estimation of entropy and mutual information

estimation of entropy and mutual information

ID:39761042

大小:845.84 KB

页数:63页

时间:2019-07-11

上传者:不努力梦想只是梦
estimation of entropy and mutual information_第1页
estimation of entropy and mutual information_第2页
estimation of entropy and mutual information_第3页
estimation of entropy and mutual information_第4页
estimation of entropy and mutual information_第5页
资源描述:

《estimation of entropy and mutual information》由会员上传分享,免费在线阅读,更多相关内容在学术论文-天天文库

ARTICLECommunicatedbyJonathanVictorEstimationofEntropyandMutualInformationLiamPaninskiliam@cns.nyu.eduCenterforNeuralScience,NewYorkUniversity,NewYork,NY10003,U.S.A.Wepresentsomenewresultsonthenonparametricestimationofentropyandmutualinformation.First,weuseanexactlocalexpansionoftheentropyfunctiontoprovealmostsureconsistencyandcentrallimitthe-oremsforthreeofthemostcommonlyuseddiscretizedinformationesti-mators.ThesetupisrelatedtoGrenander’smethodofsievesandplacesnoassumptionsontheunderlyingprobabilitymeasuregeneratingthedata.Second,weproveaconversetotheseconsistencytheorems,demon-stratingthatamisapplicationofthemostcommonestimationtechniquesleadstoanarbitrarilypoorestimateofthetrueinformation,evengivenunlimiteddata.This“inconsistency”theoremleadstoananalyticalap-proximationofthebias,validinsurprisinglysmallsampleregimesandmoreaccuratethantheusual1formulaofMillerandMadowoveralargeNregionofparameterspace.Thetwomostpracticalimplicationsoftheseresultsarenegative:(1)informationestimatesinacertaindataregimearelikelycontaminatedbybias,evenif“bias-corrected”estimatorsareused,and(2)confidenceintervalscalculatedbystandardtechniquesdrasticallyunderestimatetheerrorofthemostcommonestimationmethods.Finally,wenoteaveryusefulconnectionbetweenthebiasofentropyestimatorsandacertainpolynomialapproximationproblem.Bycastingbiascalculationproblemsinthisapproximationtheoryframework,weobtainthebestpossiblegeneralizationofknownasymptoticbiasresults.Moreinteresting,thisframeworkleadstoanestimatorwithsomeniceproperties:theestimatorcomesequippedwithrigorousboundsonthemaximumerroroverallpossibleunderlyingprobabilitydistributions,andthismaximumerrorturnsouttobesurprisinglysmall.Wedemon-stratetheapplicationofthisnewestimatoronbothrealandsimulateddata.1IntroductionThemathematicaltheoryofinformationtransmissionrepresentsapinna-cleofstatisticalresearch:theideasareatoncebeautifulandapplicabletoaremarkablywidevarietyofquestions.Whilepsychologistsandneu-rophysiologistsbegantoapplytheseconceptsalmostimmediatelyaftertheirintroduction,thepastdecadehasseenadramaticincreaseintheNeuralComputation15,1191–1253(2003)c2003MassachusettsInstituteofTechnology 1192L.Paninskipopularityofinformation-theoreticanalysisofneuraldata.Itisunsurpris-ingthatthesemethodshavefoundapplicationsinneuroscience;afterall,thetheoryshowsthatcertainconcepts,suchasmutualinformation,areunavoidablewhenoneasksthekindofquestionsneurophysiologistsareinterestedin.Forexample,thecapacityofaninformationchannelisafun-damentalquantitywhenoneisinterestedinhowmuchinformationcanbecarriedbyaprobabilistictransmissionsystem,suchasasynapse.Like-wise,weshouldcalculatethemutualinformationbetweenaspiketrainandanobservablesignalintheworldwhenweareinterestedinaskinghowmuchwe(oranyhomunculus)couldlearnaboutthesignalfromthespiketrain.Inthisarticle,wewillbeinterestednotinwhyweshouldesti-mateinformation-theoreticquantities(seeRieke,Warland,deRuytervanSteveninck,&Bialek,1997;Cover&Thomas,1991forextendedandelo-quentdiscussionsofthisquestion)butratherhowwellwecanestimatethesequantitiesatall,givenfiniteindependentlyandidenticallydistributed(i.i.d.)data.Onewouldthinkthisquestionwouldbewellunderstood;afterall,ap-pliedstatisticianshavebeenstudyingthisproblemsincethefirstappear-anceofShannon’spapers,over50yearsago(Weaver&Shannon,1949).Somewhatsurprisingly,though,manybasicquestionshaveremainedunan-swered.Tounderstandwhy,considertheproblemofestimatingthemutualinformation,I(X;Y),betweentwosignalsXandY.Thisestimationprob-lemliesattheheartofthemajorityofapplicationsofinformationtheorytodataanalysis;tomaketherelevancetoneuroscienceclear,letXbeaspiketrain,oranintracellularvoltagetrace,andYsomebehaviorallyrele-vant,physicallyobservablesignal,ortheactivityofadifferentneuron.Intheseexamplesandinmanyotherinterestingcases,theinformationesti-mationproblemiseffectivelyinfinite-dimensional.Bythedefinitionofmu-tualinformation,werequireknowledgeofthejointprobabilitydistributionP(X,Y)ontherangespacesofXandY,andthesespacescanbequitelarge—anditwouldseemtobeverydifficulttomakeprogresshereingeneral,giventhelimitedamountofdataonecanexpecttoobtainfromanyphysiologicalpreparation.Inthisarticle,weanalyzeadiscretizationprocedureforreducingthisveryhardinfinite-dimensionallearningproblemtoaseriesofmoretractablefinite-dimensionalproblems.Whileweareinterestedparticularlyinappli-cationstoneuroscience,ourresultsarevalidingeneralforanyinformationestimationproblem.Itturnsouttobepossibletoobtainafairlyclearpic-tureofexactlyhowwellthemostcommonlyuseddiscretizedinformationestimatorsperform,whytheyfailincertainregimes,andhowthisper-formancecanbeimproved.Ourmainpracticalconclusionsarethatthemostcommonestimatorscanfailbadlyincommondata-analyticsitua-tionsandthatthisfailureismoredramaticthanhasperhapsbeenappre-ciatedintheliterature.Themostcommonproceduresforestimatingcon-fidenceintervals,orerrorbars,failevenmoredramaticallyinthese“bad” EstimationofEntropyandMutualInformation1193regimes.Wesuggestanewapproachhereandprovesomeofitsadvan-tages.Thearticleisorganizedasfollows.Insection2,wedefinethebasicregu-larizationprocedure(or,rather,formalizeanintuitiveschemethathasbeenwidelyusedfordecades).Wereviewtheknownresultsinsection3andthengooninsection4toclarifyandimproveexistingbiasandvarianceresults,provingconsistencyandasymptoticnormalityresultsforafewofthemostcommonlyusedinformationestimators.Theseresultsservemainlytoshowwhenthesecommonestimatorscanbeexpectedtobeaccurateandwhentheyshouldbeexpectedtobreakdown.Sections5and6containthecentralresultsofthisarticle.Insection5,weshow,inafairlyintuitiveway,whythesecommonestimatorsperformsopoorlyincertainregimesandexactlyhowbadthisfailureis.Theseresultsleadus,insection6,tostudyapoly-nomialapproximationproblemassociatedwiththebiasofacertainclassofentropyestimators;thisclassincludesthemostcommonestimatorsintheliterature,andthesolutiontothisapproximationproblemprovidesanewestimatorwithmuchbetterproperties.Section7describessomenumeri-calresultsthatdemonstratetherelevanceofouranalysisforphysiologicaldataregimes.Weconcludewithabriefdiscussionofthreeextensionsofthiswork:section8.1examinesasurprising(andpossiblyuseful)degen-eracyofaBayesianestimator,section8.2givesaconsistencyresultforapotentiallymorepowerfulregularizationmethodthantheoneexaminedindepthhere,andsection8.3attemptstoplaceourresultsinthecontextofestimationofmoregeneralfunctionalsoftheprobabilitydistribution(thatis,notjustentropyandmutualinformation).Weattachtwoappendixes.InappendixA,welistafewassortedresultsthatareinterestingintheirownrightbutdidnotfiteasilyintotheflowofthearticle.InappendixB,wegiveproofsofseveralofthemoredifficultresults,deferredforclarity’ssakefromthemainbodyofthetext.Throughout,weassumelittlepreviousknowledgeofinformationtheorybeyondanunderstandingofthedefini-tionandbasicpropertiesofentropy(Cover&Thomas,1991);however,someknowledgeofbasicstatisticsisassumed(see,e.g.,Schervish,1995,foranintroduction).Thisarticleisintendedfortwoaudiences:appliedscientists(especiallyneurophysiologists)interestedinusinginformation-theoretictechniquesfordataanalysisandtheoristsinterestedinthemoremathematicalas-pectsoftheinformationestimationproblem.Thissplitaudiencecouldmakeforasomewhatsplitpresentation:thecorrectstatementofthere-sultsrequiressomemathematicalprecision,whilethedemonstrationoftheirutilityrequiressomemoreverboseexplanation.Nevertheless,wefeelthattheintersectionbetweentheappliedandtheoreticalcommunitiesislargeenoughtojustifyaunifiedpresentationofourresultsandourmoti-vations.Wehopereaderswillagreeandforgivethelengthoftheresultingarticle. 1194L.Paninski2TheSetup:Grenander’sMethodofSievesMuchoftheinherentdifficultyofourestimationproblemstemsfromthefactthatthemutualinformation,dP(x,y)I(X,Y)≡dP(x,y)log,X×Yd(P(x)×P(y))isanonlinearfunctionalofanunknownjointprobabilitymeasure,P(X,Y),ontwoarbitrarymeasurablespacesXandY.Inmanyinterestingcases,the“parameterspace”—thespaceofprobabilitymeasuresunderconsidera-tion—canbeverylarge,eveninfinite-dimensional.Forexample,intheneuroscientificdataanalysisapplicationsthatinspiredthiswork(Strong,Koberle,deRuytervanSteveninck,&Bialek,1998),Xcouldbeaspaceoftime-varyingvisualstimuliandYthespaceofspiketrainsthatmightbeevokedbyagivenstimulus.ThisYcouldbetakentobea(quitelarge)spaceofdiscrete(counting)measuresontheline,whileXcouldbemod-eledasthe(evenlarger)spaceofgeneralizedfunctionson3.GivenNi.i.d.samplesfromP(X,Y),{xi,yi}1≤i≤N(“stimulus”togetherwiththeevoked“response”),howwellcanweestimatetheinformationthiscellprovidesthebrainaboutthevisualscene?Clearly,itisdifficulttoanswerthisques-tionasposed;therelationshipbetweenstimulusandresponsecouldbetoocomplextoberevealedbytheavailabledata,evenifNislargebyneu-rophysiologicalstandards.Infact,therearegeneraltheoremstothiseffect(section3).Therefore,somekindofregularizationisneeded.ThemostsuccessfulapproachtakentodateinourfieldtocircumventtheseproblemswasintroducedbyBialekandcolleagues(Bialek,Rieke,deRuytervanSteveninck,&Warland,1991;Strongetal.,1998).Theideaistoadmittothedifficultyoftheproblemandinsteadestimateasystemoflowerboundsonthemutualinformationviathedataprocessinginequality(Cover&Thomas,1991),whichstatesthatI(X;Y)≥I(S(X);T(Y)),foranyrandomvariablesXandYandanyfunctionsSandTontherangeofXandY,respectively.ThegeneralityofthedataprocessinginequalityimpliesthatwearecompletelyunconstrainedinourchoiceofSandT.Sothestrategy,roughly,istochooseasequenceoffunctionsSNandTNthatpreserveasmuchinformationaspossiblegiventhatI(SN;TN)canbeestimatedwithsomefixedaccuracyfromNdatasamples.(NotethatSNandTNarechosenindependentofthedata.)Asthesizeoftheavailabledatasetincreases,ourlowerboundgrowsmonotonicallytowardthetrueinformation.Inslightlydifferentlanguage,SNandTNcouldbeviewedasmodels,orparameterizations,oftheallowedunderlyingmeasuresP(X,Y);wearesimplyallowingourmodeltobecomericher(higher-dimensional)asmoredatabecomeavailableforfitting.Clearly,then,wearenotintro- EstimationofEntropyandMutualInformation1195ducinganythingparticularlynovel,butmerelyformalizingwhatstatis-ticianshavebeendoingnaturallysincewellbeforeShannonwrotehispapers.Thisstrategybearsastrikingresemblancetoregularizationmethodsem-ployedinabstractstatisticalinference(Grenander,1981),generallyknownasthemethodofsieves.Here,onereplacestheparameterspaceofinterestwithacloselyrelatedspacethatsimplifiestheanalysisorprovidesesti-matorswithmoreattractivestatisticalproperties.Thefollowingexampleiscanonicalandhelpstoclarifyexactlywhyregularizationisnecessary.Sayoneissamplingfromsomeunknown,smoothprobabilitydensityfunc-tionandisinterestedinestimatingtheunderlyingdensity.Itisclearthatthereexistsnomaximumlikelihoodestimatorofthedensityinthespaceofsmoothfunctions(theobjectthatformallymaximizesthelikelihood,asumofDiracpointmasses,doesnotlieintheallowedsmoothnessclass).Thesituationispathological,then:asthesamplesizeincreasestoinfinity,ourestimatedoesnotconvergetothetruedensityinthesenseofanysmoothtopology.Toavoidthispathology,weregularizeourestimatorbyrequiringthatittakeitsvaluesinasmoothfunctionspace.Ineffect,werestrictourattentiontoasubset,a“sieve,”ofthepossibleparameterspace.Astheavail-abledataincrease,wegraduallyrelaxourconstraintsonthesmoothnessoftheestimator(decreasethe“meshsize”ofoursieve),untilinthelimitourestimateoftheunderlyingdensityisalmostsurelyarbitrarilyclosetothetruedensity.Wewillborrowthis“mesh”and“sieve”terminologyfortheremainderofthearticle.Here,wehavetoestimateajointprobabilitymeasure,P(X,Y),onalargeproductspace,X×Y,inordertocomputeI(X;Y).Thisisverydifficult;therefore,weregularizeourproblembyinsteadtryingtoestimateP(S,T)(whereP(S,T)isinducedbythemapsSandTinthenaturalway,i.e.,P(S=i,T=j)=P((x,y):S(x)=i,T(y)=j)).Thus,our“meshsize”isdeterminedbythedegreeofcompressioninherentingoingfrom(x,y)to(S(x),T(y)).Twovariantsofthisstrategyhaveappearedintheneuroscien-tificliterature.Thefirst,theso-calledreconstructiontechnique(Bialeketal.,1991),makesuseofsomeextremalpropertyofthepriorsignaldistributiontofacilitatethereliableestimationofalowerboundonthetrueinformation.TNhereisaseriesofconvolutionoperators,mappingspiketrains(elementsofY)backintothesignalspaceX.ThelowerboundontheinformationI(X,TN(Y))isestimatedbyspectraltechniques:thepriordistributionofX,P(X),ischosentobegaussian,andthewell-knownmaximum-entropypropertyandspectralinformationformulaforgaussiandistributionspro-videthedesiredbound.Thelowerboundsobtainedbythisreconstructionapproachhaveprovenquiteuseful(Riekeetal.,1997);however,theavail-ableconvergenceresults(ofI(X,TN(Y))toI(X,Y)asN→∞)relyonstrongassumptionsonP(X,Y),andwewillnotdiscussthistechniqueindepth.(Onefinalnote:readersfamiliarwiththereconstructiontechniquewillre-alizethatthisexampledoesnotquitefitintoourgeneralframework,asthe 1196L.PaninskiconvolutionoperatorsTN,whicharechosenbyregressiontechniques,areinfactdependentonthedata.Thesedependenciescomplicatetheanalysissignificantly,andwewillsayverylittleonthistopicbeyondabriefnoteinsection8.2.)Thesecondmethod,theso-calleddirectmethod(Strongetal.,1998;Buracas,Zador,DeWeese,&Albright,1998)isatfirstsightlessdepen-dentonassumptionsonthepriordistribtiononX.Hereonediscretizesthespaceofallspiketrainsonsomeintervalintosomefinitenumber,m,ofwordsw,andmakesuseoftheinformationformulafordiscretedistributions,I(X;W)=H(W)−H(W|X),toobtainalowerboundonthemutualinformationbetweenthespiketrainandthesignalofinterest.H(.)abovedenotestheentropyfunctional,H(W)≡−P(Wi)logP(Wi),iandH(.|.)denotesconditionalentropy;Xis,say,avisualsignalonwhichweareconditioning.1Inourpreviousnotation,W(y)=T(y).Thegeneralityofthedataprocessinginequality,again,meansthatthediscretizationcantakearbitraryform;lettingTdependonthedatasizeN,TNcould,forexam-ple,encodethetotalnumberofspikesemittedbytheneuronforsmallN,thentheoccurrenceofmoredetailedpatternsoffiring(Strongetal.,1998)forlargerN,until,inthelimit,alloftheinformationinthespiketrainisretained.Thus,inthis“direct”approach,SNandTNareassimpleaspossible:thesemapsdiscretizeXandYintoafinitenumberofpoints,mS,NandmT,N,wheremS,NandmT,NgrowwithN.ForeachvalueofN,ourproblemreducestoestimatingI(SN,TN),wherethejointdistributionoftherandomvariablesSNandTNisdiscreteonmS,NmT,Npoints,andourparameterspace,farfrombeinginfinite-dimensional,isthetractablemS,NmT,N-simplex,thesetofconvexcombinationsofmS,NmT,Ndisjointpointmasses.WeemphasizeagainthatneitherS,T,normisallowedtodependonthedata;ineffect,wepretendthatthediscretizingmapsandtheirrangesarechoseninadvance,beforeweseeasinglesample.Whilethisdiscrete“binning”approachappearsquitecrude,itwillal-lowustostatecompletelygeneralstrongconvergencetheoremsfortheinformationestimationproblem,withoutanyassumptionson,say,theex-1Tokeepdatarequirementsmanageable,H(W|X)—theexpectedconditionalentropyofWgivenx,averagedoverP(X)—isoftenreplacedwithH(W|x),theconditionalentropygivenonlyasinglex.Thefactthatanyrigorousjustificationofthissubstitutionrequiresastrongassumption(namely,thatH(W|x)iseffectivelyindependentofxwithhighP(x)-probability)hasperhapsbeenoverlyglossedoverintheliterature. EstimationofEntropyandMutualInformation1197istenceorsmoothnessofadensityforP(X,Y).Toourknowledge,resultsofthisgeneralityareunavailableoutsidethediscretecontext(butseeBeirlant,Dudewicz,Gyorfi,&vanderMeulen,1997)foragoodreviewofdifferen-tialentropyestimationtechniques,whichprovideapowerfulalternativeapproachwhentheunderlyingprobabilitymeasuresareknownaprioritopossessagivendegreeofsmoothness;Victor,2002).Inaddition,ofcourse,datathatnaturallytakeonlyafinitenumberofvaluesarenotuncommon.Therefore,wewillanalyzethisdiscreteapproachexclusivelyfortheremain-derofthisarticle.3PreviousWorkMostofthefollowingresultsarestatedintermsoftheentropyH(X);cor-respondingresultsforI(X,Y)followbyShannon’sformulafordiscretein-formation:I(X,Y)=H(X)+H(Y)−H(X,Y).Alloftheestimatorswewillconsiderarefunctionalsofthe“empiricalmea-sures”1NpN,i≡δi(TN(yj))Nj=1(whereδidenotestheprobabilitymeasureconcentratedati).Thethreemostpopularestimatorsforentropyseemtobe:1.Themaximumlikelihood(ML)estimatorgivenpN(alsocalledthe“plug-in”—byAntos&Kontoyiannis,2001)or“naive”—byStrongetal.,1998—estimator),mHˆMLE(pN)≡−pN,ilogpN,ii=1(alllogsarenaturalunlessstatedotherwise).2.TheMLEwiththeso-calledMiller-Madowbiascorrection(Miller,1955),mˆ−1HˆMM(pN)≡HˆMLE(pN)+,2NwheremˆissomeestimateofthenumberofbinswithnonzeroP-probability(herewetakemˆtobethenumberofbinswithnonzeropN-probability;seePanzeri&Treves,1996,forsomeotherexamples). 1198L.Paninski3.Thejackknifed(Efron&Stein,1981)versionoftheMLE,N−1NHˆJK≡NHˆMLE−HˆMLE−j,Nj=1whereHMLE−jistheMLEbasedonallbutthejthsample(unpublishednotesofJ.Victor;seealso,e.g.,Strongetal.,1998,inwhichaverysimilarestimatorisused).3.1CentralLimitTheorem,AsymptoticBiasandVariance.Themajor-ityofknownresultsarestatedinthefollowingcontext:fixsomediscretemeasureponmbinsandletNtendtoinfinity.Inthiscase,themultino-mialcentrallimittheorem(CLT)impliesthattheempiricalmeasurespNareasymptoticallynormal,concentratedonanellipseofsize∼N−1/2aroundthetruediscretemeasurep;sinceHˆMLEisasmoothfunctionofpontheinteriorofthem-simplex,HˆMLEisasymptoticallynormal(orchi-squaredordegenerate,accordingtotheusualconditions;Schervish,1995)aswell.ItfollowsthatboththebiasandvarianceofHˆMLEdecreaseapproximatelyas1(Basharin,1959)atallbutafinitenumberofpointsonthem-simplex.WeNwilldiscussthisbiasandvariancerateexplicitlyfortheaboveestimatorsinsection4;hereitissufficienttonotethattheasymptoticvarianceratevariessmoothlyacrossthespaceofunderlyingprobabilitymeasuresp(x,y),whilethebiasratedependsononlythenumberofnonzeroelementsofp(andisthereforeconstantontheinteriorofthem-simplexanddiscontinuousontheboundary).Theasymptoticbehaviorofthisestimationproblem(again,whenmisfixedandN→∞)isthuseasilyhandledbyclassicaltechniques.Whileitdoesnotseemtohavebeennotedpreviously,itfollowsfromtheabovethatHˆMLEisasymptoticallyminimaxforfixedmasN→∞(by“min-imax,”wemeanbestinaworst-casesense;wediscussthisconceptinmoredetailbelow);seePrakasaRao(2001)forthestandardtechnique,aclever“localBayesian”applicationoftheCramer-Raoinequality.Severalarticles(Miller,1955;Carlton,1969;Treves&Panzeri,1995;Victor,2000a)provideaseriesexpansionforthebias,inthehopeofestimatingandsubtractingoutthebiasdirectly.Althoughtheseauthorshaveallarrivedatbasicallythesameanswer,theyhavedonesowithvaryingdegreesofrigor:forexample,Miller(1955)usesanexpansionofthelogarithmthatisnoteverywhereconvergent(weoutlinethisapproachbelowandshowhowtoavoidtheseconvergenceproblems).Carlton(1969)rearrangedthetermsofaconvergentexpansionofthelogarithmterminH;unfortunately,thisexpansionisnotabsolutelyconvergent,andthereforethisrearrangementisnotnecessarilyjustified.TrevesandPanzeri(1995)andVictor(2000a)bothadmitthattheirmethods(adivergentexpansionofthelogarithmineachcase)arenotrigorous.Therefore,itwouldappearthatnoneoftheavailableresultsarestrongenoughtouseinthecontextofthisarticle,where EstimationofEntropyandMutualInformation1199mandpcandependarbitrarilystronglyonN.Wewillremedythissituationbelow.3.2ResultsofAntosandKontoyiannis.AntosandKontoyiannis(2001)recentlycontributedtworelevantresults.Thefirstissomewhatnegative:Theorem(Antos&Kontoyiannis,2001).Foranysequence{HˆN}ofentropyestimators,andforanysequence{aN},aN0,thereisadistributionPontheintegersZwithH≡H(P)<∞andE(|HˆN−H|)limsup=∞.n→∞aNInotherwords,thereisnouniversalrateatwhichtheerrorgoestozero,nomatterwhatestimatorwepick,evenwhenoursamplespaceisdiscrete(albeitinfinite).GivenanysuchputativerateaN,wecanalwaysfindsomedistributionPforwhichthetruerateofconvergenceisinfinitelyslowerthanaN.AntosandKontoyiannis(2001)proveidenticaltheoremsforthemutualinformation,aswellasafewotherfunctionalsofP.Thesecondresultisaneasyconsequenceofamoregeneralfactaboutfunctionsofmultiplerandomvariables;sincewewillusethisgeneraltheo-remrepeatedlybelow,wereproducethestatementhere.McDiarmid(1989)andDevroye,Gyorfi,andLugosi(1996)providedaproofandextendeddis-cussions.TheresultbasicallysaysthatiffisafunctionofNindependentrandomvariables,suchthatfdependsonlyweaklyonthevalueofanysinglevariable,thenfistightlyconcentratedaboutitsmean(i.e.,Var(f)issmall).Theorem(“McDiarmid’sinequality”;Chernoff,1952;Azuma,1967).If{xj}j:1,...,Nareindependentrandomvariablestakingvaluesinsomearbitrarymea-surablespaceA,andf:AN→issomefunctionsatisfyingthecoordinatewiseboundednesscondition,sup|f(x,...,x)−f(x,...,x,x,x,...,x)|0,N−22/c2P(|f(x,...,x)−E(f(x,...,x))|>)≤2ej=1j.(3.2)1N1NTheconditionsaysthatbychangingthevalueofthecoordinatexj,wecannotchangethevalueofthefunctionfbymorethansomeconstantcj. 1200L.PaninskiTheusefulnessofthetheoremisaresultofboththeubiquityoffunctionsfsatisfyingcondition3.1(andtheeasewithwhichwecanusuallycheckthecondition)andtheexponentialnatureoftheinequality,whichcanbequiteN2powerfulifj=1cjsatisfiesreasonablegrowthconditions.AntosandKontoyiannis(2001)pointedoutthatthisleadseasilytoausefulboundonthevarianceoftheMLEforentropy:Theorem(Antos&Kontoyiannis,2001).(a)ForallN,thevarianceoftheMLEforentropyisboundedabove:(logN)2Var(HˆMLE)≤.(3.3)N(b)Moreover,byMcDiarmid’sinequality,3.2,−N2(logN)−2P(|HˆMLE−E(HˆMLE)|>)≤2e2.(3.4)Notethatalthoughthisinequalityisnotparticularlytight—whileitsaysthatthevarianceofHˆMLEnecessarilydivestozerowithincreasingN,thetruevarianceturnsouttobeevensmallerthantheboundindicates—theinequalityiscompletelyuniversal,thatis,independentofmorP.Forex-ample,AntosandKontoyiannis(2001)useitinthecontextofm(countably)infinite.Inaddition,itiseasytoapplythisresulttootherfunctionalsofpN(seesection6foronesuchimportantgeneralization).3.3HˆMLEIsNegativelyBiasedEverywhere.Finally,forcompleteness,wementionthefollowingwell-knownfact,Ep(HˆMLE)≤H(p),(3.5)whereEp(.)denotestheconditionalexpectationgivenp.WehaveequalityintheaboveexpressiononlywhenH(p)=0;inwords,thebiasoftheMLEforentropyisnegativeeverywhereunlesstheunderlyingdistributionpissupportedonasinglepoint.ThisisallasimpleconsequenceofJensen’sinequality;aproofwasrecentlygiveninAntosandKontoyiannis(2001),andwewillsupplyanothereasyproofbelow.Notethatequation3.5doesnotimplythattheMLEformutualinformationisbiasedupwardeverywhere,ashasbeenclaimedelsewhere;itiseasytofinddistributionspsuchthatEp(IˆMLE)0NN→∞1/2forsomeα>0,thenN(Hˆ−H)isasymptoticallystandardnormal.σ2NNThefollowinglemmaisthekeytotheproofoftheorem1,andisinter-estinginitsownright:Lemma1.Ifm=o(N),thenHˆ→HNa.s.Notethatσ2inthestatementoftheCLT(theorem2)isexactlythevarianceNofthesuminexpression4.1,andcorrespondstotheasymptoticvariancederivedoriginallyinBasharin(1959),byasimilarlocalexpansion.Wealsopointoutthatσ2hasaspecificmeaninginthetheoryofdatacompressionN(whereσ2goesbythenameof“minimalcodingvariance”;seeKontoyian-Nnis,1997,formoredetails).WeclosethissectionwithsomeusefulresultsonthevarianceofHˆ.Weσ2have,underthestatedconditions,thatthevarianceofHˆisoforderNClog(N)2asymptotically(bytheCLT),andstrictlylessthanforallN,forsomeNfixedC(bytheresultofAntos&Kontoyiannis2001).Itturnsoutthatwecan“interpolate,”inasense,betweenthe(asymptoticallyloosebutgoodforallN)p-independentboundandthe(asymptoticallyexactbutbadforsmallN)p-dependentgaussianapproximation.ThetrickistoboundtheaveragefluctuationsinHˆwhenrandomlyreplacingonesample,insteadoftheworst-casefluctuations,asinMcDiarmid’sbound.ThekeyinequalityisduetoSteele(1986):Theorem(Steele’sinequality).IfS(x1,x2,...,xN)isanyfunctionofNi.i.d.randomvariables,then1Nvar(S)≤E(S−S)2,i2j=1whereSi=S(x1,x2,...,xi,...,xN)isgivenbyreplacingthexiwithani.i.d.copy. 1206L.PaninskiForS=Hˆ,itturnsouttobepossibletocomputetheright-handsideexplicitly;thedetailsaregiveninappendixB.ItshouldbeclearevenwithoutClog(N)2anycomputationthattheboundsoobtainedisatleastasgoodastheNguaranteedbyMcDiarmid;itisalsoeasytoshow,bythelinearexpansiontechniqueemployedabove,thattheboundisasymptoticallytightunderconditionssimilartothoseoftheorem2.Thus,σ(p)2playsthekeyroleindeterminingthevarianceofHˆ.Weknowσ2canbezeroforsomep,sinceVar(−logpi)iszeroforanypuniformonanykpoints,k≤m.Ontheotherhand,howlargecanσ2be?Thefollowingpropositionprovidestheanswer;theproofisinappendixB.Proposition2.maxσ2∼(logm)2.pThisleadsustodefinethefollowingbias-variancebalancefunction,validintheNmrange:N(logm)2V/B2≈.m2IfV/B2islarge,variancedominatesthemean-squareerror(inthe“worst-case”sense),andbiasdominatesifV/B2issmall.Itisnothardtoseethatifmisatalllarge,biasdominatesuntilNisrelativelyhuge(recallFigure1).(Thisisjustaruleofthumb,ofcourse,notleastbecausethelevelofaccuracydesired,andtherelativeimportanceofbiasandvariance,dependontheapplication.Wegivemoreprecise—inparticular,validforallvaluesofNandm—formulasforthebiasandvarianceinthefollowing.)Tosummarize,thesievemethodiseffectiveandtheasymptoticbehav-iorofHˆiswellunderstoodforNm.Inthisregime,ifV/B2>1,classi-cal(Cramer-Rao)effectsdominate,andthethreemostcommonestimators(HˆMLE,HˆMM,andHˆJK)areapproximatelyequivalent,sincetheysharethesameasymptoticvariancerate.ButifV/B2<1,biasplaysamoreimpor-tantrole,andestimatorsthatarespecificallydesignedtoreducethebiasbecomecompetitive;previousworkhasdemonstratedthatHˆMMandHˆJKareeffectiveinthisregime(Panzeri&Treves,1996;Strongetal.,1998).Inthenextsection,weturntoaregimethatismuchmorepoorlyunderstood,the(notuncommon)casewhenN∼m.Wewillseethatthelocalexpansionbecomesmuchlessusefulinthisregime,andadifferentkindofanalysisisrequired.5TheN∼mRange:ConsequencesofSymmetryThemainresultofthissectionisasfollows:ifN/misbounded,thebiasofHˆremainslargewhilethevarianceisalwayssmall,evenifN→∞.The EstimationofEntropyandMutualInformation1207basicideaisthatentropyisasymmetricfunctionofpi,1≤i≤m,inthatHisinvariantunderpermutationsofthepoints{1,...,m}.MostcommonestimatorsofH,includingHˆMLE,HˆMM,andHˆJK,sharethispermutationsymmetry(infact,onecanshowthatthereissomestatisticaljustificationforrestrictingourattentiontothisclassofsymmetricestimators;seeap-pendixA).Thus,thedistributionofHˆMLE(pN),say,isthesameasthatofHˆMLE(p),wherepistherank-sortedempiricalmeasure(forconcreteness,NNdefine“rank-sorted”as“rank-sortedindecreasingorder”).Thisleadsustostudythelimitingdistributionofthesesortedempiricalmeasures(seeFigure2).Itturnsoutthatthesesortedhistogramsconvergetothe“wrong”distributionundercertaincircumstances.Wehavethefollowingresult:Theorem3(Convergenceofsortedempiricalmeasures;inconsistency).LetPbeabsolutelycontinuouswithrespecttoLebesguemeasureontheinter-val[0,1],andletp=dP/dmbethecorrespondingdensity.LetSNbethem-equipartitionof[0,1],pdenotethesortedempiricalmeasure,andN/m→c,00.Herepc,∞isthemonotonicallydecreas-ingstepdensitywithgapsbetweenstepsjandj+1givenby1j−cp(t)(cp(t))dte.0j!b.Assumepisbounded.ThenHˆ−HN→Bc,Hˆ(p)a.s.,whereBc,Hˆ(p)isadeterministicfunction,nonconstantinp.ForHˆ=HˆMLE,B(p)=h(p)−h(p)<0,c,Hˆwhereh(.)denotesdifferentialentropy.Inotherwords,whenthesieveistoofine(N∼m),thelimitsortedempiricalhistogramexists(andissurprisinglyeasytocompute)butisnotequaltothetruedensity,evenwhentheoriginaldensityismonotonicallydecreasingandofstepform.Asaconsequence,HˆremainsbiasedevenasN→∞.ThisinturnleadstoastrictlypositivelowerboundontheasymptoticerrorofHˆoveralargeportionoftheparameterspace.ThebasicphenomenonisillustratedinFigure2.WecanapplythistheoremtoobtainsimpleformulasfortheasymptoticbiasB(p,c)forspecialcasesofp:forexample,fortheuniformdistributionU≡U([0,1]),∞cj−1B(U)=log(c)−e−clog(j);c,HˆMLE(j−1)!j=1 1208L.Paninski45m=N=1004332211002040608010000.5156m=N=1000443n22100200400600800100000.5166m=N=1000044220020004000600080001000000.51unsorted,unnormalizedisorted,normalizedpFigure2:“Incorrect”convergenceofsortedempiricalmeasures.Eachleftpanelshowsanexampleunsortedm-binhistogramofNsamplesfromtheuniformdensity,withN/m=1andNincreasingfromtoptobottom.Tensortedsamplehistogramsareoverlaidineachrightpanel,demonstratingtheconvergencetoanonuniformlimit.Theanalyticallyderivedpisdrawninthefinalpanelbutc,∞isobscuredbythesamplehistograms.1−e−cB(U)=B(U)+;c,HˆMMc,HˆMLE2c∞cj−1B(U)=1+log(c)−e−c(j−c)log(j).c,HˆJK(j−1)!j=1Togivesomeinsightintotheseformulas,notethatB(U)behaveslikec,HˆMLElog(N)−log(m)asc→0,asexpectedgiventhatHˆMLEissupportedon EstimationofEntropyandMutualInformation1209[0,log(N)](recallthelowerboundofproposition1);meanwhile,1B(U)∼B(U)+c,HˆMMc,HˆMLE2andB(U)∼B(U)+1c,HˆJKc,HˆMLEinthisc→0limit.Inotherwords,intheextremelyundersampledlimit,theMillercorrectionreducesthebiasbyonlyhalfanat,whilethejackknifegivesusonlytwicethat.Itturnsoutthattheproofofthetheoremleadstogoodupperboundsontheapproximationerroroftheseformulas,indicatingthattheseasymptoticresultswillbeusefulevenforsmallN.WeexaminethequalityoftheseapproximationsforfiniteNandminsection7.Thisasymptoticallydeterministicbehaviorofthesortedhistogramsisperhapssurprising,giventhatthereisnosuchcorrespondingdeterministicbehaviorfortheunsortedhistograms(although,bytheGlivenko-Cantellitheorem,vanderVaart&Wellner,1996,thereiswell-knowndeterministicbehaviorfortheintegralsofthehistograms).Whatisgoingonhere?Incrudeterms,thesortingprocedure“averagesover”thevariabilityintheunsortedhistograms.Inthecaseofthetheorem,the“variability”ateachbinturnsouttobeofaPoissonnature,inthelimitasm,N→∞,andthisleadstoawell-definedandeasy-to-computelimitforthesortedhistograms.Tobemoreprecise,notethatthevalueofthesortedhistogramatbinkisgreaterthantifandonlyifthenumberof(unsorted)pN,iwithpN,i>tisatleastk(rememberthatwearesortingindecreasingorder).Inotherwords,p=F−1,NNwhereFNistheempirical“histogramdistributionfunction,”1mFN(t)≡1(pN,i0.ThensomeeasycomputationsshowthatP(∃i:ni>j)→0forallj>α−1.Inotherwords,withhighprobability,wehavetoestimateHgivenonly1+α−1numbers,namely{hj}0≤j≤α−1,anditisnothardtosee,givenequation5.1andtheusualBayesianlowerboundsonminimaxerrorrates(see,e.g.,Ritov& EstimationofEntropyandMutualInformation1211Bickel,1990),thatthisisnotenoughtoestimateH(p).Wehave,therefore:Theorem4.IfN∼O(m1−α),α>0,thennoconsistentestimatorforHexists.ByShannon’sdiscreteformula,asimilarresultholdsformutualinfor-mation.6ApproximationTheoryandBiasThelastequality,expression5.1,iskeytotherestofourdevelopment.LettingB(Hˆ)denotethebiasofHˆ,wehave:NmNmB(Hˆ)=apj(1−p)N−j−−plog(p)j,Niiiijj=0i=1i=1Nj=plog(p)+ap(1−p)N−jiij,NiijiijNj=plog(p)+ap(1−p)N−j.iij,NiijijIfwedefinetheusualentropyfunction,H(x)=−xlogx,andthebinomialpolynomials,NB(x)≡xj(1−x)N−j,j,Njwehave−B(Hˆ)=H(pi)−aj,NBj,N(pi).ijInotherwords,thebiasisthem-foldsumofthedifferencebetweenthefunctionHandapolynomialofdegreeN;thesedifferencesaretakenatthepointspi,whichallfallontheinterval[0,1].Thebiaswillbesmall,therefore,ifthepolynomialisclose,insomesuitablesense,toH.Thistypeofpolynomialapproximationproblemhasbeenextensivelystudied(Devore&Lorentz,1993),andcertainresultsfromthisgeneraltheoryofapproximationwillprovequiteuseful. 1212L.PaninskiGivenanycontinuousfunctionfontheinterval,theBernsteinapproxi-matingpolynomialsoff,BN(f),aredefinedasalinearcombinationofthebinomialpolynomialsdefinedabove:NBN(f)(x)≡f(j/N)Bj,N(x).j=0NotethatfortheMLE,aj,N=H(j/N);thatis,thepolynomialappearinginequation5.1is,fortheMLE,exactlytheBernsteinpolynomialfortheentropyfunctionH(x).EverythingweknowaboutthebiasoftheMLE(andmore)canbederivedfromafewsimplegen-eralfactsaboutBernsteinpolynomials.Forexample,wefindthefollowingresultinDevoreandLorentz(1993):Theorem(Devore&Lorentz,1993,theorem10.4.2).Iffisstrictlyconcaveontheinterval,thenBN(f)(x)1,Nminipi→∞,thenN1limB(HˆMLE)=−.m−12Theproofisanelaborationoftheproofoftheabovetheorem10.3.1ofDevoreandLorentz(1993);weleaveitforappendixB.NotethattheconvergencestatedinthetheoremgivenbyDevoreandLorentz(1993)isnotuniformforf=H,becauseH(x)isnotdifferentiableatx=0;thus,whentheconditionofthetheoremisnotmet(i.e.,minp=O(1)),moreiiNintricateasymptoticbiasformulasarenecessary.Asbefore,wecanusethePoissonapproximationforthebinswithNpi→c,00.)Thus,tofindagoodestimator,weneedtomin-imizeboundsonbiasandvariancesimultaneously;wewouldliketofindHˆatominimizemax(B(Hˆ)2+V(Hˆ)),papapwherethenotationforbiasandvarianceshouldbeobviousenough.Wehavemax(B(Hˆ)2+V(Hˆ))≤maxB(Hˆ)2+maxV(Hˆ)papapapappp≤(c∗(f)M∗(f,Hˆ))2+maxV(Hˆ),(6.2)apapandatleasttwocandidatesforeasilycomputableuniformboundsonthevariance.ThefirstcomesfromMcDiarmid:Proposition5.Var(Hˆ)k,andchoosea,j≤ktominimizetheNN2NN,jleast-squaresobjectivefunction6.3;thisentailsasimplemodificationofequation6.4,−1tNtttNak+1aj≤k=XXj≤k+c∗(f)2(DD+IkIk)XYj≤k+c∗(f)2ek,whereIkisthematrixwhoseentriesareallzero,exceptforaoneat(k,k);ekisthevectorwhoseentriesareallzero,exceptforaoneinthekthelement;XtXj≤kistheupper-leftk×ksubmatrixofXtX;andNXtY=Bf,H−aBf.j≤kj,Nj,Nj,Nj=k+1Last,chooseaN,jtominimizethetrueobjectivefunction,equation6.2,overallKestimatorssoobtained.Inpractice,theminimaleffectiveKvariesquiteslowlywithN(forexample,forN=m<105,K≈30);thus,thealgorithmisapproximately(butnotrigorously)O(N).(Ofcourse,onceagood{aj,N}ischosen,HˆaisnohardertocomputethanHˆ.)WewillrefertotheresultingestimatorasHˆBUB,for“bestupperbound”(Matlabcodeimplementingthisestimatorisavailableon-lineathttp://www.cns.nyu.edu/∼liam).BeforewediscussHˆBUBfurther,wenoteseveralminorbutusefulmodi-ficationsoftheabovealgorithm.First,forsmallenoughN,theregularizedleast-squaressolutioncanbeusedasthestartingpointforahill-climbingprocedure,minimizingexpression6.2directly,forslightlyimprovedresults.Second,f,andthecorrespondingc∗(f)−2prefactoronthevariance(DtD)term,canbemodifiediftheexperimenterismoreinterestedinreducingbiasthanvariance,orviceversa.Finally,alongthesamelines,wecanconstrainthesizeofagivencoefficientak,NbyaddingaLagrangemultipliertotheregularizedleast-squaresolutionasfollows:−1tNtttXX+c∗(f)2DD+λkIkIkXY, 1218L.PaninskiUpper(BUB)andlower(JK)boundsonmaximumrmserror2.2N/m=0.10(BUB)2N/m=0.25(BUB)N/m=1.00(BUB)1.8N/m=0.10(JK)N/m=0.25(JK)N/m=1.00(JK)1.61.41.21RMSerrorbound(bits)0.80.60.40.2123456101010101010NFigure3:Acomparisonoflowerboundsonworst-caseerrorforHˆJK(upward-facingtriangles)toupperboundsonthesameforHˆBUB(downward-facingtri-angles),forseveraldifferentvaluesofN/m.whereIkisasdefinedabove.Thisisusefulinthefollowingcontext:atpointspforwhichH(p)issmall,mostoftheelementsofthetypicalempiricalmea-surearezero;hence,thebiasnearthesepointsis≈(N−1)a0,N+aN,N,andλ0canbesetashighasnecessarytokeepthebiasaslowasdesiredneartheselowentropypoints.Numericalresultsshowthatthesepertur-bationshavelittleilleffectontheperformanceoftheestimator;forex-ample,theworst-caseerrorisrelativelyinsensitivetothevalueofλ0(seeFigure7).Theperformanceofthisnewestimatorisquitepromising.Figure3indi-catesthatwhenmisallowedtogrowlinearlywithN,theupperboundontheRMSerrorofthisestimator(thesquarerootofexpression6.2)dropsoffapproximatelyasmax((E(Hˆ−H)2)1/2)<∼N−α,α≈1/3.BUBp EstimationofEntropyandMutualInformation1219(Recallthatwehavealowerboundontheworst-caseerrorofthethreemostcommonHˆ:max((E(Hˆ−H)2)1/2)>∼B(N/m),HˆpwhereBHˆ(N/m)isabiastermthatremainsboundedawayfromzeroifN/misbounded.)Foremphasis,wecodifythisobservationasaconjecture:Conjecture.HˆBUBisconsistentasN→∞evenifN/m∼c,02spikeseach;m=46. EstimationofEntropyandMutualInformation1229whileamonkeymoveditshandaccordingtoastationary,two-dimensional,filteredgaussiannoiseprocess.Weshowresultsfor11cells,simultaneouslyrecordedduringasingleexperiment,inFigure10;note,however,thatweareestimatingtheentropyofsingle-cellspiketrains,notthefullmulticellspiketrain.(Formoredetailsontheexperimentalprocedures,seePaninski,Fellows,Hatsopoulos,&Donoghue,1999,2003.)Withrealdata,itis,ofcourse,impossibletodeterminethetruevalueofH,andsothedetailederrorcalculationsperformedabovearenotpos-siblehere.Nevertheless,thebehavioroftheseestimatorsseemstofollowthetrendsseeninthesimulateddata.WeseetheconsistentslowincreaseinourestimateaswemovefromHˆMLEtoHˆMMtoHˆJK,andthenalargerjumpaswemovetoHˆBUB.Thisistrueeventhoughtherelevanttimescales(roughlydefinedasthecorrelationtimeofthestimulus)inthetwoexper-imentsdifferedbyaboutthreeordersofmagnitude.Similarresultswereobtainedforboththerealandsimulateddatausingavarietyofotherdis-cretizationparameters(datanotshown).Thus,asfarascanbedetermined,ourconclusionsaboutthebehaviorofthesefourestimators,obtainedus-ingtheanalyticalandnumericaltechniquesdescribedabove,seemtobeconsistentwithresultsobtainedusingphysiologicaldata.Inall,wehavethatthenewestimatorperformsquitewellinauniformsense.Thisgoodperformanceisespeciallystrikingin,butnotlimitedto,thecasewhenN/misO(1).WeemphasizethatevenatthepointswhereH=Hˆ=0(andthereforethethreemostcommonestimatorsperformwell,inatrivialsense),thenewestimatorperformsreasonably;byconstruction,HˆBUBneverexhibitsblowupsintheexpectederrorlikethoseseenwithHˆMLE,HˆMM,andHˆJK.GiventhefactthatwecaneasilytunethebiasofthenewestimatoratpointswhereH≈0,byadjustingλ0,HˆBUBappearstobearobustandusefulnewestimator.WeofferMatlabcode,availableon-lineathttp://www.cns.nyu.edu/∼liam,tocomputetheexactbiasandSteelevarianceboundforanyHˆa,atanydistributionp,ifthereaderisinterestedinmoredetailedinvestigationofthepropertiesofthisclassofestimator.8DirectionsforFutureWorkWehaveleftafewimportantopenproblems.Below,wegivethreesomewhatfreelydefineddirectionsforfuturework,alongwithafewpreliminaryresults.8.1Bayes.Allofourresultsherehavebeenfromaminimax,or“worst-case,”pointofview.Asdiscussedabove,thisapproachisnaturalifweknowverylittleabouttheunderlyingprobabilitymeasure.However,inmanycases,wedoknowsomethingaboutthisunderlyingp.WemightknowthatthespikecountisdistributedaccordingtosomethinglikeaPoissondistributionorthattheresponsesofaneurontoagivensetofstimulicanbe 1230L.Paninskifairlywellapproximatedbyasimpledynamicalmodel,suchasanIFcell.Howdoweincorporatethiskindofinformationinourestimates?ThefieldsofparametricandBayesianstatisticsaddressthisissueexplicitly.Wehavenotsystematicallyexploredtheparametricpointofview—thiswouldentailbuildingaseriousparametricmodelforspiketrainsandthenefficientlyestimatingtheentropyateachpointintheparameterspace—althoughthisapproachhasbeenshowntobepowerfulinafewselectcases.TheBayesianapproachwouldinvolvechoosingasuitableaprioridistributiononspiketrainsandthencomputingthecorrespondingMAPorconditionalmeanestimator;thisapproachisobviouslydifficultaswell,andwecangiveonlyapreliminaryresulthere.WolpertandWolf(1995)giveanexplicitformulafortheBayes’estimateofHandrelatedstatisticsinthecaseofauniformprioronthesimplex.Wenoteaninterestingphenomenonrelevanttothisestimator:asmincreases,thedistributiononHinducedbytheflatmeasureonthesimplexbecomesconcentratedaroundasinglepoint,andthereforethecorrespondingBayes’problembecomestrivialasm→∞,quitetheoppositeofthesituationconsideredinthecurrentwork.(Nemenman,Shafee,&Bialek,2002,inde-pendentlyobtainedafewinterestingresultsalongtheselines.)Theresultisinterestinginitsownright;itsproofsharesmanyofthefeatures(con-centrationofmeasureandsymmetrytechniques)ofourmainresultsintheprecedingsections.Moreprecisely,weconsideraclassofpriorsdeterminedbythefollowing“sort-difference”procedure:fixsomeprobabilitymeasurePontheunitinterval.Choosem−1independentsamplesdistributedaccordingtoP;sortthesamplesinascendingorder,andcallthesortedsamples{xi}0a)=O(e),foranyconstanta>0andsomeconstantC. EstimationofEntropyandMutualInformation1231Infact,itispossibletoprovemuchmore:theuniformmeasureonthesimplex(andmoregenerally,anypriorinducedbythesort-differenceproce-dure,undersomeconditionsontheintervalmeasureP)turnsouttoinduceanasymptoticallynormalprioronH,withvariancedecreasinginm.Wecancalculatetheasymptoticmeanofthisdistributionbyusinglinearityofexpectationandsymmetrytechniqueslikethoseusedinsection5.Inthefollowing,assumeforsimplicitythatPisequivalenttoLebesguemeasure(thatis,PisabsolutelycontinuouswithrespecttoLebesguemeasure,andviceversa);thisisatechnicalconditionthatcanberelaxedatthepriceofslightlymorecomplicatedformulas.Wehavethefollowing:Theorem7.P(H)isasymptoticallynormal,with1Var(H)∼mandasymptoticmeancalculatedasfollows.Letqbethesorted,normalizeddensitycorrespondingtoameasuredrawnac-cordingtothepriordescribedabove;definev1F(v)≡dudtp(t)2e−up(t),p00andq≡F−1,∞pwheretheinverseistakeninadistributionalsense.Thenq−q→0∞1inprobabilityandE(H)→h(q)+log(m),∞whereh(.)denotesdifferentialentropy.Fpaboveisthecumulativedistributionfunctionofthep-mixtureofex-ponentialswithratep(t)−1(justaspc,∞intheorem3wasdefinedastheinversecumulativedistributionfunction(c.d.f)ofamixtureofPoissondis-tributions).IfPisuniform,forexample,wehavethatq−q∞1→0inprobability,whereq(t)=−log(t),∞ 1232L.Paninskiand1H(q)→h(q)+log(m)=logm+dtlog(t)log(−log(t))∞0inprobability.8.2AdaptivePartitioning.Asemphasizedinsection1,wehavere-strictedourattentionheretopartitions,“sieves,”SandT,whichdonotdependonthedata.Thisisobviouslyastrongcondition.Canweobtainanyresultswithoutthisassumption?Asastart,wehavethefollowingconsistencyresult,statedintermsofthemeasureoftherichnessofapartitionintroducedbyVapnikandCher-vonenkis(1971),N(AF)(theshattercoefficientofthesetofallowedparti-tions,definedintheappendix;mis,asinthepreceding,themaximalnumberofelementsperpartition,F;Devroyeetal.,1996):Theorem8.Iflog(A)=oNandFgeneratesσa.s.,Iiscon-ˆNF(logm)2x,ysistentinprobability;Iisconsistenta.s.undertheslightlystrongerconditionˆ−NN(AF)e(logm)2<∞.Notetheslowerallowedrateofgrowthofm.Inaddition,theconditionsofthistheoremaretypicallyhardertocheckthanthoseoftheorem1.Forexample,itiseasytothinkofreasonablepartitioningschemesthatdonotgenerateσx,ya.s.:ifthesupportofPissomemeasurablepropersubsetofX×Y,thisisanunreasonablecondition.Wecanavoidthisproblembyrephrasingtheconditionintermsofσx,yrestrictedtothesupportofP(this,inturn,requiresplacingsomekindoftopologyonX×Y,whichshouldbenaturalenoughinmostproblems).Whatarethebenefits?Intuitively,weshouldgaininefficiency:weareputtingthepartitionswheretheydothemostgood(Darbellay&Vajda,1999).Wealsogaininapplicability,sinceinpractice,allpartitionschemesaredatadriventosomedegree.Themostimportantapplicationofthisre-sult,however,istothefollowingquestioninlearningtheory:Howdowechoosethemostinformativepartition?Forexample,givenaspiketrainandsomebehaviorallyrelevantsignal,whatisthemostefficientwaytoencodetheinformationinthespiketrainaboutthestimulus?Moreconcretely,allthingsbeingequal,doesencodingtemporalinformation,say,preservemoreinformationaboutagivenvisualstimulusthanencodingspikerateinfor-mation?Conversely,doesencodingthecontrastofascene,forexample,preservemoreinformationaboutagivenneuron’sactivitythanencodingcolor?Givenmcodewords,howmuchinformationcanwecaptureaboutwhatthisneuronistellingusaboutthescene?(See,e.g.,Victor,2000b,forrecentworkalongtheselines.) EstimationofEntropyandMutualInformation1233Theformalanalogtothesekindsofquestionsisasfollows(seeTishby,Pereira,&Bialek,1999,andGedeon,Parker,&Dimitrov,2003,forslightlymoregeneralformulations).LetFandGbeclassesof“allowed”functionsonthespacesXandY.Forexample,FandGcouldbeclassesofpartitioningoperators(correspondingtothediscretesetupusedhere)orspacesoflinearprojections(correspondingtotheinformation-maximizationapproachtoindependentcomponentanalysis.)Then,givenNi.i.d.datapairsinX×Y,wearetryingtochoosefN∈FandgN∈GinsuchawaythatI(fN(x);gN(y))ismaximized.Thisiswhereresultsliketheorem8areuseful;theyallowustoplacedistribution-freeboundsonPsupI(f(x);g(y))−Iˆ(fN(x);gN(y))>,(8.1)f∈F,g∈Gthatis,theprobabilitythatthesetofcodewordsthatlooksoptimalgivenNsamplesisactually-closetooptimal.Other(distribution-dependent)approachestotheasymptoticsofquantitieslikeequation8.1comefromthethetheoryofempiricalprocesses(see,e.g.,vanderVaart&Wellner,1996).Moreworkinthisdirectionwillbenecessarytorigorouslyanswerthebiasandvarianceproblemsassociatedwiththese“optimalcoding”questions.8.3SmoothnessandOtherFunctionals.Weendwithaslightlymoreabstractquestion.Inthecontextofthesievemethodanalyzedhere,areentropyandmutualinformationanyhardertoestimatethananyothergivenfunctionaloftheprobabilitydistribution?Clearly,thereisnothingspecialaboutH(and,byextension,I)inthecasewhenmandparefixed;here,classicalmethodsleadtotheusualN−1/2ratesofconvergence,withaprefactorthatdependsononlymandthedifferentialpropertiesofthefunctionalHatp;theentirebasictheorygoesthroughifHisreplacedbysomeotherarbitrary(smooth)functional.Thereareseveralreasonstosuspect,however,thatnotallfunctionalsarethesamewhenmisallowedtovarywithN.First,andmostobvious,Hˆisconsistentwhenm=o(N)butnotwhenm∼N;simpleexamplesshowthatthisisnottrueforallfunctionalsofp(e.g.,manylinearfunc-tionalsonmcanbeestimatedgivenfewerthanNsamples,andthiscanbeextendedtoweaklynonlinearfunctionalsaswell).Second,classicalresultsfromapproximationtheoryindicatethatsmoothnessplaysanessentialroleinapproximability;itiswellknown,forexample,thatthebestrateinthebiaspolynomialapproximationproblemdescribedinsection6isessentiallyde-terminedbyamodulusofcontinuityofthefunctionunderquestion(Ditzian&Totik,1987),andmoduliofcontinuitypopupinapparentlyverydiffer-entfunctionalestimationcontextsaswell(Donoho&Liu,1991;Jongbloed, 1234L.Paninski2000).Thus,itisreasonabletoexpectthatthesmoothnessofH,especiallyasmeasuredatthesingularpointnear0,shouldhavealottodowiththedifficultyoftheinformationestimationproblem.Finally,basicresultsinlearningtheory(Devroyeetal.,1996;vanderVaart&Wellner,1996;Cucker&Smale,2002)emphasizethestrongconnectionsbetweensmoothnessandvariousnotionsoflearnability.Forexample,anapplicationoftheoremII.2.3ofCuckerandSmale(2002)givestheexactrateofdecayofourL2objectivefunction,equation6.3,intermsofthespectralpropertiesofthediscretedif-ferentialoperatorD,expressedintheBernsteinpolynomialbasis;however,itisunclearatpresentwhetherthisresultcanbeextendedtoourfinalgoalofausefulasymptotictheoryfortheupperL∞bound,equation6.2.Afewofthequestionswewouldliketoanswermorepreciselyareasfollows.First,wewouldliketohavethepreciseminimaxrateoftheinfor-mationestimationproblem;thusfar,wehaveonlybeenabletoboundthisratebetweenm∼o(N)(seetheorem1)andm∼N1+α,α>0(seetheo-rem4).Second,howclosedoesHˆBUBcometothisminimaxrate?Indeed,doesthisestimatorrequirefewerthanmsamplestolearntheentropyonmbins,asFigure3seemstoindicate?Finally,howcanallofthisbegeneralizedforotherstatisticalfunctionals?Istheresomethinglikeasinglemodulusofcontinuitythatcontrolsthedifficultyofsomelargeclassofthesefunctionalestimationproblems?9ConclusionsSeveralpracticalconclusionsfollowfromtheresultspresentedhere;wehavegoodnewsandbadnews.First,thebadnews.•PastworkinwhichN/mwasoforder1orsmallerwasmostlikelycontaminatedbybias,evenifthejackknifeorMillercorrectionwasused.Thisisparticularlyrelevantforstudiesinwhichmultiplebin-ningschemeswerecomparedtoinvestigate,forexample,theroleoftemporalinformationintheneuralcode.WeemphasizeforfuturestudiesthatmandNmustbeprovidedforreaderstohaveconfidenceintheresultsofentropyestimation.•Errorbarsbasedonsamplevariance(orresamplingtechniques)giveverybadconfidenceintervalsifmandNarelarge.Thatis,confidenceintervalsbasedontheusualtechniquesdonotcontainthetruevalueofHorIwithhighprobability.Previousworkintheliteratureoftendisplayserrorbarsthatareprobablymisleadinglysmall.Confidenceintervalsshouldbeofsize∼B(Hˆ,N/m)+N−1/2log(min(m,N)),wherethebiastermBcanbecalculatedusingtechniquesdescribedinsection5. EstimationofEntropyandMutualInformation1235Nowthegoodnews:•Thisworkhasgivenusamuchbetterunderstandingofexactlyhowdifficulttheinformationestimationproblemisandwhatwecanhopetoaccomplishusingnonparametrictechniques,givenphysiologicallyplausiblesamplesizes.•Wehaveobtainedrigorous(andsurprisinglygeneral)resultsonbias,variance,andconvergenceofthemostcommonlyemployedestima-tors,includingthebestpossiblegeneralizationofMiller’swell-known1biasrateresult.OuranalysisclarifiestherelativeimportanceofNminimizingbiasorvariabilitydependingonNandm,accordingtothebias-variancebalancefunctionintroducedattheendofsection4.•Wehaveintroducedapromisingnewestimator,onethatcomesequippedwithbuilt-in,rigorousconfidenceintervals.Thetechniquesusedtoderivethisestimatoralsoleadtorigorousconfidenceintervalsforalargeclassofotherestimators(includingthethreemostcom-monHˆ).AppendixA:AdditionalResultsA.1Support.OnewouldliketobuildanestimatorthattakesvaluesstrictlyinsomenicesetaroundthetrueH,say,anintervalcontainingHwhoselengthshrinksasthenumberofsamples,N,increases.Thiswouldgiveusstrong“errorbars”onourestimateofH;wewouldbeabsolutelycertainthatourestimateisclosetothetrueH.TheMLEforentropyhassupporton[0,log(min(N,m))].Asimplevariationalargumentshowsthatanyestimator,T,forHonmbinsisinadmissibleifTtakesvaluesoutside[0,logm].Similarly,anyestimator,T,forIonmS×mTbinsisinadmissibleifTtakesvaluesoutside[0,log(min(mS,mT))].Itturnsoutthatthisisthebestpossible,inasense:theredonotexistanynontrivialestimatorsforentropythatarestrictlygreaterorlessthantheunknownH.Infact,thefollowingistrue:Proposition7.ThereisnoestimatorTandcorrespondinga,b,00exists.Ifso,T(ω)mustbenonzeroforallpossiblevaluesofthedata,ω(thedatacanberepresentedasanN-sequenceofintegers,1≤ω(i)≤m).Butthentheremustexistsomeω0,withp(ω0)>0,suchthatT(ω0)>0.BychoosingHsuchthat00.ForthisP,H(p)→Hˆ→{ni}doesnotformaMarkovchain;thesymmetryofHˆdiscardsinformationaboutthetrueunderlyingH(namely,observationofa1tellsussomethingverydifferentthandoesob-servationofa2).Thispropertyisclearlysharedbyanysymmetricestimator.Thefactthattheempiricalhistogramsareminimalsufficientfollows,forexample,fromBahadur’stheorem(Schervish,1995),andthefactthattheempiricalhistogramsarecompletesufficientstatistics.Inotherwords,anyσ-symmetricestimatornecessarilydiscardsinforma-tionaboutH,eventhoughHitselfisσ-symmetric.Thisindicatestheimpor-tanceofpriors;thenonparametricminimaxapproachtakenhere(focusingstrictlyonsymmetricestimatorsforalargepartofthework,asjustifiedbyproposition10)shouldbeconsideredonlyafirststep.Tobemoreconcrete,inmanyapplications,itisnaturaltoguessthattheunderlyingmeasurephassomecontinuityproperties;therefore,estimatorsthattakeadvantageofsomeunderlyingnotionofcontinuity(e.g.,bylocallysmoothingtheob-serveddistributionsbeforeestimatingtheirentropy)shouldbeexpectedtoperformbetter(onaverage,accordingtothismostlycontinuousprior)thanthebestσ-symmetricestimator,whichnecessarilydiscardsalltopologicalstructureintheunderlyingspaceX.(See,e.g.,Victor,2002,forrecentworkalongtheselines.)AppendixB:ProofsWecollectsomedeferredproofshere.Toconservespace,weomitsomeoftheeasilyverifieddetails.Thetheoremsarerestatedforconvenience.B.1Consistency.Statement(Theorem1).IfmS,NmT,N=o(N)andσSN,TNgeneratesσX,Y,thenIˆ→Ia.s.asN→∞.Theorem1isaconsequenceofthefollowinglemma:Statement(Lemma1).Ifm=o(N),thenHˆ→HNa.s.Proof.First,bytheexponentialboundofAntosandKontoyiannis,expres-sion3.4,andtheBorel-Cantellilemma,HˆN→HNa.s.ifthe(nonrandom)functionE(HˆN)↑HN.ThisconvergenceinexpectationisaconsequenceofthelocalexpansionforthebiasoftheMLE,expression4.2,andproposition1ofsection4.ProofofTheorem1.First,someterminology:byIˆ→Ia.s.,wemeanthatifI=∞,p((IˆN)i.o.)=0 EstimationofEntropyandMutualInformation1239∀>0.(“I.o.”standsfor“infinitelyoften.”)Inaddition,wecallthegivenσ-algebraonX×Y(thefamilyofsetsonwhichtheprobabilitymeasureP(X,Y)isdefined)σX,Y,andthesub-σ-algebrageneratedbySandTσS,T.Now,theproof:itfollowsfromShannon’sformulaformutualinforma-tioninthediscretecasethat|Iˆ(SN,TN)−I(SN,TN)|≤|Hˆ(S)−H(S)|+|Hˆ(T)−H(T)|+|Hˆ(S,T)−H(S,T)|.Thus,thelemmagivesIˆN→I(SN,TN)a.s.whenevermSmT/N→0.Itremainsonlytoshowthatthe(nonrandom)functionI(SN,TN)→I;thisfollowsfromresultsinstandardreferences,suchasBillingsley(1965)andKolmogorov(1993),ifeitherσS1,T1⊆σS2,T2⊆···⊆σSN,TN⊆···andσX,Y=∪NσSN,TN,orsupρ(A,B)→0,A∈σSN,TN,B∈σX,Ywhereρ(A,B)≡P(ABc∪AcB).Ifeitheroftheseconditionsholds,wesaythatσSN,TNgeneratesσX,Y.B.2CentralLimitTheorem.Statement(Theorem2).Letmσ2≡Var(−logp)≡p(−logp−H)2.NTNTN,iTN,iNi=1IfmN≡m=o(N1/2),andliminfN1−ασ2>0NN→∞1/2forsomeα>0,thenN(Hˆ−H)isasymptoticallystandardnormal.σ2NN 1240L.PaninskiProof.Thebasictool,again,isthelocalexpansionofHMLE,expression4.1.Wemustfirstshowthattheremaindertermbecomesnegligibleinproba-√bilityonaNscale,thatis,√NDKL(pN;p)=op(1).ThisfollowsfromtheformulaforEp(DKL(pN;p)),thenMarkov’sinequalityandthenonnegativityofDKL.SoitremainsonlytoshowthatdH(p;pN−p)isasymptoticallynormal.Hereweapplyaclassicaltheoremontheasymptoticnormalityofdoublearraysofinfinitesimalrandomvariables:Lemma.Let{xj,N},1≤N≤∞,1≤j≤Nbeadoublearrayofrowwisei.i.d.randomvariableswithzeromeanandvarianceσ2,withdistributionp(x,N)andNsatisfyingσ2=1/NforallN.ThenNxisasymptoticallynormal,withzeroNj=1j,Nmeanandunitvariance,iff{xj,N}satisfytheLindeberg(vanishingtail)condition:forall>0,Nx2dp(x,N)=o(1).(B.1)j=1|x|>TheconditionsofthetheoremimplytheLindebergcondition,with{xj,n}replacedby√1(dH(p;δ−p)−H).Toseethis,notethattheleft-handside2jNσofequationB.1becomes,afterthepropersubstitutions,1plog2p,2jjσ12−pj:(Nσ)2|(logpj)−H|>or1plog2p+plog2p.σ2jjjj11p:p>e(Nσ2)2+Hp:p0.Herepc,∞isthemonotonicallydecreas-ingstepdensitywithgapsbetweenstepsjandj+1givenby1j−cp(t)(cp(t))dte.0j!b.Assumepisbounded.ThenHˆ−HN→Bc,Hˆ(p)a.s.,whereBc,Hˆ(p)isadeterministicfunction,nonconstantinp.ForHˆ=HˆMLE,B(p)=h(p)−h(p)<0,c,Hˆwhereh(.)denotesdifferentialentropy.L1,a.s.Proof.Wewillgiveonlyanoutline.By→,wemeanthatp−pN1→0a.s.ByMcDiarmid’sinequality,foranydistributionq,q−pN1→E(q−pN1)a.s.Therefore,convergencetosomepinprobabilityinL1impliesalmostsureconvergenceinL1.Inaddition,McDiarmid’sinequalityhintsatthelimitingformoftheorderedhistograms.Wehaveninisort→Esort,a.s.∀1≤i≤m.NNOfcourse,thisisnotquitesatisfactory,sincebothsort(ni)andE(sort(ni))goNNtozeroformosti.Thus,weneedonlytoproveconvergenceofsort(ni)topinLinproba-N1bility.Thisfollowsbyanexaminationofthehistogramorderstatisticshn,j.Recallthatthesehn,jcompletelydeterminepn.Inaddition,thehn,jsatisfyalawoflargenumbers:11hn,j→E(hn,j)∀j≤k,mmforanyfinitek.(Thiscanbeproven,forexample,usingMcDiarmid’sin-equality.)Letusrewritetheabovetermmoreexplicitly:11m1mNjN−jE(hn,j)=E(1(ni=j))=pi(1−pi).mmmji=1i=1Nowwecanrewritethissumasanintegral:1mN1NjN−jjN−jpi(1−pi)=dtpn(t)(1−pn(t)),(B.3)mj0ji=1 1244L.Paninskiwherepn(t)≡mpmax(i|im10dt1(p=0),p−p1isobviouslyboundedawayfromzero.Regardingthefinalclaimofthetheorem,theconvergenceoftheHˆtoE(Hˆ)followsbypreviousconsiderations.Weneedproveonlythath(p)1,Nminipi→∞,thenN1limB(HˆMLE)=−.m−12Proof.Asstatedabove,theproofisanelaborationoftheproofoftheo-rem10.3.1ofDevoreandLorentz(1993).Weuseasecond-orderexpansionoftheentropyfunction:21H(t)=H(x)+(t−x)H(x)+(t−x)H(x)+hx(t−x),2wherehisaremainderterm.Pluggingin,wehave21−1−tlogt=−xlogx+(t−x)(−1−logx)+(t−x)+hx(t−x).2xAftersomealgebra,2x112(t−x)hx(t−x)=t−x+tlog+(t−x).t2xAftersomemorealgebra(mostlyrecognizingthemeanandvariancefor-mulaeforthebinomialdistribution),weseethatx(1−x)limN(BN(H)(x)−H(x))=H(x)+RN(x),N2where1−xNjjRN(x)≡−NBj,N(x)log.2NNxj=0 1246L.PaninskiTheproofoftheorem10.3.1inDevoreandLorentz(1993)proceedsbyshowingthatRN(x)=o(1)foranyfixedx∈(0,1).WeneedtoshowthatRN(x)=o(1)uniformlyforx∈[xN,1−xN],wherexNisanysequencesuchthatNxN→∞.Thiswillprovethetheorem,becausethebiasisthesumofBN(H)(x)−H(x)atmpointsonthisinterval.Theuniformestimateessentiallyfollowsfromthedeltamethod(somewhatlikeMillerandMadow’soriginalproof,exceptinonedimensioninsteadofm):usethefactthatthesuminthedefinitionofRN(x)convergestotheexpectation(withappropriatecutoffs)ofthefunctiontlogtwithrespecttothegaussiandistributionwithmeanxandvariancex1x(1−x).Wespendtherestoftheproofjustifyingtheabovestatement.NThesuminthedefinitionofRN(x)isexactlytheexpectationofthefunc-tiontlogtwithrespecttothebinomial(N,x)distribution(inaslightabuseofxtheusualnotation,wemeanabinomialrandomvariabledividedbyN,thatis,rescaledtohavesupporton[0,1]).Theresultfollowsifasecond-orderexpansionfortlogtatxconvergesatano(1/N)rateinBin-expectation,xN,xthatis,ift12EBinN,xNtlog−(t−x)−(t−x)≡EBinN,xgN,x(t)=o(1),x2xforx∈[xN,1−xN].Assume,wlog,thatxN→0;inaddition,wewillfocusonthehardestcaseandassumexN=o(N−1/2).Webreaktheaboveexpectationintofourparts:axNxNEBinN,xgN,x(t)=gdBinN,x+gdBinN,x0axNbN1+gdBin+gdBin,N,xN,xxNbNwhere0bN)a)=O(e),foranyconstanta>0andsomeconstantC.Proof.Weprovideonlyasketch.TheideaisthatHalmostsatisifiestheboundeddifferencecondition,inthefollowingsense:theredoexistpointsx∈[0,1]msuchthatm(H(x))2>m2,imi=1say,whereH(x)≡max|H(x,...,x,...x)−H(x,...,x,...x)|,i1im1imx,xiibutthesetofsuchx—callthesetA—isofdecreasingprobability.IfwemodifyHsothatH=HonthecomplementofAandletH=E(H|pi∈Ac)onA,thatis,H(x)x∈Ac,H(x)=1P(x)H(x)q∈A,P(Ac)Acthenwehavethat−a2(m2)−1P(|H−E(H)|>a)m)dp(xi)[0,1]mii=1m≤1(max(xi+2−xi)>m)dp(xi)i∼dtelogp−mpm.(B.7)ThefirstinequalityfollowsbyreplacingtheL2normintheboundeddiffer-enceconditionwithanL∞norm;thesecondfollowsfromsomecomputation EstimationofEntropyandMutualInformation1249andthesmoothnessofH(x)withrespecttochangesinsinglexi.Thelastapproximationisbasedonanapproximationinmeasurebynicefunctionsargumentsimilartotheoneintheproofoftheorem3,alongwiththewell-knownasymptoticequivalence(uptoconstantfactors),asN→∞,betweentheempiricalprocessassociatedwithadensitypandtheinhomogeneousPoissonprocessofrateNp.Weestimate|E(H)−E(H)|withthefollowinghacksaw:|E(H)−E(H)|=p(x)(H(x)−H(x))+p(x)(H(x)−H(x))AAc=p(x)(H(x)−H(x))+0A≤P(A)logm.Ifp>c>0,theintegralinequationB.7isasymptoticallylessthance−mmc;therateofthetheoremisobtainedbyacrudeoptimizationoverm.TheproofoftheCLTinTheorem7followsuponcombiningpreviousre-sultsinthisarticlewithafewpowerfulolderresults.Again,toconservespace,wegiveonlyanoutline.TheasymptoticnormalityfollowsfromMcLeish’smartingaleCLT(Chow&Teicher,1997)appliedtothemartingaleE(H|x1,...,xi);thecomputationoftheasymptoticmeanfollowsbymeth-odsalmostidenticaltothoseusedintheproofoftheorem3(sortingandlinearityofexpectation,effectively),andtheasymptoticvariancefollowsuponcombiningtheformulasofDarling(1953)andShaoandHahn(1995)withanapproximation-in-measureargumentsimilar,again,tothatusedtoprovetheorem3.SeealsoWolpertandWolf(1995)andNemenmanetal.(2002)forapplicationsofDarling’sformulatoasimilarproblem.B.8AdaptivePartitioning.Statement(Theorem8).Iflog(A)=oNandFgeneratesσNF(logm)2x,ya.s.,Iisconsistentinprobability;ˆIisconsistenta.s.undertheslightlystrongerˆcondition−NN(AF)e(logm)2<∞.Thekeyinequality,unfortunately,requiressomenotation.WefollowtheterminologyinDevroyeetal.(1996),withafewobviousmodifications.Wetake,asusual,{xj}asi.i.drandomvariablesinsomeprobabilityspace,G,P.LetFbeacollectionofpartitionsof,withPdenotingagivenpartition.2Pdenotes,asusual,the“powerset”ofapartition,thesetofallsetsthatcanbebuiltupbyunionsofsetsinP.Weintroducetheclassof 1250L.PaninskisetsAF,definedastheclassofallsetsobtainedbytakingunionsofsetsinagivenpartition,P.Inotherwords,A≡{A:A∈2P,P∈F}.FFinally,theVapnik-Chervonenkis“shattercoefficient”oftheclassofsetsAF,N(AF),isdefinedasthenumberofsetsthatcanbepickedoutofAFusingNarbitrarypointsωjin:N(AF)≡max|{ωj}∩A:A∈A|.{ωj}∈NTherateofgrowthinNofN(AF)providesapowerfulindexoftherichnessofthefamilyofpartitionsAF,asthefollowingtheorem(akindofuniformLLN)shows;pheredenotesanyprobabilitymeasureandpN,asusual,theempiricalmeasure.Theorem(Lugosi&Nobel,1996).Followingthenotationabove,forany>0,−N2/512Psup|pN(A)−p(A)|>≤8N(AF)e.P∈FA∈PThus,thistheoremisusefulifN(AF)doesnotgrowtooquicklywithN.Asitturnsout,N(AF)growsatmostpolynomiallyinNundervari-ouseasy-to-checkconditions.Additionally,N(AF)canoftenbecomputedusingstraightforwardcombinatorialarguments,evenwhenthenumberofdistinctpartitionsinFmaybeuncountable.(SeeDevroyeetal.,1996,foracollectionofinstructiveexamples.)Proof.Theorem8isprovenbyaBorel-Cantelliargument,couplingtheaboveVCinequalityofLugosiandNobelwiththefollowingeasyinequality,whichstatesthattheentropyfunctionalHis“almostL1Lipshitz”:|H(p)−H(q)|≤H2(2p−q1)+2p−q1log(m−1),whereH2(x)≡−xlog(x)−(1−x)log(1−x)denotestheusualbinaryentropyfunctionon[0,1].Weleavethedetailstothereader.AcknowledgmentsThankstoS.Schultz,R.Sussman,andE.Simoncelliformanyinterestingconversations;B.LauandA.Reyesfortheircollaborationoncollectingtheinvitrodata;M.Fellows,N.Hatsopoulos,andJ.Donoghuefortheircollab-orationoncollectingtheprimateMIdata;andI.Kontoyiannis,T.Gedeon, EstimationofEntropyandMutualInformation1251andA.Dimitrovfordetailedcommentsonpreviousdrafts.ThisworkwassupportedbyapredoctoralfellowshipfromtheHowardHughesMedicalInstitute.ReferencesAntos,A.,&Kontoyiannis,I.(2001).Convergencepropertiesoffunctionalestimatesfordiscretedistributions.RandomStructuresandAlgorithms,19,163–193.Azuma,K.(1967).Weightedsumsofcertaindependentvariables.TohokuMath-ematicalJournal,3,357–367.Basharin,G.(1959).Onastatisticalestimatefortheentropyofasequenceofindependentrandomvariables.TheoryofProbabilityandItsApplications,4,333–336.Beirlant,J.,Dudewicz,E.,Gyorfi,L.,&vanderMeulen,E.(1997).Nonparamet-ricentropyestimation:Anoverview.InternationalJournaloftheMathematicalStatisticsSciences,6,17–39.Bialek,W.,Rieke,F.,deRuytervanSteveninck,R.,&Warland,D.(1991).Readinganeuralcode.Science,252,1854–1857.Billingsley,P.(1965).Ergodictheoryandinformation.NewYork:Wiley.Buracas,G.,Zador,A.,DeWeese,M.,&Albright,T.(1998).Efficientdiscrimi-nationoftemporalpatternsbymotion-sensitiveneuronsinprimatevisualcortex.Neuron,5,959–969.Carlton,A.(1969).Onthebiasofinformationestimates.PsychologicalBulletin,71,108–109.Chernoff,H.(1952).Ameasureofasymptoticefficiencyfortestsofhypothe-sisbasedonthesumofobservations.AnnalsofMathematicalStatistics,23,493–509.Chow,Y.,&Teicher,H.(1997).Probabilitytheory.NewYork:Springer-Verlag.Cover,T.,&Thomas,J.(1991).Elementsofinformationtheory.NewYork:Wiley.Cucker,F.,&Smale,S.(2002).Onthemathematicalfoundationsoflearning.BulletinsoftheAmericanMathematicalSociety,39,1–49.Darbellay,G.,&Vajda,I.(1999).Estimationoftheinformationbyanadaptivepartitioningoftheobservationspace.IEEETransactionsonInformationTheory,45,1315–1321.Darling,D.(1953).Onaclassofproblemsrelatedtotherandomdivisionofaninterval.AnnalsofMathematicalStatistics,24,239–253.Dembo,A.,&Zeitouni,O.(1993).Largedeviationstechniquesandapplications.NewYork:Springer-Verlag.Devore,R.,&Lorentz,G.(1993).Constructiveapproximation.NewYork:Springer-Verlag.Devroye,L.,Gyorfi,L.,&Lugosi,G.(1996).Aprobabilistictheoryofpatternrecog-nition.NewYork:Springer-Verlag.Ditzian,Z.,&Totik,V.(1987).Moduliofsmoothness.Berlin:Springer-Verlag.Donoho,D.,&Liu,R.(1991).Geometrizingratesofconvergence.AnnalsofStatis-tics,19,633–701. 1252L.PaninskiEfron,B.,&Stein,C.(1981).Thejackknifeestimateofvariance.AnnalsofStatistics,9,586–596.Gedeon,T.,Parker,A.,&Dimitrov,A.(2003).Informationdistortionandneuralcoding.Manuscriptsubmittedforpublication.Gibbs,A.,&Su,F.(2002).Onchoosingandboundingprobabilitymetrics.Inter-nationalStatisticalReview,70,419–436.Grenander,U.(1981).Abstractinference.NewYork:Wiley.Jongbloed,G.(2000).Minimaxlowerboundsandmoduliofcontinuity.StatisticsandProbabilityLetters,50,279–284.Kolmogorov,A.(1993).Informationtheoryandthetheoryofalgorithms.Boston:Kluwer.Kontoyiannis,I.(1997).Second-ordernoiselesssourcecodingtheorems.IEEETransactionsInformationTheory,43,1339–1341.Lugosi,A.,&Nobel,A.B.(1996).Consistencyofdata-drivenhistogrammethodsfordensityestimationandclassification.AnnalsofStatistics,24,687–706.McDiarmid,C.(1989).Onthemethodofboundeddifferences.InJ.Siemons(Ed.),Surveysincombinatorics(pp.148–188).Cambridge:CambridgeUniversityPress.Miller,G.(1955).Noteonthebiasofinformationestimates.InH.Quastler(Ed.),InformationtheoryinpsychologyII-B(pp.95–100).Glencoe,IL:FreePress.Nemenman,I.,Shafee,F.,&Bialek,W.(2002).Entropyandinference,revisited.InT.G.Dietterich,S.Becker,&Z.Ghahramani(Eds.),Advancesinneuralinformationprocessing,14.Cambridge,MA:MITPress.Paninski,L.,Fellows,M.,Hatsopoulos,N.,&Donoghue,J.(1999).Codingdy-namicvariablesinpopulationsofmotorcortexneurons.SocietyforNeuro-scienceAbstracts,25,665.9.Paninski,L.,Fellows,M.,Hatsopoulos,N.,&Donoghue,J.(2003).Temporaltun-ingpropertiesforhandpositionandvelocityinmotorcorticalneurons.Manuscriptsubmittedforpublication.Paninski,L.,Lau,B.,&Reyes,A.(inpress).Noise-drivenadaptation:Invitroandmathematicalanalysis.Neurocomputing.Availableon-line:http://www.cns.nyu.edu/∼liam/adapt.html.Panzeri,S.,&Treves,A.(1996).Analyticalestimatesoflimitedsamplingbiasesindifferentinformationmeasures.Network:ComputationinNeuralSystems,7,87–107.PrakasaRao,B.(2001).Cramer-Raotypeintegralinequalitiesforgenerallossfunctions.TEST,10,105–120.Press,W.,Teukolsky,S.,Vetterling,W.,&Flannery,B.(1992).NumericalrecipesinC.Cambridge:CambridgeUniversityPress.Rieke,F.,Warland,D.,deRuytervanSteveninck,R.,&Bialek,W.(1997).Spikes:Exploringtheneuralcode.Cambridge,MA:MITPress.Ritov,Y.,&Bickel,P.(1990).Achievinginformationboundsinnon-andsemi-parametricmodels.AnnalsofStatistics,18,925–938.Schervish,M.(1995).Theoryofstatistics.NewYork:Springer-Verlag.Serfling,R.(1980).Approximationtheoremsofmathematicalstatistics.NewYork:Wiley. EstimationofEntropyandMutualInformation1253Shao,Y.,&Hahn,M.(1995).Limittheoremsforthelogarithmofsamplespacings.StatisticsandProbabilityLetters,24,121–132.Steele,J.(1986).AnEfron-Steininequalityfornonsymmetricstatistics.AnnalsofStatistics,14,753–758.Strong,S.Koberle,R.,deRuytervanSteveninckR.,&Bialek,W.(1998).Entropyandinformationinneuralspiketrains.PhysicalReviewLetters,80,197–202.Tishby,N.,Pereira,F.,&Bialek,W.(1999).Theinformationbottleneckmethod.InB.Hajek&R.S.Sreenivas(Eds.),Proceedingsofthe37thAllertonConferenceonCommunication,Control,andComputing.Urbana,IL:UniversityofIllinoisPress.Treves,A.,&Panzeri,S.(1995).Theupwardbiasinmeasuresofinformationderivedfromlimiteddatasamples.NeuralComputation,7,399–407.vanderVaart,A.,&Wellner,J.(1996).Weakconvergenceandempiricalprocesses.NewYork:Springer-Verlag.Vapnik,V.N.,&Chervonenkis,A.J.(1971).Ontheuniformconvergenceofrelativefrequenciesofeventstotheirprobabilities.TheoryofProbabilityandItsApplications,16,264–280.Victor,J.(2000a).Asymptoticbiasininformationestimatesandtheexponential(Bell)polynomials.NeuralComputation,12,2797–2804.Victor,J.(2000b).Howthebrainusestimetorepresentandprocessvisualinfor-mation.BrainResearch,886,33–46.Victor,J.(2002).Binlessstrategiesforestimationofinformationfromneuraldata.PhysicalReviewE,66,51903–51918.Watson,G.(1980).Approximationtheoryandnumericalmethods.NewYork:Wiley.Weaver,W.,&Shannon,C.E.(1949).Themathematicaltheoryofcommunication.Urbana:UniversityofIllinoisPress.Wolpert,D.,&Wolf,D.(1995).Estimatingfunctionsofprobabilitydistributionsfromafinitesetofsamples.PhysicalReviewE,52,6841–6854.ReceivedJune3,2002;acceptedNov.27,2002.

当前文档最多预览五页,下载文档查看全文

此文档下载收益归作者所有

当前文档最多预览五页,下载文档查看全文
温馨提示:
1. 部分包含数学公式或PPT动画的文件,查看预览时可能会显示错乱或异常,文件下载后无此问题,请放心下载。
2. 本文档由用户上传,版权归属用户,天天文库负责整理代发布。如果您对本文档版权有争议请及时联系客服。
3. 下载前请仔细阅读文档内容,确认文档内容符合您的需求后进行下载,若出现内容与标题不符可向本站投诉处理。
4. 下载文档时可能由于网络波动等原因无法下载或下载错误,付费完成后未能成功下载的用户请联系客服处理。
大家都在看
近期热门
关闭