Density Estimation Trees

Size: px

Start display at page:

Download "Density Estimation Trees"

Lenard Spencer
5 years ago
Views:

1 Densiy Esimaion Trees Parikshi Ram Georgia Insiue of Technology Alana GA, Alexander G. Gray Georgia Insiue of Technology Alana GA, ABSTRACT In his paper we develop densiy esimaion rees (DETs), he naural analog of classificaion rees and regression rees, for he ask of densiy esimaion. We consider he esimaion of a join probabiliy densiy funcion of a d-dimensional random vecor X and define a piecewise consan esimaor srucured as a decision ree. The inegraed squared error is minimized o learn he ree. We show ha he mehod is nonparameric: under sandard condiions of nonparameric densiy esimaion, DETs are shown o be asympoically consisen. In addiion, being decision rees, DETs perform auomaic feaure selecion. They empirically exhibi he inerpreabiliy, adapabiliy and feaure selecion properies of supervised decision rees while incurring sligh loss in accuracy over oher nonparameric densiy esimaors. Hence hey migh be able o avoid he curse of dimensionaliy if he rue densiy is sparse in dimensions. We believe ha densiy esimaion rees provide a new ool for exploraory daa analysis wih unique capabiliies. Caegories and Subjec Descripors I.5 [Paern Recogniion]: Models; G. [Probabiliy and Saisics]: onparameric saisics General Terms Algorihms, Experimenaion Keywords Decision rees, densiy esimaion, daa analysis. ITRODUCTIO Three mos fundamenal asks of machine learning are classificaion, regression, and densiy esimaion. Classificaion and regression are insances of supervised daa analysis where a raining se of examples is provided. The learning ask is o esimae a funcion using he raining se which also performs well on a es se. The hird ask, densiy esimaion, is an insance of unsupervised learning. This is Permission o make digial or hard copies of all or par of his work for personal or classroom use is graned wihou fee provided ha copies are no made or disribued for profi or commercial advanage and ha copies bear his noice and he full ciaion on he firs page. To copy oherwise, o republish, o pos on servers or o redisribue o liss, requires prior specific permission and/or a fee. KDD, Augus 4,, San Diego, California, USA. Copyrigh ACM //8...$.. generally harder because one does no have any insance of he ground ruh regarding he quaniy being esimaed. Decision rees [] have been widely used in he supervised seing for classificaion and regression. In his paper we inroduce, derive, and explore he naural analog of classificaion rees and regression rees for he unsupervised ask of densiy esimaion. To our knowledge his analogy has never been explored rigorously, hough several oher ideas for densiy esimaion involving hierarchical schemes have been proposed. Our hope is ha he unique advanages of decision rees in he supervised seing will ransfer o he unsupervised seing, o give rise o a new nonparameric densiy esimaion mehod wih ineresing capabiliies. Densiy esimaion. The problem of densiy esimaion can be defined as esimaing an unknown disribuion f on X wih a close approximaion ˆf : X R + given a se of iid (independen and idenically disribued) observaions {X,X,...,X } X drawn from f. Esimaing he probabiliy densiy of he given daa is a fundamenal ask in mulivariae saisics. I ofen appears as a subrouine in oher inference for example, in classificaion, one is required o find ˆP ˆp(X C)P (C) (C X) =, where C is one of ˆp(X) he possible K classes and ˆp(X C) is he class-condiional densiy of he daa X [,, 4]. I is also widely used for conducing exploraory analyses such as oulier or cluser deecion. Decision rees. Decision rees [] were developed for classificaion [5] and regression [6]. They are primarily used for he purpose of supervised learning and have been widely and successfully used in nonparameric classificaion and regression. The piecewise consan esimaors and he simplisic model of univariae binary splis of he daa in decision rees lead o relaively less accurae esimaors wih unknown convergence raes [7]. Techniques such as bagging are used o boos his accuracy. More sophisicaed mehods wih beer accuracies and clearer asympoic properies such as local linear regression [8] and suppor vecor machines [9] exis for regression and classificaion respecively. oneheless, anecdoally, decision rees remain one of he more widely used mehods in pracice [7], possibly even he mos widely used. This is due o he inheren inuiiveness, adapabiliy and inerpreabiliy of he models learned. Moreover, hey are capable of dealing wih mixed caegorical, discree and coninuous variables wihin one unified framework. They also perform feaure selecion and are easy o implemen. These properies come a he cos of accuracy in predicion bu sill make decision rees very desirable and If all he aribues of he daa are coninuous hen X R d 67

2 Table : Characerisics of Mehods for Densiy Esimaion Mehods np Accuracy Inerpreabiliy Adapabiliy Speed COD VI Rules ABD AWD Training Query MoG low fas EM algorihm very fas O(d ) Hisogram low fas O(Md) very fas O(d) KDE (FBW) medium slow O(Hd ) slow O(d) KDE (FBW) medium fas O(Hd) medium O(d log ) KDE (VBW) highes very slow O(H d d)/q slow O(d) KDE (VBW) high slow O(H d d log )/q medium O(d log ) local r-kde very high fas CV sep/q slow O(d) global r-kde medium/high fas CV seps slow O(d) DET medium slow LOO-CV fas O(D T ) pracical. Anoher advanage of decision rees is he efficien es querying once he model (in his case a ree) has been rained. Decision rees rade some accuracy for a simple, inerpreable model. Considering ha densiy esimaion is widely used in exploraory analysis, giving up accuracy (which is no a useful concep in he absence of he ground ruh) for undersanding is accepable.. onparameric Densiy Esimaors Hisograms and kernel densiy esimaors (KDEs) [] are simpler nonparameric (np) echniques, whereas varians of KDEs such as rodeo-kdes [] and wavele based mehods [8] are more complex nonparameric esimaors. A number of nonparameric Bayesian mehods have also been developed for he ask of densiy esimaion []. Mixures of Gaussians (MoGs) are widely used parameric densiy esimaors. Even hough he nonparameric mehods have been shown o be significanly more accurae han he parameric mehods and require fewer assumpions on he daa han parameric mehods, MoGs are widely used because of heir inerpreabiliy as clusers (as well as heir simple implemenaion). Adapabiliy and Speed. Adapabiliy has wo implicaions in densiy esimaion: () adapable beween dimensions (ABD) he esimaion process should rea dimensions differenly depending on heir influence on he densiy (for example, addiion of unimporan dimensions of uniform noise should no affec he esimaor) () adapable wihin a dimension (AWD) deermining regions of ineres for a given dimension and adjusing he esimaor o he local behavior of he densiy in ha dimension (for example, regions wih fas changing densiy vs. regions wih fla densiies). MoGs are no known o adap beween or wihin dimensions. Fixed-bandwidh KDEs (FBW) are accurae nonparameric esimaors bu are no adapable beween or wihin dimensions because of he resricion of using a mulivariae kernel wih he same bandwidh in every dimension (spherical kernel) for every poin. The bandwidh is chosen hrough leave-one-ou cross-validaion (LOO-CV) by picking he one wih he bes CV loss among he H differen values ried. A significanly high value of H is required o obain accurae resuls. LOO-CV for he naïve implemenaion of KDE akes O(Hd ). Spaial pariioning ree daa srucures have been used for fas-approximae compuaion of KDEs (KDE s) [, 4]. For cover-rees [5], his process requires O(Hd) ime [6]. Various KDEs wih adapive bandwidhs have been developed bu are complicaed in pracice. The neares-neighbor KDEs [] are locally adapive by choosing bandwidhs based on heir k h -neares-neighbor disance. However, hey are no adapive over dimensions since hey sill use a spherical mulivariae kernel. Moreover, he esimae does no represen a probabiliy densiy funcion (pdf), and is disconinuous and heavy ailed. A ruly adapive (ABD) KDE would require an opimal bandwidh for each dimension (VBW). This means he kernel used would be ellipical, adaping o he densiy of he daa. This makes he VBW-KDEs adapable beween dimensions. If H bandwidhs are ried in each dimension, he raining ime required for he naïve implemenaion of KDEs would be O(H d d). To make his locally adapive for each dimension(awd), he bandwidh esimaion would be required for each query q. Even he faser mehods (KDE ) would require O(H d d log ) raining ime for each query. This makes he compuaional cos of CV for VBW-KDEs inracable even in medium dimensions. A recen algorihm [] uses rodeo [7] o greedily selec adapive bandwidhs for a KDE in each dimension. Under cerain sparsiy condiions on he unknown densiy funcion, he resuling esimaor is shown o empirically perform well and obain near opimal convergence rae for KDEs. The local version (local r-kde) of his esimaor compues he opimal bandwidh in every dimension for every query poin. The paper demonsraes rodeo s adapabiliy implicily idenifying he dimensions of ineres during he process of bandwidh selecion as well as idenifying regions of ineres wihin a single dimension, selecing smaller bandwidhs where he densiy varies more rapidly. However, his echnique is expensive, requiring a rodeo sep for every single query. The rodeo sep is ieraive and we are no aware of is runime bound. However, i is empirically faser han LOO-CV for VBW-KDEs. The global-rodeo [] (global r- KDE) improves efficiency by using a fixed bandwidh wihin each dimension, and esimaing i by averaging he esimaed bandwidhs for raining poins in every dimension. Hence, he resuling esimaor loses local adapabiliy wihin a dimension (propery of adapabiliy). The raining ime now involves rodeo seps insead of a rodeo sep for each query. The accuracy of he esimae obviously depends on he (larger number raining queries imply more accurae esimaes and slower raining). KDEs can be made adapable, however, a he cos of compuaional complexiy. Moreover, given he esimaed bandwidhs, he ime aken for a single query using he naïve implemenaion of a KDE is O(d). The cover-ree daa srucure provides an approximae esimae in O(d log ) query ime. For O() queries, he query cos can be amor- 68

3 ized over he queries using a ree-daa-srucure over he queries, requiring a oal ime of O(d) [6]. Decision rees are known for heir adapabiliy over he daa (over feaures by implici feaure selecion and wihin dimensions by choosing differen leaf size in differen regions). In his paper, we will demonsrae heir adapabiliy for he ask of densiy esimaion. However, raining decision rees using LOO-CV is an expensive sep involving he growing of he ree, and he subsequen cos-complexiy pruning [, 7]. However his cos is amorized over he muliple empirically efficien queries. The query ime for a decision ree esimaor is O(D T )whered T is he deph of a decision ree T. The wors case upper bound for D T is O(), bu is empirically seen o be much igher in pracice (much more closer o O(log )). Hence decision rees also bring efficiency o nonparameric densiy esimaion a he cos of some accuracy. Hisograms are also fas nonparameric esimaors during raining as well as query ime. They require O(Md) raining ime for rying M differen binwidhs and choosing he one wih he bes CV error. The query ime is a blazing O(d). However, hey lack he adapabiliy of decision rees, and become prohibiively inaccurae as he number of dimensions increases. Inerpreabiliy. We also propose he use of decision rees for densiy esimaion o inroduce inerpreabiliy in he nonparameric seing. The decision rees provide an ineresing overlap beween he accuracy of nonparameric mehods and he simpliciy and inerpreabiliy of parameric mehods. Inerpreabiliy in densiy esimaion can be useful in he providing he following informaion abou he densiy of he daa: () deecing clusers and ouliers (COD) () providing relaive variable/dimension imporance (VI) idenifying dimensions ha significanly affec he densiy () providing simple univariae rules hese simple rules can be used, for example, for direcly specifying some chunk of a huge daa se which sis on a DBMS daabase. In he case of a DET, his chunk migh represen a cluser. MoGs are known o deec clusers bu are incapable of imparing any inerpreabiliy regarding he dimensions. The aforemenioned mehod using rodeo [] implicily does idenify dimensions of ineres. However, i is hard o obain relaive imporance of dimensions by jus looking a he values of he esimaed bandwidhs. Moreover, cluser and oulier deecion is complicaed wih KDEs an exhausive scan of he whole space provides regions of high (clusers) and low densiy (ouliers). So KDEs are no inerpreable densiy esimaors. When accuracy is required wih no concern over compuaion ime, adapive-kdes are he mehod of choice. However, hey migh be oo cosly for exploraory analysis. We aemp o achieve he properies of inerpreabiliy, while reaining adapabiliy, by using he CART-syle univariae splis which gives rise o he uniquely readable rule srucure of decision rees. This key propery is no shared by he several oher mehods for densiy esimaion involving hierarchical schemes which have been proposed over he years. Deermining dimensions of ineres is a direc byproduc of he decision ree framework via he measure of relaive imporance of predicor variables []. While linear models such as linear regression and linear SVMs yield such I is imporan o noe ha he spaial pariioning ree building has a one-ime cos of O(d log ). We will define his measure in he echnical secion of he paper. Figure : Densiy esimae in form of a decision ree and he relaive variable imporance. a capabiliy via examinaion of he linear coefficiens, his capabiliy is harder o find in nonparameric mehods. A sored lis of he leaves of a DET can easily idenify he clusers and he ouliers in he daa. All he aforemenioned properies of hese esimaors are summarized in Table. We furher moivae he desirabiliy of inerpreabiliy of a decision ree in he following example: Moivaing example. Consider he densiy esimaes for he Iris daa se (5 4) 4 as a se of rules depiced in Figure. Each leaf node conains he piecewise consan densiy esimae. More explicily, ˆf(x) =.6 x R (say). Here he leaf R = {x: x 4. &x.5}. We have also lised he class memberships of he poins in he leaf nodes 5. This figure also presens he relaive imporance of he aribues. The ree represenaion les us easily find regions of relaively high (R ) and low densiies (R ). The subree corresponding o he regions R and R conains purely class, represening he well-known linearly separable class in he Iris daa. The char conaining he relaive imporance of he predicor variables implies ha he peal lengh has he highes imporance wih regards o he densiy, whereas, he sepal lengh has no influence on he densiy (example of feaure deleion). Thus we see ha densiy esimaion in he decision ree framework provides much more informaion abou he daa han jus he densiy esimaes. There are several oher cases where i would be useful o ge his kind of informaion abou he daa: ()Asronomy performing densiy esimaion on quasar daa is a common ask in asronomy [8]. Common feaures in he daa are colors. The feaure-imporance analysis corresponding o he densiy will indicae he unimporan color feaures. This informaion can be useful in deciding which filers are useful in fuure analyses, hereby, conserving resources. ()Bioinformaics consider a meabolomics daa se conaining pa- 4 We will coninue o represen he size of daa ses wih poin wih d aribues/dimensions in he form ( d) hroughou his paper. 5 We inend o perform densiy esimaion only on he poins in he unsupervised seing. However, we provide class memberships of he poins in he node o moivae a well known propery of he Iris daa se. 69

4 ien informaion wih over feaures, where he ask is o differeniae cancer paiens from healhy paiens [9]. Densiy esimaion performed over all paiens profiles can reveal which feaures are responsible for he main variaion wihin, say, cancer paiens, or wihin healhy paiens. This is differen from feaure selecion for classificaion which would selec feaures which mos differeniae he paiens wih cancer from he res. In addiion, cluser and oulier deecion performed on his daa se will produce ineresing and possibly crucial informaion. Moreover, a represenaive pariion of he sae space by iself sheds ligh on he underlying srucure of he disribuion. Such informaion is paricularly valuable in high dimensional problems where direc visualizaion of he daa is difficul. Remark. DETs are similar o variable bin-widh hisograms, bu are resriced only o a hierarchical pariioning of he daa (hence possibly having lower accuracy han he bes possible variable bin-widh hisogram, alhough our experimens sugges oherwise).. Overview The focus of his paper is o develop a rigorous decisionree-srucured densiy esimaor and demonsrae he usefulness and inerpreabiliy of he same. In Secion, we discuss exising connecions beween densiy esimaion and decision rees. In he following secion, we define an esimaor based on he decision ree and apply he decision-ree framework for densiy esimaion using his esimaor. We also provide cerain asympoic properies of he decision-ree-srucured esimaor. Secion 4 conains experimenal resuls displaying he performance of DETs wih some comparisons wih hisograms and KDEs. The adapabiliy of DETs is demonsraed by using some synheic daa ses. High dimensional image daa ses are used o demonsrae he DETs inerpreabiliy and heir applicaion o classificaion. Training and querying ime of DETs are compared wih he fases raining and querying mehod for KDEs (KDE s). In he final secion, conclusions are discussed along wih cerain open quesions.. FURTHER RELATED WORK Decision rees have been used alongside densiy esimaion in earlier works. For example, Kohavi, 996 [] and Smyh e al., 995 [] use decision rees for he supervised ask of classificaion, and densiy esimaion (wih naïve Bayes classifiers and KDEs respecively) is used solely for he purpose of obaining a smooher and more accurae classificaion and class-condiional probabiliies in conras o he piecewise consan non-smooh esimaes of he sandard classificaion rees. Decision rees been used for he ask of diagnosing exrapolaion [] by building a classificaion ree differeniaing he given daa se from a random sample from a uniform disribuion in he suppor of he daa se. This ree provides a way o compue a measure for exrapolaion. This measure can be indirecly used o compue he densiy, bu he ree sill performs classificaion. Decision rees have also been used in he supervised seing for esimaing he join probabiliy (of he daa and he label of he raining examples) [] for censored daa by replacing he sandard loss funcion (for example - loss or Gini-index) wih he negaive loglikelihood of he join densiy. The idea of having a nesed hierarchy of pariions of he suppor of he daa has been used for dicreizaion of univariae daa [4]. The se of observaions are pariioned on he basis of he densiy using he loglikelihood loss funcion. Bu he focus is solely on univariae daa. Decision rees have also been used in an unsupervised seing o perform hierarchical clusering [5] bu do no rivially ranslae o densiy esimaion. Siedl e al., 9 [6] propose a novel mehod for indexing he densiy model in he form of a ree (called Bayes ree) for fas access wih desired level of accuracy. This ree is grown boom-up from he esimae of a KDE forhewholedaase. TheBayesreeusesMoGsoindex he densiy a he inermediae levels (increasing he number of Gaussians wih he deph). This ree successfully locaes clusers in he daa. However, being a MoG, i fails o deermine relaive relevance of he dimensions. Pólya rees [7] are hierarchical spaial pariioning rees bu used as a prior for probabiliy measures over he daa space. Densiy esimaion is done eiher by compuing he poserior mean densiy given his prior or learning a fixed ree opology and compuing he piecewise consan esimae condiional on his opology. DETs are relaed in spiri, however, hey will be learned by direcly minimizing he densiy esimaion loss and he esimaes are finally obained direcly from he ree. Moreover, he aforemenioned mehods, hough hierarchical in naure, do no share he inerpreabiliy and feaure selecion properies of decision rees, which are based on univariae splis.. DESITY ESTIMATIO TREE This secion provides he road map o perform densiy esimaion using decision rees. We define an esimaor and he corresponding loss funcion o be minimized during he raining of he decision ree for coninuous and mixed daa. Following ha, we explain he process of learning he opimal decision ree over he given sample and provide an asympoic propery of he DET.. Coninuous Feaures Definiion. The piecewise consan densiy esimae of he decision ree T buil on a se S of observaions in R d is defined as ˆf (x) = l I(x l ) () l T where T is he se of leaves of he decision ree T represening he pariions of he daa space, l is he number of observaions of S in he leaf l, V l is he volume of he leaf l wihin he d dimensional bounding box of S and I( ) is he indicaor funcion. A decision ree T requires a noion of a loss funcion R(T ) which is minimized using a greedy algorihm o consruc a ree on he se of observaions. For he unsupervised ask of densiy esimaion, we consider he Inegraed Squared Error (ISE) loss funcion []. The ISE gives a noion of overall disance beween he esimaed and he rue densiy and is a favored choice in nonparameric densiy esimaion for is inheren robusness in comparison o maximum-likelihoodbased loss funcions [8]. However, oher disance funcions such as he KL-divergence can be used as he loss funcion. We will explore his in he longer version of he paper. 6

5 The ask of learning a DET would involve solving he following opimizaion problem: ( ) ˆf (x) f(x) dx () min ˆf F X where F is he class of esimaors of form in Definiion ha can learned wih any se of observaions. Afer expanding he square and he following Mone-Carlo subsiuion ˆf X (x)f(x)dx ˆf i= (X i)(where{x i} i= is he raining se), he objecive funcion in Eq. is replaced by he following consisen plug-in esimaor of he ISE [] 6 : { min ˆf F X ( ) ˆf (x) dx } ˆf (X i) Using he piecewise consan esimaor from Definiion (which is consan wihin each leaf l ), he objecive funcion in Eq. is replaced wih he following: { ( l l T by subsiuing l ) dx ˆf (x) = l T ( l i= l l ) I(x l ) (since he cross erms in he expansion of ˆf (x) vanishbecause of he indicaor funcion) and simple calculaion shows ha l ˆf (X i)= I(X i l )= l l i= i= l T l T By making he following subsiuion ( ) ( ) ( ) l l l dx = dx = V l, l he esimaor of he ISE for DET has he following form: } l { (5) V l l T Defining Eq. 5 as he error R(T ) of he ree, he greedy surrogae of he error for any node (inernal or oherwise) can be defined as R() =. (6) V The ree is grown in a op down manner by maximizing he reducion in his greedy surrogae of he error over he given observaions.. Mixed Feaures For densiy esimaion over daa wih mixed feaures, we define a novel densiy esimaor and a loss funcion involving ordinal and caegorical daa along wih coninuous daa ha can be used o learn DETs wih mixed daa: 6 The erm X (f(x)) dx is removed from he objecive funcion since i is independen of he esimae and hence doesn affec he opimum. l } () (4) Definiion. Le S R d Z d C d wih d real feaures, d ordinal feaures and d caegorical feaures. The piecewise consan densiy esimaor of he decision ree T buil on S is defined as ˆf (x) = l I(x l ) (7) l T V ld d l j= R l j d l i= M l i where V ld is he volume of he leaf l wihin he d dimensional bounding box of he real par of S, R lj is he range of he ordinal values in he j h of he d l ordinal dimensions presen in l and M li is he number of caegories presen in l for he i h of he d l caegorical dimensions presen in l. Theerroraanode corresponding o he ISE is hen obained as R() = V d d j= R j d i= M i where V d is he volume of he node in he d real dimensions, R j is he range of he ordinal values in he j h of he d ordinal dimensions presen in and M i is he number of caegories presen in for he i h of he d caegorical dimensions presen in.. Tree Consrucion We use he ree learning algorihm presened in Breiman, e al.,984 []. The spliing, pruning and he cross-validaion procedures are modified o work wih his new loss funcion and for he unsupervised ask of densiy esimaion. Spliing. For growing he ree, each node is spli ino wo children. Le S be he se of all univariae splis. Definiion. [] The bes spli s of a node is he spli in he se of splis S which maximally reduces R(T ). This is done by greedily reducing R() for all he erminal nodes of he curren ree. Hence, for any currenly erminal node, we need o find a spli s for a node ino L and R such ha (8) s =argmax{r() R(L) R(R)} (9) s S where = L + R. For coninuous and ordinal dimensions, his opimizaion is performed by rying every possible splis of he daa in each of he dimensions. For caegorical dimensions, he splis are performed in he manner similar o he CART model []. The spliing is sopped when he node size goes below a cerain hreshold (say min ). Pruning. To avoid overfiing, we use he minimal coscomplexiy pruning []. The regularized error of a subree rooed a a node is defined as R α() =R( )+α () where α is a regularizaion parameer (o be esimaed hrough cross-validaion) and is he se of leaves in he subree rooed a. The value of α is gradually increased and a subree rooed a is pruned for he value of α where R α() =R α({}), he regularized error of he pruned subree. Since he size of he iniial ree consruced by he spliing algorihm described previously is finie, he number of possible values of α a which a prune occurs is finie 6

6 and can be calculaed efficienly. Hence we only need o selec he opimal α from a finie se of values (See Secion. in Breiman, e al., 984 [] for complee deails). Cross-validaion. The leave-one-ou cross-validaion (LOO-CV) esimaor of he densiy esimaion loss funcion in Eq. is given by ( ) Ĵ(α) = ˆf α (x) dx ˆf α ( i)(x i) () X where ˆf α is he esimaor wih he decision ree T pruned wih a regularizaion parameer α, and ˆf α ( i) is he esimaor wih he decision ree T ( i) buil wihou he raining example X i pruned wih he regularizaion parameer α. This LOO-CV esimaor is obained from Silverman, 986 [] by swiching he regularizaion parameer (replacing he bandwidh wih α). The bes sized ree is he ree T pruned wih he parameer α such ha: i= α =argmin α Ĵ(α) ().4 Asympoic Properies We show ha he mehod is nonparameric iisconsisen under mild assumpions on he model class of he inpu disribuion f. Consisency is ypically shown [8] for a nonparameric esimaor ˆf obained from a se of observaions by showing ha ( ( ) Pr ˆf (x) f(x)) dx = =. () lim X We prove he consisency of DETs on daa wih coninuous feaures. The proof of consisency of he densiy esimaor proposed in Definiion follows argumens similar o hose used o show he consisency of regression rees []. Theorem. The esimaor ˆf defined in Definiion saisfies Eq.. Proof. Given a fixed posiive ineger d,lebdenoe he collecion of all ses X ha can be described as he soluion se o a sysem of d inequaliies, each of he form b T x c for b R d and c R. ow in a decision ree T,everyleafl T is he soluion se of a sysem of d inequaliies of he form b T x c wih b R d wih jus one enry equal o and he res of he enries equal o. Therefore, T B. Le X n, n, be a random sample from a densiy funcion f on X. For, le ˆF denoe he empirical disribuion of X n, n, definedonase X by ˆF () = I(X n ) = = ˆf (x)dx (4) n= where is he number of random samples in he se {X n} n= and ˆf (x) is he esimaor given in Definiion. According o a general version of he Glivenko-Canelli heorem [9], ( ) Pr lim sup ˆF () f(x)dx = =. (5) B By Eq.4 and 5, we ge ( Pr lim ˆf (x)dx sup B ) f(x)dx = = ( ) Pr lim sup ˆf (x) f(x) dx =. B Assuming ha lim Pr(diameer() ɛ) =, hence Pr ( lim dx =) =, we ge he following wih probabiliy lim sup ˆf (x) f(x) dx B lim ˆf (x ) f(x ) dx for some x = This assumpion is commonly used for he consisency of daa-pariioning esimaors [] and is jusified since as, he diameer of any leaf node would become smaller and smaller since he leaf node can only have a bounded number of poins. Hence ( ) Pr lim sup ˆf (x) f(x) dx = = B ( ( ) Pr lim ˆf (x) f(x)) dx = =. X Hence ˆf saisfies Eq.. 4. EXPERIMETS In his secion, we demonsrae he performance of DETs under differen condiions using synheic and real daa ses. Esimaion accuracies of DETs are compared wih exising nonparameric esimaors. We exhibi he inerpreabiliy of DETs wih wo real daa ses. Furhermore, densiy esimaes of DETs are applied o classificaion and subsequen labelling accuracies are presened. Finally, he speed of raining and querying DETs are presened on several real daa ses and compared wih an exising mehod. We only consider coninuous and ordinal daa in his paper for he lack of space. Experimens wih caegorical and mixed daa will be presened in he longer version of he paper. 4. Esimaion Accuracy: Synheic Examples For he unsupervised ask of densiy esimaion, esimaion accuracy can only be compued on synheic daa ses. We choose he bes-sized ree hrough LOO-CV. To compue esimaion error on synheic daa for densiy queries, we use measures: Roo Mean Squared Error (RMSE) and Hellinger Disance (HD). We compare DETs only wih oher nonparameric densiy esimaors since parameric esimaors would involve choosing he model class (like choosing he number of Gaussians in MoGs). We repor he comparison beween DETs, hisograms (bin-widh seleced by LOO-CV) and KDEs (using he Gaussian kernel) wih he bandwidh seleced by unbiased LOO-CV using he ISE crierion (FBW) and by local rodeo (local r-kde) (VBW) on synheic daa ses. Example. (Srongly skewed disribuion in one dimension) This disribuion is defined as X 7 i= 8 ( ( ( ) i ), ( ) ) i. Figure shows he esimaed densiy funcions by a DET, a hisogram and wo KDEs for a sample size =. Because of he high skewness of he densiy funcion, he KDE (FBW) and he hisogram fail o fi he very smooh 6

7 .4.4 DET True Densiy Hisogram True Densiy Densiy densiy densiy x.5. (a) DET.4.8 Irrelevan dimension Relevan dimension (b) Hisogram x (a) DET (b) Hisogram.4 KDE True Densiy RODEO KDE True Densiy x RMSE HD His KDE(FBW) local r-kde x Irrelevan dim Relevan dim Irrelevan dimension Relevan dimension (c) KDE (FBW) (d) local r-kde Figure : Perspecive plos of he densiy esimaes for a DET, a Hisogram and wo KDEs on he daa se in Example. Figure shows he -dimensional densiy esimaes of a DET, a hisogram and wo KDEs. The DET ﬁs he irrelevan uniform dimension perfecly, while closely approximaing he mixure of Bea disribuions in he relevan dimension. The KDE wih ﬁxed bandwidh and he hisogram esimaor compleely fail o ﬁ he irrelevan dimension. The local r-kde does a beer job of ﬁing he irrelevan dimension compared o he KDE (FBW), bu sill does no enirely capure he uniform densiy. DET ail. The ﬁxed bandwidh/binwidh resuls in a highly wiggly esimae of he very smooh ail. The DET provides a piecewise consan esimae, bu i adjuss o he diﬀeren pars of he densiy funcion by varying he leaf size he ree has small leaves closer o he spike where he densiy funcion changes rapidly, having larger leaves where he densiy funcion does no change rapidly. This demonsraes he adapabiliy of DETs wihin he same dimension. The local r-kde exhibis he same adapabiliy by obaining diﬀeren bandwidhs a diﬀeren regions, hence capuring he smooh ail of he rue densiy. Table provides he esimaion errors wih increasing sample sizes for he diﬀeren mehods. The RMSE values are fairly close and he HD values for he DETs are in fac beer han ha of KDE using unbiased LOO-CV. This can be aribued o he adapive naure of DET wih respec o he leaf sizes. The error values of local r-kde are beer han DET when he daa se size is small (in case of larger daa ses, he local mehod migh be overﬁing over his insance of he daa se). The accuracies of he hisogram esimaors and DETs are comparable in his univariae example. Bu we will demonsrae in he nex example ha he performance of hisograms decline wih increasing dimensions. Example. (Two dimensional daa Mixure of Bea disribuions in one dimension, uniform in he oher dimension) We creae a (6 ) daa se by sampling one dimension as a mixure of Bea disribuions and he oher dimension as a uniform disribuion X.5 Table : Esimaion errors for Hisograms, KDEs and DETs wih increasing number of observaions (c) KDE (FBW) (d) local r-kde Figure : The densiy esimaes obained wih poins for Example using DET, Hisogram and KDEs compared o he rue densiy. Type Densiy.6 Densiy densiy densiy.5 4. Inerpreabiliy: Variable Imporance The decision-ree framework deﬁnes a measure of relevance for each of he predicor variables [] as following: Definiion 4. The measure of relevance Id (T ) for each aribue Xd (he d h aribue) in a decision ree T is deﬁned as T Id (T ) = ι I(d( ) = d ) (6) = where he summaion is over he T inernal nodes of he ree T, and he aribue d( ) used o pariion node obains ι improvemen in squared error over he consan ﬁ in node. We use his measure o presen he inerpreabiliy of he DETs on wo real daa ses. Example. (Iris daa se) We perform densiy esimaion on his 4 dimensional daa se and compue variable imporance for each of he dimensions. Figure displays he ree and he relaive variable imporance. The inerpreaion has been explained in Secion.. Example 4. (MIST digis image daa) Each image has 8-by-8 pixels providing a 784-dimensional daa se. We perform densiy esimaion on he 585 images of he digi 8 (lef panel of Figure 4). This is also an insance of performing densiy esimaion wih ordinal daa since he pixel values are discree. We compue he variable imporance of each of he dimensions (pixels in his case). The righ panel of Figure 4 displays he resuls obained. The black pixels indicae pixels wih zero variable imporance in densiy B(, ) + B(, ); X U(, ). 6

8 Table 4: Timings (in seconds) for raining(t) and all queries(q) wih KDE sanddets Daa se T(KDE ) T(DET) Q(KDE ) Q(DET) sj Colors Bio Pall psf Speedups in query ime (a) Originals (b) Variable Imporance Figure 4: Relaive imporance of he predicor variables (pixels in his case) in densiy esimaion for images of digi 8. Table : Classificaion Accuracy: r-kde vs. DET Case #Train # Tes r-kde DET v v v v v All esimaion. For he pixels wih non-zero variable imporance, he color ransiions for green (indicaing low relaive variable imporance) o red (indicaing high relaive variable imporance). The marginal densiies of many pixels are close o poin masses. Hence he densiy does no vary a all in hose dimensions, and he relaive variable imporance depiced in Figure 4 indicaes ha he DET is capable of capuring his propery of he daa. 4. Classificaion As menioned earlier, densiy esimaion is a common subrouine in classificaion where he class-condiional densiy ˆp(x C) is o be esimaed for he query poin x using he raining se X. The esimaed class Ĉ(x) = arg max C P (x C) P (C). We compare he accuracy of classificaion beween he densiy esimaes compued by DETs and local r-kdes. Example 5. (Op-digis image daa (797 64) []) We conduc binary as well as muli-class classificaion. For binary classificaion, we consider he following cases: vs. 7, vs. 7, vs. 8, 5 vs. 8, 8 vs. 9. We perform muliclass classificaion using all he classes (-9). Table liss he classificaion accuracies for he differen asks. In mos cases, he accuracies are close for he wo mehods of densiy esimaion, wih he r-kdes being more accurae han he DETs. Bu in he las wo cases, he DETs ouperform he r-kdes. Moreover, he whole experimen (raining and esing) wih he DETs is much faser han he one using he local r-kdes. For example, in he las experimen performing muli-class classificaion required 8 seconds using DETs and 5 seconds using r-kdes Speed We compare DETs o KDE s(fbw) using he fas approximae dual-ree algorihm [] wih kd-rees. We use LOO-CV o esimae he opimal bandwidh for he KDE s 7 The DET is implemened in C++ and he rodeo-kde is implemened using MATLAB on he same machine. Speedup sj colors bio pall7 psf Figure 5: Query Time speedup for he DTE over he KDE. oe ha he KDE is faser han he DTE in case of raining. (FBW). Real world daa ses drawn from he SDSS reposiory [] (SJ (5 ), PSF ( )) and he UCI Machine Learning Reposiory [] (Colors (5 ), dimension reduced version of he Bio daase ( 5), Pall7 ( 7)) are used. The query se is same as he raining se in all he experimens and he KDE sprovide he leave-one-ou esimaes. For he KDE, we use around H bandwidhs and chose he one wih he bes CV error. Usually, a much larger number of bandwidhs are ried and he raining imes scale linearly wih H. For he DETs, we use -fold crossvalidaion since we were no able o find an efficien way o conduc LOO-CV wih decision rees and i would require prohibiively long ime for hese large daa ses. This is one major limiaion of his mehod. Howeover, for sufficienly large daa ses, our experimens indicae ha DETs rained wih LOO-CV are only slighly more accurae han hose wih -fold CV. The absolue iming values (in seconds) for raining(t) and querying(q) are given in Table 4. The speedups of he DETs over he KDE s in es/query ime are shown in Figure 5. As can be seen from he resuls, he raining ime for he decision ree algorihm is significanly larger han he LOO-CV for he KDE s, bu he DETs provide up o.5 orders of magniude speedup in query ime. The KDE s are fas o rain, bu heir work comes a es ime. This experimen demonsraes he efficien querying of decision rees while also indicaing ha he ask of raining decision rees using LOO-CV is quie challenging. 5. DISCUSSIO AD COCLUSIOS This framework of decision-ree-srucured densiy esimaion provides a new way o esimae he densiy of a given se of observaions in he form of a simple se of rules which are inherenly inuiive and inerpreable. This framework has he capabiliy of easily dealing wih caegorical, discree as well as coninuous variables and performs auomaic fea- 64

9 ure selecion. On having hese rules, he densiy of a new es poin can be compued cheaply. All hese feaures sem from he fac ha he densiy esimaion is performed in he form of a decision ree. Along wih ha, he DETs are easily implemenable like supervised decision rees. Alhough hese advanages come wih he loss of accuracy, he DETs are shown o be consisen, and hence are quie reliable in he presence of enough daa. Fuure direcions of improvemen include he reducion of he disconinuiies in he densiy esimae because of he piecewise consan esimaor and he boundary bias since he DETs pu no mass ouside he span of he daa. Boundary bias is a common problem for almos all densiy esimaors. One possible remedy is o use a normalized KDE a each of he leaves insead of using he piecewise consan esimaor (similar o he approach in Smyh e.al. []. An analysis of he convergence rae of he DETs would quanify he amoun of loss of accuracy o accoun for he simpliciy of he esimaor. Being effecively a variable binwidh hisogram, we conjecure ha he convergence rae for he DETs would be o( / ) for univariae daa and beer han hisograms in higher dimensions since we demonsrae ha he DETs can effecively model uninformaive dimensions easily wihou requiring ha exra number of poins as imposed by he curse of dimensionaliy. Moreover, we are acively working on obaining densiy dependen bounds on he deph of he DETs o quanify he runimes for raining and es query. Overall, his mehod for densiy esimaion has immediae applicaion o various fields of daa analysis (for example, oulier and anomaly deecion) 8 and machine learning due o is simple and inerpreable soluion o he fundamenal ask of densiy esimaion. 6. REFERECES [] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Sone. Classificaion and Regression Trees. Wadsworh, 984. [] C. Bishop. eural eworks for Paern Recogniion. Oxford Universiy Press, 995. [] C. M. Bishop. Paern recogniion and machine learning. Springer, 6. [4] R.O.DudaandP.E.Har.Paern Classificaion and Scene Analysis. Wiley ew York, 97. [5] J. H. Friedman. A Recursive Pariioning Decision Rule for onparameric Classificaion. Transacions on Compuers, 977. [6] J. H. Friedman. A ree-srucured approach o nonparameric muliple regression. Smoohing Techniques for Curve Esimaion, 979. [7] T. Hasie, R. Tibshirani, and J. Friedman. The Elemens of Saisical Learning: Daa Mining, Inference, and Predicion. Springer,. [8] L. Wasserman. All of onparameric Saisics. Physica-Verlag, 6. [9] C. Cores and V.. Vapnik. Suppor vecor neworks. Machine Learning, 995. [] B. W. Silverman. Densiy Esimaion for Saisics and Daa Analysis. Chapman & Hall/CRC, 986. [] H. Liu, J. Laffery, and L. Wasserman. Sparse onparameric Densiy Esimaion in High Dimensions Using he Rodeo. In AISTATS, 7. [] P. Müller and F.A. Quinana. onparameric Bayesian daa analysis. Saisical Science, 4. [] A. G. Gray and A. W. Moore. onparameric densiy esimaion: Toward compuaional racabiliy. In SIAM Daa Mining,. [4] D. Lee and A. G. Gray. Fas high-dimensional kernel summaions using he mone carlo mulipole mehod. In Advances in IPS, 8. [5] A. Beygelzimer, S. Kakade, and J.C. Langford. Cover Trees for eares eighbor. ICML, 6. [6] P. Ram, D. Lee, W. March, and A. Gray. Linear-ime algorihms for pairwise saisical problems. In Advances in IPS. 9. [7] J. Laffery and L. Wasserman. Rodeo: Sparse onparameric Regression in High Dimensions. Arxiv preprin mah.st/564, 5. [8] R. Riegel, A. G. Gray, and G. Richards. Massive-Scale Kernel Discriminan Analysis: Mining for Quasars. In SIAM Daa Mining (SDM), 8. [9] W. Guan, M. Zhou, C. Y. Hampon, B. B. Benigno, L. D. Walker, A. G. Gray, J. F. McDonald, and F. M. Fernandez. Ovarian Cancer Deecion from Meabolomic Liquid Chromaography/Mass Specromery Daa by Suppor Vecor Machines. BMC Bioinformaics Journal, 9. [] R. Kohavi. Scaling up he accuracy of naive-bayes classifiers: A decision-ree hybrid. In KDD, 996. [] P. Smyh, A. G. Gray, and U. M. Fayyad. Rerofiing decision ree classifiers using kernel densiy esimaion. In ICML, 995. [] G. Hooker. Diagnosing exrapolaion: Tree-based densiy esimaion. In SIGKDD, 4. [] A.M. Molinaro, S. Dudoi, and M.J. Van der Laan. Tree-based mulivariae regression and densiy esimaion wih righ-censored daa. Journal of Mulivariae Analysis, 4. [4] G. Schmidberger and E. Frank. Unsupervised discreizaion using ree-based densiy esimaion. Lecure oes in Compuer Science, 5. [5] J. Basak and R. Krishnapuram. Inerpreable hierarchical clusering by consrucing an unsupervised decision ree. IEEE ransacions on knowledge and daa engineering, 5. [6] T. Seidl, I. Assen, P. Kranen, R. Krieger, and J. Herrmann. Indexing densiy models for incremenal learning and anyime classificaion on daa sreams. In ICEDT: Advances in Daabase Technology, 9. [7] W.H. Wong and L. Ma. Opional Pólya ree and Bayesian inference. The Annals of Saisics,. [8] E. Parzen. On he esimaion of a probabiliy densiy funcion and mode. Annals of Mahemaical Saisics, pages 65 76, 96. [9] V.. Vapnik and A. Y. Chervonenkis. On he uniform convergence of relaive frequencies of evens o heir probabiliies. Theory of Probabiliy and is Applicaions, 97. [] L. Devroye, L. Györfi, and G. Lugosi. A probabilisic heory of paern recogniion. Springer Verlag, 996. [] C.L. Blake and C.J. Merz. UCI reposiory of machine learning daabases [] A. G. Gray and A. W. Moore. n-body problems in saisical learning. In Advances in IPS,. [] R. Lupon, J.E. Gunn, Z. Ivezic, G.R. Knapp, S. Ken, and. Yasuda. The SDSS Imaging Pipelines. Arxiv preprin asro-ph/4,. 8 We will demonsrae hese applicaions in he longer version of he paper. 65

Non-parametric techniques. Instance Based Learning. NN Decision Boundaries. Nearest Neighbor Algorithm. Distance metric important

Non-parametric techniques. Instance Based Learning. NN Decision Boundaries. Nearest Neighbor Algorithm. Distance metric important on-parameric echniques Insance Based Learning AKA: neares neighbor mehods, non-parameric, lazy, memorybased, or case-based learning Copyrigh 2005 by David Helmbold 1 Do no fi a model (as do LTU, decision