Non-parametric techniques. Instance Based Learning. NN Decision Boundaries. Nearest Neighbor Algorithm. Distance metric important

on-parameric echniques Insance Based Learning AKA: neares neighbor mehods, non-parameric, lazy, memorybased, or case-based learning Copyrigh 2005 by David Helmbold 1 Do no fi a model (as do LTU, decision ree, ec) Includes eares eighbor and densiy esimaion mehods Variable sized hypohesis space, like Decision rees lazy Hypohesis (no gradien descen, opimizaion, or search) (under \lazy in Weka) 2 eares eighbor Algorihm Insances (x s) are vecor of real Sore he n raining examples (x 1, y 1 ),, (x n, y n ) To predic on new x, find x i closes o x and predic wih y i Commens: o jus simple able lookup Can avoid by minimizing squared disance Decision Boundaries Vornoi diagram, very flexible, ges more complicaed wih addiional poins - - 3 4 eares eighbor Applicaions Asronomy (classifying objecs) Medicine - diagnosis Objec deecion Characer recogniion (shape maching) Many ohers (basic heory from1950 s and 60 s) Disance meric imporan Consider expensive houses wih feaures: umber of bedrooms (1 o 5) Lo size in acres (1/6 o 1/2 plus ail) House square fee (1200 o 3000) Difference in square fee dominaes Irrelevan aribues (e.g. how far away was owner born? ) add variabiliy Correlaed aribues also bad 5 6 1

Irrelevan aribue example Le x 1 [0,1] deermine class: y 1 iff x 1 > 0.3 Consider predicing on (0,0) given daa (0.1, x 2 ) labeled 0 (0.5, x 2 ) labeled 1 where x 2, x 2 random draws from [0,1] Chance of error ~ 15%! Some ricks Rescale aribues o mean 0 variance 1 Use w j on j h componen: Dis(x, x ) j w j (x j -x j ) 2 w j I(x j, y) ( muual informaion ) Mahalanobis Disance (covariance Σ, like LDA) Dis(x,x ) (x-x ) T Σ -1 (x-x ) 7 8 Curse of Dimensionaliy K-d rees As number of aribues (d) goes up so does volume Consider 1000 raining poins in [0,1] d where does each poin predic? When d1, inerval per poin ~0.001 When d2, area per poin ~0.001, lengh of side abou 0.032 When d10, volume per poin ~0.001, lengh of side ~ 0.5 eed exponenially many poins (in d) o ge good coverage 9 Grealy speed up finding neares neighbor Like binary search ree, bu organized around dimensions Each node ess single dimension agains hreshold (median) Can use highes variance dimension or cycle hrough dimensions Growing a good K-d ree can be expensive 10 oise can cause problems oise example Assume ha rue labels always 1, bu noise randomly corrups labels 10% of he ime (making hem 0) Bayes opimal: predic 1, es error is 10% eares eighbor: use closes raining poin, 90% of he ime predic 1, 10% of hese predicions wrong 10% of he ime predic 0, 90% of hese predicions wrong Overall wrong 18% of he ime 11 12 2

K-neares neighbor Algorihm: Find he closes k poins and predic wih heir majoriy voe K- is Bayes opimal in limi as k and raining se size go o (known since 1960 s) Edied Key Idea: Reduce memory and compuaion by only soring imporan poins Heurisic: Discard hose poins correcly prediced by ohers (or ake incorrecly prediced poins) Remaining poins concenraed on he decision boundary Finding a smalles subse of poins correcly labeling ohers is P-complee. 13 14 Insance Based Densiy Esimaion Hisogram mehod: Break insance space X ino bins Use sample falling ino bin o esimae probabiliies Hisogram mehod is parameric, no insance based Has edge effecs 15 Smooher mehod: Add slice of probabiliy o area cenered a example raher han o predeermined bin In general, have a Kernel funcion ha ells how probabiliy added (see Duda and Har) Gaussians common Also called Parzon Windows Ofen a widh parameer conrols smoohing (like σ in Gaussians) 16 disribuion sample disribuion σ2 bins disks σ5 σ30 17 18 3

Final poins Can use smoohed neares neighbor also (whole sample voes on predicions wih weighs depending on disance o new poin), bu may be compuaionally expensive Can fix k and use disance o kh neares o esimae densiy Migh use cross validaion o esimae smoohing parameer Can use densiy esimaion for P(x class) and hen predic class labels using Bayes rule onparameric Regression Aka smoohing models Regressogram ĝ ( x) where b b 1 ( x,x ) 1 ( x,x ) r b( x,x ) # 1 if x "! 0 oherwise is in he same bin wih x 19 20 h is he bin size 21 22 Running Mean/Kernel Smooher Running mean smooher ĝ ( x) where 1 ) x * x & w ' r ( h % ) x * x & w ' 1 ( h % # 1 if u < 1 w( u) "! 0 oherwise Running line smooher (locally linear) Kernel smooher ' x ( x! K % " r 1 ( ) & h ĝ x # ' x ( x! K % " 1 & h # where K( ) is Gaussian Addiive models (Hasie and Tibshirani, 1990) 23 24 4

25 26 How o Choose k or h? When k or h is small, single insances maer; bias is small, variance is large (undersmoohing): High complexiy As k or h increases, we average over more insances and variance decreases bu bias increases (oversmoohing): Low complexiy Cross-validaion is used o fine-une k or h. Tree and comparison Model Daa inerpreable Missing values oise/ouliers Decision Trees Trees - flexible mixed If small ree Tricks Good wih pruning eares eighbor Insance based, flexible Usually umeric Only in 1 or 2 dimensions Training se no, bu ok for es poins Good wih knn 27 28 Tree and K Robusness Decision ree eares neighbor Monoone ransformaion Irrelevan feaures Compuaion ime Grea Fair OK Very bad Very bad Lazy - expensive 29 5