k-nearest Neighbor How to choose k Average of k points more reliable when: Large k: noise in attributes +o o noise in class labels

Mtivating Example Memry-Based Learning Instance-Based Learning K-earest eighbr Inductive Assumptin Similar inputs map t similar utputs If nt true => learning is impssible If true => learning reduces t defining similar t all similarities created equal predicting a persn s height may depend n different attributes than predicting their IQ 1-earest eighbr ( 2 Dist(c 1,c 2 = attr i (c 1 - attr i (c 2 earesteighbr = MI j (Dist(c j,c test predictin test = class j (r value j wrs well if n attribute nise, class nise, class verlap can learn cmplex functins (sharp class bundaries as number f training cases grws large, errr rate f 1- is at mst 2 times the Bayes ptimal rate (i.e. if yu new the true prbability f each class fr each test case 1

-earest eighbr Dist(c1, c2 = ( Hw t chse attri (c1 - attri (c2 2 { Large : } - earesteighbrs = - MI(Dist(ci, ctest predictintest = 1 1 classi (r valuei less sensitive t nise (particularly class nise better prbability estimates fr discrete classes larger training sets allw larger values f Small : attribute_2 Average f pints mre reliable when: nise in attributes + + + nise in class labels + + +++ + + classes partially verlap + captures fine structure f prblem space better may be necessary with small training sets Balance must be struc between large and small As training set appraches infinity, and grws large, becmes Bayes ptimal + attribute_1 Frm Hastie, Tibshirani, Friedman 2001 p418 Frm Hastie, Tibshirani, Friedman 2001 p418 2

Frm Hastie, Tibshirani, Friedman 2001 p419 Crss-Validatin Mdels usually perfrm better n training data than n future test cases 1- is 100% accurate n training data! Leave-ne-ut-crss validatin: remve each case ne-at-a-time use as test case with remaining cases as train set average perfrmance ver all test cases LOOCV is impractical with mst learning methds, but extremely efficient with MBL! Distance-Weighted tradeff between small and large can be difficult use large, but mre emphasis n nearer neighbrs? predictin test = w = 1 Dist(c, c test * class i (r * value i 3

Lcally Weighted Averaging Let = number f training pints Let weight fall-ff rapidly with distance predictin test = w = 1 * class i (r e KernelWidth Dist(c,c test * value i KernelWidth cntrls size f neighbrhd that has large effect n value (analgus t Lcally Weighted Regressin All algs s far are strict averagers: interplate, but can t extraplate D weighted regressin, centered at test pint, weight cntrlled by distance and KernelWidth Lcal regressr can be linear, quadratic, n-th degree plynmial, neural net, Yields piecewise apprximatin t surface that typically is mre cmplex than lcal regressr Euclidean Distance D(c1,c2 = ( attr i (c1 - attr i (c2 2 gives all attributes equal weight? nly if scale f attributes and differences are similar scale attributes t equal range r equal variance assumes spherical classes attribute_2 + + + + + attribute_1 + Euclidean Distance? attribute_2 + + + + + + attribute_1 + if classes are nt spherical? + + + + + + + + + attribute_1 if sme attributes are mre/less imprtant than ther attributes? if sme attributes have mre/less nise in them than ther attributes? attribute_2 4

Weighted Euclidean Distance D(c1,c2 = ( attr i (c1 - attr i (c2 2 large weights => attribute is mre imprtant small weights => attribute is less imprtant zer weights => attribute desn t matter Weights allw t be effective with axis-parallel elliptical classes Where d weights cme frm? Learning Attribute Weights Scale attribute ranges r attribute variances t mae them unifrm (fast and easy Prir nwledge umerical ptimizatin: gradient descent, simplex methds, genetic algrithm criterin is crss-validatin perfrmance Infrmatin Gain r Gain Rati f single attributes Infrmatin Gain Infrmatin Gain = reductin in entrpy due t splitting n an attribute Entrpy = expected number f bits needed t encde the class f a randmly drawn + r example using the ptimal inf-thery cding Entrpy = - p + lg 2 p + - p - lg 2 p - Splitting Rules GainRati(S, A = Entrpy(S - S v S Entrpy(S v v ŒValues(A v ŒValues(A S v S lg 2 S v S Gain(S, A = Entrpy(S - vœvalues(a S v S Entrpy(S v 5

Gain_Rati Crrectin Factr GainRati Weighted Euclidean Distance Crrectin Factr Gain Rati fr Equal Sized n-way Splits 6.00 5.00 4.00 3.00 2.00 1.00 0.00 0 10 20 30 40 50 D(c1,c2 = gain _ rati i ( attr i (c1 - attr i (c2 2 weight with gain_rati after scaling? umber f Splits Bleans, minals, Ordinals, and Reals Cnsider attribute value differences: (attr i (c1 attr i (c2 Reals: Integers: Ordinals: easy! full cntinuum f differences nt bad: discrete set f differences nt bad: discrete set f differences Bleans: awward: hamming distances 0 r 1 minals? nt gd! recde as Bleans? Curse f Dimensinality as number f dimensins increases, distance between pints becmes larger and mre unifrm if number f relevant attributes is fixed, increasing the number f less relevant attributes may swamp distance ( 2 relevant D(c1,c2 = ( attr i (c1 - attr i (c2 2 irrelevant + attr j (c1 - attr j (c2 j =1 when mre irrelevant than relevant dimensins, distance becmes less reliable slutins: larger r KernelWidth, feature selectin, feature weights, mre cmplex distance functins 6

Advantages f Memry-Based Methds Weanesses f Memry-Based Methds Lazy learning: dn t d any wr until yu nw what yu want t predict (and frm what variables! never need t learn a glbal mdel many simple lcal mdels taen tgether can represent a mre cmplex glbal mdel better fcused learning handles missing values, time varying distributins,... Very efficient crss-validatin Intelligible learning methd t many users earest neighbrs supprt explanatin and training Can use any distance metric: string-edit distance, Curse f Dimensinality: ften wrs best with 25 r fewer dimensins Run-time cst scales with training set size Large training sets will nt fit in memry Many MBL methds are strict averagers Smetimes desn t seem t perfrm as well as ther methds such as neural nets Predicted values fr regressin nt cntinuus Cmbine K with A Current Research in MBL Train neural net n prblem Use utputs f neural net r hidden unit activatins as new feature vectrs fr each pint Use K n new feature vectrs fr predictin Des feature selectin and feature creatin Smetimes wrs better than K r A Cndensed representatins t reduce memry requirements and speed-up neighbr finding t scale t 10 6 10 12 cases Learn better distance metrics Feature selectin Overfitting, VC-dimensin,... MBL in higher dimensins MBL in nn-numeric dmains: Case-Based Reasning Reasning by Analgy 7

References Lcally Weighted Learning by Atesn, Mre, Schaal Tuning Lcally Weighted Learning by Schaal, Atesn, Mre Clsing Thught In many supervised learning prblems, all the infrmatin yu ever have abut the prblem is in the training set. Why d mst learning methds discard the training data after ding learning? D neural nets, decisin trees, and Bayes nets capture all the infrmatin in the training set when they are trained? In the future, we ll see mre methds that cmbine MBL with these ther learning methds. t imprve accuracy fr better explanatin fr increased flexibility 8