Eight more Classic Machine Learning algorithms

Size: px

Start display at page:

Download "Eight more Classic Machine Learning algorithms"

Edith Cannon
6 years ago
Views:

1 Note to other teachers and users of these slides. Andrew would be delighted if ou found this source material useful in giving our own lectures. Feel free to use these slides verbatim, or to modif them to fit our own needs. PowerPoint originals are available. If ou mae use of a significant portion of these slides in our own lecture, please include this message, or the following lin to the source repositor of Andrew s tutorials http// Comments and corrections gratefull received. Eight more Classic Machine Learning algorithms Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon Universit awm@cs.cmu.edu Copright 00, Andrew W. Moore Oct 9th, 00 8 Polnomial Regression So far we ve mainl been dealing with linear regression X X Y 7 X= = Z= =(,).. =7.. = z =(,,).. z =(,, ) = b=(z T Z) - (Z T ) est = β 0 + β + β Copright 00, Andrew W. Moore Machine Learning Favorites Slide

2 Quadratic Regression It s trivial to do linear fits of fied nonlinear basis functions X Z= X Y 7 z=(,,,,,, ) 9 X= = 6 7 =(,).. =7.. = 4 7 b=(z T Z) - (Z T ) est = β 0 + β + β + β + β 4 + β 5 Copright 00, Andrew W. Moore Machine Learning Favorites Slide Quadratic Regression It s trivial Each to do component linear fits of of a fied z vector nonlinear is called basis a term. functions X X Y X= = Each column of the Z matri is called 7 a term column Z= 7 How man terms in a quadratic regression with m inputs? constant term =(,).. =7.. m linear terms = (m+)-choose- = m(m+)/ quadratic terms (m+)-choose- terms in total = O(m b=(z ) T Z) - (Z T ) z=(,,,,,, ) est = β 0 + β + β + Note that solving b=(z T Z) - (Z T ) β is thus + βo(m 4 6 ) + β 5 7 Copright 00, Andrew W. Moore Machine Learning Favorites Slide 4

3 Q th -degree polnomial Regression X Z= X Y 7 9 X= = 7 =(,).. =7.. = 7 6 b=(z T Z) - (Z T ) z=(all products of powers of inputs in which sum of powers is q or less, ) est = β 0 + β + Copright 00, Andrew W. Moore Machine Learning Favorites Slide 5 m inputs, degree Q how man terms? = the number of unique terms of the form q q qm... where q Q m m i= = the number of unique terms of the form q0 q q qm... where q = Q m = the number of lists of non-negative integers [q 0,q,q,..q m ] in which Σq i = Q m i=0 = the number of was of placing Q red diss on a row of squares of length Q+m = (Q+m)-choose-Q i i q 0 = q = q =0 q =4 q 4 = Q=, m=4 Copright 00, Andrew W. Moore Machine Learning Favorites Slide 6

4 7 Radial Basis Functions (RBFs) X Z= X Y 7 X= = 7 =(,).. =7.. = 7 b=(z T Z) - (Z T ) z=(list of radial basis function evaluations) est = β 0 + β + Copright 00, Andrew W. Moore Machine Learning Favorites Slide 7 -d RBFs c c c est = β φ () + β φ () + β φ () where φ i () = KernelFunction( - c i / KW) Copright 00, Andrew W. Moore Machine Learning Favorites Slide 8 4

5 Eample c c c est = φ () φ () + 0.5φ () where φ i () = KernelFunction( - c i / KW) Copright 00, Andrew W. Moore Machine Learning Favorites Slide 9 RBFs with Linear Regression est = φ () φ () + 0.5φ () where All c i s are held constant (initialized randoml or on a grid in m- dimensional input space) c c c φ i () = KernelFunction( - c i / KW) KW also held constant (initialized to be large enough that there s decent overlap between basis functions* *Usuall much better than the crapp overlap on m diagram Copright 00, Andrew W. Moore Machine Learning Favorites Slide 0 5

6 RBFs with Linear Regression est = φ () φ () + 0.5φ () where All c i s are held constant (initialized randoml or on a grid in m- dimensional input space) φ i () = KernelFunction( - c i / KW) then given Q basis functions, define the matri Z such that Z j = KernelFunction( - c i / KW) where is the th vector of inputs And as before, b=(z T Z) - (Z T ) c c c KW also held constant (initialized to be large enough that there s decent overlap between basis functions* *Usuall much better than the crapp overlap on m diagram Copright 00, Andrew W. Moore Machine Learning Favorites Slide RBFs with NonLinear Regression Allow the c i s to adapt to the data (initialized randoml or on a grid in m-dimensional input space) est = φ () φ () + 0.5φ () where φ i () = KernelFunction( - c i / KW) KW allowed to adapt to the data. (Some fols even let each basis function have its own c c KW c j,permitting fine detail in dense regions of input space) But how do we now find all the β j s, c i s and KW? Copright 00, Andrew W. Moore Machine Learning Favorites Slide 6

7 RBFs with NonLinear Regression Allow the c i s to adapt to the data (initialized randoml or on a grid in m-dimensional input space) est = φ () φ () + 0.5φ () where φ i () = KernelFunction( - c i / KW) KW allowed to adapt to the data. (Some fols even let each basis function have its own c c KW c j,permitting fine detail in dense regions of input space) But how do we now find all the β j s, c i s and KW? Answer Gradient Descent Copright 00, Andrew W. Moore Machine Learning Favorites Slide RBFs with NonLinear Regression Allow the c i s to adapt to the data (initialized randoml or on a grid in m-dimensional input space) est = φ () φ () + 0.5φ () where φ i () = KernelFunction( - c i / KW) KW allowed to adapt to the data. (Some fols even let each basis function have its own c c KW c j,permitting fine detail in dense regions of input space) But how do we now find all the β j s, c i s and KW? (But I d lie to see, or hope someone s alread done, a hbrid, where the c i s and KW are updated with gradient descent while the β j s use matri inversion) Answer Gradient Descent Copright 00, Andrew W. Moore Machine Learning Favorites Slide 4 7

8 Radial Basis Functions in -d Two inputs. Outputs (heights sticing out of page) not shown. Center Sphere of significant influence of center Copright 00, Andrew W. Moore Machine Learning Favorites Slide 5 Happ RBFs in -d Blue dots denote coordinates of input vectors Center Sphere of significant influence of center Copright 00, Andrew W. Moore Machine Learning Favorites Slide 6 8

9 Blue dots denote coordinates of input vectors Crabb RBFs in -d What s the problem in this eample? Center Sphere of significant influence of center Copright 00, Andrew W. Moore Machine Learning Favorites Slide 7 Blue dots denote coordinates of input vectors More crabb RBFs And what s the problem in this eample? Center Sphere of significant influence of center Copright 00, Andrew W. Moore Machine Learning Favorites Slide 8 9

10 Hopeless! Even before seeing the data, ou should understand that this is a disaster! Center Sphere of significant influence of center Copright 00, Andrew W. Moore Machine Learning Favorites Slide 9 Unhapp Even before seeing the data, ou should understand that this isn t good either.. Center Sphere of significant influence of center Copright 00, Andrew W. Moore Machine Learning Favorites Slide 0 0

11 6 Robust Regression Copright 00, Andrew W. Moore Machine Learning Favorites Slide Robust Regression This is the best fit that Quadratic Regression can manage Copright 00, Andrew W. Moore Machine Learning Favorites Slide

12 Robust Regression but this is what we d probabl prefer Copright 00, Andrew W. Moore Machine Learning Favorites Slide LOESS-based Robust Regression After the initial fit, score each datapoint according to how well it s fitted You are a ver good datapoint. Copright 00, Andrew W. Moore Machine Learning Favorites Slide 4

13 LOESS-based Robust Regression After the initial fit, score each datapoint according to how well it s fitted You are a ver good datapoint. You are not too shabb. Copright 00, Andrew W. Moore Machine Learning Favorites Slide 5 LOESS-based Robust Regression But ou are pathetic. After the initial fit, score each datapoint according to how well it s fitted You are a ver good datapoint. You are not too shabb. Copright 00, Andrew W. Moore Machine Learning Favorites Slide 6

14 Robust Regression For = to R Let (, ) be the th datapoint Let est be predicted value of Let w be a weight for datapoint that is large if the datapoint fits well and small if it fits badl w = KernelFn([ - est ] ) Copright 00, Andrew W. Moore Machine Learning Favorites Slide 7 Then redo the regression using weighted datapoints. I taught ou how to do this in the Instancebased lecture (onl then the weights depended on distance in input-space) Guess what happens net? Robust Regression For = to R Let (, ) be the th datapoint Let est be predicted value of Let w be a weight for datapoint that is large if the datapoint fits well and small if it fits badl w = KernelFn([ - est ] ) Copright 00, Andrew W. Moore Machine Learning Favorites Slide 8 4

15 Then redo the regression using weighted datapoints. I taught ou how to do this in the Instancebased lecture (onl then the weights depended on distance in input-space) Repeat whole thing until converged! Robust Regression For = to R Let (, ) be the th datapoint Let est be predicted value of Let w be a weight for datapoint that is large if the datapoint fits well and small if it fits badl w = KernelFn([ - est ] ) Copright 00, Andrew W. Moore Machine Learning Favorites Slide 9 Robust Regression---what we re doing What regular regression does Assume was originall generated using the following recipe = β 0 + β + β +N(0,σ ) Computational tas is to find the Maimum Lielihood β 0, β and β Copright 00, Andrew W. Moore Machine Learning Favorites Slide 0 5

16 Robust Regression---what we re doing What LOESS robust regression does Assume was originall generated using the following recipe With probabilit p = β 0 + β + β +N(0,σ ) But otherwise ~ N(µ,σ huge ) Computational tas is to find the Maimum Lielihood β 0, β, β, p, µ and σ huge Copright 00, Andrew W. Moore Machine Learning Favorites Slide Robust Regression---what we re doing What LOESS robust regression does Assume was originall generated using the following recipe With probabilit p = β 0 + β + β +N(0,σ ) But otherwise ~ N(µ,σ huge ) Computational tas is to find the Maimum Lielihood β 0, β, β, p, µ and σ huge Msteriousl, the reweighting procedure does this computation for us. Your first glimpse of two spectacular letters E.M. Copright 00, Andrew W. Moore Machine Learning Favorites Slide 6

17 5 Regression Trees Decision trees for regression Copright 00, Andrew W. Moore Machine Learning Favorites Slide A regression tree leaf Predict age = 47 Mean age of records matching this leaf node Copright 00, Andrew W. Moore Machine Learning Favorites Slide 4 7

18 A one-split regression tree Gender? Female Predict age = 9 Male Predict age = 6 Copright 00, Andrew W. Moore Machine Learning Favorites Slide 5 Choosing the attribute to split on Gender Rich? Num. Children Num. Bean Babies Age Female No 8 Male No Male Yes We can t use information gain. What should we use? Copright 00, Andrew W. Moore Machine Learning Favorites Slide 6 8

19 Choosing the attribute to split on Male Male MSE(Y X) = The epected squared error if we must predict a record s Y value given onl nowledge of the record s X value If we re told =j, the smallest epected error comes from predicting the mean of the Y-values among those records in which =j. Call this mean quantit µ =j Then Gender Female Rich? No No Yes Num. Children 0 0 Num. Bean Babies 0 5+ MSE( Y X ) = R N X Age ( µ j= ( such that = j) = j ) Copright 00, Andrew W. Moore Machine Learning Favorites Slide 7 Choosing the attribute to split on Male Male MSE(Y X) = The epected squared error if we must predict a record s Y value given onl nowledge of the record s X value If we re told =j, the smallest epected error comes from predicting the mean of the Y-values among those records in which =j. Call this mean quantit µ =j Then Gender Female Rich? No No Num. Children 0 Num. Bean Babies 0 Age Regression tree attribute selection 8 greedil choose the attribute that minimizes MSE(Y X) YesGuess 0what we 5+ do about real-valued 7 inputs? MSE( Y X ) = R N X 4 Guess how we prevent overfitting ( µ j= ( such that = j) = j ) Copright 00, Andrew W. Moore Machine Learning Favorites Slide 8 9

20 Pruning Decision propert-owner = Yes Female Gender? Male Do I deserve to live? Predict age = 9 # propert-owning females = 567 Mean age among POFs = 9 Age std dev among POFs = Predict age = 6 # propert-owning males = Mean age among POMs = 6 Age std dev among POMs =.5 Use a standard Chi-squared test of the nullhpothesis these two populations have the same mean and Bob s our uncle. Copright 00, Andrew W. Moore Machine Learning Favorites Slide 9 Linear Regression Trees propert-owner = Yes Female Gender? Male Also nown as Model Trees Predict age = * NumChildren - * YearsEducation Leaves contain linear functions (trained using linear regression on all records matching that leaf) Predict age = * NumChildren -.5 * YearsEducation Split attribute chosen to minimize MSE of regressed children. Pruning with a different Chisquared Copright 00, Andrew W. Moore Machine Learning Favorites Slide 40 0

21 Linear Regression Trees propert-owner = Yes Predict age = Female * NumChildren - * YearsEducation Leaves contain linear functions (trained using linear regression on all records matching that leaf) Gender? Male Predict age = Also nown as Model Trees * NumChildren -.5 * YearsEducation Detail You tpicall ignore an categorical attribute that has been tested on higher up in the tree during the regression. But use all untested attributes, and use real -valued attributes even if the ve been tested above Split attribute chosen to minimize MSE of regressed children. Pruning with a different Chisquared Copright 00, Andrew W. Moore Machine Learning Favorites Slide 4 Test our understanding Assuming regular regression trees, can ou setch a graph of the fitted function est( ) over this diagram? Copright 00, Andrew W. Moore Machine Learning Favorites Slide 4

22 Test our understanding Assuming linear regression trees, can ou setch a graph of the fitted function est( ) over this diagram? Copright 00, Andrew W. Moore Machine Learning Favorites Slide 4 4 GMDH (c.f. BACON, AIM) Group Method Data Handling A ver simple but ver good idea. Do linear regression. Use cross-validation to discover whether an quadratic term is good. If so, add it as a basis function and loop.. Use cross-validation to discover whether an of a set of familiar functions (log, ep, sin etc) applied to an previous basis function helps. If so, add it as a basis function and loop. 4. Else stop Copright 00, Andrew W. Moore Machine Learning Favorites Slide 44

23 GMDH (c.f. BACON, AIM) Group Method Data Handling A ver simple but ver good idea. Do Tpical linear learned regression function. Use age cross-validation est = height -. sqrt(weight) to discover + whether an quadratic term 4. is income good. / If (cos so, (NumCars)) add it as a basis function and loop.. Use cross-validation to discover whether an of a set of familiar functions (log, ep, sin etc) applied to an previous basis function helps. If so, add it as a basis function and loop. 4. Else stop Copright 00, Andrew W. Moore Machine Learning Favorites Slide 45 Cascade Correlation A super-fast form of Neural Networ learning Begins with 0 hidden units Incrementall adds hidden units, one b one, doing ingeniousl little recomputation between each unit addition Copright 00, Andrew W. Moore Machine Learning Favorites Slide 46

24 Begin with a regular dataset Nonstandard notation X (i) is the i th attribute (i) is the value of the i th attribute in the th record Cascade beginning X (0) (0) (0) X () () () X (m) (m) (m) Y Copright 00, Andrew W. Moore Machine Learning Favorites Slide 47 Begin with a regular dataset Find weights w (0) i to best fit Y. I.E. to minimize Cascade first step R = ( (0) ) where (0) = m j= w (0) j ( j) X (0) (0) (0) X () () () X (m) (m) (m) Y Copright 00, Andrew W. Moore Machine Learning Favorites Slide 48 4

25 Consider our errors Begin with a regular dataset Find weights w (0) i to best fit Y. I.E. to minimize R = ( (0) ) where (0) = m j= w (0) j ( j) Definee (0) = (0) X (0) (0) (0) X () () () X (m) (m) (m) Y Y (0) (0) (0) E (0) e (0) e (0) Copright 00, Andrew W. Moore Machine Learning Favorites Slide 49 Create a hidden unit Find weights u (0) i to define a new basis function H (0) () of the inputs. Mae it specialize in predicting the errors in our original fit Find {u (0) i } to maimize correlation between H (0) () and E (0) where H (0) () m (0) = g u j j= ( j) X (0) (0) (0) X () () () X (m) (m) (m) Y Y (0) (0) (0) E (0) e (0) e (0) H (0) h (0) h (0) Copright 00, Andrew W. Moore Machine Learning Favorites Slide 50 5

26 Cascade net step Find weights w () i p () 0 to better fit Y. I.E. to minimize R m () () (0) ( j) (0) (0) ( ) where = w j + pj h = j= X (0) (0) (0) X () () () X (m) (m) (m) Y Y (0) (0) (0) E (0) e (0) e (0) H (0) h (0) h (0) Y () () () Copright 00, Andrew W. Moore Machine Learning Favorites Slide 5 Now loo at new errors Find weights w () i p () 0 to better fit Y. Definee () = () X (0) (0) (0) X () () () X (m) (m) (m) Y Y (0) (0) (0) E (0) e (0) e (0) H (0) h (0) h (0) Y () () () E () e () e () Copright 00, Andrew W. Moore Machine Learning Favorites Slide 5 6

27 Create net hidden unit Find weights u () i v() 0 to define a new basis function H () () of the inputs. Mae it specialize in predicting the errors in our original fit Find {u () i, v() 0 } to maimize correlation between H () () and E () where m () () ( j) () H ( ) = g u j + v0 h j= X (0) (0) (0) X () () () X (m) (m) (m) Y Y (0) (0) (0) E (0) e (0) e (0) H (0) h (0) h (0) Y () () () E () e () e () H () h () h () (0) Copright 00, Andrew W. Moore Machine Learning Favorites Slide 5 Cascade n th step Find weights w (n) i p (n) j to better fit Y. I.E. to minimize R = ( ( n) ) where ( n) = m j= w ( n) j ( j) + n j= p ( n) j h ( j) X (0) X () X (m) Y Y (0) E (0) H (0) Y () E () H () Y (n) (0) () (m) (0) e (0) h (0) () e () h () (n) (0) () (m) (0) e (0) h (0) () e () h () (n) Copright 00, Andrew W. Moore Machine Learning Favorites Slide 54 7

28 Now loo at new errors Find weights w (n) i p (n) j to better fit Y. I.E. to minimize Definee ( n) = ( n) X (0) X () X (m) Y Y (0) E (0) H (0) Y () E () H () Y (n) E (n) (0) () (m) (0) e (0) h (0) () e () h () (n) e (n) (0) () (m) (0) e (0) h (0) () e () h () (n) e (n) Copright 00, Andrew W. Moore Machine Learning Favorites Slide 55 Create n th hidden unit Find weights u (n) i v(n) i to define a new basis function H (n) () of the inputs. Mae it specialize in predicting the errors in our previous fit Find {u (n) i, v(n) j } to maimize correlation between H(n) () and E (n) where m n ( n) ( n) ( j) ( n) ( j) H ( ) = g u j + vj h j= j= X (0) X () X (m) Y Y (0) E (0) H (0) Y () E () H () Y (n) E (n) H (n) (0) () (m) (0) e (0) h (0) () e () h () (n) e (n) h (n) (0) () (m) (0) e (0) h (0) () e () h () (n) e (n) h (n) Continue until satisfied with fit Copright 00, Andrew W. Moore Machine Learning Favorites Slide 56 8

29 Visualizing first iteration Copright 00, Andrew W. Moore Machine Learning Favorites Slide 57 Visualizing second iteration Copright 00, Andrew W. Moore Machine Learning Favorites Slide 58 9

30 Visualizing third iteration Copright 00, Andrew W. Moore Machine Learning Favorites Slide 59 Eample Cascade Correlation for Classification Copright 00, Andrew W. Moore Machine Learning Favorites Slide 60 0

31 Training two spirals Steps -6 Copright 00, Andrew W. Moore Machine Learning Favorites Slide 6 Training two spirals Steps - Copright 00, Andrew W. Moore Machine Learning Favorites Slide 6

32 If ou lie Cascade Correlation See Also Projection Pursuit In which ou add together man non-linear nonparametric scalar functions of carefull chosen directions Each direction is chosen to maimize error-reduction from the best scalar function ADA-Boost An additive combination of regression trees in which the n+ th tree learns the error of the n th tree Copright 00, Andrew W. Moore Machine Learning Favorites Slide 6 Multilinear Interpolation Consider this dataset. Suppose we wanted to create a continuous and piecewise linear fit to the data Copright 00, Andrew W. Moore Machine Learning Favorites Slide 64

33 Multilinear Interpolation Create a set of not points selected X-coordinates (usuall equall spaced) that cover the data q q q q 4 q 5 Copright 00, Andrew W. Moore Machine Learning Favorites Slide 65 Multilinear Interpolation We are going to assume the data was generated b a nois version of a function that can onl bend at the nots. Here are eamples (none fits the data well) q q q q 4 q 5 Copright 00, Andrew W. Moore Machine Learning Favorites Slide 66

34 How to find the best fit? Idea Simpl perform a separate regression in each segment for each part of the curve What s the problem with this idea? q q q q 4 q 5 Copright 00, Andrew W. Moore Machine Learning Favorites Slide 67 How to find the best fit? Let s loo at what goes on in the red segment ( q ) w ( q ) w est ( ) = h + h where w = q q h h q q q q 4 q 5 Copright 00, Andrew W. Moore Machine Learning Favorites Slide 68 4

35 How to find the best fit? In the red segment est ( ) = hf ( ) + hf ( ) q q where f ( ) =, f ( ) = w w h h φ () q q q q 4 q 5 Copright 00, Andrew W. Moore Machine Learning Favorites Slide 69 How to find the best fit? In the red segment est ( ) = hf ( ) + hf ( ) q q where f ( ) =, f ( ) = w w h h φ () q q q q 4 q 5 φ () Copright 00, Andrew W. Moore Machine Learning Favorites Slide 70 5

36 How to find the best fit? In the red segment est ( ) = hf ( ) + hf ( ) q q where f ( ) =, f ( ) = w w h h φ () q q q q 4 q 5 φ () Copright 00, Andrew W. Moore Machine Learning Favorites Slide 7 How to find the best fit? In the red segment est ( ) = hf ( ) + hf ( ) q q where f ( ) =, f ( ) = w w h h φ () q q q q 4 q 5 φ () Copright 00, Andrew W. Moore Machine Learning Favorites Slide 7 6

37 How to find the best fit? In general where f i est ( ) N = K i= qi ( ) = w 0 h f ( ) i i if q < w otherwise i h h φ () q q q q 4 q 5 φ () Copright 00, Andrew W. Moore Machine Learning Favorites Slide 7 How to find the best fit? In general where f i est ( ) N = K i= qi ( ) = w 0 h f ( ) i i if q < w otherwise i h And this is simpl a basis function regression problem! h We now how to find the least squares h ii s! φ () q q q q 4 q 5 φ () Copright 00, Andrew W. Moore Machine Learning Favorites Slide 74 7

38 In two dimensions Blue dots show locations of input vectors (outputs not depicted) Copright 00, Andrew W. Moore Machine Learning Favorites Slide 75 Blue dots show locations of input vectors (outputs not depicted) In two dimensions Each purple dot is a not point. It will contain the height of the estimated surface Copright 00, Andrew W. Moore Machine Learning Favorites Slide 76 8

39 Blue dots show locations of input vectors (outputs not depicted) In two dimensions Each purple dot is a not point. It will contain the height of the estimated surface But how do we do the interpolation to ensure that the surface is continuous? Copright 00, Andrew W. Moore Machine Learning Favorites Slide 77 Blue To dots predict show the locations value of here input vectors (outputs not depicted) In two dimensions Each purple dot is a not point. It will contain the height of the estimated surface But how do we do the interpolation to ensure that the surface is continuous? Copright 00, Andrew W. Moore Machine Learning Favorites Slide 78 9

40 Blue dots show locations of input vectors (outputs not depicted) To predict the value here In two dimensions First interpolate its value on two opposite edges Each purple dot is a not point. It will contain the height of the estimated surface But how do we do the interpolation to ensure that the surface is continuous? Copright 00, Andrew W. Moore Machine Learning Favorites Slide 79 Blue dots show locations of input vectors (outputs not depicted) To predict the value here First interpolate its value on two opposite edges Then interpolate between those two values In two dimensions Each purple dot is a not point. It will contain the height of the estimated surface But how do we do the interpolation to ensure that the surface is continuous? Copright 00, Andrew W. Moore Machine Learning Favorites Slide 80 40

41 Blue dots show locations of input vectors (outputs not depicted) To predict the value here First interpolate its value on two opposite edges Then interpolate between those two values In two dimensions Notes 8 Each purple dot is a not point. It will contain the height of the estimated surface But how do we do the interpolation to ensure that the surface is continuous? 7. This can easil be generalized to m dimensions. It should be eas to see that it ensures continuit The patches are not linear Copright 00, Andrew W. Moore Machine Learning Favorites Slide 8 Doing the regression Given data, how do we find the optimal not heights? Happil, it s simpl a twodimensional basis function problem. (Woring out the basis functions is tedious, unilluminating, and eas) What s the problem in higher dimensions? Copright 00, Andrew W. Moore Machine Learning Favorites Slide 8 4

42 MARS Multivariate Adaptive Regression Splines Invented b Jerr Friedman (one of Andrew s heroes) Simplest version Let s assume the function we are learning is of the following form est () = m = g ( Instead of a linear combination of the inputs, it s a linear combination of non-linear functions of individual inputs ) Copright 00, Andrew W. Moore Machine Learning Favorites Slide 8 MARS est () = m = g ( Instead of a linear combination of the inputs, it s a linear combination of non-linear functions of individual inputs ) Idea Each g is one of these q q q q 4 q 5 Copright 00, Andrew W. Moore Machine Learning Favorites Slide 84 4

43 MARS est () = m = g ( Instead of a linear combination of the inputs, it s a linear combination of non-linear functions of individual inputs ) est where f () = j q q q q 4 q 5 m N K = j= h j q j ( ) = w 0 f j ( Copright 00, Andrew W. Moore Machine Learning Favorites Slide 85 ) if q j < w otherwise q j The location of the j th not in the th dimension h j The regressed height of the j th not in the th dimension w The spacing between nots in the th dimension That s not complicated enough! Oa, now let s get serious. We ll allow arbitrar two-wa interactions est () = m = g ( ) + m m = t= + g t (, ) t The function we re learning is allowed to be a sum of non-linear functions over all one-d and -d subsets of attributes Can still be epressed as a linear combination of basis functions Thus learnable b linear regression Full MARS Uses cross-validation to choose a subset of subspaces, not resolution and other parameters. Copright 00, Andrew W. Moore Machine Learning Favorites Slide 86 4

44 If ou lie MARS See also CMAC (Cerebellar Model Articulated Controller) b James Albus (another of Andrew s heroes) Man of the same gut-level intuitions But entirel in a neural-networ, biologicall plausible wa (All the low dimensional functions are b means of looup tables, trained with a deltarule and using a clever blurred update and hash-tables) Copright 00, Andrew W. Moore Machine Learning Favorites Slide 87 Where are we now? Inputs Inference Engine Learn P(E E ) Joint DE, Baes Net Structure Learning Inputs Classifier Predict categor Dec Tree, Sigmoid Perceptron, Sigmoid N.Net, Gauss/Joint BC, Gauss Naïve BC, N.Neigh, Baes Net Based BC, Cascade Correlation Inputs Densit Estimator Probabilit Joint DE, Naïve DE, Gauss/Joint DE, Gauss Naïve DE, Baes Net Structure Learning Inputs Regressor Predict real no. Linear Regression, Polnomial Regression, Perceptron, Neural Net, N.Neigh, Kernel, LWR, RBFs, Robust Regression, Cascade Correlation, Regression Trees, GMDH, Multilinear Interp, MARS Copright 00, Andrew W. Moore Machine Learning Favorites Slide 88 44

45 What You Should Know For each of the eight methods ou should be able to summarize briefl what the do and outline how the wor. You should understand them well enough that given access to the notes ou can quicl reunderstand them at a moments notice But ou don t have to memorize all the details In the right contet an one of these eight might end up being reall useful to ou one da! You should be able to recognize this when it occurs. Copright 00, Andrew W. Moore Machine Learning Favorites Slide 89 Citations Radial Basis Functions T. Poggio and F. Girosi, Regularization Algorithms for Learning That Are Equivalent to Multilaer Networs, Science, 47, , 989 LOESS W. S. Cleveland, Robust Locall Weighted Regression and Smoothing Scatterplots, Journal of the American Statistical Association, 74, 68, 89-86, December, 979 GMDH etc http// P. Langle and G. L. Bradshaw and H. A. Simon, Rediscovering Chemistr with the BACON Sstem, Machine Learning An Artificial Intelligence Approach, R. S. Michalsi and J. G. Carbonnell and T. M. Mitchell, Morgan Kaufmann, 98 Regression Trees etc L. Breiman and J. H. Friedman and R. A. Olshen and C. J. Stone, Classification and Regression Trees, Wadsworth, 984 J. R. Quinlan, Combining Instance-Based and Model-Based Learning, Machine Learning Proceedings of the Tenth International Conference, 99 Cascade Correlation etc S. E. Fahlman and C. Lebiere. The cascadecorrelation learning architecture. Technical Report CMU-CS-90-00, School of Computer Science, Carnegie Mellon Universit, Pittsburgh, PA, 990. http//citeseer.nj.nec.com/fahlman9cascadec orrelation.html J. H. Friedman and W. Stuetzle, Projection Pursuit Regression, Journal of the American Statistical Association, 76, 76, December, 98 MARS J. H. Friedman, Multivariate Adaptive Regression Splines, Department for Statistics, Stanford Universit, 988, Technical Report No. 0 Copright 00, Andrew W. Moore Machine Learning Favorites Slide 90 45

Cross-validation for detecting and preventing overfitting

Cross-validation for detecting and preventing overfitting A Regression Problem = f() + noise Can we learn f from this data? Note to other teachers and users of these slides. Andrew would be delighted if