1 History of statistical/machine learning. 2 Supervised learning. 3 Two approaches to supervised learning. 4 The general learning procedure

Size: px

Start display at page:

Download "1 History of statistical/machine learning. 2 Supervised learning. 3 Two approaches to supervised learning. 4 The general learning procedure"

Harry Clarke
5 years ago
Views:

1 Overview Breiman L (2001). Statistical modeling: The two cultures Statistical learning reading group Aleander L 1 Histor of statistical/machine learning 2 Supervised learning 3 Two approaches to supervised learning 4 The general learning procedure Pschological Methods Universit of Amsterdam Amsterdam 27 October Model compleit 6 Conclusion Definition of machine learning Histor Arthur Samuel (1959) Field of stud that gives computers the abilit to learn without being eplicitl programmed Tom M. Mitchell (1997) A computer program is said to learn from eperience E with respect to some class of tasks T and performance measure P if its performance at tasks in T as measured b P improves with eperience E. Evolved from Artificial intelligence Pattern recognition Success stories (from 1990s onwards): Spam filters Optical character recognition Natural language processing (Search engines) Recommender sstems Netfli challenge (2006) won b AT&T labs research

2 Histor Success stories (from 1990s onwards): Spam filters Optical character recognition Natural language processing (Search engines) Recommender sstems Netfli challenge (2006) won b AT&T labs research Summar: Success correlated with the rise of the internet and reinvented statistics. Machine learning terms for statistical concepts Ignored b most statisticians ecept for the Breiman and Tibshirani Hastie Friedman (Efron Stanford school) Histor Summar: Success correlated with the rise of the internet and reinvented statistics. Machine learning terms for statistical concepts Ignored b most statisticians ecept for the Breiman and Tibshirani Hastie Friedman (Efron Stanford school) Statistics Estimation Data point Regression Classification Covariate Response Machine learning Learning Eample/Instance Supervised learning Supervised learning Feature Label Breiman Career: 1954 PhD in probabilit 1960s Consulting 1980s onwards professor at UC Berkele Breiman s consulting eperience Prediction problems Live with the data before modelling Solution should be either an algorithmic or a data model. Predictive accurac is the criterion for the qualit of the model Computers necessar in practice Problem statement: Supervised learning Problem setting Based on n pairs of data n where i are features and i are correct labels predict future labels new given new. Eample (classification): Features: 1 =(nose mouth). Label: 1 = es face Features: 2 =(doorbell hinge). Label: 2 = no face n Eamples of algorithmic techniques/models: Classification and regression trees bagging random forest. In the paper Breiman focusses on supervised learning

3 Problem statement: Supervised learning Problem setting Based on n pairs of data n where i are features and i are correct labels predict future labels new given new. n Problem statement: Supervised learning Problem setting Based on n pairs of data n where i are features and i are correct labels predict future labels new given new. n unknown f = f ()+ For prediction: discover the unknown f that relates features (covariates) to labels (dependent variables). input output Goal: Machine learning Predict future instances based on alread observed data Prediction based on past data. Eample: "Based on previous data I predict that it will rain tomorrow" Give a prediction accurac. Eample: "Based on previous data I m 67% sure that it will rain tomorrow" Use data models or even algorithmic models to discover the relationship f provided b nature Machine learning vs "standard" statistical inference culture Machine learning Goal: Prediction of new data (learn f from the data) Approach: Data are alwas right there are no models onl algorithms Passive: Data are alread collected "Big" data "Standard" approach in pscholog Goal: evaluation of theor (evaluate a known f ) Approach: The model is right the data could be wrong. Evaluate theor b comparing two models Active: Confirmator analsis based on the model: Design of eperiments power analsis etc etc "Small" data

4 "Standard" approach in pscholog Goal: evaluation of theor (evaluate a known f ) X Man time no theor most research has an eplorator flavour (80% according to Lakens) Approach: The model is right the data could be wrong. Evaluate theor b comparing two models X What if the model is wrong? Active: Confirmator analsis based on the model: Design of eperiments power analsis etc etc X Design of eperiments power analsis etc etc wrong if the model assumption is wrong "Small" data X Mechanical turk fmri data genetics cito OSF international collaboration man labs etc. Breiman s: Critique on the data modelling approach Wrong presumption Let data be generated according to the data model f where f is linear/logistic regression/... Eample: Uncritical use of linear regression to for instance bimodal data. Breiman s: Critique on the data modelling approach Wrong presumption Let data be generated according to the data model f where f is linear/logistic regression/... Breiman s: Critique on the data modelling approach Wrong presumption Let data be generated according to the data model f where f is linear/logistic regression/... f is linear f is logistic Focus on modelling the sampling distribution of the error not the f : {z} obs known z} { f ( {z} )= N(0 obs 2 ) f is a "network" Eample: in ANOVA sums of squared error R 2 etc. If the errors are big it is implied that the theor is bad. To quantif "big" require sampling distribution.

5 Breiman s: Critique on the data modelling approach Breiman s eperience from consulting Wrong presumption Let data be generated according to the data model f where f is linear/logistic regression/... To calculate sampling distributions stuck with simple models (Linear) Conclusion are about the model mechanism not about nature s mechanism If model is a poor emulation of nature conclusions will be wrong. The classical (linear) models are tractable but tpicall ield bad predictions Live with the data before modelling Algorithmic models are also good Predictive accurac on test set is the criterion for how good the model is Computers are important Recall setting Recall setting Problem: Problem: With which f does nature generate data = f ()+ Goal: Prediction based on past data. Learn (estimate) unknown f unknown f Give a prediction accurac. "Live with the data before modelling". Data split in three parts: Training set to learn f from the data input output

6 Recall setting Recall setting Problem: With which f does nature generate data = f ()+ Goal: Prediction based on past data. Learn (estimate) unknown f Give a prediction accurac. "Live with the data before modelling". Data split in three parts: Training set to learn f from the data Validation set to do model selection Problem: With which f does nature generate data = f ()+ Goal: Prediction based on past data. Learn (estimate) unknown f Give a prediction accurac. "Live with the data before modelling". Data split in three parts: Training set to learn f from the data Validation set To do model selection Test set to estimate the prediction accurac Recall data set consists of n pairs sa Randoml select training samples sa Training set is used to learn f from the data Corresponding formula = f ()+ Fill in the true traini traini traini = f ( traini )+ find that f for which the loss between 1 n train n train X i=1 Loss traini f ( traini ) is smallest. Call the minimiser f trained

7 Use the other samples as test samples sa Training set is used to learn f from the data. 8 Test set is used to derive the prediction accurac Fill in f trained and appl it to to ield implied = f trained ( testi ) and compare the estimate the error between implied with true testi estim = 1 Xn test Loss testi f trained ( testi ) n test i=1 10 Generalise to a new instance new Training set is used to learn f from the data. Test set is used to derive the prediction accurac. Final answer for et unseen features new use to generate new new =... implied z } { f trained ( new ) ± estim {z } Generalisation error The error estim so estimated serves as the prediction accurac. Remarks Loss functions Onl assumption is that the data i i are iid that is the are generated with the same (true fied but unknown) f. The data are assumed to be true No specification of the Loss function. The loss function replaces the assumptions on the error (tpicall Gaussian error in "standard" statistics) No specification of the collection F of functions f that we believe to be viable For categorical data tpicall zero-one (all or nothing) loss: ( 0 if i = f ( i ) Loss i f ( i ) = 1 if i 6= f ( i ) Note the hard rule here observation i "supervise" the learning.

8 Loss functions Loss functions For categorical data tpicall zero-one (all or nothing) loss: ( 0 if i = f ( i ) Loss i f ( i ) = 1 if i 6= f ( i ) Note the hard rule here observation i "supervise" the learning. For continuous data tpicall mean-squared error loss: h Loss Y f () = E Y i 2 f () Here E is the epectation (average) with respect to the true relationship f between X and Y. For categorical data tpicall zero-one (all or nothing) loss: ( 0 if i = f ( i ) Loss i f ( i ) = 1 if i 6= f ( i ) Note the hard rule here observation i "supervise" the learning. For continuous data tpicall mean-squared error loss: Loss Y f () = 1 n nx i=1 h i 2 i f ( i ) Here the epectation E is replaced with the empirical average with respect to the data being true. Bias-Variance trade-off and overfitting Bias-Variance trade-off and overfitting For continuous data tpicall mean-squared error loss: h Loss Y f () = E Y i 2 f () Here E is the epectation (average) with respect to the true f. For continuous data tpicall mean-squared error loss: h Loss Y f () =E Y =Var(f ()) + i 2 f () h 2 Bias(f ())i + Var( ) Both Var and E thus Bias are with respect to the true f.

9 Bias-Variance trade-off and overfitting Overfitting For continuous data tpicall mean-squared error loss: h Var(f ()) + i 2 Bias(f ()) {z } Structural error + Var( ) {z } Unavoidable error Recall that f trained is from minimising this loss. The more complicated a candidate f the smaller the unavoidable error as everthing is seen as structural. Problem: overfitting. For continuous data tpicall mean-squared error loss: h Var(f ()) + i 2 Bias(f ()) {z } Structural error + Var( ) {z } Unavoidable error Overfitting occurs if the f s under consideration are too complicated. An over complicated f generalises badl and is recognised b low bias high variance. Overfitting: Over complicated f : low bias high variance. True data generating f () = where is a Laplace distribution (thicker tails than normal). Raw data: Overfitting: Over complicated f : low bias high variance. Raw data and true function f :

10 Overfitting: Over complicated f : low bias high variance. Fitted with a polnomial of order nine. Overfitting: Over complicated f : low bias high variance. Observe new test sample/instance test = Overfitting: Over complicated f : low bias high variance. Observe new test sample/instance test = 6.5 and test = 66 Overfitting: Over complicated f : low bias high variance. Large loss between test and implied :

11 Overfitting: Over complicated f : low bias high variance. Model compleit True data generation was done with a polnomial of degree 2 polnomial of degree 9 is too comple. Hence the collection of candidate f s should be restricted to those f s with ma degree 2. In realit don t know the "compleit" of the true. How to choose the collection of candidates F? Eample conclusion: our epected survival ears are 12 ears ± 50 ears. Meaningless prediction. Two tpes of model compleit Two tpes of model compleit The structure of f. Linear polnomial (but still a summation of terms thus linear) <- nothing new same as in statistics The structure of f. Linear polnomial (but still a summation of terms thus linear). Neural networks (non-linear) support vector machines generalised additive models kernel smoothing splines reproducing kernel Hilbert space regression trees. Number of features n < p.

12 "Standard" setting Regularisation The functions f s in F are linear More data than features p < n. The functions f s in F are linear More features than data n < p. Hazard of overfitting Control the compleit with additional tuning parameter (for instance lasso) Hence F = F Eample: f () = p p + {z } {z} (1) Over complicated structure tuning parameter Here acts as a penalt for compleit. How to tune? Regularisation The functions f s in F are linear More features than data n < p. Hazard of overfitting Control the compleit with additional tuning parameter (for instance lasso) Hence F = F Eample: f () = p p + {z } {z} (1) Over complicated structure tuning parameter Here acts as a penalt for compleit. Tune with validation set aka repeat the general procedure man times with different (fied). Split data into three sets. Training set is used to learn f from the data. Validation set to tune (model selection). Test set is used to derive the prediction accurac. Partition. For the lasso = 0 no regularisation for = 1 onl the constant function is viable. Sa = For each fied follow the general procedure f.

13 Training set is used to learn f from the data. Validation set to tune (model selection). Test set is used to derive the prediction accurac. Sa = We get F = f 0trained f 2trained f 2 2 trained...f 2 10 trained (2) Training set is used to learn f from the data. Validation set to tune (model selection). Test set is used to derive the prediction accurac. Sa = Pick f trained with lowest average loss on the validation set. f val = arg min 1 n val n val X i=1 loss f trained ( ival ) ival (2) Training set is used to learn f from the data. Validation set to tune (model selection). Test set is used to derive the prediction accurac. Sa = Estimate prediction accurac b averaging the average loss on the test set est = 1 Xn test loss f val ( itest ) itest n test i=1 (2) Gave general idea of supervised learning. Role of training validation and test set Which specific classes of F to take. This seminar discusses a couple of them: Neural networks (non-linear) support vector machines generalised additive models kernel smoothing splines reproducing kernel Hilbert space regression trees. How to minimise? Technicallities Selection method based on select the best. Other methods bagging and boosting

14 Further organisation Changed name from machine learning reading group to statistical learning seminar Reading groups die Seminar does not require everone to read everthing Requires a small peak in preparation of a talk Not necessar to understand everthing. Can be practical and theoretical Still good to read things in advanced. Also website with outube clips are available.

Learning From Data Lecture 7 Approximation Versus Generalization

Learning From Data Lecture 7 Approimation Versus Generalization The VC Dimension Approimation Versus Generalization Bias and Variance The Learning Curve M. Magdon-Ismail CSCI 4100/6100 recap: The Vapnik-Chervonenkis