Hastie, Tibshirani & Friedman: Elements of Statistical Learning Chapter Model Assessment and Selection. CN700/March 4, 2008.

Size: px

Start display at page:

Download "Hastie, Tibshirani & Friedman: Elements of Statistical Learning Chapter Model Assessment and Selection. CN700/March 4, 2008."

Lawrence Palmer
5 years ago
Views:

1 Hastie, Tibshirani & Friedman: Elements of Statistical Learning Chapter Model Assessment and Selection CN700/March 4, 2008 Satyavarta Auditory Neuroscience Laboratory, Department of Cognitive and Neural Systems, Boston University, Boston MA 02215

2 Outline: Model Assessment and Selection Choosing Model Complexity Model Assessment Other Loss Functions Arriving at a Model Comparing Model Classes Choosing Model Complexity Bias Variance Decomposition Choosing Model Complexity Bias Variance Decomposition: Special Cases Bias and Variance Bias-Variance Tradeoff in Model Complexity Bias-Variance with 0-1 Loss model selection (ch ) slide 2 of 27

3 Bias-Variance with 0-1 Loss: knn Bias-Variance Tradeoff with 0-1 Loss: Regression Optimism AIC in model selection Estimates of Model Complexity: # of Parameters Estimates of Model Complexity: Vapnik Chernovenkis Dimension Shatter Example: Error of models picked by criteria relative to best model References model selection (ch ) slide 2 of 27

4 Choosing Model Complexity model selection (ch ) slide 3 of 27

5 Model Assessment Given X, estimate Y as fˆ model selection (ch ) slide 4 of 27

6 Model Assessment Given X, estimate Y as fˆ Loss function squared error: L(Y, ˆ f (X))=(Y ˆ f (X)) 2 model selection (ch ) slide 4 of 27

7 Model Assessment Given X, estimate Y as fˆ Loss function squared error: L(Y, ˆ f (X))=(Y ˆ f (X)) 2 Loss function absolute error: L(Y, ˆ f (X))= Y ˆ f (X) model selection (ch ) slide 4 of 27

8 Model Assessment Given X, estimate Y as fˆ Loss function squared error: L(Y, ˆ f (X))=(Y ˆ f (X)) 2 Loss function absolute error: L(Y, ˆ f (X))= Y ˆ f (X) Test error Err= E[L(Y, ˆ f (X))] model selection (ch ) slide 4 of 27

9 Model Assessment Given X, estimate Y as fˆ Loss function squared error: L(Y, ˆ f (X))=(Y ˆ f (X)) 2 Loss function absolute error: L(Y, ˆ f (X))= Y ˆ f (X) Test error Err= E[L(Y, ˆ f (X))] model selection (ch ) slide 4 of 27

10 Model Assessment Given X, estimate Y as fˆ Loss function squared error: L(Y, ˆ f (X))=(Y ˆ f (X)) 2 Loss function absolute error: L(Y, ˆ f (X))= Y ˆ f (X) Test error Err= E[L(Y, ˆ f (X))] Training error err= N 1 N i=1 L(y i, f ˆ(x i )) model selection (ch ) slide 4 of 27

11 Model Assessment Given X, estimate Y as fˆ Loss function squared error: L(Y, ˆ f (X))=(Y ˆ f (X)) 2 Loss function absolute error: L(Y, ˆ f (X))= Y ˆ f (X) Test error Err= E[L(Y, ˆ f (X))] Training error err= N 1 N i=1 L(y i, f ˆ(x i )) model selection (ch ) slide 4 of 27

12 Other Loss Functions 0 1 loss: L(G, Ĝ(X))= I(G ˆ(G)(X)) Log likelihood loss: L(G, ˆp(X))= 2 log pˆ G (X) model selection (ch ) slide 5 of 27

13 Arriving at a Model Model training Model Selection Model assessment Data Train Validation Test model selection (ch ) slide 6 of 27

14 Arriving at a Model Model training Training set Model Selection Model assessment Data Train Validation Test model selection (ch ) slide 6 of 27

15 Arriving at a Model Model training Training set Model Selection Validation set Model assessment Data Train Validation Test model selection (ch ) slide 6 of 27

16 Arriving at a Model Model training Training set Model Selection Validation set Model assessment Test set Data Train Validation Test model selection (ch ) slide 6 of 27

17 Comparing Model Classes Data rich Validation Data Train Validation Test Data poor Approximate validation Analytically: AIC, BIC, MDL, SRM Efficient Sample re-use: cross-validation, bootstrapping model selection (ch ) slide 7 of 27

18 Choosing Model Complexity model selection (ch ) slide 8 of 27

19 Bias Variance Decomposition Assumptions Y= f (X)+ǫ,ǫ N(0,σ 2 ǫ ) Squared Loss error Error Err(x 0 ) = E[(Y ˆ f (x 0 )) 2 X=x 0 ] =σ 2 ǫ+ [E ˆ f (x 0 ) 2 f (x 0 )] 2 + E[ ˆ f (x 0 ) E ˆ f (x 0 )] 2 =σ 2 ǫ + Bias2 ( ˆ f (x 0 ))+Var( ˆ f (x 0 )) model selection (ch ) slide 9 of 27

20 Choosing Model Complexity model selection (ch ) slide 10 of 27

21 Bias Variance Decomposition: Special Cases k-nearest neighbor fit Err(x 0 )=σ 2 ǫ+ [ f (x 0 ) 1 k kl=1 f (x l ) ] 2 +σ 2 ǫ /k Linear model fit ˆ f (x)= ˆβ T x Err(x 0 )=σ 2 ǫ+ [ f (x 0 ) E ˆ f p (x 0 ) ] 2 + h(x0 ) 2 σ 2 ǫ Linear Model family: further decomposition of bias [ f (x0 ) E fˆ p (x 0 ) ] 2 [ f (x0 ) E fˆ α (x 0 ) ] 2 = [ f (x 0 ) β T x 0] 2 + [β T x 0 E ˆβ T α x 0] 2 = [Model Bias] 2 + [Estimation Bias] 2 model selection (ch ) slide 11 of 27

22 Bias and Variance model selection (ch ) slide 12 of 27

23 Bias-Variance Tradeoff in Model Complexity Model Sample variance Best model (*) Total error, model bias, model variance Restricted model (restricted) estimation bias Choose restricted model if B est + B + Var restricted < B + Var Figure: Hastie et al. 7.2 model selection (ch ) slide 13 of 27

24 Bias-Variance with 0-1 Loss Assumptions Y= f (X)+ǫ,ǫ N(0,σ 2 ǫ ) Squared Loss error Error Err(x 0 ) = E[(Y ˆ f (x 0 )) 2 X=x 0 ] =σ 2 ǫ+ [E ˆ f (x 0 ) 2 f (x 0 )] 2 + E[ ˆ f (x 0 ) E ˆ f (x 0 )] 2 =σ 2 ǫ + Bias2 ( ˆ f (x 0 ))+Var( ˆ f (x 0 )) model selection (ch ) slide 14 of 27

25 Bias-Variance with 0-1 Loss: knn Figures: Hastie et al. 7.3 Prediction error (red), bias 2 (green) and variance (blue) model selection (ch ) slide 15 of 27

26 Bias-Variance Tradeoff with 0-1 Loss: Regression Figures: Hastie et al. 7.3 Prediction error (red), bias 2 (green) and variance (blue) model selection (ch ) slide 16 of 27

27 Optimism Training error err= N 1 N i=1 L(y i, f ˆ(x i )) True error Err=E[L(Y, ˆ f (X))] Model adapts to data err Err In-sample error Err in = N 1 N i=1 E yey new L(Yi new new responses observed at each training point x i, i=1, 2,..., N Y new i Optimism op Err in E y ( err), ˆ f (x i )) model selection (ch ) slide 17 of 27

28 Estimating In-sample error using Optimism Err in = E y ( err)+op For squared error, 0 1 loss, and other loss functions op= 2 N N Cov(ŷ i, y i ) i=1 Tighter the data is fit, higher the optimism Err in = E y ( err)+ N 2 N i=1 Cov(ŷ i, y i ) For linear fit with d inputs: Cov(ŷ i, y i )=dσ 2 ǫ Err in = E y ( err)+ 2 N dσ2 ǫ model selection (ch ) slide 18 of 27

29 Estimates of in-sample prediction error C p statistic Estimate ˆσ ǫ from a low-bias model C p = E y ( err)+ 2 N d ˆσ2 ǫ For logistic regression, using binomial likelihood err ˆ Err in = 2 N loglik = 2 N loglik+2 d N AIC model selection (ch ) slide 19 of 27

30 Estimates of in-sample prediction error (contd) Akaike Information Criterion (AIC): maximize likelihood minimize likelihood For Gaussian model, AIC= C p In general, for a family of models with tuning parameterα, AIC(α)= err+2 ˆ d(α) N σ2 ǫ d(α) is the effective number of parameters, e.g. d(s )=trace(s ) model selection (ch ) slide 20 of 27

31 AIC in model selection Figures: Hastie et al. 7.4 Pick model with smallest AIC model selection (ch ) slide 21 of 27

32 More Estimates of in-sample prediction error Bayes Information Criterion (BIC) BIC= 2loglik+log(N)d Penalizes complexity more heavily than AIC= 2 N loglik+2 d N Asymptotically optimal: picks correct model (if it lies in the family) as N Minimum Description Length: Formally the same as BIC, motivated by Information theory Descriptionlen = arg min len(encoded message) + len(encoding parameters) model selection (ch ) slide 22 of 27

33 Estimates of Model Complexity: # of Parameters model selection (ch ) slide 23 of 27

34 Estimates of Model Complexity: # of Parameters Y=β 0 +β 1 X model selection (ch ) slide 23 of 27

35 Estimates of Model Complexity: # of Parameters Y=β 0 +β 1 X Y= I(sin(α 1 x+α 0 )), model selection (ch ) slide 23 of 27

36 Estimates of Model Complexity: Vapnik Chernovenkis Dimension The VC dimension of the class{ f (x,α)} is defined to be the largest number of points (in some configuration) that can be shattered by members of{ f (x,α)}. model selection (ch ) slide 24 of 27

37 Shatter A set of points is said to be shattered by a class of functions if, for any binary labeling, a member of the class can perfectly separate them model selection (ch ) slide 25 of 27

38 Shatter A set of points is said to be shattered by a class of functions if, for any binary labeling, a member of the class can perfectly separate them model selection (ch ) slide 25 of 27

39 Shatter A set of points is said to be shattered by a class of functions if, for any binary labeling, a member of the class can perfectly separate them Max points shattered: 3 model selection (ch ) slide 25 of 27

40 Shatter A set of points is said to be shattered by a class of functions if, for any binary labeling, a member of the class can perfectly separate them Max points shattered: 3 model selection (ch ) slide 25 of 27

41 Shatter A set of points is said to be shattered by a class of functions if, for any binary labeling, a member of the class can perfectly separate them Max points shattered: 3 Max points shattered: model selection (ch ) slide 25 of 27

42 Example: Error of models picked by criteria relative to best model Figures: Hastie et al. 7.7 model selection (ch ) slide 26 of 27

43 References Hastie, Tibshirani and Friedman. The Elements of Statistical Learning. Springer-Verlag, 2001, pp tibs/elemstatlearn/ model selection (ch ) slide 27 of 27

PDEEC Machine Learning 2016/17

PDEEC Machine Learning 2016/17 Lecture - Model assessment, selection and Ensemble Jaime S. Cardoso jaime.cardoso@inesctec.pt INESC TEC and Faculdade Engenharia, Universidade do Porto Nov. 07, 2017 1 /