Chapter 7: Model Assessment and Selection

Size: px
Start display at page:

Download "Chapter 7: Model Assessment and Selection"

Transcription

1 Chapter 7: Model Assessment and Selection DD3364 April 20, 2012

2 Introduction

3 Regression: Review of our problem Have target variable Y to estimate from a vector of inputs X. A prediction model ˆf(X) has been estimated from training data T = {(x 1, y 1 ),..., (x n, y n )}. The loss function L(Y, ˆf(X)) measures the errors between Y and ˆf(X). Common loss functions are L(Y, ˆf(X)) (Y ˆf(X)) 2 squared error, = Y ˆf(X) absolute error

4 Regression: Review of our problem Have target variable Y to estimate from a vector of inputs X. A prediction model ˆf(X) has been estimated from training data T = {(x 1, y 1 ),..., (x n, y n )}. The loss function L(Y, ˆf(X)) measures the errors between Y and ˆf(X). Common loss functions are L(Y, ˆf(X)) (Y ˆf(X)) 2 squared error, = Y ˆf(X) absolute error

5 Regression: Review of our problem Have target variable Y to estimate from a vector of inputs X. A prediction model ˆf(X) has been estimated from training data T = {(x 1, y 1 ),..., (x n, y n )}. The loss function L(Y, ˆf(X)) measures the errors between Y and ˆf(X). Common loss functions are L(Y, ˆf(X)) (Y ˆf(X)) 2 squared error, = Y ˆf(X) absolute error

6 Test Error a.k.a. generalization error Err T = E[ L(Y, ˆf(X)) T ] Definition of Test Error Prediction error over an independent test sample. X and Y are drawn randomly from p(x, Y ). The training set T if fixed.

7 Definition of Expected Prediction Error Expected Prediction Error (expected test error) Err = E[ L(Y, ˆf(X)) ] = E[ Err T ] In this case take expectation over all the random quantities including the training set. Which quantities interest us Would like to estimate Err T. But in most cases it is easier to estimate Err. Why??

8 Definition of Expected Prediction Error Expected Prediction Error (expected test error) Err = E[ L(Y, ˆf(X)) ] = E[ Err T ] In this case take expectation over all the random quantities including the training set. Which quantities interest us Would like to estimate Err T. But in most cases it is easier to estimate Err. Why??

9 Definition of Training Error Training error err = 1 n n L(y i, ˆf(x i )) i=1 Already know as complexity of ˆf increases then err 0, but there is a tendency to overfit and Err T increases err is not a good estimate of Err T or Err. We will be revisiting the Bias-Variance trade-off.

10 Issues in assessing generalization ability Test and training error as model complexity increases Model Assessment and Selection Prediction Error High Bias Low Variance Low Bias High Variance Model Complexity (df) 1 Light blue curve: training error err. 2 Solid blue curve: expected training error E[err]. 3 Light red curve: conditional test error Err T. 4 Solid red curve: expected test error Err. FIGURE 7.1. Behavior of test sample and training sample error as the model complexity is varied. The light blue curves show the training error err, whilethe light red curves show the conditional test error ErrT for 100 training sets of size 50 each, as the model complexity is increased. The solid curves show the expected test error Err and the expected training error E[err]. Test error, also referred to as generalization error, is the prediction error over an independent test sample

11 Same story for classification Have target categorical variable G {1,..., K} to estimate from a vector of inputs X. Typically model p k (X) = P (G = k X) and define Common loss functions are loss Ĝ(X) = arg max k 2 log-likelihood a.ka. deviance L(G, ˆp(X)) = 2 p k (X) L(G, Ĝ(X)) = Ind(G Ĝ(X)) K Ind(G = k) ˆp k (X) = 2 log ˆp G (X) k=1

12 Same story for classification Have target categorical variable G {1,..., K} to estimate from a vector of inputs X. Typically model p k (X) = P (G = k X) and define Common loss functions are loss Ĝ(X) = arg max k 2 log-likelihood a.ka. deviance L(G, ˆp(X)) = 2 p k (X) L(G, Ĝ(X)) = Ind(G Ĝ(X)) K Ind(G = k) ˆp k (X) = 2 log ˆp G (X) k=1

13 Same story for classification Have target categorical variable G {1,..., K} to estimate from a vector of inputs X. Typically model p k (X) = P (G = k X) and define Common loss functions are loss Ĝ(X) = arg max k 2 log-likelihood a.ka. deviance L(G, ˆp(X)) = 2 p k (X) L(G, Ĝ(X)) = Ind(G Ĝ(X)) K Ind(G = k) ˆp k (X) = 2 log ˆp G (X) k=1

14 Performance scores for classification Test Error Err T = E[ L(G, Ĝ(X)) T ] Training Error one common definition err = 2 n n log ˆp gi (x i ) i=1

15 Goal of this chapter ˆfα (x) typically has a tunable parameter α controlling its complexity. Want to find the value of α s.t. ˆα = arg min α E[ L(Y, ˆf α (X)) ] Estimate E[ L(Y, ˆf α (X)) ] for different values of α. This chapter presents methods how to do this. Choose the α with minimum estimate.

16 Our two separate goals Model selection Estimate the performance of different models in order to choose the best one. Model Assessment Having chosen a final model, estimate its prediction error on new data.

17 Randomly divide the dataset into 3 parts For a data-rich situation Train Validation Test Common split ratio 50%, 25%, 25%. Model Selection Use training set to fit each model. Use validation set to estimate Err T for each model. Choose model with lowest Err T estimate. Model Assessment of the chosen model Use the test set - unseen until this stage - to estimate Err T.

18 Approximate the validation step either analytically with approaches such as 1 Akaike Information Criterion 2 Baysian Information Criterion 3 Minimum Description Length 4 Structural Risk Minimization or with efficient sample re-use 1 cross-validation 2 the bootstrap What if labelled data-sets are small? Each method also provides estimates of Err or Err T chosen model. of the final

19 Approximate the validation step either analytically with approaches such as 1 Akaike Information Criterion 2 Baysian Information Criterion 3 Minimum Description Length 4 Structural Risk Minimization or with efficient sample re-use 1 cross-validation 2 the bootstrap What if labelled data-sets are small? Each method also provides estimates of Err or Err T chosen model. of the final

20 The Bias-Variance Decomposition

21 The bias-variance decomposition Will assume an additive model Y = f(x) + ɛ where E[ɛ] = 0 and Var[ɛ] = σɛ 2. Then the expected prediction error of ˆf(X) at X = x 0 Err(x 0 ) = E[(Y ˆf(x 0 )) 2 X = x 0 ] can be expressed as Err(x 0 ) = Irreducible Error + Bias 2 + Variance Irreducible error: σ 2 ɛ, Bias: E[ ˆf(x 0 ) f(x 0 )], Variance: Var[ ˆf(x 0 )]

22 The bias-variance decomposition Will assume an additive model Y = f(x) + ɛ where E[ɛ] = 0 and Var[ɛ] = σɛ 2. Then the expected prediction error of ˆf(X) at X = x 0 Err(x 0 ) = E[(Y ˆf(x 0 )) 2 X = x 0 ] can be expressed as Err(x 0 ) = Irreducible Error + Bias 2 + Variance Irreducible error: σ 2 ɛ, Bias: E[ ˆf(x 0 ) f(x 0 )], Variance: Var[ ˆf(x 0 )]

23 k-nearest neighbour regression fit Err(x 0 ) = σ 2 ɛ + [ f(x 0 ) 1 k ] 2 k f(x (l) ) + σ2 ɛ k l=1 squared bias variance Complexity of model is inversely related k. As k increases the variance decreases. As k increases the squared bias increases. The above expression was computed by assuming the x i s are fixed.

24 Linear model - least square fit Have a linear model ˆf p (x) = x t ˆβ where ˆβ is p-dimensional and fit by least squares, then Err(x 0 ) = σ 2 ɛ + [ f(x 0 ) E[ ˆf p (x 0 ) ]] 2 + h(x0 ) 2 σ 2 ɛ with h(x 0 ) = X(X t X) 1 x 0 and ˆf p (x 0 ) = x t 0(X t X) 1 X t y.

25 Linear model - ridge regression fit Have a linear model ˆf p,α (x) = x t ˆβα where ˆβ α is p-dimensional and fit via ridge regression, then Err(x 0 ) = σ 2 ɛ + [ f(x 0 ) E[ ˆf p,α (x 0 ) ]] 2 + hα (x 0 ) 2 σ 2 ɛ with h α (x 0 ) = X(X t X + αi) 1 x 0 ˆf p,α (x 0 ) = x t 0(X t X + αi) 1 X t y Therefore this regression fit model has a different bias and variance to the least square fit.

26 Linear model - ridge regression fit Have a linear model ˆf p,α (x) = x t ˆβα where ˆβ α is p-dimensional and fit via ridge regression, then Err(x 0 ) = σ 2 ɛ + [ f(x 0 ) E[ ˆf p,α (x 0 ) ]] 2 + hα (x 0 ) 2 σ 2 ɛ with h α (x 0 ) = X(X t X + αi) 1 x 0 ˆf p,α (x 0 ) = x t 0(X t X + αi) 1 X t y Therefore this regression fit model has a different bias and variance to the least square fit.

27 Linear model - Finer decomposition of the bias Let β denote the parameters of the best-fitting linear approx to f: β = arg min β E[ (f(x) X t β) 2 ] Can write the averaged squared bias E x0 [ (f(x 0) E[ ˆf α(x 0) ]) 2] as E x0 [ (f(x 0 ) x t 0β ) 2 ] + E x0 [ (x t 0β E[x t 0 ˆβ α ]) 2 ] Ave[Model Bias] 2 Ave[Estimation Bias] 2 Estimation bias is zero for ordinary least sq. estimate. Estimation bias is positive for ridge regression estimate.

28 Linear model - Finer decomposition of the bias Let β denote the parameters of the best-fitting linear approx to f: β = arg min β E[ (f(x) X t β) 2 ] Can write the averaged squared bias E x0 [ (f(x 0) E[ ˆf α(x 0) ]) 2] as E x0 [ (f(x 0 ) x t 0β ) 2 ] + E x0 [ (x t 0β E[x t 0 ˆβ α ]) 2 ] Ave[Model Bias] 2 Ave[Estimation Bias] 2 Estimation bias is zero for ordinary least sq. estimate. Estimation bias is positive for ridge regression estimate.

29 FIGURE 7.2. Schematic of the behavior of bias and variance. The model space is the set of all possible predictions from the model, with the closest fit labeled with a black dot. The model bias from the truth is shown, along with the variance, indicated by the large yellow circle centered at the black dot labeled closest fit in population. A shrunken or regularized fit is also shown, having additional estimation bias, but smaller prediction error due to its decreased variance. Behaviour of bias and variance Realization Closest fit in population Closest fit Truth Model bias Estimation Bias MODEL SPACE Shrunken fit Estimation Variance RESTRICTED MODEL SPACE

30 The Set-up Bias-variance trade-off: Example 1 Have n = 80 observations and p = 20 predictors. X is uniformly distributed in [0, 1] 20 and { 0 if X 1.5 Y = 1 if X 1 >.5 Apply k-nn to perform both the classification and regression tasks. Use squared error loss to measure Err for the regression task. Use 0-1 loss to measure Err for the classification task.

31 7.3 The Bias Variance Decomposition 227 Bias-variance trade-off: Example Expected prediction error as k varies Number of Neighbors k k NN Regression Linear k NN Model Classification Regression Subset S Linear Model Number of Neighbors k Number Subset of Neighbors Size p k 5 10 Subset S Orange curve: k NN Classification expected prediction FIGURELinear 7.3. error Model Expected Classification prediction error (orange), squared bias ance (blue) for a simulated example. The top row is regression w Green curve: squared bias loss; the bottom row is classification with 0 1 loss. The mod neighbors (left) and best subset regression of size p (right). The Blue curve: variance curves are the same in regression and classification, but the pred is different Note prediction error curves are not the same as the loss functions differ

32 7.3 The Bias Variance Decomposition 227 Bias-variance trade-off: Example Expected prediction error as k varies Number of Neighbors k k NN Regression Linear k NN Model Classification Regression Subset S Linear Model Number of Neighbors k Number Subset of Neighbors Size p k 5 10 Subset S Orange curve: k NN Classification expected prediction FIGURELinear 7.3. error Model Expected Classification prediction error (orange), squared bias ance (blue) for a simulated example. The top row is regression w Green curve: squared bias loss; the bottom row is classification with 0 1 loss. The mod neighbors (left) and best subset regression of size p (right). The Blue curve: variance curves are the same in regression and classification, but the pred is different Note prediction error curves are not the same as the loss functions differ

33 The Set-up Bias-variance trade-off: Example 2 Have n = 80 observations and p = 20 predictors. X is uniformly distributed in [0, 1] 20 and { 1 if 10 j=1 Y = X j > 5 0 otherwise Use best subset linear regression of size p for classification and regression tasks. Use squared error loss to measure Err for the regression task. Use 0-1 loss to measure Err for the classification task.

34 7.3 The Bias Variance Decomposition 227 Expected prediction error as p varies Bias-variance trade-off: Example 2 gression Linear k NN Model Classification Regression Linear Model Classification Number of Neighbors k 0.0 Subset Size p eighbors k Number Subset of Neighbors Size p k Subset Size p ssification Orange curve: expected prediction error FIGURELinear 7.3. Expected Model Classification prediction error (orange), squared bias (green) and variance (blue) for a simulated example. The top row is regression with squared error loss; the bottom row is classification with 0 1 loss. The models are k-nearest neighbors (left) and best subset regression of size p (right). The variance and bias curves are the same in regression and classification, but the prediction error curve is different. Green curve: squared bias Blue curve: variance Note prediction error curves are not the same as the loss functions differ.

35 Optimism of the Training Error Rate

36 Estimating the optimism of err Training error err Err T assessment. as it uses T for both fitting and One factor is The training and test input vectors for err are the same. while for Err T they differ. Can begin to understand the optimism of err if we focus on in-sample error Err in = 1 n n E Y [ L(y i, ˆf(x i )) T ] i=1 where expectation is over new responses y i at each training point x i.

37 Estimating the optimism of err Training error err Err T assessment. as it uses T for both fitting and One factor is The training and test input vectors for err are the same. while for Err T they differ. Can begin to understand the optimism of err if we focus on in-sample error Err in = 1 n n E Y [ L(y i, ˆf(x i )) T ] i=1 where expectation is over new responses y i at each training point x i.

38 Estimating the optimism of err Training error err Err T assessment. as it uses T for both fitting and One factor is The training and test input vectors for err are the same. while for Err T they differ. Can begin to understand the optimism of err if we focus on in-sample error Err in = 1 n n E Y [ L(y i, ˆf(x i )) T ] i=1 where expectation is over new responses y i at each training point x i.

39 The optimism of err Define the optimism as op = Err in err The average optimism is ω = E y [op] where the training input vectors are held fixed, the expectation is over the training output values. For many loss functions ω = 1 n n Cov(ŷ i, y i ) i=1

40 The optimism of err ω = 1 n n Cov(ŷ i, y i ) i=1 The more strongly y i affects its prediction ŷ i the larger ω. The larger ω the greater the optimism of err. In summary get the important relation E y [Err in ] = E y [err] + 1 n n Cov(ŷ i, y i ) i=1

41 Option 1 Estimate the optimism and add it to err How to estimate prediction error? The methods C p, AIC, BIC work in this way for a special class estimates. Can use in-sample error for model selection but not a good estimate of Err.

42 Option 1 Estimate the optimism and add it to err How to estimate prediction error? The methods C p, AIC, BIC work in this way for a special class estimates. Can use in-sample error for model selection but not a good estimate of Err. Option 2 Use cross-validation and bootstrap as direct estimates of the extra-sample Err.

43 Estimates of In-Sample Prediction Error

44 If ŷ i is obtained by a linear fit with d inputs then n Cov(ŷ i, y i ) = d σɛ 2 i=1 for the additive error model Y = f(x) + ɛ. And so Err in estimate: C p statistic E y [Err in ] = E y [err] + 2 d n σ2 ɛ Adapting this expression leads to the C p statistic C p = err + 2 d n ˆσ2 ɛ where ˆσ 2 ɛ is an estimate of the noise variance.

45 If ŷ i is obtained by a linear fit with d inputs then n Cov(ŷ i, y i ) = d σɛ 2 i=1 for the additive error model Y = f(x) + ɛ. And so Err in estimate: C p statistic E y [Err in ] = E y [err] + 2 d n σ2 ɛ Adapting this expression leads to the C p statistic C p = err + 2 d n ˆσ2 ɛ where ˆσ 2 ɛ is an estimate of the noise variance.

46 If ŷ i is obtained by a linear fit with d inputs then n Cov(ŷ i, y i ) = d σɛ 2 i=1 for the additive error model Y = f(x) + ɛ. And so Err in estimate: C p statistic E y [Err in ] = E y [err] + 2 d n σ2 ɛ Adapting this expression leads to the C p statistic C p = err + 2 d n ˆσ2 ɛ where ˆσ 2 ɛ is an estimate of the noise variance.

47 Akaike Information Criterion rewards the fit between the model and the data AIC = 2 n loglik + 2 d n where penalty for including extra predictors in the model n loglik = log Pˆθ(y i ) i=1 and ˆθ is the MLE of θ. Note: AIC can be seen as an estimate of Err in in this case with a log-likelihood loss.

48 then Akaike Information Criterion: using training error Have a set of models f α (x) indexed by α, err is the training error, d(α) the # of parameters for each model. AIC(α) = err(α) + 2 d(α) n ˆσ2 ɛ Note: AIC can be seen as an estimate of Err in in this case with a squared-error loss.

49 Model Assessment AIC used and Selection for model selection Log-likelihood Loss 0-1 Loss Log-likelihood O O Train Test AIC O O O O O O O O O O Misclassification Error O O O O O O O O O O O O O O O O O O O O Number of Basis Functions Number of Basis Functions FIGURE 7.4. AIC used for model selection for the phoneme recog tion example of Section The logistic regression coefficient funct β(f) = P M m=1 hm(f)θm is modeled as an expansion in M spline basis functio In the left panel we see the AIC statistic used to estimate Err in using log-lik hood loss. Included is an estimate of Err based on an independent test sample does well except for the extremely over-parametrized case (M = 256 paramet for N = 1000 observations). In the right panel the same is done for 0 1 lo Although the AIC formula does not strictly apply here, it does a reasonable job this case. Classifier is a logistic regression function with an expansion of M spline basis functions. AIC is used to estimate Err in with a log-likelihood loss, AIC does well except when M = 256 is large and n = 1000.

50 Model Assessment and Selection How well AIC and BIC perform wrt model selection AIC % Increase Over Best reg/knn reg/linear class/knn class/linear BIC % Increase Over Best reg/knn reg/linear class/knn class/linear Boxplots show 100 Increase Over Best SRM Err( ˆα) min α Err(α) max α Err(α) min α Err(α) found via the selection method under investigation. 100 training sets were used. where ˆα is the best parameter

51 The Effective Number of Parameters

52 Generalization of the number of parameters For regularized fitting need to generalize the concept of number of parameters. Consider regularized linear fitting - ridge regression, cubic smoothing splines where ŷ = S y 1 y = (y 1, y 2,..., y n ) t is the vector of training outputs, 2 ŷ = (ŷ 1,..., ŷ n ) is the vector of predictions, 3 S is an n n matrix - depends on x 1,..., x n but not y 1,..., y n. Then the effective number of parameters is defined as df(s) = trace(s)

53 Generalization of the number of parameters For regularized fitting need to generalize the concept of number of parameters. Consider regularized linear fitting - ridge regression, cubic smoothing splines where ŷ = S y 1 y = (y 1, y 2,..., y n ) t is the vector of training outputs, 2 ŷ = (ŷ 1,..., ŷ n ) is the vector of predictions, 3 S is an n n matrix - depends on x 1,..., x n but not y 1,..., y n. Then the effective number of parameters is defined as df(s) = trace(s)

54 General definition: Effective degrees-of-freedom If y arises from an additive-error model Y = f(x) + ɛ with Var(ɛ) = σɛ 2 then n Cov(ŷ i, y i ) = trace(s)σɛ 2 i=1 The more general definition of effective dof is then df(ŷ) = n i=1 Cov(ŷ i, y i ) σ 2 ɛ

55 General definition: Effective degrees-of-freedom If y arises from an additive-error model Y = f(x) + ɛ with Var(ɛ) = σɛ 2 then n Cov(ŷ i, y i ) = trace(s)σɛ 2 i=1 The more general definition of effective dof is then df(ŷ) = n i=1 Cov(ŷ i, y i ) σ 2 ɛ

56 The Bayesian Approach and BIC

57 Generic form of BIC Bayesian Information Criterion BIC = 2 loglik + log(n) d Assuming Gaussian model and known variance σ 2 ɛ then and 2 loglik = 1 σ 2 ɛ BIC = n σ 2 ɛ n i=1 (y i ĥ(x i)) 2 = n err σ 2 ɛ ( err + log(n) d ) n σ2 ɛ Note BIC AIC, but BIC penalizes complex model more heavily than AIC.

58 Generic form of BIC Bayesian Information Criterion BIC = 2 loglik + log(n) d Assuming Gaussian model and known variance σ 2 ɛ then and 2 loglik = 1 σ 2 ɛ BIC = n σ 2 ɛ n i=1 (y i ĥ(x i)) 2 = n err σ 2 ɛ ( err + log(n) d ) n σ2 ɛ Note BIC AIC, but BIC penalizes complex model more heavily than AIC.

59 Starting point: Derivation of BIC Have {M 1,..., M M } a set of candidate models and their corresponding parameters θ 1,..., θ m. Goal: Choose the best model M i. How: Have training data Z = {(x 1, y 1 ),..., (x n, y n )} Have priors p(θ m M m ). The posterior of model M m is P (M m Z) P (M m ) p(z M m ) P (M m ) p(z θ m, M m ) p(θ m M m ) dθ m

60 Starting point: Derivation of BIC Have {M 1,..., M M } a set of candidate models and their corresponding parameters θ 1,..., θ m. Goal: Choose the best model M i. How: Have training data Z = {(x 1, y 1 ),..., (x n, y n )} Have priors p(θ m M m ). The posterior of model M m is P (M m Z) P (M m ) p(z M m ) P (M m ) p(z θ m, M m ) p(θ m M m ) dθ m

61 Starting point: Derivation of BIC Have {M 1,..., M M } a set of candidate models and their corresponding parameters θ 1,..., θ m. Goal: Choose the best model M i. How: Have training data Z = {(x 1, y 1 ),..., (x n, y n )} Have priors p(θ m M m ). The posterior of model M m is P (M m Z) P (M m ) p(z M m ) P (M m ) p(z θ m, M m ) p(θ m M m ) dθ m

62 Derivation of BIC The posterior of model M m is P (M m θ m ) P (M m ) p(z θ m, M m ) p(θ m M m ) dθ m Usually assume uniform prior: P (M m ) = 1/M. Approximate the above integral by simplification and Laplace approximation to get log P (Z M m ) = log P (Z ˆθ m, M m ) d m 2 log n + O(1) where ˆθ m is a MLE and d m is # free parameters in M m. Then BIC 2 log P (M m Z)

63 Derivation of BIC The posterior of model M m is P (M m θ m ) P (M m ) p(z θ m, M m ) p(θ m M m ) dθ m Usually assume uniform prior: P (M m ) = 1/M. Approximate the above integral by simplification and Laplace approximation to get log P (Z M m ) = log P (Z ˆθ m, M m ) d m 2 log n + O(1) where ˆθ m is a MLE and d m is # free parameters in M m. Then BIC 2 log P (M m Z)

64 Derivation of BIC The posterior of model M m is P (M m θ m ) P (M m ) p(z θ m, M m ) p(θ m M m ) dθ m Usually assume uniform prior: P (M m ) = 1/M. Approximate the above integral by simplification and Laplace approximation to get log P (Z M m ) = log P (Z ˆθ m, M m ) d m 2 log n + O(1) where ˆθ m is a MLE and d m is # free parameters in M m. Then BIC 2 log P (M m Z)

65 Derivation of BIC The posterior of model M m is P (M m θ m ) P (M m ) p(z θ m, M m ) p(θ m M m ) dθ m Usually assume uniform prior: P (M m ) = 1/M. Approximate the above integral by simplification and Laplace approximation to get log P (Z M m ) = log P (Z ˆθ m, M m ) d m 2 log n + O(1) where ˆθ m is a MLE and d m is # free parameters in M m. Then BIC 2 log P (M m Z)

66 AIC Vs BIC If M true {M 1,..., M M } then as n BIC will select M true. AIC will not. It tends to choose too complex models as n. However, when n is small BIC often chooses models which are too simple.

67 AIC Vs BIC If M true {M 1,..., M M } then as n BIC will select M true. AIC will not. It tends to choose too complex models as n. However, when n is small BIC often chooses models which are too simple.

68 Cross-Validation

69 K-Fold Cross-Validation

70 K-Fold Cross-Validation General Approach Split the data into K roughly equal-size parts Train Train Validation Train Train For the kth part calculate the prediction error of the model fit using the other K 1 parts. Do this for k = 1, 2,..., K and combine the K estimates of the prediction error.

71 K-Fold Cross-Validation General Approach Split the data into K roughly equal-size parts Train Train Validation Train Train For the kth part calculate the prediction error of the model fit using the other K 1 parts. Do this for k = 1, 2,..., K and combine the K estimates of the prediction error. When and why It is applied when labelled training data is relatively sparse. This method directly estimates Err = E[L(Y, ˆf(X))].

72 K-Fold Cross-validation: Detailed description The mapping κ : {1,..., n} {1,..., K} indicates observation i belongs to partition κ(i). ˆf k (x) is the function fitted with the kth part of the data removed. Cross-validation estimate of the prediction error is CV( ˆf) = 1 n n L(y i, ˆf κ(i) (x i )) i=1 Typical choices for K are 5 or 10. The case K = n is known as leave-one-out cross-validation.

73 K-Fold Cross-validation: Detailed description The mapping κ : {1,..., n} {1,..., K} indicates observation i belongs to partition κ(i). ˆf k (x) is the function fitted with the kth part of the data removed. Cross-validation estimate of the prediction error is CV( ˆf) = 1 n n L(y i, ˆf κ(i) (x i )) i=1 Typical choices for K are 5 or 10. The case K = n is known as leave-one-out cross-validation.

74 K-Fold Cross-validation: Detailed description The mapping κ : {1,..., n} {1,..., K} indicates observation i belongs to partition κ(i). ˆf k (x) is the function fitted with the kth part of the data removed. Cross-validation estimate of the prediction error is CV( ˆf) = 1 n n L(y i, ˆf κ(i) (x i )) i=1 Typical choices for K are 5 or 10. The case K = n is known as leave-one-out cross-validation.

75 K-Fold Cross-validation: Model selection Have models f(x, α) indexed by a parameter α. ˆf k (x, α) is αth model fit with kth part of the data removed. Then define CV( ˆf, α) = 1 n n L(y i, ˆf κ(i) (x i, α)) i=1 Choose the model ˆα = arg min α CV( ˆf, α)

76 K-Fold Cross-validation: Model selection Have models f(x, α) indexed by a parameter α. ˆf k (x, α) is αth model fit with kth part of the data removed. Then define CV( ˆf, α) = 1 n n L(y i, ˆf κ(i) (x i, α)) i=1 Choose the model ˆα = arg min α CV( ˆf, α)

77 Intuition says What quantity does K-fold validation estimate? When K = 5 or 10 then CV ( ˆf) Err as training sets for each fold are fairly different. When K = n then CV ( ˆf) Err T fold are almost identical. as training sets for each

78 Intuition says What quantity does K-fold validation estimate? When K = 5 or 10 then CV ( ˆf) Err as training sets for each fold are fairly different. When K = n then CV ( ˆf) Err T fold are almost identical. as training sets for each Book s simulation experiments say Cross-validation, really only effectively estimates Err.

79 What quantity does K-fold validation estimate? Prediction Error 10 Fold CV Error Error Error Subset Size p Subset Size p Leave One Out CV Error Approximation Error Error Mean Absolute Deviation ET CV10 Err ET CV10 ErrT ET CVN ErrT Subset Size p Subset Size p Thick red curve: Err FIGURE Conditional prediction-error ErrT, 10-fold cross-validation, and leave-one-out cross-validation curves for a 100 simulations from the top-right panel in Figure 7.3. The thick red curve is the expected prediction error Err, while the thick black curves are the expected CV curves ET CV10 and ET CVN. The lower-right panel shows the mean absolute deviation of the CV curves from the conditional error, ET CVK ErrT for K = 10 (blue) and K = N (green), Thick black curve: E T [CV K ]

80 What value of K? When K = n CV ( ˆf) is approx an unbiased estimate of Err. CV ( ˆf) has high variance as the n training sets are similar. Computational burden is high. (except for a few exceptions) When K = 5 (is lowish) CV ( ˆf) has low variance. CV ( ˆf) is potentially an upward biased estimate of Err. Only occurs if at each fold there is not enough training data to fit a good model.

81 Model Example Assessment and of Selection a K-fold cross validation curve Misclassification Error Subset Size p Orange curve: Err T FIGURE 7.9. Prediction error (orange) and tenfold cross-validation curve (blue) estimated from a single training set, from the scenario in the bottom right panel of Figure 7.3. Blue curve: CV 10 ( ˆf) ear model with best subsets regression of subset size p. Standard error bars are shown, which are the standard errors of the individual misclassification

82 Right & Wrong way to do Cross-validation

83 Classification problem with a large # of predictors What s wrong with this strategy? 1 Screen the predictors Find a subset of good predictors that are correlated with the class labels. 2 Build a classifier based on the subset of good predictors. 3 Perform cross-validation to estimate the unknown tuning parameters and to estimate Err of the final model.

84 Classification problem with a large # of predictors What s wrong with this strategy? 1 Screen the predictors Find a subset of good predictors that are correlated with the class labels. 2 Build a classifier based on the subset of good predictors. 3 Perform cross-validation to estimate the unknown tuning parameters and to estimate Err of the final model.

85 Classification problem with a large # of predictors What s wrong with this strategy? 1 Screen the predictors Find a subset of good predictors that are correlated with the class labels. 2 Build a classifier based on the subset of good predictors. 3 Perform cross-validation to estimate the unknown tuning parameters and to estimate Err of the final model.

86 Classification problem with a large # of predictors What s wrong with this strategy? 1 Screen the predictors Find a subset of good predictors that are correlated with the class labels. 2 Build a classifier based on the subset of good predictors. 3 Perform cross-validation to estimate the unknown tuning parameters and to estimate Err of the final model. The good predictors were chosen after seeing all the data.

87 1 Divide the samples into K groups randomly. 2 For each fold k = 1,..., K Should have done this Find a subset of good predictors using all the samples minus the kth fold. Build a classifier using all the samples minus the kth fold. Use the classifier to predict the labels for the samples in the kth fold.

88 Example Set-up Have a binary classification problem. n = 50 with an equal number of points from each class. Have p = 5000 quantitative predictors that are independent of the class labels. The true error rate of any classifier is 50%. If one performs pre-selection of 100 predictors and then builds a 1-nn classifier the average CV error rate was 3% over 50 simulations!

89 Example: correlation of class labels with predictors Model Assessment and Selection Wrong way Frequency Correlations of Selected Predictors with Outcome Right way Frequency Correlations of Selected Predictors with Outcome FIGURE Cross-validation the wrong and right way: histograms shows the correlation of class labels, in 10 randomly chosen samples, with the 100 predictors chosen using the incorrect (upper red) and correct (lower green) versions of

90 To perform Multistep Modelling Cross-validation must be applied to the entire sequence of modelling steps. Samples must be left out before any selection or filtering is applied which uses the labels. One exception: An unsupervised screening step can use all the samples.

91 To perform Multistep Modelling Cross-validation must be applied to the entire sequence of modelling steps. Samples must be left out before any selection or filtering is applied which uses the labels. One exception: An unsupervised screening step can use all the samples.

92 Bootstrap Method

93 Have a training set Z = (z 1, z 2,..., z n ) with each z i = (x i, y i ). The bootstrap idea is for b = 1, 2,..., B 1 Randomly draw n samples with replacement from Z to get Z b that is Z b = (z b1, z b2,..., z bn ) with b i {1,..., n} 2 Refit the model using Z b to get S(Z b ) Examine the behaviour of the B fits S(Z 1 ), S(Z 2 ),..., S(Z B ). The Bootstrap

94 Have a training set Z = (z 1, z 2,..., z n ) with each z i = (x i, y i ). The bootstrap idea is for b = 1, 2,..., B 1 Randomly draw n samples with replacement from Z to get Z b that is Z b = (z b1, z b2,..., z bn ) with b i {1,..., n} 2 Refit the model using Z b to get S(Z b ) Examine the behaviour of the B fits S(Z 1 ), S(Z 2 ),..., S(Z B ). The Bootstrap

95 Can estimate any aspect of the distribution of S(Z) For example its variance where Var[S(Z)] = 1 B 1 B (S(Z b ) S ) 2 b=1 S = 1 B B S(Z b ) b=1

96 Use Bootstrap to estimate Prediction Error Attempt 1 Êrr boot = 1 B 1 n B b=1 i=1 n L(y i, ˆf b (x i )) where ˆf b (x i ) is the predicted value at x i using the model computed from Z b. Why is this not a good estimate?? Overlap between training and test sets How could we do better? Mimic cross-validation

97 Use Bootstrap to estimate Prediction Error Attempt 1 Êrr boot = 1 B 1 n B b=1 i=1 n L(y i, ˆf b (x i )) where ˆf b (x i ) is the predicted value at x i using the model computed from Z b. Why is this not a good estimate?? Overlap between training and test sets How could we do better? Mimic cross-validation

98 Use Bootstrap to estimate Prediction Error Attempt 1 Êrr boot = 1 B 1 n B b=1 i=1 n L(y i, ˆf b (x i )) where ˆf b (x i ) is the predicted value at x i using the model computed from Z b. Why is this not a good estimate?? Overlap between training and test sets How could we do better? Mimic cross-validation

99 Use Bootstrap to estimate Prediction Error Attempt 1 Êrr boot = 1 B 1 n B b=1 i=1 n L(y i, ˆf b (x i )) where ˆf b (x i ) is the predicted value at x i using the model computed from Z b. Why is this not a good estimate?? Overlap between training and test sets How could we do better? Mimic cross-validation

100 Use Bootstrap to estimate Prediction Error Attempt 2: Leave-one-out bootstrap Êrr (1) = 1 n n i=1 1 C i b C i L(y i, ˆf b (x i )) where C i is the set of bootstrap samples b not containing observation i. Either make Make B large enough so C i > 0 for all i or Omit observation i from testing if C i = 0.

101 Use Bootstrap to estimate Prediction Error Attempt 2: Leave-one-out bootstrap Êrr (1) = 1 n n i=1 1 C i b C i L(y i, ˆf b (x i )) where C i is the set of bootstrap samples b not containing observation i. Either make Make B large enough so C i > 0 for all i or Omit observation i from testing if C i = 0.

102 Is the leave-one-out prediction any good? Pros: Cons: 1 avoids the overfitting problem of Êrr boot 1 Has the training-set-size bias of cross-validation 2 The P (observation i Z b ) is ( 1 1 n) 1 n 1 e 1 =.632 Therefore the average number of distinct observations in Z b is.632 n. 3 Êrr(1) s bias is thus similar to twofold cross-validation.

103 To alleviate this bias Attempt 3: The.632 estimator Êrr (.632) =.368 err Êrr(1) Compromise between the training error err and the leave-one-out bootstrap estimate. Its derivation is not easy. Obviously the constant.632 relates to P (observation i Z b ). The.632 estimator does not do well if predictor overfits.

104 To alleviate this bias Attempt 3: The.632 estimator Êrr (.632) =.368 err Êrr(1) Compromise between the training error err and the leave-one-out bootstrap estimate. Its derivation is not easy. Obviously the constant.632 relates to P (observation i Z b ). The.632 estimator does not do well if predictor overfits.

105 No-information error rate: Estimate the degree of overfitting ˆγ = 1 n 2 n n L(y i, ˆf(x j )) i=1 j=1 Estimate of the error rate of ˆf if inputs and outputs were independent. Note the prediction rule, ˆf, is evaluated on all possible combinations of targets y i and predictors x j.

106 No-information error rate: Estimate the degree of overfitting ˆγ = 1 n 2 n n L(y i, ˆf(x j )) i=1 j=1 Estimate of the error rate of ˆf if inputs and outputs were independent. Note the prediction rule, ˆf, is evaluated on all possible combinations of targets y i and predictors x j.

107 Estimate the degree of overfitting Relative overfitting rate: ˆR = Êrr(1) err ˆγ err 0 ˆR 1 ˆR = 0 = no overfitting. ˆR = 1 = overfitting equals no-information value ˆγ err.

108 Use Bootstrap to estimate Prediction Error Attempt 4: The.632+ estimator Êrr (.632+) = (1 ŵ) err + ŵ Êrr(1) with ŵ = ˆR.632 ŵ 1 as ˆR ranges from 0 to 1. Êrr (.632+) ranges from Êrr(.632) to Êrr(1). Êrr (.632+) is a compromise between Êrr(.632) and err that depends on the amount of overfitting. Derivation of the above eqn is non-trivial.

109 Use Bootstrap to estimate Prediction Error Attempt 4: The.632+ estimator Êrr (.632+) = (1 ŵ) err + ŵ Êrr(1) with ŵ = ˆR.632 ŵ 1 as ˆR ranges from 0 to 1. Êrr (.632+) ranges from Êrr(.632) to Êrr(1). Êrr (.632+) is a compromise between Êrr(.632) and err that depends on the amount of overfitting. Derivation of the above eqn is non-trivial.

110 How bootstrap and cv perform wrt model selection 7.11 Bootstrap Methods 253 Cross validation % Increase Over Best reg/knn reg/linear class/knn class/linear Bootstrap % Increase Over Best reg/knn reg/linear class/knn class/linear Boxplots show 100 Err( ˆα) min α Err(α) FIGURE Boxplots show the distribution of the relative error 100 [Errˆα max minα Err(α)]/[maxα α Err(α) Err(α) min α Err(α) minα Err(α)] over the four scenarios of Figure 7.3. This is the error in using the chosen model relative to the best model. There are 100 training sets represented in each boxplot. found via the selection method under investigation. 100 training sets were used. where ˆα is the best parameter

Hastie, Tibshirani & Friedman: Elements of Statistical Learning Chapter Model Assessment and Selection. CN700/March 4, 2008.

Hastie, Tibshirani & Friedman: Elements of Statistical Learning Chapter Model Assessment and Selection. CN700/March 4, 2008. Hastie, Tibshirani & Friedman: Elements of Statistical Learning Chapter 7.1-7.9 Model Assessment and Selection CN700/March 4, 2008 Satyavarta sat@cns.bu.edu Auditory Neuroscience Laboratory, Department

More information

PDEEC Machine Learning 2016/17

PDEEC Machine Learning 2016/17 PDEEC Machine Learning 2016/17 Lecture - Model assessment, selection and Ensemble Jaime S. Cardoso jaime.cardoso@inesctec.pt INESC TEC and Faculdade Engenharia, Universidade do Porto Nov. 07, 2017 1 /

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

MS&E 226: Small Data

MS&E 226: Small Data MS&E 226: Small Data Lecture 6: Model complexity scores (v3) Ramesh Johari ramesh.johari@stanford.edu Fall 2015 1 / 34 Estimating prediction error 2 / 34 Estimating prediction error We saw how we can estimate

More information

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences Biostatistics-Lecture 16 Model Selection Ruibin Xi Peking University School of Mathematical Sciences Motivating example1 Interested in factors related to the life expectancy (50 US states,1969-71 ) Per

More information

Regression, Ridge Regression, Lasso

Regression, Ridge Regression, Lasso Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

STK-IN4300 Statistical Learning Methods in Data Science

STK-IN4300 Statistical Learning Methods in Data Science Outline of the lecture Linear Methods for Regression Linear Regression Models and Least Squares Subset selection STK-IN4300 Statistical Learning Methods in Data Science Riccardo De Bin debin@math.uio.no

More information

STK-IN4300 Statistical Learning Methods in Data Science

STK-IN4300 Statistical Learning Methods in Data Science STK-IN4300 Statistical Learning Methods in Data Science Riccardo De Bin debin@math.uio.no STK-IN4300: lecture 2 1/ 38 Outline of the lecture STK-IN4300 - Statistical Learning Methods in Data Science Linear

More information

Introduction to Machine Learning and Cross-Validation

Introduction to Machine Learning and Cross-Validation Introduction to Machine Learning and Cross-Validation Jonathan Hersh 1 February 27, 2019 J.Hersh (Chapman ) Intro & CV February 27, 2019 1 / 29 Plan 1 Introduction 2 Preliminary Terminology 3 Bias-Variance

More information

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro Methods and Criteria for Model Selection CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro Goal } Introduce classifier evaluation criteria } Introduce Bias x Variance duality } Model Assessment }

More information

High-dimensional regression

High-dimensional regression High-dimensional regression Advanced Methods for Data Analysis 36-402/36-608) Spring 2014 1 Back to linear regression 1.1 Shortcomings Suppose that we are given outcome measurements y 1,... y n R, and

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Supervised Learning: Regression I Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com Some of the

More information

Recap from previous lecture

Recap from previous lecture Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience

More information

MS&E 226: Small Data

MS&E 226: Small Data MS&E 226: Small Data Lecture 6: Bias and variance (v5) Ramesh Johari ramesh.johari@stanford.edu 1 / 49 Our plan today We saw in last lecture that model scoring methods seem to be trading off two different

More information

A Bias Correction for the Minimum Error Rate in Cross-validation

A Bias Correction for the Minimum Error Rate in Cross-validation A Bias Correction for the Minimum Error Rate in Cross-validation Ryan J. Tibshirani Robert Tibshirani Abstract Tuning parameters in supervised learning problems are often estimated by cross-validation.

More information

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures FE661 - Statistical Methods for Financial Engineering 9. Model Selection Jitkomut Songsiri statistical models overview of model selection information criteria goodness-of-fit measures 9-1 Statistical models

More information

Dimension Reduction Methods

Dimension Reduction Methods Dimension Reduction Methods And Bayesian Machine Learning Marek Petrik 2/28 Previously in Machine Learning How to choose the right features if we have (too) many options Methods: 1. Subset selection 2.

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

Regression I: Mean Squared Error and Measuring Quality of Fit

Regression I: Mean Squared Error and Measuring Quality of Fit Regression I: Mean Squared Error and Measuring Quality of Fit -Applied Multivariate Analysis- Lecturer: Darren Homrighausen, PhD 1 The Setup Suppose there is a scientific problem we are interested in solving

More information

Model comparison and selection

Model comparison and selection BS2 Statistical Inference, Lectures 9 and 10, Hilary Term 2008 March 2, 2008 Hypothesis testing Consider two alternative models M 1 = {f (x; θ), θ Θ 1 } and M 2 = {f (x; θ), θ Θ 2 } for a sample (X = x)

More information

STK-IN4300 Statistical Learning Methods in Data Science

STK-IN4300 Statistical Learning Methods in Data Science Outline of the lecture STK-I4300 Statistical Learning Methods in Data Science Riccardo De Bin debin@math.uio.no Model Assessment and Selection Cross-Validation Bootstrap Methods Methods using Derived Input

More information

Local regression I. Patrick Breheny. November 1. Kernel weighted averages Local linear regression

Local regression I. Patrick Breheny. November 1. Kernel weighted averages Local linear regression Local regression I Patrick Breheny November 1 Patrick Breheny STA 621: Nonparametric Statistics 1/27 Simple local models Kernel weighted averages The Nadaraya-Watson estimator Expected loss and prediction

More information

CMSC858P Supervised Learning Methods

CMSC858P Supervised Learning Methods CMSC858P Supervised Learning Methods Hector Corrada Bravo March, 2010 Introduction Today we discuss the classification setting in detail. Our setting is that we observe for each subject i a set of p predictors

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: August 30, 2018, 14.00 19.00 RESPONSIBLE TEACHER: Niklas Wahlström NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

Lecture 3: Introduction to Complexity Regularization

Lecture 3: Introduction to Complexity Regularization ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,

More information

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University Chap 1. Overview of Statistical Learning (HTF, 2.1-2.6, 2.9) Yongdai Kim Seoul National University 0. Learning vs Statistical learning Learning procedure Construct a claim by observing data or using logics

More information

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă mmp@stat.washington.edu Reading: Murphy: BIC, AIC 8.4.2 (pp 255), SRM 6.5 (pp 204) Hastie, Tibshirani

More information

Least Squares Regression

Least Squares Regression CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the

More information

Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology

Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology Rirdge Regression Szymon Bobek Institute of Applied Computer science AGH University of Science and Technology Based on Carlos Guestrin adn Emily Fox slides from Coursera Specialization on Machine Learnign

More information

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

Linear Methods for Prediction

Linear Methods for Prediction Chapter 5 Linear Methods for Prediction 5.1 Introduction We now revisit the classification problem and focus on linear methods. Since our prediction Ĝ(x) will always take values in the discrete set G we

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

STA 414/2104, Spring 2014, Practice Problem Set #1

STA 414/2104, Spring 2014, Practice Problem Set #1 STA 44/4, Spring 4, Practice Problem Set # Note: these problems are not for credit, and not to be handed in Question : Consider a classification problem in which there are two real-valued inputs, and,

More information

Least Squares Regression

Least Squares Regression E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute

More information

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation Yujin Chung November 29th, 2016 Fall 2016 Yujin Chung Lec13: MLE Fall 2016 1/24 Previous Parametric tests Mean comparisons (normality assumption)

More information

Business Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata'

Business Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata' Business Statistics Tommaso Proietti DEF - Università di Roma 'Tor Vergata' Linear Regression Specication Let Y be a univariate quantitative response variable. We model Y as follows: Y = f(x) + ε where

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods. TheThalesians Itiseasyforphilosopherstoberichiftheychoose Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods Ivan Zhdankin

More information

LECTURE NOTE #3 PROF. ALAN YUILLE

LECTURE NOTE #3 PROF. ALAN YUILLE LECTURE NOTE #3 PROF. ALAN YUILLE 1. Three Topics (1) Precision and Recall Curves. Receiver Operating Characteristic Curves (ROC). What to do if we do not fix the loss function? (2) The Curse of Dimensionality.

More information

Performance Evaluation and Comparison

Performance Evaluation and Comparison Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Cross Validation and Resampling 3 Interval Estimation

More information

Bootstrap, Jackknife and other resampling methods

Bootstrap, Jackknife and other resampling methods Bootstrap, Jackknife and other resampling methods Part VI: Cross-validation Rozenn Dahyot Room 128, Department of Statistics Trinity College Dublin, Ireland dahyot@mee.tcd.ie 2005 R. Dahyot (TCD) 453 Modern

More information

Model comparison. Patrick Breheny. March 28. Introduction Measures of predictive power Model selection

Model comparison. Patrick Breheny. March 28. Introduction Measures of predictive power Model selection Model comparison Patrick Breheny March 28 Patrick Breheny BST 760: Advanced Regression 1/25 Wells in Bangladesh In this lecture and the next, we will consider a data set involving modeling the decisions

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. CS 189 Spring 013 Introduction to Machine Learning Final You have 3 hours for the exam. The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. Please

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete

More information

A Magiv CV Theory for Large-Margin Classifiers

A Magiv CV Theory for Large-Margin Classifiers A Magiv CV Theory for Large-Margin Classifiers Hui Zou School of Statistics, University of Minnesota June 30, 2018 Joint work with Boxiang Wang Outline 1 Background 2 Magic CV formula 3 Magic support vector

More information

MS-C1620 Statistical inference

MS-C1620 Statistical inference MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents

More information

Opening Theme: Flexibility vs. Stability

Opening Theme: Flexibility vs. Stability Opening Theme: Flexibility vs. Stability Patrick Breheny August 25 Patrick Breheny BST 764: Applied Statistical Modeling 1/20 Introduction We begin this course with a contrast of two simple, but very different,

More information

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model Centre for Molecular, Environmental, Genetic & Analytic (MEGA) Epidemiology School of Population

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

arxiv: v2 [math.st] 9 Feb 2017

arxiv: v2 [math.st] 9 Feb 2017 Submitted to the Annals of Statistics PREDICTION ERROR AFTER MODEL SEARCH By Xiaoying Tian Harris, Department of Statistics, Stanford University arxiv:1610.06107v math.st 9 Feb 017 Estimation of the prediction

More information

Evaluation requires to define performance measures to be optimized

Evaluation requires to define performance measures to be optimized Evaluation Basic concepts Evaluation requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain (generalization error) approximation

More information

Applied Machine Learning Annalisa Marsico

Applied Machine Learning Annalisa Marsico Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 22 April, SoSe 2015 Goals Feature Selection rather than Feature

More information

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1 Linear Regression Machine Learning CSE546 Kevin Jamieson University of Washington Oct 5, 2017 1 The regression problem Given past sales data on zillow.com, predict: y = House sale price from x = {# sq.

More information

MS&E 226: Small Data. Lecture 6: Bias and variance (v2) Ramesh Johari

MS&E 226: Small Data. Lecture 6: Bias and variance (v2) Ramesh Johari MS&E 226: Small Data Lecture 6: Bias and variance (v2) Ramesh Johari ramesh.johari@stanford.edu 1 / 47 Our plan today We saw in last lecture that model scoring methods seem to be trading o two di erent

More information

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013 Bayesian Methods Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2013 1 What about prior n Billionaire says: Wait, I know that the thumbtack is close to 50-50. What can you

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

COS 424: Interacting with Data. Lecturer: Rob Schapire Lecture #15 Scribe: Haipeng Zheng April 5, 2007

COS 424: Interacting with Data. Lecturer: Rob Schapire Lecture #15 Scribe: Haipeng Zheng April 5, 2007 COS 424: Interacting ith Data Lecturer: Rob Schapire Lecture #15 Scribe: Haipeng Zheng April 5, 2007 Recapitulation of Last Lecture In linear regression, e need to avoid adding too much richness to the

More information

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II) Contents Lecture Lecture Linear Discriminant Analysis Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University Email: fredriklindsten@ituuse Summary of lecture

More information

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Lecture 4 Discriminant Analysis, k-nearest Neighbors Lecture 4 Discriminant Analysis, k-nearest Neighbors Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University. Email: fredrik.lindsten@it.uu.se fredrik.lindsten@it.uu.se

More information

Resampling techniques for statistical modeling

Resampling techniques for statistical modeling Resampling techniques for statistical modeling Gianluca Bontempi Département d Informatique Boulevard de Triomphe - CP 212 http://www.ulb.ac.be/di Resampling techniques p.1/33 Beyond the empirical error

More information

Proteomics and Variable Selection

Proteomics and Variable Selection Proteomics and Variable Selection p. 1/55 Proteomics and Variable Selection Alex Lewin With thanks to Paul Kirk for some graphs Department of Epidemiology and Biostatistics, School of Public Health, Imperial

More information

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

Introduction to Gaussian Process

Introduction to Gaussian Process Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Non-parametric Gaussian Process (GP) GP Regression

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Introduction to Statistical modeling: handout for Math 489/583

Introduction to Statistical modeling: handout for Math 489/583 Introduction to Statistical modeling: handout for Math 489/583 Statistical modeling occurs when we are trying to model some data using statistical tools. From the start, we recognize that no model is perfect

More information

An Introduction to Statistical Machine Learning - Theoretical Aspects -

An Introduction to Statistical Machine Learning - Theoretical Aspects - An Introduction to Statistical Machine Learning - Theoretical Aspects - Samy Bengio bengio@idiap.ch Dalle Molle Institute for Perceptual Artificial Intelligence (IDIAP) CP 592, rue du Simplon 4 1920 Martigny,

More information

Linear model selection and regularization

Linear model selection and regularization Linear model selection and regularization Problems with linear regression with least square 1. Prediction Accuracy: linear regression has low bias but suffer from high variance, especially when n p. It

More information

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES Supervised Learning Linear vs non linear classifiers In K-NN we saw an example of a non-linear classifier: the decision boundary

More information

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

More information

Evaluation. Andrea Passerini Machine Learning. Evaluation

Evaluation. Andrea Passerini Machine Learning. Evaluation Andrea Passerini passerini@disi.unitn.it Machine Learning Basic concepts requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain

More information

VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms

VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms 03/Feb/2010 VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms Presented by Andriy Temko Department of Electrical and Electronic Engineering Page 2 of

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Introduction to Machine Learning Midterm Exam Solutions

Introduction to Machine Learning Midterm Exam Solutions 10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

Algorithm Independent Topics Lecture 6

Algorithm Independent Topics Lecture 6 Algorithm Independent Topics Lecture 6 Jason Corso SUNY at Buffalo Feb. 23 2009 J. Corso (SUNY at Buffalo) Algorithm Independent Topics Lecture 6 Feb. 23 2009 1 / 45 Introduction Now that we ve built an

More information

Announcements. Proposals graded

Announcements. Proposals graded Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu September 14, 2014 Today s Schedule Course Project Introduction Linear Regression Model Decision Tree 2 Methods

More information

Machine Learning 2017

Machine Learning 2017 Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Chapter 6 October 18, 2016 Chapter 6 October 18, 2016 1 / 80 1 Subset selection 2 Shrinkage methods 3 Dimension reduction methods (using derived inputs) 4 High

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 1 Solutions Thursday, September 19 What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information

Minimum Description Length (MDL)

Minimum Description Length (MDL) Minimum Description Length (MDL) Lyle Ungar AIC Akaike Information Criterion BIC Bayesian Information Criterion RIC Risk Inflation Criterion MDL u Sender and receiver both know X u Want to send y using

More information

Ensemble Methods and Random Forests

Ensemble Methods and Random Forests Ensemble Methods and Random Forests Vaishnavi S May 2017 1 Introduction We have seen various analysis for classification and regression in the course. One of the common methods to reduce the generalization

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Linear Methods for Prediction

Linear Methods for Prediction This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this

More information

ESS2222. Lecture 4 Linear model

ESS2222. Lecture 4 Linear model ESS2222 Lecture 4 Linear model Hosein Shahnas University of Toronto, Department of Earth Sciences, 1 Outline Logistic Regression Predicting Continuous Target Variables Support Vector Machine (Some Details)

More information

Machine Learning, Fall 2009: Midterm

Machine Learning, Fall 2009: Midterm 10-601 Machine Learning, Fall 009: Midterm Monday, November nd hours 1. Personal info: Name: Andrew account: E-mail address:. You are permitted two pages of notes and a calculator. Please turn off all

More information

Variance Reduction and Ensemble Methods

Variance Reduction and Ensemble Methods Variance Reduction and Ensemble Methods Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Last Time PAC learning Bias/variance tradeoff small hypothesis

More information

CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015

CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015 CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015 Luke ZeElemoyer Slides adapted from Carlos Guestrin Predic5on of con5nuous variables Billionaire says: Wait, that s not what

More information