Learning From Data Lecture 13 Validation and Model Selection The Validation Set Model Selection Cross Validation M. Magdon-Ismail CSCI 4100/6100
recap: Regularization Regularization combats the effects of noise by putting a leash on the algorithm. E aug (h) = E in (h)+ λ N Ω(h) Ω(h) smooth, simple h noise is rough, complex. Different regularizers give different results can choose λ, the amount of regularization. λ = 0 λ = 0.0001 λ = 0.01 λ = 1 Data Target Fit y y y y x x x x Overfitting Underfitting Optimal λ balances approximation and generalization, bias and variance. c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 2 /29 Peeking at E out
Validation: A Sneak Peek at E out E out (g) = E in (g)+overfit penalty }{{} VC bounds this using a complexity error bar for H regularization estimates this through a heuristic complexity penalty for g Validation goes directly for the jugular: E out (g) = E in (g)+overfit penalty. }{{} validation estimates this directly In-sample estimate of E out is the Holy Grail of learning from data. c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 3 /29 Test set
The Test Set D (N data points) D test (K test points) E test is an estimate for E out (g) E Dtest [e k ] = E out (g) g g E out (g) e k = e(g(x k ),y k ) e 1,e 2,...,e K E test = 1 K K k=1 e k E[E test ] = 1 K = 1 K e 1,...,e K are independent K E[e k ] k=1 K E out (g)= E out (g) k=1 Var[E test ] = 1 K Var[e K 2 k ] k=1 = 1 K Var[e] տ decreases like 1 K bigger K = more reliable E test. c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 4 /29 Validation set
The Validation Set D (N data points) D train (N K training points) D val (K validation points) 1. Remove K points from D g e k = e(g (x k ),y k ) e 1,e 2,...,e K 2. Learn using D train g. D = D train D val. g E val = 1 K K k=1 e k 3. Test g on D val E val. 4. Use error E val to estimate E out (g ). E out (g ) c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 5 /29 Validation
The Validation Set D (N data points) D train (N K training points) D val (K validation points) 1. Remove K points from D g e k = e(g (x k ),y k ) e 1,e 2,...,e K 2. Learn using D train g. D = D train D val. g E val = 1 K K k=1 e k 3. Test g on D val E val. 4. Use error E val to estimate E out (g ). E out (g ) c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 6 /29 Reliability of validation
The Validation Set D (N data points) E val is an estimate for E out (g ) E Dval [e k ] = E out (g ) D train (N K training points) g D val (K validation points) e k = e(g (x k ),y k ) e 1,e 2,...,e K E[E test ] = 1 K = 1 K K E[e k ] k=1 K E out (g )= E out (g ) k=1 g E out (g ) E val = 1 K K k=1 e k e 1,...,e K are independent Var[E val ] = 1 K Var[e K 2 k ] k=1 = 1 Var[e(g )] K տ decreases like 1 K depends on g, not H bigger K = more reliable E val? c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 7 /29 E val versus K
Sfrag Choosing K Expected Eval 10 20 30 Size of Validation Set, K Rule of thumb: K = N 5. c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 8 /29 Restoring D
Restoring D D (N) D train Primary goal: output best hypothesis. g was trained on all the data. (N K) Secondary goal: estimate E out (g). g is behind closed doors. g D val (K) E out (g) E out (g ) g E val (g ) E in (g) E val (g ) }{{} which should we use? CUSTOMER c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 9 /29 E val versus E in
E val Versus E in E out (g) E in (g)+o ( dvc N logn E out (g) E out (g ) E val (g )+O learning curve is decreasing (a practical truth, not a theorem) ւ Biased error bar depends on H. ) ( 1 K ) տ Unbiased error bar depends on g. E val (g) usually wins as an estimate for E out (g), especially when the learning curve is not steep. c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 10 /29 Model Selection
Model Selection The most important use of validation H 1 H 2 H 3 H M D train g 1 g 2 g 3 g M D val E 1 c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 11 /29 Validation estimate for E out (g 1 )
Validation Estimate for (H 1,g 1 ) The most important use of validation H 1 H 2 H 3 H M D train g 1 D val E val (g 1 ) c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 12 /29 Call it E 1
Validation Estimate for (H 1,g 1 ) The most important use of validation H 1 H 2 H 3 H M D train g 1 D val E 1 c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 13 /29 Validation estimates E 1,...,E M
Compute Validation Estimates for All Models The most important use of validation H 1 H 2 H 3 H M D train g 1 g 2 g 3 g M D val E 1 E 2 E 3 E M c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 14 /29 Pick best validation error
Pick The Best Model According to Validation Error The most important use of validation H 1 H 2 H 3 H M D train g 1 g 2 g 3 g M D val E 1 E 2 E 3 E M c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 15 /29 Biased E val (g m )
E val (g m ) is not Unbiased For E out (g m ) 0.8 Expected Error 0.7 0.6 0.5 E out (g m ) E val (g m )...because we choose one of the M finalists. 5 15 25 Validation Set Size, K E out (g m ) E val (g m )+O ( ) lnm K VC error bar for selecting a hypothesis from M using a data set of size K. c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 16 /29 Restoring D
Restoring D H 1 H 2 H 3 H M g 1 g 2 g 3 g M E 1 Model with best g also has best g We can find model with best g using validation leap of faith true modulo E val error bar c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 17 /29 Comparing E in and E val
Comparing E in and E val for Model Selection 0.56 validation: g m H 1 H 2 H M D train in-sample: g m g 1 g 2 g M Expected Eout 0.52 validation: g m D val E 1 E E } 2 {{} M pick the best 0.48 optimal 5 15 25 Validation Set Size, K D (H m, E m ) g m c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 18 /29 Selecting λ
Application to Selecting λ Which regularization parameter to use? λ 1,λ 2,...,λ M. This is a special case of model selection over M models, (H,λ 1 ) (H,λ 2 ) (H,λ 3 ) (H,λ M ) g 1 g 2 g 3 g M Picking a model amounts to chosing the optimal λ c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 19 /29 Tradeoff with K
The Dilemma When Choosing K Validation relies on the following chain of reasoning, E out (g) E out (g ) E val (g ) (small K) (large K) c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 20 /29 K = 1?
Can we get away with K = 1? Yes, almost! c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 21 /29 Leave one out
The Leave One Out Error (K = 1) y e 1 x g 1 E[e 1 ] = E out (g 1 )...but it is a wild estimate c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 22 /29 E cv
The Leave One Our Errors e 2 e 3 y y y e 1 x x x E[e 1 ] = E out (g 1 ) E cv = 1 N N n=1 e n c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 23 /29 CV is unbiased
Cross Validation is Unbiased Theorem. E cv is an unbiased estimate of Ēout(N 1). տ Expected E out when learning with N 1 points. c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 24 /29 Reliability of E cv
Reliability of E cv e n and e m are not independent. e n depends on g n which was trained on (x m,y m ). e m is evaluated on (x m,y m ). e n E cv 1 N Effective number of fresh examples giving a comparable estimate of E out c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 25 /29 Computational considerations
Cross Validation is Computationally Intensive N epochs of learning each on a data set of size N 1. Analytic approaches, for example linear regression with weight decay w reg = (Z t Z+λI) 1 Z t y H(λ) = Z(Z t Z+λI) 1 Z t. E cv = 1 N N n=1 ( ) ŷn y 2 n 1 H nn (λ) 10-fold cross validation D {}}{ D 1 D 2 D 3 D 4 D 5 D 6 D 7 D 8 D 9 D 10 train validate train c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 26 /29 Restoring D
Restoring D D D 1 g 1 D 2 g 2 D N (x 1,y 1 ) (x 2,y 2 ) (x N,y N ) g N e 1 } e 2 {{ e N } take average ( E out (g (N) ) Ēout(N 1) E cv +O learning curve ) 1 N. nearly independent e n g E cv CUSTOMER E cv can be used for model selection just as E val, for example to choose λ. c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 27 /29 Digits
Digits Problem: 1 Versus Not 1 0.03 E out Symmetry Error 0.02 0.01 E cv 1 Not 1 # Features Used E in 5 10 15 20 Average Intensity x = (1,x 1,x 2 ) z = (1,x 1,x 2,x 2 1,x 1 x 2,x 2 2,x 3 1,x 2 1x 2,x 1 x 2 2,x 3 2,...,x 5 1,x 4 1x 2,x 3 1x 2 2,x 2 1x 3 2,x 1 x 4 2,x 5 2) }{{} 5th order polynomial transform 20 dimensional non linear feature space c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 28 /29 Validation Wins
Validation Wins In the Real World Symmetry Symmetry Average Intensity Average Intensity no validation (20 features) E in = 0% E out = 2.5% cross validation (6 features) E in = 0.8% E out = 1.5% c AM L Creator: Malik Magdon-Ismail Validation and Model Selection: 29 /29