Empirical Risk Minimization, Model Selection, and Model Assessment CS6780 Advanced Machine Learning Spring 2015 Thorsten Joachims Cornell University Reading: Murphy 5.7-5.7.2.4, 6.5-6.5.3.1 Dietterich, T. G., (1998). Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Computation, 10 (7) 1895-1924. (http://sci2s.ugr.es/keel/pdf/algorithm/articulo/dietterich1998.pdf
Learning as Prediction: Batch Learning Model Real-world Process P(X,Y) Train Sample S train S Test Sample S test (x 1,y 1 ),, (x n,y n ) train h Learner (x n+1,y n+1 ), Definition: A particular Instance of a Learning Problem is described by a probability distribution P X, Y. Definition: Any Example X i, Y i is a random variable that is independently identically distributed according to P X, Y.
Training / Validation / Test Sample Definition: A Training / Test / Validation Sample S = (x 1, y 1,, x n, y n ) is drawn iid from P X, Y. P S = x 1, y 1,, x n, y n = P X i = x i, Y i = y i n i=1
Risk Definition: The Risk / Prediction Error / True Error / Generalization Error of a hypothesis h for a learning task P X, Y is Err P h = Δ y, h x P X = x, Y = y x,y Definition: The Loss Function Δ y, y R measures the quality of prediction y if the true label is y.
Empirical Risk Definition: The Empirical Risk / Error of hypothesis h on sample is S = (x 1, y 1,, x n, y n ) n Err S h = Δ y i, h x i i=1
Learning as Prediction drawn i.i.d. Real-world Process P(X,Y) Overview drawn i.i.d. Train Sample S train (x 1,y 1 ),, (x n,y n ) S Test Sample S test train h Learner (x n+1,y n+1 ), Goal: Find h with small prediction error Err P (h) with respect to P(X,Y). Training Error: Error Err Strain (h) on training sample. Test Error: Error Err Stest (h) on test sample is an estimate of Err P (h).
Bayes Risk Given knowledge of P(X,Y), the true error of the best possible h is Err P h bayes = E x~p(x) [min y Y 1 P Y = y X = x ] for the 0/1 loss.
Generative Model: Three Roadmaps for Designing ML Methods Learn P X, Y from training sample. Discriminative Conditional Model: Learn P Y X from training sample. Discriminative ERM Model: Learn h directly from training sample.
Empirical Risk Minimization Definition [ERM Principle]: Given a training sample S = ( x 1, y 1,, (x n, y n ) and a hypothesis space H, select the rule h ERM H that minimizes the empirical risk (i.e. training error) on S n h ERM = min h H i=1 Δ(y i, h(y i ))
MODEL SELECTION
Overfitting Note: Accuracy = 1.0-Error [Mitchell]
Decision Tree Example: revisited yes complete correct partial original no no yes presentation unclear clear yes no guessing no
Controlling Overfitting in Decision Trees Early Stopping: Stop growing the tree and introduce leaf when splitting no longer reliable. Restrict size of tree (e.g., number of nodes, depth) Minimum number of examples in node Threshold on splitting criterion Post Pruning: Grow full tree, then simplify. Reduced-error tree pruning Rule post-pruning
Model Selection via Validation Sample Real-world Process drawn i.i.d. drawn i.i.d. split randomly Train Sample S train split randomly Test Sample S test Train Sample S train S train Learner 1 ĥ 1 Val. Sample S val ĥ Learner m ĥ k Training: Run learning algorithm m times (e.g. different parameters). Validation Error: Errors Err Sval (ĥ i ) is an estimates of Err P (ĥ i ) for each h i. Selection: Use h i with min Err Sval (ĥ i ) for prediction on test examples.
Reduced-Error Pruning
Text Classification Example Corporate Acquisitions Results Unpruned Tree (ID3 Algorithm): Size: 437 nodes Training Error: 0.0% Test Error: 11.0% Early Stopping Tree (ID3 Algorithm): Size: 299 nodes Training Error: 2.6% Test Error: 9.8% Reduced-Error Pruning (C4.5 Algorithm): Size: 167 nodes Training Error: 4.0% Test Error: 10.8% Rule Post-Pruning (C4.5 Algorithm): Size: 164 tests Training Error: 3.1% Test Error: 10.3% Examples of rules IF vs = 1 THEN - [99.4%] IF vs = 0 & export = 0 & takeover = 1 THEN + [93.6%]
MODEL ASSESSMENT
Evaluating Learned Real-world Process Hypotheses split randomly Sample S drawn i.i.d. split randomly Training Sample S train S (x 1,y 1 ),, (x n,y n ) train Learner (incl. ModSel) ĥ Test Sample S test (x 1,y 1 ), (x k,y k ) Goal: Find h with small prediction error Err P (h) over P(X,Y). Question: How good is Err P (ĥ) of ĥ found on training sample S train. Training Error: Error Err Strain (ĥ) on training sample. Test Error: Error Err Stest (ĥ) is an estimate of Err P (ĥ).
What is the True Error of a Hypothesis? Given Sample of labeled instances S Learning Algorithm A Setup Partition S randomly into S train and S test Train learning algorithm A on Strain, result is ĥ. Apply ĥ to S test and compare predictions against true labels. Test Error on test sample Err S test (ĥ) is estimate of true error Err P(ĥ). Compute confidence interval. Training Sample S train S train ĥ Test Sample S test (x 1,y 1 ),, (x n,y n ) Learner (x 1,y 1 ), (x k,y k )
Binomial Distribution The probability of observing x heads (i.e. errors) in a sample of n independent coin tosses (i.e. examples), where in each toss the probability of heads (i.e. making an error) is p, is Normal approximation: For np(1-p)>=5 the binomial can be approximated by the normal distribution with Expected value: E(X)=np Variance: Var(X)=np(1-p) With probability, the observation x falls in the interval 50% 68% 80% 90% 95% 98% 99% z 0.67 1.00 1.28 1.64 1.96 2.33 2.58
Given Is Rule h 1 More Accurate than h 2? Sample of labeled instances S Learning Algorithms A 1 and A 2 Setup Test Partition S randomly into S train and S test Train learning algorithms A 1 and A 2 on S train, result are ĥ 1 and ĥ 2. Apply ĥ 1 and ĥ 2 to S test and compute Err Stest (ĥ 1 ) and Err Stest (ĥ 2 ). Decide, if Err P (ĥ 1 ) Err P (ĥ 2 )? Null Hypothesis: Err Stest (ĥ 1 ) and Err Stest (ĥ 2 ) come from binomial distributions with same p. Binomial Sign Test (McNemar s Test)
Is Learning Algorithm A 1 better than A 2? Given k samples S 1 S k of labeled instances, all i.i.d. from P(X,Y). Learning Algorithms A 1 and A 2 Setup For i from 1 to k Partition S i randomly into S train and S test Train learning algorithms A 1 and A 2 on S train, result are ĥ 1 and ĥ 2. Apply ĥ 1 and ĥ 2 to S test and compute Err Stest (ĥ 1 ) and Err Stest (ĥ 2 ). Test Decide, if E S (Err P (A 1 (S train ))) E S (Err P (A 2 (S train )))? Null Hypothesis: Err Stest (A 1 (S train )) and Err Stest (A 2 (S train )) come from same distribution over samples S. t-test or Wilcoxon Signed-Rank Test
Approximation via K-fold Cross Validation Given Sample of labeled instances S Learning Algorithms A 1 and A 2 Compute Randomly partition S into k equally sized subsets S 1 S k For i from 1 to k Train A 1 and A 2 on S 1 S i-1 S i+1. S k and get ĥ 1 and ĥ 2. Apply ĥ 1 and ĥ 2 to S i and compute Err Si (ĥ 1 ) and Err Si (ĥ 2 ). Estimate Average Err Si (ĥ 1 ) is estimate of E S (Err P (A 1 (S train ))) Average Err Si (ĥ 2 ) is estimate of E S (Err P (A 2 (S train ))) Count how often Err Si (ĥ 1 )>Err Si (ĥ 2 ) and Err Si (ĥ 1 )<Err Si (ĥ 2 )