Machine Learning Foundations ( 機器學習基石 ) Lecture 13: Hazard of Overfitting Hsuan-Tien Lin ( 林軒田 ) htlin@csie.ntu.edu.tw Department of Computer Science & Information Engineering National Taiwan Universit ( 國立台灣大學資訊工程系 ) Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 0/22
1 When Can Machines Learn? 2 Wh Can Machines Learn? 3 How Can Machines Learn? Roadmap Lecture 12: Nonlinear Transform nonlinear via nonlinear feature transform Φ plus linear with price of model compleit 4 How Can Machines Learn Better? Lecture 13: Hazard of Overfitting What is Overfitting? The Role of Noise and Data Size Deterministic Noise Dealing with Overfitting Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 1/22
What is Overfitting? Bad Generalization regression for R with N = 5 eamples target f () = 2nd order polnomial label n = f ( n ) + ver small noise linear regression in Z-space + Φ = 4th order polnomial unique solution passing all eamples = E in (g) = 0 E out (g) huge Data Target Fit bad generalization: low E in, high E out Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 2/22
What is Overfitting? Bad Generalization and Overfitting take d VC = 1126 for learning: bad generalization (E out - E in ) large switch from d VC = d VC to d VC = 1126: overfitting E in, E out switch from d VC = d VC to d VC = 1: underfitting E in, E out Error d vc out-of-sample error model compleit in-sample error VC dimension, d vc bad generalization: low E in, high E out ; overfitting: lower E in, higher E out Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 3/22
What is Overfitting? Cause of Overfitting: A Driving Analog Data Target Fit good fit = overfit learning overfit use ecessive d VC noise limited data size N driving commit a car accident drive too fast bump road limited observations about road condition net: how does noise & data size affect overfitting? Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 4/22
What is Overfitting? Fun Time Based on our discussion, for data of fied size, which of the following situation is relativel of the lowest risk of overfitting? 1 small noise, fitting from small d VC to median d VC 2 small noise, fitting from small d VC to large d VC 3 large noise, fitting from small d VC to median d VC 4 large noise, fitting from small d VC to large d VC Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 5/22
What is Overfitting? Fun Time Based on our discussion, for data of fied size, which of the following situation is relativel of the lowest risk of overfitting? 1 small noise, fitting from small d VC to median d VC 2 small noise, fitting from small d VC to large d VC 3 large noise, fitting from small d VC to median d VC 4 large noise, fitting from small d VC to large d VC Reference Answer: 1 Two causes of overfitting are noise and ecessive d VC. So if both are relativel under control, the risk of overfitting is smaller. Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 5/22
The Role of Noise and Data Size Case Stud (1/2) 10-th order target function + noise 50-th order target function noiselessl Data Target Data Target overfitting from best g 2 H 2 to best g 10 H 10? Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 6/22
The Role of Noise and Data Size Case Stud (2/2) 10-th order target function + noise 50-th order target function noiselessl Data 2nd Order Fit 10th Order Fit Data 2nd Order Fit 10th Order Fit g 2 H 2 g 10 H 10 E in 0.050 0.034 E out 0.127 9.00 g 2 H 2 g 10 H 10 E in 0.029 0.00001 E out 0.120 7680 overfitting from g 2 to g 10? both es! Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 7/22
The Role of Noise and Data Size Iron of Two Learners Data 2nd Order Fit 10th Order Fit learner Overfit: pick g 10 H 10 learner Restrict: pick g 2 H 2 when both know that target = 10th R gives up abilit to fit Data Target but R wins in E out a lot! philosoph: concession for advantage? :-) Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 8/22
The Role of Noise and Data Size Learning Curves Revisited H 2 H 10 Epected Error E out E in Epected Error E out Number of Data Points, N E in Number of Data Points, N H 10 : lower E out when N, but much larger generalization error for small N gra area : O overfits! (E in, E out ) R alwas wins in E out if N small! Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 9/22
The Role of Noise and Data Size The No Noise Case Data 2nd Order Fit 10th Order Fit learner Overfit: pick g 10 H 10 Data Target learner Restrict: pick g 2 H 2 when both know that there is no noise R still wins is there reall no noise? target compleit acts like noise Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 10/22
The Role of Noise and Data Size Fun Time When having limited data, in which of the following case would learner R perform better than learner O? 1 limited data from a 10-th order target function with some noise 2 limited data from a 1126-th order target function with no noise 3 limited data from a 1126-th order target function with some noise 4 all of the above Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 11/22
The Role of Noise and Data Size Fun Time When having limited data, in which of the following case would learner R perform better than learner O? 1 limited data from a 10-th order target function with some noise 2 limited data from a 1126-th order target function with no noise 3 limited data from a 1126-th order target function with some noise 4 all of the above Reference Answer: 4 We discussed about 1 and 2, but ou shall be able to generalize :-) that R also wins in the more difficult case of 3. Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 11/22
Deterministic Noise A Detailed Eperiment = f () + ɛ Gaussian ( Qf ) α q q, σ 2 q=0 } {{ } f () Gaussian iid noise ɛ with level σ 2 some uniform distribution on f () with compleit level Q f data size N Data Target goal: overfit level for different (N, σ 2 ) and (N, Q f )? Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 12/22
Deterministic Noise The Overfit Measure Data 2nd Order Fit 10th Order Fit Data Target g 2 H 2 g 10 H 10 E in (g 10 ) E in (g 2 ) for sure overfit measure E out (g 10 ) E out (g 2 ) Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 13/22
Deterministic Noise The Results impact of σ 2 versus N impact of Q f versus N 0.2 100 0.2 Noise Level, σ 2 2 1 0 80 100 120 Number of Data Points, N fied Q f = 20 0.1 0-0.1-0.2 Target Compleit, Qf 75 50 25 0 80 100 120 Number of Data Points, N fied σ 2 = 0.1 0.1 0-0.1-0.2 ring a bell? :-) Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 14/22
impact of σ 2 versus N: stochastic noise Noise Level, σ 2 2 1 0 Deterministic Noise Impact of Noise and Data Size 80 100 120 Number of Data Points, N 0.2 0.1 0-0.1-0.2 impact of Q f versus N: deterministic noise Target Compleit, Qf 100 75 50 25 0 80 100 120 Number of Data Points, N 0.2 0.1 0-0.1-0.2 four reasons of serious overfitting: data size N stochastic noise deterministic noise ecessive power overfit overfit overfit overfit overfitting easil happens Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 15/22
Deterministic Noise Deterministic Noise if f / H: something of f cannot be captured b H deterministic noise : difference between best h H and f acts like stochastic noise not new to CS: pseudo-random generator difference to stochastic noise: depends on H fied for a given f h philosoph: when teaching a kid, perhaps better not to use eamples from a complicated target function? :-) Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 16/22
Deterministic Noise Fun Time Consider the target function being sin(1126) for [0, 2π]. When is uniforml sampled from the range, and we use all possible linear hpotheses h() = w to approimate the target function with respect to the squared error, what is the level of deterministic noise for each? 1 sin(1126) 2 sin(1126) 3 sin(1126) + 4 sin(1126) 1126 Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 17/22
Deterministic Noise Fun Time Consider the target function being sin(1126) for [0, 2π]. When is uniforml sampled from the range, and we use all possible linear hpotheses h() = w to approimate the target function with respect to the squared error, what is the level of deterministic noise for each? 1 sin(1126) 2 sin(1126) 3 sin(1126) + 4 sin(1126) 1126 Reference Answer: 1 You can tr a few different w and convince ourself that the best hpothesis h is h () = 0. The deterministic noise is the difference between f and h. Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 17/22
Dealing with Overfitting Driving Analog Revisited learning overfit use ecessive d VC noise limited data size N start from simple model data cleaning/pruning data hinting regularization validation driving commit a car accident drive too fast bump road limited observations about road condition drive slowl use more accurate road information eploit more road information put the brakes monitor the dashboard all ver practical techniques to combat overfitting Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 18/22
Dealing with Overfitting Data Cleaning/Pruning if detect the outlier 5 at the top b too close to other, or too far from other wrong b current classifier... possible action 1: correct the label (data cleaning) possible action 2: remove the eample (data pruning) possibl helps, but effect varies Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 19/22
Dealing with Overfitting Data Hinting slightl shifted/rotated digits carr the same meaning possible action: add virtual eamples b shifting/rotating the given digits (data hinting) possibl helps, but watch out virtual eample not iid P(, )! Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 20/22
Dealing with Overfitting Fun Time Assume we know that f () is smmetric for some 1D regression application. That is, f () = f ( ). One possibilit of using the knowledge is to consider smmetric hpotheses onl. On the other hand, ou can also generate virtual eamples from the original data {( n, n )} as hints. What virtual eamples suit our needs best? 1 {( n, n )} 2 {( n, n )} 3 {( n, n )} 4 {(2 n, 2 n )} Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 21/22
Dealing with Overfitting Fun Time Assume we know that f () is smmetric for some 1D regression application. That is, f () = f ( ). One possibilit of using the knowledge is to consider smmetric hpotheses onl. On the other hand, ou can also generate virtual eamples from the original data {( n, n )} as hints. What virtual eamples suit our needs best? 1 {( n, n )} 2 {( n, n )} 3 {( n, n )} 4 {(2 n, 2 n )} Reference Answer: 3 We want the virtual eamples to encode the invariance when. Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 21/22
Dealing with Overfitting Summar 1 When Can Machines Learn? 2 Wh Can Machines Learn? 3 How Can Machines Learn? Lecture 12: Nonlinear Transform 4 How Can Machines Learn Better? Lecture 13: Hazard of Overfitting What is Overfitting? lower E in but higher E out The Role of Noise and Data Size overfitting easil happens! Deterministic Noise what H cannot capture acts like noise Dealing with Overfitting data cleaning/pruning/hinting, and more net: putting the brakes with regularization Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 22/22