Learning From Data Lecture 12 Regularization

Size: px

Start display at page:

Download "Learning From Data Lecture 12 Regularization"

Gillian Aubrie Bradford
5 years ago
Views:

1 Learning From Data Lecture 12 Regularization Constraining the Model Weight Decay Augmented Error M. Magdon-Ismail CSCI 4100/6100

2 recap: Overfitting Fitting the data more than is warranted Data Target Fit y c AM L Creator: Malik Magdon-Ismail Regularization: 2 /30 Noise

3 recap: Noise is Part of y We Cannot Model Stochastic Noise Deterministic Noise f() h y y y = f()+stoch. noise y = h ()+det. noise Stochastic and Deterministic Noise Hurt Learning Human: Good at etracting the simple pattern, ignoring the noise and complications. Computer: Pays equal attention to all piels. Needs help simplifying (features, regularization). c AM L Creator: Malik Magdon-Ismail Regularization: 3 /30 What is regularization?

4 Regularization What is regularization? A cure for our tendency to fit (get distracted by) the noise, hence improving E out. How does it work? By constraining the model so that we cannot fit the noise. putting on the brakes Side effects? Themedicationwillhavesideeffects ifwecannotfitthenoise,maybewecannotfitf (thesignal)? c AM L Creator: Malik Magdon-Ismail Regularization: 4 /30 Constraining

5 Constraining the Model: Does it Help? y...and the winner is: c AM L Creator: Malik Magdon-Ismail Regularization: 5 /30 Small weights

6 Constraining the Model: Does it Help? y y constrain weights to be smaller...and the winner is: c AM L Creator: Malik Magdon-Ismail Regularization: 6 /30 bias

7 Bias Goes Up A Little yḡ() yḡ() sin() sin() no regularization bias = 0.21 regularization bias = 0.23 side effect (Constant model had bias=0.5 and var=0.25.) c AM L Creator: Malik Magdon-Ismail Regularization: 7 /30 var

8 Variance Drop is Dramatic! yḡ() yḡ() sin() sin() no regularization bias = 0.21 var = 1.69 regularization bias = 0.23 var = 0.33 side effect treatment (Constant model had bias=0.5 and var=0.25.) c AM L Creator: Malik Magdon-Ismail Regularization: 8 /30 Regularication in a nutshell

9 Regularization in a Nutshell VC analysis: E out (g) E in (g)+ω(h) տ If you use a simpler H and get a good fit, then your E out is better. Regularization takes this a step further: If you use a simpler h and get a good fit, then is your E out better? c AM L Creator: Malik Magdon-Ismail Regularization: 9 /30 Polynomials

10 Polynomials of Order Q - A Useful Testbed H q : polynomials of order Q. Standard Polynomial 1 z = 2. q Legendre Polynomial 1 h() = w t z() L 1 () z = = w 0 +w 1 + +w q q L 2 (). L q () h() = w t z() ւ we re using linear regression = w 0 +w 1 L 1 ()+ +w q L q () տ allows us to treat the weights independently L 1 L 2 L 3 L 4 L (32 1) 1 2 (53 3) 1 8 ( ) 1 8 (635 ) c AM L Creator: Malik Magdon-Ismail Regularization: 10 /30 recap: linear regression

11 recap: Linear Regression ( 1,y 1 ),...,( N,y N ) } {{ } X y (z 1,y 1 ),...,(z N,y N ) } {{ } Z y min : E in (w) = 1 N N (w t z n y n ) 2 n=1 = 1 N (Zw y)t (Zw y) w lin = (Z t Z) 1 Z t y linear regression fit ր c AM L Creator: Malik Magdon-Ismail Regularization: 11 /30 Already saw constraints

12 Constraining The Model: H 10 vs. H 2 H 10 = { h() = w 0 +w 1 Φ 1 ()+w 2 Φ 2 ()+w 3 Φ 3 ()+ +w 10 Φ 10 () } H 2 = { h() = w0 +w 1 Φ 1 ()+w 2 Φ 2 ()+w 3 Φ 3 ()+ +w 10 Φ 10 () such that: w 3 = w 4 = = w 10 = 0 ր a hard order constraint that sets some weights to zero } H 2 H 10 c AM L Creator: Malik Magdon-Ismail Regularization: 12 /30 Soft constraint

13 Soft Order Constraint Don t set weights eplicitly to zero (e.g. w 3 = 0). Give a budget and let the learning choose. H 10 q wq 2 C q=0 տ budget for weights C soft order constraint allows intermediate models H 2 c AM L Creator: Malik Magdon-Ismail Regularization: 13 /30 H C

14 Soft Order Constrained Model H C H 10 = { h() = w 0 +w 1 Φ 1 ()+w 2 Φ 2 ()+w 3 Φ 3 ()+ +w 10 Φ 10 () } H 2 = { h() = w0 +w 1 Φ 1 ()+w 2 Φ 2 ()+w 3 Φ 3 ()+ +w 10 Φ 10 () such that: w 3 = w 4 = = w 10 = 0 } H C = h() = w 0 +w 1 Φ 1 ()+w 2 Φ 2 ()+w 3 Φ 3 ()+ +w 10 Φ 10 () such that: 10 q=0 w 2 q C ր a soft budget constraint on the sum of weights VC-perspective: H C is smaller than H 10 = better generalization. c AM L Creator: Malik Magdon-Ismail Regularization: 14 /30 Fitting data

15 Fitting the Data The optimal weights ր regularized w reg H C should minimize the in-sample error, but be within the budget. w reg is a solution to min : E in (w) = 1 N (Zw y)t (Zw y) subject to: w t w C c AM L Creator: Malik Magdon-Ismail Regularization: 15 /30 Getting w reg

16 Solving For w reg min : E in (w) = 1 N (Zw y)t (Zw y) subject to: w t w C E in = const. Observations: 1. Optimal w tries to get as close to w lin as possible. Optimal w will use full budget and be on the surface w t w = C. 2. Surface w t w = C, at optimal w, should be perpindicular to E in. Otherwise can move along the surface and decrease E in. 3. Normal to surface w t w = C is the vector w. E in w lin w normal 4. Surface is E in ; surface is normal. E in is parallel to normal (but in opposite direction). w t w = C E in (w reg ) = 2λ C w reg ր λ C, the lagrange multiplier, is positive. The 2 is for mathematical convenience. c AM L Creator: Malik Magdon-Ismail Regularization: 16 /30 Unconstrained minimization

17 Solving For w reg E in (w) is minimized, subject to: w t w C E in (w reg )+2λ C w reg = 0 (E in (w)+λ C w t w) w=wreg = 0 E in (w)+λ C w t w is minimized, unconditionally There is a correspondence: C λ C c AM L Creator: Malik Magdon-Ismail Regularization: 17 /30 Augmented error

18 The Augmented Error Pick a C and minimize E in (w) subject to: w t w C Pick a λ C and minimize E aug (w) = E in (w)+λ C w t w unconditionally տ A penalty for the compleity of h, measured by the size of the weights. We can pick any budget C. Translation: we are free to pick any multiplier λ C What s the right C? What s the right λ C? c AM L Creator: Malik Magdon-Ismail Regularization: 18 /30 Linear regression

19 Linear Regression With Soft Order Constraint E aug (w) = 1 N (Zw y)t (Zw y)+λ C w t w տ Convenient to set λ C = λ N E aug (w) = (Zw y)t (Zw y)+λw t w N տ called weight decay as the penalty encourages smaller weights Unconditionally minimize E aug (w). c AM L Creator: Malik Magdon-Ismail Regularization: 19 /30 Linear regression solution

20 The Solution for w reg E aug (w) = 2Z t (Zw y)+2λw = 2(Z t Z+λI)w 2Z t y Set E aug (w) = 0 w reg = (Z t Z+λI) 1 Z t y λ determines the amount of regularization Recall the unconstrained solution (λ = 0): w lin = (Z t Z) 1 Z t y c AM L Creator: Malik Magdon-Ismail Regularization: 20 /30 Dramatic effect

21 A Little Regularization... Minimizing E in (w)+ λ N wt w with different λ s λ = 0 λ = Data Target Fit y Overfitting Wow! c AM L Creator: Malik Magdon-Ismail Regularization: 21 /30 Just a little works

22 ...Goes A Long Way Minimizing E in (w)+ λ N wt w with different λ s λ = 0 λ = Data Target Fit y y Overfitting Wow! c AM L Creator: Malik Magdon-Ismail Regularization: 22 /30 Easy to overdose

23 Don t Overdose Minimizing E in (w)+ λ N wt w with different λ s λ = 0 λ = λ = 0.01 λ = 1 Data Target Fit y y y y Overfitting Underfitting c AM L Creator: Malik Magdon-Ismail Regularization: 23 /30 Overfitting and underfitting

24 Overfitting and Underfitting PSfrag 0.84 overfitting underfitting Epected Eout Regularization Parameter, λ c AM L Creator: Malik Magdon-Ismail Regularization: 24 /30 Noise and regularization

25 More Noise Needs More Medicine 1 Epected Eout σ 2 = 0.5 σ 2 = 0.25 σ 2 = Regularization Parameter, λ c AM L Creator: Malik Magdon-Ismail Regularization: 25 /30 Deterministic too

26 ...Even For Deterministic Noise Epected Eout σ 2 = 0.5 σ 2 = 0.25 σ 2 = 0 Epected Eout Q f = 100 Q f = 30 Q f = Regularization Parameter, λ Regularization Parameter, λ c AM L Creator: Malik Magdon-Ismail Regularization: 26 /30 Variations on weight decay

27 Variations on Weight Decay Uniform Weight Decay Low Order Fit Weight Growth! 0.84 overfitting underfitting 0.84 weight growth Epected Eout 0.8 Epected Eout 0.8 Epected Eout weight decay Regularization Parameter, λ Regularization Parameter, λ Regularization Parameter, λ Q q=0 w 2 q Q q=0 qw 2 q Q q=0 1 w 2 q c AM L Creator: Malik Magdon-Ismail Regularization: 27 /30 Choosing a regularizer

28 Choosing a Regularizer A Practitioner s Guide The perfect regularizer: constrain in the direction of the target function. target function is unknown (going around in circles ). The guiding principle: constrain in the direction of smoother (usually simpler) hypotheses hurts your ability to fit the high frequency noise usually means smoother and simpler weight decay not weight growth. What if you choose the wrong regularizer? You still have λ to play with validation. c AM L Creator: Malik Magdon-Ismail Regularization: 28 /30 Regularization philosophy

29 How Does Regularization Work? Stochastic noise nothing you can do about that. Good features helps to reduce deterministic noise. Regularization: Helps to combat what noise remains, especially when N is small. Typical modus operandi: sacrifice a little bias for a huge improvement in var. VC angle: you are using a smaller H without sacrificing too much E in c AM L Creator: Malik Magdon-Ismail Regularization: 29 /30 E aug versus E in

30 Augmented Error as a Proy for E out E aug (h) = E in (h)+ λ N Ω(h) ւ this was w t w E out (h) E in (h)+ω(h) տ ( ) dvc this was O N lnn E aug can beat E in as a proy for E out. տ depends on choice of λ c AM L Creator: Malik Magdon-Ismail Regularization: 30 /30

Learning From Data Lecture 13 Validation and Model Selection

Learning From Data Lecture 13 Validation and Model Selection The Validation Set Model Selection Cross Validation M. Magdon-Ismail CSCI 4100/6100 recap: Regularization Regularization combats the effects