Learning From Data Lecture 12 Regularization Constraining the Model Weight Decay Augmented Error M. Magdon-Ismail CSCI 4100/6100
recap: Overfitting Fitting the data more than is warranted Data Target Fit y c AM L Creator: Malik Magdon-Ismail Regularization: 2 /30 Noise
recap: Noise is Part of y We Cannot Model Stochastic Noise Deterministic Noise f() h y y y = f()+stoch. noise y = h ()+det. noise Stochastic and Deterministic Noise Hurt Learning Human: Good at etracting the simple pattern, ignoring the noise and complications. Computer: Pays equal attention to all piels. Needs help simplifying (features, regularization). c AM L Creator: Malik Magdon-Ismail Regularization: 3 /30 What is regularization?
Regularization What is regularization? A cure for our tendency to fit (get distracted by) the noise, hence improving E out. How does it work? By constraining the model so that we cannot fit the noise. putting on the brakes Side effects? Themedicationwillhavesideeffects ifwecannotfitthenoise,maybewecannotfitf (thesignal)? c AM L Creator: Malik Magdon-Ismail Regularization: 4 /30 Constraining
Constraining the Model: Does it Help? y...and the winner is: c AM L Creator: Malik Magdon-Ismail Regularization: 5 /30 Small weights
Constraining the Model: Does it Help? y y constrain weights to be smaller...and the winner is: c AM L Creator: Malik Magdon-Ismail Regularization: 6 /30 bias
Bias Goes Up A Little yḡ() yḡ() sin() sin() no regularization bias = 0.21 regularization bias = 0.23 side effect (Constant model had bias=0.5 and var=0.25.) c AM L Creator: Malik Magdon-Ismail Regularization: 7 /30 var
Variance Drop is Dramatic! yḡ() yḡ() sin() sin() no regularization bias = 0.21 var = 1.69 regularization bias = 0.23 var = 0.33 side effect treatment (Constant model had bias=0.5 and var=0.25.) c AM L Creator: Malik Magdon-Ismail Regularization: 8 /30 Regularication in a nutshell
Regularization in a Nutshell VC analysis: E out (g) E in (g)+ω(h) տ If you use a simpler H and get a good fit, then your E out is better. Regularization takes this a step further: If you use a simpler h and get a good fit, then is your E out better? c AM L Creator: Malik Magdon-Ismail Regularization: 9 /30 Polynomials
Polynomials of Order Q - A Useful Testbed H q : polynomials of order Q. Standard Polynomial 1 z = 2. q Legendre Polynomial 1 h() = w t z() L 1 () z = = w 0 +w 1 + +w q q L 2 (). L q () h() = w t z() ւ we re using linear regression = w 0 +w 1 L 1 ()+ +w q L q () տ allows us to treat the weights independently L 1 L 2 L 3 L 4 L 5 1 2 (32 1) 1 2 (53 3) 1 8 (354 30 2 +3) 1 8 (635 ) c AM L Creator: Malik Magdon-Ismail Regularization: 10 /30 recap: linear regression
recap: Linear Regression ( 1,y 1 ),...,( N,y N ) } {{ } X y (z 1,y 1 ),...,(z N,y N ) } {{ } Z y min : E in (w) = 1 N N (w t z n y n ) 2 n=1 = 1 N (Zw y)t (Zw y) w lin = (Z t Z) 1 Z t y linear regression fit ր c AM L Creator: Malik Magdon-Ismail Regularization: 11 /30 Already saw constraints
Constraining The Model: H 10 vs. H 2 H 10 = { h() = w 0 +w 1 Φ 1 ()+w 2 Φ 2 ()+w 3 Φ 3 ()+ +w 10 Φ 10 () } H 2 = { h() = w0 +w 1 Φ 1 ()+w 2 Φ 2 ()+w 3 Φ 3 ()+ +w 10 Φ 10 () such that: w 3 = w 4 = = w 10 = 0 ր a hard order constraint that sets some weights to zero } H 2 H 10 c AM L Creator: Malik Magdon-Ismail Regularization: 12 /30 Soft constraint
Soft Order Constraint Don t set weights eplicitly to zero (e.g. w 3 = 0). Give a budget and let the learning choose. H 10 q wq 2 C q=0 տ budget for weights C soft order constraint allows intermediate models H 2 c AM L Creator: Malik Magdon-Ismail Regularization: 13 /30 H C
Soft Order Constrained Model H C H 10 = { h() = w 0 +w 1 Φ 1 ()+w 2 Φ 2 ()+w 3 Φ 3 ()+ +w 10 Φ 10 () } H 2 = { h() = w0 +w 1 Φ 1 ()+w 2 Φ 2 ()+w 3 Φ 3 ()+ +w 10 Φ 10 () such that: w 3 = w 4 = = w 10 = 0 } H C = h() = w 0 +w 1 Φ 1 ()+w 2 Φ 2 ()+w 3 Φ 3 ()+ +w 10 Φ 10 () such that: 10 q=0 w 2 q C ր a soft budget constraint on the sum of weights VC-perspective: H C is smaller than H 10 = better generalization. c AM L Creator: Malik Magdon-Ismail Regularization: 14 /30 Fitting data
Fitting the Data The optimal weights ր regularized w reg H C should minimize the in-sample error, but be within the budget. w reg is a solution to min : E in (w) = 1 N (Zw y)t (Zw y) subject to: w t w C c AM L Creator: Malik Magdon-Ismail Regularization: 15 /30 Getting w reg
Solving For w reg min : E in (w) = 1 N (Zw y)t (Zw y) subject to: w t w C E in = const. Observations: 1. Optimal w tries to get as close to w lin as possible. Optimal w will use full budget and be on the surface w t w = C. 2. Surface w t w = C, at optimal w, should be perpindicular to E in. Otherwise can move along the surface and decrease E in. 3. Normal to surface w t w = C is the vector w. E in w lin w normal 4. Surface is E in ; surface is normal. E in is parallel to normal (but in opposite direction). w t w = C E in (w reg ) = 2λ C w reg ր λ C, the lagrange multiplier, is positive. The 2 is for mathematical convenience. c AM L Creator: Malik Magdon-Ismail Regularization: 16 /30 Unconstrained minimization
Solving For w reg E in (w) is minimized, subject to: w t w C E in (w reg )+2λ C w reg = 0 (E in (w)+λ C w t w) w=wreg = 0 E in (w)+λ C w t w is minimized, unconditionally There is a correspondence: C λ C c AM L Creator: Malik Magdon-Ismail Regularization: 17 /30 Augmented error
The Augmented Error Pick a C and minimize E in (w) subject to: w t w C Pick a λ C and minimize E aug (w) = E in (w)+λ C w t w unconditionally տ A penalty for the compleity of h, measured by the size of the weights. We can pick any budget C. Translation: we are free to pick any multiplier λ C What s the right C? What s the right λ C? c AM L Creator: Malik Magdon-Ismail Regularization: 18 /30 Linear regression
Linear Regression With Soft Order Constraint E aug (w) = 1 N (Zw y)t (Zw y)+λ C w t w տ Convenient to set λ C = λ N E aug (w) = (Zw y)t (Zw y)+λw t w N տ called weight decay as the penalty encourages smaller weights Unconditionally minimize E aug (w). c AM L Creator: Malik Magdon-Ismail Regularization: 19 /30 Linear regression solution
The Solution for w reg E aug (w) = 2Z t (Zw y)+2λw = 2(Z t Z+λI)w 2Z t y Set E aug (w) = 0 w reg = (Z t Z+λI) 1 Z t y λ determines the amount of regularization Recall the unconstrained solution (λ = 0): w lin = (Z t Z) 1 Z t y c AM L Creator: Malik Magdon-Ismail Regularization: 20 /30 Dramatic effect
A Little Regularization... Minimizing E in (w)+ λ N wt w with different λ s λ = 0 λ = 0.0001 Data Target Fit y Overfitting Wow! c AM L Creator: Malik Magdon-Ismail Regularization: 21 /30 Just a little works
...Goes A Long Way Minimizing E in (w)+ λ N wt w with different λ s λ = 0 λ = 0.0001 Data Target Fit y y Overfitting Wow! c AM L Creator: Malik Magdon-Ismail Regularization: 22 /30 Easy to overdose
Don t Overdose Minimizing E in (w)+ λ N wt w with different λ s λ = 0 λ = 0.0001 λ = 0.01 λ = 1 Data Target Fit y y y y Overfitting Underfitting c AM L Creator: Malik Magdon-Ismail Regularization: 23 /30 Overfitting and underfitting
Overfitting and Underfitting PSfrag 0.84 overfitting underfitting Epected Eout 0.8 0.76 0 0.5 1 1.5 2 Regularization Parameter, λ c AM L Creator: Malik Magdon-Ismail Regularization: 24 /30 Noise and regularization
More Noise Needs More Medicine 1 Epected Eout 0.75 0.5 0.25 σ 2 = 0.5 σ 2 = 0.25 σ 2 = 0 0.5 1 1.5 2 Regularization Parameter, λ c AM L Creator: Malik Magdon-Ismail Regularization: 25 /30 Deterministic too
...Even For Deterministic Noise 1 0.6 Epected Eout 0.75 0.5 0.25 σ 2 = 0.5 σ 2 = 0.25 σ 2 = 0 Epected Eout 0.4 0.2 Q f = 100 Q f = 30 Q f = 15 0.5 1 1.5 2 Regularization Parameter, λ 0.5 1 1.5 2 Regularization Parameter, λ c AM L Creator: Malik Magdon-Ismail Regularization: 26 /30 Variations on weight decay
Variations on Weight Decay Uniform Weight Decay Low Order Fit Weight Growth! 0.84 overfitting underfitting 0.84 weight growth Epected Eout 0.8 Epected Eout 0.8 Epected Eout weight decay 0.76 0.76 0 0.5 1 1.5 2 Regularization Parameter, λ 0 0.5 1 1.5 2 Regularization Parameter, λ Regularization Parameter, λ Q q=0 w 2 q Q q=0 qw 2 q Q q=0 1 w 2 q c AM L Creator: Malik Magdon-Ismail Regularization: 27 /30 Choosing a regularizer
Choosing a Regularizer A Practitioner s Guide The perfect regularizer: constrain in the direction of the target function. target function is unknown (going around in circles ). The guiding principle: constrain in the direction of smoother (usually simpler) hypotheses hurts your ability to fit the high frequency noise usually means smoother and simpler weight decay not weight growth. What if you choose the wrong regularizer? You still have λ to play with validation. c AM L Creator: Malik Magdon-Ismail Regularization: 28 /30 Regularization philosophy
How Does Regularization Work? Stochastic noise nothing you can do about that. Good features helps to reduce deterministic noise. Regularization: Helps to combat what noise remains, especially when N is small. Typical modus operandi: sacrifice a little bias for a huge improvement in var. VC angle: you are using a smaller H without sacrificing too much E in c AM L Creator: Malik Magdon-Ismail Regularization: 29 /30 E aug versus E in
Augmented Error as a Proy for E out E aug (h) = E in (h)+ λ N Ω(h) ւ this was w t w E out (h) E in (h)+ω(h) տ ( ) dvc this was O N lnn E aug can beat E in as a proy for E out. տ depends on choice of λ c AM L Creator: Malik Magdon-Ismail Regularization: 30 /30