Regression Hung-yi Lee 李宏毅
Regression: Output a scalar Stock Market Forecast f Self-driving Car = Dow Jones Industrial Average at tomorrow f = 方向盤角度 Recommendation f = 使用者 A 商品 B 購買可能性
Example Application Estimating the Combat Power (CP) of a pokemon after evolution x cp f = x s x hp x CP after evolution y x w x h
Step 1: Model y = b + w x cp A set of function Model f 1, f 2 w and b are parameters (can be any value) f 1 : y = 10.0 + 9.0 x cp f 2 : y = 9.8 + 9.2 x cp f 3 : y = - 0.8-1.2 x cp infinite f x = CP after evolution y Linear model: y = b + w i x i x i : x cp, x hp, x w, x h feature w i : weight, b: bias
Step 2: Goodness of Function y = b + w x cp A set of function Model f 1, f 2 function input: x 1 function Output (scalar): y 1 x 2 y 2 Training Data
Step 2: Goodness of Function Training Data: 10 pokemons x 1, y 1 x 2, y 2 x n cp, y n x 10, y 10 y This is real data. x cp Source: https://www.openintro.org/stat/data/?data=pokemon
Step 2: Goodness of Function y = b + w x cp A set of function Goodness of function f Training Data Model f 1, f 2 L f Loss function L: 10 = n=1 L w, b 10 = Input: a function, output: how bad it is Sum over examples n=1 Estimation error y n n f x cp 2 Estimated y based on input function y n b + w xn cp 2
Step 2: Goodness of Function Loss Function 10 L w, b = n=1 y n b + w xn cp 2 Each point in the figure is a function smallest The color represents L w, b. y = - 180-2 x cp Very large (true example)
Step 3: Best Function A set of function Model f 1, f 2 L w, b 10 = n=1 y n b + w xn cp 2 Goodness of function f Training Data Pick the Best Function f = arg min f L f w, b = arg min w,b L w, b 10 = arg min w,b n=1 Gradient Descent y n b + w xn cp 2
Step 3: Gradient Descent http://chico386.pixnet.net/al bum/photo/171572850 w = arg min L w w Consider loss function L(w) with one parameter w: Loss L w (Randomly) Pick an initial value w 0 Compute dl dw w=w 0 Negative Positive Increase w Decrease w w 0 w
Step 3: Gradient Descent http://chico386.pixnet.net/al bum/photo/171572850 w = arg min L w w Consider loss function L(w) with one parameter w: Loss L w (Randomly) Pick an initial value w 0 Compute dl dw w=w 0 w 1 w 0 η dl dw w=w 0 w 0 η dl dw w=w 0 η is called learning rate w
Step 3: Gradient Descent w = arg min L w w Consider loss function L(w) with one parameter w: Loss L w (Randomly) Pick an initial value w 0 Compute dl dw w=w 0 Compute dl dw w=w 1 Many iteration w 1 w 0 η dl dw w=w 0 w 2 w 1 η dl dw w=w 1 w 0 Local minima w 1 w 2 w T global minima w
L Step 3: Gradient Descent L = w L b gradient How about two parameters? w, b = arg min w,b L w, b (Randomly) Pick an initial value w 0, b 0 Compute L L w w=w 0,b=b0, b w=w 0,b=b 0 w 1 w 0 η L w w=w 0,b=b 0 b 1 b 0 η L b w=w 0,b=b 0 Compute L L w w=w 1,b=b1, b w=w 1,b=b 1 w 2 w 1 η L w w=w 1,b=b 1 b 2 b 1 η L b w=w 1,b=b 1
Step 3: Gradient Descent w Color: Value of Loss L(w,b) ( η Compute Τ L b, η Τ L b, Τ L w) Τ L w b
Step 3: Gradient Descent When solving: θ = arg max θ L θ by gradient descent Each time we update the parameters, we obtain θ that makes L θ smaller. L θ 0 > L θ 1 > L θ 2 > Is this statement correct?
Step 3: Gradient Descent Loss Very slow at L the plateau Stuck at saddle point w 1 w 2 Stuck at local minima L w 0 L w = 0 L w = 0 The value of the parameter w
Step 3: Gradient Descent Formulation of Τ L w and Τ L b 10 L w, b = n=1 y n n b + w x cp 2 L w =? 10 n=1 2 y n b + w xn cp xn cp L b =?
Step 3: Gradient Descent Formulation of Τ L w and Τ L b 10 L w, b = n=1 y n n b + w x cp 2 L w =? L b =? 10 n=1 10 n=1 2 y n b + w xn cp 2 y n n b + w x cp xn cp 1
Step 3: Gradient Descent
How s the results? y = b + w x cp b = -188.4 w = 2.7 Average Error on Training Data Training Data e 2 e 1 10 = 1 10 n=1 e n = 31.9
How s the results? - Generalization What we really care about is the error on new data (testing data) y = b + w x cp b = -188.4 w = 2.7 Average Error on Testing Data Another 10 pokemons as testing data How can we do better? 10 = 1 10 n=1 e n = 35.0 > Average Error on Training Data (31.9)
Selecting another Model y = b + w 1 x cp + w 2 (x cp ) 2 Best Function b = -10.3 w 1 = 1.0, w 2 = 2.7 x 10-3 Average Error = 15.4 Testing: Average Error = 18.4 Better! Could it be even better?
Selecting another Model y = b + w 1 x cp + w 2 (x cp ) 2 + w 3 (x cp ) 3 Best Function b = 6.4, w 1 = 0.66 w 2 = 4.3 x 10-3 w 3 = -1.8 x 10-6 Average Error = 15.3 Testing: Average Error = 18.1 Slightly better. How about more complex model?
Selecting another Model y = b + w 1 x cp + w 2 (x cp ) 2 + w 3 (x cp ) 3 + w 4 (x cp ) 4 Best Function Average Error = 14.9 Testing: Average Error = 28.8 The results become worse...
Selecting another Model y = b + w 1 x cp + w 2 (x cp ) 2 + w 3 (x cp ) 3 + w 4 (x cp ) 4 + w 5 (x cp ) 5 Best Function Average Error = 12.8 Testing: Average Error = 232.1 The results are so bad.
Training Data Model Selection 1. 2. 3. 4. 5. y = b + w x cp y = b + w 1 x cp + w 2 (x cp ) 2 y = b + w 1 x cp + w 2 (x cp ) 2 + w 3 (x cp ) 3 y = b + w 1 x cp + w 2 (x cp ) 2 + w 3 (x cp ) 3 + w 4 (x cp ) 4 y = b + w 1 x cp + w 2 (x cp ) 2 + w 3 (x cp ) 3 + w 4 (x cp ) 4 + w 5 (x cp ) 5 A more complex model yields lower error on training data. If we can truly find the best function
Model Selection Overfitting Training Testing 1 31.9 35.0 2 15.4 18.4 3 15.3 18.1 4 14.9 28.2 5 12.8 232.1 A more complex model does not always lead to better performance on testing data. This is Overfitting. Select suitable model
Let s collect more data There is some hidden factors not considered in the previous model
What are the hidden factors? Eevee Pidgey Weedle Caterpie
Back to step 1: Redesign the Model x s = species of x x y = b + w i x i Linear model? If x s = Pidgey: If x s = Weedle: If x s = Caterpie: If x s = Eevee: y = b 1 + w 1 x cp y = b 2 + w 2 x cp y = b 3 + w 3 x cp y = b 4 + w 4 x cp y
Back to step 1: Redesign the Model 1 y = b 1 δ x s = Pidgey 1 +w 1 δ x s = Pidgey x cp 0 +b 2 δ x s = Weedle 0 +w 2 δ x s = Weedle x cp 0 +b 3 δ x s = Caterpie 0 +w 3 δ x s = Caterpie x cp 0 +b 4 δ x s = Eevee 0 x cp +w 4 δ x s = Eevee x cp y = b + w i x i Linear model? δ x s = Pidgey =1 If x s = Pidgey =0 otherwise If x s = Pidgey y = b 1 + w 1 x cp
Average error = 3.8 Training Data Average error = 14.3 Testing Data
CP after evolution CP after evolution CP after evolution Are there any other hidden factors? weight Height HP
Back to step 1: Redesign the Model Again x 2 If x s = Pidgey: y = b 1 + w 1 x cp + w 5 x cp 2 If x s = Weedle: y = b 2 + w 2 x cp + w 6 x cp 2 If x s = Caterpie: y = b 3 + w 3 x cp + w 7 x cp 2 If x s = Eevee: y = b 4 + w 4 x cp + w 8 x cp 2 y = y + w 9 x hp + w 10 x hp +w 11 x h + w 12 x 2 2 h + w 13 x w + w 14 x w Training Error = 1.9 Testing Error = 102.3 Overfitting! y
Back to step 2: Regularization y = b + w i x i 2 The functions with smaller w i are better L = n y n b + w i x i +λ w i 2 Smaller w i means smoother y = b + w i x i y + w i Δx i = b + w i (x i +Δx i ) We believe smoother function is more likely to be correct Do you have to apply regularization on bias?
Regularization smoother λ Training Testing 0 1.9 102.3 1 2.3 68.7 10 3.5 25.7 100 4.1 11.1 1000 5.6 12.8 10000 6.3 18.7 100000 8.5 26.8 How smooth? Select λ obtaining the best model Training error: largerλ, considering the training error less We prefer smooth function, but don t be too smooth.
Conclusion Pokémon: Original CP and species almost decide the CP after evolution There are probably other hidden factors Gradient descent More theory and tips in the following lectures We finally get average error = 11.1 on the testing data How about new data? Larger error? Lower error? Next lecture: Where does the error come from? More theory about overfitting and regularization The concept of validation
Reference Bishop: Chapter 1.1
Acknowledgment 感謝鄭凱文同學發現投影片上的符號錯誤 感謝童寬同學發現投影片上的符號錯誤 感謝黃振綸同學發現課程網頁上影片連結錯誤的符號錯誤