Regression. Hung-yi Lee 李宏毅

Similar documents
Deep Learning. Hung-yi Lee 李宏毅

Classification: Probabilistic Generative Model

Deep Reinforcement Learning. Scratching the surface

Deep learning attracts lots of attention.

Unsupervised Learning: Linear Dimension Reduction

Convolutional Neural Network. Hung-yi Lee

More Tips for Training Neural Network. Hung-yi Lee

Semi-supervised Learning

Based on the original slides of Hung-yi Lee

Language Modeling. Hung-yi Lee 李宏毅

Optimization and Gradient Descent

Bias-Variance Tradeoff

CSC321 Lecture 2: Linear Regression

Deep Learning Tutorial. 李宏毅 Hung-yi Lee

Tips for Deep Learning

CSC 411: Lecture 04: Logistic Regression

Day 3 Lecture 3. Optimizing deep networks

ECS171: Machine Learning

Tips for Deep Learning

Neural Network Training

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Introduction to Machine Learning

CSC321 Lecture 7: Optimization

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Gradient Descent. Sargur Srihari

Sample questions for Fundamentals of Machine Learning 2018

CSC321 Lecture 8: Optimization

Advanced Machine Learning

Beyond Vectors. Hung-yi Lee

Stochastic Gradient Descent

CSC 411 Lecture 6: Linear Regression

Neural networks III: The delta learning rule with semilinear activation function

CSC 411: Lecture 02: Linear Regression

Introduction of Reinforcement Learning

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

CSC242: Intro to AI. Lecture 21

Linear Regression. Applied Linear Regression Models (Kutner, Nachtsheim, Neter, Li) hsuhl (NUK) SDA Regression 1 / 34

ECE 5424: Introduction to Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Logistic Regression. Stochastic Gradient Descent

CPSC 540: Machine Learning

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Deep Learning Tutorial. 李宏毅 Hung-yi Lee

Lecture 5: Logistic Regression. Neural Networks

Week 5: Logistic Regression & Neural Networks

Computational statistics

Linear classifiers: Logistic regression

More Optimization. Optimization Methods. Methods

Non-Convex Optimization in Machine Learning. Jan Mrkos AIC

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Machine Learning Techniques

Introduction to Neural Networks

Neural Networks: Optimization & Regularization

VBM683 Machine Learning

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Portfolio Optimization

Statistical Machine Learning from Data

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

CSC321 Lecture 15: Exploding and Vanishing Gradients

OPTIMIZATION METHODS IN DEEP LEARNING

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Neural Nets Supervised learning

CH 5 More on the analysis of consumer behavior

Nonlinear Models. Numerical Methods for Deep Learning. Lars Ruthotto. Departments of Mathematics and Computer Science, Emory University.

Linear classifiers: Overfitting and regularization

Logistic Regression. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Machine Learning. Lecture 2: Linear regression. Feng Li.

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Machine Learning Basics III

Linear discriminant functions

LECTURE # - NEURAL COMPUTATION, Feb 04, Linear Regression. x 1 θ 1 output... θ M x M. Assumes a functional form

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Linear Regression. S. Sumitra

Stochastic gradient descent; Classification

Machine Learning

CS260: Machine Learning Algorithms

ECS171: Machine Learning

An Adaptive Blind Channel Shortening Algorithm for MCM Systems

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Statistical Computing (36-350)

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

CS60021: Scalable Data Mining. Large Scale Machine Learning

Neural networks and optimization

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Binary Classification / Perceptron

Introduction to Machine Learning

Introduction to Machine Learning (67577) Lecture 7

Transcription:

Regression Hung-yi Lee 李宏毅

Regression: Output a scalar Stock Market Forecast f Self-driving Car = Dow Jones Industrial Average at tomorrow f = 方向盤角度 Recommendation f = 使用者 A 商品 B 購買可能性

Example Application Estimating the Combat Power (CP) of a pokemon after evolution x cp f = x s x hp x CP after evolution y x w x h

Step 1: Model y = b + w x cp A set of function Model f 1, f 2 w and b are parameters (can be any value) f 1 : y = 10.0 + 9.0 x cp f 2 : y = 9.8 + 9.2 x cp f 3 : y = - 0.8-1.2 x cp infinite f x = CP after evolution y Linear model: y = b + w i x i x i : x cp, x hp, x w, x h feature w i : weight, b: bias

Step 2: Goodness of Function y = b + w x cp A set of function Model f 1, f 2 function input: x 1 function Output (scalar): y 1 x 2 y 2 Training Data

Step 2: Goodness of Function Training Data: 10 pokemons x 1, y 1 x 2, y 2 x n cp, y n x 10, y 10 y This is real data. x cp Source: https://www.openintro.org/stat/data/?data=pokemon

Step 2: Goodness of Function y = b + w x cp A set of function Goodness of function f Training Data Model f 1, f 2 L f Loss function L: 10 = n=1 L w, b 10 = Input: a function, output: how bad it is Sum over examples n=1 Estimation error y n n f x cp 2 Estimated y based on input function y n b + w xn cp 2

Step 2: Goodness of Function Loss Function 10 L w, b = n=1 y n b + w xn cp 2 Each point in the figure is a function smallest The color represents L w, b. y = - 180-2 x cp Very large (true example)

Step 3: Best Function A set of function Model f 1, f 2 L w, b 10 = n=1 y n b + w xn cp 2 Goodness of function f Training Data Pick the Best Function f = arg min f L f w, b = arg min w,b L w, b 10 = arg min w,b n=1 Gradient Descent y n b + w xn cp 2

Step 3: Gradient Descent http://chico386.pixnet.net/al bum/photo/171572850 w = arg min L w w Consider loss function L(w) with one parameter w: Loss L w (Randomly) Pick an initial value w 0 Compute dl dw w=w 0 Negative Positive Increase w Decrease w w 0 w

Step 3: Gradient Descent http://chico386.pixnet.net/al bum/photo/171572850 w = arg min L w w Consider loss function L(w) with one parameter w: Loss L w (Randomly) Pick an initial value w 0 Compute dl dw w=w 0 w 1 w 0 η dl dw w=w 0 w 0 η dl dw w=w 0 η is called learning rate w

Step 3: Gradient Descent w = arg min L w w Consider loss function L(w) with one parameter w: Loss L w (Randomly) Pick an initial value w 0 Compute dl dw w=w 0 Compute dl dw w=w 1 Many iteration w 1 w 0 η dl dw w=w 0 w 2 w 1 η dl dw w=w 1 w 0 Local minima w 1 w 2 w T global minima w

L Step 3: Gradient Descent L = w L b gradient How about two parameters? w, b = arg min w,b L w, b (Randomly) Pick an initial value w 0, b 0 Compute L L w w=w 0,b=b0, b w=w 0,b=b 0 w 1 w 0 η L w w=w 0,b=b 0 b 1 b 0 η L b w=w 0,b=b 0 Compute L L w w=w 1,b=b1, b w=w 1,b=b 1 w 2 w 1 η L w w=w 1,b=b 1 b 2 b 1 η L b w=w 1,b=b 1

Step 3: Gradient Descent w Color: Value of Loss L(w,b) ( η Compute Τ L b, η Τ L b, Τ L w) Τ L w b

Step 3: Gradient Descent When solving: θ = arg max θ L θ by gradient descent Each time we update the parameters, we obtain θ that makes L θ smaller. L θ 0 > L θ 1 > L θ 2 > Is this statement correct?

Step 3: Gradient Descent Loss Very slow at L the plateau Stuck at saddle point w 1 w 2 Stuck at local minima L w 0 L w = 0 L w = 0 The value of the parameter w

Step 3: Gradient Descent Formulation of Τ L w and Τ L b 10 L w, b = n=1 y n n b + w x cp 2 L w =? 10 n=1 2 y n b + w xn cp xn cp L b =?

Step 3: Gradient Descent Formulation of Τ L w and Τ L b 10 L w, b = n=1 y n n b + w x cp 2 L w =? L b =? 10 n=1 10 n=1 2 y n b + w xn cp 2 y n n b + w x cp xn cp 1

Step 3: Gradient Descent

How s the results? y = b + w x cp b = -188.4 w = 2.7 Average Error on Training Data Training Data e 2 e 1 10 = 1 10 n=1 e n = 31.9

How s the results? - Generalization What we really care about is the error on new data (testing data) y = b + w x cp b = -188.4 w = 2.7 Average Error on Testing Data Another 10 pokemons as testing data How can we do better? 10 = 1 10 n=1 e n = 35.0 > Average Error on Training Data (31.9)

Selecting another Model y = b + w 1 x cp + w 2 (x cp ) 2 Best Function b = -10.3 w 1 = 1.0, w 2 = 2.7 x 10-3 Average Error = 15.4 Testing: Average Error = 18.4 Better! Could it be even better?

Selecting another Model y = b + w 1 x cp + w 2 (x cp ) 2 + w 3 (x cp ) 3 Best Function b = 6.4, w 1 = 0.66 w 2 = 4.3 x 10-3 w 3 = -1.8 x 10-6 Average Error = 15.3 Testing: Average Error = 18.1 Slightly better. How about more complex model?

Selecting another Model y = b + w 1 x cp + w 2 (x cp ) 2 + w 3 (x cp ) 3 + w 4 (x cp ) 4 Best Function Average Error = 14.9 Testing: Average Error = 28.8 The results become worse...

Selecting another Model y = b + w 1 x cp + w 2 (x cp ) 2 + w 3 (x cp ) 3 + w 4 (x cp ) 4 + w 5 (x cp ) 5 Best Function Average Error = 12.8 Testing: Average Error = 232.1 The results are so bad.

Training Data Model Selection 1. 2. 3. 4. 5. y = b + w x cp y = b + w 1 x cp + w 2 (x cp ) 2 y = b + w 1 x cp + w 2 (x cp ) 2 + w 3 (x cp ) 3 y = b + w 1 x cp + w 2 (x cp ) 2 + w 3 (x cp ) 3 + w 4 (x cp ) 4 y = b + w 1 x cp + w 2 (x cp ) 2 + w 3 (x cp ) 3 + w 4 (x cp ) 4 + w 5 (x cp ) 5 A more complex model yields lower error on training data. If we can truly find the best function

Model Selection Overfitting Training Testing 1 31.9 35.0 2 15.4 18.4 3 15.3 18.1 4 14.9 28.2 5 12.8 232.1 A more complex model does not always lead to better performance on testing data. This is Overfitting. Select suitable model

Let s collect more data There is some hidden factors not considered in the previous model

What are the hidden factors? Eevee Pidgey Weedle Caterpie

Back to step 1: Redesign the Model x s = species of x x y = b + w i x i Linear model? If x s = Pidgey: If x s = Weedle: If x s = Caterpie: If x s = Eevee: y = b 1 + w 1 x cp y = b 2 + w 2 x cp y = b 3 + w 3 x cp y = b 4 + w 4 x cp y

Back to step 1: Redesign the Model 1 y = b 1 δ x s = Pidgey 1 +w 1 δ x s = Pidgey x cp 0 +b 2 δ x s = Weedle 0 +w 2 δ x s = Weedle x cp 0 +b 3 δ x s = Caterpie 0 +w 3 δ x s = Caterpie x cp 0 +b 4 δ x s = Eevee 0 x cp +w 4 δ x s = Eevee x cp y = b + w i x i Linear model? δ x s = Pidgey =1 If x s = Pidgey =0 otherwise If x s = Pidgey y = b 1 + w 1 x cp

Average error = 3.8 Training Data Average error = 14.3 Testing Data

CP after evolution CP after evolution CP after evolution Are there any other hidden factors? weight Height HP

Back to step 1: Redesign the Model Again x 2 If x s = Pidgey: y = b 1 + w 1 x cp + w 5 x cp 2 If x s = Weedle: y = b 2 + w 2 x cp + w 6 x cp 2 If x s = Caterpie: y = b 3 + w 3 x cp + w 7 x cp 2 If x s = Eevee: y = b 4 + w 4 x cp + w 8 x cp 2 y = y + w 9 x hp + w 10 x hp +w 11 x h + w 12 x 2 2 h + w 13 x w + w 14 x w Training Error = 1.9 Testing Error = 102.3 Overfitting! y

Back to step 2: Regularization y = b + w i x i 2 The functions with smaller w i are better L = n y n b + w i x i +λ w i 2 Smaller w i means smoother y = b + w i x i y + w i Δx i = b + w i (x i +Δx i ) We believe smoother function is more likely to be correct Do you have to apply regularization on bias?

Regularization smoother λ Training Testing 0 1.9 102.3 1 2.3 68.7 10 3.5 25.7 100 4.1 11.1 1000 5.6 12.8 10000 6.3 18.7 100000 8.5 26.8 How smooth? Select λ obtaining the best model Training error: largerλ, considering the training error less We prefer smooth function, but don t be too smooth.

Conclusion Pokémon: Original CP and species almost decide the CP after evolution There are probably other hidden factors Gradient descent More theory and tips in the following lectures We finally get average error = 11.1 on the testing data How about new data? Larger error? Lower error? Next lecture: Where does the error come from? More theory about overfitting and regularization The concept of validation

Reference Bishop: Chapter 1.1

Acknowledgment 感謝鄭凱文同學發現投影片上的符號錯誤 感謝童寬同學發現投影片上的符號錯誤 感謝黃振綸同學發現課程網頁上影片連結錯誤的符號錯誤