Machine Learning Linear Models
Outline II - Linear Models 1. Linear Regression (a) Linear regression: History (b) Linear regression with Least Squares (c) Matrix representation and Normal Equation Method (d) Iterative Method: Gradient descent (e) Pros and Cons of both methods 2. Linear Classification: Perceptron (a) Definition and history (b) Example (c) Algorithm
Supervised Learning Training data: examples x with labels y. (x 1,y 1 ),...,(x n,y n ) /x i 2 R d Regression: y is a real value, y 2 R f : R d! R f is called a regressor. Classification: y is discrete. To simplify, y 2{ 1, +1} f : R d! { 1, +1} f is called a binary classifier.
: History A very popular technique. Rooted in Statistics. Method of Least Squares used as early as 1795 by Gauss. Re-invented in 1805 by Legendre. Frequently applied in astronomy to study the large scale of the universe. Still a very useful tool today. Carl Friedrich Gauss
Given: Training data: (x 1,y 1 ),...,(x n,y n ) /x i 2 R d and y i 2 R
Given: Training data: (x 1,y 1 ),...,(x n,y n ) /x i 2 R d and y i 2 R example x 1! x 11 x 12... x 1d y 1 label.................. example x i! x i1 x i2... x id y i label.................. example x n! x n1 x n2... x nd y n label
Given: Training data: (x 1,y 1 ),...,(x n,y n ) /x i 2 R d and y i 2 R example x 1! x 11 x 12... x 1d y 1 label.................. example x i! x i1 x i2... x id y i label.................. example x n! x n1 x n2... x nd y n label Task: Learn a regression function: f : R d! R f(x) =y
Given: Training data: (x 1,y 1 ),...,(x n,y n ) /x i 2 R d and y i 2 R example x 1! x 11 x 12... x 1d y 1 label.................. example x i! x i1 x i2... x id y i label.................. example x n! x n1 x n2... x nd y n label Task: Learn a regression function: f : R d! R f(x) =y Linear Regression: A regression model is said to be linear if it is represented by a linear function.
Y #" X 2!" X 1 d = 1, line in R 2 d = 2, hyperplane is R 3 Credit: Introduction to Statistical Learning.
Linear Regression Model: f(x) = 0 + dx j=1 jx j with j 2 R, j 2{1,...,d} s are called parameters or coe cients or weights.
Linear Regression Model: f(x) = 0 + dx j=1 jx j with j 2 R, j 2{1,...,d} s are called parameters or coe cients or weights. Learning the linear model! learning the 0 s
Linear Regression Model: f(x) = 0 + dx j=1 jx j with j 2 R, j 2{1,...,d} s are called parameters or coe cients or weights. Learning the linear model! learning the 0 s Estimation with Least squares: Use least square loss: `oss(y i,f(x i )) = (y i f(x i )) 2
Linear Regression Model: f(x) = 0 + dx j=1 jx j with j 2 R, j 2{1,...,d} s are called parameters or coe cients or weights. Learning the linear model! learning the 0 s Estimation with Least squares: Use least square loss: `oss(y i,f(x i )) = (y i f(x i )) 2 We want to minimize the loss over all examples, that is minimize the risk or cost function R: R = 1 2n (y i f(x i )) 2
Asimplecasewithonefeature(d =1): f(x) = 0 + 1 x
Asimplecasewithonefeature(d =1): We want to minimize: f(x) = 0 + 1 x R = 1 2n (y i f(x i )) 2
Asimplecasewithonefeature(d =1): We want to minimize: f(x) = 0 + 1 x R = 1 2n R( )= 1 2n (y i f(x i )) 2 (y i 0 1 x i ) 2
Asimplecasewithonefeature(d =1): We want to minimize: f(x) = 0 + 1 x R = 1 2n R( )= 1 2n Find 0 and 1 that minimize: R( )= 1 2n (y i f(x i )) 2 (y i 0 1 x i ) 2 (y i 0 1 x i ) 2
3 3 2.5 β 1 0.03 0.04 0.05 0.06 3 2.15 2.2 2.3 3 RSS β 0 β 1 5 6 7 8 9 β 0 Credit: Introduction to Statistical Learning.
Find 0 and 1 so that: argmin ( 1 2n (y i 0 1 x i ) 2 )
Find 0 and 1 so that: Minimize: R( 0, argmin ( 1 2n 1), that is: (y i 0 1 x i ) 2 ) @ 0 =0 @ 1 =0
Find 0 and 1 so that: Minimize: R( 0, =2 1 @ 0 2n argmin ( 1 2n 1), that is: (y i 0 1 x i ) 2 ) @ 0 =0 @ 1 =0 (y i 0 1 x i ) @ (y i 0 1 x i ) @ 0
Find 0 and 1 so that: Minimize: R( 0, =2 1 @ 0 2n argmin ( 1 2n 1), that is: (y i 0 1 x i ) 2 ) @ 0 =0 @ 1 =0 (y i 0 1 x i ) @ (y i 0 1 x i ) @ 0 @ 0 = 1 n (y i 0 1 x i ) ( 1) = 0
Find 0 and 1 so that: Minimize: R( 0, =2 1 @ 0 2n argmin ( 1 2n 1), that is: (y i 0 1 x i ) 2 ) @ 0 =0 @ 1 =0 (y i 0 1 x i ) @ (y i 0 1 x i ) @ 0 @ 0 = 1 n (y i 0 1 x i ) ( 1) = 0 0 = 1 n y i 1 1 n x i
=2 1 @ 1 2n (y i 0 1 x i ) @ (y i 0 1 x i ) @ 1
=2 1 @ 1 2n (y i 0 1 x i ) @ (y i 0 1 x i ) @ 1 @ 1 = 1 n (y i 0 1 x i ) ( x i )=0
=2 1 @ 1 2n (y i 0 1 x i ) @ (y i 0 1 x i ) @ 1 @ 1 = 1 n (y i 0 1 x i ) ( x i )=0 1 x i 2 = x i y i n X 0x i
=2 1 @ 1 2n (y i 0 1 x i ) @ (y i 0 1 x i ) @ 1 @ 1 = 1 n (y i 0 1 x i ) ( x i )=0 Plugging 0 in 1 : 1 x i 2 = x i y i n X 0x i 1 = P n y i x i P n x 2 i 1 n P n y i P n x i 1 n P n x i P xi
With more than one feature: Find the f(x) = 0 + j that minimize: dx j=1 jx j R = 1 2n (y i 0 d X j=1 jx ij )) 2 Let s write it more elegantly with matrices!
Matrix representation Let X be an n (d + 1) matrix where each row starts with a 1 followed by a feature vector. Let y be the label vector of the training set. Let be the vector of weights (that we want to estimate!). X := 0 B B B B B B @ 1 x 11 x 1j x 1d...... 1 x i1 x ij x id...... 1 x n1 x nj x nd 1 C C C C C C A y := 0 B B B B B B @ y 1. y i. y n 1 C C C C C C A := 0 B B B B B B @ 0. j. d 1 C C C C C C A
Normal Equation We want to find (d + 1) s that minimize R. We write R: We have that: R( )= 1 (y X ) 2 2n R( )= 1 2n (y X )T (y X ) @ = 1 n XT (y X ) @ 2 R @ = 1 n XT X is positive definite which ensures that X T (y X )=0 is a minimum. We solve: The unique solution is: =(X T X) 1 X T y
Gradient descent!"#$%&'() ' 4.5' +,-./0$'12,3$' *$%&'( '
Gradient descent Gradient Descent is an optimization method. Repeat until convergence: Update simultaneously all j for (j = 0 and j = 1) 0 := 0 @ @ 0 R( 0, 1) 1 := 1 @ @ 1 R( 0, 1)
Gradient descent Gradient Descent is an optimization method. Repeat until convergence: Update simultaneously all j for (j = 0 and j = 1) 0 := 0 @ @ 0 R( 0, 1) 1 := 1 @ @ 1 R( 0, 1) is a learning rate.
Gradient descent In the linear case: @ 0 = 1 n (y i 0 1 x i ) ( 1) = 0 Let s generalize it! @ 1 = 1 n (y i 0 1 x i ) ( x i )
Gradient descent In the linear case: @ 0 = 1 n (y i 0 1 x i ) ( 1) = 0 @ 1 = 1 n (y i 0 1 x i ) ( x i ) Repeat until convergence: Update simultaneously all j for (j = 0 and j = 1) 0 := 0 1 n ( 0 + 1 x i y i ) 1 := 1 1 n ( 0 + 1 x i y i )(x i )
Pros and Cons Analytical approach: Normal Equation + No need to specify a convergence rate or iterate. - Works only if X T X is invertible - Very slow if d is large O(d 3 ) to compute (X T X) 1 Iterative approach: Gradient Descent + E ective and e cient even in high dimensions. - Iterative (sometimes need many iterations to converge). - Needs to choose the rate.
Practical considerations 1. Scaling: Bring your features to a similar scale.
Practical considerations 1. Scaling: Bring your features to a similar scale. x i := x i µ i stdev(x i )
Practical considerations 1. Scaling: Bring your features to a similar scale. x i := x i µ i stdev(x i ) 2. Learning rate: Don t use a rate that is too small or too large.
Practical considerations 1. Scaling: Bring your features to a similar scale. x i := x i µ i stdev(x i ) 2. Learning rate: Don t use a rate that is too small or too large. 3. Rshoulddecreaseafter each iteration.
Practical considerations 1. Scaling: Bring your features to a similar scale. x i := x i µ i stdev(x i ) 2. Learning rate: Don t use a rate that is too small or too large. 3. Rshoulddecreaseafter each iteration. 4. Declare convergence if it start decreasing by less
Practical considerations 1. Scaling: Bring your features to a similar scale. x i := x i µ i stdev(x i ) 2. Learning rate: Don t use a rate that is too small or too large. 3. Rshoulddecreaseafter each iteration. 4. Declare convergence if it start decreasing by less 5. If X T X is not invertible?
Practical considerations 1. Scaling: Bring your features to a similar scale. x i := x i µ i stdev(x i ) 2. Learning rate: Don t use a rate that is too small or too large. 3. Rshoulddecreaseafter each iteration. 4. Declare convergence if it start decreasing by less 5. If X T X is not invertible? (a) Too many features as compared to the number of examples (e.g., 50 examples and 500 features) (b) Features linearly dependent: e.g., weight in pounds and in kilo.
Credit The elements of statistical learning. Data mining, inference, and prediction. 10th Edition 2009. T. Hastie, R. Tibshirani, J. Friedman. Machine Learning 1997. Tom Mitchell.