Machine Learning Linear Models

Outline II - Linear Models 1. Linear Regression (a) Linear regression: History (b) Linear regression with Least Squares (c) Matrix representation and Normal Equation Method (d) Iterative Method: Gradient descent (e) Pros and Cons of both methods 2. Linear Classification: Perceptron (a) Definition and history (b) Example (c) Algorithm

Supervised Learning Training data: examples x with labels y. (x 1,y 1 ),...,(x n,y n ) /x i 2 R d Regression: y is a real value, y 2 R f : R d! R f is called a regressor. Classification: y is discrete. To simplify, y 2{ 1, +1} f : R d! { 1, +1} f is called a binary classifier.

: History A very popular technique. Rooted in Statistics. Method of Least Squares used as early as 1795 by Gauss. Re-invented in 1805 by Legendre. Frequently applied in astronomy to study the large scale of the universe. Still a very useful tool today. Carl Friedrich Gauss

Given: Training data: (x 1,y 1 ),...,(x n,y n ) /x i 2 R d and y i 2 R

Given: Training data: (x 1,y 1 ),...,(x n,y n ) /x i 2 R d and y i 2 R example x 1! x 11 x 12... x 1d y 1 label.................. example x i! x i1 x i2... x id y i label.................. example x n! x n1 x n2... x nd y n label Task: Learn a regression function: f : R d! R f(x) =y

Y #" X 2!" X 1 d = 1, line in R 2 d = 2, hyperplane is R 3 Credit: Introduction to Statistical Learning.

Linear Regression Model: f(x) = 0 + dx j=1 jx j with j 2 R, j 2{1,...,d} s are called parameters or coe cients or weights.

Linear Regression Model: f(x) = 0 + dx j=1 jx j with j 2 R, j 2{1,...,d} s are called parameters or coe cients or weights. Learning the linear model! learning the 0 s

Linear Regression Model: f(x) = 0 + dx j=1 jx j with j 2 R, j 2{1,...,d} s are called parameters or coe cients or weights. Learning the linear model! learning the 0 s Estimation with Least squares: Use least square loss: `oss(y i,f(x i )) = (y i f(x i )) 2 We want to minimize the loss over all examples, that is minimize the risk or cost function R: R = 1 2n (y i f(x i )) 2

Asimplecasewithonefeature(d =1): f(x) = 0 + 1 x

Asimplecasewithonefeature(d =1): We want to minimize: f(x) = 0 + 1 x R = 1 2n (y i f(x i )) 2

Asimplecasewithonefeature(d =1): We want to minimize: f(x) = 0 + 1 x R = 1 2n R( )= 1 2n (y i f(x i )) 2 (y i 0 1 x i ) 2

Asimplecasewithonefeature(d =1): We want to minimize: f(x) = 0 + 1 x R = 1 2n R( )= 1 2n Find 0 and 1 that minimize: R( )= 1 2n (y i f(x i )) 2 (y i 0 1 x i ) 2 (y i 0 1 x i ) 2

3 3 2.5 β 1 0.03 0.04 0.05 0.06 3 2.15 2.2 2.3 3 RSS β 0 β 1 5 6 7 8 9 β 0 Credit: Introduction to Statistical Learning.

Find 0 and 1 so that: argmin ( 1 2n (y i 0 1 x i ) 2 )

Find 0 and 1 so that: Minimize: R( 0, argmin ( 1 2n 1), that is: (y i 0 1 x i ) 2 ) @ 0 =0 @ 1 =0

Find 0 and 1 so that: Minimize: R( 0, =2 1 @ 0 2n argmin ( 1 2n 1), that is: (y i 0 1 x i ) 2 ) @ 0 =0 @ 1 =0 (y i 0 1 x i ) @ (y i 0 1 x i ) @ 0

Find 0 and 1 so that: Minimize: R( 0, =2 1 @ 0 2n argmin ( 1 2n 1), that is: (y i 0 1 x i ) 2 ) @ 0 =0 @ 1 =0 (y i 0 1 x i ) @ (y i 0 1 x i ) @ 0 @ 0 = 1 n (y i 0 1 x i ) ( 1) = 0

Find 0 and 1 so that: Minimize: R( 0, =2 1 @ 0 2n argmin ( 1 2n 1), that is: (y i 0 1 x i ) 2 ) @ 0 =0 @ 1 =0 (y i 0 1 x i ) @ (y i 0 1 x i ) @ 0 @ 0 = 1 n (y i 0 1 x i ) ( 1) = 0 0 = 1 n y i 1 1 n x i

=2 1 @ 1 2n (y i 0 1 x i ) @ (y i 0 1 x i ) @ 1

=2 1 @ 1 2n (y i 0 1 x i ) @ (y i 0 1 x i ) @ 1 @ 1 = 1 n (y i 0 1 x i ) ( x i )=0

=2 1 @ 1 2n (y i 0 1 x i ) @ (y i 0 1 x i ) @ 1 @ 1 = 1 n (y i 0 1 x i ) ( x i )=0 1 x i 2 = x i y i n X 0x i

=2 1 @ 1 2n (y i 0 1 x i ) @ (y i 0 1 x i ) @ 1 @ 1 = 1 n (y i 0 1 x i ) ( x i )=0 Plugging 0 in 1 : 1 x i 2 = x i y i n X 0x i 1 = P n y i x i P n x 2 i 1 n P n y i P n x i 1 n P n x i P xi

With more than one feature: Find the f(x) = 0 + j that minimize: dx j=1 jx j R = 1 2n (y i 0 d X j=1 jx ij )) 2 Let s write it more elegantly with matrices!

Matrix representation Let X be an n (d + 1) matrix where each row starts with a 1 followed by a feature vector. Let y be the label vector of the training set. Let be the vector of weights (that we want to estimate!). X := 0 B B B B B B @ 1 x 11 x 1j x 1d...... 1 x i1 x ij x id...... 1 x n1 x nj x nd 1 C C C C C C A y := 0 B B B B B B @ y 1. y i. y n 1 C C C C C C A := 0 B B B B B B @ 0. j. d 1 C C C C C C A

Normal Equation We want to find (d + 1) s that minimize R. We write R: We have that: R( )= 1 (y X ) 2 2n R( )= 1 2n (y X )T (y X ) @ = 1 n XT (y X ) @ 2 R @ = 1 n XT X is positive definite which ensures that X T (y X )=0 is a minimum. We solve: The unique solution is: =(X T X) 1 X T y

Gradient descent!"#$%&'() ' 4.5' +,-./0$'12,3$' *$%&'( '

Gradient descent Gradient Descent is an optimization method. Repeat until convergence: Update simultaneously all j for (j = 0 and j = 1) 0 := 0 @ @ 0 R( 0, 1) 1 := 1 @ @ 1 R( 0, 1)

Gradient descent Gradient Descent is an optimization method. Repeat until convergence: Update simultaneously all j for (j = 0 and j = 1) 0 := 0 @ @ 0 R( 0, 1) 1 := 1 @ @ 1 R( 0, 1) is a learning rate.

Gradient descent In the linear case: @ 0 = 1 n (y i 0 1 x i ) ( 1) = 0 Let s generalize it! @ 1 = 1 n (y i 0 1 x i ) ( x i )

Gradient descent In the linear case: @ 0 = 1 n (y i 0 1 x i ) ( 1) = 0 @ 1 = 1 n (y i 0 1 x i ) ( x i ) Repeat until convergence: Update simultaneously all j for (j = 0 and j = 1) 0 := 0 1 n ( 0 + 1 x i y i ) 1 := 1 1 n ( 0 + 1 x i y i )(x i )

Pros and Cons Analytical approach: Normal Equation + No need to specify a convergence rate or iterate. - Works only if X T X is invertible - Very slow if d is large O(d 3 ) to compute (X T X) 1 Iterative approach: Gradient Descent + E ective and e cient even in high dimensions. - Iterative (sometimes need many iterations to converge). - Needs to choose the rate.

Practical considerations 1. Scaling: Bring your features to a similar scale.

Practical considerations 1. Scaling: Bring your features to a similar scale. x i := x i µ i stdev(x i )

Practical considerations 1. Scaling: Bring your features to a similar scale. x i := x i µ i stdev(x i ) 2. Learning rate: Don t use a rate that is too small or too large.

Practical considerations 1. Scaling: Bring your features to a similar scale. x i := x i µ i stdev(x i ) 2. Learning rate: Don t use a rate that is too small or too large. 3. Rshoulddecreaseafter each iteration. 4. Declare convergence if it start decreasing by less 5. If X T X is not invertible? (a) Too many features as compared to the number of examples (e.g., 50 examples and 500 features) (b) Features linearly dependent: e.g., weight in pounds and in kilo.

Credit The elements of statistical learning. Data mining, inference, and prediction. 10th Edition 2009. T. Hastie, R. Tibshirani, J. Friedman. Machine Learning 1997. Tom Mitchell.