Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis Regularization Linear Regression CSL465/603 - Machine Learning 2

Example - Green Chilies Entertainment Company Earnings from the film (in crores of Rs) Cost of making the film (in crores of Rs) Linear Regression CSL465/603 - Machine Learning 3

Notations Training dataset Number of examples - N Input variable - x # Target variable - y % Goal: Learn function that predicts y for new input x Cost of Film (Crores of Rs) - x 98.28 199.69 40.22 93.69 62.07 100.33 Profit/Loss (Crores of Rs) - y Linear Regression CSL465/603 - Machine Learning 4

Linear Regression Simplest form f(x) = w + + w - x Earnings from the film (in crores of Rs) Cost of making the film (in crores of Rs) Linear Regression CSL465/603 - Machine Learning 5

Least Mean Squares - Cost Function Choose parameters w + and w - (or w ) so that f x is as close as to y Earnings from the film (in crores of Rs) Cost of making the film (in crores of Rs) Linear Regression CSL465/603 - Machine Learning 6

Least Mean Squares - Cost Function - Parameter Space (1) Let J w +, w - = - 3 2 f x 23 %8-6 y 6 Linear Regression CSL465/603 - Machine Learning 7

Least Mean Squares - Cost Function - Parameter Space (2) Let J w +, w - = - 3 2 f x 23 %8- % y % Linear Regression CSL465/603 - Machine Learning 8

Least Mean Squares - Cost Function - Parameter Space (3) Let J w +, w - = - 3 2 f x 23 %8-6 y 6 Linear Regression CSL465/603 - Machine Learning 9

Plot of the Error Surface Linear Regression CSL465/603 - Machine Learning 10

Contour Plot of Error Surface Linear Regression CSL465/603 - Machine Learning 11

Estimating Optimal Parameters Linear Regression CSL465/603 - Machine Learning 12

Gradient Descent Basic Principle Minimize J w = - 3 f x 23 %8- % y % 2 Start with an initial estimate for w Keep changing w so that J w is progressively reduced Stop when no change or have reached the minimum Linear Regression CSL465/603 - Machine Learning 13

Gradient Descent - Intuition Linear Regression CSL465/603 - Machine Learning 14

Effect of Learning Parameter Too small value slow convergence Too large value oscillates widely and may not converge Linear Regression CSL465/603 - Machine Learning 15

Gradient Descent Local Minima Depending on the function J w, gradient descent can get stuck at local minima Linear Regression CSL465/603 - Machine Learning 16

Gradient Descent for Regression Convex error function J w = 1 3 2N ; f x 2 # y % %8- Geometrically error surface is bowl shaped. Only global minima Exercise Prove that the sum of squared error is a convex function Linear Regression CSL465/603 - Machine Learning 17

Parameter Update (1) Minimize 3 J w = 1 2N ; f x 2 # y % %8- Linear Regression CSL465/603 - Machine Learning 18

Parameter Update (2) Repeat till convergence 3 w + = w + α 1 N ; f x # y % w - = w - α 1 N ; f x # y % x # %8-3 %8- Linear Regression CSL465/603 - Machine Learning 19

Example Iteration 0 Regression Function Error Function Linear Regression CSL465/603 - Machine Learning 20

Example Iteration 1 Regression Function Error Function Linear Regression CSL465/603 - Machine Learning 21

Example Iteration 2 Regression Function Error Function Linear Regression CSL465/603 - Machine Learning 22

Example Iteration 4 Regression Function Error Function Linear Regression CSL465/603 - Machine Learning 23

Example Iteration 7 Regression Function Error Function Linear Regression CSL465/603 - Machine Learning 24

Example Iteration 9 Regression Function Error Function Linear Regression CSL465/603 - Machine Learning 25

Gradient Descent Batch Mode Update includes contribution of all data points w + = w + α 1 N ; f x # y % 3 w - = w - α 1 N ; f x # y % x # %8- %8- Will talk stochastic gradient descent later (neural networks). 3 Linear Regression CSL465/603 - Machine Learning 26

Multivariate Linear Regression Cost of Film (Crores of Rs) Celebrity status of the protagonist # of theatres release Age of the protagonist 75.72 7.57 32 52 157.39 18.74 1.87 16 68 81.93 50.96 5.09 27 35 131.95 Earnings (Crores of Rs) - y Dimension of the input data - D Linear Regression CSL465/603 - Machine Learning 27

Multivariate Linear Regression - Formulation Simplest model: f x = w + + w - x - + w 2 x 2 + + w? x? Parameters to learn: w +, w -,, w? = w Cost function: J w = - 3 2 f x 23 %8- # y % Update equation: w B = w B α - 3 f x 3 %8- % y % x %B Linear Regression CSL465/603 - Machine Learning 28

Gradient Descent Parameter update equation 3 w B = w B α 1 N ; f x % y % x %B %8- Linear Regression CSL465/603 - Machine Learning 29

Feature Scaling Multivariate Linear Regression (1) Cost of Film (Crores of Rs) Celebrity status of the protagonist # of theatres release Age of the protagonist 75.72 7.57 32 52 157.39 18.74 1.87 16 68 81.93 50.96 5.09 27 35 131.95 Profit/Loss (Crores of Rs) - y Transform features to be of same scale Linear Regression CSL465/603 - Machine Learning 30

Feature Scaling for Multivariate Linear Regression (2) Normalization 1 x D 1 or 0 x D 1 Standardization mean 0 and standard deviation 1 Linear Regression CSL465/603 - Machine Learning 31

Multivariate Linear Regression Analytical Solution Cost of Film (Crores of Rs) Celebrity status of the protagonist # of theatres release Design Matrix and Target Vector Age of the protagonist 75.72 7.57 32 52 157.39 18.74 1.87 16 68 81.93 50.96 5.09 27 35 131.95 Profit/Loss (Crores of Rs) - y X = 1 x -- x -? 1 x 2- x 2? Y = 1 x 3- x 3? y - y 2 y 3 Linear Regression CSL465/603 - Machine Learning 32

Least Squares Method f X = Xw = 1 x -- x -? 1 x 2- x 2? 1 x 3- x 3? w + w - w? = y - y 2 y 3 = Y 3 J w = 1 2 ; f x # y % 2 %8- Linear Regression CSL465/603 - Machine Learning 33

Normal Equations 1 min J W = min L L 2 XW Y N XW Y Finding the gradient wrt W and equate it to 0 Linear Regression CSL465/603 - Machine Learning 34

Analytical Solution Advantage No need for the learning parameter α! No need for iterative updates Disadvantage Need to perform matrix inversion Pseudo-Inverse of the matrix X N X P- X N Sometimes we deal with non-invertible matrices (redundant features) Linear Regression CSL465/603 - Machine Learning 35

Probabilistic View of Linear Regression (1) Let y = f(x) + ε ε is the error term that captures unmodeled effects or random noise. ε~n 0, σ 2 - Gaussian distribution Linear Regression CSL465/603 - Machine Learning 36

Probabilistic View of Linear Regression (2) Let y = f(x) + ε ε is the error term that captures unmodeled effects or random noise. ε~n 0, σ 2 - Gaussian distribution - why? N 0, σ 2 has maximum entropy among all real-valued distributions with a specified variance σ 2 3-σ rule: Linear Regression CSL465/603 - Machine Learning 37

Probabilistic View of Linear Regression (3) Let y = f(x) + ε ε is the error term that captures unmodeled effects or random noise. ε~n 0, σ 2 - Gaussian distribution Then P ε = And P y x = t y(x 0 ) p(t x 0 ) y(x) x 0 x Linear Regression CSL465/603 - Machine Learning 38

Probabilistic View of Linear Regression (4) P y -,, y 3 x -,, x 3 = P y -,, y 3 x -,, x 3 ; W = Linear Regression CSL465/603 - Machine Learning 39

Maximizing the Likelihood Maximize L W = 3 %8- P y % x % ; W Linear Regression CSL465/603 - Machine Learning 40

Loss Functions Squared loss f x y 2 Absolute loss f x y Dead band loss max 0, f x y ε, ε R ] Linear Regression CSL465/603 - Machine Learning 41

Loss Functions Problem with squared loss Linear Regression CSL465/603 - Machine Learning 42

Linear Regression with Absolute Loss Function Objective min L 3 ; XW Y %8- Non-differentiable, so cannot take the gradient descent approach Solution: frame as a constrained optimization problem Introduce new variables v R 3, v % x % W y % 3 min ; v %, subject to v % L,` %8- x % W y % v % Linear Regression CSL465/603 - Machine Learning 43

Linear Regression with Absolute Loss Function - Example LMS output LP output Linear Regression CSL465/603 - Machine Learning 44

Some Additional Notations Underlying response function (Target Concept) C Actual observed response y = C x + ε ε~n 0, σ 2, E y/x = C(x) Predicted response based on the model learned from dataset A - f x; A Expected response averaged over all datasets fm x = E n f x; A Expected L 2 error on a new test instance x - E pqq = E n f x ; A y 2 Linear Regression CSL465/603 - Machine Learning 45

Bias-Variance Analysis (1) Linear Regression CSL465/603 - Machine Learning 46

Bias-Variance Analysis (2) Linear Regression CSL465/603 - Machine Learning 47

Bias-Variance Analysis (3) Root Mean Square Error Linear Regression CSL465/603 - Machine Learning 48

Bias-Variance Analysis (4) 9 th degree polynomial fit with more sample data Linear Regression CSL465/603 - Machine Learning 49

Bias-Variance Analysis (5) Expected square loss - E L = r f x y 2 P x, y dxdy Linear Regression CSL465/603 - Machine Learning 50

Bias-Variance Analysis (6) Expected square loss - E L = r f x y 2 P x, y dxdy Linear Regression CSL465/603 - Machine Learning 51

Bias-Variance Analysis (7) Relevant part of loss: u f x C x 2 P x dx Linear Regression CSL465/603 - Machine Learning 52

Bias-Variance Analysis (8) Relevant part of loss: E n f x; A C x 2 Linear Regression CSL465/603 - Machine Learning 53

Bias-Variance Analysis (9) Degree = 1 Degree = 4 Linear Regression CSL465/603 - Machine Learning 54

Bias-Variance Analysis (10) Bias term of the error E n f x; A C x 2 Measures how well our approximation architecture can fit the data Weak approximators will have high bias Example low degree polynomials Strong approximators will have low bias Example high degree polynomials Linear Regression CSL465/603 - Machine Learning 55

Bias-Variance Analysis (11) Variance term of the error E n f x; A E n f x; A 2 No direct dependence on the target value For a fixed size dataset A Strong approximators tend to have more variance Small changes in the dataset can result in wide changes in the predictors Weak approximators tend to have less variance Small changes in the dataset result in similar predictors Variance disappears as A Linear Regression CSL465/603 - Machine Learning 56

Bias-Variance Analysis (12) Measuring Bias and Variance in practice Bootstrap from the given dataset Start with a complex approximator, and reduce the complexity through regularization Setting more coefficients/parameters to 0 Do Feature Selection Reduces variance, but can increase bias. Hopefully just sufficient to model the given data Linear Regression CSL465/603 - Machine Learning 57

Regularization Central Idea: penalize over-complicated solutions Linear regression minimizes 3 ; x % w y % 2 Regularized regression minimizes 3 2 ; x % w y % + λ w %8- %8- Linear Regression CSL465/603 - Machine Learning 58

Modified Solution Solution for ordinary linear regression min z J w min L 1 2 Xw Y N Xw Y w = X N X P- X N Y Now for the regularized version which uses L 2 norm Ridge Regression 1 min J w min z L 2 Xw Y N Xw Y + λ w 2 w = X N X + λi P- X N Y Exercise: derive the closed for solution for ridge regression with L2 regularizer Linear Regression CSL465/603 - Machine Learning 59

How to choose λ? Tradeoff between complexity vs. goodness of the fit Solution 1: If we have lots of data Generate multiple models Use lots of test data to discard the bad models Solution 2: With limited data Use k- fold cross validation Will discuss later Linear Regression CSL465/603 - Machine Learning 60

General Form of Regularizer Term 3 ; x % w y % 2? } + λ ; w D %8- D8- Quadratic/L 2 regularizer q = 2 Contours for the regularization term q =0.5 q =1 q =2 q =4 Linear Regression CSL465/603 - Machine Learning 61

Special scenario q = 1 - LASSO Least Absolute Shrinkage and Selection Operator 3 Error Function: 2? %8- x % w y % + λ D8- w D For sufficiently large λ many of the coefficients become 0 resulting in a sparse solution w 2 w 2 w w w 1 w 1 Linear Regression CSL465/603 - Machine Learning 62

LASSO Quadratic programming to solve the optimization problem Least Angles Regression solution - refer to ESL http://web.stanford.edu/~hastie/glmnet_matlab/ - matlab packages for LASSO Linear Regression CSL465/603 - Machine Learning 63

Linear Regression with Non- Linear Basis Functions Linear combination of fixed non-linear functions of the input variables? f x = w + + ; φ D x D8-1 1 1 0.5 0.75 0.75 0 0.5 0.5 0.5 0.25 0.25 1 1 0 1 0 1 0 1 0 1 0 1 Linear Regression CSL465/603 - Machine Learning 64

Linear Regression with Basis Functions Solution f X = 1 φ - x - φ? x - 1 φ - x 2 φ? x 2 1 φ - x 3 φ? x 3 w + w - w? = y - y 2 y 3 = Y w = φ X N φ X P- φ X N Y Linear Regression CSL465/603 - Machine Learning 65

Linear Regression with Multiple Outputs Multiple outputs Y = 1 x -- x -? f X = XW = 1 x 2- x 2? y -- y 1 - x 3- x 3? = Y y 3- y 3 y -- y - y 3- y 3 W = X N X P- X N Y w -+ w + = w -? w? Linear Regression CSL465/603 - Machine Learning 66

Summary Linear Regression (aka curve fitting) Gradient Descent Approach for finding the solution Analytical solution Loss Functions Probabilistic view of Linear Regression Bias-Variance analysis Regularization Ridge Regression Regression with basis functions Locally weighted regression (refer ML - 8.3) Linear Regression CSL465/603 - Machine Learning 67