Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in
Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis Regularization Linear Regression CSL465/603 - Machine Learning 2
Example - Green Chilies Entertainment Company Earnings from the film (in crores of Rs) Cost of making the film (in crores of Rs) Linear Regression CSL465/603 - Machine Learning 3
Notations Training dataset Number of examples - N Input variable - x # Target variable - y % Goal: Learn function that predicts y for new input x Cost of Film (Crores of Rs) - x 98.28 199.69 40.22 93.69 62.07 100.33 Profit/Loss (Crores of Rs) - y Linear Regression CSL465/603 - Machine Learning 4
Linear Regression Simplest form f(x) = w + + w - x Earnings from the film (in crores of Rs) Cost of making the film (in crores of Rs) Linear Regression CSL465/603 - Machine Learning 5
Least Mean Squares - Cost Function Choose parameters w + and w - (or w ) so that f x is as close as to y Earnings from the film (in crores of Rs) Cost of making the film (in crores of Rs) Linear Regression CSL465/603 - Machine Learning 6
Least Mean Squares - Cost Function - Parameter Space (1) Let J w +, w - = - 3 2 f x 23 %8-6 y 6 Linear Regression CSL465/603 - Machine Learning 7
Least Mean Squares - Cost Function - Parameter Space (2) Let J w +, w - = - 3 2 f x 23 %8- % y % Linear Regression CSL465/603 - Machine Learning 8
Least Mean Squares - Cost Function - Parameter Space (3) Let J w +, w - = - 3 2 f x 23 %8-6 y 6 Linear Regression CSL465/603 - Machine Learning 9
Plot of the Error Surface Linear Regression CSL465/603 - Machine Learning 10
Contour Plot of Error Surface Linear Regression CSL465/603 - Machine Learning 11
Estimating Optimal Parameters Linear Regression CSL465/603 - Machine Learning 12
Gradient Descent Basic Principle Minimize J w = - 3 f x 23 %8- % y % 2 Start with an initial estimate for w Keep changing w so that J w is progressively reduced Stop when no change or have reached the minimum Linear Regression CSL465/603 - Machine Learning 13
Gradient Descent - Intuition Linear Regression CSL465/603 - Machine Learning 14
Effect of Learning Parameter Too small value slow convergence Too large value oscillates widely and may not converge Linear Regression CSL465/603 - Machine Learning 15
Gradient Descent Local Minima Depending on the function J w, gradient descent can get stuck at local minima Linear Regression CSL465/603 - Machine Learning 16
Gradient Descent for Regression Convex error function J w = 1 3 2N ; f x 2 # y % %8- Geometrically error surface is bowl shaped. Only global minima Exercise Prove that the sum of squared error is a convex function Linear Regression CSL465/603 - Machine Learning 17
Parameter Update (1) Minimize 3 J w = 1 2N ; f x 2 # y % %8- Linear Regression CSL465/603 - Machine Learning 18
Parameter Update (2) Repeat till convergence 3 w + = w + α 1 N ; f x # y % w - = w - α 1 N ; f x # y % x # %8-3 %8- Linear Regression CSL465/603 - Machine Learning 19
Example Iteration 0 Regression Function Error Function Linear Regression CSL465/603 - Machine Learning 20
Example Iteration 1 Regression Function Error Function Linear Regression CSL465/603 - Machine Learning 21
Example Iteration 2 Regression Function Error Function Linear Regression CSL465/603 - Machine Learning 22
Example Iteration 4 Regression Function Error Function Linear Regression CSL465/603 - Machine Learning 23
Example Iteration 7 Regression Function Error Function Linear Regression CSL465/603 - Machine Learning 24
Example Iteration 9 Regression Function Error Function Linear Regression CSL465/603 - Machine Learning 25
Gradient Descent Batch Mode Update includes contribution of all data points w + = w + α 1 N ; f x # y % 3 w - = w - α 1 N ; f x # y % x # %8- %8- Will talk stochastic gradient descent later (neural networks). 3 Linear Regression CSL465/603 - Machine Learning 26
Multivariate Linear Regression Cost of Film (Crores of Rs) Celebrity status of the protagonist # of theatres release Age of the protagonist 75.72 7.57 32 52 157.39 18.74 1.87 16 68 81.93 50.96 5.09 27 35 131.95 Earnings (Crores of Rs) - y Dimension of the input data - D Linear Regression CSL465/603 - Machine Learning 27
Multivariate Linear Regression - Formulation Simplest model: f x = w + + w - x - + w 2 x 2 + + w? x? Parameters to learn: w +, w -,, w? = w Cost function: J w = - 3 2 f x 23 %8- # y % Update equation: w B = w B α - 3 f x 3 %8- % y % x %B Linear Regression CSL465/603 - Machine Learning 28
Gradient Descent Parameter update equation 3 w B = w B α 1 N ; f x % y % x %B %8- Linear Regression CSL465/603 - Machine Learning 29
Feature Scaling Multivariate Linear Regression (1) Cost of Film (Crores of Rs) Celebrity status of the protagonist # of theatres release Age of the protagonist 75.72 7.57 32 52 157.39 18.74 1.87 16 68 81.93 50.96 5.09 27 35 131.95 Profit/Loss (Crores of Rs) - y Transform features to be of same scale Linear Regression CSL465/603 - Machine Learning 30
Feature Scaling for Multivariate Linear Regression (2) Normalization 1 x D 1 or 0 x D 1 Standardization mean 0 and standard deviation 1 Linear Regression CSL465/603 - Machine Learning 31
Multivariate Linear Regression Analytical Solution Cost of Film (Crores of Rs) Celebrity status of the protagonist # of theatres release Design Matrix and Target Vector Age of the protagonist 75.72 7.57 32 52 157.39 18.74 1.87 16 68 81.93 50.96 5.09 27 35 131.95 Profit/Loss (Crores of Rs) - y X = 1 x -- x -? 1 x 2- x 2? Y = 1 x 3- x 3? y - y 2 y 3 Linear Regression CSL465/603 - Machine Learning 32
Least Squares Method f X = Xw = 1 x -- x -? 1 x 2- x 2? 1 x 3- x 3? w + w - w? = y - y 2 y 3 = Y 3 J w = 1 2 ; f x # y % 2 %8- Linear Regression CSL465/603 - Machine Learning 33
Normal Equations 1 min J W = min L L 2 XW Y N XW Y Finding the gradient wrt W and equate it to 0 Linear Regression CSL465/603 - Machine Learning 34
Analytical Solution Advantage No need for the learning parameter α! No need for iterative updates Disadvantage Need to perform matrix inversion Pseudo-Inverse of the matrix X N X P- X N Sometimes we deal with non-invertible matrices (redundant features) Linear Regression CSL465/603 - Machine Learning 35
Probabilistic View of Linear Regression (1) Let y = f(x) + ε ε is the error term that captures unmodeled effects or random noise. ε~n 0, σ 2 - Gaussian distribution Linear Regression CSL465/603 - Machine Learning 36
Probabilistic View of Linear Regression (2) Let y = f(x) + ε ε is the error term that captures unmodeled effects or random noise. ε~n 0, σ 2 - Gaussian distribution - why? N 0, σ 2 has maximum entropy among all real-valued distributions with a specified variance σ 2 3-σ rule: Linear Regression CSL465/603 - Machine Learning 37
Probabilistic View of Linear Regression (3) Let y = f(x) + ε ε is the error term that captures unmodeled effects or random noise. ε~n 0, σ 2 - Gaussian distribution Then P ε = And P y x = t y(x 0 ) p(t x 0 ) y(x) x 0 x Linear Regression CSL465/603 - Machine Learning 38
Probabilistic View of Linear Regression (4) P y -,, y 3 x -,, x 3 = P y -,, y 3 x -,, x 3 ; W = Linear Regression CSL465/603 - Machine Learning 39
Maximizing the Likelihood Maximize L W = 3 %8- P y % x % ; W Linear Regression CSL465/603 - Machine Learning 40
Loss Functions Squared loss f x y 2 Absolute loss f x y Dead band loss max 0, f x y ε, ε R ] Linear Regression CSL465/603 - Machine Learning 41
Loss Functions Problem with squared loss Linear Regression CSL465/603 - Machine Learning 42
Linear Regression with Absolute Loss Function Objective min L 3 ; XW Y %8- Non-differentiable, so cannot take the gradient descent approach Solution: frame as a constrained optimization problem Introduce new variables v R 3, v % x % W y % 3 min ; v %, subject to v % L,` %8- x % W y % v % Linear Regression CSL465/603 - Machine Learning 43
Linear Regression with Absolute Loss Function - Example LMS output LP output Linear Regression CSL465/603 - Machine Learning 44
Some Additional Notations Underlying response function (Target Concept) C Actual observed response y = C x + ε ε~n 0, σ 2, E y/x = C(x) Predicted response based on the model learned from dataset A - f x; A Expected response averaged over all datasets fm x = E n f x; A Expected L 2 error on a new test instance x - E pqq = E n f x ; A y 2 Linear Regression CSL465/603 - Machine Learning 45
Bias-Variance Analysis (1) Linear Regression CSL465/603 - Machine Learning 46
Bias-Variance Analysis (2) Linear Regression CSL465/603 - Machine Learning 47
Bias-Variance Analysis (3) Root Mean Square Error Linear Regression CSL465/603 - Machine Learning 48
Bias-Variance Analysis (4) 9 th degree polynomial fit with more sample data Linear Regression CSL465/603 - Machine Learning 49
Bias-Variance Analysis (5) Expected square loss - E L = r f x y 2 P x, y dxdy Linear Regression CSL465/603 - Machine Learning 50
Bias-Variance Analysis (6) Expected square loss - E L = r f x y 2 P x, y dxdy Linear Regression CSL465/603 - Machine Learning 51
Bias-Variance Analysis (7) Relevant part of loss: u f x C x 2 P x dx Linear Regression CSL465/603 - Machine Learning 52
Bias-Variance Analysis (8) Relevant part of loss: E n f x; A C x 2 Linear Regression CSL465/603 - Machine Learning 53
Bias-Variance Analysis (9) Degree = 1 Degree = 4 Linear Regression CSL465/603 - Machine Learning 54
Bias-Variance Analysis (10) Bias term of the error E n f x; A C x 2 Measures how well our approximation architecture can fit the data Weak approximators will have high bias Example low degree polynomials Strong approximators will have low bias Example high degree polynomials Linear Regression CSL465/603 - Machine Learning 55
Bias-Variance Analysis (11) Variance term of the error E n f x; A E n f x; A 2 No direct dependence on the target value For a fixed size dataset A Strong approximators tend to have more variance Small changes in the dataset can result in wide changes in the predictors Weak approximators tend to have less variance Small changes in the dataset result in similar predictors Variance disappears as A Linear Regression CSL465/603 - Machine Learning 56
Bias-Variance Analysis (12) Measuring Bias and Variance in practice Bootstrap from the given dataset Start with a complex approximator, and reduce the complexity through regularization Setting more coefficients/parameters to 0 Do Feature Selection Reduces variance, but can increase bias. Hopefully just sufficient to model the given data Linear Regression CSL465/603 - Machine Learning 57
Regularization Central Idea: penalize over-complicated solutions Linear regression minimizes 3 ; x % w y % 2 Regularized regression minimizes 3 2 ; x % w y % + λ w %8- %8- Linear Regression CSL465/603 - Machine Learning 58
Modified Solution Solution for ordinary linear regression min z J w min L 1 2 Xw Y N Xw Y w = X N X P- X N Y Now for the regularized version which uses L 2 norm Ridge Regression 1 min J w min z L 2 Xw Y N Xw Y + λ w 2 w = X N X + λi P- X N Y Exercise: derive the closed for solution for ridge regression with L2 regularizer Linear Regression CSL465/603 - Machine Learning 59
How to choose λ? Tradeoff between complexity vs. goodness of the fit Solution 1: If we have lots of data Generate multiple models Use lots of test data to discard the bad models Solution 2: With limited data Use k- fold cross validation Will discuss later Linear Regression CSL465/603 - Machine Learning 60
General Form of Regularizer Term 3 ; x % w y % 2? } + λ ; w D %8- D8- Quadratic/L 2 regularizer q = 2 Contours for the regularization term q =0.5 q =1 q =2 q =4 Linear Regression CSL465/603 - Machine Learning 61
Special scenario q = 1 - LASSO Least Absolute Shrinkage and Selection Operator 3 Error Function: 2? %8- x % w y % + λ D8- w D For sufficiently large λ many of the coefficients become 0 resulting in a sparse solution w 2 w 2 w w w 1 w 1 Linear Regression CSL465/603 - Machine Learning 62
LASSO Quadratic programming to solve the optimization problem Least Angles Regression solution - refer to ESL http://web.stanford.edu/~hastie/glmnet_matlab/ - matlab packages for LASSO Linear Regression CSL465/603 - Machine Learning 63
Linear Regression with Non- Linear Basis Functions Linear combination of fixed non-linear functions of the input variables? f x = w + + ; φ D x D8-1 1 1 0.5 0.75 0.75 0 0.5 0.5 0.5 0.25 0.25 1 1 0 1 0 1 0 1 0 1 0 1 Linear Regression CSL465/603 - Machine Learning 64
Linear Regression with Basis Functions Solution f X = 1 φ - x - φ? x - 1 φ - x 2 φ? x 2 1 φ - x 3 φ? x 3 w + w - w? = y - y 2 y 3 = Y w = φ X N φ X P- φ X N Y Linear Regression CSL465/603 - Machine Learning 65
Linear Regression with Multiple Outputs Multiple outputs Y = 1 x -- x -? f X = XW = 1 x 2- x 2? y -- y 1 - x 3- x 3? = Y y 3- y 3 y -- y - y 3- y 3 W = X N X P- X N Y w -+ w + = w -? w? Linear Regression CSL465/603 - Machine Learning 66
Summary Linear Regression (aka curve fitting) Gradient Descent Approach for finding the solution Analytical solution Loss Functions Probabilistic view of Linear Regression Bias-Variance analysis Regularization Ridge Regression Regression with basis functions Locally weighted regression (refer ML - 8.3) Linear Regression CSL465/603 - Machine Learning 67