Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Size: px

Start display at page:

Download "Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan"

Ada Floyd
5 years ago
Views:

1 Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

2 Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis Regularization Linear Regression CSL465/603 - Machine Learning 2

3 Example - Green Chilies Entertainment Company Earnings from the film (in crores of Rs) Cost of making the film (in crores of Rs) Linear Regression CSL465/603 - Machine Learning 3

4 Notations Training dataset Number of examples - N Input variable - x # Target variable - y % Goal: Learn function that predicts y for new input x Cost of Film (Crores of Rs) - x Profit/Loss (Crores of Rs) - y Linear Regression CSL465/603 - Machine Learning 4

5 Linear Regression Simplest form f(x) = w + + w - x Earnings from the film (in crores of Rs) Cost of making the film (in crores of Rs) Linear Regression CSL465/603 - Machine Learning 5

6 Least Mean Squares - Cost Function Choose parameters w + and w - (or w ) so that f x is as close as to y Earnings from the film (in crores of Rs) Cost of making the film (in crores of Rs) Linear Regression CSL465/603 - Machine Learning 6

7 Least Mean Squares - Cost Function - Parameter Space (1) Let J w +, w - = f x 23 %8-6 y 6 Linear Regression CSL465/603 - Machine Learning 7

8 Least Mean Squares - Cost Function - Parameter Space (2) Let J w +, w - = f x 23 %8- % y % Linear Regression CSL465/603 - Machine Learning 8

9 Least Mean Squares - Cost Function - Parameter Space (3) Let J w +, w - = f x 23 %8-6 y 6 Linear Regression CSL465/603 - Machine Learning 9

10 Plot of the Error Surface Linear Regression CSL465/603 - Machine Learning 10

11 Contour Plot of Error Surface Linear Regression CSL465/603 - Machine Learning 11

12 Estimating Optimal Parameters Linear Regression CSL465/603 - Machine Learning 12

13 Gradient Descent Basic Principle Minimize J w = - 3 f x 23 %8- % y % 2 Start with an initial estimate for w Keep changing w so that J w is progressively reduced Stop when no change or have reached the minimum Linear Regression CSL465/603 - Machine Learning 13

14 Gradient Descent - Intuition Linear Regression CSL465/603 - Machine Learning 14

15 Effect of Learning Parameter Too small value slow convergence Too large value oscillates widely and may not converge Linear Regression CSL465/603 - Machine Learning 15

16 Gradient Descent Local Minima Depending on the function J w, gradient descent can get stuck at local minima Linear Regression CSL465/603 - Machine Learning 16

17 Gradient Descent for Regression Convex error function J w = 1 3 2N ; f x 2 # y % %8- Geometrically error surface is bowl shaped. Only global minima Exercise Prove that the sum of squared error is a convex function Linear Regression CSL465/603 - Machine Learning 17

18 Parameter Update (1) Minimize 3 J w = 1 2N ; f x 2 # y % %8- Linear Regression CSL465/603 - Machine Learning 18

19 Parameter Update (2) Repeat till convergence 3 w + = w + α 1 N ; f x # y % w - = w - α 1 N ; f x # y % x # %8-3 %8- Linear Regression CSL465/603 - Machine Learning 19

20 Example Iteration 0 Regression Function Error Function Linear Regression CSL465/603 - Machine Learning 20

21 Example Iteration 1 Regression Function Error Function Linear Regression CSL465/603 - Machine Learning 21

22 Example Iteration 2 Regression Function Error Function Linear Regression CSL465/603 - Machine Learning 22

23 Example Iteration 4 Regression Function Error Function Linear Regression CSL465/603 - Machine Learning 23

24 Example Iteration 7 Regression Function Error Function Linear Regression CSL465/603 - Machine Learning 24

25 Example Iteration 9 Regression Function Error Function Linear Regression CSL465/603 - Machine Learning 25

26 Gradient Descent Batch Mode Update includes contribution of all data points w + = w + α 1 N ; f x # y % 3 w - = w - α 1 N ; f x # y % x # %8- %8- Will talk stochastic gradient descent later (neural networks). 3 Linear Regression CSL465/603 - Machine Learning 26

27 Multivariate Linear Regression Cost of Film (Crores of Rs) Celebrity status of the protagonist # of theatres release Age of the protagonist Earnings (Crores of Rs) - y Dimension of the input data - D Linear Regression CSL465/603 - Machine Learning 27

28 Multivariate Linear Regression - Formulation Simplest model: f x = w + + w - x - + w 2 x w? x? Parameters to learn: w +, w -,, w? = w Cost function: J w = f x 23 %8- # y % Update equation: w B = w B α - 3 f x 3 %8- % y % x %B Linear Regression CSL465/603 - Machine Learning 28

29 Gradient Descent Parameter update equation 3 w B = w B α 1 N ; f x % y % x %B %8- Linear Regression CSL465/603 - Machine Learning 29

30 Feature Scaling Multivariate Linear Regression (1) Cost of Film (Crores of Rs) Celebrity status of the protagonist # of theatres release Age of the protagonist Profit/Loss (Crores of Rs) - y Transform features to be of same scale Linear Regression CSL465/603 - Machine Learning 30

31 Feature Scaling for Multivariate Linear Regression (2) Normalization 1 x D 1 or 0 x D 1 Standardization mean 0 and standard deviation 1 Linear Regression CSL465/603 - Machine Learning 31

32 Multivariate Linear Regression Analytical Solution Cost of Film (Crores of Rs) Celebrity status of the protagonist # of theatres release Design Matrix and Target Vector Age of the protagonist Profit/Loss (Crores of Rs) - y X = 1 x -- x -? 1 x 2- x 2? Y = 1 x 3- x 3? y - y 2 y 3 Linear Regression CSL465/603 - Machine Learning 32

33 Least Squares Method f X = Xw = 1 x -- x -? 1 x 2- x 2? 1 x 3- x 3? w + w - w? = y - y 2 y 3 = Y 3 J w = 1 2 ; f x # y % 2 %8- Linear Regression CSL465/603 - Machine Learning 33

34 Normal Equations 1 min J W = min L L 2 XW Y N XW Y Finding the gradient wrt W and equate it to 0 Linear Regression CSL465/603 - Machine Learning 34

35 Analytical Solution Advantage No need for the learning parameter α! No need for iterative updates Disadvantage Need to perform matrix inversion Pseudo-Inverse of the matrix X N X P- X N Sometimes we deal with non-invertible matrices (redundant features) Linear Regression CSL465/603 - Machine Learning 35

36 Probabilistic View of Linear Regression (1) Let y = f(x) + ε ε is the error term that captures unmodeled effects or random noise. ε~n 0, σ 2 - Gaussian distribution Linear Regression CSL465/603 - Machine Learning 36

37 Probabilistic View of Linear Regression (2) Let y = f(x) + ε ε is the error term that captures unmodeled effects or random noise. ε~n 0, σ 2 - Gaussian distribution - why? N 0, σ 2 has maximum entropy among all real-valued distributions with a specified variance σ 2 3-σ rule: Linear Regression CSL465/603 - Machine Learning 37

38 Probabilistic View of Linear Regression (3) Let y = f(x) + ε ε is the error term that captures unmodeled effects or random noise. ε~n 0, σ 2 - Gaussian distribution Then P ε = And P y x = t y(x 0 ) p(t x 0 ) y(x) x 0 x Linear Regression CSL465/603 - Machine Learning 38

39 Probabilistic View of Linear Regression (4) P y -,, y 3 x -,, x 3 = P y -,, y 3 x -,, x 3 ; W = Linear Regression CSL465/603 - Machine Learning 39

40 Maximizing the Likelihood Maximize L W = 3 %8- P y % x % ; W Linear Regression CSL465/603 - Machine Learning 40

41 Loss Functions Squared loss f x y 2 Absolute loss f x y Dead band loss max 0, f x y ε, ε R ] Linear Regression CSL465/603 - Machine Learning 41

42 Loss Functions Problem with squared loss Linear Regression CSL465/603 - Machine Learning 42

43 Linear Regression with Absolute Loss Function Objective min L 3 ; XW Y %8- Non-differentiable, so cannot take the gradient descent approach Solution: frame as a constrained optimization problem Introduce new variables v R 3, v % x % W y % 3 min ; v %, subject to v % L,` %8- x % W y % v % Linear Regression CSL465/603 - Machine Learning 43

44 Linear Regression with Absolute Loss Function - Example LMS output LP output Linear Regression CSL465/603 - Machine Learning 44

45 Some Additional Notations Underlying response function (Target Concept) C Actual observed response y = C x + ε ε~n 0, σ 2, E y/x = C(x) Predicted response based on the model learned from dataset A - f x; A Expected response averaged over all datasets fm x = E n f x; A Expected L 2 error on a new test instance x - E pqq = E n f x ; A y 2 Linear Regression CSL465/603 - Machine Learning 45

46 Bias-Variance Analysis (1) Linear Regression CSL465/603 - Machine Learning 46

47 Bias-Variance Analysis (2) Linear Regression CSL465/603 - Machine Learning 47

48 Bias-Variance Analysis (3) Root Mean Square Error Linear Regression CSL465/603 - Machine Learning 48

49 Bias-Variance Analysis (4) 9 th degree polynomial fit with more sample data Linear Regression CSL465/603 - Machine Learning 49

50 Bias-Variance Analysis (5) Expected square loss - E L = r f x y 2 P x, y dxdy Linear Regression CSL465/603 - Machine Learning 50

51 Bias-Variance Analysis (6) Expected square loss - E L = r f x y 2 P x, y dxdy Linear Regression CSL465/603 - Machine Learning 51

52 Bias-Variance Analysis (7) Relevant part of loss: u f x C x 2 P x dx Linear Regression CSL465/603 - Machine Learning 52

53 Bias-Variance Analysis (8) Relevant part of loss: E n f x; A C x 2 Linear Regression CSL465/603 - Machine Learning 53

54 Bias-Variance Analysis (9) Degree = 1 Degree = 4 Linear Regression CSL465/603 - Machine Learning 54

55 Bias-Variance Analysis (10) Bias term of the error E n f x; A C x 2 Measures how well our approximation architecture can fit the data Weak approximators will have high bias Example low degree polynomials Strong approximators will have low bias Example high degree polynomials Linear Regression CSL465/603 - Machine Learning 55

56 Bias-Variance Analysis (11) Variance term of the error E n f x; A E n f x; A 2 No direct dependence on the target value For a fixed size dataset A Strong approximators tend to have more variance Small changes in the dataset can result in wide changes in the predictors Weak approximators tend to have less variance Small changes in the dataset result in similar predictors Variance disappears as A Linear Regression CSL465/603 - Machine Learning 56

57 Bias-Variance Analysis (12) Measuring Bias and Variance in practice Bootstrap from the given dataset Start with a complex approximator, and reduce the complexity through regularization Setting more coefficients/parameters to 0 Do Feature Selection Reduces variance, but can increase bias. Hopefully just sufficient to model the given data Linear Regression CSL465/603 - Machine Learning 57

58 Regularization Central Idea: penalize over-complicated solutions Linear regression minimizes 3 ; x % w y % 2 Regularized regression minimizes 3 2 ; x % w y % + λ w %8- %8- Linear Regression CSL465/603 - Machine Learning 58

59 Modified Solution Solution for ordinary linear regression min z J w min L 1 2 Xw Y N Xw Y w = X N X P- X N Y Now for the regularized version which uses L 2 norm Ridge Regression 1 min J w min z L 2 Xw Y N Xw Y + λ w 2 w = X N X + λi P- X N Y Exercise: derive the closed for solution for ridge regression with L2 regularizer Linear Regression CSL465/603 - Machine Learning 59

60 How to choose λ? Tradeoff between complexity vs. goodness of the fit Solution 1: If we have lots of data Generate multiple models Use lots of test data to discard the bad models Solution 2: With limited data Use k- fold cross validation Will discuss later Linear Regression CSL465/603 - Machine Learning 60

61 General Form of Regularizer Term 3 ; x % w y % 2? } + λ ; w D %8- D8- Quadratic/L 2 regularizer q = 2 Contours for the regularization term q =0.5 q =1 q =2 q =4 Linear Regression CSL465/603 - Machine Learning 61

62 Special scenario q = 1 - LASSO Least Absolute Shrinkage and Selection Operator 3 Error Function: 2? %8- x % w y % + λ D8- w D For sufficiently large λ many of the coefficients become 0 resulting in a sparse solution w 2 w 2 w w w 1 w 1 Linear Regression CSL465/603 - Machine Learning 62

63 LASSO Quadratic programming to solve the optimization problem Least Angles Regression solution - refer to ESL - matlab packages for LASSO Linear Regression CSL465/603 - Machine Learning 63

64 Linear Regression with Non- Linear Basis Functions Linear combination of fixed non-linear functions of the input variables? f x = w + + ; φ D x D Linear Regression CSL465/603 - Machine Learning 64

65 Linear Regression with Basis Functions Solution f X = 1 φ - x - φ? x - 1 φ - x 2 φ? x 2 1 φ - x 3 φ? x 3 w + w - w? = y - y 2 y 3 = Y w = φ X N φ X P- φ X N Y Linear Regression CSL465/603 - Machine Learning 65

66 Linear Regression with Multiple Outputs Multiple outputs Y = 1 x -- x -? f X = XW = 1 x 2- x 2? y -- y 1 - x 3- x 3? = Y y 3- y 3 y -- y - y 3- y 3 W = X N X P- X N Y w -+ w + = w -? w? Linear Regression CSL465/603 - Machine Learning 66

67 Summary Linear Regression (aka curve fitting) Gradient Descent Approach for finding the solution Analytical solution Loss Functions Probabilistic view of Linear Regression Bias-Variance analysis Regularization Ridge Regression Regression with basis functions Locally weighted regression (refer ML - 8.3) Linear Regression CSL465/603 - Machine Learning 67

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan Linear Regression CSL603 - Fall 2017 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis Regularization