Lecture 5 Multivariate Linear Regression

Size: px

Start display at page:

Download "Lecture 5 Multivariate Linear Regression"

Dayna Daniel
6 years ago
Views:

1 Lecture 5 Multivariate Linear Regression Dan Sheldon September 23, 2014

2 Topics Multivariate linear regression Model Cost function Normal equations Gradient descent Features

3 Book Data 10 8 Weight (lbs.) Pages y = x

4 Book Data Can we predict better with multiple features? Width Thickness Height # Pages Hardcover Weight

5 Book Data Can we predict better with multiple features? Width Thickness Height # Pages Hardcover Weight Training data (x (1), y (1) ), (x (2), y (2) ),..., (x (m), y (m) )

6 Book Data Can we predict better with multiple features? Width Thickness Height # Pages Hardcover Weight Training data (x (1), y (1) ), (x (2), y (2) ),..., (x (m), y (m) ) x (i) is a feature vector

7 Multivariate Linear Regression Input: x R n Output: y R Model (hypothesis class):? Cost function:?

8 Model h θ (x) =

9 Model h θ (x) = θ 0 + θ 1 x 1 + θ 2 x θ n x n

10 Model h θ (x) = θ 0 + θ 1 x 1 + θ 2 x θ n x n 1 h θ (x) = [ ] x 1 θ 0 θ 1... θ n. x n

11 Model h θ (x) = θ 0 + θ 1 x 1 + θ 2 x θ n x n 1 h θ (x) = [ ] x 1 θ 0 θ 1... θ n. x n h θ (x) = θ T x = x T θ (Augment feature vector with 1)

12 Geometry of high dimensional linear (affine) functions n-dimensional function h θ : R n R h θ (x) = θ 1 x 1 + θ 2 x θ n x n h θ (x) = θ 0 + θ 1 x 1 + θ 2 x θ n x n (linear) (affine) Three facts on board 1. Contours = hyperplanes 2. Gradient = θ (a vector, orthogonal to contours) 3. The norm θ can be interpreted as slope

13 The Problem Find θ such that y (i) h θ (x (i) ), i = 1,..., m

14 The Problem Find θ such that y (i) h θ (x (i) ), i = 1,..., m y (1) y (2)... 1 x (2) y (m) 1 x (1) 1 x (1) 2... x (1) n 1 x (2) 2... x (2) n... 1 x (m) 1 x (m) 2... x (m) n θ 0 θ 1... θ n

15 The Problem Find θ such that y (i) h θ (x (i) ), i = 1,..., m y (1) y (2)... 1 x (2) y (m) 1 x (1) 1 x (1) 2... x (1) n 1 x (2) 2... x (2) n... 1 x (m) 1 x (m) 2... x (m) n θ 0 θ 1... θ n y Xθ

16 Inputs: Data Matrix and Label Vector 1 x (1) 1 x (1) 1 x (2) X = 1 x (2) 2... x (1) n 2... x (2) n y =... 1 x (m) 1 x (m) 2... x (m) n Data matrix y (1) y (2)... y (m) Label vector Width Thickness Height # Pages Hardcover Weight

17 Illustration Find θ such that y (i) h θ (x (i) ), i = 1,..., m Y X 2 Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 3 FIGURE 3.1. Linear least squares fitting with X IR 2. We seek the linear function of X that minimizes the sum of squared residuals from Y. X 1

18 Cost Function J(θ) =

19 Cost Function J(θ) = 1 2 m (h θ (x (i) ) y (i) ) 2 i=1 Exercise: write this succinctly in matrix-vector notation

20 Cost Function Answer: J(θ) = 1 2 (Xθ y)t (Xθ y)

21 The Problem Given training data X and y, find θ to minimize cost function: J(θ) = 1 2 (Xθ y)t (Xθ y)

22 Solution 1: Normal Equations Normal equations θ = (X T X) 1 X T y Heuristic derivation:

23 Proper Approach Set all partial derivatives to zero 0 = θ j J(θ) Solve a system of n + 1 linear equations for θ 0,..., θ n Tedious, but leads to normal equations

24 Matrix Calculus Succinct (and cool!) way to solve for normal equations: 0 = J(θ) = d dθ 1 2 (Xθ y)t (Xθ y)

25 Matrix Calculus Succinct (and cool!) way to solve for normal equations: 0 = J(θ) = d 1 dθ 2 (Xθ y)t (Xθ y) 0 = (Xθ y) T X

26 Matrix Calculus Succinct (and cool!) way to solve for normal equations: 0 = J(θ) = d 1 dθ 2 (Xθ y)t (Xθ y) 0 = (Xθ y) T X 0 = X T (Xθ y)

27 Matrix Calculus Succinct (and cool!) way to solve for normal equations: 0 = J(θ) = d 1 dθ 2 (Xθ y)t (Xθ y) 0 = (Xθ y) T X 0 = X T (Xθ y) X T Xθ = X T y

28 Matrix Calculus Succinct (and cool!) way to solve for normal equations: 0 = J(θ) = d 1 dθ 2 (Xθ y)t (Xθ y) 0 = (Xθ y) T X 0 = X T (Xθ y) X T Xθ = X T y θ = (X T X) 1 X T y (Note: not responsible vector derivative in first line, but should understand rest of derivation.)

29 Solution 2: Gradient Descent 1. Initialize θ 0, θ 1,..., θ n arbitrarily 2. Repeat until convergence θ j = θ j α θ j J(θ), j = 0,..., n.

30 Solution 2: Gradient Descent 1. Initialize θ 0, θ 1,..., θ n arbitrarily 2. Repeat until convergence θ j = θ j α θ j J(θ), j = 0,..., n. Partial derivatives: θ j J(θ) = m i=1 (h θ (x (i) ) y (i) )x (i) j

31 Vectorized Gradient Descent 1. Initialize θ arbitrarily 2. Repeat until convergence θ θ α X T (Xθ y) }{{} J(θ)

32 Feature Normalization Demo: Problem 3 from HW0 Advice: normalize your features so they have the similar numeric ranges!

33 Feature Normalization For each feature j, compute the mean µ j and standard deviation σ j of that feature over training set. µ j = 1 m m i=1 x (i) j, σ j = 1 m m i=1 (x (i) j µ j ) 2

34 Feature Normalization For each feature j, compute the mean µ j and standard deviation σ j of that feature over training set. µ j = 1 m m i=1 x (i) j, σ j = 1 m m i=1 (x (i) j µ j ) 2 Then, subtract mean and divide by standard deviation: x (i) j (x (i) j µ j )/σ j

35 Feature Normalization For each feature j, compute the mean µ j and standard deviation σ j of that feature over training set. µ j = 1 m m i=1 x (i) j, σ j = 1 m m i=1 (x (i) j µ j ) 2 Then, subtract mean and divide by standard deviation: x (i) j (x (i) j µ j )/σ j Effect: adjust columns of data matrix to have mean zero and standard deviation equal to one. E.g

36 Feature Normalization Example: cost function contours before and after normalization w 2 0 w w w 1

37 Feature Design It is possible to fit nonlinear functions using linear regression: (x 1, x 2, x 3 ) (x 1, x 2, x 3, x 2 1, log(x 2 ), x 1 + x 3 ) Approaches Try standard transformations Design features you think will work

38 Polynomial Regression x (1, x, x 2, x 3,...) y x

Experiment 1: Linear Regression

Experiment 1: Linear Regression August 27, 2018 1 Description This first exercise will give you practice with linear regression. These exercises have been extensively tested with Matlab, but they should