CS 340 Lec. 15: Linear Regression

Size: px

Start display at page:

Download "CS 340 Lec. 15: Linear Regression"

Elizabeth Powell
5 years ago
Views:

1 CS 340 Lec. 15: Linear Regression AD February 2011 AD () February / 31

2 Regression Assume you are given some training data { x i, y i } N where x i R d and y i R c. Given an input test data x, you want to predict/estimate the output y associated to x. Applications: x: location, y sensor reading. x: stock at time t d, t d + 1,..., t 1, y: stock at time t. x: temperature at day t d, t d + 1,..., t 1, y: temperature at day t. AD () February / 31

3 Formalization We want to learn a mapping/predictor based on { x i, y i } N : f : R d R c which allows us to predict the response y given a new input x; i.e. y (x) = f (x). Linear regression is the simplest approach to build such a mappnig and is ubiquitous in applied science. AD () February / 31

4 Linear Regression For sake of simplicity, consider the simplest case c = 1 and d = 1. Linear regression assumes a model where w 1, w 0 R. y (x) = w 1 x + w 0 Given only 2 training data, we can solve for w 1 and w 0 but the result would be dependent of the 2 training data selected and very sensitive to the noise in the training responses y i. A more sensible approach is to minimize the residual errors ε i over the N training data ε i = y i y ( x i ) = y i w 1 x i w 0 AD () February / 31

5 Least square Regression We propose to minimize the sum of the squared residual errors w.r.t (w 0, w 1 ) N ( E (w 0, w 1 ) = y i w 1 x i ) 2 w 0 3 Sum of squares error contours for linear regression w w0 (b) AD () February / 31

6 Least square Regression We compute E w 0 E = 2 w 0 and set it equal to zero N w 0 = N y i N ( y i w 1 x i w 0 ) = 0 w 1 N x i N = y w 1 x Substituting back w 0 = y w 1 x in E (w 0, w 1 ), we obtain E (y w 1 x, w 1 ) = N (( y i y ) ( w 1 x i x )) 2 AD () February / 31

7 Least square Regression We compute E w 1 set them equal to zero N E ( = 2 w 1 x i x ) (( y i y ) ( w 1 x i x )) = 0 ( w 1 = N x i x ) ( y i y ) N (x i x) 2 Hence ( w 0 = y w 1 x = y N x i x ) ( y i y ) N (x i x) 2 x. AD () February / 31

8 Example 5 4 prediction truth In linear least squares, we minimize(a) the sum of squared distances from each training point (denoted by a red circle) to its approximation (denoted linear by a blueleast cross). squares, The red diagonal we trylinetorepresents minimize y (x) the = w 1 sum x + w 0 of, squa n which (denoted is the least by squares a blueregression cross), line. that Note is, that wethese minimize residual lines theare sum ) not = perpendicular w 0 + w 1 x, to which the least is squares the line, least in squares contrast to regression PCA. line trast to Figure Figure generated by residualsde AD () February / 31

9 Least square Regression for Multidimensional Inputs Consider now that c = 1 but d > 1 and we consider y (x) = w 0 + d k=1 w k x k = w T x + w 0 = w T x where w T = x T = ( w 0 w T) = (w 0 w 1 w d ) ( 1 x T) = (1 x 1 x d ) Given N training data, we want to minimize w.r.t w the sum of the squared residual errors E ( w) = N (y i w T x ) i 2 AD () February / 31

10 Least square Regression for Multidimensional Inputs We can rewrite E ( w) = ( ) T ( ) Y X w Y X w where Y= ( y 1... y N ) T N 1 matrix X = ( x 1 x 2 x N ) T N (d + 1) matrix We can rewrite E ( w) = Y T Y 2Y T X w + w T X T X w ) By setting E w ( X = 0, we obtain if T X is invertible w LS = ) 1 ( X T X X T Y AD () February / 31

11 Least square Regression for Multidim Inputs Ouputs Consider the case where c > 1 and d > 1 then E ( w) = = where W R (d +1) c. N trace (y i W T x ) i T (y i W T x ) i Y X W It can also be shown that w LS = c j=1 2 N = F ) 1 ( X T X X T Y ( yj i ( W T x ) ) 2 i j AD () February / 31

12 Geometric Interpretation For sake of simplicity, let s come back to the case where c = 1. We have for an new input x y (x) = w T LS x = x T wls = x T ( X T X) 1 X T Y On the training set { x i, y i } N, we have so Y (X) = X w LS = X ( X T X) 1 X T Y X T (Y (X) Y) = X T X ( X T X) 1 X T Y X T Y = 0 Y (X) is simply the orthogonal projection of Y onto the columns of X. AD () February / 31

Nonlinear Regression using Basis Functions Linear regression can handle nonlinear regression problem; e.g. y (x) = w T Φ ( x) where Φ : R d +1 R m and w is a m dimensional vector.

13 Nonlinear Regression using Basis Functions Linear regression can handle nonlinear regression problem; e.g. y (x) = w T Φ ( x) where Φ : R d +1 R m and w is a m dimensional vector. Example: Φ ( x) = ( 1,x,x 2) T (d = 1, m = 3); Φ ( x) = x = (1,x1,x 2 ) (d = 2, m = 3); Φ ( x) = x = ( 1,x 1,x 2, x 2 1,x 2 2 ) (d = 2, m = 5). introbody.tex degree (a) (b) (c) AD () February / 31

14 MSE on Training and Test Sets 48 introbody.tex truth=degree 2, model = degree 1 train test truth=degree 2, model = degree 2 train test truth=degree 2, model = degree 25 train test mse mse mse size of training set size of training set size of training set (a) (b) (c) Data generated from a degree 2 poly. with Gaussian noise of var. 4. We fit Figure polynomial 1.54: MSE on training models. and test sets For vs size Nofsmall, training set, test for dataerror generated of fromthe a degree degree 2 polynomial 25 withpoly. Gaussian noise of higher variance 4. than We fit polynomial that of models the of varying degree to 2this poly., data. (a) Degree due 1. to(b) Degree overfitting, 2. (c) Degree 25. but Note this that for small training set sizes, the test error of the degree 25 polynomial is higher than that of the degree 2 polynomial, due to overfitting, but this difference difference vanishes once as N increases. Note also that the degree 1 poly. vanishes once we have enough data. Note also that the degree 1 polynomial is too simple and has high test error even given large amounts of is training too simple data. Figure generated and has by linregpolyvsn. high test error even given large N. AD () February / 31

15 Kernel Regression Another way to perform nonlinear regression is to use kernels to define the basis functions Φ (x) = (K (x, µ 1 ),..., K (x, µ m )). For example, we could use a Radial Basis Function ( K (x, µ) = exp (x ) µ)t (x µ) 2σ 2. Alternatively we can use any function: wavelets, curvelets, splines etc. Selecting ( µ 1,..., µ m, σ 2) can be diffi cult. How to select µ: 1) place the centers unformly spaced in the region containing the data, 2) place one center at each data point, 3) cluster the data and use one center for each cluster, 4) use CV, MLE or Bayesian approach. How to select σ 2 : 1) use average squared distances to neighboring centers (scaled by a constant), 2) use CV, MLE or Bayesian approach. AD () February / 31

16 Kernel Regression (Left) RBS basis in 1d. (Middle) Basis functions evaluated on a grid. (Right) Design matrix. AD () February / 31

17 A Probabilistic Interpretation of Least-Square Regression Consider the following model p ( y x, w, σ 2) = N ( y; y (x), σ 2) where y (x) = w T Φ ( x). Given the training set { x i, y i } N, the MLE estimates of ( w, σ 2) is given by ( wmle, σ ) N 2 MLE = arg max log p ( y i x i, w, σ 2) ( w,σ 2 ) where N log p ( y i x i, w, σ 2) = N 2 log ( 2πσ 2) 1 2σ 2 Hence w MLE = σ 2 MLE = N ) 1 ( X T X X T Y and ( y i w MLE x i ) 2 /N. N (y i w T x ) i 2 AD () February / 31

18 Robust Regression The problem with least square regression is that it is very sensitive to outliers as it minimizes E L2 ( w) = (y i w T x ) i 2 N ( y i y ( x i )) N 2 = and hence the square of the residual errors. To design a procedure less sensitive to these outliers, we could pick E L1 ( w) = y i w T x i. N y i y ( x i ) = N As u goes to infinity slowler than u 2 as u then this procedure is less sensitive to outliers. We could also use something like E Huber ( w) = N c ( y i y ( x i )) where { u if u u0 c (u) = u 0 if u u 0 AD () February / 31

19 A Probabilistic Interpretation of Robust Regression Consider the following model where y (x) = w T x and p (y x, w, b) = Lap (y; y (x), b) Lap (x; µ, b) = 1 2b exp ( ) x µ b is the Laplace density of location µ and scale b. Given the training set { x i, y i } N, the MLE estimates of ( w, σ 2) is given by ) N ( wmle,laplace, b MLE = arg max log p ( y i x i, w, b ) ( w,b) = arg max N log (2b) 1 b That is w MLE,Laplace minimizes E L1 ( w). N y i w T x i. AD () February / 31

20 Gauss versus Laplace.24: Linear regression with outliers. Data points are black circles. Black dotted line: least squares. Blue dashed line: Lapla e: Student T. Figure generated by linregrobustdemo Gauss Laplace 0 1 Gauss Laplace (a) ( (left) Plots of N (y; 0, 1) and Lap y; 0, 1/ ) 2 densities (both have 1.25: (a) The pdf s for a N (0, 1) and Lap(0, 1/ 2) distribution. The mean is 0 and the variance is 1 for both distributi e log zero-mean of these pdf s. Figure unit generated variance). by laplacepdfplot. (right) Negative logs of these pdfs. (b) AD () February / 31

21 Gauss versus Laplace Regression least squares laplace student, dof= Linear regression with outliers. Data points are black circles. Black dottet line: LS, Blue dashed line: Laplace. AD () February / 31

22 Limitations of Least-Square Regression Remember that our estimate is given by w LS = ) 1 ( X T X X T Y This assumes that the (d + 1) (d + 1) matrix ( X X) T is invertible. ) ) We have rank( X T X =rank( X min (N, d + 1). Hence in common scenarios where N < d + 1, we can never invert X T X! We have also problems when columns/rows of X are almost linearly dependent (collinearity). AD () February / 31

23 The Need for Regularization When we have a small number of data, we want to be able to regularize the solution and limit overfitting. When fitting a polynomial regression model, wiggly functions will have large weights w. For example for the 14 polynomial model fitted previously, we have 11 coeffi cients w k such that ŵ k,ls > 100! AD () February / 31

24 Ridge Regression To bypass the fact that X T X might not be invertible, we consider instead ) 1 w R = ( X T X + λ I d +1 X T Y where λ > 0 and I d +1 is the (d + 1) identity matrix. This estimate minimizes E ( w) = This is equivalent to minimize N ( y i w T x ) i 2 + λ wt w. N (y i w T x ) i 2 s.t. w T w t (λ) This shrinks the value of w towards zeros. AD () February / 31

25 A Probabilistic Interpretation of Ridge Regression Consider the following model where y (x) = w T x and p ( y x, w, σ 2) = N ( y; y (x), σ 2) p ( w σ 2) d = N ( w k ; 0, σ 2 /λ ) k=0 In practice, we usually set a flat prior on w 0 ; i.e N ( w 0 ; 0, σ 2 ) /λ 0 with λ 0 << 1. Given { x i, y i } N, the MAP estimate of w is given by ( w MAP = arg max log p w x 1:N, y 1:N, σ 2) where log p ( w x 1:N, y 1:N, σ 2) = N log p ( y i x i, w, σ 2) + log p ( w σ 2) = N 2 log ( 2πσ 2) 1 ( 2σ 2 N y i w T x i ) 2 λ w T w 2σ 2 Hence w MAP = w ) 1 R = ( X T X + λ I d +1 X T Y. AD () February / 31

26 Example of Ridge Regression 20 ln lambda ln lambda ln lambda (a) (b) (c) Degree 14 polynomial fit to N = 21 data points with increasing λ. The Figure errors 1.55: Degree bars, 14represents Polynomial fit to N the = 21 data standard points with increasing deviation amounts σ. of L2 regularization. The error bars, representing the noise variance σ 2, get wider since we are ascribing more of the data variation to the noise. Figure generated by linregpolyvsregdemo. AD () February / 31

27 Regularization Path for Ridge Regression lcavol lweight age lbph svi lcp gleason pgg Profile of Ridge coeffi cients for an example on real data where d = 8 vs bound on w T w, i.e. small t (λ) means large λ. AD () February / 31

28 Example of Ridge Regression train mse test mse mean squared error fold cross validation, ntrain = log evidence mse log lambda log lambda log alpha (a) (b) (c) (left) Training and test error for a degree 14 poly. with increasing λ. (center) Estimate of test MSE produced by 5-fold CV. (right) igure Log-marginal 1.56: (a) Training and likelihood test set error for vs a degree log 14 (α) polynomial where fit by α ridge = regression, λ/σ 2. plotted vs log(λ). Note: Models are ordered rom complex (small regularizer) on the left to simple (large regularizer) on the right. The stars correspond to the values used to plot the unctions in Figure AD () (b) Estimate of test MSE produced by 5-fold cross-validation (see Section ). The February lowest 2011 CV error is28 indicated / 31

29 L1 Regression - Lasso We minimize in this case E ( w) = N ( y i w T x ) i 2 + λ w where w = d k=0 w k. This is equivalent to minimize Given { x i, y i } N associated to N ( y i w T x ) i 2 s.t. w t (λ), this is equivalent to taking the MAP estimate of w p ( y x, w, σ 2) = N ( y; y (x), σ 2), p ( w σ 2) = d Lap ( w k ; 0, σ 2 /λ ). k=0 AD () February / 31

30 Regularization Path for Lasso lcavol lweight age lbph svi lcp gleason pgg Profile of Lasso coeffi cients for an example on real data where d = 8 vs d bound on w k, i.e. small t (λ) means large λ. k=1 AD () February / 31

31 Ridge versus Lasso Contours of the unregularized function (blue) along with the constraint: ridge (left) and lasso (right). The lasso give a sparse solution as w 1 = 0. The corners of the simplex is more likely to intersect the ellipse than one of the sides as they stick out more. AD () February / 31

Linear Regression Linear Regression with Shrinkage

Linear Regression Linear Regression ith Shrinkage Introduction Regression means predicting a continuous (usually scalar) output y from a vector of continuous inputs (features) x. Example: Predicting vehicle