DS-GA 1002 Lecture notes 12 Fall Linear regression

DS-GA Lecture notes 1 Fall 16 1 Linear models Linear regression In statistics, regression consists of learning a function relating a certain quantity of interest y, the response or dependent variable, to several observed variables x 1, x,..., x p, known as covariates, features or independent variables. To ease notation, we store the features in a vector x R p. For example, we might be interested in estimating the price of a house based on its extension, the number of rooms, the year it was built, etc. The assumption in regression is that the predictor is generated according to a function h applied to the features and then perturbed by some unknown noise z y = h ( x) + z. (1) The aim is to learn h from n examples of responses and their corresponding features ( y (1), x (1)), ( y (), x ()),..., ( y (n), x (n)). () In linear regression, we assume that h is a linear function, so that the response is well modeled as a linear combination of the predictors: y (i) = x (i) T β + z (i), 1 i n, (3) where z (i) is an unknown perturbation. The function is parametrized by a vector of weights β R p, which we would like to learn from the data. Combining the equations in (3) we obtain a matrix representation of the linear-regression problem y (1) x (1) 1 x (1) x (1) p β y () = x () 1 x () x p () 1 z (1) β + z (). (4) y (n) x (n) 1 x (n) x p (n) β p z (n) Equivalently, y = X β + z, () where X is a n p matrix containing the features, y contains the response and z R n represents the noise.

Example 1.1 (Linear model for GDP). We consider the problem of building a linear model to predict the gross domestic product (GDP) of a state in the US from its population and unemployment rate. We have available the following data: Population Unemployment GDP rate (%) (USD millions) California 38 33 1. 448 467 Minnesota 4 38 4. 334 78 Oregon 3 93 6. 8 1 Nevada 79 136.8 141 4 Idaho 1 61 136 3.8 6 Alaska 73 13 6.9 4 6 South Carolina 4 774 839 4.9??? The GDP is the response, whereas the population and the unemployment rate are the features. Our goal is to fit a linear model to the data so that we can predict the GDP of South Carolina. Since the features are on different scales we normalize them so that the columns in the regression matrix have unit norm. We also normalize the response, although this is not strictly necessary. We could also apply some additional preprocessing such as centering the data..984.98.419.13.139.3 y :=.9.7, X :=.1.419.71.44. (6).6.41.9..19.6 If we find a β R such that y X β then we can estimate the GDP of South Carolina by computing x sc T β, where [ ].1 x sc := (7).373 corresponds to the population and employment rate of South Carolina divided by the constants used to normalize X.

1. 1. Least-squares fit.8 y.6.4.....4.6.8 1. 1. x Figure 1: Linear model learnt via least-squares fitting for a simple example where there is just one feature (p = 1) and 4 examples (n = 4). Least-squares estimation To calibrate the linear regression model, we need to estimate the weight vector so that it yields a good fit to the data. We can evaluate the fit for a specific choice of β R p using the sum of the squares of the error, n ) (y (i) x (i) T β y = X β. (8) i=1 The least-squares estimate β LS is the vector of weights that minimizes this cost function, β LS := arg min y X β. (9) β The least-squares cost function is convenient from a computational view, since it is convex and can be minimized efficiently (in fact, as we will see in a moment it has a closed-form solution). In addition, it has intuitive geometric and probabilistic interpretations. Figure 1 shows the linear model learnt using least squares in a simple example where there is just one feature (p = 1) and 4 examples (n = 4). Example.1 (Linear model for GDP (cont.)). The least-squares estimate for the regression coefficients in the linear GDP model are [ ] 1. β LS =. ().19 3

The model essentially learns that the GDP is roughly proportional to the population and that the unemployment does not help when fitting the data linearly. We now compare the fit provided by the linear model to the original data, as well as its prediction of the GDP of South Carolina: GDP Estimate California 448 467 446 186 Minnesota 334 78 334 84 Oregon 8 1 33 46 Nevada 141 4 19 88 Idaho 6 9 34 Alaska 4 6 3 South Carolina 199 6 89 93.1 Geometric interpretation The following theorem, proved in Section., shows that the least-squares problem has a closed form solution. Theorem. (Least-squares solution). For p n, if X is full rank then the solution to the least-squares problem (9) is β LS := ( X T X ) 1 X T y. (11) A corollary to this result provides a geometric interpretation for the least-squares estimate of y: it is obtained by projecting the response onto the column space of the matrix formed by the predictors. Corollary.3. For p n, if X is full rank then X β LS is the projection of y onto the column space of X. We provide a formal proof in Section B of the appendix, but the result is very intuitive. Any vector of the form X β is in the span of the columns of X. By definition, the least-squares estimate is the closest vector to y that can be represented in this way, so it is the projection of y onto the column space of X. This is illustrated in Figure. 4

Figure : Illustration of Corollary.3. The least-squares solution is a projection of the data onto the subspace spanned by the columns of X, denoted by X 1 and X.. Probabilistic interpretation If we model the noise in () as a realization from a random vector Z which has entries that are independent Gaussian random variables with mean zero and a certain variance σ, then we can interpret the least-squares estimate as a maximum-likelihood estimate. Under that assumption, the data are a realization of the random vector Y := X β + Z, (1) which is an iid Gaussian random vector with mean X β and covariance matrix σ I. The joint pdf of Y is equal to n ( 1 f Y ( a) = exp 1 ( ( a i=1 πσ σ i X ) ) ) β (13) i ( 1 = (π) n σ exp n 1 1 ) a X β. (14) σ The likelihood is the probability density function of Y evaluated at the observed data y and interpreted as a function of the weight vector β, ( ) ( L y β 1 = exp 1 (π) n y X β ). (1)

To find the ML estimate, we maximize the log likelihood. We conclude that it is given by the solution to the least-squares problem, since ( ) β ML = arg max L y β (16) β ( ) = arg max log L y β (17) β = arg min β y X β (18) = β LS. (19) 3 Overfitting Imagine that a friend tells you: I found a cool way to predict the temperature in New York: It s just a linear combination of the temperature in every other state. I fit the model on data from the last month and a half and it s perfect! Your friend is not lying, but the problem is that she is using a number of data points to fit the linear model that is roughly the same as the number of parameters. If n p we can find a β such that y = X β exactly, even if y and X have nothing to do with each other! This is called overfitting and is usually caused by using a model that is too flexible with respect to the number of data that are available. To evaluate whether a model suffers from overfitting we separate the data into a training set and a test set. The training set is used to fit the model and the test set is used to evaluate the error. A model that overfits the training set will have a very low error when evaluated on the training examples, but will not generalize well to the test examples. Figure 3 shows the result of evaluating the training error and the test error of a linear model with p = parameters fitted from n training examples. The training and test data are generated by fixing a vector of weights β and then computing y train = X train β + z train, () y test = X test β, (1) where the entries of X train, X test, z train and β are sampled independently at random from a Gaussian distribution with zero mean and unit variance. The training and test errors are 6

..4 Error (training) Error (test) Noise level (training) Relative error (l norm).3..1. 3 4 n Figure 3: Relative l -norm error in estimating the response achieved using least-squares regression for different values of n (the number of training data). The training error is plotted in blue, whereas the test error is plotted in red. The green line indicates the training error of the true model used to generate the data. defined as error train = error test = X train βls y train, () y train X test βls y test. (3) y test Note that even the true β does not achieve zero training error because of the presence of the noise, but the test error is actually zero if we manage to estimate β exactly. The training error of the linear model grows with n. This makes sense as the model has to fit more data using the same number of parameters. When n is close to p :=, the fitted model is much better than the true model at replicating the training data (the error of the true model is shown in green). This is a sign of overfitting: the model is adapting to the noise and not learning the true linear structure. Indeed, in that regime the test error is extremely high. At larger n, the training error rises to the level achieved by the true linear model and the test error decreases, indicating that we are learning the underlying model. 7

4 Global warming In this section we describe the application of linear regression to climate data. In particular, we analyze temperature data taken in a weather station in Oxford over years. 1 Our objective is not to perform prediction, but rather to determine whether temperatures have risen or decreased during the last years in Oxford. In order to separate the temperature into different components that account for seasonal effects we use a simple linear model with three predictors and an intercept ( ) ( ) y t β + β πt 1 cos + β 1 πt sin + β 1 3 t (4) where 1 t n denotes the time in months (n equals 1 times ). The corresponding matrix of predictors is 1 cos ( ) ( πt 1 1 sin πt1 ) t 1 1 1 cos ( πt ) ( X := 1 sin πt ) t 1. () 1 cos ( πt n ) ( 1 sin πtn ) t 1 n The intercept β represents the mean temperature, β 1 and β account for periodic yearly fluctuations and β 3 is the overall trend. If β 3 is positive then the model indicates that temperatures are increasing, if it is negative then it indicates that temperatures are decreasing. The results of fitting the linear model are shown in Figures 4 and. The fitted model indicates that both the maximum and minimum temperatures have an increasing trend of about.8 degrees Celsius (around 1.4 degrees Fahrenheit). 1 The data is available at http://www.metoffice.gov.uk/pub/data/weather/uk/climate/ stationdata/oxforddata.txt. 8

Maximum temperature Minimum temperature 3 1 1 186 188 19 19 194 196 198 186 188 19 19 194 196 198 14 1 19 191 19 193 194 19 1 8 6 4 19 191 19 193 194 19 1 1 196 1961 196 1963 1964 196 196 1961 196 1963 1964 196 Figure 4: Temperature data together with the linear model described by (4) for both maximum and minimum temperatures. 9

Maximum temperature Minimum temperature 3 1 1 186 188 19 19 194 196 198 Trend 186 188 19 19 194 196 198 Trend +.7 C / years +.88 C / years Figure : Temperature trend obtained by fitting the model described by (4) for both maximum and minimum temperatures. A Proof of Proposition. Let X = UΣ V T be the singular-value decomposition (SVD) of X. Under the conditions of the theorem, ( X T X ) 1 X T y = V Σ U T. We begin by separating y into two components y = UU T y + ( I UU T ) y (6) where UU T y is the projection of y onto the column space of X. Note that ( I UU T ) y is orthogonal to the column space of X and consequently to both UU T y and X β for any β. By Pythagoras s Theorem y X β = ( I UU ) T y UU + T y X β. (7) The minimum value of this cost function that can be achieved by optimizing over y X. This can be achieved by solving the system of equations beta is UU T y = X β = UΣ V T β. (8) Since U T U = I because p n, multiplying both sides of the equality yields the equivalent system U T y = Σ V T β. (9)

Since X is full rank, Σ and V are square and invertible (and by definition of the SVD V 1 = V T ), so β LS = V Σ U T y (3) is the unique solution to the system and consequently also of the least-squares problem. B Proof of Corollary.3 Let X = UΣ V T be the singular-value decomposition of X. Since X is full rank and p n we have U T U = I, V T V = I and Σ is a square invertible matrix, which implies X β LS = X ( X T X ) 1 X T y (31) = UΣ V T ( V Σ U T UΣ V T ) V Σ U T y (3) = UU T y. (33) 11