DS-GA 1002 Lecture notes 12 Fall Linear regression

Similar documents
Linear regression. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

DS-GA 1002 Lecture notes 10 November 23, Linear models

Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Lecture Notes 6: Linear Models

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

The Singular-Value Decomposition

Linear Models for Regression CS534

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

4 Bias-Variance for Ridge Regression (24 points)

Linear Regression and Its Applications

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Statistical Data Analysis

ECE521 week 3: 23/26 January 2017

Lecture Notes 1: Vector spaces

Linear Models for Regression CS534

Lecture : Probabilistic Machine Learning

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Overfitting, Bias / Variance Analysis

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Linear Models for Regression CS534

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Statistical Machine Learning Hilary Term 2018

Linear Regression (continued)

Least Squares Regression

Least Squares Regression

Linear Methods for Regression. Lijun Zhang

10. Linear Models and Maximum Likelihood Estimation

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Statistics: Learning models from data

Econ 424 Time Series Concepts

Basic Probability Reference Sheet

STA414/2104 Statistical Methods for Machine Learning II

Composite Hypotheses and Generalized Likelihood Ratio Tests

Introduction to Machine Learning

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Mathematical Formulation of Our Example

Linear Regression (9/11/13)

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

STAT 100C: Linear models

Fitting Linear Statistical Models to Data by Least Squares I: Introduction

4 Bias-Variance for Ridge Regression (24 points)

Lecture 13: Orthogonal projections and least squares (Section ) Thang Huynh, UC San Diego 2/9/2018

A time series is called strictly stationary if the joint distribution of every collection (Y t

Fitting Linear Statistical Models to Data by Least Squares: Introduction

1 Cricket chirps: an example

CMU-Q Lecture 24:

STAT 350: Geometry of Least Squares

L11: Pattern recognition principles

Y i = η + ɛ i, i = 1,...,n.

Linear Models for Regression

Machine Learning (Spring 2012) Principal Component Analysis

Principal Component Analysis

Math 1553, Introduction to Linear Algebra

10. Composite Hypothesis Testing. ECE 830, Spring 2014

Pattern Recognition. Parameter Estimation of Probability Density Functions

The Hilbert Space of Random Variables

CS281 Section 4: Factor Analysis and PCA

y(x) = x w + ε(x), (1)

Lecture 16: Perceptron and Exponential Weights Algorithm

Descriptive Statistics

Announcements (repeat) Principal Components Analysis

Linear Systems. Carlo Tomasi. June 12, r = rank(a) b range(a) n r solutions

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Singular Value Decomposition (SVD)

CSCI567 Machine Learning (Fall 2014)

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

[y i α βx i ] 2 (2) Q = i=1

Unsupervised Learning

Fitting Linear Statistical Models to Data by Least Squares I: Introduction

Variations. ECE 6540, Lecture 10 Maximum Likelihood Estimation

Random projections. 1 Introduction. 2 Dimensionality reduction. Lecture notes 5 February 29, 2016

Singular Value Decomposition

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Singular Value Decomposition. 1 Singular Value Decomposition and the Four Fundamental Subspaces

Linear Models for Regression. Sargur Srihari

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Lecture 4: Probabilistic Learning

Using SVD to Recommend Movies

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

Notes on singular value decomposition for Math 54. Recall that if A is a symmetric n n matrix, then A has real eigenvalues A = P DP 1 A = P DP T.

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Lecture 2 Machine Learning Review

Constrained optimization

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane

Density Estimation. Seungjin Choi

Bayesian Linear Regression [DRAFT - In Progress]

Linear Regression Linear Regression with Shrinkage

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Pseudoinverse and Adjoint Operators

Introduction to Simple Linear Regression

Machine Learning 4771

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

10-725/36-725: Convex Optimization Prerequisite Topics

Clustering VS Classification

y Xw 2 2 y Xw λ w 2 2

Multimodel Ensemble forecasts

Transcription:

DS-GA Lecture notes 1 Fall 16 1 Linear models Linear regression In statistics, regression consists of learning a function relating a certain quantity of interest y, the response or dependent variable, to several observed variables x 1, x,..., x p, known as covariates, features or independent variables. To ease notation, we store the features in a vector x R p. For example, we might be interested in estimating the price of a house based on its extension, the number of rooms, the year it was built, etc. The assumption in regression is that the predictor is generated according to a function h applied to the features and then perturbed by some unknown noise z y = h ( x) + z. (1) The aim is to learn h from n examples of responses and their corresponding features ( y (1), x (1)), ( y (), x ()),..., ( y (n), x (n)). () In linear regression, we assume that h is a linear function, so that the response is well modeled as a linear combination of the predictors: y (i) = x (i) T β + z (i), 1 i n, (3) where z (i) is an unknown perturbation. The function is parametrized by a vector of weights β R p, which we would like to learn from the data. Combining the equations in (3) we obtain a matrix representation of the linear-regression problem y (1) x (1) 1 x (1) x (1) p β y () = x () 1 x () x p () 1 z (1) β + z (). (4) y (n) x (n) 1 x (n) x p (n) β p z (n) Equivalently, y = X β + z, () where X is a n p matrix containing the features, y contains the response and z R n represents the noise.

Example 1.1 (Linear model for GDP). We consider the problem of building a linear model to predict the gross domestic product (GDP) of a state in the US from its population and unemployment rate. We have available the following data: Population Unemployment GDP rate (%) (USD millions) California 38 33 1. 448 467 Minnesota 4 38 4. 334 78 Oregon 3 93 6. 8 1 Nevada 79 136.8 141 4 Idaho 1 61 136 3.8 6 Alaska 73 13 6.9 4 6 South Carolina 4 774 839 4.9??? The GDP is the response, whereas the population and the unemployment rate are the features. Our goal is to fit a linear model to the data so that we can predict the GDP of South Carolina. Since the features are on different scales we normalize them so that the columns in the regression matrix have unit norm. We also normalize the response, although this is not strictly necessary. We could also apply some additional preprocessing such as centering the data..984.98.419.13.139.3 y :=.9.7, X :=.1.419.71.44. (6).6.41.9..19.6 If we find a β R such that y X β then we can estimate the GDP of South Carolina by computing x sc T β, where [ ].1 x sc := (7).373 corresponds to the population and employment rate of South Carolina divided by the constants used to normalize X.

1. 1. Least-squares fit.8 y.6.4.....4.6.8 1. 1. x Figure 1: Linear model learnt via least-squares fitting for a simple example where there is just one feature (p = 1) and 4 examples (n = 4). Least-squares estimation To calibrate the linear regression model, we need to estimate the weight vector so that it yields a good fit to the data. We can evaluate the fit for a specific choice of β R p using the sum of the squares of the error, n ) (y (i) x (i) T β y = X β. (8) i=1 The least-squares estimate β LS is the vector of weights that minimizes this cost function, β LS := arg min y X β. (9) β The least-squares cost function is convenient from a computational view, since it is convex and can be minimized efficiently (in fact, as we will see in a moment it has a closed-form solution). In addition, it has intuitive geometric and probabilistic interpretations. Figure 1 shows the linear model learnt using least squares in a simple example where there is just one feature (p = 1) and 4 examples (n = 4). Example.1 (Linear model for GDP (cont.)). The least-squares estimate for the regression coefficients in the linear GDP model are [ ] 1. β LS =. ().19 3

The model essentially learns that the GDP is roughly proportional to the population and that the unemployment does not help when fitting the data linearly. We now compare the fit provided by the linear model to the original data, as well as its prediction of the GDP of South Carolina: GDP Estimate California 448 467 446 186 Minnesota 334 78 334 84 Oregon 8 1 33 46 Nevada 141 4 19 88 Idaho 6 9 34 Alaska 4 6 3 South Carolina 199 6 89 93.1 Geometric interpretation The following theorem, proved in Section., shows that the least-squares problem has a closed form solution. Theorem. (Least-squares solution). For p n, if X is full rank then the solution to the least-squares problem (9) is β LS := ( X T X ) 1 X T y. (11) A corollary to this result provides a geometric interpretation for the least-squares estimate of y: it is obtained by projecting the response onto the column space of the matrix formed by the predictors. Corollary.3. For p n, if X is full rank then X β LS is the projection of y onto the column space of X. We provide a formal proof in Section B of the appendix, but the result is very intuitive. Any vector of the form X β is in the span of the columns of X. By definition, the least-squares estimate is the closest vector to y that can be represented in this way, so it is the projection of y onto the column space of X. This is illustrated in Figure. 4

Figure : Illustration of Corollary.3. The least-squares solution is a projection of the data onto the subspace spanned by the columns of X, denoted by X 1 and X.. Probabilistic interpretation If we model the noise in () as a realization from a random vector Z which has entries that are independent Gaussian random variables with mean zero and a certain variance σ, then we can interpret the least-squares estimate as a maximum-likelihood estimate. Under that assumption, the data are a realization of the random vector Y := X β + Z, (1) which is an iid Gaussian random vector with mean X β and covariance matrix σ I. The joint pdf of Y is equal to n ( 1 f Y ( a) = exp 1 ( ( a i=1 πσ σ i X ) ) ) β (13) i ( 1 = (π) n σ exp n 1 1 ) a X β. (14) σ The likelihood is the probability density function of Y evaluated at the observed data y and interpreted as a function of the weight vector β, ( ) ( L y β 1 = exp 1 (π) n y X β ). (1)

To find the ML estimate, we maximize the log likelihood. We conclude that it is given by the solution to the least-squares problem, since ( ) β ML = arg max L y β (16) β ( ) = arg max log L y β (17) β = arg min β y X β (18) = β LS. (19) 3 Overfitting Imagine that a friend tells you: I found a cool way to predict the temperature in New York: It s just a linear combination of the temperature in every other state. I fit the model on data from the last month and a half and it s perfect! Your friend is not lying, but the problem is that she is using a number of data points to fit the linear model that is roughly the same as the number of parameters. If n p we can find a β such that y = X β exactly, even if y and X have nothing to do with each other! This is called overfitting and is usually caused by using a model that is too flexible with respect to the number of data that are available. To evaluate whether a model suffers from overfitting we separate the data into a training set and a test set. The training set is used to fit the model and the test set is used to evaluate the error. A model that overfits the training set will have a very low error when evaluated on the training examples, but will not generalize well to the test examples. Figure 3 shows the result of evaluating the training error and the test error of a linear model with p = parameters fitted from n training examples. The training and test data are generated by fixing a vector of weights β and then computing y train = X train β + z train, () y test = X test β, (1) where the entries of X train, X test, z train and β are sampled independently at random from a Gaussian distribution with zero mean and unit variance. The training and test errors are 6

..4 Error (training) Error (test) Noise level (training) Relative error (l norm).3..1. 3 4 n Figure 3: Relative l -norm error in estimating the response achieved using least-squares regression for different values of n (the number of training data). The training error is plotted in blue, whereas the test error is plotted in red. The green line indicates the training error of the true model used to generate the data. defined as error train = error test = X train βls y train, () y train X test βls y test. (3) y test Note that even the true β does not achieve zero training error because of the presence of the noise, but the test error is actually zero if we manage to estimate β exactly. The training error of the linear model grows with n. This makes sense as the model has to fit more data using the same number of parameters. When n is close to p :=, the fitted model is much better than the true model at replicating the training data (the error of the true model is shown in green). This is a sign of overfitting: the model is adapting to the noise and not learning the true linear structure. Indeed, in that regime the test error is extremely high. At larger n, the training error rises to the level achieved by the true linear model and the test error decreases, indicating that we are learning the underlying model. 7

4 Global warming In this section we describe the application of linear regression to climate data. In particular, we analyze temperature data taken in a weather station in Oxford over years. 1 Our objective is not to perform prediction, but rather to determine whether temperatures have risen or decreased during the last years in Oxford. In order to separate the temperature into different components that account for seasonal effects we use a simple linear model with three predictors and an intercept ( ) ( ) y t β + β πt 1 cos + β 1 πt sin + β 1 3 t (4) where 1 t n denotes the time in months (n equals 1 times ). The corresponding matrix of predictors is 1 cos ( ) ( πt 1 1 sin πt1 ) t 1 1 1 cos ( πt ) ( X := 1 sin πt ) t 1. () 1 cos ( πt n ) ( 1 sin πtn ) t 1 n The intercept β represents the mean temperature, β 1 and β account for periodic yearly fluctuations and β 3 is the overall trend. If β 3 is positive then the model indicates that temperatures are increasing, if it is negative then it indicates that temperatures are decreasing. The results of fitting the linear model are shown in Figures 4 and. The fitted model indicates that both the maximum and minimum temperatures have an increasing trend of about.8 degrees Celsius (around 1.4 degrees Fahrenheit). 1 The data is available at http://www.metoffice.gov.uk/pub/data/weather/uk/climate/ stationdata/oxforddata.txt. 8

Maximum temperature Minimum temperature 3 1 1 186 188 19 19 194 196 198 186 188 19 19 194 196 198 14 1 19 191 19 193 194 19 1 8 6 4 19 191 19 193 194 19 1 1 196 1961 196 1963 1964 196 196 1961 196 1963 1964 196 Figure 4: Temperature data together with the linear model described by (4) for both maximum and minimum temperatures. 9

Maximum temperature Minimum temperature 3 1 1 186 188 19 19 194 196 198 Trend 186 188 19 19 194 196 198 Trend +.7 C / years +.88 C / years Figure : Temperature trend obtained by fitting the model described by (4) for both maximum and minimum temperatures. A Proof of Proposition. Let X = UΣ V T be the singular-value decomposition (SVD) of X. Under the conditions of the theorem, ( X T X ) 1 X T y = V Σ U T. We begin by separating y into two components y = UU T y + ( I UU T ) y (6) where UU T y is the projection of y onto the column space of X. Note that ( I UU T ) y is orthogonal to the column space of X and consequently to both UU T y and X β for any β. By Pythagoras s Theorem y X β = ( I UU ) T y UU + T y X β. (7) The minimum value of this cost function that can be achieved by optimizing over y X. This can be achieved by solving the system of equations beta is UU T y = X β = UΣ V T β. (8) Since U T U = I because p n, multiplying both sides of the equality yields the equivalent system U T y = Σ V T β. (9)

Since X is full rank, Σ and V are square and invertible (and by definition of the SVD V 1 = V T ), so β LS = V Σ U T y (3) is the unique solution to the system and consequently also of the least-squares problem. B Proof of Corollary.3 Let X = UΣ V T be the singular-value decomposition of X. Since X is full rank and p n we have U T U = I, V T V = I and Σ is a square invertible matrix, which implies X β LS = X ( X T X ) 1 X T y (31) = UΣ V T ( V Σ U T UΣ V T ) V Σ U T y (3) = UU T y. (33) 11