Linear Regression (9/11/13)

Similar documents
COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Linear Models in Machine Learning

Linear Regression (continued)

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

ECE521 lecture 4: 19 January Optimization, MLE, regularization

COMS 4771 Regression. Nakul Verma

Linear Models for Regression CS534

CSCI567 Machine Learning (Fall 2014)

Bayesian Regression (1/31/13)

Linear Regression (1/1/17)

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Linear Models for Regression CS534

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Overfitting, Bias / Variance Analysis

Sparse Linear Models (10/7/13)

HOMEWORK #4: LOGISTIC REGRESSION

Linear Regression Linear Regression with Shrinkage

Linear Methods for Regression. Lijun Zhang

Association studies and regression

Linear Models for Regression CS534

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Linear Models for Regression

A Short Introduction to the Lasso Methodology

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Summer School in Statistics for Astronomers V June 1 - June 6, Regression. Mosuk Chow Statistics Department Penn State University.

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Least Squares Regression

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Big Data Analytics. Lucas Rego Drumond

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Introduction to Machine Learning

Least Squares Regression

ECE521 week 3: 23/26 January 2017

STA414/2104 Statistical Methods for Machine Learning II

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Bayesian Linear Regression [DRAFT - In Progress]

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

STA 2201/442 Assignment 2

Linear Models for Regression

CS281 Section 4: Factor Analysis and PCA

[y i α βx i ] 2 (2) Q = i=1

Logistic Regression. Seungjin Choi

Machine Learning for OR & FE

Parameter Estimation. Industrial AI Lab.

Lecture 5: GPs and Streaming regression

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Introduction to Machine Learning

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine learning - HT Maximum Likelihood

Linear Regression Linear Regression with Shrinkage

Lecture 5: LDA and Logistic Regression

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Lecture 11. Linear Soft Margin Support Vector Machines

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

Homework 4. Convex Optimization /36-725

Machine Learning (CS 567) Lecture 5

HOMEWORK #4: LOGISTIC REGRESSION

10-701/ Machine Learning - Midterm Exam, Fall 2010

4 Bias-Variance for Ridge Regression (24 points)

Probabilistic Reasoning in Deep Learning

Regression with Numerical Optimization. Logistic

Machine Learning and Data Mining. Linear regression. Kalev Kask

Lecture 6. Regression

STAT 135 Lab 13 (Review) Linear Regression, Multivariate Random Variables, Prediction, Logistic Regression and the δ-method.

Linear Models for Regression

Statistics 910, #5 1. Regression Methods

Bayesian Machine Learning

Linear and logistic regression

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Data Mining Stat 588

Week 3: Linear Regression

LECTURE NOTE #NEW 6 PROF. ALAN YUILLE

5.2 Expounding on the Admissibility of Shrinkage Estimators

Machine Learning for Data Science (CS4786) Lecture 12

Managing Uncertainty

Scatter plot of data from the study. Linear Regression

y(x) = x w + ε(x), (1)

Peter Hoff Linear and multilinear models April 3, GLS for multivariate regression 5. 3 Covariance estimation for the GLM 8

High-dimensional regression

Linear Models for Regression. Sargur Srihari

Regression, Ridge Regression, Lasso

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Reading Group on Deep Learning Session 1

Transcription:

STA561: Probabilistic machine learning Linear Regression (9/11/13) Lecturer: Barbara Engelhardt Scribes: Zachary Abzug, Mike Gloudemans, Zhuosheng Gu, Zhao Song 1 Why use linear regression? Figure 1: Scatter plot showing the relation between two variables, x and y The general idea behind linear models is to capture how two observed variables are related in a linear way, assuming Gaussian noise. Given a fitted linear model (i.e., we have estimated the model parameters from training data), we may use this linear regression to: predict the value for y given a new observation x test if there is a linear relationship between two variables x and y when y is binary or multinomial, we can use this model to classify a new observation x. Later in the course, we will learn non-linear and non-gaussian versions of this basic model. Let s think about a bunch of possible relationships we want to model: x = age and y = height x = distance from the shore and y = weight of clams found 1

2 Linear Regression x = gestational age and y = birthweight Once we have fitted a linear model, then, given a new observation x, we can predict the corresponding value y. This model can also be used to test for associations between two variables. If there is no association between our variables, then the distribution of y should not depend conditionally on the value of x. Figure 2: Scatter plot showing no relationship between two variables, x and y 2 Definitions & model specificiation y R : response x R p : predictors ( also called covariates or explanatory variables) β R p : regression coefficients ( also called effects) Our training set consists of observations for n samples (x, y). To fit the model, we will estimate the coefficients. Note that bold font for a random variable represents a vector. 2.1 Linear model Our linear model has the following form: y = x T β + ɛ (1) where ɛ N (0, σ 2 ). Consequently, we have y x T, β, σ 2 N (x T β, σ 2 ). Note here that the distribution of x doesn t really matter, as long as y is still distributed as a normal conditional on x.

Linear Regression 3 To build our linear model, we must first define our data. Let our training data D train = {(x 1, y 1 ),..., (x n, y n )}, where each x i is a p 1 column vector of p different predictors, and we have n observations of each predictor and response. Then, using our knowledge of multivariate normals, we can expand (1) to y 1 x T 1... = N. n.. β, diag n(σ 2 ) y n x T n where X is an n p matrix. Our covariance matrix is a diagonal matrix with a single variance term for all samples (σ 2 ), and all covariance terms are zero, indicating independence across the n dimensions of this Gaussian (i.e., over the samples). When our model is fitted, if we are given a new x*, we can predict the associated y* value; in particular, E[y x ] = x T β. Furthermore, if we take many samples with the same x-value, we should expect to see a normal distribution of y-values, centered around the mean that we ve predicted. Figure 3: Our linear regression fit to the data. For any new x*, we can use the model to predict the associated y* by projecting the point x onto the red regression line and identifying the corresponding value of y. 2.2 What about offsets? Our regression line rarely will go through the origin (a zero year old person is not zero inches tall), so we must also include an intercept term, β 0. We accomplish this by adding a 1 as the first element of our x vector, and a corresponding β 0 term as the first element of our β vector: x = [1, x 1, x 2,, x p ] T Our model equation will then take the form y = β 0 + β 1 x 1 + β 2 x 2 +... + β p x p, where we can read off the value of the intercept from β 0.

4 Linear Regression 3 Evaluating our model 3.1 Residuals How well do our response values we predict using the fitted model differ from the actual response values? We provide the following notation first (univariate case): r i = y i ˆβx i = y i ŷ i ˆβ = our trained β ŷ i = predicted value These terms r i are called the residuals. For each data point, the error is the same as the residual: ɛ i = y i ˆβx i = r i. Additionally, by our previous definition of the error term, E[r i ] = E[ɛ i ] = E[y i ˆβx i ] = 0. In other words, the expected residual is zero according to the Gaussian model. 3.2 Residual Sum of Squares (and friends!) There are a number of metrics to quantify the fit of the model to the test set or training set data based on the residual values: Residual Sum of Squares (RSS): RSS( ˆβ, D) = (y i ˆβ T x i ) 2 Mean Squared Error (MSE): MSE = 1 n RSS( ˆβ, D) Root-Mean-Square Error (RMSE): RMSE = MSE 4 Parameter Estimation Our data for linear regression is represented as: D = {(x 1, y 1 ),..., (x n, y n )}. For a univariate model (p = 1), where each x i is a scalar value, the maximum likelihood estimate (MLE) for β, ˆβ, can be found by taking the log likelihood, differentiating with respect to β, and setting equal to 0 and solving for β. l(β; D) = [ ( ) 1/2 { } ] 1 1 log 2πσ 2 exp 2σ 2 (y i βx i ) 2 = 1 2σ 2 (y i βx i ) 2 n 2 log (2πσ2 ) = 1 RSS(β, D) + c 2σ2 where c is not a function of β, and may be ignored because it is constant with respect to β. By looking at this equation, we see that we can maximize the log likelihood by equivalently minimizing the residual sum of squares (RSS).

Linear Regression 5 Let s return to the general multivariate case, i.e. p > 1, so each data point x i is a p-dimensional vector. We derive the MLE estimate for β as follows: [ l(β; D)] β = [ 1 2 (y Xβ)T (y Xβ)] β = X T Xβ X T y (2) Setting (2) to zero, and solving for β, we have ˆβ MLE = (X T X) 1 X T y (3) Equation (3) is called the normal equation. This is a wonderful result: If X T X is invertible, then we have an efficient way of computing the MLE for the regression coefficients (the cost of a matrix inversion). However, if X T X is non-invertible (singular), then we cannot compute an estimate of β using the normal equation. One example of a singular matrix X T X happens when the number of samples n is smaller than the number of predictors p. 4.1 Ridge regression To regularize this problem, and guard ourselves against singular matrices, we can include a prior distribution for β. One advantage of selecting a prior distribution for the regression coefficients is that it helps prevent distortions of the fit due to outliers. If we select a normal prior, the variance τ 2 of the coefficients reflects our confidence in the regression: a low value of τ 2 shrinks the coefficients more heavily toward zero, whereas a high value of τ 2 lets the data guide the selection of β more heavily. With p(β τ 2 ) = p N (β j 0, τ 2 ) j=1 Figure 4: Distortion of the linear regression fit due to outliers.

6 Linear Regression we now calculate the log posterior distribution using Bayes Rule: p(β x, y, τ 2, σ 2 ) = 1 n [ ( ) 1/2 { } ] 1 1 log 2πσ 2 exp 2σ 2 (y i β T x i ) 2 + ) 2 (y i β T σ 2 x i + τ 2 p=1 p log N ( β j 0, τ 2) j=1 j βj 2 (4) The solution for ˆβ MAP in Eqn. (4) is referred to as ridge regression ˆβ MAP = argmax β 1 n (y i ŷ i ) 2 + λ where the first term of the argmax is our MSE equation (see section 4.2) and the second term is referred to as the penalty. Here, λ can be regarded as a regularization parameter corresponding to σ2 τ in Eq (4). We 2 compute the MAP estimate for β using the same strategy as in the normal equation and obtain ˆβ MAP = (X T X + λi p ) 1 X T y Notice here that (X T X +λi p ) 1 exists as the matrix (X T X +λi p ) is generally invertible (i.e., non-singular). In general, a simple way to make a singular matrix invertible is to add a very small value to the diagonal, in order to (artificially) create a full rank (and hence invertible) matrix. p j=1 β 2 j 5 Least mean squares (LMS) algorithm Another way to solve for β is directly minimize the least square cost function (the residual sum of squares): C (β) = 1 2 C(β) β = (y i ŷ i ) 2 (y i ŷ i ) x i Then we can use the steepest descent algorithm to find the estimate for β β (t+1) = β (t) + ρ (y i β (t)t x i ) x i where ρ is the step size. Note here the summation is from 1 to n, which means we have to iterate through all of the n data instances before we update β (t+1). This is called a batch method. However, it is often tempting to update our estimate as we collect each sample: In some cases n is too large for machines to process at once, or the matrix inversion in the normal equation may be difficult if there are too many features p. So we would like a way to process the data one sample at a time, which is called an online method. The LMS algorithm can be rewritten as: ) β (t+1) = β (t) + ρ (y i β (t)t x i x i, which is known as stochastic gradient decent. This means at each iteration t, we randomly pick a data sample x i, y i from the data set and update our estimate of β until we convergence. This algorithm considers a single data point at a time, and is stochastic in selecting a sample from the full data.

Linear Regression 7 5.1 Accounting for non-linearities Imagine a model: y i = φ(x i )β + ɛ φ(x) = [1, x, x 2,..., x p ] This would allow us to find non-linear relationships between x and y. x-values, for x i = [x 1, x 2 ] i : φ(x) = [x 1, x 2, x 1 x 2, x 2 1x 2,...] We can also try combinations of This function φ(x) is called the basis function. It turns out that the regression model above works with any basis function. This is still a linear model because the relationship between the coefficients β and y is linear: regardless of the φ( ) term, this is a linear model because the latent parameter β enters the model in a linear way. Furthermore, the log likelihood is convex with respect to β, and this is true for all φ(x). This nonlinear basis function will come up again when we talk about kernels, kernel methods, Gaussian processes, and adaptive basis methods.