COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Similar documents
Linear Regression (9/11/13)

Ordinary Least Squares Linear Regression

Linear Models in Machine Learning

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

[y i α βx i ] 2 (2) Q = i=1

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

ECE521 week 3: 23/26 January 2017

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

Association studies and regression

Bayesian Linear Regression [DRAFT - In Progress]

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Data Mining Stat 588

Linear Methods for Prediction

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Regression I: Mean Squared Error and Measuring Quality of Fit

Logistic Regression. Machine Learning Fall 2018

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Machine Learning Basics: Maximum Likelihood Estimation

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Introduction to Bayesian Learning. Machine Learning Fall 2018

Ph.D. Qualifying Exam Friday Saturday, January 3 4, 2014

MA 575 Linear Models: Cedric E. Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1

Bias-Variance Tradeoff

Overfitting, Bias / Variance Analysis

Classification 2: Linear discriminant analysis (continued); logistic regression

Regression Models - Introduction

Machine Learning Practice Page 2 of 2 10/28/13

Topic 12 Overview of Estimation

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Linear Regression and Its Applications

5.2 Expounding on the Admissibility of Shrinkage Estimators

CSCI-567: Machine Learning (Spring 2019)

IFT Lecture 7 Elements of statistical learning theory

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

CS6220: DATA MINING TECHNIQUES

Logistic Regression. William Cohen

Qualifying Exam in Machine Learning

Least Squares Regression

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

MS&E 226: Small Data

Data Mining Chapter 4: Data Analysis and Uncertainty Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Decision Tree Learning Lecture 2

Machine Learning for Signal Processing Bayes Classification and Regression

Regression, Ridge Regression, Lasso

Recap from previous lecture

18.9 SUPPORT VECTOR MACHINES

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Linear Regression (continued)

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Linear Regression (1/1/17)

CPSC 340: Machine Learning and Data Mining

Max. Likelihood Estimation. Outline. Econometrics II. Ricardo Mora. Notes. Notes

COS513 LECTURE 8 STATISTICAL CONCEPTS

CMSC858P Supervised Learning Methods

Simple and Multiple Linear Regression

Parametric Techniques

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Lecture : Probabilistic Machine Learning

ISyE 691 Data mining and analytics

MS&E 226: Small Data

CSE446: Clustering and EM Spring 2017

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Linear Models A linear model is defined by the expression

Lecture 1: Supervised Learning

Warm up: risk prediction with logistic regression

Least Squares Regression

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Summer School in Statistics for Astronomers V June 1 - June 6, Regression. Mosuk Chow Statistics Department Penn State University.

Managing Uncertainty

Scatter plot of data from the study. Linear Regression

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

CS6220: DATA MINING TECHNIQUES

Linear and logistic regression

Machine Learning for OR & FE

g-priors for Linear Regression

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

GI01/M055: Supervised Learning

CMU-Q Lecture 24:

A Short Introduction to the Lasso Methodology

Machine Learning - MT Linear Regression

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

COMS 4721: Machine Learning for Data Science Lecture 1, 1/17/2017

Parametric Techniques Lecture 3

CPSC 540: Machine Learning

Machine Learning for Signal Processing Bayes Classification

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

Sparse Linear Models (10/7/13)

where x and ȳ are the sample means of x 1,, x n

ECE521: Inference Algorithms and Machine Learning University of Toronto. Assignment 1: k-nn and Linear Regression

BANA 7046 Data Mining I Lecture 2. Linear Regression, Model Assessment, and Cross-validation 1

Density Estimation. Seungjin Choi

COMP90051 Statistical Machine Learning

POLI 8501 Introduction to Maximum Likelihood Estimation

Machine Learning Basics: Estimators, Bias and Variance

Machine Learning Lecture 5

Transcription:

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION SEAN GERRISH AND CHONG WANG 1. WAYS OF ORGANIZING MODELS In probabilistic modeling, there are several ways of organizing models: (1) Bayesian vs. Frequentist. (2) Discriminative vs. Generative. (a) Discriminative: conditioned on some information, e.g.regression models, classification models. (b) Generative: we fit a probability distribution to every part of the data, e.g. clustering, naive Bayesian classification. Is discriminative better or generative better? A common myth is that one of these is always the appropriate solution. (3) Per-data point prediction vs. Data set density estimation. (4) Supervised vs. Unsupervised models. (a) Supervised: given {(x i, y i )} N i=1 in training, predict y given x in testing (e.g. classification). (b) Unsupervised: given data, we seek to find structure in it. Clustering is an example. All of these models involve (1) treating observations as random variables in a probability distribution; and (2) computing something about the distribution. 2. LINEAR REGRESSION In this section, we will talk about the basic idea of linear regression and then study how to fit a linear regression. 2.1. Overview of linear regression. Linear regression is a method to predict a real valued response y from covariates x using linear models. See Figure 1 shows an example. Usually, we have multiple covariates x =< x 1, x 2,..., x p >, where p is the number of covariates. 1

2 SEAN GERRISH AND CHONG WANG FIGURE 1. Linear regression. + s are data points and the dashed line is the output of fitting the linear regression. In linear regression, we fit a linear function of covariates p (1) f(x) = β 0 + β i x i = β 0 + β T x. i=1 Note that β T x = 0 is a hyperplane. Many candidate features can be used as the input x: (1) any raw numeric data; (2) any transformation, e.g. x 2 = log x 1 and x 3 = x 1 ; (3) basis expansions, e.g. x 2 = x 2 1 and x 3 = x 3 1; (4) indicator functions of qualitative inputs, e.g. 1[the subject has brown hair]; and (5) interactions between other covariates, e.g. x 3 = x 1 x 2. 2.2. Fitting a linear regression. Suppose we have a dataset D = {(x n, y n )} N. In the simplest form of a linear regression, we assume β 0 = 0 and p = 1. So the function to be fitted is just (2) f(x) = βx. To fit a linear regression in this simplified setting, we minimize the sum of the distances between fitted values and the truth. Thus, the objective function is (if we use Euclidean distance) (3) RSS(β) = 1 2 (y n βx n ) 2, where RSS stands for Residual Sum of Squares. Figure 2 illustrates this. To minimize RSS(β), we take its derivative, (4) d RSS(β) d β = (y n βx n )x n.

COS513: FOUNDATIONS OF PROBABILISTIC MODELSLECTURE 9: LINEAR REGRESSION3 FIGURE 2. Linear regression. + s are data points and the dashed line is the output of fitting the linear regression. Since RSS(β) is convex, setting Equation 4 to zero and solving for ˆβ leads to an algebraic version of the optimal solution, (5) N ˆβ = y nx n N. x2 n When we have a new input x new, then, the prediction is simply ŷ new = ˆβx new. We can generalize Equation 2 by allowing for a constant offset in the predictions: p y = β 0 + β i x i. Note that solving for β using the setup above does not determine the fixed offset β 0. We can get around this by setting β p+1 = β 0 and x p+1 = 1. Then we have, i=1 y = β T x. As noted above, the RSS gives a sense of how accurate our estimate is. In many situations (such as our current one), we are interested in minimizing the RSS. One approach to find the minimum is to perform gradient ascent. Equation 4 gives the gradient of interest, (6) β RSS(β) = (y n β T x n )x n, of our objective function. With a convex objective function such as this RSS, the gradient is generally sufficient; we could stick this into a black box gradient descent algorithm and have a solution relatively efficiently. In this case, as noted in Equation 5, there fortunately also exists an explicit solution.

4 SEAN GERRISH AND CHONG WANG 2.3. Concise form of the exact solution. We can more concisely describe the exact solution with a bit of linear algebra. The design matrix X contains a set of n observations in p dimensions. Using the constant-offset trick above, we also append 1 to each row of X: (7) X = x 1,1... x 1,p 1 x 2,1... x 2,p 1.... x n,1... x n,p 1 The response vector y describes the corresponding set of labels for these observations: (8) Y = Combining Equations (6), (7), and (8) above, we can write the gradient of β concisely as y 1. y n β RSS(β) = X T (Y Xβ). Setting this to 0 and solving for β, we have (9) (10) (11) X T (Y X ˆβ) = 0 = X T X ˆβ = X T Y = ˆβ = (X T X) 1 Y. We observe a couple of things about the equations above. First, Equations 9-11 are sometimes referred to as the normal equations. In addition, note that the matrix X T X is invertible as long as X has rank p + 1, which requires that our covariates not be linearly dependent. 3. PROBABILISTIC INTERPRETATION We can also frame linear regression using the probabilistic tools we have developed so far. When fitting the model, we have access to the observations (x n, y n ), and we seek to determine β. When we are predicting, our goal is to determine y n, given x n and β: y n N (β T x n, σ 2 ), or, equivalently, (1) Draw ɛ N (0, σ 2 ) (2) Set y n = β T x n + ɛ.

COS513: FOUNDATIONS OF PROBABILISTIC MODELSLECTURE 9: LINEAR REGRESSION5 (a) Fitting (b) Predicting FIGURE 3. In modeling linear regression with graphical models, (a). we first fit the parameter β, then (b). predict values y n given a set of covariates x n. Note that we can interpret this as a discriminative model, because we always condition on x n. Because of this, we don t need to specify the distribution of x n in the model. Figure 3 illustrates this process. Given this model, then, our goal is to find the conditional MLE of ˆβ. The likelihood of β given our training data x and y is N ( ) 1 l(β x 1:N, y 1:N ) = exp (yn β T x n ) 2. 2πσ 2 2σ 2 Taking the logarithm and maximizing the log likelihood, we get ( ) ( ) β MLE := argmax 1 (y n β T x n ) 2 1 = argmin (y β 2 σ 2 n β T x n ) 2, β 2 which is exactly the objective function we found our earlier treatment of the problem. 4. PREDICTION FROM EXPECTATION 4.1. The Bias-Variance Tradeoff. In many statistical applications, it is useful to understand how close our empirical distribution s mean ˆβ is to the true mean β of some underlying distribution. Such questions are central to countless applications of statistics, motivating metrics such as standard error used frequently in nearly every branch of science. To formalize this a bit more, consider a dataset with N i.i.d. observations {(x i, y i )} N i=1 drawn randomly from some true distribution D. The empirical MLE estimate ˆβ MLE, which we can compute from these observations, is then a random variable, with its randomness arising from the fact that the observations are themselves drawn from D. Given an unseen datum (x new, y new ) D, we seek to find how close E[y new x new, ˆβ MLE ] is to E[y new x new, β] = βx. The mean squared error (MSE) is one measure of how close our estimator ˆβ MLE is to the truth. We can decompose the MSE into both a variance

6 SEAN GERRISH AND CHONG WANG term and a bias term: MSE = E D [( ˆβX βx) 2 ] (12) (13) = E D [( ˆβX) 2 ] 2E D [ ˆβX]βX + (βx) 2 = E D [( ˆβX) 2 ] 2E D [ ˆβX]βX + (βx) 2 + (E D [ ˆβX] 2 E D [ ˆβX] ) 2 = E D [( ˆβX) 2 ] E D [ ˆβX] 2 +(E D [ ˆβX] βx) 2, where Equation 12 is the variance of our estimator ˆβ and Equation 13 is its squared bias. When this bias is zero, the estimator is called an unbiased estimator; least squares is an example of such an estimator. For many years, statisticians cared only about unbiased estimators. Recently, however, biased estimators have become more popular because it is sometimes possible to significantly decrease variance at the expense of a little bit of bias. This will be the topic of our next section. Figure 4 illustrates this. Density True mean Unbiased estimator variance Biased estimator variance Bias FIGURE 4. When attempting to determine the true parameter β of a distribution, we can use biased and unbiased estimators. Many estimators, such as standard least-squares, lead to unbiased estimates of the response means (blue). At other times, we may wish to use biased estimators, which may decrease variance of the estimate of the response mean at the expense of bias (red). (Here we have abused terminology and blurred the distinction between bias and squared bias.)