y(x) = x w + ε(x), (1)

Similar documents
Linear Regression Linear Regression with Shrinkage

Linear Regression Linear Regression with Shrinkage

STA414/2104 Statistical Methods for Machine Learning II

Least Squares Regression

COS 424: Interacting with Data. Lecturer: Rob Schapire Lecture #15 Scribe: Haipeng Zheng April 5, 2007

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

ECE521 week 3: 23/26 January 2017

1 Effects of Regularization For this problem, you are required to implement everything by yourself and submit code.

Logistic Regression. Machine Learning Fall 2018

GAUSSIAN PROCESS REGRESSION

y Xw 2 2 y Xw λ w 2 2

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Introduction to Machine Learning

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Least Squares Regression

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

CHAPTER 3 THE COMMON FACTOR MODEL IN THE POPULATION. From Exploratory Factor Analysis Ledyard R Tucker and Robert C. MacCallum

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

STA 4273H: Sta-s-cal Machine Learning

Lecture : Probabilistic Machine Learning

Today. Calculus. Linear Regression. Lagrange Multipliers

Lecture 3a: The Origin of Variational Bayes

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Bayesian Linear Regression [DRAFT - In Progress]

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Regression, Ridge Regression, Lasso

Sparse Linear Models (10/7/13)

Lecture 2 Machine Learning Review

4 Bias-Variance for Ridge Regression (24 points)

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop

Linear Models for Regression

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Linear Models for Regression CS534

Machine Learning and Pattern Recognition, Tutorial Sheet Number 3 Answers

Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Linear Models for Regression CS534

Linear Models for Regression CS534

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Bayesian Linear Regression. Sargur Srihari

Artificial Neural Networks. Part 2

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture # 12 Scribe: Indraneel Mukherjee March 12, 2008

Max-Margin Ratio Machine

INTRODUCTION TO PATTERN RECOGNITION

Machine Learning: Logistic Regression. Lecture 04

Machine Learning: Fisher s Linear Discriminant. Lecture 05

Linear Regression (9/11/13)

MMSE Equalizer Design

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Probabilistic & Unsupervised Learning

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

GWAS IV: Bayesian linear (variance component) models

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Linear Regression and Its Applications

Parameter learning in CRF s

Overfitting, Bias / Variance Analysis

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Statistical Data Mining and Machine Learning Hilary Term 2016

Week 3: Linear Regression

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Gaussian Process Regression

PATTERN RECOGNITION AND MACHINE LEARNING

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

Ridge Regression 1. to which some random noise is added. So that the training labels can be represented as:

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

CPSC 340: Machine Learning and Data Mining

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Regularized Regression A Bayesian point of view

Gaussian Processes (10/16/13)

Lecture VIII Dim. Reduction (I)

CS 195-5: Machine Learning Problem Set 1

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

LECTURE NOTE #NEW 6 PROF. ALAN YUILLE

Machine Learning for OR & FE

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Linear Methods for Regression. Lijun Zhang

Bayesian Machine Learning

CMU-Q Lecture 24:

Gaussian Processes for Machine Learning

Machine Learning 4771

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

An Introduction to Statistical and Probabilistic Linear Models

Speech and Language Processing. Daniel Jurafsky & James H. Martin. Copyright c All rights reserved. Draft of August 7, 2017.

Transcription:

Linear regression We are ready to consider our first machine-learning problem: linear regression. Suppose that e are interested in the values of a function y(x): R d R, here x is a d-dimensional vector-valued input. We assume that e have made some observations of this mapping, D = { (x i, y i ) } N, to serve as training data. Given these examples, the goal of regression is to be able to predict the value of y at a ne input location x. In other ords, e ant to learn from D a (hopefully accurate!) prediction function (also called a regression function) ŷ(x ). Without any further restrictions on ŷ, the problem is not ell-posed. In general, I could choose any function ŷ. Ho do I choose a good function? In particular, ho do I effectively use the training observations D to guide this choice? The space of all possible regression functions ŷ is enormous; to restrict this space and make learning more tractable, linear regression assumes the relationship beteen x and y is mostly linear: y(x) = x + ε(x), (1) here R d is a vector of parameters, and ε(x) represents a departure from the underlying linear function x at x. The value of ε(x) is also called the residual. If the underlying linear trend closely matches the data, e can expect small residuals. Note that this model does not explicitly contain an intercept term. Instead, e normally add a constant feature to the input vector x as in x = [1, x]. In this case, the value 1 may be seen as an intercept term, alloing the hyperplane to avoid passing through the origin at x = 0. To avoid messy notation, e ill not explicitly rite the prime on the inputs and rather assume this expansion has already been done. In general, e can imagine applying any function to x to serve as the inputs to our linear regression model. For example, to accomplish polynomial regression of a given degree k, e might take x = [1, x, x 2,..., x k ], here exponentiation is pointise. A transformation of this type may be summarized by a map x φ(x); this is knon as a basis expansion. With a nonlinear basis expansion, e can learn nonlinear regression functions. We assume in the folloing that such an expansion, if desired, has already been applied to the inputs. Let us collect the N training examples into a matrix of training inputs X R N d and a vector of associated outputs y R N. Then our assumed linear relationship in (1), hen applied to the training data, may be ritten y = X + ε, here e have also collected the residuals into the vector ε. There are many classical approaches for estimating the parameter vector ; belo, e describe to. Ordinary least squares As mentioned previously, a good linear mapping x should result in small residuals. Our goal is to minimize the magnitude of the residuals at every possible test input x. Of course, this is impossible, so e settle for the next best thing: e minimize the magnitude of the residuals on our training data. The classical approach is to estimate by minimizing the sum of the squared residuals on the 1

training data: ŵ = arg min (x i y i ) 2. (2) This approach is called ordinary least squares. The underlying assumption is that the magnitude of the residuals are all independent from each other, so that e can simply sum up the squared residuals and minimize. (If the residuals ere instead correlated, e ould have to consider the interaction beteen pairs of residuals. This is also possible and gives rise to generalized least squares.) It turns out that the minimization problem in (2) has an exact, closed-form solution: ŵ = (X X) 1 X y. To predict the value of y at a ne input x, e plug our estimate of the parameter vector into (1): ŷ(x ) = x ŵ = x (X X) 1 X y. Ordinary least squares is by far the most commonly applied linear regression procedure. Ridge regression Occasionally, the ordinary least squares approach can lead to problems. In particular, hen the training data lie extremely close to a particular hyperplane (or hen the number of observations is less than the dimension, and e may find a hyperplane intersecting the data exactly), the magnitudes of the estimated parameters ŵ can become ill-behaved and extremely large. This is an example of overfitting. A priori, e might reasonably expect or desire the values of to be relatively small. For this reason, so-called penalization or regularization methods modify the objective in (2) to avoid such pathologies. The most common approach is to add a term to the objective encouraging small entries of, for example ŵ = arg min (x i y i ) 2 + λ 2 2, (3) here e have added the squared 2-norm of to the function to be minimized, thereby penalizing large-magnitude entries of. The scalar λ 0 serves to trade off the contributions of the residual term and the penalty term, and is a parameter of the model that e may tune as desired. As λ 0, e recover the ordinary least squares objective. The minimization problem in (3) may also be solved exactly, giving the slightly modified estimator ŵ = (X X + λi) 1 X y. Again, as λ 0 (corresponding to very small penalties on the magnitude of entries), e see that the resulting estimator converges to the ordinary least squares estimator. As λ (corresponding to very large penalties on the magnitude of entries), the estimator converges to 0. This approach is knon as ridge regression, named for the additional term λi in the estimator, hich hen vieed from above can himsically be described as resembling a mountain ridge. 2

Bayesian linear regression Here e consider a Bayesian approach to linear regression. Consider again the assumed underlying linear relationship (1): y(x) = x + ε(x). Here the vector serves as the unknon parameter of the model, hich e ill reason about using the Bayesian method. For convenience, e ill select a multivariate Gaussian prior distribution on : p( µ, Σ) = N (; µ, Σ). As before, e might expect a priori that the entries of are small. Furthermore, in the absence of any evidence otherise, e might reason that all potential directions of are equally likely. These to assumptions can be codified by choosing a multivariate Gaussian distribution for ith zero mean and isotropic diagonal covariance s 2 I: p( s 2 ) = N (; 0, s 2 I). (4) Given our observed data D = (X, y), e ish to form the posterior distribution p( D). We ill do so by deriving the joint distribution beteen the eight vector and the observed vector of values y, given the inputs X. We rite again the assumed linear relationship in our training data: y = X + ε. Given and X, the only uncertainty in the above is the additive residuals ε. We assume that the residuals are independent, unbiased, and tend to be small. We may accomplish this by modeling each entry of ε as arising from zero-mean, independent, identically distributed Gaussian noise ith variance σ 2 : p(ε σ 2 ) = N (ε; 0, σ 2 I). Define f = X, and note that f is a linear transformation of the multivariate-gaussian distributed ; therefore, e may easily derive its distribution: p(f X, µ, Σ) = N (f; Xµ, XΣX ). No the observation vector y is a sum of to independent multivariate Gaussian distributed vectors f and ε. Therefore, e may derive its prior distribution as ell: p(y X, µ, Σ, σ 2 ) = N (y; Xµ, XΣX + σ 2 I). Finally, e compute the covariance beteen y and : cov[y, ] = cov[x + ε, ] = cov[x, ] = X cov[, ] = XΣ, here e have used the linearity of covariance and the fact that the noise vector ε is independent of. No the joint distribution of and y is multivariate Gaussian: ( [ ] ) ( [ ] [ ] [ ] ) p X, µ, Σ, σ 2 µ Σ ΣX = N ;, y y Xµ XΣ XΣX + σ 2. I 3

We apply the formula for conditioning multivariate Gaussians on subvectors to derive the posterior distribution of given the data D: here Making predictions p( D, µ, Σ, σ 2 ) = N (; µ D, Σ D ), µ D = µ + ΣX (XΣX + σ 2 I) 1 (y Xµ); Σ D = Σ ΣX (XΣX + σ 2 I) 1 XΣ. Suppose e ish to use our model to predict the outputs y associated ith a set of inputs X. Writing again y = X + ε, e derive: p(y D, µ, Σ, σ 2 ) = N (y ; X µ D, X Σ D X + σ 2 I). Therefore, e have a full, joint probability distribution over the outputs y. If e must make point estimates, e choose a loss function for our predictions and use Bayesian decision theory. For example, to minimize expected squared loss, e predict the posterior mean. An equivalent ay to derive this result is to use the sum rule as follos: p(y D, µ, Σ, σ 2 ) = p(y, σ 2 )p( D, µ, Σ, σ 2 ) d = N (y ; X, σ 2 I) N (; µ D, Σ D ) d = N (y ; X µ D, X Σ D X + σ 2 I), here e have used a slightly more general form of the convolution formula for multivariate Gaussians. In this case, e vie the prediciton process as considering the predictions e ould make ith every possible value of the parameters and averaging them according to their plausibility. Connection to ridge regression For the simple prior (4), the posterior mean can be reritten as: ) 1 µ D = s 2 X (s 2 XX + σ 2 I) 1 y = (X X + σ2 s 2 I X y, here e have factored out the s 2 in the inverse and have applied the matrix identity (AB + ci) 1 A = A(BA + ci) 1. This is exactly equal to the ridge regression solution (3) ith λ = σ2 s 2! Why is this? Consider finding the maximum a posteriori estimator of using the simple prior (4). Bayes theorem tells us that the posterior distribution p( D) is proportional to the likelihood times the prior: The distribution of y given is simple: p( X, y, σ 2, s 2 ) p(y X,, σ 2 )p( s 2 ). p(y X,, σ 2 ) = N (y; X, σ 2 I) = 4 N N (y i ; x i, σ 2 ).

Rather than maximizing the posterior directly, e may instead maximize its logarithm: ŵ map = arg max 1 2σ 2 = arg min (x i y i ) 2 1 d 2s 2 (x i y i ) 2 + σ2 s 2 2 2, here e multiplied by 2σ 2 to derive the second line. This is exactly the ridge regression objective. Therefore ridge regression ith l 2 regularization can be seen from the Bayesian perspective as placing a Gaussian prior on and finding the map estimator. In general, many regularization methods can be interpreted as implicitly choosing a particular likelihood for the data and prior for the parameters of a model and maximizing the posterior density. For example, l 1 regularization (as seen in lasso) is equivalent to placing a Laplace distribution prior on the eight parameters, encouraging a sparse solution. Continuing ith this argument, e can interpret the squared loss function N (x y i ) 2 as coming from an iid Gaussian noise assumption. In general, loss functions in minimization problems such as (3) can often be interpreted as negative log likelihoods arising from an implicit independent noise assumption. 2 i 5