Least Squares Regression

Similar documents
Least Squares Regression

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Support Vector Machines for Classification and Regression

Understanding Generalization Error: Bounds and Decompositions

CIS 520: Machine Learning Oct 09, Kernel Methods

ECE521 week 3: 23/26 January 2017

Linear Models for Regression

Linear Models in Machine Learning

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

COS 424: Interacting with Data. Lecturer: Rob Schapire Lecture #15 Scribe: Haipeng Zheng April 5, 2007

y(x) = x w + ε(x), (1)

Lecture 2 Machine Learning Review

Bayesian Machine Learning

Introduction to Machine Learning

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Linear Regression Linear Regression with Shrinkage

STA414/2104 Statistical Methods for Machine Learning II

Probabilistic Machine Learning. Industrial AI Lab.

Overfitting, Bias / Variance Analysis

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

y Xw 2 2 y Xw λ w 2 2

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Linear Regression Linear Regression with Shrinkage

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Linear Models for Regression CS534

Bayesian Learning (II)

Consistency of Nearest Neighbor Methods

Recap from previous lecture

Logistic Regression. Machine Learning Fall 2018

Introduction to Bayesian Learning. Machine Learning Fall 2018

Machine Learning Linear Classification. Prof. Matteo Matteucci

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Machine learning - HT Basis Expansion, Regularization, Validation

Linear Models for Regression CS534

Lecture : Probabilistic Machine Learning

Machine learning - HT Maximum Likelihood

Lecture 5: GPs and Streaming regression

Support Vector Machines

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

GWAS IV: Bayesian linear (variance component) models

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Bayesian Linear Regression [DRAFT - In Progress]

Association studies and regression

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Regression with Numerical Optimization. Logistic

Introduction to Machine Learning

4 Bias-Variance for Ridge Regression (24 points)

Linear Models for Regression CS534

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Linear and logistic regression

Linear & nonlinear classifiers

Naïve Bayes classification

Relevance Vector Machines

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Introduction to Machine Learning

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Statistical Data Mining and Machine Learning Hilary Term 2016

PATTERN RECOGNITION AND MACHINE LEARNING

Machine Learning for Signal Processing Bayes Classification

CSCI567 Machine Learning (Fall 2014)

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

Linear Regression (9/11/13)

Ch 4. Linear Models for Classification

Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Short Course Robust Optimization and Machine Learning. 3. Optimization in Supervised Learning

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

DATA MINING AND MACHINE LEARNING

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

The Naïve Bayes Classifier. Machine Learning Fall 2017

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Linear Regression and Discrimination

ESS2222. Lecture 4 Linear model

Linear Models for Regression. Sargur Srihari

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Statistical learning. Chapter 20, Sections 1 3 1

Multivariate Bayesian Linear Regression MLAI Lecture 11

Machine Learning for Signal Processing Bayes Classification and Regression

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Introduction to Machine Learning

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

6.867 Machine learning: lecture 2. Tommi S. Jaakkola MIT CSAIL

Linear Models for Classification

Linear Regression (continued)

DEPARTMENT OF COMPUTER SCIENCE AUTUMN SEMESTER MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Midterm exam CS 189/289, Fall 2015

COMS 4771 Regression. Nakul Verma

The connection of dropout and Bayesian statistics

Transcription:

CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the material discussed in the lecture and vice versa). Outline Regression and conditional expectation Linear least squares regression Ridge regression and Lasso Probabilistic view Regression and Conditional Expectation In this lecture we consider regression problems, where there is an instance space X as before, but labels and predictions are real-valued: Y = Ŷ = R such as in a weather forecasting problem, where instances might be satellite images showing water vapor in some region and labels/predictions might be the amount of rainfall in the cog week, or in a stock price prediction problem, where instances might be feature vectors describing properties of stocks and labels/predictions might be the stock price after some time period). Here one is given a training sample S = x, y ),..., x m, y m )) X R) m, and the goal is to learn from S a regression model f S : X R that predicts accurately labels of new instances in X. What should count as a good regression model? In other words, how should we measure the performance of a regression model? A widely used performance measure involves the squared loss function, l sq : R R R +, defined as l sq y, ŷ) = ŷ y). The loss of a model f : X R on an example x, y) is measured by l sq y, fx)) = fx) y). Assug examples are drawn from some joint probability distribution D on X R, the squared-loss generalization error of f : X R w.r.t. D is then given by er sq D [f] = E X,Y ) D[ fx) Y ) ]. What would be the optimal regression model for D under the above loss? We have, er sq D [f] = E X[ EY X [ fx) Y ) ]]. Now, for each x, we know and it is easy to see) that the value c imizing E Y X=x [c Y ) ] is given by c = E[Y X = x]. Therefore the optimal regression model is simply the conditional expectation function, also called the regression function of Y on x: f x) = E[Y X = x].

Least Squares Regression The conditional expectation function plays the same role for regression w.r.t. squared loss as does a Bayes optimal classifier for binary classification w.r.t. 0- loss. The imum achievable squared error w.r.t. D is simply er sq, D = inf f:x R ersq D [f] = ersq D [f [ [ ] = E X EY X Y E[Y X]) ]] [ ] = E X Var[Y X], which is simply the expectation over X of the conditional variance of Y given X; this plays the same role as the Bayes error for 0- binary classification. Linear Least Squares Regression For the remainder of the lecture, let X = R d, and let S = x, y ),..., x m, y m )) X R) m. We start with a simple approach which does not make any assumptions about the underlying probability distribution, but simply fits a linear regression model of the form f w x) = w x to the data by imizing the empirical squared error on S, êr sq S [f w] = m m f wx i ) y i ) : w ) x i y i. ) Setting the gradient of the above objective to zero yields m w ) x i y i xi = 0. We can rewrite this using matrix notation as follows: let x x X =. Rm d and y = x m then we have X Xw X y = 0. y y. y m Rm ; These are known as the normal equations for least squares regression and yield the following solution for w assug X X is non-singular): ŵ = X X) X y. The linear least squares regression model is then given by f S x) = ŵ x. The solution ŵ can be viewed as perforg an orthogonal projection of the label vector y in R m onto the d-dimensional subspace assug m > d) spanned by the d vectors x k = x k,..., x mk ) R m, k =,..., d in particular, the vector ŷ = Xŵ constitutes the projection of y onto this subspace). We will see below that the same regression model also arises as a maximum likelihood solution under suitable probabilistic assumptions. Before doing so, we discuss two variants of the above model that are widely used in practice. 3 Ridge Regression and Lasso We saw above that the simple least squares regression model requires X X to be non-singular; indeed, when X X is close to being singular which is the case if two or more columns of X are nearly co-linear), then

Least Squares Regression 3 ŵ can contain large values that lead to over-fitting the training data. To prevent this, one often adds a penalty term or a regularizer to the objective in Eq. ) that penalizes large values in w such methods are also referred to as parameter shrinkage methods in statistics). A widely used regularizer is the L regularizer w = d j= w j, leading to the following: w ) x i y i + λ w, ) where λ > 0 is a suitable regularization parameter that deteres the trade-off between the two terms. Setting the gradient of the above objective to zero again yields a closed-form solution for w: ŵ = X X + λmi d ) X y, where I d denotes the d d identity matrix; note that the matrix ) X X + λmi d is non-singular. resulting regression model, f S x) = ŵ x, is known as ridge regression and is widely used in practice. Another regularizer that is frequently used is the L regularizer w = d j= w j, which leads to The w ) x i y i + λ w, 3) where λ > 0 is again a suitable regularization parameter. This can be formulated as a quadratic programg problem which can be solved using numerical optimization methods. For large enough λ, the solution ŵ turns out to be sparse, in the sense that many of the parameter values in ŵ are equal to zero, so that the resulting regression model depends on only a small number of features. The L -regularized least squares regression model is known as lasso and is also widely used, especially in high-dimensional problems where d is large and dependence on a small number of features is desirable. For both L and L regularizers, the regularization parameter λ deteres the extent of the penalty for large values in the parameter vector w. In practice, one generally selects λ heuristically from some finite range using a validation set which involves holding out part of the training data for validation, training on the remaining data with different values of λ, and selecting the one that gives highest performance on the validation data) or cross-validation which involves dividing the training sample into some k subsamples/folds, holding out one of these folds at a time and training on the remaining k folds with different values of λ, testing performance on the held-out fold, and repeating this procedure for all k folds; the value of λ that gives the highest average performance over the k folds is then selected ). In recent years, algorithms for certain models including lasso) have been developed that can efficiently compute the entire path of solutions for all values of λ. Below we will also see a Bayesian interpretation of these regularizers; this gives another approach to selecting λ. 4 Probabilistic View We will now make a specific assumption on the conditional distribution of Y given x, and will see that estimating the parameters of that distribution from the training sample using maximum likelihood estimation and using the conditional expectation associated with the estimated distribution as our regression model The same regularizer is also widely used in logistic regression, leading to L -regularized logistic regression. An extreme case of cross-validation with k = m leads to what is called leave-one-out validation.

4 Least Squares Regression will recover the linear least squares regression model described above. We will also see that under the same probabilistic assumption, maximum a posteriori MAP) estimation of the parameters under suitable priors will yield ridge regression and lasso. Specifically, assume that given x X, a label Y is generated randomly as follows: Y = w x + ɛ, where w R d and ɛ N 0, σ ) is some normally distributed noise with variance σ > 0. In other words, we have Y X = x N w x, σ ), so that the conditional density of Y given x can be written as py x; w, σ) = exp y w x) ) πσ σ. Clearly, in this case, the optimal regression model under squared error) is given by f x) = E[Y X = x] = w x. In practice, the parameters w, σ are unknown and must be estimated from the training sample S = x, y ),..., x m, y m )), which is assumed to contain examples drawn i.i.d. from the same distribution. Let us first proceed with maximum likelihood estimation. Maximum likelihood estimation. We can write the conditional likelihood of w, σ as Lw, σ) = p y,..., y m x,..., x m ; w, σ ) = The log-likelihood becomes ln Lw, σ) = py i x i ; w, σ) = m lnπ) m ln σ m exp y i w x i ) ) πσ σ. y i w x i ) σ. Clearly, maximizing the above log-likelihood w.r.t. w is equivalent to simply imizing the empirical squared error on S, yielding the same solution as above: ŵ = X X) X y. This yields the same linear least squares regression model as above: f S x) = ŵ x. The variance parameter σ does not play a role in the regression model, but can be useful in detering the uncertainty in the model s prediction at any point. It can be estimated by maximizing the log-likelihood above w.r.t. σ, which gives σ = yi ŵ ) x i. m Maximum a posteriori MAP) estimation. Continuing with the normal Gaussian) noise model above, we can estimate w using maximum a posteriori MAP) estimation under a suitable prior rather than using maximum likelihood estimation. For example, let us assume a zero-mean, isotropic normal prior on w. In other words, denoting by W the random variable corresponding to w, we have W N 0, σ 0 I d ),

Least Squares Regression 5 where I d denotes the d d identity matrix; this is equivalent to assug that the prior selects each component of W independently from a N 0, σ 0) distribution. The prior density can be written as pw) = π) d/ σ0 d exp ) σ0 w. Assug for simplicity that the noise variance parameter σ is known, the posterior density of w given the data S then takes the form giving pw S) exp ) σ0 w ln pw S) = σ0 w σ exp y i w x i ) ) σ, yi w ) x i + const. The MAP estimate of w is obtained by maximizing this w.r.t. w; clearly, this is equivalent to solving the following L -regularized least squares regression problem: w ) σ x i y i + w. This provides an alternative view of ridge regression, and suggests that where it is suitable to assume the above conditional distribution with noise variance σ and an isotropic normal prior on w with variance σ0, an appropriate choice for the regularization parameter is given by λ =. If instead of an isotropic normal prior we assume an isotropic Laplace prior with density then the posterior density becomes with pw S) pw) = mσ 0 λ0 ) d ) exp λ 0 w, ) exp λ 0 w ln pw S) = λ 0 w σ σ mσ 0 exp y i w x i ) ) σ, yi w ) x i + const. In this case, finding the MAP estimate of w is equivalent to solving the following L -regularized least squares regression problem: w ) σ λ 0 x i y i + m w. Again, this provides an alternative view of lasso, and suggests that where it is suitable to assume the above conditional distribution with noise variance σ and an isotropic Laplace prior on w with parameter λ 0, an appropriate choice for the regularization parameter is given by λ = σ λ 0 m. Exercise. Show that for any f : X R, the squared-error regret of f, i.e. the difference of its squared error from the optimal, is equal to the expected squared difference between fx) and E[Y X]: er sq [ ) ] D [f] ersq, D = E X fx) E[Y X].