Linear Models for Regression

Similar documents
Linear Models for Regression

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Nonparameteric Regression:

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Linear Models for Regression

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Density Estimation. Seungjin Choi

Logistic Regression. Seungjin Choi

Bayes Decision Theory

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

STA414/2104 Statistical Methods for Machine Learning II

Bayesian Linear Regression. Sargur Srihari

SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Regressione. Ruggero Donida Labati

PATTERN RECOGNITION AND MACHINE LEARNING

Relevance Vector Machines

Kernel Principal Component Analysis

Linear Models for Regression. Sargur Srihari

Outline Lecture 2 2(32)

Outline lecture 2 2(30)

Bayesian methods in economics and finance

The evolution from MLE to MAP to Bayesian Learning

ECE521 week 3: 23/26 January 2017

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Bayesian Machine Learning

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop

Overfitting, Bias / Variance Analysis

Bayesian Machine Learning

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Lecture : Probabilistic Machine Learning

STA 4273H: Sta-s-cal Machine Learning

Introduction to Machine Learning

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Ch 4. Linear Models for Classification

Information Theory Primer:

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Linear Models for Regression CS534

Lecture 5: GPs and Streaming regression

Probabilistic Latent Semantic Analysis

Overview c 1 What is? 2 Definition Outlines 3 Examples of 4 Related Fields Overview Linear Regression Linear Classification Neural Networks Kernel Met

Lecture 2 Machine Learning Review

CMU-Q Lecture 24:

Linear Regression (continued)

Ridge Regression. Mohammad Emtiyaz Khan EPFL Oct 1, 2015

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Introduction to Machine Learning

ECE521 Tutorial 2. Regression, GPs, Assignment 1. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel for the slides.

Direct Learning: Linear Classification. Donglin Zeng, Department of Biostatistics, University of North Carolina

Linear Regression and Discrimination

Bias-Variance Trade-off in ML. Sargur Srihari

Least Squares Regression

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Lecture 1b: Linear Models for Regression

Fisher s Linear Discriminant Analysis

First Technical Course, European Centre for Soft Computing, Mieres, Spain. 4th July 2011

CS540 Machine learning Lecture 5

Linear Regression Linear Regression with Shrinkage

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Linear Regression Linear Regression with Shrinkage

Expectation Maximization

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane

Machine Learning. 7. Logistic and Linear Regression

GWAS IV: Bayesian linear (variance component) models

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Gaussian Processes for Machine Learning

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Linear Regression (9/11/13)

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Artificial Neural Networks

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Bayesian Machine Learning

Machine Learning Lecture 5

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Linear regression example Simple linear regression: f(x) = ϕ(x)t w w ~ N(0, ) The mean and covariance are given by E[f(x)] = ϕ(x)e[w] = 0.

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Slides modified from: PATTERN RECOGNITION AND MACHINE LEARNING CHRISTOPHER M. BISHOP

Introduction to Bayesian Statistics

Mathematical Formulation of Our Example

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Manifold Learning: Theory and Applications to HRI

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting)

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Least Squares Estimation Namrata Vaswani,

Least Squares Regression

CSCI567 Machine Learning (Fall 2014)

Transcription:

Linear Models for Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr / 36

Outline Regression Least Squares Method Ridge Regression Least Mean Squares Recursive (Sequential) least squares Bias-Variance Dilemman Bayesian Linear Regression 2 / 36

Regression Regression aims at modeling the dependence of a response Y on a covariate X. In other words, the goal of regression is to predict the value of one or more continuous target variables y given the value of input vector x. The regression model is described by y = f (x) + ɛ. Terminology: x: input, independent variable, predictor, regressor, covariate y: output, dependent variable, response The dependence of a response on a covariate is captured via a conditional probability distribution, p(y x). Depending on f (x), Linear regression: f (x) = M j= w jφ j (x) + w. Kernel regression: f (x) = N i= w ik(x, x i ) + w. 3 / 36

Regression Function: Conditional Mean We consider the mean squared error and find the MMSE estiamte: E(f ) = y f (x) 2 = y f (x) 2 p(x, y)dxdy = y f (x) 2 p(x)p(y x)dxdy = p(x) y f (x) 2 p(y x)dy dx }{{} to be minimized [ f (x) y f (x) 2 p(y x)dy ] = f (x) = yp(y x)dy = y x. 4 / 36

Linear Regression Linear regression refers to a model in which the conditional mean of y given the value of x is an affine function of φ(x) f (x) = M w j φ j (x) + w φ (x) = w φ(x), j= where φ j (x) are known as basis functions and w = [w, w,..., w M ], φ = [φ, φ,..., φ M ]. By using nonlinear basis functions, we allow the function f (x) to be a nonlinear function of the input vector x (but a linear function of φ(x)). 5 / 36

Polynomial Regression: y t = M j= w jφ j (x t) = M j= w jx j t M = M = t t x x M = 3 M = 9 t t x x 6 / 36

Basis Functions Polynomial regression: φ j (x) = x j. Gaussian basis functions: φ j (x) = exp { x µ j 2 2σ 2 }. Spline basis functions: Piecewise polynomials (divide the input space up into regions and fit a different polynomial in each region). Many other possible basis functions: sigmoidal basis functions, hyperbolic tangent basis functions, Fourier basis, wavelet basis, and so on. 7 / 36

Least Squares Method Given a set of training data {(x t, y t )} N t=, we determine the weight vector w which minimizes E LS = 2 N t= { yt w φ(x t ) } 2 = 2 y Φw 2, where y = [y,..., y N ] and Φ R N (M+) is known as the design matrix with Φ tj = φ j (x t ), i.e., φ (x ) φ (x ) φ M (x ) φ (x 2 ) φ (x 2 ) φ M (x 2 ) Φ =....... φ (x N ) φ (x N ) φ M (x N ) 8 / 36

E LS w = leads to the normal equation that is of the form Φ Φw = Φ y. Thus, LS estimate of w is given by w LS = ( Φ Φ) Φ y = Φ y, where Φ is known as the Moore-Penrose pseudo-inverse. Φ Φ is known as the Gram matrix. 9 / 36

Maximum Likelihood We assume that the target variable y t is given by a deterministic function f (x t, w) = w φ(x t ) with additive Gaussian noise so that where ɛ N (, σ 2 I ). The log-likelihood is given by y = Φw + ɛ, L = log p(y Φ, w) = N log p(y t φ(x t ), w) t= = N 2 log σ2 N 2 log 2π σ 2 E LS. Therefore, under Gaussian noise assumption, w ML = w LS. / 36

Ridge Regression We consider the sum-of-squares error function with a Euclidean norm-based regularizer Solving E w E = y Φw 2 + λ } 2 {{} 2 w 2. }{{} LSfit regularizer = for w leads to w ridge = ( λi + Φ Φ) Φ y. / 36

Ridge Regression: MAP Perpective Recall the likelihood: p(y Φ, w) = N (y Φw, σ 2 I ). Assume a zero-mean Gaussian prior with covariance Σ for parameters w: Then the posterior over w, p(w y, Φ) = p(w) = N (w, Σ). p(y Φ, w)p(w), p(y Φ, w)p(w)dw is still Gaussian with mean and mode at ŵ = (σ 2 Σ + Φ Φ) Φ y. When Σ is proportional to identity, i.e., Σ = λ σ 2 I, this is called ridge regression. 2 / 36

Ridge Regression: Shrinkage Suppose that the SVD of the design matrix Φ R N (M+) has the form Φ = UDV, where D = diag(d,..., d M+ ). Using the SVD we write the least squares fitted vector as ( M+ Φw LS = Φ Φ Φ) Φ y = UU y = u i u i y, where U y are the coordinates of y with respect to the orthonormal basis U. The ridge solutions are ( Φw ridge = Φ λi + Φ Φ) Φ y i= M+ = UD(D 2 + λi ) DU di 2 y = u i di 2 + λ u i y. It shrinks the coordinates by the factors. A greater amount of d i 2 +λ shrinkage is applied to basis vectors with smaller di 2. d 2 i i= 3 / 36

Least Mean Square (LMS) LMS is a gradient-descent method which minimizes the instantaneous error E t, where E LS = N E t = 2 t= N t= { y t w φ(x t)} 2. The gradient descent method leads to the updating rule for w that is of the form w w η E t { } w + η y t w φ(x t) φ(x t), where η > is a constant known as learning rate. 4 / 36

Recursive (Sequential) LS We introduce the forgetting factor λ to de-emphasize old samples, leading to the following error function E RLS = 2 t λ t i (y i φ i w t ) 2, i= where φ t = φ(x t). Solving E RLS w t = for w t leads to [ t ] [ t ] λ t i φ i φ i w t = λ t i y i φ i. i= i= 5 / 36

We define P t = r t = [ t ] λ t i φ i φ i, i= [ t ] λ t i y i φ i. i= With these definitions, we have w t = P tr t. The core idea of RLS is to apply the matrix inversion lemma to develop the sequential algorithm without matrix inversion. (A + BCD) = A A B(DA B + C ) DA. 6 / 36

The recursion for P t is given by [ t P t = λ t i φ i φ i = = λ i= ] [ t λ λ t i φ i φ i i= + φ t φ t ] [ P t Pt φ ] t φ t Pt. (matrix inversion lemma) λ + φ t P t φ t 7 / 36

Thus, the updating rule for w is given by w t = P tr t = [ P t Pt φ ] tφ t P t λ λ + φ t Pt φ [λr t + y tφ t ] t ] [y t φ t w t P t φ t = w t + λ + φ t Pt φ t }{{} gain } {{ } error. 8 / 36

Expected Loss The squared loss L is given by L(y, f (x)) = (f (x) y) 2. The expected loss is computed by E {L(y, f (x))} = (f (x) y) 2 p(x, y)dxdy = E {(f (x) E{y x} + E{y x} y) 2} = E {(f (x) E{y x}) 2 + (E{y x} y) 2} + 2E {(f (x) E{y x}) (E{y x} y)}. }{{} 9 / 36

The expected loss can be written as E {L(y, f (x))} = (f (x) E{y x}) 2 p(x)dx } {{ } first term + (E{y x} y) 2 p(x)dx. } {{ } second term The function f (x) appears only in the first term which will be minimized when f (x) = E{y x}. The second term is the variance of the distribution of y, averaged over x, representing the intrinsic variability of the target data and can be regarded as noise. Because it is independent of f (x), it represents the irreducible minimum value of the loss function. 2 / 36

Bias-Variance Decomposition Consider the integrand of the first term, which for a particular data set D takes the form (f (x; D) E{y x}) 2. Then we have { E D (f (x; D) E{y x}) 2} = (E D{f (x; D)} E{y x}) 2 }{{} (bias) { 2 + E D (f (x; D) E D{f (x; D)}) 2}. } {{ } variance expected loss = (bias) 2 + variance + noise. 2 / 36

Bias-Variance Dilemma There is a trade-off between bias and variance: flexible models: low bias but high variance rigid models: high bias but low variance The model with the optimal predictive capability is the one that leads to the best balance between bias and variance. 22 / 36

Example We consider the sinusoidal data set. We generate data sets, each containing N = 25 data points, independently from the sinusoidal curve h(x) = E{y x} = sin(2πx). The data sets are indexed by l =,..., L where L =. For each data set D (l), we fit a model with 24 Gaussian basis functions through a method of ridge regression to give a prediction function f (l) (x). f (x) = L (bias) 2 = N variance = N L f (l) (x), l= N [ f (xt) h(x ] 2 t), t= N t= L L l= [ f (l) (x t) f (x t)] 2. 23 / 36

Rigid model (st row) and flexible model (2nd row) low variance high bias high variance low bias 24 / 36

Bayesian Linear Regression The likelihood function is given by p(y Φ, w) = N p(y t φ t, w), t= where p(y t φ t, w) = N (y t w φ t, β ). Assuming Gaussian prior for w, i.e., p(w) = N (w µ, Σ ), the posterior distribution over w is again Gaussian that is of the form where p(w y, Φ) = N (w µ N, Σ N ), µ N = Σ N (Σ µ + βφ y), Σ N = Σ + βφ Φ. 25 / 36

For the sake of simplicity, we consider a particular form of Gaussian prior (zero mean isotropic Gaussian prior), p(w α) = N (w, α I ). Then the corresponding posterior distribution over w is given by where p(w y) = N (w µ N, Σ N ), µ N = βσ N Φ y, Σ N = αi + βφ Φ. 26 / 36

An Example of Sequential Bayesian Learning Consider a linear model y = w + w x where only two parameters are involved. Next slide illustrates the sequential nature of Bayesian learning, showing that the posterior distribution over parameters become sharper as more data points are observed. Left-hand column: likelihood, p(y x, w). Middle column: prior/posterior, p(w D t). Right-hand column: samples of function y = w + w x where w are drawn from the posterior. 27 / 36

28 / 36

Predictive Distribution Make predictions of y for new values x (i.e, φ ). Let D = {Φ, y}. Then the predictive distribution is given by p(y φ, D, α, β) = p(y φ, w, β)p(w D, α, β)dw ( ) = N y µ N φ, σ2 N(x ), where µ N = βσ N Φ y, Σ N = αi + βφ Φ, σn(x 2 ) = + φ β Σ Nφ. }{{}}{{} uncertainty associated with parameters w noise on the data 29 / 36

Examples of Predictive Distributions t t x x t t x x 3 / 36

Plots of f (x; w) using samples from p(w y) t t x x t t x x 3 / 36

Bayesian Model Comparison Avoid the over-fitting associated with maximum likelihood by marginalizing over the model parameters instead of making point estimates of their values. Models can be compared directly on the training data, without the need for a validation set. Avoids the multiple training runs for each model associated with cross-validation. 32 / 36

Suppose that we wish to compare a set of L models, {M i } for i =,..., L. (a model refers to a probability distribution over the observed data D) Given a training set D, we then wish to evaluate the posterior distribution p(m i D) p(m i ) p(d M i ). }{{}}{{} prior evidence p(m i ) is a prior probability distribution to express our uncertainty. The data is generated from one of these models but we are uncertain which one. p(d M i ) is the model evidence which expresses the preference shown by the data for different models. This is also known as marginal likelihood, since it can be viewed as a likelihood function over the space of models, in which the parameters have been marginalized out. 33 / 36

Bayes factor is the ratio of model evidences for two models: p(d M i ) p(d M j ). Model selection: Choose the single most probable model. For a model governed by a set of parameters w, the model evidence is given by p(d M i ) = p(d w, M i )p(w M i )dw. 34 / 36

p(d) M M 2 M 3 D D M is the simplest and M 3 is the most complex. For the particular observed data set D, the model M 2 with intermediate complexity has the largest evidence. 35 / 36

Marginal Likelihood: Hyperparameter Estimation We estimate hyperparameters α and β through maximizing the marginal likelihood. The marginal likelihood is given by p(y Φ, α, β) = p(y Φ, w, β)p(w α)dw. Marginal likelihood maximization is illustrated in detail in Sec. 3.5. and 3.5.2. 36 / 36