Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Similar documents
Learning from Data: Regression

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Today. Calculus. Linear Regression. Lagrange Multipliers

STA414/2104 Statistical Methods for Machine Learning II

Linear Models for Regression. Sargur Srihari

Lecture : Probabilistic Machine Learning

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Overfitting, Bias / Variance Analysis

Linear Models for Regression

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop

Linear Models for Regression

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Least Squares Regression

Introduction to Machine Learning

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Machine Learning and Pattern Recognition, Tutorial Sheet Number 3 Answers

Bayesian Linear Regression [DRAFT - In Progress]

y(x) = x w + ε(x), (1)

ECE521 week 3: 23/26 January 2017

Linear Regression (continued)

Linear Models for Regression

y Xw 2 2 y Xw λ w 2 2

Machine learning - HT Basis Expansion, Regularization, Validation

CSCI567 Machine Learning (Fall 2014)

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Bayesian Linear Regression. Sargur Srihari

Least Squares Regression

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

An Introduction to Statistical and Probabilistic Linear Models

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Logistic Regression. Seungjin Choi

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Bayesian Inference and an Intro to Monte Carlo Methods

PATTERN RECOGNITION AND MACHINE LEARNING

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Probabilistic & Unsupervised Learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

CMU-Q Lecture 24:

ECE521 Tutorial 2. Regression, GPs, Assignment 1. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel for the slides.

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Introduction to Machine Learning

Multivariate Bayesian Linear Regression MLAI Lecture 11

Statistical Machine Learning Hilary Term 2018

PMR Learning as Inference

Density Estimation: ML, MAP, Bayesian estimation

GWAS IV: Bayesian linear (variance component) models

Advanced Introduction to Machine Learning CMU-10715

Week 3: Linear Regression

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015

Outline lecture 2 2(30)

Linear and logistic regression

Gaussian Process Regression

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Linear Regression and Discrimination

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

The evolution from MLE to MAP to Bayesian Learning

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Linear Models for Regression

Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting)

Machine Learning. 7. Logistic and Linear Regression

STA 414/2104, Spring 2014, Practice Problem Set #1

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Machine Learning Lecture 7

Probability. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. August 2014

GAUSSIAN PROCESS REGRESSION

Machine learning - HT Maximum Likelihood

INTRODUCTION TO PATTERN RECOGNITION

4 Bias-Variance for Ridge Regression (24 points)

Association studies and regression

Overview. IAML: Linear Regression. Examples of regression problems. The Regression Problem

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

CS 6375 Machine Learning

Generative v. Discriminative classifiers Intuition

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Logistic Regression Logistic

CPSC 540: Machine Learning

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Linear Models in Machine Learning

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Linear Models for Classification

Bayesian Regression Linear and Logistic Regression

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Machine Learning. Regression basics. Marc Toussaint University of Stuttgart Summer 2015

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Introduction to Bayesian Learning. Machine Learning Fall 2018

Linear Classification

Transcription:

Regression Machine Learning and Pattern Recognition Chris Williams School of Informatics, University of Edinburgh September 24 (All of the slides in this course have been adapted from previous versions by Charles Sutton, Amos Storkey, David Barber.) / 24

Classification or Regression? Classification: want to learn a discrete target variable Regression: want to learn a continuous target variable Linear regression, linear-in-the-parameters models Linear regression is a conditional Gaussian model Maximum likelihood solution - ordinary least squares Can use nonlinear basis functions Ridge regression Full Bayesian treatment Reading: Murphy chapter 7 (not all sections needed), Barber (7., 7.2, 8..) 2 / 24

One Dimensional Data 2.5 2.5.5 2 2 3 3 / 24

Linear Regression Simple example: one-dimensional linear regression. Suppose we have data of the form (x, y), and we believe the data should fol low a straight line: the data should have a straight line fit of the form y = w + w x. However we also believe the target values y are subject to measurement error, which we will assume to be Gaussian. So y = w + w x + η where η is a Gaussian noise term, mean, variance σ 2 η. 4 / 24

Figure credit: http://jedismedicine.blogspot.co.uk/24// Linear regression is just a conditional version of estimating a Gaussian (conditional on the input x) 5 / 24

Generated Data 2.5 2.5.5 2 2 3 6 / 24

Multivariate Case Consider the case where we are interested in y = f(x) for D dimensional x: y = w + w x +... w D x D + η, where η Gaussian(, ση). 2 Examples? Final grade depends on time spent on work for each tutorial. We set w = (w, w,... w D ) T and introduce φ = (, x T ) T, then we can write y = w T φ + η instead This implies p(y φ, w) = N(y; w T φ, ση) 2 Assume that training data is iid, i.e., p(y,... y N x,..., x N, w) = N n= p(yn x n, w) Given data {(x n, y n ), n =, 2,..., N}, the log likelihood is L(w) = log P (y... y N x... x N, w) = N 2ση 2 (y n w T φ n ) 2 N 2 log(2πσ2 η) n= 7 / 24

Minimizing Squared Error L(w) = 2σ 2 η N (y n w T φ n ) 2 N 2 log(2πσ2 η) n= N = C (y n w T φ n ) 2 C 2 n= where C > and C 2 don t depend on w. Now Multiplying by a positive constant doesn t change the maximum Adding a constant doesn t change the maximum. N n= (yn w T φ n ) 2 is the sum of squared errors made if you use w So maximizing the likelihood is the same as minimizing the total squared error of the linear predictor. So you don t have to believe the Gaussian assumption. You can simply believe that you want to minimize the squared error. 8 / 24

Maximum Likelihood Solution I Write Φ = (φ, φ 2,..., φ N ) T, and y = (y, y 2,..., y N ) T Φ is called the design matrix, has N rows, one for each example L(w) = 2ση 2 (y Φw) T (y Φw) C 2 Take derivatives of the log likelihood: w L(w) = σ 2 η Φ T (Φw y) 9 / 24

Maximum Likelihood Solution II Setting the derivatives to zero to find the minimum gives Φ T Φŵ = Φ T y This means the maximum likelihood ŵ is given by ŵ = (Φ T Φ) Φ T y The matrix (Φ T Φ) Φ T is called the pseudo-inverse. Ordinary least squares (OLS) solution for w MLE for the variance ˆσ 2 η = N N (y n w T φ n ) 2 n= i.e. the average of the squared residuals / 24

Generated Data 2.5 2.5.5.5 2 2 3 The black line is the maximum likelihood fit to the data. / 24

Nonlinear regression All this just used φ. We chose to put the x values in φ, but we could have put anything in there, including nonlinear transformations of the x values. In fact we can choose any useful form for φ so long as the final derivatives are linear wrt w. We can even change the size. We already have the maximum likelihood solution in the case of Gaussian noise: the pseudo-inverse solution. Models of this form are called general linear models or linear-in-the-parameters models. 2 / 24

Example:polynomial fitting Model y = w + w 2 x + w 2 x 2 + w 4 x 3. Set φ = (, x, x 2, x 3 ) T and w = (w, w 2, w 3, w 4 ). Can immediately write down the ML solution: w = (Φ T Φ) Φ T y, where Φ and y are defined as before. Could use any features we want: e.g. features that are only active in certain local regions (radial basis functions, RBFs). Figure credit: David Barber, BRML Fig 7.6 3 / 24

Dimensionality issues How many radial basis functions do we need? Suppose we need only three per dimension Then we would need 3 D for a D-dimensional problem This becomes large very fast: this is commonly called the curse of dimensionality Gaussian processes (see later) can help with these issues 4 / 24

Higher dimensional outputs Suppose the target values are vectors. Then we introduce different w i for each y i. Then we can do regression independently in each of those cases. 5 / 24

Adding a Prior Put prior over parameters, e.g., I is the identity matrix The log posterior is log p(w D) = const 2σ 2 η p(y φ, w) = N(y; w T φ, σ 2 η) p(w) = N(w;, τ 2 I) N (y n w T φ n ) 2 N 2 log(2πσ2 ) n= 2τ 2 wt w }{{} D 2 log(2πτ 2 ) penalty on large weights MAP solution can be computed analytically. Derivation almost the same as with MLE (where λ = σ 2 η/τ 2 ) This is called ridge regression w MAP = (Φ T Φ + λi) Φ T y 6 / 24

Effect of Ridge Regression Collecting constant terms from log posterior on last slide log p(w D) = const 2σ 2 η N (y n w T φ n ) 2 n= 2τ 2 wt w }{{} w 2 2. penalty term This is called l 2 regularization or weight decay. The second term is the squared Euclidean (also called l 2 ) norm of w. The idea is to reduce overfitting by forcing the function to be simple. The simplest possible function is constant w =, so encourage ŵ to be closer to that. τ is a parameter of the method. Trades off between how well you fit the training data and how simple the method is. Most commonly set via cross validation. Regularization is a general term for adding a second term to an objective function to encourage simple models. 7 / 24

Effect of Ridge Regression (Graphic) 2 ln lambda 2.35 2 ln lambda 8.57 5 5 5 5 5 5 5 5 2 5 5 5 2 Figure credit: Murphy Fig 7.7 Degree 4 polynomial fit with and without regularization 8 / 24

Why Ridge Regression Works (Graphic) u 2 ML Estimate u MAP Estimate prior mean Figure credit: Murphy Fig 7.9 9 / 24

Bayesian Regression Bayesian regression model p(y φ, w) = N(y; w T φ, σ 2 η) p(w) = N(w;, τ 2 I) Possible to compute the posterior distribution analytically, because linear Gaussian models are jointly Gaussian (see Murphy 7.6. for details) p(w Φ, y, σ 2 η) p(w)p(y Φ, σ 2 η) = N(w w N, V N ) w N = σ 2 η V N Φ T y V N = σ 2 η(σ 2 η/τ 2 I + Φ T Φ) 2 / 24

Making predictions For a new test point x with corresponding feature vector φ, we have that f(x ) = w T φ + η where w N(w N, V N ). Hence p(y x, D) N(w T Nφ, (φ ) T V N φ + σ 2 η) 2 / 24

Example of Bayesian Regression likelihood prior/posterior W y data space W x W W y W W x W W y W W x W W y W W x Figure credit: Murphy Fig 7. 22 / 24

Another Example 6 5 prediction training data plugin approximation (MLE) 8 7 prediction training data Posterior predictive (known variance) 6 4 5 3 4 3 2 2 8 6 4 2 2 4 6 8 MLE 8 6 4 2 2 4 6 8 Bayes 5 functions sampled from plugin approximation to posterior functions sampled from posterior 45 4 8 35 3 6 25 4 2 5 2 5 8 6 4 2 2 4 6 8 MLE samples 2 8 6 4 2 2 4 6 8 Bayes samples Figure credit: Murphy Fig 7.2 Fitting a quadratic. Notice how the error bars get larger further away from training data 23 / 24

Summary Linear regression is a conditional Gaussian model Maximum likelihood solution - ordinary least squares Can use nonlinear basis functions Ridge regression Full Bayesian treatment 24 / 24