Linear Models for Regression

Similar documents
These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

STA414/2104 Statistical Methods for Machine Learning II

Linear Models for Regression. Sargur Srihari

Linear Models for Regression

Lecture : Probabilistic Machine Learning

Least Squares Regression

Linear Models for Regression

Linear Models for Regression CS534

Bayesian Linear Regression. Sargur Srihari

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

ECE521 week 3: 23/26 January 2017

Least Squares Regression

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Outline lecture 2 2(30)

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Linear Models for Classification

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Bayesian Machine Learning

Linear Models for Regression CS534

Linear Models for Regression CS534

SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Regressione. Ruggero Donida Labati

Non-parametric Methods

Reading Group on Deep Learning Session 1

GWAS IV: Bayesian linear (variance component) models

PATTERN RECOGNITION AND MACHINE LEARNING

Today. Calculus. Linear Regression. Lagrange Multipliers

CSC321 Lecture 18: Learning Probabilistic Models

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Machine learning - HT Basis Expansion, Regularization, Validation

Outline Lecture 2 2(32)

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear Regression and Discrimination

Week 3: Linear Regression

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Neural Network Training

Bayesian Linear Regression [DRAFT - In Progress]

Introduction to Machine Learning

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

CS-E3210 Machine Learning: Basic Principles

Overfitting, Bias / Variance Analysis

Lecture 1b: Linear Models for Regression

1. Non-Uniformly Weighted Data [7pts]

Linear & nonlinear classifiers

CSCI567 Machine Learning (Fall 2014)

Machine Learning. 7. Logistic and Linear Regression

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Multivariate Bayesian Linear Regression MLAI Lecture 11

The evolution from MLE to MAP to Bayesian Learning

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Machine Learning Lecture 7

Bayesian methods in economics and finance

An Introduction to Statistical and Probabilistic Linear Models

Overview c 1 What is? 2 Definition Outlines 3 Examples of 4 Related Fields Overview Linear Regression Linear Classification Neural Networks Kernel Met

Stochastic Gradient Descent

Fundamentals of Machine Learning

Fundamentals of Machine Learning (Part I)

CSC2515 Assignment #2

Linear Models for Regression

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Bayesian Regression Linear and Logistic Regression

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Linear Methods for Regression. Lijun Zhang

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Max Margin-Classifier

Logistic Regression. COMP 527 Danushka Bollegala

CSC 411: Lecture 04: Logistic Regression

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Artificial Neural Networks

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

CSC 411: Lecture 02: Linear Regression

y(x) = x w + ε(x), (1)

Relevance Vector Machines

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Machine Learning Lecture 5

Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology

Gradient Descent. Sargur Srihari

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Introduction to Machine Learning

Statistical learning. Chapter 20, Sections 1 4 1

Density Estimation: ML, MAP, Bayesian estimation

Transcription:

Linear Models for Regression Machine Learning Torsten Möller Möller/Mori 1

Reading Chapter 3 of Pattern Recognition and Machine Learning by Bishop Chapter 3+5+6+7 of The Elements of Statistical Learning by Hastie, Tibshirani, Friedman Möller/Mori 2

Outline Regression Linear Basis Function Models Loss Functions for Regression Finding Optimal Weights Regularization Bayesian Linear Regression Möller/Mori 3

Outline Regression Linear Basis Function Models Loss Functions for Regression Finding Optimal Weights Regularization Bayesian Linear Regression Möller/Mori 4

Regression Given training set {(x 1, t 1 ),..., (x N, t N )} t i is continuous: regression For now, assume t i R, x i R D E.g. t i is stock price, x i contains company profit, debt, cash flow, gross sales, number of spam emails sent, Möller/Mori 5

Outline Regression Linear Basis Function Models Loss Functions for Regression Finding Optimal Weights Regularization Bayesian Linear Regression Möller/Mori 6

Linear Functions A function f( ) is linear if f(αx 1 + βx 2 ) = αf(x 1 ) + βf(x 2 ) Linear functions will lead to simple algorithms, so let s see what we can do with them Möller/Mori 7

Linear Regression Simplest linear model for regression y(x, w) = w 0 + w 1 x 1 + w 2 x 2 +... + w D x D Remember, we re learning w Set w so that y(x, w) aligns with target value in training data Möller/Mori 8

Linear Regression Simplest linear model for regression y(x, w) = w 0 + w 1 x 1 + w 2 x 2 +... + w D x D Remember, we re learning w Set w so that y(x, w) aligns with target value in training data Möller/Mori 9

Linear Regression Simplest linear model for regression y(x, w) = w 0 + w 1 x 1 + w 2 x 2 +... + w D x D Remember, we re learning w Set w so that y(x, w) aligns with target value in training data This is a very simple model, limited in what it can do 1 0 1 0 1 Möller/Mori 10

Linear Basis Function Models Simplest linear model y(x, w) = w 0 + w 1 x 1 + w 2 x 2 +... + w D x D was linear in x ( ) and w Linear in w is what will be important for simple algorithms Extend to include fixed non-linear functions of data y(x, w) = w 0 + w 1 ϕ 1 (x) + w 2 ϕ 2 (x) +... + w M 1 ϕ M 1 (x) Linear combinations of these basis functions also linear in parameters Möller/Mori 11

Linear Basis Function Models Simplest linear model y(x, w) = w 0 + w 1 x 1 + w 2 x 2 +... + w D x D was linear in x ( ) and w Linear in w is what will be important for simple algorithms Extend to include fixed non-linear functions of data y(x, w) = w 0 + w 1 ϕ 1 (x) + w 2 ϕ 2 (x) +... + w M 1 ϕ M 1 (x) Linear combinations of these basis functions also linear in parameters Möller/Mori 12

Linear Basis Function Models Simplest linear model y(x, w) = w 0 + w 1 x 1 + w 2 x 2 +... + w D x D was linear in x ( ) and w Linear in w is what will be important for simple algorithms Extend to include fixed non-linear functions of data y(x, w) = w 0 + w 1 ϕ 1 (x) + w 2 ϕ 2 (x) +... + w M 1 ϕ M 1 (x) Linear combinations of these basis functions also linear in parameters Möller/Mori 13

Linear Basis Function Models Bias parameter allows fixed offset in data y(x, w) = w 0 }{{} bias Think of simple 1-D x: +w 1 ϕ 1 (x) + w 2 ϕ 2 (x) +... + w M 1 ϕ M 1 (x) y(x, w) = w }{{} 0 + w 1 intercept }{{} slope For notational convenience, define ϕ 0 (x) = 1: y(x, w) = M 1 j=0 x w j ϕ j (x) = w T ϕ(x) Möller/Mori 14

Linear Basis Function Models Bias parameter allows fixed offset in data y(x, w) = w 0 }{{} bias Think of simple 1-D x: +w 1 ϕ 1 (x) + w 2 ϕ 2 (x) +... + w M 1 ϕ M 1 (x) y(x, w) = w }{{} 0 + w 1 intercept }{{} slope For notational convenience, define ϕ 0 (x) = 1: y(x, w) = M 1 j=0 x w j ϕ j (x) = w T ϕ(x) Möller/Mori 15

Linear Basis Function Models Bias parameter allows fixed offset in data y(x, w) = w 0 }{{} bias Think of simple 1-D x: +w 1 ϕ 1 (x) + w 2 ϕ 2 (x) +... + w M 1 ϕ M 1 (x) y(x, w) = w }{{} 0 + w 1 intercept }{{} slope For notational convenience, define ϕ 0 (x) = 1: y(x, w) = M 1 j=0 x w j ϕ j (x) = w T ϕ(x) Möller/Mori 16

Linear Basis Function Models Function for regression y(x, w) is non-linear function of x, but linear in w: y(x, w) = M 1 j=0 w j ϕ j (x) = w T ϕ(x) Polynomial regression is an example of this Order M polynomial regression, ϕ j (x) =? Möller/Mori 17

Linear Basis Function Models Function for regression y(x, w) is non-linear function of x, but linear in w: y(x, w) = M 1 j=0 w j ϕ j (x) = w T ϕ(x) Polynomial regression is an example of this Order M polynomial regression, ϕ j (x) =? Möller/Mori 18

Linear Basis Function Models Function for regression y(x, w) is non-linear function of x, but linear in w: y(x, w) = M 1 j=0 w j ϕ j (x) = w T ϕ(x) Polynomial regression is an example of this Order M polynomial regression, ϕ j (x) =? ϕ j (x) = x j : y(x, w) = w 0 x 0 + w 1 x 1 +... + w M x M Möller/Mori 19

Basis Functions: Feature Functions Often we extract features from x An intuitve way to think of ϕ j (x) is as feature functions E.g. Automatic project report grading system x is text of report: In this project we apply the algorithm of Möller [2] to recognizing blue objects. We test this algorithm on pictures of you and I from my holiday photo collection.... ϕ 1 (x) is count of occurrences of Möller [ ϕ 2 (x) is count of occurrences of of you and I Regression grade y(x, w) = 20ϕ 1 (x) 10ϕ 2 (x) Möller/Mori 20

Basis Functions: Feature Functions Often we extract features from x An intuitve way to think of ϕ j (x) is as feature functions E.g. Automatic project report grading system x is text of report: In this project we apply the algorithm of Möller [2] to recognizing blue objects. We test this algorithm on pictures of you and I from my holiday photo collection.... ϕ 1 (x) is count of occurrences of Möller [ ϕ 2 (x) is count of occurrences of of you and I Regression grade y(x, w) = 20ϕ 1 (x) 10ϕ 2 (x) Möller/Mori 21

Basis Functions: Feature Functions Often we extract features from x An intuitve way to think of ϕ j (x) is as feature functions E.g. Automatic project report grading system x is text of report: In this project we apply the algorithm of Möller [2] to recognizing blue objects. We test this algorithm on pictures of you and I from my holiday photo collection.... ϕ 1 (x) is count of occurrences of Möller [ ϕ 2 (x) is count of occurrences of of you and I Regression grade y(x, w) = 20ϕ 1 (x) 10ϕ 2 (x) Möller/Mori 22

Basis Functions: Feature Functions Often we extract features from x An intuitve way to think of ϕ j (x) is as feature functions E.g. Automatic project report grading system x is text of report: In this project we apply the algorithm of Möller [2] to recognizing blue objects. We test this algorithm on pictures of you and I from my holiday photo collection.... ϕ 1 (x) is count of occurrences of Möller [ ϕ 2 (x) is count of occurrences of of you and I Regression grade y(x, w) = 20ϕ 1 (x) 10ϕ 2 (x) Möller/Mori 23

Other Non-linear Basis Functions 1 0.5 0 0.5 1 1 0 1 1 0.75 0.5 0.25 0 1 0 1 1 0.75 0.5 0.25 0 1 0 1 Polynomial ϕ j (x) = x j Gaussians ϕ j (x) = exp{ (x µ j) 2 2s 2 } Sigmoidal ϕ j (x) = 1 1+exp((µ j x)/s) Möller/Mori 24

Example - Gaussian Basis Functions: Temperature Use Gaussian basis functions, regression on temperature Möller/Mori 25

Example - Gaussian Basis Functions: Temperature µ 1 = Vancouver, µ 2 = San Francisco, µ 3 = Oakland Möller/Mori 26

Example - Gaussian Basis Functions: Temperature µ 1 = Vancouver, µ 2 = San Francisco, µ 3 = Oakland Temperature in x = Seattle? y(x, w) = w 1 exp{ (x µ 1) 2 } + w 2s 2 2 exp{ (x µ 2) 2 } + w 2s 2 3 exp{ (x µ 3) 2 } 2s 2 Möller/Mori 27

Example - Gaussian Basis Functions: Temperature µ 1 = Vancouver, µ 2 = San Francisco, µ 3 = Oakland Temperature in x = Seattle? y(x, w) = w 1 exp{ (x µ 1) 2 } + w 2s 2 2 exp{ (x µ 2) 2 } + w 2s 2 3 exp{ (x µ 3) 2 } 2s 2 Compute distances to all µ, y(x, w) w 1 Möller/Mori 28

Example - Gaussian Basis Functions Could define ϕ j (x) = exp{ (x x j) 2 2s 2 } Gaussian around each training data point x j, N of them Could use for modelling temperature or resource availability at spatial location x Overfitting - interpolates data Example of a kernel method, more later Möller/Mori 29

Example - Gaussian Basis Functions Could define ϕ j (x) = exp{ (x x j) 2 2s 2 } Gaussian around each training data point x j, N of them Could use for modelling temperature or resource availability at spatial location x Overfitting - interpolates data Example of a kernel method, more later Möller/Mori 30

Example - Gaussian Basis Functions Could define ϕ j (x) = exp{ (x x j) 2 2s 2 } Gaussian around each training data point x j, N of them Could use for modelling temperature or resource availability at spatial location x Overfitting - interpolates data Example of a kernel method, more later Möller/Mori 31

Example - Gaussian Basis Functions Could define ϕ j (x) = exp{ (x x j) 2 2s 2 } Gaussian around each training data point x j, N of them Could use for modelling temperature or resource availability at spatial location x Overfitting - interpolates data Example of a kernel method, more later Möller/Mori 32

Example - Gaussian Basis Functions Could define ϕ j (x) = exp{ (x x j) 2 2s 2 } Gaussian around each training data point x j, N of them Could use for modelling temperature or resource availability at spatial location x Overfitting - interpolates data Example of a kernel method, more later Möller/Mori 33

Outline Regression Linear Basis Function Models Loss Functions for Regression Finding Optimal Weights Regularization Bayesian Linear Regression Möller/Mori 34

Loss Functions for Regression We want to find the best set of coefficients w Recall, one way to define best was minimizing squared error: E(w) = 1 N {y(x n, w) t n } 2 2 n=1 We will now look at another way, based on maximum likelihood Möller/Mori 35

Gaussian Noise Model for Regression We are provided with a training set {(x i, t i )} Let s assume t arises from a deterministic function plus Gaussian distributed (with precision β) noise: t = y(x, w) + ϵ The probability of observing a target value t is then: p(t x, w, β) = N (t y(x, w), β 1 ) Notation: N (x µ, σ 2 ); x drawn from Gaussian with mean µ, variance σ 2 Möller/Mori 36

Gaussian Noise Model for Regression We are provided with a training set {(x i, t i )} Let s assume t arises from a deterministic function plus Gaussian distributed (with precision β) noise: t = y(x, w) + ϵ The probability of observing a target value t is then: p(t x, w, β) = N (t y(x, w), β 1 ) Notation: N (x µ, σ 2 ); x drawn from Gaussian with mean µ, variance σ 2 Möller/Mori 37

Maximum Likelihood for Regression The likelihood of data t = {t i } using this Gaussian noise model is: N p(t w, β) = N (t n w T ϕ(x n ), β 1 ) The log-likelihood is: ln p(t w, β) = ln N n=1 n=1 β exp ( β2 ) (t n w T ϕ(x n )) 2 2π = N 2 ln β N 2 ln(2π) β 1 N (t n w T ϕ(x n )) 2 }{{} 2 n=1 const. wrt w }{{} squared error Sum of squared errors is maximum likelihood under a Gaussian noise model Möller/Mori 38

Maximum Likelihood for Regression The likelihood of data t = {t i } using this Gaussian noise model is: N p(t w, β) = N (t n w T ϕ(x n ), β 1 ) The log-likelihood is: ln p(t w, β) = ln N n=1 n=1 β exp ( β2 ) (t n w T ϕ(x n )) 2 2π = N 2 ln β N 2 ln(2π) β 1 N (t n w T ϕ(x n )) 2 }{{} 2 n=1 const. wrt w }{{} squared error Sum of squared errors is maximum likelihood under a Gaussian noise model Möller/Mori 39

Maximum Likelihood for Regression The likelihood of data t = {t i } using this Gaussian noise model is: N p(t w, β) = N (t n w T ϕ(x n ), β 1 ) The log-likelihood is: ln p(t w, β) = ln N n=1 n=1 β exp ( β2 ) (t n w T ϕ(x n )) 2 2π = N 2 ln β N 2 ln(2π) β 1 N (t n w T ϕ(x n )) 2 }{{} 2 n=1 const. wrt w }{{} squared error Sum of squared errors is maximum likelihood under a Gaussian noise model Möller/Mori 40

Maximum Likelihood for Regression The likelihood of data t = {t i } using this Gaussian noise model is: N p(t w, β) = N (t n w T ϕ(x n ), β 1 ) The log-likelihood is: ln p(t w, β) = ln N n=1 n=1 β exp ( β2 ) (t n w T ϕ(x n )) 2 2π = N 2 ln β N 2 ln(2π) β 1 N (t n w T ϕ(x n )) 2 }{{} 2 n=1 const. wrt w }{{} squared error Sum of squared errors is maximum likelihood under a Gaussian noise model Möller/Mori 41

Outline Regression Linear Basis Function Models Loss Functions for Regression Finding Optimal Weights Regularization Bayesian Linear Regression Möller/Mori 42

Finding Optimal Weights How do we maximize likelihood wrt w (or minimize squared error)? Take gradient of log-likelihood wrt w: w i ln p(t w, β) = β N (t n w T ϕ(x n ))ϕ i (x n ) n=1 In vector form: N ln p(t w, β) = β (t n w T ϕ(x n ))ϕ(x n ) T n=1 Möller/Mori 43

Finding Optimal Weights How do we maximize likelihood wrt w (or minimize squared error)? Take gradient of log-likelihood wrt w: w i ln p(t w, β) = β N (t n w T ϕ(x n ))ϕ i (x n ) n=1 In vector form: N ln p(t w, β) = β (t n w T ϕ(x n ))ϕ(x n ) T n=1 Möller/Mori 44

Finding Optimal Weights How do we maximize likelihood wrt w (or minimize squared error)? Take gradient of log-likelihood wrt w: w i ln p(t w, β) = β N (t n w T ϕ(x n ))ϕ i (x n ) n=1 In vector form: N ln p(t w, β) = β (t n w T ϕ(x n ))ϕ(x n ) T n=1 Möller/Mori 45

Finding Optimal Weights Set gradient to 0: ( N N ) 0 T = t n ϕ(x n ) T w T ϕ(x n )ϕ(x n ) T n=1 n=1 Maximum likelihood estimate for w: Φ = w ML = ( Φ T Φ ) 1 Φ T t ϕ 0 (x 1 ) ϕ 1 (x 1 )... ϕ M 1 (x 1 ) ϕ 0 (x 2 ) ϕ 1 (x 2 )... ϕ M 1 (x 2 )...... ϕ 0 (x N ) ϕ 1 (x N )... ϕ M 1 (x N ) Φ = ( Φ T Φ ) 1 Φ T known as the pseudo-inverse (pinv in MATLAB) Möller/Mori 46

Finding Optimal Weights Set gradient to 0: ( N N ) 0 T = t n ϕ(x n ) T w T ϕ(x n )ϕ(x n ) T n=1 n=1 Maximum likelihood estimate for w: Φ = w ML = ( Φ T Φ ) 1 Φ T t ϕ 0 (x 1 ) ϕ 1 (x 1 )... ϕ M 1 (x 1 ) ϕ 0 (x 2 ) ϕ 1 (x 2 )... ϕ M 1 (x 2 )...... ϕ 0 (x N ) ϕ 1 (x N )... ϕ M 1 (x N ) Φ = ( Φ T Φ ) 1 Φ T known as the pseudo-inverse (pinv in MATLAB) Möller/Mori 47

Geometry of Least Squares S t ϕ 2 ϕ 1 y t = (t 1,..., t N ) is the target value vector S is space spanned by φ j = (ϕ j (x 1 ),..., ϕ j (x N )) Solution y lies in S Least squares solution is orthogonal projection of t onto S Can verify this by looking at y = Φw ML = ΦΦ t = P t P 2 = P, P = P T Möller/Mori 48

Geometry of Least Squares S t ϕ 2 ϕ 1 y t = (t 1,..., t N ) is the target value vector S is space spanned by φ j = (ϕ j (x 1 ),..., ϕ j (x N )) Solution y lies in S Least squares solution is orthogonal projection of t onto S Can verify this by looking at y = Φw ML = ΦΦ t = P t P 2 = P, P = P T Möller/Mori 49

Geometry of Least Squares S t ϕ 2 ϕ 1 y t = (t 1,..., t N ) is the target value vector S is space spanned by φ j = (ϕ j (x 1 ),..., ϕ j (x N )) Solution y lies in S Least squares solution is orthogonal projection of t onto S Can verify this by looking at y = Φw ML = ΦΦ t = P t P 2 = P, P = P T Möller/Mori 50

Geometry of Least Squares S t ϕ 2 ϕ 1 y t = (t 1,..., t N ) is the target value vector S is space spanned by φ j = (ϕ j (x 1 ),..., ϕ j (x N )) Solution y lies in S Least squares solution is orthogonal projection of t onto S Can verify this by looking at y = Φw ML = ΦΦ t = P t P 2 = P, P = P T Möller/Mori 51

Geometry of Least Squares S t ϕ 2 ϕ 1 y t = (t 1,..., t N ) is the target value vector S is space spanned by φ j = (ϕ j (x 1 ),..., ϕ j (x N )) Solution y lies in S Least squares solution is orthogonal projection of t onto S Can verify this by looking at y = Φw ML = ΦΦ t = P t P 2 = P, P = P T Möller/Mori 52

Sequential Learning In practice N might be huge, or data might arrive online Can use a gradient descent method: Start with initial guess for w Update by taking a step in gradient direction E of error function Modify to use stochastic / sequential gradient descent: If error function E = n E n (e.g. least squares) Update by taking a step in gradient direction E n for one example Details about step size are important decrease step size at the end Möller/Mori 53

Sequential Learning In practice N might be huge, or data might arrive online Can use a gradient descent method: Start with initial guess for w Update by taking a step in gradient direction E of error function Modify to use stochastic / sequential gradient descent: If error function E = n E n (e.g. least squares) Update by taking a step in gradient direction E n for one example Details about step size are important decrease step size at the end Möller/Mori 54

Sequential Learning In practice N might be huge, or data might arrive online Can use a gradient descent method: Start with initial guess for w Update by taking a step in gradient direction E of error function Modify to use stochastic / sequential gradient descent: If error function E = n E n (e.g. least squares) Update by taking a step in gradient direction E n for one example Details about step size are important decrease step size at the end Möller/Mori 55

Sequential Learning In practice N might be huge, or data might arrive online Can use a gradient descent method: Start with initial guess for w Update by taking a step in gradient direction E of error function Modify to use stochastic / sequential gradient descent: If error function E = n E n (e.g. least squares) Update by taking a step in gradient direction E n for one example Details about step size are important decrease step size at the end Möller/Mori 56

Outline Regression Linear Basis Function Models Loss Functions for Regression Finding Optimal Weights Regularization Bayesian Linear Regression Möller/Mori 57

Regularization We discussed regularization as a technique to avoid over-fitting: Ẽ(w) = 1 2 N {y(x n, w) t n } 2 + λ 2 w 2 }{{} regularizer n=1 Next on the menu: Other regularizers Bayesian learning and quadratic regularizer Möller/Mori 58

Other Regularizers q = 0.5 q = 1 q = 2 q = 4 Can use different norms for regularizer: Ẽ(w) = 1 2 N {y(x n, w) t n } 2 + λ 2 n=1 e.g. q = 2 ridge regression e.g. q = 1 lasso math is easiest with ridge regression M w j q j=1 Möller/Mori 59

Optimization with a Quadratic Regularizer With q = 2, total error still a nice quadratic: Calculus... Ẽ(w) = 1 2 N {y(x n, w) t n } 2 + λ 2 wt w n=1 w = (λi } + {{ Φ T Φ } ) 1 Φ T t regularlized Similar to unregularlized least squares Advantage (λi + Φ T Φ) is well conditioned so inversion is stable Möller/Mori 60

Ridge Regression vs. Lasso w2 w w1 Ridge regression aka parameter shrinkage Weights w shrink back towards origin Möller/Mori 61

Ridge Regression vs. Lasso w2 w2 w w w1 w1 Ridge regression aka parameter shrinkage Weights w shrink back towards origin Lasso leads to sparse models Components of w tend to 0 with large λ (strong regularization) Intuitively, once minimum achieved at large radius, minimum is on w 1 = 0 Möller/Mori 62

Outline Regression Linear Basis Function Models Loss Functions for Regression Finding Optimal Weights Regularization Bayesian Linear Regression Möller/Mori 63

Bayesian Linear Regression Start with a prior over parameters w Conjugate prior is a Gaussian: p(w) = N (w 0, α 1 I) This simple form will make math easier; can be done for arbitrary Gaussian too Data likelihood, Gaussian model as before: p(t x, w, β) = N (t y(x, w), β 1 ) Möller/Mori 64

Bayesian Linear Regression Start with a prior over parameters w Conjugate prior is a Gaussian: p(w) = N (w 0, α 1 I) This simple form will make math easier; can be done for arbitrary Gaussian too Data likelihood, Gaussian model as before: p(t x, w, β) = N (t y(x, w), β 1 ) Möller/Mori 65

Bayesian Linear Regression Posterior distribution on w: ( N ) p(w t) p(t n x n, w, β) p(w) = [ N n=1 Take the log: n=1 ( β exp β ) ] ( α ) M 2π 2 (t n w T ϕ(x n )) 2 2 exp( α 2π 2 wt w) ln p(w t) = β 2 N (t n w T ϕ(x n )) 2 + α 2 wt w + const n=1 L 2 regularization is maximum a posteriori (MAP) with a Gaussian prior. λ = α/β Möller/Mori 66

Bayesian Linear Regression Posterior distribution on w: ( N ) p(w t) p(t n x n, w, β) p(w) = [ N n=1 Take the log: n=1 ( β exp β ) ] ( α ) M 2π 2 (t n w T ϕ(x n )) 2 2 exp( α 2π 2 wt w) ln p(w t) = β 2 N (t n w T ϕ(x n )) 2 + α 2 wt w + const n=1 L 2 regularization is maximum a posteriori (MAP) with a Gaussian prior. λ = α/β Möller/Mori 67

Bayesian Linear Regression Posterior distribution on w: ( N ) p(w t) p(t n x n, w, β) p(w) = [ N n=1 Take the log: n=1 ( β exp β ) ] ( α ) M 2π 2 (t n w T ϕ(x n )) 2 2 exp( α 2π 2 wt w) ln p(w t) = β 2 N (t n w T ϕ(x n )) 2 + α 2 wt w + const n=1 L 2 regularization is maximum a posteriori (MAP) with a Gaussian prior. λ = α/β Möller/Mori 68

Bayesian Linear Regression Posterior distribution on w: ( N ) p(w t) p(t n x n, w, β) p(w) = [ N n=1 Take the log: n=1 ( β exp β ) ] ( α ) M 2π 2 (t n w T ϕ(x n )) 2 2 exp( α 2π 2 wt w) ln p(w t) = β 2 N (t n w T ϕ(x n )) 2 + α 2 wt w + const n=1 L 2 regularization is maximum a posteriori (MAP) with a Gaussian prior. λ = α/β Möller/Mori 69

Bayesian Linear Regression - Intuition Simple example x, t R, y(x, w) = w 0 + w 1 x Start with Gaussian prior in parameter space Samples shown in data space Receive data points (blue circles in data space) Compute likelihood Posterior is prior (or prev. posterior) times likelihood Möller/Mori 70

Bayesian Linear Regression - Intuition Simple example x, t R, y(x, w) = w 0 + w 1 x Start with Gaussian prior in parameter space Samples shown in data space Receive data points (blue circles in data space) Compute likelihood Posterior is prior (or prev. posterior) times likelihood Möller/Mori 71

Bayesian Linear Regression - Intuition Simple example x, t R, y(x, w) = w 0 + w 1 x Start with Gaussian prior in parameter space Samples shown in data space Receive data points (blue circles in data space) Compute likelihood Posterior is prior (or prev. posterior) times likelihood Möller/Mori 72

Bayesian Linear Regression - Intuition Simple example x, t R, y(x, w) = w 0 + w 1 x Start with Gaussian prior in parameter space Samples shown in data space Receive data points (blue circles in data space) Compute likelihood Posterior is prior (or prev. posterior) times likelihood Möller/Mori 73

Bayesian Linear Regression - Intuition Simple example x, t R, y(x, w) = w 0 + w 1 x Start with Gaussian prior in parameter space Samples shown in data space Receive data points (blue circles in data space) Compute likelihood Posterior is prior (or prev. posterior) times likelihood Möller/Mori 74

Bayesian Linear Regression - Intuition Simple example x, t R, y(x, w) = w 0 + w 1 x Start with Gaussian prior in parameter space Samples shown in data space Receive data points (blue circles in data space) Compute likelihood Posterior is prior (or prev. posterior) times likelihood Möller/Mori 75

Predictive Distribution Single estimate of w (ML or MAP) doesn t tell whole story We have a distribution over w, and can use it to make predictions Given a new value for x, we can compute a distribution over t: p(t t, α, β) = p(t, w t, α, β)dw p(t t, α, β) = p(t w, β) p(w t, α, β) }{{}}{{}}{{} dw predict probability sum i.e. For each value of w, let it make a prediction, multiply by its probability, sum over all w For arbitrary models as the distributions, this integral may not be computationally tractable Möller/Mori 76

Predictive Distribution Single estimate of w (ML or MAP) doesn t tell whole story We have a distribution over w, and can use it to make predictions Given a new value for x, we can compute a distribution over t: p(t t, α, β) = p(t, w t, α, β)dw p(t t, α, β) = p(t w, β) p(w t, α, β) }{{}}{{}}{{} dw predict probability sum i.e. For each value of w, let it make a prediction, multiply by its probability, sum over all w For arbitrary models as the distributions, this integral may not be computationally tractable Möller/Mori 77

Predictive Distribution Single estimate of w (ML or MAP) doesn t tell whole story We have a distribution over w, and can use it to make predictions Given a new value for x, we can compute a distribution over t: p(t t, α, β) = p(t, w t, α, β)dw p(t t, α, β) = p(t w, β) p(w t, α, β) }{{}}{{}}{{} dw predict probability sum i.e. For each value of w, let it make a prediction, multiply by its probability, sum over all w For arbitrary models as the distributions, this integral may not be computationally tractable Möller/Mori 78

Predictive Distribution 1 1 1 0 0 0 1 1 1 0 1 0 1 0 1 With the Gaussians we ve used for these distributions, the predicitve distribution will also be Gaussian (math on convolutions of Gaussians) Green line is true (unobserved) curve, blue data points, red line is mean, pink one standard deviation Uncertainty small around data points Pink region shrinks with more data Möller/Mori 79

Bayesian Model Selection So what do the Bayesians say about model selection? Model selection is choosing model M i e.g. degree of polynomial, type of basis function ϕ Don t select, just integrate p(t x, D) = L p(t x, M i, D) p(m }{{} i D) predictive dist. i=1 Average together the results of all models Could choose most likely model a posteriori p(m i D) More efficient, approximation Möller/Mori 80

Bayesian Model Selection So what do the Bayesians say about model selection? Model selection is choosing model M i e.g. degree of polynomial, type of basis function ϕ Don t select, just integrate p(t x, D) = L p(t x, M i, D) p(m }{{} i D) predictive dist. i=1 Average together the results of all models Could choose most likely model a posteriori p(m i D) More efficient, approximation Möller/Mori 81

Bayesian Model Selection So what do the Bayesians say about model selection? Model selection is choosing model M i e.g. degree of polynomial, type of basis function ϕ Don t select, just integrate p(t x, D) = L p(t x, M i, D) p(m }{{} i D) predictive dist. i=1 Average together the results of all models Could choose most likely model a posteriori p(m i D) More efficient, approximation Möller/Mori 82

Bayesian Model Selection So what do the Bayesians say about model selection? Model selection is choosing model M i e.g. degree of polynomial, type of basis function ϕ Don t select, just integrate p(t x, D) = L p(t x, M i, D) p(m }{{} i D) predictive dist. i=1 Average together the results of all models Could choose most likely model a posteriori p(m i D) More efficient, approximation Möller/Mori 83

Bayesian Model Selection How do we compute the posterior over models? p(m i D) p(d M i )p(m i ) Another likelihood + prior combination Likelihood: p(d M i ) = p(d w, M i )p(w M i )dw Möller/Mori 84

Bayesian Model Selection How do we compute the posterior over models? p(m i D) p(d M i )p(m i ) Another likelihood + prior combination Likelihood: p(d M i ) = p(d w, M i )p(w M i )dw Möller/Mori 85

Conclusion Readings: Ch. 3.1, 3.1.1-3.1.4, 3.3.1-3.3.2, 3.4 Linear Models for Regression Linear combination of (non-linear) basis functions Fitting parameters of regression model Least squares Maximum likelihood (can be = least squares) Controlling over-fitting Regularization Bayesian, use prior (can be = regularization) Model selection Cross-validation (use held-out data) Bayesian (use model evidence, likelihood) Möller/Mori 86