These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Similar documents
Cheng Soon Ong & Christian Walder. Canberra February June 2018

Bayesian Linear Regression. Sargur Srihari

Linear Models for Regression

Linear Models for Regression. Sargur Srihari

Linear Models for Regression

Linear Models for Regression

Bias-Variance Trade-off in ML. Sargur Srihari

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Overfitting, Bias / Variance Analysis

Machine Learning. 7. Logistic and Linear Regression

STA414/2104 Statistical Methods for Machine Learning II

PATTERN RECOGNITION AND MACHINE LEARNING

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Neural Network Training

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Linear Models for Classification

CSCI567 Machine Learning (Fall 2014)

SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Regressione. Ruggero Donida Labati

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Linear Regression and Discrimination

Reading Group on Deep Learning Session 1

Least Squares Regression

Computer Vision Group Prof. Daniel Cremers. 3. Regression

GWAS IV: Bayesian linear (variance component) models

Outline lecture 2 2(30)

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Linear Regression (continued)

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Least Squares Regression

ECE521 week 3: 23/26 January 2017

Bayesian methods in economics and finance

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

The Laplace Approximation

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Today. Calculus. Linear Regression. Lagrange Multipliers

Machine Learning Lecture 5

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Overview c 1 What is? 2 Definition Outlines 3 Examples of 4 Related Fields Overview Linear Regression Linear Classification Neural Networks Kernel Met

Bayesian Machine Learning

Lecture : Probabilistic Machine Learning

Linear Models for Regression

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Outline Lecture 2 2(32)

Ch 4. Linear Models for Classification

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Logistic Regression. COMP 527 Danushka Bollegala

Cheng Soon Ong & Christian Walder. Canberra February June 2018

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Bayesian Machine Learning

Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Machine Learning Lecture 7

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Week 3: Linear Regression

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Relevance Vector Machines

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

1. Non-Uniformly Weighted Data [7pts]

Lecture 5: GPs and Streaming regression

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Introduction to Machine Learning

Iterative Reweighted Least Squares

Artificial Neural Networks

Machine learning - HT Basis Expansion, Regularization, Validation

DEPARTMENT OF COMPUTER SCIENCE AUTUMN SEMESTER MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Probabilistic & Unsupervised Learning

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

An Introduction to Statistical and Probabilistic Linear Models

MODULE -4 BAYEIAN LEARNING

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes

Introduction to Gaussian Process

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Bayesian Machine Learning

CS-E3210 Machine Learning: Basic Principles

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan

Gaussian Process Regression

Introduction Dual Representations Kernel Design RBF Linear Reg. GP Regression GP Classification Summary. Kernel Methods. Henrik I Christensen

Stochastic Gradient Descent

Linear Models for Regression CS534

Gaussian Processes in Machine Learning

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Regression, Ridge Regression, Lasso

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Bayesian Linear Regression [DRAFT - In Progress]

Linear & nonlinear classifiers

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Announcements. Proposals graded

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Multivariate Bayesian Linear Regression MLAI Lecture 11

Transcription:

Music and Machine Learning (IFT68 Winter 8) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Linear Basis Function Models Linear regression extended to consider fixed basis functions: y(x, w) = M j= w j φ j (x) = w T φ(x) where w = (w +,..., w M ) T and φ = (φ,..., φ M ) T Possible basis functions include polynomials, Fourier, wavelet,....5.75.75.5.5.5.25.25 Polynomial Gaussian Sigmoidal { (x µj ) 2 } ( x = µj ) φ j (x) = exp φ 2s 2 j (x) = σ s where σ(a) = + exp( a) φ j (x) = x j 2

Maximum likelihood and least squares Presume target t is generated via deterministic function plus gaussian noise ε having precision β t = y(x, w) + ɛ p(t x, w, β) = N (t y(x, w), β ) With Gaussian conditional distribution conditional mean is: [t x] = tp(t x)dt = y(x, w) With a set of input points X = {x,..., x N } independently drawn: p(t X, w, β) = N n= N (t n w T φ(x n ), β ) 3

Maximum likelihood and least squares Log likelihood: ln p(t w, β) = N n= ln N (t n w T φ(x n ), β ) = N 2 ln β N 2 ln(2π) βe D(w) where E D (w) = 2 N { tn w T φ(x n ) } 2 n= Gradient: ln p(t w, β) = N n= { tn w T φ(x n ) } φ(x n ) T 4

Maximum likelihood and least squares Set gradient to : = N t n φ(x n ) T w T ( N n= n= φ(x n )φ(x n ) T ) Solve for our weights yields normal equations for least squares: w ML = ( Φ T Φ ) Φ T t where Φ is the design matrix Φ = φ (x ) φ (x ) φ M (x ) φ (x 2 ) φ (x 2 ) φ M (x 2 )...... φ (x N ) φ (x N ) φ M (x N ) and where Φ = ( Φ T Φ ) Φ T is the pseudo-inverse of Φ 5

Geometry of least squares S t ϕ 2 ϕ y Least-squares regression is obtained by finding the orthogonal projection of the data vector t onto the subspace spanned by the basis functions. Intuition: Sum of squares error is /2 squared Euclidean distance between y and t. Thus least-squares solution would move t as close as possible to y in the subspace S. 6

Online learning For large datasets may need to learn sequentially on sequences of smaller datasets, summing error E = n E n Sequential gradient descent (also called stochastic gradient descent: w (τ+) = w (τ) η E n where τ is the iteration number and η is the learning rate For sum-of-squares we get Least Mean Squares (LMS): w (τ+) = w (τ) + η(t n w (τ)t φ n )φ n Learning rate must be chosen carefully 7

Regularized least squares q =.5 q = q = 2 q = 4 Regularize magnitude of weights: 2 N n= { tn w T φ(x n ) } 2 + λ 2 wt w Gradient with respect to yields extension of least squares: w = ( λi + Φ T Φ ) Φ T t w ML = ( Φ T Φ ) Φ T t More general regularizer; when q= we have lasso regularizer which selects for sparse models: 2 N n= { tn w T φ(x n ) } 2 + λ 2 M j= w j q 8

Visualization of regularized least squares w 2 w 2 w w w w Plot of contours of unregularized error function along with constraint region on the quadratic regularizer (left, q=2) versus lasso regularizer (right, q=). For lasso, a sparse solution is generated with w= 9

Bias-Variance decomposition How to best set the λ parameter for regularization? Conditional expectation: h(x) = [t x] = Expected squared loss written with noise as second term: [L] = We will minimize the first term. But we cannot hope to ever know the perfect regression function h(x) In a Bayesian model uncertainty is expressed as posterior over w tp(t x)dt {y(x) h(x)} 2 p(x) dx + In frequentist treatment we make a point-estimate of w. Assess confidence by making predictions over subsets of data and taking mean performance. {h(x) t} 2 p(x, t) dx dt

Bias-Variance decomposition Take integrand of first term using some subset of data D. {y(x, D) h(x)} 2 which varies with data, thus take its mean Add and subtract expected value for the data {y(x; D) D [y(x; D)] + D [y(x; D)] h(x)} 2 = {y(x; D) D [y(x; D)]} 2 + { D [y(x; D)] h(x)} 2 Take expectation with respect to D; final term vanishes D +2{y(x; D) D [y(x; D)]}{ D [y(x; D)] h(x)} [ {y(x; D) h(x)} 2 ] = { D [y(x; D)] h(x)} 2 + D [ {y(x; D) D [y(x; D)]} 2] (bias) 2 First term is bias : extent to which average prediction differs from desired regression function Second term is variance: extent to which individual solutions vary around the average. Thus measures sensitivity to data. variance

Bias-Variance decomposition expected loss = (bias) 2 + variance + noise where: (bias) 2 = variance = noise = { D [y(x; D)] h(x)} 2 p(x) dx D[ {y(x; D) D [y(x; D)]} 2] p(x) dx {h(x) t} 2 p(x, t) dx dt Very flexible models have low bias and high variance Optimal model balances the two Relatively rigid models have high bias and low variance 2

Bias variance example datasets each with 25 data points. Fit with 25 Gaussian basis functions. Regularization parameter λ is varied. Top are individual fits. Bottom is average fit along with generating sine function in green. 3

Bias variance example average = y(x) = L (bias) 2 = N variance = N L y (l) (x) l= N {y(x n ) h(x n )} 2 n= N n= L L { y (l) (x n ) y(x n ) } 2 l=.5.2.9 (bias) 2 variance (bias) 2 + variance test error.6.3 3 2 2 Plot of squared bias and variance together with their sum. The minimum is at λ=-.3 which is close to the value yielding minimum test error 4

Bayesian Linear Regression Bias-variance decomposition requires splitting data. Inefficient. Avoids overfitting of maximum likelihood Leads to automatic way of determining model complexity Now we look quickly at Bayesian approach. Will return to it later in semester. Define prior over weights using zero-mean Gaussian prior: p(w α) = N (w, α I) Log of posterior is sum of log likelihood and log of prior: ln p(w t) = β 2 with quadratic regularization term sense N {t n w T φ(x n )} 2 α 2 wt w + const n= 5 λ = α/β in least squares

Sequential Bayesian Learning Consider simple input variable x, single target t and a linear model of form y(x, w) = w + w x Just two weights, can plot prior and posteriors Generate synthetic data using Goal is to recover a = {-.3,.5} from data Basic algorithm: observe point (x,t) from dataset calculate likelihood p(t x,w) based on estimate of noise precision β f(x n, a) =.3 +.5x n + ɛ multiply likelihood by previous prior over w to yield new posterior 6

Sequential Bayesian Learning Basic algorithm: Observe point (x,t) from dataset Calculate likelihood p(t x,w) based on estimate of noise variance β Multiply likelihood by previous prior over w to yield new posterior Observe another point... Samples from posterior are shown on right 7

Predictive distribution We are generally not interested in priors over w but rather for predicting new values t from x. Evaluate predictive distribution p(t t, α, β) = p(t w, β)p(w t, α, β) dw This is convolution of conditional distribution of target with posterior w. For our problem (2 Gaussians) results in: p(t x, t, α, β) = N (t m T Nφ(x), σ 2 N(x)) σ 2 N(x) = β + φ(x)t S N φ(x) 8

Predictive distribution 9 Predictive distributions for 9 Gaussian basis functions fitting f(x)=sin(2πx)+ε in green. Red curve is mean of predictive distributions. Red shaded regions are std dev. away from mean.

Predictive distribution Plots of the functions y(x,w) using samples from the posterior distributions over w corresponding to the previous plots. 2

Equivalent kernel Posterior means can be interpreted as kernels; sets stage for kernel methods including Gaussian processes. Predictive mean can be written as: y(x, m N ) = m T Nφ(x) = βφ(x) T S N Φ T t = N n= βφ(x) T S N φ(x)t n We can also rewrite this as a kernel function: y(x, m N ) = N n= k(x, rbx n )t n where the function smoother matrix or equivalent kernel k(x, x ) = βφ(x) T S N φ(x ) is known as the Regression functions which predict using linear combinations of target values are known as linear smoothers 2

Equivalent kernel.75.5.25 Equivalent kernel (left, middle) for Gaussian basis function (right) Above k(x,x ) is plotted as a function of x. Note that it is localized around x. Mean of predictive distribution at x given by y(x,mn) is obtained using a weighted combination where points close to x are given higher weight. Idea of using a localized kernel in place of a set of basis functions leads to Gaussian processes (to be covered later). 22