These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Music and Machine Learning (IFT68 Winter 8) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Linear Basis Function Models Linear regression extended to consider fixed basis functions: y(x, w) = M j= w j φ j (x) = w T φ(x) where w = (w +,..., w M ) T and φ = (φ,..., φ M ) T Possible basis functions include polynomials, Fourier, wavelet,....5.75.75.5.5.5.25.25 Polynomial Gaussian Sigmoidal { (x µj ) 2 } ( x = µj ) φ j (x) = exp φ 2s 2 j (x) = σ s where σ(a) = + exp( a) φ j (x) = x j 2

Maximum likelihood and least squares Presume target t is generated via deterministic function plus gaussian noise ε having precision β t = y(x, w) + ɛ p(t x, w, β) = N (t y(x, w), β ) With Gaussian conditional distribution conditional mean is: [t x] = tp(t x)dt = y(x, w) With a set of input points X = {x,..., x N } independently drawn: p(t X, w, β) = N n= N (t n w T φ(x n ), β ) 3

Maximum likelihood and least squares Log likelihood: ln p(t w, β) = N n= ln N (t n w T φ(x n ), β ) = N 2 ln β N 2 ln(2π) βe D(w) where E D (w) = 2 N { tn w T φ(x n ) } 2 n= Gradient: ln p(t w, β) = N n= { tn w T φ(x n ) } φ(x n ) T 4

Maximum likelihood and least squares Set gradient to : = N t n φ(x n ) T w T ( N n= n= φ(x n )φ(x n ) T ) Solve for our weights yields normal equations for least squares: w ML = ( Φ T Φ ) Φ T t where Φ is the design matrix Φ = φ (x ) φ (x ) φ M (x ) φ (x 2 ) φ (x 2 ) φ M (x 2 )...... φ (x N ) φ (x N ) φ M (x N ) and where Φ = ( Φ T Φ ) Φ T is the pseudo-inverse of Φ 5

Geometry of least squares S t ϕ 2 ϕ y Least-squares regression is obtained by finding the orthogonal projection of the data vector t onto the subspace spanned by the basis functions. Intuition: Sum of squares error is /2 squared Euclidean distance between y and t. Thus least-squares solution would move t as close as possible to y in the subspace S. 6

Online learning For large datasets may need to learn sequentially on sequences of smaller datasets, summing error E = n E n Sequential gradient descent (also called stochastic gradient descent: w (τ+) = w (τ) η E n where τ is the iteration number and η is the learning rate For sum-of-squares we get Least Mean Squares (LMS): w (τ+) = w (τ) + η(t n w (τ)t φ n )φ n Learning rate must be chosen carefully 7

Regularized least squares q =.5 q = q = 2 q = 4 Regularize magnitude of weights: 2 N n= { tn w T φ(x n ) } 2 + λ 2 wt w Gradient with respect to yields extension of least squares: w = ( λi + Φ T Φ ) Φ T t w ML = ( Φ T Φ ) Φ T t More general regularizer; when q= we have lasso regularizer which selects for sparse models: 2 N n= { tn w T φ(x n ) } 2 + λ 2 M j= w j q 8

Visualization of regularized least squares w 2 w 2 w w w w Plot of contours of unregularized error function along with constraint region on the quadratic regularizer (left, q=2) versus lasso regularizer (right, q=). For lasso, a sparse solution is generated with w= 9

Bias-Variance decomposition How to best set the λ parameter for regularization? Conditional expectation: h(x) = [t x] = Expected squared loss written with noise as second term: [L] = We will minimize the first term. But we cannot hope to ever know the perfect regression function h(x) In a Bayesian model uncertainty is expressed as posterior over w tp(t x)dt {y(x) h(x)} 2 p(x) dx + In frequentist treatment we make a point-estimate of w. Assess confidence by making predictions over subsets of data and taking mean performance. {h(x) t} 2 p(x, t) dx dt

Bias-Variance decomposition Take integrand of first term using some subset of data D. {y(x, D) h(x)} 2 which varies with data, thus take its mean Add and subtract expected value for the data {y(x; D) D [y(x; D)] + D [y(x; D)] h(x)} 2 = {y(x; D) D [y(x; D)]} 2 + { D [y(x; D)] h(x)} 2 Take expectation with respect to D; final term vanishes D +2{y(x; D) D [y(x; D)]}{ D [y(x; D)] h(x)} [ {y(x; D) h(x)} 2 ] = { D [y(x; D)] h(x)} 2 + D [ {y(x; D) D [y(x; D)]} 2] (bias) 2 First term is bias : extent to which average prediction differs from desired regression function Second term is variance: extent to which individual solutions vary around the average. Thus measures sensitivity to data. variance

Bias-Variance decomposition expected loss = (bias) 2 + variance + noise where: (bias) 2 = variance = noise = { D [y(x; D)] h(x)} 2 p(x) dx D[ {y(x; D) D [y(x; D)]} 2] p(x) dx {h(x) t} 2 p(x, t) dx dt Very flexible models have low bias and high variance Optimal model balances the two Relatively rigid models have high bias and low variance 2

Bias variance example datasets each with 25 data points. Fit with 25 Gaussian basis functions. Regularization parameter λ is varied. Top are individual fits. Bottom is average fit along with generating sine function in green. 3

Bias variance example average = y(x) = L (bias) 2 = N variance = N L y (l) (x) l= N {y(x n ) h(x n )} 2 n= N n= L L { y (l) (x n ) y(x n ) } 2 l=.5.2.9 (bias) 2 variance (bias) 2 + variance test error.6.3 3 2 2 Plot of squared bias and variance together with their sum. The minimum is at λ=-.3 which is close to the value yielding minimum test error 4

Bayesian Linear Regression Bias-variance decomposition requires splitting data. Inefficient. Avoids overfitting of maximum likelihood Leads to automatic way of determining model complexity Now we look quickly at Bayesian approach. Will return to it later in semester. Define prior over weights using zero-mean Gaussian prior: p(w α) = N (w, α I) Log of posterior is sum of log likelihood and log of prior: ln p(w t) = β 2 with quadratic regularization term sense N {t n w T φ(x n )} 2 α 2 wt w + const n= 5 λ = α/β in least squares

Sequential Bayesian Learning Consider simple input variable x, single target t and a linear model of form y(x, w) = w + w x Just two weights, can plot prior and posteriors Generate synthetic data using Goal is to recover a = {-.3,.5} from data Basic algorithm: observe point (x,t) from dataset calculate likelihood p(t x,w) based on estimate of noise precision β f(x n, a) =.3 +.5x n + ɛ multiply likelihood by previous prior over w to yield new posterior 6

Sequential Bayesian Learning Basic algorithm: Observe point (x,t) from dataset Calculate likelihood p(t x,w) based on estimate of noise variance β Multiply likelihood by previous prior over w to yield new posterior Observe another point... Samples from posterior are shown on right 7

Predictive distribution We are generally not interested in priors over w but rather for predicting new values t from x. Evaluate predictive distribution p(t t, α, β) = p(t w, β)p(w t, α, β) dw This is convolution of conditional distribution of target with posterior w. For our problem (2 Gaussians) results in: p(t x, t, α, β) = N (t m T Nφ(x), σ 2 N(x)) σ 2 N(x) = β + φ(x)t S N φ(x) 8

Predictive distribution 9 Predictive distributions for 9 Gaussian basis functions fitting f(x)=sin(2πx)+ε in green. Red curve is mean of predictive distributions. Red shaded regions are std dev. away from mean.

Predictive distribution Plots of the functions y(x,w) using samples from the posterior distributions over w corresponding to the previous plots. 2

Equivalent kernel Posterior means can be interpreted as kernels; sets stage for kernel methods including Gaussian processes. Predictive mean can be written as: y(x, m N ) = m T Nφ(x) = βφ(x) T S N Φ T t = N n= βφ(x) T S N φ(x)t n We can also rewrite this as a kernel function: y(x, m N ) = N n= k(x, rbx n )t n where the function smoother matrix or equivalent kernel k(x, x ) = βφ(x) T S N φ(x ) is known as the Regression functions which predict using linear combinations of target values are known as linear smoothers 2

Equivalent kernel.75.5.25 Equivalent kernel (left, middle) for Gaussian basis function (right) Above k(x,x ) is plotted as a function of x. Note that it is localized around x. Mean of predictive distribution at x given by y(x,mn) is obtained using a weighted combination where points close to x are given higher weight. Idea of using a localized kernel in place of a set of basis functions leads to Gaussian processes (to be covered later). 22