Linear Models for Regression - PDF Free Download

Linear Models for Regression Machine Learning Torsten Möller Möller/Mori 1

Reading Chapter 3 of Pattern Recognition and Machine Learning by Bishop Chapter 3+5+6+7 of The Elements of Statistical Learning by Hastie, Tibshirani, Friedman Möller/Mori 2

Outline Regression Linear Basis Function Models Loss Functions for Regression Finding Optimal Weights Regularization Bayesian Linear Regression Möller/Mori 3

Outline Regression Linear Basis Function Models Loss Functions for Regression Finding Optimal Weights Regularization Bayesian Linear Regression Möller/Mori 4

Regression Given training set {(x 1, t 1 ),..., (x N, t N )} t i is continuous: regression For now, assume t i R, x i R D E.g. t i is stock price, x i contains company profit, debt, cash flow, gross sales, number of spam emails sent, Möller/Mori 5

Outline Regression Linear Basis Function Models Loss Functions for Regression Finding Optimal Weights Regularization Bayesian Linear Regression Möller/Mori 6

Linear Functions A function f( ) is linear if f(αx 1 + βx 2 ) = αf(x 1 ) + βf(x 2 ) Linear functions will lead to simple algorithms, so let s see what we can do with them Möller/Mori 7

Linear Regression Simplest linear model for regression y(x, w) = w 0 + w 1 x 1 + w 2 x 2 +... + w D x D Remember, we re learning w Set w so that y(x, w) aligns with target value in training data Möller/Mori 8

Linear Regression Simplest linear model for regression y(x, w) = w 0 + w 1 x 1 + w 2 x 2 +... + w D x D Remember, we re learning w Set w so that y(x, w) aligns with target value in training data This is a very simple model, limited in what it can do 1 0 1 0 1 Möller/Mori 10

Linear Basis Function Models Simplest linear model y(x, w) = w 0 + w 1 x 1 + w 2 x 2 +... + w D x D was linear in x ( ) and w Linear in w is what will be important for simple algorithms Extend to include fixed non-linear functions of data y(x, w) = w 0 + w 1 ϕ 1 (x) + w 2 ϕ 2 (x) +... + w M 1 ϕ M 1 (x) Linear combinations of these basis functions also linear in parameters Möller/Mori 11

Linear Basis Function Models Bias parameter allows fixed offset in data y(x, w) = w 0 }{{} bias Think of simple 1-D x: +w 1 ϕ 1 (x) + w 2 ϕ 2 (x) +... + w M 1 ϕ M 1 (x) y(x, w) = w }{{} 0 + w 1 intercept }{{} slope For notational convenience, define ϕ 0 (x) = 1: y(x, w) = M 1 j=0 x w j ϕ j (x) = w T ϕ(x) Möller/Mori 14

Linear Basis Function Models Function for regression y(x, w) is non-linear function of x, but linear in w: y(x, w) = M 1 j=0 w j ϕ j (x) = w T ϕ(x) Polynomial regression is an example of this Order M polynomial regression, ϕ j (x) =? Möller/Mori 17

Basis Functions: Feature Functions Often we extract features from x An intuitve way to think of ϕ j (x) is as feature functions E.g. Automatic project report grading system x is text of report: In this project we apply the algorithm of Möller [2] to recognizing blue objects. We test this algorithm on pictures of you and I from my holiday photo collection.... ϕ 1 (x) is count of occurrences of Möller [ ϕ 2 (x) is count of occurrences of of you and I Regression grade y(x, w) = 20ϕ 1 (x) 10ϕ 2 (x) Möller/Mori 20

Other Non-linear Basis Functions 1 0.5 0 0.5 1 1 0 1 1 0.75 0.5 0.25 0 1 0 1 1 0.75 0.5 0.25 0 1 0 1 Polynomial ϕ j (x) = x j Gaussians ϕ j (x) = exp{ (x µ j) 2 2s 2 } Sigmoidal ϕ j (x) = 1 1+exp((µ j x)/s) Möller/Mori 24

Example - Gaussian Basis Functions: Temperature Use Gaussian basis functions, regression on temperature Möller/Mori 25

Example - Gaussian Basis Functions: Temperature µ 1 = Vancouver, µ 2 = San Francisco, µ 3 = Oakland Möller/Mori 26

Example - Gaussian Basis Functions: Temperature µ 1 = Vancouver, µ 2 = San Francisco, µ 3 = Oakland Temperature in x = Seattle? y(x, w) = w 1 exp{ (x µ 1) 2 } + w 2s 2 2 exp{ (x µ 2) 2 } + w 2s 2 3 exp{ (x µ 3) 2 } 2s 2 Möller/Mori 27

Example - Gaussian Basis Functions Could define ϕ j (x) = exp{ (x x j) 2 2s 2 } Gaussian around each training data point x j, N of them Could use for modelling temperature or resource availability at spatial location x Overfitting - interpolates data Example of a kernel method, more later Möller/Mori 29

Outline Regression Linear Basis Function Models Loss Functions for Regression Finding Optimal Weights Regularization Bayesian Linear Regression Möller/Mori 34

Loss Functions for Regression We want to find the best set of coefficients w Recall, one way to define best was minimizing squared error: E(w) = 1 N {y(x n, w) t n } 2 2 n=1 We will now look at another way, based on maximum likelihood Möller/Mori 35

Gaussian Noise Model for Regression We are provided with a training set {(x i, t i )} Let s assume t arises from a deterministic function plus Gaussian distributed (with precision β) noise: t = y(x, w) + ϵ The probability of observing a target value t is then: p(t x, w, β) = N (t y(x, w), β 1 ) Notation: N (x µ, σ 2 ); x drawn from Gaussian with mean µ, variance σ 2 Möller/Mori 36

Maximum Likelihood for Regression The likelihood of data t = {t i } using this Gaussian noise model is: N p(t w, β) = N (t n w T ϕ(x n ), β 1 ) The log-likelihood is: ln p(t w, β) = ln N n=1 n=1 β exp ( β2 ) (t n w T ϕ(x n )) 2 2π = N 2 ln β N 2 ln(2π) β 1 N (t n w T ϕ(x n )) 2 }{{} 2 n=1 const. wrt w }{{} squared error Sum of squared errors is maximum likelihood under a Gaussian noise model Möller/Mori 38

Outline Regression Linear Basis Function Models Loss Functions for Regression Finding Optimal Weights Regularization Bayesian Linear Regression Möller/Mori 42

Finding Optimal Weights How do we maximize likelihood wrt w (or minimize squared error)? Take gradient of log-likelihood wrt w: w i ln p(t w, β) = β N (t n w T ϕ(x n ))ϕ i (x n ) n=1 In vector form: N ln p(t w, β) = β (t n w T ϕ(x n ))ϕ(x n ) T n=1 Möller/Mori 43

Finding Optimal Weights Set gradient to 0: ( N N ) 0 T = t n ϕ(x n ) T w T ϕ(x n )ϕ(x n ) T n=1 n=1 Maximum likelihood estimate for w: Φ = w ML = ( Φ T Φ ) 1 Φ T t ϕ 0 (x 1 ) ϕ 1 (x 1 )... ϕ M 1 (x 1 ) ϕ 0 (x 2 ) ϕ 1 (x 2 )... ϕ M 1 (x 2 )...... ϕ 0 (x N ) ϕ 1 (x N )... ϕ M 1 (x N ) Φ = ( Φ T Φ ) 1 Φ T known as the pseudo-inverse (pinv in MATLAB) Möller/Mori 46

Geometry of Least Squares S t ϕ 2 ϕ 1 y t = (t 1,..., t N ) is the target value vector S is space spanned by φ j = (ϕ j (x 1 ),..., ϕ j (x N )) Solution y lies in S Least squares solution is orthogonal projection of t onto S Can verify this by looking at y = Φw ML = ΦΦ t = P t P 2 = P, P = P T Möller/Mori 48

Sequential Learning In practice N might be huge, or data might arrive online Can use a gradient descent method: Start with initial guess for w Update by taking a step in gradient direction E of error function Modify to use stochastic / sequential gradient descent: If error function E = n E n (e.g. least squares) Update by taking a step in gradient direction E n for one example Details about step size are important decrease step size at the end Möller/Mori 53

Outline Regression Linear Basis Function Models Loss Functions for Regression Finding Optimal Weights Regularization Bayesian Linear Regression Möller/Mori 57

Regularization We discussed regularization as a technique to avoid over-fitting: Ẽ(w) = 1 2 N {y(x n, w) t n } 2 + λ 2 w 2 }{{} regularizer n=1 Next on the menu: Other regularizers Bayesian learning and quadratic regularizer Möller/Mori 58

Other Regularizers q = 0.5 q = 1 q = 2 q = 4 Can use different norms for regularizer: Ẽ(w) = 1 2 N {y(x n, w) t n } 2 + λ 2 n=1 e.g. q = 2 ridge regression e.g. q = 1 lasso math is easiest with ridge regression M w j q j=1 Möller/Mori 59

Optimization with a Quadratic Regularizer With q = 2, total error still a nice quadratic: Calculus... Ẽ(w) = 1 2 N {y(x n, w) t n } 2 + λ 2 wt w n=1 w = (λi } + {{ Φ T Φ } ) 1 Φ T t regularlized Similar to unregularlized least squares Advantage (λi + Φ T Φ) is well conditioned so inversion is stable Möller/Mori 60

Ridge Regression vs. Lasso w2 w w1 Ridge regression aka parameter shrinkage Weights w shrink back towards origin Möller/Mori 61

Ridge Regression vs. Lasso w2 w2 w w w1 w1 Ridge regression aka parameter shrinkage Weights w shrink back towards origin Lasso leads to sparse models Components of w tend to 0 with large λ (strong regularization) Intuitively, once minimum achieved at large radius, minimum is on w 1 = 0 Möller/Mori 62

Outline Regression Linear Basis Function Models Loss Functions for Regression Finding Optimal Weights Regularization Bayesian Linear Regression Möller/Mori 63

Bayesian Linear Regression Start with a prior over parameters w Conjugate prior is a Gaussian: p(w) = N (w 0, α 1 I) This simple form will make math easier; can be done for arbitrary Gaussian too Data likelihood, Gaussian model as before: p(t x, w, β) = N (t y(x, w), β 1 ) Möller/Mori 64

Bayesian Linear Regression Posterior distribution on w: ( N ) p(w t) p(t n x n, w, β) p(w) = [ N n=1 Take the log: n=1 ( β exp β ) ] ( α ) M 2π 2 (t n w T ϕ(x n )) 2 2 exp( α 2π 2 wt w) ln p(w t) = β 2 N (t n w T ϕ(x n )) 2 + α 2 wt w + const n=1 L 2 regularization is maximum a posteriori (MAP) with a Gaussian prior. λ = α/β Möller/Mori 66

Bayesian Linear Regression - Intuition Simple example x, t R, y(x, w) = w 0 + w 1 x Start with Gaussian prior in parameter space Samples shown in data space Receive data points (blue circles in data space) Compute likelihood Posterior is prior (or prev. posterior) times likelihood Möller/Mori 70

Predictive Distribution Single estimate of w (ML or MAP) doesn t tell whole story We have a distribution over w, and can use it to make predictions Given a new value for x, we can compute a distribution over t: p(t t, α, β) = p(t, w t, α, β)dw p(t t, α, β) = p(t w, β) p(w t, α, β) }{{}}{{}}{{} dw predict probability sum i.e. For each value of w, let it make a prediction, multiply by its probability, sum over all w For arbitrary models as the distributions, this integral may not be computationally tractable Möller/Mori 76

Predictive Distribution 1 1 1 0 0 0 1 1 1 0 1 0 1 0 1 With the Gaussians we ve used for these distributions, the predicitve distribution will also be Gaussian (math on convolutions of Gaussians) Green line is true (unobserved) curve, blue data points, red line is mean, pink one standard deviation Uncertainty small around data points Pink region shrinks with more data Möller/Mori 79

Bayesian Model Selection So what do the Bayesians say about model selection? Model selection is choosing model M i e.g. degree of polynomial, type of basis function ϕ Don t select, just integrate p(t x, D) = L p(t x, M i, D) p(m }{{} i D) predictive dist. i=1 Average together the results of all models Could choose most likely model a posteriori p(m i D) More efficient, approximation Möller/Mori 80

Bayesian Model Selection How do we compute the posterior over models? p(m i D) p(d M i )p(m i ) Another likelihood + prior combination Likelihood: p(d M i ) = p(d w, M i )p(w M i )dw Möller/Mori 84

Bayesian Model Selection How do we compute the posterior over models? p(m i D) p(d M i )p(m i ) Another likelihood + prior combination Likelihood: p(d M i ) = p(d w, M i )p(w M i )dw Möller/Mori 85

Conclusion Readings: Ch. 3.1, 3.1.1-3.1.4, 3.3.1-3.3.2, 3.4 Linear Models for Regression Linear combination of (non-linear) basis functions Fitting parameters of regression model Least squares Maximum likelihood (can be = least squares) Controlling over-fitting Regularization Bayesian, use prior (can be = regularization) Model selection Cross-validation (use held-out data) Bayesian (use model evidence, likelihood) Möller/Mori 86