Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Regression Machine Learning and Pattern Recognition Chris Williams School of Informatics, University of Edinburgh September 24 (All of the slides in this course have been adapted from previous versions by Charles Sutton, Amos Storkey, David Barber.) / 24

Classification or Regression? Classification: want to learn a discrete target variable Regression: want to learn a continuous target variable Linear regression, linear-in-the-parameters models Linear regression is a conditional Gaussian model Maximum likelihood solution - ordinary least squares Can use nonlinear basis functions Ridge regression Full Bayesian treatment Reading: Murphy chapter 7 (not all sections needed), Barber (7., 7.2, 8..) 2 / 24

One Dimensional Data 2.5 2.5.5 2 2 3 3 / 24

Linear Regression Simple example: one-dimensional linear regression. Suppose we have data of the form (x, y), and we believe the data should fol low a straight line: the data should have a straight line fit of the form y = w + w x. However we also believe the target values y are subject to measurement error, which we will assume to be Gaussian. So y = w + w x + η where η is a Gaussian noise term, mean, variance σ 2 η. 4 / 24

Figure credit: http://jedismedicine.blogspot.co.uk/24// Linear regression is just a conditional version of estimating a Gaussian (conditional on the input x) 5 / 24

Generated Data 2.5 2.5.5 2 2 3 6 / 24

Multivariate Case Consider the case where we are interested in y = f(x) for D dimensional x: y = w + w x +... w D x D + η, where η Gaussian(, ση). 2 Examples? Final grade depends on time spent on work for each tutorial. We set w = (w, w,... w D ) T and introduce φ = (, x T ) T, then we can write y = w T φ + η instead This implies p(y φ, w) = N(y; w T φ, ση) 2 Assume that training data is iid, i.e., p(y,... y N x,..., x N, w) = N n= p(yn x n, w) Given data {(x n, y n ), n =, 2,..., N}, the log likelihood is L(w) = log P (y... y N x... x N, w) = N 2ση 2 (y n w T φ n ) 2 N 2 log(2πσ2 η) n= 7 / 24

Minimizing Squared Error L(w) = 2σ 2 η N (y n w T φ n ) 2 N 2 log(2πσ2 η) n= N = C (y n w T φ n ) 2 C 2 n= where C > and C 2 don t depend on w. Now Multiplying by a positive constant doesn t change the maximum Adding a constant doesn t change the maximum. N n= (yn w T φ n ) 2 is the sum of squared errors made if you use w So maximizing the likelihood is the same as minimizing the total squared error of the linear predictor. So you don t have to believe the Gaussian assumption. You can simply believe that you want to minimize the squared error. 8 / 24

Maximum Likelihood Solution I Write Φ = (φ, φ 2,..., φ N ) T, and y = (y, y 2,..., y N ) T Φ is called the design matrix, has N rows, one for each example L(w) = 2ση 2 (y Φw) T (y Φw) C 2 Take derivatives of the log likelihood: w L(w) = σ 2 η Φ T (Φw y) 9 / 24

Maximum Likelihood Solution II Setting the derivatives to zero to find the minimum gives Φ T Φŵ = Φ T y This means the maximum likelihood ŵ is given by ŵ = (Φ T Φ) Φ T y The matrix (Φ T Φ) Φ T is called the pseudo-inverse. Ordinary least squares (OLS) solution for w MLE for the variance ˆσ 2 η = N N (y n w T φ n ) 2 n= i.e. the average of the squared residuals / 24

Generated Data 2.5 2.5.5.5 2 2 3 The black line is the maximum likelihood fit to the data. / 24

Nonlinear regression All this just used φ. We chose to put the x values in φ, but we could have put anything in there, including nonlinear transformations of the x values. In fact we can choose any useful form for φ so long as the final derivatives are linear wrt w. We can even change the size. We already have the maximum likelihood solution in the case of Gaussian noise: the pseudo-inverse solution. Models of this form are called general linear models or linear-in-the-parameters models. 2 / 24

Example:polynomial fitting Model y = w + w 2 x + w 2 x 2 + w 4 x 3. Set φ = (, x, x 2, x 3 ) T and w = (w, w 2, w 3, w 4 ). Can immediately write down the ML solution: w = (Φ T Φ) Φ T y, where Φ and y are defined as before. Could use any features we want: e.g. features that are only active in certain local regions (radial basis functions, RBFs). Figure credit: David Barber, BRML Fig 7.6 3 / 24

Dimensionality issues How many radial basis functions do we need? Suppose we need only three per dimension Then we would need 3 D for a D-dimensional problem This becomes large very fast: this is commonly called the curse of dimensionality Gaussian processes (see later) can help with these issues 4 / 24

Higher dimensional outputs Suppose the target values are vectors. Then we introduce different w i for each y i. Then we can do regression independently in each of those cases. 5 / 24

Adding a Prior Put prior over parameters, e.g., I is the identity matrix The log posterior is log p(w D) = const 2σ 2 η p(y φ, w) = N(y; w T φ, σ 2 η) p(w) = N(w;, τ 2 I) N (y n w T φ n ) 2 N 2 log(2πσ2 ) n= 2τ 2 wt w }{{} D 2 log(2πτ 2 ) penalty on large weights MAP solution can be computed analytically. Derivation almost the same as with MLE (where λ = σ 2 η/τ 2 ) This is called ridge regression w MAP = (Φ T Φ + λi) Φ T y 6 / 24

Effect of Ridge Regression Collecting constant terms from log posterior on last slide log p(w D) = const 2σ 2 η N (y n w T φ n ) 2 n= 2τ 2 wt w }{{} w 2 2. penalty term This is called l 2 regularization or weight decay. The second term is the squared Euclidean (also called l 2 ) norm of w. The idea is to reduce overfitting by forcing the function to be simple. The simplest possible function is constant w =, so encourage ŵ to be closer to that. τ is a parameter of the method. Trades off between how well you fit the training data and how simple the method is. Most commonly set via cross validation. Regularization is a general term for adding a second term to an objective function to encourage simple models. 7 / 24

Effect of Ridge Regression (Graphic) 2 ln lambda 2.35 2 ln lambda 8.57 5 5 5 5 5 5 5 5 2 5 5 5 2 Figure credit: Murphy Fig 7.7 Degree 4 polynomial fit with and without regularization 8 / 24

Why Ridge Regression Works (Graphic) u 2 ML Estimate u MAP Estimate prior mean Figure credit: Murphy Fig 7.9 9 / 24

Bayesian Regression Bayesian regression model p(y φ, w) = N(y; w T φ, σ 2 η) p(w) = N(w;, τ 2 I) Possible to compute the posterior distribution analytically, because linear Gaussian models are jointly Gaussian (see Murphy 7.6. for details) p(w Φ, y, σ 2 η) p(w)p(y Φ, σ 2 η) = N(w w N, V N ) w N = σ 2 η V N Φ T y V N = σ 2 η(σ 2 η/τ 2 I + Φ T Φ) 2 / 24

Making predictions For a new test point x with corresponding feature vector φ, we have that f(x ) = w T φ + η where w N(w N, V N ). Hence p(y x, D) N(w T Nφ, (φ ) T V N φ + σ 2 η) 2 / 24

Example of Bayesian Regression likelihood prior/posterior W y data space W x W W y W W x W W y W W x W W y W W x Figure credit: Murphy Fig 7. 22 / 24

Another Example 6 5 prediction training data plugin approximation (MLE) 8 7 prediction training data Posterior predictive (known variance) 6 4 5 3 4 3 2 2 8 6 4 2 2 4 6 8 MLE 8 6 4 2 2 4 6 8 Bayes 5 functions sampled from plugin approximation to posterior functions sampled from posterior 45 4 8 35 3 6 25 4 2 5 2 5 8 6 4 2 2 4 6 8 MLE samples 2 8 6 4 2 2 4 6 8 Bayes samples Figure credit: Murphy Fig 7.2 Fitting a quadratic. Notice how the error bars get larger further away from training data 23 / 24

Summary Linear regression is a conditional Gaussian model Maximum likelihood solution - ordinary least squares Can use nonlinear basis functions Ridge regression Full Bayesian treatment 24 / 24