Overview. IAML: Linear Regression. Examples of regression problems. The Regression Problem

Size: px

Start display at page:

Download "Overview. IAML: Linear Regression. Examples of regression problems. The Regression Problem"

Preston Todd
5 years ago
Views:

1 3 / 38 Overview 4 / 38 IAML: Linear Regression Nigel Goddard School of Informatics Semester The linear model Fitting the linear model to data Probabilistic interpretation of the error function Eamples of regression problems Dealing with multiple outputs Generalized linear regression Radial basis function (RBF) models The Regression Problem / 38 Eamples of regression problems 2 / 38 Classification and regression problems: Classification: target of prediction is discrete Regression: target of prediction is continuous Training data: Set D of pairs ( i, i ) for i =,..., n, where i R D and i R Toda: Linear regression, i.e., relationship between and is linear. Although this is simple (and limited) it is: More powerful than ou would epect The basis for more comple nonlinear methods Teaches a lot about regression and classification Robot inverse dnamics: predicting what torques are needed to drive a robot arm along a given trajector Electricit load forecasting, generate hourl forecasts two das in advance (see W & F,.3) Predicting staffing requirements at help desks based on historical data and product and sales information, Predicting the time to failure of equipment based on utilization and environmental conditions

2 The Linear Model To eample: Data Linear model f (; w) = w + w w D D = φ()w where φ() = (,,..., D ) = (, T ) and w w = w... () w D The maths of fitting linear models to data is eas. We use the notation φ() to make generalisation eas later. Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 29 Chap 3 5 / 38 6 / 38 To eample: Data With two features Y / 38 X 2 FIGURE 3.. Linear least squares fitting with Instead X of airline, 2. We a plane. seek the With linear more function features, of Xa hperplane. that minimizes the sum of squared residuals from Y. Figure: Hastie, Tibshirani, and Friedman X 8 / 38

3 With more features With more features CPU Performance Data Set Predict: PRP: published relative performance MYCT: machine ccle time in nanoseconds (integer) MMIN: minimum main memor in kilobtes (integer) MMAX: maimum main memor in kilobtes (integer) CACH: cache memor in kilobtes (integer) CHMIN: minimum channels in units (integer) CHMAX: maimum channels in units (integer) PRP = MYCT +.5 MMIN +.6 MMAX +.63 CACH -.27 CHMIN +.46 CHMAX In matri notation Design matri is n (D + ) Φ = 2... D D..... n n2... nd ij is the jth component of the training input i Let = (,..., n ) T Then ŷ = Φw is...? 9 / 38 / 38 Linear Algebra: The -Slide Version What is matri multiplication? a a 2 a 3 b A = a 2 a 22 a 23, b = b 2 a 3 a 32 a 33 b 3 First consider matri times vector, i.e., Ab. Two answers:. Ab is a linear combination of the columns of A a a 2 a 3 Ab = b a 2 + b 2 a 22 + b 3 a 23 a 3 a 32 a Ab is a vector. Each element of the vector is the dot products between b and one row of A. (a, a 2, a 3 )b Ab = (a 2, a 22, a 23 )b (a 3, a 32, a 33 )b / 38 2 / 38

4 Linear model (part 2) 5 / 38 Solving for Model Parameters 6 / 38 In matri notation: Design matri is n (D + ) Φ = 2... D D..... n n2... nd This looks like what we ve seen in linear algebra = Φw We know and Φ but not w. So wh not take w = Φ? (You can t, but wh?) ij is the jth component of the training input i Let = (,..., n ) T Then ŷ = Φw is the model s predicted values on training inputs. Solving for Model Parameters 3 / 38 Loss function 4 / 38 This looks like what we ve seen in linear algebra We know and Φ but not w. = Φw So wh not take w = Φ? (You can t, but wh?) Three reasons: Φ is not square. It is n (D + ). The sstem is overconstrained (n equations for D + parameters), in other words The data has noise Want a loss function O(w) that We minimize wrt w. At minimum, ŷ looks like. (Recall: ŷ depends on w) ŷ = Φw

5 9 / 38 2 / 38 Fitting a linear model to data Y X 2 X IGURE 3.. Linear least squares fitting with IR 2. We seek the linear function of X that miniizes the sum of squared residuals from Y. The Solution A common choice: squared error (makes the maths eas) O(w) = n ( i w T i ) 2 i= In the picture: this is sum of squared length of black sticks. (Each one is called a residual, i.e., each i w T i ) Fitting a linear model to data Fitting a linear model to data 7 / 38 Given a dataset D of pairs ( n O(w) i, = i ) for i =,...,n ( i w T i ) 2 Squared error makes the maths eas i= n E(w) = (= ( Φw) T i f ( i ; w)) 2 ( Φw) i= We want to minimize this with respect to w. The error Thesurface error surface is a parabolic is a parabolic bowl bowl E[w] w Figure: Tom Mitchell How do we do this? 5 / 7 Probabilistic Figure: Tom Mitchell interpretation of O(w) w / 38 Answer: to minimize O(w) = n i= ( i w T i ) 2, set partial derivatives to. This has an analtical solution ŵ = (Φ T Φ) Φ T (Φ T Φ) Φ T is the pseudo-inverse of Φ First check: Does this make sense? Do the matri dimensions line up? Then: Wh is this called a pseudo-inverse? () Finall: What happens if there are no features? Assume that = w T + ɛ, where ɛ N(, σ 2 ) (This is an eact linear relationship plus Gaussian noise.) This implies that i N(w T i, σ 2 ), i.e. log p( i i ) = log 2π + log σ + ( i w T i ) 2 2σ 2 So minimising O(w) equivalent to maimising likelihood! Can view w T as E[ ]. Squared residuals allow estimation of σ 2 ˆσ 2 = n n ( i w T i ) 2 i=

6 23 / / 38 Sensitivit to Outliers Fitting this into the general structure for learning algorithms: Linear regression is sensitive to outliers Eample: Suppose =.5 + ɛ, where ɛ N(,.25), and then add a point (2.5,3): Define the task: regression Decide on the model structure: linear regression model Decide on the score function: squared error (likelihood) Decide on optimization/search method to optimize the score function: calculus (analtic solution) / / 38 Diagnositics Dealing with multiple outputs Graphical diagnostics can be useful for checking: Is the relationship obviousl nonlinear? Look for structure in residuals? Suppose there are q different targets for each input Are there obvious outliers? The goal isn t to find all problems. You can t. The goal is to find obvious, embarrassing problems. We introduce a different w i for each target dimension, and do regression separatel for each one This is called multiple regression Eamples: Plot residuals b fitted values. Stats packages will do this for ou.

7 Basis epansion We can easil transform the original attributes non-linearl into φ() and do linear regression on them Design matri is n m φ ( ) φ 2 ( )... φ m ( ) φ ( 2 ) φ 2 ( 2 )... φ m ( 2 ) Φ =.... φ ( n ) φ 2 ( n )... φ m ( n ) Let = (,..., n ) T Minimize E(w) = Φw 2. As before we have an analtical solution ŵ = (Φ T Φ) Φ T polnomial Gaussians sigmoids (Φ T Φ) Φ T is the pseudo-inverse of Φ Figure credit: Chris Bishop, PRML 25 / / 38 Eample: polnomial regression More about the features φ() = (,, 2,..., M ) T Transforming the features can be important. t t M = M = 3 t t M = M = 9 Eample: Suppose I want to predict the CPU performance. Mabe one of the features is manufacturer. if Intel 2 if AMD = 3 if Apple 4 if Motorola Let s use this as a feature. Will this work? Figure credit: Chris Bishop, PRML 27 / / 38

8 More about the features 3 / 38 More about the features 32 / 38 Transforming the features can be important. Eample: Suppose I want to predict the CPU performance. Instead convert this into / Mabe one of the features is manufacturer. if Intel 2 if AMD = 3 if Apple 4 if Motorola Let s use this as a feature. Will this work? Not the wa ou want. Do ou reall believe AMD is double Intel? = if Intel, otherwise 2 = if AMD, otherwise. Note this is a consequence of linearit. We saw something similar with tet in the first week. Other good transformations: log, square, square root 29 / 38 3 / 38 Radial basis function (RBF) models RBF eample Set φ i () = ep( 2 c i 2 /α 2 ). Need to position these basis functions at some prior chosen centres c i and with a given width α. There are man was to do this but choosing a subset of the datapoints as centres is one method that is quite effective Finding the weights is the same as ever: the pseudo-inverse solution

9 RBF eample 35 / 38 An RBF feature 36 / Original data RBF feature, c = 3, α = phi3. 33 / / 38 Another RBF feature RBF eample Notice how the feature functions specialize in input space. Run the RBF with both basis functions above, plot the residuals Original data RBF feature, c 2 = 6, α 2 = i φ( i ) T w RBF kernel (mean 6) Original data Residuals Residual (RBF model)

10 RBF: A, there s the rub Summar So wh not use RBFs for everthing? Short answer: You might need too man basis functions. This is especiall true in high dimensions (we ll sa more later) Too man means ou probabl overfit. Etreme eample: Centre one on each training point. Also: notice that we haven t seen et in the course how to learn the RBF parameters, i.e., the mean and standard deviation of each kernel Main point of presenting RBFs now: Set up later methods like support vector machines Linear regression often useful out of the bo. More useful than it would be seem because linear means linear in the parameters. You can do a nonlinear transform of the data first, e.g., polnomial, RBF. This point will come up again. Maimum likelihood solution is computationall efficient (pseudo-inverse) Danger of overfitting, especiall with man features or basis functions 37 / / 38

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. Regression Machine Learning and Pattern Recognition Chris Williams School of Informatics, University of Edinburgh September 24 (All of the slides in this course have been adapted from previous versions