Machine Learning - MT 2016 2. Linear Regression Varun Kanade University of Oxford October 12, 2016
Announcements All students eligible to take the course for credit can sign-up for classes and practicals Attempt Problem Sheet 0 (contact your class tutor if you intend to attend class in Week 2) Problem Sheet 1 is posted (submit by noon 21 Oct at CS reception) 1
Announcement : Strachey Lecture Will finish 15-20 min early on Monday, October 31 May run over by 5 minutes or so a few other days 2
Outline Goals Review the supervised learning setting Describe the linear regression framework Apply the linear model to make predictions Derive the least squares estimate Supervised Learning Setting Data consists of input and output pairs Inputs (also covariates, independent variables, predictors, features) Output (also variates, dependent variable, targets, labels) 3
Why study linear regression? Least squares is at least 200 years old going back to Legendre and Gauss Francis Galton (1886): Regression to the mean Often real processes can be approximated by linear models More complex models require understanding linear regression Closed form analytic solutions can be obtained Many key notions of machine learning can be introduced 4
A toy example : Commute Times Want to predict commute time into city centre What variables would be useful? Distance to city centre Day of the week Data dist (km) day commute time (min) 2.7 fri 25 4.1 mon 33 1.0 sun 15 5.2 tue 45 2.8 sat 22 5
Linear Models Suppose the input is a vector x R D and the output is y R. We have data x i, y i N i=1 Notation: data dimension D, size of dataset N, column vectors Linear Model y = w 0 + x 1w 1 + + x Dw D + ɛ Bias/intercept Noise/uncertainty 6
Linear Models : Commute Time Linear Model y = w 0 + x 1w 1 + + x Dw D + ɛ Bias/intercept Noise/uncertainty Input encoding: mon-sun has to be converted to a number monday: 0, tuesday: 1,..., sunday: 6 0 if weekend, 1 if weekday Say x 1 R (distance) and x 2 {0, 1} (weekend/weekday) Linear model for commute time y = w 0 + w 1x 1 + w 2x 2 + ɛ Using 0-6 is a bad encoding. Use seven 0-1 features instead called one-hot encoding 7
Linear Model : Adding a feature for bias term dist day commute time x 1 x 2 y 2.7 fri 25 4.1 mon 33 1.0 sun 15 5.2 tue 45 2.8 sat 22 one dist day commute time x 0 x 1 x 2 y 1 2.7 fri 25 1 4.1 mon 33 1 1.0 sun 15 1 5.2 tue 45 1 2.8 sat 22 Model y = w 0 + w 1x 1 + w 2x 2 + ɛ Model y = w 0x 0 + w 1x 1 + w 2x 2 + ɛ = w x + ɛ 8
Learning Linear Models Data: (x i, y i) N i=1, where x i R D and y i R Model parameter w, where w R D Training phase: (learning/estimation w from data) (x i, y i) N i=1 data Learning Algorithm w (estimate) Testing/Deployment phase: (predict ŷ new = x new w) How different is ŷ new from y new (actual observation)? We should keep some data aside for testing before deploying a model 9
(x i, y i) N i=1, where x i R and y i R ŷ(x) = w 0 + x w 1, (no noise term in ŷ) L(w) = L(w 0, w 1) = 1 2N N (ŷ i y i) 2 = 1 N (w 0 + x i w 1 y i) 2 2N i=1 i=1 Loss function Cost function Objective Function Energy Function Notation - L, J, E, R This objective is known as the residual sum of squares or (RSS) The estimate (w 0, w 1) is known as the least squares estimate 10
(x i, y i) N i=1, where x i R and y i R ŷ(x) = w 0 + x w 1, (no noise term in ŷ) L(w) = L(w 0, w 1) = 1 2N N (ŷ i y i) 2 = 1 N (w 0 + x i w 1 y i) 2 2N i=1 i=1 L = 1 w 0 N L = 1 w 1 N N (w 0 + w 1 x i y i) i=1 N (w 0 + w 1 x i y i)x i i=1 We obtain the solution for (w 0, w 1) by setting the partial derivatives to 0 and solving the resulting system. (Normal Equations) w 0 w 0 + w 1 i xi N + w1 i xi N = i x2 i N = i yi N i xiyi N (1) (2) x = ȳ = var(x) = ĉov(x, y) = w 1 = i xi N i yi N i x2 i N x2 i xiyi x ȳ N ĉov(x, y) var(x) w 0 = ȳ w 1 x 11
Linear Regression : General Case Recall that the linear model is ŷ i = D j=0 x ijw j where we assume that x i0 = 1 for all x i, so that the bias term w 0 does not need to be treated separately. Expressing everything in matrix notation ŷ = Xw Here we have ŷ R N 1, X R N (D+1) and w R (D+1) 1 ŷ N 1 ŷ 1 ŷ 2. ŷ N = X N (D+1) w (D+1) 1 w 0 x T 1 x T 2.. x T N. w D = X N (D+1) x 10 x 1D x 20 x 2D....... x N0 x ND w (D+1) 1 w 0. w D 12
Back to toy example one dist (km) weekday? commute time (min) 1 2.7 1 (fri) 25 1 4.1 1 (mon) 33 1 1.0 0 (sun) 15 1 5.2 1 (tue) 45 1 2.8 0 (sat) 22 We have N = 5, D + 1 = 3 and so we get 25 1 2.7 1 33 1 4.1 1 y = 15, X = 1 1.0 0, w = 45 1 5.2 1 22 1 2.8 0 Suppose we get w = [6.09, 6.53, 2.11] T. Then our predictions would be 25.83 34.97 ŷ = 12.62 42.16 24.37 w 0 w 1 w 2 13
Least Squares Estimate : Minimise the Squared Error L(w) = 1 2N N (x T i w y i) 2 = (Xw y) T (Xw y) i=1 14
Finding Optimal Solutions using Calculus L(w) = 1 2N = 1 2N = 1 2N = N i=1 (x T i w y i) 2 = 1 2N (Xw y)t (Xw y) ( w T ( X T X ) ) w w T X T y y T Xw + y T y ( ( ) ) w T X T X w 2 y T Xw + y T y Then, write out all partial derivatives to form the gradient wl L w 0 = L w 1 =. L w D = Instead, we will develop tricks to differentiate using matrix notation directly 15
Differentiating Matrix Expressions Rules (Tricks) ) (i) Linear Form Expressions: w (c T w = c c T w = D c jw j j=0 ) (c T w) = c w j j, and so w (c T w (ii) Quadratic Form Expressions: ) w (w T Aw = Aw + A T w ( = 2Aw for symmetric A) w T Aw = (w T Aw) w k = D i=0 j=0 i=0 D w iw ja ij D D w ia ik + A kj w j = A T [:,k]w + A [k,:] w j=0 = c (3) ) w (w T Aw = A T w + Aw (4) 16
Deriving the Least Squares Estimate L(w) = 1 2N N i=1 (x T i w y i) 2 = 1 2N ( ( ) ) w T X T X w 2 y T Xw + y T y We compute the gradient wl = 0 using the matrix differentiation rules, wl = 1 ( ( ) X X) T w X T y N By setting wl = 0 and solving we get, ( ) X T X w = X T y ( 1 w = X X) T X T y (Assuming inverse exists) The predictions made by the model on the data X are given by ( 1 ŷ = Xw = X X X) T X T y ( 1 For this reason the matrix X X X) T X T is called the hat matrix 17
Least Squares Estimate w = ( X T X) 1 X T y When do we expect X T X to be invertible? rank(x T X) = rank(x) min{d + 1, N} As X T X is D + 1 D + 1, invertible is rank(x) = D + 1 What if we use one-hot encoding for a feature like day? Suppose x mon,..., x sun stand for 0-1 valued variables in the one-hot encoding We always have x mon + + x sun = 1 This introduces a linear dependence in the columns of X reducing the rank In this case, we can drop some features to adjust rank. We ll see alternative approaches later in the course. What is the computational complexity of computing w? Relatively easy to get O(D 2 N) bound 18
19
Recap : Predicting Commute Time Goal Predict the time taken for commute given distance and day of week Do we only wish to make predictions or also suggestions? Model and Choice of Loss Function Use a linear model y = w 0 + w 1x 1 + + w Dx D + ɛ = ŷ + ɛ Minimise average squared error 1 (yi ŷ 2N i) 2 Algorithm to Fit Model Simple matrix operations using closed-form solution 20
Model and Loss Function Choice Optimisation View of Machine Learning Pick model that you expect may fit the data well enough Pick a measure of performance that makes sense and can be optimised Run optimisation algorithm to obtain model parameters Probabilistic View of Machine Learning Pick a model for data and explicitly formulate the deviation (or uncertainty) from the model using the language of probability Use notions from probability to define suitability of various models Find the parameters or make predictions on unseen data using these suitability criteria (Frequentist vs Bayesian viewpoints) 21
Next Time Probabilistic View of Machine Learning (Maximum Likelihood) Non-linearity using basis expansion What to do when you have more features than data? Make sure you re familiar with the the multi-variate Gaussian distribution 22