Machine Learning - MT Linear Regression

Size: px

Start display at page:

Download "Machine Learning - MT Linear Regression"

Amberlynn Shepherd
5 years ago
Views:

1 Machine Learning - MT Linear Regression Varun Kanade University of Oxford October 12, 2016

2 Announcements All students eligible to take the course for credit can sign-up for classes and practicals Attempt Problem Sheet 0 (contact your class tutor if you intend to attend class in Week 2) Problem Sheet 1 is posted (submit by noon 21 Oct at CS reception) 1

Announcement : Strachey Lecture Will finish 15-20 min early on

3 Announcement : Strachey Lecture Will finish min early on Monday, October 31 May run over by 5 minutes or so a few other days 2

4 Outline Goals Review the supervised learning setting Describe the linear regression framework Apply the linear model to make predictions Derive the least squares estimate Supervised Learning Setting Data consists of input and output pairs Inputs (also covariates, independent variables, predictors, features) Output (also variates, dependent variable, targets, labels) 3

5 Why study linear regression? Least squares is at least 200 years old going back to Legendre and Gauss Francis Galton (1886): Regression to the mean Often real processes can be approximated by linear models More complex models require understanding linear regression Closed form analytic solutions can be obtained Many key notions of machine learning can be introduced 4

Distance to city centre Day of the week Data dist (km) day

6 A toy example : Commute Times Want to predict commute time into city centre What variables would be useful? Distance to city centre Day of the week Data dist (km) day commute time (min) 2.7 fri mon sun tue sat 22 5

7 Linear Models Suppose the input is a vector x R D and the output is y R. We have data x i, y i N i=1 Notation: data dimension D, size of dataset N, column vectors Linear Model y = w 0 + x 1w x Dw D + ɛ Bias/intercept Noise/uncertainty 6

8 Linear Models : Commute Time Linear Model y = w 0 + x 1w x Dw D + ɛ Bias/intercept Noise/uncertainty Input encoding: mon-sun has to be converted to a number monday: 0, tuesday: 1,..., sunday: 6 0 if weekend, 1 if weekday Say x 1 R (distance) and x 2 {0, 1} (weekend/weekday) Linear model for commute time y = w 0 + w 1x 1 + w 2x 2 + ɛ Using 0-6 is a bad encoding. Use seven 0-1 features instead called one-hot encoding 7

9 Linear Model : Adding a feature for bias term dist day commute time x 1 x 2 y 2.7 fri mon sun tue sat 22 one dist day commute time x 0 x 1 x 2 y fri mon sun tue sat 22 Model y = w 0 + w 1x 1 + w 2x 2 + ɛ Model y = w 0x 0 + w 1x 1 + w 2x 2 + ɛ = w x + ɛ 8

10 Learning Linear Models Data: (x i, y i) N i=1, where x i R D and y i R Model parameter w, where w R D Training phase: (learning/estimation w from data) (x i, y i) N i=1 data Learning Algorithm w (estimate) Testing/Deployment phase: (predict ŷ new = x new w) How different is ŷ new from y new (actual observation)? We should keep some data aside for testing before deploying a model 9

11 (x i, y i) N i=1, where x i R and y i R ŷ(x) = w 0 + x w 1, (no noise term in ŷ) L(w) = L(w 0, w 1) = 1 2N N (ŷ i y i) 2 = 1 N (w 0 + x i w 1 y i) 2 2N i=1 i=1 Loss function Cost function Objective Function Energy Function Notation - L, J, E, R This objective is known as the residual sum of squares or (RSS) The estimate (w 0, w 1) is known as the least squares estimate 10

12 (x i, y i) N i=1, where x i R and y i R ŷ(x) = w 0 + x w 1, (no noise term in ŷ) L(w) = L(w 0, w 1) = 1 2N N (ŷ i y i) 2 = 1 N (w 0 + x i w 1 y i) 2 2N i=1 i=1 L = 1 w 0 N L = 1 w 1 N N (w 0 + w 1 x i y i) i=1 N (w 0 + w 1 x i y i)x i i=1 We obtain the solution for (w 0, w 1) by setting the partial derivatives to 0 and solving the resulting system. (Normal Equations) w 0 w 0 + w 1 i xi N + w1 i xi N = i x2 i N = i yi N i xiyi N (1) (2) x = ȳ = var(x) = ĉov(x, y) = w 1 = i xi N i yi N i x2 i N x2 i xiyi x ȳ N ĉov(x, y) var(x) w 0 = ȳ w 1 x 11

13 Linear Regression : General Case Recall that the linear model is ŷ i = D j=0 x ijw j where we assume that x i0 = 1 for all x i, so that the bias term w 0 does not need to be treated separately. Expressing everything in matrix notation ŷ = Xw Here we have ŷ R N 1, X R N (D+1) and w R (D+1) 1 ŷ N 1 ŷ 1 ŷ 2. ŷ N = X N (D+1) w (D+1) 1 w 0 x T 1 x T 2.. x T N. w D = X N (D+1) x 10 x 1D x 20 x 2D x N0 x ND w (D+1) 1 w 0. w D 12

14 Back to toy example one dist (km) weekday? commute time (min) (fri) (mon) (sun) (tue) (sat) 22 We have N = 5, D + 1 = 3 and so we get y = 15, X = , w = Suppose we get w = [6.09, 6.53, 2.11] T. Then our predictions would be ŷ = w 0 w 1 w 2 13

15 Least Squares Estimate : Minimise the Squared Error L(w) = 1 2N N (x T i w y i) 2 = (Xw y) T (Xw y) i=1 14

16 Finding Optimal Solutions using Calculus L(w) = 1 2N = 1 2N = 1 2N = N i=1 (x T i w y i) 2 = 1 2N (Xw y)t (Xw y) ( w T ( X T X ) ) w w T X T y y T Xw + y T y ( ( ) ) w T X T X w 2 y T Xw + y T y Then, write out all partial derivatives to form the gradient wl L w 0 = L w 1 =. L w D = Instead, we will develop tricks to differentiate using matrix notation directly 15

17 Differentiating Matrix Expressions Rules (Tricks) ) (i) Linear Form Expressions: w (c T w = c c T w = D c jw j j=0 ) (c T w) = c w j j, and so w (c T w (ii) Quadratic Form Expressions: ) w (w T Aw = Aw + A T w ( = 2Aw for symmetric A) w T Aw = (w T Aw) w k = D i=0 j=0 i=0 D w iw ja ij D D w ia ik + A kj w j = A T [:,k]w + A [k,:] w j=0 = c (3) ) w (w T Aw = A T w + Aw (4) 16

18 Deriving the Least Squares Estimate L(w) = 1 2N N i=1 (x T i w y i) 2 = 1 2N ( ( ) ) w T X T X w 2 y T Xw + y T y We compute the gradient wl = 0 using the matrix differentiation rules, wl = 1 ( ( ) X X) T w X T y N By setting wl = 0 and solving we get, ( ) X T X w = X T y ( 1 w = X X) T X T y (Assuming inverse exists) The predictions made by the model on the data X are given by ( 1 ŷ = Xw = X X X) T X T y ( 1 For this reason the matrix X X X) T X T is called the hat matrix 17

19 Least Squares Estimate w = ( X T X) 1 X T y When do we expect X T X to be invertible? rank(x T X) = rank(x) min{d + 1, N} As X T X is D + 1 D + 1, invertible is rank(x) = D + 1 What if we use one-hot encoding for a feature like day? Suppose x mon,..., x sun stand for 0-1 valued variables in the one-hot encoding We always have x mon + + x sun = 1 This introduces a linear dependence in the columns of X reducing the rank In this case, we can drop some features to adjust rank. We ll see alternative approaches later in the course. What is the computational complexity of computing w? Relatively easy to get O(D 2 N) bound 18

20 19

21 Recap : Predicting Commute Time Goal Predict the time taken for commute given distance and day of week Do we only wish to make predictions or also suggestions? Model and Choice of Loss Function Use a linear model y = w 0 + w 1x w Dx D + ɛ = ŷ + ɛ Minimise average squared error 1 (yi ŷ 2N i) 2 Algorithm to Fit Model Simple matrix operations using closed-form solution 20

22 Model and Loss Function Choice Optimisation View of Machine Learning Pick model that you expect may fit the data well enough Pick a measure of performance that makes sense and can be optimised Run optimisation algorithm to obtain model parameters Probabilistic View of Machine Learning Pick a model for data and explicitly formulate the deviation (or uncertainty) from the model using the language of probability Use notions from probability to define suitability of various models Find the parameters or make predictions on unseen data using these suitability criteria (Frequentist vs Bayesian viewpoints) 21

23 Next Time Probabilistic View of Machine Learning (Maximum Likelihood) Non-linearity using basis expansion What to do when you have more features than data? Make sure you re familiar with the the multi-variate Gaussian distribution 22

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships