Linear Regression Machine Learning CSE546 Kevin Jamieson University of Washington Oct 5, 2017 1
The regression problem Given past sales data on zillow.com, predict: y = House sale price from x = {# sq. ft., zip code, date of sale, etc.} Training Data: {(x i,y i )} n i=1 x i 2 R d y i 2 R Sale Price # square feet 2
The regression problem Given past sales data on zillow.com, predict: y = House sale price from x = {# sq. ft., zip code, date of sale, etc.} best linear fit Training Data: {(x i,y i )} n i=1 x i 2 R d y i 2 R Sale Price # square feet Hypothesis: linear y i x T i w Loss: least squares min w nx i=1 y i x T i w 2 3
The regression problem in matrix notation y = 2 6 4 y 1. y n bw LS = arg min w 3 7 5 X = 2 3 6 4x T 7 1. 5 x T n nx i=1 y i x T i w 2 = arg min w (y Xw)T (y Xw) 4
The regression problem in matrix notation bw LS = arg min w Xw 2 2 = arg min w Xw)T (y Xw) 5
The regression problem in matrix notation bw LS = arg min w y Xw 2 2 =(X T X) 1 X T y What about an offset? bw LS, b b LS = arg min w,b nx i=1 y i (x T i w + b) 2 = arg min w,b y (Xw + 1b) 2 2 6
Dealing with an offset bw LS, b b LS = arg min w,b y (Xw + 1b) 2 2 7
Dealing with an offset bw LS, b b LS = arg min w,b y (Xw + 1b) 2 2 X T X bw LS + b b LS X T 1 = X T y 1 T X bw LS + b b LS 1 T 1 = 1 T y If X T 1 = 0 (i.e., if each feature is mean-zero) then bw LS =(X T X) 1 X T Y nx b bls = 1 n i=1 y i 8
The regression problem in matrix notation bw LS = arg min w y Xw 2 2 =(X T X) 1 X T y But why least squares? Consider y i = x T i w + i where i i.i.d. N (0, 2 ) P (y x, w, )= 9
Maximizing log-likelihood Maximize: Y n 1 log P (D w, ) = log( p 2 ) n i=1 e (y i x T i w)2 2 2 10
MLE is LS under linear model bw LS = arg min w nx i=1 y i x T i w 2 bw MLE = arg max P (D w, ) if y i = x T w i.i.d. i w + i and i N (0, 2 ) bw LS = bw MLE =(X T X) 1 X T Y 11
The regression problem Given past sales data on zillow.com, predict: y = House sale price from x = {# sq. ft., zip code, date of sale, etc.} best linear fit Training Data: {(x i,y i )} n i=1 x i 2 R d y i 2 R Sale Price # square feet Hypothesis: linear y i x T i w Loss: least squares min w nx i=1 y i x T i w 2 12
The regression problem Sale Price Given past sales data on zillow.com, predict: y = House sale price from x = {# sq. ft., zip code, date of sale, etc.} Training Data: date of sale best linear fit {(x i,y i )} n i=1 Hypothesis: linear y i x T i w Loss: least squares min w nx i=1 x i 2 R d y i 2 R y i x T i w 2 13
The regression problem Training Data: {(x i,y i )} n i=1 Hypothesis: linear y i x T i w x i 2 R d y i 2 R Transformed data: Loss: least squares min w nx i=1 y i x T i w 2 14
The regression problem Training Data: {(x i,y i )} n i=1 Hypothesis: linear y i x T i w Loss: least squares min w nx i=1 x i 2 R d y i 2 R y i x T i w 2 Transformed data: h : R d! R p maps original features to a rich, possibly high-dimensional space 2 3 2 h 1 (x) 6h 2 (x) 7 6 in d=1: h(x) = 6 4. h p (x) 7 5 = 6 4 for d>1, generate {u j } p j=1 Rd 1 h j (x) = 1 + exp(u T j x) h j (x) =(u T j x) 2 3 x 7 5 x 2. x p h j (x) = cos(u T j x) 15
The regression problem Training Data: {(x i,y i )} n i=1 Hypothesis: linear y i x T i w Loss: least squares min w nx i=1 x i 2 R d y i 2 R y i x T i w 2 Transformed data: h(x) = Hypothesis: linear y i h(x i ) T w Loss: least squares min w nx i=1 2 3 h 1 (x) h 2 (x) 6 7 4. 5 h p (x) w 2 R p y i h(x i ) T w 2 16
The regression problem Training Data: {(x i,y i )} n i=1 x i 2 R d y i 2 R Transformed data: h(x) = 2 3 h 1 (x) h 2 (x) 6 7 4. 5 h p (x) best linear fit Hypothesis: linear Sale Price y i h(x i ) T w Loss: least squares min w nx i=1 w 2 R p y i h(x i ) T w 2 date of sale 17
The regression problem Training Data: {(x i,y i )} n i=1 Sale Price date of sale x i 2 R d y i 2 R small p fit Transformed data: h(x) = Hypothesis: linear y i h(x i ) T w Loss: least squares min w nx i=1 2 3 h 1 (x) h 2 (x) 6 7 4. 5 h p (x) w 2 R p y i h(x i ) T w 2 18
The regression problem Training Data: {(x i,y i )} n i=1 Sale Price date of sale x i 2 R d y i 2 R large p fit Transformed data: h(x) = Hypothesis: linear y i h(x i ) T w Loss: least squares min w nx i=1 2 3 h 1 (x) h 2 (x) 6 7 4. 5 h p (x) w 2 R p y i h(x i ) T w 2 What s going on here? 19
Bias-Variance Tradeoff Machine Learning CSE546 Kevin Jamieson University of Washington Oct 5, 2017 20
Statistical Learning P XY (X = x, Y = y) y x
Statistical Learning P XY (X = x, Y = y) P XY (Y = y X = x 0 ) y x 0 x 1 x P XY (Y = y X = x 1 )
Statistical Learning P XY (X = x, Y = y) Ideally, we want to find: (x) =E XY [Y X = x] P XY (Y = y X = x 0 ) y x 0 x 1 x P XY (Y = y X = x 1 )
Statistical Learning P XY (X = x, Y = y) Ideally, we want to find: (x) =E XY [Y X = x] y x
Statistical Learning P XY (X = x, Y = y) Ideally, we want to find: (x) =E XY [Y X = x] But we only have samples: (x i,y i ) i.i.d. P XY for i =1,...,n y x
Statistical Learning P XY (X = x, Y = y) Ideally, we want to find: (x) =E XY [Y X = x] y x But we only have samples: (x i,y i ) i.i.d. P XY bf = arg min f2f 1 n nx (y i f(x i )) 2 i=1 for i =1,...,n and are restricted to a function class (e.g., linear) so we compute:
Statistical Learning P XY (X = x, Y = y) Ideally, we want to find: (x) =E XY [Y X = x] y x But we only have samples: (x i,y i ) i.i.d. P XY bf = arg min f2f 1 n nx (y i f(x i )) 2 i=1 for i =1,...,n and are restricted to a function class (e.g., linear) so we compute: We care about future predictions: E XY [(Y b f(x)) 2 ]
Statistical Learning P XY (X = x, Y = y) Ideally, we want to find: (x) =E XY [Y X = x] y x But we only have samples: (x i,y i ) i.i.d. P XY bf = arg min f2f 1 n nx (y i f(x i )) 2 i=1 for i =1,...,n and are restricted to a function class (e.g., linear) so we compute: Each draw D = {(x i,y i )} n i=1 results in di erent b f
Statistical Learning P XY (X = x, Y = y) Ideally, we want to find: (x) =E XY [Y X = x] y E D [ b f(x)] x But we only have samples: (x i,y i ) i.i.d. P XY bf = arg min f2f 1 n nx (y i f(x i )) 2 i=1 for i =1,...,n and are restricted to a function class (e.g., linear) so we compute: Each draw D = {(x i,y i )} n i=1 results in di erent b f
Bias-Variance Tradeoff (x) =E XY [Y X = x] b f = arg min f2f 1 n nx (y i f(x i )) 2 i=1 E Y X=x [E D [(Y b fd (x)) 2 ]] = E Y X=x [E D [(Y (x)+ (x) b fd (x)) 2 ]]
Bias-Variance Tradeoff (x) =E XY [Y X = x] b f = arg min f2f 1 n nx (y i f(x i )) 2 i=1 E XY [E D [(Y b fd (x)) 2 ] X = x] =E XY [E D [(Y (x)+ (x) b fd (x)) 2 ] X = x] =E XY [E D [(Y (x)) 2 + 2(Y (x))( (x) fd b (x)) +( (x) fd b (x)) 2 ] X = x] =E XY [(Y (x)) 2 X = x]+e D [( (x) fd b (x)) 2 ] irreducible error Caused by stochastic label noise learning error Caused by either using too simple of a model or not enough data to learn the model accurately
Bias-Variance Tradeoff (x) =E XY [Y X = x] b f = arg min f2f 1 n nx (y i f(x i )) 2 i=1 E D [( (x) b fd (x)) 2 ]=E D [( (x) E D [ b f D (x)] + E D [ b f D (x)] b fd (x)) 2 ]
Bias-Variance Tradeoff (x) =E XY [Y X = x] b f = arg min f2f 1 n nx (y i f(x i )) 2 i=1 E D [( (x) fd b (x)) 2 ]=E D [( (x) E D [ f b D (x)] + E D [ f b D (x)] fd b (x)) 2 ] =E D [( (x) E D [ f b D (x)]) 2 + 2( (x) E D [ f b D (x)])(e D [ f b D (x)] fd b (x)) +(E D [ f b D (x)] fd b (x)) 2 ] =( (x) E D [ f b D (x)]) 2 + E D [(E D [ f b D (x)] fd b (x)) 2 ] biased squared variance
Bias-Variance Tradeoff E XY [E D [(Y b fd (x)) 2 ] X = x] =E XY [(Y (x)) 2 X = x] irreducible error +( (x) E D [ b f D (x)]) 2 + E D [(E D [ b f D (x)] b fd (x)) 2 ] biased squared variance Model too simple high bias, cannot fit well to data Model too complex high variance, small changes in data change learned function a lot
Bias-Variance Tradeoff E XY [E D [(Y b fd (x)) 2 ] X = x] =E XY [(Y (x)) 2 X = x] irreducible error +( (x) E D [ b f D (x)]) 2 + E D [(E D [ b f D (x)] b fd (x)) 2 ] biased squared variance
Overfitting Machine Learning CSE546 Kevin Jamieson University of Washington Oct 5, 2017 36
Bias-Variance Tradeoff Choice of hypothesis class introduces learning bias More complex class less bias More complex class more variance But in practice??
Bias-Variance Tradeoff Choice of hypothesis class introduces learning bias More complex class less bias More complex class more variance But in practice?? Before we saw how increasing the feature space can increase the complexity of the learned estimator: F 1 F 2 F 3... bf (k) D 1 = arg min f2f k D X (x i,y i )2D Complexity grows as k grows (y i f(x i )) 2
Training set error as a function of model complexity F 1 F 2 F 3... bf (k) D 1 = arg min f2f k D X (x i,y i )2D D i.i.d. P XY (y i f(x i )) 2 1 TRAIN error: X (y i b(k) f D D (x i)) 2 (x i,y i )2D TRUE error: E XY [(Y f b(k) D (X))2 ] 39
Training set error as a function of model complexity F 1 F 2 F 3... bf (k) D 1 = arg min f2f k D X (x i,y i )2D D i.i.d. P XY (y i f(x i )) 2 1 TRAIN error: X (y i b(k) f D D (x i)) 2 (x i,y i )2D TRUE error: E XY [(Y f b(k) D (X))2 ] Complexity (k) TEST error: T i.i.d. 1 T P XY X (x i,y i )2T (y i b f (k) D (x i)) 2 Important: D \ T = ; 40
Training set error as a function of model complexity F 1 F 2 F 3... bf (k) D 1 = arg min f2f k D X (x i,y i )2D D i.i.d. P XY (y i f(x i )) 2 1 TRAIN error: X (y i b(k) f D D (x i)) 2 (x i,y i )2D TRUE error: E XY [(Y f b(k) D (X))2 ] Each line is i.i.d. draw of D or T TEST error: T i.i.d. 1 T P XY X (x i,y i )2T (y i b f (k) D (x i)) 2 Important: D \ T = ; Complexity (k) Plot from Hastie et al 41
Training set error as a function of model complexity F 1 F 2 F 3... bf (k) D 1 = arg min f2f k D X (x i,y i )2D D i.i.d. P XY (y i f(x i )) 2 1 TRAIN error: X (y i b(k) f D D (x i)) 2 (x i,y i )2D TRAIN error is optimistically biased because it is evaluated on the data it trained on. TEST error is unbiased only if T is never used to train the model or even pick the complexity k. TRUE error: E XY [(Y f b(k) D (X))2 ] TEST error: T i.i.d. 1 T P XY X (x i,y i )2T (y i b f (k) D (x i)) 2 Important: D \ T = ; 42
Test set error Given a dataset, randomly split it into two parts: Training data: Test data: Use training data to learn predictor e.g., 1 D T X D (x i,y i )2D Important: D \ T = ; (y i f b(k) D (x i)) 2 use training data to pick complexity k (next lecture) Use test data to report predicted performance 1 T X (x i,y i )2T (y i b f (k) D (x i)) 2 43
Overfitting Overfitting: a learning algorithm overfits the training data if it outputs a solution w when there exists another solution w such that: 44
How many points do I use for training/testing? Very hard question to answer! Too few training points, learned model is bad Too few test points, you never know if you reached a good solution Bounds, such as Hoeffding s inequality can help: More on this later this quarter, but still hard to answer Typically: If you have a reasonable amount of data 90/10 splits are common If you have little data, then you need to get fancy (e.g., bootstrapping) 45
Recap Learning is Collect some data E.g., housing info and sale price Randomly split dataset into TRAIN and TEST E.g., 80% and 20%, respectively Choose a hypothesis class or model E.g., linear Choose a loss function E.g., least squares Choose an optimization procedure E.g., set derivative to zero to obtain estimator Justifying the accuracy of the estimate E.g., report TEST error 46