Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 2, Kevin Jamieson 1

Size: px

Start display at page:

Download "Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 2, Kevin Jamieson 1"

Chad Bailey
5 years ago
Views:

1 Linear Regression Machine Learning CSE546 Kevin Jamieson University of Washington Oct 2,

2 The regression problem Given past sales data on zillow.com, predict: y = House sale price from x = {# sq. ft., zip code, date of sale, etc.} Training Data: {(x i,y i )} n x i 2 R d y i 2 R Sale Price # square feet 2

3 The regression problem Sale Price Given past sales data on zillow.com, predict: y = House sale price from x = {# sq. ft., zip code, date of sale, etc.} best linear fit # square feet Training Data: {(x i,y i )} n Hypothesis: linear y i x T i w Loss: least squares min w nx x i 2 R d y i 2 R y i x T i w 2 3

4 The regression problem in matrix notation y = y 1. y n bw LS = arg min w X = x T x T n nx y i x T i w 2 = arg min w (y Xw)T (y Xw) 4

5 The regression problem in matrix notation bw LS = arg min w Xw 2 2 = arg min w Xw)T (y Xw) Pwl 2Xtly Xu O xty XTXw_ If I exists then I Xix x'y 5

6 The regression problem in matrix notation bw LS = arg min y w Xw 2 2 =(X T X) 1 X T y What about an offset? bw LS, b b LS = arg min w,b nx y i (x T i w + b) 2 = arg min w,b y (Xw + 1b) 2 2 6

7 Dealing with an offset til bw LS, b b LS = arg min w,b y (Xw + 1b) 2 2 Pw 2 XT y Xu 1b O XT w X'Ib Pb 2 IT ly Hut 1b 0 I I n Ii LIX w t n b IT II Yi b a Isi t IX w 7

8 Dealing with an offset bw LS, b b LS = arg min w,b y (Xw + 1b) 2 2 X T X bw LS + b b LS X T 1 = X T y 1 T X bw LS + b b LS 1 T 1 = 1 T y I XwFwtx T µ 2xi mu O o If X T 1 = 0 (i.e., if each feature is mean-zero) then i bw LS =(X T X) 1 X T Y b bls = 1 nx y i n µ L 3 Given a new x predict w T X µ 15 8 i

9 The regression problem in matrix notation bw LS = arg min y w Xw 2 2 =(X T X) 1 X T y But why least squares? Consider y i = x T i w + i where i i.i.d. N (0, 2 ) P (y x, w, )= exp Y 9

10 Maximizing log-likelihood Maximize: Y n 1 log P (D w, ) = log( p 2 ) n e (y i x T i w)

11 MLE is LS under linear model bw LS = arg min w nx y i x T i w 2 bw MLE = arg max P (D w, ) if y i = x T w i.i.d. i w + i and i N (0, 2 ) bw LS = bw MLE =(X T X) 1 X T Y 11

12 Analysis of error Y = Xw + if y i = x T i w + i and i i.i.d. N (0, bw MLE =(X T X) 1 X T Y 2 ) xtxtxtfxw EET w TtxxfxE ELIE wki wyj Efcxixtxieetxcxixij c ELeet o cxixtxtieeetxcxixl t.ae

13 Analysis of error Y = Xw + if y i = x T i w + i and i i.i.d. N (0, bw MLE =(X T X) 1 X T Y 2 ) =(X T X) 1 X T (Xw + ) = w +(X T X) 1 X T Cov( bw MLE )=E[( bw E[ bw])( bw E[ bw]) T ]=(X T X) 1 bw MLE N (w, (X T X) 1 ) 13

14 The regression problem Sale Price Given past sales data on zillow.com, predict: y = House sale price from x = {# sq. ft., zip code, date of sale, etc.} best linear fit # square feet Training Data: {(x i,y i )} n Hypothesis: linear y i x T i w Loss: least squares min w nx x i 2 R d y i 2 R y i x T i w 2 14

15 The regression problem Sale Price Given past sales data on zillow.com, predict: y = House sale price from x = {# sq. ft., zip code, date of sale, etc.} Training Data: date of sale best linear fit {(x i,y i )} n Hypothesis: linear y i x T i w Loss: least squares min w nx x i 2 R d y i 2 R y i x T i w 2 15

16 The regression problem Training Data: {(x i,y i )} n Hypothesis: linear y i x T i w x i 2 R d y i 2 R Transformed data: Loss: least squares min w nx y i x T i w 2 16

17 The regression problem Training Data: {(x i,y i )} n Hypothesis: linear y i x T i w Loss: least squares min w nx x i 2 R d y i 2 R y i x T i w 2 Transformed data: h : R d! R p maps original features to a rich, possibly high-dimensional space h 1 (x) 6h 2 (x) 7 6 in d=1: h(x) = 6 4. h p (x) 7 5 = 6 4 for d>1, generate {u j } p j=1 Rd 1 h j (x) = 1 + exp(u T j x) h j (x) =(u T j x) 2 3 x x x p h j (x) = cos(u T j x) 17

18 a 7 51 is t x I s Ext tix next 0 Given x predict iuthlx he.la xk T xinhai L I ri Xix x'y

19 The regression problem Training Data: {(x i,y i )} n Hypothesis: linear y i x T i w Loss: least squares min w nx x i 2 R d y i 2 R y i x T i w 2 Transformed data: h(x) = Hypothesis: linear y i h(x i ) T w Loss: least squares min w nx 2 3 h 1 (x) h 2 (x) h p (x) w 2 R p y i h(x i ) T w 2 18

20 The regression problem Training Data: {(x i,y i )} n x i 2 R d y i 2 R Transformed data: h(x) = 2 3 h 1 (x) h 2 (x) h p (x) best linear fit Hypothesis: linear Sale Price y i h(x i ) T w Loss: least squares min w nx w 2 R p y i h(x i ) T w 2 date of sale 19

21 The regression problem Training Data: {(x i,y i )} n Sale Price date of sale x i 2 R d y i 2 R small p fit Transformed data: h(x) = Hypothesis: linear y i h(x i ) T w Loss: least squares min w nx 2 3 h 1 (x) h 2 (x) h p (x) w 2 R p y i h(x i ) T w 2 20

22 The regression problem Training Data: {(x i,y i )} n Sale Price date of sale x i 2 R d y i 2 R large p fit Transformed data: h(x) = Hypothesis: linear y i h(x i ) T w Loss: least squares min w nx 2 3 h 1 (x) h 2 (x) h p (x) w 2 R p y i h(x i ) T w 2 What s going on here? 21

23 Bias-Variance Tradeoff Machine Learning CSE546 Kevin Jamieson University of Washington Oct 5,

24 Statistical Learning P XY (X = x, Y = y) Goal: Predict Y given X Find function that minimizes E XY [(Y (X)) 2 ] ExLE kz x x Ux ME Eyfly c IX x E fit c l x E z Y c IX x 0 2 LELY Ix x 2C O c Z x E Y lx x

25 Statistical Learning P XY (X = x, Y = y) Goal: Predict Y given X Find function that minimizes i E XY [(Y (X)) 2 ] = E X he Y X [(Y (x)) 2 X = x] (x) = arg min c E Y X [(Y c) 2 X = x] =E Y X [Y X = x] Under LS loss, optimal predictor: (x) =E Y X [Y X = x]

26 Statistical Learning E XY [(Y (X)) 2 ] P XY (X = x, Y = y) y x 2017 Kevin Jamieson

27 Statistical Learning E XY [(Y (X)) 2 ] P XY (X = x, Y = y) P XY (Y = y X = x 0 ) y x 0 x 1 x P XY (Y = y X = x 1 )

28 Statistical Learning E XY [(Y (X)) 2 ] P XY (X = x, Y = y) Ideally, we want to find: (x) =E Y X [Y X = x] P XY (Y = y X = x 0 ) y x 0 x 1 x P XY (Y = y X = x 1 )

29 Statistical Learning P XY (X = x, Y = y) Ideally, we want to find: (x) =E Y X [Y X = x] y x

30 Statistical Learning P XY (X = x, Y = y) Ideally, we want to find: (x) =E Y X [Y X = x] But we only have samples: (x i,y i ) i.i.d. P XY for i =1,...,n y x

31 Statistical Learning P XY (X = x, Y = y) Ideally, we want to find: (x) =E Y X [Y X = x] y x But we only have samples: (x i,y i ) i.i.d. P XY bf = arg min f2f 1 n nx (y i f(x i )) 2 for i =1,...,n and are restricted to a function class (e.g., linear) so we compute:

32 Statistical Learning P XY (X = x, Y = y) Ideally, we want to find: (x) =E Y X [Y X = x] y x But we only have samples: (x i,y i ) i.i.d. P XY bf = arg min f2f 1 n nx (y i f(x i )) 2 for i =1,...,n and are restricted to a function class (e.g., linear) so we compute: We care about future predictions: E XY [(Y b f(x)) 2 ]

33 Statistical Learning P XY (X = x, Y = y) Ideally, we want to find: (x) =E Y X [Y X = x] y x But we only have samples: (x i,y i ) i.i.d. P XY bf = arg min f2f 1 n nx (y i f(x i )) 2 for i =1,...,n and are restricted to a function class (e.g., linear) so we compute: Each draw D = {(x i,y i )} n results in di erent b f

34 Statistical Learning P XY (X = x, Y = y) Ideally, we want to find: (x) =E Y X [Y X = x] y E D [ b f(x)] x But we only have samples: (x i,y i ) i.i.d. P XY bf = arg min f2f 1 n nx (y i f(x i )) 2 for i =1,...,n and are restricted to a function class (e.g., linear) so we compute: Each draw D = {(x i,y i )} n results in di erent b f

35 Bias-Variance Tradeoff (x) =E Y X [Y X = x] bf = arg min f2f 1 n nx (y i f(x i )) 2 E Y X [E D [(Y b fd (x)) 2 ] X = x] =E Y X [E D [(Y (x)+ (x) b fd (x)) 2 ] X = x] E y Eo Y cal Foca ax ElmYx Ey flhcxdyx x t2 ExDD O teokuxh Ey Ct Zan lx c Ey HX x 26

36 Bias-Variance Tradeoff (x) =E Y X [Y X = x] bf = arg min f2f 1 n nx (y i f(x i )) 2 E Y X [E D [(Y fd b (x)) 2 ] X = x] =E Y X [E D [(Y (x)+ (x) fd b (x)) 2 ] X = x] =E Y X he D [(Y (x)) 2 + 2(Y (x))( (x) fd b (x)) i +( (x) fd b (x)) 2 ] X = x =E Y X [(Y (x)) 2 X = x]+e D [( (x) fd b (x)) 2 ] irreducible error Caused by stochastic label noise learning error Caused by either using too simple of a model or not enough data to learn the model accurately

37 Bias-Variance Tradeoff (x) =E Y X [Y X = x] bf = arg min f2f 1 n nx (y i f(x i )) 2 E D [( (x) b fd (x)) 2 ]=E D [( (x) E D [ b f D (x)] + E D [ b f D (x)] b fd (x)) 2 ] Epfl Cas Eoff.CN Jt2IEIL EEDKEfEixD IcatEolEoLf.cn Foix

38 Bias-Variance Tradeoff (x) =E Y X [Y X = x] bf = arg min f2f 1 n nx (y i f(x i )) 2 E D [( (x) fd b (x)) 2 ]=E D [( (x) E D [ f b D (x)] + E D [ f b D (x)] fd b (x)) 2 ] =E D [( (x) E D [ f b D (x)]) 2 + 2( (x) E D [ f b D (x)])(e D [ f b D (x)] fd b (x)) +(E D [ f b D (x)] fd b (x)) 2 ] =( (x) E D [ f b D (x)]) 2 + E D [(E D [ f b D (x)] fd b (x)) 2 ] biased squared variance

39 Bias-Variance Tradeoff E Y X [E D [(Y b fd (x)) 2 ] X = x] =E Y X [(Y (x)) 2 X = x] irreducible error +( (x) E D [ b f D (x)]) 2 + E D [(E D [ b f D (x)] b fd (x)) 2 ] biased squared variance

40 Example: Linear LS Y = Xw + if y i = x T i w + i and i i.i.d. N (0, 2 ) bw MLE =(X T X) 1 X T Y = w +(X T X) 1 X T (x) =E Y X [Y X = x] ELxtwtE X x xtw bf D (x) = Ttx E LEAD Ellen litxxi TIE wtx3 utx 39

41 Example: Linear LS Y = Xw + if y i = x T i w + i and i i.i.d. N (0, 2 ) bw MLE =(X T X) 1 X T Y = w +(X T X) 1 X T (x) =E Y X [Y X = x] bf D (x) = bw T x = w T x + T X(X T X) 1 x In E XY [(Y (x)) 2 X = x] = 2 irreducible error ( (x) E D [ b f D (x)]) 2 +=0 biased squared 40

42 Example: Linear LS Y = Xw + if y i = x T i w + i and i i.i.d. N (0, 2 ) bw MLE =(X T X) 1 X T Y = w +(X T X) 1 X T bf D (x) = bw T x = w T x + T X(X T X) 1 x + E D [(E D [ b f D (x)] b fd (x)) 2 ] = E xtfxtxyxiexcxtxi.be variance Ex Lxi Xix 5 x o 41

43 Example: Linear LS Y = Xw + if y i = x T i w + i and i i.i.d. N (0, 2 ) bw MLE =(X T X) 1 X T Y = w +(X T X) 1 X T bf D (x) = bw T x = w T x + T X(X T X) 1 x + E D [(E D [ b f D (x)] b fd (x)) 2 ] = E D [x T (X T X) 1 X T T X(X T X) 1 x] X T X = variance nx x i x T i = 2 x T (X T X) 1 x = 2 Trace((X T X) 1 xx T ) n large! n = E[XX T ], X P X + E D [(E D [ f b E D (x)] fd b (x)) 2 2 X=x ] = n E X[Trace( 1 XX T )] = d 2 n e 42

44 Example: Linear LS Y = Xw + if y i = x T i w + i and i i.i.d. N (0, 2 ) bw MLE =(X T X) 1 X T Y = w +(X T X) 1 X T (x) =E Y X [Y X = x] bf D (x) = bw T x = w T x + T X(X T X) 1 x E XY [(Y (x)) 2 X = x] = 2 irreducible error + E D [(E D [ f b E D (x)] fd b (x)) 2 X=x ] = d 2 n variance ( (x) E D [ b f D (x)]) 2 +=0 biased squared 43

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1 Linear Regression Machine Learning CSE546 Kevin Jamieson University of Washington Oct 5, 2017 1 The regression problem Given past sales data on zillow.com, predict: y = House sale price from x = {# sq.