The key is to understand an es0mator as a random variable

Size: px

Start display at page:

Download "The key is to understand an es0mator as a random variable"

Marybeth Adams
6 years ago
Views:

5 The key is to understand an es0mator as a random variable

8 IID P(X,Y) y 1 y 2 y n x 1 x 2 x n? x n+1

9 minimize w2r d nx i=1 y i x i > w 2 + kwk 2 Training error Regulariza0on (ridge) term (λ: regulariza0on const.) Note: Can be interpreted as a Maximum A Posteriori (MAP) es0ma0on Gaussian likelihood with Gaussian prior.

10 minimize w Target output y = y Xw 2 + w 2 Training error Regulariza0on (ridge) term (λ: regulariza0on const.) y 1 Design x 1 y 2 x 2.. matrix X =. y n Note: Can be interpreted as a Maximum A Posteriori (MAP) es0ma0on Gaussian likelihood with Gaussian prior. x n

11 X = Size #rooms Bathroom Sunlight X = Neighborhood Train st. 0 x p 1 1 x 2 1 x 1 1 x p 2 x 2 2 x 2 1 B A x p n x 2 n x n 1

12 X (y Xw)+ w =0 which gives ŵ = X > X + I d 1 X > y (I d : d d iden0ty matrix) The solu0on can also be wrisen as (exercise) ŵ = X > XX > + I n 1 y

13 y = w 1 x d w d 1 x + w d + noise 0 1 w 1. = x d 1 x w d 1 w d Design matrix: 0 x d 1 1 x x 1 1 x d 2 1 x 2 2 x 2 1 X = B A x d n 1 x 2 n x n 1 A + noise

14 True Learned Training examples =0.001 True w = Learned w =

15 True Learned Training examples =0.01 True w = Learned w =

16 True Learned Training examples =0.1 True w = Learned w =

17 True Learned Training examples =1 True w = Learned w =

18 True Learned Training examples =10 True w = Learned w =

19 Outputs to be predicted y = Orange (+1) or lemon (- 1) 1

20 USPS digits dataset 7291 training samples, 2007 test samples hsp://www- stat- class.stanford.edu/~0bs/elemstatlearn/datasets/zip.info y = Number of samples

21 We can obtain 88% accuracy on a held- out test- set using about 7300 training examples Accuracy (%) λ= A machine can learn! (using a very simple learning algorithm) Number of samples

22 W=(X *X+lambda*eye(d))\(X *Y);!

24 λ=10-6 Accuracy (%) is the number of pixels (16x16) in the image Number of samples (n)

Accuracy (%) 95 90 85 80 Breast Cancer Wisconsin 30 real- valued features radius

25 Accuracy (%) Breast Cancer Wisconsin 30 real- valued features radius texture perimeter area, etc. 75 λ= (n) Number of samples

90 SPECT Heart p=22 85 22 binary features 80 Accuracy (%) 75

26 90 SPECT Heart p= binary features 80 Accuracy (%) λ= Number of samples (n)

100 Spambase p=57 Accuracy (%) 90 80 70 55 real- valued features word frequency character

27 100 Spambase p=57 Accuracy (%) real- valued features word frequency character frequency 2 integer- valued feats run- length 60 λ= Number of samples (n)

85 musk p=166 80 166 real- valued features Accuracy (%) 75

28 85 musk p= real- valued features Accuracy (%) λ= Number of samples (n)

30 y = Xw + ŵ = X > X + I d 1 X > y Err(ŵ) =E kŵ w k 2 The es0mator is a random variable!

33 E kŵ w k 2 6= ke ŵ w k 2

34 E kŵ w k 2 = E kŵ wk 2 + k w w k 2 w = E ŵ w Bias: error coming from the model/design matrix - under- fiing Variance: error caused by the noise - over- fiing ŵ w

36 y = Xw + E =0, Cov( ) = 2 I n E [ŵ] = ˆ + n I d 1 ˆ w Cov(ŵ) = 2 n ˆ + n I d 1 ˆ ˆ + n I d 1 n := /n and ˆ = 1 n X> X

37 plotellipse(mu, Sigma, color, width, marker_size) E [ŵ] and Cov(ŵ)

38 E [ŵ] k w w k 2 = ke ŵ w k 2 = 2 n Cov(ŵ) ˆ + n I d 1 w 2 ( n := /n) E kŵ wk 2 =Tr(Cov(ŵ))

39 kw k 2 ˆ

40 Ridge Regression: number of variables=100, lambda=1e 06 simulation bias 2 variance Estimation error w w* Number of samples n

41 Ridge Regression: number of variables=100, lambda=0.001 simulation bias 2 variance Estimation error w w* Number of samples n

42 Ridge Regression: number of variables=100, lambda=1 simulation bias 2 variance Estimation error w w* Number of samples n

43 E kŵ w k 2 Gen(x) =E (x > w x > ŵ) 2 = E x > (w ŵ) 2 exp_frequentists_errorbar.m

44 w ŵ =(w w)+( w ŵ) exp_frequentists_errorbar.m

45 Gen(x) =E x > (w ŵ) 2 = 2 n n x > ˆ 1 n w o E w 1 n x> ˆ n ˆ ˆ 1 n x ( ˆ n := ˆ + n I d ) n x > 1 ˆ w o 2 1 n applekˆ xk2 kw k 2 n n x > 1 ˆ w o 2 n = 1 k ˆ 1 xk2 n assuming E w [w w > ]= 1 I d

46 2 n=20 d= Output true error predicted error 1.5 true function learned function samples Input

48 1 n E ky Xŵk n Tr ˆ ( ˆ + n I d ) 1 = 2 + E (ŵ w) > ˆ (ŵ w) + ( w w ) > ˆ ( w w ) ( n := /n) Tr ˆ ( ˆ + n I d ) 1 : known as the e ective degrees of freedom

49 1 n E 0 E ky 0 Xŵk n Tr ˆ ( ˆ + n I d ) 1 (y 0 = Xw + 0 ) 1 n E ky Xŵk 2

50 ŵ \i nx nx 2 (y i x i>ŵ \i ) 2 yi x i>ŵ = 1 S(i, i) i=1 i=1 S = X(X > X + I d ) 1 X >

51 ˆ '

55 Objective(ŵ) apple Objective(w )

57 w N (0, 1 I d ) N (0, 2 I n ) y = Xw + w y N (µ, C) µ := (X > X + 2 I d ) 1 X > y C := 2 (X > X + 2 I d ) 1

58 exp_bayesian_regression.m

59 p(w y) = argmin q(w) subject to Z E w q(w) [ log p(y w)] + D(qkp), q(w)dw =1. p(w): prior distribution

60 y n+1 x n+1, y N (x n+1 > µ, 2 + x > Cx) y n+1 x n+1, y N (x n+1>ŵ, 2 ) Note: ŵ = µ if = 2

D p w (y n+1 x n+1 )kˆp(y n+1 x n+1 ) = x n+1 > (w ŵ) 2 2 2 pred + 1 2 ( 2 2 pred + log!

61 D p w (y n+1 x n+1 )kˆp(y n+1 x n+1 ) = x n+1 > (w ŵ) pred ( 2 2 pred + log! 2 pred 2 1 ) p w (y n+1 x n+1 ) : y n+1 x n+1 N (x n+1 > w, ˆp(y n+1 x n+1 ) : y n+1 x n+1 N (x n+1>ŵ, 2 ) 2 pred)

63 2 pred = 2 + {x > n+1 (w ŵ)} 2

64 E w N (0, 1 I d )E x > n+1 (w ŵ) 2 = x > Cx

65 R[q y ]=E w p(w) E y Q n i=1 p(y i w) D p(yn+1 w)kq y (y n+1 )

67 L(s, w) L(s, w) = ˆL(Q) = 1 n nx i=1 E w Q(w) [L(s i, w)] ( 0 if yx > w 0, 1 otherwise (s =(y, x), L max = 1) L(Q) =E s E w Q(w) [L(s, w)]

68 [McAllester 1999, 2013; Catoni 2007] E S L(Q) apple E S ˆL(Q)+ L max n E S D(QkP )

71 E S k ˆf f k 2 n apple O n min(, )/(2 +d)

Least Squares Regression

CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the