Machine Learning Brett Bernstein

Machie Learig Brett Berstei Week 2 Lecture: Cocept Check Exercises Starred problems are optioal. Excess Risk Decompositio 1. Let X = Y = {1, 2,..., 10}, A = {1,..., 10, 11} ad suppose the data distributio has margial distributio X Uif{1,..., 10}. Furthermore, assume Y = X (i.e., Y always has the exact same value as X). I the questios below we use square loss fuctio l(a, x) = (a x) 2. (a) What is the Bayes risk? (b) What is the approximatio error whe usig the hypothesis space of costat fuctios? (c) Suppose we use the hypothesis space F of affie fuctios. i. What is the approximatio error? ii. Cosider the fuctio ˆf(x) = x + 1. Compute R( ˆf) R(f F ). (a) The best decisio fuctio is f (x) = x. The associated risk is 0. (b) The best costat fuctio is f(x) = E[Y ] = E[X] = 5.5. This has risk E[(Y 5.5) 2 ] = Var(Y ) = 33 4, by usig (or derivig) the formula for the variace of a discrete uiform distributio. Thus the approximatio error is 33/4. (c) i. The Bayes decisio fuctio is affie, so the approximatio error is 0. ii. The risk is Thus the aswer is 1. R( ˆf) = E[(Y ˆf(X)) 2 ] = E[(X (X + 1)) 2 ] = 1. 2. ( ) Let X = [ 10, 10], Y = A = R ad suppose the data distributio has margial distributio X Uif( 10, 10) ad Y X = x N (a + bx, 1). Throughout we assume the square loss fuctio l(a, x) = (a x) 2. (a) What is the Bayes risk? 1

(b) What is the approximatio error whe usig the hypothesis space of costat fuctios (i terms of a ad b)? (c) Suppose we use the hypothesis space of affie fuctios. i. What is the approximatio error? ii. Suppose you have a fixed data set ad compute the empirical risk miimizer ˆf (x) = c + dx. What is the estimatio error (it terms of a, b, c, d)? Throughout we use the fact that Var(X) = E[X 2 ] E[X] 2. (a) The best decisio fuctio is f(x) = E[Y X = x] = a + bx. This has risk E[(Y a bx) 2 ] = E[E[(Y a bx) 2 X]] = E[1] = 1. (b) The best costat fuctio is give by E[Y ] = E[E[Y X]] = a + be[x] = a. This has risk where E[(Y a) 2 ] = E[E[(Y a) 2 X]] = E[1 + b 2 X 2 ] = 1 + b 2 E[X 2 ], E[X 2 ] = 10 10 Thus the approximatio error is 100b 2 /3. x 2 2000 dx = 20 3 20 = 100 3. (c) i. There is a affie Bayes decisio fuctio, so the approximatio error is 0. ii. Note that R( ˆf ) = E[(Y c dx) 2 ] = E[E[(Y c dx) 2 X]] = E[1 + ((a c) + (b d)x) 2 ] = 1 + (a c) 2 + 100(b d) 2 /3. Thus the estimatio error is (a c) 2 + 100(b d) 2 /3. 3. Try to best characterize each of the followig i terms of oe or more of optimizatio error, approximatio error, ad estimatio error. (a) Overfittig. (b) Uderfittig. (c) Precise empirical risk miimizatio for your hypothesis space is computatioally itractable. (d) Not eough data. (a) High estimatio error due to isufficiet data relative to the complexity of your hypothesis space. Ca be accompaied by low approximatio error idicatig a complex hypothesis space. 2

(b) High approximatio error due to a overly simplistic hypothesis space. Ca be accompaied by low estimatio estimatio error due to the large amout of data relative to the (low) complexity of the hypothesis space. (c) Icreased optimizatio error. (d) High estimatio error. 4. (a) We sometimes look at R( ˆf ) as radom, ad other times as determiistic. What causes this differece? (b) True or False: Icreasig the size of our hypothesis space ca shift risk from approximatio error to estimatio error but always leaves the quatity R( ˆf ) R(f ) costat. (c) True or False: Assume we treat our data set as a radom sample ad ot a fixed quatity. The the estimatio error ad the approximatio error are radom ad ot determiistic. (d) True or False: The empirical risk of the ERM, ˆR( ˆf ), is a ubiased estimator of the risk of the ERM R( ˆf ). (e) I each of the followig situatios, there is a implicit sample space i which the give expectatio is computed. Give that space. i. Whe we say the empirical risk ˆR(f) is a ubiased estimator of the risk R(f) (where f is idepedet of the traiig data used to compute the empirical risk). ii. Whe we compute the expected empirical risk E[R( ˆf )] (i.e., the outer expectatio). iii. Whe we say the miibatch gradiet is a ubiased estimator of the full traiig set gradiet. (a) The quatity is radom whe we cosider the traiig data as a radom sample of size. If we focus o a fixed set of traiig data the the quatity is determiistic. (b) False. Note that ˆf depeds o which hypothesis space you have chose. As a example, imagie havig a affie Bayes decisio fuctio, ad chagig the hypothesis space from the set of affie fuctios to the set of all decisio fuctios. This ca cause empirical risk miimizatio to overfit the traiig data thus creatig a sharp rise i R( ˆf ) R(f ). (c) False, approximatio error is a determiistic quatity. (d) False. The empirical risk of the ERM will ofte be biased low. This is why we use a test set to approximate its true risk. The issue is that ˆf depeds o the traiig data so El( ˆf (x i ), y i ) El( ˆf (x), y) 3

(e) where x, y is a ew radom draw from the data distributio that is t i the traiig data. i. The space of traiig sets (i.e., samples of size from the data geeratig distributio). ii. The space of traiig sets (i.e., samples of size from the data geeratig distributio). iii. The space of all miibatches chose from the full traiig set (i.e., samples of of the batch size from the empirical distributio o the full traiig set). 5. For each, use,, or = to determie the relatioship betwee the two quatities, or if the relatioship caot be determied. Throughout assume F 1, F 2 are hypothesis spaces with F 1 F 2, ad assume we are workig with a fixed loss fuctio l. (a) The estimatio errors of two decisio fuctios f 1, f 2 that miimize the empirical risk over the same hypothesis space, where f 2 uses 5 extra data poits. (b) The approximatio errors of the two decisio fuctios f 1, f 2 that miimize risk with respect to F 1, F 2, respectively (i.e., f 1 = f F1 ad f 2 = f F2 ). (c) The empirical risks of two decisio fuctios f 1, f 2 that miimize the empirical risk over F 1, F 2, respectively. Both use the same fixed traiig data. (d) The estimatio errors (for F 1, F 2, respectively) of two decisio fuctios f 1, f 2 that miimize the empirical risk over F 1, F 2, respectively. (e) The risk of two decisio fuctios f 1, f 2 that miimize the empirical risk over F 1, F 2, respectively. (a) Roughly speakig, more data is better, so we would ted to expect that f 2 will have lower estimatio error. That said, this is ot always the case, so the relatioship caot be determied. (b) The approximatio error of f 1 will be larger. (c) The empirical risk of f 1 will be larger. (d) Roughly speakig, icreasig the hypothesis space should icrease the estimatio error sice the approximatio error will decrease, ad we expect to eed more data. That said, this is ot always the case, so the aswer is the relatioship caot be determied. (e) Caot be determied. 6. I the excess risk decompositio lecture, we itroduced the decisio tree classifier spaces F (space of all decisio trees) ad F d (the space of decisio trees of depth d) ad wet through some examples. The followig questios are based o those slides. Recall that P X = Uif([0, 1] 2 ), Y = {blue, orage}, orage occurs with.9 probability below the lie y = x ad blue occurs with.9 probability above the lie y = x. 4

(a) Prove that the Bayes error rate is 0.1. (b) Is the Bayes decisio fuctio i F? (c) For the hypothesis space F 3 the slide states that R( f) = 0.176±.004 for = 1024. Assumig you had access to the traiig code that produces f from a set of data poits, ad radom draws from the data geeratig distributio, give a algorithm (pseudocode) to compute (or estimate) the values 0.176 ad.004. (a) Sice the output space is discrete ad we are usig the 0 1 loss, our best predictio is the highest probability output coditioal o the iput. By choosig orage below the lie y = x ad blue above, we obtai a.1 probability of error. For the 0 1 loss, probability of error gives the risk. (b) No. Ay decisio tree i F has fiite depth, ad thus will divide [0, 1] 2 ito a fiite umber of rectagles. Thus we caot produce the decisio boudary y = x used by the Bayes decisio fuctio. (c) Pseudocode follows: i. Iitialize L to be a empty list of risks. ii. Repeat the followig M times for some sufficietly large M: A. Draw a radom sample (x 1, y 1 ),..., (x, y ) from the data geeratig distributio. B. Obtai a decisio fuctio f by ruig our traiig algorithm o the geerated sample. C. Draw a ew radom sample (x 1, y 1),..., (x S, y S ) of size S where S is sufficietly large. D. Compute e = {i f(x i ) y i}. That is, the umber of times f is icorrect o our ew sample. E. Add e/s to the list L. iii. Compute the sample average ad stadard deviatio of the values i L. Above.176 would be the average ad.004 would be the stadard deviatio. Istead of drawig the sample of size S we could have computed the risk aalytically. L 1 ad L 2 Regularizatio 1. Cosider the followig two miimizatio problems: arg mi Ω(w) + λ w L(f w (x i ), y i ) 5

ad arg mi CΩ(w) + 1 w L(f w (x i ), y i ), where Ω(w) is the pealty fuctio (for regularizatio) ad L is the loss fuctio. Give sufficiet coditios uder which these two give the same miimizer. Let C = 1/λ. The the two objectives differ by a costat factor. 2. ( ) Let f : R R be a differetiable fuctio. Prove that f(x) 2 L if ad oly if f is Lipschitz with costat L. First suppose f(x) 2 L for some L 0 ad all x R. By the mea value theorem we have, for ay x, y R, f(y) f(x) = f(x + ξ(y x)) T (y x), where ξ is some value betwee 0 ad 1. Takig absolute values o each side we have f(y) f(x) = f(x + ξ(y x)) T (y x) f(x + ξ(y x)) 2 y x 2 by Cauchy-Schwarz. Applyig our boud o the gradiet orm proves f is Lipschitz with costat L. Coversely, suppose f is Lipschitz with costat L. Note that f(x) T v = f (x; v) = lim f(x + tv) f(x) t 0 t lim t L v t 0 t Lettig v = f(x) we obtai f(x) 2 2 L f(x) 2 givig the result. 3. ( ) Let ŵ deote the miimizer for = L v. miimize w Xw y 2 2 subject to w 1 r. Prove that f(x) = ŵ T x is Lipschitz with costat r. Note that w 2 w 1 r, so the argumet from class gives the result. To see the iequality, ote that w 2 1 = ( w 1 + + w ) 2 w 1 2 + + w 2 = w 2 2. 4. Two of the plots i the lecture slides use the fact that ˆβ / β is always betwee 0 ad 1. Here ˆβ is the parameter vector of the liear model resultig from the regularized least squares problem. Aalgously, β is the parameter vector from the uregularized problem. Why is this true that the quotiet lies i [0, 1]? 6

We assume Ivaov regularizatio (sice Tikhoov is equivalet). We kow that 1 ( β T x i y i ) 2 1 ( ˆβ T x i y i ) 2 sice β is the solutio to the ucostraied miimizatio. But if β ˆβ the β is feasible for the regularized problem, so ˆβ = β. Thus β ˆβ. 5. Explai why feature ormalizatio is importat if you are usig L 1 or L 2 regularizatio. Suppose you have a model y = w T x where x 1 is a very correlated with y, but the feature is measured i meters. Thus w 1 = 4 would mea each icrease i x 1 by 1 meter yields a icrease i y by 4. Now suppose we chage the uits of w 1 to kilometers by scalig it. This would require us to chage w 1 to 4000 to achieve the same decisio fuctio. While this has o effect o the loss (y w T x) 2 it has a sigificat effect o λ w 2 2 or λ w 1. For example, eve if x 2,..., x had very little relatioship with y, we would still udervalue w 1 due to the regularizatio. 7