Machine Learning Brett Bernstein

Machie Learig Brett Berstei Week Lecture: Cocept Check Exercises Starred problems are optioal. Statistical Learig Theory. Suppose A = Y = R ad X is some other set. Furthermore, assume P X Y is a discrete joit distributio. Compute a Bayes decisio fuctio whe the loss fuctio l : A Y R is give by l(a, y) = (a y), the 0 loss. Solutio. The Bayes decisio fuctio f satisfies f = arg mi f where (X, Y ) P X Y. Let R(f) = arg mi f E[(f(X) Y )] = arg mi P (f(x) Y ), f f (x) = arg max P (Y = y X = x), y the maximum a posteriori estimate of Y. If there is a tie, we choose ay of the maximizers. If f 2 is aother decisio fuctio we have P (f (X) Y ) = x P (f (x) Y X = x)p (X = x) = x ( P (f (x) = Y X = x))p (X = x) x ( P (f 2(x) = Y X = x))p (X = x) (Def of f ) = x P (f 2(x) Y X = x)p (X = x) = P (f 2 (X) Y ). Thus f = f. 2. ( ) Suppose A = Y = R, X is some other set, ad l : A Y R is give by l(a, y) = (a y) 2, the square error loss. What is the Bayes risk ad how does it compare with the variace of Y? Solutio. From Homework we kow that the Bayes decisio fuctio is give by f (x) = E[Y X = x]. Thus the Bayes risk is give by E[(f (X) Y ) 2 ] = E[(E[Y X] Y ) 2 ] = E[E[(E[Y X] Y ) 2 X]] = E[Var(Y X)], where we applied the law of iterated expectatios. The law of total variace states that Var(Y ) = E[Var(Y X)] + Var[E(Y X)].

This proves the Bayes risk satisfies E[Var(Y X)] = Var(Y ) Var[E(Y X)] Var(Y ). Recall from Homework that Var(Y ) is the Bayes risk whe we estimate Y without ay iput X. This shows that usig X i our estimatio reduces the Bayes risk, ad that the improvemet is measured by Var[E(Y X)]. As a saity check, ote that if X, Y are idepedet the E(Y X) = E(Y ) so Var[E(Y X)] = 0. If X = Y the E(Y X) = Y ad Var[E(Y X)] = Var(Y ). The promiet role of variace i our aalysis above is due to the fact that we are usig the square loss. 3. Let X = {,..., 0}, let Y = {,..., 0}, ad let A = Y. Suppose the data geeratig distributio, P, has margial X Uif{,..., 0} ad coditioal distributio Y X = x Uif{,..., x}. For each loss fuctio below give a Bayes decisio fuctio. (a) l(a, y) = (a y) 2, (b) l(a, y) = a y, (c) l(a, y) = (a y). Solutio. (a) From Homework we kow that f (x) = E[Y X = x] = (x + )/2. (b) From Homework, we kow that f (x) is the coditioal media of Y give X = x. If x is odd, the f (x) = (x + )/2. If x is eve, the we ca choose ay value i the iterval [ ] x + x +,. 2 2 (c) From questio above, we kow that f (x) = arg max y P (Y = y X = x). Thus we ca choose ay iteger betwee ad x, iclusive, for f (x). 4. Show that the empirical risk is a ubiased ad cosistet estimator of the Bayes risk. You may assume the Bayes risk is fiite. Solutio. We assume a give loss fuctio l ad a i.i.d. sample (x, y i ),..., (x, y ). To show it is ubiased, ote that [ ] E[ ˆR (f)] = E l(f(x i ), y i ) i= = E[l(f(x i ), y i )] (Liearity of E) i= = E[l(f(x ), y )] (i.i.d.) = R(f). 2

For cosistecy, we must show that as we have ˆR (f) R(f) with probability. Lettig z i = l(f(x i ), y i ), we see that the z i are i.i.d. with fiite mea. Thus cosistecy follows by applyig the strog law of large umbers. 5. Let X = [0, ] ad Y = A = R. Suppose you receive the (x, y) data poits (0, 5), (.2, 3), (.37, 4.2), (.9, 3), (, 5). Throughout assume we are usig the 0 loss. (a) Suppose we restrict our decisio fuctios to the hypothesis space F of costat fuctios. Give a decisio fuctio that miimizes the empirical risk over F ad the correspodig empirical risk. Is the empirical risk miimizig fuctio uique? (b) Suppose we restrict our decisio fuctios to the hypothesis space F 2 of piecewisecostat fuctios with at most discotiuity. Give a decisio fuctio that miimizes the empirical risk over F 2 ad the correspodig empirical risk. Is the empirical risk miimizig fuctio uique? Solutio. (a) We ca let ˆf(x) = 5 or ˆf(x) = 3 ad obtai the miimal empirical risk of 3/5. Thus the empirical risk miimizer is ot uique. (b) Oe solutio is to let ˆf(x) = 5 for x [0,.] ad ˆf(x) = 3 for x (., ] givig a empirical risk of 2/5. There are ucoutably may empirical risk miimizers, so agai we do ot have uiqueess. 6. ( ) Let X = [ 0, 0], Y = A = R ad suppose the data geeratig distributio has margial distributio X Uif[ 0, 0] ad coditioal distributio Y X = x N (a + bx, ) for some fixed a, b R. Suppose you are also give the followig data poits: (0, ), (0, 2), (, 3), (2.5, 3.), ( 4, 2.). (a) Assumig the 0 loss, what is the Bayes risk? (b) Assumig the square error loss l(a, y) = (a y) 2, what is the Bayes risk? (c) Usig the full hypothesis space of all (measurable) fuctios, what is the miimum achievable empirical risk for the square error loss. (d) Usig the hypothesis space of all affie fuctios (i.e., of the form f(x) = cx + d for some c, d R), what is the miimum achievable empirical risk for the square error loss. (e) Usig the hypothesis space of all quadratic fuctios (i.e., of the form f(x) = cx 2 + dx + e for some c, d, e R), what is the miimum achievable empirical risk for the square error loss. Solutio. 3

(a) For ay decisio fuctio f the risk is give by To see this ote that E[(f(X) Y )] = P (f(x) Y ) = P (f(x) = Y ) =. P (f(x) = Y ) = 20 2π 0 0 (f(x) = y)e (y a bx)2 /2 dy dx = 20 2π 0 0 0 dx = 0. Thus every decisio fuctio is a Bayes decisio fuctio, ad the Bayes risk is. (b) By problem 2 above we kow the Bayes risk is give by sice Var(Y X = x) =. (c) We choose ˆf such that E[Var(Y X)] = E[] =, ˆf(0) =.5, ˆf() = 3, ˆf(2.5) = 3., ˆf( 4) = 2., ad ˆf(x) = 0 otherwise. The we achieve the miimum empirical risk of /0. (d) Lettig we obtai (usig a computer) ŵ = 0 0 A = 2.5, y = 2 3 3. 4 2. ( ) ˆd = (A T A) A T y = ĉ ( ).4856. 0.8556 This gives ˆR 5 ( ˆf) = 5 Aŵ y 2 2 = 0.2473. [Aside: I geeral, to solve systems like the oe above o a computer you should t actually ivert the matrix A T A, but use somethig like w=a\y i Matlab which performs a QR factorizatio of A.] (e) Lettig 0 0 0 0 A = 2.5 6.25, y = 2 3 3. 4 6 2. 4

we obtai (usig a computer) ê.775 ŵ = ˆd = (A T A) A T y = 0.7545. ĉ 0.052 This gives ˆR 5 ( ˆf) = 5 Aŵ y 2 2 = 0.928. Stochastic Gradiet Descet. Whe performig mii-batch gradiet descet, we ofte radomly choose the miibatch from the full traiig set without replacemet. Show that the resultig miibatch gradiet is a ubiased estimate of the gradiet of the full traiig set. Here we assume each decisio fuctio f w i our hypothesis space is determied by a parameter vector w R d. Solutio. Let (x m, y m ),..., (x m, y m ) be our mii-batch selected uiformly without replacemet from the full traiig set (x, y ),..., (x, y ). [ ] E w l(f w (x mi, y mi )) = E [ w l(f w (x mi ), y mi )] (Liearity of, E) i= i= = E [ w l(f w (x m ), y m )] (Margials are the same) i= = E [ w l(f w (x m ), y m )] N = N wl(f w (x i ), y i ) i= N = w l(f w (x i ), y i ) N i= (Liearity of ). 2. You wat to estimate the average age of the people visitig your website. Over a fixed week we will receive a total of N visitors (which we will call our full populatio). Suppose the populatio mea µ is ukow but the variace σ 2 is kow. Sice we do t wat to bother every visitor, we will ask a small sample what their ages are. How may visitors must we radomly sample so that our estimator ˆµ has variace at most ɛ > 0? Solutio. Let x,..., x deote our radomly sampled ages, ad let ˆx deote the sample mea i= x i. The Var(ˆx) = σ2. 5

Thus we require σ 2 /ɛ. Note that this does t deped o N, the full populatio size. 3. ( ) Suppose you have bee successfully ruig mii-batch gradiet descet with a full traiig set size of 0 5 ad a mii-batch size of 00. After receivig more data your full traiig set size icreases to 0 9. Give a heuristic argumet as to why the mii-batch size eed ot icrease eve though we have 0000 times more data. Solutio. Throughout we assume our gradiet lies i R d. Cosider the empirical distributio o the full traiig set (i.e., each sample is chose with probability /N where N is the full traiig set size). Assume this distributio has mea vector µ R d (the full-batch gradiet) ad covariace matrix Σ R d d. By the cetral limit theorem the mii-batch gradiet will be approximately ormally distributed with mea µ ad covariace Σ, where is the mii-batch size. As N grows the etries of Σ eed ot grow, ad thus eed ot grow. I fact, as N grows, the empirical mea ad covariace matrix will coverge to their true values. More precisely, the mea of the empirical distributio will coverge to E l(f(x), Y ) ad the covariace will coverge to E[( l(f(x), Y ))( l(f(x), Y )) T ] E[ l(f(x), Y )]E[ l(f(x), Y )] T where (X, Y ) P X Y. The importat takeaway here is that the size of the mii-batch is depedet o the speed of computatio, ad o the characteristics of the distributio of the gradiets (such as the momets), ad thus may vary idepedetly of the size of the full traiig set. 6