Understanding Machine Learning Solution Manual

Size: px

Start display at page:

Download "Understanding Machine Learning Solution Manual"

Lenard Washington
5 years ago
Views:

1 Understanding Machine Learning Solution Manual Written by Alon Gonen Edited by Dana Rubinstein Noveber 17, Gentle Start 1. Given S = ((x i, y i )), define the ultivariate polynoial p S (x) = i []:y i =1 x x i 2. Then, for every i s.t. y i = 1 we have p S (x i ) = 0, while for every other x we have p S (x) < By the linearity of expectation, E S x D [L S(h)] = E S x D = 1 = 1 [ 1 1 [h(xi ) f(x i )] E [1 [h(x x i D i ) f(x i )]] P [h(x i) f(x i )] x i D = 1 L (D,f)(h) = L (D,f) (h). The solutions to Chapters 13,14 were written by Shai Shalev-Shwartz ] 1

2 3. (a) First, observe that by definition, A labels positively all the positive instances in the training set. Second, as we assue realizability, and since the tightest rectangle enclosing all positive exaples is returned, all the negative instances are labeled correctly by A as well. We conclude that A is an ERM. (b) Fix soe distribution D over X, and define R as in the hint. Let f be the hypothesis associated with R a training set S, denote by R(S) the rectangle returned by the proposed algorith and by A(S) the corresponding hypothesis. The definition of the algorith A iplies that R(S) R for every S. Thus, L (D,f) (R(S)) = D(R \ R(S)). Fix soe (0, 1). Define R 1, R 2, R 3 and R 4 as in the hint. For each i [4], define the event F i = {S x : S x R i = }. Applying the union bound, we obtain ( 4 ) D ({S : L (D,f) (A(S)) > }) D F i 4 D (F i ). Thus, it suffices to ensure that D (F i ) δ/4 for every i. Fix soe i [4]. Then, the probability that a saple is in F i is the probability that all of the instances don t fall in R i, which is exactly (1 /4). Therefore, and hence, D (F i ) = (1 /4) exp( /4), D ({S : L (D,f) (A(S)) > ]}) 4 exp( /4). Plugging in the assuption on, we conclude our proof. (c) The hypothesis class of axis aligned rectangles in R d is defined as follows. Given real nubers a 1 b 1, a 2 b 2,..., a d b d, define the classifier h (a1,b 1,...,a d,b d ) by h (a1,b 1,...,a d,b d )(x 1,..., x d ) = 2 { 1 if i [d], a i x i b i 0 otherwise (1)

3 The class of all axis-aligned rectangles in R d is defined as H d rec = {h (a1,b 1,...,a d,b d ) : i [d], a i b i, }. It can be seen that the sae algorith proposed above is an ERM for this case as well. The saple coplexity is analyzed siilarly. The only difference is that instead of 4 strips, we have 2d strips (2 strips for each diension). Thus, it suffices to draw a training set of size 2d log(2d/δ). (d) For each diension, the algorith has to find the inial and the axial values aong the positive instances in the training sequence. Therefore, its runtie is O(d). Since we have shown that the required value of is at ost 2d log(2d/δ), it follows that the runtie of the algorith is indeed polynoial in d, 1/, and log(1/δ). 3 A Foral Learning Model 1. The proofs follow (alost) iediately fro the definition. We will show that the saple coplexity is onotonically decreasing in the accuracy paraeter. The proof that the saple coplexity is onotonically decreasing in the confidence paraeter δ is analogous. Denote by D an unknown distribution over X, and let f H be the target hypothesis. Denote by A an algorith which learns H with saple coplexity H (, ). Fix soe δ (0, 1). Suppose that 0 < def We need to show that 1 = H ( 1, δ) H ( 2, δ) def = 2. Given an i.i.d. training sequence of size 1, we have that with probability at least 1 δ, A returns a hypothesis h such that L D,f (h) 1 2. By the iniality of 2, we conclude that (a) We propose the following algorith. If a positive instance x + appears in S, return the (true) hypothesis h x+. If S doesn t contain any positive instance, the algorith returns the all-negative hypothesis. It is clear that this algorith is an ERM. (b) Let (0, 1), and fix the distribution D over X. If the true hypothesis is h, then our algorith returns a perfect hypothesis. 3

4 Assue now that there exists a unique positive instance x +. It s clear that if x + appears in the training sequence S, our algorith returns a perfect hypothesis. Furtherore, if D[{x + }] then in any case, the returned hypothesis has a generalization error of at ost (with probability 1). Thus, it is only left to bound the probability of the case in which D[{x + }] >, but x + doesn t appear in S. Denote this event by F. Then P S x D [F ] (1 ) e. Hence, H Singleton is PAC learnable, and its saple coplexity is bounded by log(1/δ) H (, δ). 3. Consider the ERM algorith A which given a training sequence S = ((x i, y i )), returns the hypothesis ĥ corresponding to the tightest circle which contains all the positive instances. Denote the radius of this hypothesis by ˆr. Assue realizability and let h be a circle with zero generalization error. Denote its radius by r. Let, δ (0, 1). Let r r be a scalar s.t. D X ({x : r x r }) =. Define E = {x R 2 : r x r }. The probability (over drawing S) that L D (h S ) is bounded above by the probability that no point in S belongs to E. This probability of this event is bounded above by (1 ) e. The desired bound on the saple coplexity follows by requiring that e δ. 4. We first observe that H is finite. Let us calculate its size accurately. Each hypothesis, besides the all-negative hypothesis, is deterined by deciding for each variable x i, whether x i, x i or none of which appear in the corresponding conjunction. Thus, H = 3 d + 1. We conclude that H is PAC learnable and its saple coplexity can be bounded by d log 3 + log(1/δ) H (, δ). Let s describe our learning algorith. We define h 0 = x 1 x 1... x d x d. Observe that h 0 is the always-inus hypothesis. Let ((a 1, y 1 ),..., (a, y )) be an i.i.d. training sequence of size. Since 4

5 we cannot produce any inforation fro negative exaples, our algorith neglects the. For each positive exaple a, we reove fro h i all the literals that are issing in a. That is, if a i = 1, we reove x i fro h and if a i = 0, we reove x i fro h i. Finally, our algorith returns h. By construction and realizability, h i labels positively all the positive exaples aong a 1,..., a i. Fro the sae reasons, the set of literals in h i contains the set of literals in the target hypothesis. Thus, h i classifies correctly the negative eleents aong a 1,..., a i. This iplies that h is an ERM. Since the algorith takes linear tie (in ters of the diension d) to process each exaple, the running tie is bounded by O( d). 5. Fix soe h H with L (D,f) (h) >. By definition, P X D1 [h(x) = f(x)] P X D [h(x) = f(x))] < 1. We now bound the probability that h is consistent with S (i.e., that L S (h) = 0) as follows: P S [L S (h) = 0] = P [h(x) = f(x)] D i X D i ( = P [h(x) = f(x)] X D i ) 1 ( P X D i [h(x) = f(x)] < (1 ) e. The first inequality is the geoetric-arithetic ean inequality. Applying the union bound, we conclude that the probability that there exists soe h H with L (D,f) (h) >, which is consistent with S is at ost H exp( ). 6. Suppose that H is agnostic PAC learnable, and let A be a learning algorith that learns H with saple coplexity H (, ). We show that H is PAC learnable using A. ) 5

6 Let D, f be an (unknown) distribution over X, and the target function respectively. We ay assue w.l.o.g. that D is a joint distribution over X {0, 1}, where the conditional probability of y given x is deterined deterinistically by f. Since we assue realizability, we have inf h H L D (h) = 0. Let, δ (0, 1). Then, for every positive integer H (, δ), if we equip A with a training set S consisting of i.i.d. instances which are labeled by f, then with probability at least 1 δ (over the choice of S x ), it returns a hypothesis h with L D (h) inf h H L D(h ) + = 0 + =. 7. Let x X. Let α x be the conditional probability of a positive label given x. We have P[f D (X) y X = x] = 1 [αx 1/2] P[Y = 0 X = x] + 1 [αx<1/2] P[Y = 1 X = x] = 1 [αx 1/2] (1 α x ) + 1 [αx<1/2] α x = in{α x, 1 α x }. Let g be a classifier 1 fro X to {0, 1}. We have P[g(X) Y X = x] = P[g(X) = 0 X = x] P[Y = 1 X = x] + P[g(X) = 1 X = x] P[Y = 0 X = x] = P[g(X) = 0 X = x] α x + P[g(X) = 1 X = x] (1 α x ) P[g(X) = 0 X = x] in{α x, 1 α x } + P[g(X) = 1 x] in{α x, 1 α x } = in{α x, 1 α x }, The stateent follows now due to the fact that the above is true for every x X. More forally, by the law of total expectation, L D (f D ) = E (x,y) D [1 [fd (x) y]] ] = E x DX [E y DY x [1 [fd (x) y] X = x] = E x DX [α x ] ] E x DX [E y DY x [1 [g(x) y] X = x] = L D (g). 1 As we shall see, g ight be non-deterinistic. 6

7 8. (a) This was proved in the previous exercise. (b) We proved in the previous exercise that for every distribution D, the bayes optial predictor f D is optial w.r.t. D. (c) Choose any distribution D. Then A is not better than f D w.r.t. D. 9. (a) Suppose that H is PAC learnable in the one-oracle odel. Let A be an algorith which learns H and denote by H the function that deterines its saple coplexity. We prove that H is PAC learnable also in the two-oracle odel. Let D be a distribution over X {0, 1}. Note that drawing points fro the negative and positive oracles with equal provability is equivalent to obtaining i.i.d. exaples fro a distribution D which gives equal probability to positive and negative exaples. Forally, for every subset E X we have D [E] = 1 2 D+ [E] D [E]. Thus, D [{x : f(x) = 1}] = D [{x : f(x) = 0}] = 1 2. If we let A an access to a training set which is drawn i.i.d. according to D with size H (/2, δ), then with probability at least 1 δ, A returns h with /2 L (D,f)(h) = P x D [h(x) f(x)] = P = 1, h(x) = 0] + P = 0, h(x) = 1] = P = 1] P x D [h(x) = 0 f(x) = 1] + P = 0] P x D [h(x) = 1 f(x) = 0] = P = 1] P [h(x) = 0 f(x) = 1] x D + P = 0] P [h(x) = 1 f(x) = 0] x D = 1 2 L (D +,f)(h) L (D,f)(h). This iplies that with probability at least 1 δ, both L (D +,f)(h) and L (D,f)(h). 7

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory Proble sets 5 and 6 Due: Noveber th Please send your solutions to learning-subissions@ttic.edu Notations/Definitions Recall the definition of saple based Radeacher