58669 Supervised Machine Learning (Spring 014) Homework, sample solutions Credit for the solutions goes to mainly to Panu Luosto and Joonas Paalasmaa, with some additional contributions by Jyrki Kivinen Problem 1 The logarithmic loss is by definition Let y t 0 first. Now From the update rule we get L log (y, ŷ) { ln(1 ŷ) if y 0 ln ŷ if y 1. L log (y t, ŷ t ) ln(1 ŷ t ) ln(1 W t+1 v t,i h i (x t )). w t,i exp ( ηl log (0, h i (x t )) ) w t,i exp ( η( ln(1 h i (x t )) ) w t,i (1 h i (x t )), where we used the assumption η 1. Because c 1, and Let next y 1. In this case, P t P t+1 c ln W t c ln W t+1 ln W t+1 W t n ln w t,i(1 h i (x t )) ln ( 1 W t v t,i h i (x t ) ) L log (y t, ŷ t ). L log (y t, ŷ t ) ln ŷ t ln W t+1 v t,i h i (x t ), w t,i exp ( ηl log (1, h i (x t )) ) w t,i exp ( η( ln h i (x t )) ) w t,i h i (x t ).
Therefore, P t P t+1 ln W t+1 W t n ln w t,ih i (x t ) ln W t v t,i h i (x t ) L log (y t, ŷ t ) which completes the proof. Loss bound. Let η c 1 as before and let H {h 1, h,..., h n }. We get the bound directly by using the previous result L log (y t, ŷ t ) P t P t+1. For all h H we have the bound L log (S, W A) T L log (y t, ŷ t ) t1 T (P t P t+1 ) t1 P 1 P T +1 c ln W 1 c ln W T +1 T ln n ln exp( ηl log (y t, h(x t ))) ln n Taking the minimum over H gives the bound Problem (a) t1 T L log (y t, h(x t )) t1 ln n + L log (S, h). L log (S, W A) min h H L log(s, h i ) + ln n. Let A i be the event heads come up in the toss number i {1,,..., 10}. Then 10 P(A 1, A,..., A 10 ) P(A i ) ( ) 10 1 because the tosses are independently and identically-distributed. This is an example of a binomial random variable. The number of events heads come up happening in a series of 10 fair coin tosses can be modelled as a random variable X Bin(10, 1/), and P(X 10) ( 10) (1/) 10 (1/) 0 (1/) 10. (b) From a) we know that the probability that a given coin does not come up heads 10 times is q 1 (1/) 10. Because the coins tosses are independent, the probability that no coin comes up heads
10 times is q 1000. Finally, the probability that at least one coin comes up heads 10 times is ( ( ) ) 10 1000 1 1 q 1000 1 1 ( ) 1000 103 1 104 0.6358. We can formulate the situation also as follows. The series of 10 tosses with 1000 coins are again a sequence of repeated experiments. Let the length of the sequence be n 1000, and let the probability of success (10 heads with a single coin) in an experiment be p 10. The number of successes X in the sequence is a binomial distributed random variable X Bin(n, p). What is asked is the probability P(X 1) 1 P(X 0) 1 ( n 0) p 0 (1 p) n 1 (1 p) 1000. (c) Let B i be the event coin number i comes up heads 10 times and let C be the event at least one coin comes up heads 10 times. The union bound gives us the estimate ( 1000 ) 1000 ( ) 10 1 P(C) P B i P(B i ) 1000 0.97656. Our estimate turns out to be very inaccurate. Actually, if we had considered 1000 104 coin tosses, we would have got only the trivial bound P(C) 1. Exercise 3 (a) We calculate the risk using the expression from page 4 of lecture notes: R(h) ν + (1 ν)p X (x), (1) Using this for h f we directly get h(x) f(x) R(f) x νp X (x) ν. On the other hand, for any h such that P X (h(x) f(x)) > ɛ, using (1) we get R(h) ν + (1 ν)p X (h(x) f(x)) > ν + ε(1 ν) since we assume ν < 1/. (b) Following the proof of Theorem 1.9, we are going to show that with probability at least 1 we have ɛ(1 ν) ˆR(f) R(f) + () for the target classifier, and ɛ(1 ν) ˆR(h) R(h). (3) 3
for all h f. Combining () and (3) with the previous estimates shows that, for any h such that P X (h(x) f(x)) > ɛ holds, we have ɛ(1 ν) ˆR(h) R(h) ɛ(1 ν) > ν + ε(1 ν) ɛ(1 ν) R(f) + ˆR(f). Hence, ERM cannot pick any such h as its hypothesis. Therefore, if we prove that () and (3) hold with probability at least 1, we have the desired result. We use Hoeffding s inequality ( t ) Pr S ES] + t] exp m (b. i a i ) The probability that () fails is estimated as ] ɛ(1 ν) Pr ˆR(f) > R(f) + ] Pr m ˆR(f) ɛ(1 ν) > mr(f) + m ( exp m ɛ (1 ν) ) /4 m exp ( mɛ (1 ν) / ). Since we assume this implies Pr Similarly, we get ˆR(f) > R(f) + H m ɛ ln, (1 ν) ] ( ɛ(1 ν) exp ln H ) ( exp ln Pr ˆR(h) R(h) The union bound then gives the final results. Problem 4 (a) ] ɛ(1 ν) H. ) H H H. Positive examples Each positive example (x j, 1) in the sample will be classified by ĥ as positive iff ĥ does not contain a literal whose negation is in x j. The negations of the literals in x j have been removed from ĥ during training, so all positive examples are classified correctly by ĥ. Negative examples All negative examples (x j, 0) are classified as negative if all the literals of h are also in ĥ. The algorithm proceeds by removing from L only the literals that are contradicting with some x j. Those contradicting literals cannot be in h, because h classifies all positive examples correctly. Thus, no literal of h has been removed from ĥ. 4
(b) As we argued in part (a), the hypothesis ĥ includes all the literals of the target h. Therefore, In addition to this, we need to show Pr x ĥ(x) 1 and h (x) 0] 0. Pr x ĥ(x) 0 and h (x) 1] ε holds with probability at least 1. Fix ε > 0, and call a literal dangerous if it has probability greater than ε/n of being false on a positive example. More precisely, denote by ṽ(x) the value of literal ṽ on instance x: if ṽ v i, then ṽ(x) x i, and if ṽ v i, then ṽ(x) 1 x i. Then a literal ṽ is dangerous if Pr x ṽ(x) 0 and h (x) 1] > ε. There are at most n literals ṽ in ĥ. If ĥ(x) 0 and h (x) 1, then at least one of the literals in ĥ satisfies ṽ(x) 0 and h (x) 1. If none of the literals in ĥ are dangerous, then by union bound the probability of drawing x such that ṽ(x) 0 and h (x) 1 holds for at least one ṽ in ĥ is at most n (ε/n) ε. Thus, we are done if we can show that with probability at least 1, no dangerous literals remain. There are n literals to consider, so by union bound it is sufficient to show that for a fixed dangerous literal, the probability that it remains is at most /(n). If a literal ṽ is dangerous, then each example x has probability at least ε/n of satisfying both h (x) 1 and ṽ(x) 0, causing ṽ to be removed from ĥ. The probability that ṽ remains after m examples is at most ( 1 ε ) m ( exp mε ). n n For any this is at most /(n), as desired. (c) m n ε ln n First, there is one conjunction with no positive examples, which we can represent, for example, as x 1 x 1. Consider now conjunctions that have at least one positive example. For each variable i, there are three alternatives: literal v i is included, but literal v i is not literal v i is included, but literal v i is not neither literal v i nor v i is included. Since we can choose from these three alternatives independently for each of the n variables, we get 3 n conjunctions. This includes the conjunction with no literals at all, which by convention represents the hypothesis that always outputs 1. We therefore get C n 3 n + 1. The required number can be calculated directly with the theorem, m 1 0.1 ln 3100 + 1 0.001 10(ln ( 3 100 + 1 ) ln 0.001) 1167.7, 5
and 1168 is enough. Calculating m with the formula in the previous part gives Hence, m 107 is enough. Exercise 5 m n ɛ ln n 100 00 ln 0.1 0.001 106.07. The Set Cover problem can be formulated as follows. Let A {a 1, a,..., a n } be a set, and let U {B 1, B,..., B k } P(A) be a collection of subsets of A such that C U C A. We are asked to find the smallest V U such that C V C A. In other words, we should cover the set A with the smallest possible number of sets from U. We show how the Set Cover problem can be solved with an algorithm that takes as input a sample ((x 1, y 1 ),..., (x m, y m )) and outputs a monotone conjunction f such that ˆR(f) is minimized. We generate input vectors with label 0 for every a A and vectors with label 1 for every B U. The coordinates of an input vector correspond to the sets in U. In the monotone conjunction f, the variables indicate which sets belong to the set cover. For every i {1,,..., n}, make two identical sample pairs (x i, 0) (x i+n, 0) ((x i1, x i,..., x ik ), 0) such that x ij 0 if a i B j and x ij 1 otherwise. We call the vectors of the pairs 0-vectors. If the outputted conjunction corresponds to a set cover, all the labels of these vectors are correctly predicted. And if some element a A belongs to none of the sets indicated by f, that causes two prediction errors. Also, make for every i {1,,..., k} a pair (x n+i, 1) ((x n+i,1, x n+i,,..., x n+i,k ), 1) where x n+i,j 0 if i j and x n+i,j 1 otherwise. We call these 1-vectors. The purpose of 1-vectors is to incur one prediction error for each set included to the set cover. Using this input, assume that the output f of the algorithm corresponds to a collection V U, and there is a A such that a / C V C. But then we can improve f and ˆR(f) by choosing some B U such that a B. Including B increases the number of prediction errors in the set of 1-vectors by 1, and covering a reduces the number of errors among 0-vectors by. So f always corresponds to a set cover. For every set cover, all prediction errors happen with the 1-vectors, and the number of prediction errors is equal to the size of the set cover. Thus, the output of the algorithm is the smallest set cover. The sample can be generated in a linear time, so the given problem is NP-hard. 6