Lecture 7: October 18, 2017

Size: px

Start display at page:

Download "Lecture 7: October 18, 2017"

Ezra McLaughlin
5 years ago
Views:

1 Iformatio ad Codig Theory Autum 207 Lecturer: Madhur Tulsiai Lecture 7: October 8, 207 Biary hypothesis testig I this lecture, we apply the tools developed i the past few lectures to uderstad the problem of distiguishig two distributios (special cases of which have bee discussed i the previous lectures). This problem is also kow as the hypothesis testig. Suppose we have two distributios P 0 ad P o a fiite uiverse U. The uiverse chooses oe of the two distributios ad geerates the dats, which cosists of a sequece x U chose either from P0 or P. The true distributio is ukow to us, but we are guarateed that oce P 0 or P is chose, all samples i the sequece x are sampled idepedetly from the chose distributio. The goal is to distiguish betwee the followig two hypotheses: H 0 : The true distributio is P 0. H : The true distributio is P. Sometimes H 0 is also referred to as the ull (default) hypothesis. We will cosider (determiistic) tests T : U {0, }, which take the sequece of samples x as iput ad select oe of the hypotheses. There are two types of errors we will be cocered with α(t) := β(t) := P [T(x) = ] 0 (False Positive) P [T(x) = 0] (False Negative). The followig claim is easy to prove based o the properties of total-variatio distace cosidered earlier. Claim. mi T {α(t) + β(t)} = δ TV (P 0, P ). Recall that optimal test for the above claim should be of the form if P T(x) = (x) P 0(x) 0 if P (x) < P 0 (x).

2 Oe may ask why should be should we oly cosider the optimal tests for miimizig the sum α(t) + β(t). We may care more about a false positive tha a false egative, ad may wat to miimize a weighted sum (or some other mootoe fuctio) of the errors. The followig lemma shows that all optimal tests should be of the form above, which make a decisio oly based o the ratio P 0 (x)/p (x) Lemma.2 (Neyma-Pearso Lemma) Let T be a test of the form if P T(x) = (x)/p 0 (x) 0 if P0 (x)/p (x) <, for some costat 0. Let T be ay other test. The, α(t ) α(t) or β(t ) β(t). Proof: The proof follows simply from the observatio that for all x U ( T(x) T (x) ) (P (x) P 0 (x)) 0. This is true because if P (x) P 0 (x), the T(x) = ad the first quatity is oegative. Similarly, whe P (x) P 0 (x) is egative, T(x) = 0 ad T(x) T (x) 0. Summig over all x U o both sides gives E se P (x) [ T(x) T (x) ] E [ T(x) T (x) ] 0 ( ( β(t)) ( β(t )) ) (α(t) α(t ) ) 0 β(t ) β(t) α(t) α(t ) 0. Thus, α(t) α(t ) 0 implies β(t ) β(t) 0. We ow discuss how to aalyze the error probabilities for the optimal tests as characterized by the Neyma-Pearso lemma. As before, let P x deote the type (empirical distributio o U) of the sequece x. Check that the test T(x) cosidered above ca be writte i the followig form P (x) P 0 (x) D(P x P 0 ) D(P x P ) log. We defie the followig sets of probability distributios. Π := {P D(P P 0 ) D(P P ) } log Π c := {P D(P P 0 ) D(P P ) < } log Check the followig property of the sets Π ad Π c. 2

3 Exercise.3 Check that both the sets Π ad Π c are covex (ad are i fact defied by liear iequalities i the distributios P). Also, check that Π is a closed set. We kow from Saov s theorem that α(t) = β(t) = P [P x Π] 2 D(P 0 P 0) 0 P [P x Π c ] 2 D(P P ), where P 0 = arg mi P Π {D(P P 0 )}. Also, sice Π c is ot a closed set, we defie P with respect to the closure of Π c of Π c i.e., P = arg mi P Π c {D(P P )}. We will see later how to compute the distributios which miimize the KL-divergece (kow as I-projectios) as i the bouds above. The distributios P0 ad P i the above bouds tur out to be of the form P 0 (x) = P (x) = P λ 0 y U P λ 0 (x) P λ (x) (y) P λ (y), where λ is the solutio to a optimizatio problem. While the above aalysis gives the optimal bouds for optimal all tests characterized by the Neyma-Pearso lemma, the boud we will use the most is the lower boud i terms of the total variatio distace i.e., mi T {α(t) + β(t)} δ TV (P 0, P ). We will ow develop such a boud for the case of multiple hypotheses. 2 Fao s iequality ad multiple hypothesis testig Fao s iequality is cocered with Markov chais, which we saw before i the cotext of data processig iequality. We will deote the Markov chai as Z Y Ẑ. I the cotext of hypothesis testig, we ca thik of Z as the choice of a ukow hypothesis from some fiite set (hypothesis class) U Z. We thik of Y as the data geerated from this hypothesis, say a sequece x of idepedet samples. Fially, we thik of Ẑ as a guess for Z, which depeds oly o the data. Fao s iequality ] is cocered with the probability of error i the guess, defied as p e = P [Ẑ = Z. We have the followig statemet Lemma 2. (Fao s iequaity) Let Z Y Ẑ be a Markov chai, ad let p e = P [ Ẑ = Z ]. Let H(p e ) deote the biary etropy fuctio computed at p e. The, H(p e ) + p e log ( U Z ) H(Z Ẑ) H(Z Y). 3

4 Proof: We defie a biary radom variable, which idicates a error i.e if Ẑ = Z E := 0 if Ẑ = Z The boud i the ieuality the follows from cosiderig the etroy H(Z, E Ẑ). H(Z, E Ẑ) = H(Z Ẑ) + H(E Z,, Ẑ) = H(Z Ẑ), sice H(E Z, Ẑ) = 0 (why?) Aother way of computig this etropy is H(Z, E Ẑ) = H(E Ẑ) + H(Z E, Ẑ) = H(E Ẑ) + p e H(Z E =, Ẑ) + ( p e ) H(Z E = 0, Ẑ) H(E) + p e H(Z E =, Ẑ) H(p e ) + p e log ( U Z ). Comparig the two expressios them proves the claim. We ca use Fao s iequality to derive a coveiet way of obtaiig a lower boud for testig multiple hypotheses. However, we eed the followig property of KL-divergece. Exercise 2.2 Prove that KL-divergece is (strictly) covex i both it s argumets i.e., α (0, ) ad all P = P 2, Q = Q 2, D(α P + ( α) P 2 Q) < α D(P Q) + ( α) D(P 2 Q) D(P α Q + ( α) Q 2 ) < α D(P Q ) + ( α) D(P Q 2 ) I fact, KL-divergece is joitly covex i both its argumets but we will eed this property. Let {P v } v V be a collectio of hypotheses. Let the eviromet choose oe of the hypotheses uiformly at radom (deoted by a radom vaiable V) ad let x P v be a sequece of idepedet samples from a chose disributio P v (deoted by the radom variable X). We will ow boud the probability of error for a classifier V for V. Note that V X V is a Markov chai. Propositio 2.3 Let V X V be the Markov chai as above. The, p e [ ] = P V = V E v,v 2 V [D(P v P v2 )] + log V. 4

5 Proof: From Fao s iequality, we have that + p e log V H(p e ) + p e log V H(V X) = log V I(V; X). We ca ow aalyze the mutual iformatio betwee V ad x usig the equivalet expressio i terms of KL-divergece. I(V; x) = D(P(V, X) P(V)P(X)) [ ] = D(P(V) P(V)) + E D(P(X V = v) P(X)) v V [ = E D(P v P) ], v V where P = E v V [P] v deotes the margial distributio of X. Usig the covexity of KL-divergece i the secod argumet, Jese s iequality ad the chai rule for KLdivergece, we get [ E D(P v P) ] [ E D(P v Pv v V v,v 2 V 2 ) ] = E [D(P v P v2 )]. v,v 2 V Combiig the bouds gives which proves the claim. + p e log V log V E v,v 2 V [D(P v P v2 )], 5

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract I this lecture we derive risk bouds for kerel methods. We will start by showig that Soft Margi kerel SVM correspods to miimizig