CS 229: Lecture 7 Notes

Size: px

Start display at page:

Download "CS 229: Lecture 7 Notes"

Jared Mills
5 years ago
Views:

1 CS 9: Lecture 7 Notes Scribe: Hirsh Jain Lecturer: Angela Fan Overview Overview of today s lecture: Hypothesis Testing Total Variation Distance Pinsker s Inequality Application of Pinsker s Inequality to Coin Tossing Hypothesis Testing Hypothesis Testing broadly fits into the framework of inference, where we have a hypothesis that we set as a null hypothesis and an alternative hypothesis. For eample, for coin tosses, our null hypothesis may be that we have a fair coin which is Ber0.5) and our alternative may be that we have a biased coin which is Berp) for some p 0.5. Given samples from an unknown distribution, we would like to figure out which of our hypothesis is correct. Moreover, there are two types of error: Type : False positive error. This refers to the rejection of a true null hypothesis. Type : False negative error. This refers to the failure to reject a false null hypothesis. We want to find the true error, or the sum of the two errors. Total Variational Distance We frequently want to find the distance between two distributions. There are many ways to do this, and one of them is called total variational distance, or TVD. This is defined on two distributions A, B as: TVDA, B) = sup AS) BS) S Ω

2 where AS) is the probability that A assigns to subset S, and BS) is the same. Intuitively, this refers to the largest difference in probability that distributions A and B will assign to the same event. We can rewrite as } sup S Ω S A) S B) Suppose we define S ma = A) B)} Then, assuming finite distributions, it is clear that TVD can also be defined as A) B) = A) B) S ma S ma where the equality holds because A) = B) =. Given the equality of the two above, we can take the average and it will also be equal, so we have: TVDA, B) = A) B) = A) B) S ma S ma = A) B) + A) B) S ma S ma = A) B) where the signs and the placement of the absolute values follows from the definitions of S ma. Note, however, that this is precisely the -norm, so we have: TVDA, B) = A B Now, it s important to ask a basic question: why would we use this? There are some pros and some cons.. Pro: Symmetric. Unlike KL-divergence, we know that TVDA, B) = TVDB, A).. Con: T V DA, B ) can be equal to T V DA, B), where the former is defined as the distribution on two samples from each distribution. Eample: A = Berp) and B = Berp) T V DA, B) = T V DA, B ). This is bad because we would like to believe that two samples will give us more information about the true nature of the underlying distribution, but it does not always. This should not be interpreted to mean that for any A, B, this is true. It isn t. This is simply a statement that such distributions A, B eist.

3 3. Pro: T V DA, B) is equal to the sum of the false positive error and the false negative error. 4. Pro: lim n T V DA n, B n ) =. This is counterintuitive, given ), but tells us that we will get eponentially) more information about the true distributions with more sampling, even if having samples vs sample does not tell us anything new. We will prove 4. from the list above. The folloiwing claim reflects the fact that total variation distance goes to eponentially quickly. Claim. Suppose we have X, Y such that TVDX, Y ) = δ. We want to prove that for all k N: e kδ / TVDX k, Y k ) Proof. By the definition of TVD, there eists a subset S such that given samples X, y Y, we have: P S) P y S) = δ. We also define P y S) = p P S) = p+δ. Given k samples of X, we know that the probability of any sample being in S is p + δ. Thus, in epectation, p + δ)k of the samples will be in S. Similarly, given k samples of Y, pk will be in S in epectation. Now, we can apply the Chernoff bound to see that: P at least p + δ ) P at most p + δ k components of Y are in S) < e kδ ) k components of X are in S) < e kδ Let S be the set of k-tuples that contain more than p + δ ) k components of S. Then, we can bound: TVDX k, Y k ) P X k S ) P Y k S ) > e kδ ) e kδ = e kδ which gives us the desired result. Pinsker s Inequality Pinsker s Inequality states that: DP Q) ln P Q 3

4 where DP Q) is the KL Divergence of P and Q, defined as: DP Q) = ) p) p) log q) We can rewrite Pinsker s Inequality: DP Q) ln P Q TVDP, Q) ln) DP Q) Proof: We re going to start with the case where P and Q are Bernoulli. This can then be used to prove any other case. Define P and Q as: w.p. p w.p. q P = Q = 0 w.p. p 0 w.p. q We assume without loss of generality that p q. We can write out the KL-divergence and TVD eplicitly: Dp q) = p log p ) p q + p) log q TVDp, q) = p, p) q, q) = p q) We can define fp, q) = p log p ) p q + p) log p q)) q ln We can take the derivative of this function: δfp, q) δq = p q ln ) q q) 4 Note that p q is always positive and ln is positive. Moreover, q q) for all q. 4 Thus, the term inside the parentheses is positive, and the negative in front of the epression makes the derivative 0, and equal to 0 when p = q. From this, we can conclude that for p > q, the function must be positive, as it is zero at p = q and decreasing. Thus: fp, q) = p log p ) p q + p) log q ln p q)) 0 p log p p q + p) log q DP Q) = ln) P Q ) p q)) ln) as desired. Moreover, we can prove the non-binary case via a reduction to the binary case. 4

5 Applications of Pinsker s Inequality to Coin Tossing Note that DP m Q m ) = mdp Q). We can prove this using the chain rule of KL Divergence: DP X, Y ) QX, Y )) =,y =,y = p, y) log p, y) q, y) p)py ) log p)py ) q)qy ) p) log p) q) py ) + y p) y = DP Q ) + p)dp y q y X = ) = DP Q ) + DP y Q y X) = DP Q ) + DP y Q y ) = DP Q) py ) log py ) qy ) where DP y Q y X) = DP y Q y ) follows from X being independent of Y. Now, we can iteratively apply this procedure to determine that DP m Q m ) = mdp Q) as desired. Consider the following set-up for a coin tossing problem. Let = Heads, 0 = Tails, and define P and Q as: P = w.p. ɛ w.p. 0 w.p. + ɛ Q = 0 w.p. Let =,,..., m ) be a number of coin flips. We can map to either P or Q as a prediction on which distribution they came from. encoding P as 0 and Q as. This gives us a function A: A :,..., m ) 0, } We want to find the value of m i.e. the number of samples) we need to ensure that A predicts between P and Q with > 90% probability. Eplicitly, we want to find m such that: P P m[a) = 0] 9 0 and P Q m[a) = ] 9 0 Equivalently, when taking the epectation over all m-tuples in P m and Q m, we want: E P m[a)] 0 and E Q m[a)] 9 0 = E Q m[ax)] E P m[ax)] 8 0 5

6 Lemma. P, Q discrete on U, then given f : U [0, B], E P [f)] E Q[f)] B P Q Proof. We can rewrite the left using epected value, and the law of the unconscious statistician LOTUS) : E P [f)] E Q[f)] = p)f) q)f) = f) p) q)) = p) q)) f) B ) + B p) q)) p) q) f) B B P Q Now, we can use this lemma: let P = P m, Q = Q m, f = A, so we have P m Q m E Q max) E P ma) = P m Q m Now, using Pinsker s Lemma, we have that: m DP Q) = DP m Q m ) ln) ) 8 = m 5 ln) DP Q) Thus, it remains to bound DP Q): ) DP Q) = ɛ log ɛ ) ) + + ɛ log + ɛ ) = )) + ɛ log ɛ) + ɛ) + ɛ log ɛ ɛ ln ln + 4ɛ ) ɛ 4ɛ ln ɛ 8 0 = 8 5 ) of the unconscious statistician 6

7 where the last inequality uses the fact that ln + ) e. Now, if we assume that ɛ < 4, we can write: DP Q) 8ɛ ln Finally, combining this with the above inequality, we have a bound on m: m ln) DP Q) ) ɛ which can be shown to be upto constants by the Chernoff bound. 7

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose