Lecture 7: October 18, 2017

Similar documents
Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

ECE 901 Lecture 13: Maximum Likelihood Estimation

Lecture 4: April 10, 2013

Lecture 12: November 13, 2018

Problem Set 4 Due Oct, 12

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

4.1 Data processing inequality

Maximum Likelihood Estimation and Complexity Regularization

Lecture 13: Maximum Likelihood Estimation

Spring Information Theory Midterm (take home) Due: Tue, Mar 29, 2016 (in class) Prof. Y. Polyanskiy. P XY (i, j) = α 2 i 2j

Machine Learning Theory (CS 6783)

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Intro to Learning Theory

Lecture 6 Simple alternatives and the Neyman-Pearson lemma

Convergence of random variables. (telegram style notes) P.J.C. Spreij

ECE 6980 An Algorithmic and Information-Theoretic Toolbox for Massive Data

EE 4TM4: Digital Communications II Information Measures

An Introduction to Randomized Algorithms

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

Machine Learning Brett Bernstein

Empirical Process Theory and Oracle Inequalities

REGRESSION WITH QUADRATIC LOSS

Rademacher Complexity

7.1 Convergence of sequences of random variables

Machine Learning Brett Bernstein

First Year Quantitative Comp Exam Spring, Part I - 203A. f X (x) = 0 otherwise

Regression with quadratic loss

4.3 Growth Rates of Solutions to Recurrences

Properties and Hypothesis Testing

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

5. Likelihood Ratio Tests

Optimally Sparse SVMs

Math 155 (Lecture 3)

Table 12.1: Contingency table. Feature b. 1 N 11 N 12 N 1b 2 N 21 N 22 N 2b. ... a N a1 N a2 N ab

Information Theory and Statistics Lecture 4: Lempel-Ziv code

The random version of Dvoretzky s theorem in l n

Homework Set #3 - Solutions

Introduction to Probability. Ariel Yadin. Lecture 7

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Lecture 9: Expanders Part 2, Extractors

Lecture 11 and 12: Basic estimation theory

The Maximum-Likelihood Decoding Performance of Error-Correcting Codes

Lecture 5: April 17, 2013

Information Theory Tutorial Communication over Channels with memory. Chi Zhang Department of Electrical Engineering University of Notre Dame

Lecture 2: Concentration Bounds

Lecture 14: Graph Entropy

ACO Comprehensive Exam 9 October 2007 Student code A. 1. Graph Theory

Shannon s noiseless coding theorem

Lecture Notes 15 Hypothesis Testing (Chapter 10)

6.895 Essential Coding Theory October 20, Lecture 11. This lecture is focused in comparisons of the following properties/parameters of a code:

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

Sequences. Notation. Convergence of a Sequence

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.

Stat410 Probability and Statistics II (F16)

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

The minimum value and the L 1 norm of the Dirichlet kernel

Lecture 19. sup y 1,..., yn B d n

Frequentist Inference

Lecture 11: Channel Coding Theorem: Converse Part

7.1 Convergence of sequences of random variables

Application to Random Graphs

Lecture 3 The Lebesgue Integral

5.1 Review of Singular Value Decomposition (SVD)

Lecture 16: Monotone Formula Lower Bounds via Graph Entropy. 2 Monotone Formula Lower Bounds via Graph Entropy

Lecture 9: Pseudo-random generators against space bounded computation,

SDS 321: Introduction to Probability and Statistics

Lecture 2: Monte Carlo Simulation

Recurrence Relations

Lecture 2. The Lovász Local Lemma

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

The picture in figure 1.1 helps us to see that the area represents the distance traveled. Figure 1: Area represents distance travelled

Topic 9: Sampling Distributions of Estimators

10-701/ Machine Learning Mid-term Exam Solution

Assignment 5: Solutions

Estimation for Complete Data

Sequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

DS 100: Principles and Techniques of Data Science Date: April 13, Discussion #10

18.657: Mathematics of Machine Learning

Linear Support Vector Machines

Lecture 2: April 3, 2013

Topic 9: Sampling Distributions of Estimators

6.3 Testing Series With Positive Terms

Entropy and Ergodic Theory Lecture 5: Joint typicality and conditional AEP

Notes on Snell Envelops and Examples

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

Efficient GMM LECTURE 12 GMM II

Random Walks on Discrete and Continuous Circles. by Jeffrey S. Rosenthal School of Mathematics, University of Minnesota, Minneapolis, MN, U.S.A.

Seunghee Ye Ma 8: Week 5 Oct 28

Lecture 8: October 20, Applications of SVD: least squares approximation

Algebra of Least Squares

13.1 Shannon lower bound

1 Review and Overview

Infinite Sequences and Series

Lecture 15: Learning Theory: Concentration Inequalities


Chapter 6 Infinite Series

MA131 - Analysis 1. Workbook 2 Sequences I

Feedback in Iterative Algorithms

Transcription:

Iformatio ad Codig Theory Autum 207 Lecturer: Madhur Tulsiai Lecture 7: October 8, 207 Biary hypothesis testig I this lecture, we apply the tools developed i the past few lectures to uderstad the problem of distiguishig two distributios (special cases of which have bee discussed i the previous lectures). This problem is also kow as the hypothesis testig. Suppose we have two distributios P 0 ad P o a fiite uiverse U. The uiverse chooses oe of the two distributios ad geerates the dats, which cosists of a sequece x U chose either from P0 or P. The true distributio is ukow to us, but we are guarateed that oce P 0 or P is chose, all samples i the sequece x are sampled idepedetly from the chose distributio. The goal is to distiguish betwee the followig two hypotheses: H 0 : The true distributio is P 0. H : The true distributio is P. Sometimes H 0 is also referred to as the ull (default) hypothesis. We will cosider (determiistic) tests T : U {0, }, which take the sequece of samples x as iput ad select oe of the hypotheses. There are two types of errors we will be cocered with α(t) := β(t) := P [T(x) = ] 0 (False Positive) P [T(x) = 0] (False Negative). The followig claim is easy to prove based o the properties of total-variatio distace cosidered earlier. Claim. mi T {α(t) + β(t)} = δ TV (P 0, P ). Recall that optimal test for the above claim should be of the form if P T(x) = (x) P 0(x) 0 if P (x) < P 0 (x).

Oe may ask why should be should we oly cosider the optimal tests for miimizig the sum α(t) + β(t). We may care more about a false positive tha a false egative, ad may wat to miimize a weighted sum (or some other mootoe fuctio) of the errors. The followig lemma shows that all optimal tests should be of the form above, which make a decisio oly based o the ratio P 0 (x)/p (x) Lemma.2 (Neyma-Pearso Lemma) Let T be a test of the form if P T(x) = (x)/p 0 (x) 0 if P0 (x)/p (x) <, for some costat 0. Let T be ay other test. The, α(t ) α(t) or β(t ) β(t). Proof: The proof follows simply from the observatio that for all x U ( T(x) T (x) ) (P (x) P 0 (x)) 0. This is true because if P (x) P 0 (x), the T(x) = ad the first quatity is oegative. Similarly, whe P (x) P 0 (x) is egative, T(x) = 0 ad T(x) T (x) 0. Summig over all x U o both sides gives E se P (x) [ T(x) T (x) ] E [ T(x) T (x) ] 0 ( ( β(t)) ( β(t )) ) (α(t) α(t ) ) 0 β(t ) β(t) α(t) α(t ) 0. Thus, α(t) α(t ) 0 implies β(t ) β(t) 0. We ow discuss how to aalyze the error probabilities for the optimal tests as characterized by the Neyma-Pearso lemma. As before, let P x deote the type (empirical distributio o U) of the sequece x. Check that the test T(x) cosidered above ca be writte i the followig form P (x) P 0 (x) D(P x P 0 ) D(P x P ) log. We defie the followig sets of probability distributios. Π := {P D(P P 0 ) D(P P ) } log Π c := {P D(P P 0 ) D(P P ) < } log Check the followig property of the sets Π ad Π c. 2

Exercise.3 Check that both the sets Π ad Π c are covex (ad are i fact defied by liear iequalities i the distributios P). Also, check that Π is a closed set. We kow from Saov s theorem that α(t) = β(t) = P [P x Π] 2 D(P 0 P 0) 0 P [P x Π c ] 2 D(P P ), where P 0 = arg mi P Π {D(P P 0 )}. Also, sice Π c is ot a closed set, we defie P with respect to the closure of Π c of Π c i.e., P = arg mi P Π c {D(P P )}. We will see later how to compute the distributios which miimize the KL-divergece (kow as I-projectios) as i the bouds above. The distributios P0 ad P i the above bouds tur out to be of the form P 0 (x) = P (x) = P λ 0 y U P λ 0 (x) P λ (x) (y) P λ (y), where λ is the solutio to a optimizatio problem. While the above aalysis gives the optimal bouds for optimal all tests characterized by the Neyma-Pearso lemma, the boud we will use the most is the lower boud i terms of the total variatio distace i.e., mi T {α(t) + β(t)} δ TV (P 0, P ). We will ow develop such a boud for the case of multiple hypotheses. 2 Fao s iequality ad multiple hypothesis testig Fao s iequality is cocered with Markov chais, which we saw before i the cotext of data processig iequality. We will deote the Markov chai as Z Y Ẑ. I the cotext of hypothesis testig, we ca thik of Z as the choice of a ukow hypothesis from some fiite set (hypothesis class) U Z. We thik of Y as the data geerated from this hypothesis, say a sequece x of idepedet samples. Fially, we thik of Ẑ as a guess for Z, which depeds oly o the data. Fao s iequality ] is cocered with the probability of error i the guess, defied as p e = P [Ẑ = Z. We have the followig statemet Lemma 2. (Fao s iequaity) Let Z Y Ẑ be a Markov chai, ad let p e = P [ Ẑ = Z ]. Let H(p e ) deote the biary etropy fuctio computed at p e. The, H(p e ) + p e log ( U Z ) H(Z Ẑ) H(Z Y). 3

Proof: We defie a biary radom variable, which idicates a error i.e if Ẑ = Z E := 0 if Ẑ = Z The boud i the ieuality the follows from cosiderig the etroy H(Z, E Ẑ). H(Z, E Ẑ) = H(Z Ẑ) + H(E Z,, Ẑ) = H(Z Ẑ), sice H(E Z, Ẑ) = 0 (why?) Aother way of computig this etropy is H(Z, E Ẑ) = H(E Ẑ) + H(Z E, Ẑ) = H(E Ẑ) + p e H(Z E =, Ẑ) + ( p e ) H(Z E = 0, Ẑ) H(E) + p e H(Z E =, Ẑ) H(p e ) + p e log ( U Z ). Comparig the two expressios them proves the claim. We ca use Fao s iequality to derive a coveiet way of obtaiig a lower boud for testig multiple hypotheses. However, we eed the followig property of KL-divergece. Exercise 2.2 Prove that KL-divergece is (strictly) covex i both it s argumets i.e., α (0, ) ad all P = P 2, Q = Q 2, D(α P + ( α) P 2 Q) < α D(P Q) + ( α) D(P 2 Q) D(P α Q + ( α) Q 2 ) < α D(P Q ) + ( α) D(P Q 2 ) I fact, KL-divergece is joitly covex i both its argumets but we will eed this property. Let {P v } v V be a collectio of hypotheses. Let the eviromet choose oe of the hypotheses uiformly at radom (deoted by a radom vaiable V) ad let x P v be a sequece of idepedet samples from a chose disributio P v (deoted by the radom variable X). We will ow boud the probability of error for a classifier V for V. Note that V X V is a Markov chai. Propositio 2.3 Let V X V be the Markov chai as above. The, p e [ ] = P V = V E v,v 2 V [D(P v P v2 )] + log V. 4

Proof: From Fao s iequality, we have that + p e log V H(p e ) + p e log V H(V X) = log V I(V; X). We ca ow aalyze the mutual iformatio betwee V ad x usig the equivalet expressio i terms of KL-divergece. I(V; x) = D(P(V, X) P(V)P(X)) [ ] = D(P(V) P(V)) + E D(P(X V = v) P(X)) v V [ = E D(P v P) ], v V where P = E v V [P] v deotes the margial distributio of X. Usig the covexity of KL-divergece i the secod argumet, Jese s iequality ad the chai rule for KLdivergece, we get [ E D(P v P) ] [ E D(P v Pv v V v,v 2 V 2 ) ] = E [D(P v P v2 )]. v,v 2 V Combiig the bouds gives which proves the claim. + p e log V log V E v,v 2 V [D(P v P v2 )], 5