4.1 Data processing inequality

Size: px

Start display at page:

Download "4.1 Data processing inequality"

Bryan Ferguson
6 years ago
Views:

1 ECE598: Iformatio-theoretic methods i high-dimesioal statistics Sprig 206 Lecture 4: Total variatio/iequalities betwee f-divergeces Lecturer: Yihog Wu Scribe: Matthew Tsao, Feb 8, 206 [Ed. Mar 22] Recall the defiitio of f-divergeces from last time. If a fuctio f : R + R satisfies the followig properties: f is a covex fuctio. f = 0. f is strictly covex at x =, i.e. f < αfx + αfy is strict. for all x, y, α such that αx + αy =, the iequality The the fuctioal that maps pairs of distributios to R + defied by ] dp D f P Q E Q dq is a f-divergece. 4. Data processig iequality Theorem 4.. Cosider a chael that produces Y give X based o the law P Y X show below. If P Y is the distributio of Y whe X is geerated by P X ad is the distributio of Y whe X is geerated by Q X, the for ay f-divergece D f, D f P Y D f P X Q X. P X P Y P Y X Q X Oe iterpretatio of this result is that processig the observatio x makes it more difficult to determie whether it came from P X or Q X.

2 Proof. D f P X Q X = E QX Jese s iequality E QY = E QY PX ] a = E QXY ] Q X P XY E QX Y E PX Y P Y ] PXY ] [ ] PXY = E QY E QX Y f ] b PY = E QY = D f P Y. Note that a meas D f P X Q X = D f P XY ; b ca be alteratively uderstood by otig that E Q [ P XY Y ] is precisely the relative desity P Y, by checkig the defiitio of chage of measure, i.e., E P [gy ] = E Q [gy P XY ] = E Q [gy E[ P XY Y ]] for ay g. Remark 4.. P Y X ca be a determiistic map so that Y = fx. More specifically, if fx = E X for ay evet E, the Y is Beroulli with parameter P E or QE ad the data processig iequality gives D f P X Q X D f BerP E BerQE. 4. This is how we prove the coverse directio of large deviatio. Example 4.. If X = X, X 2 ad fx = X, the we have D f P X X 2 Q X X 2 D f P X Q X. As see from the proof of Theorem 4., this is i fact equivalet to data processig iequality. Remark 4.2. If D f P Q is a f-divergece, the D fp Q with fx := xf x is also a f- divergece ad D f P Q = D f Q P. Example: D f P Q = DP Q the D f P Q = DQ P. Proof. First we verify that f has all three properties required for D f to be a f-divergece. For x, y R + ad α [0, ] defie c = αx + αy so that αx c fαx + αy = cf αx = cf c c x + αy c Thus f : R + R is a covex fuctio. f = f = 0. c αx y c f x + αy c + c αy c f =. Observe that = α y fx + α fy. For x, y R +, α [0, ], if αx + αy =, the by strict covexity of f at, 0 = f = f = f αx x + αy < αxf + αyf = α y x y fx + α fy. So f is strictly covex at ad thus D f is a valid f-divergece. Fially, D f P Q = E Q ] [ ] [ P Q P = E P Q P f = E P f Q ] Q = D P f Q P. 2

3 4.2 Total variatio ad hypothesis testig Recall that the choice of fx = 2 x gives rise to the total variatio distace, D f P Q = 2 E P Q Q = P Q, 2 where P Q is a short-had uderstood i the usual sese, amely, dp dµ dµ where µ is a domiatig measure, e.g., µ = P + Q, ad the value of the itegral does ot depeds o µ. We will deote total variatio by d TV P, Q or TVP, Q. Theorem 4.2. The followig defiitios for total variatio are equivalet:. dµ dq d TV P, Q = sup P E QE, 4.2 E where the supremum is over all measurable set E. 2. d TV P, Q is the miimal sum of Type-I ad Type-II error probabilities for testig P versus Q, ad d TVP, Q = P Q Provided the diagoal {x, x : x X } is measurable, d TV P, Q = 4. Let F = {f : X R, f }. The if P [X Y ]. 4.4 P XY : P X =P,P Y =Q d TV P, Q = 2 sup E P fx E Q fx. 4.5 f F Remark 4.3 Variatioal represetatio. The equatio 4.2 ad 4.5 provide sup-represetatio of total variatio, which will be exteded to geeral f-divergeces later. Note that 4.4 is a if-represetatio of total variatio i terms of coupligs, meaig total variatio is the Wasserstei distace with respect to Hammig distace. The beefit of variatioal represetatios is that choosig a particular couplig i 4.4 gives a upper boud o d TV P, Q, ad choosig a particular f i 4.5 yields a lower boud. Remark 4.4 Operatioal meaig. I the biary hypothesis test for H 0 : X P or H : X Q, Theorem 4.2 shows that d TV P, Q is the sum of false alarm ad missed detectio probabilities. This ca be see either from 4.2 where E is the decisio regio for decidig P or from 4.3 sice the optimal test for average probability of error is the likelihood ratio test dp dq >. I particular, d TV P, Q = P Q, the probability of error is zero sice essetially P ad Q have disjoit supports. d TV P, Q = 0 P = Q ad the miimal sum of error probabilities is oe, meaig the best thig to do is to flip a coi. Throughput the course a b = mi{a, b} ad a b = max{a, b}. Here agai P Q is a short-had uderstood per the usual sese, amely, dp dµ where µ is ay domiatig measure. dq dµ dµ 3

4 4.3 Motivatig example: Hypothesis testig with multiple samples Observatio: Not all f-divergeces are both equal. Differet f-divergece has differet operatioal sigificace. For example, as we saw i Sectio 4.2, testig two hypothesis boils dow to total variatio, which determies the fudametal limit miimum average probability of error. Later i the course we will ecouter aother f-divergece: LP Q = P Q 2 P +Q, which is useful for estimatio. 2. Some f-divergece is easier to evaluate tha others. For example, for product distributios, Helliger distace ad χ 2 -divergece tesorize i the sese that they are easily expressible i terms of those of the oe-dimesioal margials; however, computig the total variatio betwee product measures is frequetly difficult. Aother example is that computig the χ 2 - divergece betwee a product distributio ad a mixture of product distributios is coveiet, which will become useful later i the course. Therefore the puchlie is that it is ofte fruitful to boud oe f-divergece by aother ad this sometimes leads to tight characterizatios. I this sectio we cosider a specific useful example to drive this poit home. The i the ext sectio we develop iequalities betwee f-divergeces systematically. Cosider a biary hypothesis test where data X = X, X 2,...X are i.i.d draw from either P or Q ad the goal is to test H 0 : X P vs H : X Q. As metioed before, d TV P gives miimal Type-I+II probabilities of error, achieved by the maximum likelihood test. By the data processig iequality, d TV P m, Q m d TV P for m <. From this we see that d TV P is a icreasig sequece i ad bouded by by defiitio ad hece coverges. Oe would hope that as, d TV P coverges to ad cosequetly, the probability of error i the hypothesis test coverges to zero. It turs out that if the distributios P, Q are idepedet of, the large deviatio theory gives d TV P = exp CP, Q + o 4.6 where the costat CP, Q = log if 0 α P α Q α is the Cheroff Iformatio of P, Q. It is clear from this that d TV P as, ad, i fact, expoetially fast. However, as frequetly ecoutered i high-dimesioal problems, if the distributios P = P ad Q = Q deped o, the the large-deviatio approach that leads to 4.6 is o loger valid. I such a situatio, total variatio is still relevat for hypothesis testig, but its behavior as is ot obvious or easy to compute. I this case, uderstadig how a more computatioally tractable f-divergece is related to total variatio may give isight o hypothesis testig without eedig to directly compute the total variatio. It turs out Helliger distace is precisely suited for this task see Theorem 4.3 below. [ ] 2 Recall that the squared Helliger distace, H 2 P, Q = E Q P Q is a f-divergece with fx = x 2, which provides a sadwich boud for total variatio 0 2 H2 P, Q d TV P, Q HP, Q H2 P, Q

5 The proof of this statemet will explaied i the ext lecture. A few observatios which are direct cosequeces of these iequalities: H 2 P, Q = 2, if ad oly if d TV P, Q =. H 2 P, Q = 0 if ad oly if d TV P, Q = 0. Helliger cosistecy TV cosistecy, amely H 2 P, Q 0 d TV P, Q 0. Theorem 4.3. For ay sequece of distributios P ad Q, as, 2 d TV P, Q 0 H 2 P, Q = o d TV P H 2 P, Q = ω Proof. Because the observatios X = X, X 2,...X are i.i.d, the joit distributio factors H 2 P = 2 2E Q P X i Q [ ] P By idepedece = 2 2 E Q X i = 2 2 Q = H2 P, Q E Q [ P Q d TV P 0 if ad oly if H 2 P 0 which happes precisely whe 2 H2 P, Q, which happes whe H 2 P, Q = o. Similarly, d TV P if ad oly if H 2 P 2 which happes precisely whe 2 H2 P, Q 0, if ad oly if H 2 P, Q = ω. Remark 4.5. The proof of Theorem 4.3 relies o two igrediets:. Sadwich boud 4.7. ] 2. Tesorizatio properties of Helliger: H 2 P i, Q i = 2 2 H2 P i, Q i Note that there are other f-divergeces that are also tesorizable, e.g., χ 2 -divergeces: χ 2 P i, Q i = + χ 2 P i, Q i ; 4.9 however, o sadwich iequality like 4.7 exists for d TV ad χ 2 ad hece there is o χ 2 -versio of Theorem 4.3. Assertig the o-existece of such iequalities requires uderstadig the relatioship betwee these two f-divergeces. 2 For positive sequeces {a }, {b }, we say a = ωb if b = oa. 5

6 4.4 Iequalities betwee f-divergeces We will discuss two methods for fidig iequalities betwee f-divergeces. ad hoc approach: case-by-case proof usig results like Jese s iequality, max mea mi, Cauchy-Schwarz, etc. systematic approach: joit rage of f-divergeces. Defiitio 4.. The joit rage betwee two f-divergeces D f ad D g is the rage of the mappig P, Q D f P Q, D g P Q, i.e., the set R R + R + where x, y R if there exist distributios P, Q o some commo measurable space such that x = D f P Q ad y = D g P Q D g D f The gree regio i the above figure shows what a joit rage betwee D f ad D g might look like. By defiitio of R, the lower boudary gives the sharpest lower boud of D g i terms of D f, amely: D f P Q V D g P Q, where V t if{d f P Q : D g P Q = t}; similarly, the upper boudary gives the best upper boud. As will be discussed i the ext lecture, the sadwich boud 4.7 correspod to precisely the lower ad upper boudaries of the joit rage of H 2 ad d TV, therefore ot improvable. It is importat to ote, however, that R may be a ubouded regio ad some of the boudaries may ot exist, meaig it is impossible to boud oe by the other, such as χ 2 versus d TV. 6

7 To gai some ituitio, we start with the ad hoc approach by provig Pisker s iequality, which bouds total variatio from above by the KL divergece. Theorem 4.4 Pisker s iequality. DP Q 2d 2 TVP, Q. 4.0 Proof. First we show that, by the data processig iequality, it suffices to prove the result for Beroulli distributios. For ay evet E, let Y = {X E} which is Beroulli with parameter P E or QE. By data processig iequality, DP Q dp E QE. If Pisker s iequality is true for all Beroulli radom variables, we have 2 DP Q d TVBerP E, BerQE = P E QE Takig the supremum over E gives Theorem 4.2. The biary case follows easily from Taylor s theorem: dp q = p ad d TV Berp, Berq = p q. q 2 DP Q sup E P E QE = d TV P, Q, i view of p t t t dt 4 p q p tdt = 2p q 2 Remark 4.6. Pisker s iequality is kow to be sharp i the sese that the costat 2 i 4.0 is ot improvable, i.e., there exist {P, Q } such that LHS RHS 2 as. Why? Nevertheless, this does ot mea that 4.0 itself is ot improvable because it might be possible to subtract some higher-order term from the RHS. This is ideed the case ad there are may refiemets of Pisker s iequality. But what is the best iequality? Settlig this questio rests o characterizig the joit rage ad the lower boudary. This is the topic of ext lecture. 7

Lecture 7: October 18, 2017

Lecture 7: October 18, 2017 Iformatio ad Codig Theory Autum 207 Lecturer: Madhur Tulsiai Lecture 7: October 8, 207 Biary hypothesis testig I this lecture, we apply the tools developed i the past few lectures to uderstad the problem