10-704: Information Processing and Learning Fall Lecture 21: Nov 14. sup

Size: px

Start display at page:

Download "10-704: Information Processing and Learning Fall Lecture 21: Nov 14. sup"

Leona Gardner
6 years ago
Views:

1 0-704: Information Processing and Learning Fall 206 Lecturer: Aarti Singh Lecture 2: Nov 4 Note: hese notes are based on scribed notes from Spring5 offering of this course LaeX template courtesy of UC Berkeley EECS dept Disclaimer: hese notes have not been subjected to the usual scrutiny reserved for formal publications hey may be distributed outside this class only with the permission of the Instructor 2 Minimax Risk and Le Cam s lower bound he minimax risk for class and loss l is R n () = E x Pθ [l ( (x) θ)] θ where is any estimator he upper bound of the minimax risk is given by designing an algorithm and the lower bound of the minimax risk is given by ormation theoretic techniques esting problems focus on specific loss function l ( (x) θ) = { (x) θ} so the minimax risk is R n () = P x θ [ (x) θ] θ In the previous lecture we saw that if there are two parameters θ 0 and θ then Le Cam s method shows that the minimax task is lower bounded by R n ({θ 0 θ }) (a) 2 2 P n θ0 P n θ V 2 2 We saw lower bounds for a simple normal mean testing problem 2 KL(P n θ 0 P n θ ) We also saw that we can use Le Cam s method for composite hypothesis tests using the following two tricks: We can always throw away parameters in the remum and lower bound the risk: P θ [ ] P θ [ ] Any problem with { } loss can be lower bounded by just choosing two parameters θ 0 θ and computing their V or KL 2 We can also separate the parameter space into two regions and mix over these sets P θ [ (x) θ] j {0} P θ [ (x) j] θ j { 2 E θ π 0x P θ [{ (x) 0}] + 2 E θ π x P θ [{ (x) }]} 2 2 P π 0 P π V 2-

2 2-2 Lecture 2: Nov 4 where P π0 (A) = E θ π0 [P θ (A)] π 0 is a distribution on 0 and π is a distribution on his is important for some problems By mixing you can make the distributions much closer together to prove stronger lower bounds But it is often challenging to compute the divergence to mixtures 22 Neyman-Pearson Lemma For simple vs simple tests the optimal statistics is the likelihood ratio test Λ(x) = P 0(x) P (x) (x) = {Λ(x) threshold} and 2 P 0[ (x) 0] + 2 P [ (x) ] = 2 2 P 0 P V Proof: In last class we saw that for any deterministic test : X {0 } with acceptance region A = {x X : (x) = } P 0 ( 0) + P ( ) = P 0 (A) + P (A c ) = P (A) + P 0 (A) (2) he result follows by noticing that this is minimized if A is the region where P 0 (x) P (x) 23 Information heoretic Connections and Fano s Method Another way to think of minimax testing is as a channel decoding problem Given a channel θ X we send θ {0 } and you see the samples X P θ If P 0 is close to P then you will have a high decoding error because when P 0 close to P H(θ X) is big Fano s inequality characterizes this relationship and can be used for proving minimax lower bounds for multiple hypothesis tests Consider the Markov chain θ X Let P e = P[ θ] for any test/decoder Fano s inequality implies that h(p e ) + P e log( ) H(θ X) or P e H(θ X) log 2 log( ) where P e = P θ πx Pθ [ (x) θ] Using the identities from earlier in the course there are many equivalent ways to state this inequality: since I(θ; X) = his is the global Fano s method I(θ; X) + log 2 P e log ( ) π(θ)p θ (X) π(θ)p θ (X) log π(θ) π(θ)p θ (X) = E θ π[kl(p θ P π )] + log 2 log = E θ π [KL(P θ P π )] We can weaken the mixture representation of KL to obtain the local or pairwise Fano method E θ π [KL(P θ P π )] E θθ π [KL(P θ P θ )]

3 Lecture 2: Nov he last step follows from Jensen s inequality since KL divergence is convex in the second argument In this case if we have M hypothesis θ θ M then we obtain (here [M] = M) P θj [ (x) j] P θj [ (x) j] M j [M] j= M 2 ij KL(P θ i P θj ) + log 2 log M 24 Application to testing for nonzero in a -sparse vector in R d H v : X n iid N (µv ) (22) where v {0 } d with only nonzero component here are d hypothesis and each pair has KL(Pi n P j n) = 2nµ 2 he local Fano method then gives which is bounded away from zero if R n () 2nµ2 + log 2 log d µ log d n Note that this rate is achieved for this problem by the largest coordinate of X = n n i= X i (X n ) = arg max X(j) j By Gaussian tail bound and union bound we know that or with probability δ: P[ j X(j) µ(j) ɛ] 2d exp{ 2nɛ 2 } j X(j) log(2d/δ) µ(j) 2n he estimated coordinate ĵ agrees with the true one j if: so that if µ = ω( X(j ) X(k) k X(j ) µ(j ) + µ(j ) µ(k) + µ(k) X(k) µ(j ) µ(k) X(k) µ(k) + µ(j ) X(j ) log(2d/δ) µ 2 2n log(d) n ) this estimator has success probability tending to heorem For the -sparse recovery problem the minimax rate is: log d µ n Actually the same rate holds for the k-sparse problem but it is slightly less obvious Also there are many techniques for proving lower bounds like Le Cam local and global Fano just for testing problems It is important to know about all of these techniques because some are better for some problems

4 2-4 Lecture 2: Nov 4 25 Estimation Problem Now let s turn to estimation problems or more general losses We write: R n () = E [Φ ρ( (X) )] where ρ : R + is a semi-metric Φ : R + R + is a non-decreasing function with Φ(0) = 0 Example: ρ( ) = and Φ(t) = t 2 so we are looking at mean square error his can also cover things like classification performance excess log loss things we have seen before 25 Proving lower bounds Step : Discretization Fix a δ > 0 and find a large set of parameters = {θ i } M i= such that his set is called a 2δ packing in the ρ-metric ρ(θ i θ j ) 2δ i j Step 2: Reduce to esting Consider j uniform([m]) and X P θj Now if you cannot differentiate between θ i and some other θ you will certainly make error Φ(δ) in the estimation problem More formally: Proposition Let {θ j } M j= be a 2δ-packing in the ρ metric hen: R n ( Φ ρ) Φ(δ) Ψ P j unif([m])x n P θ j [Ψ(X n ) j] Proof: Fix an estimator For any fixed θ we have E[Φ(ρ( θ))] E[Φ(δ){ρ( θ) δ}] = Φ(δ)P[ρ( θ) δ] Now define the test Ψ( ) = arg min j ρ( θ j ) If ρ( θ j ) < δ then Ψ( ) = j by 2δ separation and triangle inequality since ρ( θ k ) ρ(θ j θ k ) ρ( θ j ) > 2δ δ = δ he converse of this statement is that if Ψ( ) v then ρ( θ v ) δ θ P[ρ( θ) δ] M Now take an over all Ψ P j [ρ( θ j ) δ] = M j= P j [Ψ( ) j] Step 3: Use Fano or Le Cam to Lower Bound P e in esting Problems We saw how to do this earlier in this lecture and in the previous lecture j= 26 Normal Means Estimation in l 2 Let X n N (v I) v R d he goal is to have E X n (X n ) v 2 2 small Let U be a /2 packing of the unit ball in R d Note that the unit ball in d dimensions has a packing of size at least 2 d in the l 2 metric For each u U let θ u = δu R d for some δ > 0 so that θ u θ u 2 = δ u u 2 δ 2 (23)

Lecture 2: Nov 4 2-5 Figure 2: If you get θ j instead of θ j then your estimate θ must be far from θ j Also notice that since u u lie in the unit ball θ u θ u δ so the KL between each pair of θ u θ u

5 Lecture 2: Nov Figure 2: If you get θ j instead of θ j then your estimate θ must be far from θ j Also notice that since u u lie in the unit ball θ u θ u δ so the KL between each pair of θ u θ u is KL{P θu P θu } nδ 2 /2 so the Fano s Lemma gives M j= P θj [ (X n ) j] nδ2 /2 + log 2 d log 2 thus lower bound is ( 2 R n ( 2 δ [ ] 2 4) ) E jp θj [ (X n ) j] ( ) ( ) δ 2 nδ2 /2 + log 2 6 d log 2 Now we can choose δ set it to δ 2 = d log 2/(2n) hen for d 2 R n cd/n for some constant c > 0 his is the right parametric rate for this problem 27 Strong data processing inequalities How can we leverage these lower bound techniques to new settings that arise in modern learning problems? One approach is to use strong data processing inequalities as modern learning settings can be thought of as a classical problem with some transformation to the data ie parameter classical data new data (24) θ X Z (25) For d = the problem reduces to testing two simple hypothesis for which we can use Le Cam s method

6 2-6 Lecture 2: Nov 4 Example: Local Differentially private channel: Channel X Z must be differentially private for each data point ie for each data point X i we have distribution Q(Z X) st S xx X Q(Z i S X i = x) Q(Z i S X i = x exp(α) (26) ) We would like to leverage existing technology to get lower bound in these settings for learning with Z Clearly we can use data processing inequality where we get I(θ X) I(θ Z) But this bound is quite loose hus we are interested in strong data processing inequalities where pose we have channel θ X Z and Q(Z X) is the distribution of Z X with certain property we want to show that I(θ; Z) f(q)i(θ; X) where f(q) which yields a much tighter lower bound In the next class we will see that (α 0) differentially private learning leads to α 2 contraction in KL divergence which means the effective sample size goes from n to nα 2 his means that if we had n samples in the differentially private setting it is as if we only had nα 2 samples in the classical setting So we need more samples in the new setting to learn well

Lecture 21: Minimax Theory

Lecture 21: Minimax Theory Lecture : Minimax Theory Akshay Krishnamurthy akshay@cs.umass.edu November 8, 07 Recap In the first part of the course, we spent the majority of our time studying risk minimization. We found many ways