Lecture 8: Minimax Lower Bounds: LeCam, Fano, and Assouad

Size: px

Start display at page:

Download "Lecture 8: Minimax Lower Bounds: LeCam, Fano, and Assouad"

Jayson Berry
5 years ago
Views:

1 40.850: athematical Foundation of Big Data Analysis Spring 206 Lecture 8: inimax Lower Bounds: LeCam, Fano, and Assouad Lecturer: Fang Han arch 07 Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the Lecturer. 8. Overview on minimax lower bound This lecture note is focused on introducing the general framework for constructing lower bounds for any given statistical problem. Such a framework, formulated by Lucien LeCam and many others in 970s-80s, aims to mathematically rigorously understand statistical problems challenge via a worst-case analysis. We define a statistical problem as follows. It contains three components: () a parameter space Θ; (2) a class of probability measures {P θ, θ Θ}; (3) a semi-distance d(, ) : Θ Θ R on Θ. A statistical problem aims to recover θ, measured by d(, ), given P θ. Example 8... Throughout this note, we will repeatedly visit the sparse normal mean estimation problem: estimate θ Θ s := {v R p : p(v) s} based on P θ := N p (θ, σ 2 I d ) n. Here d(, ) is commonly adopted to be the Euclidean distance on R p space. Regarding a general statistical problem, we aim to know how hard, at worst, it is to recover any given θ Θ. This is formulated by the maximum risk of the estimator on Θ: r( θ n ) := E θ d 2 ( θ n, θ). 8.. Upper bounding the maximum risk By the techniques we have learnt so far, usually it is not too hard to upper bound r( θ n ). For example, in Example 8.., let s consider the LE (also called the least square estimator): We then have the following theorem. θ n := argmin v Θ s n X i v 2. Theorem 8..2 (Sparse normal mean estimation - upper bound). For Example 8.., we have where C is an absolute constant. i= E θ θ n θ 2 2 C σ2 s log(ep/s), s n Remark Note that the upper bound does not depend on the scale of θ, which is a little bit surprising. 8-

2 8-2 Lecture 8: inimax Lower Bounds: LeCam, Fano, and Assouad Proof. For any θ Θ s, by definition of θ n, we have n X i θ n n 2 X i θ 2, i= i= which implies θ n θ X θ, θ θ n θ n θ θ n θ 2 2 X θ, θ n θ 2 where X := n Xi. By the standard uniform consistency argument in EP, using the fact that p( θ n θ) 2s, we can continue writing θ n θ 2 2 v B(2s) v T (X θ) where B(s) := {v R p : p(v) s, v 2 = }. We then employ a covering net argument as in Lecture note #4. In particular, for any a R q, v v 2 2 ɛ, and v 2 = v 2 =, we have This implies (v T a) 2 (v T 2 a) 2 2ɛ v 2= (v T a) 2. v T a 2 ( 2ɛ) v T a 2. v 2= v N ɛ Here N ɛ is the ɛ-net on S q, with the cardinality smaller than ( + 2/ɛ) q. This further implies P θ ( θ n θ 2 2 2t) P θ ( v B(2s) v T (X θ) 2t) ( ) p 9 2s P θ ( v T (X θ) t) 2s ( ) 2s 9p 2 exp( nt 2 /2σ 2 ) 2s where the last inequality is due to the fact that X θ N p (0, σ 2 I d /n) under P θ, implying v T (X θ) N q (0, σ 2 /n). This implies θ n θ 2 = O P ( s log(ep/s)/n). The expectation version is by EX = P (X t)dt for any positive r.v. X. t> A general reduction scheme In this section, we introduce a general framework to calculate lower bounds for any given statistical problem. This section largely follows Tsybakov s wonderful book Introduction to Nonparametric Estimation. Our aim is to lower bound the minimax error: E θ d 2 (, θ), where is any measurable statistic on P θ for any given θ.

3 Lecture 8: inimax Lower Bounds: LeCam, Fano, and Assouad 8-3 () The first step is to reduce the expectation bounds to probability. This is via the arkov s inequality: E θ d(, θ) AP θ (d(, θ) A). Therefore, as long as we can find a A such that θn P θ (d(, θ) A) C > 0 for some absolute constant C, we have E θ d 2 (, θ) C 2 A 2. (2) The second step is to find a worst-case parameter space {θ 0, θ,..., θ } Θ of inite number of elements (θ 0 is made to be the reference value). This is usually the key step. By the property of imum, we must have P θ (d(, θ) A) θ {θ 0,...,θ } P θ (d(, θ) A). (3) The third step is to make θ 0,..., θ uniformly distinguishable, so that we could transfer the estimation problem to a testing problem. In detail, we must have to construct θ 0,..., θ such that d(θ j, θ k ) 2A, for any k j. Accordingly, for any estimator, if there exists θ k such that d(, θ k ) d(, θ j ), then we must have d(, θ j ) A (otherwise d(θ j, θ k ) d(, θ k ) + d(, θ k ) 2A). In other words, we must have where Ψ is the minimum distance test: P θj (d(, θ j ) A) P θj (Ψ j), for j = 0,...,, Ψ := argmin d(, θ j ). 0 j Therefore, we have successfully transferred the estimation problems to testing problems: P θ (d(, θ) A) Ψ 0 j P θj (Ψ j), The testing problem on the righthand side has been very nice to deal with. In particular, when =, we have recovered the two-class hypothesis testing problem, which by intuition we know Nayman-Pearson Lemma could get in to give a sharp lower bound (actually, this is one of the major motivations to bring ormation theory to statistics: LE is exactly the projection estimator regarding the K-L divergence!). The lower bound for multiple (but finite) hypothesis testing problem is constructed by LeCam. As you can imagine, it is still based on likelihood ratios. 8.2 LeCam s approach In the following, we drop the absolute continuous issue. Please check Tsybakov s book for the complete version. Theorem Let P 0, P,..., P be probability measures on (X, A). We then have Ψ 0 j τ P j (Ψ j) τ>0 + τ ( ) dp0 P j τ.

4 8-4 Lecture 8: inimax Lower Bounds: LeCam, Fano, and Assouad Proof. Let s write Ψ be the test taking values {0,,..., }. Let { } dp0 T j := τ. We then have where P 0 (Ψ 0) = P 0 (Ψ = j) = I(Ψ j )dp 0 = Accordingly, we have τ p 0 := I(Ψ j ) dp 0 P j (Ψ j) τ P j (Tj C ) = τ(p 0 α), P j (Ψ = j) and α := { } max P j(ψ j) = max P 0 (Ψ 0), max P j(ψ j) 0 j j τp j ({Ψ = j} T j ) ( ) dp0 P j < τ. max τ(p 0 α), τ( α) = max{τ(p 0 α), p 0 } min max{τ(p α), p} = 0 p + τ. This completes the proof. This then gives us the desired result. Theorem (LeCam). Assume Θ = {θ 0, θ,..., θ } is constructed such that () d(θ j, θ k ) 2A > 0 for any 0 j < k ; (2) there exists τ > 0 and 0 < α < such that ( ) dp0 P j τ α. P j (Ψ j) We then have P θ (d(, θ) A) τ ( α). + τ 8.2. Elementary ormation theory Given Theorem 8.2.2, what is left is to determine the value of P j ( dp 0 ) τ. This is, naturally, the problem of determining the distance between two probability measures P 0 and P j, which plays a key role in ormation theory. Of note, there is a track in probability to rethink everything in statistics from the perspective of the ormation theory. One motivating example is Stein s method, which is a flexible way in determining

5 Lecture 8: inimax Lower Bounds: LeCam, Fano, and Assouad 8-5 the rate for weak convergence in weak topology. People of interest should read Ramon von Handel s note ( as well as John Duchi s ( class/stats3/lectures/full_notes.pdf). ( ) dp Nevertheless, let s restrict our interest to studying P 0 j τ. To this end, let s introduce the following concepts. Definition For any two probability measures P and Q on (X, A), pose ν is a σ-filed measure on (X, A) satisfying P ν and Q ν. We define p = dp/dν and q = dq/dν, and f-divergence: D f (P, Q) := f ( ) dp dq. dq Hellinger distance (H 2 as a special case of f-divergence by taking f(x) = ( x ) 2 ): ( H(P, Q) := ( p /2 ( q) dν) 2 = [ dp ) /2 dq] 2. The last equality is the famous Sheffe s theorem. Total variation distance (taking f(x) = x /2): V (P, Q) := P (T ) Q(T ) = (p q)dν = T A T A 2 T p q dν. Kullback-Leibler (K-L) divergence (taking f(x) = x log x): K(P, Q) = log dp dp for P Q. dq χ 2 divergence (by taking f(x) = (x ) 2 ): ξ 2 (P, Q) := ( ) 2 dp dq. There are a bunch of useful inequalities within the f-divergence family. However, due to the scope limit, let s focus on the one of the most useful, Pinsker s inequalities. Lemma (Pinsker s inequalities). () V (P, Q) K(P, Q)/2. (2) If P Q, then log dp dq dp := pq>0 p log p q dν K(P, Q) + 2K(P, Q), and ( log dp ) dp K(P, Q) + K(P, Q)/2, dq + where a + := max(a, 0). Proof. Left as a good exercise.

6 8-6 Lecture 8: inimax Lower Bounds: LeCam, Fano, and Assouad Back to Theorem Using the Pinsker s inequalities, we are now well equipped to provide a more user-friendly version of LeCam s theorem. Theorem Assume that 2 and Θ = {θ 0,..., θ } satisfies () d(θ j, θ k ) 2A > 0 for any 0 j < k ; (2) P j P 0 and K(P j, P 0 ) α log, with 0 < α < /8 (the upper bound /8 is to make sure the final lower bound is larger than 0). We then have P θ (d(, θ) A) + ( 2α 2α log ). Proof. In Theorem 8.2.2, for any 0 < τ <, we have ( ) ( dp0 dpj P j τ = P j ) ( = P j log > log ) ( log dp ) j, dp 0 τ dp 0 τ log(/τ) dp 0 + where the last inequality is due to arkov s inequality. We then can employ the second Pinsker s inequality to derive ( ) [ ] dp0 P j τ K(P j, P 0 ) + K(P j, P 0 )/2. log /τ By Jensen s inequality and Condition (2), K(P j, P 0 ) K(P j, P 0 ) /2 α log. Accordingly, we have ( ) dp0 P j τ log /τ (α log + α log /2), which, combined with Theorem 8.2.2, yields P θ (d(, θ) A) τ + τ ( Picking τ = / minimizes the above term and finalizes the proof. log /τ (α log + ) α log /2) Application to Example 8.. Let s then show the upper bound in Theorem 8..2 is rate-optimal in the minimax sense via using Theorem

7 Lecture 8: inimax Lower Bounds: LeCam, Fano, and Assouad 8-7 Theorem Regarding the statistical problem in Example 8..2, we have E θ θ 2 2 σ 2 s log(ep/s)/n. s To prove the result, we need a strong result in combinatorics. In detail, let We then have the Varshamov-Gilbert bound. Ω := {ω = (ω,..., ω m ), ω i {0, }} = {0, } m. Lemma (Varshamov-Gilbert bound). Let m > 8. Then there exists a subset {ω (0),..., ω () } of Ω such that ω (0) = (0,..., 0) and ω (j) ω (k) 0 m 8 for 0 j < k (Here 0 represents the Hamming distance) and 2 m/8. Proof of Theorem To prove this theorem, we use a variation of the Varshamov-Gilbert bound. We construct the parameter space Θ 0 as follows θ 0 = 0, [θ Ti ] j = a I(j T i ), where T i T, which includes all s distinct elements in {,..., p} that pairwise have overlaps at, at most, s/8 position, and a > 0 is a value to be determined later. Using a variation of the Varshamov-Gilbert bound (a proof using probabilistic method like the proof of random projection. aybe I will cover it in class), we can prove log := log Θ 0 s log(p/s). Furthermore, for Gaussian distribution, we know and KL(P j, P 0 ) = 2 θ j 2 2 = nsa 2 /2σ 2 θ j θ k 2 2 sa 2. Accordingly, picking a 2 σ 2 log(p/s)/n completes the proof. 8.3 Fano and Assouad I do not believe I will have time to introduce them, but I can promise you they are very useful. Actually, they are developed to handle the cases LeCam s method cannot or is very difficult to handle. People of interest should read Duchi s note.

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk)

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk) 10-704: Information Processing and Learning Spring 2015 Lecture 21: Examples of Lower Bounds and Assouad s Method Lecturer: Akshay Krishnamurthy Scribes: Soumya Batra Note: LaTeX template courtesy of UC