16.1 Bounding Capacity with Covering Number

Size: px

Start display at page:

Download "16.1 Bounding Capacity with Covering Number"

Aubrey Sutton
5 years ago
Views:

1 ECE598: Information-theoretic methods in high-dimensional statistics Spring 206 Lecture 6: Upper Bounds for Density Estimation Lecturer: Yihong Wu Scribe: Yang Zhang, Apr, 206 So far we have been mostly focusing on upper bounds for L p risks In this lecture, we shift our attention to density estimation, which can be formulated as follows Given X n = X,, X n iid p P, we obtain an estimate = X,, X n The loss function is the KL divergence Dp The average risk is thus E p Dp = D p X n = x n p n dx n Our task is to upper bound the minimax risk sup E p Dp p P The term density is a little misleading, and hence quoted, because P is not confined to densities 6 Bounding Capacity with Covering Number This section introduces a bound on capacity using covering number, which is useful in terms of both its conclusion and its proof Before it is formally states, here is a recap on some important concepts KL divergence: DP Q = E P log P Q Mutual ormation: IX; Y = D P XY P X P Y = Q D P Y X Q P X, 6 where the emum is achieved at Q = P Y = E X p Y X Capacity: Denote P = { p Y X=x : x X }, then the capacity is defined as with = if P is convex C = sup IX; Y radius = sup P X Q x X Covering number for sets of distributions Nε = min # of balls that covers P D P Y X=x Q, 62 = min { N : Q,, Q N st x X, i N, DP Y X=x Q i ε }

2 Now we are ready to state the lemma Lemma 6 C {ε + log Nε} 63 ε>0 There are two ways of proving this lemma Proof # Fix ε, let N = Nε, Q,, Q N that form an ε-cover x X, let ix = argmin i N D P Y X=x Q i, and thus D P Y X=x Q ix ε Fix any P X, IX; Y = IX, ix; Y = IiX; Y + IX; Y ix HiX + IX; Y ix 2 log N + ε, where is derived from the chain rule of mutual ormation 2 is derived from that HiX is the entropy of a distribution with N outcomes, whose maximum is achieved when all the outcomes are equiprobable; and that IX; Y ix = D P Y X Q ix Q D P Y X Q ix ix ε Proof #2 IX; Y = D P Y X Q P X Q D P Y X N Q i P X N = E X D P Y X N Q i N = E X E PY X log E X P Y X N N Q i E PY X log P Y X N Q ix = log N + E PX D PY X Q ix log N + ε 2

3 Remark 6 = in equation 63 holds if P is convex, and thus C =radius from equation 62 It is easy to verify it with a special case ε=radius, where Nε =, and both sides of equation 63 equal to radius Remark 62 for n samples X n = X,, X n iid p X, note that D P n Q n = ndp Q Denote N n ε is the covering number for P n, and Nε for P The product distributions of a ε/n-cover for P form a ε-cover for P n Therefore ε N n ε N n In Gaussian case, for instance, KL-divergence is represented by Euclidean norm, and thus d Nɛ ε According to equation 63, { C n ε>0 = d ε>0 ε =ε/d = d = d log } ε + d log n ε { ε n/d + log d ε/d ε >0 } { ε + log n/d + n d ε } An Upper Bound on the Bayes Risk This section introduces an upper bound on the Bayes risk, which inspires the upper bound on the minimax risk, as will be shown in the next section Consider the standard Bayes setting where X n = X,, X n iid p, and π, and the estimate, X n, is a function of X n The Bayesian risk is given by E,X n Dp X n = πdp n dx n Dp X n = x n Lemma 62 The optimal Bayes risk is E,X n Dp X n = I; X n+ X n, where X n+ is identically distributed to and independent of X,, X n The emum is achieved when X n = p Xn+ X n, which is the Bayes estimator 3

4 Proof First note that p and X n are distributions for a new data, which can be denoted as X n+ Taking the emum over = of the Bayes risk, E,X n Dp X n = πdp n dx n Dp X n = x n = p X ndx n E X n =x n Dp = p X nx n D p Xn+ p X n =x n = p X nx n D p Xn+ p Xn+ X n p X n =x n = I; X n+ X n is derived from equation 6 Specifically, fix X n = x n, the emum of D p Xn+ p X n =x n is achieved when = E X n pxn+ = pxn+ X n With lemma 62, we can derive an upper bound in terms of capacity, fix any prior π, C n+ sup I; X n+ I; X n+ π = I; X + I; X 2 X + + I; X n+ X n 2 n + I; X n+ X n is due to the chain rule of mutual ormation; 2 is due to the fact that the mutual ormation with an additional data diminishes as the number of existing data increases, namely I; X n+ X n I; X n X n Therefore, from equation 64, we have a bound for optimal Bayes risk, which holds for any prior π: I; X n+ X n C n+ n + d n log + n 65 d 63 An Upper Bound for Minimax Risk This section introduces a theorem which states that the bound in equation 65 also holds for minimax risk, and its proof is inspired by the Bayes case Theorem 6 Yang-Baren 99 sup E p Dp Θ ε>0 n log Nε + ε d n log + n d 4

5 Proof Choose the following estimate where X n = n p Xi X i Xi, πd i p Xi X i = j= p X j πd i j= p X j Hence the estimator is a function of π Note that π here is used only to define an estimator; it has nothing to do with Bayes setting The rest of the proof bounds the worst case risk of induced by an appropriate π Fix, the risk for can be upper bounded by E p Dp = E p D n p n p Xi X i D p p Xi X i i+ 2 = n D p n p X n where is due to the convexity of KL divergence; 2 is the chain rule of KL divergence: D P X N Q X N = E log P X N Q X N n = E P X i X i n Q X i X i = D P X i X i Q X i X i 66 Fix ε, denote N = Nε as the covering number Let G = {,, N } be a set whose corresponding p s form an ε-covering of P Choose the induced by π uniformg Then D p n p X n = D p n N p n N i = E log E log p n N N p n i p n N p n ix log N + nε Combining equations 66 and 67, we can bound the minimax risk: p sup Θ E p D p sup E p D p Θ n log N + nε = log N + ε n 67 Since it holds for ε, taking the emum of both sides, and noticing that Nε /ε d concludes the proof 5

6 In Gaussian case, the minimax risk is the canonical d/n KL-divergence reduces to Euclidean norm, so theorem 6 is loose with an additional term log + n/d In next lecture, we will obtain a tighter bound, which is polynomial with respect to /n, given some additional Lipschitz continuity constraint 6

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016

ECE598: Information-theoretic methods in high-dimensional statistics Spring 06 Lecture : Mutual Information Method Lecturer: Yihong Wu Scribe: Jaeho Lee, Mar, 06 Ed. Mar 9 Quick review: Assouad s lemma