Lecture 17: Density Estimation Lecturer: Yihong Wu Scribe: Jiaqi Mu, Mar 31, 2016 [Ed. Apr 1]

ECE598: Information-theoretic methods in high-dimensional statistics Spring 06 Lecture 7: Density Estimation Lecturer: Yihong Wu Scribe: Jiaqi Mu, Mar 3, 06 [Ed. Apr ] In last lecture, we studied the minimax risk of a parameterized density estimation and its upper bound. We are given n i.i.d. samples X,..., X n generated from P θ, where P θ P = {P θ : θ Θ} is the density to be estimated. Let the loss function between a true distribution P θ and an estimated distribution ˆP be their KL-divergence, i.e., l(p θ, ˆP ) = D(P θ ˆP ). One can bound the minimax risk R of this estimation problem by, R n = inf ˆP where C n is the capacity over θ and X n, i.e., C n = where N KL (ɛ) is the covering number of P. θ Θ θ D(P θ ˆP ) C n n, (7.) I(θ; X n ) = inf {nɛ + log N KL(ɛ)}, M(Θ) ɛ>0 Further, we can use the chain rule in mutual information to learn the properties of C n. For any prior over Θ, one has, R = I(θ; X n+ X ) = I(θ; X n+ ) I(θ; X n ). Taking the remum over on both sides, one has, Rn = R ( = I(θ; X n+ ) I(θ; X n ) ) Therefore we can have a lower bound over R n as well. Remark 7.. There are some properties of {C n }: I(θ; X n+ ) I(θ; X n ) = C n+ C n. {C n } is subadditive and increasing, i.e., and therefore Cn n C n+m C n + C m, m, n Z +. has a limit for n. By Fekete s lemma, C n lim n n = inf C n n n.

If we let n = C n+ C n, one can rewrite C n by, n C n = k, and therefore, k= n n k= k. n In today s lecture, we use the bound in (7.) to study the minimax risk of a nonparameterized density estimation. 7. Density Estimation We are interested in estimating a smooth probability density function. To be precise, we are interested in estimating a pdf f P β with smoothness parameter β > 0, where f belongs to P β iff, f is a pdf on [0, ] and is upper bounded by a constant, say,. f (m) α-hölder continuous, i.e., where α (0, ], m Z and β = α + m. f (m) (x) f (m) (y) x y α, x, y (0, ), Note: For example, if β =, then P is simply the set of pdfs which are Lipshitz and bounded by. Theorem 7.. Given n i.i.d. samples X,..., X n randomly generated from a pdf f P β, the minimax risk of an estimation of f under the quadratic loss function l(f, ) = f = 0 (f(x) (x)) dx satisfies R (P β ) = inf f n β +β. (7.) Before we goes into the proof for Theorem 7., we makes some remarks. Remark 7.. The larger β is, the smoother the pdfs are, the faster R decays with n. Remark 7.3. If f is defined over [0, ] d, the bound turns into, Now we prove Theorem 7.. R n(p β ) = inf f n β d+β Proof. First, we claim we can use the minimax risk over a set of lower bounded pdfs to bound R n(p β ). The idea is in Lemma 7.

Lemma 7.. Let F be the set of pdfs that are lower bounded, i.e., F = { f : f }. Let P be an arbitrary set of pdfs on [0, ] and let P = P F. Then R n(p) R n( P) 6R n(p). Proof of Lemma 7.. Since P β P β, the lower bound is obvious, We will construct an estimator to show, R n( P β ) R n(p β ). (7.3) R n(p β ) 6R n( P β ). (7.4) Let X,..., X n be the n i.i.d. samples from f P β we have, and let U,..., U n be n i.i.d. samples uniformly generated from [0, ]. We define n i.i.d. random variables Z,..., Z n as, { Ui w.p. Z i =, otherwise. X i Thus, it is equivalent to think Z,..., Z n are i.i.d. samples from g = ( + f) P β. Let be an estimator of g from Z n. Let g be its projection in F, i.e., g = arg min h. h F Note g F, and we can bound the distance between g and g by, g g g + g g. Let = g, which is a valid pdf since g is lower bounded by. As a result, for every pdf f P β, there is a corresponding g = ( + f) P β which has a good estimator, and one can construct a good estimator from in the sense that, Therefore, f = g g 4 g. R n(p β ) = inf 6 inf 6 inf f ( + f) g P β g = R n( P β ), where { the first inequality is due to the construction of, and the second inequality is due to ( + f) : f P } β Pβ. Therefore from (7.3) and (7.4) Lemma 7. follows. It is then equivalent to prove, R n( P β ) = inf f n β d+β 3

Upper bound First we use the capacity to upper bound the minimax risk. On one hand, It is known that for any bounded pdf f and g, f g f g, and the total variation between f and g is bounded by its KL-divergence, D(f g) d TV(f, g) = f g. Therefore, we have for any bounded pdf f and g, As a result, R n( P β ) = inf f g D(f g) g g P inf β D(g ) = Rn,KL( P β ). (7.5) g P β On the other hand, one can bound the minimax risk under KL-divergence by (7.), where the capacity between g and X n can be computed via, C n inf ɛ>0 {log N KL(ɛ) + nɛ} inf ɛ>0 {log N ( ɛ) + nɛ} inf ɛ>0 {ɛ β + nɛ} = n +β. The first equality is due to the connection between the KL-divergence and the L distance. The second equality comes from Kolomogrov-Tikhomirov s Theorem. Therefore with (7.5), the upper bound is proved by showing, R n( P β ) C n n n +β = n β +β. (7.6) Lower bound Next we lower bound Rn(P β ) by Fano s inequality. Due to the relation between covering and packing numbers, we know, log M( P β,, ɛ) log N( P β,, ɛ) ɛ /β, where the second equality is due to Kolomogrov-Tikhomirov s Theorem. Let ɛ = n +β. Fano s inequality tells us, ( ) Rn( P β ) ɛ I(g; Xn ) + log log M( P β,, ɛ) ) The proof is done via (7.6) and (7.7). C n ɛ ( log M( P β,, ɛ) ( ) ɛ n +β ɛ β ɛ = n β +β. (7.7) 4 β

We make some remarks on the proof. Remark 7.4. We have learned two ways to construct a density estimator: The mean of predictive density estimators; The maximum likelihood estimator. None of those is computationally efficient. In practice, kernel density estimator (KDE) is proposed: let X,..., X n be the n samples, one can estimate the density by its histogram, ˆ = n n δ Xi. i= This estimator, however, is not a pdf. To address this issue, one put a kernel instead of a spike over each sample points, i.e., Ω = Ω ˆ, where Ω is the kernel function and is chosen to satisfy the smooth constraint of the pdfs. Remark 7.5. The result in Theorem 7. can be generalized to L p for p <, following the same analysis. 7. Estimator Based On Pairwise Comparison When we prove the lower bound in Theorem 7., we use the property that if an ɛ-covering of a set Θ cannot be tested, then θ Θ cannot be estimated. On the contrary, if the ɛ-covering can be tested, can θ Θ be estimated? First we study when the ɛ-covering can be tested. In a binary classification problem over two distributions, where φ = 0 indicates the data comes from P and φ = indicates the data comes from Q. The error is lower bounded by the total variation between P and Q, i.e., min p( ˆφ φ) = d TV (P, Q). ˆφ In a binary classification problem over two sets of distributions, where φ = 0 indicates the data comes from P P and φ = indicates the data comes from Q Q. The error is lower bounded by the total variation between P and Q, i.e., min p( ˆφ φ) = min d TV(P, Q). ˆφ P co(p),q co(q) Remark 7.6. In general, the minimization of d TV (P, Q) over the convex hulls of P and Q is a hard problem. References 5