Lecture 17: Density Estimation Lecturer: Yihong Wu Scribe: Jiaqi Mu, Mar 31, 2016 [Ed. Apr 1]

Similar documents
ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016

21.1 Lower bounds on minimax risk for functional estimation

16.1 Bounding Capacity with Covering Number

10-704: Information Processing and Learning Fall Lecture 21: Nov 14. sup

6.1 Variational representation of f-divergences

Lecture 9: October 25, Lower bounds for minimax rates via multiple hypotheses

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk)

5.1 Inequalities via joint range

CS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016

1.1 Basis of Statistical Decision Theory

3.0.1 Multivariate version and tensor product of experiments

LECTURE 3. Last time:

10-704: Information Processing and Learning Fall Lecture 24: Dec 7

19.1 Problem setup: Sparse linear regression

Lecture 22: Error exponents in hypothesis testing, GLRT

Empirical Risk Minimization

Lecture 35: December The fundamental statistical distances

Lecture 21: Minimax Theory

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

d(x n, x) d(x n, x nk ) + d(x nk, x) where we chose any fixed k > N

LECTURE 13. Last time: Lecture outline

LECTURE 10. Last time: Lecture outline

Lecture 8: Channel and source-channel coding theorems; BEC & linear codes. 1 Intuitive justification for upper bound on channel capacity

λ(x + 1)f g (x) > θ 0

Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1)

Lecture 2: August 31

Statistics: Learning models from data

15.1 Upper bound via Sudakov minorization

CS 229: Lecture 7 Notes

2.1 Optimization formulation of k-means

1 Basic Information Theory

U Logo Use Guidelines

Lecture 12 November 3

Lecture 2: Basic Concepts of Statistical Decision Theory

Probably Approximately Correct (PAC) Learning

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

The Moment Method; Convex Duality; and Large/Medium/Small Deviations

Lecture 3 January 28

Lecture 8. Strong Duality Results. September 22, 2008

Reproducing Kernel Hilbert Spaces

RATE-OPTIMAL GRAPHON ESTIMATION. By Chao Gao, Yu Lu and Harrison H. Zhou Yale University

19.1 Maximum Likelihood estimator and risk upper bound

Lecture 5 Channel Coding over Continuous Channels

Introduction to Machine Learning. Lecture 2

1/12/05: sec 3.1 and my article: How good is the Lebesgue measure?, Math. Intelligencer 11(2) (1989),

1 Review of The Learning Setting

Class 2 & 3 Overfitting & Regularization

δ xj β n = 1 n Theorem 1.1. The sequence {P n } satisfies a large deviation principle on M(X) with the rate function I(β) given by

2 2 + x =

EE/Stats 376A: Homework 7 Solutions Due on Friday March 17, 5 pm

SDS : Theoretical Statistics

Homework Assignment #2 for Prob-Stats, Fall 2018 Due date: Monday, October 22, 2018

Understanding Generalization Error: Bounds and Decompositions

x log x, which is strictly convex, and use Jensen s Inequality:

Lecture 11: Continuous-valued signals and differential entropy

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Lecture 3 - Linear and Logistic Regression

Finding a Needle in a Haystack: Conditions for Reliable. Detection in the Presence of Clutter

ECE 4400:693 - Information Theory

Minimax lower bounds I

Inverse Statistical Learning

Problem set 1, Real Analysis I, Spring, 2015.

Kernel Density Estimation

Useful Probability Theorems

National University of Singapore Department of Electrical & Computer Engineering. Examination for

Lecture 1: 01/22/2014

Loglikelihood and Confidence Intervals

Lecture 4 Noisy Channel Coding

Spatially Smoothed Kernel Density Estimation via Generalized Empirical Likelihood

FINAL REVIEW FOR MATH The limit. a n. This definition is useful is when evaluating the limits; for instance, to show

Lecture 8: Information Theory and Statistics

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Pattern Recognition. Parameter Estimation of Probability Density Functions

On Bayes Risk Lower Bounds

Lecture 25: Subgradient Method and Bundle Methods April 24

MEASURE AND INTEGRATION: LECTURE 15. f p X. < }. Observe that f p

11. Learning graphical models

ECE531 Lecture 10b: Maximum Likelihood Estimation

Application of Information Theory, Lecture 7. Relative Entropy. Handout Mode. Iftach Haitner. Tel Aviv University.

Minimax Estimation of Kernel Mean Embeddings

Key to Complex Analysis Homework 1 Spring 2012 (Thanks to Da Zheng for providing the tex-file)

Convergence Concepts of Random Variables and Functions

(each row defines a probability distribution). Given n-strings x X n, y Y n we can use the absence of memory in the channel to compute

Master s Written Examination

Gaussian Estimation under Attack Uncertainty

FORMULATION OF THE LEARNING PROBLEM

Approximation Theoretical Questions for SVMs

Introduction to Empirical Processes and Semiparametric Inference Lecture 12: Glivenko-Cantelli and Donsker Results

Recitation 1. Gradients and Directional Derivatives. Brett Bernstein. CDS at NYU. January 21, 2018

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Chapter 8: Differential entropy. University of Illinois at Chicago ECE 534, Natasha Devroye

converges as well if x < 1. 1 x n x n 1 1 = 2 a nx n

Lecture 9: March 26, 2014

1-Bit Matrix Completion

Introduction to Machine Learning (67577) Lecture 3

Algorithms for Nonsmooth Optimization

Measure and Integration: Solutions of CW2

1 Lagrange Multiplier Method

Transcription:

ECE598: Information-theoretic methods in high-dimensional statistics Spring 06 Lecture 7: Density Estimation Lecturer: Yihong Wu Scribe: Jiaqi Mu, Mar 3, 06 [Ed. Apr ] In last lecture, we studied the minimax risk of a parameterized density estimation and its upper bound. We are given n i.i.d. samples X,..., X n generated from P θ, where P θ P = {P θ : θ Θ} is the density to be estimated. Let the loss function between a true distribution P θ and an estimated distribution ˆP be their KL-divergence, i.e., l(p θ, ˆP ) = D(P θ ˆP ). One can bound the minimax risk R of this estimation problem by, R n = inf ˆP where C n is the capacity over θ and X n, i.e., C n = where N KL (ɛ) is the covering number of P. θ Θ θ D(P θ ˆP ) C n n, (7.) I(θ; X n ) = inf {nɛ + log N KL(ɛ)}, M(Θ) ɛ>0 Further, we can use the chain rule in mutual information to learn the properties of C n. For any prior over Θ, one has, R = I(θ; X n+ X ) = I(θ; X n+ ) I(θ; X n ). Taking the remum over on both sides, one has, Rn = R ( = I(θ; X n+ ) I(θ; X n ) ) Therefore we can have a lower bound over R n as well. Remark 7.. There are some properties of {C n }: I(θ; X n+ ) I(θ; X n ) = C n+ C n. {C n } is subadditive and increasing, i.e., and therefore Cn n C n+m C n + C m, m, n Z +. has a limit for n. By Fekete s lemma, C n lim n n = inf C n n n.

If we let n = C n+ C n, one can rewrite C n by, n C n = k, and therefore, k= n n k= k. n In today s lecture, we use the bound in (7.) to study the minimax risk of a nonparameterized density estimation. 7. Density Estimation We are interested in estimating a smooth probability density function. To be precise, we are interested in estimating a pdf f P β with smoothness parameter β > 0, where f belongs to P β iff, f is a pdf on [0, ] and is upper bounded by a constant, say,. f (m) α-hölder continuous, i.e., where α (0, ], m Z and β = α + m. f (m) (x) f (m) (y) x y α, x, y (0, ), Note: For example, if β =, then P is simply the set of pdfs which are Lipshitz and bounded by. Theorem 7.. Given n i.i.d. samples X,..., X n randomly generated from a pdf f P β, the minimax risk of an estimation of f under the quadratic loss function l(f, ) = f = 0 (f(x) (x)) dx satisfies R (P β ) = inf f n β +β. (7.) Before we goes into the proof for Theorem 7., we makes some remarks. Remark 7.. The larger β is, the smoother the pdfs are, the faster R decays with n. Remark 7.3. If f is defined over [0, ] d, the bound turns into, Now we prove Theorem 7.. R n(p β ) = inf f n β d+β Proof. First, we claim we can use the minimax risk over a set of lower bounded pdfs to bound R n(p β ). The idea is in Lemma 7.

Lemma 7.. Let F be the set of pdfs that are lower bounded, i.e., F = { f : f }. Let P be an arbitrary set of pdfs on [0, ] and let P = P F. Then R n(p) R n( P) 6R n(p). Proof of Lemma 7.. Since P β P β, the lower bound is obvious, We will construct an estimator to show, R n( P β ) R n(p β ). (7.3) R n(p β ) 6R n( P β ). (7.4) Let X,..., X n be the n i.i.d. samples from f P β we have, and let U,..., U n be n i.i.d. samples uniformly generated from [0, ]. We define n i.i.d. random variables Z,..., Z n as, { Ui w.p. Z i =, otherwise. X i Thus, it is equivalent to think Z,..., Z n are i.i.d. samples from g = ( + f) P β. Let be an estimator of g from Z n. Let g be its projection in F, i.e., g = arg min h. h F Note g F, and we can bound the distance between g and g by, g g g + g g. Let = g, which is a valid pdf since g is lower bounded by. As a result, for every pdf f P β, there is a corresponding g = ( + f) P β which has a good estimator, and one can construct a good estimator from in the sense that, Therefore, f = g g 4 g. R n(p β ) = inf 6 inf 6 inf f ( + f) g P β g = R n( P β ), where { the first inequality is due to the construction of, and the second inequality is due to ( + f) : f P } β Pβ. Therefore from (7.3) and (7.4) Lemma 7. follows. It is then equivalent to prove, R n( P β ) = inf f n β d+β 3

Upper bound First we use the capacity to upper bound the minimax risk. On one hand, It is known that for any bounded pdf f and g, f g f g, and the total variation between f and g is bounded by its KL-divergence, D(f g) d TV(f, g) = f g. Therefore, we have for any bounded pdf f and g, As a result, R n( P β ) = inf f g D(f g) g g P inf β D(g ) = Rn,KL( P β ). (7.5) g P β On the other hand, one can bound the minimax risk under KL-divergence by (7.), where the capacity between g and X n can be computed via, C n inf ɛ>0 {log N KL(ɛ) + nɛ} inf ɛ>0 {log N ( ɛ) + nɛ} inf ɛ>0 {ɛ β + nɛ} = n +β. The first equality is due to the connection between the KL-divergence and the L distance. The second equality comes from Kolomogrov-Tikhomirov s Theorem. Therefore with (7.5), the upper bound is proved by showing, R n( P β ) C n n n +β = n β +β. (7.6) Lower bound Next we lower bound Rn(P β ) by Fano s inequality. Due to the relation between covering and packing numbers, we know, log M( P β,, ɛ) log N( P β,, ɛ) ɛ /β, where the second equality is due to Kolomogrov-Tikhomirov s Theorem. Let ɛ = n +β. Fano s inequality tells us, ( ) Rn( P β ) ɛ I(g; Xn ) + log log M( P β,, ɛ) ) The proof is done via (7.6) and (7.7). C n ɛ ( log M( P β,, ɛ) ( ) ɛ n +β ɛ β ɛ = n β +β. (7.7) 4 β

We make some remarks on the proof. Remark 7.4. We have learned two ways to construct a density estimator: The mean of predictive density estimators; The maximum likelihood estimator. None of those is computationally efficient. In practice, kernel density estimator (KDE) is proposed: let X,..., X n be the n samples, one can estimate the density by its histogram, ˆ = n n δ Xi. i= This estimator, however, is not a pdf. To address this issue, one put a kernel instead of a spike over each sample points, i.e., Ω = Ω ˆ, where Ω is the kernel function and is chosen to satisfy the smooth constraint of the pdfs. Remark 7.5. The result in Theorem 7. can be generalized to L p for p <, following the same analysis. 7. Estimator Based On Pairwise Comparison When we prove the lower bound in Theorem 7., we use the property that if an ɛ-covering of a set Θ cannot be tested, then θ Θ cannot be estimated. On the contrary, if the ɛ-covering can be tested, can θ Θ be estimated? First we study when the ɛ-covering can be tested. In a binary classification problem over two distributions, where φ = 0 indicates the data comes from P and φ = indicates the data comes from Q. The error is lower bounded by the total variation between P and Q, i.e., min p( ˆφ φ) = d TV (P, Q). ˆφ In a binary classification problem over two sets of distributions, where φ = 0 indicates the data comes from P P and φ = indicates the data comes from Q Q. The error is lower bounded by the total variation between P and Q, i.e., min p( ˆφ φ) = min d TV(P, Q). ˆφ P co(p),q co(q) Remark 7.6. In general, the minimization of d TV (P, Q) over the convex hulls of P and Q is a hard problem. References 5