Information Measure Estimation and Applications: Boosting the Effective Sample Size from n to n ln n

Similar documents
Estimating Symmetric Distribution Properties Maximum Likelihood Strikes Back!

Maximum Likelihood Approach for Symmetric Distribution Property Estimation

Lecture 7 Introduction to Statistical Decision Theory

Quantum query complexity of entropy estimation

Optimal Estimation of a Nonsmooth Functional

Entropy and information estimation: An overview

Lecture 3: Statistical Decision Theory (Part II)

Estimation of information-theoretic quantities

Lecture 8: Information Theory and Statistics

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava

Minimax Estimation of a nonlinear functional on a structured high-dimensional model

Lecture 2: Basic Concepts of Statistical Decision Theory

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

Unified Maximum Likelihood Estimation of Symmetric Distribution Properties

Entropy Estimation in Turing s Perspective

Lecture Notes 15 Prediction Chapters 13, 22, 20.4.

Estimators for the binomial distribution that dominate the MLE in terms of Kullback Leibler risk

6.867 Machine Learning

A General Overview of Parametric Estimation and Inference Techniques.

12 - Nonparametric Density Estimation

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Universal Estimation of Divergence for Continuous Distributions via Data-Dependent Partitions

Naïve Bayes classification

Robustness and duality of maximum entropy and exponential family distributions

ST5215: Advanced Statistical Theory

Asymptotic Normality of an Entropy Estimator with Exponentially Decaying Bias

MMSE Dimension. snr. 1 We use the following asymptotic notation: f(x) = O (g(x)) if and only

Econometrics I, Estimation

Bayesian Regularization

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Plug-in Approach to Active Learning

Generalization theory

Methods of Estimation

Spring 2012 Math 541B Exam 1

Homework 1 Due: Wednesday, September 28, 2016

Linear Methods for Prediction

A BLEND OF INFORMATION THEORY AND STATISTICS. Andrew R. Barron. Collaborators: Cong Huang, Jonathan Li, Gerald Cheang, Xi Luo

Semi-Nonparametric Inferences for Massive Data

3.0.1 Multivariate version and tensor product of experiments

Introduction to Statistical Learning Theory

Estimation Tasks. Short Course on Image Quality. Matthew A. Kupinski. Introduction

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Information Geometry

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Model Selection and Geometry

Bayesian estimation of the discrepancy with misspecified parametric models

Outline of Today s Lecture

Empirical Bayes Quantile-Prediction aka E-B Prediction under Check-loss;

STAT 512 sp 2018 Summary Sheet

Fast learning rates for plug-in classifiers under the margin condition

An Introduction to Statistical Machine Learning - Theoretical Aspects -

A New Algorithm for Nonparametric Sequential Detection

ECE521 week 3: 23/26 January 2017

Peter Hoff Minimax estimation October 31, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11

6.1 Variational representation of f-divergences

Nonparametric Estimation: Part I

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

Machine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017

LECTURE NOTE #3 PROF. ALAN YUILLE

Lecture 3: Just a little more math

Practice Problems Section Problems

Lecture 2: August 31

1.1 Basis of Statistical Decision Theory

Peter Hoff Minimax estimation November 12, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11

SDS : Theoretical Statistics

Math 494: Mathematical Statistics

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Part III. A Decision-Theoretic Approach and Bayesian testing

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Inference For High Dimensional M-estimates. Fixed Design Results

6.867 Machine Learning

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

Lecture 2: Statistical Decision Theory

The risk of machine learning

Bayesian Aggregation for Extraordinarily Large Dataset

Hierarchical Multinomial-Dirichlet model for the estimation of conditional probability tables

A direct formulation for sparse PCA using semidefinite programming

Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½

41903: Introduction to Nonparametrics

STATISTICS/ECONOMETRICS PREP COURSE PROF. MASSIMO GUIDOLIN

Deep Learning: Approximation of Functions by Composition

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine Learning Basics: Maximum Likelihood Estimation

A talk on Oracle inequalities and regularization. by Sara van de Geer

From Histograms to Multivariate Polynomial Histograms and Shape Estimation. Assoc Prof Inge Koch

STATISTICAL INFERENCE WITH DATA AUGMENTATION AND PARAMETER EXPANSION

High-dimensional regression

Midterm Examination. STA 215: Statistical Inference. Due Wednesday, 2006 Mar 8, 1:15 pm

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016

Testing Statistical Hypotheses

Spatially Smoothed Kernel Density Estimation via Generalized Empirical Likelihood

Lecture 2: Statistical Decision Theory (Part I)

Lecture 5 September 19

Agnostic Learning of Disjunctions on Symmetric Distributions

Bootstrap, Jackknife and other resampling methods

Multiple Linear Regression for the Supervisor Data

Transcription:

Information Measure Estimation and Applications: Boosting the Effective Sample Size from n to n ln n Jiantao Jiao (Stanford EE) Joint work with: Kartik Venkat Yanjun Han Tsachy Weissman Stanford EE Tsinghua EE Stanford EE Information Theory, Learning, and Big Data Workshop March 17th, 2015 Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 1 / 35

Problem Setting Observe n samples from discrete distribution P with support size S we want to estimate H(P ) = F α (P ) = S p i ln p i (Shannon 48) i=1 S p α i, α > 0 (Diversity, Simpson index, Rényi entropy, etc) i=1 Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 2 / 35

What was known? Start with entropy: H(P ) = S i=1 p i ln p i. Question Optimal estimator for H(P ) given n samples? Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 3 / 35

What was known? Start with entropy: H(P ) = S i=1 p i ln p i. Question Optimal estimator for H(P ) given n samples? No unbiased estimator, cannot calculate minimax estimator... (Lehmann and Casella 98) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 3 / 35

What was known? Start with entropy: H(P ) = S i=1 p i ln p i. Question Optimal estimator for H(P ) given n samples? No unbiased estimator, cannot calculate minimax estimator... (Lehmann and Casella 98) Classical asymptotics Optimal estimator for H(P ) when n? Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 3 / 35

Classical problem Empirical Entropy S H(P n ) = ˆp i ln ˆp i i=1 where ˆp i is the empirical frequency of symbol i H(P n ) is the Maximum Likelihood Estimator (MLE) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 4 / 35

Classical problem Empirical Entropy S H(P n ) = ˆp i ln ˆp i i=1 where ˆp i is the empirical frequency of symbol i H(P n ) is the Maximum Likelihood Estimator (MLE) Theorem The MLE H(P n ) is asymptotically efficient. (Hájek Le Cam theory) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 4 / 35

Is there something missing? Optimal estimator for finite samples unknown :( Question How about using MLE when n is finite? Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 5 / 35

Is there something missing? Optimal estimator for finite samples unknown :( Question How about using MLE when n is finite? We will show it is a very bad idea in general. Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 5 / 35

Our evaluation criterion: minimax decision theory Denote by M S distributions with support size S. Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 6 / 35

Our evaluation criterion: minimax decision theory Denote by M S distributions with support size S. R max n (F, ˆF ) = E P [(F (P ) ˆF ) 2 ] Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 6 / 35

Our evaluation criterion: minimax decision theory Denote by M S distributions with support size S. R max n (F, ˆF ) = sup P M S E P [(F (P ) ˆF ) 2 ] Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 6 / 35

Our evaluation criterion: minimax decision theory Denote by M S distributions with support size S. R max n (F, ˆF ) = sup P M S E P [(F (P ) ˆF ) 2 ] Minimax risk = sup E P [(F (P ) ˆF )) 2 ] P M S Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 6 / 35

Our evaluation criterion: minimax decision theory Denote by M S distributions with support size S. R max n (F, ˆF ) = sup P M S E P [(F (P ) ˆF ) 2 ] Minimax risk = inf all ˆF sup E P [(F (P ) ˆF )) 2 ] P M S Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 6 / 35

Our evaluation criterion: minimax decision theory Denote by M S distributions with support size S. Notation: R max n (F, ˆF ) = sup P M S E P [(F (P ) ˆF ) 2 ] Minimax risk = inf all ˆF a n b n 0 < c an b n C < a n b n an b n C > 0 sup E P [(F (P ) ˆF )) 2 ] P M S Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 6 / 35

Our evaluation criterion: minimax decision theory Denote by M S distributions with support size S. Notation: R max n (F, ˆF ) = sup P M S E P [(F (P ) ˆF ) 2 ] Minimax risk = inf all ˆF a n b n 0 < c an b n C < a n b n an b n C > 0 sup E P [(F (P ) ˆF )) 2 ] P M S L 2 Risk = Bias 2 + Variance E P [(F (P ) ˆF ( ) 2 ) 2 ] = F (P ) E P ˆF + Var( ˆF ) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 6 / 35

Shall we analyze MLE non-asymptotically? Theorem (J., Venkat, Han, Weissman 14) R max n (H, H(P n )) S2 }{{} n 2 Bias 2 + ln2 S n }{{} Variance, n S Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 7 / 35

Shall we analyze MLE non-asymptotically? Theorem (J., Venkat, Han, Weissman 14) R max n (H, H(P n )) S2 }{{} n 2 Bias 2 + ln2 S n }{{} Variance, n S n S Consistency Bias is dominating if n is not too large compared to S Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 7 / 35

Can we reduce the bias? Taylor series? (Taylor expansion of H(P n ) around P n = P ) Requires n S (Paninski 03) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 8 / 35

Can we reduce the bias? Taylor series? (Taylor expansion of H(P n ) around P n = P ) Requires n S (Paninski 03) Jackknife? Requires n S (Paninski 03) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 8 / 35

Can we reduce the bias? Taylor series? (Taylor expansion of H(P n ) around P n = P ) Requires n S (Paninski 03) Jackknife? Requires n S (Paninski 03) Bootstrap? Requires at least n S (Han, J., Weissman 15) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 8 / 35

Can we reduce the bias? Taylor series? (Taylor expansion of H(P n ) around P n = P ) Requires n S (Paninski 03) Jackknife? Requires n S (Paninski 03) Bootstrap? Requires at least n S (Han, J., Weissman 15) Bayes estimator under Dirichlet prior? Requires at least n S (Han, J., Weissman 15) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 8 / 35

Can we reduce the bias? Taylor series? (Taylor expansion of H(P n ) around P n = P ) Requires n S (Paninski 03) Jackknife? Requires n S (Paninski 03) Bootstrap? Requires at least n S (Han, J., Weissman 15) Bayes estimator under Dirichlet prior? Requires at least n S (Han, J., Weissman 15) Plug-in Dirichlet smoothed distribution? Requires at least n S (Han, J., Weissman 15) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 8 / 35

There are many more... Coverage adjusted estimator (Chao, Shen 03, Vu, Yu, Kass 07) BUB estimator (Paninski 03) Shrinkage estimator (Hausser, Strimmer 09) Grassberger estimator (Grassberger 08) NSB estimator (Nemenman, Shafee, Bialek 02) B-Splines estimator (Daub et al. 04) Wagner, Viswanath, Kulkarni 11 Ohannessian, Tan, Dahleh 11... Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 9 / 35

There are many more... Coverage adjusted estimator (Chao, Shen 03, Vu, Yu, Kass 07) BUB estimator (Paninski 03) Shrinkage estimator (Hausser, Strimmer 09) Grassberger estimator (Grassberger 08) NSB estimator (Nemenman, Shafee, Bialek 02) B-Splines estimator (Daub et al. 04) Wagner, Viswanath, Kulkarni 11 Ohannessian, Tan, Dahleh 11... Recent breakthrough: Valiant and Valiant 11: the exact phase transition of entropy estimation is n S/ ln S Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 9 / 35

There are many more... Coverage adjusted estimator (Chao, Shen 03, Vu, Yu, Kass 07) BUB estimator (Paninski 03) Shrinkage estimator (Hausser, Strimmer 09) Grassberger estimator (Grassberger 08) NSB estimator (Nemenman, Shafee, Bialek 02) B-Splines estimator (Daub et al. 04) Wagner, Viswanath, Kulkarni 11 Ohannessian, Tan, Dahleh 11... Recent breakthrough: Valiant and Valiant 11: the exact phase transition of entropy estimation is n S/ ln S Linear Programming based estimator not shown to achieve minimax rates (dependence on ɛ) Another one achieves it for n S1.03 ln S. Not clear about other functionals Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 9 / 35

Systematic improvement Question Can we find a systematic methodology to improve MLE, and achieve the minimax rates for a wide class of functionals? George Pólya: the more general problem may be easier to solve than the special problem. Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 10 / 35

Start from first principle X B(n, p), if we use g(x) to estimate f(p): Bias(g(X)) f(p) E p g(x) n ( ) n = f(p) g(j) p j (1 p) n j j j=0 Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 11 / 35

Start from first principle X B(n, p), if we use g(x) to estimate f(p): Bias(g(X)) f(p) E p g(x) n ( ) n = f(p) g(j) p j (1 p) n j j j=0 Only polynomials of order n can be estimated without bias [ ] X(X 1)... (X r + 1) E p = p r, 0 r n n(n 1)... (n r + 1) Bias corresponds to polynomial approximation error Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 11 / 35

Minimize the maximum bias What if we choose g(x) such that g = arg min g sup p [0,1] Bias(g(X)) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 12 / 35

Minimize the maximum bias What if we choose g(x) such that g = arg min g sup p [0,1] Bias(g(X)) Paninski 03 It does not work for entropy. (Variance is too large!) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 12 / 35

Minimize the maximum bias What if we choose g(x) such that g = arg min g sup p [0,1] Bias(g(X)) Paninski 03 It does not work for entropy. (Variance is too large!) Best polynomial approximation: P k (x) arg min sup f(x) P k (x) P k Poly k x D Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 12 / 35

We only approximate when the problem is hard! f(p) nonsmooth smooth 0 ln n n 1 ˆp i Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 13 / 35

We only approximate when the problem is hard! f(p) unbiased estimate of best polynomial approximation of order ln n nonsmooth smooth 0 ln n n 1 ˆp i Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 13 / 35

We only approximate when the problem is hard! f(p) unbiased estimate of best polynomial approximation of order ln n f(ˆp i ) f (ˆp i )ˆp i (1 ˆp i ) 2n nonsmooth smooth 0 ln n n 1 ˆp i Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 13 / 35

Why ln n n threshold, ln n order? Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 14 / 35

Approximation theory in functional estimation Lepski, Nemirovski, Spokoiny 99, Cai and Low 11, Wu and Yang 14 (entropy estimation lower bound) Duality with shrinkage (J., Venkat, Han, Weissman 14): a new field to be explored! Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 15 / 35

A class of functionals Note F α (P ) = S i=1 pα i, α > 0. [J., Venkat, Han, Weissman 14] showed Minimax L 2 rates L 2 rates of MLE H(P ) S 2 + ln2 S n 2 n F α (P ), 0 < α 1 2 S 2 n 2α F α (P ), 1 2 < α < 1 S 2 + S2 2α n 2α n F α (P ), 1 < α < 3 2 n 2(α 1) F α (P ), α 3 2 n 1 n 1 Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 16 / 35

A class of functionals Note F α (P ) = S i=1 pα i, α > 0. [J., Venkat, Han, Weissman 14] showed Minimax L 2 rates S H(P ) 2 + ln2 S (n ln n) 2 n F α (P ), 0 < α 1 2 F α (P ), 1 2 < α < 1 S 2 + S2 2α (n ln n) 2α n S 2 S 2 (n ln n) 2α n 2α L 2 rates of MLE S 2 + ln2 S n 2 n S 2 + S2 2α n 2α n F α (P ), 1 < α < 3 2 (n ln n) 2(α 1) n 2(α 1) F α (P ), α 3 2 n 1 n 1 Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 16 / 35

A class of functionals Note F α (P ) = S i=1 pα i, α > 0. [J., Venkat, Han, Weissman 14] showed Minimax L 2 rates S H(P ) 2 + ln2 S (n ln n) 2 n F α (P ), 0 < α 1 2 F α (P ), 1 2 < α < 1 S 2 + S2 2α (n ln n) 2α n S 2 S 2 (n ln n) 2α n 2α L 2 rates of MLE S 2 + ln2 S n 2 n S 2 + S2 2α n 2α n F α (P ), 1 < α < 3 2 (n ln n) 2(α 1) n 2(α 1) F α (P ), α 3 2 n 1 n 1 Effective sample size enlargement Minimax rate-optimal with n samples MLE with n ln n samples Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 16 / 35

Sample complexity perspectives MLE Optimal H(P ) Θ(S) Θ(S/ ln S) F α (P ), 0 < α < 1 Θ(S 1/α ) Θ(S 1/α / ln S) F α (P ), α > 1 Θ(1) Θ(1) Existing literature on F α (P ) for 0 < α < 2: minimax rates unknown, sample complexity unknown, optimal schemes unknown.. Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 17 / 35

Unique features of our entropy estimator One realization of a general methodology Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 18 / 35

Unique features of our entropy estimator One realization of a general methodology Achieve the minimax rate (lower bound due to Wu and Yang 14) S 2 (n ln n) 2 + ln2 S n Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 18 / 35

Unique features of our entropy estimator One realization of a general methodology Achieve the minimax rate (lower bound due to Wu and Yang 14) S 2 (n ln n) 2 + ln2 S n Agnostic to the knowledge of S Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 18 / 35

Unique features of our entropy estimator Easily implementable: approximation order ln n, 500 order approximation using 0.46s on PC. (Chebfun) Empirical performance outperforms every available entropy estimator (Code to be released this week, available upon request) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 19 / 35

Unique features of our entropy estimator Easily implementable: approximation order ln n, 500 order approximation using 0.46s on PC. (Chebfun) Empirical performance outperforms every available entropy estimator (Code to be released this week, available upon request) Some statisticians raised interesting questions: We may not use this estimator unless you prove it is adaptive. Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 19 / 35

Adaptive estimation Near-minimax estimator: (ĤOur sup E P H(P ) P M S ) 2 inf Ĥ ) 2 sup E P (Ĥ H(P ) P M S Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 20 / 35

Adaptive estimation Near-minimax estimator: (ĤOur sup E P H(P ) P M S ) 2 inf What if we know a priori H(P ) H? We want Ĥ ) 2 sup E P (Ĥ H(P ) P M S (ĤOur 2 ) 2 sup E P H(P )) inf sup E P (Ĥ H(P ), P M S (H) Ĥ P M S (H) where M S (H) = {P : H(P ) H, P M S }. Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 20 / 35

Adaptive estimation Near-minimax estimator: (ĤOur sup E P H(P ) P M S ) 2 inf What if we know a priori H(P ) H? We want Ĥ ) 2 sup E P (Ĥ H(P ) P M S (ĤOur 2 ) 2 sup E P H(P )) inf sup E P (Ĥ H(P ), P M S (H) Ĥ P M S (H) where M S (H) = {P : H(P ) H, P M S }. Can our estimator satisfy all these requirements without knowing S and H? Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 20 / 35

Our estimator is adaptive! Han, J., Weissman 15 showed Theorem inf Ĥ sup E P Ĥ H(P ) 2 P M S (H) { S 2 + H ln S (n ln n) 2 n if S ln S enh ln n, [ H ln S ln ( S ln S nh ln n)] 2 otherwise. (1) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 21 / 35

Our estimator is adaptive! Han, J., Weissman 15 showed Theorem inf Ĥ sup E P Ĥ H(P ) 2 P M S (H) { S 2 + H ln S (n ln n) 2 n if S ln S enh ln n, [ H ln S ln ( S ln S nh ln n)] 2 otherwise. (1) For ɛ > H S1 ɛ/h ln S, it requires n H small H!) to achieve root MSE ɛ. (much smaller than S/ ln S for Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 21 / 35

Our estimator is adaptive! Han, J., Weissman 15 showed Theorem inf Ĥ sup E P Ĥ H(P ) 2 P M S (H) { S 2 + H ln S (n ln n) 2 n if S ln S enh ln n, [ H ln S ln ( S ln S nh ln n)] 2 otherwise. For ɛ > H S1 ɛ/h ln S, it requires n H (much smaller than S/ ln S for small H!) to achieve root MSE ɛ. MLE precisely requires n ln S H S1 ɛ/h. (1) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 21 / 35

Our estimator is adaptive! Han, J., Weissman 15 showed Theorem inf Ĥ sup E P Ĥ H(P ) 2 P M S (H) { S 2 + H ln S (n ln n) 2 n if S ln S enh ln n, [ H ln S ln ( S ln S nh ln n)] 2 otherwise. For ɛ > H S1 ɛ/h ln S, it requires n H (much smaller than S/ ln S for small H!) to achieve root MSE ɛ. MLE precisely requires n ln S H S1 ɛ/h. The n n ln n phenomenon is still true! (1) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 21 / 35

Understanding the generality Can you deal with cases where the non-differentiable regime is not just a point? We consider L 1 (P, Q) = S i=1 p i q i. Breakthrough by Valiant and Valiant 11: MLE requires n S samples, optimal is S ln S! However, VV 11 s construction only proves to be optimal when n S. S ln S Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 22 / 35

Applying our general methodology S The minimax L 2 rate is n ln n, while MLE is S n. The n n ln n is still true! We show that VV 11 cannot achieve it when n S. Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 23 / 35

Applying our general methodology S The minimax L 2 rate is n ln n, while MLE is S n. The n n ln n is still true! We show that VV 11 cannot achieve it when n S. However, is that easy to do? The nonsmooth regime is a whole line! How to cover it? Use a small square [ 0, ln n ] 2? n Use a band? Use some other shape? Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 23 / 35

Bad news Best polynomial approximation in 2D is not unique. There exists no efficient algorithm to compute it. Some polynomial that achieves best approximation rate cannot be used in our methodology. All nonsmooth regime design methods mentioned before fail. Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 24 / 35

Bad news Best polynomial approximation in 2D is not unique. There exists no efficient algorithm to compute it. Some polynomial that achieves best approximation rate cannot be used in our methodology. All nonsmooth regime design methods mentioned before fail. Solution: hints from our paper Minimax estimation of functionals of discrete distributions, or wait for soon-be-finished paper Minimax estimation of divergence functions. Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 24 / 35

Significant difference in phase transitions One may think: Who cares about a ln n improvement? Take sequence n = 2S/ ln S, S equally (on log) sampled from 10 2 to 10 7. For each n, S, sample 20 times from a uniform distribution. The scale of S/n S/n [2.2727, 8.0590] Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 25 / 35

Significant improvement! 5 4.5 JVHW estimator MLE 4 3.5 3 MSE 2.5 2 1.5 1 0.5 0 2 3 4 5 6 7 8 9 S/n Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 26 / 35

Mutual Information I(P XY ) Mutual information: I(P XY ) = H(P X ) + H(P Y ) H(P XY ) Theorem (J., Venkat, Han, Weissman 14) n n ln n also holds. Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 27 / 35

Learning Tree Graphical Models d-dimensional random vector with alphabet size S X = (X 1, X 2,..., X d ) Suppose p.m.f. has tree structure d P X = P Xi X π(i), i=1 Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 28 / 35

Learning Tree Graphical Models d-dimensional random vector with alphabet size S X = (X 1, X 2,..., X d ) Suppose p.m.f. has tree structure d P X = P Xi X π(i), i=1 X 4 X 3 X 6 IV. MAIN RESULT: CAUSAL DEPENDEN APPROXIMATIONS X 5 X 2 X 1 In situations where there are multiple random Fig. 1. Diagram of an approximating dependence tree structure. In this example, PT P X6 ( )P ( )P X1 X6 X3 X6 ( )P X4 X3 ( )P X2 X3 ( )P X5 X2 ( ). all possible arrangements of all the variables, Chow and Liu method can be used. However, i processes and timings to find the best approxima native approach, which would maintain causality of second order distributions serves as an approximation of processes separate, is to find an approximation t the full joint. Jiantao Jiao (Stanford University) Boosting the Effective Sample Sizeprobability by identifying March 17th, causal 2015 dependencies 28 / 35 n

Learning Tree Graphical Models d-dimensional random vector with alphabet size S X = (X 1, X 2,..., X d ) Suppose p.m.f. has tree structure d P X = P Xi X π(i), i=1 X 4 X 3 X 6 IV. MAIN RESULT: CAUSAL DEPENDEN APPROXIMATIONS X 5 X 2 X 1 In situations where there are multiple random Fig. 1. Diagram of an approximating dependence tree structure. In this example, PT P X6 ( )P ( )P X1 X6 X3 X6 ( )P X4 X3 ( )P X2 X3 ( )P X5 X2 Chow and Liu method can be used. However, i Given n i.i.d. samples from P ( ). X, estimate underlying all possible tree arrangements structure of all the variables, processes and timings to find the best approxima native approach, which would maintain causality of second order distributions serves as an approximation of processes separate, is to find an approximation t the full joint. Jiantao Jiao (Stanford University) Boosting the Effective Sample Sizeprobability by identifying March 17th, causal 2015 dependencies 28 / 35 n

Chow Liu algorithm (1968) Natural approach: Maximum Likelihood! Chow Liu 68 solved it (2000 citations) - Maximum Weight Spanning Tree (MWST) - Empirical mutual information Theorem (Chow, Liu 68) E MLE = arg max E Q is a tree e E Q I( ˆP e ) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 29 / 35

Chow Liu algorithm (1968) Natural approach: Maximum Likelihood! Chow Liu 68 solved it (2000 citations) - Maximum Weight Spanning Tree (MWST) - Empirical mutual information Theorem (Chow, Liu 68) Our approach E MLE = arg max E Q is a tree e E Q I( ˆP e ) Replace empirical mutual information by better estimates! Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 29 / 35

Original CL vs. Modified CL We set d = 7, S = 300, a star graph, sweep n from 1k to 55k. 1 0.9 Original CL 0.8 Expected wrong edges ratio 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 1 2 3 number of samples n 4 5 6 x 10 4 Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 30 / 35

8 fold improvement! [and even more] We set d = 7, S = 300, a star graph, sweep n from 1k to 55k. 1 0.9 0.8 Original CL Modified CL Expected wrong edges ratio 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 1 2 3 number of samples n 4 5 6 x 10 4 Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 31 / 35

What we did not talk about How to analyze the bias of MLE Why bootstrap and jackknife fail Extensions in multivariate settings (l p distance) Extensions in nonparametric estimation Extensions in non-i.i.d. models Applications in machine learning, medical imaging, computer vision, genomics, etc Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 32 / 35

Hindsight Unbiased estimation: approximation basis must satisfy unbiased equation (Kolmogorov 50) basis have good approximation properties additional variance incurred by approximation should not be large Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 33 / 35

Hindsight Unbiased estimation: approximation basis must satisfy unbiased equation (Kolmogorov 50) basis have good approximation properties additional variance incurred by approximation should not be large Theory of functional estimation Analytical properties of functions The whole field of approximation theory, or even more Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 33 / 35

Related work O. Lepski, A. Nemirovski, and V. Spokoiny 99, On estimation of the L r norm of a regression function T. Cai and M. Low 11, Testing composite hypotheses, Hermite polynomials and optimal estimation of a nonsmooth functional, Y. Wu and P. Yang 14, Minimax rates of entropy estimation on large alphabets via best polynomial approximation, J. Acharya, A. Orlitsky, A. T. Suresh, and H. Tyagi 14, The complexity of estimating Renyi entropy Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 34 / 35

Our work J. Jiao, K. Venkat, Y. Han, T. Weissman, Minimax estimation of functionals of discrete distributions, to appear in IEEE Transactions on Information Theory J. Jiao, K. Venkat, Y. Han, T. Weissman, Maximum likelihood estimation of functionals of discrete distributrions, available on arxiv J. Jiao, K. Venkat, Y. Han, T. Weissman, Beyond maximum likelihood: from theory to practice, available on arxiv J. Jiao, Y. Han, T. Weissman, Minimax estimation of divergence functions, in preparation. Y. Han, J. Jiao, T. Weissman, Adaptive estimation of Shannon entropy, available on arxiv Y. Han, J. Jiao, T. Weissman, Does Dirichlet prior smoothing solve the Shannon entropy estimation problem?, available on arxiv Y. Han, J. Jiao, T. Weissman, Bias correction using Taylor series, Bootstrap, and Jackknife, in preparation Y. Han, J. Jiao, T. Weissman, How to bound of gap of Jensen s inequality?, in preparation Thank you! Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 35 / 35