Maximum Likelihood Approach for Symmetric Distribution Property Estimation

Similar documents
Estimating Symmetric Distribution Properties Maximum Likelihood Strikes Back!

Unified Maximum Likelihood Estimation of Symmetric Distribution Properties

Information Measure Estimation and Applications: Boosting the Effective Sample Size from n to n ln n

Quantum query complexity of entropy estimation

Quadratic-backtracking Algorithm for String Reconstruction from Substring Compositions

Square Hellinger Subadditivity for Bayesian Networks and its Applications to Identity Testing. Extended Abstract 1

distributed simulation and distributed inference

On the Shamai-Laroia Approximation for the Information Rate of the ISI Channel

Does Better Inference mean Better Learning?

Lecture 7 Introduction to Statistical Decision Theory

Lecture: Aggregation of General Biased Signals

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Convergence Rate of Expectation-Maximization

Capacity of the Discrete Memoryless Energy Harvesting Channel with Side Information

Probabilistic Graphical Models for Image Analysis - Lecture 1

Parameter learning in CRF s

Recent Results on Input-Constrained Erasure Channels

A brief introduction to Conditional Random Fields

Taming Big Probability Distributions

Literature on Bregman divergences

Unsupervised Learning with Permuted Data

Computational Lower Bounds for Statistical Estimation Problems

3 Finish learning monotone Boolean functions

Extending Monte Carlo Methods to Factor Graphs with Negative and Complex Factors

arxiv: v1 [cs.lg] 12 Jul 2017

Convexity/Concavity of Renyi Entropy and α-mutual Information

Minimax Optimal Bayes Mixtures for Memoryless Sources over Large Alphabets

Information, Utility & Bounded Rationality

The Behaviour of the Akaike Information Criterion when Applied to Non-nested Sequences of Models

The Minimum Message Length Principle for Inductive Inference

16.1 Bounding Capacity with Covering Number

10 Robotic Exploration and Information Gathering

NUMERICAL COMPUTATION OF THE CAPACITY OF CONTINUOUS MEMORYLESS CHANNELS

1 Distribution Property Testing

Latent Tree Approximation in Linear Model

Lecture : Probabilistic Machine Learning

Refined Bounds on the Empirical Distribution of Good Channel Codes via Concentration Inequalities

Automatic Differentiation Equipped Variable Elimination for Sensitivity Analysis on Probabilistic Inference Queries

Robust Monte Carlo Methods for Sequential Planning and Decision Making

Multi-Attribute Bayesian Optimization under Utility Uncertainty

Information Complexity vs. Communication Complexity: Hidden Layers Game

Outline of Today s Lecture

Variable selection and feature construction using methods related to information theory

Lecture 3: Lower Bounds for Bandit Algorithms

Introduction An approximated EM algorithm Simulation studies Discussion

Lecture 4 Noisy Channel Coding

Log-Linear Models, MEMMs, and CRFs

Finding a Needle in a Haystack: Conditions for Reliable. Detection in the Presence of Clutter

1 Review of The Learning Setting

Monte Carlo Methods. Leon Gu CSD, CMU

EE/Stats 376A: Homework 7 Solutions Due on Friday March 17, 5 pm

Graph Matching & Information. Geometry. Towards a Quantum Approach. David Emms

Near-Optimal-Sample Estimators for Spherical Gaussian Mixtures

From CDF to PDF A Density Estimation Method for High Dimensional Data

5 Mutual Information and Channel Capacity

CS 188: Artificial Intelligence Spring Today

Analyticity and Derivatives of Entropy Rate for HMP s

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

Approximating the Partition Function by Deleting and then Correcting for Model Edges (Extended Abstract)

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

1. Principle of large deviations

Lecture 2: August 31

PATTERN RECOGNITION AND MACHINE LEARNING

Entropy Accumulation in Device-independent Protocols

A Tight Excess Risk Bound via a Unified PAC-Bayesian- Rademacher-Shtarkov-MDL Complexity

COMP2610/COMP Information Theory

Maximum Likelihood Decoding of Codes on the Asymmetric Z-channel

Lecture 2: Program Obfuscation - II April 1, 2009

Homework 1 Due: Wednesday, September 28, 2016

Universal Estimation of Divergence for Continuous Distributions via Data-Dependent Partitions

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

9 Multi-Model State Estimation

Information Divergences and the Curious Case of the Binary Alphabet

EE5139R: Problem Set 7 Assigned: 30/09/15, Due: 07/10/15

THE LIGHT BULB PROBLEM 1

much more on minimax (order bounds) cf. lecture by Iain Johnstone

The sum of d small-bias generators fools polynomials of degree d

Computing and Communications 2. Information Theory -Entropy

+ + ( + ) = Linear recurrent networks. Simpler, much more amenable to analytic treatment E.g. by choosing

Algorithmisches Lernen/Machine Learning

Testing Distributions Candidacy Talk

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

Introduction to Machine Learning Lecture 14. Mehryar Mohri Courant Institute and Google Research

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2

CMU-Q Lecture 24:

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence

Using Expectation-Maximization for Reinforcement Learning

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Lecture 8: Information Theory and Statistics

Material presented. Direct Models for Classification. Agenda. Classification. Classification (2) Classification by machines 6/16/2010.

Weighted Finite-State Transducers in Computational Biology

Universal Loseless Compression: Context Tree Weighting(CTW)

Estimating the Unseen: Improved Estimators for Entropy and other Properties

The Complexity of the Permanent and Related Problems

Object Detection Grammars

Approximate Nash Equilibria with Near Optimal Social Welfare

Appendices for the article entitled Semi-supervised multi-class classification problems with scarcity of labelled data

Agnostic Learning of Disjunctions on Symmetric Distributions

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

Stochastic Complexity of Variational Bayesian Hidden Markov Models

Transcription:

Maximum Likelihood Approach for Symmetric Distribution Property Estimation Jayadev Acharya, Hirakendu Das, Alon Orlitsky, Ananda Suresh Cornell, Yahoo, UCSD, Google

Property estimation p: unknown discrete distribution over k elements f(p): a property of p ε: accuracy parameter, δ: error probability Given: access to independent samples from p Goal: estimate f(p) to ±ε with probability > 1 δ Usually, δ constant, say 0.1 Focus of the talk: large k

Sample complexity X., X 0,, X 2 : independent samples from p f5(x 2. ): estimate of f(p) Sample complexity of f5(x 2. ): S f5, ε, k, δ = min{n: p, Pr f5(x. 2 ) f(p) ε δ} Sample complexity of f: S f, ε, k, δ = min G 5 S f5, ε, k, δ

Symmetric properties Permuting symbol labels does not change f(p) Examples: H p K p K log p K S p K 1 QR ST Renyi entropy, distance to uniformity, unseen symbols, divergences, etc

Sequence maximum likelihood N K : # times x appears in X. 2 (multiplicity) p W : empirical distribution p K W = N K n f W X. 2 = f(p W ) Empirical estimators are maximum likelihood estimators p W = arg max Q p(x. 2 ) Call this sequence maximum likelihood (SML)

Empirical entropy estimation Empirical estimator: H W X. 2 = H p W = Z N K n log n N K K. S H W, ε, k, 0.1 = Θ k ε Various corrections proposed: Miller-Maddow, Jackknifed estimator, Coverage adjusted, Sample complexity: Ω(k) [Paninski 03]: S H, ε, k, 0.1 = o(k) (existential) Note: For ε 1/k, empirical estimators are the best

Entropy estimation [ValiantValiant 11a]: Constructive proofs based on LP: S H, ε, k, 0.1 = Θ c d efg d [YuWang 14, HanJiaoVenkatWeissman 14, ValiantValiant11b]: Simplified algorithms, and exact rates: S H, ε, k, 0.1 = Θ d c efg d

Support coverage Expected number of symbols when p is sampled m times S i p = Z(1 1 p K i ) K Goal: Estimate S i p to ±(ε m) [OrlitskySureshWu 16, ZouValiantetal 16]: n = Θ i efg i log ċ samples for δ = 0.1

Known results summary Many symmetric properties: entropy, support size, distance to uniform, support coverage Different estimator for each property Sophisticated results from approximation theory

Main result Simple, ML based plug-in method that is sample-optimal for entropy, support-coverage, distance to uniform, support size.

Profiles Profile of a sequence is the multiset of multiplicities: Φ X. 2 = {N K } X. 2 = 1H, 2T, or X. 2 = 2H, 1T, Φ X. 2 = {1,2} Symmetric properties depend on multiset of probabilities Coins w/ bias 0.4, and 0.6 have same symmetric property Optimal estimators have same output for sequences with same profile. Profiles are sufficient statistic

Profile maximum likelihood [OSVZ 04] Probability of a profile: p Φ(X. 2 ) = Z p(y. 2 ) o p q :r o p q sr t p q Maximize the profile probability: X. 2 = 1H, 2T : SML: (2/3, 1/3) PML: (1/2, 1/2) p r uvw = arg max Q p(φ X. 2 )

PML for symmetric properties To estimate a symmetric f(p): Find p uvw Φ(X. 2 ) Output f(p uvw ) Advantages: No tuning parameters Not function specific

Main result PML is sample-optimal for entropy, support coverage, distance to uniformity, and support size.

Ingredients Guarantee for PML. If n = S f, ε, k, δ, then S f p uvw, 2ε, k, δ Wy q.t n If n = S f, ε, k, e }3 n, then S f p PML, 2ε, k, 0. 1 n # profiles of length n < Wy q.t

Ingredients n = S f, ε, δ, achieved by an estimator f5(φ(x 2 )): p: underlying distribution. Profiles Φ X 2 such that p Φ X 2 > δ, p uvw Φ p Φ > δ p uvw r Φ p Φ > δ f p r uvw f p f p r uvw f5 Φ + f5 Φ f p < 2ε Profiles with p Φ X 2 < δ, p(p Φ X 2 < δ < δ #profiles of length n

Ingredients Better error probability guarantees. Recall: k S H, ε, k, 0.1 = Θ ε log k Stronger error guarantees using McDiarmid s inequality: S H, ε, k, e }2Ž. = Θ d c efg d With twice the samples error drops exponentially Similar results for other properties

Main result PML is sample-optimal for entropy, unseen, distance to uniformity, and support size. Even approximate PML is optimal for these.

Algorithms EM algorithm [Orlitsky et al] Approximate PML via Bethe Permanents [Vontobel] Extensions of Markov Chains [VatedkaVontobel] Polynomial time algorithms for approximating PML

Summary Symmetric property estimation PML plug-in approach Universal, simple to state Independent of particular properties Directions: Efficient algorithms for approximate PML Relies heavily on existence of other estimators

In Fisher s words Of course nobody has been able to prove that maximum likelihood estimates are best under all circumstances. Maximum likelihood estimates computed with all the information available may turn out to be inconsistent. Throwing away a substantial part of the information may render them consistent. R. A. Fisher

References 1. Acharya, J., Das, H., Orlitsky, A., & Suresh, A. T. (2016). A Unified Maximum Likelihood Approach for Optimal Distribution Property Estimation. arxiv preprint arxiv:1611.02960. 2. Paninski, L. (2003). Estimation of entropy and mutual information. Neural computation, 15(6), 1191-1253. 3. Wu, Y., & Yang, P. (2016). Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Transactions on Information Theory, 62(6), 3702-3720. 4. Orlitsky, A., Suresh, A. T., & Wu, Y. (2016). Optimal prediction of the number of unseen species. Proceedings of the National Academy of Sciences, 201607774. 5. Orlitsky, A., Santhanam, N. P., Viswanathan, K., & Zhang, J. (2004, July). On modeling profiles instead of values. In Proceedings of the 20th conference on Uncertainty in artificial intelligence (pp. 426-435). AUAI Press. 6. Valiant, G., & Valiant, P. (2011, June). Estimating the unseen: an n/log (n)-sample estimator for entropy and support size, shown optimal via new CLTs. In Proceedings of the forty-third annual ACM symposium on Theory of computing (pp. 685-694). ACM. 7. Jiao, J., Venkat, K., Han, Y., & Weissman, T. (2015). Minimax estimation of functionals of discrete distributions. IEEE Transactions on Information Theory, 61(5), 2835-2885. 8. Vatedka, S., & Vontobel, P. O. (2016, July). Pattern maximum likelihood estimation of finite-state discrete-time Markov chains. In Information Theory (ISIT), 2016 IEEE International Symposium on(pp. 2094-2098). IEEE. 9. Vontobel, P. O. (2014, February). The Bethe and Sinkhorn approximations of the pattern maximum likelihood estimate and their connections to the Valiant- Valiant estimate. In Information Theory and Applications Workshop (ITA), 2014(pp. 1-10). IEEE. 10. Valiant, G., & Valiant, P. (2011, October). The power of linear estimators. In Foundations of Computer Science (FOCS), 2011 IEEE 52nd Annual Symposium on (pp. 403-412). IEEE. 11. Orlitsky, A., Sajama, S., Santhanam, N. P., Viswanathan, K., & Zhang, J. (2004). Algorithms for modeling distributions over large alphabets. In Information Theory, 2004. ISIT 2004. Proceedings. International Symposium on (pp. 304-304). IEEE.