Unified Maximum Likelihood Estimation of Symmetric Distribution Properties

Similar documents
Maximum Likelihood Approach for Symmetric Distribution Property Estimation

Estimating Symmetric Distribution Properties Maximum Likelihood Strikes Back!

Information Measure Estimation and Applications: Boosting the Effective Sample Size from n to n ln n

distributed simulation and distributed inference

Exercise 1: Basics of probability calculus

1 Distribution Property Testing

Series 7, May 22, 2018 (EM Convergence)

Lecture 8: Information Theory and Statistics

Quantum query complexity of entropy estimation

Expectation Maximization Algorithm

A CLASSROOM NOTE: ENTROPY, INFORMATION, AND MARKOV PROPERTY. Zoran R. Pop-Stojanović. 1. Introduction

SYSM 6303: Quantitative Introduction to Risk and Uncertainty in Business Lecture 4: Fitting Data to Distributions

Lecture 18: March 15

Gaussian Mixture Models

Dynamic Discrete Choice Structural Models in Empirical IO

CPSC 540: Machine Learning

Review and Motivation

Outline. 1. Define likelihood 2. Interpretations of likelihoods 3. Likelihood plots 4. Maximum likelihood 5. Likelihood ratio benchmarks

Unit 1: Sequence Models

Errata and suggested changes for Understanding Advanced Statistical Methods, by Westfall and Henning

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Faster and Sample Near-Optimal Algorithms for Proper Learning Mixtures of Gaussians. Constantinos Daskalakis, MIT Gautam Kamath, MIT

Dept. of Linguistics, Indiana University Fall 2015

Lecture 4: Proof of Shannon s theorem and an explicit code

Spring 2012 Math 541B Exam 1

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

EE5139R: Problem Set 4 Assigned: 31/08/16, Due: 07/09/16

Fisher Information in Gaussian Graphical Models

Basic math for biology

STA 4273H: Statistical Machine Learning

Review and continuation from last week Properties of MLEs

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

GAUSSIAN PROCESS REGRESSION

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Stein s Method for concentration inequalities

Lecture 4 Noisy Channel Coding

Introduction to Maximum Likelihood Estimation

The Church-Turing Thesis and Relative Recursion

STAT 135 Lab 2 Confidence Intervals, MLE and the Delta Method

A first model of learning

The binary entropy function

First-Order Predicate Logic. Basics

Lecture : Probabilistic Machine Learning

Lecture 5 - Information theory

A brief introduction to Conditional Random Fields

CHAPTER 2 Estimating Probabilities

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Lecture 3: Lower Bounds for Bandit Algorithms

Hidden Markov Models

ECE 275A Homework 7 Solutions

Algebra I+ Pacing Guide. Days Units Notes Chapter 1 ( , )

ECON 4160, Autumn term Lecture 1

Probability Theory. Introduction to Probability Theory. Principles of Counting Examples. Principles of Counting. Probability spaces.

The Phases of Hard Sphere Systems

Last lecture 1/35. General optimization problems Newton Raphson Fisher scoring Quasi Newton

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

Introduction An approximated EM algorithm Simulation studies Discussion

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Estimators for the binomial distribution that dominate the MLE in terms of Kullback Leibler risk

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

SEQUENTIAL ESTIMATION OF DYNAMIC DISCRETE GAMES. Victor Aguirregabiria (Boston University) and. Pedro Mira (CEMFI) Applied Micro Workshop at Minnesota

Parameter estimation Conditional risk

EM Algorithm & High Dimensional Data

n n P} is a bounded subset Proof. Let A be a nonempty subset of Z, bounded above. Define the set

Lecture 6: Gaussian Mixture Models (GMM)

Entropy and Ergodic Theory Lecture 15: A first look at concentration

Chaos, Complexity, and Inference (36-462)

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

CS Lecture 18. Expectation Maximization

Solutions to Problem Set 5

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Decision Tree Learning Lecture 2

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Chapter 16. Structured Probabilistic Models for Deep Learning

Generators for Continuous Coordinate Transformations

2. Continuous Distributions

Controlling versus enabling Online appendix

June 21, Peking University. Dual Connections. Zhengchao Wan. Overview. Duality of connections. Divergence: general contrast functions

Lecture 7 Introduction to Statistical Decision Theory

Chapter 4. Data Transmission and Channel Capacity. Po-Ning Chen, Professor. Department of Communications Engineering. National Chiao Tung University

Roger Levy Probabilistic Models in the Study of Language draft, October 2,

Belief Propagation, Information Projections, and Dykstra s Algorithm

Natural Language Processing

COMP2610/COMP Information Theory

Investigation into the use of confidence indicators with calibration

Nonparametric Bayesian Methods (Gaussian Processes)

Properties of a Taylor Polynomial

The Expectation Maximization or EM algorithm

A minimalist s exposition of EM

Data Preprocessing. Cluster Similarity

STAT 135 Lab 3 Asymptotic MLE and the Method of Moments

Lecture 5: January 30

M378K In-Class Assignment #1

Nonlinear Programming Models

Expectation maximization tutorial

Does Better Inference mean Better Learning?

ML Testing (Likelihood Ratio Testing) for non-gaussian models

f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain

Statistical NLP: Hidden Markov Models. Updated 12/15

Transcription:

Unified Maximum Likelihood Estimation of Symmetric Distribution Properties Jayadev Acharya HirakenduDas Alon Orlitsky Ananda Suresh Cornell Yahoo UCSD Google Frontiers in Distribution Testing Workshop FOCS 2017

Special Thanks to I don t smile I only smile

Symmetric properties P = { (p 1, p k ) } - collection of distributions over {1,,k} Distribution property: f: P R f is symmetric if unchanged under input permutations Entropy H p p i log 1 p i Support size S p I pi 50 i Rényi entropy, support coverage, distance to uniformity, Determined by the probability multiset {p 1, p 2,, p k }

Symmetric properties p = (p 1, p k ), (k finite or infinite) - discrete distribution f p, a property of p f p is symmetric if unchanged under input permutations Entropy H p p i log 1 p i Support size S p I pi 50 i Rényi entropy, support coverage, distance to uniformity, Determined by the probability multiset {p 1, p 2,, p k }

Property estimation p unknown distribution in P Given independent samples X 1, X 2,, X n p Estimate f p Sample complexity S f, k, ε, δ Minimum n necessary to Estimate f p ± ε With error probability < δ

Plug-in estimation Use X 1, X 2,, X n p to find an estimate pf of p Estimate f p by f pf How to estimate p?

Sequence Maximum Likelihood (SML) p sml = arg max p x Q = h, h, t sml p h,h,t p(x n ) = arg max p(x) = argmax p 2 h p(t) Π i p(x i ) p sml h,h,t (h)=2/3 p sml h,h,t (t)=1/3 Same as empirical-frequency distribution Multiplicity N x - # times x appears in x n p sml x = N x n

Prior Work For several important properties Empirical-frequency plugin requires Θ k samples New complex (non-plugin) estimators need Θ k log k samples Property Notation SML Optimal References k k 1 Entropy H(p) " log k " (Valiant & Valiant, 2011a; Wu & Yang, 2016; Jiao et al., 2015) Support size S(p) k k log 1 " S m (p) m m m Support coverage k k Distance to u kp uk 1 " 2 log k Different estimator for each property k log k log2 1 " (Wu & Yang, 2015) log m log 1 " (Orlitsky et al., 2016) 1 " (Valiant & Valiant, 2011b; Jiao 2 et al., 2016) Use sophisticated approximation theory results

Entropy estimation SML estimate of entropy = _` log a b a _` Sample complexity: Θ k/ε Various corrections proposed: Miller-Maddow, Jackknifed estimator, Coverage adjusted, Sample complexity: Ω(k) for all the above estimators

Entropy estimation [Paninski 03]: o(k) sample complexity (existential) [ValiantValiant 11a]: Constructive LP based methods: Θ g [ValiantValiant11b, WuYang 14, HanJiaoVenkatWeissman 14]: Simplified algorithms, and growth rate: Θ h g ijk h h ijk h

New (as of August) Results Unified, simple, sample-optimal approach for all above problems Plug-in estimator, replace sequence maximum likelihood with profile maximum likelihood

Profiles h, h, t or h, t, h or t, h, t same estimate One element appeared once, on appeared twice Profile: Multi-set of multiplicities: Φ X n 1 = {N x : x X n 1 } Φ(h, h, t) = Φ(t, h, t) = {1, 2} Φ(α, γ, β, γ) = {1, 1, 2} Sufficient statistic for symmetric properties

Profile maximum likelihood [+SVZ 04] Profile probability p Φ = u p(x v a ) w b x y zw Maximize the profile probability p w { } = arg max { p(φ X v a ) See On estimating the probability multiset, Orlitsky, Santhanam, Viswanathan, Zhang for a detailed treatment, and an argument for competitive distribution estimation.

Profile maximum likelihood (PML) [+SVZ 04] Profile probability p Φ = u p(x a ) b y : w b x y zw Distribution Maximizing the profile probability p w { } = arg max { p(φ) PML competitive for distribution estimation

PML example X Q = h, h, t p sml h,h,t (h)=2/3 p sml h,h,t (t)=1/3 Φ h, h, t = 1,2 p Φ = 1,2 = p s, s, d + p s, d, s + p d, s, s = 3p(s,s,d) = Q v p x p(y) b p } 1, 2 = 3 1 2 3 1 3 + 1 3 2 3 = 18 27 = 2 3

PML of {1,2} P({1,2}) = p(s,s,d)+p(s,d,s)+p(d,s,s) = 3 p(s,s,d) (1/2,1/2) p(s,s,d) = 1/8 + 1/8 = 1/4 p s, s, d = Σ p x p y = Σ p x 1 p x ¼ Σ p x = v š p { } (s,s,d) = ¼ p { } ({1,2}) = ¾ Recall: p } 1, 2 =2/3

PML({1,1,2}) Φ(α, γ, β, γ) = {1,1, 2} p { } 1,1,2 = U[5] PML can predict existence of new symbols

Profile maximum likelihood PML of {1,2} is {½, ½ } p { } 1,2 = 3 1 1 2 1 2 + 1 2 1 2 = 3 4 > 18 27 Σ p x p y = Σ p x 1 p x ¼ Σ p x = v š X a = α, γ, β, γ, Φ X a = {1,1, 2} p { } 1,1,2 = U[5] PML can predict existence of new symbols

PML Plug-in To estimate a symmetric property f Find p { } Φ(X a ) Output f(p { } ) Simple Unified No tuning parameters Some experimental results (c. 2009)

Uniform 500 symbols 350 samples 2x6, 3x4, 13x3, 63x2, 161x1 242 appeared, 258 did not

U[500], 350x, 12 experiments

Uniform 500 symbols 350 samples 2x6, 3x4, 13x3, 63x2, 161x1 248 appeared, 258 did not 700 samples

U[500], 700x, 12 experiments

15K elements, 5 steps, ~3x 30K samples Observe 8,882 elts 6,118 missing Staircase

Underlies many natural phenomena pi=c/i, i=100 15,000 30,000 samples Observe 9,047 elts 5,953 missing Zipf

1990 Census - Last names SMITH 1.006 1.006 1 JOHNSON 0.810 1.816 2 WILLIAMS 0.699 2.515 3 JONES 0.621 3.136 4 BROWN 0.621 3.757 5 DAVIS 0.480 4.237 6 MILLER 0.424 4.660 7 WILSON 0.339 5.000 8 MOORE 0.312 5.312 9 TAYLOR 0.311 5.623 10 AMEND 0.001 77.478 18835 ALPHIN 0.001 77.478 18836 ALLBRIGHT 0.001 77.479 18837 AIKIN 0.001 77.479 18838 ACRES 0.001 77.480 18839 ZUPAN 0.000 77.480 18840 ZUCHOWSKI 0.000 77.481 18841 ZEOLLA 0.000 77.481 18842 18,839 names 77.48% population ~230 million

1990 Census - Last names 18,839 last names based on ~230 million 35,000 samples, observed 9,813 names

Coverage (# new symbols) Zipf distribution over 15K elements Sample 30K times Estimate: # new symbols in sample of size λ * 30K λ < 1 Good-Toulmin: λ > 1 Estimate PML & predict Extends to λ > 1 Applies to other properties

λ >1

Finding the PML distribution EM algorithm [+ Pan, Sajama, Santhanam, Viswanathan, Zhang 05-08] Approximate PML via Bethe Permanents [Vontobel] Extensions of Markov Chains [Vatedka, Vontobel] No provable algorithms known Motivated Valiant & Valiant

Maximum Likelihood Estimation Plugin General property estimation technique P - collection of distributions over domain Z f: P R any property (say entropy) MLE estimator Given z Z Determine p ª «arg max { P p z Output f(p ª «) How good is MLE?

Competitiveness of MLE plugin P - collection of distributions over domain Z f : Z R any estimaor such that p P, Z p Pr f p f Z > ε < δ MLE plugin error bounded by Pr f p f(p «ª ) > 2 ε < δ Z Simple, universal, competitive with any f

Quiz: Probability of unlikely outcomes 6-sided die, p=(p v, p,, p µ ) p 0, and Σp = 1, otherwise arbitrary Z~p P º p ª 1/6 Can be anything (1,0,,0) Pr(p ª 1/6) = 0 (1/6,, 1/6) Pr(p ª 1/6) = 1 P ¼ p ª 0.01 0.06 P ¼ p ª 0.01 = Σ :{¾ À.Àv p 6 0.01 = 0.06

Competitiveness of MLE plugin - proof f : Z R: p P, Z p Pr f p f Z > ε < δ then Pr f p f(p «ª ) > 2 ε < δ Z For all z such that p z δ: 1) f p f z ε 2) p «ª z p z > δ, hence f(p «ª ) f z ε Triangle inequality: f(p «ª ) f p 2ε If f(p ª «) f p > 2ε then p z < δ, Pr f p ª «f p > 2ε Pr p Z < δ u p(z) { ª ÆÇ δ Z

PML performance bound If n = S f, k, ε, δ, then S { } f, k, 2 ε, Φ a δ n Φ a : number of profiles of length n Profile of length n: partition of n {3}, {1,2}, {1,1,1} 3, 2+1, 1+1+1 Φ a = partition # of n Hardy-Ramanujan: Φ a < e Q a ijk a a Easy: e If n = S f, k, ε, e Ï4 n, then S pml f, k, 2ε, e Ï n n

Summary Symmetric property estimation PML plug-in approach Simple Universal Sample optimal for known sublinear properties Future directions Provably efficient algorithms Independent proof technique Thank You!

PML for symmetric f For any symmetric property f, if n = S f, k, ε, 0.1, then S { } f, k, 2 ε, 0.1 = O(n ). Proof. By median trick, S f, k, ε, e Ï = O n m. Therefore, S { } f, k, 2 ε, e Q a Ï = O(n m), Plugging in m = C n, gives the desired result.

Better error probabilities warm up Estimating a distribution p over [k] to L1 distance ε w.p. > 0.9 requires Θ(k/ε ) samples. Proof: Exercise. Estimating a distribution p over [k] to L1 distance ε w.p. 1 e Ïh requires Θ(k/ε ) samples. Proof: Empirical estimator p p p has bounded difference constant (b.d.c.) 2/n Apply McDiarmid s inequality

Better error probabilities Recall k S H, ε, k, 2/3 = Θ ε log k Existing optimal estimators: high b.d.c. Modify them to have small b.d.c., and still be optimal In particular, can get b.d.c. = n ÏÀ.Ö (exponent close to 1) With twice the samples error drops super-fast S H, ε, k, e ÏaØ.Ù Similar results for other properties = Θ h g ijk h

Even approximate PML works! Perhaps finding exact PML is hard. Even approximate PML works. Find a distribution q such that q Φ X v a e Ïa.Û p { } Φ(X v a ) Even this is optimal (for large k)

In Fisher s words Of course nobody has been able to prove that MLE is best under all circumstances. MLE computed with all the information available may turn out to be inconsistent. Throwing away a substantial part of the information may render them consistent. R. A. Fisher

Proof of PML performance If n = S f, k, ε, δ, then S pml f, k, 2 ε, Φ n δ n S f, k, ε, δ, achieved by an estimator f (Φ(X a )) Profiles Φ X a such that p Φ X a > δ, p ÝÞß Φ p Φ > δ f p w ÝÞß f p f p w ÝÞß f Φ + f Φ f p < 2ε Profiles with p Φ X a < δ, p(p Φ X a < δ < δ Φ a