Lower Bound Techniques for Statistical Estimation. Gregory Valiant and Paul Valiant

Similar documents
CS 361: Probability & Statistics

Introduction to Probability and Statistics (Continued)

Point Estimation. Maximum likelihood estimation for a binomial distribution. CSE 446: Machine Learning

n N CHAPTER 1 Atoms Thermodynamics Molecules Statistical Thermodynamics (S.T.)

STAT 7032 Probability. Wlodek Bryc

GAUSSIAN PROCESS REGRESSION

CS 361: Probability & Statistics

Modern Methods of Data Analysis - WS 07/08

The first bound is the strongest, the other two bounds are often easier to state and compute. Proof: Applying Markov's inequality, for any >0 we have

Quantum query complexity of entropy estimation

Lecture 1: Probability Fundamentals

Random variables. DS GA 1002 Probability and Statistics for Data Science.

Theorem 2.1 (Caratheodory). A (countably additive) probability measure on a field has an extension. n=1

Lecture 2. G. Cowan Lectures on Statistical Data Analysis Lecture 2 page 1

The main results about probability measures are the following two facts:

A6523 Modeling, Inference, and Mining Jim Cordes, Cornell University

Lecture 2: Review of Basic Probability Theory

Polynomial Chaos and Karhunen-Loeve Expansion

CS168: The Modern Algorithmic Toolbox Lecture #11: The Fourier Transform and Convolution

Eigenvalue Statistics for Toeplitz and Circulant Ensembles

Randomized Algorithms

Bayesian Models in Machine Learning

Product measure and Fubini s theorem

MATH 19B FINAL EXAM PROBABILITY REVIEW PROBLEMS SPRING, 2010

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

functions Poisson distribution Normal distribution Arbitrary functions

Lecture 2. Binomial and Poisson Probability Distributions

18.175: Lecture 15 Characteristic functions and central limit theorem

CS 361: Probability & Statistics

Fundamentals of Applied Probability and Random Processes

A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring 2011

CS168: The Modern Algorithmic Toolbox Lecture #8: PCA and the Power Iteration Method

Modern Methods of Data Analysis - WS 07/08

Multivariate Statistics Random Projections and Johnson-Lindenstrauss Lemma

Next is material on matrix rank. Please see the handout

approximation algorithms I

Probability, CLT, CLT counterexamples, Bayes. The PDF file of this lecture contains a full reference document on probability and random variables.

Fluctuations from the Semicircle Law Lecture 4

Robustness of Principal Components

The circular law. Lewis Memorial Lecture / DIMACS minicourse March 19, Terence Tao (UCLA)

CS Lecture 19. Exponential Families & Expectation Propagation

Applied Probability and Stochastic Processes

Preface Introduction to Statistics and Data Analysis Overview: Statistical Inference, Samples, Populations, and Experimental Design The Role of

1 Intro to RMT (Gene)

Statistical Methods in Particle Physics

Kernel Method: Data Analysis with Positive Definite Kernels

Modèles stochastiques II

( nonlinear constraints)

Independent random variables

Clustering using Mixture Models

Irr. Statistical Methods in Experimental Physics. 2nd Edition. Frederick James. World Scientific. CERN, Switzerland

Exponential tail inequalities for eigenvalues of random matrices

Applications. More Counting Problems. Complexity of Algorithms

Probability Theory for Machine Learning. Chris Cremer September 2015

+ + ( + ) = Linear recurrent networks. Simpler, much more amenable to analytic treatment E.g. by choosing

Lecture 3: Central Limit Theorem

Statistical Data Analysis Stat 3: p-values, parameter estimation

PCA, Kernel PCA, ICA

Introduction to Machine Learning CMU-10701

Multiple Random Variables

Probability Density Functions and the Normal Distribution. Quantitative Understanding in Biology, 1.2

SUMMARY OF PROBABILITY CONCEPTS SO FAR (SUPPLEMENT FOR MA416)

Lecture 3: Mixture Models for Microbiome data. Lecture 3: Mixture Models for Microbiome data

An Automatic Inequality Prover and Instance Optimal Identity Testing

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Statistical Data Analysis

UNIT NUMBER PROBABILITY 6 (Statistics for the binomial distribution) A.J.Hobson

Stat 5101 Lecture Notes

Lecture 3: Central Limit Theorem

Math 180B Problem Set 3

A Brief Analysis of Central Limit Theorem. SIAM Chapter Florida State University

Lectures on Statistical Data Analysis

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Lecture 2: Repetition of probability theory and statistics

noise = function whose amplitude is is derived from a random or a stochastic process (i.e., not deterministic)

Convolutions and Their Tails. Anirban DasGupta

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Expectation. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Introduction. Probability and distributions

Invertibility of random matrices

Why Bayesian? Rigorous approach to address statistical estimation problems. The Bayesian philosophy is mature and powerful.

National Sun Yat-Sen University CSE Course: Information Theory. Maximum Entropy and Spectral Estimation

Lecture 04: Balls and Bins: Birthday Paradox. Birthday Paradox

Random Fermionic Systems

DETECTION theory deals primarily with techniques for

The purpose of this section is to derive the asymptotic distribution of the Pearson chi-square statistic. k (n j np j ) 2. np j.

Kernel Methods. Charles Elkan October 17, 2007

3. Review of Probability and Statistics

Local Minimax Testing

Probability Distributions.

Lecture 13 October 6, Covering Numbers and Maurey s Empirical Method

Exercises with solutions (Set D)

Lecture: Mixture Models for Microbiome data

Exam C Solutions Spring 2005

Probability and Statistics

Statistics for scientists and engineers

IV. The Normal Distribution

CAM Ph.D. Qualifying Exam in Numerical Analysis CONTENTS

Robotics 2 Data Association. Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Wolfram Burgard

A6523 Modeling, Inference, and Mining Jim Cordes, Cornell University. Motivations: Detection & Characterization. Lecture 2.

Transcription:

Lower Bound Techniques for Statistical Estimation Gregory Valiant and Paul Valiant

The Setting Given independent samples from a distribution (of discrete support): D Estimate # species Estimate entropy Hypothesis testing: =D? -- close to D? Compare 2 distributions --many metrics Algorithms: Practical, yet unexpected!

Practical Algorithms Why Lower Bounds? Impossibility

Birthday Paradox Task: distinguish U n from U n/2 in k samples Which n/2? Random 1-way 2-way k 0 0 Birthday paradox: for k = o n, same fingerprints w.h.p. (,,,,,,,,, ) data 2 2 0 0 2 1 histogram of data 1-way 2-way 2 4 0 fingerprint (histogram of histogram)

More Generally? Characterize (and compare) fingerprint distributions # heads, # tails given k=10 flips # heads, # tails given Poi(10) flips #occurrences of different elements are independent!!!!= = 1 2 [Roos 99] for low frequencies 2-way 1-way Poi(k) Poi(k) (,,,,,,,,, ) data 2 2 0 0 2 1 histogram of data 1-way 2-way 2 4 0 fingerprint (histogram of histogram)

Double Poissonization Introduced in [Raskhodnikova et al. 07], here as in [V 08] Characterize (and compare) fingerprint distributions Fingerprint entries are distributed (almost) like independent Poisson processes = 1 2 [Roos 99] for low frequencies 2-way 1-way Poi(k) Poi(k) (,,,,,,,,, ) data 2 2 0 0 2 1 histogram of data 1-way 2-way 2 4 0 fingerprint (histogram of histogram)

Theorem: Moments Describe All (for low-weight distributions) Fingerprint entries are distributed (almost) like independent Poisson processes + Poisson processes are characterized by their expectation Fingerprint distributions are characterized by the expected number of 2-way collisions, -way collisions, etc. p i 2 p i, etc. moments

Application: Entropy lower bounds (Theorem: Moments Don t Describe Anything) high entropy low entropy

Application: Entropy lower bounds (Theorem: Moments Don t Describe Anything) high entropy low entropy Second moment: area

Application: Entropy lower bounds (Theorem: Moments Don t Describe Anything) high entropy low entropy Third moment: volume

Application: Entropy lower bounds (Theorem: Moments Don t Describe Anything) high entropy low entropy edit a small portion of the distributions entropy is preserved by continuity

Application: Entropy lower bounds (Theorem: Moments Don t Describe Anything) high entropy low entropy edit a small portion of the distributions entropy is preserved by continuity

Application: Entropy lower bounds (Theorem: Moments Don t Describe Anything) high entropy low entropy edit a small portion of the distributions moments are dominated by those entropy of is the preserved (arbitrary!) by continuity edits left and right moments can be made to match moments don t determine entropy, or any other continuous property!

Double Poissonization Characterize (and compare) fingerprint distributions Fingerprint entries are distributed (almost) like independent Poisson processes = 1 2 [Roos 99] for low frequencies 1-way 2-way Poi(k) Poi(k) (,,,,,,,,, ) data 2 2 0 0 2 1 histogram of data 1-way 2-way 2 4 0 fingerprint (histogram of histogram)

Generalized Multinomial Distributions Generalizes binomial, multinomial distributions, and any sums of such distributions. Parameterized by matrix: 0 1..7..7..7..7 1

Generalized Multinomial Distributions Generalizes binomial, multinomial distributions, and any sums of such distributions. 0 Parameterized by matrix:.5..2.5.8..1.2.1.5...2.4.5.2..8.2 0 1 Aim: Characterize these distributions so we can realize when two are close So far: Characterization by Poissons in the rare events regime

Poisson vs? Poi(λ) # heads in 1000λ coin flips, where 1/1000 prob. heads Poi(λ 1, λ 2, ) : produce each dimension separately Nice features: Variance = mean Each dimension is independent New idea: multivariate Gaussian discretized Parameterized by matrix more degrees of freedom ^

Gaussian Characterization of Generalized Multinomial Distributions Convolution/Deconvolution Earthmover CLT Unimodality [Gurvitz 2008] Thm: Given multinomial distribution M with n rows, j columns, mean covariance : d TV M, G disc μ, Σ j4/ σ 1/ 2.2.1 + 0.8 log n 2/, 2 min eigenvalue of, G d Discretized Gaussian

Characterizing Fingerprints Fingerprint expectations + covariance similar ^ very Dist. of fingerprints close Richer characterization too rich? Nice Lemma: Fingerprint expectations determine fingerprint covariance

Expectations Covariance Lemma: Fingerprint expectations robustly determine fingerprint covariance E[f i ] = x:h(x) 0 Poi(kx,i) h(x) Cov[f i, f j ] = - x:h(x) 0 Poi(kx,i) Poi(kx,j) h(x) Poi(l,i)*Poi(l,j) = Poi(2l,i+j)( i+j ) 2 -(i+j) j

Approximating Thin Poissons as Linear Combinations of Poissons For fixed m but all l, can we approximate Pr[Poi(2l)=m] as a linear combination over j of Pr[Poi(l)=j]? These look like Gaussians! Same answer as for Gaussians! Result: can approximate to within e using coefficients at most 1/e

Application: Two Lower Bounds Tool: Fingerprint expectations + covariance similar ^ very Dist. of fingerprints close 1. Explicit construction of distributions p +, p - where p + is close to U n/2, and p - is close to U n, that have very close k-sample expected fingerprints, for k=c n/log n 2. Too hard! Throw everything into a linear program and let linear programming do all the work for us

Lower Bound Construction Goal: distributions p +, p - where p + is close to U n/2, and p - is close to U n, that have very close k-sample expected fingerprints, for k=c n/log n (kx) j e -xk For all j, x (h + (x) - h - (x)) 0 j! So: g(xk):=(h + (x) - h - (x))e -xk orthogonal to low deg polys We want a signed measure g on the positive reals that: Is orthogonal to low degree polynomials Decays exponentially fast Its positive portion has most of its mass at Its negative portion has most of its mass at 2c log n c log n

Orthogonal to Polynomials Fact: If P is a degree j polynomial with distinct real roots {x i }, then the signed measure g P having point mass 1/P (x i ) at each root x i is orthogonal to all polynomials of degree j-2 Task: find P such that P (x i ) grows exponentially in x i LAGUERRE POLYNOMIALS

Application: Two Lower Bounds Tool: Fingerprint expectations + covariance similar ^ very Dist. of fingerprints close 1. Explicit construction of distributions p +, p - where p + is close to U n/2, and p - is close to U n, that have very close k-sample expected fingerprints, for k=c n/log n 2. Too hard! Throw everything into a linear program and let linear programming do all the work for us

Dual to Optimal Estimator s.t. for all dists. p over [n] Find estimator z: Minimize ε, s.t. expectation of estimator z applied to k samples from p is within ε of H(p) for y +,y - dists. over [n] s.t. for all i, E[f + i] E[f - i] k 1-c Find lower bound instance y +,y - Maximize H(y + )-H(y - ) s.t. expected fingerprint entries given k samples from y +,y - match to within k 1-c

Take-home Ideas Poissonization crucial first step Fingerprints capture all the information (for symmetric property testing) More generally, forgetting and reconstructing sample Two general characterizations of the fingerprint distribution: Poisson regime domain elements each seen <<1 time in sample Tight characterization, can be leveraged even in extreme situations like >>1 distribution Gaussian regime fingerprints have high variance in every direction Much more flexible, leads to tighter bounds, but more intricate Fun with orthogonal polynomials! (Hermite, Laguerre, Chebyshev)

Approximating Thin Gaussians as Linear Combinations of Gaussians What do we convolve a Gaussian with to approximate a thinner Gaussian? (Other direction is easy, since convolving Gaussians adds their variances) Blurring is easy, unblurring is hard can only do it approximately How to analyze? Fourier transform! Convolution becomes multiplication Now: what do we multiply a Gaussian with to approximate a fatter Gaussian? e -x^2??? =e -x^2/2??? = e x^2/2 Problem: blows up Answer: approximate to within e => truncate where e -x^2/2 =e Result: can approximate to within e using coefficients at most 1/e