Minimax lower bounds I

Similar documents
Lecture 35: December The fundamental statistical distances

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk)

Minimax Estimation of Kernel Mean Embeddings

Semiparametric posterior limits

Bayesian statistics: Inference and decision theory

Lecture 8: Minimax Lower Bounds: LeCam, Fano, and Assouad

10/36-702: Minimax Theory

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

RATE-OPTIMAL GRAPHON ESTIMATION. By Chao Gao, Yu Lu and Harrison H. Zhou Yale University

4 Invariant Statistical Decision Problems

Optimal Estimation of a Nonsmooth Functional

Nonparametric estimation using wavelet methods. Dominique Picard. Laboratoire Probabilités et Modèles Aléatoires Université Paris VII

Chapter 7. Hypothesis Testing

Translation Invariant Experiments with Independent Increments

MIT Spring 2016

STAT215: Solutions for Homework 2

Consistency of the maximum likelihood estimator for general hidden Markov models

Bayesian Regularization

Priors for the frequentist, consistency beyond Schwartz

Week 3: The EM algorithm

3.0.1 Multivariate version and tensor product of experiments

On Bayes Risk Lower Bounds

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Applications of Information Geometry to Hypothesis Testing and Signal Detection

Minimax Rate of Convergence for an Estimator of the Functional Component in a Semiparametric Multivariate Partially Linear Model.

Nonparametric Bayesian Uncertainty Quantification

is a Borel subset of S Θ for each c R (Bertsekas and Shreve, 1978, Proposition 7.36) This always holds in practical applications.

A Very Brief Summary of Statistical Inference, and Examples

MIT Spring 2016

A Very Brief Summary of Bayesian Inference, and Examples

Maximum Likelihood Estimation

Context tree models for source coding

Theory of Statistical Tests

Lecture 2: Statistical Decision Theory (Part I)

6.1 Variational representation of f-divergences

Density Estimation. Seungjin Choi

ST5215: Advanced Statistical Theory

λ(x + 1)f g (x) > θ 0

Qualifying Exam in Probability and Statistics.

Statistics 3858 : Maximum Likelihood Estimators

Variance Function Estimation in Multivariate Nonparametric Regression

Statistical Inference

7. Estimation and hypothesis testing. Objective. Recommended reading

Hypothesis testing: theory and methods

Principles of Statistics

Overview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

Nonparametric estimation under Shape Restrictions

Lecture 9: October 25, Lower bounds for minimax rates via multiple hypotheses

Lecture 1: Introduction

1.1 Basis of Statistical Decision Theory

Model comparison and selection

Hypothesis Testing via Convex Optimization

Bayesian Nonparametric Point Estimation Under a Conjugate Prior

STAT 830 Bayesian Estimation

A NEW PROOF OF THE ATOMIC DECOMPOSITION OF HARDY SPACES

Probabilistic Graphical Models

ASYMPTOTIC EQUIVALENCE OF DENSITY ESTIMATION AND GAUSSIAN WHITE NOISE. By Michael Nussbaum Weierstrass Institute, Berlin

Lecture 7 Introduction to Statistical Decision Theory

Covariance function estimation in Gaussian process regression

Hypothesis Test. The opposite of the null hypothesis, called an alternative hypothesis, becomes

Lecture 21: Minimax Theory

COS513 LECTURE 8 STATISTICAL CONCEPTS

Lecture 8: Information Theory and Statistics

Foundations of Statistical Inference

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Lecture 3 January 28

The Asymptotic Minimax Constant for Sup-Norm. Loss in Nonparametric Density Estimation

Lecture 17: Density Estimation Lecturer: Yihong Wu Scribe: Jiaqi Mu, Mar 31, 2016 [Ed. Apr 1]

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources

Thanks. Presentation help: B. Narasimhan, N. El Karoui, J-M Corcuera Grant Support: NIH, NSF, ARC (via P. Hall) p.2

Introduction to Statistical Learning Theory

Theoretical Statistics. Lecture 25.

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Information geometry for bivariate distribution control

Parametric Techniques Lecture 3

SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY 1. Minimax-Optimal Bounds for Detectors Based on Estimated Prior Probabilities

Master s Written Examination

Problem set 2 The central limit theorem.

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

CS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018

Statistics Ph.D. Qualifying Exam: Part I October 18, 2003

A BLEND OF INFORMATION THEORY AND STATISTICS. Andrew R. Barron. Collaborators: Cong Huang, Jonathan Li, Gerald Cheang, Xi Luo

Part III. A Decision-Theoretic Approach and Bayesian testing

Chapter 3 : Likelihood function and inference

Chapter 3: Unbiased Estimation Lecture 22: UMVUE and the method of using a sufficient and complete statistic

Introduction to Probabilistic Machine Learning

G8325: Variational Bayes

Nonlinear Systems and Control Lecture # 19 Perturbed Systems & Input-to-State Stability

Likelihood inference in the presence of nuisance parameters

Statistics: Learning models from data

ACOURSEINLARGESAMPLETHEORY by Thomas S. Ferguson 1996, Chapman & Hall

Statistical Properties of Numerical Derivatives

9 Bayesian inference. 9.1 Subjective probability

f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain

Nancy Reid SS 6002A Office Hours by appointment

Chapter 4: Asymptotic Properties of the MLE

Kernel methods, kernel SVM and ridge regression

Bayesian Interpretations of Regularization

Statistical inference on Lévy processes

Posterior Regularization

Transcription:

Minimax lower bounds I Kyoung Hee Kim Sungshin University

1 Preliminaries 2 General strategy 3 Le Cam, 1973 4 Assouad, 1983 5 Appendix

Setting Family of probability measures {P θ : θ Θ} on a sigma field A Estimator ˆθ: measurable map from Ω to Θ L(ˆθ, θ): loss function Maximum risk R(Θ, ˆθ) := sup θ Θ E θ L(ˆθ, θ) Minimax risk with a minimax estimator ˆθ mm, R(Θ) = inf θ sup θ Θ E θ L( θ, θ) = sup E θ L(ˆθ mm, θ). θ Θ 1 / 44

Why minimax? Uniformity excludes super-efficient estimators Optimal rates reveal the difficulty in the model of interest Guidance on the choice of estimators 2 / 44

Various distances P, Q: two probability measures with densities p, q w.r.t ν. Total variation distance V (P, Q) = sup P (A) Q(A) = sup (p q)dν A A A A A Squared Hellinger distance h 2 (P, Q) = (p 1/2 q 1/2 ) 2 dν. Kullback Leibler (KL) divergence KL(P, Q) = p log p q dν. Chi-squared χ 2 distance p χ 2 2 (P, Q) = q dν 1. 3 / 44

Relation between distances P, Q: two probability measures with densities p, q w.r.t ν. 1 1 2 h2 (P, Q) V (P, Q) h(p, Q) 1 h2 (P,Q) 4 2 V (P, Q) h(p, Q) KL(P, Q) χ 2 (P, Q) 3 h 2 (P n, Q n ) nh 2 (P, Q) KL(P,Q) 4 (Pinsker) V (P, Q) 2 and V (P, Q) 1 1 2 exp( KL(P, Q)). 4 / 44

Relation to Bayes estimator Let ˆθ Π be Bayes estimator using a prior Π. Least favourable prior gives a maximal Bayes risk (BR) for all prior distributions If Bayes risk of ˆθ Π is equal to a maximal risk of ˆθ Π, then ˆθ Π is minimax, and Π is least favourable. If maximal risk of ˆθ is equal to lim n BR of ˆθ Πn, converging limit of Bayes risks with a sequence of priors (which is at least any Bayes risk for any prior), then ˆθ is minimax. 5 / 44

Examples of minimax estimator 1 Sample mean Let X i be i.i.d. N(θ, 1), i = 1,..., n and let L(ˆθ, θ) = (ˆθ θ) 2. X is minimax with a sequence of prior Π b = N(µ, b) since ˆθ Πb = (n X + µ/b)/(n + 1/b) and the posterior variance 1/(n + 1/b) converges to 1/n as b, which is equal to V ar( X). 2 James Stein estimator Let X N d (θ, I d ) where d 3, and let L(ˆθ, θ) = d i=1 (ˆθ i θ i ) 2. Minimax ( estimator ˆθ ) mm (dominating X) is found as ˆθ mm = 1 d 2 d X i=1 X2 i But ˆθ mm is not unique minimax (and ( not admissible). ) For instance, positive part estimator 1 d 2 d X i=1 X2 i dominates ˆθ mm. + 6 / 44

risk function (d=4) 0 2 4 JS, Minimax X, MLE, UMVUE Positive JS, Minimax 0 2 4 6 8 10 θ 7 / 44

Minimax optimality Usually difficult to find minimax risk and minimax estimator. Typically satisfied if we find a good lower bound l(n) on R(Θ) and if we find a good upper bound u(n) by using a specific estimator θ, where l(n) R(Θ) sup E θ L( θ, θ) u(n) (1) θ Θ u(n) lim n l(n) C. If θ satisfies (1), then θ is called a minimax optimal estimator. 8 / 44

Possible questions 1 In a nonparametric regression problem where the regression function θ satisfies some smooth conditions (e.g. twice conti. differentiable), with the squared error loss function at one point, what is a minimax lower bound? That is, sup E θ ( θ(x 0 ) θ(x 0 )) 2...? θ Θ 2 In a density estimation problem where a density f on R is assumed to be s times differentiable, with the integrated squared error loss function, what is a mnimax lower bound? That is, sup E f ( f(x) f(x)) 2 dx...? f F s 9 / 44

Lower bounds via T V distance d is a metric on Θ d(θ 0, θ 1 ) 2δ where θ 0, θ 1 Θ Denote A 0 := {ω : d(θ 0, ˆθ(ω)) < δ}. (1) Risk bound implies T V bound. Suppose P θ {d(θ, ˆθ) δ} ɛ, θ Θ Then P θ0 A 0 1 ɛ and P θ1 A 0 ɛ, which shows V (P θ0, P θ1 ) P θ0 A 0 P θ1 A 0 1 2ɛ 10 / 44

Lower bounds via T V distance d is a metric on Θ d(θ 0, θ 1 ) 2δ where θ 0, θ 1 Θ Denote A 0 := {ω : d(θ 0, ˆθ(ω)) < δ}. (2) T V bound implies risk bound. Suppose V (P θ0, P θ1 ) < 1 2ɛ. Then, sup θ Θ P θ (d(θ, ˆθ) ) δ = R(Θ, ˆθ) > ɛ since 2R(Θ, ˆθ) P θ0 (d(θ 0, ˆθ) ) ( δ + P θ1 d(θ 1, ˆθ) ) δ P θ0 A c 0 + P θ1 A 0 1 sup P θ0 A P θ1 A > 2ɛ. A 10 / 44

Lower bounds via testing Above argument uses L(ξ, θ) = 1 {d(ξ,θ) δ} L(ξ, θ 0 ) + L(ξ, θ 1 ) 1 for all ξ if d(θ 0, θ 1 ) 2δ. For a general loss function, Suppose inf ξ Θ (L(ξ, θ 0 ) + L(ξ, θ 1 )) d(θ 0, θ 1 ) for a pair θ 0 and θ 1 in Θ. Then ( L(ξ, θ0 ) inf ξ Θ d(θ 0, θ 1 ) + L(ξ, θ ) 1) 1. d(θ 0, θ 1 ) 11 / 44

Lower bounds via testing 2R(Θ, ˆθ) E θ0 L(ˆθ, θ 0 ) + E θ1 L(ˆθ, θ 1 ) [ ] L(ˆθ, θ 0 ) = d(θ 0, θ 1 ) E θ0 d(θ 0, θ 1 ) + E L(ˆθ, θ 1 ) θ 1 d(θ 0, θ 1 ) d(θ 0, θ 1 ) inf [E θ 0 f 0 + E θ1 f 1 ] f 0 +f 1 1,f 0,f 1 0 Note that inf E θ 0 f 0 + E θ1 f 1 = (p θ0 p θ1 ) dν f 0 +f 1 1,f 0,f 1 0 = 1 1 p θ0 p θ1 dν 2 = 1 V (P θ0, P θ1 ) 12 / 44

N(θ 0, 1) N(θ 1, 1) errori θ 0 + θ 1 2 errorii -4 4 13 / 44

Testing affinity (T A) Testing affinity: (p θ0 p θ1 ) dν := P θ0 P θ1 1 T A = E θ0 1 {pθ0 p θ1 } + E θ1 1 {pθ1 p θ0 } = Sum of two errors If P θ0 = P θ1, then T A = 1 meaning testing impossible If supp(p θ0 ) = [0, 1], supp(p θ1 ) = [2, 3], then T A = 0 meaning perfect test Calculation of T A for product measures P θ0 = P n 0, P θ 1 = P n 1 (1 T A) 2 nh 2 (P 0, P 1 ) h 2 (P 0, P 1 ) KL(P 0, P 1 ) χ 2 (P 0, P 1 ) for P 0 P 1 14 / 44

Reduction strategy Restrict attention to a finite subset where A is a finite index set. With an estimator ˆθ, define Θ F := {θ α : α A} Θ ˆα := arg min α A L(ˆθ, θ α ). Assume with a metric d, for some constant δ > 0, inf (L(ξ, θ α) + L(ξ, θ β )) δd(θ α, θ β ) > 0. α, β A (2) ξ Θ 15 / 44

Reduction strategy Reduce maximum risk by the average risk over the finite sub parameter space R(Θ, ˆθ) = 1 2 sup E θ 2L(ˆθ, θ) θ Θ 1 2 max α A E θ α 2L(ˆθ, θ α ) 1 ( 2 max E θ α L(ˆθ, θ α ) + L(ˆθ, θˆα ) α A ) by def. of ˆα δ 2 max E θ α d(θˆα, θ α ) by condition (2) αa δ E θα d(θˆα, θ α ). bound max by average 2 A α A 16 / 44

Two-point tests Let A = {0, 1} and d(θ α, θ β ) = 1 {θα θ β } where α, β A. By the above calculation, R(Θ, ˆθ) δ 4 (E θ 0 d(θˆα, θ 0 ) + E θ1 d(θˆα, θ 1 )) Thus δ 4 (E θ 0 (ˆα = 1) + E θ1 (ˆα = 0)) δ 4 inf {E θ 0 f 0 + E θ1 f 1 : 0 f 0, f 1 1, f 0 + f 1 = 1} δ 4 P θ 0 P θ1 1. R(Θ) δ 4 P θ 0 P θ1 1 δ 4 (1 h(p θ 0, P θ1 )). (3) 17 / 44

Le Cam s Lemma Construct such that A := {θ 0, θ 1 } Θ 1 inf ξ Θ (L(ξ, θ 0 ) + L(ξ, θ 1 )) δ, 2 P θ0 P θ1 1 c > 0. Then, for every estimator ˆθ, sup E θ L(ˆθ, θ) cδ θ Θ 4. 18 / 44

Examples (1) Normal location models Let Y 1,..., Y n N(θ, 1), where θ Θ R. Then for any estimator θ, ) 2 sup E θ ( θ θ Cn 1. θ Θ P θ = Pθ n (i.i.d. sample) and P θ = N(θ, 1) with a loss function L(ˆθ, θ) = (ˆθ θ) 2. (loss cond.) L(ξ, θ 0 ) + L(ξ, θ 1 ) 1 2 L(θ 0, θ 1 ) (?) δ (testing cond.) V 2 (P θ0, P θ1 ) h 2 (P θ0, P θ1 ) nh 2 (P θ0, P θ1 ) nkl(p θ0, P θ1 ) = n (θ 1 θ 0 ) 2 2 (?) 1/2 Fix θ 0 R and take θ 1 = θ 0 + 1/ n. Then (2) is satisfied with δ = 1/(2n), and h 2 (P θ0, P θ1 ) 1/2. By (3), we have R(Θ) 1 8n (1 1/2). 19 / 44

(2) Nonparametric regression Let Y i = θ(x i ) + ɛ i for i = 1,..., n, where ɛ i N(0, 1), x i = i/n, θ Θ with Θ the set of all twice continuously differentiable functions on [0, 1], θ (x) < M. Then for any estimator θ and any x 0 [0, 1], 2 sup E θ ( θ(x0 ) θ(x 0 )) Cn 4/5. θ Θ P θ = n i=1 P θ,i and P θ,i = N(θ(i/n), 1) with a loss function L(ˆθ, θ) = (ˆθ(x 0 ) θ(x 0 )) 2. (loss cond.) L(ξ, θ 0 ) + L(ξ, θ 1 ) 1 2 L(θ 0, θ 1 ) (?) δ (testing cond.) h 2 (P θ0, P θ1 ) KL(P θ0, P θ1 ) (?) 1/2 20 / 44

) Let θ 0 = 0 on [0, 1] and θ 1 (x) = MhnK( 2 x x0 h n where ( K(t) = a exp )1 1 1 (2t) 2 { 2t 1} where a is a normalizing constant so K(t) is a kernel, and let h = cn 1/5. Check θ 0, θ 1 Θ. Check L(ξ, θ 0 ) + L(ξ, θ 1 ) 1 2 L(θ 0, θ 1 ) = M 2 2 h4 nk 2 (0) = M 2 2 h4 na 2 exp( 2) Check KL(P θ0,p θ1 ) = = n i=1 n i=1 φ(u i ) φ(u i ) log n i=1 n i=1 φ(u i) n i=1 φ(u i θ 1 (x i )) du 1... du n ( u i θ 1 (x i ) + 12 θ 1(x i ) 2 ) du 1... du n 21 / 44

KL(P θ0,p θ1 ) = 1 n θ 1 (x i ) 2 2 i=1 = M 2 n h 4 2 na 2 exp i=1 M 2 2 h4 na 2 n i=1 2 ) 2 1 4( xi x 0 h n 1 {x0 hn 2 i n x 0+ hn 2 } M 2 2 h4 na 2 nh n = 1 2 a2 nh 5 n 1 { xi x 0 hn 2 } Since h n 1/5, by choosing a sufficiently small, we have the lower bound of order h 4 n n 4/5. 22 / 44

Multiple tests Let A = {0, 1} m and d(θ α, θ β ) = H(α, β) = m k=1 1 {α k β k } where α, β A. e.g. α = (0, 0,..., 1, 0, 1,..., 1) Start from R(Θ, ˆθ) δ E θα d(θˆα, θ α ) 2 A α A 23 / 44

R(Θ, ˆθ) δ E θα d(θˆα, θ α ) 2 A α A = δ m 2 2 m E θα 1{θˆαk θ αk } α A k=1 = δ m 1 4 2 m 1 E θα 1 {ˆαk 0} + 1 2 m 1 E θα 1 {ˆαk 1} k=1 α A 0,k α A 1,k = δ m ( P0,k 1 4 {ˆαk 0} + P 1,k 1 {ˆαk 1}) δ 4 k=1 m P 0,k P 1,k 1 δ 4 m min P θ α P θβ 1 H(α,β)=1 k=1 where P i,k = 1 2 m 1 α A i,k P θα, and A i,k = {α A : α k = 0} where i = 0 or 1. 24 / 44

Assouad s Lemma Construct such that A := {θ α, α {0, 1} m } Θ 1 inf ξ Θ (L(ξ, θ α ) + L(ξ, θ β )) δh(α, β) α, β {0, 1} m, 2 P θα P θβ 1 c > 0 if H(α, β) = 1. Then, for every estimator ˆθ, sup E θ L(θ, ˆθ) cδ θ Θ 4 m. Notation: H(α, β) = α β 0 = m k=1 1 {α k β k }. 25 / 44

Examples (3) Gaussian sequence models Let Y i = θ i + σɛ i where ɛ i N(0, 1) and σ 0 with θ = (θ 1, θ 2,...) Θ = { θ : k k2s θ 2 k M} where M is a constant. Then for any estimator θ, sup E θ θ θ 2 2 Cσ 4s 1+2s. θ Θ P θk = N(θ k, σ 2 ) for k = 1, 2,... where M is a constant. P θ = k N P θ k L(θ, t) = k (θ k t k ) 2 satisfying L(θ α, t) + L(θ β, t) 1 2 L(θ α, θ β ). 26 / 44

Need to construct θ α Θ with α {0, 1} m s.t. 1 1 2 k (θ α,k θ β,k ) 2 δ α β 0 for all α, β {0, 1} m. 2 P θα P θβ 1 c > 0 for α β 0 = 1. Construct a finite subset by setting Then with δ = 1 2 ɛ2. 1 2 θ α = (θ α1, θ α2,..., θ αm, 0, 0,...) = ɛ(α 1, α 2,..., α m, 0, 0,...). (θ αk θ βk ) 2 = 1 2 ɛ2 H(α, β) k 27 / 44

T A calculation: w.l.o.g. α 1 β 1, then ( P θα P θβ 1 = P 0 Q ) ( P 0 P ɛ Q P 0) i>m i>m 1 = P 0 P ɛ 1. With normal distribution, if ɛ = c 0 σ for σ 0. ɛ/2 P 0 P ɛ 1 = 2 φ σ 2(x ɛ)dx ( = 2 1 Φ( ɛ ) 2σ ) > 0 28 / 44

Check θ α Θ by letting m = (Mɛ 2 ) 1/(1+2s) since k k2s θ 2 k m k=1 k2s ɛ 2 m 1+2s ɛ 2 M. Lower bound is obtained as mɛ 2 ɛ 4s/(1+2s) σ 4s/(1+2s) which gives n 2s/(1+2s) with the usual choice σ = n 1/2. 29 / 44

(4) Smooth density estimation Let Y 1,..., Y n be an i.i.d. sample from a density f F s where F s is a class of s times differentiable densities with Hölder type condition ( f (s 1) (x + h) f (s 1) (x) ) 2 dx (M h ) 2. Then sup E f f F s ( f(x) f(x)) 2 dx Cn 2s 1+2s. P f = P n f where P f has a density f. Loss L(f, g) = (f(x) g(x)) 2 dx satisfying L(ξ, f α ) + L(ξ, f β ) 1 2 L(f α, f β ) 30 / 44

Typical construction: small perturbations on a base function m f α (x) = f 0 (x) + ɛ α k ϕ k (x) For the loss separation condition, 1 2 k=1 (f α (x) f β (x)) 2 = 1 ( m 2 ɛ2 (α k β k )ϕ k (x) (?) δ k=1 m (α k β k ) 2 k=1 Convenient if ϕ k ϕ k = V 1 {k=k } s.t. δ = V ɛ 2 /2. ) 2 31 / 44

Ibragimov and Has minskii(1983) Construct perturbed uniform densities on [0, 1] f 0 (x) = 1 ϕ k (x) = ϕ(mx k) for k = 1,..., m 1 where ( ϕ(x) = c 0 exp 1 x 1 ) 1 1 2x {0 x 1 2 } with ϕ( x) = ϕ(x), where c 0 is small enough to satisfy ϕ F. Perturbation function ϕ(x): infinitely differentiable, 1/2 ϕ(x)dx = 0. 1/2 1 0 f α(x)dx = 1 0 [1 + ɛ m k=1 α kϕ k (x)] dx = 1 with f α (x) c 1 > 0 32 / 44

perturbation function -0.010-0.005 0.000 0.005 0.010-0.4-0.2 0.0 0.2 0.4 33 / 44

perturbation on the first interval 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 33 / 44

perturbation on every interval 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 33 / 44

ϕk ϕ k = 1{k = k } C m, then δ = Cɛ2 /(2m) Then for the testing condition with α β 0 = 1, we need χ 2 (fα f β ) 2 (P fα, P fβ ) := dx f α 1 (f α f β ) 2 c 1 = 1 c 1 δ α β 0 = δ c 1, 1 n δ = Cɛ2 2m with the lower bound mδ ɛ 2 m/n up to a constant. 34 / 44

[ ] ϕ (s 1) k (x) = m s 1 ϕ (s 1) (mx k + 1/2) Check f α F. (f (s 1) α if m ɛ 1/s. (x + h) f α (s 1) (x)) 2 dx = ɛ 2 ( k 2 α k (ϕ (s 1) k (x + h) ϕ (s 1) k (x))) dx [ ] ɛ 2 m 2s 2 ( ϕ (s 1) (x + mh) ϕ (s 1) (x)) 2 dx ɛ 2 m 2s 2 (M mh ) 2 since ϕ F = M 2 ɛ 2 m 2s h 2 (?) (M h ) 2 From testing condition, 1/n ɛ 2 /m ɛ 2+(1/s). That is, ɛ n s/(1+2s). Then the lower bound is ɛ 2 n 2s/(1+2s). 35 / 44

Lemma V (P, Q) = 1 2 Proof. p q dν = 1 (p q). Let A 0 = {x X : p(x) q(x)}. 1 Then V P (A 0 ) Q(A 0 ) = 1 2 p q dν = 1 (p q). 2 For all A A, A (p q)dν = A A 0 (p q) + A A q) { 0(p c max A (p q), } 0 A (q p) = 1 c 0 2 p q. By the above two, we have proved the claims. 36 / 44

Lemma 1 2 h2 (P, Q) V (P, Q) h(p, Q) 1 h2 (P,Q) 4. Proof. 1 1 2 h2 (P, Q) = 1 ( ) pq = 1 p>q pq + q>p pq ( 1 p>q q + ) q>p p = 1 (p q) = V (P, Q). ( (p ) 2 2 (1 1 2 h2 (P, Q)) 2 = q)(p q) (p q) (p q) = ( (p q) 2 ) (p q) = (1 V )(1 + V ) = 1 V 2. Thus V (P, Q) 2 1 (1 1 2 h2 (P, Q)) 2 = h(p, Q) 2 (1 1 4 h2 (P, Q)). 37 / 44

Lemma h 2 (P, Q) KL(P, Q) χ 2 (P, Q). Proof. 1 KL(P, Q) = pq>0 p log p q = 2 pq>0 p log p q 2 pq>0 p log ( q p 1 + 1 ) 2 pq>0 p ( q p 1 ) = 2 pq + 2 = h 2 (P, Q). 2 KL(P, Q) = pq>0 p log p q = pq>0 p log ( p q 1 + 1 ) p 2 q 1 = χ2 (P, Q). 38 / 44

Lemma ( h 2 (P n, Q n ) = 2 1 ( 1 h2 (P,Q)) ) n 2 nh 2 (P, Q). Proof. 1 1 1 2 h2 (P n, Q n ) = ( p n q n ) n ( ) = pq = 1 h2 (P,Q) n. 2 2 Use (1 a) n 1 an for 0 a < 1. 39 / 44

Lemma (Pinsker) KL(P,Q) V (P, Q) 2. Proof. 1 Let ψ(x) = x log x x + 1 for x 0 where 0 log 0 := 0. Note that ψ(0) = 1, ψ(1) = 0, ψ (1) = 0, ψ (x) = 1/x 0 and ψ(x) 0 for all x 0. ( ) 2 Use 4 3 + 2 3 x ψ(x) (x 1) 2 where x 0. If P Q, V (P, Q) = 1 2 p q = 1 2 q>0 p q 1 q 1 ( 2 q>0 q 4 3 + 2p ) ( 3q ψ p ) q 1 ( 4q 2 3 + 2p ) 3 q>0 qψ( p) q = 1 2 pq>0 p log p q = KL(P,Q) 2. 40 / 44

psi(x) approx. by (x 1)^2 x * log(x) x + 1 0.0 0.2 0.4 0.6 0.8 1.0 1.2 (4/3+2x/3)*psi(x) 0 1 2 3 4 (x 1)^2 0.0 0.5 1.0 1.5 2.0 2.5 3.0 x 0.0 0.5 1.0 1.5 2.0 2.5 3.0 x 41 / 44

Lemma V (P, Q) 1 1 2 exp( KL(P, Q)). Proof. 1 W.l.o.g., KL(P, Q) < (i.e. P Q). 2 ( ( pq) 2 = exp 2 log ) pq>0 pq = ( exp 2 log ) ( pq>0 p q p exp 2 ) pq>0 p log q p = ( ) exp KL(P, Q). 3 Use 2 ( (p q) 2 (p q)) (p q) ( (p ) 2 ( ) 2 ( ) q)(p q) = pq exp KL(P, Q) and 2 (p q) = 2(1 V (P, Q)). 42 / 44

Theorem Suppose Π is a distribution on Θ such that R(θ, δ Π )dπ(θ) = sup R(θ, δ Π ). θ Then 1 δ Π is minimax. 2 Π is least favourable. Proof. (i) Let δ be any other procedure. Then sup θ R(θ, δ) R(θ, δ)dπ(θ) R(θ, δπ )dπ(θ) = sup θ R(θ, δ Π ). (ii) Let Π be some other distribution of θ. Then R(θ, δπ )dπ (θ) R(θ, δ Π )dπ (θ) sup θ R(θ, δ Π ) = R(θ, δπ )dπ(θ). 43 / 44

Theorem Suppose {Π n } is a sequence of prior distributions with Bayes risks R(θ, δ n )dπ n (θ) =: r Πn, and that δ is an estimator for which sup θ R(θ, δ) = lim n r Πn. Then 1 δ is minimax. 2 the sequence {Π n } is least favourable. Proof. (i) Suppose δ is any other estimator. Then sup θ R(θ, δ ) R(θ, δ )dπ n (θ) r Πn, and this holds for every n. Hence sup θ R(θ, δ ) sup θ R(θ, δ). Thus δ is minimax. (ii) If Π is any distribution, then R(θ, δπ )dπ(θ) R(θ, δ)dπ(θ) sup θ R(θ, δ) = lim n r Πn. The claim holds by definition. 44 / 44