Asymptotic efficiency of simple decisions for the compound decision problem

Similar documents
Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½

The Poisson Compound Decision Problem Revisited

L. Brown. Statistics Department, Wharton School University of Pennsylvania

Testing Simple Hypotheses R.L. Wolpert Institute of Statistics and Decision Sciences Duke University, Box Durham, NC 27708, USA

EMPIRICAL BAYES IN THE PRESENCE OF EXPLANATORY VARIABLES

Zeros of lacunary random polynomials

On Reparametrization and the Gibbs Sampler

A noninformative Bayesian approach to domain estimation

Commentary on Basu (1956)

Bayesian Perspectives on Sparse Empirical Bayes Analysis (SEBA)

On the Set of Limit Points of Normed Sums of Geometrically Weighted I.I.D. Bounded Random Variables

The deterministic Lasso

The Distribution of Partially Exchangeable Random Variables

Minimax Estimation of Kernel Mean Embeddings

Inference for High Dimensional Robust Regression

Measurable functions are approximately nice, even if look terrible.

Exchangeable Bernoulli Random Variables And Bayes Postulate

Properly colored Hamilton cycles in edge colored complete graphs

The Essential Equivalence of Pairwise and Mutual Conditional Independence

Least squares under convex constraint

10-704: Information Processing and Learning Fall Lecture 21: Nov 14. sup

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

ON EMPIRICAL BAYES WITH SEQUENTIAL COMPONENT

Carl N. Morris. University of Texas

STAT 200C: High-dimensional Statistics

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

The Distribution of Partially Exchangeable Random Variables

Parametric Techniques Lecture 3

Additive functionals of infinite-variance moving averages. Wei Biao Wu The University of Chicago TECHNICAL REPORT NO. 535

PROCEDURES CONTROLLING THE k-fdr USING. BIVARIATE DISTRIBUTIONS OF THE NULL p-values. Sanat K. Sarkar and Wenge Guo

Can we do statistical inference in a non-asymptotic way? 1

Generalization theory

ASYMPTOTIC EQUIVALENCE OF DENSITY ESTIMATION AND GAUSSIAN WHITE NOISE. By Michael Nussbaum Weierstrass Institute, Berlin

Peter Hoff Minimax estimation October 31, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11

A CLT FOR MULTI-DIMENSIONAL MARTINGALE DIFFERENCES IN A LEXICOGRAPHIC ORDER GUY COHEN. Dedicated to the memory of Mikhail Gordin

3 Integration and Expectation

Alternative Characterizations of Markov Processes

Chp 4. Expectation and Variance

INTRODUCTION TO MARKOV CHAINS AND MARKOV CHAIN MIXING

COS513 LECTURE 8 STATISTICAL CONCEPTS

An almost sure invariance principle for additive functionals of Markov chains

215 Problem 1. (a) Define the total variation distance µ ν tv for probability distributions µ, ν on a finite set S. Show that

Parametric Techniques

Applying the Benjamini Hochberg procedure to a set of generalized p-values

Model reversibility of a two dimensional reflecting random walk and its application to queueing network

Bivariate Uniqueness in the Logistic Recursive Distributional Equation

Compression in the Space of Permutations

Szemerédi s regularity lemma revisited. Lewis Memorial Lecture March 14, Terence Tao (UCLA)

LECTURE NOTES 57. Lecture 9

On the Pathwise Optimal Bernoulli Routing Policy for Homogeneous Parallel Servers

Random Bernstein-Markov factors

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava

DE FINETTI'S THEOREM FOR SYMMETRIC LOCATION FAMILIES. University of California, Berkeley and Stanford University

MAXIMAL COUPLING OF EUCLIDEAN BROWNIAN MOTIONS

Overall Objective Priors

ON THE REGULARITY OF SAMPLE PATHS OF SUB-ELLIPTIC DIFFUSIONS ON MANIFOLDS

The small ball property in Banach spaces (quantitative results)

THE DUAL FORM OF THE APPROXIMATION PROPERTY FOR A BANACH SPACE AND A SUBSPACE. In memory of A. Pe lczyński

The lasso, persistence, and cross-validation

Random Process Lecture 1. Fundamentals of Probability

University of Mannheim, West Germany

University of California San Diego and Stanford University and

Chapter 4: Modelling

Refining the Central Limit Theorem Approximation via Extreme Value Theory

INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS

STAT 200C: High-dimensional Statistics

Harmonic Analysis. 1. Hermite Polynomials in Dimension One. Recall that if L 2 ([0 2π] ), then we can write as

The Canonical Gaussian Measure on R

A note on the growth rate in the Fazekas Klesov general law of large numbers and on the weak law of large numbers for tail series

Testing Statistical Hypotheses

Exact Minimax Strategies for Predictive Density Estimation, Data Compression, and Model Selection

Sequential Decisions

Statistical Inference

. Find E(V ) and var(v ).

Boolean constant degree functions on the slice are juntas

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

On likelihood ratio tests

Another Low-Technology Estimate in Convex Geometry

Inference For High Dimensional M-estimates. Fixed Design Results

On Least Squares Linear Regression Without Second Moment

A REMARK ON ROBUSTNESS AND WEAK CONTINUITY OF M-ESTEVtATORS

Stochastic Realization of Binary Exchangeable Processes

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

BALANCING GAUSSIAN VECTORS. 1. Introduction

A NONINFORMATIVE BAYESIAN APPROACH FOR TWO-STAGE CLUSTER SAMPLING

Approximating sparse binary matrices in the cut-norm

10-704: Information Processing and Learning Fall Lecture 24: Dec 7

Corollary 1.1 (Chebyshev's Inequality). Let Y be a random variable with finite mean μ and variance ff 2. Then for all ffi>0 P [ jy μj ffiff]» 1 ffi 2

Online Supplement to Coarse Competitive Equilibrium and Extreme Prices

Limit Theorems for Exchangeable Random Variables via Martingales

THE LASSO, CORRELATED DESIGN, AND IMPROVED ORACLE INEQUALITIES. By Sara van de Geer and Johannes Lederer. ETH Zürich

Constructive bounds for a Ramsey-type problem

ALMOST SURE CONVERGENCE OF THE BARTLETT ESTIMATOR

FORMULATION OF THE LEARNING PROBLEM

L p Functions. Given a measure space (X, µ) and a real number p [1, ), recall that the L p -norm of a measurable function f : X R is defined by

Central Limit Theorems for Conditional Markov Chains

ON THE EQUIVALENCE OF CONGLOMERABILITY AND DISINTEGRABILITY FOR UNBOUNDED RANDOM VARIABLES

ECE531 Lecture 2b: Bayesian Hypothesis Testing

Peter Hoff Minimax estimation November 12, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11

Margin Maximizing Loss Functions

Transcription:

Asymptotic efficiency of simple decisions for the compound decision problem Eitan Greenshtein and Ya acov Ritov Department of Statistical Sciences Duke University Durham, NC 27708-0251, USA e-mail: eitan.greenshtein@gmail.com Jerusalem, Israel e-mail: yaacov.ritov@gmail.com Abstract: We consider the compound decision problem of estimating a vector of n parameters, known up to a permutation, corresponding to n independent observations, and discuss the difference between two symmetric classes of estimators. The first and larger class is restricted to the set of all permutation invariant estimators. The second class is restricted further to simple symmetric procedures. That is, estimators such that each parameter is estimated by a function of the corresponding observation alone. We show that under mild conditions, the minimal total squared error risks over these two classes are asymptotically equivalent up to essentially O(1 difference. AMS 2000 subject classifications: Primary 62C25; secondary 62C12, 62C07. Keywords and phrases: Compound decision, Simple decision rules, Permutation invariant rules. 1. Introduction Let F {F µ : µ M} be a parameterized family of distributions. Let Y 1, Y 2... be a sequence of independent random variables, where Y i takes value in some space Y, and Y i F µi, i 1, 2,.... For each n, we suppose that the sequence µ 1:n is known up to a permutation, where for any sequence x (x 1, x 2,... we denote the sub-sequence x s,..., x t by x s:t. We denote by µ µ n the set {µ 1,..., µ n }, i.e., µ is µ 1:n without any order information. We consider in this note the problem of estimating µ 1:n by ˆµ 1:n under the loss i1 (ˆµ i µ i 2, where ˆµ 1:n (Y 1:n. We assume that the family F is dominated by a measure ν, and denote the corresponding densities simply by f i f µi, i 1,..., n. The important example is, as usual, F µi N(µ i, 1. Let D S Dn S be the set of all simple symmetric decision functions, that is, all such that (Y 1:n (δ(y 1,..., δ(y n, for some function δ : Y M. In particular, the best simple symmetric function is denoted by S µ (δµ(y S 1,..., δµ(y S n : S µ arg min E µ 1:n 2, Dn S Partially supported by NSF grant DMS-0605236, and an ISF Grant. 1

and denote Greenshtein and Ritov/simple estimators for compound decisions 2 r S n E S µ(y 1:n µ 1:n 2, where, as usual, a 1:n 2 i1 a2 i. The class of simple rules may be considered too restrictive. Since the µs are known up to a permutation, the problem seems to be of matching the Y s to the µs. Thus, if Y i N(µ i, 1, and n 2, a reasonable decision would make ˆµ 1 closer to µ 1 µ 2 as Y 2 gets larger. The simple rule clearly remains inefficient if the µs are well separated, and generally speaking, a bigger class of decision rules may be needed to obtain efficiency. However, given the natural invariance of the problem, it makes sense to be restricted to the class D P I Dn P I of all permutation invariant decision functions, i.e, functions that satisfy for any permutation π and any (Y 1,..., Y n : (Y 1,..., Y n (ˆµ 1,..., ˆµ n (Y π(1,..., Y π(n (ˆµ π(1,..., ˆµ π(n. Let P I µ arg min D P I E (Y n µ 1:n 2 be the optimal permutation invariant rule under µ, and denote its risk by r P I n E P I µ (Y 1:n µ 1:n 2. Obviously D S D P I, and whence rn S rn P I. Still, folklore, theorems in the spirit of De Finetti, and results like Hannan and Robbins (1955, imply that asymptotically (as n P µ I and n S µ n will have similar mean risks: rn S rn P I o(n. Our main result establishes conditions that imply the stronger claim, rn S rn P I O(1. To repeat, µ is assumed known in this note. In the general decision theory framework the unknown parameter is the order of its member to correspond with Y 1:n, and the parameter space, therefore, corresponds to the set of all the permutations of 1,..., n. An asymptotic equivalence as above implies, that when we confine ourselves to the class of permutation invariant procedures, we may further restrict ourselves to the class of simple symmetric procedures, as is usually done in the standard analysis of compound decision problems. The later class is smaller and simpler. The motivation for this paper stems from the way the notion of oracle is used in some sparse estimation problems. Consider two oracles, both know the value of µ. Oracle I is restricted to use only a procedure from the class D P I, while Oracle II is further restricted to use procedures from D S. Obviously Oracle I has an advantage, our results quantify this advantage and show that it is asymptotically negligible. Furthermore, starting with Robbins (1951 various oracle-inequalities were obtained showing that one can achieve nearly the risk of Oracle II, by a legitimate statistical procedure. See, e.g., the survey Zhang (2003, for oracleinequalities regarding the difference in risks. See also Brown and Greenshtein (2007, and Jiang and Zhang (2007 for oracle inequalities regarding the ratio

Greenshtein and Ritov/simple estimators for compound decisions 3 of the risks. However, Oracle II is limited, and hence, these claims may seem to be too weak. Our equivalence results, extend many of those oracle inequalities to be valid also with respect to Oracle I. We needed a stronger result than the usual objective that the mean risks are equal up to o(1 difference. Many of the above mentioned recent applications of the compound decision notion are about sparse situations when most of the µs are in fact 0, the mean risk is o(1, and the only interest is in total risk. Let µ 1,..., µ n be some arbitrary ordering of µ. Consider now the Bayesian model under which (π, Y 1:n, π a random permutation, have a distribution given by π is uniformly distributed over P(1 : n; Given π, Y 1:n are independent, Y i F µπ(i, i 1,..., n, (1.1 where for every s < t, P(s : t is the set of all permutations of s,..., t. The above description induces a joint distribution of (M 1,..., M n, Y 1,..., Y n, where M i µ π(i, for a random permutation π. The first part of the following proposition is a simple special case of general theorems representing the best invariant procedure under certain groups as the Bayesian decision with respect to the appropriate Haar measure; for background see, e.g., Berger (1985, Chapter 6. The second part of the proposition was derived in various papers starting with Robbins (1951. In the following proposition and proof, E µ1:n is the expectation under the model in which the observations are independent and Y i F µi E µ is the expectation under the above joint distribution of Y 1:n and M 1:n. Note that under the latter model, for any i 1,..., n, marginally M i G n, the empirical measure defined by the vector µ, and conditional on M i m, Y i F m. Proposition 1.1. The best simple and permutation invariant rules are given by ( (i P µ I (Y 1:n E µ M1:n Y 1:n. (ii S µ(y 1:n (E µ (M 1 Y 1,..., E µ (M n Y n. (iii r S n r P I n + E µ S µ P I µ 2. Proof. We need only to give the standard proof of the third part. First, note that by invariance P µ I is an equalizer (over all the permutations of µ, and hence E µ1:n ( P µ I µ 1:n 2 E µ ( P µ I M 1:n 2. Also E µ1:n ( S µ µ 1:n 2 E µ ( S µ M 1:n 2. Then, given the above joint distribution, r S n E µ S µ M 1:n 2 E µ E µ { S µ M 1:n 2 Y 1:n } E µ E µ { S µ P I µ 2 + P I µ M 1:n 2 Y 1:n } r P I n + E µ S µ P I µ 2.

Greenshtein and Ritov/simple estimators for compound decisions 4 We now briefly review some related literature and problems. On simple symmetric functions, compound decision and its relation to empirical Bayes, see Samuel (1965, Copas (1969, Robbins (1983, Zhang (2003, among many other papers. Hannan and Robbins (1955 formulated essentially the same equivalence problem in testing problems, see their Section 6. They show for a special case an equivalence up to o(n difference in the total risk (i.e., non-averaged risk. Our results for estimation under squared loss are stated in terms of the total risk and we obtain O(1 difference. Our results have a strong conceptual connection to De Finetti s Theorem. The exchangeability induced on M 1,..., M n, by the Haar measure, implies asymptotic independence as in De Finetti s theorem, and consequently asymptotic independence of Y 1,..., Y n. Thus we expect E(M 1 Y 1 to be asymptotically similar to E(M 1 Y 1,..., Y n. Quantifying this similarity as n grows, has to do with the rate of convergence in De Finetti s theorem. Such rates were established by Diaconis and Freedman (1980, but are not directly applicable to obtain our results. After quoting a simple result in the following section, we consider in Section 3 the special important, but simple, case of two-valued parameter. In Section 4 we obtain a strong result under strong conditions. Finally, the main result is given in Section 5, it covers the two preceding cases, but with some price to pay for the generality. 2. Basic lemma and notation The following lemma is standard in comparison of experiments theory; for background on comparison of experiments in testing see Lehmann (1986, p-86. The proof follows a simple application of Jensen s inequality. Lemma 2.1. Consider two pairs of distributions, {G 0, G 1 } and { G 0, G 1 }, such that the first pair represents a weaker experiment in the sense that there is a Markov kernel K, and G i ( K(y, d G i (y, i 1, 2. Then for any convex function ψ E G0 ψ ( dg 1 dg 0 E G0 ψ ( d G 1 d G 0 For simplicity denote f i ( f µi (, and for any random variable X, we may write X g if g is its density with respect to a certain dominating measure. Finally, for simplicity we use the notation y i to denote the sequence y 1,..., y n without its i member, and similarly µ i {µ 1,..., µ n }\{µ i }. Finally f i (Y j is the marginal density of Y j under the model (1.1 conditional on M j µ i. 3. Two valued parameter We suppose in this section that µ can get one of two values which we denote by {0, 1}. To simplify notation we denote the two densities by f 0 and f 1.

Greenshtein and Ritov/simple estimators for compound decisions 5 Theorem 3.1. Suppose that either of the following two conditions holds: (i f 1 µ (Y 1 /f µ (Y 1 has a finite variance under µ {0, 1}. (ii i1 µ i/n γ (0, 1, and f 1 µ (Y 1 /f µ (Y 1 has a finite variance under µ {0, 1}. Then E µ ˆµ S ˆµ P I 2 O(1. Proof. Suppose condition (i holds. Let K i1 µ i, and suppose, WLOG, that K n/2. Consider the Bayes model of (1.1. By Bayes Theorem On the other hand P (M 1 1 Y 1:n P (M 1 1 Y 1 Kf 1 (Y 1 Kf 1 (Y 1 + (n Kf 0 (Y 1. Kf 1 (Y 1 f K 1 (Y 2:n Kf 1 (Y 1 f K 1 (Y 2:n + (n Kf 0 (Y 1 f K (Y 2:n Kf 1 (Y 1 ( 1 + Kf 1 (Y 1 + (n Kf 0 (Y 1 ( P (M 1 1 Y 1 1 + γ ( f K (Y 2:n 1 1, f K 1 (n Kf 0 (Y 1 Kf 1 (Y 1 + (n Kf 0 (Y 1 ( f K f K 1 (Y 2:n 1 1 where, with some abuse of notation f k (Y 2:k is the joint density of Y 2:n conditional on j2 µ j k, and the random variable γ is in [0, 1]. We prove now that f K /f K 1 (Y 2:n converges to 1 in the mean square. We use Lemma 2.1 (with ψ the square to compare the testing of f K (Y 2:k vs. f K 1 (Y 2:k to an easier problem, from which the original problem can be obtained by adding a random permutation. Suppose for simplicity and WLOG that in fact Y 2:K are i.i.d. under f 1, while Y K+1:n are i.i.d. under f 0. Then we compare g K 1 (Y 2:n the true distribution, to the mixture K f 1 (Y j j2 1 g K (Y 2:n g K 1 (Y 2:n n K n jk+1 jk+1 f 0 (Y j, f 1 f 0 (Y j. However, the likelihood ratio between g K and g K 1 is a sum of n K terms, each with mean 1 (under g K 1 and finite variance. The ratio between the gs is, therefore, 1 + O p (n 1/2 in the mean square. By Lemma 2.1, this applies also to the fs ratio. Consider now the second condition. By assumption, K is of the same order as n, and we can assume, WLOG, that the f 1 /f 0 has a finite variance under f 0. With this understanding, the above proof holds for the second condition.

Greenshtein and Ritov/simple estimators for compound decisions 6 The condition of the theorem is clearly satisfied in the normal shift model: F i N(µ i, 1, i 1, 2. It is satisfied for the normal scale model, F i N(0, σ 2 i, i 1, 2, if K is of the same order as n, or if σ 2 0/2 < σ 2 1 < 2σ 2 0. 4. Dense µs We consider now another simple case in which µ can be ordered µ (1,..., µ (n such that the difference µ (i+1 µ (i is uniformly small. This will happen if, for example, µ is in fact a random sample from a distribution with density with respect to Lebesgue measure, which is bounded away from 0 on its support, or more generally, if it is sampled from a distribution with short tails. Denote by Y (1,..., Y (n and f (1,..., f (n the Y s and fs ordered according to the µs. We assume in this section (B1 For some constants A n and V n which are bounded by a slowly converging to infinite sequence: Var max µ i µ j A n, i,j (Y (j V n f (j n 2. ( f(j+1 Note that condition (B1 holds for both the normal shift model and the normal scale model, if µ behaves like a sample from a distribution with a density as above. Theorem 4.1. If Assumption (B1 holds then Proof. By definition i1 ˆµ P I i ˆµ S 1 ˆµ S i 2 O p (A 2 nv 2 n /n. i1 µ if i (Y i i1 f i(y i, ˆµ P 1 I i1 µ if i (Y 1 f i (Y 2:n i1 f i(y 1 f i (Y 2:n, where f i is the density of Y 2:n under µ i : f i (y 2:m 1 (n 1! The result will follow if we argue that µ P I 1 µ S 1 max i,j µ i µ j ( max i,j n π P(2:n j2 f π(j (y j f 1 f i (y i. f i f j (Y 2:n 1 O p (A n V n /n. (4.1

Greenshtein and Ritov/simple estimators for compound decisions 7 That is, max i f i (Y 2:n /f 1 (Y 2:n 1 O p (V n /n. In fact we will establish a slightly stronger claim that f i f 1 T V O p (V n /n, where T V denotes the total variation norm. We will bound this distance by the distance between two other densities. Let g 1 (y 2:n n j2 f j(y j, the true distribution of Y 2:n. We define now a similar analog of f i. Let r j and y (rj be defined by f j f (rj and y (rj y j, j 1,..., n. Suppose, for simplicity, that r i < r 1. Let r 1 1 g i (y 2:n g 1 (y 2:n jr i f (j+1 f (j (y (j. The case r 1 < r i is defined similarly. Note that g i depends only on µ i. Moreover, if Ỹ2:n g j, then one can obtain Y 2:n f j by the Markov kernel that takes Ỹ2:n to a random permutation of itself. It follows from Lemma 2.1 But, by assumption f i f 1 T V g i g 1 T V g i E µ2:n (Y 2:n 1 g 1 R k r 1 1 E µ2:n r 1 1 jk jk (Y (j 1 f (j f (j+1 f (j+1 f (j (Y (j is a reversed L 2 martingale, and it follows from Assumption (B1 that max R k 1 O p (A n V n /n. k<r 1 Similar argument applies to i, r i > r 1, yielding max f i f 1 T V O p (A n V n /n i We established (4.1. The theorem follows. 5. Main result We assume: or some C < : (G1 For some C < : max i {1,...,n} µ i < C, and max i,j 1,...,n E µi (f µj (Y 1 /f µi (Y 1 2 < C. Also, there is γ > 0 such that min i,j 1,...,n P µi (f µj (Y 1 /f µi (Y 1 > γ 1/2.

Greenshtein and Ritov/simple estimators for compound decisions 8 (G2 The random variables p j (Y i are bounded in expectation by f j (Y i k1 f, i, j 1,..., n. k(y i E i1 j1 ( pj (Y i 2 < C 1 E n min i1 j p j (Y i < Cn n j1( pj (Y i 2 E < C. n min j p j (Y i i1 Both assumptions describe a situation where the µs do not separate. They cannot be too far one from another, geometrically or statistically (Assumption (G1, and they are dense in the sense that each Y can be explained by many of the µs (Assumption (G2. The conditions hold for the normal shift model if µ n are uniformly bounded: Suppose the common variance is 1 and µ j < A n. Then ( 2 j1 f j 2(Y 1 E E j1 f j (Y 1 k1 f k(y 1 E ( k1 f k(y 1 2 ne Y 2 1 +2A n Y 1 A 2 n (ne (Y 2 1 2An Y1 +A2 n /2 2 1 n E e4a n Y 1 1 n ( e 8A2 n +4A nµ 1 + e 8A2 n 4A nµ 1 2 2 n e12a and the first part of (G2 hold. The other parts follow a similar calculations. Theorem 5.1. Assume that (G1 and (G2 hold. Then Corollary 5.2. Suppose F {N(µ, 1 : conclusions of the theorem follow. E S µ P I µ 2 O(1 (i r S n r P I n O(1. (ii n. µ < c} for some c <, then the Proof. It was mentioned already in the introduction that when we are restricted to permutation invariant procedure we can consider the Bayesian model under which (π, Y 1:n, π a random permutation, have a distribution given by (1.1. Fix now i {1,..., n}. Under this model we want to compare µ S i E(µ π(i Y i, i 1,..., n

Greenshtein and Ritov/simple estimators for compound decisions 9 to More explicitly: µ S i µ P I i E(µ π(i Y 1:n, i 1,..., n. j1 µ jf j (Y i j1 f j(y i µ P I i µ j p j (Y i, i 1,..., n j1 j1 µ jf j (Y i f j (Y i j1 f j(y i f j (Y i (5.1 µ j p j (Y i W j (Y i, Y i, i 1,..., n, j1 where for all i, j 1,..., n, f j (Y i was defined in Section 2, and p j (Y i W j (Y i, Y i f j (Y i k1 f k(y i, f j (Y i k1 p k(y i f k (Y i Note that k1 p k(y i 1, and W j (Y i, Y i is the likelihood ratio between two (conditional on Y i densities of Y i, say g j0 and g 1. Consider two other densities (again, conditional on Y i : g j0 (Y i Y i f i (Y j f m (Y m, m i,j ( g j1 (Y i Y i g j0 (Y i Y i k i,j p k (Y i f j (Y k + p i (Y i f j (Y j + p j (Y i f k f i Note that g j0 g j0 K and g 1 g j1 K, where K is the Markov kernel that takes Y i to a random permutation of itself. It follows from Lemma 2.1 that E ( W j (Y i, Y i 1 2 Y i E gj1 ( gj0 g j1 1 2 E gj0 ( gj0 g j1 2 + g j1 g j0. (5.2 This expectation does not depend on i except for the value of Y i. Hence, to simplify notation, we take WLOG i j. Denote L g j1 p j (Y j + p k (Y j f j (Y k g j0 f k k i V n 4 γ min k p k(y j,

Greenshtein and Ritov/simple estimators for compound decisions 10 where γ is as in (G1. Then by (5.2 E ( W j (Y j, Y j 1 2 Y j E gj0 ( 1 L 2 + L E gj0 (L 1 2 L 1 V E g j0 (L 1 2 1I(L V 1I(L > V + E gj0 L 1I(L V E gj0 + 1 p 2 L V k(y j, k1 (5.3 by G1. Bound L γ min k p k (Y j 1I( f j (Y k > γ γ min p k (Y j (1 + U, f k k k1 where U B(n 1, 1/2 (the 1 is for the ith summand. Hence E gj0 1I(L V L n/4 1 γ min k p k (Y j k0 1 γn min k p k (Y j n/4 k0 1 k + 1 O(e n 1 γn min k p k (Y j ( n 1 k ( n 2 n+1 k + 1 2 n+1 (5.4 by large deviation. From (G1, (G2, (5.1, (5.3, and (5.4: E E ( (µ S i µ P I i 2 ( ( Y n i E E µ j p j (Y i ( W j (Y i, Y i 1 2 Y i max j κc 3 /n, j1 µ j 2 E p j (Y i E ( W j (Y i, Y i 1 2 Y i j1 for some κ large enough. Claim (i of the theorem follows. Claim (ii follows (i by Proposition 1.1. References [1] Berger, J.O. (1985. Statistical Decision Theory and Bayesian Analysis, 2 nd edition. Springer-Verlag, New York.

Greenshtein and Ritov/simple estimators for compound decisions 11 [2] Brown, L.D. and Greenshtein, E. (2007. Non parametric empirical Bayes and compound decision approaches to estimation of a high dimensional vector of normal means. Manuscript. [3] Copas, J.B. (1969. Compound decisions and empirical Bayes (with discussion. JRSSB 31 397-425. [4] Diaconis, P. and Freedman, D. (1980. Finite Exchangeable Sequences. Ann.Prob. 8 No.4 745-764. [5] Hannan, J. F. and Robbins, H. (1955. Asymptotic solutions of the compound decision problem for two completely specified distributions. Ann.Math.Stat. 26 No.1 37-51. [6] Lehmann, E. L. (1986. Testing Statistical Hypothesis, 2 nd edition. Wiley & Sons, New York. [7] Robbins, H. (1951. Asymptotically subminimax solutions of compound decision problems. Proc. Third Berkeley Symp. 157-164. [8] Robbins, H. (1983. Some thoughts on empirical Bayes estimation. Ann. Stat. 11 713-723. [9] Samuel, E. (1965. On simple rules for the compound decision problem. JRSSB 27 238-244. [10] Zhang, C.-H.(2003. Compound decision theory and empirical Bayes methods.(invited paper. Ann. Stat. 31 379-390. [11] Wenuha, J. and Zhang, C.-H. (2007 General maximum likelihood empirical Bayes estimation of normal means. Manuscript.