L. Brown. Statistics Department, Wharton School University of Pennsylvania

Similar documents
Asymptotic efficiency of simple decisions for the compound decision problem

Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½

Lawrence D. Brown* and Daniel McCarthy*

Empirical Bayes Quantile-Prediction aka E-B Prediction under Check-loss;

Generalized Cp (GCp) in a Model Lean Framework

Fast learning rates for plug-in classifiers under the margin condition

The Poisson Compound Decision Problem Revisited

Optimal Estimation of a Nonsmooth Functional

Bayesian Semiparametric GARCH Models

Model Selection and Geometry

Bayesian Semiparametric GARCH Models

A Note on Data-Adaptive Bandwidth Selection for Sequential Kernel Smoothers

Root-Unroot Methods for Nonparametric Density Estimation and Poisson Random-Effects Models

Problem statement. The data: Orthonormal regression with lots of X s (possible lots of β s are zero: Y i = β 0 + β j X ij + σz i, Z i N(0, 1),

HAPPY BIRTHDAY CHARLES

ESTIMATORS FOR GAUSSIAN MODELS HAVING A BLOCK-WISE STRUCTURE

Learning Objectives for Stat 225

Single Index Quantile Regression for Heteroscedastic Data

Variance Function Estimation in Multivariate Nonparametric Regression

P, NP, NP-Complete, and NPhard

Sliced Inverse Regression

Gaussian kernel GARCH models

EMPIRICAL BAYES IN THE PRESENCE OF EXPLANATORY VARIABLES

Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources

Robustness to Parametric Assumptions in Missing Data Models

Week 4 Spring Lecture 7. Empirical Bayes, hierarchical Bayes and random effects

Lecture 3: Statistical Decision Theory (Part II)

On detection of unit roots generalizing the classic Dickey-Fuller approach

Forecasting with Dynamic Panel Data Models

Mallows Cp for Out-of-sample Prediction

Discussant: Lawrence D Brown* Statistics Department, Wharton, Univ. of Penn.

Bayesian Nonparametric Point Estimation Under a Conjugate Prior

Test Code: STA/STB (Short Answer Type) 2013 Junior Research Fellowship for Research Course in Statistics

Lecture 8: Information Theory and Statistics

Introduction to Bayesian Statistics

Risk Aggregation with Dependence Uncertainty

Lecture Notes 15 Prediction Chapters 13, 22, 20.4.

Statistics 3657 : Moment Approximations

Discussion of Regularization of Wavelets Approximations by A. Antoniadis and J. Fan

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

A Conditional Approach to Modeling Multivariate Extremes

Estimation and Confidence Sets For Sparse Normal Mixtures

Wavelet Shrinkage for Nonequispaced Samples

Reconstruction from Anisotropic Random Measurements

Change-point models and performance measures for sequential change detection

1.1 Basis of Statistical Decision Theory

Why Do Statisticians Treat Predictors as Fixed? A Conspiracy Theory

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Session 2B: Some basic simulation methods

Construction of PoSI Statistics 1

Chi-square lower bounds

ON EMPIRICAL BAYES WITH SEQUENTIAL COMPONENT

Kernel density estimation for heavy-tailed distributions...

Plug-in Approach to Active Learning

Equivalence Relations and Partitions, Normal Subgroups, Quotient Groups, and Homomorphisms

Lecture 2: Basic Concepts of Statistical Decision Theory

Decentralized decision making with spatially distributed data

Stat 710: Mathematical Statistics Lecture 31

Lecture 35: December The fundamental statistical distances

Sparse Nonparametric Density Estimation in High Dimensions Using the Rodeo

Single Index Quantile Regression for Heteroscedastic Data

MIT Spring 2016

Chapter 9. Non-Parametric Density Function Estimation

On robust and efficient estimation of the center of. Symmetry.

BAYESIAN DECISION THEORY

Empirical Likelihood Tests for High-dimensional Data

How Much Evidence Should One Collect?

41903: Introduction to Nonparametrics

Peter Hoff Minimax estimation October 31, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11

Pointwise convergence rates and central limit theorems for kernel density estimators in linear processes

Curve Fitting Re-visited, Bishop1.2.5

Lecture 3: Introduction to Complexity Regularization

Computational and Statistical Learning Theory

Analogy Principle. Asymptotic Theory Part II. James J. Heckman University of Chicago. Econ 312 This draft, April 5, 2006

Maximum likelihood estimation of a log-concave density based on censored data

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

Advanced Statistical Methods: Beyond Linear Regression

Nonparametric Methods

Solving Classification Problems By Knowledge Sets

Generalized Neyman Pearson optimality of empirical likelihood for testing parameter hypotheses

Lecture 3 January 28

A proposal of a bivariate Conditional Tail Expectation

Estimation of a Two-component Mixture Model

Better Bootstrap Confidence Intervals

Nonparametric estimation of log-concave densities

Probabilistic Graphical Models

Discrete Dependent Variable Models

Tweedie s Formula and Selection Bias. Bradley Efron Stanford University

Article from. ARCH Proceedings

1 A Lower Bound on Sample Complexity

5 Introduction to the Theory of Order Statistics and Rank Statistics

Lecture 12 November 3

Local Polynomial Regression

ASYMPTOTIC EQUIVALENCE OF DENSITY ESTIMATION AND GAUSSIAN WHITE NOISE. By Michael Nussbaum Weierstrass Institute, Berlin

where x i and u i are iid N (0; 1) random variates and are mutually independent, ff =0; and fi =1. ff(x i )=fl0 + fl1x i with fl0 =1. We examine the e

Probabilistic Graphical Models

Nonparametric Identification of a Binary Random Factor in Cross Section Data - Supplemental Appendix

Nonparametric Inference via Bootstrapping the Debiased Estimator

Transcription:

Non-parametric Empirical Bayes and Compound Bayes Estimation of Independent Normal Means Joint work with E. Greenshtein L. Brown Statistics Department, Wharton School University of Pennsylvania lbrown@wharton.upenn.edu Conference: Innovation and Inventiveness of John Hartigan April 17, 009 1

1. Empirical Bayes (Concept) Topic Outline. Independent Normal Means (Setting + some theory) 3. The NP-EB Estimator (Heuristics) 4. A Tale of Two Concepts Empirical Bayes and Compound Bayes 5. (Somewhat) Sparse Problems 6. Numerical results 7. Theorem and Proof 8. The heteroscedastic case heuristics

Empirical Bayes: General Background n Parameters to be estimated:! 1,...,! n. [! i real-valued.] Observe X i ~ f!i, independent. Let X = ( X 1,.., X n ). Estimate! i by! ( i X). Component-wise Loss and Overall Risk L (! i," i )=(" i #! ) i and. ( E $!i L (! i," ( i X) )& % ' R (!,"! ) = 1 n 3

Bayes Estimation { } are iid, with a (prior) distribution G n. Pretend! i Under G n the Bayes Procedure would be : " i G n X i! G n = " G n G,..," n 1 n [Note:! i G n depends only on X i (and on G n ).] It would have Bayes risk B n! " # $ = B G n,% G n G n. = E # i X i. = E Gn R (&!,% G ) n [Note: Because of the scaling of sum-of-squared-error-loss by 1 n it is the case that B( G ) n is also the coordinate-wise Bayes risk, ie, B( G ) $ n = E Gn E ( G!i! i " # n ( i X )) ' %& i (). ] 4

Empirical Bayes Introduced by Robbins: An empirical Bayes approach to statistics, 3 rd Berk Symp, 1956 Applicable when the same decision problem presents itself repeatedly and independently with a fixed but unknown a priori distribution of the parameter. Robbins, Ann Math Stat, 1964 Thus: Fix G. Let G n = G for all n. Try to find a sequence of estimators,! ( n X n ), that are asymptotically as good as! G. ie, want B n! " # $ ( G, %! ) n & B! " n # $ ( G) ' 0. Much of the subsequent literature emphasized the sequential nature of this problem. 5

The Fixed-Sample Empirical Goal Even earlier Robbins had taken a slightly different perspective. Asymptotically subminimax solutions of compound decision problems, nd Berk Symp., 1951. See also Zhang (003). Robbins began, When statistical problems of the same type are considered in large groups...there may exist solutions which are asymptotically... [desirable] That is, one can benefit even for fixed, but large, n (and even if G n may change with n). To measure the desirability we propose B ( " n G # (1) sup $ % n,&! ) n ' B " # n $ % Gn!G n B n " # $ % G n G n ( 0. Here, G n is a (very inclusive) subset of priors. [But not all priors]. 6

Independent Normal Means Observe, X i ~ N (! i," ), i = 1,..,n, indep. with! known. Let! " denote normal density with Var =!. Assume! i ~ G n, iid. Write G = G n, for convenience. Consider the i-th coordinate. Write! i =!, x i = x for convenience The Bayes estimator (for Squared error loss) is! G x = E G " X = x Denote the marginal density of X as g! ( x) = &" ( # x $% )G( d% ). As a general notation, let! G x = " ( G x) # x 7

Note that! G ( x) = " G ( x) ' ( $ # x)% ( # x = & x #$)G( d$ ) '% ( & x #$)G( d$ ) Differentiate inside the integral (always OK), to write (*)! G x $ ( = " g# x) g # ( x). Moral of (*): A really good estimate of the marginal density g! ( x) should yield a good approximation to! G ( x). Validity of (*) is proved in Brown (1971). 8

Proposed Non-Parametric Empirical-Bayes Estimator Let h be a bandwidth constant (to be chosen later). Estimate g! x!g h! x by the kernel density estimator 1 h " $ x # X i ' * * 1 % & h ( ). = " ( h x # X ) i = 1 n The normal kernel has some nice properties, to be explained later. Define the NP EB estimator by!! = A useful formula is ("! 1,..,"! ) n with g!! ( i x ) i " x i =!# ( i x ) i = $! %& ( h x ) i.!g!" ( h x;x) = 1 nh!g h % x i X i # x & $ % x # X ) i h ' ( h * +. 9

A Key Lemma: Let Ĝn X denote the sample CDF of X = ( X 1,.., X n ).! Let g G,v denote the marginal density when X i ~ N (! i,v). Let! = 1 and let v = 1+ h Lemma 1: E " #!g! h x Proof:!g! h = ĜnX!" h. $ % = g! ( x G,v )and E #!g!" h x $ % & = g! " ( x G,v ). E Ĝn X! # " $ = CDF of X = G %&. 1 Hence, E "!g! ( h x) $ # % = E " Ĝn X!& $ h # % = G!&!& = G!& 1 h v. The proof for the derivatives is similar. 10

Derivation of the Estimator The expression for the estimator appears in red at the beginning and end of the following string of (approximate) equalities. " #! G 1 = g G,1 by the fundamental equation (*). " g G,1! g " G,1 # g! " G,v since v = 1+ h! 1.!! g G,1 g G,v! g "!" G,v from the Lemma g G,v!g h!! #!g h via plug-in Method-of-Moments in numerator and denominator. See Jiang and Zhang (007) for a different scheme based on a Fourier kernel. 11

A Tale of Two Formulations Compound and Empirical Bayes Compound Problem: Let! (") =! (1),..,! (n) { } and!x (!) = { X (1),.., X } (n) denote the order statistics of! and!x, resp. Consider estimators of the form { } "! i =! x i ;!x (#)!! =! i These are called Simple-Symmetric est s. SS denotes all of them. Given! (") the optimal SS rule is denoted as!! " (#). It satisfies = inf "$SS R!!," R!!,"!! (#). The goal of Compound decision theory is to find rule(s) that do almost as well as!! " (#), as judged by a criterion like (1). 1

The Link Between the Formulations EB Implies CO!! Recall that Ĝn (") denotes the sample CDF of!! ("). Then,! "SS implies + R (!,") = E 1 &! n i # $ ( X i ;!X ) (. * (!!, ' (%) - ) / = B Ĝn (%),"). 0 Consequently: If! n is EB [in the sense of (1)] then it is also Compound Optimal in the sense of:!!" # Ĝn!" $G n. (1 ) ' inf &(SS R (! " n!% # $ n,&) R! " n!% # $ n,& " n inf &(SS R! " n!% # $ n,& < ) n * 0 13

The Converse: CO EB To MOTIVATE the converse, assume!! n "SS is CO in that R $ n!! % (1 ) sup & ' n,( " n! n "# n ) inf ("SS R ( $ % n!! & ' n,() inf ("SS R $ % n!! & ' n,( < * n + 0. Suppose this holds when! n is ALL possible vectors!! n. Under a prior G n the vector!! has iid components, and = E B [n] Ĝ n!" (#),! B [n] G n,! ( )!". (#) Truth of (1 ) for all!! n then implies truth of its Expectation over the distribution of!! (") under G n. This yields (1). In reality (1 ) does not hold for all!! n, but only for a very rich subset. Hence the proof requires extra details. 14

Sparse Problems An initial motivation for this research was to create a CO - EB method suitable for Sparse settings. The proto-typical sparse CO setting has (sparse)!! (") = (# 0,...,# 0,# 1,.,# ) 1 with! ( 1" # )n values! 0 and only!n values! 1. Here,! is near 0, but! 0,! 1 may be either known or unknown. Situations with! = O 1 n can be considered extremely sparse. Situations with, say,! " 1 n 1#$, 0 < $ < 1 are moderately sparse. Sparse problems are of interest on their own merits (eg in genetics) for example, as in Efron (004+). And for their importance in building nonparametric regression estimators see eg, Johnstone and Silverman (004). 15

(Typical) Comparison of NP-EB Estimator with Best Competitor #Signals Est r Value =3 Value =4 Value =5 Value =7 5!!1.15 53 49 4 7 5 Best other 34 3 17 7 50!!1.15 179 136 81 40 50 Best other 01 156 95 5 500!!1.15 484 30 158 48 89 730 609 505 500 Best other Table: Total Expected Squared Error (via simulation; to nearest integer) Compound Bayes setup with n=1000; most means =0 and others = Value Best Other is best performing of 18 studied in J & S (004).!! 1.15 is our NP EB est r with v = 1.15 16

Statement of EB - CO Theorem Assumptions:! #" > 0 $G! { G n n : B ( [n] G ) n > n #" }. Hence, (only) moderately sparse settings are allowed. { = 1} ' C n = O( n ( ) )( > 0. G! G n n :G n #$ "C n,c n % & We believe this assumption can be relaxed, but it seems that some sort of uniformly light-tail condition on G is needed. n Theorem: Let h n = 1 d n with d n log(n)! " &d n = o( n! ) "! > 0. Then (under above assumptions)! n satisfies (1). [Note: n = 1000 & d n = log(n) v = 1+ d!1 " 1.15, as in Table.] 17

Heteroscedastic Setting EB formulation:! 1,..,! n known Observe X i ~ N (! i," i ), indep., i = 1,..,n. Assume (EB)! i ~ G n, indep., but G n unknown, except forg n!g n. Loss function, risk function, and optimality target, (1), as before. 18

Heuristics Bayes estimator on i-th coordinate has! i G x i = " i ( # g $ ( " x ) # i g ( i " x )). i i " Previous heuristics suggest approximating g! i " g! i " # g! i 1+! x i " And then estimating g! i 1+! x i ( x ) i as the average of ( x ) i by " g! x i ( 1+! ) i # $ h =! i ( ( x 1+! i % X k ), k = 1,..,n. )%! k To avoid impossibilities, need to use h k,i ( "! ) k. + =! i 1+! 19

Resulting estimator has with h k,i as above, and! i G x i!g i! = " i ( X ) = $ k (!g # $ ( x ) i!g # ( x )) i I { k: hk,i >0} ( k)" hk x i # X k,i I k: hk,i >0 { } k (With inessential modifications) this is the estimator used in Brown (AOAS, 008). $. 0