Empirical Bayes Quantile-Prediction aka E-B Prediction under Check-loss;

Size: px

Start display at page:

Download "Empirical Bayes Quantile-Prediction aka E-B Prediction under Check-loss;"

Linda Sims
5 years ago
Views:

1 BFF4, May 2, 2017 Empirical Bayes Quantile-Prediction aka E-B Prediction under Check-loss; Lawrence D. Brown Wharton School, Univ. of Pennsylvania Joint work with Gourab Mukherjee and Paat Rusmevichientong Marshall School, U. S. C. Preprint available on ArXiv. NOTE: In several places the symbol ν should read ν 2. This correction should be clear from the context. 1

2 Multiple Independent Gaussian Problems n independent problems, indexed by i = 1,..,n Observe X i N( 2 θ i,v ) Pi [θ i unknown; v P, i known] [In applications, X i might be X i = X ii with X ik iid N( 2 θ i,σ i ), k = 1,..,K i, v 2 P,i =σ 2 i K i.] 2

3 Quantile Prediction For each i potential future observation Y i N( θ i,ν F,i ). [Again this might be average of several iid var s.] Note that for each i, the mean θ i is same for both Past and Future. Fix b i ( 0,1), i = 1,..,n. Goal is to predict the b i -th quantile of dist. of Y i for each index, i. This quantile is q 0,i =θ i + v F,i Φ 1 b i ( ). The naïve coordinate-wise predictor is ˆq 0,i = X i + v F,i Φ 1 b i ( ). NOTE: This is formal Bayes for uniform prior 3

4 Shrinkage Estimation Generally beneficial in multi-mean problems (and many other multivariate settings). But the shrinkage needs to be properly implemented. For homoscedastic problems this is well understood. (Stein estimation) But the current problem is quite heteroskedastic: v P,i, v F,i, b i can all depend on i. For such problems minimax shrinkage may dramatically differ from Empirical Bayes shrinkage; And, well-implemented E-B shrinkage is usually preferable See Efron and Morris (1973, 1975) and Xie, Kou, Brown (2012, 2015) 4

5 Parametric Empirical Bayes We follow the hierarchical conjugate-prior paradigm Efron and Morris (op. cit.), Stein (1962), Lindley (1962), Lindley and Smith (1972). θ 1,..,θ n iid N( η,τ 2 ) η,τ 2 are unknown hyper-parameters to be estimated from the data. Let ˆη, τˆ 2 denote the estimates. For simplicity and notational simplicity: Consider (for the talk) the case where η = 0 is known. (Paper allows unknown η and estimates it along with τ.) 5

6 Quantile Prediction for Known Hyper-parameters If τ is known then posterior distribution of! θ known (& Gaussian) Yields known, Gaussian predictive distribution for each future Y i : Let α i = τ 2 ( τ 2 + v P,i ), then F Yi ( y τ 2,X ) i = N α i X i,v F,i +α i v P,i ( ). So quantile prediction is ˆq ( X i ;τ 2 ) = α i X i + ( v F,i +α i v P,i )Φ 1 ( b ) i. The role of the E-B prior structure is only to motivate this family of quantile predictors. 6

7 Direct E-B quantile prediction Could use data + any plausible estimate of τ 2, the hyper-parameter(s), and plug in. i.e., could use marginal MLE τˆ 2 and get prediction q( X i ; τˆ 2 ). Or could similarly use M of M estimate instead of MLE. These (and other) plausible estimates of τ 2 are different & give different predictors. They re not bad. BUT none of these needs to be the best choice. See Xie, Kou, Brown (2012) about the more familiar estimation of means problem. 7

8 Formulation with Loss Function To investigate optimal choice of τ 2 in q X i ; τ 2 ( ) impose a predictive loss function under which ˆq τ = q( X i ;τ 2 ) is the natural predictor of q when τ 2 is known. Then find best (or, just good ) τ 2 = τ 2 loss. Use q = q( X i ; τ 2 ) for each i. " X ( ) under this Check-Loss (aka, pinball loss or quantile loss). Let b,h > 0; normalize by b+ h = 1 (w.l.o.g). i-th component of predictive loss is ( ) = b ( i q Y ) + i + h ( i Y i q) + l i Y i,q 8

9 Note that b i,h i are known and allowed to depend on i. b = 0.7 Plot of Check Loss Y q axis 9

10 This loss has been introduced here as a device to evaluate potential Quantilepredictors. But it has motivation in various applications. For example: 10

11 11

12 Improved E-B quantile prediction Conceptual Idea Define average predictive risk as! ( ) = n 1 E θi l i Y i, q ( i X ) R! θ,! q ( ( )). When using a particular hyper-parameter estimator,!τ 2 =!τ 2 " X ( ), this is ( ) = n 1 E θi l i Y i, q (! i X i ; "τ 2 ( X )) R! θ ; "τ 2 ( ( )). Goal is to create a good/best estimator!τ. 12

13 KEY is to find a good and (nearly) unbiased estimator of ( ( ( ))) [Note what happened to τ 2.] n 1 E θi l i Y i, q i X i ;τ 2 Call it RE! ( " X;τ 2 ). [ RE! is a fnct. of X!, but not of Y!.] This will have (1) E θi RE! " X;τ 2 Then choose " 2 τ RE X ( ( )) n 1 E θi l i Y i, q ( i X i ;τ 2 ) ( ( )). { ( )}. ( ) = argmin τ 2 RE # " X;τ 2 13

14 Why does this yield the best τ 2 (as n )? Because there is a best τ 2 (as n ). Oracle Predictor For every! θ let!! τˆ 2 2 OR = τˆ ( OR X; θ ) = argmin τ 2 R( θ;τ 2 ). Then show [as n for every (L1 b nded) seq! θ ] " " 2 (2) τ ( RE X ) τˆ 2 ( X; " θ ) in prob, & OR (3) R (! θ ; 2 τ RE ) R (! 2 θ ; τˆ OR ). This conceptual scheme involves two major steps: 14

15 Two Major Steps Step 1: Create the (asymptotic) risk estimator RE! τ 2 to yield a suitable approximation (1). ( ) Step 2: Prove it has the desired convergence and risk properties (2) and (3). The two steps are interrelated: ( ) needs to produce a satisfactory approximation in (1), and be computationally feasible. RE! τ 2 AND it also needs to enable Step 2. 15

16 Step 1: Asymptotic Risk Estimator Need to satisfy (1) E θi RE! " X;τ 2 ( ( )) n 1 E θi l i Y i, q ( i X i ;τ 2 ) where l i is check-loss. ( ( )), If l i were squared error then can use SURE. But l i is not differentiable. So something like SURE seems a b i g s t r e t c h! 16

17 HOWEVER, l * ( i θ i,q) " E θi Y i ( l ( i Y i,q)) = v F,i G q θ i,b i v F,i ( )+ wφ( w) bw. G( w, b) =ϕ w ( ) 0 plays the role of a conventional loss function. G is a smooth, C, function. And l i * θ i,q Here s a picture of G for b = 0.7: 17

18 Approxima)on theory based non-linear approxima)on Of the loss Linear Tail approxima)on b = 0.7 b = 0.7 Linear Tail approxima)on 0 Plot of the function G 18

19 ! Creation of the Asymptotic Risk Estimator ARE For v P,i = 1 (for simplicity): (a) Approximate G w,b i K(i) terms. (b) Substitute q ( i X i ;τ ) θ i = w. ( ) by Taylor expansion ( TE ) to (c) The resulting expectation is, say, E θi ( TE( X i,θ ) i ). It includes powers of θ i times functions of X i. (d) Integrate by parts to create an equivalent expectation that contains only functions of X i & no terms with θ i. (e) Step (d) could be understood as a version of SURE generalized to higher powers of θ i. 19

20 (f) Actual computation is greatly facilitated by use of Hermite polynomials. (g) The resulting expectand is a function of X i that is an unbiased estimate of the expected Taylor approximation.! That s the core of our ARE. (h) There are some additional (clever) steps needed to handle large values of W, for which Taylor expansion is not good.! (i) Just to impress you, here s the core of ARE without the additional steps in (h): (In the following U ( i τ ) is a minor truncation/modification of X i and d ( i τ ) is a simple rational function of τ,v P,i,b i. This is copied from the paper, which uses the notation τ in place of τ 2. ) 20

21 (j) Choose K n (i), the number of terms in the Taylor expansion. This varies with i, but is O( logn).! (k) Then add over i to get ARE.! (l) Finally, find the desired argmin τ ARE via a discrete grid search. ( ) 21

22 Step 2: Prove the desired asymptotic properties There are some clues in Xie, Kou and Brown (2012). But parts of the proof here differ in key respects from what s there, and the details here are much more complex. 22

23 Simulation #1 ( ) pairs: Homoscedastic ( σ 2 = 1). Two types of θ i,b i ( 0.58,0.51) with prob 0.9 ( 5.1,0.9) with prob 0.1 Comparison of pred. risks of three pred. methods #coord s n=20 n=20 n=50 n=50 Method Efficiency Ave. of τˆ 2 Efficiency % Ave. of τˆ 2 ARE 86% % ML/MM 68% % Oracle 100% Note: Because model is homoscedastic, MaxLik and MethMom are the same 23

24 Simulations #3 This was a set of 6 simulations involving different scenarios, all with heteroscedastic data and random choices for parameters and b i. e.g., In the first setup (the simplest) ν 2 P U( 0.1,0.33),ν 2 F = 1, θ U( 0,1), b U( 0.5,0.99), all indep. In these 6 scenarios the efficiency of our ARE predictor relative to the oracle was For n=20: 88% < Eff. < 98% For n=100: 91% < Eff. < 99%. [Note: The last scenario involved uniformly distributed observations, rather than normally distributed ones. But it isn t reported in the manuscript whether the oracle knew and used that fact. I ll need to ask Gourab and Paat. 24

25 Recap Problem involves multivariate normal means Goal is quantile predictions of future observations A shrinkage predictor is produced o Shrinkage is produced by a hierarchical conjugate setup. o Quality of quantile prediction measured through predictive check-loss, which is the natural loss for quantile prediction. Methodology involves creation and use of a complicated but computable Asymptotic Risk Estimate! ( ARE ) involving Hermite polynomials. Method is asymptotically justified. And Appears to work well in simulations for moderate n. 25

26 Concluding Remarks for BFF4 1. Quantile predictor methodology is motivated from a Bayesian hierarchical structure. 2. The form of the predictor then drives the methodology 3. The ARE method is a CONDITIONAL FREQUENTIST notion. 4. BUT: The proposed ARE predictor is NOT BAYESIAN It doesn t result from any prior I can think of. not even asymptotically 5. Well known, but often overlooked Prediction is different from parameter estimation. An estimate of the sample quantile doesn t directly lead to quantile prediction. 26

g-priors for Linear Regression

Stat60: Bayesian Modeling and Inference Lecture Date: March 15, 010 g-priors for Linear Regression Lecturer: Michael I. Jordan Scribe: Andrew H. Chan 1 Linear regression and g-priors In the last lecture,