Borrowing Strength from Experience: Empirical Bayes Methods and Convex Optimization

Size: px

Start display at page:

Download "Borrowing Strength from Experience: Empirical Bayes Methods and Convex Optimization"

Phoebe Morgan
5 years ago
Views:

1 Borrowing Strength from Experience: Empirical Bayes Methods and Convex Optimization Ivan Mizera University of Alberta Edmonton, Alberta, Canada Department of Mathematical and Statistical Sciences ( Edmonton Eulers ) Wien, January 2016 Research supported by the Natural Sciences and Engineering Research Council of Canada co-authors: Roger Koenker (Illinois), Mu Lin, Sile Tao (Alberta) 1

2 Compound decision problem Estimate (predict) a vector µ = (µ 1,, µ n ) Many quantities here, and not that much of sample for those Observing one Y i for every µ i, with (conditionally) known distribution Example: Y i Binomial(n i, µ i ), n i known Another example: Y i N(µ i, 1) Yet another example: Y i Poisson(µ i ) µ i s assumed to be sampled ( random) iid-ly from P Thus, when the conditional density (...) of the Y i s is ϕ(y, µ), then the marginal density of the Y i s is g(y) = ϕ(y, µ)dp(µ) 2

3 A sporty example Data: known performance of individual players, typically summarized as of successes, k i, in a number, n i, of some repeated trials (bats, penalties) - typically, data not very extensive (start of the season, say); the objective is to predict true capabilities of individual players One possibility: Y i = k i Binomial(n i, µ i ) Another possibility: take Y i = arcsin k ( ) i + 1/4 n i + 1/2 N 1 µ i, 4n i Solutions via maximum likelihood ˆµ i = k i /n i or ˆµ i = µ i The overall mean (or marginal MLE) is often better than this Efron and Morris (1975), Brown (2008), Koenker and Mizera (2014): bayesball 3

4 NBA data (Agresti, 2002) player n k prop 1 Yao Frye Camby Okur Blount Mihm it may be better to take 7 Ilgauskas the overall mean! 8 Brown Curry Miller Haywood Olowokandi Mourning Wallace Ostertag

5 An insurance example Y i - known number of accidents of individual insured motorists Predict their expected number - rate, µ i (in next year, say) Y i Poisson(µ i ) Maximum likelihood: ˆµ i = Y i Nothing better? 5

6 The data of Simar (1976) 6

7 So, what is better? First, what is better? We express it via some (expected) loss function Most often it is averaged or aggregated squared error loss ( ˆµ i µ i ) 2 i But it could be also some other loss... And then? Well, it is sooo simple... 7

8 ... if P is known! µ i s are sampled iid-ly from P - prior distribution Conditionally on µ i, the distribution of Y i is, say, N(µ i, 1) The optimal prediction is the posterior mean, the mean of the posterior distribution: conditional distribution of µ i given Y i (given that the loss function is quadratic!) For instance, if P is N(0, σ 2 ), then (homework) the best predictor is ˆµ i = Y i 1 σ Y i Borrowing strength via shrinkage neither will be the good that good, nor the bad that bad More generally, µ i can be N(µ, σ 2 ) and Y i then N(µ i, σ 2 0 ), And then ˆµ i = Y i σ2 0 σ 2 + σ 2 (Y i µ) (if σ 2 = σ 2 0, halfway to µ) 0 8

9 If only all of them published posthumously... Thomas Bayes ( ) 9

10 10 But do we know P (or σ 2 )? Hierarchical model Random effects Smoothing Empirical Bayes no less Bayes than empirical Bayes we know it is frequentist, but frequentists think it is Bayesian, so this is why we discuss it here Many inventors...

11 11 What is mathematics? Herbert Ellis Robbins ( )

12 12 On experience in statistical decision theory (1954) Antonín Špaček ( )

13 13 I. J. Good (2000) Alann Mathison Turing ( )

14 14 So, how A. we may try to estimate the prior - f-modeling, Efron (2014) B. or more directly, the prediction rule - g-modeling A. Estimated normal prior (parametric) (Nonparametric ouverture) A. Empirical prior (nonparametric) B. Empirical prediction rule (nonparametric) Simulation contests

15 15 A. Estimated normal prior James-Stein (JS): if P is N(0, σ 2 ) 1 then the unknown part, σ 2, of the prediction rule + 1 can be estimated by n 2 S, where S = Yi 2 i

16 15 A. Estimated normal prior James-Stein (JS): if P is N(0, σ 2 ) 1 then the unknown part, σ 2, of the prediction rule + 1 can be estimated by n 2 S, where S = Yi 2 i For general µ in place of 0, the rule is ˆµ i = Y i n 3 S (Y i Ȳ), with Ȳ = 1 Y i and S = n i i (Y i Ȳ) 2

17 16 JS as empirical Bayes: Efron and Morris (1975) Charles Stein (1920 )

18 17 Nonparametric ouverture: MLE of density Density estimation: given the datapoints X 1, X 2,..., X n, solve or equivalently n i=1 g(x i ) max g! n i=1 under the side conditions log g(x i ) min g! g 0, g = 1

19 18 Doesn t work

20 How to prevent Dirac catastrophe? 19

21 20 Reference Koenker and Mizera (2014)... and those that cite it (Google Scholar)... the chance meeting on a dissecting-table of a sewing-machine and an umbrella See also REBayes package on CRAN For simplicity: ϕ(y, µ) = ϕ(y µ), and the latter is standard normal density

22 21 A. Empirical prior MLE of P: Kiefer and Wolfowitz (1956) i ( ) log ϕ(y i u) dp(u) min! P The regularizer is the fact that it is a mixture No tuning parameter needed (but known form of ϕ!) The resulting ˆP is atomic ( empirical prior ) However, it is an infinite-dimensional problem...

23 22 EM nonsense Laird (1978), Jiang and Zhang (2009): Use a grid {u 1,...u m } (m = 1000) containing the support of the observed sample and estimate the prior density via EM iterations ˆp (k+1) j = 1 n n i=1 ˆp (k) j ϕ(y i u j ) m l=1 ˆp(k) l ϕ(y i u l ), Sloooooow... (original versions: 55 hours for 1000 replications)

24 23 Convex optimization! Koenker and Mizera (2014): it is a convex problem! i When discretized ( ) log ϕ(y i u) dp(u) min! P i log ( m ϕ(y i u j )p j ) min p! or in a more technical form log y i min! Az = y and z S y i where A = (ϕ(y i u j )) and S = {s R m : 1 s = 1, s 0}.

25 24 With a dual The solution is an atomic probability measure, with not more than n atoms. The locations, ˆµ j, and the masses, ˆp j, at these locations can be found via the following dual characterization: the solution, ˆν, of n log ν i max! n ν i ϕ(y i µ) n for all µ µ i=1 i=1 satisfies the extremal equations ϕ(y i ˆµ j ) ˆp j = 1ˆν, i j and ˆµ j are exactly those µ where the dual constraint is active. And one can use modern convex optimization methods again... (And note: everything goes through for general ϕ(y, µ)) (And one can also handle - numerically - alternative loss functions!)

26 Koenker and Mizera A typical result: µ i drawn from U(5, 15) g(y) δ(y) y y Figure 2. Estimated mixture density, ĝ, and corresponding Left: mixture Bayes rule, ˆδ, density (blue: target) for a simple compound decision problem. The target Right: Bayes decision rule rule and (blue: its target) mixture density are again plotted in dashed blue. In contrast to the shape constrained estima- 25

27 26 B. Empirical prediction rule Lawrence Brown, personal communication Also, looks like in Maritz and Lwin (1989) Do not estimate P, but rather the prediction rule Tweedie formula: for known (general) P, and hence known g, the Bayes rule is δ(y) = y + σ 2 g (y) g(y) One may try to estimate g and plug it in - when knowing σ 2 (=1, for instance) Brown and Greenshtein (2009) by an exponential family argument, δ(y) is nondecreasing in y (van Houwelingen & Stijnen, 1983) (that came automatic when the prior is estimated)

28 27 Monotone (estimate of) empirical Bayes rule Maximum likelihood again (h = log g) - but with some shape-constraint regularization, - like log-concavity: (log g) 0 - but we rather want y + g (y) g(y) = y + (log g(y)) nondecreasing - that is, 1 2 y2 + log g(y) = 1 2 y2 + h(y) convex

29 27 Monotone (estimate of) empirical Bayes rule Maximum likelihood again (h = log g) - but with some shape-constraint regularization, - like log-concavity: (log g) 0 - but we rather want y + g (y) g(y) = y + (log g(y)) nondecreasing - that is, 1 2 y2 + log g(y) = 1 2 y2 + h(y) convex n i=1 log g(x i ) min g! g 0, g = 1

30 27 Monotone (estimate of) empirical Bayes rule Maximum likelihood again (h = log g) - but with some shape-constraint regularization, - like log-concavity: (log g) 0 - but we rather want y + g (y) g(y) = y + (log g(y)) nondecreasing - that is, 1 2 y2 + log g(y) = 1 2 y2 + h(y) convex n i=1 log g(x i ) min g! log g convex g 0 gdx = 1

31 27 Monotone (estimate of) empirical Bayes rule Maximum likelihood again (h = log g) - but with some shape-constraint regularization, - like log-concavity: (log g) 0 - but we rather want y + g (y) g(y) = y + (log g(y)) nondecreasing - that is, 1 2 y2 + log g(y) = 1 2 y2 + h(y) convex n i=1 h(x i ) min h! h convex e h 0 e h dx = 1

32 27 Monotone (estimate of) empirical Bayes rule Maximum likelihood again (h = log g) - but with some shape-constraint regularization, - like log-concavity: (log g) 0 - but we rather want y + g (y) g(y) = y + (log g(y)) nondecreasing - that is, 1 2 y2 + log g(y) = 1 2 y2 + h(y) convex n i=1 h(x i ) min h! h convex e h 0 e h dx = 1

33 27 Monotone (estimate of) empirical Bayes rule Maximum likelihood again (h = log g) - but with some shape-constraint regularization, - like log-concavity: (log g) 0 - but we rather want y + g (y) g(y) = y + (log g(y)) nondecreasing - that is, 1 2 y2 + log g(y) = 1 2 y2 + h(y) convex n i=1 h(x i ) min h! h convex e h dx = 1

34 27 Monotone (estimate of) empirical Bayes rule Maximum likelihood again (h = log g) - but with some shape-constraint regularization, - like log-concavity: (log g) 0 - but we rather want y + g (y) g(y) = y + (log g(y)) nondecreasing - that is, 1 2 y2 + log g(y) = 1 2 y2 + h(y) convex n i=1 h(x i ) min h! h convex e h dx = 1

35 27 Monotone (estimate of) empirical Bayes rule Maximum likelihood again (h = log g) - but with some shape-constraint regularization, - like log-concavity: (log g) 0 - but we rather want y + g (y) g(y) = y + (log g(y)) nondecreasing - that is, 1 2 y2 + log g(y) = 1 2 y2 + h(y) convex n i=1 h(x i ) + e h dx min! h convex h

36 27 Monotone (estimate of) empirical Bayes rule Maximum likelihood again (h = log g) - but with some shape-constraint regularization, - like log-concavity: (log g) 0 - but we rather want y + g (y) g(y) = y + (log g(y)) nondecreasing - that is, 1 2 y2 + log g(y) = 1 2 y2 + h(y) convex n i=1 h(x i ) + e h dx min! h 1 2 y2 + h(y) convex

37 27 Monotone (estimate of) empirical Bayes rule Maximum likelihood again (h = log g) - but with some shape-constraint regularization, - like log-concavity: (log g) 0 - but we rather want y + g (y) g(y) = y + (log g(y)) nondecreasing - that is, 1 2 y2 + log g(y) = 1 2 y2 + h(y) convex n i=1 h(x i ) + e h dx min! h 1 2 y2 + h(y) convex The regularizer is the monotonicity constraint No tuning parameter, or knowledge of ϕ - but knowing all the time that σ 2 = 1 A convex problem again

38 28 Some remarks After reparametrization, omitting constants, etc. one can write it as a solution of an equivalent problem 1 n K(Y i ) + e K(y) dφ c (y) min n! K K K Compare: 1 n i=1 n i=1 h(x i ) + e h dx min! h K h

39 29 Dual formulation Analogous to Koenker and Mizera (2010): The solution, ˆK, exists and is piecewise linear. It admits a dual characterization: e ˆK(y) = ˆf, where ˆf is the solution of f(y) log f(y)dφ(y) min! f = d(p n G), G K f dφ The estimated decision rule, ˆδ, is piecewise constant and has no jumps at min Y i and max Y i.

40 30 the monotonicity A typical requirement result: and perform µ i drawn quitefrom well asu(5, we shall 15) see in Section 5. g(y) δ(y) y y Figure 1. Estimated mixture density, ĝ, and corresponding Left: mixture density (blue: target) Bayes rule, ˆδ, for a simple compound decision problem. Right: The target piecewise Bayes constant, rule and empirical its mixture decision density are rule plotted as smooth (blue) lines. The local maxima give y for which ˆδ(y) = y.

41 31 Doable also for some other exponential families However: a version of the Tweedie formula may be obtainable only for the canonical parameter (binomial!) and depends on the loss function For the Poisson case: - the optimal prediction with respect to the quadratic loss function is, for x = 0, 1, 2,..., ˆµ(x) = (x + 1)g(x + 1) g(x), where g is the marginal density of the Y i s - for the loss function (µ ˆµ) 2 /µ, the optimal prediction is, for x = 1, 2,... ˆµ(x) = xg(x) g(x 1).

42 32 What can be done with that? One can estimate g(x) by the relative frequency, as Robbins (1956): ˆµ(x) = (x + 1)#{Y i = x + 1} n #{Y i = x} n = (x + 1)#{Y i = x + 1} #{Y i = x} however, the predictions obtained this way are not monotone, and also erratic, especially when some denominator is 0 - the latter can be rectified by the adjustment of Maritz and Lwin (1989): ˆµ(x) = (x + 1)#{Y i = x + 1} 1 + #{Y i = x}

43 33 Better: monotonizations The suggestion of van Houwelingen & Stijnen (1983): pool adjacent violators - also requires a grid Or one can estimate the marginal density under the shape-restriction that the resulting prediction is monotone: (x + 1)ĝ(x + 1) ĝ(x) (x + 2)ĝ(x + 2) ĝ(x + 1) After reparametrization in terms of logarithms, the problem is almost linear: linear constraint resulting from the one above, and linear objective function - with a nonlinear Lagrange term ensuring that the result is a probability mass function. At any rate, again a convex problem - and the number of variables is the number of the x s

44 34 Why all this is feasible: interior point methods (Leave optimization to experts) Andersen, Christiansen, Conn, and Overton (2000) We acknowledge using Mosek, a Danish optimization software Mosek: E. D. Andersen (2010) PDCO: Saunders (2003) Nesterov and Nemirovskii (1994) Boyd, Grant and Ye: Disciplined Convex Programming Folk wisdom: If it is convex, it will fly.

45 35 Simulations - or how to be highly cited Johnstone and Silverman (2004): empirical Bayes for sparsity n = 1000 observations k of which have µ all equal to one of the 4 values, 3, 4, 5, 7 the remaining n k have µ = 0 there are three choices of k: 5, 50, 500 Criterion: sum of squared errors, averaged over replications, and rounded Seems like this scenario (or similar ones) became popular

46 The first race Koenker and Mizera 7 Estimator k = 5 k = 50 k = 500 µ =3 µ =4 µ =5 µ =7 µ =3 µ =4 µ =5 µ =7 µ =3 µ =4 µ =5 µ =7 ˆδ ˆδGMLEBIP ˆδGMLEBEM δ J-S Min Table 2. Risk of Shape Constrained Rule, ˆδ compared to: two versions of prediction Jaing and Zhang s rule GMLEB procedure, one using 100 EM empirical empirical iterationsprior, denoted implementation GMLEBEM and the via other, convex GMLEBIP, optimization using the empirical interior point prior, algorithm implementation described in the viatext, EMthe kernel procedure, Brown δ 1.15 and of Brown Greenshtein and Greenshtein, (2009): and50best replications procedure of Johnstone and report Silverman. (best?) Sum results of squared for bandwith-related errors in n = 1000 observations. constant 1.15 Johnstone Reportedand entries Silverman are based on (2004): 1000, 100, 100, replications, 50 and 100 replications, (only respectively. their winner reported here, J-S 18 methods Min) 36

47 37 A new lineup BL DL(1/n) DL(1/2) HS EBMW EBB EBKM oracle Bhattacharya, Pati, Pillai, Dunson (2012): Bayesian shrinkage BL: Bayesian Lasso DL: Dirichlet-Laplace priors (with different strengths) HS: Carvalho, Polson, and Scott (2009) horseshoe priors EBMW: asympt. minimax EB of Martin and Walker (2013) elsewhere: Castillo & van der Vaart (2012) posterior concentration

48 38 Comments (Conclusions?) both approaches typically outperform other methods Kiefer-Wolfowitz empirical prior typically outperforms monotone empirical Bayes (for the examples we considered!) both methods adapt to general P, in particular to those with multiple modes however, Kiefer-Wolfowitz empirical prior is more flexible: (much) better adapts to certain peculiarities vital in practical data analysis, like unequal σ i, inclusion of covariates, etc in particular, it also exhibits certain independence of the choice of the loss function (the estimate of the prior, and hence posterior is always the same) but, in certain situations Kiefer-Wolfowitz (on the grid!) may be more computationally demanding

49 39 NBA data again player n prop k ast sigma ebkw jsmm glmm lmer 1 Yao Frye Camby Okur Blount Mihm Ilgauskas Brown Curry Miller Haywood Olowokandi Mourning Wallace Ostertag

50 A (partial) picture nba$prop p kw glmm lmer js nba$prop 40

51 Mixing distribution ( empirical prior )

52 Mixing distribution for glmm nba.glmm$p pgr 42

53 The auto insurance predictions acd.mon$robbmon acd$rate 43

54 44

55 45 That s it? What if P is unimodal? Cannot we do better in such a case?

56 45 That s it? What if P is unimodal? Cannot we do better in such a case? And if we can, will it be (significantly) better than James-Stein?

57 45 That s it? What if P is unimodal? Cannot we do better in such a case? And if we can, will it be (significantly) better than James-Stein? Joint work with Mu Lin

58 46 OK, so just impose unimodality on P or more precisely, constrain P to be log-concave (or q-convex) (unimodality does not work well in this context)

59 46 OK, so just impose unimodality on P or more precisely, constrain P to be log-concave (or q-convex) (unimodality does not work well in this context) However, the resulting problem is not convex!

60 46 OK, so just impose unimodality on P or more precisely, constrain P to be log-concave (or q-convex) (unimodality does not work well in this context) However, the resulting problem is not convex! Nevertheless, given that: log-concavity of P + that of ϕ implies that of the convolution g(y) = ϕ(y µ)dp(µ)

61 46 OK, so just impose unimodality on P or more precisely, constrain P to be log-concave (or q-convex) (unimodality does not work well in this context) However, the resulting problem is not convex! Nevertheless, given that: log-concavity of P + that of ϕ implies that of the convolution g(y) = ϕ(y µ)dp(µ) one can impose log-concavity on the mixture! (So that the resulting formulation then a convex problem is.)

62 47 3. Unimodal Kiefer-Wolfowitz g min P! g = i ( ) log ϕ(y i u) dp(u) (Works, but needs a special version of Mosek) May be demanding for large sample sizes

63 47 3. Unimodal Kiefer-Wolfowitz g min P! g = i ( ) log ϕ(y i u) dp(u) and g convex (Works, but needs a special version of Mosek) May be demanding for large sample sizes

64 47 3. Unimodal Kiefer-Wolfowitz g min g,p! g i ( ) log ϕ(y i u) dp(u) and g convex (Works, but needs a special version of Mosek) May be demanding for large sample sizes

65 48 4. Unimodal monotone empirical Bayes 1 2 y2 + h(y) convex h(y) concave n i=1 h(x i ) + e h dx min! h 1 2 y2 + h(y) convex

66 48 4. Unimodal monotone empirical Bayes 1 2 y2 + h(y) convex h(y) concave n i=1 h(x i ) + e h dx min! 1 + h (y) > 0 h

67 48 4. Unimodal monotone empirical Bayes 1 2 y2 + h(y) convex h(y) concave n i=1 h(x i ) + e h dx min! h (y) > 1 h

68 48 4. Unimodal monotone empirical Bayes 1 2 y2 + h(y) convex h(y) concave n i=1 h(x i ) + e h dx min! 0 > h (y) > 1 h

69 48 4. Unimodal monotone empirical Bayes 1 2 y2 + h(y) convex h(y) concave n i=1 h(x i ) + e h dx min! 0 > h (y) > 1 h Very easy, very fast

70 49 A typical result, again from U(5, 15) Mixture density Prediction rule y y (Empirical prior, mixture unimodal)

71 50 A typical result, again from U(5, 15) Mixture density Prediction rule y y (Empirical prediction rule, mixture unimodal)

72 51 Some simulations Sum of squared errors, averaged over replications, rounded U[5, 15] t 3 χ br kw brlc kwlc mle js oracle Last four: the mixtures of Johnstone and Silverman (2004): n = 1000 observations, with 5% or 50% of µ equal to 2 or 5 and the remaining ones are 0

73 52 Conclusions II - when the mixing (and then the mixture) distribution is unimodal, it pays to enforce this shape constraint for the estimate

74 52 Conclusions II - when the mixing (and then the mixture) distribution is unimodal, it pays to enforce this shape constraint for the estimate - if it is not, then it does not pay

75 52 Conclusions II - when the mixing (and then the mixture) distribution is unimodal, it pays to enforce this shape constraint for the estimate - if it is not, then it does not pay - unimodal Kiefer-Wolfowitz still appears to outperform the unimodal monotonized empirical Bayes by small margin

76 52 Conclusions II - when the mixing (and then the mixture) distribution is unimodal, it pays to enforce this shape constraint for the estimate - if it is not, then it does not pay - unimodal Kiefer-Wolfowitz still appears to outperform the unimodal monotonized empirical Bayes by small margin - and both outperform James-Stein, significantly for asymmetric mixing distribution

77 52 Conclusions II - when the mixing (and then the mixture) distribution is unimodal, it pays to enforce this shape constraint for the estimate - if it is not, then it does not pay - unimodal Kiefer-Wolfowitz still appears to outperform the unimodal monotonized empirical Bayes by small margin - and both outperform James-Stein, significantly for asymmetric mixing distribution - computationally, unimodal monotonized empirical Bayes is much more painless than unimodal Kiefer-Wolfowitz

Unobserved Heterogeneity in Longitudinal Data An Empirical Bayes Perspective

Unobserved Heterogeneity in Longitudinal Data An Empirical Bayes Perspective Roger Koenker University of Illinois, Urbana-Champaign University of Tokyo: 25 November 2013 Joint work with Jiaying Gu (UIUC)