1 Hierarchical Bayes and Empirical Bayes. MLII Method.

Size: px

Start display at page:

Download "1 Hierarchical Bayes and Empirical Bayes. MLII Method."

Sarah Johnston
5 years ago
Views:

1 ISyE8843A, Brani Vidakovic Handout 8 Hierarchical Bayes and Empirical Bayes. MLII Method. Hierarchical Bayes and Empirical Bayes are related by their goals, but quite different by the methods of how these goals are achieved. The attribute hierarchical refers mostly to the modeling strategy, while empirical is referring to the methodology. Both methods are concerned in specifying the distribution at prior level, hierarchical via Bayes inference involving additional degrees of hierarchy (hyperpriors and hyperparameters), while empirical Bayes is using data more directly. In expanding Bayesian models and inference to more complex problems, going beyond the simple likelihood-prior-posterior scheme, a hierarchy of models may be needed. The parameter(s) of interest considered are entering the model via their realizations which are modeled similarly as they were measurements. The common name parameter population distribution is indicative of the nature of the approach.. Hierarchical Bayesian Analysis Hierarchical Bayesian Analysis is a convenient representation of a Bayesian model, in particular the prior π, via a conditional hierarchy of so called hyper-priors π,..., π n+, π(θ) = π (θ θ )π (θ θ )... π n (θ n θ n )π n+ (θ n ) dθ dθ... dθ n. () Operationally, the model is equivalent to the model [x θ] f(x θ), [θ θ ] π (θ θ ), [θ n θ n ] π n (θ θ ), [θ n ] π n+ (θ n ). () [x θ] f(x θ), [θ] π(θ), as the inference on θ is concerned. Notice that in the hierarchy of data, parameters and hyperparameters, X θ θ θ... θ n X and θ i are independent, given θ. That means, [X θ, θ,... ] d = [X θ], [θ i θ, X] d = [θ i θ], where d = is equality in distribution. The joint distribution [X, θ, θ,..., θ n ] which by definition is [X, θ, θ,..., θ n ] = [X θ, θ,..., θ n ] [θ θ,..., θ n ] [θ θ,..., θ n ]... [θ n θ n ] [θ n ] can be represented as [X, θ, θ,..., θ n ] = [X θ][θ θ ] [θ θ ]... [θ n θ n ] [θ n ], thus, to fully specify the model, only neighbouring conditionals [X θ], [θ θ ], [θ θ ],..., [θ n θ n ] and the closure distribution [θ n ] are needed. Why then decompose the prior, as in () and use the model (). Here are some of the reasons: Modeling requirements may lead to the hierarchy in the prior. For example Bayesian models in meta analysis;

2 The prior information may be separated into the structural part and the subjective/noninformative part at higher level of hierarchy; Robustness and objectiveness let the data talk about the hyperparameters; Calculational issues (utilizing hidden mixtures, mixture priors, missing data, MCMC format). Sometimes it is not calculatingly feasible to carry out the analysis by reducing the sequence of hyperpriors () to a single prior (). Rather, Bayes rule is obtained (by using Fubini s theorem) as repeated integral with respect to more convenient conditional distributions. Here is the result involving the model (f), conditional prior (π (θ θ )), and hyperprior π (θ )). Suppose the hierarchical model is given as [X θ] f(x θ), [θ θ ] π (θ θ ), and [θ ] π (θ ), then the posterior distribution can be written as π(θ x) = π(θ x, θ )π(θ x)dθ. Θ The densities under the integral are π(θ x, θ ) = f(x θ)π (θ θ ) m (x θ ), and π(θ x) = m (x θ )π (θ ) m(x), where m (x θ ) = Θ f(x θ)π (θ θ )dθ is the marginal likelihood, and m(x) = Θ m (x θ )π (θ )dθ marginal. Now, for any function of the parameter h, E θ x h(θ) = E θ x [E θ θ,x h(θ)]. (3) Example: Suppose you pool n people about their favorite presidential candidate and X of them favor the candidate A. The likelihood is [X p] B(n, p), and the proportion p is the patrameter of interest. You believe that proportion is close to /, but not quite sure. The appropriate prior on p would be Beta Be(k, k), k N (see Figure (a)); it is symmetric about /, but you are reluctant to specify the natural number k. Thus, [p k] Be(k, k). Finally, you put a hyperprior on k, π (k) k(k ). The hyperprior probability mass function is in fact π (k) =, k =,,.... log()k(k ) What is the Bayes estimator for p if n = and X =. The unconditional prior on p is π(p) = p k ( p) k p = B(k, k) log()k(k ) 4 log()p( p). (4) k= This prior, depicted in Figure (b), is symmetric about /. The posterior does not have a finite form (can be expressed in terms of special functions) and the Bayes rule has to be found by numerical integration. δ π () = p f( p)π(p)dp = f( p)π(p)dp

3 (a) (b) Figure : Beta(k, k) densities for k =,, 3, 5,, and. The unconditional prior π(p) from (4). Mathematica programming is particularly convenient in this case, and the integrals have exact solution given n and X. In our case the exact result is 884/ The above approach to produce the Bayes estimator is direct, the unconditional prior is obtained as mixture of Betas. Thus we utilized transition from equations in () to prior in () Now we consider the same problem by utilizing hierarchical Bayes model discussed previously and equation (3). The hierarchical Bayes approach is as follows. The Bayes rule is δ π (x) = E k x E p k,x p, where the expectation E k x is taken with respect to π (k x) and E p k,x with respect to π (p k, x). The distribution for [p x, k] is again Beta, π (p k, x) p x ( p) n x p k ( p) k, with parameters x + k and n x + k. The expectation of p is The distribution π (k x) = m (x k)π (k) m(x) x+k (x+k)+(n x+k) = x+m k+n. does not have a finite form. m (x k) = f(x p)π (p k)dp is Beta-Binomial, given by m (x k) = Γ(k) Γ(k)Γ(k) Γ(n + ) Γ(x + )Γ(n x + ) Γ(k + x)γ(k + n x). Γ(k + n) Of course, m (x k) can be simplified a bit more. Now, the marginal m(x) is k= m (x k) log()k(k ) and it does not have a finite form although can be expressed in terms of hypergeometric pf q special functions. However, δ π () = E k x E p k,x p = E k x x + k k + n k= x+k k+n = m (x k) log()k(k ) k= m (x k) log()k(k ) = x( 3F ([ /, n x, + x], [/ + n/, + n/], ) ), n( 3 F ([ /, n x, x], [/ + n/, n/], ) ) is a ratio of two hypergeometric pf q functions with argument z =, thus representing an infinite sum of ratios of Gamma functions. This ratio is easily numerically evaluated, see mathematica notebook. The result The hypergeometric pf q function is defined as pf q([a,..., a p], [b,..., b q], z) = (a ) k...(a p ) k m= a(a + )... (a + n ) = Γ(a+n). Γ(a) 3 z k (b ) k...(b q) k k!, where (a)n =

4 for n = and X = is δ π () = 884/ Notice slight shrinkage toward / compared with the MLE estimator ˆp = / =.6. You may have noticed that even in the simple hierarchy, as in the previous example the Bayesian analysis may not be computationally easy. This is true even if we have perfect conjugate structure at different levels of hierarchy and normal models. The following Theorem is adapted from Berger (985) and illustrated on IQ adventures of our old friend Jeremy. Berger (985), Section 4.6 pages 8 95 contains an excellent account on hierarchical models with detailed proofs. The following model, in addition to its educational value, could be quite is useful as a modeling tool, if one believes in normality at various stages of the hierarchy. Assume that X = (X,..., X p ) is a p-dimensional observation, with the multivariate normal likelihood f(x θ) MVN p (θ, σ f I). Parameter θ = (θ,..., θ t ) is of interest and positive scalar σ f is assumed known. The first stage prior on θ, π (θ µ π, σ π), is multivariate normal MVN p (µ, σ πi), where µ = µ π and is p vector of s. The hyperparameter in this case is two-dimensional, λ = (µ π, σ π ). To complete the model, assume π (λ) = π (µ π ) π (σ π) with π (µ π ) N (β, τ ) and appropriate π (σ π). Of course β, τ and possible parameters in π (σ π) are assumed known, i.e., the hierarchy stops here. The following theorem gives an explicit (up to a univariate numerical integration) Bayes estimator of θ and its covariance matrix. The Theorem. The posterior mean is where δ π (x) = E θ x θ = E σ π x δ (x, σ π). δ (x, σπ) = x σ f σf σf + (x x) + σ π σ π σf + ( x β). (5) σ π + pτ The posterior covariance matrix is [ V = E σ π x σf I σ4 f σf 4 σπ + σf I + τ ] (σπ + σf )(pτ + σπ + σf )J + (δ (x, σπ) δ π (x))(δ (x, σπ) δ π (x)) The distribution for [σπ x], important for the above expectations, satisfies { [ ]} τ exp s π (σπ x) + p( x β) σπ +σ f pτ +σπ +σ f (σπ + σf )(p )/ (pτ + σπ + σf π (σπ), )/ where s = p i= (x i x). The expectation with respect to [σ π x] needs to be carried out numerically. It is important that the marginal posterior π (σ π x) is finite. This is true whenever the closure prior π (σ π) is bounded and p 3. Exercise: Suppose that Jeremy, IQ concerned fellow, has taken 5 IQ tests in the last 5 years and have obtained a vector score X = (,, 96, 9, 98). Assume that each measurement X i N (θ i, 8), i = 4

5 ,..., 5 and θ i s represent realizations of a random variable representing Jeremy s true ability. Unlike before, we assume that his true ability randomly changes in time, however, the underlying law from which such time-varying abilities are generated is common. Thus, θ i N (µ π, σ π), i =,..., 5. Finally the model is closed by assuming that µ π N (, ), and π, (σ π) =. Find Bayes estimator of θ and the covariance matrix of the estimate. Solution: The estimator for θ can be found coordinatewise. The Theorem. is directly applicable with: σ f = 8, β =,τ =, and p = 5. The result (see MATHEMATICA program jeremy.nb on the web site) is: ˆθ = (4.645,.39,.88, 8.56,.47) and V = Figure : Marginal posterior π (σ π x) when π (σ π) = in the IQ example. Figure shows π (σ π x) when π (σ π) =. Notice that even when the closure prior π (σ π) is improper, the marginal posterior is proper density, although quite flat.. Empirical Bayes. ML II Method Empirical Bayes has several formulations. Original formulation of empirical Bayes assumes that past values of X i and corresponding parameter θ i are known to the statistician who then on basis of current observation X n+ tries to make inference on unobserved θ n+. Of course, the parameters θ i are seldom known. However, it may be assumed that the past (and current) θ s are realizations from the same unknown prior distribution. Empirical Bayes is an approach to inference in which the observations are used to select the prior, usually via the marginal distribution. Once the prior is specified, the inference proceed in a standard Bayesian fashion. The use of data to estimate the prior in addition to subsequent use for the inference in empirical Bayes is criticized by subjectivists who consider the prior information exogenous to observations. The repeated use of data is also loaded with perils since it can underestimate modeling errors. Any data is going to be complacent with a model which used the same data to specify some of its features. 5

(a) (b) Figure 3: Herbert Robbins, 95, Richard von Mises, 883 953 Von Mises was considering the problem of testing drinking water for a particular bacteria contamination.

6 An excellent and comprehensive monograph on empirical Bayes methodology is Maritz and Lwin (989). Empirical Bayes and compound decision problems were introduced by Robbins in the 95 s, but first implicit empirical Bayes approach goes back to Von Mises (943). (a) (b) Figure 3: Herbert Robbins, 95, Richard von Mises, Von Mises was considering the problem of testing drinking water for a particular bacteria contamination. In a single experiment 5 small batches of water are taken. The experiment is positive if at least one batch is positive, i.e., contains at least one bacteria. Denote by θ the probability that a single batch contains at least one bacteria. Let X be the number of positive batches among 5 taken. Here ( ) 5 m(x) = θ x ( θ) 5 x π(θ)dθ, x and π(θ) is not specified. Von Mises was interested in P ( θ θ X = ), for an appropriate θ. For the rest of the story and solution consult Von Mises (943). We modify Von Mises problem to an exercise and provide step-by-step solution. Suppose that in 6 independent experiments, each containing 5 batches of water the following number of batches was found positive for bacteria: Experiment # of positive 3 Consider these 6 measurements historic and coming from B(5, θ i ) distributions, where θ i s are different. For example the experiments may have been conducted at different places. Take 7th experiment and measure X 7. Assume now that θ i, i =,..., 7 have Beta(α, β) distribution. Evaluate P ( θ 7. X 7 = ). Here are the steps in the solution: (i) We find the marginal for X i in the experiment i. The result is Beta-Binomial, ( ) 5 B(k + α, 5 k + β) P (X i = k α, β) =, k =,..., 5. k B(α, β) What is critical here, this distribution is free of variable variables θ i (Pun not intended). 6

7 (ii) Estimate α and β in the marginal. The method of moments is most straightforward. The theoretical moments E(X i ) = 5α/(α + β) and E(Xi ) = 5a(5a+5+b) (a+b)(a+b+) are replaced by the empirical moments based on historic data, m = 6 6 i= X i =.875 and m = 6 6 i= X i =.65, and equations solved with respect to α and β. Solutions are ˆα = m (m 5m ) m (4m + 5) 5m = 3.5 ˆβ = (m 5)(m 5m ) m (4m + 5) 5m = 6.5. (iii) Express P ( θ 7. X 7 = ) in terms of incomplete Beta functions with estimated hyperparameters ˆα and ˆβ. For example, in MATHEMATICA: Beta[., 3.5, 6.5]/Beta[3.5, 6.5]= Unlike the von Mises example, the original Robbins formulation of empirical Bayes was non-parametric. In the following example the Bayes rule with respect to an unknown prior will be expressed in terms of marginal distribution. Nonparametric empirical Bayes follows and uses the historic data to estimate the marginal distribution in nonparametric fashion. The estimated marginal distributions are then plugged in the formal Bayes rule. Example: Assume that X θ Poi(θ) and that π(θ) is to be specified. The Bayes rule is δ π (x) = θ x+ x! e θ π(θ)dθ θ x x! e θ π(θ)dθ = (x + ) θ x+ (x+)! e θ π(θ)dθ θ x x! e θ π(θ)dθ = (x + ) m π(x + ), m π (x) i.e., the Bayes rule is depends on the model only via the marginal (prior predictive) distribution m π (x). Let X i θ i Poi(θ i ), i =,..., n +, and all θ i are parameters having the same distribution. We are interested in estimating θ n+, using X n+ and X,..., X n. The estimator X + +X n+ n+ cannot be used since underlying θ s are different. Thus the MLE for θ n+ is X n+. The empirical Bayes ˆm rule is δ π (x n+ ) = δ π (x n+ ; x,..., x n ) = (x n+ + ) n(x+) /n+ ˆm n (x n+ ), where ˆm n(x n+ : x,..., x n ) = #{x i, i=,...,n x n+ =x i } n. The performance of the estimator could be improved if the estimators of marginals ˆm n are smoothed. The matlab file eb.m demonstrates the NPEB estimator. Description of this m-file and simulations in it will be added soon. In the meanwhile, please take a look at matlab file, since it is annotated. Now, consider the same setup, X i θ i Poi(θ i ), but this time the prior distribution of θ s is assumed known up to a hyperparameter. Assume that θ i s have exponential distribution with density π(θ i λ) = λe λθ i (θ i ), λ R +. A Negative Binomial is a marginal distribution for Poisson likelihood and Gamma prior. If the shape parameter in the Gamma prior is then the Negative Binomial becomes Geometric distribution (Handout ), ( ) xi λ m π (x i ) = + λ + λ. The MLE estimator of λ, based on x,..., x n is ˆλ MLE = x. Now, the Bayes estimator of θ n+ is x n++ λ+ because of the Poisson/Gamma conjugate structure. The parametric empirical Bayes estimator is δ(x n+ ; x,..., x n ) = x x+ (x n+ + ). When n, ˆm n (x) m π (x), in probability, and empirical Bayes rule is consistent, δ π (x; x,..., x n ) δ π (x). 7

8 5 θ PEB NPEB X Figure 4: Comparison of MLE, NPEB, and PEB. The circle is the true parameter θ. Exercise: If X θ Geom( θ) with f(x θ) = θ x ( θ), x =,,,... The prior π(θ) is unknown. Show δ π (x) = mπ(x+) m π (x). For X θ N B(n, θ) = Γ(n+x) Γ(x+)Γ(n) θx ( θ) n, the Bayes rule is δ π (x) = x+ m π(x+) n+x m π (x). James Stein Estimator and its EB Justification. Consider the estimation of θ in a model X MVN p (θ, I) under squared error loss L(θ, a) = i (θ i a i ). For p = and, estimator ˆθ = X is admissible (as unique minimax), i.e., no estimator has uniformly better risk. However, for p 3 X is neither unique minimax nor admissible. A better estimator is δ JS (X) = ( p p i= X i ) X, known as James-Stein estimator. The empirical Bayes justification for δ JS (X) is provided next. Suppose that θ has a prior distribution θ MVN (, τ I), where hyperparameter τ is not known and will be estimated from the sample, X in this case. The Bayes rule, under squared error loss is δ B (X) = τ ( + τ X = ) + τ X. Marginal (prior predictive) distribution for X is MVN p (, ( + τ )I). For such X, the random variable T = p i= X i + τ 8

9 has χ p distribution. Then, Thus E T = = = = E + τ p i= X i t Γ(p/ ) Γ(p/) p/ p. = p t p/ e t/ Γ(p/) dt p/ E p p i= X i t p/ e t dt Γ(p/ )p/ = + τ. Therefore, the method-of-moments estimator of +τ is δ EB (X) = p p, which yields an empirical Bayes estimator i= X i ( p ) p X. i= X i.. ML II We already have seen the spirit of ML II method in Parametric Empirical Bayes. The ML II approach was proposed by I. J. Good, a statistician that was in the team who broke German code in the WWII. The idea is to mimic the maximum likelihood estimation at the marginal level: Select a prior π that maximizes m π (x), given the data. Figure 5: I. J. Good, born 96 Exercise: Suppose that Bayesian have chosen a prior π but wants to look at all priors close to π. One such family of priors, close to π is contamination family, Γ = {( ɛ)π + ɛq, q Q}. Suppose that Q is a family of all distributions. Determine empirical Bayes choice. 9

10 Hint: Observe that for π Γ the marginal is m π (x) = ( ɛ)m (x) + ɛm q (x). Also, for any model f(x θ) a weighted average is smaller than mode, i.e., f(x θ)q(θ)dθ f(x ˆθ MLE ) = f(x θ)δ(ˆθ MLE ), where δ(ˆθ MLE ) is a point mass distribution concentrated at θ MLE. Thus, the empirical Bayes choice is π(θ) = ( ɛ)π (θ) + ɛδ(ˆθ MLE ). Exercises. Assume X θ is exponential E(/θ) with density f(x θ) = θ e x/θ, x. Let F be the cdf corresponding to f. Assume a prior on θ, π(θ). Let m π (x) = Θ f(x θ)π(θ)dθ be the marginal and M π(x) = x m π(t)dt be the corresponding c.d.f. (a) Show that θ = F (x θ) f(x θ). (b) Show that Bayes estimator of θ with respect π is δ(x) = Mπ(x) m π (x). [Hint. You will need to use a version of Fubini s theorem (Tonelli theorem) and change order of integration. Tonelli theorem allows for change when integrands are nonnegative.] (c) Suppose you observe X i θ i E(/θ i ), i =,..., n +. Explain how would you estimate θ n+ in the empirical Bayes fashion, using the result in (b).. Assume [X θ] N (θ, ) and [θ µ, τ ] N (µ, τ ). (a) Find the marginal for X. (b) What are the moment matching estimators of µ and τ, if the sample X i N (θ i, ), i =,..., n, is available. [Hint. Find the moments of the marginal and be careful, the estimator of the variance need to be nonnegative.] (c) Propose an empirical Bayes estimator of θ based on the considerations in (b). (d) What modifications in (b) are needed if you use MLE II estimator of µ and τ. [Sol. moment matching. Marginal is N (µ, + τ ). ˆµ = X and estimator of + τ is s. So ˆτ = max{, s }. For MLE ˆτ = max{, n n s }. 3. If the data X i f(x i θ) can not be reduced by a sufficient statistics, then so called pseudo-bayes approach is possible. Let T be an estimator of θ for which distribution g(t θ) is known and π the adopted prior. Instead of finding the Bayes rule δ π (x,..., x n ) = Θ θ n i= f(x i θ)π(θ)dθ Θ n i= f(x i θ)π(θ)dθ one finds the pseudo-bayes rule as Suppose that you have model δ π(t) = Θ tg(t θ)π(θ)dθ Θ g(t θ)π(θ)dθ. X i N (θ, ), θ N (µ, τ ).

11 For n large, the distribution of the sample median m = Med(X, X,..., X n ), is approximately N (θ, π n ). Write down pseudo-bayes estimator of θ using the median estimator of θ and its normal approximation. 4. Another way to justify Stein shrinkage estimator is to mimic Exercise. Suppose you have the model (i) Find the marginal distribution. (ii) Show that in (i) the MLE of τ is ˆτ = X MVN p (θ, I), θ MVN (, τ I). { x i (iii) Replacing the MLE in the Bayes estimator p, x i > p, else δ π (x) = τ x + τ the truncated James-Stein estimator is obtained, ( δ EB (x) = p ) x i where (x) + = max{x, } is the positive part of x. + x, References [] Berger, J. (985). Statistical Decision Theory and Bayesian Analysis, Second Edition, Springer Verlag. [] Maritz and Lwin (989) Empirical Bayes Methods, Second Edition, Chapman and Hall, New York, [3] Robert, C. (). Bayesian Choice, Second Edition, Springer Verlag. [4] Von Mises, R. (943). On the correct use of Bayes formula. Ann. Math. Statist., 3, Appendix Mathematica code for finding the hierarchical Bayes estimator in the example about the political pool. Integrate[ p Gamma[n + ]/(Gamma[x + ] Gamma[n - x + ]) pˆx ( - p)ˆ(n - x) ( - Abs[ - p])/(4 Log[] p ( - p)) /. {n ->, x -> }, {p,, }]/ Integrate[ Gamma[n + ]/(Gamma[x + ] Gamma[n - x + ]) pˆx ( - p)ˆ(n - x) ( - Abs[ - p])/(4 Log[] p ( - p)) /. {n ->, x -> }, {p,, }] // N

1 Priors, Contd. ISyE8843A, Brani Vidakovic Handout Nuisance Parameters

1 Priors, Contd. ISyE8843A, Brani Vidakovic Handout Nuisance Parameters ISyE8843A, Brani Vidakovic Handout 6 Priors, Contd.. Nuisance Parameters Assume that unknown parameter is θ, θ ) but we are interested only in θ. The parameter θ is called nuisance parameter. How do we