Predictive Density Estimation for Multiple Regression

Size: px

Start display at page:

Download "Predictive Density Estimation for Multiple Regression"

Alison Pope
5 years ago
Views:

1 Predictive Density Estimation for Multiple Regression Edward I. George and Xinyi Xu March 006, Revised January 007 Abstract uppose we observe X N m (Aβ, σ I) and would like to estimate the predictive density p(y β) of a future Y N n (Bβ, σ I). Evaluating predictive estimates ˆp(y x) by Kullback- Leibler loss, we develop and evaluate Bayes procedures for this problem. We obtain general sufficient conditions for minimaxity and dominance of the noninformative uniform prior Bayes procedure. We extend these results to situations where only a subset of the predictors in A is thought to be potentially irrelevant. We then consider the more realistic situation where there is model uncertainty and this subset is unknown. For this situation we develop multiple shrinkage predictive estimators and obtain general minimaxity and dominance conditions. Finally, we provide an explicit example of a minimax multiple shrinkage predictive estimator based on scaled harmonic priors. Keywords: BAYEIAN PREDICTION; MODEL UNCERTAINTY; MULTIPLE HRINKAGE; PRIOR DITRIBUTION; HRINKAGE ETIMATION. Edward I. George is Professor, tatistics Department, The Wharton chool, 3730 Walnut treet 400 JMHH, Philadelphia, PA , edgeorge@wharton.upenn.edu. Xinyi Xu is Assistant Professor, Department of tatistics, The Ohio tate University, Columbus, OH , xinyi@stat.ohio-state.edu. We would like to acknowledge Larry Brown, Feng Liang, Linda Zhao and three referees for their helpful suggestions. This work was supported by various NF grants, DM the most recent.

2 Introduction We begin with the canonical normal linear model setup X N m (Aβ, σ I) () where X is an m vector of m observations, A is a full rank, fixed m p matrix of p potential predictors where m p, and β is a p vector of unknown regression coefficients. Based on observing X = x, we consider the problem of estimating the predictive density p(y β) of a future n vector Y where Y N n (Bβ, σ I). () Here B is a fixed n p matrix of the same p potential predictors in A, although with possibly different values. We also assume that X and Y are conditionally independent given β. Finally, we assume σ is known, and without loss of generality set σ = throughout. For each value of x, we evaluate a predictive estimate ˆp(y x) of p(y β) by the well-known Kullback-Leibler (KL) loss L(β, ˆp(y x)) = p(y β) log p(y β) dy. (3) ˆp(y x) The overall quality of the procedure ˆp = ˆp(y x) for each β is then conveniently summarized by the KL risk R KL (β, ˆp) = p(x β)l(β, ˆp(y x))dx. (4) Letting ˆβ x = (A A) A x be the traditional least squares estimate of β based on x, it is tempting to consider the plug-in predictive estimate ˆp plug in (y ˆβ x ), which simply substitutes ˆβ x for β in p(y β). However, as we show in ection by extending the arguments of Aitchison (975), the formal Bayes predictive estimate ˆp U (y x) = p(x β)p(y β)dβ p(x β)dβ. (5) has smaller KL risk than ˆp plug in (y ˆβ x ) for every β. Thus, ˆp plug in (y ˆβ x ) should be ruled out and we turn our focus to Bayes procedures. For a prior π on β, the Bayes predictive estimator ˆp π (y x) is given by ˆp π (y x) = p(x β)p(y β)π(β)dβ p(x β)π(β)dβ. (6) It also follows from the arguments of Aitchison (975) that for proper π, ˆp π minimizes the average risk r π (ˆp) = R KL (β, ˆp)π(β)dβ. Note that ˆp U in (5) is the formal Bayes estimate under the improper uniform noninformative density π U (β), and would seem to be a good default procedure. Indeed, ˆp U has constant risk and is minimax under KL loss, see Liang (00) and Liang

3 and Barron (004). But surprisingly, as we will show, in many cases ˆp U itself can be uniformly dominated in terms of KL risk by other Bayes predictive estimators. In ection, we develop general conditions under which ˆp π will be minimax and uniformly dominate ˆp U in terms of the KL risk (4) for the multiple regression prediction problem. Our results can be seen as a substantial generalization of George, Liang and Xu (006) who considered the special case of this problem when X N m (µ, σxi) and Y N m (µ, σyi), where µ is the common unknown multivariate normal mean. Moving further away from this common mean setup, we proceed in ection 3 to extend these results to the setting where only a subset of the p predictors is considered to be potentially irrelevant. In ection 4, we consider the more realistic model uncertainty setting where such a subset is unknown, and develop minimax multiple shrinkage predictive densities that adaptively shrink towards the model most favored by the data. In ection 5, we conclude by showing how our results can be extended for minimax shrinkage prediction towards any linear subspaces. Although we do not consider the issue of admissibility in this paper, it may be of interest to note that for the multivariate normal prediction problem above Brown, George and Xu (006) recently established that all admissible predictive densities are Bayes procedures. Priors for Minimax Predictive Estimation In this section, we develop general conditions on π for ˆp π in (6) to uniformly dominate ˆp U in (5) under KL risk (4). The minimaxity of such ˆp π will then follow immediately from the minimaxity of ˆp U. We begin by establishing some convenient notation. As indicated previously, we use ˆβ x = (A A) A x to denote the least squares estimate of β based on x. Although y is not observed, it will be useful to use ˆβ x,y = (C C) C x where C = A (7) y B to denote the least squares estimate of β based on x and y. Note that ˆβ x N p (β, Σ A ) and ˆβ x,y N p (β, Σ C ), where for notational convenience throughout we let Σ A = (A A) and Σ C = (C C). It will also be useful to let R x = x A ˆβ x and R x,y = x C ˆβ x,y y denote the corresponding residual sums of squares (R). In terms of this notation, we have the following. Lemma. The uniform prior predictive estimate ˆp U in (5) can be expressed as ˆp U (y x) = (π) n C C A A 3 exp R x,y R x

4 = (y B (π) p Ψ exp ˆβx ) Ψ (y B ˆβ x ), (8) where Ψ = I + BΣ A B. Moreover, the KL risk of ˆp U is uniformly smaller than that of the plug-in estimator ˆp plug in (y ˆβ x ). Proof. ince ˆβ x β N p (β, Σ A ), the posterior of β under the uniform prior is β ˆβ x N p ( ˆβ x, Σ A ). It follows that the posterior of Bβ is Bβ ˆβ x N p (B ˆβ x, BΣ A B ) and thus the predictive estimator is Y ˆβ x N p (B ˆβ x, I + BΣ A B ). To calculate the risk of ˆp U, let ĤA = A(A A) A denote the hat matrix based on x and Ĥ C = C(C C) C denote the hat matrix based on both x and y. It is easy to see that R KL (β, ˆp U ) = p(y β) p(x β) p(y β) log ˆp U (y x) dxdy p(x β) p(y β) [R x,y R x ] dxdy = log C C A A n + = log C C A A n + = log C C = A A n + n p log(e i + ), i= where e,..., e p are the eigenvalues of (A A) B B. Moreover, [ ] trace(i m+n ĤC) trace(i m ĤA) R KL (β, ˆp plug in (y ˆβ p(y β) x )) = p(x β) p(y β) log ˆp plug in (y ˆβ x ) dxdy = [ p(x β) p(y β) y B ˆθ x y Bθ ] dxdy = p(x β) B ˆθ x Bθ dx = trace(b(a A) B ) = p e i. i= That ˆp U dominates ˆp plug in (y ˆβ x ) follows from the fact that log(x + ) x for any x > 0. Risk comparisons of a Bayes predictive density ˆp π with ˆp U are greatly facilitated by the following representation of ˆp π in terms of ˆp U. An analogous representation of the posterior mean in terms of the MLE, which simplifies multivariate normal mean estimation under quadratic risk, was proposed by Brown (97). For our representation, it will be useful to denote the marginal distribution of 4

5 Z β N p (β, Σ) under π by m π (z; Σ) = p(z β)π(β)dβ. (9) Thus, the marginal distributions of ˆβ x and ˆβ x,y under π are denoted by m π ( ˆβ x, Σ A ) and m π ( ˆβ x,y, Σ C ) respectively. Lemma. If m π (z; Σ) is finite for all z and Σ, then ˆp π (y x) is a proper probability distribution. Furthermore, it can be expressed as where ˆp U is defined by (8). ˆp π (y x) = m π( ˆβ x,y, Σ C ) m π ( ˆβ x, Σ A ) ˆp U(y x) (0) Proof. When m π (z; Σ) is finite for all z and Σ, that ˆp π (y x) is a proper probability distribution follows from integrating with respect to y and switching the order of integration. Next, straightforward calculation yields p(x β)π(β)dβ x Aβ = exp π(β)dβ (π) m = exp x A ˆβ x + A ˆβ x Aβ (π) m = (π) m p = A A (π) m p imilarly, we obtain exp exp x A ˆβ x exp (π) p R x p(x β)p(y β)π(β)dβ = C C π(β)dβ A ˆβ x Aβ π(β)dβ m π ( ˆβ x, Σ A ). () (π) m+n p exp R x,y The representation (0) follows immediately from (6), () and (). m π ( ˆβ x,y, Σ C ). () The next result provides a representation of the difference between the KL risks of ˆp U and ˆp π in terms of the marginal distributions of ˆβ x and ˆβ x,y. Lemma 3. The difference between the KL risks of ˆp U and ˆp π is given by R KL (β, ˆp U ) R KL (β, ˆp π ) = E β,σc log m π ( ˆβ x,y ; Σ C ) E β,σa log m π ( ˆβ x ; Σ A ) (3) where E β,σ ( ) stands for expectation with respect to the N p (β, Σ) distribution. 5

6 Proof. The KL risk difference between ˆp U and ˆp π can be expressed as = = R KL (β, ˆp U ) R KL (β, ˆp π ) p(x β) p(y β) log ˆp π(y x) ˆp U (y x) dxdy p(x β) p(y β) log m π( ˆβ x,y, Σ C ) m π ( ˆβ x, Σ A ) dxdy, where the last equality follows from Lemma. The result then follows from the change of variable theorem. To exploit the representation (3), we proceed to transform the distributions to canonical form. ince Σ A and Σ C are both symmetric and positive definite, there exists an invertible p p matrix W such that Σ A = W W and Σ C = W Σ D W (4) where Σ D = diag(d,..., d p ). (5) ince Σ C = (C C) = (A A+B B) and B B is nonnegative definite,, d i (0, ] for all i p with at least one d i <. Finally, let µ = W β, ˆµ x = W ˆβx, and ˆµ x,y = W ˆβx,y, so that ˆµ x N p (µ, I) and ˆµ x,y N p (µ, Σ D ). (6) Lemma 4. Let π W (µ) = π(w µ). Then, the difference between the KL risks of ˆp U and ˆp π is given by R KL (β, ˆp U ) R KL (β, ˆp π ) = E µ,σd log m πw (ˆµ x,y ; Σ D ) E µ,i log m πw (ˆµ x ; I) (7) where E µ,σ ( ) stands for expectation with respect to the N p (µ, Σ) distribution. Proof. The result follows by transforming the expressions in Lemma 3, E β,σa log m π ( ˆβ x ; Σ A ) = p( ˆβ x β) log p( ˆβ x β)π(β)dβd ˆβ x = p(ˆµ x µ) log p(ˆµ x µ)π W (µ)dµdˆµ x = E µ,i log m πw (ˆµ x ; I). imilarly, E β,σc log m π ( ˆβ x,y ; Σ C ) = E µ,σd log m πw (ˆµ x,y ; Σ D ). Thus, (7) equals (3). 6

7 We now proceed to find conditions on m π for which the risk difference (7) is nonnegative for all µ. Because ˆp U is minimax, this will then imply that ˆp π is minimax under the prior π corresponding to π W. Now for w [0, ], let where Σ D is defined as in (5). Next, for Z N p (µ, V w ), let Thus, we may rewrite (7) V w = wi + ( w)σ D (8) h µ (V w ) = E µ,vw log m πw (Z; V w ). (9) R KL (β, ˆp U ) R KL (β, ˆp π ) = h µ (V 0 ) h µ (V ). (0) ince h µ (w) is continuous in w, it suffices to derive conditions on m π such that ( / w)h µ (w) < 0 for all µ and w [0, ]. Letting v i be the ith diagonal element of V w, we have by the chain rule p w h h µ v i p µ = v i w = ( d i ) h µ () v i The following result provides unbiased estimates of the components of () which, when combined with (7), will be seen to play a key role in establishing sufficient conditions on m π for ˆp π to be minimax and to dominate ˆp U. As noted by George, Liang and Xu (006), these estimates are very similar to the unbiased estimates of risk for the estimation of a multivariate mean under squared error loss, see tein (974, 98). Lemma 5. If m πw (z; I) is finite for all z, then for any 0 w, m πw (z; V w ) is finite. Moreover, z m πw (Z; V w ) i h µ = E µ,vw ( ) log m πw (Z; V w ) () v i m πw (Z; V w ) z i = E µ,vw zi m πw (Z; V w ). (3) m πw (Z; V w ) Proof. When m πw (z; I) is finite for all z, it is easy to check that for any fixed z and any 0 w, ( k ) m πw (z; V w ) m πw (z; I) <. d i i= Next, letting Z = V w (Z µ) N(0, I), we have h µ = E log m πw (V w Z + µ; V w ) v i v i m πw (V w Z + µ; V w ) v i = E m πw (V w Z + µ; V w ) 7, (4)

8 where m πw (V w z + µ; V w ) v i = p v i (π) p ( v exp i zi + µ i µ i ) π W (µ )dµ v v p v i= i ( = + (z i µ i ) v i vi z i z i (µ i µ i ) ) v i v 3/ p(z µ )π W (µ )dµ i = (zi µ i ) (z i µ i m πw (z; V w ) ) v i vi p(z µ )π W (µ )dµ. Making use of the well-known univariate heat equation m πw (z; V w ) = v i z i m πw (z; V w ), (5) see for example teele (00), and the Brown (97) representation E π (µ i z i) = z i +v i z i log m πw (z), () and (3) can be verified via the same steps as in the proof of Lemma 3 in George, Liang and Xu (006). Now we can obtain sufficient conditions for a Bayes procedure ˆp π to be minimax by combining (0), () and Lemma 5. The following result provides a substantial generalization of Theorem of George, Liang and Xu (006). Theorem. uppose m π (z; W W ) is finite for all z with the invertible matrix W defined as in (4). Let H(f(z,, z p )) be the Hessian matrix of f. (i) If trace H(m π (z; W V w W ))[Σ A Σ C ] 0 for all w [0, ], then ˆp π is minimax under R KL. Furthermore, ˆp π dominates ˆp U unless π = π U. (ii) If trace H( m π (z; W V w W ))[Σ A Σ C ] 0 for all w [0, ], then ˆp π is minimax under R KL. Furthermore, ˆp π dominates ˆp U if for all w [0, ], this inequality is strict on a set of positive Lebesgue measure. Proof. To prove the minimaxity of ˆp π under R KL, it suffices to show that () or (3) is nonpositive because by () that would imply the nonnegativity of (0). Dominance would further follow by showing that () or (3) is also strictly negative on a set of positive Lebesgue measure. Noting that m πw (z; V w ) = m π (W z; W V w W ), and letting W z = z, we obtain k ( d i ) z m πw (z; V w ) = i= i k ( d i ) z m π ( z; W V w W ) i= i 8

9 k p p m π ( z; W V w W ) = ( d i ) W ji W ki z i= j= k= j z k = trace (I Σ D )W H(m π ( z; W V w W ))W = trace H(m π ( z; W V w W ))W (I Σ D )W = trace H(m π ( z; W V w W ))[Σ A Σ C ] (6) imilarly, k ( d i ) z i= i m πw (z; V w ) = trace H( m π ( z; W V w W ))[Σ A Σ C ] (7) Now (i) and (ii) follow immediately from (), (3), (6) and (7). The next result follows using the fact that zi m πw (z; V w ) 0 when µ π W (µ) 0. i Corollary. uppose m π (z; W W ) is finite for all z. Then ˆp π will be minimax if trace H(π(β))[Σ A Σ C ] 0 a.e. Furthermore, ˆp π will dominate ˆp U unless π = π U. Example. (caled harmonic prior). uppose A = B. In this case, trace H(π(β)[Σ A Σ C ] = trace H(π(β))Σ A = trace H(π(β))W W = π W (µ). (8) Let π W (µ) µ (p ) when p 3, and π W (µ) when p < 3. Note that π W is harmonic, i.e. π W (µ) 0, and not equal to π U when p 3. For p 3, the corresponding prior on β is a scaled harmonic prior π(β) W β (p ) = diag(η,, η p )β (p ), (9) where η,, η p > 0 are the eigenvalues of Σ A, and for p < 3, π(β). (The expression (9) is obtained using the fact that there exists an orthonormal matrix O such that W = O diag(η,, η p ) O ). By Corollary and (8), the predictive estimator p π under this prior is minimax and dominates ˆp U when p 3. It is easy to check that these results hold when A = rb for any known constant r. 9

10 3 Predictive Density Estimation Near ubset Models When a prior centered at 0 such as the scaled harmonic prior (9) is applied to β, the risk reduction of ˆp π over ˆp U is greatest when all the components of β are close to 0. Thus, it would be sensible to use this prior if it was felt that all p predictors in A and B were potentially irrelevant. However, such a prior would be ineffectual if only a subset of the p predictors were irrelevant, inotherwords if only a subset of the β components were close to 0. In this section, we extend our results for the setting where such a subset is known. This will set the stage for ection 4, where develop new results for the more realistic model uncertainty setting where such a subset is unknown. Let be the subset of,..., p corresponding to the indices of the potentially irrelevant predictors, and let q = be the number of elements in. Let β be the subvector of β corresponding to the columns of A indexed by. imilarly, let ˆβ,x and ˆβ,x,y be the subvectors of ˆβ x and ˆβ x,y respectively, corresponding to β. Finally, for notational convenience, let Σ A, and Σ C, be the submatrices of Σ A and Σ C respectively, which are the covariance matrices of ˆβ,x and ˆβ,x,y. When only the elements of β are thought to be close to zero, it would be sensible to consider a prior which is uniform on β, where is the complement of. We denote such a prior by π, and let π be the restriction of π to β so that π (β) = π (β ) (30) is a function of β only. centered around 0. To exploit the possibility that β is close to zero, π would then be Lemma 6. If m π (z; Σ) is finite for all z and Σ, then ˆp π (y x) is a proper probability distribution. Furthermore, it can be expressed as where ˆp U is defined by (8). ˆp π (y x) = m π ( ˆβ,x,y, Σ C, ) m π ( ˆβ,x, Σ A, ) ˆp U(y x) (3) Proof. The first assertion was proved in Lemma. Next, proceeding as in the derivation of () we obtain = p(x β)π (β)dβ exp x A ˆβ x exp (π) m p (π) p exp = A A (π) m p R x A ˆβ x Aβ π (β )dβ m π ( ˆβ,x, Σ A, ). (3) 0

11 imilarly, we obtain p(x β)p(y β)π (β)dβ = C C (π) m+n p exp R x,y m π ( ˆβ,x,y, Σ C, ). (33) The representation (3) follows immediately from (6), (3) and (33). The following results provide sufficient conditions for the minimaxity of ˆp π and for its dominance over ˆp U. We omit the proofs which are obtained using the same arguments leading to Theorem and Corollary. Analogously to our previous development there, we let W be an invertible q q matrix such that Σ A, = W W and Σ C, = W Σ D, W, where Σ D, = diag(d,..., d q ) as in (5). Finally, let V,w = wi + ( w)σ D as in (8). Theorem. L uppose m π (z; W W ) is finite for all z. Let H(f(z,, z q ) be the Hessian matrix of f. (i) If trace H(m π (z; W V,w W ))[Σ A, Σ C, ] 0 for all w [0, ], then ˆp π under R KL. Furthermore, ˆp π dominates ˆp U unless π = π U. (ii) If trace H( m π (z; W V,w W ))[Σ A, Σ C, ] 0 for all w [0, ], then ˆp π under R KL. Furthermore, ˆp π set of positive Lebesgue measure. is minimax is minimax dominates ˆp U if for all w [0, ], this inequality is strict on a Corollary. uppose m π (z; W W ) is finite for all z. Then ˆp π will be minimax if trace H(π (β ))[Σ A, Σ C, ] 0 a.e. Furthermore, ˆp π will dominate ˆp U unless π = π U. Example (continued). (caled harmonic prior). uppose A = B so that as in (8), trace H(π (β )[Σ A, Σ C, ] = π W (µ ), (34) where µ = W β. Here let π W (µ) µ (q ) when q 3, and π W (µ ) when q < 3. As before, π W is harmonic, i.e. π W (µ) 0, and not equal to π U when q 3. For q 3, the corresponding scaled harmonic prior on β is where η,, η q π (β) = π (β ) W β (q ) = diag(η,, η q )β (q ), (35) > 0 are the eigenvalues of Σ A,, and for q < 3, π (β). By Corollary and (34), p π here is minimax and dominates ˆp U when q 3.

12 4 Minimax Multiple hinkage Predictive Estimation We consider the more realistic model uncertainty setting where there is uncertainty about which subset of the p predictors in A and B should be included in the model. For each choice of, we have obtained general sufficient conditions for ˆp π to be minimax and to dominate π U. However, such ˆp π will only offer meaningful risk reduction when β is near the region where π is largest. For example, under the scaled harmonic prior in (35), such risk reduction occurs when β is close to 0. The difficulty then is how to proceed when the subset of irrelevant predictors indexed by is unknown. Rather than arbitrarily selecting, an attractive alternative is to use a multiple shrinkage predictive estimator which uses the data to adaptively emulate the most effective ˆp π. The multiple shrinkage procedure here is obtained by using a finite mixture of the contemplated priors. A similar multiple shrinkage construction for parameter estimation under squared error loss was proposed and developed by George (986a,b,c). Let Ω be the set of all the subsets under consideration, possibly even the set of all possible subsets. For each Ω, let π be the designated prior of the form (30) on β, and assign it probability w [0, ] such that Ω w =. Thus we construct the mixture prior π (β) = w π (β). (36) Ω This prior yields the multiple shrinkage predictive estimator ˆp (y x) = ˆp( x)ˆp π (y x). (37) Ω Here each ˆp π is given by (3) in Lemma (6), and each posterior probability is of the form ˆp( x) = w m π ( ˆβ,x, Σ A, ) Ω w m π ( ˆβ,x, Σ A, ) (38) which follows from (3). The form (37) reveals ˆp (y x) to be an adaptive convex combination of the individual shrinkage predictive estimates ˆp π. Note that through ˆp( x), ˆp doubly shrinks ˆp U (y x) by putting more weight on the ˆp π for which m π is largest and hence ˆp π is shrinking most. Thus, we expect ˆp to offer meaningful risk reduction whenever β is near the region where π is largest for any Ω. For example, if every π in π is one of the scaled harmonic priors in (35), such risk reduction occurs when β is close to 0 for any Ω for which q 3. Thus, the potential for risk reduction using ˆp is far greater than the risk reduction using an arbitrarily chosen ˆp π. We should also note that the allocation of risk reduction by ˆp is in part determined by the w weights in (38). Because each ˆp( x) is so adaptive through m π, choosing the weights to be uniform should be adequate. However, one may also want to consider some of the more refined suggestions for choosing such weights for the multiple shrinkage estimators in George (986b).

13 The potential for a multiple shrinkage ˆp to offer meaningful risk reduction in many different regions of the parameter space is greatly enhanced when it is minimax, and therefore can only improve on the noninformative minimax ˆp U. The following two results show that such minimaxity and dominance of ˆp U can be obtained. We then conclude with an explicit example of such domination. Theorem 3. uppose for all Ω, m π (z; W W ) is finite for all z. Let H(f(z,, z q ) be the Hessian matrix of f. If for all Ω, trace H(m π (z; W V,w W ))[Σ A, Σ C, ] 0 for all w [0, ], (39) then ˆp in (37) is minimax under R KL. Furthermore, ˆp dominates ˆp U unless π = π U. Proof. From (3), (37), and (38), it is straightforward to show that ˆp can be reexpressed as ˆp (y x) = Ω w m π ( ˆβ,x,y, Σ C, ) Ω w m π ( ˆβ,x, Σ A, ) ˆp U(y x) (40) Because p is of the same form as ˆp π in (3), namely a ratio of marginals times ˆp U, we can apply the same arguments leading to the proofs of Theorems and. These steps show that a sufficient condition for the minimaxity and dominance claims is w H(m π (z; W V,w W ))[Σ A, Σ C, ] Ω This condition is implied if (39) holds for all Ω. 0 for all w [0, ]. The next result follows using the same argument leading to Corollaries and. Corollary 3. uppose for all Ω, m π (z; W W ) is finite for all z. Then ˆp in (37) will be minimax if for all Ω trace H(π(β ))[Σ A, Σ C, ] 0 a.e. Furthermore, ˆp will dominate ˆp U unless π = π U. Example (continued). (caled harmonic prior). For each Ω, let π (β) be the scaled harmonic prior given by (35) when q 3, and by π (β) when q < 3. When A = B, by Corollary 3, p under these priors will be minimax and will dominate ˆp U if q 3 for at least one Ω. 3

14 5 Predictive Density Estimation Near Linear ubspaces The harmonic prior predictive estimator ˆp π (y x) described in ection 3, and incorporated into the multiple shrinkage predictive estimators ˆp (y x) in ection 4, offers risk reduction in the region of the parameter space where β is close to 0. This can be seen as a special case of the following general construction of a predictive estimator that obtains risk reduction when β is close to a linear subspace of R p. uppose one would like to obtain a predictive density estimator with greatest risk reduction in the region where β is close to a linear subspace G R p. In the case of ˆp π (y x), G would be the subspace of all β R p for which β 0. Alternately, if risk reduction was desired say when the components of β were close to equal, then one would consider G = [], the subspace spanned by (,..., ). Let P G β argmin g G β g be the projection of β onto G, and define β G (I P G )β to be the projection of β onto the orthogonal complement of G. For the construction of ˆp π (y x) in ection 3, β G = β. For G = [], β G = (β β) where β is the vector of components all equal to p p i= β i. The main idea behind the general construction is to use a prior that leads to shrinkage of β G towards 0 while leaving the remainder of β untouched. This can be obtained by using a prior of the form π G (β) = π G(β G ), (4) which is effectively uniform on (β β G ). This is a special case of the prior over β in (30). Note that since β G is q G (p dim(g)) dimensional, π G is a function from Rq G to R. Analogous to the construction in Lemma 6, predictive density estimators ˆp πg corresponding to priors of the form π G in (4) can be expressed as ˆp πg (y x) = m π G ( ˆβ G,x,y, Σ C,G ) m π G ( ˆβ G,x, Σ A,G ) ˆp U(y x) (4) where ˆp U is defined by (8), ˆβ G,x = (I P G ) ˆβ x and ˆβ G,x,y = (I P G ) ˆβ x,y are the projections of ˆβ x and ˆβ x,y onto the orthogonal complement of G, respectively, and Σ A,G and Σ C,G are the covariance matrices of ˆβ G,x and ˆβ G,x,y, respectively. It is straightforward to see that Theorem and Corollary and their proofs can be extended to get conditions on πg (β G) for such ˆp πg to be minimax and to dominate ˆp U. (imply substitute the symbol G for the symbol throughout). Example (continued). Extending (35), consider the following scaled harmonic prior on β. For q G 3, let π G (β) = π G(β G ) diag(η,, η q G )β G (q G ), (43) where η,, η qg > 0 are the eigenvalues of Σ A,G, and for q G < 3, let π G (β). Note that when q G 3 the resulting ˆp πg shrinks ˆp U towards G, offering reduced risk when β is close G. By the 4

15 extension of Corollary, such p πg will be minimax and dominate ˆp U when A = B and q G 3. Finally, following the development in ection 4 one can easily incorporate such ˆp πg into multiple shrinkage predictor estimators ˆp. Letting Ω be a set of subspaces G under consideration, construct the mixture prior π (β) = G Ω w G π G (β). (44) where for each G Ω, π G is the designated prior of the form (4), and w G [0, ] is such that G Ω w G =. This prior yields the multiple shrinkage predictive estimator ˆp (y x) = ˆp(G x)ˆp πg (y x), (45) G Ω where each ˆp πg is given by (4), and each posterior probability is of the form ˆp(G x) = w G m π G ( ˆβ G,x, Σ A,G ) G Ω w G m π G ( ˆβ G,x, Σ A,G ). (46) Here, ˆp (y x) is an adaptive convex combination of the individual shrinkage predictive estimates ˆp πg, and offers risk reduction whenever β G is near the region where π G is largest for any G Ω. Thus, the potential for risk reduction using ˆp is far greater than the risk reduction using an arbitrarily chosen ˆp πg. It is straightforward to see that Theorem 3 and Corollary 3 and their proofs can be extended to get conditions for such ˆp (y x) to be minimax and dominate ˆp U. (imply substitute the symbol G for the symbol throughout). Example (continued). (caled harmonic prior). For each G Ω, let π G (β) be the scaled harmonic prior given by (43) when q G 3, and by π G (β) when q < 3. When A = B, by the extension of Corollary 3, p for these priors will be minimax and will dominate ˆp U if q G 3 for at least one G Ω. References Aitchison, J. (975). Goodness of Prediction Fit. Biometrika, 6, Brown, L.D. (97). Admissible Estimators, Recurrent Diffusions, and Insoluble Boundary Value Problems. Annals of Mathematical tatistics, 4, Brown, L.D., George, E.I. and Xu, X. (006) Admissible Predictive Density Estimation. ubmitted. George, E.I. (986a). Minimax Multiple hrinkage Estimation. Ann. tatist

16 George, E.I. (986b). Combining Minimax hrinkage Estimators. Journal of the American tatistical Association, 8, George, E.I. (986c). A Formal Bayes Multiple hrinkage Estimator. Communications in tatistics: Part A - Theory and Methods (pecial issue tein-type Multivariate Estimation ), 5, 7, George, E.I., Liang, F. and Xu, X. (006). Improved Minimax Prediction Under Kullback-Leibler Loss. Ann. tatist. 34, Liang, F. (00). Exact Minimax Procedures for Predictive Density Estimation and Data Compression. Ph.D. dissertation, Department of tatistics, Yale University. Liang, F. and Barron, A. (004). Exact Minimax trategies for Predictive Density Estimation, Data Compression and Model election. IEEE Information Theory Transactions. 50, teele, J.M. (00). tochastic Calculus and Financial Applications. pringer, New York. tein, C. (974). Estimation of the Mean of a Multivariate Normal Distribution. In Proceedings of the Prague ymposium on Asymptotic tatistics, Ed. J. Hajek, pp Prague: Universita Karlova. tein, C. (98). Estimation of a Multivariate Normal Mean. Ann. tatist. 9,

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

Linear Regression For (X, Y ) a pair of random variables with values in R p R we assume that E(Y X) = β 0 + with β R p+1. p X j β j = (1, X T )β j=1 This model of the conditional expectation is linear