Predictive Density Estimation for Multiple Regression

Size: px
Start display at page:

Download "Predictive Density Estimation for Multiple Regression"

Transcription

1 Predictive Density Estimation for Multiple Regression Edward I. George and Xinyi Xu March 006, Revised January 007 Abstract uppose we observe X N m (Aβ, σ I) and would like to estimate the predictive density p(y β) of a future Y N n (Bβ, σ I). Evaluating predictive estimates ˆp(y x) by Kullback- Leibler loss, we develop and evaluate Bayes procedures for this problem. We obtain general sufficient conditions for minimaxity and dominance of the noninformative uniform prior Bayes procedure. We extend these results to situations where only a subset of the predictors in A is thought to be potentially irrelevant. We then consider the more realistic situation where there is model uncertainty and this subset is unknown. For this situation we develop multiple shrinkage predictive estimators and obtain general minimaxity and dominance conditions. Finally, we provide an explicit example of a minimax multiple shrinkage predictive estimator based on scaled harmonic priors. Keywords: BAYEIAN PREDICTION; MODEL UNCERTAINTY; MULTIPLE HRINKAGE; PRIOR DITRIBUTION; HRINKAGE ETIMATION. Edward I. George is Professor, tatistics Department, The Wharton chool, 3730 Walnut treet 400 JMHH, Philadelphia, PA , edgeorge@wharton.upenn.edu. Xinyi Xu is Assistant Professor, Department of tatistics, The Ohio tate University, Columbus, OH , xinyi@stat.ohio-state.edu. We would like to acknowledge Larry Brown, Feng Liang, Linda Zhao and three referees for their helpful suggestions. This work was supported by various NF grants, DM the most recent.

2 Introduction We begin with the canonical normal linear model setup X N m (Aβ, σ I) () where X is an m vector of m observations, A is a full rank, fixed m p matrix of p potential predictors where m p, and β is a p vector of unknown regression coefficients. Based on observing X = x, we consider the problem of estimating the predictive density p(y β) of a future n vector Y where Y N n (Bβ, σ I). () Here B is a fixed n p matrix of the same p potential predictors in A, although with possibly different values. We also assume that X and Y are conditionally independent given β. Finally, we assume σ is known, and without loss of generality set σ = throughout. For each value of x, we evaluate a predictive estimate ˆp(y x) of p(y β) by the well-known Kullback-Leibler (KL) loss L(β, ˆp(y x)) = p(y β) log p(y β) dy. (3) ˆp(y x) The overall quality of the procedure ˆp = ˆp(y x) for each β is then conveniently summarized by the KL risk R KL (β, ˆp) = p(x β)l(β, ˆp(y x))dx. (4) Letting ˆβ x = (A A) A x be the traditional least squares estimate of β based on x, it is tempting to consider the plug-in predictive estimate ˆp plug in (y ˆβ x ), which simply substitutes ˆβ x for β in p(y β). However, as we show in ection by extending the arguments of Aitchison (975), the formal Bayes predictive estimate ˆp U (y x) = p(x β)p(y β)dβ p(x β)dβ. (5) has smaller KL risk than ˆp plug in (y ˆβ x ) for every β. Thus, ˆp plug in (y ˆβ x ) should be ruled out and we turn our focus to Bayes procedures. For a prior π on β, the Bayes predictive estimator ˆp π (y x) is given by ˆp π (y x) = p(x β)p(y β)π(β)dβ p(x β)π(β)dβ. (6) It also follows from the arguments of Aitchison (975) that for proper π, ˆp π minimizes the average risk r π (ˆp) = R KL (β, ˆp)π(β)dβ. Note that ˆp U in (5) is the formal Bayes estimate under the improper uniform noninformative density π U (β), and would seem to be a good default procedure. Indeed, ˆp U has constant risk and is minimax under KL loss, see Liang (00) and Liang

3 and Barron (004). But surprisingly, as we will show, in many cases ˆp U itself can be uniformly dominated in terms of KL risk by other Bayes predictive estimators. In ection, we develop general conditions under which ˆp π will be minimax and uniformly dominate ˆp U in terms of the KL risk (4) for the multiple regression prediction problem. Our results can be seen as a substantial generalization of George, Liang and Xu (006) who considered the special case of this problem when X N m (µ, σxi) and Y N m (µ, σyi), where µ is the common unknown multivariate normal mean. Moving further away from this common mean setup, we proceed in ection 3 to extend these results to the setting where only a subset of the p predictors is considered to be potentially irrelevant. In ection 4, we consider the more realistic model uncertainty setting where such a subset is unknown, and develop minimax multiple shrinkage predictive densities that adaptively shrink towards the model most favored by the data. In ection 5, we conclude by showing how our results can be extended for minimax shrinkage prediction towards any linear subspaces. Although we do not consider the issue of admissibility in this paper, it may be of interest to note that for the multivariate normal prediction problem above Brown, George and Xu (006) recently established that all admissible predictive densities are Bayes procedures. Priors for Minimax Predictive Estimation In this section, we develop general conditions on π for ˆp π in (6) to uniformly dominate ˆp U in (5) under KL risk (4). The minimaxity of such ˆp π will then follow immediately from the minimaxity of ˆp U. We begin by establishing some convenient notation. As indicated previously, we use ˆβ x = (A A) A x to denote the least squares estimate of β based on x. Although y is not observed, it will be useful to use ˆβ x,y = (C C) C x where C = A (7) y B to denote the least squares estimate of β based on x and y. Note that ˆβ x N p (β, Σ A ) and ˆβ x,y N p (β, Σ C ), where for notational convenience throughout we let Σ A = (A A) and Σ C = (C C). It will also be useful to let R x = x A ˆβ x and R x,y = x C ˆβ x,y y denote the corresponding residual sums of squares (R). In terms of this notation, we have the following. Lemma. The uniform prior predictive estimate ˆp U in (5) can be expressed as ˆp U (y x) = (π) n C C A A 3 exp R x,y R x

4 = (y B (π) p Ψ exp ˆβx ) Ψ (y B ˆβ x ), (8) where Ψ = I + BΣ A B. Moreover, the KL risk of ˆp U is uniformly smaller than that of the plug-in estimator ˆp plug in (y ˆβ x ). Proof. ince ˆβ x β N p (β, Σ A ), the posterior of β under the uniform prior is β ˆβ x N p ( ˆβ x, Σ A ). It follows that the posterior of Bβ is Bβ ˆβ x N p (B ˆβ x, BΣ A B ) and thus the predictive estimator is Y ˆβ x N p (B ˆβ x, I + BΣ A B ). To calculate the risk of ˆp U, let ĤA = A(A A) A denote the hat matrix based on x and Ĥ C = C(C C) C denote the hat matrix based on both x and y. It is easy to see that R KL (β, ˆp U ) = p(y β) p(x β) p(y β) log ˆp U (y x) dxdy p(x β) p(y β) [R x,y R x ] dxdy = log C C A A n + = log C C A A n + = log C C = A A n + n p log(e i + ), i= where e,..., e p are the eigenvalues of (A A) B B. Moreover, [ ] trace(i m+n ĤC) trace(i m ĤA) R KL (β, ˆp plug in (y ˆβ p(y β) x )) = p(x β) p(y β) log ˆp plug in (y ˆβ x ) dxdy = [ p(x β) p(y β) y B ˆθ x y Bθ ] dxdy = p(x β) B ˆθ x Bθ dx = trace(b(a A) B ) = p e i. i= That ˆp U dominates ˆp plug in (y ˆβ x ) follows from the fact that log(x + ) x for any x > 0. Risk comparisons of a Bayes predictive density ˆp π with ˆp U are greatly facilitated by the following representation of ˆp π in terms of ˆp U. An analogous representation of the posterior mean in terms of the MLE, which simplifies multivariate normal mean estimation under quadratic risk, was proposed by Brown (97). For our representation, it will be useful to denote the marginal distribution of 4

5 Z β N p (β, Σ) under π by m π (z; Σ) = p(z β)π(β)dβ. (9) Thus, the marginal distributions of ˆβ x and ˆβ x,y under π are denoted by m π ( ˆβ x, Σ A ) and m π ( ˆβ x,y, Σ C ) respectively. Lemma. If m π (z; Σ) is finite for all z and Σ, then ˆp π (y x) is a proper probability distribution. Furthermore, it can be expressed as where ˆp U is defined by (8). ˆp π (y x) = m π( ˆβ x,y, Σ C ) m π ( ˆβ x, Σ A ) ˆp U(y x) (0) Proof. When m π (z; Σ) is finite for all z and Σ, that ˆp π (y x) is a proper probability distribution follows from integrating with respect to y and switching the order of integration. Next, straightforward calculation yields p(x β)π(β)dβ x Aβ = exp π(β)dβ (π) m = exp x A ˆβ x + A ˆβ x Aβ (π) m = (π) m p = A A (π) m p imilarly, we obtain exp exp x A ˆβ x exp (π) p R x p(x β)p(y β)π(β)dβ = C C π(β)dβ A ˆβ x Aβ π(β)dβ m π ( ˆβ x, Σ A ). () (π) m+n p exp R x,y The representation (0) follows immediately from (6), () and (). m π ( ˆβ x,y, Σ C ). () The next result provides a representation of the difference between the KL risks of ˆp U and ˆp π in terms of the marginal distributions of ˆβ x and ˆβ x,y. Lemma 3. The difference between the KL risks of ˆp U and ˆp π is given by R KL (β, ˆp U ) R KL (β, ˆp π ) = E β,σc log m π ( ˆβ x,y ; Σ C ) E β,σa log m π ( ˆβ x ; Σ A ) (3) where E β,σ ( ) stands for expectation with respect to the N p (β, Σ) distribution. 5

6 Proof. The KL risk difference between ˆp U and ˆp π can be expressed as = = R KL (β, ˆp U ) R KL (β, ˆp π ) p(x β) p(y β) log ˆp π(y x) ˆp U (y x) dxdy p(x β) p(y β) log m π( ˆβ x,y, Σ C ) m π ( ˆβ x, Σ A ) dxdy, where the last equality follows from Lemma. The result then follows from the change of variable theorem. To exploit the representation (3), we proceed to transform the distributions to canonical form. ince Σ A and Σ C are both symmetric and positive definite, there exists an invertible p p matrix W such that Σ A = W W and Σ C = W Σ D W (4) where Σ D = diag(d,..., d p ). (5) ince Σ C = (C C) = (A A+B B) and B B is nonnegative definite,, d i (0, ] for all i p with at least one d i <. Finally, let µ = W β, ˆµ x = W ˆβx, and ˆµ x,y = W ˆβx,y, so that ˆµ x N p (µ, I) and ˆµ x,y N p (µ, Σ D ). (6) Lemma 4. Let π W (µ) = π(w µ). Then, the difference between the KL risks of ˆp U and ˆp π is given by R KL (β, ˆp U ) R KL (β, ˆp π ) = E µ,σd log m πw (ˆµ x,y ; Σ D ) E µ,i log m πw (ˆµ x ; I) (7) where E µ,σ ( ) stands for expectation with respect to the N p (µ, Σ) distribution. Proof. The result follows by transforming the expressions in Lemma 3, E β,σa log m π ( ˆβ x ; Σ A ) = p( ˆβ x β) log p( ˆβ x β)π(β)dβd ˆβ x = p(ˆµ x µ) log p(ˆµ x µ)π W (µ)dµdˆµ x = E µ,i log m πw (ˆµ x ; I). imilarly, E β,σc log m π ( ˆβ x,y ; Σ C ) = E µ,σd log m πw (ˆµ x,y ; Σ D ). Thus, (7) equals (3). 6

7 We now proceed to find conditions on m π for which the risk difference (7) is nonnegative for all µ. Because ˆp U is minimax, this will then imply that ˆp π is minimax under the prior π corresponding to π W. Now for w [0, ], let where Σ D is defined as in (5). Next, for Z N p (µ, V w ), let Thus, we may rewrite (7) V w = wi + ( w)σ D (8) h µ (V w ) = E µ,vw log m πw (Z; V w ). (9) R KL (β, ˆp U ) R KL (β, ˆp π ) = h µ (V 0 ) h µ (V ). (0) ince h µ (w) is continuous in w, it suffices to derive conditions on m π such that ( / w)h µ (w) < 0 for all µ and w [0, ]. Letting v i be the ith diagonal element of V w, we have by the chain rule p w h h µ v i p µ = v i w = ( d i ) h µ () v i The following result provides unbiased estimates of the components of () which, when combined with (7), will be seen to play a key role in establishing sufficient conditions on m π for ˆp π to be minimax and to dominate ˆp U. As noted by George, Liang and Xu (006), these estimates are very similar to the unbiased estimates of risk for the estimation of a multivariate mean under squared error loss, see tein (974, 98). Lemma 5. If m πw (z; I) is finite for all z, then for any 0 w, m πw (z; V w ) is finite. Moreover, z m πw (Z; V w ) i h µ = E µ,vw ( ) log m πw (Z; V w ) () v i m πw (Z; V w ) z i = E µ,vw zi m πw (Z; V w ). (3) m πw (Z; V w ) Proof. When m πw (z; I) is finite for all z, it is easy to check that for any fixed z and any 0 w, ( k ) m πw (z; V w ) m πw (z; I) <. d i i= Next, letting Z = V w (Z µ) N(0, I), we have h µ = E log m πw (V w Z + µ; V w ) v i v i m πw (V w Z + µ; V w ) v i = E m πw (V w Z + µ; V w ) 7, (4)

8 where m πw (V w z + µ; V w ) v i = p v i (π) p ( v exp i zi + µ i µ i ) π W (µ )dµ v v p v i= i ( = + (z i µ i ) v i vi z i z i (µ i µ i ) ) v i v 3/ p(z µ )π W (µ )dµ i = (zi µ i ) (z i µ i m πw (z; V w ) ) v i vi p(z µ )π W (µ )dµ. Making use of the well-known univariate heat equation m πw (z; V w ) = v i z i m πw (z; V w ), (5) see for example teele (00), and the Brown (97) representation E π (µ i z i) = z i +v i z i log m πw (z), () and (3) can be verified via the same steps as in the proof of Lemma 3 in George, Liang and Xu (006). Now we can obtain sufficient conditions for a Bayes procedure ˆp π to be minimax by combining (0), () and Lemma 5. The following result provides a substantial generalization of Theorem of George, Liang and Xu (006). Theorem. uppose m π (z; W W ) is finite for all z with the invertible matrix W defined as in (4). Let H(f(z,, z p )) be the Hessian matrix of f. (i) If trace H(m π (z; W V w W ))[Σ A Σ C ] 0 for all w [0, ], then ˆp π is minimax under R KL. Furthermore, ˆp π dominates ˆp U unless π = π U. (ii) If trace H( m π (z; W V w W ))[Σ A Σ C ] 0 for all w [0, ], then ˆp π is minimax under R KL. Furthermore, ˆp π dominates ˆp U if for all w [0, ], this inequality is strict on a set of positive Lebesgue measure. Proof. To prove the minimaxity of ˆp π under R KL, it suffices to show that () or (3) is nonpositive because by () that would imply the nonnegativity of (0). Dominance would further follow by showing that () or (3) is also strictly negative on a set of positive Lebesgue measure. Noting that m πw (z; V w ) = m π (W z; W V w W ), and letting W z = z, we obtain k ( d i ) z m πw (z; V w ) = i= i k ( d i ) z m π ( z; W V w W ) i= i 8

9 k p p m π ( z; W V w W ) = ( d i ) W ji W ki z i= j= k= j z k = trace (I Σ D )W H(m π ( z; W V w W ))W = trace H(m π ( z; W V w W ))W (I Σ D )W = trace H(m π ( z; W V w W ))[Σ A Σ C ] (6) imilarly, k ( d i ) z i= i m πw (z; V w ) = trace H( m π ( z; W V w W ))[Σ A Σ C ] (7) Now (i) and (ii) follow immediately from (), (3), (6) and (7). The next result follows using the fact that zi m πw (z; V w ) 0 when µ π W (µ) 0. i Corollary. uppose m π (z; W W ) is finite for all z. Then ˆp π will be minimax if trace H(π(β))[Σ A Σ C ] 0 a.e. Furthermore, ˆp π will dominate ˆp U unless π = π U. Example. (caled harmonic prior). uppose A = B. In this case, trace H(π(β)[Σ A Σ C ] = trace H(π(β))Σ A = trace H(π(β))W W = π W (µ). (8) Let π W (µ) µ (p ) when p 3, and π W (µ) when p < 3. Note that π W is harmonic, i.e. π W (µ) 0, and not equal to π U when p 3. For p 3, the corresponding prior on β is a scaled harmonic prior π(β) W β (p ) = diag(η,, η p )β (p ), (9) where η,, η p > 0 are the eigenvalues of Σ A, and for p < 3, π(β). (The expression (9) is obtained using the fact that there exists an orthonormal matrix O such that W = O diag(η,, η p ) O ). By Corollary and (8), the predictive estimator p π under this prior is minimax and dominates ˆp U when p 3. It is easy to check that these results hold when A = rb for any known constant r. 9

10 3 Predictive Density Estimation Near ubset Models When a prior centered at 0 such as the scaled harmonic prior (9) is applied to β, the risk reduction of ˆp π over ˆp U is greatest when all the components of β are close to 0. Thus, it would be sensible to use this prior if it was felt that all p predictors in A and B were potentially irrelevant. However, such a prior would be ineffectual if only a subset of the p predictors were irrelevant, inotherwords if only a subset of the β components were close to 0. In this section, we extend our results for the setting where such a subset is known. This will set the stage for ection 4, where develop new results for the more realistic model uncertainty setting where such a subset is unknown. Let be the subset of,..., p corresponding to the indices of the potentially irrelevant predictors, and let q = be the number of elements in. Let β be the subvector of β corresponding to the columns of A indexed by. imilarly, let ˆβ,x and ˆβ,x,y be the subvectors of ˆβ x and ˆβ x,y respectively, corresponding to β. Finally, for notational convenience, let Σ A, and Σ C, be the submatrices of Σ A and Σ C respectively, which are the covariance matrices of ˆβ,x and ˆβ,x,y. When only the elements of β are thought to be close to zero, it would be sensible to consider a prior which is uniform on β, where is the complement of. We denote such a prior by π, and let π be the restriction of π to β so that π (β) = π (β ) (30) is a function of β only. centered around 0. To exploit the possibility that β is close to zero, π would then be Lemma 6. If m π (z; Σ) is finite for all z and Σ, then ˆp π (y x) is a proper probability distribution. Furthermore, it can be expressed as where ˆp U is defined by (8). ˆp π (y x) = m π ( ˆβ,x,y, Σ C, ) m π ( ˆβ,x, Σ A, ) ˆp U(y x) (3) Proof. The first assertion was proved in Lemma. Next, proceeding as in the derivation of () we obtain = p(x β)π (β)dβ exp x A ˆβ x exp (π) m p (π) p exp = A A (π) m p R x A ˆβ x Aβ π (β )dβ m π ( ˆβ,x, Σ A, ). (3) 0

11 imilarly, we obtain p(x β)p(y β)π (β)dβ = C C (π) m+n p exp R x,y m π ( ˆβ,x,y, Σ C, ). (33) The representation (3) follows immediately from (6), (3) and (33). The following results provide sufficient conditions for the minimaxity of ˆp π and for its dominance over ˆp U. We omit the proofs which are obtained using the same arguments leading to Theorem and Corollary. Analogously to our previous development there, we let W be an invertible q q matrix such that Σ A, = W W and Σ C, = W Σ D, W, where Σ D, = diag(d,..., d q ) as in (5). Finally, let V,w = wi + ( w)σ D as in (8). Theorem. L uppose m π (z; W W ) is finite for all z. Let H(f(z,, z q ) be the Hessian matrix of f. (i) If trace H(m π (z; W V,w W ))[Σ A, Σ C, ] 0 for all w [0, ], then ˆp π under R KL. Furthermore, ˆp π dominates ˆp U unless π = π U. (ii) If trace H( m π (z; W V,w W ))[Σ A, Σ C, ] 0 for all w [0, ], then ˆp π under R KL. Furthermore, ˆp π set of positive Lebesgue measure. is minimax is minimax dominates ˆp U if for all w [0, ], this inequality is strict on a Corollary. uppose m π (z; W W ) is finite for all z. Then ˆp π will be minimax if trace H(π (β ))[Σ A, Σ C, ] 0 a.e. Furthermore, ˆp π will dominate ˆp U unless π = π U. Example (continued). (caled harmonic prior). uppose A = B so that as in (8), trace H(π (β )[Σ A, Σ C, ] = π W (µ ), (34) where µ = W β. Here let π W (µ) µ (q ) when q 3, and π W (µ ) when q < 3. As before, π W is harmonic, i.e. π W (µ) 0, and not equal to π U when q 3. For q 3, the corresponding scaled harmonic prior on β is where η,, η q π (β) = π (β ) W β (q ) = diag(η,, η q )β (q ), (35) > 0 are the eigenvalues of Σ A,, and for q < 3, π (β). By Corollary and (34), p π here is minimax and dominates ˆp U when q 3.

12 4 Minimax Multiple hinkage Predictive Estimation We consider the more realistic model uncertainty setting where there is uncertainty about which subset of the p predictors in A and B should be included in the model. For each choice of, we have obtained general sufficient conditions for ˆp π to be minimax and to dominate π U. However, such ˆp π will only offer meaningful risk reduction when β is near the region where π is largest. For example, under the scaled harmonic prior in (35), such risk reduction occurs when β is close to 0. The difficulty then is how to proceed when the subset of irrelevant predictors indexed by is unknown. Rather than arbitrarily selecting, an attractive alternative is to use a multiple shrinkage predictive estimator which uses the data to adaptively emulate the most effective ˆp π. The multiple shrinkage procedure here is obtained by using a finite mixture of the contemplated priors. A similar multiple shrinkage construction for parameter estimation under squared error loss was proposed and developed by George (986a,b,c). Let Ω be the set of all the subsets under consideration, possibly even the set of all possible subsets. For each Ω, let π be the designated prior of the form (30) on β, and assign it probability w [0, ] such that Ω w =. Thus we construct the mixture prior π (β) = w π (β). (36) Ω This prior yields the multiple shrinkage predictive estimator ˆp (y x) = ˆp( x)ˆp π (y x). (37) Ω Here each ˆp π is given by (3) in Lemma (6), and each posterior probability is of the form ˆp( x) = w m π ( ˆβ,x, Σ A, ) Ω w m π ( ˆβ,x, Σ A, ) (38) which follows from (3). The form (37) reveals ˆp (y x) to be an adaptive convex combination of the individual shrinkage predictive estimates ˆp π. Note that through ˆp( x), ˆp doubly shrinks ˆp U (y x) by putting more weight on the ˆp π for which m π is largest and hence ˆp π is shrinking most. Thus, we expect ˆp to offer meaningful risk reduction whenever β is near the region where π is largest for any Ω. For example, if every π in π is one of the scaled harmonic priors in (35), such risk reduction occurs when β is close to 0 for any Ω for which q 3. Thus, the potential for risk reduction using ˆp is far greater than the risk reduction using an arbitrarily chosen ˆp π. We should also note that the allocation of risk reduction by ˆp is in part determined by the w weights in (38). Because each ˆp( x) is so adaptive through m π, choosing the weights to be uniform should be adequate. However, one may also want to consider some of the more refined suggestions for choosing such weights for the multiple shrinkage estimators in George (986b).

13 The potential for a multiple shrinkage ˆp to offer meaningful risk reduction in many different regions of the parameter space is greatly enhanced when it is minimax, and therefore can only improve on the noninformative minimax ˆp U. The following two results show that such minimaxity and dominance of ˆp U can be obtained. We then conclude with an explicit example of such domination. Theorem 3. uppose for all Ω, m π (z; W W ) is finite for all z. Let H(f(z,, z q ) be the Hessian matrix of f. If for all Ω, trace H(m π (z; W V,w W ))[Σ A, Σ C, ] 0 for all w [0, ], (39) then ˆp in (37) is minimax under R KL. Furthermore, ˆp dominates ˆp U unless π = π U. Proof. From (3), (37), and (38), it is straightforward to show that ˆp can be reexpressed as ˆp (y x) = Ω w m π ( ˆβ,x,y, Σ C, ) Ω w m π ( ˆβ,x, Σ A, ) ˆp U(y x) (40) Because p is of the same form as ˆp π in (3), namely a ratio of marginals times ˆp U, we can apply the same arguments leading to the proofs of Theorems and. These steps show that a sufficient condition for the minimaxity and dominance claims is w H(m π (z; W V,w W ))[Σ A, Σ C, ] Ω This condition is implied if (39) holds for all Ω. 0 for all w [0, ]. The next result follows using the same argument leading to Corollaries and. Corollary 3. uppose for all Ω, m π (z; W W ) is finite for all z. Then ˆp in (37) will be minimax if for all Ω trace H(π(β ))[Σ A, Σ C, ] 0 a.e. Furthermore, ˆp will dominate ˆp U unless π = π U. Example (continued). (caled harmonic prior). For each Ω, let π (β) be the scaled harmonic prior given by (35) when q 3, and by π (β) when q < 3. When A = B, by Corollary 3, p under these priors will be minimax and will dominate ˆp U if q 3 for at least one Ω. 3

14 5 Predictive Density Estimation Near Linear ubspaces The harmonic prior predictive estimator ˆp π (y x) described in ection 3, and incorporated into the multiple shrinkage predictive estimators ˆp (y x) in ection 4, offers risk reduction in the region of the parameter space where β is close to 0. This can be seen as a special case of the following general construction of a predictive estimator that obtains risk reduction when β is close to a linear subspace of R p. uppose one would like to obtain a predictive density estimator with greatest risk reduction in the region where β is close to a linear subspace G R p. In the case of ˆp π (y x), G would be the subspace of all β R p for which β 0. Alternately, if risk reduction was desired say when the components of β were close to equal, then one would consider G = [], the subspace spanned by (,..., ). Let P G β argmin g G β g be the projection of β onto G, and define β G (I P G )β to be the projection of β onto the orthogonal complement of G. For the construction of ˆp π (y x) in ection 3, β G = β. For G = [], β G = (β β) where β is the vector of components all equal to p p i= β i. The main idea behind the general construction is to use a prior that leads to shrinkage of β G towards 0 while leaving the remainder of β untouched. This can be obtained by using a prior of the form π G (β) = π G(β G ), (4) which is effectively uniform on (β β G ). This is a special case of the prior over β in (30). Note that since β G is q G (p dim(g)) dimensional, π G is a function from Rq G to R. Analogous to the construction in Lemma 6, predictive density estimators ˆp πg corresponding to priors of the form π G in (4) can be expressed as ˆp πg (y x) = m π G ( ˆβ G,x,y, Σ C,G ) m π G ( ˆβ G,x, Σ A,G ) ˆp U(y x) (4) where ˆp U is defined by (8), ˆβ G,x = (I P G ) ˆβ x and ˆβ G,x,y = (I P G ) ˆβ x,y are the projections of ˆβ x and ˆβ x,y onto the orthogonal complement of G, respectively, and Σ A,G and Σ C,G are the covariance matrices of ˆβ G,x and ˆβ G,x,y, respectively. It is straightforward to see that Theorem and Corollary and their proofs can be extended to get conditions on πg (β G) for such ˆp πg to be minimax and to dominate ˆp U. (imply substitute the symbol G for the symbol throughout). Example (continued). Extending (35), consider the following scaled harmonic prior on β. For q G 3, let π G (β) = π G(β G ) diag(η,, η q G )β G (q G ), (43) where η,, η qg > 0 are the eigenvalues of Σ A,G, and for q G < 3, let π G (β). Note that when q G 3 the resulting ˆp πg shrinks ˆp U towards G, offering reduced risk when β is close G. By the 4

15 extension of Corollary, such p πg will be minimax and dominate ˆp U when A = B and q G 3. Finally, following the development in ection 4 one can easily incorporate such ˆp πg into multiple shrinkage predictor estimators ˆp. Letting Ω be a set of subspaces G under consideration, construct the mixture prior π (β) = G Ω w G π G (β). (44) where for each G Ω, π G is the designated prior of the form (4), and w G [0, ] is such that G Ω w G =. This prior yields the multiple shrinkage predictive estimator ˆp (y x) = ˆp(G x)ˆp πg (y x), (45) G Ω where each ˆp πg is given by (4), and each posterior probability is of the form ˆp(G x) = w G m π G ( ˆβ G,x, Σ A,G ) G Ω w G m π G ( ˆβ G,x, Σ A,G ). (46) Here, ˆp (y x) is an adaptive convex combination of the individual shrinkage predictive estimates ˆp πg, and offers risk reduction whenever β G is near the region where π G is largest for any G Ω. Thus, the potential for risk reduction using ˆp is far greater than the risk reduction using an arbitrarily chosen ˆp πg. It is straightforward to see that Theorem 3 and Corollary 3 and their proofs can be extended to get conditions for such ˆp (y x) to be minimax and dominate ˆp U. (imply substitute the symbol G for the symbol throughout). Example (continued). (caled harmonic prior). For each G Ω, let π G (β) be the scaled harmonic prior given by (43) when q G 3, and by π G (β) when q < 3. When A = B, by the extension of Corollary 3, p for these priors will be minimax and will dominate ˆp U if q G 3 for at least one G Ω. References Aitchison, J. (975). Goodness of Prediction Fit. Biometrika, 6, Brown, L.D. (97). Admissible Estimators, Recurrent Diffusions, and Insoluble Boundary Value Problems. Annals of Mathematical tatistics, 4, Brown, L.D., George, E.I. and Xu, X. (006) Admissible Predictive Density Estimation. ubmitted. George, E.I. (986a). Minimax Multiple hrinkage Estimation. Ann. tatist

16 George, E.I. (986b). Combining Minimax hrinkage Estimators. Journal of the American tatistical Association, 8, George, E.I. (986c). A Formal Bayes Multiple hrinkage Estimator. Communications in tatistics: Part A - Theory and Methods (pecial issue tein-type Multivariate Estimation ), 5, 7, George, E.I., Liang, F. and Xu, X. (006). Improved Minimax Prediction Under Kullback-Leibler Loss. Ann. tatist. 34, Liang, F. (00). Exact Minimax Procedures for Predictive Density Estimation and Data Compression. Ph.D. dissertation, Department of tatistics, Yale University. Liang, F. and Barron, A. (004). Exact Minimax trategies for Predictive Density Estimation, Data Compression and Model election. IEEE Information Theory Transactions. 50, teele, J.M. (00). tochastic Calculus and Financial Applications. pringer, New York. tein, C. (974). Estimation of the Mean of a Multivariate Normal Distribution. In Proceedings of the Prague ymposium on Asymptotic tatistics, Ed. J. Hajek, pp Prague: Universita Karlova. tein, C. (98). Estimation of a Multivariate Normal Mean. Ann. tatist. 9,

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that Linear Regression For (X, Y ) a pair of random variables with values in R p R we assume that E(Y X) = β 0 + with β R p+1. p X j β j = (1, X T )β j=1 This model of the conditional expectation is linear

More information

Generalized Linear Models. Kurt Hornik

Generalized Linear Models. Kurt Hornik Generalized Linear Models Kurt Hornik Motivation Assuming normality, the linear model y = Xβ + e has y = β + ε, ε N(0, σ 2 ) such that y N(μ, σ 2 ), E(y ) = μ = β. Various generalizations, including general

More information

The logistic regression model is thus a glm-model with canonical link function so that the log-odds equals the linear predictor, that is

The logistic regression model is thus a glm-model with canonical link function so that the log-odds equals the linear predictor, that is Example The logistic regression model is thus a glm-model with canonical link function so that the log-odds equals the linear predictor, that is log p 1 p = β 0 + β 1 f 1 (y 1 ) +... + β d f d (y d ).

More information

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ). .8.6 µ =, σ = 1 µ = 1, σ = 1 / µ =, σ =.. 3 1 1 3 x Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ ). The Gaussian distribution Probably the most-important distribution in all of statistics

More information

Multinomial Data. f(y θ) θ y i. where θ i is the probability that a given trial results in category i, i = 1,..., k. The parameter space is

Multinomial Data. f(y θ) θ y i. where θ i is the probability that a given trial results in category i, i = 1,..., k. The parameter space is Multinomial Data The multinomial distribution is a generalization of the binomial for the situation in which each trial results in one and only one of several categories, as opposed to just two, as in

More information

ESTIMATORS FOR GAUSSIAN MODELS HAVING A BLOCK-WISE STRUCTURE

ESTIMATORS FOR GAUSSIAN MODELS HAVING A BLOCK-WISE STRUCTURE Statistica Sinica 9 2009, 885-903 ESTIMATORS FOR GAUSSIAN MODELS HAVING A BLOCK-WISE STRUCTURE Lawrence D. Brown and Linda H. Zhao University of Pennsylvania Abstract: Many multivariate Gaussian models

More information

On Reparametrization and the Gibbs Sampler

On Reparametrization and the Gibbs Sampler On Reparametrization and the Gibbs Sampler Jorge Carlos Román Department of Mathematics Vanderbilt University James P. Hobert Department of Statistics University of Florida March 2014 Brett Presnell Department

More information

Exact Minimax Strategies for Predictive Density Estimation, Data Compression, and Model Selection

Exact Minimax Strategies for Predictive Density Estimation, Data Compression, and Model Selection 2708 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 50, NO. 11, NOVEMBER 2004 Exact Minimax Strategies for Predictive Density Estimation, Data Compression, Model Selection Feng Liang Andrew Barron, Senior

More information

Bayesian Nonparametric Point Estimation Under a Conjugate Prior

Bayesian Nonparametric Point Estimation Under a Conjugate Prior University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 5-15-2002 Bayesian Nonparametric Point Estimation Under a Conjugate Prior Xuefeng Li University of Pennsylvania Linda

More information

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b) LECTURE 5 NOTES 1. Bayesian point estimators. In the conventional (frequentist) approach to statistical inference, the parameter θ Θ is considered a fixed quantity. In the Bayesian approach, it is considered

More information

Default Priors and Effcient Posterior Computation in Bayesian

Default Priors and Effcient Posterior Computation in Bayesian Default Priors and Effcient Posterior Computation in Bayesian Factor Analysis January 16, 2010 Presented by Eric Wang, Duke University Background and Motivation A Brief Review of Parameter Expansion Literature

More information

Bayes spaces: use of improper priors and distances between densities

Bayes spaces: use of improper priors and distances between densities Bayes spaces: use of improper priors and distances between densities J. J. Egozcue 1, V. Pawlowsky-Glahn 2, R. Tolosana-Delgado 1, M. I. Ortego 1 and G. van den Boogaart 3 1 Universidad Politécnica de

More information

Topics in Probability and Statistics

Topics in Probability and Statistics Topics in Probability and tatistics A Fundamental Construction uppose {, P } is a sample space (with probability P), and suppose X : R is a random variable. The distribution of X is the probability P X

More information

A Minimal Uncertainty Product for One Dimensional Semiclassical Wave Packets

A Minimal Uncertainty Product for One Dimensional Semiclassical Wave Packets A Minimal Uncertainty Product for One Dimensional Semiclassical Wave Packets George A. Hagedorn Happy 60 th birthday, Mr. Fritz! Abstract. Although real, normalized Gaussian wave packets minimize the product

More information

Analysis of Polya-Gamma Gibbs sampler for Bayesian logistic analysis of variance

Analysis of Polya-Gamma Gibbs sampler for Bayesian logistic analysis of variance Electronic Journal of Statistics Vol. (207) 326 337 ISSN: 935-7524 DOI: 0.24/7-EJS227 Analysis of Polya-Gamma Gibbs sampler for Bayesian logistic analysis of variance Hee Min Choi Department of Statistics

More information

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Classification Methods II: Linear and Quadratic Discrimminant Analysis Classification Methods II: Linear and Quadratic Discrimminant Analysis Rebecca C. Steorts, Duke University STA 325, Chapter 4 ISL Agenda Linear Discrimminant Analysis (LDA) Classification Recall that linear

More information

Gaussian Models (9/9/13)

Gaussian Models (9/9/13) STA561: Probabilistic machine learning Gaussian Models (9/9/13) Lecturer: Barbara Engelhardt Scribes: Xi He, Jiangwei Pan, Ali Razeen, Animesh Srivastava 1 Multivariate Normal Distribution The multivariate

More information

Chapter 7. Extremal Problems. 7.1 Extrema and Local Extrema

Chapter 7. Extremal Problems. 7.1 Extrema and Local Extrema Chapter 7 Extremal Problems No matter in theoretical context or in applications many problems can be formulated as problems of finding the maximum or minimum of a function. Whenever this is the case, advanced

More information

Perhaps the simplest way of modeling two (discrete) random variables is by means of a joint PMF, defined as follows.

Perhaps the simplest way of modeling two (discrete) random variables is by means of a joint PMF, defined as follows. Chapter 5 Two Random Variables In a practical engineering problem, there is almost always causal relationship between different events. Some relationships are determined by physical laws, e.g., voltage

More information

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the

More information

Recall the convention that, for us, all vectors are column vectors.

Recall the convention that, for us, all vectors are column vectors. Some linear algebra Recall the convention that, for us, all vectors are column vectors. 1. Symmetric matrices Let A be a real matrix. Recall that a complex number λ is an eigenvalue of A if there exists

More information

Linear Methods for Prediction

Linear Methods for Prediction Chapter 5 Linear Methods for Prediction 5.1 Introduction We now revisit the classification problem and focus on linear methods. Since our prediction Ĝ(x) will always take values in the discrete set G we

More information

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation Statistics 62: L p spaces, metrics on spaces of probabilites, and connections to estimation Moulinath Banerjee December 6, 2006 L p spaces and Hilbert spaces We first formally define L p spaces. Consider

More information

I. Bayesian econometrics

I. Bayesian econometrics I. Bayesian econometrics A. Introduction B. Bayesian inference in the univariate regression model C. Statistical decision theory Question: once we ve calculated the posterior distribution, what do we do

More information

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2. APPENDIX A Background Mathematics A. Linear Algebra A.. Vector algebra Let x denote the n-dimensional column vector with components 0 x x 2 B C @. A x n Definition 6 (scalar product). The scalar product

More information

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources STA 732: Inference Notes 10. Parameter Estimation from a Decision Theoretic Angle Other resources 1 Statistical rules, loss and risk We saw that a major focus of classical statistics is comparing various

More information

Akaike Information Criterion

Akaike Information Criterion Akaike Information Criterion Shuhua Hu Center for Research in Scientific Computation North Carolina State University Raleigh, NC February 7, 2012-1- background Background Model statistical model: Y j =

More information

Uniform Correlation Mixture of Bivariate Normal Distributions and. Hypercubically-contoured Densities That Are Marginally Normal

Uniform Correlation Mixture of Bivariate Normal Distributions and. Hypercubically-contoured Densities That Are Marginally Normal Uniform Correlation Mixture of Bivariate Normal Distributions and Hypercubically-contoured Densities That Are Marginally Normal Kai Zhang Department of Statistics and Operations Research University of

More information

Part III. A Decision-Theoretic Approach and Bayesian testing

Part III. A Decision-Theoretic Approach and Bayesian testing Part III A Decision-Theoretic Approach and Bayesian testing 1 Chapter 10 Bayesian Inference as a Decision Problem The decision-theoretic framework starts with the following situation. We would like to

More information

MAS223 Statistical Inference and Modelling Exercises

MAS223 Statistical Inference and Modelling Exercises MAS223 Statistical Inference and Modelling Exercises The exercises are grouped into sections, corresponding to chapters of the lecture notes Within each section exercises are divided into warm-up questions,

More information

Maximum Likelihood Estimation

Maximum Likelihood Estimation Maximum Likelihood Estimation Merlise Clyde STA721 Linear Models Duke University August 31, 2017 Outline Topics Likelihood Function Projections Maximum Likelihood Estimates Readings: Christensen Chapter

More information

Classification 2: Linear discriminant analysis (continued); logistic regression

Classification 2: Linear discriminant analysis (continued); logistic regression Classification 2: Linear discriminant analysis (continued); logistic regression Ryan Tibshirani Data Mining: 36-462/36-662 April 4 2013 Optional reading: ISL 4.4, ESL 4.3; ISL 4.3, ESL 4.4 1 Reminder:

More information

Least Sparsity of p-norm based Optimization Problems with p > 1

Least Sparsity of p-norm based Optimization Problems with p > 1 Least Sparsity of p-norm based Optimization Problems with p > Jinglai Shen and Seyedahmad Mousavi Original version: July, 07; Revision: February, 08 Abstract Motivated by l p -optimization arising from

More information

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces.

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces. Math 350 Fall 2011 Notes about inner product spaces In this notes we state and prove some important properties of inner product spaces. First, recall the dot product on R n : if x, y R n, say x = (x 1,...,

More information

Minimax Rate of Convergence for an Estimator of the Functional Component in a Semiparametric Multivariate Partially Linear Model.

Minimax Rate of Convergence for an Estimator of the Functional Component in a Semiparametric Multivariate Partially Linear Model. Minimax Rate of Convergence for an Estimator of the Functional Component in a Semiparametric Multivariate Partially Linear Model By Michael Levine Purdue University Technical Report #14-03 Department of

More information

Bootstrap prediction and Bayesian prediction under misspecified models

Bootstrap prediction and Bayesian prediction under misspecified models Bernoulli 11(4), 2005, 747 758 Bootstrap prediction and Bayesian prediction under misspecified models TADAYOSHI FUSHIKI Institute of Statistical Mathematics, 4-6-7 Minami-Azabu, Minato-ku, Tokyo 106-8569,

More information

Approximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA)

Approximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA) Approximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA) Willem van den Boom Department of Statistics and Applied Probability National University

More information

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach Jae-Kwang Kim Department of Statistics, Iowa State University Outline 1 Introduction 2 Observed likelihood 3 Mean Score

More information

Inverse of a Square Matrix. For an N N square matrix A, the inverse of A, 1

Inverse of a Square Matrix. For an N N square matrix A, the inverse of A, 1 Inverse of a Square Matrix For an N N square matrix A, the inverse of A, 1 A, exists if and only if A is of full rank, i.e., if and only if no column of A is a linear combination 1 of the others. A is

More information

Projection. Ping Yu. School of Economics and Finance The University of Hong Kong. Ping Yu (HKU) Projection 1 / 42

Projection. Ping Yu. School of Economics and Finance The University of Hong Kong. Ping Yu (HKU) Projection 1 / 42 Projection Ping Yu School of Economics and Finance The University of Hong Kong Ping Yu (HKU) Projection 1 / 42 1 Hilbert Space and Projection Theorem 2 Projection in the L 2 Space 3 Projection in R n Projection

More information

A minimalist s exposition of EM

A minimalist s exposition of EM A minimalist s exposition of EM Karl Stratos 1 What EM optimizes Let O, H be a random variables representing the space of samples. Let be the parameter of a generative model with an associated probability

More information

Matrices and Multivariate Statistics - II

Matrices and Multivariate Statistics - II Matrices and Multivariate Statistics - II Richard Mott November 2011 Multivariate Random Variables Consider a set of dependent random variables z = (z 1,..., z n ) E(z i ) = µ i cov(z i, z j ) = σ ij =

More information

Linear Regression (9/11/13)

Linear Regression (9/11/13) STA561: Probabilistic machine learning Linear Regression (9/11/13) Lecturer: Barbara Engelhardt Scribes: Zachary Abzug, Mike Gloudemans, Zhuosheng Gu, Zhao Song 1 Why use linear regression? Figure 1: Scatter

More information

Spring 2017 Econ 574 Roger Koenker. Lecture 14 GEE-GMM

Spring 2017 Econ 574 Roger Koenker. Lecture 14 GEE-GMM University of Illinois Department of Economics Spring 2017 Econ 574 Roger Koenker Lecture 14 GEE-GMM Throughout the course we have emphasized methods of estimation and inference based on the principle

More information

Variational Inference (11/04/13)

Variational Inference (11/04/13) STA561: Probabilistic machine learning Variational Inference (11/04/13) Lecturer: Barbara Engelhardt Scribes: Matt Dickenson, Alireza Samany, Tracy Schifeling 1 Introduction In this lecture we will further

More information

Spatial Process Estimates as Smoothers: A Review

Spatial Process Estimates as Smoothers: A Review Spatial Process Estimates as Smoothers: A Review Soutir Bandyopadhyay 1 Basic Model The observational model considered here has the form Y i = f(x i ) + ɛ i, for 1 i n. (1.1) where Y i is the observed

More information

An Alternative Method for Estimating and Simulating Maximum Entropy Densities

An Alternative Method for Estimating and Simulating Maximum Entropy Densities An Alternative Method for Estimating and Simulating Maximum Entropy Densities Jae-Young Kim and Joonhwan Lee Seoul National University May, 8 Abstract This paper proposes a method of estimating and simulating

More information

Formulas for probability theory and linear models SF2941

Formulas for probability theory and linear models SF2941 Formulas for probability theory and linear models SF2941 These pages + Appendix 2 of Gut) are permitted as assistance at the exam. 11 maj 2008 Selected formulae of probability Bivariate probability Transforms

More information

Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½

Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½ University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 1998 Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½ Lawrence D. Brown University

More information

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2 Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, 2010 Jeffreys priors Lecturer: Michael I. Jordan Scribe: Timothy Hunter 1 Priors for the multivariate Gaussian Consider a multivariate

More information

Lawrence D. Brown* and Daniel McCarthy*

Lawrence D. Brown* and Daniel McCarthy* Comments on the paper, An adaptive resampling test for detecting the presence of significant predictors by I. W. McKeague and M. Qian Lawrence D. Brown* and Daniel McCarthy* ABSTRACT: This commentary deals

More information

Model comparison and selection

Model comparison and selection BS2 Statistical Inference, Lectures 9 and 10, Hilary Term 2008 March 2, 2008 Hypothesis testing Consider two alternative models M 1 = {f (x; θ), θ Θ 1 } and M 2 = {f (x; θ), θ Θ 2 } for a sample (X = x)

More information

MLES & Multivariate Normal Theory

MLES & Multivariate Normal Theory Merlise Clyde September 6, 2016 Outline Expectations of Quadratic Forms Distribution Linear Transformations Distribution of estimates under normality Properties of MLE s Recap Ŷ = ˆµ is an unbiased estimate

More information

Linear Algebra. Session 12

Linear Algebra. Session 12 Linear Algebra. Session 12 Dr. Marco A Roque Sol 08/01/2017 Example 12.1 Find the constant function that is the least squares fit to the following data x 0 1 2 3 f(x) 1 0 1 2 Solution c = 1 c = 0 f (x)

More information

Online Appendix. j=1. φ T (ω j ) vec (EI T (ω j ) f θ0 (ω j )). vec (EI T (ω) f θ0 (ω)) = O T β+1/2) = o(1), M 1. M T (s) exp ( isω)

Online Appendix. j=1. φ T (ω j ) vec (EI T (ω j ) f θ0 (ω j )). vec (EI T (ω) f θ0 (ω)) = O T β+1/2) = o(1), M 1. M T (s) exp ( isω) Online Appendix Proof of Lemma A.. he proof uses similar arguments as in Dunsmuir 979), but allowing for weak identification and selecting a subset of frequencies using W ω). It consists of two steps.

More information

Lecture Notes 15 Prediction Chapters 13, 22, 20.4.

Lecture Notes 15 Prediction Chapters 13, 22, 20.4. Lecture Notes 15 Prediction Chapters 13, 22, 20.4. 1 Introduction Prediction is covered in detail in 36-707, 36-701, 36-715, 10/36-702. Here, we will just give an introduction. We observe training data

More information

WLS and BLUE (prelude to BLUP) Prediction

WLS and BLUE (prelude to BLUP) Prediction WLS and BLUE (prelude to BLUP) Prediction Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark April 21, 2018 Suppose that Y has mean X β and known covariance matrix V (but Y need

More information

Estimators for the binomial distribution that dominate the MLE in terms of Kullback Leibler risk

Estimators for the binomial distribution that dominate the MLE in terms of Kullback Leibler risk Ann Inst Stat Math (0) 64:359 37 DOI 0.007/s0463-00-036-3 Estimators for the binomial distribution that dominate the MLE in terms of Kullback Leibler risk Paul Vos Qiang Wu Received: 3 June 009 / Revised:

More information

Maximum Likelihood Estimation

Maximum Likelihood Estimation Chapter 7 Maximum Likelihood Estimation 7. Consistency If X is a random variable (or vector) with density or mass function f θ (x) that depends on a parameter θ, then the function f θ (X) viewed as a function

More information

1 Data Arrays and Decompositions

1 Data Arrays and Decompositions 1 Data Arrays and Decompositions 1.1 Variance Matrices and Eigenstructure Consider a p p positive definite and symmetric matrix V - a model parameter or a sample variance matrix. The eigenstructure is

More information

The Skorokhod reflection problem for functions with discontinuities (contractive case)

The Skorokhod reflection problem for functions with discontinuities (contractive case) The Skorokhod reflection problem for functions with discontinuities (contractive case) TAKIS KONSTANTOPOULOS Univ. of Texas at Austin Revised March 1999 Abstract Basic properties of the Skorokhod reflection

More information

Bayesian estimation of the discrepancy with misspecified parametric models

Bayesian estimation of the discrepancy with misspecified parametric models Bayesian estimation of the discrepancy with misspecified parametric models Pierpaolo De Blasi University of Torino & Collegio Carlo Alberto Bayesian Nonparametrics workshop ICERM, 17-21 September 2012

More information

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection DEPARTMENT OF STATISTICS University of Wisconsin 1210 West Dayton St. Madison, WI 53706 TECHNICAL REPORT NO. 1091r April 2004, Revised December 2004 A Note on the Lasso and Related Procedures in Model

More information

Chapter 13. Convex and Concave. Josef Leydold Mathematical Methods WS 2018/19 13 Convex and Concave 1 / 44

Chapter 13. Convex and Concave. Josef Leydold Mathematical Methods WS 2018/19 13 Convex and Concave 1 / 44 Chapter 13 Convex and Concave Josef Leydold Mathematical Methods WS 2018/19 13 Convex and Concave 1 / 44 Monotone Function Function f is called monotonically increasing, if x 1 x 2 f (x 1 ) f (x 2 ) It

More information

Chapter 1 Preliminaries

Chapter 1 Preliminaries Chapter 1 Preliminaries 1.1 Conventions and Notations Throughout the book we use the following notations for standard sets of numbers: N the set {1, 2,...} of natural numbers Z the set of integers Q the

More information

Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

More information

Lecture I A Gentle Introduction to Markov Chain Monte Carlo (MCMC)

Lecture I A Gentle Introduction to Markov Chain Monte Carlo (MCMC) 1 Lecture I A Gentle Introduction to Markov Chain Monte Carlo (MCMC) Ed George University of Pennsylvania Seminaire de Printemps Villars-sur-Ollon, Switzerland March 2005 2 1. MCMC: A New Approach to Simulation

More information

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017 Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017 Put your solution to each problem on a separate sheet of paper. Problem 1. (5106) Let X 1, X 2,, X n be a sequence of i.i.d. observations from a

More information

Estimation of parametric functions in Downton s bivariate exponential distribution

Estimation of parametric functions in Downton s bivariate exponential distribution Estimation of parametric functions in Downton s bivariate exponential distribution George Iliopoulos Department of Mathematics University of the Aegean 83200 Karlovasi, Samos, Greece e-mail: geh@aegean.gr

More information

EXPLICIT MULTIVARIATE BOUNDS OF CHEBYSHEV TYPE

EXPLICIT MULTIVARIATE BOUNDS OF CHEBYSHEV TYPE Annales Univ. Sci. Budapest., Sect. Comp. 42 2014) 109 125 EXPLICIT MULTIVARIATE BOUNDS OF CHEBYSHEV TYPE Villő Csiszár Budapest, Hungary) Tamás Fegyverneki Budapest, Hungary) Tamás F. Móri Budapest, Hungary)

More information

MAT 257, Handout 13: December 5-7, 2011.

MAT 257, Handout 13: December 5-7, 2011. MAT 257, Handout 13: December 5-7, 2011. The Change of Variables Theorem. In these notes, I try to make more explicit some parts of Spivak s proof of the Change of Variable Theorem, and to supply most

More information

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 12, DECEMBER

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 12, DECEMBER IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 12, DECEMBER 2010 6433 Bayesian Minimax Estimation of the Normal Model With Incomplete Prior Covariance Matrix Specification Duc-Son Pham, Member,

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

1 Directional Derivatives and Differentiability

1 Directional Derivatives and Differentiability Wednesday, January 18, 2012 1 Directional Derivatives and Differentiability Let E R N, let f : E R and let x 0 E. Given a direction v R N, let L be the line through x 0 in the direction v, that is, L :=

More information

Mark Gordon Low

Mark Gordon Low Mark Gordon Low lowm@wharton.upenn.edu Address Department of Statistics, The Wharton School University of Pennsylvania 3730 Walnut Street Philadelphia, PA 19104-6340 lowm@wharton.upenn.edu Education Ph.D.

More information

MATRICES ARE SIMILAR TO TRIANGULAR MATRICES

MATRICES ARE SIMILAR TO TRIANGULAR MATRICES MATRICES ARE SIMILAR TO TRIANGULAR MATRICES 1 Complex matrices Recall that the complex numbers are given by a + ib where a and b are real and i is the imaginary unity, ie, i 2 = 1 In what we describe below,

More information

EC3062 ECONOMETRICS. THE MULTIPLE REGRESSION MODEL Consider T realisations of the regression equation. (1) y = β 0 + β 1 x β k x k + ε,

EC3062 ECONOMETRICS. THE MULTIPLE REGRESSION MODEL Consider T realisations of the regression equation. (1) y = β 0 + β 1 x β k x k + ε, THE MULTIPLE REGRESSION MODEL Consider T realisations of the regression equation (1) y = β 0 + β 1 x 1 + + β k x k + ε, which can be written in the following form: (2) y 1 y 2.. y T = 1 x 11... x 1k 1

More information

A Bayesian Treatment of Linear Gaussian Regression

A Bayesian Treatment of Linear Gaussian Regression A Bayesian Treatment of Linear Gaussian Regression Frank Wood December 3, 2009 Bayesian Approach to Classical Linear Regression In classical linear regression we have the following model y β, σ 2, X N(Xβ,

More information

Homework 1 Due: Thursday 2/5/2015. Instructions: Turn in your homework in class on Thursday 2/5/2015

Homework 1 Due: Thursday 2/5/2015. Instructions: Turn in your homework in class on Thursday 2/5/2015 10-704 Homework 1 Due: Thursday 2/5/2015 Instructions: Turn in your homework in class on Thursday 2/5/2015 1. Information Theory Basics and Inequalities C&T 2.47, 2.29 (a) A deck of n cards in order 1,

More information

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2 MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2 1 Ridge Regression Ridge regression and the Lasso are two forms of regularized

More information

November 2002 STA Random Effects Selection in Linear Mixed Models

November 2002 STA Random Effects Selection in Linear Mixed Models November 2002 STA216 1 Random Effects Selection in Linear Mixed Models November 2002 STA216 2 Introduction It is common practice in many applications to collect multiple measurements on a subject. Linear

More information

Linear Algebra March 16, 2019

Linear Algebra March 16, 2019 Linear Algebra March 16, 2019 2 Contents 0.1 Notation................................ 4 1 Systems of linear equations, and matrices 5 1.1 Systems of linear equations..................... 5 1.2 Augmented

More information

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 1 Entropy Since this course is about entropy maximization,

More information

HT Introduction. P(X i = x i ) = e λ λ x i

HT Introduction. P(X i = x i ) = e λ λ x i MODS STATISTICS Introduction. HT 2012 Simon Myers, Department of Statistics (and The Wellcome Trust Centre for Human Genetics) myers@stats.ox.ac.uk We will be concerned with the mathematical framework

More information

Assignment 1 Math 5341 Linear Algebra Review. Give complete answers to each of the following questions. Show all of your work.

Assignment 1 Math 5341 Linear Algebra Review. Give complete answers to each of the following questions. Show all of your work. Assignment 1 Math 5341 Linear Algebra Review Give complete answers to each of the following questions Show all of your work Note: You might struggle with some of these questions, either because it has

More information

Chapter 8 Integral Operators

Chapter 8 Integral Operators Chapter 8 Integral Operators In our development of metrics, norms, inner products, and operator theory in Chapters 1 7 we only tangentially considered topics that involved the use of Lebesgue measure,

More information

Finite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product

Finite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product Chapter 4 Hilbert Spaces 4.1 Inner Product Spaces Inner Product Space. A complex vector space E is called an inner product space (or a pre-hilbert space, or a unitary space) if there is a mapping (, )

More information

6.1 Main properties of Shannon entropy. Let X be a random variable taking values x in some alphabet with probabilities.

6.1 Main properties of Shannon entropy. Let X be a random variable taking values x in some alphabet with probabilities. Chapter 6 Quantum entropy There is a notion of entropy which quantifies the amount of uncertainty contained in an ensemble of Qbits. This is the von Neumann entropy that we introduce in this chapter. In

More information

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X. Optimization Background: Problem: given a function f(x) defined on X, find x such that f(x ) f(x) for all x X. The value x is called a maximizer of f and is written argmax X f. In general, argmax X f may

More information

http://www.math.uah.edu/stat/markov/.xhtml 1 of 9 7/16/2009 7:20 AM Virtual Laboratories > 16. Markov Chains > 1 2 3 4 5 6 7 8 9 10 11 12 1. A Markov process is a random process in which the future is

More information

5.1 Consistency of least squares estimates. We begin with a few consistency results that stand on their own and do not depend on normality.

5.1 Consistency of least squares estimates. We begin with a few consistency results that stand on their own and do not depend on normality. 88 Chapter 5 Distribution Theory In this chapter, we summarize the distributions related to the normal distribution that occur in linear models. Before turning to this general problem that assumes normal

More information

CS 246 Review of Linear Algebra 01/17/19

CS 246 Review of Linear Algebra 01/17/19 1 Linear algebra In this section we will discuss vectors and matrices. We denote the (i, j)th entry of a matrix A as A ij, and the ith entry of a vector as v i. 1.1 Vectors and vector operations A vector

More information

Bayesian Adaptation. Aad van der Vaart. Vrije Universiteit Amsterdam. aad. Bayesian Adaptation p. 1/4

Bayesian Adaptation. Aad van der Vaart. Vrije Universiteit Amsterdam.  aad. Bayesian Adaptation p. 1/4 Bayesian Adaptation Aad van der Vaart http://www.math.vu.nl/ aad Vrije Universiteit Amsterdam Bayesian Adaptation p. 1/4 Joint work with Jyri Lember Bayesian Adaptation p. 2/4 Adaptation Given a collection

More information

Linear Models Review

Linear Models Review Linear Models Review Vectors in IR n will be written as ordered n-tuples which are understood to be column vectors, or n 1 matrices. A vector variable will be indicted with bold face, and the prime sign

More information

A note on profile likelihood for exponential tilt mixture models

A note on profile likelihood for exponential tilt mixture models Biometrika (2009), 96, 1,pp. 229 236 C 2009 Biometrika Trust Printed in Great Britain doi: 10.1093/biomet/asn059 Advance Access publication 22 January 2009 A note on profile likelihood for exponential

More information

Exact Minimax Predictive Density Estimation and MDL

Exact Minimax Predictive Density Estimation and MDL Exact Minimax Predictive Density Estimation and MDL Feng Liang and Andrew Barron December 5, 2003 Abstract The problems of predictive density estimation with Kullback-Leibler loss, optimal universal data

More information

MAT 570 REAL ANALYSIS LECTURE NOTES. Contents. 1. Sets Functions Countability Axiom of choice Equivalence relations 9

MAT 570 REAL ANALYSIS LECTURE NOTES. Contents. 1. Sets Functions Countability Axiom of choice Equivalence relations 9 MAT 570 REAL ANALYSIS LECTURE NOTES PROFESSOR: JOHN QUIGG SEMESTER: FALL 204 Contents. Sets 2 2. Functions 5 3. Countability 7 4. Axiom of choice 8 5. Equivalence relations 9 6. Real numbers 9 7. Extended

More information

[y i α βx i ] 2 (2) Q = i=1

[y i α βx i ] 2 (2) Q = i=1 Least squares fits This section has no probability in it. There are no random variables. We are given n points (x i, y i ) and want to find the equation of the line that best fits them. We take the equation

More information

Minimax design criterion for fractional factorial designs

Minimax design criterion for fractional factorial designs Ann Inst Stat Math 205 67:673 685 DOI 0.007/s0463-04-0470-0 Minimax design criterion for fractional factorial designs Yue Yin Julie Zhou Received: 2 November 203 / Revised: 5 March 204 / Published online:

More information

Ridge Regression Revisited

Ridge Regression Revisited Ridge Regression Revisited Paul M.C. de Boer Christian M. Hafner Econometric Institute Report EI 2005-29 In general ridge (GR) regression p ridge parameters have to be determined, whereas simple ridge

More information