Predictive Density Estimation for Multiple Regression
|
|
- Alison Pope
- 5 years ago
- Views:
Transcription
1 Predictive Density Estimation for Multiple Regression Edward I. George and Xinyi Xu March 006, Revised January 007 Abstract uppose we observe X N m (Aβ, σ I) and would like to estimate the predictive density p(y β) of a future Y N n (Bβ, σ I). Evaluating predictive estimates ˆp(y x) by Kullback- Leibler loss, we develop and evaluate Bayes procedures for this problem. We obtain general sufficient conditions for minimaxity and dominance of the noninformative uniform prior Bayes procedure. We extend these results to situations where only a subset of the predictors in A is thought to be potentially irrelevant. We then consider the more realistic situation where there is model uncertainty and this subset is unknown. For this situation we develop multiple shrinkage predictive estimators and obtain general minimaxity and dominance conditions. Finally, we provide an explicit example of a minimax multiple shrinkage predictive estimator based on scaled harmonic priors. Keywords: BAYEIAN PREDICTION; MODEL UNCERTAINTY; MULTIPLE HRINKAGE; PRIOR DITRIBUTION; HRINKAGE ETIMATION. Edward I. George is Professor, tatistics Department, The Wharton chool, 3730 Walnut treet 400 JMHH, Philadelphia, PA , edgeorge@wharton.upenn.edu. Xinyi Xu is Assistant Professor, Department of tatistics, The Ohio tate University, Columbus, OH , xinyi@stat.ohio-state.edu. We would like to acknowledge Larry Brown, Feng Liang, Linda Zhao and three referees for their helpful suggestions. This work was supported by various NF grants, DM the most recent.
2 Introduction We begin with the canonical normal linear model setup X N m (Aβ, σ I) () where X is an m vector of m observations, A is a full rank, fixed m p matrix of p potential predictors where m p, and β is a p vector of unknown regression coefficients. Based on observing X = x, we consider the problem of estimating the predictive density p(y β) of a future n vector Y where Y N n (Bβ, σ I). () Here B is a fixed n p matrix of the same p potential predictors in A, although with possibly different values. We also assume that X and Y are conditionally independent given β. Finally, we assume σ is known, and without loss of generality set σ = throughout. For each value of x, we evaluate a predictive estimate ˆp(y x) of p(y β) by the well-known Kullback-Leibler (KL) loss L(β, ˆp(y x)) = p(y β) log p(y β) dy. (3) ˆp(y x) The overall quality of the procedure ˆp = ˆp(y x) for each β is then conveniently summarized by the KL risk R KL (β, ˆp) = p(x β)l(β, ˆp(y x))dx. (4) Letting ˆβ x = (A A) A x be the traditional least squares estimate of β based on x, it is tempting to consider the plug-in predictive estimate ˆp plug in (y ˆβ x ), which simply substitutes ˆβ x for β in p(y β). However, as we show in ection by extending the arguments of Aitchison (975), the formal Bayes predictive estimate ˆp U (y x) = p(x β)p(y β)dβ p(x β)dβ. (5) has smaller KL risk than ˆp plug in (y ˆβ x ) for every β. Thus, ˆp plug in (y ˆβ x ) should be ruled out and we turn our focus to Bayes procedures. For a prior π on β, the Bayes predictive estimator ˆp π (y x) is given by ˆp π (y x) = p(x β)p(y β)π(β)dβ p(x β)π(β)dβ. (6) It also follows from the arguments of Aitchison (975) that for proper π, ˆp π minimizes the average risk r π (ˆp) = R KL (β, ˆp)π(β)dβ. Note that ˆp U in (5) is the formal Bayes estimate under the improper uniform noninformative density π U (β), and would seem to be a good default procedure. Indeed, ˆp U has constant risk and is minimax under KL loss, see Liang (00) and Liang
3 and Barron (004). But surprisingly, as we will show, in many cases ˆp U itself can be uniformly dominated in terms of KL risk by other Bayes predictive estimators. In ection, we develop general conditions under which ˆp π will be minimax and uniformly dominate ˆp U in terms of the KL risk (4) for the multiple regression prediction problem. Our results can be seen as a substantial generalization of George, Liang and Xu (006) who considered the special case of this problem when X N m (µ, σxi) and Y N m (µ, σyi), where µ is the common unknown multivariate normal mean. Moving further away from this common mean setup, we proceed in ection 3 to extend these results to the setting where only a subset of the p predictors is considered to be potentially irrelevant. In ection 4, we consider the more realistic model uncertainty setting where such a subset is unknown, and develop minimax multiple shrinkage predictive densities that adaptively shrink towards the model most favored by the data. In ection 5, we conclude by showing how our results can be extended for minimax shrinkage prediction towards any linear subspaces. Although we do not consider the issue of admissibility in this paper, it may be of interest to note that for the multivariate normal prediction problem above Brown, George and Xu (006) recently established that all admissible predictive densities are Bayes procedures. Priors for Minimax Predictive Estimation In this section, we develop general conditions on π for ˆp π in (6) to uniformly dominate ˆp U in (5) under KL risk (4). The minimaxity of such ˆp π will then follow immediately from the minimaxity of ˆp U. We begin by establishing some convenient notation. As indicated previously, we use ˆβ x = (A A) A x to denote the least squares estimate of β based on x. Although y is not observed, it will be useful to use ˆβ x,y = (C C) C x where C = A (7) y B to denote the least squares estimate of β based on x and y. Note that ˆβ x N p (β, Σ A ) and ˆβ x,y N p (β, Σ C ), where for notational convenience throughout we let Σ A = (A A) and Σ C = (C C). It will also be useful to let R x = x A ˆβ x and R x,y = x C ˆβ x,y y denote the corresponding residual sums of squares (R). In terms of this notation, we have the following. Lemma. The uniform prior predictive estimate ˆp U in (5) can be expressed as ˆp U (y x) = (π) n C C A A 3 exp R x,y R x
4 = (y B (π) p Ψ exp ˆβx ) Ψ (y B ˆβ x ), (8) where Ψ = I + BΣ A B. Moreover, the KL risk of ˆp U is uniformly smaller than that of the plug-in estimator ˆp plug in (y ˆβ x ). Proof. ince ˆβ x β N p (β, Σ A ), the posterior of β under the uniform prior is β ˆβ x N p ( ˆβ x, Σ A ). It follows that the posterior of Bβ is Bβ ˆβ x N p (B ˆβ x, BΣ A B ) and thus the predictive estimator is Y ˆβ x N p (B ˆβ x, I + BΣ A B ). To calculate the risk of ˆp U, let ĤA = A(A A) A denote the hat matrix based on x and Ĥ C = C(C C) C denote the hat matrix based on both x and y. It is easy to see that R KL (β, ˆp U ) = p(y β) p(x β) p(y β) log ˆp U (y x) dxdy p(x β) p(y β) [R x,y R x ] dxdy = log C C A A n + = log C C A A n + = log C C = A A n + n p log(e i + ), i= where e,..., e p are the eigenvalues of (A A) B B. Moreover, [ ] trace(i m+n ĤC) trace(i m ĤA) R KL (β, ˆp plug in (y ˆβ p(y β) x )) = p(x β) p(y β) log ˆp plug in (y ˆβ x ) dxdy = [ p(x β) p(y β) y B ˆθ x y Bθ ] dxdy = p(x β) B ˆθ x Bθ dx = trace(b(a A) B ) = p e i. i= That ˆp U dominates ˆp plug in (y ˆβ x ) follows from the fact that log(x + ) x for any x > 0. Risk comparisons of a Bayes predictive density ˆp π with ˆp U are greatly facilitated by the following representation of ˆp π in terms of ˆp U. An analogous representation of the posterior mean in terms of the MLE, which simplifies multivariate normal mean estimation under quadratic risk, was proposed by Brown (97). For our representation, it will be useful to denote the marginal distribution of 4
5 Z β N p (β, Σ) under π by m π (z; Σ) = p(z β)π(β)dβ. (9) Thus, the marginal distributions of ˆβ x and ˆβ x,y under π are denoted by m π ( ˆβ x, Σ A ) and m π ( ˆβ x,y, Σ C ) respectively. Lemma. If m π (z; Σ) is finite for all z and Σ, then ˆp π (y x) is a proper probability distribution. Furthermore, it can be expressed as where ˆp U is defined by (8). ˆp π (y x) = m π( ˆβ x,y, Σ C ) m π ( ˆβ x, Σ A ) ˆp U(y x) (0) Proof. When m π (z; Σ) is finite for all z and Σ, that ˆp π (y x) is a proper probability distribution follows from integrating with respect to y and switching the order of integration. Next, straightforward calculation yields p(x β)π(β)dβ x Aβ = exp π(β)dβ (π) m = exp x A ˆβ x + A ˆβ x Aβ (π) m = (π) m p = A A (π) m p imilarly, we obtain exp exp x A ˆβ x exp (π) p R x p(x β)p(y β)π(β)dβ = C C π(β)dβ A ˆβ x Aβ π(β)dβ m π ( ˆβ x, Σ A ). () (π) m+n p exp R x,y The representation (0) follows immediately from (6), () and (). m π ( ˆβ x,y, Σ C ). () The next result provides a representation of the difference between the KL risks of ˆp U and ˆp π in terms of the marginal distributions of ˆβ x and ˆβ x,y. Lemma 3. The difference between the KL risks of ˆp U and ˆp π is given by R KL (β, ˆp U ) R KL (β, ˆp π ) = E β,σc log m π ( ˆβ x,y ; Σ C ) E β,σa log m π ( ˆβ x ; Σ A ) (3) where E β,σ ( ) stands for expectation with respect to the N p (β, Σ) distribution. 5
6 Proof. The KL risk difference between ˆp U and ˆp π can be expressed as = = R KL (β, ˆp U ) R KL (β, ˆp π ) p(x β) p(y β) log ˆp π(y x) ˆp U (y x) dxdy p(x β) p(y β) log m π( ˆβ x,y, Σ C ) m π ( ˆβ x, Σ A ) dxdy, where the last equality follows from Lemma. The result then follows from the change of variable theorem. To exploit the representation (3), we proceed to transform the distributions to canonical form. ince Σ A and Σ C are both symmetric and positive definite, there exists an invertible p p matrix W such that Σ A = W W and Σ C = W Σ D W (4) where Σ D = diag(d,..., d p ). (5) ince Σ C = (C C) = (A A+B B) and B B is nonnegative definite,, d i (0, ] for all i p with at least one d i <. Finally, let µ = W β, ˆµ x = W ˆβx, and ˆµ x,y = W ˆβx,y, so that ˆµ x N p (µ, I) and ˆµ x,y N p (µ, Σ D ). (6) Lemma 4. Let π W (µ) = π(w µ). Then, the difference between the KL risks of ˆp U and ˆp π is given by R KL (β, ˆp U ) R KL (β, ˆp π ) = E µ,σd log m πw (ˆµ x,y ; Σ D ) E µ,i log m πw (ˆµ x ; I) (7) where E µ,σ ( ) stands for expectation with respect to the N p (µ, Σ) distribution. Proof. The result follows by transforming the expressions in Lemma 3, E β,σa log m π ( ˆβ x ; Σ A ) = p( ˆβ x β) log p( ˆβ x β)π(β)dβd ˆβ x = p(ˆµ x µ) log p(ˆµ x µ)π W (µ)dµdˆµ x = E µ,i log m πw (ˆµ x ; I). imilarly, E β,σc log m π ( ˆβ x,y ; Σ C ) = E µ,σd log m πw (ˆµ x,y ; Σ D ). Thus, (7) equals (3). 6
7 We now proceed to find conditions on m π for which the risk difference (7) is nonnegative for all µ. Because ˆp U is minimax, this will then imply that ˆp π is minimax under the prior π corresponding to π W. Now for w [0, ], let where Σ D is defined as in (5). Next, for Z N p (µ, V w ), let Thus, we may rewrite (7) V w = wi + ( w)σ D (8) h µ (V w ) = E µ,vw log m πw (Z; V w ). (9) R KL (β, ˆp U ) R KL (β, ˆp π ) = h µ (V 0 ) h µ (V ). (0) ince h µ (w) is continuous in w, it suffices to derive conditions on m π such that ( / w)h µ (w) < 0 for all µ and w [0, ]. Letting v i be the ith diagonal element of V w, we have by the chain rule p w h h µ v i p µ = v i w = ( d i ) h µ () v i The following result provides unbiased estimates of the components of () which, when combined with (7), will be seen to play a key role in establishing sufficient conditions on m π for ˆp π to be minimax and to dominate ˆp U. As noted by George, Liang and Xu (006), these estimates are very similar to the unbiased estimates of risk for the estimation of a multivariate mean under squared error loss, see tein (974, 98). Lemma 5. If m πw (z; I) is finite for all z, then for any 0 w, m πw (z; V w ) is finite. Moreover, z m πw (Z; V w ) i h µ = E µ,vw ( ) log m πw (Z; V w ) () v i m πw (Z; V w ) z i = E µ,vw zi m πw (Z; V w ). (3) m πw (Z; V w ) Proof. When m πw (z; I) is finite for all z, it is easy to check that for any fixed z and any 0 w, ( k ) m πw (z; V w ) m πw (z; I) <. d i i= Next, letting Z = V w (Z µ) N(0, I), we have h µ = E log m πw (V w Z + µ; V w ) v i v i m πw (V w Z + µ; V w ) v i = E m πw (V w Z + µ; V w ) 7, (4)
8 where m πw (V w z + µ; V w ) v i = p v i (π) p ( v exp i zi + µ i µ i ) π W (µ )dµ v v p v i= i ( = + (z i µ i ) v i vi z i z i (µ i µ i ) ) v i v 3/ p(z µ )π W (µ )dµ i = (zi µ i ) (z i µ i m πw (z; V w ) ) v i vi p(z µ )π W (µ )dµ. Making use of the well-known univariate heat equation m πw (z; V w ) = v i z i m πw (z; V w ), (5) see for example teele (00), and the Brown (97) representation E π (µ i z i) = z i +v i z i log m πw (z), () and (3) can be verified via the same steps as in the proof of Lemma 3 in George, Liang and Xu (006). Now we can obtain sufficient conditions for a Bayes procedure ˆp π to be minimax by combining (0), () and Lemma 5. The following result provides a substantial generalization of Theorem of George, Liang and Xu (006). Theorem. uppose m π (z; W W ) is finite for all z with the invertible matrix W defined as in (4). Let H(f(z,, z p )) be the Hessian matrix of f. (i) If trace H(m π (z; W V w W ))[Σ A Σ C ] 0 for all w [0, ], then ˆp π is minimax under R KL. Furthermore, ˆp π dominates ˆp U unless π = π U. (ii) If trace H( m π (z; W V w W ))[Σ A Σ C ] 0 for all w [0, ], then ˆp π is minimax under R KL. Furthermore, ˆp π dominates ˆp U if for all w [0, ], this inequality is strict on a set of positive Lebesgue measure. Proof. To prove the minimaxity of ˆp π under R KL, it suffices to show that () or (3) is nonpositive because by () that would imply the nonnegativity of (0). Dominance would further follow by showing that () or (3) is also strictly negative on a set of positive Lebesgue measure. Noting that m πw (z; V w ) = m π (W z; W V w W ), and letting W z = z, we obtain k ( d i ) z m πw (z; V w ) = i= i k ( d i ) z m π ( z; W V w W ) i= i 8
9 k p p m π ( z; W V w W ) = ( d i ) W ji W ki z i= j= k= j z k = trace (I Σ D )W H(m π ( z; W V w W ))W = trace H(m π ( z; W V w W ))W (I Σ D )W = trace H(m π ( z; W V w W ))[Σ A Σ C ] (6) imilarly, k ( d i ) z i= i m πw (z; V w ) = trace H( m π ( z; W V w W ))[Σ A Σ C ] (7) Now (i) and (ii) follow immediately from (), (3), (6) and (7). The next result follows using the fact that zi m πw (z; V w ) 0 when µ π W (µ) 0. i Corollary. uppose m π (z; W W ) is finite for all z. Then ˆp π will be minimax if trace H(π(β))[Σ A Σ C ] 0 a.e. Furthermore, ˆp π will dominate ˆp U unless π = π U. Example. (caled harmonic prior). uppose A = B. In this case, trace H(π(β)[Σ A Σ C ] = trace H(π(β))Σ A = trace H(π(β))W W = π W (µ). (8) Let π W (µ) µ (p ) when p 3, and π W (µ) when p < 3. Note that π W is harmonic, i.e. π W (µ) 0, and not equal to π U when p 3. For p 3, the corresponding prior on β is a scaled harmonic prior π(β) W β (p ) = diag(η,, η p )β (p ), (9) where η,, η p > 0 are the eigenvalues of Σ A, and for p < 3, π(β). (The expression (9) is obtained using the fact that there exists an orthonormal matrix O such that W = O diag(η,, η p ) O ). By Corollary and (8), the predictive estimator p π under this prior is minimax and dominates ˆp U when p 3. It is easy to check that these results hold when A = rb for any known constant r. 9
10 3 Predictive Density Estimation Near ubset Models When a prior centered at 0 such as the scaled harmonic prior (9) is applied to β, the risk reduction of ˆp π over ˆp U is greatest when all the components of β are close to 0. Thus, it would be sensible to use this prior if it was felt that all p predictors in A and B were potentially irrelevant. However, such a prior would be ineffectual if only a subset of the p predictors were irrelevant, inotherwords if only a subset of the β components were close to 0. In this section, we extend our results for the setting where such a subset is known. This will set the stage for ection 4, where develop new results for the more realistic model uncertainty setting where such a subset is unknown. Let be the subset of,..., p corresponding to the indices of the potentially irrelevant predictors, and let q = be the number of elements in. Let β be the subvector of β corresponding to the columns of A indexed by. imilarly, let ˆβ,x and ˆβ,x,y be the subvectors of ˆβ x and ˆβ x,y respectively, corresponding to β. Finally, for notational convenience, let Σ A, and Σ C, be the submatrices of Σ A and Σ C respectively, which are the covariance matrices of ˆβ,x and ˆβ,x,y. When only the elements of β are thought to be close to zero, it would be sensible to consider a prior which is uniform on β, where is the complement of. We denote such a prior by π, and let π be the restriction of π to β so that π (β) = π (β ) (30) is a function of β only. centered around 0. To exploit the possibility that β is close to zero, π would then be Lemma 6. If m π (z; Σ) is finite for all z and Σ, then ˆp π (y x) is a proper probability distribution. Furthermore, it can be expressed as where ˆp U is defined by (8). ˆp π (y x) = m π ( ˆβ,x,y, Σ C, ) m π ( ˆβ,x, Σ A, ) ˆp U(y x) (3) Proof. The first assertion was proved in Lemma. Next, proceeding as in the derivation of () we obtain = p(x β)π (β)dβ exp x A ˆβ x exp (π) m p (π) p exp = A A (π) m p R x A ˆβ x Aβ π (β )dβ m π ( ˆβ,x, Σ A, ). (3) 0
11 imilarly, we obtain p(x β)p(y β)π (β)dβ = C C (π) m+n p exp R x,y m π ( ˆβ,x,y, Σ C, ). (33) The representation (3) follows immediately from (6), (3) and (33). The following results provide sufficient conditions for the minimaxity of ˆp π and for its dominance over ˆp U. We omit the proofs which are obtained using the same arguments leading to Theorem and Corollary. Analogously to our previous development there, we let W be an invertible q q matrix such that Σ A, = W W and Σ C, = W Σ D, W, where Σ D, = diag(d,..., d q ) as in (5). Finally, let V,w = wi + ( w)σ D as in (8). Theorem. L uppose m π (z; W W ) is finite for all z. Let H(f(z,, z q ) be the Hessian matrix of f. (i) If trace H(m π (z; W V,w W ))[Σ A, Σ C, ] 0 for all w [0, ], then ˆp π under R KL. Furthermore, ˆp π dominates ˆp U unless π = π U. (ii) If trace H( m π (z; W V,w W ))[Σ A, Σ C, ] 0 for all w [0, ], then ˆp π under R KL. Furthermore, ˆp π set of positive Lebesgue measure. is minimax is minimax dominates ˆp U if for all w [0, ], this inequality is strict on a Corollary. uppose m π (z; W W ) is finite for all z. Then ˆp π will be minimax if trace H(π (β ))[Σ A, Σ C, ] 0 a.e. Furthermore, ˆp π will dominate ˆp U unless π = π U. Example (continued). (caled harmonic prior). uppose A = B so that as in (8), trace H(π (β )[Σ A, Σ C, ] = π W (µ ), (34) where µ = W β. Here let π W (µ) µ (q ) when q 3, and π W (µ ) when q < 3. As before, π W is harmonic, i.e. π W (µ) 0, and not equal to π U when q 3. For q 3, the corresponding scaled harmonic prior on β is where η,, η q π (β) = π (β ) W β (q ) = diag(η,, η q )β (q ), (35) > 0 are the eigenvalues of Σ A,, and for q < 3, π (β). By Corollary and (34), p π here is minimax and dominates ˆp U when q 3.
12 4 Minimax Multiple hinkage Predictive Estimation We consider the more realistic model uncertainty setting where there is uncertainty about which subset of the p predictors in A and B should be included in the model. For each choice of, we have obtained general sufficient conditions for ˆp π to be minimax and to dominate π U. However, such ˆp π will only offer meaningful risk reduction when β is near the region where π is largest. For example, under the scaled harmonic prior in (35), such risk reduction occurs when β is close to 0. The difficulty then is how to proceed when the subset of irrelevant predictors indexed by is unknown. Rather than arbitrarily selecting, an attractive alternative is to use a multiple shrinkage predictive estimator which uses the data to adaptively emulate the most effective ˆp π. The multiple shrinkage procedure here is obtained by using a finite mixture of the contemplated priors. A similar multiple shrinkage construction for parameter estimation under squared error loss was proposed and developed by George (986a,b,c). Let Ω be the set of all the subsets under consideration, possibly even the set of all possible subsets. For each Ω, let π be the designated prior of the form (30) on β, and assign it probability w [0, ] such that Ω w =. Thus we construct the mixture prior π (β) = w π (β). (36) Ω This prior yields the multiple shrinkage predictive estimator ˆp (y x) = ˆp( x)ˆp π (y x). (37) Ω Here each ˆp π is given by (3) in Lemma (6), and each posterior probability is of the form ˆp( x) = w m π ( ˆβ,x, Σ A, ) Ω w m π ( ˆβ,x, Σ A, ) (38) which follows from (3). The form (37) reveals ˆp (y x) to be an adaptive convex combination of the individual shrinkage predictive estimates ˆp π. Note that through ˆp( x), ˆp doubly shrinks ˆp U (y x) by putting more weight on the ˆp π for which m π is largest and hence ˆp π is shrinking most. Thus, we expect ˆp to offer meaningful risk reduction whenever β is near the region where π is largest for any Ω. For example, if every π in π is one of the scaled harmonic priors in (35), such risk reduction occurs when β is close to 0 for any Ω for which q 3. Thus, the potential for risk reduction using ˆp is far greater than the risk reduction using an arbitrarily chosen ˆp π. We should also note that the allocation of risk reduction by ˆp is in part determined by the w weights in (38). Because each ˆp( x) is so adaptive through m π, choosing the weights to be uniform should be adequate. However, one may also want to consider some of the more refined suggestions for choosing such weights for the multiple shrinkage estimators in George (986b).
13 The potential for a multiple shrinkage ˆp to offer meaningful risk reduction in many different regions of the parameter space is greatly enhanced when it is minimax, and therefore can only improve on the noninformative minimax ˆp U. The following two results show that such minimaxity and dominance of ˆp U can be obtained. We then conclude with an explicit example of such domination. Theorem 3. uppose for all Ω, m π (z; W W ) is finite for all z. Let H(f(z,, z q ) be the Hessian matrix of f. If for all Ω, trace H(m π (z; W V,w W ))[Σ A, Σ C, ] 0 for all w [0, ], (39) then ˆp in (37) is minimax under R KL. Furthermore, ˆp dominates ˆp U unless π = π U. Proof. From (3), (37), and (38), it is straightforward to show that ˆp can be reexpressed as ˆp (y x) = Ω w m π ( ˆβ,x,y, Σ C, ) Ω w m π ( ˆβ,x, Σ A, ) ˆp U(y x) (40) Because p is of the same form as ˆp π in (3), namely a ratio of marginals times ˆp U, we can apply the same arguments leading to the proofs of Theorems and. These steps show that a sufficient condition for the minimaxity and dominance claims is w H(m π (z; W V,w W ))[Σ A, Σ C, ] Ω This condition is implied if (39) holds for all Ω. 0 for all w [0, ]. The next result follows using the same argument leading to Corollaries and. Corollary 3. uppose for all Ω, m π (z; W W ) is finite for all z. Then ˆp in (37) will be minimax if for all Ω trace H(π(β ))[Σ A, Σ C, ] 0 a.e. Furthermore, ˆp will dominate ˆp U unless π = π U. Example (continued). (caled harmonic prior). For each Ω, let π (β) be the scaled harmonic prior given by (35) when q 3, and by π (β) when q < 3. When A = B, by Corollary 3, p under these priors will be minimax and will dominate ˆp U if q 3 for at least one Ω. 3
14 5 Predictive Density Estimation Near Linear ubspaces The harmonic prior predictive estimator ˆp π (y x) described in ection 3, and incorporated into the multiple shrinkage predictive estimators ˆp (y x) in ection 4, offers risk reduction in the region of the parameter space where β is close to 0. This can be seen as a special case of the following general construction of a predictive estimator that obtains risk reduction when β is close to a linear subspace of R p. uppose one would like to obtain a predictive density estimator with greatest risk reduction in the region where β is close to a linear subspace G R p. In the case of ˆp π (y x), G would be the subspace of all β R p for which β 0. Alternately, if risk reduction was desired say when the components of β were close to equal, then one would consider G = [], the subspace spanned by (,..., ). Let P G β argmin g G β g be the projection of β onto G, and define β G (I P G )β to be the projection of β onto the orthogonal complement of G. For the construction of ˆp π (y x) in ection 3, β G = β. For G = [], β G = (β β) where β is the vector of components all equal to p p i= β i. The main idea behind the general construction is to use a prior that leads to shrinkage of β G towards 0 while leaving the remainder of β untouched. This can be obtained by using a prior of the form π G (β) = π G(β G ), (4) which is effectively uniform on (β β G ). This is a special case of the prior over β in (30). Note that since β G is q G (p dim(g)) dimensional, π G is a function from Rq G to R. Analogous to the construction in Lemma 6, predictive density estimators ˆp πg corresponding to priors of the form π G in (4) can be expressed as ˆp πg (y x) = m π G ( ˆβ G,x,y, Σ C,G ) m π G ( ˆβ G,x, Σ A,G ) ˆp U(y x) (4) where ˆp U is defined by (8), ˆβ G,x = (I P G ) ˆβ x and ˆβ G,x,y = (I P G ) ˆβ x,y are the projections of ˆβ x and ˆβ x,y onto the orthogonal complement of G, respectively, and Σ A,G and Σ C,G are the covariance matrices of ˆβ G,x and ˆβ G,x,y, respectively. It is straightforward to see that Theorem and Corollary and their proofs can be extended to get conditions on πg (β G) for such ˆp πg to be minimax and to dominate ˆp U. (imply substitute the symbol G for the symbol throughout). Example (continued). Extending (35), consider the following scaled harmonic prior on β. For q G 3, let π G (β) = π G(β G ) diag(η,, η q G )β G (q G ), (43) where η,, η qg > 0 are the eigenvalues of Σ A,G, and for q G < 3, let π G (β). Note that when q G 3 the resulting ˆp πg shrinks ˆp U towards G, offering reduced risk when β is close G. By the 4
15 extension of Corollary, such p πg will be minimax and dominate ˆp U when A = B and q G 3. Finally, following the development in ection 4 one can easily incorporate such ˆp πg into multiple shrinkage predictor estimators ˆp. Letting Ω be a set of subspaces G under consideration, construct the mixture prior π (β) = G Ω w G π G (β). (44) where for each G Ω, π G is the designated prior of the form (4), and w G [0, ] is such that G Ω w G =. This prior yields the multiple shrinkage predictive estimator ˆp (y x) = ˆp(G x)ˆp πg (y x), (45) G Ω where each ˆp πg is given by (4), and each posterior probability is of the form ˆp(G x) = w G m π G ( ˆβ G,x, Σ A,G ) G Ω w G m π G ( ˆβ G,x, Σ A,G ). (46) Here, ˆp (y x) is an adaptive convex combination of the individual shrinkage predictive estimates ˆp πg, and offers risk reduction whenever β G is near the region where π G is largest for any G Ω. Thus, the potential for risk reduction using ˆp is far greater than the risk reduction using an arbitrarily chosen ˆp πg. It is straightforward to see that Theorem 3 and Corollary 3 and their proofs can be extended to get conditions for such ˆp (y x) to be minimax and dominate ˆp U. (imply substitute the symbol G for the symbol throughout). Example (continued). (caled harmonic prior). For each G Ω, let π G (β) be the scaled harmonic prior given by (43) when q G 3, and by π G (β) when q < 3. When A = B, by the extension of Corollary 3, p for these priors will be minimax and will dominate ˆp U if q G 3 for at least one G Ω. References Aitchison, J. (975). Goodness of Prediction Fit. Biometrika, 6, Brown, L.D. (97). Admissible Estimators, Recurrent Diffusions, and Insoluble Boundary Value Problems. Annals of Mathematical tatistics, 4, Brown, L.D., George, E.I. and Xu, X. (006) Admissible Predictive Density Estimation. ubmitted. George, E.I. (986a). Minimax Multiple hrinkage Estimation. Ann. tatist
16 George, E.I. (986b). Combining Minimax hrinkage Estimators. Journal of the American tatistical Association, 8, George, E.I. (986c). A Formal Bayes Multiple hrinkage Estimator. Communications in tatistics: Part A - Theory and Methods (pecial issue tein-type Multivariate Estimation ), 5, 7, George, E.I., Liang, F. and Xu, X. (006). Improved Minimax Prediction Under Kullback-Leibler Loss. Ann. tatist. 34, Liang, F. (00). Exact Minimax Procedures for Predictive Density Estimation and Data Compression. Ph.D. dissertation, Department of tatistics, Yale University. Liang, F. and Barron, A. (004). Exact Minimax trategies for Predictive Density Estimation, Data Compression and Model election. IEEE Information Theory Transactions. 50, teele, J.M. (00). tochastic Calculus and Financial Applications. pringer, New York. tein, C. (974). Estimation of the Mean of a Multivariate Normal Distribution. In Proceedings of the Prague ymposium on Asymptotic tatistics, Ed. J. Hajek, pp Prague: Universita Karlova. tein, C. (98). Estimation of a Multivariate Normal Mean. Ann. tatist. 9,
This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that
Linear Regression For (X, Y ) a pair of random variables with values in R p R we assume that E(Y X) = β 0 + with β R p+1. p X j β j = (1, X T )β j=1 This model of the conditional expectation is linear
More informationGeneralized Linear Models. Kurt Hornik
Generalized Linear Models Kurt Hornik Motivation Assuming normality, the linear model y = Xβ + e has y = β + ε, ε N(0, σ 2 ) such that y N(μ, σ 2 ), E(y ) = μ = β. Various generalizations, including general
More informationThe logistic regression model is thus a glm-model with canonical link function so that the log-odds equals the linear predictor, that is
Example The logistic regression model is thus a glm-model with canonical link function so that the log-odds equals the linear predictor, that is log p 1 p = β 0 + β 1 f 1 (y 1 ) +... + β d f d (y d ).
More informationx. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).
.8.6 µ =, σ = 1 µ = 1, σ = 1 / µ =, σ =.. 3 1 1 3 x Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ ). The Gaussian distribution Probably the most-important distribution in all of statistics
More informationMultinomial Data. f(y θ) θ y i. where θ i is the probability that a given trial results in category i, i = 1,..., k. The parameter space is
Multinomial Data The multinomial distribution is a generalization of the binomial for the situation in which each trial results in one and only one of several categories, as opposed to just two, as in
More informationESTIMATORS FOR GAUSSIAN MODELS HAVING A BLOCK-WISE STRUCTURE
Statistica Sinica 9 2009, 885-903 ESTIMATORS FOR GAUSSIAN MODELS HAVING A BLOCK-WISE STRUCTURE Lawrence D. Brown and Linda H. Zhao University of Pennsylvania Abstract: Many multivariate Gaussian models
More informationOn Reparametrization and the Gibbs Sampler
On Reparametrization and the Gibbs Sampler Jorge Carlos Román Department of Mathematics Vanderbilt University James P. Hobert Department of Statistics University of Florida March 2014 Brett Presnell Department
More informationExact Minimax Strategies for Predictive Density Estimation, Data Compression, and Model Selection
2708 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 50, NO. 11, NOVEMBER 2004 Exact Minimax Strategies for Predictive Density Estimation, Data Compression, Model Selection Feng Liang Andrew Barron, Senior
More informationBayesian Nonparametric Point Estimation Under a Conjugate Prior
University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 5-15-2002 Bayesian Nonparametric Point Estimation Under a Conjugate Prior Xuefeng Li University of Pennsylvania Linda
More informationLECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)
LECTURE 5 NOTES 1. Bayesian point estimators. In the conventional (frequentist) approach to statistical inference, the parameter θ Θ is considered a fixed quantity. In the Bayesian approach, it is considered
More informationDefault Priors and Effcient Posterior Computation in Bayesian
Default Priors and Effcient Posterior Computation in Bayesian Factor Analysis January 16, 2010 Presented by Eric Wang, Duke University Background and Motivation A Brief Review of Parameter Expansion Literature
More informationBayes spaces: use of improper priors and distances between densities
Bayes spaces: use of improper priors and distances between densities J. J. Egozcue 1, V. Pawlowsky-Glahn 2, R. Tolosana-Delgado 1, M. I. Ortego 1 and G. van den Boogaart 3 1 Universidad Politécnica de
More informationTopics in Probability and Statistics
Topics in Probability and tatistics A Fundamental Construction uppose {, P } is a sample space (with probability P), and suppose X : R is a random variable. The distribution of X is the probability P X
More informationA Minimal Uncertainty Product for One Dimensional Semiclassical Wave Packets
A Minimal Uncertainty Product for One Dimensional Semiclassical Wave Packets George A. Hagedorn Happy 60 th birthday, Mr. Fritz! Abstract. Although real, normalized Gaussian wave packets minimize the product
More informationAnalysis of Polya-Gamma Gibbs sampler for Bayesian logistic analysis of variance
Electronic Journal of Statistics Vol. (207) 326 337 ISSN: 935-7524 DOI: 0.24/7-EJS227 Analysis of Polya-Gamma Gibbs sampler for Bayesian logistic analysis of variance Hee Min Choi Department of Statistics
More informationClassification Methods II: Linear and Quadratic Discrimminant Analysis
Classification Methods II: Linear and Quadratic Discrimminant Analysis Rebecca C. Steorts, Duke University STA 325, Chapter 4 ISL Agenda Linear Discrimminant Analysis (LDA) Classification Recall that linear
More informationGaussian Models (9/9/13)
STA561: Probabilistic machine learning Gaussian Models (9/9/13) Lecturer: Barbara Engelhardt Scribes: Xi He, Jiangwei Pan, Ali Razeen, Animesh Srivastava 1 Multivariate Normal Distribution The multivariate
More informationChapter 7. Extremal Problems. 7.1 Extrema and Local Extrema
Chapter 7 Extremal Problems No matter in theoretical context or in applications many problems can be formulated as problems of finding the maximum or minimum of a function. Whenever this is the case, advanced
More informationPerhaps the simplest way of modeling two (discrete) random variables is by means of a joint PMF, defined as follows.
Chapter 5 Two Random Variables In a practical engineering problem, there is almost always causal relationship between different events. Some relationships are determined by physical laws, e.g., voltage
More informationEcon 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines
Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the
More informationRecall the convention that, for us, all vectors are column vectors.
Some linear algebra Recall the convention that, for us, all vectors are column vectors. 1. Symmetric matrices Let A be a real matrix. Recall that a complex number λ is an eigenvalue of A if there exists
More informationLinear Methods for Prediction
Chapter 5 Linear Methods for Prediction 5.1 Introduction We now revisit the classification problem and focus on linear methods. Since our prediction Ĝ(x) will always take values in the discrete set G we
More informationStatistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation
Statistics 62: L p spaces, metrics on spaces of probabilites, and connections to estimation Moulinath Banerjee December 6, 2006 L p spaces and Hilbert spaces We first formally define L p spaces. Consider
More informationI. Bayesian econometrics
I. Bayesian econometrics A. Introduction B. Bayesian inference in the univariate regression model C. Statistical decision theory Question: once we ve calculated the posterior distribution, what do we do
More informationAPPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.
APPENDIX A Background Mathematics A. Linear Algebra A.. Vector algebra Let x denote the n-dimensional column vector with components 0 x x 2 B C @. A x n Definition 6 (scalar product). The scalar product
More informationSTA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources
STA 732: Inference Notes 10. Parameter Estimation from a Decision Theoretic Angle Other resources 1 Statistical rules, loss and risk We saw that a major focus of classical statistics is comparing various
More informationAkaike Information Criterion
Akaike Information Criterion Shuhua Hu Center for Research in Scientific Computation North Carolina State University Raleigh, NC February 7, 2012-1- background Background Model statistical model: Y j =
More informationUniform Correlation Mixture of Bivariate Normal Distributions and. Hypercubically-contoured Densities That Are Marginally Normal
Uniform Correlation Mixture of Bivariate Normal Distributions and Hypercubically-contoured Densities That Are Marginally Normal Kai Zhang Department of Statistics and Operations Research University of
More informationPart III. A Decision-Theoretic Approach and Bayesian testing
Part III A Decision-Theoretic Approach and Bayesian testing 1 Chapter 10 Bayesian Inference as a Decision Problem The decision-theoretic framework starts with the following situation. We would like to
More informationMAS223 Statistical Inference and Modelling Exercises
MAS223 Statistical Inference and Modelling Exercises The exercises are grouped into sections, corresponding to chapters of the lecture notes Within each section exercises are divided into warm-up questions,
More informationMaximum Likelihood Estimation
Maximum Likelihood Estimation Merlise Clyde STA721 Linear Models Duke University August 31, 2017 Outline Topics Likelihood Function Projections Maximum Likelihood Estimates Readings: Christensen Chapter
More informationClassification 2: Linear discriminant analysis (continued); logistic regression
Classification 2: Linear discriminant analysis (continued); logistic regression Ryan Tibshirani Data Mining: 36-462/36-662 April 4 2013 Optional reading: ISL 4.4, ESL 4.3; ISL 4.3, ESL 4.4 1 Reminder:
More informationLeast Sparsity of p-norm based Optimization Problems with p > 1
Least Sparsity of p-norm based Optimization Problems with p > Jinglai Shen and Seyedahmad Mousavi Original version: July, 07; Revision: February, 08 Abstract Motivated by l p -optimization arising from
More informationMath 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces.
Math 350 Fall 2011 Notes about inner product spaces In this notes we state and prove some important properties of inner product spaces. First, recall the dot product on R n : if x, y R n, say x = (x 1,...,
More informationMinimax Rate of Convergence for an Estimator of the Functional Component in a Semiparametric Multivariate Partially Linear Model.
Minimax Rate of Convergence for an Estimator of the Functional Component in a Semiparametric Multivariate Partially Linear Model By Michael Levine Purdue University Technical Report #14-03 Department of
More informationBootstrap prediction and Bayesian prediction under misspecified models
Bernoulli 11(4), 2005, 747 758 Bootstrap prediction and Bayesian prediction under misspecified models TADAYOSHI FUSHIKI Institute of Statistical Mathematics, 4-6-7 Minami-Azabu, Minato-ku, Tokyo 106-8569,
More informationApproximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA)
Approximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA) Willem van den Boom Department of Statistics and Applied Probability National University
More informationStatistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach
Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach Jae-Kwang Kim Department of Statistics, Iowa State University Outline 1 Introduction 2 Observed likelihood 3 Mean Score
More informationInverse of a Square Matrix. For an N N square matrix A, the inverse of A, 1
Inverse of a Square Matrix For an N N square matrix A, the inverse of A, 1 A, exists if and only if A is of full rank, i.e., if and only if no column of A is a linear combination 1 of the others. A is
More informationProjection. Ping Yu. School of Economics and Finance The University of Hong Kong. Ping Yu (HKU) Projection 1 / 42
Projection Ping Yu School of Economics and Finance The University of Hong Kong Ping Yu (HKU) Projection 1 / 42 1 Hilbert Space and Projection Theorem 2 Projection in the L 2 Space 3 Projection in R n Projection
More informationA minimalist s exposition of EM
A minimalist s exposition of EM Karl Stratos 1 What EM optimizes Let O, H be a random variables representing the space of samples. Let be the parameter of a generative model with an associated probability
More informationMatrices and Multivariate Statistics - II
Matrices and Multivariate Statistics - II Richard Mott November 2011 Multivariate Random Variables Consider a set of dependent random variables z = (z 1,..., z n ) E(z i ) = µ i cov(z i, z j ) = σ ij =
More informationLinear Regression (9/11/13)
STA561: Probabilistic machine learning Linear Regression (9/11/13) Lecturer: Barbara Engelhardt Scribes: Zachary Abzug, Mike Gloudemans, Zhuosheng Gu, Zhao Song 1 Why use linear regression? Figure 1: Scatter
More informationSpring 2017 Econ 574 Roger Koenker. Lecture 14 GEE-GMM
University of Illinois Department of Economics Spring 2017 Econ 574 Roger Koenker Lecture 14 GEE-GMM Throughout the course we have emphasized methods of estimation and inference based on the principle
More informationVariational Inference (11/04/13)
STA561: Probabilistic machine learning Variational Inference (11/04/13) Lecturer: Barbara Engelhardt Scribes: Matt Dickenson, Alireza Samany, Tracy Schifeling 1 Introduction In this lecture we will further
More informationSpatial Process Estimates as Smoothers: A Review
Spatial Process Estimates as Smoothers: A Review Soutir Bandyopadhyay 1 Basic Model The observational model considered here has the form Y i = f(x i ) + ɛ i, for 1 i n. (1.1) where Y i is the observed
More informationAn Alternative Method for Estimating and Simulating Maximum Entropy Densities
An Alternative Method for Estimating and Simulating Maximum Entropy Densities Jae-Young Kim and Joonhwan Lee Seoul National University May, 8 Abstract This paper proposes a method of estimating and simulating
More informationFormulas for probability theory and linear models SF2941
Formulas for probability theory and linear models SF2941 These pages + Appendix 2 of Gut) are permitted as assistance at the exam. 11 maj 2008 Selected formulae of probability Bivariate probability Transforms
More informationAsymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½
University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 1998 Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½ Lawrence D. Brown University
More informationStat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2
Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, 2010 Jeffreys priors Lecturer: Michael I. Jordan Scribe: Timothy Hunter 1 Priors for the multivariate Gaussian Consider a multivariate
More informationLawrence D. Brown* and Daniel McCarthy*
Comments on the paper, An adaptive resampling test for detecting the presence of significant predictors by I. W. McKeague and M. Qian Lawrence D. Brown* and Daniel McCarthy* ABSTRACT: This commentary deals
More informationModel comparison and selection
BS2 Statistical Inference, Lectures 9 and 10, Hilary Term 2008 March 2, 2008 Hypothesis testing Consider two alternative models M 1 = {f (x; θ), θ Θ 1 } and M 2 = {f (x; θ), θ Θ 2 } for a sample (X = x)
More informationMLES & Multivariate Normal Theory
Merlise Clyde September 6, 2016 Outline Expectations of Quadratic Forms Distribution Linear Transformations Distribution of estimates under normality Properties of MLE s Recap Ŷ = ˆµ is an unbiased estimate
More informationLinear Algebra. Session 12
Linear Algebra. Session 12 Dr. Marco A Roque Sol 08/01/2017 Example 12.1 Find the constant function that is the least squares fit to the following data x 0 1 2 3 f(x) 1 0 1 2 Solution c = 1 c = 0 f (x)
More informationOnline Appendix. j=1. φ T (ω j ) vec (EI T (ω j ) f θ0 (ω j )). vec (EI T (ω) f θ0 (ω)) = O T β+1/2) = o(1), M 1. M T (s) exp ( isω)
Online Appendix Proof of Lemma A.. he proof uses similar arguments as in Dunsmuir 979), but allowing for weak identification and selecting a subset of frequencies using W ω). It consists of two steps.
More informationLecture Notes 15 Prediction Chapters 13, 22, 20.4.
Lecture Notes 15 Prediction Chapters 13, 22, 20.4. 1 Introduction Prediction is covered in detail in 36-707, 36-701, 36-715, 10/36-702. Here, we will just give an introduction. We observe training data
More informationWLS and BLUE (prelude to BLUP) Prediction
WLS and BLUE (prelude to BLUP) Prediction Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark April 21, 2018 Suppose that Y has mean X β and known covariance matrix V (but Y need
More informationEstimators for the binomial distribution that dominate the MLE in terms of Kullback Leibler risk
Ann Inst Stat Math (0) 64:359 37 DOI 0.007/s0463-00-036-3 Estimators for the binomial distribution that dominate the MLE in terms of Kullback Leibler risk Paul Vos Qiang Wu Received: 3 June 009 / Revised:
More informationMaximum Likelihood Estimation
Chapter 7 Maximum Likelihood Estimation 7. Consistency If X is a random variable (or vector) with density or mass function f θ (x) that depends on a parameter θ, then the function f θ (X) viewed as a function
More information1 Data Arrays and Decompositions
1 Data Arrays and Decompositions 1.1 Variance Matrices and Eigenstructure Consider a p p positive definite and symmetric matrix V - a model parameter or a sample variance matrix. The eigenstructure is
More informationThe Skorokhod reflection problem for functions with discontinuities (contractive case)
The Skorokhod reflection problem for functions with discontinuities (contractive case) TAKIS KONSTANTOPOULOS Univ. of Texas at Austin Revised March 1999 Abstract Basic properties of the Skorokhod reflection
More informationBayesian estimation of the discrepancy with misspecified parametric models
Bayesian estimation of the discrepancy with misspecified parametric models Pierpaolo De Blasi University of Torino & Collegio Carlo Alberto Bayesian Nonparametrics workshop ICERM, 17-21 September 2012
More informationTECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection
DEPARTMENT OF STATISTICS University of Wisconsin 1210 West Dayton St. Madison, WI 53706 TECHNICAL REPORT NO. 1091r April 2004, Revised December 2004 A Note on the Lasso and Related Procedures in Model
More informationChapter 13. Convex and Concave. Josef Leydold Mathematical Methods WS 2018/19 13 Convex and Concave 1 / 44
Chapter 13 Convex and Concave Josef Leydold Mathematical Methods WS 2018/19 13 Convex and Concave 1 / 44 Monotone Function Function f is called monotonically increasing, if x 1 x 2 f (x 1 ) f (x 2 ) It
More informationChapter 1 Preliminaries
Chapter 1 Preliminaries 1.1 Conventions and Notations Throughout the book we use the following notations for standard sets of numbers: N the set {1, 2,...} of natural numbers Z the set of integers Q the
More informationLatent Variable Models and EM algorithm
Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic
More informationLecture I A Gentle Introduction to Markov Chain Monte Carlo (MCMC)
1 Lecture I A Gentle Introduction to Markov Chain Monte Carlo (MCMC) Ed George University of Pennsylvania Seminaire de Printemps Villars-sur-Ollon, Switzerland March 2005 2 1. MCMC: A New Approach to Simulation
More informationPh.D. Qualifying Exam Friday Saturday, January 6 7, 2017
Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017 Put your solution to each problem on a separate sheet of paper. Problem 1. (5106) Let X 1, X 2,, X n be a sequence of i.i.d. observations from a
More informationEstimation of parametric functions in Downton s bivariate exponential distribution
Estimation of parametric functions in Downton s bivariate exponential distribution George Iliopoulos Department of Mathematics University of the Aegean 83200 Karlovasi, Samos, Greece e-mail: geh@aegean.gr
More informationEXPLICIT MULTIVARIATE BOUNDS OF CHEBYSHEV TYPE
Annales Univ. Sci. Budapest., Sect. Comp. 42 2014) 109 125 EXPLICIT MULTIVARIATE BOUNDS OF CHEBYSHEV TYPE Villő Csiszár Budapest, Hungary) Tamás Fegyverneki Budapest, Hungary) Tamás F. Móri Budapest, Hungary)
More informationMAT 257, Handout 13: December 5-7, 2011.
MAT 257, Handout 13: December 5-7, 2011. The Change of Variables Theorem. In these notes, I try to make more explicit some parts of Spivak s proof of the Change of Variable Theorem, and to supply most
More informationIEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 12, DECEMBER
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 12, DECEMBER 2010 6433 Bayesian Minimax Estimation of the Normal Model With Incomplete Prior Covariance Matrix Specification Duc-Son Pham, Member,
More informationσ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =
Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,
More information1 Directional Derivatives and Differentiability
Wednesday, January 18, 2012 1 Directional Derivatives and Differentiability Let E R N, let f : E R and let x 0 E. Given a direction v R N, let L be the line through x 0 in the direction v, that is, L :=
More informationMark Gordon Low
Mark Gordon Low lowm@wharton.upenn.edu Address Department of Statistics, The Wharton School University of Pennsylvania 3730 Walnut Street Philadelphia, PA 19104-6340 lowm@wharton.upenn.edu Education Ph.D.
More informationMATRICES ARE SIMILAR TO TRIANGULAR MATRICES
MATRICES ARE SIMILAR TO TRIANGULAR MATRICES 1 Complex matrices Recall that the complex numbers are given by a + ib where a and b are real and i is the imaginary unity, ie, i 2 = 1 In what we describe below,
More informationEC3062 ECONOMETRICS. THE MULTIPLE REGRESSION MODEL Consider T realisations of the regression equation. (1) y = β 0 + β 1 x β k x k + ε,
THE MULTIPLE REGRESSION MODEL Consider T realisations of the regression equation (1) y = β 0 + β 1 x 1 + + β k x k + ε, which can be written in the following form: (2) y 1 y 2.. y T = 1 x 11... x 1k 1
More informationA Bayesian Treatment of Linear Gaussian Regression
A Bayesian Treatment of Linear Gaussian Regression Frank Wood December 3, 2009 Bayesian Approach to Classical Linear Regression In classical linear regression we have the following model y β, σ 2, X N(Xβ,
More informationHomework 1 Due: Thursday 2/5/2015. Instructions: Turn in your homework in class on Thursday 2/5/2015
10-704 Homework 1 Due: Thursday 2/5/2015 Instructions: Turn in your homework in class on Thursday 2/5/2015 1. Information Theory Basics and Inequalities C&T 2.47, 2.29 (a) A deck of n cards in order 1,
More informationMA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2
MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2 1 Ridge Regression Ridge regression and the Lasso are two forms of regularized
More informationNovember 2002 STA Random Effects Selection in Linear Mixed Models
November 2002 STA216 1 Random Effects Selection in Linear Mixed Models November 2002 STA216 2 Introduction It is common practice in many applications to collect multiple measurements on a subject. Linear
More informationLinear Algebra March 16, 2019
Linear Algebra March 16, 2019 2 Contents 0.1 Notation................................ 4 1 Systems of linear equations, and matrices 5 1.1 Systems of linear equations..................... 5 1.2 Augmented
More informationLecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016
Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 1 Entropy Since this course is about entropy maximization,
More informationHT Introduction. P(X i = x i ) = e λ λ x i
MODS STATISTICS Introduction. HT 2012 Simon Myers, Department of Statistics (and The Wellcome Trust Centre for Human Genetics) myers@stats.ox.ac.uk We will be concerned with the mathematical framework
More informationAssignment 1 Math 5341 Linear Algebra Review. Give complete answers to each of the following questions. Show all of your work.
Assignment 1 Math 5341 Linear Algebra Review Give complete answers to each of the following questions Show all of your work Note: You might struggle with some of these questions, either because it has
More informationChapter 8 Integral Operators
Chapter 8 Integral Operators In our development of metrics, norms, inner products, and operator theory in Chapters 1 7 we only tangentially considered topics that involved the use of Lebesgue measure,
More informationFinite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product
Chapter 4 Hilbert Spaces 4.1 Inner Product Spaces Inner Product Space. A complex vector space E is called an inner product space (or a pre-hilbert space, or a unitary space) if there is a mapping (, )
More information6.1 Main properties of Shannon entropy. Let X be a random variable taking values x in some alphabet with probabilities.
Chapter 6 Quantum entropy There is a notion of entropy which quantifies the amount of uncertainty contained in an ensemble of Qbits. This is the von Neumann entropy that we introduce in this chapter. In
More informationOptimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.
Optimization Background: Problem: given a function f(x) defined on X, find x such that f(x ) f(x) for all x X. The value x is called a maximizer of f and is written argmax X f. In general, argmax X f may
More informationhttp://www.math.uah.edu/stat/markov/.xhtml 1 of 9 7/16/2009 7:20 AM Virtual Laboratories > 16. Markov Chains > 1 2 3 4 5 6 7 8 9 10 11 12 1. A Markov process is a random process in which the future is
More information5.1 Consistency of least squares estimates. We begin with a few consistency results that stand on their own and do not depend on normality.
88 Chapter 5 Distribution Theory In this chapter, we summarize the distributions related to the normal distribution that occur in linear models. Before turning to this general problem that assumes normal
More informationCS 246 Review of Linear Algebra 01/17/19
1 Linear algebra In this section we will discuss vectors and matrices. We denote the (i, j)th entry of a matrix A as A ij, and the ith entry of a vector as v i. 1.1 Vectors and vector operations A vector
More informationBayesian Adaptation. Aad van der Vaart. Vrije Universiteit Amsterdam. aad. Bayesian Adaptation p. 1/4
Bayesian Adaptation Aad van der Vaart http://www.math.vu.nl/ aad Vrije Universiteit Amsterdam Bayesian Adaptation p. 1/4 Joint work with Jyri Lember Bayesian Adaptation p. 2/4 Adaptation Given a collection
More informationLinear Models Review
Linear Models Review Vectors in IR n will be written as ordered n-tuples which are understood to be column vectors, or n 1 matrices. A vector variable will be indicted with bold face, and the prime sign
More informationA note on profile likelihood for exponential tilt mixture models
Biometrika (2009), 96, 1,pp. 229 236 C 2009 Biometrika Trust Printed in Great Britain doi: 10.1093/biomet/asn059 Advance Access publication 22 January 2009 A note on profile likelihood for exponential
More informationExact Minimax Predictive Density Estimation and MDL
Exact Minimax Predictive Density Estimation and MDL Feng Liang and Andrew Barron December 5, 2003 Abstract The problems of predictive density estimation with Kullback-Leibler loss, optimal universal data
More informationMAT 570 REAL ANALYSIS LECTURE NOTES. Contents. 1. Sets Functions Countability Axiom of choice Equivalence relations 9
MAT 570 REAL ANALYSIS LECTURE NOTES PROFESSOR: JOHN QUIGG SEMESTER: FALL 204 Contents. Sets 2 2. Functions 5 3. Countability 7 4. Axiom of choice 8 5. Equivalence relations 9 6. Real numbers 9 7. Extended
More information[y i α βx i ] 2 (2) Q = i=1
Least squares fits This section has no probability in it. There are no random variables. We are given n points (x i, y i ) and want to find the equation of the line that best fits them. We take the equation
More informationMinimax design criterion for fractional factorial designs
Ann Inst Stat Math 205 67:673 685 DOI 0.007/s0463-04-0470-0 Minimax design criterion for fractional factorial designs Yue Yin Julie Zhou Received: 2 November 203 / Revised: 5 March 204 / Published online:
More informationRidge Regression Revisited
Ridge Regression Revisited Paul M.C. de Boer Christian M. Hafner Econometric Institute Report EI 2005-29 In general ridge (GR) regression p ridge parameters have to be determined, whereas simple ridge
More information