Posterior Predictive Distribution Suppose we have observed a new set of explanatory variables X and we want to predict the outcomes ỹ using the regression model. Components of uncertainty in p(ỹ y) variability of the model, represented by σ 2 and not accounted for by Xβ posterior uncertainty in β and σ 2 due to the finite sample size of y. As n this uncertainty decreases to zero. Drawing a sample ỹ from its posterior predictive distribution can be done as follows 1. draw (β, σ 2 ) from p(β, σ 2 y) 2. draw ỹ N(Xβ, σ 2 I) 1
Given σ 2, the future observation ỹ has a normal distribution and the mean and the variance are given by E(ỹ y, σ 2 ) = E(E(ỹ β, σ 2, y) σ 2, y) = E( Xβ σ 2, y) = X ˆβ and V (ỹ σ 2, y) = E[V (ỹ β, σ 2, y) σ 2, y] +V [E(ỹ β, σ 2, y) σ 2, y] = E[σ 2 I σ 2, y] + V [ Xβ σ 2, y] = (I + XV XT β )σ 2 2
To determine p(ỹ y) we must average over the marginal posterior of σ 2, then, p(ỹ y) = N(ỹ X ˆβ, (I + XV XT β )σ 2 )p(σ 2 y)dσ 2 This is a multivariate t with center ˆβ, squared scale matrix ˆσ 2 (I + XV β XT ) and n k degrees of freedom. 3
Example The table gives short-term radon measurements for a sample of houses in three counties in Minnesota. All the measurements were recorded on the basement level of the houses, except for those indicated with *, which were recorded on the first floor. County Radon measurements (pci/l) Blue Earth 5.0, 13.0, 7.2, 6.8, 12.8, 5.8, 9.5, 6.0, 3.8, 14.3, 1.8, 6.9, 4.7, 9.5 Clay 0.9, 12.9, 2.6, 3.5, 26.6, 1.5, 13.0, 8.8, 19.5, 2.5, 9.0, 13.1, 3.6, 6.9 Goodhue 14.3, 6.9, 7.6, 9.8, 2.6, 43.5, 4.9, 3.5, 4.8, 5.6, 3.5, 3.9, 6.7 4
We can define a model in terms of indicator variables as follows, x 2 = 1 y i,j B.E. 0 y i,j B.E., x 3 = 1 y i,j C 0 y i,j C, z = 1 y i,j F.F. 0 y i,j F.F. i = 1, 2, 3, j = 1,..., n i. Then, the model can be written in the following form, log(y i,j ) = µ + α i + δ + ɛ i,j, ɛ i,j N(0, σ 2 ), with µ the mean effect for Goodhue, α 1 the effect of Blue Earth over µ, α 2 the effect of Clayton over µ, α 3 = 0, and δ the effect of the first floor. 5
Results 1 0 1 2 CONST BLUE E CLAY 1ST FL 6
Prediction Assume another house is sampled at random from Blue Earth County. We have two scenarios depending on whether the measurement we want to predict will be recorded on the basement or on the first floor. If we want to predict a basement measurement, we need to sample y rep from the posterior predictive distribution N(µ + α 1, σ 2 ). If we want a prediction for a first-floor measurement, then we need to sample y rep from the posterior predictive distribution N(µ + α 1 + δ, σ 2 ). Location 95% P.I Median Basement (0.526,29.663) 7.152 First floor (0.266,20.994) 5.012 7
Posterior predictive (basement) 0 200 600 0 50 100 150 radon measurement Posterior predictive (first floor) 0 200 400 600 800 0 20 40 60 80 100 120 140 radon measurement 8
Unequal Variances If we consider a linear model with a known general covariance matrix V, then we have y = Xβ + ε ε N(0, V ) Let V = LL the Cholesky decomposition of V. Then L 1 y = L 1 Xβ + v v N(0, I) Letting z = L 1 y and W = L 1 X, the LSE of β is the solution of W W ˆβ = W z. This is equivalent to X V 1 X ˆβ = X V 1 z 9
The conclusion is that, in order to deal with unequal variances we have to solve LW = X and LZ = y. There are several interesting special cases: 1. V = σ 2 V, V known, but σ 2 unknown. 2. V is a diagonal matrix. Then L ii = V ii and thus X and y are pre-multiplied by the inverse of roots of the diagonal elements of V, usually denoted as weights. 3. V ij = σ 2 h(i, j, φ), implying that the matrix is unknown, but there is a parametric form that its elements corresponds to. 4. When V is totally unknown then { 1 p(v β, y) V 1/2 exp 2 tr ( V 1 (y Xβ)(y Xβ) ) } p(v ) If p(v ) is an inverse Wishart then this full conditional corresponds to an inverse Wishart as well. 10
Including Prior Information So far we have considered only the case where the prior for β and σ 2 is non-informative. It is clear that using a prior for σ 2 that corresponds to an inverse gamma will not chance the analysis much. We can include information about β by using a multivariate normal, say β N(β 0, V β ). We can treat the prior for β as k additional data points by considering the model y = X β + ε, var(ε ) = V y 0 β 0 I k 0 V β We proceed by obtaining the posterior distribution of β from this model assuming p(β) 1. 11