Caterpillar Regression Example: Conjugate Priors, Conditional & Marginal Posteriors, Predictive Distribution, Variable Selection

Size: px

Start display at page:

Download "Caterpillar Regression Example: Conjugate Priors, Conditional & Marginal Posteriors, Predictive Distribution, Variable Selection"

Matthew Johnson
5 years ago
Views:

1 Caterpillar Regression Example: Conjugate Priors, Conditional & Marginal Posteriors, Predictive Distribution, Variable Selection Prof. Nicholas Zabaras University of Notre Dame Notre Dame, IN, USA URL: September 22,

2 Contents The Caterpillar Regression Problem, Linear Regression Model, MLE Estimator, Likelihood, MLE Estimates for our example, Conjugate Priors, Conditional and Marginal Posteriors, Ridge regression, Predictive Distribution, Implementation, Influence of the Conjugate Prior Zellner s G Prior, Marginal Posterior Mean and Variance, Predictive Modeling, Credible Intervals Jeffreys non-informative Prior, Credible Intervals, Zellner s G Prior Marginal Distribution of y, Point Null Hypothesis and Bayes Factors Variable Selection, Model Competition, Variable Selection-Prior, Stochastic Search for the Most Probable Model, Gibb s Sampling for Variable Selection, Implementation C. P. Robert, The Bayesian Core, Springer, 2 nd edition, chapter 3 (full text available) 2

3 Regression Regression refers to statistical analysis that deals with the representation of dependencies between several variables. In particular, we want to find a representation of the distribution f(y q,x) of an observable variable y given a vector of observables x, using samples of (x i,y i ), i=1,,n. Here, we will consider a particular example of modeling dependencies in pine processionary caterpillar colony size. 3

4 Linear Regression Models In linear regression, we analyze the linear influence of some variables on others. In our particular example of pine processionary caterpillar colonies, Response Variable: y is the number of processionary caterpillar colonies Explanatory Variables: Covariates x=(x 1,x 2,,x k ) as defined next (in general can be continuous, discrete or mixed type) 4

5 Caterpillar Regression Problem The pine processionary caterpillar colony size (y=number of nests) is influenced by the following explanatory variables: x 1 is the altitude (in meters) x 2 is the slope (in degrees) x 3 is the number of pines in the square x 4 is the height (in meters) of the tree sampled at the center of the square x 5 is the diameter of the tree sampled at the center of the square x 6 is the index of the settlement density x 7 is the orientation of the square (from 1 if southbound to 2 otherwise) x 8 is the height (in meters) of the dominant tree x 9 is the number of vegetation strata x 10 is the mix settlement index (from 1 if not mixed to 2 if mixed). 5

6 Caterpillar Regression Problem x 1 x 2 x 3 Semilog-y plot of the data (x i,y), i=1,..,9 x 4 x 5 x 6 An implementation is available MatLab, C++ x 7 x 8 x 9 6

7 Regression The distribution of y given x is considered in the context of a set of experimental data, i = 1,...,n, on which both y, and x,..., x i are measured. 1 ik The dataset is made from the outcomes y = (y 1,..., y n ) and of the n (k + 1) matrix of explanatory variables X 1 x x... x 1 x x... x 1 x x... x x x... x k k k n1 n2 nk 7

8 Linear Regression The most common linear regression model is of the form: y X X I From this, we conclude that: 2 2,, ~ Nn(, n) X y, x... x Var y i i k ik i 2 2, X Note the difference between finite valued regressors x i (like in our caterpillar problem) and categorical variables which also take finite number of values but with range that has no numerical meaning. Makes no sense to involve x directly in the regression: replace the single regressor x [belonging in {1,...,m}, say] with m indicator (or dummy) variables. 8

9 Categorical Variables Note the difference between finite valued regressors x i (like in our caterpillar problem) and categorical variables which also take finite number of values but with range that has no numerical meaning. Makes no sense to use x directly in the regression: replace the single regressor x [in {1,...,m}, say] with m indicator variables x ( x), x ( x),..., x ( x) Use different i for each class categorical variable value: X 1 1 y, ( x)... ( x) i m m Identifiability requires eliminating one of the classes (e.g. 1 =0) since ( ) 1 1 x i m m 9

10 Returning back to our linear regression model: From this, we conclude that: Assume that k+1<n and that X is of full rank: rank(x) = k + 1 X is of full rank if and only if X T X is invertible The likelihood is then: Linear Regression y X X I 2 2,, ~ Nn(, n) X y, x... x Var y i i k ik i, X T (, y, X ) exp ( ) ( ) 2 n/2 y X y X 2 10

11 MLE Estimator The MLE of is the solution of the least squares minimization problem n i1 y x x 2 T min( y X ) ( y X ) min... i 0 1 i1 k ik β = X T X 1 X T y Since β = X T X 1 X T y is a linear transform of y X I 2 ~ Nn(, n) β~n X T X 1 X T Xβ, X T X 1 X T σ 2 I n X X T X 1 = N β, σ 2 X T X 1 Thus β is an unbiased estimator and Var β σ 2, X = σ 2 X T X 1 Also β is the best linear unbiased estimator of β: α R k+1, Var α T β σ 2, X Var α T β σ 2, X where β is any unbiased linear estimator of β 11

12 MLE Estimator The MLE of 2 is the solution of: 1 min σ 2 2σ 2 y Xβ T (y Xβ) n 2 lnσ β=β = 0 2 σ MLE = y Xβ T (y Xβቁ n Let M = X X T X 1 X T be the projection n n matrix of the data y on the column space of X 2 E σ MLE = E y X β T (y Xβ ቁ = E (y My ) n n = E yt I M T (I M)y = E yt (I M)y n n You can easily show that (see this slide): T and a nn matrix then tr T For Y a n dim random vector with Y = Cov Y = V A Y AY = AV A = 12

13 You can easily show that (see this slide): Applying this we have: MLE Estimator T ( ) 2 y I M y MLE n T and a nn matrix then tr 2 In T For Y a n dim random vector with Y = Cov Y = V A Y AY = AV A T tr Cov T y I - M y = I - M (y) y I - M y T T I - M X I - M X rank n k Note : I M has 1 as its eigenvalue with multiplicity n k 1 and T 1 T T T 1 k1 tr I M tr I tr M n tr X X X X n tr X X X X n tr I n k

14 MLE Estimator The expectation of the MLE of 2 is then given as: The unbiased estimator of 2 is thus given: σ 2 = 1 n k 1 y X β T (y Xβ) = 2 2 MLE ( n k 1) n s 2 n k 1, where: s2 y Xβ T y Xβ Indeed using the result from the earlier slide we have: E σ 2 = 1 n k 1 E y X β T (y Xβ ቁ = σ2 n k 1 n k 1 = σ 2 To approximate the covariance of β, in β = N β, σ 2 X T X 1 use: Var β σ 2, X = σ 2 X T X 1 14

15 We can prove this theorem quite easily as follows: T T T T T Y - AY - = Y AY + A AY Y A Using the identity tr W = tr W for any random square matrix W, we can write the l.h.s. of the equation above as: This proves the theorem. Appendix T For Y a n dim random vector with Y = Cov Y = V A Y AY = AV A T and a nn matrix then tr Y - T T T T T T A Y - = Y AY A A A Y AY T A tr T T AY - Y - A Y - Y - AV tr T T T Y - A Y - = Y - A Y - A Y - Y - tr tr tr 15

16 T-Statistic We can define the standard t-statistic as follows: T i = መ β i β i σ 2 ω ii ~T(n k 1,0,1), where: ω i,i = X T X 1 i,i This statistic can be used for hypothesis testing, e.g. accept H : 0 versus H : 0 at the level if 0 i 1 i β i 1 F σ 2 ω n k 1 ii 1 Τ a 2, the 1 Τ a 2 nd quantile of T n k 1 16

17 T-Statistic The frequentist argument in using this bound is that there is significant evidence against H 0 if the p-value is smaller than. መβ i p i = P H0 T i > t i = = σ 2 ω ii P H0 T i < t i + P H0 T i > t i = F n k 1 ( t i ) + 1 F n k 1 ( t i ) < α Finally the statistic T i can be used to derive the frequentist marginal confidence interval is: β i : β i መβ i σ 2 1 ω ii F n k 1 1 aτ2 17

18 MLE (Least Squares) Estimates p i = መβ i መβ i σ 2 t ω i = P ii σ 2 H0 T i > ω ii Coefficients Estimate Std.Error t-value Pr(> t ) intercept ** XV ** XV * XV XV * XV * XV XV XV XV XV An implementation is available MatLab, C++ መβ i σ 2 ω ii 18

19 Likelihood T (, y, X ) exp ( y X ) ( y X ) 2 n/ Note that the likelihood above can now be written in the following very useful form: l(β, σ 2 y, X) = 1 2πσ 2 1 n Τ exp 2 2σ 2 y X β β + β T y X β β + β l(β, σ 2 y, X) = 1 2πσ 2 n Τ exp 2 ቇ = 1 2πσ 2 1 2σ 2 y X β T (y Xβ) 1 2σ 2 β β T X T X(β β = n Τ exp 2 ቇ 1 2σ 2 s2 1 2σ 2 β β T X T X(β β where: s 2 y Xβ T y Xβ, β = X T X 1 X T y 19

20 Conjugate Priors Observing the form of the likelihood suggests the following conjugate prior: β σ 2, X~N k+1 β, σ 2 M 1, M a (k + 1) (k + 1) pos. def. symm. matrix σ 2 X~InvGamma a, b, a, b > 0 The posterior is then π β, σ 2 β, s 2, X 2πσ 2 = σ k 1 2a 2 n exp 1 2σ 2 1 = σ k n 2a 3 exp 1 2σ 2 1 2πσ 2 n Τ exp 2 ቇ 1 2σ 2 s2 1 2σ 2 1 2σ 2 β β T M(β β 1 σ 2 k+1 Τ exp ቇ 2 β β T X T X(β β a+1 exp 1 σ 2 b = s 2 + 2b + β β T X T X(β β) + β β T M(β β ቁ = s2 + 2b + β T M + X T X β 2β T Mβ + X T Xβ + β T Mβ + β T X T X β 20

21 Conjugate Priors: Posterior of given 2 π β, σ 2 β, s 2, X σ k n 2a 3 exp 1 2σ 2 s2 + 2b + β T M + X T X β 2β T Mβ + X T Xβ + β T Mβ + β T X T X β = σ k n 2a 3 exp 1 2σ 2 s 2 + 2b + β M + X T X 1 Mβ + X T Xβ T M + X T X +β T Mβ + β T X T Xβ Mβ + X T Xβ T M + X T X T M + X T X β M + X T X 1 Mβ + X T Xβ M + X T X 1 Mβ + X T Xβ Based on this, the posterior of given 2 is: π β σ 2, y, X exp 1 2σ 2 β E β y, X T M + X T X β E β y, X β σ 2, y, X~N k+1 M + X T X 1 X T Xβ + Mβ, σ 2 M + X T X 1 where E β σ 2, y, X = M + X T X 1 Mβ + X T Xβ and Var β σ 2, y, X = σ 2 M + X T X 1 21

22 Conjugate Priors: Marginal posterior of 2 σ k n 2a 3 exp 1 2σ 2 π β, σ 2 β, s 2, X s 2 + 2b + β E β σ 2, y, X T M + X T X β E β σ 2, y, X +β T Mβ + β T X T Xβ Mβ + X T Xβ T M + X T X 1 Mβ + X T Xβ Integrating out gives the following marginal for 2 : π σ 2 β, s 2, X σ n 2a 2 exp 1 2σ 2 β T Mβ + β T X T Xβ + s 2 +2b Mβ + X T Xβ T M + X T X 1 Mβ + X T Xβ To simplify, we will use the following two identities A: B : 1 1 T 1 T 1 T 1 T 1 T 1 M + X X X X X X M X X X X 1 T T 1 T 1 X X M + X X M M X X 1 22

23 Conjugate Priors: Marginal Posterior of 2 π σ 2 β, s 2, X σ n 2a 2 exp 1 2σ 2 β T Mβ + β T X T Xβ + s 2 + 2b Mβ + X T Xβ T M + X T X 1 Mβ + X T Xβ We can simplify the last term above using identities A & B: Mβ + X T Xβ T M + X T X 1 Mβ + X T Xβ = (expandቁ 2β T X T X M + X T X 1 Mβ + β T M T M + X T X 1 Mβ + β T X T X M + X T X 1 X T Xβ = X T X 1 X T X 1 M 1 + X T X 1 1 X T X 1 2β T X T X M + X T X 1 Mβ + β T M T M + X T X 1 Mβ + β T X T X X T X 1 X T X 1 M 1 + X T X 1 1 X T X 1 X T Xβ = 2β T X T X M + X T X 1 Mβ + β T M T M + X T X 1 Mβ + β T X T Xβ β T M 1 + X T X 1 1 β = M 1 + X T X 1 1 M X T X M + X T X 1 M = M M 1 + X T X 1 1 2β T M 1 + X T X 1 1 β + β T Mβ β T M 1 + X T X 1 1 β + β T X T Xβ β T M 1 + X T X 1 1 β = β T Mβ + β T X T Xβ β β T M 1 + X T X 1 1 β β Finally: π σ 2 β, s 2, X σ n 2a 2 exp 1 2σ 2 β β T M 1 + X T X 1 1 β β + s 2 + 2b 23

24 Conjugate Priors: Marginal Posterior of 2 π σ 2 β, s 2, X σ n 2a 2 exp 1 2σ 2 β β T M 1 + X T X 1 1 β β + s 2 + 2b The marginal posterior is: σ 2 y, X~InvGamma n 2 + a, b + s2 2 + β β T M 1 + X T X 1 1 β β 2 The marginal posterior mean for n 2 is then: E π σ 2 y, X = 2b + s2 + β β T M 1 + X T X n + 2a β β 24

25 Conjugate Priors: MLE, Posterior Mean and Ridge Estimator Note that setting M = conditional posterior mean I k+1 Τc, c > 0 and β = 0 k+1 in the E β σ 2, y, X = M + X T X 1 Mβ + X T Xβ we obtain the classical Ridge Regression estimate: E β σ 2, y, X = 1 c I k+1 + X T X 1 X T Xβ = 1 c I k+1 + X T X 1 X T y, yx, 2 The general estimator can be seen as a weighted average of the prior mean and the MLE. 25

26 Conjugate Priors: Marginal Posterior of π β y, X = නπ β σ 2, y, X π σ 2 y, X dσ 2 π β σ 2, y, X N k+1 M + X T X 1 X T Xβ + Mβ, σ 2 M + X T X 1 π σ 2 y, X IG To integrate, we use the following transformations: τ = n 2 + a, b + s2 2 + β β T M 1 + X T X 1 1 β β Τ 1 σ 2, dτ = τ 2 dσ 2, μ = M + X T X 1 X T Xβ + Mβ Z = τ 1 2 β μ T M + X T X β μ + 2b + s 2 + β β T M 1 + X T X 1 1 β β G dτ = 2 G dz 2 26

27 Conjugate Priors: Marginal Posterior of τ = Τ 1 σ 2, dτ = τ 2 dσ 2, μ = M + X T X 1 X T Xβ + Mβ Z = τ 1 2 β μ T M + X T X β μ + 2b + s 2 + β β T M 1 + X T X 1 1 β β dτ = 2 G dz The marginal now takes the form: π β y, X න e Z2 τ k n 2 +a+1 dσ 2 ~ න e Z2 τ k 2 +n 2 +a 1 2dτ 0 0 where ~G k 2 n 2 a 1 2 න e Z2 Z k 2 +n 2 +a 1 2dZ ~G k 2 n 2 a G = β μ T M + X T X β μ + 2b + s 2 + β β T M 1 + X T X 1 1 β β The last integral above is a scalar quantity. 27

28 Conjugate Priors: Marginal Posterior of Thus integrating out 2 leads to a multivariate Student s t marginal posterior on : π β y, X β μ T M + X T X β μ + 2b + s 2 + β β T M 1 + X T X 1 1 β β k 2 n 2 a 1 2 where: μ = M + X T X 1 X T Xβ + Mβ Recall that the density of the multivariate T p q,, is: / 2 / / 2 det p 1 2 T p tq tq Tp q,, 1 28

29 Conjugate Priors: Marginal Posterior of We can now see that our marginal π β y, X β μ T M + X T X β μ + 2b + s 2 + β β T M 1 + X T X 1 1 β β k 2 n 2 a 1 2 can be written (can check with substitution of the Eqs below) in the form of T p / 2 / / 2 det as follows (p=k+1, v=n+2a): p 1 2 T p tq tq q,, 1 β y, X~T k+1 n + 2a, μ, Σ μ = M + X T X 1 X T X β + Mβ Σ = 2b + s2 + β β T M 1 + X T X n + 2a 1 1 β β M + X T X 1 29

Multivariate Student s-t Distribution Recall the properties of the multivariate Student s-t distribution:* *Note that in the notation of the tables above the degrees of freedom of the distribution

30 Multivariate Student s-t Distribution Recall the properties of the multivariate Student s-t distribution:* *Note that in the notation of the tables above the degrees of freedom of the distribution rather than the dimensionality are shown as subscripts, e,g, t v vs t d in the earlier slide. Hopefully the notation will be clear from the discussion. Bayesian Data Analysis, A. Gelman, J. Carlin, H. Stern and D. Rubin,

31 Conjugate Priors: Predictive Distribution For a given (m,k+1) explanatory matrix X, the outcome can be inferred through the predictive distribution* π y σ 2, y, X, X y Since π y X, β, σ 2 N Xβ, σ 2 I m, and since the posterior of conditional on 2 is given as β σ 2, y, X~N k+1 M + X T X 1 X T Xβ + Mβ, σ 2 M + X T X 1 we can see that: is a Gaussian. π y σ 2, y, X, X නπ y β, σ 2, y, X, X π β σ 2, y, X dβ * We will later on integrate 2 out to compute: π y y, X, X 31

32 Appendix We will make use next of the following conditional expectation equalities: E X Z = E E X Y, Z Z Var X Z = Var E X Y, Z Z +E Var X Y, Z Z The proof of these is quite straightforward, e.g. E X Z = න x xp x z dx = න x x න y p x, y z dy dx න x x න y p x y, z p(y z)dy dx = න y න x xp x y, z dx p y z dy = E E X Y, Z Z 32

33 Conjugate Priors: Predictive Distribution We can compute the mean and the variance of this predictive distribution as follows: and E π y σ 2, y, X, X = E π E π y β, σ 2, y, X, X σ 2, y, X, X = Xβ σ 2, y, X, X = X M + X T X 1 X T Xβ + Mβ E π In conclusion: Var π y σ 2, y, X, X = E π Var π y β, σ 2, y, X, X σ 2, y, X, X + Var π E π y β, σ 2, y, X, X σ 2, y, X, X = E π σ 2 I m σ 2, y, X, X + Var π Xβ σ 2, y, X, X = σ 2 I m + Xσ 2 M + X T X 1 X T = σ 2 I m + X M + X T X 1 X T y σ 2, y, X, X~N m XE β σ 2, y, X, σ 2 I m + XVar β σ 2, y, X X T E β σ 2, y, X = M + X T X 1 X T Xβ + Mβ 33

34 Conjugate Priors: Predictive Distribution y σ 2, y, X, X~N XE β σ 2, y, X, σ 2 I m + XVar β σ 2, y, X X T E y σ 2, y, X, X = XE β σ 2, y, X = X M + X T X 1 X T Xβ + Mβ, Var y σ 2, y, X, X = XVar β σ 2, y, X X T = σ 2 X M + X T X 1 X T We now integrate 2 against the posterior distribution σ 2 y, X~IG n 2 + a, b + s2 2 + β β T M 1 + X T X β β to obtain: π y y, X, X න N XE β σ 2, y, X, σ 2 I m + XVar β σ 2, y, X X T 0 IG n 2 + a, b + s2 2 + β β T M 1 + X T X β β dσ 2 34

35 Conjugate Priors: Predictive Distribution π y y, X, X න IG n 2 0 N XE β σ 2, y, X, σ 2 I m + XVar β σ 2, y, X X T + a, b + s2 2 + β β We substitute: T M 1 + X T X β β dσ 2 σ 2 I m + XVar β σ 2, y, X X T = σ 2 I m + X M + X T X 1 X T න σ m n 2a 2 exp 0 π y y, X, X න IG n 2 0 N XE β σ 2, y, X, σ 2 I m + X M + X T X 1 X T s2 + a, b β T β M 1 + X T X 1 1 β β dσ σ 2 {2b + s2 + β β T M 1 + X T X 1 1 β β + y E y σ 2, y, X, X T I m + X M + X T X 1 X T 1 y E y σ 2, y, X, X ൠ dσ 2 35

36 Conjugate Priors: Predictive Distribution න σ m n 2a 2 exp 0 π y y, X, X 1 2σ 2 {2b + s2 + β β T M 1 + X T X 1 1 β β + y E y σ 2, y, X, X T I m + X M + X T X 1 X T 1 y E y σ 2, y, X, X ൠ dσ 2 If we call the term inside {.} as G, then: π y y, X, X න 0 π y y, X, X න 0 σ 2 m+n+2a+2 2 e 1 2σ 2G dσ 2 π y y, X, X G m+n+2a+2 1 Z G 2 2 σ m n 2a 2 e 1 2σ 2G dσ 2 2 න 0 Z m+n+2a+2 2 e Z 1 2 G 1 dz ~G m+n+2a 2 Z2 π y y, X, X {2b + s 2 + β β T M 1 + X T X 1 1 β β + y E y σ 2, y, X, X T I m + X M + X T X 1 X T 1 y E y σ 2, y, X, X } (m+n+2a Τ ) 2 36

37 Conjugate Priors: Predictive Distribution π y y, X, X {2b + s 2 + β β T M 1 + X T X 1 1 β β + y E y σ 2, y, X, X T I m + X M + X T X 1 X T 1 y E y σ 2, y, X, X } (m+n+2a Τ ) 2 This corresponds to a Student s t distribution: / 2 / / 2 det p 1 2 T p tq tq Tp q,, 1 with y y, X, X~T m n + 2a, μ, Σ μ = E y σ 2, y, X, X = X M + X T X 1 X T Xβ + Mβ Σ = 2b + s2 + β β T M 1 + X T X 1 1 β β I m + X M + X T X 1 X T n + 2a 37

38 Implementation of the Conjugate Prior Our Prior: β σ 2, X~N k+1 β, σ 2 M 1, M a (k + 1) (k + 1) pos. def. symm. matrix β, M, a, b. σ 2 X~IG a, b, a, b > 0 We assume that there is no precise information available about β, M, a, b. Let us choose as M = I k+1 /c and, X ~ N 0, c I Let us take a = 2.1 and b = 2, i.e. prior mean and prior variance of σ 2 equal to 1.82 and β = 0 k k 1 k 1 k 1 The intuition is that if c is large, the prior on β should be more diffuse and have less bearing on the outcome. However, this turns out not to be the case. There is lasting influence of c on the posterior means of σ 2 and β 0. 38

39 Influence of the Prior Scale c on Bayesian Estimates of **************** Conjugate Priors ******************** 2 y, X 0 y, X V 0 y, X c E[sigma^2 y, X] E[beta_0 y, X] V[beta_0 y, X] An implementation is available MatLab, C++ E π σ 2 y, X = 2b + s2 + β β T M 1 + X T X n + 2a 2 E π β y, X = M + X T X β X T X β + Mβ Var β y, X = 2b+s2 + β β T M 1 + X T X 1 1 β β n+2a β M + X T X 1 39

Bayesian Estimates of for c=100 ************** Conjugate Priors*************** Bayes estimates of beta for c=100 ------------------------------------------------------------------ i y, X V i y, X

40 Bayesian Estimates of for c=100 ************** Conjugate Priors*************** Bayes estimates of beta for c= i y, X V i y, X beta_i E[beta_i y, X] V[beta_i y, X] beta_ beta_ beta_ beta_ beta_ beta_ beta_ beta_ beta_ beta_ beta_ An implementation is available MatLab, C++ 40

41 Influence of the Conjugate Prior The value of c thus has a significant influence on the estimators and even more on the posterior variance. The Bayes estimates stabilize for very large values of c. Thus the prior associated with a particular c should not be considered as a weak or pseudo-noninformative prior but, on the opposite, associated with a specific proper prior information. The dependence on (a, b) is equally strong. Considering these limitations of conjugate priors on at least the posterior variance, a more sophisticated noninformative strategy is needed. We first look a middle-ground perspective which settles the problem of the choice of M. 41

42 Zellner s Informative G-Prior We start with a middle ground solution by introducing only information about the location parameter of the regression but bypassing the selection of the prior correlation structure. β σ 2, X~N k+1 β, cσ 2 X T X 1 σ 2 ~π σ 2 X σ 2 improper Jeffreys prior Zellner, A. (1971). An Introduction to Bayesian Econometrics. John Wiley, New York. Zellner, A. (1984). Basic Issues in Econometrics. University of Chicago Press, Chicago. Carlin, B. and Louis, T. (1996). Bayes and Empirical Bayes Methods for Data Analysis. Chapman and Hall, New York. 42

43 Zellner s Informative G-Prior We start with a middle ground solution by introducing only information about the location parameter of the regression but bypassing the selection of the prior correlation structure. β σ 2, X~N k+1 β, cσ 2 X T X 1 σ 2 ~π σ 2 X σ 2 improper Jeffreys prior The prior determination is restricted to the choices of β and constant c. c can be interpreted as a measure of the information available in the prior relative to the sample. E.g., we will see that setting 1/c = 0.5 gives the prior the same weight as 50% of the sample. There is still strong influence of c. 43

44 Zellner s Informative G-Prior: Conditional Posterior of The joint posterior now takes the form (note X T X is used in both likelihood and prior): π(β, σ 2 y, X) σ 2 nτ2 exp 1 2σ 2 y X β T y Xβ 1 2σ 2 β β T X T X β β σ 2 k+1 Τ2 exp 1 2cσ 2 β β T X T X β β σ 2 1 We can now compute the following: β σ 2, y, X~exp 1 2σ 2 β β T X T X β β 1 2cσ 2 β β T X T X β β ~exp 1 2 β c c + 1 β c + β T c + 1 cσ 2 XT X β c c + 1 β c + β β σ 2, y, X~N k+1 c c + 1 β c + cσ β 2, c + 1 XT X 1 44

45 Zellner s Informative G-Prior: Posterior Marginal of 2 Starting again with the posterior π(β, σ 2 y, X) σ 2 n+k+ 3Τ2 exp 1 2σ 2 s2 1 2σ 2 β β T X T X β β exp 1 2cσ 2 β β T X T X β β = ~ σ 2 k+1 Τ2 exp 1 2 β c c + 1 β c + β T c + 1 cσ 2 XT X β c c + 1 β c + β σ 2 nτ2+1 exp 1 2σ 2 s2 + 1 β c + 1 β T X T X β β We can now compute the following posterior marginal with integration in : σ 2 y, X~InvGamma n 2, s c + 1 E σ 2 y, X = β β T X T X β β s c + 1 β β T X T X β β n 2 45

46 Zellner s Informative G-Prior: Posterior Marginal of Starting again with the posterior π(β, σ 2 y, X) σ 2 n+k+ 3Τ2 exp{ 1 2σ 2 [ β We can now compute the following posterior marginal with integration in 2 : y, X~ β c c + 1 β c + β T c + 1 c X T X β c c + 1 c c + 1 β c + β T c + 1 c X T X β c c + 1 s β ቋ c + 1 β T X T X β β ] β c + β + s c + 1 β β T X T X β β β β c + β + n+k+ Τ 1 2 β y, X~T k+1 c n, c + 1 β c c + β, s 2 + β β T X T X β β Τ( c + 1ቁ n(c + 1) X T X 1 46

47 Zellner s Informative G-Prior: Bayes Estimates The Bayes estimate of can be derived from: β y, X~T k+1 c n, c + 1 β c c + β, s 2 + β β T X T X β β Τ( c + 1ቁ n(c + 1) X T X 1 E β y, X = 1 c + 1 β + cβ The posterior variance of is also given from above as: V π β y, X = c c + 1 s 2 + β β T X T X β β Τ( c + 1ቁ n 2 X T X 1 47

48 Zellner s Informative G-Prior: Bayes Estimates E β y, X = 1 c + 1 β + cβ If c = 1, you can see from the above equation that it is like putting the same weight on the prior information and on the sample: E β y, X = 1 2 β + β which is the average of the prior mean and the MLE estimator. If c=100, the prior weights 1% of the sample. 48

49 Zellner s Informative G-Prior: Bayes Estimates The Bayes estimate of 2 can be derived from (use the mean of InvGamma) σ 2 y, X~IG n 2, s c + 1 β β T X T X β β as E σ 2 y, X = s2 + β β T X T X β β Τ c + 1 n 2 Note that only when c goes to infinity the influence of the prior vanishes. 49

50 Zellner s Informative G-Prior: Marginal Posterior Mean and Variance of **************** Zellner's G-Prior ***************** Posterior mean and variance of beta for c=100 E β y, X = 1 c + 1 β + cβ V π β y, X = i y, X V i y, X log 10 BF c s 2 + β β T X T X β β Τ( c + 1ቁ c + 1 n beta_i E[beta_i y, X] V[beta_i y, X] log10(bf) See here for BF calculation An implementation (Intercept) (****) is available X (***) MatLab, C++ X (**) X Evidence against H 0 : X (*) X (*) (****) decisive X (***) strong X (**) substantial X (*) poor X X X T X 1

51 Zellner s Informative G-Prior: Marginal Posterior Mean and Variance of **************** Zellner's G-Prior ***************** Posterior mean and variance of beta for c=1000 i y, X V i y, X log 10 BF beta_i E[beta_i y, X] V[beta_i y, X] log10(bf) (Intercept) (***) X (**) X (*) X X X X X X X X c c + 1 An implementation is available MatLab, C++ E β y, X = 1 c + 1 β + cβ V π β y, X = s 2 + β β T X T X β β Τ( c + 1ቁ n 2 See here for BF calculation Evidence against H 0 : (****) decisive (***) strong (**) substantial (*) poor 51 X T X 1

52 Zellner s Informative G-Prior: Predictive Modeling We want to predict (m 1) future observations when the explanatory variables but not the outcome variables X have been observed. y~n m Xβ, σ 2 I m Predictive distribution on y joint posterior distribution on analytically by defined as marginal of the y, β, σ 2. Can be computed π y y, X, X නπ y σ 2, y, X, X π σ 2 y, X, X dσ 2 52

53 Zellner s Informative G-Prior: Predictive Modeling β σ 2, y, X~N k+1 c c + 1 β c + cσ β 2, c + 1 XT X 1 Conditional on σ 2 the future vector of observations has a Gaussian distribution with E π y σ 2, y, X, X = E π E π y β, σ 2, y, X, X σ 2, y, X, X = E π Xβ σ 2, y, X, X = XE π β σ 2, y, X, X E π y σ 2, y, X, X = X β + cβ c + 1 independent of σ 2 This representation is quite intuitive, being the product of the matrix of explanatory variables X by the Bayes estimate of β. 53

54 Zellner s Informative G-Prior: Predictive Modeling β σ 2, y, X~N k+1 c c + 1 β c + cσ β 2, c + 1 XT X 1 Similarly, we can compute: V π y σ 2, y, X, X = E π V y β, σ 2, y, X, X σ 2, y, X, X + V π E π y β, σ 2, y, X, X σ 2, y, X, X = = E π σ 2 I m σ 2, y, X, X + V π Xβ σ 2, y, X, X V π y σ 2, y, X, X = σ 2 I m + c c + 1 X X T X 1 X T 54

55 Zellner s Informative G-Prior: Credible Regions- Highest Posterior Density Regions Here, we are interested on the highest posterior density (HPD) regions on subvectors of the parameter derived from the marginal posterior distribution of. For a single parameter, β i y, X~T 1 c n, c + 1 β c i c + β i, s 2 + β β T X T X β β Τ( c + 1ቁ n(c + 1) ω ii where ii is the (i,i) element of T X X 1 55

56 Zellner s Informative G-Prior: Credible Regions- Highest Posterior Density Regions Let us define Also τ = β + cβ c + 1 K = c s 2 + β β T X T X n(c+1) β β Τ(c+1ቁ X T X 1 = (K ij ) The variable i i i K ii has a T-distribution with n degrees of freedom. 56

57 Zellner s Informative G-Prior: Credible Regions- Highest Posterior Density Regions A 1 HPD interval on i is thus given by (see also here) where F n is the CDF of T(=n). 1 / 2, K F 1 / i Kii Fn i ii n 57

58 Zellner s Informative G-Prior: High Posterior Density (HPD) Intervals for i s Note that these HPD are different from the frequentist confidence intervals defined earlier as: β i y, X~T (n k 1, መβ i, ω i,i s 2 Τ( n k 1) ൯ ω i,i β i መβ i s 2 Τ( n k 1൯ ~T(ν = n k 1,0,1), where: ω i,i = X T X 1 i,i 1 β i : β i መβ i F n k 1 1 aτ2 ω i,i s 2 Τ( n k 1൯ s 2 y Xβ T y Xβ 58

59 Zellner s Informative G-Prior: 95% High Posterior Density (HPD) Intervals for i s beta_i HPD Interval beta_0 [4.6518, ] beta_1 [ , ] beta_2 [ , ] beta_3 [ , ] beta_4 [ , ] beta_5 [0.0152, ] beta_6 [ , ] beta_7 [ , ] beta_8 [ , ] beta_9 [ , ] beta_10 [ , ] c=100 An implementation is available MatLab, C++ 1 β i : β i መβ i F n k 1 1 aτ2 ω i,i s 2 Τ( n k 1൯ 59

60 Zellner s Informative G-Prior: 90% High Posterior Density (HPD) Intervals for i s beta_i HPD Interval beta_0 [5.7435, ] beta_1 [ , ] beta_2 [ , ] beta_3 [ , ] beta_4 [ , ] beta_5 [0.0524, ] beta_6 [ , ] beta_7 [ , ] beta_8 [ , ] beta_9 [ , ] beta_10 [ , ] c=100 ቊβ i : β i መβ i An implementation is available MatLab, C++ Note: The results given in ``C. P. Robert, The Bayesian Core, Springer, 2 nd edition, chapter 3 refer to 90% HPD intervals rather than 95% as posted. These results agree with what is shown above. 60

61 Zellner s Informative G-Prior: Marginal Distribution of y The marginal distribution of y (evidence) is a multivariate T. Since satisfies: β σ 2, X~N k+1 β, cσ 2 X T X 1, the linear transform of Xβ σ 2, X~N Xβ, cσ 2 X X T X 1 X T which implies y σ 2, X~N n Xβ, σ 2 I n + cx X T X 1 X T Integration in 2 with f y X, c = 0 2π 1 1/ nτ2 k+1 Τ2 σ2 nτ2 1 e 1 c + 1 gives (see Appendices): 2σ 2 y X β T I n +cx X T X 1 X T 1 y Xβ dσ 2 f y X, c = c + 1 (k+1) Τ2 π nτ2 Γ( n 2 ) yt y c c + 1 yt X X T X 1 X T y + 1 β c + 1 T X T Xβ 2 c + 1 yt Xβ nτ2 61

62 Appendix It can be easily shown that Indeed: 1 1 T T c T 1 c T I + X X X X I X X X X n n c 1 c c c c T 1 T T 1 T T 1 T T 1 T I X X X X I + X X X X I X X X X X X X X n c 1 n n c1 c c c 2 T T T T T 1 X T 1 X X X X X X X I c- X c n c 1 c 1 X X X I n This can simplify the term in the exponential in the earlier slide: y Xβ T I n + cx X T X 1 X T 1 y Xβ = y Xβ T I n c c + 1 X XT X 1 X T y Xβ = y T y c c + 1 yt X X T X 1 X T y y T Xβ + c c + 1 yt X X T X 1 X T Xβ β T X T y + β T X T Xβ + c β c + 1 T X T X X T X 1 X T y c β c + 1 T X T X X T X 1 X T Xβ = = y T y c c + 1 yt X X T X 1 X T y + 1 c + 1 β T X T Xβ 2 c + 1 yt Xβ 62

Let us revisit the matrix Note that for any k 1 : Appendix T 1 I + cx X X T 1 T (1 ) n c I + X X X X X X cx c X which implies that is an eigenvector with eigenvalue (1+c). There are (k+1) of those.

63 Let us revisit the matrix Note that for any k 1 : Appendix T 1 I + cx X X T 1 T (1 ) n c I + X X X X X X cx c X which implies that is an eigenvector with eigenvalue (1+c). There are (k+1) of those. X Finally note that for any z in the null space of X T, which implies that these z (n-k-1 in number) are eigenvectors of our matrix with 1 as the eigenvalues. n T 1 T T 1 T I + X X X X X X X X n c z z c z z The determinant of the matrix is then (1 ) k 1 1 n k 1 (1 ) k 1 c c. This explains the derivation in the earlier slide. X T X T z 0, 63

64 Zellner s Informative G-Prior: Point Null Hypothesis If a null hypothesis is H 0 : R = r, R being a q x (k+1) matrix, the model under H 0 can be rewritten as H Nn 0 y,, X ~ ( X, I ) n where 0 is (k + 1 q) dimensional. Under the prior β 0 σ 2, X 0 ~N k+1 q β 0, c 0 σ 2 X 0 T X 0 1 The marginal distribution of y under H 0 is: f y X 0, H 0 = c (k+1 q) Τ2 π nτ2 Γ( n ൰ 2 y T y c 0 c yt X 0 X 0 T X 0 1 X 0 T y + 1 c β 0 T X 0 T X 0 β 0 2 c yt X 0 β 0 64 nτ2

65 Zellner s Informative G-Prior: Bayes Factor The Bayes factor is then given in analytical form as: B π 10 = f y X f y X 0, H 0 = c ൫k+1 q ) Τ2 c + 1 ൫k+1) Τ2 y T y c 0 c yt X 0 X T 0 X 1 0 X T 0 y + 1 β c T 0 X T 0 X 0 β 0 2 c yt X 0 β 0 y T y c c + 1 yt X X T X 1 X T y + 1 β c + 1 T X T X β 2 c + 1 yt X β n/2 Note that we use the same 2 in both models The Bayes factor depends on c 0 and c. This calculation can be used to evaluate the inclusion or not of each of the terms β i, i = 1,.., k + 1 in our regression model. Note:X 0 = X :, setdiff(1: k + 1, j) 65

66 Noninformative Prior Analysis: Jeffreys Prior Considering the robustness issues of the two priors examined earlier in the case of a complete lack of prior information, we consider now the non-informative Jeffreys prior. The Jeffreys prior is a flat prior on 2 (,log ) J (, X ) 2 2 The posterior is then given as: π J (β, σ 2 y, X) σ 2 Τ n 2 exp 1 ቇ 2σ 2 y X β T (y Xβ) 1 2σ 2 β β T X T X(β β σ 2 = σ 2 ൫k+1 ) Τ 2 exp 1 ቇ 2σ 2 β β T X T X(β β σ 2 (n k 1 Τ ) 2+1 exp 1 2σ 2 s2 66

67 Noninformative Prior Analysis: Jeffreys Prior π J (β, σ 2 y, X) σ 2 ൫k+1 ) Τ 2 exp 1 ቇ 2σ 2 β β T X T X(β β From this joint posterior, we can immediately evaluate the following conditional and marginal posteriors: ( nk1)/21 2 J ( n k 1) s ( yx, ) exp s IG, From the 1st of these eqs., the Bayes estimate of 2 is: 2 2 s ( yx, ) n k 3 σ 2 (n k 1 ) Τ 2+1 exp 1 2σ 2 s2 π J (β σ 2, y, X) σ 2 ൫k+1 ) Τ 2 exp 1 ቇ 2σ 2 β β T X T X(β β = N k+1 β, σ 2 X T X 1 This estimate is larger (more pessimistic) than earlier estimates 67

68 Noninformative Prior Analysis: Jeffreys Prior π(β, σ 2 y, X) σ 2 Τ n 2+1 exp To compute the marginal posterior of, we need to integrate in 2. Using symbolic integrator: π(β y, X) ඳ σ 2 nτ2 1 exp s 2 + β β T X T X(β βቁ 2σ 2 s 2 + β β T X T X(β βቁ 2σ 2 dσ 2 s 2 + β β T X T X(β β ቁ nτ2 = 1 + This is a Students T distribution: π(β y, X) T k+1 ν = n k 1, θ, Σ, with: θ = β, Σ = s2 X T X 1 Thus the Bayes estimate of is: β β T s 2 X T X 1 n k 1 n k 1 E π (β y, X) = β The HPD intervals coincide with the frequentist confidence intervals. 1 (β β ൱ n k 1 68 nτ2

69 Zellner s Non-Informative G-Prior The main difference with informative G-prior setup is that we now consider c as unknown. We use the same G-prior distribution with β = 0 k+1 conditional on c, and introduce a diffuse prior on c, 1 ( c) c N* ( c) The corresponding marginal posterior is given as: 2 2 (, y, X ) (, y, X, c) ( c y, X ) dc c1 c1 2 (, y, X, c) 2 1 f y X, c () c y X (, y, X, c) f y X, c c 69

70 Zellner s Non-Informative G-Prior (, y, X ) (, y, X, c) f y X, cc c1 We have proved earlier (for constant c) that: y T y f y X, c = c + 1 (k+1) Τ2 π nτ2 Γ n 2 c c + 1 yt X X T X 1 X T y + 1 β c + 1 T X T Xβ 2 c + 1 yt Xβ nτ2 This for our problem with β = 0 k+1 ( k 1)/2 T c T T 1 T f y X, c c1 y y y X X X X y c 1 n/2 70

71 Zellner s Non-Informative G-Prior: Posterior Mean of Recall that y, X~T k+1 c n, c + 1 β c c + β, s 2 + β β T X T X β β Τ( c + 1ቁ n(c + 1) X T X 1 The Bayes estimates of for β = 0 k+1 is now given by: π β y, X = E π E π β y, X, c y, X = E π c c + 1 β y, X = c=1 σ c=1 c π c y, X c + 1 π c y, X β E π β y, X = c f y X, c c 1 c + 1 β f y X, c c 1 c=1 σ c=1 71

72 Zellner s Non-Informative G-Prior: Posterior Mean of 2 Recall that σ 2 y, X~IG n 2, s c + 1 β β T X T X β β E σ 2 y, X = s2 + β መβ T X T X β β Τ c + 1 n 2 The Bayes estimate of 2 is given similarly by: E π σ 2 y, X = c=1 s 2 + β T X T X β Τ( c + 1൯ f y X, c c n 2 1 σ c=1 f y X, c c 1 Both Bayes estimates involve infinite summations on c. The denominator in both cases is the normalizing constant 1 of the posterior f y X, cc c1 72

73 Zellner s Non-Informative G-Prior: Posterior Variance of β y, X~T k+1 c n, c + 1 β c c + β, s 2 + β β T X T X β β Τ( c + 1ቁ n(c + 1) X T X 1 V π β y, X = E π V π β y, X, c y, X + V π E π β y, X, c y, X = c s 2 + β T X T X β Τ( c + 1൯ X T X 1 c + V n 2 (c + 1) π β y, c + 1 X = E π β c=1 f y X, c n 2 (c + 1) c=1 s 2 + β T X T X β Τ( c + 1൯ σ c=1 f y X, c c 1 X T X 1 + c c + 1 E π 2 c c + 1 y, X f y X, c c 1 β σ T c=1 f y X, c c 1 See earlier slide for the term f y X, c 73

74 Zellner s Non-Informative G-Prior: Marginal Distribution The marginal distribution of the dataset is available in closed form c1 y X y X, f f c c c c1 1 ( k 1)/2 T T T 1 T ( c1) The T-shape means that we can also compute the normalizing constant! 1 c y y y X X X X y c 1 n/2 74

75 Zellner s Informative G-Prior: Posterior Mean and Variance i y, X V i y, X beta_i E[beta_i y, X] V[beta_i y, X] (Intercept) X X X X X X X X X X β c=1 A Matlab implementation can be downloaded here E π β y, X = V π β y, X = f y X, c n 2 (c + 1) c=1 c f y X, c c 1 c + 1 β f y X, c c 1 c=1 σ c=1 s 2 + β T X T X β Τ( c + 1൯ σ c=1 f y X, c c 1 X T X 1 + c c + 1 E π 2 c c + 1 y, X f y X, c c 1 β σ T c=1 f y X, c c 1 We use β = 0 11, c =

76 Zellner s Non-Informative G-Prior: Point Null Hypothesis If a null hypothesis is H 0 : R = r, the model under H 0 can be rewritten as where Under 0 () c is (k + 1 q) dimensional. and the prior the marginal distribution of y under H 0 is: The Bayes factor computed H Nn 0 y,, X ~ ( X, I ) c 1, X, c~ N 0, c X X T 0 k 1 q k 1 q ( k1 q)/2 T c T T 1 T f y X 0, H0 c c 1 y y y X 0 X 0 X 0 X 0 y c1 c 1 /, H B = f y X f y X n 1 n/2 can now be 76

77 Zellner s Non-Informative G-Prior: Posterior Mean and Variance For H B y X y X : 0, log i, V i, log ( BF) beta_i E[beta_i y, X] V[beta_i y, X] log10(bf) (Intercept) (***) X (**) X (**) X X (*) X (*) X X X X X A Matlab implementation can be downloaded here ( We use 0 ) Evidence against H 0 : (****) decisive (***) strong (**) substantial (*) poor 77

78 Variable Selection Let us return to our regression model with one dependent random variable y and a set of k x1, x2,..., xk explanatory variables. Are all the x i s needed in the regression? We assume that every q-subset explanatory variables,, 0 q k, of the is a proper set of explanatory variables for the regression of y (as before, the intercept is included in all models). We have a total of 2 k models to select from! i1 i2 i q 1, x, x,..., x i1, i2,..., i q 78

79 Variable Selection Following earlier notation, we denote: X= 1... n x1 x2 xk as the matrix that contains 1 n and the k potential predictor variables. Each model M g is associated with binary indicator vector g where g i =1 means that the variable x i is included in the model M g and g i =0 that it is not. The number of variables included in the model M g is: The indices of the variables included in the model and not included in the model are denoted, respectively, as: t ( g), t ( g) 0,1 k T q g 1 g

80 Variable Selection Models in Competition For k 1 and X, we define β g as the sub-vector g 0, i i t ( g ) 1 Let X g be the submatrix of X where only the column 1 n and the columns in t 1 (g) have been left. The model M g is then defined as: y g,,, X ~ N X, I 2 2 g n g g n where g q g 1 2 *, are the unknown parameters. The 2 is common to all models and we use the same prior for all models. 80

81 Variable Selection Models in Competition We have a high number 2 k of models in competition. We cannot specify a prior on every M g in a completely subjective and autonomous manner. We derive all priors from a single global prior associated with the full model that corresponds to g = (1,..., 1). 81

82 Zellner s Informative Prior: Variable Selection For the full model that corresponds to g = (1,..., 1), we use the Zellner s informative G-prior: β σ 2, X~N k+1 β, cσ 2 X T X 1 σ 2 ~π σ 2 X σ 2 improper Jeffreys prior For each model M g, the prior distribution of g conditional on 2 is fixed as: β γ γ, σ 2 ~N qγ +1 β γ, cσ 2 X γ T X γ 1 where β and same prior on 2 γ = X T 1 γ X γ Xγ T β. 82

83 Zellner s Informative Prior: Variable Selection Prior The joint prior for model M g is the improper prior π β γ, σ 2 γ σ 2 (q γ+1 Τ ) 2 1 exp 1 2 cσ 2 β γ β γ T Xγ T X γ β γ β γ Infinitely many ways of defining a prior on the model index γ: Our choice is a uniform prior p(g X) = 2 -k. Posterior distribution of γ is central to variable selection since it is proportional to marginal density of y on M g (or evidence of M g ) 83

84 Zellner s Informative Prior: Variable Selection Prior Posterior distribution of γ is proportional to the marginal density of y on M g (so it can also be used to compute Bayes factors) g y, X f y g, X g X f y g, X y g,,, X g,, X X f 2 2 d 2 d 2 where (see earlier derivation) exp 1 2σ 2 yt y + f y γ, σ 2, X = නf y γ, β, σ 2 π β γ, σ 2 dβ = 1 2σ 2 (c + 1) = c + 1 q γ+1 Τ2 2π nτ2 σ 2 nτ2 cyt X γ X γ T X γ 1 Xγ T y β γ T X γ T X γ β γ + 2y T X γ β γ 84

85 Zellners Informative Prior: Variable Selection Prior 1 g y, X f y g,, X X d f y g,, X d Posterior distribution of γ is then given as: y T y f γ y, X c + 1 Τ q γ+1 2 c c + 1 yt X γ X T 1 γ X γ Xγ T y + 1 β c + 1 T γ X T γ X γ β γ 2 c + 1 yt X γ β γ nτ2 We already have seen this distribution earlier for a fixed c: f y X, c = c + 1 (k+1) Τ2 π nτ2 Γ( n 2 ) yt y c c + 1 yt X X T X 1 X T y + 1 β c + 1 T X T Xβ 2 c + 1 yt Xβ nτ2 85

86 Model Selection Most likely models ordered by decreasing posterior probabilities using Zellner s informative G-prior with c=100. t ( g ) 1 g ( yx, ) t1_gamma pi(gamma y, X) An implementation is available MatLab, C++ 86

87 Model M g with the highest posterior probability is t 1 (g) = (1, 2, 4, 5), which corresponds to the variables altitude, slope, Model Selection height of the tree sampled in the center of the area, and diameter of the tree sampled in the center of the area. 87

88 Model Selection For the Zellner s non-informative prior with (c)=1/c, we have ( ) : β = 0 k+1 q T c T T 1 T y X y y y Xg Xg Xg Xg y c1 c /2, c c 1 g g n/2 Again we have seen this before as c1 y X y X, f f c c c c1 1 ( k 1)/2 T T T 1 T ( c1) 1 c y y y X X X X y c 1 n/2 88

89 Model Selection Most likely models ordered by decreasing posterior probabilities using Zellner s non-informative G-prior. t ( g ) g ( yx, ) t1_gamma pi(gamma y, X) An implementation is available MatLab, C++ 89

Stochastic Search for the Most Likely Model When k is large, it becomes computationally intractable to compute the posterior probabilities of the 2 k models.

90 Stochastic Search for the Most Likely Model When k is large, it becomes computationally intractable to compute the posterior probabilities of the 2 k models. Need of a tailored algorithm that samples from (g y,x) and selects the most likely models. Can be done by Gibbs sampling*, given the availability of the full conditional posterior probabilities of the g i s. If g i = (g 1,..., g i 1, g i+1,..., g k ) (1 i k) then (g i y, g -i,x) (g y,x) (to be evaluated in both g i = 0 and g i = 1) * Gibbs and other sampling algorithms will be introduced and discussed in detail in forthcoming lectures. 90

91 Gibbs Sampling for Variable Selection Initialization: Draw g 0 from the uniform distribution on Iteration t: Given g1,..., g k ( t1) ( t1), generate g according to g y, g,..., g, ( t) ( t1) ( t1) k X g according to g y, g, g,..., g, ( t) ( t) ( t1) ( t1) k X.... ( ) ( ) ( ) ( ) g t, t 1, t 2,..., t k according to g k y g g g k 1, X 91

92 Gibbs Sampling for Variable Selection Question: How to sample g according to ( t 1) ( t g 1) 1 y, g 2,..., g k, X t 1 1. The conditional distribution ( t1) ( t1) g1 y, g 2,..., g k, X is proportional to g yx, 2. Since only has two possible values which are 0 and 1, we get g t 1 ( t1) ( t1) p0 g1 0, g 2,..., g k yx, ( t1) ( t1) p1 g1 1, g 2,..., g k yx, t So the probability that g1 1 is p 1, then we can use Gibbs Sampling to approximate the distribution of { } g t i 92

93 Gibbs Sampling: Posterior Probabilities After T 1 MCMC iterations, we approximate the posterior probabilities p(g y,x) by empirical averages π γ y, X = 1 T T t=t 0 T I γ t =γ The T 0 first values (burn in) in the MCMC chain are eliminated. 93

94 Model Choice Comparison: Gibbs Estimates First level Informative G-prior model with ( = 0 11, c=100) compared with the Gibbs estimates of the top ten posterior probabilities t g ( yx, ) 1 g π(γ y, X) t1_gamma pi(gamma y, X) pi_hat(gamma y, X) A MatLab implementation is available β 94

95 Model Choice Comparison: Gibbs Estimates Non-informative G-prior variable model choice compared with the Gibbs estimates of the top ten posterior probabilities t ( ) 1 g ( yx, ) π(γ y, X) t1_gamma pi(gamma y, X) pi_hat(gamma y, X) A MatLab implementation is available 95

96 Gibbs Sampling: Probabilities of Inclusion An approximation of the probability to include the i-th variable: P π γ i = 1 y, X = 1 T T t=t 0 T I γi t =1 96

97 Probabilities of Inclusion Estimates Informative ( β = 0 11, c=100) and non-informative G-prior variable inclusion estimates (based on the same Gibbs output as in the earlier two tables) g i P (γ i = 1 y, X൯ P π (γ i = 1 y, X൯ gamma_i P(gamma_i y, X) P(gamma_i y, X) gamma_ gamma_ gamma_ gamma_ gamma_ gamma_ gamma_ gamma_ gamma_ gamma_ A MatLab implementation is available 97

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Stat 535 C - Statistical Computing & Monte Carlo Methods Arnaud Doucet Email: arnaud@cs.ubc.ca 1 Suggested Projects: www.cs.ubc.ca/~arnaud/projects.html First assignement on the web: capture/recapture.