Modelling the Variance : MCMC methods for tting multilevel models with complex level 1 variation and extensions to constrained variance matrices By Dr William Browne Centre for Multilevel Modelling Institute of Education, London
Summary of Talk Background to Multilevel modelling project. What is complex level 1 variation? Tutorial dataset. Method 1 : Inverse Wishart proposals. Method 2 : Truncated Normal proposals. Log formulations. Extensions to the multivariate problem.
Multilevel modelling project Based at Institute of Education. Headed by Professor Harvey Goldstein. Funded by ESRC originally through ALCD programme. 3 Full-time Research ocers. 2 Lecturers associated with project. A Network of project Fellows.
Aims of Project Modelling complex structures in social science data. Establishing forms of model structure. Developing methodology to t model. Comparing alternative methodologies. Programming methodology into computer package MLwiN. Disseminating ideas to the social science community.
MLwiN Software package Developed from a chain of packages developed by MMP. Forerunners include ML2, ML3 and MLn. Main programmer : Jon Rasbash. Consists of user-friendly Windows interface on top of fast estimation engines. Over 3,000 users (mainly academic) worldwide. Estimation by IGLS, RIGLS, MCMC methods, bootstrapping. MCMC theory and programming by William Browne and David Draper.
Current research interests Cross-classied and multiple membership models. Missing data and measurement errors in multilevel modelling. Multilevel Factor analysis modelling. Spatial modelling. Combining estimation procedures. Improving user interface.
Univariate Normal model y i N( 2 ) i= 1 ::: 4 Can be written 0 B @ y 1 y 2 y 3 y 4 1 C A N 0 B @ = 0 B @ 1 C A V = 0 B @ 2 0 0 0 0 2 0 0 0 0 2 0 0 0 0 2 11 CC AA Normal linear model y i N(X i 2 ) i= 1 ::: 4 Can be written y = (y 1, y 2, y 3, y 4 ) T MV N( V ) where = 0 B @ X 1 X 2 X 3 X 4 1 C A V = 0 B @ 2 0 0 0 0 2 0 0 0 0 2 0 0 0 0 2 1 C A
2 level variance components model y ij = X ij + u j + e ij u j N(0 2 u) e ij N(0 2 e ) i = 1 ::: 2 j = 1 ::: 2 Can be written y = (y 1 1, y 2 1, y 1 2, y 2 2 ) T MV N( V ) where = 0 B @ X 1 1 X 2 1 X 1 2 X 2 2 1 C A and V = 0 B @ 2 u + 2 e 2 u 0 0 2 u 2 u + 2 e 0 0 0 0 2 u + 2 e 2 u 0 0 2 u 2 u + 2 e 1 C A
Complex Variation. Denition: A model where the variance depends on predictor variables. y ij = X ij + Z ij u j + X C ij e ij u j MV N(0 u ) e ij MV N(0 e ) V matrix now has diagonal elements of the form V ij ij = eij + uij where eij = X CT ij ex C ij and uij = Z T ij uz ij. The o-diagonal elements of V are zero if they correspond to observations in dierent level 2 units or otherwise V ij i 0 j = Z T ij uz i 0 j.
Example : Tutorial Dataset Dataset of school exam results at age 16. Dataset 4059 pupils from 65 schools. Response variable is total GCSE score. Main predictor variable is LRT score. Other predictor of interest is gender.
Partitioning the dataset Here we see the mean and variance for dierent partitions of the dataset. Partition N Mean Variance Whole dataset 4059 0.000 1.000 Boys 1623-0.140 1.052 Girls 2436 0.093 0.940 LRT < ;1 612-0.887 0.731 ;1 < LRT < ;0:5 594-0.499 0.599 ;0:5 < LRT < ;0:1 619-0.191 0.650 ;0:1 < LRT < 0:3 710 0.044 0.658 0:3 < LRT < 0:7 547 0.279 0.659 0:7 < LRT < 1:1 428 0.571 0.678 1:1 < LRT 549 0.963 0.703
A 1 Level Model eij y ij N( 0 + X1 ij 1 V) = e00 +2X1 ij e01 +X1 2 ij e11 (1) where X1 is London Reading test (LRT) score. This graph mimics the results from partitioning the data. Level 1 variance 0.64 0.66 0.68 0.70 0.72 0.74-3 -2-1 0 1 2 3 Standardised LRT score
A 2 level model with a constant variance at level 2. y ij N( 0 + X1 ij 1 V) uij = u00 (2) eij = e00 +2X1 ij e01 +X1 2 ij e11 where X1 is London Reading test (LRT) score. Variance 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Level 1 variance Level 2 variance -3-2 -1 0 1 2 3 Standardised LRT score
A 2 Level model with complex variation at both levels 1 and 2. uij eij y ij N( 0 + X1 ij 1 V) = u00 +2X1 ij u01 +X1 2 ij u11 = e00 +2X1 ij e01 +X1 2 ij e11 (3) where X1 is London Reading test (LRT) score. Variance 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Level 1 variance Level 2 variance -3-2 -1 0 1 2 3 Standardised LRT score
A 2 level model with a more complicated variance structure at level 1 y ij N( 0 +X1 ij 1 + X2 ij 2 V) uij = u00 + 2X1 ij u01 +X1 2 ij u11 eij = e00 + 2X1 ij e01 + 2X1 ij X2 ij e12 + X2 ij e22 (4) where X1 is London Reading test (LRT) score, and X2 is 1 for boys and 0 for girls. Variance 0.2 0.4 0.6 0.8 Level 1 variance - Boys Level 1 variance - Girls Level 2 variance -3-2 -1 0 1 2 3 Standardised LRT score
Two possible formulations We can write a general two level Normal model with complex level 1 variation out in two similar but not identical formulations Firstly y ij = X ij + Z ij u j + X C ij e ij where u j MV N(0 u ) e ij MV N(0 e ) and secondly y ij = X ij + Z ij u j + e ij where u j MV N(0 u ) e ij N(0 eij) and e ij = XC ij e ij and eij = X CT ij ex C ij.
Gibbs Sampling steps for both methods In a Gibbs Sampling algorithm we construct conditional posterior distributions of each parameter (or group of parameters) in turn. This constructs a chain of values for each parameter which upon convergence will be a sample from the joint posterior distribution. Here we nd Step 1 : p( j y u u e ) MVN( b c D) Step 2 : p(u j j y u e ) MVN(bu j c D j ) Step 3 : p( u j y u e ) InvW ishart(b b S) Step 4 : p( e j y u u )? The distribution in Step 4 does not have a `nice' form using either formulation. So for this step we will use the Metropolis Hastings(MH) sampler.
Method 1 Inverse Wishart proposals In formulation 1 we know that e is a `variance matrix' therefore values of e must form a positive denite matrix. To use Metropolis Hastings we require a proposal distribution that generates positive denite matrices. We will use an inverse Wishart distribution. Let invw ishart k ( S) E() = ( ; k ; 1) ;1 S So at timestep t +1 draw from proposal distribution p( (t+1) e ) invw ishart k (w + k + 1 w (t) e ): This has mean the current estimate of e and w is a tuning constant.
Method 1 continued: As we are using an Inverse Wishart proposal distribution this distribution is not symmetric. Consequently we have to work out the hastings ratio. If the current value of e is A and we propose a move to B then the hastings ratio is as follows : hr = p((t+1) e p( (t+1) e = j A j2w+3k+3 2 j B j 2w+3k+3 2 = B j (t+1) e = A j (t+1) e IW(w + k + 1 Aw)) IW(w + k + 1 Bw)) exp( w 2 (tr(ba;1 ) ; tr(ab ;1 )) Our step 4 now becomes (t+1) e = e with prob. min(1 hr p( e jy :::) = (t) e otherwise. ) p( (t) e jy :::) where e is drawn from an InvW ishart k (w + k + ) distribution. 1 w (t) e
Earlier Example 3 Here we look again at the third earlier example model y ij N( 0 + X1 ij 1 V) uij = u00 + 2X1 ij u01 + X1 2 ij u11 eij = e00 + 2X1 ij e01 + X1 2 ij e11 (1) where y is the (normalised) GCSE score and X1 is the (standardised) LRT score. Results Par. IGLS MCMC Meth 1 MCMC Meth 2 0-0.012 (0.040) -0.010 (0.041) -0.010(0.041) 1 0.558 (0.020) 0.559 (0.020) 0.559 (0.020) u00 0.091 (0.018) 0.097 (0.020) 0.097 (0.020) u01 0.019 (0.007) 0.020 (0.007) 0.020 (0.007) u11 0.014 (0.004) 0.015 (0.005) 0.015 (0.005) e00 0.553 (0.015) 0.549 (0.013) 0.553 (0.015) e01-0.015 (0.006) -0.016 (0.007) -0.015 (0.007) e11 0.001 (0.009) 0.008 (0.006) 0.003 (0.009)
Output for parameter e11 using the inverse Wishart method
Method 2 : Truncated Normal proposals Our second formulation of the model was as follows : y ij = X ij + Z ij u j + e ij where u j MV N(0 u ) e ij N(0 eij) and e ij = XC ij e ij and eij = X CT ij ex C ij. So rather than a positive denite constraint for the matrix e we instead have the weaker constraint that eij = X CT ij ex C ij > 08i j. Note that this would be identical to a positive denite constraint if X CT took all possible values but in practice it doesn't. This constraint looks quite dicult but we will consider the elements of e one at a time.
Method 2 : Updating Diagonal terms ekk At time t we require that eij = (X C ij )T (t) e X C ij > 0 = (X C ij(k) )2 (t) ekk ; dc ij(kk) > 0 8 i j where d C ij(kk) = (XC ij(k) )2 (t) ekk ; (XC ij) T (t) e X C ij : So (t) ekk > max ekk where max ekk = max(d C ij(kk) =(XC ij(k) )2 ):
Method 2 : Updating O-Diagonal terms ekl This step will be similar to the step given for diagonal terms except this time ekl will be multiplied by X c ij(k) Xc ij(l) which can be negative. This means that there will be two truncation points (a maximum and minimum) rather than one. Step 4 of the algorithm becomes (repeated for all k and l) (t+1) ekl = ekl with prob. min(1 hr p( ekl jy :::) = (t) ekl otherwise. p( (t) jy :::)) ekl where ekl is drawn from a truncated Normal distribution with truncation points that maintain a positive variance for every observation.
Calculating the Hastings Ratios for Method 2. hr = ((min ekl; ; B)=s kl ) ; ((max ekl + ; B)=s kl ) ((min ekl ; ; A)=s kl ) ; ((max ekl + ; A)=s kl ) : M AB. M AB. (i) (ii) M AB m. M AB m. (iii) (iv) Figure 1: Plots of truncated univariate normal proposal distributions for a parameter,. A is the current value, c and B is the proposed new value,. Mismax and m is min, the truncation points. The distributions in (i) and (iii) have mean c, while the distributions in (ii) and (iv) have mean.
Output for parameter e11 using the truncated normal method
Example 2 Our model is as follows : y ij N( 0 + girl ij 1 V) V = u00 + e00 + 2girl ij e01 (2) This model ts a variance for boys and a term that represents the dierence in variance between boys and girls. Par. IGLS RIGLS MCMC 0-0.161 (0.058) -0.161 (0.058) -0.160 (0.060) 1 0.261 (0.041) 0.261 (0.041) 0.260 (0.040) u00 0.162 (0.031) 0.165 (0.032) 0.171 (0.035) e00 0.913 (0.032) 0.914 (0.032) 0.916 (0.032) e01-0.062 (0.020) -0.062 (0.020) -0.062 (0.020)
Summary so far In this talk I have introduced 2 MCMC methods for tting models with complex level 1 variation. Below is a summary of their respective advantages and disadvantages. Method 1 does not allow the variance to be negative for unobserved predictors. Method 1 allows easy specication of informative prior distributions. Method 2 mimics the existing ML methods. Method 2 allows more exibility in specication of level 1 variance functions. Method 2 can be extended to include log specications.
Log variance/precision formulation As an alternative we can write a general two level Normal model with complex level 1 variation out in the following formulation as used by Spiegelhalter et al. (1996). y ij = X ij + Z ij u j + e ij where u j MV N(0 u ) e ij N(0 1= ij ) and log( ij ) = X C ij e. This results in a multiplicative variance function: eij = exp(;x C 1ij e1) ::: exp(;x C nij en) The main advantage is the parameters are unconstrained. The main disadvantage is the diculty in interpreting the individual parameters.
A comparison of four possible models In the following graph we plot the variance function for four possible formulations of the level 1 variance for the tutorial dataset : eij = e00 + 2X1 ij e01 (1) eij = e00 +2X1 ij e01 + X1 2 ij e11 (2) eij = exp(; e00 ; 2X1 ij e01) (3) eij = exp(; e00 ; 2X1 ij e01 ; X1 2 ij e11) (4) where X1 is London Reading test (LRT) score... Level 1 Variance 0.45 0.50 0.55 0.60 0.65. Linear Quadratic Exp. Linear Exp. Quadratic. -3-2 -1 0 1 2 3 Standardised LRT score
Comparison of Speed and Eciency Here we look at the four models illustrated in the graph and compare the speed and eciency of the MH truncated Normal approach and the Adaptive rejection approach used in WinBUGS. Here the time is based on running the method on the model for 50,000 iterations and the eciency is the maximum Raftery Lewis ^N statistic rstly for a level 1 variance parameter and secondly for any model parameter. Results Model MH Time AR Time MH E. AR E. Linear 28 mins - 14.4k - 16.3k Quadratic 30 mins - 16.9k - 16.9k Exponential 34 mins 143 mins 14.8k 3.8k Linear 16.8k 17.1k Exponential 38 mins 340 mins 16.2k 4.7k Quadratic 17.5k 17.7k
Model Comparison (Work in progress). The DIC diagnostic (Spiegelhalter et al. 2001) is a measure that can be used for comparing complex models tted using MCMC methods. It can be thought of as a generalization of the AIC diagnostic and combines a measure of t based on a deviance function (D( )) and complexity based on an 'eective' number of parameters, p D. DIC = D( ) + 2pD The following table gives DIC values for some of the models above Variance function D( ) pd DIC Constant 9031.3 91.7 9214.7 Quadratic 9029.7 91.7 9213.2 Linear 9027.7 91.7 9211.1 Exp. Linear 9028.4 91.2 9210.8 Exp. Quadratic 9028.3 92.3 9212.9
Extension to multivariate problems (Work in progress). We will here ignore the multilevel problems considered so far and stick to a one level problem as the multilevel analogue is a simple extension. Assume for each of P individuals we have an N vector response, y i and this response is assumed to come from a multivariate Normal distribution so : y i MV N( i V i ) Assume we wish to update a variance matrix V i with the Metropolis-Hastings algorithm. Normally we would have V i = V and use Gibbs sampling but let us assume that each individual has a unique variance matrix. For example let us assume that for individual i we have V i [j k] = 0 + 1 X i, for the j k th element of the matrix V i. In this talk so far we have considered the case where N = 1 for this problem.
Constraints to maintain positive deniteness We could consider using the truncated normal method but calculating all the parameter constraints to retain a positive denite matrix V i 8i. N = 1 : V i > 08i N = 2 : V i [0 0] > 0 V i [1 1] > 0 and ;1 < V i [0 1] pv i [0 0]V i [1 1] < 1 N = 3: 3 variance constraints, 3 correlation constraints and a 3-way correlation constraint. Generally for an N N matrix there are in total 2 N ; 1 constraints. Each variance parameter is involved in 2 N ;1 constraints and each covariance in 2 N ;2. So even though some constraints are redundant, evaluating all constraints is impractical for large N. Solution Use univariate Normal proposals with no truncation!!
Univariate Normal proposals Metropolis method Explanation A Metropolis step is (generally) easier than a Gibbs sampling step as it involves evaluating a posterior distribution at two points rather than calculating fully the form of the conditional posterior distribution. Similarly the Normal proposal is easier than a truncated Normal proposal as it involves checking if a proposed value satises the positive denite constraints rather than fully calculating these constraints in advance of generating the proposed value. Any values that do not satisfy the constraints have probability 0 and are automatically rejected. The univariate Normal proposal also has the advantage of being symmetric and so we do not need to worry about calculating Hastings ratios.
Applications Multivariate response models where elements of the variance matrix are functions of predictor variables (as above). Factor Analysis models with correlated factors (see Goldstein and Browne 2001). Mixed Normal and Binomial response (with probit link) models (see various work by Chib)
Useful web sites http://multilevel.ioe.ac.uk/ - Project home page that contains general information on multilevel modelling and information about MLwiN including bug listings and downloads of the latest version of MLwiN plus the documation. http://multilevel.ioe.ac.uk/team/bill.html - contains drafts of all my publications including papers awaiting publication. http://multilevel.ioe.ac.uk/team/billtalk.html - contains downloads of recent presentations I have given. http://tramss.data-archive.ac.uk - Training materials in Social sciences site containing free teaching version of MLwiN.