Summary of Talk Background to Multilevel modelling project. What is complex level 1 variation? Tutorial dataset. Method 1 : Inverse Wishart proposals.

Similar documents
Bayesian Methods in Multilevel Regression

Modelling heterogeneous variance-covariance components in two-level multilevel models with application to school effects educational research

Partitioning variation in multilevel models.

MULTILEVEL IMPUTATION 1

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

MCMC algorithms for fitting Bayesian models

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

University of Bristol - Explore Bristol Research. Peer reviewed version. Link to published version (if available): /rssa.

Multilevel Statistical Models: 3 rd edition, 2003 Contents

STA 4273H: Statistical Machine Learning

Discrete Response Multilevel Models for Repeated Measures: An Application to Voting Intentions Data

ST 740: Markov Chain Monte Carlo

A Non-parametric bootstrap for multilevel models

Advanced Statistical Modelling

2 1 Introduction Multilevel models, for data having a nested or hierarchical structure, have become an important component of the applied statistician

Estimating a Piecewise Growth Model with Longitudinal Data that Contains Individual Mobility across Clusters

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Variance partitioning in multilevel logistic models that exhibit overdispersion

Bayesian linear regression

Principles of Bayesian Inference

Appendix: Modeling Approach

Part 8: GLMs and Hierarchical LMs and GLMs

1 Introduction. 2 Example

Multilevel Models in Matrix Form. Lecture 7 July 27, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2

7. Estimation and hypothesis testing. Objective. Recommended reading

36-463/663Multilevel and Hierarchical Models

0.1 factor.ord: Ordinal Data Factor Analysis

Bayesian Analysis of Latent Variable Models using Mplus

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Bayesian Networks in Educational Assessment

Lecture Notes based on Koop (2003) Bayesian Econometrics

17 : Markov Chain Monte Carlo

Part 6: Multivariate Normal and Linear Models

p(z)

University of Groningen. The multilevel p2 model Zijlstra, B.J.H.; van Duijn, Maria; Snijders, Thomas. Published in: Methodology

Ronald Christensen. University of New Mexico. Albuquerque, New Mexico. Wesley Johnson. University of California, Irvine. Irvine, California

Metropolis-Hastings Algorithm

Gibbs Sampling in Linear Models #2

Multivariate Normal & Wishart

spbayes: An R Package for Univariate and Multivariate Hierarchical Point-referenced Spatial Models

Bayesian Linear Regression

A General Multilevel Multistate Competing Risks Model for Event History Data, with. an Application to a Study of Contraceptive Use Dynamics

Bayesian Analysis of Multivariate Normal Models when Dimensions are Absent

The Wishart distribution Scaled Wishart. Wishart Priors. Patrick Breheny. March 28. Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 1/11

0.1 factor.bayes: Bayesian Factor Analysis

Markov Chain Monte Carlo

Prediction of ordinal outcomes when the association between predictors and outcome diers between outcome levels

Local Likelihood Bayesian Cluster Modeling for small area health data. Andrew Lawson Arnold School of Public Health University of South Carolina

Online Appendix to: Marijuana on Main Street? Estimating Demand in Markets with Limited Access

Convex Optimization CMU-10725

Reminder of some Markov Chain properties:

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Introduction to Machine Learning CMU-10701

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

November 2002 STA Random Effects Selection in Linear Mixed Models

Bayesian Nonparametric Regression for Diabetes Deaths

Advising on Research Methods: A consultant's companion. Herman J. Ader Gideon J. Mellenbergh with contributions by David J. Hand

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

7. Estimation and hypothesis testing. Objective. Recommended reading

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Separable covariance arrays via the Tucker product - Final

A comparison of fully Bayesian and two-stage imputation strategies for missing covariate data

,..., θ(2),..., θ(n)

Bayesian Estimation of Expected Cell Counts by Using R

Hospital H1 H2 H3 H4. Patient P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12. Neighbourhood N1 N2 N3

Model Estimation Example

Lecture 7 and 8: Markov Chain Monte Carlo

Report and Opinion 2016;8(6) Analysis of bivariate correlated data under the Poisson-gamma model

An Introduction to Path Analysis

Three-Level Multiple Imputation: A Fully Conditional Specification Approach. Brian Tinnell Keller

Using Model Selection and Prior Specification to Improve Regime-switching Asset Simulations

CPSC 540: Machine Learning

A Review of Pseudo-Marginal Markov Chain Monte Carlo

Bayesian Linear Models

Lecture 8: The Metropolis-Hastings Algorithm

1 Data Arrays and Decompositions

Bayesian Prediction of Code Output. ASA Albuquerque Chapter Short Course October 2014

A quick introduction to Markov chains and Markov chain Monte Carlo (revised version)

Generalized Exponential Random Graph Models: Inference for Weighted Graphs

Bayes methods for categorical data. April 25, 2017

Bayesian Methods for Machine Learning

19 : Slice Sampling and HMC

1. Introduction. Hang Qian 1 Iowa State University

STA 4273H: Statistical Machine Learning

Part 7: Hierarchical Modeling

Advanced Introduction to Machine Learning

Modeling and Interpolation of Non-Gaussian Spatial Data: A Comparative Study

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling

eqr094: Hierarchical MCMC for Bayesian System Reliability

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods

Biostat 2065 Analysis of Incomplete Data

Bayesian Multivariate Logistic Regression

DIC, AIC, BIC, PPL, MSPE Residuals Predictive residuals

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Lecture 16: Mixtures of Generalized Linear Models

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Monte Carlo Lecture Notes II, Jonathan Goodman. Courant Institute of Mathematical Sciences, NYU. January 29, 1997

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC

Transcription:

Modelling the Variance : MCMC methods for tting multilevel models with complex level 1 variation and extensions to constrained variance matrices By Dr William Browne Centre for Multilevel Modelling Institute of Education, London

Summary of Talk Background to Multilevel modelling project. What is complex level 1 variation? Tutorial dataset. Method 1 : Inverse Wishart proposals. Method 2 : Truncated Normal proposals. Log formulations. Extensions to the multivariate problem.

Multilevel modelling project Based at Institute of Education. Headed by Professor Harvey Goldstein. Funded by ESRC originally through ALCD programme. 3 Full-time Research ocers. 2 Lecturers associated with project. A Network of project Fellows.

Aims of Project Modelling complex structures in social science data. Establishing forms of model structure. Developing methodology to t model. Comparing alternative methodologies. Programming methodology into computer package MLwiN. Disseminating ideas to the social science community.

MLwiN Software package Developed from a chain of packages developed by MMP. Forerunners include ML2, ML3 and MLn. Main programmer : Jon Rasbash. Consists of user-friendly Windows interface on top of fast estimation engines. Over 3,000 users (mainly academic) worldwide. Estimation by IGLS, RIGLS, MCMC methods, bootstrapping. MCMC theory and programming by William Browne and David Draper.

Current research interests Cross-classied and multiple membership models. Missing data and measurement errors in multilevel modelling. Multilevel Factor analysis modelling. Spatial modelling. Combining estimation procedures. Improving user interface.

Univariate Normal model y i N( 2 ) i= 1 ::: 4 Can be written 0 B @ y 1 y 2 y 3 y 4 1 C A N 0 B @ = 0 B @ 1 C A V = 0 B @ 2 0 0 0 0 2 0 0 0 0 2 0 0 0 0 2 11 CC AA Normal linear model y i N(X i 2 ) i= 1 ::: 4 Can be written y = (y 1, y 2, y 3, y 4 ) T MV N( V ) where = 0 B @ X 1 X 2 X 3 X 4 1 C A V = 0 B @ 2 0 0 0 0 2 0 0 0 0 2 0 0 0 0 2 1 C A

2 level variance components model y ij = X ij + u j + e ij u j N(0 2 u) e ij N(0 2 e ) i = 1 ::: 2 j = 1 ::: 2 Can be written y = (y 1 1, y 2 1, y 1 2, y 2 2 ) T MV N( V ) where = 0 B @ X 1 1 X 2 1 X 1 2 X 2 2 1 C A and V = 0 B @ 2 u + 2 e 2 u 0 0 2 u 2 u + 2 e 0 0 0 0 2 u + 2 e 2 u 0 0 2 u 2 u + 2 e 1 C A

Complex Variation. Denition: A model where the variance depends on predictor variables. y ij = X ij + Z ij u j + X C ij e ij u j MV N(0 u ) e ij MV N(0 e ) V matrix now has diagonal elements of the form V ij ij = eij + uij where eij = X CT ij ex C ij and uij = Z T ij uz ij. The o-diagonal elements of V are zero if they correspond to observations in dierent level 2 units or otherwise V ij i 0 j = Z T ij uz i 0 j.

Example : Tutorial Dataset Dataset of school exam results at age 16. Dataset 4059 pupils from 65 schools. Response variable is total GCSE score. Main predictor variable is LRT score. Other predictor of interest is gender.

Partitioning the dataset Here we see the mean and variance for dierent partitions of the dataset. Partition N Mean Variance Whole dataset 4059 0.000 1.000 Boys 1623-0.140 1.052 Girls 2436 0.093 0.940 LRT < ;1 612-0.887 0.731 ;1 < LRT < ;0:5 594-0.499 0.599 ;0:5 < LRT < ;0:1 619-0.191 0.650 ;0:1 < LRT < 0:3 710 0.044 0.658 0:3 < LRT < 0:7 547 0.279 0.659 0:7 < LRT < 1:1 428 0.571 0.678 1:1 < LRT 549 0.963 0.703

A 1 Level Model eij y ij N( 0 + X1 ij 1 V) = e00 +2X1 ij e01 +X1 2 ij e11 (1) where X1 is London Reading test (LRT) score. This graph mimics the results from partitioning the data. Level 1 variance 0.64 0.66 0.68 0.70 0.72 0.74-3 -2-1 0 1 2 3 Standardised LRT score

A 2 level model with a constant variance at level 2. y ij N( 0 + X1 ij 1 V) uij = u00 (2) eij = e00 +2X1 ij e01 +X1 2 ij e11 where X1 is London Reading test (LRT) score. Variance 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Level 1 variance Level 2 variance -3-2 -1 0 1 2 3 Standardised LRT score

A 2 Level model with complex variation at both levels 1 and 2. uij eij y ij N( 0 + X1 ij 1 V) = u00 +2X1 ij u01 +X1 2 ij u11 = e00 +2X1 ij e01 +X1 2 ij e11 (3) where X1 is London Reading test (LRT) score. Variance 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Level 1 variance Level 2 variance -3-2 -1 0 1 2 3 Standardised LRT score

A 2 level model with a more complicated variance structure at level 1 y ij N( 0 +X1 ij 1 + X2 ij 2 V) uij = u00 + 2X1 ij u01 +X1 2 ij u11 eij = e00 + 2X1 ij e01 + 2X1 ij X2 ij e12 + X2 ij e22 (4) where X1 is London Reading test (LRT) score, and X2 is 1 for boys and 0 for girls. Variance 0.2 0.4 0.6 0.8 Level 1 variance - Boys Level 1 variance - Girls Level 2 variance -3-2 -1 0 1 2 3 Standardised LRT score

Two possible formulations We can write a general two level Normal model with complex level 1 variation out in two similar but not identical formulations Firstly y ij = X ij + Z ij u j + X C ij e ij where u j MV N(0 u ) e ij MV N(0 e ) and secondly y ij = X ij + Z ij u j + e ij where u j MV N(0 u ) e ij N(0 eij) and e ij = XC ij e ij and eij = X CT ij ex C ij.

Gibbs Sampling steps for both methods In a Gibbs Sampling algorithm we construct conditional posterior distributions of each parameter (or group of parameters) in turn. This constructs a chain of values for each parameter which upon convergence will be a sample from the joint posterior distribution. Here we nd Step 1 : p( j y u u e ) MVN( b c D) Step 2 : p(u j j y u e ) MVN(bu j c D j ) Step 3 : p( u j y u e ) InvW ishart(b b S) Step 4 : p( e j y u u )? The distribution in Step 4 does not have a `nice' form using either formulation. So for this step we will use the Metropolis Hastings(MH) sampler.

Method 1 Inverse Wishart proposals In formulation 1 we know that e is a `variance matrix' therefore values of e must form a positive denite matrix. To use Metropolis Hastings we require a proposal distribution that generates positive denite matrices. We will use an inverse Wishart distribution. Let invw ishart k ( S) E() = ( ; k ; 1) ;1 S So at timestep t +1 draw from proposal distribution p( (t+1) e ) invw ishart k (w + k + 1 w (t) e ): This has mean the current estimate of e and w is a tuning constant.

Method 1 continued: As we are using an Inverse Wishart proposal distribution this distribution is not symmetric. Consequently we have to work out the hastings ratio. If the current value of e is A and we propose a move to B then the hastings ratio is as follows : hr = p((t+1) e p( (t+1) e = j A j2w+3k+3 2 j B j 2w+3k+3 2 = B j (t+1) e = A j (t+1) e IW(w + k + 1 Aw)) IW(w + k + 1 Bw)) exp( w 2 (tr(ba;1 ) ; tr(ab ;1 )) Our step 4 now becomes (t+1) e = e with prob. min(1 hr p( e jy :::) = (t) e otherwise. ) p( (t) e jy :::) where e is drawn from an InvW ishart k (w + k + ) distribution. 1 w (t) e

Earlier Example 3 Here we look again at the third earlier example model y ij N( 0 + X1 ij 1 V) uij = u00 + 2X1 ij u01 + X1 2 ij u11 eij = e00 + 2X1 ij e01 + X1 2 ij e11 (1) where y is the (normalised) GCSE score and X1 is the (standardised) LRT score. Results Par. IGLS MCMC Meth 1 MCMC Meth 2 0-0.012 (0.040) -0.010 (0.041) -0.010(0.041) 1 0.558 (0.020) 0.559 (0.020) 0.559 (0.020) u00 0.091 (0.018) 0.097 (0.020) 0.097 (0.020) u01 0.019 (0.007) 0.020 (0.007) 0.020 (0.007) u11 0.014 (0.004) 0.015 (0.005) 0.015 (0.005) e00 0.553 (0.015) 0.549 (0.013) 0.553 (0.015) e01-0.015 (0.006) -0.016 (0.007) -0.015 (0.007) e11 0.001 (0.009) 0.008 (0.006) 0.003 (0.009)

Output for parameter e11 using the inverse Wishart method

Method 2 : Truncated Normal proposals Our second formulation of the model was as follows : y ij = X ij + Z ij u j + e ij where u j MV N(0 u ) e ij N(0 eij) and e ij = XC ij e ij and eij = X CT ij ex C ij. So rather than a positive denite constraint for the matrix e we instead have the weaker constraint that eij = X CT ij ex C ij > 08i j. Note that this would be identical to a positive denite constraint if X CT took all possible values but in practice it doesn't. This constraint looks quite dicult but we will consider the elements of e one at a time.

Method 2 : Updating Diagonal terms ekk At time t we require that eij = (X C ij )T (t) e X C ij > 0 = (X C ij(k) )2 (t) ekk ; dc ij(kk) > 0 8 i j where d C ij(kk) = (XC ij(k) )2 (t) ekk ; (XC ij) T (t) e X C ij : So (t) ekk > max ekk where max ekk = max(d C ij(kk) =(XC ij(k) )2 ):

Method 2 : Updating O-Diagonal terms ekl This step will be similar to the step given for diagonal terms except this time ekl will be multiplied by X c ij(k) Xc ij(l) which can be negative. This means that there will be two truncation points (a maximum and minimum) rather than one. Step 4 of the algorithm becomes (repeated for all k and l) (t+1) ekl = ekl with prob. min(1 hr p( ekl jy :::) = (t) ekl otherwise. p( (t) jy :::)) ekl where ekl is drawn from a truncated Normal distribution with truncation points that maintain a positive variance for every observation.

Calculating the Hastings Ratios for Method 2. hr = ((min ekl; ; B)=s kl ) ; ((max ekl + ; B)=s kl ) ((min ekl ; ; A)=s kl ) ; ((max ekl + ; A)=s kl ) : M AB. M AB. (i) (ii) M AB m. M AB m. (iii) (iv) Figure 1: Plots of truncated univariate normal proposal distributions for a parameter,. A is the current value, c and B is the proposed new value,. Mismax and m is min, the truncation points. The distributions in (i) and (iii) have mean c, while the distributions in (ii) and (iv) have mean.

Output for parameter e11 using the truncated normal method

Example 2 Our model is as follows : y ij N( 0 + girl ij 1 V) V = u00 + e00 + 2girl ij e01 (2) This model ts a variance for boys and a term that represents the dierence in variance between boys and girls. Par. IGLS RIGLS MCMC 0-0.161 (0.058) -0.161 (0.058) -0.160 (0.060) 1 0.261 (0.041) 0.261 (0.041) 0.260 (0.040) u00 0.162 (0.031) 0.165 (0.032) 0.171 (0.035) e00 0.913 (0.032) 0.914 (0.032) 0.916 (0.032) e01-0.062 (0.020) -0.062 (0.020) -0.062 (0.020)

Summary so far In this talk I have introduced 2 MCMC methods for tting models with complex level 1 variation. Below is a summary of their respective advantages and disadvantages. Method 1 does not allow the variance to be negative for unobserved predictors. Method 1 allows easy specication of informative prior distributions. Method 2 mimics the existing ML methods. Method 2 allows more exibility in specication of level 1 variance functions. Method 2 can be extended to include log specications.

Log variance/precision formulation As an alternative we can write a general two level Normal model with complex level 1 variation out in the following formulation as used by Spiegelhalter et al. (1996). y ij = X ij + Z ij u j + e ij where u j MV N(0 u ) e ij N(0 1= ij ) and log( ij ) = X C ij e. This results in a multiplicative variance function: eij = exp(;x C 1ij e1) ::: exp(;x C nij en) The main advantage is the parameters are unconstrained. The main disadvantage is the diculty in interpreting the individual parameters.

A comparison of four possible models In the following graph we plot the variance function for four possible formulations of the level 1 variance for the tutorial dataset : eij = e00 + 2X1 ij e01 (1) eij = e00 +2X1 ij e01 + X1 2 ij e11 (2) eij = exp(; e00 ; 2X1 ij e01) (3) eij = exp(; e00 ; 2X1 ij e01 ; X1 2 ij e11) (4) where X1 is London Reading test (LRT) score... Level 1 Variance 0.45 0.50 0.55 0.60 0.65. Linear Quadratic Exp. Linear Exp. Quadratic. -3-2 -1 0 1 2 3 Standardised LRT score

Comparison of Speed and Eciency Here we look at the four models illustrated in the graph and compare the speed and eciency of the MH truncated Normal approach and the Adaptive rejection approach used in WinBUGS. Here the time is based on running the method on the model for 50,000 iterations and the eciency is the maximum Raftery Lewis ^N statistic rstly for a level 1 variance parameter and secondly for any model parameter. Results Model MH Time AR Time MH E. AR E. Linear 28 mins - 14.4k - 16.3k Quadratic 30 mins - 16.9k - 16.9k Exponential 34 mins 143 mins 14.8k 3.8k Linear 16.8k 17.1k Exponential 38 mins 340 mins 16.2k 4.7k Quadratic 17.5k 17.7k

Model Comparison (Work in progress). The DIC diagnostic (Spiegelhalter et al. 2001) is a measure that can be used for comparing complex models tted using MCMC methods. It can be thought of as a generalization of the AIC diagnostic and combines a measure of t based on a deviance function (D( )) and complexity based on an 'eective' number of parameters, p D. DIC = D( ) + 2pD The following table gives DIC values for some of the models above Variance function D( ) pd DIC Constant 9031.3 91.7 9214.7 Quadratic 9029.7 91.7 9213.2 Linear 9027.7 91.7 9211.1 Exp. Linear 9028.4 91.2 9210.8 Exp. Quadratic 9028.3 92.3 9212.9

Extension to multivariate problems (Work in progress). We will here ignore the multilevel problems considered so far and stick to a one level problem as the multilevel analogue is a simple extension. Assume for each of P individuals we have an N vector response, y i and this response is assumed to come from a multivariate Normal distribution so : y i MV N( i V i ) Assume we wish to update a variance matrix V i with the Metropolis-Hastings algorithm. Normally we would have V i = V and use Gibbs sampling but let us assume that each individual has a unique variance matrix. For example let us assume that for individual i we have V i [j k] = 0 + 1 X i, for the j k th element of the matrix V i. In this talk so far we have considered the case where N = 1 for this problem.

Constraints to maintain positive deniteness We could consider using the truncated normal method but calculating all the parameter constraints to retain a positive denite matrix V i 8i. N = 1 : V i > 08i N = 2 : V i [0 0] > 0 V i [1 1] > 0 and ;1 < V i [0 1] pv i [0 0]V i [1 1] < 1 N = 3: 3 variance constraints, 3 correlation constraints and a 3-way correlation constraint. Generally for an N N matrix there are in total 2 N ; 1 constraints. Each variance parameter is involved in 2 N ;1 constraints and each covariance in 2 N ;2. So even though some constraints are redundant, evaluating all constraints is impractical for large N. Solution Use univariate Normal proposals with no truncation!!

Univariate Normal proposals Metropolis method Explanation A Metropolis step is (generally) easier than a Gibbs sampling step as it involves evaluating a posterior distribution at two points rather than calculating fully the form of the conditional posterior distribution. Similarly the Normal proposal is easier than a truncated Normal proposal as it involves checking if a proposed value satises the positive denite constraints rather than fully calculating these constraints in advance of generating the proposed value. Any values that do not satisfy the constraints have probability 0 and are automatically rejected. The univariate Normal proposal also has the advantage of being symmetric and so we do not need to worry about calculating Hastings ratios.

Applications Multivariate response models where elements of the variance matrix are functions of predictor variables (as above). Factor Analysis models with correlated factors (see Goldstein and Browne 2001). Mixed Normal and Binomial response (with probit link) models (see various work by Chib)

Useful web sites http://multilevel.ioe.ac.uk/ - Project home page that contains general information on multilevel modelling and information about MLwiN including bug listings and downloads of the latest version of MLwiN plus the documation. http://multilevel.ioe.ac.uk/team/bill.html - contains drafts of all my publications including papers awaiting publication. http://multilevel.ioe.ac.uk/team/billtalk.html - contains downloads of recent presentations I have given. http://tramss.data-archive.ac.uk - Training materials in Social sciences site containing free teaching version of MLwiN.