If you believe in things that you don t understand then you suffer. Stevie Wonder, Superstition

Similar documents
Bayes: All uncertainty is described using probability.

Collinearity/Confounding in richly-parameterized models

Linear Methods for Prediction

BIOS 2083 Linear Models c Abdus S. Wahed

Generalized Linear Models

1 Mixed effect models and longitudinal data analysis

Regression I: Mean Squared Error and Measuring Quality of Fit

Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model

Sampling Distributions

For more information about how to cite these materials visit

Statistical methods research done as science rather than math: Estimates on the boundary in random regressions

General Linear Model: Statistical Inference

Regression Review. Statistics 149. Spring Copyright c 2006 by Mark E. Irwin

variability of the model, represented by σ 2 and not accounted for by Xβ

Sparse Linear Models (10/7/13)

Peter Hoff Linear and multilinear models April 3, GLS for multivariate regression 5. 3 Covariance estimation for the GLM 8

20. REML Estimation of Variance Components. Copyright c 2018 (Iowa State University) 20. Statistics / 36

Math 423/533: The Main Theoretical Topics

WU Weiterbildung. Linear Mixed Models

Multivariate Regression Analysis

[y i α βx i ] 2 (2) Q = i=1

Machine Learning for OR & FE

Spatial inference. Spatial inference. Accounting for spatial correlation. Multivariate normal distributions

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures

Time-Invariant Predictors in Longitudinal Models

Statistical Distribution Assumptions of General Linear Models

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

Time-Invariant Predictors in Longitudinal Models

Regression, Ridge Regression, Lasso

Hierarchical Modeling for Univariate Spatial Data

Part 6: Multivariate Normal and Linear Models

MS&E 226: Small Data

Simple and Multiple Linear Regression

Stat 579: Generalized Linear Models and Extensions

AMS-207: Bayesian Statistics

Problem Set #6: OLS. Economics 835: Econometrics. Fall 2012

Prediction. is a weighted least squares estimate since it minimizes. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark

UNIVERSITY OF TORONTO Faculty of Arts and Science

Linear models. Linear models are computationally convenient and remain widely used in. applied econometric research

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Stat 5101 Lecture Notes

Outline. Statistical inference for linear mixed models. One-way ANOVA in matrix-vector form

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

Covariance function estimation in Gaussian process regression

Linear Methods for Prediction

Linear Regression (9/11/13)

Gibbs Sampling in Endogenous Variables Models

ECE521 week 3: 23/26 January 2017

The Simple Regression Model. Part II. The Simple Regression Model

Statistics 910, #5 1. Regression Methods

Summer School in Statistics for Astronomers V June 1 - June 6, Regression. Mosuk Chow Statistics Department Penn State University.

WLS and BLUE (prelude to BLUP) Prediction

Likelihood-Based Methods

Bayesian Linear Regression

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

The linear model is the most fundamental of all serious statistical models encompassing:

Matrices and vectors A matrix is a rectangular array of numbers. Here s an example: A =

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Matrix Approach to Simple Linear Regression: An Overview

Next is material on matrix rank. Please see the handout

6.867 Machine Learning

Gibbs Sampling in Linear Models #2

The Multiple Regression Model Estimation

3. Linear Regression With a Single Regressor

One-stage dose-response meta-analysis

Chapter 1. Linear Regression with One Predictor Variable

Multiple Linear Regression

IEOR 165 Lecture 7 1 Bias-Variance Tradeoff

Probability and Statistics Notes

Quick Review on Linear Multiple Regression

ML and REML Variance Component Estimation. Copyright c 2012 Dan Nettleton (Iowa State University) Statistics / 58

THE ANOVA APPROACH TO THE ANALYSIS OF LINEAR MIXED EFFECTS MODELS

MA 575 Linear Models: Cedric E. Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1

Modeling the Mean: Response Profiles v. Parametric Curves

Simple Linear Regression

Multilevel Models in Matrix Form. Lecture 7 July 27, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2

STAT 100C: Linear models

POLI 8501 Introduction to Maximum Likelihood Estimation

Homoskedasticity. Var (u X) = σ 2. (23)

Problems. Suppose both models are fitted to the same data. Show that SS Res, A SS Res, B

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

Multivariate Linear Regression Models

F9 F10: Autocorrelation

Multivariate Regression (Chapter 10)

MIXED MODELS THE GENERAL MIXED MODEL

Random and Mixed Effects Models - Part III

Introduction to Statistical modeling: handout for Math 489/583

How the mean changes depends on the other variable. Plots can show what s happening...

Linear Regression Models P8111

Master s Written Examination

Data Mining Stat 588

Regression. ECO 312 Fall 2013 Chris Sims. January 12, 2014

Hierarchical Modelling for Univariate Spatial Data

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

MS&E 226. In-Class Midterm Examination Solutions Small Data October 20, 2015

Regression Estimation - Least Squares and Maximum Likelihood. Dr. Frank Wood

On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models

Sampling Distributions

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Transcription:

If you believe in things that you don t understand then you suffer Stevie Wonder, Superstition

For Slovenian municipality i, i = 1,, 192, we have SIR i = (observed stomach cancers 1995-2001 / expected ) i = O i /E i SEc i = Socioeconomic score (centered and scaled) i Dark is high There s a clear negative association SIR SEc

First do a non-spatial analysis: O i Poisson(µ i ), where log µ i = loge i + α + βsec i No surprise: β O i median -014, 95% interval -017 to -010 Now do a spatial analysis: log µ i = loge i + α + βsec i + S i + H i S = (S 1,, S 1 94) improper CAR, precision parameter τ s, H = (H 1,, H 1 94) iid Normal mean 0, precision τ s SURPRISE!! β O i median -002, 95% interval -010 to +006 DIC improved 11530 10815; pd: 20 623 The obvious association has gone away What happened?

Two premises underlying much of the course (1) Writing down a model and using it to analyze a dataset = specifying a function from the data to inferential or predictive summaries It is essential to understand that function from data to summaries When I fit this model to this dataset, why do I get this result, and how much would the result change if I made this change to the data or model? (2) We must distinguish between the model we choose to analyze a given dataset and the process that we imagine produced the data In particular, our choice to use a model with random effects does not imply that those random effects correspond to any random mechanism out there in the world

Mixed linear models in the standard form Mixed linear models are commonly written in the form y = Xβ + Zu + ɛ, where The observation y is an n-vector; X, the fixed-effects design matrix, is n p and known; β, the fixed effects, is p 1 and unknown; Z, the random-effects design matrix, is n q and known; u, the random effects, is q 1 and unknown; u N(0, G(φ G )), where φ G is unknown; ɛ N(0, R(φ R )), where φ R is unknown; and The unknowns in G and R are φ = (φ G, φ R )

The novelty here is the random effects y = Xβ + Zu + ɛ, where u N(0, G(φ G )) ɛ N(0, R(φ R )) Original meaning (old-style random effects): A random effect s levels are draws from a population, and the draws are not of interest in themselves but only as samples from the larger population, which is of interest The random effects Zu are a way to model sources of variation that affect several observations the same way, eg, all observations in a cluster Current meaning (new-style random effects): Also, a random effect Zu is a kind of model that is flexible because it has many parameters but that avoids overfitting because u is constrained by means of its covariance G

Example: Pig weights (RWC Sec 42), Longitudinal data 48 pigs (i = 1,, 48), weighed 1x/week for 9 weeks (j = 1,, 9) Dumb model: weight ij = β 0 + β 1 week j + ɛ ij, ɛ ij iid N(0, σ 2 ) This model assigns all variation between pigs to ɛ ij, when pigs evidently vary in both intercept and slope

Pig weights: A less dumb model weight ij = β 0i + β 1 week j + ɛ ij, ɛ ij iid N(0, σe 2 ) = β 0 + u i + β 1 week j + ɛ ij where β 0 and β 1 are fixed and u i iid N(0, σu) 2 This model shifts pig i s line up or down as u i > 0 or u i < 0 partitions variation into variation between pigs, ui, and variation within pigs, ɛij Note that cov(weight ij, weight ij ) = cov(u i, u i ) = σ 2 u

Pig weights: Possibly not dumb model weight ij = β 0 + u i0 + (β 1 + u i1 ) week j + ɛ ij, ) ( [ ]) σ 2 N 0, 00 σ01 2 u i1 σ01 2 σ11 2 ( ui0 This is the so-called random regressions model

Random regressions model in the standard form y = Xβ + Zu + ɛ, where X = 1 week 1 1 week 9 1 week 1 1 week 9, Z = 1 week 1 0 9 2 0 9 2 1 week 9 1 week 1 0 9 2 0 9 2 1 week 9 1 week 1 0 9 2 0 9 2 1 week 9,

Random regressions model in the standard form (2) y = weight 1,1 weight 1,9 weight 2,1 weight 2,9 weight 48,9, β = y = Xβ + Zu + ɛ, where [ β0 β 1 ], u = u 10 u 11 u 20 u 21 u 48,0 u 48,1, ɛ = ɛ 1,1 ɛ 1,9 ɛ 2,1 ɛ 2,9 ɛ 48,9 Conventional fits of this model commonly give an estimate of ±1 for the correlation between the random intercept and slope Nobody knows why

Example: Molecular structure of a virus Peterson et al (2001) hypothesized a molecular description of the outer shell (prohead) of the bacteriophage virus φ29 Goal: Break the prohead or phage into constituent molecules and weigh them; the weights (and other information) test the hypotheses about the components of the prohead (gp7, gp8, gp85, etc) There were four steps in measuring each component: Select a parent: 2 prohead parents, 2 phage parents Prepare a batch of the parent On a gel date, create electrophoresis gels, separate the molecules on the gels, cut out the relevant piece of the gel Burn gels in an oxidizer run; each gel gives a weight for each molecule Here s a sample of the design for the gp8 molecule

Parent Batch Gel Date Oxidizer Run gp8 Weight 1 1 1 1 244 1 1 2 1 267 1 1 2 1 259 1 1 2 1 286 1 3 1 1 218 1 3 2 1 249 1 3 2 1 266 1 3 2 1 259 1 7 4 3 293 1 7 4 3 297 1 7 5 4 315 1 7 5 4 283 1 7 7 4 311 1 7 7 4 334 2 2 1 1 272 2 2 2 1 223

To analyze these data, I treated each of the four steps as an old-style random effect Model: For y i the i th measurement of the number of gp8 molecules y i = µ + parent j(i) + batch k(i) + gel l(i) + run m(i) + ɛ i parent j(i) iid N(0, σ 2 p ), j = 1,, 4 batch k(i) iid N(0, σ 2 b ), k = 1,, 9 batches are nested within parents gel l(i) iid N(0, σ 2 g ), l = 1,, 11 run m(i) iid N(0, σ 2 r ), m = 1,, 7 ɛ i iid N(0, σ 2 e ), i = 1,, 98 Deterministic functions j(i), k(i), l(i), and m(i) map i to parent, batch, gel, and run indices

Molecular-structure model in the standard form In y = Xβ + Zu + ɛ, X = 1 98, β = µ, Z 98 31 = parent (4 cols) batch (9 cols) gel (11 cols) run (7 cols) {}}{{}}{{}}{{}}{ 1000 10 0 10 0 10 0 1000 10 0 01 0 10 0 0001 00 1 00 1 00 1, u 31 1 = [parent 1,, parent 4, batch 1,, batch 9, gel 1,, gel 11, run 1,, run 7 ], G = σ 2 pi 4 0 0 0 0 σ 2 b I 4 0 0 0 0 σ 2 g I 4 0 0 0 0 σ 2 oi 4

Example: Rating vocal fold images ENT docs evaluate speech/larynx problems by taking videos of the inside of the larynx during speech and having trained raters assess them Standard method (2008): strobe lighting with period slightly longer than the vocal folds period Kendall (2009) tested a new method using high-speed video (HSV), giving a direct view of the folds vibration Each of 50 subjects was measured using both image forms (strobe/hsv) Each of the 100 images was assessed by 1 raters, CR, KK, KUC Each rating consisted of 5 quantities on continuous scales Interest: compare image forms, compare raters, measure their interaction

Subject ID Imaging Method Rater % Open Phase 1 strobe CR 56 1 strobe KK 70 1 strobe KK 70 1 HSV KUC 70 1 HSV KK 70 1 HSV KK 60 2 strobe KUC 50 2 HSV CR 54 3 strobe KUC 60 3 strobe KUC 70 3 HSV KK 56 4 strobe CR 65 4 HSV KK 56 5 strobe KK 50 5 HSV KUC 55 5 HSV KUC 67 6 strobe KUC 50 6 strobe KUC 50 6 HSV KUC 50

Vocal folds example: the design This is a repeated-measures design The subject effect is study number There are 3 fixed effect factors: image form, varying entirely within subject; rater, varying both within and between subjects; image form-by-rater interaction The random effects are: study number; study number by image form; study number by rater; residual (3-way interaction)

First rows of the FE design matrix X Intercept Image form Rater Interaction 1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 1 0 1 0 1 1 1 0 0 0 0 1 1 0 0 0 0 1 0 0 1 0 0 1 1 1 0 1 0 1 0 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 0 1 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 1 1 0 1 0 1 1 1 0 1 0 1 1 0 0 1 0 0

Random effects: Construct u first, then Z Study number: u 1,, u 50, one per subject Study number by image form: u 1H, u 1S,, u 50H, u 50S, one per subject/form Study number by rater: Complicated! One per unique combination of a rater and a study number: u 10,CR, u 10,KK, u 10,KUC u 60,CR, u 60,KUC u 66,KK, u 66,KUC u 79,CR, u 79,KK, u 931,CR, u 967,CR

Non-zero columns of the RE design matrix Z Sub ID Form Rater Sub Sub-by-Method Sub-by-Rater {}}{{}}{{}}{ 1 strobe CR 100 100000 1000000 1 strobe KK 100 100000 0100000 1 strobe KK 100 100000 0100000 1 HSV KUC 100 010000 0010000 1 HSV KK 100 010000 0100000 1 HSV KK 100 010000 0100000 2 strobe KUC 010 001000 0001000 2 HSV CR 010 000100 0000100 3 strobe KUC 001 000010 0000010 3 strobe KUC 001 000010 0000010 3 HSV KK 001 000001 0000001 JMP s RL maximizing algorithm converged for 4 of 5 outcomes The designs are identical, so the difference arises from y Nobody knows how

A key point of these examples and of this course is that we now have tremendous ability to fit models in this form but little understanding of how the data determine the fits This course and book are my stab at developing that understanding The next section s purpose is to emphasize three points: The theory and methods of mixed linear models are strongly connected to the theory and methods of linear models, though the differences are important; The restricted likelihood is the posterior distribution from a particular Bayesian analysis; and Conventional and Bayesian analyses are incomplete or problematic in many respects

Doing statistics with MLMs: Conventional analysis The conventional analysis usually proceeds in three steps: Estimate φ = (φ G, φ R ), the unknowns in G and R Treating ˆφ as if it s true, estimate β and u Treating ˆφ as if it s true, compute tests and confidence intervals for β and u We will start with estimating β and u (mean structure), move to estimating φ (variance structure), and then on to tests, etc The obvious problem treating ˆφ as if it s true is well known

Mean structure estimates (β and u) This area has a long history and a lot of jargon has accrued: fixed effects are estimated; random effects are predicted; fixed + random effects are (estimated) best linear unbiased predictors, (E)BLUPs I avoid this as much as I can and instead use the more recent Unified approach, based on likelihood ideas Assume normal errors and random effects (the same approach is used for non-normal models) For estimates of β and u, use likelihood maximizing values given the variance components

We have: y = Xβ + Zu + ɛ, u N q (0, G(φ G )), ɛ N n (0, R(φ R )) In the conventional view, we can write the joint density of y and u as f (y, u β, φ) = f (y u, β, φ R )f (u φ G ), Taking the log of both sides gives log f (y, u β, φ) = K 1 2 log R(φ R) 1 2 log G(φ G ) 1 2 { (y Xβ Zu) R(φ R ) 1 (y Xβ Zu) + u G 1 (φ G )u } Treat G and R as known, set C = [X Z], estimate β and u by minimizing [ ( )] [ ( )] ( ) [ ] β y C R 1 β 0 0 β y C + [β u] u u 0 G 1 u }{{}}{{} likelihood penalty

It s easy to show that for C = [X Z], this gives point estimates: [ β ũ ] = φ [ ( C R 1 0 0 C + 0 G 1 )] 1 C R 1 y The tildes and subscript mean these estimates depend on φ Omit the penalty term and this is just the GLS estimate The penalty term is an extra piece of information about u; this is how it affects the estimates of both u and β These estimates give fitted values: [ β ỹ φ = C ũ ] φ [ ( = C C R 1 0 0 C + 0 G 1 Insert estimates for G and R to give EBLUPs )] 1 C R 1 y

Estimates of φ G and φ R, the unknowns in G and R Obvious [?] approach: Write down the likelihood and compute MLEs But this has problems (1) [Dense, obscure] It s not clear exactly what the likelihood is (2) [Pragmatic] Avoid problem (1) by getting rid of u, writing E(y β) = Xβ cov(y G, R) = ZGZ + R V So y N(Xβ, V), for V V(G, R) But maximizing this likelihood gives biased estimates Example: If X 1,, X n iid N(µ, σ 2 ), the ML estimator ˆσ 2 ML = 1 n (Xi X ) 2 is biased MLEs are biased because they don t account for estimating fixed effects

Solution: The restricted (or residual) likelihood Let Q X = I X(X X) 1 X be the projection onto R(X), the orthogonal complement of the column space of the fixed-effect design matrix X First try: The residuals Q X y = Q X Zu + Q X ɛ are n-variate normal with mean 0 and covariance Q X ZGZ Q X + Q X RQ X The resulting likelihood is messy: Q X y has a singular covariance matrix Instead, do the same thing but with a non-singular covariance matrix

Solution: The restricted (or residual) likelihood Second try: Q X has spectral decomposition K 0 DK 0, where D = diag(1,, 1, 0 p), for p = rank(x); K 0 is an orthogonal matrix with first n p columns K Then Q X = KK ; K is n (n p), its columns are a basis for R(X) Project y onto the column space of K = the residual space of X: K y = K Zu + K ɛ is (n p)-variate normal with mean 0 and covariance K ZGZ K + K RK, which is non-singular The likelihood arising from K y is the restricted (residual) likelihood Pre-multiplying the previous equation by any non-singular B with B = 1 gives the same restricted likelihood

The RL also arises as a Bayesian marginal posterior The previous construction has intuitive content: Attribute to β the part of y that lies in R(X) and use only the residual to estimate G and R Unfortunately, when derived this way, the RL is obscure It can be shown that you get the same RL as the Bayesian marginal posterior of φ with flat priors on everything I ll derive that posterior

The RL as a marginal posterior Use the likelihood from a few slides ago y N(Xβ, V), for V = ZGZ + R with prior π(β, φ G, φ R ) 1 (DON T USE THIS in a Bayesian analysis) Then the posterior is π(β, φ G, φ R ) V 1 2 exp ( 1 2 (y Xβ) V 1 (y Xβ) ) To integrate out β: expand the quadratic form and complete the square: (y Xβ) V 1 (y Xβ) = y V 1 y + β X V 1 Xβ 2β X V 1 y = y V 1 y β X V 1 X β + (β β) X V 1 X(β β), for β = (X V 1 X) 1 X V 1 y, the GLS estimate given V

Integrating out β is just the integral of a multivariate normal density, so ( L(β, V)dβ = K V 1 2 X V 1 X 1 2 exp 1 [ y V 1 y 2 β ] ) X V 1 X β for β = (X V 1 X) 1 X V 1 y Take the log and expand β to get the log restricted likelihood log RL(φ y) = K 05 ( log V + log X V 1 X ) 05 y [ V 1 V 1 X(X V 1 X) 1 X V 1] y, where V = ZG(φ G )Z + R(φ R ) is a function of φ = (φ G, φ R ) This is more explicit than the earlier expression, though not closed-form

Standard errors for ˆβ in standard software For the model y N n (Xβ, V(φ)), V(φ) = ZG(φ G )Z + R(φ R ): Set φ = ˆφ, giving ˆV ˆβ is the GLS estimator ˆβ = (X ˆV 1 X) 1 X ˆV 1 y, which has the familiar covariance form cov(ˆβ) (X ˆV 1 X) 1, giving SE(ˆβ i ) cov(ˆβ) 05 ii Non-Bayesian alternative: Kenward-Roger approximation (in SAS): cov(ˆβ) (X ˆV 1 X) 1 + adjust for bias of (X ˆV 1 X) 1 as an estimate of (X V 1 X) 1 + adjust for bias of (X V 1 X) 1 (Kackar-Harville)

Two covariance matrices for (β, u) Unconditional covariance: ( ˆβ cov û u ) = [ ( 0 0 C R(ˆφ R ) 1 C + 0 G(ˆφ G ) 1 )] 1, This is wrt the distributions of ɛ and u and gives the same SEs as above Conditional on u cov ˆβ û u = [ ( 0 0 C R(ˆφ R ) 1 C + 0 G(ˆφ G ) 1 [ ( 0 0 C R(ˆφ R ) 1 C + 0 G(ˆφ G ) 1 )] 1 C R 1 C )] 1 This is wrt the distribution of ɛ Later we ll see how this can be useful

Tests for fixed effects in standard software Tests and intervals for linear combinations l β use RWC (p 104): z = l ˆβ N(0, 1) (1) (l cov(ˆβ)l) ˆ 05 [T]heoretical justification of [(1)] for general mixed models is somewhat elusive owing to the dependence in y imposed by the random effects Theoretical back-up for [(1)] exists in certain special cases, such as those arising in analysis of variance and longitudinal data analysis For some mixed models, including many used in this book, justification of [(1)] remains an open problem Use them at your own risk The LRT has the same problem RWC suggest a parametric bootstrap instead (eg, p 144)

Tests and intervals for random effects I ll say little about these because (a) they have only asymptotic rationales, which are incomplete, and (b) mostly they perform badly Restricted LRT: Compare G, R values for the same X Test for σs 2 = 0: Asymptotic distribution of LRT is a mixture of χ 2 s The XICs: AIC, BIC (aka Schwarz criterion), etc Satterthwaite approximate confidence interval for a variance (SAS): νˆσ 2 s χ 2 ν,1 α/2 σ 2 s νˆσ2 s χ 2 ν,α/2 The denominators are quantiles of χ 2 ν for ν = 2 [ˆσ 2 s /(Asymp SE ˆσ 2 s ) ] 2