For more information about how to cite these materials visit

Similar documents
For more information about how to cite these materials visit

For more information about how to cite these materials visit

For more information about how to cite these materials visit

Specification Errors, Measurement Errors, Confounding

Regression diagnostics

ECNS 561 Multiple Regression Analysis

ECON 4160, Autumn term Lecture 1

Maximum Likelihood Estimation

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

STA 2101/442 Assignment Four 1

3 Multiple Linear Regression

Stat 579: Generalized Linear Models and Extensions

Generalized Linear Models. Kurt Hornik

Multiple Regression Analysis. Part III. Multiple Regression Analysis

Structural Nested Mean Models for Assessing Time-Varying Effect Moderation. Daniel Almirall

The Simple Linear Regression Model

The consequences of misspecifying the random effects distribution when fitting generalized linear mixed models

A measurement error model approach to small area estimation

Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model

MS&E 226: Small Data

Instructions: Closed book, notes, and no electronic devices. Points (out of 200) in parentheses

Lecture 4 Multiple linear regression

Linear Regression. Junhui Qian. October 27, 2014

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Linear models and their mathematical foundations: Simple linear regression

Econometrics. Week 4. Fall Institute of Economic Studies Faculty of Social Sciences Charles University in Prague

The Slow Convergence of OLS Estimators of α, β and Portfolio. β and Portfolio Weights under Long Memory Stochastic Volatility

Problem Selected Scores

Statistics 910, #5 1. Regression Methods

Linear Regression (9/11/13)

4 Multiple Linear Regression

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Chapter 1. Linear Regression with One Predictor Variable

Summer School in Statistics for Astronomers V June 1 - June 6, Regression. Mosuk Chow Statistics Department Penn State University.

Simple Linear Regression

Chapter 2: simple regression model

Statistics 3858 : Maximum Likelihood Estimators

Linear Model Selection and Regularization

Peter Hoff Linear and multilinear models April 3, GLS for multivariate regression 5. 3 Covariance estimation for the GLM 8

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Generalized Estimating Equations (gee) for glm type data

Introduction to Econometrics. Heteroskedasticity

where x and ȳ are the sample means of x 1,, x n

multilevel modeling: concepts, applications and interpretations

BIOS 2083 Linear Models c Abdus S. Wahed

Association studies and regression

Lecture 14 Simple Linear Regression

Data Mining Stat 588

Structural Nested Mean Models for Assessing Time-Varying Effect Moderation. Daniel Almirall

Scatter plot of data from the study. Linear Regression

GLS and FGLS. Econ 671. Purdue University. Justin L. Tobias (Purdue) GLS and FGLS 1 / 22

9. Least squares data fitting

Stat 135 Fall 2013 FINAL EXAM December 18, 2013

WU Weiterbildung. Linear Mixed Models

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

UNIVERSITY OF TORONTO Faculty of Arts and Science

Bias Variance Trade-off

Eco517 Fall 2014 C. Sims FINAL EXAM

Statement: With my signature I confirm that the solutions are the product of my own work. Name: Signature:.

STAT 135 Lab 13 (Review) Linear Regression, Multivariate Random Variables, Prediction, Logistic Regression and the δ-method.

EC402 - Problem Set 3

Business Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata'

Introduction to Estimation Methods for Time Series models. Lecture 1

Analysis of variance and regression. December 4, 2007

Panel Data Models. James L. Powell Department of Economics University of California, Berkeley

Applied Regression. Applied Regression. Chapter 2 Simple Linear Regression. Hongcheng Li. April, 6, 2013

Linear Methods for Prediction

Introduction to Simple Linear Regression

Regression I: Mean Squared Error and Measuring Quality of Fit

Scatter plot of data from the study. Linear Regression

Multiple Linear Regression

Correlation and Regression

Biostatistics Workshop Longitudinal Data Analysis. Session 4 GARRETT FITZMAURICE

Causal Inference with a Continuous Treatment and Outcome: Alternative Estimators for Parametric Dose-Response Functions

ISQS 5349 Spring 2013 Final Exam

[y i α βx i ] 2 (2) Q = i=1

STA 302f16 Assignment Five 1

Machine Learning for OR & FE

Ch 2: Simple Linear Regression

Kriging models with Gaussian processes - covariance function estimation and impact of spatial sampling

Correlation in Linear Regression

POLI 8501 Introduction to Maximum Likelihood Estimation

Statistical Methods III Statistics 212. Problem Set 2 - Answer Key

Outline of GLMs. Definitions

Model Checking and Improvement

Quantitative Economics for the Evaluation of the European Policy

Simple and Multiple Linear Regression

Applied Statistics and Econometrics

Generalized Linear Models Introduction

STAT 100C: Linear models

ECON The Simple Regression Model

Lecture 1 Intro to Spatial and Temporal Data

Recap. Vector observation: Y f (y; θ), Y Y R m, θ R d. sample of independent vectors y 1,..., y n. pairwise log-likelihood n m. weights are often 1

Problems. Suppose both models are fitted to the same data. Show that SS Res, A SS Res, B

MA 575 Linear Models: Cedric E. Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1

Linear Methods for Prediction

Econ 583 Final Exam Fall 2008

Categorical Predictor Variables

Sparse Linear Models (10/7/13)

Transcription:

Author(s): Kerby Shedden, Ph.D., 2010 License: Unless otherwise noted, this material is made available under the terms of the Creative Commons Attribution Share Alike 3.0 License: http://creativecommons.org/licenses/by-sa/3.0/ We have reviewed this material in accordance with U.S. Copyright Law and have tried to maximize your ability to use, share, and adapt it. The citation key on the following slide provides information about how you may share and adapt this material. Copyright holders of content included in this material should contact open.michigan@umich.edu with any questions, corrections, or clarification regarding the use of content. For more information about how to cite these materials visit http://open.umich.edu/privacy-and-terms-use. Any medical information in this material is intended to inform and educate and is not a tool for self-diagnosis or a replacement for medical evaluation, advice, diagnosis or treatment by a healthcare professional. Please speak to your physician if you have questions about your medical condition. Viewer discretion is advised: Some medical content is graphic and may not be suitable for all viewers. 1 / 48

Regression analysis with dependent data Kerby Shedden Department of Statistics, University of Michigan December 8, 2015 2 / 48

Clustered data Clustered data are sampled from a population that can be viewed as the union of a number of related subpopulations. Write the data as Y ij R, X ij R p, where i indexes the subpopulation and j indexes the individuals in the sample belonging to the i th subpopulation. 3 / 48

Clustered data It may happen that units from the same subpopulation are more alike than units from different subpopulations, i.e. cor(y ij, Y ij ) > cor(y ij, Y i j ). Part of the within-cluster similarity may be explained by X, i.e. units from the same subpopulation may have similar X values, which leads to similar Y values. In this case, cor(y ij, Y ij ) > cor(y ij, Y ij X ij, X ij ). Even after accounting for measured covariates, units in a cluster may still resemble each other more than units in different clusters: cor(y ij, Y ij X ) > cor(y ij, Y i j X ij, X i j ). 4 / 48

Clustered data A simple working model to account for this dependence is Ŷ ij = ˆθ i + ˆβ X ij. The idea here is that β X ij explains the variation in Y that is related to the measured covariates, and the θ i explain variation in Y that is related to the clustering. This working model would be correct if there were q omitted variables Z ijl, l = 1,..., q, that were constant for all units in the same cluster (i.e. Z ijl depends on l and i, but not on j). In that case, ˆθ i would stand in for the value of l ˆψ l Z ijl that we would have obtained if the Z ijl were observed. 5 / 48

Clustered data As an alternate notation, we can vectorize the data to express Y R n and X R n p+1, then write Ŷ = j ˆθ j I j + X ˆβ, where I j R n is the indicator of which subjects belong to cluster j. If the observed covariates in X are related to the clustering (i.e. if the columns of X and the I j are not orthogonal), then OLS apportions the overlapping variance between ˆβ and ˆθ. 6 / 48

Test score example Suppose Y ij is a reading test score for the j th student in the i th classroom, out of a large number of classrooms that are considered. Suppose X ij is the income of a student s family. We might postulate as a population model which can be fit as Y ij = θ i + β X ij + ɛ ij, Ŷ ij = ˆθ i + ˆβ X ij. 7 / 48

Test score example Ideally we would want the parameter estimates to reflect sources of variation as follows: Direct effects of parent income such as access to books, life experiences, good health care, etc. should go entirely to ˆβ. Attributes of classrooms that are not related to parent income, for example, the effect of an exceptionally good or bad teacher, should go entirely to the ˆθ i. Attributes of classrooms that are correlated with parent income, such as teacher salary, training, and resources, will be apportioned by OLS between ˆθ i and ˆβ. Unique events affecting particular individuals, such as the severe illness of the student or a family member, should go entirely to ɛ. 8 / 48

Other examples of clustered data Treatment outcomes for patients treated in various hospitals. Crime rates in police precincts distributed over a number of large cities (the precincts are the units and the cities are the clusters). Prices of stocks belonging to various business sectors. Surveys in which the data are collected following a cluster sampling approach. 9 / 48

What if we ignore the θ i? The θ i are usually not of primary interest, but we should be concerned that by failing to take account of the clustering, we may incorrectly assess the relationship between Y and X. If the θ i are nonzero, but we fail to include them in the model, the working model is misspecified. Let X be the design matrix without intercept, and let Q be the matrix of cluster indicators (which includes the intercept): Y = Y 11 Y 12 Y 21 Y 22 Y 23 Q = 1 0 1 0 0 1 0 1 0 1 X = X 111 X 112 X 121 X 122 X 211 X 212 X 221 X 222 X 231 X 232 10 / 48

What if we ignore the θ i? Let X = [1 n X ]. The estimate that results from regressing Y on X is E ˆβ = ( X X ) 1 X EY = ( X X ) 1 X (X β + Qθ). where ˆβ = ( ˆβ 0, ˆβ ). For the bias of ˆβ for β to be zero, we need E ˆβ β = ( X X ) 1 X (X β + Qθ X β) = 0, where β 0 = E ˆβ 0 and β = (β 0, β ). Thus we need 0 = X (X β + Qθ X β) = X (Qθ β 0 ). 11 / 48

What if we ignore the θ i? Let S = Qθ be the vector of cluster effects. Since the first column of X consists of 1 s, we have that S = β 0. For the other covariates, we have that X j S = β 0 X j 1 n, which implies that X j and S have zero sample covariance. This is unlikely in many studies, where people tend to cluster (in schools, hospitals, etc.) with other people having similar covariate levels. 12 / 48

Fixed effects analysis In a fixed effects approach, we model the θ i as regression parameters, by including additional columns in the design matrix whose covariate levels are the cluster indicators. As the sample size grows, in most applications the cluster size will stay constant (e.g. a primary school classroom might have up to 30 students). Thus the number of clusters must grow, so the dimension of the parameter vector θ grows. This is not typical in standard regression modeling. 13 / 48

Cluster effect examples What does the inclusion of fixed effects do to the parameter estimates ˆβ, which are often of primary interest? The following slides show scatterplots of an outcome Y against a scalar covariate X, in a setting where there are three clusters (indicated by color). Above each plot are the coefficient estimates, Z-scores, and r 2 values for fits in which the cluster effects are either included (top line) or excluded (second line). These estimates are obtained from the following two working models: Ŷ = ˆα + ˆβX + ˆθ j I j Ŷ = ˆα + ˆβX. The Z-scores are ˆβ/SD( ˆβ) and the r 2 values are cor(ŷ, Y ) 2. 14 / 48

Cluster effect examples Left: No effect of clusters or X ; X and clusters are independent; ˆβ 0 with and without cluster effects in model. Right: An effect of X, but no cluster effects; X and clusters are independent; ˆβ, Z, and r 2 are similar with and without cluster effects in model. ˆβ 1 = 0.07 Z = 0.49 r 2 =0.02 ˆβ 1s = 0.06 Z s = 0.44 r 2 s =0.00 ˆβ 1 =1.04 Z =7.75 r 2 =0.57 ˆβ 1s =1.01 Z s =7.81 r 2 s =0.56 Y Y X X 15 / 48

Cluster effect examples Left: Cluster effects, but no effect of X ; X and clusters are independent; ˆβ 0 with and without cluster effects in model. Right: An effect of clusters but no effect of X ; X and clusters are dependent; when cluster effects are not modeled their effect is picked up by X. ˆβ 1 = 0.03 Z = 0.85 r 2 =0.94 ˆβ 1s =0.00 Z s =0.02 r 2 s =0.00 ˆβ 1 =0.03 Z =0.19 r 2 =0.94 ˆβ 1s =0.92 Z s =20.72 r 2 s =0.90 Y Y X X 16 / 48

Cluster effect examples Left: Effects for clusters and for X ; X and clusters are dependent. Right: Effects for clusters but not for X ; X and clusters are dependent but the net cluster effect is not linearly related to X. ˆβ 1 =0.90 Z =5.97 r 2 =0.95 ˆβ 1s =0.96 Z s =29.26 r 2 s =0.95 ˆβ 1 = 0.04 Z = 0.31 r 2 =0.84 ˆβ 1s =0.00 Z s =0.02 r 2 s =0.00 Y Y X X 17 / 48

Cluster effect examples Left: Effects for clusters and for X ; X and clusters are dependent; the X effect has the opposite sign as the net cluster effect. Right: Effects for clusters and for X ; X and clusters are dependent; the signs and magnitudes of the X effects are cluster-specific. ˆβ 1 = 1.93 Z = 12.68 r 2 =0.99 ˆβ 1s =2.72 Z s =16.81 r 2 s =0.85 ˆβ 1 = 0.52 Z = 1.70 r 2 =0.98 ˆβ 1s =2.92 Z s =8.20 r 2 s =0.58 Y Y X X 18 / 48

Implications of fixed effects analysis for observational data A stable confounder is a confounding factor that is approximately constant within clusters. A stable confounder will become part of the net cluster effect. If a stable confounder is correlated with an observed covariate X, then this will create non-orthogonality between the cluster effects and the effects of the observed covariates. 19 / 48

Implications of fixed effects analysis for experimental data Experiments are often carried out on batches of objects (specimens, parts, etc.) in which uncontrolled factors cause elements of the same batch to be more similar than elements of different batches. If treatments are assigned randomly within each batch, there are no stable confounders (in general there are no confounders in experiments). Therefore the overall OLS estimate of β is unbiased as long as the standard linear model assumptions hold. 20 / 48

Implications of fixed effects analysis for experimental data Suppose the design is balanced (e.g. exactly half of each batch is treated and half is untreated). This is an orthogonal design, so the estimate based on the working model Ŷ ij = ˆβX ij and the estimate based on the working model Ŷ ij = ˆθ i + ˆβ X ij are identical ( ˆβ = ˆβ ). Thus they have the same variance. But the estimated variance of ˆβ will be greater than the estimated variance of ˆβ (since the corresponding estimate of σ 2 is greater), so it will have lower power and wider confidence intervals. 21 / 48

Random cluster effects As we have seen, the cluster effects θ i can be treated as unobserved constants, and estimated as we would estimate any other regression coefficient. An alternative way to handle cluster effects is to view the θ i as unobserved (latent) random variables. Now we have two random variables in the model: θ i and ɛ ij. These are usually taken to be independent of each other. If the θ i are independent and identically distributed, we can combine them with the error terms to get a single random error term per observation: ɛ ij = θ i + ɛ ij. 22 / 48

Random cluster effects Let Y i = (Y i1,..., Y ini ) denote the vector of responses in the i th cluster, let X i denote the matrix of predictor variables for the i th cluster, and let ɛ i denote the vector of errors (ɛ ij ) for the i th cluster. Thus we have the model Y i = X i β + ɛ i, for clusters i = 1,..., m. The ɛ i are taken to be uncorrelated between clusters, i.e. for i i. cov(ɛ i, ɛ i ) = 0 23 / 48

Random cluster effects The structure of the covariance matrix is S i cov(ɛ i X i ) S i = σ 2 + σθ 2 σθ 2 σθ 2 σθ 2 σ 2 + σθ 2 σθ 2 σθ 2 σθ 2 σ 2 + σθ 2 = σ2 I + σθ 2 11T, where var(ɛ ij ) = σ 2 and var(θ i ) = σ 2 θ. 24 / 48

Generalized Least Squares Suppose we have a linear model with mean structure E[Y X ] = X β for Y R n, X R n p+1, and β R p+1, and variance structure var(y X ) Σ, where Σ is a given n n matrix. We can factor the covariance matrix as Σ = GG, and consider the transformed model G 1 Y = G 1 X β + G 1 ɛ. Then letting η G 1 ɛ, it follows that cov(η) = I, and note that the slope vector β of the transformed model is identical to the slope vector of the original model. 25 / 48

Generalized Least Squares We can use GLS to analyze clustered data with random cluster effects. Let Yi = S 1/2 i Y i, ɛ i = S 1/2 i ɛ i, and Xi = S 1/2 i X i. Let Y, X, and ɛ be Yi, X i, and ɛ i, respectively, stacked over i. 26 / 48

Generalized Least Squares Since cov(ɛ i ) I, the OLS estimate of β for the model Y = X β + ɛ is the best estimate of β that is linear in Y (by the Gauss-Markov theorem). Since the set of linear estimates based on Y is the same as the set of linear estimates based on Y, it follows that the GLS estimate of β based on Y is the best unbiased estimate of β that is linear in Y. 27 / 48

Generalized Least Squares The covariance matrix of ˆβ in GLS is (X T X ) 1 = (X T Σ 1 X ) 1 Note that there is no σ 2 term on the left because we have transformed the error distribution to have covariance matrix I. To apply GLS, it is only necessary to know Σ up to a multiplicative constant. The same estimated slopes ˆβ are obtained if we decorrelate with Σ or with kσ for k > 0. However we need to estimate σ 2 and use ˆσ 2 (X T Σ 1 X ) 1 for inference when Σ is only correct up to a scalar multiple. 28 / 48

Generalized Least Squares Since the sampling covariance matrix for GLS is proportional to (X T Σ 1 X ) 1, the parameter estimates are uncorrelated with each other if and only if X T Σ 1 X is diagonal that is, if the columns of X are mutually orthogonal with respect to the Mahalanobis metric defined by Σ. This is related to the fact that Ŷ in GLS is the projection of Y onto col(x ) in the Mahalanobis metric defined by Σ. This generalizes the fact that Ŷ in OLS is the projection of Y onto col(x ) in the Euclidean metric. 29 / 48

Least squares with mis-specified covariance What if we perform GLS using a working covariance S that is incorrect? For example, what happens if we use OLS when the actual covariance matrix of the errors is Σ I? Since E[ɛ X ] = E[ɛ X ] = 0, the estimate ˆβ remains unbiased. However it has two problems: it may not be the BLUE (i.e. it may not have the least variance among unbiased estimates), and the usual linear model inference procedures will be wrong. 30 / 48

Least squares with mis-specified covariance The sampling covariance when the error structure is mis-specified is given by the sandwich expression: where cov( ˆβ) = (X X ) 1 X Σ ɛ X (X X ) 1 Σ ɛ = cov(ɛ X ). This result covers two situations: (i) we use OLS, so X = X and Σ ɛ = Σ ɛ is the covariance matrix of the errors, and (ii) we use a working covariance to define the decorrelating matrix G, and this working covariance is not equal to Σ. 31 / 48

Iterated GLS In practice, we will generally not know Σ, so it must be estimated from the data. Σ is the covariance matrix of the errors, so we can estimate Σ from the residuals. Typically a low-dimensional parameteric model Σ = Σ(α) is used for the error covariance. For a given error model the fitting algorithm alternates between estimating α and estimating the regression slopes β. 32 / 48

Iterated GLS For example, suppose our working covariance has the form S i (j, j) = ν S i (j, k) = rν (j k). This is an exchangeable model with intraclass correlation coefficient (ICC) r between two observations in the same cluster. There are several closely-related ICC estimates. One approach is based on the standardized residuals R s ij : RijR s s ij /( n i (n i 1)/2 p 1) i j j i where n i is the size of the i th cluster. 33 / 48

Iterated GLS Another common model for the error covariance is the first order autoregressive (AR-1) model: where α < 1 is a parameter. Σ ij = α i j, There are several possible estimates of α borrowing ideas from time series analysis. 34 / 48

Generalized Least Squares and stable confounders For GLS to give unbiased estimates of β (that also have minimum variance due to the Gauss-Markov theorem), we must have E[ɛ ij X ] = E[θ i + ɛ ij X ] = 0. Since E[ɛ ij X ] = 0, this is equivalent to requiring that E[θ i X ] = 0. Thus if the X covariates have distinct within-cluster mean values, and the within-cluster mean values of X are correlated with the θ i, then the GLS estimate of β will be biased. 35 / 48

Likelihood inference for the random intercepts model The random intercepts model Y ij = θ i + β X ij + ɛ ij, can be analyzed using a likelihood approach, using: A random intercepts density φ(θ X ) A density for the data given the random intercept: f (Y X, θ) 36 / 48

Likelihood inference for the random intercepts model The marginal model for the observed data is obtained by integrating out the random intercepts f (Y X ) = f (Y X, θ)φ(θ X )dθ. The parameters in f (Y X ) are all parameters in f (Y X, θ), and all parameters in φ(θ X ). Typically, f (Y X, θ) is parameterized in terms of β and σ 2, and φ(θ X ) is parameterized in terms of a scale parameter σ θ. 37 / 48

Likelihood inference for the random intercepts model The first two moments of the marginal distribution Y ij X can be calculated explicitly. Let µ θ E(θ X ) and σθ 2 = var(θ X ). The moments are E(Y ij X ) = µ θ + β X ij var(y ij X ) = σ 2 θ + σ2, cov(y ij1, Y ij2 X ) = σ 2 θ (j 1 j 2 ). 38 / 48

Likelihood inference for the random intercepts model Suppose ɛ X N(0, σ 2 ) θ X N(µ θ, σ 2 θ ). In this case Y X is Gaussian, with mean and variance as given above. Thus the random intercept model can be equivalently written in marginal form as where ɛ ij = θ i µ θ + ɛ ij. Y ij = µ θ + β X ij + ɛ ij It follows that E(ɛ ij X ) = 0, var(ɛ ij X ) = σ2 θ + σ2, and the ɛ ij values have correlation coefficient σθ 2/(σ2 + σθ 2 ) within clusters. 39 / 48

Likelihood computation Maximum likelihood estimates for the model can be calculated using a gradient-based optimization procedure, or the EM algorithm. Asymptotic standard errors can be obtained from the inverse of Fisher information, and likelihood ratio tests can be used to compare nested models. The parameter estimates from iterated GLS and the MLE for the random intercepts model will be similar, and both are consistent under similar conditions. 40 / 48

Predicting the random intercepts Since the model is fit by optimizing the marginal distribution, we obtain an estimate of σ 2 θ = var(θ i), but we don t automatically learn anything about the individual θ i. If there is an interest in the individual θ i values, we can predict them using the best linear unbiased predictor (BLUP): E ˆβ,ˆσ 2,ˆσ 2 θ,ˆµ θ (θ Y, X ) = ˆµ θ + ĉov(θ, Y )ĉov(y X ) 1 (Y Ê[Y X ]) 41 / 48

Predicting the random intercepts The estimated second moments needed to clculate the BLUP are: and ĉov(θ i, Y i ) = ˆσ 2 θ 1 n i ĉov(y i X ) = ˆσ 2 I + ˆσ 2 θ 11T. where n i is the size of the i th group, ĉov(θ i, Y j ) = 0 for i j, and ĉov(y X ) is block-diagonal with blocks given by ĉov(y i X ). 42 / 48

Predicting the random intercepts Note that once the parameters are estimated, the BLUP for θ i only depends on the data in cluster i: E ˆβ,ˆσ 2,ˆσ 2 θ,ˆµ θ (θ Y, X ) = ˆµ θ + ĉov(θ i, Y i )ĉov(y i X ) 1 (Y i Ê[Y i X ]) ĉov(y i X ) 1 = ˆσ 2 (I ˆσ 2 θ 11T /(ˆσ 2 + n i ˆσ 2 θ )) For a given set of parameter values, the BLUP is a linear function of the data. For the i th cluster, E ˆβ,ˆσ 2,ˆσ 2 θ,ˆµ θ (θ i Y, X ) = ˆµ θ +n i σ 2 θ /(ˆσ2 +n i ˆσ 2 θ ) 1T (Y i Ê[Y i X ])/n i, 43 / 48

Predicting the random intercepts In the BLUP for θ i, this term is the mean of the residuals: and it is shrunk by this factor: 1 T (Y i Ê[Y i X ])/n i n i σ 2 θ /(ˆσ2 + n i ˆσ 2 θ ) This shrinkage allows us to interpret the random intercepts model as a partially pooled model that is intermediate between the fixed effects model and the model that completely ignores the clusters. 44 / 48

Predicting the random intercepts In the fixed effects model, the parameter estimates θ i are unbiased, but they are overdispersed, meaning that the sample variance of the ˆθ i will generally be greater than σ 2 θ. In the random intercepts model, the variance parameter σ 2 θ is estimated with low bias, and the BLUP s of the θ i are shrunk toward zero. 45 / 48

Random slopes Suppose we are interested in a measured covariate Z whose effect may vary by cluster. We might start with the model Y i = X i β + γz i + ɛ, so if Z i {0, 1} n i is a vector of treatment indicators (1 for treated subjects, 0 for untreated subjects), then γ is the average change associated with being treated (the population treatment effect). In many cases, it is reasonable to consider the possibility that different clusters may have different treatment effects that is, different clusters have different γ values. In this case we can let γ i be the response for cluster i. 46 / 48

Random slopes Suppose we model the random slopes γ i as being Gaussian (given X ) with expected value γ 0 and variance σ 2 γ. The marginal model is and E[Y i X i, Z i ] = X i β + Z i γ 0 cov[y i X i, Z i ] = σ 2 γz i Z i + σ 2 I. Again we can form a BLUP E ˆβ,ˆσ 2 γ ˆσ2 [γ i Y, X, Z], and the BLUP is a shrunken version of the fixed effects model in which we regress Y on X and Z within every cluster. 47 / 48

Linear mixed effects models The random intercept and random slope model are special cases of the linear mixed model: Y = X β + Zγ + ɛ X is a n p design matrix and Z is a n q design matrix; γ is a random vector with mean 0 and covariance matrix Ψ, ɛ is a random vector with mean 0 and covariance matrix σ 2 I. 48 / 48