Residuals in the Analysis of Longitudinal Data

Similar documents
STAT 4385 Topic 06: Model Diagnostics

The Model Building Process Part I: Checking Model Assumptions Best Practice

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1)

Checking model assumptions with regression diagnostics

Multiple Linear Regression

Simple linear regression

Correlation and Simple Linear Regression

Regression Diagnostics Procedures

Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response.

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects

Math 423/533: The Main Theoretical Topics

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur

2. TRUE or FALSE: Converting the units of one measured variable alters the correlation of between it and a second variable.

STAT5044: Regression and Anova

HANDBOOK OF APPLICABLE MATHEMATICS

Statistical Modelling in Stata 5: Linear Models

Multivariate and Multivariable Regression. Stella Babalola Johns Hopkins University

Generalized Linear Models

M A N O V A. Multivariate ANOVA. Data

Diagnostics and Remedial Measures

Regression Diagnostics for Survey Data

with the usual assumptions about the error term. The two values of X 1 X 2 0 1

Experimental Design and Data Analysis for Biologists

* Tuesday 17 January :30-16:30 (2 hours) Recored on ESSE3 General introduction to the course.

Multilevel Models in Matrix Form. Lecture 7 July 27, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2

Regression diagnostics

Prepared by: Prof. Dr Bahaman Abu Samah Department of Professional Development and Continuing Education Faculty of Educational Studies Universiti

Generalized Additive Models (GAMs)

Linear Models 1. Isfahan University of Technology Fall Semester, 2014

Analysis of Incomplete Non-Normal Longitudinal Lipid Data

Regression Model Building

Machine Learning Linear Regression. Prof. Matteo Matteucci

SMA 6304 / MIT / MIT Manufacturing Systems. Lecture 10: Data and Regression Analysis. Lecturer: Prof. Duane S. Boning

Probability, Statistics, and Reliability for Engineers and Scientists FUNDAMENTALS OF STATISTICAL ANALYSIS

Nonlinear Regression. Summary. Sample StatFolio: nonlinear reg.sgp

Core Courses for Students Who Enrolled Prior to Fall 2018

Labor Economics with STATA. Introduction to Regression Diagnostics

Lecture 2: Linear and Mixed Models

Biostatistics. Correlation and linear regression. Burkhardt Seifert & Alois Tschopp. Biostatistics Unit University of Zurich

Single and multiple linear regression analysis

Lectures on Simple Linear Regression Stat 431, Summer 2012

Regression Analysis and Forecasting Prof. Shalabh Department of Mathematics and Statistics Indian Institute of Technology-Kanpur

Incorporating published univariable associations in diagnostic and prognostic modeling

Optimising Group Sequential Designs. Decision Theory, Dynamic Programming. and Optimal Stopping

Residuals and model diagnostics

Path Analysis. PRE 906: Structural Equation Modeling Lecture #5 February 18, PRE 906, SEM: Lecture 5 - Path Analysis

Simple Linear Regression

Polynomial Regression

McGill University. Faculty of Science MATH 204 PRINCIPLES OF STATISTICS II. Final Examination

Course in Data Science

STK4900/ Lecture 5. Program

Multivariate Capability Analysis Using Statgraphics. Presented by Dr. Neil W. Polhemus

Inferences for Regression

The In-and-Out-of-Sample (IOS) Likelihood Ratio Test for Model Misspecification p.1/27

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

Regression Model Specification in R/Splus and Model Diagnostics. Daniel B. Carr

Tutorial 2: Power and Sample Size for the Paired Sample t-test

STAT 501 EXAM I NAME Spring 1999

Least Squares Estimation

Applied Multivariate and Longitudinal Data Analysis

General Regression Model

The impact of covariance misspecification in multivariate Gaussian mixtures on estimation and inference

Generalized Linear Models: An Introduction

Data Analyses in Multivariate Regression Chii-Dean Joey Lin, SDSU, San Diego, CA

Subject CS1 Actuarial Statistics 1 Core Principles

LAB 3 INSTRUCTIONS SIMPLE LINEAR REGRESSION

Meta-analysis of epidemiological dose-response studies

Psychology Seminar Psych 406 Dr. Jeffrey Leitzel

COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Math 5305 Notes. Diagnostics and Remedial Measures. Jesse Crawford. Department of Mathematics Tarleton State University

STAT 501 Assignment 2 NAME Spring Chapter 5, and Sections in Johnson & Wichern.

8. Example: Predicting University of New Mexico Enrollment

FORECASTING STANDARDS CHECKLIST

Module 6: Model Diagnostics

Analysing data: regression and correlation S6 and S7

appstats27.notebook April 06, 2017

From Practical Data Analysis with JMP, Second Edition. Full book available for purchase here. About This Book... xiii About The Author...

Ref.: Spring SOS3003 Applied data analysis for social science Lecture note

Understanding the Individual Contributions to Multivariate Outliers in Assessments of Data Quality

MA 575 Linear Models: Cedric E. Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1

Statistics in medicine

Cost analysis of alternative modes of delivery by lognormal regression model

Applied Multivariate Statistical Modeling Prof. J. Maiti Department of Industrial Engineering and Management Indian Institute of Technology, Kharagpur

Simple Linear Regression

Chapter 7, continued: MANOVA

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Lecture 3: Linear Models. Bruce Walsh lecture notes Uppsala EQG course version 28 Jan 2012

Unit 10: Simple Linear Regression and Correlation

Regression Models - Introduction

An Introduction to Mplus and Path Analysis

Remedial Measures for Multiple Linear Regression Models

Available online at (Elixir International Journal) Statistics. Elixir Statistics 49 (2012)

Chapter 27 Summary Inferences for Regression

11. Generalized Linear Models: An Introduction

MS&E 226: Small Data

Regression Analysis By Example

Unit 11: Multiple Linear Regression

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

Lecture 01: Introduction

Transcription:

Residuals in the Analysis of Longitudinal Data Jemila Hamid, PhD (Joint work with WeiLiang Huang) Clinical Epidemiology and Biostatistics & Pathology and Molecular Medicine McMaster University

Outline 1. Introduction 2. Residuals in the analysis of longitudinal data 3. Transformed Residuals 4. The growth curve model and decomposed residuals 5. Real data application 6. Discussion 2

1. Introduction Statistical modeling plays important roles in understanding the relationship between one or more variables Statistical modeling is commonly used in a wide range of applications, from finance, banking and weather forecasting to clinical medicine and public health, to name a few In medical and biological research in particular, modeling has been demonstrated to be an essential tool for enhancing our understanding of variety of common as well as rare diseases affecting the public 3

Introduction (cont d) Statistical modeling also plays important roles in disease diagnosis, prognosis, management as well as disease prevention and health promotion Statistical models are also commonly used in identifying risk factors associated with diseases and hence allowing effective diagnosis, treatment as well as prevention mechanisms The terms evidence-based medicine, evidence-based diagnosis and evidence-based decision making highlight the importance of statistical methods in areas of medicine and public health 4

Introduction (cont d) At the initial stages of modeling, we specify the model and model assumptions We then estimate the model parameters based on the specified model and the underlying assumptions Statistical models often rely on several assumptions including Distributional assumptions Mostly on outcome variables Relational assumptions Quantify relationship between outcome and predictors 5

Introduction (cont d) However, modeling is not complete without the investigation of model-data agreement Model Diagnostics We need to ask questions like Does data support model assumptions? Does the model fit the data? Are the mean and the covariance modeled properly? Do we need to add or remove variables? Are there outliers and/or influential observations that influence our estimation and affect the generalizability of our model? Model diagnostics is, therefore, a crucial component of any model fitting problem 6

Outline 1. Introduction 2. Residuals in the analysis of longitudinal data 3. Residuals decomposition 4. Residuals transformation 5. Real data application 6. Discussion 7

Residuals We can not talk about model diagnostics without residuals Residuals are not only used to check adequacy of model fit, they also are excellent tools to validate model assumptions as well as identify outliers and/or influential observations Residuals in univariate models are relatively simple to explore and have been studied extensively They are routinely used for model diagnostics Different types of residuals are proposed ordinary residuals, standardized residuals, studentized residuals and jackknife residuals 8

Residuals (cont d) Consider Model: Y = Xβ + ε Parameter Estimates: β = (X X) 1 X Y Ordinary Residuals Y C (X) C (X) R = Y Y = I X X X 1 X Y = I H Y Note: R = I H ε, E R = 0, aaa VVV r i = (1 h ii )σ 2 Residuals represent part of data that is left unexplained after a model has been fitted to data 9

Residuals (cont d) Standardized Residuals rr i = r i s, where s2 = 1 n p 1 r i 2 Studentized Residuals rrr i = r i s (1 h ii ) Jackknife Residuals rr i = r i s (i) (1 h ii ) 10

Residuals (cont d) How are the residuals used? Graphically Checking normality QQ plots Checking model fit Scatter plots Checking independence Scatter plots Checking for outliers and or influential observations leverage plots, plot of Cook s Distance, plot of DFBETAS and DEFITS 11

Residuals (cont d) How are the residuals used? Formal tests based on Residuals Test of normality: Shapiro-Wilk s Test, Kolmogorove- Smirnove test Constant Variance homoscedasticity: White s test Checking independence: Durbin-Watson Test Outliers and/or influential observations: Wald s test using Cook s Distance 12

Residuals (cont d) Normal Q-Q Plot Normal Q-Q Plot Sample Quantiles -4-2 0 2 4 6 Sample Quantiles 0 5 10 15-3 -2-1 0 1 2 3 Theoretical Quantiles -3-2 -1 0 1 2 3 Theoretical Quantiles Data from the normal distribution Data from the lognormal distribution 13

Residuals (cont d) Normal Q-Q Plot Normal Q-Q Plot Sample Quantiles 0 50 100 150 200 250 Sample Quantiles -4-2 0 2 4 6-3 -2-1 0 1 2 3 Theoretical Quantiles Data from the normal distribution Wrongly fitted model -3-2 -1 0 1 2 3 Theoretical Quantiles Data from the normal distribution Correctly fitted model resid(fitsq) 0 50 100 150 200 250 resid(fitsq) -4-2 0 2 4 6 14 9000 10000 12000 14000 16000 fitted(fitsq) 10000 12000 14000 16000 fitted(fitsq)

Residuals (cont d) Y1 others -30-20 -10 0 10 20 30 Y others -30-20 -10 0 10 20 30-30 -20-10 0 10 20 30 X others 0 10 20 30 40 50 60 Xnew others DFBETAS -20-15 -10-5 0 cooks.distance(fitnew)[-180] 0.0 0.1 0.2 0.3 15 0 100 200 300 400 500 Index 0 100 200 300 400 500 Index

2. Residuals in the Analysis of Longitudinal data Residuals are correlated Residuals are not normally distributed Residuals from the analysis of longitudinal data where there is no systematic component no effect of time 16

Residuals in the Analysis of Longitudinal data When there is time dependency where the mean is represented by a function of time, it is not obvious as to how we can use ordinary residuals obtained as a difference between the observed and fitted value 17 Correctly fitted model Wrongly fitted model

Outline 1. Introduction 2. Residuals in the analysis of longitudinal data 3. Transformed Residuals 4. The growth curve model and decomposed residuals 5. Real data application 6. Discussion 18

3. Transformed Residuals Cholesky decomposition Recall: The estimated covariance matrix for residuals is Consider the Cholesky decomposition Transform the residuals to get (Fitzmaurice, 2004) 19

Transformed Residuals (cont d) Small's graphical method The idea behind Small's graphical approach is to reduce the multivariate data to a univariate Suppose x 1, x 2, x n are independently distributed as N p (µ, ), then the statistic has a Beta distribution with parameters α = ½, β = ½(n-p-1) Where: 20

Transformed Residuals (cont d) Normal Q-Q Plot Multivariate normal data Independent (left) Correlated (right) Model is correctly fitted Normal Q- Q Plot Fitzmaurice's transformation Multivariate normal data Independent (left) Correlated (right) Model correctly fitted 21

Transformed Residuals (cont d) Normal Q-Q Plot Multivariate normal data Independent (left) Correlated (right) Model is correctly fitted Beta probability Plot Small's transformation Multivariate normal data Independent (left) Correlated (right) Model correctly fitted 22

Transformed Residuals (cont d) Normal Q-Q Plot of R Multivariate lognormal data Independent (left) correlated (right) Model is correctly fitted Fitzmaurice s Transformed Multivariate lognormal data Independent (left) Correlated (right) Model is correctly fitted 23

Transformed Residuals (cont d) Normal Q-Q Plot of R Multivariate lognormal data Independent (left) correlated (right) Model is correctly fitted Beta probability plots Small s Transformed Multivariate lognormal data Independent (left) Correlated (right) Model is correctly fitted 24

Transformed Residuals (cont d) Normal Q-Q Plot of R Multivariate normal data Correlated data Model is wrongly fitted Fitzmaurice's transformed residuals Multivariate normal data Correlated data Model is wrongly fitted 25

Transformed Residuals (cont d) Limitations in using the above two transformations in multivariate analysis Meant to be used for checking distributional assumptions and do not allow assessment of model fit If the model is not properly fitted, the performance for checking multivariate normality is not good as well This is particularly important in the analysis of longitudinal data where there is within individual assumption that has to be prespecified to describe the mean growth/change over time Residuals that allow is to check the within and between and between individual assumptions are better under this situations 26

Outline 1. Introduction 2. Residuals in the analysis of longitudinal data 3. Transformed Residuals 4. The growth curve model and decomposed residuals 5. Real data application 6. Discussion 27

4. The GCM and decomposed residuals The Growth Curve Model Suppose that we have m different groups where repeated measurements are taken from a given individual at p different time points. Suppose also that the mean for the i th group follows a polynomial curve of degree q over time, which can be described as Then, the Growth Curve Model (GCM) can be formulated as: 28

The Growth Curve Model A px(q+1) : Within individual design matrix B (q+1)xm : Parameter matrix C mxn : Between individual design matrix X pxn : observation matrix, and n = n 1 +n 2 29

The Growth Curve Model (cont d) Example: Dental measurements on eleven girls and sixteen boys at four different ages (8, 10, 12, 14) were taken. Each measurement is the distance, in millimeters, from the center of pituitary to pteryomaxillary fissure X = 30

The Growth Curve Model (cont d) 31

The Growth Curve Model (cont d) Objectives Should the growth curves be represented by second degree equations in time (t), or are linear equations adequate? Should two separate curves be used for boys and girls, or do both have the same growth curve? We may also be interested to estimate the growth curve(s) and obtain confidence band(s) for the expected growth curve(s)? 32

The Growth Curve Model (cont d) Example: Glucose Data Standard glucose tolerance test is administered 13 control and 20 obese patients Plasma inorganic phosphate measurements were determined from blood samples taken at 0, 0.5, 1, 1.5, 2, 3, 4 and 5 hours after a standard dose oral glucose Objective of the study was to study whether(or not) there is a significant difference between control and obese group of patients Second degree polynomial is used to model both groups 33

The Growth Curve Model (cont d) The matrices for the model 34

The Growth Curve Model (cont d) 35

Decomposed residuals (cont d) The maximum likelihood estimator for the parameter matrix B in the GCM is given by Khatri (1966): Where The predicted value is given as Therefore, ordinary residuals can be calculated by 36

Decomposed residuals (cont d) X Recall MANOVA Model: X X = BB + E The MLE estimate of B is B = XXX(CC ) 1 Residuals are therefore given by 37 R = X (I CC(CC ) 1 C)

Decomposed residuals (cont d) Note that R 1 + R 2 = X(I C CC 1 C) X Can be used to check between individual assumptions such as the normality assumption R 3 = XC CC 1 C AB C R 1 = I P A X(I P C ) R 2 = P A X(I P C ) R 3 = I P A XP C Can be used to check the within individual assumption This residual can be used to check if the fitted curve over time is adequate to represent the change over time 38

Decomposed residuals (cont d) Normal Q-Q Plot of R Multivariate normal data Correlated data Correctly fitted (left) Model is wrongly fitted (right) Scatter plot of R 3 Multivariate normal data Correlated data Correctly fitted (left) Model is wrongly fitted (right) 39

Decomposed residuals (cont d) Normal Q-Q Plot of R Multivariate normal data Correlated data Correctly fitted (left) Model is wrongly fitted (right) Scatter plot of R 3 Multivariate normal data Correlated data Correctly fitted (left) Model is wrongly fitted (right) 40

Decomposed residuals (cont d) Normal Q-Q Plot of R 1 + R 2 Perfectly fitted (left) Miss fitted (right) Multivariate Normal data Fitzmaurice's transformation Beta probability Plot of R 1 + R 2 Perfectly fitted (left) Miss fitted (right) GCM with normal error Small's transformation 41

Outline 1. Introduction 2. Residuals in the Growth Curve Model (GCM) 3. Residuals decomposition 4. Residuals transformation 5. Real data application 6. Discussion 42

5. Real data application Recall: Dental data 43

Real data application (cont d) Normal Q-Q plot of R 1 +R 2 (left) Scatter plot of R 3 (right) Normal Q-Q plot of Fitzmaurice's R 1 + R 2 (left) Beta quantile plot of Small's R 1 + R 2 (right) 44

Real data application (cont d) Dental data after outliers have been removed Normal Q-Q plot of Fitzmaurice's R 1 + R 2 (left) Beta quantile plot of Small's R 1 + R 2 (right) 45

Real data application (cont d) Recall: Glucose data 46

Real data application (cont d) Normal Q-Q plot of R 1 + R 2 (left) Scatter plot of R 3 (right) Normal Q-Q plot of Fitzmaurice's R 1 + R 2 (left) Beta quantile plot of Small's R 1 + R 2 (right) 47

Real data application (cont d) Glucose data without higher order of polynomial fitting Scatter plot of decomposed R 3 for quadratic fit (left) Scatter plot of decomposed R 3 for third degree fit (right) 48

6. Discussion Residuals play important roles in checking the adequacy of model fit, validating assumptions and identifying outliers and/or influential observations Residuals in the analysis of longitudinal data are correlated, not necessary normally distributed and Both Fitzmaurice's transformation or the Small's graphical method successfully removed the correlation structure. However, Fitzmaurices transformation did not perform well when data are not normally distributed where the transformed residuals leading to wrong decisions 49

Discussion (cont d) Residuals based on the growth curve model provided separate components of residuals that are useful for model diagnostics and checking multivariate normality The scatter plot of R 3 is able to identify systematic error in model fitting R 1 + R 2 provide reliable analysis for checking the normality assumption The results are consistent for small as well as large sample sizes, and for different covariance structures 50

Thank you!