Applied Multivariate and Longitudinal Data Analysis

Size: px

Start display at page:

Download "Applied Multivariate and Longitudinal Data Analysis"

Ferdinand Chapman
5 years ago
Views:

1 Applied Multivariate and Longitudinal Data Analysis Longitudinal Data Analysis: General Linear Model Ana-Maria Staicu SAS Hall 522; ;

2 Introduction Consider the following examples. Pay attention to what is the response variable, what is the observational unit, how many measurements are collected per unit. Low flux dialyzers are used to treat patients with end stage renal disease to remove excess fluid and waste from their blood. In low flux hemodialysis, the ultrafiltration rate (ml/hr) at which fluid is removed is thought to follow a straight line relationship with the transmembrane pressure (mmhg) applied across the dialyzer membrane. A study was conducted to compare the average ultrafiltration rate (the response) of such dialyzers across three dialysis centers where they are used on patients. A total of 4 dialyzers (units) were involved. The experiment involved recording the ultrafiltration rate at 4 transmembrane pressures (depicted by dots in the Figure below) for each dialyzer. 2

3 Dietary Calcium Absorption Data. Calcium absorption is measured for 88 subjects aged between 35 to 45 years at the beginning of the study. There are between one to four measurements are taken. How does the typical calcium absorption vary across the subjects? Age Calcium absorption Age Calcium absorption Calcium absorption for 2 subjects Calcium absorption for all subjects 3

4 Outline. In the second part of the course we will focus on statistical models and methods for studies in which individuals/subjects/objects/units are measured repeatedly over time. Specifically we will focus on modeling the longitudinal data. We will discuss modeling using a marginal perspective (i.e. aggregating the among/between sources of variability) as well as modeling using a subject specifica perspective (i.e. models explicitly all the sources of variability in the data). This chapter considers a general model perspective. We discuss flexible modeling of the mean trajectory (that incorporate the time specifically) and various covariance structures to model the dependence. The modeling techniques allow for incorporation of covariate information. We discuss estimation/statistical inference of the mean parameters and estimation of the covariance parameters. Basic concepts and Notations Response is the outcome of interest (denoted typically by Y ). Unit (object or subject) is the object on which repeated measurements are taken; typically they are individuals (i indexes units and j indexes the repeated measurement). Y ij - denotes the jth repeated measurement taken on the ith subject or unit; n denotes the total number of units and m i denotes the number of repeated measurements for unit i. The response vector of measurements for subject/individual/object/unit i is: Y i Y i2 Y i =. The responses are typically assumed independent across units (e.g. Y i, Y i are independent for i i ). However within the unit the responses are correlated (e.g. Y ij and Y ij are typically correlated). Many statistical models consider modeling the response vector Y i and not the Y ij s separately; nevertheless it is not uncommon to model Y ij s separately. We ll discuss modes that exploit both representations. Time is the generic term for the condition of measurement (t is used to denote time). Time is considered an important covariate in longitudinal data; it is modeled differently than the other covariates in the data. Both the mean of the response vector Y i and the covariance of Y i may be depend on the time. t ij - denotes the time corresponding to the Y ij. Y imi We say the design is balanced when m i = m (same number of repeated measurements across units). Otherwise we say the design is unbalanced. We say the design is regular if t ij = t j (the times of measurements are the same for all the units). Otherwise we say the design is irregular. 4

5 Although not specified explicitly, it is assumed that times occur in an increasing order t i < t i2 <... < t imi. General data structure for a balanced, regular design (m i = m and t ij = t i ) is: t t 2 t 3... t m Units Y Y 2 Y 3... Y m 2 Y 2 Y 22 Y Y 2m n Y n Y n2 Y n3... Y nm Setting: In the following, consider the observed data: {(Y ij, t ij ) : j =,..., m i } i, where Y ij is assumed to be continuous. For simplicity we assume t ij = t j and m i = m (balanced and regular design). We are interested in studying the typical behavior of the outcome over time, and furthermore in studying the way the outcome vary over time. Modeling longitudinal data is more complex than modeling independent data; multiple observations from the same person are correlated need to model the correlation among the repeated measurements modeling the mean trend across time requires attention typically the effect of the other possible predictors is modeled in the mean (systematic part). Conceptual model: For continuous data we write: DATA ij = Mean j + Residual ij, where Mean j is the average response corresponding time t j and Residual ij describes the deviation of the data DATA ij from the mean Mean j. The mean describes how the response changes on average over time. If additional factors (or covariate info such as group, additional subject information) are available then the mean may depend on these factors. Common notation: µ = (µ,..., µ m ) T for the mean vector. The residual determines how far the data deviate from its mean. It determines the distribution of the response (in this part is commonly assumed normal). It also determines how the repeated observations correlate over time. 5

6 The three main steps in modeling longitudinal data are: modeling the mean, the covariance, and selecting the distribution. In each of these it is imperative that we look at the data and use any available visualization tools. Mean. Because the elements of the mean vector µ = (µ..., µ m ) T are arranged in time increasing order t < t 2 <... < t n (µ j corresponds to t j ), we refer to the mean µ as a mean trajectory instead and less as a mean vector (common in multivariate statistics). Examples: µ j = µ(t j ) = a + bt j (linear trajectory) µ j = µ(t j ) = a + bt j + ct 2 j (quadratic trajectory) These are representation of the mean trajectory using a finite set of parameters, (parametric structures of the mean function). By an abuse of notation, in this example µ( ) was used to denote a function. Random deviation. Sources of variation. For longitudinal data there are two main types of potential sources of variation in the data Among-unit variation: this is the variation that occurs among units (subjects/individuals/objects/units are different). Within-unit variation: this is the fluctuation of the response that occurs from one measurement to another within the same subject/individual/object/unit. Measurement error, for instance, is included in this source. The next figure depicts the two sources of variation. Left panel: subject mean trajectory ( inherent trend for each subject ) in dashed line; overall mean trajectory in solid black line. The variation between these curves represents the among units variation. Middle panel: the true subject trajectory, where the deviations from the mean subject trajectories are due to the biological variation of the responses (think of fluctuations of one s heart blood pressure from one time to another). Right panel shows the observed data for each subject. Notice that the measurements deviate more, and this is due to measurement error (say imperfections in the measuring device). 6

7 subject mean response true subject response observed subject response (filled circles) time time time Illustration: One simple way to represent how the data Data ij vary with the time t ij = t j is the following: Data ij = Mean j + SubjSpecific i + BiologicalDev ij + Error 2ij where Mean j = µ j is the overall (population) mean at time point t ij = t j SubjSpecific i represents the biological variation of the ith unit; this deviation dictates the inherent trend of the ith subject, BiologicalDev ij is the component of deviation from the subject s trend that is due to the biological variation over time within the subject, Error 2ij is the component of the deviation that is due to the measurement error. 7

8 8

9 Exploratory analysis: Look at your data as much as possible (in general we don t look enough at the data). Visualization. Spaghetti plots are a method of viewing data to visualize the dynamic behavior over time, corresponding to each unit/subject. Notice the measurements for each subject are connected with line; but measurements from different subjects are not connected. Example: Researchers are interested in the development of kids over time. They collect dental growth measurements of the distance (mm) from the center of the pituitary gland to the pterygomaxillary fissure for 27 kids ( girls and 6 boys) at ages 8,, 2, and 4. A picture of the pterygomaxillary fissure can be found at The interest is in how the dental growth measurements vary over time, if they are different in boys and girls, and furthermore if the rate of change is different for boys than girls. The Figure below displays the distance measurements per age for each child. The plotting symbols denote girls () and boys (), and the trajectory for each child is connected by a solid line so that individual child patterns may be seen. spaghetti plot distance Mean Distance for girls () and boys () Age(years) Age (years) Mean. The primary objective in LDA is estimation and inference of the mean function. In our setting we have a mean vector, µ = (µ,..., µ m ) T, but recall µ j corresponds to the time t j. In the case of balanced and regular design, an estimator for the mean is obtained by the sample mean. Likewise an estimator for the error variance is obtained by the sample covariance. Most longitudinal studies do not involve balanced and regular designs; estimator in these cases are not this simple. 9

10 The estimator of µ = (µ,..., µ m ) T is µ defined by: µ Ȳ µ 2 µ =. = Ȳ 2. µ m where Ȳj = n n i= Y ij. Graphical inspection of the mean vector is an important tool to understand the possible relationship of the means over time. Examine how µ j changes with time t j. In particular look for a linear trend, or curvature for a quadratic trend, etc. We apply this estimator to the dental study data for girls and obtain the estimate: 2.8 µ G = Ȳ m Variation/Correlation. An unbiased estimator for the covariance is the sample covariance Σ = n n (Y i µ)(y i µ) T ; i= the numerator n i= (Y i µ)(y i µ) T is also known as the sums of square and cross-product matrix (SS & CP). Denote the elements of this matrix as ( Σ jk ) j,k n and also let σ j = Σjj. To describe the dependence of the data, often the cross-covariance matrix is used, and in particular the variance behavior over time and the correlation (for dependence). Examine how the variances Σ jj change with time t j, to learn about the various sources of variability in the data. Examine how the correlation varies: ρ jk = Σ jk Σjj Σ kk ; denote by Γ = ( ρ jk ) j,k n the estimated correlation of Γ = (ρ jk ) j,k n. The off diagonal terms of Γ estimate the combined sources of variability (among+between variability), but they do not distinguish between them (dental data for girls).

11 Interpretation: Remark: instead of estimating ρ jk it is common to plot the association using the so-called scatterplot matrix - essentially a matrix where for each distinct pair (t j, t k ) it is plot Y ij µ j σ j Y ik µ k σ k, and examine whether: the association seem constant across the pairs, the association seem to decay over time, or the association does not vary at all with time? Graphical display of the observations through scatterplot for systematic features.

12 2

13 Autocorrelation. Another measure that describes the association is the autocorrelation: the correlation between the repeated measurements when the lag, or distance between the time, is constant. Stationarity is a property of a stochastic processes that is related to the first/second/etc. moments being constant over time. Examining the autocorrelation is done with the purpose of checking for stationarity assumption (whether the covariance varies with the lag between the observations t j t j instead of the actual times, t j, t j ). Autocorrelation is formally defined as: ρ(u) = corr{y ij, Y ij },, where t j t j = u; this measure describes the stationarity nature of the dependence. Here u is commonly referred as the lag. To study this behavior, plot, for each lag u, the following standardized residuals Y ij µ j Y ij µ j, t j t j = u; σ j σ j Equivalently you can calculate a sample autocorrelation estimator ρ(u), based on these standardized residuals. Notice however that the estimator is based on different number of pairs, hence is characterized by different theoretical properties at various lags, and thus caution should be used in interpreting it. Alternative examination - using variogram. The variogram is defined as V (u) = 2 E{(Y ij Y ij )} 2,, where t j t j = u; For stationary processes (mean and variance constant over time) we have V (u) = τ 2 +σ 2 { ρ(u)}, where τ 2 is the noise variance (known from spatial statistics as nugget effect). When data are unbalanced it is easier to estimate V (u) than ρ(u). To estimate the variogram, we need v ijj = 2 (Y ij Y ij ) 2, and estimate V (u) by V (u) = Ave tij t ij u(v ijj ). 3

14 2 General linear model Motivation: Dialysis study (Ultrafiltration Data For Low Flux Dialyzers presented in Vonesh and Chinchilli, 997) Low flux dialyzers are used to treat patients with end stage renal disease to remove excess fluid and waste from their blood. In low flux hemodialysis, the ultrafiltration rate (ml/hr) at which fluid is removed is thought to follow a straight line relationship with the transmembrane pressure (mmhg) applied across the dialyzer membrane. A study was conducted to compare the average ultrafiltration rate (the response) of such dialyzers across three dialysis centers where they are used on patients. A total of 4 dialyzers (units) were involved. The experiment involved recording the ultrafiltration rate at 4 transmembrane pressures (depicted by dots in the Figure below) for each dialyzer. 4

15 Models for the mean trajectory. Many situations involve irregular and unbalanced sampling designs. One uses simple parametric models to describe the behavior of the mean response over time. Or more generally one assumes that the mean response changes over time in a smooth way. i. Polynomial trends in time The simplest possible curve that describes how the mean response changes over time is a straight line, E[Y ij ] = β + β t ij. Similarly a quadratic trend over time can be represented as E[Y ij ] = β + β t ij + β 2 t 2 ij. A. Observed data {Y ij, t ij : j =,..., m i } where Y ij is the ultrafiltration rate for the ith dialyzer (within Center ), corresponding to the transmembrane pressure t ij. Think of a model that describes the mean ultrafiltration rate and how it varies over time. B. Observed data {Y ij, t ij : j =,..., m i, C i } where Y ij is the ultrafiltration rate for the ith dialyzer, corresponding to the transmembrane pressure t ij and C i is the center membership. Think of a model that describes the mean ultrafiltration rate and how it varies over time. 5

16 Hip replacement study. These data are adapted from Crowder and Hand (99, section 5.2). 3 patients (3 males and 7 females) underwent hip-replacement surgery. Haematocrit, the ratio of volume packed red blood cells relative to volume of whole blood recorded on a percentage basis, was supposed to be measured for each patient at week, before the replacement, and then at weeks, 2, and 3, after the replacement. In addition the age of each participant is recorded. The primary interest was to determine whether there are possible differences in mean response following replacement for men and women. Spaghetti plots of the profiles for each patient are shown in the left hand panels of Figure 3. (We will discuss the right-hand panels later.). It may be seen from the figure that a number of both male and female patients are missing the measurement at week 2; in fact, there is one female missing the pre-replacement measurement and week 2. Here, we have a situation where the data vectors Y i are of possibly different lengths for different units. Exercise: Think and write down a model that describes the mean trajectory and how it varies over time. How do you incorporate the effect of age in this modeling framework? 6

17 ii. Linear splines In some applications the longitudinal trends in the mean response cannot be characterized by simple order degree polynomial (first or second) in time. In some application the trend cannot be well represented by polynomials in time of any order. This will mostly occur when the mean response increases (or decreases) rapidly for some duration, and slower thereafter (or vice versa). When this type of change pattern occurs, the mean trend can be modeled by spline models. In a nutshell, a spline regression model involves a linear combination of connected or joined piecewise polynomial functions. Splines are defined by degree and knots. A linear (quadratic, cubic etc.) spline means that the joined polynomials are lines (quadratic functions or cubic functions etc.). Knots are the locations at which the lines meet or are tied together. Linear spline models provide a useful and flexible way to model non-linear trends that cannot be approximated by simple polynomial functions in time. We defined earlier polynomial models as linear combinations of the power basis functions, {, t, t 2,...}. Linear spline models rely on the same general idea, except the basis functions are of the form {, t, (t κ ) +,..., (t κ k ) + } where {κ,..., κ k } are knots and k is the number of knots. Here (x) + = x if x > and if x. A linear spline model (using a single knot κ) for the mean trend can be represented as: E[Y ij ] = β + β t ij + β 2 (t ij κ) +. response response 5 5 response time time time Figure : Examples of: linear mean trend (left), quadratic mean trend (middle) and linear spline with one knot (right). The mean trend is depicted in red solid line while the observed data is shown in black circles. 7

18 Overall, a parametric model for the mean trajectory can be represented mathematically as µ(t ij ) = X T ijβ, where X T ij is a row-vector of covariates corresponding to the j measurement of the ith subject and β is the column vector of unknown parameters. For the three examples considered above, specify the form of X ij and β linear trend quadratic trend linear spline with one knot κ Remark that one can easily incorporate additional covariate information in the mean structure. 8

19 Models for the covariance: Assume the observed data is {Y ij, X ij : j =,... m i }; let Y i be the m i dimensional vector of Y ij s and X i be m i p dimensional design matrix (e.g. could include s or t ij s or t 2 ij or other covariates observed for subject i or time-varying covariates etc.). Assume that cov(y i ) = Σ i where Σ i is m i by m i dimensional covariance matrix. Here the index i is used specifically to allow for different number of repeated measurements per unit m i. In this part we assume that the covariance model is parametric, that is Σ i = Σ i (ω) is known up to a lower dimensional parameter ω. Recall the responses measured on the same unit/subject are correlated. Although the correlations or more generally the covariance among the repeated responses is not usually of particular interest, we need to account for it in making inferences for the mean parameters. Accounting for the correlations among the repeated measures completes the specification of a (normal) model for the longitudinal data and usually increases precision with which the regression parameters are estimated. There are three main approaches to describe the covariance among the repeated measures: ) unstructured; 2) covariance pattern models (to be described below); and 3) random effects covariance models (to be discussed later in the course). Importantly, these models considered for the covariance matrix will not explicitly distinguish between the among-units and the between units variation. Here are few common covariance pattern models that are described by only few parameters: () The unstructured covariance is typically used when there is a regular (sampling) design say {t ij : j =,..., m i, i =,..., n} = {t, t 2,..., t r } for not so large r; it involves r(r )/2 pairwise covariances. (2) Covariance pattern models Compound symmetric ω = (σ 2, ρ) σ 2 ρσ 2... ρσ 2 ρσ 2 ρσ 2 σ 2... ρσ 2 ρσ 2 Σ i (ω) =..... ρσ 2 ρσ 2... σ 2 ρσ 2 ρσ 2 ρσ 2... ρσ 2 σ 2 One dependent ω = (σ 2, ρ) σ 2 ρσ 2... ρσ 2 σ 2... Σ i (ω) = σ 2 ρσ 2... ρσ 2 σ 2 9

20 Toeplitz structure. ω = (σ 2, ρ,..., ρ m ) σ 2 ρ σ 2... ρ m 2 σ 2 ρ m σ 2 ρ σ 2 σ 2... ρ m 3 σ 2 ρ m 2 σ 2 Σ(ω) =..... ρ m σ 2 ρ m 2 σ 2... ρ σ 2 σ 2 Exponential structure. ω = (σ 2, ρ) σ 2 ρ t i t i2 σ 2... ρ t i t imi σ 2 ρ t i2 t i σ 2 σ 2... ρ t i2 t imi σ 2 Σ i (ω) =.... ρ t im i t i σ 2 ρ t im i t i2 σ 2... σ 2 Notice: when the set of time points {t ij : i, j} is a set of equispaced time points then the above covariance resembles to AR() covariance model corresponding to set of unique points. Remark: The above covariance structure assumes the same variance over time. This was used for simplicity, and one can specify covariance structures with unstructured variance over time. General linear model formulation (population average or marginal model) We can write the general model for the variation of responses in a matrix form as: Y i = X i β + ɛ i where ɛ i is m i -dimensional vector of random deviations, β is the fixed effects parameter and corresponds to the design matrix X i ; β (often called the mean regression parameter) is the main object of inference. The term ɛ i is the deviation from the systematic component, which has a multivariate random distribution with mean mi and covariance matrix Σ i = Σ i (ω). In this chapter we assume that the responses are normally distributed; that is Y i N mi (X i β, Σ i (ω)). Remark: This approach separates the modeling of the mean (systematic component) and the correlation of the random component; the covariance for the random component does not distinguish between the two main sources of variability - between units and among units. Modeling the correlation in longitudinal data is important to be able to obtain correct inferences on regression coefficients β. The correlation model does not change the interpretation of the β parameters. 2

21 3 Estimation of the regression parameters Parameters estimation: Maximum Likelihood (ML) Consider a framework for the estimation of the unknown parameters: the mean regression parameters (β) and the variance parameters (ω). When full distributional assumptions have been made about the vector of responses, a standard approach is to employ Maximum Likelihood Estimation (MLE). For simplicity assume first that the covariance parameters ω are known. Recall: The main idea in the MLE is to estimate the parameters by the values that make the observed data most likely to have occurred, under the specified model. As usual, we use hat to denote parameter estimators. Setting: Observed data are {Y ij : j =,... m i ; X ij } where Y ij are the responses and denote by Y i the vector of responses for unit i, and X ij is the k-dimensional vector of covariate information. Assume Y ij = X T ijβ + ɛ ij ɛ i N( mi, Σ i ) for Σ i = Σ i (ω) and assume Σ i is m i m i matrix, and is known. To obtain the MLE of β we need to maximize the following log-likelihood function: l(β) = n 2 ( m i ) log(2π) 2 i= { n n } log Σ i (Y i X i β) T Σ i (Y i X i β) ; 2 i= i= where X i is the m i k dimensional matrix with the jth row given by X T ij β MLE = argmax β l(β). Since β does not appear in the first two terms, it follows that maximization of the log-likelihood function l(β) is equivalent to minimization of: The solution is: n i= (Y i X i β) T Σ i (Y i X i β); n β MLE = argmin β (Y i X i β) T Σ (Y i X i β). β MLE = { n i= i= (X T i Σ i X i ) } n i= i (X T i Σ i Y i ); this is exactly the generalized least squares (GLS) estimator of β, β GLS. In the case when Σ i s are known, then this estimator is the best linear unbiased estimator of β (Gauss-Markov thm). 2

22 Properties of the β: Unbiasedness of β. What does it mean in layman s terms? Covariance of β. What does it mean in layman s terms? The sampling distribution of β. What does it mean in layman s terms? What is the expression of β when Σ i = σ 2 I mi 22

23 Remark. The GLS estimator is best linear unbiased estimator (BLUE) for β. When the underlying distribution is multivariate normal, then GLS is also MLE for β and furthermore one can show that is uniformly minimum variance unbiased estimator (UMVUE). Question: what is the ordinary least squares (OLS) estimator say β OLS and what is the difference between OLS estimator and GLS estimator. The GLS estimator has the smallest variance among all the weighted least squares estimators. The loss of efficiency is calculated as: eff( β OLS ) = precision( β OLS ) precision( β GLS ) = /var( β OLS ) /var( β GLS ) () if this ratio is close to, then use of β OLS is fine. In general, the ratio is less than one, which means that there is loss of efficiency by using an incorrect independence assumption in estimating the mean regression parameter. In practice the covariance parameter ω is not known. Typically ML/REML estimation is used to obtain an estimate for ω (REML = restricted maximum likelihood, to be discussed soon). The ML/REML estimator ω does not have a close form simple expression; numerical algorithms are used to obtain ω. When such an estimate is obtained then Σ i = Σ i ( ω) is substituted in the expression of the GLS β. When the sample size n is large, the resulting estimator β will have (approximately) all the same properties as if ω, and thus Σ i, were known. In R we fit marginal models (or population average models) using the function geeglm from geepack. The syntax is similar to glm. The correlation structure can be specified either using pre-specified models: independence, exchangeable, ar, unstructured, userdefined (specified by the option constr) using an user-defined correlation model (specified by the option zcor) 23

24 Study description: Case study: the Vlagtwedde-Vlaardingen Study This is an epidemiologic study conducted in two different areas in the Netherlands - the rural area of Vlagtwedde (N-E) and the urban, industrial area of Vlaardingen (S-W). The residents were followed over time to obtain information on the prevalence of and risk factors for chronic obstructive lung diseases. This dataset is based on the sample of men and women from the rural area of Vlagtwedde. The sample, initially aged 5-44, participated in follow-up surveys approximately every 3 years for up to 2 years. At each survey, information on respiratory symptoms and smoking status was collected by questionnaire and spirometry was performed. Pulmonary function was determined by spirometry and a measure of forced expiratory volume (FEV) was obtained every three years for the first 5 years of the study, and also at year 9. The dataset is comprised of a sub-sample of 33 residents aged 36 or older at their entry into the study and whose smoking status did not change over the 9 years of follow-up. Each study participant was either a current or former smoker. Current smoking was defined as smoking at least one cigarette per day. In this dataset FEV was not recorded for every subject at each of the planned measurement occasions. The number of repeated measurements of FEV on each subject varied from to 7. Question of interest: How the pulmonary function change over time? Is this different for current smokers than for former ones? Use various visualization tools to asses the mean behavior over time, gain insight into the dependence over time. Write down a parametric model for both mean and covariance. Using a normal model assumption, estimate the model parameters. 24

25 Parameters estimation: Restricted Maximum Likelihood (REML) Recall setting: Observed data are {Y ij : j =,... m i ; X ij } where Y ij are the responses and denote by Y i the vector of responses for unit i, and X ij is the k-dimensional vector of additional covariates which does not vary across repeated measures. Assume Y ij = X ij β + ɛ ij ɛ i N(, Σ i ) for Σ i = Σ i (ω) and assume ω is unknown vector of parameters. Recall the log-likelihood function l(β, ω) = log L(β, ω): l(β, ω) = m i= n i log(2π) 2 2 { m log Σ i (ω) m } (Y i X i β) T Σ i (ω) (Y i X i β). 2 i= As we stated earlier, the MLE of β and ω are obtained by maximizing the above log-likelihood function. The maximization over ω implies numerical optimization; there is no analytical solution of the ML estimator ω obtained in this way. Nevertheless we can still study properties of the ML-based covariance estimator. It turns out that the ML-based estimator is biased. Optional. To gain more insight, consider the simpler case, where we have scalar data m i = for all i. That is the observed data are {Y i ; X i }, where X i is k-dimensional vector of covariates and assume model: Y i = X T i β + ɛ i, ɛ i N(, σ 2 ) Determine the ML estimator of σ 2, and then discuss its bias. Hint: substitute m i = in the above log-likelihood function, and Σ i (ω) = σ 2. The maximizer with respect to σ 2 is σ 2 ML = n i= (Y i X T i β) 2 /n. i= Insight: Bias arises because the ML estimate σ M L 2 does not take into account that β is also estimated. It may be shown that similar problems arise more generally, when the covariance is more complex. The theory of restricted maximum likelihood (REML) was precisely developed to address this limitation. The REML likelihood is the function for the marginal distribution of the residuals. REML produces estimates of the variance/covariance parameters that are unbiased. For example in the ordinary regression with independent errors, σ REML 2 i = (Y i Xi T β) 2 ; E[ σ 2 n k REML] = σ 2 ( σ REML 2 is unbiased for σ 2 ). The idea behind REML was proposed by Bartlett (937) and was further developed for the estimation of covariance components in unbalanced data by Peterson and Thompson (Biometrika, 97). Harville (974) gives a Bayesian interpretation. The distinction between the REML and 25

26 the ML becomes relevant when k is relatively large. Nice article on REML is by: LaMotte, L.R. (Statistical Papers 27). Intuition: REML is a generalization of the unbiased sample variance estimator. In a nutshell, REML approach uses a ML function calculated to a suitably transformed data, that allows the estimation of the covariance parameters independent of the estimation of the mean parameters. Intuition behind the procedure: Transform data Y to Y = A T Y where matrix A is chosen N (N k) to make the distribution of Y free of β. Here N = n i= m i. For example consider A such that {I X(X T X) X T } = AA T and A T A = I N k ; then Y has multivariate normal distribution, with mean zero and covariance equal to AΣA T which is free of β. The covariance estimators are obtained by maximizing the likelihood of Y. Remark that this likelihood function (which is called the REML function) is in fact the product between the original likelihood function evaluated at β and an adjustment factor. The adjustment factor is n i= XT i Σ i (ω)x i /2. The REML log-likelihood function is: l REML (β, ω) = { n i= m i log(2π) n n } log Σ i (ω) (Y i X i β) T Σ i (ω) (Y i X i β) i= i= n log Xi T Σ i (ω)x i ; 2 i= The solution, again, is obtained by numerical optimization. The REML estimator of ω, ω REML, is unbiased of ω. REML estimation is the default method used to estimate the variance component parameters for many algorithms. Remark: Since the adjustment is a function solely of ω, the ML and REML-based estimators of the mean regression parameter β coincide. 26

27 4 Selection of various covariance models For this section, a maximal model for the mean is assumed and the mean structure is thus FIXED. How to select the most appropriate covariance model? The choices of models for the mean and the covariance are interdependent. Since the confidence intervals and tests of hypotheses for the mean regression parameters depend critically upon the correct specification of the correct model assumed for the covariance it is important to begin with specifying the covariance model. Nevertheless the model for the covariance depends on the assumed model for the mean: the model for the covariance models the dependence between the residuals {Y ij µ ij (β)} and {Y ij µ ij (β)} for j j. Therefore the model for the covariance should be based on a maximal model for the mean. Intuitively any systematic part that is left out (due to misspecification of the mean model) will lead to certain amount of spurious covariance among the residuals and will induce spurious dependence of the covariance on the covariates. In longitudinal models with balanced design for the time points and a very small number of covariates (e.g. group and time-points at which the repeated outcome is measured) it is possible to fit a saturated model. A saturated model would allow for arbitrary pattern for the mean response trajectory at every level of the covariates, and thus minimize the impact of the misspecification of the mean model. Nevertheless determining the maximal model is, in general, difficult and should be made on subject matter grounds. However, once a maximal mean model for the mean response had been fixed, the residual variance and covariance can be used to select an appropriate model for the covariance. Likelihood ratio test (LRT) One possible way to choose between two competing covariance models is by comparing the maximized (REML) likelihood for the corresponding covariance models and using a hypothesis testing framework. Specifically, consider the case where we compare two covariance models that are nested within one another (two covariance models are nested when the reduced model is a special case of the full model). For example, the compound symmetric covariance model is a special case of the Toeplitz covariance model when ρ = ρ 2 =... = ρ n. The null hypothesis is H : Σ has compound symmetric vs H : Σ has Toeplitz structure The LRT is obtained by comparing the maximized (REML) likelihood for the reduced covariance model (compound symmetric) with the maximized (REML) likelihood for the full covariance model (Toeplitz). Formally the test is: LRT = 2 l full 2 l red. Because of the unbiased properties of the estimators obtained using the REML likelihood, the REML likelihoods are typically used for this test. 27

28 Under the null hypothesis, the sampling distribution of LRT is chi-square with degrees of freedom equal to the difference between the number of covariance parameters in the full and the reduced models. In general LRT is preferable for testing between competing nested models. One important limitation is when the null hypothesis includes testing of parameters that are on the boundary of their space set. For example when the null hypothesis comes down to H : σ 2 = (i.e. a variance parameter is equal to zero), where recall σ 2. Such situation is known in the literature by the name testing a null hypothesis that is on the boundary of the parameter space. In this case, the usual asymptotics used to develop the null distribution of the LRT are no longer valid; in particular the null distribution of the LRT is no longer chi-square. We will discuss more about this later when we study linear mixed models. Akaike s Information Criterion (AIC). Often it is of interest to compare models that are not nested. One common method is using the Akaike s Information Criterion (AIC), which is also based on the maximized log-likelihood, but it includes a penalty for complexity of the covariance model assumed AIC = 2 l model + 2c where l model is the maximized or fitted (REML) log-likelihood using the assumed model and c is the number of parameters included in this model. Among all the covariance models of interest, the one with the smallest AIC is preferred. The basic idea behind the AIC is to strike a balance between the fit to the data and the number of parameters involved in the covariance model (if the competing models assume the same model for the mean trend). Schwarz s Bayesian Information Criterion (BIC) Another information criterion for choosing among competing covariance models is Schwarz s Bayesian Information Criterion (BIC), which also uses the maximized log-likelihood and penalizes the complexity of the model (though in a different way). BIC is defined as BIC = 2 l model + (log N)c where c is the number of parameters included in the model of interest, and N is the total number of observations in the data N = n i= m i. Among all the covariance models of interest, the one with the smallest BIC is preferred. The main idea of the BIC comes for Bayesian approach to model selection, which is based on the highest posterior probability (or largest Bayes factor); BIC tries to approximate this Bayesian criterion. Because BIC penalizes drastically the number of components in the model, it tends to select the most parsimonious (simplistic) model; because of this BIC is not among the most popular approaches to select covariance models. 28

29 Remark: AIC penalizes the number of model parameters less strongly than BIC. In small samples, the corrected AIC (caic= AIC with a greater penalty for extra parameters) has been found more successful than AIC/BIC. Inferences about β using the model-based covariance rely heavily on the correct specification of the covariance model. Any misspecification of the covariance model has negligible effects on the estimation of the mean regression parameters β, but it may have serious implications on the inference about these parameters (construction of confidence intervals and hypothesis test). Fortunately one can still make valid inferences even if there are concerns about the specification of the covariance model. In particular valid inferences can be made using the so-called sandwich estimator of the cov( β); the resulting standard error are robust to misspecification of the covariance model. The sandwich estimator of the cov( β) are more common for marginal models for discrete longitudinal observations, and we will study them in detail when we discuss this topic. 29

30 5 Inference of the regression parameters In this section we discuss how to make inferences about β: specifically we consider the construction of confidence intervals and tests of hypotheses. To construct confidence intervals and tests of hypotheses we use the ML (or REML) estimator of β and its estimated covariance matrix: ĉov( β) = { N i= (X T i Σ i X i )}, Σi = Σ i ( ω), where ω is obtained either by ML or by REML. Confidence intervals: Using this result we can construct confidence intervals for a single component of β, say β l : 95%CI for β l : βl ±.96 var( β l ). Essentially we used the lth element of the diagonal of the estimated covariance of β, ĉov( β), and the multivariate normal distribution of the estimator β. This confidence interval is approximate confidence interval if the data are not normally distributed, but the number of units n is large. Hypothesis tests. Assume it is of interest to test H : β l = versus the alternative H : β l. One can use the Wald test statistic: Z = β l. var( β l ) More generally, it may be of interest to construct tests that certain linear combinations of the components of β are. For example, of β = (β, β 2, β 3 ), of interest might be to test a hypothesis of the form H : β β 2 = and so on. Let L be a k dimensional matrix of weights, and assume that we want to test the null hypothesis H : Lβ = versus the alternative H : Lβ. Statistical inference about Lβ relies on the distribution of L β which is N(Lβ, Lcov( β)l T ), based on the distribution of β for the case when the data is multivariate normal. Here we discuss hypothesis testing; but the ideas can be applied to the construction of confidence intervals. Wald test statistic for Lβ, where L is k-dimensional matrix: Z = L β Lĉov( β)l T ; Z N(, ). Equivalently W = Z 2 has chi-square distribution with degrees of freedom, χ 2 : W = (L β ){Lĉov( β)l T } (L β ) T ; W χ 2. 3

31 The advantage of the former test is that it readily generalizes to cases when L has more than one row, for instance when L is r k dimensional matrix. In that case, the null distribution of W would be χ 2 r, and p-values will be calculated based on this distribution. The function esticon in R can be used to estimate linear combinations of regression parameters, test them using Wald and construct confidence intervals. An alternative to Wald test statistics is the likelihood ratio test (LRT) statistic. The LRT for testing H : Lβ = versus the alternative H : Lβ is obtained by comparing the maximized likelihood for 2 models: one model that incorporates the constraint specified in the null hypothesis Lβ = (this is called the reduced model); and one without constraint (this model is called full model). Note that the two models are nested in the sense that the reduced model is a special case of the full model. Thus when the constraint holds, the full model reduces to the reduced model. The maximized log-likelihood for the full model is denoted by l full and the maximized log-likelihood for the reduced model is denoted by l red. The LRT is obtained as: LRT = 2( l full l red ); when the null hypothesis is true, then the distribution of the LRT is χ 2 with df equal to the difference between the number of parameters in the full and the number of parameters in the reduced models. Remark : Wald tests are commonly employed when testing mean regression parameters. When testing between competing covariance models, Wald tests are NOT valid. Remark2 : When testing two nested models in terms of mean regression parameters, do not use REML (because the adjustment is affected by the structure of the systematic mean part). Do use REML when testing between two nested covariance models. 3

32 6 Final Remarks: main features and limitations When confronted with a real data application an important step is the selection of the appropriate covariance model. Such covariance structure incorporates both sources of variation (among-units and between-units). Useful ideas in the selection of the covariance model are: Informal graphical/numerical summaries and other techniques may be used on a preliminary fit using OLS estimates of the regression parameters AIC and BIC criteria may be used, but a dose of subjectivity is also involved If no model is truly appropriate, that is alright too. The models used in the next chapter offer an alternative approach. Important features of the regression approach The regression approach gives the analyst much flexibility in representing the form of the mean of the response. The mean can be modeled smoothly over time; the rate of change is the slope of this function. Also modeling of the mean in this fashion allows estimation of the mean at any time, not just the observed times. The approach does not require a balanced time points design: the vectors of observations may have different lengths m i. One important aspect we should be aware: if the unbalanced is due to missingness when data were intended to be collected at the same points. If the missingness is completely unrelated to the issues under study (e.g. sample of a certain subject at a certain time is mistakenly destroyed/ misplaced), then the analysis is ok. However if the missingness is related to issues under study (e.g. two treatments are compared and in one treatment a subject does not show up because they are too ill) then the missingness might contain information about the treatment; this type of analysis would not be valid. The approach allows the analyst to consider an appropriate model for the covariance out of many choices. Multiple groups/populations can be accounted for by appropriately manipulating the design matrix. Recall the explicit parameterization and the difference parameterizations. Some limitations of this methodology The modeling of the covariance matrix aggregates the two sources of variation and does not allow the analysts to understand the two sources separately The main focus is modeling of the mean trajectories over time; the reconstruction of the individual trajectories is not considered. Characterizing the subject trajectories may be of interest (the current framework does not allow such study.) 32

4 Introduction to modeling longitudinal data

4 Introduction to modeling longitudinal data We are now in a position to introduce a basic statistical model for longitudinal data. The models and methods we discuss in subsequent chapters may be viewed