Sparse Matrix Methods and Mixed-effects Models

Size: px

Start display at page:

Download "Sparse Matrix Methods and Mixed-effects Models"

Matilda Harrison
6 years ago
Views:

1 Sparse Matrix Methods and Mixed-effects Models Douglas Bates University of Wisconsin Madison

2 Outline Theory and Practice of Statistics Challenges from Data Characteristics Linear Mixed Model Formulation Evaluating the deviance Other Criteria and Models

3 Outline Theory and Practice of Statistics Challenges from Data Characteristics Linear Mixed Model Formulation Evaluating the deviance Other Criteria and Models

4 Outline Theory and Practice of Statistics Challenges from Data Characteristics Linear Mixed Model Formulation Evaluating the deviance Other Criteria and Models

5 Outline Theory and Practice of Statistics Challenges from Data Characteristics Linear Mixed Model Formulation Evaluating the deviance Other Criteria and Models

6 Outline Theory and Practice of Statistics Challenges from Data Characteristics Linear Mixed Model Formulation Evaluating the deviance Other Criteria and Models

7 Outline Theory and Practice of Statistics Challenges from Data Characteristics Linear Mixed Model Formulation Evaluating the deviance Other Criteria and Models

8 Theory and Practice of Statistics We celebrate the 50th anniversary of the founding of our department this summer. From its inception our department has had as its goal developing excellence in both the theory and the practice of statistics, and fostering the interaction of theory and practice. Involvement in the practice of statistics, even if only by taking the required course on statistical consulting, provides a grounding in real problems addressed by real clients with their, inevitably messy, real data. Knowledge of theory grounds the practice of statistics in a solid framework. It isn t enough to fit models and report estimates and p-values. We should also assess the validity of the assumptions in the model. You can t do this if you don t know what the model is.

9 Theory and Practice of Statistics We celebrate the 50th anniversary of the founding of our department this summer. From its inception our department has had as its goal developing excellence in both the theory and the practice of statistics, and fostering the interaction of theory and practice. Involvement in the practice of statistics, even if only by taking the required course on statistical consulting, provides a grounding in real problems addressed by real clients with their, inevitably messy, real data. Knowledge of theory grounds the practice of statistics in a solid framework. It isn t enough to fit models and report estimates and p-values. We should also assess the validity of the assumptions in the model. You can t do this if you don t know what the model is.

10 Theory and Practice of Statistics We celebrate the 50th anniversary of the founding of our department this summer. From its inception our department has had as its goal developing excellence in both the theory and the practice of statistics, and fostering the interaction of theory and practice. Involvement in the practice of statistics, even if only by taking the required course on statistical consulting, provides a grounding in real problems addressed by real clients with their, inevitably messy, real data. Knowledge of theory grounds the practice of statistics in a solid framework. It isn t enough to fit models and report estimates and p-values. We should also assess the validity of the assumptions in the model. You can t do this if you don t know what the model is.

11 Why Consider the Interplay of Theory and Practice? Deriving properties of models without reference to data is sterile because All models are wrong; some models are useful. (G.E.P. Box) Ideally the model is derived from properties of the data. The opposite approach: posit a model, derive its properties and then go looking for some data that follow such a model is not likely to prove useful. Conversely, deriving parameter estimates without assessing, or sometimes even knowing, the model is a risky practice. > fortune("provocatively") To paraphrase provocatively, machine learning is statistics minus any checking of models and assumptions. -- Brian D. Ripley (about the difference between machine learning and statistics) user! 2004, Vienna (May 2004)

12 Why Consider the Interplay of Theory and Practice? Deriving properties of models without reference to data is sterile because All models are wrong; some models are useful. (G.E.P. Box) Ideally the model is derived from properties of the data. The opposite approach: posit a model, derive its properties and then go looking for some data that follow such a model is not likely to prove useful. Conversely, deriving parameter estimates without assessing, or sometimes even knowing, the model is a risky practice. > fortune("provocatively") To paraphrase provocatively, machine learning is statistics minus any checking of models and assumptions. -- Brian D. Ripley (about the difference between machine learning and statistics) user! 2004, Vienna (May 2004)

13 Why Consider the Interplay of Theory and Practice? Deriving properties of models without reference to data is sterile because All models are wrong; some models are useful. (G.E.P. Box) Ideally the model is derived from properties of the data. The opposite approach: posit a model, derive its properties and then go looking for some data that follow such a model is not likely to prove useful. Conversely, deriving parameter estimates without assessing, or sometimes even knowing, the model is a risky practice. > fortune("provocatively") To paraphrase provocatively, machine learning is statistics minus any checking of models and assumptions. -- Brian D. Ripley (about the difference between machine learning and statistics) user! 2004, Vienna (May 2004)

14 What Positive Role Does Computing Play? Computing is an integral part of essentially all applications of statistics. It gives us the ability to explore complex models applied to large data sets with complex structure. In terms of theory, computing is widely used in simulation studies. Also, it provides a grounding (or should) for proposed methods. We have powerful computers but not infinitely powerful and not with an infinite amount of storage.

15 What Positive Role Does Computing Play? Computing is an integral part of essentially all applications of statistics. It gives us the ability to explore complex models applied to large data sets with complex structure. In terms of theory, computing is widely used in simulation studies. Also, it provides a grounding (or should) for proposed methods. We have powerful computers but not infinitely powerful and not with an infinite amount of storage.

16 What Negative Role Does Computing Play? If you define the extent of statistical analysis by the capabilities of available software, you tend to shoehorn the data into a prefabricated model. The noted linguist, Benjamin Lee Whorf, observed, Language shapes the way we think, and determines what we can think about. This is true not only for natural languages but also for computing languages. E.g. Not long ago many people believed that Applied statistics is the use of SAS. More to the point for this discussion, many people believe that linear mixed models must be hierarchical models, even when the data are not hierarchical.

17 What Negative Role Does Computing Play? If you define the extent of statistical analysis by the capabilities of available software, you tend to shoehorn the data into a prefabricated model. The noted linguist, Benjamin Lee Whorf, observed, Language shapes the way we think, and determines what we can think about. This is true not only for natural languages but also for computing languages. E.g. Not long ago many people believed that Applied statistics is the use of SAS. More to the point for this discussion, many people believe that linear mixed models must be hierarchical models, even when the data are not hierarchical.

18 What Negative Role Does Computing Play? If you define the extent of statistical analysis by the capabilities of available software, you tend to shoehorn the data into a prefabricated model. The noted linguist, Benjamin Lee Whorf, observed, Language shapes the way we think, and determines what we can think about. This is true not only for natural languages but also for computing languages. E.g. Not long ago many people believed that Applied statistics is the use of SAS. More to the point for this discussion, many people believe that linear mixed models must be hierarchical models, even when the data are not hierarchical.

19 What Negative Role Does Computing Play? If you define the extent of statistical analysis by the capabilities of available software, you tend to shoehorn the data into a prefabricated model. The noted linguist, Benjamin Lee Whorf, observed, Language shapes the way we think, and determines what we can think about. This is true not only for natural languages but also for computing languages. E.g. Not long ago many people believed that Applied statistics is the use of SAS. More to the point for this discussion, many people believe that linear mixed models must be hierarchical models, even when the data are not hierarchical.

20 Combining Practice, Theory and Computing To the extent possible, the methodology should encompass the characteristics of data encountered in real world situations. (Large data sets with complex structures such as non-nested random effects. Unbalanced data is almost a given.) Methodology should have a firm theoretical basis. Before you fit a model you should be able to write it. Before you write code to produce some estimates, you should be able to describe the criterion optimized by the estimates, even if you are optimizing it indirectly. Theory should teach us that there isn t the formula or the representation. Often there are many representations of the same problem. Take advantage of this. Computational methods should be reliable, robust and based on stable calculations. They should handle the edge cases. As Kernighan and Plauger state, Your code should be able to do nothing gracefully when appropriate.

21 Combining Practice, Theory and Computing To the extent possible, the methodology should encompass the characteristics of data encountered in real world situations. (Large data sets with complex structures such as non-nested random effects. Unbalanced data is almost a given.) Methodology should have a firm theoretical basis. Before you fit a model you should be able to write it. Before you write code to produce some estimates, you should be able to describe the criterion optimized by the estimates, even if you are optimizing it indirectly. Theory should teach us that there isn t the formula or the representation. Often there are many representations of the same problem. Take advantage of this. Computational methods should be reliable, robust and based on stable calculations. They should handle the edge cases. As Kernighan and Plauger state, Your code should be able to do nothing gracefully when appropriate.

22 Combining Practice, Theory and Computing To the extent possible, the methodology should encompass the characteristics of data encountered in real world situations. (Large data sets with complex structures such as non-nested random effects. Unbalanced data is almost a given.) Methodology should have a firm theoretical basis. Before you fit a model you should be able to write it. Before you write code to produce some estimates, you should be able to describe the criterion optimized by the estimates, even if you are optimizing it indirectly. Theory should teach us that there isn t the formula or the representation. Often there are many representations of the same problem. Take advantage of this. Computational methods should be reliable, robust and based on stable calculations. They should handle the edge cases. As Kernighan and Plauger state, Your code should be able to do nothing gracefully when appropriate.

23 Combining Practice, Theory and Computing To the extent possible, the methodology should encompass the characteristics of data encountered in real world situations. (Large data sets with complex structures such as non-nested random effects. Unbalanced data is almost a given.) Methodology should have a firm theoretical basis. Before you fit a model you should be able to write it. Before you write code to produce some estimates, you should be able to describe the criterion optimized by the estimates, even if you are optimizing it indirectly. Theory should teach us that there isn t the formula or the representation. Often there are many representations of the same problem. Take advantage of this. Computational methods should be reliable, robust and based on stable calculations. They should handle the edge cases. As Kernighan and Plauger state, Your code should be able to do nothing gracefully when appropriate.

24 Combining Practice, Theory and Computing To the extent possible, the methodology should encompass the characteristics of data encountered in real world situations. (Large data sets with complex structures such as non-nested random effects. Unbalanced data is almost a given.) Methodology should have a firm theoretical basis. Before you fit a model you should be able to write it. Before you write code to produce some estimates, you should be able to describe the criterion optimized by the estimates, even if you are optimizing it indirectly. Theory should teach us that there isn t the formula or the representation. Often there are many representations of the same problem. Take advantage of this. Computational methods should be reliable, robust and based on stable calculations. They should handle the edge cases. As Kernighan and Plauger state, Your code should be able to do nothing gracefully when appropriate.

25 Outline Theory and Practice of Statistics Challenges from Data Characteristics Linear Mixed Model Formulation Evaluating the deviance Other Criteria and Models

26 Data challenges As already stated, we find more and more that we encounter large, unbalanced data sets with complex structure. E.g. Look at the data for M.S. Exam problems from 20 years ago, 10 years ago and today. For mixed-effects models one current challenge is the analysis of annual test scores on students in grades 3 to 8 mandated by the No Child Left Behind act. These are longitudinal data, grouped by student, school and district. The act also mandates relating the scores to demographic variables (sex, race/ethnicity, socioeconomic status). Many states currently do not record information on teachers. That will change because of the Race to the Top program. Models with random effects associated with student, teacher, school and district will not be hierarchical (also called multilevel ). The random effects will be partially crossed (i.e. neither nested nor fully crossed).

27 Data challenges As already stated, we find more and more that we encounter large, unbalanced data sets with complex structure. E.g. Look at the data for M.S. Exam problems from 20 years ago, 10 years ago and today. For mixed-effects models one current challenge is the analysis of annual test scores on students in grades 3 to 8 mandated by the No Child Left Behind act. These are longitudinal data, grouped by student, school and district. The act also mandates relating the scores to demographic variables (sex, race/ethnicity, socioeconomic status). Many states currently do not record information on teachers. That will change because of the Race to the Top program. Models with random effects associated with student, teacher, school and district will not be hierarchical (also called multilevel ). The random effects will be partially crossed (i.e. neither nested nor fully crossed).

28 Data challenges As already stated, we find more and more that we encounter large, unbalanced data sets with complex structure. E.g. Look at the data for M.S. Exam problems from 20 years ago, 10 years ago and today. For mixed-effects models one current challenge is the analysis of annual test scores on students in grades 3 to 8 mandated by the No Child Left Behind act. These are longitudinal data, grouped by student, school and district. The act also mandates relating the scores to demographic variables (sex, race/ethnicity, socioeconomic status). Many states currently do not record information on teachers. That will change because of the Race to the Top program. Models with random effects associated with student, teacher, school and district will not be hierarchical (also called multilevel ). The random effects will be partially crossed (i.e. neither nested nor fully crossed).

29 Data challenges As already stated, we find more and more that we encounter large, unbalanced data sets with complex structure. E.g. Look at the data for M.S. Exam problems from 20 years ago, 10 years ago and today. For mixed-effects models one current challenge is the analysis of annual test scores on students in grades 3 to 8 mandated by the No Child Left Behind act. These are longitudinal data, grouped by student, school and district. The act also mandates relating the scores to demographic variables (sex, race/ethnicity, socioeconomic status). Many states currently do not record information on teachers. That will change because of the Race to the Top program. Models with random effects associated with student, teacher, school and district will not be hierarchical (also called multilevel ). The random effects will be partially crossed (i.e. neither nested nor fully crossed).

30 Data challenges As already stated, we find more and more that we encounter large, unbalanced data sets with complex structure. E.g. Look at the data for M.S. Exam problems from 20 years ago, 10 years ago and today. For mixed-effects models one current challenge is the analysis of annual test scores on students in grades 3 to 8 mandated by the No Child Left Behind act. These are longitudinal data, grouped by student, school and district. The act also mandates relating the scores to demographic variables (sex, race/ethnicity, socioeconomic status). Many states currently do not record information on teachers. That will change because of the Race to the Top program. Models with random effects associated with student, teacher, school and district will not be hierarchical (also called multilevel ). The random effects will be partially crossed (i.e. neither nested nor fully crossed).

31 The Wisconsin Knowledge and Concepts Exam (WKCE) > str(wkce) $ sch : Factor w/ 2411 levels "6","7","16","56",..: data.frame : obs. of 14 variables: $ LDS : Factor w/ levels "10000","10001",..: $ Yr : Ord.factor w/ 4 levels " "<" "<..: $ Gr : Ord.factor w/ 7 levels "3"<"4"<"5"<"6"<..: $ Rss : int $ Mss : int $ dist: Factor w/ 445 levels "7","14","63",..: $ Sx : Factor w/ 2 levels "F","M": $ Ra : Factor w/ 5 levels "W","B","H","A",..: $ Dis : Factor w/ 2 levels "N","Y": $ ELP : Factor w/ 2 levels "N","Y": $ EC : Factor w/ 2 levels "N","Y": > xtabs(~ Yr + Gr, WKCE) Gr Yr

32 How Do They Get the Scaled Scores The scores on a given test (about 50 questions) are not a simple tally of the number of correct answers. They are calculated according to pattern scoring determined by an Item Response Theory (IRT) model. The student abilities are assumed to be a sample from a distribution and the question difficulties are also a sample. A reasonable model would be a generalized linear mixed model (binary response) with (crossed) random effects for student and for question. IRT models often go further and incorporate discrimination parameters for the questions and even per-question baseline probability of correct (i.e. a guessing parameter). Building this into a mixed model would result in a generalized nonlinear mixed model with crossed random effects. In practice IRT models are fit by ad hoc methodologies with poor theoretical basis and producing really weird results.

33 How Do They Get the Scaled Scores The scores on a given test (about 50 questions) are not a simple tally of the number of correct answers. They are calculated according to pattern scoring determined by an Item Response Theory (IRT) model. The student abilities are assumed to be a sample from a distribution and the question difficulties are also a sample. A reasonable model would be a generalized linear mixed model (binary response) with (crossed) random effects for student and for question. IRT models often go further and incorporate discrimination parameters for the questions and even per-question baseline probability of correct (i.e. a guessing parameter). Building this into a mixed model would result in a generalized nonlinear mixed model with crossed random effects. In practice IRT models are fit by ad hoc methodologies with poor theoretical basis and producing really weird results.

34 How Do They Get the Scaled Scores The scores on a given test (about 50 questions) are not a simple tally of the number of correct answers. They are calculated according to pattern scoring determined by an Item Response Theory (IRT) model. The student abilities are assumed to be a sample from a distribution and the question difficulties are also a sample. A reasonable model would be a generalized linear mixed model (binary response) with (crossed) random effects for student and for question. IRT models often go further and incorporate discrimination parameters for the questions and even per-question baseline probability of correct (i.e. a guessing parameter). Building this into a mixed model would result in a generalized nonlinear mixed model with crossed random effects. In practice IRT models are fit by ad hoc methodologies with poor theoretical basis and producing really weird results.

35 How Do They Get the Scaled Scores The scores on a given test (about 50 questions) are not a simple tally of the number of correct answers. They are calculated according to pattern scoring determined by an Item Response Theory (IRT) model. The student abilities are assumed to be a sample from a distribution and the question difficulties are also a sample. A reasonable model would be a generalized linear mixed model (binary response) with (crossed) random effects for student and for question. IRT models often go further and incorporate discrimination parameters for the questions and even per-question baseline probability of correct (i.e. a guessing parameter). Building this into a mixed model would result in a generalized nonlinear mixed model with crossed random effects. In practice IRT models are fit by ad hoc methodologies with poor theoretical basis and producing really weird results.

36 How Do They Get the Scaled Scores The scores on a given test (about 50 questions) are not a simple tally of the number of correct answers. They are calculated according to pattern scoring determined by an Item Response Theory (IRT) model. The student abilities are assumed to be a sample from a distribution and the question difficulties are also a sample. A reasonable model would be a generalized linear mixed model (binary response) with (crossed) random effects for student and for question. IRT models often go further and incorporate discrimination parameters for the questions and even per-question baseline probability of correct (i.e. a guessing parameter). Building this into a mixed model would result in a generalized nonlinear mixed model with crossed random effects. In practice IRT models are fit by ad hoc methodologies with poor theoretical basis and producing really weird results.

37 How Do They Get the Scaled Scores The scores on a given test (about 50 questions) are not a simple tally of the number of correct answers. They are calculated according to pattern scoring determined by an Item Response Theory (IRT) model. The student abilities are assumed to be a sample from a distribution and the question difficulties are also a sample. A reasonable model would be a generalized linear mixed model (binary response) with (crossed) random effects for student and for question. IRT models often go further and incorporate discrimination parameters for the questions and even per-question baseline probability of correct (i.e. a guessing parameter). Building this into a mixed model would result in a generalized nonlinear mixed model with crossed random effects. In practice IRT models are fit by ad hoc methodologies with poor theoretical basis and producing really weird results.

38 Outline Theory and Practice of Statistics Challenges from Data Characteristics Linear Mixed Model Formulation Evaluating the deviance Other Criteria and Models

39 Formulating Linear Mixed-effects Models (LMMs) We see linear mixed-effects models specified in many different ways. Usually we first see the subscript fest formulation y ijklmn... = µ + α i + β j + b k +... which is not generalizable and which confounds many aspects of the model. Later we may see a vector representation with model matrices y = Xβ + Zb + ɛ ɛ N (0, σ 2 I), b N (0, Σ), b ɛ but this too confounds aspects of the model. Writing a linear model as Xβ + ɛ separates the mean (the linear predictor) from the variance, ɛ N (0, σ 2 I). In a mixed model is Zb part of the mean or part of the variance? In a generalized linear model or generalized linear mixed model (GLMM) you can t separate the mean and the variance.

40 Formulating Linear Mixed-effects Models (LMMs) We see linear mixed-effects models specified in many different ways. Usually we first see the subscript fest formulation y ijklmn... = µ + α i + β j + b k +... which is not generalizable and which confounds many aspects of the model. Later we may see a vector representation with model matrices y = Xβ + Zb + ɛ ɛ N (0, σ 2 I), b N (0, Σ), b ɛ but this too confounds aspects of the model. Writing a linear model as Xβ + ɛ separates the mean (the linear predictor) from the variance, ɛ N (0, σ 2 I). In a mixed model is Zb part of the mean or part of the variance? In a generalized linear model or generalized linear mixed model (GLMM) you can t separate the mean and the variance.

41 Formulating Linear Mixed-effects Models (LMMs) We see linear mixed-effects models specified in many different ways. Usually we first see the subscript fest formulation y ijklmn... = µ + α i + β j + b k +... which is not generalizable and which confounds many aspects of the model. Later we may see a vector representation with model matrices y = Xβ + Zb + ɛ ɛ N (0, σ 2 I), b N (0, Σ), b ɛ but this too confounds aspects of the model. Writing a linear model as Xβ + ɛ separates the mean (the linear predictor) from the variance, ɛ N (0, σ 2 I). In a mixed model is Zb part of the mean or part of the variance? In a generalized linear model or generalized linear mixed model (GLMM) you can t separate the mean and the variance.

42 Formulating Linear Mixed-effects Models (LMMs) We see linear mixed-effects models specified in many different ways. Usually we first see the subscript fest formulation y ijklmn... = µ + α i + β j + b k +... which is not generalizable and which confounds many aspects of the model. Later we may see a vector representation with model matrices y = Xβ + Zb + ɛ ɛ N (0, σ 2 I), b N (0, Σ), b ɛ but this too confounds aspects of the model. Writing a linear model as Xβ + ɛ separates the mean (the linear predictor) from the variance, ɛ N (0, σ 2 I). In a mixed model is Zb part of the mean or part of the variance? In a generalized linear model or generalized linear mixed model (GLMM) you can t separate the mean and the variance.

43 My Current Formulation of LMMs It helps to follow the advice we give introductory students and distinguish between a random variable (Y or B) and its value (y or b). We observe y but not b. The probability model specifies the conditional distribution (Y B = b) N ( Zb + Xβ, σ 2 I n ) and the unconditional distribution B N (0, Σ), where X and Z are n p and n q model matrices, σ 0 and Σ is a positive semidefinite q q variance-covariance matrix determined by the variance-component parameters. Note the emphasis on semidefinite. σ = 0 does not occur in practice but singular Σ does. That is, estimates of zero for variance components can and do occur. During the course of numerical optimization it is common to want to evaluate likelihoods on the boundary of the parameter space.

44 My Current Formulation of LMMs It helps to follow the advice we give introductory students and distinguish between a random variable (Y or B) and its value (y or b). We observe y but not b. The probability model specifies the conditional distribution (Y B = b) N ( Zb + Xβ, σ 2 I n ) and the unconditional distribution B N (0, Σ), where X and Z are n p and n q model matrices, σ 0 and Σ is a positive semidefinite q q variance-covariance matrix determined by the variance-component parameters. Note the emphasis on semidefinite. σ = 0 does not occur in practice but singular Σ does. That is, estimates of zero for variance components can and do occur. During the course of numerical optimization it is common to want to evaluate likelihoods on the boundary of the parameter space.

45 My Current Formulation of LMMs It helps to follow the advice we give introductory students and distinguish between a random variable (Y or B) and its value (y or b). We observe y but not b. The probability model specifies the conditional distribution (Y B = b) N ( Zb + Xβ, σ 2 I n ) and the unconditional distribution B N (0, Σ), where X and Z are n p and n q model matrices, σ 0 and Σ is a positive semidefinite q q variance-covariance matrix determined by the variance-component parameters. Note the emphasis on semidefinite. σ = 0 does not occur in practice but singular Σ does. That is, estimates of zero for variance components can and do occur. During the course of numerical optimization it is common to want to evaluate likelihoods on the boundary of the parameter space.

46 Expressing Σ Model specification in lmer produces a parameterization that generates Σ through a q q general Λ θ as Σ = σ 2 Λ θ Λ T θ where θ is the variance-component parameter. For models with simple scalar random effects elements of θ are ratios of standard deviations, θ i = σ i /σ subject to θ i 0, and Λ θ is block-diagonal with blocks of the form θ i I. Let U N (0, σ 2 I q ) be the spherical random effects for which B = Λ θ U (Y U = u) N (Xβ + ZΛ θ u, σ 2 I n ) It is easy to verify that B has the desired properties. Note that the transformation from U to B is well-defined, even when Λ θ is singular.

47 Expressing Σ Model specification in lmer produces a parameterization that generates Σ through a q q general Λ θ as Σ = σ 2 Λ θ Λ T θ where θ is the variance-component parameter. For models with simple scalar random effects elements of θ are ratios of standard deviations, θ i = σ i /σ subject to θ i 0, and Λ θ is block-diagonal with blocks of the form θ i I. Let U N (0, σ 2 I q ) be the spherical random effects for which B = Λ θ U (Y U = u) N (Xβ + ZΛ θ u, σ 2 I n ) It is easy to verify that B has the desired properties. Note that the transformation from U to B is well-defined, even when Λ θ is singular.

48 Expressing Σ Model specification in lmer produces a parameterization that generates Σ through a q q general Λ θ as Σ = σ 2 Λ θ Λ T θ where θ is the variance-component parameter. For models with simple scalar random effects elements of θ are ratios of standard deviations, θ i = σ i /σ subject to θ i 0, and Λ θ is block-diagonal with blocks of the form θ i I. Let U N (0, σ 2 I q ) be the spherical random effects for which B = Λ θ U (Y U = u) N (Xβ + ZΛ θ u, σ 2 I n ) It is easy to verify that B has the desired properties. Note that the transformation from U to B is well-defined, even when Λ θ is singular.

49 Outline Theory and Practice of Statistics Challenges from Data Characteristics Linear Mixed Model Formulation Evaluating the deviance Other Criteria and Models

50 Densities of U and Y The joint density f U,Y (u, y) = f U (u)f Y U (y u) on the deviance scale (negative twice the log density) is (n + q) log(2πσ 2 ) + y Xβ ZΛ θu 2 + u 2 σ 2 Evaluated at the observed y this expression (on the density scale) provides the unnormalized conditional density, h(u y), of (U Y = y). The normalizing constant, h(u y) du, is the likelihood of the parameters, θ, β and σ given y.

51 Densities of U and Y The joint density f U,Y (u, y) = f U (u)f Y U (y u) on the deviance scale (negative twice the log density) is (n + q) log(2πσ 2 ) + y Xβ ZΛ θu 2 + u 2 σ 2 Evaluated at the observed y this expression (on the density scale) provides the unnormalized conditional density, h(u y), of (U Y = y). The normalizing constant, h(u y) du, is the likelihood of the parameters, θ, β and σ given y.

52 Densities of U and Y The joint density f U,Y (u, y) = f U (u)f Y U (y u) on the deviance scale (negative twice the log density) is (n + q) log(2πσ 2 ) + y Xβ ZΛ θu 2 + u 2 σ 2 Evaluated at the observed y this expression (on the density scale) provides the unnormalized conditional density, h(u y), of (U Y = y). The normalizing constant, h(u y) du, is the likelihood of the parameters, θ, β and σ given y.

53 Evaluating the likelihood For a LMM there are several different expressions for the likelihood. One method that allows for generalization to other forms of mixed models is first to determine the conditional mode, ũ = arg max u and expand h at ũ. f U Y(u y) = arg max h(u y) u = arg min u y Xβ ZΛ θ u 2 + u 2 Because we are dealing with multivariate Gaussians the log conditional density will be exactly quadratic. For other types of models it will be approximately quadratic.

54 Evaluating the likelihood For a LMM there are several different expressions for the likelihood. One method that allows for generalization to other forms of mixed models is first to determine the conditional mode, ũ = arg max u and expand h at ũ. f U Y(u y) = arg max h(u y) u = arg min u y Xβ ZΛ θ u 2 + u 2 Because we are dealing with multivariate Gaussians the log conditional density will be exactly quadratic. For other types of models it will be approximately quadratic.

55 Solving the Penalized Least Squares Problem The point of all this development is that we can solve the penalized least squares problem, which could be rewritten ũ = arg min u [ y Xβ 0 ] [ ] ZΛθ u with solution satisfying ( ) Λ T θ ZT ZΛ θ + I ũ = Λ T θ ZT y I q 2 We do this by forming the sparse Cholesky factor, which is a sparse lower triangular matrix, L θ, satisfying L θ L T θ = ΛT Z T ZΛ + I

56 Solving the Penalized Least Squares Problem The point of all this development is that we can solve the penalized least squares problem, which could be rewritten ũ = arg min u [ y Xβ 0 ] [ ] ZΛθ u with solution satisfying ( ) Λ T θ ZT ZΛ θ + I ũ = Λ T θ ZT y I q 2 We do this by forming the sparse Cholesky factor, which is a sparse lower triangular matrix, L θ, satisfying L θ L T θ = ΛT Z T ZΛ + I

57 Practical Aspects of the Sparse Cholesky Decomposition When working with very large matrices, one must be careful of how certain computations are performed. Sparse matrix methods are usually performed in two phases: a symbolic phase in which the positions of the non-zeros are determined and a numeric phase in which the numerical values of these non-zeros are calculated. During optimization of the likelihood we will need to evaluate L θ for many different values of θ. The symbolic phase does not need to be repeated, only the numeric phase. Part of the symbolic phase is determining a fill-reducing permutation, represented by the q q permutation matrix P. The actual decomposition used is ) L θ L T θ (Λ = P T Z T ZΛ + I P T Incorporating P does not affect the theory. It can profoundly affect time and storage requirements.

58 Practical Aspects of the Sparse Cholesky Decomposition When working with very large matrices, one must be careful of how certain computations are performed. Sparse matrix methods are usually performed in two phases: a symbolic phase in which the positions of the non-zeros are determined and a numeric phase in which the numerical values of these non-zeros are calculated. During optimization of the likelihood we will need to evaluate L θ for many different values of θ. The symbolic phase does not need to be repeated, only the numeric phase. Part of the symbolic phase is determining a fill-reducing permutation, represented by the q q permutation matrix P. The actual decomposition used is ) L θ L T θ (Λ = P T Z T ZΛ + I P T Incorporating P does not affect the theory. It can profoundly affect time and storage requirements.

59 Practical Aspects of the Sparse Cholesky Decomposition When working with very large matrices, one must be careful of how certain computations are performed. Sparse matrix methods are usually performed in two phases: a symbolic phase in which the positions of the non-zeros are determined and a numeric phase in which the numerical values of these non-zeros are calculated. During optimization of the likelihood we will need to evaluate L θ for many different values of θ. The symbolic phase does not need to be repeated, only the numeric phase. Part of the symbolic phase is determining a fill-reducing permutation, represented by the q q permutation matrix P. The actual decomposition used is ) L θ L T θ (Λ = P T Z T ZΛ + I P T Incorporating P does not affect the theory. It can profoundly affect time and storage requirements.

60 Practical Aspects of the Sparse Cholesky Decomposition When working with very large matrices, one must be careful of how certain computations are performed. Sparse matrix methods are usually performed in two phases: a symbolic phase in which the positions of the non-zeros are determined and a numeric phase in which the numerical values of these non-zeros are calculated. During optimization of the likelihood we will need to evaluate L θ for many different values of θ. The symbolic phase does not need to be repeated, only the numeric phase. Part of the symbolic phase is determining a fill-reducing permutation, represented by the q q permutation matrix P. The actual decomposition used is ) L θ L T θ (Λ = P T Z T ZΛ + I P T Incorporating P does not affect the theory. It can profoundly affect time and storage requirements.

61 Expansion at ũ The penalized residual sum of squares (PRSS) can now be written y Xβ ZΛ θ u 2 + u 2 = rθ,β 2 + LT θ (u ũ) 2 where r 2 (θ, β) is the PRSS at ũ. A simple change of variable can be used to evaluate the likelihood. On the deviance scale it is 2l(θ, β, σ) = n log(2πσ 2 ) + log( L θ 2 ) + r2 θ,β σ 2. Notice that β only affects rθ,β 2. If we minimize the PRSS simultaneously w.r.t u and β, producing rθ 2, and plug in the conditional estimate of σ we obtain the profiled deviance [ ( 2πr d(θ y) = log( L θ 2 2 )] ) + n 1 + log θ. n

62 Expansion at ũ The penalized residual sum of squares (PRSS) can now be written y Xβ ZΛ θ u 2 + u 2 = rθ,β 2 + LT θ (u ũ) 2 where r 2 (θ, β) is the PRSS at ũ. A simple change of variable can be used to evaluate the likelihood. On the deviance scale it is 2l(θ, β, σ) = n log(2πσ 2 ) + log( L θ 2 ) + r2 θ,β σ 2. Notice that β only affects rθ,β 2. If we minimize the PRSS simultaneously w.r.t u and β, producing rθ 2, and plug in the conditional estimate of σ we obtain the profiled deviance [ ( 2πr d(θ y) = log( L θ 2 2 )] ) + n 1 + log θ. n

63 Expansion at ũ The penalized residual sum of squares (PRSS) can now be written y Xβ ZΛ θ u 2 + u 2 = rθ,β 2 + LT θ (u ũ) 2 where r 2 (θ, β) is the PRSS at ũ. A simple change of variable can be used to evaluate the likelihood. On the deviance scale it is 2l(θ, β, σ) = n log(2πσ 2 ) + log( L θ 2 ) + r2 θ,β σ 2. Notice that β only affects rθ,β 2. If we minimize the PRSS simultaneously w.r.t u and β, producing rθ 2, and plug in the conditional estimate of σ we obtain the profiled deviance [ ( 2πr d(θ y) = log( L θ 2 2 )] ) + n 1 + log θ. n

64 More practical aspects Because L θ is triangular its determinant is the product of its diagonal elements. When minimizing the PRSS w.r.t. u and β we write the system to be solved as [ P Λ T θ Z T ZΛ θ P T + I P Λ T θ ZT X X T ZΛ θ P T X T X and calculate the (left) Cholesky factor as [ ] Lθ 0. R T ZX RT X ] [ ũ β θ ] = [ ] T Λθ Z y X These are almost Henderson s mixed-model equations but with two important differences: the system is stable as Λ θ becomes singular and we decompose the part depending on Z first.

65 More practical aspects Because L θ is triangular its determinant is the product of its diagonal elements. When minimizing the PRSS w.r.t. u and β we write the system to be solved as [ P Λ T θ Z T ZΛ θ P T + I P Λ T θ ZT X X T ZΛ θ P T X T X and calculate the (left) Cholesky factor as [ ] Lθ 0. R T ZX RT X ] [ ũ β θ ] = [ ] T Λθ Z y X These are almost Henderson s mixed-model equations but with two important differences: the system is stable as Λ θ becomes singular and we decompose the part depending on Z first.

66 More practical aspects Because L θ is triangular its determinant is the product of its diagonal elements. When minimizing the PRSS w.r.t. u and β we write the system to be solved as [ P Λ T θ Z T ZΛ θ P T + I P Λ T θ ZT X X T ZΛ θ P T X T X and calculate the (left) Cholesky factor as [ ] Lθ 0. R T ZX RT X ] [ ũ β θ ] = [ ] T Λθ Z y X These are almost Henderson s mixed-model equations but with two important differences: the system is stable as Λ θ becomes singular and we decompose the part depending on Z first.

67 Outline Theory and Practice of Statistics Challenges from Data Characteristics Linear Mixed Model Formulation Evaluating the deviance Other Criteria and Models

68 Other criteria and model formulations The profiled REML criterion has a similar form [ ( 2πr d R (θ y) = log( L θ 2 R X 2 2 )] ) + (n p) 1 + log θ. n p For a generalized linear mixed model (GLMM) the conditional mode, ũ, is determined by penalized iteratively reweighted least squares (PIRLS). For a nonlinear mixed model (NLMM) ũ is determined by penalized nonlinear least squares (PNLS). The Laplace approximation to the deviance for a GLMM or NLMM has a similar expression to that for the LMM. The log( L θ 2 ) term depends on β and ũ but usually this dependence is weak. For some models (those with a single grouping factor or with shallowly nested grouping factors for the random effects) a further refinement is available using adaptive Gauss-Hermite quadrature.

69 Other criteria and model formulations The profiled REML criterion has a similar form [ ( 2πr d R (θ y) = log( L θ 2 R X 2 2 )] ) + (n p) 1 + log θ. n p For a generalized linear mixed model (GLMM) the conditional mode, ũ, is determined by penalized iteratively reweighted least squares (PIRLS). For a nonlinear mixed model (NLMM) ũ is determined by penalized nonlinear least squares (PNLS). The Laplace approximation to the deviance for a GLMM or NLMM has a similar expression to that for the LMM. The log( L θ 2 ) term depends on β and ũ but usually this dependence is weak. For some models (those with a single grouping factor or with shallowly nested grouping factors for the random effects) a further refinement is available using adaptive Gauss-Hermite quadrature.

70 Other criteria and model formulations The profiled REML criterion has a similar form [ ( 2πr d R (θ y) = log( L θ 2 R X 2 2 )] ) + (n p) 1 + log θ. n p For a generalized linear mixed model (GLMM) the conditional mode, ũ, is determined by penalized iteratively reweighted least squares (PIRLS). For a nonlinear mixed model (NLMM) ũ is determined by penalized nonlinear least squares (PNLS). The Laplace approximation to the deviance for a GLMM or NLMM has a similar expression to that for the LMM. The log( L θ 2 ) term depends on β and ũ but usually this dependence is weak. For some models (those with a single grouping factor or with shallowly nested grouping factors for the random effects) a further refinement is available using adaptive Gauss-Hermite quadrature.

71 Other criteria and model formulations The profiled REML criterion has a similar form [ ( 2πr d R (θ y) = log( L θ 2 R X 2 2 )] ) + (n p) 1 + log θ. n p For a generalized linear mixed model (GLMM) the conditional mode, ũ, is determined by penalized iteratively reweighted least squares (PIRLS). For a nonlinear mixed model (NLMM) ũ is determined by penalized nonlinear least squares (PNLS). The Laplace approximation to the deviance for a GLMM or NLMM has a similar expression to that for the LMM. The log( L θ 2 ) term depends on β and ũ but usually this dependence is weak. For some models (those with a single grouping factor or with shallowly nested grouping factors for the random effects) a further refinement is available using adaptive Gauss-Hermite quadrature.

72 Benefits for Moderate-sized Data Sets Being able to evaluate and optimize the deviance rapidly is beneficial for small or moderate-sized data sets too. We can profile the deviance w.r.t. individual parameters. The change in the profiled deviance from the optimum is a LRT statistic on 1 degree of freedom. For inferences based on estimates and standard errors only to be reliable the profiled deviance should be quadratic and the signed square-root, written ζ, should be linear.

73 Data from Davies (1947) collected by G.E.P. Box E C Batch B A D F Yield of dyestuff (grams of standard color) Linear mixed model fit by maximum likelihood Formula: Yield ~ 1 + (1 Batch) Data: Dyestuff AIC BIC loglik deviance Random effects: Groups Name Variance Std.Dev. Batch (Intercept) Residual Number of obs: 30, groups: Batch, 6

74 Profile zeta plots.sig01.lsig (Intercept) 2 1 ζ log(σ) σ σ 2 log(σ 1) σ 1 σ ζ 0 ζ

75 Profile pairs plots (Intercept) lsig sig Scatter Plot Matrix

76 Summary Theory and practice are both important in statistical research and applications. They should not be regarded as either/or. Computing determines what is feasible in practice and what theory is relevant to practice. Fortunately computing capabilities become greater and greater with each passing year. Sparse matrix methods are a valuable tool in many situations treating large data sets. It is worthwhile learning how to use them. Linear mixed models can be formulated in a general way that allows for highly effective methods of determining the ML or REML estimates of the parameters. This formulation can be extended to GLMMs and NLMMs. For models fit to small to moderate data sets efficient methods allow for more effective evaluation of the precision of parameter estimates. Some current practices (quoting an estimate and standard error or a variance component) are

Mixed models in R using the lme4 package Part 4: Theory of linear mixed models

Mixed models in R using the lme4 package Part 4: Theory of linear mixed models Douglas Bates 8 th International Amsterdam Conference on Multilevel Analysis 2011-03-16 Douglas Bates