Technische Universität München. Zentrum Mathematik. Linear Mixed Models Applied to Bank Branch Deposit Data

Size: px
Start display at page:

Download "Technische Universität München. Zentrum Mathematik. Linear Mixed Models Applied to Bank Branch Deposit Data"

Transcription

1 Technische Universität München Zentrum Mathematik Linear Mixed Models Applied to Bank Branch Deposit Data Project by Eike Christian Brechmann Supervisor: Prof. Claudia Czado, Ph.D. Tutor: Dr. Mathias Hofmann Deadline: 28 February 2009

2 Contents List of Figures Introduction iv vi 1 Data Description State Variables County Variables Branch Variables Linear Mixed Model for Explorative Data Analysis Branch Variable County Variables Model Formulation and Fit Hierarchical Model Necessity of the Random Effects Variance Structure of the Residuals Significance of the Fixed Effects Model Diagnostics Within-Group Errors Random Effects Discussion Interpretation Linear Mixed Model for the Full Data Set Explorative Data Analysis Branch Variable County Variables State Variables Model Formulation and Fit Initial Model Necessity of the Random Effects Variance Structure of the Errors Significance of the Fixed Effects Model Diagnostics i

3 CONTENTS ii Within-Group Errors Random Effects Prediction Discussion Interpretation Hierarchical Model Explorative Data Analysis Model Formulation and Fit Initial Model Necessity of the Random Effects Variance Structure of the Residuals Significance of the Fixed Effects Model Diagnostics Within-Group Errors Random Effects Prediction Discussion Interpretation Conclusion Fixed Effects Random Effects Variance Structure of the Residuals Prediction Summary Bibliography 106 A The nlme Library 107 A.1 groupeddata A.2 lmlist A.3 lme A.3.1 Random Effects A.3.2 The varf unc Classes A.3.3 The corstruct Classes A.3.4 Fixed Effects A.3.5 Prediction B The lattice Library 124 B.1 Clustered Data B.2 Longitudinal Data

4 CONTENTS iii C Overview of Models 129 C.1 Linear Mixed Model for 1998 (Chapter 2) C.2 Linear Mixed Model for the Full Data Set (Chapter 3) C.3 Hierarchical Model (Chapter 4)

5 List of Figures 1.1 Map of New York State Hierarchical structure of the data State Variables over time Histogram of log.dep Boxplots of log.dep for each county Boxplots of comp for each county Trellis display of log.dep by comp Residual plots by county for model percent confidence intervals for the regression coefficients of model QQ-plot of the standardized residuals of model Scatter plots of the County Variables against log.dep Standardized residuals of model 2.4 for each county Residual variances of model 2.5 for each county Standardized residuals of model 2.6 for each county Residual variances of model 2.6 for each county QQ-plot of the standardized residuals of model EBLUP s of model QQ-plot of the EBLUP s of model Intercepts of model Relationship between deposits and competition in Nassau and Kings Influences of the population and the income on the deposits Scatter plot of comp against log.dep Observations of 20 randomly selected branches Trellis display of log.dep by comp for 20 randomly selected branches percent confidence intervals for the regression coefficients of model Scatter plots of the County Variables against log.dep Trellis display of log.dep by pop Trellis display of log.dep by inc.pc Trellis display of log.dep by unemp percent confidence intervals for the regression coefficients of model Interaction plots of the categorized County Variables and comp Scatter plots of the State Variables against log.dep Interaction plots of no.fail and the County and Branch Variables iv

6 LIST OF FIGURES v 3.13 Interaction plots of mshare and the County and Branch Variables Interaction plots of branch.total and the County and Branch Variables Interaction plots of dep.total and the County and Branch Variables Interaction plots of av.dep and the County and Branch Variables Standardized residuals of model 3.4 for each year Residual variances of model 3.5 for each year Empirical autocorrelation function from the residuals of model Standardized residuals of model 3.7 for each year Deposits of branch Residual variances of model 3.7 for each year QQ-plots of the standardized residuals of model EBLUP s of model QQ-plots of the EBLUP s of model Predicted versus observed values for D interaction plot of inc.pc and av.dep in model Significant exploratory interactions of unemp in model D interaction plots of unemp and State Variables in model Standard errors of model 3.7 and the corresponding linear model Relationship between deposits and competition Scatter plots of the Branch and County Variables against log.dep Standardized residuals of model 4.3 for each year Residual variances of model 4.4 for each year Empirical autocorrelation function from the residuals of model Standardized residuals of model 4.6 for each year Residual variances of model 4.6 for each year QQ-plots of the standardized residuals of model EBLUP s of model QQ-plot of the EBLUP s of model Predicted versus observed values for Significant exploratory interactions in model D interaction plots of unemp and State Variables in model Histogram of the intercepts of model Standard errors of model 4.6 and the corresponding linear model Comparison of the predicted values of models 3.7 and

7 Introduction Linear models are not always appropriate to deal with data sets. In linear models independet response variables are assumed, but often this is not the case. The data can be ˆ clustered, i.e. the response is measured once for each subject and each subject belongs to a group of subjects (cluster), or ˆ longitudinal, i.e. the response is measured at several time points and the number of time points is not too large. For such dependent data structures the linear model has to be extended by allowing random effects in so-called linear mixed models. The focus of this project is on the application of such mixed models. Regarding the theory of linear mixed models I recommend Fahrmeir et al. [2007]. The data considered in this project is both clustered and longitudinal. An analysis of the data and model building using linear models can be found in Schabenberger [2008]. This project is an expansion of this first approach of model building in order to find appropriate mixed models for estimation and prediction. To do this I use the software R with the nlme and lattice libraries. The nlme library allows to fit linear mixed models and the lattice library provides various useful graphics. For a detailed description of the nlme library I recommend Pinheiro and Bates [2000]. An illustrative approach of using the nlme and lattice libraries is given by Fox [2002]. Fittig linear mixed models using other statistical software packages such as SAS is described in West et al. [2007]. First of all, the data is described particularly with regard to the need of mixed modeling. Subsequently a reduced data set without the time effects is considered in chapter 2. In this first approach a linear mixed model is fitted to this clustered data in order to investigate the influence and importance of random effects. In chapter 3 a linear mixed model for the full data set is built and examined. Since this model does not have a hierarchical structure regarding the clusters, a hierarchical model with adjusted covariates is fitted in chapter 4. Finally, the appendix provides additional information about the nlme and lattice libraries and how they are used in this project. An overview of the models used in this project is also provided. vi

8 Chapter 1 Data Description The data set considered in this project contains 2988 branch-year records of a major US bank in the state of New York with multiple branches. Figure 1.1: Map of New York State. 506 branches are included with observations over the period from 1994 to There is one row in the data frame for each record of a branch in a particular year. branch year dep no.fail mshare branch.total dep.total av.dep

9 CHAPTER 1. DATA DESCRIPTION unemp county log.dep inc inc.pc pop comp obs Nassau e New York e New York e Rockland e Nassau e New York e For a detailed discussion of the data set see Schabenberger [2008]. The data is clustered (branches within counties within state): Figure 1.2: Hierarchical structure of the data. The data is also longitudinal, since it is observed over a period of nine years. Therefore, a mixed model approach seems to be appropriate to model the dependencies in the data that arise from the clusters (counties) and from the measurements taken on the same subjects (branches, counties, state). In the following I concentrate on the State, County and Branch Variables. The ZIP Code Variable is not considered, because there are mostly very few observations in a zip code area. 1.1 State Variables The State Variables are constant over counties for each year and have only different outcomes for each year. ˆ no.fail: the number of branches that closed in NY during the year. ˆ mshare: the market share in NY.

10 CHAPTER 1. DATA DESCRIPTION 3 ˆ branch.total: the share of the number of branches in NY compared to the USA. ˆ dep.total: the share of the total deposits of the bank in NY compared to the USA. ˆ av.dep: the average deposit per bank in NY. These variables are highly correlated especially with year: year no.fail mshare branch.total dep.total av.dep year no.fail mshare branch.total dep.total av.dep no.fail mshare branch.total Year Year Year dep.total av.dep Year Year Figure 1.3: State Variables over time. To avoid singular matrices in the computations, year is not considered as a covariate. 1.2 County Variables The County Variables are constant over the branches within a county and changing for each year. ˆ county: the countyname. ˆ obs: the number of observations in the county.

11 CHAPTER 1. DATA DESCRIPTION 4 ˆ pop: the population in the county (in 1000 for a better interpretability). ˆ inc: the total income that the people of the county earn (in 1000 for a better interpretability). ˆ inc.pc: the per capita income (in 1000 for a better interpretability). ˆ unemp: the unemployment rate in the county. Since inc and inc.pc are highly correlated, I consider just one of the variables. In the following I concentrate on inc.pc because it allows for a better comparability between the counties. The data is unbalanced because the number of observations varies a lot across counties. county Albany Bronx Broome Chautauq Chemung Erie Genesee Herkimer Kings Livingst Madison Monroe Nassau New York Niagara Onondaga Ontario Orange Oswego Putnam Queens Rensselaer Richmond Rockland Suffolk Tioga Wayne Westchester Branch Variables ˆ branch: the branch identity number (constant over the years). ˆ dep: the total deposits (in USD) in the branch (different for each year). ˆ log.dep: the total deposits in log form, i.e. log(dep). ˆ comp: a measure of geographical competition of the branch (different for each year). In the following log.dep is used as dependent variable because its distribution is relatively symmetric (Figure 1.4) and it is common to use the log form for monetary values. A boxplot for each county provides an overview of the values of log.dep (Figure 1.5). The plot shows some variability that is examined later.

12 CHAPTER 1. DATA DESCRIPTION 5 Density log.dep Figure 1.4: Histogram of log.dep Albany Bronx Broome Chautauq Chemung Erie Genesee Herkimer Kings Livingst Madison Monroe Nassau New York Niagara Onondaga Ontario Orange Oswego Putnam Queens Rensselaer Richmond Rockland Suffolk Tioga Wayne Westchester Figure 1.5: Boxplots of log.dep for each county.

13 CHAPTER 1. DATA DESCRIPTION 6 In the original data there are the two variables SingleDensity and M M CDensity which give the sum of all distances between the branch and all branches of other banks which have only one single branch or multiple branches respectively. These variables are highly correlated (94%). As a result it is possible either to choose one of the variables or to introduce a new variable in order to avoid singular matrices in the computations. I introduce a new variable because the already existing variables are not very easy to interpret. Therefore I merge the two variables standardized by their medians. Then I scale the new variable comp so that it has values between 0 and 100, and a value of 100 is an indication of a high geographical competition: SingleDensity a = median(singledensity) + MMCDensity median(m M CDensity) b = a min(a) b comp = (1 max(b) ) 100 The geographical competition varies a lot across counties. For example in New York City the values of comp are very high, whereas in Chautauqua, which is much more rural, the geographical competition is rather low Albany Bronx Broome Chautauq Chemung Erie Genesee Herkimer Kings Livingst Madison Monroe Nassau New York Niagara Onondaga Ontario Orange Oswego Putnam Queens Rensselaer Richmond Rockland Suffolk Tioga Wayne Westchester Figure 1.6: Boxplots of comp for each county.

14 Chapter 2 Linear Mixed Model for 1998 The aim is to fit a model that describes the deposits (in log form) of branch i in county j in year t, i.e. I choose log.dep ijt as dependent variable. In this first approach to linear mixed models I show the necessity of random effects when fitting a model to a reduced data set. I have a closer look at one particular year because this eliminates the time effect from the data, i.e. the dependent variable is log.dep ijt ˆ= log.dep ij. Therefore I am dealing with clustered two-level data (branches within counties) that is easier to examine. In the following I work with the data of The following libraries are used: > library(nlme) > library(lattice) 2.1 Explorative Data Analysis At first I have to adjust the variable obs in order to get the number of observations of the counties in Therefore I compute this number and introduce the new variable obs98. > Obs98 = rep(0, 28) > for (i in 1:28) { + Obs98[i] = length(branch[county == as.character(levels(county)[i]) + & year == 1998]) + } > names(obs98) = levels(county) > bank$obs98 = Obs98[as.character(data$county)] The data of 1998 is stored in a new data frame. A groupeddata-object is also created in order to group the data by county (see Appendix A.1 for more information about groupeddata-objects). > bank98 = bank[year == 1998, ] > detach(bank) > attach(bank98) > bank98grouped = groupeddata(log.dep ~ comp county, data = bank98) 7

15 CHAPTER 2. LINEAR MIXED MODEL FOR > head(bank98) branch year dep no.fail mshare branch.total dep.total av.dep e e e e e e+05 unemp county log.dep inc inc.pc pop comp obs obs Rockland e Bronx e New York e Nassau e Monroe e New York e There are 393 observations available for 1998: > dim(bank98)[1] [1] 393 Now I want to determine the relationships between the covariates and the dependent variable. The State Variables are not used as covariates, since they are constant for one particular year, i.e. the influence of State Variables is included in the intercept Branch Variable I have a look at the relationship between log.dep and comp by using a Trellis display (Figure 2.1, see Appendix B for more information about Trellis graphics in the lattice library). Only counties with more than 15 observations are displayed, since it is difficult to work with linear least-square and local-regression fits for very few observations. Also pay attention to the differently scaled axis that are used in order to get a better impression of the data. The plots show no clear overall relationship between log.dep and comp because within counties there are varying relationships between the variables: the slopes of the linear regression lines are different, they have even different signs. This variability is modeled later by random effects.

16 CHAPTER 2. LINEAR MIXED MODEL FOR > print(xyplot(log.dep ~ comp county, subset = obs98 >= 15, + scales = list(relation = "free"), panel = function(x, y) { + panel.xyplot(x, y, col = "grey") + panel.loess(x, y, span = 1) + panel.lmline(x, y, lty = 2) + })) Suffolk Westchester log.dep Nassau New York Queens Bronx Kings Monroe comp Figure 2.1: Trellis display of log.dep by comp for counties with more than 15 observations in The dashed lines give linear least-square fits, the solid lines local-regression fits.

17 CHAPTER 2. LINEAR MIXED MODEL FOR To underline this impression I fit a linear model to examine the influence of comp on log.dep (where n j = number of observations in county j in 1998): log.dep ij = β 0 + β 1 comp ij + ε ij ε j = (ε 1j,..., ε nj j) T N nj (0, σ 2 I nj ) (2.1) > bb98lm = lm(log.dep ~ comp) As a result of the observed variability the model fit does not show a strong influence of comp on log.dep, but this influence is highly significant because other explanatory variables are missing: > summary(bb98lm)$coef Estimate Std. Error t value Pr(>t) (Intercept) e-168 comp e-19 The residuals also show the variability in the data (Figure 2.2): the linear model does not consider the correlation of observations within counties and therefore there is some variation in the residuals. > print(bwplot(county ~ resid(bb98lm))) Westchester Wayne Tioga Suffolk Rockland Richmond Queens Putnam Oswego Orange Ontario Onondaga Niagara New York Nassau Monroe Madison Livingst Kings Herkimer Genesee Erie Chemung Chautauq Broome Bronx Albany resid(bb98lm) Figure 2.2: Residual plots by county for model 2.1.

18 CHAPTER 2. LINEAR MIXED MODEL FOR In order to examine further this variability I fit linear models for each county with more than 15 observations. log.dep ij = β 0j + β 1j comp ij + ε ij ε j = (ε 1j,..., ε nj j) T N nj (0, σ 2 j I nj ) (2.2) > bb98list = lmlist(bank98grouped, subset = obs98 >= 15) As seen in Figure 2.1 the influences of comp on log.dep are very variable: > coef(bb98list) (Intercept) comp Monroe Westchester Queens Kings Suffolk Nassau Bronx New York A look at the 95-percent confidence intervals for the regression coefficients (Figure 2.3) underlines these impressions, since the intervals are very different, but they overlap and almost every interval includes zero, i.e. the effects are possibly not significantly different. However, it is also important to pay attention to the different scales: the regression coefficients of comp are close to zero, while those of the intercept are much larger. > print(plot(intervals(bb98list))) county New York Bronx Nassau Suffolk Kings Queens Westchester Monroe (Intercept) comp Figure 2.3: 95-percent confidence intervals for the regression coefficients of model 2.2. A QQ-plot of the standardized residuals (Figure 2.4) also casts doubt on the normality assumption of the errors in the linear regression model which again shows the necessity for random effects modeling this variability.

19 CHAPTER 2. LINEAR MIXED MODEL FOR > print(qqnorm(bb98list, id = 0.05, idlabels = bank98obs$branch, + cex = 0.7, col = "grey")) Quantiles of standard normal Standardized residuals Figure 2.4: QQ-plot of the standardized residuals of model 2.2 with possible outliers (labelled with their branch ID) County Variables To examine the relationships between the County Variables and log.dep I use scatter plots with LOESS-lines (Figure 2.5). > par(mfrow = c(1, 3)) > scatter.smooth(pop, log.dep, col = "grey") > scatter.smooth(inc.pc, log.dep, col = "grey") > scatter.smooth(unemp, log.dep, col = "grey") log.dep log.dep log.dep pop inc.pc unemp Figure 2.5: Scatter plots of the County Variables against log.dep. The lines give localregression fits. There is a weak positive relationship between pop and log.dep as well as between inc.pc and log.dep. The influence of unemp is not as clear as for the first two variables. A transformation with sinus or cosinus could be used, but there is no evident interpretation for the (co)sinus of the unemployment rate. Therefore the variable is not transformed.

20 CHAPTER 2. LINEAR MIXED MODEL FOR Model Formulation and Fit Hierarchical Model I fit a hierarchical linear model to the data. This allows to model the variability of the two different levels. First, there is the regression of the deposits (in log form) of branch i in county j on the geographical competition of this branch. log.dep ij = α 0j + α 1j comp ij + ε ij Second, the intercepts and the slopes possibly depend on the County variables: α 0j = γ 00 + γ 01 pop j + γ 02 inc.pc j + γ 03 unemp j + u 0j α 1j = γ 10 + γ 11 pop j + γ 12 inc.pc j + γ 13 unemp j + u 1j These two equations now can be sustituted into the first one: log.dep ij = γ 00 + γ 01 pop j + γ 02 inc.pc j + γ 03 unemp j + u 0j (γ 10 + γ 11 pop j + γ 12 inc.pc j + γ 13 unemp j + u 1j )comp ij + ε ij The γ s are fixed effects and the u s random effects. This notation is equivalent to the notation of a linear mixed model: log.dep ij = β 0 + β 1 comp ij + β 2 pop j + β 3 inc.pc j + β 4 unemp j + β 5 pop j comp ij + β 6 inc.pc j comp ij + β 7 unemp j comp ij + u 0j + u 1j comp ij + ε ij (2.3) Now, the β s are the fixed effects, which is the normally used notation of fixed effects, and the following distributions for the errors and the random effects are assumed (where n j = number of observations in county j in 1998): ε j = (ε 1j,..., ε nj j) T N nj (0, σ 2 I nj ) u j = (u 0j, u 1j ) T N 2 (0, Ψ) ( ) ψ 2 Ψ = 0 ψ 01 ψ 01 ψ1 2 This linear mixed model can be fitted by using the groupeddata-object that specifies the random effects on the county level. > bb98lme1 = lme(log.dep ~ comp + pop + inc.pc + unemp + + comp:pop + comp:inc.pc + comp:unemp, random = ~comp, + data = bank98grouped) The model summary provides detailed information about the model fit (see Appendix A.3.1 for more information): > summary(bb98lme1)

21 CHAPTER 2. LINEAR MIXED MODEL FOR Linear mixed-effects model fit by REML Data: bank98grouped AIC BIC loglik Random effects: Formula: ~comp county Structure: General positive-definite, Log-Cholesky parametrization StdDev Corr (Intercept) 3.3e-01 (Intr) comp 9.5e Residual 9.1e-01 Fixed effects: log.dep ~ comp + pop + inc.pc + unemp +... Value Std.Error DF t-value p-value (Intercept) comp pop inc.pc unemp comp:pop comp:inc.pc comp:unemp Correlation: (Intr) comp pop inc.pc unemp cmp:pp cmp:n. comp pop inc.pc unemp comp:pop comp:inc.pc comp:unemp Standardized Within-Group Residuals: Min Q1 Med Q3 Max Number of Observations: 393 Number of Groups: Necessity of the Random Effects The data examination showed that the effects of the linear models for each county are possibly not significantly different (compare Figure 2.3). This leads to the assumption that the random effects of comp can be omitted. To test this, model 2.3 has to be reduced

22 CHAPTER 2. LINEAR MIXED MODEL FOR by eliminating u 1j from the model: log.dep ij = β 0 + β 1 comp ij + β 2 pop j + β 3 inc.pc j + β 4 unemp j + β 5 pop j comp ij + β 6 inc.pc j comp ij + β 7 unemp j comp ij + u 0j + ε ij ε j = (ε 1j,..., ε nj j) T N nj (0, σ 2 I nj ) u 0j N(0, ψ 2 0) (2.4) > bb98lme1a = update(bb98lme1, random = ~1) Since model 2.4 and model 2.3 are nested, it can be tested by a likelihood-ratio test whether the random effects of comp are significant at the 5-percent level (see Appendix A.3.1 for more information about nested model comparison): > anova(bb98lme1, bb98lme1a) Model df AIC BIC loglik Test L.Ratio p-value bb98lme bb98lme1a vs 2 2.5e-07 1 The hypothesis that the random effects of comp are zero can not be rejected (with a p value of 1). Therefore these random effects can be eliminated from the model. Even if the data examination showed variabilty among the influence of comp on log.dep, this variability is not significant, once the County Variables are taken into account. Nevertheless, it should also be tested whether the random effects for the intercept are significant or not. To do this the random effects of the intercept (u 0j ) have to be eliminated from model 2.3: > bb98lme1b = update(bb98lme1, random = ~comp - 1) > anova(bb98lme1, bb98lme1b) Model df AIC BIC loglik Test L.Ratio p-value bb98lme bb98lme1b vs The test shows that these random effects are also non-signifcant if the random effects for comp are already in the model. This leads to the question whether the random effects are needed at all (which was the conclusion after the data examination). Therefore, I compare the models to a corresponding linear model. > bb98lm = lm(log.dep ~ comp + pop + inc.pc + unemp + + comp:pop + comp:inc.pc + comp:unemp, data = bank98) The information criteria can be compared because the models have the same fixed effects.

23 CHAPTER 2. LINEAR MIXED MODEL FOR > AIC(bb98lm) [1] 1070 > summary(bb98lme1a)$aic [1] 1160 > summary(bb98lme1b)$aic [1] 1161 The AIC of the linear model is smaller than the AIC s of the mixed models. Nevertheless, I continue my analysis with model 2.4, since its AIC is the smallest of the two mixed models and the model can possibly be improved and show the necessity of random effects in order to model the within-group correlations Variance Structure of the Residuals Since the number of observations and the values of log.dep in each county are varying, the within-group errors might be varying for each county, too. A look at the residuals confirms this (Figure 2.6). > print(plot(bb98lme1a, county ~ resid(., type = "p"), abline = 0)) county New York Bronx Nassau Ontario Madison Suffolk Kings Broome Queens Westchester Putnam Chautauq Monroe Richmond Rockland Oswego Onondaga Erie Genesee Chemung Orange Herkimer Livingst Tioga Niagara Wayne Albany Standardized residuals Figure 2.6: Standardized residuals of model 2.4 for each county.

24 CHAPTER 2. LINEAR MIXED MODEL FOR As a result the variance structure of model 2.4 has to be modified in order to allow heterogeneous residual variances, i.e. residual variances σ 2 j for all counties j: log.dep ij = β 0 + β 1 comp ij + β 2 pop j + β 3 inc.pc j + β 4 unemp j + β 5 pop j comp ij + β 6 inc.pc j comp ij + β 7 unemp j comp ij + u 0j + ε ij ε j = (ε 1j,..., ε nj j) T N nj (0, σ 2 j I nj ) u 0j N(0, ψ 2 0) (2.5) This specific variance structure can be modeled (see Appendix A.3.2 for more information, the number of iterations is increased because of convergence problems): > bb98lme1var = update(bb98lme1a, weights = varident(form = ~1 + county), control = list(maxiter = 100, msmaxiter = 100, + niterem = 50, msmaxeval = 500)) The residual variances actually show some variation: > varstruct = bb98lme1var$modelstruct$varstruct > sigma = (1/unique(attributes(varstruct)$weights) * + bb98lme1var$sigma)^2 > xlabels = unique(attributes(bb98lme1var$modelstruct$varstruct)$groups) > plot(sigma, axes = F, xlab = "", ylab = "Residual Variance", + type = "h") > axis(1, at = seq(1, 27, 1), labels = xlabels, las = 2) > axis(2) Albany Wayne Niagara Tioga Livingst Herkimer Orange Chemung Genesee Erie Onondaga Oswego Rockland Richmond Monroe Chautauq Putnam Westchester Queens Broome Kings Suffolk Madison Ontario Nassau Bronx New York Residual Variance Figure 2.7: Residual variances of model 2.5 for each county. In order to test whether this variance structure is significant, I can compare model 2.5 to model 2.4 because the models are nested.

25 CHAPTER 2. LINEAR MIXED MODEL FOR > anova(bb98lme1a, bb98lme1var) Model df AIC BIC loglik Test L.Ratio p-value bb98lme1a bb98lme1var vs <.0001 The new model means a significant improvement in the model fit. Therefore I continue my analysis with model Significance of the Fixed Effects Some fixed effects of model 2.5 are not significant: > summary(bb98lme1var)$ttable Value Std.Error DF t-value p-value (Intercept) 9.6e e e-10 comp 1.0e e e-01 pop 7.3e e e-01 inc.pc -1.8e e e-01 unemp 2.2e e e-01 comp:pop -4.8e e e-01 comp:inc.pc 2.9e e e-01 comp:unemp -1.5e e e-01 Thus I reduce the model by a stepwise t test at the 5-percent level. > bb98lme2 = update(bb98lme1var, fixed =. ~. - comp:unemp) > summary(bb98lme2)$ttable > bb98lme3 = update(bb98lme2, fixed =. ~. - unemp) > summary(bb98lme3)$ttable > bb98lme4 = update(bb98lme3, fixed =. ~. - comp:inc.pc) > summary(bb98lme4)$ttable > bb98lme5 = update(bb98lme4, fixed =. ~. - comp:pop) In the last model (bb98lme5) all fixed effects are significant at the 5-percent-level. > summary(bb98lme5) Linear mixed-effects model fit by REML Data: bank98grouped AIC BIC loglik Random effects: Formula: ~1 county (Intercept) Residual

26 CHAPTER 2. LINEAR MIXED MODEL FOR StdDev: Variance function: Structure: Different standard deviations per stratum Formula: ~1 county Parameter estimates: Albany Wayne Niagara Tioga Livingst 1.0e e e e e-01 Herkimer Orange Chemung Genesee Erie 2.1e e e e e-01 Onondaga Oswego Rockland Richmond Monroe 2.9e e e e e-01 Chautauq Putnam Westchester Queens Broome 5.0e e e e e-01 Kings Suffolk Madison Ontario Nassau 4.1e e e e e-01 Bronx New York 7.4e e-01 Fixed effects: log.dep ~ comp + pop + inc.pc Value Std.Error DF t-value p-value (Intercept) comp pop inc.pc Correlation: (Intr) comp pop comp pop inc.pc Standardized Within-Group Residuals: Min Q1 Med Q3 Max Number of Observations: 393 Number of Groups: 27 Thus the final model is specified as follows: log.dep ij = β 0 + β 1 comp ij + β 2 pop j + β 3 inc.pc j + u 0j + ε ij ε j = (ε 1j,..., ε nj j) T N nj (0, σ 2 j I nj ) u 0j N(0, ψ 2 0) (2.6)

27 CHAPTER 2. LINEAR MIXED MODEL FOR Model Diagnostics In the following I want to examine whether the error assumptions on the final model 2.6 are appropriate. These diagnostics can be divided into the examination of two assumptions: assumptions on the within-group (i.e within-county) errors and on the random effects respectively Within-Group Errors At first I have a look at the standardized residuals of model 2.6 for each county individually to assess the assumption that the within-group errors are independent and identically distributed within each county with mean 0 and variance σ 2 j (Figure 2.8). > print(plot(bb98lme5, county ~ resid(., type = "p"), abline = 0)) county New York Bronx Nassau Ontario Madison Suffolk Kings Broome Queens Westchester Putnam Chautauq Monroe Richmond Rockland Oswego Onondaga Erie Genesee Chemung Orange Herkimer Livingst Tioga Niagara Wayne Albany Standardized residuals Figure 2.8: Standardized residuals of model 2.6 for each county. The residuals scatter around 0 (approximately in a [ 2, 2]-band, i.e. the 95-percent confidence band) and show no pattern, but there are some residuals that are large, i.e. there are several possible outliers. These problems are probably a result of the few observations

28 CHAPTER 2. LINEAR MIXED MODEL FOR in some counties. There is not enough data to model the effects correctly, since there are 15 counties with less than 5 observations (over 50 percent of all 28 counties!): > length(unique(county[obs98 <= 5])) [1] 15 As a result the residual variances show a lot of variability (Figure 2.9). > varstruct1 = bb98lme5$modelstruct$varstruct > sigma1 = (1/unique(attributes(varstruct1)$weights) * + bb98lme1var$sigma)^2 > xlabels1 = unique(attributes(bb98lme1var$modelstruct$varstruct)$groups) > plot(sigma1, axes = F, xlab = "", ylab = "Residual Variance", + type = "h") > axis(1, at = seq(1, 27, 1), labels = xlabels1, las = 2) > axis(2) Residual Variance Albany Wayne Niagara Tioga Livingst Herkimer Orange Chemung Genesee Erie Onondaga Oswego Rockland Richmond Monroe Chautauq Putnam Westchester Queens Broome Kings Suffolk Madison Ontario Nassau Bronx New York Figure 2.9: Residual variances of model 2.6 for each county. There are two counties with very high residual variances: Rockland and Ontario. To examine the reasons of this I have a look at the corresponding observations: > bank98[county == "Rockland", ] branch year dep no.fail mshare branch.total dep.total av.dep e e e e e+05 unemp county log.dep inc inc.pc pop comp obs obs Rockland 4.7 1e

29 CHAPTER 2. LINEAR MIXED MODEL FOR Rockland e Rockland e Rockland e Rockland e > bank98[county == "Ontario", ] branch year dep no.fail mshare branch.total dep.total av.dep unemp county log.dep inc inc.pc pop comp obs obs e+05 4 Ontario e+05 4 Ontario Branch 3604 in Rockland has an unusally small value of log.dep, whereas the other four observations of the county are close to the median of log.dep (11). Therefore the residual variance is increased. In Ontario there are only two observations with different values of log.dep. As a result the residual variance is also increased. However, the assumption of independence and normality seems to be approximately appropriate because the residuals scatter around 0. The variability of the standardized residuals is also approximately homogeneous across counties and a QQ-plot (Figure 2.10) confirms the observations as well, but shows several possible outliers at the same time. > print(qqnorm(bb98lme5, ~resid(., type = "p"), id = 0.05, + idlabels = branch, col = "grey", cex = 0.7)) Quantiles of standard normal Standardized residuals Figure 2.10: QQ-plot of the standardized residuals of model 2.6 with possible outliers (labelled with their branch ID). To sum it up, the residuals show good characteristics, but the assumption of normality is possibly wrong because of the small sample sizes for some counties. Perhaps a fat-tailed

30 CHAPTER 2. LINEAR MIXED MODEL FOR distribution such as the t-distribution would be a more appropriate assumption on the errors Random Effects To assess the assumption that the random effects are normally distributed with mean 0 and variance ψ 2 0, I have a look at the EBLUP s (Empirical Best Linear Unbiased Predictors, see for example page 262 in Fahrmeir et al. [2007]) of the random effects for each county (Figure 2.11). > print(plot(ranef(bb98lme5))) county New York Bronx Nassau Ontario Madison Suffolk Kings Broome Queens Westchester Putnam Chautauq Monroe Richmond Rockland Oswego Onondaga Erie Genesee Chemung Orange Herkimer Livingst Tioga Niagara Wayne Albany (Intercept) Random effects Figure 2.11: EBLUP s (û 0j ) of model 2.6. The random effects are scattered around 0 and actually very small (pay attention to the scale). They show no pattern, too. The QQ-plot also shows that the assumption is probably appropriate (Figure 2.12).

31 CHAPTER 2. LINEAR MIXED MODEL FOR > print(qqnorm(bb98lme5, ~ranef(.))) (Intercept) 2 Quantiles of standard normal Random effects Figure 2.12: QQ-plot of the EBLUP s of model Discussion The final model 2.6 corresponds to the results of the data examination that showed positive influences of pop and inc.pc on log.dep. There is also a positive influence of comp on log.dep which corresponds to the results of model 2.1. > summary(bb98lme5)$coef$fixed[2:4] comp pop inc.pc The variability in the data is reflected by the random effects of the intercept. The plot shows the different intercepts for each county:

32 CHAPTER 2. LINEAR MIXED MODEL FOR > print(plot(coef(bb98lme5))) county New York Bronx Nassau Ontario Madison Suffolk Kings Broome Queens Westchester Putnam Chautauq Monroe Richmond Rockland Oswego Onondaga Erie Genesee Chemung Orange Herkimer Livingst Tioga Niagara Wayne Albany (Intercept) Coefficients Figure 2.13: Intercepts ( ˆβ 0 + û 0j ) of model 2.6. Although there is a lot of variability in the influence of comp on log.dep, there are no significant random effects for comp. This variability is probably explained by the variables pop and inc.pc. In view of the hierarchical model fit 2.3, the final model 2.6 can be written as follows: log.dep ij = α 0j + β 1 comp ij + ε ij α 0j = β 0 + β 2 pop j + β 3 inc.pc j + u 0j (2.7) On the one hand, the branch level intercept α 0j depends on the county and on the population and income within this county. On the other hand, the slope of comp does not depend on the county, in which the branch is located. However, the model diagnostics showed that there are still problems regarding the model fit. A comparison to a linear model showed no significance of the random effects of the initial model 2.3. Now I compare the final model 2.6 to a standard linear model in order to investigate whether the linear mixed model is an improvement in the model fit. > bb98lm.test = lm(log.dep ~ comp + pop + inc.pc) > summary(bb98lme5)$aic [1] 1027

33 CHAPTER 2. LINEAR MIXED MODEL FOR > AIC(bb98lm.test) [1] 1064 The comparison of the AIC s of the two models shows that the AIC of model 2.6 is smaller, i.e. the mixed model fit means a significant improvement. A look at the standard errors of both models helps to assess the influence of the random effects: > summary(bb98lme5)$ttable Value Std.Error DF t-value p-value (Intercept) e e-317 comp e e-41 pop e e-05 inc.pc e e-03 > summary(bb98lm.test)$coef Estimate Std. Error t value Pr(>t) (Intercept) e-170 comp e-05 pop e-04 inc.pc e-04 Whereas the standard errors of the covariates are similar, the standard error of the intercept is much smaller in the mixed model than in the standard linear model. This is a result of the correlated observations within the counties, which are modeled in the mixed model but not in the linear model. 2.5 Interpretation The examination of the data showed that the deposits of a bank branch in 1998 significantly depend on the geograpical competition of the branch, on the county, in which the branch is located, and on the county s population and per capita income. The influence of the geographical competition is positive, since there are probably more branches in an area where it is more likely to get new customers and more deposits. Another explanation is that competition stimulates business, i.e. a high geographical competition leads to higher deposits because of increased efforts to get new customers. The influences of the county s population and per capita income are also positive, since it is obvious that there are more deposits if there are more people and if the people earn more money. At last, the deposits of a bank branch also depend simply on the county the branch is located in: for example in Kings there are more deposits than in Nassau. For branches in Nassau the following equation explains the deposits of a branch: Deposits of branch i = exp( (Competition of branch i on a scale of 100)) In Kings the equation is as follows: Deposits of branch i = exp( (Competition of branch i on a scale of 100)) This shows that the deposits in Nassau are always lower at the same level of competition:

34 CHAPTER 2. LINEAR MIXED MODEL FOR Deposits 0e+00 5e+04 1e Competition on a scale of 100 Nassau Kings Figure 2.14: Relationship between deposits and competition in Nassau and Kings. The positive influences of the population and the per capita income can also be displayed: Deposits Deposits Population Per capita income in USD Figure 2.15: Influences of the population and the income on the deposits at the average level of geographical competition. In the next chapter, the model explaining the effects on the deposits of a bank brach is extended and a more detailed interpretation will be possible.

35 Chapter 3 Linear Mixed Model for the Full Data Set After having fit a mixed model on the reduced data set, I fit a mixed model on the complete data set. The aim is still to fit a model that describes the deposits (in log form) of branch i in county j in year t. Now I am dealing with clustered two-level data (branches within counties) that includes time effects for each level (branches, counties, state). These time effects require additional random effects because the observations are correlated over time. 3.1 Explorative Data Analysis I examine relationships between the dependent variable and all three levels of covariates Branch Variable To get a first impression of the relationship between log.dep and comp, I have a look at the corresponding scatter plot (Figure 3.1). There is a weak positive overall influence of comp on log.dep, but there is also a lot of variation in the data that needs to be examined. Because there are too many branches to look at each individually, I select a sample of 20 branches that are examined more closely. A groupeddata-object is also created to store the data in a data frame grouped by branch. > set.seed(2) > banksample = sample(unique(branch), 20) > sample20 = groupeddata(log.dep ~ year branch, + data = bank[is.element(branch, banksample), ]) This sample includes measurements at different time points, but not each branch is measured at each time point (Figure 3.2). 28

36 CHAPTER 3. LINEAR MIXED MODEL FOR THE FULL DATA SET 29 > scatter.smooth(comp, log.dep, col = "grey") log.dep comp Figure 3.1: Scatter plot of comp against log.dep.

37 CHAPTER 3. LINEAR MIXED MODEL FOR THE FULL DATA SET 30 > print(plot(sample20, layout = c(5, 4), aspect = 1)) log.dep year Figure 3.2: Observations of 20 randomly selected branches.

38 CHAPTER 3. LINEAR MIXED MODEL FOR THE FULL DATA SET 31 Now the relationship between log.dep and comp can be investigated individually by using a Trellis display (Figure 3.3). (Pay attention to the differently scaled axis that are used in order to get a better impression of the data.) > print(xyplot(log.dep ~ comp branch, data = sample20, + scales = list(relation = "free"), panel = function(x, y) { + panel.xyplot(x, y) + panel.lmline(x, y, lty = 2) + })) log.dep comp Figure 3.3: Trellis display of log.dep by comp for 20 randomly selected branches. The dashed lines give linear least-square fits. Even if the overall influence of comp on log.dep is positive, the plots show that there are negative influences as well as positive influences for the branches in this sample. The slopes of the effects are very different, too. Thus there is a lot of variability in the effects. Another important fact is also shown in the plots: there are very few observations for some branches and this makes it difficult to fit an appropriate standard linear model to the data. Random effects are needed to deal with this problem.

Quantifying geographical and macroeconomic effects on bank branch deposits using linear mixed models

Quantifying geographical and macroeconomic effects on bank branch deposits using linear mixed models Quantifying geographical and macroeconomic effects on bank branch deposits using linear mixed models Eike Christian Brechmann, Claudia Czado and Peggy Ng Abstract The assessment of performance and potential

More information

STAT3401: Advanced data analysis Week 10: Models for Clustered Longitudinal Data

STAT3401: Advanced data analysis Week 10: Models for Clustered Longitudinal Data STAT3401: Advanced data analysis Week 10: Models for Clustered Longitudinal Data Berwin Turlach School of Mathematics and Statistics Berwin.Turlach@gmail.com The University of Western Australia Models

More information

Introduction to the Analysis of Hierarchical and Longitudinal Data

Introduction to the Analysis of Hierarchical and Longitudinal Data Introduction to the Analysis of Hierarchical and Longitudinal Data Georges Monette, York University with Ye Sun SPIDA June 7, 2004 1 Graphical overview of selected concepts Nature of hierarchical models

More information

Hierarchical Linear Models (HLM) Using R Package nlme. Interpretation. 2 = ( x 2) u 0j. e ij

Hierarchical Linear Models (HLM) Using R Package nlme. Interpretation. 2 = ( x 2) u 0j. e ij Hierarchical Linear Models (HLM) Using R Package nlme Interpretation I. The Null Model Level 1 (student level) model is mathach ij = β 0j + e ij Level 2 (school level) model is β 0j = γ 00 + u 0j Combined

More information

Outline. Mixed models in R using the lme4 package Part 3: Longitudinal data. Sleep deprivation data. Simple longitudinal data

Outline. Mixed models in R using the lme4 package Part 3: Longitudinal data. Sleep deprivation data. Simple longitudinal data Outline Mixed models in R using the lme4 package Part 3: Longitudinal data Douglas Bates Longitudinal data: sleepstudy A model with random effects for intercept and slope University of Wisconsin - Madison

More information

Introduction to Hierarchical Data Theory Real Example. NLME package in R. Jiang Qi. Department of Statistics Renmin University of China.

Introduction to Hierarchical Data Theory Real Example. NLME package in R. Jiang Qi. Department of Statistics Renmin University of China. Department of Statistics Renmin University of China June 7, 2010 The problem Grouped data, or Hierarchical data: correlations between subunits within subjects. The problem Grouped data, or Hierarchical

More information

Randomized Block Designs with Replicates

Randomized Block Designs with Replicates LMM 021 Randomized Block ANOVA with Replicates 1 ORIGIN := 0 Randomized Block Designs with Replicates prepared by Wm Stein Randomized Block Designs with Replicates extends the use of one or more random

More information

Value Added Modeling

Value Added Modeling Value Added Modeling Dr. J. Kyle Roberts Southern Methodist University Simmons School of Education and Human Development Department of Teaching and Learning Background for VAMs Recall from previous lectures

More information

LAWS OF NEW YORK, 1995 CHAPTER 605

LAWS OF NEW YORK, 1995 CHAPTER 605 LAWS OF NEW YORK, 1995 CHAPTER 605 to amend chapter 545 of the laws of 1938, relating to a system of coordinates located in four zones for land boundary and survey station descriptions, in relation to

More information

A brief introduction to mixed models

A brief introduction to mixed models A brief introduction to mixed models University of Gothenburg Gothenburg April 6, 2017 Outline An introduction to mixed models based on a few examples: Definition of standard mixed models. Parameter estimation.

More information

Introduction and Background to Multilevel Analysis

Introduction and Background to Multilevel Analysis Introduction and Background to Multilevel Analysis Dr. J. Kyle Roberts Southern Methodist University Simmons School of Education and Human Development Department of Teaching and Learning Background and

More information

Mixed effects models

Mixed effects models Mixed effects models The basic theory and application in R Mitchel van Loon Research Paper Business Analytics Mixed effects models The basic theory and application in R Author: Mitchel van Loon Research

More information

Random Coefficients Model Examples

Random Coefficients Model Examples Random Coefficients Model Examples STAT:5201 Week 15 - Lecture 2 1 / 26 Each subject (or experimental unit) has multiple measurements (this could be over time, or it could be multiple measurements on a

More information

A Handbook of Statistical Analyses Using R 2nd Edition. Brian S. Everitt and Torsten Hothorn

A Handbook of Statistical Analyses Using R 2nd Edition. Brian S. Everitt and Torsten Hothorn A Handbook of Statistical Analyses Using R 2nd Edition Brian S. Everitt and Torsten Hothorn CHAPTER 12 Analysing Longitudinal Data I: Computerised Delivery of Cognitive Behavioural Therapy Beat the Blues

More information

13. October p. 1

13. October p. 1 Lecture 8 STK3100/4100 Linear mixed models 13. October 2014 Plan for lecture: 1. The lme function in the nlme library 2. Induced correlation structure 3. Marginal models 4. Estimation - ML and REML 5.

More information

Workshop 9.3a: Randomized block designs

Workshop 9.3a: Randomized block designs -1- Workshop 93a: Randomized block designs Murray Logan November 23, 16 Table of contents 1 Randomized Block (RCB) designs 1 2 Worked Examples 12 1 Randomized Block (RCB) designs 11 RCB design Simple Randomized

More information

A Handbook of Statistical Analyses Using R 2nd Edition. Brian S. Everitt and Torsten Hothorn

A Handbook of Statistical Analyses Using R 2nd Edition. Brian S. Everitt and Torsten Hothorn A Handbook of Statistical Analyses Using R 2nd Edition Brian S. Everitt and Torsten Hothorn CHAPTER 12 Analysing Longitudinal Data I: Computerised Delivery of Cognitive Behavioural Therapy Beat the Blues

More information

Density Temp vs Ratio. temp

Density Temp vs Ratio. temp Temp Ratio Density 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Density 0.0 0.2 0.4 0.6 0.8 1.0 1. (a) 170 175 180 185 temp 1.0 1.5 2.0 2.5 3.0 ratio The histogram shows that the temperature measures have two peaks,

More information

These slides illustrate a few example R commands that can be useful for the analysis of repeated measures data.

These slides illustrate a few example R commands that can be useful for the analysis of repeated measures data. These slides illustrate a few example R commands that can be useful for the analysis of repeated measures data. We focus on the experiment designed to compare the effectiveness of three strength training

More information

Stat 209 Lab: Linear Mixed Models in R This lab covers the Linear Mixed Models tutorial by John Fox. Lab prepared by Karen Kapur. ɛ i Normal(0, σ 2 )

Stat 209 Lab: Linear Mixed Models in R This lab covers the Linear Mixed Models tutorial by John Fox. Lab prepared by Karen Kapur. ɛ i Normal(0, σ 2 ) Lab 2 STAT209 1/31/13 A complication in doing all this is that the package nlme (lme) is supplanted by the new and improved lme4 (lmer); both are widely used so I try to do both tracks in separate Rogosa

More information

Linear Mixed Models. Appendix to An R and S-PLUS Companion to Applied Regression. John Fox. May 2002

Linear Mixed Models. Appendix to An R and S-PLUS Companion to Applied Regression. John Fox. May 2002 Linear Mixed Models Appendix to An R and S-PLUS Companion to Applied Regression John Fox May 22 1 Introduction The normal linear model (described, for example, in Chapter 4 of the text), y i = β 1 x 1i

More information

Introduction to Linear regression analysis. Part 2. Model comparisons

Introduction to Linear regression analysis. Part 2. Model comparisons Introduction to Linear regression analysis Part Model comparisons 1 ANOVA for regression Total variation in Y SS Total = Variation explained by regression with X SS Regression + Residual variation SS Residual

More information

Stat 579: Generalized Linear Models and Extensions

Stat 579: Generalized Linear Models and Extensions Stat 579: Generalized Linear Models and Extensions Linear Mixed Models for Longitudinal Data Yan Lu April, 2018, week 14 1 / 64 Data structure and Model t1 t2 tn i 1st subject y 11 y 12 y 1n1 2nd subject

More information

lme4 Luke Chang Last Revised July 16, Fitting Linear Mixed Models with a Varying Intercept

lme4 Luke Chang Last Revised July 16, Fitting Linear Mixed Models with a Varying Intercept lme4 Luke Chang Last Revised July 16, 2010 1 Using lme4 1.1 Fitting Linear Mixed Models with a Varying Intercept We will now work through the same Ultimatum Game example from the regression section and

More information

Outline. Statistical inference for linear mixed models. One-way ANOVA in matrix-vector form

Outline. Statistical inference for linear mixed models. One-way ANOVA in matrix-vector form Outline Statistical inference for linear mixed models Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark general form of linear mixed models examples of analyses using linear mixed

More information

Mixed models in R using the lme4 package Part 2: Longitudinal data, modeling interactions

Mixed models in R using the lme4 package Part 2: Longitudinal data, modeling interactions Mixed models in R using the lme4 package Part 2: Longitudinal data, modeling interactions Douglas Bates Department of Statistics University of Wisconsin - Madison Madison January 11, 2011

More information

Mixed models in R using the lme4 package Part 5: Generalized linear mixed models

Mixed models in R using the lme4 package Part 5: Generalized linear mixed models Mixed models in R using the lme4 package Part 5: Generalized linear mixed models Douglas Bates Madison January 11, 2011 Contents 1 Definition 1 2 Links 2 3 Example 7 4 Model building 9 5 Conclusions 14

More information

A Handbook of Statistical Analyses Using R 3rd Edition. Torsten Hothorn and Brian S. Everitt

A Handbook of Statistical Analyses Using R 3rd Edition. Torsten Hothorn and Brian S. Everitt A Handbook of Statistical Analyses Using R 3rd Edition Torsten Hothorn and Brian S. Everitt CHAPTER 12 Quantile Regression: Head Circumference for Age 12.1 Introduction 12.2 Quantile Regression 12.3 Analysis

More information

Data Analysis and Statistical Methods Statistics 651

Data Analysis and Statistical Methods Statistics 651 y 1 2 3 4 5 6 7 x Data Analysis and Statistical Methods Statistics 651 http://www.stat.tamu.edu/~suhasini/teaching.html Lecture 32 Suhasini Subba Rao Previous lecture We are interested in whether a dependent

More information

Covariance Models (*) X i : (n i p) design matrix for fixed effects β : (p 1) regression coefficient for fixed effects

Covariance Models (*) X i : (n i p) design matrix for fixed effects β : (p 1) regression coefficient for fixed effects Covariance Models (*) Mixed Models Laird & Ware (1982) Y i = X i β + Z i b i + e i Y i : (n i 1) response vector X i : (n i p) design matrix for fixed effects β : (p 1) regression coefficient for fixed

More information

20. REML Estimation of Variance Components. Copyright c 2018 (Iowa State University) 20. Statistics / 36

20. REML Estimation of Variance Components. Copyright c 2018 (Iowa State University) 20. Statistics / 36 20. REML Estimation of Variance Components Copyright c 2018 (Iowa State University) 20. Statistics 510 1 / 36 Consider the General Linear Model y = Xβ + ɛ, where ɛ N(0, Σ) and Σ is an n n positive definite

More information

Introduction to SAS proc mixed

Introduction to SAS proc mixed Faculty of Health Sciences Introduction to SAS proc mixed Analysis of repeated measurements, 2017 Julie Forman Department of Biostatistics, University of Copenhagen 2 / 28 Preparing data for analysis The

More information

Mixed models in R using the lme4 package Part 7: Generalized linear mixed models

Mixed models in R using the lme4 package Part 7: Generalized linear mixed models Mixed models in R using the lme4 package Part 7: Generalized linear mixed models Douglas Bates University of Wisconsin - Madison and R Development Core Team University of

More information

Lecture 9: Linear Regression

Lecture 9: Linear Regression Lecture 9: Linear Regression Goals Develop basic concepts of linear regression from a probabilistic framework Estimating parameters and hypothesis testing with linear models Linear regression in R Regression

More information

Chapter 3 Examining Data

Chapter 3 Examining Data Chapter 3 Examining Data This chapter discusses methods of displaying quantitative data with the objective of understanding the distribution of the data. Example During childhood and adolescence, bone

More information

Coping with Additional Sources of Variation: ANCOVA and Random Effects

Coping with Additional Sources of Variation: ANCOVA and Random Effects Coping with Additional Sources of Variation: ANCOVA and Random Effects 1/49 More Noise in Experiments & Observations Your fixed coefficients are not always so fixed Continuous variation between samples

More information

Correlated Data: Linear Mixed Models with Random Intercepts

Correlated Data: Linear Mixed Models with Random Intercepts 1 Correlated Data: Linear Mixed Models with Random Intercepts Mixed Effects Models This lecture introduces linear mixed effects models. Linear mixed models are a type of regression model, which generalise

More information

Introduction to SAS proc mixed

Introduction to SAS proc mixed Faculty of Health Sciences Introduction to SAS proc mixed Analysis of repeated measurements, 2017 Julie Forman Department of Biostatistics, University of Copenhagen Outline Data in wide and long format

More information

De-mystifying random effects models

De-mystifying random effects models De-mystifying random effects models Peter J Diggle Lecture 4, Leahurst, October 2012 Linear regression input variable x factor, covariate, explanatory variable,... output variable y response, end-point,

More information

Workshop 9.1: Mixed effects models

Workshop 9.1: Mixed effects models -1- Workshop 91: Mixed effects models Murray Logan October 10, 2016 Table of contents 1 Non-independence - part 2 1 1 Non-independence - part 2 11 Linear models Homogeneity of variance σ 2 0 0 y i = β

More information

Information. Hierarchical Models - Statistical Methods. References. Outline

Information. Hierarchical Models - Statistical Methods. References. Outline Information Hierarchical Models - Statistical Methods Sarah Filippi 1 University of Oxford Hilary Term 2015 Webpage: http://www.stats.ox.ac.uk/~filippi/msc_ hierarchicalmodels_2015.html Lectures: Week

More information

Non-independence due to Time Correlation (Chapter 14)

Non-independence due to Time Correlation (Chapter 14) Non-independence due to Time Correlation (Chapter 14) When we model the mean structure with ordinary least squares, the mean structure explains the general trends in the data with respect to our dependent

More information

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn A Handbook of Statistical Analyses Using R Brian S. Everitt and Torsten Hothorn CHAPTER 10 Analysing Longitudinal Data I: Computerised Delivery of Cognitive Behavioural Therapy Beat the Blues 10.1 Introduction

More information

CO2 Handout. t(cbind(co2$type,co2$treatment,model.matrix(~type*treatment,data=co2)))

CO2 Handout. t(cbind(co2$type,co2$treatment,model.matrix(~type*treatment,data=co2))) CO2 Handout CO2.R: library(nlme) CO2[1:5,] plot(co2,outer=~treatment*type,layout=c(4,1)) m1co2.lis

More information

Lecture 1: Linear Models and Applications

Lecture 1: Linear Models and Applications Lecture 1: Linear Models and Applications Claudia Czado TU München c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 0 Overview Introduction to linear models Exploratory data analysis (EDA) Estimation

More information

Outline. Mixed models in R using the lme4 package Part 5: Generalized linear mixed models. Parts of LMMs carried over to GLMMs

Outline. Mixed models in R using the lme4 package Part 5: Generalized linear mixed models. Parts of LMMs carried over to GLMMs Outline Mixed models in R using the lme4 package Part 5: Generalized linear mixed models Douglas Bates University of Wisconsin - Madison and R Development Core Team UseR!2009,

More information

Topic 14: Inference in Multiple Regression

Topic 14: Inference in Multiple Regression Topic 14: Inference in Multiple Regression Outline Review multiple linear regression Inference of regression coefficients Application to book example Inference of mean Application to book example Inference

More information

Mixed models in R using the lme4 package Part 5: Generalized linear mixed models

Mixed models in R using the lme4 package Part 5: Generalized linear mixed models Mixed models in R using the lme4 package Part 5: Generalized linear mixed models Douglas Bates 2011-03-16 Contents 1 Generalized Linear Mixed Models Generalized Linear Mixed Models When using linear mixed

More information

Open Problems in Mixed Models

Open Problems in Mixed Models xxiii Determining how to deal with a not positive definite covariance matrix of random effects, D during maximum likelihood estimation algorithms. Several strategies are discussed in Section 2.15. For

More information

The Model Building Process Part I: Checking Model Assumptions Best Practice

The Model Building Process Part I: Checking Model Assumptions Best Practice The Model Building Process Part I: Checking Model Assumptions Best Practice Authored by: Sarah Burke, PhD 31 July 2017 The goal of the STAT T&E COE is to assist in developing rigorous, defensible test

More information

Supplemental Table S1. Salmonella isolates included in study Allelic type for b Isolate a Serotype Source fima mdh manb. manb Dupli- subtype

Supplemental Table S1. Salmonella isolates included in study Allelic type for b Isolate a Serotype Source fima mdh manb. manb Dupli- subtype Supplemental Table S1. Salmonella isolates included in study Allelic type for b Isolate a Serotype Source fima mdh manb Combined ST b ST/Serotype County c Farm fima Deletion e manb Dupli- subtype cation

More information

Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models Introduction to General and Generalized Linear Models Mixed effects models - II Henrik Madsen, Jan Kloppenborg Møller, Anders Nielsen April 16, 2012 H. Madsen, JK. Møller, A. Nielsen () Chapman & Hall

More information

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1)

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1) The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1) Authored by: Sarah Burke, PhD Version 1: 31 July 2017 Version 1.1: 24 October 2017 The goal of the STAT T&E COE

More information

Linear, Generalized Linear, and Mixed-Effects Models in R. Linear and Generalized Linear Models in R Topics

Linear, Generalized Linear, and Mixed-Effects Models in R. Linear and Generalized Linear Models in R Topics Linear, Generalized Linear, and Mixed-Effects Models in R John Fox McMaster University ICPSR 2018 John Fox (McMaster University) Statistical Models in R ICPSR 2018 1 / 19 Linear and Generalized Linear

More information

Introduction to Mixed Models in R

Introduction to Mixed Models in R Introduction to Mixed Models in R Galin Jones School of Statistics University of Minnesota http://www.stat.umn.edu/ galin March 2011 Second in a Series Sponsored by Quantitative Methods Collaborative.

More information

The First Thing You Ever Do When Receive a Set of Data Is

The First Thing You Ever Do When Receive a Set of Data Is The First Thing You Ever Do When Receive a Set of Data Is Understand the goal of the study What are the objectives of the study? What would the person like to see from the data? Understand the methodology

More information

Experimental Design and Data Analysis for Biologists

Experimental Design and Data Analysis for Biologists Experimental Design and Data Analysis for Biologists Gerry P. Quinn Monash University Michael J. Keough University of Melbourne CAMBRIDGE UNIVERSITY PRESS Contents Preface page xv I I Introduction 1 1.1

More information

Introduction and Single Predictor Regression. Correlation

Introduction and Single Predictor Regression. Correlation Introduction and Single Predictor Regression Dr. J. Kyle Roberts Southern Methodist University Simmons School of Education and Human Development Department of Teaching and Learning Correlation A correlation

More information

STATISTICS 110/201 PRACTICE FINAL EXAM

STATISTICS 110/201 PRACTICE FINAL EXAM STATISTICS 110/201 PRACTICE FINAL EXAM Questions 1 to 5: There is a downloadable Stata package that produces sequential sums of squares for regression. In other words, the SS is built up as each variable

More information

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises LINEAR REGRESSION ANALYSIS MODULE XVI Lecture - 44 Exercises Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur Exercise 1 The following data has been obtained on

More information

Exam Applied Statistical Regression. Good Luck!

Exam Applied Statistical Regression. Good Luck! Dr. M. Dettling Summer 2011 Exam Applied Statistical Regression Approved: Tables: Note: Any written material, calculator (without communication facility). Attached. All tests have to be done at the 5%-level.

More information

MS&E 226: Small Data

MS&E 226: Small Data MS&E 226: Small Data Lecture 15: Examples of hypothesis tests (v5) Ramesh Johari ramesh.johari@stanford.edu 1 / 32 The recipe 2 / 32 The hypothesis testing recipe In this lecture we repeatedly apply the

More information

Business Statistics. Lecture 10: Course Review

Business Statistics. Lecture 10: Course Review Business Statistics Lecture 10: Course Review 1 Descriptive Statistics for Continuous Data Numerical Summaries Location: mean, median Spread or variability: variance, standard deviation, range, percentiles,

More information

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont. TCELL 9/4/205 36-309/749 Experimental Design for Behavioral and Social Sciences Simple Regression Example Male black wheatear birds carry stones to the nest as a form of sexual display. Soler et al. wanted

More information

Modeling the Covariance

Modeling the Covariance Modeling the Covariance Jamie Monogan University of Georgia February 3, 2016 Jamie Monogan (UGA) Modeling the Covariance February 3, 2016 1 / 16 Objectives By the end of this meeting, participants should

More information

Non-Gaussian Response Variables

Non-Gaussian Response Variables Non-Gaussian Response Variables What is the Generalized Model Doing? The fixed effects are like the factors in a traditional analysis of variance or linear model The random effects are different A generalized

More information

WU Weiterbildung. Linear Mixed Models

WU Weiterbildung. Linear Mixed Models Linear Mixed Effects Models WU Weiterbildung SLIDE 1 Outline 1 Estimation: ML vs. REML 2 Special Models On Two Levels Mixed ANOVA Or Random ANOVA Random Intercept Model Random Coefficients Model Intercept-and-Slopes-as-Outcomes

More information

R Output for Linear Models using functions lm(), gls() & glm()

R Output for Linear Models using functions lm(), gls() & glm() LM 04 lm(), gls() &glm() 1 R Output for Linear Models using functions lm(), gls() & glm() Different kinds of output related to linear models can be obtained in R using function lm() {stats} in the base

More information

36-309/749 Experimental Design for Behavioral and Social Sciences. Sep. 22, 2015 Lecture 4: Linear Regression

36-309/749 Experimental Design for Behavioral and Social Sciences. Sep. 22, 2015 Lecture 4: Linear Regression 36-309/749 Experimental Design for Behavioral and Social Sciences Sep. 22, 2015 Lecture 4: Linear Regression TCELL Simple Regression Example Male black wheatear birds carry stones to the nest as a form

More information

Package CorrMixed. R topics documented: August 4, Type Package

Package CorrMixed. R topics documented: August 4, Type Package Type Package Package CorrMixed August 4, 2016 Title Estimate Correlations Between Repeatedly Measured Endpoints (E.g., Reliability) Based on Linear Mixed-Effects Models Version 0.1-13 Date 2015-03-08 Author

More information

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #6

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #6 STA 8 Applied Linear Models: Regression Analysis Spring 011 Solution for Homework #6 6. a) = 11 1 31 41 51 1 3 4 5 11 1 31 41 51 β = β1 β β 3 b) = 1 1 1 1 1 11 1 31 41 51 1 3 4 5 β = β 0 β1 β 6.15 a) Stem-and-leaf

More information

I r j Binom(m j, p j ) I L(, ; y) / exp{ y j + (x j y j ) m j log(1 + e + x j. I (, y) / L(, ; y) (, )

I r j Binom(m j, p j ) I L(, ; y) / exp{ y j + (x j y j ) m j log(1 + e + x j. I (, y) / L(, ; y) (, ) Today I Bayesian analysis of logistic regression I Generalized linear mixed models I CD on fixed and random effects I HW 2 due February 28 I Case Studies SSC 2014 Toronto I March/April: Semi-parametric

More information

Bivariate Regression Analysis. The most useful means of discerning causality and significance of variables

Bivariate Regression Analysis. The most useful means of discerning causality and significance of variables Bivariate Regression Analysis The most useful means of discerning causality and significance of variables Purpose of Regression Analysis Test causal hypotheses Make predictions from samples of data Derive

More information

Applied Regression Analysis

Applied Regression Analysis Applied Regression Analysis Chapter 3 Multiple Linear Regression Hongcheng Li April, 6, 2013 Recall simple linear regression 1 Recall simple linear regression 2 Parameter Estimation 3 Interpretations of

More information

22s:152 Applied Linear Regression. Returning to a continuous response variable Y...

22s:152 Applied Linear Regression. Returning to a continuous response variable Y... 22s:152 Applied Linear Regression Generalized Least Squares Returning to a continuous response variable Y... Ordinary Least Squares Estimation The classical models we have fit so far with a continuous

More information

Random Intercept Models

Random Intercept Models Random Intercept Models Edps/Psych/Soc 589 Carolyn J. Anderson Department of Educational Psychology c Board of Trustees, University of Illinois Spring 2019 Outline A very simple case of a random intercept

More information

CRP 272 Introduction To Regression Analysis

CRP 272 Introduction To Regression Analysis CRP 272 Introduction To Regression Analysis 30 Relationships Among Two Variables: Interpretations One variable is used to explain another variable X Variable Independent Variable Explaining Variable Exogenous

More information

Introduction to Simple Linear Regression

Introduction to Simple Linear Regression Introduction to Simple Linear Regression 1. Regression Equation A simple linear regression (also known as a bivariate regression) is a linear equation describing the relationship between an explanatory

More information

Validation of Visual Statistical Inference, with Application to Linear Models

Validation of Visual Statistical Inference, with Application to Linear Models Validation of Visual Statistical Inference, with pplication to Linear Models Mahbubul Majumder, Heike Hofmann, Dianne Cook Department of Statistics, Iowa State University pril 2, 212 Statistical graphics

More information

22s:152 Applied Linear Regression. In matrix notation, we can write this model: Generalized Least Squares. Y = Xβ + ɛ with ɛ N n (0, Σ)

22s:152 Applied Linear Regression. In matrix notation, we can write this model: Generalized Least Squares. Y = Xβ + ɛ with ɛ N n (0, Σ) 22s:152 Applied Linear Regression Generalized Least Squares Returning to a continuous response variable Y Ordinary Least Squares Estimation The classical models we have fit so far with a continuous response

More information

Correlation in Linear Regression

Correlation in Linear Regression Vrije Universiteit Amsterdam Research Paper Correlation in Linear Regression Author: Yura Perugachi-Diaz Student nr.: 2566305 Supervisor: Dr. Bartek Knapik May 29, 2017 Faculty of Sciences Research Paper

More information

Random and Mixed Effects Models - Part III

Random and Mixed Effects Models - Part III Random and Mixed Effects Models - Part III Statistics 149 Spring 2006 Copyright 2006 by Mark E. Irwin Quasi-F Tests When we get to more than two categorical factors, some times there are not nice F tests

More information

36-463/663: Multilevel & Hierarchical Models

36-463/663: Multilevel & Hierarchical Models 36-463/663: Multilevel & Hierarchical Models Some Random Effects Configurations Brian Junker 132E Baker Hall brian@stat.cmu.edu 1 Outline Random Effect Configurations Most of our models so far: Level 1

More information

Simple Linear Regression for the Climate Data

Simple Linear Regression for the Climate Data Prediction Prediction Interval Temperature 0.2 0.0 0.2 0.4 0.6 0.8 320 340 360 380 CO 2 Simple Linear Regression for the Climate Data What do we do with the data? y i = Temperature of i th Year x i =CO

More information

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis. 401 Review Major topics of the course 1. Univariate analysis 2. Bivariate analysis 3. Simple linear regression 4. Linear algebra 5. Multiple regression analysis Major analysis methods 1. Graphical analysis

More information

Acknowledgements. Outline. Marie Diener-West. ICTR Leadership / Team INTRODUCTION TO CLINICAL RESEARCH. Introduction to Linear Regression

Acknowledgements. Outline. Marie Diener-West. ICTR Leadership / Team INTRODUCTION TO CLINICAL RESEARCH. Introduction to Linear Regression INTRODUCTION TO CLINICAL RESEARCH Introduction to Linear Regression Karen Bandeen-Roche, Ph.D. July 17, 2012 Acknowledgements Marie Diener-West Rick Thompson ICTR Leadership / Team JHU Intro to Clinical

More information

STATISTICS 479 Exam II (100 points)

STATISTICS 479 Exam II (100 points) Name STATISTICS 79 Exam II (1 points) 1. A SAS data set was created using the following input statement: Answer parts(a) to (e) below. input State $ City $ Pop199 Income Housing Electric; (a) () Give the

More information

multilevel modeling: concepts, applications and interpretations

multilevel modeling: concepts, applications and interpretations multilevel modeling: concepts, applications and interpretations lynne c. messer 27 october 2010 warning social and reproductive / perinatal epidemiologist concepts why context matters multilevel models

More information

Single and multiple linear regression analysis

Single and multiple linear regression analysis Single and multiple linear regression analysis Marike Cockeran 2017 Introduction Outline of the session Simple linear regression analysis SPSS example of simple linear regression analysis Additional topics

More information

#Alternatively we could fit a model where the rail values are levels of a factor with fixed effects

#Alternatively we could fit a model where the rail values are levels of a factor with fixed effects examples-lme.r Tue Nov 25 12:32:20 2008 1 library(nlme) # The following data shows the results of tests carried over 6 rails. The response # indicated the time needed for a an ultrasonic wave to travel

More information

Solution to Series 6

Solution to Series 6 Dr. M. Dettling Applied Series Analysis SS 2014 Solution to Series 6 1. a) > r.bel.lm summary(r.bel.lm) Call: lm(formula = NURSING ~., data = d.beluga) Residuals: Min 1Q

More information

Tentative solutions TMA4255 Applied Statistics 16 May, 2015

Tentative solutions TMA4255 Applied Statistics 16 May, 2015 Norwegian University of Science and Technology Department of Mathematical Sciences Page of 9 Tentative solutions TMA455 Applied Statistics 6 May, 05 Problem Manufacturer of fertilizers a) Are these independent

More information

Math 423/533: The Main Theoretical Topics

Math 423/533: The Main Theoretical Topics Math 423/533: The Main Theoretical Topics Notation sample size n, data index i number of predictors, p (p = 2 for simple linear regression) y i : response for individual i x i = (x i1,..., x ip ) (1 p)

More information

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression AMS 315/576 Lecture Notes Chapter 11. Simple Linear Regression 11.1 Motivation A restaurant opening on a reservations-only basis would like to use the number of advance reservations x to predict the number

More information

INFERENCE FOR REGRESSION

INFERENCE FOR REGRESSION CHAPTER 3 INFERENCE FOR REGRESSION OVERVIEW In Chapter 5 of the textbook, we first encountered regression. The assumptions that describe the regression model we use in this chapter are the following. We

More information

Multiple Regression Methods

Multiple Regression Methods Chapter 1: Multiple Regression Methods Hildebrand, Ott and Gray Basic Statistical Ideas for Managers Second Edition 1 Learning Objectives for Ch. 1 The Multiple Linear Regression Model How to interpret

More information

From Practical Data Analysis with JMP, Second Edition. Full book available for purchase here. About This Book... xiii About The Author...

From Practical Data Analysis with JMP, Second Edition. Full book available for purchase here. About This Book... xiii About The Author... From Practical Data Analysis with JMP, Second Edition. Full book available for purchase here. Contents About This Book... xiii About The Author... xxiii Chapter 1 Getting Started: Data Analysis with JMP...

More information

Solution pigs exercise

Solution pigs exercise Solution pigs exercise Course repeated measurements - R exercise class 2 November 24, 2017 Contents 1 Question 1: Import data 3 1.1 Data management..................................... 3 1.2 Inspection

More information

Mathematics for Economics MA course

Mathematics for Economics MA course Mathematics for Economics MA course Simple Linear Regression Dr. Seetha Bandara Simple Regression Simple linear regression is a statistical method that allows us to summarize and study relationships between

More information

BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression

BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression Introduction to Correlation and Regression The procedures discussed in the previous ANOVA labs are most useful in cases where we are interested

More information