Technische Universität München. Zentrum Mathematik. Linear Mixed Models Applied to Bank Branch Deposit Data

Size: px

Start display at page:

Download "Technische Universität München. Zentrum Mathematik. Linear Mixed Models Applied to Bank Branch Deposit Data"

Cora Ryan
5 years ago
Views:

1 Technische Universität München Zentrum Mathematik Linear Mixed Models Applied to Bank Branch Deposit Data Project by Eike Christian Brechmann Supervisor: Prof. Claudia Czado, Ph.D. Tutor: Dr. Mathias Hofmann Deadline: 28 February 2009

2 Contents List of Figures Introduction iv vi 1 Data Description State Variables County Variables Branch Variables Linear Mixed Model for Explorative Data Analysis Branch Variable County Variables Model Formulation and Fit Hierarchical Model Necessity of the Random Effects Variance Structure of the Residuals Significance of the Fixed Effects Model Diagnostics Within-Group Errors Random Effects Discussion Interpretation Linear Mixed Model for the Full Data Set Explorative Data Analysis Branch Variable County Variables State Variables Model Formulation and Fit Initial Model Necessity of the Random Effects Variance Structure of the Errors Significance of the Fixed Effects Model Diagnostics i

3 CONTENTS ii Within-Group Errors Random Effects Prediction Discussion Interpretation Hierarchical Model Explorative Data Analysis Model Formulation and Fit Initial Model Necessity of the Random Effects Variance Structure of the Residuals Significance of the Fixed Effects Model Diagnostics Within-Group Errors Random Effects Prediction Discussion Interpretation Conclusion Fixed Effects Random Effects Variance Structure of the Residuals Prediction Summary Bibliography 106 A The nlme Library 107 A.1 groupeddata A.2 lmlist A.3 lme A.3.1 Random Effects A.3.2 The varf unc Classes A.3.3 The corstruct Classes A.3.4 Fixed Effects A.3.5 Prediction B The lattice Library 124 B.1 Clustered Data B.2 Longitudinal Data

4 CONTENTS iii C Overview of Models 129 C.1 Linear Mixed Model for 1998 (Chapter 2) C.2 Linear Mixed Model for the Full Data Set (Chapter 3) C.3 Hierarchical Model (Chapter 4)

5 List of Figures 1.1 Map of New York State Hierarchical structure of the data State Variables over time Histogram of log.dep Boxplots of log.dep for each county Boxplots of comp for each county Trellis display of log.dep by comp Residual plots by county for model percent confidence intervals for the regression coefficients of model QQ-plot of the standardized residuals of model Scatter plots of the County Variables against log.dep Standardized residuals of model 2.4 for each county Residual variances of model 2.5 for each county Standardized residuals of model 2.6 for each county Residual variances of model 2.6 for each county QQ-plot of the standardized residuals of model EBLUP s of model QQ-plot of the EBLUP s of model Intercepts of model Relationship between deposits and competition in Nassau and Kings Influences of the population and the income on the deposits Scatter plot of comp against log.dep Observations of 20 randomly selected branches Trellis display of log.dep by comp for 20 randomly selected branches percent confidence intervals for the regression coefficients of model Scatter plots of the County Variables against log.dep Trellis display of log.dep by pop Trellis display of log.dep by inc.pc Trellis display of log.dep by unemp percent confidence intervals for the regression coefficients of model Interaction plots of the categorized County Variables and comp Scatter plots of the State Variables against log.dep Interaction plots of no.fail and the County and Branch Variables iv

6 LIST OF FIGURES v 3.13 Interaction plots of mshare and the County and Branch Variables Interaction plots of branch.total and the County and Branch Variables Interaction plots of dep.total and the County and Branch Variables Interaction plots of av.dep and the County and Branch Variables Standardized residuals of model 3.4 for each year Residual variances of model 3.5 for each year Empirical autocorrelation function from the residuals of model Standardized residuals of model 3.7 for each year Deposits of branch Residual variances of model 3.7 for each year QQ-plots of the standardized residuals of model EBLUP s of model QQ-plots of the EBLUP s of model Predicted versus observed values for D interaction plot of inc.pc and av.dep in model Significant exploratory interactions of unemp in model D interaction plots of unemp and State Variables in model Standard errors of model 3.7 and the corresponding linear model Relationship between deposits and competition Scatter plots of the Branch and County Variables against log.dep Standardized residuals of model 4.3 for each year Residual variances of model 4.4 for each year Empirical autocorrelation function from the residuals of model Standardized residuals of model 4.6 for each year Residual variances of model 4.6 for each year QQ-plots of the standardized residuals of model EBLUP s of model QQ-plot of the EBLUP s of model Predicted versus observed values for Significant exploratory interactions in model D interaction plots of unemp and State Variables in model Histogram of the intercepts of model Standard errors of model 4.6 and the corresponding linear model Comparison of the predicted values of models 3.7 and

7 Introduction Linear models are not always appropriate to deal with data sets. In linear models independet response variables are assumed, but often this is not the case. The data can be ˆ clustered, i.e. the response is measured once for each subject and each subject belongs to a group of subjects (cluster), or ˆ longitudinal, i.e. the response is measured at several time points and the number of time points is not too large. For such dependent data structures the linear model has to be extended by allowing random effects in so-called linear mixed models. The focus of this project is on the application of such mixed models. Regarding the theory of linear mixed models I recommend Fahrmeir et al. [2007]. The data considered in this project is both clustered and longitudinal. An analysis of the data and model building using linear models can be found in Schabenberger [2008]. This project is an expansion of this first approach of model building in order to find appropriate mixed models for estimation and prediction. To do this I use the software R with the nlme and lattice libraries. The nlme library allows to fit linear mixed models and the lattice library provides various useful graphics. For a detailed description of the nlme library I recommend Pinheiro and Bates [2000]. An illustrative approach of using the nlme and lattice libraries is given by Fox [2002]. Fittig linear mixed models using other statistical software packages such as SAS is described in West et al. [2007]. First of all, the data is described particularly with regard to the need of mixed modeling. Subsequently a reduced data set without the time effects is considered in chapter 2. In this first approach a linear mixed model is fitted to this clustered data in order to investigate the influence and importance of random effects. In chapter 3 a linear mixed model for the full data set is built and examined. Since this model does not have a hierarchical structure regarding the clusters, a hierarchical model with adjusted covariates is fitted in chapter 4. Finally, the appendix provides additional information about the nlme and lattice libraries and how they are used in this project. An overview of the models used in this project is also provided. vi

8 Chapter 1 Data Description The data set considered in this project contains 2988 branch-year records of a major US bank in the state of New York with multiple branches. Figure 1.1: Map of New York State. 506 branches are included with observations over the period from 1994 to There is one row in the data frame for each record of a branch in a particular year. branch year dep no.fail mshare branch.total dep.total av.dep

9 CHAPTER 1. DATA DESCRIPTION unemp county log.dep inc inc.pc pop comp obs Nassau e New York e New York e Rockland e Nassau e New York e For a detailed discussion of the data set see Schabenberger [2008]. The data is clustered (branches within counties within state): Figure 1.2: Hierarchical structure of the data. The data is also longitudinal, since it is observed over a period of nine years. Therefore, a mixed model approach seems to be appropriate to model the dependencies in the data that arise from the clusters (counties) and from the measurements taken on the same subjects (branches, counties, state). In the following I concentrate on the State, County and Branch Variables. The ZIP Code Variable is not considered, because there are mostly very few observations in a zip code area. 1.1 State Variables The State Variables are constant over counties for each year and have only different outcomes for each year. ˆ no.fail: the number of branches that closed in NY during the year. ˆ mshare: the market share in NY.

10 CHAPTER 1. DATA DESCRIPTION 3 ˆ branch.total: the share of the number of branches in NY compared to the USA. ˆ dep.total: the share of the total deposits of the bank in NY compared to the USA. ˆ av.dep: the average deposit per bank in NY. These variables are highly correlated especially with year: year no.fail mshare branch.total dep.total av.dep year no.fail mshare branch.total dep.total av.dep no.fail mshare branch.total Year Year Year dep.total av.dep Year Year Figure 1.3: State Variables over time. To avoid singular matrices in the computations, year is not considered as a covariate. 1.2 County Variables The County Variables are constant over the branches within a county and changing for each year. ˆ county: the countyname. ˆ obs: the number of observations in the county.

11 CHAPTER 1. DATA DESCRIPTION 4 ˆ pop: the population in the county (in 1000 for a better interpretability). ˆ inc: the total income that the people of the county earn (in 1000 for a better interpretability). ˆ inc.pc: the per capita income (in 1000 for a better interpretability). ˆ unemp: the unemployment rate in the county. Since inc and inc.pc are highly correlated, I consider just one of the variables. In the following I concentrate on inc.pc because it allows for a better comparability between the counties. The data is unbalanced because the number of observations varies a lot across counties. county Albany Bronx Broome Chautauq Chemung Erie Genesee Herkimer Kings Livingst Madison Monroe Nassau New York Niagara Onondaga Ontario Orange Oswego Putnam Queens Rensselaer Richmond Rockland Suffolk Tioga Wayne Westchester Branch Variables ˆ branch: the branch identity number (constant over the years). ˆ dep: the total deposits (in USD) in the branch (different for each year). ˆ log.dep: the total deposits in log form, i.e. log(dep). ˆ comp: a measure of geographical competition of the branch (different for each year). In the following log.dep is used as dependent variable because its distribution is relatively symmetric (Figure 1.4) and it is common to use the log form for monetary values. A boxplot for each county provides an overview of the values of log.dep (Figure 1.5). The plot shows some variability that is examined later.

12 CHAPTER 1. DATA DESCRIPTION 5 Density log.dep Figure 1.4: Histogram of log.dep Albany Bronx Broome Chautauq Chemung Erie Genesee Herkimer Kings Livingst Madison Monroe Nassau New York Niagara Onondaga Ontario Orange Oswego Putnam Queens Rensselaer Richmond Rockland Suffolk Tioga Wayne Westchester Figure 1.5: Boxplots of log.dep for each county.

13 CHAPTER 1. DATA DESCRIPTION 6 In the original data there are the two variables SingleDensity and M M CDensity which give the sum of all distances between the branch and all branches of other banks which have only one single branch or multiple branches respectively. These variables are highly correlated (94%). As a result it is possible either to choose one of the variables or to introduce a new variable in order to avoid singular matrices in the computations. I introduce a new variable because the already existing variables are not very easy to interpret. Therefore I merge the two variables standardized by their medians. Then I scale the new variable comp so that it has values between 0 and 100, and a value of 100 is an indication of a high geographical competition: SingleDensity a = median(singledensity) + MMCDensity median(m M CDensity) b = a min(a) b comp = (1 max(b) ) 100 The geographical competition varies a lot across counties. For example in New York City the values of comp are very high, whereas in Chautauqua, which is much more rural, the geographical competition is rather low Albany Bronx Broome Chautauq Chemung Erie Genesee Herkimer Kings Livingst Madison Monroe Nassau New York Niagara Onondaga Ontario Orange Oswego Putnam Queens Rensselaer Richmond Rockland Suffolk Tioga Wayne Westchester Figure 1.6: Boxplots of comp for each county.

14 Chapter 2 Linear Mixed Model for 1998 The aim is to fit a model that describes the deposits (in log form) of branch i in county j in year t, i.e. I choose log.dep ijt as dependent variable. In this first approach to linear mixed models I show the necessity of random effects when fitting a model to a reduced data set. I have a closer look at one particular year because this eliminates the time effect from the data, i.e. the dependent variable is log.dep ijt ˆ= log.dep ij. Therefore I am dealing with clustered two-level data (branches within counties) that is easier to examine. In the following I work with the data of The following libraries are used: > library(nlme) > library(lattice) 2.1 Explorative Data Analysis At first I have to adjust the variable obs in order to get the number of observations of the counties in Therefore I compute this number and introduce the new variable obs98. > Obs98 = rep(0, 28) > for (i in 1:28) { + Obs98[i] = length(branch[county == as.character(levels(county)[i]) + & year == 1998]) + } > names(obs98) = levels(county) > bank$obs98 = Obs98[as.character(data$county)] The data of 1998 is stored in a new data frame. A groupeddata-object is also created in order to group the data by county (see Appendix A.1 for more information about groupeddata-objects). > bank98 = bank[year == 1998, ] > detach(bank) > attach(bank98) > bank98grouped = groupeddata(log.dep ~ comp county, data = bank98) 7

15 CHAPTER 2. LINEAR MIXED MODEL FOR > head(bank98) branch year dep no.fail mshare branch.total dep.total av.dep e e e e e e+05 unemp county log.dep inc inc.pc pop comp obs obs Rockland e Bronx e New York e Nassau e Monroe e New York e There are 393 observations available for 1998: > dim(bank98)[1] [1] 393 Now I want to determine the relationships between the covariates and the dependent variable. The State Variables are not used as covariates, since they are constant for one particular year, i.e. the influence of State Variables is included in the intercept Branch Variable I have a look at the relationship between log.dep and comp by using a Trellis display (Figure 2.1, see Appendix B for more information about Trellis graphics in the lattice library). Only counties with more than 15 observations are displayed, since it is difficult to work with linear least-square and local-regression fits for very few observations. Also pay attention to the differently scaled axis that are used in order to get a better impression of the data. The plots show no clear overall relationship between log.dep and comp because within counties there are varying relationships between the variables: the slopes of the linear regression lines are different, they have even different signs. This variability is modeled later by random effects.

16 CHAPTER 2. LINEAR MIXED MODEL FOR > print(xyplot(log.dep ~ comp county, subset = obs98 >= 15, + scales = list(relation = "free"), panel = function(x, y) { + panel.xyplot(x, y, col = "grey") + panel.loess(x, y, span = 1) + panel.lmline(x, y, lty = 2) + })) Suffolk Westchester log.dep Nassau New York Queens Bronx Kings Monroe comp Figure 2.1: Trellis display of log.dep by comp for counties with more than 15 observations in The dashed lines give linear least-square fits, the solid lines local-regression fits.

17 CHAPTER 2. LINEAR MIXED MODEL FOR To underline this impression I fit a linear model to examine the influence of comp on log.dep (where n j = number of observations in county j in 1998): log.dep ij = β 0 + β 1 comp ij + ε ij ε j = (ε 1j,..., ε nj j) T N nj (0, σ 2 I nj ) (2.1) > bb98lm = lm(log.dep ~ comp) As a result of the observed variability the model fit does not show a strong influence of comp on log.dep, but this influence is highly significant because other explanatory variables are missing: > summary(bb98lm)$coef Estimate Std. Error t value Pr(>t) (Intercept) e-168 comp e-19 The residuals also show the variability in the data (Figure 2.2): the linear model does not consider the correlation of observations within counties and therefore there is some variation in the residuals. > print(bwplot(county ~ resid(bb98lm))) Westchester Wayne Tioga Suffolk Rockland Richmond Queens Putnam Oswego Orange Ontario Onondaga Niagara New York Nassau Monroe Madison Livingst Kings Herkimer Genesee Erie Chemung Chautauq Broome Bronx Albany resid(bb98lm) Figure 2.2: Residual plots by county for model 2.1.

18 CHAPTER 2. LINEAR MIXED MODEL FOR In order to examine further this variability I fit linear models for each county with more than 15 observations. log.dep ij = β 0j + β 1j comp ij + ε ij ε j = (ε 1j,..., ε nj j) T N nj (0, σ 2 j I nj ) (2.2) > bb98list = lmlist(bank98grouped, subset = obs98 >= 15) As seen in Figure 2.1 the influences of comp on log.dep are very variable: > coef(bb98list) (Intercept) comp Monroe Westchester Queens Kings Suffolk Nassau Bronx New York A look at the 95-percent confidence intervals for the regression coefficients (Figure 2.3) underlines these impressions, since the intervals are very different, but they overlap and almost every interval includes zero, i.e. the effects are possibly not significantly different. However, it is also important to pay attention to the different scales: the regression coefficients of comp are close to zero, while those of the intercept are much larger. > print(plot(intervals(bb98list))) county New York Bronx Nassau Suffolk Kings Queens Westchester Monroe (Intercept) comp Figure 2.3: 95-percent confidence intervals for the regression coefficients of model 2.2. A QQ-plot of the standardized residuals (Figure 2.4) also casts doubt on the normality assumption of the errors in the linear regression model which again shows the necessity for random effects modeling this variability.

19 CHAPTER 2. LINEAR MIXED MODEL FOR > print(qqnorm(bb98list, id = 0.05, idlabels = bank98obs$branch, + cex = 0.7, col = "grey")) Quantiles of standard normal Standardized residuals Figure 2.4: QQ-plot of the standardized residuals of model 2.2 with possible outliers (labelled with their branch ID) County Variables To examine the relationships between the County Variables and log.dep I use scatter plots with LOESS-lines (Figure 2.5). > par(mfrow = c(1, 3)) > scatter.smooth(pop, log.dep, col = "grey") > scatter.smooth(inc.pc, log.dep, col = "grey") > scatter.smooth(unemp, log.dep, col = "grey") log.dep log.dep log.dep pop inc.pc unemp Figure 2.5: Scatter plots of the County Variables against log.dep. The lines give localregression fits. There is a weak positive relationship between pop and log.dep as well as between inc.pc and log.dep. The influence of unemp is not as clear as for the first two variables. A transformation with sinus or cosinus could be used, but there is no evident interpretation for the (co)sinus of the unemployment rate. Therefore the variable is not transformed.

20 CHAPTER 2. LINEAR MIXED MODEL FOR Model Formulation and Fit Hierarchical Model I fit a hierarchical linear model to the data. This allows to model the variability of the two different levels. First, there is the regression of the deposits (in log form) of branch i in county j on the geographical competition of this branch. log.dep ij = α 0j + α 1j comp ij + ε ij Second, the intercepts and the slopes possibly depend on the County variables: α 0j = γ 00 + γ 01 pop j + γ 02 inc.pc j + γ 03 unemp j + u 0j α 1j = γ 10 + γ 11 pop j + γ 12 inc.pc j + γ 13 unemp j + u 1j These two equations now can be sustituted into the first one: log.dep ij = γ 00 + γ 01 pop j + γ 02 inc.pc j + γ 03 unemp j + u 0j (γ 10 + γ 11 pop j + γ 12 inc.pc j + γ 13 unemp j + u 1j )comp ij + ε ij The γ s are fixed effects and the u s random effects. This notation is equivalent to the notation of a linear mixed model: log.dep ij = β 0 + β 1 comp ij + β 2 pop j + β 3 inc.pc j + β 4 unemp j + β 5 pop j comp ij + β 6 inc.pc j comp ij + β 7 unemp j comp ij + u 0j + u 1j comp ij + ε ij (2.3) Now, the β s are the fixed effects, which is the normally used notation of fixed effects, and the following distributions for the errors and the random effects are assumed (where n j = number of observations in county j in 1998): ε j = (ε 1j,..., ε nj j) T N nj (0, σ 2 I nj ) u j = (u 0j, u 1j ) T N 2 (0, Ψ) ( ) ψ 2 Ψ = 0 ψ 01 ψ 01 ψ1 2 This linear mixed model can be fitted by using the groupeddata-object that specifies the random effects on the county level. > bb98lme1 = lme(log.dep ~ comp + pop + inc.pc + unemp + + comp:pop + comp:inc.pc + comp:unemp, random = ~comp, + data = bank98grouped) The model summary provides detailed information about the model fit (see Appendix A.3.1 for more information): > summary(bb98lme1)

21 CHAPTER 2. LINEAR MIXED MODEL FOR Linear mixed-effects model fit by REML Data: bank98grouped AIC BIC loglik Random effects: Formula: ~comp county Structure: General positive-definite, Log-Cholesky parametrization StdDev Corr (Intercept) 3.3e-01 (Intr) comp 9.5e Residual 9.1e-01 Fixed effects: log.dep ~ comp + pop + inc.pc + unemp +... Value Std.Error DF t-value p-value (Intercept) comp pop inc.pc unemp comp:pop comp:inc.pc comp:unemp Correlation: (Intr) comp pop inc.pc unemp cmp:pp cmp:n. comp pop inc.pc unemp comp:pop comp:inc.pc comp:unemp Standardized Within-Group Residuals: Min Q1 Med Q3 Max Number of Observations: 393 Number of Groups: Necessity of the Random Effects The data examination showed that the effects of the linear models for each county are possibly not significantly different (compare Figure 2.3). This leads to the assumption that the random effects of comp can be omitted. To test this, model 2.3 has to be reduced

22 CHAPTER 2. LINEAR MIXED MODEL FOR by eliminating u 1j from the model: log.dep ij = β 0 + β 1 comp ij + β 2 pop j + β 3 inc.pc j + β 4 unemp j + β 5 pop j comp ij + β 6 inc.pc j comp ij + β 7 unemp j comp ij + u 0j + ε ij ε j = (ε 1j,..., ε nj j) T N nj (0, σ 2 I nj ) u 0j N(0, ψ 2 0) (2.4) > bb98lme1a = update(bb98lme1, random = ~1) Since model 2.4 and model 2.3 are nested, it can be tested by a likelihood-ratio test whether the random effects of comp are significant at the 5-percent level (see Appendix A.3.1 for more information about nested model comparison): > anova(bb98lme1, bb98lme1a) Model df AIC BIC loglik Test L.Ratio p-value bb98lme bb98lme1a vs 2 2.5e-07 1 The hypothesis that the random effects of comp are zero can not be rejected (with a p value of 1). Therefore these random effects can be eliminated from the model. Even if the data examination showed variabilty among the influence of comp on log.dep, this variability is not significant, once the County Variables are taken into account. Nevertheless, it should also be tested whether the random effects for the intercept are significant or not. To do this the random effects of the intercept (u 0j ) have to be eliminated from model 2.3: > bb98lme1b = update(bb98lme1, random = ~comp - 1) > anova(bb98lme1, bb98lme1b) Model df AIC BIC loglik Test L.Ratio p-value bb98lme bb98lme1b vs The test shows that these random effects are also non-signifcant if the random effects for comp are already in the model. This leads to the question whether the random effects are needed at all (which was the conclusion after the data examination). Therefore, I compare the models to a corresponding linear model. > bb98lm = lm(log.dep ~ comp + pop + inc.pc + unemp + + comp:pop + comp:inc.pc + comp:unemp, data = bank98) The information criteria can be compared because the models have the same fixed effects.

23 CHAPTER 2. LINEAR MIXED MODEL FOR > AIC(bb98lm) [1] 1070 > summary(bb98lme1a)$aic [1] 1160 > summary(bb98lme1b)$aic [1] 1161 The AIC of the linear model is smaller than the AIC s of the mixed models. Nevertheless, I continue my analysis with model 2.4, since its AIC is the smallest of the two mixed models and the model can possibly be improved and show the necessity of random effects in order to model the within-group correlations Variance Structure of the Residuals Since the number of observations and the values of log.dep in each county are varying, the within-group errors might be varying for each county, too. A look at the residuals confirms this (Figure 2.6). > print(plot(bb98lme1a, county ~ resid(., type = "p"), abline = 0)) county New York Bronx Nassau Ontario Madison Suffolk Kings Broome Queens Westchester Putnam Chautauq Monroe Richmond Rockland Oswego Onondaga Erie Genesee Chemung Orange Herkimer Livingst Tioga Niagara Wayne Albany Standardized residuals Figure 2.6: Standardized residuals of model 2.4 for each county.

24 CHAPTER 2. LINEAR MIXED MODEL FOR As a result the variance structure of model 2.4 has to be modified in order to allow heterogeneous residual variances, i.e. residual variances σ 2 j for all counties j: log.dep ij = β 0 + β 1 comp ij + β 2 pop j + β 3 inc.pc j + β 4 unemp j + β 5 pop j comp ij + β 6 inc.pc j comp ij + β 7 unemp j comp ij + u 0j + ε ij ε j = (ε 1j,..., ε nj j) T N nj (0, σ 2 j I nj ) u 0j N(0, ψ 2 0) (2.5) This specific variance structure can be modeled (see Appendix A.3.2 for more information, the number of iterations is increased because of convergence problems): > bb98lme1var = update(bb98lme1a, weights = varident(form = ~1 + county), control = list(maxiter = 100, msmaxiter = 100, + niterem = 50, msmaxeval = 500)) The residual variances actually show some variation: > varstruct = bb98lme1var$modelstruct$varstruct > sigma = (1/unique(attributes(varstruct)$weights) * + bb98lme1var$sigma)^2 > xlabels = unique(attributes(bb98lme1var$modelstruct$varstruct)$groups) > plot(sigma, axes = F, xlab = "", ylab = "Residual Variance", + type = "h") > axis(1, at = seq(1, 27, 1), labels = xlabels, las = 2) > axis(2) Albany Wayne Niagara Tioga Livingst Herkimer Orange Chemung Genesee Erie Onondaga Oswego Rockland Richmond Monroe Chautauq Putnam Westchester Queens Broome Kings Suffolk Madison Ontario Nassau Bronx New York Residual Variance Figure 2.7: Residual variances of model 2.5 for each county. In order to test whether this variance structure is significant, I can compare model 2.5 to model 2.4 because the models are nested.

25 CHAPTER 2. LINEAR MIXED MODEL FOR > anova(bb98lme1a, bb98lme1var) Model df AIC BIC loglik Test L.Ratio p-value bb98lme1a bb98lme1var vs <.0001 The new model means a significant improvement in the model fit. Therefore I continue my analysis with model Significance of the Fixed Effects Some fixed effects of model 2.5 are not significant: > summary(bb98lme1var)$ttable Value Std.Error DF t-value p-value (Intercept) 9.6e e e-10 comp 1.0e e e-01 pop 7.3e e e-01 inc.pc -1.8e e e-01 unemp 2.2e e e-01 comp:pop -4.8e e e-01 comp:inc.pc 2.9e e e-01 comp:unemp -1.5e e e-01 Thus I reduce the model by a stepwise t test at the 5-percent level. > bb98lme2 = update(bb98lme1var, fixed =. ~. - comp:unemp) > summary(bb98lme2)$ttable > bb98lme3 = update(bb98lme2, fixed =. ~. - unemp) > summary(bb98lme3)$ttable > bb98lme4 = update(bb98lme3, fixed =. ~. - comp:inc.pc) > summary(bb98lme4)$ttable > bb98lme5 = update(bb98lme4, fixed =. ~. - comp:pop) In the last model (bb98lme5) all fixed effects are significant at the 5-percent-level. > summary(bb98lme5) Linear mixed-effects model fit by REML Data: bank98grouped AIC BIC loglik Random effects: Formula: ~1 county (Intercept) Residual

26 CHAPTER 2. LINEAR MIXED MODEL FOR StdDev: Variance function: Structure: Different standard deviations per stratum Formula: ~1 county Parameter estimates: Albany Wayne Niagara Tioga Livingst 1.0e e e e e-01 Herkimer Orange Chemung Genesee Erie 2.1e e e e e-01 Onondaga Oswego Rockland Richmond Monroe 2.9e e e e e-01 Chautauq Putnam Westchester Queens Broome 5.0e e e e e-01 Kings Suffolk Madison Ontario Nassau 4.1e e e e e-01 Bronx New York 7.4e e-01 Fixed effects: log.dep ~ comp + pop + inc.pc Value Std.Error DF t-value p-value (Intercept) comp pop inc.pc Correlation: (Intr) comp pop comp pop inc.pc Standardized Within-Group Residuals: Min Q1 Med Q3 Max Number of Observations: 393 Number of Groups: 27 Thus the final model is specified as follows: log.dep ij = β 0 + β 1 comp ij + β 2 pop j + β 3 inc.pc j + u 0j + ε ij ε j = (ε 1j,..., ε nj j) T N nj (0, σ 2 j I nj ) u 0j N(0, ψ 2 0) (2.6)

27 CHAPTER 2. LINEAR MIXED MODEL FOR Model Diagnostics In the following I want to examine whether the error assumptions on the final model 2.6 are appropriate. These diagnostics can be divided into the examination of two assumptions: assumptions on the within-group (i.e within-county) errors and on the random effects respectively Within-Group Errors At first I have a look at the standardized residuals of model 2.6 for each county individually to assess the assumption that the within-group errors are independent and identically distributed within each county with mean 0 and variance σ 2 j (Figure 2.8). > print(plot(bb98lme5, county ~ resid(., type = "p"), abline = 0)) county New York Bronx Nassau Ontario Madison Suffolk Kings Broome Queens Westchester Putnam Chautauq Monroe Richmond Rockland Oswego Onondaga Erie Genesee Chemung Orange Herkimer Livingst Tioga Niagara Wayne Albany Standardized residuals Figure 2.8: Standardized residuals of model 2.6 for each county. The residuals scatter around 0 (approximately in a [ 2, 2]-band, i.e. the 95-percent confidence band) and show no pattern, but there are some residuals that are large, i.e. there are several possible outliers. These problems are probably a result of the few observations

28 CHAPTER 2. LINEAR MIXED MODEL FOR in some counties. There is not enough data to model the effects correctly, since there are 15 counties with less than 5 observations (over 50 percent of all 28 counties!): > length(unique(county[obs98 <= 5])) [1] 15 As a result the residual variances show a lot of variability (Figure 2.9). > varstruct1 = bb98lme5$modelstruct$varstruct > sigma1 = (1/unique(attributes(varstruct1)$weights) * + bb98lme1var$sigma)^2 > xlabels1 = unique(attributes(bb98lme1var$modelstruct$varstruct)$groups) > plot(sigma1, axes = F, xlab = "", ylab = "Residual Variance", + type = "h") > axis(1, at = seq(1, 27, 1), labels = xlabels1, las = 2) > axis(2) Residual Variance Albany Wayne Niagara Tioga Livingst Herkimer Orange Chemung Genesee Erie Onondaga Oswego Rockland Richmond Monroe Chautauq Putnam Westchester Queens Broome Kings Suffolk Madison Ontario Nassau Bronx New York Figure 2.9: Residual variances of model 2.6 for each county. There are two counties with very high residual variances: Rockland and Ontario. To examine the reasons of this I have a look at the corresponding observations: > bank98[county == "Rockland", ] branch year dep no.fail mshare branch.total dep.total av.dep e e e e e+05 unemp county log.dep inc inc.pc pop comp obs obs Rockland 4.7 1e

29 CHAPTER 2. LINEAR MIXED MODEL FOR Rockland e Rockland e Rockland e Rockland e > bank98[county == "Ontario", ] branch year dep no.fail mshare branch.total dep.total av.dep unemp county log.dep inc inc.pc pop comp obs obs e+05 4 Ontario e+05 4 Ontario Branch 3604 in Rockland has an unusally small value of log.dep, whereas the other four observations of the county are close to the median of log.dep (11). Therefore the residual variance is increased. In Ontario there are only two observations with different values of log.dep. As a result the residual variance is also increased. However, the assumption of independence and normality seems to be approximately appropriate because the residuals scatter around 0. The variability of the standardized residuals is also approximately homogeneous across counties and a QQ-plot (Figure 2.10) confirms the observations as well, but shows several possible outliers at the same time. > print(qqnorm(bb98lme5, ~resid(., type = "p"), id = 0.05, + idlabels = branch, col = "grey", cex = 0.7)) Quantiles of standard normal Standardized residuals Figure 2.10: QQ-plot of the standardized residuals of model 2.6 with possible outliers (labelled with their branch ID). To sum it up, the residuals show good characteristics, but the assumption of normality is possibly wrong because of the small sample sizes for some counties. Perhaps a fat-tailed

30 CHAPTER 2. LINEAR MIXED MODEL FOR distribution such as the t-distribution would be a more appropriate assumption on the errors Random Effects To assess the assumption that the random effects are normally distributed with mean 0 and variance ψ 2 0, I have a look at the EBLUP s (Empirical Best Linear Unbiased Predictors, see for example page 262 in Fahrmeir et al. [2007]) of the random effects for each county (Figure 2.11). > print(plot(ranef(bb98lme5))) county New York Bronx Nassau Ontario Madison Suffolk Kings Broome Queens Westchester Putnam Chautauq Monroe Richmond Rockland Oswego Onondaga Erie Genesee Chemung Orange Herkimer Livingst Tioga Niagara Wayne Albany (Intercept) Random effects Figure 2.11: EBLUP s (û 0j ) of model 2.6. The random effects are scattered around 0 and actually very small (pay attention to the scale). They show no pattern, too. The QQ-plot also shows that the assumption is probably appropriate (Figure 2.12).

31 CHAPTER 2. LINEAR MIXED MODEL FOR > print(qqnorm(bb98lme5, ~ranef(.))) (Intercept) 2 Quantiles of standard normal Random effects Figure 2.12: QQ-plot of the EBLUP s of model Discussion The final model 2.6 corresponds to the results of the data examination that showed positive influences of pop and inc.pc on log.dep. There is also a positive influence of comp on log.dep which corresponds to the results of model 2.1. > summary(bb98lme5)$coef$fixed[2:4] comp pop inc.pc The variability in the data is reflected by the random effects of the intercept. The plot shows the different intercepts for each county:

32 CHAPTER 2. LINEAR MIXED MODEL FOR > print(plot(coef(bb98lme5))) county New York Bronx Nassau Ontario Madison Suffolk Kings Broome Queens Westchester Putnam Chautauq Monroe Richmond Rockland Oswego Onondaga Erie Genesee Chemung Orange Herkimer Livingst Tioga Niagara Wayne Albany (Intercept) Coefficients Figure 2.13: Intercepts ( ˆβ 0 + û 0j ) of model 2.6. Although there is a lot of variability in the influence of comp on log.dep, there are no significant random effects for comp. This variability is probably explained by the variables pop and inc.pc. In view of the hierarchical model fit 2.3, the final model 2.6 can be written as follows: log.dep ij = α 0j + β 1 comp ij + ε ij α 0j = β 0 + β 2 pop j + β 3 inc.pc j + u 0j (2.7) On the one hand, the branch level intercept α 0j depends on the county and on the population and income within this county. On the other hand, the slope of comp does not depend on the county, in which the branch is located. However, the model diagnostics showed that there are still problems regarding the model fit. A comparison to a linear model showed no significance of the random effects of the initial model 2.3. Now I compare the final model 2.6 to a standard linear model in order to investigate whether the linear mixed model is an improvement in the model fit. > bb98lm.test = lm(log.dep ~ comp + pop + inc.pc) > summary(bb98lme5)$aic [1] 1027

33 CHAPTER 2. LINEAR MIXED MODEL FOR > AIC(bb98lm.test) [1] 1064 The comparison of the AIC s of the two models shows that the AIC of model 2.6 is smaller, i.e. the mixed model fit means a significant improvement. A look at the standard errors of both models helps to assess the influence of the random effects: > summary(bb98lme5)$ttable Value Std.Error DF t-value p-value (Intercept) e e-317 comp e e-41 pop e e-05 inc.pc e e-03 > summary(bb98lm.test)$coef Estimate Std. Error t value Pr(>t) (Intercept) e-170 comp e-05 pop e-04 inc.pc e-04 Whereas the standard errors of the covariates are similar, the standard error of the intercept is much smaller in the mixed model than in the standard linear model. This is a result of the correlated observations within the counties, which are modeled in the mixed model but not in the linear model. 2.5 Interpretation The examination of the data showed that the deposits of a bank branch in 1998 significantly depend on the geograpical competition of the branch, on the county, in which the branch is located, and on the county s population and per capita income. The influence of the geographical competition is positive, since there are probably more branches in an area where it is more likely to get new customers and more deposits. Another explanation is that competition stimulates business, i.e. a high geographical competition leads to higher deposits because of increased efforts to get new customers. The influences of the county s population and per capita income are also positive, since it is obvious that there are more deposits if there are more people and if the people earn more money. At last, the deposits of a bank branch also depend simply on the county the branch is located in: for example in Kings there are more deposits than in Nassau. For branches in Nassau the following equation explains the deposits of a branch: Deposits of branch i = exp( (Competition of branch i on a scale of 100)) In Kings the equation is as follows: Deposits of branch i = exp( (Competition of branch i on a scale of 100)) This shows that the deposits in Nassau are always lower at the same level of competition:

34 CHAPTER 2. LINEAR MIXED MODEL FOR Deposits 0e+00 5e+04 1e Competition on a scale of 100 Nassau Kings Figure 2.14: Relationship between deposits and competition in Nassau and Kings. The positive influences of the population and the per capita income can also be displayed: Deposits Deposits Population Per capita income in USD Figure 2.15: Influences of the population and the income on the deposits at the average level of geographical competition. In the next chapter, the model explaining the effects on the deposits of a bank brach is extended and a more detailed interpretation will be possible.

35 Chapter 3 Linear Mixed Model for the Full Data Set After having fit a mixed model on the reduced data set, I fit a mixed model on the complete data set. The aim is still to fit a model that describes the deposits (in log form) of branch i in county j in year t. Now I am dealing with clustered two-level data (branches within counties) that includes time effects for each level (branches, counties, state). These time effects require additional random effects because the observations are correlated over time. 3.1 Explorative Data Analysis I examine relationships between the dependent variable and all three levels of covariates Branch Variable To get a first impression of the relationship between log.dep and comp, I have a look at the corresponding scatter plot (Figure 3.1). There is a weak positive overall influence of comp on log.dep, but there is also a lot of variation in the data that needs to be examined. Because there are too many branches to look at each individually, I select a sample of 20 branches that are examined more closely. A groupeddata-object is also created to store the data in a data frame grouped by branch. > set.seed(2) > banksample = sample(unique(branch), 20) > sample20 = groupeddata(log.dep ~ year branch, + data = bank[is.element(branch, banksample), ]) This sample includes measurements at different time points, but not each branch is measured at each time point (Figure 3.2). 28

36 CHAPTER 3. LINEAR MIXED MODEL FOR THE FULL DATA SET 29 > scatter.smooth(comp, log.dep, col = "grey") log.dep comp Figure 3.1: Scatter plot of comp against log.dep.

37 CHAPTER 3. LINEAR MIXED MODEL FOR THE FULL DATA SET 30 > print(plot(sample20, layout = c(5, 4), aspect = 1)) log.dep year Figure 3.2: Observations of 20 randomly selected branches.

38 CHAPTER 3. LINEAR MIXED MODEL FOR THE FULL DATA SET 31 Now the relationship between log.dep and comp can be investigated individually by using a Trellis display (Figure 3.3). (Pay attention to the differently scaled axis that are used in order to get a better impression of the data.) > print(xyplot(log.dep ~ comp branch, data = sample20, + scales = list(relation = "free"), panel = function(x, y) { + panel.xyplot(x, y) + panel.lmline(x, y, lty = 2) + })) log.dep comp Figure 3.3: Trellis display of log.dep by comp for 20 randomly selected branches. The dashed lines give linear least-square fits. Even if the overall influence of comp on log.dep is positive, the plots show that there are negative influences as well as positive influences for the branches in this sample. The slopes of the effects are very different, too. Thus there is a lot of variability in the effects. Another important fact is also shown in the plots: there are very few observations for some branches and this makes it difficult to fit an appropriate standard linear model to the data. Random effects are needed to deal with this problem.

Quantifying geographical and macroeconomic effects on bank branch deposits using linear mixed models

Quantifying geographical and macroeconomic effects on bank branch deposits using linear mixed models Eike Christian Brechmann, Claudia Czado and Peggy Ng Abstract The assessment of performance and potential