type of the at; and an investigation of duration of unemployment in West Germany from 1980 to 1995, with spatio-temporal data from the German Federal

Size: px

Start display at page:

Download "type of the at; and an investigation of duration of unemployment in West Germany from 1980 to 1995, with spatio-temporal data from the German Federal"

Lesley Welch
5 years ago
Views:

1 Bayesian Inference for Generalized Additive Mixed Models Based on Markov Random Field Priors Ludwig Fahrmeir and Stefan Lang University of Munich, Ludwigstr. 33, Munich and Abstract Most regression problems in practice require exible semiparametric forms of the predictor for modelling the dependence of responses on covariates. Moreover, it is often necessary to add random eects accounting for overdispersion caused by unobserved heterogeneity or for correlation in longitudinal or spatial data. We present a unied approach for Bayesian inference via Markov chain Monte Carlo (MCMC) simulation in generalized additive and semiparametric mixed models. Dierent types of covariates, such as usual covariates with xed eects, metrical covariates with nonlinear eects, unstructured random eects, trend and seasonal components in longitudinal data and spatial covariates are all treated within the same general framework by assigning appropriate priors with dierent forms and degrees of smoothness. The approach is particularly appropriate for discrete and other fundamentally non- Gaussian responses, where Gibbs sampling techniques developed for Gaussian models cannot be applied, but it also works well for Gaussian responses. We use the close relation between nonparametric regression and dynamic or state space models to develop posterior sampling procedures, based on Markov random eld priors. They include recent Metropolis-Hastings block move algorithms for dynamic generalized linear models and extensions for spatial covariates as building blocks. We illustrate the approach with a number of applications that arose out of consulting cases, showing that the methods are computionally feasible also in problems with many covariates and large data sets. Keywords: generalized semiparametric mixed models, Markov chain Monte Carlo, random eects, semiparametric Bayesian inference, spatial and spatiotemporal data, state space and Markov random eld models, varying coecients 1 Introduction This work has been motivated by various regression problems that resulted from consulting cases. We consider the following applications: Credit-scoring, where the aim is to model the probability that a client with certain covariates ("risk factors") will not pay back his credit as agreed upon by contract; a longitudinal study on forest damage to assess the impact of covariates like age of trees, ph value of soil and canopy density of the stand on the defoliation degree of trees; an analysis of rents paid for ats or appartments in Munich, depending on oor space, year of construction, location in the city, and a large number of indicators for equipment and 1

2 type of the at; and an investigation of duration of unemployment in West Germany from 1980 to 1995, with spatio-temporal data from the German Federal Employment Oce, including several time scales, regional eects and personal characteristics of unemployed persons. With exception of the rent example, responses are discrete. In all cases, we have metrical or spatially correlated covariates x 1 ; : : : ; x p, say, with unknown, possibly nonlinear eects, and a vector w of further covariates, whose inuence on the predictor is assumed to be linear. In the forest damage problem, we will add unstructured, tree-specic random eects b to account for unobserved heterogeneity or correlation. Therefore an additive predictor of the form = f 1 (x 1 ) + : : : + f p (x p ) + w 0 + b (1) seems reasonable. Together with an exponential family observation model and a suitable link function, (1) denes a generalized additive or semiparametric mixed model (GAMM). A further extension, that we consider in the forest damage study, leads to a varying coecient mixed model (VCMM). In the rent data application, we include structured, spatially correlated random effects that are specic for subquarters in Munich, to account for spatial heterogeneity not explained by other covariates. Similarly, in the unemployment duration analysis we include district specic spatial random eects. From a classical perspective, spatial eects are considered as correlated random eects with an appropriate prior reecting neighborhood relationships. For our Bayesian approach, it is more useful to consider them as spatially correlated eects f j (x j ) of a spatial covariate x j, and to incorporate them formally as a component of the nonparametric additive terms in (1). This will lead to a unied treatment of metrical covariates and spatially correlated random eects or, in other words, spatial covariates. Nevertheless, we well call such models GAMM or VCMM. Inference for these models with existing methodology has some drawbacks and limitations, in particular for non-gaussian responses. For GAMMs, Lin and Zhang (1999) propose approximate inference using smoothing splines and double penalized quasi-likelihood, extending previous works of Breslow and Clayton (1993), Breslow and Lin (1995) for generalized linear mixed models (GLMMs). As they point out in the discussion, similarly to approximate inference in GLMMs, there are bias problems, especially with binary data or correlated random eects, so that MCMC methods may be an attractive alternative. S-Plus has an option for exploratary data analysis in GAMMs with splines or loess functions, but there is no possibility for estimating variance components jointly with smoothing parameters. In this paper we propose a general Bayesian approach via Markov chain Monte Carlo (MCMC) for inference in generalized additive and varying coecients models, including mixed models with structured or unstructured random eects. As an important feature, all dierent types of covariates, or eects are considered from a unied viewpoint. Usual covariates with xed eects, metrical covariates, such as time scales with nonparametric trend or seasonal eect, unstructured random eects and spatial covariates are treated within the same general framework by assigning Markov random eld smoothness priors with common structure but dierent degrees of smoothness to corresponding eects. Data driven choice of smoothing parameters is automatically included. Although models with Gaussian responses are also 2

3 covered by the framework, our main interest lies in models for fundamentally non- Gaussian responses, such as binary or other discrete-valued responses. The data examples in the applications section show the practical feasibility in situations with many covariates and with large data sets. The MCMC procedure provides samples from all posteriors of interest and permits estimation of posterior means, medians, quantiles, condence bands and predictive distributions. No approximations based on conjectures of asymptotic normality have to be made, and no "plug in" procedures are needed. For Gaussian responses, Gibbs sampling can be used for fully Bayesian analysis based on smoothness priors, see for example Wong and Kohn (1996), who use state space or dynamic model representations of splines in additive models without any inclusion of unstructured or spatial random eects, or Hastie and Tibshirani (1998), who derive the Gibbs sampler as a Bayesian version of backtting. Bayesian basis function approaches with regression splines or a more general class of piecewise polynomials are proposed by Smith and Kohn (1996) and Denison et al. (1998), again without random eects. For fundamentally non-gaussian models, more general MCMC techniques than Gibbs sampling are needed, and there is still a lack of methods and of practical experience with existing suggestions. Hastie and Tibshirani (1998) sketch a Metropolis- Hastings-type algorithm for GAMs, but do not give any examples or statements about performance. Mallick et al. (1999) propose a Bayesian MARS method for GLMs as an extension of the Bayesian curve tting method of Denison et al. (1998), but, as they state, the sampler has slow convergence. The more recent RJMCMC algorithm developed by Biller (1999) for adaptive regression splines in semiparametric GLMs seems to have better convergence properties, and an extension to GAMMS and VCMMs might be promising. Our approach is based on the close relationship between dynamic generalized linear models (see, e.g., Fahrmeir and Tutz, 1997, ch.8) and generalized additive or varying coecient models (Hastie and Tibshirani, 1990, 1993). This relationship is well known for the classical smoothing problem, where observations y = (y(1); : : : ; y(n)) are assumed to be the sum y(t) = f(t) + "(t); "(t) N(0; 2 ) (2) of a smooth regression function f, evaluated at equidistant design points t = 1; : : : ; n, and independent Gaussian noise variables. Within a dynamic or state space framework the observation model (2) is supplemented by a linear Gaussian Markov model for the parameters or states f = (f(1); : : : ; f(n)). A common choice as so-called smoothness prior is a second order random walk model f(t) = 2f(t? 1)? f(t? 2) + u(t), u(t) N(0; 2 ). For given variances 2 and 2, posterior means ^f(t) and variances can be eciently computed by the Kalman lter and smoother. For a full Bayesian analysis with hyperpriors for the variances 2 and 2, the Kalman lter and smoother can be exploited for ecient, blockwise Gibbs sampling (Carter and Kohn, 1994; Fruhwirth-Schnatter, 1994). Due to the Gaussian observation model (2), the posterior mean estimates ^f(t) and posterior mode estimates coincide, and are equivalent to the solution of a corresponding penalized least squares criterion, see for example Fahrmeir and Tutz (1997, ch. 8.1), or Green and 3

4 Silverman (1994), in the related context of cubic smoothing splines. Basically, this equivalence extends to additive Gaussian models as well as to non-equally spaced design points or covariate observations. For fundamentally non-gaussian responses as considered in this paper, the observation model (2) has to be replaced by a non-gaussian model, and, as a consequence, the equivalence between posterior mean and posterior mode estimation is lost. The linear Kalman lter and smoother is no longer applicable and Gibbs sampling techniques cannot be reasonably applied. Therefore, we incorporate the Metropolis-Hastings algorithm with conditional prior proposals, developed by Knorr-Held (1999) in the context of dynamic generalized linear models, for drawing block move samples from posteriors of nonlinear eects f j (x j ) of metrical covariates x j. For spatial covariates with Markov random eld priors, this block move sampler is extended in a computationally ecient way for drawing from posteriors of spatial eects. The rest of the paper is organized as follows: Bayesian semiparametric mixed models are described in Section 2, while Section 3 contains details about the chosen MCMC techniques. In Section 4 we demonstrate the practical feasibility of our approach through the data applications mentioned in the beginning. In a rst example, we reanalyze the credit scoring data used in Fahrmeir and Tutz (1997) with a semiparametric additive model without random eects, and we apply a VCMM to the longitudinal study on forest damage. In the third example, a Gaussian additive mixed model with spatially correlated random eects is applied to the rent data example, showing that the method works well also for Gaussian responses in problems with many covariates and a large data set. Finally we present a space-time analysis of unemployment durations in West Germany with a massive data set from the German Federal Unemployment Oce through a discrete hazard model with a GAMM predictor that includes several time scales, a number of personal characteristics of unemployed persons and regional eects. 2 Bayesian semiparametric mixed models Consider now regression situations, where observations (y i ; x i1 ; : : : ; x ip ; w i ); i = 1; : : : ; n, on a response y, a vector x = (x 1 ; : : : ; x p ) of metrical or spatial covariates and a vector w of further covariates are given. In longitudinal studies, as in our applications to forest damage or to duration of unemployment in Section 4, the covariate vector will typically include one or more time scales, such as duration and calendar time. In some studies, as in the rents for ats data and the study on unemployment, a spatial covariate, such as location of ats or districts may be considered and appropriately incorporated into the model. Generalized additive and semiparametric models (Hastie and Tibshirani, (1990)) assume that, given x i = (x i1 ; : : : ; x ip ), and w i the distribution of y i belongs to an exponential family, with mean i = E(y i jx i ; w i ) linked to an additive semiparametric predictor i by i = h( i ); i = f 1 (x i1 ) + : : : + f p (x ip ) + w 0 i: (3) Here h is a known link or response function, and f 1 ; : : : ; f p are unknown smooth 4

5 functions of the covariates. Note that spatially correlated (random) eects of a spatial covariate x j are included as nonparametric component f j (x j ). For identiability reasons, unknown functions are centered appropriately. Observation models of the form (3) may be inappropriate if heterogeneity among units is not suciently described by covariates or for correlated data in a longitudinal study. A common way to deal with this problem is the inclusion of additive random eects into the predictor. This leads to mixed generalized additive models (GAMMs) with a predictor of the form i = f 1 (x i1 ) + + f p (x ip ) + wi 0 + b gi ; (4) where b gi is a unit- or group- specic random eect, with b gi = b g if unit i is in group g, g = 1; : : : ; G. For example, in our analysis of forest damage data, b g is an additional tree specic eect to account for correlation and unobserved heterogeneity. Due to the large number of trees a xed eect approach will not be feasible, and a random eects model is chosen instead. A further extension leads to varying coecient mixed models (VCMMs), i = f 1 (x i1 )z i1 + : : : + f p (x ip )z ip + wi 0 + b gi ; (5) see Hastie and Tibshirani (1993), for the case without random eects. The design vector z = (z 1 ; : : : ; z p ) may contain components of x or w as well as some additional covariates. If a design variable is identical to 1, e. g. z j = 1, then the corresponding function f j is the main eect of x j, while terms like f p (x ip )z ip model an eect of z p that varies over x p or, in other words, interaction between x p and z p. We will use a VCMM in the forest damage study, with time-varying eects and a tree-specic random eect. For Bayesian semiparametric inference, the unknown functions f 1 ; : : : ; f p, more exactly corresponding vectors of function evaluations, the parameters = ( 1 ; : : : ; r ) and the random eects b = (b(1); : : : ; b(g)) are all considered as random variables. The observation models (3), (4) or (5) are understood to be conditional upon these random variables, and have to be supplemented by appropriate prior distributions. Priors for the unknown functions f 1 ; : : : ; f p depend on the type of the covariates and on prior beliefs about smoothness of f j. Priors for time scales and metrical covariates are based on Gaussian smoothness priors that are common in dynamic generalized linear models, see, for example, Fahrmeir and Tutz (1997, ch. 8). For spatial covariates, priors are based on (Gaussian) Markov random elds, see for example Besag (1974), Besag et al. (1991) or Besag and Kooperberg (1995). Let us rst consider the case of a metrical covariate x with equally-spaced observations x i ; i = 1; : : : ; m, m n. Then the ordered sequence x (1) < : : : < x (t) < : : : < x (m) denes an equidistant grid on the x-axis. The typical case for this situation arises if the covariate x corresponds to time t, and the grid points correspond to time units such as weeks, months, or years, but generally x can be any ordered dimension. Dene f(t) := f(x (t) ) and let f = (f(1); : : : ; f(t); : : : ; f(m)) 0 denote the vector of function evaluations. Then, just as for the time trends example in Section 1, common priors for smooth functions are, respectively, rst or second 5

6 order random walk models f(t) = f(t? 1) + u(t) or f(t) = 2f(t? 1)? f(t? 2) + u(t) (6) with Gaussian errors u(t) N(0; 2 ) and diuse priors f(1) / const, and f(1) and f(2) / const, for initial values, respectively. Both specications act as smoothness priors that penalize too rough functions f. A rst order random walk penalizes abrupt jumps f(t)? f(t? 1) between successive states and a second order random walk penalizes deviations from the linear trend 2f(t? 1)? f(t? 2). Note that a second order random walk is derived by computing second dierences, i.e. the dierences of neighboring rst order dierences. The dierence in practice between the two specications is that estimated functions tend to be somewhat smoother for second order random walk priors. Of course, higher order dierence priors are also possible. For example if the covariate x is time t, measured in months, then a common smoothness prior for a seasonal component f(t) is f(t) + f(t? 1) + : : : + f(t? 11) = u(t) N(0; 2 ): (7) Next we consider the general case with non-equally spaced observations. Let x (1) < : : : < x (t) < : : : < x (m) denote the m n strictly ordered, dierent observations of the covariate x, and f = (f(1); : : : ; f(t); : : : ; f(m)) 0 with f(t) := f(x (t) ), the vector of function evaluations. Random walk or autoregressive priors have to be modied to account for nonequal distances t = x (t)? x (t?1) between observations. Random walks of rst order are now specied by f(t) = f(t? 1) + u(t); u(t) N(0; t 2 ); (8) i. e., by adjusting error variances from 2 to t 2. Random walks of second order are f(t) = 1 +! t f(t? 1)? t f(t? 2) + u(t); (9) t?1 t?1 u(t) N(0; w t 2 ), where w t is an appropriate weight. Several possibilities are conceivable for the weights. The simplest one is w t = t as for a rst order random walk. Another choice is w t = t (1 + t t?1 ), which also takes into account the distance t?1. It can be derived from the dierence f(t)? f(t? 1) f(t? 1)? f(t? 2)? t t? 1 of rst order dierences and treating them as they were independent. A related, yet dierent proposal for a second order autoregressive prior is given by Berzuini and Larizza (1996). Another possibility would be to use state space representations 6

7 of stochastic dierential equation priors based on the work by Kohn and Ansley (1987). Biller and Fahrmeir (1997) follow this idea, but there are signicant problems associated with convergence and mixing behaviour of posterior samples, compared with the priors chosen here. Due to the Markovian specication by random walks or other autoregressive models priors for the vector f of function evaluations are seemingly dened in an asymmetric, directed way. However, these priors can always be rewritten in an undirected, symmetric form. This follows from the fact that any discrete Markov process can be formulated in the undirected form of a Markov random eld by conditioning not only on previous variables f(t? 1); f(t? 2); etc. but also on future variables f(t + 1); f(t + 2), etc.. This becomes also evident from the joint Gaussian prior for the entire vector f: For each of the priors (8), (9) or (7) f has a partially improper Gaussian prior f N(0; 2 K? ); where K? is a generalized inverse of a band-diagonal precision or penalty matrix K. For example, it is easy to see that for a random walk of rst order (8), the penalty matrix is K = 0 which reduces to?1 2??1 2??1 2?1 2 +?1 3??1 3??1 3?1 3 +?1 4?? K = 0 m?2?1 m?2 +?1 m?1??1 m?1??1 m?1?1 m?1 +?1 m? m?1? m?1 m?1 1?1?1 2? ?1 2?1?1 1 for equidistant x-values. Let us now turn our attention to a spatial covariate x, where the values of x represent the location or site in connected geographical regions. For example in the unemployment study x represents the district in which the unemployed have their domicil whereas in the rents for ats example x indicates the location of the ats in munich. A common way to deal with spatial covariates is to assume that neighboring sites are more alike than two arbitrary sites. Thus for a valid prior denition a set of neighbors for each site x t must be dened. For geographical data as considered in this paper one usually assumes, that two sites x t and x j are neighbors if they share a common boundary. However more sophisticated neighborhood denitions are possible, see for example Besag et al. (1991). We assume the following spatial smoothness prior for the function evaluations f(t), t = 1; : : : ; m of the m dierent sites x t : 1 C A 1 C A : 7

8 f(t)jf(j) j 6= t; 2 N X j2@ j 1 f(j)=n t ; 2 =N t A ; (10) where N t is the number of adjacent sites and j t denotes, that site x j is a neighbor of site x t. Thus the (conditional) mean of f(t) is an unweighted average of function evaluations of neighboring sites. Note that for spatial data conditioning is undirected since there is no natural ordering of dierent sites x t as is the case for time scales or metrical covariates. A more general prior including (10) as a special case is given by f(t)jf(j) j 6= t; 2 N X 1 w tj =w t+ f(j); 1=w t+ 2 A ; (11) j2@t where w tj are known (not necessarily) equal weights and + denoting summation over the missing subscript. Such a prior is called a Gaussian intrinsic autoregression, see Besag et al. (1991) and Besag and Kooperberg (1995). The above prior (10) is obtained as a special case by specifying w tj = 1 resulting in equal weights for each neighbor of site x t. Unequal weights are for example based on the common boundary length of neighboring sites or the distance of the centroids of two sites, see Besag et al. (1991) for details. However, the applications in this paper are restricted to the prior (10) based on adjacency weights. As in the case of the autoregressive priors (8),(9) or (7), denition (11) may equivalently be written in terms of a penalty matrix K, i.e. fj 2 / exp? 1 2 f 0 Kf (12) 2 where the elements of K are given by and k tt = w t+ k tj = (? w tj w t+ j t 0 else: Usually this prior is improper since K is rank decient and thus not invertible. It is important to note that the autoregressive priors (8),(9) and (7) are all special cases of the Gaussian intrinsic autoregression (11) or the equivalent denition (12). For example, a rst order random walk for equidistant observations is obtained by dening w tj = 1 for j = t 1 and w tj = 0 else. This close formal similarity of our priors for the function f of a metrical or a spatial covariate allow a unied MCMC algorithm, that depends essentially on the penalty matrix K and is more or less independent of the type of covariate and denition of smoothness. This is described in detail in the next section. 8

9 For a fully Bayesian analysis, hyperpriors for variances are introduced in a further stage of the hierarchy. This allows for simultaneous estimation of the unknown function and the amount of smoothness. A common choice are highly dispersed inverse gamma priors p( 2 ) IG(a; b): A common choice for a and b is very small a = b, for example a = b = 0:0001, leading to almost diuse priors for the variance parameters. An alternative proposed, for example, in Besag et al. (1995) is a = 1 and a small value for b, such as b = 0:005. However since estimation results tend to be sensitive to the choice of hyperpriors, especially in situations when data is sparse, some kind of sensitivity analysis should always be performed. For the xed eect parameters 1 ; : : : ; r, we will assume independent diuse priors p( j ) / const, j = 1; : : : ; r. Another choice would be highly dispersed Gaussian priors. For random eects, we make the usual assumption that the b g 's are i.i.d. Gaussian, b g jv 2 N(0; v 2 ); g = 1; : : : ; G and dene again a highly dispersed hyperprior for v 2. In the following, let f = (f 1 ; : : : ; f p ); 2 = ( 2 1 ; : : : ; 2 p ); = ( 1 ; : : : ; r ); b = (b 1 ; : : : ; b G ) denote parameter vectors for function evaluations, variances, xed, and random eects. Then the Bayesian model specication is completed by the following conditional independence assumptions: i) For given covariates and parameters f, and b observations y i are conditionally independent. ii) Priors p(f j j 2 j ), j = 1; : : : ; p, are conditionally independent. iii) Priors for xed and random eects, and hyperpriors 2 j ; j = 1; : : : ; p, are mutually independent. 3 MCMC inference Full Bayesian inference is based on the entire posterior distribution p(f; 2 ; ; bjy) / p(yjf; 2 ; ; b)p(f; 2 ; ; b): By assumption (i), the conditional distribution of observed data y is the product of individual likelihoods: p(yjf; 2 ; ; b) = ny i=1 L i (y i ; i ); (13) 9

10 with L i (y i ; i ) determined by the specic exponential family distribution and the form chosen for the predictor. Together with the conditional independence assumptions (ii) and (iii), we have p(f; 2 ; ; bjy) / ny i=1 L i (y i ; i ) py j=1 fp(f j j 2 j )p( 2 j )g ry k=1 p( k ) GY g=1 p(b g jv 2 )p(v 2 ) for the posterior. Bayesian inference via MCMC simulation is based on updating full conditionals of single parameters or blocks of parameters, given the rest and the data. Single-move steps, as in Carlin et al. (1992), which update each parameter f(t) separately, suer from problems with convergence and mixing. For Gaussian models, Gibbs sampling with so-called multimove steps can be applied, see, for example, Carter and Kohn (1994) and Wong and Kohn (1996). For non-gaussian responses Gibbs sampling is no longer feasible and more general Metropolis Hastings algorithms are needed. We adopt and extend a computationally very ecient MH algorithm with conditional prior proposals developed recently by Knorr-Held (1999) for dynamic generalized linear models. Convergence and mixing is considerably improved by block moves, where blocks f[r; s] = (f(r); : : : ; f(s)) of parameters are updated instead of single parameters f(t). Suppressing conditioning parameters and data notationally, the full conditionals for the blocks f[r; s] are p(f[r; s] j ) / L(f[r; s]) p(f[r; s] j f(l); l =2 [r; s]; 2 ) The rst factor L(f[r; s]) is the product of all likelihood contributions in (13) that depend on f[r; s]. The second factor, the conditional distribution of f[r; s] given the rest f(l); l =2 [r; s], is a multivariate Gaussian distribution. Its mean and covariance matrix can be written in terms of the precicision matrix K of f. Let K[r; s] denote the submatrix of K, given by the rows and columns numbered r to s and let K[1; r?1] and K[s + 1; m] denote the matrices left and right of K[r; s]. Then the (conditional) mean [r; s] and the covariance matrix [r; s] are given by [rs] = 2 8 > < >: and?k[r; s]?1 K[s + 1; m]f[s + 1; m] r = 1?K[r; s]?1 K[1; r? 1]f[1; r? 1] s = m?k[r; s]?1 (K[1; r? 1]f[1; r? 1] + K[s + 1; m]f[s + 1; m]) else [r; s] = 2 K[r; s]?1 ; respectively. This result may be obtained by applying usual formulae for conditional Gaussian distributions. MH block move updates for f[r; s] are obtained by drawing a conditional prior proposal f [r; s] from the conditional Gaussian N([r; s]; [r; s]) and accepting it with probability minf1; L(f [r; s]) L(f[r; s]) g: 10

11 A fast implementation of MCMC updates requires ecient computing of the mean [r; s]. For that reason we dene the matrices K[r; s] l = K[r; s]?1 K[1; r? 1] and K[r; s] r = K[r; s]?1 K[s + 1; m]. Then the conditional mean may be rewritten as [rs] = 2 8 > < >: K[r; s] r f[s + 1; m] r = 1 K[r; s] l f[1; r? 1] r = m K[r; s] l f[1; r? 1] + K[r; s] r f[s + 1; m] else. Knorr-Held's (1999) conditional prior proposal refers to autoregressive priors in dynamic models, where the precision matrix K has nonzero diagonal bands, implying sparse structures for K[r; s] r and K[r; s] l where most elements are zero. For example for a rst order random walk only the rst column of K[r; s] r and the last column of K[r; s] l contains nonzero elements. For spatial covariates, the precision matrix is no longer band-diagonal but still sparse, with nonzero entries reecting neighborhood relationships, so that sparse matrix operations can be used to improve computational eciency. Now computing of [r; s] and [r; s] 1=2 required for drawing the conditional prior proposal f [r; s] is as follows: For every block [r; s] we compute and store the matrices [r; s]?1=2, K[r; s] l and K[r; s] r in advance, where the last two are stored as sparse matrices. In every iteration of our MCMC algorithm computing of [r; s] and [r; s]?1=2 requires at most two (sparse) matrix multiplications with a column vector and multiplication of the resulting matrices with the scalar 2. From a computational point of view, another main advantage is the simple form of the acceptance probability. Only the likelihood has to be computed, no rst or second derivates etc. are involved, thus considerably reducing the number of calculations. The full conditional for a variance parameter 2 is still an inverse gamma distribution p( 2 j ) / IG(a 0 ; b 0 ): (14) Its parameters a 0 and b 0 again can be expressed in terms of the penalty matrix K and are given by and a 0 = a + rg(k) 2 b 0 = b f 0 Kf respectively. Thus updating of variance parameters can be done by simple Gibbs steps, drawing directly from the inverse gamma densities (14). Once again a fast implementation requires sparse matrix multiplications of f 0 Kf. With a diuse prior p( j ) = const for the xed eects parameters, the full conditional for is p( j ) / ny i=1 L i (y i ; i ): 11

12 Updating of can in principle be done by MH steps with a random walk proposal q(; ), but a serious problem is tuning, i.e. specifying a suitable covariance matrix for the proposal that guarantees high acceptance rates and good mixing. Especially when the dimension of is high, with signicant correlations among components, tuning \by hand" is no longer feasible. An alternative is the weighted least squares proposal suggested by Gamerman (1997). Here a Gaussian proposal is used with mean m() and covariance matrix C(), where is the current state of the chain. The mean m() is obtained by making one Fisher scoring step to maximize the full conditional p( j ) and C() is the inverse of the expected Fisher information, evaluated at the current state of the chain. In this case the acceptance probability of a proposed new vector is minf1; p( j )q( ; ) p( j )q(; ) g (15) Note that q is not symmetric because the covariance matrix C of q depends on. Thus in principle the fraction q( ; )=q(; ) can not be omitted from (15). In practice however experience shows that this fraction is almost always near one, so omitting the fraction does not aect signicantly the eciency of the algorithm, but rather leads to a considerable saving in computation. Further computer time can be saved by omitting the Fisher scoring step when computing the mean of Gamermans proposal, and simply taking the current state of the chain as the mean. Compared to Gamermans original proposal our slightly modied updating scheme for xed eects parameters is more ecient and avoids tuning \by hand". For an additional random intercept, the full conditional for parameter b g is given by p(b g j ) / Y i2fj:g j =gg L i (y i ; i )p(b g jv 2 ) Here a simple Gaussian random walk proposal with mean b g and variance v 2 works well in most cases. To improve mixing, tuning is sometimes required by multiplying the prior variance v 2 in the proposal with a constant factor, e.g. 2. An alternative is, again, Gamerman's weighted least squares proposal or a slight modication. This becomes especially attractive when the observation model contains one or more random slope parameters in addition to the random intercept. By analogy to the variance parameters j 2 of nonparametric terms the full conditional of v 2 is again an inverse gamma distribution, so updating is straightforward. 4 Applications The following applications arose out of consulting cases and show the practical feasibility of the methods even for complex models. In the rst example, a reanalysis of credit scoring data described in Fahrmeir and Tutz (1997) illustrate performance for a semiparametric model without random eects. The data are available from the authors. In the other three applications, we use a VCMM with unstructured random eects for the forest damage data, and GAMMs with spatial random eects 12

13 and many covariates for the large data set applications on rents and unemployment duration. To assess the sensitivity of the estimated functions from the hyperparameters a and b of the prior for the variance parameter in all applications models where (re)estimated with dierent choices of a and b. We tested four dierent choices for the hyperparameters, that is a = b = 0:01, a = b = 0:0001, a = 1,b = 0:01 and a = 1, b = 0:005. However, since in all applications except the forest damage data estimation results dier only slightly for the dierent choices of hyperpriors, results are printed only for our standard choice for the hyperparameters, that is a = 1 and b = 0:005. Only for the forest damage application we give a comparison of relative changes of estimated functions with respect to the dierent hyperparameters. 4.1 Credit-Scoring In our rst application, we reanalyze the credit{scoring problem described in Fahrmeir and Tutz (1997, ch. 2.1). The aim of credit{scoring is to model or predict the probability that a client with certain covariates (\risk factors") is to be considered as a potential risk, and therefore will probably not pay back his credit as agreed upon by contract. The data set consists of 1000 consumers' credits from a South German bank. The response variable is \creditability", which is given in dichotomous form (y = 0 for creditworthy, y = 1 for not creditworthy). In addition, 20 covariates assumed to inuence creditability were collected. As in Fahrmeir and Tutz (1997), we will use a subset of these data, containing only the following covariates, which are partly metrical and partly categorical: x 1 x 3 x 4 x 5 x 6 x 8 running account, trichotomous with categories \no running account" (= 1), \good running account" (= 2), \medium running account" (\less than 200 DM" = 3 = reference category) duration of credit in months, metrical amount of credit in DM, metrical payment of previous credits, dichotomous with categories \good", \bad" (=reference category) intended use, dichotomous with categories \private" or \professional" (=reference category) marital status, with reference category \living alone". Eect coding is used for all categorical covariates. A parametric logit model for the probability pr(y = 1jx) of being not creditworthy, leads to the conclusion that the covariate \amount of credit" has no signicant inuence on the risk. Here, we reanalyze the data with a semiparametric logit model log pr(y = 1jx) 1? pr(y = 1jx) = x x f 3 (x 3 ) + f 4 (x 4 ) + 5 x x x 8 ; where x 1 1 and x 2 1 are dummies for the categories \good" and \medium" running account. The predictor has semiparametric additive form: The smooth functions f 3 (x 3 ), f 4 (x 4 ) of the metrical covariates \duration of credit" and \amount of credit", 13

14 are estimated nonparametrically using second order random walk models for non{ equally spaced observations. The constant 0 and the eects 1 ; 2 ; 5 ; 6 ; 8 of the remaining categorical covariates are considered as xed and estimated jointly with the curves. Figure 1 shows estimates for the curves f 3 and f 4. For comparison, cubic smoothing splines are included in addition to posterior mean estimates. Although cubic splines are posterior mode estimators and the penalty terms are not exactly the same, both estimates are close. While the eect of the variable \duration of credit" is almost linear, the eect of \amount of credit" is clearly nonlinear. The curve has a bathtub shape, and indicates that not only high credits but also low credits increase the risk, compared to \medium" credits between 3000{6000 DM. Apparently, if the inuence is misspecied by assuming a linear function 4 x 4 instead of f 4 (x 4 ), the estimated eect b 4 will be near zero, corresponding to an almost horizontal line b 4 x 4 near zero, and falsely considered as nonsignicant. Table 1 gives the posterior means together with 80% credible intervals and, for comparison, maximum likelihood estimates of the remaining eects. Both estimates are in close agreement. They also have the same signs and are quite near to the estimates for a parametric logit model given in Fahrmeir and Tutz (1997), so that interpretation remains qualitatively the same for these constant eects. Closeness of xed eects is in agreement with asymptotic normality results for penalized likelihood estimations in semiparametric GLMs by Mammen and van de Geer (1997). Closeness of posterior means and methods of nonparametric components is in agreement with empirical evidence and own experience in the related context of non-gaussian state space models. A heuristic justication is given in Fahrmeir and Wagenpfeil (1997), but rigorous asymptotic results are still missing. A main interest with credit scoring is prediction, that is to predict the probability of failure 0 for a new client with characteristics x 0. Within a Bayesian framework this can easily be done by just drawing samples from the predictive distribution. The advantage compared to existing frequentist based approaches is that we get not only point estimates but also exact credible intervals to assess uncertainty of the prediction. Suppose for example that we have a new client with covariate vector x 0 = (?1;?1; 16; 1:4; 1;?1; 1). Then sampling from the predictive distribution results in a predicted probability of 0 = 0:35 (posterior mean) and a credible interval of (0:2; 0:54) based on the posterior 10 and 90 percent quantiles. It should be noted that the credit scoring data come from a stratied sample with 300 "bad" and 700 good credidts. This explains the high value 0 of risk. To obtain realistic values a stratication correction is necessary, compare Anderson (1980). Table 1: Estimates of constant parameters for the credit{scoring data. covariate mean 10 % quantile 90 % quantile ML estimator x x x x x

15 duration of credit in months amount of credit in 1000 DM Figure 1: Estimated eects of duration and amount of credit. Shown is the posterior mean within 80 % credible regions and for comparison cubic smoothing splines (dotted lines). 4.2 Forest damage In this longitudinal study, we analyze the inuence of calendar time, age of trees, ph value of soil and canopy density of the stand on the damage state of beeches at the stand. Data have been collected in yearly forest damage inventories carried out in the forest district of Rothenbuch in northern Bavaria from 1983 to There are 80 observation points with occurence of beeches spread over the whole area. We use the degree of defoliation as a binary indicator for damage state, with y it = 1 for "light or distinct damage" of tree i in year t, y it = 0 for "no damage". Figure 2 shows relative frequencies of the response "damage state" over the years. There is a clear pattern with a maximum of damage around 1986, while trees seem to recover in the following ve years. A detailed data description can be found in Gottlein and Pruscha (1996). Covariates used here are dened as follows: 15

16 Figure 2: relative frequencies of the response "damage state" over the years. A age of tree at the beginning of the study in 1983, measured in three categories "below 50 years" (=1), between 50 and 120 years (=2), and above 120 years (reference category); CD Canopy density at the stand with categories "low" (=1), "medium" (=2), and "high" (reference category); ph ph value of the soil near the surface, measured as a metrical covariate with values ranging from a minimum from 3.3 to a maximum of 6.1 The covariate ph value and canopy density are time varying, while age is time constant by denition. Based on results of Gieger (1998) with a marginal model, we use a logistic VCMM with predictor = f 1 (t) + f 2 (t)a (1) + f 3 (t)a (2) + f 4 (ph) + 1 CD (1) + 2 CD (2) + b; where t is calendar time in years, and A (1) ; A (2) ; CD (1) ; CD (2) are dummy variables for A and CD. The impact of calendar time is modelled by a baseline eect f 1 (t) and time-varying eects f 3 (t), f 4 (t) of age categories, and the possibly nonlinear eect of ph is also modelled nonparametrically, using RW (2) smoothness priors. The tree-specic random eect b accounts for correlation and unobserved heterogeneity. Figure 3 show posterior mean estimates for the baseline eect f 1 (t), the time-varying eects f 2 (t) and f 3 (t) of age and the eect of the ph value f 4 (ph). The baseline eect f 1 (t) corresponds to the time trend of old trees over 120 years. Trees in this age group recovered somewhat after 1986, but then the probability for damage increases again. This is in contrast to the eect of the age groups A (1) (young trees) and A (2) (medium age). The estimated curve f 2 (t) is signicantly negative and decreases, that is, in comparison to old trees, younger trees have lower probabilities of being damaged and they seem to recover better over the years. The eect for trees of medium age in gure c) is similar but less pronounced. As we might have expected, low ph values, i.e, acid soil, have a positive eect on damage probability; it decreases however with soil becoming less acid, until the eect vanishes. 16

17 a) calendar time b) varying effect of age category time c) varying effect of age category time d) ph value time ph value Figure 3: Estimated nonparametric functions for the forest damage data. Shown is the posterior mean within 80 % credible regions. Table 2: Relative changes of estimated functions for dierent choices of hyperparameters. a b f 1 (t) f 2 (A (1) f 3 (A (2) ) f 4 (ph) Note that the impact of this eect seems to be very small, because credible intervals are very large, indicating strong uncertainty with the estimated eect. This becomes even more obvious, if we compare estimation results for dierent choices of hyperparamters of the variance parameters. Table 2 compares relative changes of estimated functions for dierent choices of hyperpriors with respect to our standard choice a = 1 and b = 0:005. As can be seen, estimates for the eect of the ph value change considerably for dierent hyperpriors while the remaining eects are more or less unaected (a least the plotted functions can not be distinguished) by the respective choice of hyperparameters. This demonstrates once again the uncertainty about the eect of the ph value. Estimated eects for canopy density are given in Table 3. Stands with low (CD (1) ) or medium (CD (2) ) density have an increased probability for damage compared to 17

18 Table 3: Estimates of constant parameters for the forest damage data. covariate mean 10% quantile 90% quantile CD CD stands with high canopy density: it seems that beeches get more shelter from bad environmental inuences in stands with high canopy density. 4.3 Rents for ats in Munich According to the German rental law, owners of appartments or ats can base an increase in the amount that they charge for rent on "average rents" for ats comparable in type, size, equipment, quality and location in a community. To provide information about these "average rents", most larger cities publish "rental guides", which can be based on regression analysis with rent as the dependent variable. We use data from the City of Munich, collected in 1998 by Infratest Sozialforschung for a random sample of more than 3000 ats. As response variable we chose R monthly net rent per square meter in German Marks, that is the monthly rent minus calculated or estimated utility costs. Covariates characterizing the at were constructed from almost 200 variables out of a questionnaire answered by tenants of ats. In the following reanalysis, we include 27 covariates. Here is a selection of some typical covariates: F oor space in square meters A year of construction H 0j1 no central heating indicator B 0j1 no bathroom indicator E 0j1 indicator of bathroom equipment above average K 0j1 indicator of kitchen equipment above average W 0j1 indicator of no central warm water system S 0j1 indicator of large balcony to the south or west O 0j1 indicator of simple, old building, constructed before 1949 R 0j1 indicator of old building, renovated, in good state N 0j1 indicator of new modern building, built after 1978 For the ocial Munich '99 rental guide, location in the city was assessed in three categories (average, good, top) by experts. Our main focus here is to investigate empirically the validity of this experts assessment. For that reason we estimated two models, one model with the experts assessment excluded from the predictor and another with the experts assessment included. In both models we include additional spatially correlated random eects that are specic for subquarters ("Bezirksviertel") in Munich to account for extra spatial variation. If the experts assessment is valid the extra spatial variation measured by the random eects should considerably decrease for model two. 18

19 So we choose Gaussian additive mixed models with predictor for model one and = 0 + f 1 (F ) + f 2 (A) + f 3 (s) + w 0 = 0 + f 1 (F ) + f 2 (A) + f 3 (s) + w (L 1 ) + 2 (L 2 ) for model two, where L 1 and L 2 are 0=1 indicators for good and top locations, assessed by experts. In both cases w is the vector of 25 binary indicators, including those mentioned above, and f(s) is a random eect for the subquarter s = s i, where the at i is located. Except for the random eects we give here only results for model one since results for the other eects dier only slightly. Figure 4 a) shows the strong inuence of oor space on rents: small ats and appartements are considerably more expensive than larger ones, but this nonlinear eect becomes smaller with increasing oor space. The eect of year of construction on rents in gure b) is more or less constant till the '50s. It then distinctly increases till about 1990, and it stabilizes on a high level in the '90s. Estimates for xed eects of selected binary indicators are given in Table 4, showing positive or negative eects as to be expected. By construction, the estimated eect is the additional positive or negative amount of net rent per square meter caused by the presence (=1) of the indicator for a at. a) floor space b) year of construction floor space in square meters year of construction Figure 4: Estimated eects of oor space and year of construction for the rent data. Shown is the posterior mean within 80 % credible regions Figure 5 a) shows a map of Munich, displaying subquarters and their spatial random eects f(s) for model one, i.e., without experts' location indicators. Figure 5 b) shows the corresponding map for model two, with inclusion of location indicators L 1 and L 2. In Figure a) there is still a lot of spatial variation, showing increaded rents in the downtown area and near parks. The remaining variation after inclusion of L 1 and L 2 is much smoother. This shows that experts' assessment explains a lot of, but not all eects of location on rents. 19

20 a) model one hoehe b) model breitetwo hoehe Figure 5: Estimated spatial random eect for model one and model two. Shown is breite the posterior mean. 20

21 Table 4: Estimates of selected constant parameters for the rents for ats. covariate mean 10% quantile 90% quantile const H B E K W S O R N Duration of unemployment In this fourth application, we analyze ocial unemployment data from the German Federal Employment Oce (\Bundesanstalt fur Arbeit"). Our analysis is based on a sample of approximately 6300 males with one or more unemployment spells during the observation period from 1980 to Since the dataset must be augmented to apply an ordinary logistic regression model, see below, we end up with more than observations which is a very large dataset. Typical questions that arise in studies on duration of unemployment are: How can the baseline eect (duration dependence) be modelled? How can trend and seasonal eects of calendar time be exibly incorporated? What eect has age? Are there regional dierences for the probability of leaving unemployment and seeking a new job? An important problem in connection with persistant unemployment in the 90 0 s in Europe, is the eect of unemployment compensation and social welfare. Are there negative side-eects of public unemployment compensation? Our analysis is based on the following covariates: D A N U C calendar time measured in months age (in years) at the beginning of unemployment nationality, dichotomous with categories \german" and \foreigner" (= reference category) unemployment compensation, trichotomous with categories \unemployment benet" (=reference category), \unemployment assistance" (U 1 ) and \subsistence allowance" (U 2 ). district in which the unemployed have their domicil Note that calendar time D and unemployment compensation U are both duration time dependent covariates. As in our rst application eect coding is used for all categorical covariates. Since duration of unemployment is measured in months, we use a discrete time duration model as described in Fahrmeir and Tutz (1997, ch. 9). Let T = t 2 f1; : : : ; q + 1g denote end of duration in month t after beginning of 21

Fahrmeir: Recent Advances in Semiparametric Bayesian Function Estimation

Fahrmeir: Recent Advances in Semiparametric Bayesian Function Estimation Sonderforschungsbereich 386, Paper 137 (1998) Online unter: http://epub.ub.uni-muenchen.de/ Projektpartner Recent Advances in Semiparametric