Bayesian Estimation of Expected Cell Counts by Using R

Size: px

Start display at page:

Download "Bayesian Estimation of Expected Cell Counts by Using R"

Griselda Gibbs
5 years ago
Views:

1 Bayesian Estimation of Expected Cell Counts by Using R Haydar Demirhan 1 and Canan Hamurkaroglu 2 Department of Statistics, Hacettepe University, Beytepe, 06800, Ankara, Turkey Abstract In this article, we give a script that is working on the R software for the Bayesian estimation of expected cell counts of a contingency table over a given log linear model The script promotes the usage and applicability of the Bayesian estimation of expected cell counts We illustrate usage and interpretation of the resulting outputs over an example Keywords:Log linear models; WinBugs; Script; Maximum likelihood estimation 1 Introduction Log linear modelling is frequently used in the analysis of the contingency tables When researchers want to include their subjective information into the analysis, Bayesian point of view is a way to include all available information In the log linear modelling, model parameters and expected cell counts of the considered contingency table can be estimated using the Bayesian approaches For each of the purposes, a prior distribution is determined and it is combined with the corresponding likelihood function using the Bayes Theorem Bayesian estimation of the model parameters over a given log linear model is considered by Leighty&Johnson [1] King&Brooks [2] use the prior specified for the log linear model parameters to elicit a prior for the expected cell counts They show consistency of the transition from model parameters to the expected cell counts in representation of the prior information The advantage of using the information on the model parameters for the estimation of expected cell counts is that one should determine prior information for less number of parameters for all of the log linear models except the saturated model Demirhan&Hamurkaroglu [3] show that the transition from prior distribution of model parameters to that of the expected cell counts, which is given by King&Brooks [2], is not appropriate for posterior inferences due to the singularity problem of the variance matrix of the prior distribution induced on the expected cell counts In addition, Demirhan&Hamurkaroglu [3] give an approach to solve the singularity problem They used Metropolis-Hastings (MH) algorithm for posterior inferences instead of the Gibbs sampling, because of the difficulties in the derivation of full conditional distributions The approach given by Demirhan&Hamurkaroglu [3] is not easy to implement for the researchers, who are not familiar with the Bayesian approaches Moreover, one should write his own computer program to implement the MH algorithm for posterior inferences In this article, we give an R script that takes mean vector of prior distribution of the model parameters, 1 haydarde@hacettepeedutr 2 caca@hacettepeedutr 1

2 the belief in the prior information and considered contingency table, makes the transition by using the approach of Demirhan&Hamurkaroglu [3], then draws posterior inferences by using the Gibbs sampling over the well known Bayesian analysis software WinBugs [4] WinBugs does not require the determination of the full conditional distributions by the user [5] The function bugs(), given by Gelman [6], calls WinBugs from R for the Gibbs sampling We use R to make required calculations for the transition and determination of the parameters of prior distribution induced on the expected cell counts, and the bugs() for the posterior inferences By this way, the Gibbs sampling is used instead of the MH algorithm with the approach of Demirhan&Hamurkaroglu [3] Log linear models are frequently used for the analysis of the contingency tables in the medical and social researches In general, the researchers from these fields have little information on the Bayesian estimation Even if one has enough information to understand Bayesian techniques, there are severe difficulties in the calculation of posterior estimates There is not any software that is able to conduct most of the proposed Bayesian approaches Therefore, one has to prepare a computer program to draw posterior inferences in most of the Bayesian approaches Because of these difficulties, researchers are not using the Bayesian approaches that are very useful and practical for their research The given script is a ready to use tool for the mentioned estimation purposes We believe that it will promote the usage of the approach of Demirhan&Hamurkaroglu [3] The approaches given by Leighty&Johnson [1], King&Brooks [2] and Demirhan& Hamurkaroglu [3] are mentioned in Section 2 The R script and discussions on it are given in Section 3 An example is given to clarify the usage and outputs of the script in Section 4 2 Bayesian estimation of the expected cell counts It is assumed that variables constituting the considered contingency table are nominal The log linear is defined as follows: log n = X m β m (1) where β m includes main and interaction effect parameters of a model m X m is corresponding design matrix, composed of -1, 0 and 1 values, and includes a column for each parameter in β m n is the vector of expected cell counts Leighty&Johnson [1] use multivariate normal distribution with parameters µ m and Σ m as the prior distribution of the model parameters µ m is used to represent the prior information on the parameters For example, if one has an information on one of the interaction parameters between two variables that it is negative and weak Then the relevant element of the µ m is chosen as negative and close to zero Σ m represents the researchers belief in the induced prior information They give a two staged approach for the specification of the entries of Σ m At the first stage it is taken as Σ m = αc m = αci m (2) where I m is the identity matrix dimension of p = dim(β m ); and to make C 1 m agree with the precision of the likelihood it is chosen as c = p/tr(v 1 1 b ), where V b is the inverse of variance matrix of the maximum likelihood (ML) estimates [1] On this point we faced with another 2

3 difficulty that the researcher should construct the design matrix by himself and obtain relevant ML estimates using a statistical software before the determination of prior distribution Our script carries on these issues for the user The α, seen in (2), is used to represent the prior belief in the induced prior information Distribution of the α is given by the second stage It is taken that τ = 1/(1 + α) and τ uniform(0, 1) to make calculations and express the belief in the prior easier Close to zero values of τ represent high disbelief and close to one values represent high belief in the prior When the τ is close to zero, the information coming from the sample dominates the prior information In the opposite case, prior information dominates the information provided by the sample King&Brooks [2] give an approach to obtain a prior distribution for the expected cell counts by using the prior distribution of the model parameters and (1) If β m follows N(µ m, Σ m ), then log n follows N(µ, Σ) from (1), where µ = X m µ m and Σ = X m Σ m X T m Proofs of these results are given in [2] As the result, joint prior distribution of the expected cell counts follows a multivariate log normal distribution with parameters µ and Σ, if Σ is a positive definite matrix Demirhan&Hamurkaroglu [3] show that Σ is singular, and hence it is not positive definite Thus, using the transition with this setting is not possible Demirhan&Hamurkaroglu [3] propose an approach to overcome the singularity problem They define the scale matrix of the multivariate normal distribution as Σ = Σ + hi, where h is a constant They show that Σ is positive definite, if h is taken as tr(σ)/dim(σ) e min (Σ), where e min (Σ), tr(σ) and dim(σ) denote the smallest eigenvalue, the trace and the dimension of Σ, respectively Then, Demirhan&Hamurkaroglu [3] use the approach of Leighty&Johnson [1] to specify the variance matrix of the prior distribution of the model parameters As the result, h is taken as following: h = α [ tr(x m C m X T m)/dim(σ) e min (X m C m X T m) ] = αv (3) Demirhan&Hamurkaroglu [3] state that because h is dependent to τ, h has some effect on the degree of belief, however, it has not got any effect on the degree of belief in the prior apart from τ Diagonal elements of Σ increases by the addition of h So, if the diagonal elements of Σ was decreased as much as the value of h before the addition of it then the effect of h may be neutralized on the average Let the value of random variable τ specified to represent our degree of belief in the prior be τ g and the decreased value for τ is obtained by using τ g, be τ s Demirhan&Hamurkaroglu [3] show that if τ s is defined as the following: τ s = τ g ( tr(xm C m X T m)/dim(σ) )(1 τ g ) + τ g, (4) tr(x m C m X T m)/dim(σ) + v effect of h is minimized and also the degree of belief in the prior information is not changed Joint prior distribution of the expected cell counts given τ and h is as follows: p(n τ, h) ( ) { 1 exp 1 } n k 2 (log n µ)t (Σ ) 1 (log n µ), 0 < n k n (5) k K 3

4 In (5), K denotes the considered contingency table and k corresponds to each cell of the table Details of this notation style are given by King&Brooks [2] Under the Poisson sampling plan, the kernel of log likelihood function is as follows: l(n y) = k K y k log n k, (6) where y k is the observed frequency of the cell k Joint posterior distribution of the expected cell counts is obtained from (5) and (6) as follows: p(n τ, h) ( ) { 1 exp 1 } n k 2 (log n µ)t (Σ ) 1 (log n µ) y k log n k (7) k K k K 3 The R script for the Bayesian estimation of expected cell counts The script works on the posterior distribution given in (7), with the prior scheme mentioned in the Section 2 It requires the τ s value, µ m, relevant contingency table and considered model to work It obtains ML estimates of the model parameters under the given model, then calculates elements of µ, h and entries of Σ to specify the prior distribution After the specification of the prior distribution, it calls WinBugs to implement the Gibbs sampling for posterior inferences Obtained posterior estimates, some of the percentiles of the marginal posterior distributions of the expected cell counts and scale reduction factors corresponding to each parameter are displayed by the R All of the statistics available in the procedure bugs, which are mentioned by Gelman[6], can be displayed in console windows of R The script is given below We added line numbers for clarifications 1 library(matrix) 2 library(r2winbugs) 3 library(brugs) 4 tau_g< tablo<-readtable("x://obsdat",header=true, fill=true, blanklinesskip =TRUE) 6 #dataframe(tablo$y,asfactor(tablo$gender),asfactor(tablo$edu),asfactor( tablo$age)) 7 gm<-glm(tablo$y ~ asfactor(tablo$gender) + asfactor(tablo$edu) + asfactor( tablo$age)+asfactor(tablo$gender) * asfactor(tablo$edu) + asfactor( tablo$gender) * asfactor(tablo$age) + asfactor(tablo$edu) * asfactor( tablo$age), family=poisson(log), x=true) 8 tas<-gm$x 9 p<-ncol(tas) 10 r<-nrow(tas) 11 tas_dev<-t(tas) 12 tas_tas_dev<-tas %*% tas_dev 13 prm_kest<-gm$coefficients 14 g1<-tas %*% prm_kest 15 bgs_kest<-exp(g1) 16 d_bgs_kest<-matrix(0,r,r) 17 for(i in 1:r){d_bgs_kest[i,i]=bgs_kest[i]} 18 g2<-tas_dev %*% d_bgs_kest 19 g3<-g2 %*% tas 20 VarCov<-inv(g3) 21 iz_v<-sum(diag(varcov)) 22 C<-diag(p/iz_V,r,r) %*% tas_tas_dev 4

5 23 iz_c<-sum(diag(c)) 24 v<-(iz_c/p)-min(eigen(c,onlyvalues=true)$values) 25 tau_s<-tau_g/((((iz_c/p)/((iz_c/p)+v))*(1-tau_g))+tau_g) 26 alpha<-(1-tau_s)/tau_s 27 h<-alpha*v 28 cov<-diag(p/(alpha*iz_v),r,r) %*% tas_tas_dev + diag(h,r,r) 29 mu_m<-asmatrix(readtable("x://mu_mdat")) 30 mu<-tas %*% mu_m 31 y<-asmatrix(tablo$y) 32 k<-tablo$d_say[1] 33 nn<-vector(mode="numeric", length=r) 34 girdi<-list("r","cov","mu") 35 basl<-function(){list(nn=runif(r,0,1))} 36 pars<-list("nn") 37 sonucsim<-bugs(girdi,basl,pars,"x://modelbug",nchains=3,niter=5000, DIC=FALSE) 38 print(sonucsim) A x:// is seen in the lines 5, 29 and 37 It should be replaced with the relevant path before running the script Lines 1-3 load the required R libraries R2WinBUGS and BRugs are given by Gelman [6] Lines 4 and 5 get the inputs from the user Definitions about the variables are made in the line 6 ML estimates of the log linear parameter are obtained at the line 7 over the model that contains all of the two factor interactions Design matrix X m is taken from the glm procedure of the R at the line 8 Some calculations are carried on at the lines 9-12 ML estimates of the model parameters are taken from the glm procedure at the line 13 Required calculations to obtain v, seen in (3), are done at the lines The v is obtained at the line 24 The τ s, mentioned in (4), is calculated at the line 25 Precision of the prior distribution of the expected cell counts is calculated at the line 26, and the value of h is obtained at the line 27 X, µ m and µ are obtained at the lines 28, 29 and 30 respectively Lines prepare inputs of the bugs procedure Line 37 calls WinBugs for implementation of the Gibbs sampling Outputs generated by WinBugs is taken back to R environment and displayed in R console window by the line 38 The script uses the files obsdat, mu mdat and modelbug Contents of these files are given in the appendix 4 An example We give an example to illustrate the outputs of the script We used gender, age and level of education cross tabulation, which is a cross tabulation The data is taken from the web site of the Turkish Statistical Institute Considered hierarchical model does not contain only three factor interaction term The script given in the Section 3 generates three chains each of which has 5000 iterations Obtained outputs from the implementation of the script are as following: Inference for Bugs model at "x://modelbug", fit using winbugs, 3 chains, each with 5000 iterations (first 500 discarded) nsims = 1500 iterations saved mean sd 25% 25% 50% 75% 975% Rhat neff nn[1] nn[2]

6 nn[3] nn[130] nn[131] nn[132] For each parameter, neff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor (at convergence, Rhat=1) It is seen from the output that first 500 iterations were discarded to reduce the autocorrelation Posterior estimates of the expected cell counts are given in the mean column Estimated standard deviations are under the sd column Next five columns correspond to the 25, 25, 50, 75 and 975th percentiles Potential scale reduction factors are given in the Rhat column It can be concluded that the convergence is achieved for a parameter, if the relevant Rhat value is close to 1 and less than 12 [7] 5 Conclusions We consider the application of the approach, which is on the Bayesian estimation of the expected cell counts of a contingency table over a given log linear model, given by Demirhan& Hamurkaroglu [3] on the R software We prepare a script for this aim The script is practically reasonable for the aims of non-bayesian statisticians or non-statistician researchers For the mentioned estimation purposes one should construct the relevant design matrix, obtain ML estimates, calculate parameters of the prior distribution of the expected cell counts, prepare a computer program to implement the MH algorithm, calculate convergence measures such as Rhat, percentiles of each marginal posterior distribution Following this list is a hard work for most of the researchers Our script provides an easy way One should coy and paste the script into the console window of R, prepare obsdat, which contains observed table, mu mdat, which contains prior information on the model parameters and create a file modelbug by coping it from the Appendix C to obtain Bayesian estimates of the expected cell counts R and WinBugs softwares can be obtained from their internet sites free of charge [8, 4] This is also promotes the usage of the script In conclusion, the given script increases applicability of the approach of the writers, and it is a ready to use tool for all researchers from various fields References [1] Leighty, RM, Johnson, WJ, 1990, A Bayesian loglinear model analysis of categorical data Journal of Official Statistics, 6, 2, [2] King, R, Brooks, SP, 2001, Prior induction in log-linear models for general contingency table analysis The Annals of Statistics, 29, 3, [3] Demirhan, H, Hamurkaroglu, C, 2006, A Bayesian Approach to the Estimation of Expected Cell Counts by Using Log Linear Models Communications in Statistics - Theory and Methods, 35, 2,

7 [4] WinBugs, 2007, Available online at: wwwmrc-bsucamacuk/bugs/winbugs/contentsshtml (accessed 10 June 2007) [5] Spiegelhalter, D, Thomas, A, Best, N, Lunn, D, 2003, WinBugs user manual, v14 Available online at: wwwmrc-bsucamacuk/bugs (accessed 10 June 2007) [6] Gelman, A, 2006, bugsr: functions for running WinBugs and OpenBugs from R Available online at: wwwstatcolumbiaedu/ gelman/bugsr (accessed 10 June 2007) [7] Gelman, A, 1996, Inference and monitoring convergence, in Markov Chain Monte Carlo (eds WR Gilks, S Richardson, DJ Spiegelhalter), Chapman&Hall/CRC, London [8] R project, 2007, Available online at: wwwr-projectorg (accessed 10 June 2007) Appendix A1 Content of the obsdat file Content of the obsdat file is as following: Gender Edu Age y A2 Content of the mu mdat file mu mdat includes the prior information about the value of each parameter one under the other The content is as following: A3 Content of the modelbug file The modelbug contains a model code in WinBugs language The content is as following: model{for(i in 1:r) { y[i,1]~dpois(nn[i]) nn[i]~dlnorm(mu[i,1],cov[i,i]) } } 7

Advanced Statistical Modelling

Advanced Statistical Modelling Markov chain Monte Carlo (MCMC) Methods and Their Applications in Bayesian Statistics School of Technology and Business Studies/Statistics Dalarna University Borlänge, Sweden. Feb. 05, 2014. Outlines 1