JAPANESE BEETLE DATA 6 MAXIMUM LIKELIHOOD FOR SPATIALLY CORRELATED DISCRETE DATA Gauge Plots TuscaroraLisa Central Madsen Fairways, 996 January 9, 7 Grubs Adult Activity Grub Counts 6 8 Organic Matter % Gauge Plots Tuscarora Central Fairways, 996 JAPANESE BEETLE DATA Distance to Organic Nearest Grubs Tree Adult Activity Matter JAPANESE BEETLE DATA Model The data are overdispersed counts. A sensible model is negative binomial with mean given by a function of organic matter. E(Y i ) = exp(β + β x i + β x i + β x i ) Distance to Organic No Grubs Nearest Many Tree Grubs Low OM High OM Matter where Y i is the ith grub count and x i is the percent organic matter at that location.
Grub Example Geostatistical Model Spatial GEE Model OUTLINE Spatial Gaussian Copula Continuing Discrete Random Variables Spatial Gaussian Copula For Discrete Data Analysis of the Grub Data Simulations Conclusions, Extensions, and Further Research THE EXPONENTIAL COVARIOGRAM MODEL If h ij =distance between locations of Y i and Y j, θ + θ, h ij = cov(y i, Y j ) = Σ ij = θ exp( θ h ij ), h ij > θ θ θ = nugget (measurement error) = partial sill = decay (reciprocal of range) THE GEOSTATISTICAL MODEL For Normal Data Y = Xβ + ɛ ɛ N(, Σ) Covariance matrix Σ is constructed from a spatial covariogram, a function depending on distance and a vector of parameters. THE EXPONENTIAL COVARIOGRAM MODEL θ +θ Covariance θ θ = θ = θ =. 6 8 Lag Distance h
THE GEOSTATISTICAL LIKELIHOOD Combining the covariogram model with the normal assumption yields a likelihood f(y β, θ) = exp (π) n/ Σ(θ) / [ (Y Xβ) Σ(θ) (Y Xβ) from which we can find maximum likelihood estimates for the parameters β and θ. ] THE LATENT PROCESS SPATIAL GEE MODEL A latent process, typically lognormal, is used to model the spatial correlation. A conditionally independent discrete process, typically Poisson for counts, is assumed to model the data. Let s = spatial location x(s) = vector of known covariates at location s β = vector of unknown regression coefficients Z(s) lognormal with E[Z(s)] =, var[z(s)] = σ Y (s) Z( ) independent Poisson{exp[x (s)β] Z(s)}. THE SPATIAL GEE MODEL Some History Liang and Zeger s (986) pioneering paper in Biometrika introduced GEEs for longitudinal data. Zeger (988) developed GEE analysis for a time series of counts using a latent process model. McShane, Albert, and Palmatier (997) adapted Zeger s model and analysis to spatially correlated count data. Gotway and Stroup (997) used GEEs to model and predict spatially correlated binary and count data. Lin and Clayton () develop asymptotic theory for GEE estimators of parameters in a spatial logistic regression model THE LATENT PROCESS SPATIAL GEE MODEL Marginal Moments The marginal moments of lognormal-poisson Y (s), E[Y (s)] = exp[x (s)β] var[y (s)] = E[Y (s)] + σ E[Y (s)], closely resemble those of a negative binomial process: If W is distributed as negative binomial, then for some k >. var(w ) = E(W ) + E(W ) k 6
THE LATENT PROCESS SPATIAL GEE MODEL Correlations The latent process Z( ) carries the spatial correlation. corr[z(s), Z(s + h)] = ρ Z (h), which induces correlation among the Y (s): corr[y (s), Y (s + h)] = ρ Z (h){ + σ E[Y (s)] }{ + σ E[Y (s + h)] }. These correlations are severely limited compared to those possible between negative binomial random variables. THE LATENT PROCESS SPATIAL GEE MODEL The latent process model may underestimate correlations among the data. When correlations are underestimated, standard errors are also underestimated. THE LATENT PROCESS SPATIAL GEE MODEL Limits to Correlation BRASH ASSERTION UB for ρ.. µ j... µ i UB for ρ.. µ j... µ i Correlation is not an appropriate measure of dependence for discrete random variables. In fact it s only appropriate for normal random variables. (a) Lognormal-Poisson (b) Negative binomial 7 8
Y Y (a) Perfect correlation Y Y (b) Almost perfect correlation THE MULTIVARIATE GAUSSIAN COPULA The bivariate Gaussian copula can be generalized. For i =... n, let Y i F i be continuous random variables and Φ = standard normal cdf Φ Σ = multivariate Gaussian cdf with covariance matrix Σ. Σ = a correlation matrix A joint distribution function is [ C(y,..., y n ; Σ) = Φ Σ Φ (F (y )),... Φ (F n (y n )) ]. THE BIVARIATE GAUSSIAN COPULA Let Y F and Y F be continuous random variables. The Gaussian copula defines a joint distribution function [ C(y, y ; δ) = Φ δ Φ (F (y )), Φ (F (y )) ]. Φ = standard normal cdf Φ δ = bivariate normal cdf with correlation δ Maximum correlation between Y and Y is achieved by setting δ =. THE MULTIVARIATE GAUSSIAN COPULA Joint Density Differentiating the distribution function yields a joint density for random variables Y i with marginal density f i : [ c(y; Σ) = Σ / exp ] [ ] n z Σ z exp z z f i (y i ) where z = [ Φ {F (y )},..., Φ {F n (y n )} ]. i= Σ determines the dependence structure. 9
THE SPATIAL GAUSSIAN COPULA Bring non-normal Y,..., Y n into the geostatistical framework by modeling the Gaussian copula s Σ as a spatial correlation matrix, θ exp( h ij θ ), i j Σ ij = ρ(h ij ) =, i = j where h ij is the distance between the locations of Y i and Y j, and θ (, ] and θ > are parameters. RECAP Observations Y i with cdf F i and density f i, i =,..., n E(Y i ) depends on unknown parameter vector β and known covariates x i Joint density c(y,..., y n ; β, θ) = Σ(θ) / exp [ z Σ(θ) z ] exp [ z z ] n i= f i(y i ) The joint density forms a likelihood for the parameters β and θ which can be maximized to obtain MLEs. A SPATIAL CORRELATION FUNCTION ρ(h).8.6. θ =.7, θ = θ =., θ = θ =., θ =. But...how does this work for discrete data?. 6 8 Lag Distance h
CONTINUING DISCRETE RANDOM VARIABLES Denuit and Lambert (): Associate with discrete Y i a continuous random variable Y i = Y i U i where U i Uniform(, ) independent of Y i and of U j for j i. CONTINUING DISCRETE RANDOM VARIABLES A couple of observations: Yi = Y i U i if and only if Y i = [Yi information is lost by continuing Y i. Distribution and density functions + ], so no F i (y) = F i ([y]) + (y [y])p r{y i = [y + ]} f i (y) = P r{y i = [y + ]} depend on only the parameters of the distribution of Y i. CONTINUING DISCRETE RANDOM VARIABLES Y i = Y i U i is a continuous random variable with distribution function and density F i (y) = F i ([y]) + (y [y])p r{y i = [y + ]} f i (y) = P r{y i = [y + ]} where [y] denotes the integer part of y. THE SPATIAL GAUSSIAN COPULA FOR DISCRETE DATA The spatial Gaussian copula joint density for Y,..., Y n, c(y; β, θ) = Σ(θ) / exp [ ] [ ] y Σ(θ) y exp y y n fi (y i ), gives a log-likelihood L(β, θ; Y, U) = log[c(y ; β, θ)]. i=
THE SPATIAL GAUSSIAN COPULA FOR DISCRETE DATA Since L(β, θ; Y, U) depends on U, MLEs will be } (ˆβ, ˆθ) = E U {arg max [L(β, θ; Y, U)]. β,θ ANALYSIS OF THE GRUB DATA Model Y i Negative Binomial, i =... E(Y i x i ) = µ i = exp(β + β x i + β x i + β x i ) ( ) + φ var(y i ) = µ i φ θ exp( h ij θ ), i j corr(y i, Y j ) =, i = j ANALYSIS OF THE GRUB DATA Grub Counts 6 6 8 Organic Matter % ANALYSIS OF THE GRUB DATA Method. Generate U... U n iid U(, ) and form Y i = Y i U i.. Find ( β, θ) = arg max β,θ [L(β, θ; Y, U)] and approximation of negative Hessian of L at maximum.. Repeat steps and several times.. ˆβ and ˆθ are averages of the ( β, θ).. Standard errors are square roots of the diagonal elements of the average approximated Hessian. 6
ANALYSIS OF THE GRUB DATA Fitted Mean Function Grub Counts 6 Observed Fitted Mean 6 8 Organic Matter % SIMULATIONS n = with spatial locations from grub data µ i = exp(β ), where β = Data generated using software package discsim. (www.stat.oregonstate.edu/people/lmadsen) About % of the pairs (Y i, Y j ) had correlations exceeding the lognormal-poisson upper bound. MLE and GEE estimates of β were calculated. ANALYSIS OF THE GRUB DATA Parameter Estimates Nominal 9% Parameter Estimate Standard Error Confidence Interval β..78 ( 6.8,.) β.9.8 (.,.7) β.997.8 (.7,.) β.9. (.,.776) SIMULATIONS Results Nominal 9% Procedure Bias Variance Confidence Coverage Spatial GEE -...69 MLE -...9 7 8
CONCLUSIONS Latent variable spatial GEE model can dangerously underestimate variance. Spatial Gaussian copula makes it easy to model spatial dependence for non-normal data. ML method is easier to work with than GEE method. FURTHER RESEARCH More simulations to assess performance in a variety of situations. More applications. Asymptotic details. Generating highly correlated discrete data. GENERALIZATIONS TO THE MODEL The method can be used for any non-normal marginals and any correlation structure. It is not necessary that all Y i share the same marginal distribution. For example, data could be overdispersed in some regions and underdispersed in others. For the negative binomial marginal model, φ could be allowed to vary. ACKNOWLEDGEMENTS The research presented here has been partially funded by the U.S. Environmental Protection Agency Grant #CR-899, the Science To Achieve Results (STAR) Program. It has not been subjected to the Agency s review and therefore does not necessarily reflect the views of the Agency, and no official endorsement should be inferred. Thanks to Clif Johnson for his extensive help figuring out how to run the simulations on the College of Engineering s Beowulf cluster. 9
japanese_beetle_.jpg (JPEG Image, 6x9 pixels) THANK YOU! file:///d:/copula/deptseminarjan7/japanese_beetle_.jpg of /7/7 : PM