Integrated Likelihood Estimation in Semiparametric Regression Models. Thomas A. Severini Department of Statistics Northwestern University

Integrated Likelihood Estimation in Semiparametric Regression Models Thomas A. Severini Department of Statistics Northwestern University Joint work with Heping He, University of York

Introduction Let Y 1, Y 2,..., Y n denote real-valued random variables of the form Y j = x T j β + γ(z j ) + ϵ j, j = 1,..., n where x 1,..., x n are constants in R p ; z 1,..., z n are constants, taking values in a set Z ϵ 1,..., ϵ n are unobserved mean-0 r.v.s such that ϵ = (ϵ 1,..., ϵ n ) T has a multivariate normal distribution covariance matrix Ω ϕ, ϕ Φ and β R p are unknown parameters γ is an unknown real-valued function on Z, taking values in a set of functions Γ Our goal is inference about the parameter β in the presence of the nuisance parameters γ and ϕ

The likelihood function for this model is given by Ω ϕ 1 1 2 exp{ 2 (Y Xβ g))t Ω 1 ϕ (Y Xβ g)} where Y = (y 1,..., y n ) T, X is the n p matrix with jth row x j, and g = (γ(z 1 ),..., γ(z n )) T. Hence, in order to proceed with likelihood inference for β some method of dealing with the nuisance parameters γ, ϕ is needed. Many methods of estimation have been proposed for this model: Engle, Granger, Rice, and Weiss (1986), Hastie and Tibshirani (1990), Heckman (1986), Ruppert, Wand, and Carroll (2003), Severini and Staniswalis (1994), and Speckman (1988). Most involve eliminating γ using some modification of the profile likelihood idea.

An alternative approach is to use an integrated likelihood in which γ is removed by averaging with respect to some weight function. Suppose that Z R and Γ is a set of differentiable functions on Z. Consider a weight function for γ corresponding to a mean-zero Gaussian stochastic process with covariance function K λ (, ) where λ is a parameter. Under this distribution, the vector (γ(z 1 ),..., γ(z n )) T has a multivariate normal distribution with mean vector 0 and covariance matrix Σ λ. The integrated likelihood is given by Ω ϕ + Σ λ 1 2 exp{ 1 2 (y xβ)t (Ω ϕ + Σ λ ) 1 (y xβ)}.

The integrated likelihood approach has several advantages: Restrictions on γ are often easy to impose by using a covariance function that respects the restrictions More complicated models in which the parameters of interest are intertwined with the unknown function are often easier to handle through the covariance structure than through the mean function of the observations It is straightforward to incorporate a parametric model for the covariance matrix of the errors

Inference based on an integrated likelihood is related to Bayesian inference in nonparametric and semiparametric regression models. Much of the Bayesian work in this area has made use of the fact that smoothing splines have a Bayesian interpretation (Wahba, 1990) and the covariance function is chosen so that spline estimation can be used (see below). Here the covariance function is chosen to reflect our assumptions about γ and the model Also we consider non-bayesian methods of inference and consider standard frequentist properties such as consistency and asymptotic distribution theory. However, the basic approach could also be applied to Bayesian inference.

Estimation The integrated likelihood is a normal likelihood with mean vector Xβ and covariance matrix V (θ) = Ω ϕ + Σ λ, θ = (ϕ, λ). Given the covariance parameter θ, β can be estimated by generalized least-squares: ˆβ(θ) = X T (X T V 1 X) 1 X T V 1 Y, V V (θ). When θ is unknown, it can be replaced by an estimator. To estimate θ, we can use the restricted maximum likelihood (REML) estimator, l p (θ) 1 2 log X T V (θ)x where l p is the profile integrated likelihood. Given the REML estimator ˆθ of θ, an estimator of β is given by ˆβ(ˆθ).

Note that standard methods of computation for mixed models can be used. To estimate γ, we can use the Best Linear Unbiased Predictor (BLUP) based on the assumption that γ is a random function. Let z denote an element of Z and consider estimation of γ(z ). The BLUP of γ(z ) is Σ (ˆθ)V (ˆθ) 1 (Y X ˆβ(ˆθ)). To use this approach, the covariance function K λ must be chosen; to do this, we consider the properties of {γ(z) : z Z} as a random process.

Models with an Unknown Continuous Function on the Real Line Suppose Z R and γ is a smooth function. It is often reasonable to assume that the covariance of γ(z) and γ( z) is a decreasing function of z z so that K λ (z, z) = τ 2 Kν ( z z /α) where K ν is a decreasing, positive definite function on [0, ) with K ν (0) = 1. Here τ > 0 is the standard deviation of γ(z), α > 0 represents a scale parameter, and ν represents a shape parameter (if present). One choice for K ν is the Gaussian covariance function K(t) = exp( 1 2 t2 ); then {γ(z) : z Z} is a stationary, infinitely-differentiable random process.

As noted earlier, the IL approach is related to spline estimation. There are at least two spline methods that can be used here: smoothing splines (e.g., Wahba, 1990) and penalized splines (e.g., Ruppert, Wand, and Carroll, 2003). Smoothing splines: γ is a mean-zero Gaussian process with covariance function 1 + z z [(1 + z 2 )(1 + z 2 )] 1 2, z, z [0, 1]. This process is nonstationary and highly correlated. Penalized splines: γ is a Gaussian stochastic process with mean δ 0 + δ 1 z + δ 2 z 2 and covariance function k K P (z, z) = τ 2 (z d j ) 2 ( z d j ) 2 for d k < z d k+1 and z z, where 0 < d 1 < d 2 <... < d r < 1 are given. Under K P, the correlation of γ(z), γ( z) is generally small. j=1

Incorporating Assumptions about γ( ) in the Model A main advantage of the IL approach is in models with additional assumptions on γ. Linear constraints on γ Suppose γ is subject to a constraint of the form T γ = 0 where T is a known, realvalued, affine function on L 2 (Z). In carrying out the IL approach, we need a distribution for {γ(z) : z Z} that respects the condition T γ = 0. First consider a mean-zero Gaussian process {γ 0 (z) : z Z} with Gaussian covariance function H λ and take {γ(z) : z Z} to have the the conditional distribution of γ 0 given that T γ 0 = 0. This conditional distribution is identical to the distribution of (Janson, 1997). γ 0 (z) Cov[γ 0(z), T γ 0 ] T γ 0 Var(T γ 0 )

It follows that {γ(z) : z Z} is a mean-zero Gaussian process with covariance function K λ (t, s) = H λ (t, s) Cov[γ 0(t), T γ 0 ; λ]cov[γ 0 (s), T γ 0 ; λ]. Var(T γ 0 ; λ) Thus, the restriction can be taken into account by simply modifying the covariance function of the process. For instance, suppose that T γ 0 = Z γ 0 (t)w(t)dt c where w is a given element of L 2 (Z) and c is a constant. Then K λ (t, s) = H λ (t, s) Z H λ(s, t)w(t)dt Z H λ(s, t)w(s)ds Z Z H. λ(s, t)w(s)w(t)ds dt

Asymptotic Properties of the Estimator Suppose that θ satisfies ˆθ = θ + O p (1/ n). Recall that θ = (ϕ, λ) where ϕ is a parameter of the error covariance matrix and λ is a parameter of the covariance function of γ( ). Therefore ϕ = ϕ 0, the true value of ϕ. However, there is no conventional true value of λ. ˆβ has the same asymptotic distribution as ˆβ (X T (V ) 1 X) 1 X T (V ) 1 Y, V = V (θ ). Note that ˆβ is normally distributed but it has bias (X T (V ) 1 X) 1 X T (V ) 1 g, g = (γ(z 1 ),..., γ(z n )) T.

The key idea in showing that the bias is asymptotically negligible is that Σ λ properties similar to a covariance function of g. has E.g., suppose that Ω ϕ = I and Σ λ g. Then (V ) 1 g = = gg T, the sample covariance function based on 1 1 + g g g = O(n 1 ). Under fairly general conditions on γ, it can be shown that n( ˆβ β0 ) D N(0, M ) as n where M lim n n[(xt V 1 (θ )X) 1 X T V 1 (θ )Ω ϕ0 V 1 (θ )X(X T V 1 (θ )X) 1 ].

Examples Example 1: Semiparametric regression model with independent errors Bowman and Azzalini (1997) present data taken taken from a survey of the fauna on the sea bed lying between the coast of northern Queensland and the Great Barrier Reef. Let Y denote catch score 1 and let x and z denote the latitude and longitude, respectively, of the sampling position. Here we use the data from zone 1; the sample size is n = 42. An appropriate model for these data is Y j = β 0 + β 1 x j + γ(z j ) + ϵ j, j = 1,..., n where ϵ 1,..., ϵ n are independent error terms with mean 0 and constant variance.

This model was fit using the IL method with a Gaussian covariance function. For comparison, the model was also fit using the generalized additive model approach of Hastie & Tibshirani (smoothing splines), the penalized spline method described in Semiparametric Regression by Ruppert, Wand, & Carroll and a kernel-based estimator (Speckman, 1988 and many others). Estimates of β 1 (reported SE): IL: 1.020(0.356) GAM: 1.153(0.371) Pen Spline: 1.098(0.368) Kernel: 1.203(0.371) The estimates of γ are also in close agreement.

Estimates of gamma in the reef example gamma(z) 0.5 0.0 0.5 1.0 Int Like SPM GAM Kernel 143.0 143.2 143.4 143.6 143.8 z

A small simulation study was conducted in which data were simulated from the model described here, with the parameter values taken to be the estimates based on the integrated likelihood method. A Monte Carlo sample size of 5000 was used. Comparison of Estimators in the Reef Example Method Int Lik GAM Pen Spline Kernel Bias -0.067-0.017-0.007 0.030 SD 0.365 0.364 0.368 0.423 MSE 0.138 0.133 0.135 0.180 Est SE 0.350 0.354 0.360 0.377 Cov Prob 0.933 0.938 0.940 0.910

Example 2: A shape-invariant model Hastie, Tibshirani, and Friedman (2001) describe data on bone mineral density (BMD) in adolescents. The response variable Y j is relative change in spinal BMD, which is modeled as a function of age and gender. Preliminary analysis suggests that the relationship between Y j and age is different for males and females, with the function relating Y j and age for males being a scaled and shifted version of the corresponding function for females. This observation suggests a model in which the mean of Y j is of the form β 0 + β x j 1 γ(z j + β 2 x j ) where z j denotes age and x j = 1 is subject j is male and 0 otherwise.

It follows that the mean function for males is β 0 +β 1 γ(z j +β 2 ) while the mean function for females is β 0 + γ(z j ). To compute the IL, we use a weight function based on taking γ to be a mean 0 Gaussian process with a Gaussian covariance function. Then Cov(β x j 1 γ(z j + β 2 x j ), β x k 1 γ(z k + β 2 x k )) = β x j+x k 1 K λ ( z j z k + β 2 (x j x k ) ). There is a further complication to this data set some of the subjects are tested multiple times (485 observations on 261 subjects). To account for this, the model was modified to include subject-specific intercept terms, taken to be normally distributed random effects.

Thus, the model has 7 parameters: β 0, the mean of the subject-specific intercepts β 1 and β 2, which describe how males and females differ the variances of the error term and of the random interecepts two parameters for the Gaussian covariance function. Note that the parameters of primary interest, β 1 and β 2, appear in the covariance matrix of Y, rather than in the mean function. The estimate of the shift is 2.1 years (SE = 0.19); the estimate of the scaling factor is 0.79 (SE = 0.068). The plot of the estimated model describes the differences between the relationship between change in BMD and age for males and females.

Comparison of Males and Females in the BMD Example Relative Change in Spinal BMD 0.05 0.00 0.05 0.10 0.15 0.20 10 15 20 25 Age

Summary The IL method provides a conceptually easy approach to estimation in models with an unknown function In simple models, the IL method works (nearly) as well as standard methods In more complicated settings, it is often straightforward to modify the covariance function used to form the IL Computation: standard methods work surprisingly well in the normal case; for non-normal errors more sophisticated methods will be needed Current proofs of asymptotic properties require stronger conditions than other methods; examples suggest that weaker conditions would suffice