A New Approach to Modeling Covariate Effects and Individualization in Population Pharmacokinetics-Pharmacodynamics

Size: px

Start display at page:

Download "A New Approach to Modeling Covariate Effects and Individualization in Population Pharmacokinetics-Pharmacodynamics"

Edwina Dickerson
5 years ago
Views:

1 Journal of Pharmacokinetics and Pharmacodynamics, Vol. 33, No. 1, February 2006 ( 2006) DOI: /s A New Approach to Modeling Covariate Effects and Individualization in Population Pharmacokinetics-Pharmacodynamics Tze Leung Lai, 1, Mei-Chiung Shih, 2 and Samuel P. Wong 3 Received September 7, 2004 Final October 7, 2005 Published Online January 10, 2006 By combining Laplace s approximation and Monte Carlo methods to evaluate multiple integrals, this paper develops a new approach to estimation in nonlinear mixed effects models that are widely used in population pharmacokinetics and pharmacodynamics. Estimation here involves not only estimating the model parameters from Phase I and II studies but also using the fitted model to estimate the concentration versus time curve or the drug effects of a subject who has covariate information but sparse measurements. Because of its computational tractability, the proposed approach can model the covariate effects nonparametrically by using (i) regression splines or neural networks as basis functions and (ii) AIC or BIC for model selection. Its computational and statistical advantages are illustrated in simulation studies and in Phase I trials. KEY WORDS: mixed effects model; hybrid estimator; neural network; regression splines. INTRODUCTION A widely used model in population pharmacokinetics (PK) and pharmacodynamics (PD) is the nonlinear mixed effects model of the form y ij = f i (t ij,θ i ) + ε ij, θ i = g(x i,β)+ b i (1 j n i, 1 i I ), (1) where θ i is a 1 r vector of the ith subject s parameters whose regression function on the subject s observed covariate x i is g(x i,β) with a 1 s 1 2 Department of Statistics, Stanford University, Stanford, CA 94305, USA. 3 Department of Biostatistics, Harvard School of Public Health, Boston, MA 02115, USA. Department of Statistics, the Chinese University of Hong Kong, Hong Kong, China. To whom correspondence should be addressed. lait@stat.stanford.edu X/06/ / Springer Science+Business Media, Inc.

2 50 Lai, Shih, and Wong parameter vector β, which is the fixed effect to be estimated. The random effects b i in Eq. (1) are assumed to be independent and identically distributed (i.i.d.) and their nonzero components have a common distribution G with mean 0. The ith subject s response y ij at t ij has mean f i (t ij,θ i ), in which f i is a known function. Given θ i, the random errors ε ij are assumed to be independent normal random variables with mean 0 and standard deviation σω ij (θ i ), in which ω ij is a given function and σ is an unknown parameter. In PK, the concentrations at times t ij after the administration of a single oral dose D i are often modeled by the one-compartment model D i k ai y ij = V i (k ai k ei ) (e k ei t ij e k ait ij ) + ε ij, 1 j n i. (2) Here V i, k ai, k ei are the ith subject s volume of distribution, absorption rate, and elimination rate, respectively, and their logarithms constitute the vector θ i in Eq. (1). The regression function g relates θ i to the ith subject s physiologic characteristics that constitute the covariate vector x i.the population distribution G is usually assumed to be normal with unknown parameters which, together with β and σ, can be estimated by maximum likelihood. Unlike linear mixed effects models in which the normality assumption on G yields closed-form expressions of the likelihood, the normality of G in nonlinear mixed effects models leads to computationally intensive likelihoods that involve I multiple integrals. A commonly used approach, as adopted in the software package NONMEM (1), and the nlme function in S-Plus due to Lindstrom and Bates (2) and Pinheiro and Bates (3) is to develop iterative schemes based on first-order approximations of f i (t ij, g(x i,β)+b i ) in Eq. (1), so that the normality assumption on G can be used to reduce the problem to that of a linear Gaussian mixed effects model at each iterative step. A basic issue with this approximation is that when some of the subjects have sparse data there are considerable errors in approximating the likelihood function via these first-order approximations, as noted by Yafune et al. (4) who propose to use Monte Carlo integration to evaluate the I multiple integrals in the likelihood function for Phase 1 studies but point out that the computational time (already taking 22 hours in their particular Phase 1 trial) may be too long for Phase 2 (or later) trials to be of practical interest. Another issue is that the actual population distribution may be highly nonnormal. Since there is no computational advantage in using normal G when first-order approximations to reduce to linear Gaussian mixed effects models are not used, it may be more appropriate to try more flexible parametric families for G. Refs. (5) (7) propose certain parametric families that

3 A New Approach to Modeling Covariate Effects 51 incorporate skewness and multimodality, but they are too computationally intensive for routine use. Refs. (8) (10) point out difficulties in likelihood inference due to inaccuracies of the first-order approximations in nonlinear mixed effects models. In this paper we make use of a hybrid approach, developed recently by Lai and Shih (11), that uses first-order approximations based on Laplace s method to evaluate the likelihood when the subject has sufficient data, in combination with Monte Carlo approximations of the likelihood involving relatively few simulation runs when the subject has sparse data. Details of this approach and its underlying rationale are given in the next section, which also discusses why this approach can lead to consistent estimators of the fixed effects even when the population distribution G is misspecified (e.g., as normal). In this connection, a review of recent work by Lai and Shih (12) and by previous authors on nonparametric modeling of G is also given in (11), where it is shown that such nonparametric modeling typically does not yield better estimates of the fixed effects than the hybrid approach that assumes G to be normal. Moreover, an improved hybrid method that uses importance sampling instead of direct Monte Carlo is developed. Another issue with NONMEM that we address herein is the choice of the functional form of g. Whereas the choice of f i in Eq. (1) is usually based on scientific theory, like Eq. (2) for PK, regression of the random effects θ i on the covariates x i is statistical modeling of the black box type, and one often chooses linear regression functions for convenience. In recent years there have been attempts to apply modern nonparametric regression techniques, such as generalized additive models (13), to the residuals after fitting a linear regression model or even the simpler population model without covariate effects. Because the θ i in Eq. (1) are unobservable, this raises the question of how residuals should be defined. We review previous definitions in the literature and propose a new definition by making use of the concept of generalized residuals introduced by Cox and Snell (14). These generalized residuals, computed by the hybrid method, enable us to perform diagnostic checks of the assumed regression model (see Regression Diagnostics ). Instead of linear regression, we assume more flexible regression models that use regression splines or neural networks as basis functions, and fit them via the likelihood function which has relatively low computational complexity when we use the improved hybrid method. This computational tractability also enables us to address likelihood-based inference and develop model selection schemes (see Flexible Regression Modeling and Likelihood Inference ). The individualization problem is of fundamental interest in population PK. Since the efficacy and toxicity of a drug are directly related

4 52 Lai, Shih, and Wong to drug concentrations at a target site, which are generally not available but for which blood concentrations are often good surrogates, criteria for designing the dosing regimen for a specific subject often involve functions of the subject s concentrations, or equivalently in Eq. (1), functions of the subject s parameter vector θ = g(x,β) + b. The subject s blood samples are often too sparse to provide an adequate estimate of θ. The empirical Bayes approach borrows information from healthy volunteers in Phase I studies who have undergone intensive blood sampling and also from clinical patients for whom intensive blood sampling is not feasible. Combining an individual patient s characteristics (as measured by x) and sparse concentration data with a large database from other subjects is one of the main motivations for building population PK models. Making use of the improved hybrid method, we show how empirical Bayes estimates of h(θ) can be computed from (a) the patient s data and (b) the population PK model fitted from other subjects data (see Individualization ). A HYBRID METHOD FOR MAXIMUM LIKELIHOOD ESTIMATION Suppose the distribution G is normal with mean 0 and covariance matrix. For given values of β,σ and, the integral for the ith subject in the likelihood function can be written as an expectation Eψ i (b), which can be computed by Monte Carlo simulations of the random vector b with the normal density function φ having mean 0 and covariance matrix. Alternatively, letting e l i (b) = ψ i (b)φ (b), we can use Laplace s method to approximate the integral e l i (b) db (1) db (r) =(2π) r/2 l i ( b i ) 1/2 e l i ( ˆb i ), (3) where l i is the Hessian matrix of second partial derivatives of l i with respect to the components b ( j) of b, and b i is the maximizer of l i (b). Laplace s approximation basically approximates l i by a quadratic function in a neighborhood of the maximizer b i as λ min ( l i ( b i )), where λ min ( ) denotes the minimum eigenvalue of a symmetric matrix. If the observations (y ij, t ij ), 1 j n i, are sufficiently informative about the ith subject s parameter vector θ i = g(x i,β 0 ) + b i,thenfor(β, σ ) near the true value (β 0,σ 0 ), l i (b) becomes peaked around b i and can be well approximated by the quadratic function l i ( b i )+(b b i ) T l i ( b i )(b b i )/2. Laplace s approximation is also applicable when λ min ( 1 ) is large, which occurs when the distribution of b is concentrated around 0. When λ min ( 1 ) is not sufficiently large and the ith subject has sparse data, Laplace s method may give a poor approximation to the left

5 A New Approach to Modeling Covariate Effects 53 hand side of Eq. (3), which will be denoted by L i (β,σ, ). These considerations led Lai and Shih (11) to introduce the following hybrid method for evaluating L i (β,σ, ), which combines Laplace s with the Monte Carlo approximation. Choose a threshold c and let V i = l i ( b i ). (i) If λ min (V i )<c, evaluate L i (β i,σ, )by the Monte Carlo approximation B 1 B j=1 ψ i ( 1/2 z j ),wherez j, j = 1,...,B, are independent random vectors from the standard normal distribution. Note that 1/2 z j is normal with mean 0 and covariance matrix. (ii) If λ min (V i ) c, evaluate L i (β,σ, ) by its Laplace approximation (2π) r/2 V i 1/2 e l i ( ˆb i ). Following Lindstrom and Bates (2), the iterative procedure used to maximize the logarithm of I i=1 L i (β,σ, ) first maximizes over β for fixed η = (σ, ) and then maximizes over η for fixed β, repeating until convergence or until a prespecified maximum number of iterative steps is reached. To avoid numerical instability in differentiating log L i with respect to β, care should be taken when L i computed by Monte Carlo approximation is small, in which case we can circumvent the difficulty by simply replacing L i by its Laplace approximation whose logarithm is convenient for differentiation. Details on the choice of the threshold c and starting values for β,σ, can be found in Section 3.2 of (11). In particular, for typical population PK studies that involve both healthy volunteers from whom intensive blood sampling is conducted and clinical patients who only have sparse blood samples, one can first single out potentially good studies and check their λ min (V i ) values. It is usually adequate to choose a threshold c as low as 10 for λ min (V i ) to determine if these potentially good studies indeed qualify for using Laplace s approximation to L i (β,σ, ); see Simulation Study for further discussion of the choice of c. Moreover, for such experimental designs, good starting values can be obtained by using only those studies that have sufficient data so that their θ i can be well estimated by the nonlinear least squares estimate based on (y ij, t ij ),1 j n i. By performing simple diagnostics on the appropriateness of using Laplace s approximation to evaluate the integral in Eq. (3) for the ith subject, the hybrid approach preserves the computational simplicity of Laplace s method when it can be used and switches to the Monte Carlo method when Laplace s method fails. If the ith subject has enough data so that l i (b) is peaked around b i for (β, σ ) near (β 0,σ 0 ), the Monte Carlo approach becomes unreliable unless B is very large or importance sampling is used to generate the B samples from a distribution that is peaked around b i, so Laplace s method gives a better approximation to

6 54 Lai, Shih, and Wong L i (β,σ, ) in this case. On the other hand, if the ith subject has sparse data and l i (b) is relatively flat in b, then applying the Monte Carlo approach is tantamount to choosing a random distribution G i,whichis the empirical distribution of a sample of B random vectors 1/2 z j with standard normal z j, to approximate G. As there is no need for high resolution in the random distribution used to approximate the actual G (which may not even be normal), using 50 B 200 samples in the Monte Carlo method should be able to provide enough statistical detail while maintaining a low computational cost comparable to that of the first-order method that can be derived from Laplace s approximation (2). We can improve the Monte Carlo method in (i) above by using importance sampling instead of sampling directly from φ. Specifically, we evaluate L i (β,σ, ) by the importance sampling estimate B / B ψ i (ζ j )w j w j, (4) j=1 where P{ζ j = 1/2 z j }=p = 1 P{ζ j = b i + (V i + ɛ I ) 1/2 z j } with standard normal z j, which corresponds to sampling ζ j from a mixture of the prior normal distribution with density φ and the posterior normal distribution with mean b i and covariance matrix (V i + ɛ I ) 1, choosing some small ɛ>0 to ensure that the covariance matrix is invertible. Denoting the density function of this mixture distribution by λ, note that λ(x) = pφ (x) + (1 p)φ (Vi +ɛ I ) 1(x b i ).Thew j in Eq. (4) are the importance weights given by w j = φ (ζ j )/λ(ζ j ). Note that the special case p = 1 reduces to direct Monte Carlo in (i) above, whereas the case p = 0 corresponds to a Monte Carlo implementation of Laplace s method. We recommend choosing p in the range 0.2 p 0.5. This choice of the importance distribution has the advantage of further incorporating the essence of Laplace s approximation in the simulation step, making the method less dependent than direct Monte Carlo used in (11) on the choice of the threshold c. For further discussion of this importance sampling approach, see Individualization. In the case where the I studies contain many good ones (in the preceding sense), Lai and Shih (12) developed nonparametric maximum likelihood estimates of G,β and σ. Previous work in this direction by Mallet (15,16) and Mentré and Mallet (17) assumes that the x i are i.i.d. so that β can be estimated via the joint distribution of (x i, b i ), which can be estimated by using the algorithms developed in (15). By using the good studies to initialize the nonparametric maximum likelihood estimate of (G,β,σ), Lai and Shih (12) were able to dispense with the restrictive assumption that x i be i.i.d. and to estimate the finite-dimensional j=1

7 A New Approach to Modeling Covariate Effects 55 parameter β directly without going through the much more difficult infinite-dimensional problem of estimating the joint distribution of (x i, b i ). They found, however, that even when G is highly non-normal (e.g., has a bimodal distribution), the parametric estimates of β and σ that assume normal G compare favorably with the nonparametric estimates. An asymptotic theory explaining this is given in (11). Since the nonparametric maximum likelihood estimate Ĝ has relatively low resolution (with very slow rate of convergence to G as the total sample size n 1 + +n I becomes infinite), approximating the population distribution G by a normal distribution (with covariance matrix to be estimated from the data), or by the random distribution G i when l i (b) is relatively flat in the hybrid method, is usually an innocuous assumption in population PK/PD models. FLEXIBLE REGRESSION MODELING AND LIKELIHOOD INFERENCE The function g in Eq. (1) is often assumed to be linear in β and x i because of simplicity and ease of interpretation, but may be overly restrictive in practice. Mandema et al. (18) introduced a three-step procedure to relax this linearity assumption. The first step consists of fitting a basic PK/PD model without covariates, from which empirical Bayes estimates θ i = ( θ i1,..., θ ir ) of θ i are derived. In the second step, θ im (m = 1,...,r) is regressed on the covariates x i = (x i1,...,x ip ) by using a generalized additive model of the form θ im = a m + p g ml (x il ), m = 1,...,r, (5) l=1 in which the constants a m and g ml are estimated from ( θ i, x i ), i = 1,...,I, using splines of degree k to approximate the functional form of g ml. A stepwise addition/deletion method is used to decide which covariates should be included in the model by using Akaike s information criterion (AIC), which is also used to choose the degree k of the splines. In the third step, NONMEM is used to estimate the parameters of the model chosen in the second step. The additivity assumption in generalized additive models precludes interactions among the covariates. Moreover, the empirical Bayes estimate θ i in Eq. (5), derived from the PK/PD model without covariates, may differ considerably from the actual θ i which is not observable and which in fact depends on covariates. Since the hybrid method enables us to carry out full likelihood computations, we propose to apply a likelihood-based model selection procedure

8 56 Lai, Shih, and Wong that consists of stepwise forward addition followed by stepwise backward elimination, similar to that introduced by Kooperberg et al. (19) for generalized linear models (without random effects). In addition, we propose to use instead of additive regression models the following regression functions which do not require the additivity assumption and which are widely used in nonparametric multiple regression because of their attractive statistical and computational properties. (i) Regression splines: For univariate x i, a regression spline of degree k is a piecewise polynomial, for which the regions that define the pieces are separated by knots and the polynomials join smoothly at the knots. It can be expressed as a linear combination of the basis functions 1, x i,...,x k i and (x i ξ j ) k +, where the ξ j are the knots and t + = max(0, t). An alternative piecewise polynomial basis that has computational advantages consists of the B-splines (20). For multivariate x i, one can define regression splines by adding tensor products of the univariate regression splines, i.e., choosing g in Eq. (1) to be g(x i,β)= β 0 + M β m B m (x i ), (6) in which B m (x i ) is a product of terms of the form x l ij or (x ij ξ mj ) k +, for some 1 l k. Note that estimating the coefficients β m of regression splines once the knots are determined involves the same procedure as that of the traditional mixed effects model with linear regression of the random effects on the covariates. Moreover, linear regression corresponds to the special case in which there are no knots. It is convenient to choose the knots at certain quantiles of the predictor variables. A much more computationally intensive adaptive knot selection scheme has been developed by Friedman (21) for the case k =1 (i.e., linear splines) in his MARS (multivariate adaptive regression splines) procedure. In population PK/PD studies, because the number I of subjects is typically not larger than a few hundred and because many of these subjects have sparse data, using quadratic splines with a few knots at some quantiles of the predictor variables typically suffices. Stone (22) has established certain asymptotic optimality properties of using these spline approximations to regression functions. (ii) Neural networks: The term neural network refers to a multi-layer regression function that represents the output in each unit of a layer as a nonlinear function of linear combinations of the inputs, which are the outputs from the units in the previous layer. A popular choice of the nonlinear function is the sigmoid φ(z) = 1/(1 + exp( z)), and a simple class of neural networks that suffices for PK/PD applications is the feedforward neural network model (NN k ) with a single layer of k sigmoidal m=1

9 A New Approach to Modeling Covariate Effects 57 hidden units: g(x) = γ 0 + k j=1 γ jφ(a j + α T j x). Barron (23) has proved that a large class of smooth functions can be approximated by sums of single-layer sigmoidal functions, with an integrated squared error of order O(k 1 ) that does not depend on the dimension p of x. This is often called the universal approximation property of feedforward neural networks for multivariate function approximation that circumvents the curse of dimensionality p. To include the commonly used linear regression function as a special case, we propose to augment NN k to a more general model of the form g(x i,β)= γ 0 + α T 0 x i + k γ j φ(a j + α T j x i), (7) with parameter vector β = (γ 0,γ 1,...,γ k, a 1,...,a k,α T 0,αT 1,...,αT k ). j=1 Likelihood Inference Let R i = log L i, and use Ṙ i to denote the gradient vector of partial derivatives and R i to denote the Hessian matrix of second derivatives of R i with respect to σ and the components of β and. A consistent estimator of the asymptotic covariance matrix of ( β, σ, ) is V 1,where V is the observed information matrix i=1 I R i ( β, σ,î), which can be computed by taking numerical derivatives and using the hybrid method to evaluate L i ; see (11). By Theorem 2 of (11), V 1/2( β β, σ σ, ij ij has a limiting standard multivariate normal distribution as )1 i j r I under certain regularity conditions. To test a null hypothesis H that some components of ω = (β,σ, ) satisfy certain equality constraints (e.g., β s+1 = = β s = 0), we can use the hybrid method to compute the generalized likelihood ratio statistic { I GLR = 2 i=1 R i ( β, σ, ) I i=1 } R i ( ω H ), (8) where ω H is the maximum likelihood estimate of ω under H. Because we have a smooth parametric family here, Eq. (8) has a limiting χ 2 distribution (as I ), with degrees of freedom equal to the difference between the dimensionalities of H and the unconstrained parameter space. Simulation studies in (8) (10) have shown the significance levels of GLR tests using the χ 2 approximation to be anti-conservative. A possible explanation for this is that the first-order approximation to the likelihood function used in these simulation studies may yield nominal GLR values

10 58 Lai, Shih, and Wong that differ substantially from the actual values. Therefore, using a more accurate method such as the hybrid method to compute the left hand side of Eq. (3) can improve the approximation. On the other hand, the sample size (or more precisely the information content of the experimental design) may not be large enough for the applicability of the limiting χ 2 distribution of (exact) GLR statistics. A more reliable approach is to compute the sampling distribution of the approximate GLR statistic by bootstrap simulations. The bootstrap test involves computing the sampling distribution of the GLR statistic by Monte Carlo simulations assuming the unknown parameters to take the value ω H, and its theoretical justification is that the GLR statistic is an approximate pivot under the null hypothesis; see (24). For similar reasons, it is more reliable to use the bootstrap method (24, Sections 12.5 and 21.5) to construct likelihood confidence regions for ω than to apply the asymptotic normality of V 1/2( β β, σ σ, ij ij that results in ellipsoidal confidence regions; see Sections 4.2 )1 i j r and 4.4 of (25) for further details in a related application. Model Selection Model selection consists of choosing the number of knots in regression splines, or the number of hidden units in single-layer neural networks, and which of a set of covariates should be included in the model. Two commonly used likelihood-based criteria are Akaike s information criterion (AIC) and the Bayesian information criterion (BIC), where BIC = 2 I log L i ( β, σ, ) + q log N, (9) i=1 with q log N in Eq. (9) replaced by 2q foraic,inwhichq is the number of unknown parameters in the model and N is the total number of observations. To circumvent the computational complexity of minimizing the information criterion over a potentially large set of models, we use a model selection procedure that consists of stepwise forward addition followed by stepwise backward elimination, similar to that introduced in (19) for polychotomous regression. To fix the ideas, we consider the case of regression splines, for which the spline basis in Eq. (6) also includes the choice of covariates. For the forward addition of basis functions, we follow the following guidelines of Kooperberg et al. (19): (a) The constant function 1 is included in Eq. (6). (b) If (x ik ξ mk ) r + (x ih ξ mh ) r + is included in Eq. (6), then so are (x ik ξ mk ) r + and (x ih ξ mh ) r + ; thus, main effects are included before

11 A New Approach to Modeling Covariate Effects 59 including interactions. (c) If (x ik ξ mk ) r + is included in Eq. (6), then so are xik l, l = 1,...,r; thus, polynomials of degree r are included before incorporating knots. One reason for these guidelines is that adding main effects before incorporating knots and then adding interactions yield models that are easier to interpret. Another reason is that Stone s (22) theory on optimal rate of convergence in nonparametric function estimation assumes such hierarchical structure. As in (19), a forward addition step aims at adding to the current model the most significant among all possible candidates. Specifically, suppose M 1 spline basis functions, with coefficients β 1,..., β M 1, together with σ and have been fitted to the model and an additional spline basis function is to be chosen from k candidates. For the jth candidate, we consider the Mth component s ( j) M of the score vector I i=1 Ṙi(( β 1,..., β M 1, 0), σ, ) that includes the M 1 basis functions and the jth candidate. Let ṽ ( j) M be the Mth diagonal element of I R i=1 i (( β 1,..., β M 1, 0), σ, ). We choose the candidate that maximizes s ( j) M / ṽ ( j) M for forward addition. The rationale of using score statistics instead of the GLR statistics in Eq. (8) for this pseudo-test of H: β ( j) M =0 is their computational simplicity in ranking the significance of the k candidates. Stepwise forward addition terminates when an information criterion for model selection does not decrease with the addition of a basis function, or when there is no more candidate basis function to be included. Stepwise backward elimination begins with the termination of stepwise forward addition, and proceeds until the information criterion for model selection does not improve with the deletion of a basis function. As in (19), a backward elimination step aims at excluding from the current model the least significant basis function, in which significance is ranked by the Wald statistics. Specifically, suppose M spline basis functions, with coefficients β 1,..., β M, together with σ and have been fitted to the model. The Wald statistic for testing β j =0is β j /ŝe j,whereŝe j is the square root of the jth diagonal element of { I i=1 R i ( β, σ, )} 1. If elimination of the basis function with the smallest Wald statistic leads to a new model with a smaller BIC (or AIC), then the new model is preferred. Note that in the preceding model selection procedure, the score statistics and Wald statistics are only used to rank the significance of the candidate basis functions for inclusion or deletion, and selection is based on an information criterion rather than on significance testing. Therefore, the issue of anti-conservative significance tests in model selection raised in (8) (10) is not relevant to our model selection procedure. Following (19),

12 60 Lai, Shih, and Wong we use BIC (instead of AIC) in the subsequent simulation and experimental studies. For the neural network basis, we can again proceed by stepwise forward addition and backward elimination, with the number k of hidden units varying between 0 and some small upper bound K. The forward procedure chooses sequentially for each k the variables to be included in NN k. It enters one variable at a time, choosing the set of variables not already entered the one with the largest score statistic. Also, it starts with k =0 (the linear regression case), then proceeds to k =1, etc. The forward selection terminates with a k and a set of variables associated with NN k. Backward elimination then proceeds to delete variables sequentially from NN k. INDIVIDUALIZATION AND REGRESSION DIAGNOSTICS Individualization An important application of a nonlinear mixed effects model is the individualization problem of estimating a function h(θ) of a subject s unobservable parameter θ, given the subject s covariate x and some (or even no) measurements taken from the subject. To fix the ideas, assume that all the f i in Eq. (1) are equal to f, and that the standard deviation ω ij (θ i ) of ɛ ij /σ are of the form ω(t ij,θ i ).Ifβ,σ and are known, then a natural estimate of h(θ) is the posterior mean of h(θ) given the subject s data. Without assuming β,σ and to be known, the empirical Bayes approach replaces them in the posterior mean by their estimates β, σ, so that h(θ) is estimated by ĥ = E ˆβ,ˆσ,ˆ {h(θ) subject s data}. (10) The expectation in Eq. (10) can be evaluated by the hybrid method which we have used for likelihood calculation. First note that Laplace s approximation in Eq. (3) is based on the Taylor expansion l i (b). = l i ( ˆb i ) + (b ˆb i ) l i ( ˆb i )(b ˆb i ) T /2. (11) Since the density function of b i given the subject s data is proportional to e l i (b), it follows from Eq. (11) that the conditional distribution of b i given the ith subject s data is approximately normal with mean ˆb i and covariance matrix Vi 1, where V i = l i ( ˆb i ). Hence for a new subject with informative data (i.e., satisfying λ min (V ) c), for whom we drop the

13 A New Approach to Modeling Covariate Effects 61 subscript i in l i, V i and ˆb i, we can use the above normal density to evaluate E ˆβ,ˆσ,ˆ {h(θ) subject s data} approximately via h(θ) = h(g(x, ˆβ) + b). = h(g(x, ˆβ) + ˆb + V 1/2 Z), (12) where Z is standard normal. The expectation with respect to Z can be evaluated either by the Taylor expansion whose expectation yields h(θ) =. h(g(x, ˆβ) + ˆb) + h (g(x, ˆβ) + ˆb)V 1/2 Z Z T V 1/2 h (g(x, ˆβ) + ˆb)V 1/2 Z ĥ. = h(g(x, ˆβ) + ˆb) V 1/2 h (g(x, ˆβ) + ˆb)V 1/2, (13a) or by Gaussian quadrature applied to h(z) = h(g(x, ˆβ)+ ˆb+V 1/2 z), yielding ĥ =. h(z) exp{ z T z/2} dz =. ( ) J J ( ) r 2 h 2s j a ji, (13b) j 1 =1 with a small J, wheres j = (s j1,...,s jr ) T, {s j } J j=1 and {a j} J j=1 are predetermined sequences such that the approximation is exact if r = 1and h( ) is a polynomial of degrees less than 2J. When the eigenvalue criterion in the hybrid method fails (i.e., when λ min (V )<c), Eq. (10) can be evaluated by using importance sampling for the Monte Carlo approximation j r =1 i=1 ĥ. = J j=1 ( h g(x, ˆβ) + b ( j)) / J w j w j, (14) j=1 where {b (1),...,b (J) } are independent samples from some density λ( ) and the importance weights are w j = e l(b( j)) /λ(b ( j) ), noting that the posterior density function of the subject s random effect b given the subject s data is proportional to e l(b). The density λ is typically chosen so that it is easy to sample from and also has a simple formula for the weights w j andsothat the coefficient of variation of the w j is not too large. One good choice of λ is a mixture of the prior normal distribution N(0, ˆ ) and the posterior normal distribution N( ˆb,( l( ˆb) + εi ) 1 ),whereε is a positive constant to ensure that the covariance matrix is positive definite. We recommend choosing 0.2 p 0.5 in the mixing proportion p : (1 p) for the prior N(0, ) versus the posterior normal distributions in the mixture.

14 62 Lai, Shih, and Wong Regression Diagnostics The empirical Bayes idea can also be used to provide diagnostics for the regression function g. If the individual parameters θ i were observed, the residuals r i = θ i g(x i, ˆβ) would provide approximations for the i.i.d. random variables b i that are not observable. Therefore substantial deviation of these residuals from an i.i.d. pattern would suggest inadequacies and possible improvements of the assumed regression model. Since the θ i are not observed, we propose to replace them by the empirical Bayes estimate E ˆβ, ˆσ 2, ˆ (θ i y i, t i, x i ), leading to the following generalized residuals in the sense of Cox and Snell (14): ˆr i = E ( ˆβ,ˆσ 2, ˆ ) (θ i y i, t i, x i ) g(x i, ˆβ) = E ( ˆβ,ˆσ 2, ˆ ) (b i y i, t i, x i ), i =1,...,I. (15) Noting that the first expectation in Eq. (15) is a special case of Eq. (10) with h(θ) = θ, the calculation of ˆr i can be carried out by the hybrid method. These Cox Snell-type generalized residuals ˆr i provide better approximations to the unobservable residuals r i = θ i g(x i, ˆβ), particularly when the ith subject has sparse data, than the computationaly more convenient ˆb i which Maitre et al. (26) proposed to use as residuals. SIMULATION STUDY Consider a one-compartment open model with first-order absorption given by Eq.(1) in which θ i = (log V i, log k ai, log k ei ),wherev i, k ai and k ei denote the ith subject s volume of distribution, absorption rate and elimination rate, and f i = f with f (t ij,θ i ) = 500k ai V i (k ai k ei ) (e k ei t ij e k ait ij ), θ i = g(x i,β)+ b i = (g 1 (x i ) + b i1,β 2 + b i2,β 3 + b i3 ), (16) in which x i is the subject s age. Assume that ε ij is normal with standard deviation σ f (t ij,θ i ),andthatb i1, b i2 and b i3 are independent normal with mean 0 and var (b ik )=τ 2 k, k =1, 2, 3. We take g 1(x i ) = 3 + exp{ (x i 20)(x i 25)(x i 40)}, β 2 = 1, β 3 = 1, τ 1 = 0.1, τ 2 =0.5, τ 3 =0.2, σ =0.1. One hundred datasets are generated from this model, each consisting of I 1 =30 subjects with 15 measurements each, taken at times t ij = 0.17, 0.33, 0.5, 0.75, 1, 1.5, 2, 2.5, 3, 4, 5, 6, 8, 10, 12, and I 2 = 100 subjects with sparse (2 to 4) measurements taken at timepoints that are randomly selected from the above 15 timepoints. For the 30 subjects with 15

15 A New Approach to Modeling Covariate Effects 63 measurements, their ages x i are randomly sampled from the uniform distribution on {19,...,65}. Thex i for the 100 subjects with sparse data are randomly chosen from seven strata: 2 6, 7 12, 13 18, 19 24, 25 40, 41 65, 66 75, with stratum sizes proportional to 1:1:1:2:2:2:1. We applied nonparametric modeling of g 1 using regression splines or neural networks g 1 (x i ) = α 0 + α 1 x i + +α r x m i + g 1 (x i ) = γ 0 + α 0 x i + K γ k (x i ξ k ) m + k=1 K γ k ψ(a k + α k x i ), in conjunction with the hybrid method with c =10 and B =50, to estimate g 1, β 2, β 3, σ, τ 1, τ 2 and τ 3. k=1 A Typical Dataset In this simulated data set, there are 31 subjects with 2 observations (each), 32 subjects with 3 observations, and 37 subjects with 4 observations. The five-number summaries (minimum, 1st quartile, median, 3rd quartile, maximum) of the age variable x i are (19, 22.25, 32.5, 43.5, 60) for the 30 subjects with 15 measurements each, and (2, 17, 24.5, 44.5, 74) for the 100 subjects who have sparse measurements. Figure 1 shows that the fitted ĝ 1 by neural networks with 1 hidden unit (NN1, dot-and-dash curve) and suitably chosen regression splines (long dashed curve) approximate the true regression function (solid curve) quite well, while the linear model (dotted line) does not catch the nonlinear pattern around x =10. Table I gives the estimates of β 2, β 3, σ, τ 1, τ 2, and τ 3 together with the BIC values for different models. Here the selected neural network model has one hidden unit (NN1); the selected spline model has order 2. Table I shows that for this particular dataset the spline model has the smallest BIC. The estimates of β 3 and σ are close to the true values but the estimates of β 2 and of the variances of the random effects are larger than the true values. Note that there are much fewer observations in the absorption phase than in the elimination phase, which may account for the less accurate estimate of the absorption rate. To examine the sensitivity of the parameter estimates to the choice of c in the hybrid estimation method, we refitted the linear model using c = 5, 10, 15, 30, 60. Table II shows that the parameter estimates are quite

16 64 Lai, Shih, and Wong g^1 (x) True Linear(nlme) Linear NN1 Splines x Fig. 1. Fitted regression function ĝ 1 for a typical simulated PK study. The solid curve represents the true function. The dotted, dot-and-dash, and long dashed curves are, respectively, the linear, NN1, and splines models fitted by the hybrid method. The short dashed curve is thhe linear model fitted by nlme. Table I. Estimates of Regression Parameters β 2, β 3, Measurement Error Standard Deviation σ, the Standard Deviation τ of the Random Effects, and BIC for a Typical Data Set Model β 2 β 3 σ τ 1 τ 2 τ 3 BIC True Linear (nlme) Linear NN Spline similar for c ranging from 5 to 60. Also given in Table II is the CPU time (in seconds) needed per iteration. For comparison, the CPU time required by nlme to fit the same linear model is second per iteration, which is shorter than the hybrid method because it uses Laplace s approximation for all individuals and does not need to calculate the eigenvalues of V i.

17 A New Approach to Modeling Covariate Effects 65 Table II. Estimates of Regression Parameters β 2, β 3, Measurement Error Standard Deviation σ, the Standard Deviation τ of the Random Effects, and BIC for a Typical Data Set, under the Linear Model and a Range of Values for Threshold c CPUtime(s) c β 2 β 3 σ τ 1 τ 2 τ 3 per iteration As expected, the CPU time needed per iteration for the hybrid method increases with c in Table II because for larger c, more individuals are classified inadequate for the application of Laplace s approximation in the likelihood evaluation. In particular, c = 0 corresponds to using Laplace s approximation for all individuals and c = corresponds to using importance sampling, which requires longer computing time, for all individuals. Based on the tradeoff between the computational effort and the approximation error, we chose c = 10 for the numerical examples in this paper. We applied the fitted model to estimate the concentration versus time curve of a new subject at age x = 30 with no measurements or with only a few concentration measurements. We used the Monte Carlo estimate given in Eq. (14) with J = 50 importance samples and p = 0.2,ɛ = to form the mixture importance density. The estimated concentration versus time curves are plotted in Figure 2 which shows that, by making use of the data from the 130 subjects, the fitted regression models provide good empirical Bayes estimates of the concentration versus time curve when the new subject has only three observations. Bias and Standard Error Based on 100 Simulated Datasets Table III shows the mean and standard error for the estimates of β 2,β 3,σ,τ 1,τ 2,τ 3, under various models fitted by the hybrid method. Also given in comparison are the corresponding values for the linear model fitted by the S-function nlme. It shows that the estimates are generally close to the true values except for the upward bias in τ 1, which reflects the approximation error ĝ 1 g 1 for volume of distribution. As expected, the bias in τ 1 for splines or NN1 model is smaller than that of the linear model because the former approximates the nonlinear function g 1 better. In addition, the standard errors of all hybrid estimates are in general

18 66 Lai, Shih, and Wong (a) 10 (b) 10 Concentration Concentration Hour Hour (c) 10 8 Concentration Hour Fig. 2. Estimates of concentration curve for a new subject with covariate x = 30, based on a fitted population PK model: (a) when no observation is available, (b) when two observations are available (marked by ), (c) when three observations are available. The solid, dotted, dotand-dash and dashed curves denote, respectively, the subject s true concentration curve and the Bayes estimate of f (t,θ) based on the fitted linear, NN1 and spline models. smaller than those of nlme, indicating the hybrid method gives more precise estimates than nlme. Since the unknown regression function g 1 (x) is nonlinear and the random effects b i1, b i2, b i3 are unobservable, we evaluated the combined mean absolute error of (ĝ 1, β 2, β 3, τ 1, τ 2, τ 3 ) by the relative mean absolute error I n i rmae = f / I ij 1 n i (17) f ij i=1 j=1 for each simulated dataset, where i=1 f ij = M 1 f ij = M 1 M m=1 M m=1 f (t ij ;ĝ 1 (x i ) + τ 1 z (m) i1, β 2 + τ 2 z (m) i2, β 3 + τ 3 z (m) i3 ), f (t ij ; g 1 (x i ) + τ 1 z (m) i1,β 2 + τ 2 z (m) i2,β 3 + τ 3 z (m) i3 ), (18)

19 A New Approach to Modeling Covariate Effects 67 Table III. Mean and Standard Error (in parentheses) for Estimates of β 2,β 3, Measurement Error Standard Deviation σ, and the Standard Deviations τ 1,τ 2,τ 3 of the Random Effects for the Simulation Study, based on 100 Simulated Datasets Model β 2 β 3 σ τ 1 τ 2 τ 3 True Linear (nlme) (0.285) (0.277) (0.066) (0.091) (0.174) (0.064) Linear (0.042) (0.009) (0.004) (0.052) (0.076) (0.023) NN (0.042) (0.009) (0.004) (0.051) (0.056) (0.023) Spline (0.102) (0.022) (0.004) (0.066) (0.058) (0.027)

20 68 Lai, Shih, and Wong and {z (1) i1, z(1) i2, z(1) i3,...,z(m) i1, z(m) i2, z(m) i3 } is a set of independent standard normal random variables. The average over the M random vectors (z (m) z (m) i2, z(m) i3 ) 1 m M in Eq. (18) is basically a Monte Carlo estimate of the expected concentration f ij. = f (t ij ; g 1 (x i ) + b i1,β 2 + b i2,β 3 + b i3 )φ (b i ) db i, i1, with (g 1,β 2,β 3,τ 1,τ 2,τ 3 ) replaced by (ĝ 1, β 2, β 3, τ 1, τ 2, τ 3 ) for f ij.herewe choose to work with relative error in Eq. (17) because the standard deviation of ε ij is proportional to the mean so normalization to the mean is needed. Also, we choose absolute rather than squared errors because the former is more stable and robust. Figure 3 shows the boxplots of the 100 replicates of rmae, using M = 500 to compute f ij and f ij in Eq. (18). The NN1 and the splines model perform much better than the linear models. EXPERIMENTAL STUDY An orally administered cancer drug, temozolomide, was given to 65 adult patients with advanced cancer in four Phase I trials sponsored by the Schering Plough Research Institute (Jen et al. (27)); see also a smaller pilot study by Newlands et al. (28). A total of 756 concentration measurements were collected. Each subject had concentration measurements from 10 min to 16 h after a single dose. These concentrations are modeled by the one-compartment open model in Eq. (16) that is used in the above simulation study, and the objective is to identify the influence of patient characteristics on the pharmacokinetics. Patient covariates included in the analysis are body surface area, gender, age and creatinine clearance, forming the vector x i in Eq. (16). Test for Gender Difference An important question in the study was whether there was gender difference in volume of distribution, which we addressed by using the GLR test described in Eq. (8). The GLR test statistic computed by the hybrid method was , which gives a p-value of using the χ 2 approximation and which is considerably smaller than the value given by nlme (with p-value of using χ 2 approximation). To check the validity of the χ 2 approximation that shows the volume of distribution to be significantly different between males and females, Figure 4 plots the quantiles of the GLR test statistic, computed by the

21 A New Approach to Modeling Covariate Effects 69 rmae nlme Linear NN1 Splines Fig. 3. Goodness of fit for NN1, splines and linear models fitted by the hybrid method and for the linear model fitted by nlme. Quantiles of LRT statistic from bootstrap replicates Quantiles of LRT statistic from bootstrap replicates Quantiles of Chi-square(1) distribution Quantiles of Chi-square(1) distribution Fig. 4. QQ (quantile versus quantile) plots of the generalized likelihood ratio statistic for gender difference versus χ 2 distribution with 1 degree of freedom, based on the hybrid method (left panel) or nlme (right panel).

22 70 Lai, Shih, and Wong hybrid method (left panel) and by nlme (right panel), from 2000 bootstrap replicates against the quantiles from a χ 2 distribution with 1 degree of freedom. These QQ plots show the χ 2 approximation to be adequate for the present study. Model Selection and Regression Diagnostics We considered four patient covariates for the population model g 1 : body surface area, gender, age, and creatinine clearance. We first examined the generalized residuals in the null model, which assumes no covariates in g 1, against each of the four covariates, and found that the residuals for volume of distribution tend to increase with body surface area. We applied the automatic model selection procedure with respect to these four covariates using three models: linear, neural networks, and splines. All three models selected body surface area for modeling volume of distribution and no covariates for absorption or elimination rates. Table IV presents the fitted parameters of the linear model, neural network with 1 hidden unit, and the spline model with 3 quartiles as the knots for a continuous covariate, treating gender as a dichotomous covariate and using the same fitting procedure as that in the simulation study. The smallest BIC value in Table IV is a spline model that has no knots and is a function of the form 2.917x x 2 without an intercept term, where x denotes body surface area. The linear model in Table IV is x and its BIC value is only slightly larger than that of the quadratic (spline) model. Because of ease of interpretation, we chose the linear model instead of the quadratic model without an intercept term. To check the goodness of fit, the generalized residual plots for the linear model are given in Figure 5. No specific trends are observed with respect to any covariate, indicating that after adjusting for body surface area, volume of distribution is no longer significantly different between males and females. Table IV. Estimates of Regression Coefficients β s, Measurement Error Standard Deviation σ, and the Standard Deviation τ of the Random Effects for the Experimental Study Model β 0 β 1 β 2 β 3 σ τ 1 τ 2 τ 3 BIC Linear (nlme) Linear NN Spline

23 A New Approach to Modeling Covariate Effects 71 Residual of log(v) Residual of log(v) Residual of log(v) Residual of log(v) Body Surface Area Female Male Gender Age Creatinine Clearance Residual of log(ka) Residual of log(ka) Residual of log(ka) Residual of log(ka) Body Surface Area Female Male Gender Age Creatinine Clearance Residual of log(ke) Residual of log(ke) Residual of log(ke) Residual of log(ke) Body Surface Area Female Male Gender Age Creatinine Clearance Fig. 5. Generalized residuals from the linear model fitted by hybrid method. The residuals are marked by except in the panel for gender where a box plot is used. Analogous to rmae introduced in Eq. (17), we measure the goodness of fit by I n i / rmae = ŷ ij I 1 y n i, (19) ij i=1 j=1 where ŷ ij is the model-based estimate f i (t ij, g(x i, ˆβ)+ b)φ ˆ (b) db of y ij. For the linear model fitted by the hybrid method, rmae = 0.306, which is comparable to ˆσ = Since y ij /f (t ij,θ i ) has mean 1 and standard deviation σ, this suggests that the fitted linear model estimates f (t ij,θ i ) reasonably well. The rmae of and for the NN1 and splines model are comparable to that of the linear model, all of which are better than the rmae of for nlme s fitted linear model. i=1 Empirical Bayes Estimates for an Individual s Concentrations To illustrate the usefulness of the fitted linear model for estimating the concentration versus time curve of an adult patient after a single

Regression, Ridge Regression, Lasso

Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.