Dennis Cosrnatos. Department of Biostatistics University of North Carolina at Chapel Hill. September 1988

Size: px

Start display at page:

Download "Dennis Cosrnatos. Department of Biostatistics University of North Carolina at Chapel Hill. September 1988"

Griffin Hodge
5 years ago
Views:

1 METHODS FOR MODELING DISEASE RISK USING PROBABILITY-QF-EXPOSURE MEASURES by Dennis Cosrnatos Department of Biostatistics University of North Carolina at Chapel Hill Institute of Mimeo Series No. 1858T September 1988

2 METHODS FOR MODELING DISEASE RISK USING PROBADILITY-OF-EXPOSURE MEASURES by Dennis Cosmatos A Dissertation submitted to the faculty of The University of North Carolina at Chapel Hill in partial fulfillment of the requirements for the degree of Doctor of Public Health in the Department of Biostatistics. Chapel Hill 1988 Approved by: ~~ ~..Q..de Reader

3 II Dennis Cosmatos. Methods for Modeling Disease Risk Using Probability-of-Exposure Measures. (Under the direction of Lawrence L. Kupper.) ABSTRACT A usual goal in health research is to examine the relationship between a specified disease outcome and exposure to a given agent or condition. In certain settings, however, the true exposure status for each individual may not be available. Instead, a Probability-Of-Exposure (POE) for each individual may be considered. This research introduces a modified logistic model (MLM) that can be used to estimate the exposure-disease (E-D) relationship when individual POE values are the only exposure data available. Various forms of the MLM are suggested, several of which allow for consideration of covariate information. Two alternative methods based on application of the usual logistic model to the data under certain exposure assumptions are also considered. Application of the MLM to three small samples of exploratory data shows that the parameter estimates are influenced by sampling error in the POE measure. In order to further investigate this finding and observe how the MLM and the alternative models perform in various analysis settings, the MLM, the two alternative models, and the usual logistic model (using known exposure status and serving as a "gold standard" model) are applied to simulated data. Eight simulation studies are conducted and the results of these analyses suggest the circumstances under which the MLM provides the most reliable and the least reliable estimates of the true underlying E-D relationship. Practical application of the MLM is illustrated by conducting an analysis of a set of genetic data. These data are used to address a hypothesis that carriers of a gene for the autosomal recessive trait ataxia-telangiectasia are at increased risk of developing a malignant neoplasm, compared to noncarriers of the gene. The probability of being a carrier (i.e., the probability of being heterozygotic for the trait, a POE of sorts) is

4 In known without error by examining the relationship of each individual to a homozygous proband. Some future research goals are outlined including adaptation of the proposed methods to analyses of occupational health data by considering methods of estimating POE, and development of a modified survival model that can be used for analyses of clinical trial data.

5 IV ACKNOWLEDGEMENTS I would like to express my unbounded appreciation to my advisor, Dr. Larry Kupper, for his skillful guidance and sustained enthusiasm throughout the course of this research. I would also like to express my thanks to the other members of my doctoral committee, Drs. Ed Davis, Cliff Patrick, Dana Quade, and Mike Symons, for their careful review of this material and their helpful comments. For introducing me to the field of biostatistics (in the guise of biometrics), I would like to thank my undergraduate instructor in biometrics, Dr. Lesile Marcus. I still recall my first analysis of his marten femur-length data with tempered fondness. This research would not have been possible had it not been conducted on the superb computational facilities at the Cornell National Supercomputer Facility. I would like to thank Mr. Rostyslaw Lewyckyj from Academic Computing Services here at UNC, and the User Support group at Cornell University, for their invaluable assistance with the many computational idiosyncrasies that emerged along the way. I will always remember the fun and friendship I shared with Jerry Schindler, George Jerdack, Nancy Lucas, Susan Reade, Ali Barakat, and Cecilia Wada, the finest group of friends and fellow classmates one could ask for. Finally, I would like to thank my best friend and wife, Irene. Only through her secret mixture of support, endurance, humor, tolerance, variety, conflict, and love, was it possible to not only complete this program, but have only fond memories of my time spent doing so, and of our time together in North Carolina.

6 VI TABLE OF CONTENTS Chapter I Introduction and Research Outline 1.1 Introduction Model Notation and Formulation Development and Specification of the Modified Logistic Model Estimation of MLM Parameters Specification and Parameter Estimation for Alternative Model I: The "Usual" Logistic Model Applied to (01+) Data Specification and Parameter Estimation for Alternative Model 2: The "Usual" Logistic Model Applied to (01) Data Specification and Parameter Estimation for the "Gold Standard" Logistic Model: The "Usual" Logistic Model Applied to Data with Known Exposure Status The MLM Under POE Recategorizations C and D Effects of POE Error on Parameter Estimates in the MLM Hypothesis Testing and Confidence Interval Estimation ltesearch Outline 28 Chapter II Simulation Studies 2.1 Overview Specification of Simulation Studies Parameter Specifications for Simulation Set A Parameter Specifications for Simulation Set B 36

7 vn Parameter Specifications for Simulation Set C Model Specification for Analysis of Simulation Set C Data Computer Programs Used in the Simulation Studies Simulation Program Details - Program Input and Data Generation Simulation Program Details - Computational Algorithms Simulation Program Details - Program Output 52 Chapter III Simulation Study Results 3.1 Introduction Simulation Study Results for Simulation Set A Simulation A(1) - N=500, R=1000,.8= Simulation A(2) - N=1000, R=1000,.8= Simulation A(3) - N=2000, R=1000,.8= Simulation Set A Summary and Concluding Remarks Simulation Study Results for Simulation Set B Simulation B(1) - N=1000, R=1000,.8= Simulation B(2) - N=2000, R=1000,.8= Simulation B(3) - N=4000, R=1000,.8= Simulation Set B Summary and Concluding Remarks Simulation Study Results for Simulation Set C Simulation C(1) - N=2000, R=1000,.8= , 1'= Simulation C(2) - N=2000, R=1000,.8=0.00, 1'= Simulation Set C Summary and Concluding Remarks 119

8 vnz 3.5 Summary and Conclusions from Simulation Studies 120 Chapter IV A Numerical Example - Analysis of the A-T Data 4.1 Introduction General Overview and Specification of the Research Hypothesis Data Collection Methods and Study Design Specification of the POE Variable and Disease Outcome "Crude" Estimates of the E-D Relationship Using the Modified Logistic Model Analyses of the A-T Data Controlling for Extraneous Factors Analysis of the A-T Data Controlling for SEX: Application of MLM Specification of an AGE Covariate Analysis of the A-T Data Controlling for AGE and SEX: Application of MLM-2 with a Continuous AGE Variable Analysis of the A-T Data Controlling for AGE and SEX: Application of MLM-3 with a Categorical AGE Variable Analysis of the A-T Data - Modeling AGE as an Effect Modifier: Application of MLM-4 With a Categorical AGE Variable Summary and Conclusions From Analyses of the A-T Data 156 Chapter V Future Research 5.1 Introduction 159

9 IX 5.2 Small Sample Parameter Estimation in the Modified Logistic Model The MLM with Several Exposure (and associated POE) Variables Estimating Probability-of-Exposure Alternative POE-Based Models Closing R.emarks 171 R.eferences 172

10 Chapter I INTRODUCTION AND RESEARCH OUTLINE 1.1 Introduction In mo.st epidemiologic studies, the investigator is interested in examining the relationship between a disease or health condition and exposure to a set of potentially harmful environmental or physiological agents. It is hypothesized that such exposures affect the development or progression of the disease of interest. The investigator usually has some measure of the extent of exposure to these agents, either in the form of a dichotomous measurement (exposed (E) vs. unexposed (E)), an ordinal measure of exposure divided into different levels (e.g. low, medium, or high), or measurement on a continuous scale. One underlying assumption common to all these classes of exposure measurement is that they are known "without error". This research focuses on the development of models for analyzing data which are known to violate this assumption, but which include information on the level of exposure uncertainty in the form of probability-of-exposure (POE) measures. These POE values may be known with or without measurement error or may have less error than that associated with an ordinary classification into exposure levels. Within this framework, several characteristics of the POE measures can be considered. In the simplest case, they may be measurements reflecting the probability of being exposed to a specific agent. A probability of 1.0 would suggest certain exposure, one of 0.0 would suggest the individual is unexposed, and a value between 1.0 and 0.0 indicates the appropriate probability of being exposed to that specific agent. We may also

11 1.1 2 incorporate a "background" exposure probability into this setting by specifying a value slightly greater then 0.0 for individuals that are considered "unexposed". In a more complex setting, we may have a vector of POE measurements indicating an individual's exposure probabilities for several agents, or probabilities of being exposed to each of several levels of a single agent. The proposed models could be extended to this setting with some modification. The motivation for developing the proposed models was initiated by the need to perform an analysis of a set of genetic data which addresses the hypothesis that individuals who are heterozygotic for the hereditary condition ataxia- telangiectasia (A-T) are at higher risk for developing malignant neoplasms than are noncarriers of the A-T gene (Swift et ai., 1976; 1987). Such a hypothesis could be examined by conventional methods if we had a large enough sample of individuals who were carriers of the trait and of individuals who were definitely non-carriers. However, since A-T carriers are phenotypically normal and a cytogenetic method for identifying carriers has not yet been devised, a method was needed to incorporate each individual's probability of being e heterozygotic for A-T (a POE in some sense) into some type of risk model. Such probabilities are determined by basic genetic properties of trait inheritance. To date, there has been very little research devoted to the development of such models. The few earlier studies that have addressed this problem (Swift et ai., 1974; 1976; Chase et ai., 1977) have suggested models that do not allow incorporation of continuous covariates, and may have some undesirable statistical properties. The goal of this research is to develop models that are not subject to these drawbacks and can be used in a variety of research settings.

12 Model Notation and Formulation In a typical analysis examining relationships between exposure (E or E) to a substance or condition of interest and a given disease outcome (D or D), we typically profile the data in the form of a 2x2 table (when no covariates are considered), as depicted by Table 1.1 below. Table 1.1 Data Layout for a Study of E-D Relationships Assuming Known Exposure Status for Each Subject E E D a b m1 c d mo n1 no N Here a, b, c, and d are the numbers of individuals with given exposure-disease characteristics. In this setting, we need to know (with certainty) the exposure and the disease status of each individual in order to accurately complete this table and we proceed to calculate measures of the E-D relationship. The above table may be expanded in order to depict the data layout under a POE setting. In such a setting, we will not be able to specify exposure as E or E for all individuals, but instead will know some measure of POE for certain subjects. We may display these types of data in the form of a 2x(l+l) table, where we define I POE categories associated with probabilities greater than zero (1~2) and one category for individuals classified as "unexposed". The data from studies that consider POE may be put into the tabular format of Table 1.2. In Table 1.2, the POE categories are indexed by the probabilities "'0'..., 11'1' We will generally assume that 11'0 has the value 0.0, "'1 has the value 1.0, and

13 1.2 4 Table 1.2: Data Layout for Studies Considering POE Probability of Exposure Category ~l ~o ~2 ~3 ~/ D D Xl Xo X2 x3... X/ m 1!h Yo Y2 Y3... Y/ m n 1 no n2 n3... n/ N o ~1=1.0>~2>~3>... >~/>~0=0.0. Under this specification, we note that the first two rows and columns of Table 1.2 comprise a 2x2 subtable that contains counts of individuals for which we "know" the exposure status (i.e. if individual is in ~1=1.0 then he/she is known to be exposed, if in ~o' he/she is known to be not exposed). This situation is similar to the known exposure situation (Table 1.1), and usual logistic methods may be applied. In fact, if the cell counts in this subtable are large, we may consider using standard methods to quantify E-D relationships without considering the data in the remaining cells. We should note, however, that disregarding what may be the majority of the e data is likely to cause instability and unreliability in the statistical analyses. In the models that are introduced in this research, the parameters are estimated using all the data collected in the study. Upon initial examination of the above data setting, an investigator may be tempted to consider applying the usual logistic model to the data, treating POE as a covariate in the model. We should stress that such an approach is inappropriate, for several reasons. First, the model that treats P j as a covariate implies that P j can be treated as a "level" of exposure. If, in fact we believe that there are only two true exposure levels, E and E, a "dose" assumption for the exposure variable would not be justified. Also, it would be difficult to argue that the POE is related to disease outcome via some "logistic-linear" relationship. On the other hand, justifiable probability

14 1.2 5 relationships (to be shown in equation [1]) support the rationale for the MLM. Given these considerations, it was felt that the usual logistic model with P j would not be appropriate for the types of data and is not considered in this research. In the sections to follow, we will introduce methods designed to perform analyses on data that are in the form depicted by Table 1.2. We will examine how these methods compare to alternative approaches that collapse the 2x(I+1) table into a 2x2 table so that standard analysis procedures can be used. In addition, we will consider the analysis of data that involve a dichotomous confounding factor. We will examine the performance of the proposed models as compared to standard procedures under several different methods of treatment of the exposure variable. For the purpose of simplifying the presentation of these methods, most of the development will focus on the analyses of data with no covariates. Once the methods for this setting are fully developed, the covariate setting will be presented as an extension. In the sections that follow below ( ), we review the notation and formulation of each model that is considered in this research Development and Specification of the Modified Logistic Model We begin the development of appropriate models for analyzing data containing probability-of-exposure measures by considering some basic statistical properties. In most cases, we are interested in modeling the probability of disease development [Pr(D)] in a population. Some individuals in this population may be exposed (E) to some specified agent or condition, and some are not exposed (E). In what follows, we will make use of the simple probability law: Pr(D) = Pr(DnE) + Pr(DnE) = Pr(E) Pr(DIE) + Pr(E) Pr(DIE). [1]

15 1.2 6 The logistic model is often used for modeling the probability of developing disease as a function of various independent variables known or hypothesized to be related to disease development. A simple form of the logistic model, one that does not contain any covariate terms other than an indication of being exposed or not exposed, specifies the following relationships: O+~ P r( DIE) = ---'=e:--_-" l+eo+~ _ 0 P r(die) = -1-=~-e-;;:o o Pr(DIE) e = =--~~ Pr(DIE) and ~ - Pr(DIE)jPr(DIE) - OR e - Pr(DIE)jPr(DIE) -, where OR is the ratio of the odds of developing the disease in the exposed group to the odds in the unexposed group, and eo is the odds of disease in the unexposed group. In the usual specification of the logistic model, Pr(E) (henceforth referred to as POE) is 0 or 1 for every subject in the study. The proposed modification to this model allows for the inclusion of POE values P j where O$Pf~;1 for individual j=l, 2,..., N. We utilize this POE information by specifying the following modified logistic model (MLM) based on equation [1]: (MLM-l) [2] where j=l, 2,..., N indexes each subject in the study, Wj indicates the probability of disease for subject j, and P j specifies that individual's POE. Just as we can extend the logistic model to include covariates, we can specify the MLM with covariates as: (MLM-2) [3]

16 1.2 7 where j=l, 2,..., N indexes each subject, there are r=l, 2,..., R continuous covariates, and X rj specifies the value of the r th covariate for the jth individual. Similarly, for a single nominal covariate with M categories, we specify: (MLM-3) M-l 0:+ f3+ E "Ym(Vmj) m=l w j = P j e M-l 0:+f3 +E "Ym(Vmj) l+e m=l [4] where j=l, 2,..., N, and where Vmj=l if the jth subject is in category m and equals zero otherwise, m=l, 2,..., M-l. Note that eo: denotes the odds of disease for unexposed individuals having the "baseline" or "reference" level m=m of the covariate, (i.e. V mj=o for m=l,..., M-l if individual j has the baseline value m=m of the covariate), and ORm= e f3 specifies an odds ratio that is "common" across all levels of the covariate (m=1, 2,..., M). Finally, an extension of Model 3 which allows for effect modification (i.e. nonuniformity of the odds ratio across strata) may be specified as: (MLM-4) Wj [5] f3 +om Here, ORm =e, m=1, 2,..., M-l, and OR M = e. f3 As with the standard logistic model, both categorical and continuous confounders and effect modifiers can be accommodated by the proposed MLM. This research concentrates on the examination of the statistical properties of MLM-l by examining the performance of this model on simulated data. As an extension, MLM-3 with one dichotomous covariate will also be studied by simulation.

17 Estimation of M LM Parameters Considering the above specifications of the MLM, we are interested in devising a method to calculate unconditional maximum-likelihood estimates of the parameters. We begin by considering the application of MLM-l to a hypothetical set of data. For a given set of N observations, let us assume we have nj individuals all having the same POE value 1r j where i=o, 1,..., I, and ~ nj = N. 1 j=o Within each of these (1+1) POE groups we observe Xj individuals with the disease of interest and Yj individuals without that disease (as depicted in table form in Table 1.2). cannot obtain model specific estimates of () and f3 in closed form. For this situation, we We can, however, use numerical methods to obtain estimates for these parameters. Given one of the above specifications of the MLM, we can estimate the appropriate parameters by using unconditional maximum likelihood procedures. The unconditional likelihood for these data is given as N [ Z 1- z ] L(O; y) =.n Wj j (1- Wj) J,where J=l I if individual j is diseased Zj = { 0 otherwise [6] or [7] The specification of Wj in the above likelihood is determined by the model being considered. For individual j, the data vector Yj contains the POE value P j, a value for the disease indicator variable Zj' and the values of the continuous and/or categorical covariates X rj and/or Vmj' The parameter vector 0 specifies the parameters included in the particular model being considered. The elements of these vectors for models MLM-l, MLM-2, MLM-3, and MLM-4 are shown in Table 1.3. One point to note in the above specifications of the data vectors is that the parameters are estimated using information from each individual in the study. Although the data layout in Table 1.2 and the general approach we present when discussing the

18 1.2 9 Table 1.3 Specification of Data and Parameter Vectors for the MLM MLM-1 MLM-2 MLM-3 MLM-4 Data Vector Yj = ( Pj, Zj) j Yj = ( Pj, Zj' X 1j,, XRj ) j Yj = ( Pj, Zj' V lj', V(M-l)j ) j Yj = ( Pj, Zj' V lj', V(M-l)j ) ; Parameter Vector (J' = ( Q, /3 ) (J' = ( Q, /3, "Y!, ''Y2,, 1'R ) (J' = ( Q, /3, 1'1' 1'2', 1'(M-l) ) (J' = ( Q, /3, 1'1'..., 1'(M-l)' 61,..., 6(M-l) ) MLM considers counts of individuals in various POE categories (for simplicity), we see that the models being proposed directly apply to situations where the POE is a continuous variable. We use Newton-Raphson and "direct search" iterative estimation algorithms to calculate maximum-likelihood estimates for the appropriate parameters. We also calculate the associated variance estimates for those parameters, and values of the loglikelihood. These algorithms involve using the computer program MAXLIK. Details of this program are given in Specification and Parameter Estimation for Alternative Modell The "Usuar' Logistic Model Applied To (01 +) Data As an alternative to the proposed MLM, we consider the application of the "usual" logistic model to the data under a recategorization of the POE variable. We create two exposure groups and classify the individuals in the sample into one of these two groups based on their values of P j as follows. If Pj>O.O, then we classify individual j as E (Le. pt = 1). If P j = 0.0, then we classify the individual as E (Le. pt =0). We c~n then display the data in the form of Table 1.4.

19 Table 1.4: Data Layout for Alternative Model 1 E E D x+ Xo rn1 1 yt Yo rno I where xt =.L xi ' and 1=1 specified in Table 1.2). nt no N I Yt =L Yj (using the notation i=l We see that application of the MLM under this setting leads to the expression [8] and, since pt is 0 or 1, we are in the "usual" logistic setting. By specifying the log likelihood equation as in [7] for the data in Table 1.4, and by then solving the equations 81nL(a+ R+. p+ z),fj,, - 0 and 8a+ - 81nL(a+,/3+j p+,z) _ 0 + -, 8/3 we can obtain explicit expressions for the MLE's of Q + and /3+, for the value of InL(a+,,B+j P+,z), and for Var(,B+). These expressions are as follows: _[9,10] [11]! [12]

20 This alternative model is often used "implicitly" when individuals are designated E when they are declared to have an attribute which serves as a surrogate for the true exposure of interest. If the surrogate is imperfect, we will have an "exposed" group that contains several individuals that are not truly exposed to the agent of interest. Using the A-T data as an example, we may classify individuals as E if they are related to the A-T proband, and as E if they are the spouse controls. The "exposed" group would then contain some individuals with the A-T gene and many without the A-T gene. Since "relationship-to-proband" is not a perfect predictor of the "presence of the A-T gene", there will be some misclassification error. In the example given with the A-T data, we know that using relationship-toproband as a risk factor would not be a reasonable approach considering the underlying genetic mechanisms that control inheritance of traits and transfer of genetic information. However, in other research settings where such mechanisms are not so well defined, we make similar types of assumptions so that we can employ the (usual) logistic model. In these situations, we are probably operating quite often under the setting of this alternative model Specification and Parameter Estimation for Alternative Model 2 The "Usual" Logistic Model Applied To (01) Data Another alternative to implementation of the proposed MLM is to consider only a subset of the data in deriving parameter estimates and making statistical inferences about E-D relationships. Specifically, we may consider ignoring the data where exposure is uncertain and using only data for individuals with Pj=O or P j =l (i.e. data only for individuals who are known to be truly exposed and for those known to be truly unexposed). The data layout for such a model is depicted in Table 1.5. Note that this table is merely the 2x2 subtable formed using the first two rows and columns of Table 1.2.

21 Table 1.5: Data Layout for Alternative Model 2 E D 01 Xl Xo m 1 01 Y1 YO m 1 n1 no NO We would expect the parameter estimates obtained from this model to be statistically reliable only when the four cell counts are large. We would not generally expect this model to perform well since we will be typically disregarding a considerable amount of the data. For the analysis setting in which we are particularly interested (analysis of the A-T data), we would actually be ignoring a majority of the data collected. The application of the MLM to this setting leads to the formulation [13] As with Alternative Model 1 [8], the above model [13] is equivalent to a specification of the "usual" logistic model since p~l = 0 when individual j has a value of Pj=O and p~l = 1 when P j =1 for the jth individual in Table 1.5. As before, we can give explicit expressions for the MLE's of n0 1 and (301 and for InL(oOl,,B01 j p 01,z) and Var(,B 1). In particular, we have _[14, 15] [16]!

22 and Yare ROI) = ,., xl Xo YI Yo' [17] Specification and Parameter Estimation for the "Gold Standard" Logistic Model The" Usual" Logistic Model Applied To Data with Known Exposure Status If we know the true exposure status of each individual, then application of the usual logistic model, specified as E _ {I if individual j is exposed j - 0 otherwise j=l,..., N, [18] would yield reliable estimates of the E-D relationship (given relatively large N). We will consider this model when we examine the simulated data. Since we will know the true exposure status for all individuals in the simulated data, we can apply this "usual" logistic model [19]. The resulting parameter estimates can then be informatively contrasted to estimates obtained from the other models which utilize POE information. Since this model [18] involves no misclassification error for the exposure variable, we fully expect it to be more statistically reliable than any of the models that involve potential exposure misclassification. It should be emphasized that the "usual" logistic model [18] can only be applied if we know each individual's true exposure status. For the data we are considering, this will not be the situation. Thus, the "usual" logistic model [18] cannot actually be used to analyze the data under consideration. It is included as a "gold standard" model to which we may compare the performance of the other proposed models. The data layout for this model is the 2x2 table specified in Table 1.1. "gold standard" model, we specify the parameters for this model as 0'9 a.nd {39. For this For the data in Table 1.1, expressions for the MLE's of 0'9 and (39, and for InL(&9,,q9; E,z) and Var(~9), are given below.

23 [19, 20] - a In[a+ c] + b In[b~ a] + cin[a.t c] + d In[b~d], [21] and - -g Var(f3 ) =- abc d [22] The MLM Under POE Recategorizations C and D For situations where we define POE categories from continuous or discrete measures of POE, selection of the category specifications may influence the estimation properties of the MLM. As will be discussed later ( 1.4 and Chapter III), categories that contain too few individuals (especially those with disease) may make reliable parameter estimation more difficult. We therefore consider the application of the MLM to data which have alternative specifications of the POE categories. First, we consider a model which combines individuals from a POE category with few observations into the next e- higher POE category, and then we consider a model that moves these individuals into the next lower POE category. Let us specify Pj as a new POE measure for individuals j=1, 2,..., N. Let us also assume that there are very few individuals with P j =1I'2 and P j =1I'4. We may then assign values to Pj as follows: Pj = P j if individual j has a value of P j equal to 11'0,11'1' 11'3' 71'S' 11'6'..., 11'] Pj - Pj 11'1 if individual j has a value of P j equal to 11'2; and, 11'3 if individual j has a value of P j equal to 11'4. Recall our earlier assumption that 11'1>11'2>". >1I']>~0' and ~1=1.0, ~o=o.o. Then, we see that Pj values are determined based on recategorizing individuals with POE values of'lr 2 and 'lr 4 into their next higher POE groups, namely, groups with POE values of 11'1 and 'lr 3, respectively. If we fit MLM-1 to these data, we have the representation

24 [23] We compute the MLE's of a C, and pc, and other associated statistics, by considering the following likelihood equations: L(a C,pc; pc,z) =.n [W J=l or j Zj [24] [25] where Zj = { 1 if individual j is diseased o otherwise The appropriate MLEs (a C and pc), variance and covariance estimates (V~r(&C), V~r(pC), and C~v(&c,PC)),and InL«&c,p c ; pc,z) are calculated using MAXLIK. Similarly, we can consider recategorizing individuals in POE groups 11"2 and 11"4 into their next lower POE groups. We do this by specifying P~ as follows: p1 = P j if individual j has a value of P j equal to 11"0' 11"1' 11"3' 11"5' 11"6'..., 11"[ ; p1 = 11"3 if individual j has a value of P j equal to 11"2; and, p1 = 11"5 if individual j has a value of P j equal to 11"4' We then specify [26] [27]

25 Effects of POE Error on Parameter Estimates in the MLM Since we are considering the inclusion of information on POE in estimating E-D relationships, we must examine how parameter estimates and subsequent inferences are affected by error in this information. This error can be attributed to error in measurement of the POE variable, or to sampling errors associated with taking finite samples from a population. The measurement error setting is of interest if the POE measure is not known, but instead, is estimated by some model. In this research, we assume that the levels of the POE variable (P j ) are known without error for each individual (j). We then concentrate on the examination of POE "misclassification" due to sampling error. This approach is more relevant to the examination of the A-T data. The measurement error situation will be a focus of future research. The general concept of error or "misclassification" of the POE variable involves several underlying issues that need to be considered. by clarifying the distinction between POE and exposure status. In presenting these issues, we begin e" In the dichotomous exposure situation (Le. exposed or not exposed), the /h individual drawn at random from a population will have one of two levels of exposure status (E j ), either exposed (E j =l) or not exposed (Ej=O). We typically label these exposure status categories as E for "exposed" and E for "not exposed". (This notation is not to be confused with the dichotomous random variable E j which indicates the exposure status for individual j). A given individual (j) may, however, have an associated POE (P j ) equal to any value between 0 and 1 (if we assume POE takes on values from a continuous scale), or anyone of (1+1) possible POE values if we know that only certain (discrete) realizations of the POE variable are possible. To simplify notation and the concepts we are discussing, we only consider the discrete case. We then specify these (1+1) possible values of the POE measure (Le. POE categories) as ~o' ~l'..., ~/.

26 To illustrate the concept of "misc1assification" of the POE variable or, specifically, sampling error associated with the realization of the POE measures, let us consider the following situation. A group of nj individuals is drawn at random from a population containing individuals that are either truly exposed or truly unexposed to some agent or condition. Assume that these nj individuals all have the same POE value given as P j =7I"j for j=l, 2,...nj' Of these, we expect to have nj7l"j exposed individuals and nj (1-7I"j) unexposed individuals in the sample (for i=o, 1, 2,..., I). However, we would generally observe nj7l"/ and n j (1-7I"j') exposed and unexposed individuals, respectively, in the sample, where 71"/ 'I- 7I"j' And, the smaller the value of nj' the more the sample would disagree with the expected numbers of exposed and unexposed individuals. Although we refer to this situation as "misclassification" of the POE variable, we should note that it is not strictly analogous to the usual exposure misclassification setting. In exposure misclassification, there is error in categorizing individuals into an exposure category. The POE variable introduces a further level of complexity. For example, if an individual (j) is drawn at random from the subpopulation of all individuals in the POE category 71"3=0.50, the probability of misclassifying that individual as "exposed" when he or she is actually "not exposed" is equal to The POE variable P j is, in this situation, an accurate measure of misclassification probability for that individual. If, however, the individual was not drawn from the entire subpopulation, but instead was drawn from a sample of the subpopulation from which (due to sampling variation) only 40% of all the individuals in that sample are truly exposed, then the true misclassification probability for each individual in that group should be In this situation, P j is not an accurate measure of misclassification probability. Error in the POE variable therefore is more properly identified as error in assigned misclassification probabilities, which indirectly relates to error in exposure status classification. In order to illustrate the effects of such misc1assification on the estimates of the MLM parameters,

27 we apply MLM-l to ~ sets of contrived data. For these three analyses, no covariates are considered. In the first analysis (Analysis 1), we apply MLM-l to a relatively small sample of individuals (N=160), all having known POE values. For simplicity, we restrict each individual to have one of five possible POE values ("'0' "'I' "'2' "'3' and'll"4). In this analysis, we assume no POE-related sampling error. We also select nj so that application of the designated exposure probabilities and disease probabilities result in whole (observed) cell counts, thus eliminating computational round-off errors. This "ideal" analysis situation is created to examine the model's performance under the best of circumstances. Estimation problems in this setting would surely force us to reconsider the basic model specification. The study sample for Analysis 1 is defined by specifying the numbers of individuals in the five POE groups: no=80, nl=15, n2=15, n3=30, n4=20; the disease probabilities in the population: Pr(DIE)=0.60, Pr(DIE)=0.20 (and hence the value of the "true" OR and fj as 6.00 and respectively); and the probability values associated with e the POE groups: '11"0=0.0, ""1=1.0, '11"2=0.67, ""3=0.50, ""4=0.25. The hypothetical data generated for Analysis 1 are detailed in Figure 1.1. Table 1.6 depicts the layout of these data in the form of a 2x(I+l) table. Table 1.7 shows counts of individuals by disease status and true exposure status. Below Table 1.7 we show the results from fitting MLM-l to these data. The matrix t is the estimated variance-covariance matrix for (a,13). The standard errors reported are simply the square roots of the appropriate elements of t. These results are obtained by implementation of the MAXLIK computer program. We see from Table 1.7 that 13 9 = (O R 9 =6.00) exactly agrees with the true value of fj (and OR) calculated from the known population disease probabilities. The value of fj (and OR) estimated by applying MLM-l to the data, given as 13=1.7899

28 (o"r=5.99), is also equal to the known value of 13 (except for numerical roundoff error and limited precision of the computer program). For the purpose of comparison, we calculate the estimate of 13 under Alternative Model 1 [8] for these analyses. Alternative Model 2 [13] is not informative in these analyses since we have constrained the counts in ""1 and ""0 to exactly reflect the population probabilities. Other categorizations of POE are also not considered for these analyses. Applying Alternative Model 1 where we classify those individuals with P j > 0.0 as exposed and the individuals with P j =0.0 as unexposed, we generate Table 1.8. In contrast to the estimate of 13 obtained from application of MLM-1, we see for these data that 13+ is biased toward the null value of one. This is what would be expected in a nondifferential misclassification situation where we misclassify truly unexposed individuals as "exposed". In fact, by recategorizing the individuals from the five POE groups into two exposure groups, we are introducing exactly such a bias. For this "ideal" setting, the estimates given by the MLM are less biased than those from Alternative ModelL

29 Figure Data for Analysis 1 ("'1 = 1.0) -{ 15 E 1=15 OE -{ 9D 65 Table 1.6: POE by Disease Status for Analysis 1 Probability of Exposure Category "'1 "'0 11"2 11"3 "'4 D (11"2=0.67) -{ 10 E 2=15 5E -{ -{ 6D 45 1 D 45 Table 1.7: Exposure by Disease Status for Analysis 1 (11"3=0.50) -{ 15 E 3=30 15 E -{ 9~ 6D 3D -{ 125 E D (11"4=0.25) -{ n 4 =20 5E -{ 3~ 2D 15 E 3D -{ 125 RESULTS FROM MLM-l: /3 = , s.e. ( lj )= & = , s.e. ( & ) = (11"0=0.0) -{ 0 E ~ - [ ] LJ no=80 80 E -{ 16 D_ 64 D o"r= 5.99 In L( &, /3; P,z) =

30 Table 1.8: Application of Alternative Modell to Analysis 1 Data D E o o o.+ {3 = The second and third analyses examine data containing POE-related sampling error. We examine a data set that is generated with the same specifications as those used in the prior analysis (i.e. same values of nj' Pr(DIE), Pr(DIE), and designated levels of 'lrj), but which has fewer truly exposed individuals in some POE groups than are expected based on the 'lrj values associated with those groups (Analysis 2). The converse situation, where there are a greater number of truly exposed individuals in some POE groups than expected based on the values of 'lrj for those groups (Analysis 3), is also examined. In Analysis 2, POE-related sampling error is present in two of the five POE groups. Of the 15 individuals sampled from POE group 2, only 5 are truly exposed; however, we would expect 10 to be truly exposed if there was no sampling error (as in Analysis 1). Also, in POE group 3, only 10 individuals are truly exposed. Since 'lr3 = 0.50, we would expect 15 of the 30 individuals in this group to be exposed. Using the notation specified earlier, the above situation may be stated as A summary of the true exposure-disease status and the specification of the values of 'lr/ in this sample are depicted in Figure 1.2. As we see from Table 1.10, the cell counts comprising this table differ from those of Analysis 1 (Table 1.7); however, the true {3 still equals (OR=6.0).

31 The estimate of 13 biased towards the null. under MLM-1 for Analysis 2 is given as,8= which is Under Alternative Modell we get,8+=.8755, which is more biased than the estimate obtained from MLM-1. Again, these biases agree with what we would expect in a situation where we have nondifferential misclassification of the exposure variable. In Analysis 3, POE-related sampling error is present in POE groups 2 and 3, but in a direction opposite to that in Analysis 2. For POE group 2, there are 15 individuals truly exposed, whereas we would expect only 10 to be exposed if there was no sampling error. In POE group 3, there are 20 instead of 15 individuals truly exposed. As before, using 7r i ' to indicate the proportion of exposed individuals in this sample, we specify,,,, d' 7ro =7ro, 7r 1 = 7rl' 7r2 > 7r2' 7r3 > 7r3' an 7r 4 =7r4 A summary of these data is presented in Figure 1.3, and Tables 1.11 and When MLM-1 is fitted to these data, we obtain,8=2.0669, reflecting a bias away from the null and an overestimate of the odds ratio attributable to exposure. Under Alternative Modell,,8+= which is biased towards the null. We note, however, that the magnitude of the bias for the dichotomized treatment of the data is greater than the bias using the MLM. Analysis 3 illustrates a curious property of the MLM. The data are constructed in a manner that seems to follow a pattern of nondifferential misclassification of the POE variable, but the estimate obtained from the MLM is not behaving as we would expect under such a situation. The fact that,8 is less biased than,8+ is encouraging, but the anticonservative direction of the bias may be of concern. What we have called "misclassification" in the POE measure, as it applies to the MLM, may actually be a different phenomenon.

32 Figure Data for Analysis 2 (11"1=1.0) -[ 15 E nl=15 { 9D 60 Table 1.9: POE by Disease Status for Analysis 2 Probability of Exposure Category 11"1 11"0 11"2 11"3 11"4 (11"/=1.0) 0 E D { 3D (T2 = 0.67) -[ 5E 20 (11"2'=0.33) n2=15 loe { 2D Table 1.10: Exposure by Disease Status for Analysis 2 (T3=0.50) -[ 10 E n3=30, - (11"3 =0.33) 20 E { 6D 40 4D { 160 D E E { 3D (T4 =0.25) -[ 5E 20 n4=20 3D (11"4'=0.25) 15 E { 120 RESULTS FROM MLM-1: /3 = , s.e. (i3 )= Ii = , s.e. (Ii) = (11"0=0.0) -[ 0 E ~ - [ ] L.J (11"0'=0.0) no=80 80 E { 16D_ 64 D o"r= 4.49 In L( Ii, /3; P,z) =

33 Figure Data for Analysis 3 9D 60 Table 1.11: POE by Disease Status for Analysis 3 Probability of Exposure Category "'1 "'0 "'2 "'3 "'4 D (11"2=0.67) { 15 E n2=15 -{ 9D (11"2'=1.0) 0 E Table 1.12: Exposure by Disease Status for Analysis 3 (11"3=0.50) { 20 E -{ 12 ~ 8D n3=30, - (11"3 =0.67) 10 E -{ 2D 80 {5E 3~ (11"4=0.25) -{ 2D n 4 =20 3D (11"4'=0.25) 15 E -{ 120 E D RESULTS FROM MLM-1: /3 = , s.e. ( 13 )= (11"0=0.0) {OE no=80 ("'0'=0.0) 80 E -{ 16 ~ 64 D Ii = , s.e. (Ii) = ~ - [ ] L" O'R= 7.90 In L( Ii, /3; P,z) =

34 Hypothesis Testing and Confidence Interval Estimation After examining the statistical properties of the parameter estimates from the suggested models, our focus turns to the inferential properties associated with these estimates. We apply common inferential procedures using the estimates obtained and examine the behavior of these methods. The likelihood ratio test is performed to test the null hypothesis H o : {3=0, incorporating the various estimators of {3 discussed in 1.2 (13, 13+, 13 01, 13 9, 13 c, and 13 d ). We define the likelihood ratio statistic as or equivalently The log-likelihood under the alternative hypothesis, InL(9 A ; y), is calculated for each of the suggested models as detailed in (equations [7, 11, 16,21,25 and 27]). The value of InL(B o ; y) is calculated using the cell counts in the appropriate contingency tables. We develop the procedure for calculating InL(9 0 ; y) by first considering the "usual" logistic model and note the direct extensions to the MLM and the alternative models. The "usual" logistic model, as specified by equation [18], reduces to We also see that, under H o, when {39 =0. [28] where m 1 and mo are the numbers of individuals with and without disease, respectively; and, zj=1 if individual j is disease, Zj=O otherwise. Setting

35 n L(og; E,z) :;'9'--- = and solving for 0 0, we specify the MLE of 0 0 as If we examine the MLM (equation [1]), we see that under the null hypothesis the model reduces to the "usual" logistic model. This also holds true for the MLM applied to data under recategorizations C [23] and D [26] and for the model which collapses across all POE groups that are greater than zero (Alternative Modell [8]). For these models, the parameter estimate under the null hypothesis is specified, respectively, as [29] We then specify (in general terms) [30, 31] We note that the values m1 and mo are identical for all data layouts except for Alternative Model 2 (Table 1.5). In that model, the parameter estimates are based on consideration of only part of the data. We note that the row totals in Table 1.5 are denoted by m~l and mg 1 to distinguish them from the marginals in the data layout tables for the other suggested models. For Alternative Model 2 we specify.01 _ 1 [ 01] m 0 0 -In mg1 ' [32] and [33, 34]

36 For all models except Alternative Model 2, we use equations [29, 30, and 31] to derive [35] ( shown here rlor 0" O=QO' " Wl'th 0 b' VlOUS ext' enslons r lor Q0' " + Qo, "9 Qo' "cd'd an Q ) o. Similarly, for Alternative Model 2, we have ". _ " _ 01 m 01] 1 01 [ m 01] InL(Oo. y) - InL(Q o o,p,z) - m 1 In[ N 01 + m O In N 01. [36] Once we have calculated InL(Oo; y), and InL(OA; y) for each model, we calculate - 2 In.\ = - 2 [ In L(0 0 ; y) - In L(0A; y)j. We reject the hypothesis H o : f3=0 at the Q=.05 level if -21n~ > 3.84 (the 95 th percentile value of a central X~ variable). Another approach to studying the inferential properties of the estimates obtained under the suggested models is to examine the structure of the confidence intervals for f3. In general, we specify a 100 (l-q)% large-sample confidence interval for f3 as c.i.(,b) =,B ±Zl-o/2 ~ Var(,B). [37] For the parameter estimates obtained from MAXLIK (,B,,Bc, and,bd), Var(,B) is the appropriate element of the inverse of the observed information matrix and is given as part of the computer output. For the models based on the logistic model with data

37 layouts in the form of 2x2 contingency tables (those involving 13+, 13 01, and 13 9 ), we can calculate Var(13) explicitly. As specified earlier, these variance estimates are given as Using the above estimates (and estimates from MAXLIK) and equation [37], we C d calculate C.I.(,8), C.I.(,8 ), C.I.(,8 ), C.I.(,8 ), C.I.(,8 ), and C.I.(,8 ). For a given model, we reject H o :,8=0 if the lower bound of the respective confidence interval exceeds the value of zero. 1.5 Research Outline The concentration of effort in this research is on examining the properties of the parameter estimates obtained using the Modified Logistic Model. The lack of previous e research on models of this type has necessitated a complete consideration of these properties. Since this is the first investigation of these models, we focus on a close examination of basic properties of the simplest forms of the model, with the intention of continuing this research to examine more complex extensions to the model in the future. The complexity of the equations specifying the model, and the fact that the solutions to these equations may not be specified in closed form, dictates that the approach to examining these models must involve simulation studies. In applying the models under investigation to data that are drawn from hypothetical populations with known parameters, we can closely monitor the behavior of the resulting parameter estimates and obtain an understanding of the factors that affect their properties. Unfortunately, a limitation of the simulation study approach is that not all possible permutations of influ-

38 encing factors can be examined. Our approach, therefore, is to concentrate on a few of what we believe are the most important factors. This research is based on analyses of three sets of simulated data. The first simulation set, which we call Simulation Set A, is a set of three simulation studies which vary only by our specification of sample size. In this simulation set, we specify a positive E-D relationship and do not consider covariate information. In Simulation Set B, we perform three simulation studies (again, varying by specification of sample size), in which we sample from a population that has been created with the same POE structure as the population in Simulation Set A, but where there is no relationship between exposure and disease status. Finally, in Simulation Set C, we adapt the models that are being investigated to consider a dichotomous covariate that is acting as a confounder. Analyses are conducted on two sets of simulated data with the same sample sizes and confounding effect of the covariate. The first simulation study in this Simulation Set is based on a sample from a population where there is a positive E-D relationship (the value of (3 is equivalent to that used for Simulation Set A), and the second simulation study specifies no E-D relationship ({3=O). Details of these simulation studies are given in Chapter II. After the results of the simulation studies are completely examined (reported in Chapter III), we should have a good understanding of the factors that effect parameter estimation and statistical inferences. At this point, we apply the appropriate models to the A-T data and report the conclusions in conjunction with consideration of the information gained by examination of the simulated data. The details and results of this part of the research is presented in Chapter IV. Extensions to the proposed MLM and target areas for future research are given in Chapter V.

Computational Systems Biology: Biology X

Computational Systems Biology: Biology X Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA L#7:(Mar-23-2010) Genome Wide Association Studies 1 The law of causality... is a relic of a bygone age, surviving, like the monarchy,