Dennis Cosrnatos. Department of Biostatistics University of North Carolina at Chapel Hill. September 1988

Size: px
Start display at page:

Download "Dennis Cosrnatos. Department of Biostatistics University of North Carolina at Chapel Hill. September 1988"

Transcription

1 METHODS FOR MODELING DISEASE RISK USING PROBABILITY-QF-EXPOSURE MEASURES by Dennis Cosrnatos Department of Biostatistics University of North Carolina at Chapel Hill Institute of Mimeo Series No. 1858T September 1988

2 METHODS FOR MODELING DISEASE RISK USING PROBADILITY-OF-EXPOSURE MEASURES by Dennis Cosmatos A Dissertation submitted to the faculty of The University of North Carolina at Chapel Hill in partial fulfillment of the requirements for the degree of Doctor of Public Health in the Department of Biostatistics. Chapel Hill 1988 Approved by: ~~ ~..Q..de Reader

3 II Dennis Cosmatos. Methods for Modeling Disease Risk Using Probability-of-Exposure Measures. (Under the direction of Lawrence L. Kupper.) ABSTRACT A usual goal in health research is to examine the relationship between a specified disease outcome and exposure to a given agent or condition. In certain settings, however, the true exposure status for each individual may not be available. Instead, a Probability-Of-Exposure (POE) for each individual may be considered. This research introduces a modified logistic model (MLM) that can be used to estimate the exposure-disease (E-D) relationship when individual POE values are the only exposure data available. Various forms of the MLM are suggested, several of which allow for consideration of covariate information. Two alternative methods based on application of the usual logistic model to the data under certain exposure assumptions are also considered. Application of the MLM to three small samples of exploratory data shows that the parameter estimates are influenced by sampling error in the POE measure. In order to further investigate this finding and observe how the MLM and the alternative models perform in various analysis settings, the MLM, the two alternative models, and the usual logistic model (using known exposure status and serving as a "gold standard" model) are applied to simulated data. Eight simulation studies are conducted and the results of these analyses suggest the circumstances under which the MLM provides the most reliable and the least reliable estimates of the true underlying E-D relationship. Practical application of the MLM is illustrated by conducting an analysis of a set of genetic data. These data are used to address a hypothesis that carriers of a gene for the autosomal recessive trait ataxia-telangiectasia are at increased risk of developing a malignant neoplasm, compared to noncarriers of the gene. The probability of being a carrier (i.e., the probability of being heterozygotic for the trait, a POE of sorts) is

4 In known without error by examining the relationship of each individual to a homozygous proband. Some future research goals are outlined including adaptation of the proposed methods to analyses of occupational health data by considering methods of estimating POE, and development of a modified survival model that can be used for analyses of clinical trial data.

5 IV ACKNOWLEDGEMENTS I would like to express my unbounded appreciation to my advisor, Dr. Larry Kupper, for his skillful guidance and sustained enthusiasm throughout the course of this research. I would also like to express my thanks to the other members of my doctoral committee, Drs. Ed Davis, Cliff Patrick, Dana Quade, and Mike Symons, for their careful review of this material and their helpful comments. For introducing me to the field of biostatistics (in the guise of biometrics), I would like to thank my undergraduate instructor in biometrics, Dr. Lesile Marcus. I still recall my first analysis of his marten femur-length data with tempered fondness. This research would not have been possible had it not been conducted on the superb computational facilities at the Cornell National Supercomputer Facility. I would like to thank Mr. Rostyslaw Lewyckyj from Academic Computing Services here at UNC, and the User Support group at Cornell University, for their invaluable assistance with the many computational idiosyncrasies that emerged along the way. I will always remember the fun and friendship I shared with Jerry Schindler, George Jerdack, Nancy Lucas, Susan Reade, Ali Barakat, and Cecilia Wada, the finest group of friends and fellow classmates one could ask for. Finally, I would like to thank my best friend and wife, Irene. Only through her secret mixture of support, endurance, humor, tolerance, variety, conflict, and love, was it possible to not only complete this program, but have only fond memories of my time spent doing so, and of our time together in North Carolina.

6 VI TABLE OF CONTENTS Chapter I Introduction and Research Outline 1.1 Introduction Model Notation and Formulation Development and Specification of the Modified Logistic Model Estimation of MLM Parameters Specification and Parameter Estimation for Alternative Model I: The "Usual" Logistic Model Applied to (01+) Data Specification and Parameter Estimation for Alternative Model 2: The "Usual" Logistic Model Applied to (01) Data Specification and Parameter Estimation for the "Gold Standard" Logistic Model: The "Usual" Logistic Model Applied to Data with Known Exposure Status The MLM Under POE Recategorizations C and D Effects of POE Error on Parameter Estimates in the MLM Hypothesis Testing and Confidence Interval Estimation ltesearch Outline 28 Chapter II Simulation Studies 2.1 Overview Specification of Simulation Studies Parameter Specifications for Simulation Set A Parameter Specifications for Simulation Set B 36

7 vn Parameter Specifications for Simulation Set C Model Specification for Analysis of Simulation Set C Data Computer Programs Used in the Simulation Studies Simulation Program Details - Program Input and Data Generation Simulation Program Details - Computational Algorithms Simulation Program Details - Program Output 52 Chapter III Simulation Study Results 3.1 Introduction Simulation Study Results for Simulation Set A Simulation A(1) - N=500, R=1000,.8= Simulation A(2) - N=1000, R=1000,.8= Simulation A(3) - N=2000, R=1000,.8= Simulation Set A Summary and Concluding Remarks Simulation Study Results for Simulation Set B Simulation B(1) - N=1000, R=1000,.8= Simulation B(2) - N=2000, R=1000,.8= Simulation B(3) - N=4000, R=1000,.8= Simulation Set B Summary and Concluding Remarks Simulation Study Results for Simulation Set C Simulation C(1) - N=2000, R=1000,.8= , 1'= Simulation C(2) - N=2000, R=1000,.8=0.00, 1'= Simulation Set C Summary and Concluding Remarks 119

8 vnz 3.5 Summary and Conclusions from Simulation Studies 120 Chapter IV A Numerical Example - Analysis of the A-T Data 4.1 Introduction General Overview and Specification of the Research Hypothesis Data Collection Methods and Study Design Specification of the POE Variable and Disease Outcome "Crude" Estimates of the E-D Relationship Using the Modified Logistic Model Analyses of the A-T Data Controlling for Extraneous Factors Analysis of the A-T Data Controlling for SEX: Application of MLM Specification of an AGE Covariate Analysis of the A-T Data Controlling for AGE and SEX: Application of MLM-2 with a Continuous AGE Variable Analysis of the A-T Data Controlling for AGE and SEX: Application of MLM-3 with a Categorical AGE Variable Analysis of the A-T Data - Modeling AGE as an Effect Modifier: Application of MLM-4 With a Categorical AGE Variable Summary and Conclusions From Analyses of the A-T Data 156 Chapter V Future Research 5.1 Introduction 159

9 IX 5.2 Small Sample Parameter Estimation in the Modified Logistic Model The MLM with Several Exposure (and associated POE) Variables Estimating Probability-of-Exposure Alternative POE-Based Models Closing R.emarks 171 R.eferences 172

10 Chapter I INTRODUCTION AND RESEARCH OUTLINE 1.1 Introduction In mo.st epidemiologic studies, the investigator is interested in examining the relationship between a disease or health condition and exposure to a set of potentially harmful environmental or physiological agents. It is hypothesized that such exposures affect the development or progression of the disease of interest. The investigator usually has some measure of the extent of exposure to these agents, either in the form of a dichotomous measurement (exposed (E) vs. unexposed (E)), an ordinal measure of exposure divided into different levels (e.g. low, medium, or high), or measurement on a continuous scale. One underlying assumption common to all these classes of exposure measurement is that they are known "without error". This research focuses on the development of models for analyzing data which are known to violate this assumption, but which include information on the level of exposure uncertainty in the form of probability-of-exposure (POE) measures. These POE values may be known with or without measurement error or may have less error than that associated with an ordinary classification into exposure levels. Within this framework, several characteristics of the POE measures can be considered. In the simplest case, they may be measurements reflecting the probability of being exposed to a specific agent. A probability of 1.0 would suggest certain exposure, one of 0.0 would suggest the individual is unexposed, and a value between 1.0 and 0.0 indicates the appropriate probability of being exposed to that specific agent. We may also

11 1.1 2 incorporate a "background" exposure probability into this setting by specifying a value slightly greater then 0.0 for individuals that are considered "unexposed". In a more complex setting, we may have a vector of POE measurements indicating an individual's exposure probabilities for several agents, or probabilities of being exposed to each of several levels of a single agent. The proposed models could be extended to this setting with some modification. The motivation for developing the proposed models was initiated by the need to perform an analysis of a set of genetic data which addresses the hypothesis that individuals who are heterozygotic for the hereditary condition ataxia- telangiectasia (A-T) are at higher risk for developing malignant neoplasms than are noncarriers of the A-T gene (Swift et ai., 1976; 1987). Such a hypothesis could be examined by conventional methods if we had a large enough sample of individuals who were carriers of the trait and of individuals who were definitely non-carriers. However, since A-T carriers are phenotypically normal and a cytogenetic method for identifying carriers has not yet been devised, a method was needed to incorporate each individual's probability of being e heterozygotic for A-T (a POE in some sense) into some type of risk model. Such probabilities are determined by basic genetic properties of trait inheritance. To date, there has been very little research devoted to the development of such models. The few earlier studies that have addressed this problem (Swift et ai., 1974; 1976; Chase et ai., 1977) have suggested models that do not allow incorporation of continuous covariates, and may have some undesirable statistical properties. The goal of this research is to develop models that are not subject to these drawbacks and can be used in a variety of research settings.

12 Model Notation and Formulation In a typical analysis examining relationships between exposure (E or E) to a substance or condition of interest and a given disease outcome (D or D), we typically profile the data in the form of a 2x2 table (when no covariates are considered), as depicted by Table 1.1 below. Table 1.1 Data Layout for a Study of E-D Relationships Assuming Known Exposure Status for Each Subject E E D a b m1 c d mo n1 no N Here a, b, c, and d are the numbers of individuals with given exposure-disease characteristics. In this setting, we need to know (with certainty) the exposure and the disease status of each individual in order to accurately complete this table and we proceed to calculate measures of the E-D relationship. The above table may be expanded in order to depict the data layout under a POE setting. In such a setting, we will not be able to specify exposure as E or E for all individuals, but instead will know some measure of POE for certain subjects. We may display these types of data in the form of a 2x(l+l) table, where we define I POE categories associated with probabilities greater than zero (1~2) and one category for individuals classified as "unexposed". The data from studies that consider POE may be put into the tabular format of Table 1.2. In Table 1.2, the POE categories are indexed by the probabilities "'0'..., 11'1' We will generally assume that 11'0 has the value 0.0, "'1 has the value 1.0, and

13 1.2 4 Table 1.2: Data Layout for Studies Considering POE Probability of Exposure Category ~l ~o ~2 ~3 ~/ D D Xl Xo X2 x3... X/ m 1!h Yo Y2 Y3... Y/ m n 1 no n2 n3... n/ N o ~1=1.0>~2>~3>... >~/>~0=0.0. Under this specification, we note that the first two rows and columns of Table 1.2 comprise a 2x2 subtable that contains counts of individuals for which we "know" the exposure status (i.e. if individual is in ~1=1.0 then he/she is known to be exposed, if in ~o' he/she is known to be not exposed). This situation is similar to the known exposure situation (Table 1.1), and usual logistic methods may be applied. In fact, if the cell counts in this subtable are large, we may consider using standard methods to quantify E-D relationships without considering the data in the remaining cells. We should note, however, that disregarding what may be the majority of the e data is likely to cause instability and unreliability in the statistical analyses. In the models that are introduced in this research, the parameters are estimated using all the data collected in the study. Upon initial examination of the above data setting, an investigator may be tempted to consider applying the usual logistic model to the data, treating POE as a covariate in the model. We should stress that such an approach is inappropriate, for several reasons. First, the model that treats P j as a covariate implies that P j can be treated as a "level" of exposure. If, in fact we believe that there are only two true exposure levels, E and E, a "dose" assumption for the exposure variable would not be justified. Also, it would be difficult to argue that the POE is related to disease outcome via some "logistic-linear" relationship. On the other hand, justifiable probability

14 1.2 5 relationships (to be shown in equation [1]) support the rationale for the MLM. Given these considerations, it was felt that the usual logistic model with P j would not be appropriate for the types of data and is not considered in this research. In the sections to follow, we will introduce methods designed to perform analyses on data that are in the form depicted by Table 1.2. We will examine how these methods compare to alternative approaches that collapse the 2x(I+1) table into a 2x2 table so that standard analysis procedures can be used. In addition, we will consider the analysis of data that involve a dichotomous confounding factor. We will examine the performance of the proposed models as compared to standard procedures under several different methods of treatment of the exposure variable. For the purpose of simplifying the presentation of these methods, most of the development will focus on the analyses of data with no covariates. Once the methods for this setting are fully developed, the covariate setting will be presented as an extension. In the sections that follow below ( ), we review the notation and formulation of each model that is considered in this research Development and Specification of the Modified Logistic Model We begin the development of appropriate models for analyzing data containing probability-of-exposure measures by considering some basic statistical properties. In most cases, we are interested in modeling the probability of disease development [Pr(D)] in a population. Some individuals in this population may be exposed (E) to some specified agent or condition, and some are not exposed (E). In what follows, we will make use of the simple probability law: Pr(D) = Pr(DnE) + Pr(DnE) = Pr(E) Pr(DIE) + Pr(E) Pr(DIE). [1]

15 1.2 6 The logistic model is often used for modeling the probability of developing disease as a function of various independent variables known or hypothesized to be related to disease development. A simple form of the logistic model, one that does not contain any covariate terms other than an indication of being exposed or not exposed, specifies the following relationships: O+~ P r( DIE) = ---'=e:--_-" l+eo+~ _ 0 P r(die) = -1-=~-e-;;:o o Pr(DIE) e = =--~~ Pr(DIE) and ~ - Pr(DIE)jPr(DIE) - OR e - Pr(DIE)jPr(DIE) -, where OR is the ratio of the odds of developing the disease in the exposed group to the odds in the unexposed group, and eo is the odds of disease in the unexposed group. In the usual specification of the logistic model, Pr(E) (henceforth referred to as POE) is 0 or 1 for every subject in the study. The proposed modification to this model allows for the inclusion of POE values P j where O$Pf~;1 for individual j=l, 2,..., N. We utilize this POE information by specifying the following modified logistic model (MLM) based on equation [1]: (MLM-l) [2] where j=l, 2,..., N indexes each subject in the study, Wj indicates the probability of disease for subject j, and P j specifies that individual's POE. Just as we can extend the logistic model to include covariates, we can specify the MLM with covariates as: (MLM-2) [3]

16 1.2 7 where j=l, 2,..., N indexes each subject, there are r=l, 2,..., R continuous covariates, and X rj specifies the value of the r th covariate for the jth individual. Similarly, for a single nominal covariate with M categories, we specify: (MLM-3) M-l 0:+ f3+ E "Ym(Vmj) m=l w j = P j e M-l 0:+f3 +E "Ym(Vmj) l+e m=l [4] where j=l, 2,..., N, and where Vmj=l if the jth subject is in category m and equals zero otherwise, m=l, 2,..., M-l. Note that eo: denotes the odds of disease for unexposed individuals having the "baseline" or "reference" level m=m of the covariate, (i.e. V mj=o for m=l,..., M-l if individual j has the baseline value m=m of the covariate), and ORm= e f3 specifies an odds ratio that is "common" across all levels of the covariate (m=1, 2,..., M). Finally, an extension of Model 3 which allows for effect modification (i.e. nonuniformity of the odds ratio across strata) may be specified as: (MLM-4) Wj [5] f3 +om Here, ORm =e, m=1, 2,..., M-l, and OR M = e. f3 As with the standard logistic model, both categorical and continuous confounders and effect modifiers can be accommodated by the proposed MLM. This research concentrates on the examination of the statistical properties of MLM-l by examining the performance of this model on simulated data. As an extension, MLM-3 with one dichotomous covariate will also be studied by simulation.

17 Estimation of M LM Parameters Considering the above specifications of the MLM, we are interested in devising a method to calculate unconditional maximum-likelihood estimates of the parameters. We begin by considering the application of MLM-l to a hypothetical set of data. For a given set of N observations, let us assume we have nj individuals all having the same POE value 1r j where i=o, 1,..., I, and ~ nj = N. 1 j=o Within each of these (1+1) POE groups we observe Xj individuals with the disease of interest and Yj individuals without that disease (as depicted in table form in Table 1.2). cannot obtain model specific estimates of () and f3 in closed form. For this situation, we We can, however, use numerical methods to obtain estimates for these parameters. Given one of the above specifications of the MLM, we can estimate the appropriate parameters by using unconditional maximum likelihood procedures. The unconditional likelihood for these data is given as N [ Z 1- z ] L(O; y) =.n Wj j (1- Wj) J,where J=l I if individual j is diseased Zj = { 0 otherwise [6] or [7] The specification of Wj in the above likelihood is determined by the model being considered. For individual j, the data vector Yj contains the POE value P j, a value for the disease indicator variable Zj' and the values of the continuous and/or categorical covariates X rj and/or Vmj' The parameter vector 0 specifies the parameters included in the particular model being considered. The elements of these vectors for models MLM-l, MLM-2, MLM-3, and MLM-4 are shown in Table 1.3. One point to note in the above specifications of the data vectors is that the parameters are estimated using information from each individual in the study. Although the data layout in Table 1.2 and the general approach we present when discussing the

18 1.2 9 Table 1.3 Specification of Data and Parameter Vectors for the MLM MLM-1 MLM-2 MLM-3 MLM-4 Data Vector Yj = ( Pj, Zj) j Yj = ( Pj, Zj' X 1j,, XRj ) j Yj = ( Pj, Zj' V lj', V(M-l)j ) j Yj = ( Pj, Zj' V lj', V(M-l)j ) ; Parameter Vector (J' = ( Q, /3 ) (J' = ( Q, /3, "Y!, ''Y2,, 1'R ) (J' = ( Q, /3, 1'1' 1'2', 1'(M-l) ) (J' = ( Q, /3, 1'1'..., 1'(M-l)' 61,..., 6(M-l) ) MLM considers counts of individuals in various POE categories (for simplicity), we see that the models being proposed directly apply to situations where the POE is a continuous variable. We use Newton-Raphson and "direct search" iterative estimation algorithms to calculate maximum-likelihood estimates for the appropriate parameters. We also calculate the associated variance estimates for those parameters, and values of the loglikelihood. These algorithms involve using the computer program MAXLIK. Details of this program are given in Specification and Parameter Estimation for Alternative Modell The "Usuar' Logistic Model Applied To (01 +) Data As an alternative to the proposed MLM, we consider the application of the "usual" logistic model to the data under a recategorization of the POE variable. We create two exposure groups and classify the individuals in the sample into one of these two groups based on their values of P j as follows. If Pj>O.O, then we classify individual j as E (Le. pt = 1). If P j = 0.0, then we classify the individual as E (Le. pt =0). We c~n then display the data in the form of Table 1.4.

19 Table 1.4: Data Layout for Alternative Model 1 E E D x+ Xo rn1 1 yt Yo rno I where xt =.L xi ' and 1=1 specified in Table 1.2). nt no N I Yt =L Yj (using the notation i=l We see that application of the MLM under this setting leads to the expression [8] and, since pt is 0 or 1, we are in the "usual" logistic setting. By specifying the log likelihood equation as in [7] for the data in Table 1.4, and by then solving the equations 81nL(a+ R+. p+ z),fj,, - 0 and 8a+ - 81nL(a+,/3+j p+,z) _ 0 + -, 8/3 we can obtain explicit expressions for the MLE's of Q + and /3+, for the value of InL(a+,,B+j P+,z), and for Var(,B+). These expressions are as follows: _[9,10] [11]! [12]

20 This alternative model is often used "implicitly" when individuals are designated E when they are declared to have an attribute which serves as a surrogate for the true exposure of interest. If the surrogate is imperfect, we will have an "exposed" group that contains several individuals that are not truly exposed to the agent of interest. Using the A-T data as an example, we may classify individuals as E if they are related to the A-T proband, and as E if they are the spouse controls. The "exposed" group would then contain some individuals with the A-T gene and many without the A-T gene. Since "relationship-to-proband" is not a perfect predictor of the "presence of the A-T gene", there will be some misclassification error. In the example given with the A-T data, we know that using relationship-toproband as a risk factor would not be a reasonable approach considering the underlying genetic mechanisms that control inheritance of traits and transfer of genetic information. However, in other research settings where such mechanisms are not so well defined, we make similar types of assumptions so that we can employ the (usual) logistic model. In these situations, we are probably operating quite often under the setting of this alternative model Specification and Parameter Estimation for Alternative Model 2 The "Usual" Logistic Model Applied To (01) Data Another alternative to implementation of the proposed MLM is to consider only a subset of the data in deriving parameter estimates and making statistical inferences about E-D relationships. Specifically, we may consider ignoring the data where exposure is uncertain and using only data for individuals with Pj=O or P j =l (i.e. data only for individuals who are known to be truly exposed and for those known to be truly unexposed). The data layout for such a model is depicted in Table 1.5. Note that this table is merely the 2x2 subtable formed using the first two rows and columns of Table 1.2.

21 Table 1.5: Data Layout for Alternative Model 2 E D 01 Xl Xo m 1 01 Y1 YO m 1 n1 no NO We would expect the parameter estimates obtained from this model to be statistically reliable only when the four cell counts are large. We would not generally expect this model to perform well since we will be typically disregarding a considerable amount of the data. For the analysis setting in which we are particularly interested (analysis of the A-T data), we would actually be ignoring a majority of the data collected. The application of the MLM to this setting leads to the formulation [13] As with Alternative Model 1 [8], the above model [13] is equivalent to a specification of the "usual" logistic model since p~l = 0 when individual j has a value of Pj=O and p~l = 1 when P j =1 for the jth individual in Table 1.5. As before, we can give explicit expressions for the MLE's of n0 1 and (301 and for InL(oOl,,B01 j p 01,z) and Var(,B 1). In particular, we have _[14, 15] [16]!

22 and Yare ROI) = ,., xl Xo YI Yo' [17] Specification and Parameter Estimation for the "Gold Standard" Logistic Model The" Usual" Logistic Model Applied To Data with Known Exposure Status If we know the true exposure status of each individual, then application of the usual logistic model, specified as E _ {I if individual j is exposed j - 0 otherwise j=l,..., N, [18] would yield reliable estimates of the E-D relationship (given relatively large N). We will consider this model when we examine the simulated data. Since we will know the true exposure status for all individuals in the simulated data, we can apply this "usual" logistic model [19]. The resulting parameter estimates can then be informatively contrasted to estimates obtained from the other models which utilize POE information. Since this model [18] involves no misclassification error for the exposure variable, we fully expect it to be more statistically reliable than any of the models that involve potential exposure misclassification. It should be emphasized that the "usual" logistic model [18] can only be applied if we know each individual's true exposure status. For the data we are considering, this will not be the situation. Thus, the "usual" logistic model [18] cannot actually be used to analyze the data under consideration. It is included as a "gold standard" model to which we may compare the performance of the other proposed models. The data layout for this model is the 2x2 table specified in Table 1.1. "gold standard" model, we specify the parameters for this model as 0'9 a.nd {39. For this For the data in Table 1.1, expressions for the MLE's of 0'9 and (39, and for InL(&9,,q9; E,z) and Var(~9), are given below.

23 [19, 20] - a In[a+ c] + b In[b~ a] + cin[a.t c] + d In[b~d], [21] and - -g Var(f3 ) =- abc d [22] The MLM Under POE Recategorizations C and D For situations where we define POE categories from continuous or discrete measures of POE, selection of the category specifications may influence the estimation properties of the MLM. As will be discussed later ( 1.4 and Chapter III), categories that contain too few individuals (especially those with disease) may make reliable parameter estimation more difficult. We therefore consider the application of the MLM to data which have alternative specifications of the POE categories. First, we consider a model which combines individuals from a POE category with few observations into the next e- higher POE category, and then we consider a model that moves these individuals into the next lower POE category. Let us specify Pj as a new POE measure for individuals j=1, 2,..., N. Let us also assume that there are very few individuals with P j =1I'2 and P j =1I'4. We may then assign values to Pj as follows: Pj = P j if individual j has a value of P j equal to 11'0,11'1' 11'3' 71'S' 11'6'..., 11'] Pj - Pj 11'1 if individual j has a value of P j equal to 11'2; and, 11'3 if individual j has a value of P j equal to 11'4. Recall our earlier assumption that 11'1>11'2>". >1I']>~0' and ~1=1.0, ~o=o.o. Then, we see that Pj values are determined based on recategorizing individuals with POE values of'lr 2 and 'lr 4 into their next higher POE groups, namely, groups with POE values of 11'1 and 'lr 3, respectively. If we fit MLM-1 to these data, we have the representation

24 [23] We compute the MLE's of a C, and pc, and other associated statistics, by considering the following likelihood equations: L(a C,pc; pc,z) =.n [W J=l or j Zj [24] [25] where Zj = { 1 if individual j is diseased o otherwise The appropriate MLEs (a C and pc), variance and covariance estimates (V~r(&C), V~r(pC), and C~v(&c,PC)),and InL«&c,p c ; pc,z) are calculated using MAXLIK. Similarly, we can consider recategorizing individuals in POE groups 11"2 and 11"4 into their next lower POE groups. We do this by specifying P~ as follows: p1 = P j if individual j has a value of P j equal to 11"0' 11"1' 11"3' 11"5' 11"6'..., 11"[ ; p1 = 11"3 if individual j has a value of P j equal to 11"2; and, p1 = 11"5 if individual j has a value of P j equal to 11"4' We then specify [26] [27]

25 Effects of POE Error on Parameter Estimates in the MLM Since we are considering the inclusion of information on POE in estimating E-D relationships, we must examine how parameter estimates and subsequent inferences are affected by error in this information. This error can be attributed to error in measurement of the POE variable, or to sampling errors associated with taking finite samples from a population. The measurement error setting is of interest if the POE measure is not known, but instead, is estimated by some model. In this research, we assume that the levels of the POE variable (P j ) are known without error for each individual (j). We then concentrate on the examination of POE "misclassification" due to sampling error. This approach is more relevant to the examination of the A-T data. The measurement error situation will be a focus of future research. The general concept of error or "misclassification" of the POE variable involves several underlying issues that need to be considered. by clarifying the distinction between POE and exposure status. In presenting these issues, we begin e" In the dichotomous exposure situation (Le. exposed or not exposed), the /h individual drawn at random from a population will have one of two levels of exposure status (E j ), either exposed (E j =l) or not exposed (Ej=O). We typically label these exposure status categories as E for "exposed" and E for "not exposed". (This notation is not to be confused with the dichotomous random variable E j which indicates the exposure status for individual j). A given individual (j) may, however, have an associated POE (P j ) equal to any value between 0 and 1 (if we assume POE takes on values from a continuous scale), or anyone of (1+1) possible POE values if we know that only certain (discrete) realizations of the POE variable are possible. To simplify notation and the concepts we are discussing, we only consider the discrete case. We then specify these (1+1) possible values of the POE measure (Le. POE categories) as ~o' ~l'..., ~/.

26 To illustrate the concept of "misc1assification" of the POE variable or, specifically, sampling error associated with the realization of the POE measures, let us consider the following situation. A group of nj individuals is drawn at random from a population containing individuals that are either truly exposed or truly unexposed to some agent or condition. Assume that these nj individuals all have the same POE value given as P j =7I"j for j=l, 2,...nj' Of these, we expect to have nj7l"j exposed individuals and nj (1-7I"j) unexposed individuals in the sample (for i=o, 1, 2,..., I). However, we would generally observe nj7l"/ and n j (1-7I"j') exposed and unexposed individuals, respectively, in the sample, where 71"/ 'I- 7I"j' And, the smaller the value of nj' the more the sample would disagree with the expected numbers of exposed and unexposed individuals. Although we refer to this situation as "misclassification" of the POE variable, we should note that it is not strictly analogous to the usual exposure misclassification setting. In exposure misclassification, there is error in categorizing individuals into an exposure category. The POE variable introduces a further level of complexity. For example, if an individual (j) is drawn at random from the subpopulation of all individuals in the POE category 71"3=0.50, the probability of misclassifying that individual as "exposed" when he or she is actually "not exposed" is equal to The POE variable P j is, in this situation, an accurate measure of misclassification probability for that individual. If, however, the individual was not drawn from the entire subpopulation, but instead was drawn from a sample of the subpopulation from which (due to sampling variation) only 40% of all the individuals in that sample are truly exposed, then the true misclassification probability for each individual in that group should be In this situation, P j is not an accurate measure of misclassification probability. Error in the POE variable therefore is more properly identified as error in assigned misclassification probabilities, which indirectly relates to error in exposure status classification. In order to illustrate the effects of such misc1assification on the estimates of the MLM parameters,

27 we apply MLM-l to ~ sets of contrived data. For these three analyses, no covariates are considered. In the first analysis (Analysis 1), we apply MLM-l to a relatively small sample of individuals (N=160), all having known POE values. For simplicity, we restrict each individual to have one of five possible POE values ("'0' "'I' "'2' "'3' and'll"4). In this analysis, we assume no POE-related sampling error. We also select nj so that application of the designated exposure probabilities and disease probabilities result in whole (observed) cell counts, thus eliminating computational round-off errors. This "ideal" analysis situation is created to examine the model's performance under the best of circumstances. Estimation problems in this setting would surely force us to reconsider the basic model specification. The study sample for Analysis 1 is defined by specifying the numbers of individuals in the five POE groups: no=80, nl=15, n2=15, n3=30, n4=20; the disease probabilities in the population: Pr(DIE)=0.60, Pr(DIE)=0.20 (and hence the value of the "true" OR and fj as 6.00 and respectively); and the probability values associated with e the POE groups: '11"0=0.0, ""1=1.0, '11"2=0.67, ""3=0.50, ""4=0.25. The hypothetical data generated for Analysis 1 are detailed in Figure 1.1. Table 1.6 depicts the layout of these data in the form of a 2x(I+l) table. Table 1.7 shows counts of individuals by disease status and true exposure status. Below Table 1.7 we show the results from fitting MLM-l to these data. The matrix t is the estimated variance-covariance matrix for (a,13). The standard errors reported are simply the square roots of the appropriate elements of t. These results are obtained by implementation of the MAXLIK computer program. We see from Table 1.7 that 13 9 = (O R 9 =6.00) exactly agrees with the true value of fj (and OR) calculated from the known population disease probabilities. The value of fj (and OR) estimated by applying MLM-l to the data, given as 13=1.7899

28 (o"r=5.99), is also equal to the known value of 13 (except for numerical roundoff error and limited precision of the computer program). For the purpose of comparison, we calculate the estimate of 13 under Alternative Model 1 [8] for these analyses. Alternative Model 2 [13] is not informative in these analyses since we have constrained the counts in ""1 and ""0 to exactly reflect the population probabilities. Other categorizations of POE are also not considered for these analyses. Applying Alternative Model 1 where we classify those individuals with P j > 0.0 as exposed and the individuals with P j =0.0 as unexposed, we generate Table 1.8. In contrast to the estimate of 13 obtained from application of MLM-1, we see for these data that 13+ is biased toward the null value of one. This is what would be expected in a nondifferential misclassification situation where we misclassify truly unexposed individuals as "exposed". In fact, by recategorizing the individuals from the five POE groups into two exposure groups, we are introducing exactly such a bias. For this "ideal" setting, the estimates given by the MLM are less biased than those from Alternative ModelL

29 Figure Data for Analysis 1 ("'1 = 1.0) -{ 15 E 1=15 OE -{ 9D 65 Table 1.6: POE by Disease Status for Analysis 1 Probability of Exposure Category "'1 "'0 11"2 11"3 "'4 D (11"2=0.67) -{ 10 E 2=15 5E -{ -{ 6D 45 1 D 45 Table 1.7: Exposure by Disease Status for Analysis 1 (11"3=0.50) -{ 15 E 3=30 15 E -{ 9~ 6D 3D -{ 125 E D (11"4=0.25) -{ n 4 =20 5E -{ 3~ 2D 15 E 3D -{ 125 RESULTS FROM MLM-l: /3 = , s.e. ( lj )= & = , s.e. ( & ) = (11"0=0.0) -{ 0 E ~ - [ ] LJ no=80 80 E -{ 16 D_ 64 D o"r= 5.99 In L( &, /3; P,z) =

30 Table 1.8: Application of Alternative Modell to Analysis 1 Data D E o o o.+ {3 = The second and third analyses examine data containing POE-related sampling error. We examine a data set that is generated with the same specifications as those used in the prior analysis (i.e. same values of nj' Pr(DIE), Pr(DIE), and designated levels of 'lrj), but which has fewer truly exposed individuals in some POE groups than are expected based on the 'lrj values associated with those groups (Analysis 2). The converse situation, where there are a greater number of truly exposed individuals in some POE groups than expected based on the values of 'lrj for those groups (Analysis 3), is also examined. In Analysis 2, POE-related sampling error is present in two of the five POE groups. Of the 15 individuals sampled from POE group 2, only 5 are truly exposed; however, we would expect 10 to be truly exposed if there was no sampling error (as in Analysis 1). Also, in POE group 3, only 10 individuals are truly exposed. Since 'lr3 = 0.50, we would expect 15 of the 30 individuals in this group to be exposed. Using the notation specified earlier, the above situation may be stated as A summary of the true exposure-disease status and the specification of the values of 'lr/ in this sample are depicted in Figure 1.2. As we see from Table 1.10, the cell counts comprising this table differ from those of Analysis 1 (Table 1.7); however, the true {3 still equals (OR=6.0).

31 The estimate of 13 biased towards the null. under MLM-1 for Analysis 2 is given as,8= which is Under Alternative Modell we get,8+=.8755, which is more biased than the estimate obtained from MLM-1. Again, these biases agree with what we would expect in a situation where we have nondifferential misclassification of the exposure variable. In Analysis 3, POE-related sampling error is present in POE groups 2 and 3, but in a direction opposite to that in Analysis 2. For POE group 2, there are 15 individuals truly exposed, whereas we would expect only 10 to be exposed if there was no sampling error. In POE group 3, there are 20 instead of 15 individuals truly exposed. As before, using 7r i ' to indicate the proportion of exposed individuals in this sample, we specify,,,, d' 7ro =7ro, 7r 1 = 7rl' 7r2 > 7r2' 7r3 > 7r3' an 7r 4 =7r4 A summary of these data is presented in Figure 1.3, and Tables 1.11 and When MLM-1 is fitted to these data, we obtain,8=2.0669, reflecting a bias away from the null and an overestimate of the odds ratio attributable to exposure. Under Alternative Modell,,8+= which is biased towards the null. We note, however, that the magnitude of the bias for the dichotomized treatment of the data is greater than the bias using the MLM. Analysis 3 illustrates a curious property of the MLM. The data are constructed in a manner that seems to follow a pattern of nondifferential misclassification of the POE variable, but the estimate obtained from the MLM is not behaving as we would expect under such a situation. The fact that,8 is less biased than,8+ is encouraging, but the anticonservative direction of the bias may be of concern. What we have called "misclassification" in the POE measure, as it applies to the MLM, may actually be a different phenomenon.

32 Figure Data for Analysis 2 (11"1=1.0) -[ 15 E nl=15 { 9D 60 Table 1.9: POE by Disease Status for Analysis 2 Probability of Exposure Category 11"1 11"0 11"2 11"3 11"4 (11"/=1.0) 0 E D { 3D (T2 = 0.67) -[ 5E 20 (11"2'=0.33) n2=15 loe { 2D Table 1.10: Exposure by Disease Status for Analysis 2 (T3=0.50) -[ 10 E n3=30, - (11"3 =0.33) 20 E { 6D 40 4D { 160 D E E { 3D (T4 =0.25) -[ 5E 20 n4=20 3D (11"4'=0.25) 15 E { 120 RESULTS FROM MLM-1: /3 = , s.e. (i3 )= Ii = , s.e. (Ii) = (11"0=0.0) -[ 0 E ~ - [ ] L.J (11"0'=0.0) no=80 80 E { 16D_ 64 D o"r= 4.49 In L( Ii, /3; P,z) =

33 Figure Data for Analysis 3 9D 60 Table 1.11: POE by Disease Status for Analysis 3 Probability of Exposure Category "'1 "'0 "'2 "'3 "'4 D (11"2=0.67) { 15 E n2=15 -{ 9D (11"2'=1.0) 0 E Table 1.12: Exposure by Disease Status for Analysis 3 (11"3=0.50) { 20 E -{ 12 ~ 8D n3=30, - (11"3 =0.67) 10 E -{ 2D 80 {5E 3~ (11"4=0.25) -{ 2D n 4 =20 3D (11"4'=0.25) 15 E -{ 120 E D RESULTS FROM MLM-1: /3 = , s.e. ( 13 )= (11"0=0.0) {OE no=80 ("'0'=0.0) 80 E -{ 16 ~ 64 D Ii = , s.e. (Ii) = ~ - [ ] L" O'R= 7.90 In L( Ii, /3; P,z) =

34 Hypothesis Testing and Confidence Interval Estimation After examining the statistical properties of the parameter estimates from the suggested models, our focus turns to the inferential properties associated with these estimates. We apply common inferential procedures using the estimates obtained and examine the behavior of these methods. The likelihood ratio test is performed to test the null hypothesis H o : {3=0, incorporating the various estimators of {3 discussed in 1.2 (13, 13+, 13 01, 13 9, 13 c, and 13 d ). We define the likelihood ratio statistic as or equivalently The log-likelihood under the alternative hypothesis, InL(9 A ; y), is calculated for each of the suggested models as detailed in (equations [7, 11, 16,21,25 and 27]). The value of InL(B o ; y) is calculated using the cell counts in the appropriate contingency tables. We develop the procedure for calculating InL(9 0 ; y) by first considering the "usual" logistic model and note the direct extensions to the MLM and the alternative models. The "usual" logistic model, as specified by equation [18], reduces to We also see that, under H o, when {39 =0. [28] where m 1 and mo are the numbers of individuals with and without disease, respectively; and, zj=1 if individual j is disease, Zj=O otherwise. Setting

35 n L(og; E,z) :;'9'--- = and solving for 0 0, we specify the MLE of 0 0 as If we examine the MLM (equation [1]), we see that under the null hypothesis the model reduces to the "usual" logistic model. This also holds true for the MLM applied to data under recategorizations C [23] and D [26] and for the model which collapses across all POE groups that are greater than zero (Alternative Modell [8]). For these models, the parameter estimate under the null hypothesis is specified, respectively, as [29] We then specify (in general terms) [30, 31] We note that the values m1 and mo are identical for all data layouts except for Alternative Model 2 (Table 1.5). In that model, the parameter estimates are based on consideration of only part of the data. We note that the row totals in Table 1.5 are denoted by m~l and mg 1 to distinguish them from the marginals in the data layout tables for the other suggested models. For Alternative Model 2 we specify.01 _ 1 [ 01] m 0 0 -In mg1 ' [32] and [33, 34]

36 For all models except Alternative Model 2, we use equations [29, 30, and 31] to derive [35] ( shown here rlor 0" O=QO' " Wl'th 0 b' VlOUS ext' enslons r lor Q0' " + Qo, "9 Qo' "cd'd an Q ) o. Similarly, for Alternative Model 2, we have ". _ " _ 01 m 01] 1 01 [ m 01] InL(Oo. y) - InL(Q o o,p,z) - m 1 In[ N 01 + m O In N 01. [36] Once we have calculated InL(Oo; y), and InL(OA; y) for each model, we calculate - 2 In.\ = - 2 [ In L(0 0 ; y) - In L(0A; y)j. We reject the hypothesis H o : f3=0 at the Q=.05 level if -21n~ > 3.84 (the 95 th percentile value of a central X~ variable). Another approach to studying the inferential properties of the estimates obtained under the suggested models is to examine the structure of the confidence intervals for f3. In general, we specify a 100 (l-q)% large-sample confidence interval for f3 as c.i.(,b) =,B ±Zl-o/2 ~ Var(,B). [37] For the parameter estimates obtained from MAXLIK (,B,,Bc, and,bd), Var(,B) is the appropriate element of the inverse of the observed information matrix and is given as part of the computer output. For the models based on the logistic model with data

37 layouts in the form of 2x2 contingency tables (those involving 13+, 13 01, and 13 9 ), we can calculate Var(13) explicitly. As specified earlier, these variance estimates are given as Using the above estimates (and estimates from MAXLIK) and equation [37], we C d calculate C.I.(,8), C.I.(,8 ), C.I.(,8 ), C.I.(,8 ), C.I.(,8 ), and C.I.(,8 ). For a given model, we reject H o :,8=0 if the lower bound of the respective confidence interval exceeds the value of zero. 1.5 Research Outline The concentration of effort in this research is on examining the properties of the parameter estimates obtained using the Modified Logistic Model. The lack of previous e research on models of this type has necessitated a complete consideration of these properties. Since this is the first investigation of these models, we focus on a close examination of basic properties of the simplest forms of the model, with the intention of continuing this research to examine more complex extensions to the model in the future. The complexity of the equations specifying the model, and the fact that the solutions to these equations may not be specified in closed form, dictates that the approach to examining these models must involve simulation studies. In applying the models under investigation to data that are drawn from hypothetical populations with known parameters, we can closely monitor the behavior of the resulting parameter estimates and obtain an understanding of the factors that affect their properties. Unfortunately, a limitation of the simulation study approach is that not all possible permutations of influ-

38 encing factors can be examined. Our approach, therefore, is to concentrate on a few of what we believe are the most important factors. This research is based on analyses of three sets of simulated data. The first simulation set, which we call Simulation Set A, is a set of three simulation studies which vary only by our specification of sample size. In this simulation set, we specify a positive E-D relationship and do not consider covariate information. In Simulation Set B, we perform three simulation studies (again, varying by specification of sample size), in which we sample from a population that has been created with the same POE structure as the population in Simulation Set A, but where there is no relationship between exposure and disease status. Finally, in Simulation Set C, we adapt the models that are being investigated to consider a dichotomous covariate that is acting as a confounder. Analyses are conducted on two sets of simulated data with the same sample sizes and confounding effect of the covariate. The first simulation study in this Simulation Set is based on a sample from a population where there is a positive E-D relationship (the value of (3 is equivalent to that used for Simulation Set A), and the second simulation study specifies no E-D relationship ({3=O). Details of these simulation studies are given in Chapter II. After the results of the simulation studies are completely examined (reported in Chapter III), we should have a good understanding of the factors that effect parameter estimation and statistical inferences. At this point, we apply the appropriate models to the A-T data and report the conclusions in conjunction with consideration of the information gained by examination of the simulated data. The details and results of this part of the research is presented in Chapter IV. Extensions to the proposed MLM and target areas for future research are given in Chapter V.

Computational Systems Biology: Biology X

Computational Systems Biology: Biology X Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA L#7:(Mar-23-2010) Genome Wide Association Studies 1 The law of causality... is a relic of a bygone age, surviving, like the monarchy,

More information

Logistic Regression: Regression with a Binary Dependent Variable

Logistic Regression: Regression with a Binary Dependent Variable Logistic Regression: Regression with a Binary Dependent Variable LEARNING OBJECTIVES Upon completing this chapter, you should be able to do the following: State the circumstances under which logistic regression

More information

Matched-Pair Case-Control Studies when Risk Factors are Correlated within the Pairs

Matched-Pair Case-Control Studies when Risk Factors are Correlated within the Pairs International Journal of Epidemiology O International Epidemlologlcal Association 1996 Vol. 25. No. 2 Printed In Great Britain Matched-Pair Case-Control Studies when Risk Factors are Correlated within

More information

TESTS FOR EQUIVALENCE BASED ON ODDS RATIO FOR MATCHED-PAIR DESIGN

TESTS FOR EQUIVALENCE BASED ON ODDS RATIO FOR MATCHED-PAIR DESIGN Journal of Biopharmaceutical Statistics, 15: 889 901, 2005 Copyright Taylor & Francis, Inc. ISSN: 1054-3406 print/1520-5711 online DOI: 10.1080/10543400500265561 TESTS FOR EQUIVALENCE BASED ON ODDS RATIO

More information

Correlation and regression

Correlation and regression 1 Correlation and regression Yongjua Laosiritaworn Introductory on Field Epidemiology 6 July 2015, Thailand Data 2 Illustrative data (Doll, 1955) 3 Scatter plot 4 Doll, 1955 5 6 Correlation coefficient,

More information

STAT331. Cox s Proportional Hazards Model

STAT331. Cox s Proportional Hazards Model STAT331 Cox s Proportional Hazards Model In this unit we introduce Cox s proportional hazards (Cox s PH) model, give a heuristic development of the partial likelihood function, and discuss adaptations

More information

Dr. Junchao Xia Center of Biophysics and Computational Biology. Fall /8/2016 1/38

Dr. Junchao Xia Center of Biophysics and Computational Biology. Fall /8/2016 1/38 BIO5312 Biostatistics Lecture 11: Multisample Hypothesis Testing II Dr. Junchao Xia Center of Biophysics and Computational Biology Fall 2016 11/8/2016 1/38 Outline In this lecture, we will continue to

More information

Person-Time Data. Incidence. Cumulative Incidence: Example. Cumulative Incidence. Person-Time Data. Person-Time Data

Person-Time Data. Incidence. Cumulative Incidence: Example. Cumulative Incidence. Person-Time Data. Person-Time Data Person-Time Data CF Jeff Lin, MD., PhD. Incidence 1. Cumulative incidence (incidence proportion) 2. Incidence density (incidence rate) December 14, 2005 c Jeff Lin, MD., PhD. c Jeff Lin, MD., PhD. Person-Time

More information

Discrete Multivariate Statistics

Discrete Multivariate Statistics Discrete Multivariate Statistics Univariate Discrete Random variables Let X be a discrete random variable which, in this module, will be assumed to take a finite number of t different values which are

More information

Factor Analytic Models of Clustered Multivariate Data with Informative Censoring (refer to Dunson and Perreault, 2001, Biometrics 57, )

Factor Analytic Models of Clustered Multivariate Data with Informative Censoring (refer to Dunson and Perreault, 2001, Biometrics 57, ) Factor Analytic Models of Clustered Multivariate Data with Informative Censoring (refer to Dunson and Perreault, 2001, Biometrics 57, 302-308) Consider data in which multiple outcomes are collected for

More information

Marginal versus conditional effects: does it make a difference? Mireille Schnitzer, PhD Université de Montréal

Marginal versus conditional effects: does it make a difference? Mireille Schnitzer, PhD Université de Montréal Marginal versus conditional effects: does it make a difference? Mireille Schnitzer, PhD Université de Montréal Overview In observational and experimental studies, the goal may be to estimate the effect

More information

Lecture 25. Ingo Ruczinski. November 24, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University

Lecture 25. Ingo Ruczinski. November 24, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University Lecture 25 Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University November 24, 2015 1 2 3 4 5 6 7 8 9 10 11 1 Hypothesis s of homgeneity 2 Estimating risk

More information

Review of Statistics 101

Review of Statistics 101 Review of Statistics 101 We review some important themes from the course 1. Introduction Statistics- Set of methods for collecting/analyzing data (the art and science of learning from data). Provides methods

More information

Statistics in medicine

Statistics in medicine Statistics in medicine Lecture 3: Bivariate association : Categorical variables Proportion in one group One group is measured one time: z test Use the z distribution as an approximation to the binomial

More information

Part IV Statistics in Epidemiology

Part IV Statistics in Epidemiology Part IV Statistics in Epidemiology There are many good statistical textbooks on the market, and we refer readers to some of these textbooks when they need statistical techniques to analyze data or to interpret

More information

Correlations with Categorical Data

Correlations with Categorical Data Maximum Likelihood Estimation of Multiple Correlations and Canonical Correlations with Categorical Data Sik-Yum Lee The Chinese University of Hong Kong Wal-Yin Poon University of California, Los Angeles

More information

STA6938-Logistic Regression Model

STA6938-Logistic Regression Model Dr. Ying Zhang STA6938-Logistic Regression Model Topic 2-Multiple Logistic Regression Model Outlines:. Model Fitting 2. Statistical Inference for Multiple Logistic Regression Model 3. Interpretation of

More information

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018 Statistics Boot Camp Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018 March 21, 2018 Outline of boot camp Summarizing and simplifying data Point and interval estimation Foundations of statistical

More information

Logistic Regression Models for Multinomial and Ordinal Outcomes

Logistic Regression Models for Multinomial and Ordinal Outcomes CHAPTER 8 Logistic Regression Models for Multinomial and Ordinal Outcomes 8.1 THE MULTINOMIAL LOGISTIC REGRESSION MODEL 8.1.1 Introduction to the Model and Estimation of Model Parameters In the previous

More information

Lecture 25: Models for Matched Pairs

Lecture 25: Models for Matched Pairs Lecture 25: Models for Matched Pairs Dipankar Bandyopadhyay, Ph.D. BMTRY 711: Analysis of Categorical Data Spring 2011 Division of Biostatistics and Epidemiology Medical University of South Carolina Lecture

More information

Ignoring the matching variables in cohort studies - when is it valid, and why?

Ignoring the matching variables in cohort studies - when is it valid, and why? Ignoring the matching variables in cohort studies - when is it valid, and why? Arvid Sjölander Abstract In observational studies of the effect of an exposure on an outcome, the exposure-outcome association

More information

8 Nominal and Ordinal Logistic Regression

8 Nominal and Ordinal Logistic Regression 8 Nominal and Ordinal Logistic Regression 8.1 Introduction If the response variable is categorical, with more then two categories, then there are two options for generalized linear models. One relies on

More information

Equivalence of random-effects and conditional likelihoods for matched case-control studies

Equivalence of random-effects and conditional likelihoods for matched case-control studies Equivalence of random-effects and conditional likelihoods for matched case-control studies Ken Rice MRC Biostatistics Unit, Cambridge, UK January 8 th 4 Motivation Study of genetic c-erbb- exposure and

More information

Two Correlated Proportions Non- Inferiority, Superiority, and Equivalence Tests

Two Correlated Proportions Non- Inferiority, Superiority, and Equivalence Tests Chapter 59 Two Correlated Proportions on- Inferiority, Superiority, and Equivalence Tests Introduction This chapter documents three closely related procedures: non-inferiority tests, superiority (by a

More information

Power and sample size calculations for designing rare variant sequencing association studies.

Power and sample size calculations for designing rare variant sequencing association studies. Power and sample size calculations for designing rare variant sequencing association studies. Seunggeun Lee 1, Michael C. Wu 2, Tianxi Cai 1, Yun Li 2,3, Michael Boehnke 4 and Xihong Lin 1 1 Department

More information

Statistics in medicine

Statistics in medicine Statistics in medicine Lecture 4: and multivariable regression Fatma Shebl, MD, MS, MPH, PhD Assistant Professor Chronic Disease Epidemiology Department Yale School of Public Health Fatma.shebl@yale.edu

More information

Marginal Screening and Post-Selection Inference

Marginal Screening and Post-Selection Inference Marginal Screening and Post-Selection Inference Ian McKeague August 13, 2017 Ian McKeague (Columbia University) Marginal Screening August 13, 2017 1 / 29 Outline 1 Background on Marginal Screening 2 2

More information

Chapter 12 Comparing Two or More Means

Chapter 12 Comparing Two or More Means 12.1 Introduction 277 Chapter 12 Comparing Two or More Means 12.1 Introduction In Chapter 8 we considered methods for making inferences about the relationship between two population distributions based

More information

Lecture Discussion. Confounding, Non-Collapsibility, Precision, and Power Statistics Statistical Methods II. Presented February 27, 2018

Lecture Discussion. Confounding, Non-Collapsibility, Precision, and Power Statistics Statistical Methods II. Presented February 27, 2018 , Non-, Precision, and Power Statistics 211 - Statistical Methods II Presented February 27, 2018 Dan Gillen Department of Statistics University of California, Irvine Discussion.1 Various definitions of

More information

Marginal, crude and conditional odds ratios

Marginal, crude and conditional odds ratios Marginal, crude and conditional odds ratios Denitions and estimation Travis Loux Gradute student, UC Davis Department of Statistics March 31, 2010 Parameter Denitions When measuring the eect of a binary

More information

Lecture 8: Summary Measures

Lecture 8: Summary Measures Lecture 8: Summary Measures Dipankar Bandyopadhyay, Ph.D. BMTRY 711: Analysis of Categorical Data Spring 2011 Division of Biostatistics and Epidemiology Medical University of South Carolina Lecture 8:

More information

High-Throughput Sequencing Course

High-Throughput Sequencing Course High-Throughput Sequencing Course DESeq Model for RNA-Seq Biostatistics and Bioinformatics Summer 2017 Outline Review: Standard linear regression model (e.g., to model gene expression as function of an

More information

BIOL 51A - Biostatistics 1 1. Lecture 1: Intro to Biostatistics. Smoking: hazardous? FEV (l) Smoke

BIOL 51A - Biostatistics 1 1. Lecture 1: Intro to Biostatistics. Smoking: hazardous? FEV (l) Smoke BIOL 51A - Biostatistics 1 1 Lecture 1: Intro to Biostatistics Smoking: hazardous? FEV (l) 1 2 3 4 5 No Yes Smoke BIOL 51A - Biostatistics 1 2 Box Plot a.k.a box-and-whisker diagram or candlestick chart

More information

Chapter 2: Describing Contingency Tables - II

Chapter 2: Describing Contingency Tables - II : Describing Contingency Tables - II Dipankar Bandyopadhyay Department of Biostatistics, Virginia Commonwealth University BIOS 625: Categorical Data & GLM [Acknowledgements to Tim Hanson and Haitao Chu]

More information

Sections 2.3, 2.4. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis 1 / 21

Sections 2.3, 2.4. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis 1 / 21 Sections 2.3, 2.4 Timothy Hanson Department of Statistics, University of South Carolina Stat 770: Categorical Data Analysis 1 / 21 2.3 Partial association in stratified 2 2 tables In describing a relationship

More information

Hypothesis Testing, Power, Sample Size and Confidence Intervals (Part 2)

Hypothesis Testing, Power, Sample Size and Confidence Intervals (Part 2) Hypothesis Testing, Power, Sample Size and Confidence Intervals (Part 2) B.H. Robbins Scholars Series June 23, 2010 1 / 29 Outline Z-test χ 2 -test Confidence Interval Sample size and power Relative effect

More information

MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2

MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2 MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2 1 Bootstrapped Bias and CIs Given a multiple regression model with mean and

More information

Previous lecture. P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing.

Previous lecture. P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing. Previous lecture P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing. Interaction Outline: Definition of interaction Additive versus multiplicative

More information

Analysis of Longitudinal Data. Patrick J. Heagerty PhD Department of Biostatistics University of Washington

Analysis of Longitudinal Data. Patrick J. Heagerty PhD Department of Biostatistics University of Washington Analysis of Longitudinal Data Patrick J Heagerty PhD Department of Biostatistics University of Washington Auckland 8 Session One Outline Examples of longitudinal data Scientific motivation Opportunities

More information

Unit 9: Inferences for Proportions and Count Data

Unit 9: Inferences for Proportions and Count Data Unit 9: Inferences for Proportions and Count Data Statistics 571: Statistical Methods Ramón V. León 12/15/2008 Unit 9 - Stat 571 - Ramón V. León 1 Large Sample Confidence Interval for Proportion ( pˆ p)

More information

Probability. We will now begin to explore issues of uncertainty and randomness and how they affect our view of nature.

Probability. We will now begin to explore issues of uncertainty and randomness and how they affect our view of nature. Probability We will now begin to explore issues of uncertainty and randomness and how they affect our view of nature. We will explore in lab the differences between accuracy and precision, and the role

More information

ANALYSIS OF ORDINAL SURVEY RESPONSES WITH DON T KNOW

ANALYSIS OF ORDINAL SURVEY RESPONSES WITH DON T KNOW SSC Annual Meeting, June 2015 Proceedings of the Survey Methods Section ANALYSIS OF ORDINAL SURVEY RESPONSES WITH DON T KNOW Xichen She and Changbao Wu 1 ABSTRACT Ordinal responses are frequently involved

More information

Lecture 7: Hypothesis Testing and ANOVA

Lecture 7: Hypothesis Testing and ANOVA Lecture 7: Hypothesis Testing and ANOVA Goals Overview of key elements of hypothesis testing Review of common one and two sample tests Introduction to ANOVA Hypothesis Testing The intent of hypothesis

More information

PIRLS 2016 Achievement Scaling Methodology 1

PIRLS 2016 Achievement Scaling Methodology 1 CHAPTER 11 PIRLS 2016 Achievement Scaling Methodology 1 The PIRLS approach to scaling the achievement data, based on item response theory (IRT) scaling with marginal estimation, was developed originally

More information

Clinical Trials. Olli Saarela. September 18, Dalla Lana School of Public Health University of Toronto.

Clinical Trials. Olli Saarela. September 18, Dalla Lana School of Public Health University of Toronto. Introduction to Dalla Lana School of Public Health University of Toronto olli.saarela@utoronto.ca September 18, 2014 38-1 : a review 38-2 Evidence Ideal: to advance the knowledge-base of clinical medicine,

More information

Unit 9: Inferences for Proportions and Count Data

Unit 9: Inferences for Proportions and Count Data Unit 9: Inferences for Proportions and Count Data Statistics 571: Statistical Methods Ramón V. León 1/15/008 Unit 9 - Stat 571 - Ramón V. León 1 Large Sample Confidence Interval for Proportion ( pˆ p)

More information

Missing covariate data in matched case-control studies: Do the usual paradigms apply?

Missing covariate data in matched case-control studies: Do the usual paradigms apply? Missing covariate data in matched case-control studies: Do the usual paradigms apply? Bryan Langholz USC Department of Preventive Medicine Joint work with Mulugeta Gebregziabher Larry Goldstein Mark Huberman

More information

Harvard University. A Note on the Control Function Approach with an Instrumental Variable and a Binary Outcome. Eric Tchetgen Tchetgen

Harvard University. A Note on the Control Function Approach with an Instrumental Variable and a Binary Outcome. Eric Tchetgen Tchetgen Harvard University Harvard University Biostatistics Working Paper Series Year 2014 Paper 175 A Note on the Control Function Approach with an Instrumental Variable and a Binary Outcome Eric Tchetgen Tchetgen

More information

CHAPTER 3. THE IMPERFECT CUMULATIVE SCALE

CHAPTER 3. THE IMPERFECT CUMULATIVE SCALE CHAPTER 3. THE IMPERFECT CUMULATIVE SCALE 3.1 Model Violations If a set of items does not form a perfect Guttman scale but contains a few wrong responses, we do not necessarily need to discard it. A wrong

More information

STA 216, GLM, Lecture 16. October 29, 2007

STA 216, GLM, Lecture 16. October 29, 2007 STA 216, GLM, Lecture 16 October 29, 2007 Efficient Posterior Computation in Factor Models Underlying Normal Models Generalized Latent Trait Models Formulation Genetic Epidemiology Illustration Structural

More information

Impact of covariate misclassification on the power and type I error in clinical trials using covariate-adaptive randomization

Impact of covariate misclassification on the power and type I error in clinical trials using covariate-adaptive randomization Impact of covariate misclassification on the power and type I error in clinical trials using covariate-adaptive randomization L I Q I O N G F A N S H A R O N D. Y E A T T S W E N L E Z H A O M E D I C

More information

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007) FROM: PAGANO, R. R. (007) I. INTRODUCTION: DISTINCTION BETWEEN PARAMETRIC AND NON-PARAMETRIC TESTS Statistical inference tests are often classified as to whether they are parametric or nonparametric Parameter

More information

Do not copy, post, or distribute

Do not copy, post, or distribute 14 CORRELATION ANALYSIS AND LINEAR REGRESSION Assessing the Covariability of Two Quantitative Properties 14.0 LEARNING OBJECTIVES In this chapter, we discuss two related techniques for assessing a possible

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

Ridit Analysis. A Note. Jairus D. Flora. Ann Arbor, hlichigan. 'lechnical Report. Nuvttnlber 1974

Ridit Analysis. A Note. Jairus D. Flora. Ann Arbor, hlichigan. 'lechnical Report. Nuvttnlber 1974 A Note on Ridit Analysis Jairus D. Flora 'lechnical Report Nuvttnlber 1974 Distributed to Motor Vtthicit: Manufacturers Association )lignway Safety Research Institute 'l'iir University uf hlichigan Ann

More information

An introduction to biostatistics: part 1

An introduction to biostatistics: part 1 An introduction to biostatistics: part 1 Cavan Reilly September 6, 2017 Table of contents Introduction to data analysis Uncertainty Probability Conditional probability Random variables Discrete random

More information

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form: Outline for today What is a generalized linear model Linear predictors and link functions Example: fit a constant (the proportion) Analysis of deviance table Example: fit dose-response data using logistic

More information

Estimating the Marginal Odds Ratio in Observational Studies

Estimating the Marginal Odds Ratio in Observational Studies Estimating the Marginal Odds Ratio in Observational Studies Travis Loux Christiana Drake Department of Statistics University of California, Davis June 20, 2011 Outline The Counterfactual Model Odds Ratios

More information

10: Crosstabs & Independent Proportions

10: Crosstabs & Independent Proportions 10: Crosstabs & Independent Proportions p. 10.1 P Background < Two independent groups < Binary outcome < Compare binomial proportions P Illustrative example ( oswege.sav ) < Food poisoning following church

More information

Categorical Data Analysis Chapter 3

Categorical Data Analysis Chapter 3 Categorical Data Analysis Chapter 3 The actual coverage probability is usually a bit higher than the nominal level. Confidence intervals for association parameteres Consider the odds ratio in the 2x2 table,

More information

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control Xiaoquan Wen Department of Biostatistics, University of Michigan A Model

More information

Stat 587: Key points and formulae Week 15

Stat 587: Key points and formulae Week 15 Odds ratios to compare two proportions: Difference, p 1 p 2, has issues when applied to many populations Vit. C: P[cold Placebo] = 0.82, P[cold Vit. C] = 0.74, Estimated diff. is 8% What if a year or place

More information

Guideline on adjustment for baseline covariates in clinical trials

Guideline on adjustment for baseline covariates in clinical trials 26 February 2015 EMA/CHMP/295050/2013 Committee for Medicinal Products for Human Use (CHMP) Guideline on adjustment for baseline covariates in clinical trials Draft Agreed by Biostatistics Working Party

More information

OUTCOME REGRESSION AND PROPENSITY SCORES (CHAPTER 15) BIOS Outcome regressions and propensity scores

OUTCOME REGRESSION AND PROPENSITY SCORES (CHAPTER 15) BIOS Outcome regressions and propensity scores OUTCOME REGRESSION AND PROPENSITY SCORES (CHAPTER 15) BIOS 776 1 15 Outcome regressions and propensity scores Outcome Regression and Propensity Scores ( 15) Outline 15.1 Outcome regression 15.2 Propensity

More information

BIOS 6649: Handout Exercise Solution

BIOS 6649: Handout Exercise Solution BIOS 6649: Handout Exercise Solution NOTE: I encourage you to work together, but the work you submit must be your own. Any plagiarism will result in loss of all marks. This assignment is based on weight-loss

More information

The identification of synergism in the sufficient-component cause framework

The identification of synergism in the sufficient-component cause framework * Title Page Original Article The identification of synergism in the sufficient-component cause framework Tyler J. VanderWeele Department of Health Studies, University of Chicago James M. Robins Departments

More information

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014 LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Liang (Sally) Shan Nov. 4, 2014 L Laboratory for Interdisciplinary Statistical Analysis LISA helps VT researchers

More information

Draft Proof - Do not copy, post, or distribute

Draft Proof - Do not copy, post, or distribute 1 LEARNING OBJECTIVES After reading this chapter, you should be able to: 1. Distinguish between descriptive and inferential statistics. Introduction to Statistics 2. Explain how samples and populations,

More information

EVALUATING THE REPEATABILITY OF TWO STUDIES OF A LARGE NUMBER OF OBJECTS: MODIFIED KENDALL RANK-ORDER ASSOCIATION TEST

EVALUATING THE REPEATABILITY OF TWO STUDIES OF A LARGE NUMBER OF OBJECTS: MODIFIED KENDALL RANK-ORDER ASSOCIATION TEST EVALUATING THE REPEATABILITY OF TWO STUDIES OF A LARGE NUMBER OF OBJECTS: MODIFIED KENDALL RANK-ORDER ASSOCIATION TEST TIAN ZHENG, SHAW-HWA LO DEPARTMENT OF STATISTICS, COLUMBIA UNIVERSITY Abstract. In

More information

Quantitative Introduction ro Risk and Uncertainty in Business Module 5: Hypothesis Testing

Quantitative Introduction ro Risk and Uncertainty in Business Module 5: Hypothesis Testing Quantitative Introduction ro Risk and Uncertainty in Business Module 5: Hypothesis Testing M. Vidyasagar Cecil & Ida Green Chair The University of Texas at Dallas Email: M.Vidyasagar@utdallas.edu October

More information

General Regression Model

General Regression Model Scott S. Emerson, M.D., Ph.D. Department of Biostatistics, University of Washington, Seattle, WA 98195, USA January 5, 2015 Abstract Regression analysis can be viewed as an extension of two sample statistical

More information

Bayesian Methods for Highly Correlated Data. Exposures: An Application to Disinfection By-products and Spontaneous Abortion

Bayesian Methods for Highly Correlated Data. Exposures: An Application to Disinfection By-products and Spontaneous Abortion Outline Bayesian Methods for Highly Correlated Exposures: An Application to Disinfection By-products and Spontaneous Abortion November 8, 2007 Outline Outline 1 Introduction Outline Outline 1 Introduction

More information

Lecture 15 (Part 2): Logistic Regression & Common Odds Ratio, (With Simulations)

Lecture 15 (Part 2): Logistic Regression & Common Odds Ratio, (With Simulations) Lecture 15 (Part 2): Logistic Regression & Common Odds Ratio, (With Simulations) Dipankar Bandyopadhyay, Ph.D. BMTRY 711: Analysis of Categorical Data Spring 2011 Division of Biostatistics and Epidemiology

More information

Lecture 1: Case-Control Association Testing. Summer Institute in Statistical Genetics 2015

Lecture 1: Case-Control Association Testing. Summer Institute in Statistical Genetics 2015 Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2015 1 / 1 Introduction Association mapping is now routinely being used to identify loci that are involved with complex traits.

More information

A consideration of the chi-square test of Hardy-Weinberg equilibrium in a non-multinomial situation

A consideration of the chi-square test of Hardy-Weinberg equilibrium in a non-multinomial situation Ann. Hum. Genet., Lond. (1975), 39, 141 Printed in Great Britain 141 A consideration of the chi-square test of Hardy-Weinberg equilibrium in a non-multinomial situation BY CHARLES F. SING AND EDWARD D.

More information

Modern Methods of Statistical Learning sf2935 Lecture 5: Logistic Regression T.K

Modern Methods of Statistical Learning sf2935 Lecture 5: Logistic Regression T.K Lecture 5: Logistic Regression T.K. 10.11.2016 Overview of the Lecture Your Learning Outcomes Discriminative v.s. Generative Odds, Odds Ratio, Logit function, Logistic function Logistic regression definition

More information

Confidence Intervals of the Simple Difference between the Proportions of a Primary Infection and a Secondary Infection, Given the Primary Infection

Confidence Intervals of the Simple Difference between the Proportions of a Primary Infection and a Secondary Infection, Given the Primary Infection Biometrical Journal 42 (2000) 1, 59±69 Confidence Intervals of the Simple Difference between the Proportions of a Primary Infection and a Secondary Infection, Given the Primary Infection Kung-Jong Lui

More information

Statistical Hypothesis Testing: Problems and Alternatives

Statistical Hypothesis Testing: Problems and Alternatives FORUM Statistical Hypothesis Testing: Problems and Alternatives NORMAN S. MATLOFF Division of Computer Science, University of California at Davis, Davis, California 95616 Environ. Entomol. 20(5): 1246-1250

More information

Lecture 01: Introduction

Lecture 01: Introduction Lecture 01: Introduction Dipankar Bandyopadhyay, Ph.D. BMTRY 711: Analysis of Categorical Data Spring 2011 Division of Biostatistics and Epidemiology Medical University of South Carolina Lecture 01: Introduction

More information

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis Review Timothy Hanson Department of Statistics, University of South Carolina Stat 770: Categorical Data Analysis 1 / 22 Chapter 1: background Nominal, ordinal, interval data. Distributions: Poisson, binomial,

More information

STAT 536: Genetic Statistics

STAT 536: Genetic Statistics STAT 536: Genetic Statistics Tests for Hardy Weinberg Equilibrium Karin S. Dorman Department of Statistics Iowa State University September 7, 2006 Statistical Hypothesis Testing Identify a hypothesis,

More information

CHL 5225 H Crossover Trials. CHL 5225 H Crossover Trials

CHL 5225 H Crossover Trials. CHL 5225 H Crossover Trials CHL 55 H Crossover Trials The Two-sequence, Two-Treatment, Two-period Crossover Trial Definition A trial in which patients are randomly allocated to one of two sequences of treatments (either 1 then, or

More information

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions

More information

STAT Chapter 13: Categorical Data. Recall we have studied binomial data, in which each trial falls into one of 2 categories (success/failure).

STAT Chapter 13: Categorical Data. Recall we have studied binomial data, in which each trial falls into one of 2 categories (success/failure). STAT 515 -- Chapter 13: Categorical Data Recall we have studied binomial data, in which each trial falls into one of 2 categories (success/failure). Many studies allow for more than 2 categories. Example

More information

Biost 518 Applied Biostatistics II. Purpose of Statistics. First Stage of Scientific Investigation. Further Stages of Scientific Investigation

Biost 518 Applied Biostatistics II. Purpose of Statistics. First Stage of Scientific Investigation. Further Stages of Scientific Investigation Biost 58 Applied Biostatistics II Scott S. Emerson, M.D., Ph.D. Professor of Biostatistics University of Washington Lecture 5: Review Purpose of Statistics Statistics is about science (Science in the broadest

More information

Bivariate Data: Graphical Display The scatterplot is the basic tool for graphically displaying bivariate quantitative data.

Bivariate Data: Graphical Display The scatterplot is the basic tool for graphically displaying bivariate quantitative data. Bivariate Data: Graphical Display The scatterplot is the basic tool for graphically displaying bivariate quantitative data. Example: Some investors think that the performance of the stock market in January

More information

Lecture #11: Classification & Logistic Regression

Lecture #11: Classification & Logistic Regression Lecture #11: Classification & Logistic Regression CS 109A, STAT 121A, AC 209A: Data Science Weiwei Pan, Pavlos Protopapas, Kevin Rader Fall 2016 Harvard University 1 Announcements Midterm: will be graded

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

Tests for Two Correlated Proportions in a Matched Case- Control Design

Tests for Two Correlated Proportions in a Matched Case- Control Design Chapter 155 Tests for Two Correlated Proportions in a Matched Case- Control Design Introduction A 2-by-M case-control study investigates a risk factor relevant to the development of a disease. A population

More information

Group comparison test for independent samples

Group comparison test for independent samples Group comparison test for independent samples The purpose of the Analysis of Variance (ANOVA) is to test for significant differences between means. Supposing that: samples come from normal populations

More information

6 Pattern Mixture Models

6 Pattern Mixture Models 6 Pattern Mixture Models A common theme underlying the methods we have discussed so far is that interest focuses on making inference on parameters in a parametric or semiparametric model for the full data

More information

Comparing the effects of two treatments on two ordinal outcome variables

Comparing the effects of two treatments on two ordinal outcome variables Working Papers in Statistics No 2015:16 Department of Statistics School of Economics and Management Lund University Comparing the effects of two treatments on two ordinal outcome variables VIBEKE HORSTMANN,

More information

MIXED MODELS THE GENERAL MIXED MODEL

MIXED MODELS THE GENERAL MIXED MODEL MIXED MODELS This chapter introduces best linear unbiased prediction (BLUP), a general method for predicting random effects, while Chapter 27 is concerned with the estimation of variances by restricted

More information

STA Module 10 Comparing Two Proportions

STA Module 10 Comparing Two Proportions STA 2023 Module 10 Comparing Two Proportions Learning Objectives Upon completing this module, you should be able to: 1. Perform large-sample inferences (hypothesis test and confidence intervals) to compare

More information

Glossary for the Triola Statistics Series

Glossary for the Triola Statistics Series Glossary for the Triola Statistics Series Absolute deviation The measure of variation equal to the sum of the deviations of each value from the mean, divided by the number of values Acceptance sampling

More information

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification, Likelihood Let P (D H) be the probability an experiment produces data D, given hypothesis H. Usually H is regarded as fixed and D variable. Before the experiment, the data D are unknown, and the probability

More information

Chapter 19: Logistic regression

Chapter 19: Logistic regression Chapter 19: Logistic regression Self-test answers SELF-TEST Rerun this analysis using a stepwise method (Forward: LR) entry method of analysis. The main analysis To open the main Logistic Regression dialog

More information

Lecture 14: Introduction to Poisson Regression

Lecture 14: Introduction to Poisson Regression Lecture 14: Introduction to Poisson Regression Ani Manichaikul amanicha@jhsph.edu 8 May 2007 1 / 52 Overview Modelling counts Contingency tables Poisson regression models 2 / 52 Modelling counts I Why

More information

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview Modelling counts I Lecture 14: Introduction to Poisson Regression Ani Manichaikul amanicha@jhsph.edu Why count data? Number of traffic accidents per day Mortality counts in a given neighborhood, per week

More information

Agreement Coefficients and Statistical Inference

Agreement Coefficients and Statistical Inference CHAPTER Agreement Coefficients and Statistical Inference OBJECTIVE This chapter describes several approaches for evaluating the precision associated with the inter-rater reliability coefficients of the

More information