Logistic Regression Model for Analyzing Extended Haplotype Data

Size: px
Start display at page:

Download "Logistic Regression Model for Analyzing Extended Haplotype Data"

Transcription

1 Genetic Epidemiology 15: (1998) Logistic Regression Model for Analyzing Extended Haplotype Data Sylvan Wallenstein, 1 * Susan E. Hodge, 3 and Ainsley Weston 2 1 Department of Biomathematical Sciences, Mount Sinai School of Medicine, New York, New York 2 Department of Community Medicine, Mount Sinai School of Medicine, New York, New York 3 New York State Psychiatric Institute and Department of Psychiatry, Columbia University, New York, New York Recently, there has been increased interest in evaluating extended haplotypes in p53 as risk factors for cancer. An allele-specific polymerase chain reaction (PCR) method, confirmed by restriction analysis, has been used to determine absolute extended haplotypes in diploid genomes. We describe statistical analyses for comparing cases and controls, or comparing different ethnic groups with respect to haplotypes composed of several biallelic loci, especially in the presence of other covariates. Tests based on cross-tabulating all possible genotypes by disease state can have limited power due to the large number of possible genotypes. Tests based simply on cross-tabulating all possible haplotypes by disease state cannot be extended to account for other variables measured on the individual. We propose imposing an assumption of additivity upon the haplotype-based analysis. This yields a logistic regression in which the outcome is case or control, and the predictor variables include the number of copies (0, 1, or 2) of each haplotype, as well as other explanatory variables. In a case-control study, the model can be constructed so that each coefficient gives the log odds ratio for disease for an individual with a single copy of the suspect haplotype and another copy of the most common haplotype, relative to an individual with two copies of the most common haplotype. We illustrate the method with published data on p53 and breast cancer. The method can also be applied to any polymorphic system, whether multiple alleles at a single locus or multiple haplotypes over several loci. Genet. Epidemiol. 15: , Wiley-Liss, Inc. *Correspondence to: Sylvan Wallenstein, Box 1023, Department of Biomathematical Sciences, Mount Sinai School of Medicine, 1 Gustave Levy Place, New York, NY wallenst@msvax.mssm.edu Received 10 April 1997; Revised 18 June 1997; Accepted 19 June Wiley-Liss, Inc.

2 174 Wallenstein et al. Key words: case-control studies; association studies; disease-marker associations; HLA INTRODUCTION Recently, there has been increased interest in evaluating haplotypes rather than alleles at individual loci as risk factors for disease. For example, the inheritance of three biallelic polymorphisms of p53 has been proposed as a risk factor for various cancers including colorectal cancer [Själander et al., 1995a], and breast cancer [Själander et al., 1996]. Själander et al. [1995b] conclude their paper by noting that... extended haplotypes would be more informative [than individual alleles at individual loci] in studies of population differences and associations between p53 germline mutations and cancer. In these reports, the frequencies of the three-locus extended haplotypes were not actually physically measured but were obtained by first estimating the frequency of pairwise haplotypes for each of the three combinations using the frequency distribution in homozygotes, and then estimating three-way combinations from these two-way frequencies. Recently, Weston et al. [1997] developed an allele-specific PCR protocol to directly determine absolute extended p53 haplotypes in diploid genomes based on an allele-specific PCR, confirmed by restriction analysis. They compared breast cancer cases with controls in three different ethnic groups with respect to these extended haplotypes. We propose a logistic regression procedure for evaluating differences between two or more groups with respect to extended haplotype distributions and show how this has advantages over simple crosstabulation of genotypes or haplotypes. The methodology is similar to that described by Smouse and Williams [1982] in a slightly different context, but the current paper also describes optimality properties of the test, discusses estimation in addition to testing, and allows for the presence of additional predictor variables, thus paving the way for studies of gene-environment interactions. We discuss application to both cohort studies (e.g., different ethnic groups) and case-control studies. METHODS We denote a haplotype based on m biallelic marker loci as x, where each x is a m 1 row vector composed of 1 s and 2 s. For concreteness and ease of exposition, we shall let 1 denote the more common allele or major allele at each locus. Thus for example, for m = 3, there are eight possible haplotypes: (1-1-1, 1-1-2, 1-2-1, , 2-1-1, 2-1-2, 2-2-1, 2-2-2). Let H denote the number of haplotypes under consideration. The maximum value of H is H = 2 m ; however, in a particular example H may be smaller than 2 m, either because not all haplotypes occur biologically or are observed, or because we choose to combine two or more rarely observed haplotypes into a category designated other. We denote the genotype, i.e., the pair of haplotypes for an individual, as (x;y). For example, when m = 3, an individual with the and haplotypes would be denoted x = (1,1,2), y = (1,2,1). The number of genotypes under consideration is denoted by G. The maximum value of G, given H, is G = H(H + 1)/2, but a smaller number may occur either because a certain genotype is lethal, or because it did not

3 Logistic Regression for Extended Haplotypes 175 happen to occur in the observed sample. In general, G can be quite large, taking a maximum value of 2 m 1 (2 m + 1); but in particular applications it can be appreciably smaller, though still considerably larger than H. We first describe two cross-tabulation procedures for analyzing these data, one genotype based and the other haplotype based. We then present our proposed logistic regression method. The genotype-based cross-tabulation method is based on a G 2 table recording the n patients according to genotype and disease status (case or control). In this unstructured model, no relationship is postulated among the G different values of π(x;y) P(Disease haplotype x and haplotype y). The null hypothesis, H 0G : π(x;y) = constant, can be tested by the ordinary Pearson chi-square statistic with G 1 degrees of freedom. Additional variables, such as smoking in lung cancer studies, could be added to the model using logistic regression [Sugimura et al., 1994]. However, especially for m > 2, due to the large value of G, this χ 2 statistic could lack power if the difference between groups is due to a haplotype rather than just one genotype. Therefore we discuss two haplotype-based procedures, one based on a cross-tabulation and the other on logistic regression, that reduce the number of parameters involved in the test of no genetic effect from G 1 to H 1. Both these procedures test the same hypothesis H 0G as above but, implicitly or explicitly, make an assumption of additivity on some scale. To apply these asymptotic procedures, we would strive to retain in the analysis haplotypes that occurred in a minimum of five individuals, and combine all the rest into an other group, which should itself also contain at least five individuals. The first haplotype-based procedure, performed for example by Weston et al. [1997], tabulates the total number of times each haplotype occurs in the sample, counting each haplotype twice for homozygotes and once for heterozygotes, for a total of 2n haplotypes in a sample of n individuals. (This is the method of allele counting; see, e.g., Cavalli-Sforza and Bodmer [1971] or Edwards [1992].) We will refer to this method as the haplotype-based cross-tabulation, in which the H different haplotypes are cross-tabulated by the two levels of disease status (e.g., case and control). The hypothesis of no association between disease and haplotype is tested conventionally using the Pearson chi-square statistic for H 2 tables. Since the test is based on 2n observations, it would appear to be difficult to extend this analysis to other covariates measured on an individual level. Our proposed method, which we term haplotype-based logistic regression, uses logistic regression rather than cross-tabulation, and is based on the number of copies (0, 1, or 2) of the H haplotypes each individual has. It imposes an explicit assumption of additivity on the logit scale. The logit of each heterozygote is assumed to be halfway between the logits of the two corresponding homozygotes. We can express this assumption via one of two parameterizations that, while mathematically equivalent, shed different light on the nature of the assumptions. The first, the homozygous parameterization of the model, useful in a cohort study, represents all G values of π(x;y) in terms of the H probabilities π(x;x) of

4 176 Wallenstein et al. disease occurrence in homozygotes. Rather than employ these H parameters directly, we transform them to a logit scale, defining θ(x) such that π(x;x) ln 1 π(x;x) = 2θ(x), or π(x;x) = e 2θ(x) /[1+e 2θ(x) ]. The assumption of additivity mentioned above is expressed by ln π(x;y) = θ(x) + θ(y), (1) 1 π(x;y) or equivalently, exp[θ(x) + θ(y)] π(x;y) = 1 + exp[θ(x) + θ(y)]. The null hypothesis, H 0H : θ(x) = constant, for all x, is evaluated in terms of the difference between the log-likelihood for the model with H parameters and no intercept, and the log likelihood of the model under H 0 with only a single (intercept) term. For n large, under H 0H, twice this difference has a chi-square distribution with H 1 degrees of freedom [Hosmer and Lemeshow, 1989]. Note that H 0H together with (1) is equivalent to H 0G. In a cohort study, 2θ(x) is the log odds of disease for individuals homozygous for haplotype x. The second, the baseline-haplotype parameterization, selects for reference a baseline haplotype, z, as the one we want to use as a basis for comparison. (This can be arbitrary, but generally will be the most common haplotype.) The parameterization is particularly appropriate for a case-control study, or when there is a major haplotype (perhaps occurring in over 80% of all individuals). The model expresses all G values of π(x;y) in terms of the H probabilities π(x;z), or equivalently in terms of an intercept, α, and H 1 coefficients β(x), where π(x;z) ln 1 π(x;z) = { α x = z (2) α + β(x), otherwise. Each parameter β(x) gives the log odds ratio for disease for an individual with a single copy of the haplotype x and another copy of the base haplotype, relative to an individual with two copies of the base haplotype. The assumption of additivity states that for x,y z π(x;y) ln 1 π(x;y) = α + β(x) + β(y). (3) The null hypothesis H* 0H : β(x) = 0, all x, would be evaluated by twice the difference between the log-likelihood for the model with H parameters and the log likeli-

5 Logistic Regression for Extended Haplotypes 177 hood of the model under H 0, i.e., with only a single (intercept) term. Setting α = 2θ(z), β(x) = θ(x) θ(z), for x z, it can be seen that H* 0H = H 0H. Alternatively, it is possible to reject H 0H based on the maximum Z statistic noted for comparing any haplotype to the baseline haplotype. The test statistic is max x Z(x) where Z(x) = β^ (x))/s.e.(β^ (x)), and β^(x) is the maximum likelihood estimate of β(x). Under the null hypothesis, Z(x) has an asymptotic standardized normal distribution. To find a conservative estimate of the P value associated with the maximum over H 1 values of x, use the Bonferroni correction to multiply the smallest P value associated with Z(x), by H 1. Conceptually, it is possible to test the additivity assumptions (1) or (3) by taking the difference between the log likelihoods of the genotype model with G 1 parameters, and the one according to equation (1) or (3) with H 1 parameters. If the proposed additivity assumption is true, twice the difference in log likelihoods, or more simply, the difference between the chi-square statistic used to test H 0G and the one used to test H 0H, will asymptotically have a chi-square distribution with G H degrees of freedom. This lack-of-fit test is included in many statistical software programs. However, asymptotic results based on G H degrees of freedom are questionable if some of the G genotypes are noted in only three or fewer patients. The haplotype-based logistic regression retains the identity of the particular genetic configuration and of the individual and can thus include demographic information, environmental risk factors, or environment-gene interactions, by simply adding them into the regression. To test whether there is a common non-zero effect of a collection of k such variables (k 1), after adjusting for genetic effects, compare twice the difference in log likelihood between a model containing such effects and a model that does not contain them, to a χ 2 distribution with k degrees of freedom. The presence of gene-environment interaction is most easily evaluated in a model with an intercept (i.e., the second parameterization), by first constructing H 1 new variables by multiplying each of the H 1 haplotype counts by the variable indicating environmental exposure, and then examining the difference in log likelihoods as described above. EXAMPLE We illustrate the application of these methods for the 182 Caucasian women in the case-control study of breast cancer described by Weston et al. [1997]. There were m = 3 biallelic loci, so the maximum number of possible haplotypes is 2 3 = 8, although only six were observed in the full sample of 284 women in all three racial groups studied. The most frequently occurring haplotype is designated 1-1-1; the second most prevalent was 2-1-1; but as evidence of linkage disequilibrium, the third was In Caucasians, only two additional haplotypes were observed: in four individuals, and in a single (case) patient. Thus we chose to collapse these two categories into an other category for the statistical analysis, so that H = 4, and G = H(H + 1)/2 = 10. We will now illustrate analysis of these data by the different methods discussed here. Genotype-Based Cross-Tabulation For the 182 Caucasian women, Table I cross-tabulates the G = 10 genotypes by disease status (case or control). The ordinary Pearson chi-square statistic with nine

6 178 Wallenstein et al. TABLE I. Cross-Tabulation of Genotype by Disease Group* Genotype Number of outcomes Number of haplotypes pattern Cases Controls Other Total *In pattern 7, other = 2-2-1; for all other patterns, other refers to As previously noted, the nomenclature is not consistent with Själander et al., or other previous papers. The first and third indices are switched, so what we call Själander et al. [1995a,b; 1996] call 2-1-2, our is their 1-1-2, and our is their degrees of freedom is χ 2 = (P = 0.194). Alternatively, the chi-square statistic based on the difference in log likelihood is 13.97, also with nine degrees of freedom. Haplotype-Based Cross-Tabulation This analysis is based on the 364 haplotypes, rather than 182 individuals. Table II cross-tabulates the four haplotypes under consideration by disease group. For this table, Pearson s χ 2 = 7.03, with three degrees of freedom, so that P = Thus, the P value obtained by this haplotype-based cross-tabulation (and as we note below, for the haplotype-based logistic regression) was smaller than that obtained by the genotype-based analysis. Haplotype-Based Logistic Regression Logistic regression is usually performed entering data on each individual separately, with the outcome case or control. In a model evaluating genetic effects only, the covariates are the number of haplotypes (0, 1, or 2) for each haplotype pattern. In our example with H = 4, either three haplotypes are used with an intercept, or the intercept is suppressed and all four are entered. Applied to our example, the procedure yields a chi-square statistic of 6.91 with three degrees of freedom, so that P = We used JMP (SAS Institute) for the calculations, but any conventional logistic regression program will suffice. If the data had come from a cohort study, one could have used the homozygous parameterization to estimate the probability of disease for any specific geno- TABLE II. Cross-Tabulation of Haplotype by Disease Group Other Total Cases 89 (68%) 14 (11%) 24 (18%) 3 (2%) 130 Controls 183 (78%) 26 (11%) 22 (9%) 3 (1%)

7 TABLE III. Parameter Estimates Obtained by Logistic Regression Logistic Regression for Extended Haplotypes 179 Parameter Estimate se(est.) Homozygous parameterization -no intercept θ (1,1,1) θ (2,1,1) θ (2,2,2) θ (other) Baseline-haplotype parameterization with the (1-1-1) haplotype as the base haplotype Intercept β (2,1,1) β (2,2,2) β (other) type. These estimates can be derived from the output as given in the top panel of Table III. For example, for an individual with two copies of the haplotype, the probability of disease would be estimated as exp( )/[1 + exp( )] = Similarly, for an individual with one copy of and one copy of the probability of disease is estimated as exp( )/[1 + exp( )] = For a case control study, these probabilities are not estimable [Prentice and Pyke, 1979], so the baseline-haplotype parameterization would be appropriate. The bottom panel of Table III gives the estimates for that parameterization. For example, the estimate for β(2,2,2) indicates that a genotype with x = (2,2,2), y = (1,1,1) is estimated to multiply the risk of cancer by exp(0.856) = 2.35, as compared to two copies of the baseline haplotype, The 95% confidence interval for this relative risk (ignoring issues of simultaneous testing) is exp[0.856 ± 1.96(.339)], or 1.21 to The test for the assumption of additivity has G H = 10 4 = 6 degrees of freedom and the χ 2 statistic is given by = 7.06, thus giving us no reason to question the assumption of additivity. However, since three of the ten genotypes were noted in only one individual, and two other genotypes were noted in only two individuals, it is important to stress that this test merely fails to contradict the assumption of additivity; it does not confirm the assumptions. DISCUSSION The logistic regression method presented here is straightforward, is commonly used in epidemiologic calculations, lends itself to interpretation of the magnitude of effects, is easily extended to control for other variables, and gives an overall test with fewer degrees of freedom than the model based on all genotypes. Extension to several groups, e.g., comparing the three racial groups in our example, or subdividing controls into benign breast disease and other controls, is also straightforward. It is also straightforward to incorporate multiple alleles at one or more loci. The impetus for proposing this logistic regression arose from the breast cancer work referred to above, and from the fact that with the allele-specific PCR protocol, it is now possible to directly determine haplotypes in some cases. However, the method could equally well be applied to any polymorphic system, whether multiple alleles at a single locus and/or multiple haplotypes over several loci. Smouse and Williams [1982] propose a general framework for constructing multivariate tests in this situa-

8 180 Wallenstein et al. tion. (We discuss only their one-locus case, since that corresponds to our situation, in which haplotypes can be determined directly.) In particular, they suggest allelic counting as a scoring convention and point out that this gives double weight to homozygotes. (Compare method of allele counting; see above). These allele counts then yield H 1 genetic score variables, which are essentially identical to the input to our logistic regression using the second parameterization. They then construct a multivariate test statistic (their equation (5)), which they show is identical to a haplotype-based contingency-table χ 2. Smouse and Williams [1982] also mention the possibility of using log-linear models, similar to the logistic regression approach proposed here, but they do not elaborate on this procedure in the context of comparing different groups (e.g., cases vs. controls), as we do here. Not only do we address a different application, we also focus on the nature of the association between genotype and disease that would make the haplotype-based tests optimal; we exploit the logistic model to estimate parameters of interest; and we discuss more general models, which include covariates. It is also interesting to note that the haplotype-based cross-tabulation, although it may appear assumption-free, actually implicitly incorporates an additivity assumption similar (although not identical) to that incorporated in the logistic regression. This arises because the haplotypes are counted up as they appear in cases and controls, so that homozygotes are given twice the weight given to heterozygotes, as also pointed out by Smouse and Williams [1982]. In another procedure used to test for HLA-disease associations, Meddeb-Garnaoui et al. [1995] performed H separate evaluations comparing the proportion of individuals with each haplotype in cases and controls. The procedure does not distinguish between homozygotes and heterozygotes with respect to the haplotype in question. Even without an adjustment for multiplicity of tests, the procedure can give very large confidence intervals for the odds ratios and can obscure any relationship between a particular haplotype and disease, if such a relationship exists. If one knew the exact mechanism of disease causation (e.g., an extension of classical dominant or recessive mode of inheritance), other methods of statistical analysis would be more appropriate. However, in the absence of this knowledge, our procedure gives a test that is consistent (power for a fixed alternative approaches 1.0, as sample size increases) for a large range of patterns, including situations in which either a single copy or two copies of a certain haplotype are increased with disease. In the example given, the P values for the haplotype-based tests were similar and were smaller than those for the genotype based tests. We conjecture that this might be true in general. We have shown that our test is optimal under certain conditions and also indicated similarity with other haplotype based tests discussed by Smouse and Williams [1982]. Thus, when these conditions are satisfied, or nearly satisfied, we would expect the haplotype-based procedures to give smaller P values than the genotype based procedures. Lastly, we note that in the limiting case of a biallelic marker at a single locus (m = 1) without covariates, the procedures based on cross-tabulation yield tests that are commonly used, whereas our logistic regression procedure results in a test statistic that can be viewed as a compromise between a model designed to detect a dominant relationship and one for a recessive relationship: The genotype-based

9 Logistic Regression for Extended Haplotypes 181 cross-tabulation yields a 3 2 table cross-tabulating, for n subjects, the three possible genotypes against disease status. The haplotype-based cross-tabulation reduces to a 2 2 cross-tabulation for the 2n alleles, of allele present or absent by disease group. Our proposed method uses logistic regression to find the relationship between the outcome status (case or control) measured in n individuals, and the covariate: number of copies of the allele (0, 1, or 2). In contrast, a test procedure optimal to detect a dominant disease would use as covariate 0 if there were no copies of the allele, and 1 if there were one or more copies; while one optimal to detect a recessive disease would use as covariate 1 if there were two copies of the allele, and 0 otherwise. Our coding scheme is somewhere in between, and is optimal for some intermediate level of penetrance, but is consistent for both a dominant and recessive model. ACKNOWLEDGMENTS This work was supported by NIH grants RR 00071, MH-48858, MH-52841, MH-28274, MH-36197, DK-31813, and CA/ES REFERENCES Cavalli-Sforza LL, Bodmer WF (1971): The Genetics of Human Populations. San Francisco: WH Freeman, p 43. Edwards AWF (1992): Likelihood, expanded edition. Baltimore: Johns Hopkins, p 19. Hosmer DW, Lemeshow S (1989): Applied Logistic Regression. New York: John Wiley and Sons, p 32. Meddeb-Garnaoui A, Zeliszewski D, Mougenot JF, Djilali-Saiah I, Caillat-Zucman S, Dormoy A, Gaudebout C, Tongio MM, Baudon JJ, Sterkers G (1995): Reevaluation of the relative risk to susceptibility to celiac disease of HLA-DRB1, -DQA1, -DQB1, -DPB1 and -TAP2 alleles in a French population. Hum Immunol 43: Prentice RL, Pyke R (1979): Logistic regression incidence models and case control studies. Biometrika 66: Själander A, Birgander R, Athlin L, Stenling R, Rutegard J, Beckman L, Beckman G (1995a): P53 germ line haplotypes associated with increased risk for colorectal cancer. Carcinogenesis 16: Själander A, Birgander R, Kivelä A, Beckman G (1995b): P53 polymorphisms and haplotypes in different ethnic groups. Hum Hered 45: Själander A, Birgander R, Hallmans G, Cajander S, Lenner P, Athlin L, Beckman L, Beckman G (1996): P53 polymorphisms and haplotypes in breast cancer. Carcinogenesis 17: Smouse PE, Williams RC (1982): Multivariate analysis of HLA-disease associations. Biometrics 38: Sugimura H, Suzuki I, Hamada GS, Iwase T, Takahashi T, Nagura K, Iwata H, Watanabe S, Kino I, Tsugane S (1994): Cytochrome p-450 1A1 genotype in lung cancer patients and controls in Rio de Janeiro, Brazil. Cancer Epidemiol Biomarkers Prev 3: Weston A, Pan C, Ksieski B, Wallenstein S, Berkowitz G, Tartter P, Bleiweiss I, Brower S, Senie R, Wolff M (1997): p53 haplotype determination in breast cancer. Cancer Epidemiol Biomarkers Prev 6:

Computational Systems Biology: Biology X

Computational Systems Biology: Biology X Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA L#7:(Mar-23-2010) Genome Wide Association Studies 1 The law of causality... is a relic of a bygone age, surviving, like the monarchy,

More information

Lecture 1: Case-Control Association Testing. Summer Institute in Statistical Genetics 2015

Lecture 1: Case-Control Association Testing. Summer Institute in Statistical Genetics 2015 Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2015 1 / 1 Introduction Association mapping is now routinely being used to identify loci that are involved with complex traits.

More information

Solutions to Even-Numbered Exercises to accompany An Introduction to Population Genetics: Theory and Applications Rasmus Nielsen Montgomery Slatkin

Solutions to Even-Numbered Exercises to accompany An Introduction to Population Genetics: Theory and Applications Rasmus Nielsen Montgomery Slatkin Solutions to Even-Numbered Exercises to accompany An Introduction to Population Genetics: Theory and Applications Rasmus Nielsen Montgomery Slatkin CHAPTER 1 1.2 The expected homozygosity, given allele

More information

2. Map genetic distance between markers

2. Map genetic distance between markers Chapter 5. Linkage Analysis Linkage is an important tool for the mapping of genetic loci and a method for mapping disease loci. With the availability of numerous DNA markers throughout the human genome,

More information

Tests for the Odds Ratio in a Matched Case-Control Design with a Quantitative X

Tests for the Odds Ratio in a Matched Case-Control Design with a Quantitative X Chapter 157 Tests for the Odds Ratio in a Matched Case-Control Design with a Quantitative X Introduction This procedure calculates the power and sample size necessary in a matched case-control study designed

More information

8 Nominal and Ordinal Logistic Regression

8 Nominal and Ordinal Logistic Regression 8 Nominal and Ordinal Logistic Regression 8.1 Introduction If the response variable is categorical, with more then two categories, then there are two options for generalized linear models. One relies on

More information

EXERCISES FOR CHAPTER 3. Exercise 3.2. Why is the random mating theorem so important?

EXERCISES FOR CHAPTER 3. Exercise 3.2. Why is the random mating theorem so important? Statistical Genetics Agronomy 65 W. E. Nyquist March 004 EXERCISES FOR CHAPTER 3 Exercise 3.. a. Define random mating. b. Discuss what random mating as defined in (a) above means in a single infinite population

More information

Nature Genetics: doi: /ng Supplementary Figure 1. Number of cases and proxy cases required to detect association at designs.

Nature Genetics: doi: /ng Supplementary Figure 1. Number of cases and proxy cases required to detect association at designs. Supplementary Figure 1 Number of cases and proxy cases required to detect association at designs. = 5 10 8 for case control and proxy case control The ratio of controls to cases (or proxy cases) is 1.

More information

Topic 21 Goodness of Fit

Topic 21 Goodness of Fit Topic 21 Goodness of Fit Contingency Tables 1 / 11 Introduction Two-way Table Smoking Habits The Hypothesis The Test Statistic Degrees of Freedom Outline 2 / 11 Introduction Contingency tables, also known

More information

Introduction to Analysis of Genomic Data Using R Lecture 6: Review Statistics (Part II)

Introduction to Analysis of Genomic Data Using R Lecture 6: Review Statistics (Part II) 1/45 Introduction to Analysis of Genomic Data Using R Lecture 6: Review Statistics (Part II) Dr. Yen-Yi Ho (hoyen@stat.sc.edu) Feb 9, 2018 2/45 Objectives of Lecture 6 Association between Variables Goodness

More information

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification, Likelihood Let P (D H) be the probability an experiment produces data D, given hypothesis H. Usually H is regarded as fixed and D variable. Before the experiment, the data D are unknown, and the probability

More information

Powerful multi-locus tests for genetic association in the presence of gene-gene and gene-environment interactions

Powerful multi-locus tests for genetic association in the presence of gene-gene and gene-environment interactions Powerful multi-locus tests for genetic association in the presence of gene-gene and gene-environment interactions Nilanjan Chatterjee, Zeynep Kalaylioglu 2, Roxana Moslehi, Ulrike Peters 3, Sholom Wacholder

More information

Lecture 9. QTL Mapping 2: Outbred Populations

Lecture 9. QTL Mapping 2: Outbred Populations Lecture 9 QTL Mapping 2: Outbred Populations Bruce Walsh. Aug 2004. Royal Veterinary and Agricultural University, Denmark The major difference between QTL analysis using inbred-line crosses vs. outbred

More information

Basic Medical Statistics Course

Basic Medical Statistics Course Basic Medical Statistics Course S7 Logistic Regression November 2015 Wilma Heemsbergen w.heemsbergen@nki.nl Logistic Regression The concept of a relationship between the distribution of a dependent variable

More information

Logistic Regression: Regression with a Binary Dependent Variable

Logistic Regression: Regression with a Binary Dependent Variable Logistic Regression: Regression with a Binary Dependent Variable LEARNING OBJECTIVES Upon completing this chapter, you should be able to do the following: State the circumstances under which logistic regression

More information

Expression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia

Expression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia Expression QTLs and Mapping of Complex Trait Loci Paul Schliekelman Statistics Department University of Georgia Definitions: Genes, Loci and Alleles A gene codes for a protein. Proteins due everything.

More information

Friday Harbor From Genetics to GWAS (Genome-wide Association Study) Sept David Fardo

Friday Harbor From Genetics to GWAS (Genome-wide Association Study) Sept David Fardo Friday Harbor 2017 From Genetics to GWAS (Genome-wide Association Study) Sept 7 2017 David Fardo Purpose: prepare for tomorrow s tutorial Genetic Variants Quality Control Imputation Association Visualization

More information

Stat 642, Lecture notes for 04/12/05 96

Stat 642, Lecture notes for 04/12/05 96 Stat 642, Lecture notes for 04/12/05 96 Hosmer-Lemeshow Statistic The Hosmer-Lemeshow Statistic is another measure of lack of fit. Hosmer and Lemeshow recommend partitioning the observations into 10 equal

More information

NIH Public Access Author Manuscript Stat Sin. Author manuscript; available in PMC 2013 August 15.

NIH Public Access Author Manuscript Stat Sin. Author manuscript; available in PMC 2013 August 15. NIH Public Access Author Manuscript Published in final edited form as: Stat Sin. 2012 ; 22: 1041 1074. ON MODEL SELECTION STRATEGIES TO IDENTIFY GENES UNDERLYING BINARY TRAITS USING GENOME-WIDE ASSOCIATION

More information

Correlation and regression

Correlation and regression 1 Correlation and regression Yongjua Laosiritaworn Introductory on Field Epidemiology 6 July 2015, Thailand Data 2 Illustrative data (Doll, 1955) 3 Scatter plot 4 Doll, 1955 5 6 Correlation coefficient,

More information

Statistics in medicine

Statistics in medicine Statistics in medicine Lecture 4: and multivariable regression Fatma Shebl, MD, MS, MPH, PhD Assistant Professor Chronic Disease Epidemiology Department Yale School of Public Health Fatma.shebl@yale.edu

More information

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies Ruth Pfeiffer, Ph.D. Mitchell Gail Biostatistics Branch Division of Cancer Epidemiology&Genetics National

More information

Hypothesis Testing, Power, Sample Size and Confidence Intervals (Part 2)

Hypothesis Testing, Power, Sample Size and Confidence Intervals (Part 2) Hypothesis Testing, Power, Sample Size and Confidence Intervals (Part 2) B.H. Robbins Scholars Series June 23, 2010 1 / 29 Outline Z-test χ 2 -test Confidence Interval Sample size and power Relative effect

More information

Case-Control Association Testing. Case-Control Association Testing

Case-Control Association Testing. Case-Control Association Testing Introduction Association mapping is now routinely being used to identify loci that are involved with complex traits. Technological advances have made it feasible to perform case-control association studies

More information

Lecture 14: Introduction to Poisson Regression

Lecture 14: Introduction to Poisson Regression Lecture 14: Introduction to Poisson Regression Ani Manichaikul amanicha@jhsph.edu 8 May 2007 1 / 52 Overview Modelling counts Contingency tables Poisson regression models 2 / 52 Modelling counts I Why

More information

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview Modelling counts I Lecture 14: Introduction to Poisson Regression Ani Manichaikul amanicha@jhsph.edu Why count data? Number of traffic accidents per day Mortality counts in a given neighborhood, per week

More information

Chapter 6. Logistic Regression. 6.1 A linear model for the log odds

Chapter 6. Logistic Regression. 6.1 A linear model for the log odds Chapter 6 Logistic Regression In logistic regression, there is a categorical response variables, often coded 1=Yes and 0=No. Many important phenomena fit this framework. The patient survives the operation,

More information

Population Genetics. with implications for Linkage Disequilibrium. Chiara Sabatti, Human Genetics 6357a Gonda

Population Genetics. with implications for Linkage Disequilibrium. Chiara Sabatti, Human Genetics 6357a Gonda 1 Population Genetics with implications for Linkage Disequilibrium Chiara Sabatti, Human Genetics 6357a Gonda csabatti@mednet.ucla.edu 2 Hardy-Weinberg Hypotheses: infinite populations; no inbreeding;

More information

University of California, Berkeley

University of California, Berkeley University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 2008 Paper 241 A Note on Risk Prediction for Case-Control Studies Sherri Rose Mark J. van der Laan Division

More information

Partitioning Genetic Variance

Partitioning Genetic Variance PSYC 510: Partitioning Genetic Variance (09/17/03) 1 Partitioning Genetic Variance Here, mathematical models are developed for the computation of different types of genetic variance. Several substantive

More information

10: Crosstabs & Independent Proportions

10: Crosstabs & Independent Proportions 10: Crosstabs & Independent Proportions p. 10.1 P Background < Two independent groups < Binary outcome < Compare binomial proportions P Illustrative example ( oswege.sav ) < Food poisoning following church

More information

STAT 526 Spring Midterm 1. Wednesday February 2, 2011

STAT 526 Spring Midterm 1. Wednesday February 2, 2011 STAT 526 Spring 2011 Midterm 1 Wednesday February 2, 2011 Time: 2 hours Name (please print): Show all your work and calculations. Partial credit will be given for work that is partially correct. Points

More information

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5 Association Testing with Quantitative Traits: Common and Rare Variants Timothy Thornton and Katie Kerr Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5 1 / 41 Introduction to Quantitative

More information

Estimating direct effects in cohort and case-control studies

Estimating direct effects in cohort and case-control studies Estimating direct effects in cohort and case-control studies, Ghent University Direct effects Introduction Motivation The problem of standard approaches Controlled direct effect models In many research

More information

Three-Way Contingency Tables

Three-Way Contingency Tables Newsom PSY 50/60 Categorical Data Analysis, Fall 06 Three-Way Contingency Tables Three-way contingency tables involve three binary or categorical variables. I will stick mostly to the binary case to keep

More information

Optimal Allele-Sharing Statistics for Genetic Mapping Using Affected Relatives

Optimal Allele-Sharing Statistics for Genetic Mapping Using Affected Relatives Genetic Epidemiology 16:225 249 (1999) Optimal Allele-Sharing Statistics for Genetic Mapping Using Affected Relatives Mary Sara McPeek* Department of Statistics, University of Chicago, Chicago, Illinois

More information

Theoretical and computational aspects of association tests: application in case-control genome-wide association studies.

Theoretical and computational aspects of association tests: application in case-control genome-wide association studies. Theoretical and computational aspects of association tests: application in case-control genome-wide association studies Mathieu Emily November 18, 2014 Caen mathieu.emily@agrocampus-ouest.fr - Agrocampus

More information

Linear Regression (1/1/17)

Linear Regression (1/1/17) STA613/CBB540: Statistical methods in computational biology Linear Regression (1/1/17) Lecturer: Barbara Engelhardt Scribe: Ethan Hada 1. Linear regression 1.1. Linear regression basics. Linear regression

More information

Weierstraß-Institut. für Angewandte Analysis und Stochastik. Leibniz-Institut im Forschungsverbund Berlin e. V. Preprint ISSN

Weierstraß-Institut. für Angewandte Analysis und Stochastik. Leibniz-Institut im Forschungsverbund Berlin e. V. Preprint ISSN Weierstraß-Institut für Angewandte Analysis und Stochastik Leibniz-Institut im Forschungsverbund Berlin e. V. Preprint ISSN 2198-5855 On an extended interpretation of linkage disequilibrium in genetic

More information

WORKSHOP 3 Measuring Association

WORKSHOP 3 Measuring Association WORKSHOP 3 Measuring Association Concepts Analysing Categorical Data o Testing of Proportions o Contingency Tables & Tests o Odds Ratios Linear Association Measures o Correlation o Simple Linear Regression

More information

Statistics in medicine

Statistics in medicine Statistics in medicine Lecture 3: Bivariate association : Categorical variables Proportion in one group One group is measured one time: z test Use the z distribution as an approximation to the binomial

More information

For 5% confidence χ 2 with 1 degree of freedom should exceed 3.841, so there is clear evidence for disequilibrium between S and M.

For 5% confidence χ 2 with 1 degree of freedom should exceed 3.841, so there is clear evidence for disequilibrium between S and M. STAT 550 Howework 6 Anton Amirov 1. This question relates to the same study you saw in Homework-4, by Dr. Arno Motulsky and coworkers, and published in Thompson et al. (1988; Am.J.Hum.Genet, 42, 113-124).

More information

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES Saurabh Ghosh Human Genetics Unit Indian Statistical Institute, Kolkata Most common diseases are caused by

More information

Goodness-of-Fit Tests for the Ordinal Response Models with Misspecified Links

Goodness-of-Fit Tests for the Ordinal Response Models with Misspecified Links Communications of the Korean Statistical Society 2009, Vol 16, No 4, 697 705 Goodness-of-Fit Tests for the Ordinal Response Models with Misspecified Links Kwang Mo Jeong a, Hyun Yung Lee 1, a a Department

More information

Introduction to the Logistic Regression Model

Introduction to the Logistic Regression Model CHAPTER 1 Introduction to the Logistic Regression Model 1.1 INTRODUCTION Regression methods have become an integral component of any data analysis concerned with describing the relationship between a response

More information

On the limiting distribution of the likelihood ratio test in nucleotide mapping of complex disease

On the limiting distribution of the likelihood ratio test in nucleotide mapping of complex disease On the limiting distribution of the likelihood ratio test in nucleotide mapping of complex disease Yuehua Cui 1 and Dong-Yun Kim 2 1 Department of Statistics and Probability, Michigan State University,

More information

Logistic Regression. Fitting the Logistic Regression Model BAL040-A.A.-10-MAJ

Logistic Regression. Fitting the Logistic Regression Model BAL040-A.A.-10-MAJ Logistic Regression The goal of a logistic regression analysis is to find the best fitting and most parsimonious, yet biologically reasonable, model to describe the relationship between an outcome (dependent

More information

More Statistics tutorial at Logistic Regression and the new:

More Statistics tutorial at  Logistic Regression and the new: Logistic Regression and the new: Residual Logistic Regression 1 Outline 1. Logistic Regression 2. Confounding Variables 3. Controlling for Confounding Variables 4. Residual Linear Regression 5. Residual

More information

LOGISTIC REGRESSION Joseph M. Hilbe

LOGISTIC REGRESSION Joseph M. Hilbe LOGISTIC REGRESSION Joseph M. Hilbe Arizona State University Logistic regression is the most common method used to model binary response data. When the response is binary, it typically takes the form of

More information

Categorical data analysis Chapter 5

Categorical data analysis Chapter 5 Categorical data analysis Chapter 5 Interpreting parameters in logistic regression The sign of β determines whether π(x) is increasing or decreasing as x increases. The rate of climb or descent increases

More information

Confidence Intervals for the Odds Ratio in Logistic Regression with Two Binary X s

Confidence Intervals for the Odds Ratio in Logistic Regression with Two Binary X s Chapter 866 Confidence Intervals for the Odds Ratio in Logistic Regression with Two Binary X s Introduction Logistic regression expresses the relationship between a binary response variable and one or

More information

Two Correlated Proportions Non- Inferiority, Superiority, and Equivalence Tests

Two Correlated Proportions Non- Inferiority, Superiority, and Equivalence Tests Chapter 59 Two Correlated Proportions on- Inferiority, Superiority, and Equivalence Tests Introduction This chapter documents three closely related procedures: non-inferiority tests, superiority (by a

More information

Chapter 19: Logistic regression

Chapter 19: Logistic regression Chapter 19: Logistic regression Self-test answers SELF-TEST Rerun this analysis using a stepwise method (Forward: LR) entry method of analysis. The main analysis To open the main Logistic Regression dialog

More information

1. Understand the methods for analyzing population structure in genomes

1. Understand the methods for analyzing population structure in genomes MSCBIO 2070/02-710: Computational Genomics, Spring 2016 HW3: Population Genetics Due: 24:00 EST, April 4, 2016 by autolab Your goals in this assignment are to 1. Understand the methods for analyzing population

More information

Interpretation of the Fitted Logistic Regression Model

Interpretation of the Fitted Logistic Regression Model CHAPTER 3 Interpretation of the Fitted Logistic Regression Model 3.1 INTRODUCTION In Chapters 1 and 2 we discussed the methods for fitting and testing for the significance of the logistic regression model.

More information

Problems for 3505 (2011)

Problems for 3505 (2011) Problems for 505 (2011) 1. In the simplex of genotype distributions x + y + z = 1, for two alleles, the Hardy- Weinberg distributions x = p 2, y = 2pq, z = q 2 (p + q = 1) are characterized by y 2 = 4xz.

More information

Chapter 20: Logistic regression for binary response variables

Chapter 20: Logistic regression for binary response variables Chapter 20: Logistic regression for binary response variables In 1846, the Donner and Reed families left Illinois for California by covered wagon (87 people, 20 wagons). They attempted a new and untried

More information

Outline of lectures 3-6

Outline of lectures 3-6 GENOME 453 J. Felsenstein Evolutionary Genetics Autumn, 007 Population genetics Outline of lectures 3-6 1. We want to know what theory says about the reproduction of genotypes in a population. This results

More information

SNP Association Studies with Case-Parent Trios

SNP Association Studies with Case-Parent Trios SNP Association Studies with Case-Parent Trios Department of Biostatistics Johns Hopkins Bloomberg School of Public Health September 3, 2009 Population-based Association Studies Balding (2006). Nature

More information

Modeling IBD for Pairs of Relatives. Biostatistics 666 Lecture 17

Modeling IBD for Pairs of Relatives. Biostatistics 666 Lecture 17 Modeling IBD for Pairs of Relatives Biostatistics 666 Lecture 7 Previously Linkage Analysis of Relative Pairs IBS Methods Compare observed and expected sharing IBD Methods Account for frequency of shared

More information

GOODNESS-OF-FIT FOR GEE: AN EXAMPLE WITH MENTAL HEALTH SERVICE UTILIZATION

GOODNESS-OF-FIT FOR GEE: AN EXAMPLE WITH MENTAL HEALTH SERVICE UTILIZATION STATISTICS IN MEDICINE GOODNESS-OF-FIT FOR GEE: AN EXAMPLE WITH MENTAL HEALTH SERVICE UTILIZATION NICHOLAS J. HORTON*, JUDITH D. BEBCHUK, CHERYL L. JONES, STUART R. LIPSITZ, PAUL J. CATALANO, GWENDOLYN

More information

A consideration of the chi-square test of Hardy-Weinberg equilibrium in a non-multinomial situation

A consideration of the chi-square test of Hardy-Weinberg equilibrium in a non-multinomial situation Ann. Hum. Genet., Lond. (1975), 39, 141 Printed in Great Britain 141 A consideration of the chi-square test of Hardy-Weinberg equilibrium in a non-multinomial situation BY CHARLES F. SING AND EDWARD D.

More information

Faculty of Health Sciences. Regression models. Counts, Poisson regression, Lene Theil Skovgaard. Dept. of Biostatistics

Faculty of Health Sciences. Regression models. Counts, Poisson regression, Lene Theil Skovgaard. Dept. of Biostatistics Faculty of Health Sciences Regression models Counts, Poisson regression, 27-5-2013 Lene Theil Skovgaard Dept. of Biostatistics 1 / 36 Count outcome PKA & LTS, Sect. 7.2 Poisson regression The Binomial

More information

Asymptotic equivalence of paired Hotelling test and conditional logistic regression

Asymptotic equivalence of paired Hotelling test and conditional logistic regression Asymptotic equivalence of paired Hotelling test and conditional logistic regression Félix Balazard 1,2 arxiv:1610.06774v1 [math.st] 21 Oct 2016 Abstract 1 Sorbonne Universités, UPMC Univ Paris 06, CNRS

More information

The E-M Algorithm in Genetics. Biostatistics 666 Lecture 8

The E-M Algorithm in Genetics. Biostatistics 666 Lecture 8 The E-M Algorithm in Genetics Biostatistics 666 Lecture 8 Maximum Likelihood Estimation of Allele Frequencies Find parameter estimates which make observed data most likely General approach, as long as

More information

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 20: Epistasis and Alternative Tests in GWAS Jason Mezey jgm45@cornell.edu April 16, 2016 (Th) 8:40-9:55 None Announcements Summary

More information

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages: Glossary The ISI glossary of statistical terms provides definitions in a number of different languages: http://isi.cbs.nl/glossary/index.htm Adjusted r 2 Adjusted R squared measures the proportion of the

More information

Additive and multiplicative models for the joint effect of two risk factors

Additive and multiplicative models for the joint effect of two risk factors Biostatistics (2005), 6, 1,pp. 1 9 doi: 10.1093/biostatistics/kxh024 Additive and multiplicative models for the joint effect of two risk factors A. BERRINGTON DE GONZÁLEZ Cancer Research UK Epidemiology

More information

Generalized logit models for nominal multinomial responses. Local odds ratios

Generalized logit models for nominal multinomial responses. Local odds ratios Generalized logit models for nominal multinomial responses Categorical Data Analysis, Summer 2015 1/17 Local odds ratios Y 1 2 3 4 1 π 11 π 12 π 13 π 14 π 1+ X 2 π 21 π 22 π 23 π 24 π 2+ 3 π 31 π 32 π

More information

Tests for the Odds Ratio in Logistic Regression with One Binary X (Wald Test)

Tests for the Odds Ratio in Logistic Regression with One Binary X (Wald Test) Chapter 861 Tests for the Odds Ratio in Logistic Regression with One Binary X (Wald Test) Introduction Logistic regression expresses the relationship between a binary response variable and one or more

More information

Lecture 7: Interaction Analysis. Summer Institute in Statistical Genetics 2017

Lecture 7: Interaction Analysis. Summer Institute in Statistical Genetics 2017 Lecture 7: Interaction Analysis Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2017 1 / 39 Lecture Outline Beyond main SNP effects Introduction to Concept of Statistical Interaction

More information

The Quantitative TDT

The Quantitative TDT The Quantitative TDT (Quantitative Transmission Disequilibrium Test) Warren J. Ewens NUS, Singapore 10 June, 2009 The initial aim of the (QUALITATIVE) TDT was to test for linkage between a marker locus

More information

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses ST3241 Categorical Data Analysis I Multicategory Logit Models Logit Models For Nominal Responses 1 Models For Nominal Responses Y is nominal with J categories. Let {π 1,, π J } denote the response probabilities

More information

Cohen s s Kappa and Log-linear Models

Cohen s s Kappa and Log-linear Models Cohen s s Kappa and Log-linear Models HRP 261 03/03/03 10-11 11 am 1. Cohen s Kappa Actual agreement = sum of the proportions found on the diagonals. π ii Cohen: Compare the actual agreement with the chance

More information

STA6938-Logistic Regression Model

STA6938-Logistic Regression Model Dr. Ying Zhang STA6938-Logistic Regression Model Topic 2-Multiple Logistic Regression Model Outlines:. Model Fitting 2. Statistical Inference for Multiple Logistic Regression Model 3. Interpretation of

More information

Chapter 22: Log-linear regression for Poisson counts

Chapter 22: Log-linear regression for Poisson counts Chapter 22: Log-linear regression for Poisson counts Exposure to ionizing radiation is recognized as a cancer risk. In the United States, EPA sets guidelines specifying upper limits on the amount of exposure

More information

Outline. P o purple % x white & white % x purple& F 1 all purple all purple. F purple, 224 white 781 purple, 263 white

Outline. P o purple % x white & white % x purple& F 1 all purple all purple. F purple, 224 white 781 purple, 263 white Outline - segregation of alleles in single trait crosses - independent assortment of alleles - using probability to predict outcomes - statistical analysis of hypotheses - conditional probability in multi-generation

More information

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data Ronald Heck Class Notes: Week 8 1 Class Notes: Week 8 Probit versus Logit Link Functions and Count Data This week we ll take up a couple of issues. The first is working with a probit link function. While

More information

Multinomial Logistic Regression Models

Multinomial Logistic Regression Models Stat 544, Lecture 19 1 Multinomial Logistic Regression Models Polytomous responses. Logistic regression can be extended to handle responses that are polytomous, i.e. taking r>2 categories. (Note: The word

More information

Introduction to logistic regression

Introduction to logistic regression Introduction to logistic regression Tuan V. Nguyen Professor and NHMRC Senior Research Fellow Garvan Institute of Medical Research University of New South Wales Sydney, Australia What we are going to learn

More information

Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression

Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression Logistic Regression Usual linear regression (repetition) y i = b 0 + b 1 x 1i + b 2 x 2i + e i, e i N(0,σ 2 ) or: y i N(b 0 + b 1 x 1i + b 2 x 2i,σ 2 ) Example (DGA, p. 336): E(PEmax) = 47.355 + 1.024

More information

Review of Statistics 101

Review of Statistics 101 Review of Statistics 101 We review some important themes from the course 1. Introduction Statistics- Set of methods for collecting/analyzing data (the art and science of learning from data). Provides methods

More information

Power and sample size calculations for designing rare variant sequencing association studies.

Power and sample size calculations for designing rare variant sequencing association studies. Power and sample size calculations for designing rare variant sequencing association studies. Seunggeun Lee 1, Michael C. Wu 2, Tianxi Cai 1, Yun Li 2,3, Michael Boehnke 4 and Xihong Lin 1 1 Department

More information

Pairwise rank based likelihood for estimating the relationship between two homogeneous populations and their mixture proportion

Pairwise rank based likelihood for estimating the relationship between two homogeneous populations and their mixture proportion Pairwise rank based likelihood for estimating the relationship between two homogeneous populations and their mixture proportion Glenn Heller and Jing Qin Department of Epidemiology and Biostatistics Memorial

More information

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017 Lecture 2: Genetic Association Testing with Quantitative Traits Instructors: Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2017 1 / 29 Introduction to Quantitative Trait Mapping

More information

Analysis of Categorical Data. Nick Jackson University of Southern California Department of Psychology 10/11/2013

Analysis of Categorical Data. Nick Jackson University of Southern California Department of Psychology 10/11/2013 Analysis of Categorical Data Nick Jackson University of Southern California Department of Psychology 10/11/2013 1 Overview Data Types Contingency Tables Logit Models Binomial Ordinal Nominal 2 Things not

More information

Supplemental Information Likelihood-based inference in isolation-by-distance models using the spatial distribution of low-frequency alleles

Supplemental Information Likelihood-based inference in isolation-by-distance models using the spatial distribution of low-frequency alleles Supplemental Information Likelihood-based inference in isolation-by-distance models using the spatial distribution of low-frequency alleles John Novembre and Montgomery Slatkin Supplementary Methods To

More information

Package LBLGXE. R topics documented: July 20, Type Package

Package LBLGXE. R topics documented: July 20, Type Package Type Package Package LBLGXE July 20, 2015 Title Bayesian Lasso for detecting Rare (or Common) Haplotype Association and their interactions with Environmental Covariates Version 1.2 Date 2015-07-09 Author

More information

Previous lecture. P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing.

Previous lecture. P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing. Previous lecture P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing. Interaction Outline: Definition of interaction Additive versus multiplicative

More information

Linkage and Linkage Disequilibrium

Linkage and Linkage Disequilibrium Linkage and Linkage Disequilibrium Summer Institute in Statistical Genetics 2014 Module 10 Topic 3 Linkage in a simple genetic cross Linkage In the early 1900 s Bateson and Punnet conducted genetic studies

More information

Confidence Intervals for the Interaction Odds Ratio in Logistic Regression with Two Binary X s

Confidence Intervals for the Interaction Odds Ratio in Logistic Regression with Two Binary X s Chapter 867 Confidence Intervals for the Interaction Odds Ratio in Logistic Regression with Two Binary X s Introduction Logistic regression expresses the relationship between a binary response variable

More information

SNP-SNP Interactions in Case-Parent Trios

SNP-SNP Interactions in Case-Parent Trios Detection of SNP-SNP Interactions in Case-Parent Trios Department of Biostatistics Johns Hopkins Bloomberg School of Public Health June 2, 2009 Karyotypes http://ghr.nlm.nih.gov/ Single Nucleotide Polymphisms

More information

Affected Sibling Pairs. Biostatistics 666

Affected Sibling Pairs. Biostatistics 666 Affected Sibling airs Biostatistics 666 Today Discussion of linkage analysis using affected sibling pairs Our exploration will include several components we have seen before: A simple disease model IBD

More information

The universal validity of the possible triangle constraint for Affected-Sib-Pairs

The universal validity of the possible triangle constraint for Affected-Sib-Pairs The Canadian Journal of Statistics Vol. 31, No.?, 2003, Pages???-??? La revue canadienne de statistique The universal validity of the possible triangle constraint for Affected-Sib-Pairs Zeny Z. Feng, Jiahua

More information

BIAS OF MAXIMUM-LIKELIHOOD ESTIMATES IN LOGISTIC AND COX REGRESSION MODELS: A COMPARATIVE SIMULATION STUDY

BIAS OF MAXIMUM-LIKELIHOOD ESTIMATES IN LOGISTIC AND COX REGRESSION MODELS: A COMPARATIVE SIMULATION STUDY BIAS OF MAXIMUM-LIKELIHOOD ESTIMATES IN LOGISTIC AND COX REGRESSION MODELS: A COMPARATIVE SIMULATION STUDY Ingo Langner 1, Ralf Bender 2, Rebecca Lenz-Tönjes 1, Helmut Küchenhoff 2, Maria Blettner 2 1

More information

Known unknowns : using multiple imputation to fill in the blanks for missing data

Known unknowns : using multiple imputation to fill in the blanks for missing data Known unknowns : using multiple imputation to fill in the blanks for missing data James Stanley Department of Public Health University of Otago, Wellington james.stanley@otago.ac.nz Acknowledgments Cancer

More information

Efficient designs of gene environment interaction studies: implications of Hardy Weinberg equilibrium and gene environment independence

Efficient designs of gene environment interaction studies: implications of Hardy Weinberg equilibrium and gene environment independence Special Issue Paper Received 7 January 20, Accepted 28 September 20 Published online 24 February 202 in Wiley Online Library (wileyonlinelibrary.com) DOI: 0.002/sim.4460 Efficient designs of gene environment

More information

Outline of lectures 3-6

Outline of lectures 3-6 GENOME 453 J. Felsenstein Evolutionary Genetics Autumn, 009 Population genetics Outline of lectures 3-6 1. We want to know what theory says about the reproduction of genotypes in a population. This results

More information

BMI 541/699 Lecture 22

BMI 541/699 Lecture 22 BMI 541/699 Lecture 22 Where we are: 1. Introduction and Experimental Design 2. Exploratory Data Analysis 3. Probability 4. T-based methods for continous variables 5. Power and sample size for t-based

More information

Statistical Analysis for QBIC Genetics Adapted by Ellen G. Dow 2017

Statistical Analysis for QBIC Genetics Adapted by Ellen G. Dow 2017 Statistical Analysis for QBIC Genetics Adapted by Ellen G. Dow 2017 I. χ 2 or chi-square test Objectives: Compare how close an experimentally derived value agrees with an expected value. One method to

More information