Introduction to Statistical Genetics (BST227) Lecture 6: Population Substructure in Association Studies

Size: px
Start display at page:

Download "Introduction to Statistical Genetics (BST227) Lecture 6: Population Substructure in Association Studies"

Transcription

1 Introduction to Statistical Genetics (BST227) Lecture 6: Population Substructure in Association Studies

2 Confounding in gene+c associa+on studies q What is it? q What is the effect? q How to detect it? q What to do about it?

3 What is it? A major threat to the validity of nonrandomized studies is confounding. What is confounding? What is a poten+al confounding variable for a gene+c study? What are some different kinds of popula+on substructures and what are they?

4 Popula+on Substructure Features of a popula+on which result in varia+on of allele and/or genotype frequencies across individuals in a popula+on

5 Popula+on Substructure: Stra+fica+on / Admixture / Inbreeding Popula+on stra+fica+on: dis+nct subgroups within a popula+on. Popula+on admixture: ma+ng among individuals of different gene+c origin over mul+ple genera+ons. Usually occult. Inbreeding: ma+ng between close rela+ves

6 Inbreeding Inbreeding implies a posi+ve probability that an individual inherits the exact same allele (A or a) from both parents, i.e. the parents had a common ancestor The inbreeding coefficient, F, measures the degree of inbreeding in a popula+on. F = p(a random individual inherits the same ancestral allele from both parents). Extreme form of inbreeding (F=1) is pure strains as used in breeding experiments. In randomly ma+ng popula+ons, F=0

7 Admixture Gila River American Indian Community Strong correla0on between Gm 3;5,13,14 allele frequency across strata of Indian heritage ( with number of great- grandparents) Indian Heritage Gm 3;5;13;14 % % Diabetes age 0 69% adjusted 18.5% 4 45% 28.6% 8.01% 39.2% Adapted from Knowler, 1988 Gm 3;5,13,14 allele frequency is a marker of Indian Heritage. Admixture usually creates a gradient of allele frequencies

8 What is the effect? Any form of population substructure causes HWD in population. From Chapter 3 with HWE, X is B(2,p), where X is the number of A alleles E(X) = 2p and var(x) = 2p(1-p) With inbreeding (HWD) E(X) = 2p and var(x) = 2p(1-p)(1+F) 0<F<1 is the Wahlund CoefLicient F is also the coeflicient of inbreeding F is also the correlation between alleles transmitted by parents This expression still assumes p is uniform in population.

9 Effect of Substructure Suppose we have K strata, each with allele frequency p k, k = 1, K, and HWE within each strata. Then E(X) =2p Where p is the average allele frequency in the popula+on. However, var(x) is bigger than binomial variance: var(x) = 2pq + 2var(p k ) (Chapter 3) Both inbreeding and popula+on stra+fica+on can cause variance infla+on. Almost all tests of associa+on are based on es+mated allele frequencies. Popula+on substructure has consequences for inference about sample propor+ons.

10 General Background on Test Sta+s+cs Many test sta+s+cs can be expressed as: Z = T/SE(T) where T is some sta+s+c of interest, and SE(T) means the square root of the es+mated variance of (T), the standard error. Under H0, E(T)=0, var(z)=1, Z~ N(0,1) or Z 2 ~χ2 on 1 df (approximately in large samples) Eg. Trend test: Allele test: T = n(x cases X controls ) / 2 T = n(p cases p controls )

11 Effect of Popula+on Substructure in Gene+c Associa+on Studies Popula+on substructure can 1) bias numerator (E(T) 0 under H0) 2) inflate the var(t) (so es+mated SE(T) is too small) 3) or do both Both of these problems will lead to increased false posi+ve rate. In gene+c studies, ohen argued that variance infla+on is the worst. Confounder biases the numerator and causes variance infla+on.

12 Consider Trend and Alleles Test Nota+on: X is the number of A alleles for one subject, p is propor+on of A alleles in the popula+on, X, p denote sample means and propor+ons in each group, SE is es+mated standard error. Assume r=s=n/2. The alleles test and the trend test have same T T = n(x cases X controls ) / 2 = n(p cases p controls ) SE(T) assuming HWE gives alleles test SE(T) without HWE gives the trend test TREND TEST CORRECTS FOR HWD (but not populaaon straaficaaon or admixture)

13 Consider Trend Test and Popula+on Stra+fica+on, 2 strata Consider 2 strata (K=2) where p 1 and p 2 are A allele frequencies in the two strata. Disease rates may also vary over strata: K 1, K 2 Under H0 : p 1cases = p 1controls = p 1 p 2cases = p 2controls = p 2 Thus there is no difference in allele frequencies within strata (Ho is true in both strata)

14 Problem with Popula+on Stra+fica+on If you do not sample cases and controls according to strata, then they will be unbalanced. Suppose K1>K2. If strata sizes are equal, expect more cases from strata 1 and more controls from strata. Ignoring strata in analysis leads to systema+c bias. Let c = propor+on of cases from strata 1 d = propor+on of controls from strata 1 Then under H 0 : E(T ) = n(x cases X controls ) / 2 = n(c d)(p 1 p 2 )

15 Bias in Numerator E(T )= n(c d)(p 1 p 2 ) To eliminate bias, need E(T) = 0 p 1 = p 2 or c = d E(c d) = S 1 S 2 (K 1 K 2 ) / (K(1 K)) K 1 =K 2 means c = d on average (K is the disease rate in the overall popula+on) The absence of varia+on in disease rate or the allele frequency over strata is sufficient to eliminate the bias.

16 Bias in Numerator General Formula for K strata (index by k): E(T ) = ncov(p k, K k ) / (K(1 K)) In general, need to have systema+c associa+on between disease rates and allele frequencies (K k and p k ) to get non-zero covariance. Although rare, strong covariance have been documented, and are more likely with a small number of strata, or admixtures of a small number of popula+ons.

17 Effect of Popula+on Substructure on Variance Trend test, under H 0 no difference in allele frequency, treat as a single sample and es+mate var(x) empirically. With admixture, var(x) = 2p(1-p)(1+F) F is Wahlund effect, coefficient of inbreeding or correla+on between parental alleles If no popula+on substructure, empirical variance will es+mate var(x) correctly. However, F accounts for correla+on of alleles WITHIN an individual, but ignores covariance induced by stra+fica+on.

18 Variance Infla+on for Trend Test Trend test T = n(x cases X controls ) / 2 Assumes no substructure but non-zero F Assumes subjects are independent, but alleles within subject are not For alleles test, (1+ F) is omined. Thus alleles test is bigger because variance is smaller. With popula+on stra+fica+on in case-control designs you have addi+onal variance infla+on. That is, the actual varia+on is bigger than what you assume because allele frequency varies by strata.

19 Variance Infla+on (Devlin and Roeder, 1999) True var(t) under stra+fica+on: Var(T ) = 2np(1 p)[1 F + nf(c d) 2 ] Variance infla+on: true Var(T with popula+on substructure) = λ var(t for simple trend test/allele test) Variance infla+on substan+ally larger than bias because it depends on n (Devlin et al, 2000) Variance infla+on increases with n, and does not depend on allele frequency. Similar result for alleles test: λ = [1 F + nf(c d) 2 ] Es+ma+on of λ is the basis for Genomic Control.

20 Variance Infla+on factor Es+mate λ to check whether λ > 1. Basis for genomic control. How to detect it? Q-Q plots Since p-values follow an uniform distribu+on between 0 and 1, we can plot the observed p-values vs quan+les of an uniform distribu+on.

21 What to do about it? Match or stra+fy on ethnic ancestry Self report may not be accurate Ethnicity [e.g. race ] may not be good surrogate for ancestry Difficult to match admixed ancestry subjects Use family-based controls Siblings (condi+onal logis+c) Case-parent pseudocontrols (TDT, FBAT etc.) Adjust using mul+ple unlinked markers Genomic control Es+mate infla+on factor and adjust accordingly Popula+on Components Infer popula+on substructure using con+nuous axes of varia+on Linear mixed effect models

22 Genomic Control (Devlin and Roeder, 1999) Idea: null markers should have correct chi-squared distribu+on if there is no substructure. Use data at mul+ple markers (should be unlinked and null) to es+mate λ. How es+mate λ? Let χ 2 1, χ 2 2 2,..., χ L denote the trend test calculated for the L markers. Then the χ 2 1, χ 2 2 2,..., χ L should look like a sample of 1 2 L chi-square random variables with one df. Sample var(χ 2 j ) = 2 if no variance infla+on. Could calculate var(χ 2 j ) and compare to 2.

23 Genomic Control (Devlin and Roeder, 1999) However, var is very sensi+ve to outliers, one bad test can ruin your es+mate, use ˆλ = median[χ 1 2, χ 2 2,..., χ L 2 ] / Use es+mated λ to adjust variance of Trend Test: χ 2 T / ˆλ 2 ~ approximately χ 1 under H 0 Assumes λ constant across loci, similar muta+on rates, limited selec+on at marker, limited varia+on in F across subpopula+ons Note, book uses 1/ λ.

24 How large a problem? (Freedman et al., 2004) Examined 11 case-control/cohort studies for evidence of excess varia+on (λ > 1) (Reich and Goldstein, 2001) No evidence of excess varia+on for original range of sample sizes ( cases) and number of SNP s (24-48) Expanded two case-cohort studies of prostate cancer

25 Prostate Cancer Studies (Freedman et al., 2004) Prostate cancer among African Americans Cases/Controls SNPs P-value (λ=1) Estimated λ original 90/69 48 < Expanded 469/ <.04 AIM* included 474/ <.0001 *SNP s chosen to discriminate ancestry are included

26 Prostate Cancer in African Americans (Freedman, et. Al) Removed 40/474 cases, 48/476 controls with self-report of non African American ancestry P-values for all 211 markers is 10-7 λ is 1.5 (extrapolated to 1000 cases and controls) Typical sized studies may not detect variance infla+on, and using self-reported heritage may not solve the problem Larger studies will likely have more substructure With GWAS data, different approaches can be used to adjust for substructure.

27 Second approach Principle Components: Designed for GWAS With a GWAS, get 500K or more markers; essen+ally enough to nail down an individual s gene+c signature (do not use imputed SNPs) Var-cov matrix of individual gene+c varia+on, denote by C Use PC to determine major axes or con+nuous axes of gene+c varia+on

28 Box 7.2 Principal component adjustment of association studies Notation: M... Number of SNPs N... Number of subjects X = (Zij) an M x N matrix of standardized genotypes coded for the additive model for the ith SNP in the jth proband, i.e, z ij = (X ij X i. ) / p i (1 p i ) where Pi estimates the population frequency of the ith SNP. The algorithm: Step 1: Compute the Variance-Covariance matrix for the probands as C = (X T X)/(N - 1). Step 2: Compute the eigenvalue decomposition of the covariance matrix. Step 3: Select the top K eigenvalues that are statistically significant. Step 4: Include the significant eigenvectors in the linear regression of the phenotype on marker, or use the significant eigenvectors to match cases and controls, and do a matched pair analysis.

29 Principle Components of C: Let j indexes people, i indexes SNP diag C j = var(g ij ): variance over snps for person j off diag C jj = cor(g ij,g jj ) for pairs j and j : high cor indicates that 2 individuals have very similar SNP profiles common gene+c background Eigenvalues and eigenvectors of C: Eigenvectors: linear combina+on of SNPs, maximize differences in people, or maximize variance in SNPs across people Eigenvalues: how much of total varia+on is explained by each eigenvector.

30 Case + + Control X N I... 0 u Q) > C Q) CJ) Q) t- -it + :++-PF-+ -ft* + + ; r , t \ + x + +: * +-t= * + + +x f=i"" X +++ +x -f -h + :5t X + - It- + X + X X X X > x + X + + X X X ++ X X + +,=I- X X + x +++ x +x + +x Xi- X + + X X X + X X X X X eigenvector 1

31 PCA on 1000 Genome Project

32 Principle Components: Designed for GWAS Let Z 1 j, Z 2 j,... denote the eigenvectors for the jth person (use only top 10 PCs or so) g(e(y X)) = a + bx + c 1 Z 1 + c 2 Z Can also use Z 1 j, Z 2 j,..., to cluster, or stratify individuals into homogeneous strata.

33 Linear mixed models PCA does not account for cryp+c relatedness between individuals, which may also inflate the tests Therefore, use linear mixed model: Y = XB + µ +ε Var(µ) = σ g 2 K µ is the random effect, which represents the heritable component of random varia+on K is the gene+c similarity/relatedness matrix according to pairwise genotype similarity of individuals, it is influenced by popula+on structure, family structure and cryp+c relatedness EMMAX, BOLT-LMM, etc

34 Consequences of Substructure If both strata allele frequencies AND disease rates vary, then E(T) 0 and have bias If only allele frequencies vary, but sample is unbalanced (c d) have variance inflation. This is biggest concern, problem increases as n increases. Using PC s may NOT fix variance inflation.

35 Summary q What is population substructure? q What is the effect? E(T) biased, Var(T) inflated. q How to detect it? QQ plot, Genomic control q What to do about it? Genomic control, PCA, linear mixed models. Active research on correcting population substructure for sequencing data. Reading: Price, AL, et al. New approaches to population stratification in genome-wide association studies Nat Rev Genet.

Methods for Cryptic Structure. Methods for Cryptic Structure

Methods for Cryptic Structure. Methods for Cryptic Structure Case-Control Association Testing Review Consider testing for association between a disease and a genetic marker Idea is to look for an association by comparing allele/genotype frequencies between the cases

More information

(Genome-wide) association analysis

(Genome-wide) association analysis (Genome-wide) association analysis 1 Key concepts Mapping QTL by association relies on linkage disequilibrium in the population; LD can be caused by close linkage between a QTL and marker (= good) or by

More information

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture16: Population structure and logistic regression I Jason Mezey jgm45@cornell.edu April 11, 2017 (T) 8:40-9:55 Announcements I April

More information

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics 1 Springer Nan M. Laird Christoph Lange The Fundamentals of Modern Statistical Genetics 1 Introduction to Statistical Genetics and Background in Molecular Genetics 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

More information

Evolu&on, Popula&on Gene&cs, and Natural Selec&on Computa.onal Genomics Seyoung Kim

Evolu&on, Popula&on Gene&cs, and Natural Selec&on Computa.onal Genomics Seyoung Kim Evolu&on, Popula&on Gene&cs, and Natural Selec&on 02-710 Computa.onal Genomics Seyoung Kim Phylogeny of Mammals Phylogene&cs vs. Popula&on Gene&cs Phylogene.cs Assumes a single correct species phylogeny

More information

Recombina*on and Linkage Disequilibrium (LD)

Recombina*on and Linkage Disequilibrium (LD) Recombina*on and Linkage Disequilibrium (LD) A B a b r = recombina*on frac*on probability of an odd Number of crossovers occur Between our markers 0

More information

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017 Lecture 2: Genetic Association Testing with Quantitative Traits Instructors: Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2017 1 / 29 Introduction to Quantitative Trait Mapping

More information

Asymptotic distribution of the largest eigenvalue with application to genetic data

Asymptotic distribution of the largest eigenvalue with application to genetic data Asymptotic distribution of the largest eigenvalue with application to genetic data Chong Wu University of Minnesota September 30, 2016 T32 Journal Club Chong Wu 1 / 25 Table of Contents 1 Background Gene-gene

More information

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 18: Introduction to covariates, the QQ plot, and population structure II + minimal GWAS steps Jason Mezey jgm45@cornell.edu April

More information

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power Proportional Variance Explained by QTL and Statistical Power Partitioning the Genetic Variance We previously focused on obtaining variance components of a quantitative trait to determine the proportion

More information

Breeding Values and Inbreeding. Breeding Values and Inbreeding

Breeding Values and Inbreeding. Breeding Values and Inbreeding Breeding Values and Inbreeding Genotypic Values For the bi-allelic single locus case, we previously defined the mean genotypic (or equivalently the mean phenotypic values) to be a if genotype is A 2 A

More information

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5 Association Testing with Quantitative Traits: Common and Rare Variants Timothy Thornton and Katie Kerr Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5 1 / 41 Introduction to Quantitative

More information

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies Ruth Pfeiffer, Ph.D. Mitchell Gail Biostatistics Branch Division of Cancer Epidemiology&Genetics National

More information

Partitioning the Genetic Variance

Partitioning the Genetic Variance Partitioning the Genetic Variance 1 / 18 Partitioning the Genetic Variance In lecture 2, we showed how to partition genotypic values G into their expected values based on additivity (G A ) and deviations

More information

Nature Genetics: doi: /ng Supplementary Figure 1. Number of cases and proxy cases required to detect association at designs.

Nature Genetics: doi: /ng Supplementary Figure 1. Number of cases and proxy cases required to detect association at designs. Supplementary Figure 1 Number of cases and proxy cases required to detect association at designs. = 5 10 8 for case control and proxy case control The ratio of controls to cases (or proxy cases) is 1.

More information

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES Saurabh Ghosh Human Genetics Unit Indian Statistical Institute, Kolkata Most common diseases are caused by

More information

1. Understand the methods for analyzing population structure in genomes

1. Understand the methods for analyzing population structure in genomes MSCBIO 2070/02-710: Computational Genomics, Spring 2016 HW3: Population Genetics Due: 24:00 EST, April 4, 2016 by autolab Your goals in this assignment are to 1. Understand the methods for analyzing population

More information

Overview of clustering analysis. Yuehua Cui

Overview of clustering analysis. Yuehua Cui Overview of clustering analysis Yuehua Cui Email: cuiy@msu.edu http://www.stt.msu.edu/~cui A data set with clear cluster structure How would you design an algorithm for finding the three clusters in this

More information

PCA and admixture models

PCA and admixture models PCA and admixture models CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price PCA and admixture models 1 / 57 Announcements HW1

More information

Genotype Imputation. Biostatistics 666

Genotype Imputation. Biostatistics 666 Genotype Imputation Biostatistics 666 Previously Hidden Markov Models for Relative Pairs Linkage analysis using affected sibling pairs Estimation of pairwise relationships Identity-by-Descent Relatives

More information

Heritability and the response to selec2on

Heritability and the response to selec2on Heritability and the response to selec2on Resemblance between rela2ves in Quan2ta2ve traits A trait with L loci Each segregating an allele A 1 at freq. p l Each copy of the A 1 allele at a locus increasing

More information

Computational Systems Biology: Biology X

Computational Systems Biology: Biology X Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA L#5:(Mar-21-2010) Genome Wide Association Studies 1 Experiments on Garden Peas Statistical Significance 2 The law of causality...

More information

The Quantitative TDT

The Quantitative TDT The Quantitative TDT (Quantitative Transmission Disequilibrium Test) Warren J. Ewens NUS, Singapore 10 June, 2009 The initial aim of the (QUALITATIVE) TDT was to test for linkage between a marker locus

More information

Genetic Association Studies in the Presence of Population Structure and Admixture

Genetic Association Studies in the Presence of Population Structure and Admixture Genetic Association Studies in the Presence of Population Structure and Admixture Purushottam W. Laud and Nicholas M. Pajewski Division of Biostatistics Department of Population Health Medical College

More information

Binomial Mixture Model-based Association Tests under Genetic Heterogeneity

Binomial Mixture Model-based Association Tests under Genetic Heterogeneity Binomial Mixture Model-based Association Tests under Genetic Heterogeneity Hui Zhou, Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 April 30,

More information

Friday Harbor From Genetics to GWAS (Genome-wide Association Study) Sept David Fardo

Friday Harbor From Genetics to GWAS (Genome-wide Association Study) Sept David Fardo Friday Harbor 2017 From Genetics to GWAS (Genome-wide Association Study) Sept 7 2017 David Fardo Purpose: prepare for tomorrow s tutorial Genetic Variants Quality Control Imputation Association Visualization

More information

Association studies and regression

Association studies and regression Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration

More information

Data Processing Techniques

Data Processing Techniques Universitas Gadjah Mada Department of Civil and Environmental Engineering Master in Engineering in Natural Disaster Management Data Processing Techniques Hypothesis Tes,ng 1 Hypothesis Testing Mathema,cal

More information

Efficient Bayesian mixed model analysis increases association power in large cohorts

Efficient Bayesian mixed model analysis increases association power in large cohorts Linear regression Existing mixed model methods New method: BOLT-LMM Time O(MM) O(MN 2 ) O MN 1.5 Corrects for confounding? Power Efficient Bayesian mixed model analysis increases association power in large

More information

Sample sta*s*cs and linear regression. NEU 466M Instructor: Professor Ila R. Fiete Spring 2016

Sample sta*s*cs and linear regression. NEU 466M Instructor: Professor Ila R. Fiete Spring 2016 Sample sta*s*cs and linear regression NEU 466M Instructor: Professor Ila R. Fiete Spring 2016 Mean {x 1,,x N } N samples of variable x hxi 1 N NX i=1 x i sample mean mean(x) other notation: x Binned version

More information

Linear Regression and Correla/on. Correla/on and Regression Analysis. Three Ques/ons 9/14/14. Chapter 13. Dr. Richard Jerz

Linear Regression and Correla/on. Correla/on and Regression Analysis. Three Ques/ons 9/14/14. Chapter 13. Dr. Richard Jerz Linear Regression and Correla/on Chapter 13 Dr. Richard Jerz 1 Correla/on and Regression Analysis Correla/on Analysis is the study of the rela/onship between variables. It is also defined as group of techniques

More information

Linear Regression and Correla/on

Linear Regression and Correla/on Linear Regression and Correla/on Chapter 13 Dr. Richard Jerz 1 Correla/on and Regression Analysis Correla/on Analysis is the study of the rela/onship between variables. It is also defined as group of techniques

More information

Differen'al Privacy with Bounded Priors: Reconciling U+lity and Privacy in Genome- Wide Associa+on Studies

Differen'al Privacy with Bounded Priors: Reconciling U+lity and Privacy in Genome- Wide Associa+on Studies Differen'al Privacy with Bounded Priors: Reconciling U+lity and Privacy in Genome- Wide Associa+on Studies Florian Tramèr, Zhicong Huang, Erman Ayday, Jean- Pierre Hubaux ACM CCS 205 Denver, Colorado,

More information

Correla'on. Keegan Korthauer Department of Sta's'cs UW Madison

Correla'on. Keegan Korthauer Department of Sta's'cs UW Madison Correla'on Keegan Korthauer Department of Sta's'cs UW Madison 1 Rela'onship Between Two Con'nuous Variables When we have measured two con$nuous random variables for each item in a sample, we can study

More information

Population Genetics. with implications for Linkage Disequilibrium. Chiara Sabatti, Human Genetics 6357a Gonda

Population Genetics. with implications for Linkage Disequilibrium. Chiara Sabatti, Human Genetics 6357a Gonda 1 Population Genetics with implications for Linkage Disequilibrium Chiara Sabatti, Human Genetics 6357a Gonda csabatti@mednet.ucla.edu 2 Hardy-Weinberg Hypotheses: infinite populations; no inbreeding;

More information

Garvan Ins)tute Biosta)s)cal Workshop 16/6/2015. Tuan V. Nguyen. Garvan Ins)tute of Medical Research Sydney, Australia

Garvan Ins)tute Biosta)s)cal Workshop 16/6/2015. Tuan V. Nguyen. Garvan Ins)tute of Medical Research Sydney, Australia Garvan Ins)tute Biosta)s)cal Workshop 16/6/2015 Tuan V. Nguyen Tuan V. Nguyen Garvan Ins)tute of Medical Research Sydney, Australia Introduction to linear regression analysis Purposes Ideas of regression

More information

Case-Control Association Testing. Case-Control Association Testing

Case-Control Association Testing. Case-Control Association Testing Introduction Association mapping is now routinely being used to identify loci that are involved with complex traits. Technological advances have made it feasible to perform case-control association studies

More information

PCA, admixture proportions and SFS for low depth NGS data. Anders Albrechtsen

PCA, admixture proportions and SFS for low depth NGS data. Anders Albrechtsen PCA, admixture proportions and SFS for low depth NGS data Anders Albrechtsen Admixture model NGSadmix Introduction to PCA PCA for NGS - genotype likelihood approach analysis based on individual allele

More information

PCA vignette Principal components analysis with snpstats

PCA vignette Principal components analysis with snpstats PCA vignette Principal components analysis with snpstats David Clayton October 30, 2018 Principal components analysis has been widely used in population genetics in order to study population structure

More information

Population Structure

Population Structure Ch 4: Population Subdivision Population Structure v most natural populations exist across a landscape (or seascape) that is more or less divided into areas of suitable habitat v to the extent that populations

More information

Sta$s$cs for Genomics ( )

Sta$s$cs for Genomics ( ) Sta$s$cs for Genomics (140.688) Instructor: Jeff Leek Slide Credits: Rafael Irizarry, John Storey No announcements today. Hypothesis testing Once you have a given score for each gene, how do you decide

More information

Calculation of IBD probabilities

Calculation of IBD probabilities Calculation of IBD probabilities David Evans University of Bristol This Session Identity by Descent (IBD) vs Identity by state (IBS) Why is IBD important? Calculating IBD probabilities Lander-Green Algorithm

More information

Class Notes. Examining Repeated Measures Data on Individuals

Class Notes. Examining Repeated Measures Data on Individuals Ronald Heck Week 12: Class Notes 1 Class Notes Examining Repeated Measures Data on Individuals Generalized linear mixed models (GLMM) also provide a means of incorporang longitudinal designs with categorical

More information

FaST Linear Mixed Models for Genome-Wide Association Studies

FaST Linear Mixed Models for Genome-Wide Association Studies FaST Linear Mixed Models for Genome-Wide Association Studies Christoph Lippert 1-3, Jennifer Listgarten 1,3, Ying Liu 1, Carl M. Kadie 1, Robert I. Davidson 1, and David Heckerman 1,3 1 Microsoft Research

More information

Lecture 3: Linear Models. Bruce Walsh lecture notes Uppsala EQG course version 28 Jan 2012

Lecture 3: Linear Models. Bruce Walsh lecture notes Uppsala EQG course version 28 Jan 2012 Lecture 3: Linear Models Bruce Walsh lecture notes Uppsala EQG course version 28 Jan 2012 1 Quick Review of the Major Points The general linear model can be written as y = X! + e y = vector of observed

More information

Lecture 5: BLUP (Best Linear Unbiased Predictors) of genetic values. Bruce Walsh lecture notes Tucson Winter Institute 9-11 Jan 2013

Lecture 5: BLUP (Best Linear Unbiased Predictors) of genetic values. Bruce Walsh lecture notes Tucson Winter Institute 9-11 Jan 2013 Lecture 5: BLUP (Best Linear Unbiased Predictors) of genetic values Bruce Walsh lecture notes Tucson Winter Institute 9-11 Jan 013 1 Estimation of Var(A) and Breeding Values in General Pedigrees The classic

More information

Linkage and Linkage Disequilibrium

Linkage and Linkage Disequilibrium Linkage and Linkage Disequilibrium Summer Institute in Statistical Genetics 2014 Module 10 Topic 3 Linkage in a simple genetic cross Linkage In the early 1900 s Bateson and Punnet conducted genetic studies

More information

Theoretical and computational aspects of association tests: application in case-control genome-wide association studies.

Theoretical and computational aspects of association tests: application in case-control genome-wide association studies. Theoretical and computational aspects of association tests: application in case-control genome-wide association studies Mathieu Emily November 18, 2014 Caen mathieu.emily@agrocampus-ouest.fr - Agrocampus

More information

ARTICLE MASTOR: Mixed-Model Association Mapping of Quantitative Traits in Samples with Related Individuals

ARTICLE MASTOR: Mixed-Model Association Mapping of Quantitative Traits in Samples with Related Individuals ARTICLE MASTOR: Mixed-Model Association Mapping of Quantitative Traits in Samples with Related Individuals Johanna Jakobsdottir 1,3 and Mary Sara McPeek 1,2, * Genetic association studies often sample

More information

Person-Time Data. Incidence. Cumulative Incidence: Example. Cumulative Incidence. Person-Time Data. Person-Time Data

Person-Time Data. Incidence. Cumulative Incidence: Example. Cumulative Incidence. Person-Time Data. Person-Time Data Person-Time Data CF Jeff Lin, MD., PhD. Incidence 1. Cumulative incidence (incidence proportion) 2. Incidence density (incidence rate) December 14, 2005 c Jeff Lin, MD., PhD. c Jeff Lin, MD., PhD. Person-Time

More information

EFFICIENT COMPUTATION WITH A LINEAR MIXED MODEL ON LARGE-SCALE DATA SETS WITH APPLICATIONS TO GENETIC STUDIES

EFFICIENT COMPUTATION WITH A LINEAR MIXED MODEL ON LARGE-SCALE DATA SETS WITH APPLICATIONS TO GENETIC STUDIES Submitted to the Annals of Applied Statistics EFFICIENT COMPUTATION WITH A LINEAR MIXED MODEL ON LARGE-SCALE DATA SETS WITH APPLICATIONS TO GENETIC STUDIES By Matti Pirinen, Peter Donnelly and Chris C.A.

More information

Computational Systems Biology: Biology X

Computational Systems Biology: Biology X Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA L#7:(Mar-23-2010) Genome Wide Association Studies 1 The law of causality... is a relic of a bygone age, surviving, like the monarchy,

More information

... x. Variance NORMAL DISTRIBUTIONS OF PHENOTYPES. Mice. Fruit Flies CHARACTERIZING A NORMAL DISTRIBUTION MEAN VARIANCE

... x. Variance NORMAL DISTRIBUTIONS OF PHENOTYPES. Mice. Fruit Flies CHARACTERIZING A NORMAL DISTRIBUTION MEAN VARIANCE NORMAL DISTRIBUTIONS OF PHENOTYPES Mice Fruit Flies In:Introduction to Quantitative Genetics Falconer & Mackay 1996 CHARACTERIZING A NORMAL DISTRIBUTION MEAN VARIANCE Mean and variance are two quantities

More information

I Have the Power in QTL linkage: single and multilocus analysis

I Have the Power in QTL linkage: single and multilocus analysis I Have the Power in QTL linkage: single and multilocus analysis Benjamin Neale 1, Sir Shaun Purcell 2 & Pak Sham 13 1 SGDP, IoP, London, UK 2 Harvard School of Public Health, Cambridge, MA, USA 3 Department

More information

Least Squares Parameter Es.ma.on

Least Squares Parameter Es.ma.on Least Squares Parameter Es.ma.on Alun L. Lloyd Department of Mathema.cs Biomathema.cs Graduate Program North Carolina State University Aims of this Lecture 1. Model fifng using least squares 2. Quan.fica.on

More information

Latent Variable Methods for the Analysis of Genomic Data

Latent Variable Methods for the Analysis of Genomic Data John D. Storey Center for Statistics and Machine Learning & Lewis-Sigler Institute for Integrative Genomics Latent Variable Methods for the Analysis of Genomic Data http://genomine.org/talks/ Data m variables

More information

Outline. What is Machine Learning? Why Machine Learning? 9/29/08. Machine Learning Approaches to Biological Research: Bioimage Informa>cs and Beyond

Outline. What is Machine Learning? Why Machine Learning? 9/29/08. Machine Learning Approaches to Biological Research: Bioimage Informa>cs and Beyond Outline Machine Learning Approaches to Biological Research: Bioimage Informa>cs and Beyond Robert F. Murphy External Senior Fellow, Freiburg Ins>tute for Advanced Studies Ray and Stephanie Lane Professor

More information

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin 1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)

More information

2. Map genetic distance between markers

2. Map genetic distance between markers Chapter 5. Linkage Analysis Linkage is an important tool for the mapping of genetic loci and a method for mapping disease loci. With the availability of numerous DNA markers throughout the human genome,

More information

G E INTERACTION USING JMP: AN OVERVIEW

G E INTERACTION USING JMP: AN OVERVIEW G E INTERACTION USING JMP: AN OVERVIEW Sukanta Dash I.A.S.R.I., Library Avenue, New Delhi-110012 sukanta@iasri.res.in 1. Introduction Genotype Environment interaction (G E) is a common phenomenon in agricultural

More information

Populations in statistical genetics

Populations in statistical genetics Populations in statistical genetics What are they, and how can we infer them from whole genome data? Daniel Lawson Heilbronn Institute, University of Bristol www.paintmychromosomes.com Work with: January

More information

Bias/variance tradeoff, Model assessment and selec+on

Bias/variance tradeoff, Model assessment and selec+on Applied induc+ve learning Bias/variance tradeoff, Model assessment and selec+on Pierre Geurts Department of Electrical Engineering and Computer Science University of Liège October 29, 2012 1 Supervised

More information

Evolutionary quantitative genetics and one-locus population genetics

Evolutionary quantitative genetics and one-locus population genetics Evolutionary quantitative genetics and one-locus population genetics READING: Hedrick pp. 57 63, 587 596 Most evolutionary problems involve questions about phenotypic means Goal: determine how selection

More information

Lecture 2: Linear Models. Bruce Walsh lecture notes Seattle SISG -Mixed Model Course version 23 June 2011

Lecture 2: Linear Models. Bruce Walsh lecture notes Seattle SISG -Mixed Model Course version 23 June 2011 Lecture 2: Linear Models Bruce Walsh lecture notes Seattle SISG -Mixed Model Course version 23 June 2011 1 Quick Review of the Major Points The general linear model can be written as y = X! + e y = vector

More information

STAT 536: Genetic Statistics

STAT 536: Genetic Statistics STAT 536: Genetic Statistics Frequency Estimation Karin S. Dorman Department of Statistics Iowa State University August 28, 2006 Fundamental rules of genetics Law of Segregation a diploid parent is equally

More information

LECTURE # How does one test whether a population is in the HW equilibrium? (i) try the following example: Genotype Observed AA 50 Aa 0 aa 50

LECTURE # How does one test whether a population is in the HW equilibrium? (i) try the following example: Genotype Observed AA 50 Aa 0 aa 50 LECTURE #10 A. The Hardy-Weinberg Equilibrium 1. From the definitions of p and q, and of p 2, 2pq, and q 2, an equilibrium is indicated (p + q) 2 = p 2 + 2pq + q 2 : if p and q remain constant, and if

More information

Two sample Test. Paired Data : Δ = 0. Lecture 3: Comparison of Means. d s d where is the sample average of the differences and is the

Two sample Test. Paired Data : Δ = 0. Lecture 3: Comparison of Means. d s d where is the sample average of the differences and is the Gene$cs 300: Sta$s$cal Analysis of Biological Data Lecture 3: Comparison of Means Two sample t test Analysis of variance Type I and Type II errors Power More R commands September 23, 2010 Two sample Test

More information

Example: Data from the Child Health and Development Study

Example: Data from the Child Health and Development Study Example: Data from the Child Health and Development Study Can we use linear regression to examine how well length of gesta:onal period predicts birth weight? First look at the sca@erplot: Does a linear

More information

Exponen'al growth Limi'ng factors Environmental resistance Carrying capacity logis'c growth curve

Exponen'al growth Limi'ng factors Environmental resistance Carrying capacity logis'c growth curve Exponen'al growth Popula)on increases by a fixed percent Fixed percent of a large number produces a large increase Graphed as a J- shaped curve Cannot be sustained indefinitely It occurs in nature With

More information

Introduc)on to RNA- Seq Data Analysis. Dr. Benilton S Carvalho Department of Medical Gene)cs Faculty of Medical Sciences State University of Campinas

Introduc)on to RNA- Seq Data Analysis. Dr. Benilton S Carvalho Department of Medical Gene)cs Faculty of Medical Sciences State University of Campinas Introduc)on to RNA- Seq Data Analysis Dr. Benilton S Carvalho Department of Medical Gene)cs Faculty of Medical Sciences State University of Campinas Material: hep://)ny.cc/rnaseq Slides: hep://)ny.cc/slidesrnaseq

More information

Resemblance among relatives

Resemblance among relatives Resemblance among relatives Introduction Just as individuals may differ from one another in phenotype because they have different genotypes, because they developed in different environments, or both, relatives

More information

Notes on Population Genetics

Notes on Population Genetics Notes on Population Genetics Graham Coop 1 1 Department of Evolution and Ecology & Center for Population Biology, University of California, Davis. To whom correspondence should be addressed: gmcoop@ucdavis.edu

More information

REGRESSION AND CORRELATION ANALYSIS

REGRESSION AND CORRELATION ANALYSIS Problem 1 Problem 2 A group of 625 students has a mean age of 15.8 years with a standard devia>on of 0.6 years. The ages are normally distributed. How many students are younger than 16.2 years? REGRESSION

More information

Generative Model (Naïve Bayes, LDA)

Generative Model (Naïve Bayes, LDA) Generative Model (Naïve Bayes, LDA) IST557 Data Mining: Techniques and Applications Jessie Li, Penn State University Materials from Prof. Jia Li, sta3s3cal learning book (Has3e et al.), and machine learning

More information

Table of Contents. Multivariate methods. Introduction II. Introduction I

Table of Contents. Multivariate methods. Introduction II. Introduction I Table of Contents Introduction Antti Penttilä Department of Physics University of Helsinki Exactum summer school, 04 Construction of multinormal distribution Test of multinormality with 3 Interpretation

More information

Calculation of IBD probabilities

Calculation of IBD probabilities Calculation of IBD probabilities David Evans and Stacey Cherny University of Oxford Wellcome Trust Centre for Human Genetics This Session IBD vs IBS Why is IBD important? Calculating IBD probabilities

More information

Research Statement on Statistics Jun Zhang

Research Statement on Statistics Jun Zhang Research Statement on Statistics Jun Zhang (junzhang@galton.uchicago.edu) My interest on statistics generally includes machine learning and statistical genetics. My recent work focus on detection and interpretation

More information

Advanced Quantitative Research Methodology, Lecture Notes: Research Designs for Causal Inference 1

Advanced Quantitative Research Methodology, Lecture Notes: Research Designs for Causal Inference 1 Advanced Quantitative Research Methodology, Lecture Notes: Research Designs for Causal Inference 1 Gary King GaryKing.org April 13, 2014 1 c Copyright 2014 Gary King, All Rights Reserved. Gary King ()

More information

Heritability estimation in modern genetics and connections to some new results for quadratic forms in statistics

Heritability estimation in modern genetics and connections to some new results for quadratic forms in statistics Heritability estimation in modern genetics and connections to some new results for quadratic forms in statistics Lee H. Dicker Rutgers University and Amazon, NYC Based on joint work with Ruijun Ma (Rutgers),

More information

Principal component analysis and the asymptotic distribution of high-dimensional sample eigenvectors

Principal component analysis and the asymptotic distribution of high-dimensional sample eigenvectors Principal component analysis and the asymptotic distribution of high-dimensional sample eigenvectors Kristoffer Hellton Department of Mathematics, University of Oslo May 12, 2015 K. Hellton (UiO) Distribution

More information

STAT 536: Genetic Statistics

STAT 536: Genetic Statistics STAT 536: Genetic Statistics Tests for Hardy Weinberg Equilibrium Karin S. Dorman Department of Statistics Iowa State University September 7, 2006 Statistical Hypothesis Testing Identify a hypothesis,

More information

Quantitative Genetics

Quantitative Genetics Bruce Walsh, University of Arizona, Tucson, Arizona, USA Almost any trait that can be defined shows variation, both within and between populations. Quantitative genetics is concerned with the analysis

More information

ECON Fundamentals of Probability

ECON Fundamentals of Probability ECON 351 - Fundamentals of Probability Maggie Jones 1 / 32 Random Variables A random variable is one that takes on numerical values, i.e. numerical summary of a random outcome e.g., prices, total GDP,

More information

The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies

The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies Ian Barnett, Rajarshi Mukherjee & Xihong Lin Harvard University ibarnett@hsph.harvard.edu June 24, 2014 Ian Barnett

More information

Classifica(on and predic(on omics style. Dr Nicola Armstrong Mathema(cs and Sta(s(cs Murdoch University

Classifica(on and predic(on omics style. Dr Nicola Armstrong Mathema(cs and Sta(s(cs Murdoch University Classifica(on and predic(on omics style Dr Nicola Armstrong Mathema(cs and Sta(s(cs Murdoch University Classifica(on Learning Set Data with known classes Prediction Classification rule Data with unknown

More information

Lecture 9. QTL Mapping 2: Outbred Populations

Lecture 9. QTL Mapping 2: Outbred Populations Lecture 9 QTL Mapping 2: Outbred Populations Bruce Walsh. Aug 2004. Royal Veterinary and Agricultural University, Denmark The major difference between QTL analysis using inbred-line crosses vs. outbred

More information

HERITABILITY ESTIMATION USING A REGULARIZED REGRESSION APPROACH (HERRA)

HERITABILITY ESTIMATION USING A REGULARIZED REGRESSION APPROACH (HERRA) BIRS 016 1 HERITABILITY ESTIMATION USING A REGULARIZED REGRESSION APPROACH (HERRA) Malka Gorfine, Tel Aviv University, Israel Joint work with Li Hsu, FHCRC, Seattle, USA BIRS 016 The concept of heritability

More information

heritable diversity feb ! gene 8840 biol 8990

heritable diversity feb ! gene 8840 biol 8990 heritable diversity feb 25 2015! gene 8840 biol 8990 D. Gordon E. Robertson - photo from Wikipedia HERITABILITY DEPENDS ON CONTEXT heritability: how well does parent predict offspring phenotype? how much

More information

Second-Order Inference for Gaussian Random Curves

Second-Order Inference for Gaussian Random Curves Second-Order Inference for Gaussian Random Curves With Application to DNA Minicircles Victor Panaretos David Kraus John Maddocks Ecole Polytechnique Fédérale de Lausanne Panaretos, Kraus, Maddocks (EPFL)

More information

Supplementary Information

Supplementary Information Supplementary Information 1 Supplementary Figures (a) Statistical power (p = 2.6 10 8 ) (b) Statistical power (p = 4.0 10 6 ) Supplementary Figure 1: Statistical power comparison between GEMMA (red) and

More information

Lecture 2: Introduction to Quantitative Genetics

Lecture 2: Introduction to Quantitative Genetics Lecture 2: Introduction to Quantitative Genetics Bruce Walsh lecture notes Introduction to Quantitative Genetics SISG, Seattle 16 18 July 2018 1 Basic model of Quantitative Genetics Phenotypic value --

More information

Modeling IBD for Pairs of Relatives. Biostatistics 666 Lecture 17

Modeling IBD for Pairs of Relatives. Biostatistics 666 Lecture 17 Modeling IBD for Pairs of Relatives Biostatistics 666 Lecture 7 Previously Linkage Analysis of Relative Pairs IBS Methods Compare observed and expected sharing IBD Methods Account for frequency of shared

More information

The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies

The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies Ian Barnett, Rajarshi Mukherjee & Xihong Lin Harvard University ibarnett@hsph.harvard.edu August 5, 2014 Ian Barnett

More information

Lecture 4. Basic Designs for Estimation of Genetic Parameters

Lecture 4. Basic Designs for Estimation of Genetic Parameters Lecture 4 Basic Designs for Estimation of Genetic Parameters Bruce Walsh. Aug 003. Nordic Summer Course Heritability The reason for our focus, indeed obsession, on the heritability is that it determines

More information

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS * Some contents are adapted from Dr. Hung Huang and Dr. Chengkai Li at UT Arlington Mingon Kang, Ph.D. Computer Science, Kennesaw State University Problems

More information

Evolu&on Cont d. h:p:// content/uploads/2009/09/evolu&on.jpg. 7 th Grade Biology Mr. Joanides

Evolu&on Cont d. h:p://  content/uploads/2009/09/evolu&on.jpg. 7 th Grade Biology Mr. Joanides Evolu&on Cont d h:p://www.buildamovement.com/blog/wp- content/uploads/2009/09/evolu&on.jpg 7 th Grade Biology Mr. Joanides The Fossil Record Fossil Preserved remains or markings lem by organisms that live

More information

Statistical Models for sequencing data: from Experimental Design to Generalized Linear Models

Statistical Models for sequencing data: from Experimental Design to Generalized Linear Models Best practices in the analysis of RNA-Seq and CHiP-Seq data 4 th -5 th May 2017 University of Cambridge, Cambridge, UK Statistical Models for sequencing data: from Experimental Design to Generalized Linear

More information

Partitioning the Genetic Variance. Partitioning the Genetic Variance

Partitioning the Genetic Variance. Partitioning the Genetic Variance Partitioning the Genetic Variance Partitioning the Genetic Variance In lecture 2, we showed how to partition genotypic values G into their expected values based on additivity (G A ) and deviations from

More information

FaST linear mixed models for genome-wide association studies

FaST linear mixed models for genome-wide association studies Nature Methods FaS linear mixed models for genome-wide association studies Christoph Lippert, Jennifer Listgarten, Ying Liu, Carl M Kadie, Robert I Davidson & David Heckerman Supplementary Figure Supplementary

More information

Multiple QTL mapping

Multiple QTL mapping Multiple QTL mapping Karl W Broman Department of Biostatistics Johns Hopkins University www.biostat.jhsph.edu/~kbroman [ Teaching Miscellaneous lectures] 1 Why? Reduce residual variation = increased power

More information