1. Understand the methods for analyzing population structure in genomes

Size: px

Start display at page:

Download "1. Understand the methods for analyzing population structure in genomes"

Kristian Webb
6 years ago
Views:

1 MSCBIO 2070/02-710: Computational Genomics, Spring 2016 HW3: Population Genetics Due: 24:00 EST, April 4, 2016 by autolab Your goals in this assignment are to 1. Understand the methods for analyzing population structure in genomes 2. Understand the methods for identifying disease loci in genomes 3. Explore the approach for identifying structure variants in genomes What to hand in. One report (in pdf format) addressing each of following questions including the figures generated by R when appropriate. All source code for the R exercises. We should be able to run the source code and produce the figures requested. Submit a zip file containing the completed code (if any) and the pdf file (if any) to autolab. The zip file should have the following structure./s2016hw3.pdf./q3/ put all codes related to Q3 here, if any

2 1. [15 points] Hardy-Weinberg Equilibrium (a) (5 points) Show that the Hardy-Weinberg equilibrium holds for three alleles. [Hint: Assume allele frequencies p, q, and r (p + q + r = 1) for each of the three alleles A 1, A 2, and A 3.] Based on the allele frequencies, we could get the genotype frequencies in the offspring. For genotype A 1 A 2, P (A 1 A 2 ) = pq + qp = 2pq The frequencies of all possible genotypes could be found in the following table, Genotype Frequency A 1 A 1 p 2 A 1 A 2 2pq A 1 A 3 2pr A 2 A 2 q 2 A 2 A 3 2qr A 3 A 3 r 2 Based on these phenotype frequencies, we could calculate the allele frequencies p, q and r in the offspring. p = 2p2 + 2pq + 2pr 2 q = 2q2 + 2pq + 2qr 2 r = 2r2 + 2pr + 2qr 2 = p = q = r Thus the Hardy-Weinberg equilibrium holds for three alleles. (b) (5 points) The numbers of individuals with genotypes AA, Aa, and aa at a locus are given as 232, 36, and 6, respectively. Perform a chi-square test to see if the Hardy-Weinberg Equilibrium holds for this locus at significance level α = Use the degree of freedom 1. We use p to represent the allele frequency for A, and q for a. The total number of observations is = p = = q = = Then we calculate the expected genotype frequencies. Afterwards we calculate the χ 2 statistics. E(AA) = = E(aa) = = 2.1 E(Aa) = = 44.0 χ 2 ( )2 = = (6 2.1) (36 44)2 44 By checking the χ 2 distribution table, we could find χ ,df=1 = Since 8.77 > 3.841, we reject the null hypothesis and the Hardy-Weinberg Equilibrium doesn t hold. 2

3 (c) (5 points) The Write-Fisher model as illustrated in the lecture note can be considered as a Markov chain. If we denote the number of allele A in the population in generation n by X n, then we recognize that the sequence X 0, X 1,..., is a Markov chain, the set of possible outcomes being {0, 1, 2,..., 2N}. The transition matrix of the chain is given by a binomial distribution B(2N, i/2n): ( ) 2N ( i ) j ( p ij = p(x n+1 = j X n = i) = 1 i ) 2N j j 2N 2N Show that E(X n+1 X n = i) = i. How does it relate to Hardy-Weinberg equilibrium? i Since p(x n+1 X n = i) B(2N, 2N ), E(X n+1 X n = i) = 2N i 2N = i. Since E(X n+1 X n = i) = X n, the Hardy-Weinberg equilibrium holds for the expected frequencies. 2. [5 points] HMM in PHASE and STRUCTURE Assuming K ancestral chromosomes, the transition probabilities in the hidden Markov models in PHASE as well as those embedded in the linkage model extension of STRUCTURE model the presence/abscence of recombination events between locus l and locus l + 1 with distance d l. The transition probabilities from ancestral chromosome state labels z l = k to z l+1 = k for k, k {1,..., K} are given as { P (z l+1 = k exp( d l r) + (1 exp( d l r))q k if k = k z l = k) = (1 exp( d l r))q k otherwise where r is the per-basepair recombination rate and q i s for i = 1,..., K are prior probabilities for each of the K states that sum to 1. Consider the case where the underlying genome block structure has z l = z l+1 = k but has a small segment from the m k ancestral chromosome inserted between loci l and l + 1. How is this scenario modeled by the transition probabilities above? We use R to stand for the number of recombination events between loci l and l + 1, which follow a Poisson distribution. Since r is the per-basepair recombination rate and the distance between locus l and locus l + 1 is d l. The mean value of the Poisson distribution is d l r. Thus, the density function of the Poisson distribution is as follows, p(r) = (d lr) R e d lr R! Suppose z l = z l+1 = k, P (z l+1 = k z l = k) could be calculated as follows, P (z l+1 = k z l = k) = P (R = 0) + P (R > 0) (1) = p(r = 0) + P (R = 1) + P (R = 2) + P (R = 3) + (2) K K K = p(r = 0) + p(r = 1)q k + p(r = 2) q i q k + p(r = 3) q i q j q k + (3) i=1 i=1 j=1 K = p(r = 0) + p(r = 1)q k + p(r = 2)q k q i + p(r = 3)q k K i=1 i=1 j=1 K q i q j + (4) K = p(r = 0) + p(r = 1)q k + p(r = 2)q k + p(r = 3)q k + (Because q i = 1) = p(r = 0) + q k (1 p(r = 0)) (6) = e dlr + (1 e dlr )q k (7) i=1 (5) 3

4 If z l = z l+1 = k and there is one small segment from the m k ancestral chromosome inserted between loci l and l + 1, the probability of this scenario could be calculated more explicitly. P (one small insertion) = p(r = 2)q m q k, m k The probability above is a fraction of the term P (R = 2) in equation (2). 3. [10 points] PCA and Population Structure Consider the SNP genotype data for 5912 loci on chromosome 2 from 423 individuals provided with this homework in file snp.txt. Each of the individuals are from one of the following six populations: CEU: Utah residents with Northern and Western European ancestry from the CEPH collection CHB: Han Chinese in Beijing, China JPT: Japanese in Tokyo, Japan LWK: Luhya in Webuye, Kenya MEX: Mexican ancestry in Los Angeles, California YRI: Yoruba in Ibadan, Nigeria The ancestry labels for each individual are provided in file sample names with population labels.txt. (a) (5 points) Perform PCA and plot the ancestry of the individuals on 2 dimensions using the first two principal components, as was discussed in the class. Use different colors for different true ancestry to plot the individuals in 2 dimensions after PCA. Include your plot and code. If you perform PCA on the snp matrix without scaling, or perform PCA on the covariance matrix constructed by 1 n X X without scaling 0.04 Populations CEU PC CHB JPT LWK MEX YRI PC1 4

5 If you perform PCA on the covariance matrix constructed by cov() function, 0.05 Populations CEU CHB PC2 JPT LWK 0.00 MEX YRI PC1 If you scale the original snp matrix and perform PCA on it, 20 Populations CEU PC2 0 CHB JPT LWK MEX YRI PC1 5

6 library ( ggplot2 ) popdata <- read. table (" sample_names_with_population_labels. txt ", header = FALSE ) colnames ( popdata ) <-c(" sample.id "," pop_code ") snpdata <- read. table (" snp. txt ", header = FALSE ) pcadata <- prcomp (t( snpdata ), center =TRUE, scale = TRUE ) tmppcadata <- cbind (as. data. frame ( pcadata$x [,1:2]), popdata$pop_code ) colnames ( tmppcadata ) <- c(" PC1 "," PC2 "," Populations ") tmppcadata$populations <- factor ( tmppcadata$populations ) p <- ggplot ( tmppcadata, aes (x=pc1,y=pc2, colour = Populations )) p+ geom_point ( size =2) If you scale the original snp matrix and perform PCA on the covaraince matrix constructed by 1 n X X, 0.05 Populations 0.00 CEU CHB PC2 JPT LWK 0.05 MEX YRI PC1 (b) (5 points) Which ethnic groups are similar in terms of their genomes? Which ethnic groups are different in terms of their genomes? From the plot, we could find there are three clusters. Each of them is formed by two ethnic group. Ethnic groups fall in different clusters are quite different. (1) The CHB (China) and JPT (Japan) groups overlap with each other pretty well. (2)The majority of LWK (Kenya) and YRI (Nigera) groups overlaps. (3)The CEU (Utah, European ancestry) and MEX (California, Mexican ancestry) groups share only a small intersection. 6

7 Any other pairs of ethnic groups are very different in terms of their genomes. 4. [10 points] Linkage Analysis Compute the probabilities of the following pedigrees assuming Penetrance model is p(affected dd) = 0.1, p(affected Dd) = 0.2, p(affected DD) = 0.7. The allele frequency of D is 0.02 Shaded means affected, blank means unaffected (a) (5 points) Since the allele frequency of D is 0.02, the allele frequency of d is = Further we could calculate the phenotype frequencies. P (DD) = = P (dd) = = P (Dd) = = From the genotypes of the offspring, we could infer the phenotype of M1 could be dd or Dd. P (pedigree M1 is Dd) = P (Dd)P (Dd)P (dd Dd, Dd)P (Dd Dd, Dd) P (unaffected Dd) P (affected Dd)P (unaffected dd)p (affected Dd) = (1 0.2) 0.2 (1 0.1) 0.2 = P (pedigree M1 is dd) = P (dd)p (Dd)P (dd dd, Dd)P (Dd dd, Dd) P (unaffected dd) P (affected Dd)P (unaffected dd)p (affected Dd) = (1 0.1) 0.2 (1 0.1) 0.2 = Sum up these two probabilities and we could get the probability of the pedigree. p pedigree = P (M1 is Dd) + P (M1 is dd) =

8 (b) (5 points) From the genotypes of the offspring, we could infer the only possible phenotype of M1 is Dd. P pedigree = P (pedigree M1 is Dd) = P (Dd)P (Dd)P (dd Dd, Dd)P (Dd dd, Dd)P (DD Dd, Dd) P (unaffected Dd) P (affected Dd)P (unaffected dd)p (unaffected Dd)P (affected DD) = (1 0.2) 0.2 (1 0.1) (1 0.2) 0.7 = [23 points] Genome-wide Association Studies (a) (5 points) Given the following data, perform chi-square tests to test the association between a given locus and case/control status. Control Case Major allele homozygous heterozygous Minor allele homozygous The null hypothesis H 0 is that there is no association between a given locus and case/control status. Suppose the two alleles here are A (major) and a (minor). The total number of control samples is 126 and the total number of case samples is 125. We first calculate the allele frequency under the null hypothesis. The total number of major allele homozygous, heterozygous and minor allele homozygous are 85, 76 and 90 correspondingly P (A) = 2 ( ) = P (a) = 2 ( ) = 0.51 Allele based The observed allele count table is as follows, Control Case Major allele (A) = = 75 Minor allel(a) = = 175 The expected allele count table is as follows, 8

9 Control Case Major allele (A) = = Minor allel(a) = = Then we calculate the χ 2 test statistics, χ 2 ( )2 = = ( ) By checking the χ 2 distribution table, we could find χ ,df=1 = Since 71.97>3.84, we reject the null hypothesis and there is an association between a given locus and case/control status. Genotype based Control Case Major allele homozygous = = heterozygous = = Minor allele homozygous = = Then we calculate the χ 2 test statistics, χ 2 ( )2 = = ( ) By checking the χ 2 distribution table, we could find χ ,df=2 = Since 52.07>5.99, we reject the null hypothesis and there is an association between a given locus and case/control status. Allele+Genotype based Although you could get the same answer, this is not the right way to do it. Because we don t know whether Hardy-Weinberg Equilibrium holds for current generation or not. (b) (3 points) Assuming the chi-square test in (a) above is one of 100,000 loci that were tested for associations. What is the adjusted p-value after Bonferroni correction? Allele based The p-value for the χ 2 test statistics is p(71.97, df = 1) = p 0. The adjusted p-value after Bonferroni correction is 10 5 p 0 = 10 5 p 0. Genotype based The p-value for the χ 2 test statistics is p(52.07, df = 2) = The adjusted p-value after Bonferroni correction is = (c) (5 points) Bonferroni correction is effective when all the statistical tests are independent of each other. Consider performing case/control genome wide association studies for type II diabetes based on African individuals. Consider performing the same type of study on European population. In general, African population is more ancient and African genomes have weaker linkage disequilibrium than European population. Would Bonferroni correction be more effective in African or in European population? Why? Since African genomes have weaker linkage disequilibrium than European population, each loci of African genomes are is more likely to be independent of each other. Thus the Bonferroni correction could be more effective in African population. 9

10 (d) (5 points) Given the following data, perform chi-square tests to test the association between a given locus and case/control status. Control Case Major allele homozygous heterozygous 1 2 Minor allele homozygous 1 2 The null hypothesis H 0 is that there is no association between a given locus and case/control status. Suppose the two alleles here are A (major) and a (minor). The total number of control samples is 107 and the total number of case samples is 104. We first calculate the allele frequency under the null hypothesis. The total number of major allele homozygous, heterozygous and minor allele homozygous are 205, 3 and 3 correspondingly P (A) = 2 ( ) = P (a) = 2 ( ) = 0.02 Allele based The observed allele count table is as follows, Control Case Major allele (A) = = 202 Minor allel(a) = = 6 The expected allele count table is as follows, Control Case Major allele (A) = = Minor allel(a) = = 4.16 Then we calculate the χ 2 test statistics, χ 2 ( )2 = = 1.22 (6 4.16) By checking the χ 2 distribution table, we could find χ ,df=1 = Since 1.22<3.84, the null hypothesis is not violated and there is not an association between a given locus and case/control status. Genotype based Control Case Major allele homozygous = = heterozygous = = 1.48 Minor allele homozygous = = 1.48 Then we calculate the χ 2 test statistics, χ 2 ( )2 = = (2 1.48)

11 By checking the χ 2 distribution table, we could find χ ,df=2 = Since 0.74<5.99, the null hypothesis is not violated and there is not an association between a given locus and case/control status. Allele+Genotype based Similarly we don t know whether Hardy-Weinberg Equilibrium holds for current generation or not. If you do the calculation, you could find you will draw a wrong conclusion. (e) (5 points) In (b), what is the minor allele frequency in the whole population including all samples? Can you reliably conclude on the significance of the association? Why? The minor allele frequency is 2 ( ) = 0.51 in (b). The allele frequency is fairly large which makes the significance of the association reliable. But in (d), the minor allele frequency is 2 ( ) = The sample size containing minor alleles is too small for the association study in (d), so the significance of the association is not so reliable. 6. [7 points] Haplotypes and Genome-wide Association Studies Consider the genome data below collected from case (patient) and control (normal healthy) individuals. Our goal is to see if the haplotypes formed by the three SNPs influence the disease susceptibility. Case: Individual 1...C...T..G....C...T..G. Individual 2...T...G..A....C...T..G. Individual 3...C...T..A....C...T..G. Control Individual 4...T...G..A....T...G..A. Individual 5...C...T..A....C...T..A. (a) (2 points) List haplotype alleles. Haplotype alleles are CTG, TGA and CTA. (b) (5 points) Create a contingency table that you can use for chi square test. The contingency table is as follows, Case Control Total CTG TGA CTA Total [10 points] Structural Variants Assume you are performing paired-end sequencing of a region of your own genome to see if it contains an insertion or deletion compared to the reference genome. Assume the distribution of bp distances between the two sequenced fragments (or insert sizes) in each mate pair (collected genome-wide) is given as in the lecture note. 11

12 (a) (5 points) If there was a homozygous insertion of length 100bp in your genome, what would be the distribution of the distances between the two sequenced fragments in each mate pair from your own genome? Suppose the mean value of the real distribution is 400, 0.02 group density Measured Distribution Real Distribution distance (b) (5 points) If there was a heterozygous insertion of length 100bp in your genome, what would be the distribution of the distances between the two sequenced fragments in each mate pair from your own genome? Suppose the mean value of the real distribution is 400, 12

13 0.02 group density Measured Distribution Real Distribution distance 13

CSci 8980: Advanced Topics in Graphical Models Analysis of Genetic Variation

CSci 8980: Advanced Topics in Graphical Models Analysis of Genetic Variation Instructor: Arindam Banerjee November 26, 2007 Genetic Polymorphism Single nucleotide polymorphism (SNP) Genetic Polymorphism