Humans have two copies of each chromosome. Inherited from mother and father. Genotyping technologies do not maintain the phase

Size: px

Start display at page:

Download "Humans have two copies of each chromosome. Inherited from mother and father. Genotyping technologies do not maintain the phase"

Gyles Hardy
5 years ago
Views:

2 Humans have two copies of each chromosome Inherited from mother and father. Genotyping technologies do not maintain the phase

3 Genotyping technologies do not maintain the phase

4 Recall that proximal SNPs are in LD. Few haplotypes must be observed Ex:T G G C is the best phasing of the 1st two columns This phasing is most reliable for short regions LD decays at 10kb

5 If: child is heterozygous, and a parent is homozygous, we know which allele comes from which parent. Location 1: A from mother, T from father Location 2: C from father, G from mother Child haplotypes (TC,AG) Father Mother A/T----C/C A/A----C/G A/T----C/G Child

8 A G A G C T T A G T A - - T G G G T C T A G A T - A T G C A A T G T G A T G A G A G C T A G C A T G A C T T T T G G T T C G C G The fragments are alligned to the unphased reference Uninformative fragments and columns are removed Tri- allelic SNP columns are removed. Relabel the two alleles using 0/1

9 A G A G C T A G C A T G A C T T T T G G T T C G C G A G A G C T T A G T A - - T G G G T C T A G A T - A T G C A A T G T G A T G A G A G C T A G C A T G A C T T T T G G T T C G C G Goal: Reconstruct the phased binary string, and its complement, given substrings

10 Consider a binary string, (and its complement)

11 The string is revealed to us only through a collection of substrings of the string, and its complement. Given the substrings, can the string be reconstructed?

12 The error free reconstruction is unique. The problem becomes much harder if some of the substrings have errors (do not match the consensus) DeYine MEC: Minimum # calls that need to be corrected for a match MEC reconstruction: Find the string that minimizes the MEC error. MEC reconstruction is NP- hard even when all fragments have length 2!

13 Greedily select a fragment that extends the current haplotype

14 Some fragments will not match without error. These are assigned & corrected greedily

15 A Greedy scheme such as this was employed for JCV s genome. MEC: The minimum number of base- calls that need to be Ylipped for an error free assignment. Goal: reconstruct a haplotype that minimizes MEC X! 1-0 X! X! X!

16 The Greedy approach often leads to suboptimal solutions A local Ylipping of the current haplotype might improve the MEC X! X! X! 1 - X! 0 X!

17 The haplotype change also involves a reassignment of fragments. Here, the MEC error reduced to 2. This suggests a generic strategy Start with a haplotype, and move to a new one if it can improve the MEC! X! X!

19 Error probability q We can compute the likelihood of X, given H,q Pr(X H,q) = The goal is to either compute H that maximizes likelihood, OR To sample H from Pr(X H,q) i Pr(X i H,q) H=(h,h) Haplotype H h: h: X 1 : X 2 : X 3 : X n : Fragment matrix X

20 H $ Pr[H H'] min& 1, % H Pr(X H',q) Pr(X H,q) ' ) (

21 A simple neighborhood is deyined by Ylipping one column at a time (Ex: col. 11) Waterman and Churchill S = {11}

22 While this Ylip- update markov chain has the right stationary distribution, it does not converge fast.

23 n columns, each spanned by d fragments. Two haplotypes (H 1,H 2 ) are equally likely Hard to move from one good haplotype to another d n columns

24 Let p=1- q. Mixing time bound based on conductance arguments (Jerrum & Sinclair 92) Similar empirical results for hitting time. Even modest values of n,d are problematic. # # Thm : Mixing time is Ω n p2 + q 2 % % $ $ 2 pq n=20 & ( ' d & (, '

25 If we modify the neighborhood to include the Yirst n/2 columns Thm: the mixing time of the Markov Chain is O(n 6 ), independent of d. Proof uses: Canonical Paths

27 The theoretical analysis on examples tells us: Choice of neighborhood (subsets of columns that are Ylipped) is important. Unfortunately, it is not easy to predict what the correct neighborhood should be for an arbitrary example

28 Each column is a node. (x,y) is an edge if there is a fragment touching columns x and y

37 The Hapcut algorithm allows us (in a heuristic sense) to escape a current local minimum. It can be modiyied to sample from the Haplotypes, instead of a single haplotype output.

39 1.856M variants used for haplotype assembly of huref (using an earlier version of the algorithm described here) Chr 22 stats 25K variant sites, 53K useful rows ~7 variant per fragment 609 disjoint haplotypes (largest contains 1008 variants) N50 haplotype length=350kb (50% of the variant sites lie in haplotypes 350kb or greater) Paired end sequencing is critical.

40 Greedy Hapcut HASH HASH/Hapcut have nearly identical performances HASH is slower but allows for sampling of multiple haplotypes Both offer >20% improvement over the naïve method

42 Errors were simulated on Huref sequences Switch error in reconstruction is low, and decreases with increased depth The computed MEC error tracks simulated errors

43 A switch w.r.t hapmap phasing is illustrative of either low LD, or HASH error

44 HASH mismatch rate is low ( ) Adjusted rate ( ) Hapmap error rate is (CEU) to 0.02 (YRI) even with trios Without trios, error is 0.05

45 Haplotyping results would be very poor with Next gen sequencing If 250,000 long reads were available, could we solve the assembly problem? Adjusted N50 200, , ,000 AN , Read Length Conclusion: Read- length helps, but we need very long reads to achieve reasonable N Read Length

46 Accuracy of haplotypes solved using computation (HASH/ Hapcut) Length of haplotypes: Described by the number of nodes in the graph Primarily a technology issue. Sanger sequencing: long haplotypes, prohibitively expensive Next Gen sequencing: Affordable, short haplotypes Single molecule sequencing longer reads reads, but, Ability to switch sequencing ON/ OFF in a strobe mode

47 Fix: Read Length, Coverage, Max Insert Size to keep cost constant Varying insert improves haplotype length AN Varying Advance Lengths % of advance lengths = 9000bp (100-x)% of advance lengths = 3000bp

48 Accuracy of haplotypes solved using computation (HASH/ Hapcut) Length of haplotypes: Described by the number of nodes in the graph Primarily a technology issue. Sanger sequencing: long haplotypes, prohibitively expensive Next Gen sequencing: Affordable, short haplotypes Single molecule sequencing longer reads reads, but, Ability to switch sequencing ON/ OFF in a strobe mode

49 We need to Yind a distribution (pdf) f(a) on advance length a that will optimize AN50 Q: Compute the shape that maximizes AN50 f(a) Advance length a

50 Optimization over different shapes is difyicult. Our solution: Use the beta distribution Pros: Only two parameters Cons: Considers a limited sampling of the space.

Choosing the right distribution of advance lengths Varying α, β

51 Choosing the right distribution of advance lengths Varying α, β parameters of the β- distribution Distribution of advance length for a=1.6 b= Distribution of advance length for a=0.6 b= AN Distribution of advance length for a=1 b= alpha beta

Isolate a single cell from suspension Use protease digestion to release all DNA from the cytoplasm (46 chromosomes) Randomly partition chromosomes into 48 chambers

52 Isolate a single cell from suspension Use protease digestion to release all DNA from the cytoplasm (46 chromosomes) Randomly partition chromosomes into 48 chambers Multiple strand displacement ampliyication of single genomes (This is the tricky part) in each tube Test each tube for heterozygozity Genotype each homozygous tube

53 Tubes chr1 chr 1 X X Loci A small PCR based test is used to see which chromosome is in which tube (marked by X) How many marks per locus should we see? We can combine the content of two tubes if they do not contain homologous pairs. Finally, we genotype the remaining pairs/ sequence them.

55 1. We started (and Yinished) by considering sources of variation 2. Models of evolution of population under natural assumptions 1. HW/Linkage equlibrium 2. EfYicient simulation of populations via coalescent theory 3. Detecting structural variation 4. Detection of regions under selection 5. Association testing 6. Population sub- structure 7. Haplotype phasing

56 1. Phylogeny reconstruction (perfect phylogeny, distance- based methods) 2. Optimization using (Integer- ) Linear programming, simulated annealing and other paradigms. Maximum Ylows. 3. Greedy algorithms, dynamic programming 4. Stochastic sampling methods (MCMC, Gibbs sampling) 5. Statistical tests for signiyicance 6. EfYicient simulation techniques 1. Coalescent for populations, including recombination, selection, etc. 2. Simulating genotype/phenotype associations

57 1. We started (and Yinished) by considering sources of variation 2. Models of evolution of population under natural assumptions 1. HW/Linkage equlibrium 2. EfYicient simulation of populations via coalescent theory 3. Detecting structural variation 4. Detection of regions under selection 5. Association testing 6. Population sub- structure 7. Evolution under recombinations/recombination hot- spot detection via counting of recombination events (partially) 8. Haplotype phasing (partially)

58 1. Phylogeny reconstruction (perfect phylogeny, distance- based methods) 2. Optimization using (Integer- ) Linear programming, simulated annealing and other paradigms 3. Greedy algorithms, dynamic programming 4. Stochastic sampling methods (MCMC, Gibbs sampling) 5. Statistical tests for signiyicance 6. EfYicient simulation techniques 1. Coalescent for populations 2. Simulating genotype/phenotype associations 3. Pairwise analysis

1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES:

1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES: .5. ESTIMATION OF HAPLOTYPE FREQUENCIES: Chapter - 8 For SNPs, alleles A j,b j at locus j there are 4 haplotypes: A A, A B, B A and B B frequencies q,q,q 3,q 4. Assume HWE at haplotype level. Only the