Modelling Genetic Variations with Fragmentation-Coagulation Processes

Size: px

Start display at page:

Download "Modelling Genetic Variations with Fragmentation-Coagulation Processes"

Gervais Hart
5 years ago
Views:

1 Modelling Genetic Variations with Fragmentation-Coagulation Processes Yee Whye Teh, Charles Blundell, Lloyd Elliott Gatsby Computational Neuroscience Unit, UCL

2 Genetic Variations in Populations Inferring histories of human populations. Understanding fundamental genetic processes. Associating genetic with phenotypic variations. Discovering genetic causes of diseases.

3 Ancestral Tree backwards in time GCAAAGGGTA GCATCCGGTA CAATCCTATA [Kingman 98]

4 Ancestral Tree CCAACTGATA backwards in time GCAACGGGTA GCAAAGGGTA GCATCCGGTA CAATCCTATA [Kingman 98]

5 Ancestral Tree CCAACTGATA backwards in time GCAACGGGTA GCAAAGGGTA GCATCCGGTA CAATCCTATA [Kingman 98]

6 Ancestral Recombination Graph CCTACCACTA CCTACCGGTA backwards in time C CCTTCCGGTA GCATCCGGTA CCATCCGGTA GCATCCGGTA C C T A C C T A T A GCAAAGGGTA GCATCCGGTA CAATCCTATA [Hudson 98]

7 Ancestral Recombination Graph CCTACCACTA CCTACCGGTA backwards in time CCTAAGGGTA CCTTCCGGTA GCATCCGGTA CCATCCGGTA GCATCCGGTA CCTACCTATA GCAAAGGGTA GCATCCGGTA CAATCCTATA [Hudson 98]

8 Mosaic Structure Simplification: Blocks of recurring segments; Each DNA sequence composed of multiple blocks. Hidden Markov models. G C A A A G G G T A G C A T C C G G T A C A A T C C T A T A Figure from [Daly et al 00]

9 Fragmentation-Coagulation Processes No need for model selection---bayesian nonparametric. No label switching problem---no labels. Idea: Use unlabelled partitions of sequences as basic representation. Use a Markov process over partitions to model changing partition structure. Partition: set of clusters, e.g. {{,},{}} disjoint, non-empty, contains all sequences. unlabelled G C A A A G G G T A G C A T C C G G T A C A A T C C T A T A

10 Markov Process over Partitions G C A A A G G G T A G C A T C C G G T A C A A T C C T A T A G C A A A G G G T A {Π t : t =,,...,T} {Π t : t [0,T]} C A A T C C T A T A

11 Fragmentation-Coagulation Processes

12 Fragmentation-Coagulation Processes

13 Fragmentation-Coagulation Processes

14 Fragmentation-Coagulation Processes

15 Fragmentation-Coagulation Processes

16 Fragmentation-Coagulation Processes

17 Fragmentation-Coagulation Processes initial distribution = CRP(µ)

18 Fragmentation-Coagulation Processes 6 7 c a b fragmentation rate = R Γ( a )Γ( b ) Γ( c ) 9 initial distribution = CRP(µ)

19 Fragmentation-Coagulation Processes 6 a c 7 b a fragmentation rate = R Γ( a )Γ( b ) Γ( c ) b c 9 initial distribution = CRP(µ) coagulation rate = R/µ

20 Fragmentation-Coagulation Processes Markov. Stationary, with CRP(μ) as equilibrium distribution. Reversible. Exchangeable. Dirichlet diffusion tree [Neal 00] and Kingman s coalescent. Simplest example of exchangeable fragmentation-coalescence processes [Berestycki 004].

21 Inference Gibbs sampling: Resample trajectory of one sequence at each iteration. 0 T Dealing with continuous time dynamics: Uniformisation based auxiliary variable Gibbs [Rao & Teh UAI 0]. Forward filtering-backward sampling.

22 Inference Gibbs sampling: Resample trajectory of one sequence at each iteration. Dealing with continuous time dynamics: Uniformisation based auxiliary variable Gibbs [Rao & Teh UAI 0]. Forward filtering-backward sampling.

23 Imputation Results accuracy (%) BEAGLE FCP fastphase proportion held out SNPs

24 Summary Modelling the mosaic structure of genetic variations. Fragmentation-coagulation processes. Bayesian nonparametrics. Label switching problem. State-of-the-art results. Poster T09 Future work: scaling up, and other statistical genetics applications.

25 Poster T09 Thank You! Vinayak Rao and Andriy Mnih Chris Holmes, Gil McVean, Lancelot James NIPS organisers and audience

26 Appendix

27 Imputation Experiments: Unphased Data accuracy (%) BEAGLE IMPUTE fastphase FCP 60 FCP 600 FCP computation time (s)

28 Imputation Experiments: Unphased Data accuracy (%) 9 accuracy (%) proportion held out genotypes proportion held out SNPs

29 Hidden Markov Models s i s i s i s i4 s i5 x i x i x i x i4 x i5 sequences [Daly et al 00], [Scheet & Stephens 006]

30 Hidden Markov Models T s i s i s i s i4 s i5 x i x i x i x i4 x i5 sequences E Typical: stationary with shared transition and emission probabilities. [Daly et al 00], [Scheet & Stephens 006]

31 Hidden Markov Models T T T T 4 T 5 s i s i s i s i4 s i5 x i x i x i x i4 x i5 sequences E E E E 4 E 5 Typical: stationary with shared transition and emission probabilities. Here: non-stationary with location specific transition and emissions. [Daly et al 00], [Scheet & Stephens 006]

32 HMM Label Switching Problem

33 HMM Label Switching Problem a a a a a a a b b b b b b b 0

34 HMM Label Switching Problem a a a a a a a b a b b b b b b b 0 0

35 HMM Label Switching Problem a b a a b b a b a b b a b b a a 0 0

36 HMM Label Switching Problem a b a a b b a b a b b a b b a a 0 0 normalized log likelihood FCP BHMM MCMC iteration FCP BHMM MCMC iterations until optimum

37 Chinese Restaurant Process Through Time 0 T

38 Chinese Restaurant Process Through Time +µ +µ µ +µ 4 0 T

39 Chinese Restaurant Process Through Time +µ +µ µ +µ 4 0 T

40 Chinese Restaurant Process Through Time +µ +µ µ +µ 4 R µ 0 T

41 Chinese Restaurant Process Through Time +µ +µ µ +µ 4 R µ 0 T

42 Chinese Restaurant Process Through Time +µ +µ µ +µ 4 R µ 0 T / /

43 Chinese Restaurant Process Through Time +µ +µ µ +µ 4 R µ 0 T / /

44 Chinese Restaurant Process Through Time +µ R +µ µ +µ 4 R µ 0 T / /

45 Chinese Restaurant Process Through Time +µ R +µ µ +µ 4 R µ 0 T / /

46 Chinese Restaurant Process Through Time +µ R +µ µ +µ 4 R µ 0 T / /

47 Partitions Set [n] = {,,...,n} indexing n sequences. Partition of [n], e.g.: {{,,6},{,7},{4,5,8},{9}} Non-empty; Disjoint; Union is [n]; and Unlabelled.

Bayesian Nonparametric Modelling of Genetic Variations using Fragmentation-Coagulation Processes

Journal of Machine Learning Research 1 (2) 1-48 Submitted 4/; Published 1/ Bayesian Nonparametric Modelling of Genetic Variations using Fragmentation-Coagulation Processes Yee Whye Teh y.w.teh@stats.ox.ac.uk