Exam Dates. February 19 March 1 If those dates don't work ad hoc in Heidelberg

Size: px

Start display at page:

Download "Exam Dates. February 19 March 1 If those dates don't work ad hoc in Heidelberg"

Delilah Hutchinson
5 years ago
Views:

1 Exam Dates February 19 March 1 If those dates don't work ad hoc in Heidelberg

2 Plan for next lectures Lecture 11: More on Likelihood Models & Parallel Computing in Phylogenetics Lecture 12: (Andre) Discrete Operations on Trees Alexis will not be there Lecture 13: (Pavlos & Alexis) Population genetics I Lecture 14: (Pavlos & Alexis) Population genetics II Lecture 15: (Alexis) Wrap-up & exam preparation

3 Likelihood-based Phylogenetic Inference so far Search Techniques The Maximum Likelihood Model Some simple DNA models Jukes Cantor General Time Reversible (GTR) Model

4 Outline for today More on Likelihood Models Computing the likelihood on a tree Rate heterogeneity among sites Model Selection Protein Substitution Models Optimizing and Parallelizing Likelihood Calculations Parallelization Load Balance Saving RAM Course evaluation

5 Maximum Likelihood Length: m Seq1 Seq2 Seq3 Seq4 Alignment

6 Maximum Likelihood Length: m A C G T Seq1 Seq2 Seq3 Seq4 Alignment A C G T Substitution model

7 Maximum Likelihood Length: m A C G T Seq1 Seq2 Seq3 Seq4 Alignment A C G T Substitution model Commonly denoted as Q matrix: transition probs for time dt, for time t: P(t)=e^Qt

8 Maximum Likelihood Length: m A C G T Seq1 Seq2 Seq3 Seq4 Alignment A C G T Prior probabilities, Empirical base frequencies Substitution model ππaa ππcc π πgg ππtt

9 Maximum Likelihood Length: m A C G T Seq1 Seq2 Seq3 Seq4 A C G T Alignment Seq 1 Prior probabilities, Empirical base frequencies ππaa ππcc π πgg ππtt Substitution model Seq 3 b3 b1 b5 b2 Seq 2 b4 Seq 4

10 Maximum Likelihood Length: m A C G T Seq1 Seq2 Seq3 Seq4 A C G T Alignment Seq 1 Prior probabilities, Empirical base frequencies πa πc πg πt Substitution model Seq 3 b3 b1 b5 b2 b4 Seq 2 Seq 4 virtual root: vr

11 Maximum Likelihood Length: m A C G T Seq1 Seq2 Seq3 Seq4 A C G T Alignment Seq 1 Prior probabilities, Empirical base frequencies ππaa ππcc π πgg ππtt Substitution model Seq 3 b3 b1 vr b5 b2 b4 Seq 2 P(A) P(C) P(G) P(T) P(A) P(C) P(G) P(T) m Seq 4

12 Maximum Likelihood Length: m A C G T Seq1 Seq2 Seq3 Seq4 A C G T Alignment Seq 1 b1 Prior probabilities, Empirical base frequencies πa πc πg πt Substitution model Floating-point & memory b3 intensive vr Seq 3 b5 b2 b4 Seq 2 P(A) P(C) P(G) P(T) P(A) P(C) P(G) P(T) m Seq 4

13 Post-order Traversal virtual root

14 Post-order Traversal virtual root

15 Post-order Traversal virtual root AGCC A G C C

16 Post-order Traversal virtual root

17 Post-order Traversal virtual root :-)

18 What happens when we compute this inner vector? P(b_i) P(A) P(C) P(G) P(T) L^(k) P(b_j) A C G T A C G T A C G T b_i L^(i) P(A) P(C) P(G) P(T) A C G T b_j P(A) P(C) P(G) P(T) L^(j) Position c

19 Post-order Traversal virtual root :-)

20 Post-order Traversal virtual root :-)

21 Post-order Traversal virtual root :-)

22 Post-order Traversal virtual root :-) :-)

23 Post-order Traversal virtual root :-) :-)

24 Post-order Traversal virtual root :-) :-)

25 Post-order Traversal virtual root :-) :-) :-)

26 Post-order Traversal Overall likelihood: sum over logs of per-site likelihoods virtual root :-) :-) :-)

27 Post-order Traversal LS4

28 Post-order Traversal LS3 b1 b4 LS5

29 Post-order Traversal LS3 b4' b1' x LS5

30 Post-order Traversal b4':=b1+b4 b1' := 0 LS3 x LS5

31 Maximum Likelihood Length: m A C G T Seq1 Seq2 Seq3 Seq4 Alignment A C G T Prior probabilities, Empirical base frequencies ππaa ππcc π πgg ππtt Substitution model Seq 3 Seq 1 Seq 2 Seq 4 optimize branch lengths

32 Branch Length Optimization starting branch

33 Branch Length Optimization starting branch Essentially we place the virtual root into this branch here

34 Branch Length Optimization starting branch

35 Branch Length Optimization starting branch

36 Branch Length Optimization starting branch

37 Branch Length Optimization starting branch

38 Branch Length Optimization starting branch

39 Branch Length Optimization starting branch

40 Branch Length Optimization starting branch iterate until convergence

41 Maximum Likelihood Length: m A C G T Seq1 Seq2 Seq3 Seq4 Alignment A C G T Prior probabilities, Empirical base frequencies ππaa ππcc π πgg ππtt Substitution model optimize model parameters Seq 1 Seq 2 Seq 3 Seq 4

42 Maximum Likelihood Length: m A C G T Seq1 Seq2 Seq3 Seq4 Alignment A C G T Prior probabilities, Empirical base frequencies ππaa ππcc π πgg ππtt Substitution model optimize model parameters Seq 1 We don't need to re-estimate Seq 2 model params. for every tree topology as long as the tree is reasonable, i.e., non-random Seq 3 Seq 4

43 Numerical Optimization Procedures See chapters 9 & 10 of: Numerical Recipes in C The Art of Scientific Computing

44 Basic Operations Maximum Likelihood Compute Conditional Likelihood Array at an inner node Compute Likelihood at Virtual Root Optimize a Branch Length for a given Branch Optimize all Branch Lengths Optimize other Model Parameters

45 Basic Operations Maximum Likelihood Compute Conditional Likelihood Array at an inner node Compute Likelihood at Virtual Root Optimize a Branch Length for a given Branch Optimize all Branch Lengths Optimize other Model Parameters Bayesian programs only require two operations

46 Rate Heterogeneity among Sites Gene ACGGGGGGGGGGGTTTTCCCCC ATGGGGGGGGGGGTTTCCCCCC ACCGGGGGGGGGGTTTTGCCCC AGGGGGGGGGGGCTTTTCCCCC Among-site rate heterogeneity Biological phenomenon different sites/columns evolve at different speeds Need to accommodate this in our models

47 Γ-Distribution

48 Γ-Distribution Small α high rate heterogeneity Large α low rate heterogeneity

49 Discrete Γ-Distribution r0 r1 r2 r3

50 An Abstract View of Γ rate 0 P(t) =e^qr0t rate 1 P(t) =e^qr1t rate 2 P(t) =e^qr2t rate 3 P(t) =e^qr3t length 7 This is the integral of the likelihood we approximate via discretization LnL(i) = log(¼ * (L0 + L1 + L2 + L3)) Log likelihood at site i All Γ rates have equal probability

51 An Abstract View of Γ rate 0 P(t) =e^qr0t rate 1 P(t) =e^qr1t rate 2 P(t) =e^qr2t length 7 4 times higher memory consumption rate 3 P(t) =e^qr3t

52 An Abstract View of Γ rate 0 P(t) =e^qr0t rate 1 P(t) =e^qr1t rate 2 P(t) =e^qr2t length 7 4 times more floating point operations rate 3 P(t) =e^qr3t

53 Γ Model of Rate Heterogeneity with 4 discrete rates α r0 r1 r2 r r0 r1 r2 r3

Another Famous Computational Evolutionary Biologist Ziheng Yang the guy who has written this nice book Z Yang: Maximum likelihood phylogenetic estimation from

54 Another Famous Computational Evolutionary Biologist Ziheng Yang the guy who has written this nice book Z Yang: Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods, Journal of Molecular evolution 39 (3), , At LMU Munich during student protests

55 An Abstract View of CAT alignment length: 7

56 CAT: ML-based Optimization of Individual Rates per Site r0, r1,,rm

57 CAT Step 1: ML-based Optimization of Individual Rates per Site A. Stamatakis: Phylogenetic models of rate heterogeneity: a high performance computing perspective, Proceedings of IPDPS r0, r1,,rm

58 CAT Step 2: Order List of ri according to site log likelihoods li ([r0, -1.2], [r1, -5.6], [r2, -2.5], [r3,-0.3]) Sort according to log likelihood contributions of sites ([r1, -5.6], [r2, -2.5], [r0, -1.2], [r3,-0.3])

59 CAT Step 3: Categorize; Example with 2 Rate Categories (default: 25) ([r1, -5.6], [r2, -2.5], [r0, -1.2], [r3,-0.3]) Select the 2 rates with highest Likelihood Contributions rc0 := r1, category[1] := 0 rc1 := r2, category[2] := 1 The index refers to the alignment column position Assign remaining rate categories of positions r0,r3 to the rate category 0 or 1 which minimizes (δ(r1, r0), δ(r2, r0)) and (δ(r1, r3), δ(rc2, r3)) respectively

60 ML-Loops CAT-LOOP for(i = 0; i < n; i++) { cat = category[i]; r = rate[cat]; p[i] = f(q[i], pq, r[i], pr, r); } Γ-LOOP for(i = 0; { p[i].g0 p[i].g1 p[i].g2 p[i].g3 } i < n; i++) = = = = f(q[i], f(q[i], f(q[i], f(q[i], pq, pq, pq, pq, r[i], r[i], r[i], r[i], pr, pr, pr, pr, r0); r1); r2)); r3));

61 715 Sequences under HKY85+Γ Log Likelihood Score under Γ C.E. Robertson et al (2005) Phylogenetic diversity and ecology of environmental Archaea, In Current Opinion in Microbiology. Execution Time

62 Model Selection How do we select a model? How do we avoid over-parameterizing a model? Typically: a model with more parameters yields better likelihood scores better likelihood is not a criterion for selecting a more complex model we may over-parametrize the model key question: Does a more complex model yield a different tree? Model selection criteria: Likelihood-ratio test can only be applied to nested models! More flexible criteria not requiring nested models AIC: Akaike Information Criterion BIC: Bayesian Information Criterion

63 Protein Substitution Models The GTR Q matrix for protein data has 189 free parameters instead of just 5 (DNA) Estimating 189 rate parameters is difficult, time-consuming, and may lead to overparameterizing the model Instead, empirical models such as JTT, LG, WAG, MTMAM, etc. are used The Q matrices are obtained by jointly optimizing model parameters on a large collection of reference alignments The models differ with respect to: the amount of data used to obtain them the type of data on which the models have been optimized e.g., dedicated models for HIV, FLU, Mammals the numerical optimization methods used Examples of general models: LG: Le & Gascuel: An Improved General Amino Acid Replacement Matrix WAG: Whelan & Goldman: A General Empirical Model of Protein Evolution Derived from Multiple Protein Families Using a Maximum-Likelihood Approach

64 Floating Point Numbers Machine numbers are an imperfect mapping of the infinite real numbers to a finite number of machine values!

65 Floating Point Arithmetics: The Root of All Evil Computational science mostly relies on floating-point intensive codes How do we verify these codes? We stand on shaky grounds Scientists using those codes assume that there are no bugs Double precision arithmetics required for certain applications Who knows what de-normalized floating point numbers are? Please have a look at: J. Björndalen, O. Anshus: Trusting floating point benchmarks-are your benchmarks really data-independent? Applied Parallel Computing. State of the art in Scientific Computing 2010; pp , Springer. and at my micro-benchmark at:

66 Post-order Traversal virtual root Values get smaller and smaller as we Approach the root

67 Post-order Traversal We need to apply numerical scaling techniques to avoid underflow! virtual root Values get smaller and smaller as we Approach the root

68 Optimization Use SSE3 & AVX vector intrinsics Special implementations for: TIP TIP TIP INNER INNER I will spare you the details But, avoid redundant computations INNER

69 Loop Level Parallelism virtual root P Q R P[i] = f(q[i], R[i])

70 Loop Level Parallelism virtual root P This operation uses 95% of total execution time! Q R P[i] = f(q[i], R[i])

71 Loop Level Parallelism virtual root P Q R

72 Loop Level Parallelism virtual root P Q R

73 Loop Level Parallelism virtual root P Q R

74 Parallel Post-order Traversal Only need to synchronize at the root MPI_Reduce() to calculate: virtual root Σ log(li)

75 Parallel Post-order Traversal Overall Score + Σ log(li) Σ log(li)

76 Parallelization Outline

77 Loop-Level Parallelism on Distributed Memory Machines

78 Classic Fork-Join with Busy-Wait Broadcast Traversal Trigger Compute all vectors in Traversal descriptor Reduce Σ log(li) Broadcast Traversal Barrier Trigger Compute all vectors in Traversal descriptor Reduce Σ log(li) Broadcast new α params. Busy wait Barrier Trigger Compute all vectors in Traversal descriptor Reduce Σ log(li) Barrier

79 Synchronizations in RAxML with Pthreads RAxML Pthreads for a run time of about 10 seconds on 16 cores/threads 404 taxa 7429 sites: 194,000 Barriers 1481 taxa 1241 sites: 739,000 Barriers A paper on performance of alternative PThreads barrier implementations: S.A. Berger, A. Stamatakis: "Assessment of Barrier Implementions for Fine-Grain Parallel Regions on Current Multi-core Architectures", IEEE Cluster 2010.

80 Classic Fork-Join with Busy-Wait 0 A D Broadcast Traversal Trigger Compute all vectors in Traversal descriptor C A B B Reduce Σ log(li) Broadcast Traversal Barrier Trigger Compute all vectors in Traversal descriptor C D A C Reduce Σ log(li) B Broadcast new α params. α:=0.2 Compute all vectors in Traversal descriptor D Reduce Σ log(li) Barrier Trigger Barrier

81 Classic Fork-Join with Busy-Wait 0 A D Broadcast Traversal Trigger Compute all vectors For good parallel performance: thein Traversal broadcast must be fast!descriptor C A B B Reduce Σ log(li) Broadcast Traversal Barrier Trigger Compute all vectors in Traversal descriptor C D A C Reduce Σ log(li) B Broadcast new α params. α:=0.2 Compute all vectors in Traversal descriptor D Reduce Σ log(li) Barrier Trigger Barrier

82 Parallel Performance Problems They all start with partitioned datasets!

83 What is a partitioned dataset? Multi-gene or whole-genome alignment

84 What is a partitioned dataset? Gene 0 Gene 1 Gene 2 Gene 3 Multi-gene or whole-genome alignment Gene 4

85 What is a partitioned dataset? We may also partition by 1st, 2nd, and 3rd codon position Gene 0 Gene 1 Gene 2 Gene 3 Multi-gene or whole-genome alignment Gene 4

86 What is a partitioned dataset? Gene 0 Gene 1 Gene 2 Gene 3 Gene 4 α0 GTR0 α1 GTR1 α2 GTR2 α3 GTR3 α4 GTR4

87 What is a partitioned dataset? Gene 0 Gene 1 Gene 2 Gene 3 Gene 4 α0 GTR0 α1 GTR1 α2 GTR2 α3 GTR3 α4 GTR4 Joint branch length estimate

88 Assigning Protein Models? Gene 0 Gene 1 WAG LG JTT... WAG LG JTT... Gene 2 WAG LG JTT... Gene 3 WAG LG JTT... Gene 4 WAG LG JTT... Joint branch length estimate

89 Assigning Protein Models? Gene 0 Gene 1 WAG LG JTT... WAG LG JTT... There exits about 20 empirical protein models 205 possible Assignments! Gene 2 WAG LG JTT... Gene 3 WAG LG JTT... Gene 4 WAG LG JTT... Joint branch length estimate

90 Assigning Protein Models? Gene 0 Gene 1 WAG LG JTT... WAG LG JTT... Because partitions are linked via Branch lengths, this is NP-hard proof: joint work with Kassian heuristics: a Master's thesis in our lab Gene 2 WAG LG JTT... Gene 3 WAG LG JTT... Gene 4 WAG LG JTT... Joint branch length estimate

91 What is a partitioned dataset? Gene 0 Gene 1 Gene 2 Gene 3 Gene 4 α0 GTR0 α1 GTR1 α2 GTR2 α3 GTR3 α4 GTR4 per-partition branch length estimate

92 What is a partitioned dataset? Gene 0 α0 GTR0 Gene 1 α1 GTR1 Gene 3 Gene 4 Same underlying tree α2 topology! α3 GTR2 GTR3 α4 GTR4 Gene 2

93 What is a partitioned dataset? Gene 0 α0 GTR0 Gene 2 Gene 3 Gene 1 Per-partition branch length estimate Increases the number of parameters in the likelihood model by 4 * (2n 3), α1 where n is the α2 number of taxaα3 GTR1 GTR2 GTR3 Gene 4 α4 GTR4

94 Parallel Performance Problems They all start with partitioned datasets! How do we distribute partitions to processors? How do we calculate parameter changes? How much time does our broadcast take? Goal: Keep all processors busy all the time minimize communication and synchronization!

95 Data Distribution Orangutan Gorilla Chimp Homo Sapiens AACG AAGG A-GG AGGA CPU cache Shared TTTT TTTTTTT TTTT CPU cache memory

96 Data Distribution Orangutan Gorilla Chimp Homo Sapiens AACG AAGG A-GG AGGA CPU TTTT TTTTTTT TTTT CPU cache cache Distributed memory Distributed memory

97 Data Distribution Orangutan Gorilla Chimp Homo Sapiens AACG AAGG A-GG AGGA CPU Partitioned data distribution is not that trivial! TTTT TTTTTTT TTTT CPU cache cache Distributed memory Distributed memory

98 Data Distribution I Orangutan Gorilla Chimp Homo Sapiens AACG AAGG A-GG AGGA CPU TTTT TTTTTTT TTTT CPU cache cache Distributed memory Distributed memory

99 Data Distribution I Orangutan Gorilla Chimp Homo Sapiens Works well when we have more partitions than processors: May lead to load imbalance not all processors obtain equal number of sites! AACG AAGG A-GG AGGA CPU TTTT TTTTTTT TTTT CPU cache cache Distributed memory Distributed memory

100 Data Distribution II Orangutan Gorilla Chimp Homo Sapiens Works well when we have more processors than partitions: However we will need to compute: P(t)=eQt for each partition at each processor! AACG AAGG A-GG AGGA CPU TTTT TTTTTTT TTTT CPU cache cache Distributed memory Distributed memory

101 Data Distribution II Orangutan Gorilla Chimp Homo Sapiens Works well when we have more processors than partitions: However we will need to compute: P(t)=eQt for each partition at each processor! AACG AAGG eq t eq t A-GG AGGA 1 CPU 1 TTTT TTTeQ t eq t TTTT TTTT 2 2 CPU cache cache Distributed memory Distributed memory

102 Data Distribution II Orangutan Gorilla Chimp Performance impact depends on number of states in Homo Sapiens data/dimension of Q Works well when we have more processors than partitions: However we will need to compute: P(t)=eQt for each partition at each processor! AACG AAGG eq t eq t A-GG AGGA 1 CPU 1 TTTT TTTeQ t eq t TTTT TTTT 2 2 CPU cache cache Distributed memory Distributed memory

103 Data Distribution II Orangutan Gorilla How do we distribute partitions to Chimp processors? Homo Sapiens Works well when we have more processors than partitions: However we will need to compute: P(t)=eQt for each partition at each processor! AACG AAGG eq t eq t A-GG AGGA 1 CPU 1 TTTT TTTeQ t eq t TTTT TTTT 2 2 CPU cache cache Distributed memory Distributed memory

104 Load Balance I G0 P0 G1 G2 P1 G3

105 Load Balance I P0 P1

106 Load Balance I P0 P1 Find the partition-to-processor assignment such that the maximum number of sites per processor is minimized this is NP-hard

107 Load Balance I The multiprocessor job scheduling problem in phylogenetics Problem when #partitions >> #cores Tested per-site (cyclic/modulo) data distribution versus per partition data distribution We used the Longest Processing Time (LPT) heuristics for assigning partitions to processors 25 taxa, 220,000 sites, 100 genes GAMMA model naïve: 613 secs LPT: 550 secs CAT model naïve: 298 secs LPT: 127 secs Larger protein dataset under Г model of rate heterogeneity: 10-fold performance improvement! J. Zhang, A. Stamatakis: "The Multi-Processor Scheduling Problem in Phylogenetics", 11th IEEE HICOMB workshop (in conjunction with IPDPS 2012).

108 LPT heuristics for multi-processor scheduling Sort jobs (partitions) by processing length (partition length) in decreasing order Remove a job (partition) from the sorted list and assign it to the processor with the earliest end time (the smallest sum of partition lengths) Repeat until the sorted list is empty Upper bound: 4/3 1/(3p) * OPT, where p is the number of processors Graham, R. L.: "Bounds on Multiprocessing Timing Anomalies". SIAM Journal on Applied Mathematics 17 (2): , Remark: LPT works surprisingly well (see our paper on the phylogenetic problem where we also tested other heuristics)

109 Partitioned Branch Lengths & other parameters

110 Load-Balance II Zoom

111 Synchronization Points Assume 10 branches Each branch requires 10 Newton-Raphson Iterations Each NR Iteration requires a synchronization via a reduction operation One branch/partition at a time: 100 sync. points, less work (only one partition) per sync. point All branches concurrently: 10 sync. points, more work per sync. point Branches will need distinct number of operations Add convergence state bit vector

112 Synchronization Points Org1 AC GT Org2 AC TT

113 Synchronization Points Org1 AC GT Org2 AC TT

114 Synchronization Points Org1 AC GT Org2 AC TT 00

115 Synchronization Points Org1 AC GT Org2 AC TT 00 00

116 Synchronization Points Org1 AC GT Org2 AC TT

117 Synchronization Points Org1 AC GT Org2 AC TT In this example: 4 instead of 7 sync points!

118 Load Balance II A. Stamatakis, M. Ott: "Load Balance in the Phylogenetic Likelihood Kernel". Proceedings of ICPP 2009, Vienna, Austria, September 2009.

119 Classic Fork-Join with For good parallel performance: the Busy-Wait broadcast must be fast! Remember: 10 secs 16 cores approx 0 A C A D B B times. 500,000 Traversal What happensbroadcast if we have 1000 partitions and Compute propose 1000 new in all vectors alpha parameters? Traversal descriptor Reduce Σ log(li) Broadcast Traversal Trigger Barrier Trigger Compute all vectors in Traversal descriptor C D A C Reduce Σ log(li) B Broadcast new α params. α:=0.2 Compute all vectors in Traversal descriptor D Reduce Σ log(li) Barrier Trigger Barrier

120 Alternative MPI parallelization P0 P1 MPI_Reduce() MPI_Bcast() ? ? Execution time MPI_Reduce() MPI_Bcast()

121 Alternative MPI parallelization P P1 I think this is the way we will have to the future. MPI_Reduce() do it in MPI_Bcast()?? Execution time MPI_Reduce() MPI_Bcast()

122 ExaML New code implementing this new parallelization scheme A. Stamatakis, A. J. Aberer: "Novel Parallelization Schemes for Large-Scale Likelihood-based Phylogenetic Inference", accepted for publication at IPDPS 2013, Boston, USA, Up to 3 times faster than RAxML-Light (2012) on large, partitioned datasets Tested with up to 1536 cores on our cluster at HITS Future developments 20,000,000 CPU hors on SuperMUC for Improving scalability Implementing fault tolerance Execute 1KITE tree inferences Further details ask Andre Aberer

Smith et al. American Journal of Botany 98(3): Data Supplement S2 page 1

Smith et al. American Journal of Botany 98(3):404-414. 2011. Data Supplement S1 page 1 Smith, Stephen A., Jeremy M. Beaulieu, Alexandros Stamatakis, and Michael J. Donoghue. 2011. Understanding angiosperm