Exam Dates. February 19 March 1 If those dates don't work ad hoc in Heidelberg

Exam Dates February 19 March 1 If those dates don't work ad hoc in Heidelberg

Plan for next lectures Lecture 11: More on Likelihood Models & Parallel Computing in Phylogenetics Lecture 12: (Andre) Discrete Operations on Trees Alexis will not be there Lecture 13: (Pavlos & Alexis) Population genetics I Lecture 14: (Pavlos & Alexis) Population genetics II Lecture 15: (Alexis) Wrap-up & exam preparation

Likelihood-based Phylogenetic Inference so far Search Techniques The Maximum Likelihood Model Some simple DNA models Jukes Cantor General Time Reversible (GTR) Model

Outline for today More on Likelihood Models Computing the likelihood on a tree Rate heterogeneity among sites Model Selection Protein Substitution Models Optimizing and Parallelizing Likelihood Calculations Parallelization Load Balance Saving RAM Course evaluation

Maximum Likelihood Length: m Seq1 Seq2 Seq3 Seq4 Alignment

Maximum Likelihood Length: m A C G T Seq1 Seq2 Seq3 Seq4 Alignment A C G T Substitution model

Maximum Likelihood Length: m A C G T Seq1 Seq2 Seq3 Seq4 Alignment A C G T Substitution model Commonly denoted as Q matrix: transition probs for time dt, for time t: P(t)=e^Qt

Maximum Likelihood Length: m A C G T Seq1 Seq2 Seq3 Seq4 Alignment A C G T Prior probabilities, Empirical base frequencies Substitution model ππaa ππcc π πgg ππtt

Maximum Likelihood Length: m A C G T Seq1 Seq2 Seq3 Seq4 A C G T Alignment Seq 1 Prior probabilities, Empirical base frequencies ππaa ππcc π πgg ππtt Substitution model Seq 3 b3 b1 b5 b2 Seq 2 b4 Seq 4

Maximum Likelihood Length: m A C G T Seq1 Seq2 Seq3 Seq4 A C G T Alignment Seq 1 Prior probabilities, Empirical base frequencies πa πc πg πt Substitution model Seq 3 b3 b1 b5 b2 b4 Seq 2 Seq 4 virtual root: vr

Maximum Likelihood Length: m A C G T Seq1 Seq2 Seq3 Seq4 A C G T Alignment Seq 1 Prior probabilities, Empirical base frequencies ππaa ππcc π πgg ππtt Substitution model Seq 3 b3 b1 vr b5 b2 b4 Seq 2 P(A) P(C) P(G) P(T) P(A) P(C) P(G) P(T) m Seq 4

Maximum Likelihood Length: m A C G T Seq1 Seq2 Seq3 Seq4 A C G T Alignment Seq 1 b1 Prior probabilities, Empirical base frequencies πa πc πg πt Substitution model Floating-point & memory b3 intensive vr Seq 3 b5 b2 b4 Seq 2 P(A) P(C) P(G) P(T) P(A) P(C) P(G) P(T) m Seq 4

Post-order Traversal virtual root

Post-order Traversal virtual root AGCC A G C C 1.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0

Post-order Traversal virtual root

Post-order Traversal virtual root :-)

What happens when we compute this inner vector? P(b_i) P(A) P(C) P(G) P(T) L^(k) P(b_j) A C G T A C G T A C G T b_i L^(i) P(A) P(C) P(G) P(T) A C G T b_j P(A) P(C) P(G) P(T) L^(j) Position c

Post-order Traversal virtual root :-)

Post-order Traversal virtual root :-) :-)

Post-order Traversal virtual root :-) :-) :-)

Post-order Traversal Overall likelihood: sum over logs of per-site likelihoods virtual root :-) :-) :-)

Post-order Traversal LS4

Post-order Traversal LS3 b1 b4 LS5

Post-order Traversal LS3 b4' b1' x LS5

Post-order Traversal b4':=b1+b4 b1' := 0 LS3 x LS5

Maximum Likelihood Length: m A C G T Seq1 Seq2 Seq3 Seq4 Alignment A C G T Prior probabilities, Empirical base frequencies ππaa ππcc π πgg ππtt Substitution model Seq 3 Seq 1 Seq 2 Seq 4 optimize branch lengths

Branch Length Optimization starting branch

Branch Length Optimization starting branch Essentially we place the virtual root into this branch here

Branch Length Optimization starting branch

Branch Length Optimization starting branch iterate until convergence

Maximum Likelihood Length: m A C G T Seq1 Seq2 Seq3 Seq4 Alignment A C G T Prior probabilities, Empirical base frequencies ππaa ππcc π πgg ππtt Substitution model optimize model parameters Seq 1 We don't need to re-estimate Seq 2 model params. for every tree topology as long as the tree is reasonable, i.e., non-random Seq 3 Seq 4

Numerical Optimization Procedures See chapters 9 & 10 of: Numerical Recipes in C The Art of Scientific Computing

Basic Operations Maximum Likelihood Compute Conditional Likelihood Array at an inner node Compute Likelihood at Virtual Root Optimize a Branch Length for a given Branch Optimize all Branch Lengths Optimize other Model Parameters

Rate Heterogeneity among Sites Gene ACGGGGGGGGGGGTTTTCCCCC ATGGGGGGGGGGGTTTCCCCCC ACCGGGGGGGGGGTTTTGCCCC AGGGGGGGGGGGCTTTTCCCCC Among-site rate heterogeneity Biological phenomenon different sites/columns evolve at different speeds Need to accommodate this in our models

Γ-Distribution

Γ-Distribution Small α high rate heterogeneity Large α low rate heterogeneity

Discrete Γ-Distribution r0 r1 r2 r3

An Abstract View of Γ rate 0 P(t) =e^qr0t rate 1 P(t) =e^qr1t rate 2 P(t) =e^qr2t rate 3 P(t) =e^qr3t length 7 This is the integral of the likelihood we approximate via discretization LnL(i) = log(¼ * (L0 + L1 + L2 + L3)) Log likelihood at site i All Γ rates have equal probability

An Abstract View of Γ rate 0 P(t) =e^qr0t rate 1 P(t) =e^qr1t rate 2 P(t) =e^qr2t length 7 4 times higher memory consumption rate 3 P(t) =e^qr3t

An Abstract View of Γ rate 0 P(t) =e^qr0t rate 1 P(t) =e^qr1t rate 2 P(t) =e^qr2t length 7 4 times more floating point operations rate 3 P(t) =e^qr3t

Γ Model of Rate Heterogeneity with 4 discrete rates α r0 r1 r2 r3............ r0 r1 r2 r3

Another Famous Computational Evolutionary Biologist Ziheng Yang the guy who has written this nice book Z Yang: Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods, Journal of Molecular evolution 39 (3), 306-314, 1994. At LMU Munich during student protests

An Abstract View of CAT alignment length: 7

CAT: ML-based Optimization of Individual Rates per Site r0, r1,,rm

CAT Step 1: ML-based Optimization of Individual Rates per Site A. Stamatakis: Phylogenetic models of rate heterogeneity: a high performance computing perspective, Proceedings of IPDPS 2006. r0, r1,,rm

CAT Step 2: Order List of ri according to site log likelihoods li ([r0, -1.2], [r1, -5.6], [r2, -2.5], [r3,-0.3]) Sort according to log likelihood contributions of sites ([r1, -5.6], [r2, -2.5], [r0, -1.2], [r3,-0.3])

CAT Step 3: Categorize; Example with 2 Rate Categories (default: 25) ([r1, -5.6], [r2, -2.5], [r0, -1.2], [r3,-0.3]) Select the 2 rates with highest Likelihood Contributions rc0 := r1, category[1] := 0 rc1 := r2, category[2] := 1 The index refers to the alignment column position Assign remaining rate categories of positions r0,r3 to the rate category 0 or 1 which minimizes (δ(r1, r0), δ(r2, r0)) and (δ(r1, r3), δ(rc2, r3)) respectively

ML-Loops CAT-LOOP for(i = 0; i < n; i++) { cat = category[i]; r = rate[cat]; p[i] = f(q[i], pq, r[i], pr, r); } Γ-LOOP for(i = 0; { p[i].g0 p[i].g1 p[i].g2 p[i].g3 } i < n; i++) = = = = f(q[i], f(q[i], f(q[i], f(q[i], pq, pq, pq, pq, r[i], r[i], r[i], r[i], pr, pr, pr, pr, r0); r1); r2)); r3));

715 Sequences under HKY85+Γ Log Likelihood Score under Γ C.E. Robertson et al (2005) Phylogenetic diversity and ecology of environmental Archaea, In Current Opinion in Microbiology. Execution Time

Model Selection How do we select a model? How do we avoid over-parameterizing a model? Typically: a model with more parameters yields better likelihood scores better likelihood is not a criterion for selecting a more complex model we may over-parametrize the model key question: Does a more complex model yield a different tree? Model selection criteria: Likelihood-ratio test can only be applied to nested models! More flexible criteria not requiring nested models AIC: Akaike Information Criterion BIC: Bayesian Information Criterion

Protein Substitution Models The GTR Q matrix for protein data has 189 free parameters instead of just 5 (DNA) Estimating 189 rate parameters is difficult, time-consuming, and may lead to overparameterizing the model Instead, empirical models such as JTT, LG, WAG, MTMAM, etc. are used The Q matrices are obtained by jointly optimizing model parameters on a large collection of reference alignments The models differ with respect to: the amount of data used to obtain them the type of data on which the models have been optimized e.g., dedicated models for HIV, FLU, Mammals the numerical optimization methods used Examples of general models: LG: Le & Gascuel: An Improved General Amino Acid Replacement Matrix WAG: Whelan & Goldman: A General Empirical Model of Protein Evolution Derived from Multiple Protein Families Using a Maximum-Likelihood Approach

Floating Point Numbers Machine numbers are an imperfect mapping of the infinite real numbers to a finite number of machine values!

Floating Point Arithmetics: The Root of All Evil Computational science mostly relies on floating-point intensive codes How do we verify these codes? We stand on shaky grounds Scientists using those codes assume that there are no bugs Double precision arithmetics required for certain applications Who knows what de-normalized floating point numbers are? Please have a look at: J. Björndalen, O. Anshus: Trusting floating point benchmarks-are your benchmarks really data-independent? Applied Parallel Computing. State of the art in Scientific Computing 2010; pp 178-188, Springer. and at my micro-benchmark at: https://github.com/stamatak/denormalizedfloatingpointnumbers

Post-order Traversal virtual root 0.001 Values get smaller and smaller as we Approach the root 0.01 0.1 0.1

Post-order Traversal We need to apply numerical scaling techniques to avoid underflow! virtual root 0.001 Values get smaller and smaller as we Approach the root 0.01 0.1 0.1

Optimization Use SSE3 & AVX vector intrinsics Special implementations for: TIP TIP TIP INNER INNER I will spare you the details But, avoid redundant computations INNER

Loop Level Parallelism virtual root P Q R P[i] = f(q[i], R[i])

Loop Level Parallelism virtual root P This operation uses 95% of total execution time! Q R P[i] = f(q[i], R[i])

Loop Level Parallelism virtual root P Q R

Parallel Post-order Traversal Only need to synchronize at the root MPI_Reduce() to calculate: virtual root Σ log(li)

Parallel Post-order Traversal Overall Score + Σ log(li) Σ log(li)

Parallelization Outline

Loop-Level Parallelism on Distributed Memory Machines

Classic Fork-Join with Busy-Wait 0 1 2 3 Broadcast Traversal Trigger Compute all vectors in Traversal descriptor Reduce Σ log(li) Broadcast Traversal Barrier Trigger Compute all vectors in Traversal descriptor Reduce Σ log(li) Broadcast new α params. Busy wait Barrier Trigger Compute all vectors in Traversal descriptor Reduce Σ log(li) Barrier

Synchronizations in RAxML with Pthreads RAxML Pthreads for a run time of about 10 seconds on 16 cores/threads 404 taxa 7429 sites: 194,000 Barriers 1481 taxa 1241 sites: 739,000 Barriers A paper on performance of alternative PThreads barrier implementations: S.A. Berger, A. Stamatakis: "Assessment of Barrier Implementions for Fine-Grain Parallel Regions on Current Multi-core Architectures", IEEE Cluster 2010.

Classic Fork-Join with Busy-Wait 0 A D 1 2 3 Broadcast Traversal Trigger Compute all vectors in Traversal descriptor C A B B Reduce Σ log(li) Broadcast Traversal Barrier Trigger Compute all vectors in Traversal descriptor C D A C Reduce Σ log(li) B Broadcast new α params. α:=0.2 Compute all vectors in Traversal descriptor D Reduce Σ log(li) Barrier Trigger Barrier

Classic Fork-Join with Busy-Wait 0 A D 1 2 3 Broadcast Traversal Trigger Compute all vectors For good parallel performance: thein Traversal broadcast must be fast!descriptor C A B B Reduce Σ log(li) Broadcast Traversal Barrier Trigger Compute all vectors in Traversal descriptor C D A C Reduce Σ log(li) B Broadcast new α params. α:=0.2 Compute all vectors in Traversal descriptor D Reduce Σ log(li) Barrier Trigger Barrier

Parallel Performance Problems They all start with partitioned datasets!

What is a partitioned dataset? Multi-gene or whole-genome alignment

What is a partitioned dataset? Gene 0 Gene 1 Gene 2 Gene 3 Multi-gene or whole-genome alignment Gene 4

What is a partitioned dataset? We may also partition by 1st, 2nd, and 3rd codon position Gene 0 Gene 1 Gene 2 Gene 3 Multi-gene or whole-genome alignment Gene 4

What is a partitioned dataset? Gene 0 Gene 1 Gene 2 Gene 3 Gene 4 α0 GTR0 α1 GTR1 α2 GTR2 α3 GTR3 α4 GTR4

What is a partitioned dataset? Gene 0 Gene 1 Gene 2 Gene 3 Gene 4 α0 GTR0 α1 GTR1 α2 GTR2 α3 GTR3 α4 GTR4 Joint branch length estimate

Assigning Protein Models? Gene 0 Gene 1 WAG LG JTT... WAG LG JTT... Gene 2 WAG LG JTT... Gene 3 WAG LG JTT... Gene 4 WAG LG JTT... Joint branch length estimate

Assigning Protein Models? Gene 0 Gene 1 WAG LG JTT... WAG LG JTT... There exits about 20 empirical protein models 205 possible Assignments! Gene 2 WAG LG JTT... Gene 3 WAG LG JTT... Gene 4 WAG LG JTT... Joint branch length estimate

Assigning Protein Models? Gene 0 Gene 1 WAG LG JTT... WAG LG JTT... Because partitions are linked via Branch lengths, this is NP-hard proof: joint work with Kassian heuristics: a Master's thesis in our lab Gene 2 WAG LG JTT... Gene 3 WAG LG JTT... Gene 4 WAG LG JTT... Joint branch length estimate

What is a partitioned dataset? Gene 0 Gene 1 Gene 2 Gene 3 Gene 4 α0 GTR0 α1 GTR1 α2 GTR2 α3 GTR3 α4 GTR4 per-partition branch length estimate

What is a partitioned dataset? Gene 0 α0 GTR0 Gene 1 α1 GTR1 Gene 3 Gene 4 Same underlying tree α2 topology! α3 GTR2 GTR3 α4 GTR4 Gene 2

What is a partitioned dataset? Gene 0 α0 GTR0 Gene 2 Gene 3 Gene 1 Per-partition branch length estimate Increases the number of parameters in the likelihood model by 4 * (2n 3), α1 where n is the α2 number of taxaα3 GTR1 GTR2 GTR3 Gene 4 α4 GTR4

Parallel Performance Problems They all start with partitioned datasets! How do we distribute partitions to processors? How do we calculate parameter changes? How much time does our broadcast take? Goal: Keep all processors busy all the time minimize communication and synchronization!

Data Distribution Orangutan Gorilla Chimp Homo Sapiens AACG AAGG A-GG AGGA CPU cache Shared TTTT TTTTTTT TTTT CPU cache memory

Data Distribution Orangutan Gorilla Chimp Homo Sapiens AACG AAGG A-GG AGGA CPU TTTT TTTTTTT TTTT CPU cache cache Distributed memory Distributed memory

Data Distribution Orangutan Gorilla Chimp Homo Sapiens AACG AAGG A-GG AGGA CPU Partitioned data distribution is not that trivial! TTTT TTTTTTT TTTT CPU cache cache Distributed memory Distributed memory

Data Distribution I Orangutan Gorilla Chimp Homo Sapiens AACG AAGG A-GG AGGA CPU TTTT TTTTTTT TTTT CPU cache cache Distributed memory Distributed memory

Data Distribution I Orangutan Gorilla Chimp Homo Sapiens Works well when we have more partitions than processors: May lead to load imbalance not all processors obtain equal number of sites! AACG AAGG A-GG AGGA CPU TTTT TTTTTTT TTTT CPU cache cache Distributed memory Distributed memory

Data Distribution II Orangutan Gorilla Chimp Homo Sapiens Works well when we have more processors than partitions: However we will need to compute: P(t)=eQt for each partition at each processor! AACG AAGG A-GG AGGA CPU TTTT TTTTTTT TTTT CPU cache cache Distributed memory Distributed memory

Data Distribution II Orangutan Gorilla Chimp Homo Sapiens Works well when we have more processors than partitions: However we will need to compute: P(t)=eQt for each partition at each processor! AACG AAGG eq t eq t A-GG AGGA 1 CPU 1 TTTT TTTeQ t eq t TTTT TTTT 2 2 CPU cache cache Distributed memory Distributed memory

Data Distribution II Orangutan Gorilla Chimp Performance impact depends on number of states in Homo Sapiens data/dimension of Q Works well when we have more processors than partitions: However we will need to compute: P(t)=eQt for each partition at each processor! AACG AAGG eq t eq t A-GG AGGA 1 CPU 1 TTTT TTTeQ t eq t TTTT TTTT 2 2 CPU cache cache Distributed memory Distributed memory

Data Distribution II Orangutan Gorilla How do we distribute partitions to Chimp processors? Homo Sapiens Works well when we have more processors than partitions: However we will need to compute: P(t)=eQt for each partition at each processor! AACG AAGG eq t eq t A-GG AGGA 1 CPU 1 TTTT TTTeQ t eq t TTTT TTTT 2 2 CPU cache cache Distributed memory Distributed memory

Load Balance I G0 P0 G1 G2 P1 G3

Load Balance I P0 P1

Load Balance I P0 P1 Find the partition-to-processor assignment such that the maximum number of sites per processor is minimized this is NP-hard

Load Balance I The multiprocessor job scheduling problem in phylogenetics Problem when #partitions >> #cores Tested per-site (cyclic/modulo) data distribution versus per partition data distribution We used the Longest Processing Time (LPT) heuristics for assigning partitions to processors 25 taxa, 220,000 sites, 100 genes GAMMA model naïve: 613 secs LPT: 550 secs CAT model naïve: 298 secs LPT: 127 secs Larger protein dataset under Г model of rate heterogeneity: 10-fold performance improvement! J. Zhang, A. Stamatakis: "The Multi-Processor Scheduling Problem in Phylogenetics", 11th IEEE HICOMB workshop (in conjunction with IPDPS 2012).

LPT heuristics for multi-processor scheduling Sort jobs (partitions) by processing length (partition length) in decreasing order Remove a job (partition) from the sorted list and assign it to the processor with the earliest end time (the smallest sum of partition lengths) Repeat until the sorted list is empty Upper bound: 4/3 1/(3p) * OPT, where p is the number of processors Graham, R. L.: "Bounds on Multiprocessing Timing Anomalies". SIAM Journal on Applied Mathematics 17 (2): 416 429, 1969. Remark: LPT works surprisingly well (see our paper on the phylogenetic problem where we also tested other heuristics)

Partitioned Branch Lengths & other parameters

Load-Balance II Zoom

Synchronization Points Assume 10 branches Each branch requires 10 Newton-Raphson Iterations Each NR Iteration requires a synchronization via a reduction operation One branch/partition at a time: 100 sync. points, less work (only one partition) per sync. point All branches concurrently: 10 sync. points, more work per sync. point Branches will need distinct number of operations Add convergence state bit vector

Synchronization Points Org1 AC GT Org2 AC TT

Synchronization Points Org1 AC GT Org2 AC TT 00

Synchronization Points Org1 AC GT Org2 AC TT 00 00

Synchronization Points Org1 AC GT Org2 AC TT 00 00 01

Synchronization Points Org1 AC GT Org2 AC TT 00 00 01 11 In this example: 4 instead of 7 sync points!

Load Balance II A. Stamatakis, M. Ott: "Load Balance in the Phylogenetic Likelihood Kernel". Proceedings of ICPP 2009, Vienna, Austria, September 2009.

Classic Fork-Join with For good parallel performance: the Busy-Wait broadcast must be fast! Remember: 10 secs 16 cores approx 0 A C A D B B 1 2 3 times. 500,000 Traversal What happensbroadcast if we have 1000 partitions and Compute propose 1000 new in all vectors alpha parameters? Traversal descriptor Reduce Σ log(li) Broadcast Traversal Trigger Barrier Trigger Compute all vectors in Traversal descriptor C D A C Reduce Σ log(li) B Broadcast new α params. α:=0.2 Compute all vectors in Traversal descriptor D Reduce Σ log(li) Barrier Trigger Barrier

Alternative MPI parallelization P0 P1 MPI_Reduce() MPI_Bcast() -55000? -55000? Execution time -55001 MPI_Reduce() MPI_Bcast() -55001

Alternative MPI parallelization P0-55000 P1 I think this is the way we will have to the future. MPI_Reduce() do it in -55001 MPI_Bcast()?? Execution time -55001 MPI_Reduce() MPI_Bcast() -55000

ExaML New code implementing this new parallelization scheme https://github.com/stamatak/examl A. Stamatakis, A. J. Aberer: "Novel Parallelization Schemes for Large-Scale Likelihood-based Phylogenetic Inference", accepted for publication at IPDPS 2013, Boston, USA, 2013. Up to 3 times faster than RAxML-Light (2012) on large, partitioned datasets Tested with up to 1536 cores on our cluster at HITS Future developments 20,000,000 CPU hors on SuperMUC for Improving scalability Implementing fault tolerance Execute 1KITE tree inferences Further details ask Andre Aberer