Exam Dates. February 19 March 1 If those dates don't work ad hoc in Heidelberg

Similar documents
Smith et al. American Journal of Botany 98(3): Data Supplement S2 page 1

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

COMPUTING LARGE PHYLOGENIES WITH STATISTICAL METHODS: PROBLEMS & SOLUTIONS

Phylogenetic inference

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week (Jan 27 & 29):

Phylogenetic Tree Reconstruction

EVOLUTIONARY DISTANCES

Preliminaries. Download PAUP* from: Tuesday, July 19, 16

T R K V CCU CG A AAA GUC T R K V CCU CGG AAA GUC. T Q K V CCU C AG AAA GUC (Amino-acid

Using algebraic geometry for phylogenetic reconstruction

Maximum Likelihood Tree Estimation. Carrie Tribble IB Feb 2018

Constructing Evolutionary/Phylogenetic Trees

Phylogenetics. BIOL 7711 Computational Bioscience

Molecular Evolution & Phylogenetics

Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington.

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Constructing Evolutionary/Phylogenetic Trees

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Taming the Beast Workshop

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Phylogenetics: Building Phylogenetic Trees

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

A (short) introduction to phylogenetics

Molecular Evolution & Phylogenetics Traits, phylogenies, evolutionary models and divergence time between sequences

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

BINF6201/8201. Molecular phylogenetic methods

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Molecular Evolution, course # Final Exam, May 3, 2006

Quantifying sequence similarity

Dr. Amira A. AL-Hosary

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

CREATING PHYLOGENETIC TREES FROM DNA SEQUENCES

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Algorithms in Bioinformatics

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

BIOINFORMATICS TRIAL EXAMINATION MASTERS KT-OR

Probabilistic modeling and molecular phylogeny

Today s Lecture: HMMs

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2016 University of California, Berkeley. Parsimony & Likelihood [draft]

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

INFORMATION-THEORETIC BOUNDS OF EVOLUTIONARY PROCESSES MODELED AS A PROTEIN COMMUNICATION SYSTEM. Liuling Gong, Nidhal Bouaynaya and Dan Schonfeld

Reconstruire le passé biologique modèles, méthodes, performances, limites


Lecture 11 Friday, October 21, 2011

Phylogenetic Assumptions

Notation. Bounds on Speedup. Parallel Processing. CS575 Parallel Processing

Evolutionary Models. Evolutionary Models

Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano

Biol 206/306 Advanced Biostatistics Lab 12 Bayesian Inference

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Phylogeny: building the tree of life

Lecture Notes: Markov chains

CSE 241 Class 1. Jeremy Buhler. August 24,

Using R for Iterative and Incremental Processing

Biol 206/306 Advanced Biostatistics Lab 12 Bayesian Inference Fall 2016

Phylogenetic Inference using RevBayes

Molecular Evolution and Phylogenetic Tree Reconstruction

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

How should we go about modeling this? Model parameters? Time Substitution rate Can we observe time or subst. rate? What can we observe?

Lab 9: Maximum Likelihood and Modeltest

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

What is Phylogenetics

Improvements for Implicit Linear Equation Solvers

Phylogenetic inference: from sequences to trees

Effects of Gap Open and Gap Extension Penalties

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

Workshop III: Evolutionary Genomics

Dense Arithmetic over Finite Fields with CUMODP

7. Tests for selection

Welcome to CSE21! Lecture B Miles Jones MWF 9-9:50pm PCYN 109. Lecture D Russell (Impagliazzo) MWF 4-4:50am Center 101

SUPPLEMENTARY INFORMATION

Assessing an Unknown Evolutionary Process: Effect of Increasing Site- Specific Knowledge Through Taxon Addition

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Elements of Bioinformatics 14F01 TP5 -Phylogenetic analysis

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

Lecture 4. Models of DNA and protein change. Likelihood methods

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200 Spring 2018 University of California, Berkeley

Structure Learning: the good, the bad, the ugly

A New Dominant Point-Based Parallel Algorithm for Multiple Longest Common Subsequence Problem

Scheduling Parallel Jobs with Linear Speedup

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Energy-efficient Mapping of Big Data Workflows under Deadline Constraints

Likelihood Ratio Tests for Detecting Positive Selection and Application to Primate Lysozyme Evolution

O 3 O 4 O 5. q 3. q 4. Transition

Plan: Evolutionary trees, characters. Perfect phylogeny Methods: NJ, parsimony, max likelihood, Quartet method

Parallel Program Performance Analysis

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University

C.DARWIN ( )

Mixture Models in Phylogenetic Inference. Mark Pagel and Andrew Meade Reading University.

Inferring Molecular Phylogeny

Extending Parallel Scalability of LAMMPS and Multiscale Reactive Molecular Simulations

CS 700: Quantitative Methods & Experimental Design in Computer Science

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Consensus methods. Strict consensus methods

Data Structures. Outline. Introduction. Andres Mendez-Vazquez. December 3, Data Manipulation Examples

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria

Transcription:

Exam Dates February 19 March 1 If those dates don't work ad hoc in Heidelberg

Plan for next lectures Lecture 11: More on Likelihood Models & Parallel Computing in Phylogenetics Lecture 12: (Andre) Discrete Operations on Trees Alexis will not be there Lecture 13: (Pavlos & Alexis) Population genetics I Lecture 14: (Pavlos & Alexis) Population genetics II Lecture 15: (Alexis) Wrap-up & exam preparation

Likelihood-based Phylogenetic Inference so far Search Techniques The Maximum Likelihood Model Some simple DNA models Jukes Cantor General Time Reversible (GTR) Model

Outline for today More on Likelihood Models Computing the likelihood on a tree Rate heterogeneity among sites Model Selection Protein Substitution Models Optimizing and Parallelizing Likelihood Calculations Parallelization Load Balance Saving RAM Course evaluation

Maximum Likelihood Length: m Seq1 Seq2 Seq3 Seq4 Alignment

Maximum Likelihood Length: m A C G T Seq1 Seq2 Seq3 Seq4 Alignment A C G T Substitution model

Maximum Likelihood Length: m A C G T Seq1 Seq2 Seq3 Seq4 Alignment A C G T Substitution model Commonly denoted as Q matrix: transition probs for time dt, for time t: P(t)=e^Qt

Maximum Likelihood Length: m A C G T Seq1 Seq2 Seq3 Seq4 Alignment A C G T Prior probabilities, Empirical base frequencies Substitution model ππaa ππcc π πgg ππtt

Maximum Likelihood Length: m A C G T Seq1 Seq2 Seq3 Seq4 A C G T Alignment Seq 1 Prior probabilities, Empirical base frequencies ππaa ππcc π πgg ππtt Substitution model Seq 3 b3 b1 b5 b2 Seq 2 b4 Seq 4

Maximum Likelihood Length: m A C G T Seq1 Seq2 Seq3 Seq4 A C G T Alignment Seq 1 Prior probabilities, Empirical base frequencies πa πc πg πt Substitution model Seq 3 b3 b1 b5 b2 b4 Seq 2 Seq 4 virtual root: vr

Maximum Likelihood Length: m A C G T Seq1 Seq2 Seq3 Seq4 A C G T Alignment Seq 1 Prior probabilities, Empirical base frequencies ππaa ππcc π πgg ππtt Substitution model Seq 3 b3 b1 vr b5 b2 b4 Seq 2 P(A) P(C) P(G) P(T) P(A) P(C) P(G) P(T) m Seq 4

Maximum Likelihood Length: m A C G T Seq1 Seq2 Seq3 Seq4 A C G T Alignment Seq 1 b1 Prior probabilities, Empirical base frequencies πa πc πg πt Substitution model Floating-point & memory b3 intensive vr Seq 3 b5 b2 b4 Seq 2 P(A) P(C) P(G) P(T) P(A) P(C) P(G) P(T) m Seq 4

Post-order Traversal virtual root

Post-order Traversal virtual root

Post-order Traversal virtual root AGCC A G C C 1.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0

Post-order Traversal virtual root

Post-order Traversal virtual root :-)

What happens when we compute this inner vector? P(b_i) P(A) P(C) P(G) P(T) L^(k) P(b_j) A C G T A C G T A C G T b_i L^(i) P(A) P(C) P(G) P(T) A C G T b_j P(A) P(C) P(G) P(T) L^(j) Position c

Post-order Traversal virtual root :-)

Post-order Traversal virtual root :-)

Post-order Traversal virtual root :-)

Post-order Traversal virtual root :-) :-)

Post-order Traversal virtual root :-) :-)

Post-order Traversal virtual root :-) :-)

Post-order Traversal virtual root :-) :-) :-)

Post-order Traversal Overall likelihood: sum over logs of per-site likelihoods virtual root :-) :-) :-)

Post-order Traversal LS4

Post-order Traversal LS3 b1 b4 LS5

Post-order Traversal LS3 b4' b1' x LS5

Post-order Traversal b4':=b1+b4 b1' := 0 LS3 x LS5

Maximum Likelihood Length: m A C G T Seq1 Seq2 Seq3 Seq4 Alignment A C G T Prior probabilities, Empirical base frequencies ππaa ππcc π πgg ππtt Substitution model Seq 3 Seq 1 Seq 2 Seq 4 optimize branch lengths

Branch Length Optimization starting branch

Branch Length Optimization starting branch Essentially we place the virtual root into this branch here

Branch Length Optimization starting branch

Branch Length Optimization starting branch

Branch Length Optimization starting branch

Branch Length Optimization starting branch

Branch Length Optimization starting branch

Branch Length Optimization starting branch

Branch Length Optimization starting branch iterate until convergence

Maximum Likelihood Length: m A C G T Seq1 Seq2 Seq3 Seq4 Alignment A C G T Prior probabilities, Empirical base frequencies ππaa ππcc π πgg ππtt Substitution model optimize model parameters Seq 1 Seq 2 Seq 3 Seq 4

Maximum Likelihood Length: m A C G T Seq1 Seq2 Seq3 Seq4 Alignment A C G T Prior probabilities, Empirical base frequencies ππaa ππcc π πgg ππtt Substitution model optimize model parameters Seq 1 We don't need to re-estimate Seq 2 model params. for every tree topology as long as the tree is reasonable, i.e., non-random Seq 3 Seq 4

Numerical Optimization Procedures See chapters 9 & 10 of: Numerical Recipes in C The Art of Scientific Computing

Basic Operations Maximum Likelihood Compute Conditional Likelihood Array at an inner node Compute Likelihood at Virtual Root Optimize a Branch Length for a given Branch Optimize all Branch Lengths Optimize other Model Parameters

Basic Operations Maximum Likelihood Compute Conditional Likelihood Array at an inner node Compute Likelihood at Virtual Root Optimize a Branch Length for a given Branch Optimize all Branch Lengths Optimize other Model Parameters Bayesian programs only require two operations

Rate Heterogeneity among Sites Gene ACGGGGGGGGGGGTTTTCCCCC ATGGGGGGGGGGGTTTCCCCCC ACCGGGGGGGGGGTTTTGCCCC AGGGGGGGGGGGCTTTTCCCCC Among-site rate heterogeneity Biological phenomenon different sites/columns evolve at different speeds Need to accommodate this in our models

Γ-Distribution

Γ-Distribution Small α high rate heterogeneity Large α low rate heterogeneity

Discrete Γ-Distribution r0 r1 r2 r3

An Abstract View of Γ rate 0 P(t) =e^qr0t rate 1 P(t) =e^qr1t rate 2 P(t) =e^qr2t rate 3 P(t) =e^qr3t length 7 This is the integral of the likelihood we approximate via discretization LnL(i) = log(¼ * (L0 + L1 + L2 + L3)) Log likelihood at site i All Γ rates have equal probability

An Abstract View of Γ rate 0 P(t) =e^qr0t rate 1 P(t) =e^qr1t rate 2 P(t) =e^qr2t length 7 4 times higher memory consumption rate 3 P(t) =e^qr3t

An Abstract View of Γ rate 0 P(t) =e^qr0t rate 1 P(t) =e^qr1t rate 2 P(t) =e^qr2t length 7 4 times more floating point operations rate 3 P(t) =e^qr3t

Γ Model of Rate Heterogeneity with 4 discrete rates α r0 r1 r2 r3............ r0 r1 r2 r3

Another Famous Computational Evolutionary Biologist Ziheng Yang the guy who has written this nice book Z Yang: Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods, Journal of Molecular evolution 39 (3), 306-314, 1994. At LMU Munich during student protests

An Abstract View of CAT alignment length: 7

CAT: ML-based Optimization of Individual Rates per Site r0, r1,,rm

CAT Step 1: ML-based Optimization of Individual Rates per Site A. Stamatakis: Phylogenetic models of rate heterogeneity: a high performance computing perspective, Proceedings of IPDPS 2006. r0, r1,,rm

CAT Step 2: Order List of ri according to site log likelihoods li ([r0, -1.2], [r1, -5.6], [r2, -2.5], [r3,-0.3]) Sort according to log likelihood contributions of sites ([r1, -5.6], [r2, -2.5], [r0, -1.2], [r3,-0.3])

CAT Step 3: Categorize; Example with 2 Rate Categories (default: 25) ([r1, -5.6], [r2, -2.5], [r0, -1.2], [r3,-0.3]) Select the 2 rates with highest Likelihood Contributions rc0 := r1, category[1] := 0 rc1 := r2, category[2] := 1 The index refers to the alignment column position Assign remaining rate categories of positions r0,r3 to the rate category 0 or 1 which minimizes (δ(r1, r0), δ(r2, r0)) and (δ(r1, r3), δ(rc2, r3)) respectively

ML-Loops CAT-LOOP for(i = 0; i < n; i++) { cat = category[i]; r = rate[cat]; p[i] = f(q[i], pq, r[i], pr, r); } Γ-LOOP for(i = 0; { p[i].g0 p[i].g1 p[i].g2 p[i].g3 } i < n; i++) = = = = f(q[i], f(q[i], f(q[i], f(q[i], pq, pq, pq, pq, r[i], r[i], r[i], r[i], pr, pr, pr, pr, r0); r1); r2)); r3));

715 Sequences under HKY85+Γ Log Likelihood Score under Γ C.E. Robertson et al (2005) Phylogenetic diversity and ecology of environmental Archaea, In Current Opinion in Microbiology. Execution Time

Model Selection How do we select a model? How do we avoid over-parameterizing a model? Typically: a model with more parameters yields better likelihood scores better likelihood is not a criterion for selecting a more complex model we may over-parametrize the model key question: Does a more complex model yield a different tree? Model selection criteria: Likelihood-ratio test can only be applied to nested models! More flexible criteria not requiring nested models AIC: Akaike Information Criterion BIC: Bayesian Information Criterion

Protein Substitution Models The GTR Q matrix for protein data has 189 free parameters instead of just 5 (DNA) Estimating 189 rate parameters is difficult, time-consuming, and may lead to overparameterizing the model Instead, empirical models such as JTT, LG, WAG, MTMAM, etc. are used The Q matrices are obtained by jointly optimizing model parameters on a large collection of reference alignments The models differ with respect to: the amount of data used to obtain them the type of data on which the models have been optimized e.g., dedicated models for HIV, FLU, Mammals the numerical optimization methods used Examples of general models: LG: Le & Gascuel: An Improved General Amino Acid Replacement Matrix WAG: Whelan & Goldman: A General Empirical Model of Protein Evolution Derived from Multiple Protein Families Using a Maximum-Likelihood Approach

Floating Point Numbers Machine numbers are an imperfect mapping of the infinite real numbers to a finite number of machine values!

Floating Point Arithmetics: The Root of All Evil Computational science mostly relies on floating-point intensive codes How do we verify these codes? We stand on shaky grounds Scientists using those codes assume that there are no bugs Double precision arithmetics required for certain applications Who knows what de-normalized floating point numbers are? Please have a look at: J. Björndalen, O. Anshus: Trusting floating point benchmarks-are your benchmarks really data-independent? Applied Parallel Computing. State of the art in Scientific Computing 2010; pp 178-188, Springer. and at my micro-benchmark at: https://github.com/stamatak/denormalizedfloatingpointnumbers

Post-order Traversal virtual root 0.001 Values get smaller and smaller as we Approach the root 0.01 0.1 0.1

Post-order Traversal We need to apply numerical scaling techniques to avoid underflow! virtual root 0.001 Values get smaller and smaller as we Approach the root 0.01 0.1 0.1

Optimization Use SSE3 & AVX vector intrinsics Special implementations for: TIP TIP TIP INNER INNER I will spare you the details But, avoid redundant computations INNER

Loop Level Parallelism virtual root P Q R P[i] = f(q[i], R[i])

Loop Level Parallelism virtual root P This operation uses 95% of total execution time! Q R P[i] = f(q[i], R[i])

Loop Level Parallelism virtual root P Q R

Loop Level Parallelism virtual root P Q R

Loop Level Parallelism virtual root P Q R

Parallel Post-order Traversal Only need to synchronize at the root MPI_Reduce() to calculate: virtual root Σ log(li)

Parallel Post-order Traversal Overall Score + Σ log(li) Σ log(li)

Parallelization Outline

Loop-Level Parallelism on Distributed Memory Machines

Classic Fork-Join with Busy-Wait 0 1 2 3 Broadcast Traversal Trigger Compute all vectors in Traversal descriptor Reduce Σ log(li) Broadcast Traversal Barrier Trigger Compute all vectors in Traversal descriptor Reduce Σ log(li) Broadcast new α params. Busy wait Barrier Trigger Compute all vectors in Traversal descriptor Reduce Σ log(li) Barrier

Synchronizations in RAxML with Pthreads RAxML Pthreads for a run time of about 10 seconds on 16 cores/threads 404 taxa 7429 sites: 194,000 Barriers 1481 taxa 1241 sites: 739,000 Barriers A paper on performance of alternative PThreads barrier implementations: S.A. Berger, A. Stamatakis: "Assessment of Barrier Implementions for Fine-Grain Parallel Regions on Current Multi-core Architectures", IEEE Cluster 2010.

Classic Fork-Join with Busy-Wait 0 A D 1 2 3 Broadcast Traversal Trigger Compute all vectors in Traversal descriptor C A B B Reduce Σ log(li) Broadcast Traversal Barrier Trigger Compute all vectors in Traversal descriptor C D A C Reduce Σ log(li) B Broadcast new α params. α:=0.2 Compute all vectors in Traversal descriptor D Reduce Σ log(li) Barrier Trigger Barrier

Classic Fork-Join with Busy-Wait 0 A D 1 2 3 Broadcast Traversal Trigger Compute all vectors For good parallel performance: thein Traversal broadcast must be fast!descriptor C A B B Reduce Σ log(li) Broadcast Traversal Barrier Trigger Compute all vectors in Traversal descriptor C D A C Reduce Σ log(li) B Broadcast new α params. α:=0.2 Compute all vectors in Traversal descriptor D Reduce Σ log(li) Barrier Trigger Barrier

Parallel Performance Problems They all start with partitioned datasets!

What is a partitioned dataset? Multi-gene or whole-genome alignment

What is a partitioned dataset? Gene 0 Gene 1 Gene 2 Gene 3 Multi-gene or whole-genome alignment Gene 4

What is a partitioned dataset? We may also partition by 1st, 2nd, and 3rd codon position Gene 0 Gene 1 Gene 2 Gene 3 Multi-gene or whole-genome alignment Gene 4

What is a partitioned dataset? Gene 0 Gene 1 Gene 2 Gene 3 Gene 4 α0 GTR0 α1 GTR1 α2 GTR2 α3 GTR3 α4 GTR4

What is a partitioned dataset? Gene 0 Gene 1 Gene 2 Gene 3 Gene 4 α0 GTR0 α1 GTR1 α2 GTR2 α3 GTR3 α4 GTR4 Joint branch length estimate

Assigning Protein Models? Gene 0 Gene 1 WAG LG JTT... WAG LG JTT... Gene 2 WAG LG JTT... Gene 3 WAG LG JTT... Gene 4 WAG LG JTT... Joint branch length estimate

Assigning Protein Models? Gene 0 Gene 1 WAG LG JTT... WAG LG JTT... There exits about 20 empirical protein models 205 possible Assignments! Gene 2 WAG LG JTT... Gene 3 WAG LG JTT... Gene 4 WAG LG JTT... Joint branch length estimate

Assigning Protein Models? Gene 0 Gene 1 WAG LG JTT... WAG LG JTT... Because partitions are linked via Branch lengths, this is NP-hard proof: joint work with Kassian heuristics: a Master's thesis in our lab Gene 2 WAG LG JTT... Gene 3 WAG LG JTT... Gene 4 WAG LG JTT... Joint branch length estimate

What is a partitioned dataset? Gene 0 Gene 1 Gene 2 Gene 3 Gene 4 α0 GTR0 α1 GTR1 α2 GTR2 α3 GTR3 α4 GTR4 per-partition branch length estimate

What is a partitioned dataset? Gene 0 α0 GTR0 Gene 1 α1 GTR1 Gene 3 Gene 4 Same underlying tree α2 topology! α3 GTR2 GTR3 α4 GTR4 Gene 2

What is a partitioned dataset? Gene 0 α0 GTR0 Gene 2 Gene 3 Gene 1 Per-partition branch length estimate Increases the number of parameters in the likelihood model by 4 * (2n 3), α1 where n is the α2 number of taxaα3 GTR1 GTR2 GTR3 Gene 4 α4 GTR4

Parallel Performance Problems They all start with partitioned datasets! How do we distribute partitions to processors? How do we calculate parameter changes? How much time does our broadcast take? Goal: Keep all processors busy all the time minimize communication and synchronization!

Data Distribution Orangutan Gorilla Chimp Homo Sapiens AACG AAGG A-GG AGGA CPU cache Shared TTTT TTTTTTT TTTT CPU cache memory

Data Distribution Orangutan Gorilla Chimp Homo Sapiens AACG AAGG A-GG AGGA CPU TTTT TTTTTTT TTTT CPU cache cache Distributed memory Distributed memory

Data Distribution Orangutan Gorilla Chimp Homo Sapiens AACG AAGG A-GG AGGA CPU Partitioned data distribution is not that trivial! TTTT TTTTTTT TTTT CPU cache cache Distributed memory Distributed memory

Data Distribution I Orangutan Gorilla Chimp Homo Sapiens AACG AAGG A-GG AGGA CPU TTTT TTTTTTT TTTT CPU cache cache Distributed memory Distributed memory

Data Distribution I Orangutan Gorilla Chimp Homo Sapiens Works well when we have more partitions than processors: May lead to load imbalance not all processors obtain equal number of sites! AACG AAGG A-GG AGGA CPU TTTT TTTTTTT TTTT CPU cache cache Distributed memory Distributed memory

Data Distribution II Orangutan Gorilla Chimp Homo Sapiens Works well when we have more processors than partitions: However we will need to compute: P(t)=eQt for each partition at each processor! AACG AAGG A-GG AGGA CPU TTTT TTTTTTT TTTT CPU cache cache Distributed memory Distributed memory

Data Distribution II Orangutan Gorilla Chimp Homo Sapiens Works well when we have more processors than partitions: However we will need to compute: P(t)=eQt for each partition at each processor! AACG AAGG eq t eq t A-GG AGGA 1 CPU 1 TTTT TTTeQ t eq t TTTT TTTT 2 2 CPU cache cache Distributed memory Distributed memory

Data Distribution II Orangutan Gorilla Chimp Performance impact depends on number of states in Homo Sapiens data/dimension of Q Works well when we have more processors than partitions: However we will need to compute: P(t)=eQt for each partition at each processor! AACG AAGG eq t eq t A-GG AGGA 1 CPU 1 TTTT TTTeQ t eq t TTTT TTTT 2 2 CPU cache cache Distributed memory Distributed memory

Data Distribution II Orangutan Gorilla How do we distribute partitions to Chimp processors? Homo Sapiens Works well when we have more processors than partitions: However we will need to compute: P(t)=eQt for each partition at each processor! AACG AAGG eq t eq t A-GG AGGA 1 CPU 1 TTTT TTTeQ t eq t TTTT TTTT 2 2 CPU cache cache Distributed memory Distributed memory

Load Balance I G0 P0 G1 G2 P1 G3

Load Balance I P0 P1

Load Balance I P0 P1 Find the partition-to-processor assignment such that the maximum number of sites per processor is minimized this is NP-hard

Load Balance I The multiprocessor job scheduling problem in phylogenetics Problem when #partitions >> #cores Tested per-site (cyclic/modulo) data distribution versus per partition data distribution We used the Longest Processing Time (LPT) heuristics for assigning partitions to processors 25 taxa, 220,000 sites, 100 genes GAMMA model naïve: 613 secs LPT: 550 secs CAT model naïve: 298 secs LPT: 127 secs Larger protein dataset under Г model of rate heterogeneity: 10-fold performance improvement! J. Zhang, A. Stamatakis: "The Multi-Processor Scheduling Problem in Phylogenetics", 11th IEEE HICOMB workshop (in conjunction with IPDPS 2012).

LPT heuristics for multi-processor scheduling Sort jobs (partitions) by processing length (partition length) in decreasing order Remove a job (partition) from the sorted list and assign it to the processor with the earliest end time (the smallest sum of partition lengths) Repeat until the sorted list is empty Upper bound: 4/3 1/(3p) * OPT, where p is the number of processors Graham, R. L.: "Bounds on Multiprocessing Timing Anomalies". SIAM Journal on Applied Mathematics 17 (2): 416 429, 1969. Remark: LPT works surprisingly well (see our paper on the phylogenetic problem where we also tested other heuristics)

Partitioned Branch Lengths & other parameters

Load-Balance II Zoom

Synchronization Points Assume 10 branches Each branch requires 10 Newton-Raphson Iterations Each NR Iteration requires a synchronization via a reduction operation One branch/partition at a time: 100 sync. points, less work (only one partition) per sync. point All branches concurrently: 10 sync. points, more work per sync. point Branches will need distinct number of operations Add convergence state bit vector

Synchronization Points Org1 AC GT Org2 AC TT

Synchronization Points Org1 AC GT Org2 AC TT

Synchronization Points Org1 AC GT Org2 AC TT 00

Synchronization Points Org1 AC GT Org2 AC TT 00 00

Synchronization Points Org1 AC GT Org2 AC TT 00 00 01

Synchronization Points Org1 AC GT Org2 AC TT 00 00 01 11 In this example: 4 instead of 7 sync points!

Load Balance II A. Stamatakis, M. Ott: "Load Balance in the Phylogenetic Likelihood Kernel". Proceedings of ICPP 2009, Vienna, Austria, September 2009.

Classic Fork-Join with For good parallel performance: the Busy-Wait broadcast must be fast! Remember: 10 secs 16 cores approx 0 A C A D B B 1 2 3 times. 500,000 Traversal What happensbroadcast if we have 1000 partitions and Compute propose 1000 new in all vectors alpha parameters? Traversal descriptor Reduce Σ log(li) Broadcast Traversal Trigger Barrier Trigger Compute all vectors in Traversal descriptor C D A C Reduce Σ log(li) B Broadcast new α params. α:=0.2 Compute all vectors in Traversal descriptor D Reduce Σ log(li) Barrier Trigger Barrier

Alternative MPI parallelization P0 P1 MPI_Reduce() MPI_Bcast() -55000? -55000? Execution time -55001 MPI_Reduce() MPI_Bcast() -55001

Alternative MPI parallelization P0-55000 P1 I think this is the way we will have to the future. MPI_Reduce() do it in -55001 MPI_Bcast()?? Execution time -55001 MPI_Reduce() MPI_Bcast() -55000

ExaML New code implementing this new parallelization scheme https://github.com/stamatak/examl A. Stamatakis, A. J. Aberer: "Novel Parallelization Schemes for Large-Scale Likelihood-based Phylogenetic Inference", accepted for publication at IPDPS 2013, Boston, USA, 2013. Up to 3 times faster than RAxML-Light (2012) on large, partitioned datasets Tested with up to 1536 cores on our cluster at HITS Future developments 20,000,000 CPU hors on SuperMUC for Improving scalability Implementing fault tolerance Execute 1KITE tree inferences Further details ask Andre Aberer