Ultra- large Mul,ple Sequence Alignment

Ultra- large Mul,ple Sequence Alignment Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana- Champaign hep://tandy.cs.illinois.edu

Phylogeny (evolu,onary tree) Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website, University of Arizona

Phylogenies and Applications Basic Biology: How did life evolve? Applica,ons of phylogenies to: protein structure and func,on popula,on gene,cs human migra,ons metagenomics

Computational Phylogenetics and Metagenomics Courtesy of the Tree of Life project

Hard Computational Problems NP- hard problems Large datasets 100,000+ sequences thousands of genes Big data complexity: model misspecifica,on fragmentary sequences errors in input data streaming data

Warnow Research Goal: improve accuracy, speed, robustness, or mathematical guarantees of computational methods, to enable highly accurate analyses of real datasets Techniques: divide-and-conquer, iteration, chordal graph theory, and probability theory Evaluation: synthetic and real data; collaborations with biologists and linguists Examples: Historical linguistics, 1994-present Absolute fast converging methods 1997-2002 Phylogenetic networks, 2003-2005 Genome rearrangements, 2000-2006 Multiple sequence alignment, 2009-present (many papers, including SATé-1 (Science), SATé-2 (Syst Biol), PASTA (RECOMB and J Comp Biol), and UPP (Genome Biology)) Supertree methods, 2009-present Metagenomic analysis, 2014-present Coalescent-based species tree estimation (2011-present, including Science 2014a, Science 2014b, PNAS 2014)

DNA Sequence Evolution AAGGCCT AAGACTT TGGACTT -3 mil yrs -2 mil yrs AGGGCAT TAGCCCT AGCACTT -1 mil yrs AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT today

Phylogeny Problem U V W X Y AGGGCAT TAGCCCA TAGACTT TGCACAA TGCGCTT U X Y V W

Markov Model of Site Evolu,on Simplest (Jukes- Cantor, 1969): The model tree T is binary and has subs,tu,on probabili,es p(e) on each edge e. The state at the root is randomly drawn from {A,C,T,G} (nucleo,des) If a site (posi,on) changes on an edge, it changes with equal probability to each of the remaining states. The evolu,onary process is Markovian. More complex models (such as the General Time Reversible model, or the General Markov model) are also considered, o`en with liele change to the theory.

Quan,fying Error FN FN: false negative (missing edge) FP: false positive (incorrect edge) 50% error rate FP

Sta,s,cal Consistency error Data Maximum likelihood is sta,s,cally consistent under standard models (e.g., GTR)

Mathema,cal Ques,ons Is the model tree iden,fiable? Which es,ma,on methods are sta,s,cally consistent under this model? How much data does the method need to es,mate the model tree correctly (with high probability)? What is the impact of model misspecifica,on? What is the computa,onal complexity of an es,ma,on problem?

The Classical Phylogeny Problem U V W X Y AGGGCAT TAGCCCA TAGACTT TGCACAA TGCGCTT U X Y V W

Much is known about this problem from a mathema,cal and empirical viewpoint U V W X Y AGGGCAT TAGCCCA TAGACTT TGCACAA TGCGCTT U X Y V W

However U V W X Y AGGGCATGA AGAT TAGACTT TGCACAA TGCGCTT U X Y V W

Indels (insertions and deletions) Deletion Mutation ACGGTGCAGTTACCA ACCAGTCACCA

Deletion Substitution ACGGTGCAGTTACCA Insertion ACCAGTCACCTA ACGGTGCAGTTACC-A AC----CAGTCACCTA The true mul*ple alignment Reflects historical substitution, insertion, and deletion events Defined using transitive closure of pairwise alignments computed on edges of the true tree

Input: unaligned sequences S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA

Phase 1: Alignment S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA

Phase 2: Construct tree S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA S1 S2 S4 S3

Phylogenomic pipeline Select taxon set and markers Gather and screen sequence data, possibly iden,fy orthologs Compute mul,ple sequence alignments for each locus, and construct gene trees Compute species tree or network: Combine the es,mated gene trees, OR Es,mate a tree from a concatena,on of the mul,ple sequence alignments Get sta,s,cal support on each branch (e.g., bootstrapping) Es,mate dates on the nodes of the phylogeny Use species tree with branch support and dates to understand biology

Phylogenomic pipeline Select taxon set and markers Gather and screen sequence data, possibly iden,fy orthologs Compute mul,ple sequence alignments for each locus, and construct gene trees Coalescent- based Compute species tree or network: species tree Combine the es,mated gene trees, OR es,ma,on! Es,mate a tree from a concatena,on of the mul,ple sequence alignments Get sta,s,cal support on each branch (e.g., bootstrapping) Es,mate dates on the nodes of the phylogeny Use species tree with branch support and dates to understand biology

Large-scale Alignment Estimation Many genes are considered unalignable due to high rates of evolu,on Only a few methods can analyze large datasets iplant (NSF Plant Biology Collabora,ve) and other projects planning to construct phylogenies with 500,000 taxa

1kp: Thousand Transcriptome Project G. Ka-Shu Wong U Alberta J. Leebens-Mack U Georgia N. Wickett Northwestern N. Matasci iplant T. Warnow, S. Mirarab, N. Nguyen, UIUC UT-Austin UT-Austin Plus many many other people l First study (WickeE, Mirarab, et al., PNAS 2014) had ~100 species and ~800 genes, gene trees and alignments es,mated using SATé, and a coalescent- based species tree es,mated using ASTRAL l Second study: Plant Tree of Life based on transcriptomes of ~1200 species, and more than 13,000 gene families (most not single copy) Gene Challenges: Tree Incongruence Species tree estimation from conflicting gene trees Gene tree estimation of datasets with > 100,000 sequences

Multiple Sequence Alignment (MSA): a scientific grand challenge 1 S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC Sn = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- Sn = -------TCAC--GACCGACA Novel techniques needed for scalability and accuracy NP-hard problems and large datasets Current methods do not provide good accuracy Few methods can analyze even moderately large datasets Many important applications besides phylogenetic estimation 1 Frontiers in Massive Data Analysis, National Academies Press, 2013

This talk Big data multiple sequence alignment SATé (Science 2009, Systematic Biology 2012) and PASTA (RECOMB and J Comp Biol 2015), methods for co-estimation of alignments and trees UPP (Genome Biology 2015): ultra-large multiple sequence alignment, using the Ensemble of HMMs technique. Other applications of the ehmm technique: phylogenetic placement (SEPP, PSB 2012) metagenomic taxon identification (TIPP, Bioinformatics 2014) protein structure and function classification gene binning

Mul,ple Sequence Alignment

First Align, then Compute the Tree S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA S1 S2 S4 S3

Co- es,ma,on would be much beeer!!! S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA S1 S2 S4 S3

Simulation Studies S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA Unaligned Sequences S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA S1 S2 S4 S3 True tree and alignment Compare S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-C--T-----GACCGC-- S4 = T---C-A-CGACCGA----CA S1 S4 S2 S3 Estimated tree and alignment

Quantifying Error FN FN: false negative (missing edge) FP: false positive (incorrect edge) 50% error rate FP

Two- phase es,ma,on Alignment methods Clustal POY (and POY*) Probcons (and Probtree) Probalign MAFFT Muscle Di- align T- Coffee Prank (PNAS 2005, Science 2008) Opal (ISMB and Bioinf. 2007) FSA (PLoS Comp. Bio. 2009) Infernal (Bioinf. 2009) Etc. Phylogeny methods Bayesian MCMC Maximum parsimony Maximum likelihood Neighbor joining FastME UPGMA Quartet puzzling Etc. RAxML: heuris>c for large- scale ML op>miza>on

1000- taxon models, ordered by difficulty (Liu et al., 2009)

SATé Family of methods Itera,ve divide- and- conquer methods Each itera,on re- aligns the sequences using the current tree, running preferred MSA methods on small local subsets, and merging subset alignments Each itera,on computes an ML tree on the current alignment, under the GTR (Generalized Time Reversible) Markov model of evolu,on Note: these methods are MSA boosters, designed to improve accuracy and/or scalability of the base method We show results using MAFFT- l- ins- i to align subsets

Re-aligning on a tree A B C D Decompose dataset A C B D Align subsets A C B D Estimate ML tree on merged alignment ABCD Merge sub-alignments

SATé and PASTA Algorithms Obtain initial alignment and estimated ML tree Estimate ML tree on new alignment Tree Use tree to compute new alignment Alignment Repeat un,l termina,on condi,on, and return the alignment/tree pair with the best ML score

SATé- 1 (Science 2009) performance 1000- taxon models, ordered by difficulty rate of evolu,on generally increases from le` to right SATé- 1 24 hour analysis, on desktop machines (Similar improvements for biological datasets) SATé- 1 can analyze up to about 8,000 sequences.

SATé- 1 and SATé- 2 (Systema*c Biology, 2012) SATé- 1: up to 8K SATé- 2: up to ~50K 1000- taxon models ranked by difficulty

SATé variants differ only in the decomposition strategy A B C D Decompose dataset A C B D Align subsets A C B D Estimate ML tree on merged alignment ABCD Merge sub-alignments

SATé- II: centroid edge decomposi,on A C ABCDE D ABC DE AB C D E B E A B SATé- II makes all subsets small (user parameter), and can analyze 50K sequences, SATé- I decomposi,on produced clades and had bigger subsets; limited to 8K sequences

SATé: merger strategy A C ABCDE D ABC DE AB C D E B E A B Both SATé s use the same hierarchical merger strategy. On large (50K) datasets, the last pairwise merger can use more than 70% of the running,me

PASTA merging: Step 1 A C C B D E A B D E Compute a spanning tree connec,ng alignment subsets

PASTA merging: Step 2 AB A B BD C CD DE D E A AB B BD C CD D DE E Use Opal (or Muscle) to merge adjacent subset alignments in the spanning tree

PASTA merging: Step 3 C AB + BD = ABD ABD + CD = ABCD ABCD + DE = ABCDE A AB B BD CD D DE E Use transi,vity to merge all pairwise- merged alignments from Step 2 into final an alignment on en,re dataset Overall: O(n log(n) + L)

PASTA vs. SATé- II profiling and scaling 8 PASTA Speedup 6 4 SATe2 2 1 1 2 4 6 8 10 12 Number of Threads

PASTA Running Time and Scalability 1250 Running time (minutes) 1000 750 500 250 One itera,on Using 12 cpus 1 node on Lonestar TACC Maximum 24 GB memory Showing wall clock running,me ~ 1 hour for 10k taxa ~ 17 hours for 200k taxa 0 10,000 50,000 100,000 200,000 Number of Sequences

Tree accuracy 1 million sequences: PASTA finished one iteration in 15 days PASTA tree had 6% error, compared to 5.6% when using true alignment Starting tree had 8.4% error 10

PASTA vs. SATé- II Difference is how subset alignments are merged together (transi,vity instead of Opal/Muscle). As expected, PASTA is faster and can analyze larger datasets. Unexpected: PASTA produces more accurate alignments and trees (on both simulated and biological data, including DNA, RNA, and AA sequences). Thus, transi,vity applied to compa,ble and overlapping alignments gives a surprisingly accurate technique for merging a collec,on of alignments.

PASTA and SATé- II: MSA boosters PASTA and SATé- II are techniques for improving the scalability of MSA methods to large datasets. We showed results here using MAFFT- l- ins- i to align small subsets with 200 sequences. We have also explored results using other MSA methods (e.g., Prank, Clustal, Bali- Phy), and obtain similar improvements in accuracy and/or scalability.

1kp: Thousand Transcriptome Project G. Ka-Shu Wong U Alberta J. Leebens-Mack U Georgia N. Wickett Northwestern N. Matasci iplant T. Warnow, S. Mirarab, N. Nguyen, UIUC UT-Austin UT-Austin l Plant Tree of Life based on transcriptomes of ~1200 species l More than 13,000 gene families (most not single copy) Gene Tree Incongruence Plus many many other people Challenge: Massive gene tree conflict consistent with ILS Alignment of datasets with > 100,000 sequences

12000 Mean:317 10000 Median:266 1KP dataset: more than 100,000 p450 amino- acid sequences, many fragmentary 8000 Counts 6000 4000 2000 0 0 500 1000 1500 2000 Length

12000 Mean:317 10000 Median:266 1KP dataset: more than 100,000 p450 amino- acid sequences, many fragmentary 8000 Counts 6000 4000 2000 0 All standard mul>ple sequence alignment methods we tested performed poorly on datasets with fragments. 0 500 1000 1500 2000 Length

1kp: Thousand Transcriptome Project G. Ka-Shu Wong U Alberta J. Leebens-Mack U Georgia N. Wickett Northwestern N. Matasci iplant T. Warnow, S. Mirarab, N. Nguyen UIUC UT-Austin UT-Austin l Plant Tree of Life based on transcriptomes of ~1200 species l More than 13,000 gene families (most not single copy) Gene Tree Incongruence Plus many many other people Challenge: Alignment of datasets with > 100,000 sequences with many fragmentary sequences

UPP UPP = Ultra- large mul,ple sequence alignment using Phylogeny- aware Profiles Nguyen, Mirarab, and Warnow. Genome Biology, 2014. Purpose: highly accurate large- scale mul,ple sequence alignments, even in the presence of fragmentary sequences.

Simple idea (not UPP) Select random subset of sequences, and build backbone alignment Construct a Hidden Markov Model (HMM) on the backbone alignment Add all remaining sequences to the backbone alignment using the HMM

One Hidden Markov Model for the entire alignment?

This approach works well if the dataset is small and has low evolu,onary rates, but is not very accurate otherwise. Select random subset of sequences, and build backbone alignment Construct a Hidden Markov Model (HMM) on the backbone alignment Add all remaining sequences to the backbone alignment using the HMM

One Hidden Markov Model for the en,re alignment? HMM 1

Or 2 HMMs? HMM 1 HMM 2

Or 4 HMMs? HMM 1 HMM 2 HMM 3 HMM 4

Or all 7 HMMs? HMM 1 HMM 2 HMM 4 m HMM 7 HMM 3 HMM 5 HMM 6

UPP Algorithmic Approach 1. Select random subset of full- length sequences, and build backbone alignment 2. Construct an Ensemble of Hidden Markov Models on the backbone alignment 3. Add all remaining sequences to the backbone alignment using the Ensemble of HMMs

UPP Algorithmic Approach 1. Select random subset of full- length sequences, and build backbone alignment and backbone tree Notes: Need to avoid fragments in the backbone We show results using PASTA for the backbone alignment and tree, but are other methods can be used we have explored BAli- Phy, a powerful Bayesian sta,s,cal method Random is good when taxonomic sampling is rela,vely uniform, but directed sampling can improve accuracy We explored backbones with 100 and 1000 sequences, even when the full dataset is very big (1,000,000 one million)

UPP Algorithmic Approach 2. Construct an Ensemble of Hidden Markov Models on the backbone alignment Technique: Create set of subsets (using the tree). Then, for each subset, build an HMM on the induced alignment on each subset. Note: Different subset sizes are good for different situa,ons, and the ensemble technique is more accurate than disjoint sets

UPP Algorithmic Approach 3. Add all remaining sequences to the backbone alignment using the Ensemble of HMMs For each of the remaining sequences s, find H, the HMM from the ensemble that has the best score (i.e., HMM maximizing Pr(s H)) Use HMMER code and H to add s into the backbone alignment

Evalua,on Simulated datasets (some have fragmentary sequences): 10K to 1,000,000 sequences in RNASim complex RNA sequence evolu,on simula,on 1000- sequence nucleo,de datasets from SATé papers 5000- sequence AA datasets (from FastTree paper) 10,000- sequence Indelible nucleo,de simula,on Biological datasets: Proteins: largest BaliBASE and HomFam RNA: 3 CRW datasets up to 28,000 sequences

RNASim: alignment error All methods given 24 hrs on a 12- core machine Note: Maz was run under default se{ngs for 10K and 50K sequences and under ParEree for 100K sequences, and fails to complete under any se{ng For 200K sequences. Clustal- Omega only completes on 10K dataset.

RNASim: tree error All methods given 24 hrs on a 12- core machine Note: Maz was run under default se{ngs for 10K and 50K sequences and under ParEree for 100K sequences, and fails to complete under any se{ng For 200K sequences. Clustal- Omega only completes on 10K dataset.

RNASim Million Sequences: alignment error Notes: We show alignment error using average of SP-FN and SP-FP. UPP variants have better alignment scores than PASTA. (Not shown: Total Column Scores PASTA more accurate than UPP) No other methods tested could complete on these data PASTA under-aligns: its alignment is 43 times wider than true alignment (~900 Gb of disk space). UPP alignments were closer in

RNASim Million Sequences: tree error Using 12 processors: UPP(Fast,NoDecomp) took 2.2 days, UPP(Fast) took 11.9 days, and PASTA took 10.3 days

UPP vs. PASTA: impact of fragmenta,on Mean alignment error 0.6 0.4 0.2 0.0 0 12.5 25 50 % Fragmentary PASTA UPP(Default) (a) Average alignment error Under high rates of evolu,on, PASTA is badly impacted by fragmentary sequences (the same is true for other methods). Under low rates of evolu,on, PASTA can s,ll be highly accurate (data not shown). Delta FN tree error 0.4 0.2 0.0 0 12.5 25 50 % Fragmentary UPP con,nues to have good accuracy even on datasets with many fragments under all rates of evolu,on. PASTA UPP(Default) (b) Average tree error Performance on fragmentary datasets of the 1000M2 model condi,on

UPP Running Time Wall clock align time (hr) 15 10 5 0 50000 100000 150000 200000 Number of sequences UPP(Fast) Wall- clock,me used (in hours) given 12 processors

Current Related Research UPP: Using itera,on Using other MSA models (not just HMMs) within the Ensemble Using structural alignments for the backbone Using powerful sta,s,cal methods to produce the backbone alignment and tree PASTA : Using powerful sta,s,cal methods for the subset alignments Improving the pairwise merging technique ENSEMBLE OF HMMS: Metagenomic taxon iden,fica,on (collabora,on with Mihai Pop, Maryland) Gene binning (joint with Jian Peng, UIUC, and with Jim Leebens- Mack, Georgia) Protein structure and func,on classifica,on (collabora,ons with Mar,n Weigt, Paris, and Jian Peng, UIUC)

Acknowledgments PhD students: Nam Nguyen (now postdoc at UIUC) and Siavash Mirarab (now faculty at UCSD) Undergrad: Keerthana Kumar Current NSF grants: ABI- 1458652 (mul,ple sequence alignment) III:AF:1513629 (metagenomics collabora,ve with Mihai Pop, University of Maryland) DBI:1461364 (phylogenomics collabora,ve with Rice and Stanford) CCF:1535977 (graph algorithms to improve phylogene,c es,ma,on collabora,ve with Berkeley) Other recent or current support: Guggenheim Founda,on, NSF DEB:0733029, NSF DBI:1062335, Microso` Research New England, David Bruton Jr. Centennial Professorship, TACC (Texas Advanced Compu,ng Center), the University of Alberta (Canada), Grainger Founda,on (at UIUC), and UIUC TACC, UTCS, and UIUC computa,onal resources

Alignment Accuracy Correct columns Starting Starting Alignment Alignment ClustalW ClustalW Mafft"Profile Mafft"Profile Muscle Muscle SATe2 SATe2 PASTA PASTA Alignment Accuracy (TC) 800 600 400 200 0 RNASim 125 Gutell 100 75 50 25 0.00 0 10000 50000 100000 200000 16S.3 16S.T 16S.B.ALL

TIPP: high accuracy taxonomic identification of metagenomic data TIPP: taxon identification and phylogenetic profiling (Bioinformatics, 2014) Technique: combines UPP alignments and phylogenetic placement algorithms, and considers statistical uncertainty. Results: better accuracy than all current methods, even for sequencing technologies producing high indel rates Research funded by new NSF grant III:AF:1513629 (collabora,ve with Mihai Pop at University of Maryland)

Metagenomic taxonomic iden*fica*on and phylogene*c profiling Metagenomics, Venter et al., Exploring the Sargasso Sea: Scien*sts Discover One Million New Genes in Ocean Microbes

Basic Ques,ons 1. What is this fragment? (Classify each fragment as well as possible.) 2. What is the taxonomic distribu,on in the dataset? (Note: helpful to use marker genes.) 3. What are the organisms in this metagenomic sample doing together?

The Tree of Life: Multiple Challenges Scientific challenges: Ultra-large multiple-sequence alignment Alignment-free phylogeny estimation Supertree estimation Estimating species trees from many gene trees Genome rearrangement phylogeny Reticulate evolution Visualization of large trees and alignments Data mining techniques to explore multiple optima Theoretical guarantees under Markov models of evolution Techniques: machine learning, applied probability theory, graph theory, combinatorial optimization, supercomputing, and heuristics

High indel datasets containing known genomes Note: NBC, MetaPhlAn, and MetaPhyler cannot classify any sequences from at least one of the high indel long sequence datasets, and motu terminates with an error message on all the high indel datasets.

TIPP vs. other abundance profilers TIPP is highly accurate, even in the presence of high indel rates and novel genomes, and for both short and long reads. All other methods have some vulnerability (e.g., motu is only accurate for short reads and is impacted by high indel rates).

Metagenomic Taxon Iden,fica,on Objec,ve: classify short reads in a metagenomic sample

Abundance Profiling Objective: Distribution of the species (or genera, or families, etc.) within the sample. For example: The distribution of the sample at the species-level is: 50% species A 20% species B 15% species C 14% species D 1% species E

Abundance Profiling Objective: Distribution of the species (or genera, or families, etc.) within the sample. Leading techniques: PhymmBL (Brady & Salzberg, Nature Methods 2009) NBC (Rosen, Reichenberger, and Rosenfeld, Bioinformatics 2011) MetaPhyler (Liu et al., BMC Genomics 2011), from the Pop lab at the University of Maryland MetaPhlAn (Segata et al., Nature Methods 2012), from the Huttenhower Lab at Harvard motu (Bork et al., Nature Methods 2013) MetaPhyler, MetaPhlAn, and motu are marker-based techniques (but use different marker genes). Marker gene are single-copy, universal, and resistant to horizontal transmission.

Novel genome datasets Note: motu terminates with an error message on the long fragment datasets and high indel datasets.

Summary SATé- 1 (Science 2009), SATé- 2 (Systema,c Biology 2012), co- es,ma,on of alignments and trees. SATé- 2 is well established in the biology community. PASTA (RECOMB 2014 and J Comp Biol 2015) is the replacement for SATé. PASTA can analyze up to 1,000,000 sequences. UPP (ultra- large mul,ple sequence alignment), Genome Biology 2015. Uses a collec,on of HMMs to represent a backbone alignment. Improves alignment and also detec,on of remote homology compared to a single HMM. UPP produces highly accurate alignments, even in the presence of fragmentary sequences. Can analyze datasets with 1,000,000 sequences. Other applica,ons of the Ensemble of HMMs technique TIPP (metagenomic taxon iden,fica,on and abundance profiling), Bioinforma,cs 2014. SEPP (phylogene,c placement), PSB 2012. Protein sequence analysis (collabora,ons with Mar,n Weigt, Paris, and Jian Peng, UIUC)

SEPP SEPP: SATé-enabled Phylogenetic Placement, by Mirarab, Nguyen, and Warnow. Pacific Symposium on Biocomputing, 2012, special session on the Human Microbiome Objective: phylogenetic analysis of single-gene datasets with fragmentary sequences Introduces HMM Family technique

Phylogenetic Placement Fragmentary sequences from some gene Full- length sequences for same gene, and an alignment and a tree ACCG CGAG CGG GGCT TAGA GGGGG TCGAG GGCG GGG... ACCT AGG...GCAT TAGC...CCA TAGA...CTT AGC...ACA ACT..TAGA..A

Phylogenetic Placement Step 1: Align each query sequence to backbone alignment Step 2: Place each query sequence into backbone tree, using extended alignment

HMMER vs. PaPaRa Alignments 0.0 Increasing rate of evolution

Align Sequence S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = TAAAAC S1 S4 S2 S3

Align Sequence S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC-------- S1 S4 S2 S3

Phylogenetic Placement Align each query sequence to backbone alignment HMMALIGN (Eddy, Bioinformatics 1998) PaPaRa (Berger and Stamatakis, Bioinformatics 2011) Place each query sequence into backbone tree Pplacer (Matsen et al., BMC Bioinformatics, 2011) EPA (Berger and Stamatakis, Systematic Biology 2011) Note: pplacer and EPA use maximum likelihood, and are reported to have the same accuracy.

Place Sequence S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC-------- S1 S4 Q1 S2 S3

HMMER+pplacer: 1) build one HMM for the entire alignment 2) Align fragment to the HMM, and insert into alignment 3) Insert fragment into tree to optimize likelihood

Using SEPP for taxon identification Fragmentary sequences from some gene Full- length sequences for same gene, and an alignment and a tree ACCG CGAG CGG GGCT TAGA GGGGG TCGAG GGCG GGG... ACCT AGG...GCAT TAGC...CCA TAGA...CTT AGC...ACA ACT..TAGA..A

SEPP(10%), based on ~10 HMMs 0.0 0.0 Increasing rate of evolution

SEPP vs. HMMER+pplacer SEPP produced more accurate phylogenetic placements than HMMER+pplacer. The only difference is the use of a Family of HMMs instead of one HMM. The biggest differences are for datasets with high rates of evolution.

The Tree of Life: Multiple Challenges Scientific challenges: Ultra-large multiple-sequence alignment Gene tree estimation Metagenomic classification Alignment-free phylogeny estimation Supertree estimation Estimating species trees from many gene trees Genome rearrangement phylogeny Reticulate evolution Visualization of large trees and alignments Data mining techniques to explore multiple optima Theoretical guarantees under Markov models of evolution Techniques: applied probability theory, graph theory, supercomputing, and heuristics Testing: simulations and real data

SATé- II running,me profiling

3 FIG. 1. Algorithmic design of PASTA. The first six boxes show the steps involved in one iteration of PASTA. The last two boxes show the meaning of transitivity for homologies defined by a column of an MSA, and how the concept of transitivity can be used to merge two compatible and overlapping alignments. MSA, multiple sequence alignment. Figure from Mirarab et al., J. Computa,onal Biology 2014