Ultra- large Mul,ple Sequence Alignment

Size: px
Start display at page:

Download "Ultra- large Mul,ple Sequence Alignment"

Transcription

1 Ultra- large Mul,ple Sequence Alignment Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana- Champaign hep://tandy.cs.illinois.edu

2 Phylogeny (evolu,onary tree) Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website, University of Arizona

3 Phylogenies and Applications Basic Biology: How did life evolve? Applica,ons of phylogenies to: protein structure and func,on popula,on gene,cs human migra,ons metagenomics

4

5 Computational Phylogenetics and Metagenomics Courtesy of the Tree of Life project

6 Hard Computational Problems NP- hard problems Large datasets 100,000+ sequences thousands of genes Big data complexity: model misspecifica,on fragmentary sequences errors in input data streaming data

7 Warnow Research Goal: improve accuracy, speed, robustness, or mathematical guarantees of computational methods, to enable highly accurate analyses of real datasets Techniques: divide-and-conquer, iteration, chordal graph theory, and probability theory Evaluation: synthetic and real data; collaborations with biologists and linguists Examples: Historical linguistics, 1994-present Absolute fast converging methods Phylogenetic networks, Genome rearrangements, Multiple sequence alignment, 2009-present (many papers, including SATé-1 (Science), SATé-2 (Syst Biol), PASTA (RECOMB and J Comp Biol), and UPP (Genome Biology)) Supertree methods, 2009-present Metagenomic analysis, 2014-present Coalescent-based species tree estimation (2011-present, including Science 2014a, Science 2014b, PNAS 2014)

8 Warnow Research Goal: improve accuracy, speed, robustness, or mathematical guarantees of computational methods, to enable highly accurate analyses of real datasets Techniques: divide-and-conquer, iteration, chordal graph theory, and probability theory Evaluation: synthetic and real data; collaborations with biologists and linguists Examples: Historical linguistics, 1994-present Absolute fast converging methods Phylogenetic networks, Genome rearrangements, Multiple sequence alignment, 2009-present (many papers, including SATé-1 (Science), SATé-2 (Syst Biol), PASTA (RECOMB and J Comp Biol), and UPP (Genome Biology)) Supertree methods, 2009-present Metagenomic analysis, 2014-present Coalescent-based species tree estimation (2011-present, including Science 2014a, Science 2014b, PNAS 2014)

9 Warnow Research Goal: improve accuracy, speed, robustness, or mathematical guarantees of computational methods, to enable highly accurate analyses of real datasets Techniques: divide-and-conquer, iteration, chordal graph theory, and probability theory Evaluation: synthetic and real data; collaborations with biologists and linguists Examples: Historical linguistics, 1994-present Absolute fast converging methods Phylogenetic networks, Genome rearrangements, Multiple sequence alignment, 2009-present (many papers, including SATé-1 (Science), SATé-2 (Syst Biol), PASTA (RECOMB and J Comp Biol), and UPP (Genome Biology)) Supertree methods, 2009-present Metagenomic analysis, 2014-present Coalescent-based species tree estimation (2011-present, including Science 2014a, Science 2014b, PNAS 2014)

10 DNA Sequence Evolution AAGGCCT AAGACTT TGGACTT -3 mil yrs -2 mil yrs AGGGCAT TAGCCCT AGCACTT -1 mil yrs AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT today

11 Phylogeny Problem U V W X Y AGGGCAT TAGCCCA TAGACTT TGCACAA TGCGCTT U X Y V W

12 Markov Model of Site Evolu,on Simplest (Jukes- Cantor, 1969): The model tree T is binary and has subs,tu,on probabili,es p(e) on each edge e. The state at the root is randomly drawn from {A,C,T,G} (nucleo,des) If a site (posi,on) changes on an edge, it changes with equal probability to each of the remaining states. The evolu,onary process is Markovian. More complex models (such as the General Time Reversible model, or the General Markov model) are also considered, o`en with liele change to the theory.

13 Quan,fying Error FN FN: false negative (missing edge) FP: false positive (incorrect edge) 50% error rate FP

14 Sta,s,cal Consistency error Data Maximum likelihood is sta,s,cally consistent under standard models (e.g., GTR)

15 Mathema,cal Ques,ons Is the model tree iden,fiable? Which es,ma,on methods are sta,s,cally consistent under this model? How much data does the method need to es,mate the model tree correctly (with high probability)? What is the impact of model misspecifica,on? What is the computa,onal complexity of an es,ma,on problem?

16 The Classical Phylogeny Problem U V W X Y AGGGCAT TAGCCCA TAGACTT TGCACAA TGCGCTT U X Y V W

17 Much is known about this problem from a mathema,cal and empirical viewpoint U V W X Y AGGGCAT TAGCCCA TAGACTT TGCACAA TGCGCTT U X Y V W

18 However U V W X Y AGGGCATGA AGAT TAGACTT TGCACAA TGCGCTT U X Y V W

19 Indels (insertions and deletions) Deletion Mutation ACGGTGCAGTTACCA ACCAGTCACCA

20 Deletion Substitution ACGGTGCAGTTACCA Insertion ACCAGTCACCTA ACGGTGCAGTTACC-A AC----CAGTCACCTA The true mul*ple alignment Reflects historical substitution, insertion, and deletion events Defined using transitive closure of pairwise alignments computed on edges of the true tree

21 Input: unaligned sequences S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA

22 Phase 1: Alignment S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT GACCGC-- S4 = TCAC--GACCGACA

23 Phase 2: Construct tree S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT GACCGC-- S4 = TCAC--GACCGACA S1 S2 S4 S3

24 Phylogenomic pipeline Select taxon set and markers Gather and screen sequence data, possibly iden,fy orthologs Compute mul,ple sequence alignments for each locus, and construct gene trees Compute species tree or network: Combine the es,mated gene trees, OR Es,mate a tree from a concatena,on of the mul,ple sequence alignments Get sta,s,cal support on each branch (e.g., bootstrapping) Es,mate dates on the nodes of the phylogeny Use species tree with branch support and dates to understand biology

25 Phylogenomic pipeline Select taxon set and markers Gather and screen sequence data, possibly iden,fy orthologs Compute mul,ple sequence alignments for each locus, and construct gene trees Coalescent- based Compute species tree or network: species tree Combine the es,mated gene trees, OR es,ma,on! Es,mate a tree from a concatena,on of the mul,ple sequence alignments Get sta,s,cal support on each branch (e.g., bootstrapping) Es,mate dates on the nodes of the phylogeny Use species tree with branch support and dates to understand biology

26 Phylogenomic pipeline Select taxon set and markers Gather and screen sequence data, possibly iden,fy orthologs Compute mul,ple sequence alignments for each locus, and construct gene trees Compute species tree or network: Combine the es,mated gene trees, OR Es,mate a tree from a concatena,on of the mul,ple sequence alignments Get sta,s,cal support on each branch (e.g., bootstrapping) Es,mate dates on the nodes of the phylogeny Use species tree with branch support and dates to understand biology

27 Large-scale Alignment Estimation Many genes are considered unalignable due to high rates of evolu,on Only a few methods can analyze large datasets iplant (NSF Plant Biology Collabora,ve) and other projects planning to construct phylogenies with 500,000 taxa

28 1kp: Thousand Transcriptome Project G. Ka-Shu Wong U Alberta J. Leebens-Mack U Georgia N. Wickett Northwestern N. Matasci iplant T. Warnow, S. Mirarab, N. Nguyen, UIUC UT-Austin UT-Austin Plus many many other people l First study (WickeE, Mirarab, et al., PNAS 2014) had ~100 species and ~800 genes, gene trees and alignments es,mated using SATé, and a coalescent- based species tree es,mated using ASTRAL l Second study: Plant Tree of Life based on transcriptomes of ~1200 species, and more than 13,000 gene families (most not single copy) Gene Challenges: Tree Incongruence Species tree estimation from conflicting gene trees Gene tree estimation of datasets with > 100,000 sequences

29 Multiple Sequence Alignment (MSA): a scientific grand challenge 1 S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC Sn = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT GACCGC-- Sn = TCAC--GACCGACA Novel techniques needed for scalability and accuracy NP-hard problems and large datasets Current methods do not provide good accuracy Few methods can analyze even moderately large datasets Many important applications besides phylogenetic estimation 1 Frontiers in Massive Data Analysis, National Academies Press, 2013

30 This talk Big data multiple sequence alignment SATé (Science 2009, Systematic Biology 2012) and PASTA (RECOMB and J Comp Biol 2015), methods for co-estimation of alignments and trees UPP (Genome Biology 2015): ultra-large multiple sequence alignment, using the Ensemble of HMMs technique. Other applications of the ehmm technique: phylogenetic placement (SEPP, PSB 2012) metagenomic taxon identification (TIPP, Bioinformatics 2014) protein structure and function classification gene binning

31 Mul,ple Sequence Alignment

32 First Align, then Compute the Tree S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT GACCGC-- S4 = TCAC--GACCGACA S1 S2 S4 S3

33 Co- es,ma,on would be much beeer!!! S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT GACCGC-- S4 = TCAC--GACCGACA S1 S2 S4 S3

34 Simulation Studies S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA Unaligned Sequences S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT GACCGC-- S4 = TCAC--GACCGACA S1 S2 S4 S3 True tree and alignment Compare S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-C--T-----GACCGC-- S4 = T---C-A-CGACCGA----CA S1 S4 S2 S3 Estimated tree and alignment

35 Quantifying Error FN FN: false negative (missing edge) FP: false positive (incorrect edge) 50% error rate FP

36 Two- phase es,ma,on Alignment methods Clustal POY (and POY*) Probcons (and Probtree) Probalign MAFFT Muscle Di- align T- Coffee Prank (PNAS 2005, Science 2008) Opal (ISMB and Bioinf. 2007) FSA (PLoS Comp. Bio. 2009) Infernal (Bioinf. 2009) Etc. Phylogeny methods Bayesian MCMC Maximum parsimony Maximum likelihood Neighbor joining FastME UPGMA Quartet puzzling Etc. RAxML: heuris>c for large- scale ML op>miza>on

37 1000- taxon models, ordered by difficulty (Liu et al., 2009)

38 SATé Family of methods Itera,ve divide- and- conquer methods Each itera,on re- aligns the sequences using the current tree, running preferred MSA methods on small local subsets, and merging subset alignments Each itera,on computes an ML tree on the current alignment, under the GTR (Generalized Time Reversible) Markov model of evolu,on Note: these methods are MSA boosters, designed to improve accuracy and/or scalability of the base method We show results using MAFFT- l- ins- i to align subsets

39 Re-aligning on a tree A B C D Decompose dataset A C B D Align subsets A C B D Estimate ML tree on merged alignment ABCD Merge sub-alignments

40 SATé and PASTA Algorithms Obtain initial alignment and estimated ML tree Estimate ML tree on new alignment Tree Use tree to compute new alignment Alignment Repeat un,l termina,on condi,on, and return the alignment/tree pair with the best ML score

41 SATé- 1 (Science 2009) performance taxon models, ordered by difficulty rate of evolu,on generally increases from le` to right SATé hour analysis, on desktop machines (Similar improvements for biological datasets) SATé- 1 can analyze up to about 8,000 sequences.

42 SATé- 1 and SATé- 2 (Systema*c Biology, 2012) SATé- 1: up to 8K SATé- 2: up to ~50K taxon models ranked by difficulty

43 SATé variants differ only in the decomposition strategy A B C D Decompose dataset A C B D Align subsets A C B D Estimate ML tree on merged alignment ABCD Merge sub-alignments

44 SATé- II: centroid edge decomposi,on A C ABCDE D ABC DE AB C D E B E A B SATé- II makes all subsets small (user parameter), and can analyze 50K sequences, SATé- I decomposi,on produced clades and had bigger subsets; limited to 8K sequences

45 SATé: merger strategy A C ABCDE D ABC DE AB C D E B E A B Both SATé s use the same hierarchical merger strategy. On large (50K) datasets, the last pairwise merger can use more than 70% of the running,me

46 PASTA merging: Step 1 A C C B D E A B D E Compute a spanning tree connec,ng alignment subsets

47 PASTA merging: Step 2 AB A B BD C CD DE D E A AB B BD C CD D DE E Use Opal (or Muscle) to merge adjacent subset alignments in the spanning tree

48 PASTA merging: Step 3 C AB + BD = ABD ABD + CD = ABCD ABCD + DE = ABCDE A AB B BD CD D DE E Use transi,vity to merge all pairwise- merged alignments from Step 2 into final an alignment on en,re dataset Overall: O(n log(n) + L)

49 PASTA vs. SATé- II profiling and scaling 8 PASTA Speedup 6 4 SATe Number of Threads

50 PASTA Running Time and Scalability 1250 Running time (minutes) One itera,on Using 12 cpus 1 node on Lonestar TACC Maximum 24 GB memory Showing wall clock running,me ~ 1 hour for 10k taxa ~ 17 hours for 200k taxa 0 10,000 50, , ,000 Number of Sequences

51 Tree accuracy 1 million sequences: PASTA finished one iteration in 15 days PASTA tree had 6% error, compared to 5.6% when using true alignment Starting tree had 8.4% error 10

52 PASTA vs. SATé- II Difference is how subset alignments are merged together (transi,vity instead of Opal/Muscle). As expected, PASTA is faster and can analyze larger datasets. Unexpected: PASTA produces more accurate alignments and trees (on both simulated and biological data, including DNA, RNA, and AA sequences). Thus, transi,vity applied to compa,ble and overlapping alignments gives a surprisingly accurate technique for merging a collec,on of alignments.

53 PASTA and SATé- II: MSA boosters PASTA and SATé- II are techniques for improving the scalability of MSA methods to large datasets. We showed results here using MAFFT- l- ins- i to align small subsets with 200 sequences. We have also explored results using other MSA methods (e.g., Prank, Clustal, Bali- Phy), and obtain similar improvements in accuracy and/or scalability.

54 1kp: Thousand Transcriptome Project G. Ka-Shu Wong U Alberta J. Leebens-Mack U Georgia N. Wickett Northwestern N. Matasci iplant T. Warnow, S. Mirarab, N. Nguyen, UIUC UT-Austin UT-Austin l Plant Tree of Life based on transcriptomes of ~1200 species l More than 13,000 gene families (most not single copy) Gene Tree Incongruence Plus many many other people Challenge: Massive gene tree conflict consistent with ILS Alignment of datasets with > 100,000 sequences

55 12000 Mean: Median:266 1KP dataset: more than 100,000 p450 amino- acid sequences, many fragmentary 8000 Counts Length

56 12000 Mean: Median:266 1KP dataset: more than 100,000 p450 amino- acid sequences, many fragmentary 8000 Counts All standard mul>ple sequence alignment methods we tested performed poorly on datasets with fragments Length

57 1kp: Thousand Transcriptome Project G. Ka-Shu Wong U Alberta J. Leebens-Mack U Georgia N. Wickett Northwestern N. Matasci iplant T. Warnow, S. Mirarab, N. Nguyen UIUC UT-Austin UT-Austin l Plant Tree of Life based on transcriptomes of ~1200 species l More than 13,000 gene families (most not single copy) Gene Tree Incongruence Plus many many other people Challenge: Alignment of datasets with > 100,000 sequences with many fragmentary sequences

58 UPP UPP = Ultra- large mul,ple sequence alignment using Phylogeny- aware Profiles Nguyen, Mirarab, and Warnow. Genome Biology, Purpose: highly accurate large- scale mul,ple sequence alignments, even in the presence of fragmentary sequences.

59 UPP UPP = Ultra- large mul,ple sequence alignment using Phylogeny- aware Profiles Nguyen, Mirarab, and Warnow. Genome Biology, Purpose: highly accurate large- scale mul,ple sequence alignments, even in the presence of fragmentary sequences. Uses an ensemble of HMMs

60 Simple idea (not UPP) Select random subset of sequences, and build backbone alignment Construct a Hidden Markov Model (HMM) on the backbone alignment Add all remaining sequences to the backbone alignment using the HMM

61 One Hidden Markov Model for the entire alignment?

62 Simple idea (not UPP) Select random subset of sequences, and build backbone alignment Construct a Hidden Markov Model (HMM) on the backbone alignment Add all remaining sequences to the backbone alignment using the HMM

63 This approach works well if the dataset is small and has low evolu,onary rates, but is not very accurate otherwise. Select random subset of sequences, and build backbone alignment Construct a Hidden Markov Model (HMM) on the backbone alignment Add all remaining sequences to the backbone alignment using the HMM

64 One Hidden Markov Model for the en,re alignment? HMM 1

65 Or 2 HMMs? HMM 1 HMM 2

66 Or 4 HMMs? HMM 1 HMM 2 HMM 3 HMM 4

67 Or all 7 HMMs? HMM 1 HMM 2 HMM 4 m HMM 7 HMM 3 HMM 5 HMM 6

68 UPP Algorithmic Approach 1. Select random subset of full- length sequences, and build backbone alignment 2. Construct an Ensemble of Hidden Markov Models on the backbone alignment 3. Add all remaining sequences to the backbone alignment using the Ensemble of HMMs

69 UPP Algorithmic Approach 1. Select random subset of full- length sequences, and build backbone alignment and backbone tree Notes: Need to avoid fragments in the backbone We show results using PASTA for the backbone alignment and tree, but are other methods can be used we have explored BAli- Phy, a powerful Bayesian sta,s,cal method Random is good when taxonomic sampling is rela,vely uniform, but directed sampling can improve accuracy We explored backbones with 100 and 1000 sequences, even when the full dataset is very big (1,000,000 one million)

70 UPP Algorithmic Approach 2. Construct an Ensemble of Hidden Markov Models on the backbone alignment Technique: Create set of subsets (using the tree). Then, for each subset, build an HMM on the induced alignment on each subset. Note: Different subset sizes are good for different situa,ons, and the ensemble technique is more accurate than disjoint sets

71 UPP Algorithmic Approach 3. Add all remaining sequences to the backbone alignment using the Ensemble of HMMs For each of the remaining sequences s, find H, the HMM from the ensemble that has the best score (i.e., HMM maximizing Pr(s H)) Use HMMER code and H to add s into the backbone alignment

72 Evalua,on Simulated datasets (some have fragmentary sequences): 10K to 1,000,000 sequences in RNASim complex RNA sequence evolu,on simula,on sequence nucleo,de datasets from SATé papers sequence AA datasets (from FastTree paper) 10,000- sequence Indelible nucleo,de simula,on Biological datasets: Proteins: largest BaliBASE and HomFam RNA: 3 CRW datasets up to 28,000 sequences

73 RNASim: alignment error All methods given 24 hrs on a 12- core machine Note: Maz was run under default se{ngs for 10K and 50K sequences and under ParEree for 100K sequences, and fails to complete under any se{ng For 200K sequences. Clustal- Omega only completes on 10K dataset.

74 RNASim: tree error All methods given 24 hrs on a 12- core machine Note: Maz was run under default se{ngs for 10K and 50K sequences and under ParEree for 100K sequences, and fails to complete under any se{ng For 200K sequences. Clustal- Omega only completes on 10K dataset.

75 RNASim Million Sequences: alignment error Notes: We show alignment error using average of SP-FN and SP-FP. UPP variants have better alignment scores than PASTA. (Not shown: Total Column Scores PASTA more accurate than UPP) No other methods tested could complete on these data PASTA under-aligns: its alignment is 43 times wider than true alignment (~900 Gb of disk space). UPP alignments were closer in

76 RNASim Million Sequences: tree error Using 12 processors: UPP(Fast,NoDecomp) took 2.2 days, UPP(Fast) took 11.9 days, and PASTA took 10.3 days

77 UPP vs. PASTA: impact of fragmenta,on Mean alignment error % Fragmentary PASTA UPP(Default) (a) Average alignment error Under high rates of evolu,on, PASTA is badly impacted by fragmentary sequences (the same is true for other methods). Under low rates of evolu,on, PASTA can s,ll be highly accurate (data not shown). Delta FN tree error % Fragmentary UPP con,nues to have good accuracy even on datasets with many fragments under all rates of evolu,on. PASTA UPP(Default) (b) Average tree error Performance on fragmentary datasets of the 1000M2 model condi,on

78 UPP Running Time Wall clock align time (hr) Number of sequences UPP(Fast) Wall- clock,me used (in hours) given 12 processors

79 Current Related Research UPP: Using itera,on Using other MSA models (not just HMMs) within the Ensemble Using structural alignments for the backbone Using powerful sta,s,cal methods to produce the backbone alignment and tree PASTA : Using powerful sta,s,cal methods for the subset alignments Improving the pairwise merging technique ENSEMBLE OF HMMS: Metagenomic taxon iden,fica,on (collabora,on with Mihai Pop, Maryland) Gene binning (joint with Jian Peng, UIUC, and with Jim Leebens- Mack, Georgia) Protein structure and func,on classifica,on (collabora,ons with Mar,n Weigt, Paris, and Jian Peng, UIUC)

80 Acknowledgments PhD students: Nam Nguyen (now postdoc at UIUC) and Siavash Mirarab (now faculty at UCSD) Undergrad: Keerthana Kumar Current NSF grants: ABI (mul,ple sequence alignment) III:AF: (metagenomics collabora,ve with Mihai Pop, University of Maryland) DBI: (phylogenomics collabora,ve with Rice and Stanford) CCF: (graph algorithms to improve phylogene,c es,ma,on collabora,ve with Berkeley) Other recent or current support: Guggenheim Founda,on, NSF DEB: , NSF DBI: , Microso` Research New England, David Bruton Jr. Centennial Professorship, TACC (Texas Advanced Compu,ng Center), the University of Alberta (Canada), Grainger Founda,on (at UIUC), and UIUC TACC, UTCS, and UIUC computa,onal resources

81 Alignment Accuracy Correct columns Starting Starting Alignment Alignment ClustalW ClustalW Mafft"Profile Mafft"Profile Muscle Muscle SATe2 SATe2 PASTA PASTA Alignment Accuracy (TC) RNASim 125 Gutell S.3 16S.T 16S.B.ALL

82 TIPP: high accuracy taxonomic identification of metagenomic data TIPP: taxon identification and phylogenetic profiling (Bioinformatics, 2014) Technique: combines UPP alignments and phylogenetic placement algorithms, and considers statistical uncertainty. Results: better accuracy than all current methods, even for sequencing technologies producing high indel rates Research funded by new NSF grant III:AF: (collabora,ve with Mihai Pop at University of Maryland)

83 Metagenomic taxonomic iden*fica*on and phylogene*c profiling Metagenomics, Venter et al., Exploring the Sargasso Sea: Scien*sts Discover One Million New Genes in Ocean Microbes

84 Basic Ques,ons 1. What is this fragment? (Classify each fragment as well as possible.) 2. What is the taxonomic distribu,on in the dataset? (Note: helpful to use marker genes.) 3. What are the organisms in this metagenomic sample doing together?

85 The Tree of Life: Multiple Challenges Scientific challenges: Ultra-large multiple-sequence alignment Alignment-free phylogeny estimation Supertree estimation Estimating species trees from many gene trees Genome rearrangement phylogeny Reticulate evolution Visualization of large trees and alignments Data mining techniques to explore multiple optima Theoretical guarantees under Markov models of evolution Techniques: machine learning, applied probability theory, graph theory, combinatorial optimization, supercomputing, and heuristics

86 High indel datasets containing known genomes Note: NBC, MetaPhlAn, and MetaPhyler cannot classify any sequences from at least one of the high indel long sequence datasets, and motu terminates with an error message on all the high indel datasets.

87 TIPP vs. other abundance profilers TIPP is highly accurate, even in the presence of high indel rates and novel genomes, and for both short and long reads. All other methods have some vulnerability (e.g., motu is only accurate for short reads and is impacted by high indel rates).

88 Metagenomic Taxon Iden,fica,on Objec,ve: classify short reads in a metagenomic sample

89 Abundance Profiling Objective: Distribution of the species (or genera, or families, etc.) within the sample. For example: The distribution of the sample at the species-level is: 50% species A 20% species B 15% species C 14% species D 1% species E

90 Abundance Profiling Objective: Distribution of the species (or genera, or families, etc.) within the sample. Leading techniques: PhymmBL (Brady & Salzberg, Nature Methods 2009) NBC (Rosen, Reichenberger, and Rosenfeld, Bioinformatics 2011) MetaPhyler (Liu et al., BMC Genomics 2011), from the Pop lab at the University of Maryland MetaPhlAn (Segata et al., Nature Methods 2012), from the Huttenhower Lab at Harvard motu (Bork et al., Nature Methods 2013) MetaPhyler, MetaPhlAn, and motu are marker-based techniques (but use different marker genes). Marker gene are single-copy, universal, and resistant to horizontal transmission.

91 Novel genome datasets Note: motu terminates with an error message on the long fragment datasets and high indel datasets.

92 Summary SATé- 1 (Science 2009), SATé- 2 (Systema,c Biology 2012), co- es,ma,on of alignments and trees. SATé- 2 is well established in the biology community. PASTA (RECOMB 2014 and J Comp Biol 2015) is the replacement for SATé. PASTA can analyze up to 1,000,000 sequences. UPP (ultra- large mul,ple sequence alignment), Genome Biology Uses a collec,on of HMMs to represent a backbone alignment. Improves alignment and also detec,on of remote homology compared to a single HMM. UPP produces highly accurate alignments, even in the presence of fragmentary sequences. Can analyze datasets with 1,000,000 sequences. Other applica,ons of the Ensemble of HMMs technique TIPP (metagenomic taxon iden,fica,on and abundance profiling), Bioinforma,cs SEPP (phylogene,c placement), PSB Protein sequence analysis (collabora,ons with Mar,n Weigt, Paris, and Jian Peng, UIUC)

93 SEPP SEPP: SATé-enabled Phylogenetic Placement, by Mirarab, Nguyen, and Warnow. Pacific Symposium on Biocomputing, 2012, special session on the Human Microbiome Objective: phylogenetic analysis of single-gene datasets with fragmentary sequences Introduces HMM Family technique

94 Phylogenetic Placement Fragmentary sequences from some gene Full- length sequences for same gene, and an alignment and a tree ACCG CGAG CGG GGCT TAGA GGGGG TCGAG GGCG GGG... ACCT AGG...GCAT TAGC...CCA TAGA...CTT AGC...ACA ACT..TAGA..A

95 Phylogenetic Placement Step 1: Align each query sequence to backbone alignment Step 2: Place each query sequence into backbone tree, using extended alignment

96 HMMER vs. PaPaRa Alignments 0.0 Increasing rate of evolution

97 Align Sequence S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = TAAAAC S1 S4 S2 S3

98 Align Sequence S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = T-A--AAAC S1 S4 S2 S3

99 Phylogenetic Placement Align each query sequence to backbone alignment HMMALIGN (Eddy, Bioinformatics 1998) PaPaRa (Berger and Stamatakis, Bioinformatics 2011) Place each query sequence into backbone tree Pplacer (Matsen et al., BMC Bioinformatics, 2011) EPA (Berger and Stamatakis, Systematic Biology 2011) Note: pplacer and EPA use maximum likelihood, and are reported to have the same accuracy.

100 Place Sequence S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = T-A--AAAC S1 S4 Q1 S2 S3

101 HMMER+pplacer: 1) build one HMM for the entire alignment 2) Align fragment to the HMM, and insert into alignment 3) Insert fragment into tree to optimize likelihood

102 Using SEPP for taxon identification Fragmentary sequences from some gene Full- length sequences for same gene, and an alignment and a tree ACCG CGAG CGG GGCT TAGA GGGGG TCGAG GGCG GGG... ACCT AGG...GCAT TAGC...CCA TAGA...CTT AGC...ACA ACT..TAGA..A

103 Using SEPP for taxon identification Fragmentary sequences from some gene Full- length sequences for same gene, and an alignment and a tree ACCG CGAG CGG GGCT TAGA GGGGG TCGAG GGCG GGG... ACCT AGG...GCAT TAGC...CCA TAGA...CTT AGC...ACA ACT..TAGA..A

104 SEPP(10%), based on ~10 HMMs Increasing rate of evolution

105

106 SEPP vs. HMMER+pplacer SEPP produced more accurate phylogenetic placements than HMMER+pplacer. The only difference is the use of a Family of HMMs instead of one HMM. The biggest differences are for datasets with high rates of evolution.

107 The Tree of Life: Multiple Challenges Scientific challenges: Ultra-large multiple-sequence alignment Gene tree estimation Metagenomic classification Alignment-free phylogeny estimation Supertree estimation Estimating species trees from many gene trees Genome rearrangement phylogeny Reticulate evolution Visualization of large trees and alignments Data mining techniques to explore multiple optima Theoretical guarantees under Markov models of evolution Techniques: applied probability theory, graph theory, supercomputing, and heuristics Testing: simulations and real data

108 The Tree of Life: Multiple Challenges Scientific challenges: Ultra-large multiple-sequence alignment Gene tree estimation Metagenomic classification Alignment-free phylogeny estimation Supertree estimation Estimating species trees from many gene trees Genome rearrangement phylogeny Reticulate evolution Visualization of large trees and alignments Data mining techniques to explore multiple optima Theoretical guarantees under Markov models of evolution Techniques: applied probability theory, graph theory, supercomputing, and heuristics Testing: simulations and real data

109 SATé- II running,me profiling

110 3 FIG. 1. Algorithmic design of PASTA. The first six boxes show the steps involved in one iteration of PASTA. The last two boxes show the meaning of transitivity for homologies defined by a column of an MSA, and how the concept of transitivity can be used to merge two compatible and overlapping alignments. MSA, multiple sequence alignment. Figure from Mirarab et al., J. Computa,onal Biology 2014

SEPP and TIPP for metagenomic analysis. Tandy Warnow Department of Computer Science University of Texas

SEPP and TIPP for metagenomic analysis. Tandy Warnow Department of Computer Science University of Texas SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas Phylogeny (evolutionary tree) Orangutan From the Tree of the Life Website, University of Arizona Gorilla

More information

Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics

Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana-Champaign http://tandy.cs.illinois.edu

More information

SEPP and TIPP for metagenomic analysis. Tandy Warnow Department of Computer Science University of Texas

SEPP and TIPP for metagenomic analysis. Tandy Warnow Department of Computer Science University of Texas SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas Metagenomics: Venter et al., Exploring the Sargasso Sea: Scientists Discover One Million New Genes

More information

Construc)ng the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign

Construc)ng the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign Construc)ng the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign Phylogeny (evolutionary tree) Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website,

More information

Mul$ple Sequence Alignment Methods. Tandy Warnow Departments of Bioengineering and Computer Science h?p://tandy.cs.illinois.edu

Mul$ple Sequence Alignment Methods. Tandy Warnow Departments of Bioengineering and Computer Science h?p://tandy.cs.illinois.edu Mul$ple Sequence Alignment Methods Tandy Warnow Departments of Bioengineering and Computer Science h?p://tandy.cs.illinois.edu Species Tree Orangutan Gorilla Chimpanzee Human From the Tree of the Life

More information

CS 581 Algorithmic Computational Genomics. Tandy Warnow University of Illinois at Urbana-Champaign

CS 581 Algorithmic Computational Genomics. Tandy Warnow University of Illinois at Urbana-Champaign CS 581 Algorithmic Computational Genomics Tandy Warnow University of Illinois at Urbana-Champaign Today Explain the course Introduce some of the research in this area Describe some open problems Talk about

More information

Phylogenomics, Multiple Sequence Alignment, and Metagenomics. Tandy Warnow University of Illinois at Urbana-Champaign

Phylogenomics, Multiple Sequence Alignment, and Metagenomics. Tandy Warnow University of Illinois at Urbana-Champaign Phylogenomics, Multiple Sequence Alignment, and Metagenomics Tandy Warnow University of Illinois at Urbana-Champaign Phylogeny (evolutionary tree) Orangutan Gorilla Chimpanzee Human From the Tree of the

More information

CS 581 Algorithmic Computational Genomics. Tandy Warnow University of Illinois at Urbana-Champaign

CS 581 Algorithmic Computational Genomics. Tandy Warnow University of Illinois at Urbana-Champaign CS 581 Algorithmic Computational Genomics Tandy Warnow University of Illinois at Urbana-Champaign Course Staff Professor Tandy Warnow Office hours Tuesdays after class (2-3 PM) in Siebel 3235 Email address:

More information

CS 394C Algorithms for Computational Biology. Tandy Warnow Spring 2012

CS 394C Algorithms for Computational Biology. Tandy Warnow Spring 2012 CS 394C Algorithms for Computational Biology Tandy Warnow Spring 2012 Biology: 21st Century Science! When the human genome was sequenced seven years ago, scientists knew that most of the major scientific

More information

394C, October 2, Topics: Mul9ple Sequence Alignment Es9ma9ng Species Trees from Gene Trees

394C, October 2, Topics: Mul9ple Sequence Alignment Es9ma9ng Species Trees from Gene Trees 394C, October 2, 2013 Topics: Mul9ple Sequence Alignment Es9ma9ng Species Trees from Gene Trees Mul9ple Sequence Alignment Mul9ple Sequence Alignments and Evolu9onary Histories (the meaning of homologous

More information

From Genes to Genomes and Beyond: a Computational Approach to Evolutionary Analysis. Kevin J. Liu, Ph.D. Rice University Dept. of Computer Science

From Genes to Genomes and Beyond: a Computational Approach to Evolutionary Analysis. Kevin J. Liu, Ph.D. Rice University Dept. of Computer Science From Genes to Genomes and Beyond: a Computational Approach to Evolutionary Analysis Kevin J. Liu, Ph.D. Rice University Dept. of Computer Science!1 Adapted from U.S. Department of Energy Genomic Science

More information

From Gene Trees to Species Trees. Tandy Warnow The University of Texas at Aus<n

From Gene Trees to Species Trees. Tandy Warnow The University of Texas at Aus<n From Gene Trees to Species Trees Tandy Warnow The University of Texas at Aus

More information

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees Erin Molloy and Tandy Warnow {emolloy2, warnow}@illinois.edu University of Illinois at Urbana

More information

The Mathema)cs of Es)ma)ng the Tree of Life. Tandy Warnow The University of Illinois

The Mathema)cs of Es)ma)ng the Tree of Life. Tandy Warnow The University of Illinois The Mathema)cs of Es)ma)ng the Tree of Life Tandy Warnow The University of Illinois Phylogeny (evolu9onary tree) Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website, University of Arizona

More information

Constrained Exact Op1miza1on in Phylogene1cs. Tandy Warnow The University of Illinois at Urbana-Champaign

Constrained Exact Op1miza1on in Phylogene1cs. Tandy Warnow The University of Illinois at Urbana-Champaign Constrained Exact Op1miza1on in Phylogene1cs Tandy Warnow The University of Illinois at Urbana-Champaign Phylogeny (evolu1onary tree) Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website,

More information

Computa(onal Challenges in Construc(ng the Tree of Life

Computa(onal Challenges in Construc(ng the Tree of Life Computa(onal Challenges in Construc(ng the Tree of Life Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana-Champaign h?p://tandy.cs.illinois.edu Phylogeny (evolueonary tree)

More information

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9 Lecture 5 Alignment I. Introduction. For sequence data, the process of generating an alignment establishes positional homologies; that is, alignment provides the identification of homologous phylogenetic

More information

Phylogenetic inference

Phylogenetic inference Phylogenetic inference Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, March 7 th 016 After this lecture, you can discuss (dis-) advantages of different information types

More information

Phylogene)cs. IMBB 2016 BecA- ILRI Hub, Nairobi May 9 20, Joyce Nzioki

Phylogene)cs. IMBB 2016 BecA- ILRI Hub, Nairobi May 9 20, Joyce Nzioki Phylogene)cs IMBB 2016 BecA- ILRI Hub, Nairobi May 9 20, 2016 Joyce Nzioki Phylogenetics The study of evolutionary relatedness of organisms. Derived from two Greek words:» Phle/Phylon: Tribe/Race» Genetikos:

More information

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Bioinformatics tools for phylogeny and visualization. Yanbin Yin Bioinformatics tools for phylogeny and visualization Yanbin Yin 1 Homework assignment 5 1. Take the MAFFT alignment http://cys.bios.niu.edu/yyin/teach/pbb/purdue.cellwall.list.lignin.f a.aln as input and

More information

TheDisk-Covering MethodforTree Reconstruction

TheDisk-Covering MethodforTree Reconstruction TheDisk-Covering MethodforTree Reconstruction Daniel Huson PACM, Princeton University Bonn, 1998 1 Copyright (c) 2008 Daniel Huson. Permission is granted to copy, distribute and/or modify this document

More information

ASTRAL: Fast coalescent-based computation of the species tree topology, branch lengths, and local branch support

ASTRAL: Fast coalescent-based computation of the species tree topology, branch lengths, and local branch support ASTRAL: Fast coalescent-based computation of the species tree topology, branch lengths, and local branch support Siavash Mirarab University of California, San Diego Joint work with Tandy Warnow Erfan Sayyari

More information

CSCI1950 Z Computa4onal Methods for Biology Lecture 4. Ben Raphael February 2, hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary

CSCI1950 Z Computa4onal Methods for Biology Lecture 4. Ben Raphael February 2, hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary CSCI1950 Z Computa4onal Methods for Biology Lecture 4 Ben Raphael February 2, 2009 hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary Parsimony Probabilis4c Method Input Output Sankoff s & Fitch

More information

Dr. Amira A. AL-Hosary

Dr. Amira A. AL-Hosary Phylogenetic analysis Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic Basics: Biological

More information

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological

More information

Phylogenetic Geometry

Phylogenetic Geometry Phylogenetic Geometry Ruth Davidson University of Illinois Urbana-Champaign Department of Mathematics Mathematics and Statistics Seminar Washington State University-Vancouver September 26, 2016 Phylogenies

More information

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center Phylogenetic Analysis Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center Outline Basic Concepts Tree Construction Methods Distance-based methods

More information

Phylogenetic Tree Reconstruction

Phylogenetic Tree Reconstruction I519 Introduction to Bioinformatics, 2011 Phylogenetic Tree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Evolution theory Speciation Evolution of new organisms is driven

More information

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic

More information

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive. Additive distances Let T be a tree on leaf set S and let w : E R + be an edge-weighting of T, and assume T has no nodes of degree two. Let D ij = e P ij w(e), where P ij is the path in T from i to j. Then

More information

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD William and Nancy Thompson Missouri Distinguished Professor Department

More information

CSCI1950 Z Computa4onal Methods for Biology Lecture 5

CSCI1950 Z Computa4onal Methods for Biology Lecture 5 CSCI1950 Z Computa4onal Methods for Biology Lecture 5 Ben Raphael February 6, 2009 hip://cs.brown.edu/courses/csci1950 z/ Alignment vs. Distance Matrix Mouse: ACAGTGACGCCACACACGT Gorilla: CCTGCGACGTAACAAACGC

More information

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics - in deriving a phylogeny our goal is simply to reconstruct the historical relationships between a group of taxa. - before we review the

More information

Fast coalescent-based branch support using local quartet frequencies

Fast coalescent-based branch support using local quartet frequencies Fast coalescent-based branch support using local quartet frequencies Molecular Biology and Evolution (2016) 33 (7): 1654 68 Erfan Sayyari, Siavash Mirarab University of California, San Diego (ECE) anzee

More information

A Phylogenetic Network Construction due to Constrained Recombination

A Phylogenetic Network Construction due to Constrained Recombination A Phylogenetic Network Construction due to Constrained Recombination Mohd. Abdul Hai Zahid Research Scholar Research Supervisors: Dr. R.C. Joshi Dr. Ankush Mittal Department of Electronics and Computer

More information

first (i.e., weaker) sense of the term, using a variety of algorithmic approaches. For example, some methods (e.g., *BEAST 20) co-estimate gene trees

first (i.e., weaker) sense of the term, using a variety of algorithmic approaches. For example, some methods (e.g., *BEAST 20) co-estimate gene trees Concatenation Analyses in the Presence of Incomplete Lineage Sorting May 22, 2015 Tree of Life Tandy Warnow Warnow T. Concatenation Analyses in the Presence of Incomplete Lineage Sorting.. 2015 May 22.

More information

Recent Advances in Phylogeny Reconstruction

Recent Advances in Phylogeny Reconstruction Recent Advances in Phylogeny Reconstruction from Gene-Order Data Bernard M.E. Moret Department of Computer Science University of New Mexico Albuquerque, NM 87131 Department Colloqium p.1/41 Collaborators

More information

CSCI1950 Z Computa3onal Methods for Biology* (*Working Title) Lecture 1. Ben Raphael January 21, Course Par3culars

CSCI1950 Z Computa3onal Methods for Biology* (*Working Title) Lecture 1. Ben Raphael January 21, Course Par3culars CSCI1950 Z Computa3onal Methods for Biology* (*Working Title) Lecture 1 Ben Raphael January 21, 2009 Course Par3culars Three major topics 1. Phylogeny: ~50% lectures 2. Func3onal Genomics: ~25% lectures

More information

CSCI1950 Z Computa3onal Methods for Biology Lecture 24. Ben Raphael April 29, hgp://cs.brown.edu/courses/csci1950 z/ Network Mo3fs

CSCI1950 Z Computa3onal Methods for Biology Lecture 24. Ben Raphael April 29, hgp://cs.brown.edu/courses/csci1950 z/ Network Mo3fs CSCI1950 Z Computa3onal Methods for Biology Lecture 24 Ben Raphael April 29, 2009 hgp://cs.brown.edu/courses/csci1950 z/ Network Mo3fs Subnetworks with more occurrences than expected by chance. How to

More information

Quantifying sequence similarity

Quantifying sequence similarity Quantifying sequence similarity Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, February 16 th 2016 After this lecture, you can define homology, similarity, and identity

More information

New methods for es-ma-ng species trees from genome-scale data. Tandy Warnow The University of Illinois

New methods for es-ma-ng species trees from genome-scale data. Tandy Warnow The University of Illinois New methods for es-ma-ng species trees from genome-scale data Tandy Warnow The University of Illinois Phylogeny (evolu9onary tree) Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website,

More information

EVOLUTIONARY DISTANCES

EVOLUTIONARY DISTANCES EVOLUTIONARY DISTANCES FROM STRINGS TO TREES Luca Bortolussi 1 1 Dipartimento di Matematica ed Informatica Università degli studi di Trieste luca@dmi.units.it Trieste, 14 th November 2007 OUTLINE 1 STRINGS:

More information

Upcoming challenges in phylogenomics. Siavash Mirarab University of California, San Diego

Upcoming challenges in phylogenomics. Siavash Mirarab University of California, San Diego Upcoming challenges in phylogenomics Siavash Mirarab University of California, San Diego Gene tree discordance The species tree gene1000 Causes of gene tree discordance include: Incomplete Lineage Sorting

More information

Predicting Protein Functions and Domain Interactions from Protein Interactions

Predicting Protein Functions and Domain Interactions from Protein Interactions Predicting Protein Functions and Domain Interactions from Protein Interactions Fengzhu Sun, PhD Center for Computational and Experimental Genomics University of Southern California Outline High-throughput

More information

How to read and make phylogenetic trees Zuzana Starostová

How to read and make phylogenetic trees Zuzana Starostová How to read and make phylogenetic trees Zuzana Starostová How to make phylogenetic trees? Workflow: obtain DNA sequence quality check sequence alignment calculating genetic distances phylogeny estimation

More information

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression) Using phylogenetics to estimate species divergence times... More accurately... Basics and basic issues for Bayesian inference of divergence times (plus some digression) "A comparison of the structures

More information

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between

More information

Constructing Evolutionary/Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees Constructing Evolutionary/Phylogenetic Trees 2 broad categories: Distance-based methods Ultrametric Additive: UPGMA Transformed Distance Neighbor-Joining Character-based Maximum Parsimony Maximum Likelihood

More information

MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE

MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE Manmeet Kaur 1, Navneet Kaur Bawa 2 1 M-tech research scholar (CSE Dept) ACET, Manawala,Asr 2 Associate Professor (CSE Dept) ACET, Manawala,Asr

More information

Reconstruction of species trees from gene trees using ASTRAL. Siavash Mirarab University of California, San Diego (ECE)

Reconstruction of species trees from gene trees using ASTRAL. Siavash Mirarab University of California, San Diego (ECE) Reconstruction of species trees from gene trees using ASTRAL Siavash Mirarab University of California, San Diego (ECE) Phylogenomics Orangutan Chimpanzee gene 1 gene 2 gene 999 gene 1000 Gorilla Human

More information

A phylogenomic toolbox for assembling the tree of life

A phylogenomic toolbox for assembling the tree of life A phylogenomic toolbox for assembling the tree of life or, The Phylota Project (http://www.phylota.org) UC Davis Mike Sanderson Amy Driskell U Pennsylvania Junhyong Kim Iowa State Oliver Eulenstein David

More information

BIOINFORMATICS. Scaling Up Accurate Phylogenetic Reconstruction from Gene-Order Data. Jijun Tang 1 and Bernard M.E. Moret 1

BIOINFORMATICS. Scaling Up Accurate Phylogenetic Reconstruction from Gene-Order Data. Jijun Tang 1 and Bernard M.E. Moret 1 BIOINFORMATICS Vol. 1 no. 1 2003 Pages 1 8 Scaling Up Accurate Phylogenetic Reconstruction from Gene-Order Data Jijun Tang 1 and Bernard M.E. Moret 1 1 Department of Computer Science, University of New

More information

Estimating Evolutionary Trees. Phylogenetic Methods

Estimating Evolutionary Trees. Phylogenetic Methods Estimating Evolutionary Trees v if the data are consistent with infinite sites then all methods should yield the same tree v it gets more complicated when there is homoplasy, i.e., parallel or convergent

More information

Parameter Es*ma*on: Cracking Incomplete Data

Parameter Es*ma*on: Cracking Incomplete Data Parameter Es*ma*on: Cracking Incomplete Data Khaled S. Refaat Collaborators: Arthur Choi and Adnan Darwiche Agenda Learning Graphical Models Complete vs. Incomplete Data Exploi*ng Data for Decomposi*on

More information

Gene Regulatory Networks II Computa.onal Genomics Seyoung Kim

Gene Regulatory Networks II Computa.onal Genomics Seyoung Kim Gene Regulatory Networks II 02-710 Computa.onal Genomics Seyoung Kim Goal: Discover Structure and Func;on of Complex systems in the Cell Identify the different regulators and their target genes that are

More information

17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on:

17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on: 17 Non-collinear alignment This exposition is based on: 1. Darling, A.E., Mau, B., Perna, N.T. (2010) progressivemauve: multiple genome alignment with gene gain, loss and rearrangement. PLoS One 5(6):e11147.

More information

MiGA: The Microbial Genome Atlas

MiGA: The Microbial Genome Atlas December 12 th 2017 MiGA: The Microbial Genome Atlas Jim Cole Center for Microbial Ecology Dept. of Plant, Soil & Microbial Sciences Michigan State University East Lansing, Michigan U.S.A. Where I m From

More information

Phylogenetic Networks, Trees, and Clusters

Phylogenetic Networks, Trees, and Clusters Phylogenetic Networks, Trees, and Clusters Luay Nakhleh 1 and Li-San Wang 2 1 Department of Computer Science Rice University Houston, TX 77005, USA nakhleh@cs.rice.edu 2 Department of Biology University

More information

Tree of Life iological Sequence nalysis Chapter http://tolweb.org/tree/ Phylogenetic Prediction ll organisms on Earth have a common ancestor. ll species are related. The relationship is called a phylogeny

More information

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT 5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT.03.239 03.10.2012 ALIGNMENT Alignment is the task of locating equivalent regions of two or more sequences to maximize their similarity. Homology:

More information

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson Grundlagen der Bioinformatik, SS 10, D. Huson, April 12, 2010 1 1 Introduction Grundlagen der Bioinformatik Summer semester 2010 Lecturer: Prof. Daniel Huson Office hours: Thursdays 17-18h (Sand 14, C310a)

More information

PhyQuart-A new algorithm to avoid systematic bias & phylogenetic incongruence

PhyQuart-A new algorithm to avoid systematic bias & phylogenetic incongruence PhyQuart-A new algorithm to avoid systematic bias & phylogenetic incongruence Are directed quartets the key for more reliable supertrees? Patrick Kück Department of Life Science, Vertebrates Division,

More information

Basics on bioinforma-cs Lecture 7. Nunzio D Agostino

Basics on bioinforma-cs Lecture 7. Nunzio D Agostino Basics on bioinforma-cs Lecture 7 Nunzio D Agostino nunzio.dagostino@entecra.it; nunzio.dagostino@gmail.com Multiple alignments One sequence plays coy a pair of homologous sequence whisper many aligned

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of Computer Science San José State University San José, California, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Distance Methods Character Methods

More information

7. Tests for selection

7. Tests for selection Sequence analysis and genomics 7. Tests for selection Dr. Katja Nowick Group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute for Brain Research www. nowicklab.info

More information

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018 CONCEPT OF SEQUENCE COMPARISON Natapol Pornputtapong 18 January 2018 SEQUENCE ANALYSIS - A ROSETTA STONE OF LIFE Sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of

More information

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree) I9 Introduction to Bioinformatics, 0 Phylogenetic ree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & omputing, IUB Evolution theory Speciation Evolution of new organisms is driven by

More information

Evolu&on, Popula&on Gene&cs, and Natural Selec&on Computa.onal Genomics Seyoung Kim

Evolu&on, Popula&on Gene&cs, and Natural Selec&on Computa.onal Genomics Seyoung Kim Evolu&on, Popula&on Gene&cs, and Natural Selec&on 02-710 Computa.onal Genomics Seyoung Kim Phylogeny of Mammals Phylogene&cs vs. Popula&on Gene&cs Phylogene.cs Assumes a single correct species phylogeny

More information

Constructing Evolutionary/Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees Constructing Evolutionary/Phylogenetic Trees 2 broad categories: istance-based methods Ultrametric Additive: UPGMA Transformed istance Neighbor-Joining Character-based Maximum Parsimony Maximum Likelihood

More information

Mitochondrial Genome Annotation

Mitochondrial Genome Annotation Protein Genes 1,2 1 Institute of Bioinformatics University of Leipzig 2 Department of Bioinformatics Lebanese University TBI Bled 2015 Outline Introduction Mitochondrial DNA Problem Tools Training Annotation

More information

CS 6140: Machine Learning Spring What We Learned Last Week 2/26/16

CS 6140: Machine Learning Spring What We Learned Last Week 2/26/16 Logis@cs CS 6140: Machine Learning Spring 2016 Instructor: Lu Wang College of Computer and Informa@on Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Email: luwang@ccs.neu.edu Sign

More information

Phylogeny: building the tree of life

Phylogeny: building the tree of life Phylogeny: building the tree of life Dr. Fayyaz ul Amir Afsar Minhas Department of Computer and Information Sciences Pakistan Institute of Engineering & Applied Sciences PO Nilore, Islamabad, Pakistan

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 07: profile Hidden Markov Model http://bibiserv.techfak.uni-bielefeld.de/sadr2/databasesearch/hmmer/profilehmm.gif Slides adapted from Dr. Shaojie Zhang

More information

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches Int. J. Bioinformatics Research and Applications, Vol. x, No. x, xxxx Phylogenies Scores for Exhaustive Maximum Likelihood and s Searches Hyrum D. Carroll, Perry G. Ridge, Mark J. Clement, Quinn O. Snell

More information

Multiple sequence alignment

Multiple sequence alignment Multiple sequence alignment Multiple sequence alignment: today s goals to define what a multiple sequence alignment is and how it is generated; to describe profile HMMs to introduce databases of multiple

More information

BINF6201/8201. Molecular phylogenetic methods

BINF6201/8201. Molecular phylogenetic methods BINF60/80 Molecular phylogenetic methods 0-7-06 Phylogenetics Ø According to the evolutionary theory, all life forms on this planet are related to one another by descent. Ø Traditionally, phylogenetics

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION Supplementary information S3 (box) Methods Methods Genome weighting The currently available collection of archaeal and bacterial genomes has a highly biased distribution of isolates across taxa. For example,

More information

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships Chapter 26: Phylogeny and the Tree of Life You Must Know The taxonomic categories and how they indicate relatedness. How systematics is used to develop phylogenetic trees. How to construct a phylogenetic

More information

Phylogenetics. BIOL 7711 Computational Bioscience

Phylogenetics. BIOL 7711 Computational Bioscience Consortium for Comparative Genomics! University of Colorado School of Medicine Phylogenetics BIOL 7711 Computational Bioscience Biochemistry and Molecular Genetics Computational Bioscience Program Consortium

More information

PHYLOGENY AND SYSTEMATICS

PHYLOGENY AND SYSTEMATICS AP BIOLOGY EVOLUTION/HEREDITY UNIT Unit 1 Part 11 Chapter 26 Activity #15 NAME DATE PERIOD PHYLOGENY AND SYSTEMATICS PHYLOGENY Evolutionary history of species or group of related species SYSTEMATICS Study

More information

Model Accuracy Measures

Model Accuracy Measures Model Accuracy Measures Master in Bioinformatics UPF 2017-2018 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Variables What we can measure (attributes) Hypotheses

More information

Whole Genome Alignments and Synteny Maps

Whole Genome Alignments and Synteny Maps Whole Genome Alignments and Synteny Maps IINTRODUCTION It was not until closely related organism genomes have been sequenced that people start to think about aligning genomes and chromosomes instead of

More information

Genome-scale Es-ma-on of the Tree of Life. Tandy Warnow The University of Illinois

Genome-scale Es-ma-on of the Tree of Life. Tandy Warnow The University of Illinois WABI 2017 Factsheet 55 proceedings paper submissions, 27 accepted (49% acceptance rate) 9 additional poster-only submissions (+8 posters accompanying papers) 56 PC members Each paper reviewed by at least

More information

Multiple Sequence Alignment. Sequences

Multiple Sequence Alignment. Sequences Multiple Sequence Alignment Sequences > YOR020c mstllksaksivplmdrvlvqrikaqaktasglylpe knveklnqaevvavgpgftdangnkvvpqvkvgdqvl ipqfggstiklgnddevilfrdaeilakiakd > crassa mattvrsvksliplldrvlvqrvkaeaktasgiflpe

More information

Smith et al. American Journal of Botany 98(3): Data Supplement S2 page 1

Smith et al. American Journal of Botany 98(3): Data Supplement S2 page 1 Smith et al. American Journal of Botany 98(3):404-414. 2011. Data Supplement S1 page 1 Smith, Stephen A., Jeremy M. Beaulieu, Alexandros Stamatakis, and Michael J. Donoghue. 2011. Understanding angiosperm

More information

CREATING PHYLOGENETIC TREES FROM DNA SEQUENCES

CREATING PHYLOGENETIC TREES FROM DNA SEQUENCES INTRODUCTION CREATING PHYLOGENETIC TREES FROM DNA SEQUENCES This worksheet complements the Click and Learn developed in conjunction with the 2011 Holiday Lectures on Science, Bones, Stones, and Genes:

More information

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics Bioinformatics 1 Biology, Sequences, Phylogenetics Part 4 Sepp Hochreiter Klausur Mo. 30.01.2011 Zeit: 15:30 17:00 Raum: HS14 Anmeldung Kusss Contents Methods and Bootstrapping of Maximum Methods Methods

More information

What is Phylogenetics

What is Phylogenetics What is Phylogenetics Phylogenetics is the area of research concerned with finding the genetic connections and relationships between species. The basic idea is to compare specific characters (features)

More information

Networks. Can (John) Bruce Keck Founda7on Biotechnology Lab Bioinforma7cs Resource

Networks. Can (John) Bruce Keck Founda7on Biotechnology Lab Bioinforma7cs Resource Networks Can (John) Bruce Keck Founda7on Biotechnology Lab Bioinforma7cs Resource Networks in biology Protein-Protein Interaction Network of Yeast Transcriptional regulatory network of E.coli Experimental

More information

CS 394C March 21, Tandy Warnow Department of Computer Sciences University of Texas at Austin

CS 394C March 21, Tandy Warnow Department of Computer Sciences University of Texas at Austin CS 394C March 21, 2012 Tandy Warnow Department of Computer Sciences University of Texas at Austin Phylogeny From the Tree of the Life Website, University of Arizona Orangutan Gorilla Chimpanzee Human Phylogeny

More information

Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST

Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST Introduction Bioinformatics is a powerful tool which can be used to determine evolutionary relationships and

More information

Sequence Analysis '17- lecture 8. Multiple sequence alignment

Sequence Analysis '17- lecture 8. Multiple sequence alignment Sequence Analysis '17- lecture 8 Multiple sequence alignment Ex5 explanation How many random database search scores have e-values 10? (Answer: 10!) Why? e-value of x = m*p(s x), where m is the database

More information

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University Phylogenetics: Bayesian Phylogenetic Analysis COMP 571 - Spring 2015 Luay Nakhleh, Rice University Bayes Rule P(X = x Y = y) = P(X = x, Y = y) P(Y = y) = P(X = x)p(y = y X = x) P x P(X = x 0 )P(Y = y X

More information

Effects of Gap Open and Gap Extension Penalties

Effects of Gap Open and Gap Extension Penalties Brigham Young University BYU ScholarsArchive All Faculty Publications 200-10-01 Effects of Gap Open and Gap Extension Penalties Hyrum Carroll hyrumcarroll@gmail.com Mark J. Clement clement@cs.byu.edu See

More information

A (short) introduction to phylogenetics

A (short) introduction to phylogenetics A (short) introduction to phylogenetics Thibaut Jombart, Marie-Pauline Beugin MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis with PR Statistics, Millport Field

More information

Workshop III: Evolutionary Genomics

Workshop III: Evolutionary Genomics Identifying Species Trees from Gene Trees Elizabeth S. Allman University of Alaska IPAM Los Angeles, CA November 17, 2011 Workshop III: Evolutionary Genomics Collaborators The work in today s talk is joint

More information

Minimum Edit Distance. Defini'on of Minimum Edit Distance

Minimum Edit Distance. Defini'on of Minimum Edit Distance Minimum Edit Distance Defini'on of Minimum Edit Distance How similar are two strings? Spell correc'on The user typed graffe Which is closest? graf gra@ grail giraffe Computa'onal Biology Align two sequences

More information

Bacterial Communities in Women with Bacterial Vaginosis: High Resolution Phylogenetic Analyses Reveal Relationships of Microbiota to Clinical Criteria

Bacterial Communities in Women with Bacterial Vaginosis: High Resolution Phylogenetic Analyses Reveal Relationships of Microbiota to Clinical Criteria Bacterial Communities in Women with Bacterial Vaginosis: High Resolution Phylogenetic Analyses Reveal Relationships of Microbiota to Clinical Criteria Seminar presentation Pierre Barbera Supervised by:

More information

CSCI 360 Introduc/on to Ar/ficial Intelligence Week 2: Problem Solving and Op/miza/on

CSCI 360 Introduc/on to Ar/ficial Intelligence Week 2: Problem Solving and Op/miza/on CSCI 360 Introduc/on to Ar/ficial Intelligence Week 2: Problem Solving and Op/miza/on Professor Wei-Min Shen Week 13.1 and 13.2 1 Status Check Extra credits? Announcement Evalua/on process will start soon

More information