Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics

Similar documents
Phylogenomics, Multiple Sequence Alignment, and Metagenomics. Tandy Warnow University of Illinois at Urbana-Champaign

CS 581 Algorithmic Computational Genomics. Tandy Warnow University of Illinois at Urbana-Champaign

Ultra- large Mul,ple Sequence Alignment

Construc)ng the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign

CS 394C Algorithms for Computational Biology. Tandy Warnow Spring 2012

SEPP and TIPP for metagenomic analysis. Tandy Warnow Department of Computer Science University of Texas

SEPP and TIPP for metagenomic analysis. Tandy Warnow Department of Computer Science University of Texas

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

CS 581 Algorithmic Computational Genomics. Tandy Warnow University of Illinois at Urbana-Champaign

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Mul$ple Sequence Alignment Methods. Tandy Warnow Departments of Bioengineering and Computer Science h?p://tandy.cs.illinois.edu

SUPPLEMENTARY INFORMATION

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

From Genes to Genomes and Beyond: a Computational Approach to Evolutionary Analysis. Kevin J. Liu, Ph.D. Rice University Dept. of Computer Science

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

PHYLOGENY AND SYSTEMATICS

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

Multiple sequence alignment

EECS730: Introduction to Bioinformatics

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

BLAST. Varieties of BLAST

From Gene Trees to Species Trees. Tandy Warnow The University of Texas at Aus<n

SUPPLEMENTARY INFORMATION

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

Week 10: Homology Modelling (II) - HHpred

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

PROTEIN FUNCTION PREDICTION WITH AMINO ACID SEQUENCE AND SECONDARY STRUCTURE ALIGNMENT SCORES

Dr. Amira A. AL-Hosary

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

Sequence Alignment Techniques and Their Uses

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Gibbs Sampling Methods for Multiple Sequence Alignment

MiGA: The Microbial Genome Atlas

Similarity searching summary (2)

An Introduction to Sequence Similarity ( Homology ) Searching

Large-Scale Genomic Surveys

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

O 3 O 4 O 5. q 3. q 4. Transition

ASTRAL: Fast coalescent-based computation of the species tree topology, branch lengths, and local branch support

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

Effects of Gap Open and Gap Extension Penalties

Tools and Algorithms in Bioinformatics

Microbial analysis with STAMP

Phylogenetic inference

Mitochondrial Genome Annotation

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Pairwise sequence alignment and pair hidden Markov models

Taxonomical Classification using:

Hands-On Nine The PAX6 Gene and Protein

A phylogenomic toolbox for assembling the tree of life

MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE

Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST

Sequence Analysis, '18 -- lecture 9. Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene.

Hidden Markov Models in computational biology. Ron Elber Computer Science Cornell

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)

Multiple Sequence Alignment: HMMs and Other Approaches

Phylogenetic Networks, Trees, and Clusters

Comparison of Three Fugal ITS Reference Sets. Qiong Wang and Jim R. Cole

Overview of Research at Bioinformatics Lab

394C, October 2, Topics: Mul9ple Sequence Alignment Es9ma9ng Species Trees from Gene Trees

Basic Local Alignment Search Tool

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Session 5: Phylogenomics

Hidden Markov Models and Their Applications in Biological Sequence Analysis

Phylogenetics - Orthology, phylogenetic experimental design and phylogeny reconstruction. Lesser Tenrec (Echinops telfairi)

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

Whole Genome Alignments and Synteny Maps

HMMs and biological sequence analysis

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Probalign: Multiple sequence alignment using partition function posterior probabilities

How to read and make phylogenetic trees Zuzana Starostová

Supplementary Information

Phylogenetics in the Age of Genomics: Prospects and Challenges

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other?

Hidden Markov Models

Evaluating the usefulness of alignment filtering methods to reduce the impact of errors on evolutionary inferences

Phylogenetic Tree Reconstruction

17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on:

Quantifying sequence similarity

Comparison of Cost Functions in Sequence Alignment. Ryan Healey

Bioinformatics. Proteins II. - Pattern, Profile, & Structure Database Searching. Robert Latek, Ph.D. Bioinformatics, Biocomputing

8/23/2014. Phylogeny and the Tree of Life

Motifs, Profiles and Domains. Michael Tress Protein Design Group Centro Nacional de Biotecnología, CSIC

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

Model Accuracy Measures

Markov Models & DNA Sequence Evolution

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

Hidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

Transcription:

Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana-Champaign http://tandy.cs.illinois.edu

From the Tree of the Life Website, University of Arizona Phylogenomics

1kp: Thousand Transcriptome Project G. Ka-Shu Wong U Alberta J. Leebens-Mack U Georgia N. Wickett Northwestern N. Matasci iplant T. Warnow, UIUC S. Mirarab, UCSD N. Nguyen UCSD Plus many many other people Plant Tree of Life based on transcriptomes of ~1200 species l More than 13,000 gene families (most not single copy) Gene Tree Incongruence l Challenge: Alignments and trees on > 100,000 sequences

1000-taxon models, ordered by difficulty (Liu et al., Science 324(5934):1561-1564, 2009)

Re-aligning on a tree A B C D Decompose dataset A C B D Align subsets A C B D Estimate ML tree on merged alignment ABCD Merge sub-alignments

SATé and PASTA Algorithms Obtain initial alignment and estimated ML tree Estimate ML tree on new alignment Tree Use tree to compute new alignment Alignment Repeat until termination condition, and return the alignment/tree pair with the best ML score

1000 taxon models, ordered by difficulty, Liu et al., Science 324(5934):1561-1564, 2009 24 hour SATé-I analysis, on desktop machines (Similar improvements for biological datasets)

SATé-2 better than SATé-1 1000-taxon models ranked by difficulty SATé-1 (Liu et al., Science 2009): can analyze up to 8K sequences SATé-2 (Liu et al., Systematic Biology 2012): can analyze up to ~50K sequences

PASTA: even better than SATé-2 RNASim 0.20 Tree Error (FN Rate) 0.15 0.10 0.05 Clustal Omega Muscle Mafft Starting Tree SATe2 PASTA Reference Alignment 0.00 10000 50000 100000 200000 PASTA: Mirarab, Nguyen, and Warnow, J Comp. Biol. 2015 Simulated RNASim datasets from 10K to 200K taxa Limited to 24 hours using 12 CPUs Not all methods could run (missing bars could not finish)

0.4 Total Column Score PASTA+BAli Phy Better Comparing default PASTA to PASTA+BAli-Phy on simulated datasets with 1000 sequences PASTA+BAli Phy 0.3 0.2 0.1 0.0 PASTA Better 0.0 0.1 0.2 0.3 0.4 PASTA data Indelible M2 RNAsim Rose L1 Rose M1 Rose S1 Recall (SP Score) Tree Error: Delta RF (RAxML) 1.0 PASTA+BAli Phy Better 0.15 PASTA+BAli Phy Better PASTA+BAli Phy 0.9 0.8 0.7 0.6 PASTA Better PASTA 0.10 0.05 0.00 PASTA Better 0.6 0.7 0.8 0.9 1.0 PASTA 0.00 0.05 0.10 0.15 PASTA+BAli Phy Decomposition to 100-sequence subsets, one iteration of PASTA+BAli-Phy

1kp: Thousand Transcriptome Project G. Ka-Shu Wong U Alberta J. Leebens-Mack U Georgia N. Wickett Northwestern N. Matasci iplant T. Warnow, UIUC S. Mirarab, UCSD N. Nguyen UCSD Plus many many other people Plant Tree of Life based on transcriptomes of ~1200 species l More than 13,000 gene families (most not single copy) Gene Tree Incongruence l Challenge: Alignments and trees on > 100,000 sequences

12000 Mean:317 10000 Median:266 1KP dataset: more than 100,000 p450 amino-acid sequences, many fragmentary 8000 Counts 6000 4000 2000 0 0 500 1000 1500 2000 Length

12000 Mean:317 10000 Median:266 1KP dataset: more than 100,000 p450 amino-acid sequences, many fragmentary 8000 Counts 6000 4000 2000 0 All standard multiple sequence alignment methods we tested performed poorly on datasets with fragments. 0 500 1000 1500 2000 Length

Profile Hidden Markov Models Probabilistic model to represent a family of sequences, represented by a multiple sequence alignment Introduced for sequence analysis in Krogh et al. 1994. Popularized in Eddy 1996 and textbook Durbin et al. 1998 Fundamental part of HMMER and other protein databases Used for: homology detection, protein family assignment, multiple sequence alignment, phylogenetic placement, protein structure prediction, alignment segmentation, etc.

Profile HMMs l l Generative model for representing a MSA Consists of: l l l Set of states (Match, insertion, and deletion) Transition probabilities Emission probabilities

Profile Hidden Markov Model for DNA sequence alignment

HMMs for MSA l l Given seed alignment (e.g., in PFAM) and a collection of sequences for the protein family: l l l Represent seed alignment using HMM Align each additional sequence to the HMM Use transitivity to obtain MSA Can we do something like this without a seed alignment?

HMMs for MSA l l Given seed alignment (e.g., in PFAM) and a collection of sequences for the protein family: l l l Represent seed alignment using HMM Align each additional sequence to the HMM Use transitivity to obtain MSA Can we do something like this without a seed alignment?

UPP UPP = Ultra-large multiple sequence alignment using Phylogeny-aware Profiles Nguyen, Mirarab, and Warnow. Genome Biology, 2014. Purpose: highly accurate large-scale multiple sequence alignments, even in the presence of fragmentary sequences.

UPP UPP = Ultra-large multiple sequence alignment using Phylogeny-aware Profiles Nguyen, Mirarab, and Warnow. Genome Biology, 2014. Purpose: highly accurate large-scale multiple sequence alignments, even in the presence of fragmentary sequences. Uses an ensemble of HMMs

Simple idea (not UPP) Select random subset of sequences, and build backbone alignment Construct a Hidden Markov Model (HMM) on the backbone alignment Add all remaining sequences to the backbone alignment using the HMM

PASTA: even better than SATé-2 0.20 RNASim Starting tree is based on UPP(simple): one profile HMM for Backbone alignment Tree Error (FN Rate) 0.15 0.10 0.05 Clustal Omega Muscle Mafft Starting Tree SATe2 PASTA Reference Alignment 0.00 10000 50000 100000 200000 PASTA: Mirarab, Nguyen, and Warnow, J Comp. Biol. 2015 Simulated RNASim datasets from 10K to 200K taxa Limited to 24 hours using 12 CPUs Not all methods could run (missing bars could not finish)

This approach works well if the dataset is small and has low evolutionary rates, but is not very accurate otherwise. Select random subset of sequences, and build backbone alignment Construct a Hidden Markov Model (HMM) on the backbone alignment Add all remaining sequences to the backbone alignment using the HMM

One Hidden Markov Model for the entire alignment?

One Hidden Markov Model for the backbone alignment? HMM 1

Or 2 HMMs? HMM 1 HMM 2

Or 4 HMMs? HMM 1 HMM 2 HMM 3 HMM 4

Or all 7 HMMs? HMM 1 HMM 2 HMM 4 m HMM 7 HMM 3 HMM 5 HMM 6

UPP Algorithmic Approach 1. Select random subset of full-length sequences, and build backbone alignment 2. Construct an Ensemble of Hidden Markov Models on the backbone alignment 3. Add all remaining sequences to the backbone alignment using the Ensemble of HMMs

Evaluation Simulated datasets (some have fragmentary sequences): 10K to 1,000,000 sequences in RNASim complex RNA sequence evolution simulation 1000-sequence nucleotide datasets from SATé papers 5000-sequence AA datasets (from FastTree paper) 10,000-sequence Indelible nucleotide simulation Biological datasets: Proteins: largest BaliBASE and HomFam RNA: 3 CRW datasets up to 28,000 sequences

RNASim Million Sequences: tree error Using 12 processors: UPP(Fast,NoDecomp) took 2.2 days, UPP(Fast) took 11.9 days, and PASTA took 10.3 days

UPP is very robust to fragmentary sequences Mean alignment error Delta FN tree error 0.6 0.4 0.2 0.0 0.4 0.2 0.0 0 12.5 25 50 % Fragmentary PASTA UPP(Default) (a) Average alignment error 0 12.5 25 50 % Fragmentary Under high rates of evolution, PASTA is badly impacted by fragmentary sequences (the same is true for other methods). UPP continues to have good accuracy even on datasets with many fragments under all rates of evolution. PASTA UPP(Default) (b) Average tree error Performance on fragmentary datasets of the 1000M2 model condition

UPP Running Time Wall clock align time (hr) 15 10 5 0 50000 100000 150000 200000 Number of sequences UPP(Fast) Wall-clock time used (in hours) given 12 processors

Other Applications of the Ensemble of HMMs SEPP (phylogenetic placement, Mirarab, Nguyen, and Warnow PSB 2014) TIPP (metagenomic taxon identification, Nguyen, Mirarab, Liu, Pop, and Warnow, Bioinformatics 2014) HIPPI (protein classification and remote homology detection), RECOMB-CG 2016 and BMC Genomics 2016 (Nguyen, Nute, Mirarab, and Warnow)

Abundance Profiling Objective: Distribution of the species (or genera, or families, etc.) within the sample. For example: The distribution of the sample at the species-level is: 50% species A 20% species B 15% species C 14% species D 1% species E

TIPP pipeline Input: set of reads from a shotgun sequencing experiment of a metagenomic sample 1. Assign reads to marker genes using BLAST 2. For reads assigned to marker genes, perform taxonomic analysis 3. Combine analyses from Step 2

High indel datasets containing known genomes Note: NBC, MetaPhlAn, and MetaPhyler cannot classify any sequences from at least one of the high indel long sequence datasets, and motu terminates with an error message on all the high indel datasets.

Novel genome datasets Note: motu terminates with an error message on the long fragment datasets and high indel datasets.

Protein Family Assignment Input: new AA sequence (might be fragmentary) and database of protein families (e.g., PFAM) Output: assignment (if justified) of the sequence to an existing family in the database

HIPPI HIerarchical Profile HMMs for Protein family Identification Nguyen, Nute, Mirarab, and Warnow, RECOMB-CG 2016 and BMC-Genomics 2016 Uses an ensemble of HMMs to classify protein sequences Tested on HMMER

Seq Length: Full Seq Length: 50% Seq Length: 25%.99.99.98 Precision.98.98.98.97.95.92.97.96.75.80.85.90.95 1.00.50.60.70.80.90 Recall.20.40.60.80 Method HIPPI HMMER BLAST HHsearch: 1 Iteration HHsearch: 2 Iterations

Four Problems Phylogenetic Placement (SEPP, PSB 2012) Multiple sequence alignment (UPP, RECOMB 2014 and Genome Biology 2014) Metagenomic taxon identification (TIPP, Bioinformatics 2014) Gene family assignment and homology detection (HIPPI, RECOMB-CG 2016 and BMC Genomics 2016) A unifying technique is the Ensemble of Hidden Markov Models (introduced by Mirarab et al., 2012)

Summary Using an ensemble of HMMs tends to improve accuracy, for a cost of running time. Applications so far to taxonomic placement (SEPP), multiple sequence alignment (UPP), protein family classification (HIPPI). Improvements are mostly noticeable for large diverse datasets. Phylogenetically-based construction of the ensemble helps accuracy (note: the decompositions we produce are not clade-based), but the design and use of these ensembles is still in its infancy. (Many relatively similar approaches have been used by others, including Sci-Phy and FlowerPower by Sjolander.) The basic idea can be used with any kind of probabilistic model, doesn t have to be restricted to profile HMMs.

Basic question: why does it help? Using an ensemble of HMMs tends to improve accuracy, for a cost of running time. Applications so far to taxonomic placement (SEPP), multiple sequence alignment (UPP), protein family classification (HIPPI). Improvements are mostly noticeable for large diverse datasets. Phylogenetically-based construction of the ensemble helps accuracy (note: the decompositions we produce are not clade-based), but the design and use of these ensembles is still in its infancy. (Many relatively similar approaches have been used by others, including Sci-Phy and FlowerPower by Sjolander.) The basic idea can be used with any kind of probabilistic model, doesn t have to be restricted to profile HMMs.

The Tree of Life: Multiple Challenges Scientific challenges: Ultra-large multiple-sequence alignment Alignment-free phylogeny estimation Supertree estimation Estimating species trees from many gene trees Genome rearrangement phylogeny Reticulate evolution Visualization of large trees and alignments Data mining techniques to explore multiple optima Theoretical guarantees under Markov models of evolution Techniques: machine learning, applied probability theory, graph theory, combinatorial optimization, supercomputing, and heuristics

Acknowledgments PASTA and UPP: Nam Nguyen (now postdoc at UIUC) and Siavash Mirarab (now faculty at UCSD), undergrad: Keerthana Kumar (at UT-Austin) PASTA+BAli-Phy: Mike Nute (PhD student at UIUC) Current NSF grants: ABI-1458652 (multiple sequence alignment) Grainger Foundation (at UIUC), and UIUC TACC, UTCS, Blue Waters, and UIUC campus cluster PASTA, UPP, SEPP, and TIPP are available on github at https://github.com/smirarab/; see also PASTA+BAli-Phy at http://github.com/mgnute/pasta

Nguyen et al. Genome Biology (2015) 16:124 Page 6 of 15 Table 2 Average alignment SP-error, tree error, and TC score across most full-length datasets Method ROSE RNASim Indelible ROSE CRW 10 AA HomFam HomFam NT 10K 10K AA (17) (2) Average alignment SP-error UPP 7.8 (1) 9.5 (1) 1.7 (2) 2.9 (1) 12.5 (1) 24.2 (1) 23.3 (1) 20.8 (2) PASTA 7.8 (1) 15.0 (2) 0.4 (1) 3.1 (1) 12.8 (1) 24.0 (1) 22.5 (1) 17.3 (1) MAFFT 20.6 (2) 25.5 (3) 41.4 (3) 4.9 (2) 28.3 (2) 23.5 (1) 25.3 (2) 20.7 (2) Muscle 20.6 (2) 64.7 (5) 62.4 (4) 5.5 (3) 30.7 (3) 30.2 (2) 48.1 (4) X Clustal 49.2 (3) 35.3 (4) X 6.5 (4) 43.3 (4) 24.3 (1) 27.7 (3) 29.4 (3) Average FN error UPP 1.3 (1) 0.8 (1) 0.3 (1) 1.8 (1) 7.8 (2) 3.4 (2) NA NA PASTA 1.3 (1) 0.4 (1) <0.1 (1) 1.3 (1) 5.1 (1) 3.3 (1) NA NA MAFFT 5.8 (2) 3.5 (2) 24.8 (3) 4.5 (3) 10.1 (3) 2.3 (1) NA NA Muscle 8.4 (3) 7.3 (3) 32.5 (4) 3.1 (2) 5.5 (1) 12.6 (3) NA NA Clustal 24.3 (4) 10.4 (4) X 4.2 (3) 34.1 (4) 3.5 (2) NA NA Average TC score UPP 37.8 (1) 0.5 (2) 11.0 (3) 2.6 (2) 1.4 (1) 11.4 (1) 47.3 (1) 40.3 (3) PASTA 37.8 (1) 2.3 (1) 48.0 (1) 5.4 (1) 2.3 (1) 12.1 (1) 46.1 (2) 50.0 (1) MAFFT 31.4 (2) 0.4 (2) 7.8 (4) 0.6 (3) 0.7 (2) 12.1 (1) 45.5 (2) 46.9 (2) Muscle 9.8 (3) <0.0 (2) 18.3 (2) 2.7 (2) 0.7 (2) 10.5 (2) 27.7 (4) X Clustal 5.7 (4) 0.2 (2) X 3.1 (2) 0.1 (2) 11.8 (1) 38.6 (3) 31.0 (4) We report the average alignment SP-error (the average of SPFN and SPFP errors) (top), average FN error (middle), and average TC score (bottom), for the collection of full-length datasets. All scores represent percentages and so are out of 100. Results marked with an X indicate that the method failed to terminate withinthe time limit (24 hours on a 12-core machine). Muscle failed to align two of the HomFam datasets; we report separate average results on the 17 HomFam datasets for all methods and the two HomFam datasets for all but Muscle. We did not test tree error on the HomFam datasets (therefore, the FN error is indicated by NA ). The tier ranking for each method is shown parenthetically

Precision 1.0.95.90.85.80.75 1.0.95.90.85.80.75 1.0.90.80 1.0.90.80 1.0.90.80.70 1.0.90.80.70.25.50.75 1.0.25.50.75 1.0.00.25.50.75 1.0.00.25.50.75 1.0.00.20.40.60.80 Avg. Pairwise Sequence Identiy > 30% 20 30% < 20% 1.0.95.90.85.80.75 1.0.95.90.85.80.75 1.0.90.80 1.0.90.80 1.0.90.80.70 1.0.90.80.70.25.50.75 1.0.25.50.75 1.0.00.25.50.75 1.0.00.25.50.75 1.0.00.20.40.60.80 1.0.95.90.85.80.75 1.0.95.90.85.80.75 1.0.90.80 1.0.90.80 1.0.90.80.70 1.0.90.80.70.25.50.75 1.0.25.50.75 1.0.00.25.50.75 1.0.00.25.50.75 1.0.00.20.40.60.80 Size: 0 100 Size: > 100 Size: 0 100 Size: > 100 Size: 0 100 Size: > 100 Sequence Length: Full Sequence Length: 50% Sequence Length: 25%.00.20.40.60.80.00.20.40.60.80 Recall.00.20.40.60.80 Method HIPPI HMMER BLAST HHsearch: 1 Iteration HHsearch: 2 Iterations

1. Pre-processing seed alignments Family A Family B Family N...... i) Compute an ML tree on each seed alignment ii) Build ensemble of HMMs for each seed alignment, using its ML tree HMM A 1 HMM A3 HMM A2 HMM N 1 HMM N4 HMM B 1 HMM N3 HMM N 5 iii) Collect HMMs into database HMM database HMM N 2 HMM A 1 HMM A2 HMM A3 HMM B 1... HMM N 1 HMM N 2 HMM N3 HMM N4 HMM N 5 2. Classification of query sequences Query sequence HMM database HMM A 1 HMM A2 HMM A3 HMM B 1... HMM N 1 HMM N 2 HMM N3 HMM N4 HMM N 5 i) Score query sequence against all HMMs in database, keeping only scores above inclusion threshold HMM A1 = 8.9 HMM A2 = 7.3 HMM A3 = 9.4 HMM B1 = 5.6... HMM N1 = 4.4 HMM N2 = 5.6 HMM N3 = 4.3 HMM N4 = 4.7 HMM N5 = 6.6 ii) Rank families by best scoring HMM within family 1) Family A = 9.4 2) Family N = 6.6... 8) Family B = 5.6... iii) Assign query sequence to top ranking family Family A Query