CS 581 / BIOE 540: Algorithmic Computa<onal Genomics. Tandy Warnow Departments of Bioengineering and Computer Science hhp://tandy.cs.illinois.

CS 581 / BIOE 540: Algorithmic Computa<onal Genomics Tandy Warnow Departments of Bioengineering and Computer Science hhp://tandy.cs.illinois.edu

Course Details Office hours: Tuesdays 12:30-1:30 (Siebel 3235) Course webpage: hhp://tandy.cs.illinois.edu/581-2017.html Textbook: Computa<onal Phylogene<cs, available for download at hhp://tandy.cs.illinois.edu/textbook.pdf TA: Pranjal Vachaspa< (to be confirmed)

Today Describe some important problems in computa<onal biology, for which students in this course could develop improved methods Explain how the course will be run Answer ques<ons

This Course Topics: computa<onal and sta<s<cal problems in sequence analysis (e.g., mul<ple sequence alignment, phylogeny es<ma<on, metagenomics, etc.). Focus: understanding the mathema<cal founda<ons, and designing algorithms with outstanding accuracy and speed on large, complex datasets. This is not a course about how to use the tools.

Prerequisites No background in biology is needed. However, the course has the following prerequisites: CS 374: computa<onal complexity, algorithm design techniques, and proving theorems about algorithms CS 361: probability and sta<s<cs By recursion, CS 225: programming

If you haven t sa<sfied the pre-reqs: You need permission to stay in the course. The first homework is due (by email) on Saturday at 1 PM. See the homework webpage hhp://tandy.cs.illinois.edu/cs581-2017-hw.html Then make an appointment to see me to review the homework.

This course Phylogeny es<ma<on based on stochas<c models of sequence evolu<on and genome evolu<on Mul<ple sequence alignment Applica<ons to metagenomics, protein structure predic<on, and other biological problems

Species Tree Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website, University of Arizona

Evolu<on informs about everything in biology Big genome sequencing projects just produce data - so what? Evolu<onary history relates all organisms and genes, and helps us understand and predict interac<ons between genes (gene<c networks) drug design predic<ng func<ons of genes influenza vaccine development origins and spread of disease origins and migra<ons of humans

Constructing the Tree of Life: Hard Computational Problems NP-hard problems Large datasets 100,000+ sequences thousands of genes Big data complexity: model misspecifica<on fragmentary sequences errors in input data streaming data

Phylogenomic pipeline Select taxon set and markers Gather and screen sequence data, possibly iden<fy orthologs Compute mul<ple sequence alignments for each locus Compute species tree or network: Compute gene trees on the alignments and combine the es<mated gene trees, OR Es<mate a tree from a concatena<on of the mul<ple sequence alignments Get sta<s<cal support on each branch (e.g., bootstrapping) Es<mate dates on the nodes of the phylogeny Use species tree with branch support and dates to understand biology

Research Strategies Improved algorithms through: Divide-and-conquer Bin-and-conquer Iteration Bayesian sta<s<cs Hidden Markov Models Graph theory Combinatorial op<miza<on Sta<s<cal modelling Massive Simula<ons High Performance Compu<ng

DNA Sequence Evolution AAGGCCT AAGACTT TGGACTT -3 mil yrs -2 mil yrs AGGGCAT TAGCCCT AGCACTT -1 mil yrs AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT today

Phylogeny Problem U V W X Y AGGGCAT TAGCCCA TAGACTT TGCACAA TGCGCTT X U Y V W

Performance criteria Running time Space Statistical performance issues (e.g., statistical consistency) with respect to a Markov model of evolution Topological accuracy with respect to the underlying true tree or true alignment, typically studied in simulation Accuracy with respect to a particular criterion (e.g. maximum likelihood score), on real data

Quantifying Error FN FN: false negative (missing edge) FP: false positive (incorrect edge) 50% error rate FP

Statistical consistency, exponential convergence, and absolute fast convergence (afc)

Phylogenetic reconstruction methods 1 Hill-climbing heuristics for hard optimization criteria (Maximum Parsimony and Maximum Likelihood) Cost Local optimum Phylogenetic trees Global optimum 2 Polynomial time distance-based methods: Neighbor Joining, FastME, etc. 3. Bayesian methods

Solving maximum likelihood (and other hard optimization problems) is unlikely # of Taxa # of Unrooted Trees 4 3 5 15 6 105 7 945 8 10395 9 135135 10 2027025 20 2.2 x 10 20 100 4.5 x 10 190 1000 2.7 x 10 2900

Quantifying Error FN FN: false negative (missing edge) FP: false positive (incorrect edge) 50% error rate FP

Neighbor joining has poor performance on large diameter trees [Nakhleh et al. ISMB 2001] Error Rate 0.8 0.6 0.4 NJ Theorem (Atteson): Exponential sequence length requirement for Neighbor Joining! 0.2 0 0 400 800 No. Taxa 1200 1600

Phylogeny Problem U V W X Y AGGGCAT TAGCCCA TAGACTT TGCACAA TGCGCTT X U Y V W

The Real Problem! U V W X Y AGGGCATGA AGAT TAGACTT TGCACAA TGCGCTT U X Y V W

Deletion Substitution ACGGTGCAGTTACCA Insertion ACCAGTCACCTA ACGGTGCAGTTACC-A AC----CAGTCACCTA The true multiple alignment Reflects historical substitution, insertion, and deletion events Defined using transitive closure of pairwise alignments computed on edges of the true tree

S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA Input: unaligned sequences

Phase 1: Alignment S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA

Phase 2: Construct tree S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA S1 S2 S4 S3

Two-phase estimation Alignment methods Clustal POY (and POY*) Probcons (and Probtree) Probalign MAFFT Muscle Di-align T-Coffee Prank (PNAS 2005, Science 2008) Opal (ISMB and Bioinf. 2007) FSA (PLoS Comp. Bio. 2009) Infernal (Bioinf. 2009) Etc. Phylogeny methods Bayesian MCMC Maximum parsimony Maximum likelihood Neighbor joining FastME UPGMA Quartet puzzling Etc.

Simulation Studies S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA Unaligned Sequences S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA S1 S2 S4 S3 True tree and alignment Compare S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-C--T-----GACCGC-- S4 = T---C-A-CGACCGA----CA S1 S4 S2 S3 Estimated tree and alignment

1000 taxon models, ordered by difficulty (Liu et al., 2009)

Multiple Sequence Alignment (MSA): another grand challenge 1 S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC Sn = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- Sn = -------TCAC--GACCGACA Novel techniques needed for scalability and accuracy NP-hard problems and large datasets Current methods do not provide good accuracy Few methods can analyze even moderately large datasets Many important applications besides phylogenetic estimation 1 Frontiers in Massive Data Analysis, National Academies Press, 2013

Major Challenges Phylogenetic analyses: standard methods have poor accuracy on even moderately large datasets, and the most accurate methods are enormously computationally intensive (weeks or months, high memory requirements) Multiple sequence alignment: key step for many biological questions (protein structure and function, phylogenetic estimation), but few methods can run on large datasets. Alignment accuracy is generally poor for large datasets with high rates of evolution.

(Phylogenetic estimation from whole genomes)

Species Tree Estimation requires multiple genes! Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website, University of Arizona

Two basic approaches for species tree estimation Concatenate ( combine ) sequence alignments for different genes, and run phylogeny estimation methods Compute trees on individual genes and combine gene trees

Using mul<ple genes S 1 gene 1 TCTAATGGAA gene 3 S 2 S 3 GCTAAGGGAA TCTAAGGGAA gene 2 S 1 S 3 TATTGATACA TCTTGATACC S 4 TCTAACGGAA S 4 GGTAACCCTC S 4 TAGTGATGCA S 7 TCTAATGGAC S 5 GCTAAACCTC S 7 TAGTGATGCA S 8 TATAACGGAA S 6 GGTGACCATC S 8 CATTCATACC S 7 GCTAAACCTC

Concatena<on gene 1 gene 2 gene 3 S 1 TCTAATGGAA?????????? TATTGATACA S 2 GCTAAGGGAA???????????????????? S 3 TCTAAGGGAA?????????? TCTTGATACC S 4 TCTAACGGAA GGTAACCCTCTAGTGATGCA S 5?????????? GCTAAACCTC?????????? S 6?????????? GGTGACCATC?????????? S 7 TCTAATGGAC GCTAAACCTC TAGTGATGCA S 8 TATAACGGAA?????????? CATTCATACC

Red gene tree species tree (green gene tree okay)

Gene Tree Incongruence Gene trees can differ from the species tree due to: Duplica<on and loss Horizontal gene transfer Incomplete lineage sor<ng (ILS)

Incomplete Lineage Sor<ng (ILS) Confounds phylogene<c analysis for many groups: Hominids Birds Yeast Animals Toads Fish Fungi There is substan<al debate about how to analyze phylogenomic datasets in the presence of ILS.

Lineage Sor<ng Popula<on-level process, also called the Mul<-species coalescent (Kingman, 1982) Gene trees can differ from species trees due to short <mes between specia<on events or large popula<on size; this is called Incomplete Lineage Sor<ng or Deep Coalescence.

The Coalescent Courtesy James Degnan Past Present

Gene tree in a species tree Courtesy James Degnan

Key observa<on: Under the mul<-species coalescent model, the species tree defines a probability distribu4on on the gene trees, and is iden4fiable from the distribu4on on gene trees Courtesy James Degnan

Two compe<ng approaches Species gene 1 gene 2... gene k... Concatenation Analyze separately... Summary Method

Species tree es<ma<on: difficult, even for small datasets! Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website, University of Arizona

Major Challenges: large datasets, fragmentary sequences Multiple sequence alignment: Few methods can run on large datasets, and alignment accuracy is generally poor for large datasets with high rates of evolution. Gene Tree Estimation: standard methods have poor accuracy on even moderately large datasets, and the most accurate methods are enormously computationally intensive (weeks or months, high memory requirements). Species Tree Estimation: gene tree incongruence makes accurate estimation of species tree challenging. Phylogenetic Network Estimation: Horizontal gene transfer and hybridization requires non-tree models of evolution Both phylogenetic estimation and multiple sequence alignment are also impacted by fragmentary data.

Avian Phylogenomics Project Erich Jarvis, HHMI MTP Gilbert, Copenhagen G Zhang, BGI T. Warnow UT-Aus<n S. Mirarab Md. S. Bayzid, UT-Aus<n UT-Aus<n Approx. 50 species, whole genomes 14,000 loci Plus many many other people Challenges: Species tree es<ma<on under the mul<-species coalescent model, from 14,000 poor es<mated gene trees, all with different topologies (we used sta<s<cal binning ) Maximum likelihood es<ma<on on a million-site genome-scale alignment 250 CPU years Science, December 2014 (Jarvis, Mirarab, et al., and Mirarab et al.)

1kp: Thousand Transcriptome Project G. Ka-Shu Wong U Alberta J. Leebens-Mack U Georgia N. Wickett Northwestern N. Matasci iplant T. Warnow, S. Mirarab, N. Nguyen, UIUC UT-Austin UT-Austin l Plant Tree of Life based on transcriptomes of ~1200 species l More than 13,000 gene families (most not single copy) l First paper: PNAS 2014 (~100 species and ~800 loci) Gene Tree Incongruence Plus many many other people Upcoming Challenges (~1200 species, ~400 loci): Species tree es<ma<on under the mul<-species coalescent from hundreds of conflic<ng gene trees on >1000 species (we will use ASTRAL Mirarab et al. 2014, Mirarab & Warnow 2015) Mul<ple sequence alignment of >100,000 sequences (with lots of fragments!) we will use UPP (Nguyen et al., 2015)

Metagenomics: Venter et al., Exploring the Sargasso Sea: Scientists Discover One Million New Genes in Ocean Microbes

Metagenomic data analysis NGS data produce fragmentary sequence data Metagenomic analyses include unknown species Taxon identification: given short sequences, identify the species for each fragment Applications: Human Microbiome Issues: accuracy and speed Mihai Pop, Univ Maryland

Metagenomic taxon iden6fica6on Objec<ve: classify short reads in a metagenomic sample

Possible Indo-European tree (Ringe, Warnow and Taylor 2000) Anatolian Vedic Iranian Greek Italic Celtic Tocharian Armenian Germanic Baltic Slavic Albanian

Perfect Phylogene<c Network for IE Nakhleh et al., Language 2005 Anatolian Vedic Iranian Greek Italic Celtic Tocharian Armenian Germanic Baltic Slavic Albanian

Grading Homework: 25% (one hw dropped) Midterm: 40% (March 30) Final Project: 25% (due May 6) Course Par<cipa<on: 10% No final exam.

Homework Assignments Homework assignments are listed at hhp://tandy.cs.illinois.edu/cs581-2017- hw.html and are due at 1 PM (in person or via email) late homeworks have reduced credit and will not be accepted ayer 48 hours past the deadline. You are encouraged to work with others on your homework, but you must write up solu<ons by yourself and indicate who you worked with on each homework.

Course Schedule A detailed course schedule is at hhp://tandy.cs.illinois.edu/cs581-2017-detailedsyllabus.html This schedule includes material you are expected to have looked at before coming to class: assigned reading (from textbook and/or scien<fic literature) PPT and/or PDF of my lecture

Final Project and Class Presenta<on Either research project (can be with another student) or survey paper (done by yourself). Many interes<ng and publishable problems to address: see hhp://tandy.cs.illinois.edu/topics.html Your class presenta<on should be related to your final project.

Academic Integrity Please see course website at hhp://tandy.cs.illinois.edu/581-2017.html and also hhp://tandy.cs.illinois.edu/ethics.pdf For this course: Examine the policy about collabora<on Learn and understand what plagiarism is (and then don t do it). This applies to homework, all wri<ng assignments, and the final project.

Course Research Projects Evalua<ng exis<ng methods on simulated and real (biological or linguis<c) datasets Designing a new method, and establishing its performance (using theory and data) Analyzing a biological dataset using several different methods, to address biology

Examples of published course projects Md S. Bayzid, T. Hunt, and T. Warnow. "Disk Covering Methods Improve Phylogenomic Analyses". Proceedings of RECOMB-CG (Compara<ve Genomics), 2014, and BMC Genomics 2014, 15(Suppl 6): S7. T. Zimmermann, S. Mirarab and T. Warnow. "BBCA: Improving the scalability of *BEAST using random binning". Proceedings of RECOMB-CG (Compara<ve Genomics), 2014, and BMC Genomics 2014, 15(Suppl 6): S11. J. Chou, A. Gupta, S. Yaduvanshi, R. Davidson, M. Nute, S. Mirarab and T. Warnow. A compara<ve study of SVDquartets and other coalescent-based species tree es<ma<on methods. RECOMB-Compara<ve Genomics and BMC Genomics, 2015., 2015, 16 (Suppl 10): S2. P. Vachaspa< and T. Warnow (2016). FastRFS: Fast and Accurate Robinson-Foulds Supertrees using Constrained Exact Op<miza<on Bioinforma<cs 2016; doi: 10.1093/bioinforma<cs/btw600. (Special issue for papers from RECOMB-CG)

Research Projects you could join Phylogenomics projects (Avian and the 1KP) Species tree and network es<ma<on from conflic<ng genes Large-scale mul<ple sequence alignment Large-scale maximum likelihood tree es<ma<on Improving gene tree es<ma<on using whole genomes Metagenomics (with Mihai Pop, University of Maryland, and Bill Gropp) Iden<fying genes and taxa from short sequences Metagenomic assembly Applica<ons to clinical diagnos<cs Protein Sequence Analysis (with Jian Peng) What func<on and structure does this protein have? How did structure and func<on evolve? Historical Linguis<cs (with Donald Ringe, UPenn) How did Indo-European evolve? Designing and implemen<ng sta<s<cal es<ma<on methods for language phylogenies