CS 581 / BIOE 540: Algorithmic Computa<onal Genomics. Tandy Warnow Departments of Bioengineering and Computer Science hhp://tandy.cs.illinois.

Similar documents
Mul$ple Sequence Alignment Methods. Tandy Warnow Departments of Bioengineering and Computer Science h?p://tandy.cs.illinois.edu

Construc)ng the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign

394C, October 2, Topics: Mul9ple Sequence Alignment Es9ma9ng Species Trees from Gene Trees

CS 394C Algorithms for Computational Biology. Tandy Warnow Spring 2012

CS 581 Algorithmic Computational Genomics. Tandy Warnow University of Illinois at Urbana-Champaign

CS 581 Algorithmic Computational Genomics. Tandy Warnow University of Illinois at Urbana-Champaign

From Gene Trees to Species Trees. Tandy Warnow The University of Texas at Aus<n

SEPP and TIPP for metagenomic analysis. Tandy Warnow Department of Computer Science University of Texas

Phylogenomics, Multiple Sequence Alignment, and Metagenomics. Tandy Warnow University of Illinois at Urbana-Champaign

SEPP and TIPP for metagenomic analysis. Tandy Warnow Department of Computer Science University of Texas

Ultra- large Mul,ple Sequence Alignment

From Genes to Genomes and Beyond: a Computational Approach to Evolutionary Analysis. Kevin J. Liu, Ph.D. Rice University Dept. of Computer Science

Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics

The Mathema)cs of Es)ma)ng the Tree of Life. Tandy Warnow The University of Illinois

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

Constrained Exact Op1miza1on in Phylogene1cs. Tandy Warnow The University of Illinois at Urbana-Champaign

New methods for es-ma-ng species trees from genome-scale data. Tandy Warnow The University of Illinois

ASTRAL: Fast coalescent-based computation of the species tree topology, branch lengths, and local branch support

first (i.e., weaker) sense of the term, using a variety of algorithmic approaches. For example, some methods (e.g., *BEAST 20) co-estimate gene trees

CSCI1950 Z Computa3onal Methods for Biology* (*Working Title) Lecture 1. Ben Raphael January 21, Course Par3culars

Genome-scale Es-ma-on of the Tree of Life. Tandy Warnow The University of Illinois

Reconstruction of species trees from gene trees using ASTRAL. Siavash Mirarab University of California, San Diego (ECE)

Computa(onal Challenges in Construc(ng the Tree of Life

Phylogenetic Geometry

Upcoming challenges in phylogenomics. Siavash Mirarab University of California, San Diego

Fast coalescent-based branch support using local quartet frequencies

Genome-scale Es-ma-on of the Tree of Life. Tandy Warnow The University of Illinois

TheDisk-Covering MethodforTree Reconstruction

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Evolu&on, Popula&on Gene&cs, and Natural Selec&on Computa.onal Genomics Seyoung Kim

Phylogene)cs. IMBB 2016 BecA- ILRI Hub, Nairobi May 9 20, Joyce Nzioki

CS 394C March 21, Tandy Warnow Department of Computer Sciences University of Texas at Austin

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Methods to reconstruct phylogene1c networks accoun1ng for ILS

Today's project. Test input data Six alignments (from six independent markers) of Curcuma species

Phylogenetic inference

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

CS 6140: Machine Learning Spring 2016

A Phylogenetic Network Construction due to Constrained Recombination

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

CS 6140: Machine Learning Spring What We Learned Last Week 2/26/16

Gene Regulatory Networks II Computa.onal Genomics Seyoung Kim

Phylogeny: building the tree of life

Phylogenetics. BIOL 7711 Computational Bioscience

Dr. Amira A. AL-Hosary

Estimating Evolutionary Trees. Phylogenetic Methods

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

CREATING PHYLOGENETIC TREES FROM DNA SEQUENCES

Computational Challenges in Constructing the Tree of Life

Phylogenetic Tree Reconstruction

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Multiple Sequence Alignment. Sequences

Quartet Inference from SNP Data Under the Coalescent Model

CSCI1950 Z Computa4onal Methods for Biology Lecture 5

Evolutionary trees. Describe the relationship between objects, e.g. species or genes

CSCI1950 Z Computa3onal Methods for Biology Lecture 24. Ben Raphael April 29, hgp://cs.brown.edu/courses/csci1950 z/ Network Mo3fs

Taming the Beast Workshop

BIOINFORMATICS. Scaling Up Accurate Phylogenetic Reconstruction from Gene-Order Data. Jijun Tang 1 and Bernard M.E. Moret 1

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Phylogenetics in the Age of Genomics: Prospects and Challenges

CSCI1950 Z Computa4onal Methods for Biology Lecture 4. Ben Raphael February 2, hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary

Quantifying sequence similarity

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Phylogenetics: Building Phylogenetic Trees

Recent Advances in Phylogeny Reconstruction

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

Species Tree Inference using SVDquartets

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

CSE 241 Class 1. Jeremy Buhler. August 24,

Parameter Es*ma*on: Cracking Incomplete Data

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week:

Predicting Protein Functions and Domain Interactions from Protein Interactions

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Algorithms in Bioinformatics

Phylogenetic Networks, Trees, and Clusters

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

EVOLUTIONARY DISTANCES

OMICS Journals are welcoming Submissions

The impact of missing data on species tree estimation

CSCI 360 Introduc/on to Ar/ficial Intelligence Week 2: Problem Solving and Op/miza/on

Phylogenetic analyses. Kirsi Kostamo

Phylogeny Tree Algorithms

Constructing Evolutionary/Phylogenetic Trees

1 ATGGGTCTC 2 ATGAGTCTC

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Graphical Models. Lecture 1: Mo4va4on and Founda4ons. Andrew McCallum

WenEtAl-biorxiv 2017/12/21 10:55 page 2 #2

ASTR 380: Life in the Universe - Astrobiology Professor Sylvain Veilleux

On the variance of internode distance under the multispecies coalescent

MiGA: The Microbial Genome Atlas

Organisatorische Details

Course Staff. Textbook

Machine Learning. Instructor: Pranjal Awasthi

CPSC 506: Complexity of Computa5on

Announcements. Topics: Work On: - sec0ons 1.2 and 1.3 * Read these sec0ons and study solved examples in your textbook!

CONTENTS. P A R T I Genomes 1. P A R T II Gene Transcription and Regulation 109

Workshop III: Evolutionary Genomics

Central Maine Community College Auburn, Maine Course Syllabus: Introduction to General Biology Instructor Lloyd Crocker

Basics on bioinforma-cs Lecture 7. Nunzio D Agostino

Transcription:

CS 581 / BIOE 540: Algorithmic Computa<onal Genomics Tandy Warnow Departments of Bioengineering and Computer Science hhp://tandy.cs.illinois.edu

Course Details Office hours: Tuesdays 12:30-1:30 (Siebel 3235) Course webpage: hhp://tandy.cs.illinois.edu/581-2017.html Textbook: Computa<onal Phylogene<cs, available for download at hhp://tandy.cs.illinois.edu/textbook.pdf TA: Pranjal Vachaspa< (to be confirmed)

Today Describe some important problems in computa<onal biology, for which students in this course could develop improved methods Explain how the course will be run Answer ques<ons

This Course Topics: computa<onal and sta<s<cal problems in sequence analysis (e.g., mul<ple sequence alignment, phylogeny es<ma<on, metagenomics, etc.). Focus: understanding the mathema<cal founda<ons, and designing algorithms with outstanding accuracy and speed on large, complex datasets. This is not a course about how to use the tools.

Prerequisites No background in biology is needed. However, the course has the following prerequisites: CS 374: computa<onal complexity, algorithm design techniques, and proving theorems about algorithms CS 361: probability and sta<s<cs By recursion, CS 225: programming

If you haven t sa<sfied the pre-reqs: You need permission to stay in the course. The first homework is due (by email) on Saturday at 1 PM. See the homework webpage hhp://tandy.cs.illinois.edu/cs581-2017-hw.html Then make an appointment to see me to review the homework.

This course Phylogeny es<ma<on based on stochas<c models of sequence evolu<on and genome evolu<on Mul<ple sequence alignment Applica<ons to metagenomics, protein structure predic<on, and other biological problems

Species Tree Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website, University of Arizona

Evolu<on informs about everything in biology Big genome sequencing projects just produce data - so what? Evolu<onary history relates all organisms and genes, and helps us understand and predict interac<ons between genes (gene<c networks) drug design predic<ng func<ons of genes influenza vaccine development origins and spread of disease origins and migra<ons of humans

Constructing the Tree of Life: Hard Computational Problems NP-hard problems Large datasets 100,000+ sequences thousands of genes Big data complexity: model misspecifica<on fragmentary sequences errors in input data streaming data

Phylogenomic pipeline Select taxon set and markers Gather and screen sequence data, possibly iden<fy orthologs Compute mul<ple sequence alignments for each locus Compute species tree or network: Compute gene trees on the alignments and combine the es<mated gene trees, OR Es<mate a tree from a concatena<on of the mul<ple sequence alignments Get sta<s<cal support on each branch (e.g., bootstrapping) Es<mate dates on the nodes of the phylogeny Use species tree with branch support and dates to understand biology

Phylogenomic pipeline Select taxon set and markers Gather and screen sequence data, possibly iden<fy orthologs Compute mul<ple sequence alignments for each locus Compute species tree or network: Compute gene trees on the alignments and combine the es<mated gene trees, OR Es<mate a tree from a concatena<on of the mul<ple sequence alignments Get sta<s<cal support on each branch (e.g., bootstrapping) Es<mate dates on the nodes of the phylogeny Use species tree with branch support and dates to understand biology

Research Strategies Improved algorithms through: Divide-and-conquer Bin-and-conquer Iteration Bayesian sta<s<cs Hidden Markov Models Graph theory Combinatorial op<miza<on Sta<s<cal modelling Massive Simula<ons High Performance Compu<ng

Avian Phylogenomics Project Erich Jarvis, HHMI MTP Gilbert, Copenhagen G Zhang, BGI T. Warnow UT-Aus<n S. Mirarab Md. S. Bayzid, UT-Aus<n UT-Aus<n Approx. 50 species, whole genomes 14,000 loci Plus many many other people Science, December 2014 (Jarvis, Mirarab, et al., and Mirarab et al.)

1kp: Thousand Transcriptome Project G. Ka-Shu Wong U Alberta J. Leebens-Mack U Georgia N. Wickett Northwestern N. Matasci iplant T. Warnow, S. Mirarab, N. Nguyen, UIUC UT-Austin UT-Austin l Plant Tree of Life based on transcriptomes of ~1200 species l More than 13,000 gene families (most not single copy) l First paper: PNAS 2014 (~100 species and ~800 loci) Gene Tree Incongruence Upcoming Challenges (~1200 species, ~400 loci) Plus many many other people

DNA Sequence Evolution AAGGCCT AAGACTT TGGACTT -3 mil yrs -2 mil yrs AGGGCAT TAGCCCT AGCACTT -1 mil yrs AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT today

Phylogeny Problem U V W X Y AGGGCAT TAGCCCA TAGACTT TGCACAA TGCGCTT X U Y V W

Performance criteria Running time Space Statistical performance issues (e.g., statistical consistency) with respect to a Markov model of evolution Topological accuracy with respect to the underlying true tree or true alignment, typically studied in simulation Accuracy with respect to a particular criterion (e.g. maximum likelihood score), on real data

Quantifying Error FN FN: false negative (missing edge) FP: false positive (incorrect edge) 50% error rate FP

Statistical consistency, exponential convergence, and absolute fast convergence (afc)

Phylogenetic reconstruction methods 1 Hill-climbing heuristics for hard optimization criteria (Maximum Parsimony and Maximum Likelihood) Cost Local optimum Phylogenetic trees Global optimum 2 Polynomial time distance-based methods: Neighbor Joining, FastME, etc. 3. Bayesian methods

Solving maximum likelihood (and other hard optimization problems) is unlikely # of Taxa # of Unrooted Trees 4 3 5 15 6 105 7 945 8 10395 9 135135 10 2027025 20 2.2 x 10 20 100 4.5 x 10 190 1000 2.7 x 10 2900

Quantifying Error FN FN: false negative (missing edge) FP: false positive (incorrect edge) 50% error rate FP

Neighbor joining has poor performance on large diameter trees [Nakhleh et al. ISMB 2001] Error Rate 0.8 0.6 0.4 NJ Theorem (Atteson): Exponential sequence length requirement for Neighbor Joining! 0.2 0 0 400 800 No. Taxa 1200 1600

Major Challenges Phylogenetic analyses: standard methods have poor accuracy on even moderately large datasets, and the most accurate methods are enormously computationally intensive (weeks or months, high memory requirements)

Phylogeny Problem U V W X Y AGGGCAT TAGCCCA TAGACTT TGCACAA TGCGCTT X U Y V W

The Real Problem! U V W X Y AGGGCATGA AGAT TAGACTT TGCACAA TGCGCTT U X Y V W

Deletion Substitution ACGGTGCAGTTACCA Insertion ACCAGTCACCTA ACGGTGCAGTTACC-A AC----CAGTCACCTA The true multiple alignment Reflects historical substitution, insertion, and deletion events Defined using transitive closure of pairwise alignments computed on edges of the true tree

S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA Input: unaligned sequences

Phase 1: Alignment S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA

Phase 2: Construct tree S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA S1 S2 S4 S3

Two-phase estimation Alignment methods Clustal POY (and POY*) Probcons (and Probtree) Probalign MAFFT Muscle Di-align T-Coffee Prank (PNAS 2005, Science 2008) Opal (ISMB and Bioinf. 2007) FSA (PLoS Comp. Bio. 2009) Infernal (Bioinf. 2009) Etc. Phylogeny methods Bayesian MCMC Maximum parsimony Maximum likelihood Neighbor joining FastME UPGMA Quartet puzzling Etc.

Simulation Studies S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA Unaligned Sequences S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA S1 S2 S4 S3 True tree and alignment Compare S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-C--T-----GACCGC-- S4 = T---C-A-CGACCGA----CA S1 S4 S2 S3 Estimated tree and alignment

1000 taxon models, ordered by difficulty (Liu et al., 2009)

Multiple Sequence Alignment (MSA): another grand challenge 1 S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC Sn = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- Sn = -------TCAC--GACCGACA Novel techniques needed for scalability and accuracy NP-hard problems and large datasets Current methods do not provide good accuracy Few methods can analyze even moderately large datasets Many important applications besides phylogenetic estimation 1 Frontiers in Massive Data Analysis, National Academies Press, 2013

Major Challenges Phylogenetic analyses: standard methods have poor accuracy on even moderately large datasets, and the most accurate methods are enormously computationally intensive (weeks or months, high memory requirements) Multiple sequence alignment: key step for many biological questions (protein structure and function, phylogenetic estimation), but few methods can run on large datasets. Alignment accuracy is generally poor for large datasets with high rates of evolution.

(Phylogenetic estimation from whole genomes)

Species Tree Estimation requires multiple genes! Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website, University of Arizona

Two basic approaches for species tree estimation Concatenate ( combine ) sequence alignments for different genes, and run phylogeny estimation methods Compute trees on individual genes and combine gene trees

Using mul<ple genes S 1 gene 1 TCTAATGGAA gene 3 S 2 S 3 GCTAAGGGAA TCTAAGGGAA gene 2 S 1 S 3 TATTGATACA TCTTGATACC S 4 TCTAACGGAA S 4 GGTAACCCTC S 4 TAGTGATGCA S 7 TCTAATGGAC S 5 GCTAAACCTC S 7 TAGTGATGCA S 8 TATAACGGAA S 6 GGTGACCATC S 8 CATTCATACC S 7 GCTAAACCTC

Concatena<on gene 1 gene 2 gene 3 S 1 TCTAATGGAA?????????? TATTGATACA S 2 GCTAAGGGAA???????????????????? S 3 TCTAAGGGAA?????????? TCTTGATACC S 4 TCTAACGGAA GGTAACCCTCTAGTGATGCA S 5?????????? GCTAAACCTC?????????? S 6?????????? GGTGACCATC?????????? S 7 TCTAATGGAC GCTAAACCTC TAGTGATGCA S 8 TATAACGGAA?????????? CATTCATACC

Red gene tree species tree (green gene tree okay)

Gene Tree Incongruence Gene trees can differ from the species tree due to: Duplica<on and loss Horizontal gene transfer Incomplete lineage sor<ng (ILS)

Incomplete Lineage Sor<ng (ILS) Confounds phylogene<c analysis for many groups: Hominids Birds Yeast Animals Toads Fish Fungi There is substan<al debate about how to analyze phylogenomic datasets in the presence of ILS.

Lineage Sor<ng Popula<on-level process, also called the Mul<-species coalescent (Kingman, 1982) Gene trees can differ from species trees due to short <mes between specia<on events or large popula<on size; this is called Incomplete Lineage Sor<ng or Deep Coalescence.

The Coalescent Courtesy James Degnan Past Present

Gene tree in a species tree Courtesy James Degnan

Key observa<on: Under the mul<-species coalescent model, the species tree defines a probability distribu4on on the gene trees, and is iden4fiable from the distribu4on on gene trees Courtesy James Degnan

Two compe<ng approaches Species gene 1 gene 2... gene k... Concatenation Analyze separately... Summary Method

Species tree es<ma<on: difficult, even for small datasets! Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website, University of Arizona

Major Challenges: large datasets, fragmentary sequences Multiple sequence alignment: Few methods can run on large datasets, and alignment accuracy is generally poor for large datasets with high rates of evolution. Gene Tree Estimation: standard methods have poor accuracy on even moderately large datasets, and the most accurate methods are enormously computationally intensive (weeks or months, high memory requirements). Species Tree Estimation: gene tree incongruence makes accurate estimation of species tree challenging. Phylogenetic Network Estimation: Horizontal gene transfer and hybridization requires non-tree models of evolution Both phylogenetic estimation and multiple sequence alignment are also impacted by fragmentary data.

Major Challenges: large datasets, fragmentary sequences Multiple sequence alignment: Few methods can run on large datasets, and alignment accuracy is generally poor for large datasets with high rates of evolution. Gene Tree Estimation: standard methods have poor accuracy on even moderately large datasets, and the most accurate methods are enormously computationally intensive (weeks or months, high memory requirements). Species Tree Estimation: gene tree incongruence makes accurate estimation of species tree challenging. Phylogenetic Network Estimation: Horizontal gene transfer and hybridization requires non-tree models of evolution Both phylogenetic estimation and multiple sequence alignment are also impacted by fragmentary data.

Avian Phylogenomics Project Erich Jarvis, HHMI MTP Gilbert, Copenhagen G Zhang, BGI T. Warnow UT-Aus<n S. Mirarab Md. S. Bayzid, UT-Aus<n UT-Aus<n Approx. 50 species, whole genomes 14,000 loci Plus many many other people Challenges: Species tree es<ma<on under the mul<-species coalescent model, from 14,000 poor es<mated gene trees, all with different topologies (we used sta<s<cal binning ) Maximum likelihood es<ma<on on a million-site genome-scale alignment 250 CPU years Science, December 2014 (Jarvis, Mirarab, et al., and Mirarab et al.)

1kp: Thousand Transcriptome Project G. Ka-Shu Wong U Alberta J. Leebens-Mack U Georgia N. Wickett Northwestern N. Matasci iplant T. Warnow, S. Mirarab, N. Nguyen, UIUC UT-Austin UT-Austin l Plant Tree of Life based on transcriptomes of ~1200 species l More than 13,000 gene families (most not single copy) l First paper: PNAS 2014 (~100 species and ~800 loci) Gene Tree Incongruence Plus many many other people Upcoming Challenges (~1200 species, ~400 loci): Species tree es<ma<on under the mul<-species coalescent from hundreds of conflic<ng gene trees on >1000 species (we will use ASTRAL Mirarab et al. 2014, Mirarab & Warnow 2015) Mul<ple sequence alignment of >100,000 sequences (with lots of fragments!) we will use UPP (Nguyen et al., 2015)

Constructing the Tree of Life: Hard Computational Problems NP-hard problems Large datasets 100,000+ sequences thousands of genes Big data complexity: model misspecifica<on fragmentary sequences errors in input data streaming data

Research Strategies Improved algorithms through: Divide-and-conquer Bin-and-conquer Iteration Bayesian sta<s<cs Hidden Markov Models Graph theory Combinatorial op<miza<on Sta<s<cal modelling Massive Simula<ons High Performance Compu<ng

Evolu<on informs about everything in biology Big genome sequencing projects just produce data - so what? Evolu<onary history relates all organisms and genes, and helps us understand and predict interac<ons between genes (gene<c networks) drug design predic<ng func<ons of genes influenza vaccine development origins and spread of disease origins and migra<ons of humans

Metagenomics: Venter et al., Exploring the Sargasso Sea: Scientists Discover One Million New Genes in Ocean Microbes

Metagenomic data analysis NGS data produce fragmentary sequence data Metagenomic analyses include unknown species Taxon identification: given short sequences, identify the species for each fragment Applications: Human Microbiome Issues: accuracy and speed Mihai Pop, Univ Maryland

Metagenomic taxon iden6fica6on Objec<ve: classify short reads in a metagenomic sample

Possible Indo-European tree (Ringe, Warnow and Taylor 2000) Anatolian Vedic Iranian Greek Italic Celtic Tocharian Armenian Germanic Baltic Slavic Albanian

Perfect Phylogene<c Network for IE Nakhleh et al., Language 2005 Anatolian Vedic Iranian Greek Italic Celtic Tocharian Armenian Germanic Baltic Slavic Albanian

Grading Homework: 25% (one hw dropped) Midterm: 40% (March 30) Final Project: 25% (due May 6) Course Par<cipa<on: 10% No final exam.

Homework Assignments Homework assignments are listed at hhp://tandy.cs.illinois.edu/cs581-2017- hw.html and are due at 1 PM (in person or via email) late homeworks have reduced credit and will not be accepted ayer 48 hours past the deadline. You are encouraged to work with others on your homework, but you must write up solu<ons by yourself and indicate who you worked with on each homework.

Course Schedule A detailed course schedule is at hhp://tandy.cs.illinois.edu/cs581-2017-detailedsyllabus.html This schedule includes material you are expected to have looked at before coming to class: assigned reading (from textbook and/or scien<fic literature) PPT and/or PDF of my lecture

Final Project and Class Presenta<on Either research project (can be with another student) or survey paper (done by yourself). Many interes<ng and publishable problems to address: see hhp://tandy.cs.illinois.edu/topics.html Your class presenta<on should be related to your final project.

Academic Integrity Please see course website at hhp://tandy.cs.illinois.edu/581-2017.html and also hhp://tandy.cs.illinois.edu/ethics.pdf For this course: Examine the policy about collabora<on Learn and understand what plagiarism is (and then don t do it). This applies to homework, all wri<ng assignments, and the final project.

Course Research Projects Evalua<ng exis<ng methods on simulated and real (biological or linguis<c) datasets Designing a new method, and establishing its performance (using theory and data) Analyzing a biological dataset using several different methods, to address biology

Examples of published course projects Md S. Bayzid, T. Hunt, and T. Warnow. "Disk Covering Methods Improve Phylogenomic Analyses". Proceedings of RECOMB-CG (Compara<ve Genomics), 2014, and BMC Genomics 2014, 15(Suppl 6): S7. T. Zimmermann, S. Mirarab and T. Warnow. "BBCA: Improving the scalability of *BEAST using random binning". Proceedings of RECOMB-CG (Compara<ve Genomics), 2014, and BMC Genomics 2014, 15(Suppl 6): S11. J. Chou, A. Gupta, S. Yaduvanshi, R. Davidson, M. Nute, S. Mirarab and T. Warnow. A compara<ve study of SVDquartets and other coalescent-based species tree es<ma<on methods. RECOMB-Compara<ve Genomics and BMC Genomics, 2015., 2015, 16 (Suppl 10): S2. P. Vachaspa< and T. Warnow (2016). FastRFS: Fast and Accurate Robinson-Foulds Supertrees using Constrained Exact Op<miza<on Bioinforma<cs 2016; doi: 10.1093/bioinforma<cs/btw600. (Special issue for papers from RECOMB-CG)

Research Projects you could join Phylogenomics projects (Avian and the 1KP) Species tree and network es<ma<on from conflic<ng genes Large-scale mul<ple sequence alignment Large-scale maximum likelihood tree es<ma<on Improving gene tree es<ma<on using whole genomes Metagenomics (with Mihai Pop, University of Maryland, and Bill Gropp) Iden<fying genes and taxa from short sequences Metagenomic assembly Applica<ons to clinical diagnos<cs Protein Sequence Analysis (with Jian Peng) What func<on and structure does this protein have? How did structure and func<on evolve? Historical Linguis<cs (with Donald Ringe, UPenn) How did Indo-European evolve? Designing and implemen<ng sta<s<cal es<ma<on methods for language phylogenies