CSCI 4181 / CSCI 6802 Algorithms in Bioinformatics 1
"In science there is only physics; all the rest is stamp collecting." -Ernest Rutherford 2
Was I a stamp collector? Alan Turing 3
Inte skert, men vad om mig*??? Carolus Linnaeus * Not sure, but what about me? 4
More generally,? (Why is Ernest Rutherford calling biology nonscience?) 5
Science Whether Turing was a scientist depends on your definition of science Mathematics and theoretical CS do not typically use the scientific method, and are not falsifiable in the sense that is normally applied today 6
And what about biology? There are stamp collecting elements (and phases) in biology But how could this be otherwise? And there is always some notion of an underlying theory 7
Disciplines of Biology Developmental biology 8 Richardson M K, Hanken J, Selwood L, Wright G M, Richards R J, Pieau C, Raynaud A. Haeckel, embryos, and evolution. Science. 1998; 280: 983 984.
Molecular Biology - DNA, RNA, proteins and other molecules in the cell 9
Ecology - Interactions between organisms and the environment (including other organisms!) http://www.absc.usgs.gov/research/seabird_foragefish/marinehabitat/home.html 10
Evolution - Changes that occur in living systems through time From http://www.nbii.gov/ 11
Stamp collecting perceptions Because biology is horrendously (er, amazingly) complex! Data collection can be tricky too (although we can concede the same for physics) 12
Lesser Rutherford: All science is either physics or stamp collecting. 13
Before we Leave Rutherford If your result needs a statistician then you should design a better experiment. 14
In general, biological experiments have too many random factors and uncontrolled variables to give neat results So we need STATISTICS to test hypotheses about the natural world. And statistics alone aren t enough (of which more later) Many advances in the field of statistics over the last 100 years have been driven by biological (ecological and molecular) questions 15
One More We don t have the money, so we have to think (Speaking about the experiments he carried out) For our purposes - money: infinite CPU cycles think: design efficient experiments ( efficient = data set selection + algorithms) 16
Example: the Global Ocean Survey (2004-2006) 17
Over 6,000,000 protein sequences identified from the first phase alone All-versus-all comparisons using BLAST: >10 6 CPU hours Clustering with CD-HIT: about 500 hours on one machine 18
Now, we would rather use BLAST than CD-HIT* But we would rather use Smith-Waterman (better, slower) than BLAST! But we CANNOT use BLAST or S-W if we want to compare this huge dataset against everything else we know about *except we shouldn't, because it has some pretty awful bugs 19
Example 2: Cancer genomics Human genome Bacterial genome Typical gene Stratton et al., Nature (2009) 20
Project to sequence "the human genome": 1990-2003 (ish), $3B Current cost of sequencing a human genome: $10,000 Storage requirements for human genome in plain text: 1.5 GB "It will soon be cheaper to resequence a nucleotide [i.e., a "letter"] of DNA than to store it" Francis Ouellette, 2011 21
Bioinformatics The development and use of computational and statistical methods to manage and analyze biological data Biological data most often means molecular biological (DNA, protein) data, but the discipline is broader than this, and blurs into ecology, physiology and other disciplines 22
Algorithms in Bioinformatics Instructor: Dr. Robert Beiko Schedule: 10:35-11:55, Mondays and Wednesdays Location: Mona Campbell #1107 (tutorials to be determined) 23
Purpose Identify the key DATA TYPES in the biological domain Introduce the KEY QUESTIONS we want to ask of these data Examine representative ALGORITHMS for biological data analysis Consider the use of appropriate STATISTICAL MODELS of biology Think about the TRADEOFFS between exhaustive analysis and efficient heuristics 24
Component Undergraduate Graduate 1 15% 10% Tutorial 2 15% 10% 3 15% 10% 4 15% 10% Proposal 10% 10% Literature review N/A 10% Project Methods 10% 10% Oral presentation N/A 10% Final report 20% 20% 25
Critical Skills Data acquisition from online sources Examples: National Centre for Biotechnology Information (ncbi.nlm.nih.gov) US Department of Energy Joint Genome Institute (jgi.doe.gov) 26
Critical Skills Abstractions of biological data For instance: Evolutionary relationships as trees and graphs Biological sequences as strings Related sequences as matrices 27
Critical Skills Use and understand different methods How much accuracy do we lose when we choose different heuristic vs. exact methods? Do different methods treat biological data in more-orless appropriate ways? Model-based vs. model-free methods (and differences among models) 28
Critical Skills The assessment of statistically significant differences between data sets Parametric vs non-parametric tests Assumptions of different tests 29
Un-Critical Skills Programming / Scripting File format conversions Automation: repeat analysis of many data sets Simple string processing and extraction Commonly used tools Perl (including BioPerl) Python (ditto BioPython) C/C++/Java Not essential! But very helpful 30
BUT Everything in Context We will approach all of this in an APPLIED way You will learn it when you need to know it, and understand why it is relevant 31
THE PROJECT Can play to your background strengths Interpretation Method(s) Data but should show what you ve learned 32
THE PROJECT Interpretation Method(s) Data Choose an interesting data set 33
THE PROJECT Interpretation Data Methods Apply one or more methods (possibly with modifications) 34
THE PROJECT Method(s) Interpretation Compare the results obtained for different data sets or methods Data 35
THE PROJECT I can help point you in the right direction, but I encourage you to share ideas and resources. For instance: How do I do a t-test in R? My for loop isn t working! These results make no sense! 36
References No textbook per se Different texts address different parts of the course Textbooks are out-of-date as soon as they appear! Some information will be given as handouts See syllabus for recommendations 37
References Scientific publications Particularly when we look at specific methods in depth 38
Course Overview Three modules (about one month each), illustrating a different challenge in bioinformatics and different solutions Four tutorials: get your hands into it The three modules are: BIOLOGICAL SEQUENCE CLASSIFICATION SEQUENCE ALIGNMENT PHYLOGENETIC ANALYSIS 39
Module 1 Sequence classification Sequences A bunch of numbers A bunch of numbers insight via Decision trees Statistical classification Artificial neural networks Support vector machines 40
Module 2 Sequence Alignment Types of alignment problems Dynamic programming Hidden Markov models Heuristics Variations: Bayesian, progressive, graph-based approaches 41
Module 3 Phylogenetic analysis Distance matrix methods Character-based methods Searching vs. sampling tree space Statistical support 42
Organisms, Genomes, Sequences, and so on Life at Different Resolutions 43
Essential properties of an organism Reproduction Sexual Asexual Tetrahymena thermophila (www.isleepinadrawer.com) Amoeba proteus (www.teachnet.ie) 44
Essential properties of an organism Cellularity Unicellular Multicellular Treponema pallidum (www.teachersource.com) Caenorhabditis elegans (959 cells) (www.ucl.ac.uk) 45
Essential properties of an organism Biochemical processes Fermentation Antibiotic synthesis 46
The capacity to do all of these things comes from the GENOME of an organism Genome = the complete set of genetic material (DNA for all known organisms) 47
Prokaryotes Eukaryotes espacial.org 48
The Human Genome 23 linear chromosomes ~3 billion DNA residues ~20,000 genes (controversy!) 49
Escherichia coli strain K12 1 circular chromosome, 2 plasmids ~5.6 million DNA residues 5326 genes 50
Genes on the main chromosome Gene order 51
The DNA sequence of a gene 5 - ATG CGT TAC TTC GAA ATG GCA ACC CAC TCG GGG ACT TCC TCC AAC GGT TGA- 3 3 - TAC GCA ATG AAG CTT TAC CGT TGG GTG AGC CCC TGA AGG AGG TTG CCA ACT- 5 52
DNA to protein sequence ATG CGT TAC TTC GAA ATG GCA ACC CAC TCG GGG ACT TCC TCC AAC GGT TGA M A Y F E M A T H S G T S S N G * 53
Protein sequence and structure M A Y F E M A T H S G T S S N G * 54
55
Metabolism Proteins working together 56
57
Pathways (metabolism + self-replication + signalling) = 58
Communities of organisms http://www.noaanews.noaa.gov/stories2006/s2644.htm 59
60