CSCI 4181 / CSCI 6802 Algorithms in Bioinformatics

Similar documents
Practical Bioinformatics

Characterization of Pathogenic Genes through Condensed Matrix Method, Case Study through Bacterial Zeta Toxin

Advanced topics in bioinformatics

Crick s early Hypothesis Revisited

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

Grade Level: AP Biology may be taken in grades 11 or 12.

Lassen Community College Course Outline

Virginia Western Community College BIO 101 General Biology I

Valley Central School District 944 State Route 17K Montgomery, NY Telephone Number: (845) ext Fax Number: (845)

High throughput near infrared screening discovers DNA-templated silver clusters with peak fluorescence beyond 950 nm

Honors Biology 9. Dr. Donald Bowlin Ext. 1220

Updated: 10/11/2018 Page 1 of 5

I. Molecules and Cells: Cells are the structural and functional units of life; cellular processes are based on physical and chemical changes.

1. CHEMISTRY OF LIFE. Tutorial Outline

Bacterial Genetics & Operons

STAAR Biology Assessment

Performance Indicators: Students who demonstrate this understanding can:

Campbell Biology AP Edition 11 th Edition, 2018

Supplementary Information for

Computational Biology: Basics & Interesting Problems

West Windsor-Plainsboro Regional School District AP Biology Grades 11-12

Biology Assessment. Eligible Texas Essential Knowledge and Skills

AP BIOLOGY SUMMER ASSIGNMENT

Chapter 15 Active Reading Guide Regulation of Gene Expression

SUPPLEMENTARY DATA - 1 -

SUPPORTING INFORMATION FOR. SEquence-Enabled Reassembly of β-lactamase (SEER-LAC): a Sensitive Method for the Detection of Double-Stranded DNA

Comparative genomics: Overview & Tools + MUMmer algorithm

FAIRBANKS NORTH STAR BOROUGH SCHOOL DISTRICT - SCIENCE CURRICULUM. Prentice Hall Biology (Miller/Levine) 2010 MASTERY CORE OBJECTIVES HIGH SCHOOL

Tutorials are designed specifically for the Virginia Standards of Learning to prepare students for the Standards of Learning tests.

Studying Life. Lesson Overview. Lesson Overview. 1.3 Studying Life

AP Biology UNIT 1: CELL BIOLOGY. Advanced Placement

COMPETENCY GOAL 1: The learner will develop abilities necessary to do and understand scientific inquiry.

Building a Multifunctional Aptamer-Based DNA Nanoassembly for Targeted Cancer Therapy

Evaluate evidence provided by data from many scientific disciplines to support biological evolution. [LO 1.9, SP 5.3]

NSCI Basic Properties of Life and The Biochemistry of Life on Earth

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

Field 045: Science Life Science Assessment Blueprint

Supplemental data. Pommerrenig et al. (2011). Plant Cell /tpc

Bio 101 General Biology 1

Fundamentals of Biology Valencia College BSC1010C

Chapter Chemical Uniqueness 1/23/2009. The Uses of Principles. Zoology: the Study of Animal Life. Fig. 1.1

BIOLOGY I: COURSE OVERVIEW

Supporting Information for. Initial Biochemical and Functional Evaluation of Murine Calprotectin Reveals Ca(II)-

Ledyard Public Schools Science Curriculum. Biology. Level-2. Instructional Council Approval June 1, 2005

Map of AP-Aligned Bio-Rad Kits with Learning Objectives

Energy Requirement Energy existed in several forms satisfied condition 2 (much more UV than present no ozone layer!)

Computational Structural Bioinformatics

Readings Lecture Topics Class Activities Labs Projects Chapter 1: Biology 6 th ed. Campbell and Reese Student Selected Magazine Article

STRUCTURAL BIOINFORMATICS I. Fall 2015

Creating a Dichotomous Key

The Prokaryotic World

ADVANCED PLACEMENT BIOLOGY

VCE BIOLOGY Relationship between the key knowledge and key skills of the Study Design and the Study Design

Computational methods for predicting protein-protein interactions

AP Biology Essential Knowledge Cards BIG IDEA 1

Text of objective. Investigate and describe the structure and functions of cells including: Cell organelles

Advanced Placement Biology

evoglow - express N kit distributed by Cat.#: FP product information broad host range vectors - gram negative bacteria

SCOTCAT Credits: 20 SCQF Level 7 Semester 1 Academic year: 2018/ am, Practical classes one per week pm Mon, Tue, or Wed

2015 FALL FINAL REVIEW

AP Biology Curriculum Framework

BIOINFORMATICS LAB AP BIOLOGY

Electronic supplementary material

evoglow - express N kit Cat. No.: product information broad host range vectors - gram negative bacteria

Topic 3: Genetics (Student) Essential Idea: Chromosomes carry genes in a linear sequence that is shared by members of a species.

CELL AND MICROBIOLOGY Nadia Iskandarani

Computational Biology Course Descriptions 12-14

Probability models for machine learning. Advanced topics ML4bio 2016 Alan Moses

Molecular Biology Of The Cell 6th Edition Alberts

SSR ( ) Vol. 48 No ( Microsatellite marker) ( Simple sequence repeat,ssr),

Evolvable Neural Networks for Time Series Prediction with Adaptive Learning Interval

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

FORMAT FOR CORRELATION TO THE GEORGIA PERFORMANCE STANDARDS. Subject Area: Science State-Funded Course: Biology

Modelling and Analysis in Bioinformatics. Lecture 1: Genomic k-mer Statistics

Unit # - Title Intro to Biology Unit 1 - Scientific Method Unit 2 - Chemistry

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Molecular and cellular biology is about studying cell structure and function

Bioinformatics and BLAST

Programme Specification (Undergraduate) For 2017/18 entry Date amended: 25/06/18

Lamar University College of Arts and Sciences. Hayes Building Phone: Office Hours: T 2:15-4:00 R 2:15-4:00

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Biology the study of life. Lecture 15

Biology. Slide 1 of 36. End Show. Copyright Pearson Prentice Hall

Enduring understanding 1.A: Change in the genetic makeup of a population over time is evolution.

AP Biology. Read college-level text for understanding and be able to summarize main concepts

Biology. Lessons: 15% Quizzes: 25% Projects: 30% Tests: 30% Assignment Weighting per Unit Without Projects. Lessons: 21% Quizzes: 36% Tests: 43%

Genetic Variation: The genetic substrate for natural selection. Horizontal Gene Transfer. General Principles 10/2/17.

Genomes and Their Evolution

AP* Biology Prep Course

Nature Structural & Molecular Biology: doi: /nsmb Supplementary Figure 1

Science Unit Learning Summary

Teaching Licensure: Biology

Biology: End of Semester Projects The end of the semester is HERE!!!

The minimal prokaryotic genome. The minimal prokaryotic genome. The minimal prokaryotic genome. The minimal prokaryotic genome

I. Molecules & Cells. A. Unit One: The Nature of Science. B. Unit Two: The Chemistry of Life. C. Unit Three: The Biology of the Cell.

Ohio Tutorials are designed specifically for the Ohio Learning Standards to prepare students for the Ohio State Tests and end-ofcourse

A A A A B B1

Microbes usually have few distinguishing properties that relate them, so a hierarchical taxonomy mainly has not been possible.

Microbial Taxonomy. Slowly evolving molecules (e.g., rrna) used for large-scale structure; "fast- clock" molecules for fine-structure.

STAAR Biology: Assessment Activities. Biological Evolution and Classification. The Charles A. Dana Center at The University of Texas at Austin

Transcription:

CSCI 4181 / CSCI 6802 Algorithms in Bioinformatics 1

"In science there is only physics; all the rest is stamp collecting." -Ernest Rutherford 2

Was I a stamp collector? Alan Turing 3

Inte skert, men vad om mig*??? Carolus Linnaeus * Not sure, but what about me? 4

More generally,? (Why is Ernest Rutherford calling biology nonscience?) 5

Science Whether Turing was a scientist depends on your definition of science Mathematics and theoretical CS do not typically use the scientific method, and are not falsifiable in the sense that is normally applied today 6

And what about biology? There are stamp collecting elements (and phases) in biology But how could this be otherwise? And there is always some notion of an underlying theory 7

Disciplines of Biology Developmental biology 8 Richardson M K, Hanken J, Selwood L, Wright G M, Richards R J, Pieau C, Raynaud A. Haeckel, embryos, and evolution. Science. 1998; 280: 983 984.

Molecular Biology - DNA, RNA, proteins and other molecules in the cell 9

Ecology - Interactions between organisms and the environment (including other organisms!) http://www.absc.usgs.gov/research/seabird_foragefish/marinehabitat/home.html 10

Evolution - Changes that occur in living systems through time From http://www.nbii.gov/ 11

Stamp collecting perceptions Because biology is horrendously (er, amazingly) complex! Data collection can be tricky too (although we can concede the same for physics) 12

Lesser Rutherford: All science is either physics or stamp collecting. 13

Before we Leave Rutherford If your result needs a statistician then you should design a better experiment. 14

In general, biological experiments have too many random factors and uncontrolled variables to give neat results So we need STATISTICS to test hypotheses about the natural world. And statistics alone aren t enough (of which more later) Many advances in the field of statistics over the last 100 years have been driven by biological (ecological and molecular) questions 15

One More We don t have the money, so we have to think (Speaking about the experiments he carried out) For our purposes - money: infinite CPU cycles think: design efficient experiments ( efficient = data set selection + algorithms) 16

Example: the Global Ocean Survey (2004-2006) 17

Over 6,000,000 protein sequences identified from the first phase alone All-versus-all comparisons using BLAST: >10 6 CPU hours Clustering with CD-HIT: about 500 hours on one machine 18

Now, we would rather use BLAST than CD-HIT* But we would rather use Smith-Waterman (better, slower) than BLAST! But we CANNOT use BLAST or S-W if we want to compare this huge dataset against everything else we know about *except we shouldn't, because it has some pretty awful bugs 19

Example 2: Cancer genomics Human genome Bacterial genome Typical gene Stratton et al., Nature (2009) 20

Project to sequence "the human genome": 1990-2003 (ish), $3B Current cost of sequencing a human genome: $10,000 Storage requirements for human genome in plain text: 1.5 GB "It will soon be cheaper to resequence a nucleotide [i.e., a "letter"] of DNA than to store it" Francis Ouellette, 2011 21

Bioinformatics The development and use of computational and statistical methods to manage and analyze biological data Biological data most often means molecular biological (DNA, protein) data, but the discipline is broader than this, and blurs into ecology, physiology and other disciplines 22

Algorithms in Bioinformatics Instructor: Dr. Robert Beiko Schedule: 10:35-11:55, Mondays and Wednesdays Location: Mona Campbell #1107 (tutorials to be determined) 23

Purpose Identify the key DATA TYPES in the biological domain Introduce the KEY QUESTIONS we want to ask of these data Examine representative ALGORITHMS for biological data analysis Consider the use of appropriate STATISTICAL MODELS of biology Think about the TRADEOFFS between exhaustive analysis and efficient heuristics 24

Component Undergraduate Graduate 1 15% 10% Tutorial 2 15% 10% 3 15% 10% 4 15% 10% Proposal 10% 10% Literature review N/A 10% Project Methods 10% 10% Oral presentation N/A 10% Final report 20% 20% 25

Critical Skills Data acquisition from online sources Examples: National Centre for Biotechnology Information (ncbi.nlm.nih.gov) US Department of Energy Joint Genome Institute (jgi.doe.gov) 26

Critical Skills Abstractions of biological data For instance: Evolutionary relationships as trees and graphs Biological sequences as strings Related sequences as matrices 27

Critical Skills Use and understand different methods How much accuracy do we lose when we choose different heuristic vs. exact methods? Do different methods treat biological data in more-orless appropriate ways? Model-based vs. model-free methods (and differences among models) 28

Critical Skills The assessment of statistically significant differences between data sets Parametric vs non-parametric tests Assumptions of different tests 29

Un-Critical Skills Programming / Scripting File format conversions Automation: repeat analysis of many data sets Simple string processing and extraction Commonly used tools Perl (including BioPerl) Python (ditto BioPython) C/C++/Java Not essential! But very helpful 30

BUT Everything in Context We will approach all of this in an APPLIED way You will learn it when you need to know it, and understand why it is relevant 31

THE PROJECT Can play to your background strengths Interpretation Method(s) Data but should show what you ve learned 32

THE PROJECT Interpretation Method(s) Data Choose an interesting data set 33

THE PROJECT Interpretation Data Methods Apply one or more methods (possibly with modifications) 34

THE PROJECT Method(s) Interpretation Compare the results obtained for different data sets or methods Data 35

THE PROJECT I can help point you in the right direction, but I encourage you to share ideas and resources. For instance: How do I do a t-test in R? My for loop isn t working! These results make no sense! 36

References No textbook per se Different texts address different parts of the course Textbooks are out-of-date as soon as they appear! Some information will be given as handouts See syllabus for recommendations 37

References Scientific publications Particularly when we look at specific methods in depth 38

Course Overview Three modules (about one month each), illustrating a different challenge in bioinformatics and different solutions Four tutorials: get your hands into it The three modules are: BIOLOGICAL SEQUENCE CLASSIFICATION SEQUENCE ALIGNMENT PHYLOGENETIC ANALYSIS 39

Module 1 Sequence classification Sequences A bunch of numbers A bunch of numbers insight via Decision trees Statistical classification Artificial neural networks Support vector machines 40

Module 2 Sequence Alignment Types of alignment problems Dynamic programming Hidden Markov models Heuristics Variations: Bayesian, progressive, graph-based approaches 41

Module 3 Phylogenetic analysis Distance matrix methods Character-based methods Searching vs. sampling tree space Statistical support 42

Organisms, Genomes, Sequences, and so on Life at Different Resolutions 43

Essential properties of an organism Reproduction Sexual Asexual Tetrahymena thermophila (www.isleepinadrawer.com) Amoeba proteus (www.teachnet.ie) 44

Essential properties of an organism Cellularity Unicellular Multicellular Treponema pallidum (www.teachersource.com) Caenorhabditis elegans (959 cells) (www.ucl.ac.uk) 45

Essential properties of an organism Biochemical processes Fermentation Antibiotic synthesis 46

The capacity to do all of these things comes from the GENOME of an organism Genome = the complete set of genetic material (DNA for all known organisms) 47

Prokaryotes Eukaryotes espacial.org 48

The Human Genome 23 linear chromosomes ~3 billion DNA residues ~20,000 genes (controversy!) 49

Escherichia coli strain K12 1 circular chromosome, 2 plasmids ~5.6 million DNA residues 5326 genes 50

Genes on the main chromosome Gene order 51

The DNA sequence of a gene 5 - ATG CGT TAC TTC GAA ATG GCA ACC CAC TCG GGG ACT TCC TCC AAC GGT TGA- 3 3 - TAC GCA ATG AAG CTT TAC CGT TGG GTG AGC CCC TGA AGG AGG TTG CCA ACT- 5 52

DNA to protein sequence ATG CGT TAC TTC GAA ATG GCA ACC CAC TCG GGG ACT TCC TCC AAC GGT TGA M A Y F E M A T H S G T S S N G * 53

Protein sequence and structure M A Y F E M A T H S G T S S N G * 54

55

Metabolism Proteins working together 56

57

Pathways (metabolism + self-replication + signalling) = 58

Communities of organisms http://www.noaanews.noaa.gov/stories2006/s2644.htm 59

60