Chapter 2: Sequence Alignment

Similar documents
Molecular Evolution and Comparative Genomics

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Science Data Representation Questions: Strategies and Sample Questions

Quantifying sequence similarity

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Scoring Matrices. Shifra Ben-Dor Irit Orr

Team members (First and Last Names): Fossil lab

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Local Alignment Statistics

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

5 Time Marches On. TAKE A LOOK 1. Identify What kinds of organisms formed the fossils in the picture?

Module 9: Earth's History Topic 3 Content: A Tour of Geologic Time Notes

BIOINFORMATICS: An Introduction

CSE 549: Computational Biology. Substitution Matrices

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Substitution matrices

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

17-1 The Fossil Record Slide 2 of 40

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task.

Geologic Time. Mr. Skirbst

How do we learn about ancient life? Fossil- a trace or imprint of a living thing that is preserved by geological processes.

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Bio94 Discussion Activity week 3: Chapter 27 Phylogenies and the History of Life

The Significance of the Fossil Record ( Susan Matthews and Graeme Lindbeck)

Pairwise sequence alignment

Fossils Biology 2 Thursday, January 31, 2013

Algorithms in Bioinformatics

17-1 The Fossil Record Slide 1 of 40

7.36/7.91 recitation CB Lecture #4

Chapter Study Guide Section 17-1 The Fossil Record (pages )

Pairwise sequence alignments

The History of Life. Fossils and Ancient Life (page 417) How Fossils Form (page 418) Interpreting Fossil Evidence (pages ) Chapter 17

TIME LINE OF LIFE. Strip for Clock of Eras representing the circumference. 1. Review the eras represented on the Clock of Eras:

EARTH S HISTORY. What is Geology? logy: science. Geology is the scientific study of the Earth, including its:

What is the Earth s time scale?

11/18/2010. Pairwise sequence. Copyright notice. November 22, Announcements. Outline: pairwise alignment. Learning objectives

Chapter 4: Hidden Markov Models

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

Section 17 1 The Fossil Record (pages )

Moreover, the circular logic

Beaming in your answers

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

Pairwise & Multiple sequence alignments

Fossils & The Geologic Time Scale

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Bio 2 Plant and Animal Biology

Hidden Markov Models

Sequence Database Search Techniques I: Blast and PatternHunter tools

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

MCMC: Markov Chain Monte Carlo

Evolution = descent with modification

Name Class Date. Crossword Puzzle Use the clues below to complete the puzzle.

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Geological Time Scale UG Hons.1 st Year) DR. CHANDAN SURABHI DAS ASST. PROF. IN GEOGRAPHY BARASAT GOVT. COLLEGE

Chapters 25 and 26. Searching for Homology. Phylogeny

Biology. Slide 1 of 40. End Show. Copyright Pearson Prentice Hall

Sequence comparison: Score matrices

Earth History. What is the Earth s time scale? Geological time Scale. Pre-Cambrian. FOUR Eras

UNDERSTANDING GEOLOGIC TIME

Spring th Grade

Global alignments - review

Use Target Reading Skills

Practical considerations of working with sequencing data

Chapter 7: Covalent Structure of Proteins. Voet & Voet: Pages ,

Cladistics and Bioinformatics Questions 2013

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

Section 17 1 The Fossil Record (pages )

The Lithosphere and the Tectonic System. The Structure of the Earth. Temperature 3000º ºC. Mantle

Advanced Algorithms and Models for Computational Biology -- a machine learning approach. Some important dates in history (billions of years ago)

Computational Biology

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Local Alignment: Smith-Waterman algorithm

BINF 730. DNA Sequence Alignment Why?

Phylogenetic Trees. How do the changes in gene sequences allow us to reconstruct the evolutionary relationships between related species?

The History of Life, the Universe and Everything or What do you get when you multiply six by nine. Chapters 17 (skim) and 18

I. History of Life on Earth

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

GEOLOGY 12 CHAPTER 8 WS #3 GEOLOGIC TIME & THE FOSSIL RECORD

2 Eras of the Geologic Time Scale

Chapter 3 Time and Geology

Exercise 5. Sequence Profiles & BLAST

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

Evolution = descent with modification

History of Life on Earth

The Fossil Record. The Geological Time Scale Dating Techniques The Fossil Record Early Primate Ancestors. modern human. chimpanzee

Earth - Home Sweet Home. Sunday, August 18, 13

The Phanerozoic Eon. 542 mya Present. Divided into 3 Eras The Paleozoic, Mesozoic, and Cenozoic Eras

Lecture 3: Markov chains.

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

Age of Earth/Geologic Time. Vocabulary

Geological Time How old is the Earth

Transcription:

Chapter 2: Sequence lignment 2.2 Scoring Matrices Prof. Yechiam Yemini (YY) Computer Science Department Columbia University COMS4761-2007 Overview PM: scoring based on evolutionary statistics Elementary Markovian models of evolution BLOSUM: tuning scoring to evolutionary conservation Recommended brief tutorial on sequence alignment techniques: http://www.psc.edu/biomed/training/tutorials/sequence/db/ COMS4761-2007 2 1

Introduction To Log-Likelihood Estimation Driving question: how likely is an alignment (X,Y)? X=..a.. Y=..b.. How likely is evolution to substitute (a,b)? Define P(a) = probability of a symbol a P(a,b) = probability of evolutionary substitution (a,b) Likelihood ratio: L(a,b)=p(a,b)/p(a)p(b) Compares evolutionary substitution against a random one Log-odds discrimination: s(a,b)=10* log [p(a,b)/p(a)p(b)] Consider only orders of magnitude discrimination COMS4761-2007 3 Log-Odds Scoring Scoring Matrix: s(a,b)=10* log [p(a,b)/p(a)p(b)] The score of an alignment (X,Y) is given by: S(X,Y)=10*log[Πp(X i,y i )/p(x i )p(y i )]=Σ s(x i,y i ) S(X,Y) discriminates between random and evolutionary likely alignment DP maximizes the log-odds score of an alignment But, how do we acquire the probabilities p(a),p(a,b)? COMS4761-2007 4 2

Digression: Markovian Models How do we model statistical sequence changes? Start with a finite state machine (FSM) (regular grammar) n FSM path generates unique sequence of changes Example: S(t) = state at time t C G T S(0)=T S(1)= S(2)=G S(3)=C S(4)=G COMS4761-2007 5 Markov Chain: FSM +Transition Probabilities Finite, homogeneous Markov Chain assumptions: Markovian: transition probabilities depend on current state only P[S t+1 =j S t =i, S t-1 =i t-1..s 0 =i 0 ] = P[S t+1 =j S t =i] Homogeneity: transition probabilities are time independent P[S t+1 =j S t =i]=a(i,j) Notes: Transition matrix is stochastic: a(i,j) 0, Σ j a(i,j)=1 ( 0, 1=1) Diagonal elements of are given by: a(i,i)=1-σ k i a(i,k) 0.1 C 0.4 0.5 0.1 0.1 G 0.2 0.1 0.2 0.2 0.6 0.3 T 0.3 Transition Matrix G C T G C T 0.3 0.4 0.2 0.1 0.5 0.1 0.1 0.3 0.1 0.2 0.1 0.6 0.2 0.1 0.3 0.5 COMS4761-2007 6 3

Computing Evolution Probabilities Suppose the initial symbol probability is p π 1 =(π 1 (),π 1 (G),π 1 (C),π 1 (T)) Evolution: fter one generation: π 2 (y)=σ x a(x,y) π 1 (x) π 2 = π 1 fter n generations: π n+1 = π 1 n The transition matrix for n generations: n G C T G C T 0.3 0.4 0.2 0.1 0.5 0.1 0.1 0.3 0.1 0.2 0.1 0.6 0.2 0.1 0.3 0.5 COMS4761-2007 7 Scoring Matrices: PM [Dayhoff, 1978] PM: Percent ccepted Mutations Key idea: estimate probabilities of evolutionary change Model evolution as a Markov chain of substitutions; PM =unit of evolutionary time ~ 10 7 years Evolution performs Markovian substitutions at each PM Consider protein families with high (99%) sequence similarity Dayhoff carefully selected and studied 71 protein families ssure that only one generation of changes is reflected Eliminate noisy estimates COMS4761-2007 8 4

PM Computations Data: sequences of a protein family Build phylogenetic tree Count frequencies f(a,b)= frequency of (a,b) substitution; p(a)= frequency of a [p(a)=σ x f(a,x)] s(a,b)=10*log[f(a,b)/p(a)p(b)] COMS4761-2007 9 Scoring Matrices: PM1 PM60 PM250 ssume evolution is Markovian Define PMn(a,b) = probability of changing a to b through n mutations n= evolutionary time measured in PM units PMn=(PM1) n Compute PMn for n=1,60,80,120, 250 COMS4761-2007 10 5

PM250 Measures Evolutionary Similarity PM250 scores similarity among distantly related proteins Example: PM1 250 [Tyr,Phe]=5.7; PM1 250 [Phe,Tyr]=8.3 PM250(Phe,Tyr)=(5.7+8.3)/2=7 F F I V V F Y P V V L 7 COMS4761-2007 11 How Good Is The Markovian Model? Is evolution similar across protein families? Does it change symbols independently of location? Is evolution homogeneous in time? Does it change symbols independently of each other? COMS4761-2007 12 6

Proteins Evolve t Different Rates (T. Speed Slides) Corrected amino acid changes per 100 residues 220 Mammals 2 3 1.1 MY Fibrinopeptides 4 Birds/Reptiles Mammals/ Reptiles Reptiles/Fish Carp/Lamprey 200 abcde f g h i 180 160 140 120 100 80 60 40 20 0 Cretaceous Jurassic Triassic Permian Carboniferous Devonian Silurian Ordovician V ertebrates/ Insects Hemoglobin Cambrian 5.8 MY 5 Cytochrome c 10 9 6 78 Evolution of the globins 20.0 MY 100 200 300 400 500 600 700 800 j lgonkian Huronian 1 Separation of ancestors of plants and animals 900 1000 1100 1200 1300 1400 Pliocene Miocene Oligocene Eocene Paleocene Millions of years since divergence COMS4761-2007 fter Dickerson (1971) 13 Proteins Evolve t Different Rates Protein a PMs/100 residues/10 8 years Theoretical lookback time b Pseudogenes 400 45 c Fibrinopeptides 90 200 c Lactalbumins 27 670 c Lysozymes 24 750 c Ribonucleases 21 850 c Hemoglobins 12 1.5 d cid proteases 8 2.3 d Triosephosphate isomerase 3 6 d Phosphoglyceraldehyde dehydrogenase 2 9 d Glutamate dehydrogenase 1 18 d a PMs, ccepted point mutations. b Useful lookback time = 360 PMs. c Million years. d Billion years. From Doolittle 1986 COMS4761-2007 14 7

BLOSUM: Henikoff & Henikoff 1992 BLOSUM: Block Substitution Matrices Motivation: PM use of matrix power can result in large errors Key idea: consider conserved patterns (blocks) of a large sample of proteins Classify protein families (over 500 families) Family has characteristic patterns (signatures) that are conserved Use these conserved patterns to compute substitution probability P(a,b) = probability of (a,b) substitution; P(a) = probability of a Bpi Bovine npgivaritqkgldyacqqgvltlqkele Bpi Human npgvvvrisqkgldyasqqgtaalqkelk Cept Human eagivcritkpallvlnhetakviqtafq Lbp Human npglvaritdkglqyaaqegllalqsell Lbp Rabbit npglitritdkgleyaaregllalqrkll COMS4761-2007 15 Scoring Matrices (e.g., BLOSUM) BLOSUMx=based on patterns that are x% similar The level of x% can provide different performance in identifying similarity BLOSUM62 provides good scoring (used as default) COMS4761-2007 16 8