Bioinformatics. Molecular Biophysics & Biochemistry 447b3 / 747b3. Class 3, 1/19/98. Mark Gerstein. Yale University

Similar documents
Large-Scale Genomic Surveys

BIOINFORMATICS Sequences

Pairwise sequence alignment

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Sequence analysis and Genomics

Lecture 1, 31/10/2001: Introduction to sequence alignment. The Needleman-Wunsch algorithm for global sequence alignment: description and properties

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 5 G R A T I V. Pair-wise Sequence Alignment

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Algorithms in Bioinformatics

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

Protein Structure Prediction

Pairwise sequence alignments

Multiple Sequence Alignment using Profile HMM

Bioinformatics and BLAST

Bioinformatics. Molecular Biophysics & Biochemistry 447b3 / 747b3. Mark Gerstein. Class #2, 1/14/98 Yale University

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

Week 10: Homology Modelling (II) - HHpred

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011

8 Grundlagen der Bioinformatik, SS 09, D. Huson, April 28, 2009

Pairwise & Multiple sequence alignments

Practical considerations of working with sequencing data

Moreover, the circular logic

Sequence comparison: Score matrices

In-Depth Assessment of Local Sequence Alignment

Local Alignment: Smith-Waterman algorithm

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Sequence Bioinformatics. Multiple Sequence Alignment Waqas Nasir

Stephen Scott.

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Global alignments - review

Tools and Algorithms in Bioinformatics

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

09/06/25. Computergestützte Strukturbiologie (Strukturelle Bioinformatik) Non-uniform distribution of folds. Scheme of protein structure predicition

Similarity or Identity? When are molecules similar?

Sequence analysis and comparison

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Copyright 2000 N. AYDIN. All rights reserved. 1

Pair Hidden Markov Models

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Tools and Algorithms in Bioinformatics

Computational Biology

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

Sequence Comparison. mouse human

Pairwise alignment using HMMs

Lecture 2: Pairwise Alignment. CG Ron Shamir

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University

Pairwise Sequence Alignment

Motivating the need for optimal sequence alignments...

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

Administration. ndrew Torda April /04/2008 [ 1 ]

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Study and Implementation of Various Techniques Involved in DNA and Protein Sequence Analysis

BIOINFORMATICS Structures. Mark Gerstein, Yale University bioinfo.mbb.yale.edu/mbb452a 1 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.

Introduction to Bioinformatics

Multiple Alignment using Hydrophobic Clusters : a tool to align and identify distantly related proteins

Bioinformatics for Biologists

Quantifying sequence similarity

Bio nformatics. Lecture 3. Saad Mneimneh

Single alignment: Substitution Matrix. 16 march 2017

Substitution matrices

Algorithms in Bioinformatics

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

Multiple sequence alignment

Introduction to Sequence Alignment. Manpreet S. Katari

... and searches for related sequences probably make up the vast bulk of bioinformatics activities.

Bioinformatics. Molecular Biophysics & Biochemistry 447b3 / 747b3. Mark Gerstein. Class #1, 1/12/98 Yale University

Microscopic analysis of protein oxidative damage: effect of. carbonylation on structure, dynamics and aggregability of.

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Alignment & BLAST. By: Hadi Mozafari KUMS

Collected Works of Charles Dickens

Outline. Approximation: Theory and Algorithms. Motivation. Outline. The String Edit Distance. Nikolaus Augsten. Unit 2 March 6, 2009

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Pairwise alignment, Gunnar Klau, November 9, 2005, 16:

Biologically significant sequence alignments using Boltzmann probabilities

Probalign: Multiple sequence alignment using partition function posterior probabilities

Computational Biology Lecture 5: Time speedup, General gap penalty function Saad Mneimneh

Packing of Secondary Structures

Approximation: Theory and Algorithms

Lecture 5: September Time Complexity Analysis of Local Alignment

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

Similarity Search. The String Edit Distance. Nikolaus Augsten. Free University of Bozen-Bolzano Faculty of Computer Science DIS. Unit 2 March 8, 2012

Introduction to Computation & Pairwise Alignment

Computational Molecular Biology (

Sequence and Structure Alignment Z. Luthey-Schulten, UIUC Pittsburgh, 2006 VMD 1.8.5

Overview Multiple Sequence Alignment

Scoring Matrices. Shifra Ben-Dor Irit Orr

Transcription:

Molecular Biophysics & Biochemistry 447b3 / 747b3 Bioinformatics Mark Gerstein Class 3, 1/19/98 Yale University 1

Aligning Text Strings Raw Data??? T C A T G C A T T G 2 matches, 0 gaps T C A T G C A T T G 3 matches (2 end gaps) T C A T G.. C A T T G 4 matches, 1 insertion T C A - T G. C A T T G 4 matches, 1 insertion T C A T - G. C A T T G 2

Dynamic Programming What to do for Bigger String? SSDSEREEHVKRFRQALDDTGMKVPMATTNLFTHPVFKDGGFTANDRDVRRYALRKTIRNIDLAVELGAETYVAWGGREGAESGGAKDVRDALDRMKEAFDLLGEYVTSQGYDIRFAIEP KPNEPRGDILLPTVGHALAFIERLERPELYGVNPEVGHEQMAGLNFPHGIAQALWAGKLFHIDLNGQNGIKYDQDLRFGAGDLRAAFWLVDLLESAGYSGPRHFDFKPPRTEDFDGVWAS Needleman-Wunsch (1970) provided first automatic method Dynamic Programming to Find Global Alignment Their Test Data (J->Y) ABCNYRQCLCRPM AYCYNRCKCRBP 3

Step 1 -- Make a Dot Plot (Similarity Matrix) Put 1 s where characters are identical. A B C N Y R Q C L C R P M A 1 N 1 R 1 1 K R 1 1 B 1 P 1 4

A More Interesting Dot Matrix (adapted from R Altman) 5

Step 2 -- Start Computing the Sum Matrix new_value_cell(r,c) <= cell(r,c) { Old value, either 1 or 0 } + Max[ cell (R+1, C+1), { Diagonally Down, no gaps } cells(r+1, C+2 to C_max),{ Down a row, making col. gap } cells(r+2 to R_max, C+2) { Down a col., making row gap } ] A B C N Y R Q C L C R P M A 1 N 1 R 1 1 K R 1 1 B 1 P 1 A B C N Y R Q C L C R P M A 1 N 1 R 1 1 K R 1 2 0 0 B 1 2 1 1 1 1 1 1 1 1 1 0 0 P 0 0 0 0 0 0 0 0 0 0 0 1 0 6

Step 3 -- Keep Going A B C N Y R Q C L C R P M A 1 N 1 R 1 1 K R 1 2 0 0 B 1 2 1 1 1 1 1 1 1 1 1 0 0 P 0 0 0 0 0 0 0 0 0 0 0 1 0 A B C N Y R Q C L C R P M A 1 N 1 R 5 4 3 3 2 2 0 0 C 3 3 4 3 3 3 3 4 3 3 1 0 0 K 3 3 3 3 3 3 3 3 3 2 1 0 0 C 2 2 3 2 2 2 2 3 2 3 1 0 0 R 2 1 1 1 1 2 1 1 1 1 2 0 0 B 1 2 1 1 1 1 1 1 1 1 1 0 0 P 0 0 0 0 0 0 0 0 0 0 0 1 0 7

Step 4 -- Sum Matrix All Done Alignment Score is 8 matches. A B C N Y R Q C L C R P M A 1 N 1 R 5 4 3 3 2 2 0 0 C 3 3 4 3 3 3 3 4 3 3 1 0 0 K 3 3 3 3 3 3 3 3 3 2 1 0 0 C 2 2 3 2 2 2 2 3 2 3 1 0 0 R 2 1 1 1 1 2 1 1 1 1 2 0 0 B 1 2 1 1 1 1 1 1 1 1 1 0 0 P 0 0 0 0 0 0 0 0 0 0 0 1 0 A B C N Y R Q C L C R P M A 8 7 6 6 5 4 4 3 3 2 1 0 0 Y 7 7 6 6 6 4 4 3 3 2 1 0 0 C 6 6 7 6 5 4 4 4 3 3 1 0 0 Y 6 6 6 5 6 4 4 3 3 2 1 0 0 N 5 5 5 6 5 4 4 3 3 2 1 0 0 R 4 4 4 4 4 5 4 3 3 2 2 0 0 C 3 3 4 3 3 3 3 4 3 3 1 0 0 K 3 3 3 3 3 3 3 3 3 2 1 0 0 C 2 2 3 2 2 2 2 3 2 3 1 0 0 R 2 1 1 1 1 2 1 1 1 1 2 0 0 B 1 2 1 1 1 1 1 1 1 1 1 0 0 P 0 0 0 0 0 0 0 0 0 0 0 1 0 8

Step 5 -- Traceback Find Best Score (8) and Trace Back A B C N Y - R Q C L C R - P M A Y C - Y N R - C K C R B P A B C N Y R Q C L C R P M A 8 7 6 6 5 4 4 3 3 2 1 0 0 Y 7 7 6 6 6 4 4 3 3 2 1 0 0 C 6 6 7 6 5 4 4 4 3 3 1 0 0 Y 6 6 6 5 6 4 4 3 3 2 1 0 0 N 5 5 5 6 5 4 4 3 3 2 1 0 0 R 4 4 4 4 4 5 4 3 3 2 2 0 0 C 3 3 4 3 3 3 3 4 3 3 1 0 0 K 3 3 3 3 3 3 3 3 3 2 1 0 0 C 2 2 3 2 2 2 2 3 2 3 1 0 0 R 2 1 1 1 1 2 1 1 1 1 2 0 0 B 1 2 1 1 1 1 1 1 1 1 1 0 0 P 0 0 0 0 0 0 0 0 0 0 0 1 0 9

Step 6 -- Alternate Tracebacks A B C - N Y R Q C L C R - P M A Y C Y N - R - C K C R B P A B C N Y R Q C L C R P M A 8 7 6 6 5 4 4 3 3 2 1 0 0 Y 7 7 6 6 6 4 4 3 3 2 1 0 0 C 6 6 7 6 5 4 4 4 3 3 1 0 0 Y 6 6 6 5 6 4 4 3 3 2 1 0 0 N 5 5 5 6 5 4 4 3 3 2 1 0 0 R 4 4 4 4 4 5 4 3 3 2 2 0 0 C 3 3 4 3 3 3 3 4 3 3 1 0 0 K 3 3 3 3 3 3 3 3 3 2 1 0 0 C 2 2 3 2 2 2 2 3 2 3 1 0 0 R 2 1 1 1 1 2 1 1 1 1 2 0 0 B 1 2 1 1 1 1 1 1 1 1 1 0 0 P 0 0 0 0 0 0 0 0 0 0 0 1 0 10

Key Idea in Dynamic Programming The best alignment that ends at a given pair of positions (i and j) in the 2 sequences is the score of the best alignment previous to this position PLUS the score for aligning those two positions. An Example Below o Aligning R to K does not affect alignment of previous N-terminal residues. Once this is done it is fixed. Then go on to align D to E. o How could this be violated? Aligning R to K changes best alignment in box. ACSQRP--LRV-SH RSENCV A-SNKPQLVKLMTH VKDFCV ACSQRP--LRV-SH -R SENCV A-SNKPQLVKLMTH VK DFCV 11

Computational Complexity Basic NW Algorithm is O(n 2 ) (in speed) N M x N squares to fill At each square need to look back (M +N ) black squares to find max in block M x N x (M +N ) -> O(n 3 ) A B C N Y R Q C L C R P M A 1 However, max values in block can be cached, so algorithm is really only O(n 2 ) M N 1 R 5 4 3 3 2 2 0 0 C 3 3 4 3 3 3 3 4 3 3 1 0 0 K 3 3 3 3 3 3 3 3 3 2 1 0 0 O(n 2 ) in memory too! Improvements can (effectively) reduce sequence comparison to O(n) in both C 2 2 3 2 2 2 2 3 2 3 1 0 0 R 2 1 1 1 1 2 1 1 1 1 2 0 0 B 1 2 1 1 1 1 1 1 1 1 1 0 0 P 0 0 0 0 0 0 0 0 0 0 0 1 0 N 12 M

Gap Penalties The score at a position can also factor in a penalty for introducing gaps (i. e., not going from i, j to i- 1, j- 1). Gap penalties are often of linear form: GAP = a + bn GAP is the gap penalty a = cost of opening a gap b = cost of extending the gap by one (affine) N = length of the gap (Here assume b=0, a=1/2, so GAP = 1/2 regardless of length.) 13

Step 2 -- Computing the Sum Matrix with Gaps new_value_cell(r,c) <= cell(r,c) { Old value, either 1 or 0 } + Max[ cell (R+1, C+1), { Diagonally Down, no gaps } ] cells(r+1, C+2 to C_max) - GAP,{ Down a row, making col. gap } cells(r+2 to R_max, C+2) - GAP { Down a col., making row gap } A B C N Y R Q C L C R P M A 1 N 1 R 1 1 K R 1 1 B 1 P 1 A B C N Y R Q C L C R P M A 1 N 1 R 1 1 K R 1 1.5 0 0 B 1 2 1 1 1 1 1 1 1 1 1 0 0 P 0 0 0 0 0 0 0 0 0 0 0 1 0 GAP =1/2 14

Similarity (Substitution) Matrix Identity Matrix Match L with L => 1 Match L with D => 0 Match L with V => 0?? S(aa-1,aa-2) Match L with L => 1 Match L with D => 0 Match L with V =>.5 Number of Common Ones PAM Blossum Gonnet A R N D C Q E G H I L K M F P S T W Y V A 4-1 -2-2 0-1 -1 0-2 -1-1 -1-1 -2-1 1 0-3 -2 0 R -1 5 0-2 -3 1 0-2 0-3 -2 2-1 -3-2 -1-1 -3-2 -3 N -2 0 6 1-3 0 0 0 1-3 -3 0-2 -3-2 1 0-4 -2-3 D -2-2 1 6-3 0 2-1 -1-3 -4-1 -3-3 -1 0-1 -4-3 -3 C 0-3 -3-3 8-3 -4-3 -3-1 -1-3 -1-2 -3-1 -1-2 -2-1 Q -1 1 0 0-3 5 2-2 0-3 -2 1 0-3 -1 0-1 -2-1 -2 E -1 0 0 2-4 2 5-2 0-3 -3 1-2 -3-1 0-1 -3-2 -2 G 0-2 0-1 -3-2 -2 6-2 -4-4 -2-3 -3-2 0-2 -2-3 -3 H -2 0 1-1 -3 0 0-2 7-3 -3-1 -2-1 -2-1 -2-2 2-3 I -1-3 -3-3 -1-3 -3-4 -3 4 2-3 1 0-3 -2-1 -3-1 3 L -1-2 -3-4 -1-2 -3-4 -3 2 4-2 2 0-3 -2-1 -2-1 1 K -1 2 0-1 -3 1 1-2 -1-3 -2 5-1 -3-1 0-1 -3-2 -2 M -1-1 -2-3 -1 0-2 -3-2 1 2-1 5 0-2 -1-1 -1-1 1 F -2-3 -3-3 -2-3 -3-3 -1 0 0-3 0 6-4 -2-2 1 3-1 P -1-2 -2-1 -3-1 -1-2 -2-3 -3-1 -2-4 6-1 -1-4 -3-2 S 1-1 1 0-1 0 0 0-1 -2-2 0-1 -2-1 4 1-3 -2-2 T 0-1 0-1 -1-1 -1-2 -2-1 -1-1 -1-2 -1 1 5-2 -2 0 W -3-3 -4-4 -2-2 -3-2 -2-3 -2-3 -1 1-4 -3-2 10 2-3 Y -2-2 -2-3 -2-1 -2-3 2-1 -1-2 -1 3-3 -2-2 2 6-1 V 0-3 -3-3 -1-2 -2-3 -3 3 1-2 1-1 -2-2 0-3 -1 4 15

Where do matrices come from? 1 Manually align protein structures (or, more risky, sequences) 2 Look at frequency of a.a. substitutions at structurally constant sites. 3 Compute log-odds S(aa-1,aa-2) = log ( freq(o) / freq(e) ) O = observed exchanges, E = expected exchanges + > More likely than random 0 > At random base rate - > Less likely than random 16

The Score S = Total Score S(i,j) = similarity matrix score for aligning i and j Sum is carried out over all aligned i and j n = number of gaps (assuming no gap ext. penalty) G = gap penalty S = S( i, j) ng i, j 17

Molecular Biology Information: DNA/RNA/Protein Macromolecular Structure Almost all protein (RNA Adapted From D Soll Web Page, Right Hand Top Protein from M Levitt web page) 18

Molecular Biology Information: Protein Structure Details Statistics on Number of XYZ triplets 200 residues/domain -> 200 CA atoms, separated by 3.8 A Avg. Residue is Leu: 4 backbone atoms + 4 sidechain atoms, 150 cubic A o => ~1500 xyz triplets (=8x200) per protein domain 10 K known domain, ~300 folds ATOM 1 C ACE 0 9.401 30.166 60.595 1.00 49.88 1GKY 67 ATOM 2 O ACE 0 10.432 30.832 60.722 1.00 50.35 1GKY 68 ATOM 3 CH3 ACE 0 8.876 29.767 59.226 1.00 50.04 1GKY 69 ATOM 4 N SER 1 8.753 29.755 61.685 1.00 49.13 1GKY 70 ATOM 5 CA SER 1 9.242 30.200 62.974 1.00 46.62 1GKY 71 ATOM 6 C SER 1 10.453 29.500 63.579 1.00 41.99 1GKY 72 ATOM 7 O SER 1 10.593 29.607 64.814 1.00 43.24 1GKY 73 ATOM 8 CB SER 1 8.052 30.189 63.974 1.00 53.00 1GKY 74 ATOM 9 OG SER 1 7.294 31.409 63.930 1.00 57.79 1GKY 75 ATOM 10 N ARG 2 11.360 28.819 62.827 1.00 36.48 1GKY 76 ATOM 11 CA ARG 2 12.548 28.316 63.532 1.00 30.20 1GKY 77 ATOM 12 C ARG 2 13.502 29.501 63.500 1.00 25.54 1GKY 78... ATOM 1444 CB LYS 186 13.836 22.263 57.567 1.00 55.06 1GKY1510 ATOM 1445 CG LYS 186 12.422 22.452 58.180 1.00 53.45 1GKY1511 ATOM 1446 CD LYS 186 11.531 21.198 58.185 1.00 49.88 1GKY1512 ATOM 1447 CE LYS 186 11.452 20.402 56.860 1.00 48.15 1GKY1513 ATOM 1448 NZ LYS 186 10.735 21.104 55.811 1.00 48.41 1GKY1514 ATOM 1449 OXT LYS 186 16.887 23.841 56.647 1.00 62.94 1GKY1515 TER 1450 LYS 186 1GKY1516 19

Sperm Whale Myoglobin 20

Structural Alignment of Two Globins 21

Immunoglobulin Alignment (Harder) 22

Some Similarities are Readily Apparent others are more Subtle Easy: Globins Tricky: Ig C, Ig V, Very Subtle: G3P-dehydrogenase, C-term. domain 23

Automatic Structural Alignment 24

Generalized Similarity Matrix PAM(A,V) = 0.5 Applies at every position S(aa @ i, aa @ J) Specific Matrix for each pair of residues i in protein 1 and J in protein 2 Example is Y near N-term. matches any C-term. residue (Y at J=2) S(i,J) Doesn t need to depend on a.a. identities at all! Just need to make up a score for matching residue i in protein 1 with residue J in protein 2 J 1 2 3 4 5 6 7 8 9 10 11 12 13 A B C N Y R Q C L C R P M 1 A 1 2 5 5 5 5 5 5 3 4 5 N 1 6 R 1 1 7 8 K 9 10 R 1 1 11 B 1 12 P 1 25 i

Structural Alignment Similarity Matrix for Structural Alignment Similarity Matrix S(i,J) depends on the 3D coordinates of residues i and J Distance between CA of i and J d = 2 2 ( xi xj ) + ( yi yj ) + ( zi zj ) 2 M(i,j) = 100 / (5 + d 2 ) Threading S(i,J) depends on the how well the amino acid at position i in protein 1 fits into the 3D structural environment at position J of protein 2 26