Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Similar documents
Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

Advanced topics in bioinformatics

Lecture 1, 31/10/2001: Introduction to sequence alignment. The Needleman-Wunsch algorithm for global sequence alignment: description and properties

Pairwise sequence alignment

Practical considerations of working with sequencing data

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Tools and Algorithms in Bioinformatics

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

Scoring Matrices. Shifra Ben-Dor Irit Orr

Single alignment: Substitution Matrix. 16 march 2017

Computational Biology

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

BLAST: Target frequencies and information content Dannie Durand

Pairwise & Multiple sequence alignments

Introduction to Bioinformatics

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

In-Depth Assessment of Local Sequence Alignment

Algorithms in Bioinformatics

Quantifying sequence similarity

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

Bioinformatics for Biologists

Lecture 5,6 Local sequence alignment

An Introduction to Sequence Similarity ( Homology ) Searching

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Sequence analysis and comparison

Sequence Database Search Techniques I: Blast and PatternHunter tools

Sequence comparison: Score matrices

Local Alignment Statistics

Basic Local Alignment Search Tool

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Tools and Algorithms in Bioinformatics

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Scoring Matrices. Shifra Ben Dor Irit Orr

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Local Alignment: Smith-Waterman algorithm

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Similarity searching summary (2)

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Global alignments - review

Alignment & BLAST. By: Hadi Mozafari KUMS

Large-Scale Genomic Surveys

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Statistical Distributions of Optimal Global Alignment Scores of Random Protein Sequences

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 5 G R A T I V. Pair-wise Sequence Alignment

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

Chapter 7: Rapid alignment methods: FASTA and BLAST

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Introduction to protein alignments

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

PAM-1 Matrix 10,000. From: Ala Arg Asn Asp Cys Gln Glu To:

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Substitution matrices

Motivating the need for optimal sequence alignments...

Pairwise sequence alignments

CSE 549: Computational Biology. Substitution Matrices

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

EECS730: Introduction to Bioinformatics

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

Similarity or Identity? When are molecules similar?

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

Sequence Alignment (chapter 6)

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2,

Exercise 5. Sequence Profiles & BLAST

Bioinformatics and BLAST

BLAST: Basic Local Alignment Search Tool

Homology Modeling. Roberto Lins EPFL - summer semester 2005

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University

Study and Implementation of Various Techniques Involved in DNA and Protein Sequence Analysis

Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids

Today s Lecture: HMMs

Position-specific scoring matrices (PSSM)

Finding the Best Biological Pairwise Alignment Through Genetic Algorithm Determinando o Melhor Alinhamento Biológico Através do Algoritmo Genético

bioinformatics 1 -- lecture 7

Pairwise alignment. 2.1 Introduction GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKL

Moreover, the circular logic

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

SEQUENCE alignment is an underlying application in the

Sequence Alignment Techniques and Their Uses

EECS730: Introduction to Bioinformatics

Lecture Notes: Markov chains

Sequence analysis and Genomics

1.5 Sequence alignment

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

Lecture 5: September Time Complexity Analysis of Local Alignment

Do Aligned Sequences Share the Same Fold?

Fundamentals of database searching

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Sequence Comparison. mouse human

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Week 10: Homology Modelling (II) - HHpred

Transcription:

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm Alignment scoring schemes and theory: substitution matrices and gap models 1

Local sequence alignments Local sequence alignments are necessary for cases of: Modular organization of genes and proteins (exons, domains, etc.) Repeats Sequences diverged so that similarity was retained, or can be detected, just in some sub-regions 2

Modular organization of genes gene A gene B gene C gene W gene X gene Y gene Z 3

Modular protein organization IG domain Adapted from Henikoff et al Science 278:609, 97 IG domain IG domain IG domain Kringle domain TLK receptor tyrosine-kinase Protein-kinase domain FN3 domain FN3 domain IG domain FN3 domain TEK receptor tyrosine-kinase EGF domain EGF domain EGF domain IG domain 4

Modular protein organization 1KAP secreted calcium-binding alkaline-protease Calcium-binding repeats Protease domain 5

Local sequence alignment 6

Local sequence alignment For local sequence alignment we wish to find what regions (sub-sequences) in the compared pair of sequences will give the best alignment scores with the parameters we supply (substitution matrix, gap penalty and gap scoring model. The aligned regions may be anywhere along the sequences. More then one region might be aligned with a score above the threshold. 7

Local sequence alignment Smith-Waterman algorithm σ[ a ] b : score of aligning a pair of residues a and b -q : gap penalty S (i,j) : optimal score of an alignment ending at residues i,j best : highest score in the scores-matrix (S) 8

best 0 for j 1 to N do S (0,j) 0 Local sequence alignment Smith-Waterman algorithm for i 1 to M do { S (i,0) 0 } for j 1 to N do S (i,j) max (S (i-1, j-1) + σ[ a i b j ], max {S (0, j)...s(i-1, j)} -q, max {S (i, 0)...S(i, j-1)} -q, 0) best max (S (i, j), best) 9 Pearson & Miller Meth Enz 210:575, 92

Local sequence alignment Smith-Waterman algorithm Finding the optimal alignment A C G T A 1-1 -1-1 C -1 1-1 -1 G -1-1 1-1 T -1-1 -1 1 A T C A G A G T C 0 0 0 0 0 0 0 0 0 0 G 0 0 0 0 0 1 0 1 0 0 T 0 0 1 0 0 0 0 0 2 0 C 0 0 0 2 0 0 0 0 0 3 A 0 1 0 0 3 1 1 1 1 1 Gap penalty -2 The optimal local alignment is: G 0 0 0 0 1 4 2 2 2 2 T 0 0 1 0 1 2 3 1 3 1 C 0 0 0 2 1 2 1 2 1 4 A 0 1 0 0 3 2 3 1 1 2 ATCAGAGTC G TCAG--TCA ++++^^++ : 1+1+1+1-2+1+1=4 10

Local sequence alignment Smith-Waterman algorithm Finding the sub-optimal alignment A C G T A 1-1 -1-1 C -1 1-1 -1 G -1-1 1-1 T -1-1 -1 1 Gap penalty -2 A T C A G A G T C 0 0 0 0 0 0 0 0 0 0 G 0 0 0 0 0 1 0 1 0 0 T 0 0 1 0 0 0 0 0 2 0 C 0 0 0 2 0 0 0 0 0 3 A 0 1 0 0 3 1 1 1 1 1 G 0 0 0 0 1 4 2 2 2 2 T 0 0 1 0 1 2 3 1 3 1 C 0 0 0 2 1 2 1 2 1 4 A 0 1 0 0 3 2 3 1 1 2 Score threshold 3 11

Local sequence alignment Smith-Waterman algorithm Finding the sub-optimal alignment A C G T A 1-1 -1-1 C -1 1-1 -1 G -1-1 1-1 T -1-1 -1 1 A T C A G A G T C 0 0 0 0 0 0 0 0 0 0 G 0-1 -1-1 -1 1-1 1-1 -1 T 0-1 0-1 - 1-1 - 1-1 1-1 C 0-1 - 1 0-1 - 1-1 - 1-1 1 A 0 1-1 - 1 0-1 1-1 - 1-1 G 0-1 -1-1 -1 0-1 1-1 -1 Gap penalty -2 T 0-1 1-1 - 1-1 - 1-1 0-1 C 0-1 - 1 1-1 - 1-1 - 1-1 0 A 0 1-1 - 1 1-1 1-1 - 1-1 Remove scores of the current optimal ATCAGAGTC alignment and then recalculate the GTCAG--TCA matrix to find the next best alignment /s 12

Local sequence alignment Smith-Waterman algorithm Finding the sub-optimal alignment A C G T A 1-1 -1-1 C -1 1-1 -1 G -1-1 1-1 T -1-1 -1 1 Gap penalty -2 A T C A G A G T C 0 0 0 0 0 0 0 0 0 0 G 0 0 0 0 0 1 0 1 0 0 T 0 0 0 0 0 0 0 0 2 0 C 0 0 0 0 0 0 0 0 0 3 A 0 1 0 0 0 0 1 0 0 0 G 0 0 0 0 0 0 0 2 0 0 T 0 0 1 0 0 0 0 0 0 0 C 0 0 0 2 0 0 0 0 0 0 A 0 1 0 0 3 1 1 1 1 1 A TCAGAGTC GTCAGTCA +++ : 1+1+1 =3 Score threshold 3 13

Local sequence alignment Smith-Waterman algorithm In order for the algorithm to identify local alignments the score for aligning unrelated sequence segments should typically be negative. Otherwise true optimal local alignments will be extended beyond their correct ends or have lower scores then longer alignments between unrelated regions. Alignment scores are determined by substitution matrix and by the gap penalties and gap scoring model. 14

Alignment scoring schemes: gap models Gap scoring by a constant relation to the gap length: σ -q g (g is the number ATCACA σ -3q of gapped residues) T---CA Gap scoring by a constant relation to the gap length: σ -q ATCACA σ -q T---CA Affine gap scoring (opening [d] and extending gap penalties [e]): σ -(d + e (g-1)) ATCACA σ -(d + 2e) T---CA 15

Local sequence alignment Smith-Waterman algorithm If alignment scores of unrelated sequences are mainly or solely determined by the substitution scores then such alignments would have negative scores if the sum of expected substitution scores would be negative: Σ i,j p i p j s ij < 0 i & j - residues, p i - frequency of residue i s ij - score of aligning residues i and j 16

Local sequence alignment Smith-Waterman algorithm We can easily identify substitution matrices that will not give positive scores to random alignments. However, we have no analytical way for finding which gap scores will satisfy the demand for random alignment scores to be less or equal to zero and produce local sequence alignments. Nevertheless, certain sets of scoring schemes (substitution matrix and gap scores) were found to give satisfactory local alignments. 17

Sequence alignment DNA substitution matrix A C G T A 5-4 -4-4 C -4 5-4 -4 G -4-4 5-4 T -4-4 -4 5 Typical gap penalties for local alignment algorithms of DNA sequences are 16 for opening a gap & 4 for extending it 18

Alignment scoring schemes: substitution matrices Unitary substitution matrix - two scores are used, one for matches and one mismatches. Practical usage of such matrices is for nucleotide alphabets. In protein sequence alignments there are 20 types of residues (amino acids - aa) with complex relations by size, charge, genetic code, and chemistry. Unitary aa substitution matrices are outperformed by matrices that can have different scores for the 210 aa pairs. These matrices are calculated by scoring the relation between different of aa according to some of their features and/or which substitutions occur in correct alignments and what is the probability of having them by chance. 19

Altschul JMB 219:555, 91 Alignment scoring schemes: substitution matrices Every substitution matrix is either explicitly calculated from target frequencies of aligned residues (q ij ) and the frequencies of the residues (p i ), or these target and observed frequencies are implicit and can be back-calculated from the substitution scores. The ratio of a target frequency to the frequencies it will occur by chance compares the probability an event will occur under two alternative hypotheses - q ij /(p i p j ). This is called a likelihood, or odds, ratio. Such probabilities should be multiplied to get the probability of their independent occurrence, or their log can be added. Log-odds score - s ij = (ln q ij /(p i p j )) / λ (λ determines the base of the logarithm) 20

Sequence alignment BLOSUM62 in 1/2 Bit Units amino acids substitution matrix A R N D C Q E G H I L K M F P S T W Y V X A 4 R -1 5 N -2 0 6 D -2-2 1 6 C 0-3 -3-3 9 Q -1 1 0 0-3 5 E -1 0 0 2-4 2 5 G 0-2 0-1 -3-2 -2 6 H -2 0 1-1 -3 0 0-2 8 I -1-3 -3-3 -1-3 -3-4 -3 4 L -1-2 -3-4 -1-2 -3-4 -3 2 4 K -1 2 0-1 -3 1 1-2 -1-3 -2 5 The expected score per aligned position (Σ i,j p i p j s ij ) is -0.52. Thus, this matrix is suitable for finding local sequence alignments. M -1-1 -2-3 -1 0-2 -3-2 1 2-1 5 F -2-3 -3-3 -2-3 -3-3 -1 0 0-3 0 6 P -1-2 -2-1 -3-1 -1-2 -2-3 -3-1 -2-4 7 S 1-1 1 0-1 0 0 0-1 -2-2 0-1 -2-1 4 T 0-1 0-1 -1-1 -1-2 -2-1 -1-1 -1-2 -1 1 5 W -3-3 -4-4 -2-2 -3-2 -2-3 -2-3 -1 1-4 -3-2 11 Y -2-2 -2-3 -2-1 -2-3 2-1 -1-2 -1 3-3 -2-2 2 7 V 0-3 -3-3 -1-2 -2-3 -3 3 1-2 1-1 -2-2 0-3 -1 4 X 0-1 -1-1 -2-1 -1-1 -1-1 -1-1 -1-1 -2 0 0-2 -1-1 -1 21

Sequence alignment BLOSUM62 in 1/2 Bit Units amino acids substitution matrix A R N D C Q E G H I L K M F P S T W Y V X A 4 R -1 5 P i P j Q ij Q ij /P i P j 2log 2 (Q ij /P i P j ) N -2 0 6 A:A 0.074 0.074 0.0215 3.926 3.946 D -2-2 1 6 R:R 0.074 0.052 0.0023 0.598-1.485 C 0-3 -3-3 9 Q -1 1 0 0-3 5 E -1 0 0 2-4 2 5 G 0-2 0-1 -3-2 -2 6 H -2 0 1-1 -3 0 0-2 8 I -1-3 -3-3 -1-3 -3-4 -3 4 L -1-2 -3-4 -1-2 -3-4 -3 2 4 K -1 2 0-1 -3 1 1-2 -1-3 -2 5 M -1-1 -2-3 -1 0-2 -3-2 1 2-1 5 F -2-3 -3-3 -2-3 -3-3 -1 0 0-3 0 6 P -1-2 -2-1 -3-1 -1-2 -2-3 -3-1 -2-4 7 S 1-1 1 0-1 0 0 0-1 -2-2 0-1 -2-1 4 T 0-1 0-1 -1-1 -1-2 -2-1 -1-1 -1-2 -1 1 5 W -3-3 -4-4 -2-2 -3-2 -2-3 -2-3 -1 1-4 -3-2 11 Y -2-2 -2-3 -2-1 -2 See -3 ftp://ncbi.nlm.nih.gov/repository/blocks/unix/blosum/readme 2-1 -1-2 -1 3-3 -2-2 2 7 V 0-3 -3-3 -1-2 -2 and -3 ftp://ncbi.nlm.nih.gov/repository/blocks/unix/blosum/blosum/ -3 3 1-2 1-1 -2-2 0-3 -1 4 22 X 0-1 -1-1 -2-1 -1-1 -1-1 -1-1 -1-1 -2 0 0-2 -1-1 -1

Altschul JMB 219:555, 91 Alignment scoring schemes: substitution matrices Substitution matrices are characterized by their average score per residue pair H = Σ i,j q ij s ij = Σ i,j q ij log 2 (q ij /p i p j ) H is the information, in bit units, per aligned residue pair. It depends on the target frequencies (q ij ) - calculated from what we think are correct alignments - and on the alignments that would occur by chance (p i p j ). It is termed the relative entropy of the matrix. H measures the information provided by the matrix to distinguish correct alignments from chance ones. Matrices with lower values will identify more distant sequence relationships that produce weaker 23 alignments.

Alignment scoring schemes: substitution matrices Substitution matrices differ by the models and data used for their calculation. Each is suitable for identifying alignments of sequences with different evolutionary distances. Nevertheless, longer alignments are needed to identify the relationship between more distant sequences. The scale of the substitution matrix (base of the log) is arbitrary. However, matrices must be in the same scale to be compared to each other, and gap penalties are specific to the matrix and scale used. Typical penalties for local alignment with the BLOSUM62 matrix in half-bit units are 12 for opening a gap and 2 for extending it. 24

More details, sources and things to do for next lecture Sources: Pearson & Miller "Dynamic programming algorithms for biological sequence comparison." Methods in Enz., 210:575-601 (1992), Altschul Amino acid substitution matrices from an information theoretic perspective J Mol Biol 219:555-565 (1991), Henikoff Scores for sequence searches and alignments Curr Opin Struct Biol 6:353-360 (1996). Assignment: Read the source articles for this lecture. They have more details on the material we covered and introduce topics for next lectures. Calculate the q ij target frequencies of the DNA substitution matrix shown in class for equal nucleotide frequencies, and for p A = p T =0.3 & 25 p G = p C =0.2.

More details, sources and things to do for next lecture For those who are no acquainted with information theory or want to be certain they know the basics of it: An information theory primer for molecular biologistshttp://www.lecb.ncifcrf.gov/~toms/paper/primer 26