C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 5 G R A T I V. Pair-wise Sequence Alignment

Similar documents
Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Sequence analysis and comparison

Pairwise sequence alignment

Algorithms in Bioinformatics

Sequence analysis and Genomics

In-Depth Assessment of Local Sequence Alignment

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Pairwise & Multiple sequence alignments

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Motivating the need for optimal sequence alignments...

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Quantifying sequence similarity

Practical considerations of working with sequencing data

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

An Introduction to Sequence Similarity ( Homology ) Searching

Single alignment: Substitution Matrix. 16 march 2017

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Pairwise Sequence Alignment

1.5 Sequence alignment

Sequence Alignment (chapter 6)

Sequence comparison: Score matrices

Lecture 5,6 Local sequence alignment

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Study and Implementation of Various Techniques Involved in DNA and Protein Sequence Analysis

Biol478/ August

Similarity or Identity? When are molecules similar?

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

Introduction to Comparative Protein Modeling. Chapter 4 Part I

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Bioinformatics and BLAST

Lecture 1, 31/10/2001: Introduction to sequence alignment. The Needleman-Wunsch algorithm for global sequence alignment: description and properties

Pairwise sequence alignments

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Collected Works of Charles Dickens

Tools and Algorithms in Bioinformatics

Substitution matrices

O 3 O 4 O 5. q 3. q 4. Transition

Introduction to Bioinformatics

Bioinformatics. Molecular Biophysics & Biochemistry 447b3 / 747b3. Class 3, 1/19/98. Mark Gerstein. Yale University

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Moreover, the circular logic

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Local Alignment: Smith-Waterman algorithm

EECS730: Introduction to Bioinformatics

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

Alignment & BLAST. By: Hadi Mozafari KUMS

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6)

Sequence Database Search Techniques I: Blast and PatternHunter tools

... and searches for related sequences probably make up the vast bulk of bioinformatics activities.

Tools and Algorithms in Bioinformatics

Computational Biology

INFORMATION-THEORETIC BOUNDS OF EVOLUTIONARY PROCESSES MODELED AS A PROTEIN COMMUNICATION SYSTEM. Liuling Gong, Nidhal Bouaynaya and Dan Schonfeld

Large-Scale Genomic Surveys

Local Alignment Statistics

Effects of Gap Open and Gap Extension Penalties

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

Bioinformatics for Biologists

Scoring Matrices. Shifra Ben-Dor Irit Orr

BINF 730. DNA Sequence Alignment Why?

Evolutionary Models. Evolutionary Models

Finding the Best Biological Pairwise Alignment Through Genetic Algorithm Determinando o Melhor Alinhamento Biológico Através do Algoritmo Genético

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Administration. ndrew Torda April /04/2008 [ 1 ]

Sequence Comparison. mouse human

Exploring Evolution & Bioinformatics

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

An Introduction to Bioinformatics Algorithms Hidden Markov Models

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University

Lecture Notes: Markov chains

Sequence Alignment Techniques and Their Uses

Pair Hidden Markov Models

Basic structures of proteins

Statistical Distributions of Optimal Global Alignment Scores of Random Protein Sequences

Practical Bioinformatics

BIOINFORMATICS: An Introduction

Genomics and bioinformatics summary. Finding genes -- computer searches

8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011

Introduction to Computation & Pairwise Alignment

Copyright 2000 N. AYDIN. All rights reserved. 1

HMMs and biological sequence analysis

8 Grundlagen der Bioinformatik, SS 09, D. Huson, April 28, 2009

CSE 549: Computational Biology. Substitution Matrices

Pairwise alignment. 2.1 Introduction GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKL

Sequence Bioinformatics. Multiple Sequence Alignment Waqas Nasir

bioinformatics 1 -- lecture 7

Transcription:

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U Introduction to bioinformatics 2007 Lecture 5 Pair-wise Sequence Alignment

Bioinformatics Nothing in Biology makes sense except in the light of evolution (Theodosius Dobzhansky (1900-1975)) Nothing in bioinformatics makes sense except in the light of Biology

Divergent evolution Ancestral sequence: ABCD ACCD (B C) ABD (C ø) mutation deletion ACCD or ACCD Pairwise Alignment AB D A BD

Divergent evolution Ancestral sequence: ABCD ACCD (B C) ABD (C ø) mutation deletion ACCD or ACCD Pairwise Alignment AB D A BD true alignment

What can be observed about divergent evolution (a) G (b) G Ancestral sequence Sequence 1 Sequence 2 1: ACCTGTAATC 2: ACGTGCGATC * ** D = 3/10 (fraction different sites (nucleotides)) (c) G C One substitution - one visible G A A Two substitutions - none visible (d) A C Two substitutions - one visible G G A Back mutation - not visible G

Convergent evolution Often with shorter motifs (e.g. active sites) Motif (function) has evolved more than once independently, e.g. starting with two very different sequences adopting different folds Sequences and associated structures remain different, but (functional) motif can become identical Classical example: serine proteinase and chymotrypsin

Serine proteinase (subtilisin) and chymotrypsin Different evolutionary origins Similarities in the reaction mechanisms. Chymotrypsin, subtilisin and carboxypeptidase C have a catalytic triad of serine, aspartate and histidine in common: serine acts as a nucleophile, aspartate as an electrophile, and histidine as a base. The geometric orientations of the catalytic residues are similar between families, despite different protein folds. The linear arrangements of the catalytic residues reflect different family relationships. For example the catalytic triad in the chymotrypsin clan is ordered HDS, but is ordered DHS in the subtilisin clan and SDH in the carboxypeptidase clan.

Serine proteinase (subtilisin) and chymotrypsin H D S chymotrypsin D H S serine proteinase S D H carboxypeptidase C Catalytic triads Read http://www.ebi.ac.uk/interpro/potm/2003_5/page1.htm

Serine proteinase (subtilisin) and chymotrypsin

Serine proteinase (subtilisin) and chymotrypsin

Serine proteinase (subtilisin) and chymotrypsin There is also divergent evolution.. Proc Natl Acad Sci U S A. 2000 December 19; 97(26): 14097 14102. The structure of aspartyl dipeptidase reveals a unique fold with a Ser-His-Glu catalytic triad Kjell Håkansson, * Andrew H.-J. Wang, and Charles G. Miller *

A protein sequence alignment MSTGAVLIY--TSILIKECHAMPAGNE----- ---GGILLFHRTHELIKESHAMANDEGGSNNS * * * **** *** A DNA sequence alignment attcgttggcaaatcgcccctatccggccttaa att---tggcggatcg-cctctacgggcc---- *** **** **** ** ******

Searching for similarities What is the function of the new gene? The lazy investigation (i.e., no biologial experiments, just bioinformatics techniques): Find a set of similar protein sequences to the unknown sequence Identify similarities and differences For long proteins: first identify domains

Intermezzo: what is a domain A domain is a: Compact, semi-independent unit (Richardson, 1981). Stable unit of a protein structure that can fold autonomously (Wetlaufer, 1973). Recurring functional and evolutionary module (Bork, 1992). Nature is a tinkerer and not an inventor (Jacob, 1977).

Protein domains recur in different combinations The DEATH Domain (DD) Present in a variety of Eukaryotic proteins involved with cell death. Six helices enclose a tightly packed hydrophobic core. Some DEATH domains form homotypic and heterotypic dimers. http://www.mshri.on.ca/pawson

Structural domain organisation can intricate Pyruvate kinase Phosphotransferase β barrel regulatory domain α/β barrel catalytic substrate binding domain α/β nucleotide binding domain 1 continuous + 2 discontinuous domains

Evolutionary and functional relationships Reconstruct evolutionary relation: Based on sequence -Identity (simplest method) -Similarity Homology (common ancestry: the ultimate goal) Other (e.g., 3D structure) Functional relation: Sequence Structure Function

Searching for similarities Common ancestry is more interesting: Makes it more likely that genes share the same function Homology: sharing a common ancestor a binary property (yes/no) it s a nice tool: When (an unknown) gene X is homologous to (a known) gene G it means that we gain a lot of information on X: what we know about G can be transferred to X as a good suggestion.

How to go from DNA to protein sequence A piece of double stranded DNA: 5 attcgttggcaaatcgcccctatccggc 3 3 taagcaaccgtttagcggggataggccg 5 DNA direction is from 5 to 3

How to go from DNA to protein sequence 6-frame translation using the codon table (last lecture): 5 attcgttggcaaatcgcccctatccggc 3 3 taagcaaccgtttagcggggataggccg 5

Evolution and three-dimensional protein structure information Isocitrate dehydrogenase: The distance from the active site (in yellow) determines the rate of evolution (red = fast evolution, blue = slow evolution) Dean, A. M. and G. B. Golding: Pacific Symposium on Bioinformatics 2000

Bioinformatics tool Algorithm Data tool Biological Interpretation (model)

Example today: Pairwise sequence alignment needs sense of evolution Global dynamic programming MDAGSTVILCFVG M D A S T I L C G S Search matrix Evolution Amino Acid Exchange Matrix MDAGSTVILCFVG- MDAAST-ILC--GS Gap penalties (open,extension)

How to determine similarity Frequent evolutionary events at the DNA level: 1. Substitution 2. Insertion, deletion 3. Duplication We will restrict ourselves to these events 4. Inversion

A protein sequence alignment MSTGAVLIY--TSILIKECHAMPAGNE----- ---GGILLFHRTHELIKESHAMANDEGGSNNS * * * **** *** A DNA sequence alignment attcgttggcaaatcgcccctatccggccttaa att---tggcggatcg-cctctacgggcc---- *** **** **** ** ******

Substitution (or match/mismatch) DNA proteins Gap penalty Linear: gp(k)=ak Affine: gp(k)=b+ak Dynamic programming Concave, e.g.: gp(k)=log(k) Scoring alignments The score for an alignment is the sum of the scores of all alignment columns

Dynamic programming Scoring alignments i j l a, b S a,b = - s ( ) Nk gp(k) k gp(k) = gap init + k gap extension affine gap penalties

DNA: define a score for match/mismatch of letters Simple: A C G T A 1-1 -1-1 C -1 1-1 -1 G -1-1 1-1 T -1-1 -1 1 Used in genome alignments: A C G T A 91-114 -31-123 C -114 100-125 -31 G -31-125 100-114 T -123-31 -114 91

Dynamic programming Scoring alignments T D W V T A L K T D W L - - I K 20 20 Amino Acid Exchange Matrix 10 1 Affine gap penalties (open, extension) Score: s(t,t)+s(d,d)+s(w,w)+s(v,l)-p o -2P x + +s(l,i)+s(k,k)

Amino acid exchange matrices How do we get one? 20 20 And how do we get associated gap penalties? First systematic method to derive a.a. exchange matrices by Margaret Dayhoff et al. (1968) Atlas of Protein Structure

A 2 R -2 6 N 0 0 2 D 0-1 2 4 C -2-4 -4-5 12 Q 0 1 1 2-5 4 E 0-1 1 3-5 2 4 G 1-3 0 1-3 -1 0 5 H -1 2 2 1-3 3 1-2 6 I -1-2 -2-2 -2-2 -2-3 -2 5 L -2-3 -3-4 -6-2 -3-4 -2 2 6 K -1 3 1 0-5 1 0-2 0-2 -3 5 M -1 0-2 -3-5 -1-2 -3-2 2 4 0 6 F -4-4 -4-6 -4-5 -5-5 -2 1 2-5 0 9 P 1 0-1 -1-3 0-1 -1 0-2 -3-1 -2-5 6 S 1 0 1 0 0-1 0 1-1 -1-3 0-2 -3 1 2 T 1-1 0 0-2 -1 0 0-1 0-2 0-1 -3 0 1 3 W -6 2-4 -7-8 -5-7 -7-3 -5-2 -3-4 0-6 -2-5 17 Y -3-4 -2-4 0-4 -4-5 0-1 -1-4 -2 7-5 -3-3 0 10 V 0-2 -2-2 -2-2 -2-1 -2 4 2-2 2-1 -1-1 0-6 -2 4 PAM250 matrix amino acid exchange matrix (log odds) B 0-1 2 3-4 1 2 0 1-2 -3 1-2 -5-1 0 0-5 -3-2 2 Z 0 0 1 3-5 3 3-1 2-2 -3 0-2 -5 0 0-1 -6-4 -2 2 3 A R N D C Q E G H I L K M F P S T W Y V B Z Positive exchange values denote mutations that are more likely than randomly expected, while negative numbers correspond to avoided mutations compared to the randomly expected situation

Amino acid exchange matrices Amino acids are not equal: 1. Some are easily substituted because they have similar: physico-chemical properties structure 2. Some mutations between amino acids occur more often due to similar codons The two above observations give us ways to define substitution matrices

Pair-wise alignment Combinatorial explosion - 1 gap in 1 sequence: n+1 possibilities - 2 gaps in 1 sequence: (n+1)n - 3 gaps in 1 sequence: (n+1)n(n-1), etc. 2n (2n)! 2 2n = ~ n (n!) 2 πn T D W V T A L K T D W L - - I K 2 sequences of 300 a.a.: ~10 88 alignments 2 sequences of 1000 a.a.: ~10 600 alignments!

Technique to overcome the combinatorial explosion: Dynamic Programming Alignment is simulated as Markov process, all sequence positions are seen as independent Chances of sequence events are independent Therefore, probabilities per aligned position need to be multiplied Amino acid matrices contain so-called log-odds values (log 10 of the probabilities), so probabilities can be summed

To say the same more statistically To perform statistical analyses on messages or sequences, we need a reference model. The model: each letter in a sequence is selected from a defined alphabet in an independent and identically distributed (i.i.d.) manner. This choice of model system will allow us to compute the statistical significance of certain characteristics of a sequence, its subsequences, or an alignment. Given a probability distribution, P i, for the letters in a i.i.d. message, the probability of seeing a particular sequence of letters i, j, k,... n is simply P i P j P k P n. As an alternative to multiplication of the probabilities, we could sum their logarithms and exponentiate the result. The probability of the same sequence of letters can be computed by exponentiating log P i + log P j + log P k + + log P n. In practice, when aligning sequences we only add log-odds values (residue exchange matrix) but we do not exponentiate the final score.

Sequence alignment History of Dynamic Programming algorithm 1970 Needleman-Wunsch global pair-wise alignment Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins, J Mol Biol. 48(3):443-53. 1981 Smith-Waterman local pair-wise alignment Smith, TF, Waterman, MS (1981) Identification of common molecular subsequences. J. Mol. Biol. 147, 195-197.

Pairwise sequence alignment Global dynamic programming M D A S T I L C G S MDAGSTVILCFVG Search matrix Evolution Amino Acid Exchange Matrix MDAGSTVILCFVG- MDAAST-ILC--GS Gap penalties (open,extension)

Global dynamic programming j-1 j i-1 i Value from residue exchange matrix H(i,j) = Max This is a recursive formula H(i-1,j-1) + S(i,j) H(i-1,j) - g H(i,j-1) - g diagonal vertical horizontal

Global dynamic programming PAM250, Gap =6 (linear) S H A K E S 2-1 1 0 0 P 1 0 1-1 -1 E 0 1 0 0 4 A 1-1 2-1 0 R -2-1 These values are copied from the PAM250 matrix (see earlier slide) 0 2 3 E 0 1 0 0 4 S H A K E 0-6 -12-18 -24-30 S -6 2-4 -10-16 -22 P -12-4 2-3 -9-15 -30 E -18-10 -3 0-3 -5-24 A -24-16 -9-1 -1-3 -18 R -30-22 -14-7 2-2 -12 E -36-28 -20-13 -4 6-6 -24-18 -12 The extra bottom row and rightmost column give the penalties that would need to be applied due to end gaps -6 0 Higgs & Attwood, p. 124

Global dynamic programming Affine gap penalties j-1 i-1 S i,j = s i,j + Max Max{S 0<x<i-1, j-1 - Pi - (i-x-1)px} S i-1,j-1 Gap opening penalty Max{S i-1, 0<y<j-1 - Pi - (j-y-1)px} Gap extension penalty

Global dynamic programming Gap o =10, Gap e =2 D W V T A L K D W V T A L K T 8 3 8 11 9 9 8 0-12 -14-16 -18-20 -22-24 D W V L K 12 1 6 4 8 1 25 2 6 5 12 10 These values are copied from the PAM250 matrix (see earlier slide), after being made non-negative by adding 8 to each PAM250 matrix cell (-8 is the lowest number in the PAM250 matrix) 6 2 6 8 3 8 9 8 8 2 8 6 7 4 6 10 14 5 8 5 6 5 13 T D W V L K -12-14 -16-18 -20-22 8 0-13 -10-14 -12-9 9 25-4 -2-9 -34-6 2 11 37 23 17-29 -5 2 5 21 46 33-1 -9 3 4 19 31 53 17-11 -5 9 19 37 39 39-14 The extra bottom row and rightmost column give the final global alignment scores -3 0 15 26 50 27-34 -21-16 1 14 50

Easy DP recipe for using affine j-1 gap penalties i-1 M[i,j] is optimal alignment (highest scoring alignment until [i,j]) Check preceding row until j-2: apply appropriate gap penalties preceding row until i-2: apply appropriate gap penalties and cell[i-1, j-1]: apply score for cell[i-1, j-1]

DP is a two-step process Forward step: calculate scores Trace back: start at highest score and reconstruct the path leading to the highest score These two steps lead to the highest scoring alignment (the optimal alignment) This is guaranteed when you use DP!

Global dynamic programming

Semi-global pairwise alignment Global alignment: all gaps are penalised Semi-global alignment: N- and C-terminal gaps (end-gaps) are not penalised MSTGAVLIY--TS----- ---GGILLFHRTSGTSNS End-gaps End-gaps

Semi-global dynamic programming - two examples with different gap penalties - These values are copied from the PAM250 matrix (see earlier slide), after being made non-negative by adding 8 to each PAM250 matrix cell (-8 is the lowest number in the PAM250 matrix) Global score is 65 10 1*2 10 2*2

Semi-global pairwise alignment Applications of semi-global: Finding a gene in genome Placing marker onto a chromosome One sequence much longer than the other Danger: if gap penalties high -- really bad alignments for divergent sequences

Local dynamic programming (Smith & Waterman, 1981) E D A S T I L C G S LCFVMLAGSTVIVGTR Search matrix AGSTVIVG A-STILCG Amino Acid Exchange Matrix Gap penalties (open, extension) Negative numbers

Local dynamic programming (Smith & Waterman, 1981) j-1 i-1 Gap opening penalty S i,j + Max{S 0<x<i-1,j-1 - Pi - (i-x-1)px} S i,j = Max S i,j + S i-1,j-1 S i,j + Max {S i-1,0<y<j-1 - Pi - (j-y-1)px} 0 Gap extension penalty

Local dynamic programming

Dot plots Way of representing (visualising) sequence similarity without doing dynamic programming (DP) Make same matrix, but locally represent sequence similarity by averaging using a window

Comparing two sequences We want to be able to choose the best alignment between two sequences. A simple method of visualising similarities between two sequences is to use dot plots. The first sequence to be compared is assigned to the horizontal axis and the second is assigned to the vertical axis.

Dot plots can be filtered by window approaches (to calculate running averages) and applying a threshold They can identify insertions, deletions, inversions