Lecture 1, 31/10/2001: Introduction to sequence alignment. The Needleman-Wunsch algorithm for global sequence alignment: description and properties

Similar documents
Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Advanced topics in bioinformatics

Collected Works of Charles Dickens

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Sequence analysis and Genomics

Pairwise sequence alignment

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

Single alignment: Substitution Matrix. 16 march 2017

Algorithms in Bioinformatics

Practical Bioinformatics

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Example questions. Z:\summer_10_teaching\bioinfo\Beispiel_frage_bioinformatik.doc [1 / 5]

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Large-Scale Genomic Surveys

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Sequence Alignment (chapter 6)

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

Practical considerations of working with sequencing data

Pairwise & Multiple sequence alignments

Lecture 5,6 Local sequence alignment

Bio nformatics. Lecture 3. Saad Mneimneh

Motivating the need for optimal sequence alignments...

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Pairwise sequence alignments

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

In-Depth Assessment of Local Sequence Alignment

Bioinformatics. Molecular Biophysics & Biochemistry 447b3 / 747b3. Class 3, 1/19/98. Mark Gerstein. Yale University

Bioinformatics and BLAST

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

Lecture 2: Pairwise Alignment. CG Ron Shamir

Sequence analysis and comparison

Copyright 2000 N. AYDIN. All rights reserved. 1

Sequence Alignment Techniques and Their Uses

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 5 G R A T I V. Pair-wise Sequence Alignment

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Tools and Algorithms in Bioinformatics

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

1.5 Sequence alignment

Quantifying sequence similarity

An Introduction to Sequence Similarity ( Homology ) Searching

Sequence Comparison. mouse human

Tools and Algorithms in Bioinformatics

Introduction to Computation & Pairwise Alignment

Pairwise Sequence Alignment

Introduction to Bioinformatics

Administration. ndrew Torda April /04/2008 [ 1 ]

Moreover, the circular logic

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6)

Analysis and Design of Algorithms Dynamic Programming

Alignment & BLAST. By: Hadi Mozafari KUMS

Lecture Notes: Markov chains

Pair Hidden Markov Models

Study and Implementation of Various Techniques Involved in DNA and Protein Sequence Analysis

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

NUMB3RS Activity: DNA Sequence Alignment. Episode: Guns and Roses

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Pairwise alignment, Gunnar Klau, November 9, 2005, 16:

Effects of Gap Open and Gap Extension Penalties

Introduction to Bioinformatics

Computational Biology Lecture 5: Time speedup, General gap penalty function Saad Mneimneh

8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011

Sequence comparison: Score matrices

Substitution matrices

Homology Modeling. Roberto Lins EPFL - summer semester 2005

String Matching Problem

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

Heuristic Alignment and Searching

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Week 10: Homology Modelling (II) - HHpred

Bioinformatics for Biologists

Basic Local Alignment Search Tool

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Similarity or Identity? When are molecules similar?

Computational Biology

8 Grundlagen der Bioinformatik, SS 09, D. Huson, April 28, 2009

Lecture 4: September 19

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Introduction to Comparative Protein Modeling. Chapter 4 Part I

E-SICT: An Efficient Similarity and Identity Matrix Calculating Tool

BINF 730. DNA Sequence Alignment Why?

BLAST. Varieties of BLAST

CSE : Computational Issues in Molecular Biology. Lecture 6. Spring 2004

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Finding the Best Biological Pairwise Alignment Through Genetic Algorithm Determinando o Melhor Alinhamento Biológico Através do Algoritmo Genético

Lecture 5: September Time Complexity Analysis of Local Alignment

DNA and protein databases. EMBL/GenBank/DDBJ database of nucleic acids

Pacific Symposium on Biocomputing 4: (1999)

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

Transcription:

Lecture 1, 31/10/2001: Introduction to sequence alignment The Needleman-Wunsch algorithm for global sequence alignment: description and properties 1

Computational sequence-analysis The major goal of computational sequence analysis is to predict the function and structure of genes and proteins from their sequence. This is made possible since organisms evolve by mutation, duplication and selection of their genes. Thus, sequence similarity often indicates functional and structural similarity. 2

5 ATCAGAGTC 3 5 TTCAGTC 3 ATC CTA AG GA etc. 3

ATCAGAGTC TTCA--GTC +++^^+++ We wish to identify what regions are most similar to each other in the two sequences. Sequences are shifted one by the other and gaps introduced, to cover all possible alignments. The shifts and gaps provide the steps by which one sequence can be converted into the other. 4

dot-plot T T C A G T C A T C A G A G T C T T C A G T C A T TCAGAGTC TCA--GTC 5

scoring Substitution matrix - the similarity value between each pair of residues A C G T A 2 0 0 0 C 0 2 0 0 G 0 0 2 0 T 0 0 0 2 Gap penalty - the cost of introducing gaps Gap penalty -2 ATCAGAGTC TTCA--GTC +++^^+++ : 0+2+2+2-2-2+2+2+2 = 8 6

A C G T A 2 0 0 0 C 0 2 0 0 G 0 0 2 0 T 0 0 0 2 Gap penalty -2 A T C A G A G T C T 0 2 0 0 0 0 0 2 0 T 0 2 0 0 0 0 0 2 0 C 0 0 2 0 0 0 0 0 2 A 2 0 0 2 0 2 0 0 0 G 0 0 0 0 2 0 2 0 0 T 0 2 0 0 0 0 0 2 0 C 0 0 2 0 0 0 0 0 2 Initialization Position 3,2 : [ a b ] [ a - ] [T 2 T 1 ] ATC -TT [C 3 T 1 ] ATC- --TT [T 2 T 2 ] ATC TT- [ - b ] 7

A C G T A 2 0 0 0 C 0 2 0 0 G 0 0 2 0 T 0 0 0 2 Gap penalty -2 A T C A G A G T C 0-2 -4-6 -8-10 -12-14 -16-18 T - 2 0 2 0 0 0 0 0 2 0 T - 4 0 2 0 0 0 0 0 2 0 C - 6 0 0 2 0 0 0 0 0 2 A - 8 2 0 0 2 0 2 0 0 0 G -10 0 0 0 0 2 0 2 0 0 T -12 0 2 0 0 0 0 0 2 0 C -14 0 0 2 0 0 0 0 0 2 Initialization Directionality of score calculation [ a b ] [ a - ] [ - b ] 8

A C G T A 2 0 0 0 C 0 2 0 0 G 0 0 2 0 T 0 0 0 2 Gap penalty -2 A T C A G A G T C 0-2 -4-6 -8-10 -12-14 -16-18 T - 2 0 2 0 0 0 0 0 2 0 T - 4 0 2 0 0 0 0 0 2 0 C - 6 0 0 2 0 0 0 0 0 2 A - 8 2 0 0 2 0 2 0 0 0 G -10 0 0 0 0 2 0 2 0 0 T -12 0 2 0 0 0 0 0 2 0 C -14 0 0 2 0 0 0 0 0 2 9

A C G T A 2 0 0 0 C 0 2 0 0 G 0 0 2 0 T 0 0 0 2 Gap penalty -2 A T C A G A G T C 0-2 -4-6 -8-10 -12-14 -16-18 T - 2 0 0-2 -4-6 -8-10 -12-14 T - 4-2 2 0-2 -4-6 -8-8 -10 C - 6-4 0 4 2 0-2 - 4-6 - 6 A - 8-4 - 2 2 6 4 2 0-2 - 4 G -10-6 - 4 0 4 8 6 4 2 0 T -12-8 -4-2 2 6 8 6 6 4 C -14-10 - 6-2 0 4 6 8 6 8 10

Needleman-Wunsch algorithm σ[ a ] b : score of aligning a pair of residues a and b σ[ a ] - : score of aligning residue a with a gap (gap penalty: -q) S : score matrix S(i,j) : optimal score of aligning residues positions 1 to i on one sequence with residues positions 1 to j on another sequence 11

Needleman-Wunsch algorithm S(0,0) 0 for j 1 to N do S(0,j) S(0,j-1) + σ[ - bj ] for i 1 to M do { S(i,0) S(i-1,0) + σ[ a i - ] for j 1 to N do S(i,j) max (S(i-1, j-1) + σ[ a i b j ], S(i-1, j) + σ[ a i - ], S(i, j-1) + σ[ - bj ]) } Pearson & Miller Meth Enz 210:575, 92 12

Optimal score/s is found - more steps needed to find the corresponding alignment/s. This is a time-saving property in database searches and other applications. Only a single pass through the alignment matrix is needed. 13

A C G T A 2 0 0 0 C 0 2 0 0 G 0 0 2 0 T 0 0 0 2 Gap penalty -2 A T C A G A G T C 0-2 -4-6 -8-10 -12-14 -16-18 T - 2 0 0-2 -4-6 -8-10 -12-14 T - 4-2 2 0-2 -4-6 -8-8 -10 C - 6-4 0 4 2 0-2 - 4-6 - 6 A - 8-4 - 2 2 6 4 2 0-2 - 4 G -10-6 - 4 0 4 8 6 4 2 0 T -12-8 -4-2 2 6 8 6 6 4 C -14-10 - 6-2 0 4 6 8 6 8 14

the traceback A C G T A 2 0 0 0 C 0 2 0 0 G 0 0 2 0 T 0 0 0 2 Gap penalty -2 A T C A G A G T C 0-2 -4-6 -8-10 -12-14 -16-18 T - 2 0 0-2 -4-6 -8-10 -12-14 T - 4-2 2 0-2 -4-6 -8-8 -10 C - 6-4 0 4 2 0-2 - 4-6 - 6 A - 8-4 - 2 2 6 4 2 0-2 - 4 G -10-6 - 4 0 4 8 6 4 2 0 T -12-8 -4-2 2 6 8 6 6 4 C -14-10 - 6-2 0 4 6 8 6 8 15

the traceback A C G T A 2 0 0 0 C 0 2 0 0 G 0 0 2 0 T 0 0 0 2 Gap penalty -2 A T C A G A G T C 0-2 -4-6 -8-10 -12-14 -16-18 T - 2 0 0-2 -4-6 -8-10 -12-14 T - 4-2 2 0-2 -4-6 -8-8 -10 C - 6-4 0 4 2 0-2 - 4-6 - 6 A - 8-4 - 2 2 6 4 2 0-2 - 4 G -10-6 - 4 0 4 8 6 4 2 0 T -12-8 -4-2 2 6 8 6 6 4 C -14-10 - 6-2 0 4 6 8 6 8 ATCAGAGTC TTCAG--TC 16 ++++^^++ : 0+2+2+2+2-2-2+2+2=8

the traceback A C G T A 2 0 0 0 C 0 2 0 0 G 0 0 2 0 T 0 0 0 2 Gap penalty -2 A T C A G A G T C 0-2 -4-6 -8-10 -12-14 -16-18 T - 2 0 0-2 -4-6 -8-10 -12-14 T - 4-2 2 0-2 -4-6 -8-8 -10 C - 6-4 0 4 2 0-2 - 4-6 - 6 A - 8-4 - 2 2 6 4 2 0-2 - 4 G -10-6 - 4 0 4 8 6 4 2 0 T -12-8 -4-2 2 6 8 6 6 4 C -14-10 - 6-2 0 4 6 8 6 8 ATCAGAGTC TTC--AGTC 17 ++^^++++ : 0+2+2-2-2+2+2+2+2=8

the traceback A C G T A 2 0 0 0 C 0 2 0 0 G 0 0 2 0 T 0 0 0 2 Gap penalty -2 A T C A G A G T C 0-2 -4-6 -8-10 -12-14 -16-18 T - 2 0 0-2 -4-6 -8-10 -12-14 T - 4-2 2 0-2 -4-6 -8-8 -10 C - 6-4 0 4 2 0-2 - 4-6 - 6 A - 8-4 - 2 2 6 4 2 0-2 - 4 G -10-6 - 4 0 4 8 6 4 2 0 T -12-8 -4-2 2 6 8 6 6 4 C -14-10 - 6-2 0 4 6 8 6 8 ATCAGAGTC : 8 TTCAG--TC ATCAGAGTC : 8 TTC--AGTC ATCAGAGTC : 8 TTCA--GTC 18

Algorithm calculates score/s of optimal global sequence alignments, penalizes end gaps and penalizes each residue in a gap is equally. ATCAGAGTC has lower score then CAGAGTC --TTCAGTC TTCAGTC ATCACAGTC has same score as ATCACAGTC T-C--AGTC T---CAGTC ATCACAGTC has lower score then ACACAGTC T---CAGTC T--CAGTC 19

In order to score a gap penalty q independent of the gap length, i.e ACACAGTC ATCACAGTC AGCTTTCACAGTC all have the T--CAGTC T---CAGTC T-------CAGTC same score the algorithm we presented is modified to extend alignments in more then the three ways we considered. 20

A T C A G A G T C T 0 2 0 0 0 0 0 2 0 T 0 2 0 0 0 0 0 2 0 [ - b ] C 0 0 2 0 0 0 0 0 2 A 2 0 0 2 0 2 0 0 0 G 0 0 0 0 2 0 2 0 0 T 0 2 0 0 0 0 0 2 0 C 0 0 2 0 0 0 0 0 2 [ a - ] [ a b ] [ a - ] [ - b ] 21

Needleman-Wunsch algorithm S(0,0) 0 for j 1 to N do S(0,j) -q for i 1 to M do { S(i,0) -q for j 1 to N do S(i,j) max (S(i-1, j-1) + σ[ a i b j ], max {S(0, j)...s(i-1, j)} -q, max {S(i, 0)...S(i, j-1)} -q) } 22 Pearson & Miller Meth Enz 210:575, 92

caveats Every algorithm is limited by the model it is built upon. For example, the NW dynamic programming algorithm guarantees us optimal global alignments with the parameters we supply (substitution matrix, gap penalty and gap scoring). However - Different parameters can give different alignments, The correct alignment might not be the optimal one. The correct alignment might correspond only to part of the global alignments, 23

More details, sources and things to do for next class Source: Pearson WR & Miller W "Dynamic programming algorithms for biological sequence comparison." Methods in Enzymology, 210:575-601 (1992). Assignment: Calculate NW alignments with constant gap penalty seeing the effect of different gap penalties and match/mismatch scores. In all cases use substitution matrices that have two types of scores only a value for an exact match and a lower value for mismatches. Try the nucleotide sequences used in class and the following amino acid sequences: ACDGSMF & AMDFR. 24