CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Similar documents
Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Algorithms in Bioinformatics

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Sequence analysis and Genomics

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

In-Depth Assessment of Local Sequence Alignment

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Large-Scale Genomic Surveys

Pairwise & Multiple sequence alignments

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Local Alignment: Smith-Waterman algorithm

Bioinformatics and BLAST

Sequence analysis and comparison

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Tools and Algorithms in Bioinformatics

Substitution matrices

Practical considerations of working with sequencing data

Pairwise sequence alignment

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Computational Biology

Single alignment: Substitution Matrix. 16 march 2017

Scoring Matrices. Shifra Ben-Dor Irit Orr

Pairwise sequence alignments

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

Pairwise alignment. 2.1 Introduction GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKL

Sequence Alignment Techniques and Their Uses

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Tools and Algorithms in Bioinformatics

Introduction to Bioinformatics

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Quantifying sequence similarity

Alignment & BLAST. By: Hadi Mozafari KUMS

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

Basic Local Alignment Search Tool

Pairwise Sequence Alignment

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

1.5 Sequence alignment

Fundamentals of database searching

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

Motivating the need for optimal sequence alignments...

Sequence Alignment (chapter 6)

Similarity or Identity? When are molecules similar?

Bio nformatics. Lecture 3. Saad Mneimneh

Biology Tutorial. Aarti Balasubramani Anusha Bharadwaj Massa Shoura Stefan Giovan

Algorithms in Bioinformatics I, ZBIT, Uni Tübingen, Daniel Huson, WS 2003/4 1

Bioinformatics for Biologists

Moreover, the circular logic

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Sequence Comparison. mouse human

Algorithms in Bioinformatics: A Practical Introduction. Sequence Similarity

Lecture 5,6 Local sequence alignment

Collected Works of Charles Dickens

Evolution. CT Amemiya et al. Nature 496, (2013) doi: /nature12027

Lecture 2: Pairwise Alignment. CG Ron Shamir

Global alignments - review

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

String Matching Problem

Week 10: Homology Modelling (II) - HHpred

8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011

CSE 549: Computational Biology. Substitution Matrices

EECS730: Introduction to Bioinformatics

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program)

8 Grundlagen der Bioinformatik, SS 09, D. Huson, April 28, 2009

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 5 G R A T I V. Pair-wise Sequence Alignment

Analysis and Design of Algorithms Dynamic Programming

... and searches for related sequences probably make up the vast bulk of bioinformatics activities.

Local Alignment Statistics

Protein function prediction based on sequence analysis

Introduction to Bioinformatics Online Course: IBT

Sequence Database Search Techniques I: Blast and PatternHunter tools

Lecture 4: September 19

Pairwise alignment, Gunnar Klau, November 9, 2005, 16:

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Study and Implementation of Various Techniques Involved in DNA and Protein Sequence Analysis

Comparing whole genomes

Exercise 5. Sequence Profiles & BLAST

Lecture 5: September Time Complexity Analysis of Local Alignment

Tutorial 4 Substitution matrices and PSI-BLAST

BIOINFORMATICS: An Introduction

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Sequence comparison: Score matrices

Administration. ndrew Torda April /04/2008 [ 1 ]

Substitution Matrices

Introduction to Sequence Alignment. Manpreet S. Katari

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

A profile-based protein sequence alignment algorithm for a domain clustering database

Copyright 2000 N. AYDIN. All rights reserved. 1

Heuristic Alignment and Searching

Transcription:

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) Contents Alignment algorithms Needleman-Wunsch (global alignment) Smith-Waterman (local alignment) Heuristic algorithms FASTA BLAST 1

Sequence Alignment Motivation Sequence assembly: reconstructing long DNA sequences from overlapping sequence fragments Annotation: assign functions to newly discovered genes Raw genomic (DNA) sequences -> coding sequences (CDS), candidate for genes -> protein sequence -> function Terminologies: cdna, RNA, mrna Evolution: mutation -> sequence diversity (vs homology) -> (new) phenotype? Corner stone: sequence similarity -> sequence homology -> same function Alignment algorithms What is an alignment? A one-to-one matching of two sequences so that each character in a pair of sequences is associated with a single character of the other sequence or with a null character (gap). Alignments are often displayed as two rows with an optional third row in between pointing out regions of similarity. Example: Types of alignment: pairwise vs multiple; global vs local Algorithms Rigorous heuristic 2

PAM matrices (Margaret Dayhoff, 1978) point accepted mutation or percent accepted mutation unit of measurement of evolutionary divergence between two amino acid sequences substitute matrices (scoring matrices) 1 PAM = one accepted point-mutation event per onehundred amino acids PAM (cont d) caveat: Sequences s1 and s2 are x PAM divergent does not imply s1 and s2 have x percent sequence difference ( should be equal or less). facts: 1. even amino acid sequences that have diverged by 200 PAM units are expected to be identical in about 25% of their positions. 2. sequences that are 250 PAM units diverged can generally be distinguished from a pair of random sequences. 3

PAM matrix is a 20 by 20 matrix, and each element p ij represents the expected evolutionary exchange between the two corresponding amino acids for sequences that are a specific number of PAM units diverged. That is, p ij = log[f(i,j)/f(i)f(j)] where f(i) and f(j) are the frequencies that amino acids A i and A j appear in the sequences, and f(i,j) the frequency that A i and A j are aligned. By construction, PAM n matrices with n > 1 are extrapolated from those with lower n. Namely, it assumes that the frequencies of the amino acids remain constant over time and that the mutational processes causing replacements in an interval of one PAM unit operate the same for longer periods. Ideally, to align two sequences that are x PAM units diverged, the PAM x matrix should be used. Practically, since you do not know a priori how much diverged given two sequences, you may need to try different PAM matrices. Empirically, PAM 250 matrix is reported to be very effective for aligning sequences used in evolutionary studies. 4

BLOSUM matrices [Steven and Jorja Henikoff] - BLOSUM x matrix is a 20 by 20 matrix. Its elements are defined like those of PAM matrices but the frequencies are collected from sequences in BLOCKs database that are less than x percent identical (generally x is between 50 and 80). - By their construction, BLOSUM matrices are believed to be more effectively detect distant homology. - in place of PAM 250, BLOSUM 62 is the default matrix used in database search. 5

PROSITE is a "dictionary of sites and patterns in proteins" that is linked to the protein sequence database Swiss-Prot. The signature patterns in PROSITE are written as regular expressions. For example, G-[GN]-[SGA]-G-x-R-[SGA]-C-x(2)-[IV] is a PROSITE pattern. BLOCKS is a database of protein motifs that is derived from the PROSITE library. Needleman-Wunsch algorithm (Global Pairwise optimal alignment, 1970) To align two sequences x[1...n] and y[1...m], i) if x at i aligns with y at j, a score s(x i, y j ) is added; if either x i or y j is a gap, a score of d is subtracted (penalty). ii) The best score up to (i,j) will be F(i,j) = max { F(i-1, j-1) + s(x i, y j ), } F(i-1,j) - d, F(i, j-1) - d 6

Needleman-Wunsch (cont d) iii) Tabular computing to get F(i,j) for all 1<i<n and i<j<m Draw a diagram: F(i-1, j-1) F(i, j-1) s(x i,y j ) -d F(i-1, j) -d F(i,j) By definition, F(n,m) gives the best score for an alignment of x[1...n] and y[1...m]. iv) Trace-back To find the alignment itself, we must find the path of choices (in applying the formulae of ii) when tabular computing that led to this final value. > Vertical move is gap in the column sequence. > Horizontal move is gap in the row sequence. > Diagonal move is a match. 7

Example: Align HEAGAWGHEE and PAWHEAE. Use BLOSUM 50 for substitution matrix and d=-8 for gap penalty. (page 21 in the text) H E A G A W G H E E 0-8 -16-24 -32-40 -48-56 -64-72 -80 P -8-2 -9-17 -25-33 -42-49 -57-65 -73 A -16-10 -3-4 -12-20 -28-36 -44-52 -60 W -24-18 -11-6 -7-15 -5-13 -21-29 -37 H -32-14 -18-13 -8-9 -13-7 -3-11 -19 E -40-22 -8-16 -16-9 -12-15 -7 3-5 A -48-30 -16-3 -11-11 -12-12 -15-5 2 E -56-38 -24-11 -6-12 -14-15 -12-9 1 Time complexity: O(nm) Space complexity: O(nm) Big-O notation: f(x) = O(g(x)) => f is upper bound by g f(x) = Omega(g(x)) => f is lower bound by g f(x) = Theta(g(x)) => f is bound to g within constant factors 8

Local pairwise optimal alignment why need local alignment (vs global)? - mosaic structure ( functioning domains) of proteins e.g., are these three sequences similar or not? s1 s2 s3 Local alignment naive algorithm: there are (n 2 m 2 ) pairs of substrings; to align each pair as a global alignment problem will take O(nm); the optimal local alignment will therefore take O(n 3 m 3 ). Smith-Waterman algorithm (dynamic programming) recurrence relationship F(i,j) = max { 0, F(i-1, j-1) + s(x i, y j ), F(i-1,j) - d, F(i, j-1) - d } Notes: 1) For this to work, the random match model must have a negative score. 2) The time complexity of Smith-Waterman is (n 2 m 2 ). 9

Example: Align HEAGAWGHEE and PAWHEAE. Use BLOSUM 50 for substitution matrix and d=-8 for gap penalty. (page 23 in the text) H E A G A W G H E E 0 0 0 0 0 0 0 0 0 0 0 P 0 0 0 0 0 0 0 0 0 0 0 A 0 0 0 5 0 5 0 0 0 0 0 W 0 0 0 0 2 0 20 12 0 0 0 H 0 10 2 0 0 0 12 18 22 14 6 E 0 2 16 8 0 0 4 10 18 28 20 A 0 0 8 21 13 5 0 4 10 20 27 E 0 0 0 13 18 12 4 0 4 16 26 Heuristic alignment algorithms - motivation: speed sequence DB ~ O(100,000,000) basepair query sequence 1000 basepair O(nm) time complexity => 10 11 matrix cells in dynamic programming table if 10,000,000 cells/second => 10000 seconds ~ 3 hours. - heuristic versus rigorous 10