STATC141 Spring 2005 The materials are from Pairwise Sequence Alignment by Robert Giegerich and David Wheeler

Similar documents
CHAPTER 3 THE COMMON FACTOR MODEL IN THE POPULATION. From Exploratory Factor Analysis Ledyard R Tucker and Robert C. MacCallum

UC Berkeley Department of Electrical Engineering and Computer Sciences. EECS 126: Probability and Random Processes

Sequence analysis and Genomics

Pairwise Sequence Alignment. Robert Giegerich and David Wheeler. Version 2.01 (2 typos corrected w.r.t Version 2.0) May 21, 1996.

Quantifying sequence similarity

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

MMSE Equalizer Design

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

3 - Vector Spaces Definition vector space linear space u, v,

Lecture 8 January 30, 2014

Introduction to spectral alignment

= Prob( gene i is selected in a typical lab) (1)

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University

Consider this problem. A person s utility function depends on consumption and leisure. Of his market time (the complement to leisure), h t

Bloom Filters and Locality-Sensitive Hashing

Bio nformatics. Lecture 3. Saad Mneimneh

Some Classes of Invertible Matrices in GF(2)

Long run input use-input price relations and the cost function Hessian. Ian Steedman Manchester Metropolitan University

Chapter 3. Systems of Linear Equations: Geometry

Introduction To Resonant. Circuits. Resonance in series & parallel RLC circuits

and sinθ = cosb =, and we know a and b are acute angles, find cos( a+ b) Trigonometry Topics Accuplacer Review revised July 2016 sin.

Computational Biology

A Generalization of a result of Catlin: 2-factors in line graphs

Business Cycles: The Classical Approach

Algebra II Solutions

Announcements Wednesday, September 06

Bucket handles and Solenoids Notes by Carl Eberhart, March 2004

y(x) = x w + ε(x), (1)

Algorithms in Bioinformatics

Outline of lectures 3-6

1. Suppose preferences are represented by the Cobb-Douglas utility function, u(x1,x2) = Ax 1 a x 2

COS 424: Interacting with Data. Lecturer: Rob Schapire Lecture #15 Scribe: Haipeng Zheng April 5, 2007

A proof of topological completeness for S4 in (0,1)

Minimize Cost of Materials

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

Pairwise & Multiple sequence alignments

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Fundamentals of Similarity Search

Why do Golf Balls have Dimples on Their Surfaces?

1 Effects of Regularization For this problem, you are required to implement everything by yourself and submit code.

Randomized Smoothing Networks

Polynomial and Rational Functions

Sequence Comparison. mouse human

NUMB3RS Activity: DNA Sequence Alignment. Episode: Guns and Roses

The Probability of Pathogenicity in Clinical Genetic Testing: A Solution for the Variant of Uncertain Significance

FIBONACCI NUMBERS AND DECIMATION OF BINARY SEQUENCES

Bivariate Uniqueness in the Logistic Recursive Distributional Equation

Sequence Alignment. Johannes Starlinger

Neural Networks. Associative memory 12/30/2015. Associative memories. Associative memories

Polynomial and Rational Functions

Dynamic Programming: Edit Distance

Introduction to Bioinformatics Algorithms Homework 3 Solution

Regression coefficients may even have a different sign from the expected.

Enumeration and symmetry of edit metric spaces. Jessie Katherine Campbell. A dissertation submitted to the graduate faculty

Lecture 3 Frequency Moments, Heavy Hitters

CS583 Lecture 11. Many slides here are based on D. Luebke slides. Review: Dynamic Programming

THE LINEAR BOUND FOR THE NATURAL WEIGHTED RESOLUTION OF THE HAAR SHIFT

Possible Cosmic Influences on the 1966 Tashkent Earthquake and its Largest Aftershocks

PHY 335 Data Analysis for Physicists

Computational Biology Lecture 5: Time speedup, General gap penalty function Saad Mneimneh

Early & Quick COSMIC-FFP Analysis using Analytic Hierarchy Process

Lecture 4: September 19

Language Recognition Power of Watson-Crick Automata

Comparing whole genomes

reader is comfortable with the meaning of expressions such as # g, f (g) >, and their

Bioinformatics and BLAST

EECS730: Introduction to Bioinformatics

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Ch. 2 Math Preliminaries for Lossless Compression. Section 2.4 Coding

Outline. Approximation: Theory and Algorithms. Motivation. Outline. The String Edit Distance. Nikolaus Augsten. Unit 2 March 6, 2009

Science Degree in Statistics at Iowa State College, Ames, Iowa in INTRODUCTION

Announcements Wednesday, September 06. WeBWorK due today at 11:59pm. The quiz on Friday covers through Section 1.2 (last weeks material)

Lecture 2: Pairwise Alignment. CG Ron Shamir

Effects of Gap Open and Gap Extension Penalties

Artificial Neural Networks. Part 2

BLAST: Basic Local Alignment Search Tool

An Investigation of the use of Spatial Derivatives in Active Structural Acoustic Control

The spatial arrangement of cones in the fovea: Bayesian analysis

Foundations of Natural Language Processing Lecture 6 Spelling correction, edit distance, and EM

Intermediate 2 Revision Unit 3. (c) (g) w w. (c) u. (g) 4. (a) Express y = 4x + c in terms of x. (b) Express P = 3(2a 4d) in terms of a.

OUTLINING PROOFS IN CALCULUS. Andrew Wohlgemuth

Analysis and Design of Algorithms Dynamic Programming

Knots, Coloring and Applications

Enhancing Generalization Capability of SVM Classifiers with Feature Weight Adjustment

Aside: Golden Ratio. Golden Ratio: A universal law. Golden ratio φ = lim n = 1+ b n = a n 1. a n+1 = a n + b n, a n+b n a n

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

1 Maintaining a Dictionary

Lecture 5,6 Local sequence alignment

Outline. Similarity Search. Outline. Motivation. The String Edit Distance

Multidimensional Possible-world Semantics for Conditionals

These notes give a quick summary of the part of the theory of autonomous ordinary differential equations relevant to modeling zombie epidemics.

On A Comparison between Two Measures of Spatial Association

Approximation: Theory and Algorithms

Average Coset Weight Distributions of Gallager s LDPC Code Ensemble

Toppling Conjectures

BIOINFORMATICS: An Introduction

The Dot Product

Effects of Asymmetric Distributions on Roadway Luminance

TABLEAUX VARIANTS OF SOME MODAL AND RELEVANT SYSTEMS

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Transcription:

STATC141 Spring 2005 The materials are from Pairise Sequence Alignment by Robert Giegerich and David Wheeler Lecture 6, 02/08/05 The analysis of multiple DNA or protein sequences (I) Sequence similarity is an important issue in sequence alignments. In molecular biology, proteins and DNA can be similar ith respect to their function, their structure, or their primary sequence of amino or nucleic acids. The general rule is that sequence determines shape, and shape determines function. So hen e study sequence similarity, e eventually hope to discover or validate similarity in shape and function. This approach is often successful. Hoever, there are many examples here to sequences have little or no similarity, but still the molecules fold into the same shape and share the same function. This lecture speaks of neither shape nor function. We are focusing on DNA sequences. Sequences are seen as strings of characters. In fact, the ideas and techniques e discuss in this lecture have important applications in text processing, too. Similarity has both a quantitative and a qualitative aspect: A similarity measure gives a quantitative anser, saying that to sequences sho a certain degree of similarity. An alignment is a mutual arrangement of to sequences hich is a sort of qualitative anser; it exhibits here the to sequences are similar, and here they differ. An optimal alignment, of course, is one that exhibits the most correspondences, and the least differences. Edit Distance To ays are used to quantify similarity of to sequences: A similarity measure is a function that associates a numeric value ith a pair of sequences, ith the idea that a higher value indicates greater similarity. Beyond this, similarity measures vary idely, and care must be taken hen interpreting similarity measures. The notion of distance is somehat dual to similarity. It treats sequences as points in a metric space. A distance measure is a function that also associates a numeric value ith a pair of sequences, but ith the idea that the larger the distance, the smaller the similarity, and vice versa. Distance measures usually satisfy the mathematical axioms of a metric. In particular, distance values are never negative. In most cases, distance and similarity measures are interchangeable in the sense that a small distance means high similarity, and vice versa. Hamming Distance For to sequences of equal length, e ust count the character positions in hich they differ. For example: sequence s AAT AGCAT GATCGTG sequence t TAA ACAAT GTCGTGG Hamming distance(s, t) 2 2 5 This distance measure is very useful in some cases, but in general it is not flexible enough. First of all, the sequences may have different length. Second, there is generally no fixed correspondence beteen their character positions. In the mechanism of DNA replication, errors like deleting or inserting a nucleotide are not unusual. Although the rest of the sequences are identical, such a shift of position leads to exaggerated values in the Hamming distance. Look at the rightmost example above. The Hamming distance says that s and t are apart by 5 characters (out of 7). On the other hand, by deleting the second character A from s and the last character G from t, both become equal to GTCGTG. In this sense, they are only to characters apart! Let us model the distance of s and t by introducing a gap character "-" and edition operations including insertion, deletion, match, and

replacement (mismatch). We say that the pair (a, a) denotes a match (no change from s to t), (a, -) denotes deletion of character a in s, (a, b) denotes replacement of a (in s) by b (in t), here a b, (-, b) denotes insertion of character b (in s). Since the problem is symmetric in s and t, a deletion in s can be seen as an insertion in t, and vice versa. An alignment of to sequences s and t is an arrangement of s and t by position, here s and t can be padded ith gap symbols to achieve the same length: s : G A T C G T G -- or s : G A T C G T -- G Next e turn the edit protocol into a measure of distance by assigning a "cost" or "eight" to each operation (deletion, insertion, match, replacement). We define (a, a) = 0 (a, b) = 1 for a b (a, - ) = ( -, a) = 1 This scheme is knon as the Levenshtein Distance, also called unit cost model. Its predominant virtue is its simplicity. In general, more sophisticated cost models must be used. For example, replacing an amino acid by a biochemically similar one should eight less than a replacement by an amino acid ith totally different properties. We ill discuss those things later. No e are ready to define the most important notion for sequence analysis: The cost of an alignment of to sequences s and t is the sum of the costs of all the edit operations that lead from s to t. An optimal alignment of s and t is an alignment hich has minimal cost among all possible alignments. The edit distance of s and t is the cost of an optimal alignment of s and t under a cost function. We denote it by d( s, t ). Using the unit cost model for in our previous example, e obtain the folloing cost: s : G A T C G T G -- cost: 2 d (, ) s t = 2

Lecture 6, 02/08/05 The analysis of multiple DNA or protein sequences (II) Calculating Edit Distances and Optimal Alignments The number of possible alignments beteen to sequences is gigantic, and unless the eight function is very simple, it may seem difficult to pick out an optimal alignment. But fortunately, there is an easy and systematic ay to find it. The algorithm described no is very famous in biocomputing, it is usually called the dynamic programming algorithm. For s = s... 1s2 sm and t = t... 1t2 tn let us consider to prefixes 0 : s : i = s... 1s2 si and 0 : t : = t t t ith i, 1. We assume e already kno optimal alignments beteen all shorter 1 2... prefixes of and, in particular of 1. 0 : s : (i - 1)and 0 : t : ( - 1), of 2. 0 : s : (i - 1)and 0 : t :, and of 3. 0 : s : i and 0 : t : ( - 1). An optimal alignment of 0 : s : i and 0 : t : must be an extension of one of the above by 1. a Replacement (mismatch) (si, t ), or a Match (si, t ), depending on hether si = t. 2. a Deletion (si, ), or 3. an Insertion (, t ). Edit Distance (Recursion) We simply have to choose the minimum: d ( 0 : s : i, 0 : t : ) = min{d ( 0 : s : (i - 1), 0 : t : ( - 1) )+ (s,t ), i d ( 0 : s : (i - 1), 0 : t : )+ (s, ), d ( 0 : s : i, 0 : t : ( - 1) )+ (, t )} There is no choice hen one of the prefixes is empty, i.e. i = 0, or = 0, or both: d ( 0 : s : 0, 0 : t : 0 )= 0 d ( 0 : s : i, 0 : t : 0 )= d ( 0 : s : (i - 1), 0 : t : 0 )+ (s, ) for i = 1,..., m i d ( 0 : s : 0, 0 : t : )= d ( 0 : s : 0, 0 : t : ( - 1) )+ (, t ) for = 1,..., n According to this scheme, and for a given, the edit distances of all prefixes of i and define an (m+ 1) (n+ 1) distance matrix D = (d i ) ith d i = d ( 0 : s : i, 0 : t : ). The three-ay choice in the minimization formula for d i leads to the folloing pattern of dependencies beteen matrix elements: i The bottom right corner of the distance matrix contains the desired result: d mn = d ( ) ( 0 : s : m, 0 : t : n )= d s,t.

Edit Distance (Matrix) Belo is the distance matrix for the example ith s = AGCACACA, t = ACACACTA (the cost for a match is 0; the cost for a deletion or an insertion or a replacement is 1): In the second diagram, e have dran a path through the distance matrix indicating hich case as chosen hen taking the minimum. A diagonal line means Replacement (Mismatch) or Match, a vertical line means Deletion, and a horizontal line means Insertion. Thus, this path indicates the edit operation protocol of the optimal alignment ith d ( s,t )= 2. Please note that in some cases the minimal choice is not unique and different paths could have been dran hich indicate alternative optimal alignments. Here is an example here the optimal alignment is not unique. Let s = TTCC, t = AATT. In hich order should e calculate the matrix entries? The only constraint is the above pattern of dependencies. The most common order of calculation is line by line (each line from left to right), or column by column (each column from top-to-bottom). A Word on the Dynamic Programming Paradigm Dynamic Programming is a very general programming technique. It is applicable hen a large search space can be structured into a succession of stages, such that The initial stage contains trivial solutions to sub-problems, Each partial solution in a later stage can be calculated by recurring on only a fixed number of partial solutions in an earlier stage,

The final stage contains the overall solution. This applies to our distance matrix: The columns are the stages, the first column is trivial, the final one contains the overall result. A matrix entry d i, is the partial solution d ( 0 : s : i, 0 : t : ) and can be determined from to solutions in the previous column d i- 1, -1 and d i, -1 plus one in the same column, namely d i- 1,. Since calculating edit distances is the predominant approach to sequence comparison, some people simply call this THE dynamic programming algorithm. Just note that the dynamic programming paradigm has many other applications as ell, even ithin bioinformatics. A Word on Scoring Functions and Related Notions Many authors use different ords for essentially the same idea: scores, eights, costs, distance and similarity functions all attribute a numeric value to a pair of sequences. distance should only be used hen the metric axioms are satisfied. In particular, distance values are never negative. The optimal alignment minimizes distance. The term costs usually implies positive values, ith the overall cost to be minimized. Hoever, metric axioms are not assumed. eights and scores can be positive or negative. The most popular use is that a high score is good, i.e. it indicates a lot of similarity. Hence, the optimal alignments maximize scores. The term similarity immediately implies that large values are good, i.e. an optimal alignment maximizes similarity. Intuitively, one ould expect that similarity values should not be negative (hat is less than zero similarity?). But don't be surprised to see negative similarity scores shortly. Mathematically, distances are a little more tractable than the others. In terms of programming, general scoring functions are a little more flexible. Let us close ith another caveat concerning the influence of sequence length on similarity. Let us ust count exact matches and let us assume that to sequences of length n and m, respectively, have 99 exact matches. Let c ( s,t) be the similarity score calculated for s and t under this cost model. So, c ( ) 99 s,t =. What this means depending on n and m: If n = m = 100, the sequences are very similar - almost identical. If n = m = 1000, e have only 10% identity! So if e relate sequences of varying length, it makes sense to use length-relative scores - rather than c( s,t) e use c ( ) /( ) s,t n + m for sequence comparison.