Computational Biology Lecture 8: Substitution matrices Saad Mneimneh

Similar documents
Search sequence databases 2 10/25/2016

be the i th symbol in x and

General Tips on How to Do Well in Physics Exams. 1. Establish a good habit in keeping track of your steps. For example, when you use the equation

CS 331 DESIGN AND ANALYSIS OF ALGORITHMS DYNAMIC PROGRAMMING. Dr. Daisy Tang

Note on EM-training of IBM-model 1

Singular Value Decomposition: Theory and Applications

Copyright 2017 by Taylor Enterprises, Inc., All Rights Reserved. Adjusted Control Limits for P Charts. Dr. Wayne A. Taylor

ECEN 5005 Crystals, Nanocrystals and Device Applications Class 19 Group Theory For Crystals

Eigenvalues of Random Graphs

Problem Set 9 Solutions

Notes on Frequency Estimation in Data Streams

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1

Random Walks on Digraphs

2.3 Nilpotent endomorphisms

Example: (13320, 22140) =? Solution #1: The divisors of are 1, 2, 3, 4, 5, 6, 9, 10, 12, 15, 18, 20, 27, 30, 36, 41,

Spring Force and Power

University of Washington Department of Chemistry Chemistry 452/456 Summer Quarter 2014

Copyright 2017 by Taylor Enterprises, Inc., All Rights Reserved. Adjusted Control Limits for U Charts. Dr. Wayne A. Taylor

Difference Equations

Lecture 2 Solution of Nonlinear Equations ( Root Finding Problems )

Split alignment. Martin C. Frith April 13, 2012

Bézier curves. Michael S. Floater. September 10, These notes provide an introduction to Bézier curves. i=0

/ n ) are compared. The logic is: if the two

Design and Analysis of Algorithms

Lecture 3. Ax x i a i. i i

Physics 2A Chapters 6 - Work & Energy Fall 2017

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011

Temperature. Chapter Heat Engine

A Simple Research of Divisor Graphs

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

Download the files protein1.txt and protein2.txt from the course website.

Chapter 6. Operational Amplifier. inputs can be defined as the average of the sum of the two signals.

Transfer Functions. Convenient representation of a linear, dynamic model. A transfer function (TF) relates one input and one output: ( ) system

Lecture 10: May 6, 2013

: Numerical Analysis Topic 2: Solution of Nonlinear Equations Lectures 5-11:

Feature Selection: Part 1

Maximum Likelihood Estimation

Week 11: Chapter 11. The Vector Product. The Vector Product Defined. The Vector Product and Torque. More About the Vector Product

Solutions to Problem Set 6

Lecture 7: Boltzmann distribution & Thermodynamics of mixing

COMPLEX NUMBERS AND QUADRATIC EQUATIONS

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Physics 2A Chapter 3 HW Solutions

Answers Problem Set 2 Chem 314A Williamsen Spring 2000

Annexes. EC.1. Cycle-base move illustration. EC.2. Problem Instances

CS286r Assign One. Answer Key

Lecture Nov

Finite Difference Method

10-701/ Machine Learning, Fall 2005 Homework 3

Linear Approximation with Regularization and Moving Least Squares

Lecture Space-Bounded Derandomization

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

x = , so that calculated

Module 9. Lecture 6. Duality in Assignment Problems

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

Case A. P k = Ni ( 2L i k 1 ) + (# big cells) 10d 2 P k.

Online Classification: Perceptron and Winnow

Mixture o f of Gaussian Gaussian clustering Nov

Hopfield Training Rules 1 N

A PROBABILITY-DRIVEN SEARCH ALGORITHM FOR SOLVING MULTI-OBJECTIVE OPTIMIZATION PROBLEMS

Statistics and Quantitative Analysis U4320. Segment 3: Probability Prof. Sharyn O Halloran

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Convergence of random processes

Vapnik-Chervonenkis theory

18.1 Introduction and Recap

= z 20 z n. (k 20) + 4 z k = 4

Here is the rationale: If X and y have a strong positive relationship to one another, then ( x x) will tend to be positive when ( y y)

Linear Regression Analysis: Terminology and Notation

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

Composite Hypotheses testing

Formulas for the Determinant

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

Structure and Drive Paul A. Jensen Copyright July 20, 2003

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Hidden Markov Models

Department of Statistics University of Toronto STA305H1S / 1004 HS Design and Analysis of Experiments Term Test - Winter Solution

Statistical Inference. 2.3 Summary Statistics Measures of Center and Spread. parameters ( population characteristics )

Complex Variables. Chapter 18 Integration in the Complex Plane. March 12, 2013 Lecturer: Shih-Yuan Chen

Norms, Condition Numbers, Eigenvalues and Eigenvectors

DIFFERENTIAL SCHEMES

LECTURE 9 CANONICAL CORRELATION ANALYSIS

Quantum Mechanics I - Session 4

Randomness and Computation

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Math 217 Fall 2013 Homework 2 Solutions

Credit Card Pricing and Impact of Adverse Selection

Chapter 3 Differentiation and Integration

SIMPLE LINEAR REGRESSION

Journal of Universal Computer Science, vol. 1, no. 7 (1995), submitted: 15/12/94, accepted: 26/6/95, appeared: 28/7/95 Springer Pub. Co.

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 8 Luca Trevisan February 17, 2016

Introduction ( Week 1-2) Course introduction A brief introduction to molecular biology A brief introduction to sequence comparison Part I: Algorithms

BIOINFORMATICS: PAST, PRESENT AND FUTURE. Susan R. Wilson Mathematical Sciences Institute, Australian National University, Australia

Course organization. Part II: Algorithms for Network Biology (Week 12-16)

ESCI 341 Atmospheric Thermodynamics Lesson 10 The Physical Meaning of Entropy

Errors for Linear Systems

On the Repeating Group Finding Problem

1 GSW Iterative Techniques for y = Ax

Stanford University CS254: Computational Complexity Notes 7 Luca Trevisan January 29, Notes for Lecture 7

7. Products and matrix elements

Transcription:

Computatonal Bology Lecture 8: Substtuton matrces Saad Mnemneh As we have ntroduced last tme, smple scorng schemes lke + or a match, - or a msmatch and -2 or a gap are not justable bologcally, especally or amno acd sequences (protens). Instead, more elaborated scorng unctons are used. These scores are usually obtaned as a result o analyzng chemcal propertes and statstcal data or amno acds and DNA sequences. For example, t s known that same sze amno acds are more lkely to be substtuted by one another. Smlarly, amno acds wth same anty to water are lkely to serve the same purpose n some cases. On the other hand, some mutatons are not acceptable (may lead to demse o the organsm). PAM and BLOSUM matrces are amongst results o such analyss. We wll see the technques through whch PAM and BLOSUM matrces are obtaned. Substrtuton matrces Chemcal propertes o amno acds govern how the amno acds substtue one another. In prncple, a substrtuton matrx s, where s j s used to score algnng character wth character j, should relect the probablty o two characters substtung one another. The queston s how to buld such a probablty matrx that closely maps realty? Derent strateges result n derent matrces but the central dea s the same. I we go back to the concept o a hgh scorng segment par, theory tells us that the algnment (ungapped) gven by such a segment s governed by a lmtng dstrbuton such that where: s s the substuton matrx used q j = e λsj q j s the probablty o observng character algned wth character j p s the probablty o occurrence o character Thereore, s j = λ ln q j Ths ormula or s j suggests a way to constrcut the matrx s. I hgh scorng algnments are to be real, {q j } represent the desred probabltes o substtons, whle {p } represent the background probabltes o occurrence. By observng related amles o sequences, one could estmate q and p and hence obtan the matrx s by usng some scalng actor λ. Note that λ,j s j =,j ln q j =,j p j ln q j p j Inormaton theory tells us that the above sum s strctly less than 0 p q, whch s a desred property. Also, note that the score S n bts o a gven segment (see prevous lecture) s S = log eλs K Snce S s a sum o terms o the orm λ ln q j, e λs s a product o terms o the orm q j. Thereore, S relect the log lkelhood o observng the algnment due to substtuton (governed probablstcally by q) relatve to smply by chance (governed probablstcally by p). Dvson by the constant K adjusts the score to account or the rate o observng maxmal scorng segments as descrbed n the prevous lecture.

Here s another ntutve approach that justes the above scheme. To construct a substtuton matrx to score proten algnments, a amly o protens can be consdered, and a multple algnment o all the proten sequences n the amly s obtaned. Agan, we are consderng algnments wth no gaps; thereore, we assume sequences have the same length (a vald assumpton they are related) and, thereore, the multple algnment s trval. or any par o amno acds and j we need q j, the probablty o observng algned wth j (whch s same as q j ), and p, the probablty o observng an. The queston s, n an algnment (ungapped) o sequences x and y that algns two o ther amno acds and j, dd ths happen by chance or was ndeed because o a mutaton rom to j or vce versa? To capture ths complementary behavors, we consder two models: M, where x and y are related and obtaned accordng to the jont probabltes q j, and R, where x and y are unrelated and obtaned ndependently at random accordng to the ndvdual probabltes p and p j. Consderng ths, now the score s the lkelhood that the sequences are related compared relatve to them beng unrelated. Ths s called the odds rato and s mathematcally expressed as: P (x, y M) score(x, y) = P (x, y R) = q x y p x p. y Ths ormula says that the score o the (ungapped) algnment s the probablty that the symbols o x and y are algned because they are related, relatve to the probablty o ther symbols beng algned just by chance. For two algned amno acds and j, we take s pont o vew, what s the probablty to see a j on the other sde? Well, ths s the probablty that an mutated nto a j, p( j). However, there s a mere chance o p j or a random occurence o a j as well. Hence, the probablty rato p( j) p j relects how much beleves that ths j s related to t. Now, snce q j = = p p( j) (the probablty o observng an and ts mutated orm), we can also express the lkelhood as: q j p( j) = p j. By dong ths or every par o algned symbols n the algnment and ndng the product o the terms, we obtan the ormula above, whch relect how much we beleve that the two entre sequences are related. In all the algnment algorthms we have seen so ar, we reled on the act that the score s addtve, and ths was a key property or the dynamc programmng to work. In ths case - when the score s computed as,j multplcatve. In order to make t addtve we can take the log and compute: log,j =,j log q x y. Ths s called the log-odds rato. Thereore, the values makng up the sum wll be the ndvdual scores ound n the matrx, hence s j = log qj up to some scalng actor. Note that ths s symmetrc, so scorng algned wth j s the same as scorng j algned wth, hence, the drecton o the algnment s not mportant (but one could n prncple make a dstncton needed). Now the mportant queston: how to compute p, p j, and q j? We re gong to look at two ways o computng ths: PAM and BLOSUM matrces. PAM (Pont Accepted Mutatons) matrces q x y q x y - t s PAM stands or Pont Accepted Mutatons. An accepted mutaton s dened as a mutaton that was postvely selected by the envronment and dd not caused the demse o the organsm. A PAM matrx M holds the probablty o beng replaced by j n a certan evolutonary tme perod. The longer the evolutonary perod o tme, the more dcult t s to determne the correct values. The reason beng that could mutate several tmes beore becomng a j, and t wll be hard to capture all these ntermedate mutatons, snce we only observe and j. What we are gong to do s look over mutatons that occurred n a relatvely short evolutonary perod o tme. One unt o evoluton s dened to be the amount o evoluton that changes, on the average, n 00 amno acds. Consderng ths unt, a -PAM matrx s rst computed. Usng ths as a startng pont, a k-pam matrx can be generated rom the -PAM matrx. For a -PAM matrx M, M j s gong to be p( j) scaled by a actor, such that the expected number o mutatons s 0.0; n other words t s the same as havng the probablty o n 00 or a mutaton to occur. The computatonal steps that lead to the -PAM matrx are: 2

Compute p or every. Compute p( j) or every par and j and let M j = p( j). Scale M such that the expected number o mutatons p ( M ) s 0.0. M Use s j = 0 log j 0 p j to obtan the addtve scores. s j s rounded to an nteger and here, the scalng actor 0 s used just to provde a better nteger approxmaton. Next we ll take a closer look at each o these steps. Let the requency count j be the number o tmes s algned wth j countng both drectons. Then, let the number o occurrences o, = j j, and the count o all characters =. Now, we can estmate p j = j, whose meanng s smply the rate at whch was ound to be algned wth j. Smlarly, the rate o ndng an occurence o s p =. Now, havng both p j and p determned, the elements o the matrx M are beng computed as: M j = p( j) = pj p. M s ndeed a probablty matrx, and ths can be proved by notng that j M j =. To llustrate ths step o computng a matrx M, let s have a quck example. Let the algnment be: In ths case, the requences are: hence the estmated probabltes: A B A A AB = BA = AA = 2 A = X AX = AB + AA = B = X BX = BA = = X X = 4 The matrx M wll be: The expected number o mutatons s X p AB = AB = 4 p BA = BA = 4 p AA = AA = 2 p A = A = 4 p B = B = 4 p(a B) = p AB p A = p(b A) = p BA p B = M = [ 2 0 p X ( M XX ) = p A ( M AA ) + p B ( M BB ) = 4 ( 2 ) + ( 0) = 0.5 = 50% 4 The next step s the scalng o M such that t s consstent wth the denton o a -PAM matrx: n 00 expected mutatons. Suppose matrx M s elements, M j, are scaled by a actor α. In ths case the new values become M j = αm j. Ths wll change the values o the row sums such that j M j = α. Snce we want a probablty matrx - every row sums up to - a small adjustment s needed: we wll add α to every element on the man dagonal: ] M j = αm j, j M = αm + α Ths wll restore the property o a probablty matrx. Now what should α be? Let s compute the new expected number o mutatons: p ( M ) = p ( αm + α) = α p ( M )

Ths s just α multpled by the old expected number o mutatons. Thereore, we can set α approprately. For nstance, n the example above, α = 0.02. Havng a -PAM matrx computed, the queston s how to compute a 2-PAM matrx? In other words, what s the probablty p 2 ( j) o mutatng nto j n two unts o evoluton. Ths s the probablty o mutatng nto k, or some k, n the rst unt o evoluton, and then, k mutatng nto j n the second unt o evoluton. Mathematcally, ths can be expressed as: p 2 ( j) = k p( k)p(k j) = k M k M kj Ths s the ormula used to obtan the entry correspondng to the par and j when multply M by tsel. Hence, the 2-PAM matrx s just M 2. An analogous step s used to show that the k-pam matrx s the same as M k. When workng wth a k-pam probablty matrx the score wll be computed n the same way: s k j = 0 log M k j 0 p j. The only change s that now the values o M k are plugged nstead o those o M. BLOSUM (BLOCKS Substtuton Matrces) matrces As mentoned earler, BLOSUM are another type o matrces used n scorng sequence algnments. They are ntended to be used or scorng smlartes o proten sequences that are evolutonary ar apart (dstant). Computng ther values s done usng the normaton stored n a database o blocks (called the BLOCKS database) where each block s a multple ungapped algnment o related proten sequences. The sequences o each block are clustered, puttng two sequences nto the same cluster ther percentage o matchng algned resdues - or level o smlarty - s above a certan threshold L%. We dene two sequences to be dstant they all n derent clusters. Thereore, two dstant sequences der by at least (00 L)%. The computaton o BLOSUM-L, or a partcular value o L, s based on countng the number o mutatons among dstant sequences only. Thereore, lower values o L correspond to longer evolutonary tmes, and are applcable or more dstant sequences. As explaned above, n computng a BLOSUM-L matrx s entres, we want to count the number o mutatons between dstant sequences only - the ones that are less than L% smlar. The value ab s the relatve requency o seeng a algned wth b. Whenever such an algnment s observed or two sequences that are n derent clusters, ab s ncremented by n n 2, where n and n 2 are the szes o the two clusters (we scale by the sze o the cluster snce larger clusters are more lkely to contan mutatons). The steps through whch a matrx s computed are: Estmate p = j j ; Estmate q j = k,l j j k,l kl ; BLOSUM-L(,j)= log q j wth some scalng actor λ. Consder an example where sequences are generated at random (so we are not usng the BLOCKS database here) such that p A = p G = p C = p T = 4 and the level o smlarty s 50%,.e. the probablty that two algned resdues are the same s 0.5. Then L = 50%, we expect to have one cluster, where: and p AA = p GG = p CC = p T T = 0.5 4 = 8 p AG = p AC = p AT = p GA = p GC = p GT = p CA = p CG = p CT = p T A = p T G = p T C = 0.5 2 = 24 Then a match wll have a score and a msmatch wll have a score m = log /8 /4./4 = s = log /24 /4./4 = 0.585 4

Reerences Setubal J., Medans, J., Introducton to Molecular Bology, Chapter. Drubn R. et al., Bologcal Sequence Analyss, Chapter 2. 5