Introduction to Computation & Pairwise Alignment

Similar documents
Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Sequence analysis and Genomics

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Algorithms in Bioinformatics

Computational Biology

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Practical considerations of working with sequencing data

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Tools and Algorithms in Bioinformatics

Bioinformatics and BLAST

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Pairwise sequence alignments

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Single alignment: Substitution Matrix. 16 march 2017

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Pairwise sequence alignment

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Motivating the need for optimal sequence alignments...

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Collected Works of Charles Dickens

Sequence Comparison. mouse human

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

Local Alignment: Smith-Waterman algorithm

bioinformatics 1 -- lecture 7

Pairwise & Multiple sequence alignments

Lecture 1, 31/10/2001: Introduction to sequence alignment. The Needleman-Wunsch algorithm for global sequence alignment: description and properties

BLAST: Target frequencies and information content Dannie Durand

1.5 Sequence alignment

Large-Scale Genomic Surveys

Sequence Analysis '17 -- lecture 7

Quantifying sequence similarity

Pairwise alignment, Gunnar Klau, November 9, 2005, 16:

Study and Implementation of Various Techniques Involved in DNA and Protein Sequence Analysis

Bioinformatics for Biologists

Pairwise Sequence Alignment

Local Alignment Statistics

Tools and Algorithms in Bioinformatics

Sequence comparison: Score matrices

In-Depth Assessment of Local Sequence Alignment

Hidden Markov Models

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

Sequence analysis and comparison

An Introduction to Sequence Similarity ( Homology ) Searching

Alignment & BLAST. By: Hadi Mozafari KUMS

Introduction to Bioinformatics

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Similarity or Identity? When are molecules similar?

Heuristic Alignment and Searching

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Lecture 5,6 Local sequence alignment

Scoring Matrices. Shifra Ben-Dor Irit Orr

Probability and random variables

8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011

Moreover, the circular logic

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Bio nformatics. Lecture 3. Saad Mneimneh

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 5 G R A T I V. Pair-wise Sequence Alignment

Sequence Alignment Techniques and Their Uses

8 Grundlagen der Bioinformatik, SS 09, D. Huson, April 28, 2009

CS 310 Advanced Data Structures and Algorithms

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

Sequence Alignment (chapter 6)

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University

Lecture 2: Pairwise Alignment. CG Ron Shamir

Hidden Markov Models

Global alignments - review

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Administration. ndrew Torda April /04/2008 [ 1 ]

Hidden Markov Models

Algorithms in Bioinformatics: A Practical Introduction. Sequence Similarity

BINF 730. DNA Sequence Alignment Why?

Substitution matrices

Pairwise alignment. 2.1 Introduction GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKL

Data structures Exercise 1 solution. Question 1. Let s start by writing all the functions in big O notation:

Copyright 2000 N. AYDIN. All rights reserved. 1

Lecture 4: September 19

EECS730: Introduction to Bioinformatics

Markov Chains and Hidden Markov Models. = stochastic, generative models

Basic Local Alignment Search Tool

Pair Hidden Markov Models

String Matching Problem

Sequence Alignment. Johannes Starlinger

Lecture 5: September Time Complexity Analysis of Local Alignment

Lecture 2. Fundamentals of the Analysis of Algorithm Efficiency

CSE : Computational Issues in Molecular Biology. Lecture 6. Spring 2004

Whole Genome Alignments and Synteny Maps

Transcription:

Introduction to Computation & Pairwise Alignment Eunok Paek eunokpaek@hanyang.ac.kr

Algorithm what you already know about programming Pan-Fried Fish with Spicy Dipping Sauce This spicy fish dish is quick to prepare and cooks in about 8 minutes. Ingredients: ½ c mayonnaise ½ t salt ½ t cayenne pepper ¼ t ground black pepper 2 T lemon juice 2 eggs, beaten 4 white fish fillets (6 oz.) 1 c bread crumbs 3 T vegetable oil Directions: In a small bowl whisk together mayonnaise, cayenne pepper and lemon juice; set aside. Season fish fillets with salt and pepper to taste. Dip in beaten egg and coat evenly with bread crumbs. Heat a large, nonstick skillet over medium-high heat. Add oil and when hot, but not smoking, saute fish until golden brown and thoroughly cooked, about 4 minutes per side. Serve warm with reserved spicy dipping sauce.

Algorithm what you already know about programming Recipes have to be refined - A new recipe is rarely right on the first attempt. - Modifications are made as necessary. - Trying the recipe on the intended audience may yield further modifications. - The recipe can be adapted for new ingredients. Writing a program is a lot like writing a recipe.

Algorithm Definition An algorithm is a finite set of precise instructions for performing a computation or for solving a problem Example: find a maximum value in a finite sequence of integers 1. Set the temporary maximum equal to the first integer in the sequence. 2. Compare the next integer in the sequence to the temporary maximum equal to this integer. 3. Repeat the previous step if there are more integers in the sequence. 4. Stop when there are no integers left in the sequence. The temporary maximum at this point is the largest integer in the sequence.

Algorithm Pseudo code assignment variable = value repetition (iteration) function name arguments function max (a 1, a 2,, a n : integer) max = a 1 ; for i = 2 to n if max < a i then max = a i ; return max; data type function value

Algorithm Pseudo code function binary search (x: integer, a 1, a 2,, a n : increasing integers) i = 1; j = n; repetition while i < j begin m = (i + j / 2); if x > a m then i = m + 1 else j = m; end if x = a i then location = i else location = 0; return location;

Algorithm Pseudo code function n_choose_k (n, k: integers) calling another function return Factorial(n) / (Factorial(n k) * Factorial(k)); function Factorial (n: integer) temp = 1; for i = 2 to n temp = temp * i; return temp;

Algorithm Recursion function fibonacci (n: nonnegative integer) if n = 0 then return 0 else if n = 1 then return 1 else return fibonacci(n-1) + fibonacci(n-2); recursive call F 4 F 4 F 3 F 2 F 2 F 3 F 2 F 1 F 1 F 0 F 1 F 0 F 0 F 1

Algorithm Iteration & Memory function fibonacci (n: nonnegative integer) if n = 0 then return 0 else begin what if fn_2 = 0; n = 1? fn_1 = 1; for i = 1 to n-1; begin fn = fn_1 + fn_2; fn_2 = fn_1; fn_1 = fn; end end return fn_1;

Computation Running Time Two ways to measure relative efficiency of an algorithm Mathematical analysis Empirical analysis Mathematical analysis of the running time Running time is measured by the number of basic steps (e.g., the number of python statements) that the algorithm makes. Running time is described as a function of input size, Tn ( ) We are usually interested in the worst case running time or average case running time. n

Computation Big-Oh(O) Notation Example: T(n) = 13n 3 + 42n 2 + 2nlogn + 4n As n grows larger, n 3 is MUCH larger than n 2, nlogn, and n, so it dominates T(n) The constant factor 13 can be ignored since it is affected by the compiler used or machine speed, etc. The running time grows roughly on the order of n 3 Notationally, T(n)=O(n 3 )

Computation Complexity g(n) 3000 2000 1000 2 n n 3 /2 5n 2 100n 5 10 15 20 n

Computation Complexity Function Approximate Values n 10 100 1000 nlogn 33 664 9966 n 3 1,000 1,000,000 10 9 10 6 n 8 10 14 10 22 10 30 2 n 1024 1.27x10 30 1.05x10 301 n logn 2099 1.93x10 13 7.89x10 29 n! 3,628,800 10 158 4x10 2567

Computation Complexity Function Size of Instance Solved in One Day Size of Instance Solved in a Computer 10 Times Faster n 10 12 10 13 nlogn 0.948x10 11 0.87x10 12 n 2 10 6 3.16x10 6 n 3 10 4 2.15x10 4 10 8 n 4 10 18 2 n 40 43 10 n 12 13 n logn 79 95 n! 14 15

Take Home Message There can be many ways to solve the same problem. Running time can often be estimated mathematically, using parameter of input size n. What matter is the order of growth in computational time.

Sequence Alignment A C C T G A G A G A C G T G G C A G mismatch 70% identical indel

Sequence Alignment Eye of the tiger * In 1994 Walter Gehring et alum (Un. Basel) turn the gene eyeless on in various places on Drosophila melanogaster * Result: on multiple places eyes are formed * eyeless is a master regulatory gene that controls +/- 2000 other genes * eyeless on induces formation of an eye

Sequence Alignment Eyeless Drosophila

Sequence Alignment

Sequence Alignment Homeoboxes & Master regulatory genes

Sequence Alignment Homeoboxes & Master regulatory genes HOMEO BOX A homeobox is a DNA sequence found within genes that are involved in the regulation of development (morphogenesis) of animals, fungi and plants.

Sequence Alignment Sequence alignment is important for: * prediction of function * database searching * gene finding * sequence divergence * sequence assembly 22

Growth of GenBank and WGS

Pairwise Alignment Dot matrix Dynamic programming Needleman-Wunsch optimal global alignment Smith-Waterman optimal local alignment

Pairwise Alignment Types of Sequence Alignment Dot matrix Number of sequences pairwise alignment: compare two sequences multiple alignment: compare multiple sequences Portion of sequences aligned global alignment: align sequences over their entire lengths local alignment: find the longest/best subsequence pairs that give maximum similarity Algorithmic approach optimal methods: Needleman-Wunsch, Smith-Waterman heuristic methods: FASTA, BLAST

Pairwise Alignment Dot Matrix Dot Matrix A visual depiction of relationship between 2 sequences Reveals insertion/deletion Finds direct or inverted repeats Steps create a 2D matrix one sequence along the top the other along the left side for each cell of the matrix, place a dot if the two corresponding residues match

Pairwise Alignment Dot Matrix Running Time of Dot Matrix Lengths of sequences: m, n O(mn)

Pairwise Alignment Dot Matrix DNA sequences protein sequences

Pairwise Alignment Dot Matrix Random Matches in Dot Matrix When comparing DNA sequences, random matches occur with probability 1/4 When comparing protein sequences, 1/20 Thus, for comparisons of protein coding DNA sequences, we should translate them to amino acids first

Pairwise Alignment Dot Matrix To Reduce Random Noise in Dot Matrix Specify a window size, w Take w residues from each of the two sequences Among the w pairs of residues, count how many pairs are matches Specify a stringency

Pairwise Alignment Dot Matrix Simple dot matrix, Window size 1 P V I L E P M M K V T I E M P P 1 1 1 V 1 1 I 1 1 L 1 E 1 1 P 1 1 1 I 1 1 M 1 1 1 R V 1 1 E 1 1 V 1 1 T 1 T 1 P 1 1 1

Pairwise Alignment Dot Matrix Window size is 3 P V I L E P M M K V T I E M P P 3 1 1 1 1 V 3 1 1 I 3 1 1 1 1 L 3 1 1 1 E 1 2 1 1 1 P 1 1 1 2 1 1 1 I 1 1 1 1 1 M 1 2 1 R 1 1 1 1 1 1 V 1 1 1 1 1 1 E 1 1 2 1 V 1 1 2 T 1 1 1 1 T 1 1 2 2 1 P 1 1 1 1 1 1 1 3

Pairwise Alignment Dot Matrix Window size is 3; Stringency is 2 P V I L E P M M K V T I E M P P 3 V 3 I 3 L 3 E 2 P 2 I M 2 R V E 2 V 2 T T 2 2 P 3

Pairwise Alignment Dot Matrix DNA Sequences single residue identity 16 out of 23 identical

Pairwise Alignment Dot Matrix Protein Sequences single residue identity 6 out of 23 identical

Pairwise Alignment Dot Matrix Insertion/Deletion, Inversion

Pairwise Alignment Dot Matrix ABCDEFGEFGHIJKLMNO tandem duplication compared to no duplication tandem duplication compared to self

Pairwise Alignment Dot Matrix What Is This? 5 GGCGG 3 Palindrome (Intrastrand)

Pairwise Alignment Global Alignment Optimal Alignment Consider two sequences, both of length n If no gaps are allowed, there is only one alignment, which is optimal If n gaps are allowed, there are possible alignments How to find the optimal ones? n (2n)! ( n ) 2 ( n!) 2n 2 2 n

Pairwise Alignment Global Alignment First, Define Optimality Scoring scheme a scoring matrix and gap penalties Examples of scoring schemes amino acids: PAM250, or BLOSUM62; -13 for gap opening, -2 for gap extension nucleotides: the matrix to the right; -8 for gap opening, -6 for gap extension A C G T A 2-7 -6-7 C -7 2-7 -6 G -6-7 2-7 T -7-6 -7 2

Pairwise Alignment Global Alignment Intuition of Dynamic Programming If we already have the optimal solution to: XY AB then we know the next pair of characters will either be: XYZ or XY- or XYZ ABC ABC AB- (where - indicates a gap). So we can extend the match by determining which of these has the highest score.

Pairwise Alignment Global Alignment Recursive Definition of Dynamic Programming Notations: F(i,j): the accumulated score of aligning x 1, x 2,, x i to y 1,, y j s(x,y): the score of matching residue x to residue y, from the scoring matrix (k): the penalty for a gap of length k F ( i, j) max F F F ( i 1, j 1) ( k, j) ( i ( i, k ) ( j s( x k ), k ), i, y k k j ), 0,..., 0,..., i 1, j 1.

Pairwise Alignment Global Alignment Illustration of Dynamic Programming X Y Z U V W

Pairwise Alignment Global Alignment Dynamic Programming: Units of Operations Y 1 Y 2 Y 3 Y 4 Y n total X 1 1 1 1 1 1 n X 2 1 3 4 5 n+1 (n+4)(n-1)/2+1 = (n 2 +3n-4)/2+1 X 3 1 4 5 6 n+2 (n+6)(n-1)/2+1 = (n 2 +5n-6)/2+1 X 4 1 5 6 7 n+3 (n+8)(n-1)/2+1 = (n 2 +7n-8)/2+1 X n 1 n+1 n+2 n+3 2n-1 (n+2n)(n-1)/2+1 = (n 2 +(2n-1)n-2n)/2+1 [n 2 (n-1)+n(n+1)(n-1)-(n+2)(n-1)]/2+2n-1 = [2n 3-3n 2 -n+2]/2 +2n -1 O(n 3 ) units of operations

Pairwise Alignment Global Alignment The Needleman-Wunsch Algorithm The method described in the previous slides is the Needleman- Wunsch (1970) algorithm It computes the optimal global alignment between two sequences The optimality is defined in terms of a scoring scheme (a scoring matrix plus gap penalties) The running time is O(n 3 )

Pairwise Alignment Global Alignment Needleman-Wunsch Implementation Details F( i, F( i 1, j 1) s( xi, j) maxf( k, j) ( i k), F( i, k) ( j k), At each cell of the matrix, keep track of how the maximum is arrived at After the entire matrix is filled, do a traceback from the bottom right corner to the top left corner y k j ), 0,..., i 1, k 0,..., j 1. A B C 0 1 2 8 9 01-23456789 ABCDEFG-HIJ I J

Pairwise Alignment Global Alignment Gap Penalties Above, the function of gap penalties can take any form Below, using a simple gap penalty (-d for each gap position), we can speed up the alignment algorithm 1. 0,..., ), ( ), ( 1, 0,..., ), ( ), ( ),, ( 1) 1, ( max ), ( j k k j k i F i k k i j k F y x s j i F j i F j i. 1), (, ) 1, ( ),, ( 1) 1, ( max ), ( d j i F d j i F y x s j i F j i F j i

Pairwise Alignment Global Alignment Illustration of Gotoh s Algorithm X Y Z 0 -d -2d -3d U -d V -2d W -3d

Pairwise Alignment Global Alignment Example: match 1, mismatch -1, gap -1 A C G T 0-1 -2-3 -4 A -1 1 0-1 -2 G -2 0 0 1 0 C -3-1 1 0 0 T -4-2 0 0 1

Pairwise Alignment Global Alignment Gotoh s Algorithm: Units of Operations O(n 2 ) units of operations to fill the matrix O(n) units to trace back Y 1 Y 2 Y 3 Y 4 Y n total 1 1 1 1 1 1 n+1 X 1 1 3 3 3 3 3 3n+1 X 2 1 3 3 3 3 3 3n+1 X 3 1 3 3 3 3 3 3n+1 X 4 1 3 3 3 3 3 3n+1 X n 1 3 3 3 3 3 3n+1 3n 2 +2n+1

Pairwise Alignment Global Alignment Affine Gap Penalties -d for gap opening -e for gap extension (k) = -d - e (k-1) Running time is still O(n 2 ) Described in Gotoh (1982) Optimal global alignment F( i, R( i, C( i, F( i 1, j) maxr( i 1, C( i 1, j 1) F( i 1, j) d, j) max R( i 1, j) e. F( i, j 1) d, j) max C( i, j 1) e. s( x i j 1) s( xi, y j 1) s( x, y i, y j j j ), ), ).

Pairwise Alignment Local Alignment Smith-Waterman Running time is O(n 2 ) Described in Smith and Waterman (1981) Optimal local alignment Traceback is different F( i, R( i, C( i, F( i 1, j 1) R( i 1, j 1) j) max C( i 1, j 1) 0. F( i 1, j) d, j) max R( i 1, j) e. F( i, j 1) d, j) max C( i, j 1) e. s( x s( x s( x i i i,,, y y y j j j ), ), ),

Pairwise Alignment Local Alignment Global versus Local Alignments LGPSSKQTGKGS-SRIWDN (Global) LN-ITKSAGKGAIMRLGDA -------TGKG-------- (Local) -------AGKG--------

Pairwise Alignment Local Alignment Smith-Waterman Traceback H E A G A W G H E E 0 0 0 0 0 0 0 0 0 0 0 P 0 0 0 0 0 0 0 0 0 0 0 A 0 0 0 5 0 5 0 0 0 0 0 W 0 0 0 0 2 0 20 12 4 0 0 H 0 10 2 0 0 0 12 18 22 14 6 E 0 2 16 8 0 0 4 10 18 28 20 A 0 0 8 21 13 5 0 4 10 20 27 E 0 0 6 13 18 12 4 0 4 16 26

Pairwise Alignment Significance of Alignment Probability of Random Alignments Suppose we have a tetrahedron-shaped die whose four faces are labeled with A, C, G, T. Throw the die twice, and record the labels facing down. Probability of getting an identical pair: ¼*¼. There are 4 possible identical pairs: 4*¼*¼ = ¼. 6 identical pairs = (1/4)^6 = 2.4E-4. Probability of getting a mismatch: 1 ¼ = ¾. 6 mismatched pairs is (3/4)^6 = 0.178.

Pairwise Alignment Significance of Alignment If A, C, G, T are not of Equal Proportions Probability of drawing an identical pair is given by: p p 2 A p 2 C p 2 G p 2 T p x is proportion of nucleotide x Probability of drawing a mismatch is 1 - p

Pairwise Alignment Significance of Alignment Longest Run of Heads in Coin Toss HTTHHHTHHTHHHTTTHHHHHHHTTTHHT Probability of head is p. We are looking at a sequence of length n. At a random position, probability of seeing a run of 5 heads p 5 There are n 4such positions Frequency of observing such a run is p 5 (n 4). In general, p K (n (K 1)). (Erdos-Renyi law, 1970) For large n, K = log 1/p n. Expected length of the longest run of heads: If p=0.5, after 100 tosses, the longest run is log 2 100 = 6.65

Pairwise Alignment Significance of Alignment M: Longest Run in Random Alignment Sequence lengths: m, n p: probability of match q: 1 p γ: Euler s number, 0.577 E(M) log 1/p (mn) + log 1/p (q) + γlog(e) ½, for large m, n If a local alignment is longer than E(M), then it is significant How significant?

Pairwise Alignment Significance of Alignment Significance of Local Alignment In biological experiments, after a set of values of an entity is obtained, we usually calculate the mean and variance Assume data follows the normal distribution The mean and variance are of interest For example, is the mean not equal to zero at the significance level of 0.05? This is not what we want in local alignment We want the significance of the highest scores not the mean score

Pairwise Alignment Significance of Alignment Distribution of Scores The scores of a pair of sequences are compared to those of two random sequences of the same length and composition The distribution of random sequence scores follows the Gumbel extreme value distribution Similar to the normal distribution, with a positively skewed tail The score must be greater than expected from a normal distribution to achieve the same level of significance

Pairwise Alignment Significance of Alignment Normal Distribution versus Extreme Value Distribution 0.4 Normal Extreme Value Normal distribution: y = exp(-x 2 /2) / sqrt(2π) Extreme value distribution: y = exp(-x exp(-x)) 0.0-4 -3-2 -1 0 1 2 3 4 x

Pairwise Alignment Substitution Matrices DNA PAM 1 Matrix PAM 1 corresponds to 1% mutations, 99% conservation. Assume 4 nucleotides are present at equal frequencies Assume all mutations from any nucleotide to any other are equally likely A C G T A 0.99 0.0033 0.0033 0.0033 C 0.0033 0.99 0.0033 0.0033 G 0.0033 0.0033 0.99 0.0033 T 0.0033 0.0033 0.0033 0.99 A uniform model M

Pairwise Alignment Substitution Matrices Transitions and Transversions Purines: A and G Pyrimidines: C and T Transitions: more often purine to purine pyrimidine to pyrimidine Transversions: less often from purine to pyrimidine from pyrimidine to purine

Pairwise Alignment Substitution Matrices Another DNA PAM 1 Matrix Assume 4 nucleotides are present at equal frequencies Assume transitions are 3 times more often than transversions A biased model A C G T A 0.99 0.002 0.006 0.002 C 0.002 0.99 0.002 0.006 G 0.006 0.002 0.99 0.002 T 0.002 0.006 0.002 0.99

Pairwise Alignment Substitution Matrices The Meaning of the Score of an Alignment Assume ACGT is aligned to CCGT Given a model (matrix) M Want: odds ratio Pr(A C) Pr(C C) Pr(G G) Pr(T T) given the model (P A M AC ) (P C M CC )(P G M GG )(P T M TT ) Divided by Pr(A C) Pr(C C) Pr(G G) Pr(T T) happened by chance (P A P C ) (P C P C )(P G P G )(P T P T ) Compute: Let S XY = log 2 (P X M XY / P X P Y ) S = S AC + S CC + S GG + S TT, log odds ratio 2 S is what we want (odds ratio)

Pairwise Alignment Substitution Matrices From PAM 1 Mutation Probability Matrix to PAM1 Log Odds Ratio Matrix A C G T A C G T A 0.99 0.0033 0.0033 0.0033 A 2-6 -6-6 C 0.0033 0.99 0.0033 0.0033 C -6 2-6 -6 G 0.0033 0.0033 0.99 0.0033 G -6-6 2-6 T 0.0033 0.0033 0.0033 0.99 T -6-6 -6 2

Pairwise Alignment Substitution Matrices From Another PAM 1 Mutation Probability Matrix to PAM1 Log Odds Ratio Matrix A C G T A C G T A 0.99 0.002 0.006 0.002 A 2-7 -5-7 C 0.002 0.99 0.002 0.006 C -7 2-7 -5 G 0.006 0.002 0.99 0.002 G -5-7 2-7 T 0.002 0.006 0.002 0.99 T -7-5 -7 2

Pairwise Alignment Substitution Matrices From PAM 1 to PAM 2 PAM 2 = PAM 1 * PAM 1 = (PAM 1 ) 2 PAM 2 (A C): PAM 1 (A A)*PAM 1 (A C) + PAM 1 (A C)*PAM 1 (C C) + PAM 1 (A G)*PAM 1 (G C) + PAM 1 (A T)*PAM 1 (T C) Markov process: the probability of change from nucleotide A to nucleotide C is the same, regardless of previous changes at the site or the position of the site in the sequence

Pairwise Alignment Substitution Matrices Amino Acid PAM Matrices Percent Accepted Mutation Dayhoff (1978), 1572 changes in 71 families of proteins, at least 85% similar For each amino acid, count 20 numbers For example, how many F (phenylalanine) stay the same, how many change to the other 19 amino acids Normalize: divide each of these 20 numbers by (sum of 20 numbers) PAM1: 1% probability of change

Pairwise Alignment Substitution Matrices The Column/Row of F in PAM1 F to A: 0.0002 F to R: 0.0001 F to N: 0.0001 F to D: 0.0000 F to C: 0.0000 F to Q: 0.0000 F to E: 0.0000 F to G: 0.0001 F to H: 0.0002 F to I: 0.0007 F to L: 0.0013 F to K: 0.0000 F to M: 0.0001 F to F: 0.9946 F to P: 0.0001 F to S: 0.0003 F to T: 0.0001 F to W: 0.0001 F to Y: 0.0021 F to V: 0.0001

Pairwise Alignment Substitution Matrices Compute PAM250 PAM 2 = PAM 1 * PAM 1 = (PAM 1 ) 2 PAM 250 = (PAM1) 250 Convert to log odds: PAM 250 (F Y) = 0.15 Divide by the frequency of F, 0.04 0.15/0.04 = 3.75 log 10 (3.75) = 0.57 Similarly for Y F: log 10 (0.2/0.03) = 0.83 So PAM250(F Y) = 10*(0.57+0.83)/2

Pairwise Alignment Substitution Matrices BLOSUM BLOcks of amino acid SUbstitution Matrices Start with highly-conserved patterns (blocks) in a large set of closely related proteins Use the likelihood of substitutions found in those sequences to create a substitution probability matrix BLOSUM-n means that the sequences used were n% alike BLOSUM62 is standard