Global alignments - review Take two sequences: X[j] and Y[j] M[i-1, j-1] ± 1 M[i, j] = max M[i, j-1] 2 M[i-1, j] 2 The best alignment for X[1 i] and Y[1 j] is called M[i, j] X[j] Initiation: M[,]= pply the equation Find the alignment with backtracing Y[i] 1 2 3 4 5 6 7 - G G G C G - 1 G 2 3 G 4 T 5 G 6 2
lgorithm time/space complexity - Big-O Notation a simple description of complexity: constant O(1), linear O(n), quadratic O(n 2 ), cubic O(n 3 )... asymptotic upper bound read: order of f n=o gn iff. simple,e.g.n 2 x, c f x cgx x x 2 lnx f x lnx x =17
Big-O Notation Example Time complexity of global alignment: M[i-1, j-1] ± 1 M[i, j] = max M[i, j-1] 2 M[i-1, j] 2 nm 1 init 1nm 1 =Onm calcm print
Global alignment linear space M[i-1, j-1] ± 1 M[i, j] = max M[i, j-1] 2 M[i-1, j] 2 We need O(nm) time, but only O(m) space how? problem with backtracking
Global alignment linear space, recursion LSpacei,i', j, j': i j k 1 j' LSpacei, n 2 1, j, k 1 n/2 LSpacei, n 2 1, j,k 2 i' k 2 space complexity: O(m)
Global alignment linear space, algorithm LSpacei,i ', j, j': i j k 1 j' return if areai,i', j, j'empty h:= i ' i 2 calc.musing Ommemory plusfind pathl h crossing therow h LSpacei,h 1,k 1, j' print L h LSpacei,h1,k 2, j' h=n/2 i' L h k 2 log 2 n time complexity: nm 2nm i= 2 i
lignments: Local alignment
Local alignment: Smith-Waterman algorithm Seq X: What s local? Seq Y: llow only parts of the sequence to match Locally maximal: can not make it better by trimming/extending the alignment
Local alignment Seq X: Seq Y: Why local? Parts of sequence diverge faster evolutionary pressure does not put constraints on the whole sequence seq X: seq Y: Proteins have modular construction sharing domains between sequences seq X: seq Y:
Domains - example Immunoglobulin domain
Global local alignment a) global align Take the global equation Look at the result of the global alignment - s e q - s b) retrieve the result e CGCCTTGGTTCTCG- C-C-----GTTCGT-G c) sum score along the result q u e sum align. pos.
Local alignment breaking the alignment recipe Just don t let the score go below Start the new alignment when it happens Where is the result in the matrix? Before: fter: - s e q u e - s e q u e - - s s e e q q sum sum sum align. pos. align. pos. align. pos.
Local alignment the equation M[i-1, j-1] + score(x[i],y[j]) M[i, j] = max M[i, j-1] g M[i-1, j] g Init the boundaries with s Great contribution to science! Run the algorithm Read the maximal value from anywhere in the matrix - s e q u e Find the result with backtracking - s e q
Finding second best alignment We can find the best local alignment in the sequence But where is the second best one? Scoring: 1 for match -2 for a gap G Best alignment G 18 16 14 16 19 17 14 17 2 15 18 16 G 15 18 clump 13 16
Clump of an alignment lignments sharing at least one aligned pair G 18 16 G 16 19 2 G clump
Clumps gene X gene Y
Finding second best alignment Don t let any matched pair to contribute to the next alignment G Clear the best alignment G 1 1 1 Recalculate the clump G 2 3 1 1
Extraction of local alignments Waterman- Eggert algorithm 1. Repeat a. Calc M without taking cells into account b. Retrieve the highest scoring alignment c. Set it s trace to
Clumps gene X gene Y
Clumps gene X gene Y
Low complexity regions Local composition bias Replication slippage: e.g. triplet repeats Results in spurious hits Breaks down statistical models Different proteins reported as hits due to similar composition Up to ¼ of the sequence can be biased Huntington s disease Huntingtin gene of unknown (!) function Repeats #: 6-35: normal; 36-12: disease
Pitfalls of alignments lignment is not a reconstruction of the evolution (common ancestor is extinct by the time of alignment)
Pitfalls of alignments Matches to the same fragment rbitrary poor alignment regions seq X: seq Y:
What s the number of steps in these algorithms? How much memory is used? Summary 1. Global a.k.a. Needleman- Wunsch algorithm 2. Global-local 3. Local a.k.a. Smith- Waterman algorithm 4. Many local alignments a.k.a. Waterman- Eggert algorithm seq X: seq Y: seq X: seq Y: seq X: seq Y: seq X: seq Y:
mino acid substitution matrices
Percent ccepted Mutations distance and matrices ccepted by natural selection not lethal not silent Def.: S 1 and S 2 are PM 1 distant if on avg. there was one mutation per 1 aa Q.: If the seqs are PM 8 distant, how many residues may be diffent?
PM matrix Created from easy alignments pairwise gapless 85% id f proline f proline, valine i.e. frequency of occurenceof proline frequency of substitution proline with valine,for PM1 counta,b aligns f a,b= aligns c,d!=a,b M symmetricmatr ix, f a,a f a i.e. M=[ f b,a f b,b] countc,d
PM matrix How to calculte M for PM 2 distance? Take more distant seqs or extrapolate... M 2 =[ f a,a f b,a =[ f a f a,af a,af a,bf b,a f a,af a,bf a,bf b,b f b,b]2 ]
PM log odds matrix Making of the PM N matrix observed PMN[a, b] = log 2 f am N [a,b] f af b By chance alone Why log? Mutations and chance: More freq: PM N[a,b] > Less freq: PM N[a,b] < = log 2 M N [a,b] f b odds log odds
PM 25 matrix
BLOSUM matrix BLOcks SUbstitution Matrix Based on gapless alignments More often used than PM
BLOSUM N matrix Cluster together sequences N% identity f(a,b) frequency of occurrence a and b in the same column f a, b f a = f a,a a b 2 e(a,b) chance alone ea,b = 1 a b ea,a = f a 2 ea, b = 2 f af b for a b BLOSUM N [a,b] = log 2 f a,b ea,b log odds
BLOSUM 62 matrix # BLOSUM Clustered Scoring Matrix in 1/2 Bit Units # Blocks Database = /data/blocks_5./blocks.dat # Cluster Percentage: >= 62 # Entropy =.6979, Expected = -.529 R N D C Q E G H I L K M F P S T W Y V 4-1 -2-2 -1-1 -2-1 -1-1 -1-2 -1 1-3 -2 R -1 5-2 -3 1-2 -3-2 2-1 -3-2 -1-1 -3-2 -3 N -2 6 1-3 1-3 -3-2 -3-2 1-4 -2-3 D -2-2 1 6-3 2-1 -1-3 -4-1 -3-3 -1-1 -4-3 -3 C -3-3 -3 9-3 -4-3 -3-1 -1-3 -1-2 -3-1 -1-2 -2-1 Q -1 1-3 5 2-2 -3-2 1-3 -1-1 -2-1 -2 E -1 2-4 2 5-2 -3-3 1-2 -3-1 -1-3 -2-2 G -2-1 -3-2 -2 6-2 -4-4 -2-3 -3-2 -2-2 -3-3 H -2 1-1 -3-2 8-3 -3-1 -2-1 -2-1 -2-2 2-3 I -1-3 -3-3 -1-3 -3-4 -3 4 2-3 1-3 -2-1 -3-1 3 L -1-2 -3-4 -1-2 -3-4 -3 2 4-2 2-3 -2-1 -2-1 1 K -1 2-1 -3 1 1-2 -1-3 -2 5-1 -3-1 -1-3 -2-2 M -1-1 -2-3 -1-2 -3-2 1 2-1 5-2 -1-1 -1-1 1 F -2-3 -3-3 -2-3 -3-3 -1-3 6-4 -2-2 1 3-1 P -1-2 -2-1 -3-1 -1-2 -2-3 -3-1 -2-4 7-1 -1-4 -3-2 S 1-1 1-1 -1-2 -2-1 -2-1 4 1-3 -2-2 T -1-1 -1-1 -1-2 -2-1 -1-1 -1-2 -1 1 5-2 -2 W -3-3 -4-4 -2-2 -3-2 -2-3 -2-3 -1 1-4 -3-2 11 2-3 Y -2-2 -2-3 -2-1 -2-3 2-1 -1-2 -1 3-3 -2-2 2 7-1 V -3-3 -3-1 -2-2 -3-3 3 1-2 1-1 -2-2 -3-1 4
PM vs BLOSUM http://www.ncbi.nih.gov/education/blstinfo/scoring2.html PM is extrapolation from closely related seqs We are interested more distant relationships