Alignment of Long Sequences BMI/CS 776 www.biostt.wisc.edu/bmi776/ Spring 2016 Anthony Gitter gitter@biostt.wisc.edu
Gols for Lecture Key concepts how lrge-scle lignment differs from the simple cse the cnonicl three step pproch of lrge-scle ligners using suffix trees to find mximl unique mtching subsequences (MUMs) If time permits using tries nd threded tries to find lignment seeds constrined dynmic progrmming to lign between/round nchors using sprse dynmic progrmming (DP) to find chin of locl lignments 2
Pirwise Lrge-Scle Alignment: Tsk Definition Given pir of lrge-scle sequences (e.g. chromosomes) method for scoring the lignment (e.g. substitution mtrices, insertion/deletion prmeters) Do construct globl lignment: identify ll mtching positions between the two sequences 3
Lrge Scle Alignment Exmple: Mouse Chr6 vs. Humn Chr12 Figure from: Delcher et l., Nucleic Acids Reserch 27, 1999 4
Why the Problem is Chllenging Sequences too big to mke O(n 2 ) dynmicprogrmming methods prcticl Long sequences re less likely to be colliner becuse of rerrngements initilly we ll ssume collinerity we ll consider rerrngements in next lecture 5
Generl Strtegy Figure from: Brudno et l. Genome Reserch, 2003 1. perform pttern mtching to find seeds for globl lignment 2. find good chin of nchors 3. fill in reminder with stndrd but constrined lignment method 6
The MUMmer System Delcher et l., Nucleic Acids Reserch, 1999 Given: genomes A nd B 1. find ll mximl unique mtching subsequences (MUMs) 2. extrct the longest possible set of mtches tht occur in the sme order in both genomes 3. close the gps 7
Step 1: Finding Seeds in MUMmer Mximl unique mtch: occurs exctly once in both genomes A nd B not contined in ny longer MUM mismtches Key insight: significntly long MUM is certin to be prt of the globl lignment 8
Suffix Trees Substring problem: given text S of length m preprocess S in O(m) time such tht, given query string Q of length n, find occurrence (if ny) of Q in S in O(n) time Suffix trees solve this problem nd others 9
Suffix Tree Definition key property A suffix tree T for string S of length m is tree with the following properties: rooted nd directed m leves, lbeled 1 to m ech edge lbeled by substring of S conctention of edge lbels on pth from root to lef i is suffix i of S (we will denote this by Si...m) ech internl non-root node hs t lest two children edges out of node must begin with different chrcters 10
Suffixes S = bnn$ suffixes of S $ $ n$ n$ nn$ nn$ bnn$ 11
Suffix Tree Exmple S = bnn$ Add $ to end so tht suffix tree exists (no suffix is prefix of nother suffix) n $ $ n $ b n n $ n $ n $ $ 7 2 4 6 1 3 5 12
Solving the Substring Problem Assume we hve suffix tree T FindMtch(Q, T): follow (unique) pth down from root of T ccording to chrcters in Q if ll of Q is found to be prefix of such pth return lbel of some lef below this pth else, return no mtch found 13
Solving the Substring Problem Q = nn Q = nb n $ $ n $ b n n $ n $ n $ $ 7 STOP n $ $ n $ b n n $ n $ n $ $ 7 2 4 6 1 3 5 return 3 2 4 6 1 3 5 return no mtch found 14
MUMs nd Generlized Suffix Trees Build one suffix tree for both genomes A nd B Lbel ech lef node with genome it represents Genome A: cccg# Genome B: cct$ cg# c g# t$ ech internl node represents repeted sequence A, 3 cg# c g# t$ A, 5 B, 3 A, 2 A, 4 B, 2 cg# t$ 15 A, 1 B, 1 ech lef represents suffix nd its position in sequence
MUMs nd Suffix Trees Unique mtch: internl node with 2 children, lef nodes from different genomes But these mtches re not necessrily mximl Genome A: cccg# Genome B: cct$ cg# c g# t$ A, 3 cg# c g# t$ A, 5 B, 3 A, 2 A, 4 B, 2 cg# t$ A, 1 B, 1 represents unique mtch 16
MUMs nd Suffix Trees To identify mximl mtches, cn compre suffixes following unique mtch nodes Genome A: ct# Genome B: c$ c t# A, 4 $ $ c t# t# $ B, 4 B, 3 A, 3 A, 2 B, 2 t# A, 1 $ B, 1 the suffixes following these two mtch nodes re the sme; the left one represents longer mtch (c) 17
Using Suffix Trees to Find MUMs O(n) time to construct suffix tree for both sequences (of lengths n) O(n) time to find MUMs - one scn of the tree (which is O(n) in size) O(n) possible MUMs in contrst to O(n 2 ) possible exct mtches Min prmeter of pproch: length of shortest MUM tht should be identified (20 50 bses) 18
Step 2: Chining in MUMmer Sort MUMs ccording to position in genome A Solve vrition of Longest Incresing Subsequence (LIS) problem to find sequences in scending order in both genomes Figure from: Delcher et l., Nucleic Acids Reserch 27, 1999 19
Finding Longest Subsequence Unlike ordinry LIS problems, MUMmer tkes into ccount lengths of sequences represented by MUMs overlps Requires O( k log k) time where k is number of MUMs 20
Types of Gps in MUMmer Alignment Figure from: Delcher et l., Nucleic Acids Reserch 27, 1999 21
Step 3: Close the Gps SNPs: between MUMs: trivil to detect otherwise: hndle like repets Insertions trnspositions (subsequences tht were deleted from one loction nd inserted elsewhere): look for out-of-sequence MUMs simple insertions: trivil to detect 22
Step 3: Close the Gps Polymorphic regions short ones: lign them with dynmic progrmming method long ones: cll MUMmer recursively with reduced minimum MUM length Repets detected by overlpping MUMs Figure from: Delcher et l. Nucleic Acids Reserch 27, 1999 23
The LAGAN Method Brudno et l., Genome Reserch, 2003 Given: genomes A nd B nchors = find_nchors(a, B) step 3: finish globl lignment with DP constrined by nchors find_nchors(a, B) step 1: find locl lignments by mtching, chining k-mer seeds step 2: nchors = highest-weight sequence of locl lignments for ech pir of djcent nchors 1, 2 in nchors if 1, 2 re more thn d bses prt A, B = sequences between 1, 2 sub-nchors = find_nchors( A, B ) insert sub-nchors between 1, 2 in nchors return nchors 24
Step 1: Finding Seeds in LAGAN Degenerte k-mers: mtching k-long sequences with smll number of mismtches llowed By defult, LAGAN uses 10-mers nd llows 1 mismtch ccg cgcgctct cct ct cgcggtct cgt 25
Finding Seeds in LAGAN Exmple: trie to represent ll 3-mers of the sequence gccgcct c g c c g c c g t c 2 3, 7 4 8 5 1 6 One sequence is used to build the trie The other sequence (the query) is wlked through to find mtching k-mers 26
Allowing Degenerte Mtches Suppose we re llowing 1 bse to mismtch in looking for mtches to the 3-mer cc; need to explore green nodes c g c c g c c g t 2 3, 7 4 8 5 1 6 c 27
LAGAN Uses Threded Tries In threded trie, ech lef for word w 1...w p hs bck pointer to the node for w 2...w p c g c c g c c g t 2 3, 7 4 8 5 1 6 c 28
Usully requires following only two pointers to mtch ginst the next k-mer, insted of trversing tree from root for ech 29 Trversing Threded Trie Consider trversing the trie to find 3-mer mtches for the query sequence: ccgt c g c c g c c g t 2 3, 7 4 8 5 1 6 c
Step 1b: Chining Seeds in LAGAN Cn chin seeds s 1 nd s 2 if the indices of s 1 > indices of s 2 (for both sequences) s 1 nd s 2 re ner ech other Keep trck of seeds in the serch box s the query sequence is processed Figure from: Brudno et l. BMC Bioinformtics, 2003 30
Step 2: Chining in LAGAN Use sprse dynmic progrmming to chin locl lignments 31
The Problem: Find Chin of Locl Alignments (x,y) (x,y ) requires x < x y < y Ech locl lignment hs weight FIND the chin with highest totl weight Slide from Serfim Btzoglou, Stnford University 32
Sprse DP for rectngle chining 1,, N: rectngles h (h j, l j ): y-coordintes of rectngle j l w(j): weight of rectngle j V(j): optiml score of chin ending in j L: list of triplets (l j, V(j), j) y L is sorted by l j : smllest (North) to lrgest (South) vlue L is implemented s blnced binry tree Slide from Serfim Btzoglou, Stnford University 33
Sprse DP for rectngle chining Min ide: Sweep through x- coordintes To the right of b, nything chinble to is chinble to b Therefore, if V(b) > V(), rectngle is useless for subsequent chining In L, keep rectngles j sorted with incresing l j - coordintes sorted with incresing V(j) score V() V(b) Slide from Serfim Btzoglou, Stnford University 34
Sprse DP for rectngle chining Go through rectngle x-coordintes, from lowest to highest: 1. When on the leftmost end of rectngle i: j. j: rectngle in L, with lrgest l j < h i b. V(i) = w(i) + V(j) k i 2. When on the rightmost end of i:. k: rectngle in L, with lrgest l k l i b. If V(i) > V(k): i. INSERT (l i, V(i), i) in L ii. REMOVE ll (l j, V(j), j) with V(j) V(i) & l j l i Slide from Serfim Btzoglou, Stnford University 35
Exmple x 2 5 6 9 10 11 12 14 15 16 y : 5 c: 3 b: 6 d: 4 e: 2 V L b c d e 5 11 8 12 13 l i V(i) i 5 11 9 15 16 5 11 8 12 13 cb d e 1. When on the leftmost end of rectngle i:. j: rectngle in L, with lrgest l j < h i b. V(i) = w(i) + V(j) 36 Slide from Serfim Btzoglou, Stnford University 2. When on the rightmost end of i:. k: rectngle in L, with lrgest l k l i b. If V(i) > V(k): i. INSERT (l i, V(i), i) in L ii. REMOVE ll (l j, V(j), j) with V(j) V(i) & l j l i
Time Anlysis 1. Sorting the x-coords tkes O(N log N) 2. Going through x-coords: N steps 3. Ech of N steps requires O(log N) time: Serching L tkes log N Inserting to L tkes log N All deletions re consecutive, so log N per deletion Ech element is deleted t most once: N log N for ll deletions Recll tht INSERT, DELETE, SUCCESSOR, tke O(log N) time in blnced binry serch tree Slide from Serfim Btzoglou, Stnford University 37
Constrined Dynmic If we know tht the i th element in one sequence must lign with the j th element in the other, we cn ignore two rectngles in the DP mtrix Progrmming j i 38
Step 3: Computing the Globl Alignment in LAGAN Given n nchor tht strts t (i, j) nd ends t (i, j ), LAGAN limits the DP to the unshded regions Thus nchors re somewht flexible Figure from: Brudno et l. Genome Reserch, 2003 39
Step 3: Computing the Globl Alignment in LAGAN Figures from: Brudno et l. Genome Reserch, 2003 40
Exmple Alignment: E. Coli O157:H7 vs. E. coli K-12 Figure from: Pern et l. Nture, 2001 41
Comprison of Lrge-Scle Alignment Methods Method Pttern mtching Chining MUMmer suffix tree - MUMs LIS vrint AVID (not discussed) suffix tree - exct & wobble mtches Smith-Wtermn vrint LAGAN k-mer trie, inexct mtches sprse DP 42