Exact Matching. Exact Matching Algorithms 5/19/2015. Exact Matching Problem: search pattern P in text T (P,T are strings)

Size: px
Start display at page:

Download "Exact Matching. Exact Matching Algorithms 5/19/2015. Exact Matching Problem: search pattern P in text T (P,T are strings)"

Transcription

1 Exct Mtching Exct Mtching Problem: Exct Mtching Algorithms serch pttern P in text T (P,T re strings) Knuth Morris Prtt preprocess pttern P Aho Corsick preprocess pttern P of severl strings P = { P 1,, P r } Suffix Trees preprocess text T or severl texts, dtbse of texts 1

2 Exct Mtching: Preprocessing Pttern(s) Knuth-Morris-Prtt (KMP) Aho-Corsick Serching Pttern P in Text T Nive: T: neeneeneeneeeneneeeneeneneeneneen P: neeneeen neeneen neeneen neeneen Better: T: neeneeneeneeeneneeeneeneneeneneen P: neeneeen neeneeen neeneeen 2

3 Pttern P ϵ {,b, } * : KMP Preprocessed P exmple Automton: b b b Tble: b b b filure links (suffix = prefix): top: s utomton bottom: s tble how to determine? how to use? P: P: P: P: P: P: 6 b b b b b b 3 7 b b b b b b 4 8 b b b b b... 3 KMP computing filure links filure link ~ new best mtch (fter mismtch) òr 0 Pttern P: k-1 b b k Flink[1] = 0; for k from 2 to PtLen do fil = Flink[k-1] while ( fil>0 nd P[fil] P[k-1] ) do fil = Flink[fil]; od Flink[k] = fil+1; od 3

4 KMP computing filure links Flink[1] = 0; for k from 2 to PtLen do fil = Flink[k-1] while ( fil>0 nd P[fil] P[k-1] ) do fil = Flink[fil]; od Flink[k] = fil+1; od Tble: b b b fil: 0 0->F->Flink[2]=0+1 1->T->0->F->Flink[3]=1 1->F->Flink[4]= prefixes vi filure links Pttern P: P 1 P r-1 P k-r+1 P k-1 Flink[k]=r r k P r P 1 P r-1 = P k-r+1 P k-1 mximl r<k prefix ll such vlues r: suffix r 4 r 3 r 2 r 1 k P 1 P r2-1 = P k-r2 +1 P k-1 = P r1 -r 2 +1 P r1-1 Flink[r 1 ]=r 2 4

5 other methods Boyer-Moore T = mrktkoopmn P = schoenveter schoe work bckwrds Krp-Rbin fingerprint i-1 i i+n-1 i+n p 1 p n hsh-vlue i B n-1 + i+1 B n-2 + i+n-1 B 0 i+1 B n i+n-1 B 1 + i+n B 0 exct mtching with set of ptterns P = { P 1,, P r } ll occurrences in text T totl length m length n AHO CORASICK generlizes KMP filure links: longest suffix tht is prefix (perhps in nother string) > ssume no subwords within P 5

6 keyword tree - trie edges ~ letters t t o p o e t t e r r s y i c e n h c o { potto, poetry, pottery, science, school } o l y e leves ~ keywords filure links o t t o p t t o h e t o o t t h e r e { potto, tttoo, theter, other } r potto other potto tttoo filure links into other brnches! => into other ptterns 6

7 Algorithm for dding the filure links: follow the links existing filure links new edge with lbel : follow existing filure links strting t the prent node of the edge until n outgoing edge with lbel is found Adding filure links o t t o p t t t o o h o e t t h e r e { potto, tttoo, theter, other } r potto other t heter ttoo bredth first (level-by-level) 7

8 filure links o t t o p t t o h e t o o t h t e r e { potto, tttoo, theter, other } r shortcuts root to child of root (see for exmple po ) [single letter] Preprocess text T: Suffix Trees 8

9 trie vs. suffix tree Text T: bb Suffixes: bb b b b b b Text string T (= bb ) + ll its suffixes b b b b b b b b trie suffix tree From: Trie vs. Suffix Tree Trie: b b b b b Suffix Tree: b b b b b Given text T: Trie(T) = O( T ) 2 qudrtic bd exmple: T = n b n Trie(T) like DFA for the suffixes of T minimize DFA directed cyclic word grph DFA = Deterministic Finite Automton (lso used by Ukkonen) only brnching nodes nd leves represented edges lbeled by substrings of T correspondence of leves nd suffixes T leves, hence < T internl nodes Tree(T) = O( T + size(edge lbels)) liner 9

10 nitty Order of insertion is position of suffix in Suffix Tree of text T: nitty itty tty ty y ritty itty tty ty y nitty 2 8 ε itty ty ε 3 9 t y y ε ritty 4 10 ε Text T nd its suffixes: nitty nitty itty tty ty y ritty itty tty ty y nitty ε itty t 3-3 ty 4-5 y ε ritty 7-11 y ε ε implementtion: refer to positions of the substrings in T 10

11 liner time construction Text T nd its suffixes: nitty itty tty ty y ritty itty tty ty y Weiner (1973) lgorithm of the yer McCreight (1976) on-line lgorithm (Ukkonen 1992) Filure link to mximl prefix of suffix in T online construction of suffix trie for T = bb suffix links b b bb bb b b εb ε b b b b b b ε ε b b b b b next symbol = b from here b lredy exists 11

12 Suffix Tree ppliction: full text index T P pos pos P in T P is prefix of suffix of T P subtree under P ~ loctions of P pos pos exmple: find itt in nitty nitty itty tty ty y ritty itty tty ty y nitty 2 8 ε itty ty ε ritty t y y ε ε positions 12

13 Suffix Tree ppliction: longest common substring T P pples plte T pos pos P Construct generlized suffix tree Contining suffixes of both T nd T. Mrk T nd T suffixes: pos pos ppliction: counting motifs nitty itty tty ty y ritty itty tty ty y nitty 2 8 ε itty ty 2 2 ε 11 y t 4 y 2 ε ritty 7 2 ε The suffix tree now contins counts for the number of suffixes in sub-tree 13

14 Suffix Trees Experiments: motifs, repets in DNA s reported by Ukkonen humn chromosome 3 the first bses 31 min cpu time (8 processors, 4 GB) humn genome: 3x10 9 bses suffix tree for Humn Genome is fesible longest repet? Occurrences t: , r Length: 2559 ttgggtctgtgcccgtgcggtttgttcttgttccgtgcctgtggtgtgctgccccttctcgtctttgcgtt ggtttctccgtgcttccctcccccctccccccccccccgtccccggtgtgtgtgttccccttcctgtgtcctgtgttctc ttgttcttcccccttggtggctgcggtgtttggttttttgtccttgcggtttgctggtgtggtttccgcttctcct tccctcggctgctctcttttttttggctgctgtttcctggtgtttgtgcccttttcttcccgtctcccttgttg gctctgggttggttccgtctttgctttgtgtgtgccgctctcgtgtgctgtgtcttttgcgctgtttttcc tttgggtttcccgttgggtggctgggtctggttttctgttctgtccctgggtccccctgcttccctggt tgctgtttcgtcccgccgttccttttctccctcctctccgccctgttgtttcctgcttttttgtcgccttctctg gtgtggtggttctcttgtggttttgtttgctttctctgtggccgtgtgtggcttttttctgtgttttttggctgcttgtctt cttttgggtgtctgttcttccttcgccccttttgtggggttgtttgtttttttcttgttttgttgggttcttgtgttctgggttt gccctttgtcgtggtggttgcttttctcccttctgtggttgcctgttcctctgtggtggtttcttctgctgtgcggctct ttgtttttgtccctttgtcttttggcttttgttgcctgcttttggtgttttgctggtccttgccctgccttgtcctgtg gtttgcctggttttcttctgggttttttggttttggtctctgtgtcttttcctcttgtttttggtgtttttggtg ttttggtgtttttttttttggtgtttttttggtgtgggggtccgtttcgctttctcttggctg ccgttttccctgcccttttttgggtcctttccccttgcttgtttttgtcggtttgtcgtcgtgttgtgttgcgg ctttttctggggctctgttctgttccttggtctttctctgttttggtccgtcctgctgttttggttctgtgccttgtgttgttt ggtcggtgcgtgtggttccgctttgttcttttggcttggttgcttggctgtgggctcttttttggttccttgctttgt gttttttccttctgtggttcttggtgcttgtggggtggcttgtcttttccctgggcgttggccttttcc tttgtcttcctccctggcgtgtctgttcttcctttgtttgttcctctttttttcttggcgtggtttgtgttctccttggggt ccttcctcccttgtgttggttcctggtttttttctctttggcttgtgtggggttcctctgtttgctctctgtttgtctg ttttggtgttgtgcttgtgtttttgccttgttttgttcctggctttgctggttgctttcgcttgggttttgggctg gcgtggggttttctgttctctgtctctgccgggctttgcttcctcttttcctttgtcccgtttttccctctc ctgcctgttgccctggccgcttccccttgttgtgggtggtggggggctccctgtcttgtgccgttttcggg tgcttccgtttttgtccttcgttgtttggctgtgggtttgtctgtgctcttttttttggtctccctctcctttt ttgggtttttgctgggttcttgttttgtcggccttttctgctcttttggttctgtggtttctgtctttggttctgtttt tgctgggtcgtttttgttttcgttgttgccgccttgctcccgggtggccccttgtctggtggtgctttttgtgt gctgctggttcggtttgccgtttttttgggtttctgctcgtgttctcggtttggtctttctctttttttgttgtgtctctgt cggctttggttcggtgtgctggcctcttggttgg 14

15 ten occurrences? ttttttttttttttggcgggtctcgctctgtcgcccggctgggtgcgtg gcgggtctcggctcctgcgctccgcctcccgggttccgccttct cctgcctcgcctcccgtgctgggctcggcgcccgccctcg cccggctttttttgttttttgtggcggggtttcccgttttgccgg gtggtctcgtctcctgcctcgtgtccgcccgcctcggcctcccg tgctgggttcggcgt Length: 277 Occurrences t: , , , , , In the reversed complement t: , , , Suffix Tree Remrks suffix tree efficient (liner) storge, but constnt ±40, lrge, overhed suffix rry hs constnt ±5 overhed hence more prcticl but hs its own complictions nïve n log(n) lgorithm is in some cses not too bd (next slide) 15

16 Suffix Arry nitty itty tty ty y ritty itty tty ty y itty itty nitty ritty tty tty ty ty y y lexicogrphic order of the suffixes sources Dn Gusfield Algorithms on Strings, Trees, nd Sequences Computer Science nd Computtionl Biology lists mny pplictions for suffix trees (nd extended implementtion detils) slides on suffix-trees bsed on/copied from Esko Ukkonen, Univ Helsinki (Erice School, 30 Oct 2005) 16

17 Next Genertion Sequencing Bsed on Lecture Notes by R. Shmir [7] E.M. Bkker Overview Introduction Next Genertion Technologies The Mpping Problem The MAQ Algorithm The Bowtie Algorithm Burrows-Wheeler Trnsform Sequence Assembly Problem De Bruijn Grphs Other Assembly Algorithms 17

18 Introduction 1953 Wtson nd Crick: the structure of the DNA molecule DNA crrier of the genetic informtion, the chllenge of reding the DNA sequence becme centrl to biologicl reserch. methods for DNA sequencing were extremely inefficient, lborious nd costly Holley: relibly sequencing the yest gene for trna Al required the equivlent of full yer's work per person per bse pir (bp) sequenced (1bp/person yer) Two clssicl methods for sequencing DNA frgments by Snger nd Gilbert. Snger sequencing (sketch):. The polymerse extends the lbeled primer, rndomly incorporting either: norml dctp bse, or modified ddctp bse At every position where ddctp is inserted, polymeriztion termintes! => popultion of frgments, where the length of ech frgment is function of the reltive distnce from the modified bse to the primer. b. electrophoretic seprtion of the products of ech of the four rection tubes (ddg, dda, ddt, nd ddc), run in individul lnes. The bnds on the gel represent the respective frgments (Lbelled Strnds) shown to the right => the complement of the originl templte (bottom to top) 18

19 Introduction 1980 In the 1980s methods were ugmented by prtil utomtion the cloning method, which llowed fst nd exponentil repliction of DNA frgment Strt of the humn genome project: sequencing efficiency reched 200,000 bp/person-yer End of the humn genome project: 50,000,000 bp/person-yer. (Totl cost summed to $3 billion.) Note: Moore s Lw doubling trnsistor count every 2 yers Here: doubling of #bse pirs/person-yer every 1.5 yers Introduction Recently ( /2012): new sequencing technologies next genertion sequencing (NGS) or deep sequencing relible sequencing of 100x10 9 bp/person-yer. NGS now llows compiling the full DNA sequence of person for ~ $10,000 within 3-5 yers cost is expected to drop to ~$1000. (2013) full humne genome scn with coverge of 3 for round 2000 euro. 19

20 From: E.C. Hyden, The 1000$ Genome. Nture, Mrch Next Genertion Sequencing Technologies Next Genertion Sequencing technology of Illumin, one of the leding compnies in the field: DNA is replicted => millions of copies. All DNA copies re rndomly shredded into vrious frgments using restriction enzymes or mechnicl mens. The frgments form the input for the sequencing phse. Hundreds of copies of ech frgment re generted in one spot (cluster) on the surfce of huge mtrix. Initilly it is not known which frgment sits where. Note: There re quite few other vilble technologies vilble nd currently under development. 20

21 Next Genertion Sequencing technology of Illumin (continued): A DNA repliction enzyme is dded to the mtrix long with 4 slightly modified nucleotides. The 4 nucleotides re slightly modified chemiclly so tht: ech would emit unique color when excited by lser, ech one would terminte the repliction. Hence, the growing complementry DNA strnd on ech frgment in ech cluster is extended by one nucleotide t time. Next Genertion Sequencing technology of Illumin (continued): A lser is used to obtin red of the ctul nucleotide tht is dded in ech cluster. (The multiple copies in cluster provide n mplified signl.) The solution is wshed wy together with the chemicl modifiction on the lst nucleotide which prevented further elongtion nd emitted the unique signl. Millions of frgments re sequenced efficiently nd in prllel. The resulting frgment sequences re clled reds. Reds form the input for further sequence mpping nd ssembly. 21

22 Next Genertion Sequencing Technologies (1) A modified nucleotide is dded to the complementry DNA strnd by DNA polymerse enzyme. (2) A lser is used to obtin red of the nucleotide just dded. (3) The full sequence of frgment thus determined through successive itertions of the process. (4) A visuliztion of the mtrix where frgment clusters re ttched to flow cells. The Short Red Mpping Problem The Mpping Problem INPUT: QUESTION: m reds S 1,, S m of length l nd n pproximte reference genome R. Wht re the positions x 1,, x m long R where ech red S 1,, S m mtches, respectively? For exmple: fter sequencing the genome of person we wnt to mp it to n existing sequence of the humn genome. The new smple will not be 100% identicl to the reference genome: nturl vrition in the popultion some reds my lign to their position in the reference with mismtches or gps sequencing errors repetitive regions in the genome mke it difficult to decide where to mp the red Humns re diploid orgnisms different lleles on the mternl nd pternl chromosomes two slightly different reds mpping to the sme loction (some perhps with mismtches) 22

23 Possible Solutions for the Mpping Problem The problem of ligning short red to long genome sequence is exctly the problem of locl lignment. But in cse of the humn genome: the number of reds m is usully the length of red l is bp the length of the genome R is 3x 10 9 bp (or twice, for the diploid genome). Solutions for the Mpping Problem Nive lgorithm for ech S i scn the reference string R mtch the red t ech position p nd pick the best mtch Time complexity: O(m l R ) for exct or inexct mtching. m number of reds, l red length Considering the prmeters for our problem, this is clerly imprcticl. Less nive solution use the Knuth-Morris-Prtt lgorithm to mtch ech S i to R Time complexity: O(m (l+ R )) = O(ml + m R ) for exct mtching. A substntil improvement but still not enough 23

24 The Mpping Problem: Suffix Trees 1. Build suffix tree for R. 2. For ech S i (i in {1,,m})find mtches by trversing the tree from the root. Time complexity: O(ml+ R ), ssuming the tree is built using Ukkonen's liner-time lgorithm. Notes: Time complexity is prcticl. we only need to build the tree for the reference genome once. The tree cn be sved nd used repetedly for mpping new sequences, until n updted version of the genome is issued. The Mpping Problem: Suffix Trees Spce complexity now becomes the obstcle: The lefs of the suffix tree lso hold the indices where the suffixes begin => the tree requires O( R log R ) bits just for coding of the indices. The constnts re lrge due to the dditionl lyers of informtion required for the tree (such s suffix links, etc.). The originl humn genome reference sequence demnds just R log(#symbols) bits => we cn store the humn genome using ~750MB But! ~64GB for the Suffix Tree (2015: doble)! Finl problem: suffix trees llow only for exct mtching. 24

25 The Mpping Problem: Hshing Preprocess the reference genome R into hsh tble H. The keys of the hsh re ll the substrings of length l in R: Tble H contins: the position p in R where the substring ends. Mtching: given S i the lgorithm returns H(S i ). Time complexity: O(m l+l R ) Note: The spce complexity is O(l R + R log R ) since we must lso hold the binry representtion of ech substring's position. (Still too high.) Pcking the substrings into bit-vectors representing ech nucleotide s 2-bit code reduces the spce complexity by fctor of 4. Prcticl solution: Prtition the genome into severl chunks, ech of the size of the cche memory, nd run the lgorithm on ech chunk in turn. Agin, only exct mtching llowed. The Mpping Problem: The MAQ Algorithm The MAQ (Mpping nd Alignment with Qulities) lgorithm by Li, Run nd Durbin, 2008 [5] An efficient solution to the short red mpping problem Allows for inexct mtching, up to predetermined mximum number of mismtches. Memory efficient. Mesures for bse nd red qulity, nd for identifying sequence vrints. 25

26 The MAQ Algorithm A key insight: If red in its correct position to the reference genome hs one mismtch, then prtitioning tht red into two mkes sure one prt is still exctly mtched. Hence, if we perform n exct mtch twice using only subset of the red bses ( templte), one of the subsets is enough to find the mtch. For 2 mismtches, 6 templtes: some re noncontiguous ech templte covers hlf the red Ech templte hs prtner templte tht complements it to form the full red If the red hs no more thn two mismtches, t lest one templte will not be ffected by ny mismtch. For 3-mismtches, 20 templtes gurntee t lest one fully mtched templte. Etc. The MAQ Algorithm: Templtes A nd B re reds, where purple boxes indicte mismtches with respect to the reference. The numbered templtes hve blue boxes in the positions they cover. 1 mismtch: compring red A through templte 1 => no full mtch templte 2 fully mtched 2 mismtches: red B is fully mtched with templte 6 Any combintion of up to two mismtched positions will be voided by t lest one of the 6 templtes. Note: 6 templtes gurntee 57% of ll 3 mismtches will lso hve t lest one fully mtching templte. 26

27 The MAQ Algorithm The lgorithm: Generte the number of templtes required to gurntee t lest one full mtch for the desired mximl number of mismtches, nd For ech red: mtch the red ginst the templtes use the exct mtching templte s seed for extending cross the full red. Identifying the exct mtches is done by hshing the red templtes into templte-specific hsh tbles, nd scnning the reference genome R ginst ech tble. Note: It is not necessry to generte templtes nd hsh tbles for the full red, since the initil seed will undergo extension, e.g. using the Smith- Wtermn lgorithm. Therefore the lgorithm initilly processes only the first 28 bses of ech red. (The first bses re the most ccurte bses of the reds.) The MAQ Algorithm Algorithm (for the cse of up to two mismtches): 1. Index the first 28 bses of ech red with complementry templtes 1, nd 2, thereby generting hsh tbles H1 nd H2, respectively. 2. Scn the reference genome R: for ech position nd ech orienttion, query 28-bp window through templtes 1, 2 ginst the pproprite tbles H1, H2, respectively. If hit, extend nd score the complete red bsed on mismtches. For ech red, keep the two best scoring hits nd the number of mismtches therein. 3. Repet steps 1+2 with complementry templtes 3, 4, then 5, 6. Remrk: The reson for indexing ginst pir of complementry templtes ech time hs to do with feture of the lgorithm regrding pired-end reds (see originl pper for detils). 27

28 The MAQ Algorithm Complexity Time Complexity: O(m l) for generting the hsh tbles in Step 1, nd O(l R ) for scnning the genome in Step 2. Repeting Steps 1+2 three times in this implementtion hs no effect on the symptotic complexity. Spce Complexity: O(m l) for holding the hsh tbles in Step 1, nd O(m l+ R ) totl spce for Step 2, but only O(m l) spce in min memory t ny one time. The MAQ Algorithm Complexity Note tht, not the full red length l is used, but window of length l' <= l (28-bp in the implementtion presented bove). The size of ech key in templte hsh tble is only l' /2, since ech templte covers hlf the window. => Time nd spce complexity of step 1 is: O(2 m l' /2) = O(ml' ). This reduction in spce mkes running the lgorithm on PC fesible. The time nd spce required for extending mtch in Step 2 using the Smith-Wtermn lgorithm, where p the probbility of hit (using the l length window): The time complexity : O(l' R + p R l 2 ). The spce complexity: O(ml' + l 2 ). Note: The vlue of p is smll, nd decreses drsticlly the longer l' is, since most l' -long windows in the reference genome will likely not cpture the exct coordintes of true red. 28

29 Red Mpping Qulities The MAQ lgorithm provides mesure of the confidence level for the mpping of ech red, denoted by QS nd defined s: QS = -10 log 10 (Pr{red S is wrongly mpped}) For exmple: QS = 30 if the probbility of incorrect mpping of red S is 1/1000. This confidence mesure is clled phred-scled qulity, following the scling scheme originlly introduced by Phil Green nd collegues[3] for the humn genome project. The Bowtie Algorithm Another wy to mp reds to reference genome is given by the Bowtie lgorithm, presented in 2009 by Lngmed et l.[1]. It solves the mpping problem using spce-efficient indexing scheme. The indexing scheme used is clled the Burrows- Wheeler trnsform[2] nd ws originlly developed for dt compression purposes. 29

30 The Burrows-Wheeler Trnsform Applying the Burrows-Wheeler trnsform BW(T) to the text T = "the next text tht i index.": 1. First, we generte ll cyclic shifts of T. 2. Next, we sort these shifts lexicogrphiclly. define the chrcter '.' s the minimum nd we ssume tht it ppers exctly once, s the lst symbol in the text. followed lexicogrphiclly by ' (spce) followed by the English letters ccording to their nturl ordering. Cll the resulting mtrix M. The trnsform BW(T) is defined s the sequence of the lst chrcters in the rows of M. Note tht, the lst column is permuttion of ll chrcters in the text since ech chrcter ppers in the lst position in exctly one cyclic shift. Burrows-Wheeler Trnsform Some of the cyclic shifts of T sorted lexicogrphiclly nd indexed by the lst chrcter. 30

31 Burrows-Wheeler Trnsform Storing BW(T) requires the sme spce s the size of the text T since it is simply permuttion of T. In the cse of the humn genome, ech chrcter cn be represented by 2 bits => storing the permuttion requires 2 times 3x10 9 bits (insted of ~30 times 3x10 9 for storing ll indices of T). Burrows-Wheeler Trnsform The following holds for BW(T): 1. The number of occurrences in T of the chrcter c = the number of occurrences of c in BW(T) (= permuttion of the T). 2. The first column of the mtrix M cn be obtined by sorting BW(T) lexicogrphiclly. 3. The number of occurrences of the substring 'xt' in T: BW(T) is the lst column of the lexicogrphicl sorting of the shifts. The chrcter t the lst position of row ppers in the text T immeditely prior to the first chrcter in the sme row (ech row is cyclicl shift). => consider the intervl of 't' in the first column, nd check whether ny of these rows hve n 'x t the lst position. 31

32 Sorted BW(T) BW(T) Recovering the first column (left) by sorting the lst column. Burrows-Wheeler Trnsform Given BW(T) lso the second column cn be derived: 'xt' ppers twice in the text, nd we see tht 3 rows strt with n 'x'. Two of those must be followed by 't, where the lexicogrphicl sorting determines which 'x'. The third 'x' is followed by '.' (see first row) => '.' must follow the first 'x' in the first column since '.' is smller lexicogrphiclly thn 't'. The second nd third occurrences of 'x' in the first column re therefore followed by 't'. Note: We cn use the sme process to recover the chrcters t the second column for ech intervl, nd then the third, etc. t text tht i index. the nex t tht i index. the next tex text tht i index.the next tht i index. the next text the next text tht i index. 32

33 F L Lst-first mpping: Ech 't' chrcter in L is linked to its position in F nd no crossed links re possible. The j-th occurrence of chrcter X in L corresponds to the sme text chrcter s the j-th occurrence of X in F. Burrows-Wheeler Trnsform The previous two centrl properties of the BW-trnsform re cptured in the Lemm by Ferrgin nd Mnzini[4]: Lemm 12.1 (Lst-First Mpping): Let M be the mtrix whose rows re ll cyclicl shifts of T sorted lexicogrphiclly, nd let L(i) be the chrcter t the lst column of row i nd F(i) be the first chrcter in tht row. Then: 1. In row i of M, L(i) precedes F(i) in the originl text: T = L(i) F(i) 2. The j-th occurrence of chrcter X in L corresponds to the sme text chrcter s the j-th occurrence of X in F. 33

34 F L Lst-first mpping: j-th occurrence of chrcter 't' chrcter in L is linked to its j-th position in F nd no crossed links re possible. Burrows-Wheeler Trnsform Proof: 1. Follows directly from the fct tht ech row in M is cyclicl shift. 2. Let X j denote the j-th occurrence of chrcter X in L, nd let α be the chrcter following X j in the text nd β the chrcter following X j+1. Then, since X j ppers bove X j+1 in L, α ppers t the beginning of row bove the row tht strts with β. The rows re lexicogrphiclly ordered, hence α must be equl or lexicogrphiclly smller thn β. Now clerly X α lexicogrphiclly X β holds. Hence, s the rows re lexicogrphiclly ordered, if chrcter X j ppers in F it is followed by α, nd thus will be bove X j+1 which is followed by β. Thus proofing the Lemm. 34

35 Reconstructing the Text Algorithm UNPERMUTE for reconstructing text T from its Burrows-Wheeler trnsform BW(T) utilizing Lemm 12.1 [4]: ssume the ctul text T is of length u ppend unique $ chrcter (=. ) t the end, which is the smllest lexicogrphiclly UNPERMUTE[BW(T)] 1. Compute the rry C[1,, Σ ] : where C(c) is the number of chrcters {$, 1,,c-1} in T, i.e., the number of chrcters tht re lexicogrphiclly smller thn c 2. Construct the lst-first mpping LF, trcing every chrcter in L to its corresponding position in F: LF[i] = C(L[i]) + r(l[i], i) + 1, where r(c, i) is the number of occurrences of chrcter c in the prefix L[1, i -1] 3. Reconstruct T bckwrds: s = 1, T(u) = L[1]; for i = u 1,, 1 do s = LF[s]; T[i] = L[s]; od; Notice, tht C(e) + 1 = 9 is the position of the first occurrence of e' in F. C(e) = 8, s there re 8 chrcters in T tht re smller thn e. C F L Lst-first mpping: Ech 't' chrcter in L is linked to its position in F nd no crossed links re possible. 35

36 Exmple T = ccg$ (u = 6) is trnsformed to BW(T) = gc$c, nd we now wish to reconstruct T from BW(T) using UNPERMUTE: 1. Compute rry C. For exmple: C(c) = 4 since there re 4 occurrences of chrcters smller thn 'c' in T ('$' nd 3 occurrences of ''). Now C(c) + 1 = 5 is the position of the first occurrence of 'c' in F. 2. Perform the LF mpping. For exmple, LF[c 2 ] = C(c) + r(c,7) + 1 = 6, nd indeed the second occurrence of 'c' in F is t F[6]. 3. Determine the lst chrcter in T: T(6) = L(1) = 'g'. 4. Iterte bckwrds over ll positions using the LF mpping. For exmple, to recover the chrcter T(5), we use the LF mpping to trce L(1) to F(7), nd then T(5) = L(7) = 'c'. Remrk We do not ctully hold F in memory, we only keep the rry C defined bove, of size Σ, which we cn esily obtined from L. r(c,7) = # of occurences of c in length 7-1 prefix, used s n offset to obtin the right c. Determine its loction in F nd find its corresponding predecessor in L using C(.) nd r(.,.) i.e., find corresponding g using the lst-first mpping First Lst Exmple of running UNPERMUTE to recover the originl text. Source: [1]. 36

37 Exct Mtching EXACTMATCH exct mtching of query string P to T, given BW(T)[4] is similr to UNPERMUTE, we use the sme C nd r(c, i) denote by sp the position of the first row in the intervl of rows in M we re currently considering denote by ep the position of the first row beyond this intervl of rows => the intervl of rows is defined by the rows sp,, ep - 1. EXACTMATCH[P[1,,p], BW(T)] 1. c = P[p]; sp = C[c] + 1; ep = C[c+1] + 1; i = p - 1; 2. while sp < ep nd i >= 1 c = P[i]; sp = C[c] + r(c, sp) + 1; ep = C[c] + r(c, ep) + 1; i = i - 1; 3. if (sp == ep) return "no mtch"; else return sp, ep; P = c Exmple of running EXACTMATCH to find query string in the text. Source: [1]. 37

38 Inexct Mtching Sketch Ech chrcter in red hs numeric qulity vlue, with lower vlues indicting higher likelihood of sequencing error. Similr to EXACTMATCH, clculting mtrix intervls for successively longer query suffixes. If the rnge becomes empty ( suffix does not occur in the text), then the lgorithm selects n lredy-mtched query position nd substitute different bse there, introducing mismtch into the lignment. The EXACTMATCH serch resumes from just fter the substituted position. The lgorithm selects only those substitutions tht re consistent with the lignment policy nd yield modified suffix tht occurs t lest once in the text. If there re multiple cndidte substitution positions, then the lgorithm greedily selects position with mximl qulity vlue. The Assembly Problem How to ssemble n unknown genome bsed on mny highly overlpping short reds from it? Problem Sequence ssembly INPUT: m l-long reds S 1,, S m. QUESTION: Wht is the sequence of the full genome? The crucil difference between the problems of mpping nd ssembly is tht now we do not hve reference genome, nd we must ssemble the full sequence directly from the reds. 38

39 De Bruijn Grphs Definition A k-dimensionl de Bruijn grph of n symbols is directed grph representing overlps between sequences of symbols. It hs n k vertices, consisting of ll possible k-tuples of the given symbols. The sme symbol my pper multiple times in tuple. If we hve the set of symbols A = { 1,, n } then the set of vertices is: V = { ( 1,, 1, 1 ), ( 1,, 1, 2 ),, ( 1,, 1, n ), ( 1,, 2, 1 ),, ( n,, n, n )} If vertice w cn be expressed by shifting ll symbols of nother vertex v by one plce to the left nd dding new symbol t the end, then v hs directed edge to w. Thus the set of directed edges E is: E = {( (v 1, v 2,, v k ), (w 1,w 2,, w k ) ) v 2 = w 1, v 3 = w 2,, v k = w k-1, nd w k new} A portion of 4-dimensionl de Bruijn grph. The vertex CAAA hs directed edge to vertex AAAC, becuse if we shift the lbel CAAA to the left nd dd the symbol C we get the lbel AAAC. In this wy the word CAAAC defines directed edge 39

40 De Bruijn Grphs Given red, every contiguous (k+1)-long word in it corresponds to n edge in the de Bruijn grph. Form subgrph G of the full de Bruijn grph by introducing only the edges tht correspond to (k+1)-long words in some red. A pth in this grph defines potentil subsequence in the genome. Hence, we cn convert red to its corresponding pth in the constructed subgrph G Using the grph from the previous slide, we construct the pth corresponding to CCAACAAAAC: shift the sequence one position to the left ech time, nd mrking the vertex with the lbel of the 4 first nucleotides. 40

41 De Bruijn Grphs Form pths for ll the reds Identify common vertices on different pths Merge different red-pths through these common vertices of the pths into one long sequence The two reds in () re converted to pths in the grph The common vertex TGAG is identified Combine these two pths into (b). 41

42 De Bruijn Grphs Velvet (2008) by Zerbino nd Birney[6], is n lgorithm which uses de Bruijn grphs to ssemble reds; with the following difficulties: repets will show up s cycles in the merged pth tht we form. We do not know how mny times ech cycle must be trversed in order to form the full sequence of the genome. If we hve two cycles strting t the sme vertex, we cnnot tell which one to trverse first. Velvet ttempts to ddress this issue by utilizing the extr informtion we hve in the cse of pired-ends sequencing (both ends of DNA frgment re sequenced in Red 1 nd Red 2, respectively. The distnce between ech pired red is known). Other Assembly Algorithms HMM bsed Mjority bsed Etc. 42

43 Bibliogrphy [1] Ben Lngmed, Cole Trpnell, Mihi Pop nd Steven L. Slzberg. Ultrfst nd memory-effcient lignment of short DNA sequences to the humn genome. Genome Biology, [2] Michel Burrows nd Dvid Wheeler. A block sorting lossless dt compression lgorithm. Technicl Report 124, Digitl Equipment Corportion, [3] Brent Ewing nd Phil Green. Bse-clling of utomted sequencer trces using phred. II. Error probbilities. Genome Reserch, [4] Pulo Ferrgin nd Giovni Mnzini. Opportunistic dt structures with pplictions. FOCS '00 Proceedings of the 41st Annul Symposium on Foundtions of Computer Science, [5] Heng Li, Jue Run nd Richrd Durbin. Mpping short DNA sequencing reds nd clling vrints using mpping qulity scores. Genome Reserch, [6] Dniel R. Zerbino nd Ewn Birney. Velvet: lgorithms for de novo short red ssembly using de Bruijn grphs. Genome Reserch, [7] Ron Shmir, Computtionl Genomics Fll Semester, 2010 Lecture 12: Algorithms for Next Genertion Sequencing Dt, Jnury 6, 2011 Scribe: Ant Gluzmn nd Ern Mick Appendix Some bsic terminology nd techniques 43

44 DNA Repliction DNA Polymerse is fmily of enzymes crrying out DNA repliction. A Primer is short frgment of DNA or RNA tht pirs with templte DNA. Polymerse Chin Rection Mixture of templtes nd primers is heted such tht templtes seprte in strnds Uses pir of primers spnning trget region in templte DNA Polymerizes prtner strnds in ech direction from the primers using thermo stble DNA polymerse repet BWT Exct Mtching Exmple Use T = ccg while serching for P = 'c': 1. First, we initilize sp nd ep to define the intervl of rows beginning with the lst chrcter in P, which is 'c': sp = C(c) + 1 = 5. ep = C(g) + 1 = Next, we consider the preceding chrcter in P, nmely ''. We redefine the intervl s the rows tht begin with 'c' utilizing the LF mpping. Specificlly: sp = C() + r(, 5) + 1 = = 3. ep = C() + r(, 7) + 1 = = Now, we consider the next preceding chrcter, nmely ''. We now redefine the intervl s the rows tht begin with 'c'. Specificlly: sp = C() + r(, 3) + 1 = = 2. ep = C() + r(, 5) + 1 = = Hving thus covered ll positions of P, we return the finl intervl clculted (whose size equls the number of occurrences of P in T). 44

45 BWT Exct Mtching EXACTMATCH returns the indices of rows in M tht begin with the query it does not provide the offset of ech mtch in T. If we kept the position in T corresponding to the strt of ech row we would wste lot of spce. we cn mrk only some rows with pre-clculted offsets. Then, if the row where EXACTMATCH found the query hs this offset, we cn return it immeditely. Or use LF mpping to find row tht hs preclculted offset, nd dd the number of times the procedure is pplied to obtin the position in which we re interested. => time-spce trde-of ssocited with this process. DNA Repliction Enzyme 45

46 Primers 46

47 DNA Restriction DNA restriction enzymes Recognize specific nucleotide sequence (4 8 bp long, often plindromic) Mke double-strnded DNA cut For exmple HeIII recognizes the following sequences nd cuts between djcent G nd C giving blunt end: 5 GGCC3 3 CCGG5 Also restriction endonuclese (restriction enzymes) tht cut with n offset nd thus producing sticky ends re possible. Restriction Enzymes 47

48 Red Mpping Qulities The MAQ lgorithm provides mesure of the confidence level for the mpping of ech red, denoted by QS nd defined s: QS = -10 log 10 (Pr{red S is wrongly mpped}) For exmple: QS = 30 if the probbility of incorrect mpping of red S is 1/1000. This confidence mesure is clled phred-scled qulity, following the scling scheme originlly introduced by Phil Green nd collegues[3] for the humn genome project. QS Confidence Levels A simplified sttisticl model for inferring QS: Assume we know the reds re supposed to mtch the reference precisely. So ny mismtches must be the result of sequencing errors. Assuming tht sequencing errors long red re independent, we cn define: the probbility p(z x, u) of obtining the red z given tht its position in reference x is u s the product of the error probbilities of the observed mismtches in the lignment of the red t tht position. These bse error probbilities re empiriclly determined during the sequencing phse bsed on the chrcteristics of the emitted signl t ech step. 48

49 QS Exmple If red z ligned t position u yields 2 mismtches tht hve phred-scled bse qulity of 10 nd 20, then: p(z x, u) = 10 -(20+10)/10 = Note tht we will be ble to clculte the posterior probbility p s (u x, z) of position u being the correct mpping for n observed red z using Byes' theorem (see text). Being ble to clculte these posterior probbility, we cn now give our confidence mesure s: Q s (u x, z) = -10 log 10 (1 - p s (u x, z)) Note: MAQ uses more sophisticted sttisticl model to del with the issues we neglected in this simplifiction, such s true mismtches tht re not the result of sequencing errors (SNPs = single nucleotide polymorphisms). 49

E.M. Bakker. Several slides are based on/taken from [7].

E.M. Bakker. Several slides are based on/taken from [7]. Next Genertion Sequencing E.M. Bkker Severl slides re bsed on/tken from [7]. Overview Introduction Next Genertion Technologies The Mpping Problem The MAQ Algorithm The Bowtie Algorithm Burrows-Wheeler

More information

Module 9: Tries and String Matching

Module 9: Tries and String Matching Module 9: Tries nd String Mtching CS 240 - Dt Structures nd Dt Mngement Sjed Hque Veronik Irvine Tylor Smith Bsed on lecture notes by mny previous cs240 instructors Dvid R. Cheriton School of Computer

More information

Module 9: Tries and String Matching

Module 9: Tries and String Matching Module 9: Tries nd String Mtching CS 240 - Dt Structures nd Dt Mngement Sjed Hque Veronik Irvine Tylor Smith Bsed on lecture notes by mny previous cs240 instructors Dvid R. Cheriton School of Computer

More information

Alignment of Long Sequences. BMI/CS Spring 2016 Anthony Gitter

Alignment of Long Sequences. BMI/CS Spring 2016 Anthony Gitter Alignment of Long Sequences BMI/CS 776 www.biostt.wisc.edu/bmi776/ Spring 2016 Anthony Gitter gitter@biostt.wisc.edu Gols for Lecture Key concepts how lrge-scle lignment differs from the simple cse the

More information

Where did dynamic programming come from?

Where did dynamic programming come from? Where did dynmic progrmming come from? String lgorithms Dvid Kuchk cs302 Spring 2012 Richrd ellmn On the irth of Dynmic Progrmming Sturt Dreyfus http://www.eng.tu.c.il/~mi/cd/ or50/1526-5463-2002-50-01-0048.pdf

More information

Balanced binary search trees

Balanced binary search trees 02110 Inge Li Gørtz Overview Blnced binry serch trees: Red-blck trees nd 2-3-4 trees Amortized nlysis Dynmic progrmming Network flows String mtching String indexing Computtionl geometry Introduction to

More information

Algorithms in Computational. Biology. More on BWT

Algorithms in Computational. Biology. More on BWT Algorithms in Computtionl Biology More on BWT tody Plese Lst clss! don't forget to submit And by next (vi emil, repo ) implementtion week or shre prgectfltw get Not I would like reding overview! Discuss

More information

1 APL13: Suffix Arrays: more space reduction

1 APL13: Suffix Arrays: more space reduction 1 APL13: Suffix Arrys: more spce reduction In Section??, we sw tht when lphbet size is included in the time nd spce bounds, the suffix tree for string of length m either requires Θ(m Σ ) spce or the minimum

More information

Convert the NFA into DFA

Convert the NFA into DFA Convert the NF into F For ech NF we cn find F ccepting the sme lnguge. The numer of sttes of the F could e exponentil in the numer of sttes of the NF, ut in prctice this worst cse occurs rrely. lgorithm:

More information

State space systems analysis (continued) Stability. A. Definitions A system is said to be Asymptotically Stable (AS) when it satisfies

State space systems analysis (continued) Stability. A. Definitions A system is said to be Asymptotically Stable (AS) when it satisfies Stte spce systems nlysis (continued) Stbility A. Definitions A system is sid to be Asymptoticlly Stble (AS) when it stisfies ut () = 0, t > 0 lim xt () 0. t A system is AS if nd only if the impulse response

More information

Chapter 0. What is the Lebesgue integral about?

Chapter 0. What is the Lebesgue integral about? Chpter 0. Wht is the Lebesgue integrl bout? The pln is to hve tutoril sheet ech week, most often on Fridy, (to be done during the clss) where you will try to get used to the ides introduced in the previous

More information

Formal languages, automata, and theory of computation

Formal languages, automata, and theory of computation Mälrdlen University TEN1 DVA337 2015 School of Innovtion, Design nd Engineering Forml lnguges, utomt, nd theory of computtion Thursdy, Novemer 5, 14:10-18:30 Techer: Dniel Hedin, phone 021-107052 The exm

More information

CMSC 330: Organization of Programming Languages. DFAs, and NFAs, and Regexps (Oh my!)

CMSC 330: Organization of Programming Languages. DFAs, and NFAs, and Regexps (Oh my!) CMSC 330: Orgniztion of Progrmming Lnguges DFAs, nd NFAs, nd Regexps (Oh my!) CMSC330 Spring 2018 Types of Finite Automt Deterministic Finite Automt (DFA) Exctly one sequence of steps for ech string All

More information

1 From NFA to regular expression

1 From NFA to regular expression Note 1: How to convert DFA/NFA to regulr expression Version: 1.0 S/EE 374, Fll 2017 Septemer 11, 2017 In this note, we show tht ny DFA cn e converted into regulr expression. Our construction would work

More information

Student Activity 3: Single Factor ANOVA

Student Activity 3: Single Factor ANOVA MATH 40 Student Activity 3: Single Fctor ANOVA Some Bsic Concepts In designed experiment, two or more tretments, or combintions of tretments, is pplied to experimentl units The number of tretments, whether

More information

p-adic Egyptian Fractions

p-adic Egyptian Fractions p-adic Egyptin Frctions Contents 1 Introduction 1 2 Trditionl Egyptin Frctions nd Greedy Algorithm 2 3 Set-up 3 4 p-greedy Algorithm 5 5 p-egyptin Trditionl 10 6 Conclusion 1 Introduction An Egyptin frction

More information

CS 275 Automata and Formal Language Theory

CS 275 Automata and Formal Language Theory CS 275 Automt nd Forml Lnguge Theory Course Notes Prt II: The Recognition Problem (II) Chpter II.5.: Properties of Context Free Grmmrs (14) Anton Setzer (Bsed on book drft by J. V. Tucker nd K. Stephenson)

More information

The Regulated and Riemann Integrals

The Regulated and Riemann Integrals Chpter 1 The Regulted nd Riemnn Integrls 1.1 Introduction We will consider severl different pproches to defining the definite integrl f(x) dx of function f(x). These definitions will ll ssign the sme vlue

More information

Tests for the Ratio of Two Poisson Rates

Tests for the Ratio of Two Poisson Rates Chpter 437 Tests for the Rtio of Two Poisson Rtes Introduction The Poisson probbility lw gives the probbility distribution of the number of events occurring in specified intervl of time or spce. The Poisson

More information

Computing the Optimal Global Alignment Value. B = n. Score of = 1 Score of = a a c g a c g a. A = n. Classical Dynamic Programming: O(n )

Computing the Optimal Global Alignment Value. B = n. Score of = 1 Score of = a a c g a c g a. A = n. Classical Dynamic Programming: O(n ) Alignment Grph Alignment Mtrix Computing the Optiml Globl Alignment Vlue An Introduction to Bioinformtics Algorithms A = n c t 2 3 c c 4 g 5 g 6 7 8 9 B = n 0 c g c g 2 3 4 5 6 7 8 t 9 0 2 3 4 5 6 7 8

More information

Riemann Sums and Riemann Integrals

Riemann Sums and Riemann Integrals Riemnn Sums nd Riemnn Integrls Jmes K. Peterson Deprtment of Biologicl Sciences nd Deprtment of Mthemticl Sciences Clemson University August 26, 203 Outline Riemnn Sums Riemnn Integrls Properties Abstrct

More information

DIRECT CURRENT CIRCUITS

DIRECT CURRENT CIRCUITS DRECT CURRENT CUTS ELECTRC POWER Consider the circuit shown in the Figure where bttery is connected to resistor R. A positive chrge dq will gin potentil energy s it moves from point to point b through

More information

Anatomy of a Deterministic Finite Automaton. Deterministic Finite Automata. A machine so simple that you can understand it in less than one minute

Anatomy of a Deterministic Finite Automaton. Deterministic Finite Automata. A machine so simple that you can understand it in less than one minute Victor Admchik Dnny Sletor Gret Theoreticl Ides In Computer Science CS 5-25 Spring 2 Lecture 2 Mr 3, 2 Crnegie Mellon University Deterministic Finite Automt Finite Automt A mchine so simple tht you cn

More information

Riemann Sums and Riemann Integrals

Riemann Sums and Riemann Integrals Riemnn Sums nd Riemnn Integrls Jmes K. Peterson Deprtment of Biologicl Sciences nd Deprtment of Mthemticl Sciences Clemson University August 26, 2013 Outline 1 Riemnn Sums 2 Riemnn Integrls 3 Properties

More information

CS 275 Automata and Formal Language Theory

CS 275 Automata and Formal Language Theory CS 275 Automt nd Forml Lnguge Theory Course Notes Prt II: The Recognition Problem (II) Chpter II.6.: Push Down Automt Remrk: This mteril is no longer tught nd not directly exm relevnt Anton Setzer (Bsed

More information

CS375: Logic and Theory of Computing

CS375: Logic and Theory of Computing CS375: Logic nd Theory of Computing Fuhu (Frnk) Cheng Deprtment of Computer Science University of Kentucky 1 Tble of Contents: Week 1: Preliminries (set lgebr, reltions, functions) (red Chpters 1-4) Weeks

More information

Jim Lambers MAT 169 Fall Semester Lecture 4 Notes

Jim Lambers MAT 169 Fall Semester Lecture 4 Notes Jim Lmbers MAT 169 Fll Semester 2009-10 Lecture 4 Notes These notes correspond to Section 8.2 in the text. Series Wht is Series? An infinte series, usully referred to simply s series, is n sum of ll of

More information

1 Online Learning and Regret Minimization

1 Online Learning and Regret Minimization 2.997 Decision-Mking in Lrge-Scle Systems My 10 MIT, Spring 2004 Hndout #29 Lecture Note 24 1 Online Lerning nd Regret Minimiztion In this lecture, we consider the problem of sequentil decision mking in

More information

1.4 Nonregular Languages

1.4 Nonregular Languages 74 1.4 Nonregulr Lnguges The number of forml lnguges over ny lphbet (= decision/recognition problems) is uncountble On the other hnd, the number of regulr expressions (= strings) is countble Hence, ll

More information

Minimal DFA. minimal DFA for L starting from any other

Minimal DFA. minimal DFA for L starting from any other Miniml DFA Among the mny DFAs ccepting the sme regulr lnguge L, there is exctly one (up to renming of sttes) which hs the smllest possile numer of sttes. Moreover, it is possile to otin tht miniml DFA

More information

NUMERICAL INTEGRATION. The inverse process to differentiation in calculus is integration. Mathematically, integration is represented by.

NUMERICAL INTEGRATION. The inverse process to differentiation in calculus is integration. Mathematically, integration is represented by. NUMERICAL INTEGRATION 1 Introduction The inverse process to differentition in clculus is integrtion. Mthemticlly, integrtion is represented by f(x) dx which stnds for the integrl of the function f(x) with

More information

Quadratic Forms. Quadratic Forms

Quadratic Forms. Quadratic Forms Qudrtic Forms Recll the Simon & Blume excerpt from n erlier lecture which sid tht the min tsk of clculus is to pproximte nonliner functions with liner functions. It s ctully more ccurte to sy tht we pproximte

More information

Finite Automata. Informatics 2A: Lecture 3. John Longley. 22 September School of Informatics University of Edinburgh

Finite Automata. Informatics 2A: Lecture 3. John Longley. 22 September School of Informatics University of Edinburgh Lnguges nd Automt Finite Automt Informtics 2A: Lecture 3 John Longley School of Informtics University of Edinburgh jrl@inf.ed.c.uk 22 September 2017 1 / 30 Lnguges nd Automt 1 Lnguges nd Automt Wht is

More information

Intermediate Math Circles Wednesday, November 14, 2018 Finite Automata II. Nickolas Rollick a b b. a b 4

Intermediate Math Circles Wednesday, November 14, 2018 Finite Automata II. Nickolas Rollick a b b. a b 4 Intermedite Mth Circles Wednesdy, Novemer 14, 2018 Finite Automt II Nickols Rollick nrollick@uwterloo.c Regulr Lnguges Lst time, we were introduced to the ide of DFA (deterministic finite utomton), one

More information

Math& 152 Section Integration by Parts

Math& 152 Section Integration by Parts Mth& 5 Section 7. - Integrtion by Prts Integrtion by prts is rule tht trnsforms the integrl of the product of two functions into other (idelly simpler) integrls. Recll from Clculus I tht given two differentible

More information

We partition C into n small arcs by forming a partition of [a, b] by picking s i as follows: a = s 0 < s 1 < < s n = b.

We partition C into n small arcs by forming a partition of [a, b] by picking s i as follows: a = s 0 < s 1 < < s n = b. Mth 255 - Vector lculus II Notes 4.2 Pth nd Line Integrls We begin with discussion of pth integrls (the book clls them sclr line integrls). We will do this for function of two vribles, but these ides cn

More information

1.3 Regular Expressions

1.3 Regular Expressions 56 1.3 Regulr xpressions These hve n importnt role in describing ptterns in serching for strings in mny pplictions (e.g. wk, grep, Perl,...) All regulr expressions of lphbet re 1.Ønd re regulr expressions,

More information

CMSC 330: Organization of Programming Languages

CMSC 330: Organization of Programming Languages CMSC 330: Orgniztion of Progrmming Lnguges Finite Automt 2 CMSC 330 1 Types of Finite Automt Deterministic Finite Automt (DFA) Exctly one sequence of steps for ech string All exmples so fr Nondeterministic

More information

Designing finite automata II

Designing finite automata II Designing finite utomt II Prolem: Design DFA A such tht L(A) consists of ll strings of nd which re of length 3n, for n = 0, 1, 2, (1) Determine wht to rememer out the input string Assign stte to ech of

More information

7.2 The Definite Integral

7.2 The Definite Integral 7.2 The Definite Integrl the definite integrl In the previous section, it ws found tht if function f is continuous nd nonnegtive, then the re under the grph of f on [, b] is given by F (b) F (), where

More information

A recursive construction of efficiently decodable list-disjunct matrices

A recursive construction of efficiently decodable list-disjunct matrices CSE 709: Compressed Sensing nd Group Testing. Prt I Lecturers: Hung Q. Ngo nd Atri Rudr SUNY t Bufflo, Fll 2011 Lst updte: October 13, 2011 A recursive construction of efficiently decodble list-disjunct

More information

On Suffix Tree Breadth

On Suffix Tree Breadth On Suffix Tree Bredth Golnz Bdkoeh 1,, Juh Kärkkäinen 2, Simon J. Puglisi 2,, nd Bell Zhukov 2, 1 Deprtment of Computer Science University of Wrwick Conventry, United Kingdom g.dkoeh@wrwick.c.uk 2 Helsinki

More information

SUMMER KNOWHOW STUDY AND LEARNING CENTRE

SUMMER KNOWHOW STUDY AND LEARNING CENTRE SUMMER KNOWHOW STUDY AND LEARNING CENTRE Indices & Logrithms 2 Contents Indices.2 Frctionl Indices.4 Logrithms 6 Exponentil equtions. Simplifying Surds 13 Opertions on Surds..16 Scientific Nottion..18

More information

1 Structural induction, finite automata, regular expressions

1 Structural induction, finite automata, regular expressions Discrete Structures Prelim 2 smple uestions s CS2800 Questions selected for spring 2017 1 Structurl induction, finite utomt, regulr expressions 1. We define set S of functions from Z to Z inductively s

More information

First Midterm Examination

First Midterm Examination Çnky University Deprtment of Computer Engineering 203-204 Fll Semester First Midterm Exmintion ) Design DFA for ll strings over the lphet Σ = {,, c} in which there is no, no nd no cc. 2) Wht lnguge does

More information

Z b. f(x)dx. Yet in the above two cases we know what f(x) is. Sometimes, engineers want to calculate an area by computing I, but...

Z b. f(x)dx. Yet in the above two cases we know what f(x) is. Sometimes, engineers want to calculate an area by computing I, but... Chpter 7 Numericl Methods 7. Introduction In mny cses the integrl f(x)dx cn be found by finding function F (x) such tht F 0 (x) =f(x), nd using f(x)dx = F (b) F () which is known s the nlyticl (exct) solution.

More information

Math 8 Winter 2015 Applications of Integration

Math 8 Winter 2015 Applications of Integration Mth 8 Winter 205 Applictions of Integrtion Here re few importnt pplictions of integrtion. The pplictions you my see on n exm in this course include only the Net Chnge Theorem (which is relly just the Fundmentl

More information

University of Alabama Department of Physics and Astronomy. PH126: Exam 1

University of Alabama Department of Physics and Astronomy. PH126: Exam 1 University of Albm Deprtment of Physics nd Astronomy PH 16 LeClir Fll 011 Instructions: PH16: Exm 1 1. Answer four of the five questions below. All problems hve equl weight.. You must show your work for

More information

Math 520 Final Exam Topic Outline Sections 1 3 (Xiao/Dumas/Liaw) Spring 2008

Math 520 Final Exam Topic Outline Sections 1 3 (Xiao/Dumas/Liaw) Spring 2008 Mth 520 Finl Exm Topic Outline Sections 1 3 (Xio/Dums/Liw) Spring 2008 The finl exm will be held on Tuesdy, My 13, 2-5pm in 117 McMilln Wht will be covered The finl exm will cover the mteril from ll of

More information

Properties of Integrals, Indefinite Integrals. Goals: Definition of the Definite Integral Integral Calculations using Antiderivatives

Properties of Integrals, Indefinite Integrals. Goals: Definition of the Definite Integral Integral Calculations using Antiderivatives Block #6: Properties of Integrls, Indefinite Integrls Gols: Definition of the Definite Integrl Integrl Clcultions using Antiderivtives Properties of Integrls The Indefinite Integrl 1 Riemnn Sums - 1 Riemnn

More information

Genetic Programming. Outline. Evolutionary Strategies. Evolutionary strategies Genetic programming Summary

Genetic Programming. Outline. Evolutionary Strategies. Evolutionary strategies Genetic programming Summary Outline Genetic Progrmming Evolutionry strtegies Genetic progrmming Summry Bsed on the mteril provided y Professor Michel Negnevitsky Evolutionry Strtegies An pproch simulting nturl evolution ws proposed

More information

CS 188: Artificial Intelligence Spring 2007

CS 188: Artificial Intelligence Spring 2007 CS 188: Artificil Intelligence Spring 2007 Lecture 3: Queue-Bsed Serch 1/23/2007 Srini Nrynn UC Berkeley Mny slides over the course dpted from Dn Klein, Sturt Russell or Andrew Moore Announcements Assignment

More information

DATA Search I 魏忠钰. 复旦大学大数据学院 School of Data Science, Fudan University. March 7 th, 2018

DATA Search I 魏忠钰. 复旦大学大数据学院 School of Data Science, Fudan University. March 7 th, 2018 DATA620006 魏忠钰 Serch I Mrch 7 th, 2018 Outline Serch Problems Uninformed Serch Depth-First Serch Bredth-First Serch Uniform-Cost Serch Rel world tsk - Pc-mn Serch problems A serch problem consists of:

More information

SOLUTIONS FOR ADMISSIONS TEST IN MATHEMATICS, COMPUTER SCIENCE AND JOINT SCHOOLS WEDNESDAY 5 NOVEMBER 2014

SOLUTIONS FOR ADMISSIONS TEST IN MATHEMATICS, COMPUTER SCIENCE AND JOINT SCHOOLS WEDNESDAY 5 NOVEMBER 2014 SOLUTIONS FOR ADMISSIONS TEST IN MATHEMATICS, COMPUTER SCIENCE AND JOINT SCHOOLS WEDNESDAY 5 NOVEMBER 014 Mrk Scheme: Ech prt of Question 1 is worth four mrks which re wrded solely for the correct nswer.

More information

Exam 1 Solutions (1) C, D, A, B (2) C, A, D, B (3) C, B, D, A (4) A, C, D, B (5) D, C, A, B

Exam 1 Solutions (1) C, D, A, B (2) C, A, D, B (3) C, B, D, A (4) A, C, D, B (5) D, C, A, B PHY 249, Fll 216 Exm 1 Solutions nswer 1 is correct for ll problems. 1. Two uniformly chrged spheres, nd B, re plced t lrge distnce from ech other, with their centers on the x xis. The chrge on sphere

More information

Compiler Design. Fall Lexical Analysis. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Compiler Design. Fall Lexical Analysis. Sample Exercises and Solutions. Prof. Pedro C. Diniz University of Southern Cliforni Computer Science Deprtment Compiler Design Fll Lexicl Anlysis Smple Exercises nd Solutions Prof. Pedro C. Diniz USC / Informtion Sciences Institute 4676 Admirlty Wy, Suite

More information

Unit #9 : Definite Integral Properties; Fundamental Theorem of Calculus

Unit #9 : Definite Integral Properties; Fundamental Theorem of Calculus Unit #9 : Definite Integrl Properties; Fundmentl Theorem of Clculus Gols: Identify properties of definite integrls Define odd nd even functions, nd reltionship to integrl vlues Introduce the Fundmentl

More information

Solution for Assignment 1 : Intro to Probability and Statistics, PAC learning

Solution for Assignment 1 : Intro to Probability and Statistics, PAC learning Solution for Assignment 1 : Intro to Probbility nd Sttistics, PAC lerning 10-701/15-781: Mchine Lerning (Fll 004) Due: Sept. 30th 004, Thursdy, Strt of clss Question 1. Bsic Probbility ( 18 pts) 1.1 (

More information

CS 373, Spring Solutions to Mock midterm 1 (Based on first midterm in CS 273, Fall 2008.)

CS 373, Spring Solutions to Mock midterm 1 (Based on first midterm in CS 273, Fall 2008.) CS 373, Spring 29. Solutions to Mock midterm (sed on first midterm in CS 273, Fll 28.) Prolem : Short nswer (8 points) The nswers to these prolems should e short nd not complicted. () If n NF M ccepts

More information

Riemann is the Mann! (But Lebesgue may besgue to differ.)

Riemann is the Mann! (But Lebesgue may besgue to differ.) Riemnn is the Mnn! (But Lebesgue my besgue to differ.) Leo Livshits My 2, 2008 1 For finite intervls in R We hve seen in clss tht every continuous function f : [, b] R hs the property tht for every ɛ >

More information

Fingerprint idea. Assume:

Fingerprint idea. Assume: Fingerprint ide Assume: We cn compute fingerprint f(p) of P in O(m) time. If f(p) f(t[s.. s+m 1]), then P T[s.. s+m 1] We cn compre fingerprints in O(1) We cn compute f = f(t[s+1.. s+m]) from f(t[s.. s+m

More information

XPath Node Selection over Grammar-Compressed Trees

XPath Node Selection over Grammar-Compressed Trees XPth Node Selection over Grmmr-Compressed Trees Sebstin Mneth University of Edinburgh with Tom Sebstin (INRIA Lille) XPth Node Selection Given: XPth query Q nd tree T Question: wc-time to evlute Q over

More information

Math 1B, lecture 4: Error bounds for numerical methods

Math 1B, lecture 4: Error bounds for numerical methods Mth B, lecture 4: Error bounds for numericl methods Nthn Pflueger 4 September 0 Introduction The five numericl methods descried in the previous lecture ll operte by the sme principle: they pproximte the

More information

The use of a so called graphing calculator or programmable calculator is not permitted. Simple scientific calculators are allowed.

The use of a so called graphing calculator or programmable calculator is not permitted. Simple scientific calculators are allowed. ERASMUS UNIVERSITY ROTTERDAM Informtion concerning the Entrnce exmintion Mthemtics level 1 for Interntionl Bchelor in Communiction nd Medi Generl informtion Avilble time: 2 hours 30 minutes. The exmintion

More information

Math 61CM - Solutions to homework 9

Math 61CM - Solutions to homework 9 Mth 61CM - Solutions to homework 9 Cédric De Groote November 30 th, 2018 Problem 1: Recll tht the left limit of function f t point c is defined s follows: lim f(x) = l x c if for ny > 0 there exists δ

More information

The steps of the hypothesis test

The steps of the hypothesis test ttisticl Methods I (EXT 7005) Pge 78 Mosquito species Time of dy A B C Mid morning 0.0088 5.4900 5.5000 Mid Afternoon.3400 0.0300 0.8700 Dusk 0.600 5.400 3.000 The Chi squre test sttistic is the sum of

More information

Types of Finite Automata. CMSC 330: Organization of Programming Languages. Comparing DFAs and NFAs. NFA for (a b)*abb.

Types of Finite Automata. CMSC 330: Organization of Programming Languages. Comparing DFAs and NFAs. NFA for (a b)*abb. CMSC 330: Orgniztion of Progrmming Lnguges Finite Automt 2 Types of Finite Automt Deterministic Finite Automt () Exctly one sequence of steps for ech string All exmples so fr Nondeterministic Finite Automt

More information

8 Laplace s Method and Local Limit Theorems

8 Laplace s Method and Local Limit Theorems 8 Lplce s Method nd Locl Limit Theorems 8. Fourier Anlysis in Higher DImensions Most of the theorems of Fourier nlysis tht we hve proved hve nturl generliztions to higher dimensions, nd these cn be proved

More information

How to simulate Turing machines by invertible one-dimensional cellular automata

How to simulate Turing machines by invertible one-dimensional cellular automata How to simulte Turing mchines by invertible one-dimensionl cellulr utomt Jen-Christophe Dubcq Déprtement de Mthémtiques et d Informtique, École Normle Supérieure de Lyon, 46, llée d Itlie, 69364 Lyon Cedex

More information

Chapter 7 Notes, Stewart 8e. 7.1 Integration by Parts Trigonometric Integrals Evaluating sin m x cos n (x) dx...

Chapter 7 Notes, Stewart 8e. 7.1 Integration by Parts Trigonometric Integrals Evaluating sin m x cos n (x) dx... Contents 7.1 Integrtion by Prts................................... 2 7.2 Trigonometric Integrls.................................. 8 7.2.1 Evluting sin m x cos n (x)......................... 8 7.2.2 Evluting

More information

Types of Finite Automata. CMSC 330: Organization of Programming Languages. Comparing DFAs and NFAs. Comparing DFAs and NFAs (cont.) Finite Automata 2

Types of Finite Automata. CMSC 330: Organization of Programming Languages. Comparing DFAs and NFAs. Comparing DFAs and NFAs (cont.) Finite Automata 2 CMSC 330: Orgniztion of Progrmming Lnguges Finite Automt 2 Types of Finite Automt Deterministic Finite Automt () Exctly one sequence of steps for ech string All exmples so fr Nondeterministic Finite Automt

More information

4.4 Areas, Integrals and Antiderivatives

4.4 Areas, Integrals and Antiderivatives . res, integrls nd ntiderivtives 333. Ares, Integrls nd Antiderivtives This section explores properties of functions defined s res nd exmines some connections mong res, integrls nd ntiderivtives. In order

More information

New data structures to reduce data size and search time

New data structures to reduce data size and search time New dt structures to reduce dt size nd serch time Tsuneo Kuwbr Deprtment of Informtion Sciences, Fculty of Science, Kngw University, Hirtsuk-shi, Jpn FIT2018 1D-1, No2, pp1-4 Copyright (c)2018 by The Institute

More information

Spanning tree congestion of some product graphs

Spanning tree congestion of some product graphs Spnning tree congestion of some product grphs Hiu-Fi Lw Mthemticl Institute Oxford University 4-9 St Giles Oxford, OX1 3LB, United Kingdom e-mil: lwh@mths.ox.c.uk nd Mikhil I. Ostrovskii Deprtment of Mthemtics

More information

Java II Finite Automata I

Java II Finite Automata I Jv II Finite Automt I Bernd Kiefer Bernd.Kiefer@dfki.de Deutsches Forschungszentrum für künstliche Intelligenz Finite Automt I p.1/13 Processing Regulr Expressions We lredy lerned out Jv s regulr expression

More information

UNIFORM CONVERGENCE. Contents 1. Uniform Convergence 1 2. Properties of uniform convergence 3

UNIFORM CONVERGENCE. Contents 1. Uniform Convergence 1 2. Properties of uniform convergence 3 UNIFORM CONVERGENCE Contents 1. Uniform Convergence 1 2. Properties of uniform convergence 3 Suppose f n : Ω R or f n : Ω C is sequence of rel or complex functions, nd f n f s n in some sense. Furthermore,

More information

Chapter 14. Matrix Representations of Linear Transformations

Chapter 14. Matrix Representations of Linear Transformations Chpter 4 Mtrix Representtions of Liner Trnsformtions When considering the Het Stte Evolution, we found tht we could describe this process using multipliction by mtrix. This ws nice becuse computers cn

More information

W. We shall do so one by one, starting with I 1, and we shall do it greedily, trying

W. We shall do so one by one, starting with I 1, and we shall do it greedily, trying Vitli covers 1 Definition. A Vitli cover of set E R is set V of closed intervls with positive length so tht, for every δ > 0 nd every x E, there is some I V with λ(i ) < δ nd x I. 2 Lemm (Vitli covering)

More information

19 Optimal behavior: Game theory

19 Optimal behavior: Game theory Intro. to Artificil Intelligence: Dle Schuurmns, Relu Ptrscu 1 19 Optiml behvior: Gme theory Adversril stte dynmics hve to ccount for worst cse Compute policy π : S A tht mximizes minimum rewrd Let S (,

More information

Finite Automata Theory and Formal Languages TMV027/DIT321 LP4 2018

Finite Automata Theory and Formal Languages TMV027/DIT321 LP4 2018 Finite Automt Theory nd Forml Lnguges TMV027/DIT321 LP4 2018 Lecture 10 An Bove April 23rd 2018 Recp: Regulr Lnguges We cn convert between FA nd RE; Hence both FA nd RE ccept/generte regulr lnguges; More

More information

Goals: Determine how to calculate the area described by a function. Define the definite integral. Explore the relationship between the definite

Goals: Determine how to calculate the area described by a function. Define the definite integral. Explore the relationship between the definite Unit #8 : The Integrl Gols: Determine how to clculte the re described by function. Define the definite integrl. Eplore the reltionship between the definite integrl nd re. Eplore wys to estimte the definite

More information

Bases for Vector Spaces

Bases for Vector Spaces Bses for Vector Spces 2-26-25 A set is independent if, roughly speking, there is no redundncy in the set: You cn t uild ny vector in the set s liner comintion of the others A set spns if you cn uild everything

More information

Chapter 4 Contravariance, Covariance, and Spacetime Diagrams

Chapter 4 Contravariance, Covariance, and Spacetime Diagrams Chpter 4 Contrvrince, Covrince, nd Spcetime Digrms 4. The Components of Vector in Skewed Coordintes We hve seen in Chpter 3; figure 3.9, tht in order to show inertil motion tht is consistent with the Lorentz

More information

Recitation 3: More Applications of the Derivative

Recitation 3: More Applications of the Derivative Mth 1c TA: Pdric Brtlett Recittion 3: More Applictions of the Derivtive Week 3 Cltech 2012 1 Rndom Question Question 1 A grph consists of the following: A set V of vertices. A set E of edges where ech

More information

11.1 Finite Automata. CS125 Lecture 11 Fall Motivation: TMs without a tape: maybe we can at least fully understand such a simple model?

11.1 Finite Automata. CS125 Lecture 11 Fall Motivation: TMs without a tape: maybe we can at least fully understand such a simple model? CS125 Lecture 11 Fll 2016 11.1 Finite Automt Motivtion: TMs without tpe: mybe we cn t lest fully understnd such simple model? Algorithms (e.g. string mtching) Computing with very limited memory Forml verifiction

More information

Duality # Second iteration for HW problem. Recall our LP example problem we have been working on, in equality form, is given below.

Duality # Second iteration for HW problem. Recall our LP example problem we have been working on, in equality form, is given below. Dulity #. Second itertion for HW problem Recll our LP emple problem we hve been working on, in equlity form, is given below.,,,, 8 m F which, when written in slightly different form, is 8 F Recll tht we

More information

Lecture 14: Quadrature

Lecture 14: Quadrature Lecture 14: Qudrture This lecture is concerned with the evlution of integrls fx)dx 1) over finite intervl [, b] The integrnd fx) is ssumed to be rel-vlues nd smooth The pproximtion of n integrl by numericl

More information

The Predom module. Predom calculates and plots isothermal 1-, 2- and 3-metal predominance area diagrams. Predom accesses only compound databases.

The Predom module. Predom calculates and plots isothermal 1-, 2- and 3-metal predominance area diagrams. Predom accesses only compound databases. Section 1 Section 2 The module clcultes nd plots isotherml 1-, 2- nd 3-metl predominnce re digrms. ccesses only compound dtbses. Tble of Contents Tble of Contents Opening the module Section 3 Stoichiometric

More information

FLAG: Fast Local Alignment Generating Methodology. Abstract. Introduction

FLAG: Fast Local Alignment Generating Methodology. Abstract. Introduction Romnin Biotechnologicl Letters Vol 8, No, 23 Copyright 23 University of Buchrest Printed in Romni All rights reserved SHORT COMMUNICATION FLAG: Fst Locl Alignment Generting Methodology Abstrct Received

More information

Parsing and Pattern Recognition

Parsing and Pattern Recognition Topics in IT Prsing nd Pttern Recognition Week Context-Free Prsing College of Informtion Science nd Engineering Ritsumeikn University this week miguity in nturl lnguge in mchine lnguges top-down, redth-first

More information

12.1 Nondeterminism Nondeterministic Finite Automata. a a b ε. CS125 Lecture 12 Fall 2016

12.1 Nondeterminism Nondeterministic Finite Automata. a a b ε. CS125 Lecture 12 Fall 2016 CS125 Lecture 12 Fll 2016 12.1 Nondeterminism The ide of nondeterministic computtions is to llow our lgorithms to mke guesses, nd only require tht they ccept when the guesses re correct. For exmple, simple

More information

Numerical Integration

Numerical Integration Chpter 5 Numericl Integrtion Numericl integrtion is the study of how the numericl vlue of n integrl cn be found. Methods of function pproximtion discussed in Chpter??, i.e., function pproximtion vi the

More information

Surface maps into free groups

Surface maps into free groups Surfce mps into free groups lden Wlker Novemer 10, 2014 Free groups wedge X of two circles: Set F = π 1 (X ) =,. We write cpitl letters for inverse, so = 1. e.g. () 1 = Commuttors Let x nd y e loops. The

More information

Theoretical foundations of Gaussian quadrature

Theoretical foundations of Gaussian quadrature Theoreticl foundtions of Gussin qudrture 1 Inner product vector spce Definition 1. A vector spce (or liner spce) is set V = {u, v, w,...} in which the following two opertions re defined: (A) Addition of

More information

1B40 Practical Skills

1B40 Practical Skills B40 Prcticl Skills Comining uncertinties from severl quntities error propgtion We usully encounter situtions where the result of n experiment is given in terms of two (or more) quntities. We then need

More information

ODE: Existence and Uniqueness of a Solution

ODE: Existence and Uniqueness of a Solution Mth 22 Fll 213 Jerry Kzdn ODE: Existence nd Uniqueness of Solution The Fundmentl Theorem of Clculus tells us how to solve the ordinry differentil eqution (ODE) du = f(t) dt with initil condition u() =

More information

Review of Gaussian Quadrature method

Review of Gaussian Quadrature method Review of Gussin Qudrture method Nsser M. Asi Spring 006 compiled on Sundy Decemer 1, 017 t 09:1 PM 1 The prolem To find numericl vlue for the integrl of rel vlued function of rel vrile over specific rnge

More information

Section 14.3 Arc Length and Curvature

Section 14.3 Arc Length and Curvature Section 4.3 Arc Length nd Curvture Clculus on Curves in Spce In this section, we ly the foundtions for describing the movement of n object in spce.. Vector Function Bsics In Clc, formul for rc length in

More information

Chapter 5 : Continuous Random Variables

Chapter 5 : Continuous Random Variables STAT/MATH 395 A - PROBABILITY II UW Winter Qurter 216 Néhémy Lim Chpter 5 : Continuous Rndom Vribles Nottions. N {, 1, 2,...}, set of nturl numbers (i.e. ll nonnegtive integers); N {1, 2,...}, set of ll

More information

Self-similarity and symmetries of Pascal s triangles and simplices mod p

Self-similarity and symmetries of Pascal s triangles and simplices mod p Sn Jose Stte University SJSU ScholrWorks Fculty Publictions Mthemtics nd Sttistics Februry 2004 Self-similrity nd symmetries of Pscl s tringles nd simplices mod p Richrd P. Kubelk Sn Jose Stte University,

More information