E.M. Bakker. Several slides are based on/taken from [7].

Similar documents
Exact Matching. Exact Matching Algorithms 5/19/2015. Exact Matching Problem: search pattern P in text T (P,T are strings)

Module 9: Tries and String Matching

Module 9: Tries and String Matching

Alignment of Long Sequences. BMI/CS Spring 2016 Anthony Gitter

Where did dynamic programming come from?

Algorithms in Computational. Biology. More on BWT

Balanced binary search trees

1 APL13: Suffix Arrays: more space reduction

Convert the NFA into DFA

The Regulated and Riemann Integrals

p-adic Egyptian Fractions

Chapter 0. What is the Lebesgue integral about?

Tests for the Ratio of Two Poisson Rates

DIRECT CURRENT CIRCUITS

CMSC 330: Organization of Programming Languages. DFAs, and NFAs, and Regexps (Oh my!)

State space systems analysis (continued) Stability. A. Definitions A system is said to be Asymptotically Stable (AS) when it satisfies

Math& 152 Section Integration by Parts

1 From NFA to regular expression

Formal languages, automata, and theory of computation

We partition C into n small arcs by forming a partition of [a, b] by picking s i as follows: a = s 0 < s 1 < < s n = b.

1 Online Learning and Regret Minimization

Computing the Optimal Global Alignment Value. B = n. Score of = 1 Score of = a a c g a c g a. A = n. Classical Dynamic Programming: O(n )

Minimal DFA. minimal DFA for L starting from any other

XPath Node Selection over Grammar-Compressed Trees

CS 275 Automata and Formal Language Theory

Anatomy of a Deterministic Finite Automaton. Deterministic Finite Automata. A machine so simple that you can understand it in less than one minute

NUMERICAL INTEGRATION. The inverse process to differentiation in calculus is integration. Mathematically, integration is represented by.

SOLUTIONS FOR ADMISSIONS TEST IN MATHEMATICS, COMPUTER SCIENCE AND JOINT SCHOOLS WEDNESDAY 5 NOVEMBER 2014

CS375: Logic and Theory of Computing

Finite Automata. Informatics 2A: Lecture 3. John Longley. 22 September School of Informatics University of Edinburgh

Math 8 Winter 2015 Applications of Integration

Math 1B, lecture 4: Error bounds for numerical methods

CS 188: Artificial Intelligence Spring 2007

Student Activity 3: Single Factor ANOVA

SUMMER KNOWHOW STUDY AND LEARNING CENTRE

Intermediate Math Circles Wednesday, November 14, 2018 Finite Automata II. Nickolas Rollick a b b. a b 4

Riemann Sums and Riemann Integrals

1 Structural induction, finite automata, regular expressions

Unit #9 : Definite Integral Properties; Fundamental Theorem of Calculus

Riemann Sums and Riemann Integrals

Z b. f(x)dx. Yet in the above two cases we know what f(x) is. Sometimes, engineers want to calculate an area by computing I, but...

Chapter 14. Matrix Representations of Linear Transformations

5.7 Improper Integrals

Riemann is the Mann! (But Lebesgue may besgue to differ.)

Jim Lambers MAT 169 Fall Semester Lecture 4 Notes

CMSC 330: Organization of Programming Languages

4.4 Areas, Integrals and Antiderivatives

Designing finite automata II

Duality # Second iteration for HW problem. Recall our LP example problem we have been working on, in equality form, is given below.

Properties of Integrals, Indefinite Integrals. Goals: Definition of the Definite Integral Integral Calculations using Antiderivatives

1.4 Nonregular Languages

Section 6: Area, Volume, and Average Value

The steps of the hypothesis test

Math 520 Final Exam Topic Outline Sections 1 3 (Xiao/Dumas/Liaw) Spring 2008

7.2 The Definite Integral

A recursive construction of efficiently decodable list-disjunct matrices

W. We shall do so one by one, starting with I 1, and we shall do it greedily, trying

Quadratic Forms. Quadratic Forms

1.3 Regular Expressions

CS 373, Spring Solutions to Mock midterm 1 (Based on first midterm in CS 273, Fall 2008.)

Compiler Design. Fall Lexical Analysis. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Fingerprint idea. Assume:

DISCRETE MATHEMATICS HOMEWORK 3 SOLUTIONS

On Suffix Tree Breadth

The use of a so called graphing calculator or programmable calculator is not permitted. Simple scientific calculators are allowed.

12.1 Nondeterminism Nondeterministic Finite Automata. a a b ε. CS125 Lecture 12 Fall 2016

Goals: Determine how to calculate the area described by a function. Define the definite integral. Explore the relationship between the definite

Uninformed Search Lecture 4

Solution for Assignment 1 : Intro to Probability and Statistics, PAC learning

Finite Automata Theory and Formal Languages TMV027/DIT321 LP4 2018

1B40 Practical Skills

Math 61CM - Solutions to homework 9

New data structures to reduce data size and search time

Surface maps into free groups

CS103B Handout 18 Winter 2007 February 28, 2007 Finite Automata

How to simulate Turing machines by invertible one-dimensional cellular automata

First Midterm Examination

Spanning tree congestion of some product graphs

8 Laplace s Method and Local Limit Theorems

Types of Finite Automata. CMSC 330: Organization of Programming Languages. Comparing DFAs and NFAs. NFA for (a b)*abb.

Types of Finite Automata. CMSC 330: Organization of Programming Languages. Comparing DFAs and NFAs. Comparing DFAs and NFAs (cont.) Finite Automata 2

CS 275 Automata and Formal Language Theory

Lecture 14: Quadrature

Chapter 5 : Continuous Random Variables

CMDA 4604: Intermediate Topics in Mathematical Modeling Lecture 19: Interpolation and Quadrature

Precalculus Spring 2017

Chapter 4 Contravariance, Covariance, and Spacetime Diagrams

University of Alabama Department of Physics and Astronomy. PH126: Exam 1

UNIFORM CONVERGENCE. Contents 1. Uniform Convergence 1 2. Properties of uniform convergence 3

Chapter 7 Notes, Stewart 8e. 7.1 Integration by Parts Trigonometric Integrals Evaluating sin m x cos n (x) dx...

Numerical Integration

Genetic Programming. Outline. Evolutionary Strategies. Evolutionary strategies Genetic programming Summary

Parsing and Pattern Recognition

DATA Search I 魏忠钰. 复旦大学大数据学院 School of Data Science, Fudan University. March 7 th, 2018

Population bottleneck : dramatic reduction of population size followed by rapid expansion,

Theoretical foundations of Gaussian quadrature

19 Optimal behavior: Game theory

Bases for Vector Spaces

ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER /2019

12.1 Nondeterminism Nondeterministic Finite Automata. a a b ε. CS125 Lecture 12 Fall 2014

Transcription:

Next Genertion Sequencing E.M. Bkker Severl slides re bsed on/tken from [7]. Overview Introduction Next Genertion Technologies The Mpping Problem The MAQ Algorithm The Bowtie Algorithm Burrows-Wheeler Trnsform Sequence Assembly Problem De Bruijn Grphs Other Assembly Algorithms 1

Introduction 1953 Wtson nd Crick: the structure of the DNA molecule DNA crrier of the genetic informtion, the chllenge of reding the DNA sequence becme centrl to biologicl reserch. methods for DNA sequencing were extremely inefficient, lborious nd costly. 1965 Holley: relibly sequencing the yest gene for trna Al required the equivlent of full yer's work per person per bse pir (bp) sequenced (1bp/person yer). 1970 Two clssicl methods for sequencing DNA frgments by Snger nd Gilbert. Snger sequencing (sketch): 1. The polymerse extends the lbeled primer, rndomly by either: norml dctp bse, or modified ddctp bse At every position where ddctp is inserted, polymeriztion termintes! => mix of frgments, where the length of ech frgment is function of the reltive distnce from the modified bse to the primer. 2. In 4 lnes electrophoretic seprtion of the mix of frgments in ech of the four tubes (ddg, dda, ddt, nd ddc) is estblished. Now the bnds on the gel ccumulte equl length frgments representing the Lbelled Strnds shown to the right. => This llows us to red the complement of the originl templte from bottom to top. 1. 2. 2

Introduction 1980 In the 1980s methods were ugmented by prtil utomtion the cloning method, which llowed fst nd exponentil repliction of DNA frgment. 1990 Strt of the humn genome project: sequencing efficiency reched 200,000 bp/person-yer. 2002 End of the humn genome project: 50,000,000 bp/person-yer. (Totl cost summed to $3 billion.) Note: Moore s Lw doubling trnsistor count every 2 yers Here: doubling of #bse pirs/person-yer every 1.5 yers Introduction Recently (2007 2011/2012): new sequencing technologies next genertion sequencing (NGS) or deep sequencing relible sequencing of 100x10 9 bp/person-yer. NGS now llows compiling the full DNA sequence of person for ~ $10,000 within 3-5 yers cost is expected to drop to ~$1000. 2017: Verits Genetics uses Illumin HiSeq X Ten system with n verge coverge depth of 30X. Whole genome sequencing in 8-10 weeks t price of $999. Supplies limited to 5,000 in 2016. (See http://www.nnlyze.com/2016/03/does-full-genomesequencing-relly-cost-1000-now/) Illumin sys it cn deliver $100 genome soon (HiSeq s successor NovSeq) 3

100$ Genome scn soon! From: https://www.genome.gov/sequencingcosts/ Next Genertion Sequencing Technologies Next Genertion Sequencing technology of Illumin, one of the leding compnies in the field: DNA is replicted => millions of copies. All DNA copies re rndomly shredded into vrious frgments using restriction enzymes or mechnicl mens => frgments form the input for the sequencing phse. Hundreds of copies of ech frgment re generted/ssembled t one spot (cluster) on the surfce of huge mtrix. Initilly it is not known which frgment sits where. Note: There re quite few other technologies vilble nd/or currently under development. 4

Next Genertion Sequencing technology of Illumin (continued): A DNA repliction enzyme is dded to the mtrix long with 4 slightly modified nucleotides. The 4 nucleotides re slightly modified chemiclly so tht: ech would emit unique color when excited by lser, ech one would terminte the repliction. hence, the growing complementry DNA strnd on ech frgment in ech cluster is extended by one nucleotide t time. Next Genertion Sequencing technology of Illumin (continued): A lser is used to obtin red of the ctul nucleotide tht is dded in ech cluster. (The multiple copies in cluster provide n mplified signl.) The solution is wshed wy together with the chemicl modifiction on the lst nucleotide which prevented further elongtion nd emitted the unique signl. Millions of frgments re sequenced efficiently nd in prllel. The resulting frgment sequences re clled reds. Reds form the input for further sequence mpping nd ssembly. 5

Next Genertion Sequencing Technologies (1) A modified nucleotide is dded to the complementry DNA strnd by DNA polymerse enzyme. (2) A lser is used to obtin red of the nucleotide just dded. (3) The full sequence of frgment thus determined through successive itertions of the process. (4) A visuliztion of the mtrix where frgment clusters re ttched to flow cells. MinION: nnopores Phys.org Red length ~10kb Longest reported230 300 kb 500 bps per pore Nture.com 6

The Mpping Problem nd the Assembly Problem The Mpping Problem INPUT: m reds S 1,, S m of length l nd n pproximte reference genome R. QUESTION: Wht re the positions x 1,, x m long R where ech red S 1,, S m mtches, respectively? The Assembly Problem INPUT: m reds S 1,, S m of length l. QUESTION: Wht is the sequence of the full genome? The crucil difference between the problems of mpping nd ssembly is tht in cse of ssembly we do not hve reference genome, nd we must ssemble the full sequence directly from the reds. The Mpping Problem The Short Red Mpping Problem on reference genome R INPUT: QUESTION: m reds S 1,, S m of length l nd n pproximte reference genome R. Wht re the positions x 1,, x m long R where ech red S 1,, S m mtches, respectively? For exmple: fter sequencing the genome of person we wnt to mp it to n existing sequence of the humn genome. The new smple will not be 100% identicl to the reference genome: nturl vrition in the popultion mismtches or gps sequencing errors repetitive regions Humns re diploid orgnisms different lleles on the mternl nd pternl chromosomes two slightly different reds mpping to the sme loction (some with mismtches) 7

Solutions for the Mpping Problem Nive lgorithm for ech S i scn the reference string R mtch the red t ech position p nd pick the best mtch Time complexity: O(m l R ) for exct or inexct mtching. m number of reds, l red length Considering the prmeters for our problem (m ~10 8, l ~10 2, R ~3 10 9 ), this is imprcticl. Less nive solution use the Knuth-Morris-Prtt lgorithm to mtch ech S i to R Time complexity: O(m (l+ R )) = O(ml + m R ) for exct mtching. A substntil improvement but still not enough Pttern P ϵ {,b, } * : KMP Preprocessed P exmple Automton: b b b Tble: 1 2 3 4 5 6 7 8 b b b 0 1 1 2 2 3 4 3 filure links (suffix = prefix): top: s utomton bottom: s tble how to determine? how to use? P: P: P: P: P: P: 6 b b b b b b 3 7 b b b b b b 4 8 b b b b b... 3 8

The Mpping Problem: Suffix Trees 1. Build suffix tree for R. 2. For ech S i (i in {1,,m})find mtches by trversing the tree from the root. Time complexity: O(ml+ R ), ssuming the tree is built using Ukkonen's liner-time lgorithm. Notes: Time complexity is prcticl. Only build the tree for the reference genome once. The tree cn be sved nd used repetedly. Suffix Tree exmple: find itt in nitty nitty itty tty ty y ritty itty tty ty y 1 2 3 4 5 6 7 8 9 10 11 1 nitty 2 8 ε itty ty ε ritty t y y ε 3 9 4 10 7 ε 5 11 6 positions 9

The Mpping Problem: Suffix Trees Spce complexity my be n obstcle: The lefs of the suffix tree lso hold the indices where the suffixes begin => the tree requires O( R log R ) bits just for coding of the indices. The constnts re lrge due to the dditionl lyers of informtion required for the tree (such s suffix links, etc.). The originl humn genome reference sequence demnds just R log(#symbols) bits => we cn store the humn genome using ~750MB But! ~64GB for the Suffix Tree (2016: very doble)! Finl problem: suffix trees llow only for exct mtching. The Mpping Problem: Hshing Preprocess the reference genome R into hsh tble H. The keys of the hsh re ll the substrings of length l in R: Tble H contins: the position p in R where the substring ends. Mtching: given S i the lgorithm returns H(S i ). Time complexity: O(m l+l R ) Note: The spce complexity is O(l R + R log R ) since we must lso hold the binry representtion of ech substring's position. (Still quite high.) Represent ech nucleotide s 2-bit code: fctor of 4 reduction. Prcticl solution: Prtition the genome into severl chunks. Agin, only exct mtching llowed. 10

The Mpping Problem: The MAQ Algorithm The MAQ (Mpping nd Alignment with Qulities) lgorithm by Li, Run nd Durbin, 2008 [5] An efficient solution to the short red mpping problem Allows for inexct mtching, up to predetermined mximum number of mismtches. Memory efficient. Mesures for bse nd red qulity, nd for identifying sequence vrints. The MAQ Algorithm A key insight: If red in its correct position to the reference genome hs one mismtch, then prtitioning tht red into two mkes sure one prt is still exctly mtched. Hence, if we perform n exct mtch twice using only subset of the red bses ( templte), one of the subsets is enough to find the mtch. For 2 mismtches, 6 templtes: some templtes re noncontiguous ech templte covers hlf the red Ech templte hs prtner templte tht complements it to form the full red If the red hs no more thn two mismtches, t lest one templte will not be ffected by ny mismtch. For 3-mismtches, 20 templtes gurntee t lest one fully mtched templte. Etc. 11

mismtch The MAQ Algorithm: Templtes Red: Red: A nd B re reds, where purple boxes indicte mismtches with respect to the reference. The numbered templtes hve blue boxes in the positions they cover. 1 mismtch: compring red A through templte 1 => no full mtch templte 2 fully mtched Templte # 2 mismtches: red B is fully mtched with templte 6 Any combintion of up to two mismtched positions will be voided by t lest one of the 6 templtes. Note: 6 templtes gurntee 57% of ll 3 mismtches will hve t lest one fully mtching templte. The MAQ Algorithm The lgorithm: Generte the number of templtes required to gurntee t lest one full mtch for the desired mximl number of mismtches, nd For ech red: mtch the red ginst the templtes use the exct mtching templte s seed for extending cross the full red. Identifying the exct mtches is done by hshing the red templtes into templte-specific hsh tbles, nd scnning the reference genome R ginst ech tble. Note: It is not necessry to generte templtes nd hsh tbles for the full red, since the initil seed will undergo extension, e.g. using the Smith- Wtermn lgorithm. Therefore the lgorithm initilly processes only the first 28 bses of ech red. (The first bses re the most ccurte bses of the reds.) 12

The MAQ Algorithm Algorithm (for the cse of up to two mismtches): 1. Index the first 28 bses of ech red with complementry templtes 1, nd 2, thereby generting hsh tbles H1 nd H2, respectively. 2. Scn the reference genome R: for ech position nd ech orienttion, query 28-bp window through templtes 1, 2 ginst the pproprite tbles H1, H2, respectively. If hit, extend nd score the complete red bsed on mismtches. For ech red, keep the two best scoring hits nd the number of mismtches therein. 3. Repet steps 1+2 with complementry templtes 3, 4, then 5, 6. Remrk: The reson for indexing ginst pir of complementry templtes ech time hs to do with feture of the lgorithm regrding pired-end reds (see originl pper for detils). The MAQ Algorithm Complexity Time Complexity: O(m l) for generting the hsh tbles in Step 1, nd O(l R ) for scnning the genome in Step 2. Repeting Steps 1+2 three times in this implementtion hs no effect on the symptotic complexity. Spce Complexity: O(m l) for holding the hsh tbles in Step 1, nd O(m l+ R ) totl spce for Step 2, but only O(m l) spce in min memory t ny one time. 13

The MAQ Algorithm Complexity Note tht, not the full red length l is used, but window of length l' <= l ( l' = 28-bp in the implementtion presented bove ). The size of ech key in templte hsh tble is only l' /2, since ech templte covers hlf the window. => Time nd spce complexity of step 1 is: O(2 m l' /2) = O(ml' ). This spce reduction mkes running the lgorithm on PC very fesible. The time nd spce required for extending mtch in Step 2 using the Smith-Wtermn lgorithm, where p the probbility of hit (using the l length window): The time complexity : O(l' R + p R l 2 ). The spce complexity: O(ml' + l 2 ). Note: The vlue of p is smll, nd decreses drsticlly the longer l' is, since most (other) l' -long windows in the reference genome will likely not cpture the exct coordintes of true red. Red Mpping Qulities The MAQ lgorithm provides mesure of the confidence level for the mpping of ech red, denoted by QS nd defined s: QS = -10 log 10 (Pr{red S is wrongly mpped}) For exmple: QS = 30 if the probbility of incorrect mpping of red S is 1/1000. This confidence mesure is clled phred-scled qulity, similr to the scling scheme originlly introduced by Phil Green et l. [3] for the humn genome project. Exmple: Bse clling with Phred Qulity Score = 30 => Probbility of incorrect bse cll is 1/1000. 14

The Bowtie Algorithm Another wy to mp reds to reference genome is given by the Bowtie lgorithm, presented in 2009 by Lngmed et l.[1]. It solves the mpping problem using spce-efficient indexing scheme. The indexing scheme used is clled the Burrows- Wheeler Trnsform [2] nd ws originlly developed for dt compression purposes. The Burrows-Wheeler Trnsform Applying the Burrows-Wheeler trnsform BW(T) to the text T = "the next text tht i index.": 1. First, we generte ll cyclic shifts of T. 2. Next, we sort these shifts lexicogrphiclly. define the chrcter '.' s the minimum nd we ssume tht it ppers exctly once, s the lst symbol in the text. followed lexicogrphiclly by ' (spce) followed by the English letters ccording to their nturl ordering. Cll the resulting mtrix M. The trnsform BW(T) is defined s the sequence of the lst chrcters in the rows of M. Note tht, the lst column is permuttion of ll chrcters in the text since ech chrcter ppers in the lst position in exctly one cyclic shift. 15

Burrows-Wheeler Trnsform Some of the cyclic shifts of T sorted lexicogrphiclly nd indexed by the lst chrcter. Burrows-Wheeler Trnsform Storing BW(T) requires the sme spce s the size of the text T since it is permuttion of T. BW(the humn genome) Ech bse {A,C,T,G} represented by 2 bits => storing the permuttion requires 2 times 3x10 9 bits (insted of ~30 times 3x10 9 for storing ll indices of T). 16

Burrows-Wheeler Trnsform The following holds for BW(T): 1. # occurrences of chr c in T = # occurrences of chr c in BW(T) (BW(T) permuttion of the T). 2. The first column of the mtrix M cn be obtined by sorting BW(T) lexicogrphiclly. 3. Determine the number of occurrences of the substring 'xt' in T: BW(T) is the lst column of the lexicogrphicl sorting of the shifts. The chrcter t the lst position of row ppers in the text T immeditely prior to the first chrcter in the sme row (ech row is cyclicl shift). => consider the intervl of 't' in the first column, nd check how mny of these rows hve n 'x t the lst position. Burrows-Wheeler Trnsform 1. Some of the cyclic shifts of T sorted lexicogrphiclly nd indexed by the lst chrcter. 17

Burrows-Wheeler Trnsform The following holds for BW(T): 1. # occurrences of chr c in T = # occurrences of chr c in BW(T) (BW(T) permuttion of the T). 2. The first column of the mtrix M cn be obtined by sorting BW(T) lexicogrphiclly. 3. Determine the number of occurrences of the substring 'xt' in T: BW(T) is the lst column of the lexicogrphicl sorting of the shifts. The chrcter t the lst position of row ppers in the text T immeditely prior to the first chrcter in the sme row (ech row is cyclicl shift). => consider the intervl of 't' in the first column, nd check how mny of these rows hve n 'x t the lst position. Sorted BW(T) BW(T) 2. Recovering the first column (left) by sorting the lst column. 3. Determine the number of occurrences of the substring 'xt' in T 18

Burrows-Wheeler Trnsform Given BW(T) lso the second column cn be derived: 'xt' ppers twice in the text, nd three rows strt with n 'x'. Two of the three must be followed by 't, where the lexicogrphicl sorting determines which 'x'. The third 'x' is followed by '.' (see first row) => '.' must follow the first 'x' in the first column since '.' is smller lexicogrphiclly thn 't'. The second nd third occurrences of 'x' in the first column re therefore followed by 't'. Note: We cn use the sme process to recover the chrcters t the second column for ech intervl, nd then the third, etc. t text tht i index. the nex t tht i index. the next tex text tht i index.the next tht i index. the next text the next text tht i index. We hve ctully: F L Lst-first mpping: Ech 't' chrcter in L is linked to its position in F nd no crossed links re possible. The j-th occurrence of chrcter X in L corresponds to the sme text chrcter s the j-th occurrence of X in F. 19

Burrows-Wheeler Trnsform The previous two centrl properties of the BW-trnsform re cptured in the Lemm by Ferrgin nd Mnzini[4]: Lemm 12.1 (Lst-First Mpping): Let M be the mtrix whose rows re ll cyclicl shifts of T sorted lexicogrphiclly, nd let L(i) be the chrcter t the lst column of row i nd F(i) be the first chrcter in tht row. Then: 1. In row i of M, L(i) precedes F(i) in the originl text: T = L(i) F(i) 2. The j-th occurrence of chrcter X in L corresponds to the sme text chrcter s the j-th occurrence of X in F. F L Lst-first mpping: j-th occurrence of chrcter 't' chrcter in L is linked to its j-th position in F nd no crossed links re possible. 20

Burrows-Wheeler Trnsform Proof: 1. Follows directly from the fct tht ech row in M is cyclicl shift. 2. Let X j denote the j-th occurrence of chrcter X in L, nd let α be the chrcter following X j in the text nd β the chrcter following X j+1. Then, since X j ppers bove X j+1 in L, α ppers t the beginning of row bove the row tht strts with β. The rows re lexicogrphiclly ordered, hence α must be equl or lexicogrphiclly smller thn β. Now clerly X α lexicogrphiclly X β holds. Hence, s the rows re lexicogrphiclly ordered, if chrcter X j ppers in F it is followed by α, nd thus will be bove X j+1 which is followed by β. Thus proofing the Lemm. Reconstructing the Text Algorithm UNPERMUTE for reconstructing text T from its Burrows-Wheeler trnsform BW(T) utilizing Lemm 12.1 [4]: ssume the ctul text T is of length u ppend unique $ chrcter (=. ) t the end, which is the smllest lexicogrphiclly UNPERMUTE[BW(T)] 1. Compute the rry C[1,, Σ ] : where C(c) is the number of chrcters {$, 1,,c-1} in T, i.e., the number of chrcters tht re lexicogrphiclly smller thn c 2. Construct the lst-first mpping LF, trcing every chrcter in L to its corresponding position in F: LF[i] = C(L[i]) + r(l[i], i) + 1, where r(c, i) is the number of occurrences of chrcter c in the prefix L[1, i -1] 3. Reconstruct T bckwrds: s = 1, T(u) = L[1]; for i = u 1,, 1 do s = LF[s]; T[i] = L[s]; od; 21

Notice, tht C(e) + 1 = 9 is the position of the first occurrence of e' in F. C(e) = 8, s there re 8 chrcters in T tht re smller thn e. C 0 1 1 1 1 1 6 7 8 F L Lst-first mpping: Ech 't' chrcter in L is linked to its position in F nd no crossed links re possible. Exmple T = ccg$ (u = 6) is trnsformed to BW(T) = gc$c, nd we now wish to reconstruct T from BW(T) using UNPERMUTE: 1. Compute rry C. For exmple: C(c) = 4 since there re 4 occurrences of chrcters smller thn 'c' in T ('$' nd 3 occurrences of ''). Now C(c) + 1 = 5 is the position of the first occurrence of 'c' in F. 2. Perform the LF mpping. For exmple, LF[c 2 ] = C(c) + r(c,7) + 1 = 6, nd indeed the second occurrence of 'c' in F is t F[6]. 3. Determine the lst chrcter in T: T(6) = L(1) = 'g'. 4. Iterte bckwrds over ll positions using the LF mpping. For exmple, to recover the chrcter T(5), we use the LF mpping to trce L(1) to F(7), nd then T(5) = L(7) = 'c'. Remrk We do not ctully hold F in memory, we only keep the rry C defined bove, of size Σ, which we cn esily obtined from L. r(c,7) = # of occurences of c in length 7-1 prefix, used s n offset to obtin the right c. 22

Determine its loction in F nd find its corresponding predecessor in L using C(.) nd r(.,.) i.e., find corresponding g using the lst-first mpping Note: $ =. First Lst Exmple of running UNPERMUTE to recover the originl text. Source: [1]. Exct Mtching EXACTMATCH exct mtching of query string P to T, given BW(T)[4] is similr to UNPERMUTE, we use the sme C nd r(c, i) denote by sp the position of the first row in the intervl of rows in M we re currently considering denote by ep the position of the first row beyond this intervl of rows => the intervl of rows is defined by the rows sp,, ep - 1. EXACTMATCH[P[1,,p], BW(T)] 1. c = P[p]; sp = C[c] + 1; ep = C[c+1] + 1; i = p - 1; 2. while sp < ep nd i >= 1 c = P[i]; sp = C[c] + r(c, sp) + 1; ep = C[c] + r(c, ep) + 1; i = i - 1; 3. if (sp == ep) return "no mtch"; else return sp, ep; 23

P = c Exmple of running EXACTMATCH to find query string in the text [1]. Inexct Mtching Sketch Ech chrcter in red hs numeric qulity vlue, with lower vlues indicting higher likelihood of sequencing error. Similr to EXACTMATCH, clculting mtrix intervls for successively longer query suffixes. If the rnge becomes empty ( suffix does not occur in the text), then the lgorithm selects n lredy-mtched query position nd substitute different bse there, introducing mismtch into the lignment. The EXACTMATCH serch resumes from just fter the substituted position. The lgorithm selects only those substitutions tht re consistent with the lignment policy nd yield modified suffix tht occurs t lest once in the text. If there re multiple cndidte substitution positions, then the lgorithm greedily selects position with mximl qulity vlue. 24

P = gc No g => substitute nd proceed. g g Exmple of running INEXACTMATCH to find query string in the text [1]. The Assembly Problem How to ssemble n unknown genome bsed on mny highly overlpping short reds from it? Problem Sequence ssembly INPUT: m l-long reds S 1,, S m. QUESTION: Wht is the sequence of the full genome? The crucil difference between the problems of mpping nd ssembly is tht now we do not hve reference genome, nd we must ssemble the full sequence directly from the reds. 25

De Bruijn Grphs Definition A k-dimensionl de Bruijn grph of n symbols is directed grph representing overlps between sequences of symbols. It hs n k vertices, consisting of ll possible k-tuples of the given symbols. Note: the sme symbol my pper multiple times in tuple. If we hve the set of symbols A = { 1,, n } then the set of vertices is: V = { ( 1,, 1, 1 ), ( 1,, 1, 2 ),, ( 1,, 1, n ), ( 1,, 2, 1 ),, ( n,, n, n )} If vertice w cn be expressed by shifting ll symbols of nother vertex v by one plce to the left nd dding new symbol t the end, then v hs directed edge to w. Thus the set of directed edges E is: E = {( (v 1, v 2,, v k ), (w 1,w 2,, w k ) ) v 2 = w 1, v 3 = w 2,, v k = w k-1, nd w k new} A portion of 4-dimensionl de Bruijn grph. The vertex CAAA hs directed edge to vertex AAAC, becuse if we shift the lbel CAAA to the left nd dd the symbol C we get the lbel AAAC. In this wy the word CAAAC defines directed edge 26

De Bruijn Grphs Given red, every contiguous (k+1)-long word in it corresponds to n edge in the k-dimensionl de Bruijn grph of the symbols {A,C,T,G}. Form subgrph G of the full de Bruijn grph by introducing only the edges tht correspond to (k+1)-long words in some red. A pth in this grph defines potentil subsequence in the genome. Hence, we cn convert red to its corresponding pth in the constructed subgrph G Using the grph from the previous slide, we construct the pth corresponding to CCAACAAAAC: shift the sequence one position to the left ech time, nd mrking the vertex with the lbel of the 4 first nucleotides. 27

De Bruijn Grphs Form pths for ll the reds Identify common vertices on different pths Merge different red-pths through these common vertices of the pths into one long sequence The two reds in () re converted to pths in the grph The common vertex TGAG is identified Combine these two pths into (b). 28

De Bruijn Grphs Velvet (2008), by Zerbino nd Birney[6], Velvet GUI (2012), is n lgorithm which uses de Bruijn grphs to ssemble reds; with the following difficulties: repets will show up s cycles in the merged pth tht we form. We do not know how mny times ech cycle must be trversed in order to form the full sequence of the genome. If we hve two cycles strting t the sme vertex, we cnnot tell which one to trverse first. Velvet ttempts to ddress this issue by utilizing the extr informtion we hve in the cse of pired-ends sequencing (both ends of DNA frgment re sequenced in Red 1 nd Red 2, respectively. The distnce between ech pired red is known). Velvet (Wikipedi) 29

Remove tips if the edges re wek. Exmple of bubble: Other Assembly Algorithms HMM bsed Mjority bsed Etc. Long reds: string grphs 30

Other Assembly Algorithms K.R. Brdnn et l. Assemblthon 2: evluting de novo methods of genome ssembly in three vertebrte species (http://gigscience.biomedcentrl.com/rticles/10.1186/2047-217x-2-10, 2013) Mny current genome ssemblers produced useful ssemblies, contining significnt representtion of their genes nd overll genome structure. However, the high degree of vribility between the entries suggests tht there is still much room for improvement in the field of genome ssembly nd tht pproches which work well in ssembling the genome of one species my not necessrily work well for nother. Other Assembly Algorithms Slzberg SL, Phillippy AM, Zimin A, Puiu D, Mgoc T, Koren S, Trengen TJ, Schtz MC, Delcher AL, Roberts M, et l. Gge: A criticl evlution of genome ssemblies nd ssembly lgorithms. Genome Res. 2012; 22(3):557 67. Three conclusions: 1.Qulity: dt qulity, rther thn the ssembler itself, hs drmtic effect on the qulity of n ssembled genome 2.Vribility: the degree of contiguity of n ssembly vries enormously mong different ssemblers nd different genomes 3.Correctness: the correctness of n ssembly lso vries widely nd is not well correlted with sttistics on contiguity. 31

Other Assembly Algorithms Scffolding nd completing genome ssemblies in rel-time with nnopore sequencing By Minh Duc Co, Son Hong Nguyen, Devik Gnesmoorthy, Alysh G. Elliott, Mtthew A. Cooper & Lchln J. M. Coin Nture Communictions 8, Article number: 14515 (2017) Long red sequencing technologies, for exmple Pcific Biosciences (PcBio) nd Oxford Nnopore MinION sequencing, llow users to generte reds spnning most repetitive sequences, which cn be used to close gps in frgmented ssemblies. Imge from reference [8]. 32

Bibliogrphy [1] Ben Lngmed, Cole Trpnell, Mihi Pop nd Steven L. Slzberg. Ultrfst nd memory-effcient lignment of short DNA sequences to the humn genome. Genome Biology, 2009. [2] Michel Burrows nd Dvid Wheeler. A block sorting lossless dt compression lgorithm. Technicl Report 124, Digitl Equipment Corportion, 1994. [3] Brent Ewing nd Phil Green. Bse-clling of utomted sequencer trces using phred. II. Error probbilities. Genome Reserch, 1998. [4] Pulo Ferrgin nd Giovni Mnzini. Opportunistic dt structures with pplictions. FOCS '00 Proceedings of the 41st Annul Symposium on Foundtions of Computer Science, 2000. [5] Heng Li, Jue Run nd Richrd Durbin. Mpping short DNA sequencing reds nd clling vrints using mpping qulity scores. Genome Reserch, 2008. [6] Dniel R. Zerbino nd Ewn Birney. Velvet: lgorithms for de novo short red ssembly using de Bruijn grphs. Genome Reserch, 2008. [7] Ron Shmir, Computtionl Genomics Fll Semester, 2010 Lecture 12: Algorithms for Next Genertion Sequencing Dt, Jnury 6, 2011 Scribe: Ant Gluzmn nd Ern Mick [8] Minh Duc Co, Son Hong Nguyen, Devik Gnesmoorthy, Alysh G. Elliott, Mtthew A. Cooper & Lchln J. M. Coin. Scffolding nd completing genome ssemblies in reltime with nnopore sequencing. Nture Communictions 8, Article number: 14515, 2017. Exct Mtching 33

Exct Mtching Algorithms Exct Mtching Problem: serch pttern P in text T (P,T re strings) Knuth Morris Prtt preprocess pttern P Aho Corsick preprocess pttern P of severl strings P = { P 1,, P r } Suffix Trees preprocess text T or severl texts, dtbse of texts Exct Mtching: Preprocessing Pttern(s) Knuth-Morris-Prtt (KMP) Aho-Corsick 34

Serching Pttern P in Text T Nive: T: neeneeneeneeeneneeeneeneneeneneen P: neeneeen neeneeen neeneeen neeneeen Better: T: neeneeneeneeeneneeeneeneneeneneen P: neeneeen neeneeen neeneeen Pttern P ϵ {,b, } * : KMP Preprocessed P exmple Automton: b b b Tble: 1 2 3 4 5 6 7 8 b b b 0 1 1 2 2 3 4 3 filure links (suffix = prefix): top: s utomton bottom: s tble how to determine? how to use? P: P: P: P: P: P: 6 b b b b b b 3 7 b b b b b b 4 8 b b b b b... 3 35

KMP computing filure links filure link ~ new best mtch (fter mismtch) òr 0 Pttern P: k-1 b b k Flink[1] = 0; for k from 2 to PtLen do fil = Flink[k-1] while ( fil>0 nd P[fil] P[k-1] ) do fil = Flink[fil]; od Flink[k] = fil+1; od KMP computing filure links Flink[1] = 0; for k from 2 to PtLen do fil = Flink[k-1] while ( fil>0 nd P[fil] P[k-1] ) do fil = Flink[fil]; od Flink[k] = fil+1; od Tble: 1 2 3 4 5 6 7 8 b b b 0 0 1 0 1 1 0 1 1 2 0 1 1 2 2 3 4 3 fil: 0 0->F->Flink[2]=0+1 1->T->0->F->Flink[3]=1 1->F->Flink[4]=1+1 0 1 1 2 2 3 4 3 36

prefixes vi filure links Pttern P: P 1 P r-1 P k-r+1 P k-1 Flink[k]=r r k P r P 1 P r-1 = P k-r+1 P k-1 mximl r<k prefix ll such vlues r: suffix r 4 r 3 r 2 r 1 k P 1 P r2-1 = P k-r2 +1 P k-1 = P r1 -r 2 +1 P r1-1 Flink[r 1 ]=r 2 Boyer-Moore other methods T = mrktkoopmn P = schoenveter schoe work bckwrds Krp-Rbin fingerprint i-1 i i+n-1 i+n p 1 p n hsh-vlue i B n-1 + i+1 B n-2 + i+n-1 B 0 i+1 B n-1 + + i+n-1 B 1 + i+n B 0 37

exct mtching with set of ptterns P = { P 1,, P r } ll occurrences in text T totl length m length n AHO CORASICK generlizes KMP filure links: longest suffix tht is prefix (perhps in nother string) > ssume no subwords within P keyword tree - trie edges ~ letters t t o p o e t t e r r s y i c e n h c o { potto, poetry, pottery, science, school } o l 1 2 3 4 5 1 y e 2 5 3 4 leves ~ keywords 38

filure links o t t o p t t o h e t o o t t h e r e { potto, tttoo, theter, other } r potto other potto tttoo filure links into other brnches! => into other ptterns Algorithm for dding the filure links: follow the links existing filure links new edge with lbel : follow existing filure links strting t the prent node of the edge until n outgoing edge with lbel is found 39

Adding filure links o t t o p t t t o o h o e t t h e r e { potto, tttoo, theter, other } r potto other t heter ttoo bredth first (level-by-level) filure links o t t o p t o h t e h t t t o e o r e { potto, tttoo, theter, other } r shortcuts root to child of root (see for exmple po ) [single letter] 40

Preprocess text T: Suffix Trees trie vs. suffix tree Text T: bb Suffixes: bb b b b b b Text string T (= bb ) + ll its suffixes b b b b b b b b trie suffix tree From: www.cs.helsinki.fi/u/ukkonen/erice2005.ppt 41

Trie vs. Suffix Tree Trie: b b b b b Suffix Tree: b b b b b Given text T: Trie(T) = O( T ) 2 qudrtic bd exmple: T = n b n Trie(T) like DFA for the suffixes of T minimize DFA directed cyclic word grph DFA = Deterministic Finite Automton (lso used by Ukkonen) only brnching nodes nd leves represented edges lbeled by substrings of T correspondence of leves nd suffixes T leves, hence < T internl nodes Tree(T) = O( T + size(edge lbels)) liner nitty Order of insertion is position of suffix in Suffix Tree of text T: nitty itty tty ty y ritty itty tty ty y 1 2 3 4 5 6 7 8 9 10 11 1 nitty 2 8 ε itty ty ε 3 9 t y y ε ritty 4 10 ε 5 11 7 6 42

Text T nd its suffixes: nitty nitty itty tty ty y ritty itty tty ty y 1 2 3 4 5 6 7 8 9 10 11 1 nitty 1-11 6-11 ε itty 2-5 6-11 t 3-3 ty 4-5 y 5-5 6-11 2 8 3 9 4 10 ε ritty 7-11 y 7 5-5 6-11 6-11 ε ε 5 11 6 implementtion: refer to positions of the substrings in T liner time construction Text T nd its suffixes: nitty itty tty ty y ritty itty tty ty y Weiner (1973) lgorithm of the yer McCreight (1976) on-line lgorithm (Ukkonen 1992) 43

Filure link to mximl prefix of suffix in T online construction of suffix trie for T = bb suffix links b b bb bb b b εb ε b b b b b b ε ε b b b b b next symbol = b from here b lredy exists Suffix Tree ppliction: full text index T pos P pos P in T P is prefix of suffix of T P subtree under P ~ loctions of P pos pos 44

exmple: find itt in nitty nitty itty tty ty y ritty itty tty ty y 1 2 3 4 5 6 7 8 9 10 11 1 nitty 2 8 ε itty ty ε ritty t y y ε 3 9 4 10 7 ε 5 11 6 positions Suffix Tree ppliction: longest common substring T P pples plte T pos pos P Construct generlized suffix tree Contining suffixes of both T nd T. Mrk T nd T suffixes: pos pos And find the longest common prefixes of suffixes from different texts stored in the Suffix Tree. 45

ppliction: counting motifs nitty itty tty ty y ritty itty tty ty y 1 2 3 4 5 6 7 8 9 10 11 1 nitty 2 8 ε itty ty 2 2 ε 11 y t 4 y 2 ε 3 9 4 10 ritty 7 2 ε 5 11 6 The suffix tree now contins counts for the number of suffixes in sub-tree Suffix Trees Experiments: motifs, repets in DNA s reported by Ukkonen humn chromosome 3 the first 48 999 930 bses 31 min cpu time (8 processors, 4 GB) humn genome: 3x10 9 bses suffix tree for Humn Genome is fesible 46

longest repet? Occurrences t: 28395980, 28401554r Length: 2559 ttgggtctgtgcccgtgcggtttgttcttgttccgtgcctgtggtgtgctgccccttctcgtctttgcgtt ggtttctccgtgcttccctcccccctccccccccccccgtccccggtgtgtgtgttccccttcctgtgtcctgtgttctc ttgttcttcccccttggtggctgcggtgtttggttttttgtccttgcggtttgctggtgtggtttccgcttctcct tccctcggctgctctcttttttttggctgctgtttcctggtgtttgtgcccttttcttcccgtctcccttgttg gctctgggttggttccgtctttgctttgtgtgtgccgctctcgtgtgctgtgtcttttgcgctgtttttcc tttgggtttcccgttgggtggctgggtctggttttctgttctgtccctgggtccccctgcttccctggt tgctgtttcgtcccgccgttccttttctccctcctctccgccctgttgtttcctgcttttttgtcgccttctctg gtgtggtggttctcttgtggttttgtttgctttctctgtggccgtgtgtggcttttttctgtgttttttggctgcttgtctt cttttgggtgtctgttcttccttcgccccttttgtggggttgtttgtttttttcttgttttgttgggttcttgtgttctgggttt gccctttgtcgtggtggttgcttttctcccttctgtggttgcctgttcctctgtggtggtttcttctgctgtgcggctct ttgtttttgtccctttgtcttttggcttttgttgcctgcttttggtgttttgctggtccttgccctgccttgtcctgtg gtttgcctggttttcttctgggttttttggttttggtctctgtgtcttttcctcttgtttttggtgtttttggtg ttttggtgtttttttttttggtgtttttttggtgtgggggtccgtttcgctttctcttggctg ccgttttccctgcccttttttgggtcctttccccttgcttgtttttgtcggtttgtcgtcgtgttgtgttgcgg ctttttctggggctctgttctgttccttggtctttctctgttttggtccgtcctgctgttttggttctgtgccttgtgttgttt ggtcggtgcgtgtggttccgctttgttcttttggcttggttgcttggctgtgggctcttttttggttccttgctttgt gttttttccttctgtggttcttggtgcttgtggggtggcttgtcttttccctgggcgttggccttttcc tttgtcttcctccctggcgtgtctgttcttcctttgtttgttcctctttttttcttggcgtggtttgtgttctccttggggt ccttcctcccttgtgttggttcctggtttttttctctttggcttgtgtggggttcctctgtttgctctctgtttgtctg ttttggtgttgtgcttgtgtttttgccttgttttgttcctggctttgctggttgctttcgcttgggttttgggctg gcgtggggttttctgttctctgtctctgccgggctttgcttcctcttttcctttgtcccgtttttccctctc ctgcctgttgccctggccgcttccccttgttgtgggtggtggggggctccctgtcttgtgccgttttcggg tgcttccgtttttgtccttcgttgtttggctgtgggtttgtctgtgctcttttttttggtctccctctcctttt ttgggtttttgctgggttcttgttttgtcggccttttctgctcttttggttctgtggtttctgtctttggttctgtttt tgctgggtcgtttttgttttcgttgttgccgccttgctcccgggtggccccttgtctggtggtgctttttgtgt gctgctggttcggtttgccgtttttttgggtttctgctcgtgttctcggtttggtctttctctttttttgttgtgtctctgt cggctttggttcggtgtgctggcctcttggttgg ten occurrences? ttttttttttttttggcgggtctcgctctgtcgcccggctgggtgcgtg gcgggtctcggctcctgcgctccgcctcccgggttccgccttct cctgcctcgcctcccgtgctgggctcggcgcccgccctcg cccggctttttttgttttttgtggcggggtttcccgttttgccgg gtggtctcgtctcctgcctcgtgtccgcccgcctcggcctcccg tgctgggttcggcgt Length: 277 Occurrences t: 10130003, 11421803, 18695837, 26652515, 42971130, 47398125 In the reversed complement t: 17858493, 41463059, 42431718, 42580925 47

Suffix Tree Remrks suffix tree efficient (liner) storge, but constnt ±40, lrge, overhed suffix rry hs constnt ±5 overhed hence more prcticl but hs its own complictions nïve n log(n) lgorithm is in some cses not too bd (next slide) Suffix Arry nitty itty tty ty y ritty itty tty ty y 1 2 3 4 5 6 7 8 9 10 11 itty itty nitty ritty tty tty ty ty y y 6 8 2 1 7 9 3 10 4 11 5 lexicogrphic order of the suffixes 48

References Dn Gusfield Algorithms on Strings, Trees, nd Sequences Computer Science nd Computtionl Biology This book lists mny pplictions for suffix trees nd hs extended implementtion detils. Severl slides on suffix-trees re bsed on nd/or copied from Esko Ukkonen, Univ Helsinki (Erice School, 30 Oct 2005) 49