Bio nformatics Lecture 16
DNA sequencing To sequence a DNA is to obtain the string of bases that it contains. It is impossible to sequence the whole DNA molecule directly. We may however obtain a piece of a certain length cut at random and sequence it. This is called a fragment. By using cloning and cutting techniques we can obtain a large number of sequenced fragments. The goal is to reconstruct the DNA molecule based on the fragments overlap.
Ideal case We know the length of the DNA (e.g. 10 bases) There are no errors in sequencing the fragments ACCGT CGTGC TTAC TACCGT --ACCGT-- ----CGTGC TTAC----- -TACCGT-- Align sequences ignoring end gaps Find consensus by majority voting TTACCGTGC
Insertion errors ACCGT CAGTGC TTAC TACCGT --ACC-GT-- ----CAGTGC TTAC------ -TACC-GT-- TTACC-GTGC Insertion of A in the second fragment Gap in consensus will be discarded In this example, it still works because of majority voting
Deletion error ACCGT CGTGC TTAC TACCGT --ACCGT-- ----CGTGC TTAC----- -TAC-GT-- TTACCGTGC The first C was deleted from 4 th fragment Consensus still works
Chimeric fragment Two disjoint fragments join to form one fragment that is not originally part of the DNA ACCGT CGTGC TTAC TACCGT TTATGC --ACCGT-- ----CGTGC TTAC----- -TACCGT-- TTACCGTGC TTA---TGC
Unknown orientation which strand a particular fragment belongs to? CACGT ACGT ACTACG GTACT ACTGA CTGA CACGT -ACGT --CGTAGT -----AGTAC --------ACTGA ---------CTGA reverse compliment We have 2 n possibilities CACGTAGTACTGA
Repeats A X B X C X D A X C X B X D Repeats of the form X X X
Repeats A X B Y C X D Y E A X D Y C X B Y E Repeats of the form X Y X Y
Inverted repeats CGA X TCG X reverse complement inverted X X Inverted repeat
Lack of coverage uncovered area contig contig We have more than one contig
Number of fragment It is important to know how many fragments we need to generate in order to achieve certain coverage. Let T denote the length of the DNA. Assume all fragments have length l and that we can detect overlaps of at least t bases. If we sample n fragments at random, what is the expected number of contigs? E[# contigs] ne n( l t)/ T
Alternative methods Shortest common superstring SCS An elegant theoretical abstraction, but fundamentally flawed R. Karp Generalized SCS Models errors and orientations Multicontig Models errors, orientations, and coverage
SCS Given a set of fragments F, Find the shortest string s that contains every f F as a substring This is NP-hard The SCS might not be what we really want
Bad example (repeats) X X Shortest common superstring will give: X X
Generalized SCS Given a set of fragments F, Find the shortest string s that contains either f or f as a substring, for every f F Now it models orientations
Generalized SCS (cont.) Given a set of fragments F, ε > 0, and a distance function d Find the shortest string s that contains a substring x for every f F such that min[d(f,x), d(f,x)] ε f Now it models both orientations and errors
Multicontig For a given set of fragments F, a contig is a multiple alignment containing either f or f for every f F. A contig has an ε-consensus iff each fragment f (or f) differs from its image in the consensus by at most ε f. A contig is a t-contig if the smallest overlap that is not contained in any fragment is at least t (t is a measure of coverage).
Multicontig Given a set of fragments F, ε > 0, and t > 0 Partition F into a minimum number of subsets such that each subset has a t-contig with ε-consensus This is NP-hard This models errors, orientations, and coverage
Solving SCS We are going to consider a Hamiltonian path approach to solving the SCS problem
Overlap graph Consider the complete directed weighted graph G = (V, E), called the overlap graph V = F (each fragment is a vertex) (u,v) E with weight -t iff t is the length of the maximal suffix of u that is a prefix of v We allow self loops and zero weight edges
Example c CTAAAG 0 weight edges not shown TACGA a -2-1 -1-1 d GACA -1 ACCC b a = TACGA b = ACCC c = CTAAAG d = GAGC
A path defines a superstring Every simple path P in the overlap graph involving a set of vertices (fragments) A defines a superstring s(p) for the set A. Therefore, a Hamiltonian path in the overlap graph defines a superstring for the set of fragments F. A Hamiltonian path must exist because the graph is complete (how many do we have?).
Example c CTAAAG TACGA a -2-1 0 weight edges not shown -1-1 d GACA -1 ACCC b a = TACGA b = ACCC c = CTAAAG d = GAGC s(p): P = adbc TACGA GACA ACCC CTAAAG ---------------- TACGACACCCTAAAG
Does a superstring define a path? We have seen that every Hamiltonian path corresponds to a superstring. Is the converse true? No: A superstring can contain arbitrary characters that are not present in any fragments Does a shortest superstring correspond to a Hamiltonian path? Yes: if F is substring-free, i.e. no fragment in F is contained in another
Example AGC The shortest superstring is G b 0 a 0 0-1 0 0 c CT AGCT There is no Hamiltonian path P, such that s(p) = AGCT
Subtring-free collection F Let F be a substring free set, then for every shortest superstring s, there is a Hamiltonian path P, such that s(p) = s. Proof: assume the fragments appear in s as follows (no gaps and no one can be contained in another) this must be the max overlap between a a and b b c d s -t 1 -t 2 0 t 1 t 2 Ham path: a b c d etc
Non substring-free F If F is not substring-free, then we can remove all fragments from F that are substrings of other fragments We end up with a set F But any superstring of F is a superstring of F Therefore, we can use F
Length of string v.s. weight of path Let P be a Hamiltonian path. Let w(p) be the weight of P. Let F = Σ a F a Then s(p) = F + w(p) [proof is simple] Therefore, the shortest common superstring corresponds to the Hamiltonian path with minimum weight
Proof Let P be a Hamiltonian path with minimum weight s(p) is a shortest superstring Let s be a shortest superstring with s < s(p) Then there is a Hamiltonian path P such that s = s(p ) s(p ) = F + w(p ) < s(p) = F + w(p) Therefore, w(p ) < w(p), contradiction
Hamiltonian path approach Finding a minimum weight Hamiltonian path is NP-hard (you can reduce HAMPATH to it) Unfortunately, there is no better approach to solve SCS, because SCS itself is NP-hard Let s consider a greedy algorithm for finding a Hamiltonian path
Greedy algorithm Greedy: start with an empty path repeatedly add the least weighted available edge until you get a Hamiltonian path Every time we add an edge (u,v), we need to check: (u,v) does not create a cycle with the previously added edges u has no previously added outgoing edge v has no previously added incoming edge
Greedy algorithm sort edges by their weight: e 1, e 2, e E for all v V in(v) 0 out(v) 0 H φ i 1 while H < F 1 (u,v) e i if out(u) = 0 and in(v) = 0 then To build the graph: O(n F ) (could be done optimally in O(n 2 + F ) using suffix trees) To run the algorithm: O(n 2 logn) if H e i does not contain a cycle [disjoint set data structure] H H e i out(u) 1 in(v) 1 i i + 1
Example ATGC -2-2 -3 0 GCC TGCAT Greedy algorithm will choose: ATGC TGCAT GCC ATGCATGCC Optimal is: TGCAT ATGC GCC TGCATGCC
Sequncing By Hybridization SBH Use all possible probes of length l and obtain hybridization data with the DNA. If no errors, we have all substrings of length l. We would like to reconstruct the DNA from those substrings. We can formalize this as SCS and solve it as before. But we can simplify a little bit
SBH and SCS SBH is a special case of the SCS problem where all fragments of F have the same length l. In the overlap graph, we will keep only the edges with weights equal to (l 1). By construction of these fragments, we know that there must be a Hamiltonian path in this modified overlap graph. All Hamiltonian paths now have the same weight = (n 1)(l 1) Thus we only need to find a Hamiltonian path (still NP-complete)
Example l = 3 ATG TGG TGC GTG GGC GCA GCG CGT
Idea Instead of representing fragments as vertices, represent them as edges. Then, instead of looking for a Hamiltonian path (a path that goes through each vertex once), look for an Euler path (a path that goes through each edge once). Euler path can be found in linear time.
Fragments as edges Construct a directed graph G = (V, E) V: (l 1) length fragments (these can be obtained from our set F by considering the first and last l 1 characters of each fragment) E: A directed edge (u,v) for each fragment in F that starts with u and ends with v
Example l = 3 ATG TGG TGC GTG GGC GCA GCG CGT GT CG AT TG GC CA GG
Euler Cycle By construction of the fragments, we know that the graph will have all vertices balanced except possibly for two unbalanced vertices (each occurrence of an l-1 fragment is shared by two l length fragments, except possibly for the first and last one) By adding an edge between two unbalanced vertices we can make the graph balanced Then we can find an Euler cycle in the graph (since it is balanced, there is one)
Example l = 3 ATG TGG TGC GTG GGC GCA GCG CGT GT CG ATGCGTGGCA AT TG GC CA GG