Lecture 4: DNA Restriction Mapping

Similar documents
8/29/13 Comp 555 Fall

9/2/2009 Comp /Comp Fall

Lecture 4: DNA Restriction Mapping

An Algorithmic Problem in Molecular Biology Partial Digest Problem

Math for Liberal Studies

Lecture 3 : Probability II. Jonathan Marchini

Physical Mapping Restriction Mapping

Bioinformatics Algorithms. Physical Mapping Restriction Mapping

Set Notation and Axioms of Probability NOT NOT X = X = X'' = X

Lecture 2 31 Jan Logistics: see piazza site for bootcamps, ps0, bashprob

Do in calculator, too.

Chapter 1 Problem Solving: Strategies and Principles

10.1. Randomness and Probability. Investigation: Flip a Coin EXAMPLE A CONDENSED LESSON

Discrete Probability Distributions

Finding extremal voting games via integer linear programming

Exhaustive search. CS 466 Saurabh Sinha

Discrete Probability Distributions

12.1. Randomness and Probability

discrete probability distributions

Partial restriction digest

Mathematics for large scale tensor computations

arxiv: v1 [math.ra] 11 Aug 2010

CS 484 Data Mining. Association Rule Mining 2

secretsaremadetobefoundoutwithtime UGETGVUCTGOCFGVQDGHQWPFQWVYKVJVKOG Breaking the Code

MATH 1823 Honors Calculus I Permutations, Selections, the Binomial Theorem

Stat 198 for 134 Instructor: Mike Leong

BHP BILLITON UNIVERSITY OF MELBOURNE SCHOOL MATHEMATICS COMPETITION, 2003: INTERMEDIATE DIVISION

Planck blackbody function

Locally Connected Recurrent Networks. Lai-Wan CHAN and Evan Fung-Yu YOUNG. Computer Science Department, The Chinese University of Hong Kong

Physical Mapping. Restriction Mapping. Lecture 12. A physical map of a DNA tells the location of precisely defined sequences along the molecule.

Dragan Mašulović. The Discrete Charm of Discrete Mathematics

Self Organization and Coordination

DATA MINING LECTURE 3. Frequent Itemsets Association Rules

Identities relating the Jordan product and the associator in the free nonassociative algebra

Executive Assessment. Executive Assessment Math Review. Section 1.0, Arithmetic, includes the following topics:

arxiv: v1 [physics.chem-ph] 1 May 2018

Data Mining Concepts & Techniques

Dynamic Programming( Weighted Interval Scheduling)

ALGEBRAS, DIALGEBRAS, AND POLYNOMIAL IDENTITIES. Murray R. Bremner

Dragan Mašulović. Introduction to Discrete Mathematics

arxiv: v1 [q-bio.pe] 27 Oct 2011

Key-recovery Attacks on Various RO PUF Constructions via Helper Data Manipulation

Fractional designs and blocking.

How many hours would you estimate that you spent on this assignment?

Lecture Notes for Chapter 6. Introduction to Data Mining

Key-recovery Attacks on Various RO PUF Constructions via Helper Data Manipulation

Mathematical Background

CSE 431/531: Analysis of Algorithms. Dynamic Programming. Lecturer: Shi Li. Department of Computer Science and Engineering University at Buffalo

Association Rules Information Retrieval and Data Mining. Prof. Matteo Matteucci

Lecture 2: Divide and conquer and Dynamic programming

COMP 5331: Knowledge Discovery and Data Mining

Lecture 6 & 7. Shuanglin Shao. September 16th and 18th, 2013

Discrete Mathematics & Mathematical Reasoning Chapter 6: Counting

Dynamic Programming: Interval Scheduling and Knapsack

Lecture 10: Powers of Matrices, Difference Equations

ORF 245 Fundamentals of Engineering Statistics. Midterm Exam 1

ST3232: Design and Analysis of Experiments

Some Structural Properties of AG-Groups

2.1 Computational Tractability. Chapter 2. Basics of Algorithm Analysis. Computational Tractability. Polynomial-Time

If the objects are replaced there are n choices each time yielding n r ways. n C r and in the textbook by g(n, r).

DATA MINING - 1DL360

CSE332: Data Structures & Parallelism Lecture 2: Algorithm Analysis. Ruth Anderson Winter 2019

Mathematical Background. e x2. log k. a+b a + b. Carlos Moreno uwaterloo.ca EIT e π i 1 = 0.

1 Quick Sort LECTURE 7. OHSU/OGI (Winter 2009) ANALYSIS AND DESIGN OF ALGORITHMS

CSE373: Data Structures and Algorithms Lecture 3: Math Review; Algorithm Analysis. Catie Baker Spring 2015

Elementary Linear Algebra

Chapter 2: The Basics. slides 2017, David Doty ECS 220: Theory of Computation based on The Nature of Computation by Moore and Mertens

Week 5: Quicksort, Lower bound, Greedy

Lecture 4: Counting, Pigeonhole Principle, Permutations, Combinations Lecturer: Lale Özkahya

221B Lecture Notes Many-Body Problems I

Mathematical Induction

O Notation (Big Oh) We want to give an upper bound on the amount of time it takes to solve a problem.

The Integers. Peter J. Kahn

Computational Statistics and Data Analysis. Mixtures of weighted distance-based models for ranking data with applications in political studies

DATA MINING - 1DL360

On r-dynamic Coloring of Grids

Bargaining, Coalitions and Externalities: A Comment on Maskin

9A. Iteration with range. Topics: Using for with range Summation Computing Min s Functions and for-loops A Graphics Applications

The Integers. Math 3040: Spring Contents 1. The Basic Construction 1 2. Adding integers 4 3. Ordering integers Multiplying integers 12

3.4 Introduction to power series

Divide and Conquer Problem Solving Method

Copyright 2000, Kevin Wayne 1

Computer Science & Engineering 423/823 Design and Analysis of Algorithms

14 Higher order forms; divergence theorem

Condorcet Domains, Median Graphs and the Single Crossing Property

Properties of Consensus Methods for Inferring Species Trees from Gene Trees

COMP 355 Advanced Algorithms

MODEL ANSWERS TO THE FIRST QUIZ. 1. (18pts) (i) Give the definition of a m n matrix. A m n matrix with entries in a field F is a function

Computer Science & Engineering 423/823 Design and Analysis of Algorithms

Intractable Problems Part Two

x + 2y + 3z = 8 x + 3y = 7 x + 2z = 3

Introduction to Computer Science and Programming for Astronomers

Lecture: Expanders, in theory and in practice (2 of 2)

Introduction and basic definitions

arxiv: v1 [math-ph] 18 May 2017

14 Classic Counting Problems in Combinatorics

Supplementary Material for MTH 299 Online Edition

Designing Information Devices and Systems I Spring 2019 Lecture Notes Note 2

COMP 355 Advanced Algorithms Algorithm Design Review: Mathematical Background

CS 231: Algorithmic Problem Solving

Transcription:

Lecture 4: DNA Restriction Mapping Study Chapter 4.1-4.3 9/3/2013 COMP 465 Fall 2013 1

Recall Restriction Enzymes (from Lecture 2) Restriction enzymes break DNA whenever they encounter specific base sequences They occur reasonably frequently within long sequences (a 6-base sequence target appears, on average, 1:4096 bases) Can be used as molecular scissors EcoRI cggtacgtggtggtg aattctgtaagccgattccgcttcggggag aattccatgccatcatgggcgttgc gccatgcaccaccacttaa gacattcggctaaggcgaagcccctcttaa ggtacggtagtacccgcaacg EcoRI 9/3/2013 COMP 465 Fall 2013 2

Restriction Enzyme Uses Recombinant DNA technology make novel DNA constructs, add fluorophores add other probes Digesting DNA into pieces that can be efficiently and reliably replicated through PCR (Polymerase Chain Reaction) Cutting DNA for genotyping via Microarrays Sequence Cloning Inserting sequences into a host cell, via vectors cdna/genomic library construction Coding DNA, is a byproduct of transcription Targeted sequencing (ex. RRBS) DNA restriction mapping A rough map of a DNA fragment 9/3/2013 COMP 465 Fall 2013 3

DNA Restriction Maps A map of the restriction sites in a DNA sequence If the DNA sequence is known, then constructing a restriction map is trivial Restriction maps are a cheap alternative to sequencing for unknown sequences 9/3/2013 COMP 465 Fall 2013 4

Consider the DNA Mapping Problem Begin with an isolated strand of DNA Digest it with restriction enzymes Breaks strand it in variable length fragments Use gel electrophoresis to sort fragments according to size Can accurately sort DNA fragments that differ in length by a single nucleotide, and estimate their relative abundance Use fragment lengths to reassemble a map of the original strand Smaller fragments move farther 9/3/2013 COMP 465 Fall 2013 5

Single Enzyme Digestion What can be learned from a single complete digest? 0 1 3 7 12 1 2 4 5 Not much. There are many possible answers 0 4 5 10 12 0 2 6 11 12 0 5 6 8 12 9/3/2013 COMP 465 Fall 2013 6

Double Enzyme Digestion An alternative approach is to digest with two different enzymes in three stages First, with restriction enzyme A Second, with restriction enzyme B Third, with both enzymes, A & B 0 4 5 10 12 0 3 9 12 0 3 4 5 9 10 12 3 1 1 4 1 2 The inputs are three sets of restriction fragment lengths [1,2,4,5], [3,3,6], [1,1,1,2,3,4] 9/3/2013 COMP 465 Fall 2013 7

Double Digest Problem Given two sets of intervals on a common line segment between two disjoint interior point sets, and a third set of intervals between all points, reconstruct the positions of the points. Input: da fragment lengths from the digest with enzyme A. db fragment lengths from the digest with enzyme B. dx fragment lengths from the digest with both A and B. Output: A location of the cuts for the enzyme A. B location of the cuts for the enzyme B. 9/3/2013 COMP 465 Fall 2013 8

Class Exercise Suppose you are asked to assemble a map from three digests A = [1,2,3] B = [2,4] AB = [1,1,2,2] How do you solve for the map? How do you state your strategy as a general purpose algorithm? 9/3/2013 COMP 465 Fall 2013 9

Set Permutations Given a set [A,B,C,D] find all permutations [A,B,C,D] [A,B,D,C] [A,C,B,D] [A,C,D,B] [A,D,B,C] [A,D,C,B] [B,A,C,D] [B,A,D,C] [B,C,A,D] [B,C,D,A] [B,D,A,C] [B,D,C,A] [C,A,B,D] [C,A,D,B] [C,B,A,D] [C,B,D,A] [C,D,A,B] [C,D,B,A] [D,A,B,C] [D,A,C,B] [D,B,A,C] [D,B,C,A] [D,C,A,B] [D,C,B,A] How many? 1 st choice = n 2 nd choice = n-1 3 rd choice = n-2 N! permutations of N elements 10! = 3628800 24! = 620448401733239439360000 9/3/2013 COMP 465 Fall 2013 10

A Brute Force Solution Test all permutations of A and B checking they are compatible with some permuation of AB def doubledigest(seta, setb, setab, circular = False): a = Permute(seta) while (a.permutationsremain()): ab = Permute(setab) while (ab.permutationsremain()): if compatible(a.order, ab.order): b = Permute(setb) while (b.permutationsremain()): if (circular): for i in xrange(len(setab)): abshift = shift(ab.order, i) len(b)! if compatible(b.order, abshift): return (a.order, b.order, ab.order, i) else: len(ab)! if compatible(b.order, ab.order): return (a.order, b.order, ab.order, 0) return (astate, bstate, abstate, -1) len(a)! 9/3/2013 COMP 465 Fall 2013 11

How to Improve Performance? What strategy can we use to solve the double restriction map problem faster? Is there a branch-and-bound strategy? Does the given code *really* test every permutation? How does compatible( ) help? Does the order of the loops help? Could you do all permutations of A and B, then compute the intervals and compare to AB? The double digest problem is truly a hard problem (NPcomplete). No one knows an algorithm whose execution time does not grow slower than some exponent in the size of the inputs. If one is found, then an entire set of problems will suddenly also be solvable in less than exponential time. 9/3/2013 COMP 465 Fall 2013 12

Partial Digestion Problem Another way to construct a restriction map Expose DNA to the restriction enzyme for a limited amount of time to prevent it from cutting at all restriction sites (partial digestion) Generates the set of all possible restriction fragments between every pair of (not necessarily consecutive) points The set of fragment sizes is used to determine the positions of the restriction sites We assume that the multiplicity of a repeated fragment can be determined, i.e., multiple restriction fragments of the same length can be determined (e.g., by observing twice as much fluorescence for a double fragment than for a single fragment) 9/3/2013 COMP 465 Fall 2013 13

Partial Digestion Illustration A complete set of pairwise distances between points. In the following example a set of 10 fragments is generated. L = {3, 5, 5, 8, 9, 14, 14, 17, 19, 22} 9/3/2013 COMP 465 Fall 2013 14

Pairwise Distance Matrix Often useful to consider 0 5 14 19 22 partial digests in a distance matrix form 0-5 14 19 22 Each entry is the distance 5-9 14 17 between a pair of point 14-5 8 positions labeled on the 19-3 rows and columns 22 - The distance matrix for n points has n(n-1)/2 entries, therefore we expect that many digest values as inputs Largest value in L establishes the segment length Actual non-zero point values are a subset of L 9/3/2013 COMP 465 Fall 2013 15

Partial Digest Problem Given all pairwise distances between points on a line, reconstruct the positions of those points. Input: A multiset of pairwise distances L, n( n 1) containing elements 2 Output: A set X, of n integers, such that the set of pairwise distances X = L 9/3/2013 COMP 465 Fall 2013 16

Homometric Solutions 0 1 3 4 5 7 12 13 15 0 1 3 4 5 7 12 13 15 1 2 3 4 6 11 12 14 3 1 2 4 9 10 12 4 1 3 8 9 11 5 2 7 8 10 7 5 6 8 12 1 3 13 2 15 0 1 3 8 9 11 12 13 15 0 1 3 8 9 11 12 13 15 1 2 7 8 10 11 12 14 3 5 6 8 9 10 12 8 1 3 4 5 7 9 2 3 4 6 11 1 2 4 12 1 3 13 2 15 The solution of a PDP is not always unique Two distinct point sets, A and B, can lead to indistinguishable distance multisets, A = B 9/3/2013 COMP 465 Fall 2013 17

Brute Force PDP Algorithm Basic idea: Construct all combinations of n - 2 integers between 0 and max(l), and check to see if the pairwise distances match. def bruteforcepdp(l, n): L.sort() M = max(l) X = intsbetween(0,m,n-2) while (X.combinationsRemain()): dx = allpairsdist(x.intset()) dx.sort() if (dx == L): print "X =", X.intSet() Compare this Python code to the pseudocode on page 88 in the book 9/3/2013 COMP 465 Fall 2013 18

Set Combinations Combinations of A things taken B at a time Order is unimportant [A,B,C] [A,C,B] [B,A,C] [B,C,A] [C,A,B] [C,B,A] All combinations of n items in k positions [1,1,0,0], [1,0,1,0],[1,0,0,1],[0,1,1,0],[0,1,0,1],[0,0,1,1] Smaller than a factorial Interesting relation n k n k 0 n! k!( n k)! 9/3/2013 COMP 465 Fall 2013 19 n k 2 n

BruteForcePDP Performance BruteForcePDP takes O(max(L) n-2 ) time since it must examine all possible sets of positions. The problem scales with the size of the largest pairwise distance Suppose we multiply each element in L by a constant factor? Should we consider every possible combination of n - 2 points? (Consider our observations concerning distance matrices) 9/3/2013 COMP 465 Fall 2013 20

Another Brute Force PDP Approach Recall that the actual point values are a subset of L s values. Thus, rather than consider all combinations of possible points, we need only consider n 2 combinations of values from L. def anotherbruteforcepdp(l, n): L.sort() M = max(l) X = intsfroml(l,n-2) while (X.combinationsRemain()): dx = allpairsdist(x.intset()) dx.sort() if (dx == L): print "X = ", X.intSet() Compare this Python code to the pseudocode on page 88 in the book 9/3/2013 COMP 465 Fall 2013 21

Efficiency of AnotherBruteForcePDP It s more efficient, but still slow If L = {2, 998, 1000} (n = 3, M = 1000), BruteForcePDP will be extremely slow, but AnotherBruteForcePDP will be quite fast Fewer sets are examined, but runtime is still exponential: O(n 2n-4 ) Is there a better way? 9/3/2013 COMP 465 Fall 2013 22

A Practical PDP Algorithm 1. Begin with X = {0} 2. Remove the largest element in L and place it in X 3. See if the element fits on the right or left side of the restriction map 4. When it fits, find the other lengths it creates and remove those from L 5. Go back to step 3 until L is empty 9/3/2013 COMP 465 Fall 2013 23

Defining delta(y, X) Before describing PartialDigest, we first define a helper function: delta(y, X) as the multiset of all distances between point y and the points in the set X delta(y, X) = { y x 1, y x 2,, y x n } ex. [3,6,11] = delta(8,[5,14,19]) 9/3/2013 COMP 465 Fall 2013 24

An Example L = { 2, 2, 3, 3, 4, 5, 6, 7, 8, 10 } X = { 0 } 9/3/2013 COMP 465 Fall 2013 25

An Example L = { 2, 2, 3, 3, 4, 5, 6, 7, 8, 10 } X = { 0 } Remove 10 from L and insert it into X. We know this must be the total length of the DNA sequence because it is the largest fragment. 9/3/2013 COMP 465 Fall 2013 26

An Example L = { 2, 2, 3, 3, 4, 5, 6, 7, 8, 10 } X = { 0, 10 } 9/3/2013 COMP 465 Fall 2013 27

An Example L = { 2, 2, 3, 3, 4, 5, 6, 7, 8, 10 } X = { 0, 10 } Remove 8 from L and make y = 2 or 8. But since the two cases are symmetric, we can assume y = 2. 9/3/2013 COMP 465 Fall 2013 28

An Example L = { 2, 2, 3, 3, 4, 5, 6, 7, 8, 10 } X = { 0, 10 } Find the distances from y = 2 to other elements in X. D(y, X) = {8, 2}, so we remove {8, 2} from L and add 2 to X. 9/3/2013 COMP 465 Fall 2013 29

An Example L = { 2, 2, 3, 3, 4, 5, 6, 7, 8, 10 } X = { 0, 2, 10 } 9/3/2013 COMP 465 Fall 2013 30

An Example L = { 2, 2, 3, 3, 4, 5, 6, 7, 8, 10 } X = { 0, 2, 10 } Next, remove 7 from L and make y = 7 or y = 10 7 = 3. We explore y = 7 first, so delta(y, X ) = {7, 5, 3}. 9/3/2013 COMP 465 Fall 2013 31

An Example L = { 2, 2, 3, 3, 4, 5, 6, 7, 8, 10 } X = { 0, 2, 10 } For y = 7 first, delta(y, X ) = {7, 5, 3}. Therefore, we remove {7, 5,3} from L and add 7 to X. D(y, X) = {7, 5, 3} = { 7 0, 7 2, 7 10 } 9/3/2013 COMP 465 Fall 2013 32

An Example L = { 2, 2, 3, 3, 4, 5, 6, 7, 8, 10 } X = { 0, 2, 7, 10 } 9/3/2013 COMP 465 Fall 2013 33

An Example L = { 2, 2, 3, 3, 4, 5, 6, 7, 8, 10 } X = { 0, 2, 7, 10 } Next, take 6 from L and make y = 6. Unfortunately, delta(y, X) = {6, 4, 1,4}, which is not a subset of L. Therefore, we won t explore this branch. 6 9/3/2013 COMP 465 Fall 2013 34

An Example L = { 2, 2, 3, 3, 4, 5, 6, 7, 8, 10 } X = { 0, 2, 7, 10 } This time make y = 4. delta(y, X) = {4, 2, 3,6}, which is a subset of L, so we explore this branch. We remove {4, 2, 3,6} from L and add 4 to X. 9/3/2013 COMP 465 Fall 2013 35

An Example L = { 2, 2, 3, 3, 4, 5, 6, 7, 8, 10 } X = { 0, 2, 4, 7, 10 } 9/3/2013 COMP 465 Fall 2013 36

An Example L = { 2, 2, 3, 3, 4, 5, 6, 7, 8, 10 } X = { 0, 2, 4, 7, 10 } L is now empty, so we have a solution, which is X. 9/3/2013 COMP 465 Fall 2013 37

An Example L = { 2, 2, 3, 3, 4, 5, 6, 7, 8, 10 } X = { 0, 2, 7, 10 } To find other solutions, we backtrack (remove old insertions and try different ones). 9/3/2013 COMP 465 Fall 2013 38

Implementation def partialdigest(l): width = max(l) L.remove(width) X = [0, width] Place(L, X) def Place(L, X): if (len(l) == 0): print X return y = max(l) dyx = delta(y, X) if (dyx.subset(l)): X.append(y); map(l.remove, dyx.items) Place(L, X) X.remove(y); map(l.append, dyx.items) w = max(x) - y dwx = delta(w, X) if (dwx.subset(l)): X.append(w); map(l.remove, dwx.items) Place(L,X) X.remove(w); map(l.append, dwx.items) return Checks distances from the 0 end Checks distances from the width end This PDP algorithm outputs all solutions. In fact, it might even repeat solutions 9/3/2013 COMP 465 Fall 2013 39

Analysis Let T(n) be the maximum time that partialdigest takes to solve an n-point instance of PDP If, at every step, there is only one viable solution, then partialdigest reduces the size of the problem by one on each recursive call T(n) = T(n-1) + O(n) O(n 2 ) However, if there are two alternatives then T(n) = 2T(n-1) + O(n) O(2 n ) 9/3/2013 COMP 465 Fall 2013 40

Comments & Next Time In the book there is a reference to a polynomial algorithm for solving PDP (pg. 115). The authors of this paper have since posted a clarification that their solution does not suggest a polynomial algorithm. Therefore, the complexity of the PDP is still unknown. Next Time: More Exhaustive Search problems Next Time: The Motif Finding Problem 9/3/2013 COMP 465 Fall 2013 41