Dynamic programming in less time and space

Similar documents
Hidden Markov Models

Approximate Matching

Analysis and Design of Algorithms Dynamic Programming

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Pairwise alignment, Gunnar Klau, November 9, 2005, 16:

Lecture 2: Pairwise Alignment. CG Ron Shamir

Bioinformatics and BLAST

Bioinformatics I, WS 16/17, D. Huson, January 30,

CS473 - Algorithms I

Sequence Comparison. mouse human

Algorithms in Bioinformatics: A Practical Introduction. Sequence Similarity

Sequence analysis and Genomics

Proofs, Strings, and Finite Automata. CS154 Chris Pollett Feb 5, 2007.

Integer Sorting on the word-ram

Linear-Space Alignment

CSE370 HW3 Solutions (Winter 2010)

Lecture 4: September 19

Elementary Algebra Study Guide Some Basic Facts This section will cover the following topics

15 Text search. P.D. Dr. Alexander Souza. Winter term 11/12

Module 9: Tries and String Matching

Great Theoretical Ideas in Computer Science

String Matching Problem

Thanks. You Might Also Like. I look forward helping you focus your instruction and save time prepping.

Samson Zhou. Pattern Matching over Noisy Data Streams

Implementing Approximate Regularities

ECE 571 Advanced Microprocessor-Based Design Lecture 10

Alignment Algorithms. Alignment Algorithms

Algorithm Design and Analysis

Pairwise sequence alignment

More Dynamic Programming

Searching Sear ( Sub- (Sub )Strings Ulf Leser

On Pattern Matching With Swaps

More Dynamic Programming

Experiment 1: The Same or Not The Same?

Efficient Polynomial-Time Algorithms for Variants of the Multiple Constrained LCS Problem

GAMINGRE 8/1/ of 7

On-line String Matching in Highly Similar DNA Sequences

MACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance

Unit Two Descriptive Biostatistics. Dr Mahmoud Alhussami

convert a two s complement number back into a recognizable magnitude.

Algorithm Theory. 13 Text Search - Knuth, Morris, Pratt, Boyer, Moore. Christian Schindelhauer

Dynamic Programming. Prof. S.J. Soni

Linear Programming and its Extensions Prof. Prabha Shrama Department of Mathematics and Statistics Indian Institute of Technology, Kanpur

CUTTING Shadows SOME CUTOUT DIALS FOR YOUR AMUSEMENT. ALL YOU NEED ARE SOME SCISSORS, AND YOUR LATITUDE, AND LONGITUDE.

Hidden Markov Models. Three classic HMM problems

EECS730: Introduction to Bioinformatics

2. Exact String Matching

6.6 Sequence Alignment

A New Dominant Point-Based Parallel Algorithm for Multiple Longest Common Subsequence Problem

MEAM 520. More Velocity Kinematics

arxiv: v1 [cs.ds] 9 Apr 2018

Algorithms (II) Yu Yu. Shanghai Jiaotong University

A Simple Linear Space Algorithm for Computing a Longest Common Increasing Subsequence

Local Alignment: Smith-Waterman algorithm

Whole Genome Alignments and Synteny Maps

Frequent Itemsets and Association Rule Mining. Vinay Setty Slides credit:

Computing & Telecommunications Services

Efficient High-Similarity String Comparison: The Waterfall Algorithm

Computing & Telecommunications Services Monthly Report January CaTS Help Desk. Wright State University (937)

4-2: Organizing the Elements. 8 th Grade Physical Sciences

STATC141 Spring 2005 The materials are from Pairwise Sequence Alignment by Robert Giegerich and David Wheeler

Aside: Golden Ratio. Golden Ratio: A universal law. Golden ratio φ = lim n = 1+ b n = a n 1. a n+1 = a n + b n, a n+b n a n

Performance, Power & Energy

Table of Contents. Table of Contents Converting lattices: Rhombohedral to hexagonal and back

CMPSCI 311: Introduction to Algorithms Second Midterm Exam

Tree Building Activity

Efficient Approximation of Large LCS in Strings Over Not Small Alphabet

Lecture 2e Row Echelon Form (pages 73-74)

Creating a Bathymetric Database & Datum Conversion

S95 INCOME-TESTED ASSISTANCE RECONCILIATION WORKSHEET (V3.1MF)

Forces and motion 3: Friction

MITOCW watch?v=fkfsmwatddy

Lecture 13. More dynamic programming! Longest Common Subsequences, Knapsack, and (if time) independent sets in trees.

Introduction to Sequence Alignment. Manpreet S. Katari

36 What is Linear Algebra?

Remainders. We learned how to multiply and divide in elementary

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Modern Algebra Prof. Manindra Agrawal Department of Computer Science and Engineering Indian Institute of Technology, Kanpur

Genome 373: Hidden Markov Models II. Doug Fowler

Journal of Theoretical and Applied Information Technology 30 th November Vol.96. No ongoing JATIT & LLS QUERYING RDF DATA

Parallelism in Computer Arithmetic: A Historical Perspective

WeatherHawk Weather Station Protocol

ACTIVITY 2: Motion and Energy

WHEN IS IT EVER GOING TO RAIN? Table of Average Annual Rainfall and Rainfall For Selected Arizona Cities

Algorithms for Molecular Biology

! Where are we on course map? ! What we did in lab last week. " How it relates to this week. ! Compression. " What is it, examples, classifications

EECS730: Introduction to Bioinformatics

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

Structure of Materials Prof. Anandh Subramaniam Department of Material Science and Engineering Indian Institute of Technology, Kanpur

Lecture 1. ASTR 111 Section 002 Introductory Astronomy Solar System. Dr. Weigel. Outline. Course Overview Topics. Course Overview General Information

Note: k = the # of conditions n = # of data points in a condition N = total # of data points

You Might Also Like. I look forward helping you focus your instruction while saving tons of time. Kesler Science Station Lab Activities 40%+ Savings!

4. Quantization and Data Compression. ECE 302 Spring 2012 Purdue University, School of ECE Prof. Ilya Pollak

NP-Completeness I. Lecture Overview Introduction: Reduction and Expressiveness

CSC 1700 Analysis of Algorithms: Warshall s and Floyd s algorithms

Algebra II Polynomials: Operations and Functions

Unit 8: Sequ. ential Circuits

Tandem Mass Spectrometry: Generating function, alignment and assembly

arxiv: v2 [cs.ds] 16 Mar 2015

Descriptive Statistics-I. Dr Mahmoud Alhussami

Transcription:

Dynamic programming in less time and space Ben Langmead Department of omputer Science You are free to use these slides. If you do, please sign the guestbook (www.langmead-lab.org/teaching-materials), or email me (ben.langmead@gmail.com) and tell me briefly how you re using them. For original Keynote files, email me.

Space usage revisited We said the fill step requires O(mn) space ϵ T A T G T A T G ϵ 0 8 16 24 32 40 48 56 64 72 80 T 8 0 8 16 24 32 40 48 56 64 72 A 16 8 0 8 16 24 32 40 48 56 64 24 16 8 2 10 18 24 32 40 48 56 G 32 24 16 10 2 10 18 26 34 40 48 T 40 32 24 16 10 2 10 18 26 34 42 48 40 32 24 18 10 2 10 18 26 34 A 56 48 40 32 26 18 10 2 10 18 26 G 64 56 48 40 32 26 18 10 6 10 18 72 64 56 48 40 34 26 18 12 10 10 an we do better? Assume we re only interested in cost / score in lower righthand cell

Space usage revisited ϵ T A T G T A T G ϵ 0 8 16 24 32 40 48 56 64 72 80 T 8 0 8 16 24 32 40 48 56 64 72 A G T A G

Space usage revisited Idea: just store current and previous rows. Discard older rows as we go. (Likewise for columns or antidiagonals.) ϵ T A T G T A T G ϵ 0 8 16 24 32 40 48 56 64 72 80 T 8 0 8 16 24 32 40 48 56 64 72 A? G T Discard this row......once we begin this row Only keeping O(1) rows at a time, space bound becomes O(min(n, m)) -- linear space A G

Space usage revisited Idea: just store current and previous rows. Discard older rows as we go. (Likewise for columns or antidiagonals.) ϵ T A T G T A T G ϵ T A G T A G 64 56 48 40 32 26 18 10 6 10 18 72 64 56 48 40 34 26 18 12 10 10 We get desired value / score, by looking in the lower right cell (global alignment)

Space usage revisited More savings: discard elements as soon as they re no longer needed ϵ T A T G T A T G Discard this element ϵ T 24 32 40 48 56 64 72 A 16 8 0 8 16 24 Once we fill this element G T A G

Space usage revisited an we get both the optimal score and the alignment in linear space? For global alignment, we can...

Space usage revisited: subdividing matrix Assume global alignment. Idea: Split matrix into left half, filled as usual, and right half, filled backwards. In both halves, only store current, previous columns. Find V[i, j]s for successively longer prefixes of x and y V[i, j]s V [i, j]s Find V [i, j]s for successively longer suffixes of x and y

Space usage revisited: subdividing matrix ]s V [i, j]s 5 4 5 5 4 6 2 6 4 3 5 2 1 3 1 4 4 1 5 4 5 5 3 6 4 7 4 1 2 3 4 3 5 3 5 3 After fill, we have the center 2 columns Given nearby cells from each column, we can calculate value for optimal global alignment passing through those cells Opimal such value indicates where alignment crosses the center Repeat recursively to solve the entire problem, including backtrace, in O(mn) time and linear space Hirschberg s algorithm See Gusfield 12.1

Data parallelism: SIMD operations SIMD: Single Instruction, Multple Data A SIMD operation performs several operations at once on vectors of operands One instruction on a modern PU can add two vectors of 8 16-bit numbers quickly: http://www.coins-project.org/international/oinsdoc.en/simd/simd.html 134 45 14 73 86 782 67 36 + 7 952 65 33 6 56 5 3 = 141 997 79 106 92 838 72 39

Data parallelism an we take advantage of these operations when filling the matrix? - T A T G T A T G - 0 8 16 24 32 40 48 56 64 72 80 T 8 A 16 24 G 32 T 40 48 A 56 G 64 72

Data parallelism Yes, dynamic programming has data parallelism E.g. cells in red are calculated in the same way: different inputs but same operations. None depend on the others. Things we do when filling in a cell: add, max, etc, can be packed into vectors and done for many cells in parallel: - T A T G T A T G - 0 8 16 24 32 40 48 56 64 72 80 T 8 0 8 16 24 32 A 16 8 0 8 16 24 16 8 2 G 32 24 16 T 40 32 48 A 56 G 64 72

Data parallelism Variations on this idea are quite practical and used a lot in practice A Wozniak A. Using video-oriented instructions to speed up sequence comparison. omput Appl Biosci. 1997 Apr;13(2):145-50. B Rognes T, Seeberg E. Six-fold speed-up of Smith-Waterman sequence database searches using parallel processing on common microprocessors. Bioinformatics. 2000 Aug;16(8):699-706. Farrar M. Striped Smith-Waterman speeds database searches six times over other SIMD implementations. Bioinformatics. 2007 Jan 15;23(2): 156-61. D Rognes T. Faster Smith-Waterman database searches with inter-sequence SIMD parallelisation. BM Bioinformatics. 2011 Jun 1;12:221. Figure from: Rognes T. Faster Smith-Waterman database searches with inter-sequence SIMD parallelisation. BM Bioinformatics. 2011 Jun 1;12:221.

Dynamic programming summary Edit distance is harder to calculate than Hamming distance, but there is a O(mn) time dynamic progamming algorithm Global alignment generalizes edit distange to use a cost function Slight tweaks to global alignment turn it into an algorithm for: Longest ommon Subsequence Finding approximate occurrences of P in T Local alignment also has a O(mn)-time dynamic programming solution Further efficiencies are possible: If no alignment is needed, global/local alignment can be made linear-space Even if alignment is needed, global alignment can be made linear-space with Hirschberg SIMD instructions can fill in chunks of cells at a time More ideas: http://en.wikipedia.org/wiki/smith-waterman_algorithm#accelerated_versions