Alignment of Long Sequences. BMI/CS Spring 2016 Anthony Gitter

Similar documents
Module 9: Tries and String Matching

Module 9: Tries and String Matching

Where did dynamic programming come from?

Computing the Optimal Global Alignment Value. B = n. Score of = 1 Score of = a a c g a c g a. A = n. Classical Dynamic Programming: O(n )

Balanced binary search trees

19 Optimal behavior: Game theory

1 APL13: Suffix Arrays: more space reduction

Preview 11/1/2017. Greedy Algorithms. Coin Change. Coin Change. Coin Change. Coin Change. Greedy algorithms. Greedy Algorithms

Fingerprint idea. Assume:

CS 188: Artificial Intelligence Spring 2007

Global alignment. Genome Rearrangements Finding preserved genes. Lecture 18

Tries and suffixes trees

CMSC 330: Organization of Programming Languages. DFAs, and NFAs, and Regexps (Oh my!)

(e) if x = y + z and a divides any two of the integers x, y, or z, then a divides the remaining integer

Anatomy of a Deterministic Finite Automaton. Deterministic Finite Automata. A machine so simple that you can understand it in less than one minute

What s Behind BLAST. Gene Myers, Director MPI for Cell Biology and Genetics Dresden, DE

Strong Bisimulation. Overview. References. Actions Labeled transition system Transition semantics Simulation Bisimulation

1 Online Learning and Regret Minimization

Intermediate Math Circles Wednesday, November 14, 2018 Finite Automata II. Nickolas Rollick a b b. a b 4

Faster Regular Expression Matching. Philip Bille Mikkel Thorup

W. We shall do so one by one, starting with I 1, and we shall do it greedily, trying

4. GREEDY ALGORITHMS I

1.4 Nonregular Languages

Finite Automata Theory and Formal Languages TMV027/DIT321 LP4 2018

p-adic Egyptian Fractions

Connected-components. Summary of lecture 9. Algorithms and Data Structures Disjoint sets. Example: connected components in graphs

CS:4330 Theory of Computation Spring Regular Languages. Equivalences between Finite automata and REs. Haniel Barbosa

USA Mathematical Talent Search Round 1 Solutions Year 21 Academic Year

The area under the graph of f and above the x-axis between a and b is denoted by. f(x) dx. π O

1.3 Regular Expressions

Chapter Five: Nondeterministic Finite Automata. Formal Language, chapter 5, slide 1

DATA Search I 魏忠钰. 复旦大学大数据学院 School of Data Science, Fudan University. March 7 th, 2018

a < a+ x < a+2 x < < a+n x = b, n A i n f(x i ) x. i=1 i=1

Introduction to Computational Molecular Biology. Suffix Trees

Solution for Assignment 1 : Intro to Probability and Statistics, PAC learning

CS 314 Principles of Programming Languages

Winter 2016 COMP-250: Introduction to Computer Science. Lecture 24, April 7, 2016

First Midterm Examination

Scientific notation is a way of expressing really big numbers or really small numbers.

Harvard University Computer Science 121 Midterm October 23, 2012

State Minimization for DFAs

Lecture 21: Order statistics

We partition C into n small arcs by forming a partition of [a, b] by picking s i as follows: a = s 0 < s 1 < < s n = b.

Lecture 2: January 27

The Greedy Algorithm for the Minimum Common String Partition Problem

Algorithms in Computational. Biology. More on BWT

This lecture covers Chapter 8 of HMU: Properties of CFLs

Lexical Analysis Part III

Solving the String Statistics Problem in Time O(n log n)

Lecture 1: Introduction to integration theory and bounded variation

FLAG: Fast Local Alignment Generating Methodology. Abstract. Introduction

Natural examples of rings are the ring of integers, a ring of polynomials in one variable, the ring

New data structures to reduce data size and search time

Designing finite automata II

1 Structural induction, finite automata, regular expressions

Homework 3 Solutions

Java II Finite Automata I

Formal languages, automata, and theory of computation

Algorithm Design and Analysis

First Midterm Examination

CS 301. Lecture 04 Regular Expressions. Stephen Checkoway. January 29, 2018

Physics 201 Lab 3: Measurement of Earth s local gravitational field I Data Acquisition and Preliminary Analysis Dr. Timothy C. Black Summer I, 2018

On Suffix Tree Breadth

Math 1B, lecture 4: Error bounds for numerical methods

The Minimum Label Spanning Tree Problem: Illustrating the Utility of Genetic Algorithms

CMSC 330: Organization of Programming Languages

INTRODUCTION TO LINEAR ALGEBRA

NFAs and Regular Expressions. NFA-ε, continued. Recall. Last class: Today: Fun:

State space systems analysis (continued) Stability. A. Definitions A system is said to be Asymptotically Stable (AS) when it satisfies

Types of Finite Automata. CMSC 330: Organization of Programming Languages. Comparing DFAs and NFAs. NFA for (a b)*abb.

Frobenius numbers of generalized Fibonacci semigroups

Exact Matching. Exact Matching Algorithms 5/19/2015. Exact Matching Problem: search pattern P in text T (P,T are strings)

1. For each of the following theorems, give a two or three sentence sketch of how the proof goes or why it is not true.

Types of Finite Automata. CMSC 330: Organization of Programming Languages. Comparing DFAs and NFAs. Comparing DFAs and NFAs (cont.) Finite Automata 2

Graph Theory. Dr. Saad El-Zanati, Faculty Mentor Ryan Bunge Graduate Assistant Illinois State University REU. Graph Theory

Lecture Note 9: Orthogonal Reduction

Convert the NFA into DFA

GRADE 4. Division WORKSHEETS

Minimal DFA. minimal DFA for L starting from any other

GNFA GNFA GNFA GNFA GNFA

Goals: Determine how to calculate the area described by a function. Define the definite integral. Explore the relationship between the definite

CS415 Compilers. Lexical Analysis and. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

Math 61CM - Solutions to homework 9

CS 275 Automata and Formal Language Theory

CMDA 4604: Intermediate Topics in Mathematical Modeling Lecture 19: Interpolation and Quadrature

CISC 4090 Theory of Computation

Part 5 out of 5. Automata & languages. A primer on the Theory of Computation. Last week was all about. a superset of Regular Languages

Recitation 3: More Applications of the Derivative

Finite Automata. Informatics 2A: Lecture 3. John Longley. 22 September School of Informatics University of Edinburgh

Closure Properties of Regular Languages

The Knapsack Problem. COSC 3101A - Design and Analysis of Algorithms 9. Fractional Knapsack Problem. Fractional Knapsack Problem

A recursive construction of efficiently decodable list-disjunct matrices

1 Error Analysis of Simple Rules for Numerical Integration

Student Activity 3: Single Factor ANOVA

8 Laplace s Method and Local Limit Theorems

Chapter 2 Finite Automata

Math 426: Probability Final Exam Practice

12.1 Nondeterminism Nondeterministic Finite Automata. a a b ε. CS125 Lecture 12 Fall 2014

1. For each of the following theorems, give a two or three sentence sketch of how the proof goes or why it is not true.

Lecture 17. Integration: Gauss Quadrature. David Semeraro. University of Illinois at Urbana-Champaign. March 20, 2014

Transcription:

Alignment of Long Sequences BMI/CS 776 www.biostt.wisc.edu/bmi776/ Spring 2016 Anthony Gitter gitter@biostt.wisc.edu

Gols for Lecture Key concepts how lrge-scle lignment differs from the simple cse the cnonicl three step pproch of lrge-scle ligners using suffix trees to find mximl unique mtching subsequences (MUMs) If time permits using tries nd threded tries to find lignment seeds constrined dynmic progrmming to lign between/round nchors using sprse dynmic progrmming (DP) to find chin of locl lignments 2

Pirwise Lrge-Scle Alignment: Tsk Definition Given pir of lrge-scle sequences (e.g. chromosomes) method for scoring the lignment (e.g. substitution mtrices, insertion/deletion prmeters) Do construct globl lignment: identify ll mtching positions between the two sequences 3

Lrge Scle Alignment Exmple: Mouse Chr6 vs. Humn Chr12 Figure from: Delcher et l., Nucleic Acids Reserch 27, 1999 4

Why the Problem is Chllenging Sequences too big to mke O(n 2 ) dynmicprogrmming methods prcticl Long sequences re less likely to be colliner becuse of rerrngements initilly we ll ssume collinerity we ll consider rerrngements in next lecture 5

Generl Strtegy Figure from: Brudno et l. Genome Reserch, 2003 1. perform pttern mtching to find seeds for globl lignment 2. find good chin of nchors 3. fill in reminder with stndrd but constrined lignment method 6

The MUMmer System Delcher et l., Nucleic Acids Reserch, 1999 Given: genomes A nd B 1. find ll mximl unique mtching subsequences (MUMs) 2. extrct the longest possible set of mtches tht occur in the sme order in both genomes 3. close the gps 7

Step 1: Finding Seeds in MUMmer Mximl unique mtch: occurs exctly once in both genomes A nd B not contined in ny longer MUM mismtches Key insight: significntly long MUM is certin to be prt of the globl lignment 8

Suffix Trees Substring problem: given text S of length m preprocess S in O(m) time such tht, given query string Q of length n, find occurrence (if ny) of Q in S in O(n) time Suffix trees solve this problem nd others 9

Suffix Tree Definition key property A suffix tree T for string S of length m is tree with the following properties: rooted nd directed m leves, lbeled 1 to m ech edge lbeled by substring of S conctention of edge lbels on pth from root to lef i is suffix i of S (we will denote this by Si...m) ech internl non-root node hs t lest two children edges out of node must begin with different chrcters 10

Suffixes S = bnn$ suffixes of S $ $ n$ n$ nn$ nn$ bnn$ 11

Suffix Tree Exmple S = bnn$ Add $ to end so tht suffix tree exists (no suffix is prefix of nother suffix) n $ $ n $ b n n $ n $ n $ $ 7 2 4 6 1 3 5 12

Solving the Substring Problem Assume we hve suffix tree T FindMtch(Q, T): follow (unique) pth down from root of T ccording to chrcters in Q if ll of Q is found to be prefix of such pth return lbel of some lef below this pth else, return no mtch found 13

Solving the Substring Problem Q = nn Q = nb n $ $ n $ b n n $ n $ n $ $ 7 STOP n $ $ n $ b n n $ n $ n $ $ 7 2 4 6 1 3 5 return 3 2 4 6 1 3 5 return no mtch found 14

MUMs nd Generlized Suffix Trees Build one suffix tree for both genomes A nd B Lbel ech lef node with genome it represents Genome A: cccg# Genome B: cct$ cg# c g# t$ ech internl node represents repeted sequence A, 3 cg# c g# t$ A, 5 B, 3 A, 2 A, 4 B, 2 cg# t$ 15 A, 1 B, 1 ech lef represents suffix nd its position in sequence

MUMs nd Suffix Trees Unique mtch: internl node with 2 children, lef nodes from different genomes But these mtches re not necessrily mximl Genome A: cccg# Genome B: cct$ cg# c g# t$ A, 3 cg# c g# t$ A, 5 B, 3 A, 2 A, 4 B, 2 cg# t$ A, 1 B, 1 represents unique mtch 16

MUMs nd Suffix Trees To identify mximl mtches, cn compre suffixes following unique mtch nodes Genome A: ct# Genome B: c$ c t# A, 4 $ $ c t# t# $ B, 4 B, 3 A, 3 A, 2 B, 2 t# A, 1 $ B, 1 the suffixes following these two mtch nodes re the sme; the left one represents longer mtch (c) 17

Using Suffix Trees to Find MUMs O(n) time to construct suffix tree for both sequences (of lengths n) O(n) time to find MUMs - one scn of the tree (which is O(n) in size) O(n) possible MUMs in contrst to O(n 2 ) possible exct mtches Min prmeter of pproch: length of shortest MUM tht should be identified (20 50 bses) 18

Step 2: Chining in MUMmer Sort MUMs ccording to position in genome A Solve vrition of Longest Incresing Subsequence (LIS) problem to find sequences in scending order in both genomes Figure from: Delcher et l., Nucleic Acids Reserch 27, 1999 19

Finding Longest Subsequence Unlike ordinry LIS problems, MUMmer tkes into ccount lengths of sequences represented by MUMs overlps Requires O( k log k) time where k is number of MUMs 20

Types of Gps in MUMmer Alignment Figure from: Delcher et l., Nucleic Acids Reserch 27, 1999 21

Step 3: Close the Gps SNPs: between MUMs: trivil to detect otherwise: hndle like repets Insertions trnspositions (subsequences tht were deleted from one loction nd inserted elsewhere): look for out-of-sequence MUMs simple insertions: trivil to detect 22

Step 3: Close the Gps Polymorphic regions short ones: lign them with dynmic progrmming method long ones: cll MUMmer recursively with reduced minimum MUM length Repets detected by overlpping MUMs Figure from: Delcher et l. Nucleic Acids Reserch 27, 1999 23

The LAGAN Method Brudno et l., Genome Reserch, 2003 Given: genomes A nd B nchors = find_nchors(a, B) step 3: finish globl lignment with DP constrined by nchors find_nchors(a, B) step 1: find locl lignments by mtching, chining k-mer seeds step 2: nchors = highest-weight sequence of locl lignments for ech pir of djcent nchors 1, 2 in nchors if 1, 2 re more thn d bses prt A, B = sequences between 1, 2 sub-nchors = find_nchors( A, B ) insert sub-nchors between 1, 2 in nchors return nchors 24

Step 1: Finding Seeds in LAGAN Degenerte k-mers: mtching k-long sequences with smll number of mismtches llowed By defult, LAGAN uses 10-mers nd llows 1 mismtch ccg cgcgctct cct ct cgcggtct cgt 25

Finding Seeds in LAGAN Exmple: trie to represent ll 3-mers of the sequence gccgcct c g c c g c c g t c 2 3, 7 4 8 5 1 6 One sequence is used to build the trie The other sequence (the query) is wlked through to find mtching k-mers 26

Allowing Degenerte Mtches Suppose we re llowing 1 bse to mismtch in looking for mtches to the 3-mer cc; need to explore green nodes c g c c g c c g t 2 3, 7 4 8 5 1 6 c 27

LAGAN Uses Threded Tries In threded trie, ech lef for word w 1...w p hs bck pointer to the node for w 2...w p c g c c g c c g t 2 3, 7 4 8 5 1 6 c 28

Usully requires following only two pointers to mtch ginst the next k-mer, insted of trversing tree from root for ech 29 Trversing Threded Trie Consider trversing the trie to find 3-mer mtches for the query sequence: ccgt c g c c g c c g t 2 3, 7 4 8 5 1 6 c

Step 1b: Chining Seeds in LAGAN Cn chin seeds s 1 nd s 2 if the indices of s 1 > indices of s 2 (for both sequences) s 1 nd s 2 re ner ech other Keep trck of seeds in the serch box s the query sequence is processed Figure from: Brudno et l. BMC Bioinformtics, 2003 30

Step 2: Chining in LAGAN Use sprse dynmic progrmming to chin locl lignments 31

The Problem: Find Chin of Locl Alignments (x,y) (x,y ) requires x < x y < y Ech locl lignment hs weight FIND the chin with highest totl weight Slide from Serfim Btzoglou, Stnford University 32

Sprse DP for rectngle chining 1,, N: rectngles h (h j, l j ): y-coordintes of rectngle j l w(j): weight of rectngle j V(j): optiml score of chin ending in j L: list of triplets (l j, V(j), j) y L is sorted by l j : smllest (North) to lrgest (South) vlue L is implemented s blnced binry tree Slide from Serfim Btzoglou, Stnford University 33

Sprse DP for rectngle chining Min ide: Sweep through x- coordintes To the right of b, nything chinble to is chinble to b Therefore, if V(b) > V(), rectngle is useless for subsequent chining In L, keep rectngles j sorted with incresing l j - coordintes sorted with incresing V(j) score V() V(b) Slide from Serfim Btzoglou, Stnford University 34

Sprse DP for rectngle chining Go through rectngle x-coordintes, from lowest to highest: 1. When on the leftmost end of rectngle i: j. j: rectngle in L, with lrgest l j < h i b. V(i) = w(i) + V(j) k i 2. When on the rightmost end of i:. k: rectngle in L, with lrgest l k l i b. If V(i) > V(k): i. INSERT (l i, V(i), i) in L ii. REMOVE ll (l j, V(j), j) with V(j) V(i) & l j l i Slide from Serfim Btzoglou, Stnford University 35

Exmple x 2 5 6 9 10 11 12 14 15 16 y : 5 c: 3 b: 6 d: 4 e: 2 V L b c d e 5 11 8 12 13 l i V(i) i 5 11 9 15 16 5 11 8 12 13 cb d e 1. When on the leftmost end of rectngle i:. j: rectngle in L, with lrgest l j < h i b. V(i) = w(i) + V(j) 36 Slide from Serfim Btzoglou, Stnford University 2. When on the rightmost end of i:. k: rectngle in L, with lrgest l k l i b. If V(i) > V(k): i. INSERT (l i, V(i), i) in L ii. REMOVE ll (l j, V(j), j) with V(j) V(i) & l j l i

Time Anlysis 1. Sorting the x-coords tkes O(N log N) 2. Going through x-coords: N steps 3. Ech of N steps requires O(log N) time: Serching L tkes log N Inserting to L tkes log N All deletions re consecutive, so log N per deletion Ech element is deleted t most once: N log N for ll deletions Recll tht INSERT, DELETE, SUCCESSOR, tke O(log N) time in blnced binry serch tree Slide from Serfim Btzoglou, Stnford University 37

Constrined Dynmic If we know tht the i th element in one sequence must lign with the j th element in the other, we cn ignore two rectngles in the DP mtrix Progrmming j i 38

Step 3: Computing the Globl Alignment in LAGAN Given n nchor tht strts t (i, j) nd ends t (i, j ), LAGAN limits the DP to the unshded regions Thus nchors re somewht flexible Figure from: Brudno et l. Genome Reserch, 2003 39

Step 3: Computing the Globl Alignment in LAGAN Figures from: Brudno et l. Genome Reserch, 2003 40

Exmple Alignment: E. Coli O157:H7 vs. E. coli K-12 Figure from: Pern et l. Nture, 2001 41

Comprison of Lrge-Scle Alignment Methods Method Pttern mtching Chining MUMmer suffix tree - MUMs LIS vrint AVID (not discussed) suffix tree - exct & wobble mtches Smith-Wtermn vrint LAGAN k-mer trie, inexct mtches sprse DP 42