Introduction to Bioinformatics

Similar documents
Global alignment. Genome Rearrangements Finding preserved genes. Lecture 18

Algorithms & Data Structures Homework 8 HS 18 Exercise Class (Room & TA): Submitted by: Peer Feedback by: Points:

Lecture 6: Coding theory

Project 6: Minigoals Towards Simplifying and Rewriting Expressions

1 PYTHAGORAS THEOREM 1. Given a right angled triangle, the square of the hypotenuse is equal to the sum of the squares of the other two sides.

Computational Biology Lecture 18: Genome rearrangements, finding maximal matches Saad Mneimneh

, g. Exercise 1. Generator polynomials of a convolutional code, given in binary form, are g. Solution 1.

18.06 Problem Set 4 Due Wednesday, Oct. 11, 2006 at 4:00 p.m. in 2-106

Mid-Term Examination - Spring 2014 Mathematical Programming with Applications to Economics Total Score: 45; Time: 3 hours

Chapter 4 State-Space Planning

Part 4. Integration (with Proofs)

Computing data with spreadsheets. Enter the following into the corresponding cells: A1: n B1: triangle C1: sqrt

Technische Universität München Winter term 2009/10 I7 Prof. J. Esparza / J. Křetínský / M. Luttenberger 11. Februar Solution

Finite State Automata and Determinisation

CS 573 Automata Theory and Formal Languages

Solutions for HW9. Bipartite: put the red vertices in V 1 and the black in V 2. Not bipartite!

Math 32B Discussion Session Week 8 Notes February 28 and March 2, f(b) f(a) = f (t)dt (1)

Prefix-Free Regular-Expression Matching

Lecture Notes No. 10

Common intervals of genomes. Mathieu Raffinot CNRS LIAFA

Counting Paths Between Vertices. Isomorphism of Graphs. Isomorphism of Graphs. Isomorphism of Graphs. Isomorphism of Graphs. Isomorphism of Graphs

INTEGRATION. 1 Integrals of Complex Valued functions of a REAL variable

Tutorial Worksheet. 1. Find all solutions to the linear system by following the given steps. x + 2y + 3z = 2 2x + 3y + z = 4.

Intermediate Math Circles Wednesday 17 October 2012 Geometry II: Side Lengths

5. Every rational number have either terminating or repeating (recurring) decimal representation.

QUADRATIC EQUATION. Contents

8 THREE PHASE A.C. CIRCUITS

On-Line Construction. of Suffix Trees. Overview. Suffix Trees. Notations. goo. Suffix tries

Module 9: Tries and String Matching

Module 9: Tries and String Matching

Discrete Structures, Test 2 Monday, March 28, 2016 SOLUTIONS, VERSION α

More Properties of the Riemann Integral

Alpha Algorithm: Limitations

Section 1.3 Triangles

Data Structures and Algorithm. Xiaoqing Zheng

Matrices SCHOOL OF ENGINEERING & BUILT ENVIRONMENT. Mathematics (c) 1. Definition of a Matrix

Exercise sheet 6: Solutions

where the box contains a finite number of gates from the given collection. Examples of gates that are commonly used are the following: a b

(a) A partition P of [a, b] is a finite subset of [a, b] containing a and b. If Q is another partition and P Q, then Q is a refinement of P.

Ling 3701H / Psych 3371H: Lecture Notes 9 Hierarchic Sequential Prediction

A Study on the Properties of Rational Triangles

Electromagnetism Notes, NYU Spring 2018

Line Integrals and Entire Functions

Maintaining Mathematical Proficiency

p-adic Egyptian Fractions

CS311 Computational Structures Regular Languages and Regular Grammars. Lecture 6

TOPIC: LINEAR ALGEBRA MATRICES

Table of Content. c 1 / 5

Green s Theorem. (2x e y ) da. (2x e y ) dx dy. x 2 xe y. (1 e y ) dy. y=1. = y e y. y=0. = 2 e

ANALYSIS AND MODELLING OF RAINFALL EVENTS

Chem Homework 11 due Monday, Apr. 28, 2014, 2 PM

April 8, 2017 Math 9. Geometry. Solving vector problems. Problem. Prove that if vectors and satisfy, then.

Learning Partially Observable Markov Models from First Passage Times

Probability. b a b. a b 32.

Fast index for approximate string matching

1. For each of the following theorems, give a two or three sentence sketch of how the proof goes or why it is not true.

= state, a = reading and q j

Linear Algebra Introduction

How do we solve these things, especially when they get complicated? How do we know when a system has a solution, and when is it unique?

Chapter 3. Vector Spaces. 3.1 Images and Image Arithmetic

Chapter 14. Matrix Representations of Linear Transformations

Logic Synthesis and Verification

System Validation (IN4387) November 2, 2012, 14:00-17:00

6.3.2 Spectroscopy. N Goalby chemrevise.org 1 NO 2 H 3 CH3 C. NMR spectroscopy. Different types of NMR

Alpha Algorithm: A Process Discovery Algorithm

Bisimulation, Games & Hennessy Milner logic

Distance-Join: Pattern Match Query In a Large Graph Database

NON-DETERMINISTIC FSA

6.3.2 Spectroscopy. N Goalby chemrevise.org 1 NO 2 CH 3. CH 3 C a. NMR spectroscopy. Different types of NMR

Hyers-Ulam stability of Pielou logistic difference equation

@#? Text Search ] { "!" Nondeterministic Finite Automata. Transformation NFA to DFA and Simulation of NFA. Text Search Using Automata

Nondeterministic Finite Automata

Lecture 1 - Introduction and Basic Facts about PDEs

CSE 332. Sorting. Data Abstractions. CSE 332: Data Abstractions. QuickSort Cutoff 1. Where We Are 2. Bounding The MAXIMUM Problem 4

Periodic string comparison

A Lower Bound for the Length of a Partial Transversal in a Latin Square, Revised Version

Engr354: Digital Logic Circuits

The University of Nottingham SCHOOL OF COMPUTER SCIENCE A LEVEL 2 MODULE, SPRING SEMESTER MACHINES AND THEIR LANGUAGES ANSWERS

11/3/13. Indexing techniques. Short-read mapping software. Indexing a text (a genome, etc) Some terminologies. Hashing

Introduction to Olympiad Inequalities

Recitation 3: More Applications of the Derivative

(h+ ) = 0, (3.1) s = s 0, (3.2)

MATH Final Review

Nondeterministic Automata vs Deterministic Automata

Logic Synthesis and Verification

T b a(f) [f ] +. P b a(f) = Conclude that if f is in AC then it is the difference of two monotone absolutely continuous functions.

6.5 Improper integrals

Applications of Definite Integral

CS 275 Automata and Formal Language Theory

Data Structures LECTURE 10. Huffman coding. Example. Coding: problem definition

22: Union Find. CS 473u - Algorithms - Spring April 14, We want to maintain a collection of sets, under the operations of:

Lossless Compression Lossy Compression

First Midterm Examination

For a, b, c, d positive if a b and. ac bd. Reciprocal relations for a and b positive. If a > b then a ab > b. then

Necessary and sucient conditions for some two. Abstract. Further we show that the necessary conditions for the existence of an OD(44 s 1 s 2 )

Connected-components. Summary of lecture 9. Algorithms and Data Structures Disjoint sets. Example: connected components in graphs

CIT 596 Theory of Computation 1. Graphs and Digraphs

15.12 Applications of Suffix Trees

PYTHAGORAS THEOREM WHAT S IN CHAPTER 1? IN THIS CHAPTER YOU WILL:

CS 2204 DIGITAL LOGIC & STATE MACHINE DESIGN SPRING 2014

Transcription:

Introdution to Bioinformtis

Outline } Method without onsidering bkground distribution } Generl pproh onsidering bkground distribution } Wys to speed up the lgorithm

Trnsription Ftor Binding Sites (TFBSs) 3 7/20/17

Trnsription Ftor Binding Sites } DNA sequene segments tht trnsription ftors (TF) bind to re lled trnsription ftorbinding sites (TFBSs). (TFBSs) } TF intert with their TFBS using ombintion of eletrostti nd Vn der Wls fores. } Most of the TFs bind DNA in motif speifi mnner, i.e. TFs n bind to list of similr DNA sequene segments.

Trnsription Ftor Binding Sites } Trnsription ftor binding sites re usully short (round 5-15 bp) } They re frequently degenerte sequene motifs o The sequene degenery onfers different levels of regultion } Given genome, the predition of TFBSs is diffiult nd risky tsk. (TFBSs)

Identifition of TFBSs } Experiment methods o Trditionl methods o o o o Foot-printing methods Nitroellulose binding ssys Gel-shift nlysis Southwestern blotting } TFBSs in silio o o Aim: to identify more ndidte trget TFBS. Degenerte onsensus sequenes. (Drwbk: does not ontin preise likelihood informtion) } High-throughput method o o Finding high-ffinity binding sequene in vitro (SELEX) High-throughput method in vivo: ChIP-hip o Position weighted mtrix (PWM) or PSSM (position speifi mtrix) is ommon pproh to this problem.

Position Weight Mtries (PWMs) 7 7/20/17

Position weight mtrix (PWM)/ Positionspeifi weight mtrix(pssm) PWM is ommonly used representtion of motifs(ptterns) in biologil sequenes. Imgine two experimentlly determined TF binding sites for onetf: Seq1: ATTGAGTCGCAGTGACTCAAG Seq2: CTTGAGTCAGGCAGGCTCAAT Constrution of Position Weight Mtrix (PWM): PWM of "better qulity": Construted using 33 TF binding sites for one TF

} Length of PWM (number of olumns): Definitions: M f f * i, j i, j ' = o o bsolute PWM (ount mtrix): Î N with i Î{ A, T, C, G} nd reltive PWM (frequeny mtrix): * fi, j * å fk, j kî{ A, T, C, G} j Î[ 0, M -1] 1 2 3 A 0.8 0.. T 0.1 0.. C 0.1 0.2.. G 0 0.8..

A simple TFBS mthing tool 10 7/20/17

Nïve method without onsidering bkground distribution MATCH TM : tool for serhing trnsription ftor binding sites in DNA sequenes (A.E. Kel et l. 2003) } Input: (1) DNA sequenes ontining potentil TF binding sites (2) PWM Output: A list of found potentil sites. } Two types of sores re lulted o o Core Similrity Sore (CSS) : only lulted for the first five onseutive onserved region. Mtrix Similrity Sore (MSS): lulted for ll the positions

Nïve method without onsidering bkground distribution MATCH TM : tool for serhing trnsription ftor binding sites in DNA sequenes (A.E. Kel et l. 2003) MSS( CSS ) = Current - Min Mx - Min MSS ( CSS ) Î[0,1] = å - Current j= L = å - Mx j= 1 0 L 1 0 I( j) I( j) f mx j f nu ( j), j ' f f i, j ' mx j = f * i, j = mx{ f i nu(j) refers to the nuleotide with index j. å * fk, j kî{ A, T, C, G} * i, j } å * fk, j kî{ A, T, C, G} (highest frequeny of nuleotide in position j in the mtrix) Min : L å - i= 1 0 I( j) f min j f min j = min{ f i * i, j } å * fk, j kî{ A, T, C, G} (lowest frequeny of nuleotide in position j in the mtrix) I( j) = å iî{ A, T, G, C} fi, j ln(4 fi, j ) j = 1,2,..., L Informtion vetor

} Two utoffs re kept for CSS nd MSS sores respetively. Proedure: } A window onsisting of five nuleotides is moving long the sequene. } CSS (ore similrity sore) is lulted. } For eh CSS higher thn CSS utoff, the sequene nd is prolonged t both ends to fit the mtrix length. Then the MSS sore is lulted } If two sores re both higher thn ut-offs, then output s yes instne ATCGTACTAGCTACGATCAA TCGA Clulte CSS sore Chek if the sore is bove the CSS threshold Prolong ATCGTACTAGCTACGATCAA TCGA Clulte MSS sore Chek if the sore is bove MSS threshold

Inorporting the bkground 14 7/20/17

Bkground model: } Some nuleotides in the PWM ount more thn others Nuleotide ontents (nonoding), C. effiiens Nuleotide ontents (totl), C. effiiens Nuleotide ontents (oding), C. effiiens

} Length of PWM (number of olumns): o Bkground model: p Î[0,1] with i Î{ A, T, C, G} i Definitions: with M å p i = 1 i A 0.180 T 0.182 C 0.330 G 0.308 o bsolute PWM (ount mtrix): f * i, j f à i, j ' f Î N with o = o p i, j i Î{ A, T, C, G} reltive PWM (frequeny mtrix): * fi, j * å fk, j kî{ A, T, C, G} Pseudo-ounts per olumn (void overfitting): e.g. = f * i, j + p i à f i, j = nd p fi, j p å fk, j kî{ A, T, C, G} j Î[ 0, M -1] = 4 1 2 3 A 0.8 0.. T 0.1 0.. C 0.1 0.2.. G 0 0.8..

} Soring funtion (log-odds sore): where nu(j) = nuleotide with index j Mthing proedure: Definitions: S strtidx endidx f = å, endidx ln p j= strtidx nu( j), j nu( j) Seq = A G C A A T T A A A T T G G A T A A C.. PWM = S } Clulte sore for every position of the sliding window 0, M -1 S M > th } Report every mth with (th is the threshold of being signl) But how to set good threshold vlue? 0, -1

Sore distribution: l B (X ) } sore distribution of the PWM lulted with rndom sequenes ording to bkground model. l T (X ) } sore distribution lulted with rndom sequenes ording to PWM model. P Z ( X = s) } probbility of observing sore s under distribution Z. We re interested in: P Z ( X ³ s) } probbility of observing t lest sore s. with P Z ( X ³ s) = mx å i= s P Z ( X = i)

pvlue: p = P B ( X å x= s Probbility of observing t lest sore s by hne ³ s) = mx P B ( X = x) à Set th = s, with P B ( X ³ s) = p for given p (pitures: ssuming stndrd norml distribution) Set equl flse positive nd flse negtive errors: - Set s, where P ( X ³ s) = P ( X s) B T

pvlue: p = P B ( X å x= s Probbility of observing t lest sore s by hne ³ s) = mx P B ( X = x) à Set th = s, with P B ( X ³ s) = p for given p (pitures: ssuming stndrd norml distribution) Set equl flse positive nd flse negtive errors: - Set s, where P ( X ³ s) = P ( X s) B T

Methods to speed up generl mthing pproh } The generl mthing pproh ims for finding binding site by moving the window of length M long sequene of length N. } The time omplexity of stright-forwrd implementtion is O(MN) } Severl methods were implemented to speed up the PWM/PSSM o o o o Lookhed lgorithm Permutted lookhed lgorithm Suffix tree Enhned suffix rry

Let s speed it up! Kirk: How muh time to you need, Sotty? Sotty: Gimme 20 minutes. Kirk: You got 10. Sotty: OK. I ll do in in 5. Two minutes lter 22 7/20/17

Lookhed lgorithm } The motivtion: given segment of sequene, we wnt to know whether we n rejet its probbility of being signl s erly s possible. o o For given sequene segment of length M, we hve the sore funtion: S M å - 1 0, M - 1 = ln( f nu ( j), j p nu( j) ) j= 0 ---(1) We define the minimum nd mximum sore for given PWM: S M å - 1 min( 0, M - 1) = min {ln( f, j p )} Î{ A, T, G, C} j= 0 ---(2) S M å - 1 mx( 0, M - 1) = mx {ln( f, j p )} Î{ A, T, G, C} j= 0 ---(3)

Lookhed lgorithm 0 d M -1 o For ny, we lso define the prefix sore of depth d: pfxs d = S d 0, d = åln( fnu( j), j p nu( j) ) j= 0 ---(4) s d = S o And the mximl sore in the lst M-d -1 positions of the PWM: M å - 1 mx( d + 1, M -1) = mx {ln( f, j p )} Î{ A, T, G, C} j= d + 1 ---(5) o Finlly, we n lulte the intermedite threshold t position d: th d = th-s d ----(6)

Lookhed lgorithm } Therefore, the following sttements re equivlent: pfxs Û S d 0, M -1 ³ th ³ th d for ll d(0 d M -1) } Bsilly, when prefix hs sore so low tht even if the rest of the segment hieves mximl sore, still the sore for whole segment is below the threshold, then we must rejet it. ATGCGCTTAAGTCTGTGGTCAAATGCTAGCTACGTACGATCGAT C pfxs (prefix sore) (mx sore) d s d Chek if bove th d for every position, if not, then rejet it.

Permutted lookhed lgorithm } With the lookhed lgorithm, the sooner we rejet segment, the better running time we hve. } Therefore, it mkes sense to hek the positions in PWM tht is more likely to be rejeted by lookhed lgorithm. We implement this ide by permuttion of PWM: } Eh olumn of PWM hs highest sore: M j = S mx( j, j) = mx {ln( f, Î{ A, T, G, C} j p )} } nd n expettion of the sore if the residue is generted by bkground model: E j = å å S A T G C j p = f A T G C j p Î{,,, }, ln( Î{,,, }, ) p

Permutted lookhed lgorithm } We fous on the differene between M j nd E j. If the expettion for olumn is omprtively low to the highest sore, then it is more likely the segment is rejeted t this olumn. } Therefore, we order the mtrix by (M j - E j ), nd ompute the most dngerous olumn first. Position Differene Permutte 0 1 2 3 4 5 6 7 8 Order by differene 0 1 2 3 4 5 6 7 8 A T G C G A T C G A G T G T C A G C A T G C G A T C G 1 3 4 6 2 7 5 9 8 0 4 1 2 6 3 5 8 7 0 4 1 2 6 3 5 8 7 A G T G T C A G C 1 2 3 4 5 6 7 8 9 pfxs d s d

1. Suffix tree is dt struture tht presents ll the suffixes of given string. 2. A suffix tree for string w, is tree whose edges re lbeled with substrings. Eh suffix of w orresponds to extly one pth from the tree s root to lef. Suffix tree 3. Suffix tree is speil dt struture tht llows number of string opertions to be rried out in n effiient wy Suffix tree for the string. Substring termintes with. The 12 pths from the root to lef orrespond to the 12 suffixes.

Number Substring 0 1 Suffix tree 2 3 4 5 6 7 8 9 10 11

Suffix tree Key fetures of suffix tree T for string w[0, m-1] is rooted tree with : 1. m leves numbered from 0 to m-1 2. At lest two hildren for eh internl node (exept root) 3. Eh lbel represents substring of w (nonempty) 4. No two edges out of the sme node begin with sme hrter

Applitions of Suffix tree } One of the simplest pplition of suffix tree is to hek whether string P of length m is substring of the given string w in O(m) time. } Construt the suffix tree T of string w. And mth string P long from the root to lef } If there exists omplete mth, then P is substring of w, otherwise, not. Chek if is substring of

Applitions of Suffix tree } Besides, there re mny other pplitions of suffix tree. Given suffix tree of string w of length n, 1. Find the first ourrene of the ptterns P 1,,P q, of totl length m in O(m) time. 2. Serh for regulr expression in P in time expeted subliner in n. 3. Find the longest ommon substrings of string w i nd w j in Θ(n i +n j ) time. 4. Find the longest repeted substring in Θ(n) time. 5.

How to grow suffix tree (nïve method) } The running time for nïve onstrution of suffix tree is O(n 2 ) ( n: text size) } For exmple, we wnt to onstrut suffix tree of string xbx xbx 0 1. Strt with the whole string (lef number 1) nd onnet the root with the lef

How to grow suffix tree (nïve method) 2. Generte suffixes w[1 n-1], w[2 n-1],, w[n-1], nd push them into the tree one by one. Suffixes: - bx - bx - x - - - xbx 0

How to grow suffix tree (nïve method) 3. To insert Sfx i = w[i n-1], follow the pth from the root, mthing hrters of Sfx i until the first mismth t the hrter Sfx i [j]. There re two ses: Insert seond nd third suffixes xbx 0 bx i. If the mthing nnot ontinue from node (whih mens mismth hppens to be t the beginning of next edge), then rete new node. Lbel the edge to its orresponding substring. bx 1 2

How to grow suffix tree (nïve method) ii. If the mismth ours in the middle of n edge e = (u,v), then denote the edge to be 0, l-1. Insertion of x uses first edge to split bx 0 Let the mismth our t k, then rete new node w, nd reple edge e by edges (u,w) nd (w,v), lbeled by 1,, k-1, nd k l-1. x bx 3 Then rete nother new node to store the rest of the newly inserted suffix. bx 1 2

How to grow suffix tree (nïve method) Sme thing hppens when inserting bx 0 After inserting, nd, the suffix tree is omplete Finlly, in both ses, new lef is reted,numbered i. x bx 3 1 4 5 bx 6 2

PWM/PSSM using suffix tree Suppose we hve } How n suffix tree elerte the proess of mthing? (1) We first find the proper length of trget sequene segment. The length n be deided bsed on memory size. (2) Then we onstrut suffix trees from the trget sequene.

PWM/PSSM using suffix tree (3) Then depth-first trversl of the tree is performed, lulted ll the prefix sores ( pfxs d ) for edge lbels. Suppose we hve the sore funtions like the following: S 1, S 3 S, 3 S,, = 2, 0 =, 1 =, 0 = for given threshold: th = 6 We hve intermedite thresholds: th 0 =3, th 1 =6 Afterwrds, we lulte ll the prefix sores for edge lbels. 1 1 4 3 6 3 5

PWM/PSSM using suffix tree Red zone in the figure shows the brnhes hving sore below intermedite threshold (4) Finlly nlyze the sores, hek if either of the two ses hppens: i. Any sore t some node in the tree rehes the threshold, then ll of its substrings represented by tree rehes the threshold s well. ii. Similrly, hek if ny of the sores fll below the intermedite threshold, then the whole substring brnh n be ignored. 1 3 Green zone in the figure shows the brnhes hving sores bove intermedite threshold 6 4 3 5

Suffix tree à Suffix rry 41 7/20/17

Enhned suffix rry } Min fetures: } M. Bekstette et l. (2006) brought forwrd PWM-bsed serhing method using enhned suffix rrys. } In their study, they foused on the improvement of spe effiieny when serhing with PWM. Their method is similr to the suffix tree disussed in the previous slides. Three rrys re kept for different usges: 1. suf rry suf rry speifies the first indies of eh suffix. 2. lp rry lp rry stores the length of the longest ommon prefix of two djent suffixes ording to lef numbers. 3. skp rry Sorry, little bit omplex, tlk bout it in the following slides.

Enhned suffix rry (rry suf ) suf rry speifies the first indies of eh suffix.. S suf [0], S suf [1],, S suf [n-1] is the sequene of suffixes of S in first index position sending order, where S suf [i]=s suf[i] = [i... n-1]. i à index if ordered lexiogrphilly i suf[i] S suf [i] 6 0 0 1 1 2 2 3 4 4 9 5 7 6 3 7 8 8 5 9 10 10 11 11

Enhned suffix rry (rry lp ) Arry lp is n rry rnge from 0 to n with the following fetures. (1) lp[0] = 0 (2) lp[i] stores the length of the longest ommon prefix of S suf [i- 1] nd S suf [i]. The ommon prefix of nd is, so lp[1] = 3 The ommon prefix of nd is, so lp[3] = 1 i lp[i] S suf [i] 0 0 1 3 2 2 3 1 4 2 5 2 6 0 7 2 8 3 9 1 10 1 11 0

Enhned suffix rry (rry skp ) Arry skp is in rnge 0 to n suh tht skp [ i] = min({ n + 1} È{ j Î[ i + 1, n] lp[ i] > lp[ j]}) Geometrilly, skp[i] denotes the next lef tht does not our in substree below the brnhing node orresponding to the longest ommon prefix of S suf [i-1] nd S suf [i]. à skp[i] is the next index j where where lp[j] < lp[i] i lp[i] skp[i] S suf [i] 0 0 12 1 3 2 2 2 3 3 1 6 4 2 6 5 2 6 6 0 12 7 2 9 8 3 9 9 1 11 10 1 11 11 0 12

Enhned suffix rry (rry skp ) Longest ommon prefix of nd is, so lp[3] = 1. 2 0 1 The red edge indites the ommon prefix. 7 8 3 4 5 6 9 10 11

Enhned suffix rry (rry skp ) 0 Similrly, we n find out tht lp[4] = lp[5] = 2 The red edge indites the ommon prefix. 7 8 2 3 4 5 6 1 9 10 11

Enhned suffix rry (rry skp ) We nnot find ommon prefix between nd, so lp[6]= 0 2 0 1 Therefore, skp[3] = skp[4] = skp[5]= 6. In the grph, we n esily tell S suf [6] is the first node (olored in green) not ourring in brnh of S suf [3], S suf [4] nd S suf [5] (olored in purple). 9 10 7 8 3 4 5 6 11

Enhned suffix rry (rry skp ) Strting from no node ours in nother brnh (brnh not involved with the urrent suffix). Therefore, skp[6]=12 7 8 2 3 4 5 6 0 1 9 10 11

Referenes } A.E. Kel et l. MATCHTM: tool for serhing trnsription ftor binding sites in DNA sequenes. (2003) Nulei Aids Reserh Vol. 31 No. 13 } M. Bekstette et l. PoSSuMserh: Fst nd Sensitive Mthing of Position Speifi Soring Mtries using Enhnes Suffix Arrys (2004) } M. Bekstette et l. Fst Index bsed lgorithms nd softwre for mthing position speifi soring mtries. (2006) BMC Bioinformtis } S. Rhmnn et l. On the Power of Profiles for Trnsription Ftor Biding Site Detetion. (2003) Sttistil Applitions in Genetis nd Moleulr Biology } B. Dorohonenu et l. Aelerting Protein Clssifition Using Suffix Trees. (2000)

Thnk you! 51 7/20/17

Suffix rrys/trees nd PWM mthing 52 7/20/17

} Definition (1): prefix sore for sequene w pfxs d d w) = åln( f w ( j), j / w( j) ) j= 0 Enhned suffix rry ( p w ( j) Î{ A, T, G, C} for ll j where w is sequene segment, w(j) is the hrter of w t index j. Denote l i = min{m, S suf [i] }-1. } Definition (2): d i s the lrgest depth of the suffix tht stisfies the intermedite threshold d = mx({ -1} È{ d Î[0, l ] pfxs ( S [ i]) ³ i i d suf th d }) } Definition (3): C i [d] is the prefix sore of S suf [i] with depth d Ci [ d] = pfxs d ( Ssuf [ i]) for ll d Î[0,di ]

Enhned suffix rry Notie tht, for eh S suf [i], the following sttements re equivlent: d i = M 1 pfxs M 1 (S suf [i]) = C i [M 1] th M 1 M is the length of PWM We will show the lgorithm by n exmple. Suppose we hve following sore funtions : S i,j Index 0 Index 1 1 3 2 3 2 1 Index 2 Tht is, S suf [i] stisfies the threshold iff the lrgest depth stisfying the intermedite threshold equls to the length of the PWM Suppose we hve following threshold: th = 7 Intermedite thresholds: th 0 = 2, th 1 = 5, th 2 = 7.

Algorithm: 1. First ompute C 0 nd d 0 to see if the first suffix stisfies the threshold For the S suf [0] =, we hve C 0 [0] = pfxs 0 (S suf [0]) = 1, below the threshold. d = mx({ -1} È{ d Î[0, l0] pfxs d ( Ssuf [0]) ³ th 0 d Hene we hve d 0 = -1, mening no prefix stisfies threshold. Enhned suffix rry Below th 0 }) 11 9 10 7 8 2 3 4 5 6 0 1

Enhned suffix rry By following the rules below: 2. Afterwrds, it s the VERY triky prt. Bsed on the skp rry, we n utlly JUMP over some suffixes. For eh S suf [i] stisfying/not stisfying the threshold, we try to find the first k tht d i +1 >= lp[k], by the following jumping sde: let k 0 = i+1, k 1 = skp[k 0 ], k m = skp[k m-1 ] suh tht, d i +1 < lp[k 1 ], d i +1 < lp[k 2 ],, d i +1 < lp[k m-1 ] nd d i +1>= lp[k m ] k m is the k we wnt. And ny suffixes within the jump rnge stisfy/do not stisfy the threshold s S suf [i] stisfies/does not stisfy the threshold

Enhned suffix rry i. In the first step, we hve d 0 = -1 ii. We try to find first k suh tht d 0 +1=0 >= lp[k]. iii. By mking three jumps bsed on skp rry, we find k 3 = 6 stisfying our se. First jump: k 1 = skp[k 0 =0+1=1] = 2 d 0 +1=0< lp[k 1 ] = 2 Seond jump: k 2 = skp[k 1 ] = 3 d 0 +1=0< lp[k 2 ] = 1 Third jump: k 3 = skp[k 2 ] = 6 d 0 +1=0>= lp[k 3 ] = 0. YEAH, we got it!!! i lp[i] skp[i] S suf [i] 0 0 12 1 3 2 2 2 3 3 1 6 4 2 6 5 2 6 6 0 12 7 2 9 8 3 9 9 1 11 10 1 11 11 0 12

Below th 0 1 Enhned suffix rry Sine S suf [0] does not stisfy the threshold, S suf [1] S suf [5] nnot stisfy the threshold. i. In the first step, we hve d 0 = -1 ii. We try to find first k suh tht d 0 +1 >= lp[k]. iii. By mking three jumps bsed on skp rry, we find k 3 = 6 stisfying our se. First jump: k 1 = skp[k 0 ] = 2 d 0 +1=0< lp[k 1 ] = 2 Seond jump: k 2 = skp[k 1 ] = 3 d 0 +1=0< lp[k 2 ] = 1 Third jump: k 3 = skp[k 2 ] = 6 d 0 +1=0>= lp[k 3 ] = 0. YEAH, we got it!!!

Next we ompute C 6, d 6, Enhned suffix rry d = mx({ -1} È{ d Î[0, l ] pfxs ( S [ i]) ³ i i d suf th d }) Ci [ d] = pfxs d ( Ssuf [ i]) for ll d Î[0,di ] We obtin: d 6 = 2 nd C 6 [0] = 3, C 6 [1] = 6, C 6 [2] = 8, stisfying ll intermedite thresholds. Therefore, S suf [6] is signl. S i,j Index 0 Index 1 1 3 2 3 2 1 Index 2 S suf [6]= Suppose we hve following threshold: th = 7 Intermedite thresholds: th 0 = 2, th 1 = 5, th 2 = 7.

Enhned suffix rry Similrly, we try to find the first k suh tht d 6 +1=3 >= lp[k]. We find tht, k 0 =6+1=7 6 3 8 Stisfying d 6 +1=2+1=3 >= lp[k 0 =7] = 2 Therefore, only S suf [6] stisfies the threshold in this round. Next we ontinue to ompute C 7 nd d 7 No JUMP here. Only moves to the next node

Enhned suffix rry 6 7 3 By similr pproh, we obtin d 7 = 2 nd C 7 [0] = 3, C 7 [1] = 6, C 7 [2] = 7, stisfying ll intermedite threshold. Similrly, we try to find the first k suh tht d 7 +1=3 >= lp[k]. We find tht, k 0 =7+1=8 Stisfiying d 7 +1=2+1=3 >= lp[k 0 =8] = 3 Therefore, only S suf [7] stisfies the threshold in this round. No JUMP here. Only moves to the next node

Enhned suffix rry 0 1 3 6 7 11 5 6 9 10 7 8 2 3 4 5 6 By similr pproh, we obtin S suf [8],S suf [9] stisfying the threshold; S suf [10] nd S suf [11] not stisfying the threshold (Algorithm ends)

Enhned suffix rry (lgorithm) 1. Compute d 0, nd C 0 [d] for ny d Î[ 0, d0] 2. Assume d i-1 nd C i-1 [d] hs been determined, then we lulte d i nd C i [d] from d i-1 nd C i-1 [d] : Sine S suf [i-1] nd S suf [i] hve ommon prefix of length lp[i], we hve, C i-1 [d]= C i [d] for ll d Î[ 0, lp[ i] -1] To lulte C i [d] for ll d Î[ 0, di] onsidered:, the following two ses need to (1) d i-1 +1 >= lp[i] Then ompute C i [d] for d i+1 >lp[i] while d<=l i nd C i [d] >= th d

Enhned suffix rry (lgorithm) (2) d i-1 +1< lp[i] Suppose we hve j be the minimum vlue from [i+1, n+1] suh tht ll suffixes S suf [i], S suf [i+1] S suf [j-1] hve ommon prefix of length d i-1 +1. Then, ording to the definition, i. if d i-1 = m-1, then there re signls t ll position S suf [r] for i<=r<=j-1 ii. If d i-1 <m-1, then no signls for ll position S suf [r] We obtin j by following hin of entries in rry skp, omputing hin of vlues : j 0 =i, j 1 = skp[j 0 ], j k = skp[j k-1 ] suh tht, d i-1 +1 < lp[j k-1 ] nd d i-1 +1>= lp[j k ]