BDD-Based Analysis of Gapped q-gram Filters

Similar documents
A Brief Introduction to Markov Chains and Hidden Markov Models

Cryptanalysis of PKP: A New Approach

Bayesian Learning. You hear a which which could equally be Thanks or Tanks, which would you go with?

Efficiently Generating Random Bits from Finite State Markov Chains

A. Distribution of the test statistic

MARKOV CHAINS AND MARKOV DECISION THEORY. Contents

Combining reaction kinetics to the multi-phase Gibbs energy calculation

Partial permutation decoding for MacDonald codes

II. PROBLEM. A. Description. For the space of audio signals

A Solution to the 4-bit Parity Problem with a Single Quaternary Neuron

8 Digifl'.11 Cth:uits and devices

THE OUT-OF-PLANE BEHAVIOUR OF SPREAD-TOW FABRICS

Target Location Estimation in Wireless Sensor Networks Using Binary Data

Efficient Generation of Random Bits from Finite State Markov Chains

ASummaryofGaussianProcesses Coryn A.L. Bailer-Jones

Optimality of Inference in Hierarchical Coding for Distributed Object-Based Representations

STA 216 Project: Spline Approach to Discrete Survival Analysis

NEW DEVELOPMENT OF OPTIMAL COMPUTING BUDGET ALLOCATION FOR DISCRETE EVENT SIMULATION

An Algorithm for Pruning Redundant Modules in Min-Max Modular Network

Explicit overall risk minimization transductive bound

Count-Min Sketches for Estimating Password Frequency within Hamming Distance Two

Approximated MLC shape matrix decomposition with interleaf collision constraint

XSAT of linear CNF formulas

Collective organization in an adaptative mixture of experts

Stochastic Complement Analysis of Multi-Server Threshold Queues. with Hysteresis. Abstract

Asymptotic Properties of a Generalized Cross Entropy Optimization Algorithm

Statistical Learning Theory: A Primer

Inductive Bias: How to generalize on novel data. CS Inductive Bias 1

CS229 Lecture notes. Andrew Ng

Separation of Variables and a Spherical Shell with Surface Charge

Stochastic Automata Networks (SAN) - Modelling. and Evaluation. Paulo Fernandes 1. Brigitte Plateau 2. May 29, 1997

#A48 INTEGERS 12 (2012) ON A COMBINATORIAL CONJECTURE OF TU AND DENG

Uniprocessor Feasibility of Sporadic Tasks with Constrained Deadlines is Strongly conp-complete

arxiv: v1 [math.ca] 6 Mar 2017

Approximated MLC shape matrix decomposition with interleaf collision constraint

Traffic data collection

A Novel Learning Method for Elman Neural Network Using Local Search

Two view learning: SVM-2K, Theory and Practice

Improving the Accuracy of Boolean Tomography by Exploiting Path Congestion Degrees

Paragraph Topic Classification

On the Goal Value of a Boolean Function

/epjconf/

Melodic contour estimation with B-spline models using a MDL criterion

Expectation-Maximization for Estimating Parameters for a Mixture of Poissons

AST 418/518 Instrumentation and Statistics

Determining The Degree of Generalization Using An Incremental Learning Algorithm

Chemical Kinetics Part 2

Pairwise RNA Edit Distance

MATH 172: MOTIVATION FOR FOURIER SERIES: SEPARATION OF VARIABLES

NIKOS FRANTZIKINAKIS. N n N where (Φ N) N N is any Følner sequence

General Certificate of Education Advanced Level Examination June 2010

A Statistical Framework for Real-time Event Detection in Power Systems

Haar Decomposition and Reconstruction Algorithms

New Efficiency Results for Makespan Cost Sharing

Sequential Decoding of Polar Codes with Arbitrary Binary Kernel

From Margins to Probabilities in Multiclass Learning Problems

arxiv: v1 [cs.lg] 31 Oct 2017

arxiv: v1 [cs.ds] 12 Nov 2018

DIGITAL FILTER DESIGN OF IIR FILTERS USING REAL VALUED GENETIC ALGORITHM

General Certificate of Education Advanced Level Examination June 2010

High Spectral Resolution Infrared Radiance Modeling Using Optimal Spectral Sampling (OSS) Method

C. Fourier Sine Series Overview

Two Birds With One Stone: An Efficient Hierarchical Framework for Top-k and Threshold-based String Similarity Search

Asynchronous Control for Coupled Markov Decision Systems

FRST Multivariate Statistics. Multivariate Discriminant Analysis (MDA)

<C 2 2. λ 2 l. λ 1 l 1 < C 1

The EM Algorithm applied to determining new limit points of Mahler measures

Maintenance activities planning and grouping for complex structure systems

Consistent linguistic fuzzy preference relation with multi-granular uncertain linguistic information for solving decision making problems

Feasible Itemset Distributions in Data Mining: Theory and Application

The influence of temperature of photovoltaic modules on performance of solar power plant

Formulas for Angular-Momentum Barrier Factors Version II

Scalable Spectrum Allocation for Large Networks Based on Sparse Optimization

MULTI-PERIOD MODEL FOR PART FAMILY/MACHINE CELL FORMATION. Objectives included in the multi-period formulation

Chemical Kinetics Part 2. Chapter 16

Power Control and Transmission Scheduling for Network Utility Maximization in Wireless Networks

Integrating Factor Methods as Exponential Integrators

Adjustment of automatic control systems of production facilities at coal processing plants using multivariant physico- mathematical models

Construction of Supersaturated Design with Large Number of Factors by the Complementary Design Method

MODELING OF A THREE-PHASE APPLICATION OF A MAGNETIC AMPLIFIER

On a geometrical approach in contact mechanics

Automobile Prices in Market Equilibrium. Berry, Pakes and Levinsohn

Testing for the Existence of Clusters

Optimal spaced seeds for faster approximate string matching

Fast Blind Recognition of Channel Codes

Some Measures for Asymmetry of Distributions

On colorings of the Boolean lattice avoiding a rainbow copy of a poset arxiv: v1 [math.co] 21 Dec 2018

arxiv: v1 [cs.db] 1 Aug 2012

Statistical Learning Theory: a Primer

Recursive Constructions of Parallel FIFO and LIFO Queues with Switched Delay Lines

Optimal spaced seeds for faster approximate string matching

MassJoin: A MapReduce-based Method for Scalable String Similarity Joins

Paper presented at the Workshop on Space Charge Physics in High Intensity Hadron Rings, sponsored by Brookhaven National Laboratory, May 4-7,1998

Algorithms to solve massively under-defined systems of multivariate quadratic equations

Sardinas-Patterson like algorithms in coding theory

Rate-Distortion Theory of Finite Point Processes

Universal Consistency of Multi-Class Support Vector Classification

Appendix A: MATLAB commands for neural networks

School of Electrical Engineering, University of Bath, Claverton Down, Bath BA2 7AY

SydU STAT3014 (2015) Second semester Dr. J. Chan 18

Transcription:

BDD-Based Anaysis of Gapped q-gram Fiters Marc Fontaine, Stefan Burkhardt 2 and Juha Kärkkäinen 2 Max-Panck-Institut für Informatik Stuhsatzenhausweg 85, 6623 Saarbrücken, Germany e-mai: stburk@mpi-sb.mpg.de e-mai: fontaine@studcs.uni-sb.de 2 Department of Computer Science P.O.Box 68 (Gustaf Häströmin katu 2 B) FI-4 University of Hesinki, Finand e-mai: Juha.Karkkainen@cs.hesinki.fi Abstract. Recenty, there has been a surge of interest in gapped q-gram fiters for approximate string matching. Important design parameters for fiters are for exampe the vaue of q, the fiter-threshod and in particuar the shape (aka seed) of the fiter. A good choice of parameters can improve the performance of a q-gram fiter by orders of magnitude and optimising these parameters is a nontrivia combinatoria probem. We describe a new method for anaysing gapped q-gram fiters. This method is simpe and generic. It appies to a variety of fiters, overcomes many restrictions that are present in existing agorithms and can easiy be extended to new fiter variants. To impement our approach, we use an extended version of BDDs (Binary Decision Diagrams), a data structure that efficienty represents sets of bit-strings. In a second step, we define a new cass of muti-shape fiters and anayse these fiters with the BDD-based approach. Experiments show that muti-shape fiters can outperform the best singe-shape fiters, which are currenty in use, in many aspects. The BDD-based agorithm is crucia for the design and anaysis of these new and better muti-shape fiters. Our resuts appy to the k-mismatches probem, i.e. approximate string matching with Hamming distance. Introduction String matching invoves searching a given string or textua database T for occurrences of substrings that match a search pattern P. The approximate string matching probem aows the search pattern and the matches to have some difference or distance according to a given distance function. Many appications depend on efficient soutions of this probem, especiay in the fied of bio-informatics, where databases may consist of sequences of 9 nuceotides of DNA or of ong sequences of amino acids. This work was conducted in part at the MPI for computer science, Saarbrücken with support from the Future and Emerging Technoogies programme of the EU under contract number IST-999-486 (ALCOM-FT) and at the University of Hesinki supported by the Academy of Finand grant 256. 56

BDD-Based Anaysis of Gapped q-gram Fiters Fiter agorithms are a common approach for approximate string matching. They speedup string matching by quicky generating a set of potentia matches and discarding the rest of the database. The true matches can then be found in a second step, the verification phase, by inspecting a potentia matches. Designing a good fiter usuay means optimising the tradeoff between the compexity of the fitration phase and the efficiency of the fiter. Many efficient fiters work with precomputed indexes, in particuar, indexes that are based on gapped q-grams or shapes. For exampe the three 3-grams of string ACAGCT for shape ##-# are AC-G,CA-C and AG-T. A matching pair of q-grams between a pattern and a substring of T is caed a hit. The q-gram index stores the positions of a q-grams of the database and aows to find hits efficienty. If the number of hits between the search pattern and a substring of the database exceeds a certain threshod t, that substring is caed a potentia match. As first shown in [6, 7], the performance of the fiter depends cruciay on the shape. Good shapes are found by anaysing arge sets of shapes, because no method for directy generating good shapes has been found yet. Even anaysing a singe shape is non-trivia and a ot of effort has gone into deveoping methods for this purpose. Recenty gapped shapes have been the focus of quite a bit of attention [6, 9,, 2]. In [6, 7], Burkhardt and Kärkkäinen compute the optima threshod. It is the highest threshod that sti aows the fiter to return a true matches, i.e., substrings that are within a fixed Hamming distance from the pattern. They aso compute a measure caed the minimum coverage, which provides a rough estimate of how many fase matches get through the fiter. The true positive rates and the fase positive rates were determined experimentay for seected shapes. If the threshod or Hamming distance is increased, the fiter aso discards some true matches, which are then caed fase negatives. In [5] an abstract measure for the fase negative probabiity, the so-caed recognition rate was defined and anaysed experimentay. The exact computation of both fase positive and fase negative rates was done by Ma, Tromp and Li [5]. However, their agorithms are restricted to fiters that count ony non-overapping hits. This is a significant restriction as the ower correation of overapping q-grams is a big advantage of gapped q-grams over ungapped ones [7]. Brejová, Brown and Vinař [, 2] deveop new variants of the agorithm of Ma, Tromp and Li. In [], a true match is defined not directy by a Hamming distance but as a probabiity distribution represented by a hidden Markov mode. In [2], they further generaise the approach to approximate hits and mutipe shapes. However, the restriction to non-overapping hits remains in a of their work. Very recenty, we became aware of severa papers using mutipe gapped shapes for approximate string matching in various different approaches [3, 7, 8, 4]. We present a new and fexibe method for computing various properties of q-gram fiters. This agorithm is based on a simpe natura abstraction of the probem and appies to a genera cass of fiters. At the same time, it overcomes many restrictions present in previous agorithms in particuar the non-overapping hit restriction. Our method consists of two steps. The first step is an agorithm based on sets of bit-strings. These sets can be of exponentia size and we have to use a compact and efficient representation of the sets to actuay impement our agorithm. A data 57

Proceedings of the Prague Stringoogy Conference 4 structure caed BDDs [3, 4] (Binary decision diagrams or Binary decomposition diagrams) impements such a representation of sets. Ony the use of a data structure ike BDDs makes it feasibe to run our agorithm. In the second step the BDDs generated in the first step can be used to efficienty compute interesting properties of the sets they represent. Most properties can be computed in inear time of the size of the BDDs. This cear spit of the probem into two steps distinguishes our method from previous agorithms, which were mosty based on dynamic programming. In the second part of this work, we appy our method to design new and better fiters. The basic idea in this part is to fiter with a set of shapes simutaneousy. Muti-shape fiters have been used before [8]. Our new idea is to use a carefuy seected set of shapes together with a specificay computed fitration criterion. This fitration criterion repaces the optima threshod of a singe shape fiter. The BDD-based agorithm aows us to compute the best fitration criterion for a set of shapes and at the same time determine the important quaity measures of the resuting muti-shape fiter. To investigate the potentia of muti-shape fiters based on a specific fitration criterion, we anayse arge sets of randomy generated fiters. These experiments show that good muti-shape fiters are very rare, but the experiments aso yied fiters that are superior to singe-shape fiters in severa important aspects. 2 Representing Match-Mismatch-Patterns with BDDs Let A and B be two strings of ength. We ca the bit-string p(a, B) {, } the match-mismatch-pattern of the two strings. A in p(a, B) denotes a matching position and a mismatch. The Hamming distance of A and B is then the number of zeros in p(a, B). We represent the number of zeros and ones in a bit-string p by p and p. The k-differences version of approximate string matching aows a pattern string and a match to have a Hamming distance of at most k. For most fiters, the match-mismatch-pattern of an aignment between the pattern string and the database at some position x contains enough information to decide whether x is returned as a potentia match or not. Therefore it is, in principe, sufficient to enumerate a possibe match-mismatch-patterns to anayse the performance of fiters for the k-differences probem. A drawback of this brute-force-approach is, that there are 2 possibe matchmismatch-patterns for strings of ength and reaistic fiters usuay work with pattern ength 5. To overcome this compexity-probem we use a data structure caed BDDs. BDDs aow a compact and efficient representation of sets of equa-ength bit-strings. They can be seen as an abstract data structure that supports the foowing operations. Creation of a new BDD for the base-cases and {ǫ}. Composition of two BDDs S = comp(s, S ) Decomposition of a BDD into two BDDs according to the first position of the bit-strings in the BDD. 58

BDD-Based Anaysis of Gapped q-gram Fiters Computing and of two BDDs and the compement of a BDD The composition comp(s, S ) represents the set: comp(s, S ) = {s (s = a a S ) (s = b b S )} For two sets A and B that are given as decompositions A = comp(a, A ) and B = comp(b, B ), A B and A B can be computed recursivey as: and: A B = comp(a B, A B ) A B = comp(a B, A B ) BDDs are impemented with DAGs (Directed Acycic Graphs) and they are simiar to finite automata without oops. In a BDD aways the minima, smaest possibe DAG is used to represent a set of bit-strings and equa sets are represented by one canonica node of the DAG. A coection of BDDs can share the structure of a singe DAG and BDD-impementations usuay make use of hash-tabes to maintain the canonica-representation-property. Hash tabes are aso used to avoid re-computations during the computation of and. ǫ ǫ ǫ A simpe BDD and the set it represents BDDs have many different appications where they often make it possibe to hande exponentia size sets within non-exponentia compexity. An introduction to BDDs can be found in [3, 4], where BDDs are used to represent booean functions over a finite set of variabes. The actua performance of BDDs in an appication depends on the structure of the sets they represent. For sets of size 2 the space compexity of BDDs can range from O() to Θ(2 ). It is important to note, that we use BBDs ony to anayse q-gram fiters. This is a one-time computation and the compexity of the BDDs does not interfere with the compexity of the fiters under consideration. The theoretica compexity of BDDs is therefore secondary in our appication. Our experience is, that BDDs work we to reduce the compexity of fiter anaysis. They make is possibe to anayse a interesting fiters within reasonabe time and space imits. In [] Fontaine described a extension of standard BDDs, which uses {, } as an additiona base case for the decomposition (so caed -BDDs). This extension aows more compact representation than standard BDDs and it was used for a our experiments. For code istings and runtime measurements of a prototype -BDD impementation see []. The prototype impementation ony consists of about kb of C++ code and computing the fiter properties for a typica shape ony takes a few seconds. 59

Proceedings of the Prague Stringoogy Conference 4 3 q-gram Simiarity-Based Fiters We use strings from {#,-} to denote different shapes. # stands for a position that must match, whereas - is a don t care or wid-card position. span(s) = s is the span of a shape s. Let A and B be two strings of ength and s a shape. A position i span(s) is a hit of shape s if n < span(s) : s(n) = # A(i + n) = B(i + n). The number of different hits of a shape s for two strings A and B is caed the q-gram simiarity qgs s (A, B) of the two strings. For strings of ength the q-gramsimiarity can be at most span(s) +. A= ACTGTACTGCCGTACT B= ACTGTAATGCAGTACT p(a, B)= shape s= ###---##--## qgs s (A, B)= 2 ACTGTACTGCCGTACT ###---?#--?# ###---##--## <- hit ###---##--## <- hit ###---#?--## ##?---?#--## ACTGTAATGCAGTACT Match-mismatch-pattern and q-gram-simiarity A q-gram fiter computes the set of potentia matches with the hep of a threshod t. A potentia match is the position of a substring in the database with a q-gram simiarity of at east t with the pattern string. Increasing the threshod of a fiter reduces the number of potentia matches at the cost of a decreased fiter sensitivity, i.e. the fiter is more ikey to overook true matches. The match-mismatch-pattern of two strings contains sufficient information to compute their q-gram-simiarity. Therefore a fiter can be anaysed by ooking at a possibe match-mismatch-patterns. We can partition the set of a possibe matchmismatch-patterns according to the q-gram-simiarity they represent for a given shape. For any fixed shape s and any h, N we define: P h = {p {, } s produces exacty h hits in p} It foows that the set P M of match-mismatch-pattern that represent a potentia match is: PM = P h h t A set P h can easiy be computed based on the sets P h if: and P h. A matchmismatch-pattern p {, } is in P h either: its suffix of ength is in P h position or: its suffix of ength is in P h. and it has an additiona hit of shape s at and it does not have an additiona hit at position This agorithm can be formuated as a simpe equation for sets P h = (expand(p h ) S (s)) (expand(p h ) S (s)) 6

BDD-Based Anaysis of Gapped q-gram Fiters with the foowing three definitions: S (s) = {p {, } s has a hit in p at position } S (s) = {p {, } s does not have a hit in p at position } expand(m) = {x x = m x = m, m M} BDDs directy support and, and expand(m) can be impemented as expand(m)= comp(m, M). BDDs aso support the creation of S (s) and S (s) for any shape s. S (s) can be computed recursivey as: {, } if s = ǫ if < span(s) S (s) = comp(, S (r)) if s = #r comp(s (r), S (r)) if s = -r Shape=#-## P4 = S 4 = {, } expand(p4 ) = {,,, } P5 = {,,,,, } P5 2 = {} P5 = {,,,...} P h and S (s) for shape #-## S {, } span(s) As an aternative to our definition of the q-gram-simiarity qgs(a, B) it is possibe to require individua hits to be non-overapping [5, ]. For such fiters the set PM can be computed with an agorithm simiar to the one described above. (Compute the sets P (h,i), where i is the offset of the first hit.) 4 Fiter Anaysis with BDDs The agorithm described in the previous section aows us to generate BDDrepresentations for the sets P h. These BDD-representations can be used to compute many interesting properties of the sets and thereby the underying fiters. Note that the computation of the various properties is independent of what fiter the sets P h represent and how they were computed. This is in contrast to previous approaches using dynamic programming where the fiter definition is deepy invoved in the property computation. 4. Specificity The specificity of a fiter describes its abiity to reduce a arge database to a sma set of potentia matches. For a given random mode, the fiter specificity is equivaent to the probabiity that a random substring of ength is a potentia match of a random search pattern. 6

Proceedings of the Prague Stringoogy Conference 4 Every match-mismatch-pattern p describes one possibe event that can occur whie aigning a database and a search pattern and we can use severa probabiity modes to assign probabiities to these events. We can then simpy extend these probabiities from one match-mismatch-pattern to sets of match-mismatch-patterns by summing up the probabiities of the eements of the sets. For exampe, to anayse a fiter for a DNA database, we might assume that the database and pattern string are independent random strings with an even distribution of the etters {A, C, G, T }. It foows that every singe character has a chance of 4 being a match and the probabiity of any match-mismatch-pattern p is: prob(p) = ( 4 ) p ( 3 4 ) p With this we can compute the probabiity of a potentia match, i.e the specificity of the fiters as: specificity = prob(p) p PM Given the binary decomposition comp(p, P ) of a set P the probabiity Prob(P) of the set is: Prob(P) = ( 3 4 ) Prob(P ) + ( 4 ) Prob(P ) The base-cases for the binary decomposition are aso the base-cases for this recursion: Prob( ) = Prob(ǫ) = This shows that, if BDDs are used to represent the sets, Prob(P) = p P prob(p) can be computed in inear time of the size of the BDDs. It can be seen that a simiar approach aows to compute the probabiities of sets for many different probabiity modes efficienty. In particuar it is aso possibe to use hidden Markov modes (HMMs) as probabiity mode. HMMs have been used in [] to mode rea DNA sequences of different species. 4.2 Recognition Rate For approximate string matching with Hamming distance we can define the recognition rate r(j) of a fiter as the expected fraction of potentia matches among substrings of the database with exacty Hamming distance j. The match-mismatch-patterns of ength and Hamming distance j can easiy be computed with the singe-character shape # as P j (#). It foows that a fiter with potentia matches P M has the recognition rate: r(j) = Prob(PM P j (#)) Prob(P j (#)) Recognition rates have been defined and determined experimentay in [5]. 4.3 Threshod The set of potentia matches of a fiter with shape s, and with it the recognition rates of the fiter, heaviy depends on the threshod t. 62

BDD-Based Anaysis of Gapped q-gram Fiters A fiter is ossess for a threshod t and Hamming distance k if j k : r(j) =, otherwise it is ossy. If one is interested in a fixed maxima Hamming distance k and ossess fitering, then there exists an optima threshod t best. A dynamic programming agorithm for computing t best is described in [7]. BDD-based threshod computation is aso possibe. For each set P h we compute: m(p) = min p P p We use the notation p for the number of occurrences of in string p. m(p h ) is the minimum number of mismatching positions of any match-mismatch-pattern p P h This minimum can be found in inear time in the size of the BDD. Any set P h with m(p h ) k contains at east one match-mismatch-pattern with Hamming distance at most k. The optima threshod t best for a ossess fiter is the smaest h such that m(p h ) k. shape s = #-#---#-#-#------# span(s) = 8 pattern ength = 5 number of hits h {,..., 33} h 2 3 4 5 6 7 8 9... 3 32 33 m(p h) 8 7 7 6 6 6 5 5 5 4... k = 7 t best = k = 6 t best = 3 k = 5 t best = 6 k = 4 t best = 9 Computing the threshod t best. 5 Muti-shape Fiters Shapes can be better than contiguous q-grams because they introduce irreguarity in the way the mismatching positions affect the q-grams. For good shapes, ony a few worst case configurations of the mismatching characters affect many q-grams. A reasonabe approach to further improve the performance of fiters is therefore to use two or more somehow orthogona shapes in parae. The idea is, that those configurations of mismatches, that are particuary bad for one shape, are better covered by a second shape and vice versa. Designing a good muti-shape fiter is a nontrivia combinatoria probem, just ike finding good individua shapes. One coud assume that the best individua shapes aso form the best muti-shape fiter, however our experiments suggest that this is often not the case. Muti-shape fiters are the most important appication for our BDD-based approach. The extension of our agorithm to muti-shape fiters is straight-forward and it eads to a new concept: the generic fitration criterion C. The generic fitration criterion C repaces the threshod t of a singe-shape fiter. It enabes a muti-shape fiter to make fu use of the reations between the singe shapes. 63

Proceedings of the Prague Stringoogy Conference 4 A fiter with n shapes s...s n can use the q-gram simiarities h = qgs s (M, P)... h n = qgs sn (M, P) to decide whether M is a potentia match or not. (P is the pattern string and M is any substring of the database.) We ca a set C N n a fitration criterion for the shapes s...s n and define: M is a potentia match (h,...,h n ) C This generic fitration criterion C can mode many different strategies for mutishape fiters. For exampe it can mode fiters that require at east one hit of one shape, fiters the require one hit of each shape, fiters that sum up the hits of the shapes, or fiters that use each shape with its individua threshod t best. In Section 3 we used the notation P h (s) for the set of a match-mismatch-patterns with exacty h hits of a singe fixed shape s. To anayse muti-shape fiters we extend this notation to sets of shapes {s,...,s n }. We define P (h,...,h n) (s,...,s n ) as the set of a match match-mismatch-patterns with exacty h i hits of shape s i ( i n). The sets P (h,...,h n) (s,...,s n ) can be computed as: P (h,...,h n) (s,..., s n ) = P h i (s i ) i n With this, the set of match-mismatch-patterns, that represent a potentia match according to a fitration criterion C is: PM = (h,...,h n) C P (h,...,h n) (s,...,s n ) Together with the set P M, a statistica performance measures (recognition rate, specificity), which we computed for singe-shape fiters in Section 3, are now aso avaiabe for our mode of muti-shape fiters. The definition of P (h,...,h n) (s,...,s n ) aso makes it possibe to compute a optima fitration criterion C best for a ossess fiter with some fixed Hamming distance k. It is: C best = {(h... h n ) m(p (h,...,h n) (s,...,s n )) k} C best repaces the threshod t best of singe shape fiters. To reduce the high compexity invoved in the computation of C best Fontaine [] describes a straight forward approximation. 6 Designing Better Fiters The design of a fiter is aways a compromise between three objectives: high sensitivity fast fitration phase high specificity of the fiter, i.e. a fast verification phase 64

BDD-Based Anaysis of Gapped q-gram Fiters There are severa trade-offs between these objectives. For exampe, a higher sensitivity is usuay at the cost of a ower specificity and a faster fitration often yieds ower sensitivities and specificities [5, 5, 8]. Using a we chosen shape for the q-grams and the appropriate threshod can greaty improve overa fiter performance compared to fitering with ungapped q- grams [6, 7]. In this section we wi show that muti-shape fiters with a carefuy seected set of shapes and a specificay computed fitration criterion can further boost fiter performance for a three objectives compared to singe-shape fiters. A good estimate for the runtime of a q-gram fiter is the number of hits in the database that have to be processed. It is roughy proportiona to Σ q. (This assumes a database with a random distribution of etters from Σ and it is aso a good estimate for exampe for DNA sequences [7].) High vaues of q are desirabe because they make the fitration fast however they aso mean ower sensitivities. In this section we ony consider q-gram fiters that work ossess for a fixed Hamming distance k and we use k to compare the sensitivities of such fiters (a higher vaue of k means a higher sensitivity). To compare the specificities of different fiters, we aways use the shapes with the optima threshod t best (the optima fitration criterion C best for muti-shape fiters) that sti guarantees ossess fitering for the fixed k. For a experiments in this section, we use a pattern ength = 5 and assume a DNA-ike database with Σ = 4. There is a trade-off between k and the highest vaue of q that can be used for ossess fitering. For exampe for pattern ength = 5 and k = 5 the highest possibe q for a ossess singe-shape fiter is q =, for k = 6 it is q = 9. Simiar constraints between q and k aso exist for muti-shape fiters. However we found that they can have higher vaues of both q and k than is possibe for singe shapes. Therefore muti-shape fiters make it possibe to increase q, which makes them faster, or increase k, i.e the sensitivity. In some cases it is even possibe to increase q and k at the same time. This is not at the cost of a ower specificity, but instead it is even possibe to increase the specificity aso. Pairs of shapes: q =, k = 6 Consider for exampe the foowing three ossess two-shape fiters for k = 6: Three good two-shapes fiters k = 6, = 5, Σ = 4 s s 2 specificity a) ##-##---##-#### ###-#-###----#--## 8.9782 8 b) #-##-###-#### ####----###--##-# 9.36443 8 c) #-##-##--##### ####--#--#---##--## 7.76365 8 C best a) and c) {(, ), (, ), (, ), (, ), (2, )} not (, 2)! b) {(, ), (, ), (, 2), (, ), (, ), (2, )} The best singe shape fiter for this probem is ######-#-## with q = 9, t best = 2 and a specificity of 3.83835 6. Compared to this singe shape fiter, each of these three muti-shape fiters with two (q = )-shapes improves the specificity by a factor of 5. The runtime of a muti-shape fiter is approximatey the sum of the 65

Proceedings of the Prague Stringoogy Conference 4 run-times computed for its individua shapes. This means that the fiters with two (q = )-shapes for Σ = 4 are aso about two times as fast as a q = 9 singe-shape fiter. The three muti-shape fiters of this exampe were found by scanning 5 pairs of random (q = )-shapes. In this sampe set, good pairs were extremey rare. 3439 of the pairs, i.e. more than two-thirds, did not yied a ossess fiter for k = 6 at a. It is interesting that none of the six shapes that comprise the three best pairs, we found, work particuary we as a singe-shape fiter. This suggests that combining good singe-shape fiters is not necessariy the best method to construct a good mutishape fiter. Aso note, that 5 pairs of shapes is a reativey sma random set. It is very ikey that much better two-shape fiters can be found with more extensive experiments. 4-tupes of shapes: q =, k = 8 In a second experiment we fixed q to and tried to increase k. We generated, 5, fiters with 4-tupes of random (q = )-shapes and anaysed each of these 4-tupes with our agorithm. In this sampe set, the good 4-tupe fiters were again rare. Nevertheess, we found 5 ossess fiters for k = 8 with specificities of about 3. For comparison, the highest possibe k for a ossess singe-shape fiter with q = 9 is k = 6. (A singe-shape fiter with q = 9, is about as fast as our 4-tupe fiters.) The fitration criterion of the 5 4-tupes fiters for k = 8 is that they require at east one hit of any of the four shapes. 4-tupes of shapes: q =, k = 7 Aternativey, each of the 5 good 4-tupe fiters, we found, can aso be used for k = 7 with a stricter fitration criterion. Athough the computation of the exact fitration criterion C best for this probem has a high compexity, it is easy to compute an suitabe approximation C approx []. The compement C approx of one such approximation consists of 4 eements. This fitration criterion C approx guarantees ossess fitering for k = 7 and a specificity of 5.2288 8. The best set of four shapes out of,5, random = 5, k = 7, specificity = 5.228823e 8 s =##-#-###---#--#----## s 2 =##-#--#--#-#-#--### s 3 =###-#-##-#-### s 4 =##-###-----##--#-#--# C for the best 4-tupe fiter and k = 7 {(,,, ), (,,, ), (,,, 2), (,,, 3),(,,, 4), (,,, 5), (,,,), (,,, ), (,,, 2), (,, 2, ), (,, 2, ),(,,3, ),(,, 3, ), (,, 4,), (,, 4, ), (,, 5, ), (,,, ), (,,, ),(,,, 2),(,,, ), (,,,), (,, 2, ), (,, 3, ), (,, 4, ), (, 2,, ),(, 2,, ),(, 2,, ), (,3,,), (,,, ), (,,, ), (,,, 2), (,,, 4),(,,, ),(,, 2, ), (,, 4,), (,,, ), (,,, ), (, 2,, ), (2,,,), (5,,,)} The experiments show that muti-shape fiters can have significanty better specificities and work for higher vaues of k than singe-shape fiters. At the same time they can aso speed up the fitration. It remains an open question if there is an agorithm to construct good sets of shapes for muti-shape fiters. 66

BDD-Based Anaysis of Gapped q-gram Fiters 7 Concusion We described a new method for the anaysis of gapped q-gram fiters. This method uses bit-strings, we ca them match-mismatch-patterns, to describe possibe aignments between the database and the search patterns. Sets of match-mismatchpatterns provide a simpe abstraction of fiter agorithms for the k-differences probem. The first step of our approach is to generate sets of match-mismatch-patterns, in particuar the set of match-mismatch-patterns representing the potentia matches. To impement this step efficienty, we use BDDs as a data structure to represent sets of bit-strings. In the second step, we can then use these BDD representations to compute many interesting properties of fiters ike the recognition rate and specificity for various probabiity modes. Our approach is simpe and genera and appies to a variety of fiter agorithms. For exampe, it can mode singe-shape fiters with any threshod and generic mutishape fiters. Previous agorithms for fiter-reated probems were often based on dynamic programming. Compared to dynamic programming, our approach is more genera and more natura and aows many interesting extensions. The most important appication of our approach is the anaysis of muti-shape fiters, which work with a set of shapes in parae. For any set of shapes, our approach can compute an optima fitration criterion C best, which guarantees ossess fitering for the k-differences probem and aso the sensitivities and specificities of the resuting muti-shape fiter. We found, that good muti-shape fiters with a carefuy seected set of shapes and a specificay computed fitration criterion C best are much better than singe-shape fiters. They aow higher specificities and sensitivities than singe shape fiters and higher vaues of k are possibe (for ossess fitering). Muti-shape fiters can aso be faster than singe shape fiters, because they sti work with higher vaues of q. The BDD-based approach makes it possibe to find good muti-shape fiters by scanning a arge number of randomy generated candidates. However, ony a sma fraction of these candidates show the desired properties. Since fu enumeration as for singe-shape fiters [6] is not possibe for muti-shape fiters, a constructive agorithm to generate good sets of shapes remains an interesting open probem. References [] B. Brejová, D. G. Brown, and T. Vinař. Optima spaced seeds for hidden Markow modes, with appications to homoogous coding regions. In Proc. 4th Annua Symposium on Combinatoria Pattern Matching, voume 2676 of LNCS, pages 42 54. Springer, 23. [2] B. Brejová, D. G. Brown, and T. Vinař. Vector seeds: an extension to spaced seeds aows substantia improvements in sensitivity and specificity. In Proc. 3rd Internationa Workshop on Agorithms and Bioinformatics, voume 282 of Lecture Notes in Bioinformatics, pages 39 54. Springer, 23. [3] R. E. Bryant. Graph-based agorithms for booean function manipuation. IEEE Transactions on Computers, 35:677 69, 986(8). 67

Proceedings of the Prague Stringoogy Conference 4 [4] R. E. Bryant. Symboic booean manipuation with ordered binary-decision diagrams. ACM Computing Surveys, 24:293 38, 992(3). [5] S. Burkhardt. Fiter Agorithms for Approximate String Matching. PhD thesis, Department of Computer Science, Saarand University, 22. http://www.mpisb.mpg.de/ stburk/thesis.ps. [6] S. Burkhardt and J. Kärkkäinen. Better fitering with gapped q-grams. In Proc. 2th Annua Symposium on Combinatoria Pattern Matching, voume 289 of LNCS, pages 73 85. Springer, 2. [7] S. Burkhardt and J. Kärkkäinen. Better fitering with gapped q-grams. Fundamenta Informaticae, 56( 2):5 7, 23. [8] A. Caifano and I. Rigoutsos. FLASH: A fast ook-up agorithm for string homoogy. In Proc. st Internationa Conference on Inteigent Systems for Moecuar Bioogy, pages 56 64. AAAI Press, 993. [9] K. P. Choi, F. Zeng, and L. Zhang. Good spaced seeds for homoogy search. Bioinformatics, 2(7):54 59, 24. [] K. P. Choi and L. Zhang. Sensitivity anaysis and efficient method for identifying optima spaced seeds. Journa of Computer and System Sciences, 68:22 4, 24. [] M. Fontaine. Computing the fitration efficiency of shape-index-fiters for approximate string matching. Master s thesis, Dept. of Computer Science, Saarand University, Nov 23. http://www.mpi-sb.mpg.de/ fontaine/thesis.ps. [2] U. Keich, M. Li, B. Ma, and J. Tromp. On spaced seeds for simiarity search. Discrete Appied Mathematics, 38(3):253 263, 24. [3] G. Kucherov, L. Noé, and M. Roytberg. Muti-seed ossess fitration. To appear in CPM 24. [4] M. Li, B. Ma, D. Kisman, and J. Tromp. PatternHunter II: Highy Sensitive and Fast Homoogy Search. Journa of Bioinformatics and Computationa Bioogy, 24. To appear. Eary version in GIW 23. [5] B. Ma, J. Tromp, and M. Li. Patternhunter: faster and more sensitive homoogy search. Bioinformatics, 8:44 445, 22. [6] L. Noè and G. Kucherov. YASS: Simiarity search in DNA sequences. Technica report, INRIA Tech report 4852, 23. [7] Y. Sun and J. Buher. Designing mutipe simutaneous seeds for DNA simiarity search. In Proceedings of the eighth annua internationa conference on Computationa moecuar bioogy, pages 76 84, 24. [8] J. Xu, D. Brown, M. Li, and B. Ma. Optimizing mutipe spaced seeds for homoogy search. To appear in CPM 24. 68