BDD-Based Analysis of Gapped q-gram Filters

Size: px
Start display at page:

Download "BDD-Based Analysis of Gapped q-gram Filters"

Transcription

1 BDD-Based Anaysis of Gapped q-gram Fiters Marc Fontaine, Stefan Burkhardt 2 and Juha Kärkkäinen 2 Max-Panck-Institut für Informatik Stuhsatzenhausweg 85, 6623 Saarbrücken, Germany e-mai: stburk@mpi-sb.mpg.de e-mai: fontaine@studcs.uni-sb.de 2 Department of Computer Science P.O.Box 68 (Gustaf Häströmin katu 2 B) FI-4 University of Hesinki, Finand e-mai: Juha.Karkkainen@cs.hesinki.fi Abstract. Recenty, there has been a surge of interest in gapped q-gram fiters for approximate string matching. Important design parameters for fiters are for exampe the vaue of q, the fiter-threshod and in particuar the shape (aka seed) of the fiter. A good choice of parameters can improve the performance of a q-gram fiter by orders of magnitude and optimising these parameters is a nontrivia combinatoria probem. We describe a new method for anaysing gapped q-gram fiters. This method is simpe and generic. It appies to a variety of fiters, overcomes many restrictions that are present in existing agorithms and can easiy be extended to new fiter variants. To impement our approach, we use an extended version of BDDs (Binary Decision Diagrams), a data structure that efficienty represents sets of bit-strings. In a second step, we define a new cass of muti-shape fiters and anayse these fiters with the BDD-based approach. Experiments show that muti-shape fiters can outperform the best singe-shape fiters, which are currenty in use, in many aspects. The BDD-based agorithm is crucia for the design and anaysis of these new and better muti-shape fiters. Our resuts appy to the k-mismatches probem, i.e. approximate string matching with Hamming distance. Introduction String matching invoves searching a given string or textua database T for occurrences of substrings that match a search pattern P. The approximate string matching probem aows the search pattern and the matches to have some difference or distance according to a given distance function. Many appications depend on efficient soutions of this probem, especiay in the fied of bio-informatics, where databases may consist of sequences of 9 nuceotides of DNA or of ong sequences of amino acids. This work was conducted in part at the MPI for computer science, Saarbrücken with support from the Future and Emerging Technoogies programme of the EU under contract number IST (ALCOM-FT) and at the University of Hesinki supported by the Academy of Finand grant

2 BDD-Based Anaysis of Gapped q-gram Fiters Fiter agorithms are a common approach for approximate string matching. They speedup string matching by quicky generating a set of potentia matches and discarding the rest of the database. The true matches can then be found in a second step, the verification phase, by inspecting a potentia matches. Designing a good fiter usuay means optimising the tradeoff between the compexity of the fitration phase and the efficiency of the fiter. Many efficient fiters work with precomputed indexes, in particuar, indexes that are based on gapped q-grams or shapes. For exampe the three 3-grams of string ACAGCT for shape ##-# are AC-G,CA-C and AG-T. A matching pair of q-grams between a pattern and a substring of T is caed a hit. The q-gram index stores the positions of a q-grams of the database and aows to find hits efficienty. If the number of hits between the search pattern and a substring of the database exceeds a certain threshod t, that substring is caed a potentia match. As first shown in [6, 7], the performance of the fiter depends cruciay on the shape. Good shapes are found by anaysing arge sets of shapes, because no method for directy generating good shapes has been found yet. Even anaysing a singe shape is non-trivia and a ot of effort has gone into deveoping methods for this purpose. Recenty gapped shapes have been the focus of quite a bit of attention [6, 9,, 2]. In [6, 7], Burkhardt and Kärkkäinen compute the optima threshod. It is the highest threshod that sti aows the fiter to return a true matches, i.e., substrings that are within a fixed Hamming distance from the pattern. They aso compute a measure caed the minimum coverage, which provides a rough estimate of how many fase matches get through the fiter. The true positive rates and the fase positive rates were determined experimentay for seected shapes. If the threshod or Hamming distance is increased, the fiter aso discards some true matches, which are then caed fase negatives. In [5] an abstract measure for the fase negative probabiity, the so-caed recognition rate was defined and anaysed experimentay. The exact computation of both fase positive and fase negative rates was done by Ma, Tromp and Li [5]. However, their agorithms are restricted to fiters that count ony non-overapping hits. This is a significant restriction as the ower correation of overapping q-grams is a big advantage of gapped q-grams over ungapped ones [7]. Brejová, Brown and Vinař [, 2] deveop new variants of the agorithm of Ma, Tromp and Li. In [], a true match is defined not directy by a Hamming distance but as a probabiity distribution represented by a hidden Markov mode. In [2], they further generaise the approach to approximate hits and mutipe shapes. However, the restriction to non-overapping hits remains in a of their work. Very recenty, we became aware of severa papers using mutipe gapped shapes for approximate string matching in various different approaches [3, 7, 8, 4]. We present a new and fexibe method for computing various properties of q-gram fiters. This agorithm is based on a simpe natura abstraction of the probem and appies to a genera cass of fiters. At the same time, it overcomes many restrictions present in previous agorithms in particuar the non-overapping hit restriction. Our method consists of two steps. The first step is an agorithm based on sets of bit-strings. These sets can be of exponentia size and we have to use a compact and efficient representation of the sets to actuay impement our agorithm. A data 57

3 Proceedings of the Prague Stringoogy Conference 4 structure caed BDDs [3, 4] (Binary decision diagrams or Binary decomposition diagrams) impements such a representation of sets. Ony the use of a data structure ike BDDs makes it feasibe to run our agorithm. In the second step the BDDs generated in the first step can be used to efficienty compute interesting properties of the sets they represent. Most properties can be computed in inear time of the size of the BDDs. This cear spit of the probem into two steps distinguishes our method from previous agorithms, which were mosty based on dynamic programming. In the second part of this work, we appy our method to design new and better fiters. The basic idea in this part is to fiter with a set of shapes simutaneousy. Muti-shape fiters have been used before [8]. Our new idea is to use a carefuy seected set of shapes together with a specificay computed fitration criterion. This fitration criterion repaces the optima threshod of a singe shape fiter. The BDD-based agorithm aows us to compute the best fitration criterion for a set of shapes and at the same time determine the important quaity measures of the resuting muti-shape fiter. To investigate the potentia of muti-shape fiters based on a specific fitration criterion, we anayse arge sets of randomy generated fiters. These experiments show that good muti-shape fiters are very rare, but the experiments aso yied fiters that are superior to singe-shape fiters in severa important aspects. 2 Representing Match-Mismatch-Patterns with BDDs Let A and B be two strings of ength. We ca the bit-string p(a, B) {, } the match-mismatch-pattern of the two strings. A in p(a, B) denotes a matching position and a mismatch. The Hamming distance of A and B is then the number of zeros in p(a, B). We represent the number of zeros and ones in a bit-string p by p and p. The k-differences version of approximate string matching aows a pattern string and a match to have a Hamming distance of at most k. For most fiters, the match-mismatch-pattern of an aignment between the pattern string and the database at some position x contains enough information to decide whether x is returned as a potentia match or not. Therefore it is, in principe, sufficient to enumerate a possibe match-mismatch-patterns to anayse the performance of fiters for the k-differences probem. A drawback of this brute-force-approach is, that there are 2 possibe matchmismatch-patterns for strings of ength and reaistic fiters usuay work with pattern ength 5. To overcome this compexity-probem we use a data structure caed BDDs. BDDs aow a compact and efficient representation of sets of equa-ength bit-strings. They can be seen as an abstract data structure that supports the foowing operations. Creation of a new BDD for the base-cases and {ǫ}. Composition of two BDDs S = comp(s, S ) Decomposition of a BDD into two BDDs according to the first position of the bit-strings in the BDD. 58

4 BDD-Based Anaysis of Gapped q-gram Fiters Computing and of two BDDs and the compement of a BDD The composition comp(s, S ) represents the set: comp(s, S ) = {s (s = a a S ) (s = b b S )} For two sets A and B that are given as decompositions A = comp(a, A ) and B = comp(b, B ), A B and A B can be computed recursivey as: and: A B = comp(a B, A B ) A B = comp(a B, A B ) BDDs are impemented with DAGs (Directed Acycic Graphs) and they are simiar to finite automata without oops. In a BDD aways the minima, smaest possibe DAG is used to represent a set of bit-strings and equa sets are represented by one canonica node of the DAG. A coection of BDDs can share the structure of a singe DAG and BDD-impementations usuay make use of hash-tabes to maintain the canonica-representation-property. Hash tabes are aso used to avoid re-computations during the computation of and. ǫ ǫ ǫ A simpe BDD and the set it represents BDDs have many different appications where they often make it possibe to hande exponentia size sets within non-exponentia compexity. An introduction to BDDs can be found in [3, 4], where BDDs are used to represent booean functions over a finite set of variabes. The actua performance of BDDs in an appication depends on the structure of the sets they represent. For sets of size 2 the space compexity of BDDs can range from O() to Θ(2 ). It is important to note, that we use BBDs ony to anayse q-gram fiters. This is a one-time computation and the compexity of the BDDs does not interfere with the compexity of the fiters under consideration. The theoretica compexity of BDDs is therefore secondary in our appication. Our experience is, that BDDs work we to reduce the compexity of fiter anaysis. They make is possibe to anayse a interesting fiters within reasonabe time and space imits. In [] Fontaine described a extension of standard BDDs, which uses {, } as an additiona base case for the decomposition (so caed -BDDs). This extension aows more compact representation than standard BDDs and it was used for a our experiments. For code istings and runtime measurements of a prototype -BDD impementation see []. The prototype impementation ony consists of about kb of C++ code and computing the fiter properties for a typica shape ony takes a few seconds. 59

5 Proceedings of the Prague Stringoogy Conference 4 3 q-gram Simiarity-Based Fiters We use strings from {#,-} to denote different shapes. # stands for a position that must match, whereas - is a don t care or wid-card position. span(s) = s is the span of a shape s. Let A and B be two strings of ength and s a shape. A position i span(s) is a hit of shape s if n < span(s) : s(n) = # A(i + n) = B(i + n). The number of different hits of a shape s for two strings A and B is caed the q-gram simiarity qgs s (A, B) of the two strings. For strings of ength the q-gramsimiarity can be at most span(s) +. A= ACTGTACTGCCGTACT B= ACTGTAATGCAGTACT p(a, B)= shape s= ###---##--## qgs s (A, B)= 2 ACTGTACTGCCGTACT ###---?#--?# ###---##--## <- hit ###---##--## <- hit ###---#?--## ##?---?#--## ACTGTAATGCAGTACT Match-mismatch-pattern and q-gram-simiarity A q-gram fiter computes the set of potentia matches with the hep of a threshod t. A potentia match is the position of a substring in the database with a q-gram simiarity of at east t with the pattern string. Increasing the threshod of a fiter reduces the number of potentia matches at the cost of a decreased fiter sensitivity, i.e. the fiter is more ikey to overook true matches. The match-mismatch-pattern of two strings contains sufficient information to compute their q-gram-simiarity. Therefore a fiter can be anaysed by ooking at a possibe match-mismatch-patterns. We can partition the set of a possibe matchmismatch-patterns according to the q-gram-simiarity they represent for a given shape. For any fixed shape s and any h, N we define: P h = {p {, } s produces exacty h hits in p} It foows that the set P M of match-mismatch-pattern that represent a potentia match is: PM = P h h t A set P h can easiy be computed based on the sets P h if: and P h. A matchmismatch-pattern p {, } is in P h either: its suffix of ength is in P h position or: its suffix of ength is in P h. and it has an additiona hit of shape s at and it does not have an additiona hit at position This agorithm can be formuated as a simpe equation for sets P h = (expand(p h ) S (s)) (expand(p h ) S (s)) 6

6 BDD-Based Anaysis of Gapped q-gram Fiters with the foowing three definitions: S (s) = {p {, } s has a hit in p at position } S (s) = {p {, } s does not have a hit in p at position } expand(m) = {x x = m x = m, m M} BDDs directy support and, and expand(m) can be impemented as expand(m)= comp(m, M). BDDs aso support the creation of S (s) and S (s) for any shape s. S (s) can be computed recursivey as: {, } if s = ǫ if < span(s) S (s) = comp(, S (r)) if s = #r comp(s (r), S (r)) if s = -r Shape=#-## P4 = S 4 = {, } expand(p4 ) = {,,, } P5 = {,,,,, } P5 2 = {} P5 = {,,,...} P h and S (s) for shape #-## S {, } span(s) As an aternative to our definition of the q-gram-simiarity qgs(a, B) it is possibe to require individua hits to be non-overapping [5, ]. For such fiters the set PM can be computed with an agorithm simiar to the one described above. (Compute the sets P (h,i), where i is the offset of the first hit.) 4 Fiter Anaysis with BDDs The agorithm described in the previous section aows us to generate BDDrepresentations for the sets P h. These BDD-representations can be used to compute many interesting properties of the sets and thereby the underying fiters. Note that the computation of the various properties is independent of what fiter the sets P h represent and how they were computed. This is in contrast to previous approaches using dynamic programming where the fiter definition is deepy invoved in the property computation. 4. Specificity The specificity of a fiter describes its abiity to reduce a arge database to a sma set of potentia matches. For a given random mode, the fiter specificity is equivaent to the probabiity that a random substring of ength is a potentia match of a random search pattern. 6

7 Proceedings of the Prague Stringoogy Conference 4 Every match-mismatch-pattern p describes one possibe event that can occur whie aigning a database and a search pattern and we can use severa probabiity modes to assign probabiities to these events. We can then simpy extend these probabiities from one match-mismatch-pattern to sets of match-mismatch-patterns by summing up the probabiities of the eements of the sets. For exampe, to anayse a fiter for a DNA database, we might assume that the database and pattern string are independent random strings with an even distribution of the etters {A, C, G, T }. It foows that every singe character has a chance of 4 being a match and the probabiity of any match-mismatch-pattern p is: prob(p) = ( 4 ) p ( 3 4 ) p With this we can compute the probabiity of a potentia match, i.e the specificity of the fiters as: specificity = prob(p) p PM Given the binary decomposition comp(p, P ) of a set P the probabiity Prob(P) of the set is: Prob(P) = ( 3 4 ) Prob(P ) + ( 4 ) Prob(P ) The base-cases for the binary decomposition are aso the base-cases for this recursion: Prob( ) = Prob(ǫ) = This shows that, if BDDs are used to represent the sets, Prob(P) = p P prob(p) can be computed in inear time of the size of the BDDs. It can be seen that a simiar approach aows to compute the probabiities of sets for many different probabiity modes efficienty. In particuar it is aso possibe to use hidden Markov modes (HMMs) as probabiity mode. HMMs have been used in [] to mode rea DNA sequences of different species. 4.2 Recognition Rate For approximate string matching with Hamming distance we can define the recognition rate r(j) of a fiter as the expected fraction of potentia matches among substrings of the database with exacty Hamming distance j. The match-mismatch-patterns of ength and Hamming distance j can easiy be computed with the singe-character shape # as P j (#). It foows that a fiter with potentia matches P M has the recognition rate: r(j) = Prob(PM P j (#)) Prob(P j (#)) Recognition rates have been defined and determined experimentay in [5]. 4.3 Threshod The set of potentia matches of a fiter with shape s, and with it the recognition rates of the fiter, heaviy depends on the threshod t. 62

8 BDD-Based Anaysis of Gapped q-gram Fiters A fiter is ossess for a threshod t and Hamming distance k if j k : r(j) =, otherwise it is ossy. If one is interested in a fixed maxima Hamming distance k and ossess fitering, then there exists an optima threshod t best. A dynamic programming agorithm for computing t best is described in [7]. BDD-based threshod computation is aso possibe. For each set P h we compute: m(p) = min p P p We use the notation p for the number of occurrences of in string p. m(p h ) is the minimum number of mismatching positions of any match-mismatch-pattern p P h This minimum can be found in inear time in the size of the BDD. Any set P h with m(p h ) k contains at east one match-mismatch-pattern with Hamming distance at most k. The optima threshod t best for a ossess fiter is the smaest h such that m(p h ) k. shape s = #-#---#-#-#------# span(s) = 8 pattern ength = 5 number of hits h {,..., 33} h m(p h) k = 7 t best = k = 6 t best = 3 k = 5 t best = 6 k = 4 t best = 9 Computing the threshod t best. 5 Muti-shape Fiters Shapes can be better than contiguous q-grams because they introduce irreguarity in the way the mismatching positions affect the q-grams. For good shapes, ony a few worst case configurations of the mismatching characters affect many q-grams. A reasonabe approach to further improve the performance of fiters is therefore to use two or more somehow orthogona shapes in parae. The idea is, that those configurations of mismatches, that are particuary bad for one shape, are better covered by a second shape and vice versa. Designing a good muti-shape fiter is a nontrivia combinatoria probem, just ike finding good individua shapes. One coud assume that the best individua shapes aso form the best muti-shape fiter, however our experiments suggest that this is often not the case. Muti-shape fiters are the most important appication for our BDD-based approach. The extension of our agorithm to muti-shape fiters is straight-forward and it eads to a new concept: the generic fitration criterion C. The generic fitration criterion C repaces the threshod t of a singe-shape fiter. It enabes a muti-shape fiter to make fu use of the reations between the singe shapes. 63

9 Proceedings of the Prague Stringoogy Conference 4 A fiter with n shapes s...s n can use the q-gram simiarities h = qgs s (M, P)... h n = qgs sn (M, P) to decide whether M is a potentia match or not. (P is the pattern string and M is any substring of the database.) We ca a set C N n a fitration criterion for the shapes s...s n and define: M is a potentia match (h,...,h n ) C This generic fitration criterion C can mode many different strategies for mutishape fiters. For exampe it can mode fiters that require at east one hit of one shape, fiters the require one hit of each shape, fiters that sum up the hits of the shapes, or fiters that use each shape with its individua threshod t best. In Section 3 we used the notation P h (s) for the set of a match-mismatch-patterns with exacty h hits of a singe fixed shape s. To anayse muti-shape fiters we extend this notation to sets of shapes {s,...,s n }. We define P (h,...,h n) (s,...,s n ) as the set of a match match-mismatch-patterns with exacty h i hits of shape s i ( i n). The sets P (h,...,h n) (s,...,s n ) can be computed as: P (h,...,h n) (s,..., s n ) = P h i (s i ) i n With this, the set of match-mismatch-patterns, that represent a potentia match according to a fitration criterion C is: PM = (h,...,h n) C P (h,...,h n) (s,...,s n ) Together with the set P M, a statistica performance measures (recognition rate, specificity), which we computed for singe-shape fiters in Section 3, are now aso avaiabe for our mode of muti-shape fiters. The definition of P (h,...,h n) (s,...,s n ) aso makes it possibe to compute a optima fitration criterion C best for a ossess fiter with some fixed Hamming distance k. It is: C best = {(h... h n ) m(p (h,...,h n) (s,...,s n )) k} C best repaces the threshod t best of singe shape fiters. To reduce the high compexity invoved in the computation of C best Fontaine [] describes a straight forward approximation. 6 Designing Better Fiters The design of a fiter is aways a compromise between three objectives: high sensitivity fast fitration phase high specificity of the fiter, i.e. a fast verification phase 64

10 BDD-Based Anaysis of Gapped q-gram Fiters There are severa trade-offs between these objectives. For exampe, a higher sensitivity is usuay at the cost of a ower specificity and a faster fitration often yieds ower sensitivities and specificities [5, 5, 8]. Using a we chosen shape for the q-grams and the appropriate threshod can greaty improve overa fiter performance compared to fitering with ungapped q- grams [6, 7]. In this section we wi show that muti-shape fiters with a carefuy seected set of shapes and a specificay computed fitration criterion can further boost fiter performance for a three objectives compared to singe-shape fiters. A good estimate for the runtime of a q-gram fiter is the number of hits in the database that have to be processed. It is roughy proportiona to Σ q. (This assumes a database with a random distribution of etters from Σ and it is aso a good estimate for exampe for DNA sequences [7].) High vaues of q are desirabe because they make the fitration fast however they aso mean ower sensitivities. In this section we ony consider q-gram fiters that work ossess for a fixed Hamming distance k and we use k to compare the sensitivities of such fiters (a higher vaue of k means a higher sensitivity). To compare the specificities of different fiters, we aways use the shapes with the optima threshod t best (the optima fitration criterion C best for muti-shape fiters) that sti guarantees ossess fitering for the fixed k. For a experiments in this section, we use a pattern ength = 5 and assume a DNA-ike database with Σ = 4. There is a trade-off between k and the highest vaue of q that can be used for ossess fitering. For exampe for pattern ength = 5 and k = 5 the highest possibe q for a ossess singe-shape fiter is q =, for k = 6 it is q = 9. Simiar constraints between q and k aso exist for muti-shape fiters. However we found that they can have higher vaues of both q and k than is possibe for singe shapes. Therefore muti-shape fiters make it possibe to increase q, which makes them faster, or increase k, i.e the sensitivity. In some cases it is even possibe to increase q and k at the same time. This is not at the cost of a ower specificity, but instead it is even possibe to increase the specificity aso. Pairs of shapes: q =, k = 6 Consider for exampe the foowing three ossess two-shape fiters for k = 6: Three good two-shapes fiters k = 6, = 5, Σ = 4 s s 2 specificity a) ##-##---##-#### ###-#-###----#--## b) #-##-###-#### ####----###--##-# c) #-##-##--##### ####--#--#---##--## C best a) and c) {(, ), (, ), (, ), (, ), (2, )} not (, 2)! b) {(, ), (, ), (, 2), (, ), (, ), (2, )} The best singe shape fiter for this probem is ######-#-## with q = 9, t best = 2 and a specificity of Compared to this singe shape fiter, each of these three muti-shape fiters with two (q = )-shapes improves the specificity by a factor of 5. The runtime of a muti-shape fiter is approximatey the sum of the 65

11 Proceedings of the Prague Stringoogy Conference 4 run-times computed for its individua shapes. This means that the fiters with two (q = )-shapes for Σ = 4 are aso about two times as fast as a q = 9 singe-shape fiter. The three muti-shape fiters of this exampe were found by scanning 5 pairs of random (q = )-shapes. In this sampe set, good pairs were extremey rare of the pairs, i.e. more than two-thirds, did not yied a ossess fiter for k = 6 at a. It is interesting that none of the six shapes that comprise the three best pairs, we found, work particuary we as a singe-shape fiter. This suggests that combining good singe-shape fiters is not necessariy the best method to construct a good mutishape fiter. Aso note, that 5 pairs of shapes is a reativey sma random set. It is very ikey that much better two-shape fiters can be found with more extensive experiments. 4-tupes of shapes: q =, k = 8 In a second experiment we fixed q to and tried to increase k. We generated, 5, fiters with 4-tupes of random (q = )-shapes and anaysed each of these 4-tupes with our agorithm. In this sampe set, the good 4-tupe fiters were again rare. Nevertheess, we found 5 ossess fiters for k = 8 with specificities of about 3. For comparison, the highest possibe k for a ossess singe-shape fiter with q = 9 is k = 6. (A singe-shape fiter with q = 9, is about as fast as our 4-tupe fiters.) The fitration criterion of the 5 4-tupes fiters for k = 8 is that they require at east one hit of any of the four shapes. 4-tupes of shapes: q =, k = 7 Aternativey, each of the 5 good 4-tupe fiters, we found, can aso be used for k = 7 with a stricter fitration criterion. Athough the computation of the exact fitration criterion C best for this probem has a high compexity, it is easy to compute an suitabe approximation C approx []. The compement C approx of one such approximation consists of 4 eements. This fitration criterion C approx guarantees ossess fitering for k = 7 and a specificity of The best set of four shapes out of,5, random = 5, k = 7, specificity = e 8 s =##-#-###---#--#----## s 2 =##-#--#--#-#-#--### s 3 =###-#-##-#-### s 4 =##-###-----##--#-#--# C for the best 4-tupe fiter and k = 7 {(,,, ), (,,, ), (,,, 2), (,,, 3),(,,, 4), (,,, 5), (,,,), (,,, ), (,,, 2), (,, 2, ), (,, 2, ),(,,3, ),(,, 3, ), (,, 4,), (,, 4, ), (,, 5, ), (,,, ), (,,, ),(,,, 2),(,,, ), (,,,), (,, 2, ), (,, 3, ), (,, 4, ), (, 2,, ),(, 2,, ),(, 2,, ), (,3,,), (,,, ), (,,, ), (,,, 2), (,,, 4),(,,, ),(,, 2, ), (,, 4,), (,,, ), (,,, ), (, 2,, ), (2,,,), (5,,,)} The experiments show that muti-shape fiters can have significanty better specificities and work for higher vaues of k than singe-shape fiters. At the same time they can aso speed up the fitration. It remains an open question if there is an agorithm to construct good sets of shapes for muti-shape fiters. 66

12 BDD-Based Anaysis of Gapped q-gram Fiters 7 Concusion We described a new method for the anaysis of gapped q-gram fiters. This method uses bit-strings, we ca them match-mismatch-patterns, to describe possibe aignments between the database and the search patterns. Sets of match-mismatchpatterns provide a simpe abstraction of fiter agorithms for the k-differences probem. The first step of our approach is to generate sets of match-mismatch-patterns, in particuar the set of match-mismatch-patterns representing the potentia matches. To impement this step efficienty, we use BDDs as a data structure to represent sets of bit-strings. In the second step, we can then use these BDD representations to compute many interesting properties of fiters ike the recognition rate and specificity for various probabiity modes. Our approach is simpe and genera and appies to a variety of fiter agorithms. For exampe, it can mode singe-shape fiters with any threshod and generic mutishape fiters. Previous agorithms for fiter-reated probems were often based on dynamic programming. Compared to dynamic programming, our approach is more genera and more natura and aows many interesting extensions. The most important appication of our approach is the anaysis of muti-shape fiters, which work with a set of shapes in parae. For any set of shapes, our approach can compute an optima fitration criterion C best, which guarantees ossess fitering for the k-differences probem and aso the sensitivities and specificities of the resuting muti-shape fiter. We found, that good muti-shape fiters with a carefuy seected set of shapes and a specificay computed fitration criterion C best are much better than singe-shape fiters. They aow higher specificities and sensitivities than singe shape fiters and higher vaues of k are possibe (for ossess fitering). Muti-shape fiters can aso be faster than singe shape fiters, because they sti work with higher vaues of q. The BDD-based approach makes it possibe to find good muti-shape fiters by scanning a arge number of randomy generated candidates. However, ony a sma fraction of these candidates show the desired properties. Since fu enumeration as for singe-shape fiters [6] is not possibe for muti-shape fiters, a constructive agorithm to generate good sets of shapes remains an interesting open probem. References [] B. Brejová, D. G. Brown, and T. Vinař. Optima spaced seeds for hidden Markow modes, with appications to homoogous coding regions. In Proc. 4th Annua Symposium on Combinatoria Pattern Matching, voume 2676 of LNCS, pages Springer, 23. [2] B. Brejová, D. G. Brown, and T. Vinař. Vector seeds: an extension to spaced seeds aows substantia improvements in sensitivity and specificity. In Proc. 3rd Internationa Workshop on Agorithms and Bioinformatics, voume 282 of Lecture Notes in Bioinformatics, pages Springer, 23. [3] R. E. Bryant. Graph-based agorithms for booean function manipuation. IEEE Transactions on Computers, 35:677 69, 986(8). 67

13 Proceedings of the Prague Stringoogy Conference 4 [4] R. E. Bryant. Symboic booean manipuation with ordered binary-decision diagrams. ACM Computing Surveys, 24:293 38, 992(3). [5] S. Burkhardt. Fiter Agorithms for Approximate String Matching. PhD thesis, Department of Computer Science, Saarand University, stburk/thesis.ps. [6] S. Burkhardt and J. Kärkkäinen. Better fitering with gapped q-grams. In Proc. 2th Annua Symposium on Combinatoria Pattern Matching, voume 289 of LNCS, pages Springer, 2. [7] S. Burkhardt and J. Kärkkäinen. Better fitering with gapped q-grams. Fundamenta Informaticae, 56( 2):5 7, 23. [8] A. Caifano and I. Rigoutsos. FLASH: A fast ook-up agorithm for string homoogy. In Proc. st Internationa Conference on Inteigent Systems for Moecuar Bioogy, pages AAAI Press, 993. [9] K. P. Choi, F. Zeng, and L. Zhang. Good spaced seeds for homoogy search. Bioinformatics, 2(7):54 59, 24. [] K. P. Choi and L. Zhang. Sensitivity anaysis and efficient method for identifying optima spaced seeds. Journa of Computer and System Sciences, 68:22 4, 24. [] M. Fontaine. Computing the fitration efficiency of shape-index-fiters for approximate string matching. Master s thesis, Dept. of Computer Science, Saarand University, Nov fontaine/thesis.ps. [2] U. Keich, M. Li, B. Ma, and J. Tromp. On spaced seeds for simiarity search. Discrete Appied Mathematics, 38(3): , 24. [3] G. Kucherov, L. Noé, and M. Roytberg. Muti-seed ossess fitration. To appear in CPM 24. [4] M. Li, B. Ma, D. Kisman, and J. Tromp. PatternHunter II: Highy Sensitive and Fast Homoogy Search. Journa of Bioinformatics and Computationa Bioogy, 24. To appear. Eary version in GIW 23. [5] B. Ma, J. Tromp, and M. Li. Patternhunter: faster and more sensitive homoogy search. Bioinformatics, 8:44 445, 22. [6] L. Noè and G. Kucherov. YASS: Simiarity search in DNA sequences. Technica report, INRIA Tech report 4852, 23. [7] Y. Sun and J. Buher. Designing mutipe simutaneous seeds for DNA simiarity search. In Proceedings of the eighth annua internationa conference on Computationa moecuar bioogy, pages 76 84, 24. [8] J. Xu, D. Brown, M. Li, and B. Ma. Optimizing mutipe spaced seeds for homoogy search. To appear in CPM

A Brief Introduction to Markov Chains and Hidden Markov Models

A Brief Introduction to Markov Chains and Hidden Markov Models A Brief Introduction to Markov Chains and Hidden Markov Modes Aen B MacKenzie Notes for December 1, 3, &8, 2015 Discrete-Time Markov Chains You may reca that when we first introduced random processes,

More information

Cryptanalysis of PKP: A New Approach

Cryptanalysis of PKP: A New Approach Cryptanaysis of PKP: A New Approach Éiane Jaumes and Antoine Joux DCSSI 18, rue du Dr. Zamenhoff F-92131 Issy-es-Mx Cedex France eiane.jaumes@wanadoo.fr Antoine.Joux@ens.fr Abstract. Quite recenty, in

More information

Bayesian Learning. You hear a which which could equally be Thanks or Tanks, which would you go with?

Bayesian Learning. You hear a which which could equally be Thanks or Tanks, which would you go with? Bayesian Learning A powerfu and growing approach in machine earning We use it in our own decision making a the time You hear a which which coud equay be Thanks or Tanks, which woud you go with? Combine

More information

Efficiently Generating Random Bits from Finite State Markov Chains

Efficiently Generating Random Bits from Finite State Markov Chains 1 Efficienty Generating Random Bits from Finite State Markov Chains Hongchao Zhou and Jehoshua Bruck, Feow, IEEE Abstract The probem of random number generation from an uncorreated random source (of unknown

More information

A. Distribution of the test statistic

A. Distribution of the test statistic A. Distribution of the test statistic In the sequentia test, we first compute the test statistic from a mini-batch of size m. If a decision cannot be made with this statistic, we keep increasing the mini-batch

More information

MARKOV CHAINS AND MARKOV DECISION THEORY. Contents

MARKOV CHAINS AND MARKOV DECISION THEORY. Contents MARKOV CHAINS AND MARKOV DECISION THEORY ARINDRIMA DATTA Abstract. In this paper, we begin with a forma introduction to probabiity and expain the concept of random variabes and stochastic processes. After

More information

Combining reaction kinetics to the multi-phase Gibbs energy calculation

Combining reaction kinetics to the multi-phase Gibbs energy calculation 7 th European Symposium on Computer Aided Process Engineering ESCAPE7 V. Pesu and P.S. Agachi (Editors) 2007 Esevier B.V. A rights reserved. Combining reaction inetics to the muti-phase Gibbs energy cacuation

More information

Partial permutation decoding for MacDonald codes

Partial permutation decoding for MacDonald codes Partia permutation decoding for MacDonad codes J.D. Key Department of Mathematics and Appied Mathematics University of the Western Cape 7535 Bevie, South Africa P. Seneviratne Department of Mathematics

More information

II. PROBLEM. A. Description. For the space of audio signals

II. PROBLEM. A. Description. For the space of audio signals CS229 - Fina Report Speech Recording based Language Recognition (Natura Language) Leopod Cambier - cambier; Matan Leibovich - matane; Cindy Orozco Bohorquez - orozcocc ABSTRACT We construct a rea time

More information

A Solution to the 4-bit Parity Problem with a Single Quaternary Neuron

A Solution to the 4-bit Parity Problem with a Single Quaternary Neuron Neura Information Processing - Letters and Reviews Vo. 5, No. 2, November 2004 LETTER A Soution to the 4-bit Parity Probem with a Singe Quaternary Neuron Tohru Nitta Nationa Institute of Advanced Industria

More information

8 Digifl'.11 Cth:uits and devices

8 Digifl'.11 Cth:uits and devices 8 Digif'. Cth:uits and devices 8. Introduction In anaog eectronics, votage is a continuous variabe. This is usefu because most physica quantities we encounter are continuous: sound eves, ight intensity,

More information

THE OUT-OF-PLANE BEHAVIOUR OF SPREAD-TOW FABRICS

THE OUT-OF-PLANE BEHAVIOUR OF SPREAD-TOW FABRICS ECCM6-6 TH EUROPEAN CONFERENCE ON COMPOSITE MATERIALS, Sevie, Spain, -6 June 04 THE OUT-OF-PLANE BEHAVIOUR OF SPREAD-TOW FABRICS M. Wysocki a,b*, M. Szpieg a, P. Heström a and F. Ohsson c a Swerea SICOMP

More information

Target Location Estimation in Wireless Sensor Networks Using Binary Data

Target Location Estimation in Wireless Sensor Networks Using Binary Data Target Location stimation in Wireess Sensor Networks Using Binary Data Ruixin Niu and Pramod K. Varshney Department of ectrica ngineering and Computer Science Link Ha Syracuse University Syracuse, NY 344

More information

Efficient Generation of Random Bits from Finite State Markov Chains

Efficient Generation of Random Bits from Finite State Markov Chains Efficient Generation of Random Bits from Finite State Markov Chains Hongchao Zhou and Jehoshua Bruck, Feow, IEEE Abstract The probem of random number generation from an uncorreated random source (of unknown

More information

ASummaryofGaussianProcesses Coryn A.L. Bailer-Jones

ASummaryofGaussianProcesses Coryn A.L. Bailer-Jones ASummaryofGaussianProcesses Coryn A.L. Baier-Jones Cavendish Laboratory University of Cambridge caj@mrao.cam.ac.uk Introduction A genera prediction probem can be posed as foows. We consider that the variabe

More information

Optimality of Inference in Hierarchical Coding for Distributed Object-Based Representations

Optimality of Inference in Hierarchical Coding for Distributed Object-Based Representations Optimaity of Inference in Hierarchica Coding for Distributed Object-Based Representations Simon Brodeur, Jean Rouat NECOTIS, Département génie éectrique et génie informatique, Université de Sherbrooke,

More information

STA 216 Project: Spline Approach to Discrete Survival Analysis

STA 216 Project: Spline Approach to Discrete Survival Analysis : Spine Approach to Discrete Surviva Anaysis November 4, 005 1 Introduction Athough continuous surviva anaysis differs much from the discrete surviva anaysis, there is certain ink between the two modeing

More information

NEW DEVELOPMENT OF OPTIMAL COMPUTING BUDGET ALLOCATION FOR DISCRETE EVENT SIMULATION

NEW DEVELOPMENT OF OPTIMAL COMPUTING BUDGET ALLOCATION FOR DISCRETE EVENT SIMULATION NEW DEVELOPMENT OF OPTIMAL COMPUTING BUDGET ALLOCATION FOR DISCRETE EVENT SIMULATION Hsiao-Chang Chen Dept. of Systems Engineering University of Pennsyvania Phiadephia, PA 904-635, U.S.A. Chun-Hung Chen

More information

An Algorithm for Pruning Redundant Modules in Min-Max Modular Network

An Algorithm for Pruning Redundant Modules in Min-Max Modular Network An Agorithm for Pruning Redundant Modues in Min-Max Moduar Network Hui-Cheng Lian and Bao-Liang Lu Department of Computer Science and Engineering, Shanghai Jiao Tong University 1954 Hua Shan Rd., Shanghai

More information

Explicit overall risk minimization transductive bound

Explicit overall risk minimization transductive bound 1 Expicit overa risk minimization transductive bound Sergio Decherchi, Paoo Gastado, Sandro Ridea, Rodofo Zunino Dept. of Biophysica and Eectronic Engineering (DIBE), Genoa University Via Opera Pia 11a,

More information

Count-Min Sketches for Estimating Password Frequency within Hamming Distance Two

Count-Min Sketches for Estimating Password Frequency within Hamming Distance Two Count-Min Sketches for Estimating Password Frequency within Hamming Distance Two Leah South and Dougas Stebia Schoo of Mathematica Sciences, Queensand University of Technoogy, Brisbane, Queensand, Austraia

More information

Approximated MLC shape matrix decomposition with interleaf collision constraint

Approximated MLC shape matrix decomposition with interleaf collision constraint Approximated MLC shape matrix decomposition with intereaf coision constraint Thomas Kainowski Antje Kiese Abstract Shape matrix decomposition is a subprobem in radiation therapy panning. A given fuence

More information

XSAT of linear CNF formulas

XSAT of linear CNF formulas XSAT of inear CN formuas Bernd R. Schuh Dr. Bernd Schuh, D-50968 Kön, Germany; bernd.schuh@netcoogne.de eywords: compexity, XSAT, exact inear formua, -reguarity, -uniformity, NPcompeteness Abstract. Open

More information

Collective organization in an adaptative mixture of experts

Collective organization in an adaptative mixture of experts Coective organization in an adaptative mixture of experts Vincent Vigneron, Christine Fuchen, Jean-Marc Martinez To cite this version: Vincent Vigneron, Christine Fuchen, Jean-Marc Martinez. Coective organization

More information

Stochastic Complement Analysis of Multi-Server Threshold Queues. with Hysteresis. Abstract

Stochastic Complement Analysis of Multi-Server Threshold Queues. with Hysteresis. Abstract Stochastic Compement Anaysis of Muti-Server Threshod Queues with Hysteresis John C.S. Lui The Dept. of Computer Science & Engineering The Chinese University of Hong Kong Leana Goubchik Dept. of Computer

More information

Asymptotic Properties of a Generalized Cross Entropy Optimization Algorithm

Asymptotic Properties of a Generalized Cross Entropy Optimization Algorithm 1 Asymptotic Properties of a Generaized Cross Entropy Optimization Agorithm Zijun Wu, Michae Koonko, Institute for Appied Stochastics and Operations Research, Caustha Technica University Abstract The discrete

More information

Statistical Learning Theory: A Primer

Statistical Learning Theory: A Primer Internationa Journa of Computer Vision 38(), 9 3, 2000 c 2000 uwer Academic Pubishers. Manufactured in The Netherands. Statistica Learning Theory: A Primer THEODOROS EVGENIOU, MASSIMILIANO PONTIL AND TOMASO

More information

Inductive Bias: How to generalize on novel data. CS Inductive Bias 1

Inductive Bias: How to generalize on novel data. CS Inductive Bias 1 Inductive Bias: How to generaize on nove data CS 478 - Inductive Bias 1 Overfitting Noise vs. Exceptions CS 478 - Inductive Bias 2 Non-Linear Tasks Linear Regression wi not generaize we to the task beow

More information

CS229 Lecture notes. Andrew Ng

CS229 Lecture notes. Andrew Ng CS229 Lecture notes Andrew Ng Part IX The EM agorithm In the previous set of notes, we taked about the EM agorithm as appied to fitting a mixture of Gaussians. In this set of notes, we give a broader view

More information

Separation of Variables and a Spherical Shell with Surface Charge

Separation of Variables and a Spherical Shell with Surface Charge Separation of Variabes and a Spherica She with Surface Charge In cass we worked out the eectrostatic potentia due to a spherica she of radius R with a surface charge density σθ = σ cos θ. This cacuation

More information

Stochastic Automata Networks (SAN) - Modelling. and Evaluation. Paulo Fernandes 1. Brigitte Plateau 2. May 29, 1997

Stochastic Automata Networks (SAN) - Modelling. and Evaluation. Paulo Fernandes 1. Brigitte Plateau 2. May 29, 1997 Stochastic utomata etworks (S) - Modeing and Evauation Pauo Fernandes rigitte Pateau 2 May 29, 997 Institut ationa Poytechnique de Grenobe { IPG Ecoe ationae Superieure d'informatique et de Mathematiques

More information

#A48 INTEGERS 12 (2012) ON A COMBINATORIAL CONJECTURE OF TU AND DENG

#A48 INTEGERS 12 (2012) ON A COMBINATORIAL CONJECTURE OF TU AND DENG #A48 INTEGERS 12 (2012) ON A COMBINATORIAL CONJECTURE OF TU AND DENG Guixin Deng Schoo of Mathematica Sciences, Guangxi Teachers Education University, Nanning, P.R.China dengguixin@ive.com Pingzhi Yuan

More information

Uniprocessor Feasibility of Sporadic Tasks with Constrained Deadlines is Strongly conp-complete

Uniprocessor Feasibility of Sporadic Tasks with Constrained Deadlines is Strongly conp-complete Uniprocessor Feasibiity of Sporadic Tasks with Constrained Deadines is Strongy conp-compete Pontus Ekberg and Wang Yi Uppsaa University, Sweden Emai: {pontus.ekberg yi}@it.uu.se Abstract Deciding the feasibiity

More information

arxiv: v1 [math.ca] 6 Mar 2017

arxiv: v1 [math.ca] 6 Mar 2017 Indefinite Integras of Spherica Besse Functions MIT-CTP/487 arxiv:703.0648v [math.ca] 6 Mar 07 Joyon K. Boomfied,, Stephen H. P. Face,, and Zander Moss, Center for Theoretica Physics, Laboratory for Nucear

More information

Approximated MLC shape matrix decomposition with interleaf collision constraint

Approximated MLC shape matrix decomposition with interleaf collision constraint Agorithmic Operations Research Vo.4 (29) 49 57 Approximated MLC shape matrix decomposition with intereaf coision constraint Antje Kiese and Thomas Kainowski Institut für Mathematik, Universität Rostock,

More information

Traffic data collection

Traffic data collection Chapter 32 Traffic data coection 32.1 Overview Unike many other discipines of the engineering, the situations that are interesting to a traffic engineer cannot be reproduced in a aboratory. Even if road

More information

A Novel Learning Method for Elman Neural Network Using Local Search

A Novel Learning Method for Elman Neural Network Using Local Search Neura Information Processing Letters and Reviews Vo. 11, No. 8, August 2007 LETTER A Nove Learning Method for Eman Neura Networ Using Loca Search Facuty of Engineering, Toyama University, Gofuu 3190 Toyama

More information

Two view learning: SVM-2K, Theory and Practice

Two view learning: SVM-2K, Theory and Practice Two view earning: SVM-2K, Theory and Practice Jason D.R. Farquhar jdrf99r@ecs.soton.ac.uk Hongying Meng hongying@cs.york.ac.uk David R. Hardoon drh@ecs.soton.ac.uk John Shawe-Tayor jst@ecs.soton.ac.uk

More information

Improving the Accuracy of Boolean Tomography by Exploiting Path Congestion Degrees

Improving the Accuracy of Boolean Tomography by Exploiting Path Congestion Degrees Improving the Accuracy of Booean Tomography by Expoiting Path Congestion Degrees Zhiyong Zhang, Gaoei Fei, Fucai Yu, Guangmin Hu Schoo of Communication and Information Engineering, University of Eectronic

More information

Paragraph Topic Classification

Paragraph Topic Classification Paragraph Topic Cassification Eugene Nho Graduate Schoo of Business Stanford University Stanford, CA 94305 enho@stanford.edu Edward Ng Department of Eectrica Engineering Stanford University Stanford, CA

More information

On the Goal Value of a Boolean Function

On the Goal Value of a Boolean Function On the Goa Vaue of a Booean Function Eric Bach Dept. of CS University of Wisconsin 1210 W. Dayton St. Madison, WI 53706 Lisa Heerstein Dept of CSE NYU Schoo of Engineering 2 Metrotech Center, 10th Foor

More information

https://doi.org/ /epjconf/

https://doi.org/ /epjconf/ HOW TO APPLY THE OPTIMAL ESTIMATION METHOD TO YOUR LIDAR MEASUREMENTS FOR IMPROVED RETRIEVALS OF TEMPERATURE AND COMPOSITION R. J. Sica 1,2,*, A. Haefee 2,1, A. Jaai 1, S. Gamage 1 and G. Farhani 1 1 Department

More information

Melodic contour estimation with B-spline models using a MDL criterion

Melodic contour estimation with B-spline models using a MDL criterion Meodic contour estimation with B-spine modes using a MDL criterion Damien Loive, Ney Barbot, Oivier Boeffard IRISA / University of Rennes 1 - ENSSAT 6 rue de Kerampont, B.P. 80518, F-305 Lannion Cedex

More information

Expectation-Maximization for Estimating Parameters for a Mixture of Poissons

Expectation-Maximization for Estimating Parameters for a Mixture of Poissons Expectation-Maximization for Estimating Parameters for a Mixture of Poissons Brandon Maone Department of Computer Science University of Hesini February 18, 2014 Abstract This document derives, in excrutiating

More information

AST 418/518 Instrumentation and Statistics

AST 418/518 Instrumentation and Statistics AST 418/518 Instrumentation and Statistics Cass Website: http://ircamera.as.arizona.edu/astr_518 Cass Texts: Practica Statistics for Astronomers, J.V. Wa, and C.R. Jenkins, Second Edition. Measuring the

More information

Determining The Degree of Generalization Using An Incremental Learning Algorithm

Determining The Degree of Generalization Using An Incremental Learning Algorithm Determining The Degree of Generaization Using An Incrementa Learning Agorithm Pabo Zegers Facutad de Ingeniería, Universidad de os Andes San Caros de Apoquindo 22, Las Condes, Santiago, Chie pzegers@uandes.c

More information

Chemical Kinetics Part 2

Chemical Kinetics Part 2 Integrated Rate Laws Chemica Kinetics Part 2 The rate aw we have discussed thus far is the differentia rate aw. Let us consider the very simpe reaction: a A à products The differentia rate reates the rate

More information

Pairwise RNA Edit Distance

Pairwise RNA Edit Distance Pairwise RNA Edit Distance In the foowing: Sequences S 1 and S 2 associated structures P 1 and P 2 scoring of aignment: different edit operations arc atering arc removing 1) ACGUUGACUGACAACAC..(((...)))...

More information

MATH 172: MOTIVATION FOR FOURIER SERIES: SEPARATION OF VARIABLES

MATH 172: MOTIVATION FOR FOURIER SERIES: SEPARATION OF VARIABLES MATH 172: MOTIVATION FOR FOURIER SERIES: SEPARATION OF VARIABLES Separation of variabes is a method to sove certain PDEs which have a warped product structure. First, on R n, a inear PDE of order m is

More information

NIKOS FRANTZIKINAKIS. N n N where (Φ N) N N is any Følner sequence

NIKOS FRANTZIKINAKIS. N n N where (Φ N) N N is any Følner sequence SOME OPE PROBLEMS O MULTIPLE ERGODIC AVERAGES IKOS FRATZIKIAKIS. Probems reated to poynomia sequences In this section we give a ist of probems reated to the study of mutipe ergodic averages invoving iterates

More information

General Certificate of Education Advanced Level Examination June 2010

General Certificate of Education Advanced Level Examination June 2010 Genera Certificate of Education Advanced Leve Examination June 2010 Human Bioogy HBI6T/Q10/task Unit 6T A2 Investigative Skis Assignment Task Sheet The effect of using one or two eyes on the perception

More information

A Statistical Framework for Real-time Event Detection in Power Systems

A Statistical Framework for Real-time Event Detection in Power Systems 1 A Statistica Framework for Rea-time Event Detection in Power Systems Noan Uhrich, Tim Christman, Phiip Swisher, and Xichen Jiang Abstract A quickest change detection (QCD) agorithm is appied to the probem

More information

Haar Decomposition and Reconstruction Algorithms

Haar Decomposition and Reconstruction Algorithms Jim Lambers MAT 773 Fa Semester 018-19 Lecture 15 and 16 Notes These notes correspond to Sections 4.3 and 4.4 in the text. Haar Decomposition and Reconstruction Agorithms Decomposition Suppose we approximate

More information

New Efficiency Results for Makespan Cost Sharing

New Efficiency Results for Makespan Cost Sharing New Efficiency Resuts for Makespan Cost Sharing Yvonne Beischwitz a, Forian Schoppmann a, a University of Paderborn, Department of Computer Science Fürstenaee, 3302 Paderborn, Germany Abstract In the context

More information

Sequential Decoding of Polar Codes with Arbitrary Binary Kernel

Sequential Decoding of Polar Codes with Arbitrary Binary Kernel Sequentia Decoding of Poar Codes with Arbitrary Binary Kerne Vera Miosavskaya, Peter Trifonov Saint-Petersburg State Poytechnic University Emai: veram,petert}@dcn.icc.spbstu.ru Abstract The probem of efficient

More information

From Margins to Probabilities in Multiclass Learning Problems

From Margins to Probabilities in Multiclass Learning Problems From Margins to Probabiities in Muticass Learning Probems Andrea Passerini and Massimiiano Ponti 2 and Paoo Frasconi 3 Abstract. We study the probem of muticass cassification within the framework of error

More information

arxiv: v1 [cs.lg] 31 Oct 2017

arxiv: v1 [cs.lg] 31 Oct 2017 ACCELERATED SPARSE SUBSPACE CLUSTERING Abofaz Hashemi and Haris Vikao Department of Eectrica and Computer Engineering, University of Texas at Austin, Austin, TX, USA arxiv:7.26v [cs.lg] 3 Oct 27 ABSTRACT

More information

arxiv: v1 [cs.ds] 12 Nov 2018

arxiv: v1 [cs.ds] 12 Nov 2018 Quantum-inspired ow-rank stochastic regression with ogarithmic dependence on the dimension András Giyén 1, Seth Loyd Ewin Tang 3 November 13, 018 arxiv:181104909v1 [csds] 1 Nov 018 Abstract We construct

More information

DIGITAL FILTER DESIGN OF IIR FILTERS USING REAL VALUED GENETIC ALGORITHM

DIGITAL FILTER DESIGN OF IIR FILTERS USING REAL VALUED GENETIC ALGORITHM DIGITAL FILTER DESIGN OF IIR FILTERS USING REAL VALUED GENETIC ALGORITHM MIKAEL NILSSON, MATTIAS DAHL AND INGVAR CLAESSON Bekinge Institute of Technoogy Department of Teecommunications and Signa Processing

More information

General Certificate of Education Advanced Level Examination June 2010

General Certificate of Education Advanced Level Examination June 2010 Genera Certificate of Education Advanced Leve Examination June 2010 Human Bioogy HBI6T/P10/task Unit 6T A2 Investigative Skis Assignment Task Sheet The effect of temperature on the rate of photosynthesis

More information

High Spectral Resolution Infrared Radiance Modeling Using Optimal Spectral Sampling (OSS) Method

High Spectral Resolution Infrared Radiance Modeling Using Optimal Spectral Sampling (OSS) Method High Spectra Resoution Infrared Radiance Modeing Using Optima Spectra Samping (OSS) Method J.-L. Moncet and G. Uymin Background Optima Spectra Samping (OSS) method is a fast and accurate monochromatic

More information

C. Fourier Sine Series Overview

C. Fourier Sine Series Overview 12 PHILIP D. LOEWEN C. Fourier Sine Series Overview Let some constant > be given. The symboic form of the FSS Eigenvaue probem combines an ordinary differentia equation (ODE) on the interva (, ) with a

More information

Two Birds With One Stone: An Efficient Hierarchical Framework for Top-k and Threshold-based String Similarity Search

Two Birds With One Stone: An Efficient Hierarchical Framework for Top-k and Threshold-based String Similarity Search Two Birds With One Stone: An Efficient Hierarchica Framework for Top-k and Threshod-based String Simiarity Search Jin Wang Guoiang Li Dong Deng Yong Zhang Jianhua Feng Department of Computer Science and

More information

Asynchronous Control for Coupled Markov Decision Systems

Asynchronous Control for Coupled Markov Decision Systems INFORMATION THEORY WORKSHOP (ITW) 22 Asynchronous Contro for Couped Marov Decision Systems Michae J. Neey University of Southern Caifornia Abstract This paper considers optima contro for a coection of

More information

FRST Multivariate Statistics. Multivariate Discriminant Analysis (MDA)

FRST Multivariate Statistics. Multivariate Discriminant Analysis (MDA) 1 FRST 531 -- Mutivariate Statistics Mutivariate Discriminant Anaysis (MDA) Purpose: 1. To predict which group (Y) an observation beongs to based on the characteristics of p predictor (X) variabes, using

More information

<C 2 2. λ 2 l. λ 1 l 1 < C 1

<C 2 2. λ 2 l. λ 1 l 1 < C 1 Teecommunication Network Contro and Management (EE E694) Prof. A. A. Lazar Notes for the ecture of 7/Feb/95 by Huayan Wang (this document was ast LaT E X-ed on May 9,995) Queueing Primer for Muticass Optima

More information

The EM Algorithm applied to determining new limit points of Mahler measures

The EM Algorithm applied to determining new limit points of Mahler measures Contro and Cybernetics vo. 39 (2010) No. 4 The EM Agorithm appied to determining new imit points of Maher measures by Souad E Otmani, Georges Rhin and Jean-Marc Sac-Épée Université Pau Veraine-Metz, LMAM,

More information

Maintenance activities planning and grouping for complex structure systems

Maintenance activities planning and grouping for complex structure systems Maintenance activities panning and grouping for compex structure systems Hai Canh u, Phuc Do an, Anne Barros, Christophe Bérenguer To cite this version: Hai Canh u, Phuc Do an, Anne Barros, Christophe

More information

Consistent linguistic fuzzy preference relation with multi-granular uncertain linguistic information for solving decision making problems

Consistent linguistic fuzzy preference relation with multi-granular uncertain linguistic information for solving decision making problems Consistent inguistic fuzzy preference reation with muti-granuar uncertain inguistic information for soving decision making probems Siti mnah Binti Mohd Ridzuan, and Daud Mohamad Citation: IP Conference

More information

Feasible Itemset Distributions in Data Mining: Theory and Application

Feasible Itemset Distributions in Data Mining: Theory and Application Feasibe Itemset Distributions in Data Mining: Theory and Appication Ganesh Ramesh Dept. of Computer Science University at Abany, SUNY Abany, NY 12222, USA ganesh@cs.abany.edu Wiiam A. Maniatty Dept. of

More information

The influence of temperature of photovoltaic modules on performance of solar power plant

The influence of temperature of photovoltaic modules on performance of solar power plant IOSR Journa of Engineering (IOSRJEN) ISSN (e): 2250-3021, ISSN (p): 2278-8719 Vo. 05, Issue 04 (Apri. 2015), V1 PP 09-15 www.iosrjen.org The infuence of temperature of photovotaic modues on performance

More information

Formulas for Angular-Momentum Barrier Factors Version II

Formulas for Angular-Momentum Barrier Factors Version II BNL PREPRINT BNL-QGS-06-101 brfactor1.tex Formuas for Anguar-Momentum Barrier Factors Version II S. U. Chung Physics Department, Brookhaven Nationa Laboratory, Upton, NY 11973 March 19, 2015 abstract A

More information

Scalable Spectrum Allocation for Large Networks Based on Sparse Optimization

Scalable Spectrum Allocation for Large Networks Based on Sparse Optimization Scaabe Spectrum ocation for Large Networks ased on Sparse Optimization innan Zhuang Modem R&D Lab Samsung Semiconductor, Inc. San Diego, C Dongning Guo, Ermin Wei, and Michae L. Honig Department of Eectrica

More information

MULTI-PERIOD MODEL FOR PART FAMILY/MACHINE CELL FORMATION. Objectives included in the multi-period formulation

MULTI-PERIOD MODEL FOR PART FAMILY/MACHINE CELL FORMATION. Objectives included in the multi-period formulation ationa Institute of Technoogy aicut Department of echanica Engineering ULTI-PERIOD ODEL FOR PART FAILY/AHIE ELL FORATIO Given a set of parts, processing requirements, and avaiabe resources The objective

More information

Chemical Kinetics Part 2. Chapter 16

Chemical Kinetics Part 2. Chapter 16 Chemica Kinetics Part 2 Chapter 16 Integrated Rate Laws The rate aw we have discussed thus far is the differentia rate aw. Let us consider the very simpe reaction: a A à products The differentia rate reates

More information

Power Control and Transmission Scheduling for Network Utility Maximization in Wireless Networks

Power Control and Transmission Scheduling for Network Utility Maximization in Wireless Networks ower Contro and Transmission Scheduing for Network Utiity Maximization in Wireess Networks Min Cao, Vivek Raghunathan, Stephen Hany, Vinod Sharma and. R. Kumar Abstract We consider a joint power contro

More information

Integrating Factor Methods as Exponential Integrators

Integrating Factor Methods as Exponential Integrators Integrating Factor Methods as Exponentia Integrators Borisav V. Minchev Department of Mathematica Science, NTNU, 7491 Trondheim, Norway Borko.Minchev@ii.uib.no Abstract. Recenty a ot of effort has been

More information

Adjustment of automatic control systems of production facilities at coal processing plants using multivariant physico- mathematical models

Adjustment of automatic control systems of production facilities at coal processing plants using multivariant physico- mathematical models IO Conference Series: Earth and Environmenta Science AER OEN ACCESS Adjustment of automatic contro systems of production faciities at coa processing pants using mutivariant physico- mathematica modes To

More information

Construction of Supersaturated Design with Large Number of Factors by the Complementary Design Method

Construction of Supersaturated Design with Large Number of Factors by the Complementary Design Method Acta Mathematicae Appicatae Sinica, Engish Series Vo. 29, No. 2 (2013) 253 262 DOI: 10.1007/s10255-013-0214-6 http://www.appmath.com.cn & www.springerlink.com Acta Mathema cae Appicatae Sinica, Engish

More information

MODELING OF A THREE-PHASE APPLICATION OF A MAGNETIC AMPLIFIER

MODELING OF A THREE-PHASE APPLICATION OF A MAGNETIC AMPLIFIER TH INTERNATIONAL CONGRESS OF THE AERONAUTICAL SCIENCES MODELING OF A THREE-PHASE APPLICATION OF A MAGNETIC AMPLIFIER L. Austrin*, G. Engdah** * Saab AB, Aerosystems, S-81 88 Linköping, Sweden, ** Roya

More information

On a geometrical approach in contact mechanics

On a geometrical approach in contact mechanics Institut für Mechanik On a geometrica approach in contact mechanics Aexander Konyukhov, Kar Schweizerhof Universität Karsruhe, Institut für Mechanik Institut für Mechanik Kaiserstr. 12, Geb. 20.30 76128

More information

Automobile Prices in Market Equilibrium. Berry, Pakes and Levinsohn

Automobile Prices in Market Equilibrium. Berry, Pakes and Levinsohn Automobie Prices in Market Equiibrium Berry, Pakes and Levinsohn Empirica Anaysis of demand and suppy in a differentiated products market: equiibrium in the U.S. automobie market. Oigopoistic Differentiated

More information

Testing for the Existence of Clusters

Testing for the Existence of Clusters Testing for the Existence of Custers Caudio Fuentes and George Casea University of Forida November 13, 2008 Abstract The detection and determination of custers has been of specia interest, among researchers

More information

Optimal spaced seeds for faster approximate string matching

Optimal spaced seeds for faster approximate string matching Optimal spaced seeds for faster approximate string matching Martin Farach-Colton Gad M. Landau S. Cenk Sahinalp Dekel Tsur Abstract Filtering is a standard technique for fast approximate string matching

More information

Fast Blind Recognition of Channel Codes

Fast Blind Recognition of Channel Codes Fast Bind Recognition of Channe Codes Reza Moosavi and Erik G. Larsson Linköping University Post Print N.B.: When citing this work, cite the origina artice. 213 IEEE. Persona use of this materia is permitted.

More information

Some Measures for Asymmetry of Distributions

Some Measures for Asymmetry of Distributions Some Measures for Asymmetry of Distributions Georgi N. Boshnakov First version: 31 January 2006 Research Report No. 5, 2006, Probabiity and Statistics Group Schoo of Mathematics, The University of Manchester

More information

On colorings of the Boolean lattice avoiding a rainbow copy of a poset arxiv: v1 [math.co] 21 Dec 2018

On colorings of the Boolean lattice avoiding a rainbow copy of a poset arxiv: v1 [math.co] 21 Dec 2018 On coorings of the Booean attice avoiding a rainbow copy of a poset arxiv:1812.09058v1 [math.co] 21 Dec 2018 Baázs Patkós Afréd Rényi Institute of Mathematics, Hungarian Academy of Scinces H-1053, Budapest,

More information

arxiv: v1 [cs.db] 1 Aug 2012

arxiv: v1 [cs.db] 1 Aug 2012 Functiona Mechanism: Regression Anaysis under Differentia Privacy arxiv:208.029v [cs.db] Aug 202 Jun Zhang Zhenjie Zhang 2 Xiaokui Xiao Yin Yang 2 Marianne Winsett 2,3 ABSTRACT Schoo of Computer Engineering

More information

Statistical Learning Theory: a Primer

Statistical Learning Theory: a Primer ??,??, 1 6 (??) c?? Kuwer Academic Pubishers, Boston. Manufactured in The Netherands. Statistica Learning Theory: a Primer THEODOROS EVGENIOU AND MASSIMILIANO PONTIL Center for Bioogica and Computationa

More information

Recursive Constructions of Parallel FIFO and LIFO Queues with Switched Delay Lines

Recursive Constructions of Parallel FIFO and LIFO Queues with Switched Delay Lines Recursive Constructions of Parae FIFO and LIFO Queues with Switched Deay Lines Po-Kai Huang, Cheng-Shang Chang, Feow, IEEE, Jay Cheng, Member, IEEE, and Duan-Shin Lee, Senior Member, IEEE Abstract One

More information

Optimal spaced seeds for faster approximate string matching

Optimal spaced seeds for faster approximate string matching Optimal spaced seeds for faster approximate string matching Martin Farach-Colton Gad M. Landau S. Cenk Sahinalp Dekel Tsur Abstract Filtering is a standard technique for fast approximate string matching

More information

MassJoin: A MapReduce-based Method for Scalable String Similarity Joins

MassJoin: A MapReduce-based Method for Scalable String Similarity Joins MassJoin: A MapReduce-based Method for Scaabe String Simiarity Joins Dong Deng, Guoiang Li, Shuang Hao, Jiannan Wang, Jianhua Feng Department of Computer Science, Tsinghua University, Beijing, China {dd11,

More information

Paper presented at the Workshop on Space Charge Physics in High Intensity Hadron Rings, sponsored by Brookhaven National Laboratory, May 4-7,1998

Paper presented at the Workshop on Space Charge Physics in High Intensity Hadron Rings, sponsored by Brookhaven National Laboratory, May 4-7,1998 Paper presented at the Workshop on Space Charge Physics in High ntensity Hadron Rings, sponsored by Brookhaven Nationa Laboratory, May 4-7,998 Noninear Sef Consistent High Resoution Beam Hao Agorithm in

More information

Algorithms to solve massively under-defined systems of multivariate quadratic equations

Algorithms to solve massively under-defined systems of multivariate quadratic equations Agorithms to sove massivey under-defined systems of mutivariate quadratic equations Yasufumi Hashimoto Abstract It is we known that the probem to sove a set of randomy chosen mutivariate quadratic equations

More information

Sardinas-Patterson like algorithms in coding theory

Sardinas-Patterson like algorithms in coding theory A.I.Cuza University of Iaşi Facuty of Computer Science Sardinas-Patterson ike agorithms in coding theory Author, Bogdan Paşaniuc Supervisor, Prof. dr. Ferucio Laurenţiu Ţipea Iaşi, September 2003 Contents

More information

Rate-Distortion Theory of Finite Point Processes

Rate-Distortion Theory of Finite Point Processes Rate-Distortion Theory of Finite Point Processes Günther Koiander, Dominic Schuhmacher, and Franz Hawatsch, Feow, IEEE Abstract We study the compression of data in the case where the usefu information

More information

Universal Consistency of Multi-Class Support Vector Classification

Universal Consistency of Multi-Class Support Vector Classification Universa Consistency of Muti-Cass Support Vector Cassification Tobias Gasmachers Dae Moe Institute for rtificia Inteigence IDSI, 6928 Manno-Lugano, Switzerand tobias@idsia.ch bstract Steinwart was the

More information

Appendix A: MATLAB commands for neural networks

Appendix A: MATLAB commands for neural networks Appendix A: MATLAB commands for neura networks 132 Appendix A: MATLAB commands for neura networks p=importdata('pn.xs'); t=importdata('tn.xs'); [pn,meanp,stdp,tn,meant,stdt]=prestd(p,t); for m=1:10 net=newff(minmax(pn),[m,1],{'tansig','purein'},'trainm');

More information

School of Electrical Engineering, University of Bath, Claverton Down, Bath BA2 7AY

School of Electrical Engineering, University of Bath, Claverton Down, Bath BA2 7AY The ogic of Booean matrices C. R. Edwards Schoo of Eectrica Engineering, Universit of Bath, Caverton Down, Bath BA2 7AY A Booean matrix agebra is described which enabes man ogica functions to be manipuated

More information

SydU STAT3014 (2015) Second semester Dr. J. Chan 18

SydU STAT3014 (2015) Second semester Dr. J. Chan 18 STAT3014/3914 Appied Stat.-Samping C-Stratified rand. sampe Stratified Random Samping.1 Introduction Description The popuation of size N is divided into mutuay excusive and exhaustive subpopuations caed

More information