OBLIVIOUS STRING EMBEDDINGS AND EDIT DISTANCE APPROXIMATIONS

Size: px
Start display at page:

Download "OBLIVIOUS STRING EMBEDDINGS AND EDIT DISTANCE APPROXIMATIONS"

Transcription

1 OBLIVIOUS STRING EMBEDDINGS AND EDIT DISTANCE APPROXIMATIONS Tuğkan Batu a, Funda Ergun b, and Cenk Sahinalp b a LONDON SCHOOL OF ECONOMICS b SIMON FRASER UNIVERSITY LSE CDAM Seminar Oblivious String Embeddings and Edit Distance Approximations October 19, 2006 p. 1

2 EDIT DISTANCE Let S and T be strings over alphabet Σ. Edit Distance D(S, T): number character insertions, deletions, and substitutions required to transform S into T. Many string similarity problems are based on edit distance. Used in text processing, analysis of genomic sequences,... Variants: non-uniform costs, block operations,... Exact Computation: [Masek Paterson 80] gave an O(n 2 / log n)-time algorithm for constant-size alphabets. LSE CDAM Seminar Oblivious String Embeddings and Edit Distance Approximations October 19, 2006 p. 2

3 EDIT DISTANCE APPROXIMATION γ-approximation: Given strings S,T, and γ > 1, output a value d such that D(S,T) d γ D(S,T). [Bar-Yossef et al. 04]: n 3/7 -approximation in near linear time (Õ(n) time). Our results: n 1 3 +o(1) -approximation in near linear time. n 1 ɛ 3 +o(1) -approximation in Õ(n1+ɛ ) time. Block edit distance variants: Almost logarithmic approximation factors [Muthukrishnan Sahinalp 00], [Cormode Muthukrishnan 02] LSE CDAM Seminar Oblivious String Embeddings and Edit Distance Approximations October 19, 2006 p. 3

4 STRING (TO STRING) EMBEDDINGS Def. Let Σ and Γ be two alphabets. A string embedding with distortion d is a mapping φ : Σ Γ such that D(S,T) d 1 D(φ(S),φ(T)) d 2 D(S,T) for all S,T, and d 1 d 2 d. Interesting if φ reduces string length (dimensionality reduction): Def. A string embedding with reduction r > 1 maps a string of length n to a string of length at most n r. Warning: This is not a compression. We require Σ Γ. LSE CDAM Seminar Oblivious String Embeddings and Edit Distance Approximations October 19, 2006 p. 4

5 STRING EMBEDDINGS (CONT.) Lemma. A string embedding with reduction r has a distortion at least r. "Proof." Maximum edit distance reduces by a factor of r. Minimum edit distance stays at 1. Our result: A string embedding with reduction r and distortion r 1+o(1) can be computed in Õ(n) time. LSE CDAM Seminar Oblivious String Embeddings and Edit Distance Approximations October 19, 2006 p. 5

6 STRING EMBEDDINGS VIA PARTITIONING S : adebccebaaadebcbadec adebc cebaa adebc badec α β α δ T : dadebccebaaadebcbade dadeb cceba aadeb cbade ε γ κ λ We want: Consistency S : adebc cebaa adebc badec T : d adebc cebaa adebc bade Maybe too strong to ask! Need to use content! Locality LSE CDAM Seminar Oblivious String Embeddings and Edit Distance Approximations October 19, 2006 p. 6

7 LOCALLY CONSISTENT PARSING (LCP)[SAHINALP VISHKIN 95,96] Partitions depend only on local string content. Consistency condition: If sufficiently long substring w occurs more than once in S, then most blocks in w will be set identically in each occurrence of w (in S or any other string). most blocks = all but some boundary blocks LSE CDAM Seminar Oblivious String Embeddings and Edit Distance Approximations October 19, 2006 p. 7

8 PARSING CONSISTENTLY Convention: a > b > c > d > e >... > z Assume no character is repeated consecutively. S :... dcadecbgabcdefghkmopstuadecbhc... We will partition S into blocks of size 2 or Mark every local maximum in sliding window of 3 (= a character larger than immediate neighbors) (Primary markers) (No consecutive markers) S :... dc adec bg abcdef ghkmopstu adec bhc Partition the segments into blocks of size 2 (or 3 if necessary) (Secondary markers) S :... dc ad ec bg ab cd ef gh km op stu ad ec bhc... LSE CDAM Seminar Oblivious String Embeddings and Edit Distance Approximations October 19, 2006 p. 8

9 CONSISTENT, BUT IS IT LOCAL? dcadecbgabcdef ghkmopstuadecbhc dc adec bg abcdef ghkmopstu adec bhc dc ad ec bg a b cd ef gh km op stu ad ec bhc Problem: Primary markers can be far apart (as far as twice the alphabet size). Hence, marker locations can depend on far away characters. LSE CDAM Seminar Oblivious String Embeddings and Edit Distance Approximations October 19, 2006 p. 9

10 ALPHABET REDUCTION Note that all we need for parsing are no repetitions and total order of characters. We will reduce alphabet size to avoid far-away primary markers. Σ = {a = 1111, b = 1110, c = 1101, d = 1011, e = 1010,...} Assign a tag (with shorter bit complexity) to each character. Tag of ith character: rightmost bit position where S[i] and S[i 1] differ concatenated with value of that bit in S[i] S : a e b tag : Remark. Still no repetitions in tags. Reduction in alphabet: k bits to log k + 1 bits LSE CDAM Seminar Oblivious String Embeddings and Edit Distance Approximations October 19, 2006 p. 10

11 ITERATIVE ALPHABET REDUCTION Apply alphabet reduction log k times ( k: initial alphabet size) the tags are constant use the LCP procedure (on tags) to partition string into blocks of size 2 and 3 Each marker is set based on O(log k) locations in the string. LSE CDAM Seminar Oblivious String Embeddings and Edit Distance Approximations October 19, 2006 p. 11

12 OBTAINING LONGER BLOCKS Problem. Blocks are of size 2 and 3. What if we need longer blocks? 1. Label blocks (by a new alphabet) Use the same character for different occurences of the same block. 2. Partition this new string into blocks of size 2 and 3. Step 1 : Step 2 : ad ec bg ab cd ef gh ko mp stu ad ec α β χ δ ε φ ν ß κ λ α β αβ χδε φ ν ßκλ αβ adec bgabcd ef gh kompstu adec Block sizes between 2 2 = 4 and 3 2 = 9. After t repetitions: Block sizes between 2 t and 3 t. LSE CDAM Seminar Oblivious String Embeddings and Edit Distance Approximations October 19, 2006 p. 12

13 LCP(C) We generalize above technique to obtain blocks of size between c and 2c 1. Same underlying ideas Various periodicities are handled separately Alphabet reduction is achieved by comparing substrings of length 2c 3 instead Computed in O(c 2 n) time. Markers depend on roughly O(c 2c ) locations Lemma. One edit operation to a string can change at most O(c 2c ) markers in LCP(c). LSE CDAM Seminar Oblivious String Embeddings and Edit Distance Approximations October 19, 2006 p. 13

14 ITERATIVE APPLICATION OF LCP(C) 1. Label blocks (by a new alphabet) Use the same character for different occurences of the same block. 2. Partition this new string into blocks of size between c and 2c 1. Step 1 : Step 2 : ad ec bg ab cd ef gh ko mp stu ad ec α β χ δ ε φ ν ß κ λ α β αβ χδε φ ν ßκλ αβ adec bgabcd ef gh kompstu adec Block sizes between c 2 and (2c 1) 2. After t repetitions: Block sizes between c t and (2c 1) t. LSE CDAM Seminar Oblivious String Embeddings and Edit Distance Approximations October 19, 2006 p. 14

15 STRING EMBEDDINGS VIA LCP(C) String embedding φ(s) with reduction r = Ω(log n): 1. Choose c = log log n/ log log log n and t such that c t = r. 2. Apply LCP(c) t times to partition S. block sizes between c t = r and (2c 1) t = r 1+o(1). 3. Label the blocks with new alphabet Γ to obtain string φ(s). Lemma. D(S,T)/(2c 1) t D(φ(S),φ(T)) O(c 2c ) D(S,T). Corollary. Embedding φ has distortion r 1+o(1). LSE CDAM Seminar Oblivious String Embeddings and Edit Distance Approximations October 19, 2006 p. 15

16 LCP(2) VS LCP(C) When we use LCP(c) to get reduction r, we set c t = r. The distortion is the largest block size: (2c 1) t (2c) t = 2 t r. Overhead of LCP(c) on the distortion: 2 t = r log c 2 LSE CDAM Seminar Oblivious String Embeddings and Edit Distance Approximations October 19, 2006 p. 16

17 APPROXIMATING EDIT DISTANCE A naive idea: Given S, T, and γ > 1, 1. Let S = φ(s) and T = φ(t) for a suitable reduction r (see below). 2. Calculate D(S,T ) using dynamic programming (in O((n/r) 2 ) time). 3. Translate into an approximation to D(S,T) using distortion of φ. Setting (n/r) 2 = n, gives r = n. Does not yield better than n-approximation in linear time! LSE CDAM Seminar Oblivious String Embeddings and Edit Distance Approximations October 19, 2006 p. 17

18 HOW TO CALCULATE D(S, T ) MORE EFFICIENTLY We can exploit properties of S and T to compute D(S, T ). If D(S, T) k, then the number of insertions and deletions (from S to T ) is less than k/r. Observation. During the computation of D(S, T ), we do not have to compare far away locations in S and T. Hence, we can restrict algorithm to look at a "narrow" band along diagonal of DP table. Hence, we can make S and T longer (read: smaller r) for better accuracy. Say, use r = n 1/3. Result: n (1 ɛ)/3+o(1) -approximation in Õ(n 1+ɛ ) time. (We can get even better if D(S, T) < n 2/3 ). LSE CDAM Seminar Oblivious String Embeddings and Edit Distance Approximations October 19, 2006 p. 18

19 FUTURE DIRECTIONS Better edit distance approximation Low distortion L 1 embeddings for strings: φ : Σ L d 1 Ω(log n) distortion lower bound [Krauthgamer Ostrovsky 06] 2 O( log nlog log n) best known [Ostrovsky Rabani 05] Other string-similarity problems LSE CDAM Seminar Oblivious String Embeddings and Edit Distance Approximations October 19, 2006 p. 19

Oblivious String Embeddings and Edit Distance Approximations

Oblivious String Embeddings and Edit Distance Approximations Oblivious String Embeddings and Edit Distance Approximations Tuğkan Batu Funda Ergun Cenk Sahinalp Abstract We introduce an oblivious embedding that maps strings of length n under edit distance to strings

More information

Efficient Approximation of Large LCS in Strings Over Not Small Alphabet

Efficient Approximation of Large LCS in Strings Over Not Small Alphabet Efficient Approximation of Large LCS in Strings Over Not Small Alphabet Gad M. Landau 1, Avivit Levy 2,3, and Ilan Newman 1 1 Department of Computer Science, Haifa University, Haifa 31905, Israel. E-mail:

More information

Trace Reconstruction Revisited

Trace Reconstruction Revisited Trace Reconstruction Revisited Andrew McGregor 1, Eric Price 2, Sofya Vorotnikova 1 1 University of Massachusetts Amherst 2 IBM Almaden Research Center Problem Description Take original string x of length

More information

Compressed Index for Dynamic Text

Compressed Index for Dynamic Text Compressed Index for Dynamic Text Wing-Kai Hon Tak-Wah Lam Kunihiko Sadakane Wing-Kin Sung Siu-Ming Yiu Abstract This paper investigates how to index a text which is subject to updates. The best solution

More information

arxiv: v2 [cs.ds] 8 Apr 2016

arxiv: v2 [cs.ds] 8 Apr 2016 Optimal Dynamic Strings Paweł Gawrychowski 1, Adam Karczmarz 1, Tomasz Kociumaka 1, Jakub Łącki 2, and Piotr Sankowski 1 1 Institute of Informatics, University of Warsaw, Poland [gawry,a.karczmarz,kociumaka,sank]@mimuw.edu.pl

More information

Low Distortion Embedding from Edit to Hamming Distance using Coupling

Low Distortion Embedding from Edit to Hamming Distance using Coupling Electronic Colloquium on Computational Complexity, Report No. 111 (2015) Low Distortion Embedding from Edit to Hamming Distance using Coupling Diptarka Chakraborty Elazar Goldenberg Michal Koucký July

More information

A Faster Grammar-Based Self-Index

A Faster Grammar-Based Self-Index A Faster Grammar-Based Self-Index Travis Gagie 1 Pawe l Gawrychowski 2 Juha Kärkkäinen 3 Yakov Nekrich 4 Simon Puglisi 5 Aalto University Max-Planck-Institute für Informatik University of Helsinki University

More information

Sequence comparison by compression

Sequence comparison by compression Sequence comparison by compression Motivation similarity as a marker for homology. And homology is used to infer function. Sometimes, we are only interested in a numerical distance between two sequences.

More information

arxiv: v1 [cs.ds] 15 Feb 2012

arxiv: v1 [cs.ds] 15 Feb 2012 Linear-Space Substring Range Counting over Polylogarithmic Alphabets Travis Gagie 1 and Pawe l Gawrychowski 2 1 Aalto University, Finland travis.gagie@aalto.fi 2 Max Planck Institute, Germany gawry@cs.uni.wroc.pl

More information

PATTERN MATCHING WITH SWAPS IN PRACTICE

PATTERN MATCHING WITH SWAPS IN PRACTICE International Journal of Foundations of Computer Science c World Scientific Publishing Company PATTERN MATCHING WITH SWAPS IN PRACTICE MATTEO CAMPANELLI Università di Catania, Scuola Superiore di Catania

More information

Approximate Pattern Matching and the Query Complexity of Edit Distance

Approximate Pattern Matching and the Query Complexity of Edit Distance Krzysztof Onak Approximate Pattern Matching p. 1/20 Approximate Pattern Matching and the Query Complexity of Edit Distance Joint work with: Krzysztof Onak MIT Alexandr Andoni (CCI) Robert Krauthgamer (Weizmann

More information

Computation Theory Finite Automata

Computation Theory Finite Automata Computation Theory Dept. of Computing ITT Dublin October 14, 2010 Computation Theory I 1 We would like a model that captures the general nature of computation Consider two simple problems: 2 Design a program

More information

Trace Reconstruction Revisited

Trace Reconstruction Revisited Trace Reconstruction Revisited Andrew McGregor 1, Eric Price 2, and Sofya Vorotnikova 1 1 University of Massachusetts Amherst {mcgregor,svorotni}@cs.umass.edu 2 IBM Almaden Research Center ecprice@mit.edu

More information

Internal Pattern Matching Queries in a Text and Applications

Internal Pattern Matching Queries in a Text and Applications Internal Pattern Matching Queries in a Text and Applications Tomasz Kociumaka Jakub Radoszewski Wojciech Rytter Tomasz Waleń Abstract We consider several types of internal queries: questions about subwords

More information

How many double squares can a string contain?

How many double squares can a string contain? How many double squares can a string contain? F. Franek, joint work with A. Deza and A. Thierry Algorithms Research Group Department of Computing and Software McMaster University, Hamilton, Ontario, Canada

More information

Computing the Entropy of a Stream

Computing the Entropy of a Stream Computing the Entropy of a Stream To appear in SODA 2007 Graham Cormode graham@research.att.com Amit Chakrabarti Dartmouth College Andrew McGregor U. Penn / UCSD Outline Introduction Entropy Upper Bound

More information

Overcoming the l 1 Non-Embeddability Barrier: Algorithms for Product Metrics

Overcoming the l 1 Non-Embeddability Barrier: Algorithms for Product Metrics Overcoming the l 1 Non-Embeddability Barrier: Algorithms for Product Metrics Alexandr Andoni MIT andoni@mit.edu Piotr Indyk MIT indyk@mit.edu Robert Krauthgamer Weizmann Institute of Science robert.krauthgamer@weizmann.ac.il

More information

Improved Sketching of Hamming Distance with Error Correcting

Improved Sketching of Hamming Distance with Error Correcting Improved Setching of Hamming Distance with Error Correcting Ohad Lipsy Bar-Ilan University Ely Porat Bar-Ilan University Abstract We address the problem of setching the hamming distance of data streams.

More information

On Pattern Matching With Swaps

On Pattern Matching With Swaps On Pattern Matching With Swaps Fouad B. Chedid Dhofar University, Salalah, Oman Notre Dame University - Louaize, Lebanon P.O.Box: 2509, Postal Code 211 Salalah, Oman Tel: +968 23237200 Fax: +968 23237720

More information

Optimal spaced seeds for faster approximate string matching

Optimal spaced seeds for faster approximate string matching Optimal spaced seeds for faster approximate string matching Martin Farach-Colton Gad M. Landau S. Cenk Sahinalp Dekel Tsur Abstract Filtering is a standard technique for fast approximate string matching

More information

Optimal spaced seeds for faster approximate string matching

Optimal spaced seeds for faster approximate string matching Optimal spaced seeds for faster approximate string matching Martin Farach-Colton Gad M. Landau S. Cenk Sahinalp Dekel Tsur Abstract Filtering is a standard technique for fast approximate string matching

More information

Indexing LZ77: The Next Step in Self-Indexing. Gonzalo Navarro Department of Computer Science, University of Chile

Indexing LZ77: The Next Step in Self-Indexing. Gonzalo Navarro Department of Computer Science, University of Chile Indexing LZ77: The Next Step in Self-Indexing Gonzalo Navarro Department of Computer Science, University of Chile gnavarro@dcc.uchile.cl Part I: Why Jumping off the Cliff The Past Century Self-Indexing:

More information

Compressing Kinetic Data From Sensor Networks. Sorelle A. Friedler (Swat 04) Joint work with David Mount University of Maryland, College Park

Compressing Kinetic Data From Sensor Networks. Sorelle A. Friedler (Swat 04) Joint work with David Mount University of Maryland, College Park Compressing Kinetic Data From Sensor Networks Sorelle A. Friedler (Swat 04) Joint work with David Mount University of Maryland, College Park Motivation Motivation Computer Science Graphics: Image and video

More information

Finding all covers of an indeterminate string in O(n) time on average

Finding all covers of an indeterminate string in O(n) time on average Finding all covers of an indeterminate string in O(n) time on average Md. Faizul Bari, M. Sohel Rahman, and Rifat Shahriyar Department of Computer Science and Engineering Bangladesh University of Engineering

More information

A fast algorithm for the Kolakoski sequence

A fast algorithm for the Kolakoski sequence A fast algorithm for the Kolakoski sequence Richard P. Brent Australian National University and University of Newcastle 13 December 2016 (updated 30 Dec. 2016) Joint work with Judy-anne Osborn The Kolakoski

More information

Optimal Data-Dependent Hashing for Approximate Near Neighbors

Optimal Data-Dependent Hashing for Approximate Near Neighbors Optimal Data-Dependent Hashing for Approximate Near Neighbors Alexandr Andoni 1 Ilya Razenshteyn 2 1 Simons Institute 2 MIT, CSAIL April 20, 2015 1 / 30 Nearest Neighbor Search (NNS) Let P be an n-point

More information

Online Computation of Abelian Runs

Online Computation of Abelian Runs Online Computation of Abelian Runs Gabriele Fici 1, Thierry Lecroq 2, Arnaud Lefebvre 2, and Élise Prieur-Gaston2 1 Dipartimento di Matematica e Informatica, Università di Palermo, Italy Gabriele.Fici@unipa.it

More information

Converting SLP to LZ78 in almost Linear Time

Converting SLP to LZ78 in almost Linear Time CPM 2013 Converting SLP to LZ78 in almost Linear Time Hideo Bannai 1, Paweł Gawrychowski 2, Shunsuke Inenaga 1, Masayuki Takeda 1 1. Kyushu University 2. Max-Planck-Institut für Informatik Recompress SLP

More information

Reconstructing Strings from Random Traces

Reconstructing Strings from Random Traces Reconstructing Strings from Random Traces Tuğkan Batu Sampath Kannan Sanjeev Khanna Andrew McGregor Abstract We are given a collection of m random subsequences (traces) of a string t of length n where

More information

Similarity searching, or how to find your neighbors efficiently

Similarity searching, or how to find your neighbors efficiently Similarity searching, or how to find your neighbors efficiently Robert Krauthgamer Weizmann Institute of Science CS Research Day for Prospective Students May 1, 2009 Background Geometric spaces and techniques

More information

Communication complexity of document exchange

Communication complexity of document exchange Communication complexity of document exchange Graham Cormode Mike Paterson Süleyman Cenk Ṣahinalp Uzi Vishkin Abstract We address the problem of minimizing the communication involved in the exchange of

More information

Self-Indexed Grammar-Based Compression

Self-Indexed Grammar-Based Compression Fundamenta Informaticae XXI (2001) 1001 1025 1001 IOS Press Self-Indexed Grammar-Based Compression Francisco Claude David R. Cheriton School of Computer Science University of Waterloo fclaude@cs.uwaterloo.ca

More information

Proofs, Strings, and Finite Automata. CS154 Chris Pollett Feb 5, 2007.

Proofs, Strings, and Finite Automata. CS154 Chris Pollett Feb 5, 2007. Proofs, Strings, and Finite Automata CS154 Chris Pollett Feb 5, 2007. Outline Proofs and Proof Strategies Strings Finding proofs Example: For every graph G, the sum of the degrees of all the nodes in G

More information

Enumeration and symmetry of edit metric spaces. Jessie Katherine Campbell. A dissertation submitted to the graduate faculty

Enumeration and symmetry of edit metric spaces. Jessie Katherine Campbell. A dissertation submitted to the graduate faculty Enumeration and symmetry of edit metric spaces by Jessie Katherine Campbell A dissertation submitted to the graduate faculty in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY

More information

Efficient High-Similarity String Comparison: The Waterfall Algorithm

Efficient High-Similarity String Comparison: The Waterfall Algorithm Efficient High-Similarity String Comparison: The Waterfall Algorithm Alexander Tiskin Department of Computer Science University of Warwick http://go.warwick.ac.uk/alextiskin Alexander Tiskin (Warwick)

More information

Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts

Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts Philip Bille 1, Rolf Fagerberg 2, and Inge Li Gørtz 3 1 IT University of Copenhagen. Rued Langgaards

More information

On the Number of Distinct Squares

On the Number of Distinct Squares Frantisek (Franya) Franek Advanced Optimization Laboratory Department of Computing and Software McMaster University, Hamilton, Ontario, Canada Invited talk - Prague Stringology Conference 2014 Outline

More information

nx + 1 = (n + 1)x 13(n + 1) and nx = (n + 1)x + 27(n + 1).

nx + 1 = (n + 1)x 13(n + 1) and nx = (n + 1)x + 27(n + 1). 1. (Answer: 630) 001 AIME SOLUTIONS Let a represent the tens digit and b the units digit of an integer with the required property. Then 10a + b must be divisible by both a and b. It follows that b must

More information

The CENTRE for EDUCATION in MATHEMATICS and COMPUTING cemc.uwaterloo.ca Euclid Contest. Tuesday, April 12, 2016

The CENTRE for EDUCATION in MATHEMATICS and COMPUTING cemc.uwaterloo.ca Euclid Contest. Tuesday, April 12, 2016 The CENTRE for EDUCATION in MATHEMATICS and COMPUTING cemc.uwaterloo.ca 016 Euclid Contest Tuesday, April 1, 016 (in North America and South America) Wednesday, April 13, 016 (outside of North America

More information

Finding Frequent Patterns in a String in Sublinear Time

Finding Frequent Patterns in a String in Sublinear Time Finding Frequent Patterns in a String in Sublinear Time Petra Berenbrink 1, Funda Ergun 2, and Tom Friedetzky 3 1 School of Computing Science, Simon Fraser University, Burnaby, B.C., V5A 1S6, Canada http://www.cs.sfu.ca/

More information

Making Nearest Neighbors Easier. Restrictions on Input Algorithms for Nearest Neighbor Search: Lecture 4. Outline. Chapter XI

Making Nearest Neighbors Easier. Restrictions on Input Algorithms for Nearest Neighbor Search: Lecture 4. Outline. Chapter XI Restrictions on Input Algorithms for Nearest Neighbor Search: Lecture 4 Yury Lifshits http://yury.name Steklov Institute of Mathematics at St.Petersburg California Institute of Technology Making Nearest

More information

Alphabet Friendly FM Index

Alphabet Friendly FM Index Alphabet Friendly FM Index Author: Rodrigo González Santiago, November 8 th, 2005 Departamento de Ciencias de la Computación Universidad de Chile Outline Motivations Basics Burrows Wheeler Transform FM

More information

Algorithms for Calculating Statistical Properties on Moving Points

Algorithms for Calculating Statistical Properties on Moving Points Algorithms for Calculating Statistical Properties on Moving Points Dissertation Proposal Sorelle Friedler Committee: David Mount (Chair), William Gasarch Samir Khuller, Amitabh Varshney January 14, 2009

More information

arxiv: v1 [cs.ds] 2 Dec 2009

arxiv: v1 [cs.ds] 2 Dec 2009 Variants of Constrained Longest Common Subsequence arxiv:0912.0368v1 [cs.ds] 2 Dec 2009 Paola Bonizzoni Gianluca Della Vedova Riccardo Dondi Yuri Pirola Abstract In this work, we consider a variant of

More information

Samson Zhou. Pattern Matching over Noisy Data Streams

Samson Zhou. Pattern Matching over Noisy Data Streams Samson Zhou Pattern Matching over Noisy Data Streams Finding Structure in Data Pattern Matching Finding all instances of a pattern within a string ABCD ABCAABCDAACAABCDBCABCDADDDEAEABCDA Knuth-Morris-Pratt

More information

Self-Indexed Grammar-Based Compression

Self-Indexed Grammar-Based Compression Fundamenta Informaticae XXI (2001) 1001 1025 1001 IOS Press Self-Indexed Grammar-Based Compression Francisco Claude David R. Cheriton School of Computer Science University of Waterloo fclaude@cs.uwaterloo.ca

More information

Lecture 1 : Data Compression and Entropy

Lecture 1 : Data Compression and Entropy CPS290: Algorithmic Foundations of Data Science January 8, 207 Lecture : Data Compression and Entropy Lecturer: Kamesh Munagala Scribe: Kamesh Munagala In this lecture, we will study a simple model for

More information

String Matching. Thanks to Piotr Indyk. String Matching. Simple Algorithm. for s 0 to n-m. Match 0. for j 1 to m if T[s+j] P[j] then

String Matching. Thanks to Piotr Indyk. String Matching. Simple Algorithm. for s 0 to n-m. Match 0. for j 1 to m if T[s+j] P[j] then String Matching Thanks to Piotr Indyk String Matching Input: Two strings T[1 n] and P[1 m], containing symbols from alphabet Σ Goal: find all shifts 0 s n-m such that T[s+1 s+m]=p Example: Σ={,a,b,,z}

More information

Ends of Finitely Generated Groups from a Nonstandard Perspective

Ends of Finitely Generated Groups from a Nonstandard Perspective of Finitely of Finitely from a University of Illinois at Urbana Champaign McMaster Model Theory Seminar September 23, 2008 Outline of Finitely Outline of Finitely Outline of Finitely Outline of Finitely

More information

Motivation for Arithmetic Coding

Motivation for Arithmetic Coding Motivation for Arithmetic Coding Motivations for arithmetic coding: 1) Huffman coding algorithm can generate prefix codes with a minimum average codeword length. But this length is usually strictly greater

More information

25 Minimum bandwidth: Approximation via volume respecting embeddings

25 Minimum bandwidth: Approximation via volume respecting embeddings 25 Minimum bandwidth: Approximation via volume respecting embeddings We continue the study of Volume respecting embeddings. In the last lecture, we motivated the use of volume respecting embeddings by

More information

Streaming and communication complexity of Hamming distance

Streaming and communication complexity of Hamming distance Streaming and communication complexity of Hamming distance Tatiana Starikovskaya IRIF, Université Paris-Diderot (Joint work with Raphaël Clifford, ICALP 16) Approximate pattern matching Problem Pattern

More information

Space-Efficient Re-Pair Compression

Space-Efficient Re-Pair Compression Space-Efficient Re-Pair Compression Philip Bille, Inge Li Gørtz, and Nicola Prezza Technical University of Denmark, DTU Compute {phbi,inge,npre}@dtu.dk Abstract Re-Pair [5] is an effective grammar-based

More information

Small-Space Dictionary Matching (Dissertation Proposal)

Small-Space Dictionary Matching (Dissertation Proposal) Small-Space Dictionary Matching (Dissertation Proposal) Graduate Center of CUNY 1/24/2012 Problem Definition Dictionary Matching Input: Dictionary D = P 1,P 2,...,P d containing d patterns. Text T of length

More information

Space Complexity vs. Query Complexity

Space Complexity vs. Query Complexity Space Complexity vs. Query Complexity Oded Lachish Ilan Newman Asaf Shapira Abstract Combinatorial property testing deals with the following relaxation of decision problems: Given a fixed property and

More information

Recursive Definitions

Recursive Definitions Recursive Definitions Example: Give a recursive definition of a n. a R and n N. Basis: n = 0, a 0 = 1. Recursion: a n+1 = a a n. Example: Give a recursive definition of n i=0 a i. Let S n = n i=0 a i,

More information

Geometric Optimization Problems over Sliding Windows

Geometric Optimization Problems over Sliding Windows Geometric Optimization Problems over Sliding Windows Timothy M. Chan and Bashir S. Sadjad School of Computer Science University of Waterloo Waterloo, Ontario, N2L 3G1, Canada {tmchan,bssadjad}@uwaterloo.ca

More information

A Sublinear Algorithm for Weakly Approximating Edit Distance

A Sublinear Algorithm for Weakly Approximating Edit Distance A Sublinear Algorithm for Weakly Approximating Edit Distance Tuğkan Batu University of Pennsylvania batu@cis.upenn.edu Avner Magen University of Toronto avner@cs.toronto.edu Funda Ergün Case Western Reserve

More information

ON THE BIT-COMPLEXITY OF LEMPEL-ZIV COMPRESSION

ON THE BIT-COMPLEXITY OF LEMPEL-ZIV COMPRESSION ON THE BIT-COMPLEXITY OF LEMPEL-ZIV COMPRESSION PAOLO FERRAGINA, IGOR NITTO, AND ROSSANO VENTURINI Abstract. One of the most famous and investigated lossless data-compression schemes is the one introduced

More information

Tree Adjoining Grammars

Tree Adjoining Grammars Tree Adjoining Grammars TAG: Parsing and formal properties Laura Kallmeyer & Benjamin Burkhardt HHU Düsseldorf WS 2017/2018 1 / 36 Outline 1 Parsing as deduction 2 CYK for TAG 3 Closure properties of TALs

More information

Inf2A: The Pumping Lemma

Inf2A: The Pumping Lemma Inf2A: Stuart Anderson School of Informatics University of Edinburgh October 8, 2009 Outline 1 Deterministic Finite State Machines and Regular Languages 2 3 4 The language of a DFA ( M = Q, Σ, q 0, F,

More information

String Indexing for Patterns with Wildcards

String Indexing for Patterns with Wildcards MASTER S THESIS String Indexing for Patterns with Wildcards Hjalte Wedel Vildhøj and Søren Vind Technical University of Denmark August 8, 2011 Abstract We consider the problem of indexing a string t of

More information

arxiv: v1 [cs.fl] 29 Jun 2013

arxiv: v1 [cs.fl] 29 Jun 2013 On a compact encoding of the swap automaton Kimmo Fredriksson 1 and Emanuele Giaquinta 2 arxiv:1307.0099v1 [cs.fl] 29 Jun 2013 1 School of Computing, University of Eastern Finland kimmo.fredriksson@uef.fi

More information

The streaming k-mismatch problem

The streaming k-mismatch problem The streaming k-mismatch problem Raphaël Clifford 1, Tomasz Kociumaka 2, and Ely Porat 3 1 Department of Computer Science, University of Bristol, United Kingdom raphael.clifford@bristol.ac.uk 2 Institute

More information

arxiv: v1 [cs.ds] 9 Apr 2018

arxiv: v1 [cs.ds] 9 Apr 2018 From Regular Expression Matching to Parsing Philip Bille Technical University of Denmark phbi@dtu.dk Inge Li Gørtz Technical University of Denmark inge@dtu.dk arxiv:1804.02906v1 [cs.ds] 9 Apr 2018 Abstract

More information

Guess & Check Codes for Deletions, Insertions, and Synchronization

Guess & Check Codes for Deletions, Insertions, and Synchronization Guess & Check Codes for Deletions, Insertions, and Synchronization Serge Kas Hanna, Salim El Rouayheb ECE Department, Rutgers University sergekhanna@rutgersedu, salimelrouayheb@rutgersedu arxiv:759569v3

More information

arxiv: v1 [cs.dc] 4 Oct 2018

arxiv: v1 [cs.dc] 4 Oct 2018 Distributed Reconfiguration of Maximal Independent Sets Keren Censor-Hillel 1 and Mikael Rabie 2 1 Department of Computer Science, Technion, Israel, ckeren@cs.technion.ac.il 2 Aalto University, Helsinki,

More information

Information Complexity vs. Communication Complexity: Hidden Layers Game

Information Complexity vs. Communication Complexity: Hidden Layers Game Information Complexity vs. Communication Complexity: Hidden Layers Game Jiahui Liu Final Project Presentation for Information Theory in TCS Introduction Review of IC vs CC Hidden Layers Game Upper Bound

More information

Optimal compression of approximate Euclidean distances

Optimal compression of approximate Euclidean distances Optimal compression of approximate Euclidean distances Noga Alon 1 Bo az Klartag 2 Abstract Let X be a set of n points of norm at most 1 in the Euclidean space R k, and suppose ε > 0. An ε-distance sketch

More information

Average Complexity of Exact and Approximate Multiple String Matching

Average Complexity of Exact and Approximate Multiple String Matching Average Complexity of Exact and Approximate Multiple String Matching Gonzalo Navarro Department of Computer Science University of Chile gnavarro@dcc.uchile.cl Kimmo Fredriksson Department of Computer Science

More information

A Polynomial Time Algorithm for Parsing with the Bounded Order Lambek Calculus

A Polynomial Time Algorithm for Parsing with the Bounded Order Lambek Calculus A Polynomial Time Algorithm for Parsing with the Bounded Order Lambek Calculus Timothy A. D. Fowler Department of Computer Science University of Toronto 10 King s College Rd., Toronto, ON, M5S 3G4, Canada

More information

Automata & languages. A primer on the Theory of Computation. Laurent Vanbever. ETH Zürich (D-ITET) September,

Automata & languages. A primer on the Theory of Computation. Laurent Vanbever.  ETH Zürich (D-ITET) September, Automata & languages A primer on the Theory of Computation Laurent Vanbever www.vanbever.eu ETH Zürich (D-ITET) September, 24 2015 Last week was all about Deterministic Finite Automaton We saw three main

More information

Perfect Two-Fault Tolerant Search with Minimum Adaptiveness 1

Perfect Two-Fault Tolerant Search with Minimum Adaptiveness 1 Advances in Applied Mathematics 25, 65 101 (2000) doi:10.1006/aama.2000.0688, available online at http://www.idealibrary.com on Perfect Two-Fault Tolerant Search with Minimum Adaptiveness 1 Ferdinando

More information

arxiv: v1 [math.co] 11 Jul 2016

arxiv: v1 [math.co] 11 Jul 2016 Characterization and recognition of proper tagged probe interval graphs Sourav Chakraborty, Shamik Ghosh, Sanchita Paul and Malay Sen arxiv:1607.02922v1 [math.co] 11 Jul 2016 October 29, 2018 Abstract

More information

Name Geometry Common Core Regents Review Packet - 3. Topic 1 : Equation of a circle

Name Geometry Common Core Regents Review Packet - 3. Topic 1 : Equation of a circle Name Geometry Common Core Regents Review Packet - 3 Topic 1 : Equation of a circle Equation with center (0,0) and radius r Equation with center (h,k) and radius r ( ) ( ) 1. The endpoints of a diameter

More information

arxiv: v1 [cs.cc] 15 Nov 2016

arxiv: v1 [cs.cc] 15 Nov 2016 Diploid Alignment is NP-hard Romeo Rizzi 1, Massimo Cairo 1, Veli Mäkinen 2, and Daniel Valenzuela 2 1 Department of Computer Science, University of Verona, Italy 2 Helsinki Institute for Information echnology,

More information

The Intractability of Computing the Hamming Distance

The Intractability of Computing the Hamming Distance The Intractability of Computing the Hamming Distance Bodo Manthey and Rüdiger Reischuk Universität zu Lübeck, Institut für Theoretische Informatik Wallstraße 40, 23560 Lübeck, Germany manthey/reischuk@tcs.uni-luebeck.de

More information

Streaming algorithms for embedding and computing edit distance in the low distance regime

Streaming algorithms for embedding and computing edit distance in the low distance regime Electronic Colloquium on Computational Complexity, Revision 1 of Report No. 111 (2015) Streaming algorithms for embedding and computing edit distance in the low distance regime Diptarka Chakraborty Department

More information

BOUNDS ON ZIMIN WORD AVOIDANCE

BOUNDS ON ZIMIN WORD AVOIDANCE BOUNDS ON ZIMIN WORD AVOIDANCE JOSHUA COOPER* AND DANNY RORABAUGH* Abstract. How long can a word be that avoids the unavoidable? Word W encounters word V provided there is a homomorphism φ defined by mapping

More information

Testing random variables for independence and identity

Testing random variables for independence and identity Testing random variables for independence and identity Tuğkan Batu Eldar Fischer Lance Fortnow Ravi Kumar Ronitt Rubinfeld Patrick White January 10, 2003 Abstract Given access to independent samples of

More information

CS 455/555: Mathematical preliminaries

CS 455/555: Mathematical preliminaries CS 455/555: Mathematical preliminaries Stefan D. Bruda Winter 2019 SETS AND RELATIONS Sets: Operations: intersection, union, difference, Cartesian product Big, powerset (2 A ) Partition (π 2 A, π, i j

More information

String Range Matching

String Range Matching String Range Matching Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi Department of Computer Science, University of Helsinki Helsinki, Finland firstname.lastname@cs.helsinki.fi Abstract. Given strings

More information

Discrete Mathematics & Mathematical Reasoning Chapter 6: Counting

Discrete Mathematics & Mathematical Reasoning Chapter 6: Counting Discrete Mathematics & Mathematical Reasoning Chapter 6: Counting Kousha Etessami U. of Edinburgh, UK Kousha Etessami (U. of Edinburgh, UK) Discrete Mathematics (Chapter 6) 1 / 39 Chapter Summary The Basics

More information

Title. Author(s) 花田, 博幸. Issue Date DOI. Doc URL. Type. File Information. The q-gram Distance as an Approximation of the Edit

Title. Author(s) 花田, 博幸. Issue Date DOI. Doc URL. Type. File Information. The q-gram Distance as an Approximation of the Edit Title The q-gram Distance as an Approximation of the Edit Author(s) 花田, 博幸 Issue Date 2014-06-30 DOI 10.14943/doctoral.k11490 Doc URL http://hdl.handle.net/2115/64515 Type theses (doctoral) File Information

More information

CS 530: Theory of Computation Based on Sipser (second edition): Notes on regular languages(version 1.1)

CS 530: Theory of Computation Based on Sipser (second edition): Notes on regular languages(version 1.1) CS 530: Theory of Computation Based on Sipser (second edition): Notes on regular languages(version 1.1) Definition 1 (Alphabet) A alphabet is a finite set of objects called symbols. Definition 2 (String)

More information

Longest Gapped Repeats and Palindromes

Longest Gapped Repeats and Palindromes Discrete Mathematics and Theoretical Computer Science DMTCS vol. 19:4, 2017, #4 Longest Gapped Repeats and Palindromes Marius Dumitran 1 Paweł Gawrychowski 2 Florin Manea 3 arxiv:1511.07180v4 [cs.ds] 11

More information

Computing Techniques for Parallel and Distributed Systems with an Application to Data Compression. Sergio De Agostino Sapienza University di Rome

Computing Techniques for Parallel and Distributed Systems with an Application to Data Compression. Sergio De Agostino Sapienza University di Rome Computing Techniques for Parallel and Distributed Systems with an Application to Data Compression Sergio De Agostino Sapienza University di Rome Parallel Systems A parallel random access machine (PRAM)

More information

Multiple Pattern Matching

Multiple Pattern Matching Multiple Pattern Matching Stephen Fulwider and Amar Mukherjee College of Engineering and Computer Science University of Central Florida Orlando, FL USA Email: {stephen,amar}@cs.ucf.edu Abstract In this

More information

Adapting Boyer-Moore-Like Algorithms for Searching Huffman Encoded Texts

Adapting Boyer-Moore-Like Algorithms for Searching Huffman Encoded Texts Adapting Boyer-Moore-Like Algorithms for Searching Huffman Encoded Texts Domenico Cantone Simone Faro Emanuele Giaquinta Department of Mathematics and Computer Science, University of Catania, Italy 1 /

More information

The Smoothed Complexity of Edit Distance 1

The Smoothed Complexity of Edit Distance 1 0 The Smoothed Complexity of Edit Distance 1 Alexandr Andoni 2, Microsoft Research SVC (andoni@microsoft.com) Robert Krauthgamer 3, The Weizmann Institute of Science (robert.krauthgamer@weizmann.ac.il)

More information

arxiv: v2 [cs.ds] 28 Jan 2009

arxiv: v2 [cs.ds] 28 Jan 2009 Minimax Trees in Linear Time Pawe l Gawrychowski 1 and Travis Gagie 2, arxiv:0812.2868v2 [cs.ds] 28 Jan 2009 1 Institute of Computer Science University of Wroclaw, Poland gawry1@gmail.com 2 Research Group

More information

Proofs of Proximity for Context-Free Languages and Read-Once Branching Programs

Proofs of Proximity for Context-Free Languages and Read-Once Branching Programs Proofs of Proximity for Context-Free Languages and Read-Once Branching Programs Oded Goldreich Weizmann Institute of Science oded.goldreich@weizmann.ac.il Ron D. Rothblum Weizmann Institute of Science

More information

arxiv:cs/ v1 [cs.dm] 7 May 2006

arxiv:cs/ v1 [cs.dm] 7 May 2006 arxiv:cs/0605026v1 [cs.dm] 7 May 2006 Strongly Almost Periodic Sequences under Finite Automata Mappings Yuri Pritykin April 11, 2017 Abstract The notion of almost periodicity nontrivially generalizes the

More information

Theoretical Computer Science. Dynamic rank/select structures with applications to run-length encoded texts

Theoretical Computer Science. Dynamic rank/select structures with applications to run-length encoded texts Theoretical Computer Science 410 (2009) 4402 4413 Contents lists available at ScienceDirect Theoretical Computer Science journal homepage: www.elsevier.com/locate/tcs Dynamic rank/select structures with

More information

Finite State Automata Design

Finite State Automata Design Finite State Automata Design Nicholas Mainardi 1 Dipartimento di Elettronica e Informazione Politecnico di Milano nicholas.mainardi@polimi.it March 14, 2017 1 Mostly based on Alessandro Barenghi s material,

More information

Exam 1 CSU 390 Theory of Computation Fall 2007

Exam 1 CSU 390 Theory of Computation Fall 2007 Exam 1 CSU 390 Theory of Computation Fall 2007 Solutions Problem 1 [10 points] Construct a state transition diagram for a DFA that recognizes the following language over the alphabet Σ = {a, b}: L 1 =

More information

the subset partial order Paul Pritchard Technical Report CIT School of Computing and Information Technology

the subset partial order Paul Pritchard Technical Report CIT School of Computing and Information Technology A simple sub-quadratic algorithm for computing the subset partial order Paul Pritchard P.Pritchard@cit.gu.edu.au Technical Report CIT-95-04 School of Computing and Information Technology Grith University

More information

CONWAY S COSMOLOGICAL THEOREM

CONWAY S COSMOLOGICAL THEOREM CONWAY S COSMOLOGICAL THEOREM R.A. LITHERLAND. Introduction In [C], Conway introduced an operator on strings (finite sequences) of positive integers, the audioactive (or look and say ) operator. Usually,

More information

Efficient (δ, γ)-pattern-matching with Don t Cares

Efficient (δ, γ)-pattern-matching with Don t Cares fficient (δ, γ)-pattern-matching with Don t Cares Yoan José Pinzón Ardila Costas S. Iliopoulos Manolis Christodoulakis Manal Mohamed King s College London, Department of Computer Science, London WC2R 2LS,

More information

How do regular expressions work? CMSC 330: Organization of Programming Languages

How do regular expressions work? CMSC 330: Organization of Programming Languages How do regular expressions work? CMSC 330: Organization of Programming Languages Regular Expressions and Finite Automata What we ve learned What regular expressions are What they can express, and cannot

More information