Fast String Kernels. Alexander J. Smola Machine Learning Group, RSISE The Australian National University Canberra, ACT 0200
|
|
- Calvin Black
- 5 years ago
- Views:
Transcription
1 Fast String Kernels Alexander J. Smola Machine Learning Group, RSISE The Australian National University Canberra, ACT 0200 joint work with S.V.N. Vishwanathan Slides (soon) available at smola/stringkernel/
2 Overview Kernels on Strings Kernels on Trees Weighting Schemes Tree to String Conversion Suffix Trees Definition and Examples Matching Statistics Counting Substrings Weights and Kernels Annotation and Weighting Function Linear Time Prediction Extensions and Future Work Alex Smola: Fast String Kernels, smola/stringkernels/ Page 2
3 String Kernel Basics Some Notation Alphabet: what we build strings from Sentinel Character: usually $, it terminates the string Concatenation: xy obtained by assembling strings x, y Prefix / Sufix: If x = yz then y is a prefix and z is a suffix Exact Matching Kernels k(x, x ) := s x,s x w s δ s,s = Inexact Matching Kernels k(x, x ) := w s,s = s x,s x s A num s (x) num s (x )w s. s A num s (x) num s (x )w s,s. Accounting for mismatch. Much more expensive to compute and not topic of this talk. Alex Smola: Fast String Kernels, smola/stringkernels/ Page 3
4 String Kernel Examples Bag of Characters w s = 0 for all s > 1 counts single characters. Can be computed in linear time and linear-time predictions (Joachims, 1999). Bag of Words s is bounded by whitespace. Linear time (Joachims, 1999). Limited Range Correlations Setting w s = 0 for all s > n yields limited range correlations of length n. K-spectrum kernel This takes into account substrings of length k (Eskin et al., 2002), where w s = 0 for all s k. Linear time kernel computation, and quadratic time prediction. General Case Quadratic time kernel computation (Haussler, 1998, Watkins, 1998), cubic time prediction. Alex Smola: Fast String Kernels, smola/stringkernels/ Page 4
5 Tree Kernels Definition (Colins and Duffy, 2001) Denote by T, T trees and denote by t = T a subtree of T, then k(t, T ) = w t δ t,t. t =T,t =T We count matching subtrees (other definitions possible, will come to that later). Problem We want permutation invariance of unordered trees. Solution Sort trees before computing kernel (good for any tree operation). Alex Smola: Fast String Kernels, smola/stringkernels/ Page 5
6 Sorting Trees Sorting Rules Assume existence of lexicographic on labels Introduce symbols [, ] satisfy [ < ], and that ], [ < label(n) for all labels. Algorithm For an unlabeled leaf n define tag(n) := []. For a labeled leaf n define tag(n) := [ label(n)]. For an unlabeled node n with children n 1,..., n c sort the tags of the children in lexicographical order such that tag(n i ) tag(n j ) if i < j and define tag(n) = [ tag(n 1 ) tag(n 2 )... tag(n c )]. For a labeled node perform the same operations as above and set tag(n) = [ label(n) tag(n 1 ) tag(n 2 )... tag(n c )]. Alex Smola: Fast String Kernels, smola/stringkernels/ Page 6
7 Sorting Trees in Linear Time Example The trees Theorem have label [[][[][]]]. 1. tag(root) can be computed in (λ + 2)(l log 2 l) time and linear storage in l. 2. Substrings s of tag(root) starting with [ and ending with a balanced ] correspond to subtrees T of T where s is the tag on T. 3. Arbitrary substrings s of tag(root) correspond to subset trees T of T. 4. tag(root) is invariant under permutations of the leaves and allows the reconstruction of an unique element of the equivalence class (under permutation). Proof of 1. by induction. Extension to k-ary trees straightforward. Rest follows from definition. Alex Smola: Fast String Kernels, smola/stringkernels/ Page 7
8 Tree to String Conversion Consequence We can compute tree kernel by 1. Converting trees to strings 2. Computing string kernels Advantages More general subtree operations possible: we may include non-balanced subtrees (cutting a slice from a tree). Simple storage and simple implementation (dynamic array suffices) All speedups for strings work for kernels, too (XML documents, etc.) Alex Smola: Fast String Kernels, smola/stringkernels/ Page 8
9 Suffix Trees Definition Compact tree build from all the suffixes of a word. Suffix tree of ababc Properties ab c$ b c$ abc$ c$ abc$ Can be built and stored in linear time (Ukkonen, 1995) Twice as many nodes as characters in string Leaves on subtree give number of matching substrings Suffix Links Connections across the tree. Vital for parsing strings (e.g., if we parsed abracadabra this speeds up the parsing of bracadabra) Alex Smola: Fast String Kernels, smola/stringkernels/ Page 9
10 Matching Statistics Definition Given strings x, y with x = n and y = m, the matching statistics of x with respect to y are defined by v, c N n, where v i is the length of the longest substring of y matching a prefix of x[i : n] v i := i + v i 1 c i is a pointer to ceil(x[i : v i ]) in S(y). This can be computed in linear time (Chang and Lawler, 1994). Example Matching statistic of abba with respect to S(ababc). String a b b a v i ceil(c i ) ab b babc$ ab ab c$ b c$ abc$ c$ abc$ Alex Smola: Fast String Kernels, smola/stringkernels/ Page 10
11 Matching Substrings Prefixes w is a substring of x iff there is an i such that w is a prefix of x[i : n]. The number of occurrences of w in x can be calculated by finding all such i. Substrings The set of matching substrings of x and y is the set of all prefixes of x[i : v i ]. Next Step If we have a substring w of x, prefixes of w may occur in x with higher frequency. We need an efficient computation scheme. Alex Smola: Fast String Kernels, smola/stringkernels/ Page 11
12 Key Trick Theorem Let x and y be strings and c and v be the matching statistics of x with respect to y. Assume that W (y, t) = w us w u where u = ceil(t) and t = uv. s prefix(v) can be computed in constant time for any t. O( x + y ) time as k(x, y) = x i=1 val(x[i : v i ]) = x i=1 Then k(x, y) can be computed in val(c i ) + lvs(floor(x[i : v i ]))W (y, x[i : v i ]) where val(t) := lvs(floor(t)) W (y, t) + val(ceil(t)) and val(root) := 0. Alex Smola: Fast String Kernels, smola/stringkernels/ Page 12
13 Computing W (y, t) in Constant Time Length-Dependent Weights Assume that w s = w s, then W (y, t) = Generic Weights t j= ceil(t) w j w ceil(t) = ω t ω ceil(t) where ω j := Simple option: pre-compute the annotation of all suffix trees beforehand. Better: build suffix tree on all strings (linear time) and annotate this tree. Simplifying assumption for TFIDF weights, w s = φ( s )ψ(#s) W (y, t) = s prefix(t) w s s prefix(ceil(t)) w s = φ(freq(t)) t j i=1 i= ceil(t) +1 w j φ(i) Alex Smola: Fast String Kernels, smola/stringkernels/ Page 13
14 Linear Time Prediction Problem For prediction we need to compute f(x) = i α ik(x i, x). This depends on the number of SVs. Bad for large databases (e.g., spam filtering). The classifier degrades in runtime, the more data we have. We are repeatedly parsing s Idea We can merge matching weights from all the SVs. All we need is a compressed lookup function. Alex Smola: Fast String Kernels, smola/stringkernels/ Page 14
15 Linear Time Prediction Merge all SVs into one suffix tree Σ. Compute matching statistics of x wrt. Sigma. Update weights on every node of Σ as weight(w) = Extend the definition of val(x) to Σ via m α i lvs xi (w) i=1 val Σ (t) := weight(floor(t)) W (Σ, t) + weight(ceil(t)) and val Σ (root) := 0. Here W (Σ, t) denotes the sum of weights between ceil(t) and t, with respect to Σ rather than S(y). We only need to sum over val Σ (x[i : v i ]) to compute f. We can classify texts in linear time regardless of the size of the SV set! Alex Smola: Fast String Kernels, smola/stringkernels/ Page 15
16 Summary and Extensions Redux of Tree to String kernels (heaps, stacks, bags, etc. trivial) Linear prediction and kernel computation time (previously quadratic or cubic). Makes things practical. Storage of SVs needed. Can be greatly reduced if redundancies abound in SV set. E.g. for anagram and analphabet we need only analphabet and gram. Coarsening for trees (can be done in linear time, too) Approximate matching and wildcards Automata and dynamical systems Do expensive things with string kernel classifiers. Alex Smola: Fast String Kernels, smola/stringkernels/ Page 16
Fast Kernels for String and Tree Matching
Fast Kernels for String and Tree Matching S. V. N. Vishwanathan Dept. of Comp. Sci. & Automation Indian Institute of Science Bangalore, 560012, India vishy@csa.iisc.ernet.in Alexander J. Smola Machine
More informationMining Frequent Closed Unordered Trees Through Natural Representations
Mining Frequent Closed Unordered Trees Through Natural Representations José L. Balcázar, Albert Bifet and Antoni Lozano Universitat Politècnica de Catalunya Pascal Workshop: learning from and with graphs
More informationIntrusion Detection and Malware Analysis
Intrusion Detection and Malware Analysis IDS feature extraction Pavel Laskov Wilhelm Schickard Institute for Computer Science Metric embedding of byte sequences Sequences 1. blabla blubla blablabu aa 2.
More informationarxiv: v1 [cs.ds] 15 Feb 2012
Linear-Space Substring Range Counting over Polylogarithmic Alphabets Travis Gagie 1 and Pawe l Gawrychowski 2 1 Aalto University, Finland travis.gagie@aalto.fi 2 Max Planck Institute, Germany gawry@cs.uni.wroc.pl
More informationHierarchical Overlap Graph
Hierarchical Overlap Graph B. Cazaux and E. Rivals LIRMM & IBC, Montpellier 8. Feb. 2018 arxiv:1802.04632 2018 B. Cazaux & E. Rivals 1 / 29 Overlap Graph for a set of words Consider the set P := {abaa,
More information1 Alphabets and Languages
1 Alphabets and Languages Look at handout 1 (inference rules for sets) and use the rules on some examples like {a} {{a}} {a} {a, b}, {a} {{a}}, {a} {{a}}, {a} {a, b}, a {{a}}, a {a, b}, a {{a}}, a {a,
More informationDefine M to be a binary n by m matrix such that:
The Shift-And Method Define M to be a binary n by m matrix such that: M(i,j) = iff the first i characters of P exactly match the i characters of T ending at character j. M(i,j) = iff P[.. i] T[j-i+.. j]
More informationLecture 1 : Data Compression and Entropy
CPS290: Algorithmic Foundations of Data Science January 8, 207 Lecture : Data Compression and Entropy Lecturer: Kamesh Munagala Scribe: Kamesh Munagala In this lecture, we will study a simple model for
More informationSmall-Space Dictionary Matching (Dissertation Proposal)
Small-Space Dictionary Matching (Dissertation Proposal) Graduate Center of CUNY 1/24/2012 Problem Definition Dictionary Matching Input: Dictionary D = P 1,P 2,...,P d containing d patterns. Text T of length
More informationComputing Longest Common Substrings Using Suffix Arrays
Computing Longest Common Substrings Using Suffix Arrays Maxim A. Babenko, Tatiana A. Starikovskaya Moscow State University Computer Science in Russia, 2008 Maxim A. Babenko, Tatiana A. Starikovskaya (MSU)Computing
More informationContext-Free Languages
CS:4330 Theory of Computation Spring 2018 Context-Free Languages Non-Context-Free Languages Haniel Barbosa Readings for this lecture Chapter 2 of [Sipser 1996], 3rd edition. Section 2.3. Proving context-freeness
More informationCS5371 Theory of Computation. Lecture 9: Automata Theory VII (Pumping Lemma, Non-CFL)
CS5371 Theory of Computation Lecture 9: Automata Theory VII (Pumping Lemma, Non-CFL) Objectives Introduce Pumping Lemma for CFL Apply Pumping Lemma to show that some languages are non-cfl Pumping Lemma
More informationKernel Methods. Outline
Kernel Methods Quang Nguyen University of Pittsburgh CS 3750, Fall 2011 Outline Motivation Examples Kernels Definitions Kernel trick Basic properties Mercer condition Constructing feature space Hilbert
More informationComputational Models - Lecture 4
Computational Models - Lecture 4 Regular languages: The Myhill-Nerode Theorem Context-free Grammars Chomsky Normal Form Pumping Lemma for context free languages Non context-free languages: Examples Push
More informationTheory of Computation
Theory of Computation (Feodor F. Dragan) Department of Computer Science Kent State University Spring, 2018 Theory of Computation, Feodor F. Dragan, Kent State University 1 Before we go into details, what
More informationWhat is this course about?
What is this course about? Examining the power of an abstract machine What can this box of tricks do? What is this course about? Examining the power of an abstract machine Domains of discourse: automata
More informationPattern Matching (Exact Matching) Overview
CSI/BINF 5330 Pattern Matching (Exact Matching) Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Pattern Matching Exhaustive Search DFA Algorithm KMP Algorithm
More informationarxiv: v1 [cs.ds] 9 Apr 2018
From Regular Expression Matching to Parsing Philip Bille Technical University of Denmark phbi@dtu.dk Inge Li Gørtz Technical University of Denmark inge@dtu.dk arxiv:1804.02906v1 [cs.ds] 9 Apr 2018 Abstract
More informationSmaller and Faster Lempel-Ziv Indices
Smaller and Faster Lempel-Ziv Indices Diego Arroyuelo and Gonzalo Navarro Dept. of Computer Science, Universidad de Chile, Chile. {darroyue,gnavarro}@dcc.uchile.cl Abstract. Given a text T[1..u] over an
More informationUkkonen's suffix tree construction algorithm
Ukkonen's suffix tree construction algorithm aba$ $ab aba$ 2 2 1 1 $ab a ba $ 3 $ $ab a ba $ $ $ 1 2 4 1 String Algorithms; Nov 15 2007 Motivation Yet another suffix tree construction algorithm... Why?
More informationComputation Theory Finite Automata
Computation Theory Dept. of Computing ITT Dublin October 14, 2010 Computation Theory I 1 We would like a model that captures the general nature of computation Consider two simple problems: 2 Design a program
More informationProofs, Strings, and Finite Automata. CS154 Chris Pollett Feb 5, 2007.
Proofs, Strings, and Finite Automata CS154 Chris Pollett Feb 5, 2007. Outline Proofs and Proof Strategies Strings Finding proofs Example: For every graph G, the sum of the degrees of all the nodes in G
More informationLexical Analysis. Reinhard Wilhelm, Sebastian Hack, Mooly Sagiv Saarland University, Tel Aviv University.
Lexical Analysis Reinhard Wilhelm, Sebastian Hack, Mooly Sagiv Saarland University, Tel Aviv University http://compilers.cs.uni-saarland.de Compiler Construction Core Course 2017 Saarland University Today
More informationAlphabet Friendly FM Index
Alphabet Friendly FM Index Author: Rodrigo González Santiago, November 8 th, 2005 Departamento de Ciencias de la Computación Universidad de Chile Outline Motivations Basics Burrows Wheeler Transform FM
More informationAutomata & languages. A primer on the Theory of Computation. Laurent Vanbever. ETH Zürich (D-ITET) October,
Automata & languages A primer on the Theory of Computation Laurent Vanbever www.vanbever.eu ETH Zürich (D-ITET) October, 5 2017 Part 3 out of 5 Last week, we learned about closure and equivalence of regular
More informationPart 3 out of 5. Automata & languages. A primer on the Theory of Computation. Last week, we learned about closure and equivalence of regular languages
Automata & languages A primer on the Theory of Computation Laurent Vanbever www.vanbever.eu Part 3 out of 5 ETH Zürich (D-ITET) October, 5 2017 Last week, we learned about closure and equivalence of regular
More informationPushdown Automata. We have seen examples of context-free languages that are not regular, and hence can not be recognized by finite automata.
Pushdown Automata We have seen examples of context-free languages that are not regular, and hence can not be recognized by finite automata. Next we consider a more powerful computation model, called a
More informationAn O(N) Semi-Predictive Universal Encoder via the BWT
An O(N) Semi-Predictive Universal Encoder via the BWT Dror Baron and Yoram Bresler Abstract We provide an O(N) algorithm for a non-sequential semi-predictive encoder whose pointwise redundancy with respect
More informationTHEORY OF COMPUTATION (AUBER) EXAM CRIB SHEET
THEORY OF COMPUTATION (AUBER) EXAM CRIB SHEET Regular Languages and FA A language is a set of strings over a finite alphabet Σ. All languages are finite or countably infinite. The set of all languages
More informationModule 9: Tries and String Matching
Module 9: Tries and String Matching CS 240 - Data Structures and Data Management Sajed Haque Veronika Irvine Taylor Smith Based on lecture notes by many previous cs240 instructors David R. Cheriton School
More informationFinite Automata Theory and Formal Languages TMV027/DIT321 LP4 2018
Finite Automata Theory and Formal Languages TMV027/DIT321 LP4 2018 Lecture 14 Ana Bove May 14th 2018 Recap: Context-free Grammars Simplification of grammars: Elimination of ǫ-productions; Elimination of
More informationCompact Indexes for Flexible Top-k Retrieval
Compact Indexes for Flexible Top-k Retrieval Simon Gog Matthias Petri Institute of Theoretical Informatics, Karlsruhe Institute of Technology Computing and Information Systems, The University of Melbourne
More information4.8 Huffman Codes. These lecture slides are supplied by Mathijs de Weerd
4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd Data Compression Q. Given a text that uses 32 symbols (26 different letters, space, and some punctuation characters), how can we
More informationMining Emerging Substrings
Mining Emerging Substrings Sarah Chan Ben Kao C.L. Yip Michael Tang Department of Computer Science and Information Systems The University of Hong Kong {wyschan, kao, clyip, fmtang}@csis.hku.hk Abstract.
More informationThis lecture covers Chapter 7 of HMU: Properties of CFLs
This lecture covers Chapter 7 of HMU: Properties of CFLs Chomsky Normal Form Pumping Lemma for CFs Closure Properties of CFLs Decision Properties of CFLs Additional Reading: Chapter 7 of HMU. Chomsky Normal
More informationUNIT I INFORMATION THEORY. I k log 2
UNIT I INFORMATION THEORY Claude Shannon 1916-2001 Creator of Information Theory, lays the foundation for implementing logic in digital circuits as part of his Masters Thesis! (1939) and published a paper
More informationCSE 202 Homework 4 Matthias Springer, A
CSE 202 Homework 4 Matthias Springer, A99500782 1 Problem 2 Basic Idea PERFECT ASSEMBLY N P: a permutation P of s i S is a certificate that can be checked in polynomial time by ensuring that P = S, and
More informationAlgorithms Design & Analysis. String matching
Algorithms Design & Analysis String matching Greedy algorithm Recap 2 Today s topics KM algorithm Suffix tree Approximate string matching 3 String Matching roblem Given a text string T of length n and
More informationBinary Decision Diagrams
Binary Decision Diagrams Literature Some pointers: H.R. Andersen, An Introduction to Binary Decision Diagrams, Lecture notes, Department of Information Technology, IT University of Copenhagen Tools: URL:
More informationMA/CSSE 474 Theory of Computation
MA/CSSE 474 Theory of Computation Bottom-up parsing Pumping Theorem for CFLs Recap: Going One Way Lemma: Each context-free language is accepted by some PDA. Proof (by construction): The idea: Let the stack
More informationLecture 4 : Adaptive source coding algorithms
Lecture 4 : Adaptive source coding algorithms February 2, 28 Information Theory Outline 1. Motivation ; 2. adaptive Huffman encoding ; 3. Gallager and Knuth s method ; 4. Dictionary methods : Lempel-Ziv
More informationDynamic Programming. Shuang Zhao. Microsoft Research Asia September 5, Dynamic Programming. Shuang Zhao. Outline. Introduction.
Microsoft Research Asia September 5, 2005 1 2 3 4 Section I What is? Definition is a technique for efficiently recurrence computing by storing partial results. In this slides, I will NOT use too many formal
More informationCS5371 Theory of Computation. Lecture 9: Automata Theory VII (Pumping Lemma, Non-CFL, DPDA PDA)
CS5371 Theory of Computation Lecture 9: Automata Theory VII (Pumping Lemma, Non-CFL, DPDA PDA) Objectives Introduce the Pumping Lemma for CFL Show that some languages are non- CFL Discuss the DPDA, which
More informationarxiv: v5 [cs.fl] 21 Feb 2012
Streaming Tree Transducers Rajeev Alur and Loris D Antoni University of Pennsylvania February 23, 2012 arxiv:1104.2599v5 [cs.fl] 21 Feb 2012 Abstract Theory of tree transducers provides a foundation for
More informationKernel Methods. Charles Elkan October 17, 2007
Kernel Methods Charles Elkan elkan@cs.ucsd.edu October 17, 2007 Remember the xor example of a classification problem that is not linearly separable. If we map every example into a new representation, then
More informationTheoretical Computer Science
Theoretical Computer Science Zdeněk Sawa Department of Computer Science, FEI, Technical University of Ostrava 17. listopadu 15, Ostrava-Poruba 708 33 Czech republic September 22, 2017 Z. Sawa (TU Ostrava)
More informationCSE182-L7. Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding CSE182
CSE182-L7 Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding 10-07 CSE182 Bell Labs Honors Pattern matching 10-07 CSE182 Just the Facts Consider the set of all substrings
More informationBLAST: Basic Local Alignment Search Tool
.. CSC 448 Bioinformatics Algorithms Alexander Dekhtyar.. (Rapid) Local Sequence Alignment BLAST BLAST: Basic Local Alignment Search Tool BLAST is a family of rapid approximate local alignment algorithms[2].
More informationFinite Automata - Deterministic Finite Automata. Deterministic Finite Automaton (DFA) (or Finite State Machine)
Finite Automata - Deterministic Finite Automata Deterministic Finite Automaton (DFA) (or Finite State Machine) M = (K, Σ, δ, s, A), where K is a finite set of states Σ is an input alphabet s K is a distinguished
More informationSkriptum VL Text Indexing Sommersemester 2012 Johannes Fischer (KIT)
Skriptum VL Text Indexing Sommersemester 2012 Johannes Fischer (KIT) Disclaimer Students attending my lectures are often astonished that I present the material in a much livelier form than in this script.
More informationLecture 18 April 26, 2012
6.851: Advanced Data Structures Spring 2012 Prof. Erik Demaine Lecture 18 April 26, 2012 1 Overview In the last lecture we introduced the concept of implicit, succinct, and compact data structures, and
More informationProbabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov
Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly
More informationCS6901: review of Theory of Computation and Algorithms
CS6901: review of Theory of Computation and Algorithms Any mechanically (automatically) discretely computation of problem solving contains at least three components: - problem description - computational
More informationSUFFIX TREE. SYNONYMS Compact suffix trie
SUFFIX TREE Maxime Crochemore King s College London and Université Paris-Est, http://www.dcs.kcl.ac.uk/staff/mac/ Thierry Lecroq Université de Rouen, http://monge.univ-mlv.fr/~lecroq SYNONYMS Compact suffix
More informationTopics COSC Administrivia. Topics Today. Administrivia (II) Acknowledgements. Slides presented May 9th and 16th,
Topics COSC 2001 Introduction to the Theory of Computation Dr. David Forster Basic concepts of theoretical CS, with practical applications: Regular Languages Context Free Languages Recursively Enumerable
More informationStreaming Tree Transducers
1 Streaming Tree Transducers RAJEEV ALUR, University of Pennsylvania LORIS D ANTONI, University of Wisconsin-Madison The theory of tree transducers provides a foundation for understanding expressiveness
More informationCS1802 Week 11: Algorithms, Sums, Series, Induction
CS180 Discrete Structures Recitation Fall 017 Nov 11 - November 17, 017 CS180 Week 11: Algorithms, Sums, Series, Induction 1 Markov chain i. Boston has days which are either sunny or rainy and can be modeled
More informationSelf-Indexed Grammar-Based Compression
Fundamenta Informaticae XXI (2001) 1001 1025 1001 IOS Press Self-Indexed Grammar-Based Compression Francisco Claude David R. Cheriton School of Computer Science University of Waterloo fclaude@cs.uwaterloo.ca
More informationUNIT II REGULAR LANGUAGES
1 UNIT II REGULAR LANGUAGES Introduction: A regular expression is a way of describing a regular language. The various operations are closure, union and concatenation. We can also find the equivalent regular
More informationI519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB
I519 Introduction to Bioinformatics, 2011 Genome Comparison Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Whole genome comparison/alignment Build better phylogenies Identify polymorphism
More informationAutomata Theory and Formal Grammars: Lecture 1
Automata Theory and Formal Grammars: Lecture 1 Sets, Languages, Logic Automata Theory and Formal Grammars: Lecture 1 p.1/72 Sets, Languages, Logic Today Course Overview Administrivia Sets Theory (Review?)
More informationEECS 229A Spring 2007 * * (a) By stationarity and the chain rule for entropy, we have
EECS 229A Spring 2007 * * Solutions to Homework 3 1. Problem 4.11 on pg. 93 of the text. Stationary processes (a) By stationarity and the chain rule for entropy, we have H(X 0 ) + H(X n X 0 ) = H(X 0,
More informationChapter 0 Introduction. Fourth Academic Year/ Elective Course Electrical Engineering Department College of Engineering University of Salahaddin
Chapter 0 Introduction Fourth Academic Year/ Elective Course Electrical Engineering Department College of Engineering University of Salahaddin October 2014 Automata Theory 2 of 22 Automata theory deals
More informationWhat we have done so far
What we have done so far DFAs and regular languages NFAs and their equivalence to DFAs Regular expressions. Regular expressions capture exactly regular languages: Construct a NFA from a regular expression.
More informationString Regularities and Degenerate Strings
M. Sc. Thesis Defense Md. Faizul Bari (100705050P) Supervisor: Dr. M. Sohel Rahman String Regularities and Degenerate Strings Department of Computer Science and Engineering Bangladesh University of Engineering
More informationSelf-Indexed Grammar-Based Compression
Fundamenta Informaticae XXI (2001) 1001 1025 1001 IOS Press Self-Indexed Grammar-Based Compression Francisco Claude David R. Cheriton School of Computer Science University of Waterloo fclaude@cs.uwaterloo.ca
More informationAutomata & languages. A primer on the Theory of Computation. Laurent Vanbever. ETH Zürich (D-ITET) September,
Automata & languages A primer on the Theory of Computation Laurent Vanbever www.vanbever.eu ETH Zürich (D-ITET) September, 24 2015 Last week was all about Deterministic Finite Automaton We saw three main
More informationRun-length & Entropy Coding. Redundancy Removal. Sampling. Quantization. Perform inverse operations at the receiver EEE
General e Image Coder Structure Motion Video x(s 1,s 2,t) or x(s 1,s 2 ) Natural Image Sampling A form of data compression; usually lossless, but can be lossy Redundancy Removal Lossless compression: predictive
More informationTasks of lexer. CISC 5920: Compiler Construction Chapter 2 Lexical Analysis. Tokens and lexemes. Buffering
Tasks of lexer CISC 5920: Compiler Construction Chapter 2 Lexical Analysis Arthur G. Werschulz Fordham University Department of Computer and Information Sciences Copyright Arthur G. Werschulz, 2017. All
More informationThis lecture covers Chapter 5 of HMU: Context-free Grammars
This lecture covers Chapter 5 of HMU: Context-free rammars (Context-free) rammars (Leftmost and Rightmost) Derivations Parse Trees An quivalence between Derivations and Parse Trees Ambiguity in rammars
More informationRepresenting structured relational data in Euclidean vector spaces. Tony Plate
Representing structured relational data in Euclidean vector spaces Tony Plate tplate@acm.org http://www.d-reps.org October 2004 AAAI Symposium 2004 1 Overview A general method for representing structured
More informationSlides for CIS 675. Huffman Encoding, 1. Huffman Encoding, 2. Huffman Encoding, 3. Encoding 1. DPV Chapter 5, Part 2. Encoding 2
Huffman Encoding, 1 EECS Slides for CIS 675 DPV Chapter 5, Part 2 Jim Royer October 13, 2009 A toy example: Suppose our alphabet is { A, B, C, D }. Suppose T is a text of 130 million characters. What is
More informationEven More on Dynamic Programming
Algorithms & Models of Computation CS/ECE 374, Fall 2017 Even More on Dynamic Programming Lecture 15 Thursday, October 19, 2017 Sariel Har-Peled (UIUC) CS374 1 Fall 2017 1 / 26 Part I Longest Common Subsequence
More informationMathematical Preliminaries. Sipser pages 1-28
Mathematical Preliminaries Sipser pages 1-28 Mathematical Preliminaries This course is about the fundamental capabilities and limitations of computers. It has 3 parts 1. Automata Models of computation
More informationMcCreight's suffix tree construction algorithm
McCreight's suffix tree construction algorithm b 2 $baa $ 5 $ $ba 6 3 a b 4 $ aab$ 1 Motivation Recall: the suffix tree is an extremely useful data structure with space usage and construction time in O(n).
More information1 More finite deterministic automata
CS 125 Section #6 Finite automata October 18, 2016 1 More finite deterministic automata Exercise. Consider the following game with two players: Repeatedly flip a coin. On heads, player 1 gets a point.
More informationNon-context-Free Languages. CS215, Lecture 5 c
Non-context-Free Languages CS215, Lecture 5 c 2007 1 The Pumping Lemma Theorem. (Pumping Lemma) Let be context-free. There exists a positive integer divided into five pieces, Proof for for each, and..
More informationIntroduction to information theory and coding
Introduction to information theory and coding Louis WEHENKEL Set of slides No 5 State of the art in data compression Stochastic processes and models for information sources First Shannon theorem : data
More informationAverage Case Analysis of QuickSort and Insertion Tree Height using Incompressibility
Average Case Analysis of QuickSort and Insertion Tree Height using Incompressibility Tao Jiang, Ming Li, Brendan Lucier September 26, 2005 Abstract In this paper we study the Kolmogorov Complexity of a
More informationAdvanced Automata Theory 11 Regular Languages and Learning Theory
Advanced Automata Theory 11 Regular Languages and Learning Theory Frank Stephan Department of Computer Science Department of Mathematics National University of Singapore fstephan@comp.nus.edu.sg Advanced
More informationAutomata Theory - Quiz II (Solutions)
Automata Theory - Quiz II (Solutions) K. Subramani LCSEE, West Virginia University, Morgantown, WV {ksmani@csee.wvu.edu} 1 Problems 1. Induction: Let L denote the language of balanced strings over Σ =
More information6.1 The Pumping Lemma for CFLs 6.2 Intersections and Complements of CFLs
CSC4510/6510 AUTOMATA 6.1 The Pumping Lemma for CFLs 6.2 Intersections and Complements of CFLs The Pumping Lemma for Context Free Languages One way to prove AnBn is not regular is to use the pumping lemma
More informationAn Introduction to Machine Learning
An Introduction to Machine Learning L6: Structured Estimation Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune, January
More informationarxiv: v2 [cs.ds] 16 Mar 2015
Longest common substrings with k mismatches Tomas Flouri 1, Emanuele Giaquinta 2, Kassian Kobert 1, and Esko Ukkonen 3 arxiv:1409.1694v2 [cs.ds] 16 Mar 2015 1 Heidelberg Institute for Theoretical Studies,
More informationPushdown Automata. Notes on Automata and Theory of Computation. Chia-Ping Chen
Pushdown Automata Notes on Automata and Theory of Computation Chia-Ping Chen Department of Computer Science and Engineering National Sun Yat-Sen University Kaohsiung, Taiwan ROC Pushdown Automata p. 1
More informationNote: In any grammar here, the meaning and usage of P (productions) is equivalent to R (rules).
Note: In any grammar here, the meaning and usage of P (productions) is equivalent to R (rules). 1a) G = ({R, S, T}, {0,1}, P, S) where P is: S R0R R R0R1R R1R0R T T 0T ε (S generates the first 0. R generates
More informationCompressed Index for Dynamic Text
Compressed Index for Dynamic Text Wing-Kai Hon Tak-Wah Lam Kunihiko Sadakane Wing-Kin Sung Siu-Ming Yiu Abstract This paper investigates how to index a text which is subject to updates. The best solution
More informationConverting SLP to LZ78 in almost Linear Time
CPM 2013 Converting SLP to LZ78 in almost Linear Time Hideo Bannai 1, Paweł Gawrychowski 2, Shunsuke Inenaga 1, Masayuki Takeda 1 1. Kyushu University 2. Max-Planck-Institut für Informatik Recompress SLP
More informationd(ν) = max{n N : ν dmn p n } N. p d(ν) (ν) = ρ.
1. Trees; context free grammars. 1.1. Trees. Definition 1.1. By a tree we mean an ordered triple T = (N, ρ, p) (i) N is a finite set; (ii) ρ N ; (iii) p : N {ρ} N ; (iv) if n N + and ν dmn p n then p n
More informationPushdown Automata: Introduction (2)
Pushdown Automata: Introduction Pushdown automaton (PDA) M = (K, Σ, Γ,, s, A) where K is a set of states Σ is an input alphabet Γ is a set of stack symbols s K is the start state A K is a set of accepting
More informationConcatenation. The concatenation of two languages L 1 and L 2
Regular Expressions Problem Problem Set Set Four Four is is due due using using a late late period period in in the the box box up up front. front. Concatenation The concatenation of two languages L 1
More informationInteger Sorting on the word-ram
Integer Sorting on the word-rm Uri Zwick Tel viv University May 2015 Last updated: June 30, 2015 Integer sorting Memory is composed of w-bit words. rithmetical, logical and shift operations on w-bit words
More informationCIS 520: Machine Learning Oct 09, Kernel Methods
CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed
More informationText Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University
Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data
More informationp 3 p 2 p 4 q 2 q 7 q 1 q 3 q 6 q 5
Discrete Fréchet distance Consider Professor Bille going for a walk with his personal dog. The professor follows a path of points p 1,..., p n and the dog follows a path of points q 1,..., q m. We assume
More informationString Indexing for Patterns with Wildcards
MASTER S THESIS String Indexing for Patterns with Wildcards Hjalte Wedel Vildhøj and Søren Vind Technical University of Denmark August 8, 2011 Abstract We consider the problem of indexing a string t of
More informationStructure-Based Comparison of Biomolecules
Structure-Based Comparison of Biomolecules Benedikt Christoph Wolters Seminar Bioinformatics Algorithms RWTH AACHEN 07/17/2015 Outline 1 Introduction and Motivation Protein Structure Hierarchy Protein
More informationSection Summary. Relations and Functions Properties of Relations. Combining Relations
Chapter 9 Chapter Summary Relations and Their Properties n-ary Relations and Their Applications (not currently included in overheads) Representing Relations Closures of Relations (not currently included
More information2 Permutation Groups
2 Permutation Groups Last Time Orbit/Stabilizer algorithm: Orbit of a point. Transversal of transporter elements. Generators for stabilizer. Today: Use in a ``divide-and-conquer approach for permutation
More informationEquivalence of Regular Expressions and FSMs
Equivalence of Regular Expressions and FSMs Greg Plaxton Theory in Programming Practice, Spring 2005 Department of Computer Science University of Texas at Austin Regular Language Recall that a language
More information