Fast String Kernels. Alexander J. Smola Machine Learning Group, RSISE The Australian National University Canberra, ACT 0200

Similar documents
Fast Kernels for String and Tree Matching

Mining Frequent Closed Unordered Trees Through Natural Representations

Intrusion Detection and Malware Analysis

arxiv: v1 [cs.ds] 15 Feb 2012

Hierarchical Overlap Graph

1 Alphabets and Languages

Define M to be a binary n by m matrix such that:

Lecture 1 : Data Compression and Entropy

Small-Space Dictionary Matching (Dissertation Proposal)

Computing Longest Common Substrings Using Suffix Arrays

Context-Free Languages

CS5371 Theory of Computation. Lecture 9: Automata Theory VII (Pumping Lemma, Non-CFL)

Kernel Methods. Outline

Computational Models - Lecture 4

Theory of Computation

What is this course about?

Pattern Matching (Exact Matching) Overview

arxiv: v1 [cs.ds] 9 Apr 2018

Smaller and Faster Lempel-Ziv Indices

Ukkonen's suffix tree construction algorithm

Computation Theory Finite Automata

Proofs, Strings, and Finite Automata. CS154 Chris Pollett Feb 5, 2007.

Lexical Analysis. Reinhard Wilhelm, Sebastian Hack, Mooly Sagiv Saarland University, Tel Aviv University.

Alphabet Friendly FM Index

Automata & languages. A primer on the Theory of Computation. Laurent Vanbever. ETH Zürich (D-ITET) October,

Part 3 out of 5. Automata & languages. A primer on the Theory of Computation. Last week, we learned about closure and equivalence of regular languages

Pushdown Automata. We have seen examples of context-free languages that are not regular, and hence can not be recognized by finite automata.

An O(N) Semi-Predictive Universal Encoder via the BWT

THEORY OF COMPUTATION (AUBER) EXAM CRIB SHEET

Module 9: Tries and String Matching

Finite Automata Theory and Formal Languages TMV027/DIT321 LP4 2018

Compact Indexes for Flexible Top-k Retrieval

4.8 Huffman Codes. These lecture slides are supplied by Mathijs de Weerd

Mining Emerging Substrings

This lecture covers Chapter 7 of HMU: Properties of CFLs

UNIT I INFORMATION THEORY. I k log 2

CSE 202 Homework 4 Matthias Springer, A

Algorithms Design & Analysis. String matching

Binary Decision Diagrams

MA/CSSE 474 Theory of Computation

Lecture 4 : Adaptive source coding algorithms

Dynamic Programming. Shuang Zhao. Microsoft Research Asia September 5, Dynamic Programming. Shuang Zhao. Outline. Introduction.

CS5371 Theory of Computation. Lecture 9: Automata Theory VII (Pumping Lemma, Non-CFL, DPDA PDA)

arxiv: v5 [cs.fl] 21 Feb 2012

Kernel Methods. Charles Elkan October 17, 2007

Theoretical Computer Science

CSE182-L7. Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding CSE182

BLAST: Basic Local Alignment Search Tool

Finite Automata - Deterministic Finite Automata. Deterministic Finite Automaton (DFA) (or Finite State Machine)

Skriptum VL Text Indexing Sommersemester 2012 Johannes Fischer (KIT)

Lecture 18 April 26, 2012

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

CS6901: review of Theory of Computation and Algorithms

SUFFIX TREE. SYNONYMS Compact suffix trie

Topics COSC Administrivia. Topics Today. Administrivia (II) Acknowledgements. Slides presented May 9th and 16th,

Streaming Tree Transducers

CS1802 Week 11: Algorithms, Sums, Series, Induction

Self-Indexed Grammar-Based Compression

UNIT II REGULAR LANGUAGES

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB

Automata Theory and Formal Grammars: Lecture 1

EECS 229A Spring 2007 * * (a) By stationarity and the chain rule for entropy, we have

Chapter 0 Introduction. Fourth Academic Year/ Elective Course Electrical Engineering Department College of Engineering University of Salahaddin

What we have done so far

String Regularities and Degenerate Strings

Self-Indexed Grammar-Based Compression

Automata & languages. A primer on the Theory of Computation. Laurent Vanbever. ETH Zürich (D-ITET) September,

Run-length & Entropy Coding. Redundancy Removal. Sampling. Quantization. Perform inverse operations at the receiver EEE

Tasks of lexer. CISC 5920: Compiler Construction Chapter 2 Lexical Analysis. Tokens and lexemes. Buffering

This lecture covers Chapter 5 of HMU: Context-free Grammars

Representing structured relational data in Euclidean vector spaces. Tony Plate

Slides for CIS 675. Huffman Encoding, 1. Huffman Encoding, 2. Huffman Encoding, 3. Encoding 1. DPV Chapter 5, Part 2. Encoding 2

Even More on Dynamic Programming

Mathematical Preliminaries. Sipser pages 1-28

McCreight's suffix tree construction algorithm

1 More finite deterministic automata

Non-context-Free Languages. CS215, Lecture 5 c

Introduction to information theory and coding

Average Case Analysis of QuickSort and Insertion Tree Height using Incompressibility

Advanced Automata Theory 11 Regular Languages and Learning Theory

Automata Theory - Quiz II (Solutions)

6.1 The Pumping Lemma for CFLs 6.2 Intersections and Complements of CFLs

An Introduction to Machine Learning

arxiv: v2 [cs.ds] 16 Mar 2015

Pushdown Automata. Notes on Automata and Theory of Computation. Chia-Ping Chen

Note: In any grammar here, the meaning and usage of P (productions) is equivalent to R (rules).

Compressed Index for Dynamic Text

Converting SLP to LZ78 in almost Linear Time

d(ν) = max{n N : ν dmn p n } N. p d(ν) (ν) = ρ.

Pushdown Automata: Introduction (2)

Concatenation. The concatenation of two languages L 1 and L 2

Integer Sorting on the word-ram

CIS 520: Machine Learning Oct 09, Kernel Methods

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

p 3 p 2 p 4 q 2 q 7 q 1 q 3 q 6 q 5

String Indexing for Patterns with Wildcards

Structure-Based Comparison of Biomolecules

Section Summary. Relations and Functions Properties of Relations. Combining Relations

2 Permutation Groups

Equivalence of Regular Expressions and FSMs

Transcription:

Fast String Kernels Alexander J. Smola Machine Learning Group, RSISE The Australian National University Canberra, ACT 0200 Alex.Smola@anu.edu.au joint work with S.V.N. Vishwanathan Slides (soon) available at http://mlg.anu.edu.au/ smola/stringkernel/

Overview Kernels on Strings Kernels on Trees Weighting Schemes Tree to String Conversion Suffix Trees Definition and Examples Matching Statistics Counting Substrings Weights and Kernels Annotation and Weighting Function Linear Time Prediction Extensions and Future Work Alex Smola: Fast String Kernels, http://mlg.anu.edu.au/ smola/stringkernels/ Page 2

String Kernel Basics Some Notation Alphabet: what we build strings from Sentinel Character: usually $, it terminates the string Concatenation: xy obtained by assembling strings x, y Prefix / Sufix: If x = yz then y is a prefix and z is a suffix Exact Matching Kernels k(x, x ) := s x,s x w s δ s,s = Inexact Matching Kernels k(x, x ) := w s,s = s x,s x s A num s (x) num s (x )w s. s A num s (x) num s (x )w s,s. Accounting for mismatch. Much more expensive to compute and not topic of this talk. Alex Smola: Fast String Kernels, http://mlg.anu.edu.au/ smola/stringkernels/ Page 3

String Kernel Examples Bag of Characters w s = 0 for all s > 1 counts single characters. Can be computed in linear time and linear-time predictions (Joachims, 1999). Bag of Words s is bounded by whitespace. Linear time (Joachims, 1999). Limited Range Correlations Setting w s = 0 for all s > n yields limited range correlations of length n. K-spectrum kernel This takes into account substrings of length k (Eskin et al., 2002), where w s = 0 for all s k. Linear time kernel computation, and quadratic time prediction. General Case Quadratic time kernel computation (Haussler, 1998, Watkins, 1998), cubic time prediction. Alex Smola: Fast String Kernels, http://mlg.anu.edu.au/ smola/stringkernels/ Page 4

Tree Kernels Definition (Colins and Duffy, 2001) Denote by T, T trees and denote by t = T a subtree of T, then k(t, T ) = w t δ t,t. t =T,t =T We count matching subtrees (other definitions possible, will come to that later). Problem We want permutation invariance of unordered trees. Solution Sort trees before computing kernel (good for any tree operation). Alex Smola: Fast String Kernels, http://mlg.anu.edu.au/ smola/stringkernels/ Page 5

Sorting Trees Sorting Rules Assume existence of lexicographic on labels Introduce symbols [, ] satisfy [ < ], and that ], [ < label(n) for all labels. Algorithm For an unlabeled leaf n define tag(n) := []. For a labeled leaf n define tag(n) := [ label(n)]. For an unlabeled node n with children n 1,..., n c sort the tags of the children in lexicographical order such that tag(n i ) tag(n j ) if i < j and define tag(n) = [ tag(n 1 ) tag(n 2 )... tag(n c )]. For a labeled node perform the same operations as above and set tag(n) = [ label(n) tag(n 1 ) tag(n 2 )... tag(n c )]. Alex Smola: Fast String Kernels, http://mlg.anu.edu.au/ smola/stringkernels/ Page 6

Sorting Trees in Linear Time Example The trees Theorem have label [[][[][]]]. 1. tag(root) can be computed in (λ + 2)(l log 2 l) time and linear storage in l. 2. Substrings s of tag(root) starting with [ and ending with a balanced ] correspond to subtrees T of T where s is the tag on T. 3. Arbitrary substrings s of tag(root) correspond to subset trees T of T. 4. tag(root) is invariant under permutations of the leaves and allows the reconstruction of an unique element of the equivalence class (under permutation). Proof of 1. by induction. Extension to k-ary trees straightforward. Rest follows from definition. Alex Smola: Fast String Kernels, http://mlg.anu.edu.au/ smola/stringkernels/ Page 7

Tree to String Conversion Consequence We can compute tree kernel by 1. Converting trees to strings 2. Computing string kernels Advantages More general subtree operations possible: we may include non-balanced subtrees (cutting a slice from a tree). Simple storage and simple implementation (dynamic array suffices) All speedups for strings work for kernels, too (XML documents, etc.) Alex Smola: Fast String Kernels, http://mlg.anu.edu.au/ smola/stringkernels/ Page 8

Suffix Trees Definition Compact tree build from all the suffixes of a word. Suffix tree of ababc Properties ab c$ b c$ abc$ c$ abc$ Can be built and stored in linear time (Ukkonen, 1995) Twice as many nodes as characters in string Leaves on subtree give number of matching substrings Suffix Links Connections across the tree. Vital for parsing strings (e.g., if we parsed abracadabra this speeds up the parsing of bracadabra) Alex Smola: Fast String Kernels, http://mlg.anu.edu.au/ smola/stringkernels/ Page 9

Matching Statistics Definition Given strings x, y with x = n and y = m, the matching statistics of x with respect to y are defined by v, c N n, where v i is the length of the longest substring of y matching a prefix of x[i : n] v i := i + v i 1 c i is a pointer to ceil(x[i : v i ]) in S(y). This can be computed in linear time (Chang and Lawler, 1994). Example Matching statistic of abba with respect to S(ababc). String a b b a v i 2 1 2 1 ceil(c i ) ab b babc$ ab ab c$ b c$ abc$ c$ abc$ Alex Smola: Fast String Kernels, http://mlg.anu.edu.au/ smola/stringkernels/ Page 10

Matching Substrings Prefixes w is a substring of x iff there is an i such that w is a prefix of x[i : n]. The number of occurrences of w in x can be calculated by finding all such i. Substrings The set of matching substrings of x and y is the set of all prefixes of x[i : v i ]. Next Step If we have a substring w of x, prefixes of w may occur in x with higher frequency. We need an efficient computation scheme. Alex Smola: Fast String Kernels, http://mlg.anu.edu.au/ smola/stringkernels/ Page 11

Key Trick Theorem Let x and y be strings and c and v be the matching statistics of x with respect to y. Assume that W (y, t) = w us w u where u = ceil(t) and t = uv. s prefix(v) can be computed in constant time for any t. O( x + y ) time as k(x, y) = x i=1 val(x[i : v i ]) = x i=1 Then k(x, y) can be computed in val(c i ) + lvs(floor(x[i : v i ]))W (y, x[i : v i ]) where val(t) := lvs(floor(t)) W (y, t) + val(ceil(t)) and val(root) := 0. Alex Smola: Fast String Kernels, http://mlg.anu.edu.au/ smola/stringkernels/ Page 12

Computing W (y, t) in Constant Time Length-Dependent Weights Assume that w s = w s, then W (y, t) = Generic Weights t j= ceil(t) w j w ceil(t) = ω t ω ceil(t) where ω j := Simple option: pre-compute the annotation of all suffix trees beforehand. Better: build suffix tree on all strings (linear time) and annotate this tree. Simplifying assumption for TFIDF weights, w s = φ( s )ψ(#s) W (y, t) = s prefix(t) w s s prefix(ceil(t)) w s = φ(freq(t)) t j i=1 i= ceil(t) +1 w j φ(i) Alex Smola: Fast String Kernels, http://mlg.anu.edu.au/ smola/stringkernels/ Page 13

Linear Time Prediction Problem For prediction we need to compute f(x) = i α ik(x i, x). This depends on the number of SVs. Bad for large databases (e.g., spam filtering). The classifier degrades in runtime, the more data we have. We are repeatedly parsing s Idea We can merge matching weights from all the SVs. All we need is a compressed lookup function. Alex Smola: Fast String Kernels, http://mlg.anu.edu.au/ smola/stringkernels/ Page 14

Linear Time Prediction Merge all SVs into one suffix tree Σ. Compute matching statistics of x wrt. Sigma. Update weights on every node of Σ as weight(w) = Extend the definition of val(x) to Σ via m α i lvs xi (w) i=1 val Σ (t) := weight(floor(t)) W (Σ, t) + weight(ceil(t)) and val Σ (root) := 0. Here W (Σ, t) denotes the sum of weights between ceil(t) and t, with respect to Σ rather than S(y). We only need to sum over val Σ (x[i : v i ]) to compute f. We can classify texts in linear time regardless of the size of the SV set! Alex Smola: Fast String Kernels, http://mlg.anu.edu.au/ smola/stringkernels/ Page 15

Summary and Extensions Redux of Tree to String kernels (heaps, stacks, bags, etc. trivial) Linear prediction and kernel computation time (previously quadratic or cubic). Makes things practical. Storage of SVs needed. Can be greatly reduced if redundancies abound in SV set. E.g. for anagram and analphabet we need only analphabet and gram. Coarsening for trees (can be done in linear time, too) Approximate matching and wildcards Automata and dynamical systems Do expensive things with string kernel classifiers. Alex Smola: Fast String Kernels, http://mlg.anu.edu.au/ smola/stringkernels/ Page 16