Fast String Kernels Alexander J. Smola Machine Learning Group, RSISE The Australian National University Canberra, ACT 0200 Alex.Smola@anu.edu.au joint work with S.V.N. Vishwanathan Slides (soon) available at http://mlg.anu.edu.au/ smola/stringkernel/
Overview Kernels on Strings Kernels on Trees Weighting Schemes Tree to String Conversion Suffix Trees Definition and Examples Matching Statistics Counting Substrings Weights and Kernels Annotation and Weighting Function Linear Time Prediction Extensions and Future Work Alex Smola: Fast String Kernels, http://mlg.anu.edu.au/ smola/stringkernels/ Page 2
String Kernel Basics Some Notation Alphabet: what we build strings from Sentinel Character: usually $, it terminates the string Concatenation: xy obtained by assembling strings x, y Prefix / Sufix: If x = yz then y is a prefix and z is a suffix Exact Matching Kernels k(x, x ) := s x,s x w s δ s,s = Inexact Matching Kernels k(x, x ) := w s,s = s x,s x s A num s (x) num s (x )w s. s A num s (x) num s (x )w s,s. Accounting for mismatch. Much more expensive to compute and not topic of this talk. Alex Smola: Fast String Kernels, http://mlg.anu.edu.au/ smola/stringkernels/ Page 3
String Kernel Examples Bag of Characters w s = 0 for all s > 1 counts single characters. Can be computed in linear time and linear-time predictions (Joachims, 1999). Bag of Words s is bounded by whitespace. Linear time (Joachims, 1999). Limited Range Correlations Setting w s = 0 for all s > n yields limited range correlations of length n. K-spectrum kernel This takes into account substrings of length k (Eskin et al., 2002), where w s = 0 for all s k. Linear time kernel computation, and quadratic time prediction. General Case Quadratic time kernel computation (Haussler, 1998, Watkins, 1998), cubic time prediction. Alex Smola: Fast String Kernels, http://mlg.anu.edu.au/ smola/stringkernels/ Page 4
Tree Kernels Definition (Colins and Duffy, 2001) Denote by T, T trees and denote by t = T a subtree of T, then k(t, T ) = w t δ t,t. t =T,t =T We count matching subtrees (other definitions possible, will come to that later). Problem We want permutation invariance of unordered trees. Solution Sort trees before computing kernel (good for any tree operation). Alex Smola: Fast String Kernels, http://mlg.anu.edu.au/ smola/stringkernels/ Page 5
Sorting Trees Sorting Rules Assume existence of lexicographic on labels Introduce symbols [, ] satisfy [ < ], and that ], [ < label(n) for all labels. Algorithm For an unlabeled leaf n define tag(n) := []. For a labeled leaf n define tag(n) := [ label(n)]. For an unlabeled node n with children n 1,..., n c sort the tags of the children in lexicographical order such that tag(n i ) tag(n j ) if i < j and define tag(n) = [ tag(n 1 ) tag(n 2 )... tag(n c )]. For a labeled node perform the same operations as above and set tag(n) = [ label(n) tag(n 1 ) tag(n 2 )... tag(n c )]. Alex Smola: Fast String Kernels, http://mlg.anu.edu.au/ smola/stringkernels/ Page 6
Sorting Trees in Linear Time Example The trees Theorem have label [[][[][]]]. 1. tag(root) can be computed in (λ + 2)(l log 2 l) time and linear storage in l. 2. Substrings s of tag(root) starting with [ and ending with a balanced ] correspond to subtrees T of T where s is the tag on T. 3. Arbitrary substrings s of tag(root) correspond to subset trees T of T. 4. tag(root) is invariant under permutations of the leaves and allows the reconstruction of an unique element of the equivalence class (under permutation). Proof of 1. by induction. Extension to k-ary trees straightforward. Rest follows from definition. Alex Smola: Fast String Kernels, http://mlg.anu.edu.au/ smola/stringkernels/ Page 7
Tree to String Conversion Consequence We can compute tree kernel by 1. Converting trees to strings 2. Computing string kernels Advantages More general subtree operations possible: we may include non-balanced subtrees (cutting a slice from a tree). Simple storage and simple implementation (dynamic array suffices) All speedups for strings work for kernels, too (XML documents, etc.) Alex Smola: Fast String Kernels, http://mlg.anu.edu.au/ smola/stringkernels/ Page 8
Suffix Trees Definition Compact tree build from all the suffixes of a word. Suffix tree of ababc Properties ab c$ b c$ abc$ c$ abc$ Can be built and stored in linear time (Ukkonen, 1995) Twice as many nodes as characters in string Leaves on subtree give number of matching substrings Suffix Links Connections across the tree. Vital for parsing strings (e.g., if we parsed abracadabra this speeds up the parsing of bracadabra) Alex Smola: Fast String Kernels, http://mlg.anu.edu.au/ smola/stringkernels/ Page 9
Matching Statistics Definition Given strings x, y with x = n and y = m, the matching statistics of x with respect to y are defined by v, c N n, where v i is the length of the longest substring of y matching a prefix of x[i : n] v i := i + v i 1 c i is a pointer to ceil(x[i : v i ]) in S(y). This can be computed in linear time (Chang and Lawler, 1994). Example Matching statistic of abba with respect to S(ababc). String a b b a v i 2 1 2 1 ceil(c i ) ab b babc$ ab ab c$ b c$ abc$ c$ abc$ Alex Smola: Fast String Kernels, http://mlg.anu.edu.au/ smola/stringkernels/ Page 10
Matching Substrings Prefixes w is a substring of x iff there is an i such that w is a prefix of x[i : n]. The number of occurrences of w in x can be calculated by finding all such i. Substrings The set of matching substrings of x and y is the set of all prefixes of x[i : v i ]. Next Step If we have a substring w of x, prefixes of w may occur in x with higher frequency. We need an efficient computation scheme. Alex Smola: Fast String Kernels, http://mlg.anu.edu.au/ smola/stringkernels/ Page 11
Key Trick Theorem Let x and y be strings and c and v be the matching statistics of x with respect to y. Assume that W (y, t) = w us w u where u = ceil(t) and t = uv. s prefix(v) can be computed in constant time for any t. O( x + y ) time as k(x, y) = x i=1 val(x[i : v i ]) = x i=1 Then k(x, y) can be computed in val(c i ) + lvs(floor(x[i : v i ]))W (y, x[i : v i ]) where val(t) := lvs(floor(t)) W (y, t) + val(ceil(t)) and val(root) := 0. Alex Smola: Fast String Kernels, http://mlg.anu.edu.au/ smola/stringkernels/ Page 12
Computing W (y, t) in Constant Time Length-Dependent Weights Assume that w s = w s, then W (y, t) = Generic Weights t j= ceil(t) w j w ceil(t) = ω t ω ceil(t) where ω j := Simple option: pre-compute the annotation of all suffix trees beforehand. Better: build suffix tree on all strings (linear time) and annotate this tree. Simplifying assumption for TFIDF weights, w s = φ( s )ψ(#s) W (y, t) = s prefix(t) w s s prefix(ceil(t)) w s = φ(freq(t)) t j i=1 i= ceil(t) +1 w j φ(i) Alex Smola: Fast String Kernels, http://mlg.anu.edu.au/ smola/stringkernels/ Page 13
Linear Time Prediction Problem For prediction we need to compute f(x) = i α ik(x i, x). This depends on the number of SVs. Bad for large databases (e.g., spam filtering). The classifier degrades in runtime, the more data we have. We are repeatedly parsing s Idea We can merge matching weights from all the SVs. All we need is a compressed lookup function. Alex Smola: Fast String Kernels, http://mlg.anu.edu.au/ smola/stringkernels/ Page 14
Linear Time Prediction Merge all SVs into one suffix tree Σ. Compute matching statistics of x wrt. Sigma. Update weights on every node of Σ as weight(w) = Extend the definition of val(x) to Σ via m α i lvs xi (w) i=1 val Σ (t) := weight(floor(t)) W (Σ, t) + weight(ceil(t)) and val Σ (root) := 0. Here W (Σ, t) denotes the sum of weights between ceil(t) and t, with respect to Σ rather than S(y). We only need to sum over val Σ (x[i : v i ]) to compute f. We can classify texts in linear time regardless of the size of the SV set! Alex Smola: Fast String Kernels, http://mlg.anu.edu.au/ smola/stringkernels/ Page 15
Summary and Extensions Redux of Tree to String kernels (heaps, stacks, bags, etc. trivial) Linear prediction and kernel computation time (previously quadratic or cubic). Makes things practical. Storage of SVs needed. Can be greatly reduced if redundancies abound in SV set. E.g. for anagram and analphabet we need only analphabet and gram. Coarsening for trees (can be done in linear time, too) Approximate matching and wildcards Automata and dynamical systems Do expensive things with string kernel classifiers. Alex Smola: Fast String Kernels, http://mlg.anu.edu.au/ smola/stringkernels/ Page 16