Kernels for Structured Natural Language Data

Kernels for Structured atural Language Data Jun Suzuki, Yutaka Sasaki, and Eisaku Maeda TT Communication Science Laboratories, TT Corp -4 Hikaridai, Seika-cho, Soraku-gun, Kyoto, 69-037 Japan {jun, sasaki, maeda}@cslabkeclnttcojp Abstract This paper devises a novel kernel function for structured natural language data In the field of atural Language rocessing, feature extraction consists of the following two steps: () syntactically and semantically analyzing raw data, ie, character strings, then representing the results as discrete structures, such as parse trees and dependency graphs with part-of-speech tags; () creating (possibly high-dimensional) numerical feature vectors from the discrete structures The new kernels, called Hierarchical Directed Acyclic raph (HDA) kernels, directly accept DAs whose nodes can contain DAs HDA data structures are needed to fully reflect the syntactic and semantic structures that natural language data inherently have In this paper, we define the kernel function and show how it permits efficient calculation Experiments demonstrate that the proposed kernels are superior to existing kernel functions, eg, sequence kernels, tree kernels, and bag-of-words kernels Introduction Recent developments in kernel technology enable us to handle discrete structures, such as sequences, trees, and graphs Kernel functions suitable for atural Language rocessing (L) have recently been proposed Convolution Kernels [4, ] demonstrate how to build kernels over discrete structures Since texts can be analyzed as discrete structures, these discrete kernels have been applied to L tasks, such as sequence kernels [8, 9] for text categorization and tree kernels [, ] for (shallow) parsing In this paper, we focus on tasks in the application areas of L, such as Machine Translation, Text Summarization, Text Categorization and Question Answering In these tasks, richer types of information within texts, such as syntactic and semantic information, are required for higher performance However, syntactic information and semantic information are formed by very complex structures that cannot be written in simple structures, such as sequences and trees The motivation of this paper is to propose kernels specifically suited to structured natural language data The proposed kernels can handle several of the structures found within texts and calculate kernels with regard to these structures at a practical cost and time Accordingly, these kernels can be efficiently applied to learning and clustering problems in L applications

Koizumi Text : Koizumi J un ic h ir o is p r ime min is t e r of J a p a n ( ) r e s ul t of a p a r t - of - s p e e c h t a g g e r ( ) r e s ul t of a n oun p h r a s e c h un k e r Koizumi J un ic h ir o is p r ime min is t e r of J a p a n V B J J J I J un ic h ir o Koizumi is p r ime min is t e r of J a p a n ( 3 ) r e s ul t of a n a me d e n t it ie s t a g g e r ( 4 ) r e s ul t of a d e p e n d e n c y s t r uc t ur e a n a l y ze r J un ic h ir o Koizumi is p r ime min is t e r of J a p a n oun t r y C e r s on Koizumi J un ic h ir o is p r ime min is t e r of J a p a n ( 5 ) s e ma n t ic in f or ma t ion f r om d ic t ion a r y ( e g W or d - e t ) ( ) - ( 5 ) J un ic h ir okoizumi is p r ime min is t e r of J a p a n [n um b er ] [executive] [executive d ir ecto r ] [A s ia n C o un tr y ] [A s ia n n a tio n ] J a p a n J u n i c h i r o K o i z u m i i s p r i m e m i n i s ter o f V B J J J I er s o n C o u n tr y Figure : Examples of structures within texts as determined by basic L tools Structured atural Language Data for Application Tasks in L In general, natural language data contain many kinds of syntactic and semantic structures For example, texts have several levels of syntactic and semantic chunks, such as part-ofspeech (OS) chunks, named entities (Es), noun phrase () chunks, sentences, and discourse segments, and these are bound by relation structures, such as dependency structures, anaphora, discourse relations and coreference These syntactic and semantic structures can provide important information for understanding natural language and, moreover, tackling real tasks in application areas of L The accuracies of basic L tools such as OS taggers, chunkers, E taggers, and dependency structure analyzers have improved to the point that they can help to develop real applications This paper proposes a method to handle these syntactic and semantic structures in a single framework: We combine the results of basic L tools to make one hierarchically structured data set Figure shows an example of structures within texts analyzed by basic L tools that are currently available and that offer easy use and high performance As shown in Figure, structures in texts can be hierarchical or recursive graphs in graph A certain node can be constructed or characterized by other graphs odes usually have several kinds of attributes, such as words, OS tags, semantic information such as Wordet [3], and classes of the named entities Moreover, the relations between nodes are usually directed Therefore, we should employ a () directed, () multi-labeled, and (3) hierarchically structured graph to model structured natural language data Let V be a set of vertices (or nodes) and E be a set of edges (or links) Then, a graph = (V, E) is called a directed graph if E is a set of directed links E V V Definition (Multi-Labeled raph) Let Γ be a set of labels (or attributes) and M V Γ be label allocations Then, = (V, E, M) is called a multi-labeled graph Definition (Hierarchically Structured raph) Let i = (V i, E i ) be a subgraph in = (V, E) where V i V and E i E, and = {,, n } be a set of subgraphs in F V represents a set of vertical links from a node v V to a subgraph i Then, = (V, E,, F ) is called a hierarchically structured graph if each node has at most one vertical edge Intuitively, vertical link f i,j F from node v i to graph j indicates that node v i contains graph j Finally, in this paper, we successfully represent structured natural language data by using a multi-labeled hierarchical directed graph Definition 3 (Multi-Labeled Hierarchical Directed raph) = (V, E, M,, F ) is a multi-labeled hierarchical directed graph

c j A r a p h i ca l m o d e l o f s t r u ct u r e s w i t h i n t e x t a c b d a e V b a c d : c h u nk : r el a t i on b et w een c h u nk s M u l t i - L a b e l e d H i e r a r ch i ca l D i r e ct e d r a p h : q a q q3 e, 4 f, e 3, 4 q4 e, 5 b e 6, 4 q5 q6 d e 5, 7 q7 q8 f5, f7, 3 a : 3 r r 3 e r e, 3 b V e, 4 r 4 f, f4, r 5 r 6 e 6, 5 a e 7, 3 3 r 7 r 8 c e 7, 8 f6, 3 d : node qi, ri : s u b g r a p h j : di r ec t ed l i nk : v er t i c a l l i nk e i,j fi, Figure : Examples of Hierarchical Directed raph structures (these are also HDA): each letter represents an attribute Figure shows examples of multi-labeled hierarchical directed graphs In this paper, we call a multi-labeled hierarchical directed graph a hierarchical directed graph 3 Kernels on Hierarchical Directed Acyclic raph At first, in order to calculate kernels efficiently, we add one constraint: that the hierarchical directed graph has no cyclic paths First, we define a path on a Hierarchical Directed raph If a node has no vertical link, then the node is called a terminal node, which is denoted as T V ; otherwise it is a non-terminal node, which is denoted as T V Definition 4 (Hierarchical ath (Hi)) Let p = v i, e i,j, v j,, v k, e k,l, v l be a path Let Υ(v) be a function that returns a subgraph i that is linked with v by a vertical link if v T Let () be a function that returns the set of all His in, where links between v and v / are ignored Then, p h = h(v i ), e i,j, h(v j ),, h(v k ), e k,l, h(v l ) is defined as a Hi, where h(v) returns vp h x, p h x ( x ) st x = Υ(v) if v T otherwise returns v Intuitively, a Hi is constructed by a path in the path structure, eg, p h = v i, e i,j, v j v m, e m,n, v n,, v k, e k,l, v l Definition 5 (Hierarchical Directed Acyclic raph (HDA)) hierarchical directed graph = (V, E, M,, F ) is an HDA if there is no Hi from any node v to the same node v A primitive feature for defining kernels on HDAs is a hierarchical attribute subsequence Definition 6 (Hierarchical Attribute Subsequence (HiAS)) A HiAS is defined as a list of attributes with hierarchical information extracted from nodes on His For example, let p h = v i, e i,j, v j v m, e m,n, v n,, v k, e k,l, v l be a Hi, then, HiASs in p h are written as τ(p h ) = a i, a j a m, a n,, a k, a l, which is all combinations for all a i τ(v i ), where τ(v) of node v is a function that returns the set of attributes allocated to node v, and τ(p h ) of Hi p h is a function that returns all possible HiASs extracted from Hi p h Γ denotes all possible HiASs constructed by the attribute in Γ and γ i Γ denotes the i th HiAS An explicit representation of a feature vector of an HDA kernel is defined as φ() = (φ (),, φ Γ ()), where φ represents the explicit feature mapping from HDA to the numerical feature space The value of φ i () becomes the weighted number of occurrences of γ i in According to this approach, the HDA kernel, K(, ) = Γ i= φ i( ) φ i ( ), calculates the inner product of the weighted common HiASs in

j : v :0 L5 :0 Li : w e i g h t o f La b i e l L f, :0 8 : node v i v i : w e i g h t o f n o v i d e v :0 8 v 3 :0 7 v 4 :0 9 e3, 4 :0 7 L :0 4 L3 : L4 :0 9 L :0 5 e, 4 :0 6 : s u b g r a p h j : di r ec t ed l i nk : v er t i c a l l i nk ei,j fi, ei, j : w e i g h t o f d i r e c t e d l i n k f r o m v i t o v j j f i, : w e i g h t o f v e r t i c a l l i n k f o m r v i t o j Figure 3: An Example of Hierarchical Directed raph with weight factors two HDAs, and In this paper, we use stand for the meaning of such that, since it is simple K HDA (, ) = W γi (p h )W γi (p h ), () γ i Γ γ i τ(p h ) ph ( ) γ i τ(p h ) ph ( ) where W γi (p h ) represents the weight value of HiAS γ i in Hi p h The weight of HiAS γ i in Hi p h is determined by W γi (p h ) = W V (v) W E (v i, v j ) W F (v i, j ) W Γ (a), () v V (p h ) e i,j E(p h ) f i,j F (p h ) a τ(γ i) where W V (v), W E (v i, v j ), W F (v i, j ), and W Γ (a) represent the weight of node v, link from v i to v j, vertical link from v i to subgraph j, and attribute a, respectively An example of how each weight factor is given is shown in Figure 3 In the case of L data, for example, W Γ (a) might be given by the score of tf idf from large scale documents, W V (v) by the type of chunk such as word, phrase or named entity, W E (v i, v j ) by the type of relation between v i and v j, and W F (v i, j ) by the number of nodes in j Soft Structural Matching Frameworks Since HDA kernels permit not only the exact matching of substructures but also approximate matching, we add the framework of node skip and relaxation of hierarchical information First, we discuss the framework of the node skip We introduce decay function Λ V (v)(0 < Λ V (v) ), which represents the cost of skipping node v when extracting HiASs from the His, which is almost the same architecture as [8] For example, a HiAS under the node skips is written as a, a 3,, a 5 from Hi v v, v 3, v 4, v 5, where is the explicit representation of a node that is skipped ext, in the case of the relaxation of hierarchical information, we perform two processes: () we form one hierarchy if there is multiple hierarchy information in the same point, for example, a i, a j, a k becomes a i, a j, a k ; and () we delete hierarchical information if there exists only one node, for example, a i, a j, a k becomes a i, a j, a k These two frameworks achieve approximate substructure matching automatically Table shows an explicit representation of the common HiASs (features) of and in Figure For the sake of simplicity, for all the weights W V (v), W E (v i, v j ), W F (v i, j ), and W Γ (a), are taken as and for all v, Λ V (v) = λ if v has at least one attribute, otherwise Λ V (v) = Efficient Recursive Computation In general, when the dimension of the feature space Γ becomes very high, it is computationally infeasible to generate feature vector φ() explicitly We define an efficient calculation formula between HDAs and, which is written as: K HDA (, ) = K(q, r), (3) q Q r R

Table : Common HiASs of and in Figure : (S represents the node skip, HR represents the relaxation of hierarchical information) S S+ HR HiAS with HiAS value HiAS with HiAS value common HiAS value common HiAS value a a a a a a b b b b b b c c c c c c d d d d d d c, b c, b c, b c, b c, b c, b d, b d, b d, b d, b - 0 b, d a a a a a a c c c c - 0 c,, a, a λ 3, a, a λ, a λ 4, a λ 4,,, λ,,, λ, λ, b, b, b, b, b, b, d, d λ, d, d λ - 0, d λ b,, a b, a λ 3 b, a b, a λ b, a λ 4 b, a λ 4 b,, b, λ b, b, b, λ b, λ b, d b, d λ b, d b, d λ - 0 b, d λ c,, a c, a λ 3 c, a c, a - 0 c, a λ 3 c, d c, d λ c, d c, d - 0 c, d λ d, a d, a λ d, a d, a - 0 d, a λ,, a, a λ, a, a, a λ, a λ b,, a b, a λ b, a b, a b, a λ b, a λ, b,, a, b, a λ 3, b, a, b, a λ, b, a λ 4, b, a λ 4, b,,, b, λ, b,, b,, b, λ, b, λ, b, d, b, d λ, b, d, b, d λ - 0, b, d λ, b,, a, b, a λ, b, a, b, a, b, a λ, b, a λ where Q = {q,, q Q } and R = {r,, r R } represent nodes in and, respectively K(q, r) represents the sum of the weighted common HiASs that are extracted from the His whose sink nodes are q and r K(q, r) = J,(q, r)h(q, r) + Ĥ(q, r)i(q, r) + I(q, r) (4) Function I(q, r) returns the weighted number of common attributes of nodes q and r, I(q, r) = W V (q)w V (r) W Γ (a )W Γ (a )δ(a, a ), (5) a τ(q) a τ(r) where δ(a, a ) = if a = a, and 0 otherwise Let H(q, r) be a function that returns the sum of the weighted common HiASs between q and r including Υ(q) and Υ(r) { H(q, r) = I(q, r) + (I(q, r) + ΛV (q)λ V (r)) Ĥ(q, r), if q, r T (6) I(q, r), otherwise Ĥ(q, r) = W F (q, i )W F (r, j )J i,j (s, t) (7) s i i =Υ(q) t j j =Υ(r) Let J x,y (q, r), J x,y(q, r), and J x,y(q, r), where x, y are (sub)graphs, be recursive functions to calculate H(q, r) and K(q, r) J x,y (q, r) = J x,y(q, r)h(q, r) + H(q, r) (8) W E (q, t) ( Λ J x,y(q, V (t)j x,y(q, t)+j x,y (q, t) ), if ψ(r) r) = t {ψ(r) V (y)} (9) W E (s, r) ( Λ J x,y(q, V (s)j x,y(s, r)+j x,y(s, r) ), if ψ(q) r) = s {ψ(q) V (x)} (0)

where Λ V (v) = Λ V (v) t i i=υ(v) Λ V (t) if v T, Λ V (v) = Λ V (v) otherwise Function ψ(q) returns a set of nodes that have direct links to node q ψ(q) = means that no node has a direct link to s ext, we show the formula when using the framework of relaxation of hierarchical information The functions have the same meanings as in the previous formula We denote H(q, r) = H(q, r) + H (q, r) K(q, r) = J, (q, r) H(q, r) + ( H (q, r) + H (q, r) ) I(q, r) + I(q, r) () H(q, r) = ( H (q, r) + H (q, r) ) I(q, r) + H (q, r) + I(q, r) () W F (r, H j ) H(q, t), if r T (q, r) = t j j =Υ(r) (3) W F (q, i )H(s, r) + Ĥ(q, r), if q, r T s H i i =Υ(q) (q, r) = W F (q, i )H(s, r), if q T (4) s i i =Υ(q) J x,y (q, r) = J x,y(q, r) H(q, r) (5) W E (q, t) ( Λ J x,y(q, V (t)j x,y(q, t)+j x,y (q, t)+ H(q, t) ), if ψ(r) r) = t {ψ(r) V (y)} (6) Functions I(q, r), J x,y(q, r), and Ĥ(q, r) are the same as those shown above According to equation (3), given the recursive definition of K HDA (q, r), the value between two HDAs can be calculated in time O( Q R ) In actual use, we may want to evaluate only the subset of all HiASs whose sizes are under n when determining the kernel value because of the problem discussed in [] This can simply realized by not calculating those HiASs whose size exceeds n when calculating K(q, r); the calculation cost becomes O(n Q R ) Finally, we normalize the values of the HDA kernels to remove any bias introduced by the number of nodes in the graphs This normalization corresponds to the standard unit norm normalization of examples in the feature space corresponding to the kernel space ˆK(x, y) = K(x, y) (K(x, x)k(y, y)) / [4] We will now elucidate an efficient processing algorithm First, as a pre-process, the nodes are sorted under two conditions, V (Υ(v)) v and Ψ(v) v, where Ψ(v) represents all nodes that have a path to v The dynamic programming technique can be used to compute HDA kernels very efficiently: By following the sorted order, the values that are needed to calculate K(q, r) have already been calculated in the previous calculation 4 Experiments Our aim was to test the efficiency of using the richer syntactic and semantic structures available within texts, which can be treated now for the first time by our proposed method We evaluated the performance of the proposed method in the actual L task of Question Classification, which is similar to the Text Classification task except that it requires many

question: Who i s p r i m e m i n i s t e r of J a p a n? w or d or d e r of a t t r i b u t e s ( S e q - K ) hi e r a r c hi c a l c hu n k s a n d t he i r r e l a t i on s ( H D A - K ) W h o is p r im e m inister of J a p a n? J J I W V B J C ountr y [n um b er ] [executive] [A s ia n C o un tr y ] W h o is p r im e m inister of W B J J V J I um b [executive] [n er ] J a p a n [A s ia n C o un tr y ]? d e p e n d e n c y s t r u c t u r e s of a t t r i b u t e s ( D S - K, D A - K ) C ountr y W h o is p r im e m inister of B J J W V J I um b [executive] [n er ] J a p a n C ountr y [A s ia n C o un tr y ]? Figure 4: Examples of input data of comparison methods Table : Results of question classification by SVM with comparison kernel functions evaluated by F-measure TIME TO LOCATIO ORAIZATIO UMEX n 3 4 HDA-K - 95 94 96 DA-K - 946 93 869 DS-K - 65 564 403 Seq-K - 946 90 866 BOW-K 899 906 885 853 3 4-80 83 784-803 774 79-544 507 466-79 774 733 748 77 757 745 3 4-76 7 697-704 67 60-535 509 49-706 668 595 638 690 633 57 3 4-96 9 874-9 880 83-60 504 44-93 885 85 84 846 804 79 more semantic features within texts [7, 0] We used three different QA data sets written in Japanese [0] We compared the performance of the proposed kernel, the HDA Kernel (HDA-K), with DA kernels (DA-K), Dependency Structure kernels (DS-K) [], and sequence kernels (Seq-K) [9] Moreover, we evaluated the bag-of-words kernel (BOW-K) [6], that is, the bag-of-words with polynomial kernels, as the baseline method The main difference between each method is the ability to treat syntactic and semantic information within texts Figure 4 shows the differences of input objects between each method For better understanding, these examples are shown in English We used words, named entity tags, and semantic information [5] for attributes Seq-K only treats word order, DS-K and DA-K treat dependency structures, and HDA-K treats the and E chunks with their dependency structures We used the same formula with our proposed method for DA-K Comparing HDA-K to DA-K shows the difference in performance between handling the hierarchical structures and not handling them We extended Seq-K and DS-K to improve the total performance and to establish a more equal evaluation, with the same conditions, against our proposed method ote that though DA-K and DS-K handle input objects of the same form, their kernel calculation methods differ as do their return values We used node skip parameter Λ V (v) = 05 for all nodes v in each comparison We used SVM [] as a kernel-based machine learning algorithm We evaluated the performance of the comparison methods with question type TIME TO, ORAIZATIO, LOCATIO, and UMEX, which are defined in the CRL QA-data Table shows the average F-measure as evaluated by 5-fold cross validation n in Table indicates the threshold of an attribute s number, that is, we evaluated only those HiASs that contain less than n-attributes for each kernel calculation As shown in this table, HDA- K showed the best performance in the experiments The experiments in this paper were designed to investigate how to improve the performance by using the richer syntactic and semantic structures within texts In the task of Question Classification, a given question is classified into Question Type, which reflects the intention of the question These results http://wwwcsnyuedu/ sekine/roject/crlqa/

indicate that our approach, incorporating richer structure features within texts, is well suited to the tasks in the L applications The original DS-K requires exact matching of the tree structure, even when it is extended for more flexible matching This is why DS-K showed the worst performance in our experiments The sequence, DA, and HDA kernels offer approximate matching by the framework of node skip, which produces better performance in the tasks that evaluate the intention of the texts The structure of HDA approaches that of DA if we do not consider the hierarchical structure In addition, the structures of sequences and trees are entirely included in that of DA Thus, the HDA kernel subsumes some of the discrete kernels, such as sequence, tree, and graph kernels 5 Conclusions This paper proposed HDA kernels, which can handle more of the rich syntactic and semantic information present within texts Our proposed method is a very generalized framework for handling structured natural language data We evaluated the performance of HDA kernels with the real L task of question classification Our experiments showed that HDA kernels offer better performance than sequence kernels, tree kernels, and the baseline method bag-of-words kernels if the target task requires the use of the richer information within texts References [] M Collins and Duffy Convolution Kernels for atural Language In roc of eural Information rocessing Systems (IS 00), 00 [] M Collins and Duffy arsing with a Single euron: Convolution Kernels for atural Language roblems In Technical Report UCS-CRL-0-0 UC Santa Cruz, 00 [3] C Fellbaum Wordet: An Electronic Lexical Database MIT ress, 998 [4] D Haussler Convolution Kernels on Discrete Structures In Technical Report UCS-CRL-99-0 UC Santa Cruz, 999 [5] S Ikehara, M Miyazaki, S Shirai, A Yokoo, H akaiwa, K Ogura, Y Oyama, and Y Hayashi, editors The Semantic Attribute System, oi-taikei A Japanese Lexicon, volume Iwanami ublishing, 997 (in Japanese) [6] T Joachims Text Categorization with Support Vector Machines: Learning with Many Relevant Features In roc of European Conference on Machine Learning(ECML 98), pages 37 4, 998 [7] X Li and D Roth Learning Question Classifiers In roc of the 9th International Conference on Computational Linguistics (COLI 00), pages 556 56, 00 [8] H Lodhi, C Saunders, J Shawe-Taylor, Cristianini, and C Watkins Text Classification Using String Kernel Journal of Machine Learning Research, :49 444, 00 [9] Cancedda and E aussier and C outte and J-M Renders Word-Sequence Kernels Journal of Machine Learning Research, 3:059 08, 003 [0] J Suzuki, H Taira, Y Sasaki, and E Maeda Question Classification using HDA Kernel In Workshop on Multilingual Summarization and Question Answering (003), pages 6 68, 003 [] V Vapnik The ature of Statistical Learning Theory Springer, 995 [] C Watkins Dynamic Alignment Kernels In Technical Report CSD-TR-98- Royal Holloway, University of London Computer Science Department, 999