MACHINE LEARNING. Kernel Methods. Alessandro Moschitti

Size: px

Start display at page:

Download "MACHINE LEARNING. Kernel Methods. Alessandro Moschitti"

Edmund Knight
6 years ago
Views:

1 MACHINE LEARNING Kernel Methods Alessandro Moschitti Department of information and communication technology University of Trento

2 Communications Next Lectures will be on February 11, 13, 16, 18 We start at 12:00 (sharp)

3 Outline (1) PART I: Theory Motivations Kernel Based Machines Kernel Methods Kernel Trick Mercer s conditions Kernel operators Basic Kernels Linear Kernel Polynomial Kernel Lexical kernel String Kernel

4 Outline (2) Tree kernels subtree, subset tree and partial tree kernels PART II: Applications to Question Answering Question Classification Kernel Combinations Answer Classification Experiments and Results Conclusions

5 Motivation (1) Feature design most difficult aspect in designing a learning system complex and difficult phase, e.g., structural feature representation: deep knowledge and intuitions are required design problems when the phenomenon is described by many features

Motivation (2) Kernel methods alleviate such problems Structures represented in terms of substructures High dimensional feature spaces Implicit and

6 Motivation (2) Kernel methods alleviate such problems Structures represented in terms of substructures High dimensional feature spaces Implicit and abstract feature spaces Generate high number of features Support Vector Machines select the relevant features Automatic Feature engineering side-effect

7 Theory Kernel Trick Kernel Based Machines Basic Kernel Properties Kernel Types

8 The main idea of Kernel Functions Mapping vectors in a space where they are linearly separable x φ(x) φ o o x x o x o x φ(x) φ(x) φ(o) φ(x) φ(o) φ(x) φ(o) φ(o)

9 A mapping example Given two masses m 1 and m 2, one is constrained Apply a force f a to the mass m 1 Experiments Features m 1, m 2 and f a We want to learn a classifier that tells when a mass m 1 will get far away from m 2 If we consider the Gravitational Newton Law m1m2 f ( m1, m2, r) = C 2 r we need to find when f(m 1, m 2, r) < f a

10 Mapping Example x = x1,..., xn) φ ( x) = ( φ ( x),..., φn( x)) ( 1 The gravitational law is not linear so we need to change space ( fa = 2, m1, m2, r) ( k, x, y, z) (ln fa,ln m1,ln m,ln r) As ln f ( m1, m2, r) = ln C + ln m1 + ln m2 2ln r = c + x + y 2z We need the hyperplane ln f a ln m1 ln m2 + 2ln r ln C = (ln m 1,ln m 2,-2ln r) (x,y,z)- ln f a + ln C = 0, we can decide without error if the mass will get far away or not 0

11 A kernel-based Machine Perceptron training w 0 0 ;b 0 0;k 0;R max 1 i l x i Repeat for i = 1 to if y i ( w k x i + b k ) 0 then w k +1 = w k + ηy i x i endif endfor b k +1 = b k + ηy i R 2 k = k +1 until an error is found return k,( w k,b k )

15 Changing Gradient Representation Each step of perceptron only training data is added with a certain weight w = α j y j x j j=1.. So the classification function sgn( w x + b) = sgn( α j j=1.. y j x j x + b ) Note that data only appears in the scalar product

16 We can rewrite the classification function as As well as the updating function The learning rate does not affect the algorithm so we set it to η =1. η Dual Perceptron algorithm and Kernel functions ) ), ( sgn( ) ) ( ) ( sgn( ) ) ( ) ( sgn( ) ( b x x k y b x x y b x w x h j j i j j j j j + = = + = + = = = α φ φ α φ φ η α α α φ φ α + = + = + = = i i i j j j j i j j j j i b x x k y b x x y y 0 allora ) ), ( ) ( ) ( ( if

17 Kernels in Support Vector Machines In Soft Margin SVMs we maximize: By using kernel functions we rewrite the problem as:

18 Kernel Function Definition Kernel are the product of mapping functions such as x R n, φ ( x) = ( φ1( x), φ2 ( x),..., φ m ( x)) R m

19 The Kernel Gram Matrix With KM-based learning, the sole information used from the training data set is the Kernel Gram Matrix If the kernel is valid, K is symmetric definitepositive.

20 Valid Kernels

21 Valid Kernels cont d If the matrix is positive semi-definite then we can find a mapping φ implementing the kernel function

22 Valid Kernel operations k(x,z) = k 1 (x,z)+k 2 (x,z) k(x,z) = k 1 (x,z)*k 2 (x,z) k(x,z) = α k 1 (x,z) k(x,z) = f(x)f(z) k(x,z) = k 1 (φ(x),φ(z)) k(x,z) = x'bz

23 Basic Kernels for unstructured data Linear Kernel Polynomial Kernel Lexical kernel String Kernel

24 A simple classification problem: Text Categorization Berlusconi Wonderful Bush acquires declares Totti Yesterday Inzaghi before war match elections Politic C 1 Economic C 2 Sport C n

25 Text Classification Problem Given: a set of target categories: the set T of documents, define f : T 2 C C = { C 1,..,C n } VSM (Salton89 ) Features are dimensions of a Vector Space. Documents and Categories are vectors of feature weights. C d is assigned to if d C i > th i

26 The Vector Space Model Berlusconi d 1 : Politic Bush declares war. Berlusconi gives support d 1 d 2 C 1 d 3 C 2 Totti d 2 : Sport Wonderful Totti in the yesterday match against Berlusconi s Milan d 3 :Economic Berlusconi acquires Inzaghi before elections C 1 : Politics Category C 2 : Sport Category Bush

27 Linear Kernel In Text Categorization documents are word vectors Φ(d x ) = x = (0,..,1,..,0,..,0,..,1,..,0,..,0,..,1,..,0,..,0,..,1,..,0,..,1) buy acquisition stocks sell market Φ(d z ) = z = (0,..,1,..,0,..,1,..,0,..,0,..,0,..,1,..,0,..,0,..,1,..,0,..,0) The dot product x z features in common buy company stocks sell This provides a sort of similarity counts the number of

28 Feature Conjunction (polynomial Kernel) The initial vectors are mapped in a higher space More expressive, as encodes Stock+Market vs. Downtown+Market features We can smartly compute the scalar product as,1) 2, 2, 2,, ( ), ( x x x x x x x x > Φ < ), ( 1) ( 1) ( ,1) 2, 2, 2,, (,1) 2, 2, 2,, ( ) ( ) ( z x K z x z x z x z x z x z z x x z x z x z z z z z z x x x x x x z x Poly = + = + + = = = = = = Φ Φ ( 2 ) x 1 x

29 Using character sequences φ("bank") = x = (0,..,1,..,0,..,1,..,0,...1,..,0,..,1,..,0,..,1,..,0) bank ank bnk bk b φ("rank") = z = (1,..,0,..,0,..,1,..,0,...0,..,1,..,0,..,1,..,0,..,1) rank ank rnk rk r x z counts the number of common substrings x z = φ("bank") φ("rank") = k("bank","rank")

30 String Kernel Given two strings, the number of matches between their substrings is evaluated E.g. Bank and Rank B, a, n, k, Ba, Ban, Bank, Bk, an, ank, nk,.. R, a, n, k, Ra, Ran, Rank, Rk, an, ank, nk,.. String kernel over sentences and texts Huge space but there are efficient algorithms

31 Formal Definition, where i 1 +1, where

32 Kernel between Bank and Rank

33 An example of string kernel computation

34 Efficient Evaluation Dynamic Programming technique Evaluate the spectrum string kernels Substrings of size p Sum the contribution of the different spectra

35 Efficient Evaluation

36 An example: SK( Gatta, Cata ) First, evaluate the SK with size p=1, i.e. a, a, t, t, a, a Store in the table SK p=1

37 Evaluating DP2 Predicted weight of the string of size p multiplied by the number of substrings of size p-1

Evaluating the Predictive DP on strings of size 2 (second row) Let s consider substrings of size 2 and suppose that: we have matched the first a we will match the next character that we add to the

38 Evaluating the Predictive DP on strings of size 2 (second row) Let s consider substrings of size 2 and suppose that: we have matched the first a we will match the next character that we add to the two strings We compute the weights of future matches at different string positions with some not-yet known character? If the match occurs immediately after a the weight will be λ 1+1 x λ 1+1 = λ 4 and we store just λ 2 in the DP entry in [ a, a ]

39 Evaluating the DP wrt different positions (second row) If the match for gatta occurs after t the weight will be λ 1+2 (x λ 2 = λ 5 ) since the substring for it will be with a? (λ 3 ) We write such prediction in the entry [ a, t ] Same rationale for a match after the second t, we have the substring a? (matching with a? from catta ) for a weight of λ 3+1 (x λ 2 )

40 Evaluating the DP wrt different positions (third row) If the match of the second character occurs after t of cata, the weight will be λ 2+1 (x λ 2 = λ 5 ) since it will be with the string a?, with a weight of λ 3 If the match occurs after t of both gatta and cata, there are two way to compose substring of size two: a? with weight λ 4 or t? with weight λ 2 the total is λ 2 +λ 4

41 Evaluating the DP wrt different positions (third row) Finally, match after the last t of both cat and gatta There are three possible substrings of gatta : a?, t?, t? for gatta with weight λ 3, λ 2 or λ, respectively. There are two possible substrings of gatta a?, t? with weight λ 2 and λ With a weight of λ 5, λ 3, λ 2 the total weight string is λ 5 + λ 3 + λ 2

42 Evaluating SK of size 2 using DP2 SK p= 2 The number/weight of substrings of size 2 between gat and cat is λ 4 = λ 2 ([ a, a ] entry of DP) x λ 2 (cost of one character), where a = t and b = t. Between gatta and cata is λ 7 + λ 5 + λ 4, i.e the matches of a a, t a, ta with at and a a and ta

43 Document Similarity Doc 1 Doc 2 industry company telephone product market

44 Lexical Semantic Kernel [CoNLL 2005] The document similarity is the SK function: SK(d 1,d 2 ) = s(w 1,w 2 ) w 1 d 1,w 2 d 2 where s is any similarity function between words, e.g. WordNet [Basili et al.,2005] similarity or LSA [Cristianini et al., 2002] Good results when training data is poor

45 Mercer s Theorem (finite space) ( ) i, j=1 Let us consider K = K( x i, x j ) n K symmetric V: K = VΛ V for Takagi factorization of a complex-symmetric matrix, where: Λ is the diagonal matrix of the eigenvalues λ t of K n v t = ( v ti ) i =1 are the eigenvectors, i.e. the columns of V Let us assume lambda values non-negative φ : x i ( λ t v ) n ti t =1 R n, i =1,..,n

46 Mercer s Theorem (sufficient conditions) Therefore Φ( x i ) Φ( x j ) = n λ t v ti v tj = VΛ V t=1 ( ) ij = K, ij = K( which implies that K is a kernel function x i, x j )

47 Mercer s Theorem (necessary conditions) Suppose we have negative eigenvalues λ s and eigenvectors v s z = has the following norm: the following point n v si Φ( x i ) = Λ V i=1 z 2 = z z = Λ V v s Λ V v s = v s ' V Λ Λ V v s = v s ' Kv s = v s ' λ s v s = λ s v 2 s < 0 v s, this contradicts the geometry of the space.

48 Is it a valid kernel? It may not be a kernel so we can use M M

49 Tree kernels Subtree, Subset Tree, Partial Tree kernels Efficient computation

50 Example of a parse tree John delivers a talk in Rome S S N VP N VP VP V NP PP John V delivers D NP N PP PP IN N IN N N Rome a talk in Rome

51 The Syntactic Tree Kernel (STK) [Collins and Duffy, 2002] VP V NP delivers D a N talk

52 The overall fragment set VP D NP a Children are not divided

53 Explicit kernel space φ(t x ) = x = (0,..,1,..,0,..,1,..,0,..,1,..,0,..,1,..,0,..,1,..,0,..,1,..,0) φ(t z ) = z = (1,..,0,..,0,..,1,..,0,..,1,..,0,..,1,..,0,..,0,..,1,..,0,..,0) x z counts the number of common substructures

54 Efficient evaluation of the scalar product x z = φ(t x ) φ(t z ) = K(T x,t z ) = = Δ(n x,n z ) n x T x n z T z

55 Efficient evaluation of the scalar product x z = φ(t x ) φ(t z ) = K(T x,t z ) = [Collins and Duffy, ACL 2002] evaluate Δ in O(n 2 ): Δ(n x,n z ) = 0, if the productions are different else Δ(n x,n z ) =1, if pre - terminals else = Δ(n x,n z ) n x T x n z T z Δ(n x,n z ) = nc(n x ) j=1 (1+ Δ(ch(n x, j),ch(n z, j)))

56 Other Adjustments Decay factor Δ(n x,n z ) = λ, if pre - terminals else Δ(n x,n z ) = λ nc(n x ) j=1 (1+ Δ(ch(n x, j),ch(n z, j))) Normalization K (T x,t z ) = K(T x,t z ) K(T x,t x ) K(T z,t z )

57 SubTree (ST) Kernel [Vishwanathan and Smola, 2002] VP V NP NP delivers D N D N V D N a talk a talk delivers a talk

58 Evaluation Given the equation for the SST kernel Δ(n x,n z ) = 0, if the productions are different else Δ(n x,n z ) =1, if pre - terminals else Δ(n x,n z ) = nc(n x ) j=1 (1+ Δ(ch(n x, j),ch(n z, j)))

59 Evaluation Given the equation for the SST kernel Δ(n x,n z ) = 0, if the productions are different else Δ(n x,n z ) =1, if pre - terminals else Δ(n x,n z ) = nc(n x ) j=1 Δ(ch(n x, j),ch(n z, j))

60 Fast Evaluation of STK [EACL 2006] K(T x,t z ) = Δ(n x,n z ) where P(n x ) and P(n z ) are the production rules used at nodes n x and n z n x,n z NP NP = { n x,n z T x T z : Δ(n x,n z ) 0} = = { n x,n z T x T z : P(n x ) = P(n z )},

61 Algorithm

62 Observations We order the production rules used in T x and T z, at loading time At learning time we may evaluate NP in T x + T z running time If T x and T z are generated by only one production rule O( T x T z )

63 Observations We order the production rules used in T x and T z, at loading time At learning time we may evaluate NP in T x + T z running time If T x and T z are generated by only one production rule O( T x T z ) Very Unlikely!!!!

64 Labeled Ordered Tree Kernel SST satisfies the constraint remove 0 or all children at a time. If we relax such constraint we get more general substructures [Kashima and Koyanagi, 2002] VP VP VP VP VP VP VP VP V gives D a NP N talk V D a NP N talk D a NP N talk D a NP N D a NP NP D NP D N NP N NP N NP D NP

65 Weighting Problems V gives D a VP NP N talk V gives D a VP NP N talk Both matched pairs give the same contribution. Gap based weighting is VP VP needed. V NP V NP A novel efficient evaluation gives D JJ N gives D JJ N has to be defined a good talk a bad talk

66 Partial Trees, [ECML 2006] SST + String Kernel with weighted gaps on Nodes children VP VP VP VP VP VP VP VP V brought D a NP N cat V D a NP N cat D a NP N cat D a NP N D a NP NP D NP D N NP N NP N NP D NP

67 Partial Tree Kernel By adding two decay factors we obtain:

68 Efficient Evaluation (2) The complexity of finding the subsequences is Therefore the overall complexity is where ρ is the maximum branching factor (p = ρ)

69 Running Time of Tree Kernel Functions

70 Kernel Combinations an example Kernel Combinations: ,, p Tree p Tree P Tree p p Tree Tree P Tree p Tree P Tree p Tree P Tree K K K K K K K K K K K K K K K K = + = = + = + + γ γ Tree kernel features flat of kernel polynomial 3 Tree p K K

71 Question Classification Definition: What does HTML stand for? Description: What's the final line in the Edgar Allan Poe poem "The Raven"? Entity: What foods can cause allergic reaction in people? Human: Who won the Nobel Peace Prize in 1992? Location: Where is the Statue of Liberty? Manner: How did Bob Marley die? Numeric: When was Martin Luther King Jr. born? Organization: What company makes Bentley cars?

72 Question Classifier based on Tree Kernels Question dataset ( [Lin and Roth, 2005]) Distributed on 6 categories: Abbreviations, Descriptions, Entity, Human, Location, and Numeric. Fixed split 5500 training and 500 test questions Cross-validation (10-folds) Using the whole question parse trees Constituent parsing Example What is an offer of direct stock purchase plan?

74 Kernels BOW, POS are obtained with a simple tree, e.g. BOX What is an offer an PT (parse tree) * * * * * PAS (predicate argument structure)

75 Question classification

76 TASK: Automatic Classification The classifier detects if a pair (question and answer) is correct or not A representation for the pair is needed The classifier can be used to re-rank the output of a basic QA system

77 Dataset 2: TREC data 138 TREC 2001 test questions labeled as description 2,256 sentences, extracted from the best ranked paragraphs (using a basic QA system based on Lucene search engine on TREC dataset) 216 of which labeled as correct by one annotator

78 Dataset 2: TREC data 138 TREC 2001 test questions labeled as description 2,256 sentences, extracted from the best ranked A question is linked to many answers: all its derived paragraphs (using a basic QA system based on pairs cannot be shared by training and test sets Lucene search engine on TREC dataset) 216 of which labeled as correct by one annotator

79 Computational Representations and Feature Spaces

80 Bags of words (BOW) and POS-tags (POS) To save time, apply STK to these trees: BOX What is an offer of * * * * * BOX WHNP VBZ DT NN IN * * * * *

81 Word and POS Sequences What is an offer of? (word sequence, WSK) What_is_offer What_is WHNP VBZ DT NN IN (POS sequence, POSSK) WHNP_VBZ_NN WHNP_NN_IN

82 Syntactic Parse Trees (PT)

83 Predicate Argument Classification In an event: target words describe relation among different entities the participants are often seen as predicate's arguments. Example: Paul gives a lecture in Rome

84 Predicate Argument Classification In an event: target words describe relation among different entities the participants are often seen as predicate's arguments. Example: [ Arg0 Paul] [ predicate gives ] [ Arg1 a lecture] [ ArgM in Rome]

85 Predicate Argument Structure for Partial Tree Kernel (PAS PTK ) [ARG1 Antigens] were [AM TMP originally] [rel defined] [ARG2 as nonself molecules]. [ARG0 Researchers] [rel describe] [ARG1 antigens][arg2 as foreign molecules] [ARGM LOC in the body]

86 Kernels and Combinations Exploiting the property: k(x,z) = k 1 (x,z)+k 2 (x,z) BOW, POS, WSK, POSSK, PT, PAS PTK BOW+POS, BOW+PT, PT+POS,

87 Results on TREC Data (5 folds cross validation) F1-measure Kernel Type

88 Results on TREC Data (5 folds cross validation) F1-measure Kernel Type

89 Results on TREC Data (5 folds cross validation) F1-measure Kernel Type

90 Results on TREC Data (5 folds cross validation) F1-measure Kernel Type

91 Results on TREC Data (5 folds cross validation) F1-measure Kernel Type

92 Results on TREC Data (5 folds cross validation) F1-measure Kernel Type

93 Results on TREC Data (5 folds cross validation) F1-measure BOW 24 POSSK+STK+PAS-PTK % of improvement Kernel Type

94 Conclusions Dealing with noisy and errors of NLP modules require robust approaches SVMs are robust to noise and Kernel methods allows for: Syntactic information via STK Shallow Semantic Information via PTK Word/POS sequences via String Kernels When the IR task is complex, syntax and semantics are essential Great improvement in Q/A classification SVM-Light-TK: an efficient tool to use them

95 SVM-light-TK Software Encodes ST, SST and combination kernels in SVM-light [Joachims, 1999] Available at Tree forests, vector sets New extensions: the PT kernel will be released asap

96 Data Format What does Html stand for? 1 BT (SBARQ (WHNP (WP What))(SQ (AUX does)(np (NNP S.O.S.))(VP (VB stand)(pp (IN for))))(.?)) BT (BOW (What *)(does *)(S.O.S. *)(stand *)(for *)(? *)) BT (BOP (WP *)(AUX *)(NNP *)(VB *)(IN *)(. *)) BT (PAS (ARG0 (R-A1 (What *)))(ARG1 (A1 (S.O.S. NNP)))(ARG2 (rel stand))) ET 1:1 21: E-4 23:1 30:1 36:1 39:1 41:1 46:1 49:1 66:1 152:1 274:1 333:1 BV 2:1 21: E-4 23:1 31:1 36:1 39:1 41:1 46:1 49:1 52:1 66:1 152:1 246:1 333:1 392:1 EV

97 Basic Commands Training and classification./svm_learn -t 5 -C T train.dat model./svm_classify test.dat model Learning with a vector sequence./svm_learn -t 5 -C V train.dat model Learning with the sum of vector and kernel sequences./svm_learn -t 5 -C + train.dat model

98 Custom Kernel Kernel.h double custom_kernel(kernel_parm *kernel_parm, DOC *a, DOC *b); if(a->num_of_trees && b->num_of_trees && a- >forest_vec[i]!=null && b->forest_vec[i]! =NULL){// Test if one the i-th tree of instance a and b is an empty tree

99 Custom Kernel: tree-kernel k1= // summation of tree kernels tree_kernel(kernel_parm, a, b, i, i)/ Evaluate tree kernel between the two i-th trees. sqrt(tree_kernel(kernel_parm, a, a, i, i) * tree_kernel(kernel_parm, b, b, i, i)); Normalize respect to both i-th trees.

100 Custom Kernel: Polynomial kernel if(a->num_of_vectors && b->num_of_vectors && a->vectors[i]!=null && b->vectors[i]! =NULL){ Check if the i-th vectors are empty. k2= // summation of vectors basic_kernel(kernel_parm, a, b, i, i)/ Compute standard kernel (selected according to the "second_kernel" parameter).

101 Custom Kernel: Polynomial kernel sqrt( basic_kernel(kernel_parm, a, a, i, i) * basic_kernel(kernel_parm, b, b, i, i) ); //normalize vectors return k1+k2;

102 Conclusions Kernel methods and SVMs are useful tools to design language applications Kernel design still require some level of expertise Engineering approaches to tree kernels Basic Combinations Canonical Mappings, e.g. Node Marking Merging of kernels in more complex kernels State-of-the-art in SRL and QC An efficient tool to use them

103 Thank you

104 References Alessandro Moschitti, Silvia Quarteroni, Roberto Basili and Suresh Manandhar, Exploiting Syntactic and Shallow Semantic Kernels for Question/Answer Classification, Proceedings of the 45th Conference of the Association for Computational Linguistics (ACL), Prague, June Alessandro Moschitti and Fabio Massimo Zanzotto, Fast and Effective Kernels for Relational Learning from Texts, Proceedings of The 24th Annual International Conference on Machine Learning (ICML 2007), Corvallis, OR, USA. Daniele Pighin, Alessandro Moschitti and Roberto Basili, RTV: Tree Kernels for Thematic Role Classification, Proceedings of the 4th International Workshop on Semantic Evaluation (SemEval-4), English Semantic Labeling, Prague, June Stephan Bloehdorn and Alessandro Moschitti, Combined Syntactic and Semanitc Kernels for Text Classification, to appear in the 29th European Conference on Information Retrieval (ECIR), April 2007, Rome, Italy. Fabio Aiolli, Giovanni Da San Martino, Alessandro Sperduti, and Alessandro Moschitti, Efficient Kernel-based Learning for Trees, to appear in the IEEE Symposium on Computational Intelligence and Data Mining (CIDM), Honolulu, Hawaii, 2007

105 An introductory book on SVMs, Kernel methods and Text Categorization

106 References Roberto Basili and Alessandro Moschitti, Automatic Text Categorization: from Information Retrieval to Support Vector Learning, Aracne editrice, Rome, Italy. Alessandro Moschitti, Efficient Convolution Kernels for Dependency and Constituent Syntactic Trees. In Proceedings of the 17th European Conference on Machine Learning, Berlin, Germany, Alessandro Moschitti, Daniele Pighin, and Roberto Basili, Tree Kernel Engineering for Proposition Re-ranking, In Proceedings of Mining and Learning with Graphs (MLG 2006), Workshop held with ECML/PKDD 2006, Berlin, Germany, Elisa Cilia, Alessandro Moschitti, Sergio Ammendola, and Roberto Basili, Structured Kernels for Automatic Detection of Protein Active Sites. In Proceedings of Mining and Learning with Graphs (MLG 2006), Workshop held with ECML/PKDD 2006, Berlin, Germany, 2006.

107 References Fabio Massimo Zanzotto and Alessandro Moschitti, Automatic learning of textual entailments with cross-pair similarities. In Proceedings of COLING-ACL, Sydney, Australia, Alessandro Moschitti, Making tree kernels practical for natural language learning. In Proceedings of the Eleventh International Conference on European Association for Computational Linguistics, Trento, Italy, Alessandro Moschitti, Daniele Pighin and Roberto Basili. Semantic Role Labeling via Tree Kernel joint inference. In Proceedings of the 10th Conference on Computational Natural Language Learning, New York, USA, Alessandro Moschitti, Bonaventura Coppola, Daniele Pighin and Roberto Basili, Semantic Tree Kernels to classify Predicate Argument Structures. In Proceedings of the the 17th European Conference on Artificial Intelligence, Riva del Garda, Italy, 2006.

108 References Alessandro Moschitti and Roberto Basili, A Tree Kernel approach to Question and Answer Classification in Question Answering Systems. In Proceedings of the Conference on Language Resources and Evaluation, Genova, Italy, Ana-Maria Giuglea and Alessandro Moschitti, Semantic Role Labeling via FrameNet, VerbNet and PropBank. In Proceedings of the Joint 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING-ACL), Sydney, Australia, Roberto Basili, Marco Cammisa and Alessandro Moschitti, Effective use of wordnet semantics via kernel-based learning. In Proceedings of the 9th Conference on Computational Natural Language Learning (CoNLL 2005), Ann Arbor(MI), USA, 2005

109 References Alessandro Moschitti, Ana-Maria Giuglea, Bonaventura Coppola and Roberto Basili. Hierarchical Semantic Role Labeling. In Proceedings of the 9th Conference on Computational Natural Language Learning (CoNLL 2005 shared task), Ann Arbor(MI), USA, Roberto Basili, Marco Cammisa and Alessandro Moschitti, A Semantic Kernel to classify texts with very few training examples. In Proceedings of the Workshop on Learning in Web Search, at the 22nd International Conference on Machine Learning (ICML 2005), Bonn, Germany, Alessandro Moschitti, Bonaventura Coppola, Daniele Pighin and Roberto Basili. Engineering of Syntactic Features for Shallow Semantic Parsing. In Proceedings of the ACL05 Workshop on Feature Engineering for Machine Learning in Natural Language Processing, Ann Arbor (MI), USA, 2005.

110 References Alessandro Moschitti, A study on Convolution Kernel for Shallow Semantic Parsing. In proceedings of ACL-2004, Spain, Alessandro Moschitti and Cosmin Adrian Bejan, A Semantic Kernel for Predicate Argument Classification. In proceedings of the CoNLL-2004, Boston, MA, USA, M. Collins and N. Duffy, New ranking algorithms for parsing and tagging: Kernels over discrete structures, and the voted perceptron. In ACL02, S.V.N. Vishwanathan and A.J. Smola. Fast kernels on strings and trees. In Proceedings of Neural Information Processing Systems, 2002.

111 References AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel-based learning methods) N. Cristianini and J. Shawe-Taylor Cambridge University Press Xavier Carreras and Llu ıs M`arquez Introduction to the CoNLL-2005 Shared Task: Semantic Role Labeling. In proceedings of CoNLL 05. Sameer Pradhan, Kadri Hacioglu, Valeri Krugler, Wayne Ward, James H. Martin, and Daniel Jurafsky Support vector learning for semantic argument classification. to appear in Machine Learning Journal.

112 The Impact of SSTK in Answer Classification F1-measure Q(BOW)+A(BOW) Q(BOW)+A(PT,BOW) Q(PT)+A(PT,BOW) Q(BOW)+A(BOW,PT,PAS) Q(BOW)+A(BOW,PT,PAS_N) Q(PT)+A(PT,BOW,PAS) Q(BOW)+A(BOW,PAS) Q(BOW)+A(BOW,PAS_N) j

113 Mercer s conditions (1)

114 Mercer s conditions (2) If the Gram matrix: G = k( x i x, j is positive semi-definite there is a mapping φ that produces the target kernel function )

115 The lexical semantic kernel is not always a kernel It may not be a kernel so we can use M M, where M is the initial similarity matrix

116 Efficient Evaluation (1) In [Taylor and Cristianini, 2004 book], sequence kernels with weighted gaps are factorized with respect to different subsequence sizes. We treat children as sequences and apply the same theory D p

MACHINE LEARNING. Kernel Methods. Alessandro Moschitti

MACHINE LEARNING. Kernel Methods. Alessandro Moschitti MACHINE LEARNING Kernel Methods Alessandro Moschitti Department of information and communication technology University of Trento Email: moschitti@dit.unitn.it Outline (1) PART I: Theory Motivations Kernel