MACHINE LEARNING. Kernel Methods. Alessandro Moschitti

Size: px

Start display at page:

Download "MACHINE LEARNING. Kernel Methods. Alessandro Moschitti"

Beryl Kelley
6 years ago
Views:

1 MACHINE LEARNING Kernel Methods Alessandro Moschitti Department of information and communication technology University of Trento

2 Outline (1) PART I: Theory Motivations Kernel Based Machines Kernel Methods Kernel Trick Mercer s conditions Kernel operators Basic Kernels Linear Kernel Polynomial Kernel Lexical kernel String Kernel

3 Outline (2) Tree kernels subtree, subset tree and partial tree kernels PART II: Kernel engineering for Natural Language Applications Object Transformation and Kernel Combinations: The Semantic Role Labeling case Node marking Shallow Semantic Tree Kernels Experiments

4 Outline (2) Merging of Kernels and Kernel Combinations: Question/Answer Classification The Syntactic/Semantic Tree Kernel Advanced Kernel combinations Experiments Example of Custom Kernel in SVM-LIGHT-TK Conclusions

5 Kernels for Natural Language Processing (1) Typical tasks: from text categorization to syntactic and semantic parsing Feature design for NLP applications: complex and difficult phase, e.g., structural feature representation: deep knowledge and intuitions are required design problems when the phenomenon is described by many features

Kernels for Natural Language Processing (2) Kernel methods alleviate such problems Structures represented in terms of substructures High dimensional feature spaces

6 Kernels for Natural Language Processing (2) Kernel methods alleviate such problems Structures represented in terms of substructures High dimensional feature spaces Implicit and abstract feature spaces Generate high number of features Support Vector Machines select the relevant features Automatic Feature engineering side-effect

7 Kernels Engineering Kernel design still requires noticeable creativity and expertise. An engineering approach is needed: starting from basic and well known kernel functions basic combinations canonical mappings, e.g. object transformations merging of kernels

8 Theory Kernel Trick Kernel Based Machines Basic Kernel Properties Kernel Types

9 The main idea of Kernel Functions Mapping vectors in a space where they are linearly separable x φ(x) φ o o x x o x o x φ(x) φ(x) φ(o) φ(x) φ(o) φ(x) φ(o) φ(o)

10 A mapping example Given two masses m 1 and m 2, one is constrained Apply a force f a to the mass m 1 Experiments Features m 1, m 2 and f a We want to learn a classifier that tells when a mass m 1 will get far away from m 2 If we consider the Gravitational Newton Law m1m2 f ( m1, m2, r) = C 2 r we need to find when f(m 1, m 2, r) < f a

11 Mapping Example x = x1,..., xn) φ ( x) = ( φ ( x),..., φn( x)) ( 1 The gravitational law is not linear so we need to change space ( fa = 2, m1, m2, r) ( k, x, y, z) (ln fa,ln m1,ln m,ln r) As ln f ( m1, m2, r) = ln C + ln m1 + ln m2 2ln r = c + x + y 2z We need the hyperplane ln f a ln m1 ln m2 + 2ln r ln C = (ln m 1,ln m 2,-2ln r) (x,y,z)- ln f a + ln C = 0, we can decide without error if the mass will get far away or not 0

12 Classification function in the dual space sgn( w x + b) = sgn( Note that data only appears in the scalar product The Matrix G ( x ) i x j i, j= 1 = α j j=1.. l y j x j x + b ) is called Gram matrix

13 We can rewrite the classification function as As well as the updating function The learning rate does not affect the algorithm so we set it to η =1. η Dual Perceptron algorithm and Kernel functions ) ), ( sgn( ) ) ( ) ( sgn( ) ) ( ) ( sgn( ) ( b x x k y b x x y b x w x h j j i j j j j j + = = + = + = = = α φ φ α φ φ η α α α φ φ α + = + = + = = i i i j j j j i j j j j i b x x k y b x x y y 0 allora ) ), ( ) ( ) ( ( if

14 Kernels in Support Vector Machines In Soft Margin SVMs we maximize: By using kernel functions we rewrite the problem as:

15 Kernel Function Definition Kernel are the product of mapping functions such as x R n, φ ( x) = ( φ1( x), φ2 ( x),..., φ m ( x)) R m

16 Valid Kernels

17 Valid Kernels cont d If the matrix is positive semi-definite then we can find a mapping φ implementing the kernel function

18 Valid Kernel operations k(x,z) = k 1 (x,z)+k 2 (x,z) k(x,z) = k 1 (x,z)*k 2 (x,z) k(x,z) = α k 1 (x,z) k(x,z) = f(x)f(z) k(x,z) = k 1 (φ(x),φ(z)) k(x,z) = x'bz

19 Basic Kernels for unstructured data Linear Kernel Polynomial Kernel Lexical kernel String Kernel

20 Linear Kernel In Text Categorization documents are word vectors Φ(d x ) = x = (0,..,1,..,0,..,0,..,1,..,0,..,0,..,1,..,0,..,0,..,1,..,0,..,1) buy acquisition stocks sell market Φ(d z ) = z = (0,..,1,..,0,..,1,..,0,..,0,..,0,..,1,..,0,..,0,..,1,..,0,..,0) The dot product x z features in common buy company stocks sell This provides a sort of similarity counts the number of

21 Feature Conjunction (polynomial Kernel) The initial vectors are mapped in a higher space More expressive, as encodes Stock+Market vs. Downtown+Market features We can smartly compute the scalar product as,1) 2, 2, 2,, ( ), ( x x x x x x x x > Φ < ), ( 1) ( 1) ( ,1) 2, 2, 2,, (,1) 2, 2, 2,, ( ) ( ) ( z x K z x z x z x z x z x z z x x z x z x z z z z z z x x x x x x z x Poly = + = + + = = = = = = Φ Φ ( 2 ) x 1 x

22 Using character sequences φ("bank") = x = (0,..,1,..,0,..,1,..,0,...1,..,0,..,1,..,0,..,1,..,0) bank ank bnk bk b φ("rank") = z = (1,..,0,..,0,..,1,..,0,...0,..,1,..,0,..,1,..,0,..,1) rank ank rnk rk r x z counts the number of common substrings x z = φ("bank") φ("rank") = k("bank","rank")

23 String Kernel Given two strings, the number of matches between their substrings is evaluated E.g. Bank and Rank B, a, n, k, Ba, Ban, Bank, Bk, an, ank, nk,.. R, a, n, k, Ra, Ran, Rank, Rk, an, ank, nk,.. String kernel over sentences and texts Huge space but there are efficient algorithms

24 Formal Definition, where i 1 +1, where

25 Kernel between Bank and Rank

26 An example of string kernel computation

27 Intuition behind efficient evaluation Dynamic programming Given two strings of size n and m, NS(n,m,ρ) = NS(n,m,ρ-1)+NS(n-1,m,ρ)+NS(n,m-1,ρ) NS(n-1,m-1,ρ)

28 Document Similarity Doc 1 Doc 2 industry company telephone product market

29 Lexical Semantic Kernel [CoNLL 2005] The document similarity is the SK function: SK(d 1,d 2 ) = s(w 1,w 2 ) w 1 d 1,w 2 d 2 where s is any similarity function between words, e.g. WordNet [Basili et al.,2005] similarity or LSA [Cristianini et al., 2002] Good results when training data is poor

30 Is it a valid kernel? It may not be a kernel so we can use M M

31 Tree kernels Subtree, Subset Tree, Partial Tree kernels Efficient computation

32 Example of a parse tree John delivers a talk in Rome S S N VP N VP VP V NP PP John V delivers D NP N PP PP IN N IN N N Rome a talk in Rome

33 The Syntactic Tree Kernel (STK) [Collins and Duffy, 2002] VP V NP delivers D a N talk

34 The overall fragment set VP D NP a Children are not divided

35 Explicit kernel space φ(t x ) = x = (0,..,1,..,0,..,1,..,0,..,1,..,0,..,1,..,0,..,1,..,0,..,1,..,0) φ(t z ) = z = (1,..,0,..,0,..,1,..,0,..,1,..,0,..,1,..,0,..,0,..,1,..,0,..,0) x z counts the number of common substructures

36 Efficient evaluation of the scalar product x z = φ(t x ) φ(t z ) = K(T x,t z ) = = Δ(n x,n z ) n x T x n z T z

37 Efficient evaluation of the scalar product x z = φ(t x ) φ(t z ) = K(T x,t z ) = [Collins and Duffy, ACL 2002] evaluate Δ in O(n 2 ): Δ(n x,n z ) = 0, if the productions are different else Δ(n x,n z ) =1, if pre - terminals else = Δ(n x,n z ) n x T x n z T z Δ(n x,n z ) = nc(n x ) j=1 (1+ Δ(ch(n x, j),ch(n z, j)))

38 Explicit kernel space x = (0,..,1,..,0,..,1, VP VP..,0,..,1,..,0, VP.,1,..,0,..,1, NP NP..,0,..,1,..,0) NP V NP V NP V NP D N D N D N delivers D N D N D N a talk a talk a talk a talk x z counts the number of common substructures

39 Implicit Representation x z = φ(t x ) φ(t z ) = K(T x,t z ) = = Δ(n x,n z ) n x T x n z T z

40 Implicit Representation x z = φ(t x ) φ(t z ) = K(T x,t z ) = [Collins and Duffy, ACL 2002] evaluate Δ in O(n 2 ): Δ(n x,n z ) = 0, if the productions are different else Δ(n x,n z ) =1, if pre - terminals else = Δ(n x,n z ) n x T x n z T z Δ(n x,n z ) = nc(n x ) j=1 (1+ Δ(ch(n x, j),ch(n z, j)))

41 Other Adjustments Decay factor Δ(n x,n z ) = λ, if pre - terminals else Δ(n x,n z ) = λ nc(n x ) j=1 (1+ Δ(ch(n x, j),ch(n z, j))) Normalization K (T x,t z ) = K(T x,t z ) K(T x,t x ) K(T z,t z )

42 Fast SST Evaluation [EACL 2006] K(T x,t z ) = Δ(n x,n z ) where P(n x ) and P(n z ) are the production rules used at nodes n x and n z n x,n z NP NP = { n x,n z T x T z : Δ(n x,n z ) 0} = = { n x,n z T x T z : P(n x ) = P(n z )},

43 Observations We order the production rules used in T x and T z, at loading time At learning time we may evaluate NP in T x + T z running time If T x and T z are generated by only one production rule O( T x T z )

44 Observations We order the production rules used in T x and T z, at loading time At learning time we may evaluate NP in T x + T z running time If T x and T z are generated by only one production rule O( T x T z ) Very Unlikely!!!!

45 SubTree (ST) Kernel [Vishwanathan and Smola, 2002] VP V NP NP delivers D N D N V D N a talk a talk delivers a talk

46 Evaluation Given the equation for the SST kernel Δ(n x,n z ) = 0, if the productions are different else Δ(n x,n z ) =1, if pre - terminals else Δ(n x,n z ) = nc(n x ) j=1 (1+ Δ(ch(n x, j),ch(n z, j)))

47 Evaluation Given the equation for the SST kernel Δ(n x,n z ) = 0, if the productions are different else Δ(n x,n z ) =1, if pre - terminals else Δ(n x,n z ) = nc(n x ) j=1 Δ(ch(n x, j),ch(n z, j))

48 Labeled Ordered Tree Kernel SST satisfies the constraint remove 0 or all children at a time. If we relax such constraint we get more general substructures [Kashima and Koyanagi, 2002] VP VP VP VP VP VP VP VP V gives D a NP N talk V D a NP N talk D a NP N talk D a NP N D a NP NP D NP D N NP N NP N NP D NP

49 Weighting Problems V gives D a VP NP N talk V gives D a VP NP N talk Both matched pairs give the same contribution. Gap based weighting is VP VP needed. V NP V NP A novel efficient evaluation gives D JJ N gives D JJ N has to be defined a good talk a bad talk

50 Partial Trees, [ECML 2006] SST + String Kernel with weighted gaps on Nodes children VP VP VP VP VP VP VP VP V brought D a NP N cat V D a NP N cat D a NP N cat D a NP N D a NP NP D NP D N NP N NP N NP D NP

51 Partial Tree Kernel By adding two decay factors we obtain:

52 Efficient Evaluation (1) In [Taylor and Cristianini, 2004 book], sequence kernels with weighted gaps are factorized with respect to different subsequence sizes. We treat children as sequences and apply the same theory D p

53 Efficient Evaluation (2) The complexity of finding the subsequences is Therefore the overall complexity is where ρ is the maximum branching factor (p = ρ)

54 Running Time of Tree Kernel Functions µseconds FTK-SST QTK-SST FTK-PT Number of Tree Nodes

55 Gold Standard Tree Experiments PropBank and PennTree bank about 53,700 sentences Sections from 2 to 21 train., 23 test., 1 and 22 dev. Arguments from Arg0 to Arg5, ArgA and ArgM for a total of 122,774 and 7,359

56 Argument Classification Accuracy Accuracy ST Linear SST PT % Training Data

57 PART II: Kernel Engineering for Language Applications Basic Combinations Canonical Mappings, e.g. object transformations Merging of Kernels

58 Kernel Combinations an example Kernel Combinations: ,, p Tree p Tree P Tree p p Tree Tree P Tree p Tree P Tree p Tree P Tree K K K K K K K K K K K K K K K K = + = = + = + + γ γ Tree kernel features flat of kernel polynomial 3 Tree p K K

59 Object Transformation [CLJ 2008] Canonical Mapping, φ M () object transformation, e. g. a syntactic parse tree, into a verb subcategorization frame tree. Feature Extraction, φ E () maps the canonical structure in all its fragments different fragment spaces, e. g. ST, SST and PT. ), ( ) ( ) ( )) ( ( )) ( ( ) ( ) ( ), ( S S K S S O O O O O O K E E E M E M E = = = = φ φ φ φ φ φ φ φ

60 Object Transformation and Kernel Combinations: Semantic Role Labeling Kernel Engineering via node marking Kernels for the whole predicate argument structures Kernels for re-ranking

61 Predicate Argument Classification In an event: target words describe relation among different entities the participants are often seen as predicate's arguments. Example: Paul gives a lecture in Rome

62 Predicate Argument Classification In an event: target words describe relation among different entities the participants are often seen as predicate's arguments. Example: [ Arg0 Paul] [ predicate gives ] [ Arg1 a lecture] [ ArgM in Rome]

63 Predicate Argument Structures and syntax Semantics are connected to syntax via parse trees S N VP Paul Arg. 0 V gives D NP N IN PP N Predicate a lecture in Rome Arg. 1 Arg. M

64 Object Transformation: tree tailoring [ACL 2004] S N VP Paul Arg. 0 V gives D NP N PP IN N Predicate a lecture in Rome Arg. 1 Arg. M

65 Object Transformation: Node Marking

66 Node Marking Effect

67 Different tailoring and marking MMST CMST

68 Experiments PropBank and PennTree bank about 53,700 sentences Charniak trees from CoNLL 2005 Boundary detection: Section 2 training Section 24 testing PAF and MPAF

69 Number of examples/nodes of Section 2

70 Predicate Argument Feature (PAF) vs. Marked PAF (MPAF) [ACL-ws-2005]

71 More general mappings: Semantic structures for re-ranking [CoNLL 2006]

72 Other Shallow Semantic structures [ARG1 Antigens] were [AM TMP originally] [rel defined] [ARG2 as nonself molecules]. [ARG0 Researchers] [rel describe] [ARG1 antigens][arg2 as foreign molecules] [ARGM LOC in the body]

73 Shallow Semantic Trees for SST kernel [ACL 2007]

74 Merging of Kernels [ECIR 2007]: Question/Answer Classification Syntactic/Semantic Tree Kernel Kernel Combinations Experiments

75 Merging of Kernels

76 Merging of Kernels VP VP V NP V NP gives D N N gives D N N a good talk a solid talk

77 Delta Evaluation is very simple

78 Question Classification Definition: What does HTML stand for? Description: What's the final line in the Edgar Allan Poe poem "The Raven"? Entity: What foods can cause allergic reaction in people? Human: Who won the Nobel Peace Prize in 1992? Location: Where is the Statue of Liberty? Manner: How did Bob Marley die? Numeric: When was Martin Luther King Jr. born? Organization: What company makes Bentley cars?

79 Question Classifier based on Tree Kernels Question dataset ( [Lin and Roth, 2005]) Distributed on 6 categories: Abbreviations, Descriptions, Entity, Human, Location, and Numeric. Fixed split 5500 training and 500 test questions Cross-validation (10-folds) Using the whole question parse trees Constituent parsing Example What is an offer of direct stock purchase plan?

81 Kernels BOW, POS are obtained with a simple tree, e.g. BOX What is an offer an PT (parse tree) * * * * * PAS (predicate argument structure)

82 Question classification

83 Similarity based on WordNet

84 Question Classification with S/STK

85 Answer Classification data-set 138 TREC 2001 test questions labeled as description 1309 sentences, extracted from the best ranked paragraphs (using a basic QA system based on Google search engine) 416 of which labeled as correct by two annotators

86 Impact of Bow and PT in Answer classification F1-measure Q(BOW)+A(BOW) Q(BOW)+A(PT,BOW) Q(PT)+A(PT) Q(PT)+A(BOW) Q(BOW)+A(PT) Q(PT,BOW)+A(PT,BOW) Q(PT)+A(PT,BOW) j

87 The Impact of SSTK in Answer Classification F1-measure Q(BOW)+A(BOW) Q(BOW)+A(PT,BOW) Q(BOW)+A(BOW,PT,PAS) Q(BOW)+A(BOW,PAS) j

88 SVM-light-TK Software Encodes ST, SST and combination kernels in SVM-light [Joachims, 1999] Available at Tree forests, vector sets New extensions: the PT kernel will be released asap

89 Data Format What does Html stand for? 1 BT (SBARQ (WHNP (WP What))(SQ (AUX does)(np (NNP S.O.S.))(VP (VB stand)(pp (IN for))))(.?)) BT (BOW (What *)(does *)(S.O.S. *)(stand *)(for *)(? *)) BT (BOP (WP *)(AUX *)(NNP *)(VB *)(IN *)(. *)) BT (PAS (ARG0 (R-A1 (What *)))(ARG1 (A1 (S.O.S. NNP)))(ARG2 (rel stand))) ET 1:1 21: E-4 23:1 30:1 36:1 39:1 41:1 46:1 49:1 66:1 152:1 274:1 333:1 BV 2:1 21: E-4 23:1 31:1 36:1 39:1 41:1 46:1 49:1 52:1 66:1 152:1 246:1 333:1 392:1 EV

90 Basic Commands Training and classification./svm_learn -t 5 -C T train.dat model./svm_classify test.dat model Learning with a vector sequence./svm_learn -t 5 -C V train.dat model Learning with the sum of vector and kernel sequences./svm_learn -t 5 -C + train.dat model

91 Custom Kernel Kernel.h double custom_kernel(kernel_parm *kernel_parm, DOC *a, DOC *b); if(a->num_of_trees && b->num_of_trees && a- >forest_vec[i]!=null && b->forest_vec[i]! =NULL){// Test if one the i-th tree of instance a and b is an empty tree

92 Custom Kernel: tree-kernel k1= // summation of tree kernels tree_kernel(kernel_parm, a, b, i, i)/ Evaluate tree kernel between the two i-th trees. sqrt(tree_kernel(kernel_parm, a, a, i, i) * tree_kernel(kernel_parm, b, b, i, i)); Normalize respect to both i-th trees.

93 Custom Kernel: Polynomial kernel if(a->num_of_vectors && b->num_of_vectors && a->vectors[i]!=null && b->vectors[i]! =NULL){ Check if the i-th vectors are empty. k2= // summation of vectors basic_kernel(kernel_parm, a, b, i, i)/ Compute standard kernel (selected according to the "second_kernel" parameter).

94 Custom Kernel: Polynomial kernel sqrt( basic_kernel(kernel_parm, a, a, i, i) * basic_kernel(kernel_parm, b, b, i, i) ); //normalize vectors return k1+k2;

Conclusions Kernel methods and SVMs are useful tools to design language applications Kernel design still require some level of expertise Engineering approaches to tree

95 Conclusions Kernel methods and SVMs are useful tools to design language applications Kernel design still require some level of expertise Engineering approaches to tree kernels Basic Combinations Canonical Mappings, e.g. Node Marking Merging of kernels in more complex kernels State-of-the-art in SRL and QC An efficient tool to use them

96 Thank you

97 References Alessandro Moschitti, Silvia Quarteroni, Roberto Basili and Suresh Manandhar, Exploiting Syntactic and Shallow Semantic Kernels for Question/Answer Classification, Proceedings of the 45th Conference of the Association for Computational Linguistics (ACL), Prague, June Alessandro Moschitti and Fabio Massimo Zanzotto, Fast and Effective Kernels for Relational Learning from Texts, Proceedings of The 24th Annual International Conference on Machine Learning (ICML 2007), Corvallis, OR, USA. Daniele Pighin, Alessandro Moschitti and Roberto Basili, RTV: Tree Kernels for Thematic Role Classification, Proceedings of the 4th International Workshop on Semantic Evaluation (SemEval-4), English Semantic Labeling, Prague, June Stephan Bloehdorn and Alessandro Moschitti, Combined Syntactic and Semanitc Kernels for Text Classification, to appear in the 29th European Conference on Information Retrieval (ECIR), April 2007, Rome, Italy. Fabio Aiolli, Giovanni Da San Martino, Alessandro Sperduti, and Alessandro Moschitti, Efficient Kernel-based Learning for Trees, to appear in the IEEE Symposium on Computational Intelligence and Data Mining (CIDM), Honolulu, Hawaii, 2007

98 An introductory book on SVMs, Kernel methods and Text Categorization

99 References Roberto Basili and Alessandro Moschitti, Automatic Text Categorization: from Information Retrieval to Support Vector Learning, Aracne editrice, Rome, Italy. Alessandro Moschitti, Efficient Convolution Kernels for Dependency and Constituent Syntactic Trees. In Proceedings of the 17th European Conference on Machine Learning, Berlin, Germany, Alessandro Moschitti, Daniele Pighin, and Roberto Basili, Tree Kernel Engineering for Proposition Re-ranking, In Proceedings of Mining and Learning with Graphs (MLG 2006), Workshop held with ECML/PKDD 2006, Berlin, Germany, Elisa Cilia, Alessandro Moschitti, Sergio Ammendola, and Roberto Basili, Structured Kernels for Automatic Detection of Protein Active Sites. In Proceedings of Mining and Learning with Graphs (MLG 2006), Workshop held with ECML/PKDD 2006, Berlin, Germany, 2006.

100 References Fabio Massimo Zanzotto and Alessandro Moschitti, Automatic learning of textual entailments with cross-pair similarities. In Proceedings of COLING-ACL, Sydney, Australia, Alessandro Moschitti, Making tree kernels practical for natural language learning. In Proceedings of the Eleventh International Conference on European Association for Computational Linguistics, Trento, Italy, Alessandro Moschitti, Daniele Pighin and Roberto Basili. Semantic Role Labeling via Tree Kernel joint inference. In Proceedings of the 10th Conference on Computational Natural Language Learning, New York, USA, Alessandro Moschitti, Bonaventura Coppola, Daniele Pighin and Roberto Basili, Semantic Tree Kernels to classify Predicate Argument Structures. In Proceedings of the the 17th European Conference on Artificial Intelligence, Riva del Garda, Italy, 2006.

101 References Alessandro Moschitti and Roberto Basili, A Tree Kernel approach to Question and Answer Classification in Question Answering Systems. In Proceedings of the Conference on Language Resources and Evaluation, Genova, Italy, Ana-Maria Giuglea and Alessandro Moschitti, Semantic Role Labeling via FrameNet, VerbNet and PropBank. In Proceedings of the Joint 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING-ACL), Sydney, Australia, Roberto Basili, Marco Cammisa and Alessandro Moschitti, Effective use of wordnet semantics via kernel-based learning. In Proceedings of the 9th Conference on Computational Natural Language Learning (CoNLL 2005), Ann Arbor(MI), USA, 2005

102 References Alessandro Moschitti, Ana-Maria Giuglea, Bonaventura Coppola and Roberto Basili. Hierarchical Semantic Role Labeling. In Proceedings of the 9th Conference on Computational Natural Language Learning (CoNLL 2005 shared task), Ann Arbor(MI), USA, Roberto Basili, Marco Cammisa and Alessandro Moschitti, A Semantic Kernel to classify texts with very few training examples. In Proceedings of the Workshop on Learning in Web Search, at the 22nd International Conference on Machine Learning (ICML 2005), Bonn, Germany, Alessandro Moschitti, Bonaventura Coppola, Daniele Pighin and Roberto Basili. Engineering of Syntactic Features for Shallow Semantic Parsing. In Proceedings of the ACL05 Workshop on Feature Engineering for Machine Learning in Natural Language Processing, Ann Arbor (MI), USA, 2005.

103 References Alessandro Moschitti, A study on Convolution Kernel for Shallow Semantic Parsing. In proceedings of ACL-2004, Spain, Alessandro Moschitti and Cosmin Adrian Bejan, A Semantic Kernel for Predicate Argument Classification. In proceedings of the CoNLL-2004, Boston, MA, USA, M. Collins and N. Duffy, New ranking algorithms for parsing and tagging: Kernels over discrete structures, and the voted perceptron. In ACL02, S.V.N. Vishwanathan and A.J. Smola. Fast kernels on strings and trees. In Proceedings of Neural Information Processing Systems, 2002.

104 References AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel-based learning methods) N. Cristianini and J. Shawe-Taylor Cambridge University Press Xavier Carreras and Llu ıs M`arquez Introduction to the CoNLL-2005 Shared Task: Semantic Role Labeling. In proceedings of CoNLL 05. Sameer Pradhan, Kadri Hacioglu, Valeri Krugler, Wayne Ward, James H. Martin, and Daniel Jurafsky Support vector learning for semantic argument classification. to appear in Machine Learning Journal.

105 Algorithm

106 The Impact of SSTK in Answer Classification F1-measure Q(BOW)+A(BOW) Q(BOW)+A(PT,BOW) Q(PT)+A(PT,BOW) Q(BOW)+A(BOW,PT,PAS) Q(BOW)+A(BOW,PT,PAS_N) Q(PT)+A(PT,BOW,PAS) Q(BOW)+A(BOW,PAS) Q(BOW)+A(BOW,PAS_N) j

107 Mercer s conditions (1)

108 Mercer s conditions (2) If the Gram matrix: G = k( x i x, j is positive semi-definite there is a mapping φ that produces the target kernel function )

109 The lexical semantic kernel is not always a kernel It may not be a kernel so we can use M M, where M is the initial similarity matrix

MACHINE LEARNING. Kernel Methods. Alessandro Moschitti

MACHINE LEARNING. Kernel Methods. Alessandro Moschitti MACHINE LEARNING Kernel Methods Alessandro Moschitti Department of information and communication technology University of Trento Email: moschitti@dit.unitn.it Communications Next Lectures will be on February