MACHINE LEARNING. Kernel Methods. Alessandro Moschitti

Size: px
Start display at page:

Download "MACHINE LEARNING. Kernel Methods. Alessandro Moschitti"

Transcription

1 MACHINE LEARNING Kernel Methods Alessandro Moschitti Department of information and communication technology University of Trento

2 Communications Next Lectures will be on February 11, 13, 16, 18 We start at 12:00 (sharp)

3 Outline (1) PART I: Theory Motivations Kernel Based Machines Kernel Methods Kernel Trick Mercer s conditions Kernel operators Basic Kernels Linear Kernel Polynomial Kernel Lexical kernel String Kernel

4 Outline (2) Tree kernels subtree, subset tree and partial tree kernels PART II: Applications to Question Answering Question Classification Kernel Combinations Answer Classification Experiments and Results Conclusions

5 Motivation (1) Feature design most difficult aspect in designing a learning system complex and difficult phase, e.g., structural feature representation: deep knowledge and intuitions are required design problems when the phenomenon is described by many features

6 Motivation (2) Kernel methods alleviate such problems Structures represented in terms of substructures High dimensional feature spaces Implicit and abstract feature spaces Generate high number of features Support Vector Machines select the relevant features Automatic Feature engineering side-effect

7 Theory Kernel Trick Kernel Based Machines Basic Kernel Properties Kernel Types

8 The main idea of Kernel Functions Mapping vectors in a space where they are linearly separable x φ(x) φ o o x x o x o x φ(x) φ(x) φ(o) φ(x) φ(o) φ(x) φ(o) φ(o)

9 A mapping example Given two masses m 1 and m 2, one is constrained Apply a force f a to the mass m 1 Experiments Features m 1, m 2 and f a We want to learn a classifier that tells when a mass m 1 will get far away from m 2 If we consider the Gravitational Newton Law m1m2 f ( m1, m2, r) = C 2 r we need to find when f(m 1, m 2, r) < f a

10 Mapping Example x = x1,..., xn) φ ( x) = ( φ ( x),..., φn( x)) ( 1 The gravitational law is not linear so we need to change space ( fa = 2, m1, m2, r) ( k, x, y, z) (ln fa,ln m1,ln m,ln r) As ln f ( m1, m2, r) = ln C + ln m1 + ln m2 2ln r = c + x + y 2z We need the hyperplane ln f a ln m1 ln m2 + 2ln r ln C = (ln m 1,ln m 2,-2ln r) (x,y,z)- ln f a + ln C = 0, we can decide without error if the mass will get far away or not 0

11 A kernel-based Machine Perceptron training w 0 0 ;b 0 0;k 0;R max 1 i l x i Repeat for i = 1 to if y i ( w k x i + b k ) 0 then w k +1 = w k + ηy i x i endif endfor b k +1 = b k + ηy i R 2 k = k +1 until an error is found return k,( w k,b k )

12

13

14

15 Changing Gradient Representation Each step of perceptron only training data is added with a certain weight w = α j y j x j j=1.. So the classification function sgn( w x + b) = sgn( α j j=1.. y j x j x + b ) Note that data only appears in the scalar product

16 We can rewrite the classification function as As well as the updating function The learning rate does not affect the algorithm so we set it to η =1. η Dual Perceptron algorithm and Kernel functions ) ), ( sgn( ) ) ( ) ( sgn( ) ) ( ) ( sgn( ) ( b x x k y b x x y b x w x h j j i j j j j j + = = + = + = = = α φ φ α φ φ η α α α φ φ α + = + = + = = i i i j j j j i j j j j i b x x k y b x x y y 0 allora ) ), ( ) ( ) ( ( if

17 Kernels in Support Vector Machines In Soft Margin SVMs we maximize: By using kernel functions we rewrite the problem as:

18 Kernel Function Definition Kernel are the product of mapping functions such as x R n, φ ( x) = ( φ1( x), φ2 ( x),..., φ m ( x)) R m

19 The Kernel Gram Matrix With KM-based learning, the sole information used from the training data set is the Kernel Gram Matrix If the kernel is valid, K is symmetric definitepositive.

20 Valid Kernels

21 Valid Kernels cont d If the matrix is positive semi-definite then we can find a mapping φ implementing the kernel function

22 Valid Kernel operations k(x,z) = k 1 (x,z)+k 2 (x,z) k(x,z) = k 1 (x,z)*k 2 (x,z) k(x,z) = α k 1 (x,z) k(x,z) = f(x)f(z) k(x,z) = k 1 (φ(x),φ(z)) k(x,z) = x'bz

23 Basic Kernels for unstructured data Linear Kernel Polynomial Kernel Lexical kernel String Kernel

24 A simple classification problem: Text Categorization Berlusconi Wonderful Bush acquires declares Totti Yesterday Inzaghi before war match elections Politic C 1 Economic C 2 Sport C n

25 Text Classification Problem Given: a set of target categories: the set T of documents, define f : T 2 C C = { C 1,..,C n } VSM (Salton89 ) Features are dimensions of a Vector Space. Documents and Categories are vectors of feature weights. C d is assigned to if d C i > th i

26 The Vector Space Model Berlusconi d 1 : Politic Bush declares war. Berlusconi gives support d 1 d 2 C 1 d 3 C 2 Totti d 2 : Sport Wonderful Totti in the yesterday match against Berlusconi s Milan d 3 :Economic Berlusconi acquires Inzaghi before elections C 1 : Politics Category C 2 : Sport Category Bush

27 Linear Kernel In Text Categorization documents are word vectors Φ(d x ) = x = (0,..,1,..,0,..,0,..,1,..,0,..,0,..,1,..,0,..,0,..,1,..,0,..,1) buy acquisition stocks sell market Φ(d z ) = z = (0,..,1,..,0,..,1,..,0,..,0,..,0,..,1,..,0,..,0,..,1,..,0,..,0) The dot product x z features in common buy company stocks sell This provides a sort of similarity counts the number of

28 Feature Conjunction (polynomial Kernel) The initial vectors are mapped in a higher space More expressive, as encodes Stock+Market vs. Downtown+Market features We can smartly compute the scalar product as,1) 2, 2, 2,, ( ), ( x x x x x x x x > Φ < ), ( 1) ( 1) ( ,1) 2, 2, 2,, (,1) 2, 2, 2,, ( ) ( ) ( z x K z x z x z x z x z x z z x x z x z x z z z z z z x x x x x x z x Poly = + = + + = = = = = = Φ Φ ( 2 ) x 1 x

29 Using character sequences φ("bank") = x = (0,..,1,..,0,..,1,..,0,...1,..,0,..,1,..,0,..,1,..,0) bank ank bnk bk b φ("rank") = z = (1,..,0,..,0,..,1,..,0,...0,..,1,..,0,..,1,..,0,..,1) rank ank rnk rk r x z counts the number of common substrings x z = φ("bank") φ("rank") = k("bank","rank")

30 String Kernel Given two strings, the number of matches between their substrings is evaluated E.g. Bank and Rank B, a, n, k, Ba, Ban, Bank, Bk, an, ank, nk,.. R, a, n, k, Ra, Ran, Rank, Rk, an, ank, nk,.. String kernel over sentences and texts Huge space but there are efficient algorithms

31 Formal Definition, where i 1 +1, where

32 Kernel between Bank and Rank

33 An example of string kernel computation

34 Efficient Evaluation Dynamic Programming technique Evaluate the spectrum string kernels Substrings of size p Sum the contribution of the different spectra

35 Efficient Evaluation

36 An example: SK( Gatta, Cata ) First, evaluate the SK with size p=1, i.e. a, a, t, t, a, a Store in the table SK p=1

37 Evaluating DP2 Predicted weight of the string of size p multiplied by the number of substrings of size p-1

38 Evaluating the Predictive DP on strings of size 2 (second row) Let s consider substrings of size 2 and suppose that: we have matched the first a we will match the next character that we add to the two strings We compute the weights of future matches at different string positions with some not-yet known character? If the match occurs immediately after a the weight will be λ 1+1 x λ 1+1 = λ 4 and we store just λ 2 in the DP entry in [ a, a ]

39 Evaluating the DP wrt different positions (second row) If the match for gatta occurs after t the weight will be λ 1+2 (x λ 2 = λ 5 ) since the substring for it will be with a? (λ 3 ) We write such prediction in the entry [ a, t ] Same rationale for a match after the second t, we have the substring a? (matching with a? from catta ) for a weight of λ 3+1 (x λ 2 )

40 Evaluating the DP wrt different positions (third row) If the match of the second character occurs after t of cata, the weight will be λ 2+1 (x λ 2 = λ 5 ) since it will be with the string a?, with a weight of λ 3 If the match occurs after t of both gatta and cata, there are two way to compose substring of size two: a? with weight λ 4 or t? with weight λ 2 the total is λ 2 +λ 4

41 Evaluating the DP wrt different positions (third row) Finally, match after the last t of both cat and gatta There are three possible substrings of gatta : a?, t?, t? for gatta with weight λ 3, λ 2 or λ, respectively. There are two possible substrings of gatta a?, t? with weight λ 2 and λ With a weight of λ 5, λ 3, λ 2 the total weight string is λ 5 + λ 3 + λ 2

42 Evaluating SK of size 2 using DP2 SK p= 2 The number/weight of substrings of size 2 between gat and cat is λ 4 = λ 2 ([ a, a ] entry of DP) x λ 2 (cost of one character), where a = t and b = t. Between gatta and cata is λ 7 + λ 5 + λ 4, i.e the matches of a a, t a, ta with at and a a and ta

43 Document Similarity Doc 1 Doc 2 industry company telephone product market

44 Lexical Semantic Kernel [CoNLL 2005] The document similarity is the SK function: SK(d 1,d 2 ) = s(w 1,w 2 ) w 1 d 1,w 2 d 2 where s is any similarity function between words, e.g. WordNet [Basili et al.,2005] similarity or LSA [Cristianini et al., 2002] Good results when training data is poor

45 Mercer s Theorem (finite space) ( ) i, j=1 Let us consider K = K( x i, x j ) n K symmetric V: K = VΛ V for Takagi factorization of a complex-symmetric matrix, where: Λ is the diagonal matrix of the eigenvalues λ t of K n v t = ( v ti ) i =1 are the eigenvectors, i.e. the columns of V Let us assume lambda values non-negative φ : x i ( λ t v ) n ti t =1 R n, i =1,..,n

46 Mercer s Theorem (sufficient conditions) Therefore Φ( x i ) Φ( x j ) = n λ t v ti v tj = VΛ V t=1 ( ) ij = K, ij = K( which implies that K is a kernel function x i, x j )

47 Mercer s Theorem (necessary conditions) Suppose we have negative eigenvalues λ s and eigenvectors v s z = has the following norm: the following point n v si Φ( x i ) = Λ V i=1 z 2 = z z = Λ V v s Λ V v s = v s ' V Λ Λ V v s = v s ' Kv s = v s ' λ s v s = λ s v 2 s < 0 v s, this contradicts the geometry of the space.

48 Is it a valid kernel? It may not be a kernel so we can use M M

49 Tree kernels Subtree, Subset Tree, Partial Tree kernels Efficient computation

50 Example of a parse tree John delivers a talk in Rome S S N VP N VP VP V NP PP John V delivers D NP N PP PP IN N IN N N Rome a talk in Rome

51 The Syntactic Tree Kernel (STK) [Collins and Duffy, 2002] VP V NP delivers D a N talk

52 The overall fragment set VP D NP a Children are not divided

53 Explicit kernel space φ(t x ) = x = (0,..,1,..,0,..,1,..,0,..,1,..,0,..,1,..,0,..,1,..,0,..,1,..,0) φ(t z ) = z = (1,..,0,..,0,..,1,..,0,..,1,..,0,..,1,..,0,..,0,..,1,..,0,..,0) x z counts the number of common substructures

54 Efficient evaluation of the scalar product x z = φ(t x ) φ(t z ) = K(T x,t z ) = = Δ(n x,n z ) n x T x n z T z

55 Efficient evaluation of the scalar product x z = φ(t x ) φ(t z ) = K(T x,t z ) = [Collins and Duffy, ACL 2002] evaluate Δ in O(n 2 ): Δ(n x,n z ) = 0, if the productions are different else Δ(n x,n z ) =1, if pre - terminals else = Δ(n x,n z ) n x T x n z T z Δ(n x,n z ) = nc(n x ) j=1 (1+ Δ(ch(n x, j),ch(n z, j)))

56 Other Adjustments Decay factor Δ(n x,n z ) = λ, if pre - terminals else Δ(n x,n z ) = λ nc(n x ) j=1 (1+ Δ(ch(n x, j),ch(n z, j))) Normalization K (T x,t z ) = K(T x,t z ) K(T x,t x ) K(T z,t z )

57 SubTree (ST) Kernel [Vishwanathan and Smola, 2002] VP V NP NP delivers D N D N V D N a talk a talk delivers a talk

58 Evaluation Given the equation for the SST kernel Δ(n x,n z ) = 0, if the productions are different else Δ(n x,n z ) =1, if pre - terminals else Δ(n x,n z ) = nc(n x ) j=1 (1+ Δ(ch(n x, j),ch(n z, j)))

59 Evaluation Given the equation for the SST kernel Δ(n x,n z ) = 0, if the productions are different else Δ(n x,n z ) =1, if pre - terminals else Δ(n x,n z ) = nc(n x ) j=1 Δ(ch(n x, j),ch(n z, j))

60 Fast Evaluation of STK [EACL 2006] K(T x,t z ) = Δ(n x,n z ) where P(n x ) and P(n z ) are the production rules used at nodes n x and n z n x,n z NP NP = { n x,n z T x T z : Δ(n x,n z ) 0} = = { n x,n z T x T z : P(n x ) = P(n z )},

61 Algorithm

62 Observations We order the production rules used in T x and T z, at loading time At learning time we may evaluate NP in T x + T z running time If T x and T z are generated by only one production rule O( T x T z )

63 Observations We order the production rules used in T x and T z, at loading time At learning time we may evaluate NP in T x + T z running time If T x and T z are generated by only one production rule O( T x T z ) Very Unlikely!!!!

64 Labeled Ordered Tree Kernel SST satisfies the constraint remove 0 or all children at a time. If we relax such constraint we get more general substructures [Kashima and Koyanagi, 2002] VP VP VP VP VP VP VP VP V gives D a NP N talk V D a NP N talk D a NP N talk D a NP N D a NP NP D NP D N NP N NP N NP D NP

65 Weighting Problems V gives D a VP NP N talk V gives D a VP NP N talk Both matched pairs give the same contribution. Gap based weighting is VP VP needed. V NP V NP A novel efficient evaluation gives D JJ N gives D JJ N has to be defined a good talk a bad talk

66 Partial Trees, [ECML 2006] SST + String Kernel with weighted gaps on Nodes children VP VP VP VP VP VP VP VP V brought D a NP N cat V D a NP N cat D a NP N cat D a NP N D a NP NP D NP D N NP N NP N NP D NP

67 Partial Tree Kernel By adding two decay factors we obtain:

68 Efficient Evaluation (2) The complexity of finding the subsequences is Therefore the overall complexity is where ρ is the maximum branching factor (p = ρ)

69 Running Time of Tree Kernel Functions

70 Kernel Combinations an example Kernel Combinations: ,, p Tree p Tree P Tree p p Tree Tree P Tree p Tree P Tree p Tree P Tree K K K K K K K K K K K K K K K K = + = = + = + + γ γ Tree kernel features flat of kernel polynomial 3 Tree p K K

71 Question Classification Definition: What does HTML stand for? Description: What's the final line in the Edgar Allan Poe poem "The Raven"? Entity: What foods can cause allergic reaction in people? Human: Who won the Nobel Peace Prize in 1992? Location: Where is the Statue of Liberty? Manner: How did Bob Marley die? Numeric: When was Martin Luther King Jr. born? Organization: What company makes Bentley cars?

72 Question Classifier based on Tree Kernels Question dataset ( [Lin and Roth, 2005]) Distributed on 6 categories: Abbreviations, Descriptions, Entity, Human, Location, and Numeric. Fixed split 5500 training and 500 test questions Cross-validation (10-folds) Using the whole question parse trees Constituent parsing Example What is an offer of direct stock purchase plan?

73

74 Kernels BOW, POS are obtained with a simple tree, e.g. BOX What is an offer an PT (parse tree) * * * * * PAS (predicate argument structure)

75 Question classification

76 TASK: Automatic Classification The classifier detects if a pair (question and answer) is correct or not A representation for the pair is needed The classifier can be used to re-rank the output of a basic QA system

77 Dataset 2: TREC data 138 TREC 2001 test questions labeled as description 2,256 sentences, extracted from the best ranked paragraphs (using a basic QA system based on Lucene search engine on TREC dataset) 216 of which labeled as correct by one annotator

78 Dataset 2: TREC data 138 TREC 2001 test questions labeled as description 2,256 sentences, extracted from the best ranked A question is linked to many answers: all its derived paragraphs (using a basic QA system based on pairs cannot be shared by training and test sets Lucene search engine on TREC dataset) 216 of which labeled as correct by one annotator

79 Computational Representations and Feature Spaces

80 Bags of words (BOW) and POS-tags (POS) To save time, apply STK to these trees: BOX What is an offer of * * * * * BOX WHNP VBZ DT NN IN * * * * *

81 Word and POS Sequences What is an offer of? (word sequence, WSK) What_is_offer What_is WHNP VBZ DT NN IN (POS sequence, POSSK) WHNP_VBZ_NN WHNP_NN_IN

82 Syntactic Parse Trees (PT)

83 Predicate Argument Classification In an event: target words describe relation among different entities the participants are often seen as predicate's arguments. Example: Paul gives a lecture in Rome

84 Predicate Argument Classification In an event: target words describe relation among different entities the participants are often seen as predicate's arguments. Example: [ Arg0 Paul] [ predicate gives ] [ Arg1 a lecture] [ ArgM in Rome]

85 Predicate Argument Structure for Partial Tree Kernel (PAS PTK ) [ARG1 Antigens] were [AM TMP originally] [rel defined] [ARG2 as nonself molecules]. [ARG0 Researchers] [rel describe] [ARG1 antigens][arg2 as foreign molecules] [ARGM LOC in the body]

86 Kernels and Combinations Exploiting the property: k(x,z) = k 1 (x,z)+k 2 (x,z) BOW, POS, WSK, POSSK, PT, PAS PTK BOW+POS, BOW+PT, PT+POS,

87 Results on TREC Data (5 folds cross validation) F1-measure Kernel Type

88 Results on TREC Data (5 folds cross validation) F1-measure Kernel Type

89 Results on TREC Data (5 folds cross validation) F1-measure Kernel Type

90 Results on TREC Data (5 folds cross validation) F1-measure Kernel Type

91 Results on TREC Data (5 folds cross validation) F1-measure Kernel Type

92 Results on TREC Data (5 folds cross validation) F1-measure Kernel Type

93 Results on TREC Data (5 folds cross validation) F1-measure BOW 24 POSSK+STK+PAS-PTK % of improvement Kernel Type

94 Conclusions Dealing with noisy and errors of NLP modules require robust approaches SVMs are robust to noise and Kernel methods allows for: Syntactic information via STK Shallow Semantic Information via PTK Word/POS sequences via String Kernels When the IR task is complex, syntax and semantics are essential Great improvement in Q/A classification SVM-Light-TK: an efficient tool to use them

95 SVM-light-TK Software Encodes ST, SST and combination kernels in SVM-light [Joachims, 1999] Available at Tree forests, vector sets New extensions: the PT kernel will be released asap

96 Data Format What does Html stand for? 1 BT (SBARQ (WHNP (WP What))(SQ (AUX does)(np (NNP S.O.S.))(VP (VB stand)(pp (IN for))))(.?)) BT (BOW (What *)(does *)(S.O.S. *)(stand *)(for *)(? *)) BT (BOP (WP *)(AUX *)(NNP *)(VB *)(IN *)(. *)) BT (PAS (ARG0 (R-A1 (What *)))(ARG1 (A1 (S.O.S. NNP)))(ARG2 (rel stand))) ET 1:1 21: E-4 23:1 30:1 36:1 39:1 41:1 46:1 49:1 66:1 152:1 274:1 333:1 BV 2:1 21: E-4 23:1 31:1 36:1 39:1 41:1 46:1 49:1 52:1 66:1 152:1 246:1 333:1 392:1 EV

97 Basic Commands Training and classification./svm_learn -t 5 -C T train.dat model./svm_classify test.dat model Learning with a vector sequence./svm_learn -t 5 -C V train.dat model Learning with the sum of vector and kernel sequences./svm_learn -t 5 -C + train.dat model

98 Custom Kernel Kernel.h double custom_kernel(kernel_parm *kernel_parm, DOC *a, DOC *b); if(a->num_of_trees && b->num_of_trees && a- >forest_vec[i]!=null && b->forest_vec[i]! =NULL){// Test if one the i-th tree of instance a and b is an empty tree

99 Custom Kernel: tree-kernel k1= // summation of tree kernels tree_kernel(kernel_parm, a, b, i, i)/ Evaluate tree kernel between the two i-th trees. sqrt(tree_kernel(kernel_parm, a, a, i, i) * tree_kernel(kernel_parm, b, b, i, i)); Normalize respect to both i-th trees.

100 Custom Kernel: Polynomial kernel if(a->num_of_vectors && b->num_of_vectors && a->vectors[i]!=null && b->vectors[i]! =NULL){ Check if the i-th vectors are empty. k2= // summation of vectors basic_kernel(kernel_parm, a, b, i, i)/ Compute standard kernel (selected according to the "second_kernel" parameter).

101 Custom Kernel: Polynomial kernel sqrt( basic_kernel(kernel_parm, a, a, i, i) * basic_kernel(kernel_parm, b, b, i, i) ); //normalize vectors return k1+k2;

102 Conclusions Kernel methods and SVMs are useful tools to design language applications Kernel design still require some level of expertise Engineering approaches to tree kernels Basic Combinations Canonical Mappings, e.g. Node Marking Merging of kernels in more complex kernels State-of-the-art in SRL and QC An efficient tool to use them

103 Thank you

104 References Alessandro Moschitti, Silvia Quarteroni, Roberto Basili and Suresh Manandhar, Exploiting Syntactic and Shallow Semantic Kernels for Question/Answer Classification, Proceedings of the 45th Conference of the Association for Computational Linguistics (ACL), Prague, June Alessandro Moschitti and Fabio Massimo Zanzotto, Fast and Effective Kernels for Relational Learning from Texts, Proceedings of The 24th Annual International Conference on Machine Learning (ICML 2007), Corvallis, OR, USA. Daniele Pighin, Alessandro Moschitti and Roberto Basili, RTV: Tree Kernels for Thematic Role Classification, Proceedings of the 4th International Workshop on Semantic Evaluation (SemEval-4), English Semantic Labeling, Prague, June Stephan Bloehdorn and Alessandro Moschitti, Combined Syntactic and Semanitc Kernels for Text Classification, to appear in the 29th European Conference on Information Retrieval (ECIR), April 2007, Rome, Italy. Fabio Aiolli, Giovanni Da San Martino, Alessandro Sperduti, and Alessandro Moschitti, Efficient Kernel-based Learning for Trees, to appear in the IEEE Symposium on Computational Intelligence and Data Mining (CIDM), Honolulu, Hawaii, 2007

105 An introductory book on SVMs, Kernel methods and Text Categorization

106 References Roberto Basili and Alessandro Moschitti, Automatic Text Categorization: from Information Retrieval to Support Vector Learning, Aracne editrice, Rome, Italy. Alessandro Moschitti, Efficient Convolution Kernels for Dependency and Constituent Syntactic Trees. In Proceedings of the 17th European Conference on Machine Learning, Berlin, Germany, Alessandro Moschitti, Daniele Pighin, and Roberto Basili, Tree Kernel Engineering for Proposition Re-ranking, In Proceedings of Mining and Learning with Graphs (MLG 2006), Workshop held with ECML/PKDD 2006, Berlin, Germany, Elisa Cilia, Alessandro Moschitti, Sergio Ammendola, and Roberto Basili, Structured Kernels for Automatic Detection of Protein Active Sites. In Proceedings of Mining and Learning with Graphs (MLG 2006), Workshop held with ECML/PKDD 2006, Berlin, Germany, 2006.

107 References Fabio Massimo Zanzotto and Alessandro Moschitti, Automatic learning of textual entailments with cross-pair similarities. In Proceedings of COLING-ACL, Sydney, Australia, Alessandro Moschitti, Making tree kernels practical for natural language learning. In Proceedings of the Eleventh International Conference on European Association for Computational Linguistics, Trento, Italy, Alessandro Moschitti, Daniele Pighin and Roberto Basili. Semantic Role Labeling via Tree Kernel joint inference. In Proceedings of the 10th Conference on Computational Natural Language Learning, New York, USA, Alessandro Moschitti, Bonaventura Coppola, Daniele Pighin and Roberto Basili, Semantic Tree Kernels to classify Predicate Argument Structures. In Proceedings of the the 17th European Conference on Artificial Intelligence, Riva del Garda, Italy, 2006.

108 References Alessandro Moschitti and Roberto Basili, A Tree Kernel approach to Question and Answer Classification in Question Answering Systems. In Proceedings of the Conference on Language Resources and Evaluation, Genova, Italy, Ana-Maria Giuglea and Alessandro Moschitti, Semantic Role Labeling via FrameNet, VerbNet and PropBank. In Proceedings of the Joint 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING-ACL), Sydney, Australia, Roberto Basili, Marco Cammisa and Alessandro Moschitti, Effective use of wordnet semantics via kernel-based learning. In Proceedings of the 9th Conference on Computational Natural Language Learning (CoNLL 2005), Ann Arbor(MI), USA, 2005

109 References Alessandro Moschitti, Ana-Maria Giuglea, Bonaventura Coppola and Roberto Basili. Hierarchical Semantic Role Labeling. In Proceedings of the 9th Conference on Computational Natural Language Learning (CoNLL 2005 shared task), Ann Arbor(MI), USA, Roberto Basili, Marco Cammisa and Alessandro Moschitti, A Semantic Kernel to classify texts with very few training examples. In Proceedings of the Workshop on Learning in Web Search, at the 22nd International Conference on Machine Learning (ICML 2005), Bonn, Germany, Alessandro Moschitti, Bonaventura Coppola, Daniele Pighin and Roberto Basili. Engineering of Syntactic Features for Shallow Semantic Parsing. In Proceedings of the ACL05 Workshop on Feature Engineering for Machine Learning in Natural Language Processing, Ann Arbor (MI), USA, 2005.

110 References Alessandro Moschitti, A study on Convolution Kernel for Shallow Semantic Parsing. In proceedings of ACL-2004, Spain, Alessandro Moschitti and Cosmin Adrian Bejan, A Semantic Kernel for Predicate Argument Classification. In proceedings of the CoNLL-2004, Boston, MA, USA, M. Collins and N. Duffy, New ranking algorithms for parsing and tagging: Kernels over discrete structures, and the voted perceptron. In ACL02, S.V.N. Vishwanathan and A.J. Smola. Fast kernels on strings and trees. In Proceedings of Neural Information Processing Systems, 2002.

111 References AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel-based learning methods) N. Cristianini and J. Shawe-Taylor Cambridge University Press Xavier Carreras and Llu ıs M`arquez Introduction to the CoNLL-2005 Shared Task: Semantic Role Labeling. In proceedings of CoNLL 05. Sameer Pradhan, Kadri Hacioglu, Valeri Krugler, Wayne Ward, James H. Martin, and Daniel Jurafsky Support vector learning for semantic argument classification. to appear in Machine Learning Journal.

112 The Impact of SSTK in Answer Classification F1-measure Q(BOW)+A(BOW) Q(BOW)+A(PT,BOW) Q(PT)+A(PT,BOW) Q(BOW)+A(BOW,PT,PAS) Q(BOW)+A(BOW,PT,PAS_N) Q(PT)+A(PT,BOW,PAS) Q(BOW)+A(BOW,PAS) Q(BOW)+A(BOW,PAS_N) j

113 Mercer s conditions (1)

114 Mercer s conditions (2) If the Gram matrix: G = k( x i x, j is positive semi-definite there is a mapping φ that produces the target kernel function )

115 The lexical semantic kernel is not always a kernel It may not be a kernel so we can use M M, where M is the initial similarity matrix

116 Efficient Evaluation (1) In [Taylor and Cristianini, 2004 book], sequence kernels with weighted gaps are factorized with respect to different subsequence sizes. We treat children as sequences and apply the same theory D p

MACHINE LEARNING. Kernel Methods. Alessandro Moschitti

MACHINE LEARNING. Kernel Methods. Alessandro Moschitti MACHINE LEARNING Kernel Methods Alessandro Moschitti Department of information and communication technology University of Trento Email: moschitti@dit.unitn.it Outline (1) PART I: Theory Motivations Kernel

More information

Semantic Role Labeling via Tree Kernel Joint Inference

Semantic Role Labeling via Tree Kernel Joint Inference Semantic Role Labeling via Tree Kernel Joint Inference Alessandro Moschitti, Daniele Pighin and Roberto Basili Department of Computer Science University of Rome Tor Vergata 00133 Rome, Italy {moschitti,basili}@info.uniroma2.it

More information

Encoding Tree Pair-based Graphs in Learning Algorithms: the Textual Entailment Recognition Case

Encoding Tree Pair-based Graphs in Learning Algorithms: the Textual Entailment Recognition Case Encoding Tree Pair-based Graphs in Learning Algorithms: the Textual Entailment Recognition Case Alessandro Moschitti DII University of Trento Via ommarive 14 38100 POVO (T) - Italy moschitti@dit.unitn.it

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Reading: Ben-Hur & Weston, A User s Guide to Support Vector Machines (linked from class web page) Notation Assume a binary classification problem. Instances are represented by vector

More information

Fast Computing Grammar-driven Convolution Tree Kernel for Semantic Role Labeling

Fast Computing Grammar-driven Convolution Tree Kernel for Semantic Role Labeling Fast Computing Grammar-driven Convolution Tree Kernel for Semantic Role Labeling Wanxiang Che 1, Min Zhang 2, Ai Ti Aw 2, Chew Lim Tan 3, Ting Liu 1, Sheng Li 1 1 School of Computer Science and Technology

More information

CKY & Earley Parsing. Ling 571 Deep Processing Techniques for NLP January 13, 2016

CKY & Earley Parsing. Ling 571 Deep Processing Techniques for NLP January 13, 2016 CKY & Earley Parsing Ling 571 Deep Processing Techniques for NLP January 13, 2016 No Class Monday: Martin Luther King Jr. Day CKY Parsing: Finish the parse Recognizer à Parser Roadmap Earley parsing Motivation:

More information

Support Vector Machine & Its Applications

Support Vector Machine & Its Applications Support Vector Machine & Its Applications A portion (1/3) of the slides are taken from Prof. Andrew Moore s SVM tutorial at http://www.cs.cmu.edu/~awm/tutorials Mingyue Tan The University of British Columbia

More information

ML (cont.): SUPPORT VECTOR MACHINES

ML (cont.): SUPPORT VECTOR MACHINES ML (cont.): SUPPORT VECTOR MACHINES CS540 Bryan R Gibson University of Wisconsin-Madison Slides adapted from those used by Prof. Jerry Zhu, CS540-1 1 / 40 Support Vector Machines (SVMs) The No-Math Version

More information

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford) Parsing Based on presentations from Chris Manning s course on Statistical Parsing (Stanford) S N VP V NP D N John hit the ball Levels of analysis Level Morphology/Lexical POS (morpho-synactic), WSD Elements

More information

Support Vector Machines: Kernels

Support Vector Machines: Kernels Support Vector Machines: Kernels CS6780 Advanced Machine Learning Spring 2015 Thorsten Joachims Cornell University Reading: Murphy 14.1, 14.2, 14.4 Schoelkopf/Smola Chapter 7.4, 7.6, 7.8 Non-Linear Problems

More information

MACHINE LEARNING. Vapnik-Chervonenkis (VC) Dimension. Alessandro Moschitti

MACHINE LEARNING. Vapnik-Chervonenkis (VC) Dimension. Alessandro Moschitti MACHINE LEARNING Vapnik-Chervonenkis (VC) Dimension Alessandro Moschitti Department of Information Engineering and Computer Science University of Trento Email: moschitti@disi.unitn.it Computational Learning

More information

The Research on Syntactic Features in Semantic Role Labeling

The Research on Syntactic Features in Semantic Role Labeling 23 6 2009 11 J OU RNAL OF CH IN ESE IN FORMA TION PROCESSIN G Vol. 23, No. 6 Nov., 2009 : 100320077 (2009) 0620011208,,,, (, 215006) :,,,( NULL ),,,; CoNLL22005 Shared Task WSJ 77. 54 %78. 75 %F1, : ;;;;

More information

Lecture 13: Structured Prediction

Lecture 13: Structured Prediction Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501: NLP 1 Quiz 2 v Lectures 9-13 v Lecture 12: before page

More information

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan Support'Vector'Machines Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan kasthuri.kannan@nyumc.org Overview Support Vector Machines for Classification Linear Discrimination Nonlinear Discrimination

More information

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015 Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch COMP-599 Oct 1, 2015 Announcements Research skills workshop today 3pm-4:30pm Schulich Library room 313 Start thinking about

More information

Linear Classification and SVM. Dr. Xin Zhang

Linear Classification and SVM. Dr. Xin Zhang Linear Classification and SVM Dr. Xin Zhang Email: eexinzhang@scut.edu.cn What is linear classification? Classification is intrinsically non-linear It puts non-identical things in the same class, so a

More information

Fast and Effective Kernels for Relational Learning from Texts

Fast and Effective Kernels for Relational Learning from Texts Fast and Effective Kernels for Relational Learning from Texts Alessandro Moschitti MOCHITTI@DIT.UNITN.IT Department of Information and Communication Technology, University of Trento, 38050 Povo di Trento,

More information

Introduction to the CoNLL-2004 Shared Task: Semantic Role Labeling

Introduction to the CoNLL-2004 Shared Task: Semantic Role Labeling Introduction to the CoNLL-2004 Shared Task: Semantic Role Labeling Xavier Carreras and Lluís Màrquez TALP Research Center Technical University of Catalonia Boston, May 7th, 2004 Outline Outline of the

More information

SUPPORT VECTOR REGRESSION WITH A GENERALIZED QUADRATIC LOSS

SUPPORT VECTOR REGRESSION WITH A GENERALIZED QUADRATIC LOSS SUPPORT VECTOR REGRESSION WITH A GENERALIZED QUADRATIC LOSS Filippo Portera and Alessandro Sperduti Dipartimento di Matematica Pura ed Applicata Universit a di Padova, Padova, Italy {portera,sperduti}@math.unipd.it

More information

Driving Semantic Parsing from the World s Response

Driving Semantic Parsing from the World s Response Driving Semantic Parsing from the World s Response James Clarke, Dan Goldwasser, Ming-Wei Chang, Dan Roth Cognitive Computation Group University of Illinois at Urbana-Champaign CoNLL 2010 Clarke, Goldwasser,

More information

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017 Deep Learning for Natural Language Processing Sidharth Mudgal April 4, 2017 Table of contents 1. Intro 2. Word Vectors 3. Word2Vec 4. Char Level Word Embeddings 5. Application: Entity Matching 6. Conclusion

More information

MACHINE LEARNING. Support Vector Machines. Alessandro Moschitti

MACHINE LEARNING. Support Vector Machines. Alessandro Moschitti MACHINE LEARNING Support Vector Machines Alessandro Moschitti Department of information and communication technology University of Trento Email: moschitti@dit.unitn.it Summary Support Vector Machines

More information

Predicting Graph Labels using Perceptron. Shuang Song

Predicting Graph Labels using Perceptron. Shuang Song Predicting Graph Labels using Perceptron Shuang Song shs037@eng.ucsd.edu Online learning over graphs M. Herbster, M. Pontil, and L. Wainer, Proc. 22nd Int. Conf. Machine Learning (ICML'05), 2005 Prediction

More information

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018 From Binary to Multiclass Classification CS 6961: Structured Prediction Spring 2018 1 So far: Binary Classification We have seen linear models Learning algorithms Perceptron SVM Logistic Regression Prediction

More information

Fast Logistic Regression for Text Categorization with Variable-Length N-grams

Fast Logistic Regression for Text Categorization with Variable-Length N-grams Fast Logistic Regression for Text Categorization with Variable-Length N-grams Georgiana Ifrim *, Gökhan Bakır +, Gerhard Weikum * * Max-Planck Institute for Informatics Saarbrücken, Germany + Google Switzerland

More information

SUPPORT VECTOR MACHINE

SUPPORT VECTOR MACHINE SUPPORT VECTOR MACHINE Mainly based on https://nlp.stanford.edu/ir-book/pdf/15svm.pdf 1 Overview SVM is a huge topic Integration of MMDS, IIR, and Andrew Moore s slides here Our foci: Geometric intuition

More information

6.891: Lecture 24 (December 8th, 2003) Kernel Methods

6.891: Lecture 24 (December 8th, 2003) Kernel Methods 6.891: Lecture 24 (December 8th, 2003) Kernel Methods Overview ffl Recap: global linear models ffl New representations from old representations ffl computational trick ffl Kernels for NLP structures ffl

More information

MACHINE LEARNING. Automated Text Categorization. Alessandro Moschitti

MACHINE LEARNING. Automated Text Categorization. Alessandro Moschitti MACHINE LEARNING Automated Text Categorization Alessandro Moschitti Department of information and communication technology University of Trento Email: moschitti@dit.unitn.it Outline Text Categorization

More information

Learning Methods for Linear Detectors

Learning Methods for Linear Detectors Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIMAG 2 / MoSIG M1 Second Semester 2011/2012 Lesson 20 27 April 2012 Contents Learning Methods for Linear Detectors Learning Linear Detectors...2

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Stochastic Grammars Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(22) Structured Classification

More information

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015 EE613 Machine Learning for Engineers Kernel methods Support Vector Machines jean-marc odobez 2015 overview Kernel methods introductions and main elements defining kernels Kernelization of k-nn, K-Means,

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Machine Learning for natural language processing

Machine Learning for natural language processing Machine Learning for natural language processing Classification: Maximum Entropy Models Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 24 Introduction Classification = supervised

More information

Processing/Speech, NLP and the Web

Processing/Speech, NLP and the Web CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 25 Probabilistic Parsing) Pushpak Bhattacharyya CSE Dept., IIT Bombay 14 th March, 2011 Bracketed Structure: Treebank Corpus [ S1[

More information

Perceptron Revisited: Linear Separators. Support Vector Machines

Perceptron Revisited: Linear Separators. Support Vector Machines Support Vector Machines Perceptron Revisited: Linear Separators Binary classification can be viewed as the task of separating classes in feature space: w T x + b > 0 w T x + b = 0 w T x + b < 0 Department

More information

Kernels and the Kernel Trick. Machine Learning Fall 2017

Kernels and the Kernel Trick. Machine Learning Fall 2017 Kernels and the Kernel Trick Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem Support vectors, duals and kernels

More information

(Kernels +) Support Vector Machines

(Kernels +) Support Vector Machines (Kernels +) Support Vector Machines Machine Learning Torsten Möller Reading Chapter 5 of Machine Learning An Algorithmic Perspective by Marsland Chapter 6+7 of Pattern Recognition and Machine Learning

More information

Chapter 6: Classification

Chapter 6: Classification Chapter 6: Classification 1) Introduction Classification problem, evaluation of classifiers, prediction 2) Bayesian Classifiers Bayes classifier, naive Bayes classifier, applications 3) Linear discriminant

More information

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Linear classifier Which classifier? x 2 x 1 2 Linear classifier Margin concept x 2

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Representing structured relational data in Euclidean vector spaces. Tony Plate

Representing structured relational data in Euclidean vector spaces. Tony Plate Representing structured relational data in Euclidean vector spaces Tony Plate tplate@acm.org http://www.d-reps.org October 2004 AAAI Symposium 2004 1 Overview A general method for representing structured

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Brief Introduction to Machine Learning

Brief Introduction to Machine Learning Brief Introduction to Machine Learning Yuh-Jye Lee Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU August 29, 2016 1 / 49 1 Introduction 2 Binary Classification 3 Support Vector

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2015 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction Linear vs Non-linear classifier CS789: Machine Learning and Neural Network Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Linear classifier is in the

More information

Machine Learning : Support Vector Machines

Machine Learning : Support Vector Machines Machine Learning Support Vector Machines 05/01/2014 Machine Learning : Support Vector Machines Linear Classifiers (recap) A building block for almost all a mapping, a partitioning of the input space into

More information

Parsing with Context-Free Grammars

Parsing with Context-Free Grammars Parsing with Context-Free Grammars Berlin Chen 2005 References: 1. Natural Language Understanding, chapter 3 (3.1~3.4, 3.6) 2. Speech and Language Processing, chapters 9, 10 NLP-Berlin Chen 1 Grammars

More information

Learning SVM Classifiers with Indefinite Kernels

Learning SVM Classifiers with Indefinite Kernels Learning SVM Classifiers with Indefinite Kernels Suicheng Gu and Yuhong Guo Dept. of Computer and Information Sciences Temple University Support Vector Machines (SVMs) (Kernel) SVMs are widely used in

More information

SVM merch available in the CS department after lecture

SVM merch available in the CS department after lecture SVM merch available in the CS department after lecture SVM SVM Introduction Previously we considered Perceptrons based on the McCulloch and Pitts neuron model. We identified a method for updating weights,

More information

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09 Natural Language Processing : Probabilistic Context Free Grammars Updated 5/09 Motivation N-gram models and HMM Tagging only allowed us to process sentences linearly. However, even simple sentences require

More information

Learning with kernels and SVM

Learning with kernels and SVM Learning with kernels and SVM Šámalova chata, 23. května, 2006 Petra Kudová Outline Introduction Binary classification Learning with Kernels Support Vector Machines Demo Conclusion Learning from data find

More information

Kernel Methods. Charles Elkan October 17, 2007

Kernel Methods. Charles Elkan October 17, 2007 Kernel Methods Charles Elkan elkan@cs.ucsd.edu October 17, 2007 Remember the xor example of a classification problem that is not linearly separable. If we map every example into a new representation, then

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Support Vector Machines Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete

More information

Kernel Methods. Barnabás Póczos

Kernel Methods. Barnabás Póczos Kernel Methods Barnabás Póczos Outline Quick Introduction Feature space Perceptron in the feature space Kernels Mercer s theorem Finite domain Arbitrary domain Kernel families Constructing new kernels

More information

Kernel Methods in Machine Learning

Kernel Methods in Machine Learning Kernel Methods in Machine Learning Autumn 2015 Lecture 1: Introduction Juho Rousu ICS-E4030 Kernel Methods in Machine Learning 9. September, 2015 uho Rousu (ICS-E4030 Kernel Methods in Machine Learning)

More information

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar Data Mining Support Vector Machines Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar 02/03/2018 Introduction to Data Mining 1 Support Vector Machines Find a linear hyperplane

More information

CS 188: Artificial Intelligence. Outline

CS 188: Artificial Intelligence. Outline CS 188: Artificial Intelligence Lecture 21: Perceptrons Pieter Abbeel UC Berkeley Many slides adapted from Dan Klein. Outline Generative vs. Discriminative Binary Linear Classifiers Perceptron Multi-class

More information

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data

More information

Applied inductive learning - Lecture 7

Applied inductive learning - Lecture 7 Applied inductive learning - Lecture 7 Louis Wehenkel & Pierre Geurts Department of Electrical Engineering and Computer Science University of Liège Montefiore - Liège - November 5, 2012 Find slides: http://montefiore.ulg.ac.be/

More information

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines Gautam Kunapuli Example: Text Categorization Example: Develop a model to classify news stories into various categories based on their content. sports politics Use the bag-of-words representation for this

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Shivani Agarwal Support Vector Machines (SVMs) Algorithm for learning linear classifiers Motivated by idea of maximizing margin Efficient extension to non-linear

More information

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space to The The A s s in to Fabio A. González Ph.D. Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá April 2, 2009 to The The A s s in 1 Motivation Outline 2 The Mapping the

More information

Linear Classifiers (Kernels)

Linear Classifiers (Kernels) Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers (Kernels) Blaine Nelson, Christoph Sawade, Tobias Scheffer Exam Dates & Course Conclusion There are 2 Exam dates: Feb 20 th March

More information

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri CSE 151 Machine Learning Instructor: Kamalika Chaudhuri Linear Classification Given labeled data: (xi, feature vector yi) label i=1,..,n where y is 1 or 1, find a hyperplane to separate from Linear Classification

More information

Kernel methods CSE 250B

Kernel methods CSE 250B Kernel methods CSE 250B Deviations from linear separability Noise Find a separator that minimizes a convex loss function related to the number of mistakes. e.g. SVM, logistic regression. Deviations from

More information

Max Margin-Classifier

Max Margin-Classifier Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization

More information

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines Pattern Recognition and Machine Learning James L. Crowley ENSIMAG 3 - MMIS Fall Semester 2016 Lessons 6 10 Jan 2017 Outline Perceptrons and Support Vector machines Notation... 2 Perceptrons... 3 History...3

More information

LECTURER: BURCU CAN Spring

LECTURER: BURCU CAN Spring LECTURER: BURCU CAN 2017-2018 Spring Regular Language Hidden Markov Model (HMM) Context Free Language Context Sensitive Language Probabilistic Context Free Grammar (PCFG) Unrestricted Language PCFGs can

More information

Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation

Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation Deviations from linear separability Kernel methods CSE 250B Noise Find a separator that minimizes a convex loss function related to the number of mistakes. e.g. SVM, logistic regression. Systematic deviation

More information

SVMs: nonlinearity through kernels

SVMs: nonlinearity through kernels Non-separable data e-8. Support Vector Machines 8.. The Optimal Hyperplane Consider the following two datasets: SVMs: nonlinearity through kernels ER Chapter 3.4, e-8 (a) Few noisy data. (b) Nonlinearly

More information

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask Machine Learning and Data Mining Support Vector Machines Kalev Kask Linear classifiers Which decision boundary is better? Both have zero training error (perfect training accuracy) But, one of them seems

More information

Lecture Support Vector Machine (SVM) Classifiers

Lecture Support Vector Machine (SVM) Classifiers Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector Machine Instructor: Yizhou Sun yzsun@cs.ucla.edu October 18, 2017 Homework 1 Announcements Due end of the day of this Thursday (11:59pm)

More information

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley Natural Language Processing Classification Classification I Dan Klein UC Berkeley Classification Automatically make a decision about inputs Example: document category Example: image of digit digit Example:

More information

Nearest Neighbors Methods for Support Vector Machines

Nearest Neighbors Methods for Support Vector Machines Nearest Neighbors Methods for Support Vector Machines A. J. Quiroz, Dpto. de Matemáticas. Universidad de Los Andes joint work with María González-Lima, Universidad Simón Boĺıvar and Sergio A. Camelo, Universidad

More information

Kaggle.

Kaggle. Administrivia Mini-project 2 due April 7, in class implement multi-class reductions, naive bayes, kernel perceptron, multi-class logistic regression and two layer neural networks training set: Project

More information

Machine Learning for Structured Prediction

Machine Learning for Structured Prediction Machine Learning for Structured Prediction Grzegorz Chrupa la National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Grzegorz Chrupa la (DCU) Machine Learning for

More information

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396 Data Mining Linear & nonlinear classifiers Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 31 Table of contents 1 Introduction

More information

Text Mining. March 3, March 3, / 49

Text Mining. March 3, March 3, / 49 Text Mining March 3, 2017 March 3, 2017 1 / 49 Outline Language Identification Tokenisation Part-Of-Speech (POS) tagging Hidden Markov Models - Sequential Taggers Viterbi Algorithm March 3, 2017 2 / 49

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Stephan Dreiseitl University of Applied Sciences Upper Austria at Hagenberg Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support Overview Motivation

More information

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods Foundation of Intelligent Systems, Part I SVM s & Kernel Methods mcuturi@i.kyoto-u.ac.jp FIS - 2013 1 Support Vector Machines The linearly-separable case FIS - 2013 2 A criterion to select a linear classifier:

More information

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister A Syntax-based Statistical Machine Translation Model Alexander Friedl, Georg Teichtmeister 4.12.2006 Introduction The model Experiment Conclusion Statistical Translation Model (STM): - mathematical model

More information

The Noisy Channel Model and Markov Models

The Noisy Channel Model and Markov Models 1/24 The Noisy Channel Model and Markov Models Mark Johnson September 3, 2014 2/24 The big ideas The story so far: machine learning classifiers learn a function that maps a data item X to a label Y handle

More information

Multi-Label Informed Latent Semantic Indexing

Multi-Label Informed Latent Semantic Indexing Multi-Label Informed Latent Semantic Indexing Shipeng Yu 12 Joint work with Kai Yu 1 and Volker Tresp 1 August 2005 1 Siemens Corporate Technology Department of Neural Computation 2 University of Munich

More information

A Randomized Approach for Crowdsourcing in the Presence of Multiple Views

A Randomized Approach for Crowdsourcing in the Presence of Multiple Views A Randomized Approach for Crowdsourcing in the Presence of Multiple Views Presenter: Yao Zhou joint work with: Jingrui He - 1 - Roadmap Motivation Proposed framework: M2VW Experimental results Conclusion

More information

Discriminative Direction for Kernel Classifiers

Discriminative Direction for Kernel Classifiers Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering

More information

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers Computational Methods for Data Analysis Massimo Poesio SUPPORT VECTOR MACHINES Support Vector Machines Linear classifiers 1 Linear Classifiers denotes +1 denotes -1 w x + b>0 f(x,w,b) = sign(w x + b) How

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2016 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

TALP at GeoQuery 2007: Linguistic and Geographical Analysis for Query Parsing

TALP at GeoQuery 2007: Linguistic and Geographical Analysis for Query Parsing TALP at GeoQuery 2007: Linguistic and Geographical Analysis for Query Parsing Daniel Ferrés and Horacio Rodríguez TALP Research Center Software Department Universitat Politècnica de Catalunya {dferres,horacio}@lsi.upc.edu

More information

CS798: Selected topics in Machine Learning

CS798: Selected topics in Machine Learning CS798: Selected topics in Machine Learning Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Jakramate Bootkrajang CS798: Selected topics in Machine Learning

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers

More information

10/17/04. Today s Main Points

10/17/04. Today s Main Points Part-of-speech Tagging & Hidden Markov Model Intro Lecture #10 Introduction to Natural Language Processing CMPSCI 585, Fall 2004 University of Massachusetts Amherst Andrew McCallum Today s Main Points

More information

Multiclass Classification-1

Multiclass Classification-1 CS 446 Machine Learning Fall 2016 Oct 27, 2016 Multiclass Classification Professor: Dan Roth Scribe: C. Cheng Overview Binary to multiclass Multiclass SVM Constraint classification 1 Introduction Multiclass

More information

Support Vector Machines and Kernel Methods

Support Vector Machines and Kernel Methods Support Vector Machines and Kernel Methods Geoff Gordon ggordon@cs.cmu.edu July 10, 2003 Overview Why do people care about SVMs? Classification problems SVMs often produce good results over a wide range

More information

The Perceptron. Volker Tresp Summer 2016

The Perceptron. Volker Tresp Summer 2016 The Perceptron Volker Tresp Summer 2016 1 Elements in Learning Tasks Collection, cleaning and preprocessing of training data Definition of a class of learning models. Often defined by the free model parameters

More information

Topic Models and Applications to Short Documents

Topic Models and Applications to Short Documents Topic Models and Applications to Short Documents Dieu-Thu Le Email: dieuthu.le@unitn.it Trento University April 6, 2011 1 / 43 Outline Introduction Latent Dirichlet Allocation Gibbs Sampling Short Text

More information

The SUBTLE NL Parsing Pipeline: A Complete Parser for English Mitch Marcus University of Pennsylvania

The SUBTLE NL Parsing Pipeline: A Complete Parser for English Mitch Marcus University of Pennsylvania The SUBTLE NL Parsing Pipeline: A Complete Parser for English Mitch Marcus University of Pennsylvania 1 PICTURE OF ANALYSIS PIPELINE Tokenize Maximum Entropy POS tagger MXPOST Ratnaparkhi Core Parser Collins

More information

CIS 520: Machine Learning Oct 09, Kernel Methods

CIS 520: Machine Learning Oct 09, Kernel Methods CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed

More information

Proposition Knowledge Graphs. Gabriel Stanovsky Omer Levy Ido Dagan Bar-Ilan University Israel

Proposition Knowledge Graphs. Gabriel Stanovsky Omer Levy Ido Dagan Bar-Ilan University Israel Proposition Knowledge Graphs Gabriel Stanovsky Omer Levy Ido Dagan Bar-Ilan University Israel 1 Problem End User 2 Case Study: Curiosity (Mars Rover) Curiosity is a fully equipped lab. Curiosity is a rover.

More information