Introduction to Data-Driven Dependency Parsing

Size: px
Start display at page:

Download "Introduction to Data-Driven Dependency Parsing"

Transcription

1 Introduction to Data-Driven Dependency Parsing Introductory Course, ESSLLI 2007 Ryan McDonald 1 Joakim Nivre 2 1 Google Inc., New York, USA ryanmcd@google.com 2 Uppsala University and Växjö University, Sweden nivre@msi.vxu.se Introduction to Data-Driven Dependency Parsing 1(50)

2 Formal Conditions on Dependency Graphs From Last Lecture Last Lecture For a dependency graph G = (V,A) With label set L = {l 1,...,l L } G is (weakly) connected: If i, j V, i j. G is acyclic: If i j, then not j i. G obeys the single-head constraint: If i j, then not i j, for any i i. G is projective: If i j, then i i, for any i such that i <i <j or j <i <i. Introduction to Data-Driven Dependency Parsing 2(50)

3 From Last Lecture Dependency Graphs as Trees Consider a dependency graph G = (V, A) satisfying: G is (weakly) connected: If i, j V, i j. G obeys the single-head constraint: If i j, then not i j, for any i i. G obeys the single-root constraint: If i such that i j, then i such that i j, for any j j w 0 = root is always this node This dependency graph is by definition a tree For the rest of the course we assume that all dependency graphs are trees Introduction to Data-Driven Dependency Parsing 3(50)

4 From Last Lecture Dependency Graphs as Trees Satisfies: connected, single-head obj pc nmod sbj nmod nmod nmod Economic news had little effect on financial markets. Introduction to Data-Driven Dependency Parsing 4(50)

5 From Last Lecture Dependency Graphs as Trees Satisfies: connected, single-head, single-root p pred obj pc nmod sbj nmod nmod nmod root Economic news had little effect on financial markets. Introduction to Data-Driven Dependency Parsing 4(50)

6 Introduction Overview of the Course Dependency parsing (Joakim) Machine learning methods (Ryan) Transition-based models (Joakim) Graph-based models (Ryan) Loose ends (Joakim, Ryan): Other approaches Empirical results Available software Introduction to Data-Driven Dependency Parsing 5(50)

7 Introduction Data-Driven Parsing Data-Driven Machine Learning Parameterize a model Supervised: Learn parameters from annotated data Unsupervised: Induce parameters from a large corpora Data-Driven vs. Grammar-driven Can parse all sentences vs. generate specific language Data-driven = grammar of Σ Introduction to Data-Driven Dependency Parsing 6(50)

8 Introduction Lecture 2: Outline Feature Representations Linear Classifiers Perceptron Large-Margin Classifiers (SVMs, MIRA) Others Non-linear Classifiers K-NNs and Memory-based Learning Kernels Structured Learning Structured Perceptron Large-Margin Perceptron Others Introduction to Data-Driven Dependency Parsing 7(50)

9 Introduction Important Message This lecture contains a lot of details Not important if you do not follow all proofs and maths What is important Understand basic representation of data features Understand basic goal and structure of classifiers Understand important distinctions: linear vs. non-linear, binary vs. multiclass, multiclass vs. structured, etc. Interested in ML for NLP Check out afternoon course Machine learning methods for NLP Introduction to Data-Driven Dependency Parsing 8(50)

10 Feature Representations Feature Representations Input: x X e.g., document or sentence with some words x = w1... w n, or a series of previous actions Output: y Y e.g., dependency tree, document class, part-of-speech tags, next parsing action We assume a mapping from x to a high dimensional feature vector f(x) : X R m But sometimes it will be easier to think of a mapping from an input/output pair to a feature vector f(x, y) : X Y R m For any vector v R m, let v j be the j th value Introduction to Data-Driven Dependency Parsing 9(50)

11 Feature Representations Examples x is a document { 1 if x contains the word interest f j (x) = 0 otherwise f j (x) = The percentage of words than contain punctuation x is a word and y is a part-of-speech tag { 1 if x = bank and y = Verb f j (x,y) = 0 otherwise Introduction to Data-Driven Dependency Parsing 10(50)

12 Feature Representations Example 2 j 1 if x contains the word John f 0 (x) = 0 otherwise j 1 if x contains the word Mary f 1 (x) = 0 otherwise j 1 if x contains the word Harry f 2 (x) = 0 otherwise j 1 if x contains the word likes f 3 (x) = 0 otherwise x=john likes Mary f(x) = [ ] x=mary likes John f(x) = [ ] x=harry likes Mary f(x) = [ ] x=harry likes Harry f(x) = [ ] Introduction to Data-Driven Dependency Parsing 11(50)

13 Linear Classifiers Linear classifier: score (or probability) of a particular classification is based on a linear combination of features and their weights Let w R m be a high dimensional weight vector If we assume that w is known, then we can define two kinds of linear classifiers Reminder: v v = v j v j R Binary Classification: Y = { 1, 1} j y = sign(w f(x)) Multiclass Classification: Y = {0, 1,..., N} Linear Classifiers y = argmax y w f(x, y) Introduction to Data-Driven Dependency Parsing 12(50)

14 Linear Classifiers Binary Linear Classifier Divides all points: y = sign(w f(x)) Introduction to Data-Driven Dependency Parsing 13(50)

15 Linear Classifiers Multiclass Linear Classifier Defines regions of space: y = arg max y w f(x,y) Introduction to Data-Driven Dependency Parsing 14(50)

16 Linear Classifiers Separability A set of points is separable, if there exists a w such that classification is perfect Separable Not Separable Introduction to Data-Driven Dependency Parsing 15(50)

17 Linear Classifiers Supervised Learning Input: training examples T = {(x t,y t )} T t=1 Input: feature representation f Output: w that maximizes/minimizes some important function on the training set minimize error (Perceptron, SVMs, Boosting) maximize likelihood of data (Logistic Regression, CRFs) Assumption: The training data is separable Not necessary, just makes life easier There is a lot of good work in machine learning to tackle the non-separable case Introduction to Data-Driven Dependency Parsing 16(50)

18 Linear Classifiers Perceptron Minimize error Binary classification: Y = { 1, 1} w = arg min 1 1[y t = sign(w f(x t ))] w Multiclass classification: Y = {0, 1,..., N} w = arg min w t t 1[p] = 1 1[y t = arg maxw f(x t, y)] y { 1 p is true 0 otherwise Introduction to Data-Driven Dependency Parsing 17(50)

19 Perceptron Learning Algorithm (multiclass) Linear Classifiers Training data: T = {(x t,y t )} T t=1 1. w (0) = 0; i = 0 2. for n : 1..N 3. for t : 1..T 4. Let y = arg max y w (i) f(x t,y ) 5. if y y t 6. w (i+1) = w (i) + f(x t,y t ) f(x t,y ) 7. i = i return w i Introduction to Data-Driven Dependency Parsing 18(50)

20 Perceptron Learning Algorithm (multiclass) Linear Classifiers Given an training instance (x t,y t ), define: Ȳt = Y {y t } A training set T is separable with margin γ > 0 if there exists a vector u with u = 1 such that: u f(x t,y t ) u f(x t,y ) γ for all y Ȳ t and u = j u2 j Assumption: the training set is separable with margin γ Introduction to Data-Driven Dependency Parsing 19(50)

21 Perceptron Learning Algorithm (multiclass) Linear Classifiers Theorem: For any training set separable with a margin of γ, the following holds for the perceptron algorithm: Number of training errors R2 γ 2 where R f(x t,y t ) f(x t,y ) for all (x t,y t ) T and y Ȳ t Thus, after a finite number of training iterations, the error on the training set will converge to zero Let s prove it! (proof taken from Collins 02) Introduction to Data-Driven Dependency Parsing 20(50)

22 Perception Learning Algorithm (multiclass) Linear Classifiers Training data: T = {(x t, y t)} T t=1 1. w (0) = 0; i = 0 2. for n : 1..N 3. for t : 1..T 4. Let y = arg max y w (i) f(x t, y ) 5. if y y t 6. w (i+1) = w (i) + f(x t, y t) f(x t, y ) 7. i = i return w i w (k 1) are the weights before k th mistake Suppose k th mistake made at the t th example, (x t, y t) y = arg max y w (k 1) f(x t, y ) y y t w (k) = w (k 1) +f(x t, y t) f(x t, y ) Now: u w (k) = u w (k 1) + u (f(x t, y t) f(x t, y )) u w (k 1) + γ Now: w (0) = 0 and u w (0) = 0, by induction on k, u w (k) (k 1)γ Now: since u w (k) u w (k) and u = 1 then w (k) (k 1)γ Now: w (k) 2 = w (k 1) 2 + f(x t, y t) f(x t, y ) 2 + 2w (k 1) (f(x t, y t) f(x t, y )) w (k) 2 w (k 1) 2 + R 2 (since R f(x t, y t) f(x t, y ) and w (k 1) f(x t, y t) w (k 1) f(x t, y ) 0) Introduction to Data-Driven Dependency Parsing 21(50)

23 Perception Learning Algorithm (multiclass) We have just shown that w (k) (k 1)γ and w (k) 2 w (k 1) 2 + R 2 By induction on k and since w (0) = 0 and w (0) 2 = 0 Therefore, and solving for k w (k) 2 (k 1)R 2 (k 1) 2 γ 2 w (k) 2 (k 1)R 2 k 1 R2 γ 2 Therefore the number of errors is bounded! Linear Classifiers Introduction to Data-Driven Dependency Parsing 22(50)

24 Linear Classifiers Margin Training Testing Denote the value of the margin by γ Introduction to Data-Driven Dependency Parsing 23(50)

25 Linear Classifiers Margin Intuitively maximizing margin makes sense More importantly, generalization error to unseen test data is proportional to the inverse of the margin ǫ R 2 γ 2 T Perceptron: we have shown that: If a training set is separable by some margin, the perceptron will find a w that separates the data However, it does not pick a w to maximize the margin! Introduction to Data-Driven Dependency Parsing 24(50)

26 Linear Classifiers Max Margin = Min Norm Let γ > 0 Max Margin: max w 1 γ Min Norm: min 1 2 w 2 such that: y t (w f(x t )) γ (x t, y t ) T = such that: y t (w f(x t )) 1 (x t, y t ) T w is bound since scaling trivially produces larger margin y t ([βw] f(x t )) βγ, for some β 1 Instead of fixing w we fix the margin γ = 1 Introduction to Data-Driven Dependency Parsing 25(50)

27 Linear Classifiers Support Vector Machines Binary: Multiclass: min 1 2 w 2 min 1 2 w 2 such that: y t (w f(x t )) 1 (x t,y t ) T such that: w f(x t,y t ) w f(x t,y ) 1 (x t,y t ) T and y Ȳ t Both are quadratic programming problems a well known convex optimization problem Can be solved with out-of-the-box algorithms Introduction to Data-Driven Dependency Parsing 26(50)

28 Linear Classifiers Support Vector Machines Binary: min 1 2 w 2 such that: y t (w f(x t )) 1 Problem: Sometimes T is far too large Thus the number of constraints might make solving the quadratic programming problem very difficult Most common technique: Sequential Minimal Optimization (SMO) Sparse: solution only depends on support vectors (x t,y t ) T Introduction to Data-Driven Dependency Parsing 27(50)

29 Margin Infused Relaxed Algorithm (MIRA) Linear Classifiers Another option maximize margin using an online algorithm Batch vs. Online Batch update parameters based on entire training set (e.g., SVMs) Online update parameters based on a single training instance at a time (e.g., Perceptron) MIRA can be thought of as a max-margin perceptron or an online SVM Introduction to Data-Driven Dependency Parsing 28(50)

30 Linear Classifiers MIRA (multiclass) Batch (SVMs): min 1 2 w 2 such that: w f(x t, y t ) w f(x t, y ) 1 (x t, y t ) T and y Ȳ t Online (MIRA): Training data: T = {(x t, y t )} T t=1 1. w (0) = 0; i = 0 2. for n : 1..N 3. for t : 1..T 4. w (i+1) = arg min w* w* w (i) such that: w f(x t, y t ) w f(x t, y ) 1 y Ȳ t 5. i = i return w i MIRA has much smaller optimizations with only Ȳ t constraints Cost: sub-optimal optimization Introduction to Data-Driven Dependency Parsing 29(50)

31 Linear Classifiers Summary What we have covered Feature-based representations Linear Classifiers Perceptron Large-Margin SVMs (batch) and MIRA (online) What is next Non-linear classifiers Introduction to Data-Driven Dependency Parsing 30(50)

32 Non-Linear Classifiers Non-Linear Classifiers Some data sets require more than a linear classifier to be correctly modeled A lot of models out there K-Nearest Neighbours Decision Trees Kernels Neural Networks Will only discuss a couple due to time constraints Introduction to Data-Driven Dependency Parsing 31(50)

33 Non-Linear Classifiers K-Nearest Neighbours Simplest form: for a given test point x, find k-nearest neighbours in training set Neighbours vote for classification Distance is Euclidean distance d(x t,x r ) = (f j (x t ) f j (x r )) 2 j No linear classifier can correctly label data set. But 3-nearest neighbours does. Introduction to Data-Driven Dependency Parsing 32(50)

34 Non-Linear Classifiers K-Nearest Neighbours A training set T, distance function d, and value K define a non-linear classification boundary Introduction to Data-Driven Dependency Parsing 33(50)

35 Non-Linear Classifiers K-Nearest Neighbours K-NN is often called a lazy learning algorithm or memory based learning (MBL) K-NN generalized in the Tilburg Memory Based Learning Package Different distance functions Different voting schemes for classification Tie-breaking Memory representations Introduction to Data-Driven Dependency Parsing 34(50)

36 Non-Linear Classifiers Kernels A kernel is a similarity function between two points that is symmetric and positive semi-definite, which we denote by: φ(x t,x r ) R Mercer s Theorem: for any kernal φ, there exists an f, such that: φ(x t,x r ) = f(x t ) f(x r ) Introduction to Data-Driven Dependency Parsing 35(50)

37 Non-Linear Classifiers Kernel Trick Perceptron Algorithm Training data: T = {(x t, y t)} T t=1 1. w (0) = 0; i = 0 2. for n : 1..N 3. for t : 1..T 4. Let y = argmax y w (i) f(x t, y) 5. if y y t 6. w (i+1) = w (i) + f(x t, y t) f(x t, y) 7. i = i return w i Each feature function f(x t,y t ) is added and f(x t,y) is subtracted to w say α y,t times αy,t is the # of times during learning label y is predicted for example t Thus, w = t,y α y,t [f(x t,y t ) f(x t,y)] Introduction to Data-Driven Dependency Parsing 36(50)

38 Non-Linear Classifiers Kernel Trick Perceptron Algorithm We can re-write the argmax function as: y = arg maxw (i) f(x t, y ) y = arg max α y,t [f(x t, y t ) f(x t, y)] f(x t, y ) y t,y = arg max α y,t [f(x t, y t ) f(x t, y ) f(x t, y) f(x t, y )] y t,y = arg max α y,t [φ((x t, y t ), (x t, y )) φ((x t, y), (x t, y ))] y t,y We can then re-write the perceptron algorithm strictly with kernels Introduction to Data-Driven Dependency Parsing 37(50)

39 Non-Linear Classifiers Kernel Trick Perceptron Algorithm Training data: T = {(x t, y t)} T t=1 1. y,t set α y,t = 0 2. for n : 1..N 3. for t : 1..T 4. Let y = argmax y 5. if y y t 6. α y,t = α y,t + 1 P t,y αy,t[φ((xt, yt), (xt, y )) φ((x t, y), (x t, y ))] Given a new instance x y = arg max α y,t [φ((x t,y t ),(x,y )) φ((x t,y),(x,y ))] y t,y But it seems like we have just complicated things??? Introduction to Data-Driven Dependency Parsing 38(50)

40 Non-Linear Classifiers Kernels = Tractable Non-Linearity A linear classifier in a higher dimensional feature space is a non-linear classifier in the original space Computing a non-linear kernel is often better computationally than calculating the corresponding dot product in the high dimension feature space Thus, kernels allow us to efficiently learn non-linear classifiers Introduction to Data-Driven Dependency Parsing 39(50)

41 Non-Linear Classifiers Linear Classifiers in High Dimension Introduction to Data-Driven Dependency Parsing 40(50)

42 Non-Linear Classifiers Example: Polynomial Kernel f(x) R M, d 2 φ(x t,x s ) = (f(x t ) f(x s ) + 1) d O(M) to calculate for any d!! But in the original feature space (primal space) Consider d = 2, M = 2, and f(xt ) = [x t,1, x t,2 ] (f(x t) f(x s) + 1) 2 = ([x t,1,x t,2 ] [x s,1,x s,2 ] + 1) 2 = (x t,1 x s,1 + x t,2 x s,2 + 1) 2 = (x t,1 x s,1 ) 2 + (x t,2 x s,2 ) 2 + 2(x t,1 x s,1 ) + 2(x t,2 x s,2 ) +2(x t,1 x t,2 x s,1 x s,2 ) + (1) 2 which equals: [(x t,1 ) 2,(x t,2 ) 2, 2x t,1, 2x t,2, 2x t,1 x t,2,1] [(x s,1 ) 2,(x s,2 ) 2, 2x s,1, 2x s,2, 2x s,1 x s,2,1] Introduction to Data-Driven Dependency Parsing 41(50)

43 Non-Linear Classifiers Popular Kernels Polynomial kernel φ(x t,x s ) = (f(x t ) f(x s ) + 1) d Gaussian radial basis kernel (infinite feature space representation!) φ(x t,x s ) = exp( f(x t) f(x s ) 2 ) 2σ String kernels [Lodhi et al. 2002, Collins and Duffy 2002] Tree kernels [Collins and Duffy 2002] Introduction to Data-Driven Dependency Parsing 42(50)

44 Structured Learning Structured Learning Sometimes our output space Y is not simply a category Examples: Parsing: for a sentence x, Y is the set of possible parse trees Sequence tagging: for a sentence x, Y is the set of possible tag sequences, e.g., part-of-speech tags, named-entity tags Machine translation: for a source sentence x, Y is the set of possible target language sentences Can t we just use our multiclass learning algorithms? In all the cases, the size of the set Y is exponential in the length of the input x It is often non-trivial to solve our learning algorithms in such cases Introduction to Data-Driven Dependency Parsing 43(50)

45 Structured Learning Perceptron Training data: T = {(x t,y t )} T t=1 1. w (0) = 0; i = 0 2. for n : 1..N 3. for t : 1..T 4. Let y = arg max y w (i) f(x t,y ) (**) 5. if y y t 6. w (i+1) = w (i) + f(x t,y t ) f(x t,y ) 7. i = i return w i (**) Solving the argmax requires a search over an exponential space of outputs! Introduction to Data-Driven Dependency Parsing 44(50)

46 Structured Learning Large-Margin Classifiers Online (MIRA): Batch (SVMs): min 1 2 w 2 such that: w f(x t, y t ) w f(x t, y ) 1 (x t, y t ) T and y Ȳ t (**) Training data: T = {(x t, y t )} T t=1 1. w (0) = 0; i = 0 2. for n : 1..N 3. for t : 1..T 4. w (i+1) = arg min w* w* w (i) such that: w f(x t, y t ) w f(x t, y ) 1 y Ȳ t (**) 5. i = i return w i (**) There are exponential constraints in the size of each input!! Introduction to Data-Driven Dependency Parsing 45(50)

47 Structured Learning Factor the Feature Representations We can make an assumption that our feature representations factor relative to the output Example: Context Free Parsing: f(x, y) = A BC y f(x, A BC) Sequence Analysis Markov Assumptions: y f(x, y) = f(x, y i 1, y i ) These kinds of factorizations allow us to run algorithms like CKY and Viterbi to compute the argmax function i=1 Introduction to Data-Driven Dependency Parsing 46(50)

48 Structured Learning Structured Perceptron Exactly like original perceptron Except now the argmax function uses a factored feature representation All of the original analysis for the multiclass perceptron carries over!! Introduction to Data-Driven Dependency Parsing 47(50)

49 Structured Learning Structured SVMs min 1 2 w 2 such that: w f(x t,y t ) w f(x t,y ) L(y t,y ) (x t,y t ) T and y Ȳ t (**) Still have an exponential # of constraints Feature factorizations also allow for solutions Maximum Margin Markov Networks (Taskar et al. 03) Structured SVMs (Tsochantaridis et al. 04) Note: Old fixed margin of 1 is now a fixed loss L(y t,y ) between two structured outputs Introduction to Data-Driven Dependency Parsing 48(50)

50 Online Structured SVMs (or Online MIRA) Training data: T = {(x t,y t )} T t=1 1. w (0) = 0; i = 0 2. for n : 1..N 3. for t : 1..T 4. w (i+1) = arg min w* w* w (i) such that: w f(x t,y t ) w f(x t,y ) L(y t,y ) y Ȳ t and y k-best(x t,w (i) ) (**) 5. i = i return w i k-best(x t ) is set of outputs with highest scores using weight vector w (i) Structured Learning Simple Solution only consider outputs y Ȳ t that currently have highest score Introduction to Data-Driven Dependency Parsing 49(50)

51 Wrap Up Main Points of Lecture Feature representations Choose feature weights, w, to maximize some function (min error, max margin) Batch learning (SVMs) versus online learning (perceptron, MIRA) Linear versus Non-linear classifiers Structured Learning Introduction to Data-Driven Dependency Parsing 50(50)

52 References and Further Reading References and Further Reading A. L. Berger, S. A. Della Pietra, and V. J. Della Pietra A maximum entropy approach to natural language processing. Computational Linguistics, 22(1). P. M. Camerini, L. Fratta, and F. Maffioli The k best spanning arborescences of a network. Networks, 10(2): Y.J. Chu and T.H. Liu On the shortest arborescence of a directed graph. Science Sinica, 14: M. Collins and N. Duffy New ranking algorithms for parsing and tagging: Kernels over discrete structures, and the voted perceptron. In Proc. ACL. M. Collins Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In Proc. EMNLP. K. Crammer and Y. Singer On the algorithmic implementation of multiclass kernel based vector machines. JMLR. K. Crammer and Y. Singer Ultraconservative online algorithms for multiclass problems. JMLR. Introduction to Data-Driven Dependency Parsing 50(50)

53 References and Further Reading K. Crammer, O. Dekel, S. Shalev-Shwartz, and Y. Singer Online passive aggressive algorithms. In Proc. NIPS. K. Crammer, O. Dekel, J. Keshat, S. Shalev-Shwartz, and Y. Singer Online passive aggressive algorithms. JMLR. J. Edmonds Optimum branchings. Journal of Research of the National Bureau of Standards, 71B: J. Eisner Three new probabilistic models for dependency parsing: An exploration. In Proc. COLING. Y. Freund and R.E. Schapire Large margin classification using the perceptron algorithm. Machine Learning, 37(3): T. Joachims Learning to Classify Text using Support Vector Machines. Kluwer. D. Klein and C. Manning Corpus-based induction of syntactic structure: Models of dependency and constituency. In Proc. ACL. T. Koo, A. Globerson, X. Carreras, and M. Collins Structured prediction models via the matrix-tree theorem. In Proc. EMNLP. Introduction to Data-Driven Dependency Parsing 50(50)

54 J. Lafferty, A. McCallum, and F. Pereira Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. ICML. H. Lodhi, C. Saunders, J. Shawe-Taylor, and N. Cristianini Classification with string kernels. Journal of Machine Learning Research. References and Further Reading A. McCallum, D. Freitag, and F. Pereira Maximum entropy Markov models for information extraction and segmentation. In Proc. ICML. R. McDonald and F. Pereira Online learning of approximate dependency parsing algorithms. In Proc EACL. R. McDonald and G. Satta On the complexity of non-projective data-driven dependency parsing. In Proc. IWPT. R. McDonald, K. Crammer, and F. Pereira Online large-margin training of dependency parsers. In Proc. ACL. K.R. Müller, S. Mika, G. Rätsch, K. Tsuda, and B. Schölkopf An introduction to kernel-based learning algorithms. IEEE Neural Networks, 12(2): M.A. Paskin Introduction to Data-Driven Dependency Parsing 50(50)

55 References and Further Reading Cubic-time parsing and learning algorithms for grammatical bigram models. Technical Report UCB/CSD , Computer Science Division, University of California Berkeley. K. Sagae and A. Lavie Parser combination by reparsing. In Proc. HLT/NAACL. F. Sha and F. Pereira Shallow parsing with conditional random fields. In Proc. HLT/NAACL, pages N. Smith and J. Eisner Guiding unsupervised grammar induction using contrastive estimation. In Working Notes of the International Joint Conference on Artificial Intelligence Workshop on Grammatical Inference Applications. D.A. Smith and N.A. Smith Probabilistic models of nonprojective dependency trees. In Proc. EMNLP. C. Sutton and A. McCallum An introduction to conditional random fields for relational learning. In L. Getoor and B. Taskar, editors, Introduction to Statistical Relational Learning. MIT Press. R.E. Tarjan Finding optimum branchings. Networks, 7: B. Taskar, C. Guestrin, and D. Koller Introduction to Data-Driven Dependency Parsing 50(50)

56 References and Further Reading Max-margin Markov networks. In Proc. NIPS. B. Taskar Learning Structured Prediction Models: A Large Margin Approach. Ph.D. thesis, Stanford. I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun Support vector learning for interdependent and structured output spaces. In Proc. ICML. W.T. Tutte Graph Theory. Cambridge University Press. D. Yuret Discovery of linguistic relations using lexical attraction. Ph.D. thesis, MIT. Introduction to Data-Driven Dependency Parsing 50(50)

Generalized Linear Classifiers in NLP

Generalized Linear Classifiers in NLP Generalized Linear Classifiers in NLP (or Discriminative Generalized Linear Feature-Based Classifiers) Graduate School of Language Technology, Sweden 2009 Ryan McDonald Google Inc., New York, USA E-mail:

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers

More information

Machine Learning for Structured Prediction

Machine Learning for Structured Prediction Machine Learning for Structured Prediction Grzegorz Chrupa la National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Grzegorz Chrupa la (DCU) Machine Learning for

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Polyhedral Outer Approximations with Application to Natural Language Parsing

Polyhedral Outer Approximations with Application to Natural Language Parsing Polyhedral Outer Approximations with Application to Natural Language Parsing André F. T. Martins 1,2 Noah A. Smith 1 Eric P. Xing 1 1 Language Technologies Institute School of Computer Science Carnegie

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. http://goo.gl/jv7vj9 Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. http://goo.gl/xilnmn Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT

More information

2.2 Structured Prediction

2.2 Structured Prediction The hinge loss (also called the margin loss), which is optimized by the SVM, is a ramp function that has slope 1 when yf(x) < 1 and is zero otherwise. Two other loss functions squared loss and exponential

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Max-margin learning of GM Eric Xing Lecture 28, Apr 28, 2014 b r a c e Reading: 1 Classical Predictive Models Input and output space: Predictive

More information

Sequential Supervised Learning

Sequential Supervised Learning Sequential Supervised Learning Many Application Problems Require Sequential Learning Part-of of-speech Tagging Information Extraction from the Web Text-to to-speech Mapping Part-of of-speech Tagging Given

More information

Structured Prediction

Structured Prediction Machine Learning Fall 2017 (structured perceptron, HMM, structured SVM) Professor Liang Huang (Chap. 17 of CIML) x x the man bit the dog x the man bit the dog x DT NN VBD DT NN S =+1 =-1 the man bit the

More information

Algorithms for Predicting Structured Data

Algorithms for Predicting Structured Data 1 / 70 Algorithms for Predicting Structured Data Thomas Gärtner / Shankar Vembu Fraunhofer IAIS / UIUC ECML PKDD 2010 Structured Prediction 2 / 70 Predicting multiple outputs with complex internal structure

More information

Graphical models for part of speech tagging

Graphical models for part of speech tagging Indian Institute of Technology, Bombay and Research Division, India Research Lab Graphical models for part of speech tagging Different Models for POS tagging HMM Maximum Entropy Markov Models Conditional

More information

Structured Prediction

Structured Prediction Structured Prediction Classification Algorithms Classify objects x X into labels y Y First there was binary: Y = {0, 1} Then multiclass: Y = {1,...,6} The next generation: Structured Labels Structured

More information

Distributed Training Strategies for the Structured Perceptron

Distributed Training Strategies for the Structured Perceptron Distributed Training Strategies for the Structured Perceptron Ryan McDonald Keith Hall Gideon Mann Google, Inc., New York / Zurich {ryanmcd kbhall gmann}@google.com Abstract Perceptron training is widely

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

Structured Prediction Models via the Matrix-Tree Theorem

Structured Prediction Models via the Matrix-Tree Theorem Structured Prediction Models via the Matrix-Tree Theorem Terry Koo Amir Globerson Xavier Carreras Michael Collins maestro@csail.mit.edu gamir@csail.mit.edu carreras@csail.mit.edu mcollins@csail.mit.edu

More information

The Structured Weighted Violations Perceptron Algorithm

The Structured Weighted Violations Perceptron Algorithm The Structured Weighted Violations Perceptron Algorithm Rotem Dror and Roi Reichart Faculty of Industrial Engineering and Management, Technion, IIT {rtmdrr@campus roiri@ie}.technion.ac.il Abstract We present

More information

CMU at SemEval-2016 Task 8: Graph-based AMR Parsing with Infinite Ramp Loss

CMU at SemEval-2016 Task 8: Graph-based AMR Parsing with Infinite Ramp Loss CMU at SemEval-2016 Task 8: Graph-based AMR Parsing with Infinite Ramp Loss Jeffrey Flanigan Chris Dyer Noah A. Smith Jaime Carbonell School of Computer Science, Carnegie Mellon University, Pittsburgh,

More information

Online Passive- Aggressive Algorithms

Online Passive- Aggressive Algorithms Online Passive- Aggressive Algorithms Jean-Baptiste Behuet 28/11/2007 Tutor: Eneldo Loza Mencía Seminar aus Maschinellem Lernen WS07/08 Overview Online algorithms Online Binary Classification Problem Perceptron

More information

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Algorithms for NLP Classification II Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Minimize Training Error? A loss function declares how costly each mistake is E.g. 0 loss for correct label,

More information

Support Vector Machines: Kernels

Support Vector Machines: Kernels Support Vector Machines: Kernels CS6780 Advanced Machine Learning Spring 2015 Thorsten Joachims Cornell University Reading: Murphy 14.1, 14.2, 14.4 Schoelkopf/Smola Chapter 7.4, 7.6, 7.8 Non-Linear Problems

More information

Lecture 13: Structured Prediction

Lecture 13: Structured Prediction Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501: NLP 1 Quiz 2 v Lectures 9-13 v Lecture 12: before page

More information

Personal Project: Shift-Reduce Dependency Parsing

Personal Project: Shift-Reduce Dependency Parsing Personal Project: Shift-Reduce Dependency Parsing 1 Problem Statement The goal of this project is to implement a shift-reduce dependency parser. This entails two subgoals: Inference: We must have a shift-reduce

More information

Max Margin-Classifier

Max Margin-Classifier Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization

More information

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers Computational Methods for Data Analysis Massimo Poesio SUPPORT VECTOR MACHINES Support Vector Machines Linear classifiers 1 Linear Classifiers denotes +1 denotes -1 w x + b>0 f(x,w,b) = sign(w x + b) How

More information

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron) Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron) Intro to NLP, CS585, Fall 2014 http://people.cs.umass.edu/~brenocon/inlp2014/ Brendan O Connor (http://brenocon.com) 1 Models for

More information

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013 Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013 Outline Modeling Inference Training Applications Outline Modeling Problem definition Discriminative vs. Generative Chain CRF General

More information

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Sequence Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(21) Introduction Structured

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Search-Based Structured Prediction

Search-Based Structured Prediction Search-Based Structured Prediction by Harold C. Daumé III (Utah), John Langford (Yahoo), and Daniel Marcu (USC) Submitted to Machine Learning, 2007 Presented by: Eugene Weinstein, NYU/Courant Institute

More information

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard and Mitch Marcus (and lots original slides by

More information

Lab 12: Structured Prediction

Lab 12: Structured Prediction December 4, 2014 Lecture plan structured perceptron application: confused messages application: dependency parsing structured SVM Class review: from modelization to classification What does learning mean?

More information

Computational Linguistics

Computational Linguistics Computational Linguistics Dependency-based Parsing Clayton Greenberg Stefan Thater FR 4.7 Allgemeine Linguistik (Computerlinguistik) Universität des Saarlandes Summer 2016 Acknowledgements These slides

More information

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research Introduction to Machine Learning Lecture 13 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification Mehryar Mohri - Introduction to Machine Learning page 2 Motivation

More information

6.036 midterm review. Wednesday, March 18, 15

6.036 midterm review. Wednesday, March 18, 15 6.036 midterm review 1 Topics covered supervised learning labels available unsupervised learning no labels available semi-supervised learning some labels available - what algorithms have you learned that

More information

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science Natural Language Processing CS 6840 Lecture 06 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Statistical Parsing Define a probabilistic model of syntax P(T S):

More information

Learning and Inference over Constrained Output

Learning and Inference over Constrained Output Learning and Inference over Constrained Output Vasin Punyakanok Dan Roth Wen-tau Yih Dav Zimak Department of Computer Science University of Illinois at Urbana-Champaign {punyakan, danr, yih, davzimak}@uiuc.edu

More information

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS

More information

Brief Introduction to Machine Learning

Brief Introduction to Machine Learning Brief Introduction to Machine Learning Yuh-Jye Lee Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU August 29, 2016 1 / 49 1 Introduction 2 Binary Classification 3 Support Vector

More information

Computational Linguistics. Acknowledgements. Phrase-Structure Trees. Dependency-based Parsing

Computational Linguistics. Acknowledgements. Phrase-Structure Trees. Dependency-based Parsing Computational Linguistics Dependency-based Parsing Dietrich Klakow & Stefan Thater FR 4.7 Allgemeine Linguistik (Computerlinguistik) Universität des Saarlandes Summer 2013 Acknowledgements These slides

More information

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Support Vector Machines.

Support Vector Machines. Support Vector Machines www.cs.wisc.edu/~dpage 1 Goals for the lecture you should understand the following concepts the margin slack variables the linear support vector machine nonlinear SVMs the kernel

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Driving Semantic Parsing from the World s Response

Driving Semantic Parsing from the World s Response Driving Semantic Parsing from the World s Response James Clarke, Dan Goldwasser, Ming-Wei Chang, Dan Roth Cognitive Computation Group University of Illinois at Urbana-Champaign CoNLL 2010 Clarke, Goldwasser,

More information

Machine Learning Lecture 6 Note

Machine Learning Lecture 6 Note Machine Learning Lecture 6 Note Compiled by Abhi Ashutosh, Daniel Chen, and Yijun Xiao February 16, 2016 1 Pegasos Algorithm The Pegasos Algorithm looks very similar to the Perceptron Algorithm. In fact,

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete

More information

CS 188: Artificial Intelligence Spring Announcements

CS 188: Artificial Intelligence Spring Announcements CS 188: Artificial Intelligence Spring 2010 Lecture 22: Nearest Neighbors, Kernels 4/18/2011 Pieter Abbeel UC Berkeley Slides adapted from Dan Klein Announcements On-going: contest (optional and FUN!)

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Support vector machines Lecture 4

Support vector machines Lecture 4 Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The

More information

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction Linear vs Non-linear classifier CS789: Machine Learning and Neural Network Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Linear classifier is in the

More information

CS 188: Artificial Intelligence Spring Announcements

CS 188: Artificial Intelligence Spring Announcements CS 188: Artificial Intelligence Spring 2010 Lecture 24: Perceptrons and More! 4/22/2010 Pieter Abbeel UC Berkeley Slides adapted from Dan Klein Announcements W7 due tonight [this is your last written for

More information

Linear Classifiers IV

Linear Classifiers IV Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers IV Blaine Nelson, Tobias Scheffer Contents Classification Problem Bayesian Classifier Decision Linear Classifiers, MAP Models Logistic

More information

Transition-Based Parsing

Transition-Based Parsing Transition-Based Parsing Based on atutorial at COLING-ACL, Sydney 2006 with Joakim Nivre Sandra Kübler, Markus Dickinson Indiana University E-mail: skuebler,md7@indiana.edu Transition-Based Parsing 1(11)

More information

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt) Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt) Nathan Schneider (some slides borrowed from Chris Dyer) ENLP 12 February 2018 23 Outline Words, probabilities Features,

More information

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan Support'Vector'Machines Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan kasthuri.kannan@nyumc.org Overview Support Vector Machines for Classification Linear Discrimination Nonlinear Discrimination

More information

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines Pattern Recognition and Machine Learning James L. Crowley ENSIMAG 3 - MMIS Fall Semester 2016 Lessons 6 10 Jan 2017 Outline Perceptrons and Support Vector machines Notation... 2 Perceptrons... 3 History...3

More information

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Linear classifier Which classifier? x 2 x 1 2 Linear classifier Margin concept x 2

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2015 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

1 SVM Learning for Interdependent and Structured Output Spaces

1 SVM Learning for Interdependent and Structured Output Spaces 1 SVM Learning for Interdependent and Structured Output Spaces Yasemin Altun Toyota Technological Institute at Chicago, Chicago IL 60637 USA altun@tti-c.org Thomas Hofmann Google, Zurich, Austria thofmann@google.com

More information

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Erin Allwein, Robert Schapire and Yoram Singer Journal of Machine Learning Research, 1:113-141, 000 CSE 54: Seminar on Learning

More information

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015 Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch COMP-599 Oct 1, 2015 Announcements Research skills workshop today 3pm-4:30pm Schulich Library room 313 Start thinking about

More information

Introduction to Logistic Regression and Support Vector Machine

Introduction to Logistic Regression and Support Vector Machine Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Stochastic Grammars Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(22) Structured Classification

More information

CS798: Selected topics in Machine Learning

CS798: Selected topics in Machine Learning CS798: Selected topics in Machine Learning Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Jakramate Bootkrajang CS798: Selected topics in Machine Learning

More information

CS 188: Artificial Intelligence. Outline

CS 188: Artificial Intelligence. Outline CS 188: Artificial Intelligence Lecture 21: Perceptrons Pieter Abbeel UC Berkeley Many slides adapted from Dan Klein. Outline Generative vs. Discriminative Binary Linear Classifiers Perceptron Multi-class

More information

lecture 6: modeling sequences (final part)

lecture 6: modeling sequences (final part) Natural Language Processing 1 lecture 6: modeling sequences (final part) Ivan Titov Institute for Logic, Language and Computation Outline After a recap: } Few more words about unsupervised estimation of

More information

13A. Computational Linguistics. 13A. Log-Likelihood Dependency Parsing. CSC 2501 / 485 Fall 2017

13A. Computational Linguistics. 13A. Log-Likelihood Dependency Parsing. CSC 2501 / 485 Fall 2017 Computational Linguistics CSC 2501 / 485 Fall 2017 13A 13A. Log-Likelihood Dependency Parsing Gerald Penn Department of Computer Science, University of Toronto Based on slides by Yuji Matsumoto, Dragomir

More information

Conditional Random Field

Conditional Random Field Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

MIRA, SVM, k-nn. Lirong Xia

MIRA, SVM, k-nn. Lirong Xia MIRA, SVM, k-nn Lirong Xia Linear Classifiers (perceptrons) Inputs are feature values Each feature has a weight Sum is the activation activation w If the activation is: Positive: output +1 Negative, output

More information

Jeff Howbert Introduction to Machine Learning Winter

Jeff Howbert Introduction to Machine Learning Winter Classification / Regression Support Vector Machines Jeff Howbert Introduction to Machine Learning Winter 2012 1 Topics SVM classifiers for linearly separable classes SVM classifiers for non-linearly separable

More information

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015 Machine Learning Classification, Discriminative learning Structured output, structured input, discriminative function, joint input-output features, Likelihood Maximization, Logistic regression, binary

More information

Training Conditional Random Fields for Maximum Labelwise Accuracy

Training Conditional Random Fields for Maximum Labelwise Accuracy Training Conditional Random Fields for Maximum Labelwise Accuracy Samuel S. Gross Computer Science Department Stanford University Stanford, CA, USA ssgross@cs.stanford.edu Chuong B. Do Computer Science

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 24, 2016 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

18.9 SUPPORT VECTOR MACHINES

18.9 SUPPORT VECTOR MACHINES 744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the

More information

Machine Learning for natural language processing

Machine Learning for natural language processing Machine Learning for natural language processing Classification: Maximum Entropy Models Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 24 Introduction Classification = supervised

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Andreas Maletti Technische Universität Dresden Fakultät Informatik June 15, 2006 1 The Problem 2 The Basics 3 The Proposed Solution Learning by Machines Learning

More information

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Sameer Maskey Week 13, Nov 28, 2012 1 Announcements Next lecture is the last lecture Wrap up of the semester 2 Final Project

More information

Features of Statistical Parsers

Features of Statistical Parsers Features of tatistical Parsers Preliminary results Mark Johnson Brown University TTI, October 2003 Joint work with Michael Collins (MIT) upported by NF grants LI 9720368 and II0095940 1 Talk outline tatistical

More information

Minibatch and Parallelization for Online Large Margin Structured Learning

Minibatch and Parallelization for Online Large Margin Structured Learning Minibatch and Parallelization for Online Large Margin Structured Learning Kai Zhao Computer Science Program, Graduate Center City University of New York kzhao@gc.cuny.edu Liang Huang, Computer Science

More information

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley Natural Language Processing Classification Classification I Dan Klein UC Berkeley Classification Automatically make a decision about inputs Example: document category Example: image of digit digit Example:

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Support Vector Machines Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique

More information

Online Passive-Aggressive Algorithms

Online Passive-Aggressive Algorithms Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

Support Vector Machines: Maximum Margin Classifiers

Support Vector Machines: Maximum Margin Classifiers Support Vector Machines: Maximum Margin Classifiers Machine Learning and Pattern Recognition: September 16, 2008 Piotr Mirowski Based on slides by Sumit Chopra and Fu-Jie Huang 1 Outline What is behind

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM 1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Jean-Philippe.Vert@mines.org Human Genome Center, University

More information

Logistic Regression Logistic

Logistic Regression Logistic Case Study 1: Estimating Click Probabilities L2 Regularization for Logistic Regression Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 10 th,

More information

Online EM for Unsupervised Models

Online EM for Unsupervised Models Online for Unsupervised Models Percy Liang Dan Klein Computer Science Division, EECS Department University of California at Berkeley Berkeley, CA 94720 {pliang,klein}@cs.berkeley.edu Abstract The (batch)

More information

Foundations of Machine Learning

Foundations of Machine Learning Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about

More information

Random Field Models for Applications in Computer Vision

Random Field Models for Applications in Computer Vision Random Field Models for Applications in Computer Vision Nazre Batool Post-doctorate Fellow, Team AYIN, INRIA Sophia Antipolis Outline Graphical Models Generative vs. Discriminative Classifiers Markov Random

More information

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague AdaBoost Lecturer: Jan Šochman Authors: Jan Šochman, Jiří Matas Center for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz Motivation Presentation 2/17 AdaBoost with trees

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated Fall 3 Computer Vision Overview of Statistical Tools Statistical Inference Haibin Ling Observation inference Decision Prior knowledge http://www.dabi.temple.edu/~hbling/teaching/3f_5543/index.html Bayesian

More information

Maxent Models and Discriminative Estimation

Maxent Models and Discriminative Estimation Maxent Models and Discriminative Estimation Generative vs. Discriminative models (Reading: J+M Ch6) Introduction So far we ve looked at generative models Language models, Naive Bayes But there is now much

More information

Material presented. Direct Models for Classification. Agenda. Classification. Classification (2) Classification by machines 6/16/2010.

Material presented. Direct Models for Classification. Agenda. Classification. Classification (2) Classification by machines 6/16/2010. Material presented Direct Models for Classification SCARF JHU Summer School June 18, 2010 Patrick Nguyen (panguyen@microsoft.com) What is classification? What is a linear classifier? What are Direct Models?

More information