Introduction to Data-Driven Dependency Parsing

Similar documents
Generalized Linear Classifiers in NLP

Machine Learning for NLP

Machine Learning for Structured Prediction

Machine Learning for NLP

Polyhedral Outer Approximations with Application to Natural Language Parsing

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

2.2 Structured Prediction

Probabilistic Graphical Models

Sequential Supervised Learning

Structured Prediction

Algorithms for Predicting Structured Data

Graphical models for part of speech tagging

Structured Prediction

Distributed Training Strategies for the Structured Perceptron

Undirected Graphical Models

Structured Prediction Models via the Matrix-Tree Theorem

The Structured Weighted Violations Perceptron Algorithm

CMU at SemEval-2016 Task 8: Graph-based AMR Parsing with Infinite Ramp Loss

Online Passive- Aggressive Algorithms

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Support Vector Machines: Kernels

Lecture 13: Structured Prediction

Personal Project: Shift-Reduce Dependency Parsing

Max Margin-Classifier

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Statistical Methods for NLP

Linear & nonlinear classifiers

Search-Based Structured Prediction

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Lab 12: Structured Prediction

Computational Linguistics

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

6.036 midterm review. Wednesday, March 18, 15

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Learning and Inference over Constrained Output

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Brief Introduction to Machine Learning

Computational Linguistics. Acknowledgements. Phrase-Structure Trees. Dependency-based Parsing

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

CSCI-567: Machine Learning (Spring 2019)

Support Vector Machines.

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Driving Semantic Parsing from the World s Response

Machine Learning Lecture 6 Note

Kernel Methods and Support Vector Machines

CS 188: Artificial Intelligence Spring Announcements

Linear & nonlinear classifiers

Support vector machines Lecture 4

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

CS 188: Artificial Intelligence Spring Announcements

Linear Classifiers IV

Transition-Based Parsing

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) and Kernel Methods

1 SVM Learning for Interdependent and Structured Output Spaces

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Introduction to Logistic Regression and Support Vector Machine

Statistical Methods for NLP

CS798: Selected topics in Machine Learning

CS 188: Artificial Intelligence. Outline

lecture 6: modeling sequences (final part)

13A. Computational Linguistics. 13A. Log-Likelihood Dependency Parsing. CSC 2501 / 485 Fall 2017

Conditional Random Field

Support Vector Machine (SVM) and Kernel Methods

MIRA, SVM, k-nn. Lirong Xia

Jeff Howbert Introduction to Machine Learning Winter

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

Training Conditional Random Fields for Maximum Labelwise Accuracy

Intelligent Systems (AI-2)

18.9 SUPPORT VECTOR MACHINES

Machine Learning for natural language processing

Introduction to Support Vector Machines

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Features of Statistical Parsers

Minibatch and Parallelization for Online Large Margin Structured Learning

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Statistical Machine Learning from Data

Online Passive-Aggressive Algorithms

Intelligent Systems (AI-2)

Support Vector Machines: Maximum Margin Classifiers

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Logistic Regression Logistic

Online EM for Unsupervised Models

Foundations of Machine Learning

Random Field Models for Applications in Computer Vision

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Neural Networks and Deep Learning

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated

Maxent Models and Discriminative Estimation

Material presented. Direct Models for Classification. Agenda. Classification. Classification (2) Classification by machines 6/16/2010.

Transcription:

Introduction to Data-Driven Dependency Parsing Introductory Course, ESSLLI 2007 Ryan McDonald 1 Joakim Nivre 2 1 Google Inc., New York, USA E-mail: ryanmcd@google.com 2 Uppsala University and Växjö University, Sweden E-mail: nivre@msi.vxu.se Introduction to Data-Driven Dependency Parsing 1(50)

Formal Conditions on Dependency Graphs From Last Lecture Last Lecture For a dependency graph G = (V,A) With label set L = {l 1,...,l L } G is (weakly) connected: If i, j V, i j. G is acyclic: If i j, then not j i. G obeys the single-head constraint: If i j, then not i j, for any i i. G is projective: If i j, then i i, for any i such that i <i <j or j <i <i. Introduction to Data-Driven Dependency Parsing 2(50)

From Last Lecture Dependency Graphs as Trees Consider a dependency graph G = (V, A) satisfying: G is (weakly) connected: If i, j V, i j. G obeys the single-head constraint: If i j, then not i j, for any i i. G obeys the single-root constraint: If i such that i j, then i such that i j, for any j j w 0 = root is always this node This dependency graph is by definition a tree For the rest of the course we assume that all dependency graphs are trees Introduction to Data-Driven Dependency Parsing 3(50)

From Last Lecture Dependency Graphs as Trees Satisfies: connected, single-head obj pc nmod sbj nmod nmod nmod Economic news had little effect on financial markets. Introduction to Data-Driven Dependency Parsing 4(50)

From Last Lecture Dependency Graphs as Trees Satisfies: connected, single-head, single-root p pred obj pc nmod sbj nmod nmod nmod root Economic news had little effect on financial markets. Introduction to Data-Driven Dependency Parsing 4(50)

Introduction Overview of the Course Dependency parsing (Joakim) Machine learning methods (Ryan) Transition-based models (Joakim) Graph-based models (Ryan) Loose ends (Joakim, Ryan): Other approaches Empirical results Available software Introduction to Data-Driven Dependency Parsing 5(50)

Introduction Data-Driven Parsing Data-Driven Machine Learning Parameterize a model Supervised: Learn parameters from annotated data Unsupervised: Induce parameters from a large corpora Data-Driven vs. Grammar-driven Can parse all sentences vs. generate specific language Data-driven = grammar of Σ Introduction to Data-Driven Dependency Parsing 6(50)

Introduction Lecture 2: Outline Feature Representations Linear Classifiers Perceptron Large-Margin Classifiers (SVMs, MIRA) Others Non-linear Classifiers K-NNs and Memory-based Learning Kernels Structured Learning Structured Perceptron Large-Margin Perceptron Others Introduction to Data-Driven Dependency Parsing 7(50)

Introduction Important Message This lecture contains a lot of details Not important if you do not follow all proofs and maths What is important Understand basic representation of data features Understand basic goal and structure of classifiers Understand important distinctions: linear vs. non-linear, binary vs. multiclass, multiclass vs. structured, etc. Interested in ML for NLP Check out afternoon course Machine learning methods for NLP Introduction to Data-Driven Dependency Parsing 8(50)

Feature Representations Feature Representations Input: x X e.g., document or sentence with some words x = w1... w n, or a series of previous actions Output: y Y e.g., dependency tree, document class, part-of-speech tags, next parsing action We assume a mapping from x to a high dimensional feature vector f(x) : X R m But sometimes it will be easier to think of a mapping from an input/output pair to a feature vector f(x, y) : X Y R m For any vector v R m, let v j be the j th value Introduction to Data-Driven Dependency Parsing 9(50)

Feature Representations Examples x is a document { 1 if x contains the word interest f j (x) = 0 otherwise f j (x) = The percentage of words than contain punctuation x is a word and y is a part-of-speech tag { 1 if x = bank and y = Verb f j (x,y) = 0 otherwise Introduction to Data-Driven Dependency Parsing 10(50)

Feature Representations Example 2 j 1 if x contains the word John f 0 (x) = 0 otherwise j 1 if x contains the word Mary f 1 (x) = 0 otherwise j 1 if x contains the word Harry f 2 (x) = 0 otherwise j 1 if x contains the word likes f 3 (x) = 0 otherwise x=john likes Mary f(x) = [1 1 0 1] x=mary likes John f(x) = [1 1 0 1] x=harry likes Mary f(x) = [0 1 1 1] x=harry likes Harry f(x) = [0 0 1 1] Introduction to Data-Driven Dependency Parsing 11(50)

Linear Classifiers Linear classifier: score (or probability) of a particular classification is based on a linear combination of features and their weights Let w R m be a high dimensional weight vector If we assume that w is known, then we can define two kinds of linear classifiers Reminder: v v = v j v j R Binary Classification: Y = { 1, 1} j y = sign(w f(x)) Multiclass Classification: Y = {0, 1,..., N} Linear Classifiers y = argmax y w f(x, y) Introduction to Data-Driven Dependency Parsing 12(50)

Linear Classifiers Binary Linear Classifier Divides all points: y = sign(w f(x)) Introduction to Data-Driven Dependency Parsing 13(50)

Linear Classifiers Multiclass Linear Classifier Defines regions of space: y = arg max y w f(x,y) Introduction to Data-Driven Dependency Parsing 14(50)

Linear Classifiers Separability A set of points is separable, if there exists a w such that classification is perfect Separable Not Separable Introduction to Data-Driven Dependency Parsing 15(50)

Linear Classifiers Supervised Learning Input: training examples T = {(x t,y t )} T t=1 Input: feature representation f Output: w that maximizes/minimizes some important function on the training set minimize error (Perceptron, SVMs, Boosting) maximize likelihood of data (Logistic Regression, CRFs) Assumption: The training data is separable Not necessary, just makes life easier There is a lot of good work in machine learning to tackle the non-separable case Introduction to Data-Driven Dependency Parsing 16(50)

Linear Classifiers Perceptron Minimize error Binary classification: Y = { 1, 1} w = arg min 1 1[y t = sign(w f(x t ))] w Multiclass classification: Y = {0, 1,..., N} w = arg min w t t 1[p] = 1 1[y t = arg maxw f(x t, y)] y { 1 p is true 0 otherwise Introduction to Data-Driven Dependency Parsing 17(50)

Perceptron Learning Algorithm (multiclass) Linear Classifiers Training data: T = {(x t,y t )} T t=1 1. w (0) = 0; i = 0 2. for n : 1..N 3. for t : 1..T 4. Let y = arg max y w (i) f(x t,y ) 5. if y y t 6. w (i+1) = w (i) + f(x t,y t ) f(x t,y ) 7. i = i + 1 8. return w i Introduction to Data-Driven Dependency Parsing 18(50)

Perceptron Learning Algorithm (multiclass) Linear Classifiers Given an training instance (x t,y t ), define: Ȳt = Y {y t } A training set T is separable with margin γ > 0 if there exists a vector u with u = 1 such that: u f(x t,y t ) u f(x t,y ) γ for all y Ȳ t and u = j u2 j Assumption: the training set is separable with margin γ Introduction to Data-Driven Dependency Parsing 19(50)

Perceptron Learning Algorithm (multiclass) Linear Classifiers Theorem: For any training set separable with a margin of γ, the following holds for the perceptron algorithm: Number of training errors R2 γ 2 where R f(x t,y t ) f(x t,y ) for all (x t,y t ) T and y Ȳ t Thus, after a finite number of training iterations, the error on the training set will converge to zero Let s prove it! (proof taken from Collins 02) Introduction to Data-Driven Dependency Parsing 20(50)

Perception Learning Algorithm (multiclass) Linear Classifiers Training data: T = {(x t, y t)} T t=1 1. w (0) = 0; i = 0 2. for n : 1..N 3. for t : 1..T 4. Let y = arg max y w (i) f(x t, y ) 5. if y y t 6. w (i+1) = w (i) + f(x t, y t) f(x t, y ) 7. i = i + 1 8. return w i w (k 1) are the weights before k th mistake Suppose k th mistake made at the t th example, (x t, y t) y = arg max y w (k 1) f(x t, y ) y y t w (k) = w (k 1) +f(x t, y t) f(x t, y ) Now: u w (k) = u w (k 1) + u (f(x t, y t) f(x t, y )) u w (k 1) + γ Now: w (0) = 0 and u w (0) = 0, by induction on k, u w (k) (k 1)γ Now: since u w (k) u w (k) and u = 1 then w (k) (k 1)γ Now: w (k) 2 = w (k 1) 2 + f(x t, y t) f(x t, y ) 2 + 2w (k 1) (f(x t, y t) f(x t, y )) w (k) 2 w (k 1) 2 + R 2 (since R f(x t, y t) f(x t, y ) and w (k 1) f(x t, y t) w (k 1) f(x t, y ) 0) Introduction to Data-Driven Dependency Parsing 21(50)

Perception Learning Algorithm (multiclass) We have just shown that w (k) (k 1)γ and w (k) 2 w (k 1) 2 + R 2 By induction on k and since w (0) = 0 and w (0) 2 = 0 Therefore, and solving for k w (k) 2 (k 1)R 2 (k 1) 2 γ 2 w (k) 2 (k 1)R 2 k 1 R2 γ 2 Therefore the number of errors is bounded! Linear Classifiers Introduction to Data-Driven Dependency Parsing 22(50)

Linear Classifiers Margin Training Testing Denote the value of the margin by γ Introduction to Data-Driven Dependency Parsing 23(50)

Linear Classifiers Margin Intuitively maximizing margin makes sense More importantly, generalization error to unseen test data is proportional to the inverse of the margin ǫ R 2 γ 2 T Perceptron: we have shown that: If a training set is separable by some margin, the perceptron will find a w that separates the data However, it does not pick a w to maximize the margin! Introduction to Data-Driven Dependency Parsing 24(50)

Linear Classifiers Max Margin = Min Norm Let γ > 0 Max Margin: max w 1 γ Min Norm: min 1 2 w 2 such that: y t (w f(x t )) γ (x t, y t ) T = such that: y t (w f(x t )) 1 (x t, y t ) T w is bound since scaling trivially produces larger margin y t ([βw] f(x t )) βγ, for some β 1 Instead of fixing w we fix the margin γ = 1 Introduction to Data-Driven Dependency Parsing 25(50)

Linear Classifiers Support Vector Machines Binary: Multiclass: min 1 2 w 2 min 1 2 w 2 such that: y t (w f(x t )) 1 (x t,y t ) T such that: w f(x t,y t ) w f(x t,y ) 1 (x t,y t ) T and y Ȳ t Both are quadratic programming problems a well known convex optimization problem Can be solved with out-of-the-box algorithms Introduction to Data-Driven Dependency Parsing 26(50)

Linear Classifiers Support Vector Machines Binary: min 1 2 w 2 such that: y t (w f(x t )) 1 Problem: Sometimes T is far too large Thus the number of constraints might make solving the quadratic programming problem very difficult Most common technique: Sequential Minimal Optimization (SMO) Sparse: solution only depends on support vectors (x t,y t ) T Introduction to Data-Driven Dependency Parsing 27(50)

Margin Infused Relaxed Algorithm (MIRA) Linear Classifiers Another option maximize margin using an online algorithm Batch vs. Online Batch update parameters based on entire training set (e.g., SVMs) Online update parameters based on a single training instance at a time (e.g., Perceptron) MIRA can be thought of as a max-margin perceptron or an online SVM Introduction to Data-Driven Dependency Parsing 28(50)

Linear Classifiers MIRA (multiclass) Batch (SVMs): min 1 2 w 2 such that: w f(x t, y t ) w f(x t, y ) 1 (x t, y t ) T and y Ȳ t Online (MIRA): Training data: T = {(x t, y t )} T t=1 1. w (0) = 0; i = 0 2. for n : 1..N 3. for t : 1..T 4. w (i+1) = arg min w* w* w (i) such that: w f(x t, y t ) w f(x t, y ) 1 y Ȳ t 5. i = i + 1 6. return w i MIRA has much smaller optimizations with only Ȳ t constraints Cost: sub-optimal optimization Introduction to Data-Driven Dependency Parsing 29(50)

Linear Classifiers Summary What we have covered Feature-based representations Linear Classifiers Perceptron Large-Margin SVMs (batch) and MIRA (online) What is next Non-linear classifiers Introduction to Data-Driven Dependency Parsing 30(50)

Non-Linear Classifiers Non-Linear Classifiers Some data sets require more than a linear classifier to be correctly modeled A lot of models out there K-Nearest Neighbours Decision Trees Kernels Neural Networks Will only discuss a couple due to time constraints Introduction to Data-Driven Dependency Parsing 31(50)

Non-Linear Classifiers K-Nearest Neighbours Simplest form: for a given test point x, find k-nearest neighbours in training set Neighbours vote for classification Distance is Euclidean distance d(x t,x r ) = (f j (x t ) f j (x r )) 2 j No linear classifier can correctly label data set. But 3-nearest neighbours does. Introduction to Data-Driven Dependency Parsing 32(50)

Non-Linear Classifiers K-Nearest Neighbours A training set T, distance function d, and value K define a non-linear classification boundary Introduction to Data-Driven Dependency Parsing 33(50)

Non-Linear Classifiers K-Nearest Neighbours K-NN is often called a lazy learning algorithm or memory based learning (MBL) K-NN generalized in the Tilburg Memory Based Learning Package Different distance functions Different voting schemes for classification Tie-breaking Memory representations Introduction to Data-Driven Dependency Parsing 34(50)

Non-Linear Classifiers Kernels A kernel is a similarity function between two points that is symmetric and positive semi-definite, which we denote by: φ(x t,x r ) R Mercer s Theorem: for any kernal φ, there exists an f, such that: φ(x t,x r ) = f(x t ) f(x r ) Introduction to Data-Driven Dependency Parsing 35(50)

Non-Linear Classifiers Kernel Trick Perceptron Algorithm Training data: T = {(x t, y t)} T t=1 1. w (0) = 0; i = 0 2. for n : 1..N 3. for t : 1..T 4. Let y = argmax y w (i) f(x t, y) 5. if y y t 6. w (i+1) = w (i) + f(x t, y t) f(x t, y) 7. i = i + 1 8. return w i Each feature function f(x t,y t ) is added and f(x t,y) is subtracted to w say α y,t times αy,t is the # of times during learning label y is predicted for example t Thus, w = t,y α y,t [f(x t,y t ) f(x t,y)] Introduction to Data-Driven Dependency Parsing 36(50)

Non-Linear Classifiers Kernel Trick Perceptron Algorithm We can re-write the argmax function as: y = arg maxw (i) f(x t, y ) y = arg max α y,t [f(x t, y t ) f(x t, y)] f(x t, y ) y t,y = arg max α y,t [f(x t, y t ) f(x t, y ) f(x t, y) f(x t, y )] y t,y = arg max α y,t [φ((x t, y t ), (x t, y )) φ((x t, y), (x t, y ))] y t,y We can then re-write the perceptron algorithm strictly with kernels Introduction to Data-Driven Dependency Parsing 37(50)

Non-Linear Classifiers Kernel Trick Perceptron Algorithm Training data: T = {(x t, y t)} T t=1 1. y,t set α y,t = 0 2. for n : 1..N 3. for t : 1..T 4. Let y = argmax y 5. if y y t 6. α y,t = α y,t + 1 P t,y αy,t[φ((xt, yt), (xt, y )) φ((x t, y), (x t, y ))] Given a new instance x y = arg max α y,t [φ((x t,y t ),(x,y )) φ((x t,y),(x,y ))] y t,y But it seems like we have just complicated things??? Introduction to Data-Driven Dependency Parsing 38(50)

Non-Linear Classifiers Kernels = Tractable Non-Linearity A linear classifier in a higher dimensional feature space is a non-linear classifier in the original space Computing a non-linear kernel is often better computationally than calculating the corresponding dot product in the high dimension feature space Thus, kernels allow us to efficiently learn non-linear classifiers Introduction to Data-Driven Dependency Parsing 39(50)

Non-Linear Classifiers Linear Classifiers in High Dimension Introduction to Data-Driven Dependency Parsing 40(50)

Non-Linear Classifiers Example: Polynomial Kernel f(x) R M, d 2 φ(x t,x s ) = (f(x t ) f(x s ) + 1) d O(M) to calculate for any d!! But in the original feature space (primal space) Consider d = 2, M = 2, and f(xt ) = [x t,1, x t,2 ] (f(x t) f(x s) + 1) 2 = ([x t,1,x t,2 ] [x s,1,x s,2 ] + 1) 2 = (x t,1 x s,1 + x t,2 x s,2 + 1) 2 = (x t,1 x s,1 ) 2 + (x t,2 x s,2 ) 2 + 2(x t,1 x s,1 ) + 2(x t,2 x s,2 ) +2(x t,1 x t,2 x s,1 x s,2 ) + (1) 2 which equals: [(x t,1 ) 2,(x t,2 ) 2, 2x t,1, 2x t,2, 2x t,1 x t,2,1] [(x s,1 ) 2,(x s,2 ) 2, 2x s,1, 2x s,2, 2x s,1 x s,2,1] Introduction to Data-Driven Dependency Parsing 41(50)

Non-Linear Classifiers Popular Kernels Polynomial kernel φ(x t,x s ) = (f(x t ) f(x s ) + 1) d Gaussian radial basis kernel (infinite feature space representation!) φ(x t,x s ) = exp( f(x t) f(x s ) 2 ) 2σ String kernels [Lodhi et al. 2002, Collins and Duffy 2002] Tree kernels [Collins and Duffy 2002] Introduction to Data-Driven Dependency Parsing 42(50)

Structured Learning Structured Learning Sometimes our output space Y is not simply a category Examples: Parsing: for a sentence x, Y is the set of possible parse trees Sequence tagging: for a sentence x, Y is the set of possible tag sequences, e.g., part-of-speech tags, named-entity tags Machine translation: for a source sentence x, Y is the set of possible target language sentences Can t we just use our multiclass learning algorithms? In all the cases, the size of the set Y is exponential in the length of the input x It is often non-trivial to solve our learning algorithms in such cases Introduction to Data-Driven Dependency Parsing 43(50)

Structured Learning Perceptron Training data: T = {(x t,y t )} T t=1 1. w (0) = 0; i = 0 2. for n : 1..N 3. for t : 1..T 4. Let y = arg max y w (i) f(x t,y ) (**) 5. if y y t 6. w (i+1) = w (i) + f(x t,y t ) f(x t,y ) 7. i = i + 1 8. return w i (**) Solving the argmax requires a search over an exponential space of outputs! Introduction to Data-Driven Dependency Parsing 44(50)

Structured Learning Large-Margin Classifiers Online (MIRA): Batch (SVMs): min 1 2 w 2 such that: w f(x t, y t ) w f(x t, y ) 1 (x t, y t ) T and y Ȳ t (**) Training data: T = {(x t, y t )} T t=1 1. w (0) = 0; i = 0 2. for n : 1..N 3. for t : 1..T 4. w (i+1) = arg min w* w* w (i) such that: w f(x t, y t ) w f(x t, y ) 1 y Ȳ t (**) 5. i = i + 1 6. return w i (**) There are exponential constraints in the size of each input!! Introduction to Data-Driven Dependency Parsing 45(50)

Structured Learning Factor the Feature Representations We can make an assumption that our feature representations factor relative to the output Example: Context Free Parsing: f(x, y) = A BC y f(x, A BC) Sequence Analysis Markov Assumptions: y f(x, y) = f(x, y i 1, y i ) These kinds of factorizations allow us to run algorithms like CKY and Viterbi to compute the argmax function i=1 Introduction to Data-Driven Dependency Parsing 46(50)

Structured Learning Structured Perceptron Exactly like original perceptron Except now the argmax function uses a factored feature representation All of the original analysis for the multiclass perceptron carries over!! Introduction to Data-Driven Dependency Parsing 47(50)

Structured Learning Structured SVMs min 1 2 w 2 such that: w f(x t,y t ) w f(x t,y ) L(y t,y ) (x t,y t ) T and y Ȳ t (**) Still have an exponential # of constraints Feature factorizations also allow for solutions Maximum Margin Markov Networks (Taskar et al. 03) Structured SVMs (Tsochantaridis et al. 04) Note: Old fixed margin of 1 is now a fixed loss L(y t,y ) between two structured outputs Introduction to Data-Driven Dependency Parsing 48(50)

Online Structured SVMs (or Online MIRA) Training data: T = {(x t,y t )} T t=1 1. w (0) = 0; i = 0 2. for n : 1..N 3. for t : 1..T 4. w (i+1) = arg min w* w* w (i) such that: w f(x t,y t ) w f(x t,y ) L(y t,y ) y Ȳ t and y k-best(x t,w (i) ) (**) 5. i = i + 1 6. return w i k-best(x t ) is set of outputs with highest scores using weight vector w (i) Structured Learning Simple Solution only consider outputs y Ȳ t that currently have highest score Introduction to Data-Driven Dependency Parsing 49(50)

Wrap Up Main Points of Lecture Feature representations Choose feature weights, w, to maximize some function (min error, max margin) Batch learning (SVMs) versus online learning (perceptron, MIRA) Linear versus Non-linear classifiers Structured Learning Introduction to Data-Driven Dependency Parsing 50(50)

References and Further Reading References and Further Reading A. L. Berger, S. A. Della Pietra, and V. J. Della Pietra. 1996. A maximum entropy approach to natural language processing. Computational Linguistics, 22(1). P. M. Camerini, L. Fratta, and F. Maffioli. 1980. The k best spanning arborescences of a network. Networks, 10(2):91 110. Y.J. Chu and T.H. Liu. 1965. On the shortest arborescence of a directed graph. Science Sinica, 14:1396 1400. M. Collins and N. Duffy. 2002. New ranking algorithms for parsing and tagging: Kernels over discrete structures, and the voted perceptron. In Proc. ACL. M. Collins. 2002. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In Proc. EMNLP. K. Crammer and Y. Singer. 2001. On the algorithmic implementation of multiclass kernel based vector machines. JMLR. K. Crammer and Y. Singer. 2003. Ultraconservative online algorithms for multiclass problems. JMLR. Introduction to Data-Driven Dependency Parsing 50(50)

References and Further Reading K. Crammer, O. Dekel, S. Shalev-Shwartz, and Y. Singer. 2003. Online passive aggressive algorithms. In Proc. NIPS. K. Crammer, O. Dekel, J. Keshat, S. Shalev-Shwartz, and Y. Singer. 2006. Online passive aggressive algorithms. JMLR. J. Edmonds. 1967. Optimum branchings. Journal of Research of the National Bureau of Standards, 71B:233 240. J. Eisner. 1996. Three new probabilistic models for dependency parsing: An exploration. In Proc. COLING. Y. Freund and R.E. Schapire. 1999. Large margin classification using the perceptron algorithm. Machine Learning, 37(3):277 296. T. Joachims. 2002. Learning to Classify Text using Support Vector Machines. Kluwer. D. Klein and C. Manning. 2004. Corpus-based induction of syntactic structure: Models of dependency and constituency. In Proc. ACL. T. Koo, A. Globerson, X. Carreras, and M. Collins. 2007. Structured prediction models via the matrix-tree theorem. In Proc. EMNLP. Introduction to Data-Driven Dependency Parsing 50(50)

J. Lafferty, A. McCallum, and F. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. ICML. H. Lodhi, C. Saunders, J. Shawe-Taylor, and N. Cristianini. 2002. Classification with string kernels. Journal of Machine Learning Research. References and Further Reading A. McCallum, D. Freitag, and F. Pereira. 2000. Maximum entropy Markov models for information extraction and segmentation. In Proc. ICML. R. McDonald and F. Pereira. 2006. Online learning of approximate dependency parsing algorithms. In Proc EACL. R. McDonald and G. Satta. 2007. On the complexity of non-projective data-driven dependency parsing. In Proc. IWPT. R. McDonald, K. Crammer, and F. Pereira. 2005. Online large-margin training of dependency parsers. In Proc. ACL. K.R. Müller, S. Mika, G. Rätsch, K. Tsuda, and B. Schölkopf. 2001. An introduction to kernel-based learning algorithms. IEEE Neural Networks, 12(2):181 201. M.A. Paskin. 2001. Introduction to Data-Driven Dependency Parsing 50(50)

References and Further Reading Cubic-time parsing and learning algorithms for grammatical bigram models. Technical Report UCB/CSD-01-1148, Computer Science Division, University of California Berkeley. K. Sagae and A. Lavie. 2006. Parser combination by reparsing. In Proc. HLT/NAACL. F. Sha and F. Pereira. 2003. Shallow parsing with conditional random fields. In Proc. HLT/NAACL, pages 213 220. N. Smith and J. Eisner. 2005. Guiding unsupervised grammar induction using contrastive estimation. In Working Notes of the International Joint Conference on Artificial Intelligence Workshop on Grammatical Inference Applications. D.A. Smith and N.A. Smith. 2007. Probabilistic models of nonprojective dependency trees. In Proc. EMNLP. C. Sutton and A. McCallum. 2006. An introduction to conditional random fields for relational learning. In L. Getoor and B. Taskar, editors, Introduction to Statistical Relational Learning. MIT Press. R.E. Tarjan. 1977. Finding optimum branchings. Networks, 7:25 35. B. Taskar, C. Guestrin, and D. Koller. 2003. Introduction to Data-Driven Dependency Parsing 50(50)

References and Further Reading Max-margin Markov networks. In Proc. NIPS. B. Taskar. 2004. Learning Structured Prediction Models: A Large Margin Approach. Ph.D. thesis, Stanford. I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. 2004. Support vector learning for interdependent and structured output spaces. In Proc. ICML. W.T. Tutte. 1984. Graph Theory. Cambridge University Press. D. Yuret. 1998. Discovery of linguistic relations using lexical attraction. Ph.D. thesis, MIT. Introduction to Data-Driven Dependency Parsing 50(50)