Machine Learning for Structured Prediction

Similar documents
2.2 Structured Prediction

Undirected Graphical Models

Introduction to Data-Driven Dependency Parsing

Generalized Linear Classifiers in NLP

Algorithms for Predicting Structured Data

Random Field Models for Applications in Computer Vision

Sequential Supervised Learning

Graphical models for part of speech tagging

Support Vector Machines: Kernels

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Low-Dimensional Discriminative Reranking. Jagadeesh Jagarlamudi and Hal Daume III University of Maryland, College Park

Introduction to Classification and Sequence Labeling

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Linear & nonlinear classifiers

Lab 12: Structured Prediction

Search-Based Structured Prediction

Support Vector Machines.

Support Vector Machines

Structured Prediction

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Kernel Methods and Support Vector Machines

Predicting Sequences: Structured Perceptron. CS 6355: Structured Prediction

Parameter learning in CRF s

Machine Learning Practice Page 2 of 2 10/28/13

Probabilistic Graphical Models

lecture 6: modeling sequences (final part)

Structured Prediction

Training Conditional Random Fields for Maximum Labelwise Accuracy

Machine Learning for NLP

Probabilistic Models for Sequence Labeling

Machine Learning for NLP

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

CMU-Q Lecture 24:

Jeff Howbert Introduction to Machine Learning Winter

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Log-Linear Models, MEMMs, and CRFs

Statistical Methods for NLP

Polyhedral Outer Approximations with Application to Natural Language Parsing

Learning and Inference over Constrained Output

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

CSC 411 Lecture 17: Support Vector Machine

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Conditional Random Fields for Sequential Supervised Learning

Hidden Markov Models

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

Lecture 13: Structured Prediction

CMU at SemEval-2016 Task 8: Graph-based AMR Parsing with Infinite Ramp Loss

Discriminative Unsupervised Learning of Structured Predictors

Pattern Recognition and Machine Learning

Introduction to Machine Learning Midterm, Tues April 8

Maximum Entropy Markov Models

Brief Introduction of Machine Learning Techniques for Content Analysis

The Structured Weighted Violations Perceptron Algorithm

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

Support vector machines Lecture 4

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Pattern Recognition 2018 Support Vector Machines

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

1 Training and Approximation of a Primal Multiclass Support Vector Machine

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Lecture 3: Multiclass Classification

A Support Vector Method for Multivariate Performance Measures

Max Margin-Classifier

1 SVM Learning for Interdependent and Structured Output Spaces

Logistic Regression. Machine Learning Fall 2018

1 Machine Learning Concepts (16 points)

Warm up: risk prediction with logistic regression

Structure Learning in Sequential Data

Logistic Regression & Neural Networks

Conditional Random Field

Support Vector Machines for Classification: A Statistical Portrait

ML4NLP Multiclass Classification

Probabilistic Graphical Models

Announcements - Homework

Linear & nonlinear classifiers

Information Extraction from Text

Linear Classifiers IV

Lecture 5 Neural models for NLP

Distributed Training Strategies for the Structured Perceptron

Statistical NLP Spring A Discriminative Approach

NLP Programming Tutorial 11 - The Structured Perceptron

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Introduction to Support Vector Machines

SVM optimization and Kernel methods

Maximum Margin Bayesian Networks

Machine Learning for Signal Processing Bayes Classification and Regression

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Probabilistic Graphical Models

Introduction to Logistic Regression and Support Vector Machine

Machine Learning Lecture 6 Note

Conditional Random Fields: An Introduction

Transcription:

Machine Learning for Structured Prediction Grzegorz Chrupa la National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Grzegorz Chrupa la (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 1 / 19

Structured vs Non-structured prediction In supervised learning we try to learn a function h : X Y where x X are inputs and y Y are outputs. Binary classification: Y = { 1, +1} Multiclass classification: Y = {1,...,K} (finite set of labels) Regression: Y = R In contrast, in structured prediction elements Y are complex The prediction is based on the feature function Φ : X F where usually F = R D (D-dimensional vector space) Grzegorz Chrupa la (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 2 / 19

Structured prediction tasks Sequence labeling: e.g. POS tagging Parsing: given an input sequence, build a tree whose yield (leaves) are the elements in the sequence and whose structure obeys some grammar. Collective classification: E.g. relation learning problems, such as labeling web pages given link information. Bipartite matching: given a bipartite graph, find the best possible matching. e.g. word alignment in NLP and protein structure prediction in computational biology. Grzegorz Chrupa la (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 3 / 19

Linear classifiers In binary classification, linear classifiers separate classes in a multidimensional input space with a hyperplane The classification function giving the output score is where y = f ( x w + b) = f ( d D w dx d + b) x is the input feature vector = Φ(x), w is a real vector of feature weights, b is the bias term, and f maps the dot product of the two vectors to output (e.g. f (x) = 1 if x < 0, +1 otherwise) The use of kernels permits to use linear classifiers for non-linearly separable problems such as the XOR function Grzegorz Chrupa la (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 4 / 19

Linear classifier hyperplanes Figure: Three separating hyperplanes in 2-dimensional space (Wikipedia) Grzegorz Chrupa la (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 5 / 19

Perceptron The Perceptron iterates I times through the 1...n training examples, updating the parameters (weights and bias) if the current example n is classified incorrectly: w w + y n x n b b + yn where y n is the output for the n th training example Averaged Perceptron, which generalizes better to unseen examples, does weight averaging, i.e. the final weights returned are the average of all weight vectors used during the run of the algorithm Grzegorz Chrupa la (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 6 / 19

Perceptron Algorithm Daume III (2006) Grzegorz Chrupa la (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 7 / 19

Structured Perceptron For structured prediction the feature function is Φ : X Y F (second argument is the hypothesized output) Structured prediction algorithms need Φ to admit tractable search The argmax problem: ŷ = argmax y Y w Φ(x,y) Structured Perceptron: for each example n wherever the predicted ŷ n for x n differs from y n the weights are updated: w w + Φ(x n, y n ) Φ(x n, ŷ n ) Similar to plain Perceptron key difference: ŷ is calculated using the arg max. Incremental Perceptron is a variant used in cased where arg max isn t available a beam search algorithm is used instead Grzegorz Chrupa la (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 8 / 19

Logistic regression / Maximum Entropy Conditional probability of class y is proportional to exp f (x) p(y x; w,b) = 1 Z x; w,b exp [y( w Φ(x) + b)] = 1 1 + exp [ 2y( w Φ(x) + b)] For training we try to find w and b which maximize the likelihood of training data Gaussian prior used to penalize large weights Generalized Iterative Scaling algorithm used to solve the maximization problem Grzegorz Chrupa la (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 9 / 19

Maximum Entropy Markov Models MEMMs are an extension of MaxEnt to sequence labeling Replace p(observation state) with p(state observation) For first-order MEMM, the conditional distribution on the label y n given full input x, the previous label y n 1, feature function Φ and weights w is: p(y n x,y n 1 ; w ) = 1 Z x;yn 1 w exp [ w Φ(x,y n,y n 1 )] When MEMM is trained true label y n 1 is used At prediction time, Viterbi is applied to solve arg max Predicted y n 1 label is used in classifying n th element of the sequence Grzegorz Chrupa la (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 10 / 19

Conditional Random Fields Alternative generalization of MaxEnt to structured prediction Main difference: while MEMM uses per state exponential models for the conditional probabilities of next states given the current state CRF has a single exponential model for the joint probability of the whole sequence of labels The formulation is: p(y x; w ) = 1 Z x; w exp [ w Φ(x,y)] Z x; w = y Y exp [ w Φ(x,y )] where elements y Y are incorrect outputs Grzegorz Chrupa la (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 11 / 19

CRF - experimental comparison to HMM and MEMMs J. Lafferty, A. McCallum, F. Pereira - Proc. 18th International Conf. on Machine Learning, 2001 Grzegorz Chrupa la (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 12 / 19

Large Margin Models Choose parameter settings which maximize the distance between the hyperplane separating classes and the nearest data points on either side. Support Vector Machines. For data separable with margin 1: subject to minimize w,b 1 2 w 2 n, y n [ w Φ(x n ) + b] 1 Grzegorz Chrupa la (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 13 / 19

SVM Soft-margin formulation: don t enforce perfect classification of training examples The slack variables ξ measures how far an example is from the correct hard margin constraint subject to minimize w,b 1 2 w 2 + C N n=1 ξ n n, y n [ w Φ(x n ) + b] 1 ξ n ξ n 0 The parameter C trades off fitting the training data and a small weight vector Grzegorz Chrupa la (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 14 / 19

Margins and Support Vectors Figure: SVM maximum-margin hyperplanes for binary classification. Examples along the hyperplanes are support vectors (from Wikipedia) Grzegorz Chrupa la (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 15 / 19

Maximum Margin models for Structured Prediction Maximum Margin Markov Networks (M 3 N) The difference in score (under loss function l) between true output y and any incorrect output ŷ is at least l(x,y,ŷ) M 3 N scales the margin to be proportional to loss. Support Vector Machines for Interdependent and Structured Outputs: SVM struct Instead of scaling the margin, scale the slack variables by the loss The two approaches differ in the optimization techniques used (SVM struct constrains the loss function less but is more difficult to optimize). Grzegorz Chrupa la (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 16 / 19

Summary of structured prediction algorithms Figure from Daume III (2006) Grzegorz Chrupa la (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 17 / 19

Other approaches to structured outputs Searn (Search + Learn) Analogical learning Miscellaneous hacks: Reranking Stacking Class n-grams Grzegorz Chrupa la (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 18 / 19

References M. Collins, 2002, Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms, EMNLP-02 McCallum, Andrew, Dayne Freitag, and Fernando Pereira. 2000. Maximum entropy Markov models for information extraction and segmentation. ICML J. Lafferty, A. McCallum, and F. Pereira. 2001. Conditional random elds: Probabilistic models for segmenting and labeling sequence data. ICML B. Taskar, C. Guestrin, and D. Koller. 2003. Max-margin Markov networks. In Advances in Neural Information Processing Systems (NIPS). I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. 2005. Large margin methods for structured and interdependent output variables. JMLR Hal Daumé III, 2006, Practical Structured Learning Techniques for Natural Language Processing, PhD thesis Grzegorz Chrupa la (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 19 / 19