Lab 12: Structured Prediction

Similar documents
Machine Learning for Structured Prediction

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Methods for NLP

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Dependency Parsing. Statistical NLP Fall (Non-)Projectivity. CoNLL Format. Lecture 9: Dependency Parsing

Soft Inference and Posterior Marginals. September 19, 2013

Lecture 9: PGM Learning

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Introduction to Logistic Regression and Support Vector Machine

Machine Learning for NLP

Natural Language Processing

Lecture 13: Structured Prediction

Probabilistic Context Free Grammars. Many slides from Michael Collins

Linear Classifiers IV

Probabilistic Graphical Models

Study Notes on the Latent Dirichlet Allocation

Structured Prediction

Naïve Bayes classification

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Marrying Dynamic Programming with Recurrent Neural Networks

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Hidden Markov Models

Statistical Methods for NLP

Lecture 3: Multiclass Classification

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Personal Project: Shift-Reduce Dependency Parsing

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Posterior Regularization

AN ABSTRACT OF THE DISSERTATION OF

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Lecture 12: Algorithms for HMMs

Hidden Markov Models

Probabilistic Graphical Models

Bayesian Machine Learning - Lecture 7

Sequential Supervised Learning

Dynamic Programming: Hidden Markov Models

A Support Vector Method for Multivariate Performance Measures

Lecture 15. Probabilistic Models on Graph

Lecture 12: Algorithms for HMMs

Predicting Sequences: Structured Perceptron. CS 6355: Structured Prediction

Statistical Methods for NLP

Probabilistic Context-free Grammars

Support Vector Machines for Classification: A Statistical Portrait

Bayesian Networks Inference with Probabilistic Graphical Models

Sequence Labeling: HMMs & Structured Perceptron

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung

Generative Clustering, Topic Modeling, & Bayesian Inference

Machine Learning for NLP

Machine Learning for natural language processing

with Local Dependencies

So# Inference and Posterior Marginals. September 19, 2013

Conditional Random Fields for Sequential Supervised Learning

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

Midterms. PAC Learning SVM Kernels+Boost Decision Trees. MultiClass CS446 Spring 17

Hidden Markov Models

A brief introduction to Conditional Random Fields

Advanced Natural Language Processing Syntactic Parsing

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated

Semantic Parsing with Combinatory Categorial Grammars

Material presented. Direct Models for Classification. Agenda. Classification. Classification (2) Classification by machines 6/16/2010.

An Introduction to Machine Learning

Chapter 16. Structured Probabilistic Models for Deep Learning

Logistic Regression & Neural Networks

Machine Learning (CS 567) Lecture 2

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

COMP 875 Announcements

Transition-based dependency parsing

Introduction to Machine Learning

Conditional Random Field

Features of Statistical Parsers

Spectral Unsupervised Parsing with Additive Tree Metrics

Probabilistic Graphical Models: Lagrangian Relaxation Algorithms for Natural Language Processing

Linear Classification: Perceptron

Support vector machines Lecture 4

Introduction to Semantic Parsing with CCG

CMU at SemEval-2016 Task 8: Graph-based AMR Parsing with Infinite Ramp Loss

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

ML4NLP Multiclass Classification

COMP538: Introduction to Bayesian Networks

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Warm up: risk prediction with logistic regression

Brief Introduction of Machine Learning Techniques for Content Analysis

LECTURER: BURCU CAN Spring

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13

Bowl Maximum Entropy #4 By Ejay Weiss. Maxent Models: Maximum Entropy Foundations. By Yanju Chen. A Basic Comprehension with Derivations

NLP Homework: Dependency Parsing with Feed-Forward Neural Network

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

Recent Advances in Bayesian Inference Techniques

Parameter learning in CRF s

Introduction to Machine Learning

Transcription:

December 4, 2014

Lecture plan structured perceptron application: confused messages application: dependency parsing structured SVM

Class review: from modelization to classification What does learning mean? Density estimation: Accurately represent the distribution of data point Most straightforward loss for a family of distributions F: min D(p, ˆp) p F D(p, ˆp) = D KL (p ˆp) gives I-projection D(p, ˆp) = D KL (ˆp p) gives M-projection Becomes (see lecture 3, slide 7): max p(d) p F

Class review: from modelization to classification What does learning mean? Density estimation Structure or Knowledge discovery: Use learned model structure and parameters to get insights into data For example LDA gives topics of a corpus QMR-DT: identify syndromes No a priori best loss, maximizing likelihood is a good idea.

Class review: from modelization to classification What does learning mean? Density estimation Structure or Knowledge discovery Specific prediction tasks: Predict the most likely value of a latent variable Becomes a classification problem Classification loss: min I[ y y; p(y x) > p(y x)] p F (x,y)

Class review: from modelization to classification Classification loss Equivalently: I[ y y; log(p(y x)) log(p(y x)) > 0] min p F (x,y) Recall log-linear formulation: ( log(p(y x)) log(p(y x)) = w f c (x, y c) c c f c (x, y c ) ) Perfect solution: solve LP with D Y Y combinatorial: instead, structured algorithms

Structured algorithms Why structured? Starts from: I[ y y; log(p(y x)) log(p(y x)) > 0] = I[y = arg max p(u x) log(p(y x)) log(p(y x)) > 0] u arg max u p(u x) obtained by marginal inference

Structured algorithms Structured perceptron: Same as regular perceptron, where the prediction for y is structured In case of violation, update: w = w + f (x t, y t ) f (x t, y) If the data is separable with marging γ, converges after: ( 2 maxm,y f (x m, y) 2 Average weights as regularization γ ) 2

Simple example: Confusing messages

Simple example: Confusing messages Three impatient individuals fight for a token (right to speak) Gets token, says a word, loses token. All talking about different things Get to observe who did what Train a perceptron to track token (un-mix sentences)

Simple example: Confusing messages Example: This I birds ate painting meat is fly beautiful [1, 2, 3, 2, 1, 2, 1, 3, 1]

Simple example: Confusing messages What model do we use? What are the features? How do we learn?

Simple example: Confusing messages This I birds ate painting meat is fly beautiful Max Entropy Markov Model Features?

Simple example: Confusing messages This I birds ate painting meat is fly beautiful More information Features?

Simple example: Confusing messages Features: f (y t, y t+1 ): tags f (y t, x t ): tag, word, topic f (y t, y t+2, x t, x t+2 ): tags, POS, topics, (words?) Prediction: Dynamic programming (complexity?) Learning: perceptron

Simple example: Confusing messages One iteration: Training datum: This I birds ate painting meat is fly beautiful [1, 2, 3, 2, 1, 2, 1, 3, 1] Prediction: This I birds ate painting meat is fly beautiful [2, 2, 3, 2, 1, 1, 1, 3, 1] Update: w =w + f (1, 2) f (2, 2) + f (1, This, Topic this ) f (2, This, Topic this ) + f (1, 3, DET, NNS, Topic This, Topic birds ) f (2, 3, DET), NNS, Topic This, Topic birds ) +...

Simple example: Confusing messages Repeat until convergence.

Example of structured perceptron: dependency parsing Dependency parsing Incremental parsing Features, perceptron Exact search, Beam search

Example of structured perceptron: dependency parsing Figure : A dependency (head-modifier) and constituency (grammar) parse.

Example of structured perceptron: dependency parsing For each pair of words, is one the head of the other? Edge classification problem? Root We have length at already question this discussed

Example of structured perceptron: dependency parsing Each word only has one head. Resulting graph is a tree Structured prediction

Example of structured perceptron: dependency parsing First approach: Suppose the presence of an edge only depends on the sentence Example features for f (e i,j, w): words w i, w j POS tags of w i, w j POS tags of w i 1, w i+1, w j 1, w j+1 Inference is reduced to a Maximum Spanning Tree problem

Example of structured perceptron: dependency parsing First approach: f(e Root,we,w) We Root f(e have,we,w) have length at already question this discussed MST followed by perceptron update.

Example of structured perceptron: dependency parsing Straightforward algorithm, but does not support dependence between edges.

Example of structured perceptron: dependency parsing For projective grammars, can be done incrementally. Map the set of edges to a sequence of actions. Perceptron: predicts next action given partial parse tree. http://stp.lingfil.uu.se/~nivre/docs/eacl3.pdf

Example of structured perceptron: dependency parsing Possible actions: SHIFT: Moves a word from the buffer to the stack LEFT-ARC: First stack item points to second stack item RIGHT-ARC: Second stack item points to first stack item Features: Words (lexicalized) POS Surrounding POS head of the current word Edge types

Example of structured perceptron: dependency parsing How do we do supervised training? need an inference algorithm Greedy search: take highest scoring action Exact inference: complexity O(n 5 ), can be reduced to O(n 3 ) with some restrictions Beam search: middle ground Choose which weights to update http://www.aclweb.org/anthology/n12-1015

Example of structured perceptron: dependency parsing Beam Search keep k sequences of actions with highest score Early Updates Only update perceptron weights for action which falls off the beam Max-Violation Update perceptron weights for action that fall off the beam until farthest from the beam

Example of structured perceptron: dependency parsing Figure : Choosing what to update

Example of structured perceptron: dependency parsing Figure : Effect of Max-Violation

Structured algorithms: SVM Structured SVM: slack variables We have: w (f (x t, y t ) f (x t, y)) > 0 w (f (x t, y t ) f (x t, y)) 1 If the data is separable, perceptron If not, minimize violations: ξ t s.t. w (f (x t, y t ) f (x t, y)) 1 ξ t min ξ 0 t Regularized version gives structural SVM: ξ t + C w 2 min w,ξ 0 t

Structured algorithms: SVM Using ALL constraints: ξ t = max(0, max y Y 1 w (f (x t, y t ) f (x t, y))) Hence, whole objective becomes: min max(0, max 1 w (f (x t, y t ) f (x t, y))) + C w 2 w y Y t As before, max y Y is a MAP inference problem, hence structured method Can easily replace uniform 1 loss with other metric, as long as inference remains easy. Example, Hamming loss.

Structured prediction Next time, pseudo-likelihood Preparation exercises