Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Similar documents
Algorithms for NLP. Classifica(on III. Taylor Berg- Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Natural Language Processing. Classification. Margin. Linear Models: Perceptron. Issues with Perceptrons. Problems with Perceptrons.

Algorithms for NLP. Classifica(on III. Taylor Berg- Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Classification. Statistical NLP Spring Example: Text Classification. Some Definitions. Block Feature Vectors.

Structured Prediction: A Large Margin Approach. Ben Taskar University of Pennsylvania

Statistical NLP Spring A Discriminative Approach

Probabilistic Graphical Models

Support vector machines Lecture 4

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

CSC 411 Lecture 17: Support Vector Machine

Probabilistic Graphical Models

Machine Learning for Structured Prediction

Support Vector Machines: Maximum Margin Classifiers

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Jeff Howbert Introduction to Machine Learning Winter

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Machine Learning for NLP

Support Vector Machine (continued)

Machine Learning for NLP

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Kernelized Perceptron Support Vector Machines

CS 188: Artificial Intelligence Spring Announcements

Support Vector Machines

CS 188: Artificial Intelligence Spring Announcements

L5 Support Vector Classification

Announcements - Homework

Discriminative Models

Structured Prediction

Discriminative Models

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Structured Prediction

Announcements. CS 188: Artificial Intelligence Spring Classification. Today. Classification overview. Case-Based Reasoning

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Linear & nonlinear classifiers

Machine Learning A Geometric Approach

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Machine Learning Lecture 6 Note

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

COMP 875 Announcements

ML4NLP Multiclass Classification

Lecture Support Vector Machine (SVM) Classifiers

SVMs, Duality and the Kernel Trick

Support Vector Machine

Introduction to Machine Learning Midterm, Tues April 8

Support Vector Machines and Kernel Methods

Warm up: risk prediction with logistic regression

Machine Learning Practice Page 2 of 2 10/28/13

Lecture 13: Structured Prediction

Support Vector Machines.

Review: Support vector machines. Machine learning techniques and image analysis

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Support Vector Machines and Bayes Regression

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

CS 188: Artificial Intelligence. Outline

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Support Vector Machine (SVM) and Kernel Methods

Support vector machines

Lab 12: Structured Prediction

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Homework 3. Convex Optimization /36-725

Multiclass Classification-1

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Lecture 10: A brief introduction to Support Vector Machine

Statistical Methods for Data Mining

ML (cont.): SUPPORT VECTOR MACHINES

Lecture 10: Support Vector Machine and Large Margin Classifier

Learning: Binary Perceptron. Examples: Perceptron. Separable Case. In the space of feature vectors

Applied Machine Learning Annalisa Marsico

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

2.2 Structured Prediction

Support Vector Machine (SVM) and Kernel Methods

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Cutting Plane Training of Structural SVM

An Introduction to Machine Learning

Dan Roth 461C, 3401 Walnut

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Kernel Methods and Support Vector Machines

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Support Vector Machines

Deep Learning for Computer Vision

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Introduction to Logistic Regression and Support Vector Machine

Support Vector Machines for Classification and Regression

(Kernels +) Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

SVMs: nonlinearity through kernels

NLP Programming Tutorial 11 - The Structured Perceptron

Statistical Pattern Recognition

Generalized Linear Classifiers in NLP

Kernel Methods. Barnabás Póczos

MIRA, SVM, k-nn. Lirong Xia

Transcription:

Algorithms for NLP Classification II Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Minimize Training Error? A loss function declares how costly each mistake is E.g. 0 loss for correct label, 1 loss for wrong label Can weight mistakes differently (e.g. false positives worse than false negatives or Hamming distance over structured labels) We could, in principle, minimize training loss: This is a hard, discontinuous optimization problem

3 Separable Case Examples: Perceptron

4 Examples: Perceptron Non-Separable Case

Objective Functions What do we want from our weights? Depends! So far: minimize (training) errors: This is the zero-one loss Discontinuous, minimizing is NP-complete Maximum entropy and SVMs have other objectives related to zero-one loss

Linear Models: Maximum Entropy Maximum entropy (logistic regression) Use the scores as probabilities: Make positive Normalize Maximize the (log) conditional likelihood of training data

Maximum Entropy II Motivation for maximum entropy: Connection to maximum entropy principle (sort of) Might want to do a good job of being uncertain on noisy cases in practice, though, posteriors are pretty peaked Regularization (smoothing)

Log-Loss If we view maxent as a minimization problem: This minimizes the log loss on each example One view: log loss is an upper bound on zero-one loss

Maximum Margin Note: exist other choices of how to penalize slacks! Non-separable SVMs Add slack to the constraints Make objective pay (linearly) for slack: C is called the capacity of the SVM the smoothing knob Learning: Can still stick this into Matlab if you want Constrained optimization is hard; better methods! We ll come back to this later

Remember SVMs We had a constrained minimization but we can solve for x i Giving

Hinge Loss Plot really only right in binary case Consider the per-instance objective: This is called the hinge loss Unlike maxent / log loss, you stop gaining objective once the true label wins by enough You can start from here and derive the SVM objective Can solve directly with sub-gradient decent (e.g. Pegasos: Shalev-Shwartz et al 07)

Max vs Soft-Max Margin SVMs: Maxent: You can make this zero but not this one Very similar! Both try to make the true score better than a function of the other scores The SVM tries to beat the augmented runner-up The Maxent classifier tries to beat the soft-max

Loss Functions: Comparison Zero-One Loss Hinge Log

Structure

Handwriting recognition x y brace Sequential structure [Slides: Taskar and Klein 05]

CFG Parsing x y The screen was a sea of red Recursive structure

Bilingual Word Alignment x What is the anticipated cost of collecting fees under the new proposal? En vertu de nouvelle propositions, quel est le côut prévu de perception de les droits? What is the anticipated cost of collecting fees under the new proposal? Combinatorial structure y En vertu de les nouvelle propositions, quel est le côut prévu de perception de le droits?

Structured Models space of feasible outputs Assumption: Score is a sum of local part scores Parts = nodes, edges, productions

Named Entity Recognition ORG ORG --- ORG ORG ORG --- --- LOC Apple Computer bought Smart Systems Inc. located in Arkansas.

Bilingual word alignment What is the anticipated cost of collecting fees under the new proposal? j k En vertu de les nouvelle propositions, quel est le côut prévu de perception de le droits? association position orthography

Efficient Decoding Common case: you have a black box which computes at least approximately, and you want to learn w Easiest option is the structured perceptron [Collins 01] Structure enters here in that the search for the best y is typically a combinatorial algorithm (dynamic programming, matchings, ILPs, A* ) Prediction is structured, learning update is not

Structured Margin (Primal) Remember our primal margin objective? min w 1 2 kwk2 2 + C X i max y w > f i (y)+`i(y) w > f i (yi ) Still applies with structured output space!

Structured Margin (Primal) Just need efficient loss-augmented decode: ȳ = argmax y w > f i (y)+`i(y) min w 1 2 kwk2 2 + C X i w > f i (ȳ)+`i(ȳ) w > f i (y i ) r w = w + C X i (f i (ȳ) f i (y i )) Still use general subgradient descent methods! (Adagrad)

Structured Margin (Dual) Remember the constrained version of primal: Dual has a variable for every constraint here

Full Margin: OCR We want: brace Equivalently: brace aaaaa brace aaaab a lot! brace zzzzz

Parsing example We want: It was red S A B C D Equivalently: It was red S A B C D It was red S A B D F It was red S A B C D It was red S A B C D a lot! It was red S A B C D It was red S E F G H

Alignment example We want: What is the Quel est le 1 2 3 1 2 3 Equivalently: What is the Quel est le 1 2 3 1 2 3 What is the Quel est le 1 2 3 1 2 3 What is the Quel est le 1 2 3 1 2 3 What is the Quel est le 1 2 3 1 2 3 a lot! What is the Quel est le 1 2 3 1 2 3 What is the Quel est le 1 2 3 1 2 3

Cutting Plane (Dual) A constraint induction method [Joachims et al 09] Exploits that the number of constraints you actually need per instance is typically very small Requires (loss-augmented) primal-decode only Repeat: Find the most violated constraint for an instance: Add this constraint and resolve the (non-structured) QP (e.g. with SMO or other QP solver)

Cutting Plane (Dual) Some issues: Can easily spend too much time solving QPs Doesn t exploit shared constraint structure In practice, works pretty well; fast like perceptron/mira, more stable, no averaging

Likelihood, Structured Structure needed to compute: Log-normalizer Expected feature counts E.g. if a feature is an indicator of DT-NN then we need to compute posterior marginals P(DT-NN sentence) for each position and sum Also works with latent variables (more later)

Comparison

Input Option 0: Reranking N-Best List (e.g. n=100) [e.g. Charniak and Johnson 05] Output x = The screen was a sea of red. Baseline Parser Non-Structured Classification

Reranking Advantages: Directly reduce to non-structured case No locality restriction on features Disadvantages: Stuck with errors of baseline parser Baseline system must produce n-best lists But, feedback is possible [McCloskey, Charniak, Johnson 2006]

M3Ns Another option: express all constraints in a packed form Maximum margin Markov networks [Taskar et al 03] Integrates solution structure deeply into the problem structure Steps Express inference over constraints as an LP Use duality to transform minimax formulation into min-min Constraints factor in the dual along the same structure as the primal; alphas essentially act as a dual distribution Various optimization possibilities in the dual

Quadratic kernels Example: Kernels

Non-Linear Separators Another view: kernels map an original feature space to some higher-dimensional feature space where the training set is (more) separable Φ: y φ(y)

Why Kernels? Can t you just add these features on your own (e.g. add all pairs of features instead of using the quadratic kernel)? Yes, in principle, just compute them No need to modify any algorithms But, number of features can get large (or infinite) Some kernels not as usefully thought of in their expanded representation, e.g. RBF or data-defined kernels [Henderson and Titov 05] Kernels let us compute with these features implicitly Example: implicit dot product in quadratic kernel takes much less space and time per dot product Of course, there s the cost for using the pure dual algorithms