Algorithms for NLP. Classifica(on III. Taylor Berg- Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Similar documents
Algorithms for NLP. Classifica(on III. Taylor Berg- Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Natural Language Processing. Classification. Margin. Linear Models: Perceptron. Issues with Perceptrons. Problems with Perceptrons.

Classification. Statistical NLP Spring Example: Text Classification. Some Definitions. Block Feature Vectors.

CS 188: Artificial Intelligence Spring Announcements

CS 188: Artificial Intelligence Spring Announcements

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

Announcements. CS 188: Artificial Intelligence Spring Classification. Today. Classification overview. Case-Based Reasoning

Structured Prediction: A Large Margin Approach. Ben Taskar University of Pennsylvania

Methodological Foundations of Biomedical Informatics (BMSC-GA 4449) Optimization

COMP 562: Introduction to Machine Learning

Introduction to Logistic Regression and Support Vector Machine

Probabilistic Graphical Models

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Support Vector Machine (continued)

Machine Learning A Geometric Approach

Support Vector Machines: Maximum Margin Classifiers

Kernelized Perceptron Support Vector Machines

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Least Mean Squares Regression. Machine Learning Fall 2017

Jeff Howbert Introduction to Machine Learning Winter

Support Vector Machines

Statistical NLP Spring A Discriminative Approach

Support Vector Machines

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Kernel methods CSE 250B

Support Vector Machine

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Lecture 10: A brief introduction to Support Vector Machine

Machine Learning. Support Vector Machines. Manfred Huber

Support Vector Machines and Kernel Methods

UVA CS 4501: Machine Learning

COMP 875 Announcements

Statistical Machine Learning from Data

Announcements - Homework

Machine Learning for Structured Prediction

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

SVMs, Duality and the Kernel Trick

Support vector machines Lecture 4

Lecture Support Vector Machine (SVM) Classifiers

Max Margin-Classifier

Probabilistic Graphical Models

Support Vector Machine (SVM) and Kernel Methods

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Discrimina)ve Latent Variable Models. SPFLODD November 15, 2011

CSC 411 Lecture 17: Support Vector Machine

Cutting Plane Training of Structural SVM

Logis&c Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Dan Roth 461C, 3401 Walnut

Machine Learning for NLP

Basis Expansion and Nonlinear SVM. Kai Yu

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Discriminative Models

SMO Algorithms for Support Vector Machines without Bias Term

Boos$ng Can we make dumb learners smart?

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Support Vector Machines and Kernel Methods

Bellman s Curse of Dimensionality

Support Vector Machines.

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Discriminative Models

L5 Support Vector Classification

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

CS 6140: Machine Learning Spring What We Learned Last Week 2/26/16

Support Vector Machines

ML (cont.): SUPPORT VECTOR MACHINES

Support Vector Machines

SVM optimization and Kernel methods

CS269: Machine Learning Theory Lecture 16: SVMs and Kernels November 17, 2010

10701 Recitation 5 Duality and SVM. Ahmed Hefny

CS145: INTRODUCTION TO DATA MINING

18.9 SUPPORT VECTOR MACHINES

Support Vector Machine (SVM) and Kernel Methods

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

Structured Prediction

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Algorithms for NLP. Language Modeling III. Taylor Berg- Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines

Review: Support vector machines. Machine learning techniques and image analysis

Support vector machines

Sequential Minimal Optimization (SMO)

CS 188: Artificial Intelligence. Outline

Kernel Methods. Barnabás Póczos

Linear & nonlinear classifiers

(Kernels +) Support Vector Machines

Low-Dimensional Discriminative Reranking. Jagadeesh Jagarlamudi and Hal Daume III University of Maryland, College Park

Machine Learning for NLP

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Structured Prediction

Transcription:

Algorithms for NLP Classifica(on III Taylor Berg- Kirkpatrick CMU Slides: Dan Klein UC Berkeley

The Perceptron, Again Start with zero weights Visit training instances one by one Try to classify If correct, no change! If wrong: adjust weights mistake vectors

Perceptron Weights What is the final value of w? Can it be an arbitrary real vector? No! It s built by adding up feature vectors (mistake vectors). mistake counts Can reconstruct weight vectors (the primal representa(on) from update counts (the dual representa(on) for each i

Dual Perceptron Track mistake counts rather than weights Start with zero counts (α) For each instance x Try to classify If correct, no change! If wrong: raise the mistake count for this example and predic(on

Dual / Kernelized Perceptron How to classify an example x? If someone tells us the value of K for each pair of candidates, never need to build the weight vectors

Issues with Dual Perceptron Problem: to score each candidate, we may have to compare to all training candidates Very, very slow compared to primal dot product! One bright spot: for perceptron, only need to consider candidates we made mistakes on during training Slightly bezer for SVMs where the alphas are (in theory) sparse This problem is serious: fully dual methods (including kernel methods) tend to be extraordinarily slow Of course, we can (so far) also accumulate our weights as we go...

Kernels: Who Cares? So far: a very strange way of doing a very simple calcula(on Kernel trick : we can subs(tute any* similarity func(on in place of the dot product Lets us learn new kinds of hypotheses * Fine print: if your kernel doesn t satisfy certain technical requirements, lots of proofs break. E.g. convergence, mistake bounds. In practice, illegal kernels sometimes work (but not always).

Some Kernels Kernels implicitly map original vectors to higher dimensional spaces, take the dot product there, and hand the result back Linear kernel: Quadra(c kernel: RBF: infinite dimensional representa(on Discrete kernels: e.g. string kernels, tree kernels

Tree Kernels [Collins and Duffy 0] Want to compute number of common subtrees between T, T Add up counts of all pairs of nodes n, n Base: if n, n have different root produc(ons, or are depth 0: Base: if n, n are share the same root produc(on:

Kernelized SVM (trust me) Primal formula(on: Dual formula(on: min 0 X i (y) f i (yi ) i,y f i (y) X i (y)`i(y) i,y 8i X y i (y) =C

Dual Formula(on for SVMs We want to op(mize: (separable case for now) This is hard because of the constraints Solu(on: method of Lagrange mul(pliers The Lagrangian representa(on of this problem is: All we ve done is express the constraints as an adversary which leaves our objec(ve alone if we obey the constraints but ruins our objec(ve if we violate any of them

Lagrange Duality We start out with a constrained op(miza(on problem: We form the Lagrangian: This is useful because the constrained solu(on is a saddle point of Λ (this is a general property): Primal problem in w Dual problem in α

Dual Formula(on II Duality tells us that has the same value as This is useful because if we think of the α s as constants, we have an unconstrained min in w that we can solve analy(cally. Then we end up with an op(miza(on over α instead of w (easier).

Dual Formula(on III Minimize the Lagrangian for fixed α s: So we have the Lagrangian as a func(on of only α s:

Primal vs Dual SVM Primal formula(on: Dual formula(on: min 0 X i (y) f i (yi ) i,y f i (y) X i (y)`i(y) i,y 8i X y i (y) =C

Learning SVMs (Primal) Primal formula(on: min w kwk + C X i max y w > f i (y)+`i(y) w > f i (yi )

Learning SVMs (Primal) Primal formula(on: min w kwk + C X i max y w > f i (y)+`i(y) w > f i (yi ) Loss- augmented decode: ȳ = argmax y w > f i (y)+`i(y) min w kwk + C X i w > f i (ȳ)+`i(ȳ) w > f i (y i ) r w = w + C X i (f i (ȳ) f i (y i )) Use general subgradient descent methods! (Adagrad)

Learning SVMs (Dual) We want to find α which minimize min 0 X i (y) f i (yi ) i,y f i (y) X i (y)`i(y) i,y 8i X y i (y) =C This is a quadra(c program: Can be solved with general QP or convex op(mizers But they don t scale well to large problems Cf. maxent models work fine with general op(mizers (e.g. CG, L- BFGS) How would a special purpose op(mizer work?

Coordinate Descent I (Dual) min 0 X i (y) f i (yi ) i,y f i (y) X i (y)`i(y) i,y Despite all the mess, Z is just a quadra(c in each α i (y) Coordinate descent: op(mize one variable at a (me 0 0 If the unconstrained argmin on a coordinate is nega(ve, just clip to zero

Coordinate Descent II (Dual) Ordinarily, trea(ng coordinates independently is a bad idea, but here the update is very fast and simple So we visit each axis many (mes, but each visit is quick This approach works fine for the separable case For the non- separable case, we just gain a simplex constraint and so we need slightly more complex methods (SMO, exponen(ated gradient)

What are the Alphas? Each candidate corresponds to a primal constraint In the solu(on, an α i (y) will be: Zero if that constraint is inac(ve Posi(ve if that constrain is ac(ve i.e. posi(ve on the support vectors Support vectors Support vectors contribute to weights:

Structure

Handwri(ng recogni(on x y brace Sequential structure [Slides: Taskar and Klein 05]

CFG Parsing x y The screen was a sea of red Recursive structure

Bilingual Word Alignment x What is the anticipated cost of collecting fees under the new proposal? En vertu de nouvelle propositions, quel est le côut prévu de perception de les droits? What is the anticipated cost of collecting fees under the new proposal? Combinatorial structure y En vertu de les nouvelle propositions, quel est le côut prévu de perception de le droits?

Structured Models space of feasible outputs Assump(on: Score is a sum of local part scores Parts = nodes, edges, produc(ons

Bilingual word alignment What is the anticipated cost of collecting fees under the new proposal? j k En vertu de les nouvelle propositions, quel est le côut prévu de perception de le droits? associa(on posi(on orthography

Efficient Decoding Common case: you have a black box which computes at least approximately, and you want to learn w Easiest op(on is the structured perceptron [Collins 0] Structure enters here in that the search for the best y is typically a combinatorial algorithm (dynamic programming, matchings, ILPs, A* ) Predic(on is structured, learning update is not

Structured Margin (Primal) Remember our primal margin objec(ve? min w kwk + C X i max y w > f i (y)+`i(y) w > f i (yi ) S(ll applies with structured output space!

Structured Margin (Primal) Just need efficient loss- augmented decode: ȳ = argmax y w > f i (y)+`i(y) min w kwk + C X i w > f i (ȳ)+`i(ȳ) w > f i (y i ) r w = w + C X i (f i (ȳ) f i (y i )) S(ll use general subgradient descent methods! (Adagrad)

Structured Margin (Dual) Remember the constrained version of primal: Dual has a variable for every constraint here

Full Margin: OCR We want: brace Equivalently: brace aaaaa brace aaaab a lot! brace zzzzz

Parsing example We want: It was red S A B C D Equivalently: It was red S A B C D It was red S A B D F It was red S A B C D It was red S A B C D a lot! It was red S A B C D It was red S E F G H

Alignment example We want: What is the Quel est le 3 3 Equivalently: What is the Quel est le 3 3 What is the Quel est le 3 3 What is the Quel est le 3 3 What is the Quel est le 3 3 a lot! What is the Quel est le 3 3 What is the Quel est le 3 3

Cunng Plane (Dual) A constraint induc(on method [Joachims et al 09] Exploits that the number of constraints you actually need per instance is typically very small Requires (loss- augmented) primal- decode only Repeat: Find the most violated constraint for an instance: Add this constraint and resolve the (non- structured) QP (e.g. with SMO or other QP solver)

Comparison

Input Op(on 0: Reranking N-Best List (e.g. n=00) [e.g. Charniak and Johnson 05] Output x = The screen was a sea of red. Baseline Parser Non-Structured Classification

Reranking Advantages: Directly reduce to non- structured case No locality restric(on on features Disadvantages: Stuck with errors of baseline parser Baseline system must produce n- best lists But, feedback is possible [McCloskey, Charniak, Johnson 006]