Multiclass and Introduction to Structured Prediction

Similar documents
Multiclass and Introduction to Structured Prediction

Multiclass Classification

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Sequential Supervised Learning

Linear Classifiers IV

Midterm sample questions

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Generalized Linear Models

DS-GA 1003: Machine Learning and Computational Statistics Homework 6: Generalized Hinge Loss and Multiclass SVM

INTRODUCTION TO DATA SCIENCE

Announcements - Homework

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Support vector machines Lecture 4

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

CSCI-567: Machine Learning (Spring 2019)

Machine Learning for Structured Prediction

Midterms. PAC Learning SVM Kernels+Boost Decision Trees. MultiClass CS446 Spring 17

Machine Learning for NLP

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

with Local Dependencies

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

CPSC 340: Machine Learning and Data Mining

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Warm up: risk prediction with logistic regression

Subgradient Descent. David S. Rosenberg. New York University. February 7, 2018

ML4NLP Multiclass Classification

Probabilistic Graphical Models

Evaluation. Andrea Passerini Machine Learning. Evaluation

Structured Prediction Theory and Algorithms

Payment Rules for Combinatorial Auctions via Structural Support Vector Machines

Evaluation requires to define performance measures to be optimized

Classification and Pattern Recognition

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Lecture 13: Structured Prediction

Support Vector Machines

Lecture Support Vector Machine (SVM) Classifiers

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

An Introduction to Machine Learning

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Support Vector Machines, Kernel SVM

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Kernel Methods and Support Vector Machines

FINAL: CS 6375 (Machine Learning) Fall 2014

Part of the slides are adapted from Ziko Kolter

Introduction to Machine Learning (67577) Lecture 3

Linear and Logistic Regression. Dr. Xiaowei Huang

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

CS229 Supplemental Lecture notes

Announcements. Proposals graded

CSE 546 Final Exam, Autumn 2013

Part 2: Generalized output representations and structure

Basis Expansion and Nonlinear SVM. Kai Yu

Name (NetID): (1 Point)

Lecture 3: Multiclass Classification

Introduction to Machine Learning

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Lecture 18: Multiclass Support Vector Machines

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Log-Linear Models, MEMMs, and CRFs

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Deep Learning for Computer Vision

Classification objectives COMS 4771

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

Neural Networks: Backpropagation

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Max Margin-Classifier

CIS 520: Machine Learning Oct 09, Kernel Methods

Discriminative Models

Machine Learning for NLP

IFT Lecture 7 Elements of statistical learning theory

Predicting Sequences: Structured Perceptron. CS 6355: Structured Prediction

Structured Prediction

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

2.2 Structured Prediction

Cutting Plane Training of Structural SVM

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Classification and Support Vector Machine

MIRA, SVM, k-nn. Lirong Xia

Information Extraction from Text

Sequence Labeling: HMMs & Structured Perceptron

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Introduction to Machine Learning Midterm, Tues April 8

CSC 411 Lecture 17: Support Vector Machine

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

The Noisy Channel Model and Markov Models

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Online Advertising is Big Business

CSC321 Lecture 4: Learning a Classifier

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Machine Learning Basics

Hidden Markov Models

Transcription:

Multiclass and Introduction to Structured Prediction David S. Rosenberg New York University March 27, 2018 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 1 / 49

Contents 1 Introduction 2 Reduction to Binary Classification 3 Linear Classifers: Binary and Multiclass 4 Multiclass Predictors 5 A Linear Multiclass Hypothesis Space 6 Linear Multiclass SVM 7 Interlude: Is This Worth The Hassle Compared to One-vs-All? 8 Introduction to Structured Prediction David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 2 / 49

Introduction David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 3 / 49

Multiclass Setting Input space: X Ouput space: Y = {1,...,k} Our approaches to multiclass problems so far: multinomial / softmax logistic regression Soon: trees and random forests Today we consider linear methods specifically designed for multiclass. But the main takeaway will be an approach that generalizes to situations where k is exponentially large too large to enumerate. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 4 / 49

Reduction to Binary Classification David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 5 / 49

One-vs-All / One-vs-Rest Plot courtesy of David Sontag. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 6 / 49

One-vs-All / One-vs-Rest Train k binary classifiers, one for each class. Train ith classifier to distinguish class i from rest Suppose h 1,...,h k : X R are our binary classifiers. Can output hard classifications in { 1,1} or scores in R. Final prediction is Ties can be broken arbitrarily. h(x) = arg max h i (x) i {1,...,k} David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 7 / 49

Linear Classifers: Binary and Multiclass David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 8 / 49

Linear Binary Classifier Review Input Space: X = R d Output Space: Y = { 1,1} Linear classifier score function: f (x) = w,x = w T x Final classification prediction: sign(f (x)) Geometrically, when are sign(f (x)) = +1 and sign(f (x)) = 1? David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 9 / 49

Linear Binary Classifier Review Suppose w > 0 and x > 0: f (x) = w,x = w x cosθ f (x) > 0 cosθ > 0 θ ( 90,90 ) f (x) < 0 cosθ < 0 θ [ 90,90 ] David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 10 / 49

Three Class Example Base hypothesis space H = { f (x) = w T x x R 2}. Note: Separating boundary always contains the origin. Example based on Shalev-Schwartz and Ben-David s Understanding Machine Learning, Section 17.1 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 11 / 49

Three Class Example: One-vs-Rest Class 1 vs Rest: f 1 (x) = w T 1 x David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 12 / 49

Three Class Example: One-vs-Rest Examine Class 2 vs Rest Predicts everything to be Not 2. If it predicted some 2, then it would get many more Not 2 incorrect. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 13 / 49

One-vs-Rest: Predictions Score for class i is where θ i is the angle between x and w i. f i (x) = w i,x = w i x cosθ i, David S. Predict Rosenberg class (New iyork that University) has highest f DS-GA (x). 1003 / CSCI-GA 2567 March 27, 2018 14 / 49

One-vs-Rest: Class Boundaries For simplicity, we ve assumed w 1 = w 2 = w 3. Then w i and x are equal for all scores. = x is classified by whichever has largest cosθ i (i.e. θ i closest to 0) David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 15 / 49

One-vs-Rest: Class Boundaries This approach doesn t work well in this instance. How can we fix this? David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 16 / 49

The Linear Multiclass Hypothesis Space Base Hypothesis Space: H = { x w T x w R d}. Linear Multiclass Hypothesis Space (for k classes): { } F = x arg maxh i (x) h 1,...,h k H i What s the action space here? David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 17 / 49

One-vs-Rest: Class Boundaries Recall: A learning algorithm chooses the hypothesis from the hypothesis space. Is this a failure of the hypothesis space or the learning algorithm? David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 18 / 49

A Solution with Linear Functions This works... so the problem is not with the hypothesis space. How can we get a solution like this? David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 19 / 49

Multiclass Predictors David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 20 / 49

Multiclass Hypothesis Space Base Hypothesis Space: H = {h : X R} ( score functions ). Multiclass Hypothesis Space (for k classes): { } F = x arg maxh i (x) h 1,...,h k H i h i (x) scores how likely x is to be from class i. Issue: Need to learn (and represent) k functions. Doesn t scale to very large k. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 21 / 49

Multiclass Hypothesis Space: Reframed General [Discrete] Output Space: Y (e.g Y = {1,...,k} for multiclass) New idea: Rather than a score function for each class, use one function h(x,y) that gives a compatibility score between input x and output y Final prediction is the y Y that is most compatible with x: f (x) = arg maxh(x,y) y Y This subsumes the framework with class-specific score functions. Given class-specific score functions h 1,...,h k, we could define compatibility function as h(x,i) = h i (x), i = 1,...,k. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 22 / 49

Multiclass Hypothesis Space: Reframed General [Discrete] Output Space: Y Base Hypothesis Space: H = {h : X Y R} h(x,y) gives compatibility score between input x and output y Multiclass Hypothesis Space { F = Final prediction function is an f F. x arg maxh(x,y) h H y Y For each f F there is an underlying compatibility score function h H. } David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 23 / 49

Learning in a Multiclass Hypothesis Space: In Words Base Hypothesis Space: H = {h : X Y R} Training data: (x 1,y 1 ),(x 2,y 2 ),...,(x n,y n ) Learning process chooses h H. Want compatibility h(x,y) to be large when x has label y, small otherwise. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 24 / 49

Learning in a Multiclass Hypothesis Space: In Math h(x,y) classifies(x i,y i ) correctly iff h(x i,y i ) > h(x i,y) y y i h should give higher score for correct y than for all other y Y. An equivalent condition is the following: If we define h(x i,y i ) > max y y i h(x i,y) m i = h(x i,y i ) max y y i h(x i,y), then classification is correct if m i > 0. Generally want m i to be large. Sound familiar? David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 25 / 49

A Linear Multiclass Hypothesis Space David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 26 / 49

Linear Multiclass Prediction Function A linear class-sensitive score function is given by h(x,y) = w,ψ(x,y), where Ψ(x,y) : X Y R d is a class-sensitive feature map. Ψ(x,y) extracts features relevant to how compatible y is with x. Final compatibility score is a linear function of Ψ(x,y). Linear Multiclass Hypothesis Space { } F = x arg max w,ψ(x,y) w R d y Y David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 27 / 49

Example: X = R2, Y = {1, 2, 3} w1 = 22, 22, w2 = (0, 1), w3 = 22, 22 Prediction function: (x1, x2 ) 7 arg maxi {1,2,3} hwi, (x1, x2 )i. How can we get this into the form x 7 arg maxy Y hw, Ψ(x, y )i David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 28 / 49

The Multivector Construction What if we stack w i s together: w = 2 2 2 2 2,, 0,1 }{{ 2 }{{}, } 2, 2 w 2 }{{} w 1 w 3 And then do the following: Ψ : R 2 {1,2,3} R 6 defined by Ψ(x,1) := (x 1,x 2,0,0,0,0) Ψ(x,2) := (0,0,x 1,x 2,0,0) Ψ(x,3) := (0,0,0,0,x 1,x 2 ) Then w,ψ(x,y) = w y,x, which is what we want. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 29 / 49

NLP Example: Part-of-speech classification X = {All possible words}. Y = {NOUN,VERB,ADJECTIVE,ADVERB,ARTICLE,PREPOSITION}. Features of x X: [The word itself], ENDS_IN_ly, ENDS_IN_ness,... Ψ(x,y) = (ψ 1 (x,y),ψ 2 (x,y),ψ 3 (x,y),...,ψ d (x,y)): ψ 1 (x,y) = 1(x = apple AND y = NOUN) ψ 2 (x,y) = 1(x = run AND y = NOUN) ψ 3 (x,y) = 1(x = run AND y = VERB) ψ 4 (x,y) = 1(x ENDS_IN_ly AND y =ADVERB)... e.g. Ψ(x = run,y = NOUN) = (0,1,0,0,...) After training, what would you guess corresponding w 1,w 2,w 3,w 4 to be? David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 30 / 49

NLP Example: How does it work? Ψ(x,y) = (ψ 1 (x,y),ψ 2 (x,y),ψ 3 (x,y),...,ψ d (x,y)) R d : ψ 1 (x,y) = 1(x = apple AND y = NOUN) ψ 2 (x,y) = 1(x = run AND y = NOUN)... After training, we ve learned w R d. Say w = (5, 3,1,4,...) To predict label for x = apple, we compute compatibility scores for each y Y: Predict class that gives highest score. w, Ψ(apple, NOUN) w, Ψ(apple, VERB) w, Ψ(apple, ADVERB) David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 31 / 49.

Another Approach: Use Label Features What if we have a very large number of classes? Make features for the classes. Common in advertising X: User and user context Y : A large set of banner ads Suppose user x is shown many banner ads. We want to predict which one the user will click on. Possible compatibility features: ψ 1 (x,y) = 1(x interested in sports AND y relevant to sports) ψ 2 (x,y) = 1(x is in target demographic group of y) ψ 3 (x,y) = 1(x previously clicked on ad from company sponsoring y) David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 32 / 49

Linear Multiclass SVM David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 33 / 49

The Margin for Multiclass Let h : X Y R be our compatibility score function. Define a margin between correct class and each other class: Definition The [class-specific] margin of score function h on the ith example (x i,y i ) for class y is m i,y (h) = h(x i,y i ) h(x i,y). Want m i,y (h) to be large and positive for all y y i. For our linear hypothesis space, margin is m i,y (w) = w,ψ(x i,y i ) w,ψ(x i,y) David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 34 / 49

Multiclass SVM with Hinge Loss Recall binary SVM (without bias term): Multiclass SVM (Version 1): 1 min w R d 2 w 2 + c n 1 min w R d 2 w 2 + c n n max 0,1 y i w T x }{{} i. margin i=1 n max[max(0,1 m i,y (w))] y y i i=1 where m i,y (w) = w,ψ(x i,y i ) w,ψ(x i,y). As in SVM, we ve taken the value 1 as our target margin for each i,y. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 35 / 49

Class-Sensitive Loss In multiclass, some misclassifications may be worse than others. Rather than 0/1 Loss, we may be interested in a more general loss : Y A [0, ) We can use this as our target margin for multiclass SVM. Multiclass SVM (Version 2): 1 min w R d 2 w 2 + c n n max[max(0, (y i,y) m i,y (w))] y y i i=1 If margin m i,y (w) meets or exceeds its target (y i,y) y y i, then no loss on example i. Note: If (y,y) = 0 y Y, then we can replace max y yi with max y. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 36 / 49

Interlude: Is This Worth The Hassle Compared to One-vs-All? David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 37 / 49

Recap: What Have We Got? Problem: Multiclass classification Y = {1,...,k} Solution 1: One-vs-All Train k models: h 1 (x),...,h k (x) : X R. Predict with argmax y Y h y (x). Gave simple example where this fails for linear classifiers Solution 2: Multiclass Train one model: h(x,y) : X Y R. Prediction involves solving arg max y Y h(x,y). David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 38 / 49

Does it work better in practice? Paper by Rifkin & Klautau: In Defense of One-Vs-All Classification (2004) Extensive experiments, carefully done albeit on relatively small UCI datasets Suggests one-vs-all works just as well in practice (or at least, the advantages claimed by earlier papers for multiclass methods were not compelling) Compared many multiclass frameworks (including the one we discuss) one-vs-all for SVMs with RBF kernel one-vs-all for square loss with RBF kernel (for classification!) All performed roughly the same David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 39 / 49

Why Are We Bothering with Multiclass? The framework we have developed for multiclass compatibility features / score functions multiclass margin target margin Generalizes to situations where k is very large and one-vs-all is intractable. Key point is that we can generalize across outputs y by using features of y. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 40 / 49

Introduction to Structured Prediction David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 41 / 49

Part-of-speech (POS) Tagging Given a sentence, give a part of speech tag for each word: x y [START] }{{} x 0 [START] }{{} y 0 }{{} He x 1 Pronoun }{{} y 1 }{{} eats x 2 Verb }{{} y 2 apples }{{} x 3 Noun }{{} y 3 V = {all English words} {[START],. } P = {START, Pronoun,Verb,Noun,Adjective} X = V n, n = 1,2,3,... [Word sequences of any length] Y = P n, n = 1,2,3,...[Part of speech sequence of any length] David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 42 / 49

Structured Prediction A structured prediction problem is a multiclass problem in which Y is very large, but has (or we assume it has) a certain structure. For POS tagging, Y grows exponentially in the length of the sentence. Typical structure assumption: The POS labels form a Markov chain. i.e. y n+1 y n,y n 1,...,y 0 is the same as y n+1 y n. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 43 / 49

Local Feature Functions: Type 1 A type 1 local feature only depends on the label at a single position, say y i (label of the ith word) and x at any position Example: φ 1 (i,x,y i ) = 1(x i = runs)1(y i = Verb) φ 2 (i,x,y i ) = 1(x i = runs)1(y i = Noun) φ 3 (i,x,y i ) = 1(x i 1 = He)1(x i = runs)1(y i = Verb) David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 44 / 49

Local Feature Functions: Type 2 A type 2 local feature only depends on the labels at 2 consecutive positions: y i 1 and y i x at any position Example: θ 1 (i,x,y i 1,y i ) = 1(y i 1 = Pronoun)1(y i = Verb) θ 2 (i,x,y i 1,y i ) = 1(y i 1 = Pronoun)1(y i = Noun) David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 45 / 49

Local Feature Vector and Compatibility Score At each position i in sequence, define the local feature vector: Ψ i (x,y i 1,y i ) = (φ 1 (i,x,y i ),φ 2 (i,x,y i ),..., θ 1 (i,x,y i 1,y i ),θ 2 (i,x,y i 1,y i ),...) Local compatibility score for (x,y) at position i is w,ψ i (x,y i 1,y i ). David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 46 / 49

Sequence Compatibility Score The compatibility score for the pair of sequences (x,y) is the sum of the local compatibility scores: w,ψ i (x,y i 1,y i ) i = w, Ψ i (x,y i 1,y i ) i = w,ψ(x,y), where we define the sequence feature vector by Ψ(x,y) = i Ψ i (x,y i 1,y i ). So we see this is a special case of linear multiclass prediction. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 47 / 49

Sequence Target Loss How do we assess the loss for prediction sequence y for example (x,y)? Hamming loss is common: (y,y ) = 1 y y i=1 1(y i y i ) Could generalize this as (y,y ) = 1 y y i=1 δ(y i,y i ) David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 48 / 49

What remains to be done? To compute predictions, we need to find This is straightforward for Y small. Now Y is exponentially large. arg max w,ψ(x,y). y Y Because Ψ breaks down into local functions only depending on 2 adjacent labels, we can solve this efficiently using dynamic programming. (Similar to Viterbi decoding.) Learning can be done with SGD and a similar dynamic program. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 49 / 49