Multiclass Classification

Similar documents
Multiclass and Introduction to Structured Prediction

Multiclass and Introduction to Structured Prediction

DS-GA 1003: Machine Learning and Computational Statistics Homework 6: Generalized Hinge Loss and Multiclass SVM

Midterm sample questions

Support vector machines Lecture 4

Linear Classifiers IV

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

INTRODUCTION TO DATA SCIENCE

Midterms. PAC Learning SVM Kernels+Boost Decision Trees. MultiClass CS446 Spring 17

Lecture Support Vector Machine (SVM) Classifiers

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

Multiclass Classification-1

ML4NLP Multiclass Classification

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Sequential Supervised Learning

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Generalized Linear Models

with Local Dependencies

Machine Learning for Structured Prediction

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Payment Rules for Combinatorial Auctions via Structural Support Vector Machines

Support Vector Machines

Deep Learning for Computer Vision

Introduction to Machine Learning

Kernel Methods and Support Vector Machines

IFT Lecture 7 Elements of statistical learning theory

Announcements - Homework

Log-Linear Models, MEMMs, and CRFs

Day 4: Classification, support vector machines

Cutting Plane Training of Structural SVM

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Lecture 13: Structured Prediction

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

CS-E4830 Kernel Methods in Machine Learning

Evaluation. Andrea Passerini Machine Learning. Evaluation

Introduction to Machine Learning (67577) Lecture 3

CPSC 340: Machine Learning and Data Mining

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

Structured Prediction Theory and Algorithms

Classification and Pattern Recognition

Evaluation requires to define performance measures to be optimized

Warm up: risk prediction with logistic regression

L5 Support Vector Classification

Subgradient Descent. David S. Rosenberg. New York University. February 7, 2018

Natural Language Processing (CSEP 517): Text Classification

Max Margin-Classifier

An Introduction to Machine Learning

CSCI-567: Machine Learning (Spring 2019)

Part of the slides are adapted from Ziko Kolter

The Noisy Channel Model and Markov Models

Sequence Labeling: HMMs & Structured Perceptron

Basis Expansion and Nonlinear SVM. Kai Yu

Loss Functions, Decision Theory, and Linear Models

Machine Learning for NLP

Support Vector Machines, Kernel SVM

Machine Learning 4771

Announcements. Proposals graded

Structured Prediction

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Discriminative Models

Neural Networks: Backpropagation

TDT4173 Machine Learning

Midterm sample questions

CSE 546 Final Exam, Autumn 2013

Midterm: CS 6375 Spring 2015 Solutions

Name (NetID): (1 Point)

Online Passive-Aggressive Algorithms. Tirgul 11

Hidden Markov Models in Language Processing

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Machine Learning Practice Page 2 of 2 10/28/13

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Lecture 18: Multiclass Support Vector Machines

Lecture 5 Neural models for NLP

The Naïve Bayes Classifier. Machine Learning Fall 2017

Support Vector and Kernel Methods

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Linear Classifiers. Michael Collins. January 18, 2012

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Multiclass Boosting with Repartitioning

Notes on the framework of Ando and Zhang (2005) 1 Beyond learning good functions: learning good spaces

CSC321 Lecture 4: Learning a Classifier

Support Vector Machines

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

Introduction to Machine Learning Midterm, Tues April 8

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Part 2: Generalized output representations and structure

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Predicting Sequences: Structured Perceptron. CS 6355: Structured Prediction

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

Support Vector Machines

Clustering. Introduction to Data Science. Jordan Boyd-Graber and Michael Paul SLIDES ADAPTED FROM LAUREN HANNAH

Lab 12: Structured Prediction

Linear Classifiers and the Perceptron

1. Implement AdaBoost with boosting stumps and apply the algorithm to the. Solution:

Transcription:

Multiclass Classification David Rosenberg New York University March 7, 2017 David Rosenberg (New York University) DS-GA 1003 March 7, 2017 1 / 52

Introduction Introduction David Rosenberg (New York University) DS-GA 1003 March 7, 2017 2 / 52

Introduction Multiclass Setting Input space: X Ouput space: Y = {1,...,k} Today we consider linear methods specifically designed for multiclass. David Rosenberg (New York University) DS-GA 1003 March 7, 2017 3 / 52

Reduction to Binary Classification Reduction to Binary Classification David Rosenberg (New York University) DS-GA 1003 March 7, 2017 4 / 52

Reduction to Binary Classification One-vs-All / One-vs-Rest Plot courtesy of David Sontag. David Rosenberg (New York University) DS-GA 1003 March 7, 2017 5 / 52

Reduction to Binary Classification One-vs-All / One-vs-Rest Train k binary classifiers, one for each class. Train ith classifier to distinguish class i from rest Suppose h 1,...,h k : X R are our binary classifiers. Can output hard classifications in { 1,1} or scores in R. Final prediction is Ties can be broken arbitrarily. h(x) = arg max h i (x) i {1,...,k} David Rosenberg (New York University) DS-GA 1003 March 7, 2017 6 / 52

Linear Classifers: Binary and Multiclass Linear Classifers: Binary and Multiclass David Rosenberg (New York University) DS-GA 1003 March 7, 2017 7 / 52

Linear Classifers: Binary and Multiclass Linear Binary Classifier Review Input Space: X = R d Output Space: Y = { 1,1} Linear classifier score function: f (x) = w,x = w T x Final classification prediction: sign(f (x)) Geometrically, when are sign(f (x)) = +1 and sign(f (x)) = 1? David Rosenberg (New York University) DS-GA 1003 March 7, 2017 8 / 52

Linear Classifers: Binary and Multiclass Linear Binary Classifier Review Suppose w > 0 and x > 0: f (x) = w,x = w x cosθ f (x) > 0 cosθ > 0 θ ( 90,90 ) f (x) < 0 cosθ < 0 θ [ 90,90 ] David Rosenberg (New York University) DS-GA 1003 March 7, 2017 9 / 52

Linear Classifers: Binary and Multiclass Three Class Example Base hypothesis space H = { f (x) = w T x x R 2}. Note: Separating boundary always contains the origin. Example based on Shalev-Schwartz and Ben-David s Understanding Machine Learning, Section 17.1 David Rosenberg (New York University) DS-GA 1003 March 7, 2017 10 / 52

Linear Classifers: Binary and Multiclass Three Class Example: One-vs-Rest Class 1 vs Rest: f 1 (x) = w T 1 x David Rosenberg (New York University) DS-GA 1003 March 7, 2017 11 / 52

Linear Classifers: Binary and Multiclass Three Class Example: One-vs-Rest Examine Class 2 vs Rest Predicts everything to be Not 2. If it predicted some 2, then it would get many more Not 2 incorrect. David Rosenberg (New York University) DS-GA 1003 March 7, 2017 12 / 52

Linear Classifers: Binary and Multiclass One-vs-Rest: Predictions Score for class i is f i (x) = w i,x = w i x cosθ i, where θ i is the angle between x and w i. Predict class i that has highest f i (x). David Rosenberg (New York University) DS-GA 1003 March 7, 2017 13 / 52

Linear Classifers: Binary and Multiclass One-vs-Rest: Class Boundaries For simplicity, we ve assumed w 1 = w 2 = w 3. Then w i and x are equal for all scores. = x is classified by whichever has largest cosθ i (i.e. θ i closest to 0) David Rosenberg (New York University) DS-GA 1003 March 7, 2017 14 / 52

Linear Classifers: Binary and Multiclass One-vs-Rest: Class Boundaries This approach doesn t work well in this instance. Can we fix it by changing our base hypothesis space? David Rosenberg (New York University) DS-GA 1003 March 7, 2017 15 / 52

Linear Classifers: Binary and Multiclass The Linear Multiclass Hypothesis Space Base Hypothesis Space: H = { x w T x w R d}. Linear Multiclass Hypothesis Space (for k classes): { } F = x arg maxh i (x) h 1,...,h k H i What s the action space here? David Rosenberg (New York University) DS-GA 1003 March 7, 2017 16 / 52

Linear Classifers: Binary and Multiclass One-vs-Rest: Class Boundaries Is this a failure of the hypothesis space or the learning algorithm? (A learning algorithm chooses the hypothesis from the hypothesis space.) David Rosenberg (New York University) DS-GA 1003 March 7, 2017 17 / 52

Linear Classifers: Binary and Multiclass A Solution with Linear Functions This works... so the problem is not with the hypothesis space. How can we get a solution like this? David Rosenberg (New York University) DS-GA 1003 March 7, 2017 18 / 52

Multiclass Predictors Multiclass Predictors David Rosenberg (New York University) DS-GA 1003 March 7, 2017 19 / 52

Multiclass Predictors Multiclass Hypothesis Space Base Hypothesis Space: H = {h : X R} ( score functions ). Multiclass Hypothesis Space (for k classes): { } F = x arg maxh i (x) h 1,...,h k H i h i (x) scores how likely x is to be from class i. David Rosenberg (New York University) DS-GA 1003 March 7, 2017 20 / 52

Multiclass Predictors Multiclass Hypothesis Space: Reframed A slight reframing turns out to be more convenient down the line General Output Space: Y e.g Y = {1,...,k} for multiclass Base Hypothesis Space: H = {h : X Y R} gives compatibility score between input x and output y Multiclass Hypothesis Space { F = x arg maxh(x,y) h H y Y } Now we re back to a single score function. Takes x and y and evalutes their compatibility. David Rosenberg (New York University) DS-GA 1003 March 7, 2017 21 / 52

Multiclass Predictors Learning in a Multiclass Hypothesis Space: In Words Base Hypothesis Space: H = {h : X Y R} Training data: (x 1,y 1 ),(x 2,y 2 ),...,(x n,y n ) Learning process chooses h H. What type of h do we want? Want h(x,y) to be large when x has label y, small otherwise. David Rosenberg (New York University) DS-GA 1003 March 7, 2017 22 / 52

Multiclass Predictors Learning in a Multiclass Hypothesis Space: In Math h(x,y) classifies(x i,y i ) correctly iff h(x i,y i ) > h(x i,y) y y i h should give higher score for correct y than for all other y Y. An equivalent condition is the following: First idea for objective function: min h H i=1 h(x i,y i ) > max y y i h(x i,y) n [ ] l h(x i,y i ) maxh(x i,y) y y i David Rosenberg (New York University) DS-GA 1003 March 7, 2017 23 / 52

Linear Hypothesis Space Linear Hypothesis Space David Rosenberg (New York University) DS-GA 1003 March 7, 2017 24 / 52

Linear Hypothesis Space Linear Multiclass Prediction Function A linear class-sensitive score function is given by h(x,y) = w,ψ(x,y), where Ψ(x,y) : X Y R d is a class-sensitive feature map. Linear Multiclass Hypothesis Space { } F = x arg max w,ψ(x,y) w R d y Y Ψ(x,y) extracts features relevant to how compatible y is with x. Final compatibility score must be extracted linearly from Ψ(x, y). David Rosenberg (New York University) DS-GA 1003 March 7, 2017 25 / 52

Linear Hypothesis Space Example: X = R2, Y = {1, 2, 3} w1 = 22, 22, w2 = (0, 1), w3 = 22, 22 Prediction function: (x1, x2 ) 7 arg maxi {1,2,3} hwi, (x1, x2 )i. How can we get this into the form x 7 arg maxy Y hw, Ψ(x, y )i David Rosenberg (New York University) DS-GA 1003 March 7, 2017 26 / 52

Linear Hypothesis Space The Multivector Construction What if we stack w i s together: w = 2 2 2 2 2,, 0,1 }{{ 2 }{{}, } 2, 2 w 2 }{{} w 1 w 3 And then do the following: Ψ : R 2 {1,2,3} R 6 defined by Ψ(x,1) := (x 1,x 2,0,0,0,0) Ψ(x,2) := (0,0,x 1,x 2,0,0) Ψ(x,3) := (0,0,0,0,x 1,x 2 ) Then w,ψ(x,y) = w y,x, which is what we want. David Rosenberg (New York University) DS-GA 1003 March 7, 2017 27 / 52

Linear Hypothesis Space Natural Language Processing Example X = {All possible words}. Y = {NOUN,VERB,ADJECTIVE,ADVERB,ARTICLE,PREPOSITION}. Features of x X: [The word itself], ENDS_IN_ly, ENDS_IN_ness,... Ψ(x,y) = (ψ 1 (x,y),ψ 2 (x,y),ψ 3 (x,y),...,ψ d (x,y)): ψ 1 (x,y) = 1(x = apple AND y = NOUN) ψ 2 (x,y) = 1(x = run AND y = NOUN) ψ 3 (x,y) = 1(x = run AND y = VERB) ψ 4 (x,y) = 1(x ENDS_IN_ly AND y =ADVERB)... e.g. Ψ(x = run,y = NOUN) = (0,1,0,0,...) After training, what would you guess corresponding w 1,w 2,w 3,w 4 to be? David Rosenberg (New York University) DS-GA 1003 March 7, 2017 28 / 52

Linear Hypothesis Space NLP Example: How does it work? Ψ(x,y) = (ψ 1 (x,y),ψ 2 (x,y),ψ 3 (x,y),...,ψ d (x,y)) R d : ψ 1 (x,y) = 1(x = apple AND y = NOUN) ψ 2 (x,y) = 1(x = run AND y = NOUN). After training, we ve learned w R d. Say w = (5,3,1,4,...) To predict label for x = apple, we compute scores for each y Y:.. w, Ψ(apple, NOUN) w, Ψ(apple, VERB) w, Ψ(apple, ADVERB) Predict class that gives highest score.. David Rosenberg (New York University) DS-GA 1003 March 7, 2017 29 / 52

Linear Hypothesis Space Another Approach: Use Label Features What if we have a very large number of classes? Make features for the classes. Common in advertising X: User and user context Y : A large set of banner ads Suppose user x is shown many banner ads. We want to predict which one the user will click on. Possible features: ψ 1 (x,y) = 1(x interested in sports AND y relevant to sports) ψ 2 (x,y) = 1(x is in target demographic group of y) ψ 3 (x,y) = 1(x previously clicked on ad from company sponsoring y) David Rosenberg (New York University) DS-GA 1003 March 7, 2017 30 / 52

Linear Hypothesis Space TF-IDF Features for News Article Classification X = {news articles} Y = {politics, sports, entertainment, world news, local news} [TOPICS] Want to use the words in article x to predict topic y. The Term-Frequency of word w in document x, denoted TF (w,x), is the numer of times word w occurs in document x. David Rosenberg (New York University) DS-GA 1003 March 7, 2017 31 / 52

Linear Hypothesis Space TF-IDF Features for News Article Classification The Document-Frequency of word w in class y, denoted DF (w,y), is the number of documents containing word w NOT in topic y. The TF-IDF feature for word w is then defined as ( ψ w (x,y) = TF (w,x)log m DF (w,y) where m is the total number of documents in training set. (NOTE: There are many other variations of TF-IDF). ), David Rosenberg (New York University) DS-GA 1003 March 7, 2017 32 / 52

Linear Hypothesis Space TF-IDF: Things to note Suppose we have d words in our vocabulary and k topic classes. Suppose we have a TF-IDF feature for each word (and no other features). What s the dimension of Ψ(x,y)? We have one TF-IDF for each word. Recall our multivector-style NLP features: ψ 2 (x,y) = 1(x = run AND y = NOUN) ψ 3 (x,y) = 1(x = run AND y = VERB) If made this TF-IDF style, it would like ψ x=run(x,y) = 1(x = run) (compatibility of "run" with class y) David Rosenberg (New York University) DS-GA 1003 March 7, 2017 33 / 52

Linear Multiclass SVM Linear Multiclass SVM David Rosenberg (New York University) DS-GA 1003 March 7, 2017 34 / 52

Linear Multiclass SVM The Margin for Multiclass Let h : X Y R be our class-sensitive score function. Define a margin between correct class and each other class: Definition The margin of score function h on the ith example (x i,y i ) for class y is m i,y (h) = h(x i,y i ) h(x i,y). Want m i,y (h) to be large and positive for all y y i. For our linear hypothesis space, margin is m i,y (w) = w,ψ(x i,y i ) w,ψ(x i,y) David Rosenberg (New York University) DS-GA 1003 March 7, 2017 35 / 52

Linear Multiclass SVM Multiclass SVM with Hinge Loss Recall binary SVM (without bias term): [Recall (x) + = max(0,x).] Multiclass SVM (Version 1): 1 min w R d 2 w 2 + c n 1 min w R d 2 w 2 + c n n 1 y i w T x }{{} i margin i=1 + n max(1 m i,y (w)) y y + i i=1 where m i,y (w) = w,ψ(x i,y i ) w,ψ(x i,y).. David Rosenberg (New York University) DS-GA 1003 March 7, 2017 36 / 52

Linear Multiclass SVM Class-Sensitive Loss In multiclass, some misclassifications may be worse than others. Rather than 0/1 Loss, we may be interested in a more general loss : Y A R 0 We can use this as our target margin for multiclass SVM. Multiclass SVM (Version 2): 1 min w R d 2 w 2 + c n n i=1 max( (y i,y) m i,y (w)) y + We can think of (y i,y) as the target margin for example i and class y because if each margin m i,y (w) meets or exceeds its corresponding target (y i,y), then we don t incur a loss on example i. David Rosenberg (New York University) DS-GA 1003 March 7, 2017 37 / 52

Linear Multiclass SVM Geometric Interpretation Prediction is given by argmax y Y w,ψ(x,y). Note it s unchanged if we replace w by w/ w. For simplicity, let s assume w = 1. Then score function w,ψ(x,y) = Ψ(x,y) cosθ = Proj w Ψ(x,y). David Rosenberg (New York University) DS-GA 1003 March 7, 2017 38 / 52

Linear Multiclass SVM Geometric Interpretation Ψ maps each x X to Y different vectors in R d. For example (x,y), we want margins for all y y to exceed target margin: w,ψ(x,y) w,ψ(x,y ) (y,y ) Figure from Section 17.2.4 from Shalev-Schwartz and Ben-David s Understanding Machine Learning David Rosenberg (New York University) DS-GA 1003 March 7, 2017 39 / 52

Interlude: Is This Worth The Hassle Compared to One-vs-All? Interlude: Is This Worth The Hassle Compared to One-vs-All? David Rosenberg (New York University) DS-GA 1003 March 7, 2017 40 / 52

Interlude: Is This Worth The Hassle Compared to One-vs-All? Recap: What Have We Got? Problem: Multiclass classification Y = {1,...,K} Solution 1: One-vs-All. Train K models: h 1 (x),...,h K (x) : X R. Predict with argmax y Y h y (x). Gave simple example where this fails for linear classifiers Solution 2: Multiclass Train one model: h(x,y) : X Y R. Prediction is involves solving arg max y Y h(x,y). David Rosenberg (New York University) DS-GA 1003 March 7, 2017 41 / 52

Interlude: Is This Worth The Hassle Compared to One-vs-All? Does it work better in practice? Paper by Rifkin & Klautau: In Defense of One-Vs-All Classification (2004) Extensive experiments, carefully done albeit on relatively small UCI datasets Suggests one-vs-all works just as well in practice (or at least, the advantages claimed by earlier papers for multiclass methods were not compelling) Compared many multiclass frameworks (including the one we discuss) one-vs-all for SVMs with RBF kernel one-vs-all for square loss with RBF kernel (for classification!) All performed roughly the same David Rosenberg (New York University) DS-GA 1003 March 7, 2017 42 / 52

Interlude: Is This Worth The Hassle Compared to One-vs-All? Why Are We Bothering with Multiclass? The framework we have developed for multiclass class sensitive score functions multiclass margin target margin Generalizes to situations where one-vs-all is computationally intractable. David Rosenberg (New York University) DS-GA 1003 March 7, 2017 43 / 52

Introduction to Structured Prediction Introduction to Structured Prediction David Rosenberg (New York University) DS-GA 1003 March 7, 2017 44 / 52

Introduction to Structured Prediction Part-of-speech (POS) Tagging Given a sentence, give a part of speech tag for each word: x y [START] }{{} x 0 [START] }{{} y 0 }{{} He x 1 Pronoun }{{} y 1 }{{} eats x 2 Verb }{{} y 2 apples }{{} x 3 Noun }{{} y 3 V = {all English words} {[START],. } P = {START, Pronoun,Verb,Noun,Adjective} X = V n, n = 1,2,3,... [Word sequences of any length] Y = P n, n = 1,2,3,...[Part of speech sequence of any length] David Rosenberg (New York University) DS-GA 1003 March 7, 2017 45 / 52

Introduction to Structured Prediction Structured Prediction A structured prediction problem is a multiclass problem in which Y is very large, but has (or we assume it has) a certain structure. For POS tagging, Y grows exponentially in the length of the sentence. Typical structure assumption: The POS labels form a Markov chain. i.e. y n+1 y n,y n 1,...,y 0 is the same as y n+1 y n. More on this in several weeks or in DS-GA 1005. David Rosenberg (New York University) DS-GA 1003 March 7, 2017 46 / 52

Introduction to Structured Prediction Local Feature Functions: Type 1 A type 1 local feature only depends on the label at a single position, say y i (label of the ith word) and x at any position Example: φ 1 (i,x,y i ) = 1(x i = runs)1(y i = Verb) φ 2 (i,x,y i ) = 1(x i = runs)1(y i = Noun) φ 3 (i,x,y i ) = 1(x i 1 = He)1(x i = runs)1(y i = Verb) David Rosenberg (New York University) DS-GA 1003 March 7, 2017 47 / 52

Introduction to Structured Prediction Local Feature Functions: Type 2 A type 2 local feature only depends on the labels at 2 consecutive positions: y i 1 and y i x at any position Example: θ 1 (i,x,y i 1,y i ) = 1(y i 1 = Pronoun)1(y i = Verb) θ 2 (i,x,y i 1,y i ) = 1(y i 1 = Pronoun)1(y i = Noun) David Rosenberg (New York University) DS-GA 1003 March 7, 2017 48 / 52

Introduction to Structured Prediction Local Feature Vector and Compatibility Score At each position i in sequence, define the local feature vector: Ψ i (x,y i 1,y i ) = (φ 1 (i,x,y i ),φ 2 (i,x,y i ),..., θ 1 (i,x,y i 1,y i ),θ 2 (i,x,y i 1,y i ),...) Local compatibility score for (x,y) at position i is w,ψ i (x,y i 1,y i ). David Rosenberg (New York University) DS-GA 1003 March 7, 2017 49 / 52

Introduction to Structured Prediction Sequence Compatibility Score The compatibility score for the pair of sequences (x,y) is the sum of the local compatibility scores: w,ψ i (x,y i 1,y i ) i = w, Ψ i (x,y i 1,y i ) i = w,ψ(x,y), where we define the sequence feature vector by Ψ(x,y) = i Ψ i (x,y i 1,y i ). So we see this is a special case of linear multiclass prediction. David Rosenberg (New York University) DS-GA 1003 March 7, 2017 50 / 52

Introduction to Structured Prediction Sequence Target Loss How do we assess the loss for prediction sequence y for example (x,y)? Hamming loss is common: (y,y ) = 1 y y i=1 1(y i y i ) Could generalize this as (y,y ) = 1 y y i=1 δ(y i,y i ) David Rosenberg (New York University) DS-GA 1003 March 7, 2017 51 / 52

Introduction to Structured Prediction What remains to be done? To compute predictions, we need to find This is straightforward for Y small. Now Y is exponentially large. arg max w,ψ(x,y). y Y Because Ψ breaks down into local functions only depending on 2 adjacent labels, we can solve this efficiently using dynamic programming. (Similar to Viterbi decoding.) Learning can be done with SGD and a similar dynamic program. David Rosenberg (New York University) DS-GA 1003 March 7, 2017 52 / 52