Multiclass and Introduction to Structured Prediction

Multiclass and Introduction to Structured Prediction David S. Rosenberg Bloomberg ML EDU November 28, 2017 David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 1 / 48

Introduction David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 2 / 48

Multiclass Setting Input space: X Ouput space: Y = {1,...,k} Our approaches to multiclass problems so far: multinomial / softmax logistic regression trees and random forests Today we consider linear methods specifically designed for multiclass. But the main takeaway will be an approach that generalizes to situations where k is exponentially large too large to enumerate. David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 3 / 48

Reduction to Binary Classification David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 4 / 48

One-vs-All / One-vs-Rest Plot courtesy of David Sontag. David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 5 / 48

One-vs-All / One-vs-Rest Train k binary classifiers, one for each class. Train ith classifier to distinguish class i from rest Suppose h 1,...,h k : X R are our binary classifiers. Can output hard classifications in { 1,1} or scores in R. Final prediction is Ties can be broken arbitrarily. h(x) = arg max h i (x) i {1,...,k} David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 6 / 48

Linear Classifers: Binary and Multiclass David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 7 / 48

Linear Binary Classifier Review Input Space: X = R d Output Space: Y = { 1,1} Linear classifier score function: f (x) = w,x = w T x Final classification prediction: sign(f (x)) Geometrically, when are sign(f (x)) = +1 and sign(f (x)) = 1? David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 8 / 48

Linear Binary Classifier Review Suppose w > 0 and x > 0: f (x) = w,x = w x cosθ f (x) > 0 cosθ > 0 θ ( 90,90 ) f (x) < 0 cosθ < 0 θ [ 90,90 ] David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 9 / 48

Three Class Example Base hypothesis space H = { f (x) = w T x x R 2}. Note: Separating boundary always contains the origin. Example based on Shalev-Schwartz and Ben-David s Understanding Machine Learning, Section 17.1 David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 10 / 48

Three Class Example: One-vs-Rest Class 1 vs Rest: f 1 (x) = w T 1 x David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 11 / 48

Three Class Example: One-vs-Rest Examine Class 2 vs Rest Predicts everything to be Not 2. If it predicted some 2, then it would get many more Not 2 incorrect. David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 12 / 48

One-vs-Rest: Predictions Score for class i is where θ i is the angle between x and w i. f i (x) = w i,x = w i x cosθ i, David S. Predict Rosenberg class (Bloomberg i that MLhas EDU) highest f (x). ML 101 November 28, 2017 13 / 48

One-vs-Rest: Class Boundaries For simplicity, we ve assumed w 1 = w 2 = w 3. Then w i and x are equal for all scores. = x is classified by whichever has largest cosθ i (i.e. θ i closest to 0) David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 14 / 48

One-vs-Rest: Class Boundaries This approach doesn t work well in this instance. How can we fix this? David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 15 / 48

The Linear Multiclass Hypothesis Space Base Hypothesis Space: H = { x w T x w R d}. Linear Multiclass Hypothesis Space (for k classes): { } F = x arg maxh i (x) h 1,...,h k H i What s the action space here? David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 16 / 48

One-vs-Rest: Class Boundaries Recall: A learning algorithm chooses the hypothesis from the hypothesis space. Is this a failure of the hypothesis space or the learning algorithm? David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 17 / 48

A Solution with Linear Functions This works... so the problem is not with the hypothesis space. How can we get a solution like this? David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 18 / 48

Multiclass Predictors David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 19 / 48

Multiclass Hypothesis Space Base Hypothesis Space: H = {h : X R} ( score functions ). Multiclass Hypothesis Space (for k classes): { } F = x arg maxh i (x) h 1,...,h k H i h i (x) scores how likely x is to be from class i. Issue: Need to learn (and represent) k functions. Doesn t scale to very large k. David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 20 / 48

Multiclass Hypothesis Space: Reframed General [Discrete] Output Space: Y (e.g Y = {1,...,k} for multiclass) New idea: Rather than a score function for each class, use one function h(x,y) that gives a compatibility score between input x and output y Final prediction is the y Y that is most compatible with x: f (x) = arg maxh(x,y) y Y This subsumes the framework with class-specific score functions. Given class-specific score functions h 1,...,h k, we could define compatibility function as h(x,i) = h i (x), i = 1,...,k. David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 21 / 48

Multiclass Hypothesis Space: Reframed General [Discrete] Output Space: Y Base Hypothesis Space: H = {h : X Y R} h(x,y) gives compatibility score between input x and output y Multiclass Hypothesis Space { F = Final prediction function is an f F. x arg maxh(x,y) h H y Y For each f F there is an underlying compatibility score function h H. } David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 22 / 48

Learning in a Multiclass Hypothesis Space: In Words Base Hypothesis Space: H = {h : X Y R} Training data: (x 1,y 1 ),(x 2,y 2 ),...,(x n,y n ) Learning process chooses h H. Want compatibility h(x,y) to be large when x has label y, small otherwise. David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 23 / 48

Learning in a Multiclass Hypothesis Space: In Math h(x,y) classifies (x i,y i ) correctly iff h(x i,y i ) > h(x i,y) y y i h should give higher score for correct y than for all other y Y. An equivalent condition is the following: If we define h(x i,y i ) > max y y i h(x i,y) m i = h(x i,y i ) max y y i h(x i,y), then classification is correct if m i > 0. Generally want m i to be large. Sound familiar? David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 24 / 48

A Linear Multiclass Hypothesis Space David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 25 / 48

Linear Multiclass Prediction Function A linear compatibility score function is given by h(x,y) = w,ψ(x,y), where Ψ(x,y) : X Y R d is a compatibility feature map 1. Ψ(x,y) extracts features relevant to how compatible y is with x. Final compatibility score is a linear function of Ψ(x,y). Linear Multiclass Hypothesis Space { } F = x arg max w,ψ(x,y) w R d y Y 1 Called class-sensitive score function and feature map in our SSBD reference. David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 26 / 48

Example: X = R2, Y = {1, 2, 3} w1 = 22, 22, w2 = (0, 1), w3 = 22, 22 Prediction function: (x1, x2 ) 7 arg maxi {1,2,3} hwi, (x1, x2 )i. How can we get this into the form x 7 arg maxy Y hw, Ψ(x, y )i David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 27 / 48

The Multivector Construction What if we stack w i s together: w = 2 2 2 2 2,, 0,1 }{{ 2 }{{}, } 2, 2 w 2 }{{} w 1 w 3 And then do the following: Ψ : R 2 {1,2,3} R 6 defined by Ψ(x,1) := (x 1,x 2,0,0,0,0) Ψ(x,2) := (0,0,x 1,x 2,0,0) Ψ(x,3) := (0,0,0,0,x 1,x 2 ) Then w,ψ(x,y) = w y,x, which is what we want. David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 28 / 48

NLP Example: Part-of-speech classification X = {All possible words}. Y = {NOUN,VERB,ADJECTIVE,ADVERB,ARTICLE,PREPOSITION}. Features of x X: [The word itself], ENDS_IN_ly, ENDS_IN_ness,... Ψ(x,y) = (ψ 1 (x,y),ψ 2 (x,y),ψ 3 (x,y),...,ψ d (x,y)): ψ 1 (x,y) = 1(x = apple AND y = NOUN) ψ 2 (x,y) = 1(x = run AND y = NOUN) ψ 3 (x,y) = 1(x = run AND y = VERB) ψ 4 (x,y) = 1(x ENDS_IN_ly AND y =ADVERB)... e.g. Ψ(x = run,y = NOUN) = (0,1,0,0,...) After training, what would you guess corresponding w 1,w 2,w 3,w 4 to be? David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 29 / 48

NLP Example: How does it work? Ψ(x,y) = (ψ 1 (x,y),ψ 2 (x,y),ψ 3 (x,y),...,ψ d (x,y)) R d : ψ 1 (x,y) = 1(x = apple AND y = NOUN) ψ 2 (x,y) = 1(x = run AND y = NOUN)... After training, we ve learned w R d. Say w = (5, 3,1,4,...) To predict label for x = apple, we compute compatibility scores for each y Y: Predict class that gives highest score. w, Ψ(apple, NOUN) w, Ψ(apple, VERB) w, Ψ(apple, ADVERB) David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 30 / 48.

Another Approach: Use Label Features What if we have a very large number of classes? Make features for the classes. Common in advertising X: User and user context Y : A large set of banner ads Suppose user x is shown many banner ads. We want to predict which one the user will click on. Possible compatibility features: ψ 1 (x,y) = 1(x interested in sports AND y relevant to sports) ψ 2 (x,y) = 1(x is in target demographic group of y) ψ 3 (x,y) = 1(x previously clicked on ad from company sponsoring y) David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 31 / 48

Linear Multiclass SVM David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 32 / 48

The Margin for Multiclass Let h : X Y R be our compatibility score function. Define a margin between correct class and each other class: Definition The [class-specific] margin of score function h on the ith example (x i,y i ) for class y is m i,y (h) = h(x i,y i ) h(x i,y). Want m i,y (h) to be large and positive for all y y i. For our linear hypothesis space, margin is m i,y (w) = w,ψ(x i,y i ) w,ψ(x i,y) David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 33 / 48

Multiclass SVM with Hinge Loss Recall binary SVM (without bias term): Multiclass SVM (Version 1): 1 min w R d 2 w 2 + c n 1 min w R d 2 w 2 + c n n max 0,1 y i w T x }{{} i. margin i=1 n max[max(0,1 m i,y (w))] y y i i=1 where m i,y (w) = w,ψ(x i,y i ) w,ψ(x i,y). As in SVM, we ve taken the value 1 as our target margin for each i,y. David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 34 / 48

Class-Sensitive Loss In multiclass, some misclassifications may be worse than others. Rather than 0/1 Loss, we may be interested in a more general loss : Y A [0, ) We can use this as our target margin for multiclass SVM. Multiclass SVM (Version 2): 1 min w R d 2 w 2 + c n n max[max(0, (y i,y) m i,y (w))] y y i i=1 If margin m i,y (w) meets or exceeds its target (y i,y) y y i, then no loss on example i. Note: If (y,y) = 0 y Y, then we can replace max y yi with max y. David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 35 / 48

Interlude: Is This Worth The Hassle Compared to One-vs-All? David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 36 / 48

Recap: What Have We Got? Problem: Multiclass classification Y = {1,...,k} Solution 1: One-vs-All Train k models: h 1 (x),...,h k (x) : X R. Predict with argmax y Y h y (x). Gave simple example where this fails for linear classifiers Solution 2: Multiclass Train one model: h(x,y) : X Y R. Prediction involves solving arg max y Y h(x,y). David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 37 / 48

Does it work better in practice? Paper by Rifkin & Klautau: In Defense of One-Vs-All Classification (2004) Extensive experiments, carefully done albeit on relatively small UCI datasets Suggests one-vs-all works just as well in practice (or at least, the advantages claimed by earlier papers for multiclass methods were not compelling) Compared many multiclass frameworks (including the one we discuss) one-vs-all for SVMs with RBF kernel one-vs-all for square loss with RBF kernel (for classification!) All performed roughly the same David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 38 / 48

Why Are We Bothering with Multiclass? The framework we have developed for multiclass compatibility features / score functions multiclass margin target margin Generalizes to situations where k is very large and one-vs-all is intractable. Key point is that we can generalize across outputs y by using features of y. David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 39 / 48

Introduction to Structured Prediction David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 40 / 48

Part-of-speech (POS) Tagging Given a sentence, give a part of speech tag for each word: x y [START] }{{} x 0 [START] }{{} y 0 }{{} He x 1 Pronoun }{{} y 1 }{{} eats x 2 Verb }{{} y 2 apples }{{} x 3 Noun }{{} y 3 V = {all English words} {[START],. } P = {START, Pronoun,Verb,Noun,Adjective} X = V n, n = 1,2,3,... [Word sequences of any length] Y = P n, n = 1,2,3,...[Part of speech sequence of any length] David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 41 / 48

Structured Prediction A structured prediction problem is a multiclass problem in which Y is very large, but has (or we assume it has) a certain structure. For POS tagging, Y grows exponentially in the length of the sentence. Typical structure assumption: The POS labels form a Markov chain. i.e. y n+1 y n,y n 1,...,y 0 is the same as y n+1 y n. David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 42 / 48

Local Feature Functions: Type 1 A type 1 local feature only depends on the label at a single position, say y i (label of the ith word) and x at any position Example: φ 1 (i,x,y i ) = 1(x i = runs)1(y i = Verb) φ 2 (i,x,y i ) = 1(x i = runs)1(y i = Noun) φ 3 (i,x,y i ) = 1(x i 1 = He)1(x i = runs)1(y i = Verb) David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 43 / 48

Local Feature Functions: Type 2 A type 2 local feature only depends on the labels at 2 consecutive positions: y i 1 and y i x at any position Example: θ 1 (i,x,y i 1,y i ) = 1(y i 1 = Pronoun)1(y i = Verb) θ 2 (i,x,y i 1,y i ) = 1(y i 1 = Pronoun)1(y i = Noun) David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 44 / 48

Local Feature Vector and Compatibility Score At each position i in sequence, define the local feature vector: Ψ i (x,y i 1,y i ) = (φ 1 (i,x,y i ),φ 2 (i,x,y i ),..., θ 1 (i,x,y i 1,y i ),θ 2 (i,x,y i 1,y i ),...) Local compatibility score for (x,y) at position i is w,ψ i (x,y i 1,y i ). David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 45 / 48

Sequence Compatibility Score The compatibility score for the pair of sequences (x,y) is the sum of the local compatibility scores: w,ψ i (x,y i 1,y i ) i = w, Ψ i (x,y i 1,y i ) i = w,ψ(x,y), where we define the sequence feature vector by Ψ(x,y) = i Ψ i (x,y i 1,y i ). So we see this is a special case of linear multiclass prediction. David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 46 / 48

Sequence Target Loss How do we assess the loss for prediction sequence y for example (x,y)? Hamming loss is common: (y,y ) = 1 y y i=1 1(y i y i ) Could generalize this as (y,y ) = 1 y y i=1 δ(y i,y i ) David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 47 / 48

What remains to be done? To compute predictions, we need to find This is straightforward for Y small. Now Y is exponentially large. arg max w,ψ(x,y). y Y Because Ψ breaks down into local functions only depending on 2 adjacent labels, we can solve this efficiently using dynamic programming. (Similar to Viterbi decoding.) Learning can be done with SGD and a similar dynamic program. David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 48 / 48