Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers IV Blaine Nelson, Tobias Scheffer
Contents Classification Problem Bayesian Classifier Decision Linear Classifiers, MAP Models Logistic Regression Regularized Empirical Risk Minimization Kernel Perceptron, Support Vector Machine Ridge Regression, LASSO Representer Theorem Dualized Perceptron, Dual SVM Mercer Map Learning with Structured Input & Output Taxonomy, Sequences, Ranking, Decoder, Cutting Plane Algorithm 2
Recall: Binary SVM Classification for two classes: y x = sign f θ x, f θ x = φ x T θ A parameter vector θ Optimization problem: min θ,ξ λ such that n i=1 ξ i + 1 2 θt θ ξ i 0 and y i φ x i T θ 1 ξ i Does this generalize to k classes? 3
Recall: Multiclass Logistic Regression In the multi-class case, the linear model has a decision function: f θ x, y = φ x T θ y + b y & a classifier: y x = argmax z Y Logistic model for multiclass: P y x, θ = exp φ x T θ y + b y z Y exp φ x T θ z + b z f θ x, z The prior is a normal distribution; p θ = N θ; 0, Σ The Maximum-A-Posteriori parameter is θ MAP = argmin θ n i=1 log Σ z Y exp f θ x i, z f θ x i, y i + θt Σ 1 θ 2 loss regularizer 4
Generalizing to Multiclass Problems Recall multiclass logistic regression y x = sign f θ x became y x = argmax y f θ x, y Single parameter vector θ became θ = θ 1,, θ k T with a component for each class For an SVM with k classes Regularizer? 1 2 θt θ becomes 1 2 Margin Constraints? k j=1 θ jt θ j 5
Generalizing to Multiclass Problems Recall multiclass logistic regression y x = sign f θ x became y x = argmax y f θ x, y Single parameter vector θ became θ = θ 1,, θ k T with a component for each class For an SVM with k classes Regularizer: 1 2 θt θ becomes 1 k θ jt θ j 2 j=1 Margin Constraints: y i φ x T i θ 1 ξ i becomes y y i f θ x i, y i f θ x i, y + 1 ξ i 6
Multiclass SVM Classification for more than two classes: y x = argmax y f θ x, y, f θ x, y = φ x T θ y A parameter vector for each of the k classes θ = θ 1,, θ k T Optimization problem: min θ,ξ λ such that n i=1 ξ i + 1 2 k j=1 ξ i 0 and y y i f θ x i, y i θ jt θ j f θ x i, y + 1 ξ i [J.Weston, C.Watkins, 1999] 7
Multiclass Feature-Mapping Different weight vectors can be seen as different slices of a single weight vector: θ = θ 1,, θ k T Joint representation of input and output: Φ x, y = 2 = 0 φ x 0 k concatenated feature vectors 8
Multiclass Feature-Mapping Classification for more than two classes: y x = argmax y Multiclass kernels: f θ x, y f θ x, y Λ y = = Φ x, y T θ I y = 1 I y = k Φ x, y = φ x Λ y = φ x I y = 1 φ x I y = k 9
Multiclass Kernel Encoding Classification for more than two classes: y x = argmax y Multiclass kernels: f θ x, y f θ x, y = Φ x, y T θ Φ x, y = x 1 x 2 I y = 1 I y = 2 = x 1 I y = 1 x 2 I y = 1 x 1 I y = 2 x 2 I y = 2 k x i, y i, x j, y j = Φ x i, y i T Φ x j, y j = I y i = y j k x i, x j 1 ify i =y j 0 otherwise 10
Multiclass Kernel Encoding Classification for more than two classes: y x = argmax y Multiclass kernels: f θ x, y f θ x, y = Φ x, y T θ Φ x, y = x 1 x 2 I y = 1 I y = 2 = x 1 I y = 1 x 2 I y = 1 x 1 I y = 2 x 2 I y = 2 k x i, y i, x j, y j = Φ x i, y i T Φ x j, y j = I y i = y j k x i, x j 1 if y i =y j 0 otherwise 11
Classification with Information over Classes Classes have their own features: y x = argmax y f θ x, y f θ x, y = Φ x, y T θ, Φ x, y = φ x Λ y k x i, y i, x j, y j = Φ x i, y i T Φ x j, y j = φ x i T φ x j Λ y i T Λ y j = k x i, x j Λ y i T Λ y j correspondence between classes 12
Classification with Information over Classes Classes have their own features: y x = argmax y f θ x, y f θ x, y = Φ x, y T θ, Φ x, y = φ x Λ y k x i, y i, x j, y j = Φ x i, y i T Φ x j, y j = φ x i T φ x j Λ y i T Λ y j = k x i, x j Λ y i T Λ y j correspondence between classes Correspondence of the classes Λ y i T Λ y j : E.g., inner product of the class descriptions. Λ y i : term-frequency (TF) vector over all words. 13
Multiclass SVM Classification for more than two classes: y x = argmax y f θ x, y f now has two arguments. Shared feature representation for input & output: f θ x, y = Φ x, y T θ, Same approach is used for multiclass, sequence and structured learning, ranking. 14
Multiclass SVM Example x is encoded; e.g., a document y = 2 is the encoded class Φ x, y = I y = 1 x I y = 2 x I y = 3 x I y = 4 x I y = 5 x I y = 6 x = 0 x 0 0 0 0 1 2 3 4 5 6 15
Multiclass SVM Example x is encoded; e.g., a document y = 2 is the encoded class Φ x, y = I y = 1 x I y = 2 x I y = 3 x I y = 4 x I y = 5 x I y = 6 x = 0 x 0 0 0 0 1 2 3 4 5 6 16
Multiclass SVM Classification for more than two classes: y x = argmax y f θ x, y, f θ x, y = φ x T θ y A parameter vector for each of the k classes θ = θ 1,, θ k T Optimization problem: min θ,ξ λ such that n i=1 ξ i + 1 2 k j=1 ξ i 0 and y y i f θ x i, y i θ jt θ j f θ x i, y + 1 ξ i 17
Multiclass SVM Classification for more than two classes: y x = argmax y f θ x, y, f θ x, y = Φ x, y T θ A parameter vector for each of the k classes θ = θ 1,, θ k T Optimization problem: min θ,ξ λ such that n i=1 ξ i + 1 2 θt θ ξ i 0 and y y i f θ x i, y i f θ x i, y + 1 ξ i 18
STRUCTURED MULTICLASS OUTPUT 19
Taxonomic Output Structure Suppose the k classes are related by an underlying tree structure (depth d): v 1 1 Homininae Hominini v 1 2 v 2 2 Gorillini Pan Homo Gorilla 3 3 3 v 1 v 2 v 3 Each class is a path in the tree; y = y 1,, y d. 20
Taxonomic Output Structure Suppose the k classes are related by an underlying tree structure (depth d): v 1 1 Homininae Hominini v 1 2 v 2 2 Gorillini Pan Homo Gorilla 3 3 3 v 1 v 2 v 3 Each class is a path in the tree; y = y 1,, y d. Chimpanzee = Homininae, Hominini, Pan 21
Taxonomic Output Structure Suppose the k classes are related by an underlying tree structure (depth d): v 1 1 Homininae Hominini v 1 2 v 2 2 Gorillini Pan Homo Gorilla 3 3 3 v 1 v 2 v 3 Each class is a path in the tree; y = y 1,, y d. Human = Homininae, Hominini, Homo 22
Taxonomic Output Structure Suppose the k classes are related by an underlying tree structure (depth d): v 1 1 Homininae Hominini v 1 2 v 2 2 Gorillini Pan Homo Gorilla 3 3 3 v 1 v 2 v 3 Each class is a path in the tree; y = y 1,, y d. W. Gorilla= Homininae, Gorillini, Gorilla 23
Taxonomic Output Structure Classes in a tree structure (depth d): The class for each x is a path in a class tree y = y 1,, y d. v 1 3 v 1 1 v 1 2 v 2 2 3 3 v 2 v 3 Encoding of the common features of input and output? 24
Classification with a Taxonomy Classes in a tree structure (depth d): y x f θ x, y = argmax y f θ x, y = Φ x, y T θ y = y 1,, y d Λ y = Λ y 1 Λ y d Λ y i = Φ x, y = φ x Λ y = φ x I y i = v 1 i i I y i = v ni Λ y 1 Λ y d = φ x v 1 3 v 2 3 I y 1 1 = v 1 I y 1 = v1 n1 I y d d = v 1 v 1 1 v 1 2 v 2 2 d I y d = v nd v 3 3 25
Classification with a Taxonomy Example x is encoded; e.g., a document y = v 1 1, v 2 2, v 3 3 T is a path; e.g., in a topic tree Φ x, y = I y 1 = v 1 1 x I y 2 = v 1 2 x I y 2 = v 2 2 x I y 3 = v 1 3 x I y 3 = v 2 3 x I y 3 = v 3 3 x = x 0 x 0 0 x v 1 3 v 1 1 v 1 2 v 2 2 3 3 v 2 v 3 26
Classification with a Taxonomy Example x is encoded; e.g., a document y = v 1 1, v 2 2, v 3 3 T is a path; e.g., in a topic tree Φ x, y = I y 1 = v 1 1 x I y 2 = v 1 2 x I y 2 = v 2 2 x I y 3 = v 1 3 x I y 3 = v 2 3 x I y 3 = v 3 3 x = x 0 x 0 0 x v 1 3 v 1 1 v 1 2 v 2 2 3 3 v 2 v 3 27
Classification with a Taxonomy Kernelization Classes in a tree structure (depth d): y x f θ x, y = argmax y f θ x, y = Φ x, y T θ v 1 3 v 1 1 v 1 2 v 2 2 3 3 v 2 v 3 y = y 1,, y d k x i, y i, x j, y j = Φ x i, y i T Φ x j, y j = φ x i T φ x j Λ y i T Λ y j d l = k x i, x j Λ y T l l=1 i Λ y j 28
Classification with a Taxonomy Decoding / Prediction Classes in a tree structure (depth d): f θ x, y y x = Φ x, y T θ = argmax f θ x, y y = argmax y = argmax y = argmax y n i=1 n i=1 n i=1 α i k x i, y i, x, y α i k x i, x Λ y i T Λ y v 1 3 d l T α i k x i, x Λ y i Λ y l l=1 v 1 1 v 1 2 v 2 2 3 3 v 2 v 3 29
Structured Classification The structure of the output class Y is encoded by defining Λ y The joint feature map Φ x, y = φ x Λ y captures the information from both input & class structure allowing learning to utilize that structure This defines a multiclass kernel on the input & output: k x i, y i, x j, y j = k x i, x j Λ y i T Λ y j 30
COMPLEX STRUCTURES 31
Complex Structured Output Suppose output space Y contains complex objects. Can they be represented as a combination of binary prediction problems? Examples: Part-of-speech and named entity recognition Natural language parsing Sequence alignment 32
Complex Output Tagging Sequential input / output Part-of-speech recognition x = Curiosity kills the cat y = Noun, Verb, Determiner, Noun Named entity recognition, information extraction: x = Barbie meets Ken y = Person,, Person 33
Complex Output Parsing Natural Language Parsing x = Curiosity killed the cat. NP S VP NP N V Det N Curiosity killed the cat 34
Complex Output Alignments Sequence Alignment We are given two sequences Prediction is an alignment between them x = s=(abjlhbnjyaugai) t=(bhjkbnygu) AB-JLHBNJYAUGAI BHJK-BN-YGU 35
Complex Structured Output Output space Y contains complex objects. A multistage process propagates its errors. Why isn t this just a simple multiclass problem? y 1 VP S VP NP Det V N V N The dog chased the cat y 2 NP S VP NP Det Det N V N y k VP S NP NP Det Det N V N 36
Learning with Complex Structured Output Example: POS-Tagging Sentence x = Curiosity killed the cat We need to compute: argmax Φ x, y T θ N, V, Det, N y Explicit: Φ x, N, V, Det, N T θ Φ x, N, N, N, N T θ Φ x, N, V, Det, N T θ Φ x, N, N, N, V T θ Φ x, N, V, Det, N T θ Φ x, N, N, V, N T θ Φ x, N, V, Det, N T θ Φ x, N, N, V, V T θ Φ x, N, V, Det, N T θ Φ x, N, V, N, N T θ Too Many!!! 37
Complex Structured Output Output space Y contains complex objects. A multistage process propagates its errors. Exponential number of classes potentially leads to exponentially-many parameters to estimate; inefficient predictions; an inefficient learning algorithm. 38
Complex Structured Output Exponential number of classes potentially leads to exponentially-many parameters to estimate; inefficient predictions; an inefficient learning algorithm. Classification for more than two classes: y x f θ x, y Φ x, y T θ = argmax y = argmax y To reduce the number of parameters, one needs an efficient representation of the input & output Φ x, y. This representation depends on the concrete problem definition. 39
Example: Sequential Input & Output (Feature-Mapping) y 1 y 2 y 3 x 1 x 2 x 3 Curiosity killed the y 4 x 4 cat An attribute for every pair of adjacent labels y t and y t+1. φ N,V y t, y t+1 = I y t = Noun y t+1 = Verb An attribute for every pair of input and output. φ cat,n x t, y t = I x t = cat y t = Noun Label-label counts: Φ i = t φ i y t, y t+1 Label-observation: Φ j = t φ j x t, y t Joint feature representation: Φ x, y =, φ N,V y t, y t+1,, φ cat,n x t, y t, T t Weight vector: w =, w N,V,, w cat,n, T 40
Example: Sequential Input & Output (Decoding / Prediction) To classify a sequence, we must compute y x = argmax f θ x, y y The argmax goes over all possible sequences (exponentially-many in sequence length). f θ x, y = Φ x, y T θ sums over the features of adjacent label-label pairs und features of x i, y i pairs. The summands only differ where the y-sequences also differ. Using dynamic programming, the argmax can be computed in linear time (Viterbi Algorithm). 41
Example: Sequential Input & Output (Decoding / Prediction) To classify a sequence, we must compute y x = argmax f θ x, y y Using dynamic programming, the argmax can be computed in linear time (Viterbi Algorithm). Idea: we can compute the maximizing subsequences at time t given the maximizing subsequences at t 1 y t 1 = N V D t 1 γ N t 1 γ V t 1 γ D y t = N V D t γ N t γ V t γ D γ N t p N t = max w z,n + γ t 1 z + w cat,n z = argmax z x t = cat 42
Example: Sequential Input & Output (Decoding / Prediction) To classify a sequence, we must compute y x = argmax f θ x, y y Using dynamic programming, the argmax can be computed in linear time (Viterbi Algorithm). Idea: we can compute the maximizing subsequences at time t given the maximizing subsequences at t 1 y t 1 = N V D t 1 γ N t 1 γ V t 1 γ D y t = N V D t γ N t γ V t γ D γ N t p N t = max w z,n + γ t 1 z + w cat,n z = argmax z x t = cat 43
Example: Sequential Input & Output (Decoding / Prediction) To classify a sequence, we must compute y x = argmax f θ x, y y Using dynamic programming, the argmax can be computed in linear time (Viterbi Algorithm). Idea: we can compute the maximizing subsequences at time t given the maximizing subsequences at t 1 y t 1 = N V D t 1 γ N t 1 γ V t 1 γ D y t = N V D t γ N t γ V t γ D γ t V = max w z,v + γ t 1 z + w cat,v z p t V = argmax z x t = cat 44
Example: Sequential Input & Output (Decoding / Prediction) To classify a sequence, we must compute y x = argmax f θ x, y y Using dynamic programming, the argmax can be computed in linear time (Viterbi Algorithm). Idea: we can compute the maximizing subsequences at time t given the maximizing subsequences at t 1 y t 1 = N V D t 1 γ N t 1 γ V t 1 γ D y t = N V D t γ N t γ V t γ D γ t D = max w z,d + γ t 1 z + w cat,d z p t D = argmax z x t = cat 45
Example: Sequential Input & Output (Decoding / Prediction) To classify a sequence, we must compute y x = argmax f θ x, y y Using dynamic programming, the argmax can be computed in linear time (Viterbi Algorithm). Idea: we can compute the maximizing subsequences at time t given the maximizing subsequences at t 1 Once γ t is computed for entire sequence, maximize γ T and follow pointers p t back to find max sequence N y 1 y 2 V y 3 D y 4 N x 1 x 2 x 3 Curiosity killed the x 4 cat 46
Structured Output Space Sequential Input / Output Decoding: Viterbi Algorithm Feature-Mapping: Entries for adjacent states φ N,V y t, y t+1 = I y t = Noun y t+1 = Verb Entries for observations in a state φ cat,n x t, y t = I x t = cat y t = Noun 47
Structured Output Space Example: Natural Language Parsing y = x = NP N S VP NP V Det N Curiosity killed the cat φ NP N y = I y p = NP N p y Φ x, y = 1 1 0 1 0 1 S NP, VP NP N VP V VP V, NP N ate N cat φ N cat x, y = I y p = N cat p y Joint feature representation: Φ x, y =, φ NP N y,, φ N cat (x, y), T Weight vector: w =, w NP N,, w N cat, T Decoding with dynamic programming CKY-Parser, O n 3. 48
Structured Output Space Example: Collective Classification x 1 y 1 y 2 x 2 y 4 y 3 x 3 x 4 An attribute for every pair of adjacent labels y t and y t+1. φ 123 y i, y j An attribute for every pair of input and output. φ cat,n x t, y t = I y t = Institute x t = Decoder: Message Passing Algorithm. 49
Complex Structured Output Exponential number of classes potentially leads to exponentially-many parameters to estimate; inefficient predictions; an inefficient learning algorithm. Optimization problem: min θ,ξ λ such that n i=1 ξ i + 1 θ 2 2 2 ξ i 0 and y y i f θ x i, y i Exponentially many constraints f θ x i, y + 1 ξ i Efficient encoding 50
Learning with Complex Structured Output Optimization problem: min θ,ξ λ such that n i=1 ξ i + 1 θ 2 2 2 ξ i 0 and y y i f θ x i, y i Exponentially many constraints f θ x i, y + 1 ξ i Optimization through iterative training. Negative Constraints are added when an error occurs during training. [Tsochantaridis et al., 2004] 51
Learning with Complex Structured Output Cutting Plane Algorithm Given: L = x 1, y 1,, x n, y n Repeat until all sequences are correctly predicted. Iterate over all examples x i, y i. Compute y = argmax y y i Φ x i, y T θ If Φ x i, y i T θ < Φ x i, y T θ + 1 (Margin-Violation) then add the constraint Φ x i, y i T θ < Φ x i, y T θ + 1 ξ i to the working set of constraints. Solve the optimization problem for input x i, output y i and negative pseudo-examples y (in the working set) Return the learned θ. 52
Structured Output Spaces General Framework Customized in terms of Loss Function Joint representation of the input and output Decoder y = argmax y y i Φ x i, y T θ optionally with a loss function Implementation: http://www.cs.cornell.edu/people/tj/svm_light/svm_struct.html 53
Complex Structured Output Output space Y contains complex objects. A multistage process propagates its errors. Exponential number of classes, but fewer parameters to estimate; efficient predictions; efficient learning algorithm. Feature-Mapping: reduces the number of parameters for every class Cutting plane algorithm Problem-specific encoding 54
RANKING 55
Ranking Can we also learn with other types of structure? Ranking every prediction is an ordering We want to use this ordering information to improve our learning algorithm 56
Ranking Instances should be placed in the correct order. e.g., Relevance ranking of search results. e.g., Relevance ranking of product recommendations. Samples are pairs: L = f x i f x j, x i should appear before x j in the ranking 57
Ranking Relevance ranking of search results. Website x i, Search query q. Joint feature representation Φ x i, q for websites and search queries. Φ x i, q can be a vector of different features of that appear in both x i and q (ie, their correspondence). Eg. Count of corresponding words; quasicorrespondence in H1-neighborhoods, PageRank, Samples are pairs: L = f x i, q f x j, q, x i should appear before x j in the ranking for query q. 58
Ranking Relevance-Ranking of search results. Samples are pairs: L = f x i, q f x j, q, x i should appear before x j in the ranking for query q. Samples can be taken from click data. A user asks query q, receives a list of results, then clicks on the i th element in the list x i. Implicitly, he has rejected list element 1... i-1. For this user and query q, x i should have been placed first. 59
Ranking Relevance ranking of search results. Samples are pairs: L = f x i, q f x j, q, A user asks query q, receives a list of results, then clicks on the i th element in the list x i. For this user and query q, x i should have been placed first. From this encounter, we infer the samples: L q = f x i, q f x 1, q,, f x i, q f x i 1, q 60
Ranking Given: samples L = f x 1i1, q 1 f x 11, q 1,, f x 1i1, q 1 f x 1i1 1, q 1 f x mim, q m f x m1, q m,, f x mim, q m f x mim 1, q m Solution min w,ξ 1 2 w 2 + C ξ ji j i s. t. j i < i j w T Φ x jij, q j Φ x ji, q j 1 ξ ji j i ξ ji 0 61
Structured Classification - Summary Classification for more than two classes: y x = argmax y f θ x, y, f θ x, y = Φ x, y T θ Learning formulated as Multiclass SVM Structure of Y captured by Λ y giving Φ x, y = φ x Λ y & k x i, y i, x j, y j = k x i, x j Λ y i T Λ y j Structured input & output high dimensional Y fewer parameters to estimate from feature mapping efficient prediction from problem-specific encoding efficient learning algorithm cutting plane algorithm Other structured problems: Ranking uses order constraints 62