Linear Classifiers IV
|
|
- Lionel Hines
- 6 years ago
- Views:
Transcription
1 Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers IV Blaine Nelson, Tobias Scheffer
2 Contents Classification Problem Bayesian Classifier Decision Linear Classifiers, MAP Models Logistic Regression Regularized Empirical Risk Minimization Kernel Perceptron, Support Vector Machine Ridge Regression, LASSO Representer Theorem Dualized Perceptron, Dual SVM Mercer Map Learning with Structured Input & Output Taxonomy, Sequences, Ranking, Decoder, Cutting Plane Algorithm 2
3 Recall: Binary SVM Classification for two classes: y x = sign f θ x, f θ x = φ x T θ A parameter vector θ Optimization problem: min θ,ξ λ such that n i=1 ξ i θt θ ξ i 0 and y i φ x i T θ 1 ξ i Does this generalize to k classes? 3
4 Recall: Multiclass Logistic Regression In the multi-class case, the linear model has a decision function: f θ x, y = φ x T θ y + b y & a classifier: y x = argmax z Y Logistic model for multiclass: P y x, θ = exp φ x T θ y + b y z Y exp φ x T θ z + b z f θ x, z The prior is a normal distribution; p θ = N θ; 0, Σ The Maximum-A-Posteriori parameter is θ MAP = argmin θ n i=1 log Σ z Y exp f θ x i, z f θ x i, y i + θt Σ 1 θ 2 loss regularizer 4
5 Generalizing to Multiclass Problems Recall multiclass logistic regression y x = sign f θ x became y x = argmax y f θ x, y Single parameter vector θ became θ = θ 1,, θ k T with a component for each class For an SVM with k classes Regularizer? 1 2 θt θ becomes 1 2 Margin Constraints? k j=1 θ jt θ j 5
6 Generalizing to Multiclass Problems Recall multiclass logistic regression y x = sign f θ x became y x = argmax y f θ x, y Single parameter vector θ became θ = θ 1,, θ k T with a component for each class For an SVM with k classes Regularizer: 1 2 θt θ becomes 1 k θ jt θ j 2 j=1 Margin Constraints: y i φ x T i θ 1 ξ i becomes y y i f θ x i, y i f θ x i, y + 1 ξ i 6
7 Multiclass SVM Classification for more than two classes: y x = argmax y f θ x, y, f θ x, y = φ x T θ y A parameter vector for each of the k classes θ = θ 1,, θ k T Optimization problem: min θ,ξ λ such that n i=1 ξ i k j=1 ξ i 0 and y y i f θ x i, y i θ jt θ j f θ x i, y + 1 ξ i [J.Weston, C.Watkins, 1999] 7
8 Multiclass Feature-Mapping Different weight vectors can be seen as different slices of a single weight vector: θ = θ 1,, θ k T Joint representation of input and output: Φ x, y = 2 = 0 φ x 0 k concatenated feature vectors 8
9 Multiclass Feature-Mapping Classification for more than two classes: y x = argmax y Multiclass kernels: f θ x, y f θ x, y Λ y = = Φ x, y T θ I y = 1 I y = k Φ x, y = φ x Λ y = φ x I y = 1 φ x I y = k 9
10 Multiclass Kernel Encoding Classification for more than two classes: y x = argmax y Multiclass kernels: f θ x, y f θ x, y = Φ x, y T θ Φ x, y = x 1 x 2 I y = 1 I y = 2 = x 1 I y = 1 x 2 I y = 1 x 1 I y = 2 x 2 I y = 2 k x i, y i, x j, y j = Φ x i, y i T Φ x j, y j = I y i = y j k x i, x j 1 ify i =y j 0 otherwise 10
11 Multiclass Kernel Encoding Classification for more than two classes: y x = argmax y Multiclass kernels: f θ x, y f θ x, y = Φ x, y T θ Φ x, y = x 1 x 2 I y = 1 I y = 2 = x 1 I y = 1 x 2 I y = 1 x 1 I y = 2 x 2 I y = 2 k x i, y i, x j, y j = Φ x i, y i T Φ x j, y j = I y i = y j k x i, x j 1 if y i =y j 0 otherwise 11
12 Classification with Information over Classes Classes have their own features: y x = argmax y f θ x, y f θ x, y = Φ x, y T θ, Φ x, y = φ x Λ y k x i, y i, x j, y j = Φ x i, y i T Φ x j, y j = φ x i T φ x j Λ y i T Λ y j = k x i, x j Λ y i T Λ y j correspondence between classes 12
13 Classification with Information over Classes Classes have their own features: y x = argmax y f θ x, y f θ x, y = Φ x, y T θ, Φ x, y = φ x Λ y k x i, y i, x j, y j = Φ x i, y i T Φ x j, y j = φ x i T φ x j Λ y i T Λ y j = k x i, x j Λ y i T Λ y j correspondence between classes Correspondence of the classes Λ y i T Λ y j : E.g., inner product of the class descriptions. Λ y i : term-frequency (TF) vector over all words. 13
14 Multiclass SVM Classification for more than two classes: y x = argmax y f θ x, y f now has two arguments. Shared feature representation for input & output: f θ x, y = Φ x, y T θ, Same approach is used for multiclass, sequence and structured learning, ranking. 14
15 Multiclass SVM Example x is encoded; e.g., a document y = 2 is the encoded class Φ x, y = I y = 1 x I y = 2 x I y = 3 x I y = 4 x I y = 5 x I y = 6 x = 0 x
16 Multiclass SVM Example x is encoded; e.g., a document y = 2 is the encoded class Φ x, y = I y = 1 x I y = 2 x I y = 3 x I y = 4 x I y = 5 x I y = 6 x = 0 x
17 Multiclass SVM Classification for more than two classes: y x = argmax y f θ x, y, f θ x, y = φ x T θ y A parameter vector for each of the k classes θ = θ 1,, θ k T Optimization problem: min θ,ξ λ such that n i=1 ξ i k j=1 ξ i 0 and y y i f θ x i, y i θ jt θ j f θ x i, y + 1 ξ i 17
18 Multiclass SVM Classification for more than two classes: y x = argmax y f θ x, y, f θ x, y = Φ x, y T θ A parameter vector for each of the k classes θ = θ 1,, θ k T Optimization problem: min θ,ξ λ such that n i=1 ξ i θt θ ξ i 0 and y y i f θ x i, y i f θ x i, y + 1 ξ i 18
19 STRUCTURED MULTICLASS OUTPUT 19
20 Taxonomic Output Structure Suppose the k classes are related by an underlying tree structure (depth d): v 1 1 Homininae Hominini v 1 2 v 2 2 Gorillini Pan Homo Gorilla v 1 v 2 v 3 Each class is a path in the tree; y = y 1,, y d. 20
21 Taxonomic Output Structure Suppose the k classes are related by an underlying tree structure (depth d): v 1 1 Homininae Hominini v 1 2 v 2 2 Gorillini Pan Homo Gorilla v 1 v 2 v 3 Each class is a path in the tree; y = y 1,, y d. Chimpanzee = Homininae, Hominini, Pan 21
22 Taxonomic Output Structure Suppose the k classes are related by an underlying tree structure (depth d): v 1 1 Homininae Hominini v 1 2 v 2 2 Gorillini Pan Homo Gorilla v 1 v 2 v 3 Each class is a path in the tree; y = y 1,, y d. Human = Homininae, Hominini, Homo 22
23 Taxonomic Output Structure Suppose the k classes are related by an underlying tree structure (depth d): v 1 1 Homininae Hominini v 1 2 v 2 2 Gorillini Pan Homo Gorilla v 1 v 2 v 3 Each class is a path in the tree; y = y 1,, y d. W. Gorilla= Homininae, Gorillini, Gorilla 23
24 Taxonomic Output Structure Classes in a tree structure (depth d): The class for each x is a path in a class tree y = y 1,, y d. v 1 3 v 1 1 v 1 2 v v 2 v 3 Encoding of the common features of input and output? 24
25 Classification with a Taxonomy Classes in a tree structure (depth d): y x f θ x, y = argmax y f θ x, y = Φ x, y T θ y = y 1,, y d Λ y = Λ y 1 Λ y d Λ y i = Φ x, y = φ x Λ y = φ x I y i = v 1 i i I y i = v ni Λ y 1 Λ y d = φ x v 1 3 v 2 3 I y 1 1 = v 1 I y 1 = v1 n1 I y d d = v 1 v 1 1 v 1 2 v 2 2 d I y d = v nd v
26 Classification with a Taxonomy Example x is encoded; e.g., a document y = v 1 1, v 2 2, v 3 3 T is a path; e.g., in a topic tree Φ x, y = I y 1 = v 1 1 x I y 2 = v 1 2 x I y 2 = v 2 2 x I y 3 = v 1 3 x I y 3 = v 2 3 x I y 3 = v 3 3 x = x 0 x 0 0 x v 1 3 v 1 1 v 1 2 v v 2 v 3 26
27 Classification with a Taxonomy Example x is encoded; e.g., a document y = v 1 1, v 2 2, v 3 3 T is a path; e.g., in a topic tree Φ x, y = I y 1 = v 1 1 x I y 2 = v 1 2 x I y 2 = v 2 2 x I y 3 = v 1 3 x I y 3 = v 2 3 x I y 3 = v 3 3 x = x 0 x 0 0 x v 1 3 v 1 1 v 1 2 v v 2 v 3 27
28 Classification with a Taxonomy Kernelization Classes in a tree structure (depth d): y x f θ x, y = argmax y f θ x, y = Φ x, y T θ v 1 3 v 1 1 v 1 2 v v 2 v 3 y = y 1,, y d k x i, y i, x j, y j = Φ x i, y i T Φ x j, y j = φ x i T φ x j Λ y i T Λ y j d l = k x i, x j Λ y T l l=1 i Λ y j 28
29 Classification with a Taxonomy Decoding / Prediction Classes in a tree structure (depth d): f θ x, y y x = Φ x, y T θ = argmax f θ x, y y = argmax y = argmax y = argmax y n i=1 n i=1 n i=1 α i k x i, y i, x, y α i k x i, x Λ y i T Λ y v 1 3 d l T α i k x i, x Λ y i Λ y l l=1 v 1 1 v 1 2 v v 2 v 3 29
30 Structured Classification The structure of the output class Y is encoded by defining Λ y The joint feature map Φ x, y = φ x Λ y captures the information from both input & class structure allowing learning to utilize that structure This defines a multiclass kernel on the input & output: k x i, y i, x j, y j = k x i, x j Λ y i T Λ y j 30
31 COMPLEX STRUCTURES 31
32 Complex Structured Output Suppose output space Y contains complex objects. Can they be represented as a combination of binary prediction problems? Examples: Part-of-speech and named entity recognition Natural language parsing Sequence alignment 32
33 Complex Output Tagging Sequential input / output Part-of-speech recognition x = Curiosity kills the cat y = Noun, Verb, Determiner, Noun Named entity recognition, information extraction: x = Barbie meets Ken y = Person,, Person 33
34 Complex Output Parsing Natural Language Parsing x = Curiosity killed the cat. NP S VP NP N V Det N Curiosity killed the cat 34
35 Complex Output Alignments Sequence Alignment We are given two sequences Prediction is an alignment between them x = s=(abjlhbnjyaugai) t=(bhjkbnygu) AB-JLHBNJYAUGAI BHJK-BN-YGU 35
36 Complex Structured Output Output space Y contains complex objects. A multistage process propagates its errors. Why isn t this just a simple multiclass problem? y 1 VP S VP NP Det V N V N The dog chased the cat y 2 NP S VP NP Det Det N V N y k VP S NP NP Det Det N V N 36
37 Learning with Complex Structured Output Example: POS-Tagging Sentence x = Curiosity killed the cat We need to compute: argmax Φ x, y T θ N, V, Det, N y Explicit: Φ x, N, V, Det, N T θ Φ x, N, N, N, N T θ Φ x, N, V, Det, N T θ Φ x, N, N, N, V T θ Φ x, N, V, Det, N T θ Φ x, N, N, V, N T θ Φ x, N, V, Det, N T θ Φ x, N, N, V, V T θ Φ x, N, V, Det, N T θ Φ x, N, V, N, N T θ Too Many!!! 37
38 Complex Structured Output Output space Y contains complex objects. A multistage process propagates its errors. Exponential number of classes potentially leads to exponentially-many parameters to estimate; inefficient predictions; an inefficient learning algorithm. 38
39 Complex Structured Output Exponential number of classes potentially leads to exponentially-many parameters to estimate; inefficient predictions; an inefficient learning algorithm. Classification for more than two classes: y x f θ x, y Φ x, y T θ = argmax y = argmax y To reduce the number of parameters, one needs an efficient representation of the input & output Φ x, y. This representation depends on the concrete problem definition. 39
40 Example: Sequential Input & Output (Feature-Mapping) y 1 y 2 y 3 x 1 x 2 x 3 Curiosity killed the y 4 x 4 cat An attribute for every pair of adjacent labels y t and y t+1. φ N,V y t, y t+1 = I y t = Noun y t+1 = Verb An attribute for every pair of input and output. φ cat,n x t, y t = I x t = cat y t = Noun Label-label counts: Φ i = t φ i y t, y t+1 Label-observation: Φ j = t φ j x t, y t Joint feature representation: Φ x, y =, φ N,V y t, y t+1,, φ cat,n x t, y t, T t Weight vector: w =, w N,V,, w cat,n, T 40
41 Example: Sequential Input & Output (Decoding / Prediction) To classify a sequence, we must compute y x = argmax f θ x, y y The argmax goes over all possible sequences (exponentially-many in sequence length). f θ x, y = Φ x, y T θ sums over the features of adjacent label-label pairs und features of x i, y i pairs. The summands only differ where the y-sequences also differ. Using dynamic programming, the argmax can be computed in linear time (Viterbi Algorithm). 41
42 Example: Sequential Input & Output (Decoding / Prediction) To classify a sequence, we must compute y x = argmax f θ x, y y Using dynamic programming, the argmax can be computed in linear time (Viterbi Algorithm). Idea: we can compute the maximizing subsequences at time t given the maximizing subsequences at t 1 y t 1 = N V D t 1 γ N t 1 γ V t 1 γ D y t = N V D t γ N t γ V t γ D γ N t p N t = max w z,n + γ t 1 z + w cat,n z = argmax z x t = cat 42
43 Example: Sequential Input & Output (Decoding / Prediction) To classify a sequence, we must compute y x = argmax f θ x, y y Using dynamic programming, the argmax can be computed in linear time (Viterbi Algorithm). Idea: we can compute the maximizing subsequences at time t given the maximizing subsequences at t 1 y t 1 = N V D t 1 γ N t 1 γ V t 1 γ D y t = N V D t γ N t γ V t γ D γ N t p N t = max w z,n + γ t 1 z + w cat,n z = argmax z x t = cat 43
44 Example: Sequential Input & Output (Decoding / Prediction) To classify a sequence, we must compute y x = argmax f θ x, y y Using dynamic programming, the argmax can be computed in linear time (Viterbi Algorithm). Idea: we can compute the maximizing subsequences at time t given the maximizing subsequences at t 1 y t 1 = N V D t 1 γ N t 1 γ V t 1 γ D y t = N V D t γ N t γ V t γ D γ t V = max w z,v + γ t 1 z + w cat,v z p t V = argmax z x t = cat 44
45 Example: Sequential Input & Output (Decoding / Prediction) To classify a sequence, we must compute y x = argmax f θ x, y y Using dynamic programming, the argmax can be computed in linear time (Viterbi Algorithm). Idea: we can compute the maximizing subsequences at time t given the maximizing subsequences at t 1 y t 1 = N V D t 1 γ N t 1 γ V t 1 γ D y t = N V D t γ N t γ V t γ D γ t D = max w z,d + γ t 1 z + w cat,d z p t D = argmax z x t = cat 45
46 Example: Sequential Input & Output (Decoding / Prediction) To classify a sequence, we must compute y x = argmax f θ x, y y Using dynamic programming, the argmax can be computed in linear time (Viterbi Algorithm). Idea: we can compute the maximizing subsequences at time t given the maximizing subsequences at t 1 Once γ t is computed for entire sequence, maximize γ T and follow pointers p t back to find max sequence N y 1 y 2 V y 3 D y 4 N x 1 x 2 x 3 Curiosity killed the x 4 cat 46
47 Structured Output Space Sequential Input / Output Decoding: Viterbi Algorithm Feature-Mapping: Entries for adjacent states φ N,V y t, y t+1 = I y t = Noun y t+1 = Verb Entries for observations in a state φ cat,n x t, y t = I x t = cat y t = Noun 47
48 Structured Output Space Example: Natural Language Parsing y = x = NP N S VP NP V Det N Curiosity killed the cat φ NP N y = I y p = NP N p y Φ x, y = S NP, VP NP N VP V VP V, NP N ate N cat φ N cat x, y = I y p = N cat p y Joint feature representation: Φ x, y =, φ NP N y,, φ N cat (x, y), T Weight vector: w =, w NP N,, w N cat, T Decoding with dynamic programming CKY-Parser, O n 3. 48
49 Structured Output Space Example: Collective Classification x 1 y 1 y 2 x 2 y 4 y 3 x 3 x 4 An attribute for every pair of adjacent labels y t and y t+1. φ 123 y i, y j An attribute for every pair of input and output. φ cat,n x t, y t = I y t = Institute x t = Decoder: Message Passing Algorithm. 49
50 Complex Structured Output Exponential number of classes potentially leads to exponentially-many parameters to estimate; inefficient predictions; an inefficient learning algorithm. Optimization problem: min θ,ξ λ such that n i=1 ξ i + 1 θ ξ i 0 and y y i f θ x i, y i Exponentially many constraints f θ x i, y + 1 ξ i Efficient encoding 50
51 Learning with Complex Structured Output Optimization problem: min θ,ξ λ such that n i=1 ξ i + 1 θ ξ i 0 and y y i f θ x i, y i Exponentially many constraints f θ x i, y + 1 ξ i Optimization through iterative training. Negative Constraints are added when an error occurs during training. [Tsochantaridis et al., 2004] 51
52 Learning with Complex Structured Output Cutting Plane Algorithm Given: L = x 1, y 1,, x n, y n Repeat until all sequences are correctly predicted. Iterate over all examples x i, y i. Compute y = argmax y y i Φ x i, y T θ If Φ x i, y i T θ < Φ x i, y T θ + 1 (Margin-Violation) then add the constraint Φ x i, y i T θ < Φ x i, y T θ + 1 ξ i to the working set of constraints. Solve the optimization problem for input x i, output y i and negative pseudo-examples y (in the working set) Return the learned θ. 52
53 Structured Output Spaces General Framework Customized in terms of Loss Function Joint representation of the input and output Decoder y = argmax y y i Φ x i, y T θ optionally with a loss function Implementation: 53
54 Complex Structured Output Output space Y contains complex objects. A multistage process propagates its errors. Exponential number of classes, but fewer parameters to estimate; efficient predictions; efficient learning algorithm. Feature-Mapping: reduces the number of parameters for every class Cutting plane algorithm Problem-specific encoding 54
55 RANKING 55
56 Ranking Can we also learn with other types of structure? Ranking every prediction is an ordering We want to use this ordering information to improve our learning algorithm 56
57 Ranking Instances should be placed in the correct order. e.g., Relevance ranking of search results. e.g., Relevance ranking of product recommendations. Samples are pairs: L = f x i f x j, x i should appear before x j in the ranking 57
58 Ranking Relevance ranking of search results. Website x i, Search query q. Joint feature representation Φ x i, q for websites and search queries. Φ x i, q can be a vector of different features of that appear in both x i and q (ie, their correspondence). Eg. Count of corresponding words; quasicorrespondence in H1-neighborhoods, PageRank, Samples are pairs: L = f x i, q f x j, q, x i should appear before x j in the ranking for query q. 58
59 Ranking Relevance-Ranking of search results. Samples are pairs: L = f x i, q f x j, q, x i should appear before x j in the ranking for query q. Samples can be taken from click data. A user asks query q, receives a list of results, then clicks on the i th element in the list x i. Implicitly, he has rejected list element 1... i-1. For this user and query q, x i should have been placed first. 59
60 Ranking Relevance ranking of search results. Samples are pairs: L = f x i, q f x j, q, A user asks query q, receives a list of results, then clicks on the i th element in the list x i. For this user and query q, x i should have been placed first. From this encounter, we infer the samples: L q = f x i, q f x 1, q,, f x i, q f x i 1, q 60
61 Ranking Given: samples L = f x 1i1, q 1 f x 11, q 1,, f x 1i1, q 1 f x 1i1 1, q 1 f x mim, q m f x m1, q m,, f x mim, q m f x mim 1, q m Solution min w,ξ 1 2 w 2 + C ξ ji j i s. t. j i < i j w T Φ x jij, q j Φ x ji, q j 1 ξ ji j i ξ ji 0 61
62 Structured Classification - Summary Classification for more than two classes: y x = argmax y f θ x, y, f θ x, y = Φ x, y T θ Learning formulated as Multiclass SVM Structure of Y captured by Λ y giving Φ x, y = φ x Λ y & k x i, y i, x j, y j = k x i, x j Λ y i T Λ y j Structured input & output high dimensional Y fewer parameters to estimate from feature mapping efficient prediction from problem-specific encoding efficient learning algorithm cutting plane algorithm Other structured problems: Ranking uses order constraints 62
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer
Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers Blaine Nelson, Tobias Scheffer Contents Classification Problem Bayesian Classifier Decision Linear Classifiers, MAP Models Logistic
More informationBayesian Learning (II)
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP
More informationStatistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.
http://goo.gl/jv7vj9 Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT
More informationStatistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.
http://goo.gl/xilnmn Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT
More informationAn Introduction to Machine Learning
An Introduction to Machine Learning L6: Structured Estimation Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune, January
More informationLecture 13: Structured Prediction
Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501: NLP 1 Quiz 2 v Lectures 9-13 v Lecture 12: before page
More informationA Support Vector Method for Multivariate Performance Measures
A Support Vector Method for Multivariate Performance Measures Thorsten Joachims Cornell University Department of Computer Science Thanks to Rich Caruana, Alexandru Niculescu-Mizil, Pierre Dupont, Jérôme
More informationMachine Learning for Structured Prediction
Machine Learning for Structured Prediction Grzegorz Chrupa la National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Grzegorz Chrupa la (DCU) Machine Learning for
More informationHidden Markov Models in Language Processing
Hidden Markov Models in Language Processing Dustin Hillard Lecture notes courtesy of Prof. Mari Ostendorf Outline Review of Markov models What is an HMM? Examples General idea of hidden variables: implications
More informationMulticlass and Introduction to Structured Prediction
Multiclass and Introduction to Structured Prediction David S. Rosenberg New York University March 27, 2018 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 1 / 49 Contents
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 Exam policy: This exam allows two one-page, two-sided cheat sheets (i.e. 4 sides); No other materials. Time: 2 hours. Be sure to write
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters
More informationMulticlass and Introduction to Structured Prediction
Multiclass and Introduction to Structured Prediction David S. Rosenberg Bloomberg ML EDU November 28, 2017 David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 1 / 48 Introduction David S. Rosenberg
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationModels, Data, Learning Problems
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Models, Data, Learning Problems Tobias Scheffer Overview Types of learning problems: Supervised Learning (Classification, Regression,
More informationPredicting Sequences: Structured Perceptron. CS 6355: Structured Prediction
Predicting Sequences: Structured Perceptron CS 6355: Structured Prediction 1 Conditional Random Fields summary An undirected graphical model Decompose the score over the structure into a collection of
More informationMulticlass Classification
Multiclass Classification David Rosenberg New York University March 7, 2017 David Rosenberg (New York University) DS-GA 1003 March 7, 2017 1 / 52 Introduction Introduction David Rosenberg (New York University)
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationLab 12: Structured Prediction
December 4, 2014 Lecture plan structured perceptron application: confused messages application: dependency parsing structured SVM Class review: from modelization to classification What does learning mean?
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationAlgorithms for Predicting Structured Data
1 / 70 Algorithms for Predicting Structured Data Thomas Gärtner / Shankar Vembu Fraunhofer IAIS / UIUC ECML PKDD 2010 Structured Prediction 2 / 70 Predicting multiple outputs with complex internal structure
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationIntelligent Systems (AI-2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 24, 2016 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,
More informationSequential Supervised Learning
Sequential Supervised Learning Many Application Problems Require Sequential Learning Part-of of-speech Tagging Information Extraction from the Web Text-to to-speech Mapping Part-of of-speech Tagging Given
More informationProbabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov
Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly
More informationStatistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields
Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Sameer Maskey Week 13, Nov 28, 2012 1 Announcements Next lecture is the last lecture Wrap up of the semester 2 Final Project
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationMachine Learning for NLP
Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers
More information6.036 midterm review. Wednesday, March 18, 15
6.036 midterm review 1 Topics covered supervised learning labels available unsupervised learning no labels available semi-supervised learning some labels available - what algorithms have you learned that
More informationJeff Howbert Introduction to Machine Learning Winter
Classification / Regression Support Vector Machines Jeff Howbert Introduction to Machine Learning Winter 2012 1 Topics SVM classifiers for linearly separable classes SVM classifiers for non-linearly separable
More informationLecture 9: PGM Learning
13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Intelligent Data Analysis Decision Trees Paul Prasse, Niels Landwehr, Tobias Scheffer Decision Trees One of many applications:
More informationCPSC 340: Machine Learning and Data Mining
CPSC 340: Machine Learning and Data Mining Linear Classifiers: multi-class Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due in a week Midterm:
More informationKernel methods, kernel SVM and ridge regression
Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Collaborative Filtering 2 Collaborative Filtering R: rating matrix; U: user factor;
More informationPart of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015
Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch COMP-599 Oct 1, 2015 Announcements Research skills workshop today 3pm-4:30pm Schulich Library room 313 Start thinking about
More informationCS6375: Machine Learning Gautam Kunapuli. Support Vector Machines
Gautam Kunapuli Example: Text Categorization Example: Develop a model to classify news stories into various categories based on their content. sports politics Use the bag-of-words representation for this
More informationSupport Vector Machines: Kernels
Support Vector Machines: Kernels CS6780 Advanced Machine Learning Spring 2015 Thorsten Joachims Cornell University Reading: Murphy 14.1, 14.2, 14.4 Schoelkopf/Smola Chapter 7.4, 7.6, 7.8 Non-Linear Problems
More informationProbabilistic Graphical Models
School of Computer Science Probabilistic Graphical Models Max-margin learning of GM Eric Xing Lecture 28, Apr 28, 2014 b r a c e Reading: 1 Classical Predictive Models Input and output space: Predictive
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationMachine Learning for Signal Processing Bayes Classification and Regression
Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For
More informationSupport Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM
1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Jean-Philippe.Vert@mines.org Human Genome Center, University
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Recommendation. Tobias Scheffer
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Recommendation Tobias Scheffer Recommendation Engines Recommendation of products, music, contacts,.. Based on user features, item
More informationGeneralized Linear Classifiers in NLP
Generalized Linear Classifiers in NLP (or Discriminative Generalized Linear Feature-Based Classifiers) Graduate School of Language Technology, Sweden 2009 Ryan McDonald Google Inc., New York, USA E-mail:
More informationIntelligent Systems (AI-2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,
More informationSupport Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington
Support Vector Machines CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 A Linearly Separable Problem Consider the binary classification
More informationPart A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )
Part A 1. A Markov chain is a discrete-time stochastic process, defined by a set of states, a set of transition probabilities (between states), and a set of initial state probabilities; the process proceeds
More informationIntroduction to Logistic Regression and Support Vector Machine
Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Language Models Tobias Scheffer Stochastic Language Models A stochastic language model is a probability distribution over words.
More informationMachine Learning for Signal Processing Bayes Classification
Machine Learning for Signal Processing Bayes Classification Class 16. 24 Oct 2017 Instructor: Bhiksha Raj - Abelino Jimenez 11755/18797 1 Recap: KNN A very effective and simple way of performing classification
More informationAlgorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley
Algorithms for NLP Classification II Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Minimize Training Error? A loss function declares how costly each mistake is E.g. 0 loss for correct label,
More informationLecture 3: Multiclass Classification
Lecture 3: Multiclass Classification Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Some slides are adapted from Vivek Skirmar and Dan Roth CS6501 Lecture 3 1 Announcement v Please enroll in
More informationKernel Methods and Support Vector Machines
Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationBowl Maximum Entropy #4 By Ejay Weiss. Maxent Models: Maximum Entropy Foundations. By Yanju Chen. A Basic Comprehension with Derivations
Bowl Maximum Entropy #4 By Ejay Weiss Maxent Models: Maximum Entropy Foundations By Yanju Chen A Basic Comprehension with Derivations Outlines Generative vs. Discriminative Feature-Based Models Softmax
More informationCOMP90051 Statistical Machine Learning
COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 24. Hidden Markov Models & message passing Looking back Representation of joint distributions Conditional/marginal independence
More information18.9 SUPPORT VECTOR MACHINES
744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the
More informationMulticlass Classification-1
CS 446 Machine Learning Fall 2016 Oct 27, 2016 Multiclass Classification Professor: Dan Roth Scribe: C. Cheng Overview Binary to multiclass Multiclass SVM Constraint classification 1 Introduction Multiclass
More informationLogistic Regression. Machine Learning Fall 2018
Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes
More informationMidterm sample questions
Midterm sample questions CS 585, Brendan O Connor and David Belanger October 12, 2014 1 Topics on the midterm Language concepts Translation issues: word order, multiword translations Human evaluation Parts
More informationFrom Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018
From Binary to Multiclass Classification CS 6961: Structured Prediction Spring 2018 1 So far: Binary Classification We have seen linear models Learning algorithms Perceptron SVM Logistic Regression Prediction
More informationProbabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier
More informationSequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them
HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated
More informationProbabilistic Machine Learning. Industrial AI Lab.
Probabilistic Machine Learning Industrial AI Lab. Probabilistic Linear Regression Outline Probabilistic Classification Probabilistic Clustering Probabilistic Dimension Reduction 2 Probabilistic Linear
More informationNatural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley
Natural Language Processing Classification Classification I Dan Klein UC Berkeley Classification Automatically make a decision about inputs Example: document category Example: image of digit digit Example:
More informationIntroduction to Machine Learning Midterm, Tues April 8
Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend
More informationExpectation Maximization (EM)
Expectation Maximization (EM) The EM algorithm is used to train models involving latent variables using training data in which the latent variables are not observed (unlabeled data). This is to be contrasted
More information10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging
10-708: Probabilistic Graphical Models 10-708, Spring 2018 10 : HMM and CRF Lecturer: Kayhan Batmanghelich Scribes: Ben Lengerich, Michael Kleyman 1 Case Study: Supervised Part-of-Speech Tagging We will
More informationInformation Extraction from Text
Information Extraction from Text Jing Jiang Chapter 2 from Mining Text Data (2012) Presented by Andrew Landgraf, September 13, 2013 1 What is Information Extraction? Goal is to discover structured information
More informationSTA 414/2104: Machine Learning
STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far
More informationBayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014
Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several
More informationHidden Markov Models
CS769 Spring 2010 Advanced Natural Language Processing Hidden Markov Models Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Part-of-Speech Tagging The goal of Part-of-Speech (POS) tagging is to label each
More informationTopics in Structured Prediction: Problems and Approaches
Topics in Structured Prediction: Problems and Approaches DRAFT Ankan Saha June 9, 2010 Abstract We consider the task of structured data prediction. Over the last few years, there has been an abundance
More informationIntroduction to Machine Learning
10-701 Introduction to Machine Learning Homework 2, version 1.0 Due Oct 16, 11:59 am Rules: 1. Homework submission is done via CMU Autolab system. Please package your writeup and code into a zip or tar
More informationLINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning
LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES Supervised Learning Linear vs non linear classifiers In K-NN we saw an example of a non-linear classifier: the decision boundary
More informationDay 4: Classification, support vector machines
Day 4: Classification, support vector machines Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 21 June 2018 Topics so far
More informationFINAL: CS 6375 (Machine Learning) Fall 2014
FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for
More information> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel
Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation
More informationSupport Vector Machines
Support Vector Machines Reading: Ben-Hur & Weston, A User s Guide to Support Vector Machines (linked from class web page) Notation Assume a binary classification problem. Instances are represented by vector
More informationEnergy Based Learning. Energy Based Learning
Energy Based Learning Energy Based Learning The Courant Institute of Mathematical Sciences The Courant Institute of Mathematical Sciences New York University New York University http://yann.lecun.com http://www.cs.nyu.edu/~yann
More informationStructured Prediction
Machine Learning Fall 2017 (structured perceptron, HMM, structured SVM) Professor Liang Huang (Chap. 17 of CIML) x x the man bit the dog x the man bit the dog x DT NN VBD DT NN S =+1 =-1 the man bit the
More informationCOMS 4771 Introduction to Machine Learning. Nakul Verma
COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW1 due next lecture Project details are available decide on the group and topic by Thursday Last time Generative vs. Discriminative
More informationIntroduction to Support Vector Machines
Introduction to Support Vector Machines Hsuan-Tien Lin Learning Systems Group, California Institute of Technology Talk in NTU EE/CS Speech Lab, November 16, 2005 H.-T. Lin (Learning Systems Group) Introduction
More informationCS446: Machine Learning Fall Final Exam. December 6 th, 2016
CS446: Machine Learning Fall 2016 Final Exam December 6 th, 2016 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains
More informationMulticlass Boosting with Repartitioning
Multiclass Boosting with Repartitioning Ling Li Learning Systems Group, Caltech ICML 2006 Binary and Multiclass Problems Binary classification problems Y = { 1, 1} Multiclass classification problems Y
More informationSVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels
SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic
More informationNotes on the framework of Ando and Zhang (2005) 1 Beyond learning good functions: learning good spaces
Notes on the framework of Ando and Zhang (2005 Karl Stratos 1 Beyond learning good functions: learning good spaces 1.1 A single binary classification problem Let X denote the problem domain. Suppose we
More informationBayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016
Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several
More informationExpectation Maximization (EM)
Expectation Maximization (EM) The Expectation Maximization (EM) algorithm is one approach to unsupervised, semi-supervised, or lightly supervised learning. In this kind of learning either no labels are
More informationSupport Vector Machine. Industrial AI Lab.
Support Vector Machine Industrial AI Lab. Classification (Linear) Autonomously figure out which category (or class) an unknown item should be categorized into Number of categories / classes Binary: 2 different
More informationEvaluation. Andrea Passerini Machine Learning. Evaluation
Andrea Passerini passerini@disi.unitn.it Machine Learning Basic concepts requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain
More informationLog-Linear Models with Structured Outputs
Log-Linear Models with Structured Outputs Natural Language Processing CS 4120/6120 Spring 2016 Northeastern University David Smith (some slides from Andrew McCallum) Overview Sequence labeling task (cf.
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 18, 2016 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass
More informationComputational Genomics and Molecular Biology, Fall
Computational Genomics and Molecular Biology, Fall 2011 1 HMM Lecture Notes Dannie Durand and Rose Hoberman October 11th 1 Hidden Markov Models In the last few lectures, we have focussed on three problems
More informationEvaluation requires to define performance measures to be optimized
Evaluation Basic concepts Evaluation requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain (generalization error) approximation
More informationMachine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015
Machine Learning Classification, Discriminative learning Structured output, structured input, discriminative function, joint input-output features, Likelihood Maximization, Logistic regression, binary
More informationMachine Learning for natural language processing
Machine Learning for natural language processing Classification: Maximum Entropy Models Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 24 Introduction Classification = supervised
More informationMACHINE LEARNING. Support Vector Machines. Alessandro Moschitti
MACHINE LEARNING Support Vector Machines Alessandro Moschitti Department of information and communication technology University of Trento Email: moschitti@dit.unitn.it Summary Support Vector Machines
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table
More informationStructured Prediction
Structured Prediction Classification Algorithms Classify objects x X into labels y Y First there was binary: Y = {0, 1} Then multiclass: Y = {1,...,6} The next generation: Structured Labels Structured
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More information