Linear Classifiers IV

Size: px
Start display at page:

Download "Linear Classifiers IV"

Transcription

1 Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers IV Blaine Nelson, Tobias Scheffer

2 Contents Classification Problem Bayesian Classifier Decision Linear Classifiers, MAP Models Logistic Regression Regularized Empirical Risk Minimization Kernel Perceptron, Support Vector Machine Ridge Regression, LASSO Representer Theorem Dualized Perceptron, Dual SVM Mercer Map Learning with Structured Input & Output Taxonomy, Sequences, Ranking, Decoder, Cutting Plane Algorithm 2

3 Recall: Binary SVM Classification for two classes: y x = sign f θ x, f θ x = φ x T θ A parameter vector θ Optimization problem: min θ,ξ λ such that n i=1 ξ i θt θ ξ i 0 and y i φ x i T θ 1 ξ i Does this generalize to k classes? 3

4 Recall: Multiclass Logistic Regression In the multi-class case, the linear model has a decision function: f θ x, y = φ x T θ y + b y & a classifier: y x = argmax z Y Logistic model for multiclass: P y x, θ = exp φ x T θ y + b y z Y exp φ x T θ z + b z f θ x, z The prior is a normal distribution; p θ = N θ; 0, Σ The Maximum-A-Posteriori parameter is θ MAP = argmin θ n i=1 log Σ z Y exp f θ x i, z f θ x i, y i + θt Σ 1 θ 2 loss regularizer 4

5 Generalizing to Multiclass Problems Recall multiclass logistic regression y x = sign f θ x became y x = argmax y f θ x, y Single parameter vector θ became θ = θ 1,, θ k T with a component for each class For an SVM with k classes Regularizer? 1 2 θt θ becomes 1 2 Margin Constraints? k j=1 θ jt θ j 5

6 Generalizing to Multiclass Problems Recall multiclass logistic regression y x = sign f θ x became y x = argmax y f θ x, y Single parameter vector θ became θ = θ 1,, θ k T with a component for each class For an SVM with k classes Regularizer: 1 2 θt θ becomes 1 k θ jt θ j 2 j=1 Margin Constraints: y i φ x T i θ 1 ξ i becomes y y i f θ x i, y i f θ x i, y + 1 ξ i 6

7 Multiclass SVM Classification for more than two classes: y x = argmax y f θ x, y, f θ x, y = φ x T θ y A parameter vector for each of the k classes θ = θ 1,, θ k T Optimization problem: min θ,ξ λ such that n i=1 ξ i k j=1 ξ i 0 and y y i f θ x i, y i θ jt θ j f θ x i, y + 1 ξ i [J.Weston, C.Watkins, 1999] 7

8 Multiclass Feature-Mapping Different weight vectors can be seen as different slices of a single weight vector: θ = θ 1,, θ k T Joint representation of input and output: Φ x, y = 2 = 0 φ x 0 k concatenated feature vectors 8

9 Multiclass Feature-Mapping Classification for more than two classes: y x = argmax y Multiclass kernels: f θ x, y f θ x, y Λ y = = Φ x, y T θ I y = 1 I y = k Φ x, y = φ x Λ y = φ x I y = 1 φ x I y = k 9

10 Multiclass Kernel Encoding Classification for more than two classes: y x = argmax y Multiclass kernels: f θ x, y f θ x, y = Φ x, y T θ Φ x, y = x 1 x 2 I y = 1 I y = 2 = x 1 I y = 1 x 2 I y = 1 x 1 I y = 2 x 2 I y = 2 k x i, y i, x j, y j = Φ x i, y i T Φ x j, y j = I y i = y j k x i, x j 1 ify i =y j 0 otherwise 10

11 Multiclass Kernel Encoding Classification for more than two classes: y x = argmax y Multiclass kernels: f θ x, y f θ x, y = Φ x, y T θ Φ x, y = x 1 x 2 I y = 1 I y = 2 = x 1 I y = 1 x 2 I y = 1 x 1 I y = 2 x 2 I y = 2 k x i, y i, x j, y j = Φ x i, y i T Φ x j, y j = I y i = y j k x i, x j 1 if y i =y j 0 otherwise 11

12 Classification with Information over Classes Classes have their own features: y x = argmax y f θ x, y f θ x, y = Φ x, y T θ, Φ x, y = φ x Λ y k x i, y i, x j, y j = Φ x i, y i T Φ x j, y j = φ x i T φ x j Λ y i T Λ y j = k x i, x j Λ y i T Λ y j correspondence between classes 12

13 Classification with Information over Classes Classes have their own features: y x = argmax y f θ x, y f θ x, y = Φ x, y T θ, Φ x, y = φ x Λ y k x i, y i, x j, y j = Φ x i, y i T Φ x j, y j = φ x i T φ x j Λ y i T Λ y j = k x i, x j Λ y i T Λ y j correspondence between classes Correspondence of the classes Λ y i T Λ y j : E.g., inner product of the class descriptions. Λ y i : term-frequency (TF) vector over all words. 13

14 Multiclass SVM Classification for more than two classes: y x = argmax y f θ x, y f now has two arguments. Shared feature representation for input & output: f θ x, y = Φ x, y T θ, Same approach is used for multiclass, sequence and structured learning, ranking. 14

15 Multiclass SVM Example x is encoded; e.g., a document y = 2 is the encoded class Φ x, y = I y = 1 x I y = 2 x I y = 3 x I y = 4 x I y = 5 x I y = 6 x = 0 x

16 Multiclass SVM Example x is encoded; e.g., a document y = 2 is the encoded class Φ x, y = I y = 1 x I y = 2 x I y = 3 x I y = 4 x I y = 5 x I y = 6 x = 0 x

17 Multiclass SVM Classification for more than two classes: y x = argmax y f θ x, y, f θ x, y = φ x T θ y A parameter vector for each of the k classes θ = θ 1,, θ k T Optimization problem: min θ,ξ λ such that n i=1 ξ i k j=1 ξ i 0 and y y i f θ x i, y i θ jt θ j f θ x i, y + 1 ξ i 17

18 Multiclass SVM Classification for more than two classes: y x = argmax y f θ x, y, f θ x, y = Φ x, y T θ A parameter vector for each of the k classes θ = θ 1,, θ k T Optimization problem: min θ,ξ λ such that n i=1 ξ i θt θ ξ i 0 and y y i f θ x i, y i f θ x i, y + 1 ξ i 18

19 STRUCTURED MULTICLASS OUTPUT 19

20 Taxonomic Output Structure Suppose the k classes are related by an underlying tree structure (depth d): v 1 1 Homininae Hominini v 1 2 v 2 2 Gorillini Pan Homo Gorilla v 1 v 2 v 3 Each class is a path in the tree; y = y 1,, y d. 20

21 Taxonomic Output Structure Suppose the k classes are related by an underlying tree structure (depth d): v 1 1 Homininae Hominini v 1 2 v 2 2 Gorillini Pan Homo Gorilla v 1 v 2 v 3 Each class is a path in the tree; y = y 1,, y d. Chimpanzee = Homininae, Hominini, Pan 21

22 Taxonomic Output Structure Suppose the k classes are related by an underlying tree structure (depth d): v 1 1 Homininae Hominini v 1 2 v 2 2 Gorillini Pan Homo Gorilla v 1 v 2 v 3 Each class is a path in the tree; y = y 1,, y d. Human = Homininae, Hominini, Homo 22

23 Taxonomic Output Structure Suppose the k classes are related by an underlying tree structure (depth d): v 1 1 Homininae Hominini v 1 2 v 2 2 Gorillini Pan Homo Gorilla v 1 v 2 v 3 Each class is a path in the tree; y = y 1,, y d. W. Gorilla= Homininae, Gorillini, Gorilla 23

24 Taxonomic Output Structure Classes in a tree structure (depth d): The class for each x is a path in a class tree y = y 1,, y d. v 1 3 v 1 1 v 1 2 v v 2 v 3 Encoding of the common features of input and output? 24

25 Classification with a Taxonomy Classes in a tree structure (depth d): y x f θ x, y = argmax y f θ x, y = Φ x, y T θ y = y 1,, y d Λ y = Λ y 1 Λ y d Λ y i = Φ x, y = φ x Λ y = φ x I y i = v 1 i i I y i = v ni Λ y 1 Λ y d = φ x v 1 3 v 2 3 I y 1 1 = v 1 I y 1 = v1 n1 I y d d = v 1 v 1 1 v 1 2 v 2 2 d I y d = v nd v

26 Classification with a Taxonomy Example x is encoded; e.g., a document y = v 1 1, v 2 2, v 3 3 T is a path; e.g., in a topic tree Φ x, y = I y 1 = v 1 1 x I y 2 = v 1 2 x I y 2 = v 2 2 x I y 3 = v 1 3 x I y 3 = v 2 3 x I y 3 = v 3 3 x = x 0 x 0 0 x v 1 3 v 1 1 v 1 2 v v 2 v 3 26

27 Classification with a Taxonomy Example x is encoded; e.g., a document y = v 1 1, v 2 2, v 3 3 T is a path; e.g., in a topic tree Φ x, y = I y 1 = v 1 1 x I y 2 = v 1 2 x I y 2 = v 2 2 x I y 3 = v 1 3 x I y 3 = v 2 3 x I y 3 = v 3 3 x = x 0 x 0 0 x v 1 3 v 1 1 v 1 2 v v 2 v 3 27

28 Classification with a Taxonomy Kernelization Classes in a tree structure (depth d): y x f θ x, y = argmax y f θ x, y = Φ x, y T θ v 1 3 v 1 1 v 1 2 v v 2 v 3 y = y 1,, y d k x i, y i, x j, y j = Φ x i, y i T Φ x j, y j = φ x i T φ x j Λ y i T Λ y j d l = k x i, x j Λ y T l l=1 i Λ y j 28

29 Classification with a Taxonomy Decoding / Prediction Classes in a tree structure (depth d): f θ x, y y x = Φ x, y T θ = argmax f θ x, y y = argmax y = argmax y = argmax y n i=1 n i=1 n i=1 α i k x i, y i, x, y α i k x i, x Λ y i T Λ y v 1 3 d l T α i k x i, x Λ y i Λ y l l=1 v 1 1 v 1 2 v v 2 v 3 29

30 Structured Classification The structure of the output class Y is encoded by defining Λ y The joint feature map Φ x, y = φ x Λ y captures the information from both input & class structure allowing learning to utilize that structure This defines a multiclass kernel on the input & output: k x i, y i, x j, y j = k x i, x j Λ y i T Λ y j 30

31 COMPLEX STRUCTURES 31

32 Complex Structured Output Suppose output space Y contains complex objects. Can they be represented as a combination of binary prediction problems? Examples: Part-of-speech and named entity recognition Natural language parsing Sequence alignment 32

33 Complex Output Tagging Sequential input / output Part-of-speech recognition x = Curiosity kills the cat y = Noun, Verb, Determiner, Noun Named entity recognition, information extraction: x = Barbie meets Ken y = Person,, Person 33

34 Complex Output Parsing Natural Language Parsing x = Curiosity killed the cat. NP S VP NP N V Det N Curiosity killed the cat 34

35 Complex Output Alignments Sequence Alignment We are given two sequences Prediction is an alignment between them x = s=(abjlhbnjyaugai) t=(bhjkbnygu) AB-JLHBNJYAUGAI BHJK-BN-YGU 35

36 Complex Structured Output Output space Y contains complex objects. A multistage process propagates its errors. Why isn t this just a simple multiclass problem? y 1 VP S VP NP Det V N V N The dog chased the cat y 2 NP S VP NP Det Det N V N y k VP S NP NP Det Det N V N 36

37 Learning with Complex Structured Output Example: POS-Tagging Sentence x = Curiosity killed the cat We need to compute: argmax Φ x, y T θ N, V, Det, N y Explicit: Φ x, N, V, Det, N T θ Φ x, N, N, N, N T θ Φ x, N, V, Det, N T θ Φ x, N, N, N, V T θ Φ x, N, V, Det, N T θ Φ x, N, N, V, N T θ Φ x, N, V, Det, N T θ Φ x, N, N, V, V T θ Φ x, N, V, Det, N T θ Φ x, N, V, N, N T θ Too Many!!! 37

38 Complex Structured Output Output space Y contains complex objects. A multistage process propagates its errors. Exponential number of classes potentially leads to exponentially-many parameters to estimate; inefficient predictions; an inefficient learning algorithm. 38

39 Complex Structured Output Exponential number of classes potentially leads to exponentially-many parameters to estimate; inefficient predictions; an inefficient learning algorithm. Classification for more than two classes: y x f θ x, y Φ x, y T θ = argmax y = argmax y To reduce the number of parameters, one needs an efficient representation of the input & output Φ x, y. This representation depends on the concrete problem definition. 39

40 Example: Sequential Input & Output (Feature-Mapping) y 1 y 2 y 3 x 1 x 2 x 3 Curiosity killed the y 4 x 4 cat An attribute for every pair of adjacent labels y t and y t+1. φ N,V y t, y t+1 = I y t = Noun y t+1 = Verb An attribute for every pair of input and output. φ cat,n x t, y t = I x t = cat y t = Noun Label-label counts: Φ i = t φ i y t, y t+1 Label-observation: Φ j = t φ j x t, y t Joint feature representation: Φ x, y =, φ N,V y t, y t+1,, φ cat,n x t, y t, T t Weight vector: w =, w N,V,, w cat,n, T 40

41 Example: Sequential Input & Output (Decoding / Prediction) To classify a sequence, we must compute y x = argmax f θ x, y y The argmax goes over all possible sequences (exponentially-many in sequence length). f θ x, y = Φ x, y T θ sums over the features of adjacent label-label pairs und features of x i, y i pairs. The summands only differ where the y-sequences also differ. Using dynamic programming, the argmax can be computed in linear time (Viterbi Algorithm). 41

42 Example: Sequential Input & Output (Decoding / Prediction) To classify a sequence, we must compute y x = argmax f θ x, y y Using dynamic programming, the argmax can be computed in linear time (Viterbi Algorithm). Idea: we can compute the maximizing subsequences at time t given the maximizing subsequences at t 1 y t 1 = N V D t 1 γ N t 1 γ V t 1 γ D y t = N V D t γ N t γ V t γ D γ N t p N t = max w z,n + γ t 1 z + w cat,n z = argmax z x t = cat 42

43 Example: Sequential Input & Output (Decoding / Prediction) To classify a sequence, we must compute y x = argmax f θ x, y y Using dynamic programming, the argmax can be computed in linear time (Viterbi Algorithm). Idea: we can compute the maximizing subsequences at time t given the maximizing subsequences at t 1 y t 1 = N V D t 1 γ N t 1 γ V t 1 γ D y t = N V D t γ N t γ V t γ D γ N t p N t = max w z,n + γ t 1 z + w cat,n z = argmax z x t = cat 43

44 Example: Sequential Input & Output (Decoding / Prediction) To classify a sequence, we must compute y x = argmax f θ x, y y Using dynamic programming, the argmax can be computed in linear time (Viterbi Algorithm). Idea: we can compute the maximizing subsequences at time t given the maximizing subsequences at t 1 y t 1 = N V D t 1 γ N t 1 γ V t 1 γ D y t = N V D t γ N t γ V t γ D γ t V = max w z,v + γ t 1 z + w cat,v z p t V = argmax z x t = cat 44

45 Example: Sequential Input & Output (Decoding / Prediction) To classify a sequence, we must compute y x = argmax f θ x, y y Using dynamic programming, the argmax can be computed in linear time (Viterbi Algorithm). Idea: we can compute the maximizing subsequences at time t given the maximizing subsequences at t 1 y t 1 = N V D t 1 γ N t 1 γ V t 1 γ D y t = N V D t γ N t γ V t γ D γ t D = max w z,d + γ t 1 z + w cat,d z p t D = argmax z x t = cat 45

46 Example: Sequential Input & Output (Decoding / Prediction) To classify a sequence, we must compute y x = argmax f θ x, y y Using dynamic programming, the argmax can be computed in linear time (Viterbi Algorithm). Idea: we can compute the maximizing subsequences at time t given the maximizing subsequences at t 1 Once γ t is computed for entire sequence, maximize γ T and follow pointers p t back to find max sequence N y 1 y 2 V y 3 D y 4 N x 1 x 2 x 3 Curiosity killed the x 4 cat 46

47 Structured Output Space Sequential Input / Output Decoding: Viterbi Algorithm Feature-Mapping: Entries for adjacent states φ N,V y t, y t+1 = I y t = Noun y t+1 = Verb Entries for observations in a state φ cat,n x t, y t = I x t = cat y t = Noun 47

48 Structured Output Space Example: Natural Language Parsing y = x = NP N S VP NP V Det N Curiosity killed the cat φ NP N y = I y p = NP N p y Φ x, y = S NP, VP NP N VP V VP V, NP N ate N cat φ N cat x, y = I y p = N cat p y Joint feature representation: Φ x, y =, φ NP N y,, φ N cat (x, y), T Weight vector: w =, w NP N,, w N cat, T Decoding with dynamic programming CKY-Parser, O n 3. 48

49 Structured Output Space Example: Collective Classification x 1 y 1 y 2 x 2 y 4 y 3 x 3 x 4 An attribute for every pair of adjacent labels y t and y t+1. φ 123 y i, y j An attribute for every pair of input and output. φ cat,n x t, y t = I y t = Institute x t = Decoder: Message Passing Algorithm. 49

50 Complex Structured Output Exponential number of classes potentially leads to exponentially-many parameters to estimate; inefficient predictions; an inefficient learning algorithm. Optimization problem: min θ,ξ λ such that n i=1 ξ i + 1 θ ξ i 0 and y y i f θ x i, y i Exponentially many constraints f θ x i, y + 1 ξ i Efficient encoding 50

51 Learning with Complex Structured Output Optimization problem: min θ,ξ λ such that n i=1 ξ i + 1 θ ξ i 0 and y y i f θ x i, y i Exponentially many constraints f θ x i, y + 1 ξ i Optimization through iterative training. Negative Constraints are added when an error occurs during training. [Tsochantaridis et al., 2004] 51

52 Learning with Complex Structured Output Cutting Plane Algorithm Given: L = x 1, y 1,, x n, y n Repeat until all sequences are correctly predicted. Iterate over all examples x i, y i. Compute y = argmax y y i Φ x i, y T θ If Φ x i, y i T θ < Φ x i, y T θ + 1 (Margin-Violation) then add the constraint Φ x i, y i T θ < Φ x i, y T θ + 1 ξ i to the working set of constraints. Solve the optimization problem for input x i, output y i and negative pseudo-examples y (in the working set) Return the learned θ. 52

53 Structured Output Spaces General Framework Customized in terms of Loss Function Joint representation of the input and output Decoder y = argmax y y i Φ x i, y T θ optionally with a loss function Implementation: 53

54 Complex Structured Output Output space Y contains complex objects. A multistage process propagates its errors. Exponential number of classes, but fewer parameters to estimate; efficient predictions; efficient learning algorithm. Feature-Mapping: reduces the number of parameters for every class Cutting plane algorithm Problem-specific encoding 54

55 RANKING 55

56 Ranking Can we also learn with other types of structure? Ranking every prediction is an ordering We want to use this ordering information to improve our learning algorithm 56

57 Ranking Instances should be placed in the correct order. e.g., Relevance ranking of search results. e.g., Relevance ranking of product recommendations. Samples are pairs: L = f x i f x j, x i should appear before x j in the ranking 57

58 Ranking Relevance ranking of search results. Website x i, Search query q. Joint feature representation Φ x i, q for websites and search queries. Φ x i, q can be a vector of different features of that appear in both x i and q (ie, their correspondence). Eg. Count of corresponding words; quasicorrespondence in H1-neighborhoods, PageRank, Samples are pairs: L = f x i, q f x j, q, x i should appear before x j in the ranking for query q. 58

59 Ranking Relevance-Ranking of search results. Samples are pairs: L = f x i, q f x j, q, x i should appear before x j in the ranking for query q. Samples can be taken from click data. A user asks query q, receives a list of results, then clicks on the i th element in the list x i. Implicitly, he has rejected list element 1... i-1. For this user and query q, x i should have been placed first. 59

60 Ranking Relevance ranking of search results. Samples are pairs: L = f x i, q f x j, q, A user asks query q, receives a list of results, then clicks on the i th element in the list x i. For this user and query q, x i should have been placed first. From this encounter, we infer the samples: L q = f x i, q f x 1, q,, f x i, q f x i 1, q 60

61 Ranking Given: samples L = f x 1i1, q 1 f x 11, q 1,, f x 1i1, q 1 f x 1i1 1, q 1 f x mim, q m f x m1, q m,, f x mim, q m f x mim 1, q m Solution min w,ξ 1 2 w 2 + C ξ ji j i s. t. j i < i j w T Φ x jij, q j Φ x ji, q j 1 ξ ji j i ξ ji 0 61

62 Structured Classification - Summary Classification for more than two classes: y x = argmax y f θ x, y, f θ x, y = Φ x, y T θ Learning formulated as Multiclass SVM Structure of Y captured by Λ y giving Φ x, y = φ x Λ y & k x i, y i, x j, y j = k x i, x j Λ y i T Λ y j Structured input & output high dimensional Y fewer parameters to estimate from feature mapping efficient prediction from problem-specific encoding efficient learning algorithm cutting plane algorithm Other structured problems: Ranking uses order constraints 62

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers Blaine Nelson, Tobias Scheffer Contents Classification Problem Bayesian Classifier Decision Linear Classifiers, MAP Models Logistic

More information

Bayesian Learning (II)

Bayesian Learning (II) Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. http://goo.gl/jv7vj9 Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. http://goo.gl/xilnmn Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT

More information

An Introduction to Machine Learning

An Introduction to Machine Learning An Introduction to Machine Learning L6: Structured Estimation Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune, January

More information

Lecture 13: Structured Prediction

Lecture 13: Structured Prediction Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501: NLP 1 Quiz 2 v Lectures 9-13 v Lecture 12: before page

More information

A Support Vector Method for Multivariate Performance Measures

A Support Vector Method for Multivariate Performance Measures A Support Vector Method for Multivariate Performance Measures Thorsten Joachims Cornell University Department of Computer Science Thanks to Rich Caruana, Alexandru Niculescu-Mizil, Pierre Dupont, Jérôme

More information

Machine Learning for Structured Prediction

Machine Learning for Structured Prediction Machine Learning for Structured Prediction Grzegorz Chrupa la National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Grzegorz Chrupa la (DCU) Machine Learning for

More information

Hidden Markov Models in Language Processing

Hidden Markov Models in Language Processing Hidden Markov Models in Language Processing Dustin Hillard Lecture notes courtesy of Prof. Mari Ostendorf Outline Review of Markov models What is an HMM? Examples General idea of hidden variables: implications

More information

Multiclass and Introduction to Structured Prediction

Multiclass and Introduction to Structured Prediction Multiclass and Introduction to Structured Prediction David S. Rosenberg New York University March 27, 2018 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 1 / 49 Contents

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 Exam policy: This exam allows two one-page, two-sided cheat sheets (i.e. 4 sides); No other materials. Time: 2 hours. Be sure to write

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters

More information

Multiclass and Introduction to Structured Prediction

Multiclass and Introduction to Structured Prediction Multiclass and Introduction to Structured Prediction David S. Rosenberg Bloomberg ML EDU November 28, 2017 David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 1 / 48 Introduction David S. Rosenberg

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Models, Data, Learning Problems

Models, Data, Learning Problems Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Models, Data, Learning Problems Tobias Scheffer Overview Types of learning problems: Supervised Learning (Classification, Regression,

More information

Predicting Sequences: Structured Perceptron. CS 6355: Structured Prediction

Predicting Sequences: Structured Perceptron. CS 6355: Structured Prediction Predicting Sequences: Structured Perceptron CS 6355: Structured Prediction 1 Conditional Random Fields summary An undirected graphical model Decompose the score over the structure into a collection of

More information

Multiclass Classification

Multiclass Classification Multiclass Classification David Rosenberg New York University March 7, 2017 David Rosenberg (New York University) DS-GA 1003 March 7, 2017 1 / 52 Introduction Introduction David Rosenberg (New York University)

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Lab 12: Structured Prediction

Lab 12: Structured Prediction December 4, 2014 Lecture plan structured perceptron application: confused messages application: dependency parsing structured SVM Class review: from modelization to classification What does learning mean?

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Algorithms for Predicting Structured Data

Algorithms for Predicting Structured Data 1 / 70 Algorithms for Predicting Structured Data Thomas Gärtner / Shankar Vembu Fraunhofer IAIS / UIUC ECML PKDD 2010 Structured Prediction 2 / 70 Predicting multiple outputs with complex internal structure

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 24, 2016 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

Sequential Supervised Learning

Sequential Supervised Learning Sequential Supervised Learning Many Application Problems Require Sequential Learning Part-of of-speech Tagging Information Extraction from the Web Text-to to-speech Mapping Part-of of-speech Tagging Given

More information

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly

More information

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Sameer Maskey Week 13, Nov 28, 2012 1 Announcements Next lecture is the last lecture Wrap up of the semester 2 Final Project

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers

More information

6.036 midterm review. Wednesday, March 18, 15

6.036 midterm review. Wednesday, March 18, 15 6.036 midterm review 1 Topics covered supervised learning labels available unsupervised learning no labels available semi-supervised learning some labels available - what algorithms have you learned that

More information

Jeff Howbert Introduction to Machine Learning Winter

Jeff Howbert Introduction to Machine Learning Winter Classification / Regression Support Vector Machines Jeff Howbert Introduction to Machine Learning Winter 2012 1 Topics SVM classifiers for linearly separable classes SVM classifiers for non-linearly separable

More information

Lecture 9: PGM Learning

Lecture 9: PGM Learning 13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Intelligent Data Analysis Decision Trees Paul Prasse, Niels Landwehr, Tobias Scheffer Decision Trees One of many applications:

More information

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining CPSC 340: Machine Learning and Data Mining Linear Classifiers: multi-class Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due in a week Midterm:

More information

Kernel methods, kernel SVM and ridge regression

Kernel methods, kernel SVM and ridge regression Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Collaborative Filtering 2 Collaborative Filtering R: rating matrix; U: user factor;

More information

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015 Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch COMP-599 Oct 1, 2015 Announcements Research skills workshop today 3pm-4:30pm Schulich Library room 313 Start thinking about

More information

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines Gautam Kunapuli Example: Text Categorization Example: Develop a model to classify news stories into various categories based on their content. sports politics Use the bag-of-words representation for this

More information

Support Vector Machines: Kernels

Support Vector Machines: Kernels Support Vector Machines: Kernels CS6780 Advanced Machine Learning Spring 2015 Thorsten Joachims Cornell University Reading: Murphy 14.1, 14.2, 14.4 Schoelkopf/Smola Chapter 7.4, 7.6, 7.8 Non-Linear Problems

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Max-margin learning of GM Eric Xing Lecture 28, Apr 28, 2014 b r a c e Reading: 1 Classical Predictive Models Input and output space: Predictive

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning for Signal Processing Bayes Classification and Regression Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For

More information

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM 1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Jean-Philippe.Vert@mines.org Human Genome Center, University

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Recommendation. Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Recommendation. Tobias Scheffer Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Recommendation Tobias Scheffer Recommendation Engines Recommendation of products, music, contacts,.. Based on user features, item

More information

Generalized Linear Classifiers in NLP

Generalized Linear Classifiers in NLP Generalized Linear Classifiers in NLP (or Discriminative Generalized Linear Feature-Based Classifiers) Graduate School of Language Technology, Sweden 2009 Ryan McDonald Google Inc., New York, USA E-mail:

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Support Vector Machines CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 A Linearly Separable Problem Consider the binary classification

More information

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 ) Part A 1. A Markov chain is a discrete-time stochastic process, defined by a set of states, a set of transition probabilities (between states), and a set of initial state probabilities; the process proceeds

More information

Introduction to Logistic Regression and Support Vector Machine

Introduction to Logistic Regression and Support Vector Machine Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Language Models Tobias Scheffer Stochastic Language Models A stochastic language model is a probability distribution over words.

More information

Machine Learning for Signal Processing Bayes Classification

Machine Learning for Signal Processing Bayes Classification Machine Learning for Signal Processing Bayes Classification Class 16. 24 Oct 2017 Instructor: Bhiksha Raj - Abelino Jimenez 11755/18797 1 Recap: KNN A very effective and simple way of performing classification

More information

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Algorithms for NLP Classification II Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Minimize Training Error? A loss function declares how costly each mistake is E.g. 0 loss for correct label,

More information

Lecture 3: Multiclass Classification

Lecture 3: Multiclass Classification Lecture 3: Multiclass Classification Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Some slides are adapted from Vivek Skirmar and Dan Roth CS6501 Lecture 3 1 Announcement v Please enroll in

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Bowl Maximum Entropy #4 By Ejay Weiss. Maxent Models: Maximum Entropy Foundations. By Yanju Chen. A Basic Comprehension with Derivations

Bowl Maximum Entropy #4 By Ejay Weiss. Maxent Models: Maximum Entropy Foundations. By Yanju Chen. A Basic Comprehension with Derivations Bowl Maximum Entropy #4 By Ejay Weiss Maxent Models: Maximum Entropy Foundations By Yanju Chen A Basic Comprehension with Derivations Outlines Generative vs. Discriminative Feature-Based Models Softmax

More information

COMP90051 Statistical Machine Learning

COMP90051 Statistical Machine Learning COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 24. Hidden Markov Models & message passing Looking back Representation of joint distributions Conditional/marginal independence

More information

18.9 SUPPORT VECTOR MACHINES

18.9 SUPPORT VECTOR MACHINES 744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the

More information

Multiclass Classification-1

Multiclass Classification-1 CS 446 Machine Learning Fall 2016 Oct 27, 2016 Multiclass Classification Professor: Dan Roth Scribe: C. Cheng Overview Binary to multiclass Multiclass SVM Constraint classification 1 Introduction Multiclass

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

Midterm sample questions

Midterm sample questions Midterm sample questions CS 585, Brendan O Connor and David Belanger October 12, 2014 1 Topics on the midterm Language concepts Translation issues: word order, multiword translations Human evaluation Parts

More information

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018 From Binary to Multiclass Classification CS 6961: Structured Prediction Spring 2018 1 So far: Binary Classification We have seen linear models Learning algorithms Perceptron SVM Logistic Regression Prediction

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated

More information

Probabilistic Machine Learning. Industrial AI Lab.

Probabilistic Machine Learning. Industrial AI Lab. Probabilistic Machine Learning Industrial AI Lab. Probabilistic Linear Regression Outline Probabilistic Classification Probabilistic Clustering Probabilistic Dimension Reduction 2 Probabilistic Linear

More information

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley Natural Language Processing Classification Classification I Dan Klein UC Berkeley Classification Automatically make a decision about inputs Example: document category Example: image of digit digit Example:

More information

Introduction to Machine Learning Midterm, Tues April 8

Introduction to Machine Learning Midterm, Tues April 8 Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend

More information

Expectation Maximization (EM)

Expectation Maximization (EM) Expectation Maximization (EM) The EM algorithm is used to train models involving latent variables using training data in which the latent variables are not observed (unlabeled data). This is to be contrasted

More information

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging 10-708: Probabilistic Graphical Models 10-708, Spring 2018 10 : HMM and CRF Lecturer: Kayhan Batmanghelich Scribes: Ben Lengerich, Michael Kleyman 1 Case Study: Supervised Part-of-Speech Tagging We will

More information

Information Extraction from Text

Information Extraction from Text Information Extraction from Text Jing Jiang Chapter 2 from Mining Text Data (2012) Presented by Andrew Landgraf, September 13, 2013 1 What is Information Extraction? Goal is to discover structured information

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Hidden Markov Models

Hidden Markov Models CS769 Spring 2010 Advanced Natural Language Processing Hidden Markov Models Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Part-of-Speech Tagging The goal of Part-of-Speech (POS) tagging is to label each

More information

Topics in Structured Prediction: Problems and Approaches

Topics in Structured Prediction: Problems and Approaches Topics in Structured Prediction: Problems and Approaches DRAFT Ankan Saha June 9, 2010 Abstract We consider the task of structured data prediction. Over the last few years, there has been an abundance

More information

Introduction to Machine Learning

Introduction to Machine Learning 10-701 Introduction to Machine Learning Homework 2, version 1.0 Due Oct 16, 11:59 am Rules: 1. Homework submission is done via CMU Autolab system. Please package your writeup and code into a zip or tar

More information

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES Supervised Learning Linear vs non linear classifiers In K-NN we saw an example of a non-linear classifier: the decision boundary

More information

Day 4: Classification, support vector machines

Day 4: Classification, support vector machines Day 4: Classification, support vector machines Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 21 June 2018 Topics so far

More information

FINAL: CS 6375 (Machine Learning) Fall 2014

FINAL: CS 6375 (Machine Learning) Fall 2014 FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for

More information

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Reading: Ben-Hur & Weston, A User s Guide to Support Vector Machines (linked from class web page) Notation Assume a binary classification problem. Instances are represented by vector

More information

Energy Based Learning. Energy Based Learning

Energy Based Learning. Energy Based Learning Energy Based Learning Energy Based Learning The Courant Institute of Mathematical Sciences The Courant Institute of Mathematical Sciences New York University New York University http://yann.lecun.com http://www.cs.nyu.edu/~yann

More information

Structured Prediction

Structured Prediction Machine Learning Fall 2017 (structured perceptron, HMM, structured SVM) Professor Liang Huang (Chap. 17 of CIML) x x the man bit the dog x the man bit the dog x DT NN VBD DT NN S =+1 =-1 the man bit the

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW1 due next lecture Project details are available decide on the group and topic by Thursday Last time Generative vs. Discriminative

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Hsuan-Tien Lin Learning Systems Group, California Institute of Technology Talk in NTU EE/CS Speech Lab, November 16, 2005 H.-T. Lin (Learning Systems Group) Introduction

More information

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

CS446: Machine Learning Fall Final Exam. December 6 th, 2016 CS446: Machine Learning Fall 2016 Final Exam December 6 th, 2016 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains

More information

Multiclass Boosting with Repartitioning

Multiclass Boosting with Repartitioning Multiclass Boosting with Repartitioning Ling Li Learning Systems Group, Caltech ICML 2006 Binary and Multiclass Problems Binary classification problems Y = { 1, 1} Multiclass classification problems Y

More information

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic

More information

Notes on the framework of Ando and Zhang (2005) 1 Beyond learning good functions: learning good spaces

Notes on the framework of Ando and Zhang (2005) 1 Beyond learning good functions: learning good spaces Notes on the framework of Ando and Zhang (2005 Karl Stratos 1 Beyond learning good functions: learning good spaces 1.1 A single binary classification problem Let X denote the problem domain. Suppose we

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Expectation Maximization (EM)

Expectation Maximization (EM) Expectation Maximization (EM) The Expectation Maximization (EM) algorithm is one approach to unsupervised, semi-supervised, or lightly supervised learning. In this kind of learning either no labels are

More information

Support Vector Machine. Industrial AI Lab.

Support Vector Machine. Industrial AI Lab. Support Vector Machine Industrial AI Lab. Classification (Linear) Autonomously figure out which category (or class) an unknown item should be categorized into Number of categories / classes Binary: 2 different

More information

Evaluation. Andrea Passerini Machine Learning. Evaluation

Evaluation. Andrea Passerini Machine Learning. Evaluation Andrea Passerini passerini@disi.unitn.it Machine Learning Basic concepts requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain

More information

Log-Linear Models with Structured Outputs

Log-Linear Models with Structured Outputs Log-Linear Models with Structured Outputs Natural Language Processing CS 4120/6120 Spring 2016 Northeastern University David Smith (some slides from Andrew McCallum) Overview Sequence labeling task (cf.

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 18, 2016 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall Computational Genomics and Molecular Biology, Fall 2011 1 HMM Lecture Notes Dannie Durand and Rose Hoberman October 11th 1 Hidden Markov Models In the last few lectures, we have focussed on three problems

More information

Evaluation requires to define performance measures to be optimized

Evaluation requires to define performance measures to be optimized Evaluation Basic concepts Evaluation requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain (generalization error) approximation

More information

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015 Machine Learning Classification, Discriminative learning Structured output, structured input, discriminative function, joint input-output features, Likelihood Maximization, Logistic regression, binary

More information

Machine Learning for natural language processing

Machine Learning for natural language processing Machine Learning for natural language processing Classification: Maximum Entropy Models Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 24 Introduction Classification = supervised

More information

MACHINE LEARNING. Support Vector Machines. Alessandro Moschitti

MACHINE LEARNING. Support Vector Machines. Alessandro Moschitti MACHINE LEARNING Support Vector Machines Alessandro Moschitti Department of information and communication technology University of Trento Email: moschitti@dit.unitn.it Summary Support Vector Machines

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Structured Prediction

Structured Prediction Structured Prediction Classification Algorithms Classify objects x X into labels y Y First there was binary: Y = {0, 1} Then multiclass: Y = {1,...,6} The next generation: Structured Labels Structured

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information