Algorithms for Predicting Structured Data

Size: px
Start display at page:

Download "Algorithms for Predicting Structured Data"

Transcription

1 1 / 70 Algorithms for Predicting Structured Data Thomas Gärtner / Shankar Vembu Fraunhofer IAIS / UIUC ECML PKDD 2010

2 Structured Prediction 2 / 70 Predicting multiple outputs with complex internal structure and dependencies Contrast with single-valued prediction problems like binary classification and regression

3 Part-of-Speech Tagging 3 / 70 Input: Sentence Output: Part-of-speech tags The brown fox chased the lazy dog DT JJ N VBD DT JJ N

4 Linguistic Parsing 4 / 70 Input: Sentence Output: Parse tree S NP VP DT JJ N VBD NP The brown fox chased DT JJ N the lazy dog

5 Machine Learning Problems 5 / 70 Multi-class prediction Multi-label classification Hierarchical classification Label ranking...

6 Real World Applications 6 / 70 Natural language processing Computational biology Bioinformatics Computer vision Robotics...

7 Outline 7 / 70 Part I: From Binary to Structured Prediction Loss Functions Algorithms Part II: Exact and Approximate Inference Approximate Inference: + / - ve Results Complexity of Learning New Training Methods Part III: Weakly Supervised Learning

8 Tutorial Focus 8 / 70 Discriminative methods in structured prediction Algorithmic techniques

9 Topics Not Covered 9 / 70 Advanced optimization methods Connections to reinforcement learning Learning theory / Generalization bounds Inference in graphical models X-supervised learning

10 Outline 10 / 70 Part I: Binary to Structured Loss Functions Algorithms

11 Road Map 11 / 70 Perceptron Structured perceptron Regression Kernel dependency estimation Support Vector Machines Structured SVMs Logistic Regression Conditional random fields

12 Preliminaries 12 / 70

13 13 / 70 Input and output spaces Supervised Learning X, Y Classification Y = { 1, +1}; Regression Y = R Training data X = (x 1,..., x m ) X m Y = (y 1,..., y m ) Y m (drawn i.i.d. from an unknown distribution) Function approximation f : X Y Good performance on unseen examples where performance is measured w.r.t. a loss function

14 Generative and Discriminative Learning 14 / 70 Generative learning (e.g., Naïve Bayes, HMM) Model joint probability p(x, y) Predict using Bayes rule p(y x) = p(x, y) p(x, z) z Y p(x y)p(y) = p(x, z) z Y Discriminative learning (e.g., Logistic regression, CRF) Model conditional probability p(y x) OR Mapping from inputs to outputs f : X Y

15 Regularized Risk Minimization 15 / 70 argmin f H m l(f(x i ), y i )+λω(f) i=1 l : Y Y R +, non-negative loss function Ω : H R +, regularization function λ > 0, regularization parameter

16 Regularized Risk Minimization 16 / 70 Example: Linear classifiers f(x) = w, φ(x) argmin w R n m l( w,φ(x i ),y i )+λ w 2 2 i=1 φ : X R n, feature map

17 Structured Prediction 17 / 70 Input and output spaces X, Y Training data X = (x 1,..., x m ) X m Y = (y 1,..., y m ) Y m Joint scoring function f : X Y R Prediction ŷ = argmax y Y(x) f(x, y) = argmax w,φ(x, y) y Y(x)

18 Joint Feature Maps 18 / 70 Extract features from inputs AND outputs Feature map, φ : X Y H Joint input-output kernel, k[(x, y),(x, y )] = φ(x, y),φ(x, y )

19 Joint Feature Maps 19 / 70 Example: Multi-label classification, Hierarchical classification x R n, y {0, 1} d φ(x, y) = x y Tensor product kernel: k[(x, y),(x, y )] = k X (x, x )k Y (y, y )

20 20 / 70 Loss Functions (Binary Structured)

21 Regularized Risk Minimization 21 / 70 argmin f H m l(f(x i ), y i )+λω(f) i=1 l : Y Y R +, non-negative loss function Ω : H R +, regularization function λ > 0, regularization parameter

22 Zero-One Loss 22 / 70 Binary: l 0-1 (y, z) = { 0 if y = z 1 otherwise Structured: l max (f,(x, y)) = (argmax f(x, z), y) z Y Non-convex, non-differentiable, hard to optimize Use surrogate losses instead; convex upper bounds

23 Squared Loss 23 / 70 l square (y, z) = (y z) 2, Y = R Used in regression problems Extension to structured prediction is non-trivial due to inter-dependencies among the multiple outputs (more on this later)

24 Hinge Loss 24 / 70 Binary: l hinge (y, z) = max(0, 1 yz) Structured: l hinge(f,(x, y)) = max[ (z, y) + f(x, z) f(x, y)] z Y

25 Logistic Loss 25 / 70 Binary: l log (y, z) = ln(1+exp( yz)) Structured: [ ] l log (f,(x, y)) = ln exp(f(x, z)) f(x, y) z Y

26 Exponential Loss 26 / 70 Binary: l exp (y, z) = exp( yz) Structured: l exp (f,(x, y)) = z Y exp[f(x, z) f(x, y)]

27 Loss Functions 27 / Square Hinge Log Exp loss(t) t=y.f(x)

28 28 / 70 Learning Algorithms (parameter estimation)

29 Perceptron Structured perceptron 29 / 70

30 Perceptron 30 / 70 Online learning algorithm Prediction ŷ = f(x) = sgn w, x Algorithm: Initialize w = 0 For t = 1...T 1. receive input x t 2. predict ŷ t = sgn w, x t 3. receive true label y t { 1,+1} 4. if y t ŷ t, then w w + y t x t

31 Mistake Bound 31 / 70 Separable data [Block, 62; Novikoff, 62] Theorem Let (x 1, y 1 ),...,(x m, y m ) be a sequence of training examples with x i R for all i [[m]]. Suppose there exists a unit norm vector u such that y i ( u, x i ) γ for all the examples. Then the number of mistakes made by the perceptron algorithm on this sequence is at most (R/γ) 2.

32 Mistake Bound 32 / 70 Inseparable data [Freund and Schapire, 99] Theorem Let (x 1, y 1 ),...,(x m, y m ) be a sequence of training examples with x i R for all i [[m]]. Let u be any unit norm weight vector and let γ > 0. Define the deviation of each example as d i = max[0,γ y i ( u, x i )] and define D = m i=1 d 2 i. Then the number of mistakes made by the perceptron algorithm on this sequence of examples is at most (R + D) 2 γ 2.

33 Constraint Classification 33 / 70 (Har-Peled et al., ALT 2002) Generalizes perceptron for various problems including multiclass prediction, multilabel classification and label ranking Example: Multiclass prediction, Y = {1,...,d} Weights (w 1,..., w d ) R nd Prediction ŷ = f(x) = argmax i [[d]] w i, x

34 Constraint Classification 34 / 70 Algorithm: Initialize (w 1,...,w d ) = 0 For t = 1...T 1. receive input x t and label y t [[d]] 2. construct ỹ t = {(y t, i) i [[d]]\y t } 3. for all (p, q) ỹ t, if w p w q, x t < 0, then w p w p + x, w q w q x

35 Structured Perceptron 35 / 70 (Collins, EMNLP 2002) Linear scoring function f(x, y) = w,φ(x, y) Algorithm: Initialize w = 0 For t = 1...T 1. receive input x t 2. predict ŷ t = argmax y Y f(x t, y) 3. receive true label y t Y 4. w w +φ(x t, y t ) φ(x t, ŷ t )

36 Mistake Bound 36 / 70 Separable data [Collins, 02] Theorem Let (x 1, y 1 ),(x 2, y 2 ),...,(x m, y m ) be a sequence of training examples which is separable with margin γ. Let R denote a constant that satisfies φ(x i, y i ) φ(x i, z) R, for all i [[m]], and for all z GEN(x i ). Then the number of mistakes made by the perceptron algorithm on this sequence of examples is at most R 2 /γ 2.

37 Mistake Bound 37 / 70 Inseparable data [Collins, 02] For a training example (x i, y i ) and a v,γ pair, define m i = v,φ(x i, y i ) max z GEN(x i ) v,φ(x i, z) and m ɛ i = max{0,γ m i } and define D v,γ = i=1 ɛ2 i. Then the number of mistakes made by the structured perceptron is at most (R + D v,γ ) 2 min v,γ γ 2.

38 Regression KDE 38 / 70

39 Regularized Least-Squares Regression 39 / 70 OPT: min + m (w x w R nλ w 2 i y i ) 2 i=1 Solution: w = (X X +λi) 1 X y where X R m n : data/input matrix y R m 1 : label/output vector I: identity matrix

40 Kernelized Version 40 / 70 OPT1: min f H λ f 2 + m (f(x i ) y i ) 2 i=1 Representer theorem: f = m c i k(x i, ) i=1 OPT2: min Kc + y Kc 2 c R mλc Solution: c = (K +λi) 1 y

41 Kernel Dependency Estimation 41 / 70 (Weston et al., NIPS 2002) Output feature" map ψ : Y H Y, k Y (y, y ) = ψ(y),ψ(y ) Note: Possible to consider a large class of non-linear loss functions in the output space KDE uses kernel PCA to decorrelate" the outputs, and trains independent RLSRs

42 Kernel Dependency Estimation 42 / 70 (Weston et al., NIPS 2002) Algorithm: 1. decompose {ψ(y i ) i [[m]]} into k orthogonal directions using KPCA 2. learn f j : φ R j for each direction j [[k]] using RLSR 3. predict by solving the pre-image problem: ( ) 2 ŷ(x) = argmin [v1 ψ(y),..., v k ψ(y)] [f 1(x),..., f k (x)] y Y

43 Kernel Dependency Estimation 43 / 70 (Weston et al., NIPS 2002) Note: Projection of ψ(y) onto the p th principal component v p = m i=1 αp i ψ(y i) is given by vp ψ(y) = m i=1 αp i k Y(y i, y), where α p is the p th eigenvector of the kernel matrix K Y (after centering it)

44 SVM Structured SVM 44 / 70

45 Support Vector Machine 45 / 70 Linear large-margin classifier f(x) = w, φ(x) Minimizes the hinge loss OPT: minλ w w m m max{0, 1 y i w,φ(x i ) } i=1 OPT with slack variables: min w,ξ λ w m m ξ i i=1 s.t. : y i w,φ(x i ) 1 ξ i, i [[m]] ξ i 0, i [[m]]

46 Support Vector Machine 46 / 70 SVM dual 1 min c R m 2 c YKYc 1 c s.t.: 0 c i 1 mλ, i [[m]] Representer theorem: f( ) = m c i k(x i, ) i=1

47 Structured SVM / Max-margin Markov Network 47 / 70 Minimizes structured hinge loss Two formulations: 1. Slack rescaling 2. Margin rescaling

48 Structured SVM 48 / 70 Slack rescaling min w,ξ λ w m m ξ i i=1 s.t. : w,φ(x i, y i ) w,φ(x i, z) 1 ξ i (y i,z), z, i ξ i 0, i [[m]]

49 Structured SVM 49 / 70 Margin rescaling min w,ξ λ w m m ξ i i=1 s.t. : w,φ(x i, y i ) w,φ(x i, z) (y i, z) ξ i, z, i ξ i 0, i [[m]]

50 Solving the OPT 50 / 70 Major issue: Exponential number of constraints Suffices to design a sub-routine (loss-augmented inference) to compute ŷ = argmax[1 w,φ(x, y) φ(x, z) ] (z, y) z Y ŷ = argmax[ (z, y) w,φ(x, y) φ(x, z) ] z Y Iteratively add constraints

51 Cutting-Plane Method (Tsochantaridis et al., JMLR 2005) 1: Input: (x 1, y 1 ),...,(x m, y m ),λ,ɛ 2: S i, for all i [[m]] 3: repeat 4: for i = 1,...,m do 5: H(y) (y i, y)+w φ(x i, y) w φ(x i, y i ) 6: compute ŷ = argmax y Y H(y) 7: 8: compute ξ i = max{0, max y Si H(y)} if H(ŷ) > ξ i +ɛ then 9: S i S i {ŷ} 10: w optimize primal over i S i 11: end if 12: end for 13: until no S i has changed during iteration 51 / 70

52 Theoretical Guarantees with Exact Inference 52 / 70 (Tsochantaridis et al., 2005; Finley and Joachims, ICML 2008) Polynomial time termination Algorithm terminates in a polynomial number of iterations Correctness Algorithm solves OPT accurately to a desired precision ɛ Empirical risk bound 1 m m ξ i upper bounds empirical risk i=1

53 Kernelized SSVM 53 / 70 min α R m Y s.t. : iz α jz k[(x i, z),(x j, z i,j [[m]],z,z Yα )] (y i, z)α iz i,z α iz λ, i [[m]] z Y α iz 0, i [[m]], z Y Representer theorem: f(, ) = α iz k[(x i, z),(, )] i [[m]],z Y

54 Structured Perceptron Revisited 54 / 70 Linear scoring function f(x, y) = w,φ(x, y) Algorithm: Initialize w = 0 For t = 1...T 1. receive input x t 2. predict ŷ t = argmax y Y f(x t, y) 3. receive true label y t Y 4. w w +φ(x t, y t ) φ(x t, ŷ t ) Note: Update rule is not influenced by the loss function (y, z)

55 Structured Perceptron Revisited 55 / 70 Replace inference with loss-augmented inference Algorithm: Initialize w = 0 For t = 1...T 1. receive example pair x t, y t 2. predict ŷ t = argmax y Y [f(x t, y) + (y t, y)] 3. w w +η t (φ(x t, y t ) φ(x t, ŷ t ))

56 Min-Max Formulation 56 / 70 (Taskar et al., ICML 2005) Recall brute force enumeration min w,ξ λ w m m ξ i i=1 s.t. : w,φ(x i, y i ) w,φ(x i, z) (y i, z) ξ i, z, i ξ i 0, i [[m]]

57 Min-Max Formulation (Taskar et al., ICML 2005) min w,ξ λ w m m ξ i i=1 s.t. : w,φ(x i, y i ) +ξ i max z Y [ w,φ(x i, z)+ (y i, z)], i [[m]] ξ i 0, i [[m]] Key steps: Plug-in LP inference Use LP duality to replace max with min Rewrite the OPT as a concise QP with polynomial number of variables and constraints (depends on the output structure) 57 / 70

58 Logistic Regression CRF 58 / 70

59 Logistic Regression 59 / 70 Probabilistic (binary) classifier Likelihood function: p(y x, w) = exp(y φ(x), w ) exp(y φ(x), w )+exp( y φ(x), w ) Estimate parameters w by minimizing negative log-likelihood

60 Exponential Family 60 / 70 Family of probability distributions p(x; w) = exp( φ(x), w ln Z(w)) φ(x) - Sufficient statistics of x Z(w) - Partition function Z(w) = exp( φ(x), w )dx x

61 Log-Partition Function 61 / 70 g(w) = ln Z(w) = ln x exp( φ(x), w )dx g(w) is a cumulant generator, i.e., w g(w) = φ(x) exp( φ(x), w )dx exp( φ(x), w )dx = E x p(x;w) [φ(x)] 2 w g(w) = Cov x p(x;w)[φ(x)] Note: g(w) is convex!

62 Examples 62 / 70 Binomial Multinomial Gaussian Laplace Poisson Dirichlet...

63 (Conditional) Exponential Family 63 / 70 Family of conditional distributions p(y x, w) = exp( φ(x, y), w ln Z(w x)) φ(x, y) - Joint sufficient statistics of x and y Z(w x) - Partition function Z(w x) = exp( φ(x, y), w ) y Y

64 MAP Estimation 64 / 70 p(w Y, X) p(w, Y X) = p(y X, w)p(w) Point estimate with a Gaussian prior on w, p(w) = exp( λ w 2 ) ŵ = argmax w = argmax w [ln p(w, Y X)] [ ] 1 m p(y i x i, w) λ w 2 m i=1 Note: Convex optimization problem

65 Linear-Chain Conditional Random Fields Y i 1 Y i Y i+1 X i 1 X i X i+1 Cliques: (y t, x t ),(y t 1, y t ), for all t p(y ( x, w) = ) exp [ φ(y t, x t ), w xy,t + φ(y t 1, y t ), w yy,t ] ln Z(w x) t Efficient inference using dynamic programming MLE / MAP estimation of parameters 65 / 70

66 HMMs and CRFs 66 / 70 Y i 1 Y i Y i+1 Y i 1 Y i Y i+1 X i 1 X i X i+1 X i 1 X i X i+1 Hidden Markov Model: Maximize p(x, y w) Conditional Random Field: Maximize p(y x, w)

67 Learning Reductions 67 / 70 (ICML 2009 tutorial) Reductions: Transform complex learning problems into simpler, core problems Desideratum: Good performance on the core problem should imply good performance on the complex problem Examples Multi-class prediction Binary classification Ranking Classification Cost-sensitive classification Binary classification Structured prediction Binary classification (SEARN)

68 Structured Binary (SEARN) 68 / 70 (Daumé et al., 2009) Decomposable outputs y = (y 1,..., y T ) SEARN learns a policy π that maps tuples (x, y 1,..., y t 1 ) to y T Reduces structured prediction to cost-sensitive (multi-class) classification Good performance on cost-sensitive classification implies good performance on the structured prediction problem Note: No argmax problem!

69 Reduction to Cost-Sensitive Classification 69 / 70 Distribution over cost-sensitive examples (input, c 1,...,c K ) For every structured example (x, y) 1. sample t uniformly from [[T]] 2. run policy π for t 1 steps to yield (ŷ 1,...,ŷ t 1 ) 3. Input: (x, ŷ 1,..., ŷ t 1 ) 4. Costs: c k = E l(y,(ŷ 1,..., ŷ t 1, k, ŷ t 1,..., ŷ T )), k [[K]] ŷ t+1,...,ŷ T π

70 Intermediate Summary 70 / 70 Loss Functions (binary structured) Discriminative Structured Prediction Algorithms Perceptron Structured perceptron RLSR KDE SVM Struct-SVM, M3N LogReg CRF Structured prediction Cost-sensitive classification

Machine Learning for Structured Prediction

Machine Learning for Structured Prediction Machine Learning for Structured Prediction Grzegorz Chrupa la National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Grzegorz Chrupa la (DCU) Machine Learning for

More information

Sequential Supervised Learning

Sequential Supervised Learning Sequential Supervised Learning Many Application Problems Require Sequential Learning Part-of of-speech Tagging Information Extraction from the Web Text-to to-speech Mapping Part-of of-speech Tagging Given

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Max-margin learning of GM Eric Xing Lecture 28, Apr 28, 2014 b r a c e Reading: 1 Classical Predictive Models Input and output space: Predictive

More information

Lecture 9: PGM Learning

Lecture 9: PGM Learning 13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and

More information

Parameter learning in CRF s

Parameter learning in CRF s Parameter learning in CRF s June 01, 2009 Structured output learning We ish to learn a discriminant (or compatability) function: F : X Y R (1) here X is the space of inputs and Y is the space of outputs.

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Cutting Plane Training of Structural SVM

Cutting Plane Training of Structural SVM Cutting Plane Training of Structural SVM Seth Neel University of Pennsylvania sethneel@wharton.upenn.edu September 28, 2017 Seth Neel (Penn) Short title September 28, 2017 1 / 33 Overview Structural SVMs

More information

Structured Prediction

Structured Prediction Machine Learning Fall 2017 (structured perceptron, HMM, structured SVM) Professor Liang Huang (Chap. 17 of CIML) x x the man bit the dog x the man bit the dog x DT NN VBD DT NN S =+1 =-1 the man bit the

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

Introduction to Logistic Regression and Support Vector Machine

Introduction to Logistic Regression and Support Vector Machine Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel

More information

1 Machine Learning Concepts (16 points)

1 Machine Learning Concepts (16 points) CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010

More information

Part 4: Conditional Random Fields

Part 4: Conditional Random Fields Part 4: Conditional Random Fields Sebastian Nowozin and Christoph H. Lampert Colorado Springs, 25th June 2011 1 / 39 Problem (Probabilistic Learning) Let d(y x) be the (unknown) true conditional distribution.

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models David Sontag New York University Lecture 12, April 23, 2013 David Sontag NYU) Graphical Models Lecture 12, April 23, 2013 1 / 24 What notion of best should learning be optimizing?

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Linear Classifiers IV

Linear Classifiers IV Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers IV Blaine Nelson, Tobias Scheffer Contents Classification Problem Bayesian Classifier Decision Linear Classifiers, MAP Models Logistic

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should

More information

Lecture 3: Multiclass Classification

Lecture 3: Multiclass Classification Lecture 3: Multiclass Classification Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Some slides are adapted from Vivek Skirmar and Dan Roth CS6501 Lecture 3 1 Announcement v Please enroll in

More information

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015 Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 Exam policy: This exam allows two one-page, two-sided cheat sheets (i.e. 4 sides); No other materials. Time: 2 hours. Be sure to write

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

A Support Vector Method for Multivariate Performance Measures

A Support Vector Method for Multivariate Performance Measures A Support Vector Method for Multivariate Performance Measures Thorsten Joachims Cornell University Department of Computer Science Thanks to Rich Caruana, Alexandru Niculescu-Mizil, Pierre Dupont, Jérôme

More information

Generalized Linear Classifiers in NLP

Generalized Linear Classifiers in NLP Generalized Linear Classifiers in NLP (or Discriminative Generalized Linear Feature-Based Classifiers) Graduate School of Language Technology, Sweden 2009 Ryan McDonald Google Inc., New York, USA E-mail:

More information

Learning with Structured Inputs and Outputs

Learning with Structured Inputs and Outputs Learning with Structured Inputs and Outputs Christoph Lampert Institute of Science and Technology Austria Vienna (Klosterneuburg), Austria INRIA Summer School 2010 Learning with Structured Inputs and Outputs

More information

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated

More information

Introduction to Machine Learning Midterm, Tues April 8

Introduction to Machine Learning Midterm, Tues April 8 Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend

More information

1 SVM Learning for Interdependent and Structured Output Spaces

1 SVM Learning for Interdependent and Structured Output Spaces 1 SVM Learning for Interdependent and Structured Output Spaces Yasemin Altun Toyota Technological Institute at Chicago, Chicago IL 60637 USA altun@tti-c.org Thomas Hofmann Google, Zurich, Austria thofmann@google.com

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Support vector machines Lecture 4

Support vector machines Lecture 4 Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The

More information

Kernel methods, kernel SVM and ridge regression

Kernel methods, kernel SVM and ridge regression Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Collaborative Filtering 2 Collaborative Filtering R: rating matrix; U: user factor;

More information

10-701/ Machine Learning - Midterm Exam, Fall 2010

10-701/ Machine Learning - Midterm Exam, Fall 2010 10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam

More information

Posterior Regularization

Posterior Regularization Posterior Regularization 1 Introduction One of the key challenges in probabilistic structured learning, is the intractability of the posterior distribution, for fast inference. There are numerous methods

More information

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. http://goo.gl/jv7vj9 Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT

More information

Learning with Structured Inputs and Outputs

Learning with Structured Inputs and Outputs Learning with Structured Inputs and Outputs Christoph H. Lampert IST Austria (Institute of Science and Technology Austria), Vienna ENS/INRIA Summer School, Paris, July 2013 Slides: http://www.ist.ac.at/~chl/

More information

Probabilistic Graphical Models & Applications

Probabilistic Graphical Models & Applications Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with

More information

Classification objectives COMS 4771

Classification objectives COMS 4771 Classification objectives COMS 4771 1. Recap: binary classification Scoring functions Consider binary classification problems with Y = { 1, +1}. 1 / 22 Scoring functions Consider binary classification

More information

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. http://goo.gl/xilnmn Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015 Machine Learning Classification, Discriminative learning Structured output, structured input, discriminative function, joint input-output features, Likelihood Maximization, Logistic regression, binary

More information

Machine Learning 2017

Machine Learning 2017 Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section

More information

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

CS446: Machine Learning Fall Final Exam. December 6 th, 2016 CS446: Machine Learning Fall 2016 Final Exam December 6 th, 2016 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains

More information

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES Supervised Learning Linear vs non linear classifiers In K-NN we saw an example of a non-linear classifier: the decision boundary

More information

Topics in Structured Prediction: Problems and Approaches

Topics in Structured Prediction: Problems and Approaches Topics in Structured Prediction: Problems and Approaches DRAFT Ankan Saha June 9, 2010 Abstract We consider the task of structured data prediction. Over the last few years, there has been an abundance

More information

Convex Optimization Algorithms for Machine Learning in 10 Slides

Convex Optimization Algorithms for Machine Learning in 10 Slides Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,

More information

Structured Prediction: A Large Margin Approach. Ben Taskar University of Pennsylvania

Structured Prediction: A Large Margin Approach. Ben Taskar University of Pennsylvania Structured Prediction: A Large Margin Approach Ben Taskar University of Pennsylvania Structured Prediction Prediction of complex outputs Structured outputs: multivariate, correlated, constrained Novel,

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina Indirect Rule Learning: Support Vector Machines Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection

More information

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Linear classifier Which classifier? x 2 x 1 2 Linear classifier Margin concept x 2

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter

More information

A brief introduction to Conditional Random Fields

A brief introduction to Conditional Random Fields A brief introduction to Conditional Random Fields Mark Johnson Macquarie University April, 2005, updated October 2010 1 Talk outline Graphical models Maximum likelihood and maximum conditional likelihood

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers Blaine Nelson, Tobias Scheffer Contents Classification Problem Bayesian Classifier Decision Linear Classifiers, MAP Models Logistic

More information

Polyhedral Outer Approximations with Application to Natural Language Parsing

Polyhedral Outer Approximations with Application to Natural Language Parsing Polyhedral Outer Approximations with Application to Natural Language Parsing André F. T. Martins 1,2 Noah A. Smith 1 Eric P. Xing 1 1 Language Technologies Institute School of Computer Science Carnegie

More information

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE Data Provided: None DEPARTMENT OF COMPUTER SCIENCE Autumn Semester 203 204 MACHINE LEARNING AND ADAPTIVE INTELLIGENCE 2 hours Answer THREE of the four questions. All questions carry equal weight. Figures

More information

Max Margin-Classifier

Max Margin-Classifier Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization

More information

Predicting Sequences: Structured Perceptron. CS 6355: Structured Prediction

Predicting Sequences: Structured Perceptron. CS 6355: Structured Prediction Predicting Sequences: Structured Perceptron CS 6355: Structured Prediction 1 Conditional Random Fields summary An undirected graphical model Decompose the score over the structure into a collection of

More information

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie Computational Biology Program Memorial Sloan-Kettering Cancer Center http://cbio.mskcc.org/leslielab

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574

More information

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning CS 446 Machine Learning Fall 206 Nov 0, 206 Bayesian Learning Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes Overview Bayesian Learning Naive Bayes Logistic Regression Bayesian Learning So far, we

More information

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley Natural Language Processing Classification Classification I Dan Klein UC Berkeley Classification Automatically make a decision about inputs Example: document category Example: image of digit digit Example:

More information

Classical Predictive Models

Classical Predictive Models Laplace Max-margin Markov Networks Recent Advances in Learning SPARSE Structured I/O Models: models, algorithms, and applications Eric Xing epxing@cs.cmu.edu Machine Learning Dept./Language Technology

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Text Categorization, Support Vector Machines Sameer Maskey Announcement Reading Assignments Will be posted online tonight Homework 1 Assigned and available from the course website

More information

Lecture 13: Structured Prediction

Lecture 13: Structured Prediction Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501: NLP 1 Quiz 2 v Lectures 9-13 v Lecture 12: before page

More information

Part 2: Generalized output representations and structure

Part 2: Generalized output representations and structure Part 2: Generalized output representations and structure Dale Schuurmans University of Alberta Output transformation Output transformation What if targets y special? E.g. what if y nonnegative y 0 y probability

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Introduction to Graphical Models

Introduction to Graphical Models Introduction to Graphical Models The 15 th Winter School of Statistical Physics POSCO International Center & POSTECH, Pohang 2018. 1. 9 (Tue.) Yung-Kyun Noh GENERALIZATION FOR PREDICTION 2 Probabilistic

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Lecture 10: A brief introduction to Support Vector Machine

Lecture 10: A brief introduction to Support Vector Machine Lecture 10: A brief introduction to Support Vector Machine Advanced Applied Multivariate Analysis STAT 2221, Fall 2013 Sungkyu Jung Department of Statistics, University of Pittsburgh Xingye Qiao Department

More information

Jeff Howbert Introduction to Machine Learning Winter

Jeff Howbert Introduction to Machine Learning Winter Classification / Regression Support Vector Machines Jeff Howbert Introduction to Machine Learning Winter 2012 1 Topics SVM classifiers for linearly separable classes SVM classifiers for non-linearly separable

More information

Conditional Random Field

Conditional Random Field Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions

More information

Structured Prediction

Structured Prediction Structured Prediction Classification Algorithms Classify objects x X into labels y Y First there was binary: Y = {0, 1} Then multiclass: Y = {1,...,6} The next generation: Structured Labels Structured

More information

An Introduction to Machine Learning

An Introduction to Machine Learning An Introduction to Machine Learning L6: Structured Estimation Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune, January

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Thomas G. Dietterich tgd@eecs.oregonstate.edu 1 Outline What is Machine Learning? Introduction to Supervised Learning: Linear Methods Overfitting, Regularization, and the

More information

Introduction to Data-Driven Dependency Parsing

Introduction to Data-Driven Dependency Parsing Introduction to Data-Driven Dependency Parsing Introductory Course, ESSLLI 2007 Ryan McDonald 1 Joakim Nivre 2 1 Google Inc., New York, USA E-mail: ryanmcd@google.com 2 Uppsala University and Växjö University,

More information

Machine Learning And Applications: Supervised Learning-SVM

Machine Learning And Applications: Supervised Learning-SVM Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine

More information

CS-E4830 Kernel Methods in Machine Learning

CS-E4830 Kernel Methods in Machine Learning CS-E4830 Kernel Methods in Machine Learning Lecture 5: Multi-class and preference learning Juho Rousu 11. October, 2017 Juho Rousu 11. October, 2017 1 / 37 Agenda from now on: This week s theme: going

More information

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Sameer Maskey Week 13, Nov 28, 2012 1 Announcements Next lecture is the last lecture Wrap up of the semester 2 Final Project

More information

Accelerated Training of Max-Margin Markov Networks with Kernels

Accelerated Training of Max-Margin Markov Networks with Kernels Accelerated Training of Max-Margin Markov Networks with Kernels Xinhua Zhang University of Alberta Alberta Innovates Centre for Machine Learning (AICML) Joint work with Ankan Saha (Univ. of Chicago) and

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete

More information

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013 More on HMMs and other sequence models Intro to NLP - ETHZ - 18/03/2013 Summary Parts of speech tagging HMMs: Unsupervised parameter estimation Forward Backward algorithm Bayesian variants Discriminative

More information

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer

More information

Graphical Models for Collaborative Filtering

Graphical Models for Collaborative Filtering Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,

More information

Information Extraction from Text

Information Extraction from Text Information Extraction from Text Jing Jiang Chapter 2 from Mining Text Data (2012) Presented by Andrew Landgraf, September 13, 2013 1 What is Information Extraction? Goal is to discover structured information

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang. Machine Learning CUNY Graduate Center, Spring 2013 Lectures 11-12: Unsupervised Learning 1 (Clustering: k-means, EM, mixture models) Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning

More information

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly

More information

Lecture 10: Support Vector Machine and Large Margin Classifier

Lecture 10: Support Vector Machine and Large Margin Classifier Lecture 10: Support Vector Machine and Large Margin Classifier Applied Multivariate Analysis Math 570, Fall 2014 Xingye Qiao Department of Mathematical Sciences Binghamton University E-mail: qiao@math.binghamton.edu

More information

Introduction to Machine Learning Midterm Exam Solutions

Introduction to Machine Learning Midterm Exam Solutions 10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,

More information

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Algorithms for NLP Classification II Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Minimize Training Error? A loss function declares how costly each mistake is E.g. 0 loss for correct label,

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information