Algorithms for Predicting Structured Data
|
|
- Belinda Skinner
- 5 years ago
- Views:
Transcription
1 1 / 70 Algorithms for Predicting Structured Data Thomas Gärtner / Shankar Vembu Fraunhofer IAIS / UIUC ECML PKDD 2010
2 Structured Prediction 2 / 70 Predicting multiple outputs with complex internal structure and dependencies Contrast with single-valued prediction problems like binary classification and regression
3 Part-of-Speech Tagging 3 / 70 Input: Sentence Output: Part-of-speech tags The brown fox chased the lazy dog DT JJ N VBD DT JJ N
4 Linguistic Parsing 4 / 70 Input: Sentence Output: Parse tree S NP VP DT JJ N VBD NP The brown fox chased DT JJ N the lazy dog
5 Machine Learning Problems 5 / 70 Multi-class prediction Multi-label classification Hierarchical classification Label ranking...
6 Real World Applications 6 / 70 Natural language processing Computational biology Bioinformatics Computer vision Robotics...
7 Outline 7 / 70 Part I: From Binary to Structured Prediction Loss Functions Algorithms Part II: Exact and Approximate Inference Approximate Inference: + / - ve Results Complexity of Learning New Training Methods Part III: Weakly Supervised Learning
8 Tutorial Focus 8 / 70 Discriminative methods in structured prediction Algorithmic techniques
9 Topics Not Covered 9 / 70 Advanced optimization methods Connections to reinforcement learning Learning theory / Generalization bounds Inference in graphical models X-supervised learning
10 Outline 10 / 70 Part I: Binary to Structured Loss Functions Algorithms
11 Road Map 11 / 70 Perceptron Structured perceptron Regression Kernel dependency estimation Support Vector Machines Structured SVMs Logistic Regression Conditional random fields
12 Preliminaries 12 / 70
13 13 / 70 Input and output spaces Supervised Learning X, Y Classification Y = { 1, +1}; Regression Y = R Training data X = (x 1,..., x m ) X m Y = (y 1,..., y m ) Y m (drawn i.i.d. from an unknown distribution) Function approximation f : X Y Good performance on unseen examples where performance is measured w.r.t. a loss function
14 Generative and Discriminative Learning 14 / 70 Generative learning (e.g., Naïve Bayes, HMM) Model joint probability p(x, y) Predict using Bayes rule p(y x) = p(x, y) p(x, z) z Y p(x y)p(y) = p(x, z) z Y Discriminative learning (e.g., Logistic regression, CRF) Model conditional probability p(y x) OR Mapping from inputs to outputs f : X Y
15 Regularized Risk Minimization 15 / 70 argmin f H m l(f(x i ), y i )+λω(f) i=1 l : Y Y R +, non-negative loss function Ω : H R +, regularization function λ > 0, regularization parameter
16 Regularized Risk Minimization 16 / 70 Example: Linear classifiers f(x) = w, φ(x) argmin w R n m l( w,φ(x i ),y i )+λ w 2 2 i=1 φ : X R n, feature map
17 Structured Prediction 17 / 70 Input and output spaces X, Y Training data X = (x 1,..., x m ) X m Y = (y 1,..., y m ) Y m Joint scoring function f : X Y R Prediction ŷ = argmax y Y(x) f(x, y) = argmax w,φ(x, y) y Y(x)
18 Joint Feature Maps 18 / 70 Extract features from inputs AND outputs Feature map, φ : X Y H Joint input-output kernel, k[(x, y),(x, y )] = φ(x, y),φ(x, y )
19 Joint Feature Maps 19 / 70 Example: Multi-label classification, Hierarchical classification x R n, y {0, 1} d φ(x, y) = x y Tensor product kernel: k[(x, y),(x, y )] = k X (x, x )k Y (y, y )
20 20 / 70 Loss Functions (Binary Structured)
21 Regularized Risk Minimization 21 / 70 argmin f H m l(f(x i ), y i )+λω(f) i=1 l : Y Y R +, non-negative loss function Ω : H R +, regularization function λ > 0, regularization parameter
22 Zero-One Loss 22 / 70 Binary: l 0-1 (y, z) = { 0 if y = z 1 otherwise Structured: l max (f,(x, y)) = (argmax f(x, z), y) z Y Non-convex, non-differentiable, hard to optimize Use surrogate losses instead; convex upper bounds
23 Squared Loss 23 / 70 l square (y, z) = (y z) 2, Y = R Used in regression problems Extension to structured prediction is non-trivial due to inter-dependencies among the multiple outputs (more on this later)
24 Hinge Loss 24 / 70 Binary: l hinge (y, z) = max(0, 1 yz) Structured: l hinge(f,(x, y)) = max[ (z, y) + f(x, z) f(x, y)] z Y
25 Logistic Loss 25 / 70 Binary: l log (y, z) = ln(1+exp( yz)) Structured: [ ] l log (f,(x, y)) = ln exp(f(x, z)) f(x, y) z Y
26 Exponential Loss 26 / 70 Binary: l exp (y, z) = exp( yz) Structured: l exp (f,(x, y)) = z Y exp[f(x, z) f(x, y)]
27 Loss Functions 27 / Square Hinge Log Exp loss(t) t=y.f(x)
28 28 / 70 Learning Algorithms (parameter estimation)
29 Perceptron Structured perceptron 29 / 70
30 Perceptron 30 / 70 Online learning algorithm Prediction ŷ = f(x) = sgn w, x Algorithm: Initialize w = 0 For t = 1...T 1. receive input x t 2. predict ŷ t = sgn w, x t 3. receive true label y t { 1,+1} 4. if y t ŷ t, then w w + y t x t
31 Mistake Bound 31 / 70 Separable data [Block, 62; Novikoff, 62] Theorem Let (x 1, y 1 ),...,(x m, y m ) be a sequence of training examples with x i R for all i [[m]]. Suppose there exists a unit norm vector u such that y i ( u, x i ) γ for all the examples. Then the number of mistakes made by the perceptron algorithm on this sequence is at most (R/γ) 2.
32 Mistake Bound 32 / 70 Inseparable data [Freund and Schapire, 99] Theorem Let (x 1, y 1 ),...,(x m, y m ) be a sequence of training examples with x i R for all i [[m]]. Let u be any unit norm weight vector and let γ > 0. Define the deviation of each example as d i = max[0,γ y i ( u, x i )] and define D = m i=1 d 2 i. Then the number of mistakes made by the perceptron algorithm on this sequence of examples is at most (R + D) 2 γ 2.
33 Constraint Classification 33 / 70 (Har-Peled et al., ALT 2002) Generalizes perceptron for various problems including multiclass prediction, multilabel classification and label ranking Example: Multiclass prediction, Y = {1,...,d} Weights (w 1,..., w d ) R nd Prediction ŷ = f(x) = argmax i [[d]] w i, x
34 Constraint Classification 34 / 70 Algorithm: Initialize (w 1,...,w d ) = 0 For t = 1...T 1. receive input x t and label y t [[d]] 2. construct ỹ t = {(y t, i) i [[d]]\y t } 3. for all (p, q) ỹ t, if w p w q, x t < 0, then w p w p + x, w q w q x
35 Structured Perceptron 35 / 70 (Collins, EMNLP 2002) Linear scoring function f(x, y) = w,φ(x, y) Algorithm: Initialize w = 0 For t = 1...T 1. receive input x t 2. predict ŷ t = argmax y Y f(x t, y) 3. receive true label y t Y 4. w w +φ(x t, y t ) φ(x t, ŷ t )
36 Mistake Bound 36 / 70 Separable data [Collins, 02] Theorem Let (x 1, y 1 ),(x 2, y 2 ),...,(x m, y m ) be a sequence of training examples which is separable with margin γ. Let R denote a constant that satisfies φ(x i, y i ) φ(x i, z) R, for all i [[m]], and for all z GEN(x i ). Then the number of mistakes made by the perceptron algorithm on this sequence of examples is at most R 2 /γ 2.
37 Mistake Bound 37 / 70 Inseparable data [Collins, 02] For a training example (x i, y i ) and a v,γ pair, define m i = v,φ(x i, y i ) max z GEN(x i ) v,φ(x i, z) and m ɛ i = max{0,γ m i } and define D v,γ = i=1 ɛ2 i. Then the number of mistakes made by the structured perceptron is at most (R + D v,γ ) 2 min v,γ γ 2.
38 Regression KDE 38 / 70
39 Regularized Least-Squares Regression 39 / 70 OPT: min + m (w x w R nλ w 2 i y i ) 2 i=1 Solution: w = (X X +λi) 1 X y where X R m n : data/input matrix y R m 1 : label/output vector I: identity matrix
40 Kernelized Version 40 / 70 OPT1: min f H λ f 2 + m (f(x i ) y i ) 2 i=1 Representer theorem: f = m c i k(x i, ) i=1 OPT2: min Kc + y Kc 2 c R mλc Solution: c = (K +λi) 1 y
41 Kernel Dependency Estimation 41 / 70 (Weston et al., NIPS 2002) Output feature" map ψ : Y H Y, k Y (y, y ) = ψ(y),ψ(y ) Note: Possible to consider a large class of non-linear loss functions in the output space KDE uses kernel PCA to decorrelate" the outputs, and trains independent RLSRs
42 Kernel Dependency Estimation 42 / 70 (Weston et al., NIPS 2002) Algorithm: 1. decompose {ψ(y i ) i [[m]]} into k orthogonal directions using KPCA 2. learn f j : φ R j for each direction j [[k]] using RLSR 3. predict by solving the pre-image problem: ( ) 2 ŷ(x) = argmin [v1 ψ(y),..., v k ψ(y)] [f 1(x),..., f k (x)] y Y
43 Kernel Dependency Estimation 43 / 70 (Weston et al., NIPS 2002) Note: Projection of ψ(y) onto the p th principal component v p = m i=1 αp i ψ(y i) is given by vp ψ(y) = m i=1 αp i k Y(y i, y), where α p is the p th eigenvector of the kernel matrix K Y (after centering it)
44 SVM Structured SVM 44 / 70
45 Support Vector Machine 45 / 70 Linear large-margin classifier f(x) = w, φ(x) Minimizes the hinge loss OPT: minλ w w m m max{0, 1 y i w,φ(x i ) } i=1 OPT with slack variables: min w,ξ λ w m m ξ i i=1 s.t. : y i w,φ(x i ) 1 ξ i, i [[m]] ξ i 0, i [[m]]
46 Support Vector Machine 46 / 70 SVM dual 1 min c R m 2 c YKYc 1 c s.t.: 0 c i 1 mλ, i [[m]] Representer theorem: f( ) = m c i k(x i, ) i=1
47 Structured SVM / Max-margin Markov Network 47 / 70 Minimizes structured hinge loss Two formulations: 1. Slack rescaling 2. Margin rescaling
48 Structured SVM 48 / 70 Slack rescaling min w,ξ λ w m m ξ i i=1 s.t. : w,φ(x i, y i ) w,φ(x i, z) 1 ξ i (y i,z), z, i ξ i 0, i [[m]]
49 Structured SVM 49 / 70 Margin rescaling min w,ξ λ w m m ξ i i=1 s.t. : w,φ(x i, y i ) w,φ(x i, z) (y i, z) ξ i, z, i ξ i 0, i [[m]]
50 Solving the OPT 50 / 70 Major issue: Exponential number of constraints Suffices to design a sub-routine (loss-augmented inference) to compute ŷ = argmax[1 w,φ(x, y) φ(x, z) ] (z, y) z Y ŷ = argmax[ (z, y) w,φ(x, y) φ(x, z) ] z Y Iteratively add constraints
51 Cutting-Plane Method (Tsochantaridis et al., JMLR 2005) 1: Input: (x 1, y 1 ),...,(x m, y m ),λ,ɛ 2: S i, for all i [[m]] 3: repeat 4: for i = 1,...,m do 5: H(y) (y i, y)+w φ(x i, y) w φ(x i, y i ) 6: compute ŷ = argmax y Y H(y) 7: 8: compute ξ i = max{0, max y Si H(y)} if H(ŷ) > ξ i +ɛ then 9: S i S i {ŷ} 10: w optimize primal over i S i 11: end if 12: end for 13: until no S i has changed during iteration 51 / 70
52 Theoretical Guarantees with Exact Inference 52 / 70 (Tsochantaridis et al., 2005; Finley and Joachims, ICML 2008) Polynomial time termination Algorithm terminates in a polynomial number of iterations Correctness Algorithm solves OPT accurately to a desired precision ɛ Empirical risk bound 1 m m ξ i upper bounds empirical risk i=1
53 Kernelized SSVM 53 / 70 min α R m Y s.t. : iz α jz k[(x i, z),(x j, z i,j [[m]],z,z Yα )] (y i, z)α iz i,z α iz λ, i [[m]] z Y α iz 0, i [[m]], z Y Representer theorem: f(, ) = α iz k[(x i, z),(, )] i [[m]],z Y
54 Structured Perceptron Revisited 54 / 70 Linear scoring function f(x, y) = w,φ(x, y) Algorithm: Initialize w = 0 For t = 1...T 1. receive input x t 2. predict ŷ t = argmax y Y f(x t, y) 3. receive true label y t Y 4. w w +φ(x t, y t ) φ(x t, ŷ t ) Note: Update rule is not influenced by the loss function (y, z)
55 Structured Perceptron Revisited 55 / 70 Replace inference with loss-augmented inference Algorithm: Initialize w = 0 For t = 1...T 1. receive example pair x t, y t 2. predict ŷ t = argmax y Y [f(x t, y) + (y t, y)] 3. w w +η t (φ(x t, y t ) φ(x t, ŷ t ))
56 Min-Max Formulation 56 / 70 (Taskar et al., ICML 2005) Recall brute force enumeration min w,ξ λ w m m ξ i i=1 s.t. : w,φ(x i, y i ) w,φ(x i, z) (y i, z) ξ i, z, i ξ i 0, i [[m]]
57 Min-Max Formulation (Taskar et al., ICML 2005) min w,ξ λ w m m ξ i i=1 s.t. : w,φ(x i, y i ) +ξ i max z Y [ w,φ(x i, z)+ (y i, z)], i [[m]] ξ i 0, i [[m]] Key steps: Plug-in LP inference Use LP duality to replace max with min Rewrite the OPT as a concise QP with polynomial number of variables and constraints (depends on the output structure) 57 / 70
58 Logistic Regression CRF 58 / 70
59 Logistic Regression 59 / 70 Probabilistic (binary) classifier Likelihood function: p(y x, w) = exp(y φ(x), w ) exp(y φ(x), w )+exp( y φ(x), w ) Estimate parameters w by minimizing negative log-likelihood
60 Exponential Family 60 / 70 Family of probability distributions p(x; w) = exp( φ(x), w ln Z(w)) φ(x) - Sufficient statistics of x Z(w) - Partition function Z(w) = exp( φ(x), w )dx x
61 Log-Partition Function 61 / 70 g(w) = ln Z(w) = ln x exp( φ(x), w )dx g(w) is a cumulant generator, i.e., w g(w) = φ(x) exp( φ(x), w )dx exp( φ(x), w )dx = E x p(x;w) [φ(x)] 2 w g(w) = Cov x p(x;w)[φ(x)] Note: g(w) is convex!
62 Examples 62 / 70 Binomial Multinomial Gaussian Laplace Poisson Dirichlet...
63 (Conditional) Exponential Family 63 / 70 Family of conditional distributions p(y x, w) = exp( φ(x, y), w ln Z(w x)) φ(x, y) - Joint sufficient statistics of x and y Z(w x) - Partition function Z(w x) = exp( φ(x, y), w ) y Y
64 MAP Estimation 64 / 70 p(w Y, X) p(w, Y X) = p(y X, w)p(w) Point estimate with a Gaussian prior on w, p(w) = exp( λ w 2 ) ŵ = argmax w = argmax w [ln p(w, Y X)] [ ] 1 m p(y i x i, w) λ w 2 m i=1 Note: Convex optimization problem
65 Linear-Chain Conditional Random Fields Y i 1 Y i Y i+1 X i 1 X i X i+1 Cliques: (y t, x t ),(y t 1, y t ), for all t p(y ( x, w) = ) exp [ φ(y t, x t ), w xy,t + φ(y t 1, y t ), w yy,t ] ln Z(w x) t Efficient inference using dynamic programming MLE / MAP estimation of parameters 65 / 70
66 HMMs and CRFs 66 / 70 Y i 1 Y i Y i+1 Y i 1 Y i Y i+1 X i 1 X i X i+1 X i 1 X i X i+1 Hidden Markov Model: Maximize p(x, y w) Conditional Random Field: Maximize p(y x, w)
67 Learning Reductions 67 / 70 (ICML 2009 tutorial) Reductions: Transform complex learning problems into simpler, core problems Desideratum: Good performance on the core problem should imply good performance on the complex problem Examples Multi-class prediction Binary classification Ranking Classification Cost-sensitive classification Binary classification Structured prediction Binary classification (SEARN)
68 Structured Binary (SEARN) 68 / 70 (Daumé et al., 2009) Decomposable outputs y = (y 1,..., y T ) SEARN learns a policy π that maps tuples (x, y 1,..., y t 1 ) to y T Reduces structured prediction to cost-sensitive (multi-class) classification Good performance on cost-sensitive classification implies good performance on the structured prediction problem Note: No argmax problem!
69 Reduction to Cost-Sensitive Classification 69 / 70 Distribution over cost-sensitive examples (input, c 1,...,c K ) For every structured example (x, y) 1. sample t uniformly from [[T]] 2. run policy π for t 1 steps to yield (ŷ 1,...,ŷ t 1 ) 3. Input: (x, ŷ 1,..., ŷ t 1 ) 4. Costs: c k = E l(y,(ŷ 1,..., ŷ t 1, k, ŷ t 1,..., ŷ T )), k [[K]] ŷ t+1,...,ŷ T π
70 Intermediate Summary 70 / 70 Loss Functions (binary structured) Discriminative Structured Prediction Algorithms Perceptron Structured perceptron RLSR KDE SVM Struct-SVM, M3N LogReg CRF Structured prediction Cost-sensitive classification
Machine Learning for Structured Prediction
Machine Learning for Structured Prediction Grzegorz Chrupa la National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Grzegorz Chrupa la (DCU) Machine Learning for
More informationSequential Supervised Learning
Sequential Supervised Learning Many Application Problems Require Sequential Learning Part-of of-speech Tagging Information Extraction from the Web Text-to to-speech Mapping Part-of of-speech Tagging Given
More informationProbabilistic Graphical Models
School of Computer Science Probabilistic Graphical Models Max-margin learning of GM Eric Xing Lecture 28, Apr 28, 2014 b r a c e Reading: 1 Classical Predictive Models Input and output space: Predictive
More informationLecture 9: PGM Learning
13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and
More informationParameter learning in CRF s
Parameter learning in CRF s June 01, 2009 Structured output learning We ish to learn a discriminant (or compatability) function: F : X Y R (1) here X is the space of inputs and Y is the space of outputs.
More informationMachine Learning for NLP
Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationCutting Plane Training of Structural SVM
Cutting Plane Training of Structural SVM Seth Neel University of Pennsylvania sethneel@wharton.upenn.edu September 28, 2017 Seth Neel (Penn) Short title September 28, 2017 1 / 33 Overview Structural SVMs
More informationStructured Prediction
Machine Learning Fall 2017 (structured perceptron, HMM, structured SVM) Professor Liang Huang (Chap. 17 of CIML) x x the man bit the dog x the man bit the dog x DT NN VBD DT NN S =+1 =-1 the man bit the
More informationUndirected Graphical Models
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional
More informationLogistic Regression. Machine Learning Fall 2018
Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes
More informationIntroduction to Logistic Regression and Support Vector Machine
Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel
More information1 Machine Learning Concepts (16 points)
CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More informationLogistic Regression: Online, Lazy, Kernelized, Sequential, etc.
Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010
More informationPart 4: Conditional Random Fields
Part 4: Conditional Random Fields Sebastian Nowozin and Christoph H. Lampert Colorado Springs, 25th June 2011 1 / 39 Problem (Probabilistic Learning) Let d(y x) be the (unknown) true conditional distribution.
More informationProbabilistic Graphical Models
Probabilistic Graphical Models David Sontag New York University Lecture 12, April 23, 2013 David Sontag NYU) Graphical Models Lecture 12, April 23, 2013 1 / 24 What notion of best should learning be optimizing?
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationLinear Classifiers IV
Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers IV Blaine Nelson, Tobias Scheffer Contents Classification Problem Bayesian Classifier Decision Linear Classifiers, MAP Models Logistic
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should
More informationLecture 3: Multiclass Classification
Lecture 3: Multiclass Classification Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Some slides are adapted from Vivek Skirmar and Dan Roth CS6501 Lecture 3 1 Announcement v Please enroll in
More informationSequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015
Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 Exam policy: This exam allows two one-page, two-sided cheat sheets (i.e. 4 sides); No other materials. Time: 2 hours. Be sure to write
More informationSupport Vector Machines for Classification: A Statistical Portrait
Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,
More informationA Support Vector Method for Multivariate Performance Measures
A Support Vector Method for Multivariate Performance Measures Thorsten Joachims Cornell University Department of Computer Science Thanks to Rich Caruana, Alexandru Niculescu-Mizil, Pierre Dupont, Jérôme
More informationGeneralized Linear Classifiers in NLP
Generalized Linear Classifiers in NLP (or Discriminative Generalized Linear Feature-Based Classifiers) Graduate School of Language Technology, Sweden 2009 Ryan McDonald Google Inc., New York, USA E-mail:
More informationLearning with Structured Inputs and Outputs
Learning with Structured Inputs and Outputs Christoph Lampert Institute of Science and Technology Austria Vienna (Klosterneuburg), Austria INRIA Summer School 2010 Learning with Structured Inputs and Outputs
More informationSequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them
HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated
More informationIntroduction to Machine Learning Midterm, Tues April 8
Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend
More information1 SVM Learning for Interdependent and Structured Output Spaces
1 SVM Learning for Interdependent and Structured Output Spaces Yasemin Altun Toyota Technological Institute at Chicago, Chicago IL 60637 USA altun@tti-c.org Thomas Hofmann Google, Zurich, Austria thofmann@google.com
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationSupport vector machines Lecture 4
Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The
More informationKernel methods, kernel SVM and ridge regression
Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Collaborative Filtering 2 Collaborative Filtering R: rating matrix; U: user factor;
More information10-701/ Machine Learning - Midterm Exam, Fall 2010
10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam
More informationPosterior Regularization
Posterior Regularization 1 Introduction One of the key challenges in probabilistic structured learning, is the intractability of the posterior distribution, for fast inference. There are numerous methods
More informationMachine Learning Practice Page 2 of 2 10/28/13
Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes
More informationStatistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.
http://goo.gl/jv7vj9 Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT
More informationLearning with Structured Inputs and Outputs
Learning with Structured Inputs and Outputs Christoph H. Lampert IST Austria (Institute of Science and Technology Austria), Vienna ENS/INRIA Summer School, Paris, July 2013 Slides: http://www.ist.ac.at/~chl/
More informationProbabilistic Graphical Models & Applications
Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with
More informationClassification objectives COMS 4771
Classification objectives COMS 4771 1. Recap: binary classification Scoring functions Consider binary classification problems with Y = { 1, +1}. 1 / 22 Scoring functions Consider binary classification
More informationWarm up: risk prediction with logistic regression
Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T
More informationStatistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.
http://goo.gl/xilnmn Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT
More informationProbabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier
More informationMachine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015
Machine Learning Classification, Discriminative learning Structured output, structured input, discriminative function, joint input-output features, Likelihood Maximization, Logistic regression, binary
More informationMachine Learning 2017
Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section
More informationCS446: Machine Learning Fall Final Exam. December 6 th, 2016
CS446: Machine Learning Fall 2016 Final Exam December 6 th, 2016 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains
More informationLINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning
LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES Supervised Learning Linear vs non linear classifiers In K-NN we saw an example of a non-linear classifier: the decision boundary
More informationTopics in Structured Prediction: Problems and Approaches
Topics in Structured Prediction: Problems and Approaches DRAFT Ankan Saha June 9, 2010 Abstract We consider the task of structured data prediction. Over the last few years, there has been an abundance
More informationConvex Optimization Algorithms for Machine Learning in 10 Slides
Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,
More informationStructured Prediction: A Large Margin Approach. Ben Taskar University of Pennsylvania
Structured Prediction: A Large Margin Approach Ben Taskar University of Pennsylvania Structured Prediction Prediction of complex outputs Structured outputs: multivariate, correlated, constrained Novel,
More informationClassification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative
More informationIndirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina
Indirect Rule Learning: Support Vector Machines Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection
More informationSupport Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Linear classifier Which classifier? x 2 x 1 2 Linear classifier Margin concept x 2
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationPattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods
Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter
More informationA brief introduction to Conditional Random Fields
A brief introduction to Conditional Random Fields Mark Johnson Macquarie University April, 2005, updated October 2010 1 Talk outline Graphical models Maximum likelihood and maximum conditional likelihood
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer
Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers Blaine Nelson, Tobias Scheffer Contents Classification Problem Bayesian Classifier Decision Linear Classifiers, MAP Models Logistic
More informationPolyhedral Outer Approximations with Application to Natural Language Parsing
Polyhedral Outer Approximations with Application to Natural Language Parsing André F. T. Martins 1,2 Noah A. Smith 1 Eric P. Xing 1 1 Language Technologies Institute School of Computer Science Carnegie
More informationDEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE
Data Provided: None DEPARTMENT OF COMPUTER SCIENCE Autumn Semester 203 204 MACHINE LEARNING AND ADAPTIVE INTELLIGENCE 2 hours Answer THREE of the four questions. All questions carry equal weight. Figures
More informationMax Margin-Classifier
Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization
More informationPredicting Sequences: Structured Perceptron. CS 6355: Structured Prediction
Predicting Sequences: Structured Perceptron CS 6355: Structured Prediction 1 Conditional Random Fields summary An undirected graphical model Decompose the score over the structure into a collection of
More informationA short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie
A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie Computational Biology Program Memorial Sloan-Kettering Cancer Center http://cbio.mskcc.org/leslielab
More informationIntroduction to Machine Learning
Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574
More informationCS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning
CS 446 Machine Learning Fall 206 Nov 0, 206 Bayesian Learning Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes Overview Bayesian Learning Naive Bayes Logistic Regression Bayesian Learning So far, we
More informationNatural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley
Natural Language Processing Classification Classification I Dan Klein UC Berkeley Classification Automatically make a decision about inputs Example: document category Example: image of digit digit Example:
More informationClassical Predictive Models
Laplace Max-margin Markov Networks Recent Advances in Learning SPARSE Structured I/O Models: models, algorithms, and applications Eric Xing epxing@cs.cmu.edu Machine Learning Dept./Language Technology
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationStatistical Methods for NLP
Statistical Methods for NLP Text Categorization, Support Vector Machines Sameer Maskey Announcement Reading Assignments Will be posted online tonight Homework 1 Assigned and available from the course website
More informationLecture 13: Structured Prediction
Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501: NLP 1 Quiz 2 v Lectures 9-13 v Lecture 12: before page
More informationPart 2: Generalized output representations and structure
Part 2: Generalized output representations and structure Dale Schuurmans University of Alberta Output transformation Output transformation What if targets y special? E.g. what if y nonnegative y 0 y probability
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationIntroduction to Graphical Models
Introduction to Graphical Models The 15 th Winter School of Statistical Physics POSCO International Center & POSTECH, Pohang 2018. 1. 9 (Tue.) Yung-Kyun Noh GENERALIZATION FOR PREDICTION 2 Probabilistic
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationLecture 10: A brief introduction to Support Vector Machine
Lecture 10: A brief introduction to Support Vector Machine Advanced Applied Multivariate Analysis STAT 2221, Fall 2013 Sungkyu Jung Department of Statistics, University of Pittsburgh Xingye Qiao Department
More informationJeff Howbert Introduction to Machine Learning Winter
Classification / Regression Support Vector Machines Jeff Howbert Introduction to Machine Learning Winter 2012 1 Topics SVM classifiers for linearly separable classes SVM classifiers for non-linearly separable
More informationConditional Random Field
Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions
More informationStructured Prediction
Structured Prediction Classification Algorithms Classify objects x X into labels y Y First there was binary: Y = {0, 1} Then multiclass: Y = {1,...,6} The next generation: Structured Labels Structured
More informationAn Introduction to Machine Learning
An Introduction to Machine Learning L6: Structured Estimation Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune, January
More informationIntroduction to Machine Learning
Introduction to Machine Learning Thomas G. Dietterich tgd@eecs.oregonstate.edu 1 Outline What is Machine Learning? Introduction to Supervised Learning: Linear Methods Overfitting, Regularization, and the
More informationIntroduction to Data-Driven Dependency Parsing
Introduction to Data-Driven Dependency Parsing Introductory Course, ESSLLI 2007 Ryan McDonald 1 Joakim Nivre 2 1 Google Inc., New York, USA E-mail: ryanmcd@google.com 2 Uppsala University and Växjö University,
More informationMachine Learning And Applications: Supervised Learning-SVM
Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine
More informationCS-E4830 Kernel Methods in Machine Learning
CS-E4830 Kernel Methods in Machine Learning Lecture 5: Multi-class and preference learning Juho Rousu 11. October, 2017 Juho Rousu 11. October, 2017 1 / 37 Agenda from now on: This week s theme: going
More informationStatistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields
Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Sameer Maskey Week 13, Nov 28, 2012 1 Announcements Next lecture is the last lecture Wrap up of the semester 2 Final Project
More informationAccelerated Training of Max-Margin Markov Networks with Kernels
Accelerated Training of Max-Margin Markov Networks with Kernels Xinhua Zhang University of Alberta Alberta Innovates Centre for Machine Learning (AICML) Joint work with Ankan Saha (Univ. of Chicago) and
More informationKernel Methods and Support Vector Machines
Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete
More informationMore on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013
More on HMMs and other sequence models Intro to NLP - ETHZ - 18/03/2013 Summary Parts of speech tagging HMMs: Unsupervised parameter estimation Forward Backward algorithm Bayesian variants Discriminative
More informationLecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron
CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer
More informationGraphical Models for Collaborative Filtering
Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,
More informationInformation Extraction from Text
Information Extraction from Text Jing Jiang Chapter 2 from Mining Text Data (2012) Presented by Andrew Landgraf, September 13, 2013 1 What is Information Extraction? Goal is to discover structured information
More informationLearning with Rejection
Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,
More informationMachine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.
Machine Learning CUNY Graduate Center, Spring 2013 Lectures 11-12: Unsupervised Learning 1 (Clustering: k-means, EM, mixture models) Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning
More informationProbabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov
Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly
More informationLecture 10: Support Vector Machine and Large Margin Classifier
Lecture 10: Support Vector Machine and Large Margin Classifier Applied Multivariate Analysis Math 570, Fall 2014 Xingye Qiao Department of Mathematical Sciences Binghamton University E-mail: qiao@math.binghamton.edu
More informationIntroduction to Machine Learning Midterm Exam Solutions
10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,
More informationAlgorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley
Algorithms for NLP Classification II Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Minimize Training Error? A loss function declares how costly each mistake is E.g. 0 loss for correct label,
More informationUnsupervised Learning
Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College
More informationIntroduction to Machine Learning Midterm Exam
10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but
More information