Probabilistic Graphical Models
|
|
- Shanna Wells
- 5 years ago
- Views:
Transcription
1 Probabilistic Graphical Models David Sontag New York University Lecture 12, April 23, 2013 David Sontag NYU) Graphical Models Lecture 12, April 23, / 24
2 What notion of best should learning be optimizing? This depends on what we want to do 1 Density estimation: we are interested in the full distribution so later we can compute whatever conditional probabilities we want) 2 Specific prediction tasks: we are using the distribution to make a prediction 3 Structure or knowledge discovery: we are interested in the model itself David Sontag NYU) Graphical Models Lecture 12, April 23, / 24
3 Density estimation for conditional models Suppose we want to predict a set of variables Y given some others X, e.g., for segmentation or stereo vision input: two images! output: disparity! We concentrate on predicting py X), and use a conditional loss function lossx, y, ˆM) = log ˆpy x). Since the loss function only depends on ˆpy x), suffices to estimate the conditional distribution, not the joint David Sontag NYU) Graphical Models Lecture 12, April 23, / 24
4 Density estimation for conditional models CRF: py x) = 1 φ c x, y c ), Zx) = Zx) c C ŷ φ c x, ŷ c ) c C Parameterization as log-linear model: Weights w R d. Feature vectors f c x, y c ) R d. φ c x, y c ; w) = expw f c x, y c )) [ ] Empirical risk minimization with CRFs, i.e. min ˆM E D lossx, y, ˆM) : w ML = arg min w = arg max w = arg max w 1 D x,y) D x,y) D c w x,y) D log py x; w) ) log φ c x, y c ; w) log Zx; w) c ) f c x, y c ) x,y) D log Zx; w) David Sontag NYU) Graphical Models Lecture 12, April 23, / 24
5 Application: Part-of-speech tagging United flies some large jet N V D A N United 1 flies 2 some 3 large 4 jet 5 David Sontag NYU) Graphical Models Lecture 12, April 23, / 24
6 Graphical model formulation of POS tagging Graphical model formulation given: a sentence of length n and a tag set T one variable for each word, takes values in T edge potentials θi 1, i, t, t) foralli n, t, t T example: United 1 flies 2 some 3 large 4 jet 5 T = {A, D, N, V } note: for probabilistic HMM θi 1, i, t, t) = logpw i t)pt t )) David Sontag NYU) Graphical Models Lecture 12, April 23, / 24
7 Features for POS tagging Edge potentials: Fully parameterize T T features and weights), i.e. θ i 1,i t, t) = w T t,t where the superscript T denotes that these are the weights for the transitions Node potentials: Introduce features for the presence or absence of certain attributes of each word e.g., initial letter capitalized, suffix is ing ), for each possible tag T #attributes features and weights) This part is conditional on the input sentence! Edge potential same for all edges. Same for node potentials. David Sontag NYU) Graphical Models Lecture 12, April 23, / 24
8 Structured prediction Often we learn a model for the purpose of structured prediction, in which given x we predict y by finding the MAP assignment: argmax ˆpy x) y Rather than learn using log-loss density estimation), we use a loss function better suited to the specific task One reasonable choice would be the classification error: E x,y) p [1I{ y y s.t. ˆpy x) ˆpy x) }] which is the probability over all x, y) pairs sampled from p that our classifier selects the right labels If p is in the model family, training with log-loss density estimation) and classification error would perform similarly given sufficient data) Otherwise, better to directly go for what we care about classification error) David Sontag NYU) Graphical Models Lecture 12, April 23, / 24
9 Structured prediction Consider the empirical risk for 0-1 loss classification error): 1 1I{ y y s.t. ˆpy x) ˆpy x) } D x,y) D Each constraint ˆpy x) ˆpy x) is equivalent to w f c x, y c) log Zx; w) w f c x, y c ) log Zx; w) c The log-partition function cancels out on both sides. Re-arranging, we have: w f c x, y c) ) f c x, y c ) 0 c c Said differently, the empirical risk is zero when x, y) D and y y, w f c x, y c ) ) f c x, y c) > 0. c c David Sontag NYU) Graphical Models Lecture 12, April 23, / 24 c
10 Structured prediction Empirical risk is zero when x, y) D and y y, w f c x, y c ) ) f c x, y c) > 0. c c In the simplest setting, learning corresponds to finding a weight vector w that satisfies all of these constraints when possible) This is a linear program LP)! How many constraints does it have? D Y exponentially many! Thus, we must avoid explicitly representing this LP This lecture is about algorithms for solving this LP or some variant) in a tractable manner David Sontag NYU) Graphical Models Lecture 12, April 23, / 24
11 Structured perceptron algorithm Input: Training examples D = {x m, y m )} Let fx, y) = c f cx, y c ). Then, the constraints that we want to satisfy are ) w fx m, y m ) fx m, y) > 0, y y m The perceptron algorithm uses MAP inference in its inner loop: MAPx m ; w) = arg max y Y w fxm, y) The maximization can often be performed efficiently by using the structure! The perceptron algorithm is then: 1 Start with w = 0 2 While the weight vector is still changing: 3 For m = 1,..., D 4 y MAPx m ; w) 5 w w + fx m, y m ) fx m, y) David Sontag NYU) Graphical Models Lecture 12, April 23, / 24
12 Structured perceptron algorithm If the training data is separable, the perceptron algorithm is guaranteed to find a weight vector which perfectly classifies all of the data When separable with margin γ, number of iterations is at most where R = max m,y fx m, y) 2 ) 2 2R, γ In practice, one stops after a certain number of outer iterations called epochs), and uses the average of all weights The averaging can be understood as a type of regularization to prevent overfitting David Sontag NYU) Graphical Models Lecture 12, April 23, / 24
13 Allowing slack We can equivalently write the constraints as ) w fx m, y m ) fx m, y) 1, y y m Suppose there do not exist weights w that satisfy all constraints Introduce slack variables ξ m 0, one per data point, to allow for constraint violations: ) w fx m, y m ) fx m, y) 1 ξ m, y y m Then, minimize the sum of the slack variables, min ξ 0 m ξ m, subject to the above constraints David Sontag NYU) Graphical Models Lecture 12, April 23, / 24
14 Structural SVM support vector machine) subject to: min w,ξ ξ m + C w 2 m ) w fx m, y m ) fx m, y) 1 ξ m, m, y y m ξ m 0, m This is a quadratic program QP). Solving for the slack variables in closed form, we obtain ) ) ξm = max 0, max 1 w fx m, y m ) fx m, y) y Y Thus, we can re-write the whole optimization problem as ) ) min max 0, max 1 w fx m, y m ) fx m, y) + C w 2 w y Y m David Sontag NYU) Graphical Models Lecture 12, April 23, / 24
15 Hinge loss )) We can view max 0, max y Y 1 w fx m, y m ) fx m, y) as a loss function, called hinge loss When w fx m, y m ) w fx m, y) for all y i.e., correct prediction), this takes a value between 0 and 1 When y such that w fx m, y) w fx m, y m ) i.e., incorrect prediction), this takes a value 1 Thus, this always upper bounds the 0-1 loss! Minimizing hinge loss is good because it minimizes an upper bound on the 0-1 loss prediction error) David Sontag NYU) Graphical Models Lecture 12, April 23, / 24
16 Better Metrics It doesn t always make sense to penalize all incorrect predictions equally! We can change the constraints to ) w fx m, y m ) fx m, y) y, y m ) ξ m, where y, y m ) 0 is a measure of how far the assignment y is from the true assignment y m This is called margin scaling as opposed to slack scaling) We assume that y, y) = 0, which allows us to say that the constraint holds for all y, rather than just y y m A frequently used metric for MRFs is Hamming distance, where y, y m ) = i V 1I[y i y m i ] y, David Sontag NYU) Graphical Models Lecture 12, April 23, / 24
17 Structural SVM with margin scaling min w m max y Y )) y, y m ) w fx m, y m ) fx m, y) + C w 2 How to solve this? Many methods! 1 Cutting-plane algorithm Tsochantaridis et al., 2005) 2 Stochastic subgradient method Ratliff et al., 2007) 3 Dual Loss Primal Weights algorithm Meshi et al., 2010) 4 Frank-Wolfe algorithm Lacoste-Julien et al., 2013) David Sontag NYU) Graphical Models Lecture 12, April 23, / 24
18 Stochastic subgradient method min w m max y Y )) y, y m ) w fx m, y m ) fx m, y) + C w 2 Although this objective is convex, it is not differentiable everywhere We can use a subgradient method to minimize instead of gradient descent) ) The subgradient of max y Y y, y m ) w fx m, y m ) fx m, y) at w t) is fx m, ŷ) fx m, y m ), where ŷ is one of the maximizers with respect to w t), i.e. ŷ = arg max y Y y, ym ) + w t) fx m, y) This maximization is called loss-augmented MAP inference David Sontag NYU) Graphical Models Lecture 12, April 23, / 24
19 Loss-augmented inference ŷ = arg max y Y y, ym ) + w t) fx m, y) When y, y m ) = i V 1I[y i yi m ], this corresponds to adding additional single-node potentials θ i y i ) = 1 if y i y m, and 0 otherwise If MAP inference was previously exactly solvable by a combinatorial algorithm, loss-augmented MAP inference typically is too The Hamming distance pushes the MAP solution away from the true assignment y m David Sontag NYU) Graphical Models Lecture 12, April 23, / 24
20 Cutting-plane algorithm subject to: w min w,ξ ) fx m, y m ) fx m, y) ξ m + C w 2 m y, y m ) ξ m, m, y Y m ξ m 0, m Start with Y m = {y m }. Solve for the optimal w, ξ Then, look to see if any of the unused constraints that are violated To find a violated constraint for data point m, simply solve the loss-augmented inference problem: ŷ = arg max y Y y, ym ) + w fx m, y) If ŷ Y m, do nothing. Otherwise, let Y m = Y m {ŷ} Repeat until no new constraints are added. Then we are optimal! David Sontag NYU) Graphical Models Lecture 12, April 23, / 24
21 Cutting-plane algorithm Can prove that, in order to solve the structural SVM up to ɛ additive) accuracy, takes a polynomial number of iterations In practice, terminates very quickly David Sontag NYU) Graphical Models Lecture 12, April 23, / 24
22 Block-Coordinate Frank-Wolfe Optimization for Structural SVMs Summary of convergence rates Table 1. Convergence rates given in the number of calls to the oracles for di erent optimization algorithms for the str tural SVM objective 1) in the case of a Markov random field structure, to reach a specific accuracy " measured for di er types of gaps, in term of the number of training examples n, regularization parameter, sizeofthelabelspace Y, m imum feature norm R := max i,y k iy)k 2 some minor terms were ignored for succinctness). Table inspired from Zha et al., 2011). Notice that only stochastic subgradient and our proposed algorithm have rates independent of n. Optimization algorithm Online Primal/Dual Type of guarantee Oracle type # Oracle calls dual extragradient Taskar no primal- dual saddle point gap Bregman projection O nr log Y et al., 2006) " online exponentiated gradient yes dual expected dual error expectation O n+log Y )R 2 Collins et al., 2008) " q excessive gap reduction log Y no primal-dual duality gap expectation O nr Zhang et al., 2011) " nr BMRM Teo et al., 2010) no primal primal error maximization O 2 " 1-slack SVM-Struct Joachims nr no primal-dual duality gap maximization O 2 et al., 2009) " stochastic subgradient R yes primal primal error w.h.p. maximization Õ 2 Shalev-Shwartz et al., 2010a) " this paper: block-coordinate R yes primal-dual expected duality gap maximization O 2 Frank-Wolfe " Thm. 3 lary 3 gives a similar convergence rate as our TheoremR 3. rate in the duality gap as the full Frank-Wolfe meth same Balamurugan as before. et al. 2011) n=number propose toofapprox- imately solve a quadratic problem on each example leads to a simple online algorithm that yields a so trainingfor examples. the dual structural λ is the SVMregularization optimization problem using constant SMO, but correpsonding they do not provide toany 2C/n) rate guarantees. The online-eg method implements a variant for stochastic algorithms: no step-size sequence ne tion to an issue that is notoriously di cult to addr of dual coordinate descent, but it requires an expectation oracle and Collins et al. 2008) estimate its primal ciently computed in closed-form. Further, at the c to be tuned since the optimal step-size can be e convergence Davidat Sontag only ONYU) 1/" 2. Graphical Models of an additional pass Lecture through 12, April the 23, 2013 data which 22 / 24co
23 Application to segmentation & support inference ECCV-12 submission ID Input RGB Surface Normals Aligned Normals Segmentation Input Depth Inpainted Depth 3D Planes Support Relations Image Depth Map 1. Major Surfaces 2. Surface Normals 3. Align Point Cloud Point Cloud RGB Image Regions Features Segmenta Feature Extrac {X i,n i } {R j } {F j } Joint with Nathan Silberman & Rob Fergus) Support Classi ca Structure Labels Support Rela onships Fig. 1. Overview of algorithm. Our algorithm flows from left to right. Given an input image with raw and inpainted depth maps, we compute surface normals and align them to the room by finding three dominant orthogonal directions. We then David Sontag NYU) Graphical Models Lecture 12, April 23, / 24
24 b obj Application to machine translation ly b) Dice and Distance Word alignment between languages: consul r b le un de les grands objectifs de les consultations est de faire en sorte que la relance profite é g a l e m e n t à tous. le un de les grands objectifs de les consultations est de faire en sorte que la relance profite é g a l e m e n t à tous. benefits all. one of the major objectives of these consultations is to make sure that the recovery benefits all. phic, Taskar, and BothShort Lacoste-Julien, Klein d) All 05) features David Sontag NYU) Graphical Models Lecture 12, April 23, / 24
Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley
Algorithms for NLP Classification II Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Minimize Training Error? A loss function declares how costly each mistake is E.g. 0 loss for correct label,
More informationLecture 9: PGM Learning
13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and
More informationSupport vector machines Lecture 4
Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The
More informationStructured Prediction
Structured Prediction Classification Algorithms Classify objects x X into labels y Y First there was binary: Y = {0, 1} Then multiclass: Y = {1,...,6} The next generation: Structured Labels Structured
More informationCutting Plane Training of Structural SVM
Cutting Plane Training of Structural SVM Seth Neel University of Pennsylvania sethneel@wharton.upenn.edu September 28, 2017 Seth Neel (Penn) Short title September 28, 2017 1 / 33 Overview Structural SVMs
More informationAccelerated Training of Max-Margin Markov Networks with Kernels
Accelerated Training of Max-Margin Markov Networks with Kernels Xinhua Zhang University of Alberta Alberta Innovates Centre for Machine Learning (AICML) Joint work with Ankan Saha (Univ. of Chicago) and
More informationMachine Learning for Structured Prediction
Machine Learning for Structured Prediction Grzegorz Chrupa la National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Grzegorz Chrupa la (DCU) Machine Learning for
More informationAlgorithms for Predicting Structured Data
1 / 70 Algorithms for Predicting Structured Data Thomas Gärtner / Shankar Vembu Fraunhofer IAIS / UIUC ECML PKDD 2010 Structured Prediction 2 / 70 Predicting multiple outputs with complex internal structure
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationSubgradient Methods for Maximum Margin Structured Learning
Nathan D. Ratliff ndr@ri.cmu.edu J. Andrew Bagnell dbagnell@ri.cmu.edu Robotics Institute, Carnegie Mellon University, Pittsburgh, PA. 15213 USA Martin A. Zinkevich maz@cs.ualberta.ca Department of Computing
More informationLab 12: Structured Prediction
December 4, 2014 Lecture plan structured perceptron application: confused messages application: dependency parsing structured SVM Class review: from modelization to classification What does learning mean?
More informationLecture Support Vector Machine (SVM) Classifiers
Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in
More informationSequential Supervised Learning
Sequential Supervised Learning Many Application Problems Require Sequential Learning Part-of of-speech Tagging Information Extraction from the Web Text-to to-speech Mapping Part-of of-speech Tagging Given
More informationProbabilistic Graphical Models
School of Computer Science Probabilistic Graphical Models Max-margin learning of GM Eric Xing Lecture 28, Apr 28, 2014 b r a c e Reading: 1 Classical Predictive Models Input and output space: Predictive
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationStochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization
Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine
More information10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging
10-708: Probabilistic Graphical Models 10-708, Spring 2018 10 : HMM and CRF Lecturer: Kayhan Batmanghelich Scribes: Ben Lengerich, Michael Kleyman 1 Case Study: Supervised Part-of-Speech Tagging We will
More informationStatistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.
http://goo.gl/jv7vj9 Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT
More informationMLCC 2017 Regularization Networks I: Linear Models
MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017 About this class We introduce a class of learning algorithms based on Tikhonov regularization We study computational
More informationStatistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.
http://goo.gl/xilnmn Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT
More informationMachine Learning Lecture 6 Note
Machine Learning Lecture 6 Note Compiled by Abhi Ashutosh, Daniel Chen, and Yijun Xiao February 16, 2016 1 Pegasos Algorithm The Pegasos Algorithm looks very similar to the Perceptron Algorithm. In fact,
More informationMachine Learning And Applications: Supervised Learning-SVM
Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine
More informationLinear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)
Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training
More information2.2 Structured Prediction
The hinge loss (also called the margin loss), which is optimized by the SVM, is a ramp function that has slope 1 when yf(x) < 1 and is zero otherwise. Two other loss functions squared loss and exponential
More informationLecture 10: A brief introduction to Support Vector Machine
Lecture 10: A brief introduction to Support Vector Machine Advanced Applied Multivariate Analysis STAT 2221, Fall 2013 Sungkyu Jung Department of Statistics, University of Pittsburgh Xingye Qiao Department
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationSequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015
Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative
More informationAnnouncements - Homework
Announcements - Homework Homework 1 is graded, please collect at end of lecture Homework 2 due today Homework 3 out soon (watch email) Ques 1 midterm review HW1 score distribution 40 HW1 total score 35
More informationParameter learning in CRF s
Parameter learning in CRF s June 01, 2009 Structured output learning We ish to learn a discriminant (or compatability) function: F : X Y R (1) here X is the space of inputs and Y is the space of outputs.
More informationGraphical models for part of speech tagging
Indian Institute of Technology, Bombay and Research Division, India Research Lab Graphical models for part of speech tagging Different Models for POS tagging HMM Maximum Entropy Markov Models Conditional
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationA brief introduction to Conditional Random Fields
A brief introduction to Conditional Random Fields Mark Johnson Macquarie University April, 2005, updated October 2010 1 Talk outline Graphical models Maximum likelihood and maximum conditional likelihood
More informationMinimax risk bounds for linear threshold functions
CS281B/Stat241B (Spring 2008) Statistical Learning Theory Lecture: 3 Minimax risk bounds for linear threshold functions Lecturer: Peter Bartlett Scribe: Hao Zhang 1 Review We assume that there is a probability
More informationSupport Vector Machine
Andrea Passerini passerini@disi.unitn.it Machine Learning Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)
More informationICS-E4030 Kernel Methods in Machine Learning
ICS-E4030 Kernel Methods in Machine Learning Lecture 3: Convex optimization and duality Juho Rousu 28. September, 2016 Juho Rousu 28. September, 2016 1 / 38 Convex optimization Convex optimisation This
More informationCS6375: Machine Learning Gautam Kunapuli. Support Vector Machines
Gautam Kunapuli Example: Text Categorization Example: Develop a model to classify news stories into various categories based on their content. sports politics Use the bag-of-words representation for this
More informationProbabilistic Graphical Models
Probabilistic Graphical Models David Sontag New York University Lecture 4, February 16, 2012 David Sontag (NYU) Graphical Models Lecture 4, February 16, 2012 1 / 27 Undirected graphical models Reminder
More informationIndirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina
Indirect Rule Learning: Support Vector Machines Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection
More informationKernelized Perceptron Support Vector Machines
Kernelized Perceptron Support Vector Machines Emily Fox University of Washington February 13, 2017 What is the perceptron optimizing? 1 The perceptron algorithm [Rosenblatt 58, 62] Classification setting:
More informationCoordinate Descent and Ascent Methods
Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:
More informationSequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them
HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated
More informationLecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)
Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron) Intro to NLP, CS585, Fall 2014 http://people.cs.umass.edu/~brenocon/inlp2014/ Brendan O Connor (http://brenocon.com) 1 Models for
More informationSupport Vector Machines and Bayes Regression
Statistical Techniques in Robotics (16-831, F11) Lecture #14 (Monday ctober 31th) Support Vector Machines and Bayes Regression Lecturer: Drew Bagnell Scribe: Carl Doersch 1 1 Linear SVMs We begin by considering
More informationMachine Learning A Geometric Approach
Machine Learning A Geometric Approach CIML book Chap 7.7 Linear Classification: Support Vector Machines (SVM) Professor Liang Huang some slides from Alex Smola (CMU) Linear Separator Ham Spam From Perceptron
More informationProbabilistic Graphical Models
Probabilistic Graphical Models David Sontag New York University Lecture 6, March 7, 2013 David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 1 / 25 Today s lecture 1 Dual decomposition 2 MAP inference
More informationCSC 411 Lecture 17: Support Vector Machine
CSC 411 Lecture 17: Support Vector Machine Ethan Fetaya, James Lucas and Emad Andrews University of Toronto CSC411 Lec17 1 / 1 Today Max-margin classification SVM Hard SVM Duality Soft SVM CSC411 Lec17
More informationConditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013
Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013 Outline Modeling Inference Training Applications Outline Modeling Problem definition Discriminative vs. Generative Chain CRF General
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationSupport Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs
E0 270 Machine Learning Lecture 5 (Jan 22, 203) Support Vector Machines for Classification and Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in
More informationMachine Learning for NLP
Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationStructured Prediction
Machine Learning Fall 2017 (structured perceptron, HMM, structured SVM) Professor Liang Huang (Chap. 17 of CIML) x x the man bit the dog x the man bit the dog x DT NN VBD DT NN S =+1 =-1 the man bit the
More informationSupport Vector Machines for Classification and Regression
CIS 520: Machine Learning Oct 04, 207 Support Vector Machines for Classification and Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may
More informationWarm up: risk prediction with logistic regression
Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T
More informationBinary Classification / Perceptron
Binary Classification / Perceptron Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and Vibhav Gogate Supervised Learning Input: x 1, y 1,, (x n, y n ) x i is the i th data
More informationHidden Markov Models
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 22 April 2, 2018 1 Reminders Homework
More informationwith Local Dependencies
CS11-747 Neural Networks for NLP Structured Prediction with Local Dependencies Xuezhe Ma (Max) Site https://phontron.com/class/nn4nlp2017/ An Example Structured Prediction Problem: Sequence Labeling Sequence
More informationLecture 3: Multiclass Classification
Lecture 3: Multiclass Classification Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Some slides are adapted from Vivek Skirmar and Dan Roth CS6501 Lecture 3 1 Announcement v Please enroll in
More informationDeep Learning for Computer Vision
Deep Learning for Computer Vision Lecture 4: Curse of Dimensionality, High Dimensional Feature Spaces, Linear Classifiers, Linear Regression, Python, and Jupyter Notebooks Peter Belhumeur Computer Science
More informationPredicting Sequences: Structured Perceptron. CS 6355: Structured Prediction
Predicting Sequences: Structured Perceptron CS 6355: Structured Prediction 1 Conditional Random Fields summary An undirected graphical model Decompose the score over the structure into a collection of
More informationSupport Vector Machines and Kernel Methods
2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. http://cs229.stanford.edu/notes/cs229-notes3.pdf Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University
More informationSUPPORT VECTOR MACHINE
SUPPORT VECTOR MACHINE Mainly based on https://nlp.stanford.edu/ir-book/pdf/15svm.pdf 1 Overview SVM is a huge topic Integration of MMDS, IIR, and Andrew Moore s slides here Our foci: Geometric intuition
More informationHOMEWORK 4: SVMS AND KERNELS
HOMEWORK 4: SVMS AND KERNELS CMU 060: MACHINE LEARNING (FALL 206) OUT: Sep. 26, 206 DUE: 5:30 pm, Oct. 05, 206 TAs: Simon Shaolei Du, Tianshu Ren, Hsiao-Yu Fish Tung Instructions Homework Submission: Submit
More informationLecture 16: Modern Classification (I) - Separating Hyperplanes
Lecture 16: Modern Classification (I) - Separating Hyperplanes Outline 1 2 Separating Hyperplane Binary SVM for Separable Case Bayes Rule for Binary Problems Consider the simplest case: two classes are
More informationNatural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley
Natural Language Processing Classification Classification I Dan Klein UC Berkeley Classification Automatically make a decision about inputs Example: document category Example: image of digit digit Example:
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationLinear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)
Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationLearning MN Parameters with Alternative Objective Functions. Sargur Srihari
Learning MN Parameters with Alternative Objective Functions Sargur srihari@cedar.buffalo.edu 1 Topics Max Likelihood & Contrastive Objectives Contrastive Objective Learning Methods Pseudo-likelihood Gradient
More informationProbabilistic Graphical Models: Lagrangian Relaxation Algorithms for Natural Language Processing
Probabilistic Graphical Models: Lagrangian Relaxation Algorithms for atural Language Processing Alexander M. Rush (based on joint work with Michael Collins, Tommi Jaakkola, Terry Koo, David Sontag) Uncertainty
More informationLecture 10: Support Vector Machine and Large Margin Classifier
Lecture 10: Support Vector Machine and Large Margin Classifier Applied Multivariate Analysis Math 570, Fall 2014 Xingye Qiao Department of Mathematical Sciences Binghamton University E-mail: qiao@math.binghamton.edu
More informationPerceptron Revisited: Linear Separators. Support Vector Machines
Support Vector Machines Perceptron Revisited: Linear Separators Binary classification can be viewed as the task of separating classes in feature space: w T x + b > 0 w T x + b = 0 w T x + b < 0 Department
More informationSVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels
SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic
More informationSupport Vector Machines. Machine Learning Fall 2017
Support Vector Machines Machine Learning Fall 2017 1 Where are we? Learning algorithms Decision Trees Perceptron AdaBoost 2 Where are we? Learning algorithms Decision Trees Perceptron AdaBoost Produce
More informationTopics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families
Midterm Review Topics we covered Machine Learning Optimization Basics of optimization Convexity Unconstrained: GD, SGD Constrained: Lagrange, KKT Duality Linear Methods Perceptrons Support Vector Machines
More informationSupport Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017
Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem
More informationSupport Vector Machines: Maximum Margin Classifiers
Support Vector Machines: Maximum Margin Classifiers Machine Learning and Pattern Recognition: September 16, 2008 Piotr Mirowski Based on slides by Sumit Chopra and Fu-Jie Huang 1 Outline What is behind
More informationStructural Learning with Amortized Inference
Structural Learning with Amortized Inference AAAI 15 Kai-Wei Chang, Shyam Upadhyay, Gourab Kundu, and Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign {kchang10,upadhya3,kundu2,danr}@illinois.edu
More informationLecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron
CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer
More informationCSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent
CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic
More informationLECTURE NOTE #8 PROF. ALAN YUILLE. Can we find a linear classifier that separates the position and negative examples?
LECTURE NOTE #8 PROF. ALAN YUILLE 1. Linear Classifiers and Perceptrons A dataset contains N samples: { (x µ, y µ ) : µ = 1 to N }, y µ {±1} Can we find a linear classifier that separates the position
More informationLinear Regression (continued)
Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression
More informationDan Roth 461C, 3401 Walnut
CIS 519/419 Applied Machine Learning www.seas.upenn.edu/~cis519 Dan Roth danroth@seas.upenn.edu http://www.cis.upenn.edu/~danroth/ 461C, 3401 Walnut Slides were created by Dan Roth (for CIS519/419 at Penn
More informationLearning by constraints and SVMs (2)
Statistical Techniques in Robotics (16-831, F12) Lecture#14 (Wednesday ctober 17) Learning by constraints and SVMs (2) Lecturer: Drew Bagnell Scribe: Albert Wu 1 1 Support Vector Ranking Machine pening
More informationStatistical Learning Theory and the C-Loss cost function
Statistical Learning Theory and the C-Loss cost function Jose Principe, Ph.D. Distinguished Professor ECE, BME Computational NeuroEngineering Laboratory and principe@cnel.ufl.edu Statistical Learning Theory
More informationLogistic Regression: Online, Lazy, Kernelized, Sequential, etc.
Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010
More informationCS-E4830 Kernel Methods in Machine Learning
CS-E4830 Kernel Methods in Machine Learning Lecture 3: Convex optimization and duality Juho Rousu 27. September, 2017 Juho Rousu 27. September, 2017 1 / 45 Convex optimization Convex optimisation This
More informationSupport Vector Machines
Two SVM tutorials linked in class website (please, read both): High-level presentation with applications (Hearst 1998) Detailed tutorial (Burges 1998) Support Vector Machines Machine Learning 10701/15781
More informationMachine Learning Support Vector Machines. Prof. Matteo Matteucci
Machine Learning Support Vector Machines Prof. Matteo Matteucci Discriminative vs. Generative Approaches 2 o Generative approach: we derived the classifier from some generative hypothesis about the way
More informationAlgorithms for NLP. Classifica(on III. Taylor Berg- Kirkpatrick CMU Slides: Dan Klein UC Berkeley
Algorithms for NLP Classifica(on III Taylor Berg- Kirkpatrick CMU Slides: Dan Klein UC Berkeley The Perceptron, Again Start with zero weights Visit training instances one by one Try to classify If correct,
More informationLinear classifiers Lecture 3
Linear classifiers Lecture 3 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin ML Methodology Data: labeled instances, e.g. emails marked spam/ham
More informationLearning Parameters of Undirected Models. Sargur Srihari
Learning Parameters of Undirected Models Sargur srihari@cedar.buffalo.edu 1 Topics Log-linear Parameterization Likelihood Function Maximum Likelihood Parameter Estimation Simple and Conjugate Gradient
More informationProbabilistic Graphical Models & Applications
Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with
More informationBits of Machine Learning Part 1: Supervised Learning
Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification
More informationOn the Generalization Ability of Online Strongly Convex Programming Algorithms
On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract
More informationLecture 13: Structured Prediction
Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501: NLP 1 Quiz 2 v Lectures 9-13 v Lecture 12: before page
More informationName (NetID): (1 Point)
CS446: Machine Learning (D) Spring 2017 March 16 th, 2017 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains
More informationSVMs, Duality and the Kernel Trick
SVMs, Duality and the Kernel Trick Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 26 th, 2007 2005-2007 Carlos Guestrin 1 SVMs reminder 2005-2007 Carlos Guestrin 2 Today
More informationLogistic Regression. Machine Learning Fall 2018
Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes
More informationDual Decomposition for Inference
Dual Decomposition for Inference Yunshu Liu ASPITRG Research Group 2014-05-06 References: [1]. D. Sontag, A. Globerson and T. Jaakkola, Introduction to Dual Decomposition for Inference, Optimization for
More information