Support Vector Machines and Bayes Regression

Similar documents
1 Bayesian Linear Regression (BLR)

Learning by constraints and SVMs (2)

Statistical Techniques in Robotics (16-831, F12) Lecture#17 (Wednesday October 31) Kalman Filters. Lecturer: Drew Bagnell Scribe:Greydon Foil 1

Jeff Howbert Introduction to Machine Learning Winter

Machine Learning for NLP

Deep Learning for Computer Vision

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

CS229 Supplemental Lecture notes

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods

Bayesian Machine Learning

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Max Margin-Classifier

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Parametric Techniques Lecture 3

Linear & nonlinear classifiers

Parametric Techniques

Machine Learning for NLP

Lecture 4: Training a Classifier

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Warm up: risk prediction with logistic regression

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support vector machines Lecture 4

Machine Learning And Applications: Supervised Learning-SVM

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines

L5 Support Vector Classification

Linear & nonlinear classifiers

Machine Learning 1. Linear Classifiers. Marius Kloft. Humboldt University of Berlin Summer Term Machine Learning 1 Linear Classifiers 1

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Kernelized Perceptron Support Vector Machines

Support Vector Machines, Kernel SVM

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

CSC 411 Lecture 17: Support Vector Machine

Support Vector Machine (continued)

HOMEWORK 4: SVMS AND KERNELS

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Qualifying Exam in Machine Learning

Linear Classifiers. Michael Collins. January 18, 2012

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

CSC 411 Lecture 7: Linear Classification

Pattern Recognition 2018 Support Vector Machines

LECTURE NOTE #NEW 6 PROF. ALAN YUILLE

Linear Models in Machine Learning

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Support Vector Machines for Classification and Regression

Support Vector Machine (SVM) and Kernel Methods

Multiclass Classification-1

Support Vector Machine

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Linear classifiers Lecture 3

Regression with Numerical Optimization. Logistic

Lecture 10: A brief introduction to Support Vector Machine

LECTURE NOTE #3 PROF. ALAN YUILLE

18.9 SUPPORT VECTOR MACHINES

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

6.867 Machine learning

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Support Vector Machine (SVM) and Kernel Methods

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Statistical Pattern Recognition

10-701/ Machine Learning - Midterm Exam, Fall 2010

CSC321 Lecture 4 The Perceptron Algorithm

MLCC 2017 Regularization Networks I: Linear Models

Logistic Regression. Machine Learning Fall 2018

Support Vector Machines

Part of the slides are adapted from Ziko Kolter

Gaussian Processes (10/16/13)

Introduction to Support Vector Machines

Based on slides by Richard Zemel

Lecture 4: Training a Classifier

SVMs, Duality and the Kernel Trick

Review: Support vector machines. Machine learning techniques and image analysis

Lecture 6. Regression

Week 5: Logistic Regression & Neural Networks

Lecture 2 Machine Learning Review

Midterm: CS 6375 Spring 2015 Solutions

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

ECE521 week 3: 23/26 January 2017

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Machine Learning, Fall 2009: Midterm

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

STA 4273H: Statistical Machine Learning

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Introduction to SVM and RVM

Machine Learning

COMP 875 Announcements

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Lecture : Probabilistic Machine Learning

Transcription:

Statistical Techniques in Robotics (16-831, F11) Lecture #14 (Monday ctober 31th) Support Vector Machines and Bayes Regression Lecturer: Drew Bagnell Scribe: Carl Doersch 1 1 Linear SVMs We begin by considering binary, linear classification. In this problem, we are trying to map elements from our feature space into the set { 1, 1}. When we have two features, F 1, F 2 R, we can think of the problem graphically. In the following figure, we represent the points with a label of 1 as s, we the points with a label of 1 as s. We wish to find a linear decision boundary (a straight line) that separates the points from each class. The decision boundary is shown as the dotted line. F2 F1 Figure 1: Linear binary classification To represent this notion mathematically, we denote the i th training point as f i, and the class of this point as y i. The linear classification problem requires us to find a set of weights w such that: i y i w T f i 0 (1) Note that the decision boundary need not pass through the origin. We often use a dummy variable F 0 = c for constant c 0, which allows us to use w 0 as an offset. ne issue with this formulation of the problem is that it has a trivial solution. Specifically, if we set all of our weights to 0, we will have y i w T f i = 0 for all points. To address this issue, we modify 1 Some content adapted from previous scribes: Alan Kraut, Robert Fisher, Heather Knight, Ben injilefu, Kevin Lipkin, Alvaro Collet-Romea, Gandalf, Laura Lindzey and Hans Pirnay 1

our constraints to be: i y i w T f i margin (2) This can have the added benefit of improving generalization. nce again, we can represent this graphically. In the following picture, the space between the dotted line and the decision boundary represents the margin. F2 F1 Figure 2: Linear binary classification with margins We further see that the magnitude of our weight vector does not matter, only the orientation affects classification. Therefore we can restrict w to w 2 = 1. Taken together, this yields the following formulation: Maximize Such that margin i y i w T f i margin w 2 1 We use w 2 1 instead of w 2 = 1 to make sure that our problem remain convex. Note that we do not include a bias term; we can incorporate a bias by simply adding a 1 at the end of each input vector. This means we are regularizing the bias in this formulation of the SVM, though the original SVM papers did not. This problem is generally solved in an equivalent form that is easier to optimize, where we hold the margin fixed and minimize the magnitude of the weights: Minimize w 2 Such that i y i w T f i 1 margin > 0 This last formulation is an example of a quadratic program, which we are able to solve efficiently. Unfortunately, the hard margin SVM requires that the data be linearly separable, which is almost 2

never the case. To address this issue, we introduce a slack variable, ξ. The problem now becomes: Minimize λ w 2 + i ξ i Such that y i w T f i 1 ξ i ξ i 0 margin > 0 There will be one slack variable ξ i to correspond to each of the T data-points that we are considering. Many SVM solvers further transform this optimization into a dual form, where the weight vector w is written as a linear combination of the support vectors, and this linear combination is optimized rather than optimizing w directly. This allows us to formulate nonlinear versions of the SVM, but this is beyond the scope of the class. 1.1 Gradient Descent To perform gradient descent, we first observe that ξ = max(0, 1 y i w T f i ), because the slack variable is 0 if this point is labeled correctly. This allows us to generate the loss function l t = λ w 2 + max(0, 1 y t w T f t ) (3) ur update for w is now as follows. If the output was correct by at least the margin, the only contribution to the loss is the magnitude of w: w w 2α t λw (4) And if the output for this time step was incorrect or not outside the margin (ξ > 0), w w 2α t λw + α t y t f t (5) We note that this loss function is not minimizing the number of mistakes made by the algorithm. In fact, solving this problem with a 0-1 loss function (which would minimize the number of mistakes) is known to be NP-hard. We can visualize the difference between these loss functions with the following figure, in which Z = w T f i : ur loss function, the hinge loss function, is convex, while the mistake loss (0-1 loss) function is not. However, we do see intuitively that the loss function we are using is the best convexification of the NP-hard problem. In practice the convexified loss is more appropriate anyway, since we generally want to minimize how wrong we are when we do mislabel a point. 1.2 Selecting α t First idea is to set α t proportional to 1 t. If we have T elements, each with a maximum value of F, the maximum gradient, G, is T F. This results in a regret that is R F GT. This is not as good as we could do. 3

1 Hinge loss Mistake loss 0 Z 1 Figure 3: Loss functions Notice that l t is an extremely good convex function. It is a quadratic plus a convex function. In the same way all convex functions lie above a line (a subgradient) from every point, l t lies above a quadratic from every point. Specifically, if it is always the case that then f(x) is said to be H-strongly convex. In this case l t is λ-strongly convex. f(y) f(x) + H 2 (y x)2 + f T x (y x) (6) If α t = G G2 Ht, then regret H (1+log t). log t is really good, and this learning rate and algorithm is essentially the current best for this class of problem. 2 Bayes Linear Regression In linear regression, the goal is to predict a continuous outcome variable. In particular, let: θ = parameter vector of the learned model x t R n = set of features at every timestep, used for prediction y t R = true outcome Then our model is as follows: y t = θx t + ɛ t 4

x 0 x 1 x 2 x 3 y 0 y 1 y 2 y 3 θ Figure 4: Graphic model of BLR where ɛ t N (0, σ 2 ) is noise that is independent of everything else. Thus the likelihood if θ is known may be written: p(y t x t, θ) = 1 z exp(θx(2σ2 ) 1 ) In BLR, we maintain a distribution over the weight vector θ to represent our beliefs about what θ is likely to be. The math is easiest if we restrict this distribution to be a Gaussian: θ N(µ 0, Σ 0 ). This has the following form: p(θ) = 1 z exp{ 1 2 (θ µ 0) T Σ 1 0 (θ µ 0)} (7) This is called the moment parameterization of a Guassian. This is the most intuitive parameterization, but it is actually more computationally convenient to use the natural parameterization. In the natural parameterization, the multiplication operation required in Bayes rule becomes a simple sum: where p(θ) = 1 z exp{j T 0 θ 1 2 θt P 0 θ} (8) P 0 = Σ 1 0 (9) J 0 = P 0 µ 0 (10) Note that for these to be probability distributions, Σ 0 must be positive-definite; otherwise the distribution will have an infinite integral and cannot be normalized. It is also generally constrained to be symmetric, as correlations between random variables are symmetric. Recall that if we have a distribution over θ, and we recieve a new x t+1, we can compute the distribution over ỹ t+1 by integrating out θ p(ỹ t+1 x t+1, D) = p(ỹ t+1 x t+1, D, θ) p(θ D)dθ (11) = p(ỹ t+1 x t+1, θ) p(θ D)dθ (12) 5

We want to derive how to update our estimate of θ given a series of data D = {x, y} up to timestep t and x t+1. We can calculate p(θ D) recursively using the Bayes Theorem: p(θ D) = 1 z p(y t θ, D)p(θ D) (13) where we know the likelihood p(y t D, θ) = p(y t x t, θ) is given by a Gaussian N (θ T x t, σ 2 t ) p(y t x t, θ) = 1 z exp{ (θt x t y t ) 2 2σ 2 } (14) The updating rule for p(θ D) at timestep t is given by multiplication of two exponential functions. Adding the exponent of the prior to that of the likelihood yields collecting terms to find updates J t and P t : 1 2σ 2 ( θ T x t y t ) 2 + J T t 1 θ 1 2 θt P t 1 θ (15) = 1 ( θ T 2σ 2 x t x T t θ 2θ T x t y t + yt 2 ) + J T t 1 θ 1 2 θt P t 1 θ (16) ( ) x T = t y t σ 2 + Jt 1 T θ 1 ( xt x T ) 2 θt t σ 2 + P t 1 θ y2 t 2σ 2 (17) Since this all happens in the exponent of an exponential function, the constant y 2 t /σ 2 -term can be shifted into the regularizing constant z. Thus, the update rules for J t and P t are J t = y tx t σ 2 + J t 1 (18) P t = x tx T t σ 2 + P t 1 (19) 1. In a gaussian model, a new datapoint always lowers the variance - this downgrading of the variance does not always make sense 2. If you believe there are outliers, this model won t work for you 3. The variance is not a function of y t. The precision is only affected by input not output. This is a consequence of having the same σ (observation error) everywhere in space. 6