An Analytical Comparison between Bayes Point Machines and Support Vector Machines

Size: px
Start display at page:

Download "An Analytical Comparison between Bayes Point Machines and Support Vector Machines"


1 An Analytical Comparison between Bayes Point Machines and Support Vector Machines Ashish Kapoor Massachusetts Institute of Technology Cambridge, MA Abstract This paper analyzes the relationship and the differences between the two variants of the kernel machines, namely Bayes Point Machine (BPM) and the Support Vector Machine (SVM). We pose both BPM and SVM as an estimation problem in a probabilistic framework. Given training data and a loss function, a posterior probability distribution on the space of functions is induced. The BPM solution is shown to be the mean of this posterior whereas the SVM is shown the Maximum A Posteriori (MAP) solution when using the hinge loss function. 1 Introduction There has been a lot of research directed to the Kernel Machines. Support Vector Machine (SVM) [1] has been inspired from statistical learning theory, whereas the Bayes Point Machine (BPM) is a Bayesian approximation to the classification. Support vector machine looks at the learning problem from the optimization perspective whereas the Bayesian perspective relies on sampling from the probability distributions. Despite these differences there seem to be a close relationship between the SVM and the BPM. Herbrich et al [2] have highlighted some of these similarities. This paper aims to pose BPM and SVM in one single framework and analyze the similarities and the differences between the two approaches. The next section provides a very quick overview of BPM. Followed by that we discuss SVM as a special case of Tikhonov regularization and give a probabilistic interpretation. In section 4 we pose both SVM and BPM in a probabilistic framework which allows us to compare them effectively. Followed by that we conclude with future work. We limit this discussion to two class classification problems. We denote the data points using bold letters and the corresponding class labels by. The training set comprises of tuples, that is, where is the number of training samples. The classifiers are denoted by and belong to a fixed hypothesis space!. Further, we restrict to linear kernels without bias for simplicity, therefore classifiers are of type #" %$ sign"'&)(+* %$. Finally,,-"' /. $ is a loss function that represents loss occurred in estimating as.. Though here we limit ourselves to linear kernel but the discussion can easily be extend to non-linear kernels by just considering the high-dimensional feature space onto which the data points are projected.

2 = SVM Bayes Figure 1: Difference between SVM and Bayesian Classification Strategy 2 Bayes Point Machine This section provides a very brief overview of the Bayes point machine. The readers wishing for more details should look into [2, 3]. 2.1 The Bayes Classification Strategy Given a test point the Bayes classification strategy looks at how is classified by all the possible classifiers! and weighs it according to the posterior "6 $. The following equation depicts the Bayesian classification strategy: 798 /:<; " %$ 1 #" %$ 0>"6? $A@ (1) This classification strategy, which is the Bayesian averaging of linear classifiers, has been proven both theoretically and empirically optimal on average in terms of generalization performance [4, 5]. But, the Bayes classification strategy is computationally very demanding and further it is just a strategy and in general may not correspond to any one single classifier within the hypothesis space considered. 2.2 Bayes Point Bayes Point is a classifier that best mimics the Bayesian classification strategy. It was shown F elsewhere [2, 5] that under certain mild assumptions the average 1 34GIH classifiercbde converges very quickly to the Bayes point. Bayes Point Machine is an algorithm that aims at returning the center of mass under the posterior We can think of the center of mass approximating the Bayes classification strategy as follows:

3 = U 798 /:<; " %$ 1 #" %$KJ "L? $M@N F 1 345G #" %$ HPO F 1 34GIHQ" %$ (2) Hence, the Bayes point machine returns the average classifierbd, which very closely approximates the Bayesian classification strategy. Figure 1 shows the difference between the support vector machine classification and the Bayesian classification strategy. The Bayes point machines approximates a vote between all linear separators [2, 4], whereas the support vector machine aims to maximize the margin [1]. Computing the center of mass is a difficult task. And many of the authors have used sampling methods to recover the mean of the posterior [2, 3]. To compute the center of mass we can write the posterior0>"l? $ as: J "L? $RSJ "T+? $ * J "L $ (3) HereJ "L $ represents the prior on the space of possible classifiers and most of the authors &-(, only the direction vector& characterizes the function. Further, as the magnitude of & is irrelevant for classification, we can just look at all the& s of a fixed length. As there is no reason for us to prefer any single classifier prior to looking at the data, our prior is uniform for all& of a fixed length. This distribution is fair to all the classifiers. have restricted to a uniform prior. Since our classification function is of the form: #" %$ Now,J "T+? $ in equation (3) is the likelihood of the training data given, and one of the possible forms is: J "T+? $ VAWXYZ[XT\^] 4 `_a "6#"cb$d b6$ (4) where _ea ". $ is the zero-one loss and defined as: `_a ". $ if gfh ij. (5) otherwise k Figure 2: SVM and BPM in the Version Space, from PhD Thesis, Tom Minka [4]

4 h The likelihood given as in equation (4) is 1 if perfectly classifies the training data otherwise it is zero. Under this likelihood the posterior assigns equal non-zero probability to all the classifiers of a fixed length that perfectly separates the data. The Bayes point machine is expected to return the average of all these classifiers. Further, we can consider a graphical interpretation of the BPM too. In the feature space the data points are plotted as points and the classifiers are plotted as hyper planes. We can consider a parameter space, where the classifiers are plotted as points and the data points plotted as hyper planes. In the parameter space the set of classifiers that classify all the data points form a convex set, called Version Space bounded by the hyper planes corresponding to the data points. Limiting our classifiers to a fixed length l corresponds to looking a sphere of radius l in the parameter space. Under uniform prior and 0-1 loss the BPM will return the center of the Version space. Figure 2 (from [4]) shows this interpretation. 3 Support Vector Machines Classifiers based on support vector machine (SVM) perform binary classification by first projecting the data points into a high dimensional feature space and then using a hyper plane that is maximally separated from the nearest positive and negative data points [1]. In this discussion, we have restricted to linear kernels but we can easily extend all our discussion for non-linear kernels by just considering the high-dimensional feature space onto which the data points are projected. For a linear kernel without bias (i.e. #" %$ g&+( ), the quadratic programming problem for an SVM can be written as following: &EnmoqpristvuNwx*zy b{ P} bi~??&+?? subject to: b #"cb$ ƒ } b } b for # h for C Herew is the user specified constant. This can be rewritten as following: & xmo prist u w * b{ " ˆ b #" b $A$A ~ Š??&+?? (6) where: "L $ f if k (7) otherwise h k As mentioned in Herbrich et al [2] support vector machine classifier can be thought of as the center of maximally inscribbable ball in the version space. Figure 2 (From [4]) shows this graphical interpretation. Further, Evgeniou et al [6] have shown that the SVM classification is an instance of more general Tikhonov regularization. Tikhonov regularization is a general approach to learning, where the aim is to find a function that minimizes the training error while simultaneously attempting to minimize its norm in a Reproducing Kernel Hilbert Space (RKHS)!. The Tikhonov regularization can be written as: Œnmo prist 1 ] * b{,-"' b #" b $A$%~ Ž???? (8) Here,,-"T #"' $A$ is the loss function as defined earlier,ž is the user specified regularization parameter and???? is the norm in RKHS defined by the positive definite kernel function

5 . The first term in the Tikhonov regularization denotes the empirical error and the second term denotes the complexity of our solution. The Tikhonov regularization represents the trade off in choosing functions that are not only simple, but also best represent our training data. Ž is the regularization term that can be used to adjust the preferences of simple functions over the preference of the functions that best fits the data. By changing the form of the loss function,-"t #"' $A$ and, a number of popular classification and regression schemes can be derived. Evgeniou et al [6] have discussed standard regularization networks, SVM classification and regression in detail as different cases of Tikhonov regularization that arise from different choices of, and. For example algorithms for standard regularization networks can be derived by using a squared loss function, i.e., "' bq #"cb$a$ E"L #"cb6$ b6$. Further in this discussion, we have restricted to linear kernels which implies that??????&+??. For details please refer to [2, 6]. 3.1 Maximum A Posteriori Interpretation of SVM Classification The solution to the the optimization problem for SVM classification and (also, Tikhonov Regularization) can be interpreted as the mode of the posterior probability distribution. Consider the interpretation of the empirical loss y b{,>"' b #" b $q$ and the stabilizer???? as: J "L $R : a (9) J "T+? $R : ap šœ X Ÿž# 1 W XT YZ XK (10) J "6 $ denotes the prior probability and under this interpretation it says that the functions with small norm are more likely than the functions with a larger norm.j "L+? $ denotes the likelihood of the data and often is referred to as noise model. For standard regularization network the noise model is Gaussian and for SVM regression is a mixture of Gaussians. Readers are referred to Girosi et al [7] and Pontil et al [8] for more details. It is clear that given, the quadratic programming problem corresponding to SVM, we can interpret the solution to it as a mode of the probability distribution: 4 BPMs and SVMs J "L? $R J "T+? $ * J "L $ j: ap š X Ÿž# 1 WX YZ[X a x: a š X ž# 1 W X' YZ X' a x: a š X [ a Z[X 1 WX ' T a 3 3u 3 3 As shown earlier Bayes Point Machines are the mean of the posteriorj "L? $. In the last section we showed that the support vector machine solution is the mode of the posterior. The the posteriorsj "L? $ for the SVMs and the BPMs are not same and in this section we analyze the differences as well as the similarities. 4.1 The Priors A lot of authors while working with BPM restrict to the classifiers& of fixed equal length and assign uniform prior for all directions. As mentioned in [4], this can also be achieved using a zero mean spherical Gaussian distribution as prior. That is, J "L $R : a 3 3u 3 3 R : a (11) (12) (13) (14)

6 Table 1: Bayes Point Machines and Support Vector Machines Given 2 5 ª «²±³µ ƒ Ḿ¹º» T¼L¼ ½ ¾Ÿ º ¾À Classification Type Cost function:(,>"' #"' $q$ ) Computation Criteria Bayes Point Machine Can be any reasonable cost function Mean of0>"l? $ e.g.:,-"6 #"' A %$ Á" v #" %$A$ i.e.â'ãed F 1 34GIH Tikhonov Regularization Can be any reasonable cost function Mode of0>"l? $ SVM Classification, "L #"T q %$ Ä" ƒ v #" %$A$ Å ž Æ nmoqp rimç 1 0>"L? $ Regularization Network,-"6 #"' A %$ Á"' #" %$A$ <ÈÉ jmo pr>m<ç 1 0>"L? $ This prior assigns uniform distribution to all the classifiers & lying on a sphere of a fixed radius. The beauty of this prior is that it allows us to drop the restriction that all our classifiers are of the same length. Further, this prior is exactly the same prior used to compute the posterior for the SVM case. 4.2 The Likelihood Much of the BPM literature focuses on the 0-1 loss with the likelihood J "T+? $ given in equation 4. The SVM on the other hand, have the likelihood "J "TÊ? $q$ as : a š9 X ž Z X Y1 W ' with the hinge loss, i.e.,-"t /. $ Ë" *. $[. Using a hard 0-1 loss in BPM corresponds to focusing only on the region that perfectly classifies the training data. We can refine this hard 0-1 loss to admit the possibility of error by using linear slack. We can use the following as our loss function:, "L#" b b $ wj* " b #" b $A$A (15) Given, this loss function we can write the likelihood for BPM as R : a š9 X Ÿž Z[XY1 W ' (16) Here, C is a constant that determines how hard are the boundaries. We get likelihood given in equation 4 when the constantw tends to infinity. 4.3 Main Result The main result is shown in table 1 and we can state the following: Ì The solution obtained using BPM is the mean of the posteriorj "L? $. One of the possible likelihoods for the BPM is given byj "L+? $ j: a šœ X [ a Z[X 1 WX T T.

7 Ì The solution obtained by SVM on the other hand is the mode of the posterior with the likelihood function exactly given by: a š X a Z X 1 W X' ' Ì The priorj "L $R : a Conclusion and Future Work for both BPM and SVM. We have posed the SVM and the BPM in a probabilistic framework. Under this framework, BPM finds the mean of the posterior distribution of functions, whereas SVM finds the MAP estimate when using the hinge loss function. So the question of who is is better than who boils down to the choice of mean vs mode and the choice of loss functions. BPM has the advantage that it usually works with the 0-1 loss which is a natural choice for the loss function. Elsewhere it has been proved that BPM converges to the Bayes point which is the projection of the Bayesian classification strategy. SVM on the other hand has been shown to work really well in many applications and is much easier to compute than BPM. The open questions include the performance of a classifier that is the mean of the posterior when using a hinge loss. Further, no one has yet answered the questions regarding stability, consistency and convergence of the BPM. Also, it has been shown that mean of posterior converges to the Bayes point under mild assumptions, so an interesting question to ask would be that how does the mode relate to the Bayes point. Acknowledgments Thanks Tom Minka for the Bayes Point Machine code and to Sayan Mujherjee, Yuan Qi and Rosalind W. Picard for insightful discussions. References [1] Christopher J. C. Burges. A tutorial on support vector machines for pattern classification. Data Mining and Knowledge Discovery, 2(2): , [2] Ralf Herbrich, Thore Graepel, and Colin Campbell. Bayes point machines. Journal of Machine Learning Research, 1: , [3] P. Rujan. Playing billiards in version space. Neural Computation, 9:99 122, [4] Thomas P. Minka. Chapter 5: A Family of Algorithms for Approximate Bayesian Inference. PhD thesis, Massachussetts Institute of Technology, [5] T. Watkin. Optimal learning with a neural network. Europhysics Letters, 21: , [6] Theodoros Evgeniou, Massimiliano Pontil, and Tomaso Poggio. Regularization networks and support vector machines. Advances in Computational Mathematics, 13(1):1 50, [7] F. Girosi, M. Jones, and T. Poggio. Regularization theory and neural network architectures. Neural Computation, 7: , [8] M. Pontil, S. Mukherjee, and F. Girosi. On noise model of support vector machine regression. A.I. Memo 1651, Massachusetts Institute of Technology, A.I. Lab, October 1998.

Kernels for Multi task Learning

Kernels for Multi task Learning Kernels for Multi task Learning Charles A Micchelli Department of Mathematics and Statistics State University of New York, The University at Albany 1400 Washington Avenue, Albany, NY, 12222, USA Massimiliano

More information

Margin Maximizing Loss Functions

Margin Maximizing Loss Functions Margin Maximizing Loss Functions Saharon Rosset, Ji Zhu and Trevor Hastie Department of Statistics Stanford University Stanford, CA, 94305 saharon, jzhu, Abstract Margin maximizing

More information


A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES Wei Chu, S. Sathiya Keerthi, Chong Jin Ong Control Division, Department of Mechanical Engineering, National University of Singapore 0 Kent Ridge Crescent,

More information

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee The Learning Problem and Regularization 9.520 Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing

More information

Probabilistic Machine Learning. Industrial AI Lab.

Probabilistic Machine Learning. Industrial AI Lab. Probabilistic Machine Learning Industrial AI Lab. Probabilistic Linear Regression Outline Probabilistic Classification Probabilistic Clustering Probabilistic Dimension Reduction 2 Probabilistic Linear

More information

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee 9.520: Class 20 Bayesian Interpretations Tomaso Poggio and Sayan Mukherjee Plan Bayesian interpretation of Regularization Bayesian interpretation of the regularizer Bayesian interpretation of quadratic

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Machine Learning Support Vector Machines. Prof. Matteo Matteucci Machine Learning Support Vector Machines Prof. Matteo Matteucci Discriminative vs. Generative Approaches 2 o Generative approach: we derived the classifier from some generative hypothesis about the way

More information

A note on the generalization performance of kernel classifiers with margin. Theodoros Evgeniou and Massimiliano Pontil

A note on the generalization performance of kernel classifiers with margin. Theodoros Evgeniou and Massimiliano Pontil MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES A.I. Memo No. 68 November 999 C.B.C.L

More information

On the V γ Dimension for Regression in Reproducing Kernel Hilbert Spaces. Theodoros Evgeniou, Massimiliano Pontil


More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences!! h0p:// Lecture 2 In our

More information

Manifold Regularization

Manifold Regularization Manifold Regularization Vikas Sindhwani Department of Computer Science University of Chicago Joint Work with Mikhail Belkin and Partha Niyogi TTI-C Talk September 14, 24 p.1 The Problem of Learning is

More information

On the Noise Model of Support Vector Machine Regression. Massimiliano Pontil, Sayan Mukherjee, Federico Girosi


More information

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI Support Vector Machines CAP 5610: Machine Learning Instructor: Guo-Jun QI 1 Linear Classifier Naive Bayes Assume each attribute is drawn from Gaussian distribution with the same variance Generative model:

More information

Kernel Methods. Barnabás Póczos

Kernel Methods. Barnabás Póczos Kernel Methods Barnabás Póczos Outline Quick Introduction Feature space Perceptron in the feature space Kernels Mercer s theorem Finite domain Arbitrary domain Kernel families Constructing new kernels

More information

Linear Dependency Between and the Input Noise in -Support Vector Regression

Linear Dependency Between and the Input Noise in -Support Vector Regression 544 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 14, NO. 3, MAY 2003 Linear Dependency Between the Input Noise in -Support Vector Regression James T. Kwok Ivor W. Tsang Abstract In using the -support vector

More information

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto Reproducing Kernel Hilbert Spaces 9.520 Class 03, 15 February 2006 Andrea Caponnetto About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Support Vector Machines

Support Vector Machines Two SVM tutorials linked in class website (please, read both): High-level presentation with applications (Hearst 1998) Detailed tutorial (Burges 1998) Support Vector Machines Machine Learning 10701/15781

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Least Squares Regression

Least Squares Regression E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 Bayesian paradigm Consistent use of probability theory

More information

The Learning Problem and Regularization

The Learning Problem and Regularization 9.520 Class 02 February 2011 Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference problem from usually small sets of high dimensional, noisy data. Learning

More information

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes About this class Maximum margin classifiers SVMs: geometric derivation of the primal problem Statement of the dual problem The kernel trick SVMs as the solution to a regularization problem Maximizing the

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 1 Bayesian paradigm Consistent use of probability theory

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces 9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Announcements. Proposals graded

Announcements. Proposals graded Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:

More information

Bayesian Support Vector Machines for Feature Ranking and Selection

Bayesian Support Vector Machines for Feature Ranking and Selection Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete

More information

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Selim Aksoy Department of Computer Engineering Bilkent University CS 551, Fall 2018 CS 551, Fall

More information

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

More information

Midterm Exam, Spring 2005

Midterm Exam, Spring 2005 10-701 Midterm Exam, Spring 2005 1. Write your name and your email address below. Name: Email address: 2. There should be 15 numbered pages in this exam (including this cover sheet). 3. Write your name

More information

Mining Classification Knowledge

Mining Classification Knowledge Mining Classification Knowledge Remarks on NonSymbolic Methods JERZY STEFANOWSKI Institute of Computing Sciences, Poznań University of Technology COST Doctoral School, Troina 2008 Outline 1. Bayesian classification

More information

Least Squares Regression

Least Squares Regression CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the

More information

The Bayes classifier

The Bayes classifier The Bayes classifier Consider where is a random vector in is a random variable (depending on ) Let be a classifier with probability of error/risk given by The Bayes classifier (denoted ) is the optimal

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Least Absolute Shrinkage is Equivalent to Quadratic Penalization Least Absolute Shrinkage is Equivalent to Quadratic Penalization Yves Grandvalet Heudiasyc, UMR CNRS 6599, Université de Technologie de Compiègne, BP 20.529, 60205 Compiègne Cedex, France

More information

MTTTS16 Learning from Multiple Sources

MTTTS16 Learning from Multiple Sources MTTTS16 Learning from Multiple Sources 5 ECTS credits Autumn 2018, University of Tampere Lecturer: Jaakko Peltonen Lecture 6: Multitask learning with kernel methods and nonparametric models On this lecture:

More information

Statistical Learning Reading Assignments

Statistical Learning Reading Assignments Statistical Learning Reading Assignments S. Gong et al. Dynamic Vision: From Images to Face Recognition, Imperial College Press, 2001 (Chapt. 3, hard copy). T. Evgeniou, M. Pontil, and T. Poggio, "Statistical

More information

6.867 Machine learning

6.867 Machine learning 6.867 Machine learning Mid-term eam October 8, 6 ( points) Your name and MIT ID: .5.5 y.5 y.5 a).5.5 b). y.5 y.5 c).5.5 d).5.5 Figure : Plots of linear regression results with different types of

More information

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 11, 2009 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

Lecture 7: Kernels for Classification and Regression

Lecture 7: Kernels for Classification and Regression Lecture 7: Kernels for Classification and Regression CS 194-10, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 15, 2011 Outline Outline A linear regression problem Linear auto-regressive

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017 CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class

More information

Kernel expansions with unlabeled examples

Kernel expansions with unlabeled examples Kernel expansions with unlabeled examples Martin Szummer MIT AI Lab & CBCL Cambridge, MA Tommi Jaakkola MIT AI Lab Cambridge, MA Abstract Modern classification applications

More information

Support Vector Machines Explained

Support Vector Machines Explained December 23, 2008 Support Vector Machines Explained Tristan Fletcher Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),

More information

Support Vector Machine. Industrial AI Lab.

Support Vector Machine. Industrial AI Lab. Support Vector Machine Industrial AI Lab. Classification (Linear) Autonomously figure out which category (or class) an unknown item should be categorized into Number of categories / classes Binary: 2 different

More information

Loss Functions for Preference Levels: Regression with Discrete Ordered Labels

Loss Functions for Preference Levels: Regression with Discrete Ordered Labels Loss Functions for Preference Levels: Regression with Discrete Ordered Labels Jason D. M. Rennie Massachusetts Institute of Technology Comp. Sci. and Artificial Intelligence Laboratory Cambridge, MA 9,

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

10-701/ Machine Learning - Midterm Exam, Fall 2010

10-701/ Machine Learning - Midterm Exam, Fall 2010 10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam

More information

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM 1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Human Genome Center, University

More information

Introduction to Bayesian Learning. Machine Learning Fall 2018

Introduction to Bayesian Learning. Machine Learning Fall 2018 Introduction to Bayesian Learning Machine Learning Fall 2018 1 What we have seen so far What does it mean to learn? Mistake-driven learning Learning by counting (and bounding) number of mistakes PAC learnability

More information

Diagram Structure Recognition by Bayesian Conditional Random Fields

Diagram Structure Recognition by Bayesian Conditional Random Fields Diagram Structure Recognition by Bayesian Conditional Random Fields Yuan Qi MIT CSAIL 32 Vassar Street Cambridge, MA, 0239, USA Martin Szummer Microsoft Research 7 J J Thomson Avenue

More information

Expectation Propagation for Approximate Bayesian Inference

Expectation Propagation for Approximate Bayesian Inference Expectation Propagation for Approximate Bayesian Inference José Miguel Hernández Lobato Universidad Autónoma de Madrid, Computer Science Department February 5, 2007 1/ 24 Bayesian Inference Inference Given

More information

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II) Contents Lecture Lecture Linear Discriminant Analysis Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University Email: fredriklindsten@ituuse Summary of lecture

More information

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses Steven Bergner, Chris Demwell Lecture notes for Cmpt 882 Machine Learning February 19, 2004 Abstract In these notes, a

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Short Course Robust Optimization and Machine Learning. 3. Optimization in Supervised Learning

Short Course Robust Optimization and Machine Learning. 3. Optimization in Supervised Learning Short Course Robust Optimization and 3. Optimization in Supervised EECS and IEOR Departments UC Berkeley Spring seminar TRANSP-OR, Zinal, Jan. 16-19, 2012 Outline Overview of Supervised models and variants

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers

More information

Pattern Recognition 2018 Support Vector Machines

Pattern Recognition 2018 Support Vector Machines Pattern Recognition 2018 Support Vector Machines Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 1 / 48 Support Vector Machines Ad Feelders ( Universiteit Utrecht

More information

Chapter 6 Classification and Prediction (2)

Chapter 6 Classification and Prediction (2) Chapter 6 Classification and Prediction (2) Outline Classification and Prediction Decision Tree Naïve Bayes Classifier Support Vector Machines (SVM) K-nearest Neighbors Accuracy and Error Measures Feature

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation

More information

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning Mark Schmidt University of British Columbia, May 2016 Some images from this lecture are

More information

Neutron inverse kinetics via Gaussian Processes

Neutron inverse kinetics via Gaussian Processes Neutron inverse kinetics via Gaussian Processes P. Picca Politecnico di Torino, Torino, Italy R. Furfaro University of Arizona, Tucson, Arizona Outline Introduction Review of inverse kinetics techniques

More information

Support Vector Machines

Support Vector Machines Wien, June, 2010 Paul Hofmarcher, Stefan Theussl, WU Wien Hofmarcher/Theussl SVM 1/21 Linear Separable Separating Hyperplanes Non-Linear Separable Soft-Margin Hyperplanes Hofmarcher/Theussl SVM 2/21 (SVM)

More information

ECE-271B. Nuno Vasconcelos ECE Department, UCSD

ECE-271B. Nuno Vasconcelos ECE Department, UCSD ECE-271B Statistical ti ti Learning II Nuno Vasconcelos ECE Department, UCSD The course the course is a graduate level course in statistical learning in SLI we covered the foundations of Bayesian or generative

More information

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation Overview Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation Probabilistic Interpretation: Linear Regression Assume output y is generated

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina Indirect Rule Learning: Support Vector Machines Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection

More information

Introduction to Machine Learning Midterm Exam Solutions

Introduction to Machine Learning Midterm Exam Solutions 10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

Lecture 3: Multiclass Classification

Lecture 3: Multiclass Classification Lecture 3: Multiclass Classification Kai-Wei Chang CS @ University of Virginia Some slides are adapted from Vivek Skirmar and Dan Roth CS6501 Lecture 3 1 Announcement v Please enroll in

More information

Multiclass Classification-1

Multiclass Classification-1 CS 446 Machine Learning Fall 2016 Oct 27, 2016 Multiclass Classification Professor: Dan Roth Scribe: C. Cheng Overview Binary to multiclass Multiclass SVM Constraint classification 1 Introduction Multiclass

More information

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen This Lecture: Advanced Machine Learning Regression

More information

Lecture 10: Support Vector Machine and Large Margin Classifier

Lecture 10: Support Vector Machine and Large Margin Classifier Lecture 10: Support Vector Machine and Large Margin Classifier Applied Multivariate Analysis Math 570, Fall 2014 Xingye Qiao Department of Mathematical Sciences Binghamton University E-mail:

More information

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2 Nearest Neighbor Machine Learning CSE546 Kevin Jamieson University of Washington October 26, 2017 2017 Kevin Jamieson 2 Some data, Bayes Classifier Training data: True label: +1 True label: -1 Optimal

More information

SVMs, Duality and the Kernel Trick

SVMs, Duality and the Kernel Trick SVMs, Duality and the Kernel Trick Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 26 th, 2007 2005-2007 Carlos Guestrin 1 SVMs reminder 2005-2007 Carlos Guestrin 2 Today

More information

LECTURE 7 Support vector machines

LECTURE 7 Support vector machines LECTURE 7 Support vector machines SVMs have been used in a multitude of applications and are one of the most popular machine learning algorithms. We will derive the SVM algorithm from two perspectives:

More information

Introduction. Chapter 1

Introduction. Chapter 1 Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

MLCC 2017 Regularization Networks I: Linear Models

MLCC 2017 Regularization Networks I: Linear Models MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017 About this class We introduce a class of learning algorithms based on Tikhonov regularization We study computational

More information

Support Vector Machines and Bayes Regression

Support Vector Machines and Bayes Regression Statistical Techniques in Robotics (16-831, F11) Lecture #14 (Monday ctober 31th) Support Vector Machines and Bayes Regression Lecturer: Drew Bagnell Scribe: Carl Doersch 1 1 Linear SVMs We begin by considering

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Infinite Ensemble Learning with Support Vector Machinery

Infinite Ensemble Learning with Support Vector Machinery Infinite Ensemble Learning with Support Vector Machinery Hsuan-Tien Lin and Ling Li Learning Systems Group, California Institute of Technology ECML/PKDD, October 4, 2005 H.-T. Lin and L. Li (Learning Systems

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

Mining Classification Knowledge

Mining Classification Knowledge Mining Classification Knowledge Remarks on NonSymbolic Methods JERZY STEFANOWSKI Institute of Computing Sciences, Poznań University of Technology SE lecture revision 2013 Outline 1. Bayesian classification

More information

Bayesian Learning (II)

Bayesian Learning (II) Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP

More information

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines Nonlinear Support Vector Machines through Iterative Majorization and I-Splines P.J.F. Groenen G. Nalbantov J.C. Bioch July 9, 26 Econometric Institute Report EI 26-25 Abstract To minimize the primal support

More information

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract Scale-Invariance of Support Vector Machines based on the Triangular Kernel François Fleuret Hichem Sahbi IMEDIA Research Group INRIA Domaine de Voluceau 78150 Le Chesnay, France Abstract This paper focuses

More information

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9 Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9 Slides adapted from Jordan Boyd-Graber Machine Learning: Chenhao Tan Boulder 1 of 39 Recap Supervised learning Previously: KNN, naïve

More information

Data Mining Part 4. Prediction

Data Mining Part 4. Prediction Data Mining Part 4. Prediction 4.3. Fall 2009 Instructor: Dr. Masoud Yaghini Outline Introduction Bayes Theorem Naïve References Introduction Bayesian classifiers A statistical classifiers Introduction

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters

More information

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University Chapter 9. Support Vector Machine Yongdai Kim Seoul National University 1. Introduction Support Vector Machine (SVM) is a classification method developed by Vapnik (1996). It is thought that SVM improved

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes Iain Murray CSC255, Introduction to Machine Learning, Fall 28 Dept. Computer Science, University of Toronto The problem Learn scalar function of

More information