An Analytical Comparison between Bayes Point Machines and Support Vector Machines
|
|
- Gwendoline Chandler
- 5 years ago
- Views:
Transcription
1 An Analytical Comparison between Bayes Point Machines and Support Vector Machines Ashish Kapoor Massachusetts Institute of Technology Cambridge, MA Abstract This paper analyzes the relationship and the differences between the two variants of the kernel machines, namely Bayes Point Machine (BPM) and the Support Vector Machine (SVM). We pose both BPM and SVM as an estimation problem in a probabilistic framework. Given training data and a loss function, a posterior probability distribution on the space of functions is induced. The BPM solution is shown to be the mean of this posterior whereas the SVM is shown the Maximum A Posteriori (MAP) solution when using the hinge loss function. 1 Introduction There has been a lot of research directed to the Kernel Machines. Support Vector Machine (SVM) [1] has been inspired from statistical learning theory, whereas the Bayes Point Machine (BPM) is a Bayesian approximation to the classification. Support vector machine looks at the learning problem from the optimization perspective whereas the Bayesian perspective relies on sampling from the probability distributions. Despite these differences there seem to be a close relationship between the SVM and the BPM. Herbrich et al [2] have highlighted some of these similarities. This paper aims to pose BPM and SVM in one single framework and analyze the similarities and the differences between the two approaches. The next section provides a very quick overview of BPM. Followed by that we discuss SVM as a special case of Tikhonov regularization and give a probabilistic interpretation. In section 4 we pose both SVM and BPM in a probabilistic framework which allows us to compare them effectively. Followed by that we conclude with future work. We limit this discussion to two class classification problems. We denote the data points using bold letters and the corresponding class labels by. The training set comprises of tuples, that is, where is the number of training samples. The classifiers are denoted by and belong to a fixed hypothesis space!. Further, we restrict to linear kernels without bias for simplicity, therefore classifiers are of type #" %$ sign"'&)(+* %$. Finally,,-"' /. $ is a loss function that represents loss occurred in estimating as.. Though here we limit ourselves to linear kernel but the discussion can easily be extend to non-linear kernels by just considering the high-dimensional feature space onto which the data points are projected.
2 = SVM Bayes Figure 1: Difference between SVM and Bayesian Classification Strategy 2 Bayes Point Machine This section provides a very brief overview of the Bayes point machine. The readers wishing for more details should look into [2, 3]. 2.1 The Bayes Classification Strategy Given a test point the Bayes classification strategy looks at how is classified by all the possible classifiers! and weighs it according to the posterior "6 $. The following equation depicts the Bayesian classification strategy: 798 /:<; " %$ 1 #" %$ 0>"6? $A@ (1) This classification strategy, which is the Bayesian averaging of linear classifiers, has been proven both theoretically and empirically optimal on average in terms of generalization performance [4, 5]. But, the Bayes classification strategy is computationally very demanding and further it is just a strategy and in general may not correspond to any one single classifier within the hypothesis space considered. 2.2 Bayes Point Bayes Point is a classifier that best mimics the Bayesian classification strategy. It was shown F elsewhere [2, 5] that under certain mild assumptions the average 1 34GIH classifiercbde converges very quickly to the Bayes point. Bayes Point Machine is an algorithm that aims at returning the center of mass under the posterior We can think of the center of mass approximating the Bayes classification strategy as follows:
3 = U 798 /:<; " %$ 1 #" %$KJ "L? $M@N F 1 345G #" %$ HPO F 1 34GIHQ" %$ (2) Hence, the Bayes point machine returns the average classifierbd, which very closely approximates the Bayesian classification strategy. Figure 1 shows the difference between the support vector machine classification and the Bayesian classification strategy. The Bayes point machines approximates a vote between all linear separators [2, 4], whereas the support vector machine aims to maximize the margin [1]. Computing the center of mass is a difficult task. And many of the authors have used sampling methods to recover the mean of the posterior [2, 3]. To compute the center of mass we can write the posterior0>"l? $ as: J "L? $RSJ "T+? $ * J "L $ (3) HereJ "L $ represents the prior on the space of possible classifiers and most of the authors &-(, only the direction vector& characterizes the function. Further, as the magnitude of & is irrelevant for classification, we can just look at all the& s of a fixed length. As there is no reason for us to prefer any single classifier prior to looking at the data, our prior is uniform for all& of a fixed length. This distribution is fair to all the classifiers. have restricted to a uniform prior. Since our classification function is of the form: #" %$ Now,J "T+? $ in equation (3) is the likelihood of the training data given, and one of the possible forms is: J "T+? $ VAWXYZ[XT\^] 4 `_a "6#"cb$d b6$ (4) where _ea ". $ is the zero-one loss and defined as: `_a ". $ if gfh ij. (5) otherwise k Figure 2: SVM and BPM in the Version Space, from PhD Thesis, Tom Minka [4]
4 h The likelihood given as in equation (4) is 1 if perfectly classifies the training data otherwise it is zero. Under this likelihood the posterior assigns equal non-zero probability to all the classifiers of a fixed length that perfectly separates the data. The Bayes point machine is expected to return the average of all these classifiers. Further, we can consider a graphical interpretation of the BPM too. In the feature space the data points are plotted as points and the classifiers are plotted as hyper planes. We can consider a parameter space, where the classifiers are plotted as points and the data points plotted as hyper planes. In the parameter space the set of classifiers that classify all the data points form a convex set, called Version Space bounded by the hyper planes corresponding to the data points. Limiting our classifiers to a fixed length l corresponds to looking a sphere of radius l in the parameter space. Under uniform prior and 0-1 loss the BPM will return the center of the Version space. Figure 2 (from [4]) shows this interpretation. 3 Support Vector Machines Classifiers based on support vector machine (SVM) perform binary classification by first projecting the data points into a high dimensional feature space and then using a hyper plane that is maximally separated from the nearest positive and negative data points [1]. In this discussion, we have restricted to linear kernels but we can easily extend all our discussion for non-linear kernels by just considering the high-dimensional feature space onto which the data points are projected. For a linear kernel without bias (i.e. #" %$ g&+( ), the quadratic programming problem for an SVM can be written as following: &EnmoqpristvuNwx*zy b{ P} bi~??&+?? subject to: b #"cb$ ƒ } b } b for # h for C Herew is the user specified constant. This can be rewritten as following: & xmo prist u w * b{ " ˆ b #" b $A$A ~ Š??&+?? (6) where: "L $ f if k (7) otherwise h k As mentioned in Herbrich et al [2] support vector machine classifier can be thought of as the center of maximally inscribbable ball in the version space. Figure 2 (From [4]) shows this graphical interpretation. Further, Evgeniou et al [6] have shown that the SVM classification is an instance of more general Tikhonov regularization. Tikhonov regularization is a general approach to learning, where the aim is to find a function that minimizes the training error while simultaneously attempting to minimize its norm in a Reproducing Kernel Hilbert Space (RKHS)!. The Tikhonov regularization can be written as: Œnmo prist 1 ] * b{,-"' b #" b $A$%~ Ž???? (8) Here,,-"T #"' $A$ is the loss function as defined earlier,ž is the user specified regularization parameter and???? is the norm in RKHS defined by the positive definite kernel function
5 . The first term in the Tikhonov regularization denotes the empirical error and the second term denotes the complexity of our solution. The Tikhonov regularization represents the trade off in choosing functions that are not only simple, but also best represent our training data. Ž is the regularization term that can be used to adjust the preferences of simple functions over the preference of the functions that best fits the data. By changing the form of the loss function,-"t #"' $A$ and, a number of popular classification and regression schemes can be derived. Evgeniou et al [6] have discussed standard regularization networks, SVM classification and regression in detail as different cases of Tikhonov regularization that arise from different choices of, and. For example algorithms for standard regularization networks can be derived by using a squared loss function, i.e., "' bq #"cb$a$ E"L #"cb6$ b6$. Further in this discussion, we have restricted to linear kernels which implies that??????&+??. For details please refer to [2, 6]. 3.1 Maximum A Posteriori Interpretation of SVM Classification The solution to the the optimization problem for SVM classification and (also, Tikhonov Regularization) can be interpreted as the mode of the posterior probability distribution. Consider the interpretation of the empirical loss y b{,>"' b #" b $q$ and the stabilizer???? as: J "L $R : a (9) J "T+? $R : ap šœ X Ÿž# 1 W XT YZ XK (10) J "6 $ denotes the prior probability and under this interpretation it says that the functions with small norm are more likely than the functions with a larger norm.j "L+? $ denotes the likelihood of the data and often is referred to as noise model. For standard regularization network the noise model is Gaussian and for SVM regression is a mixture of Gaussians. Readers are referred to Girosi et al [7] and Pontil et al [8] for more details. It is clear that given, the quadratic programming problem corresponding to SVM, we can interpret the solution to it as a mode of the probability distribution: 4 BPMs and SVMs J "L? $R J "T+? $ * J "L $ j: ap š X Ÿž# 1 WX YZ[X a x: a š X ž# 1 W X' YZ X' a x: a š X [ a Z[X 1 WX ' T a 3 3u 3 3 As shown earlier Bayes Point Machines are the mean of the posteriorj "L? $. In the last section we showed that the support vector machine solution is the mode of the posterior. The the posteriorsj "L? $ for the SVMs and the BPMs are not same and in this section we analyze the differences as well as the similarities. 4.1 The Priors A lot of authors while working with BPM restrict to the classifiers& of fixed equal length and assign uniform prior for all directions. As mentioned in [4], this can also be achieved using a zero mean spherical Gaussian distribution as prior. That is, J "L $R : a 3 3u 3 3 R : a (11) (12) (13) (14)
6 Table 1: Bayes Point Machines and Support Vector Machines Given 2 5 ª «²±³µ ƒ Ḿ¹º» T¼L¼ ½ ¾Ÿ º ¾À Classification Type Cost function:(,>"' #"' $q$ ) Computation Criteria Bayes Point Machine Can be any reasonable cost function Mean of0>"l? $ e.g.:,-"6 #"' A %$ Á" v #" %$A$ i.e.â'ãed F 1 34GIH Tikhonov Regularization Can be any reasonable cost function Mode of0>"l? $ SVM Classification, "L #"T q %$ Ä" ƒ v #" %$A$ Å ž Æ nmoqp rimç 1 0>"L? $ Regularization Network,-"6 #"' A %$ Á"' #" %$A$ <ÈÉ jmo pr>m<ç 1 0>"L? $ This prior assigns uniform distribution to all the classifiers & lying on a sphere of a fixed radius. The beauty of this prior is that it allows us to drop the restriction that all our classifiers are of the same length. Further, this prior is exactly the same prior used to compute the posterior for the SVM case. 4.2 The Likelihood Much of the BPM literature focuses on the 0-1 loss with the likelihood J "T+? $ given in equation 4. The SVM on the other hand, have the likelihood "J "TÊ? $q$ as : a š9 X ž Z X Y1 W ' with the hinge loss, i.e.,-"t /. $ Ë" *. $[. Using a hard 0-1 loss in BPM corresponds to focusing only on the region that perfectly classifies the training data. We can refine this hard 0-1 loss to admit the possibility of error by using linear slack. We can use the following as our loss function:, "L#" b b $ wj* " b #" b $A$A (15) Given, this loss function we can write the likelihood for BPM as R : a š9 X Ÿž Z[XY1 W ' (16) Here, C is a constant that determines how hard are the boundaries. We get likelihood given in equation 4 when the constantw tends to infinity. 4.3 Main Result The main result is shown in table 1 and we can state the following: Ì The solution obtained using BPM is the mean of the posteriorj "L? $. One of the possible likelihoods for the BPM is given byj "L+? $ j: a šœ X [ a Z[X 1 WX T T.
7 Ì The solution obtained by SVM on the other hand is the mode of the posterior with the likelihood function exactly given by: a š X a Z X 1 W X' ' Ì The priorj "L $R : a Conclusion and Future Work for both BPM and SVM. We have posed the SVM and the BPM in a probabilistic framework. Under this framework, BPM finds the mean of the posterior distribution of functions, whereas SVM finds the MAP estimate when using the hinge loss function. So the question of who is is better than who boils down to the choice of mean vs mode and the choice of loss functions. BPM has the advantage that it usually works with the 0-1 loss which is a natural choice for the loss function. Elsewhere it has been proved that BPM converges to the Bayes point which is the projection of the Bayesian classification strategy. SVM on the other hand has been shown to work really well in many applications and is much easier to compute than BPM. The open questions include the performance of a classifier that is the mean of the posterior when using a hinge loss. Further, no one has yet answered the questions regarding stability, consistency and convergence of the BPM. Also, it has been shown that mean of posterior converges to the Bayes point under mild assumptions, so an interesting question to ask would be that how does the mode relate to the Bayes point. Acknowledgments Thanks Tom Minka for the Bayes Point Machine code and to Sayan Mujherjee, Yuan Qi and Rosalind W. Picard for insightful discussions. References [1] Christopher J. C. Burges. A tutorial on support vector machines for pattern classification. Data Mining and Knowledge Discovery, 2(2): , [2] Ralf Herbrich, Thore Graepel, and Colin Campbell. Bayes point machines. Journal of Machine Learning Research, 1: , [3] P. Rujan. Playing billiards in version space. Neural Computation, 9:99 122, [4] Thomas P. Minka. Chapter 5: A Family of Algorithms for Approximate Bayesian Inference. PhD thesis, Massachussetts Institute of Technology, [5] T. Watkin. Optimal learning with a neural network. Europhysics Letters, 21: , [6] Theodoros Evgeniou, Massimiliano Pontil, and Tomaso Poggio. Regularization networks and support vector machines. Advances in Computational Mathematics, 13(1):1 50, [7] F. Girosi, M. Jones, and T. Poggio. Regularization theory and neural network architectures. Neural Computation, 7: , [8] M. Pontil, S. Mukherjee, and F. Girosi. On noise model of support vector machine regression. A.I. Memo 1651, Massachusetts Institute of Technology, A.I. Lab, October 1998.
Kernels for Multi task Learning
Kernels for Multi task Learning Charles A Micchelli Department of Mathematics and Statistics State University of New York, The University at Albany 1400 Washington Avenue, Albany, NY, 12222, USA Massimiliano
More informationMargin Maximizing Loss Functions
Margin Maximizing Loss Functions Saharon Rosset, Ji Zhu and Trevor Hastie Department of Statistics Stanford University Stanford, CA, 94305 saharon, jzhu, hastie@stat.stanford.edu Abstract Margin maximizing
More informationA GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong
A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES Wei Chu, S. Sathiya Keerthi, Chong Jin Ong Control Division, Department of Mechanical Engineering, National University of Singapore 0 Kent Ridge Crescent,
More informationThe Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee
The Learning Problem and Regularization 9.520 Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing
More informationProbabilistic Machine Learning. Industrial AI Lab.
Probabilistic Machine Learning Industrial AI Lab. Probabilistic Linear Regression Outline Probabilistic Classification Probabilistic Clustering Probabilistic Dimension Reduction 2 Probabilistic Linear
More information9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee
9.520: Class 20 Bayesian Interpretations Tomaso Poggio and Sayan Mukherjee Plan Bayesian interpretation of Regularization Bayesian interpretation of the regularizer Bayesian interpretation of quadratic
More informationMachine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io
Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem
More informationMachine Learning Support Vector Machines. Prof. Matteo Matteucci
Machine Learning Support Vector Machines Prof. Matteo Matteucci Discriminative vs. Generative Approaches 2 o Generative approach: we derived the classifier from some generative hypothesis about the way
More informationA note on the generalization performance of kernel classifiers with margin. Theodoros Evgeniou and Massimiliano Pontil
MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES A.I. Memo No. 68 November 999 C.B.C.L
More informationOn the V γ Dimension for Regression in Reproducing Kernel Hilbert Spaces. Theodoros Evgeniou, Massimiliano Pontil
MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES A.I. Memo No. 1656 May 1999 C.B.C.L
More informationSTA 4273H: Sta-s-cal Machine Learning
STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our
More informationManifold Regularization
Manifold Regularization Vikas Sindhwani Department of Computer Science University of Chicago Joint Work with Mikhail Belkin and Partha Niyogi TTI-C Talk September 14, 24 p.1 The Problem of Learning is
More informationOn the Noise Model of Support Vector Machine Regression. Massimiliano Pontil, Sayan Mukherjee, Federico Girosi
MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES A.I. Memo No. 1651 October 1998
More informationSupport Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI
Support Vector Machines CAP 5610: Machine Learning Instructor: Guo-Jun QI 1 Linear Classifier Naive Bayes Assume each attribute is drawn from Gaussian distribution with the same variance Generative model:
More informationKernel Methods. Barnabás Póczos
Kernel Methods Barnabás Póczos Outline Quick Introduction Feature space Perceptron in the feature space Kernels Mercer s theorem Finite domain Arbitrary domain Kernel families Constructing new kernels
More informationLinear Dependency Between and the Input Noise in -Support Vector Regression
544 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 14, NO. 3, MAY 2003 Linear Dependency Between the Input Noise in -Support Vector Regression James T. Kwok Ivor W. Tsang Abstract In using the -support vector
More informationReproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto
Reproducing Kernel Hilbert Spaces 9.520 Class 03, 15 February 2006 Andrea Caponnetto About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationSupport Vector Machines
Two SVM tutorials linked in class website (please, read both): High-level presentation with applications (Hearst 1998) Detailed tutorial (Burges 1998) Support Vector Machines Machine Learning 10701/15781
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationLeast Squares Regression
E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute
More informationApproximate Inference Part 1 of 2
Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ Bayesian paradigm Consistent use of probability theory
More informationThe Learning Problem and Regularization
9.520 Class 02 February 2011 Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference problem from usually small sets of high dimensional, noisy data. Learning
More informationAbout this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes
About this class Maximum margin classifiers SVMs: geometric derivation of the primal problem Statement of the dual problem The kernel trick SVMs as the solution to a regularization problem Maximizing the
More informationApproximate Inference Part 1 of 2
Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ 1 Bayesian paradigm Consistent use of probability theory
More informationReproducing Kernel Hilbert Spaces
9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationAnnouncements. Proposals graded
Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:
More informationBayesian Support Vector Machines for Feature Ranking and Selection
Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher pat@student.ethz.ch ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction
More informationKernel Methods and Support Vector Machines
Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete
More informationNon-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines
Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2018 CS 551, Fall
More informationMachine Learning Practice Page 2 of 2 10/28/13
Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes
More informationMidterm Exam, Spring 2005
10-701 Midterm Exam, Spring 2005 1. Write your name and your email address below. Name: Email address: 2. There should be 15 numbered pages in this exam (including this cover sheet). 3. Write your name
More informationMining Classification Knowledge
Mining Classification Knowledge Remarks on NonSymbolic Methods JERZY STEFANOWSKI Institute of Computing Sciences, Poznań University of Technology COST Doctoral School, Troina 2008 Outline 1. Bayesian classification
More informationLeast Squares Regression
CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the
More informationThe Bayes classifier
The Bayes classifier Consider where is a random vector in is a random variable (depending on ) Let be a classifier with probability of error/risk given by The Bayes classifier (denoted ) is the optimal
More informationLogistic Regression. Machine Learning Fall 2018
Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes
More informationLeast Absolute Shrinkage is Equivalent to Quadratic Penalization
Least Absolute Shrinkage is Equivalent to Quadratic Penalization Yves Grandvalet Heudiasyc, UMR CNRS 6599, Université de Technologie de Compiègne, BP 20.529, 60205 Compiègne Cedex, France Yves.Grandvalet@hds.utc.fr
More informationMTTTS16 Learning from Multiple Sources
MTTTS16 Learning from Multiple Sources 5 ECTS credits Autumn 2018, University of Tampere Lecturer: Jaakko Peltonen Lecture 6: Multitask learning with kernel methods and nonparametric models On this lecture:
More informationStatistical Learning Reading Assignments
Statistical Learning Reading Assignments S. Gong et al. Dynamic Vision: From Images to Face Recognition, Imperial College Press, 2001 (Chapt. 3, hard copy). T. Evgeniou, M. Pontil, and T. Poggio, "Statistical
More information6.867 Machine learning
6.867 Machine learning Mid-term eam October 8, 6 ( points) Your name and MIT ID: .5.5 y.5 y.5 a).5.5 b).5.5.5.5 y.5 y.5 c).5.5 d).5.5 Figure : Plots of linear regression results with different types of
More informationMidterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 11, 2009 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationIntroduction to Machine Learning Midterm Exam
10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but
More informationLecture 7: Kernels for Classification and Regression
Lecture 7: Kernels for Classification and Regression CS 194-10, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 15, 2011 Outline Outline A linear regression problem Linear auto-regressive
More informationMidterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric
More informationCPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017
CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class
More informationKernel expansions with unlabeled examples
Kernel expansions with unlabeled examples Martin Szummer MIT AI Lab & CBCL Cambridge, MA szummer@ai.mit.edu Tommi Jaakkola MIT AI Lab Cambridge, MA tommi@ai.mit.edu Abstract Modern classification applications
More informationSupport Vector Machines Explained
December 23, 2008 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),
More informationSupport Vector Machine. Industrial AI Lab.
Support Vector Machine Industrial AI Lab. Classification (Linear) Autonomously figure out which category (or class) an unknown item should be categorized into Number of categories / classes Binary: 2 different
More informationLoss Functions for Preference Levels: Regression with Discrete Ordered Labels
Loss Functions for Preference Levels: Regression with Discrete Ordered Labels Jason D. M. Rennie Massachusetts Institute of Technology Comp. Sci. and Artificial Intelligence Laboratory Cambridge, MA 9,
More information9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering
Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make
More information10-701/ Machine Learning - Midterm Exam, Fall 2010
10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam
More informationSupport Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM
1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Jean-Philippe.Vert@mines.org Human Genome Center, University
More informationIntroduction to Bayesian Learning. Machine Learning Fall 2018
Introduction to Bayesian Learning Machine Learning Fall 2018 1 What we have seen so far What does it mean to learn? Mistake-driven learning Learning by counting (and bounding) number of mistakes PAC learnability
More informationDiagram Structure Recognition by Bayesian Conditional Random Fields
Diagram Structure Recognition by Bayesian Conditional Random Fields Yuan Qi MIT CSAIL 32 Vassar Street Cambridge, MA, 0239, USA alanqi@csail.mit.edu Martin Szummer Microsoft Research 7 J J Thomson Avenue
More informationExpectation Propagation for Approximate Bayesian Inference
Expectation Propagation for Approximate Bayesian Inference José Miguel Hernández Lobato Universidad Autónoma de Madrid, Computer Science Department February 5, 2007 1/ 24 Bayesian Inference Inference Given
More informationContents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)
Contents Lecture Lecture Linear Discriminant Analysis Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University Email: fredriklindsten@ituuse Summary of lecture
More informationRelationship between Least Squares Approximation and Maximum Likelihood Hypotheses
Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses Steven Bergner, Chris Demwell Lecture notes for Cmpt 882 Machine Learning February 19, 2004 Abstract In these notes, a
More informationMachine Learning Lecture 7
Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant
More informationClassification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative
More informationShort Course Robust Optimization and Machine Learning. 3. Optimization in Supervised Learning
Short Course Robust Optimization and 3. Optimization in Supervised EECS and IEOR Departments UC Berkeley Spring seminar TRANSP-OR, Zinal, Jan. 16-19, 2012 Outline Overview of Supervised models and variants
More informationMachine Learning for NLP
Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers
More informationPattern Recognition 2018 Support Vector Machines
Pattern Recognition 2018 Support Vector Machines Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 1 / 48 Support Vector Machines Ad Feelders ( Universiteit Utrecht
More informationChapter 6 Classification and Prediction (2)
Chapter 6 Classification and Prediction (2) Outline Classification and Prediction Decision Tree Naïve Bayes Classifier Support Vector Machines (SVM) K-nearest Neighbors Accuracy and Error Measures Feature
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More information> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel
Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation
More informationSVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning
SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning Mark Schmidt University of British Columbia, May 2016 www.cs.ubc.ca/~schmidtm/svan16 Some images from this lecture are
More informationNeutron inverse kinetics via Gaussian Processes
Neutron inverse kinetics via Gaussian Processes P. Picca Politecnico di Torino, Torino, Italy R. Furfaro University of Arizona, Tucson, Arizona Outline Introduction Review of inverse kinetics techniques
More informationSupport Vector Machines
Wien, June, 2010 Paul Hofmarcher, Stefan Theussl, WU Wien Hofmarcher/Theussl SVM 1/21 Linear Separable Separating Hyperplanes Non-Linear Separable Soft-Margin Hyperplanes Hofmarcher/Theussl SVM 2/21 (SVM)
More informationECE-271B. Nuno Vasconcelos ECE Department, UCSD
ECE-271B Statistical ti ti Learning II Nuno Vasconcelos ECE Department, UCSD The course the course is a graduate level course in statistical learning in SLI we covered the foundations of Bayesian or generative
More informationOverview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation
Overview Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation Probabilistic Interpretation: Linear Regression Assume output y is generated
More informationNaïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability
Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish
More informationIndirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina
Indirect Rule Learning: Support Vector Machines Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection
More informationIntroduction to Machine Learning Midterm Exam Solutions
10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,
More informationNaïve Bayes classification
Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss
More informationLecture 3: Multiclass Classification
Lecture 3: Multiclass Classification Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Some slides are adapted from Vivek Skirmar and Dan Roth CS6501 Lecture 3 1 Announcement v Please enroll in
More informationMulticlass Classification-1
CS 446 Machine Learning Fall 2016 Oct 27, 2016 Multiclass Classification Professor: Dan Roth Scribe: C. Cheng Overview Binary to multiclass Multiclass SVM Constraint classification 1 Introduction Multiclass
More informationLecture 3. Linear Regression II Bastian Leibe RWTH Aachen
Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression
More informationLecture 10: Support Vector Machine and Large Margin Classifier
Lecture 10: Support Vector Machine and Large Margin Classifier Applied Multivariate Analysis Math 570, Fall 2014 Xingye Qiao Department of Mathematical Sciences Binghamton University E-mail: qiao@math.binghamton.edu
More informationNearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2
Nearest Neighbor Machine Learning CSE546 Kevin Jamieson University of Washington October 26, 2017 2017 Kevin Jamieson 2 Some data, Bayes Classifier Training data: True label: +1 True label: -1 Optimal
More informationSVMs, Duality and the Kernel Trick
SVMs, Duality and the Kernel Trick Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 26 th, 2007 2005-2007 Carlos Guestrin 1 SVMs reminder 2005-2007 Carlos Guestrin 2 Today
More informationLECTURE 7 Support vector machines
LECTURE 7 Support vector machines SVMs have been used in a multitude of applications and are one of the most popular machine learning algorithms. We will derive the SVM algorithm from two perspectives:
More informationIntroduction. Chapter 1
Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics
More informationData Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts
Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification
More informationParametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012
Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood
More informationMLCC 2017 Regularization Networks I: Linear Models
MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017 About this class We introduce a class of learning algorithms based on Tikhonov regularization We study computational
More informationSupport Vector Machines and Bayes Regression
Statistical Techniques in Robotics (16-831, F11) Lecture #14 (Monday ctober 31th) Support Vector Machines and Bayes Regression Lecturer: Drew Bagnell Scribe: Carl Doersch 1 1 Linear SVMs We begin by considering
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationInfinite Ensemble Learning with Support Vector Machinery
Infinite Ensemble Learning with Support Vector Machinery Hsuan-Tien Lin and Ling Li Learning Systems Group, California Institute of Technology ECML/PKDD, October 4, 2005 H.-T. Lin and L. Li (Learning Systems
More informationSTA414/2104 Statistical Methods for Machine Learning II
STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements
More informationRecent Advances in Bayesian Inference Techniques
Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian
More informationMining Classification Knowledge
Mining Classification Knowledge Remarks on NonSymbolic Methods JERZY STEFANOWSKI Institute of Computing Sciences, Poznań University of Technology SE lecture revision 2013 Outline 1. Bayesian classification
More informationBayesian Learning (II)
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP
More informationNonlinear Support Vector Machines through Iterative Majorization and I-Splines
Nonlinear Support Vector Machines through Iterative Majorization and I-Splines P.J.F. Groenen G. Nalbantov J.C. Bioch July 9, 26 Econometric Institute Report EI 26-25 Abstract To minimize the primal support
More informationScale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract
Scale-Invariance of Support Vector Machines based on the Triangular Kernel François Fleuret Hichem Sahbi IMEDIA Research Group INRIA Domaine de Voluceau 78150 Le Chesnay, France Abstract This paper focuses
More informationMachine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9
Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9 Slides adapted from Jordan Boyd-Graber Machine Learning: Chenhao Tan Boulder 1 of 39 Recap Supervised learning Previously: KNN, naïve
More informationData Mining Part 4. Prediction
Data Mining Part 4. Prediction 4.3. Fall 2009 Instructor: Dr. Masoud Yaghini Outline Introduction Bayes Theorem Naïve References Introduction Bayesian classifiers A statistical classifiers Introduction
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters
More informationChapter 9. Support Vector Machine. Yongdai Kim Seoul National University
Chapter 9. Support Vector Machine Yongdai Kim Seoul National University 1. Introduction Support Vector Machine (SVM) is a classification method developed by Vapnik (1996). It is thought that SVM improved
More informationIntroduction to Gaussian Processes
Introduction to Gaussian Processes Iain Murray murray@cs.toronto.edu CSC255, Introduction to Machine Learning, Fall 28 Dept. Computer Science, University of Toronto The problem Learn scalar function of
More information