# 9.2 Support Vector Machines 159

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 9.2 Support Vector Machines Kernel Methods We have all the tools together now to make an exciting step. Let us summarize our findings. We are interested in regularized estimation problems of the form (9.5), where y(x) = w T φ(x) + b is linear, examples include the soft margin SVM and MAP for logistic regression. Here is a mad idea. Suppose we use a huge number of features p, maybe even infinitely many. Before figuring out how this could be done, let us first see whether this makes any sense in principle. After all, we have to store w R p and evaluate φ(x) R p. Do we? In the previous section, we learned that we can always represent w = Φ T α, where α R n, and our dataset is finite. Moreover, the error function in (9.5) depends on y = Φw + b = ΦΦ T α + b only, and ΦΦ T is just an R n n matrix. Finally, the Tikhonov regularizer is given by ν 2 w 2 = ν Φ T α 2 = ν 2 2 αt ΦΦ T α, it also only depends on ΦΦ T. Finally, once we are done and found (α, b ), where w = Φ T α, we can predict on new inputs x with y (x) = w T φ(x) + b = α T Φφ(x) + b. We need finite quantities only in order to make our idea work, namely the matrix ΦΦ T during training, and the mapping Φφ(x) for predictions later on, to be evaluated at finitely many x. This is the basic observation which makes kernelization work. The entries of ΦΦ T are φ(x i ) T φ(x j ), while [Φφ(x)] = [φ(x i ) T φ(x)]. We can write K(x, x ) = φ(x) T φ(x ), a kernel function. It is now clear that given the kernel function K(x, x ), we never need to access the underlying φ(x). In fact, we can forget about the dimensionality p and vectors of this size altogether. What makes K(x, x ) a kernel function? It must be the inner product in some feature space, but what does that imply? Let us work out some properties. First, a kernel function is obviously symmetric: K(x, x) = K(x, x ). Second, consider some arbitrary set {x i } of n input points and construct the kernel matrix K = [K(x i, x j )] R n n. Also, denote Φ = [φ(x 1 ),..., φ(x n )] T R n p. Then, α T Kα = α T ΦΦ T α = Φ T α 2 0. In other words, the kernel matrix K is positive semidefinite (see Section 6.3). This property defines kernel functions. K(x, x ) is a kernel function if the kernel matrix K = [K(x i, x j )] for any finite set of points {x i } is symmetric positive semidefinite. An important subfamily are the infinite-dimensional or positive definite kernel functions. A member K(x, x ) of this subfamily is defined by all its kernel matrices K = [K(x i, x j )] being positive definite for any set {x i } of any size. In particular, all kernel matrices are invertible. As we will see shortly, it is positive definite kernel functions which give rise to infinite-dimensional feature spaces, therefore to nonlinear kernel methods.

2 160 9 Support Vector Machines Problem: Can I use astronomically many features p? How about p =? Approach: No problem! As long as you can efficiently compute the kernel function K(x, x ) = φ(x) T φ(x ), the representer theorem saves the day. Hilbert Spaces and All That (*) Before we give examples of kernel functions, a comment for meticulous readers (all others can safely skip this paragraph and move to the examples). How can we even talk about φ(x) T φ(x) if p =? Even worse, what is Φ R n p in this case? In the best case, all this involves infinite sums, which may not converge. Rest assured that all this can be made rigorous within the framework of Hilbert function and functional spaces. In short, infinite dimensional vectors become functions, their transposes become functionals, and matrices become linear operators. A key result is Mercer s theorem for positive semidefinite kernel functions, which provides a construction for a feature map. However, with the exception of certain learning-theoretical questions, the importance of all this function space mathematics for down-to-earth machine learning is very limited. Historically, the point about the efforts of mathematicians like Hilbert, Schmidt and Riesz was to find conditions under which function spaces could be treated in the same simple way as finite-dimensional vector spaces, working out analogies for positive definite matrices, quadratic functions, eigendecomposition, and so on. Moreover, function spaces governing kernel methods are of the particularly simple reproducing kernel Hilbert type, where common pathologies like delta functions do not even arise. You may read about all that in [39] or other kernel literature, it will not play a role in this course. Just one warning which you will not find spelled out much in the SVM literature. The geometry in huge or infinite-dimensional spaces is dramatically different from anything we can draw or imagine. For example, in Mercer s construction of K(x, x ) = j 1 φ j(x)φ j (x ), the different feature dimensions j = 1, 2,... are by no means on equal terms, as far as concepts like distance or volume are concerned. For most commonly used infinite-dimensional kernel functions, the contributions φ j (x)φ j (x ) rapidly become extremely small, and only a small number of initial features determine most of the predictions. A good intuition about kernel methods is that they behave like (easy to use) linear methods of flexible dimensionality. As the number of data points n grows, a larger (but finite) number of the feature space dimensions will effectively be used. Examples of Kernel Functions Let us look at some examples. Maybe the simplest kernel function is K(x, x ) = x T x, the standard inner product. Moreover, for any finite-dimensional feature map φ(x) R p (p < ), K(x, x ) = φ(x) T φ(x ) is a kernel function. Since any kernel matrix of this type can at most have rank p, such kernel functions are positive semidefinite, but not positive definite. However, even for finitedimensional kernels, it can be much simpler to work with K(x, x ) directly than to evaluate φ(x). For example, recall polynomial regression estimation from Section 4.1, giving rise to a polynomial feature map φ(x) = [1, x,..., x r ] T for x R. Now, if x R d is multivariate, a corresponding polynomial feature

3 9.2 Support Vector Machines 161 map would consists of very many features. Is there a way around their explicit representation? Consider the polynomial kernel K(x, x ) = ( x T x ) r = (x j1... x jr ) ( x j 1... x ) j r. j 1,...,j r For example, if d = 3 and r = 2, then K(x, x ) = (x 1 x 1 + x 2 x 2 + x 3 x 3) 2 = x 2 1(x 1) 2 + x 2 2(x 2) 2 + x 2 3(x 3) 2 + 2(x 1 x 2 )(x 1x 2) + 2(x 1 x 3 )(x 1x 3) + 2(x 2 x 3 )(x 2x 3), a feature map of which is x 2 1 x 2 2 φ(x) = x 2 3 2x1 2x1 x 2. 2x2 x 3 x 3 If x R d, K(x, x ) is evaluated in O(d), independent of r. Yet it is based on a feature map φ(x) R dr, whose dimensionality 6 scales exponentially in r. A variant is given by K(x, x ) = ( x T x + ε ) r, ε > 0, which can be obtained by replacing x by [x T, ε] T above. The feature map now runs over all subranges 1 j 1 j k d, 0 k r. A frequently used infinite-dimensional (positive definite) kernel is the Gaussian (or radial basis function, or squared exponential) kernel: K(x, x ) = e τ 2 x x 2, τ > 0. (9.6) We establish it as a kernel function in Section The Gaussian is an example of a stationary kernel, these depend on x x only. We can weight each dimension differently: K(x, x ) = e 1 2 d j=1 τj(xj x j )2, τ 1,... τ d > 0. Free parameters in kernels are hyperparameters, much like C in the soft margin SVM or the noise variance σ 2 in Gaussian linear regression, choosing them is a model selection problem (Chapter 10). Choosing the right kernel is much like choosing the right model. In order to do it well, you need to know your options. Kernels can be combined from others in many ways, [6, ch. 6.2] gives a good overview. It is also important to understand statistical properties implied by a kernel. For example, the Gaussian kernel produces extremely smooth solutions, while other kernels from the Matérn family are more flexible. Most books on kernel methods will provide some overview, see also [41]. One highly successful application domain for kernel methods concerns problems where input points x have combinatorial structure, such as chains, trees, or 6 More economically, we can run over all 1 j 1 j r d.

4 162 9 Support Vector Machines graphs. Applications range from bioinformatics over computational chemistry to structured objects in computer vision. The rationale is that it is often simpler and much more computationally efficient to devise a kernel function K(x, x ) than a feature map φ(x). This field was seeded by independent work of David Haussler [22] and Chris Watkins. A final remark concerns normalization. As noted in Chapter 2 and above in Section 9.1, it is often advantageous to use normalized feature maps φ(x) = 1. What does this mean for a kernel? K(x, x) = φ(x) T φ(x) = 1. Therefore, a kernel function gives rise to a normalized feature map if its diagonal entries K(x, x) are all 1. For example, the Gaussian kernel (9.6) is normalized. Moreover, if K(x, x ) is a kernel, then so is K(x, x ) K(x, x)k(x, x ) (see Section 9.2.4), and the latter is normalized. It is a good idea to use normalized kernels in practice Techniques: Properties of Kernels (*) In this section, we review a few properties of kernel functions and look at some more examples. The class of kernel functions has formidable closedness properties. If K 1 (x, x ) and K 2 (x, x ) are kernels, so are ck 1 for c > 0, K 1 + K 2 and K 1 K 2. You will have no problem confirming the the first two. The third is shown at the end of this section. Moreover, f(x)k 1 (x, x )f(x ) is a kernel function as well, for any f(x). This justifies kernel normalization, as discussed at the end of Section If K r (x, x ) is a sequence of kernel functions converging pointwise to K(x, x ) = lim r K r (x, x ), then K(x, x ) is a kernel function as well. Finally, if K(x, x ) is a kernel and ψ(y) is some mapping into R d, then (y, y ) K(ψ(y), ψ(y )) is a kernel as well. Let us show that the Gaussian kernel (9.6) is a valid kernel function. First, (x T x ) r is a kernel for every r = 0, 1, 2,..., namely the polynomial kernel from Section By the way, K(x, x ) = 1 is a kernel function, since its kernel matrices 11 T are positive semidefinite. Therefore, K r (x, x ) = r j=0 1 j! (xt x ) j are all kernels, and so is the limit e xt x = lim r K r (x, x ). More general, if K(x, x ) is a kernel, so is e K(x,x ). Now, e τ 2 x x 2 = e τ 2 x 2 e τxt x e τ 2 x 2. The middle is a kernel, and we apply our normalization rule with f(x) = e τ 2 x 2. The Gaussian kernel is infinite-dimensional (positive definite), although we will not show this here.

5 9.2 Support Vector Machines 163 Another way to think about kernels is in terms of covariance functions. A random process is a set of random variables a(x), one for each x R d. Its covariance function is K(x, x ) = Cov[a(x), a(x )] = E [(a(x) E[a(x)])(a(x ) E[a(x )])]. Covariance functions are kernel functions. For some set {x i }, let a = [a(x i ) E[a(x i )]] R n be a random vector. Then, for any v R n : v T Kv = v T E [ aa T ] v = E [ (v T a) 2] 0. Finally, there are some symmetric functions which are not kernels. One example is K(x, x ) = tanh ( αx T x + β ). In an attempt to make SVMs look like multi-layer perceptrons, this non-kernel was suggested and is shipped to this day in many SVM toolboxes 7. Running the SVM with kernels like this spells trouble. Soft margin SVM is a convex optimization problem only if kernel matrices are positive semidefinite, codes will typically crash if that is not the case. A valid neural networks kernel is found in [47], derived from the covariance function perspective. Finally, why is K 1 (x, x )K 2 (x, x ) a kernel? This argument is for interested readers only, it can be skipped at no loss. We have to show that for two positive semidefinite kernel matrices K 1, K 2 R n n the Schur (or Hadamard) product K 1 K 2 = [K 1 (x i, x j )K 2 (x i, x j )] (Section 2.4.3) is positive semidefinite as well. To this end, we consider the Kronecker product K 1 K 2 = [K 1 (x i, x j )K 2 ] R n2 n 2. This is positive semidefinite as well. Namely, we can write K 1 = V 1 V T 1, K 2 = V 2 V T 2, then K 1 K 2 = (V 1 V 2 )(V 1 V 2 ) T. But the Schur product is a square submatrix of K 1 K 2, a so called minor. In other words, for some index set J {1,..., n 2 } of size J = n: K 1 K 2 = (K 1 K 2 ) J, so that for any v R n : v T (K 1 K 2 )v = z T (K 1 K 2 )z 0, z = I,J v R n2. The same proof works to show that the positive definiteness of K 1, K 2 implies the positive definiteness of K 1 K 2, a result due to Schur Summary Let us summarize the salient points leading up to support vector machine binary classification. We started with the observation that for a linearly separable dataset, many different separating hyperplanes result in zero training error. Among all those potential solutions, which could arise as outcome of the perceptron algorithm, there is one which exhibits maximum stability against small displacements of patterns x i, by attaining the maximum margin γ D (w, b). Pursuing this lead, we addressed a number of problems: 7 It is even on Wikipedia (en.wikipedia.org/wiki/support vector machine).

### CIS 520: Machine Learning Oct 09, Kernel Methods

CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed

### Perceptron Revisited: Linear Separators. Support Vector Machines

Support Vector Machines Perceptron Revisited: Linear Separators Binary classification can be viewed as the task of separating classes in feature space: w T x + b > 0 w T x + b = 0 w T x + b < 0 Department

### Linear Classification and SVM. Dr. Xin Zhang

Linear Classification and SVM Dr. Xin Zhang Email: eexinzhang@scut.edu.cn What is linear classification? Classification is intrinsically non-linear It puts non-identical things in the same class, so a

### The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the

### NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

### Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

### GAUSSIAN PROCESS REGRESSION

GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The

### Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete

### Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Jean-Philippe.Vert@mines.org Human Genome Center, University

### Kernel Method: Data Analysis with Positive Definite Kernels

Kernel Method: Data Analysis with Positive Definite Kernels 2. Positive Definite Kernel and Reproducing Kernel Hilbert Space Kenji Fukumizu The Institute of Statistical Mathematics. Graduate University

### Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter

### Linear Regression (continued)

Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

### Kernel Methods. Barnabás Póczos

Kernel Methods Barnabás Póczos Outline Quick Introduction Feature space Perceptron in the feature space Kernels Mercer s theorem Finite domain Arbitrary domain Kernel families Constructing new kernels

### Kernels for Multi task Learning

Kernels for Multi task Learning Charles A Micchelli Department of Mathematics and Statistics State University of New York, The University at Albany 1400 Washington Avenue, Albany, NY, 12222, USA Massimiliano

### Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2015 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

### Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation

Deviations from linear separability Kernel methods CSE 250B Noise Find a separator that minimizes a convex loss function related to the number of mistakes. e.g. SVM, logistic regression. Systematic deviation

### Support Vector Machines

Wien, June, 2010 Paul Hofmarcher, Stefan Theussl, WU Wien Hofmarcher/Theussl SVM 1/21 Linear Separable Separating Hyperplanes Non-Linear Separable Soft-Margin Hyperplanes Hofmarcher/Theussl SVM 2/21 (SVM)

### Regularized Least Squares

Regularized Least Squares Charlie Frogner 1 MIT 2011 1 Slides mostly stolen from Ryan Rifkin (Google). Summary In RLS, the Tikhonov minimization problem boils down to solving a linear system (and this

### Gaussian Processes (10/16/13)

STA561: Probabilistic machine learning Gaussian Processes (10/16/13) Lecturer: Barbara Engelhardt Scribes: Changwei Hu, Di Jin, Mengdi Wang 1 Introduction In supervised learning, we observe some inputs

### SVMs, Duality and the Kernel Trick

SVMs, Duality and the Kernel Trick Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 26 th, 2007 2005-2007 Carlos Guestrin 1 SVMs reminder 2005-2007 Carlos Guestrin 2 Today

### Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

### Bayesian Linear Regression. Sargur Srihari

Bayesian Linear Regression Sargur srihari@cedar.buffalo.edu Topics in Bayesian Regression Recall Max Likelihood Linear Regression Parameter Distribution Predictive Distribution Equivalent Kernel 2 Linear

### Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

### Each new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up!

Feature Mapping Consider the following mapping φ for an example x = {x 1,...,x D } φ : x {x1,x 2 2,...,x 2 D,,x 2 1 x 2,x 1 x 2,...,x 1 x D,...,x D 1 x D } It s an example of a quadratic mapping Each new

### Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

### Kernel Methods in Machine Learning

Kernel Methods in Machine Learning Autumn 2015 Lecture 1: Introduction Juho Rousu ICS-E4030 Kernel Methods in Machine Learning 9. September, 2015 uho Rousu (ICS-E4030 Kernel Methods in Machine Learning)

### Regularized Least Squares

Regularized Least Squares Ryan M. Rifkin Google, Inc. 2008 Basics: Data Data points S = {(X 1, Y 1 ),...,(X n, Y n )}. We let X simultaneously refer to the set {X 1,...,X n } and to the n by d matrix whose

### Kernel Methods. Outline

Kernel Methods Quang Nguyen University of Pittsburgh CS 3750, Fall 2011 Outline Motivation Examples Kernels Definitions Kernel trick Basic properties Mercer condition Constructing feature space Hilbert

### Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer

### Support Vector Machines and Kernel Methods

2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. http://cs229.stanford.edu/notes/cs229-notes3.pdf Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University

### Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

### Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Reproducing Kernel Hilbert Spaces 9.520 Class 03, 15 February 2006 Andrea Caponnetto About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

### Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters

### Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

University of Cambridge Engineering Part IIB & EIST Part II Paper I0: Advanced Pattern Processing Handouts 4 & 5: Multi-Layer Perceptron: Introduction and Training x y (x) Inputs x 2 y (x) 2 Outputs x

### Nearest Neighbors Methods for Support Vector Machines

Nearest Neighbors Methods for Support Vector Machines A. J. Quiroz, Dpto. de Matemáticas. Universidad de Los Andes joint work with María González-Lima, Universidad Simón Boĺıvar and Sergio A. Camelo, Universidad

### 10-725/36-725: Convex Optimization Prerequisite Topics

10-725/36-725: Convex Optimization Prerequisite Topics February 3, 2015 This is meant to be a brief, informal refresher of some topics that will form building blocks in this course. The content of the

### Kernel Methods. Machine Learning A W VO

Kernel Methods Machine Learning A 708.063 07W VO Outline 1. Dual representation 2. The kernel concept 3. Properties of kernels 4. Examples of kernel machines Kernel PCA Support vector regression (Relevance

### Midterm: CS 6375 Spring 2015 Solutions

Midterm: CS 6375 Spring 2015 Solutions The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for an

### SUPPORT VECTOR MACHINE FOR THE SIMULTANEOUS APPROXIMATION OF A FUNCTION AND ITS DERIVATIVE

SUPPORT VECTOR MACHINE FOR THE SIMULTANEOUS APPROXIMATION OF A FUNCTION AND ITS DERIVATIVE M. Lázaro 1, I. Santamaría 2, F. Pérez-Cruz 1, A. Artés-Rodríguez 1 1 Departamento de Teoría de la Señal y Comunicaciones

### Learning with kernels and SVM

Learning with kernels and SVM Šámalova chata, 23. května, 2006 Petra Kudová Outline Introduction Binary classification Learning with Kernels Support Vector Machines Demo Conclusion Learning from data find

### Elements of Positive Definite Kernel and Reproducing Kernel Hilbert Space

Elements of Positive Definite Kernel and Reproducing Kernel Hilbert Space Statistical Inference with Reproducing Kernel Hilbert Space Kenji Fukumizu Institute of Statistical Mathematics, ROIS Department

### Lecture 7: Kernels for Classification and Regression

Lecture 7: Kernels for Classification and Regression CS 194-10, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 15, 2011 Outline Outline A linear regression problem Linear auto-regressive

### Reproducing Kernel Hilbert Spaces

9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we

### Learning From Data Lecture 25 The Kernel Trick

Learning From Data Lecture 25 The Kernel Trick Learning with only inner products The Kernel M. Magdon-Ismail CSCI 400/600 recap: Large Margin is Better Controling Overfitting Non-Separable Data 0.08 random

### UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 Exam policy: This exam allows two one-page, two-sided cheat sheets (i.e. 4 sides); No other materials. Time: 2 hours. Be sure to write

### Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes Lecturer: Drew Bagnell Scribe: Venkatraman Narayanan 1, M. Koval and P. Parashar 1 Applications of Gaussian

### Introduction to Machine Learning

Introduction to Machine Learning Kernel Methods Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1 / 21

### Machine Learning Lecture 5

Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

### CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

CSE 151 Machine Learning Instructor: Kamalika Chaudhuri Linear Classification Given labeled data: (xi, feature vector yi) label i=1,..,n where y is 1 or 1, find a hyperplane to separate from Linear Classification

### Lecture 10: A brief introduction to Support Vector Machine

Lecture 10: A brief introduction to Support Vector Machine Advanced Applied Multivariate Analysis STAT 2221, Fall 2013 Sungkyu Jung Department of Statistics, University of Pittsburgh Xingye Qiao Department

### SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS

SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS VIKAS CHANDRAKANT RAYKAR DECEMBER 5, 24 Abstract. We interpret spectral clustering algorithms in the light of unsupervised

### The Perceptron Algorithm

The Perceptron Algorithm Greg Grudic Greg Grudic Machine Learning Questions? Greg Grudic Machine Learning 2 Binary Classification A binary classifier is a mapping from a set of d inputs to a single output

### Support Vector Machines and Kernel Methods

Support Vector Machines and Kernel Methods Geoff Gordon ggordon@cs.cmu.edu July 10, 2003 Overview Why do people care about SVMs? Classification problems SVMs often produce good results over a wide range

### Lecture 10: Support Vector Machines

Lecture 0: Support Vector Machines Lecture : Support Vector Machines Haim Sompolinsky, MCB 3, Monday, March 2, 205 Haim Sompolinsky, MCB 3, Wednesday, March, 207 The Optimal Separating Plane Suppose we

### CS 188: Artificial Intelligence Spring Announcements

CS 188: Artificial Intelligence Spring 2010 Lecture 22: Nearest Neighbors, Kernels 4/18/2011 Pieter Abbeel UC Berkeley Slides adapted from Dan Klein Announcements On-going: contest (optional and FUN!)

### Worst-Case Bounds for Gaussian Process Models

Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis

### Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Modeling Data with Linear Combinations of Basis Functions Read Chapter 3 in the text by Bishop A Type of Supervised Learning Problem We want to model data (x 1, t 1 ),..., (x N, t N ), where x i is a vector

### In the Name of God. Lectures 15&16: Radial Basis Function Networks

1 In the Name of God Lectures 15&16: Radial Basis Function Networks Some Historical Notes Learning is equivalent to finding a surface in a multidimensional space that provides a best fit to the training

### LMS Algorithm Summary

LMS Algorithm Summary Step size tradeoff Other Iterative Algorithms LMS algorithm with variable step size: w(k+1) = w(k) + µ(k)e(k)x(k) When step size µ(k) = µ/k algorithm converges almost surely to optimal

### CSC321 Lecture 2: Linear Regression

CSC32 Lecture 2: Linear Regression Roger Grosse Roger Grosse CSC32 Lecture 2: Linear Regression / 26 Overview First learning algorithm of the course: linear regression Task: predict scalar-valued targets,

### Support vector machines Lecture 4

Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The

### CPSC 540: Machine Learning

CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

### Introduction to Machine Learning

1, DATA11002 Introduction to Machine Learning Lecturer: Teemu Roos TAs: Ville Hyvönen and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer

### Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

### Lecture 4: Training a Classifier

Lecture 4: Training a Classifier Roger Grosse 1 Introduction Now that we ve defined what binary classification is, let s actually train a classifier. We ll approach this problem in much the same way as

### Direct Learning: Linear Classification. Donglin Zeng, Department of Biostatistics, University of North Carolina

Direct Learning: Linear Classification Logistic regression models for classification problem We consider two class problem: Y {0, 1}. The Bayes rule for the classification is I(P(Y = 1 X = x) > 1/2) so

### Homework 2 Solutions Kernel SVM and Perceptron

Homework 2 Solutions Kernel SVM and Perceptron CMU 1-71: Machine Learning (Fall 21) https://piazza.com/cmu/fall21/17115781/home OUT: Sept 25, 21 DUE: Oct 8, 11:59 PM Problem 1: SVM decision boundaries

### The Multivariate Gaussian Distribution

The Multivariate Gaussian Distribution Chuong B. Do October, 8 A vector-valued random variable X = T X X n is said to have a multivariate normal or Gaussian) distribution with mean µ R n and covariance

### c 4, < y 2, 1 0, otherwise,

Fundamentals of Big Data Analytics Univ.-Prof. Dr. rer. nat. Rudolf Mathar Problem. Probability theory: The outcome of an experiment is described by three events A, B and C. The probabilities Pr(A) =,

### Informal Ng Machine Learning Class Notes Phil Lucht Rimrock Digital Technology, Salt Lake City, Utah last update: Feb 17, 2015

Informal Ng Machine Learning Class Notes Phil Lucht Rimrock Digital Technology, Salt Lake City, Utah 84103 last update: Feb 17, 2015 This file contains my informal notes related to Prof. Andrew Ng's Summer

### Lecture 6. Regression

Lecture 6. Regression Prof. Alan Yuille Summer 2014 Outline 1. Introduction to Regression 2. Binary Regression 3. Linear Regression; Polynomial Regression 4. Non-linear Regression; Multilayer Perceptron

### Neural networks and support vector machines

Neural netorks and support vector machines Perceptron Input x 1 Weights 1 x 2 x 3... x D 2 3 D Output: sgn( x + b) Can incorporate bias as component of the eight vector by alays including a feature ith

### Linear Models for Classification

Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

### Support Vector Machines

EE 17/7AT: Optimization Models in Engineering Section 11/1 - April 014 Support Vector Machines Lecturer: Arturo Fernandez Scribe: Arturo Fernandez 1 Support Vector Machines Revisited 1.1 Strictly) Separable

### Short Course Robust Optimization and Machine Learning. 3. Optimization in Supervised Learning

Short Course Robust Optimization and 3. Optimization in Supervised EECS and IEOR Departments UC Berkeley Spring seminar TRANSP-OR, Zinal, Jan. 16-19, 2012 Outline Overview of Supervised models and variants

### Error Functions & Linear Regression (2)

Error Functions & Linear Regression (2) John Kelleher & Brian Mac Namee Machine Learning @ DIT Overview 1 Introduction Overview 2 Linear Classifiers Threshold Function Perceptron Learning Rule Training/Learning

### Learning from Data: Regression

November 3, 2005 http://www.anc.ed.ac.uk/ amos/lfd/ Classification or Regression? Classification: want to learn a discrete target variable. Regression: want to learn a continuous target variable. Linear

### PAC-learning, VC Dimension and Margin-based Bounds

More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

### Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

30 MATHEMATICS REVIEW G A.1.1 Matrices and Vectors Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A = a 11 a 12... a 1N a 21 a 22... a 2N...... a M1 a M2... a MN A matrix can

### Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers Blaine Nelson, Tobias Scheffer Contents Classification Problem Bayesian Classifier Decision Linear Classifiers, MAP Models Logistic

### Relevance Vector Machines

LUT February 21, 2011 Support Vector Machines Model / Regression Marginal Likelihood Regression Relevance vector machines Exercise Support Vector Machines The relevance vector machine (RVM) is a bayesian

### Inner product spaces. Layers of structure:

Inner product spaces Layers of structure: vector space normed linear space inner product space The abstract definition of an inner product, which we will see very shortly, is simple (and by itself is pretty

### Support Vector and Kernel Methods

SIGIR 2003 Tutorial Support Vector and Kernel Methods Thorsten Joachims Cornell University Computer Science Department tj@cs.cornell.edu http://www.joachims.org 0 Linear Classifiers Rules of the Form:

### TDT4173 Machine Learning

TDT4173 Machine Learning Lecture 3 Bagging & Boosting + SVMs Norwegian University of Science and Technology Helge Langseth IT-VEST 310 helgel@idi.ntnu.no 1 TDT4173 Machine Learning Outline 1 Ensemble-methods

### UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

### Introduction to Gaussian Process

Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Non-parametric Gaussian Process (GP) GP Regression

### Machine Learning Lecture 7

Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

### The Kernel Trick. Carlos C. Rodríguez October 25, Why don t we do it in higher dimensions?

The Kernel Trick Carlos C. Rodríguez http://omega.albany.edu:8008/ October 25, 2004 Why don t we do it in higher dimensions? If SVMs were able to handle only linearly separable data, their usefulness would

### Random Projections. Lopez Paz & Duvenaud. November 7, 2013

Random Projections Lopez Paz & Duvenaud November 7, 2013 Random Outline The Johnson-Lindenstrauss Lemma (1984) Random Kitchen Sinks (Rahimi and Recht, NIPS 2008) Fastfood (Le et al., ICML 2013) Why random

### Support Vector Machines. Maximizing the Margin

Support Vector Machines Support vector achines (SVMs) learn a hypothesis: h(x) = b + Σ i= y i α i k(x, x i ) (x, y ),..., (x, y ) are the training exs., y i {, } b is the bias weight. α,..., α are the

### Statistical Pattern Recognition

Statistical Pattern Recognition Support Vector Machine (SVM) Hamid R. Rabiee Hadi Asheri, Jafar Muhammadi, Nima Pourdamghani Spring 2013 http://ce.sharif.edu/courses/91-92/2/ce725-1/ Agenda Introduction

### Regularization Networks and Support Vector Machines

Advances in Computational Mathematics x (1999) x-x 1 Regularization Networks and Support Vector Machines Theodoros Evgeniou, Massimiliano Pontil, Tomaso Poggio Center for Biological and Computational Learning

### LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES Supervised Learning Linear vs non linear classifiers In K-NN we saw an example of a non-linear classifier: the decision boundary

### Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS Problems 2 Please do Problem 8.3 in the textbook. We will discuss this in class. Classification: Problem Statement 3 In regression, we are modeling the relationship

### UNIVERSITY OF MANITOBA

Question Points Score INSTRUCTIONS TO STUDENTS: This is a 6 hour examination. No extra time will be given. No texts, notes, or other aids are permitted. There are no calculators, cellphones or electronic

### Machine Learning, Midterm Exam

10-601 Machine Learning, Midterm Exam Instructors: Tom Mitchell, Ziv Bar-Joseph Wednesday 12 th December, 2012 There are 9 questions, for a total of 100 points. This exam has 20 pages, make sure you have

### LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

LINEAR CLASSIFIERS Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification, the input

### Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Non-Convex Optimization CS6787 Lecture 7 Fall 2017 First some words about grading I sent out a bunch of grades on the course management system Everyone should have all their grades in Not including paper