CIS 520: Machine Learning Oct 09, Kernel Methods

 Agatha Lawson
 7 days ago
 Views:
Transcription
1 CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed in the lecture (and vice versa Outline Nonlinear models via basis functions Closer look at the SVM dual: kernel functions, kernel SVM RKHSs and Representer Theorem Kernel logistic regression Kernel ridge regression Nonlinear Models via Basis Functions Let X = R d We have seen methods for learning linear models of the form h(x = sign(w x + b for binary classification (such as logistic regression and SVMs and f(x = w x + b for regression (such as linear least squares regression and SVR What if we want to learn a nonlinear model? What would be a simple way to achieve this using the methods we have seen so far? One way to achieve this is to map instances x R d to some new feature vectors φ(x R n via some nonlinear feature mapping φ : R d R n, and then to learn a linear model in this transformed space For example, if one maps instances x R d to n = ( + 2d + ( d 2 dimensional feature vectors x x d x x 2 φ(x =, x d x d x 2 then learning a linear model in the transformed space is equivalent to learning a quadratic model in the original instance space In general, one can choose any basis functions φ, φ n : X R, and learn a linear x 2 d
2 2 Kernel Methods model over these: w φ(x + b, where w R n (in fact, one can do this for X R d as well For example, in least squares regression applied to a training sample S = ((x, y,, (x m, y m (R d R m, one would simply replace the matrix X R m d with the design matrix Φ R m n, where Φ ij = φ j (x i What is a potential difficulty in doing this? If n is large (eg as would be the case if the feature mapping φ corresponded to a highdegree polynomial, then the above approach can be computationally expensive In this lecture we look at a technique that allows one to implement the above idea efficiently for many algorithms We start by taking a closer look at the SVM dual which we derived in the last lecture 2 Closer Look at the SVM Dual: Kernel Functions, Kernel SVM Recall the form of the dual we derived for the (softmargin linear SVM: max α 2 α i α j y i y j (x i x j + α i ( j= subject to α i y i = 0 (2 0 α i C, i =,, m (3 If we implement this on feature vectors φ(x i R n in place of x i R d, we get the following optimization problem: max α 2 ( α i α j y i y j φ(xi φ(x j + α i (4 j= subject to α i y i = 0 (5 0 α i C, i =,, m (6 This involves computing dot products between vectors φ(x i, φ(x j in R n Similarly, using the learned model to make predictions on a new test point x R d also involves computing dot products between vectors in R n : ( h(x = sign α i y i φ(x i φ(x + b i SV For example, as we saw above, one can learn a quadratic classifier in X = R 2 by learning a linear classifier in φ(r 2 R 6, where (( x x φ = x 2 x 2 x x 2 ; x 2 x 2 2 clearly, a straightforward approach to learning an SVM classifier in this space (and applying it to a new test point will involve computing dot products in R 6 (more generally, when learning a degreeq polynomial in R d, such a straightforward approach will involve computing dot products in R n for n = O(d q
3 Kernel Methods 3 Now, consider replacing dot products φ(x φ(x in the above example with K(x, x, where x, x R 2, K(x, x = (x x + 2 It can be verified (exercise! that K(x, x = φ K (x φ K (x, where (( x φ K = x 2 2x 2x2 2x x 2 Thus, using K(x, x above instead of φ(x φ(x implicitly computes dot products in R 6, with computation of dot products required only in R 2! In fact, one can use any symmetric, positive semidefinite kernel function K : X X R (also called a Mercer kernel function in the SVM algorithm directly, even if the feature space implemented by the kernel function cannot be described explicitly Any such kernel function yields a convex dual problem; if K is positive definite, then K also corresponds to inner products in some inner product space V (ie K(x, x = φ(x, φ(x for some φ : X V For Euclidean instance spaces X = R d, examples of commonly used kernel functions include the polynomial kernel K(x, x = (x x + q,which results in learning a degreeq polynomial threshold classifier, and the Gaussian kernel, also known as the radial basis function (RBF kernel, K(x, x = exp ( x x 2 2 2σ (where 2 σ > 0 is a parameter of the kernel, which effectivey implements dot products in an infinitedimensional inner product space; in both cases, evaluating the kernel K(x, x at any two points x, x requires only O(d computation time Kernel functions can also be used for nonvectorial data (X = R d ; for example, kernel functions are often used to implicitly embed instance spaces containing strings, trees etc into an inner product space, and to implicitly learn a linear classifier in this space Intuitively, it is helpful to think of kernel functions as capturing some sort of similarity between pairs of instances in X To summarize, given a training sample S = ((x, y,, (x m, y m (X {±} m, in order to learn a kernel SVM classifier using a kernel function K : X X R, one simply solves the kernel SVM dual given by x 2 x 2 2 max α 2 α i α j y i y j K(x i, x j + α i (7 j= subject to α i y i = 0 (8 0 α i C, i =,, m, (9 and then predicts the label of a new instance x X according to ( h(x = sign i SV α i y i K(x i, x + b, where b = SV i SV ( y i j SV α j y j K(x i, x j
4 4 Kernel Methods 3 RKHSs and Representer Theorem Let K : X X R be a symmetric positive definite kernel function Let { FK 0 r } = f : X R f(x = α i K(x i, x for some r Z +, α i R, x i X For f, g FK 0 with f(x = r α ik(x i, x and g(x = s j= β jk(x j, x, define r s f, g K = α i β j K(x i, x j (0 j= f K = f, f K ( Let F K be the completion of FK 0 under the metric induced by the above norm Then reproducing kernel Hibert space (RKHS associated with K 2 Note that the SVM classifier learned using kernel K is of the form where f(x = i SV α iy i K(x i, x, ie where f F K h(x = sign(f(x + b, In fact, consider the following optimization problem: ( yi (f(x i + b f F K,b R m + + λ f 2 K F K is called the It turns out that the above SVM solution (with C = 2λm is a solution to this problem, ie the kernel SVM solution imizes the RKHSnorm regularized hinge loss over all functions over the form f(x + b for f F K, b R More generally, we have the following result: Theorem (Representer Theorem Let K : X X R be a positive definite kernel function Let Y R Let S = ((x, y,, (x m, y m (X Y m Let L : R m Y m R Let Ω : R + R + be a monotonically increasing function Then for λ > 0, there is a solution to the optimization problem of the form ( (f(x L + b,, f(x m + b, (y,, y m f F K,b R f(x = α i K(x i, x for some α,, α m R If Ω is strictly increasing, then all solutions have this form + λ Ω( f 2 K The above result tells us that even if F K is an infinitedimensional space, any optimization problem resulting from imizing a loss over a finite training sample regularized by some increasing function of the RKHSnorm is effectively a finitedimensional optimization problem, and moreover, the solution to this problem can be written as a kernel expansion over the training points In particular, imizing any other loss over F K (regularized by the RKHSnorm will also yield a solution of this form! Exercise Show that linear functions f : R d R of the form f(x = w x form an RKHS with linear kernel K : R d R d R given by K(x, x = x x and with f 2 K = w 2 2 The metric induced by the norm K is given by d K (f, g = f g K The completion of FK 0 is simply F K plus any limit points of Cauchy sequences in FK 0 under this metric 2 The name reproducing kernel Hilbert space comes from the following reproducing property: For any x X, define K x : X R as K x(x = K(x, x ; then for any f F K, we have f, K x = f(x
5 Kernel Methods 5 4 Kernel Logistic Regression Given a training sample S (X {±} m and kernel function K : X X R, the kernel logistic regression classifier is given by the solution to the following optimization problem: f F K,b R m ln ( + e yi(f(xi+b + λ f 2 K Since we know from the Representer Theorem that the solution has the form f(x = m α ik(x i, x, we can write the above as an optimization problem over α, b: α R m,b R m ln ( + e yi( m j= αjk(xj,xi+b + λ j= α i α j K(x i, x j This is of a similar form as in standard logistic regression, with m basis functions φ j (x = K(x j, x for j [m] (and w α! In particular, define K R m m as K ij = K(x i, x j (this is often called the gram matrix, and let k i denote the ith column of this matrix Then we can write the above as simply α R m,b R m ln ( + e yi(α k i+b + λα Kα, which is similar to the form for standard linear logistic regression (with feature vectors k i except for the regularizer being α Kα rather than α 2 2 and can be solved similarly as before, using similar numerical optimization methods We note that unlike SVMs, here in general, the solution has α i 0 i [m] A variant of logistic regression called the import vector machine (IVM adopts a greedy approach to find a subset IV [m] such that the function f (x + b = i IV α i K(x i, x + b gives good performance Compared to SVMs, IVMs can provide more natural class probability estimates, as well as more natural extensions to multiclass classification 5 Kernel Ridge Regression Given a training sample S (X R m and kernel function K : X X R, consider first a kernel ridge regression formulation for learning a function f F K : f F K m ( yi f(x i 2 + λ f 2 K Again, since we know from the Representer Theorem that the solution has the form f(x = m α ik(x i, x, we can write the above as an optimization problem over α: α R m m ( 2 y i α j K(x j, x i + λ α i α j K(x i, x j, j= j= or in matrix notation, α R m m ( yi α 2 k i + λα Kα
6 6 Kernel Methods Again, this is of the same form as standard linear ridge regression, with feature vectors k i and with regularizer α Kα rather than α 2 2 If K is positive definite, in which case the gram matrix K is invertible, then setting the gradient of the objective above wrt α to zero can be seen to yield α = ( K + λmi m y, where as before I m is the m m identity matrix and y = (y,, y m R m Exercise Show that if X = R d and one wants to explicitly include a bias term b in the linear ridge regression solution which is not included in the regularization, then defining x ( [ ] w Id 0 X =, w =, L =, b 0 0 x m one gets the solution w = ( X X + λml X y How would you extend this to learning a function of the form f(x + b for f F K, b R in the kernel ridge regression setting?
Support Vector Machines
Wien, June, 2010 Paul Hofmarcher, Stefan Theussl, WU Wien Hofmarcher/Theussl SVM 1/21 Linear Separable Separating Hyperplanes NonLinear Separable SoftMargin Hyperplanes Hofmarcher/Theussl SVM 2/21 (SVM)
More informationKernel Methods. Outline
Kernel Methods Quang Nguyen University of Pittsburgh CS 3750, Fall 2011 Outline Motivation Examples Kernels Definitions Kernel trick Basic properties Mercer condition Constructing feature space Hilbert
More informationLecture 10: A brief introduction to Support Vector Machine
Lecture 10: A brief introduction to Support Vector Machine Advanced Applied Multivariate Analysis STAT 2221, Fall 2013 Sungkyu Jung Department of Statistics, University of Pittsburgh Xingye Qiao Department
More informationKernel Methods. Machine Learning A W VO
Kernel Methods Machine Learning A 708.063 07W VO Outline 1. Dual representation 2. The kernel concept 3. Properties of kernels 4. Examples of kernel machines Kernel PCA Support vector regression (Relevance
More informationSVMs, Duality and the Kernel Trick
SVMs, Duality and the Kernel Trick Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 26 th, 2007 20052007 Carlos Guestrin 1 SVMs reminder 20052007 Carlos Guestrin 2 Today
More informationKernel Method: Data Analysis with Positive Definite Kernels
Kernel Method: Data Analysis with Positive Definite Kernels 2. Positive Definite Kernel and Reproducing Kernel Hilbert Space Kenji Fukumizu The Institute of Statistical Mathematics. Graduate University
More informationKernel Methods in Machine Learning
Kernel Methods in Machine Learning Autumn 2015 Lecture 1: Introduction Juho Rousu ICSE4030 Kernel Methods in Machine Learning 9. September, 2015 uho Rousu (ICSE4030 Kernel Methods in Machine Learning)
More information9.2 Support Vector Machines 159
9.2 Support Vector Machines 159 9.2.3 Kernel Methods We have all the tools together now to make an exciting step. Let us summarize our findings. We are interested in regularized estimation problems of
More informationLecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron
CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer
More informationSupport Vector Machines
EE 17/7AT: Optimization Models in Engineering Section 11/1  April 014 Support Vector Machines Lecturer: Arturo Fernandez Scribe: Arturo Fernandez 1 Support Vector Machines Revisited 1.1 Strictly) Separable
More informationIntroduction to Machine Learning
1, DATA11002 Introduction to Machine Learning Lecturer: Teemu Roos TAs: Ville Hyvönen and Janne Leppäaho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer
More informationConsistency of Nearest Neighbor Methods
E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study
More informationLecture 5: Linear models for classification. Logistic regression. Gradient Descent. Secondorder methods.
Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Secondorder methods. Linear models for classification Logistic regression Gradient descent and secondorder methods
More informationReproducing Kernel Hilbert Spaces
9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationLINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning
LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES Supervised Learning Linear vs non linear classifiers In KNN we saw an example of a nonlinear classifier: the decision boundary
More informationSupport Vector Machines. Maximizing the Margin
Support Vector Machines Support vector achines (SVMs) learn a hypothesis: h(x) = b + Σ i= y i α i k(x, x i ) (x, y ),..., (x, y ) are the training exs., y i {, } b is the bias weight. α,..., α are the
More informationThe Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017
The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two onepage, twosided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationCOMS 4771 Regression. Nakul Verma
COMS 4771 Regression Nakul Verma Last time Support Vector Machines Maximum Margin formulation Constrained Optimization Lagrange Duality Theory Convex Optimization SVM dual and Interpretation How get the
More informationLinear Regression (continued)
Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression
More informationFunction Spaces. 1 Hilbert Spaces
Function Spaces A function space is a set of functions F that has some structure. Often a nonparametric regression function or classifier is chosen to lie in some function space, where the assume structure
More informationMidterm: CS 6375 Spring 2015 Solutions
Midterm: CS 6375 Spring 2015 Solutions The exam is closed book. You are allowed a onepage cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for an
More informationDiscriminative Direction for Kernel Classifiers
Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering
More informationKernels for Multi task Learning
Kernels for Multi task Learning Charles A Micchelli Department of Mathematics and Statistics State University of New York, The University at Albany 1400 Washington Avenue, Albany, NY, 12222, USA Massimiliano
More informationLecture 10: Support Vector Machines
Lecture 0: Support Vector Machines Lecture : Support Vector Machines Haim Sompolinsky, MCB 3, Monday, March 2, 205 Haim Sompolinsky, MCB 3, Wednesday, March, 207 The Optimal Separating Plane Suppose we
More informationKernels MIT Course Notes
Kernels MIT 15.097 Course Notes Cynthia Rudin Credits: Bartlett, Schölkopf and Smola, Cristianini and ShaweTaylor The kernel trick that I m going to show you applies much more broadly than SVM, but we
More informationNearest Neighbors Methods for Support Vector Machines
Nearest Neighbors Methods for Support Vector Machines A. J. Quiroz, Dpto. de Matemáticas. Universidad de Los Andes joint work with María GonzálezLima, Universidad Simón Boĺıvar and Sergio A. Camelo, Universidad
More informationAdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague
AdaBoost Lecturer: Jan Šochman Authors: Jan Šochman, Jiří Matas Center for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz Motivation Presentation 2/17 AdaBoost with trees
More informationIntroduction to Logistic Regression and Support Vector Machine
Introduction to Logistic Regression and Support Vector Machine guest lecturer: MingWei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel
More informationMachine Learning (CS 567) Lecture 5
Machine Learning (CS 567) Lecture 5 Time: TTh 5:00pm  6:20pm Location: GFS 118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol
More informationSUPPORT VECTOR MACHINE FOR THE SIMULTANEOUS APPROXIMATION OF A FUNCTION AND ITS DERIVATIVE
SUPPORT VECTOR MACHINE FOR THE SIMULTANEOUS APPROXIMATION OF A FUNCTION AND ITS DERIVATIVE M. Lázaro 1, I. Santamaría 2, F. PérezCruz 1, A. ArtésRodríguez 1 1 Departamento de Teoría de la Señal y Comunicaciones
More informationGaussian Processes (10/16/13)
STA561: Probabilistic machine learning Gaussian Processes (10/16/13) Lecturer: Barbara Engelhardt Scribes: Changwei Hu, Di Jin, Mengdi Wang 1 Introduction In supervised learning, we observe some inputs
More informationDiffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)
Diffeomorphic Warping Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel) What Manifold Learning Isn t Common features of Manifold Learning Algorithms: 11 charting Dense sampling Geometric Assumptions
More informationBack to the future: Radial Basis Function networks revisited
Back to the future: Radial Basis Function networks revisited Qichao Que, Mikhail Belkin Department of Computer Science and Engineering Ohio State University Columbus, OH 4310 que, mbelkin@cse.ohiostate.edu
More informationTDT 4173 Machine Learning and Case Based Reasoning. Helge Langseth og Agnar Aamodt. NTNU IDI Seksjon for intelligente systemer
TDT 4173 Machine Learning and Case Based Reasoning Lecture 6 Support Vector Machines. Ensemble Methods Helge Langseth og Agnar Aamodt NTNU IDI Seksjon for intelligente systemer Outline 1 Wrapup from last
More information15388/688  Practical Data Science: Nonlinear modeling, crossvalidation, regularization, and evaluation
15388/688  Practical Data Science: Nonlinear modeling, crossvalidation, regularization, and evaluation J. Zico Kolter Carnegie Mellon University Fall 2016 1 Outline Example: return to peak demand prediction
More informationLecture 4. 1 Learning NonLinear Classifiers. 2 The Kernel Trick. CS621 Theory Gems September 27, 2012
CS62 Theory Gems September 27, 22 Lecture 4 Lecturer: Aleksander Mądry Scribes: Alhussein Fawzi Learning NonLinear Classifiers In the previous lectures, we have focused on finding linear classifiers,
More informationKernel Methods in Medical Imaging
This is page 1 Printer: Opaque this Kernel Methods in Medical Imaging G. Charpiat, M. Hofmann, B. Schölkopf ABSTRACT We introduce machine learning techniques, more specifically kernel methods, and show
More informationSupport vector machines Lecture 4
Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The
More information1 Review of Winnow Algorithm
COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture # 17 Scribe: Xingyuan Fang, Ethan April 9th, 2013 1 Review of Winnow Algorithm We have studied Winnow algorithm in Algorithm 1. Algorithm
More informationRobust KernelBased Regression
Robust KernelBased Regression Budi Santosa Department of Industrial Engineering Sepuluh Nopember Institute of Technology Kampus ITS Surabaya Surabaya 60111,Indonesia Theodore B. Trafalis School of Industrial
More informationCOMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma
COMS 4771 Introduction to Machine Learning James McInerney Adapted from slides by Nakul Verma Announcements HW1: Please submit as a group Watch out for zero variance features (Q5) HW2 will be released
More informationOslo Class 4 Early Stopping and Spectral Regularization
RegML2017@SIMULA Oslo Class 4 Early Stopping and Spectral Regularization Lorenzo Rosasco UNIGEMITIIT June 28, 2016 Learning problem Solve min w E(w), E(w) = dρ(x, y)l(w x, y) given (x 1, y 1 ),..., (x
More informationRegularized Least Squares
Regularized Least Squares Ryan M. Rifkin Google, Inc. 2008 Basics: Data Data points S = {(X 1, Y 1 ),...,(X n, Y n )}. We let X simultaneously refer to the set {X 1,...,X n } and to the n by d matrix whose
More informationComputing regularization paths for learning multiple kernels
Computing regularization paths for learning multiple kernels Francis Bach Romain Thibaux Michael Jordan Computer Science, UC Berkeley December, 24 Code available at www.cs.berkeley.edu/~fbach Computing
More informationNeural networks and support vector machines
Neural netorks and support vector machines Perceptron Input x 1 Weights 1 x 2 x 3... x D 2 3 D Output: sgn( x + b) Can incorporate bias as component of the eight vector by alays including a feature ith
More informationIn the Name of God. Lectures 15&16: Radial Basis Function Networks
1 In the Name of God Lectures 15&16: Radial Basis Function Networks Some Historical Notes Learning is equivalent to finding a surface in a multidimensional space that provides a best fit to the training
More informationStreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory
StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory S.V. N. (vishy) Vishwanathan Purdue University and Microsoft vishy@purdue.edu October 9, 2012 S.V. N. Vishwanathan (Purdue,
More informationPattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods
Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter
More informationEvaluation requires to define performance measures to be optimized
Evaluation Basic concepts Evaluation requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain (generalization error) approximation
More informationCoordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /
Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10725 / 36725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools Firstorder methods
More informationLogistic Regression Trained with Different Loss Functions. Discussion
Logistic Regression Trained with Different Loss Functions Discussion CS640 Notations We restrict our discussions to the binary case. g(z) = g (z) = g(z) z h w (x) = g(wx) = + e z = g(z)( g(z)) + e wx =
More informationLecture 4: Training a Classifier
Lecture 4: Training a Classifier Roger Grosse 1 Introduction Now that we ve defined what binary classification is, let s actually train a classifier. We ll approach this problem in much the same way as
More informationThe definitions and notation are those introduced in the lectures slides. R Ex D [h
Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 2 October 04, 2016 Due: October 18, 2016 A. Rademacher complexity The definitions and notation
More informationSUPPORT VECTOR REGRESSION WITH A GENERALIZED QUADRATIC LOSS
SUPPORT VECTOR REGRESSION WITH A GENERALIZED QUADRATIC LOSS Filippo Portera and Alessandro Sperduti Dipartimento di Matematica Pura ed Applicata Universit a di Padova, Padova, Italy {portera,sperduti}@math.unipd.it
More informationSurrogate loss functions, divergences and decentralized detection
Surrogate loss functions, divergences and decentralized detection XuanLong Nguyen Department of Electrical Engineering and Computer Science U.C. Berkeley Advisors: Michael Jordan & Martin Wainwright 1
More informationOnline Learning, Mistake Bounds, Perceptron Algorithm
Online Learning, Mistake Bounds, Perceptron Algorithm 1 Online Learning So far the focus of the course has been on batch learning, where algorithms are presented with a sample of training data, from which
More informationIntroduction to Machine Learning Midterm Exam Solutions
10701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv BarJoseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,
More informationE0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)
E0 370 Statistical Learning Theory Lecture 5 Aug 5, 0 Covering Nubers, PseudoDiension, and FatShattering Diension Lecturer: Shivani Agarwal Scribe: Shivani Agarwal Introduction So far we have seen how
More information1 Kernel methods & optimization
Machine Learning Class Notes 92613 Prof. David Sontag 1 Kernel methods & optimization One eample of a kernel that is frequently used in practice and which allows for highly nonlinear discriminant functions
More informationScaleInvariance of Support Vector Machines based on the Triangular Kernel. Abstract
ScaleInvariance of Support Vector Machines based on the Triangular Kernel François Fleuret Hichem Sahbi IMEDIA Research Group INRIA Domaine de Voluceau 78150 Le Chesnay, France Abstract This paper focuses
More informationRecent Advances and Trends in Largescale Kernel Methods
IEICE Transactions on Information and Systems, vol.e92d, no.7, pp.1338 1353, 2009. 1 Recent Advances and Trends in Largescale Kernel Methods Hisashi Kashima (hkashima@jp.ibm.com) IBM Research, Tokyo
More informationL3 Apprentissage. Michèle Sebag Benjamin Monmège LRI LSV. 27 février 2013
L3 Apprentissage Michèle Sebag Benjamin Monmège LRI LSV 27 février 2013 Next course Tutorials/Videolectures http://www.iro.umontreal.ca/ bengioy/talks/icml2012ybtutorial.pdf Part 1: 156; Part 2: 79133
More informationSurrogate regret bounds for generalized classification performance metrics
Surrogate regret bounds for generalized classification performance metrics Wojciech Kotłowski Krzysztof Dembczyński Poznań University of Technology PLSIGML, Częstochowa, 14.04.2016 1 / 36 Motivation 2
More informationRobust regression and nonlinear kernel methods for characterization of neuronal response functions from limited data
Robust regression and nonlinear kernel methods for characterization of neuronal response functions from limited data Maneesh Sahani Gatsby Computational Neuroscience Unit University College, London Jennifer
More informationComplexity and regularization issues in kernelbased learning
Complexity and regularization issues in kernelbased learning Marcello Sanguineti Department of Communications, Computer, and System Sciences (DIST) University of Genoa  Via Opera Pia 13, 16145 Genova,
More informationCSCI567 Machine Learning (Fall 2014)
CSCI567 Machine Learning (Fall 24) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu October 2, 24 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 / 24 Outline Review
More informationDoes Unlabeled Data Help?
Does Unlabeled Data Help? Worstcase Analysis of the Sample Complexity of Semisupervised Learning. BenDavid, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline
More informationL20: MLPs, RBFs and SPR Bayes discriminants and MLPs The role of MLP hidden units Bayes discriminants and RBFs Comparison between MLPs and RBFs
L0: MLPs, RBFs and SPR Bayes discriminants and MLPs The role of MLP hidden units Bayes discriminants and RBFs Comparison between MLPs and RBFs CSCE 666 Pattern Analysis Ricardo GutierrezOsuna CSE@TAMU
More informationGAUSSIAN PROCESS REGRESSION
GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The
More informationLoss Functions for Preference Levels: Regression with Discrete Ordered Labels
Loss Functions for Preference Levels: Regression with Discrete Ordered Labels Jason D. M. Rennie Massachusetts Institute of Technology Comp. Sci. and Artificial Intelligence Laboratory Cambridge, MA 9,
More information3. Some tools for the analysis of sequential strategies based on a Gaussian process prior
3. Some tools for the analysis of sequential strategies based on a Gaussian process prior E. Vazquez Computer experiments June 2122, 2010, Paris 21 / 34 Function approximation with a Gaussian prior Aim:
More informationRelevance Vector Machines
LUT February 21, 2011 Support Vector Machines Model / Regression Marginal Likelihood Regression Relevance vector machines Exercise Support Vector Machines The relevance vector machine (RVM) is a bayesian
More informationCS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning
CS 446 Machine Learning Fall 206 Nov 0, 206 Bayesian Learning Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes Overview Bayesian Learning Naive Bayes Logistic Regression Bayesian Learning So far, we
More informationThe Decision List Machine
The Decision List Machine Marina Sokolova SITE, University of Ottawa Ottawa, Ont. Canada,K1N6N5 sokolova@site.uottawa.ca Nathalie Japkowicz SITE, University of Ottawa Ottawa, Ont. Canada,K1N6N5 nat@site.uottawa.ca
More informationLinear Models for Classification
Linear Models for Classification Oliver Schulte  CMPT 726 Bishop PRML Ch. 4 Classification: Handwritten Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,
More informationVC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.
VC Dimension Review The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces. Previously, in discussing PAC learning, we were trying to answer questions about
More informationPAClearning, VC Dimension and Marginbased Bounds
More details: General: http://www.learningwithkernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAClearning, VC Dimension and Marginbased
More informationDiscussion About Nonlinear Time Series Prediction Using Least Squares Support Vector Machine
Commun. Theor. Phys. (Beijing, China) 43 (2005) pp. 1056 1060 c International Academic Publishers Vol. 43, No. 6, June 15, 2005 Discussion About Nonlinear Time Series Prediction Using Least Squares Support
More informationECE521 Lecture7. Logistic Regression
ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multiclass classification 2 Outline Decision theory is conceptually easy and computationally hard
More informationMULTIPLEKERNELLEARNING CSE902
MULTIPLEKERNELLEARNING CSE902 Multiple Kernel Learning keywords Heterogeneous information fusion Feature selection Maxmargin classification Multiple kernel learning MKL Convex optimization Kernel classification
More informationLecture 32: Taylor Series and McLaurin series We saw last day that some functions are equal to a power series on part of their domain.
Lecture 32: Taylor Series and McLaurin series We saw last day that some functions are equal to a power series on part of their domain. For example f(x) = 1 1 x = 1 + x + x2 + x 3 + = ln(1 + x) = x x2 2
More informationIntroduction to Gaussian Process
Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Nonparametric Gaussian Process (GP) GP Regression
More informationMathematical Programming for Multiple Kernel Learning
Mathematical Programming for Multiple Kernel Learning Alex Zien Fraunhofer FIRST.IDA, Berlin, Germany Friedrich Miescher Laboratory, Tübingen, Germany 07. July 2009 Mathematical Programming Stream, EURO
More informationReal Estate Price Prediction with Regression and Classification CS 229 Autumn 2016 Project Final Report
Real Estate Price Prediction with Regression and Classification CS 229 Autumn 2016 Project Final Report Hujia Yu, Jiafu Wu [hujiay, jiafuwu]@stanford.edu 1. Introduction Housing prices are an important
More informationLoss Functions and Optimization. Lecture 31
Lecture 3: Loss Functions and Optimization Lecture 31 Administrative Assignment 1 is released: http://cs231n.github.io/assignments2017/assignment1/ Due Thursday April 20, 11:59pm on Canvas (Extending
More informationBehavioral Data Mining. Lecture 7 Linear and Logistic Regression
Behavioral Data Mining Lecture 7 Linear and Logistic Regression Outline Linear Regression Regularization Logistic Regression Stochastic Gradient Fast Stochastic Methods Performance tips Linear Regression
More informationContents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)
Contents Lecture Lecture Linear Discriminant Analysis Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University Email: fredriklindsten@ituuse Summary of lecture
More informationAn Introduction to Machine Learning
An Introduction to Machine Learning L6: Structured Estimation Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune, January
More informationSubstitution Matrix based Kernel Functions for Protein Secondary Structure Prediction
Substitution Matrix based Kernel Functions for Protein Secondary Structure Prediction Bram Vanschoenwinkel Vrije Universiteit Brussel Computational Modeling Lab Pleinlaan 2, 1050 Brussel, Belgium Email:
More informationMidterm Exam, Spring 2005
10701 Midterm Exam, Spring 2005 1. Write your name and your email address below. Name: Email address: 2. There should be 15 numbered pages in this exam (including this cover sheet). 3. Write your name
More informationLearning with Transformation Invariant Kernels
Learning with Transformation Invariant Kernels Christian Walder Max Planck Institute for Biological Cybernetics 72076 Tübingen, Germany christian.walder@tuebingen.mpg.de Olivier Chapelle Yahoo! Research
More informationGeorgia Department of Education Common Core Georgia Performance Standards Framework CCGPS Advanced Algebra Unit 2
Polynomials Patterns Task 1. To get an idea of what polynomial functions look like, we can graph the first through fifth degree polynomials with leading coefficients of 1. For each polynomial function,
More informationIntroduction to Machine Learning Fall 2017 Note 5. 1 Overview. 2 Metric
CS 189 Introduction to Machine Learning Fall 2017 Note 5 1 Overview Recall from our previous note that for a fixed input x, our measurement Y is a noisy measurement of the true underlying response f x):
More informationLogistic Regression and Boosting for Labeled Bags of Instances
Logistic Regression and Boosting for Labeled Bags of Instances Xin Xu and Eibe Frank Department of Computer Science University of Waikato Hamilton, New Zealand {xx5, eibe}@cs.waikato.ac.nz Abstract. In
More informationTopics in Machine Learning (CS729) Instructor: Saketh
Topics in Machine Learning (CS729) Instructor: Saketh Contents Contents 1 1 Introduction 3 2 Supervised Inductive Learning 5 2.1 Statistical Learning Theory (for SIL case)............... 6 2.1.1 ERM Consistency
More information10701 Recitation 5 Duality and SVM. Ahmed Hefny
10701 Recitation 5 Duality and SVM Ahmed Hefny Outline Langrangian and Duality The Lagrangian Duality Eamples Support Vector Machines Primal Formulation Dual Formulation Soft Margin and Hinge Loss Lagrangian
More informationHow to learn from very few examples?
How to learn from very few examples? Dengyong Zhou Department of Empirical Inference Max Planck Institute for Biological Cybernetics Spemannstr. 38, 72076 Tuebingen, Germany Outline Introduction Part A
More informationAssociation studies and regression
Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration
More information