Hilbert Space Methods in Learning

Size: px
Start display at page:

Download "Hilbert Space Methods in Learning"

Transcription

1 Hilbert Space Methods in Learning guest lecturer: Risi Kondor 6772 Advanced Machine Learning and Perception (Jebara), Columbia University, October 15,

2 1. A general formulation of the learning problem Empirical and true errors over tting Error bounds and what they tell us about the design of algorithms 2. Hilbert space methods Reproducing Kernel Hilbert Spaces Kernels Algorithms: SVM, Gaussian Processes, Kernel PCA Tutorial online at 2

3 The Learning Problem 3

4 Regression Learn a function f : x y Linear functions, order p polynomials, splines, etc. Examples: Boston housing problem, robot grasps, motorcycle data, etc. 4

5 Classi cation Separate +1 labeled points from 1 labeled points Examples: Face recognition, DNA splice site identi cation, document classi cation, call type classi cation 5

6 Supervised learning Input space: X e.g. X =R n Output space: Y Y ={ 1, +1} for classi cation Y = R for regression Training set: S = (x 1, y 1 ), (x 2, y 2 ),..., (x m, y m ) x i X, y t Y Truth : Deterministic: y = f 0 (x) Probabilistic: y p ( y x ) (more general) Goal: construct hypothesis f : X Y to predict y given x. 6

7 The Empirical Risk Empirical risk (training error): R emp [f] = 1 m L(f(x i ), y i ) where L : Y Y R is the loss function. Zero-one loss for classi cation: L(ŷ, y) = 1 if ŷ y 0 otherwise. Squared error loss for regression: L(ŷ, y) = (y ŷ) 2. 7

8 A Bad Learning Algorithm (memorization algorithm) Set 1 when x=x i and y i =1 f(x) = 1 otherwise. For zero-one loss perfect performance on training data! R emp [f] = 1 m L(f(x i ), y i ) = 0 Will it generalize well to testing examples? Why not? 8

9 The True Risk Assume some distribution on inputs: p(x) Distribution on (x, y) examples: p(x, y) = δ(y f 0 (x)) p(x) p(x, y) = p ( y x ) p(x) or True risk: R[f] = E [L(f(x), y)] = X Y p(x, y) L(f(x), y) dx dy. This is what we really want to minimize in discriminative learning. 9

10 True Risk vs. Empirical Risk R[f] = E [L(f(x), y)] R emp [f] = 1 m L(f(x i ), y i ). Just minimizing R emp is BAD (see previous algorithm). Optimizing the training error at the expense of the testing error is called over tting. But we do not know p(x, y)!!! Can we still do anything? 10

11 Bounding the True Risk For many practical learning algorithms R[f] = E [L(f(x), y)] R emp [f] = 1 m L(f(x i ), y i ). Uniform error bounds: For any distribution D, with probability 1 δ (over the choice of training set) R[f] R emp [f] ɛ for all hypotheses f F simultaneously. PAC bound: probably approximately correct 11

12 KEY CONCEPT: Capacity Control [ R[f] P Remp [f] ] ɛ 1 δ Generally, ɛ is a complicated function of δ depending crucially on F (hypothesis class) that f is chosen from. Compromise: model exibility generalization performance large F want small ɛ complexity generality 12

13 Capacity control Too in exible? Just right. Over tting? How do we quantify complexity of f? 13

14 Uniform Error Bounds [ P R[f] R emp [f] ɛ f F ] 1 δ. [ ] P sup [R[f] R emp [f] ] ɛ f F 1 δ, Not equivalent to: [ ] P R[f] R emp [f] ɛ 1 δ f F. 14

15 Vapnik-Chervonenkis type bounds With probability 1 δ [ sup R[f] Remp [f] ] h (log (2m/h) + 1) log (δ/4) f F m where h is the VC dimension of F. Linear discriminators in R n : h = n with margin γ in ball of radius D : h = min ( n, D 2 /γ 2 ) + 1 Large margin is good! 15

16 Covering number bounds With probability 1 δ [ sup R[f] Remp [f] ] log (12m E N1 (S, ɛ/8)) log δ 16M f F m where M is an upper bound on L(f(x), y). The covering number N 1 (S, ɛ) is the number of vectors v 1, v 2,..., v n to ensure that for any f F, there is a v k such that needed L(f(x i ), v k ) ɛ. 16

17 Stability-based bounds If f 1 δ is the hypothesis returned by a β-stable algorithm, then with probability R[f ] R emp [f ] β + 2 (mβ + M) 2 log (δ/2) m where M is an upper bound on L(f(x), y). An algorithm is β-stable if for all training sets, and any example (x, y), L(f (x), y) changes by at most β when we replace any one of the training examples by any other example. 17

18 Rademacher bounds For Ra r [f] = r with probability 1 δ [ ( R[f] inf (1+α) R emp [f] ) ( )] 31r log 2 b α>0 4α r + 50bɛ. n Rademacher average: Ra r [f] = E S,σ [ where P [σ = 1] = P [σ = 1] = 1/2. sup f F : EL(f(x),y) r ] σ i L (f(x i ), y i ) 18

19 Structural Risk Minimization If we have bound of form [ ] P sup [R[f] R emp [f] ] ɛ F f F 1 δ 1. Fix δ 2. Compute f F = arg min f F [R emp [f] + ɛ F ] for a sequence of spaces F 1 F 2... F k 3. Return f i with smallest R emp [f i ] + ɛ F i Does this work? 19

20 The problem with error bounds Most bounds are hopelessly loose. Typically, we get for 1 δ =.95 ɛ = Main culprit is the uniformity requirement. Can we still use them for anything or are they just a weird sport? Form of bounds is important, even if their value is not. In particular, large margin is good. 20

21 Hilbert Space Methods 21

22 SVM s: the old story Kernel k : X X R pos.def. similarity measure Feature map Φ : X F obeys k(x, x ) = Φ(x), Φ(x ) e.g. Gaussian Kernel: k(x, x ) = e x x 2 /(2σ 2 ) Find maximum margin separating hyperplane in high dimensional space! f(x) = sgn [ b + ] α i k(x i, x) 22

23 Want more general story behind Hilbert space methods. How do we tell what is a good kernel, anyway? Want large margin. What kernel will give us large margin? Lessons so far: capacity control is crucial; large margin is good; pursue abstract approach looking for general f : X Y, worry about actual algorithm later. 23

24 Regularized Risk Motivated by form of error bounds, minimize R reg [f] = 1 L(f(x i ), y i ) m }{{} R emp [f] over some large space of functions H. + Ω[f] }{{} regularizer Ω[f] is a penalty term penalizing hypotheses that are too complex. Effectively SRM. See Regularization networks of Poggio & Girosi. 24

25 Regularized Spaces of Functions Given {(x 1, y 1 ),..., (x m, y m )} look for f : X Y in some linear space of functions H minimizing R reg [f] = 1 m L(f(x i ), y i ) + f 2 H 25

26 Regularized Spaces of Functions Given {(x 1, y 1 ),..., (x m, y m )} look for f : X Y in some linear space of functions H minimizing R reg [f] = 1 m R reg [f] = 1 m L(f(x i ), y i ) + f 2 H L(f(x i ), y i ) + f, f H Hilbert space 26

27 Regularized Spaces of Functions Given {(x 1, y 1 ),..., (x m, y m )} look for f : X Y in some linear space of functions H minimizing R reg [f] = 1 m R reg [f] = 1 m R reg [f] = 1 m L(f(x i ), y i ) + f 2 H L(f(x i ), y i ) + f, f H L( f, k xi, y i ) + f, f H Hilbert space RKHS 27

28 Regularized Spaces of Functions Given {(x 1, y 1 ),..., (x m, y m )} look for f : X Y in some linear space of functions H minimizing R reg [f]. R reg [f] = 1 m R reg [f] = 1 m R reg [f] = 1 m L(f(x i ), y i ) + f 2 H L(f(x i ), y i ) + f, f H L( f, k xi, y i ) + f, f H Hilbert space RKHS The k x are prototypical functions s.t. f(x) = f, k x. 28

29 Representer Theorem Minimizer of R reg [f] = 1 m will be in the span of k x1, k x2,..., k xm! L( f, k xi, y i ) + f, f H The hypothesis can be written f(x) = f, k x = where k(x, x ) = k x, k x. α i k xi, k x = α i k(x, x i ). All we need to nd are α 1, α 2,..., α m. How do we construct the RKHS? 29

30 Constructing the RKHS f(x) = f, k x Bootstrap everything from k(x, x ) = k x, k x for x, x X! 1. Anything outside span { k x x X } is uninteresting, so f = 2. To evaluate f(x ) use f(x) = f, k x = X β(x) k x, k x dx = X X β(x) k(x, x ) dx β(x) k x dx. 3. To compute f, f use f, f = β(x) β(x ) k x, k x dxdx = X X X X β(x) β(x ) k(x, x ) dxdx 4. Note that k x (x ) = k x, k x = k(x, x ) so we simply have k x = k(x, x ). 5. H is a particular instance of a feature space F if we set Φ(x) = k x. 30

31 Correspondence R reg [f] = 1 m L( f, k xi, y i ) + f, f H Kernel methods make sense from Regularization Theory point of view if kernel corresponds to sensible operator Ω[f] = f, f H. 31

32 Fourier regularization Fourier transform on R n : ˆf(ω) = 1 (2π) n/2 R n f(x) e iω x dx Inverse trasform: f(x) = 1 (2π) n/2 C n ˆf(x) e iω x dω Fourier regularization: Ω[f] = f, f H = e σ2 ω 2 /2 ˆf(ω) 2 dω Corresponding kernel: k(x, x ) = e x x 2 /(2σ 2 ) The Gaussian kernel will heavily penalize non-smooth functions! 32

33 Other kernels Homogeneous polynomial: k(x, x ) = (x x ) p Non-homogeneous polynomial: k(x, x ) = (x x + 1) p tanh kernel: k(x, x ) = tanh(κ (() x x ) + δ) Triangular kernel: k(x, x ) = 1 (x x ) /d String kernels: k(string 1, string 2 ) Kernels on distributions: Fisher, etc. Diffusion kernels: k(x, x ) = [ e β ] x,x Similarity measure Regularization 33

34 Algorithms 34

35 Modularity of Hilbert space methods f = arg min 1 ftinh m L( f, k xi, y i ) + f, f }{{} H }{{} Determines algorithm Determines kernel Same algorithm (SVM) can be used in very different contexts by changing the kernel kernel engineering Regularization scheme can be studied independent of application (classi cation, regression, etc.) ANY kernel method can be formulated as one of these minimization problems 35

36 Soft margin SVM s Relax problem to learning continuous functions f : X R with hinge loss Then L(f(x), y) = C max (0, y f(x) + 1) f = arg min f H reduces to soft margin SVM [ f = arg min f H [ f, f + C 1 m ] L( f, k xi, y i ) + f, f H ] ξ i subject to Probably the most popular algorithm for classi cation. y i f(x i ) 1 ξ i 36

37 Kernel Regression If we set then f = arg min f H [ 1 m ] L( f, k xi, y i ) + f, f H reduces to soft kernel regression f = arg min f H [ f, f + C ] (ξ i + ξ i) subject to y i f(x i ) ɛ + ξ y i f(x i ) ɛ ξ 37

38 38

39 39

Statistical learning theory, Support vector machines, and Bioinformatics

Statistical learning theory, Support vector machines, and Bioinformatics 1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University Chapter 9. Support Vector Machine Yongdai Kim Seoul National University 1. Introduction Support Vector Machine (SVM) is a classification method developed by Vapnik (1996). It is thought that SVM improved

More information

RegML 2018 Class 2 Tikhonov regularization and kernels

RegML 2018 Class 2 Tikhonov regularization and kernels RegML 2018 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT June 17, 2018 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n = (x i,

More information

Oslo Class 2 Tikhonov regularization and kernels

Oslo Class 2 Tikhonov regularization and kernels RegML2017@SIMULA Oslo Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT May 3, 2017 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n

More information

Support vector machines Lecture 4

Support vector machines Lecture 4 Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces 9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we

More information

Support Vector Machine

Support Vector Machine Support Vector Machine Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Linear Support Vector Machine Kernelized SVM Kernels 2 From ERM to RLM Empirical Risk Minimization in the binary

More information

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University Lecture 18: Kernels Risk and Loss Support Vector Regression Aykut Erdem December 2016 Hacettepe University Administrative We will have a make-up lecture on next Saturday December 24, 2016 Presentations

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 11, 2009 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

The Learning Problem and Regularization

The Learning Problem and Regularization 9.520 Class 02 February 2011 Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference problem from usually small sets of high dimensional, noisy data. Learning

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Statistical Learning Reading Assignments

Statistical Learning Reading Assignments Statistical Learning Reading Assignments S. Gong et al. Dynamic Vision: From Images to Face Recognition, Imperial College Press, 2001 (Chapt. 3, hard copy). T. Evgeniou, M. Pontil, and T. Poggio, "Statistical

More information

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina Indirect Rule Learning: Support Vector Machines Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection

More information

A Kernel Between Sets of Vectors

A Kernel Between Sets of Vectors A Kernel Between Sets of Vectors Risi Kondor Tony Jebara Columbia University, New York, USA. 1 A Kernel between Sets of Vectors In SVM, Gassian Processes, Kernel PCA, kernel K de nes feature map Φ : X

More information

Lecture 10: Support Vector Machine and Large Margin Classifier

Lecture 10: Support Vector Machine and Large Margin Classifier Lecture 10: Support Vector Machine and Large Margin Classifier Applied Multivariate Analysis Math 570, Fall 2014 Xingye Qiao Department of Mathematical Sciences Binghamton University E-mail: qiao@math.binghamton.edu

More information

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie Computational Biology Program Memorial Sloan-Kettering Cancer Center http://cbio.mskcc.org/leslielab

More information

Non-linear Support Vector Machines

Non-linear Support Vector Machines Non-linear Support Vector Machines Andrea Passerini passerini@disi.unitn.it Machine Learning Non-linear Support Vector Machines Non-linearly separable problems Hard-margin SVM can address linearly separable

More information

ECE-271B. Nuno Vasconcelos ECE Department, UCSD

ECE-271B. Nuno Vasconcelos ECE Department, UCSD ECE-271B Statistical ti ti Learning II Nuno Vasconcelos ECE Department, UCSD The course the course is a graduate level course in statistical learning in SLI we covered the foundations of Bayesian or generative

More information

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes About this class Maximum margin classifiers SVMs: geometric derivation of the primal problem Statement of the dual problem The kernel trick SVMs as the solution to a regularization problem Maximizing the

More information

Support Vector Machines. Machine Learning Fall 2017

Support Vector Machines. Machine Learning Fall 2017 Support Vector Machines Machine Learning Fall 2017 1 Where are we? Learning algorithms Decision Trees Perceptron AdaBoost 2 Where are we? Learning algorithms Decision Trees Perceptron AdaBoost Produce

More information

A Bahadur Representation of the Linear Support Vector Machine

A Bahadur Representation of the Linear Support Vector Machine A Bahadur Representation of the Linear Support Vector Machine Yoonkyung Lee Department of Statistics The Ohio State University October 7, 2008 Data Mining and Statistical Learning Study Group Outline Support

More information

Soft-margin SVM can address linearly separable problems with outliers

Soft-margin SVM can address linearly separable problems with outliers Non-linear Support Vector Machines Non-linearly separable problems Hard-margin SVM can address linearly separable problems Soft-margin SVM can address linearly separable problems with outliers Non-linearly

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Stephan Dreiseitl University of Applied Sciences Upper Austria at Hagenberg Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support Overview Motivation

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Support Vector Machine for Classification and Regression

Support Vector Machine for Classification and Regression Support Vector Machine for Classification and Regression Ahlame Douzal AMA-LIG, Université Joseph Fourier Master 2R - MOSIG (2013) November 25, 2013 Loss function, Separating Hyperplanes, Canonical Hyperplan

More information

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM 1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Jean-Philippe.Vert@mines.org Human Genome Center, University

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 9, 2011 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Approximation Theoretical Questions for SVMs

Approximation Theoretical Questions for SVMs Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

TUM 2016 Class 1 Statistical learning theory

TUM 2016 Class 1 Statistical learning theory TUM 2016 Class 1 Statistical learning theory Lorenzo Rosasco UNIGE-MIT-IIT July 25, 2016 Machine learning applications Texts Images Data: (x 1, y 1 ),..., (x n, y n ) Note: x i s huge dimensional! All

More information

Recap from previous lecture

Recap from previous lecture Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience

More information

Solving Classification Problems By Knowledge Sets

Solving Classification Problems By Knowledge Sets Solving Classification Problems By Knowledge Sets Marcin Orchel a, a Department of Computer Science, AGH University of Science and Technology, Al. A. Mickiewicza 30, 30-059 Kraków, Poland Abstract We propose

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

Lecture 10: A brief introduction to Support Vector Machine

Lecture 10: A brief introduction to Support Vector Machine Lecture 10: A brief introduction to Support Vector Machine Advanced Applied Multivariate Analysis STAT 2221, Fall 2013 Sungkyu Jung Department of Statistics, University of Pittsburgh Xingye Qiao Department

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Machine Learning And Applications: Supervised Learning-SVM

Machine Learning And Applications: Supervised Learning-SVM Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Shivani Agarwal Support Vector Machines (SVMs) Algorithm for learning linear classifiers Motivated by idea of maximizing margin Efficient extension to non-linear

More information

Lecture 3: Introduction to Complexity Regularization

Lecture 3: Introduction to Complexity Regularization ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,

More information

Pattern Recognition 2018 Support Vector Machines

Pattern Recognition 2018 Support Vector Machines Pattern Recognition 2018 Support Vector Machines Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 1 / 48 Support Vector Machines Ad Feelders ( Universiteit Utrecht

More information

Does Unlabeled Data Help?

Does Unlabeled Data Help? Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline

More information

Computational Learning Theory

Computational Learning Theory Computational Learning Theory Pardis Noorzad Department of Computer Engineering and IT Amirkabir University of Technology Ordibehesht 1390 Introduction For the analysis of data structures and algorithms

More information

Perceptron Revisited: Linear Separators. Support Vector Machines

Perceptron Revisited: Linear Separators. Support Vector Machines Support Vector Machines Perceptron Revisited: Linear Separators Binary classification can be viewed as the task of separating classes in feature space: w T x + b > 0 w T x + b = 0 w T x + b < 0 Department

More information

CIS 520: Machine Learning Oct 09, Kernel Methods

CIS 520: Machine Learning Oct 09, Kernel Methods CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed

More information

Bits of Machine Learning Part 1: Supervised Learning

Bits of Machine Learning Part 1: Supervised Learning Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification

More information

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto Reproducing Kernel Hilbert Spaces 9.520 Class 03, 15 February 2006 Andrea Caponnetto About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Advanced Introduction to Machine Learning CMU-10715

Advanced Introduction to Machine Learning CMU-10715 Advanced Introduction to Machine Learning CMU-10715 Risk Minimization Barnabás Póczos What have we seen so far? Several classification & regression algorithms seem to work fine on training datasets: Linear

More information

Can we do statistical inference in a non-asymptotic way? 1

Can we do statistical inference in a non-asymptotic way? 1 Can we do statistical inference in a non-asymptotic way? 1 Guang Cheng 2 Statistics@Purdue www.science.purdue.edu/bigdata/ ONR Review Meeting@Duke Oct 11, 2017 1 Acknowledge NSF, ONR and Simons Foundation.

More information

Support Vector Machines

Support Vector Machines EE 17/7AT: Optimization Models in Engineering Section 11/1 - April 014 Support Vector Machines Lecturer: Arturo Fernandez Scribe: Arturo Fernandez 1 Support Vector Machines Revisited 1.1 Strictly) Separable

More information

MLCC 2017 Regularization Networks I: Linear Models

MLCC 2017 Regularization Networks I: Linear Models MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017 About this class We introduce a class of learning algorithms based on Tikhonov regularization We study computational

More information

Bayesian Support Vector Machines for Feature Ranking and Selection

Bayesian Support Vector Machines for Feature Ranking and Selection Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher pat@student.ethz.ch ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction

More information

BINARY CLASSIFICATION

BINARY CLASSIFICATION BINARY CLASSIFICATION MAXIM RAGINSY The problem of binary classification can be stated as follows. We have a random couple Z = X, Y ), where X R d is called the feature vector and Y {, } is called the

More information

Class 2 & 3 Overfitting & Regularization

Class 2 & 3 Overfitting & Regularization Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines Pattern Recognition and Machine Learning James L. Crowley ENSIMAG 3 - MMIS Fall Semester 2016 Lessons 6 10 Jan 2017 Outline Perceptrons and Support Vector machines Notation... 2 Perceptrons... 3 History...3

More information

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee The Learning Problem and Regularization 9.520 Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

A note on the generalization performance of kernel classifiers with margin. Theodoros Evgeniou and Massimiliano Pontil

A note on the generalization performance of kernel classifiers with margin. Theodoros Evgeniou and Massimiliano Pontil MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES A.I. Memo No. 68 November 999 C.B.C.L

More information

(Kernels +) Support Vector Machines

(Kernels +) Support Vector Machines (Kernels +) Support Vector Machines Machine Learning Torsten Möller Reading Chapter 5 of Machine Learning An Algorithmic Perspective by Marsland Chapter 6+7 of Pattern Recognition and Machine Learning

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

Introduction to Machine Learning

Introduction to Machine Learning 1, DATA11002 Introduction to Machine Learning Lecturer: Teemu Roos TAs: Ville Hyvönen and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer

More information

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003 Statistical Approaches to Learning and Discovery Week 4: Decision Theory and Risk Minimization February 3, 2003 Recall From Last Time Bayesian expected loss is ρ(π, a) = E π [L(θ, a)] = L(θ, a) df π (θ)

More information

Part of the slides are adapted from Ziko Kolter

Part of the slides are adapted from Ziko Kolter Part of the slides are adapted from Ziko Kolter OUTLINE 1 Supervised learning: classification........................................................ 2 2 Non-linear regression/classification, overfitting,

More information

LECTURE NOTE #8 PROF. ALAN YUILLE. Can we find a linear classifier that separates the position and negative examples?

LECTURE NOTE #8 PROF. ALAN YUILLE. Can we find a linear classifier that separates the position and negative examples? LECTURE NOTE #8 PROF. ALAN YUILLE 1. Linear Classifiers and Perceptrons A dataset contains N samples: { (x µ, y µ ) : µ = 1 to N }, y µ {±1} Can we find a linear classifier that separates the position

More information

Support Vector Machines

Support Vector Machines Wien, June, 2010 Paul Hofmarcher, Stefan Theussl, WU Wien Hofmarcher/Theussl SVM 1/21 Linear Separable Separating Hyperplanes Non-Linear Separable Soft-Margin Hyperplanes Hofmarcher/Theussl SVM 2/21 (SVM)

More information

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science Neural Networks Prof. Dr. Rudolf Kruse Computational Intelligence Group Faculty for Computer Science kruse@iws.cs.uni-magdeburg.de Rudolf Kruse Neural Networks 1 Supervised Learning / Support Vector Machines

More information

Tutorial on Machine Learning for Advanced Electronics

Tutorial on Machine Learning for Advanced Electronics Tutorial on Machine Learning for Advanced Electronics Maxim Raginsky March 2017 Part I (Some) Theory and Principles Machine Learning: estimation of dependencies from empirical data (V. Vapnik) enabling

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find a plane that separates the classes in feature space. If we cannot, we get creative in two

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2015 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Classifier Complexity and Support Vector Classifiers

Classifier Complexity and Support Vector Classifiers Classifier Complexity and Support Vector Classifiers Feature 2 6 4 2 0 2 4 6 8 RBF kernel 10 10 8 6 4 2 0 2 4 6 Feature 1 David M.J. Tax Pattern Recognition Laboratory Delft University of Technology D.M.J.Tax@tudelft.nl

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Support Vector Machines Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique

More information

Statistical Machine Learning Hilary Term 2018

Statistical Machine Learning Hilary Term 2018 Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/sml18.html

More information

This is an author-deposited version published in : Eprints ID : 17710

This is an author-deposited version published in :   Eprints ID : 17710 Open Archive TOULOUSE Archive Ouverte (OATAO) OATAO is an open access repository that collects the work of Toulouse researchers and makes it freely available over the web where possible. This is an author-deposited

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem

More information

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI An Introduction to Statistical Theory of Learning Nakul Verma Janelia, HHMI Towards formalizing learning What does it mean to learn a concept? Gain knowledge or experience of the concept. The basic process

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete

More information

Regularization Networks and Support Vector Machines

Regularization Networks and Support Vector Machines Advances in Computational Mathematics x (1999) x-x 1 Regularization Networks and Support Vector Machines Theodoros Evgeniou, Massimiliano Pontil, Tomaso Poggio Center for Biological and Computational Learning

More information

Spectral Regularization

Spectral Regularization Spectral Regularization Lorenzo Rosasco 9.520 Class 07 February 27, 2008 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,

More information

Kernel Methods. Outline

Kernel Methods. Outline Kernel Methods Quang Nguyen University of Pittsburgh CS 3750, Fall 2011 Outline Motivation Examples Kernels Definitions Kernel trick Basic properties Mercer condition Constructing feature space Hilbert

More information

Support Vector Machines with Example Dependent Costs

Support Vector Machines with Example Dependent Costs Support Vector Machines with Example Dependent Costs Ulf Brefeld, Peter Geibel, and Fritz Wysotzki TU Berlin, Fak. IV, ISTI, AI Group, Sekr. FR5-8 Franklinstr. 8/9, D-0587 Berlin, Germany Email {geibel

More information

Direct Learning: Linear Classification. Donglin Zeng, Department of Biostatistics, University of North Carolina

Direct Learning: Linear Classification. Donglin Zeng, Department of Biostatistics, University of North Carolina Direct Learning: Linear Classification Logistic regression models for classification problem We consider two class problem: Y {0, 1}. The Bayes rule for the classification is I(P(Y = 1 X = x) > 1/2) so

More information

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES Wei Chu, S. Sathiya Keerthi, Chong Jin Ong Control Division, Department of Mechanical Engineering, National University of Singapore 0 Kent Ridge Crescent,

More information

Content. Learning. Regression vs Classification. Regression a.k.a. function approximation and Classification a.k.a. pattern recognition

Content. Learning. Regression vs Classification. Regression a.k.a. function approximation and Classification a.k.a. pattern recognition Content Andrew Kusiak Intelligent Systems Laboratory 239 Seamans Center The University of Iowa Iowa City, IA 52242-527 andrew-kusiak@uiowa.edu http://www.icaen.uiowa.edu/~ankusiak Introduction to learning

More information

A unified framework for Regularization Networks and Support Vector Machines. Theodoros Evgeniou, Massimiliano Pontil, Tomaso Poggio

A unified framework for Regularization Networks and Support Vector Machines. Theodoros Evgeniou, Massimiliano Pontil, Tomaso Poggio MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES A.I. Memo No. 1654 March 23, 1999

More information

Graphs in Machine Learning

Graphs in Machine Learning Graphs in Machine Learning Michal Valko Inria Lille - Nord Europe, France TA: Pierre Perrault Partially based on material by: Mikhail Belkin, Jerry Zhu, Olivier Chapelle, Branislav Kveton October 30, 2017

More information

Representer theorem and kernel examples

Representer theorem and kernel examples CS81B/Stat41B Spring 008) Statistical Learning Theory Lecture: 8 Representer theorem and kernel examples Lecturer: Peter Bartlett Scribe: Howard Lei 1 Representer Theorem Recall that the SVM optimization

More information

Regularization via Spectral Filtering

Regularization via Spectral Filtering Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,

More information

Kernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444

Kernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444 Kernel Methods Jean-Philippe Vert Jean-Philippe.Vert@mines.org Last update: Jan 2015 Jean-Philippe Vert (Mines ParisTech) 1 / 444 What we know how to solve Jean-Philippe Vert (Mines ParisTech) 2 / 444

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

Announcements - Homework

Announcements - Homework Announcements - Homework Homework 1 is graded, please collect at end of lecture Homework 2 due today Homework 3 out soon (watch email) Ques 1 midterm review HW1 score distribution 40 HW1 total score 35

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015 Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and

More information

On the V γ Dimension for Regression in Reproducing Kernel Hilbert Spaces. Theodoros Evgeniou, Massimiliano Pontil

On the V γ Dimension for Regression in Reproducing Kernel Hilbert Spaces. Theodoros Evgeniou, Massimiliano Pontil MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES A.I. Memo No. 1656 May 1999 C.B.C.L

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Reading: Ben-Hur & Weston, A User s Guide to Support Vector Machines (linked from class web page) Notation Assume a binary classification problem. Instances are represented by vector

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem

More information

A Magiv CV Theory for Large-Margin Classifiers

A Magiv CV Theory for Large-Margin Classifiers A Magiv CV Theory for Large-Margin Classifiers Hui Zou School of Statistics, University of Minnesota June 30, 2018 Joint work with Boxiang Wang Outline 1 Background 2 Magic CV formula 3 Magic support vector

More information

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 4: SVM I Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d. from distribution

More information

Minimax risk bounds for linear threshold functions

Minimax risk bounds for linear threshold functions CS281B/Stat241B (Spring 2008) Statistical Learning Theory Lecture: 3 Minimax risk bounds for linear threshold functions Lecturer: Peter Bartlett Scribe: Hao Zhang 1 Review We assume that there is a probability

More information