TDT 4173 Machine Learning and Case Based Reasoning. Helge Langseth og Agnar Aamodt. NTNU IDI Seksjon for intelligente systemer

Similar documents
TDT4173 Machine Learning

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Support Vector Machine (continued)

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

TDT4173 Machine Learning

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Jeff Howbert Introduction to Machine Learning Winter

Perceptron Revisited: Linear Separators. Support Vector Machines

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Support Vector Machine (SVM) and Kernel Methods

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Kernel Methods and Support Vector Machines

CS798: Selected topics in Machine Learning

Introduction to Support Vector Machines

L5 Support Vector Classification

Statistical Pattern Recognition

Machine Learning : Support Vector Machines

6.036 midterm review. Wednesday, March 18, 15

Support Vector Machines

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

18.9 SUPPORT VECTOR MACHINES

Linear & nonlinear classifiers

ECE521 week 3: 23/26 January 2017

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Support Vector Machines.

Applied Machine Learning Annalisa Marsico

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Support Vector Machines Explained

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Incorporating detractors into SVM classification

Support Vector Machine & Its Applications

Review: Support vector machines. Machine learning techniques and image analysis

ML (cont.): SUPPORT VECTOR MACHINES

Machine Learning Practice Page 2 of 2 10/28/13

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Support Vector Machines

Support Vector Machines: Maximum Margin Classifiers

Learning with kernels and SVM

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Linear Classification and SVM. Dr. Xin Zhang

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Linear & nonlinear classifiers

Lecture 10: A brief introduction to Support Vector Machine

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Qualifying Exam in Machine Learning

Lecture 10: Support Vector Machine and Large Margin Classifier

CIS 520: Machine Learning Oct 09, Kernel Methods

Pattern Recognition 2018 Support Vector Machines

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Introduction to Machine Learning

Support Vector Machines

FINAL: CS 6375 (Machine Learning) Fall 2014

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Statistical Methods for SVM

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Support Vector Machine. Industrial AI Lab.

(Kernels +) Support Vector Machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Sparse Kernel Machines - SVM

Support Vector Machines, Kernel SVM

Kernel Methods. Machine Learning A W VO

Linear classifiers Lecture 3

Each new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up!

Introduction to Support Vector Machines

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Support Vector Machines

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

Learning with multiple models. Boosting.

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning 4771

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Max Margin-Classifier

Lecture 7: Kernels for Classification and Regression

Gene Expression Data Classification with Revised Kernel Partial Least Squares Algorithm

SUPPORT VECTOR MACHINE

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Machine Learning, Fall 2011: Homework 5

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Basis Expansion and Nonlinear SVM. Kai Yu

Useful Equations for solving SVM questions

Kernels and the Kernel Trick. Machine Learning Fall 2017

Learning Kernel Parameters by using Class Separability Measure

Machine Learning, Midterm Exam

Statistical learning theory, Support vector machines, and Bioinformatics

SVMC An introduction to Support Vector Machines Classification

Support Vector Machine. Industrial AI Lab. Prof. Seungchul Lee

Lecture Notes on Support Vector Machine

Neural networks and support vector machines

Logistic Regression. COMP 527 Danushka Bollegala

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Support Vector Machines

Transcription:

TDT 4173 Machine Learning and Case Based Reasoning Lecture 6 Support Vector Machines. Ensemble Methods Helge Langseth og Agnar Aamodt NTNU IDI Seksjon for intelligente systemer

Outline 1 Wrap-up from last time 2 Support Vector Machines Background Linear separators The dual problem Non-separable subspaces Nonlinearity and kernels Ensemble-methods Background Bagging Boosting 2 TDT4173 Machine Learning

Support Vector Machines Support Vector Machines (SVMs) Kernel Methods Paper by Benne+ and Campbell TDT 4173 Machine Learning and CBR

Description of the task Background Data: 1 We have a set of data D = {(x 1,y 1 ),...,(x m,y m )}. The instances are described by x i, the class is y i. 2 The data is generated by some unknown probability distribution P(x,y). Task: 1 Be able to guess y at a new location x. 2 For SVMs one typically states this as find an unknown function f(x) that estimates y at x. 3 Note! In this lesson we look at binary classification, and let y { 1,+1} denote the classes. 4 We will look for linear functions, i.e., f(x) = b + w T x b + m i=1 w i x i 17 TDT4173 Machine Learning

Linear separators How to find the best linear separator We are looking for a linear separator for this data 18 TDT4173 Machine Learning

Linear separators How to find the best linear separator There are so many solutions... 18 TDT4173 Machine Learning

Linear separators How to find the best linear separator But only one is considered the best! 18 TDT4173 Machine Learning

Linear separators How to find the best linear separator SVMs are called large margin classifiers 18 TDT4173 Machine Learning

Linear separators How to find the best linear separator... and the data points touching the lines are the support vectors 18 TDT4173 Machine Learning

The geometry of the problem Linear separators {x : b + w T x 1} {x : b + w T x 0} {x : b + w T x +1} w Note! Since one line has b + w T x = 1, the other has b + w T x = 1, the length between them is 2/ w. 19 TDT4173 Machine Learning

An optimisation problem Linear separators Optimisation criteria: The distance between margins is 2/ w, so that is what we want to maximise. Equivalently, we can minimise w /2. For simplicity of the mathematics, we will rather minimise w 2 /2 Constraints: The margin separates all data observations correctly: b + w T x i 1 for y i = 1. b + w T x i +1 for y i = +1. Alternative (equivalent) constraint set: y i (b + w T x i ) 1 20 TDT4173 Machine Learning

An optimisation problem (2) Linear separators Mathematical Programming Setting: Combining the above requirements we obtain minimize wrt. w and b: 1 2 w 2 subject to y i (b + w T x i ) 1 0,i = 1,...,m Properties: Problem is convex Hence it has unique minimum Efficient algorithms for solving it exist 21 TDT4173 Machine Learning

The dual problem The dual problem and the convex hull The convex hull of {x j }: The smallest subset of the instance space that is convex contains all elements {x j } is the convex hull of {x j }. Find it by drawing lines between all x j and choose the outermost boundary. 22 TDT4173 Machine Learning

The dual problem The dual problem and the convex hull (2) Look at the difference between the points closest in the convex hulls. The decision line must be orthogonal to the line between the two closest points. c d c d So, we want to minimise c d. c can be written as a weighted sum of all elements in the green class: c = y i =Class 1 α ix i, and similarly for d. 23 TDT4173 Machine Learning

The dual problem The dual problem and the convex hull (3) Minimising c d is (modulo a constant) equivalent to this formulation: minimize wrt. α: 1 2 subject to Properties: m m m α i α j y i y j x T i x j i=1 j=1 i=1 α i m y i α i = 0 and that α i 0,i = 1,...,m i=1 Problem is convex, hence has unique minimum. Quadratic programming problem known solution method. For solution: α i > 0 only if x i is a support vector. 24 TDT4173 Machine Learning

Theoretical foundation The dual problem 1 Formal proofs of SVM properties available (but out of scope for us) 2 Large separators smart if we have small variations in x then we will still classify correctly 3 There are many skinny margin planes, only one if you look for the fattest plane; thus more robust. 25 TDT4173 Machine Learning

Non-separable subspaces What if the convex hulls are overlapping? If the convex hulls are overlapping we cannot find a linear separator To handle this, we optimise a criteria where we maximise distance between lines minus a penalty for mis-classifications This is equivalent to scaling the convex hulls, and do as before on the reduced convex hulls 26 TDT4173 Machine Learning

Non-separable subspaces What if the convex hulls are overlapping? (2) The problem with scaling is (modulo a constant) equivalent to this formulation: minimize wrt. α: 1 2 subject to Properties: m m m α i α j y i y j x T i x j i=1 j=1 i=1 m y i α i = 0 and that C α i 0,i = 1,...,m i=1 Problem as before, but C introduces the scaling; this is equivalent to incurring cost of misclassification. Still solvable using standard methods. Demo: Different values of C: http://svm.dcs.rhbnc.ac.uk/pagesnew/gpat.shtml 27 TDT4173 Machine Learning α i

Support Vector Machines StaBsBcal Learning Theory MisclassificaBon error and the funcbon complexity bound generalizabon error. Maximizing margins minimizes complexity. Eliminates overfirng. SoluBon depends only on Support Vectors not number of asributes. TDT 4173 Machine Learning and CBR

Nonlinearity and kernels Nonlinear problems when scaling does not make sense The problem is difficult to solve when x = (r,s) has only two dimensions...but if we blow it up to us five dimension: θ(x) = {r,s,rs,r 2,s 2 }, i.e. invent the mapping θ( ) : R 2 R 5, and try to find the linear separator in R 5, then everything is OK. 28 TDT4173 Machine Learning

Nonlinearity and kernels Solving the problem in higher dimensions We solve this as before, but remembering to look in the higher dimension: minimize wrt. α: 1 m m m α i α j y i y j θ(x i ) T θ(x j ) 2 subject to i=1 j=1 i=1 m y i α i = 0 and that C α i 0,i = 1,...,m i=1 Note that: We do not need to evaluate θ(x) directly, only θ(x i ) T θ(x j ). If we find a clever way of evaluating θ(x i ) T θ(x j ) (i.e., independent of the size of the target space) we can solve the problem easily, and without even thinking about what θ(x) even means. We define K(x i,x j ) = θ(x i ) T θ(x j ), and focus on finding K(, ) instead of the mapping. K is called a kernel. 29 TDT4173 Machine Learning α i

Kernel functions Support Vector Machines Nonlinearity and kernels θ(x) K(θ(x i ),θ(x j )) Degree d polynomial (x T i x ( j + 1) d ) (x Radial Basis Functions exp i x j ) 2 Two-layer Neural Network sigmoid (η x T i x j + c) Different kernels have different properties, and finding the right kernel is a difficult task, and can be hard to visualise. Example: The RBF kernel uses (implicitly) an infinitely dimensional representation for θ( ). 2σ 30 TDT4173 Machine Learning

SVMs: Algorithmic summary Nonlinearity and kernels Select the parameter C (tradeoff between minimising training set error and maximising the margin). Select kernel function, and associated parameters (e.g., σ for RBF). Solve the optimisation problem using quadratic programming. Find the value b by using the support vectors. Classify a new point x using { m } f(x) = sign y i α i K(x,x i ) b i=1 Demo: Different kernels http://svm.dcs.rhbnc.ac.uk/pagesnew/gpat.shtml 31 TDT4173 Machine Learning

Support Vector Machines SVM Extensions Regression Variable SelecBon BoosBng Density EsBmaBon Unsupervised Learning Novelty/Outlier DetecBon Feature DetecBon Clustering October 2 4, 2000 M2000 6 TDT 4173 Machine Learning and CBR

Support Vector Machines Many Other ApplicaBons Speech RecogniBon Data Base MarkeBng Quark Flavors in High Energy Physics Dynamic Object RecogniBon Knock DetecBon in Engines Protein Sequence Problem Text CategorizaBon Breast Cancer Diagnosis See: hsp://www.clopinet.com/isabelle/projects/ SVM/applist.html TDT 4173 Machine Learning and CBR

Support Vector Machines Hallelujah! GeneralizaBon theory and pracbce meet General methodology for many types of problems Same Program + New Kernel = New method No problems with local minima Few model parameters. Selects capacity. Robust opbmizabon methods. Successful ApplicaBons BUT TDT 4173 Machine Learning and CBR

Support Vector Machines HYPE? Will SVMs beat my best hand tuned method Z for X? Do SVM scale to massive datasets? How to chose C and Kernel? What is the effect of asribute scaling? How to handle categorical variables? How to incorporate domain knowledge? How to interpret results? TDT 4173 Machine Learning and CBR