IAML: Support Vector Machines

Similar documents
In SMV I. IAML: Support Vector Machines II. This Time. The SVM optimization problem. We saw:

Pattern Recognition 2014 Support Vector Machines

x 1 Outline IAML: Logistic Regression Decision Boundaries Example Data

Support-Vector Machines

COMP 551 Applied Machine Learning Lecture 11: Support Vector Machines

COMP 551 Applied Machine Learning Lecture 9: Support Vector Machines (cont d)

Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017

The blessing of dimensionality for kernel methods

What is Statistical Learning?

COMP 551 Applied Machine Learning Lecture 4: Linear classification

COMP9444 Neural Networks and Deep Learning 3. Backpropagation

k-nearest Neighbor How to choose k Average of k points more reliable when: Large k: noise in attributes +o o noise in class labels

Linear programming III

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Support Vector Machines and Flexible Discriminants

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

CS 477/677 Analysis of Algorithms Fall 2007 Dr. George Bebis Course Project Due Date: 11/29/2007

Elements of Machine Intelligence - I

Stats Classification Ji Zhu, Michigan Statistics 1. Classification. Ji Zhu 445C West Hall

Smoothing, penalized least squares and splines

Tree Structured Classifier

T Algorithmic methods for data mining. Slide set 6: dimensionality reduction

STATS216v Introduction to Statistical Learning Stanford University, Summer Practice Final (Solutions) Duration: 3 hours

Differentiation Applications 1: Related Rates

This section is primarily focused on tools to aid us in finding roots/zeros/ -intercepts of polynomials. Essentially, our focus turns to solving.

, which yields. where z1. and z2

Part 3 Introduction to statistical classification techniques

Agenda. What is Machine Learning? Learning Type of Learning: Supervised, Unsupervised and semi supervised Classification

Midwest Big Data Summer School: Machine Learning I: Introduction. Kris De Brabanter

Chapter 3: Cluster Analysis

Trigonometric Ratios Unit 5 Tentative TEST date

Distributions, spatial statistics and a Bayesian perspective

Biplots in Practice MICHAEL GREENACRE. Professor of Statistics at the Pompeu Fabra University. Chapter 13 Offprint

MATHEMATICS SYLLABUS SECONDARY 5th YEAR

Resampling Methods. Chapter 5. Chapter 5 1 / 52

AP Physics Kinematic Wrap Up

GENESIS Structural Optimization for ANSYS Mechanical

Math Foundations 20 Work Plan

Lecture 5: Equilibrium and Oscillations

MODULE 1. e x + c. [You can t separate a demominator, but you can divide a single denominator into each numerator term] a + b a(a + b)+1 = a + b

and the Doppler frequency rate f R , can be related to the coefficients of this polynomial. The relationships are:

Medium Scale Integrated (MSI) devices [Sections 2.9 and 2.10]

Simple Linear Regression (single variable)

Contents. This is page i Printer: Opaque this

Enhancing Performance of MLP/RBF Neural Classifiers via an Multivariate Data Distribution Scheme

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

The Solution Path of the Slab Support Vector Machine

MODULE FOUR. This module addresses functions. SC Academic Elementary Algebra Standards:

Support Vector Machine (continued)

Fall 2013 Physics 172 Recitation 3 Momentum and Springs

Floating Point Method for Solving Transportation. Problems with Additional Constraints

The Kullback-Leibler Kernel as a Framework for Discriminant and Localized Representations for Visual Recognition

Image Processing 1 (IP1) Bildverarbeitung 1

SURVIVAL ANALYSIS WITH SUPPORT VECTOR MACHINES

We can see from the graph above that the intersection is, i.e., [ ).

Reinforcement Learning" CMPSCI 383 Nov 29, 2011!

Support Vector Machines and Flexible Discriminants

If (IV) is (increased, decreased, changed), then (DV) will (increase, decrease, change) because (reason based on prior research).

Feedforward Neural Networks

Kinetic Model Completeness

Section 6-2: Simplex Method: Maximization with Problem Constraints of the Form ~

Logistic Regression. and Maximum Likelihood. Marek Petrik. Feb

The Law of Total Probability, Bayes Rule, and Random Variables (Oh My!)

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Department of Electrical Engineering, University of Waterloo. Introduction

Artificial Neural Networks MLP, Backpropagation

CHAPTER 3 INEQUALITIES. Copyright -The Institute of Chartered Accountants of India

Slide04 (supplemental) Haykin Chapter 4 (both 2nd and 3rd ed): Multi-Layer Perceptrons

Name AP CHEM / / Chapter 1 Chemical Foundations

4th Indian Institute of Astrophysics - PennState Astrostatistics School July, 2013 Vainu Bappu Observatory, Kavalur. Correlation and Regression

Checking the resolved resonance region in EXFOR database

[COLLEGE ALGEBRA EXAM I REVIEW TOPICS] ( u s e t h i s t o m a k e s u r e y o u a r e r e a d y )

Computational modeling techniques

EASTERN ARIZONA COLLEGE Precalculus Trigonometry

Max Margin-Classifier

Section 5.8 Notes Page Exponential Growth and Decay Models; Newton s Law

NUROP CONGRESS PAPER CHINESE PINYIN TO CHINESE CHARACTER CONVERSION

Computational modeling techniques

7 TH GRADE MATH STANDARDS

CN700 Additive Models and Trees Chapter 9: Hastie et al. (2001)

the results to larger systems due to prop'erties of the projection algorithm. First, the number of hidden nodes must

Optimization Programming Problems For Control And Management Of Bacterial Disease With Two Stage Growth/Spread Among Plants

A Scalable Recurrent Neural Network Framework for Model-free

Five Whys How To Do It Better

Introduction: A Generalized approach for computing the trajectories associated with the Newtonian N Body Problem

NAME: Prof. Ruiz. 1. [5 points] What is the difference between simple random sampling and stratified random sampling?

Preparation work for A2 Mathematics [2017]

Administrativia. Assignment 1 due thursday 9/23/2004 BEFORE midnight. Midterm exam 10/07/2003 in class. CS 460, Sessions 8-9 1

Chapter 3 Kinematics in Two Dimensions; Vectors

Part a: Writing the nodal equations and solving for v o gives the magnitude and phase response: tan ( 0.25 )

Public Key Cryptography. Tim van der Horst & Kent Seamons

1 PreCalculus AP Unit G Rotational Trig (MCR) Name:

3. Classify the following Numbers (Counting (natural), Whole, Integers, Rational, Irrational)

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

EEO 401 Digital Signal Processing Prof. Mark Fowler

The standards are taught in the following sequence.

Lesson Plan. Recode: They will do a graphic organizer to sequence the steps of scientific method.

Relationships Between Frequency, Capacitance, Inductance and Reactance.

You need to be able to define the following terms and answer basic questions about them:

Transcription:

1 / 22 IAML: Supprt Vectr Machines Charles Suttn and Victr Lavrenk Schl f Infrmatics Semester 1

2 / 22 Outline Separating hyperplane with maimum margin Nn-separable training data Epanding the input int a high-dimensinal space Supprt vectr regressin Reading: W & F sec 6.3 (maimum margin hyperplane, nnlinear class bundaries), SVM handut. SV regressin nt eaminable.

Overview 3 / 22 Supprt vectr machines are ne f the mst effective and widely used classificatin algrithms. SVMs are the cmbinatin f tw ideas Maimum margin classificatin The kernel trick SVMs are a linear classifier, like lgistic regressin

Recall: Dt Prducts w is length f the prjectin f nt w (if w is a unit vectr) (If yu d nt remember this, see supplementary maths ntes n curse Web site.) 4 / 22

Separating Hyperplane Separating Hyperplane Fr any linear classifier Training instances ( i, y i ), i = 1,..., n. i { 1, +1} Training instances ( i, y i ), i = 1,...,n. y i { 1, +1} Hyperplane Hyperplane w w. + w 0 0 = 0 ~ (w.) + w > 0 0 ~ (w.) + w < 0 0 3 / 18 5 / 22

Maimum margin 6 / 22 Let the perpendicular distance frm the hyperplane t the nearest +1 class pint be d + Similarly fr nearest class 1 pint, perpendicular distance is d Margin is defined as min(d +, d ) Supprt vectr machine algrithm lks fr ( w, w 0 ) that gives rise t the maimum margin At the ma-margin slutin, it must be true that d + = d

Illustratin f the margin 5 / 187 / 22 ~ w margin

Hw t cmpute the margin using dt prducts 8 / 22

9 / 22 Ma-margin as an ptimizatin prblem Our gal will be t cme up with a cnstrained ptimizatin prblem, because then we can use standard technlgy t slve it. (By standard technlgy I mean fancy versins f the algrithms we learned in the ptimizatin lecture.) At a high level, what we want is an ptimizatin prblem that says: Find w with maimum margin subject t the cnstraints that all f the training eamples are classified crrectly. Yu culd try t d this naively, e.g., maimize d + + d, etc. Instead we re ging t d smething a bit mre clever. The reasn is that ptimizers like t see smth, cnve bjective functins and cnstraints. Linear is even better, nn-differentiable is t be avided if pssible.

Our first clever trick 10 / 22 Nte that ( w, w 0 ) and (c w, cw 0 ) defines the same hyperplane. This is like saying a margin 1000mm > 1m Remve rescaling freedm by demanding that min i w i + w 0 = 1 This means that the margin min(d +, d ) = 1, s nw we need t maimize d + + d instead. Nw we have three types f cnstraints n w w i + w 0 0 fr y i = +1 w i + w 0 0 min i w i + w 0 = 1 fr y i = 1 We can simplify these in ne fell swp. These three cnstraints are equivalent t the much simpler y i ( w i + w 0 ) +1 fr all y i

A secnd trick 11 / 22 It turns ut that the margin is 1/ w. Prf: Fr tw pints n bundaries w + + w 0 = 1 w + w 0 = 1 thus w ( + ) = 2 and w ( + ) = 2 w w Nte that we have assumed that the cnstraints f the ptimizatin prblem are satisfied. (We dn t care what the margin is if they aren t, since that wn t be a slutin.)

The SVM ptimizatin prblem Nte that maimizing 2/ w is equivalent t minimizing w 2. S the SVM weights are determined by slving the ptimizatin prblem: min w w 2 s.t. y i ( w i + w 0 ) +1 fr all i 12 / 22

Finding the ptimum Optimal hyperplane can be cmputed frm a quadratic prgramming prblem using Lagrange multipliers w = i α i y i i This uses fancy numerical techniques frm ptimizatin literature. Optimal hyperplane is determined by just a few eamples: call these supprt vectrs α i = 0 fr nn-supprt patterns Optimizatin prblem has n lcal minima (like lgistic regressin) Predictin n new data pint f () = sgn(( w ) + w 0 ) n = sgn( α i y i ( i ) + w 0 ) i=1 13 / 22

14 / 22 Nn-separable training sets If data set is nt linearly separable, the ptimizatin prblem abve has n slutin. Slutin: Add a slack variable ξ i 0 fr each training eample New ptimizatin prblem is t mimimize subject t the cnstraints w 2 + C( n ξ i ) k i=1 w i + w 0 1 ξ i fr y i = +1 w i + w 0 1 + ξ i fr y i = 1 Usually set k = 1. C is a trade-ff parameter, picked by hand (see belw). Large C gives a large penalty t errrs

15 / 22! ~ w margin 9 / 18

Nn-linear SVMs 16 / 22 Transfrm t φ() Linear algrithm depends nly n i. Hence transfrmed algrithm depends nly n φ() φ( i ) Use a kernel functin k( i, j ) such that k( i, j ) = φ( i ) φ( j ) (This is called the kernel trick, and can be used with a wide variety f learning algrithms, nt just ma margin.) Eample 1: fr 2-d input space φ() = 2 1 21 2 2 2 with k( i, j ) = ( i j ) 2

17 / 22 input space feature space " "!!!!! " " " " Figure Credit: Bernhard Schelkpf Figure Credit: Bernhard Schelkpf Eample 2 Eample 2 k( i, j )=ep i j 2 /α 2 k( i, j ) = ep i j 2 /α 2 In this case the dimensin f φ is infinite In T this test case a new the input dimensin f φ is infinite T test a new input n f () =sgn( n α i y i k( i, )+w 0 ) f () = sgn( i=1 α i y i k( i, ) + w 0 ) i=1 11 / 18

Predictin Applicatins new eample f()= sgn (! + b) classificatin f()= sgn (! $ i.k(, i) + b) $ 1 $ 2 $ 3 $ 4 weights k k k k cmparisn: k(, i), e.g. supprt vectrs 1... 4 k(, i)=(. i) d k(, i)=ep(!! i 2 / c) k(, i)=tanh("(. i)+#) input vectr Figure Figure Credit: Credit: Bernhard Bernhard Schelkpf Schelkpf 13 / 18 18 / 22

19 / 22 Chsing φ, C There are theretical results Hwever, in practice crss-validatin methds are cmmnly used

Eample applicatin 20 / 22 US Pstal Service digit data (7291 eamples, 16 16 images). Three SVMs using plynmial, RBF and MLP-type kernels were used (see Schölkpf and Smla, Learning with Kernels, 2002 fr details) Use almst the same ( 90%) small sets (4% f data base) f SVs All systems perfrm well ( 4% errr) Many ther applicatins, e.g. Tet categrizatin Face detectin DNA analysis

21 / 22 Cmparisn with linear and lgistic regressin Underlying basic idea f linear predictin is the same, but errr functins differ Lgistic regressin (nn-sparse) vs SVM ( hinge lss, sparse slutin) Linear regressin (squared errr) vs ɛ-insensitive errr Linear regressin and lgistic regressin can be kernelized t

SVM summary 22 / 22 SVMs are the cmbinatin f ma-margin and the kernel trick Learn linear decisin bundaries (like lgistic regressin, perceptrns) Pick hyperplane that maimizes margin Use slack variables t deal with nn-separable data Optimal hyperplane can be written in terms f supprt patterns Transfrm t higher-dimensinal space using kernel functins Gd empirical results n many prblems Appears t avid verfitting in high dimensinal spaces (cf regularizatin)