In SMV I. IAML: Support Vector Machines II. This Time. The SVM optimization problem. We saw:

Similar documents
IAML: Support Vector Machines

Support-Vector Machines

Pattern Recognition 2014 Support Vector Machines

COMP 551 Applied Machine Learning Lecture 11: Support Vector Machines

x 1 Outline IAML: Logistic Regression Decision Boundaries Example Data

COMP 551 Applied Machine Learning Lecture 9: Support Vector Machines (cont d)

The blessing of dimensionality for kernel methods

Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017

k-nearest Neighbor How to choose k Average of k points more reliable when: Large k: noise in attributes +o o noise in class labels

COMP9444 Neural Networks and Deep Learning 3. Backpropagation

Linear programming III

Elements of Machine Intelligence - I

COMP 551 Applied Machine Learning Lecture 4: Linear classification

Support Vector Machines and Flexible Discriminants

What is Statistical Learning?

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Stats Classification Ji Zhu, Michigan Statistics 1. Classification. Ji Zhu 445C West Hall

, which yields. where z1. and z2

Smoothing, penalized least squares and splines

The Kullback-Leibler Kernel as a Framework for Discriminant and Localized Representations for Visual Recognition

Contents. This is page i Printer: Opaque this

Support Vector Machine (continued)

Part 3 Introduction to statistical classification techniques

Support Vector Machines and Flexible Discriminants

This section is primarily focused on tools to aid us in finding roots/zeros/ -intercepts of polynomials. Essentially, our focus turns to solving.

Lecture 5: Equilibrium and Oscillations

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Simple Linear Regression (single variable)

Agenda. What is Machine Learning? Learning Type of Learning: Supervised, Unsupervised and semi supervised Classification

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

7 TH GRADE MATH STANDARDS

The Solution Path of the Slab Support Vector Machine

SURVIVAL ANALYSIS WITH SUPPORT VECTOR MACHINES

Chapter 3: Cluster Analysis

Tree Structured Classifier

CS 477/677 Analysis of Algorithms Fall 2007 Dr. George Bebis Course Project Due Date: 11/29/2007

Medium Scale Integrated (MSI) devices [Sections 2.9 and 2.10]

Midwest Big Data Summer School: Machine Learning I: Introduction. Kris De Brabanter

STATS216v Introduction to Statistical Learning Stanford University, Summer Practice Final (Solutions) Duration: 3 hours

CHAPTER 24: INFERENCE IN REGRESSION. Chapter 24: Make inferences about the population from which the sample data came.

Math Foundations 20 Work Plan

[COLLEGE ALGEBRA EXAM I REVIEW TOPICS] ( u s e t h i s t o m a k e s u r e y o u a r e r e a d y )

Lim f (x) e. Find the largest possible domain and its discontinuity points. Why is it discontinuous at those points (if any)?

Kinetic Model Completeness

Image Processing 1 (IP1) Bildverarbeitung 1

3.4 Shrinkage Methods Prostate Cancer Data Example (Continued) Ridge Regression

If (IV) is (increased, decreased, changed), then (DV) will (increase, decrease, change) because (reason based on prior research).

Name AP CHEM / / Chapter 1 Chemical Foundations

Enhancing Performance of MLP/RBF Neural Classifiers via an Multivariate Data Distribution Scheme

T Algorithmic methods for data mining. Slide set 6: dimensionality reduction

The standards are taught in the following sequence.

Admin. MDP Search Trees. Optimal Quantities. Reinforcement Learning

SUPPORT VECTOR MACHINES FOR BANKRUPTCY ANALYSIS

Artificial Neural Networks MLP, Backpropagation

Feedforward Neural Networks

CN700 Additive Models and Trees Chapter 9: Hastie et al. (2001)

Trigonometric Ratios Unit 5 Tentative TEST date

Determining Optimum Path in Synthesis of Organic Compounds using Branch and Bound Algorithm

Question 2-1. Solution 2-1 CHAPTER 2 HYDROSTATICS

Biplots in Practice MICHAEL GREENACRE. Professor of Statistics at the Pompeu Fabra University. Chapter 13 Offprint

PSU GISPOPSCI June 2011 Ordinary Least Squares & Spatial Linear Regression in GeoDa

Work, Energy, and Power

Review: Support vector machines. Machine learning techniques and image analysis

GENESIS Structural Optimization for ANSYS Mechanical

Distributions, spatial statistics and a Bayesian perspective

Differentiation Applications 1: Related Rates

Checking the resolved resonance region in EXFOR database

MODULE FOUR. This module addresses functions. SC Academic Elementary Algebra Standards:

the results to larger systems due to prop'erties of the projection algorithm. First, the number of hidden nodes must

A.H. Helou Ph.D.~P.E.

Linear Classification

Slide04 (supplemental) Haykin Chapter 4 (both 2nd and 3rd ed): Multi-Layer Perceptrons

Lecture 8: Multiclass Classification (I)

Internal vs. external validity. External validity. This section is based on Stock and Watson s Chapter 9.

Logistic Regression. and Maximum Likelihood. Marek Petrik. Feb

16 GACV for Support Vector Machines

MODULE 1. e x + c. [You can t separate a demominator, but you can divide a single denominator into each numerator term] a + b a(a + b)+1 = a + b

NUMBERS, MATHEMATICS AND EQUATIONS

Five Whys How To Do It Better

MATHEMATICS SYLLABUS SECONDARY 5th YEAR

Announcements - Homework

Neural networks and support vector machines

Tutorial 4: Parameter optimization

19 Better Neural Network Training; Convolutional Neural Networks

Resampling Methods. Chapter 5. Chapter 5 1 / 52

Exponential Functions, Growth and Decay

cfl Cpyright by Ji Zhu 2003 All Rights Reserved ii

DESIGN OPTIMIZATION OF HIGH-LIFT CONFIGURATIONS USING A VISCOUS ADJOINT-BASED METHOD

Computational modeling techniques

We say that y is a linear function of x if. Chapter 13: The Correlation Coefficient and the Regression Line

Module 3: Gaussian Process Parameter Estimation, Prediction Uncertainty, and Diagnostics

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Solutions to the Extra Problems for Chapter 14

AP Physics Kinematic Wrap Up

Physical Layer: Outline

Administrativia. Assignment 1 due thursday 9/23/2004 BEFORE midnight. Midterm exam 10/07/2003 in class. CS 460, Sessions 8-9 1

EASTERN ARIZONA COLLEGE Precalculus Trigonometry

NAME: Prof. Ruiz. 1. [5 points] What is the difference between simple random sampling and stratified random sampling?

Chapter 11: Neural Networks

Transcription:

In SMV I IAML: Supprt Vectr Machines II Nigel Gddard Schl f Infrmatics Semester 1 We sa: Ma margin trick Gemetry f the margin and h t cmpute it Finding the ma margin hyperplane using a cnstrained ptimizatin prblem Ma margin = Min nrm This Time 1 / 25 The SVM ptimizatin prblem 2 / 25 Last time: the ma margin eights can be cmputed by slving a cnstrained ptimizatin prblem Nn separable data The kernel trick min 2 s.t. y i ( i + 0 ) +1 fr all i Many algrithms have been prpsed t slve this. One f the earliest efficient algrithms is called SMO [Platt, 1998]. This is utside the scpe f the curse, but it des eplain the name f the SVM methd in Weka. 3 / 25 4 / 25

Finding the ptimum Why a slutin f this frm? If yu mve the pints nt n the marginal hyperplanes, slutin desn t change - therefre thse pints dn t matter. If yu g thrugh sme advanced maths (Lagrange multipliers, etc.), it turns ut that yu can sh smething remarkable. Optimal parameters lk like = i α i y i i Furthermre, slutin is sparse. Optimal hyperplane is determined by just a fe eamples: call these supprt vectrs ~ margin 5 / 25 6 / 25 Finding the ptimum Nn-separable training sets 5 / 18 If yu g thrugh sme advanced maths (Lagrange multipliers, etc.), it turns ut that yu can sh smething remarkable. Optimal parameters lk like = i α i y i i If data set is nt linearly separable, the ptimizatin prblem that e have given has n slutin. Furthermre, slutin is sparse. Optimal hyperplane is determined by just a fe eamples: call these supprt vectrs α i = 0 fr nn-supprt patterns Optimizatin prblem t find α i has n lcal minima (like lgistic regressin) Predictin n ne data pint Why? min 2 s.t. y i ( i + 0 ) +1 fr all i f () = sign(( ) + 0 ) = sign( n α i y i ( i ) + 0 ) 7 / 25 8 / 25

Nn-separable training sets If data set is nt linearly separable, the ptimizatin prblem that e have given has n slutin. min 2 s.t. y i ( i + 0 ) +1 fr all i! Why? Slutin: Dn t require that e classify all pints crrectly. All the algrithm t chse t ignre sme f the pints. This is bviusly dangerus (hy nt ignre all f them?) s e need t give it a penalty fr ding s. ~ margin 9 / 18 9 / 25 10 / 25 Slack Think abut ridge regressin again Slutin: Add a slack variable ξ i 0 fr each training eample. If the slack variable is high, e get t rela the cnstraint, but e pay a price Ne ptimizatin prblem is t minimize n 2 + C( ξi k ) subject t the cnstraints i + 0 1 ξ i fr y i = +1 i + 0 1 + ξ i fr y i = 1 Usually set k = 1. C is a trade-ff parameter. Large C gives a large penalty t errrs. Slutin has same frm, but supprt vectrs als include all here ξ i 0. Why? 11 / 25 Our ma margin + slack ptimizatin prblem is t minimize: n 2 + C( ξ i ) k subject t the cnstraints i + 0 1 ξ i fr y i = +1 i + 0 1 + ξ i fr y i = 1 This lks a even mre like ridge regressin than the nn-slack prblem: C( n ξ i) k measures h ell e fit the data 2 penalizes eight vectrs ith a large nrm S C can be vieed as a regularizatin parameters, like λ in ridge regressin r regularized lgistic regressin Yu re alled t make this tradeff even hen the data set is separable! 12 / 25

15 / 25 16 / 25 Why yu might ant slack in a separable data set Nn-linear SVMs 2 1 2 1 SVMs can be made nnlinear just like any ther linear algrithm e ve seen (i.e., using a basis epansin) But in an SVM, the basis epansin is implemented in a very special ay, using smething called a kernel The reasn fr this is that kernels can be faster t cmpute ith if the epanded feature space is very high dimensinal (even infinite)! This is a fairly advanced tpic mathematically, s e ill just g thrugh a high-level versin 13 / 25 14 / 25 Kernel Nn-linear SVMs A kernel is in sme sense an alternate API fr specifying t the classifier hat yur epanded feature space is. Up t n, e have alays given the classifier a ne set f training vectrs φ( i ) fr all i, e.g., just as a list f numbers. φ : R d R D If D is large, this ill be epensive; if D is infinite, this ill be impssible Transfrm t φ() Linear algrithm depends nly n i. Hence transfrmed algrithm depends nly n φ() φ( i ) Use a kernel functin k( i, j ) such that k( i, j ) = φ( i ) φ( j ) (This is called the kernel trick, and can be used ith a ide variety f learning algrithms, nt just ma margin.)

Eample f kernel 19 / 25 Kernels, dt prducts, and distance 2013 / 25/ 18 Eample 1: fr 2-d input space then φ( i ) = 2 i,1 2i,1 i,2 2 i,2 k( i, j ) = ( i j ) 2 The Euclidean distance squared beteen t vectrs can be cmputed using dt prducts d( 1, 2 ) = ( 1 2 ) T ( 1 2 ) = T 1 1 2 T 1 2 + T 2 2 Using a linear kernel k( 1, 2 ) = T 1 2 e can rerite this as d( 1, 2 ) = k( 1, 1 ) 2k( 1, 2 ) + k( 2, 2 ) Any kernel gives yu an assciated distance measure this ay. Think f a kernel as an indirect ay f specifying distances. Supprt Vectr Machine 17 / 25 Applicatins Predictin n ne eample 18 / 25 A supprt vectr machine is a kernelized maimum margin classifier. Fr ma margin remember that e had the magic prperty f()= sgn (! + b) classificatin f()= sgn (! $ i.k(, i) + b) α i y i i $ 1 $ 2 $ 3 $ 4 eights = i This means e uld predict the label f a test eample as ŷ = sign[ T + 0 ] = sign[ α i y i T i + 0 ] i k k k k cmparisn: k(, i), e.g. supprt vectrs 1... 4 k(, i)=(. i) d k(, i)=ep(!! i 2 / c) k(, i)=tanh((. i)+#) Kernelizing this e get input vectr ŷ = sign[ i α i y i k( i, ) + b] Figure Credit: Bernhard Schelkpf Figure Credit: Bernhard Schelkpf

23 / 25 24 / 25 input space feature space Chsing φ, C!!!!! Figure Credit: Bernhard Schelkpf Figure Credit: Bernhard Schelkpf Eample 2 Eample 2 k( i, j )=ep i j 2 /α 2 In this case the k( dimensin i, j ) = ep f φ is infinite i j 2 /α 2 InT this test case a ne theinput dimensin f φ is infinite. i.e., It can be shn that n φ that maps n int a finite-dimensinal space ill give yu this f () kernel. =sgn( α i y i k( i, )+ 0 ) We can never calculate φ(), but the algrithm nly needs us t calculate k fr different pairs f pints. 11 / 18 There are theretical results, but e ill nt cver them. (If yu ant t lk them up, there are actually upper bunds n the generalizatin errr: lk fr VC-dimensin and structural risk minimizatin.) Hever, in practice crss-validatin methds are cmmnly used Eample applicatin 21 / 25 Cmparisn ith linear and lgistic regressin 22 / 25 US Pstal Service digit data (7291 eamples, 16 16 images). Three SVMs using plynmial, RBF and MLP-type kernels ere used (see Schölkpf and Smla, Learning ith Kernels, 2002 fr details) Use almst the same ( 90%) small sets (4% f data base) f SVs All systems perfrm ell ( 4% errr) Many ther applicatins, e.g. Tet categrizatin Face detectin DNA analysis Underlying basic idea f linear predictin is the same, but errr functins differ Lgistic regressin (nn-sparse) vs SVM ( hinge lss, sparse slutin) Linear regressin (squared errr) vs ɛ-insensitive errr Linear regressin and lgistic regressin can be kernelized t

SVM summary SVMs are the cmbinatin f ma-margin and the kernel trick Learn linear decisin bundaries (like lgistic regressin, perceptrns) Pick hyperplane that maimizes margin Use slack variables t deal ith nn-separable data Optimal hyperplane can be ritten in terms f supprt patterns Transfrm t higher-dimensinal space using kernel functins Gd empirical results n many prblems Appears t avid verfitting in high dimensinal spaces (cf regularizatin) Srry fr all the maths! 25 / 25