COMP 551 Applied Machine Learning Lecture 11: Support Vector Machines

Similar documents
COMP 551 Applied Machine Learning Lecture 9: Support Vector Machines (cont d)

Support-Vector Machines

Pattern Recognition 2014 Support Vector Machines

IAML: Support Vector Machines

COMP 551 Applied Machine Learning Lecture 4: Linear classification

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

In SMV I. IAML: Support Vector Machines II. This Time. The SVM optimization problem. We saw:

x 1 Outline IAML: Logistic Regression Decision Boundaries Example Data

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

The blessing of dimensionality for kernel methods

Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017

Linear programming III

T Algorithmic methods for data mining. Slide set 6: dimensionality reduction

Stats Classification Ji Zhu, Michigan Statistics 1. Classification. Ji Zhu 445C West Hall

What is Statistical Learning?

CHAPTER 3 INEQUALITIES. Copyright -The Institute of Chartered Accountants of India

This section is primarily focused on tools to aid us in finding roots/zeros/ -intercepts of polynomials. Essentially, our focus turns to solving.

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Slide04 (supplemental) Haykin Chapter 4 (both 2nd and 3rd ed): Multi-Layer Perceptrons

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

NUMBERS, MATHEMATICS AND EQUATIONS

Hypothesis Tests for One Population Mean

Support Vector Machines and Flexible Discriminants

CS 477/677 Analysis of Algorithms Fall 2007 Dr. George Bebis Course Project Due Date: 11/29/2007

Activity Guide Loops and Random Numbers

the results to larger systems due to prop'erties of the projection algorithm. First, the number of hidden nodes must

Kinetic Model Completeness

Lyapunov Stability Stability of Equilibrium Points

GENESIS Structural Optimization for ANSYS Mechanical

Resampling Methods. Chapter 5. Chapter 5 1 / 52

Simple Linear Regression (single variable)

Chapter 3: Cluster Analysis

CHM112 Lab Graphing with Excel Grading Rubric

Experiment #3. Graphing with Excel

Admin. MDP Search Trees. Optimal Quantities. Reinforcement Learning

COMP9444 Neural Networks and Deep Learning 3. Backpropagation

Differentiation Applications 1: Related Rates

You need to be able to define the following terms and answer basic questions about them:

k-nearest Neighbor How to choose k Average of k points more reliable when: Large k: noise in attributes +o o noise in class labels

SURVIVAL ANALYSIS WITH SUPPORT VECTOR MACHINES

ECEN 4872/5827 Lecture Notes

Tree Structured Classifier

The Solution Path of the Slab Support Vector Machine

ENSC Discrete Time Systems. Project Outline. Semester

Distributions, spatial statistics and a Bayesian perspective

Math 105: Review for Exam I - Solutions

Lead/Lag Compensator Frequency Domain Properties and Design Methods

Medium Scale Integrated (MSI) devices [Sections 2.9 and 2.10]

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Midwest Big Data Summer School: Machine Learning I: Introduction. Kris De Brabanter

, which yields. where z1. and z2

Trigonometric Ratios Unit 5 Tentative TEST date

We can see from the graph above that the intersection is, i.e., [ ).

Revision: August 19, E Main Suite D Pullman, WA (509) Voice and Fax

Math Foundations 20 Work Plan

Fall 2013 Physics 172 Recitation 3 Momentum and Springs

Section 6-2: Simplex Method: Maximization with Problem Constraints of the Form ~

Functions. EXPLORE \g the Inverse of ao Exponential Function

LHS Mathematics Department Honors Pre-Calculus Final Exam 2002 Answers

Part a: Writing the nodal equations and solving for v o gives the magnitude and phase response: tan ( 0.25 )

Five Whys How To Do It Better

B. Definition of an exponential

Contents. This is page i Printer: Opaque this

Enhancing Performance of MLP/RBF Neural Classifiers via an Multivariate Data Distribution Scheme

Introduction to Support Vector Machines

MODULE FOUR. This module addresses functions. SC Academic Elementary Algebra Standards:

Preparation work for A2 Mathematics [2018]

Elements of Machine Intelligence - I

MODULE 1. e x + c. [You can t separate a demominator, but you can divide a single denominator into each numerator term] a + b a(a + b)+1 = a + b

Computational modeling techniques

Computational modeling techniques

STATS216v Introduction to Statistical Learning Stanford University, Summer Practice Final (Solutions) Duration: 3 hours

The Law of Total Probability, Bayes Rule, and Random Variables (Oh My!)

Preparation work for A2 Mathematics [2017]

Sequential Allocation with Minimal Switching

Public Key Cryptography. Tim van der Horst & Kent Seamons

Margin Distribution and Learning Algorithms

CHAPTER 24: INFERENCE IN REGRESSION. Chapter 24: Make inferences about the population from which the sample data came.

Video Encoder Control

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Floating Point Method for Solving Transportation. Problems with Additional Constraints

Assessment Primer: Writing Instructional Objectives

Building to Transformations on Coordinate Axis Grade 5: Geometry Graph points on the coordinate plane to solve real-world and mathematical problems.

ALE 21. Gibbs Free Energy. At what temperature does the spontaneity of a reaction change?

Lesson Plan. Recode: They will do a graphic organizer to sequence the steps of scientific method.

Last Updated: Oct 14, 2017

Support Vector Machines and Flexible Discriminants

Chapter 3 Kinematics in Two Dimensions; Vectors

Turing Machines. Human-aware Robotics. 2017/10/17 & 19 Chapter 3.2 & 3.3 in Sipser Ø Announcement:

Internal vs. external validity. External validity. This section is based on Stock and Watson s Chapter 9.

Image Processing 1 (IP1) Bildverarbeitung 1

Sections 15.1 to 15.12, 16.1 and 16.2 of the textbook (Robbins-Miller) cover the materials required for this topic.

ENG2410 Digital Design Arithmetic Circuits

Lecture 7: Damped and Driven Oscillations

3.4 Shrinkage Methods Prostate Cancer Data Example (Continued) Ridge Regression

Computational modeling techniques

A Matrix Representation of Panel Data

Part 3 Introduction to statistical classification techniques

1996 Engineering Systems Design and Analysis Conference, Montpellier, France, July 1-4, 1996, Vol. 7, pp

Transcription:

COMP 551 Applied Machine Learning Lecture 11: Supprt Vectr Machines Instructr: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/cmp551 Unless therwise nted, all material psted fr this curse are cpyright f the instructr, and cannt be reused r repsted withut the instructr s written permissin.

Tday s quiz In the randm frest apprach prpsed by Breiman, hw many hyper-parameters need t be specified? 1, 2, 3, 4, 5 What is the cmplexity f each iteratin f Adabst, assuming yur weak learner is a decisin stump and yu have all binary variables? Let M be the number f features and N be the number f examples. O(M), O(N), O(MN), O(MN 2 ) Which f the tw ensemble strategies is mst effective fr high variance base classifiers? Bagging, Bsting 2

Prject #2 3

Outline Perceptrns Definitin Perceptrn learning rule Cnvergence Margin & max margin classifiers Linear Supprt Vectr Machines Frmulatin as ptimizatin prblem Generalized Lagrangian and dual Nn-linear Supprt Vectr Machines (next class) 4

A simple linear classifier Given a binary classificatin task: {x i, y i } i=1:n, y i ={-1,+1}. The perceptrn (Rsenblatt, 1957) is a classifier f the frm: h w (x) = sign(w T x) = {+1 if w T x 0; -1 therwise} The decisin bundary is w T x=0. An example <x i, y i > is classified crrectly if and nly if: y i (w T x i )>0. 1 w 0 Linear + threshld x 1 w 1 y x m w m 5

Perceptrn learning rule (Rsenblatt, 1957) Cnsider the fllwing prcedure: Initialize w j, j=0:m randmly, While any training examples remain incrrectly classified: Lp thrugh all misclassified examples x i Perfrm the update: w w + α y i x i where α is the learning rate (r step size). Intuitin: Fr misclassified psitive examples, increase w T x, and reduce it fr negative examples. 6

Gradient-descent learning The perceptrn learning rule can be interpreted as a gradient descent prcedure, with ptimizatin criterin: Err(w) = i=1:n { 0 if y i w T x i 0; -y i w T x therwise } 7

Gradient-descent learning The perceptrn learning rule can be interpreted as a gradient descent prcedure, with ptimizatin criterin: Err(w) = i=1:n { 0 if y i w T x i 0; -y i w T x therwise } Fr crrectly classified examples, the errr is zer. Fr incrrectly classified examples, the errr tells by hw much w T x is n the wrng side f the decisin bundary. The errr is zer when all examples are classified crrectly. 8

Linear separability The data is linearly separable if and nly if there exists a w such that: Fr all examples, y i w T x i > 0 Or equivalently, the 0-1 lss is zer fr sme set f parameters (w). x 2 + x 2 + + - + - - - x 1 - + x 1 Linearly separable Nt linearly separable 9

Perceptrn cnvergence therem The basic therem: If the perceptrn learning rule is applied t a linearly separable dataset, a slutin will be fund after sme finite number f updates. 10

Perceptrn cnvergence therem The basic therem: If the perceptrn learning rule is applied t a linearly separable dataset, a slutin will be fund after sme finite number f updates. Additinal cmments: The number f updates depends n the dataset, n the learning rate, and n the initial weights. If the data is nt linearly separable, there will be scillatin (which can be detected autmatically). Decreasing the learning rate t 0 can cause the scillatin t settle n sme particular slutin. 11

Perceptrn learning example 1 0.9 0.8 0.7 0.6 x2 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 x1 12

Perceptrn learning example '!"+!"&!"*!"%,#!")!"$!"(!"#!"'!!!"#!"$!"%!"& ',' 13

Weight as a cmbinatin f input vectrs Recall perceptrn learning rule: w w + α y i x i If initial weights are zer, then at any step, the weights are a linear cmbinatin f feature vectrs f the examples: w = i=1:n α i y i x i where α i is the sum f step sizes used fr all updates applied t example i. 14

Weight as a cmbinatin f input vectrs Recall perceptrn learning rule: w w + α y i x i If initial weights are zer, then at any step, the weights are a linear cmbinatin f feature vectrs f the examples: w = i=1:n α i y i x i where α i is the sum f step sizes used fr all updates applied t example i. By the end f training, sme examples may have never participated in an update, s will have α i =0. This is called the dual representatin f the classifier. 15

Perceptrn learning example Examples used (bld) and nt (faint). What d yu ntice? '!"+!"&!"*!"%,#!")!"$!"(!"#!"'!!!"#!"$!"%!"& ',' 16

Perceptrn learning example Slutins are ften nn-unique. The slutin depends n the set f instances and the rder f sampling in updates.!"+ -./.0#"'(+).'"+(*#1...-!./.!#,#!"&!"*!"%!")!"$!"(!"#!"'!!!"#!"$!"%!"& ',' 17

A few cmments n the Perceptrn Perceptrns can be learned t fit linearly separable data, using a gradient-descent rule. The lgistic functin ffers a smth versin f the perceptrn. 18

A few cmments n the Perceptrn Perceptrns can be learned t fit linearly separable data, using a gradient-descent rule. The lgistic functin ffers a smth versin f the perceptrn. Tw issues: Slutins are nn-unique. What abut nn-linearly separable data? 19

A few cmments n the Perceptrn Perceptrns can be learned t fit linearly separable data, using a gradient-descent rule. The lgistic functin ffers a smth versin f the perceptrn. Tw issues: Slutins are nn-unique. What abut nn-linearly separable data? (Tpic fr next class.) Perhaps data can be linearly separated in a different feature space? Perhaps we can relax the criterin f separating all the data? 20

The nn-uniqueness issue Cnsider a linearly separable binary classificatin dataset. There is an infinite number f hyper-planes that separate the classes: Which plane is best?!!!!! " " " " " 21

The nn-uniqueness issue Cnsider a linearly separable binary classificatin dataset. There is an infinite number f hyper-planes that separate the classes: Which plane is best?!!!!! " " " " " Related questin: Fr a given plane, fr which pints shuld we be mst cnfident in the classificatin? 22

Linear Supprt Vectr Machine (SVM) A linear SVM is a perceptrn fr which we chse w such that the margin is maximized. Fr a given separating hyper-plane, the margin is twice the (Euclidean) distance frm hyper-plane t nearest training example. I.e. the width f the strip arund the decisin bundary that cntains n training examples.!!!!!!!!!! " " " " " " " " " " 23

Distance t the decisin bundary Suppse we have a decisin bundary that separates the data. w T x>0 + + + + + + + + w T x<0 Class 1 Class 2 Assuming y i ={-1, +1}, cnfidence = y i w T x i 24

Distance t the decisin bundary Suppse we have a decisin bundary that separates the data. + + + + + + + + x i 0 γ i x i w Class 1 Class 2 Let ɣ i be the distance frm instance x i t the decisin bundary. Define vectr w t be the nrmal t the decisin bundary. 25

Distance t the decisin bundary Hw can we write ɣ i in terms f x i, y i, w? Let x i0 be the pint n the decisin bundary nearest x i The vectr frm x i0 t x i is ɣ i w / w. ɣ i is a scalar (distance frm x i t x i0 ) w/ w is the unit nrmal. S we can define x i0 = x i -ɣ i w / w. + + + + x i 0 γ i x i w 26

Distance t the decisin bundary Hw can we write ɣ i in terms f x i, y i, w? Let x i0 be the pint n the decisin bundary nearest x i The vectr frm x i0 t x i is ɣ i w / w. ɣ i is a scalar (distance frm x i t x i0 ) w/ w is the unit nrmal. S we can define x i0 = x i -ɣ i w / w. As x i0 is n the decisin bundary, we have + + + + x i 0 γ i x i w w T ( x i -ɣ i w / w ) = 0 Slving fr ɣ i yields, fr a psitive example: ɣ i = w T x i / w r fr examples f bth classes: ɣ i = y i w T x i / w 27

Optimizatin First suggestin: Maximize M with respect t w subject t y i w T x i / w M, i This is nt very cnvenient fr ptimizatin: w appears nnlinearly in the cnstraints. Prblem is undercnstrained. If (w, M) is ptimal, s is (βw, M), fr any β>0. Add a cnstraint: w M = 1 Instead try: Minimize w with respect t w subject t y i w T x i 1 28

Optimizatin First suggestin: Maximize M with respect t w subject t y i w T x i / w M, i This is nt very cnvenient fr ptimizatin: w appears nnlinearly in the cnstraints. Prblem is undercnstrained. If (w, M) is ptimal, s is (βw, M), fr any β>0. Add a cnstraint: w M = 1 Instead try: Minimize w with respect t w subject t y i w T x i 1 29

Final frmulatin Let s minimize ½ w 2 instead f w (Taking the square is a mntne transfrm, as w is psitive, s it desn t change the ptimal slutin. The ½ is fr mathematical cnvenience.) This gets us t: Min ½ w 2 w.r.t. w s.t. y i w T x i 1 This can be slved! Hw? It is a quadratic prgramming (QP) prblem a standard type f ptimizatin prblem fr which many efficient packages are available. Better yet, it s a cnvex (psitive semidefinite) QP. 30

Cnstrained ptimizatin Picture frm: http://www.cs.cmu.edu/~aarti/class/10701_spring14/ 31

!"+ Example -./.0''"*+)+.'#"&!%%1...-!./.!'#"+'*$,#!"&!"*!"%!")!"$!"(!"#!"'!!!"#!"$!"%!"& ',' We have a unique slutin, but n supprt vectrs yet. Recall the dual slutin fr the Perceptrn: Extend fr the margin case. 32

Lagrange multipliers Cnsider the fllwing ptimizatin prblem, called primal: min w f(w) s.t. g i (w) 0, i=1 k We define the generalized Lagrangian: L(w, α) = f(w) + i=1:k α i g i (w) where α i, i=1 k are the Lagrange multipliers. Figure : Find x and y t maximize f(x, y) subject t a cnstraint (shwn in red) g(x, y) = c. Frm: https://en.wikipedia.rg/wiki/lagrange_multiplier 33

Lagrangian ptimizatin Cnsider P(w) = max α:αi 0 L(w,α) (P stands fr primal ) Observe that the fllwing is true: P(w) = { f(w), if all cnstraints are satisfied, +, therwise } Hence, instead f cmputing min w f(w) subject t the riginal cnstraints, we can cmpute: p* = min w P(w) = min w max α:αi 0 L(w,α) Primal Alternately, invert max and min t get: d* = max α:αi 0 min w L(w,α) Dual 34

Maximum Margin Perceptrn We wanted t slve: Min ½ w 2 The Lagrangian is: w.r.t. w s.t. y i w T x i 1 L(w, α) = ½ w 2 + i α i (1 y i (w T x i ) ) The primal prblem is: min w max α:αi 0 L(w,α) The dual prblem is: max α:αi 0 min w L(w,α) 35

Dual ptimizatin prblem Cnsider bth slutins: p* = min w max α:αi 0 L(w,α) Primal d* = max α:αi 0 min w L(w,α) Dual If f and g i are cnvex and the g i can all be satisfied simultaneusly fr sme w, then we have equality: d* = p* = L(w*, α*). w* is the ptimal weight vectr (= primal slutin) α* is the ptimal set f supprt vectrs (=dual slutin) Fr SVMs, we have a quadratic bjective and linear cnstraints s bth f and g i are cnvex. Fr linearly separable data, all g i can be satisfied simultaneusly. Nte: w*, α* slve the primal and dual if and nly if they satisfy the Karush-Kunh-Tucker cnditins (see suggested readings). 36

Slving the dual Taking derivatives f L(w, α) wrt w, setting t 0, and slving fr w : L(w, α) = ½ w 2 + i α i (1 y i (w T x i ) ) δl/δw = w - i α i y i x i = 0 w* = i α i y i x i Just like fr the perceptrn with zer initial weights, the ptimal slutin w* is a linear cmbinatin f the x i. Plugging this back int L we get the dual: max α i α i ½ i,j y i y j α i α j (x i x) with cnstraints α i 0 and i α i y i = 0. Quadratic prgramming prblem. Cmplexity f slving quadratic prgram? Plynmial time, O( v 3 ) (where v =# variables in ptimizatin; here v =n). Fast apprximatins exist. 37

The supprt vectrs Suppse we find the ptimal α s (e.g. using a QP package.) Cnstraint i is active when α i > 0. This crrespnds fr the pints fr which (1-y i w T x i )=0. These are the pints lying n the edge f the margin. We call them supprt vectrs. They define the decisin bundary. The utput f the classifier fr query pint x is cmputed as: h w (x) = sign( i=1:n α i y i (x i x) ) It is determined by cmputing the dt prduct f the query pint with the supprt vectrs. 38

Example Example Supprt vectrs are in bld 39

What yu shuld knw Frm tday: The perceptrn algrithm. The margin definitin fr linear SVMs. The use f Lagrange multipliers t transfrm ptimizatin prblems. The primal and dual ptimizatin prblems fr SVMs. After the next class: Nn-linearly separable case. Feature space versin f SVMs. The kernel trick and examples f cmmn kernels. 40