Support Vector Machines (SVMs).

Similar documents
The Perceptron Algorithm, Margins

Sample and Computationally Efficient Active Learning. Maria-Florina Balcan Carnegie Mellon University

Jeff Howbert Introduction to Machine Learning Winter

Foundations For Learning in the Age of Big Data. Maria-Florina Balcan

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Foundations For Learning in the Age of Big Data. Maria-Florina Balcan

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Announcements - Homework

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines for Classification and Regression

Support Vector Machine (SVM) and Kernel Methods

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Support vector machines

Introduction to Support Vector Machines

Generalization, Overfitting, and Model Selection

Max Margin-Classifier

Support Vector Machine (continued)

Support Vector Machines

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Support Vector Machines

CSC 411 Lecture 17: Support Vector Machine

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

ML (cont.): SUPPORT VECTOR MACHINES

Support Vector Machines

Machine Learning. Support Vector Machines. Manfred Huber

Machine Learning And Applications: Supervised Learning-SVM

Support vector machines Lecture 4

Kernelized Perceptron Support Vector Machines

Kernel Methods and Support Vector Machines

Statistical Pattern Recognition

SVMs, Duality and the Kernel Trick

Support Vector Machines for Classification: A Statistical Portrait

Generalization and Overfitting

(Kernels +) Support Vector Machines

Support Vector Machines, Kernel SVM

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

18.9 SUPPORT VECTOR MACHINES

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Generalization, Overfitting, and Model Selection

Lecture 10: A brief introduction to Support Vector Machine

Applied Machine Learning Annalisa Marsico

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Polyhedral Computation. Linear Classifiers & the SVM

Introduction to Logistic Regression and Support Vector Machine

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Support Vector Machines

L5 Support Vector Classification

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Pattern Recognition 2018 Support Vector Machines

CS-E4830 Kernel Methods in Machine Learning

Perceptron Revisited: Linear Separators. Support Vector Machines

Classification and Support Vector Machine

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Support Vector Machines: Maximum Margin Classifiers

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

How Good is a Kernel When Used as a Similarity Measure?

Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Support Vector Machines

ICS-E4030 Kernel Methods in Machine Learning

Convex Optimization M2

Support Vector Machines and Kernel Methods

Introduction to Machine Learning Midterm Exam

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Active Learning Class 22, 03 May Claire Monteleoni MIT CSAIL

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

An introduction to Support Vector Machines

Lecture Support Vector Machine (SVM) Classifiers

LECTURE 7 Support vector machines

Introduction to Support Vector Machines

HOMEWORK 4: SVMS AND KERNELS

Lecture Notes on Support Vector Machine

Mathematical Programming for Multiple Kernel Learning

Large Scale Semi-supervised Linear SVM with Stochastic Gradient Descent

Short Course Robust Optimization and Machine Learning. 3. Optimization in Supervised Learning

CS145: INTRODUCTION TO DATA MINING

Review: Support vector machines. Machine learning techniques and image analysis

Distributed Machine Learning. Maria-Florina Balcan, Georgia Tech

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Linear & nonlinear classifiers

An Introduction to Machine Learning

Incorporating detractors into SVM classification

Distributed Machine Learning. Maria-Florina Balcan Carnegie Mellon University

Support Vector Machines. Machine Learning Fall 2017

Analysis of Spectral Kernel Design based Semi-supervised Learning

Convex Optimization and Support Vector Machine

Cluster Kernels for Semi-Supervised Learning

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Discriminative Models

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Brief Introduction to Machine Learning

Efficient Semi-supervised and Active Learning of Disjunctions

Support Vector Machine

Support Vector Machines

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015

Classifier Complexity and Support Vector Classifiers

Transcription:

Support Vector Machines (SVMs). SemiSupervised Learning. SemiSupervised SVMs. MariaFlorina Balcan 3/25/215

Support Vector Machines (SVMs). One of the most theoretically well motivated and practically most effective classification algorithms in machine learning. Directly motivated by Margins and Kernels!

Geometric Margin WLOG homogeneous linear separators [w = ]. Definition: The margin of example x w.r.t. a linear sep. w is the distance from x to the plane w x =. Margin of example x 1 If w = 1, margin of x w.r.t. w is x w. x 1 w Margin of example x 2 x 2

Geometric Margin Definition: The margin of example x w.r.t. a linear sep. w is the distance from x to the plane w x =. Definition: The margin γ w of a set of examples S wrt a linear separator w is the smallest margin over points x S. Definition: The margin γ of a set of examples S is the maximum γ w over all linear separators w. γ γ w

Margin Important Theme in ML Both sample complexity and algorithmic implications. Sample/Mistake Bound complexity: If large margin, # mistakes Peceptron makes is small (independent on the dim of the space)! If large margin γ and if alg. produces a large margin classifier, then amount of data needed depends only on R/γ [Bartlett & ShaweTaylor 99]. Algorithmic Implications w γ γ Suggests searching for a large margin classifier SVMs

Support Vector Machines (SVMs) Directly optimize for the maximum margin separator: SVMs First, assume we know a lower bound on the margin γ Input: γ, S={(x 1, y 1 ),,(x m, y m )}; Find: some w where: w 2 = 1 For all i, y i w x i γ w γ γ Output: w, a separator of margin γ over S Realizable case, where the data is linearly separable by margin γ

Support Vector Machines (SVMs) Directly optimize for the maximum margin separator: SVMs E.g., search for the best possible γ Input: S={(x 1, y 1 ),,(x m, y m )}; Find: some w and maximum γ where: w 2 = 1 For all i, y i w x i γ w γ γ Output: maximum margin separator over S

Support Vector Machines (SVMs) Directly optimize for the maximum margin separator: SVMs Input: S={(x 1, y 1 ),,(x m, y m )}; Maximize γ under the constraint: w 2 = 1 For all i, y i w x i γ w γ γ

Support Vector Machines (SVMs) Directly optimize for the maximum margin separator: SVMs Input: S={(x 1, y 1 ),,(x m, y m )}; Maximize γ under the constraint: w 2 = 1 For all i, y i w x i γ This is a constrained optimization problem. objective function constraints Famous example of constrained optimization: linear programming, where objective fn is linear, constraints are linear (in)equalities

Support Vector Machines (SVMs) Directly optimize for the maximum margin separator: SVMs Input: S={(x 1, y 1 ),,(x m, y m )}; Maximize γ under the constraint: w 2 = 1 For all i, y i w x i γ w γ γ This constraint is nonlinear. In fact, it s even nonconvex w 1 w 2 2 w 1 w 2

Support Vector Machines (SVMs) Directly optimize for the maximum margin separator: SVMs Input: S={(x 1, y 1 ),,(x m, y m )}; Maximize γ under the constraint: w 2 = 1 For all i, y i w x i γ w = w/γ, then max γ is equiv. to minimizing w 2 (since w 2 = 1/γ 2 ). So, dividing both sides by γ and writing in terms of w we get: w γ γ Input: S={(x 1, y 1 ),,(x m, y m )}; 2 Minimize w under the constraint: For all i, y i w x i 1 w x = 1 w w x = 1

Support Vector Machines (SVMs) Directly optimize for the maximum margin separator: SVMs Input: S={(x 1, y 1 ),,(x m, y m )}; argmin w w 2 s.t.: For all i, y i w x i 1 This is a constrained optimization problem. The objective is convex (quadratic) All constraints are linear Can solve efficiently (in poly time) using standard quadratic programing (QP) software

Support Vector Machines (SVMs) Question: what if data isn t perfectly linearly separable? Issue 1: now have two objectives maximize margin minimize # of misclassifications. Ans 1: Let s optimize their sum: minimize w 2 C(# misclassifications) where C is some tradeoff constant. Issue 2: This is computationally hard (NPhard). [even if didn t care about margin and minimized # mistakes] NPhard [GuruswamiRaghavendra 6] w x = 1 w w x = 1

Support Vector Machines (SVMs) Question: what if data isn t perfectly linearly separable? Issue 1: now have two objectives maximize margin minimize # of misclassifications. Ans 1: Let s optimize their sum: minimize w 2 C(# misclassifications) where C is some tradeoff constant. Issue 2: This is computationally hard (NPhard). [even if didn t care about margin and minimized # mistakes] NPhard [GuruswamiRaghavendra 6] w x = 1 w w x = 1

Support Vector Machines (SVMs) Question: what if data isn t perfectly linearly separable? Replace # mistakes with upper bound called hinge loss Input: S={(x 1, y 1 ),,(x m, y m )}; Minimize w Input: S={(x 1, y 1 ),,(x m, y m )}; Find 2 under the constraint: For all i, y i w x i 1 argmin w,ξ1,,ξ m w 2 C For all i, y i w x i 1 ξ i ξ i ξ i are slack variables i ξ i s.t.: w x = 1 w w x = 1 w w x = 1 w x = 1

Support Vector Machines (SVMs) Question: what if data isn t perfectly linearly separable? Replace # mistakes with upper bound called hinge loss Input: S={(x 1, y 1 ),,(x m, y m )}; Find argmin w,ξ1,,ξ m w 2 C For all i, y i w x i 1 ξ i ξ i ξ i are slack variables i ξ i s.t.: C controls the relative weighting between the twin goals of making the w 2 small (margin is large) and ensuring that most examples have functional margin 1. w x = 1 l w, x, y w w x = 1 = max (,1 y w x)

Support Vector Machines (SVMs) Question: what if data isn t perfectly linearly separable? Replace # mistakes with upper bound called hinge loss Input: S={(x 1, y 1 ),,(x m, y m )}; Find argmin w,ξ1,,ξ m w 2 C For all i, y i w x i 1 ξ i ξ i i ξ i s.t.: Replace the number of mistakes with the hinge loss w x = 1 w w x = 1 w 2 C(# misclassifications) l w, x, y = max (,1 y w x)

Support Vector Machines (SVMs) Question: what if data isn t perfectly linearly separable? Replace # mistakes with upper bound called hinge loss Input: S={(x 1, y 1 ),,(x m, y m )}; Find argmin w,ξ1,,ξ m w 2 C For all i, y i w x i 1 ξ i ξ i i ξ i s.t.: w x = 1 Total amount have to move the points to get them on the correct side of the lines w x = 1/ 1, where the distance between the lines w x = and w x = 1 counts as 1 unit. w w x = 1 l w, x, y = max (,1 y w x)

What if the data is far from being linearly separable? Example: vs No good linear separator in pixel representation. SVM philosophy: Use a Kernel

Support Vector Machines (SVMs) Input: S={(x 1, y 1 ),,(x m, y m )}; Find argmin w,ξ1,,ξ m w 2 C i ξ i s.t.: Primal form For all i, y i w x i 1 ξ i ξ i Which is equivalent to: Input: S={(x 1, y 1 ),,(x m, y m )}; Find argmin α 1 2 For all i, i j y i y j α i α j x i x j i α i s.t.: α i C i Lagrangian Dual i y i α i =

SVMs (Lagrangian Dual) Input: S={(x 1, y 1 ),,(x m, y m )}; Find argmin α 1 2 i j y i y j α i α j x i x j i α i s.t.: For all i, α i C i y i α i = Final classifier is: w = α i y i x i i The points x i for which α i are called the support vectors i w x = 1 w w x = 1

Kernelizing the Dual SVMs Input: S={(x 1, y 1 ),,(x m, y m )}; Find argmin α 1 2 For all i, i j y i y j α i α j x i x j i α i s.t.: α i C i Replace x i x j with K x i, x j. i y i α i = Final classifier is: w = α i y i x i i The points x i for which α i are called the support vectors With a kernel, classify x using α i y i K(x, x i ) i

Support Vector Machines (SVMs). One of the most theoretically well motivated and practically most effective classification algorithms in machine learning. Directly motivated by Margins and Kernels!

What you should know The importance of margins in machine learning. The primal form of the SVM optimization problem The dual form of the SVM optimization problem. Kernelizing SVM. Think about how it s related to Regularized Logistic Regression.

Modern (Partially) Supervised Machine Learning Using Unlabeled Data and Interaction for Learning

Classic Paradigm Insufficient Nowadays Modern applications: massive amounts of raw data. Only a tiny fraction can be annotated by human experts. Protein sequences Billions of webpages Images

SemiSupervised Learning raw data face not face Labeled data Expert Labeler Classifier

Active Learning raw data face not face Expert Labeler O O O Classifier

SemiSupervised Learning Prominent paradigm in past 15 years in Machine Learning. Most applications have lots of unlabeled data, but labeled data is rare or expensive: Web page, document classification Computer Vision Computational Biology,.

SemiSupervised Learning Data Source Learning Algorithm Unlabeled examples Unlabeled examples Expert / Oracle Labeled Examples Algorithm outputs a classifier S l ={(x 1, y 1 ),,(x ml, y ml )} x i drawn i.i.d from D, y i = c (x i ) S u ={x 1,,x mu } drawn i.i.d from D Goal: h has small error over D. err D h = Pr x~ D (h x c (x))

Semisupervised learning: no querying. Just have lots of additional unlabeled data. A bit puzzling since unclear what unlabeled data can do for you. Key Insight Unlabeled data useful if we have beliefs not only about the form of the target, but also about its relationship with the underlying distribution.

Combining Labeled and Unlabeled Data Several methods have been developed to try to use unlabeled data to improve performance, e.g.: Transductive SVM [Joachims 99] Cotraining [Blum & Mitchell 98] Graphbased methods [B&C1], [ZGL3] Workshops [ICML 3, ICML 5, ] Books: SemiSupervised Learning, MIT 26 O. Chapelle, B. Scholkopf and A. Zien (eds) Test of time awards at ICML! Introduction to SemiSupervised Learning, Morgan & Claypool, 29 Zhu & Goldberg

Example of typical assumption: Margins The separator goes through low density regions of the space/large margin. assume we are looking for linear separator belief: should exist one with large separation SVM Labeled data only Transductive SVM

Transductive Support Vector Machines Optimize for the separator with large margin wrt labeled and unlabeled data. [Joachims 99] argmin w w 2 s.t.: y i w x i 1, for all i {1,, m l } S u ={x 1,,x mu } y u w x u 1, for all u {1,, m u } y u { 1, 1} for all u {1,, m u } w w x = 1 w x = 1 Input: S l ={(x 1, y 1 ),,(x ml, y ml )} Find a labeling of the unlabeled sample and w s.t. w separates both labeled and unlabeled data with maximum margin.

Transductive Support Vector Machines argmin w w 2 C ξ i i C ξ u u y i w x i 1ξ i, for all i {1,, m l } S u ={x 1,,x mu } y i w x u 1 ξ u, for all u {1,, m u } y u { 1, 1} for all u {1,, m u } w w x = 1 w x = 1 Input: S l ={(x 1, y 1 ),,(x ml, y ml )} Find a labeling of the unlabeled sample and w s.t. w separates both labeled and unlabeled data with maximum margin. Optimize for the separator with large margin wrt labeled and unlabeled data. [Joachims 99]

Transductive Support Vector Machines Optimize for the separator with large margin wrt labeled and unlabeled data. Input: S l ={(x 1, y 1 ),,(x ml, y ml )} S u ={x 1,,x mu } argmin w w 2 C i ξ i C u ξ u y i w x i 1ξ i, for all i {1,, m l } y i w x u 1 ξ u, for all u {1,, m u } y u { 1, 1} for all u {1,, m u } NPhard.. Convex only after you guessed the labels too many possible guesses

Transductive Support Vector Machines Optimize for the separator with large margin wrt labeled and unlabeled data. Heuristic (Joachims) high level idea: First maximize margin over the labeled points Use this to give initial labels to unlabeled points based on this separator. Try flipping labels of unlabeled points to see if doing so can increase margin Keep going until no more improvements. Finds a locallyoptimal solution.

Experiments [Joachims99]

Transductive Support Vector Machines Helpful distribution Nonhelpful distributions Highly compatible 1/ 2 clusters, all partitions separable by large margin

SemiSupervised Learning Prominent paradigm in past 15 years in Machine Learning. Key Insight Unlabeled data useful if we have beliefs not only about the form of the target, but also about its relationship with the underlying distribution. Prominent techniques Transductive SVM [Joachims 99] Cotraining [Blum & Mitchell 98] Graphbased methods [B&C1], [ZGL3]