The Perceptron Algorithm, Margins

Similar documents
Support Vector Machines (SVMs).

1 Learning Linear Separators

1 Learning Linear Separators

Support vector machines Lecture 4

Perceptron (Theory) + Linear Regression

Support Vector Machines. Machine Learning Fall 2017

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Kernelized Perceptron Support Vector Machines

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Kernel Methods and Support Vector Machines

Support Vector Machine (continued)

Announcements - Homework

Linear classifiers Lecture 3

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Max Margin-Classifier

COMS 4771 Introduction to Machine Learning. Nakul Verma

Binary Classification / Perceptron

Jeff Howbert Introduction to Machine Learning Winter

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Support Vector Machines

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machine (SVM) and Kernel Methods

(Kernels +) Support Vector Machines

Support Vector Machines

ML (cont.): SUPPORT VECTOR MACHINES

Support Vector Machine (SVM) and Kernel Methods

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Introduction to Logistic Regression and Support Vector Machine

Support Vector Machines for Classification and Regression

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Support Vector Machines: Maximum Margin Classifiers

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Incorporating detractors into SVM classification

Statistical Pattern Recognition

Generalization, Overfitting, and Model Selection

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Pattern Recognition 2018 Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Lecture Support Vector Machine (SVM) Classifiers

Support Vector Machines

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Discriminative Models

Minimax risk bounds for linear threshold functions

18.9 SUPPORT VECTOR MACHINES

Support Vector Machines

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Modelli Lineari (Generalizzati) e SVM

Support Vector Machines and Kernel Methods

The Perceptron algorithm

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Support Vector Machines for Classification: A Statistical Portrait

Discriminative Models

Linear & nonlinear classifiers

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines and Kernel Methods

Part of the slides are adapted from Ziko Kolter

Linear Classifiers. Michael Collins. January 18, 2012

CSC 411 Lecture 17: Support Vector Machine

6.867 Machine learning

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Warm up: risk prediction with logistic regression

Linear & nonlinear classifiers

Machine Learning And Applications: Supervised Learning-SVM

Sample and Computationally Efficient Active Learning. Maria-Florina Balcan Carnegie Mellon University

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Generalization, Overfitting, and Model Selection

SVMs, Duality and the Kernel Trick

Statistical and Computational Learning Theory

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Introduction to Support Vector Machines

Lecture 3 January 28

Dan Roth 461C, 3401 Walnut

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning for NLP

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Applied Machine Learning Annalisa Marsico

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

LECTURE NOTE #8 PROF. ALAN YUILLE. Can we find a linear classifier that separates the position and negative examples?

Support vector machines

Support Vector Machines, Kernel SVM

L5 Support Vector Classification

Machine Learning for NLP

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Efficient Bandit Algorithms for Online Multiclass Prediction

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Classifier Complexity and Support Vector Classifiers

Review: Support vector machines. Machine learning techniques and image analysis

CS 590D Lecture Notes

Mistake Bound Model, Halving Algorithm, Linear Classifiers, & Perceptron

Machine Learning A Geometric Approach

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Classification and Support Vector Machine

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

1 Boosting. COS 511: Foundations of Machine Learning. Rob Schapire Lecture # 10 Scribe: John H. White, IV 3/6/2003

Support Vector Machines

Transcription:

The Perceptron Algorithm, Margins MariaFlorina Balcan 08/29/2018

The Perceptron Algorithm Simple learning algorithm for supervised classification analyzed via geometric margins in the 50 s [Rosenblatt 57]. Originally introduced in the online learning scenario. Online Learning Model Perceptron Algorithm Its Guarantees under large margins

The Online Learning Model Example arrive sequentially. We need to make a prediction. Afterwards observe the outcome. For i=1, 2,, : Example x i Phase i: Online Algorithm Prediction h(x i ) Observe true label (x i ) Mistake bound model Goal: Minimize the number of mistakes. Analysis wise, make no distributional assumptions, consider a worst sequence of examples.

The Online Learning Model. Motivation Email classification (distribution of both spam and regular mail changes over time, but the target function stays fixed last year's spam still looks like spam). Recommendation systems. Recommending movies, etc. Predicting whether a user will be interested in a new news article or not. Add placement in a new market.

Linear Separators Feature space X = R d Hypothesis class of linear decision surfaces in R d. h x = w x w 0, if h x 0, then label x as, otherwise label it as X X X X X X X X X X O O O O O O O O w Trick: Without loss of generality w 0 = 0. Proof: Can simulate a nonzero threshold with a dummy input feature x 0 that is always set up to 1. x = x 1,, x d x = x 1,, x d, 1 w x w 0 0 iff w 1,, w d, w 0 x 0 where w = w 1,, w d

Linear Separators: Perceptron Algorithm Set t=1, start with the all zero vector w 1. Given example x, predict positive iff w t x 0 On a mistake, update as follows: Mistake on positive, then update w t1 w t x Mistake on negative, then update w t1 w t x Natural greedy procedure: If true label of x is 1 and w t incorrect on x we have w t x < 0, w t1 x w t x x x = w t x x 2, so more chance w t1 classifies x correctly. Similarly for mistakes on negative examples.

Linear Separators: Perceptron Algorithm Set t=1, start with the all zero vector w 1. Given example x, predict positive iff w t x 0 On a mistake, update as follows: Mistake on positive, then update w t1 w t x Mistake on negative, then update w t1 w t x Note: w t is weighted sum of incorrectly classified examples w t = a i1 x i1 a ik x ik w t x = a i1 x i1 x a ik x ik x Important when we talk about kernels.

Perceptron Algorithm: Example Example: 1,2 1,0 1,1 1,0 1, 2 1, 1 X X X Algorithm: Set t=1, start with allzeroes weight vector w 1. Given example x, predict positive iff w t x 0. On a mistake, update as follows: Mistake on positive, update w t1 w t x Mistake on negative, update w t1 w t x w 1 = (0,0) w 2 = w 1 1,2 = (1, 2) w 3 = w 2 1,1 = (2, 1) w 4 = w 3 1, 2 = (3,1)

Geometric Margin Definition: The margin of example x w.r.t. a linear sep. w is the distance from x to the plane w x = 0 Margin of positive example x 1 x 1 w Margin of negative example x 2 x 2

Geometric Margin Definition: The margin of example x w.r.t. a linear sep. w is the distance from x to the plane w x = 0 Definition: The margin w of a set of examples S wrt a linear separator w is the smallest margin over points x S. w w w

Geometric Margin Definition: The margin of example x w.r.t. a linear sep. w is the distance from x to the plane w x = 0 Definition: The margin w of a set of examples S wrt a linear separator w is the smallest margin over points x S. Definition: The margin of a set of examples S is the maximum w over all linear separators w. w

Perceptron: Mistake Bound Guarantee: If data has margin and all points inside a ball of radius R, then Perceptron makes R/ 2 mistakes. (Normalized margin: multiplying all points by 100, or dividing all points by 100, doesn t change the number of mistakes; algo is invariant to scaling.) R w*

Perceptron Algorithm: Analysis Guarantee: If data has margin and all points inside a ball of radius R, then Perceptron makes R/ 2 mistakes. Update rule: Mistake on positive: w t1 w t x Mistake on negative: w t1 w t x Proof: Idea: analyze w t w and w t, where w is the maxmargin sep, w = 1. Claim 1: w t1 w w t w. Claim 2: w 2 t1 w 2 t R 2. After M mistakes: w M1 w M (by Claim 1) w M1 R M (by Claim 2) (because l x x w ) (by Pythagorean Theorem) w t1 w t x w M1 w w M1 (since w is unit length) So, M R M, so M R 2.

Finding a Consistent Separator Can use Perceptron in a batch setting too, to find a consistent linear separator given a set S of labeled examples that is linearly separable by margin. We repeatedly feed the whole set S of labeled examples into the Perceptron algorithm up to R 2 1 rounds, until we get to a point where the current hypothesis is consistent/correct with the whole set S. Note we are guaranteed to reach such a point. The runnning time is then polynomial in R 2 and S.

What if there is no perfect separator? Perceptron can be adapted to the case where there is no perfect separator as long as the so called hinge loss (i.e., the total distance needed to move the points to classify them correctly large margin) is small. Let TD = total distance needed to move the points to classify them correctly large margin. After M mistakes: w M1 w M TD (new Claim 1) w M1 R M (by Claim 2) So, M TD R M, so M R 2 2 TD.

Margin Perceptron for Approx Maximizing the Margins Set t=1, w 1 = l x x, where x is the first example. Given example x, predict positive if w t x 2 predict negative if w t x 2 margin mistake if w t x (, ) 2 2 On a mistake, update: w t1 w t l(x)x Guarantee: If data has margin and all points inside a ball of radius R, then Margin Perceptron makes 8 R 2 R 4 mistakes.

Perceptron Discussion Simple, but very useful in applications like branch prediction; it also has interesting extensions to structured prediction. Can be kernelized to handle nonlinear decision boundaries! See future lecture!!!

Margin Important Theme in ML If large margin, # mistakes Peceptron makes is small (independent on the dim of the ambient space)! Large margin can help prevent overfitting. If large margin and if alg. produces a large margin classifier, then amount of data needed depends only on R/ [Bartlett & ShaweTaylor 99]. w Why not directly search for a large margin classifier? Support Vector Machines (SVMs).

Geometric Margin WLOG homogeneous linear separators [w 0 = 0]. Definition: The margin of example x w.r.t. a linear sep. w is the distance from x to the plane w x = 0. Margin of example x 1 If w = 1, margin of x w.r.t. w is x w. x 1 w Margin of example x 2 x 2

Geometric Margin Definition: The margin of example x w.r.t. a linear sep. w is the distance from x to the plane w x = 0. Definition: The margin w of a set of examples S wrt a linear separator w is the smallest margin over points x S. Definition: The margin of a set of examples S is the maximum w over all linear separators w. w

Support Vector Machines (SVMs) Directly optimize for the maximum margin separator: SVMs First, assume we know a lower bound on the margin Input:, S={(x 1, y 1 ),,(x m, y m )}; Find: some w where: w 2 = 1 For all i, y i w x i w Output: w, a separator of margin over S The case where the data is truly linearly separable by margin

Support Vector Machines (SVMs) Directly optimize for the maximum margin separator: SVMs E.g., search for the best possible Input: S={(x 1, y 1 ),,(x m, y m )}; Find: some w and maximum where: w 2 = 1 For all i, y i w x i w Output: maximum margin separator over S

Support Vector Machines (SVMs) Directly optimize for the maximum margin separator: SVMs Input: S={(x 1, y 1 ),,(x m, y m )}; Maximize under the constraint: w 2 = 1 For all i, y i w x i w

Support Vector Machines (SVMs) Directly optimize for the maximum margin separator: SVMs Input: S={(x 1, y 1 ),,(x m, y m )}; Maximize under the constraint: w 2 = 1 For all i, y i w x i This is a constrained optimization problem. objective function constraints Famous example of constrained optimization: linear programming, where objective fn is linear, constraints are linear (in)equalities

Support Vector Machines (SVMs) Directly optimize for the maximum margin separator: SVMs Input: S={(x 1, y 1 ),,(x m, y m )}; Maximize under the constraint: w 2 = 1 For all i, y i w x i w This constraint is nonlinear. In fact, it s even nonconvex w 1 w 2 2 w 1 w 2

Support Vector Machines (SVMs) Directly optimize for the maximum margin separator: SVMs Input: S={(x 1, y 1 ),,(x m, y m )}; Maximize under the constraint: w 2 = 1 For all i, y i w x i w = w/, then max is equiv. to minimizing w 2 (since w 2 = 1/ 2 ). So, dividing both sides by and writing in terms of w we get: w Input: S={(x 1, y 1 ),,(x m, y m )}; 2 Minimize w under the constraint: For all i, y i w x i 1 w x = 1 w w x = 1

Support Vector Machines (SVMs) Directly optimize for the maximum margin separator: SVMs Input: S={(x 1, y 1 ),,(x m, y m )}; argmin w w 2 s.t.: For all i, y i w x i 1 This is a constrained optimization problem. The objective is convex (quadratic) All constraints are linear Can solve efficiently (in poly time) using standard quadratic programing (QP) software

Support Vector Machines (SVMs) Question: what if data isn t perfectly linearly separable? Issue 1: now have two objectives maximize margin minimize # of misclassifications. Ans 1: Let s optimize their sum: minimize w 2 C(# misclassifications) where C is some tradeoff constant. w x = 1 Issue 2: This is computationally very hard (NPhard). [even if didn t care about margin and minimized # mistakes] w w x = 1

Support Vector Machines (SVMs) Question: what if data isn t perfectly linearly separable? Replace # mistakes with upper bound called hinge loss Input: S={(x 1, y 1 ),,(x m, y m )}; 2 Minimize w under the constraint: For all i, y i w x i 1 Input: S={(x 1, y 1 ),,(x m, y m )}; Find argmin w,ξ1,,ξ m w 2 C σ i ξ i s.t.: For all i, y i w x i 1 ξ i ξ i 0 ξ i are slack variables w x = 1 w w x = 1 w w x = 1 w x = 1

Support Vector Machines (SVMs) Question: what if data isn t perfectly linearly separable? Replace # mistakes with upper bound called hinge loss Input: S={(x 1, y 1 ),,(x m, y m )}; Find argmin w,ξ1,,ξ m w 2 C σ i ξ i s.t.: For all i, y i w x i 1 ξ i ξ i 0 ξ i are slack variables C controls the relative weighting between the twin goals of making the w 2 small (margin is large) and ensuring that most examples have functional margin 1. w x = 1 w w x = 1 l w, x, y = max(0,1 y w x)

What you should know Perceptron simple online algo for learning linear separators with good guarantees when data has large geometric margin. The importance of margins in machine learning. The Support Vector Machines (SVM) algorithm.