Kernelized Perceptron Support Vector Machines

Similar documents
Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

Support Vector Machines

Logistic Regression Logistic

Support vector machines Lecture 4

SVMs, Duality and the Kernel Trick

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Warm up: risk prediction with logistic regression

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

PAC-learning, VC Dimension and Margin-based Bounds

Linear classifiers Lecture 3

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

(Kernels +) Support Vector Machines

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Support Vector Machines

Jeff Howbert Introduction to Machine Learning Winter

Neural networks and support vector machines

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Discriminative Models

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Support Vector Machines

The Perceptron Algorithm, Margins

Announcements - Homework

The Perceptron Algorithm

Stochastic Gradient Descent

CSC 411 Lecture 17: Support Vector Machine

Linear classifiers: Overfitting and regularization

COMP 875 Announcements

PAC-learning, VC Dimension and Margin-based Bounds

Deep Learning for Computer Vision

Kernel Methods and Support Vector Machines

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Discriminative Models

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Logistic Regression. Machine Learning Fall 2018

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (continued)

Machine Learning A Geometric Approach

Support Vector Machine (SVM) and Kernel Methods

Machine Learning : Support Vector Machines

Binary Classification / Perceptron

Multiclass Classification-1

Classification Logistic Regression

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Introduction to Support Vector Machines

Kernels and the Kernel Trick. Machine Learning Fall 2017

Statistical Machine Learning from Data

Machine Learning Practice Page 2 of 2 10/28/13

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Applied Machine Learning Annalisa Marsico

Support Vector Machines

Support Vector Machines. Machine Learning Fall 2017

Pattern Recognition 2018 Support Vector Machines

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Neural networks and optimization

Support Vector Machines and Kernel Methods

Optimization and Gradient Descent

Support Vector Machines and Kernel Methods

Linear & nonlinear classifiers

Support Vector Machines: Maximum Margin Classifiers

Support Vector Machines

Kernel Methods. Barnabás Póczos

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

SVMs: nonlinearity through kernels

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Support Vector Machine

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Lecture 10: A brief introduction to Support Vector Machine

SUPPORT VECTOR MACHINE

ECS171: Machine Learning

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Support Vector Machine (SVM) and Kernel Methods

Machine Learning Linear Models

Perceptron Revisited: Linear Separators. Support Vector Machines

1 Learning Linear Separators

Support Vector Machines

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Machine Learning Lecture 6 Note

Neural networks and optimization

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

CS798: Selected topics in Machine Learning

Linear Models for Classification

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Selected Topics in Optimization. Some slides borrowed from

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Dan Roth 461C, 3401 Walnut

Machine Learning Basics

Transcription:

Kernelized Perceptron Support Vector Machines Emily Fox University of Washington February 13, 2017 What is the perceptron optimizing? 1

The perceptron algorithm [Rosenblatt 58, 62] Classification setting: y in {-1,+1} Linear model - Prediction: Training: - Initialize weight vector: - At each time step: Observe features: Make prediction: Observe true class: Update model: - If prediction is not equal to truth 3 Perceptron prediction: Margin of confidence 4 2

Hinge loss Perceptron prediction: Makes a mistake when: Hinge loss (same as maximizing the margin used by SVMs) 5 Minimizing hinge loss in batch setting Given a dataset: Minimize average hinge loss: How do we compute the gradient? 6 3

Subgradients of convex functions Gradients lower bound convex functions: Gradients are unique at x if function differentiable at x Subgradients: Generalize gradients to non-differentiable points: - Any plane that lower bounds function: 7 Subgradient of hinge Hinge loss: Subgradient of hinge loss: - If y t (w.x t ) > 0: - If y t (w.x t ) < 0: - If y t (w.x t ) = 0: - In one line: 8 4

Subgradient descent for hinge minimization Given data: Want to minimize: Subgradient descent works the same as gradient descent: - But if there are multiple subgradients at a point, just pick (any) one: 9 Perceptron revisited Perceptron update: Batch hinge minimization update: Difference? 10 5

The kernelized perceptron What if the data are not linearly separable? Use features of features of features of features. 12 Feature space can get really large really quickly! 6

Higher order polynomials d input dimension p degree of polynomial number of monomial terms number of input dimensions p=4 p=3 p=2 grows fast! p = 6, d = 100 about 1.6 billion terms 13 Perceptron revisited Given weight vector w (t), predict point x by: Mistake at time t: w (t+1) w (t) + y t x t Thus, write weight vector in terms of mistaken data points only: - Let M (t) be time steps up to t when mistakes were made: Prediction rule now: When using high dimensional features: 14 7

Dot-product of polynomials polynomials of degree exactly p 15 Finally, the kernel trick! Kernelized perceptron Every time you make a mistake, remember (x t,y t ) Kernelized perceptron prediction for x: 16 8

Polynomial kernels All monomials of degree p in O(d) operations: polynomials of degree exactly p How about all monomials of degree up to p? - Solution 0: - Better solution: 17 Common kernels Polynomials of degree exactly p Polynomials of degree up to p Gaussian (squared exponential) kernel Sigmoid 18 9

What you need to know Linear separability in higher-dim feature space The kernel trick Kernelized perceptron Derive polynomial kernel Common kernels 19 Support vector machines (SVMs) 10

Linear classifiers Which line is better? 21 Pick the one with the largest margin! 22 11

Maximize the margin 23 But there are many planes 24 12

Review: Normal to a plane 25 A convention: Normalized margin Canonical hyperplanes x + x - 26 13

Margin maximization using canonical hyperplanes Unnormalized problem: Normalized Problem: 27 Support vector machines (SVMs) Solve efficiently by many methods, e.g., - quadratic programming (QP) Well-studied solution algorithms - Stochastic gradient descent Hyperplane defined by support vectors 28 14

What if data are not linearly separable? Use features of features of features of features. 29 What if data are still not linearly separable? If data are not linearly separable, some points don t satisfy margin constraint: How bad is the violation? Tradeoff margin violation with w : 30 15

SVMs for non-linearly separable data, meet my friend the Perceptron Perceptron was minimizing the hinge loss: SVMs minimizes the regularized hinge loss!! 31 Stochastic gradient descent for SVMs Perceptron minimization: SVMs minimization: SGD for Perceptron: SGD for SVMs: 32 16

Mixture model example 33 From Hastie, Tibshirani, Friedman book Mixture model example kernels 34 From Hastie, Tibshirani, Friedman book 17

What you need to know Maximizing margin Derivation of SVM formulation Non-linearly separable case - Hinge loss - a.k.a. adding slack variables SVMs = Perceptron + L 2 regularization Can optimize SVMs with SGD - Many other approaches possible 35 18