Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

Similar documents
Kernelized Perceptron Support Vector Machines

Logistic Regression Logistic

Support Vector Machines

Warm up: risk prediction with logistic regression

SVMs, Duality and the Kernel Trick

Support vector machines Lecture 4

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Linear classifiers Lecture 3

Stochastic Gradient Descent

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Announcements - Homework

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Support Vector Machine (SVM) and Kernel Methods

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Introduction to Support Vector Machines

Learning Theory Continued

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Support Vector Machine (SVM) and Kernel Methods

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Support Vector Machines. Machine Learning Fall 2017

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Support Vector Machines

Bias-Variance Tradeoff

SVMs: nonlinearity through kernels

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Linear & nonlinear classifiers

Jeff Howbert Introduction to Machine Learning Winter

Support Vector Machines: Maximum Margin Classifiers

Kernel Methods and Support Vector Machines

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Perceptron Revisited: Linear Separators. Support Vector Machines

COMP 875 Announcements

CSC 411 Lecture 17: Support Vector Machine

Deep Learning for Computer Vision

Classification Logistic Regression

Kernels and the Kernel Trick. Machine Learning Fall 2017

CPSC 340: Machine Learning and Data Mining

Dan Roth 461C, 3401 Walnut

Logistic Regression. Machine Learning Fall 2018

Linear Classifiers and the Perceptron

Machine Learning : Support Vector Machines

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Multiclass Classification-1

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Support Vector Machines

COMS 4771 Introduction to Machine Learning. Nakul Verma

Announcements. Proposals graded

(Kernels +) Support Vector Machines

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Review: Support vector machines. Machine learning techniques and image analysis

Lecture 10: A brief introduction to Support Vector Machine

Neural networks and support vector machines

Support Vector Machine (continued)

Linear classifiers: Overfitting and regularization

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Statistical Machine Learning from Data

Support Vector Machines, Kernel SVM

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Announcements Kevin Jamieson

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Support Vector Machines

Linear & nonlinear classifiers

1 Machine Learning Concepts (16 points)

Machine Learning A Geometric Approach

ECE 5424: Introduction to Machine Learning

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Name (NetID): (1 Point)

Support Vector Machines

Support Vector Machines

EE 511 Online Learning Perceptron

Perceptron. Subhransu Maji. CMPSCI 689: Machine Learning. 3 February February 2015

Support Vector Machine (SVM) and Kernel Methods

6.036 midterm review. Wednesday, March 18, 15

Learning Theory. Aar$ Singh and Barnabas Poczos. Machine Learning / Apr 17, Slides courtesy: Carlos Guestrin

Support Vector Machines and Kernel Methods

Applied Machine Learning Annalisa Marsico

Online Learning With Kernel

Discriminative Models

Machine Learning Basics

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

18.9 SUPPORT VECTOR MACHINES

More about the Perceptron

SUPPORT VECTOR MACHINE

Machine Learning. Support Vector Machines. Manfred Huber

Transcription:

Kernels Machine Learning CSE446 Carlos Guestrin University of Washington October 28, 2013 Carlos Guestrin 2005-2013 1 Linear Separability: More formally, Using Margin Data linearly separable, if there exists a vector a margin Such that Carlos Guestrin 2005-2013 2 1

Perceptron Analysis: Linearly Separable Case Theorem [Block, Novikoff]: Given a sequence of labeled examples: Each feature vector has bounded norm: If dataset is linearly separable: Then the number of mistakes made by the online perceptron on any such sequence is bounded by Carlos Guestrin 2005-2013 3 Beyond Linearly Separable Case Perceptron algorithm is super cool! No assumption about data distribution! Could be generated by an oblivious adversary, no need to be iid Makes a fixed number of mistakes, and it s done for ever! Even if you see infinite data However, real world not linearly separable Can t expect never to make mistakes again Analysis extends to non-linearly separable case Very similar bound, see Freund & Schapire Converges, but ultimately may not give good accuracy (make many many many mistakes) Carlos Guestrin 2005-2013 4 2

What if the data is not linearly separable? Use features of features of features of features. Feature space can get really large really quickly! Carlos Guestrin 2005-2013 5 Higher order polynomials number of monomial terms number of input dimensions d=4 d=3 d=2 m input features d degree of polynomial grows fast! d = 6, m = 100 about 1.6 billion terms Carlos Guestrin 2005-2013 6 3

Perceptron Revisited Given weight vector w (t), predict point x by: Mistake at time t: w (t+1) ç w (t) + y (t) x (t) Thus, write weight vector in terms of mistaken data points only: Let M (t) be time steps up to t when mistakes were made: Prediction rule now: When using high dimensional features: Carlos Guestrin 2005-2013 7 Dot-product of polynomials exactly d Carlos Guestrin 2005-2013 8 4

Finally the Kernel Trick!!! (Kernelized Perceptron Every time you make a mistake, remember (x (t),y (t) ) Kernelized Perceptron prediction for x: y (j) j2m (t) sign(w (t) (x)) = X (x (j) ) (x) = X y (j) k(x (j), x) j2m (t) Carlos Guestrin 2005-2013 9 Polynomial kernels All monomials of degree d in O(d) operations: How about all monomials of degree up to d? Solution 0: exactly d Better solution: Carlos Guestrin 2005-2013 10 5

Common kernels Polynomials of degree exactly d Polynomials of degree up to d Gaussian (squared exponential) kernel 2 Sigmoid Carlos Guestrin 2005-2013 11 What you need to know Notion of online learning Perceptron algorithm Mistake bounds and proofs The kernel trick Kernelized Perceptron Derive polynomial kernel Common kernels In online learning, report averaged weights at the end Carlos Guestrin 2005-2013 12 6

Your Midterm Content: Everything up to last Wednesday (Perceptron) Only 80mins, so arrive early and settle down quickly, we ll start and end on time Open book Textbook, Books, Course notes, Personal notes Bring a calculator that can do log J No: Computers, tablets, phones, other materials, internet devices, wireless telepathy or wandering eyes The exam: Covers key concepts and ideas, work on understanding the big picture, and differences between methods Carlos Guestrin 2005-2013 13 Support Vector Machines Machine Learning CSE446 Carlos Guestrin University of Washington October 28, 2013 Carlos Guestrin 2005-2013 14 7

Linear classifiers Which line is better? 2005-2007 Carlos Guestrin 15 Pick the one with the largest margin! w.x + w 0 = 0 confidence = y j (w x j + w 0 ) 2005-2007 Carlos Guestrin 16 8

Maximize the margin w.x + w 0 = 0 max,w,w 0 y j (w x j + w 0 ), 8j 2 {1,...,N} 2005-2007 Carlos Guestrin 17 But there are many planes w.x + w 0 = 0 2005-2007 Carlos Guestrin 18 9

Review: Normal to a plane x j = x j + w w w.x + w 0 = 0 2005-2007 Carlos Guestrin 19 A Convention: Normalized margin Canonical hyperplanes x j = x j + w w w.x + w 0 = +1 w.x + w 0 = 0 w.x + w 0 = -1 x + x - margin 2γ 2005-2007 Carlos Guestrin 20 10

Margin maximization using canonical hyperplanes w.x + w 0 = +1 w.x + w 0 = 0 w.x + w 0 = -1 Unnormalized problem: max,w,w 0 Normalized Problem: y j (w x j + w 0 ), 8j 2 {1,...,N} margin 2γ min w,w 0 w 2 2 y j (w x j + w 0 ) 1, 8j 2 {1,...,N} 2005-2007 Carlos Guestrin 21 Support vector machines (SVMs) w.x + w 0 = +1 w.x + w 0 = 0 w.x + w 0 = -1 min w,w 0 w 2 2 y j (w x j + w 0 ) 1, 8j 2 {1,...,N} Solve efficiently by many methods, e.g., quadratic programming (QP) Well-studied solution algorithms Stochastic gradient descent Hyperplane defined by support vectors margin 2γ 2005-2007 Carlos Guestrin 22 11

What if the data is not linearly separable? Use features of features of features of features. 2005-2007 Carlos Guestrin 23 What if the data is still not linearly separable? min w,w 0 w 2 2 y j (w x j + w 0 ) 1, 8j If data is not linearly separable, some points don t satisfy margin constraint: How bad is the violation? Tradeoff margin violation with w : 2005-2007 Carlos Guestrin 24 12

SVMs for Non-Linearly Separable meet my friend the Perceptron Perceptron was minimizing the hinge loss: NX j=1 y j (w x j + w 0 ) + SVMs minimizes the regularized hinge loss!! w 2 2 + C NX j=1 1 y j (w x j + w 0 ) + Carlos Guestrin 2005-2013 25 Stochastic Gradient Descent for SVMs NX j=1 Perceptron minimization: y j (w x j + w 0 ) + SGD for Perceptron: SVMs minimization: NX w 2 2 + C 1 y j (w x j + w 0 ) + j=1 SGD for SVMs: w (t+1) w (t) + h i y (t) (w (t) x (t) ) apple 0 y (t) x (t) Carlos Guestrin 2005-2013 26 13

What you need to know Maximizing margin Derivation of SVM formulation Non-linearly separable case Hinge loss A.K.A. adding slack variables SVMs = Perceptron + L2 regularization Can also use kernels with SVMs Can optimize SVMs with SGD Many other approaches possible 2005-2007 Carlos Guestrin 27 14