CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Similar documents
Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

More about the Perceptron

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

Linear & nonlinear classifiers

Support Vector Machine. Industrial AI Lab.

Linear & nonlinear classifiers

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

EE 511 Online Learning Perceptron

Support Vector Machine. Industrial AI Lab. Prof. Seungchul Lee

Jeff Howbert Introduction to Machine Learning Winter

Support Vector Machine

COMS 4771 Introduction to Machine Learning. Nakul Verma

MIRA, SVM, k-nn. Lirong Xia

CS798: Selected topics in Machine Learning

Support vector machines Lecture 4

The Perceptron Algorithm

The Perceptron Algorithm

Pattern Recognition 2018 Support Vector Machines

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Learning with kernels and SVM

Final Exam. 1 True or False (15 Points)

6.036 midterm review. Wednesday, March 18, 15

Linear Classification and SVM. Dr. Xin Zhang

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Machine Learning Lecture 6 Note

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Kernelized Perceptron Support Vector Machines

Logistic regression and linear classifiers COMS 4771

CIS 520: Machine Learning Oct 09, Kernel Methods

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Multiclass Classification-1

SVMs: nonlinearity through kernels

Support Vector Machines

Evaluation. Andrea Passerini Machine Learning. Evaluation

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

1 Machine Learning Concepts (16 points)

Support Vector and Kernel Methods

Support Vector Machine & Its Applications

Brief Introduction to Machine Learning

Review: Support vector machines. Machine learning techniques and image analysis

Evaluation requires to define performance measures to be optimized

Applied Machine Learning Annalisa Marsico

CS 188: Artificial Intelligence. Outline

CSC321 Lecture 4 The Perceptron Algorithm

Neural networks and support vector machines

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

Linear Classifiers. Michael Collins. January 18, 2012

Download the following data sets from the UC Irvine ML repository:

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Lecture 7: Kernels for Classification and Regression

Support Vector Machines and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods

CS145: INTRODUCTION TO DATA MINING

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Perceptron Revisited: Linear Separators. Support Vector Machines

Introduction to Support Vector Machines

Machine Learning Linear Models

Deep Learning for Computer Vision

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Simple Neural Nets For Pattern Classification

Kernel Methods and Support Vector Machines

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Support Vector Machine (SVM) and Kernel Methods

CSE 546 Final Exam, Autumn 2013

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Security Analytics. Topic 6: Perceptron and Support Vector Machine

Support Vector Machines

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Linear discriminant functions

Sparse Kernel Machines - SVM

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machines

Machine Learning 4771

Mistake Bound Model, Halving Algorithm, Linear Classifiers, & Perceptron

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

Warm up: risk prediction with logistic regression

Support Vector Machines.

Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation

Introduction to Machine Learning

Logistic Regression Logistic

Introduction to Support Vector Machines

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Perceptron. Subhransu Maji. CMPSCI 689: Machine Learning. 3 February February 2015

Midterm: CS 6375 Spring 2015 Solutions

Machine Learning A Geometric Approach

Machine Learning. Support Vector Machines. Manfred Huber

6.867 Machine learning

Introduction to Machine Learning

Machine Learning Practice Page 2 of 2 10/28/13

Transcription:

CSE 151 Machine Learning Instructor: Kamalika Chaudhuri

Linear Classification Given labeled data: (xi, feature vector yi) label i=1,..,n where y is 1 or 1, find a hyperplane to separate from

Linear Classification Process Classification O Algorithm w Training Data Prediction Rule: Vector w How do you classify a test example x? Output y = sign(w, x)

Linear Classification Given labeled data (xi, yi), i=1,..,n, where y is 1 or 1, Find a hyperplane through the origin to separate from O w w: normal vector to the hyperplane For a point x on one side of the hyperplane, w, x > 0 For a point x on the other side, w, x < 0

The Perceptron Algorithm Goal: Given labeled data (xi, yi), i=1,..,n, where y is 1 or 1, Find a vector w such that the corresponding hyperplane separates from Perceptron Algorithm: 1. Initially, w1 = y1x1 2. For t=2, 3,... If y t w t,x t < 0 then: Else: wt1 = wt yt xt wt1 = wt

Linear Separability linearly separable not linearly separable

Measure of Separability: Margin For a unit vector w, and training set S, margin of w is defined as: γ =min x S w, x x min cosine of angle between w and x, for any x in S O θ w O θ w Low Margin High Margin

Real Examples: Margin For a unit vector w, and training set S, margin of w is defined as: γ =min x S w, x x min cosine of angle between w and x, for any x in data w w Low Margin High Margin

When does Perceptron work? Theorem: If the training points are linearly separable with margin γ and if x i =1 for all xi in the training data, then #mistakes made by perceptron is at most 1/γ 2 Observations: 1. Lower the margin, more mistakes 2. May need more than one pass over the training data to get a classifier which makes no mistakes on the training set

What if data is not linearly separable? Ideally, we want to find a linear separator that makes the minimum number of mistakes on training data But this is NP Hard

Voted Perceptron Perceptron: 1. Initially: m = 1, w1 = y1x1 2. For t=2, 3,... If y t w m,x t 0 3. Output wm then: wm1 = wm yt xt m = m 1 Voted Perceptron: 1. Initially: m = 1, w1 = y1x1 2. For t=2, 3,... If Else: y t w m,x t 0 then: wm1 = wm yt xt m = m 1 cm = 1 cm = cm 1 3. Output (w1, c1), (w2, c2),..., (wm, cm)

Voted Perceptron Voted Perceptron: 1. Initially: m = 1, w1 = y1x1 2. For t=2, 3,... If Else: y t w m,x t 0 then: wm1 = wm yt xt m = m 1 cm = 1 cm = cm 1 3. Output (w1, c1), (w2, c2),..., (wm, cm) How to classify example x? Output: sign( m i=1 c i sign(w i,x)) Problem: Have to store all the classifiers

Averaged Perceptron Averaged Perceptron: 1. Initially: m = 1, w1 = y1x1 2. For t=2, 3,... If Else: y t w m,x t 0 then: wm1 = wm yt xt m = m 1 cm = 1 cm = cm 1 3. Output (w1, c1), (w2, c2),..., (wm, cm) How to classify example x? m Output: sign( c i w i,x) i=1

Voted vs. Averaged How to classify example x? Averaged: sign( Voted: sign( m i=1 m i=1 c i w i,x) c i sign(w i,x)) Example where voted and averaged are different: w1 = x1, w2 = x2, w3 = x3, c1 = c2 = c3 = 1 Averaged: x1 x2 x3 > 0 Voted: Any two of x1, x2, x3 > 0

Perceptron: An Example Data: ( (4, 0), 1), ( (1, 1), 1), ( (0, 1), 1), ( (2,2), 1) O w

Perceptron: An Example Data: ( (4, 0), 1), ( (1, 1), 1), ( (0, 1), 1), ( (2,2), 1) Step 1: w1 = (4, 0), c1 = 1 O w1 Decision Boundary

Perceptron: An Example Data: ( (4, 0), 1), ( (1, 1), 1), ( (0, 1), 1), ( (2,2), 1) Step 1: w1 = (4, 0), c1 = 1 Step 2: (x2, y2) = ( (1, 1), 1) x2 w1 O Decision Boundary

Perceptron: An Example Data: ( (4, 0), 1), ( (1, 1), 1), ( (0, 1), 1), ( (2,2), 1) Step 1: w1 = (4, 0), c1 = 1 Step 2: (x2, y2) = ( (1, 1), 1) y 2 w 1,x 2 < 0 Mistake! x2 w1 O Decision Boundary

Perceptron: An Example Data: ( (4, 0), 1), ( (1, 1), 1), ( (0, 1), 1), ( (2,2), 1) Step 1: w1 = (4, 0), c1 = 1 Step 2: (x2, y2) = ( (1, 1), 1) y 2 w 1,x 2 < 0 Mistake! Update: w2 = (3, 1), c2 = 1 x2 w1 O x2 w2 Decision Boundary

Perceptron: An Example Data: ( (4, 0), 1), ( (1, 1), 1), ( (0, 1), 1), ( (2,2), 1) Step 1: w1 = (4, 0), c1 = 1 Step 2: (x2, y2) = ( (1, 1), 1) y 2 w 1,x 2 < 0 Mistake! Update: w2 = (3, 1), c2 = 1 Step 3: (x3, y3) = ( (0, 1), 1) x3 x2 w1 O w2 Decision Boundary

Perceptron: An Example Data: ( (4, 0), 1), ( (1, 1), 1), ( (0, 1), 1), ( (2,2), 1) Step 1: w1 = (4, 0), c1 = 1 Step 2: (x2, y2) = ( (1, 1), 1) y 2 w 1,x 2 < 0 Mistake! Update: w2 = (3, 1), c2 = 1 Step 3: (x3, y3) = ( (0, 1), 1) y 3 w 2,x 3 > 0 Correct! x3 x2 O w2 Decision Boundary

Perceptron: An Example Data: ( (4, 0), 1), ( (1, 1), 1), ( (0, 1), 1), ( (2,2), 1) Step 1: w1 = (4, 0), c1 = 1 Step 2: (x2, y2) = ( (1, 1), 1) y 2 w 1,x 2 < 0 Mistake! Update: w2 = (3, 1), c2 = 1 Step 3: (x3, y3) = ( (0, 1), 1) y 3 w 2,x 3 > 0 Correct! Update: c2 = 2 x3 x2 O w2 Decision Boundary

Perceptron: An Example Data: ( (4, 0), 1), ( (1, 1), 1), ( (0, 1), 1), ( (2,2), 1) Step 1: w1 = (4, 0), c1 = 1 Step 2: (x2, y2) = ( (1, 1), 1) y 2 w 1,x 2 < 0 Mistake! Update: w2 = (3, 1), c2 = 1 Step 3: (x3, y3) = ( (0, 1), 1) y 3 w 2,x 3 > 0 Correct! Update: c2 = 2 Step 4: (x4, y4) = ( (2, 2), 1) x3 x2 O x4 w2 Decision Boundary

Perceptron: An Example Data: ( (4, 0), 1), ( (1, 1), 1), ( (0, 1), 1), ( (2,2), 1) Step 1: w1 = (4, 0), c1 = 1 Step 2: (x2, y2) = ( (1, 1), 1) y 2 w 1,x 2 < 0 Mistake! x3 x2 Update: w2 = (3, 1), c2 = 1 Step 3: (x3, y3) = ( (0, 1), 1) y 3 w 2,x 3 > 0 Correct! Update: c2 = 2 Step 4: (x4, y4) = ( (2, 2), 1) y 4 w 2,x 4 < 0 Mistake! O x4 w2 Decision Boundary

Perceptron: An Example Data: ( (4, 0), 1), ( (1, 1), 1), ( (0, 1), 1), ( (2,2), 1) Step 1: w1 = (4, 0), c1 = 1 Step 2: (x2, y2) = ( (1, 1), 1) y 2 w 1,x 2 < 0 Mistake! Update: w2 = (3, 1), c2 = 1 Step 3: (x3, y3) = ( (0, 1), 1) y 3 w 2,x 3 > 0 Correct! Decision Boundary Update: c2 = 2 Step 4: (x4, y4) = ( (2, 2), 1) y 4 w 2,x 4 < 0 Mistake! Update: w3 = (1, 3), c3 = 1 O x4 w2 w3

Perceptron: An Example Data: ( (4, 0), 1), ( (1, 1), 1), ( (0, 1), 1), ( (2,2), 1) Decision Boundary Step 1: w1 = (4, 0), c1 = 1 Step 2: (x2, y2) = ( (1, 1), 1) y 2 w 1,x 2 < 0 Mistake! Update: w2 = (3, 1), c2 = 1 Step 3: (x3, y3) = ( (0, 1), 1) y 3 w 2,x 3 > 0 Correct! Update: c2 = 2 Step 4: (x4, y4) = ( (2, 2), 1) y 4 w 2,x 4 < 0 Mistake! Update: w3 = (1, 3), c3 = 1 x4 O w2 Final Voted Perceptron: sign(sign(w 1,x)2sign(w 2,x)sign(w 3,x)) w3 Final Averaged Perceptron: sign(w 1 2w 2 w 3,x)

Summary Linear classification Perceptron Voted and Averaged Perceptron Some issues

Margins Max margin separator Separator Suppose data is linearly separable by a margin. Perceptron will stop once it finds a linear separator. But sometimes we would like to compute the maxmargin separator. This is done by Support Vector Machines (SVMs).

More General Linear Classification Given labeled data (xi, yi), i=1,..,n, where y is 1 or 1, Find a hyperplane to separate from We can transform general linear classification to linear classification by a hyperplane through the origin as follows: For each (xi, yi) (xi is ddimensional), construct: (zi, yi), where zi is the d1dimensional vector [1, xi] Any hyperplane through the origin in the zspace corresponds to a hyperplane (not necessarily through the origin) in xspace, and vice versa

Categorical Features Features with no inherent ordering Example: Gender = Male, Female State = Alaska, Hawaii, California Convert a categorical feature with k categories to a kdimensional vector: e.g, Alaska maps to [1, 0, 0], Hawaii to [0, 1, 0], California to [0, 0, 1]

Multiclass Linear Classification Main Idea: Reduce to multiple binary classification problems Onevsall: For k classes, solve k binary classification problems Class 1 vs. rest, Class 2 vs. rest,..., Class k vs. rest Allvsall: For k classes, solve k(k1)/2 binary classification problems Class i vs. j, for i=1,..,k, and j = i1,..,k To classify, combine by voting

The Confusion Matrix 1 2 3 4 5 6 1 2 3 4 5 6 Mij = #examples that have label j but are classified as i Nj = #examples with label j Cij = Mij/Nj Indicates which classes are easy and hard to classify High diagonal entry: easy to classify High offdiagonal entry: high scope for confusion

Kernels

Feature Maps X2 Z2 φ X1 Z1 Z3 Data which is not linearly separable may be linearly separable in another feature space. In the figure, data is linearly separable in zspace: (Z1, Z2, Z3) = (X1 2, 2X1X2, X2 2 )

Feature Maps X2 Z2 φ X1 Z1 Z3 Let φ(x) be a feature map. For example: φ(x) = (x1 2, x1x2, x2 2 ) We want to run perceptron on the feature space φ(x)

Feature Maps X2 X1 What feature map will make this dataset linearly separable?

Feature Maps Let φ(x) be a feature map. For example: φ(x) = (x1 2, x1, x2) We want to run perceptron on the feature space φ(x) Many linear classification algorithms, including perceptron, SVMs, etc can be written in terms of dotproducts x, z We can simply change these dot products to φ(x), φ(z) For a feature map φ we define the corresponding kernel K: K(x,y) =φ(x), φ(y) Thus we can write the perceptron algorithm in terms of K

Kernels For a feature map φ we define the corresponding kernel K: K(x,y) =φ(x), φ(y) Examples: 1. K(x, z) =(x, z) 2

Kernels For a feature map φ we define the corresponding kernel K: K(x,y) =φ(x), φ(y) Examples: 1. K(x, z) =(x, z) 2 d K(x, z) = x i z i = d d x i x j z i z j = d (x i x j )(z i z j ) i=1 x i z i d i=1 i=1 j=1 i,j=1 For d=3, φ(x) = [ x1 2, x1x2, x1x3, x2x1, x2 2, x2x3, x3x1, x3x2, x3 2 ] T Time to compute K directly = O(d) Time to compute K though feature map = O(d 2 )

Kernels For a feature map φ we define the corresponding kernel K: K(x,y) =φ(x), φ(y) Examples: 1. 2. K(x, z) =(x, z) 2 K(x, z) =(x, z c) 2 In Class Problem: What is the feature map for this kernel?