Binary Classification / Perceptron

Similar documents
CS 6375 Machine Learning

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Support vector machines Lecture 4

Variance Reduction and Ensemble Methods

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Linear classifiers Lecture 3

Kernelized Perceptron Support Vector Machines

Optimization and Gradient Descent

CPSC 340: Machine Learning and Data Mining

Part of the slides are adapted from Ziko Kolter

The Perceptron Algorithm, Margins

Day 4: Classification, support vector machines

Logistic Regression. COMP 527 Danushka Bollegala

Deep Learning for Computer Vision

More about the Perceptron

Machine Learning Lecture 6 Note

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Warm up: risk prediction with logistic regression

Convex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs)

Machine Learning and Data Mining. Linear classification. Kalev Kask

18.9 SUPPORT VECTOR MACHINES

Introduction to Support Vector Machines

Linear Classifiers. Michael Collins. January 18, 2012

HOMEWORK 4: SVMS AND KERNELS

CSC 411 Lecture 17: Support Vector Machine

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Introduction to Machine Learning

Machine Learning Linear Models

Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011!

Minimax risk bounds for linear threshold functions

Vote. Vote on timing for night section: Option 1 (what we have now) Option 2. Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9

Support Vector Machine (SVM) and Kernel Methods

Bayesian Methods: Naïve Bayes

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

The Perceptron Algorithm 1

Support Vector Machine (SVM) and Kernel Methods

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

CMU-Q Lecture 24:

(Kernels +) Support Vector Machines

Bits of Machine Learning Part 1: Supervised Learning

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

CS229 Supplemental Lecture notes

Introduction to AI Learning Bayesian networks. Vibhav Gogate

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

Single layer NN. Neuron Model

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Logistic Regression. Machine Learning Fall 2018

Support Vector Machine (SVM) and Kernel Methods

Lecture 12. Talk Announcement. Neural Networks. This Lecture: Advanced Machine Learning. Recap: Generalized Linear Discriminants

Introduction to Machine Learning

Neural networks and support vector machines

Lecture Support Vector Machine (SVM) Classifiers

Machine Learning for NLP

Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Announcements - Homework

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

ECE521 Lecture7. Logistic Regression

Lecture 5 Neural models for NLP

Machine Learning Lecture 5

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

CS221 / Autumn 2017 / Liang & Ermon. Lecture 2: Machine learning I

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Max Margin-Classifier

Kernel Methods and Support Vector Machines

Introduction to Logistic Regression

18.6 Regression and Classification with Linear Models

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Pattern Recognition 2018 Support Vector Machines

The Perceptron Algorithm

Machine Learning Practice Page 2 of 2 10/28/13

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Day 3: Classification, logistic regression

Online Learning and Sequential Decision Making

CS60021: Scalable Data Mining. Large Scale Machine Learning

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

CSC242: Intro to AI. Lecture 21

Linear Discrimination Functions

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Classification Logistic Regression

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Transcription:

Binary Classification / Perceptron Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and Vibhav Gogate

Supervised Learning Input: x 1, y 1,, (x n, y n ) x i is the i th data item and y i is the i th label Goal: find a function f such that f x i approximation to y i is a good Can use it to predict y values for previously unseen x values

Examples of Supervised Learning Spam email detection Handwritten digit recognition Stock market prediction More?

Supervised Learning Hypothesis space: set of allowable functions f: X Y Goal: find the best element of the hypothesis space How do we measure the quality of f?

Regression y x Hypothesis class: linear functions f x = ax b Squared loss function used to measure the error of the approximation

Linear Regression In typical regression applications, measure the fit using a squared loss function L f, y i = f x i y i 2 Want to minimize the average loss on the training data For linear regression, the learning problem is then min a,b 1 n i ax i b y i 2 For an unseen data point, x, the learning algorithm predicts f(x)

Binary Classification Input x (1), y 1,, (x (n), y n ) with x i R m and y i { 1, 1} We can think of the observations as points in R m with an associated sign (either /- corresponding to 0/1) An example with m = 2

Binary Classification Input x (1), y 1,, (x (n), y n ) with x i R m and y i { 1, 1} We can think of the observations as points in R m with an associated sign (either /- corresponding to 0/1) An example with m = 2 What is a good hypothesis space for this problem?

Binary Classification Input x (1), y 1,, (x (n), y n ) with x i R m and y i { 1, 1} We can think of the observations as points in R m with an associated sign (either /- corresponding to 0/1) An example with m = 2 What is a good hypothesis space for this problem?

Binary Classification Input x (1), y 1,, (x (n), y n ) with x i R m and y i { 1, 1} We can think of the observations as points in R m with an associated sign (either /- corresponding to 0/1) An example with m = 2 What is a good hypothesis space for this problem? In this case, we say that the observations are linearly separable

Linear Separators In n dimensions, a hyperplane is a solution to the equation with w R n, b R w T x b = 0 Hyperplanes divide R n into two distinct sets of points (called halfspaces) w T x b > 0 w T x b < 0

Binary Classification Input x (1), y 1,, (x (n), y n ) with x i R m and y i { 1, 1} We can think of the observations as points in R m with an associated sign (either /- corresponding to 0/1) An example with m = 2 What is a good hypothesis space for this problem? In this case, we say that the observations are linearly separable

The Linearly Separable Case Input x (1), y 1,, (x (n), y n ) with x i R m and y i { 1,1} Hypothesis space: separating hyperplanes f x = w T x b How should we choose the loss function?

The Linearly Separable Case Input x (1), y 1,, (x (n), y n ) with x i R m and y i { 1,1} Hypothesis space: separating hyperplanes f w,b x = w T x b How should we choose the loss function? Count the number of misclassifications loss = i y i sign(f w,b (x (i) )) Tough to optimize, gradient contains no information

The Linearly Separable Case Input x (1), y 1,, (x (n), y n ) with x i R m and y i { 1,1} Hypothesis space: separating hyperplanes f w,b x = w T x b How should we choose the loss function? Penalize each misclassification by the size of the violation perceptron loss = i max 0, y i f w,b x (i) Modified hinge loss (this loss is convex, but not differentiable)

The Perceptron Algorithm Try to minimize the perceptron loss using (sub)gradient descent w (perceptron loss) = i: y i f w,b x (i) 0 y i x (i) b (perceptron loss) = i: y i f w,b x (i) 0 y i

Subgradients For a convex function g(x), a subgradient at a point x 0 is any tangent line/plane through the point x 0 that underestimates the function everywhere g(x) x 0 x

Subgradients For a convex function g(x), a subgradient at a point x 0 is any tangent line/plane through the point x 0 that underestimates the function everywhere g(x) x 0 x

Subgradients For a convex function g(x), a subgradient at a point x 0 is any tangent line/plane through the point x 0 that underestimates the function everywhere g(x) x 0 x

Subgradients For a convex function g(x), a subgradient at a point x 0 is any tangent line/plane through the point x 0 that underestimates the function everywhere g(x) x 0 x If 0 is subgradient at x 0, then x 0 is a global minimum

The Perceptron Algorithm Try to minimize the perceptron loss using (sub)gradient descent w (t1) = w (t) γ t i: y i f w t,b (t) x(i) 0 y i x (i) b (t1) = b (t) γ t y i i: y i f w t,b (t) x(i) 0 With step size γ t (sometimes called the learning rate)

Stochastic Gradient Descent To make the training more practical, stochastic gradient descent is used instead of standard gradient descent Approximate the gradient of a sum by sampling a few indices (as few as one) uniformly at random and averaging x n i=1 K g i (x) 1 K k=1 x g ik (x) here, each i k is sampled uniformly at random from {1,, n} Stochastic gradient descent converges under certain assumptions on the step size 22

Stochastic Gradient Descent Setting K = 1, we can simply pick a random observation i and perform the following update if the i th data point is misclassified w (t1) = w (t) γ t y i x (i) and otherwise b (t1) = b (t) γ t y i w (t1) = w (t) b (t1) = b (t) Sometimes, you will see the perceptron algorithm specified with γ t = 1 for all t 23

Application of Perceptron Spam email classification Represent emails as vectors of counts of certain words (e.g., sir, madam, Nigerian, prince, money, etc.) Apply the perceptron algorithm to the resulting vectors To predict the label of an unseen email: Construct its vector representation, x Check whether or not w T x b is positive or negative

Perceptron Learning Drawbacks: No convergence guarantees if the observations are not linearly separable Can overfit There are a number of perfect classifiers, but the perceptron algorithm doesn t have any mechanism for choosing between them

What If the Data Isn t Separable? Input x (1), y 1,, (x (n), y n ) with x i R m and y i { 1, 1} We can think of the observations as points in R m with an associated sign (either /- corresponding to 0/1) An example with m = 2 What is a good hypothesis space for this problem?

What If the Data Isn t Separable? Input x (1), y 1,, (x (n), y n ) with x i R m and y i { 1, 1} We can think of the observations as points in R m with an associated sign (either /- corresponding to 0/1) An example with m = 2 What is a good hypothesis space for this problem?

Adding Features Perceptron algorithm only works for linearly separable data Can add features to make the data linearly separable over a larger space!

Adding Features The idea: Given the observations x (1),, x (n), construct a feature vector φ(x) Use φ x (1),, φ x (n) instead of x (1),, x (n) in the learning algorithm Goal is to choose φ so that φ x (1),, φ x (n) are linearly separable Learn linear separators of the form w T φ x (instead of w T x) Warning: more expressive features can lead to overfitting

Adding Features Examples φ x 1, x 2 = x 1 x 2 This is just the input data, without modification φ x 1, x 2 = 1 x 1 x 2 x 1 2 x 2 2 This corresponds to a second degree polynomial separator, or equivalently, elliptical separators in the original space

Adding Features x 2 x 1 1 2 x 2 1 2 1 0 x 1

Support Vector Machines How can we decide between two perfect classifiers? What is the practical difference between these two solutions?