Linear Classifiers. Michael Collins. January 18, 2012

Similar documents
6.867 Machine learning: lecture 2. Tommi S. Jaakkola MIT CSAIL

Minimax risk bounds for linear threshold functions

CSC321 Lecture 4 The Perceptron Algorithm

CPSC 340: Machine Learning and Data Mining

CS229 Supplemental Lecture notes

Optimization and Gradient Descent

The Perceptron algorithm

Learning: Binary Perceptron. Examples: Perceptron. Separable Case. In the space of feature vectors

Lecture 2 Machine Learning Review

More about the Perceptron

Binary Classification / Perceptron

Machine Learning Linear Models

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

CS 188: Artificial Intelligence. Outline

Errors, and What to Do. CS 188: Artificial Intelligence Fall What to Do About Errors. Later On. Some (Simplified) Biology

CS 188: Artificial Intelligence Fall 2011

Mistake Bound Model, Halving Algorithm, Linear Classifiers, & Perceptron

Decision trees COMS 4771

Machine Learning for NLP

Machine Learning

Warm up: risk prediction with logistic regression

EE 511 Online Learning Perceptron

Perceptron. Subhransu Maji. CMPSCI 689: Machine Learning. 3 February February 2015

MIRA, SVM, k-nn. Lirong Xia

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Online Learning Summer School Copenhagen 2015 Lecture 1

Linear Methods for Classification

CS 5522: Artificial Intelligence II

Lecture 6. Notes on Linear Algebra. Perceptron

Logistic regression and linear classifiers COMS 4771

Vote. Vote on timing for night section: Option 1 (what we have now) Option 2. Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9

Lecture 4: Linear predictors and the Perceptron

Machine Learning

The Perceptron Algorithm 1

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

CS229 Supplemental Lecture notes

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Online Learning, Mistake Bounds, Perceptron Algorithm

Linear Classifiers and the Perceptron

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Support Vector Machines and Bayes Regression

The Perceptron Algorithm, Margins

CSC321 Lecture 5 Learning in a Single Neuron

CS 343: Artificial Intelligence

Machine Learning and Deep Learning! Vincent Lepetit!

Support Vector Machines, Kernel SVM

Deep Learning for Computer Vision

Announcements - Homework

ECE521 Lecture7. Logistic Regression

Machine Learning And Applications: Supervised Learning-SVM

Single layer NN. Neuron Model

Machine Learning (CS 567) Lecture 3

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Classification with Perceptrons. Reading:

Support Vector Machines. Machine Learning Fall 2017

CE213 Artificial Intelligence Lecture 14

Linear models and the perceptron algorithm

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

6.867 Machine learning

Lecture 4 Logistic Regression

INTRODUCTION TO DATA SCIENCE

Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods

Discrete Probability Distributions

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Machine Learning (CS 567) Lecture 2

CSC 411 Lecture 7: Linear Classification

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1

The Perceptron Algorithm

Classification objectives COMS 4771

Lecture Support Vector Machine (SVM) Classifiers

Logistic Regression. COMP 527 Danushka Bollegala

Linear and Logistic Regression. Dr. Xiaowei Huang

6.036 midterm review. Wednesday, March 18, 15

Linear Discrimination Functions

Learning Methods for Linear Detectors

Gaussian and Linear Discriminant Analysis; Multiclass Classification

CS 188: Artificial Intelligence. Machine Learning

Gaussian discriminant analysis Naive Bayes

Perceptron (Theory) + Linear Regression

Linear & nonlinear classifiers

Linear Classifiers and the Perceptron Algorithm

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Online Learning With Kernel

Machine Learning and Data Mining. Linear classification. Kalev Kask

Feedforward Neural Networks. Michael Collins, Columbia University

The Perceptron. Volker Tresp Summer 2016

The Perceptron. Volker Tresp Summer 2014

CS 188: Artificial Intelligence Fall 2008

General Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering

Machine Learning A Geometric Approach

Lecture 11 Linear regression

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Machine Learning for NLP

LECTURE 7 Support vector machines

Multiclass Classification-1

Logistic Regression. William Cohen

Transcription:

Linear Classifiers Michael Collins January 18, 2012

Today s Lecture Binary classification problems Linear classifiers The perceptron algorithm

Classification Problems: An Example Goal: build a system that automatically determines whether an image is a human face or not Each image is 100 100 pixels, where each pixel takes a grey-scale value in the set {0, 1, 2,..., 255} We represent an image as a point x R d, where d = 100 2 = 10000 We have n = 50 training examples, where each training example is an input point x R 10000 paired with a label y where y = +1 if the training example contains a face, y = 1 otherwise

Binary Classification Problems Goal: Learn a function f : R d { 1, +1} We have n training examples {(x 1, y 1 ), (x 2, y 2 ),..., (x n, y n )} Each x i is a point in R d Each y i is either +1 or 1

Supervised Learning Problems Goal: Learn a function f : X Y We have n training examples {(x 1, y 1 ), (x 2, y 2 ),..., (x n, y n )} where each x i X, and each y i Y Often (not always) X = R d for some integer d Some possibilities for Y: Y = { 1, +1} (binary classification) Y = {1, 2,..., k} for some k > 2 (multi-class classification) Y = R (regression)

A Second Example: Spam Filtering Goal: build a system that predicts whether an email message is spam or not Training examples: (x i, y i ) for i = 1... n Each y i is +1 if a message is spam, 1 otherwise. Each x i is a vector in R d representing a document

What Kind of Solution would Suffice? Say we have n = 50 training examples. Each pixel can take 256 values. It s possible that some pixel, say pixel number 3, has a different value for every one of the 50 training examples Define x t,3 for t = 1... n to be the value of pixel 3 on the t th training example. A possible function f(x ) learned from the training set: For t = 1... 50: If x 3 = x t,3 then return y t Return 1 Classifies the training examples perfectly, but does it generalize to new examples?

Model Selection How can we find classifiers that generalize well? Key point: we must constrain the set of possible functions that we entertain If our set of possible functions is too large, we have a risk of finding a trivial function that works perfectly on the training data, but does not generalize well If our set of possible functions is too small, we may not even be able to find a function that works well on the training data Later in the course we ll introduce formal (statistical) analysis relating the size of a set of functions to the generalization properties of a learning algorithm

Linear Classifiers through the Origin Model form: f(x; θ) = sign(θ 1 x 1 +... + θ d x d ) = sign(x θ) θ is a vector of real-valued parameters The functions in our class are parameterized by θ R d sign(z) = +1 if z 0, and 1 otherwise

Linear Classifiers through the Origin: Geometric Intuition Each point x is in R d The parameters θ specify a hyperplane (linear separator) that separates points into 1 vs. +1 Specifically, the hyperplane is through the origin, with the vector θ as its normal

Linear Classifiers (General Form) Model form: f(x; θ, θ 0 ) = sign(x θ + θ 0 ) θ is a vector of real-valued parameters, θ 0 is a bias parameter The functions in our class are parameterized by θ R d and θ 0 R

Linear Classifiers (General Form): Geometric Intuition Each point x is in R d The parameters θ, θ 0 specify a hyperplane (linear separator) that separates points into 1 vs. +1 Specifically, the hyperplane has the vector θ as its normal, and is at a distance θ 0 / θ from the origin, where θ is the norm (length) of θ.

A Learning Algorithm: The Perceptron We ve chosen a function class (the class of linear separators through the origin) The estimation problem: choose a specific function in this class (i.e., a setting for the parameters θ) on the basis of the training set One suggestion: find a value for θ that minimizes the number of training errors Ê(θ) = 1 n n (1 δ(y t, f(x t ; θ))) = 1 n t=1 n Loss(y t, f(x t ; θ)) t=1 where δ(y, y ) is 1 if y = y, 0 otherwise Other definitions of Loss are possible

The Perceptron Algorithm Initialization: θ = 0 (i.e., all parameters are set to 0) Repeat until convergence: For t = 1... n 1. y = sign(x t θ) 2. If y y t Then θ = θ + y t x t, Else leave θ unchanged Convergence occurs when the parameter vector θ remains unchanged for an entire pass over the training set. At that point, all training examples are classified correctly

More about the Perceptron Analysis: if there exists a parameter setting θ that correctly classifies all training examples, the algorithm will converge. Otherwise, the algorithm will not converge. Intuition: Suppose we make a mistake on x t. We then do the update θ = θ + y t x t. From this: y t (θ x t ) = y t (θ + y t x t ) x t = y t (θ x t ) + yt 2 (x t x t ) = y t (θ x t ) + x t 2 Hence y t (θ x t ) increases by x t 2

The Perceptron Convergence Theorem Assume their exists some parameter vector θ, and some γ > 0 such that for all t = 1... n, y t (x t θ ) γ Assume in addition that for all t = 1... n, x t R Then the perceptron algorithm makes at most updates before convergence R 2 θ 2 γ 2

A Geometric Interpretation Assume their exists some parameter vector θ, and some γ > 0 such that for all t = 1... n, y t (x t θ ) γ The ratio γ/ θ is the smallest distance of any point x t to the hyperplane defined by θ