Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Similar documents
Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1

Linear models and the perceptron algorithm

Linear Discriminant Functions

The Perceptron algorithm

Single layer NN. Neuron Model

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Neural networks and support vector machines

Optimization and Gradient Descent

Logistic Regression. Machine Learning Fall 2018

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Deep Learning for Computer Vision

The Perceptron Algorithm

Machine Learning Practice Page 2 of 2 10/28/13

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

The Perceptron Algorithm 1

Linear & nonlinear classifiers

In the Name of God. Lecture 11: Single Layer Perceptrons

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Lecture 6. Notes on Linear Algebra. Perceptron

Warm up: risk prediction with logistic regression

Linear Discrimination Functions

GRADIENT DESCENT. CSE 559A: Computer Vision GRADIENT DESCENT GRADIENT DESCENT [0, 1] Pr(y = 1) w T x. 1 f (x; θ) = 1 f (x; θ) = exp( w T x)

More about the Perceptron

Course 395: Machine Learning - Lectures

SVMs: nonlinearity through kernels

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

SGN (4 cr) Chapter 5

Linear discriminant functions

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

18.9 SUPPORT VECTOR MACHINES

2 Linear Classifiers and Perceptrons

Lecture 16: Modern Classification (I) - Separating Hyperplanes

ESS2222. Lecture 4 Linear model

CPSC 340: Machine Learning and Data Mining

Machine Learning for NLP

COMS 4771 Introduction to Machine Learning. Nakul Verma

Machine Learning, Fall 2011: Homework 5

Day 4: Classification, support vector machines

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

HOMEWORK 4: SVMS AND KERNELS

CS145: INTRODUCTION TO DATA MINING

Vote. Vote on timing for night section: Option 1 (what we have now) Option 2. Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9

Chapter ML:VI. VI. Neural Networks. Perceptron Learning Gradient Descent Multilayer Perceptron Radial Basis Functions

Support Vector Machines.

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

PMR5406 Redes Neurais e Lógica Fuzzy Aula 3 Single Layer Percetron

Training the linear classifier

Machine Learning Basics

Linear Learning Machines

Kernelized Perceptron Support Vector Machines

Introduction to Machine Learning

y(x) = x w + ε(x), (1)

Support Vector Machines

Linear & nonlinear classifiers

GRADIENT DESCENT AND LOCAL MINIMA

Machine Learning Lecture 5

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Brief Introduction to Machine Learning

Binary Classification / Perceptron

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

A Course in Machine Learning

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear Classifiers and the Perceptron Algorithm

FA Homework 2 Recitation 2

Classification with Perceptrons. Reading:

Neural Networks. Associative memory 12/30/2015. Associative memories. Associative memories

10-701/ Machine Learning - Midterm Exam, Fall 2010

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Artificial Neural Networks. Part 2

Midterm: CS 6375 Spring 2015 Solutions

The Perceptron. Volker Tresp Summer 2016

Multiclass Classification-1

Perceptron. Subhransu Maji. CMPSCI 689: Machine Learning. 3 February February 2015

Support Vector Machines and Kernel Methods

Enhancing Generalization Capability of SVM Classifiers with Feature Weight Adjustment

Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression Logistic

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Linear Models for Classification

Discriminative Models

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture # 12 Scribe: Indraneel Mukherjee March 12, 2008

Geometric View of Machine Learning Nearest Neighbor Classification. Slides adapted from Prof. Carpuat

Perceptron (Theory) + Linear Regression

Midterm exam CS 189/289, Fall 2015

Linear Classifiers. Michael Collins. January 18, 2012

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Support Vector Machines

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

ML4NLP Multiclass Classification

6.867 Machine learning

Discriminative Models

The Perceptron Algorithm

9 Classification. 9.1 Linear Classifiers

4 THE PERCEPTRON. 4.1 Bio-inspired Learning. Learning Objectives:

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Transcription:

Preliminaries Linear models: the perceptron and closest centroid algorithms Chapter 1, 7 Definition: The Euclidean dot product beteen to vectors is the expression d T x = i x i The dot product is also referred to as inner product or scalar product. It is sometimes denoted as (hence the name dot product). i=1 x T x b > 0 T x b < 0 1 Preliminaries Labeled data Definition: The Euclidean dot product beteen to vectors is the expression d T x = i x i The dot product is also referred to as inner product or scalar product. Geometric interpretation. The dot product beteen to unit vectors 1 is the cosine of the angle beteen them. The dot product beteen a vector and a unit vector is the length of its projection in that direction. And in general: The norm of a vector: x = x x i=1 x = x cos( ) 3 A labeled dataset: Where D = {(x i,y i )} n i=1 x i R d The labels: are discrete for classification problems (e.g. 1, -1) for binary classification are d-dimensional vectors 4 1

Labeled data Linear models A labeled dataset: D = {(x i,y i )} n i=1 (linear decision boundaries) Where x i R d are d-dimensional vectors T x b > 0 The labels: T x b < 0 Linear models for regression 90 are continuous values for a regression problem (estimating a linear function) 85 80 75 70 65 60 55 50 45 5 40 140 150 160 170 180 190 00 6 T x b > 0 T x b > 0 T x b < 0 T x b < 0 Discriminant/scoring function: f(x) = x b Decision boundary: all x such that f(x) = x b =0 eight vector bias For linear models the the decision boundary is a line in -d, a plane in 3-d and a hyperplane in higher dimensions 7 8

T x b > 0 T x b > 0 T x b < 0 T x b < 0 Using the discriminant to make a prediction: ŷ = sign( x b) Decision boundary: all x such that f(x) = x b =0 the sign function equals 1 hen its argument is positive and -1 otherise What can you say about the decision boundary hen b = 0? 9 10 Linear models for regression Why linear? When using a linear model for regression the scoring function is the prediction: ŷ = x b 90 85 80 75 70 It s a good baseline: alays start simple Linear models are stable Linear models are less likely to overfit the training data because they have relatively less parameters. Can sometimes underfit. Often all you need hen the data is high dimensional. Lots of scalable algorithms 65 60 55 50 45 40 140 150 160 170 180 190 00 11 1 3

Large margin classification margin p (pn)/ n =pn 13 14 Non-linear large margin classifiers.5 1.5 1 0.5 0 From linear to non-linear 5 4.5 4 3.5 3.5 0.5 1 1.5 1.5 1 0.5.5.5 1.5 1 0.5 0 0.5 1 1.5.5 0 0 0.5 1 1.5.5 3 3.5 4 4.5 5 There is a neat mathematical trick that ill enable us to use linear classifiers to create non-linear decision boundaries! Original data: not linearly separable Transformed data: (x,y ) = (x, y ) 15 16 4

Define: µ () = 1 Pos {i y i =1} x i µ ( ) = 1 Neg here Pos/Neg is the number of positive/negative examples. This is the center of mass of the positive/negative examples. Classify an input x according to hich center of mass it is closest to. {i y i= 1} x i p µ () (pn)/ µ ( n ) =pn = µ () µ ( ) Let s express this as a linear classifier! See page 1- in the textbook Our hyperplane is going be perpendicular to the vector that connects the to centers of mass. Therefore: = µ () µ ( ) 17 18 µ () p (µ () µ ( ) ) (pn)/ µ ( n ) =pn = µ () µ ( ) µ () p (µ () µ ( ) ) (pn)/ µ ( n ) =pn = µ () µ ( ) To find the bias term e use the fact that the midpoint beteen the means is on the hyperplane, i.e. With a little algebra: b = 1 (µ() µ ( ) )(µ () µ ( ) ) (µ() µ ( ) ) b =0 19 The norm of a vector: x = x x = 1 ( µ() µ ( ) ) 0 5

One of the first classifiers proposed goes back to 1957! Linearly separable data Linearly separable data: there exists a linear decision boundary separating the classes. Example: T x b > 0 T x b < 0 1 The bias and homogeneous coordinates (Rosenblatt, 1957) In some cases e ill use algorithms that learn a discriminant function ithout a bias term. This does not reduce the expressivity of the model because e can obtain a bias using the folloing trick: Add another dimension x 0 to each input and set it to 1. Learn a eight vector of dimension d1 in this extended space, and interpret 0 as the bias term. With the notation =( 1,..., d ) We have that: See page 4 in the book =( 0, 1,..., d ) x =(1,x 1,...,x d ) x = 0 x 3 Idea: iterate over the training examples, and update the eight vector in a ay that ould make x i is more likely to be correctly classified. Let s assume that x i is misclassified, and is a positive example i.e. Note: e re learning a classifier x i < 0 ithout a bias term We ould like to update to such that This can be achieved by choosing Where 0 x i > x i 0 = x i 0 < apple 1 is the learning rate Rosenblatt, Frank (1957), The Perceptron--a perceiving and recognizing automaton. Report 85-460-1, Cornell Aeronautical Laboratory. Section 7. in the book 4 6

If x i is a negative example, the update needs to be opposite. Overall, e can summarize the to cases as: 0 = y i x i Input: labeled data D in homogeneous coordinates Output: eight vector = 0 converged = false hile not converged : converged = true for i in 1,, D : if x i is misclassified update and set converged=false The algorithm makes sense, but let s try to derive it in a more principled ay. The algorithm is trying to find a vector that separates positive from negative examples. We can express that as: y i x i > 0, For a given eight vector the degree to hich this does not hold can be expressed as: E() = i: x i is misclassified i =1,...,n y i x i We ant to find that minimizes or maximizes this criterion? 5 6 Digression: gradient descent Given a function E(), the gradient is the direction of steepest ascent Therefore to minimize E(), take a step in the direction of the negative of the gradient Notice that the gradient is perpendicular to contours of equal E() Gradient descent We can no express gradient descent as: here (t 1) = (t) (t) re() @E() @ @E() @E() @ =,..., @E() @ 1 @ d And (t) is the eight vector at iteration t The constant η is called the step size (learning rate hen used in the context of machine learning). Images from http://en.ikipedia.org/iki/gradient_descent 7 8 7

Let s apply gradient descent to the perceptron criterion: E() = y i x i i: x i is misclassified @E() @ = i: x i is misclassified (t 1) = (t) @E() @ = (t) Which is exactly the perceptron algorithm! y i x i i: x i is misclassified y i x i The algorithm is guaranteed to converge if the data is linearly separable, and does not converge otherise. Issues ith the algorithm: The algorithm chooses an arbitrary hyperplane that separates the to classes. It may not be the best one from the learning perspective. Does not converge if the data is not separable (can halt after a fixed number of iterations). There are variants of the algorithm that address these issues (to some extent). 9 30 Perceptron for regression Replace the update equation ith: 0 = (y i ŷ i ) x i This is not likely to converge so the algorithm is run for a fixed number of training epochs Training epoch one complete run through the training data 31 8