Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1

Similar documents
Linear models and the perceptron algorithm

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

SVMs: nonlinearity through kernels

Learning From Data Lecture 8 Linear Classification and Regression

Linear Discriminant Functions

Neural networks and support vector machines

The Perceptron algorithm

Learning From Data Lecture 10 Nonlinear Transforms

Single layer NN. Neuron Model

Vote. Vote on timing for night section: Option 1 (what we have now) Option 2. Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

Artificial Neural Networks. Part 2

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Logistic Regression. Machine Learning Fall 2018

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Course 395: Machine Learning - Lectures

Learning From Data Lecture 9 Logistic Regression and Gradient Descent

Optimization and Gradient Descent

Linear & nonlinear classifiers

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Learning From Data Lecture 2 The Perceptron

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

In the Name of God. Lecture 11: Single Layer Perceptrons

Deep Learning for Computer Vision

Enhancing Generalization Capability of SVM Classifiers with Feature Weight Adjustment

Lecture 6. Notes on Linear Algebra. Perceptron

PMR5406 Redes Neurais e Lógica Fuzzy Aula 3 Single Layer Percetron

Midterm: CS 6375 Spring 2015 Solutions

Support Vector Machines

GRADIENT DESCENT. CSE 559A: Computer Vision GRADIENT DESCENT GRADIENT DESCENT [0, 1] Pr(y = 1) w T x. 1 f (x; θ) = 1 f (x; θ) = exp( w T x)

Chapter ML:VI. VI. Neural Networks. Perceptron Learning Gradient Descent Multilayer Perceptron Radial Basis Functions

The Perceptron Algorithm 1

2 Linear Classifiers and Perceptrons

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Brief Introduction to Machine Learning

Binary Classification / Perceptron

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture # 12 Scribe: Indraneel Mukherjee March 12, 2008

Linear Discrimination Functions

Discriminative Models

The Perceptron Algorithm

Classification with Perceptrons. Reading:

The Perceptron. Volker Tresp Summer 2016

Warm up: risk prediction with logistic regression

Machine Learning: Logistic Regression. Lecture 04

CS145: INTRODUCTION TO DATA MINING

Discriminative Models

COMS 4771 Introduction to Machine Learning. Nakul Verma

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

FA Homework 2 Recitation 2

y(x) = x w + ε(x), (1)

The Perceptron Algorithm

Linear discriminant functions

Neural Networks. Associative memory 12/30/2015. Associative memories. Associative memories

SGN (4 cr) Chapter 5

18.9 SUPPORT VECTOR MACHINES

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

More about the Perceptron

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

ECS171: Machine Learning

Lecture Support Vector Machine (SVM) Classifiers

CPSC 340: Machine Learning and Data Mining

Day 4: Classification, support vector machines

Linear Classifiers. Michael Collins. January 18, 2012

ESS2222. Lecture 4 Linear model

Support Vector Machines and Kernel Methods

1 Machine Learning Concepts (16 points)

Machine Learning Practice Page 2 of 2 10/28/13

Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011!

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Linear & nonlinear classifiers

6.867 Machine learning

Lecture 4: Perceptrons and Multilayer Perceptrons

6.036 midterm review. Wednesday, March 18, 15

Classification Logistic Regression

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning A Geometric Approach

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

GRADIENT DESCENT AND LOCAL MINIMA

HOMEWORK 4: SVMS AND KERNELS

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Support Vector Machines.

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

A Course in Machine Learning

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

CSC Neural Networks. Perceptron Learning Rule

10-701/ Machine Learning - Midterm Exam, Fall 2010

Support Vector Machines for Classification: A Statistical Portrait

Logistic Regression Logistic

9 Classification. 9.1 Linear Classifiers

Learning From Data Lecture 15 Reflecting on Our Path - Epilogue to Part I

Support Vector Machines

Machine Learning Basics

Linear Learning Machines

Linear Models for Classification

Logistic Regression. COMP 527 Danushka Bollegala

Multiclass Classification-1

Transcription:

90 8 80 7 70 6 60 0 8/7/ Preliminaries Preliminaries Linear models and the perceptron algorithm Chapters, T x + b < 0 T x + b > 0 Definition: The Euclidean dot product beteen to vectors is the expression d T x = ix i The dot product is also referred to as inner product or scalar product. It is sometimes denoted as (hence the name dot product). i= x Definition: The Euclidean dot product beteen to vectors is the expression d T x = ix i The dot product is also referred to as inner product or scalar product. Geometric interpretation. The dot product beteen to unit vectors is the cosine of the angle beteen them. The dot product beteen a vector and a unit vector is the length of its projection in that direction. And in general: The norm of a vector: x = x x i= x = x cos( ) Labeled data Labeled data Linear models A labeled dataset: D = {(x i,y i )} n i= A labeled dataset: D = {(x i,y i )} n i= (linear decision boundaries) Where x i R d are d-dimensional vectors Where x i R d are d-dimensional vectors T x + b > 0 The labels: are discrete for classification problems (e.g. +, -) for binary classification The labels: are continuous values for a regression problem Linear models for regression (estimating a linear function) T x + b < 0 0 0 0 60 70 80 90 00 6

8/7/ Tx + b > 0 Tx + b > 0 Tx + b < 0 Tx + b > 0 Tx + b < 0 Decision boundary: f (x) = x + b Discriminant/scoring function: eight vector Tx + b < 0 bias Using the discriminant to make a prediction: f (x) = x + b = 0 all x such that For linear models the the decision boundary is a line in -d, a plane in -d and a hyperplane in higher dimensions 7 y = sign( x + b) the sign function equals hen its argument is positive and - otherise 8 9 Linear models for regression Why linear? When using a linear model for regression the scoring function is the prediction: Least Squares y = x + blinear Regression Tx + b > 0 It s a good baseline: alays start simple Linear models are stable Linear models are less likely to overfit the training data because they have relatively less parameters. Can sometimes underfit. Often all you need hen the data is high dimensional. Lots of scalable algorithms Decision boundary: all x such that y y Tx + b < 0 f (x) = x + b = 0 x x x What can you say about the decision boundary hen b = 0? noisy target P (y x) y = f (x) +! 0 in-sample error out-of-sample error c AM # L Creator: Malik Magdon-Ismail Ein(h) = N N! (h(xn) yn) n= Eout(h) = Ex[(h(x) y) ] Linear Classification and Regression: 6 / h(x) = tx Matrix representation

8/7/ From linear to non-linear From linear to non-linear Linearly separable data... Linearly separable data: there exists a linear decision boundary separating the classes. 0.. Example: 0. 0...... 0. 0 0... 0. 0 0 0..... T x + b > 0 There is a neat mathematical trick that ill enable us to use linear classifiers to create non-linear decision boundaries! Original data: not linearly separable Transformed data: (x,y ) = (x, y ) T x + b < 0 The bias and homogeneous coordinates Finding a good hyperplane (Rosenblatt, 97) In some cases e ill use algorithms that learn a discriminant function ithout a bias term. This does not reduce the expressivity of the model because e can obtain a bias using the folloing trick: Add another dimension x 0 to each input and set it to. Learn a eight vector of dimension d+ in this extended space, and interpret 0 as the bias term. With the notation =(,..., d ) We have that: See page 7 in the book =( 0,,..., d ) x =(,x,...,x d ) x = 0 + x 6 We ould like a classifier that fits the data, i.e. e ould like to find a vector that minimizes E in (h) = n = n n i= n i= [h(x i ) 6= f(x i )] [sign( x i ) 6= y i ] This is a difficult problem because of the discrete nature of the indicator and sign function (knon to be NP-hard). 7 Idea: iterate over the training examples, and update the eight vector in a ay that ould make x i is more likely to be correctly classified. Let s assume that x i is misclassified, and is a positive example i.e. Note: e re learning a classifier x i < 0 ithout a bias term We ould like to update to such that 0 x i > x i This can be achieved by choosing 0 < apple 0 = + x i is the learning rate Age Rosenblatt, Frank (97), The Perceptron--a perceiving and recognizing automaton. Report 8-60-, Cornell Aeronautical Laboratory. Section. in the book 8 c AM L Creator: Malik Magdon-Ismail The Perceptron: 0/ PLA Income

8/7/ If x i is a negative example, the update needs to be opposite. Overall, e can summarize the to cases as: 0 = + y i x i Output: eight vector = 0 hile not converged : for i in,, D : if x i is misclassified update and set converged=false return Since the algorithm is not guaranteed to converge you need to set a limit on the number of iterations: Output: eight vector = 0 hile (not converged or number of iterations < T) : for i in,, D : if x i is misclassified: update and set converged=false return The algorithm makes sense, but let s try to derive it in a more principled ay. The algorithm is trying to find a vector that separates positive from negative examples. We can express that as: y i x i > 0, i =,...,n For a given eight vector the degree to hich this does not hold can be expressed as: E() = y i x i We ant to find that minimizes or maximizes this criterion? 9 0 Digression: gradient descent Given a function E(), the gradient is the direction of steepest ascent Therefore to minimize E(), take a step in the direction of the negative of the gradient Notice that the gradient is perpendicular to contours of equal E() Gradient descent We can no express gradient descent as: here (t + ) = (t) (t) re() @E() @ @E() @E() @ =,..., @E() @ @ d And (t) is the eight vector at iteration t The constant η is called the step size (learning rate hen used in the context of machine learning). Let s apply gradient descent to the perceptron criterion: E() = y i x i @E() @ = (t + ) = (t) @E() @ = (t)+ Which is exactly the perceptron algorithm! y ix i y ix i Images from http://en.ikipedia.org/iki/gradient_descent

8/7/ The algorithm is guaranteed to converge if the data is linearly separable, and does not converge otherise. Issues ith the algorithm: q q The algorithm chooses an arbitrary hyperplane that separates the to classes. It may not be the best one from the learning perspective. Does not converge if the data is not separable (can halt after a fixed number of iterations). The pocket algorithm Output: a eight vector = 0, pocket = 0 hile (not converged or number of iterations < T) : for i in,, D : if x i is misclassified: update and set converged=false if leads to better E in than pocket : pocket = return pocket Intensity and Symmetry Features Image classification feature: animportantpropertyoftheinputthatyouthinkisusefulfor classification. Features: (dictionary.com: important a prominentproperties or conspicuous part of orthe characteristic) input you think are relevant for classification There are variants of the algorithm that address these issues (to some extent). Gallant, S. I. (990). Perceptron-based learning algorithms. IEEE Transactions on Neural Netorks, vol., no., pp. 79 9. 6 In this case e consider the level of symmetry (image its flipped version) and overall } intensity (fraction } of pixels that are x =(,x,x dark) ) x =(,x input,x ) input d =( 0,, ) =(linear 0,, model ) linear vc = d model vc = 7 c AM L Creator: Malik Magdon-Ismail Linear Classification and Regression: / PLA on digits data c AM L Creator: Malik Magdon-Ismail Linear Classification and Regression: / PLA on digits data Pocket vs on perceptron Digits Data Comparison on image data: distinguishing beteen the digits PLA Pocket and (see page 8 in the book): 0% Eout 0% Error (log scale) 0% % Error (log scale) 0% % Eout Ein Ein 0 0 00 70 000 Iteration Number, t 0 0 00 70 000 Iteration Number, t 8 c AM L Creator: Malik Magdon-Ismail Linear Classification and Regression: / Regression