Linear models and the perceptron algorithm

Similar documents
Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

SVMs: nonlinearity through kernels

Learning From Data Lecture 8 Linear Classification and Regression

The Perceptron algorithm

Neural networks and support vector machines

Linear Discriminant Functions

Learning From Data Lecture 10 Nonlinear Transforms

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

The Perceptron Algorithm

Vote. Vote on timing for night section: Option 1 (what we have now) Option 2. Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9

Learning From Data Lecture 2 The Perceptron

Artificial Neural Networks. Part 2

Single layer NN. Neuron Model

In the Name of God. Lecture 11: Single Layer Perceptrons

Linear Classifiers. Michael Collins. January 18, 2012

GRADIENT DESCENT. CSE 559A: Computer Vision GRADIENT DESCENT GRADIENT DESCENT [0, 1] Pr(y = 1) w T x. 1 f (x; θ) = 1 f (x; θ) = exp( w T x)

COMS 4771 Introduction to Machine Learning. Nakul Verma

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Linear & nonlinear classifiers

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Deep Learning for Computer Vision

CSC Neural Networks. Perceptron Learning Rule

More about the Perceptron

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Warm up: risk prediction with logistic regression

Lecture 6. Notes on Linear Algebra. Perceptron

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Course 395: Machine Learning - Lectures

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture # 12 Scribe: Indraneel Mukherjee March 12, 2008

2 Linear Classifiers and Perceptrons

Neural Networks. Associative memory 12/30/2015. Associative memories. Associative memories

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Brief Introduction to Machine Learning

FA Homework 2 Recitation 2

Discriminative Models

The Perceptron. Volker Tresp Summer 2016

1 Machine Learning Concepts (16 points)

Machine Learning A Geometric Approach

Announcements. Proposals graded

Support Vector Machines for Classification: A Statistical Portrait

Learning From Data Lecture 9 Logistic Regression and Gradient Descent

Linear & nonlinear classifiers

Learning Linear Detectors

A Course in Machine Learning

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Day 4: Classification, support vector machines

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Midterm: CS 6375 Spring 2015 Solutions

Enhancing Generalization Capability of SVM Classifiers with Feature Weight Adjustment

Machine Learning Practice Page 2 of 2 10/28/13

Discriminative Models

Learning From Data Lecture 25 The Kernel Trick

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

CS145: INTRODUCTION TO DATA MINING

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

CSC321 Lecture 4 The Perceptron Algorithm

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Learning From Data Lecture 3 Is Learning Feasible?

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Data Mining Part 5. Prediction

Binary Classification / Perceptron

y(x) = x w + ε(x), (1)

Linear Classification and SVM. Dr. Xin Zhang

The Perceptron Algorithm 1

PMR5406 Redes Neurais e Lógica Fuzzy Aula 3 Single Layer Percetron

Support Vector Machines

6.867 Machine learning

Final Examination CS 540-2: Introduction to Artificial Intelligence

Machine Learning: Logistic Regression. Lecture 04

9 Classification. 9.1 Linear Classifiers

Geometric View of Machine Learning Nearest Neighbor Classification. Slides adapted from Prof. Carpuat

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning: Fisher s Linear Discriminant. Lecture 05

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

4 THE PERCEPTRON. 4.1 Bio-inspired Learning. Learning Objectives:

18.9 SUPPORT VECTOR MACHINES

Chapter ML:VI. VI. Neural Networks. Perceptron Learning Gradient Descent Multilayer Perceptron Radial Basis Functions

Lecture Support Vector Machine (SVM) Classifiers

Introduction to Machine Learning

CPSC 340: Machine Learning and Data Mining

Lecture 4: Linear predictors and the Perceptron

The Perceptron Algorithm

Machine Learning, Fall 2011: Homework 5

Voting (Ensemble Methods)

18.6 Regression and Classification with Linear Models

Logistic Regression. Machine Learning Fall 2018

Lecture 4. 1 Learning Non-Linear Classifiers. 2 The Kernel Trick. CS-621 Theory Gems September 27, 2012

Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011!

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Learning From Data Lecture 13 Validation and Model Selection

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Perceptron (Theory) + Linear Regression

Multiclass Classification-1

Learning Methods for Linear Detectors

Linear discriminant functions

Perceptron. Subhransu Maji. CMPSCI 689: Machine Learning. 3 February February 2015

N-bit Parity Neural Networks with minimum number of threshold neurons

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Transcription:

8/5/6 Preliminaries Linear models and the perceptron algorithm Chapters, 3 Definition: The Euclidean dot product beteen to vectors is the expression dx T x = i x i The dot product is also referred to as inner product or scalar product. It is sometimes denoted as (hence the name dot product). i= x T x + b > 0 T x + b < 0 Preliminaries Labeled data Definition: The Euclidean dot product beteen to vectors is the expression dx T x = i x i The dot product is also referred to as inner product or scalar product. Geometric interpretation. The dot product beteen to unit vectors is the cosine of the angle beteen them. The dot product beteen a vector and a unit vector is the length of its projection in that direction. And in general: The norm of a vector: x = x x i= x = x cos( ) 3 A labeled dataset: Where D = {(x i,y i )} N i= x i R d The labels: are discrete for classification problems (e.g. +, -) for binary classification are d-dimensional vectors 4

8/5/6 Labeled data Linear models A labeled dataset: D = {(x i,y i )} N i= (linear decision boundaries) Where x i R d are d-dimensional vectors T x + b > 0 The labels: T x + b < 0 Linear models for regression 90 are continuous values for a regression problem (estimating a linear function) 85 80 75 70 65 60 55 50 45 5 40 40 50 60 70 80 90 00 6 T x + b > 0 T x + b > 0 T x + b < 0 T x + b < 0 Discriminant/scoring function: x + b Decision boundary: all x such that x + b =0 eight vector bias For linear models the the decision boundary is a line in -d, a plane in 3-d and a hyperplane in higher dimensions 7 8

8/5/6 Tx + b > 0 Tx + b > 0 Tx + b < 0 Using the discriminant to make a prediction: Tx + b < 0 Decision boundary: all x such that y = sign( x + b) f (x) = x + b = 0 What can you say about the decision boundary hen b = 0? the sign function euals hen its argument is positive and - otherise 9 0 Linear models for regression Why linear? When using a linear model for regression the scoring function is the prediction: Least Suares y = x + blinear Regression y y x x out-of-sample error Ein(h) = N N! x noisy target P (y x) y = f (x) + ϵ in-sample error It s a good baseline: alays start simple Linear models are stable Linear models are less likely to overfit the training data because they have less parameters. Can sometimes underfit. Often all you need hen the data is high dimensional. Lots of scalable algorithms (h(xn) yn) n= Eout(h) = Ex[(h(x) y) ] h(x) = tx 3

8/5/6 From linear to non-linear.5.5 0.5 0 From linear to non-linear 5 4.5 4 3.5 3.5 0.5.5.5 0.5.5.5.5 0.5 0 0.5.5.5 0 0 0.5.5.5 3 3.5 4 4.5 5 There is a neat mathematical trick that ill enable us to use linear classifiers to create non-linear decision boundaries! Original data: not linearly separable Transformed data: (x,y ) = (x, y ) 3 4 Linearly separable data Linearly separable data: there exists a linear decision boundary separating the classes. The bias and homogeneous coordinates Formulating a model that does not have a bias does not reduce the expressivity of the model because e can obtain a bias using the folloing trick: Example: T x + b < 0 T x + b > 0 Add another dimension x 0 to each input and set it to. Learn a eight vector of dimension d+ in this extended space, and interpret 0 as the bias term. With the notation =(,..., d ) We have that: =( 0,,..., d ) x =(,x,...,x d ) x = 0 + x 5 See page 7 in the book 6 4

8/5/6 Finding a good hyperplane The perceptron algorithm (Rosenblatt, 957) We ould like a classifier that fits the data, i.e. e ould like to find a vector that minimizes E in = N = N NX i= NX i= [h(x i ) 6= f(x i )] [sign( x i ) 6= y i )] This is a difficult problem because of the discrete nature of the indicator and sign function (knon to be NP-hard). 7 Idea: iterate over the training examples, and update the eight vector in a ay that ould make x i is more likely to be correctly classified. Let s assume that x i is misclassified, and is a positive example i.e. Note: e re learning a classifier x i < 0 ithout a bias term We ould like to update to such that 0 x i > x i This can be achieved by choosing 0 < apple 0 = + x i is the learning rate Age Rosenblatt, Frank (957), The Perceptron--a perceiving and recognizing automaton. Report 85-460-, Cornell Aeronautical Laboratory. Section. in the book Income 8 The perceptron algorithm The perceptron algorithm If x i is a negative example, the update needs to be opposite. Overall, e can summarize the to cases as: 0 = + y i x i Input: labeled data D in homogeneous coordinates Output: eight vector = 0 converged = false hile not converged : converged = true for i in,,n : if x i is misclassified update and set converged=false return Since the algorithm is not guaranteed to converge if the data is not linearly separable you need to set a limit on the number of iterations: Input: labeled data D in homogeneous coordinates Output: eight vector = 0 converged = false hile (not converged or number of iterations < T) : converged = true for i in,,n : if x i is misclassified: update and set converged=false return 9 0 5

8/5/6 The perceptron algorithm The algorithm is guaranteed to converge if the data is linearly separable, and does not converge otherise. Issues ith the algorithm: The algorithm chooses an arbitrary hyperplane that separates the to classes. It may not be the best one from the learning perspective. Does not converge if the data is not separable (can halt after a fixed number of iterations). There are variants of the algorithm that address these issues (to some extent). The pocket algorithm Input: labeled data D in homogeneous coordinates Output: a eight vector = 0, pocket = 0 converged = false hile (not converged or number of iterations < T) : converged = true for i in,,n : if x i is misclassified: update and set converged=false if leads to better E in than pocket : pocket = return pocket Gallant, S. I. (990). Perceptron-based learning algorithms. IEEE Transactions on Neural Netorks, vol., no., pp. 79 9. Intensity and Symmetry Features Image classification feature: animportantpropertyoftheinputthatyouthinkisusefulfor classification. Features: (dictionary.com: important a prominentproperties or conspicuous part of orthe characteristic) input you think are relevant for classification Pocket vs perceptron Pocket on Digits Data Comparison on image data: distinguishing beteen the digits PLA Pocket and 5 (see page 83 in the book): 50% Eout 50% Error (log scale) 0% % Error (log scale) 0% % Eout Ein Ein 0 50 500 750 000 Iteration Number, t 0 50 500 750 000 Iteration Number, t In this case e consider the level of symmetry (image its flipped version) and overall } intensity (fraction } of pixels that are x =(,x,x dark) ) x =(,x input,x ) input d =( 0,, ) =(linear 0,, model ) linear vc =3 d model vc =3 3 4 c AM L Creator: Malik Magdon-Ismail Linear Classification and Regression: / PLA on digits data c AM L Creator: Malik Magdon-Ismail Linear Classification and Regression: / PLA on digits data c AM L Creator: Malik Magdon-Ismail Linear Classification and Regression: 4/ Regression 6