Lecture 6. Notes on Linear Algebra. Perceptron

Similar documents
Lecture 11. Kernel Methods

Lecture 3. Linear Regression

10.4 The Cross Product

10.1 Three Dimensional Space

CSC 578 Neural Networks and Deep Learning

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Single layer NN. Neuron Model

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

CSC321 Lecture 4 The Perceptron Algorithm

Classification with Perceptrons. Reading:

Neural Networks and the Back-propagation Algorithm

In the Name of God. Lecture 11: Single Layer Perceptrons

The Perceptron Algorithm 1

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

COMP-4360 Machine Learning Neural Networks

Lecture 4: Perceptrons and Multilayer Perceptrons

Nonlinear Classification

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Multilayer Perceptrons (MLPs)

Midterm: CS 6375 Spring 2015 Solutions

Linear discriminant functions

Multilayer Perceptron

Support Vector Machines.

Linear & nonlinear classifiers

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

Neural networks and support vector machines

Artificial Intelligence

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CPSC 340: Machine Learning and Data Mining

CSC242: Intro to AI. Lecture 21

Logistic Regression. COMP 527 Danushka Bollegala

Machine Learning Basics Lecture 3: Perceptron. Princeton University COS 495 Instructor: Yingyu Liang

Neural Networks: Backpropagation

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

ECE521 Lectures 9 Fully Connected Neural Networks

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

CS249: ADVANCED DATA MINING

Linear Discrimination Functions

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning (CSE 446): Neural Networks

ECE521 Lecture7. Logistic Regression

Radial Basis Function (RBF) Networks

Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011!

Support Vector Machines. CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Dan Roth 461C, 3401 Walnut

Machine Learning Lecture 5

CSC321 Lecture 5: Multilayer Perceptrons

Revision: Neural Network

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Data Mining Part 5. Prediction

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Artificial Neural Networks

9 Classification. 9.1 Linear Classifiers

Neural Networks biological neuron artificial neuron 1

Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1

Lecture 10: Logistic Regression

Linear Models for Classification

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Linear Classifiers. Michael Collins. January 18, 2012

Worksheets for GCSE Mathematics. Quadratics. mr-mathematics.com Maths Resources for Teachers. Algebra

The Perceptron. Volker Tresp Summer 2016

Simple Neural Nets For Pattern Classification

Course 395: Machine Learning - Lectures

Work, Energy, and Power. Chapter 6 of Essential University Physics, Richard Wolfson, 3 rd Edition

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Linear and Logistic Regression. Dr. Xiaowei Huang

The Perceptron. Volker Tresp Summer 2014

Logistic Regression & Neural Networks

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

The Perceptron algorithm

Lecture 16: Modern Classification (I) - Separating Hyperplanes

Introduction to Logistic Regression

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

More about the Perceptron

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

AI Programming CS F-20 Neural Networks

Optimization and Gradient Descent

Lecture No. 1 Introduction to Method of Weighted Residuals. Solve the differential equation L (u) = p(x) in V where L is a differential operator

18.6 Regression and Classification with Linear Models

Lecture 5: Logistic Regression. Neural Networks

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

A Course in Machine Learning

CSC321 Lecture 4: Learning a Classifier

Lecture 7 Artificial neural networks: Supervised learning

Artificial Neural Networks

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Linear & nonlinear classifiers

Least Mean Squares Regression

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLĂ„RNING

ECE521 Lecture 7/8. Logistic Regression

The Perceptron Algorithm

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Artificial Neural Networks. MGS Lecture 2

Transcription:

Lecture 6. Notes on Linear Algebra. Perceptron COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Andrey Kan Copyright: University of Melbourne

This lecture Notes on linear algebra Vectors and dot products Hyperplanes and vector normals Perceptron Introduction to Artificial Neural Networks The perceptron model Stochastic gradient descent 2

Statistical Machine Learning (S2 2016) Deck 6 Notes on Linear Algebra Link between geometric and algebraic interpretation of ML methods 3

What are vectors? Suppose uu = uu 1, uu 2. What does uu really represent? Ordered set of numbers {uu 1, uu 2 } Cartesian coordinates of a point uu 2 A direction uu 2 0 uu 1 uu 1 art: OpenClipartVectors at pixabay.com (CC0) 4

Dot product: algebraic definition Given two mm-dimensional vectors uu and vv, their dot product is uu vv uu vv mm ii=1 uu ii vv ii E.g., weighted sum of terms is a dot product xx ww Verify that if kk is a constant and aa, bb and cc are vectors of the same size then kkaa bb = kk aa bb = aa kkbb aa bb + cc = aa bb + aa cc 5

Dot product: geometric definition Given two mm-dimensional vectors uu and vv, their dot product is uu vv uu vv uu vv cos θθ Here uu and vv are L2 norms (i.e., Euclidean lengths) for vectors uu and vv and θθ is the angle between vectors uu θθ vv The scalar projection of uu onto vv is given by uu vv = uu cos θθ Thus dot product is uu vv = uu vv vv = vv uu uu 6

Equivalence of definitions Lemma: The algebraic and geometric definitions are identical Proof sketch: Express the vectors using the standard vector basis ee 1,, ee mm in RR mm, uu = mm ii=1 uu ii ee ii, and vv = mm ii=1 vv ii ee ii Vectors ee ii are an orthonormal basic, they have unit length and orthogonal to each other, so ee ii ee ii = 1 and ee ii ee jj = 0 for ii jj mm uu vv = uu vv ii ee ii = vv ii uu ee ii mm ii=1 mm ii=1 = vv ii uu ee ii cos θθ ii = vv ii uu ii ii=1 mm ii=1 7

Geometric properties of the dot product If the two vectors are orthogonal then uu vv = 0 If the two vectors are parallel then uu vv = uu vv, if they are anti-parallel then uu vv = uu vv uu uu = uu 2, so uu = uu 2 1 + + uu 2 mm defines the Euclidean vector length uu θθ vv 8

Hyperplanes and normal vectors A hyperplane defined by parameters ww and bb is a set of points xx that satisfy xx ww + bb = 0 In 2D, a hyperplane is a line: a line is a set of points that satisfy ww 1 + ww 2 + bb = 0 A normal vector for a hyperplane is a vector perpendicular to that hyperplane 9

Hyperplanes and normal vectors Consider a hyperplane defined by parameters ww and bb. Note that ww is itself a vector Lemma: Vector ww is a normal vector to the hyperplane Proof sketch: Choose any two points uu and vv on the hyperplane. Note that vector uu vv lies on the hyperplane Consider dot product uu vv ww = uu ww vv ww = uu ww + bb vv ww + bb = 0 Thus uu vv lies on the hyperplane, but is perpendicular to ww, and so ww is a vector normal 10

Example in 2D Consider a line defined by ww 1, ww 2 and bb Vector ww = ww 1, ww 2 is a normal vector ww 1 = ww 1 ww 2 ww 2 bb ww 2 11

Is logistic regression a linear method? Logistic function ff ss Probabilities 0.0 0.2 0.4 0.6 0.8 1.0-10 -5 0 5 10 ss Reals 12

Logistic regression is a linear classifier Logistic regression model: PP YY = 1 xx = 1 1 + exp xx ww Classification rule: if PP YY = 1 xx > 1 2 then class 1, else class 0 Decision boundary: 1 1 + exp xx ww = 1 2 exp xx ww = 1 xx ww = 0 13

Statistical Machine Learning (S2 2016) Deck 6 The Perceptron Model A building block for artificial neural networks, yet another linear classifier 14

Biological inspiration Humans perform well at many tasks that matter Originally neural networks were an attempt to mimic the human brain photo: Alvesgaspar, Wikimedia Commons, CC3 15

Artificial neural network As a crude approximation, the human brain can be thought as a mesh of interconnected processing nodes (neurons) that relay electrical signals Artificial neural network is a network of processing elements Each element converts inputs to output The output is a function (called activation function) of a weighted sum of inputs 16

Outline In order to use an ANN we need (a) to design network topology and (b) adjust weights to given data In this course, we will exclusively focus on task (b) for a particular class of networks called feed forward networks Training an ANN means adjusting weights for training data given a pre-defined network topology We will come back to ANNs and discuss ANN training in the next lecture Right now we will turn our attention to an individual network element because it is an interesting model in itself 17

Perceptron model ww 1 ww ss 2 Σ ff ff(ss) 1 ww 0 Compare this model to logistic regression, inputs ww 1, ww 2 synaptic weights ww 0 bias weight ff activation function 18

Perceptron is a linear binary classifier Perceptron is a binary classifier: Perceptron is a linear classifier: ss is a linear function of inputs, and the decision boundary is linear Predict class A if ss 0 Predict class B if ss < 0 where ss = mm ii=0 xx ii ww ii decision boundary plane with values of ss plane with data points Example for 2D data 19

Exercise: find weights of a perceptron capable of perfect classification of the following dataset 1 2 yy 0 0 Class B 0 1 Class B 1 0 Class B 1 1 Class A art: OpenClipartVectors at pixabay.com (CC0)

Loss function for perceptron Recall that training means finding weights that minimise some loss. Therefore, we proceed with considering the loss function for perceptron Our task is binary classification. Let s arbitrarily encode one class as +1 and the other as 1. So each training example is now {xx, yy}, where yy is either +1 or 1 Recall that, in a perceptron, ss = mm ii=0 xx ii ww ii, and the sign of ss determines the predicted class: +1 if ss > 0, and 1 if ss < 0 Consider a single training example. If yy and ss have the same sign then the example is classified correctly. If yy and ss have different signs, the example is misclassified 21

Loss function for perceptron Consider a single training example. If yy and ss have the same sign then the example is classified correctly. If yy and ss have different signs, the example is misclassified The perceptron uses a loss function in which there is no penalty for correctly classified examples, while the penalty (loss) is equal to ss for misclassified examples* Formally: LL ss, yy = 0 if both ss, yy have the same sign LL ss, yy = ss if both ss, yy have different signs This can be re-written as LL ss, yy = max 0, ssss 0 LL(ss, yy) ssss * This is similar, but not identical to another widely used loss function called hinge loss 22

Stochastic gradient descent Split all training examples in BB batches Choose initial θθ (1) For ii from 1 to TT For jj from 1 to BB Iterations over the entire dataset are called epochs Do gradient descent update using data from batch jj Advantage of such an approach: computational feasibility for large datasets 23

Perceptron training algorithm Choose initial guess ww (0), kk = 0 For ii from 1 to TT (epochs) For jj from 1 to NN (training examples) Consider example xx jj, yy jj Update*: ww kk++ = ww kk ηη LL(ww (kk) ) LL(ww) = max 0, ssss mm ss = xx ii ww ii ii=0 ηη is learning rate *There is no derivative when ss = 0, but this case is handled explicitly in the algorithm, see next slides 24

Perceptron training rule We have LL ww ii = 0 when ssss > 0 We don t need to do update when an example is correctly classified We have ww ii = xx ii when yy = 1 and ss < 0 We have ww ii = xx ii when yy = 1 and ss > 0 LL(ss, yy) ss = mm ii=0 xx ii ww ii 0 ssss 25

Perceptron training algorithm When classified correctly, weights are unchanged When misclassified: ww kk+1 (ηη > 0 is called learning rate) = ηη(±xx) If yy = 1, but ss < 0 ww ii ww ii + ηηxx ii ww 0 ww 0 + ηη If yy = 1, but ss 0 ww ii ww ii ηηxx ii ww 0 ww 0 ηη Convergence Theorem: if the training data is linearly separable, the algorithm is guaranteed to converge to a solution. That is, there exist a finite KK such that LL ww KK = 0 26

Perceptron convergence theorem Assumptions Linear separability: There exists ww so that yy ii ww xx ii γγ for all training data ii = 1,, NN and some positive γγ. Bounded data: xx ii RR for ii = 1,, NN and some finite RR. Proof sketch Establish that ww ww kk ww ww kk 1 + γγ It then follows that ww ww kk kkkk Establish that ww kk 2 kkrr 2 Note that cos ww, ww kk = ww ww kk Use the fact that cos 1 Arrive at kk RR2 ww 2 γγ ww ww kk 27

Pros and cons of perceptron learning If the data is linearly separable, the perceptron training algorithm will converge to a correct solution There is a formal proof good! It will converge to some solution (separating boundary), one of infinitely many possible bad! However, if the data is not linearly separable, the training will fail completely rather than give some approximate solution Ugly 28

Perceptron Learning Example Basic setup ww 1 ss = ww 1 + ww 2 + ww 0 ww 2 ww 0 1, ss < 0 Σ ff 1, ss 0 1 (learning rate ηη = 0.1) 29

Perceptron Learning Example Start with random weights 1 ww 1 = 0.2 ww 2 = 0.0 ww 0 = 0.1 ss = ww 1 + ww 2 + ww 0 1, ss < 0 Σ ff 1, ss 0 (learning rate ηη = 0.1) 0.5 30

Perceptron Learning Example Consider training example 1 1 ww 1 = 0.2 ww 2 = 0.0 ww 0 = 0.1 ss = ww 1 + ww 2 + ww 0 1, ss < 0 Σ ff 1, ss 0 (learning rate ηη = 0.1) 0.2 + 0.0 0.1 > 0 ww 1 ww 1 ηη = 0.1 ww 2 ww 2 ηη = 0.1 ww 0 ww 0 ηη = 0.2 class -1 class 1 0.5 (1,1) 31

Perceptron Learning Example Update weights 1 ww 1 = 0.1 ww 2 = 0.1 ww 0 = 0.2 ss = ww 1 + ww 2 + ww 0 1, ss < 0 Σ ff 1, ss 0 (learning rate ηη = 0.1) (1,1) class -1 class 1 2 32

Perceptron Learning Example Consider training example 2 1 ww 1 = 0.1 ww 2 = 0.1 ww 0 = 0.2 ss = ww 1 + ww 2 + ww 0 1, ss < 0 Σ ff 1, ss 0 (learning rate ηη = 0.1) 0.1 0.1 0.2 < 0 ww 1 ww 1 + ηη = 0.3 ww 2 ww 2 + ηη = 0.0 ww 0 ww 0 + ηη = 0.1 class -1 class 1 2 (2,1) 33

Perceptron Learning Example Update weights 1 ww 1 = 0.3 ww 2 = 0.0 ww 0 = 0.1 ss = ww 1 + ww 2 + ww 0 1, ss < 0 Σ ff 1, ss 0 (learning rate ηη = 0.1) (2,1) class -1 class 1 1/3 34

Perceptron Learning Example Further examples 1 ww 1 = 0.3 ww 2 = 0.0 ww 0 = 0.1 ss = ww 1 + ww 2 + ww 0 1, ss < 0 Σ ff 1, ss 0 (learning rate ηη = 0.1) 0.3 0.0 0.1 > 0 3 rd point: correctly classified 4 th point: incorrect, update etc. 4 th point class -1 class 1 1/3 (1.5,0.5) 35

Perceptron Learning Example Further examples 1 ww 1 = ww 2 = ww 0 = ss = ww 1 + ww 2 + ww 0 1, ss < 0 Σ ff 1, ss 0 (learning rate ηη = 0.1) Eventually, all the data will be correctly classified (provided it is linearly separable) class -1 class 1 36

This lecture Notes on linear algebra Vectors and dot products Hyperplanes and vector normals Perceptron Introduction to Artificial Neural Networks The perceptron model Training algorithm 37