Chapter ML:VI. VI. Neural Networks. Perceptron Learning Gradient Descent Multilayer Perceptron Radial Basis Functions

Similar documents
Chapter ML:VI (continued)

Chapter ML:VI (continued)

Artificial Neural Networks The Introduction

Single layer NN. Neuron Model

PMR5406 Redes Neurais e Lógica Fuzzy Aula 3 Single Layer Percetron

Neural Networks (Part 1) Goals for the lecture

In the Name of God. Lecture 11: Single Layer Perceptrons

Simple Neural Nets For Pattern Classification

Introduction to Neural Networks

Feedforward Neural Nets and Backpropagation

Artificial Neural Networks

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Lecture 4: Perceptrons and Multilayer Perceptrons

Supervised (BPL) verses Hybrid (RBF) Learning. By: Shahed Shahir

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

Artificial Neural Networks. Historical description

Data Mining Part 5. Prediction

3.4 Linear Least-Squares Filter

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

y(x n, w) t n 2. (1)

Linear discriminant functions

Linear Discrimination Functions

The Perceptron Algorithm 1

Reading Group on Deep Learning Session 1

Introduction to Artificial Neural Networks

Introduction To Artificial Neural Networks

COMP9444 Neural Networks and Deep Learning 2. Perceptrons. COMP9444 c Alan Blair, 2017

Inf2b Learning and Data

CE213 Artificial Intelligence Lecture 14

Rosenblatt s Perceptron

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1

Supervised Learning. George Konidaris

Linear & nonlinear classifiers

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

Chapter 2 Single Layer Feedforward Networks

Machine Learning. Neural Networks

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Introduction Biologically Motivated Crude Model Backpropagation

Computational Intelligence Winter Term 2017/18

Lecture 7 Artificial neural networks: Supervised learning

Multilayer Perceptron

Neural networks. Chapter 20, Section 5 1

Neural Networks biological neuron artificial neuron 1

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Neural Networks for Machine Learning. Lecture 2a An overview of the main types of neural network architecture

The Perceptron algorithm

Plan. Perceptron Linear discriminant. Associative memories Hopfield networks Chaotic networks. Multilayer perceptron Backpropagation

Sections 18.6 and 18.7 Artificial Neural Networks

Sections 18.6 and 18.7 Artificial Neural Networks

The Perceptron Algorithm

Input layer. Weight matrix [ ] Output layer

4. Multilayer Perceptrons

Artifical Neural Networks

Radial-Basis Function Networks

Artificial Neural Network

Course 395: Machine Learning - Lectures

Lecture 4: Feed Forward Neural Networks

Neural Networks Lecture 2:Single Layer Classifiers

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Revision: Neural Network

Radial-Basis Function Networks

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

Lecture 6. Notes on Linear Algebra. Perceptron

Computational Intelligence

ADALINE for Pattern Classification

2015 Todd Neller. A.I.M.A. text figures 1995 Prentice Hall. Used by permission. Neural Networks. Todd W. Neller

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

A MULTILAYER EXTENSION OF THE SIMILARITY NEURAL NETWORK

Linear & nonlinear classifiers

Deep Learning. Ali Ghodsi. University of Waterloo

Multilayer Neural Networks

COMP-4360 Machine Learning Neural Networks

CSC Neural Networks. Perceptron Learning Rule

GRADIENT DESCENT AND LOCAL MINIMA

22c145-Fall 01: Neural Networks. Neural Networks. Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1

Artificial Neural Networks (ANN) Xiaogang Su, Ph.D. Department of Mathematical Science University of Texas at El Paso

Machine Learning: Logistic Regression. Lecture 04

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Neural networks. Chapter 19, Sections 1 5 1

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Master Recherche IAC TC2: Apprentissage Statistique & Optimisation

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

LECTURE NOTE #NEW 6 PROF. ALAN YUILLE

ECS171: Machine Learning

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Artificial Neural Networks. Q550: Models in Cognitive Science Lecture 5

Classification with Perceptrons. Reading:

Νεςπο-Ασαυήρ Υπολογιστική Neuro-Fuzzy Computing

CSC242: Intro to AI. Lecture 21

Neural Networks. Fundamentals of Neural Networks : Architectures, Algorithms and Applications. L, Fausett, 1994

Multi-layer Perceptron Networks

Neural Network Training

Lecture 4: Linear predictors and the Perceptron

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Transcription:

Chapter ML:VI VI. Neural Networks Perceptron Learning Gradient Descent Multilayer Perceptron Radial asis Functions ML:VI-1 Neural Networks STEIN 2005-2018

The iological Model Simplified model of a neuron: dendrites cell body axon synapse ML:VI-2 Neural Networks STEIN 2005-2018

The iological Model (continued) Neuron characteristics: The numerous dendrites of a neuron serve as its input channels for electrical signals. t particular contact points between the dendrites, the so-called synapses, electrical signals can be initiated. synapse can initiate signals of different strengths, where the strength is encoded by the frequency of a pulse train. The cell body of a neuron accumulates the incoming signals. If a particular stimulus threshold is exceeded, the cell body generates a signal, which is output via the axon. The processing of the signals is unidirectional. (from left to right in the figure) ML:VI-3 Neural Networks STEIN 2005-2018

History 1943 Warren McCulloch and Walter Pitts present a model of the neuron. 1949 Donald Hebb postulates a new learning paradigm: reinforcement only for active neurons. (those neurons that are involved in a decision process) 1958 Frank Rosenblatt develops the perceptron model. 1962 Rosenblatt proves the perceptron convergence theorem. 1969 Marvin Minsky and Seymour Papert publish a book on the limitations of the perceptron model. 1970. 1985 1986 David Rumelhart and James McClelland present the multilayer perceptron. ML:VI-4 Neural Networks STEIN 2005-2018

The Perceptron of Rosenblatt [1958] Inputs Output ML:VI-5 Neural Networks STEIN 2005-2018

The Perceptron of Rosenblatt [1958] x 1 x 2. x p Inputs Output x j, w j R, j = 1... p ML:VI-6 Neural Networks STEIN 2005-2018

The Perceptron of Rosenblatt [1958] x 1 w 1 x 2. w 2. w p Σ θ y x p Inputs Output x j, w j R, j = 1... p ML:VI-7 Neural Networks STEIN 2005-2018

Remarks: The perceptron of Rosenblatt is based on the neuron model of McCulloch and Pitts. The perceptron is a feed forward system. ML:VI-8 Neural Networks STEIN 2005-2018

Specification of Classification Problems [ ML Introduction] Characterization of the model (model world): X is a set of feature vectors, also called feature space. X R p C = {0, 1} is a set of classes. C = { 1, 1} in the regression setting. c : X C is the ideal classifier for X. c is approximated by y (perceptron). D = {(x 1, c(x 1 )),..., (x n, c(x n ))} X C is a set of examples. How could the hypothesis space H look like? ML:VI-9 Neural Networks STEIN 2005-2018

Computation in the Perceptron [ Regression] p If w j x j θ then y(x) = 1, and j=1 p if w j x j < θ then y(x) = 0. j=1 ML:VI-10 Neural Networks STEIN 2005-2018

Computation in the Perceptron [ Regression] p If w j x j θ then y(x) = 1, and j=1 p if w j x j < θ then y(x) = 0. j=1 1 0 Σ w j x j p θ j=0 j=1 where p w j x j = w T x. j=1 (or other notations for the scalar product) hypothesis is determined by θ, w 1,..., w p. ML:VI-11 Neural Networks STEIN 2005-2018

Computation in the Perceptron (continued) p y(x) = heaviside( w j x j θ) = heaviside( j=1 p w j x j ) with w 0 = θ, x 0 = 1 j=0 1 0 Σ w j x j j=0 p x 0 =1 w 0 = θ x 1 w 1 x 2. w 2. w p Σ 0θ y x p Inputs Output hypothesis is determined by w 0, w 1,..., w p. ML:VI-12 Neural Networks STEIN 2005-2018

Remarks: If the weight vector is extended by w 0 = θ, and, if the feature vectors are extended by the constant feature x 0 = 1, the learning algorithm gets a canonical form. Implementations of neural networks introduce this extension often implicitly. e careful with regard to the dimensionality of the weight vector: it is always denoted as w here, regardless of the fact whether the w 0 -dimension, with w 0 = θ, is included. The function heaviside is named after the mathematician Oliver Heaviside. [Heaviside: step function O. Heaviside] ML:VI-13 Neural Networks STEIN 2005-2018

Weight daptation [IGD lgorithm] lgorithm: PT Perceptron Training Input: D Training examples (x, c(x)) with x = p + 1, c(x) {0, 1}. η Learning rate, a small positive constant. Internal: y(d) Set of y(x)-values computed from the elements x in D given some w. Output: w Weight vector. PT (D, η) 1. initialize_random_weights(w), t = 0 2. REPET 3. t = t + 1 4. (x, c(x)) = random_select(d) 5. error = c(x) heaviside(w T x) // c(x) {0, 1}, heaviside {0, 1}, error {0, 1, 1} 6. w = η error x 7. w = w + w 8. UNTIL(convergence(D, y(d)) OR t > t max ) 9. return(w) ML:VI-14 Neural Networks STEIN 2005-2018

Remarks: The variable t denotes the time. t each point in time the learning algorithm gets an example presented and, as a consequence, may adapt the weight vector. The weight adaptation rule compares the true class c(x) (the ground truth) to the class computed by the perceptron. In case of a wrong classification of a feature vector x, error is either 1 or +1, regardless of the exact numeric difference between c(x) and w T x. y(d) is the set of y(x)-values given w for the elements x in D. ML:VI-15 Neural Networks STEIN 2005-2018

Weight daptation: Illustration in Input Space x 2 x 2 n n d x 1 x 1 Definition of an (affine) hyperplane: L = { x n T x = d } [Wikipedia] n denotes a normal vector that is perpendicular to the hyperplane L. If n = 1 then n T x d gives the distance of any point x to L. If sgn(n T x 1 d) = sgn(n T x 2 d), then x 1 and x 2 lie on the same side of the hyperplane. ML:VI-16 Neural Networks STEIN 2005-2018

Weight daptation: Illustration in Input Space (continued) x 2 x 2 (w 1,...,w p ) T θ x 1 x 1 Definition of an (affine) hyperplane: w T x = 0 p w j x j = θ = w 0 j=1 (hyperplane definition as before, with notation taken from the classification problem setting) ML:VI-17 Neural Networks STEIN 2005-2018

Remarks: perceptron defines a hyperplane that is perpendicular (= normal) to (w 1,..., w p ) T. θ or w 0 specify the offset of the hyperplane from the origin, along (w 1,..., w p ) T and as multiple of 1/ (w 1,..., w p ) T. The set of possible weight vectors w = (w 0, w 1,..., w p ) T form the hypothesis space H. Weight adaptation means learning, and the shown learning paradigm is supervised. For the weight adaptation in Line 6 7 of the PT lgorithm, note that if some x j is zero, w j will be zero as well. Keyword: Hebbian learning [Hebb 1949] Note that here (and in the following illustrations) the hyperplane movement is not the result of solving a regression problem in the (p + 1)-dimensional input-output-space, where the residuals are to be minimized. Instead, the PT lgorithm takes each missclassified example as a trigger to correct the hyperplane s normal vector without taking the effect on the other residuals into account. ML:VI-18 Neural Networks STEIN 2005-2018

Example The examples are presented to the perceptron. The perceptron computes a value that is interpreted as class label. ML:VI-19 Neural Networks STEIN 2005-2018

Example (continued) Encoding: The encoding of the examples is based on expressive features such as the number of line crossings, most acute angle, longest line, etc. The class label, c(x), is encoded as a number. Examples from are labeled with 1, examples from are labeled with 0. x 11 x 12. x 1p... x k1 x k2. x kp }{{} Class c(x) = 1 x l1 x l2. x lp... x m1 x m2. x mp }{{} Class c(x) = 0 ML:VI-20 Neural Networks STEIN 2005-2018

Example: Illustration in Input Space [PT lgorithm] x 2 x 1 possible configuration of encoded objects in the feature space X. ML:VI-21 Neural Networks STEIN 2005-2018

Example: Illustration in Input Space [PT lgorithm] x 2 x 1 ML:VI-22 Neural Networks STEIN 2005-2018

Example: Illustration in Input Space [PT lgorithm] x 2 x 1 ML:VI-23 Neural Networks STEIN 2005-2018

Example: Illustration in Input Space [PT lgorithm] x 2 x 1 (w 1,...,w p ) T x ML:VI-24 Neural Networks STEIN 2005-2018

Example: Illustration in Input Space [PT lgorithm] x 2 x 1 ML:VI-25 Neural Networks STEIN 2005-2018

Example: Illustration in Input Space [PT lgorithm] x 2 x 1 ML:VI-26 Neural Networks STEIN 2005-2018

Example: Illustration in Input Space [PT lgorithm] x 2 x 1 ML:VI-27 Neural Networks STEIN 2005-2018

Example: Illustration in Input Space [PT lgorithm] x 2 x 1 ML:VI-28 Neural Networks STEIN 2005-2018

Example: Illustration in Input Space [PT lgorithm] x 2 x 1 ML:VI-29 Neural Networks STEIN 2005-2018

Example: Illustration in Input Space [PT lgorithm] x 2 x 1 ML:VI-30 Neural Networks STEIN 2005-2018

Perceptron Convergence Theorem Questions: 1. Which kind of learning tasks can be addressed with the functions in the hypothesis space H? 2. Can the PT lgorithm construct such a function for a given task? ML:VI-31 Neural Networks STEIN 2005-2018

Perceptron Convergence Theorem Questions: 1. Which kind of learning tasks can be addressed with the functions in the hypothesis space H? 2. Can the PT lgorithm construct such a function for a given task? Theorem 1 (Perceptron Convergence [Rosenblatt 1962]) Let X 0 and X 1 be two finite sets with vectors of the form x = (1, x 1,..., x p ) T, let X 1 X 0 =, and let ŵ define a separating hyperplane with respect to X 0 and X 1. Moreover, let D be a set of examples of the form (x, 0), x X 0 and (x, 1), x X 1. Then holds: If the examples in D are processed with the PT lgorithm, the constructed weight vector w will converge within a finite number of iterations. ML:VI-32 Neural Networks STEIN 2005-2018

Perceptron Convergence Theorem: Proof Preliminaries: The sets X 1 and X 0 are separated by a hyperplane ŵ. The proof requires that for all x X 1 the inequality ŵ T x > 0 holds. This condition is always fulfilled, as the following consideration shows. Let x X 1 with ŵ T x = 0. Since X 0 is finite, the members x X 0 have a minimum positive distance δ with regard to the hyperplane ŵ. Hence, ŵ can be moved by δ 2 towards X 0, resulting in a new hyperplane ŵ that still fulfills (ŵ ) T x < 0 for all x X 0, but that now also fulfills (ŵ ) T x > 0 for all x X 1. ML:VI-33 Neural Networks STEIN 2005-2018

Perceptron Convergence Theorem: Proof Preliminaries: The sets X 1 and X 0 are separated by a hyperplane ŵ. The proof requires that for all x X 1 the inequality ŵ T x > 0 holds. This condition is always fulfilled, as the following consideration shows. Let x X 1 with ŵ T x = 0. Since X 0 is finite, the members x X 0 have a minimum positive distance δ with regard to the hyperplane ŵ. Hence, ŵ can be moved by δ 2 towards X 0, resulting in a new hyperplane ŵ that still fulfills (ŵ ) T x < 0 for all x X 0, but that now also fulfills (ŵ ) T x > 0 for all x X 1. y defining X = X 1 { x x X 0 }, the searched w fulfills w T x > 0 for all x X. Then, with c(x) = 1 for all x X, error {0, 1} (instead of {0, 1, 1}). [PT lgorithm, Line 5] ML:VI-34 Neural Networks STEIN 2005-2018

Perceptron Convergence Theorem: Proof Preliminaries: The sets X 1 and X 0 are separated by a hyperplane ŵ. The proof requires that for all x X 1 the inequality ŵ T x > 0 holds. This condition is always fulfilled, as the following consideration shows. Let x X 1 with ŵ T x = 0. Since X 0 is finite, the members x X 0 have a minimum positive distance δ with regard to the hyperplane ŵ. Hence, ŵ can be moved by δ 2 towards X 0, resulting in a new hyperplane ŵ that still fulfills (ŵ ) T x < 0 for all x X 0, but that now also fulfills (ŵ ) T x > 0 for all x X 1. y defining X = X 1 { x x X 0 }, the searched w fulfills w T x > 0 for all x X. Then, with c(x) = 1 for all x X, error {0, 1} (instead of {0, 1, 1}). [PT lgorithm, Line 5] The PT lgorithm performs a number of iterations, where w(t) denotes the weight vector for iteration t, which form the basis for the weight vector w(t + 1). x(t) X denotes the feature vector chosen in round t. The first (and randomly chosen) weight vector is denoted as w(0). ML:VI-35 Neural Networks STEIN 2005-2018

Perceptron Convergence Theorem: Proof Preliminaries: The sets X 1 and X 0 are separated by a hyperplane ŵ. The proof requires that for all x X 1 the inequality ŵ T x > 0 holds. This condition is always fulfilled, as the following consideration shows. Let x X 1 with ŵ T x = 0. Since X 0 is finite, the members x X 0 have a minimum positive distance δ with regard to the hyperplane ŵ. Hence, ŵ can be moved by δ 2 towards X 0, resulting in a new hyperplane ŵ that still fulfills (ŵ ) T x < 0 for all x X 0, but that now also fulfills (ŵ ) T x > 0 for all x X 1. y defining X = X 1 { x x X 0 }, the searched w fulfills w T x > 0 for all x X. Then, with c(x) = 1 for all x X, error {0, 1} (instead of {0, 1, 1}). [PT lgorithm, Line 5] The PT lgorithm performs a number of iterations, where w(t) denotes the weight vector for iteration t, which form the basis for the weight vector w(t + 1). x(t) X denotes the feature vector chosen in round t. The first (and randomly chosen) weight vector is denoted as w(0). Recall the Cauchy-Schwarz inequality: a 2 b 2 (a T b) 2, where x := x T x denotes the Euclidean norm. ML:VI-36 Neural Networks STEIN 2005-2018

Perceptron Convergence Theorem: Proof (continued) Line of argument: (a) We state a lower bound for how much w must change from its initial value after n iterations (to become a separating hyperplane). The derivation of this lower bound exploits the presupposed linear separability of X 0 and X 1. (b) (c) We state an upper bound for how much w can change from its initial value after n iterations. The derivation of this upper bound exploits the finiteness of X 0 and X 1, which in turn guarantees the existence of an upper bound for the norm of the maximum feature vector. We observe that the lower bound grows quadratically in n, whereas the upper bound grows linearly. From the relation lower bound < upper bound we derive a finite upper bound for n. ML:VI-37 Neural Networks STEIN 2005-2018

Perceptron Convergence Theorem: Proof (continued) 1. The PT lgorithm computes in iteration t the scalar product w(t) T x(t). If classified correctly, w(t) T x(t) > 0 and w is unchanged. Otherwise, w(t + 1) = w(t) + η x(t) [Line 5-7]. ML:VI-38 Neural Networks STEIN 2005-2018

Perceptron Convergence Theorem: Proof (continued) 1. The PT lgorithm computes in iteration t the scalar product w(t) T x(t). If classified correctly, w(t) T x(t) > 0 and w is unchanged. Otherwise, w(t + 1) = w(t) + η x(t) [Line 5-7]. 2. sequence of n incorrectly classified feature vectors, (x(t)), along with the weight adaptation, w(t + 1) = w(t) + η x(t), results in the series w(n) : w(1) = w(0) + η x(0) w(2) = w(1) + η x(1) = w(0) + η x(0) + η x(1). w(n) = w(0) + η x(0) +... + η x(n 1) ML:VI-39 Neural Networks STEIN 2005-2018

Perceptron Convergence Theorem: Proof (continued) 1. The PT lgorithm computes in iteration t the scalar product w(t) T x(t). If classified correctly, w(t) T x(t) > 0 and w is unchanged. Otherwise, w(t + 1) = w(t) + η x(t) [Line 5-7]. 2. sequence of n incorrectly classified feature vectors, (x(t)), along with the weight adaptation, w(t + 1) = w(t) + η x(t), results in the series w(n) : w(1) = w(0) + η x(0) w(2) = w(1) + η x(1) = w(0) + η x(0) + η x(1). w(n) = w(0) + η x(0) +... + η x(n 1) 3. The hyperplane defined by ŵ separates X 1 and X 0 : x X : ŵ T x > 0 Let δ := min x X ŵt x. Observe that δ > 0 holds. ML:VI-40 Neural Networks STEIN 2005-2018

Perceptron Convergence Theorem: Proof (continued) 1. The PT lgorithm computes in iteration t the scalar product w(t) T x(t). If classified correctly, w(t) T x(t) > 0 and w is unchanged. Otherwise, w(t + 1) = w(t) + η x(t) [Line 5-7]. 2. sequence of n incorrectly classified feature vectors, (x(t)), along with the weight adaptation, w(t + 1) = w(t) + η x(t), results in the series w(n) : w(1) = w(0) + η x(0) w(2) = w(1) + η x(1) = w(0) + η x(0) + η x(1). w(n) = w(0) + η x(0) +... + η x(n 1) 3. The hyperplane defined by ŵ separates X 1 and X 0 : x X : ŵ T x > 0 Let δ := min x X ŵt x. Observe that δ > 0 holds. 4. nalyze the scalar product of w(n) and ŵ : ŵ T w(n) = ŵ T w(0) + η ŵ T x(0) +... + η ŵ T x(n 1) ŵ T w(n) ŵ T w(0) + nηδ 0 (for n n 0 with sufficiently large n 0 N) (ŵ T w(n)) 2 (ŵ T w(0) + nηδ) 2 ML:VI-41 Neural Networks STEIN 2005-2018

Perceptron Convergence Theorem: Proof (continued) 1. The PT lgorithm computes in iteration t the scalar product w(t) T x(t). If classified correctly, w(t) T x(t) > 0 and w is unchanged. Otherwise, w(t + 1) = w(t) + η x(t) [Line 5-7]. 2. sequence of n incorrectly classified feature vectors, (x(t)), along with the weight adaptation, w(t + 1) = w(t) + η x(t), results in the series w(n) : w(1) = w(0) + η x(0) w(2) = w(1) + η x(1) = w(0) + η x(0) + η x(1). w(n) = w(0) + η x(0) +... + η x(n 1) 3. The hyperplane defined by ŵ separates X 1 and X 0 : x X : ŵ T x > 0 Let δ := min x X ŵt x. Observe that δ > 0 holds. 4. nalyze the scalar product of w(n) and ŵ : ŵ T w(n) = ŵ T w(0) + η ŵ T x(0) +... + η ŵ T x(n 1) ŵ T w(n) ŵ T w(0) + nηδ 0 (for n n 0 with sufficiently large n 0 N) (ŵ T w(n)) 2 (ŵ T w(0) + nηδ) 2 5. pply the Cauchy-Schwarz inequality: ŵ 2 w(n) 2 (ŵ T w(0) + nηδ) 2 w(n) 2 (ŵt w(0) + nηδ) 2 ŵ 2 ML:VI-42 Neural Networks STEIN 2005-2018

Perceptron Convergence Theorem: Proof (continued) 6. Consider again the weight adaptation w(t + 1) = w(t) + η x(t) : w(t + 1) 2 = w(t) + η x(t) 2 = (w(t) + η x(t)) T (w(t) + η x(t)) = w(t) T w(t) + η 2 x(t) T x(t) + 2η w(t) T x(t) w(t) 2 + η x(t) 2 (since w(t) T x(t) < 0) ML:VI-43 Neural Networks STEIN 2005-2018

Perceptron Convergence Theorem: Proof (continued) 6. Consider again the weight adaptation w(t + 1) = w(t) + η x(t) : w(t + 1) 2 = w(t) + η x(t) 2 = (w(t) + η x(t)) T (w(t) + η x(t)) = w(t) T w(t) + η 2 x(t) T x(t) + 2η w(t) T x(t) w(t) 2 + η x(t) 2 (since w(t) T x(t) < 0) 7. Consider the series w(n) from Step 2 : w(n) 2 w(n 1) 2 + η x(n 1) 2 w(n 2) 2 + η x(n 2) 2 + η x(n 1) 2 w(0) 2 + η x(0) 2 +... + η x(n 1) 2 n 1 = w(0) 2 + η x(i) 2 j=0 ML:VI-44 Neural Networks STEIN 2005-2018

Perceptron Convergence Theorem: Proof (continued) 6. Consider again the weight adaptation w(t + 1) = w(t) + η x(t) : w(t + 1) 2 = w(t) + η x(t) 2 = (w(t) + η x(t)) T (w(t) + η x(t)) = w(t) T w(t) + η 2 x(t) T x(t) + 2η w(t) T x(t) w(t) 2 + η x(t) 2 (since w(t) T x(t) < 0) 7. Consider the series w(n) from Step 2 : w(n) 2 w(n 1) 2 + η x(n 1) 2 w(n 2) 2 + η x(n 2) 2 + η x(n 1) 2 w(0) 2 + η x(0) 2 +... + η x(n 1) 2 n 1 = w(0) 2 + η x(i) 2 j=0 8. With ε := max x X x 2 follows w(n) 2 w(0) 2 + nη 2 ε ML:VI-45 Neural Networks STEIN 2005-2018

Perceptron Convergence Theorem: Proof (continued) 9. oth inequalities (see Step 5 and Step 8) must be fulfilled: w(n) 2 (ŵt w(0) + nηδ) 2 ŵ 2 and w(n) 2 w(0) 2 + nη 2 ε (ŵ T w(0) + nηδ) 2 w(n) 2 w(0) 2 + nη 2 ε ŵ 2 (ŵ T w(0) + nηδ) 2 w(0) 2 + nη 2 ε ŵ 2 Set w(0) = 0 : n 2 η 2 δ 2 ŵ 2 nη 2 ε n ε δ 2 ŵ 2 ML:VI-46 Neural Networks STEIN 2005-2018

Perceptron Convergence Theorem: Proof (continued) 9. oth inequalities (see Step 5 and Step 8) must be fulfilled: w(n) 2 (ŵt w(0) + nηδ) 2 ŵ 2 and w(n) 2 w(0) 2 + nη 2 ε (ŵ T w(0) + nηδ) 2 w(n) 2 w(0) 2 + nη 2 ε ŵ 2 (ŵ T w(0) + nηδ) 2 w(0) 2 + nη 2 ε ŵ 2 Set w(0) = 0 : n 2 η 2 δ 2 ŵ 2 nη 2 ε n ε δ 2 ŵ 2 The PT lgorithm terminates within a finite number of iterations. Observe: (ŵ T w(0) + nηδ) 2 ŵ 2 Θ(n 2 ) and w(0) 2 + nη 2 ε Θ(n) ML:VI-47 Neural Networks STEIN 2005-2018

Perceptron Convergence Theorem: Discussion If a separating hyperplane between X 0 and X 1 exists, the PT lgorithm will converge. If no such hyperplane exists, convergence cannot be guaranteed. separating hyperplane can be found in polynomial time with linear programming. The PT lgorithm, however, may require an exponential number of iterations. ML:VI-48 Neural Networks STEIN 2005-2018

Perceptron Convergence Theorem: Discussion If a separating hyperplane between X 0 and X 1 exists, the PT lgorithm will converge. If no such hyperplane exists, convergence cannot be guaranteed. separating hyperplane can be found in polynomial time with linear programming. The PT lgorithm, however, may require an exponential number of iterations. Classification problems with noise (right-hand side) are problematic: x 2 x 1 x 2 x 1 ML:VI-49 Neural Networks STEIN 2005-2018

Gradient Descent Classification Error Gradient descent considers the true error (better: the hyperplane distance) and will converge even if X 1 and X 0 cannot be separated by a hyperplane. However, this convergence process is of an asymptotic nature and no finite iteration bound can be stated. Gradient descent applies the so-called delta rule, which will be derived in the following. The delta rule forms the basis of the backpropagation algorithm. ML:VI-50 Neural Networks STEIN 2005-2018

Gradient Descent Classification Error Gradient descent considers the true error (better: the hyperplane distance) and will converge even if X 1 and X 0 cannot be separated by a hyperplane. However, this convergence process is of an asymptotic nature and no finite iteration bound can be stated. Gradient descent applies the so-called delta rule, which will be derived in the following. The delta rule forms the basis of the backpropagation algorithm. Consider the linear perceptron without a threshold function: p y(x) = w T x = w j x j j=0 [Heaviside] The classification error Err(w) of a weight vector (= hypothesis) w with regard to D can be defined as follows: Err(w) = 1 (c(x) y(x)) 2 [Singleton error] 2 (x,c(x)) D ML:VI-51 Neural Networks STEIN 2005-2018

Gradient Descent Classification Error Err(w) 16 12 8 4 0-2 -1 0 1 w 2 0 3 3 2 1 0-1 w 1-2 The gradient of Err(w), Err(w), defines the steepest ascent or descent: ( Err(w) Err(w) =, Err(w),, Err(w) ) w 0 w 1 w p ML:VI-52 Neural Networks STEIN 2005-2018

Gradient Descent Weight daptation w w + w where w = η Err(w) Componentwise (j = 0,..., p) weight adaptation [PT lgorithm] : w j w j + w j where w j = η w j Err(w) = η (x, c(x)) D (c(x) w T x) x j ML:VI-53 Neural Networks STEIN 2005-2018

Gradient Descent Weight daptation w w + w where w = η Err(w) Componentwise (j = 0,..., p) weight adaptation [PT lgorithm] : w j w j + w j where w j = η w j Err(w) = η (x, c(x)) D (c(x) w T x) x j w j Err(w) = 1 w j 2 (c(x) y(x)) 2 = 1 2 (x,c(x)) D (x,c(x)) D w j (c(x) y(x)) 2 = 1 2 = = (x,c(x)) D (x,c(x)) D (x,c(x)) D 2(c(x) y(x)) (c(x) w T x) (c(x) w T x)( x j ) w j (c(x) y(x)) w j (c(x) w T x) ML:VI-54 Neural Networks STEIN 2005-2018

Gradient Descent Weight daptation w w + w where w = η Err(w) Componentwise (j = 0,..., p) weight adaptation [PT lgorithm] : w j w j + w j where w j = η w j Err(w) = η (x, c(x)) D (c(x) w T x) x j w j Err(w) = 1 w j 2 (c(x) y(x)) 2 = 1 2 (x,c(x)) D (x,c(x)) D w j (c(x) y(x)) 2 = 1 2 = = (x,c(x)) D (x,c(x)) D (x,c(x)) D 2(c(x) y(x)) (c(x) w T x) (c(x) w T x)( x j ) w j (c(x) y(x)) w j (c(x) w T x) ML:VI-55 Neural Networks STEIN 2005-2018

Gradient Descent Weight daptation w w + w where w = η Err(w) Componentwise (j = 0,..., p) weight adaptation [PT lgorithm] : w j w j + w j where w j = η w j Err(w) = η (x, c(x)) D (c(x) w T x) x j w j Err(w) = 1 w j 2 (c(x) y(x)) 2 = 1 2 (x,c(x)) D (x,c(x)) D w j (c(x) y(x)) 2 = 1 2 = = (x,c(x)) D (x,c(x)) D (x,c(x)) D 2(c(x) y(x)) (c(x) w T x) (c(x) w T x)( x j ) w j (c(x) y(x)) w j (c(x) w T x) ML:VI-56 Neural Networks STEIN 2005-2018

Gradient Descent Weight daptation w w + w where w = η Err(w) Componentwise (j = 0,..., p) weight adaptation [PT lgorithm] : w j w j + w j where w j = η w j Err(w) = η (x, c(x)) D (c(x) w T x) x j w j Err(w) = 1 w j 2 (c(x) y(x)) 2 = 1 2 (x,c(x)) D (x,c(x)) D w j (c(x) y(x)) 2 = 1 2 = = (x,c(x)) D (x,c(x)) D (x,c(x)) D 2(c(x) y(x)) (c(x) w T x) (c(x) w T x)( x j ) w j (c(x) y(x)) w j (c(x) w T x) ML:VI-57 Neural Networks STEIN 2005-2018

Gradient Descent Weight daptation: atch Gradient Descent [IGD lgorithm] lgorithm: GD atch Gradient Descent Input: D Training examples (x, c(x)) with x = p + 1, c(x) {0, 1}. (c(x) { 1, 1}) η Learning rate, a small positive constant. Internal: y(d) Set of y(x)-values computed from the elements x in D given some w. Output: w Weight vector. GD(D, η) 1. initialize_random_weights(w), t = 0 2. REPET 3. t = t + 1 4. w = 0 5. FORECH (x, c(x)) D DO 6. error = c(x) w T x 7. w = w + η error x 8. ENDDO 9. w = w + w 10. UNTIL(convergence(D, y(d)) OR t > t max ) 11. return(w) ML:VI-58 Neural Networks STEIN 2005-2018

Remarks: w Err(w) (and not to + ) to descend to the minimum. Each GD iteration REPET... UNTIL corresponds to finding the direction of steepest error descent as Err(w t ) = ( (x,c(x)) D c(x) w T t x ) ( x w x,c(x) D c(x) T wt x ) 2 and updating w t by taking a step of length η in this direction. Using a constant step size η is can severely impair the speed of convergence. When taking the optimal step size η t := argmin η Err(w t η Err(w t )) at each iteration t, it can be shown that gradient descent (merely) has a linear rate of convergence. [Meza 2010] s criterion for the convergence function may serve the global error, either quantified as the sum of the squared residuals, Err(w t ), or as the norm of the error gradient, Err(w t ), which are compared to some small positive bound ε. ML:VI-59 Neural Networks STEIN 2005-2018

Gradient Descent Weight daptation: Delta Rule The weight adaptation in the GD lgorithm is set-based: before modifying a weight component in w, the total error of all examples (the batch ) is computed. Weight adaptation with regard to a single example (x, c(x)) D : w = η (c(x) w T x) x This adaptation rule is known under different names: delta rule Widrow-Hoff rule adaline rule least mean squares (LMS) rule The classification error Err d (w) of a weight vector (= hypothesis) w with regard to a single example d D, d = (x, c(x)), is given as: Err d (w) = 1 2 (c(x) wt x) 2 [atch error] ML:VI-60 Neural Networks STEIN 2005-2018

Gradient Descent Weight daptation: Incremental Gradient Descent [lgorithms: LMS GD PT ] lgorithm: IGD Incremental Gradient Descent Input: D Training examples (x, c(x)) with x = p + 1, c(x) {0, 1}. (c(x) { 1, 1}) η Learning rate, a small positive constant. Internal: y(d) Set of y(x)-values computed from the elements x in D given some w. Output: w Weight vector. IGD(D, η) 1. initialize_random_weights(w), t = 0 2. REPET 3. t = t + 1 4. FORECH (x, c(x)) D DO 5. error = c(x) w T x 6. w = η error x 7. w = w + w 8. ENDDO 9. UNTIL(convergence(D, y(d)) OR t > t max ) 10. return(w) ML:VI-61 Neural Networks STEIN 2005-2018

Remarks: The classification error Err of incremental gradient descent is specific for each training example d D, d = (x, c(x)): Err d (w) = 1 2 (c(x) wt x) 2 The sequence of incremental weight adaptations approximates the gradient descent of the batch approach. If η is chosen sufficiently small, this approximation can happen at arbitrary accuracy. The computation of the total error of batch gradient descent enables larger weight adaptation increments. Compared to batch gradient descent, the example-based weight adaptation of incremental gradient descent can better avoid getting stuck in a local minimum of the error function. Incremental gradient descent is also called stochastic gradient descent. ML:VI-62 Neural Networks STEIN 2005-2018

Remarks (continued): When, as is done here, the residual sum squares, RSS, is chosen as error (loss) function, the incremental gradient descent algorithm [IGD] corresponds to the least mean squares algorithm [ LMS]. The incremental gradient descend algorithm [IGD] looks similar to the perceptron training algorithm [PT ], since these algorithms differ only in the error computation (Line 5) where the latter applies the Heaviside function. However, this subtle syntactic difference is a significant conceptual difference, entailing a number of consequences: Gradient descent is a regression approach and exploits the residua, which are provided by an error function of choice, and whose differential is evaluated to control the hyperplane movement. The PT algorithm is not based on residuals (in the (p + 1)-dimensional input-outputspace) but refers to the input space only, where it simply evaluates the side of the hyperplane as a binary feature (correct side or not). Provided linear separability, the PT algorithm will converge within a finite number of iterations, which, however, cannot be guaranteed for gradient descent. Gradient descent will converge even if the data is not linearly separable. Data sets can be constructed whose classes are linearly separable, but where gradient descent will not determine a hyperplane that classifies all examples correctly (whereas the PT lgorithm of course does). ML:VI-63 Neural Networks STEIN 2005-2018