Chapter ML:VI (continued)

Similar documents
Chapter ML:VI (continued)

Chapter ML:VI. VI. Neural Networks. Perceptron Learning Gradient Descent Multilayer Perceptron Radial Basis Functions

Neural Networks (Part 1) Goals for the lecture

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Artifical Neural Networks

Simple Neural Nets For Pattern Classification

1 What a Neural Network Computes

Course 395: Machine Learning - Lectures

Introduction to Neural Networks

COMP-4360 Machine Learning Neural Networks

Artificial Neural Networks. MGS Lecture 2

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

April 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D.

y(x n, w) t n 2. (1)

LIMITATIONS OF RECEPTRON. XOR Problem The failure of the perceptron to successfully simple problem such as XOR (Minsky and Papert).

Artificial Neural Networks

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Computational Intelligence Winter Term 2017/18

Artificial Neural Network

Neural Networks biological neuron artificial neuron 1

Neural Networks Task Sheet 2. Due date: May

Input layer. Weight matrix [ ] Output layer

Neural Networks and the Back-propagation Algorithm

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Plan. Perceptron Linear discriminant. Associative memories Hopfield networks Chaotic networks. Multilayer perceptron Backpropagation

Artificial Neural Networks The Introduction

Unit 8: Introduction to neural networks. Perceptrons

Lecture 4: Perceptrons and Multilayer Perceptrons

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Part 8: Neural Networks

COMP9444 Neural Networks and Deep Learning 2. Perceptrons. COMP9444 c Alan Blair, 2017

Introduction To Artificial Neural Networks

CMSC 421: Neural Computation. Applications of Neural Networks

Introduction to Artificial Neural Networks

Computational Intelligence

Multi-layer Perceptron Networks

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Lab 5: 16 th April Exercises on Neural Networks

Machine Learning. Neural Networks

Artificial Neural Networks

Chapter 2 Single Layer Feedforward Networks

2018 EE448, Big Data Mining, Lecture 5. (Part II) Weinan Zhang Shanghai Jiao Tong University

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Multilayer Perceptrons and Backpropagation

Machine Learning

Multilayer Perceptron = FeedForward Neural Network

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Learning Neural Networks

Single Layer Perceptron Networks

2015 Todd Neller. A.I.M.A. text figures 1995 Prentice Hall. Used by permission. Neural Networks. Todd W. Neller

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Neural Networks and Deep Learning

Multilayer Neural Networks

Neural Networks. Fundamentals of Neural Networks : Architectures, Algorithms and Applications. L, Fausett, 1994

AI Programming CS F-20 Neural Networks

Revision: Neural Network

A Logarithmic Neural Network Architecture for Unbounded Non-Linear Function Approximation

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning

Master Recherche IAC TC2: Apprentissage Statistique & Optimisation

Supervised (BPL) verses Hybrid (RBF) Learning. By: Shahed Shahir

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Mul7layer Perceptrons

Neural Networks Lecture 4: Radial Bases Function Networks

Introduction Biologically Motivated Crude Model Backpropagation

22c145-Fall 01: Neural Networks. Neural Networks. Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1

Logistic Regression & Neural Networks

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Learning Deep Architectures for AI. Part I - Vijay Chakilam

Data Mining Part 5. Prediction

Deep Feedforward Networks

Artificial Neural Networks. Historical description

Artificial Neural Networks (ANN)

100 inference steps doesn't seem like enough. Many neuron-like threshold switching units. Many weighted interconnections among units

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

CHALMERS, GÖTEBORGS UNIVERSITET. EXAM for ARTIFICIAL NEURAL NETWORKS. COURSE CODES: FFR 135, FIM 720 GU, PhD

Pattern Classification

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Feedforward Neural Nets and Backpropagation

Machine Learning Lecture 10

CS489/698: Intro to ML

COGS Q250 Fall Homework 7: Learning in Neural Networks Due: 9:00am, Friday 2nd November.

Unit III. A Survey of Neural Network Model

The Perceptron Algorithm

Multilayer Feedforward Networks. Berlin Chen, 2002

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Single layer NN. Neuron Model

4. Multilayer Perceptrons

Introduction to Machine Learning

Radial-Basis Function Networks

Artificial Neural Networks. Q550: Models in Cognitive Science Lecture 5

A summary of Deep Learning without Poor Local Minima

Artificial Neural Networks

) (d o f. For the previous layer in a neural network (just the rightmost layer if a single neuron), the required update equation is: 2.

Transcription:

Chapter ML:VI (continued) VI Neural Networks Perceptron Learning Gradient Descent Multilayer Perceptron Radial asis Functions ML:VI-64 Neural Networks STEIN 2005-2018

Definition 1 (Linear Separability) Two sets of feature vectors, X 0, X 1, of a p-dimensional feature space are called linearly separable, if p + 1 real numbers, w 0, w 1,, w p, exist such that holds: 1 x X 0 : p j0 w jx j < 0 2 x X 1 : p j0 w jx j 0 ML:VI-65 Neural Networks STEIN 2005-2018

Definition 1 (Linear Separability) Two sets of feature vectors, X 0, X 1, of a p-dimensional feature space are called linearly separable, if p + 1 real numbers, w 0, w 1,, w p, exist such that holds: 1 x X 0 : p j0 w jx j < 0 2 x X 1 : p j0 w jx j 0 x 2 x 1 x 2 x 1 linearly separable not linearly separable ML:VI-66 Neural Networks STEIN 2005-2018

Separability The XOR function defines the smallest example for two not linearly separable sets: x 2 1 x 1 x 2 XOR Class 0 0 0 1 0 1 0 1 1 1 1 0 x 2 0 x 1 0 x 1 1 ML:VI-67 Neural Networks STEIN 2005-2018

Separability (continued) The XOR function defines the smallest example for two not linearly separable sets: x 2 1 x 1 x 2 XOR Class 0 0 0 1 0 1 0 1 1 1 1 0 x 2 0 x 1 0 x 1 1 specification of several hyperplanes combination of several perceptrons ML:VI-68 Neural Networks STEIN 2005-2018

Separability (continued) Layered combination of several perceptrons: the multilayer perceptron Minimum multilayer perceptron that is able to handle the XOR problem: x 0 1 x 1 0 0 {, } x 2 0 ML:VI-69 Neural Networks STEIN 2005-2018

Remarks: The multilayer perceptron was presented by Rumelhart and McClelland in 1986 Earlier, but unnoticed, was a similar research work of Werbos and Parker [1974, 1982] Compared to a single perceptron the multilayer perceptron poses a significantly more challenging training ( learning) problem, which requires continuous (and non-linear) threshold functions along wtih sophisticated learning strategies Marvin Minsky and Seymour Papert showed 1969 with the XOR problem the limitations of single perceptrons Moreover, they assumed that extensions of the perceptron architecture (such as the multilayer perceptron) would be similarly limited as a single perceptron fatal mistake In fact, they brought the research in this field to a halt that lasted 17 years [erkeley] [Marvin Minsky: MIT Media Lab, Wikipedia] ML:VI-70 Neural Networks STEIN 2005-2018

Computation in the Network [ Heaviside] perceptron with a continuous and non-linear threshold function: x 0 1 w 0 θ w 0 x 1 w 1 x 2 w 2 w p 0θ y x p Inputs Output The sigmoid function σ(z) as threshold function: σ(z) 1 where dσ(z) 1 + e z dz σ(z) (1 σ(z)) ML:VI-71 Neural Networks STEIN 2005-2018

Computation in the Network (continued) Computation of the perceptron output y(x) via the sigmoid function σ: y(x) σ(w T x) 1 1 + e wt x 1 0 w j x j j0 p n alternative to the sigmoid function is the tanh function: 1 tanh(x) ex e x e x + e e2x 1 x e 2x + 1 0 w j x j j0 p ML:VI-72 Neural Networks STEIN 2005-2018

Computation in the Network (continued) Distinguish units (nodes, perceptrons) of type input, hidden, and output: U I x 1 y x p U I, U H, U O w jk, w jk x j k y k, δ k w k x y H, y O Sets with units of type input, hidden, and output Weight and weight adaptation for the edge connecting the units j and k Input value (single incoming edge) for unit k, provided at the output of unit j Output value and classification error of unit k Weight vector (all incoming edges) of unit k Input vector for a unit of the hidden layer Output vector of the hidden layer and the output layer respectively ML:VI-73 Neural Networks STEIN 2005-2018

Computation in the Network (continued) Distinguish units (nodes, perceptrons) of type input, hidden, and output: U I U H x 1 x p y U I, U H, U O w jk, w jk x j k y k, δ k w k x y H, y O Sets with units of type input, hidden, and output Weight and weight adaptation for the edge connecting the units j and k Input value (single incoming edge) for unit k, provided at the output of unit j Output value and classification error of unit k Weight vector (all incoming edges) of unit k Input vector for a unit of the hidden layer Output vector of the hidden layer and the output layer respectively ML:VI-74 Neural Networks STEIN 2005-2018

Computation in the Network (continued) Distinguish units (nodes, perceptrons) of type input, hidden, and output: U I U H U O y 1 x 1 x p y 2 y k U I, U H, U O w jk, w jk x j k y k, δ k w k x y H, y O Sets with units of type input, hidden, and output Weight and weight adaptation for the edge connecting the units j and k Input value (single incoming edge) for unit k, provided at the output of unit j Output value and classification error of unit k Weight vector (all incoming edges) of unit k Input vector for a unit of the hidden layer Output vector of the hidden layer and the output layer respectively ML:VI-75 Neural Networks STEIN 2005-2018

Computation in the Network (continued) Distinguish units (nodes, perceptrons) of type input, hidden, and output: U I U H U O x 0 1 y H0 1 y 1 x 1 x p y 2 y k U I, U H, U O w jk, w jk x j k y k, δ k w k x y H, y O Sets with units of type input, hidden, and output Weight and weight adaptation for the edge connecting the units j and k Input value (single incoming edge) for unit k, provided at the output of unit j Output value and classification error of unit k Weight vector (all incoming edges) of unit k Input vector for a unit of the hidden layer Output vector of the hidden layer and the output layer respectively ML:VI-76 Neural Networks STEIN 2005-2018

Computation in the Network (continued) Distinguish units (nodes, perceptrons) of type input, hidden, and output: U I U H U O x 0 1 y H0 1 y 1 x 1 x p y 2 y k U I, U H, U O w jk, w jk x j k y k, δ k w k x y H, y O Sets with units of type input, hidden, and output Weight and weight adaptation for the edge connecting the units j and k Input value (single incoming edge) for unit k, provided at the output of unit j Output value and classification error of unit k Weight vector (all incoming edges) of unit k Input vector for a unit of the hidden layer Output vector of the hidden layer and the output layer respectively ML:VI-77 Neural Networks STEIN 2005-2018

Remarks: The units of the input layer, U I, perform no computations at all They distribute the input values to the next layer The network topology corresponds to a complete, bipartite graph between the units in U I and U H as well as between the units in U H and U O The non-linear characteristic of the sigmoid function allows for networks that approximate every (computable) function For this capability only three active layers are required, ie, two layers with hidden units and one layer with output units Keyword: universal approximator [Kolmogorov Theorem, 1957] Multilayer perceptrons are also called multilayer networks or (artificial) neural networks, NN for short ML:VI-78 Neural Networks STEIN 2005-2018

Classification Error The classification error Err(w) is computed as sum over the U O k network outputs: Err(w) 1 (c v (x) y v (x)) 2 2 v U O (x,c(x)) D Due its complex form, Err(w) may contain various local minima: Err(w) 16 12 8 4 0 ML:VI-79 Neural Networks STEIN 2005-2018

Weight daptation: Incremental Gradient Descent [network] lgorithm: MPT Multilayer Perceptron Training Input: D Training examples (x, c(x)) with x p + 1, c(x) {0, 1} k (c(x) { 1, 1} k ) η Learning rate, a small positive constant Output: w Weights of the units in U I, U H, U O 1 initialize_random_weights(u I, U H, U O ), t 0 2 REPET 3 t t + 1 4 FORECH (x, c(x)) D DO 5 FORECH u U H DO y u σ(wu T x) // compute output of layer 1 6 FORECH v U O DO y v σ(wv T y H ) // compute output of layer 2 7 FORECH v U O DO δ v y v (1 y v ) (c v (x) y v ) // backpropagate layer 2 8 FORECH u U H DO δ u y u (1 y u ) w uv δ v // backpropagate layer 1 v U o 9 FORECH w jk, (j, k) (U I U H ) (U H U O ) DO 10 w jk η δ k x j k 11 w jk w jk + w jk 12 ENDDO 13 ENDDO 14 UNTIL(convergence(D, y O (D)) OR t > t max ) 15 return(w) ML:VI-80 Neural Networks STEIN 2005-2018

Weight daptation: Incremental Gradient Descent [network] lgorithm: MPT Multilayer Perceptron Training Input: D Training examples (x, c(x)) with x p + 1, c(x) {0, 1} k (c(x) { 1, 1} k ) η Learning rate, a small positive constant Output: w Weights of the units in U I, U H, U O 1 initialize_random_weights(u I, U H, U O ), t 0 2 REPET 3 t t + 1 4 FORECH (x, c(x)) D DO 5 FORECH u U H DO y u σ(wu T x) // compute output of layer 1 6 FORECH v U O DO y v σ(wv T y H ) // compute output of layer 2 7 FORECH v U O DO δ v y v (1 y v ) (c v (x) y v ) // backpropagate layer 2 8 FORECH u U H DO δ u y u (1 y u ) w uv δ v // backpropagate layer 1 v U o 9 FORECH w jk, (j, k) (U I U H ) (U H U O ) DO 10 w jk η δ k x j k 11 w jk w jk + w jk 12 ENDDO 13 ENDDO 14 UNTIL(convergence(D, y O (D)) OR t > t max ) 15 return(w) ML:VI-81 Neural Networks STEIN 2005-2018

Weight daptation: Incremental Gradient Descent [network] lgorithm: MPT Multilayer Perceptron Training Input: D Training examples (x, c(x)) with x p + 1, c(x) {0, 1} k (c(x) { 1, 1} k ) η Learning rate, a small positive constant Output: w Weights of the units in U I, U H, U O 1 initialize_random_weights(u I, U H, U O ), t 0 2 REPET 3 t t + 1 4 FORECH (x, c(x)) D DO 5 FORECH u U H DO y u σ(wu T x) // compute output of layer 1 6 FORECH v U O DO y v σ(wv T y H ) // compute output of layer 2 7 FORECH v U O DO δ v y v (1 y v ) (c v (x) y v ) // backpropagate layer 2 8 FORECH u U H DO δ u y u (1 y u ) w uv δ v // backpropagate layer 1 v U o 9 FORECH w jk, (j, k) (U I U H ) (U H U O ) DO 10 w jk η δ k x j k 11 w jk w jk + w jk 12 ENDDO 13 ENDDO 14 UNTIL(convergence(D, y O (D)) OR t > t max ) 15 return(w) ML:VI-82 Neural Networks STEIN 2005-2018

Remarks: The generic delta rule (Lines 7 and 8 of the MPT algorithm) allows for a backpropagation of the classification error and hence the training of multi-layered networks Gradient descent is based on the classification error of the entire network and hence considers the entire network weight vector ML:VI-83 Neural Networks STEIN 2005-2018

Weight daptation: Momentum Term Momentum idea: a weight adaptation in iteration t considers the adaptation in iteration t 1 : w uv (t) η δ v x u v + α w uv (t 1) The term α, 0 α < 1, is called momentum ML:VI-84 Neural Networks STEIN 2005-2018

Weight daptation: Momentum Term Momentum idea: a weight adaptation in iteration t considers the adaptation in iteration t 1 : w uv (t) η δ v x u v + α w uv (t 1) The term α, 0 α < 1, is called momentum Effects: due the adaptation inertia local minima can be overcome if the direction of the descent does not change, the adaptation increment and, as a consequence, the speed of convergence is increased ML:VI-85 Neural Networks STEIN 2005-2018