Chapter ML:VI (continued)

Similar documents
Chapter ML:VI (continued)

Chapter ML:VI. VI. Neural Networks. Perceptron Learning Gradient Descent Multilayer Perceptron Radial Basis Functions

Neural Networks (Part 1) Goals for the lecture

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Course 395: Machine Learning - Lectures

Artificial Neural Networks

y(x n, w) t n 2. (1)

Simple Neural Nets For Pattern Classification

1 What a Neural Network Computes

Artifical Neural Networks

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Neural Networks and the Back-propagation Algorithm

Neural Networks biological neuron artificial neuron 1

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Lecture 4: Perceptrons and Multilayer Perceptrons

Artificial Neural Networks. MGS Lecture 2

Introduction to Neural Networks

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Input layer. Weight matrix [ ] Output layer

COMP-4360 Machine Learning Neural Networks

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Computational Intelligence Winter Term 2017/18

Neural Networks. Fundamentals of Neural Networks : Architectures, Algorithms and Applications. L, Fausett, 1994

Machine Learning. Neural Networks

2015 Todd Neller. A.I.M.A. text figures 1995 Prentice Hall. Used by permission. Neural Networks. Todd W. Neller

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

2018 EE448, Big Data Mining, Lecture 5. (Part II) Weinan Zhang Shanghai Jiao Tong University

Multilayer Neural Networks

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

April 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D.

LIMITATIONS OF RECEPTRON. XOR Problem The failure of the perceptron to successfully simple problem such as XOR (Minsky and Papert).

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Artificial Neural Network

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Master Recherche IAC TC2: Apprentissage Statistique & Optimisation

Multilayer Perceptrons and Backpropagation

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

COMP9444 Neural Networks and Deep Learning 2. Perceptrons. COMP9444 c Alan Blair, 2017

Multilayer Perceptron = FeedForward Neural Network

Machine Learning

Introduction To Artificial Neural Networks

Neural Networks Task Sheet 2. Due date: May

Introduction to Artificial Neural Networks

Learning Neural Networks

Neural Networks Lecture 4: Radial Bases Function Networks

Computational Intelligence

AI Programming CS F-20 Neural Networks

Multi-layer Perceptron Networks

Artificial Neural Networks The Introduction

Unit 8: Introduction to neural networks. Perceptrons

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Plan. Perceptron Linear discriminant. Associative memories Hopfield networks Chaotic networks. Multilayer perceptron Backpropagation

Lab 5: 16 th April Exercises on Neural Networks

CS489/698: Intro to ML

Part 8: Neural Networks

Feedforward Neural Nets and Backpropagation

Unit III. A Survey of Neural Network Model

Chapter 2 Single Layer Feedforward Networks

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

CMSC 421: Neural Computation. Applications of Neural Networks

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning

Supervised (BPL) verses Hybrid (RBF) Learning. By: Shahed Shahir

Neural Networks and Deep Learning

Radial-Basis Function Networks

Introduction to Machine Learning

Deep Feedforward Networks

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Learning Deep Architectures for AI. Part I - Vijay Chakilam

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Artificial Neural Networks. Historical description

Pattern Classification

Neural networks. Chapter 20, Section 5 1

Artificial Neural Networks

Neural Nets Supervised learning

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Multilayer Feedforward Networks. Berlin Chen, 2002

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav

Revision: Neural Network

Introduction to feedforward neural networks

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Neural networks. Chapter 20. Chapter 20 1

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Radial-Basis Function Networks

Mul7layer Perceptrons

Artificial Neural Networks 2

Nonlinear Classification

Backpropagation Neural Net

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Multilayer Perceptron

Single Layer Perceptron Networks

Artificial Neural Networks

Data Mining Part 5. Prediction

Neural networks. Chapter 19, Sections 1 5 1

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Artificial Neural Networks

Artificial Neural Networks (ANN)

Machine Learning

Transcription:

Chapter ML:VI (continued) VI. Neural Networks Perceptron Learning Gradient Descent Multilayer Perceptron Radial asis Functions ML:VI-56 Neural Networks STEIN 2005-2013

Definition 1 (Linear Separability) Two sets of feature vectors, X 0, X 1, of a p-dimensional feature space are called linearly separable, if p + 1 real numbers, θ, w 1,..., w p, exist such that holds: 1. x X 0 : p j=1 w jx j < θ 2. x X 1 : p j=1 w jx j θ ML:VI-57 Neural Networks STEIN 2005-2013

Definition 1 (Linear Separability) Two sets of feature vectors, X 0, X 1, of a p-dimensional feature space are called linearly separable, if p + 1 real numbers, θ, w 1,..., w p, exist such that holds: 1. x X 0 : p j=1 w jx j < θ 2. x X 1 : p j=1 w jx j θ x 2 x 1 x 2 x 1 linearly separable not linearly separable ML:VI-58 Neural Networks STEIN 2005-2013

Separability The XOR function defines the smallest example for two not linearly separable sets: x 2 = 1 x 1 x 2 XOR Class 0 0 0 1 0 1 0 1 1 1 1 0 x 2 = 0 x 1 = 0 x 1 = 1 ML:VI-59 Neural Networks STEIN 2005-2013

Separability (continued) The XOR function defines the smallest example for two not linearly separable sets: x 2 = 1 x 1 x 2 XOR Class 0 0 0 1 0 1 0 1 1 1 1 0 x 2 = 0 x 1 = 0 x 1 = 1 specification of several hyperplanes combination of several perceptrons ML:VI-60 Neural Networks STEIN 2005-2013

Separability (continued) Layered combination of several perceptrons: the multilayer perceptron. Minimum multilayer perceptron that is able to handle the XOR problem: x 1 = Σ, θ Σ, θ {, } x 2 = Σ, θ ML:VI-61 Neural Networks STEIN 2005-2013

Remarks: The multilayer perceptron was presented by Rumelhart and McClelland in 1986. Earlier, but unnoticed, was a similar research work of Werbos and Parker [1974, 1982]. Compared to a single perceptron the multilayer perceptron poses a significantly more challenging training (= learning) problem, which requires continuous threshold functions and sophisticated learning strategies. Marvin Minsky and Seymour Papert showed 1969 with the XOR problem the limitations of single perceptrons. Moreover, they assumed that extensions of the perceptron architecture (such as the multilayer perceptron) would be similarly limited as a single perceptron. fatal mistake. In fact, they brought the research in this field to a halt that lasted 17 years. [erkeley] [Marvin Minsky] ML:VI-62 Neural Networks STEIN 2005-2013

Computation in the Network [Heaviside] perceptron with a continuous, non-linear threshold function: x 0 =1 w 0 = θ x 1 w 1 x 2. w 2. w p Σ 0θ y x p Inputs Output The sigmoid function σ(z) as threshold function: σ(z) = 1 where dσ(z) 1 + e z dz = σ(z) (1 σ(z)) ML:VI-63 Neural Networks STEIN 2005-2013

Computation in the Network (continued) Computation of the perceptron output y(x) via the sigmoid function σ: y(x) = σ(w T x) = 1 1 + e wt x 1 0 Σ w j x j j=0 p n alternative to the sigmoid function is the tanh function: 1 tanh(x) = ex e x e x + e = e2x 1 x e 2x + 1 0 Σ w j x j j=0 p ML:VI-64 Neural Networks STEIN 2005-2013

Computation in the Network (continued) Distinguish units (nodes, perceptrons) of type input, hidden, and output: U I U H U O x 0 =1 = Σ y 1 x 1 x p =. = Σ Σ. Σ Σ. y 2 y k feed forward U I, U H, U O x u v w uv, w uv δ u y u Sets with units of type input, hidden, and output Input value for unit v, provided via output of unit u Weight and weight adaptation for edge connecting units u and v Classification error of unit u Output value of unit u ML:VI-65 Neural Networks STEIN 2005-2013

Remarks: The units of the input layer, U I, perform no computations at all. They distribute the input values to the next layer. The non-linear characteristic of the sigmoid function make networks possible that approximate every function. To achieve this flexibility, only three active layers are required, i.e., two layers with hidden units and one layer with output units. Keywords: universal approximator, [Kolmogorov Theorem, 1957] The generic delta rule allows for a backpropagagtion of the classification error and hence the training of multi-layered networks. Gradient descent is based on the classification error of thee entire network and hence considers the entire network weight vector. Multilayer perceptrons are also called multilayer networks or (artificial) neural networks, abbreviated as NN. ML:VI-66 Neural Networks STEIN 2005-2013

Classification Error The classification error Err(w) is computed as sum over the U O = k network outputs: Err(w) = 1 (c v (x) y v (x)) 2 2 v U O (x,c(x)) D Due its complex form, Err(w) may contain various local minima: Err(w) 16 12 8 4 0 ML:VI-67 Neural Networks STEIN 2005-2013

Weight daptation: Incremental Gradient Descent [network] lgorithm: MPT Multilayer Perceptron Training Input: D Training examples of the form (x, c(x)) with x = p + 1, c(x) {0, 1} k. η Learning rate, a small positive constant. Output: w Weights for the units in U I, U H, U O. 1. initialize_random_weights(u I, U H, U O ), t = 0 2. REPET 3. t = t + 1 4. FORECH (x, c(x)) D DO 5. FORECH u U H DO y u = propagate(x, u) // layer 1 6. FORECH v U O DO y v = propagate(y H, v) // layer 2 7. FORECH v U O DO δ v = y v (1 y v ) (c v (x) y v ) // backpropagate layer 2 8. FORECH u U H DO δ u = y u (1 y u ) v U o w uv δ v // backpropagate layer 1 9. FORECH w jk, (j, k) (U I U H ) (U H U O ) DO 10. w jk = η δ k x j k 11. w jk = w jk + w jk 12. ENDDO 13. ENDDO 14. UNTIL(convergence(D, y O (D)) OR t > t max ) 15. return(w) ML:VI-68 Neural Networks STEIN 2005-2013

Weight daptation: Incremental Gradient Descent [network] lgorithm: MPT Multilayer Perceptron Training Input: D Training examples of the form (x, c(x)) with x = p + 1, c(x) {0, 1} k. η Learning rate, a small positive constant. Output: w Weights for the units in U I, U H, U O. 1. initialize_random_weights(u I, U H, U O ), t = 0 2. REPET 3. t = t + 1 4. FORECH (x, c(x)) D DO 5. FORECH u U H DO y u = propagate(x, u) // layer 1 6. FORECH v U O DO y v = propagate(y H, v) // layer 2 7. FORECH v U O DO δ v = y v (1 y v ) (c v (x) y v ) // backpropagate layer 2 8. FORECH u U H DO δ u = y u (1 y u ) v U o w uv δ v // backpropagate layer 1 9. FORECH w jk, (j, k) (U I U H ) (U H U O ) DO 10. w jk = η δ k x j k 11. w jk = w jk + w jk 12. ENDDO 13. ENDDO 14. UNTIL(convergence(D, y O (D)) OR t > t max ) 15. return(w) ML:VI-69 Neural Networks STEIN 2005-2013

Weight daptation: Incremental Gradient Descent [network] lgorithm: MPT Multilayer Perceptron Training Input: D Training examples of the form (x, c(x)) with x = p + 1, c(x) {0, 1} k. η Learning rate, a small positive constant. Output: w Weights for the units in U I, U H, U O. 1. initialize_random_weights(u I, U H, U O ), t = 0 2. REPET 3. t = t + 1 4. FORECH (x, c(x)) D DO 5. FORECH u U H DO y u = propagate(x, u) // layer 1 6. FORECH v U O DO y v = propagate(y H, v) // layer 2 7. FORECH v U O DO δ v = y v (1 y v ) (c v (x) y v ) // backpropagate layer 2 8. FORECH u U H DO δ u = y u (1 y u ) v U o w uv δ v // backpropagate layer 1 9. FORECH w jk, (j, k) (U I U H ) (U H U O ) DO 10. w jk = η δ k x j k 11. w jk = w jk + w jk 12. ENDDO 13. ENDDO 14. UNTIL(convergence(D, y O (D)) OR t > t max ) 15. return(w) ML:VI-70 Neural Networks STEIN 2005-2013

Weight daptation: Momentum Term Momentum idea: a weight adaptation in iteration t considers the adaptation in iteration t 1: w uv (t) = η δ v x u v + α w uv (t 1) The term α, 0 α < 1, is called momentum. ML:VI-71 Neural Networks STEIN 2005-2013

Weight daptation: Momentum Term Momentum idea: a weight adaptation in iteration t considers the adaptation in iteration t 1: w uv (t) = η δ v x u v + α w uv (t 1) The term α, 0 α < 1, is called momentum. Effects: due the adaptation inertia local minima can be overcome if the direction of the descent does not change, the adaptation increment and, as a consequence, the speed of convergence is increased. ML:VI-72 Neural Networks STEIN 2005-2013

Neural Networks dditional Sources on the Web pplication and implementation: JNNS. Java Neural Network Simulator. http://www-ra.informatik.uni-tuebingen.de/software/javanns SNNS. Stuttgart Neural Network Simulator. http://www-ra.informatik.uni-tuebingen.de/software/snns ML:VI-73 Neural Networks STEIN 2005-2013