Computational Intelligence Winter Term 2017/18

Similar documents
Computational Intelligence

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

y(x n, w) t n 2. (1)

Artificial Neural Networks. Historical description

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Machine Learning: Multi Layer Perceptrons

Revision: Neural Network

Lab 5: 16 th April Exercises on Neural Networks

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

Computational Intelligence

Computational Intelligence Winter Term 2009/10

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

ECE 471/571 - Lecture 17. Types of NN. History. Back Propagation. Recurrent (feedback during operation) Feedforward

Lecture 4: Perceptrons and Multilayer Perceptrons

Computational Intelligence Winter Term 2017/18

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Computational Intelligence

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Neural Networks (Part 1) Goals for the lecture

Neural Networks and the Back-propagation Algorithm

4. Multilayer Perceptrons

Artifical Neural Networks

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Notes on Back Propagation in 4 Lines

Introduction to Neural Networks

Computational statistics

8. Lecture Neural Networks

Intelligent Systems Discriminative Learning, Neural Networks

Introduction to Artificial Neural Networks

Lecture 17: Neural Networks and Deep Learning

Multilayer Perceptron

Pattern Classification

Statistical Machine Learning from Data

Computational Intelligence

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Input layer. Weight matrix [ ] Output layer

Course 395: Machine Learning - Lectures

Statistical NLP for the Web

Multilayer Perceptron

Chapter ML:VI (continued)

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Learning Deep Architectures for AI. Part I - Vijay Chakilam

Classification with Perceptrons. Reading:

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Intro to Neural Networks and Deep Learning

Neural Networks DWML, /25

Artificial Neural Networks 2

Neural Network Training

Lecture 7 Artificial neural networks: Supervised learning

Neural Networks Task Sheet 2. Due date: May

Gradient Descent Training Rule: The Details

Introduction Biologically Motivated Crude Model Backpropagation

COGS Q250 Fall Homework 7: Learning in Neural Networks Due: 9:00am, Friday 2nd November.

Rprop Using the Natural Gradient

Chapter 9: The Perceptron

Learning Neural Networks

ECS171: Machine Learning

Machine Learning Lecture 5

Multilayer Perceptron = FeedForward Neural Network

Chapter ML:VI. VI. Neural Networks. Perceptron Learning Gradient Descent Multilayer Perceptron Radial Basis Functions

Multilayer Perceptrons and Backpropagation

Artificial Neural Networks

Linear Discrimination Functions

Chapter ML:VI (continued)

COMP9444 Neural Networks and Deep Learning 2. Perceptrons. COMP9444 c Alan Blair, 2017

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Unit III. A Survey of Neural Network Model

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Optimization and Gradient Descent

Neural Networks. Xiaojin Zhu Computer Sciences Department University of Wisconsin, Madison. slide 1

2015 Todd Neller. A.I.M.A. text figures 1995 Prentice Hall. Used by permission. Neural Networks. Todd W. Neller

Neural Nets Supervised learning

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Neural Networks Lecture 4: Radial Bases Function Networks

Artificial Neural Networks

Lecture 5: Logistic Regression. Neural Networks

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Address for Correspondence

Multilayer Neural Networks

Neural Networks. Volker Tresp Summer 2015

Advanced statistical methods for data analysis Lecture 2

Introduction to Convolutional Neural Networks (CNNs)

Linear discriminant functions

Artificial Intelligence

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Artificial Neuron (Perceptron)

Neural networks. Chapter 20. Chapter 20 1

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Neural Networks. Learning and Computer Vision Prof. Olga Veksler CS9840. Lecture 10

Data Mining Part 5. Prediction

1 What a Neural Network Computes

Transcription:

Computational Intelligence Winter Term 207/8 Prof. Dr. Günter Rudolph Lehrstuhl für Algorithm Engineering (LS ) Fakultät für Informatik TU Dortmund

Plan for Today Single-Layer Perceptron Accelerated Learning Online- vs. Batch-Learning Multi-Layer-Perceptron Model Backpropagation 2

Single-Layer Perceptron (SLP) Acceleration of Perceptron Learning Assumption: x { 0, } n x for all x (0,..., 0) Let B = P { -x : x N } (only positive examples) If classification incorrect, then w x < 0. Consequently, size of error is just δ = -w x > 0. w t+ = w t + (δ + ε) x for ε > 0 (small) corrects error in a single step, since w t+ x = (w t + (δ + ε) x) x = w t x + (δ + ε) x x = -δ + δ x 2 + ε x 2 = δ ( x 2 ) + ε x 2 > 0 0 > 0 3

Single-Layer Perceptron (SLP) Generalization: Assumption: x n x > 0 for all x (0,..., 0) as before: w t+ = w t + (δ + ε) x for ε > 0 (small) and δ = - w t x > 0 w t+ x = δ ( x 2 ) + ε x 2 < 0 possible! > 0 Idea: Scaling of data does not alter classification task (if threshold 0)! Let = min { x : x B } > 0 Set ^ x = x ^ set of scaled examples B ^ x x ^ 2 0 w ^ t+ x > 0 4

Single-Layer Perceptron (SLP) There exist numerous variants of Perceptron Learning Methods. Theorem: (Duda & Hart 973) If rule for correcting weights is w t+ = w t + γ t x (if w t x < 0). t 0 : γ t 0 2. 3. then w t w* for t with x: x w* > 0. e.g.: γ t = γ > 0 or γ t = γ / (t+) for γ > 0 5

Single-Layer Perceptron (SLP) as yet: Online Learning Update of weights after each training pattern (if necessary) now: Batch Learning Update of weights only after test of all training patterns Update rule: w t+ = w t + γ Σ x w t x < 0 x B (γ > 0) vague assessment in literature: advantage : usually faster disadvantage : needs more memory just a single vector! 6

Single-Layer Perceptron (SLP) find weights by means of optimization Let F(w) = { x B : w x < 0 } be the set of patterns incorrectly classified by weight w. Objective function: f(w) = Σ x F(w) w x min! Optimum: f(w) = 0 iff F(w) is empty Possible approach: gradient method w t+ = w t γ f(w t ) (γ > 0) converges to a local minimum (dep. on w 0 ) 7

Single-Layer Perceptron (SLP) Gradient method w t+ = w t γ f(w t ) Gradient points in direction of steepest ascent of function f( ) Gradient Caution: Indices i of w i here denote components of vector w; they are not the iteration counters! 8

Single-Layer Perceptron (SLP) Gradient method thus: gradient gradient method batch learning 9

Single-Layer Perceptron (SLP) How difficult is it (a) to find a separating hyperplane, provided it exists? (b) to decide, that there is no separating hyperplane? Let B = P { -x : x N } (only positive examples), w i R, θ R, B = m For every example x i B should hold: x i w + x i2 w 2 +... + x in w n θ trivial solution w i = θ = 0 to be excluded! Therefore additionally: η R x i w + x i2 w 2 +... + x in w n θ η 0 Idea: η maximize if η* > 0, then solution found 0

Single-Layer Perceptron (SLP) Matrix notation: Linear Programming Problem: f(z, z 2,..., z n, z n+, z n+2 ) = z n+2 s.t. Az 0 max! calculated by e.g. Kamarkaralgorithm in polynomial time If z n+2 = η > 0, then weights and threshold are given by z. Otherwise separating hyperplane does not exist!

What can be achieved by adding a layer? Single-layer perceptron (SLP) Hyperplane separates space in two subspaces N P Two-layer perceptron arbitrary convex sets can be separated connected by AND gate in 2nd layer Three-layer perceptron arbitrary sets can be separated (depends on number of neurons)- several convex sets representable by 2nd layer, these sets can be combined in 3rd layer more than 3 layers not necessary! convex sets of 2nd layer connected by OR gate in 3rd layer 2

XOR with 3 neurons in 2 steps x - y 2 z x x 2 y y 2 z 0 0 0 0 0 x 2 - y 2 0 0 0 convex set 3

XOR with 3 neurons in 2 layers x - y z x x 2 y y 2 z 0 0 0 0 0 0 0 x 2 - y 2 0 0 0 0 0 without AND gate in 2nd layer x x 2 x 2 x, x 2 x x 2 x + 4

XOR can be realized with only 2 neurons! x x 2 y -2y x -2y+x 2 z x y 2-2 z 0 0 0 0 0 0 0 0 0 0 0 0 x 2-2 0 0 BUT: this is not a layered network (no MLP)! 5

Evidently: MLPs deployable for addressing significantly more difficult problems than SLPs! But: How can we adjust all these weights and thresholds? Is there an efficient learning algorithm for MLPs? History: Unavailability of efficient learning algorithm for MLPs was a brake shoe...... until Rumelhart, Hinton and Williams (986): Backpropagation Actually proposed by Werbos (974)... but unknown to ANN researchers (was PhD thesis) 6

Quantification of classification error of MLP Total Sum Squared Error (TSSE) output of net for weights w and input x target output of net for input x Total Mean Squared Error (TMSE) TSSE # training patters # output neurons const. leads to same solution as TSSE 7

Learning algorithms for Multi-Layer-Perceptron (here: 2 layers) idea: minimize error! f(w t, u t ) = TSSE min! Gradient method x x 2... w 2... u 2... u t+ = u t - γ u f(w t, u t ) w t+ = w t - γ w f(w t, u t ) x n w nm m BUT: f(w, u) cannot be differentiated! Why? Discontinuous activation function a(.) in neuron! idea: find smooth activation function similar to original function! 8 0 θ

Learning algorithms for Multi-Layer-Perceptron (here: 2 layers) good idea: sigmoid activation function (instead of signum function) 0 θ monotone increasing e.g.: 0 differentiable non-linear output [0,] instead of { 0, } threshold θ integrated in activation function values of derivatives directly determinable from function values 9

Learning algorithms for Multi-Layer-Perceptron (here: 2 layers) w y u Gradient method x z f(w t, u t ) = TSSE y 2 u t+ = u t - γ u f(w t, u t ) x 2 2 z 2 w t+ = w t - γ w f(w t, u t )...... 2... z K x i : inputs y j : values after first layer x I w nm J y J K z k = a( ) z k : values after second layer y j = h( ) 20

output of neuron j after st layer output of neuron k after 2nd layer error of input x: output of net target output for input x 2

error for input x and target output z*: total error for all training patterns (x, z*) B: (TSSE) 22

gradient of total error: vector of partial derivatives w.r.t. weights u jk and w ij thus: and 23

assume: and: chain rule of differential calculus: outer derivative inner derivative 24

partial derivative w.r.t. u jk : error signal δ k 25

partial derivative w.r.t. w ij : factors reordered error signal δ k from previous layer error signal δ j from current layer 26

Generalization (> 2 layers) Let neural network have L layers S, S 2,... S L. Let neurons of all layers be numbered from to N. All weights w ij are gathered in weights matrix W. Let o j be output of neuron j. j S m neuron j is in m-th layer error signal: correction: in case of online learning: correction after each test pattern presented 27

error signal of neuron in inner layer determined by error signals of all neurons of subsequent layer and weights of associated connections. First determine error signals of output neurons, use these error signals to calculate the error signals of the preceding layer, use these error signals to calculate the error signals of the preceding layer, and so forth until reaching the first inner layer. thus, error is propagated backwards from output layer to first inner backpropagation (of error) 28

other optimization algorithms deployable! in addition to backpropagation (gradient descent) also: Backpropagation with Momentum take into account also previous change of weights: QuickProp assumption: error function can be approximated locally by quadratic function, update rule uses last two weights at step t and t 2. Resilient Propagation (RPROP) exploits sign of partial derivatives: 2 times negative or positive increase step size! change of sign reset last step and decrease step size! typical values: factor for decreasing 0,5 / factor for increasing,2 evolutionary algorithms individual = weights matrix later more about this! 29