Christian Mohr

Similar documents
Introduction to Neural Networks

ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92

Neural networks. Chapter 20. Chapter 20 1

Using a Hopfield Network: A Nuts and Bolts Approach

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

Neural networks. Chapter 19, Sections 1 5 1

Artificial Neural Networks. Edward Gatt

Lecture 5: Logistic Regression. Neural Networks

Reservoir Computing and Echo State Networks

Neural networks. Chapter 20, Section 5 1

Deep Learning Recurrent Networks 2/28/2018

Artificial Intelligence Hopfield Networks

Artificial Neural Networks

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Introduction to Artificial Neural Networks

Neural Networks and Deep Learning

Artificial Neural Networks Examination, June 2005

Neural Networks Lecture 4: Radial Bases Function Networks

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

How to do backpropagation in a brain

Recurrent Neural Networks

Introduction to Machine Learning

Lecture 4: Perceptrons and Multilayer Perceptrons

Artificial Neural Networks Examination, June 2004

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Artificial Neural Networks Examination, March 2004

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

CS:4420 Artificial Intelligence

100 inference steps doesn't seem like enough. Many neuron-like threshold switching units. Many weighted interconnections among units

Data Mining Part 5. Prediction

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Learning and Memory in Neural Networks

The error-backpropagation algorithm is one of the most important and widely used (and some would say wildly used) learning techniques for neural

AI Programming CS F-20 Neural Networks

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Machine Learning. Neural Networks

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Multilayer Neural Networks

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

CHALMERS, GÖTEBORGS UNIVERSITET. EXAM for ARTIFICIAL NEURAL NETWORKS. COURSE CODES: FFR 135, FIM 720 GU, PhD

y(x n, w) t n 2. (1)

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Neural Networks Introduction

Artificial Neural Networks 2

4. Multilayer Perceptrons

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Artificial Neural Networks. MGS Lecture 2

Simple Neural Nets For Pattern Classification

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

CS 4700: Foundations of Artificial Intelligence

Hopfield Network Recurrent Netorks

Statistical NLP for the Web

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Artificial Neural Networks Examination, March 2002

EE-559 Deep learning Recurrent Neural Networks

CSC242: Intro to AI. Lecture 21

COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines. COMP9444 c Alan Blair, 2017

Course 395: Machine Learning - Lectures

Part 8: Neural Networks

Artificial Neural Networks

Sample Exam COMP 9444 NEURAL NETWORKS Solutions

Introduction to Neural Networks: Structure and Training

Long-Short Term Memory and Other Gated RNNs

memory networks, have been proposed by Hopeld (1982), Lapedes and Farber (1986), Almeida (1987), Pineda (1988), and Rohwer and Forrest (1987). Other r

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

18.6 Regression and Classification with Linear Models

Statistical Machine Learning from Data

3.3 Discrete Hopfield Net An iterative autoassociative net similar to the nets described in the previous sections has been developed by Hopfield

POWER SYSTEM DYNAMIC SECURITY ASSESSMENT CLASSICAL TO MODERN APPROACH

Neural Networks. Mark van Rossum. January 15, School of Informatics, University of Edinburgh 1 / 28

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Chapter 4 Neural Networks in System Identification

Introduction to Neural Networks

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 13 Mar 1, 2018

Artificial Neural Networks

Artificial Intelligence

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Supervised (BPL) verses Hybrid (RBF) Learning. By: Shahed Shahir

Multilayer Perceptrons (MLPs)

Computational Intelligence Lecture 3: Simple Neural Networks for Pattern Classification

Artificial Neural Networks (ANN)

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Learning in State-Space Reinforcement Learning CIS 32

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Neural Networks and Ensemble Methods for Classification

Lecture 5 Neural models for NLP

Radial Basis Function Networks. Ravi Kaushik Project 1 CSC Neural Networks and Pattern Recognition

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN ,

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

fraction dt (0 < dt 1) from its present value to the goal net value: Net y (s) = Net y (s-1) + dt (GoalNet y (s) - Net y (s-1)) (2)

Neural networks: Unsupervised learning

Long-Short Term Memory

Neural Networks for Machine Learning. Lecture 11a Hopfield Nets

Transcription:

Christian Mohr 20.12.2011

Recurrent Networks Networks in which units may have connections to units in the same or preceding layers Also connections to the unit itself possible Already covered: Hopfield Nets (no general RNN) Boltzmann Machines

Recurrent Networks Arbitrary connection weights As seen in Hopfield Nets / Boltzmann: Symmetric weights: Networks will settle down to a stable state But nets with non symmetric weights are not necessarily unstable Unstable nets can be used to recognize or reproduce time sequences

Recurrent Back-Propagation Back-Propagation can be extended to arbitrary networks if they converge to stable states (we use continuous valued units, so states are called points) Use a modified version of the network itself to calculate the weights

Recurrent Back-Propagation Consider N continuousvalued units V i With weights w ij and activation function g(h) Some units are input units with input ξ µ i specified in patterns µ All other units input is 0 Also some units are output units with output value denoted with ς µ i ς 1 ς 2 V 4 V 5 V 1 V 2 V 3 ξ 1 ξ 2 ξ 3 ξ 4

Recurrent Back-Propagation Consider the net to apply the following evolution rule τ dv i dt = V i + g w ij V j + ξ i j which is the differential equation for continuous valued nets that also update continuously dv i /dt = 0 w ij V j + ξ i Then stable points are where : V i = g j

Recurrent Back-Propagation We assume at least one stable point As error measure we use: with E = 1 2 k E k 2 E k = ς V k k if k is an output 0 otherwise

Recurrent Back-Propagation The delta-rule for recurrent back-propagation is* Δw ij = ηg'(h i )Y i V j where h i is the net input to a unit I and g is the derivative of h To find the Y s the following differential equation has to be solved τ dy i dt = Y i + g g'(h j )w ji V j + E i j which can be done by evolution of a new errorpropagation network

Recurrent Back-Propagation The error-propagation has the same topology as the original net Weights: w ij = g'(h i )w ij E 1 E 2 Y 4 Y 5 Transfer function: g (x) = x Inputs: Errors E i of the units i in the original network Y 1 Y 2 Y 3

Recurrent Back-Propagation Training procedure: Relax original net to find V i s Compute E i s Relax error-propagation network to find Y i s Update the weights using Δw ij = ηg'(h i )Y i V j

Recurrent Back-Propagation Recurrent Back-Propagation scales with N 2 with N units in the net The use of recurrent nets gives a large improvement in performance over normal feed-forward for a number of problems as pattern completion

Temporal Sequences of Patterns So far only fixed patterns Extension: Sequence of patterns No stable state, but go through a predetermined sequence of states.

Temporal Sequences of Patterns Simple Example of sequence generation: Synchronous updating and equal weights Just turn on the first unit Only simple sequences and not very robust t 1 t 2 t 3

Temporal Sequences of Patterns For more arbitrary sequences using asynchronous updating we need asymmetric connections Instead of using: add an additional term: w ij = 1 N w ij = 1 N µ ξ i µ ξ j µ µ ξ i ξ j + λ N µ ξ i µ +1 ξ j µ Uses information from the next pattern

Temporal Sequences of Patterns Asynchronous updating tends to dephase the system Net reaches states that overlap several consecutive patterns Method only usable for short sequences Possible solution: Fast and slow connections h i (t) = µ w S ij V j (t) + λw L ij V j (t)

Conclusion: Temporal Sequences of Patterns No feed forward net nor nets with symmetric weights are capable of pattern sequences Sequences are calculated not learned How can a recurrent net learn a pattern sequence?

Learning Time Sequences 3 distinct tasks Sequence Recognition: Produce output pattern from input pattern sequence Sequence Reproduction: Pattern completion with dynamic patterns Temporal Association: Produce output pattern sequence from input pattern sequence

Tapped Delay Lines Easiest way of sequence recognition (not recurrent) Turn time pattern into spatial pattern So several time steps are presented to the net simultaneously

Tapped Delay Lines Widely applied to speech recognition task Time-Delay Neural Networks [Waibel et. al. 89] Vectors of spectral coefficients as time signal (2 dimensional time-frequency plane) Train network with spectra of phonemes

Tapped Delay Lines General approach yields several drawbacks to sequence recognition Maximum length of possible sequence has to be chosen in advance High number of units = slow computation Timing has to be very accurate

Tapped Delay Lines Solution to timing problem: Use filters that broaden the time signal instead of fixed delay The longer the delay, the broader the filter

Context Units Partially recurrent networks Mainly feed-forward, but carefully chosen set of feedback connections Mostly fixed feedback weights Does not complicate training Synchronous updating (one update for all units at a discrete time step) Also referred to as Sequential Units

Context Units Different architectures with a whole or part of a layer being Context Units Context units receive some signals from the net at a time t and forward them at t+1 Net remembers some aspects of previous time steps State of the net depends on past states and current input Net can recognize sequences based on its state at the end of a sequence (and generate in some cases)

Context Units Elman Nets 1 Output Hidden Context Input Hidden Units hold a copy of the output of the hidden layer Modifiable connections all feed-forward (Backpropagation) Usable for recognition and short continuations Also can mimic finite state machines

Context Units Jordan Nets 1 α Output Hidden Context Input Context Units hold a copy of the output layer and themselves Self connection gives individual memory or inertia With fixed outputs context units would decay exponentially (Decay Units)

Context Units Jordan Nets 1 α Output Hidden Context Input Weighted moving average or trace C i (t +1) = αc i (t) + O i (t) = O i (t) + αo i (t 1) +... t t'=0 = α t t' O i (t') t 0 cont. = e logα (t t') O i (t')dt'

Context Units Jordan Nets 1 α Output Hidden Context Input Usable for: Generating a set sequences with different fixed inputs Recognizing different input sequences

Context Units More Architectures Output Output Input gets to the net only via context units α Hidden Context Input α i Hidden Context Input Acts as an IIR Filter with transfer function 1 1 αz 1

Context Units More Architectures Output Output Right: modifiable feedback weights Hidden Hidden Comparable to real-time recurrent networks α Context Input α i Context Input Both work better on recognizing than on generating or reconstruction

Back-Propagation Through Time Use fully connected units (also each to itself) Units are updated synchronous Every unit may be input, output, both or neither w 12 w 11 V 1 V 2 w 22 w 21 ξ 1 (t) ξ 2 (t)

Back-Propagation Through Time For each time step update all units synchronous depending on the current input and state of the net Final output will be the classification result How to train the weights? w 12 w 11 V 1 V 2 w 22 w 21 ξ 1 (t) ξ 2 (t)

Back-Propagation Through Time For a time sequence of length T copy all units T times w 12 V V 1 w 2 21 w 11 V 1 V 2 w 22 w 11 w 22 w 12 V 1 V w 2 21 w 21 w 11 w 22 w 12 V 1 V 2 Net will behave identically for T time steps Weights need to the same (so time independent)

Back-Propagation Through Time Train weights on the equivalent feed-forward net and use them in the recurrent network Problem: Backpropagation would not get to equal weights for each time step Δw ij Add up all s and change all copies of the same weight by the same amount

Back-Propagation Through Time Needs very much resources in training and also much training data Completely impractical for large or even unknown length of sequences Largely superseded by the other approaches

Real-Time Recurrent Learning Learning rule for pattern sequences in recurrent networks (Recurrent Back- Propagation) One version can be run online (while sequences are presented, not after they are finished) Can deal with arbitrary length

Real-Time Recurrent Learning Assume same dynamics as in Back-Propagation through time V i (t) = g(h i (t 1)) = g w ij V j (t 1) + ξ i (t 1) j With target outputs ζ k (t) for some units at some time steps, we get the following error measure E k (t) = ς (t) V (t) k k if k is an output at t 0 otherwise

Real-Time Recurrent Learning The cost function is then the sum of the cost function per time step over all time steps E = T E(t) = 1 2 t =0 T t =0 k E k (t) 2 Due to the time dependency we get a time dependent Δw ij Δw ij (t) = η k E k (t) V k (t) w ij V k (t) V p (t 1) = g'(h w k (t 1)) δ ki V j (t 1) + w kp ij p w ij

Real-Time Recurrent Learning No stable points in general so derivatives depend on the derivatives of the preceding time step V k (t) V p (t 1) = g'(h w k (t 1)) δ ki V j (t 1) + w kp ij w p ij But net is time discrete so we can calculate the derivatives iteratively Just need to the initial condition V k (0) w ij = 0

Real-Time Recurrent Learning Since all derivatives can be computed iteratively the time dependent Δw ij (t) can be found Just iterate through all time steps Sum up all partial weight changes to get the total changes Repeat until net remembers the correct sequence

Real-Time Recurrent Learning Algorithm needs very much computation time and memory For N fully recurrent units there are N 3 derivatives to be maintained Updating is proportional to N So algorithm s complexity is N 4 But updating weights can be done after each time step if η is small =Real-Time Recurrent Learning

Real-Time Recurrent Learning Works well for sequence recognition but simpler nets can do that as well Can learn a flip-flop net Output a signal only after a symbol A has occurred until another symbol B has occurred Can learn Finite State Machine With some modifications (teacher forcing) algorithm can be used to train a square wave or sine wave oscillator

Time-Dependent Recurrent Back-Propagation Related algorithm for time-continuous recurrent nets τ i dv i dt Sum over time steps in error function becomes an integral Again a second DGL dy i dt E = 1 2 = V i + g w ij V j + ξ (t) i j T 0 = 1 τ i Y i + [ V k (t) ξ k (t)] 2 dt k O j 1 τ j w ji g'(h j )Y j + E i (t)

Time-Dependent Recurrent Back-Propagation Integrate DGL of the original net from t=0 to T to get the V i s Integrate the second DGL from t=t to 0 to get the Y i s Get weight changes with Δw ij = η E w ij since E = 1 T Y w ij τ i g'(h i )V j dt i 0

Time-Dependent Recurrent Back-Propagation Was used to train a net with 2 outputs to follow a 2 dimensional trajectory, including circles and figure eights Net was capable of returning to the trajectory even after disturbance Best approach unless online learning is needed

Radial Basis Function Networks Networks that use Radial Basis Functions as activation functions Used for function approximation, time series prediction, and control Typically have a hidden layer with Radial Basis Functions as activation functions and a linear output layer

Models for Function Approximation Train a model to approximate a function f(x) by a linear combination of a set of fixed functions (basis functions) f (x) = m j =1 w j h j (x) Model is linear if parameters of basis functions are fixed and only linear parameter w (weights) are trained

Radial Basis Functions Functions which monotonic decrease in response with the distance to a central point Most common: Gaussian µ σ h(x) = e : Mean : Variance x µ 2 σ 2

Radial Basis Functions Lead to a hyperelliptic decision surface instead of hyperplane So we get local units y j = tanh i w ij x i y j = e x µ 2 j σ j 2

Local Activation with linear units Possible but needs multi layer perceptron

Tile the Input Space Receptive fields overlap a bit, so there is usually more than one unit active. But for a given input, the total number of active units will be small. The locality property of RBFs makes them similar to Parzen windows.

Training RBF Nets Training Scheme for RBF Nets is hybrid Use unsupervised learning for the center points and perhaps also the variances Use k-means algorithm, intialized from randomly chosen points from the training set. Use a Kohonen SOFM (Self-Organizing Feature Map) to map the space. Then take selected units' weight vectors as our RBF centers Least Mean Square algorithm to train the output weights First step can be skipped if input space is split equidistant and variances are fixed Maybe the number of units is unnecessarily high

Example I: Input Space

Example I: While Training

Example I: After Training

Example II: Input Space

Example II: After Training

RBF Nets In low dimensions we can also use a Parzen window (for classification) or a table-lookup interpolation scheme But in higher dimensions RBF Nets are much better since units can be placed only where they are needed