Lecture 5: Recurrent Neural Networks

Size: px

Start display at page:

Download "Lecture 5: Recurrent Neural Networks"

Regina Douglas
5 years ago
Views:

1 1/25 Lecture 5: Recurrent Neural Networks Nima Mohajerin University of Waterloo WAVE Lab July 4, 2017

2 2/25 Overview 1 Recap 2 RNN Architectures for Learning Long Term Dependencies 3 Other RNN Architectures 4 System Identification with RNNs

3 Recap 3/25 RNNs RNNs deal with sequential information. RNNs are dynamic systems. Frequently their dynamic is represented via state-space equations.

4 Recap 4/25 A simple RNN A simple RNN in discrete-time domain: x(k) = f ( Ax(k 1) + Bu(k) + b x ) y(k) = g ( Cx(k) + b y ) x(k) R s : RNN state vector, no. of states = no. of hidden neuron y(k) R n : RNN output vector, no. of output neurons = n u(k) R m : Input vector to RNN (Independent input) A R s R s : State feedback weight matrix B R s R m : Input weight matrix b x R s : Bias term C R n R s : State to output weight matrix b y R n : Output bias

5 Recap 5/25 Back Propagation Through Time One data sample: Input: U = [ u(k 0 + 1) u(k 0 + 2)... u(k 0 + T ) ]. output: Y t = [ y t (k 0 + 1) y t (k 0 + 2)... y t (k 0 + T ) ]. SSE cost (per sample): T T L = 0.5 e(k 0 +k) e(k 0 +k) = 0.5 k=1 k=1 i=1 Batch cost (batch size = D): L = 0.5 D d=1 T k=1 e d(k 0 + k) e d (k 0 + k) n ( yi (k 0 +k) y t,i (k 0 +k) ) 2

6 Recap 6/25 Gradients To do a derivative-based optimization, we need the gradient of L: L a ij = y(k) a ij v(k) a ij T e (k 0 + k) e(k 0 + k) T = e (k 0 + k) y(k 0 + k) a k=1 ij a k=1 ij = ( ) Cx(k) + by a ij = v(k) f (v(k)), v(k) = Ax(k 1) + Bu(k) + b x a ij 0 = A x(k 1). x(k 1) + A = x a ij a ij j (k 1) 0. s 1 x(k) a ij g ( Cx(k) + b y ) = C x(k) a ij g ( Cx(k) + b y ) x(k 1) + A a ij

7 RNN Architectures for Learning Long Term Dependencies 7/25 Section 2 RNN Architectures for Learning Long Term Dependencies

8 RNN Architectures for Learning Long Term Dependencies 8/25 Gated Architectures g(x) = x σ(ax + b) : element-wise multiplicaiton x R n g(x) R n A R n R n

9 RNN Architectures for Learning Long Term Dependencies 9/25 Preserve Information with Gated Architectures The idea is that if a neuron has a self-feedback with weight equal to one, the information will retain for an infinite amount of time when unfolded. Some information should decay, some should not be stored. With a gate the intention is to control the self-feedback weight.

10 RNN Architectures for Learning Long Term Dependencies 10/25 Gated Recurrent Unit g f (k) =σ ( Wf i u(k) + Wf o y h (k 1) ) g i (k) =σ ( Wiu(k) i + Wi o y h (k 1) ) m(k) =h ( Wyu(k) i + Wy y (g f y h (k 1)) ) y h (k) =g i (k) y h (k 1) + ( 1 g i (k) ) m(k)

11 RNN Architectures for Learning Long Term Dependencies Long Short Term Memory Cell g i(k) =σ ( ) Wiu(k) i + Wi o y h (k 1) + Wi c c(k 1) + b i g f (k) =σ ( ) Wf i u(k) + Wf o y h (k 1) + Wf c c(k 1) + b f g o(k) =σ ( ) Wou(k) i + Woy o h (k 1) + Woc(k c 1) + b o c(k) =g i(k) f ( Wcu(k) i + Wc o y h (k 1) ) + g f (k) c(k 1) m(k) =c(k) g o(k) y h (k) =h(w y m(k 1) + b y ). 11/25

12 RNN Architectures for Learning Long Term Dependencies 12/25 Learning Long-Term Dependencies One way to avoid gradient exploding is to clip the gradient: if g > v, g gv g One way to address vanishing gradient is to use a regularizer that maintains the magnitude of the gradient vector (Pascanu et al. 2013): Ω = k ( x(k) L x(k) x(k 1) ) 2 1 x(k) L

13 Other RNN Architectures 13/25 Section 3 Other RNN Architectures

14 Other RNN Architectures 14/25 RNNs as Associative Memories An RNN is a nonlinear chaotic system. It can have many attractors in its phase space. Hopfield (1985) model is the most popular one. It is a fully connected recurrent model where the feedback weight matrix is symmetric and has diagonal elements equal to zero. Hopfield model is stable in a Lyapunov sense if the output neurons are updated one at a time. (Refer to Du KL, Swamy MNS (2006) Neural networks in a softcomputing framework doe further discussion)

15 Other RNN Architectures 15/25 RNNs as Associative Memories

16 Other RNN Architectures 16/25 RNNs as Associative Memories

17 Other RNN Architectures 17/25 Deep RNNs

18 Other RNN Architectures 18/25 Deep RNNs We can generalize this idea and create a connection matrix:

19 Other RNN Architectures 19/25 Deep RNNs Example: (to L1) (to L2) (to L3) (to output) (from input) C = (from L1) (from L2) (from L3)

20 Other RNN Architectures 20/25 Reservoir Computing One approach to cope with the difficulty of training RNNs. The idea is to use a very large RNN, as a reservoir and use it to transform the input. The transformed input by the RNN is then linearly combined to form the output. The linear weights are trained while the reservoir (RNN) is fixed. Echo State Networks (continuous output neurons), Liquid State Machines (spiking binary neurons)

21 Other RNN Architectures 21/25 Reservoir Computing How to set the reservoir weights? Set weights in such a way that the RNN is at the edge of stability: set the eigenvalues of the state Jacobian close to one. J(k) = x(k) x(k 1) Echo State Networks (continuous output neurons), Liquid State Machines (spiking binary neurons)

22 System Identification with RNNs 22/25 Section 4 System Identification with RNNs

23 System Identification with RNNs 23/25 Reconstructing the System States We have a set of observations, i.e., measurements of a dynamic system input and output (states). We want to learn the system dynamics RNNs are universal approximators for dynamic systems (K. Funahashi and Y. Nakamura, 1993) Delay embedding theorem (Taken s theorem) States that a chaotic dynamical system can be reconstructed from a sequence of observations of the system. It leads to Auto-Regressive with exogenous (ARX) models.

24 System Identification with RNNs 24/25 Nonlinear Auto-Regressive with exogenous Inputs y(k) = F ( u(k), u(k 1),..., u(k d x ), y(k 1),..., y(k d y ) ). F can be constructed using a neural network. Typically an MLP is used.

25 System Identification with RNNs 25/25 Teacher Forcing (Parallel Training) Is mainly used in NARX architectures. Substitute the past network predictions with the targets. y(k) = F ( u(k), u(k 1),..., u(k d x ), y t (k 1),..., y t (k d y ) ). Converts the RNN to a FFNN (Single-step prediction).

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning Recurrent Neural Network (RNNs) University of Waterloo October 23, 2015 Slides are partially based on Book in preparation, by Bengio, Goodfellow, and Aaron Courville, 2015 Sequential data Recurrent neural