(

Size: px

Start display at page:

Download "("

Dominick McCormick
5 years ago
Views:

1 Class 15 - Long Short-Term Memory (LSTM) Study materials ( ( ( Chapter 10 of textbook

2 RNN concepts Based on Christopher Olah's blog on LSTM You don t throw everything away and start thinking from scratch again. Your thoughts have persistence. For example, imagine you want to classify what kind of event is happening at every point in a movie. It s unclear how a traditional neural network could use its reasoning about previous events in the lm to inform later ones.

3 RNN concepts That's why RNN's have loops, allowing information to persist.

4 Here below, A is some neural network that takes x t is some input at time t, and that outputs h t. A loop allows information to be passed from one step of the network to the next.

5 RNN concepts A recurrent neural network can be thought of as multiple copies of the same network, each passing a message to a successor. Consider what happens if we unroll the loop:

6 The problem of long-term dependencies One of the appeals of RNNs is the idea that they might be able to connect previous information to the present task, such as using previous video frames might inform the understanding of the present frame. Sometimes, we only need to look at recent information to perform the present task.

7 The problem of long-term dependencies I was up all night wondering where the Sun had gone. Then it dawned on me. Sky is

8 The problem of long-term dependencies Answer to the life, the universe and everything is _

9 The problem of long-term dependencies Due to poor grades in high school, Steven Spielberg was rejected from the University of Southern California three times. He was awarded an honorary degree in 1994 and became a trustee of the university in "Since 1980, I've been trying to be associated with this school," joked the 62-year-old lmmaker. "I eventually had to buy my way in," he told the Los Angeles Times. Spielberg has to date directed 51 lms and won three Oscars. Forbes Magazine puts Spielberg's wealth at $3 billion. He is

10 The problem of long-term dependencies In such cases, where the gap between the relevant information and the place that it s needed is small, RNNs can learn to use the past information.

11 The problem of long-term dependencies Unfortunately, as that gap grows, RNNs become unable to learn to connect the information. Because of the "vanishing gradient" problem. LSTMs solve this problem

12 RNN RNNs are called recurrent because they perform the same task for every element of a sequence, with the output being depended on the previous computations. RNNs is that they have a memory which captures information about what has been calculated so far. In theory RNNs can make use of information in arbitrarily long sequences, but in practice they are limited to looking back only a few steps.

13 RNN x t : input at time step t. s t : hidden state value at time step t. It's the memory of the network. It is calculated based on the previous hidden state and the input at the current step: s t = f(u x t + W s t 1 ) Here, function f usually is a non-linearity such as sigmoid, or tanh or ReLU s 1 : it is used to compute value of the rst hidden state, s 0, and typically s 1 is initialized to 0. o t : output at step t. It's calculated only based on the memory at time t. o t = softmax(v ) s t

15 RNN U, V, W RNN shares the same parameters across all time steps. It is like we are performing the same task at each step, just with different inputs. This property greatly reduces total number of parameters we need to learn. The following RNN has outputs at each time step, but depending on the task this may not be necessary. (Remember the types of RNN? one-to-one, vs one-to-many, vs many-to-many)

16 Language modeling and generating text Given a sequence of words we want to predict the probability of each word given the previous words. Let's we have a sentence of m words. A language model allows us to predict the probability of observing the sentence: m P( w 1,, w m ) = P( w i w 1,, w i 1 ) i=1 P(AB)P(C AB) P(A)P(B A)P(C AB) P(ABC) = =

17 How to train RNN U, V, W You can nd that the parameters are shared in different time steps. s t = tanh(u x t + W s t 1 ) = softmax(v s t ) y^t

18 How to train RNN

19 The loss, as the cross-entropy loss at time step t is given by: E t ( y t, y^t ) = y t log y^t Therefore, the total error is just the sum of errors at each time step: E(y, y^ ) = E t ( y t, ) = t Here, y t is the correct word at time step t, and y^t is the corresponding prediction. t y t y^t log y^t

20 How to train RNN We need to compute gradients of the error with respect to the parameters, U, V, W, and adjust the parameters using stochastic gradient descent:

21 How to train RNN E W E U E V = = = t t t E t W E t U E t V

22 Computing E t V E 3 V = = Here, z 3 = V s 3, and is the outer product between two vectors. = E 3 y^3 E 3 y^3 y^3 V y^3 z 3 z 3 V ) ( y^3 y 3 s 3

Long-Short Term Memory and Other Gated RNNs

Long-Short Term Memory and Other Gated RNNs Sargur Srihari srihari@buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Sequence Modeling