CSCI 315: Artificial Intelligence through Deep Learning W&L Winter Term 2017 Prof. Levy Recurrent Neural Networks (Chapter 7)
Recall our first-week discussion...
How do we know stuff?
(MIT Press 1996)
Intelligence as Prediction
Context Simple Recurrent Network (Elman 1990) Context layer is just a copy of hidden layer at previous time. Like input layer, it is fully connected to hidden layer. It acts as an additional input that provides a history (context) for the current input. So C in ABCABC looks different to the hidden layer than the C in BACBAC. COPY
Experiment #1: Sequential XOR Recall XOR (Exclusive OR) as a minimal test case for nontrivial machine learning: Input 0 1 0 1 Output 0 1 1 0 This is a purely static learning task: for a given input, the output is always the same. We can turn XOR into a prediction task by repeating a sequence consisting of two random bits, followed by their XOR: 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 0 0 1 1...
Target sequence is input sequence shifted left by one time step: input: 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 0 0 1 1 target: 0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 0 0 1 1 Training sequences was 3000 bits long. SRN had one input / target unit; two hidden / context units. 600 iterations of back-prop (each iteration = one pass through the entire sequence) What sort of results do we expect : how good will network be at predicting each next target?
Experiment #4: Discovering Lexical Classes Lexical class: noun, verb, adjective, etc. Can be further analyzed into animate noun (person, animal), inanimate noun (car, rock), transitive verb (take, see), intransitive (leave, go), etc.
Used a little grammar to generate simple English sentences from these words:
31 words, each represented by a one-hot code: 150 hidden/context units Training set of sentences 27,354 words long Six passes through the training sequence
Instead of looking at the error signal, Elman looked at the average values (vector) computed by the hidden units when presented with each word. With 150 hidden units, it is difficult to see coherent patterns. So Elman used Hierarchical Cluster Analysis to group hidden-layer vectors recursively: 1) Create a distance matrix of the Euclidean distance of each vector from the others 2) If two words have a small distance, put them in the same group 3) Average vector for a group can be used to represent the group as a whole, enabling distance comparisons between groups.
Hypothetical distance matrix break break cat man move plate smash woman cat man move plate smash woman.9.8.2.7.1.8.1.8.3.9.2.9.4.8.1.8.2.8.8.4.9
Results
Responding to Novelty
SRN: Summing Up SRN revived the nature / nurture debate in cognitive science, revealing a huge amount of learnable hidden structure in word sequences. Elman (1990) became one of the top-cited papers in psychology and cognitive science Led researchers to wonder whether training algorithm could be beefed up to deal with real-world, practical sequence tasks (translation, part-of-speech tagging)
From SRN to BPTT SRN back-prop is truncated in that it uses only the most recent hidden-layer activations Hence there is a rapid discounting of errors computed at the previous time steps. By unrolling (copying) the net over time, we can make better use of errors computed farther back in time: hence, Back-Prop Through Time. In essence, we turn a simple recurrent net into a complicated non-recurrent net.
Back-Prop Through Time During training, we maintain a distinct copy of the weights at each time step, and modify them independently using backprop. After training, we re-roll the unrolled network back into its original form, averaging trained weight copies to get the final weights. https://www.researchgate.net/publication/2903062_a_guide_to_recurrent_neural_networks_and_backpropagation
BPTT: The Vanishing Gradient Problem Even with an unrolled network, the error gradient will eventually become too small to be useful for weight update The figure at right illustrates this for a toy network of three units (input, hidden, output), but it is true of any BPTT network in general.
Long Short-Term Memory: A Solution to the Vanishing Gradient LSTM: Invented by Schmidhüber in 1997, but citations have exploded recently thanks to Deep Learning Best intro: google COLAH LSTM Consider a general RNN architecture, shown at right: http://colah.github.io/posts/2015-08-understanding-lstms/
General RNN Architecture our yt Architecture (?)
Unrolled
SRN Revisited As in previous SRN illustration, hidden layer is computed from current input and previous hidden. But modern approach uses hyperbolic tangent tanh() here, instead of logistic sigmoid.
tanh()
LSTM = layer x = elementwise multiply + = elementwise add σ = logistic sigmoid
Cell State gate
Cell State gate transistor
Forget Gate Layer ft will be 0 where we want to forget a vector component, 1 where we want to remember it. For example, introduction of a singular noun in xt might override a previous plural noun, for verb agreement later. So we forget the plural noun.
Input Gate Layer it will be 1 where we want to learn a vector component, 0 where we don t. For example, introduction of a singular noun in xt might override a previous plural noun, for verb agreement later. So we remember the singular noun. Ĉt is the actual data that we want to remember.
Cell State Update New cell state Ct is what we want to forget from previous state, plus what we want to remember from new info.
Output ot is yet another sigmoidal gate that allows us to select the components of the output. Finally, we pass the current cell state Ct through tanh(), to keep it in [-1,+1].
LSTM vs. SRN: Summary SRN has two layers of weights - input,context hidden - hidden output LSTM has four: - Wf - Wi - Wc - Wo Traditional SRN used logistic sigmoid everywhere. LSTM uses tanh(), with sigmoid reserved for gating. So how does LSTM avoid the vanishing gradient?