CSCI 252: Neural Networks and Graphical Models. Fall Term 2016 Prof. Levy. Architecture #7: The Simple Recurrent Network (Elman 1990)

Size: px

Start display at page:

Download "CSCI 252: Neural Networks and Graphical Models. Fall Term 2016 Prof. Levy. Architecture #7: The Simple Recurrent Network (Elman 1990)"

Irma Parsons
5 years ago
Views:

1 CSCI 252: Neural Networks and Graphical Models Fall Term 2016 Prof. Levy Architecture #7: The Simple Recurrent Network (Elman 1990)

2 Part I Multi-layer Neural Nets

3 Taking Stock: What can we do with neural nets? SOM: Reveal hidden patterns in a large data set Hopfield: Restore degraded patterns SDM: Learn arbitrary associations (key/value pairs), or restore degraded patterns LSA: Reveal hidden patterns among words or texts VSA: Model structures and perform nontrivial cognitive tasks on them (analogy) in an interesting (holistic) way So what have we missed?

4 Prediction: The Essence of Mind (?) Recall Jeff Hawkins s claim that prediction is the essence of what makes the human mind unique. Although this is probably an over-statement, everyone agrees that without the ability to predict, you miss too much (language, music, etc.)

5 Prediction: The Essence of Mind (?) Recall Jeff Hawkins s claim that prediction is the essence of what makes the human mind unique. Although this is probably an over-statement, everyone agree that without the ability to predict, you miss too much (language, music, etc.) So we will turn out attention to the most popular neural-network models of prediction. But first let s review auto-associators.

6 Auto-Association Revisited Recall the auto-associator concept: N inputs, N outputs. Goal is to obtain a seat of network weights (matrix) that gives you the same output as the input. With no further requirements, we should just map each input component directly to the corresponding outputs with a weight of 1 (yk = xk). y1 y2... yn x1 x2... xn Q: So why did we bother with Hopfield Nets and SDM to build auto-associators?

Auto-Association Revisited A: Simple one-to-one element-wise mappings cannot Restore a noisy pattern to its original form (Hopfield, SDM) Perform a

7 Auto-Association Revisited A: Simple one-to-one element-wise mappings cannot Restore a noisy pattern to its original form (Hopfield, SDM) Perform a hetero-associative mapping (SDM) Yield any insights into how the brain might encode and solve problems Q: So why not just use SDM for sequences? Denning 1989

8 The Importance of History / Context Fill in the blanks: Today was hot, but tomorrow will be even Today was cold, but tomorrow will be even So we clearly take more than just the current item into account when trying to predict the next. In 1990, psychologist Jeffrey Elman invented the Simple Recurrent Network (SRN) to model such phenomena. SRN builds on top of an existing auto-associative model called a three-layer perceptron, so we ll look at that first.

Three-Layer Perceptron http://neuroph.sourceforge.net/tutorials/images/mlp.jpg Note the use of a hidden layer between the input layer and the output layer.

9 Three-Layer Perceptron Note the use of a hidden layer between the input layer and the output layer. This layer allows the perceptron to compute nontrivial mappings: Input Output Input Output Input Output

10 Three-Layer Perceptron: Another Dot-Product Engine! x1 w1 y x2 w 2 y = f(x w)

11 Three-Layer Perceptron: History Two-layer perceptron was invented by F. Rosenblatt in Advantage was that could learn certain mappings, rather than having them pre-programmed. But (as we saw), Minsky & Papert pointed out its limitations in It wasn t until the PDP revolution (1986) that the problem of learning in three-layer perceptron problem was solved, via the back-propagation algorithm (Rumelhart & McClelland 1986), eventually leading to the multi-layer Deep Learning networks of today.

12 Simple Recurrent Network Context Context layer is just a copy of hidden layer at previous time. Like input layer, it is fully connected to hidden layer. It acts as an additional input that provides a history (context) for the current input. So C in ABCABC looks different to the hidden layer than the C in BACBAC. COPY

13 SRN Training via Back-Propagation Basic idea: Input = current symbol; target (desired output) = next symbol Represent each symbol by a one-in-n ( one hot ) code; e.g, for learning sequences using symbols A, B, C: A = 001 B = 010 BACBACBACBAC... Target: Input: C = 100

14 SRN Training via Back-Propagation Recall that each output (and hidden) units computes a function f of the dot product of its inputs with its weights. To keep the computed output between 0 and 1, we use a squashing function like f(x) = 1 / (1 + e-x); a.k.a. logistic sigmoid, or f(x) = tanh(x) (in which case we use -1 and +1 instead of 0 and 1). More modern (deep learning) nets use a softmax function, which pushes one output unit to 1 and others to 0. All such functions are differentiable (smooth) variants of the familiar threshold function, allowing us to use calculus to change the weights gradually over many (thousands) of iterations: this is the back-propagation algorithm, a variety of gradient-descent.

16 Deriving Back-Prop Intuitively Consider a simple two-layer auto-associator / predictor network: input x weights w output y target t error e

17 Deriving Back-Prop Intuitively Let s focus on just one weight for now: y x w 0.9 t 0 e -0.9 y = f(x*w) Intuitively, we should change w in proportion to its contribution to the error e. Calculus gives us a Delta Rule for weights: Δ w=a e f ' (x w) x where f is the first derivative of f, and a is a constant of proportionality called the learning rate.

18 Error on hidden units Problem: hidden unit has no explicit target, hence no apparent error: x wxh why y = f(h*why) h = f(x*wxh) The back-propagation trick, discovered simultaneously by several researchers in the 1970s-1980s, says that the hiddenunit error can be computed by using the error on the output: Δ why =a e y f ' (h w hy )h=a δ y h e h=δ y w hy Δ w xh =a e h f ' (x w xh ) x=a δh x

19 Multi-layer neural nets: Summary Hidden units allow neural nets to learn interesting / nontrivial mappings Adding a recurrent context layer allows the net to keep a history of its previous hidden-unit values Back-prop says: Adjust a weight in proportion to its contribution to the error on the unit to which it connects (Delta Rule). The error on an output unit is its target minus its actual value The error on a hidden unit is proportional to the error on the unit to which it connects, multiplied by the connecting weight

20 Part II Finding Structure in Time (Elman 1990)

21 Experiment #1: Sequential XOR Recall XOR (Exclusive OR) as a minimal test case for nontrivial machine learning: Input Output This is a purely static learning task: for a given input, the output is always the same. We can turn XOR into a prediction task by repeating a sequence consisting of two random bits, followed by their XOR:

22 Target sequence is input sequence shifted left by one time step: input: target: Training sequences was 3000 bits long. SRN had one input / target unit; two hidden / context units. 600 iterations of back-prop (each iteration = one pass through the entire sequence) What sort of results do we expect : how good will network be at predicting each next target?

24 Experiment #2: Letter Sequences A made-up language with three syllables: ba, dii, guuu Training sequence randomly chosen from these syllables: input: d i i b a b a g u u u b a d i i g u u target: i i b a b a g u u u b a d i i g u u Represent each letter using a six-bit phonetic code:

25 Experiment #2: Letter Sequences So, 6 input/target units. 20 hidden/context units Training set was 1,000 syllables 200 passes of back-prop through training set

27 Experiment #3: Discovering the Notion Word

29 Experiment #4: Discovering Lexical Classes Lexical class: noun, verb, adjective, etc. Can be further analyzed into animate noun (person, animal), inanimate noun (car, rock), transitive verb (take, see), intransitive (leave, go), etc.

30 Used a little grammar to generate simple English sentences from these words:

31 31 words, each represented by a one-in-n code: 150 hidden/context units Training set of sentences 27,354 words long Six passes through the training sequence

32 Instead of looking at the error signal, Elman looked at the average values (vector) computed by the hidden units when presented with each word. With 150 hidden units, it is difficult to see coherent patterns. So Elman used Hierarchical Cluster Analysis to group hidden-layer vectors recursively: 1) Create a distance matrix of the Euclidean distance of each vector from the others 2) If two words have a small distance, put them in the same group 3) Average vector for a group can be used to represent the group as a whole, enabling distance comparisons between groups.

33 Hypothetical distance matrix break break cat man move plate smash woman cat man move plate smash woman

34 Results

35 Responding to Novelty

36 Rationalism vs. Empiricism Revisited

CSCI 315: Artificial Intelligence through Deep Learning

CSCI 315: Artificial Intelligence through Deep Learning W&L Winter Term 2017 Prof. Levy Recurrent Neural Networks (Chapter 7) Recall our first-week discussion... How do we know stuff? (MIT Press 1996)