Neural Turing Machine. Author: Alex Graves, Greg Wayne, Ivo Danihelka Presented By: Tinghui Wang (Steve)

Similar documents
CSC321 Lecture 16: ResNets and Attention

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning

Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function

Lecture 17: Neural Networks and Deep Learning

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Learning Long-Term Dependencies with Gradient Descent is Difficult

Intelligent Systems Discriminative Learning, Neural Networks

Learning Deep Architectures for AI. Part I - Vijay Chakilam

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE-

Recurrent Neural Networks

Lecture 11 Recurrent Neural Networks I

Introduction to Deep Neural Networks

Supervised Learning. George Konidaris

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

CSCI 252: Neural Networks and Graphical Models. Fall Term 2016 Prof. Levy. Architecture #7: The Simple Recurrent Network (Elman 1990)

Introduction to Deep Learning

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

RECURRENT NETWORKS I. Philipp Krähenbühl

Learning and Memory in Neural Networks

Lecture 11 Recurrent Neural Networks I

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Deep Belief Networks are compact universal approximators

CSCI 315: Artificial Intelligence through Deep Learning

Machine Learning. Neural Networks

CSC321 Lecture 15: Exploding and Vanishing Gradients

Long-Short Term Memory

Demystifying deep learning. Artificial Intelligence Group Department of Computer Science and Technology, University of Cambridge, UK

Course 395: Machine Learning - Lectures

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Deep Recurrent Neural Networks

Artificial Neural Networks. MGS Lecture 2

Long-Short Term Memory and Other Gated RNNs

TTIC 31230, Fundamentals of Deep Learning, Winter David McAllester. The Fundamental Equations of Deep Learning

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Analysis of Multilayer Neural Network Modeling and Long Short-Term Memory

Long Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, Deep Lunch

Introduction to Deep Learning

Deep Learning Recurrent Networks 2/28/2018

The Perceptron Algorithm

Recurrent Neural Networks. Jian Tang

Lecture 15: Exploding and Vanishing Gradients

Lecture 5: Logistic Regression. Neural Networks

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

arxiv: v3 [cs.lg] 14 Jan 2018

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

CSC321 Lecture 5: Multilayer Perceptrons

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Neural Networks Language Models

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Lecture 15: Recurrent Neural Nets

Stephen Scott.

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

10. Artificial Neural Networks

arxiv: v1 [cs.lg] 4 May 2015

Supervised Learning Part I

CSC321 Lecture 15: Recurrent Neural Networks

Recurrent and Recursive Networks

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Neural networks. Chapter 20, Section 5 1

On the use of Long-Short Term Memory neural networks for time series prediction

Introduction to Neural Networks

Lecture 3 Feedforward Networks and Backpropagation

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Neural Networks. Mark van Rossum. January 15, School of Informatics, University of Edinburgh 1 / 28

CS21 Decidability and Tractability

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Deep Feedforward Networks

Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST

Slide credit from Hung-Yi Lee & Richard Socher

Other Topologies. Y. LeCun: Machine Learning and Pattern Recognition p. 5/3

Notes on Back Propagation in 4 Lines

arxiv: v3 [cs.lg] 12 Jan 2016

Based on the original slides of Hung-yi Lee

Deep Feedforward Networks. Lecture slides for Chapter 6 of Deep Learning Ian Goodfellow Last updated

Machine Learning Lecture 10

Faster Training of Very Deep Networks Via p-norm Gates

Physics 178/278 - David Kleinfeld - Winter 2017

Artificial Neural Networks

memory networks, have been proposed by Hopeld (1982), Lapedes and Farber (1986), Almeida (1987), Pineda (1988), and Rohwer and Forrest (1987). Other r

Grundlagen der Künstlichen Intelligenz

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Introduction to RNNs!

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

SGD and Deep Learning

Denoising Autoencoders

Introduction to Deep Learning

Reading Group on Deep Learning Session 1

Chapter 9: The Perceptron

Making Deep Learning Understandable for Analyzing Sequential Data about Gene Regulation

Automata Theory CS S-12 Turing Machine Modifications

Deep Learning: a gentle introduction

Experiments on the Consciousness Prior

AI Programming CS F-20 Neural Networks

Deep Belief Networks are Compact Universal Approximators

UNSUPERVISED LEARNING

Transcription:

Neural Turing Machine Author: Alex Graves, Greg Wayne, Ivo Danihelka Presented By: Tinghui Wang (Steve)

Introduction Neural Turning Machine: Couple a Neural Network with external memory resources The combined system is analogous to TM, but differentiable from end to end NTM Can infer simple algorithms! 4/6/2015 Tinghui Wang, EECS, WSU 2

Finite State Machine Σ, S, s 0, δ, F Σ: Alphabet S: Finite set of states s 0 : Initial state δ: state transition function: S Σ P S F: set of Final States In Computation Theory 4/6/2015 Tinghui Wang, EECS, WSU 3

Push Down Automata Q, Σ, Γ, δ, q 0, Z, F Σ: Input Alphabet Γ: Stack Alphabet Q: Finite set of states q 0 : Initial state δ: state transition function: Q Σ Γ P Q Γ F: set of Final States In Computation Theory 4/6/2015 Tinghui Wang, EECS, WSU 4

Turing Machine Q, Γ, b, Σ, δ, q 0, F Q: Finite set of States q 0 : Initial State Γ: Tape alphabet (Finite) b: Blank Symbol (occurs infinitely on the tape) Σ: Input Alphabet (Σ = Γ\ b ) δ: Transition Function Q\F Γ Q Γ L, R F: Final State (Accepted) 4/6/2015 Tinghui Wang, EECS, WSU 5

Neural Turing Machine - Add Learning to TM!! Finite State Machine (Program) Neural Network (I Can Learn!!) 4/6/2015 Tinghui Wang, EECS, WSU 6

A Little History on Neural Network 1950s: Frank Rosenblatt, Perceptron classification based on linear predictor 1969: Minsky Proof that Perceptron Sucks! Not able to learn XOR 1980s: Back propagation 1990s: Recurrent Neural Network, Long Short-Term Memory Late 2000s now: Deep Learning, Fast Computer 4/6/2015 Tinghui Wang, EECS, WSU 7

Feed-Forward Neural Net and Back Propagation Total Input of unit j: x j = y i w ji i i w ji j Output of unit j (logistic function) 1 y j = 1 + e x j Total Error (mean square) E total = 1 2 c j d c,j y c,j 2 Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (8 October 1986). "Learning representations by back-propagating errors". Nature 323 (6088): 533 536. Gradient Descent E w ji = E y j y j x j x j w ji 4/6/2015 Tinghui Wang, EECS, WSU 8

Recurrent Neural Network y t : n-tuple of outputs at time t x net t : m-tuple of external inputs at time t U: set of indices k that x k is output of unit in network I: set of indices k that x k is external input x k = x k net t, if k I y k t, if k R R. J. Williams and D. Zipser. Gradient-based learning algorithms for recurrent networks and their computational complexity. In Back-propagation: Theory, Architectures and Applications. Hillsdale, NJ: Erlbaum, 1994. y k t + 1 = f k s k t + 1 where f k is a differential squashing function s k t + 1 = i I U w ki x i t 4/6/2015 Tinghui Wang, EECS, WSU 9

Recurrent Neural Network Time Unrolling Error Function: J t = 1 2 k U e k t 2 = 1 2 k U d k t y k t 2 J Total t, t = t τ=t +1 J τ Back Propagation through Time: J t ε k t = y k t = e k t = d k t y k t δ k τ = ε k τ 1 = J t s k τ J t = J t y k τ y k τ 1 = l U J Total w ji = t y k τ s k τ = f k s k τ ε k τ J t s l τ τ=t +1 s l τ y k τ 1 = l U J t s j τ s j τ w ji w lk δ l τ 4/6/2015 Tinghui Wang, EECS, WSU 10

RNN & Long Short-Term Memory RNN is Turing Complete With Proper Configuration, it can simulate any sequence generated by a Turing Machine Error Vanishing and Exploding Consider single unit self loop: δ k τ = J t s k τ = J t y k τ Long Short-term Memory y k τ s k τ = f k s k τ ε k τ = f k s k τ w kk δ k τ + 1 Hochreiter, S. and Schmidhuber, J. (1997). Long shortterm memory. Neural computation, 9(8):1735 1780. 4/6/2015 Tinghui Wang, EECS, WSU 11

Purpose of Neural Turing Machine Enrich the capabilities of standard RNN to simplify the solution of algorithmic tasks 4/6/2015 Tinghui Wang, EECS, WSU 12

M-Size Vector M-Size Vector NTM Read/Write Operation 1 2 3... N 1 2 3... N Erase: M t i = M t 1 i 1 w t i e t r t = i w t i M t i Add: M t i += w t i a t w 1 w 2 w 3... w N w t i : vector of weights over the N locations emitted by a read head at time t w 1 w 2 w 3... w N e t : Erase Vector, emitted by the header a t : Add Vector, emitted by the header w t : weights vector 4/6/2015 Tinghui Wang, EECS, WSU 13

Attention & Focus Attention and Focus is adjusted by weights vector across whole memory bank! 4/6/2015 Tinghui Wang, EECS, WSU 14

Content Based Addressing Find Similar Data in memory k t : Key Vector β t : Key Strength w t c i = exp β tk k t, M t i j exp β tk k t, M t j K u, v = u v u v 4/6/2015 Tinghui Wang, EECS, WSU 15

May not depend on content only Gating Content Addressing g t : Interpolation gate in the range of (0,1) g w t = gt w c t + 1 g t w t 1 4/6/2015 Tinghui Wang, EECS, WSU 16

Convolutional Shift Location Addressing s t : Shift weighting Convolutional Vector (size N) N 1 g w t i = w t j st i j j=0 4/6/2015 Tinghui Wang, EECS, WSU 17

Convolutional Shift Location Addressing γ t : sharpening factor w t i = w t i γ t j w t j γ t 4/6/2015 Tinghui Wang, EECS, WSU 18

Neural Turing Machine - Experiments Goal: to demonstrate NTM is Able to solve the problems By learning compact internal programs Three architectures: NTM with a feedforward controller NTM with an LSTM controller Standard LSTM controller Applications Copy Repeat Copy Associative Recall Dynamic N-Grams Priority Sort 4/6/2015 Tinghui Wang, EECS, WSU 19

Task: Store and recall a long sequence of arbitrary information Training: 8-bit random vectors with length 1-20 No inputs while it generates the targets NTM Experiments: Copy NTM LSTM 4/6/2015 Tinghui Wang, EECS, WSU 20

NTM Experiments: Copy Learning Curve 4/6/2015 Tinghui Wang, EECS, WSU 21

Task: Repeat a sequence a specified number of times Training: random length sequences of random binary vectors followed by a scalar value indicating the desired number of copies NTM Experiments: Repeated Copy 4/6/2015 Tinghui Wang, EECS, WSU 22

NTM Experiments: Repeated Copy Learning Curve 4/6/2015 Tinghui Wang, EECS, WSU 23

NTM Experiments: Associative Recall Task: ask the network to produce the next item, given current item after propagating a sequence to network Training: each item is composed of three six-bit binary vectors, 2-8 items every episode 4/6/2015 Tinghui Wang, EECS, WSU 24

NTM Experiments: Dynamic N-Grams Task: Learn N-Gram Model rapidly adapt to new predictive distribution Training: 6-Gram distributions over binary sequences. 200 successive bits using look-up table by drawing 32 probabilities from Beta(.5,.5) distribution Compare to Optimal Estimator: P B = 1 N 1, N 2, c = N 1 + 0.5 N 1 + N 0 + 1 C=01111 C=00010 4/6/2015 Tinghui Wang, EECS, WSU 25

NTM Experiments: Priority Sorting Task: Sort Binary Vector based on priority Training: 20 binary vectors with corresponding priorities, output 16 highest-priority vectors 4/6/2015 Tinghui Wang, EECS, WSU 26

NTM Experiments: Learning Curve 4/6/2015 Tinghui Wang, EECS, WSU 27

Some Detailed Parameters NTM with Feed Forward Neural Network NTM with LSTM LSTM 4/6/2015 Tinghui Wang, EECS, WSU 28

Literature Reference: Additional Information Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (8 October 1986). "Learning representations by back-propagating errors". Nature 323 (6088): 533 536 R. J. Williams and D. Zipser. Gradient-based learning algorithms for recurrent networks and their computational complexity. In Back-propagation: Theory, Architectures and Applications. Hillsdale, NJ: Erlbaum, 1994. Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8):1735 1780 Y. Bengio and Y. LeCun, Scaling Learning Algorithms towards AI, Large-Scale Kernel Machines, Bottou,., Chapelle, O., DeCoste, D., and Weston, J., Eds., MIT Press, 2007. NTM Implementation https://github.com/shawntan/neural-turing-machines NTM Reddit Discussion https://www.reddit.com/r/machinelearning/comments/2m9mga/implementation_of_neural_turing_ma chines/ 4/6/2015 Tinghui Wang, EECS, WSU 29

4/6/2015 Tinghui Wang, EECS, WSU 30