NLP Programming Tutorial 8 - Recurrent Neural Nets

Similar documents
Neural networks CMSC 723 / LING 723 / INST 725 MARINE CARPUAT. Slides credit: Graham Neubig

NLP Programming Tutorial 11 - The Structured Perceptron

Sequential Data Modeling - Conditional Random Fields

Logistic Regression & Neural Networks

Sequential Data Modeling - The Structured Perceptron

NLP Programming Tutorial 6 - Advanced Discriminative Learning

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Lecture 17: Neural Networks and Deep Learning

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Lecture 5 Neural models for NLP

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Recurrent Neural Networks. Jian Tang

ECE521 Lectures 9 Fully Connected Neural Networks

Deep Learning Recurrent Networks 2/28/2018

Speech and Language Processing

Neural Networks in Structured Prediction. November 17, 2015

CSCI 315: Artificial Intelligence through Deep Learning

Slide credit from Hung-Yi Lee & Richard Socher

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

NEURAL LANGUAGE MODELS

Applied Natural Language Processing

Neural Network Training

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

y(x n, w) t n 2. (1)

lecture 6: modeling sequences (final part)

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Recurrent Neural Networks

From perceptrons to word embeddings. Simon Šuster University of Groningen

A Tutorial On Backward Propagation Through Time (BPTT) In The Gated Recurrent Unit (GRU) RNN

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Neural Network Language Modeling

Neural Architectures for Image, Language, and Speech Processing

arxiv: v3 [cs.lg] 14 Jan 2018

Natural Language Processing

CSC 411 Lecture 10: Neural Networks

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Recurrent Neural Networks. COMP-550 Oct 5, 2017

Artificial Neural Networks. MGS Lecture 2

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Statistical NLP for the Web

Lecture 13: Structured Prediction

Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials

CSCI567 Machine Learning (Fall 2018)

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Sequence Modeling with Neural Networks

Reading Group on Deep Learning Session 1

CSC321 Lecture 15: Exploding and Vanishing Gradients

Backpropagation Neural Net

Other Topologies. Y. LeCun: Machine Learning and Pattern Recognition p. 5/3

Neural Networks and Deep Learning

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Artificial Intelligence

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Neural Networks: Backpropagation

Lab 5: 16 th April Exercises on Neural Networks

Convolutional Neural Networks

Solutions. Part I Logistic regression backpropagation with a single training example

Introduction to Deep Neural Networks

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

CSC 578 Neural Networks and Deep Learning

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2005, Lecture 4 Gradient-Based Learning III: Architectures Yann LeCun

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes)

Intelligent Systems (AI-2)

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Recurrent and Recursive Networks

Neural Networks (Part 1) Goals for the lecture

Multilayer Perceptron

Deep Learning for Natural Language Processing

Introduction to Neural Networks

22c145-Fall 01: Neural Networks. Neural Networks. Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Deep Feedforward Networks. Lecture slides for Chapter 6 of Deep Learning Ian Goodfellow Last updated

Lecture 11 Recurrent Neural Networks I

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Stochastic gradient descent; Classification

epochs epochs

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Natural Language Understanding. Recap: probability, language models, and feedforward networks. Lecture 12: Recurrent Neural Networks and LSTMs

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

CS838-1 Advanced NLP: Hidden Markov Models

The XOR problem. Machine learning for vision. The XOR problem. The XOR problem. x 1 x 2. x 2. x 1. Fall Roland Memisevic

Introduction to RNNs!

Lecture 15: Recurrent Neural Nets

CSC242: Intro to AI. Lecture 21

Neural Nets Supervised learning

Neural Networks. Intro to AI Bert Huang Virginia Tech

Machine Learning. Neural Networks

Intelligent Systems (AI-2)

Feedforward Neural Nets and Backpropagation

Error Backpropagation

Probabilistic Context-free Grammars

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Intro to Neural Networks and Deep Learning

Sequence Labeling: HMMs & Structured Perceptron

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

text classification 3: neural networks

Artificial Neuron (Perceptron)

Transcription:

NLP Programming Tutorial 8 - Recurrent Neural Nets Graham Neubig Nara Institute of Science and Technology (NAIST) 1

Feed Forward Neural Nets All connections point forward ϕ( x) y It is a directed acyclic graph (DAG) 2

Recurrent Neural Nets (RNN) Part of the node outputs return as input h t 1 ϕ t (x) y Why? It is possible to memorize 3

RNN in Sequence Modeling y 1 y 2 y 3 y 4 NET NET NET NET x 1 x 2 x 3 x 4 4

Example: POS Tagging JJ NN NN VBZ NET NET NET NET natural language processing is 5

Multi-class Prediction with Neural Networks 6

Review: Prediction Problems Given x, predict y A book review Oh, man I love this book! This book is so boring... A tweet On the way to the park! 公園に行くなう! A sentence I read a book Is it positive? yes no Its language English Japanese Its syntactic parse N S VBD VP DET NP NN I read a book Binary Prediction (2 choices) Multi-class Prediction (several choices) Structured Prediction (millions of choices) 7

Review: Sigmoid Function The sigmoid softens the step function Step Function 1 P( y=1 x)= e w ϕ ( x) 1+e w ϕ (x) Sigmoid Function 1 p(y x) 0.5 p(y x) 0.5 0-10 -5 0 5 10 w*phi(x) 0-10 -5 0 5 10 w*phi(x) 8

softmax Function Sigmoid function for multiple classes P( y x)= e w ϕ(x, y) ~y e w ϕ( x,~ y) Current class Sum of other classes Can be expressed using matrix/vector ops r=exp(w ϕ( x)) p=r / ~r r ~ r 9

Selecting the Best Value from a Probability Distribution Find the index y with the highest probability find_best(p): y = 0 for each element i in 1.. len(p)-1: if p[i] > p[y]: y = i return y 10

softmax Function Gradient The difference between the true and estimated probability distributions d err /d ϕ out = p' p The true distribution p' is expressed with a vector with only the y-th element 1 (a one-hot vector) p'={0,0,,1,,0} 11

Creating a 1-hot Vector create_one_hot(id, size): vec = np.zeros(size) vec[id] = 1 return vec 12

Forward Propagation in Recurrent Nets 13

Review: Forward Propagation Code forward_nn(network, φ 0 ) φ= [ φ 0 ] # Output of each layer for each layer i in 1.. len(network): w, b = network[i-1] # Calculate the value based on previous layer φ[i] = np.tanh( np.dot( w, φ[i-1] ) + b ) return φ # Return the values of all layers 14

RNN Calculation p t p t+1 1 b o softmax 1 b o softmax w o,h w o,h w r,h w r,h h t-1 h t h t+1 w r,x tanh w r,x tanh x t b r x t+1 b r 1 1 h t =tanh(w r, h h t 1 +w r, x x t +b r ) p t =softmax (w o, h h t +b o ) 15

RNN Forward Calculation forward_rnn(w r,x, w r,h, b r, w o,h, b o, x) h = [ ] # Hidden layers (at time t) p = [ ] # Output probability distributions (at time t) y = [ ] # Output values (at time t) for each time t in 0.. len(x)-1: if t > 0: h[t] = tanh(w r,x x[t] + w r,h h[t-1] + b r ) else: h[t] = tanh(w r,x x[t] + b r ) p[t] = tanh(w o,h h[t] + b o ) y[t] = find_max(p[t]) return h, p, y 16

Review: Back Propagation in Feed-forward Nets 17

Stochastic Gradient Descent Online training algorithm for probabilistic models (including logistic regression) w = 0 for I iterations for each labeled pair x, y in the data w += α * dp(y x)/dw In other words For every training example, calculate the gradient (the direction that will increase the probability of y) Move in that direction, multiplied by learning rate α 18

Gradient of the Sigmoid Function Take the derivative of the probability d d w P ( y=1 x) = d d w = ϕ (x) w ϕ( x) e w ϕ (x) 1+e w ϕ (x) e (1+e w ϕ (x) ) 2 dp(y x)/dw*phi(x) 0.4 0.3 0.2 0.1 0-10 -5 0 5 10 w*phi(x) d d w P ( y= 1 x) = d d w (1 ew ϕ (x) w ϕ (x)) 1+e w ϕ (x) e = ϕ (x) (1+e w ϕ (x) ) 2 19

Learning: Don't Know Derivative for Hidden Units! For NNs, only know correct tag for last layer ϕ( x) w 1 w 2 h( x) d P( y=1 x) d w 1 =? w 4 d P( y=1 x) e w 4 h(x) =h(x) d w 4 (1+e w 4 h( x) ) 2 y=1 w 3 d P( y=1 x) d w 2 =? d P( y=1 x) d w 3 =? 20

Answer: Back-Propogation Calculate derivative w/ chain rule d P( y=1 x) d P( y=1 x) = d w 1 d w 4 h( x) d w 4 h(x) d h 1 (x) d h 1 (x) d w 1 e w 4 h( x) (1+e w 4 h( x) ) 2 w 1,4 Error of next unit (δ 4 ) Weight Gradient of this unit In General Calculate i based on next units j: d P( y=1 x) w i = d h i( x) d w i j δ j w i, j 21

Conceptual Picture Send errors back through the net w 1 δ 1 ϕ( x) w 2 δ 2 w 4 δ 4 y w 3 δ 3 22

Back Propagation in Recurrent Nets 23

What Errors do we Know? y 1 y 2 y 3 y 4 δ o,1 δ o,2 δ o,3 δ o,4 δ r,1 δ r,2 δ r,3 NET NET NET NET x 1 x 2 x 3 x 4 We know the output errors δ o Must use back-prop to find recurrent errors δ r 24

How to Back-Propagate? Standard back propagation through time (BPTT) For each δ o, calculate n steps of δ r Full gradient calculation Use dynamic programming to calculate the whole sequence 25

Back Propagation through Time y 1 δ y 2 δ y 3 δ y 4 o,3 δ o,1 o,2 o,4 δ δ δ δ NET NET NET NET δ δ δ x 1 x 2 x 3 x 4 Use only one output error Stop after n steps (here, n=2) 26

Full Gradient Calculation y 1 y 2 y 3 y δ δ δ 4 δ o,1 o,2 o,3 o,4 δ δ δ δ NET NET NET NET x 1 x 2 x 3 x 4 First, calculate whole net result forward Then, calculate result backwards 27

BPTT? Full Gradient? Full gradient: + Faster, no time limit - Must save the result of the whole sequence in memory BPTT: + Only remember the results in the past few steps - Slower, less accurate for long dependencies 28

Vanishing Gradient in Neural Nets very tiny δ y 1 y 2 y 3 y 4 δ o,4 tiny small med. δ δ δ NET NET NET NET x 1 x 2 x 3 x 4 Long Short Term Memory is designed to solve this 29

RNN Full Gradient Calculation gradient_rnn(w r,x, w r,h, b r, w o,h, b o, x, h, p, y') initialize Δw r,x, Δw r,h, Δb r, Δw o,h, Δb o δ r ' = np.zeros(len(b r )) # Error from the following time step for each time t in len(x)-1.. 0: p' = create_one_hot(y'[t]) δ o ' = p' p[t] # Output error Δw o,h += np.outer(h[t], δ o '); Δb o += δ o ' # Output gradient δ r = np.dot(δ' r, w r,h ) + np.dot(δ' o, w o,h ) # Backprop δ' r = δ r * (1 h[t] 2 ) # tanh gradient Δw r,x += np.outer(x[t], δ r '); Δb r += δ r ' # Hidden gradient if t!= 0: Δw r,h += np.outer(h[t-1], δ r '); return Δw r,x, Δw r,h, Δb r, Δw o,h, Δb o 30

Weight Update update_weights(w r,x, w r,h, b r, w o,h, b o, Δw r,x, Δw r,h, Δb r, Δw o,h, Δb o, λ) w r,x += λ * Δw r,x w r,h += λ * Δw r,h b r += λ * Δb r w o,h += λ * Δw o,h b o += λ * Δb o 31

Overall Training Algorithm # Create features create map x_ids, y_ids, array data for each labeled pair x, y in the data add (create_ids(x, x_ids), create_ids(y, y_ids) ) to data initialize net randomly # Perform training for I iterations for each labeled pair x, y' in the feat_lab h, p, y = forward_rnn(net, φ 0 ) Δ= gradient_rnn(net, x, h, y') update_weights(net, Δ, λ) print net to weight_file print x_ids, y_ids to id_file 32

Exercise 33

Exercise Create an RNN for sequence labeling Training train-rnn and testing test-rnn Test: Same data as POS tagging Input: test/05-{train,test}-input.txt Reference: test/05-{train,test}-answer.txt Train a model with data/wiki-en-train.norm_pos and predict for data/wiki-en-test.norm Evaluate the POS performance, and compare with HMM: script/gradepos.pl data/wiki-en-test.pos my_answer.pos 34