Recurrent Neural Networks. COMP-550 Oct 5, 2017

Similar documents
Neural Networks for NLP. COMP-599 Nov 30, 2016

PROJECT TITLE. B.Tech. BRANCH NAME NAME 2 (ROLL NUMBER)

CLEAR SPACE AND MINIMUM SIZE. A clear space area free of competing visual elements should be maintained.

Text Analysis. Week 5

Brand guidelines 2014

Week 7 Sentiment Analysis, Topic Detection

USER MANUAL. ICIM S.p.A. Certifi cation Mark

How to give. an IYPT talk

Dorset Coastal Explorer Planning

Linear Algebra. Dave Bayer

Lecture 17: Neural Networks and Deep Learning

arxiv: v3 [cs.lg] 14 Jan 2018

Sediment Management for Coastal Restoration in Louisiana -Role of Mississippi and Atchafalaya Rivers

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Neural Networks and Deep Learning

NEURAL LANGUAGE MODELS

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Neural Networks in Structured Prediction. November 17, 2015

Neural Networks 2. 2 Receptive fields and dealing with image inputs

ECE521 Lectures 9 Fully Connected Neural Networks

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

MEDIA KIT D E S T I N A T I O N S M A G A Z I N E D E S T I N A T I O N S M A G A Z I N E. C O M

Long-Short Term Memory and Other Gated RNNs

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Neural Networks. Intro to AI Bert Huang Virginia Tech

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Lecture 5 Neural models for NLP

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Recurrent and Recursive Networks

CSC321 Lecture 15: Exploding and Vanishing Gradients

sps hrb con csb rmb gsv gsh CASE STUDIES CASE STUDIES CALL TO ACTION CALL TO ACTION PRODUCTS APPLICATIONS TRAINING EDUCATORS RESOURCES ABOUT CONTACT

Feedforward Neural Networks

Artificial Intelligence

Introduction to RNNs!

ECE521 Lecture 7/8. Logistic Regression

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Intro to Neural Networks and Deep Learning

Machine Learning. Boris

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Neural Architectures for Image, Language, and Speech Processing

Notes on Back Propagation in 4 Lines

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Introduction to Machine Learning Spring 2018 Note Neural Networks

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

How to do backpropagation in a brain

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

Machine Learning Basics III

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Lecture 15: Exploding and Vanishing Gradients

SGD and Deep Learning

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes)

Deep Feedforward Networks

Deep Learning for NLP

Neural networks COMS 4771

From perceptrons to word embeddings. Simon Šuster University of Groningen

Gradient-Based Learning. Sargur N. Srihari

Lecture 11 Recurrent Neural Networks I

Natural Language Processing and Recurrent Neural Networks

Feedforward Neural Networks. Michael Collins, Columbia University

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

text classification 3: neural networks

Demystifying deep learning. Artificial Intelligence Group Department of Computer Science and Technology, University of Cambridge, UK

Lecture 11 Recurrent Neural Networks I

Lecture 4 Backpropagation

Introduction to Deep Neural Networks

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Recurrent Neural Networks

Lecture 2: Learning with neural networks

Deep Feedforward Networks. Sargur N. Srihari

Recurrent Neural Networks. Jian Tang

More on Neural Networks

Jakub Hajic Artificial Intelligence Seminar I

Lecture 3 Feedforward Networks and Backpropagation

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Neural Network Language Modeling

Sequence Modeling with Neural Networks

A Tutorial On Backward Propagation Through Time (BPTT) In The Gated Recurrent Unit (GRU) RNN

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Neural Network Training

Neural Networks: Backpropagation

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Deep Learning for NLP

Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets)

Lecture 3 Feedforward Networks and Backpropagation

Recurrent neural networks

Deep Learning. Hung-yi Lee 李宏毅

Machine Learning (CSE 446): Backpropagation

Australia s Optical Data Centre

Intelligent Systems (AI-2)

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

Lecture 2: Modular Learning

CSC321 Lecture 5: Multilayer Perceptrons

Foundations of Artificial Intelligence

{SOP_Title} Standard Operating Procedure. Owner: {SOP_Author} {SOP_Description} {SOP_ID} {Date}

Deep Learning: a gentle introduction

Transcription:

Recurrent Neural Networks COMP-550 Oct 5, 2017

Outline Introduction to neural networks and deep learning Feedforward neural networks Recurrent neural networks 2

Classification Review y = f( x) output label classifier input Represent input x as a list of features Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0 3

Logistic Regression Linear regression: y = a 1 x 1 + a 2 x 2 + + a n x n + b Intuition: Linear regression gives as continuous values in [-, ] let s squish the values to be in [0, 1]! Function that does this: logit function P(y x) = 1 Z ea 1x 1 + a 2 x 2 + + a n x n + b This Z is a normalizing constant to ensure this is a probability distribution. (a.k.a., maximum entropy or MaxEnt classifier) N.B.: Don t be confused by name this method is most often used to solve classification problems. 4

Linear Model Logistic regression, support vector machines, etc. are examples of linear models. P(y x) = 1 Z ea 1x 1 + a 2 x 2 + + a n x n + b Linear combination of feature weights and values Cannot learn complex, non-linear functions from input features to output labels (without adding features) e.g., Starts with a capital AND not at beginning of sentence -> proper noun 5

(Artificial) Neural Networks A kind of learning model which automatically learns non-linear functions from input to output Biologically inspired metaphor: Network of computational units called neurons Each neuron takes scalar inputs, and produces a scalar output, very much like a logistic regression model Neuron( x) = g(a 1 x 1 + a 2 x 2 + + a n x n + b) As a whole, the network can theoretically compute any computable function, given enough neurons. (These notions can be formalized.) 6

Responsible For: AlphaGo (Google) (2015) Beat Go champion Lee Sedol in a series of 5 matches, 4-1 Atari game-playing bot (Google) (2015) Above results use NNs in conjunction with reinforcement learning State of the art in: Speech recognition Machine translation Object detection Other NLP tasks 7

Feedforward Neural Networks All connections flow forward (no loops); each layer of hidden units is fully connected to the next. Figure from Goldberg (2015) 8

Inference in a FF Neural Network Perform computations forwads through the graph: h 1 = g 1 (xw 1 + b 1 ) h 2 = g 2 (h 1 W 2 + b 2 ) y = h 2 W 3 Note that we are now representing each layer as a vector; combining all of the weights in a layer across the units into a weight matrix 9

Activation Function In one unit: Linear combination of inputs and weight values nonlinearity h 1 = g 1 (xw 1 + b 1 ) Popular choices: Sigmoid function (just like logistic regression!) tanh function Rectifier/ramp function: g x = max(0, x) Why do we need the non-linearity? 10

Softmax Layer In NLP, we often care about discrete outcomes e.g., words, POS tags, topic label Output layer can be constructed such that the output values sum to one: Let x = x 1 x k softmax x i = exp(x i) k exp(x j ) Interpretation: unit x i represents probability that outcome is i. Essentially, the last layer is like a multinomial logistic regression j 11

Loss Function A neural network is optimized with respect to a loss function, which measures how much error it is making on predictions: y: correct, gold-standard distribution over class labels y: system predicted distribution over class labels L(y, y): loss function between the two Popular choice for classification (usually with a softmax output layer) cross entropy: L ce y, y = y i log( y i ) i 12

Training Neural Networks Typically done by stochastic gradient descent For one training example, find gradient of loss function wrt parameters of the network (i.e., the weights of each layer); travel along in that direction. Network has very many parameters! Efficient algorithm to compute the gradient with respect to all parameters: backpropagation (Rumelhart et al., 1986) Boils down to an efficient way to use the chain rule of derivatives to propagate the error signal from the loss function backwards through the network back to the inputs 13

SGD Overview Inputs: Function computed by neural network, f(x; θ) Training samples {x k, y k } Loss function L Repeat for a while: Sample a training case, x k, y k Compute loss L(f x k ; θ, y k ) Compute gradient L(x k ) wrt the parameters θ Update θ θ η L(x k ) Return θ Forward pass By backpropagation 14

Example: Forward Pass h 1 = g 1 (xw 1 + b 1 ) h 2 = g 2 (h 1 W 2 + b 2 ) f x = y = g 3 h 2 = h 2 W 3 Loss function: L(y, y gold ) Save the values for h 1, h 2, y too! 15

Example Cont d: Backpropagation Need to compute: L W By calculus and chain rule: L = L g 3 W 3 g 3 W 3 L = L g 3 g 2 W 2 g 3 g 2 W 2 L = L g 3 g 2 g 1 W 1 g 3 g 2 g 1 W 1 f x = g 3 (g 2 (g 1 x ) L L 3, W2, W 1 Notice the overlapping computations? Be sure to do this in a smart order to avoid redundant computations! 16

Example: Time Delay Neural Network Let s draw a neural network architecture for POS tagging using a feedforward neural network. We ll construct a context window around each word, and predict the POS tag of that word as the output. Limitations of this approach? 17

Recurrent Neural Networks A neural network sequence model: RNN s 0, x 1:n = s 1:n, y 1:n s i = R(s i 1, x i ) # s i : state vector y i = O s i # y i : output vector R and O are parts of the neural network that compute the next state vector and the output vector 18

Long-Term Dependencies in Language There can be dependencies between words that are arbitrarily far apart. I will look the word that you have described that doesn t make sense to her up. Can you think of some other examples in English of longrange dependencies? Cannot easily model with HMMs or even LC-CRFs, but can with RNNs 19

Vanishing and Exploding Gradients If R and O are simple fully connected layers, we have a problem. In the unrolled network, the gradient signal can get lost on its way back to the words far in the past: Suppose it is W 1 that we want to modify, and there are N layers between that and the loss function. L W 1 = L g N g2 g 1 g N gn 1 g 1 W 1 If the gradient norms are small (<1), the gradient will vanish to near-zero (or explode to near-infinity if >1) This happens especially because we have repeated applications of the same weight matrices in the recurrence 20

Long Short-Term Memory Networks Currently one of the most popular RNN architectures for NLP (Hochreiter and Schmidhuber, 1997) Explicitly models a memory cell (i.e., a hidden-layer vector), and how it is updated as a response to the current input and the previous state of the memory cell. Visual step-by-step explanation: http://colah.github.io/posts/2015-08-understanding-lstms/ 21

Fix for Vanishing Gradients It is the fact that in the LSTM, we can propagate a cell state directly that fixed the vanishing gradient problem: There is no repeated weight application between the internal states across time! 22

Hardware for NNs Common operations in inference and learning: Matrix multiplication Component-wise operations (e.g., activation functions) This operation is highly parallelizable! Graphical processing units (GPUs) are specifically designed to perform this type of computation efficiently 23

Packages for Implementing NNs TensorFlow https://www.tensorflow.org/ PyTorch http://pytorch.org/ Caffe http://caffe.berkeleyvision.org/ Theano http://deeplearning.net/software/theano/ These packages support GPU and CPU computations Write interface code in high-level programming language, like Python 24

Summary: Advantages of NNs Learn relationships between inputs and outputs: Complex features and dependencies between inputs and states over long ranges with no fixed horizon assumption (i.e., non-markovian) Reduces need for feature engineering More efficient use of input data via weight sharing Highly flexible, generic architecture Multi-task learning: jointly train model that solves multiple tasks simultaneously Transfer learning: Take part of a neural network used for an initial task, use that as an initialization for a second, related task 25

Summary: Challenges of NNs Complex models may need a lot of training data Many fiddly hyperparameters to tune, little guidance on how to do so, except empirically or through experience: Learning rate, number of hidden units, number of hidden layers, how to connect units, non-linearity, loss function, how to sample data, training procedure, etc. Can be difficult to interpret the output of a system Why did the model predict a certain label? Have to examine weights in the network. Important to convince people to act on the outputs of the model! 26

NNs for NLP Neural networks have taken over mainstream NLP since 2014; most empirical work at recent conferences use them in some way Lots of interesting open research questions: How to use linguistic structure (e.g., word senses, parses, other resources) with NNs, either as input or output? When is linguistic feature engineering a good idea, rather than just throwing more data with a simple representation for the NN to learn the features? Multitask and transfer learning for NLP Defining and solving new, challenging NLP tasks 27