Neural Networks for NLP. COMP-599 Nov 30, 2016

Similar documents
Recurrent Neural Networks. COMP-550 Oct 5, 2017

PROJECT TITLE. B.Tech. BRANCH NAME NAME 2 (ROLL NUMBER)

Text Analysis. Week 5

CLEAR SPACE AND MINIMUM SIZE. A clear space area free of competing visual elements should be maintained.

Brand guidelines 2014

Week 7 Sentiment Analysis, Topic Detection

USER MANUAL. ICIM S.p.A. Certifi cation Mark

How to give. an IYPT talk

Dorset Coastal Explorer Planning

a) b) (Natural Language Processing; NLP) (Deep Learning) Bag of words White House RGB [1] IBM

word2vec Parameter Learning Explained

From perceptrons to word embeddings. Simon Šuster University of Groningen

Linear Algebra. Dave Bayer

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Lecture 5 Neural models for NLP

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Lecture 17: Neural Networks and Deep Learning

arxiv: v3 [cs.lg] 14 Jan 2018

Deep Learning Recurrent Networks 2/28/2018

text classification 3: neural networks

Natural Language Processing and Recurrent Neural Networks

DISTRIBUTIONAL SEMANTICS

An overview of word2vec

Sediment Management for Coastal Restoration in Louisiana -Role of Mississippi and Atchafalaya Rivers

Deep Learning for NLP Part 2

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Neural Networks in Structured Prediction. November 17, 2015

Introduction to Deep Neural Networks

ECE521 Lectures 9 Fully Connected Neural Networks

Neural Networks. Intro to AI Bert Huang Virginia Tech

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

arxiv: v2 [cs.cl] 1 Jan 2019

GloVe: Global Vectors for Word Representation 1

arxiv: v3 [cs.cl] 30 Jan 2016

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

Neural Networks: Backpropagation

Deep Learning For Mathematical Functions

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Introduction to RNNs!

Bayesian Paragraph Vectors

MEDIA KIT D E S T I N A T I O N S M A G A Z I N E D E S T I N A T I O N S M A G A Z I N E. C O M

Homework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown

NEURAL LANGUAGE MODELS

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Natural Language Processing

Lecture 6: Neural Networks for Representing Word Meaning

Long-Short Term Memory and Other Gated RNNs

Deep Learning for NLP

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

Lecture 11 Recurrent Neural Networks I

Overview Today: From one-layer to multi layer neural networks! Backprop (last bit of heavy math) Different descriptions and viewpoints of backprop

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Lecture 11 Recurrent Neural Networks I

Deep Learning and Lexical, Syntactic and Semantic Analysis. Wanxiang Che and Yue Zhang

Deep Learning for NLP

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018

Spatial Transformer. Ref: Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu, Spatial Transformer Networks, NIPS, 2015

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Sequence Modeling with Neural Networks

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Neural Network Language Modeling

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Introduction to Machine Learning Spring 2018 Note Neural Networks

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention

Natural Language Processing

Feedforward Neural Networks

Neural Architectures for Image, Language, and Speech Processing

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Making Deep Learning Understandable for Analyzing Sequential Data about Gene Regulation

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

sps hrb con csb rmb gsv gsh CASE STUDIES CASE STUDIES CALL TO ACTION CALL TO ACTION PRODUCTS APPLICATIONS TRAINING EDUCATORS RESOURCES ABOUT CONTACT

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

ATASS: Word Embeddings

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Deep Learning. Hung-yi Lee 李宏毅

Artificial Intelligence

Word2Vec Embedding. Embedding. Word Embedding 1.1 BEDORE. Word Embedding. 1.2 Embedding. Word Embedding. Embedding.

SGD and Deep Learning

Deep Learning for Natural Language Processing

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 6

Machine Learning. Boris

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Applied Natural Language Processing

Feedforward Neural Networks. Michael Collins, Columbia University

Part-of-Speech Tagging + Neural Networks 2 CS 287

Statistical NLP for the Web

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Neural networks COMS 4771

Structured Neural Networks (I)

TTIC 31230, Fundamentals of Deep Learning, Winter David McAllester. The Fundamental Equations of Deep Learning

Recurrent Neural Network

Transcription:

Neural Networks for NLP COMP-599 Nov 30, 2016

Outline Neural networks and deep learning: introduction Feedforward neural networks word2vec Complex neural network architectures Convolutional neural networks Recurrent and recursive neural networks 2

Reference Much of the figures and explanations are drawn from this reference: A Primer on Neural Network Models for Natural Language Processing. Yoav Goldberg. 2015. http://u.cs.biu.ac.il/~yogo/nnlp.pdf 3

Classification Review y = f( x) output label classifier input Represent input x as a list of features Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0 4

Logistic Regression Linear regression: y = a 1 x 1 + a 2 x 2 + + a n x n + b Intuition: Linear regression gives as continuous values in [-, ] let s squish the values to be in [0, 1]! Function that does this: logit function P(y x) = 1 Z ea 1x 1 + a 2 x 2 + + a n x n + b This Z is a normalizing constant to ensure this is a probability distribution. (a.k.a., maximum entropy or MaxEnt classifier) N.B.: Don t be confused by name this method is most often used to solve classification problems. 5

Linear Model Logistic regression, support vector machines, etc. are examples of linear models. P(y x) = 1 Z ea 1x 1 + a 2 x 2 + + a n x n + b Linear combination of feature weights and values Cannot learn complex, non-linear functions from input features to output labels (without adding features) e.g., Starts with a capital AND not at beginning of sentence -> proper noun 6

(Artificial) Neural Networks A kind of learning model which automatically learns non-linear functions from input to output Biologically inspired metaphor: Network of computational units called neurons Each neuron takes scalar inputs, and produces a scalar output, very much like a logistic regression model Neuron( x) = g(a 1 x 1 + a 2 x 2 + + a n x n + b) As a whole, the network can theoretically compute any computable function, given enough neurons. (These notions can be formalized.) 7

Responsible For: AlphaGo (Google) (2015) Beat Go champion Lee Sedol in a series of 5 matches, 4-1 Atari game-playing bot (Google) (2015) State of the art in: Speech recognition Machine translation Object detection Many other NLP tasks 8

Feedforward Neural Networks All connections flow forward (no loops); each layer of hidden units is fully connected to the next. Figure from Goldberg (2015) 9

Inference in a FF Neural Network Perform computations forwads through the graph: h 1 = g 1 (xw 1 + b 1 ) h 2 = g 2 (h 1 W 2 + b 2 ) y = h 2 W 3 Note that we are now representing each layer as a vector; combining all of the weights in a layer across the units into a weight matrix 10

Activation Function In one unit: Linear combination of inputs and weight values nonlinearity h 1 = g 1 (xw 1 + b 1 ) Popular choices: Sigmoid function (just like logistic regression!) tanh function Rectifier/ramp function: g x = max(0, x) Why do we need the non-linearity? 11

Softmax Layer In NLP, we often care about discrete outcomes e.g., words, POS tags, topic label Output layer can be constructed such that the output values sum to one: Let x = x 1 x k softmax x i = exp(x i) k exp(x j ) Interpretation: unit x i represents probability that outcome is i. Essentially, the last layer is like a multinomial logistic regression j 12

Loss Function A neural network is optimized with respect to a loss function, which measures how much error it is making on predictions: y: correct, gold-standard distribution over class labels y: system predicted distribution over class labels L(y, y): loss function between the two Popular choice for classification (usually with a softmax output layer) cross entropy: L ce y, y = y i log( y i ) i 13

Training Neural Networks Typically done by stochastic gradient descent For one training example, find gradient of loss function wrt parameters of the network (i.e., the weights of each layer); travel along in that direction. Network has very many parameters! Efficient algorithm to compute the gradient with respect to all parameters: backpropagation (Rumelhart et al., 1986) Boils down to an efficient way to use the chain rule of derivatives to propagate the error signal from the loss function backwards through the network back to the inputs 14

SGD Overview Inputs: Function computed by neural network, f(x; θ) Training samples {x k, y k } Loss function L Repeat for a while: Sample a training case, x k, y k Compute loss L(f x k ; θ, y k ) Compute gradient L(x k ) wrt the parameters θ Update θ θ + η L(x k ) Return θ By backpropagation 15

Example: Time Delay Neural Network Let s draw a neural network architecture for POS tagging using a feedforward neural network. We ll construct a context window around each word, and predict the POS tag of that word as the output. 16

Unsupervised Learning Neural networks can be used for unsupervised learning Usual approach: the outputs to be predicted are derived from the input Popular task: representation learning Learn an embedding or representation of some vocabulary of objects using a neural network Usually, these embeddings are vectors which are taken from a hidden layer of a neural network These embeddings can then be used to compute similarity or relatedness, or in downstream tasks (remember distributional semantics) 17

word2vec (Mikolov et al., 2013) Learn vector representations of words Actually two models: Continuous bag of words (CBOW) use context words to predict a target word Skip-gram use target word to predict context words In both cases, the representation that is associated with the target word is the embedding that is learned. 18

word2vec Architectures (Mikolov et al., 2013) 19

Validation Word similarity Analogical reasoning task a : b :: c : d what is d? New York : New York Times :: Baltimore :? (Answer: Baltimore Sun) man : king :: woman :? (Answer: queen) Solve by ranking vocabulary items according to assumption: v a v b = v c v d or v d = v b v a + v c (? = king man + queen) 20

Impact of Hyperparameters There are many hyperparameters that are important to the performance of word2vec: Weighting relative importance of context words Sampling procedure during training of target and context words These have a large impact on performance (Levy et al., 2015) Applying analogous changes to previous approaches such as Singular Value Decomposition results in similar performance on word similarity tasks 21

Use of Word Embeddings in NLP Can download pretrained word2vec embeddings from the web Use word vectors as features in other NLP tasks Now very widespread Advantages: Vectors are trained from a large amount of generaldomain text: proxy of background knowledge! Cheap and easy to try Disadvantages: Doesn t always work (especially if you have a lot of taskspecific data) 22

Other Neural Network Architectures Convolutional neural networks Recurrent neural networks Recursive neural networks 23

Convolutional Neural Networks (CNN) Apply a neural network to a window around elements in a sequence, then pool the results into a single vector Convolution layer: much like the TDNN Pooling layer: combine output vectors of convolution layer Figure from Goldberg, (2015) 24

Uses of CNNs Widely used in image processing for object detection Intuition: the convolution+pooling is a kind of search through the data to find features that make up some object In NLP, effective at similar tasks, which require searching for a part of an indicative part of a sentence. e.g., Kalchbrenner et al. (2014) apply CNNs to: Sentiment analysis Question type classification 25

Recurrent Neural Networks A neural network sequence model: RNN s 0, x 1:n = s 1:n, y 1:n s i = R(s i 1, x i ) # s i : state vector y i = O s i # y i : output vector R and O are parts of the neural network that compute the next state vector and the output vector 26

Long Short-Term Memory Networks Currently one of the most popular RNN architectures for NLP (Hochreiter and Schmidhuber, 1997) Explicitly models a memory cell (i.e., a hidden-layer vector), and how it is updated as a response to the current input and the previous state of the memory cell. Visual step-by-step explanation: http://colah.github.io/posts/2015-08-understanding-lstms/ 27

Recursive Neural Networks Similar to recurrent neural networks, but with a (static) tree structure What makes it recursive is that the composition function at each step is the same (the weights are tied) (Socher et al., 2013) 28

Advantages of Neural Networks Learn relationships between inputs and outputs: Complex features and dependencies between inputs and states over long ranges with no fixed horizon assumption (i.e., non-markovian) Reduces need for feature engineering More efficient use of input data via weight sharing Highly flexible, generic architecture Multi-task learning: jointly train model that solves multiple tasks simultaneously Transfer learning: Take part of a neural network used for an initial task, use that as an initialization for a second, related task 29

Challenges of Neural Networks in NLP Complex models may need a lot of training data Many fiddly hyperparameters to tune, little guidance on how to do so, except empirically or through experience: Learning rate, number of hidden units, number of hidden layers, how to connect units, non-linearity, loss function, how to sample data, training procedure, etc. Can be difficult to interpret the output of a system Why did the model predict a certain label? Have to examine weights in the network. Important to convince people to act on the outputs of the model! 30

NNs for NLP Neural networks have taken over mainstream NLP since 2014; most empirical work at recent conferences use them in some way Lots of interesting open research questions: How to use linguistic structure (e.g., word senses, parses, other resources) with NNs, either as input or output? When is linguistic feature engineering a good idea, rather than just throwing more data with a simple representation for the NN to learn the features? Multitask and transfer learning for NLP Defining and solving new, challenging NLP tasks 31

References Mikolov, Sutskever, Chen, Corrado, and Dean. Distributed Representations of Words and Phrases and Their Compositionality. NIPS 2013. Hochreiter and Schmidhuber. Long Short-Term Memory. Neural Computation. 1997. Kalchbrenner, Grefenstette, and Blunsom. A Convolutional Neural Network for Modelling Sentences. https://arxiv.org/pdf/1404.2188v1.pdf Levy, Goldberg, and Dagan. Improving distributional similarity with lessons learned from word embeddings. TACL 2015. Socher, Perelygin, Wu, Chuang, Manning, Ng, and Potts. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. EMNLP 2013. 32