Deep Learning for NLP

Similar documents
CS395T: Structured Models for NLP Lecture 17: CNNs. Greg Durrett

From perceptrons to word embeddings. Simon Šuster University of Groningen

Natural Language Processing

Deep Learning for NLP Part 2

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

Sequence Modeling with Neural Networks

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Neural Networks in Structured Prediction. November 17, 2015

NEURAL LANGUAGE MODELS

Natural Language Processing and Recurrent Neural Networks

The Noisy Channel Model and Markov Models

Lecture 5 Neural models for NLP

CS395T: Structured Models for NLP Lecture 19: Advanced NNs I

Jakub Hajic Artificial Intelligence Seminar I

CS395T: Structured Models for NLP Lecture 19: Advanced NNs I. Greg Durrett

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

Natural Language Processing with Deep Learning CS224N/Ling284

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Overview Today: From one-layer to multi layer neural networks! Backprop (last bit of heavy math) Different descriptions and viewpoints of backprop

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Homework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown

CS388: Natural Language Processing Lecture 9: CNNs, Neural CRFs. CNNs. Administrivia. This Lecture. Greg Durrett. Project 1 due today at 5pm

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Deep Learning: a gentle introduction

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 6

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Recurrent Neural Networks. COMP-550 Oct 5, 2017

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

CS224N: Natural Language Processing with Deep Learning Winter 2018 Midterm Exam

CSC321 Lecture 16: ResNets and Attention

Part-of-Speech Tagging + Neural Networks 2 CS 287

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Introduction to Deep Learning

Conditional Language Modeling. Chris Dyer

Neural networks CMSC 723 / LING 723 / INST 725 MARINE CARPUAT. Slides credit: Graham Neubig

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Natural Language Processing

text classification 3: neural networks

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

APPLIED DEEP LEARNING PROF ALEXIEI DINGLI

Lecture 17: Neural Networks and Deep Learning

Deep Learning Recurrent Networks 2/28/2018

with Local Dependencies

Neural Network Language Modeling

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Neural Networks for NLP. COMP-599 Nov 30, 2016

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

ECE521 Lecture 7/8. Logistic Regression

Bayesian Networks (Part I)

Lecture 6: Neural Networks for Representing Word Meaning

Structured Neural Networks (I)

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention

Learning to translate with neural networks. Michael Auli

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

ML4NLP Multiclass Classification

Introduction to Neural Networks

Neural Networks and Deep Learning

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Natural Language Processing with Deep Learning CS224N/Ling284

Recurrent Neural Networks. deeplearning.ai. Why sequence models?

Convolutional Neural Networks

Generative Models for Sentences

Deep Learning for NLP

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Feedforward Neural Networks

How to do backpropagation in a brain

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Neural Architectures for Image, Language, and Speech Processing

Deep Learning Autoencoder Models

Marrying Dynamic Programming with Recurrent Neural Networks

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Distributional Semantics and Word Embeddings. Chase Geigle

Conditional Language modeling with attention

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Deep Learning For Mathematical Functions

Recurrent and Recursive Networks

Chunking with Support Vector Machines

Instructions for NLP Practical (Units of Assessment) SVM-based Sentiment Detection of Reviews (Part 2)

Deep Learning. Ali Ghodsi. University of Waterloo

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Applied Natural Language Processing

Deep Learning for Natural Language Processing

Introduction to Machine Learning (67577)

Backpropagation and Neural Networks part 1. Lecture 4-1

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING

Transcription:

Deep Learning for NLP Instructor: Wei Xu Ohio State University CSE 5525 Many slides from Greg Durrett

Outline Motivation for neural networks Feedforward neural networks Applying feedforward neural networks to NLP Convolutional neural networks Application examples Tools

Sentiment Analysis the movie was very good

Sentiment Analysis with Linear Example Label Feature Type the movie was very good I[good] Unigrams the movie was very bad I[bad] Unigrams the movie was not bad I[not bad] Bigrams the movie was not very good I[not very good] Trigrams the movie was not really very enjoyable 4-grams!

Drawbacks More complex features capture interactions but scale badly (13M unigrams, 1.3B 4-grams in Google n-grams) Can we do better than seeing every n-gram once in the training data? not very good not so great Instead of more complex linear functions, let s use simpler nonlinear functions, namely neural networks the movie was not really very enjoyable

Neural Networks: XOR Let s see how we can use neural nets to learn a simple nonlinear function Inputs x 1, x 2 x 2 1 0 (generally x =(x 1,...,x m )) Output y 0 1 x 1 (generally y =(y 1,...,y n )) y = x 1 XOR x 2 x 1 x 2 0 0 0 1 1 0 1 1 0 1 1 0

Neural Networks: XOR x 2 1 0 y = a 1 x 1 + a 2 x 2 y = a 1 x 1 + a 2 x 2 + a 3 tanh(x 1 + x 2 ) X or 0 1 x 1 (looks like action potential in neuron) x 1 x 2 x 1 XOR x 2 0 0 0 1 1 0 1 1 0 1 1 0

Neural Networks: XOR x 2 1 0 y = a 1 x 1 + a 2 x 2 y = a 1 x 1 + a 2 x 2 + a 3 tanh(x 1 + x 2 ) X 0 1 x 1 y = x 1 x 2 + 2 tanh(x 1 + x 2 ) or x 1 x 2 x 1 XOR x 2 0 0 0 1 1 0 1 1 0 1 1 0 x 2 x 1

Neural Networks: XOR I x 2 [good] 1-1 y = 2x 1 x 2 + 2 tanh(x 1 + x 2 ) 0 0 I x 1 [not] x 2 x 1

Neural Networks (Linear model: y = w x + b ) y = g(w x + b) y = g(wx + b) Nonlinear transformation Warp space Shift Taken from http://colah.github.io/posts/2014-03-nn-manifolds-topology/

Neural Networks Linear classifier Neural network possible because we transformed the space! Taken from http://colah.github.io/posts/2014-03-nn-manifolds-topology/

Deep Neural Networks (this was our neural net from the XOR example) y 1 = g(w 1 x + b 1 ) Adopted from Chris Dyer

Deep Neural Networks y 1 = g(w 1 x + b 1 ) Adopted from Chris Dyer

Deep Neural Networks Input Hidden Layer Output z = g(vg(wx + b)+c) } output of first layer z = g(vy + c) Adopted from Chris Dyer

Neural Networks Linear classifier Neural network possible because we transformed the space! Taken from http://colah.github.io/posts/2014-03-nn-manifolds-topology

Deep Neural Networks Taken from http://colah.github.io/posts/2014-03-nn-manifolds-topology/

Deep Neural Networks Input Hidden Layer Output z = g(vg(wx + b)+c) } output of first layer With no nonlinearity: z = VWx + Vb + c Equivalent to z = Ux + d Adopted from Chris Dyer

Deep Neural Networks Input Hidden Layer Output Nodes in the hidden layer can learn interactions or conjunctions of features I I [not] [good] y = 2x 1 x 2 + 2 tanh(x 1 + x 2 ) not OR good

Learning Neural Networks Input Hidden Layer Output change in output w.r.t. hidden change in output w.r.t. input change in hidden w.r.t. input Computing these looks like running this network in reverse (backpropagation)

Outline Motivation for neural networks Feedforward neural networks Applying feedforward neural networks to NLP Convolutional neural networks Application examples Tools

Feedforward Bag-of-words I x 2 [good] 1-1 I [a] I [to] 0 0 y = g(wx + b) I x 1 [not] I I [bad] I[not] [good] real-valued matrix, dims = vocabulary size (~10k) x hidden layer size (~100) binary vector, length = vocabulary size

Drawbacks to FFBoW Lots of parameters to learn Doesn t preserve ordering in the input really not very good and really not very enjoyable we don t know the relationship between good and enjoyable I I [a] I [to] I [bad] I[not] [good]

Word Embeddings word2vec: turn each word into a 100-dimensional vector Context-based embeddings: find a vector predictive of a word s context Words in similar contexts will end up with similar vectors dog great good enjoyable bad is

Feedforward with word vectors the movie was good. y = g(wx + b) hidden layer size ~100 x (sentence length (~10) x vector size (~100)) the movie was great. Each x now represents multiple bits of input Can capture word similarity binary vector, length = sentence length x vector size

Feedforward with word vectors the movie was good. y = g(wx + b) the movie was very good Need our model to be shiftinvariant, like bag-of-words is

Comparing Architectures Instead of more complex linear functions, let s use simpler nonlinear functions Feedforward bag-of-words: didn t take advantage of word similarity, lots of parameters to learn Feedforward with word vectors: our parameters are attached to particular indices in a sentence Solution: convolutional neural nets

Outline Motivation for neural networks Feedforward neural networks Applying feedforward neural networks to NLP Convolutional neural networks Application examples Tools

Convolutional Networks the movie was good. 0.03 0.02 0.1 1.1 0.0 } good filter output max = 1.1

Convolutional Networks the movie was good. bad okay terrible 0.03 0.02 0.1 1.1 0.0 } max = 1.1 0.1 0.3 0.1

Convolutional Networks the movie was good bad. 0.03 0.02 0.1 1.1 0.0 } max = 1.1 Input: n vectors of length m each k filter outputs of length 1 each k filters of length m each Takes variable-length input and turns it into fixed-length output 0.1 Filters are initialized randomly and then learned 1.1 0.1 0.3 0.1 Features for a classifier, or input to another neural net layer

Convolutional Networks the movie was great. 0.03 0.02 0.1 1.8 0.0 }max = 1.8 Word vectors for similar words are similar, so convolutional filters will have similar outputs

Convolutional Networks the movie was not good. + + } } 0.03 0.05 0.2 1.5 0.1 not good }max = 1.5 Analogous to bigram features in bag-of-words models

Comparing Architectures Instead of more complex linear functions, let s use simpler nonlinear functions Convolutional networks let us take advantage of word similarity Convolutional networks are translation-invariant like bag-of-words Convolutional networks can capture local interactions with filters of width > 1 (i.e. not good )

Outline Motivation for neural networks Feedforward neural networks Applying feedforward neural networks to NLP Convolutional neural networks Application examples Tools

Sentence Classification the movie was not good. convolutional fully connected prediction

Object Recognition convolutional layers fully connected layers Conv layer 3 Conv layer 1 AlexNet (2012)

Neural networks are NNs are built from convolutional layers, fully connected layers, and some other types Can chain these together into various architectures Any neural network built this way can be learned from data!

Sentence Classification the movie was not good. convolutional fully connected prediction

Sentence Classification movie review sentiment subjectivity/objectivity detection product reviews Outperforms highly-tuned bag-of-words model question type classification Taken from Kim (2014)

Entity Linking Although he originally won the event, the United States Anti- Doping Agency announced in August 2012 that they had disqualified Armstrong from his seven consecutive Tour de France wins from 1999 2005.? Lance Edward Armstrong is an American former professional road cyclist? Armstrong County is a county in Pennsylvania Conventional: compare vectors from tf-idf features for overlap Convolutional networks can capture many of the same effects: distill notions of topic from n-grams Francis-Landau, Durrett, and Klein (NAACL 2016)

Entity Linking Although he originally won the event, the United States Anti- Doping Agency announced in August 2012 that they had disqualified Armstrong from his seven consecutive Tour de France wins from 1999 2005. Lance Edward Armstrong is an American former professional road cyclist Armstrong County is a county in Pennsylvania convolutional convolutional convolutional topic vector topic vector topic vector similar probable link dissimilar improbable link Francis-Landau, Durrett, and Klein (NAACL 2016)

Syntactic Parsing NP NP PP VB D VP NP PP He wrote a long report on Mars. He wrote a long report on Mars. My report Fig. 1 report on Mars wrote on Mars

Chart Parsing NP chart value = score(rule) + chart(left child) + chart(right child) NP PP He wrote a long report on Mars

Syntactic Parsing score NP NP PP = w > f NP NP PP 2 5 7 wrote a long report on Mars. He wrote a long report on Mars. 2 5 7 feat = I Left child last word = report NP NP PP Features need to combine surface information and syntactic information, but looking at words directly ends up being very sparse

Scoring parses with neural nets score NP NP PP = s >. vector representation of rule being applied s > v neural network He wrote a long report on Mars. 2 5 7 Jupiter Durrett and Klein (ACL 2015)

Syntactic Parsing Discrete+ Continuous NP NP PP He wrote a long report on Mars Parsing a sentence: Feedforward pass on nets Discrete feature computation Run CKY dynamic program Durrett and Klein (ACL 2015)

Machine Translation the cat ate STOP le chat a mangé Long short-term memory units

Long Short-Term Memory Networks Map sequence of inputs to sequence of outputs Taken from http://colah.github.io/posts/2015-08-understanding-lstms/

Machine Translation the cat ate STOP le chat a mangé Google is moving towards this architecture, performance is constantly improving compared to phrase-based methods

Neural Network Tensorflow: https://www.tensorflow.org/ By Google, actively maintained, bindings for many languages Theano: http://deeplearning.net/software/theano/ University of Montreal, less and less maintained Torch: http://torch.ch/ Facebook AI Research, Lua

Neural Network http://tmmse.xyz/content/images/2016/02/theano-computation-graph.png

Word Vector Tools Word2Vec: https://radimrehurek.com/gensim/models/word2vec.html https://code.google.com/archive/p/word2vec/ Python code, actively maintained GLoVe: http://nlp.stanford.edu/projects/glove/ Word vectors trained on very large corpora

Convolutional Networks CNNs for sentence class.: https://github.com/yoonkim/cnn_sentence Based on tutorial from: http://deeplearning.net/tutorial/lenet.html Python code Trains very quickly

Takeaways Neural networks have several advantages for NLP: We can use simpler nonlinear functions instead of more complex linear functions We can take advantage of word similarity We can build models that are both position-dependent (feedforward neural networks) and position-independent (convolutional networks) NNs have natural applications to many problems While conventional linear models often still do well, neural nets are increasingly the state-of-the-art for many tasks