Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Similar documents
Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Lecture 5 Neural models for NLP

From perceptrons to word embeddings. Simon Šuster University of Groningen

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Neural Networks Language Models

Neural Word Embeddings from Scratch

An overview of word2vec

Lecture 17: Neural Networks and Deep Learning

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Homework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown

NEURAL LANGUAGE MODELS

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Deep Learning Recurrent Networks 2/28/2018

Natural Language Processing

DISTRIBUTIONAL SEMANTICS

Natural Language Processing and Recurrent Neural Networks

Overview Today: From one-layer to multi layer neural networks! Backprop (last bit of heavy math) Different descriptions and viewpoints of backprop

Natural Language Processing with Deep Learning CS224N/Ling284. Richard Socher Lecture 2: Word Vectors

GloVe: Global Vectors for Word Representation 1

Deep Learning for NLP Part 2

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Sequence Modeling with Neural Networks

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

Assignment 1. Learning distributed word representations. Jimmy Ba

Deep Learning for NLP

Deep Learning. Ali Ghodsi. University of Waterloo

Lecture 6: Neural Networks for Representing Word Meaning

Introduction to Deep Neural Networks

Deep Learning for NLP

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

word2vec Parameter Learning Explained

Neural Architectures for Image, Language, and Speech Processing

Conditional Language modeling with attention

Random Coattention Forest for Question Answering

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

Neural Network Language Modeling

Machine Learning Basics

Introduction to Convolutional Neural Networks 2018 / 02 / 23

Natural Language Processing with Deep Learning CS224N/Ling284

CSC321 Lecture 16: ResNets and Attention

with Local Dependencies

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

arxiv: v2 [cs.cl] 1 Jan 2019

Convolutional Neural Networks

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

arxiv: v3 [cs.cl] 30 Jan 2016

Lecture 7: Word Embeddings

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018

text classification 3: neural networks

Deep Learning and Lexical, Syntactic and Semantic Analysis. Wanxiang Che and Yue Zhang

Seman&cs with Dense Vectors. Dorota Glowacka

ECE521 Lectures 9 Fully Connected Neural Networks

Contents. (75pts) COS495 Midterm. (15pts) Short answers

Neural Networks for NLP. COMP-599 Nov 30, 2016

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

ML4NLP Multiclass Classification

Structured Neural Networks (I)

Caesar s Taxi Prediction Services

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

(2pts) What is the object being embedded (i.e. a vector representing this object is computed) when one uses

Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow,

Applied Natural Language Processing

Deep Learning for Natural Language Processing

CSC321 Lecture 7 Neural language models

Structured deep models: Deep learning on graphs and beyond

ECE521 Lecture 7/8. Logistic Regression

Task-Oriented Dialogue System (Young, 2000)

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

CS224n: Natural Language Processing with Deep Learning 1

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Dimensionality Reduction and Principle Components Analysis

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention

Instructions for NLP Practical (Units of Assessment) SVM-based Sentiment Detection of Reviews (Part 2)

Part-of-Speech Tagging + Neural Networks 2 CS 287

Deep Learning For Mathematical Functions

EE-559 Deep learning Recurrent Neural Networks

Natural Language Understanding. Recap: probability, language models, and feedforward networks. Lecture 12: Recurrent Neural Networks and LSTMs

Generative Models for Sentences

Chinese Character Handwriting Generation in TensorFlow

ATASS: Word Embeddings

Multi-Class Sentiment Classification for Short Text Sequences

CS224N: Natural Language Processing with Deep Learning Winter 2018 Midterm Exam

Long-Short Term Memory

CSC321 Lecture 4: Learning a Classifier

Presented By: Omer Shmueli and Sivan Niv

Nonlinear Classification

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Course 10. Kernel methods. Classical and deep neural networks.

arxiv: v3 [cs.lg] 14 Jan 2018

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

Deep Feedforward Networks. Sargur N. Srihari

Making Deep Learning Understandable for Analyzing Sequential Data about Gene Regulation

Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST

Word Embeddings 2 - Class Discussions

Transcription:

Deep Learning for Natural Language Processing Sidharth Mudgal April 4, 2017

Table of contents 1. Intro 2. Word Vectors 3. Word2Vec 4. Char Level Word Embeddings 5. Application: Entity Matching 6. Conclusion 1

Intro

Intro Deep Learning

Core Idea Given data that has hard to describe complex intrinsic structure Important: DL usually works well only when data has unknown hidden structure. Not a silver bullet for all tasks Learn hierarchical representations of data points to perform a task 2

Typical flow Input representation Hidden representations (higher level features) Output prediction 3

Main Challenge Assisting the neural network to learn to create good representations How? Encoding domain knowledge Make training easier Improved optimization techniques Better gradient flow 4

Model Engineering DL represents a shift in the way we think about ML Traditional ML: Encode domain knowledge using feature engineering Deep ML: Encode domain knowledge using model engineering 5

Intro Natural Language Processing

What is NLP? (from Stanford CS224n) 6

Applications (from Stanford CS224n) 7

The NLP Formula: Hierarchical Feature Generation Input representation (Word Vectors) Hidden representations (Typically using LSTM RNNs) Output prediction (Typically using a variant of softmax) 8

Word Vectors

Word Vectors Motivation

What do we want? We want a representation of words that: Captures meaning Is efficient 9

Past attempts to capture meaning of words - WordNet 10

A simple idea You shall know a word by the company it keeps (J. R. Firth 1957: 11) 11

Ways to create word embeddings Traditional count based methods SVD, PPMI, etc. Neural network based methods Word2Vec et al. Hybrid Glove 12

Word Vectors Language Modeling

What is a Language Model? A probability distribution over sequences of words P(w 1,, w m ) = m t=1 P(w t w t 1,, w 1 ) In practice, we only condition on the last M words P(w 1,, w m ) = m t=1 P(w t w t 1,, w t M ) 13

Training word vectors using LMs (Bengio et al. 2003 14

Word2Vec

Core Idea Parameterize words as vectors Use an NN to predict nearby words Skip Gram CBOW Hope that learned word vectors are semantically meaningful 15

Skip Gram Model 16

Objective Function (from Stanford CS224n) 17

How do we predict nearby words? (from Stanford CS224n) 18

Step by step (from Stanford CS224n) 19

The problem of Softmax Scalability wrt vocabulary size Do we really need to compute the probability of ALL words during training? Do we really need to push down the prob. of ALL wrong words during training? Solutions: Hierarchical Softmax Negative sampling 20

The problem of Softmax 21

Hierarchical Softmax - Simple version Preprocess step: Randomly split the vocabulary V into V clusters consisting of approximately V words each I.e., each word w in V is permanently assigned to a cluster C(w) at the beginning of training 22

Hierarchical Softmax - Simple version The Idea Compute the probability p(w t 1 c t ) in three steps: First, compute the probability of the word w t 1 belonging to its true cluster C(w t 1), i.e., p(c(w t 1) c t) Next, compute the probability of the word within its true cluster C(w t 1), p(w t 1 C(w t 1), c t) Then, p(w t 1 c t) is the product of the two probabilities: p(w t 1 c t) = p(w t 1, C(w t 1) c t) = p(c(w t 1) c t) p(w t 1 C(w t 1), c t) 23

Hierarchical Softmax - Simple version Computing cluster probability p(c(w t 1 ) c t ): Use a regular softmax NN with V classes, 1 for each cluster. 24

Hierarchical Softmax - Simple version Computing word in cluster probability p(w t 1 C(w t 1 ), c t ): To do so, each word cluster has a dedicated softmax NN with approximately V classes, 1 for each word in the cluster. 25

Hierarchical Softmax Time complexity of simple version = O( V ) Can do much better (O(log V )) with more complex versions Frequency based clustering of words Brown clustering etc... 26

Negative Sampling Radical alternative Instead of literally predicting nearby words, we learn to distinguish true nearby words from randomly selected words Consider example of predicting w t 1 Sample k negative words n 1 t 1,, n k t 1 V, i.e., k words in V other than w t 1 Compute dot products only for the target word w t 1 and the k negative samples Instead of computing a probability using a softmax, the model is supposed to make a binary decision for each word using a sigmoid, i.e., predict 1 for word w t 1 and 0 for words n 1 t 1,, n k t 1 27

Results E.g.: France - Paris + Italy Rome 28

Drawbacks Learns word vectors independently for each word Cannot estimate word embeddings of out-of-vocabulary (OOV) words 29

Char Level Word Embeddings

Core Idea Make use of fact that words are composed of characters Parameterize characters as vectors Compose char vectors to create word vectors Can use CNN or RNN Use an NN to predict nearby words Hope that word vectors are semantically meaningful 30

The Word Embedding Network (WEN) 31

Step 1: Lookup Char Embeddings Same as word vector lookup layer 32

Step 2: Temporal Convolution over Characters Assumption: Words are made of smaller units of character n-grams (morphemes) How do we encode this knowledge? Use k Temporal CNNs with k kernel widths Intuition: A CNN with kernel width j detects the presence of character n-grams of length j 33

Step 2: Temporal Convolution over Characters Example: Temporal convolution with kernel width = 3 and two feature maps sensitive to 3-grams mac and est 34

Step 2: Temporal Convolution over Characters The output of step 2 consists of two pieces of information: Which character n-grams are present in a given input word? Where do they occur in the word? 35

Step 3: Max-Pooling For each feature map in each CNN, output only the maximum activation Intuition: Get rid of the position information of character n-grams This yeilds a fixed dimensional vector 36

Step 4: Fully Connected Layers 3-layer highway network Intuition: Compose a word vector using information about which character n-grams are present in the word 37

Step 4: Highway NN Layer Details Core Idea: Provide a means to allow inputs to flow unaltered through an NN layer This improves gradient flow Alternate pathways (highways) conditionally allows the input to pass through undeterred to the corresponding outputs as if the layer did not exist Gates determine whether to modify each dimension of the input 38

Step 4: Highway NN Layer Details 39

Step 4: Highway NN Layer Details Given an input vector q, the output of the n th highway layer with input x n and output y n is computed as follows: x n = { q, if n = 1, y n 1, if n > 1 H( x n ) = W (H) n x n + b (H) n T( x n ) = W (T) n x n + b (T) n y n = H( x n ) T( x n ) + x n (1 T( x n )) Here H applies an affine transformation to x n and T acts as the gate deciding which inputs should be transformed and by how much. 40

The Word Embedding Network (WEN) 41

Training the Word Embedding Network How can we train this model to produce good word embeddings? Train it as part of a recurrent neural language model 42

Training 43

Results 44

Application: Entity Matching

The Task Given text descriptions of two entities, determine if they refer to the same real world entity Binary classification problem 45

The Task Typical approach: Information Extraction (IE) Followed by training a traditional classifier: SVM, Random Forest etc. Proposed alternate approach: Neural Entity Matching Model (Nemo) Use deep learning to perform this end to end Train a single model to perform extraction and entity matching 46

Nemo Overview 47

Baseline - Magellan Generates string similarity metrics based on heursitics Jaccard, Cosine, Levenshtein etc. Picks the best standard ML model using cross validation SVM, Random forest etc. 48

Results 49

Results - Extensive extraction 50

Understanding the model Compute first order saliency scores for word embeddings Essentially the derivative of word embeddings wrt the output of the model Measure of sensitivity 51

Understanding the model Example: Assume we have an entity pair (e 1, e 2 ) We want to compute the saliency of the t th word in e 1 If y L represents the score that Nemo assigns to the correct label L (match or mismatch) for the pair, x t represents Nemo s input units for the t th word and w t is the word embedding of the t th word in e 1, then the saliency s of the word is calculated as s = y L xt= w x t t 52

Saliency Visualization 53

Entity Embedding Visualization 54

Entity Embedding Visualization - Headphones 55

Entity Embedding Visualization - Cables 56

Entity Embedding Visualization - Shirts 57

Entity Embedding Visualization - Bags 58

Entity Embedding Visualization - Desks 59

Conclusion

Deep Learning: An incredibly promising tool Has produced breakthroughs in several domains But has a lot of open problems Large quantities of labeled data Huge number of parameters Optimization challenges 60