Natural Language Processing with Deep Learning. CS224N/Ling284

Similar documents
Natural Language Processing with Deep Learning. CS224N/Ling284

Natural Language Processing with Deep Learning CS224N/Ling284

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Natural Language Processing with Deep Learning CS224N/Ling284

text classification 3: neural networks

ECE521 Lectures 9 Fully Connected Neural Networks

COMP 562: Introduction to Machine Learning

Neural Networks. William Cohen [pilfered from: Ziv; Geoff Hinton; Yoshua Bengio; Yann LeCun; Hongkak Lee - NIPs 2010 tutorial ]

Deep Learning for NLP

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

Least Mean Squares Regression. Machine Learning Fall 2017

Last Lecture Recap UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 3: Linear Regression

Overview Today: From one-layer to multi layer neural networks! Backprop (last bit of heavy math) Different descriptions and viewpoints of backprop

CS 6140: Machine Learning Spring What We Learned Last Week 2/26/16

Part 2. Representation Learning Algorithms

C4 Phenomenological Modeling - Regression & Neural Networks : Computational Modeling and Simulation Instructor: Linwei Wang

Logis&c Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

CS 6140: Machine Learning Spring 2016

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

UVA CS 4501: Machine Learning. Lecture 6: Linear Regression Model with Dr. Yanjun Qi. University of Virginia

Machine Learning Basics

CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015

CSC321 Lecture 5: Multilayer Perceptrons

Neural Networks: Backpropagation

Regression.

From perceptrons to word embeddings. Simon Šuster University of Groningen

ECE521 Lecture 7/8. Logistic Regression

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Recurrent Neural Networks. Dr. Kira Radinsky CTO SalesPredict Visi8ng Professor/Scien8st Technion. Slides were adapted from lectures by Richard Socher

Neural Networks in Structured Prediction. November 17, 2015

Machine Learning 2nd Edi7on

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Discrimina)ve Latent Variable Models. SPFLODD November 15, 2011

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Neural networks and optimization

Natural Language Processing with Deep Learning CS224N/Ling284. Richard Socher Lecture 2: Word Vectors

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Deep Learning for NLP Part 2

Warm up: risk prediction with logistic regression

Lecture 17: Neural Networks and Deep Learning

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Lecture 3 Feedforward Networks and Backpropagation

CSC321 Lecture 4: Learning a Classifier

Machine Learning for Signal Processing Bayes Classification and Regression

Neural Networks: Backpropagation

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

CSCI 360 Introduc/on to Ar/ficial Intelligence Week 2: Problem Solving and Op/miza/on

Deep neural networks

Deep Learning Lab Course 2017 (Deep Learning Practical)

Deep Feedforward Networks. Sargur N. Srihari

Homework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown

Lecture 35: Optimization and Neural Nets

How to do backpropagation in a brain

NEURAL LANGUAGE MODELS

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

CSC321 Lecture 4: Learning a Classifier

Lecture 4 Backpropagation

CSE 473: Ar+ficial Intelligence. Hidden Markov Models. Bayes Nets. Two random variable at each +me step Hidden state, X i Observa+on, E i

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

Neural Networks for Machine Learning. Lecture 2a An overview of the main types of neural network architecture

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Deep Feedforward Networks

Natural Language Processing

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

CSC 411 Lecture 10: Neural Networks

CSCI567 Machine Learning (Fall 2018)

Recurrent Neural Networks. Jian Tang

Machine Learning (CSE 446): Backpropagation

Neural networks and support vector machines

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Understanding How ConvNets See

Artificial Neural Networks

Machine Learning Lecture 5

STA 4273H: Statistical Machine Learning

Intelligent Systems Discriminative Learning, Neural Networks

Introduction to Convolutional Neural Networks (CNNs)

Intro to Neural Networks and Deep Learning

Natural Language Processing

Multilayer Perceptron

CSC321 Lecture 2: Linear Regression

Lecture 3 Feedforward Networks and Backpropagation

Lecture 5: Logistic Regression. Neural Networks

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

The connection of dropout and Bayesian statistics

Sequence Modeling with Neural Networks

8-1: Backpropagation Prof. J.C. Kao, UCLA. Backpropagation. Chain rule for the derivatives Backpropagation graphs Examples

word2vec Parameter Learning Explained

Seman&cs with Dense Vectors. Dorota Glowacka

Algorithms for NLP. Classifica(on III. Taylor Berg- Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Boos$ng Can we make dumb learners smart?

Nonlinear Classification

Long-Short Term Memory and Other Gated RNNs

Transcription:

Natural Language Processing with Deep Learning CS224N/Ling284 Lecture 4: Word Window Classification and Neural Networks Christopher Manning and Richard Socher

Overview Today: Classifica(on background Upda(ng word vectors for classifica(on Window classifica(on & cross entropy error deriva(on (ps A single layer neural network! Max-Margin loss and backprop This lecture will help a lot with PSet1 :)

Classifica6on setup and nota6on Generally we have a training dataset consis(ng of samples {x i,y i } N i=1 x i - inputs, e.g. words (indices or vectors!), context windows, sentences, documents, etc. y i - labels we try to predict, for example class: sen(ment, named en((es, buy/sell decision, other words later: mul(-word sequences

Classifica6on intui6on Training data: {x i,y i } N i=1 Simple illustra(on case: Fixed 2d word vectors to classify Using logis(c regression à linear decision boundary à General ML: assume x is fixed, train logis(c regression weights W à only modify the decision boundary Visualiza(ons with ConvNetJS by Karpathy! hzp://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html Goal: predict for each x: where

Details of the so=max We can tease apart into two steps: 1. Take the y th row of W and mul(ply that row with x: Compute all f c for c=1,,c 2. Normalize to obtain probability with so^max func(on: =softmax (f) y

The so=max and cross-entropy error For each training example {x,y}, our objec(ve is to maximize the probability of the correct class y Hence, we minimize the nega(ve log probability of that class:

Background: Why Cross entropy error Assuming a ground truth (or gold or target) probability distribu(on that is 1 at the right class and 0 everywhere else: p = [0,,0,1,0, 0] and our computed probability is q, then the cross entropy is: Because of one-hot p, the only term le= is the nega6ve log probability of the true class

Sidenote: The KL divergence Cross-entropy can be re-wrizen in terms of the entropy and Kullback-Leibler divergence between the two distribu(ons: Because H(p) is zero in our case (and even if it wasn t it would be fixed and have no contribu(on to gradient), to minimize this is equal to minimizing the KL divergence between p and q The KL divergence is not a distance but a non-symmetric measure of the difference between two probability distribu(ons p and q

Classifica6on over a full dataset Cross entropy loss func(on over full dataset {x i,y i } N i=1 Instead of We will write f in matrix nota(on: We can s(ll index elements of it based on class

Classifica6on: Regulariza6on! Really full loss func(on over any dataset includes regulariza6on over all parameters θ: Regulariza(on will prevent overfigng when we have a lot of features (or later a very powerful/deep model) x-axis: more powerful model or more training itera(ons Blue: training error, red: test error

Details: General ML op6miza6on For general machine learning θusually only consists of columns of W: So we only update the decision boundary Visualiza(ons with ConvNetJS by Karpathy

Classifica6on difference with word vectors Common in deep learning: Learn both W and word vectors x Very large! Overfigng Danger!

television Losing generaliza6on by re-training word vectors Segng: Training logis(c regression for movie review sen(ment single words and in the training data we have TV and telly In the tes(ng data we have television Originally they were all similar (from pre-training word vectors) What happens when we train the word vectors? telly TV

Losing generaliza6on by re-training word vectors What happens when we train the word vectors? Those that are in the training data move around Words from pre-training that do NOT appear in training stay Example: In training data: TV and telly Only in tes(ng data: television telly TV :( television

Losing generaliza6on by re-training word vectors Take home message: If you only have a small training data set, don t train the word vectors. If you have have a very large dataset, it may work bezer to train word vectors to the task. telly TV television

Side note on word vectors nota6on The word vector matrix L is also called lookup table Word vectors = word embeddings = word representa(ons (mostly) Mostly from methods like word2vec or Glove V [ ] L = d aardvark a meta zebra These are the word features x word from now on New development (later in the class): character models :o

Window classifica6on Classifying single words is rarely done. Interes(ng problems like ambiguity arise in context! Example: auto-antonyms: "To sanc(on" can mean "to permit" or "to punish. "To seed" can mean "to place seeds" or "to remove seeds." Example: ambiguous named en((es: Paris à Paris, France vs Paris Hilton Hathaway à Berkshire Hathaway vs Anne Hathaway

Window classifica6on Idea: classify a word in its context window of neighboring words. For example named en(ty recogni(on into 4 classes: Person, loca(on, organiza(on, none Many possibili(es exist for classifying one word in context, e.g. averaging all the words in a window but that looses posi(on informa(on

Window classifica6on Train so^max classifier by assigning a label to a center word and concatena(ng all word vectors surrounding it Example: Classify Paris in the context of this sentence with window length 2: museums in Paris are amazing. X window = [ x museums x in x Paris x are x amazing ] T Resul(ng vector x window = x R 5d, a column vector!

Simplest window classifier: So=max With x = x window we can use the same so^max classifier as before predicted model output probability With cross entropy error as before: same But how do you update the word vectors?

Upda6ng concatenated word vectors Short answer: Just take deriva(ves as before Long answer: Let s go over steps together (helpful for PSet 1) Define: : so^max probability output vector (see previous slide) : target probability distribu(on (all 0 s except at ground truth index of class y, where it s 1) and f c = c th element of the f vector Hard, the first (me, hence some (ps now :)

Upda6ng concatenated word vectors Tip 1: Carefully define your variables and keep track of their dimensionality! Tip 2: Chain rule! If y = f(u) and u = g(x), i.e. y = f(g(x)), then: Simple example:

Upda6ng concatenated word vectors Tip 2 con(nued: Know thy chain rule Don t forget which variables depend on what and that x appears inside all elements of f s Tip 3: For the so^max part of the deriva(ve: First take the deriva(ve wrt f c when c=y (the correct class), then take deriva(ve wrt f c when c y (all the incorrect classes)

Upda6ng concatenated word vectors Tip 4: When you take deriva(ve wrt one element of f, try to see if you can create a gradient in the end that includes all par(al deriva(ves: Tip 5: To later not go insane & implementa(on! à results in terms of vector opera(ons and define single index-able vectors:

Upda6ng concatenated word vectors Tip 6: When you start with the chain rule, first use explicit sums and look at par(al deriva(ves of e.g. x i or W ij Tip 7: To clean it up for even more complex func(ons later: Know dimensionality of variables &simplify into matrix nota(on Tip 8: Write this out in full sums if it s not clear!

Upda6ng concatenated word vectors What is the dimensionality of the window vector gradient? x is the en(re window, 5 d-dimensional word vectors, so the deriva(ve wrt to x has to have the same dimensionality:

Upda6ng concatenated word vectors The gradient that arrives at and updates the word vectors can simply be split up for each word vector: Let With x window = [ x museums x in x Paris x are x amazing ] We have

Upda6ng concatenated word vectors This will push word vectors into areas such they will be helpful in determining named en((es. For example, the model can learn that seeing x in as the word just before the center word is indica(ve for the center word to be a loca(on

What s missing for training the window model? The gradient of J wrt the so^max weights W! Similar steps, write down par(al wrt W ij first! Then we have full

A note on matrix implementa6ons There are two expensive opera(ons in the so^max: The matrix mul(plica(on and the exp A for loop is never as efficient when you implement it compared to a large matrix mul(plica(on! Example code à

A note on matrix implementa6ons Looping over word vectors instead of concatena(ng them all into one large matrix and then mul(plying the so^max weights with that matrix 1000 loops, best of 3: 639 µs per loop 10000 loops, best of 3: 53.8 µs per loop

A note on matrix implementa6ons Result of faster method is a C x N matrix: Each column is an f(x) in our nota(on (unnormalized class scores) Matrices are awesome! You should speed test your code a lot too

So=max (= logis6c regression) alone not very powerful So^max only gives linear decision boundaries in the original space. With lizle data that can be a good regularizer With more data it is very limi(ng!

So=max (= logis6c regression) is not very powerful So^max only linear decision boundaries à Lame when problem is complex Wouldn t it be cool to get these correct?

Neural Nets for the Win! Neural networks can learn much more complex func(ons and nonlinear decision boundaries!

From logis6c regression to neural nets

Demys6fying neural networks Neural networks come with their own terminological baggage A single neuron A computa(onal unit with n (3) inputs and 1 output and parameters W, b But if you understand how so^max models work Then you already understand the opera(on of a basic neuron! Inputs Ac(va(on func(on Output Bias unit corresponds to intercept term

A neuron is essen6ally a binary logis6c regression unit h w,b (x) = f (w T x + b) 1 f (z) = 1+ e z b: We can have an always on feature, which gives a class prior, or separate it out, as a bias term w, b are the parameters of this neuron i.e., this logis(c regression model

A neural network = running several logis6c regressions at the same 6me If we feed a vector of inputs through a bunch of logis(c regression func(ons, then we get a vector of outputs But we don t have to decide ahead of Ame what variables these logisac regressions are trying to predict!

A neural network = running several logis6c regressions at the same 6me which we can feed into another logis(c regression func(on It is the loss funcaon that will direct what the intermediate hidden variables should be, so as to do a good job at predicang the targets for the next layer, etc.

A neural network = running several logis6c regressions at the same 6me Before we know it, we have a mul(layer neural network.

Matrix nota6on for a layer We have a 1 = f (W 11 x 1 +W 12 x 2 +W 13 x 3 + b 1 ) a 2 = f (W 21 x 1 +W 22 x 2 +W 23 x 3 + b 2 ) etc. In matrix nota(on z = Wx + b a = f (z) where f is applied element-wise: f ([z 1, z 2, z 3 ]) = [ f (z 1 ), f (z 2 ), f (z 3 )] W 12 b 3 a 1 a 2 a 3

Non-lineari6es (f): Why they re needed Example: function approximation, e.g., regression or classification Without non-linearities, deep neural networks can t do anything more than a linear transform Extra layers could just be compiled down into a single linear transform: W 1 W 2 x = Wx With more layers, they can approximate more complex functions!

A more powerful, neural net window classifier Revisi(ng X window = [ x museums x in x Paris x are x amazing ] Assume we want to classify whether the center word is a loca(on or not Lecture 5, Slide 44 Richard Socher 1/19/17

A Single Layer Neural Network A single layer is a combina(on of a linear layer and a nonlinearity: The neural ac(va(ons a can then be used to compute some output For instance, a probability via so^max p y x = softmax(wa) Or an unnormalized score (even simpler) 45

Summary: Feed-forward Computa6on Compu(ng a window s score with a 3-layer neural net: s = score(museums in Paris are amazing ) x window = [ x museums x in x Paris x are x amazing ] 46

Main intui6on for extra layer The layer learns non-linear interac(ons between the input word vectors. Example: only if museums is first vector should it mazer that in is in the second posi(on X window = [ x museums x in x Paris x are x amazing ] 47

The max-margin loss s = score(museums in Paris are amazing) s c = score(not all museums in Paris) Idea for training objec(ve: make score of true window larger and corrupt window s score lower (un(l they re good enough): minimize This is con(nuous --> we can use SGD 48

Max-margin Objec6ve func6on Objec(ve for a single window: Each window with a loca(on at its center should have a score +1 higher than any window without a loca(on at its center xxx ß 1 à ooo For full objec(ve func(on: Sample several corrupt windows per true one. Sum over all training windows 49

Training with Backpropaga6on Assuming cost J is > 0, compute the deriva(ves of s and s c wrt all the involved variables: U, W, b, x 50

Training with Backpropaga6on Let s consider the deriva(ve of a single weight W ij This only appears inside a i s U 2 For example: W 23 is only used to compute a 2 a 1 a 2 W 23 b 2 x 1 x 2 x 3 +1 51

Training with Backpropaga6on Deriva(ve of weight W ij : s U 2 a 1 a 2 W 23 x 1 x 2 x 3 +1 52

Training with Backpropaga6on Deriva(ve of single weight W ij : s U 2 Local error signal Local input signal a 1 a 2 W 23 where for logis(c f x 1 x 2 x 3 +1 53

Training with Backpropaga6on From single weight W ij to full W: S We want all combina(ons of i = 1, 2 and j = 1, 2, 3 à? Solu(on: Outer product: where is the responsibility or error signal coming from each ac(va(on a s U 2 a 1 a 2 W 23 x 1 x 2 x 3 +1 54

Training with Backpropaga6on For biases b, we get: s U 2 a 1 a 2 W 23 x 1 x 2 x 3 +1 55

Training with Backpropaga6on That s almost backpropaga(on It s taking deriva(ves and using the chain rule Remaining trick: we can re-use deriva(ves computed for higher layers in compu(ng deriva(ves for lower layers! Example: last deriva(ves of model, the word vectors in x 56

Training with Backpropaga6on Take deriva(ve of score with respect to single element of word vector Now, we cannot just take into considera(on one a i because each x j is connected to all the neurons above and hence x j influences the overall score through all of these, hence: 57 Re-used part of previous deriva(ve

Training with Backpropaga6on With,what is the full gradient? à Observa(ons: The error message δ that arrives at a hidden layer has the same dimensionality as that hidden layer 58

Pu[ng all gradients together: Remember: Full objec(ve func(on for each window was: For example: gradient for U: 59

Summary Congrats! Super useful basic components and real model Word vector training Windows So^max and cross entropy error à PSet1 Scores and max-margin loss Neural network à PSet1 One more half of a math-heavy lecture Then the rest will be easier and more applied :)

Next lecture: Project advice Taking more and deeper deriva6ves à Full Backprop Then we have all the basic tools in place to learn about more complex models and have some fun :)