Lecture 5 Neural models for NLP

Similar documents
Lecture 17: Neural Networks and Deep Learning

Neural Networks in Structured Prediction. November 17, 2015

NEURAL LANGUAGE MODELS

ML4NLP Multiclass Classification

CSC242: Intro to AI. Lecture 21

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

CSC321 Lecture 5: Multilayer Perceptrons

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

CSC321 Lecture 4: Learning a Classifier

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

CSC321 Lecture 4: Learning a Classifier

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Neural Network Training

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Logistic Regression & Neural Networks

From perceptrons to word embeddings. Simon Šuster University of Groningen

lecture 6: modeling sequences (final part)

Deep Feedforward Networks

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Neural Networks (Part 1) Goals for the lecture

CS 6375 Machine Learning

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Deep Learning for NLP

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Classification with Perceptrons. Reading:

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

arxiv: v3 [cs.lg] 14 Jan 2018

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Deep Feedforward Networks

Natural Language Processing

Introduction to Machine Learning Spring 2018 Note Neural Networks

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Lecture 13: Structured Prediction

Machine Learning for Structured Prediction

Natural Language Processing

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

text classification 3: neural networks

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Task-Oriented Dialogue System (Young, 2000)

Feedforward Neural Networks

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Machine Learning. Boris

Machine Learning Basics

Deep Learning Recurrent Networks 2/28/2018

Neural Architectures for Image, Language, and Speech Processing

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Binary Classification / Perceptron

Neural Networks and Deep Learning

CSC 411 Lecture 10: Neural Networks

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Introduction to Machine Learning HW6

Artificial Neural Networks. MGS Lecture 2

Neural Networks. Intro to AI Bert Huang Virginia Tech

ECE521 Lectures 9 Fully Connected Neural Networks

Machine Learning (CSE 446): Neural Networks

Machine Learning Lecture 7

Recurrent Neural Networks. COMP-550 Oct 5, 2017

Conditional Language Modeling. Chris Dyer

Supervised Learning. George Konidaris

NLP Programming Tutorial 8 - Recurrent Neural Nets

Neural networks CMSC 723 / LING 723 / INST 725 MARINE CARPUAT. Slides credit: Graham Neubig

Deep Learning Lab Course 2017 (Deep Learning Practical)

Deep Neural Networks (1) Hidden layers; Back-propagation

Neural Networks for NLP. COMP-599 Nov 30, 2016

Lecture 6. Notes on Linear Algebra. Perceptron

Neural Network Language Modeling

ECE521 Lecture 7/8. Logistic Regression

Machine Learning (CS 567) Lecture 2

Conditional Language modeling with attention

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Lecture 4: Perceptrons and Multilayer Perceptrons

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

An overview of word2vec

Learning with multiple models. Boosting.

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

UNSUPERVISED LEARNING

Neural networks and support vector machines

Lecture 5: Logistic Regression. Neural Networks

Course 395: Machine Learning - Lectures

Deep Feedforward Networks

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Neural Networks and the Back-propagation Algorithm

Machine Learning for NLP

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Lecture 6: Neural Networks for Representing Word Meaning

Jakub Hajic Artificial Intelligence Seminar I

Transcription:

CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm

Logistics CS546 Machine Learning in NLP 2

Schedule Week 1 Week 4: Lectures Paper presentations: Lectures 9-28 1. Word embeddings 2. Language models 3. More on RNNs for NLP 4. CNNs for NLP 5. Multitask learning for NLP 6. Syntactic parsing 7. Information extraction 8. Semantic parsing 9. Coreference resolution 10. Machine translation I 11. Machine translation II 12. Generation 13. Discourse 14. Dialog I 15. Dialog II 16. Multimodal NLP 17. Question answering 18. Entailment recognition 19. Reading comprehension 20. Knowledge graph modeling CS546 Machine Learning in NLP 3

Machine learning fundamentals CS546 Machine Learning in NLP 4

Learning scenarios Supervised learning: Learning to predict labels/structures from correctly annotated data Unsupervised learning: Learning to find hidden structure (e.g. clusters) in [unannotated] input data Semi-supervised learning: Learning to predict labels/structures from (a little) annotated and (a lot of) unannotated data Reinforcement learning: Learning to act through feedback for actions (rewards/punishments) from the environment CS546 Machine Learning in NLP 5

Input x X An item x drawn from an input space X System y = f(x) Output y Y An item y drawn from an output space Y In (supervised) machine learning, we deal with systems whose f(x) is learned from (labeled) examples.

Supervised learning Input Target function y = f(x) Output x X An item x drawn from an instance space X Learned Model y = g(x) y Y An item y drawn from a label space Y You often seen f(x) ^ instead of g(x), but PowerPoint can t really typeset that, so g(x) will have to do.

Supervised learning Regression: Y is continuous Classification: Y is discrete (and finite) Binary classification: Y = {0,1} or {+1, -1} Multiclass classification: Y = {1,,K} (with K>2) Structured prediction: Y consists of structured objects Y often has some sort of compositional structure and may be infinite

Supervised learning: Training Labeled Training Data D train (x 1, y 1 ) (x 2, y 2 ) (x N, y N ) Learning Algorithm Learned model g(x) Give the learner examples in D train The learner returns a model g(x)

Supervised learning: Testing Labeled Test Data D test (x 1, y 1 ) (x 2, y 2 ) (x M, y M ) Reserve some labeled data for testing

Supervised learning: Testing Raw Test Data X test x 1 x 2. x M Labeled Test Data D test (x 1, y 1 ) (x 2, y 2 ) (x M, y M ) Test Labels Y test y 1 y 2... y M

Supervised learning: Testing Apply the model to the raw test data Raw Test Data X test x 1 x 2. x M Learned model g(x) Predicted Labels g(x test ) g(x 1 ) g(x 2 ). g(x M ) Test Labels Y test y 1 y 2... y M

Supervised learning: Testing Evaluate the model by comparing the predicted labels against the test labels Raw Test Data X test x 1 x 2. x M Learned model g(x) Predicted Labels g(x test ) g(x 1 ) g(x 2 ). g(x M ) Test Labels Y test y 1 y 2... y M

Design decisions What data do you use to train/test your system? Do you have enough training data? How noisy is it? What evaluation metrics do you use to test your system? Do they correlate with what you want to measure? What features do you use to represent your data X? Feature engineering used to be really important What kind of a model do you want to use? What network architecture do you want to use? What learning algorithm do you use to train your system? How do you set the hyperparameters of the algorithm?

Linear classifiers: f(x) = w 0 + wx f(x) > 0 f(x) = 0 x 2 f(x) < 0 Linear classifiers are defined over vector spaces Every hypothesis f(x) is a hyperplane: f(x) = w 0 + wx f(x) is also called the decision boundary Assign ŷ = 1 to all x where f(x) > 0 Assign ŷ = -1 to all x where f(x) < 0 ŷ = sgn(f(x)) x 1

Learning a linear classifier x 2 f(x) > 0 f(x) = 0 x 2 f(x) < 0 x 1 Input: Labeled training data D = {(x 1, y 1 ),,(x D, y D )} plotted in the sample space X = R 2 with : y i = +1, : y i = 1 x 1 Output: A decision boundary f(x) = 0 that separates the training data y i f(x i ) > 0 CS446 Machine Learning 16

Which model should we pick? We need a metric (aka an objective function) We would like to minimize the probability of misclassifying unseen examples, but we can t measure that probability. Instead: minimize the number of misclassified training examples CS446 Machine Learning 17

Which model should we pick? We need a more specific metric: There may be many models that are consistent with the training data. Loss functions provide such metrics. CS446 Machine Learning 18

y f(x) > 0: Correct classification f(x) > 0 f(x) = 0 x 2 f(x) < 0 An example (x, y) is correctly classified by f(x) if and only if y f(x) > 0: Case 1 (y = +1 = ŷ): f(x) > 0 y f(x) > 0 Case 2 (y = -1 = ŷ): f(x) < 0 y f(x) > 0 Case 3 (y = +1 ŷ = -1): f(x) > 0 y f(x) < 0 Case 4 (y = -1 ŷ = +1): f(x) < 0 y f(x) < 0 x 1

Loss functions for classification Loss = What penalty do we incur if we misclassify x? L(y, f(x)) is the loss (aka cost) of classifier f on example x when the true label of x is y. We assign label ŷ = sgn(f(x)) to x Plots of L(y, f(x)): x-axis is typically y f(x) Today: 0-1 loss and square loss (more loss functions later) CS446 Machine Learning 20

0-1 Loss L(y, f(x)) = 0 iff y = ŷ = 1 iff y ŷ L( y f(x) ) = 0 iff y f(x) > 0 (correctly classified) = 1 iff y f(x) < 0 (misclassified) CS446 Machine Learning 21

Square Loss: (y f(x)) 2 L(y, f(x)) = (y f(x)) 2 Note: L(-1, f(x)) = (-1 f(x)) 2 = ( 1 + f(x)) 2 = L(1, -f(x)) (the loss when y=-1 [red] is the mirror of the loss when y=+1 [blue]) CS446 Machine Learning 22

The square loss is a convex upper bound on 0-1 Loss CS446 Machine Learning 23

Batch learning: Gradient Descent for Least Mean Squares (LMS) CS446 CS546 Machine Machine Learning Learning in NLP 24

Gradient Descent Iterative batch learning algorithm: Learner updates the hypothesis based on the entire training data Learner has to go multiple times over the training data Goal: Minimize training error/loss At each step: move w in the direction of steepest descent along the error/loss surface CS446 Machine Learning 25

Gradient Descent Error(w): Error of w on training data w i : Weight at iteration i Error(w) w 4 w 3 w 2 w 1 w CS446 Machine Learning 26

Least Mean Square Error Err(w) = 1 2 (y d ŷ d ) 2 d D LMS Error: Sum of square loss over all training items (multiplied by 0.5 for convenience) D is fixed, so no need to divide by its size Goal of learning: Find w* = argmin(err(w)) CS446 Machine Learning 27

Iterative batch learning Initialization: Initialize w 0 (the initial weight vector) For each iteration: for i = 0 T: Determine by how much to change w based on the entire data set D Δw = computedelta(d, w i ) Update w: w i+1 = update(w i, Δw) CS446 Machine Learning 28

Gradient Descent: Update 1. Compute Err(w i ), the gradient of the training error at w i This requires going over the entire training data Err(w) = Err(w), Err(w),..., Err(w) w 0 w 1 w N 2. Update w: w i+1 = w i α Err(w i ) α >0 is the learning rate T CS446 Machine Learning 29

What s a gradient? Err(w) = Err(w) w 0, Err(w) w 1,..., Err(w) w N T The gradient is a vector of partial derivatives It indicates the direction of steepest increase in Err(w) Hence the minus in the upgrade rule: w i α Err(w i ) CS446 Machine Learning 30

Computing Err(w i ) Err(w) w i = w i = 1 2 = 1 2 d D d D 1 2 w i 2(y d = (y d ( y d f(x d )) 2 d D ( y d f(x d )) 2 d D f (x d ))x di Err(w (j) )= 1 2 (y d f(x) d ) 2 d D f(x d )) w i (y d w x d ) CS446 Machine Learning 31

Gradient descent (batch) Initialize w 0 randomly for i = 0 T: Δw = (0,., 0) for every training item d = 1 D: f(x d ) = w i x d for every component of w j = 0 N: Δw j += α(y d f(x d )) x dj w i+1 = w i + Δw return w i+1 when it has converged CS446 Machine Learning 32

The batch update rule for each component of w Δw i = α D d=1 (y d w i x d )x di Implementing gradient descent: As you go through the training data, you can just accumulate the change in each component w i of w 33

Learning rate and convergence The learning rate is also called the step size. More sophisticated algorithms (Conjugate Gradient) choose the step size automatically and converge faster. When the learning rate is too small, convergence is very slow When the learning rate is too large, we may oscillate (overshoot the global minimum) You have to experiment to find the right learning rate for your task CS446 Machine Learning 34

Online learning with Stochastic Gradient Descent CS446 CS546 Machine Machine Learning Learning in NLP 35

Stochastic Gradient Descent Online learning algorithm: Learner updates the hypothesis with each training example No assumption that we will see the same training examples again Like batch gradient descent, except we update after seeing each example CS446 Machine Learning 36

Why online learning? Too much training data: Can t afford to iterate over everything Streaming scenario: New data will keep coming You can t assume you have seen everything Useful also for adaptation (e.g. user-specific spam detectors) CS446 Machine Learning 37

Stochastic Gradient descent (online) Initialize w 0 randomly for m = 0 M: f(x m ) = w i x m Δw j = α(y d f(x m )) x mj w i+1 = w i + Δw return w i+1 when it has converged CS446 Machine Learning 38

Online perceptron Assumptions: class labels y {+1, -1}; learning rate α >0 Initial weight vector w 0 := (0,,0) i = 0 for m = 0 M: if y m f(x m ) = y m w i x m < 0: (x m is misclassified add α y m x m to w!) w i+1 := w i + α y m x m i := i+1 return w i+1 when all examples correctly classified Perceptron rule CS446 Machine Learning 39

Neural Networks CS546 Machine Learning in NLP 40

What are neural nets? Simplest variant: single-layer feedforward net For binary classification tasks: Single output unit Return 1 if y > 0.5 Return 0 otherwise For multiclass classification tasks: K output units (a vector) Each output unit yi = class i Return argmaxi(yi) Output unit: scalar y Input layer: vector x Output layer: vector y Input layer: vector x CS546 Machine Learning in NLP 41

Multiclass models: softmax(yi) Multiclass classification = predict one of K classes. Return the class i with the highest score: argmax i (y i ) In neural networks, this is typically done by using the softmax function, which maps real-valued vectors in R N into a distribution over the N outputs For a vector z = (z 0 z K ): P(i) = softmax(z i ) = exp(z i ) k=0..k exp(z k ) (NB: This is just logistic regression) CS546 Machine Learning in NLP 42

Single-layer feedforward networks Single-layer (linear) feedforward network y = wx + b (binary classification) w is a weight vector, b is a bias term (a scalar) This is just a linear classifier (aka Perceptron) (the output y is a linear function of the input x) Single-layer non-linear feedforward networks: Pass wx + b through a non-linear activation function, e.g. y = tanh(wx + b) CS546 Machine Learning in NLP 43

Nonlinear activation functions Sigmoid (logistic function): σ(x) = 1/(1 + e x ) Useful for output units (probabilities) [0,1] range Hyperbolic tangent: tanh(x) = (e 2x 1)/(e 2x +1) Useful for internal units: [-1,1] range Hard tanh (approximates tanh) htanh(x) = 1 for x < ( 1, 1 for x > 1, x otherwise Rectified Linear Unit: ReLU(x) = max(0, x) Useful for internal units sigmoid(x) tanh(x) hardtanh(x) ReLU(x) 1.0 0.5 0.0-0.5-1.0-6 -4-2 0 2 4 6 1.0 0.5 0.0-0.5-1.0-6 -4-2 0 2 4 6 1.0 0.5 0.0-0.5-1.0-6 -4-2 0 2 4 6 1.0 0.5 0.0-0.5-1.0-6 -4-2 0 2 4 6 CS546 Machine Learning in NLP 44

Multi-layer feedforward networks We can generalize this to multi-layer feedforward nets. Output layer: vector y Hidden layer: vector hn Hidden layer: vector h1 Input layer: vector x CS546 Machine Learning in NLP 45

Why neural approaches to NLP? CS546 Machine Learning in NLP 46

Motivation for neural approaches to NLP: Features can be brittle Word-based features: How do we handle unseen/rare words? Many features are produced by other NLP systems (POS tags, dependencies, NER output, etc.) These systems are often trained on labeled data. Producing labeled data can be very expensive. We typically don t have enough labeled data from the domain of interest. We might not get accurate features for our domain of interest. CS546 Machine Learning in NLP 47

Features in neural approaches Many of the current successful neural approaches to NLP do not use traditional discrete features. Words in the input are often represented as dense vectors (aka. word embeddings, e.g. word2vec) Traditional approaches: each word in the vocabulary is a separate feature. No generalization across words that have similar meanings. Neural approaches: Words with similar meanings have similar vectors. Models generalize across words with similar meanings Other kinds of features (POS tags, dependencies, etc.) are often ignored. CS546 Machine Learning in NLP 48

Motivation for neural approaches to NLP: Markov assumptions Traditional sequence models (n-gram language models, HMMs, MEMMs, CRFs) make rigid Markov assumptions (bigram/trigram/n-gram). Recurrent neural nets (RNNs, LSTMs) can capture arbitrary-length histories without requiring more parameters. CS546 Machine Learning in NLP 49

Neural approaches to NLP CS546 Machine Learning in NLP 50

Challenges in using NNs for NLP Our input and output variables are discrete: words, labels, structures. NNs work best with continuous vectors. We typically want to learn a mapping (embedding) from discrete words (input) to dense vectors. We can do this with (simple) neural nets and related methods. The input to a NN is (traditionally) a fixed-length vector. How do you represent a variable-length sequence as a vector? Use recurrent neural nets: read in one word at the time to predict a vector, use that vector and the next word to predict a new vector, etc. CS546 Machine Learning in NLP 51

How does NLP use NNs? Word embeddings (word2vec, Glove, etc.) Train a NN to predict a word from its context (or the context from a word). This gives a dense vector representation of each word Neural language models: Use recurrent neural networks (RNNs) to predict word sequences More advanced: use LSTMs (special case of RNNs) Sequence-to-sequence (seq2seq) models: From machine translation: use one RNN to encode source string, and another RNN to decode this into a target string. Also used for automatic image captioning, etc. Recursive neural networks: Used for parsing CS546 Machine Learning in NLP 52

Neural Language Models LMs define a distribution over strings: P(w1.wk) LMs factor P(w1.wk) into the probability of each word: P(w1.wk) = P(w1) P(w2 w1) P(w3 w1w2) P(wk w1.wk 1) A neural LM needs to define a distribution over the V words in the vocabulary, conditioned on the preceding words. Output layer: V units (one per word in the vocabulary) with softmax to get a distribution Input: Represent each preceding word by its d-dimensional embedding. - Fixed-length history (n-gram): use preceding n 1 words - Variable-length history: use a recurrent neural net CS546 Machine Learning in NLP 53

Recurrent neural networks (RNNs) Basic RNN: Modify the standard feedforward architecture (which predicts a string w0 wn one word at a time) such that the output of the current step (wi) is given as additional input to the next time step (when predicting the output for wi+1). Output typically (the last) hidden layer. output hidden input output hidden input Feedforward Net Recurrent Net CS546 Machine Learning in NLP 54

Word Embeddings (e.g. word2vec) Main idea: If you use a feedforward network to predict the probability of words that appear in the context of (near) an input word, the hidden layer of that network provides a dense vector representation of the input word. Words that appear in similar contexts (that have high distributional similarity) wils have very similar vector representations. These models can be trained on large amounts of raw text (and pretrained embeddings can be downloaded) CS546 Machine Learning in NLP 55

Sequence-to-sequence (seq2seq) models Task (e.g. machine translation): Given one variable length sequence as input, return another variable length sequence as output Main idea: Use one RNN to encode the input sequence ( encoder ) Feed the last hidden state as input to a second RNN ( decoder ) that then generates the output sequence. CS546 Machine Learning in NLP 56