CSC321 Lecture 7 Neural language models

Similar documents
CSC321 Lecture 4: Learning a Classifier

CSC321 Lecture 4: Learning a Classifier

CSC321 Lecture 4 The Perceptron Algorithm

CSC321 Lecture 5 Learning in a Single Neuron

CSC321 Lecture 9 Recurrent neural nets

CSC321 Lecture 18: Learning Probabilistic Models

Lecture 4: Training a Classifier

CSC321 Lecture 15: Recurrent Neural Networks

Midterm for CSC321, Intro to Neural Networks Winter 2018, night section Tuesday, March 6, 6:10-7pm

CSC321 Lecture 10 Training RNNs

CSC321 Lecture 5: Multilayer Perceptrons

Lecture 4: Training a Classifier

CSC321 Lecture 8: Optimization

From perceptrons to word embeddings. Simon Šuster University of Groningen

CSC321 Lecture 20: Autoencoders

Lecture 15: Exploding and Vanishing Gradients

CSC321 Lecture 9: Generalization

CSC321 Lecture 7: Optimization

Conditional Language Modeling. Chris Dyer

STA 4273H: Statistical Machine Learning

CSC321 Lecture 16: ResNets and Attention

Lecture 15: Recurrent Neural Nets

CSC321 Lecture 20: Reversible and Autoregressive Models

CSC321 Lecture 9: Generalization

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Generative Clustering, Topic Modeling, & Bayesian Inference

Bayesian Networks BY: MOHAMAD ALSABBAGH

CSC321 Lecture 6: Backpropagation

CSC 411 Lecture 7: Linear Classification

Lecture 6: Backpropagation

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

CS 361: Probability & Statistics

Lecture 2: Linear regression

Lecture 21: Spectral Learning for Graphical Models

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

CSC321 Lecture 2: Linear Regression

Ways to make neural networks generalize better

The Noisy Channel Model and Markov Models

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Midterm sample questions

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

CSC 411 Lecture 10: Neural Networks

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Lecture 3: ASR: HMMs, Forward, Viterbi

17 Neural Networks NEURAL NETWORKS. x XOR 1. x Jonathan Richard Shewchuk

Lecture 18: Learning probabilistic models

TDA231. Logistic regression

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Lecture 10: Powers of Matrices, Difference Equations

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Lecture 9: Generalization

Week 5: Logistic Regression & Neural Networks

CSC 411 Lecture 12: Principal Component Analysis

Logistic Regression & Neural Networks

STA 414/2104: Machine Learning

Neural Network Language Modeling

Notes on Noise Contrastive Estimation (NCE)

Artificial Neural Networks 2

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

STA 4273H: Statistical Machine Learning

Lecture 5: Logistic Regression. Neural Networks

Deep Learning for NLP

Introduction to Bayesian Learning. Machine Learning Fall 2018

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Conditional probabilities and graphical models

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

Brief Introduction of Machine Learning Techniques for Content Analysis

Introduction to Neural Networks

Natural Language Processing with Deep Learning CS224N/Ling284

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

text classification 3: neural networks

Neural Networks for Machine Learning. Lecture 2a An overview of the main types of neural network architecture

Classification 2: Linear discriminant analysis (continued); logistic regression

CSC321 Lecture 15: Exploding and Vanishing Gradients

Lecture 5 Neural models for NLP

Machine Learning for Signal Processing Bayes Classification and Regression

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Text Mining. March 3, March 3, / 49

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Intelligent Systems (AI-2)

Lecture 15. Probabilistic Models on Graph

Naïve Bayes, Maxent and Neural Models

Announcements. CS 188: Artificial Intelligence Fall Causality? Example: Traffic. Topology Limits Distributions. Example: Reverse Traffic

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Lecture 3 Feedforward Networks and Backpropagation

Intelligent Systems (AI-2)

Seq2Seq Losses (CTC)

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Probabilistic Graphical Models

CS 361: Probability & Statistics

Foundations of Natural Language Processing Lecture 6 Spelling correction, edit distance, and EM

Vote. Vote on timing for night section: Option 1 (what we have now) Option 2. Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9

Transcription:

CSC321 Lecture 7 Neural language models Roger Grosse and Nitish Srivastava February 1, 2015 Roger Grosse and Nitish Srivastava CSC321 Lecture 7 Neural language models February 1, 2015 1 / 19

Overview We ve talked about neural nets and backpropagation in the abstract. Now let s see our first real example, a neural language model. This also serves to introduce two big ideas: probabilistic modeling, where the network learns to predict a probability distribution The cross-entropy loss function encourages the model to assign high probability to the observed data. It also avoids the problem of saturation in the output units. dimensionality reduction using linear hidden units (i.e. reducing the number of trainable parameters) Roger Grosse and Nitish Srivastava CSC321 Lecture 7 Neural language models February 1, 2015 2 / 19

Language modeling Motivation: suppose we want to build a speech recognition system. We d like to be able to infer a likely sentence s given the observed speech signal a. The generative approach is to build two components: An observation model, represented as p(a s), which tells us how likely the sentence s is to lead to the acoustic signal a. A prior, represented as p(s), which tells us how likely a given sentence s is. E.g., it should know that recognize speech is more likely that wreck a nice beach. Roger Grosse and Nitish Srivastava CSC321 Lecture 7 Neural language models February 1, 2015 3 / 19

Language modeling Motivation: suppose we want to build a speech recognition system. We d like to be able to infer a likely sentence s given the observed speech signal a. The generative approach is to build two components: An observation model, represented as p(a s), which tells us how likely the sentence s is to lead to the acoustic signal a. A prior, represented as p(s), which tells us how likely a given sentence s is. E.g., it should know that recognize speech is more likely that wreck a nice beach. Given these components, we can use Bayes Rule to infer a posterior distribution over sentences given the speech signal: p(s a) = p(s)p(a s) s p(s )p(a s ). Roger Grosse and Nitish Srivastava CSC321 Lecture 7 Neural language models February 1, 2015 3 / 19

Language modeling In this lecture, we focus on learning a good distribution p(s) of sentences. This problem is known as language modeling. Assume we have a corpus of sentences s (1),..., s (N). The maximum likelihood criterion says we want our model to maximize the probability our model assigns to the observed sentences. We assume the sentences are independent, so that their probabilities multiply. max p N p(s (i) ). i=1 Roger Grosse and Nitish Srivastava CSC321 Lecture 7 Neural language models February 1, 2015 4 / 19

Language modeling In maximum likelihood training, we want to maximize N i=1 p(s(i) ). The probability of generating the whole training corpus is vanishingly small like monkeys typing all of Shakespeare. The log probability is something we can work with more easily. It also conveniently decomposes as a sum: N log p(s (i) ) = i=1 N log p(s (i) ). i=1 Let s use negative log probabilities, so that we re working with positive numbers. Roger Grosse and Nitish Srivastava CSC321 Lecture 7 Neural language models February 1, 2015 5 / 19

Language modeling In maximum likelihood training, we want to maximize N i=1 p(s(i) ). The probability of generating the whole training corpus is vanishingly small like monkeys typing all of Shakespeare. The log probability is something we can work with more easily. It also conveniently decomposes as a sum: N log p(s (i) ) = i=1 N log p(s (i) ). i=1 Let s use negative log probabilities, so that we re working with positive numbers. Intuition: slightly better trained monkeys are slightly more likely to type Hamlet! Roger Grosse and Nitish Srivastava CSC321 Lecture 7 Neural language models February 1, 2015 5 / 19

Neural language model Probability of a sentence? What does that even mean? Roger Grosse and Nitish Srivastava CSC321 Lecture 7 Neural language models February 1, 2015 6 / 19

Neural language model Probability of a sentence? What does that even mean? A sentence is a sequence of words w 1, w 2,..., w M. Using the chain rule of conditional probability, we can decompose the probability as p(s) = p(w 1,..., w M ) = p(w 1 )p(w 2 w 1 ) p(w M w 1,..., w M 1 ). Therefore, the language modeling problem is equivalent to being able to predict the next word! Roger Grosse and Nitish Srivastava CSC321 Lecture 7 Neural language models February 1, 2015 6 / 19

Neural language model Probability of a sentence? What does that even mean? A sentence is a sequence of words w 1, w 2,..., w M. Using the chain rule of conditional probability, we can decompose the probability as p(s) = p(w 1,..., w M ) = p(w 1 )p(w 2 w 1 ) p(w M w 1,..., w M 1 ). Therefore, the language modeling problem is equivalent to being able to predict the next word! We typically make a Markov assumption, i.e. that the distribution over the next word only depends on the preceding few words. I.e., if we use a context of length 3, p(w i w 1,..., w i 1 ) = p(w i w i 3, w i 2, w i 1 ). Now it s like a supervised prediction problem. The inputs are (w i 3, w i 2, w i 1 ), and the target is w i. Roger Grosse and Nitish Srivastava CSC321 Lecture 7 Neural language models February 1, 2015 6 / 19

Neural language model Bengio s neural language model (from Geoff s lecture) skip-layer connections softmax units (one per possible next word) units that learn to predict the output word from features of the input words learned distributed encoding of word t-2 table look-up learned distributed encoding of word t-1 table look-up index of word at t-2 index of word at t-1 Roger Grosse and Nitish Srivastava CSC321 Lecture 7 Neural language models February 1, 2015 7 / 19

Neural language model When we train a neural language model, is that supervised or unsupervised learning? Does it have elements of both? Roger Grosse and Nitish Srivastava CSC321 Lecture 7 Neural language models February 1, 2015 8 / 19

A good loss function Squared error worked nicely for regression. What loss function should we use when predicting probabilities? Roger Grosse and Nitish Srivastava CSC321 Lecture 7 Neural language models February 1, 2015 9 / 19

A good loss function Squared error worked nicely for regression. What loss function should we use when predicting probabilities? Consider these two functions - Squared error: C sq = 1 2 (y t)2 Cross-entropy: C CE = t log y (1 t) log(1 y) Here t and y are probabilities, so then lie in [0, 1]. Roger Grosse and Nitish Srivastava CSC321 Lecture 7 Neural language models February 1, 2015 9 / 19

A good loss function Squared error worked nicely for regression. What loss function should we use when predicting probabilities? Consider these two functions - Squared error: C sq = 1 2 (y t)2 Cross-entropy: C CE = t log y (1 t) log(1 y) Here t and y are probabilities, so then lie in [0, 1]. If y = t = 0 or y = t = 1, both loss functions are zero. Roger Grosse and Nitish Srivastava CSC321 Lecture 7 Neural language models February 1, 2015 9 / 19

A good loss function Squared error worked nicely for regression. What loss function should we use when predicting probabilities? Consider these two functions - Squared error: C sq = 1 2 (y t)2 Cross-entropy: C CE = t log y (1 t) log(1 y) Here t and y are probabilities, so then lie in [0, 1]. If y = t = 0 or y = t = 1, both loss functions are zero. Suppose t = 1. Consider these two situations - The model predicts y = 0.01. The model predicts y = 0.0001. Roger Grosse and Nitish Srivastava CSC321 Lecture 7 Neural language models February 1, 2015 9 / 19

A good loss function Squared error worked nicely for regression. What loss function should we use when predicting probabilities? Consider these two functions - Squared error: C sq = 1 2 (y t)2 Cross-entropy: C CE = t log y (1 t) log(1 y) Here t and y are probabilities, so then lie in [0, 1]. If y = t = 0 or y = t = 1, both loss functions are zero. Suppose t = 1. Consider these two situations - The model predicts y = 0.01. The model predicts y = 0.0001. The first case is a LOT better than the second. (We are a hundred times less likely to pick 0). So we need to be a lot more unhappy about the second case, than the first. Which loss function does that? Roger Grosse and Nitish Srivastava CSC321 Lecture 7 Neural language models February 1, 2015 9 / 19

A good loss function Squared error worked nicely for regression. What loss function should we use when predicting probabilities? Consider these two functions - Squared error: C sq = 1 2 (y t)2 Cross-entropy: C CE = t log y (1 t) log(1 y) Here t and y are probabilities, so then lie in [0, 1]. If y = t = 0 or y = t = 1, both loss functions are zero. Suppose t = 1. Consider these two situations - The model predicts y = 0.01. C sq = 0.99 2, C CE = log(100) The model predicts y = 0.0001. C sq = 0.9999 2, C CE = log(10000) The first case is a LOT better than the second. (We are a hundred times less likely to pick 0). So we need to be a lot more unhappy about the second case, than the first. Which loss function does that? Roger Grosse and Nitish Srivastava CSC321 Lecture 7 Neural language models February 1, 2015 10 / 19

Question 1: Squared error vs. cross-entropy Geoff said that squared error is the wrong cost function to use with logistic or softmax outputs because of saturation. Let s analyze this in the case of a logistic unit. Recall the logistic activation function: y = σ(z) = 1 1 + e z Suppose our target is t = 1. Sketch C and dc/dz as a function of z for the cost functions: Squared error: C sq = 1 2 (y t)2 Cross-entropy: C CE = t log y (1 t) log(1 y) Hint: consider the behavior as z ± Roger Grosse and Nitish Srivastava CSC321 Lecture 7 Neural language models February 1, 2015 11 / 19

Question 1: Squared error vs. cross-entropy Roger Grosse and Nitish Srivastava CSC321 Lecture 7 Neural language models February 1, 2015 12 / 19

Question 1: Squared error vs. cross-entropy The region where C sq is flat for negative z is a plateau. A function is convex if line segments joining points on the graph of f lie above f. Mathematically, C(λw 1 + (1 λ)w 2 ) λc(w 1 ) + (1 λ)c(w 2 ). Convex cost functions are usually easier to optimize because there aren t any local optima or plateaux. not convex in z convex in z Roger Grosse and Nitish Srivastava CSC321 Lecture 7 Neural language models February 1, 2015 13 / 19

Question 1: Squared error vs. cross-entropy We just saw that plateaux are a problem for logistic output units with squared error loss, but not cross-entropy. We often use logistic hidden units in multilayer neural nets. Do you think we have plateaux when training these units? Roger Grosse and Nitish Srivastava CSC321 Lecture 7 Neural language models February 1, 2015 14 / 19

Question 2: Linear hidden units The neural language model is the first example we ve seen of an embedding layer. Geoff describes it as a lookup table. But we can also think of it as a linear hidden layer. Multiplying R by an indicator vector selects a column of R. Roger Grosse and Nitish Srivastava CSC321 Lecture 7 Neural language models February 1, 2015 15 / 19

Question 2: Linear hidden units Here we have two architectures. Model A is similar to the neural language model, while Model B eliminates the embedding layer. Show that Model B can compute any function that Model A can compute. If I give you weight matrices R and W for Model 1, compute an equivalent weight matrix V for Model B. Roger Grosse and Nitish Srivastava CSC321 Lecture 7 Neural language models February 1, 2015 16 / 19

Question 2: Linear hidden units Here we have two architectures. Model A is similar to the neural language model, while Model B eliminates the embedding layer. Show that Model B can compute any function that Model A can compute. If I give you weight matrices R and W for Model 1, compute an equivalent weight matrix V for Model B. Solution: Model B can match Model A s function by setting V = W R, where R R = R. R Roger Grosse and Nitish Srivastava CSC321 Lecture 7 Neural language models February 1, 2015 16 / 19

Question 2: Linear hidden units Here we have two architectures. Model A is similar to the neural language model, while Model B eliminates the embedding layer. Assume there are 1000 words in the vocabulary, 20-dimensional embeddings, and 300 hidden units. 1 Compute the number of trainable parameters in the layers shown for each model. 2 Approximately compute the number of arithmetic operations required to compute the hidden activations for a given input. Assume we compute the matrix-vector products explicitly rather than using a lookup table. Hint: how many operations are needed to compute a matrix-vector product? Roger Grosse and Nitish Srivastava CSC321 Lecture 7 Neural language models February 1, 2015 17 / 19

Question 2: Linear hidden units Number of trainable parameters: Model A: R is a matrix of size 20 1000 and W is of size 300 60, for 20 1000 + 300 60 = 38,000 learnable parameters Model B: V is a matrix of size 300 3000, for 900,000 learnable parameters. Number of computations: A matrix-vector product where the matrix is of size M N involves approximately MN adds and MN multiplies. Model A: Three matrix-vector products of size 20 1000 and one of size 300 60 for a total of 78,000 multiples and adds. Model B: One matrix-vector product of size 300 3000, for 900,000 multiplies and adds. Note: since the inputs are indicator vectors, in practice you would use a lookup table. But we asked about explicit multiplications since most situations where we use linear hidden layers won t involve indicator vectors. Roger Grosse and Nitish Srivastava CSC321 Lecture 7 Neural language models February 1, 2015 18 / 19

Question 2: Linear hidden units To recap: Nonlinear hidden units allow a network to compute more complex functions. Linear hidden units don t increase the expressive power of a network. But they can introduce a bottleneck which reduces the number of learnable parameters or the number of computations required. Corresponds to a low-rank factorization of the weight matrix Roger Grosse and Nitish Srivastava CSC321 Lecture 7 Neural language models February 1, 2015 19 / 19