Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

Similar documents
GloVe: Global Vectors for Word Representation 1

Deep Learning. Ali Ghodsi. University of Waterloo

Homework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

CME323 Distributed Algorithms and Optimization. GloVe on Spark. Alex Adamson SUNet ID: aadamson. June 6, 2016

word2vec Parameter Learning Explained

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

DISTRIBUTIONAL SEMANTICS

CSA4050: Advanced Topics Natural Language Processing. Lecture Statistics III. Statistical Approaches to NLP

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Distributional Semantics and Word Embeddings. Chase Geigle

arxiv: v3 [cs.cl] 30 Jan 2016

Deep Learning for NLP Part 2

Language Modeling. Michael Collins, Columbia University

Natural Language Processing

Word Embeddings 2 - Class Discussions

Data Mining & Machine Learning

N-gram Language Modeling Tutorial

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Natural Language Processing. Statistical Inference: n-grams

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

NEURAL LANGUAGE MODELS

Neural Word Embeddings from Scratch

Natural Language Processing

Deep Learning for NLP

Natural Language Processing SoSe Words and Language Model

N-gram Language Modeling

Lecture 6: Neural Networks for Representing Word Meaning

Embeddings Learned By Matrix Factorization

Seman&cs with Dense Vectors. Dorota Glowacka

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)

Natural Language Processing with Deep Learning CS224N/Ling284. Richard Socher Lecture 2: Word Vectors

Neural Networks Language Models

CS224n: Natural Language Processing with Deep Learning 1

The representation of word and sentence

Topic Modeling: Beyond Bag-of-Words

From perceptrons to word embeddings. Simon Šuster University of Groningen

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

1 Inference in Dirichlet Mixture Models Introduction Problem Statement Dirichlet Process Mixtures... 3

Generative Clustering, Topic Modeling, & Bayesian Inference

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention

Bayesian Paragraph Vectors

Word2Vec Embedding. Embedding. Word Embedding 1.1 BEDORE. Word Embedding. 1.2 Embedding. Word Embedding. Embedding.

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Modeling Environment

Deep Learning. Language Models and Word Embeddings. Christof Monz

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Lecture 7: Word Embeddings

Lecture 2: N-gram. Kai-Wei Chang University of Virginia Couse webpage:

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Midterm sample questions

Natural Language Processing

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Learning Features from Co-occurrences: A Theoretical Analysis

Probabilistic Language Modeling

A fast and simple algorithm for training neural probabilistic language models

arxiv: v2 [cs.cl] 1 Jan 2019

Topic Modelling and Latent Dirichlet Allocation

Lecture 12: Algorithms for HMMs

The Language Modeling Problem (Fall 2007) Smoothed Estimation, and Language Modeling. The Language Modeling Problem (Continued) Overview

Language Model. Introduction to N-grams

The Noisy Channel Model and Markov Models

Maschinelle Sprachverarbeitung

Natural Language Processing

Maschinelle Sprachverarbeitung

Natural Language Processing

Deep Learning Recurrent Networks 2/28/2018

An overview of word2vec

Probabilistic Latent Semantic Analysis

Deep Learning for NLP

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

Conditional Language Modeling. Chris Dyer

{ Jurafsky & Martin Ch. 6:! 6.6 incl.

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

ANLP Lecture 6 N-gram models and smoothing

Lecture 12: Algorithms for HMMs

Natural Language Processing (CSE 517): Cotext Models (II)

Ranked Retrieval (2)

CS224N: Natural Language Processing with Deep Learning Winter 2018 Midterm Exam

An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling

Generalized Hebbian Algorithm for Incremental Singular Value Decomposition in Natural Language Processing

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018

Natural Language Processing (CSE 490U): Language Models

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

DT2118 Speech and Speaker Recognition

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Modeling Topics and Knowledge Bases with Embeddings

Towards Universal Sentence Embeddings

Conditional Language modeling with attention

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Cross-Lingual Language Modeling for Automatic Speech Recogntion

Neural Network Language Modeling

Natural Language Processing and Recurrent Neural Networks

Applied Natural Language Processing

Latent Semantic Analysis. Hongning Wang

Instructions for NLP Practical (Units of Assessment) SVM-based Sentiment Detection of Reviews (Part 2)

Transcription:

Deep Learning Basics Lecture 10: Neural Language Models Princeton University COS 495 Instructor: Yingyu Liang

Natural language Processing (NLP) The processing of the human languages by computers One of the oldest AI tasks One of the most important AI tasks One of the hottest AI tasks nowadays

Difficulty Difficulty 1: ambiguous, typically no formal description Example: We saw her duck. 1. We looked at a duck that belonged to her. 2. We looked at her quickly squat down to avoid something. 3. We use a saw to cut her duck.

Difficulty Difficulty 2: computers do not have human concepts Example: She like little animals. For example, yesterday we saw her duck. 1. We looked at a duck that belonged to her. 2. We looked at her quickly squat down to avoid something. 3. We use a saw to cut her duck.

Statistical language model

Probabilistic view Use probabilistic distribution to model the language Dates back to Shannon (information theory; bits in the message)

Statistical language model Language model: probability distribution over sequences of tokens Typically, tokens are words, and distribution is discrete Tokens can also be characters or even bytes Sentence: the quick brown fox jumps over the lazy dog Tokens: x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9

Statistical language model For simplification, consider fixed length sequence of tokens (sentence) (x 1, x 2, x 3,, x τ 1, x τ ) Probabilistic model: P [x 1, x 2, x 3,, x τ 1, x τ ]

N-gram model

n-gram model n-gram: sequence of n tokens n-gram model: define the conditional probability of the n-th token given the preceding n 1 tokens P x 1, x 2,, x τ = P x 1,, x n 1 P[x t x t n+1,, x t 1 ] τ t=n

n-gram model n-gram: sequence of n tokens n-gram model: define the conditional probability of the n-th token given the preceding n 1 tokens P x 1, x 2,, x τ = P x 1,, x n 1 τ t=n P[x t x t n+1,, x t 1 ] Markovian assumptions

Typical n-gram model n = 1: unigram n = 2: bigram n = 3: trigram

Training n-gram model Straightforward counting: counting the co-occurrence of the grams For all grams (x t n+1,, x t 1, x t ) 1. count and estimate P[x t n+1,, x t 1, x t ] 2. count and estimate P x t n+1,, x t 1 3. compute P x t x t n+1,, x t 1 = P x t n+1,, x t 1, x t P x t n+1,, x t 1

A simple trigram example Sentence: the dog ran away P the dog ran away = P the dog ran P[away dog ran] P the dog ran away = P the dog ran P[dog ran away] P[dog ran]

Drawback Sparsity issue: P most likely to be 0 Bad case: dog ran away never appear in the training corpus, so P[dog ran away] = 0 Even worse: dog ran never appear in the training corpus, so P[dog ran] = 0

Rectify: smoothing Basic method: adding non-zero probability mass to zero entries Back-off methods: restore to lower order statistics Example: if P[away dog ran] does not work, use P away ran as replacement Mixture methods: use a linear combination of P away ran and P[away dog ran]

Drawback High dimesion: # of grams too large Vocabulary size: about 10k=2^14 #trigram: about 2^42

Rectify: clustering Class-based language models: cluster tokens into classes; replace each token with its class Significantly reduces the vocabulary size; also address sparsity issue Combinations of smoothing and clustering are also possible

Neural language model

Neural Language Models Language model designed for modeling natural language sequences by using a distributed representation of words Distributed representation: embed each word as a real vector (also called word embedding) Language model: functions that act on the vectors

Distributed vs Symbolic representation Symbolic representation: can be viewed as one-hot vector Token i in the vocabulary is represented as e i i-th entry 0 0 0 0 1 0 0 0 0 0 Can be viewed as a special case of distributed representation

Distributed vs Symbolic representation Word embeddings: used for real value computation (instead of logic/grammar derivation, or discrete probabilistic model) Hope that real value computation corresponds to semantics Example: inner products correspond to token similarities One-hot vectors: every pair of words has inner product 0

Co-occurrence Firth s Hypothesis (1957): the meaning of a word is defined by the company it keeps w w P[w, w ] Use the co-occurrence of the word as its vector: v w P[w, : ]

Co-occurrence Firth s Hypothesis (1957): the meaning of a word is defined by the company it keeps c P[w, c] w Can replace with context such as a phrase Use the co-occurrence of the word as its vector: v w P[w, : ]

Drawback High dimensionality: equal vocabulary size (~10k) can be even higher if context is used

Latent semantic analysis (LSA) LSA by Deerwester et al., 1990: low rank approx. of co-occurrence M X Y w P[w, w ] row vector for the word w

Variants low rank approx. of the transformed co-occurrence M X Y w P w, w Or PMI w, w = ln P[w,w ] P[w] P[w ] row vector for the word w

State-of-the-art word embeddings Updated on April 2016

Word2vec Continous-Bag-Of-Words Figure from Efficient Estimation of Word Representations in Vector Space, By Mikolov, Chen, Corrado, Dean P w t w t 2,, w t+2 exp[v wt mean v wt 2,, v wt+2 ]

Linear structure for analogies Semantic: man:woman::king:queen v man v woman v king v queen Syntatic: run:running::walk:walking v run v running v walk v walking

GloVe: Global Vector Suppose the co-occurrence between word i and word j is X ij The word vector for word i is w i and w i The GloVe objective function is Where b i s are bias terms, f x = min{100, x 3/4 }

Advertisement Lots of mysterious things What are the reasons behind The weird transformation on the co-occurrence? The model of word2vec? The objective of GloVe? The hyperparameters (weights, bias, etc)? What are the connections between them? A unified framework? Why do the word vector have linear structure for analogies?

Advertisement We proposed a generative model with theoretical analysis: RAND-WALK: A Latent Variable Model Approach to Word Embeddings Next lecture by Tengyu Ma, presenting this work Can t miss!