Natural Language Processing

Similar documents
Applied Natural Language Processing

Natural Language Processing

Applied Natural Language Processing

NEURAL LANGUAGE MODELS

From perceptrons to word embeddings. Simon Šuster University of Groningen

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Lecture 6: Neural Networks for Representing Word Meaning

Conditional Language Modeling. Chris Dyer

Natural Language Processing

Sequence Modeling with Neural Networks

Lecture 17: Neural Networks and Deep Learning

Deep Learning Recurrent Networks 2/28/2018

Deconstructing Data Science

CSC321 Lecture 5: Multilayer Perceptrons

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

Lecture 5 Neural models for NLP

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

word2vec Parameter Learning Explained

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Deconstructing Data Science

Training the linear classifier

Structured Neural Networks (I)

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Part-of-Speech Tagging + Neural Networks 2 CS 287

Logistic Regression & Neural Networks

text classification 3: neural networks

Natural Language Processing and Recurrent Neural Networks

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Natural Language Processing

Neural Network Language Modeling

Neural Networks Language Models

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 6

arxiv: v3 [cs.cl] 30 Jan 2016

Recurrent Neural Networks. deeplearning.ai. Why sequence models?

RECURRENT NETWORKS I. Philipp Krähenbühl

Language Modeling. Michael Collins, Columbia University

Machine Learning (CS 567) Lecture 2

Recurrent and Recursive Networks

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Neural Networks. Intro to AI Bert Huang Virginia Tech

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Natural Language Processing

Better Conditional Language Modeling. Chris Dyer

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Ways to make neural networks generalize better

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CS224N: Natural Language Processing with Deep Learning Winter 2018 Midterm Exam

An overview of word2vec

Feed-forward Network Functions

Lecture 12: Algorithms for HMMs

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Natural Language Understanding. Recap: probability, language models, and feedforward networks. Lecture 12: Recurrent Neural Networks and LSTMs

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Neural Architectures for Image, Language, and Speech Processing

Neural Networks in Structured Prediction. November 17, 2015

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

EE-559 Deep learning Recurrent Neural Networks

4. Multilayer Perceptrons

Homework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown

Natural Language Processing

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Slide credit from Hung-Yi Lee & Richard Socher

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes)

Statistical Machine Learning from Data

CSC321 Lecture 15: Recurrent Neural Networks

Artificial Intelligence

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Lecture 12: Algorithms for HMMs

Recurrent Neural Networks. COMP-550 Oct 5, 2017

Introduction to Machine Learning HW6

Neural networks CMSC 723 / LING 723 / INST 725 MARINE CARPUAT. Slides credit: Graham Neubig

Artificial Neural Networks. MGS Lecture 2

Generative Clustering, Topic Modeling, & Bayesian Inference

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Contents. (75pts) COS495 Midterm. (15pts) Short answers

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav

MLPR: Logistic Regression and Neural Networks

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

The Language Modeling Problem (Fall 2007) Smoothed Estimation, and Language Modeling. The Language Modeling Problem (Continued) Overview

Deep Learning: a gentle introduction

Neural Networks: Backpropagation

ECE521 Lecture 7/8. Logistic Regression

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

arxiv: v1 [cs.cl] 5 Mar Introduction

CSC321 Lecture 7 Neural language models

Deep Feedforward Networks

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

CSC321 Lecture 16: ResNets and Attention

Deep Learning for Natural Language Processing

Log-Linear Models, MEMMs, and CRFs

Lecture 3 Feedforward Networks and Backpropagation

Midterm sample questions

Week 5: Logistic Regression & Neural Networks

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

Transcription:

Natural Language Processing Info 159/259 Lecture 7: Language models 2 (Sept 14, 2017) David Bamman, UC Berkeley

Language Model Vocabulary V is a finite set of discrete symbols (e.g., words, characters); V = V V + is the infinite set of sequences of symbols from V; each sequence ends with STOP x V +

Language Model Language modeling is the task of estimating P(w)

Language Model arma virumque cano arms man and I sing Arms, and the man I sing [Dryden] I sing of arms and a man When we have choices to make about different ways to say something, a language models gives us a formal way of operationalizing their fluency (in terms of how likely they are exist in the language)

Markov assumption bigram model (first-order markov) n i P (w i w i 1 ) P (STOP w n ) trigram model (second-order markov) n i P (w i w i 2,w i 1 ) P (STOP w n 1,w n )

Classification A mapping h from input data x (drawn from instance space X) to a label (or labels) y from some enumerable output space Y X = set of all documents Y = {english, mandarin, greek, } x = a single document y = ancient greek

Classification A mapping h from input data x (drawn from instance space X) to a label (or labels) y from some enumerable output space Y Y = {the, of, a, dog, iphone, } x = (context) y = word

Logistic regression P(y = 1 x, β) = 1 +exp 1 F i=1 x iβ i output space Y = {0, 1}

x = feature vector β = coefficients Feature Value Feature β the 0 and 0 bravest 0 love 0 loved 0 genius 0 not 0 fruit 1 BIAS 1 the 0.01 and 0.03 bravest 1.4 love 3.1 loved 1.2 genius 0.5 not -3.0 fruit -0.8 BIAS -0.1 9

P(Y = 1 X = x; β) = exp(x β) 1 +exp(x β) = exp(x β) exp(x 0)+exp(x β)

Logistic regression P(Y = y X = x; β) = exp(x β y ) y Y exp(x β y ) output space Y = {1,...,K}

x = feature vector β = coefficients Feature Value Feature β1 β2 β3 β4 β5 the 0 and 0 bravest 0 love 0 loved 0 genius 0 not 0 fruit 1 BIAS 1 the 1.33-0.80-0.54 0.87 0 and 1.21-1.73-1.57-0.13 0 bravest 0.96-0.05 0.24 0.81 0 love 1.49 0.53 1.01 0.64 0 loved -0.52-0.02 2.21-2.53 0 genius 0.98 0.77 1.53-0.95 0 not -0.96 2.14-0.71 0.43 0 fruit 0.59-0.76 0.93 0.03 0 BIAS -1.92-0.70 0.94-0.63 0 12

Language Model We can use multi class logistic regression for language modeling by treating the vocabulary as the output space Y = V

Unigram LM A unigram language model here would have just one feature: a bias term. Feature βthe βof βa βdog βiphone BIAS -1.92-0.70 0.94-0.63 0

Bigram LM Feature Value βthe βof βa βdog βiphone wi-1=the 1 wi-1=and 0 wi-1=brave 0 st wi-1=love 0 wi-1=loved 0 wi-1=geniu 0 s wi-1=not 0 wi-1=fruit 0 BIAS 1 1.33-0.80-0.54 0.87 0 1.21-1.73-1.57-0.13 0 0.96-0.05 0.24 0.81 0 1.49 0.53 1.01 0.64 0-0.52-0.02 2.21-2.53 0 0.98 0.77 1.53-0.95 0-0.96 2.14-0.71 0.43 0 0.59-0.76 0.93 0.03 0-1.92-0.70 0.94-0.63 0

P(w i = dog w i 1 = the) Feature Value βthe βof βa βdog βiphone wi-1=the 1 wi-1=and 0 wi-1=brave 0 st wi-1=love 0 wi-1=loved 0 wi-1=geniu 0 s wi-1=not 0 wi-1=fruit 0 BIAS 1 1.33-0.80-0.54 0.87 0 1.21-1.73-1.57-0.13 0 0.96-0.05 0.24 0.81 0 1.49 0.53 1.01 0.64 0-0.52-0.02 2.21-2.53 0 0.98 0.77 1.53-0.95 0-0.96 2.14-0.71 0.43 0 0.59-0.76 0.93 0.03 0-1.92-0.70 0.94-0.63 0

Trigram LM P(w i = dog w i 2 = and, w i 1 = the) Feature Value wi-2=the ^ wi-1=the 0 wi-2=and ^ wi-1=the 1 wi-2=bravest ^ wi-1=the 0 wi-2=love ^ wi-1=the 0 wi-2=loved ^ wi-1=the 0 wi-2=genius ^ wi-1=the 0 wi-2=not ^ wi-1=the 0 wi-2=fruit ^ wi-1=the 0 BIAS 1

Parameters How many parameters do we have with a log-linear trigram language model? How many do we have with a generative trigram LM?

Smoothing P(w i = dog w i 2 = and, w i 1 = the) Feature Value second-order features first-order features wi-2=the ^ wi-1=the 0 wi-2=and ^ wi-1=the 1 wi-2=bravest ^ wi-1=the 0 wi-2=love ^ wi-1=the 0 wi-1=the 1 wi-1=and 0 wi-1=bravest 0 wi-1=love 0 BIAS 1

L2 regularization N F (β) = log P(y i x i, β) η β 2 j i=1 j=1 we want this to be high but we want this to be small We can do this by changing the function we re trying to optimize by adding a penalty for having values of β that are high This is equivalent to saying that each β element is drawn from a Normal distribution centered on 0. η controls how much of a penalty to pay for coefficients that are far from 0 (optimize on development data) 20

L1 regularization N F (β) = log P(y i x i, β) η β j i=1 j=1 we want this to be high but we want this to be small L1 regularization encourages coefficients to be exactly 0. η again controls how much of a penalty to pay for coefficients that are far from 0 (optimize on development data) 21

Richer representations Log-linear models give us the flexibility of encoding richer representations of the context we are conditioning on. We can reason about any observations from the entire history and not just the local context.

JACKSONVILLE, Fla. Stressed and exhausted families across the Southeast were assessing the damage from Hurricane Irma on Tuesday, even as flooding from the storm continued to plague some areas, like Jacksonville, and the worst of its wallop was being revealed in others, like the Florida Keys. Officials in Florida, Georgia and South Carolina tried to prepare residents for the hardships of recovery from the https://www.nytimes.com/2017/09/12/us/irma-jacksonville-florida.html

Hillary Clinton seemed to add Benghazi to her already-long list of culprits to blame for her upset loss to Donald http://www.foxnews.com/politics/2017/09/13/clinton-laments-how-benghazi-tragedy-hurt-her-politically.html

Hillary Clinton seemed to add Benghazi to her already-long list of culprits to blame for her upset loss to Donald feature classes example ngrams (wi-1, wi-2:wi-1, wi-3:wi-1) wi-2= to, wi= donald gappy ngrams spelling, capitalizaiton class/gazetteer membership w1= hillary and wi-1= donald wi-1 is capitalized and wi is capitalized wi-1 in list of names and wi in list of names 26

Classes bigram count bigram count black car 100 <COLOR> car 137 blue car 37 red car 0 Goldberg 2017

Tradeoffs Richer representations = more parameters, higher likelihood of overfitting Much slower to train than estimating the parameters of a generative model P(Y = y X = x; β) = exp(x β y ) y Y exp(x β y )

Neural LM Simple feed-forward multilayer perceptron (e.g., one hidden layer) input x = vector concatenation of a conditioning context of fixed size k 1 x =[v(w 1 );...; v(w k )] Bengio et al. 2003

1 v(w1) v(w1) 0.7 1.3-4.5 x =[v(w 1 );...; v(w k )] v(w2) 1 v(w2) 0.7 1.3-4.5 v(w3) 1 v(w3) 0.7 1.3-4.5 w1 = tried w2 = to w3 = prepare w4 = residents v(w4) 1 one-hot encoding v(w4) 0.7 1.3-4.5 distributed representation Bengio et al. 2003

W 1 R kd H W 2 R H V b 1 R H b 2 R V h = g(xw 1 + b 1 ) x =[v(w 1 );...; v(w k )] ŷ = softmax(hw 2 + b 2 ) Bengio et al. 2003

Softmax P(Y = y X = x; β) = exp(x β y ) y Y exp(x β y )

Neural LM conditioning context tried to prepare residents for the hardships of recovery from the y

Recurrent neural network RNN allow arbitarily-sized conditioning contexts; condition on the entire sequence history.

Recurrent neural network Goldberg 2017

Recurrent neural network Each time step has two inputs: x i (the observation at time step i); one-hot vector, feature vector or distributed representation. s i-1 (the output of the previous state); base case: s 0 = 0 vector

Recurrent neural network s i = R(x i, s i 1 ) R computes the output state as a function of the current input and previous state y i = O(s i ) O computes the output as a function of the current output state

Continuous bag of words s i = R(x i, s i 1 )=s i 1 + x i Each state is the sum of all previous inputs y i = O(s i )=s i Output is the identity

Simple RNN g = tanh or relu s i = R(x i, s i 1 )=g(s i 1 W s + x i W x + b) Different weight vectors W transform the previous state and current input before combining W s R H H W x R D H b R H y i = O(s i )=s i Elman 1990, Mikolov 2012

RNN LM The output state s i is an H- dimensional real vector; we can transfer that into a probability by passing it through an additional linear transformation followed by a softmax y i = O(s i )= (s i W o + b o ) Elman 1990, Mikolov 2012

Training RNNs Given this definition of an RNN: s i = R(x i,s i 1 )=g(s i 1 W s + x i W x + b) y i = O(s i )= (s i W o + b o ) We have five sets of parameters to learn: W s,w x,w o,b,b o

Training RNNs we tried to prepare residents At each time step, we make a prediction and incur a loss; we know the true y (the word we see in that position)

L( ) y1 W s L( ) y2 W s L( ) y3 W s L( ) y4 W s L( ) y5 W s we tried to prepare residents Training here is standard backpropagation, taking the derivative of the loss we incur at step t with respect to the parameters we want to update

Generation context1 context2 generated word As we sample, the words we generate form the new context we condition on START START The START The dog The dog walked dog walked in

Generation

Conditioned generation In a basic RNN, the input at each timestep is a representation of the word at that position s i = R(x i,s i 1 )=g(s i 1 W s + x i W x + b) But we can also condition on any arbitrary context (topic, author, date, metadata, dialect, etc.) s i = R(x i,s i 1 )=g(s i 1 W s + [x i ; c]w x + b)

we tried to prepare residents 0-2 8-8 -1 2 0 5-4 2 0 2 0-2 5-11 0 5 0-4 -2 8-3 -2 0 8-4 -1 0 2 Each state i encodes information seen until time i and its structure is optimized to predict the next word

Character LM Vocabulary V is a finite set of discrete characters When the output space is small, you re putting a lot of the burden on the structure of the model Encode long-range dependencies (suffixes depend on prefixes, word boundaries etc.)

Character LM http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Character LM http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Character LM http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Drawbacks Very expensive to train (especially for large vocabulary, though tricks exists cf. hierarchical softmax) Backpropagation through long histories leads to vanishing gradients (cf. LSTMs in a few weeks). But they consistently have some of the strongest performances in perplexity evaluations.

Mikolov 2010

Melis, Dyer and Blunsom 2017

Distributed representations Some of the greatest power 1 in neural network approaches is in the representation of words (and contexts) as lowdimensional vectors 0.7 1.3-4.5 We ll talk much more about that on Tuesday.