Applied Natural Language Processing

Similar documents
Natural Language Processing

with Local Dependencies

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Natural Language Processing

Natural Language Processing

Neural Networks in Structured Prediction. November 17, 2015

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Neural Architectures for Image, Language, and Speech Processing

Natural Language Processing

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Lecture 17: Neural Networks and Deep Learning

Natural Language Processing

Lecture 13: Structured Prediction

Recurrent and Recursive Networks

Conditional Language modeling with attention

Natural Language Processing

Natural Language Processing

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Task-Oriented Dialogue System (Young, 2000)

Deep Learning Recurrent Networks 2/28/2018

Marrying Dynamic Programming with Recurrent Neural Networks

NLP Programming Tutorial 11 - The Structured Perceptron

NLP Programming Tutorial 8 - Recurrent Neural Nets

CS838-1 Advanced NLP: Hidden Markov Models

Lecture 5 Neural models for NLP

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Recurrent Neural Networks. deeplearning.ai. Why sequence models?

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

NEURAL LANGUAGE MODELS

10/17/04. Today s Main Points

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Deep Sequence Models. Context Representation, Regularization, and Application to Language. Adji Bousso Dieng

Statistical Methods for NLP

CS388: Natural Language Processing Lecture 4: Sequence Models I

CSC321 Lecture 16: ResNets and Attention

Hidden Markov Models

Natural Language Processing

Natural Language Processing and Recurrent Neural Networks

Neural Networks Language Models

Lecture 8: Recurrent Neural Networks

lecture 6: modeling sequences (final part)

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Intelligent Systems (AI-2)

NLP Homework: Dependency Parsing with Feed-Forward Neural Network

CSCI 315: Artificial Intelligence through Deep Learning

Neural Networks for NLP. COMP-599 Nov 30, 2016

Intelligent Systems (AI-2)

CSE 490 U Natural Language Processing Spring 2016

Introduction to Deep Neural Networks

text classification 3: neural networks

Sequence Labeling: HMMs & Structured Perceptron

arxiv: v3 [cs.lg] 14 Jan 2018

Conditional Random Fields: An Introduction

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes)

Natural Language Processing

Slide credit from Hung-Yi Lee & Richard Socher

Generating Sequences with Recurrent Neural Networks

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

CS395T: Structured Models for NLP Lecture 19: Advanced NNs I. Greg Durrett

Recurrent Neural Networks. Jian Tang

From perceptrons to word embeddings. Simon Šuster University of Groningen

Hidden Markov Models

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

Structured Neural Networks (I)

Graphical models for part of speech tagging

Better Conditional Language Modeling. Chris Dyer

LECTURER: BURCU CAN Spring

Sequential Data Modeling - The Structured Perceptron

Recurrent Neural Networks. COMP-550 Oct 5, 2017

Lecture 9: Hidden Markov Model

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 6

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Lecture 7: Sequence Labeling

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Bayesian Networks (Part I)

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Fun with weighted FSTs

Sequence Modeling with Neural Networks

Spatial Transformation

Text Mining. March 3, March 3, / 49

Bio-Medical Text Mining with Machine Learning

Applied Natural Language Processing

Hidden Markov Models (HMMs)

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

Other Topologies. Y. LeCun: Machine Learning and Pattern Recognition p. 5/3

CSE 447/547 Natural Language Processing Winter 2018

MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING

Conditional Random Fields for Sequential Supervised Learning

Recurrent Neural Network

Stephen Scott.

CS395T: Structured Models for NLP Lecture 19: Advanced NNs I

Deep Learning for NLP

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Tackling the Limits of Deep Learning for NLP

Deep Learning For Mathematical Functions

Statistical Methods for NLP

Transcription:

Applied Natural Language Processing Info 256 Lecture 20: Sequence labeling (April 9, 2019) David Bamman, UC Berkeley

POS tagging NNP Labeling the tag that s correct for the context. IN JJ FW SYM IN JJ VBZ VB LS VBZ VB NN NN VBP DT NN NN NN VBP DT NN Fruit flies like a banana Time flies like an arrow (Just tags in evidence within the Penn Treebank more are possible!)

Named entity recognition B-PERS I-PERS O O O O ORG tim cook is the ceo of apple person 3 or 4-class: location organization 7-class: (misc) person location organization time money percent date

Supersense tagging O B-artifact I-artifact B-motion O B-time O O O B-group The station wagons arrived at noon, a long shining line O B-motion O O B-location I-location that coursed through the west campus. Noun supersenses (Ciarmita and Altun 2003)

Segmentation B = character is the start of new word I = character is inside existing word # b l a c k l i v e s m a t t e r B B I I I I B I I I I B I I I I I B B I I I I B I I I B I I I I I I # black lives matter # black live smatter

Sequence labeling x = {x 1,...,x n } y = {y 1,...,y n } For a set of inputs x with n sequential time steps, one corresponding label yi for each xi

Sequence labeling models model form label dependency rich features? Hidden Markov Models N i=1 P(x i y i ) P(y i y i 1 ) Markov assumption no MEMM N i=1 P(y i y i 1, x, β) Markov assumption yes CRF P(y x, β) pairwise through entire sequence yes N RNN P(y i x 1:i, β) none distributed i=1

Back to RNNs RNN allow arbitarily-sized conditioning contexts; condition on the entire sequence history. We used RNNs for document classification to generate a representation of a sequence that we can then use for prediction.

y1 y2 y3 y4 y5 I loved the movie! 0.1 0.1 0.1 0.1 0.1 x1 0.20 0.31 x2 0.20 0.31 x3 0.20 0.31 x4 0.20 0.31 x5 0.20 0.31-1.4-1.4-1.4-1.4-1.4 0.8 0.8 0.8 0.8 0.8 s1 s2 s3 s4 s5 9

y1 y2 y3 y4 y5 I loved the movie! 0.1 0.1 0.1 0.1 0.20 0.20 0.20 0.20 x1 x2 x3 x4 x5 0.31 0.31 0.31 0.31 0.1 0.20 0.31-1.4-1.4-1.4-1.4-1.4 0.8 0.8 0.8 0.8 0.8 s1 s2 s3 s4 s5 10

y avg I loved the movie! Iyyer et al. (2015), Deep Unordered Composition Rivals Syntactic Methods for Text Classification (ACL) 11

y x 1 a 1 + x 2 a 2 + x 3 a 3 + x 4 a 4 + x 5 a 5 weighted sum I loved the movie! 12

Recurrent neural network Each time step has two inputs: xi (the observation at time step i); one-hot vector, feature vector or word embedding. si-1 (the output of the previous state); base case: s0 = 0 vector

Recurrent neural network s i = R(x i, s i 1 ) R computes the output state as a function of the current input and previous state y i = O(s i ) O computes the output as a function of the current output state

Simple RNN g = tanh or relu s i = R(x i, s i 1 )=g(s i 1 W s + x i W x + b) Different weight vectors W transform the previous state and current input before combining W s R H H W x R D H b R H y i = O(s i )=s i Elman 1990, Mikolov 2012

Recurrent neural network Often used for sequential prediction tasks: Language models predicting the next symbol (word, character) in a sequence Machine translation -predicting a sequence of words (sentence) in language f conditioned on sentence in language e Sequence labeling (POS tagging, NER)

RNNs for sequence labeling The output state si is an H- dimensional real vector; we can transfer that into a probability by passing it through an additional linear transformation followed by a softmax y i = O(s i )= (s i W o + b o ) Elman 1990, Mikolov 2012

Training RNNs Given this definition of an RNN: s i = R(x i,s i 1 )=g(s i 1 W s + x i W x + b) y i = O(s i )= (s i W o + b o ) We have five sets of parameters to learn: W s,w x,w o,b,b o

For POS tagging, predict the tag from Y conditioned on the context PRP VBD DT NN. I loved the movie! 19

RNNs for POS DT NN VBD IN DT NN??? The horse raced past the barn fell To make a prediction for yt, RNNs condition on all input seen through time t (x1,, xt) But knowing something about the future can help (xt+1,, xn)

Bidirectional RNN A powerful alternative is make predictions conditioning both on the past and the future. Two RNNs One running left-to-right One right-to-left Each produces an output vector at each time step, which we concatenate

Bidirectional RNN backward RNN 0.7-1.1-5.4 0.7-1.1-5.4 0.7-1.1-5.4 0.7-1.1-5.4 0.7-1.1-5.4 I loved the movie! 22

Bidirectional RNN 0.7-1.1-5.4 0.7-1.1-5.4 0.7-1.1-5.4 0.7-1.1-5.4 0.7-1.1-5.4 0.7-1.1-5.4 0.7-1.1-5.4 0.7-1.1-5.4 0.7-1.1-5.4 0.7-1.1-5.4 I loved the movie! I loved the movie! 23

PRP VBD DT NN. 0.7-1.1-5.4 0.7-1.1-5.4 0.7-1.1-5.4 0.7-1.1-5.4 0.7-1.1-5.4 0.7-1.1-5.4 0.7-1.1-5.4 0.7-1.1-5.4 0.7-1.1-5.4 0.7-1.1-5.4 I loved the movie! I loved the movie! 24

Training BiRNNs Given this definition of an BiRNN: s i b = R b (x i,s i+1 b )=g(s i+1 b W s b + x i W x b + b b ) s i f = R f (x i,s i 1 f )=g(s i 1 f W s f + x i W x f + b f ) y i = softmax [s i f ; s i b]w o + b o We have 8 sets of parameters to learn (3 for each RNN + 2 for the final layer)

error Goldberg 2017

How do we fix this? PRP VBD DT NN??? I loved the movie bigly 0 0 0 0 0 27

0.7-1.1-5.4 0.7-1.1-5.4 BiLSTM for each word; concatenate final state of forward LSTM, backward LSTM, and word embedding as representation for a word. Lample et al. (2016), Neural Architectures for Named Entity Recognition RB movie 4 3-2 -1 4 9 0 0 0 0 0 3-1 -4 7 8 word embedding m o v i e m o v i e character BiLSTM 28

0.7-1.1-5.4 0.7-1.1-5.4 BiLSTM for each word; concatenate final state of forward LSTM, backward LSTM, and word embedding as representation for a word. Lample et al. (2016), Neural Architectures for Named Entity Recognition RB bigly 4 3-2 -1 4 9 0 0 0 0 0 0 0 0 0 0 word embedding b i g l y b i g l y character BiLSTM 29

Character CNN for each word; concatenate character CNN output and word embedding as representation for a word. RB bigly Chu et al. (2016), Named Entity Recognition with Bidirectional LSTM-CNNs 4 3-2 -1 4 0 0 0 0 0 max pooling 0 0 0 0 0 convolution word embedding character embeddings b i g l y 30

LSTM/RNN An RNN doesn t use the dependencies between nearby labels in making predictions.

NN VBZ VB VBZ 0.51 NNS 0.48 JJ 0.01 NN 0 The information that s passed between states is not the categorical choice (VBZ) but a hidden state that generated the distribution. Fruit flies like a banana 32

NN VBZ VB VBZ 0.51 NNS 0.48 JJ 0.01 NN 0 If we knew the categorical choice of VBZ at t2, P(VB) at t3 would be much lower. Fruit flies like a banana 33

TimeDistributed In keras, the TimeDistributed wrapper applies the same operation to every time step in a sequence (e.g., the same Dense layer with the same parameters) 34

PRP VBD DT NN. softmax softmax softmax softmax softmax y 1 W O y 2 W O y 3 W O y 4 W O y 5 W O I loved the movie! W O R 5x45

TimeDistributed lstm_output = LSTM(lstm_size, return_sequences=true) (embedded_sequences) preds = TimeDistributed(Dense(output_dim, activation="softmax"))(lstm_output) 36

Activity 12.ner/SequenceLabelingBiLSTM_TODO 37