CS388: Natural Language Processing Lecture 9: CNNs, Neural CRFs. CNNs. Administrivia. This Lecture. Greg Durrett. Project 1 due today at 5pm

Similar documents
CS395T: Structured Models for NLP Lecture 19: Advanced NNs I

CS395T: Structured Models for NLP Lecture 19: Advanced NNs I. Greg Durrett

CS395T: Structured Models for NLP Lecture 17: CNNs. Greg Durrett

CS395T: Structured Models for NLP Lecture 5: Sequence Models II

Neural Architectures for Image, Language, and Speech Processing

Deep Learning for NLP

Natural Language Processing

CS395T: Structured Models for NLP Lecture 2: Machine Learning Review

Natural Language Processing

with Local Dependencies

Lecture 17: Neural Networks and Deep Learning

Lecture 5 Neural models for NLP

Neural Networks 2. 2 Receptive fields and dealing with image inputs

NEURAL LANGUAGE MODELS

Conditional Language modeling with attention

Applied Natural Language Processing

Neural Networks in Structured Prediction. November 17, 2015

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

CS395T: Structured Models for NLP Lecture 10: Loopy Graphical Models

Deep Learning Recurrent Networks 2/28/2018

Convolutional Neural Networks

Structured Neural Networks (I)

Sequence Modeling with Neural Networks

Part-of-Speech Tagging + Neural Networks 2 CS 287

Recurrent and Recursive Networks

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Task-Oriented Dialogue System (Young, 2000)

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Intelligent Systems (AI-2)

CSC321 Lecture 16: ResNets and Attention

Fun with weighted FSTs

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes)

Overview Today: From one-layer to multi layer neural networks! Backprop (last bit of heavy math) Different descriptions and viewpoints of backprop

Understanding How ConvNets See

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Lecture 13: Structured Prediction

Homework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Statistical Methods for NLP

Speech and Language Processing

Lecture 6: Neural Networks for Representing Word Meaning

Natural Language Processing with Deep Learning CS224N/Ling284

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Deep Learning for NLP Part 2

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Loss Functions and Optimization. Lecture 3-1

Data Mining & Machine Learning

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Statistical NLP for the Web

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

arxiv: v3 [cs.lg] 14 Jan 2018

Learning to translate with neural networks. Michael Auli

Loss Functions and Optimization. Lecture 3-1

Asaf Bar Zvi Adi Hayat. Semantic Segmentation

Neural networks CMSC 723 / LING 723 / INST 725 MARINE CARPUAT. Slides credit: Graham Neubig

Introduction to Convolutional Neural Networks (CNNs)

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Sequential Supervised Learning

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Intelligent Systems (AI-2)

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Neural Networks. Intro to AI Bert Huang Virginia Tech

Recurrent Neural Networks. Jian Tang

Natural Language Processing

Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets)

Statistical Machine Learning

Lecture 4 Backpropagation

Midterm for CSC321, Intro to Neural Networks Winter 2018, night section Tuesday, March 6, 6:10-7pm

Lecture 11 Recurrent Neural Networks I

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Hidden Markov Models

From perceptrons to word embeddings. Simon Šuster University of Groningen

Current Word Previous Word Next Word Current Word Character n-gram all Current POS Tag Surrounding POS Tag Sequence Current Word Shape Surrounding

ECE521 Lecture 7/8. Logistic Regression

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

Lecture 11 Recurrent Neural Networks I

Hidden Markov Models

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Slide credit from Hung-Yi Lee & Richard Socher

Lecture 7 Convolutional Neural Networks

text classification 3: neural networks

ATASS: Word Embeddings

CS388: Natural Language Processing Lecture 4: Sequence Models I

Recurrent Neural Networks. COMP-550 Oct 5, 2017

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Unit III. A Survey of Neural Network Model

Lecture 35: Optimization and Neural Nets

Intro to Neural Networks and Deep Learning

Conditional Random Field

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 6

Artificial Neural Networks. MGS Lecture 2

Transcription:

CS388: Natural Language Processing Lecture 9: CNNs, Neural CRFs Project 1 due today at 5pm Administrivia Mini 2 out tonight, due in two weeks Greg Durrett This Lecture CNNs CNNs for SenLment Neural CRFs CNNs

ConvoluLonal Layer Applies a filter over patches of the input and returns that filter s aclvalons ConvoluLon: take dot product of filter with a patch of the input ConvoluLonal Layer Applies a filter over patches of the input and returns that filter s aclvalons ConvoluLon: take dot product of filter with a patch of the input image: n x n x k filter: m x m x k activation ij = kx 1 kx 1 i o=0 j o=0 offsets sum over dot products image(i + i o,j+ j o ) filter(i o,j o ) image: n x n x k aclvalons: (n - m + 1) x (n - m + 1) x 1 filter: m x m x k Each of these cells is a vector with mullple values Images: RGB values (3 dim); text: word vector (50+ dim) ConvoluLons for NLP Input and filter are 2-dimensional instead of 3-dimensional Compare: CNNs vs. LSTMs O(n) x c n x 2c sentence: n words x k vec dim filter: m x k aclvalons: (n - m + 1) x 1 the movie was good vector for each word Combines evidence locally in a sentence and produces a new (but slll variable-length) representalon the movie was good c filters, m x k each n x k the movie was good BiLSTM with hidden size c n x k Both LSTMs and convolulonal layers transform the input using context LSTM: globally looks at the enlre sentence (but local for many problems) CNN: local depending on filter width + number of layers

CNNs for SenLment Analysis P (y x) CNNs for SenLment W the movie was good projeclon + so`max c-dimensional vector max pooling over the sentence n x c c filters, m x k each n x k Max pooling: return the max aclvalon of a given filter over the enlre sentence; like a logical OR (sum pooling is like logical AND) Understanding CNNs for SenLment Understanding CNNs for SenLment the movie was good 0.03 } 0.02 good filter output max = 1.1 1.1 the movie was good 0.03 } 0.02 max = 1.1 1.1. 0.0. 0.0 bad Filter looks like the things that will cause it to have high aclvalon okay terrible 0.3

Understanding CNNs for SenLment Understanding CNNs for SenLment the movie was good bad. 0.03 } 0.02 max = 1.1 1.1 0.0 1.1 0.3 Features for classificalon layer (or more NN layers) the movie was great. 0.03 0.02 1.8 0.0 }max = 1.8 Takes variable-length input and turns it into fixed-length output Filters are inilalized randomly and then learned Word vectors for similar words are similar, so convolulonal filters will have similar outputs the movie was Understanding CNNs for SenLment not good. } } 0.03 0.05 0.2 1.5 not good }max = 1.5 Analogous to bigram features in bag-of-words models What can CNNs learn? the movie was not good the movie was not really all that good the cinematography was good, the music great, but the movie was bad I entered the theater in the bloom of youth and le> as an old man Indicator feature of text containing bigram <-> max pooling of a filter that matches that bigram

Deep ConvoluLonal Networks Low-level filters: extract low-level features from the data Deep ConvoluLonal Networks High-level filters: match larger and more semanlc pajerns Zeiler and Fergus (2014) CNNs: ImplementaLon Input is batch_size x n x k matrix, filters are c x m x k matrix (c filters) Typically use filters with m ranging from 1 to 5 or so (mullple filter widths in a single convnet) All computalon graph libraries support efficient convolulon operalons Zeiler and Fergus (2014) CNNs for Sentence ClassificaLon QuesLon classificalon, senlment, etc. W Conv+pool, then use feedforward layers to classify Can use mullple types of input vectors (fixed inilalizer and learned) the movie was good Kim (2014)

Sentence ClassificaLon movie review senlment subjeclvity/objeclvity deteclon product reviews Neural CRF Basics queslon type classificalon Also effeclve at document-level text classificalon Kim (2014) NER Revisited Features in CRFs: I[tag=B-LOC & curr_word=hangzhou], I[tag=B-LOC & prev_word=to], I[tag=B-LOC & curr_prefix=han] Linear model over features Downsides: Lexical features mean that words need to be seen in the training data Linear model can t capture feature conjunclons as effeclvely (can t look at more than 2 words with a single feature) LSTMs for NER Transducer (LM-like model) B-PER I-PER O O O B-LOC Barack Obama will travel to Hangzhou What are the strengths and weaknesses of this model compared to CRFs?

LSTMs for NER B-PER I-PER O O O B-LOC Neural CRFs BidirecLonal transducer model Barack Obama will travel to Hangzhou What are the strengths and weaknesses of this model compared to CRFs? Barack Obama will travel to Hangzhou Neural CRFs: bidireclonal LSTMs (or some NN) compute emission potenlals, capture structural constraints in transilon potenlals P (y x) = 1 Z i=2 ConvenLonal: Neural: Neural CRFs ny ny exp( t (y i 1,y i )) exp( e (y i,i,x)) i=1 e(y i,i,x) =w > f e (y i,i,x) e(y i,i,x) =W > y i f(i, x) e t y 1 y 2 y n W is a num_tags x len(f) matrix f(i, x) could be the output of a feedforward neural network looking at the words around posilon i, or the ith output of an LSTM, Neural network computes unnormalized potenlals that are consumed and normalized by a structured model Inference: compute f, use Viterbi P (y x) = 1 Z i=2 ConvenLonal: For linear model: CompuLng Gradients ny ny exp( t (y i 1,y i )) exp( e (y i,i,x)) i=1 e(y i,i,x) =w > f e (y i,i,x) Neural: e(y i,i,x) =W y > i f(i, x) @L = P (y i = s x)+i[s is gold] @ e,i @ e,i w i = f e,i (y i,i,x) e t y 1 y 2 y n error signal, compute with F-B chain rule say to mullply together, gives our update For neural model: compute gradient of phi w.r.t. parameters of neural net

Neural CRFs FFNN Neural CRF for NER 2) Run forward-backward 3) Compute error signal 1) Compute f(x) 4) Backprop (no knowledge Barack Obama will travel to Hangzhou of sequenlal structure required) e(to) FFNN e(hangzhou) e(today) previous word curr word next word to Hangzhou today e = Wg(Vf(x,i)) f(x,i)=[emb(x i 1 ), emb(x i ), emb(x i+1 )] LSTM Neural CRFs LSTMs for NER B-PER I-PER O O O B-LOC Barack Obama will travel to Hangzhou BidirecLonal LSTMs compute emission (or transilon) potenlals Barack Obama will travel to Hangzhou How does this compare to neural CRF?

NLP (Almost) From Scratch CNN Neural CRFs WLL: independent classificalon; SLL: neural CRF conv+pool+ffnn travel to Hangzhou today for Append to each word vector an embedding of the relafve posifon of that word ConvoluLon over the sentence produces a posilon-dependent representalon LM2: word vectors learned from a precursor to word2vec/glove, trained for 2 weeks (!) on Wikipedia Collobert, Weston, et al. 2008, 2011-2 -1 0 1 2 travel to Hangzhou today for CNN NCRFs vs. FFNN NCRFs Neural CRFs with LSTMs Neural CRF using character LSTMs to compute word representalons Sentence approach (CNNs) is comparable to window approach (FFNNs) except for SRL where they claim it works much bejer Collobert and Weston 2008, 2011 Chiu and Nichols (2015), Lample et al. (2016)

Chiu+Nichols: character CNNs instead of LSTMs Lin/Passos/Luo: use external resources like Wikipedia Neural CRFs with LSTMs LSTM-CRF captures the important aspects of NER: word context (LSTM), sub-word features (character LSTMs), outside knowledge (word embeddings) Takeaways CNNs are a flexible way of extraclng features analogous to bag of n- grams, can also encode posilonal informalon All kinds of NNs can be integrated into CRFs for structured inference. Can be applied to NER, other tagging, parsing, This concludes the ML/DL-heavy porlon of the course. StarLng Tuesday: syntax, then semanlcs Chiu and Nichols (2015), Lample et al. (2016)