Natural Language Processing

Similar documents
Natural Language Processing

NEURAL LANGUAGE MODELS

From perceptrons to word embeddings. Simon Šuster University of Groningen

text classification 3: neural networks

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Deep Learning for NLP

Applied Natural Language Processing

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Lecture 5 Neural models for NLP

Applied Natural Language Processing

Neural Networks: Backpropagation

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

Lecture 17: Neural Networks and Deep Learning

An overview of word2vec

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Neural networks CMSC 723 / LING 723 / INST 725 MARINE CARPUAT. Slides credit: Graham Neubig

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Lecture 6: Neural Networks for Representing Word Meaning

Week 5: Logistic Regression & Neural Networks

Neural Networks in Structured Prediction. November 17, 2015

Feedforward Neural Networks

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Logistic Regression & Neural Networks

Machine Learning (CSE 446): Neural Networks

Neural Networks: Backpropagation

Logistic Regression Trained with Different Loss Functions. Discussion

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

The Noisy Channel Model and Markov Models

Natural Language Processing with Deep Learning CS224N/Ling284

Conditional Language Modeling. Chris Dyer

CSC321 Lecture 4: Learning a Classifier

arxiv: v2 [cs.cl] 1 Jan 2019

CSC321 Lecture 10 Training RNNs

Introduction to Deep Learning

Natural Language Processing

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention

Warm up: risk prediction with logistic regression

Machine Learning (CS 567) Lecture 2

Feedforward Neural Networks. Michael Collins, Columbia University

Statistical NLP for the Web

Machine Learning Lecture 5

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

ATASS: Word Embeddings

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

CSC321 Lecture 6: Backpropagation

Introduction to Machine Learning Midterm Exam

Midterm sample questions

Linear classifiers: Overfitting and regularization

Neural Network Training

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Stochastic Gradient Descent

SGD and Deep Learning

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Machine Learning (CSE 446): Backpropagation

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Cheng Soon Ong & Christian Walder. Canberra February June 2018

The connection of dropout and Bayesian statistics

Deep Learning for NLP

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Neural Architectures for Image, Language, and Speech Processing

CSC321 Lecture 4: Learning a Classifier

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

1 Machine Learning Concepts (16 points)

arxiv: v3 [cs.lg] 14 Jan 2018

Probabilistic Graphical Models

CS388: Natural Language Processing Lecture 9: CNNs, Neural CRFs. CNNs. Administrivia. This Lecture. Greg Durrett. Project 1 due today at 5pm

word2vec Parameter Learning Explained

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Intro to Neural Networks and Deep Learning

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

1 What a Neural Network Computes

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Neural Networks and Deep Learning

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning for NLP

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Learning to translate with neural networks. Michael Auli

Natural Language Processing

CSCI567 Machine Learning (Fall 2018)

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

An overview of deep learning methods for genomics

NLP Homework: Dependency Parsing with Feed-Forward Neural Network

Neural networks and optimization

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

How to do backpropagation in a brain

Logistic Regression. Machine Learning Fall 2018

Classification Logistic Regression

ECS171: Machine Learning

Transcription:

Natural Language Processing Info 59/259 Lecture 4: Text classification 3 (Sept 5, 207) David Bamman, UC Berkeley

. https://www.forbes.com/sites/kevinmurnane/206/04/0/what-is-deep-learning-and-how-is-it-useful

History of NLP Foundational insights, 940s/950s Two camps (symbolic/stochastic), 957-970 Four paradigms (stochastic, logic-based, NLU, discourse modeling), 970-983 Empiricism and FSM (983-993) Field comes together (994-999) Machine learning (2000 today) J&M 2008, ch Neural networks (~204 today)

Neural networks in NLP Language modeling [Mikolov et al. 200] Text classification [Kim 204; Iyyer et al. 205] Syntactic parsing [Chen and Manning 204, Dyer et al. 205, Andor et al. 206] CCG super tagging [Lewis and Steedman 204] Machine translation [Cho et al. 204, Sustkever et al. 204] Dialogue agents [Sordoni et al. 205, Vinyals and Lee 205, Ji et al. 206] (for overview, see Goldberg 207,.3.)

Neural networks Discrete, high-dimensional representation of inputs (one-hot vectors) -> low-dimensional distributed representations. Non-linear interactions of input features Multiple layers to capture hierarchical structure

Neural network libraries

Logistic regression x β ŷ = +exp F i= x iβ i not -0.5 bad -.7 movie 0 0.3

SGD Calculate the derivative of some loss function with respect to parameters we can change, update accordingly to make predictions on training data a little less wrong next time. 8

Logistic regression ŷ = +exp F i= x iβ i x β x β x2 β2 y not -0.5 bad -.7 x3 β3 movie 0 0.3

Neural networks Two core ideas: Non-linear activation functions Multiple layers

*For simplicity, we re leaving out the bias term, but assume most layers have them as well. W V x W, W,2 h V x2 W2, W2,2 W3, h2 V2 y x3 W3,2 Input Hidden Output Layer

W V x h x2 y h2 x3 x W V y not -0.5.3 4. bad 0.4 0.08-0.9 movie 0.7 3.

W V x h x2 y h2 x3 h j = f F i= x i W i,j the hidden nodes are completely determined by the input and weights

W V x h x2 y h2 x3 h = f F i= x i W i,

Activation functions σ(z) = +exp( z).00 0.75 y 0.50 0.25 0.00-0 -5 0 5 0 x

Logistic regression x β ŷ = +exp F i= x iβ i x2 β2 y ŷ = σ F x i β i x3 β3 i= We can think about logistic regression as a neural network with no hidden layers

Activation functions tanh(z) = exp(z) exp( z) exp(z)+exp( z).0 0.5 y 0.0-0.5 -.0-0 -5 0 5 0 x

Activation functions rectifier(z) = max(0, z) 0.0 7.5 y 5.0 2.5 0.0-0 -5 0 5 0 x

W V x h x2 y h2 x3 h = σ h 2 = σ F i= F i= x i W i, x i W i,2 ŷ = σ [V h + V 2 h 2 ]

W V x h x2 y h2 x3 ŷ = σ V σ F i x i W i, + V 2 σ F i x i W i,2 we can express y as a function only of the input x and the weights W and V

F F ŷ = σ V σ x i W i, +V 2 σ x i W i,2 i i h h 2 This is hairy, but differentiable Backpropagation: Given training samples of <x,y> pairs, we can use stochastic gradient descent to find the values of W and V that minimize the loss.

W V x h x2 y h2 x3 Neural networks are a series of functions chained together xw σ (xw) σ (xw) V σ (σ (xw) V) The loss is another function chained on top log (σ (σ (xw) V))

Chain rule V log (σ (σ (xw) V)) Let s take the likelihood for a single training example with label y =; we want this value to be as high as possible = log (σ (σ (xw) V)) σ (σ (xw) V) σ (σ (xw) V) σ (xw) V σ (xw) V V A B C = log (σ (hv)) σ (hv) σ (hv) hv hv V

Chain rule A B C = log (σ (hv)) σ (hv) σ (hv) hv hv V A B C = σ (hv) σ (hv) ( σ (hv)) h =( σ (hv))h =( ŷ)h

Neural networks Tremendous flexibility on design choices (exchange feature engineering for model engineering) Articulate model structure and use the chain rule to derive parameter updates.

Neural network structures x h x2 y h2 x3 Output one real value

Neural network structures x y 0 h x2 y h2 x3 y 0 Multiclass: output 3 values, only one = in training data

Neural network structures x y h x2 y h2 x3 y 0 output 3 values, several = in training data

Regularization Increasing the number of parameters = increasing the possibility for overfitting to training data

Regularization L2 regularization: penalize W and V for being too large Dropout: when training on a <x,y> pair, randomly remove some node and weights. Early stopping: Stop backpropagation before the training error is too small.

Deeper networks W W2 V x h x2 h2 h2 y x3 h2 h2 x3

x W Densely connected layer x2 x3 h x x4 x5 h2 h2 W x6 h x7 h = (xw )

Convolutional networks With convolution networks, the same operation is (i.e., the same set of parameters) is applied to different regions of the input

2D Convolution 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 blurring http://www.wildml.com/205//understanding-convolutional-neural-networks-for-nlp/ https://docs.gimp.org/en/plug-in-convmatrix.html

D Convolution convolution K /3 /3 /3 x 3-4 0 x :4K = moving average

I x Convolutional networks hated x2 h=f(i, hated, it) it I x3 x4 h h2 h2=f(it, I, really) x W really x5 h3 h h3=f(really, hated, it) hated it x6 x7 h = (x W + x 2 W 2 + x 3 W 3 ) h 2 = (x 3 W + x 4 W 2 + x 5 W 3 ) h 3 = (x 5 W + x 6 W 2 + x 7 W 3 )

Indicator vocab item indicator vector Every token is a V- dimensional vector (size of the vocab) with a single identifying the word We ll get to distributed representations of words in on 9/9 a 0 aa 0 aal 0 aalii 0 aam 0 aardvark aardwolf 0 aba 0

Convolutional x networks x2 x3 h convolutional window size x x x2 x3 size of vocab x4 h2 W size of vocab x5 h3 W W2 W3 x6 x7 h = (x W + x 2 W 2 + x 3 W 3 ) h 2 = (x 3 W + x 4 W 2 + x 5 W 3 ) h 3 = (x 5 W + x 6 W 2 + x 7 W 3 )

Convolutional x networks x2 x3 h convolutional window size x x x2 x3 size of vocab x4 h2 W size of vocab x5 x6 h3 W W2 W3 For indicator vectors, we re just adding these numbers together x7 h = (W,x id + W 2,x id 2 + W 3,x id 3 ) (Where xn id specifies the location of the in the vector i.e., the vocabulary id)

7 Pooling 3 7 9 2 9 Down-samples a layer by selecting a single point from some set 0 5 3 5 Max-pooling selects the largest value

Convolutional x networks x2 x3 0 x4 2 0 x5 - x6 5 This defines one filter. x7 convolution max pooling

x W x x x2 x3 W W2 W3 x2 h vectorize vectorize x3 x4 x W x5 x6 x2 x3 W2 W3 x7 h = (x W )

x We can specify multiple filters; each filter is a separate set of parameters to be learned x2 Wa Wb Wc Wd x3 x x4 x5 x2 x6 x3 x7 h = (x W ) R 4

x We can specify multiple filters; each filter is a separate set of parameters to be learned x2 Wa Wb Wc Wd x3 x x4 x5 x2 x6 x3 x7 h = (x W ) R 4

Convolutional networks With max pooling, we select a single number for each filter over all tokens (e.g., with 00 filters, the output of max pooling stage = 00-dimensional vector) If we specify multiple filters, we can also scope each filter over different window sizes

Zhang and Wallace 206, A Sensitivity Analysis of (and Practitioners Guide to) Convolutional Neural Networks for Sentence Classification

Wa Wb Wc Wd V x x2 x3 x4 y x5 5 tokens, 4 filters x6 x7 tanh(x W ) max (hv )

CNN as important ngram detector Higher-order ngrams are much more informative than just unigrams (e.g., i don t like this movie [ I, don t, like, this, movie ]) We can think about a CNN as providing a mechanism for detecting important (sequential) ngrams without having the burden of creating them as unique features unique types unigrams 5092 bigrams 45,220 trigrams 90,694 4-grams,074,92 Unique ngrams (-4) in Cornell movie review dataset

CNN Backprop (V) L(W, V )=y log o A +( y) log( o) B L(W, V ) V = A + B V = A V + B V

CNN Backprop (V)

CNN Backprop (V)

CNN Backprop (V)

You ll derive and implement updates for the rest of the parameters in homework 2

Generative vs. Discriminative models Generative models specify a joint distribution over the labels and the data. With this you could generate new data P(x, y) =P(y) P(x y) Discriminative models specify the conditional distribution of the label y given the data x. These models focus on how to discriminate between the classes P(y x)

259 project proposal due 9/26 Final project involving or 2 students involving natural language processing -- either focusing on core NLP methods or using NLP in support of an empirical research question. Proposal (2 pages): outline the work you re going to undertake motivate its rationale as an interesting question worth asking assess its potential to contribute new knowledge by situating it within related literature in the scientific community. (cite 5 relevant sources) who is the team and what are each of your responsibilities (everyone gets the same grade)

Thursday Read Hovy and Spruit (206) and come prepared to discuss!