Deep Learning for NLP

Similar documents
11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

Deep Learning for NLP Part 2

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Natural Language Processing with Deep Learning CS224N/Ling284

Deep Feedforward Neural Networks Yoshua Bengio

text classification 3: neural networks

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Natural Language Processing with Deep Learning CS224N/Ling284. Richard Socher Lecture 2: Word Vectors

Natural Language Processing

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

CSC321 Lecture 5: Multilayer Perceptrons

Natural Language Processing with Deep Learning CS224N/Ling284

Lecture 17: Neural Networks and Deep Learning

Neural Networks: Introduction

Neural Networks: Backpropagation

Lecture 5 Neural models for NLP

Feature Design. Feature Design. Feature Design. & Deep Learning

Based on the original slides of Hung-yi Lee

ECE521 Lectures 9 Fully Connected Neural Networks

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Machine Learning (CSE 446): Neural Networks

Artificial Intelligence

Deep Feedforward Networks

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

From perceptrons to word embeddings. Simon Šuster University of Groningen

NEURAL LANGUAGE MODELS

Artificial Neural Networks

Lecture 3 Feedforward Networks and Backpropagation

Deep Feedforward Networks. Sargur N. Srihari

GloVe: Global Vectors for Word Representation 1

Artificial Neural Networks

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

Neural Networks and Deep Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

ECE521 Lecture 7/8. Logistic Regression

Introduction to Convolutional Neural Networks (CNNs)

Deep Learning & Artificial Intelligence WS 2018/2019

Reading Group on Deep Learning Session 1

Introduction to Neural Networks

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Lecture 3 Feedforward Networks and Backpropagation

A summary of Deep Learning without Poor Local Minima

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Artificial Neuron (Perceptron)

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Deep Feedforward Networks. Han Shao, Hou Pong Chan, and Hongyi Zhang

Nonlinear Classification

CSC321 Lecture 4: Learning a Classifier

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Jakub Hajic Artificial Intelligence Seminar I

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Logistic Regression & Neural Networks

Feed-forward Network Functions

Feedforward Neural Nets and Backpropagation

Grundlagen der Künstlichen Intelligenz

STA 414/2104: Lecture 8

Neural Networks in Structured Prediction. November 17, 2015

Artificial Neural Networks. MGS Lecture 2

CSC321 Lecture 4: Learning a Classifier

Neural networks and optimization

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Neural networks and support vector machines

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Artificial Neural Networks 2

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Lecture 35: Optimization and Neural Nets

CSC 411 Lecture 10: Neural Networks

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Deep Feedforward Networks

Machine Learning. Boris

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Introduction to Deep Neural Networks

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

Intro to Neural Networks and Deep Learning

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Deep Neural Networks (1) Hidden layers; Back-propagation

Neural networks and optimization

Deep neural networks

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Lecture 6: Neural Networks for Representing Word Meaning

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Introduction to Neural Networks

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

ML4NLP Multiclass Classification

Machine Learning Lecture 5

STA 414/2104: Lecture 8

INTRODUCTION TO ARTIFICIAL INTELLIGENCE

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav

Learning Deep Architectures for AI. Part I - Vijay Chakilam

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Tutorial on Machine Learning for Advanced Electronics

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16

Course 395: Machine Learning - Lectures

Transcription:

Deep Learning for NLP CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard Socher and Yoshua Bengio)

Machine Learning and NLP NER WordNet Usually machine learning works well because of human-designed representations and input features Machine learning becomes just optimizing weights to best make a final prediction PA1 and PA3? SRL Parser Representation learning attempts to automatically learn good features or representations 2

What is Deep Learning? Three senses of deep learning : #1 Learning a data-dependent hierarchy of multiple intermediate levels of representation of increasing abstraction #2 Use of distributed representations and neural net learning methods, even when the architecture isn t technically deep #3 Anything to do with artificial intelligence Deep learning is the idea that machines could eventually learn by themselves to perform functions like speech recognition http://readwrite.com/2014/01/29/

Deep Learning and its Architectures Deep learning attempts to learn multiple levels of representation Focus: Multi-layer neural networks Output layer Here predicting a supervised target Hidden layers These learn more abstract representations as you head up Input layer 4 Raw sensory inputs (roughly)

6 Part 1.1: The Basics Advantages of Deep Learning (Part 1)

#1 Learning representations Handcrafting features is time-consuming The features are often both over-specified and incomplete The work has to be done again for each task/domain/ Humans develop representations for learning and reasoning Our computers should do the same Deep learning provides a way of doing this Machine Learning 7

#2 The need for distributed representations Current NLP systems are incredibly fragile because of their atomic symbol representations 8 Crazy sentential complement, such as for likes [(being) crazy]

Distributed representations deal with the curse of dimensionality Generalizing locally (e.g., nearest neighbors) requires representative examples for all relevant variations! Classic solutions: Manual feature design Assuming a smooth target function (e.g., linear models) Kernel methods (linear in terms of kernel based on data points) Neural networks parameterize and learn a similarity kernel 9

#3 Learning multiple levels of representation [Lee et al. ICML 2009; Lee et al. NIPS 2009] Successive model layers learn deeper intermediate representations Layer 3 Layer 2 10 Layer 1 High-level linguistic representations

11 Part 1.2: The Basics From logistic regression to neural nets

Demystifying neural networks Terminology Math is similar to maxent/logistic regression A single neuron A computational unit with n (3) inputs and 1 output and parameters W, b Inputs Activation function Output 12 Bias unit corresponds to intercept term

From Maxent Classifiers to Neural Networks Refresher: Maxent/softmax classifier (from last lecture) Supervised learning gives us a distribution for datum d over classes in C Vector form: P(c d, λ) = c " C P(c d, λ) = Such a classifier is used as-is in a neural network ( a softmax layer ) Often as the top layer i i exp λ i f i (c, d) exp λ i f i ( c ", d) c! e λ T f (c,d ) e λ T f ( c!,d ) But for now we ll derive a two-class logistic model for one neuron 13

From Maxent Classifiers to Neural Networks Vector form: Make two class: P(c 1 d, λ) = = P(c d, λ) = e λ T f (c,d ) e λ T f ( c!,d ) c! e λ T f (c 1,d ) e λ T f (c 1,d ) + e λ T f (c 2,d ) = e λ T f (c 1,d ) e λ T f (c 1,d ) + e λ T f (c 2,d ) e λ T f (c 1,d ) e λ T f (c 1,d ) 1 1+ e = 1 for x = f (c λ T 1, d) f (c 2, d) [ f (c 2,d ) f (c 1,d )] 1+ e λ T x = f (λ T x) 14 for f(z) = 1/(1 + exp( z)), the logistic function a sigmoid non-linearity.

This is exactly what an artificial neuron computes h w,b (x) = f (w T x + b) 1 f (z) = 1+ e z b: We can have an always on feature, which gives a class prior, or separate it out, as a bias term 15 w, b are the parameters of this neuron i.e., this logistic regression model

A neural network = running several logistic regressions at the same time If we feed a vector of inputs through a bunch of logistic regression functions, then we get a vector of outputs But we don t have to decide ahead of time what variables these logistic regressions are trying to predict! 16

A neural network = running several logistic regressions at the same time which we can feed into another logistic regression function and it is the training criterion that will decide what those intermediate binary variables should be, so as to do a good job of predicting the targets for the next layer, etc. 17

A neural network = running several logistic regressions at the same time Before we know it, we have a multilayer neural network. 18

Matrix notation for a layer We have a 1 = f (W 11 x 1 +W 12 x 2 +W 13 x 3 + b 1 ) a 2 = f (W 21 x 1 +W 22 x 2 +W 23 x 3 + b 2 ) In matrix notation where f is applied element-wise: f ([z 1, z 2, z 3 ]) = [ f (z 1 ), f (z 2 ), f (z 3 )] 19 etc. z = Wx + b a = f (z) W 12 b 3 a 1 a 2 a 3

How do we train the weights W? For a supervised neural net, we can train the model just like a maxent model we calculate and use gradients Stochastic gradient descent (SGD) Usual Conjugate gradient or L-BFGS Sometimes; slightly dangerous since space may not be convex We can use the same ideas and techniques Just without guarantees This leads to backpropagation, which we cover later 20

Non-linearities: Why they re needed For logistic regression: map to probabilities Here: function approximation, e.g., regression or classification Without non-linearities, deep neural networks can t do anything more than a linear transform Extra layers could just be compiled down into a single linear transform People often use other non-linearities, such as tanh, ReLU 21

Non-linearities: What s used logistic ( sigmoid ) tanh tanh is just a rescaled and shifted sigmoid tanh(z) = 2logistic(2z) 1 tanh is often used and often performs better for deep nets 22 It s output is symmetric around 0

Non-linearities: Other choices hard tanh soft sign linear rectifier (ReLU) softsign(z) = a 1+ a rect(z) = max(z, 0) hard tanh similar but computationally cheaper than tanh and saturates hard. [Glorot and Bengio AISTATS 2010, 2011] discuss softsign and rectifier 23 Rectified linear unit is now common transfers linear signal if active

Summary Knowing the meaning of words! You now understand the basics and the relation to other models Neuron = logistic regression or similar function Input layer = input training/test vector Bias unit = intercept term/always on feature Activation = response/output Activation function is a logistic or other nonlinearity Backpropagation = running stochastic gradient descent through a multilayer network efficiently Weight decay = regularization / Bayesian prior smoothing 24

25 Part 1.3: The Basics Word Representations

The standard word representation The vast majority of rule-based and statistical NLP work regards words as atomic symbols: hotel, conference, walk In vector space terms, this is a vector with one 1 and a lot of zeroes [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0] Dimensionality: 20K (speech) 50K (PTB) 500K (big vocab) 13M (Google 1T) We call this a one-hot representation. Its problem: motel [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0] AND hotel [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0] = 0 26

Distributional similarity based representations You can get a lot of value by representing a word by means of its neighbors You shall know a word by the company it keeps (J. R. Firth 1957: 11) One of the most successful ideas of modern statistical NLP government debt problems turning into banking crises as has happened in saying that Europe needs unified banking regulation to replace the hodgepodge ë These words will represent banking ì 27 You can vary whether you use local or large context to get a more syntactic or semantic clustering

Two traditional word representations: Class-based and soft clustering Class based models learn word classes of similar words based on distributional information ( ~ class HMM) Brown clustering (Brown et al. 1992, Liang 2005) Exchange clustering (Martin et al. 1998, Clark 2003) Soft clustering models learn for each cluster/topic a distribution over words of how likely that word is in each cluster Latent Semantic Analysis (LSA/LSI), Random projections Latent Dirichlet Analysis (LDA), HMM clustering 28

Neural word embeddings as a distributed representation Similar idea Combine vector space semantics with the prediction of probabilistic models (Bengio et al. 2003, Collobert & Weston 2008, Huang et al. 2012) In all of these approaches, including deep learning models, a word is represented as a dense vector linguistics = 0.286 0.792 0.177 0.107 0.109 0.542 0.349 0.271 29

Neural word embeddings - visualization 30

Kulkarni, Al-Rfou, Perozzi, & Skiena (2015)

Advantages of the neural word embedding approach Compared to other methods, neural word embeddings can become more meaningful through adding supervision from one or multiple tasks For instance, sentiment is usually not captured in unsupervised word embeddings but can be in neural word vectors We can build compositional vector representations for longer phrases (next lecture) 32