Deep Learning and Lexical, Syntactic and Semantic Analysis. Wanxiang Che and Yue Zhang

Similar documents
Natural Language Processing and Recurrent Neural Networks

From perceptrons to word embeddings. Simon Šuster University of Groningen

Recurrent Neural Networks. Jian Tang

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning

arxiv: v3 [cs.lg] 14 Jan 2018

Lecture 5 Neural models for NLP

Lecture 11 Recurrent Neural Networks I

Long-Short Term Memory and Other Gated RNNs

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Deep Learning Recurrent Networks 2/28/2018

Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow,

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Introduction to RNNs!

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

Neural Networks Language Models

Lecture 11 Recurrent Neural Networks I

Slide credit from Hung-Yi Lee & Richard Socher

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

CSC321 Lecture 15: Exploding and Vanishing Gradients

Lecture 17: Neural Networks and Deep Learning

Recurrent Neural Network

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Deep Learning for Natural Language Processing

Jakub Hajic Artificial Intelligence Seminar I

Neural Networks for NLP. COMP-599 Nov 30, 2016

Conditional Language modeling with attention

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Demystifying deep learning. Artificial Intelligence Group Department of Computer Science and Technology, University of Cambridge, UK

Recurrent and Recursive Networks

CSC321 Lecture 15: Recurrent Neural Networks

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Deep Learning Architectures and Algorithms

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018

Feature Design. Feature Design. Feature Design. & Deep Learning

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Task-Oriented Dialogue System (Young, 2000)

Lecture 15: Exploding and Vanishing Gradients

Deep Learning: a gentle introduction

Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Recurrent Neural Networks

Trajectory-based Radical Analysis Network for Online Handwritten Chinese Character Recognition

Natural Language Processing

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Deep Learning for NLP Part 2

Introduction to Deep Neural Networks

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Vanishing and Exploding Gradients. ReLUs. Xavier Initialization

Neural Networks 2. 2 Receptive fields and dealing with image inputs

NEURAL LANGUAGE MODELS

CSC321 Lecture 16: ResNets and Attention

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Part-of-Speech Tagging + Neural Networks 2 CS 287

Lecture 6: Neural Networks for Representing Word Meaning

Stephen Scott.

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Neural Network Language Modeling

Sequence Modeling with Neural Networks

Recurrent Neural Networks

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

An overview of word2vec

Neural Architectures for Image, Language, and Speech Processing

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

UNSUPERVISED LEARNING

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

CSCI 315: Artificial Intelligence through Deep Learning

Neural Networks in Structured Prediction. November 17, 2015

Deep learning for Natural Language Processing and Machine Translation

a) b) (Natural Language Processing; NLP) (Deep Learning) Bag of words White House RGB [1] IBM

RECURRENT NETWORKS I. Philipp Krähenbühl

Learning Long-Term Dependencies with Gradient Descent is Difficult

Mitosis Detection in Breast Cancer Histology Images with Multi Column Deep Neural Networks

Deep Learning Recurrent Networks 10/11/2017

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Deep Learning for NLP

Based on the original slides of Hung-yi Lee

lecture 6: modeling sequences (final part)

Neural Turing Machine. Author: Alex Graves, Greg Wayne, Ivo Danihelka Presented By: Tinghui Wang (Steve)

Natural Language Processing

Generating Sequences with Recurrent Neural Networks

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Lecture 15: Recurrent Neural Nets

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Faster Training of Very Deep Networks Via p-norm Gates

arxiv: v1 [cs.cl] 21 May 2017

EE-559 Deep learning LSTM and GRU

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Introduction to Convolutional Neural Networks 2018 / 02 / 23

CSC321 Lecture 10 Training RNNs

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Course 395: Machine Learning - Lectures

Recurrent neural networks

GloVe: Global Vectors for Word Representation 1

Transcription:

Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10

Part 2: Introduction to Deep Learning

Part 2.1: Deep Learning Background

What is Machine Learning? From Data to Knowledge Input Algorithm Traditional Program Output Input Output ML Program Algorithm

A Standard Example of ML The MNIST (Modified NIST) database of hand-written digits recognition Publicly available A huge amount about how well various ML methods do on it 60,000 + 10,000 hand-written digits (28x28 pixels each)

Very hard to say what makes a 2

Traditional Model (before 2012) Fixed/engineered features + trainable classifier ( 分类器 ) Designing a feature extractor requires considerable efforts by experts SIFT GIST Shape context

Deep Learning (after 2012) Learning Hierarchical Representations DEEP means more than one stage of non-linear feature transformation

Deep Learning Architecture

Deep Learning is Not New 1980s technology (Neural Networks)

Pros About Neural Networks Simple to learn p(y x) Results OK for shallow nets Cons Does not learn p(x) Trouble with > 3 layers Overfitts Slow to train

Pros Simple to learn p(y x) Deep Learning beats NN Results OK for shallow nets Cons Does not learn p(x) Trouble with > 3 layers Overfitts Slow to train Dropout Maxout Stochastic Pooling GPU Unsupervised feature learning: RMBs, DAEs, New activation functions: ReLU, Gated mechanism

Results on MNIST Naïve Neural Network 96.59% SVM (default settings for libsvm) 94.35% Optimal SVM [Andreas Mueller] 98.56% The state of the art: Convolutional NN (2013) 99.79%

Deep Learning Wins 9. MICCAI 2013 Grand Challenge on Mitosis Detection 8. ICPR 2012 Contest on Mitosis Detection in Breast Cancer Histological Images 7. ISBI 2012 Brain Image Segmentation Challenge (with superhuman pixel error rate) 6. IJCNN 2011 Traffic Sign Recognition Competition (only our method achieved superhuman results) 5. ICDAR 2011 offline Chinese Handwriting Competition 4. Online German Traffic Sign Recognition Contest 3. ICDAR 2009 Arabic Connected Handwriting Competition 2. ICDAR 2009 Handwritten Farsi/Arabic Character Recognition Competition 1. ICDAR 2009 French Connected Handwriting Competition. Compare the overview page on handwriting recognition. http://people.idsia.ch/~juergen/deeplearning.html

Deep Learning for Speech Recognition

Deep Learning for NLP Applicati ons Word Seg POS tagging Parsing MT Dialog MCTest QA Caption Semantic Vector Space Deep Learning 我喜欢红苹果 我喜欢红苹果 我喜欢红苹果 Recurrent NN Convolutional NN Recursive NN Big Data Monolingual Data Multi-lingual Data Multi-modal Data

Part 2.2: Feedforward Neural Networks

The Traditional Paradigm for ML 1. Convert the raw input vector into a vector of feature activations Use hand-written programs based on common-sense to define the features 2. Learn how to weight each of the feature activations to get a single scalar quantity 3. If this quantity is above some threshold, decide that the input vector is a positive example of the target class

The Standard Perceptron Architecture decision unit >5? learned weights 0.8 0.9 0.1 feature units good very/good very hand-coded programs input units IPhone is very good.

The Limitations of Perceptrons The hand-coded features Great influence on the performance Need lots of cost to find suitable features A linear classifier with a hyperplane Cannot separate non-linear data, such as XOR function cannot be learned by a single-layer perceptron 0,1 1,1 0,0 1,0 The positive and negative cases cannot be separated by a plane

Learning with Non-linear Hidden Layers

Feedforward Neural Networks Output layer 2nd hidden layer 1st hidden layer x1 x2.. xn The information is propagated from the inputs to the outputs Time has no role (NO cycle between outputs and inputs) Multi-layer Perceptron (MLP)? Learning the weights of hidden units is equivalent to learning features Networks without hidden layers are very limited in the input-output mappings More layers of linear units do not help. Its still linear Fixed output non-linearities are not enough

Multiple Layer Neural Networks What are those hidden neurons doing? Maybe represent outlines

General Optimizing (Learning) Algorithms Gradient Descent Stochastic Gradient Descent (SGD) Minibatch SGD (m > 1), Online GD (m = 1)

Computational/Flow Graphs Describing Mathematical Expressions For example e = (a + b) * (b + 1) c = a + b, d = b + 1, e = c * d If a = 2, b = 1

Derivatives on Computational Graphs

Computational Graph Backward Pass (Backpropagation)

An FNN POS Tagger

Part 2.3: Word Embeddings

Typical Approaches for Word Representation 1-hot representation (orthogonality) bag-of-word model star sun [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ] [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, ] sim(star, sun) = 0

Distributed Word Representation Each word is associated with a low-dimension (compressed, 50-1000), density (non-sparse) and real (continuous) vector (word embedding) Learning word vectors through supervised models Nature Semantic similarity as vector similarity

How to obtain Word Embedding

Neural Network Language Models Neural Network Language Models (NNLM) Feed Forward (Bengio et al. 2003) Maximum-Likelihood Estimation Back-propagation Input: (n 1) embeddings

Predict Word Vector Directly SENNA (Collobert and Weston, 2008) word2vec (Mikolov et al. 2013)

Word2vec: CBOW (Continuous Bag-of-Word) Add inputs from words within short window to predict the current word The weights for different positions are shared Computationally much more efficient than normal NNLM The hidden layer is just linear Each word is an embedding v(w) Each context is an embedding v (c)

Word2vec: Skip-Gram Predicting surrounding words using the current word Similar performance with CBOW Each word is an embedding v(w) Each context is an embedding v (c)

Word2vec Training SGD + backpropagation Most of the computational cost is a function of the size of the vocabulary (millions) Training accelerating Negative Sampling Mikolov et al. 2013 Hierarchical Decomposition Morin and Bengio 2005. Mnih and Hinton 2008. Mikolov et al. 2013 Graph Processing Unit (GPU)

Word Analogy

Part 2.4: Recurrent and Other Neural Networks

Language Models A language model computes a probability for a sequence of word: P(w (, w + ) or predicts a probability for the next word: P(w +,( w (, w + ) Useful for machine translation, speech recognition, and so on Word ordering P(the cat is small) > P(small the is cat) Word choice P(there are four cats) > P(there are for cats)

Traditional Language Models An incorrect but necessary Markov assumption! Probability is usually conditioned on window of n previous words 3 3 P w (, w + = 04( P(w 0 w (,, w 01( ) 04( P(w 0 w 01(+1(),, w 01( ) How to estimate probabilities p w 6 w ( = 789+:(; <,; = ) p w 789+:(; < ) > w (, w 6 = 789+:(; <,; =,;? ) 789+:(; <,; = ) Performance improves with keeping around higher n-grams counts and doing smoothing, such as backoff (e.g. if 4-gram not found, try 3-gram, etc) Disadvantages There are A LOT of n-grams! Cannot see too long history P( 坐 / 作了一整天的火车 / 作业 )

Recurrent Neural Networks (RNNs) Condition the neural network on all previous inputs RAM requirement only scales with number of inputs W 3 y t-1 y t y t+1 W 3 W 3 W 1 h t-1 W 1 W 1 h t h t+1 W 2 W 2 W 2 x t-1 x t x t+1

Recurrent Neural Networks (RNNs) At a single time step t h : = tanh(w ( h :1( + W 6 x : ) yi : = softmax(w > h : ) yi : yi : W 3 W 3 h t-1 W 1 h t W 1 W 1 h t W 2 W 2 x t x t

Training RNNs is hard Ideally inputs from many time steps ago can modify output y For example, with 2 time steps y t-1 y t y t+1 W 3 W 3 h t-1 h t h t+1 W 1 W 1 W 2 W 2 x t-1 x t x t+1

BackPropagation Through Time (BPTT) Total error is the sum of each error at time step t PQ PR = PQ T PR? = PQ T U :4( PV T PV T PR PQ T PR? is easy to be calculated But to calculate PQ T PR < = PQ T PV T PV T PX T PX T PR < is hard (also for W6 ) Because h : = tanh(w ( h :1( + W 6 x : ) depends on h :1(, which depends on W ( and h :16, and so on. So PQ T = PR < : Z4( PQ T PV T PV T PX T PX T PX Y PX Y PR < W 1 yi : W 3 h t W 2 x t

Thevanishing gradient problem PQ T : PQ T = PV T PX T PX Y PR Z4(, h PV T PX T PX Y PR : = tanh(w ( h :1( + W 6 x : ) PX T : PX [ = PX ]4Z,( = ]4Z,( W ( diag[tanh ( )] Y PX [\< : PX T γ W ( γλ PX ( T\< where γ is bound diag[tanh ( )], λ ( is the largest singular value of W ( PX T γλ :1Z PX ( à0, if γλ ( < 1 Y This can become very small or very large quickly à Vanishing or exploding gradient Trick for exploding gradient: clipping trick (set a threshold)

A solution Intuition Ensure γλ ( 1 à to prevent vanishing gradients So Proper initialization of the W To use ReLU instead of tanh or sigmoid activation functions

A better solution Recall the original transition equation h : = tanh(w ( h :1( + W 6 x : ) We can instead update the state additively u : = tanh(w ( h :1( + W 6 x : ) yi : h : = h :1( + u : PX then, T = 1 + P9 T 1 PX T\< PX T\< On the other hand h : = h :1( + u : = h :16 + u :1( + u : = h t-1 h t x t

A better solution (cont.) Interpolate between old state and new state ( choosing to forget ) f : = σ W k x : + U k h :1( h : = f : h :1( + (1 f : ) u : Introduce a separate input gate i : i : = σ W 0 x : + U 0 h :1( h : = f : h :1( + i : u : Selectively expose memory cell c : with an output gate o : o : = σ W 8 x : + U 8 h :1( c : = f : c :1( + i : u : h : = o : tanh (c : )

Long Short-Term Memory (LSTM) Hochreiter & Schmidhuber, 1997 LSTM = additive updates + gating ht Ct-1 + Ct tanh σ σ σ + + + + tanh ht-1 0 1 2 3 ht Xt

Gated Recurrent Unites, GRU (Cho et al. 2014) Main ideas Keep around memories to capture long distance dependencies Allow error messages to flow at different strengths depending on the inputs Update gate Based on current input and hidden state z : = σ W q x : + U q h :1( Reset gate Similarly but with different weights r : = σ(w s x : + U s h :1( )

GRU New memory content ht : = tanh(wx : + r : Uh :1( ) Update gate z controls how much of past state should matter now If z closed to 1, then we can copy information in that unit through many time steps à less vanishing gradient! If reset gate r unit is close to 0, then this ignores previous memory and only stores the new input information à allows model to drop information that is irrelevant in the future Units with long term dependencies have active update gates z Units with short-term dependencies often have rest gates r very active Final memory at time step combines current and previous time steps h : = z : h :1( + (1 z : ) ht

LSTM vs. GRU No clear winner! Tuning hyperparameters like layer size is probably more important than picking the ideal architecture GRUs have fewer parameters and thus may train a bit faster or need less data to generalize If you have enough data, the greater expressive power of LSTMs may lead to better results.

More RNNs Bidirectional RNN Stack Bidirectional RNN

Tree-LSTMs Traditional Sequential Composition Tree-Structured Composition

More Applications of RNN Neural Machine Translation Handwriting Generation Image Caption Generation...

Neural Machine Translation RNN trained end-to-end: encoder-decoder 我? Decoder: Encoder: h 1 W 1 W 1 h 2 h 3 W 2 W 2 W 2 This needs to capture the entire sentence! 我 I love you

Attention Mechanism Scoring Bahdanau et al. 2015 c 我? Encoder: α Decoder: 0.3 0.6 0.1 h 1 h 2 h 3 W 1 W 1 h t W 2 W 2 W 2 我 α=score(h :, h { ) I love you

Convolution Neural Network Pooling CS231n Convolutional Neural Network for Visual Recognition.

CNN for NLP Zhang, Y., & Wallace, B. (2015). A Sensitivity Analysis of (and Practitioners Guide to) Convolutional Neural Networks for Sentence Classification.

Recursive Neural Network Socher, R., Manning, C., & Ng, A. (2011). Learning Continuous Phrase Representations and Syntactic Parsing with Recursive Neural Network. NIPS.