Neural Architectures for Image, Language, and Speech Processing

Similar documents
Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Slide credit from Hung-Yi Lee & Richard Socher

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Lecture 17: Neural Networks and Deep Learning

arxiv: v3 [cs.lg] 14 Jan 2018

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

CSC321 Lecture 16: ResNets and Attention

Deep Learning Recurrent Networks 2/28/2018

Natural Language Processing

CSC321 Lecture 20: Reversible and Autoregressive Models

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018

Recurrent and Recursive Networks

Recurrent Neural Network

Conditional Language modeling with attention

Lecture 35: Optimization and Neural Nets

Recurrent Neural Networks. Jian Tang

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Deep Learning for Automatic Speech Recognition Part II

Neural Networks in Structured Prediction. November 17, 2015

Applied Natural Language Processing

Convolutional Neural Networks

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

CSC321 Lecture 15: Exploding and Vanishing Gradients

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Lecture 5 Neural models for NLP

Recurrent Neural Networks. deeplearning.ai. Why sequence models?

End-to-end Automatic Speech Recognition

EE-559 Deep learning Recurrent Neural Networks

Long-Short Term Memory and Other Gated RNNs

Stephen Scott.

Introduction to Convolutional Neural Networks 2018 / 02 / 23

Agenda. Digit Classification using CNN Digit Classification using SAE Visualization: Class models, filters and saliency 2 DCT

Introduction to RNNs!

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

An overview of deep learning methods for genomics

Memory-Augmented Attention Model for Scene Text Recognition

Recurrent Neural Networks

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Lecture 15: Exploding and Vanishing Gradients

Structured Neural Networks (I)

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Vanishing and Exploding Gradients. ReLUs. Xavier Initialization

Bayesian Networks (Part I)

Lecture 4 Backpropagation

Lecture 11 Recurrent Neural Networks I

Task-Oriented Dialogue System (Young, 2000)

Neural Networks. Intro to AI Bert Huang Virginia Tech

Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets)

Convolutional Neural Network Architecture

Introduction to Convolutional Neural Networks (CNNs)

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

Machine Translation. 10: Advanced Neural Machine Translation Architectures. Rico Sennrich. University of Edinburgh. R. Sennrich MT / 26

Sequence Modeling with Neural Networks

Speech and Language Processing

Recurrent neural networks

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Contents. (75pts) COS495 Midterm. (15pts) Short answers

Jakub Hajic Artificial Intelligence Seminar I

Introduction to Machine Learning Spring 2018 Note Neural Networks

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

CS 1674: Intro to Computer Vision. Final Review. Prof. Adriana Kovashka University of Pittsburgh December 7, 2016

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

CSC321 Lecture 10 Training RNNs

Notes on Deep Learning for NLP

NEURAL LANGUAGE MODELS

EE-559 Deep learning LSTM and GRU

Understanding How ConvNets See

CS395T: Structured Models for NLP Lecture 17: CNNs. Greg Durrett

Asaf Bar Zvi Adi Hayat. Semantic Segmentation

Feedforward Neural Networks. Michael Collins, Columbia University

Recurrent Neural Networks. COMP-550 Oct 5, 2017

with Local Dependencies

ECE521 Lectures 9 Fully Connected Neural Networks

Natural Language Processing

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Deep Feedforward Networks. Lecture slides for Chapter 6 of Deep Learning Ian Goodfellow Last updated

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Introduction to (Convolutional) Neural Networks

CS388: Natural Language Processing Lecture 9: CNNs, Neural CRFs. CNNs. Administrivia. This Lecture. Greg Durrett. Project 1 due today at 5pm

Introduction to Deep Neural Networks

Neural networks COMS 4771

Lecture 11 Recurrent Neural Networks I

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Deep Learning Architectures and Algorithms

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

NLP Programming Tutorial 8 - Recurrent Neural Nets

Neural Networks Language Models

Artificial Neural Networks. MGS Lecture 2

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 6

(2pts) What is the object being embedded (i.e. a vector representing this object is computed) when one uses

Deep Learning (CNNs)

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

Natural Language Processing

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes)

WaveNet: A Generative Model for Raw Audio

Transcription:

Neural Architectures for Image, Language, and Speech Processing Karl Stratos June 26, 2018 1 / 31

Overview Feedforward Networks Need for Specialized Architectures Convolutional Neural Networks (CNNs) Recurrent Neural Networks (RNNs) Long Short-Term Memory Networks (LSTMs) Example: Bidirectional LSTM Network for POS Tagging Encoder-Decoder Models Example: RNN-Based Seq2Seq Bonus: Connectionist Temporal Classification (CTC) 2 / 31

What s a Neural Network? 3 / 31

What s a Neural Network? Just a composition of linear/nonlinear functions. ( ( ) ) f(x) = W (L) tanh W (L 1) tanh W (1) x 3 / 31

What s a Neural Network? Just a composition of linear/nonlinear functions. ( ( ) ) f(x) = W (L) tanh W (L 1) tanh W (1) x More like a paradigm, not a specific model. 1. Transform your input x f(x). 2. Define loss between f(x) and the target label y. 3. Train parameters by minimizing the loss. 3 / 31

You ve Already Seen Some Neural Networks... Log-linear model is a neural network with 0 hidden layer and a softmax output layer: p(y x) := exp([w x] y) y exp([w x] y ) = softmax y(w x) Get W by minimizing L(W ) = i log p(y i x i ). 4 / 31

You ve Already Seen Some Neural Networks... Log-linear model is a neural network with 0 hidden layer and a softmax output layer: p(y x) := exp([w x] y) y exp([w x] y ) = softmax y(w x) Get W by minimizing L(W ) = i log p(y i x i ). Linear regression is a neural network with 0 hidden layer and the identity output layer: f(x) := W x Get W by minimizing L(W ) = i (y i f i (x)) 2. 4 / 31

Feedforward Network Think: log-linear with extra transformation 5 / 31

Feedforward Network Think: log-linear with extra transformation With 1 hidden layer: h (1) = tanh(w (1) x) p(y x) = softmax y (h (1) ) 5 / 31

Feedforward Network Think: log-linear with extra transformation With 1 hidden layer: h (1) = tanh(w (1) x) p(y x) = softmax y (h (1) ) With 2 hidden layers: h (1) = tanh(w (1) x) h (2) = tanh(w (2) h (1) ) p(y x) = softmax y (h (2) ) Again, get parameters W (l) by minimizing i log p(y i x i ). 5 / 31

Feedforward Network Think: log-linear with extra transformation With 1 hidden layer: h (1) = tanh(w (1) x) p(y x) = softmax y (h (1) ) With 2 hidden layers: h (1) = tanh(w (1) x) h (2) = tanh(w (2) h (1) ) p(y x) = softmax y (h (2) ) Again, get parameters W (l) by minimizing i log p(y i x i ). Q. What s the catch? 5 / 31

Training = Loss Minimization We can decrease any continuous loss by following the subgradient. 1. Differentiate the loss wrt. model parameters (backprop) 2. Take a gradient step 6 / 31

Overview Feedforward Networks Need for Specialized Architectures Convolutional Neural Networks (CNNs) Recurrent Neural Networks (RNNs) Long Short-Term Memory Networks (LSTMs) Example: Bidirectional LSTM Network for POS Tagging Encoder-Decoder Models Example: RNN-Based Seq2Seq Bonus: Connectionist Temporal Classification (CTC) 7 / 31

Neural Networks are (Finite-Sample) Universal Learners! Theorem. (Zhang et al., 2016) Give me any 1. Set of n samples S = { x (1)... x (n)} R d 2. Function f : S R that assigns some arbitrary value f(x (i) ) to each i = 1... n Then I can specify a 1-hidden-layer feedforward network C : S R with 2n + d parameters such that C(x (i) ) = f(x (i) ) for all i = 1... n. 8 / 31

Neural Networks are (Finite-Sample) Universal Learners! Theorem. (Zhang et al., 2016) Give me any 1. Set of n samples S = { x (1)... x (n)} R d 2. Function f : S R that assigns some arbitrary value f(x (i) ) to each i = 1... n Then I can specify a 1-hidden-layer feedforward network C : S R with 2n + d parameters such that C(x (i) ) = f(x (i) ) for all i = 1... n. Proof. Define C(x) = w relu((a x... a x) + b) where w, b R n and a R d are network parameters. Choose a, b so that the matrix A i,j := [max { 0, a x (i) b j } ] is triangular. Solve for w in f(x (1) ). f(x (n) ) = Aw 8 / 31

So Why Not Use a Simple Feedforward for Everything? 9 / 31

So Why Not Use a Simple Feedforward for Everything? Computational reasons For example, using a GIANT feedforward to cover instances of different sizes is clearly inefficient. 9 / 31

So Why Not Use a Simple Feedforward for Everything? Computational reasons For example, using a GIANT feedforward to cover instances of different sizes is clearly inefficient. Empirical reasons In principle, we can learn any function. This tells us nothing about how to get there. How many samples do we need? How can we find the right parameters? Specializing an architecture to a particular type of computation allows us to incorporate inductive bias. Right architecture is absolutely critical in practice. 9 / 31

Overview Feedforward Networks Need for Specialized Architectures Convolutional Neural Networks (CNNs) Recurrent Neural Networks (RNNs) Long Short-Term Memory Networks (LSTMs) Example: Bidirectional LSTM Network for POS Tagging Encoder-Decoder Models Example: RNN-Based Seq2Seq Bonus: Connectionist Temporal Classification (CTC) 10 / 31

Image = Cube X R w h d : width w, height h, depth d (e.g., 3 RGB values) (scene from Your Name (2016)) 11 / 31

Convolutional Neural Network (CNN) Think: Slide various types of filters across image 12 / 31

Convolutional Neural Network (CNN) Think: Slide various types of filters across image We say we apply a filter F i R f f d with stride s N and zero-padding size p N to an image X R w h d and obtain a slice Z i R w h 1 where w = (w f + 2p)/s + 1 and h = (h f + 2p)/s + 1. Each entry of Z i is given by a dot product between F i and the corresponding receptive field in X (with zero-padding): Z i t,t,1 = a,b,c F i a,b,c X t+a,t +b,c 12 / 31

Convolutional Neural Network (CNN) Think: Slide various types of filters across image We say we apply a filter F i R f f d with stride s N and zero-padding size p N to an image X R w h d and obtain a slice Z i R w h 1 where w = (w f + 2p)/s + 1 and h = (h f + 2p)/s + 1. Each entry of Z i is given by a dot product between F i and the corresponding receptive field in X (with zero-padding): Z i t,t,1 = a,b,c F i a,b,c X t+a,t +b,c We use multiple filters and stack the slices: if m filters are used, the output is Z R w h m. 12 / 31

An Example ConvNet Architecture Input: image X R w h d Convolutional layer: use m filters to obtain Z R w h m. 13 / 31

An Example ConvNet Architecture Input: image X R w h d Convolutional layer: use m filters to obtain Z R w h m. Relu layer: Apply element-wise relu to Z. 13 / 31

An Example ConvNet Architecture Input: image X R w h d Convolutional layer: use m filters to obtain Z R w h m. Relu layer: Apply element-wise relu to Z. Max pooling layer: use 2 2 filter with stride 2 and appropriate zero-padding size to downsample Z to P R (w /2) (h /2) m. 13 / 31

An Example ConvNet Architecture Input: image X R w h d Convolutional layer: use m filters to obtain Z R w h m. Relu layer: Apply element-wise relu to Z. Max pooling layer: use 2 2 filter with stride 2 and appropriate zero-padding size to downsample Z to P R (w /2) (h /2) m. Fully-connected layer: equivalent to convolutional layer with K giant filters Q k R (w /2) (h /2) m 13 / 31

An Example ConvNet Architecture Input: image X R w h d Convolutional layer: use m filters to obtain Z R w h m. Relu layer: Apply element-wise relu to Z. Max pooling layer: use 2 2 filter with stride 2 and appropriate zero-padding size to downsample Z to P R (w /2) (h /2) m. Fully-connected layer: equivalent to convolutional layer with K giant filters Q k R (w /2) (h /2) m Output: vector v R 1 1 K used for K-way classification of X 13 / 31

In Practice, Many Such Layers http://cs231n.github.io/convolutional-networks 14 / 31

Filters Correspond to Patterns http://cs231n.github.io/convolutional-networks 15 / 31

Filters at Lower Layer vs Higher Layer Zeiler and Fergus (2013) 16 / 31

Vanishing/Exploding Gradient Problem in Deep Networks h l = Conv(Conv(Conv(Conv(Conv( (x) ))))) ResNet (He et al., 2015): at each layer l, h l = Conv(h l 1 )+h l 1 17 / 31

Overview Feedforward Networks Need for Specialized Architectures Convolutional Neural Networks (CNNs) Recurrent Neural Networks (RNNs) Long Short-Term Memory Networks (LSTMs) Example: Bidirectional LSTM Network for POS Tagging Encoder-Decoder Models Example: RNN-Based Seq2Seq Bonus: Connectionist Temporal Classification (CTC) 18 / 31

Language/Speech = Sequence x 1... x N R d : each x i is a d-dimensional embedding of the i-th input Text: x i is a word embedding (to be also learned) Speech: x i is an acoustic feature vector Two properties of this type of data 1. Length N not fixed 2. Typically processed from left to right (most of the time) 19 / 31

Recurrent Neural Network (RNN) Think: HMM (or Kalman filter) with extra transformation 20 / 31

Recurrent Neural Network (RNN) Think: HMM (or Kalman filter) with extra transformation Input: sequence x 1... x N R d For i = 1... N, h i = tanh (W x i + V h i 1 ) Output: sequence h 1... h N R d 20 / 31

RNN Deep Feedforward Unroll the expression for the last output vector h N : ( ( ) )) h N = tanh W x N + V + V tanh (W x 1 + V h 0 It s just a deep feedforward network with one important difference: parameters are reused (V, W ) are applied N times Training: do backprop on this unrolled network, update parameters 21 / 31

Overview Feedforward Networks Need for Specialized Architectures Convolutional Neural Networks (CNNs) Recurrent Neural Networks (RNNs) Long Short-Term Memory Networks (LSTMs) Example: Bidirectional LSTM Network for POS Tagging Encoder-Decoder Models Example: RNN-Based Seq2Seq Bonus: Connectionist Temporal Classification (CTC) 22 / 31

LSTM RNN produces a sequence of output vectors x 1... x N h 1... h N LSTM produces memory cell vectors along with output x 1... x N c 1... c N, h 1... h N These c 1... c N enable the network to keep or drop information from previous states. 23 / 31

LSTM: Details At each time step i, Compute a masking vector for the memory cell: q i = σ ( U q x + V q h i 1 + W i c i 1 ) [0, 1] d 24 / 31

LSTM: Details At each time step i, Compute a masking vector for the memory cell: q i = σ ( U q x + V q h i 1 + W i c i 1 ) [0, 1] d Use q i to keep/forget dimensions in previous memory cell: c i = (1 q i ) c i 1 + q i tanh (U c x + V c h i 1 ) 24 / 31

LSTM: Details At each time step i, Compute a masking vector for the memory cell: q i = σ ( U q x + V q h i 1 + W i c i 1 ) [0, 1] d Use q i to keep/forget dimensions in previous memory cell: c i = (1 q i ) c i 1 + q i tanh (U c x + V c h i 1 ) Compute another masking vector for the output: o i = σ (U o x + V o h i 1 + W o c i ) [0, 1] d 24 / 31

LSTM: Details At each time step i, Compute a masking vector for the memory cell: q i = σ ( U q x + V q h i 1 + W i c i 1 ) [0, 1] d Use q i to keep/forget dimensions in previous memory cell: c i = (1 q i ) c i 1 + q i tanh (U c x + V c h i 1 ) Compute another masking vector for the output: o i = σ (U o x + V o h i 1 + W o c i ) [0, 1] d Use o i to keep/forget dimensions in current memory cell: h i = o i tanh(c i ) 24 / 31

Overview Feedforward Networks Need for Specialized Architectures Convolutional Neural Networks (CNNs) Recurrent Neural Networks (RNNs) Long Short-Term Memory Networks (LSTMs) Example: Bidirectional LSTM Network for POS Tagging Encoder-Decoder Models Example: RNN-Based Seq2Seq Bonus: Connectionist Temporal Classification (CTC) 25 / 31

Build Your Network Model Parameters Embedding e c for each character type c 2 LSTMs (forward/backward) at character level Embedding e w for each word type w 2 LSTMs (forward/backward) at word level Feedforward network (U, V ) at the output the dog saw the cat 26 / 31

1. Character-Level LSTMs Forward character-level LSTM Backward character-level LSTM e d eo eg f c 1 f c 2 f c 3 R dc eg eo e d b c 1 b c 2 b c 3 R dc Get a character-aware encoding of dog: f c 3 x dog = b c 3 R 2dc+dw e dog 27 / 31

2. Word-Level LSTMs Forward word-level LSTM x the x dog xsaw x the xcat f1 w f2 w f3 w f4 w f5 w R d Backward word-level LSTM xcat x the xsaw x dog x the b w 1 b w 2 b w 3 b w 4 b w 5 R d Get a sentence-aware encoding of dog: [ ] f w z dog = 2 R 2d b w 4 28 / 31

3. Feedforward The final vector h dog R m has dimension equal to the the number of tag types m, computed by feedforward Think h dog = V tanh(uz dog ) [h dog ] N score of tag N for dog 29 / 31

Greedy Model Define distribution over tags at each position as: p(n dog) = softmax N (h dog ) Given tagged words (x (1), y (1) )... (x (n), y (n) ), minimize the loss m L(Θ) = log p(y (i) x (i) ) i=1 30 / 31

References CNN tutorial: http://cs231n.github.io/convolutional-networks/ LSTM tutorial: http://colah.github.io/posts/ 2015-08-Understanding-LSTMs/ 31 / 31