Recurrent Neural Networks (RNNs) Lecture 9 - Networks for Sequential Data RNNs & LSTMs. RNN with no outputs. RNN with no outputs

Size: px
Start display at page:

Download "Recurrent Neural Networks (RNNs) Lecture 9 - Networks for Sequential Data RNNs & LSTMs. RNN with no outputs. RNN with no outputs"

Transcription

1 Recurrent Neural Networks (RNNs) RNNs are a family of networks for processing sequential data. Lecture 9 - Networks for Sequential Data RNNs & LSTMs DD2424 September 6, 2017 A RNN applies the same function recursively when traversing network s graph structure. RNN encodes a sequence x1, x2,..., xτ into fixed length hidden vector hτ. The size of hτ is independent of τ. Amazingly flexible powerful high-level architecture. RNN with no outputs RNN with no outputs delay of 1 time step h h x x Graph displays processing of information for each time step. Information from input x is incorporated into state h. Graph displays processing of information for each time step. Information from input x is incorporated into state h. State h is passed forward in time.

2 RNN with no outputs RNN: How hidden states generated Most recurrent neural networks have a function f ht = f(, ; θ) h.. ht ht+1 h nrolled visualization of the RNN that defines their hidden state over time where - ht is the hidden state at time t (a vector) - is the input vector at time t - θ is the parameters of f. θ remains constant as t changes. Apply the same function with the same parameter values at each iteration. RNN: How hidden states generated RNN with no outputs Most recurrent neural networks have a function f ht = f(, ; θ) that defines their hidden state over time where - ht is the hidden state at time t (a vector) - is the input vector at time t - θ is the parameters of f. θ remains constant as t changes. Apply the same function with the same parameter values at each iteration. f f f f h.. ht ht+1 h nrolled visualization of the RNN

3 RNN with outputs RNN with outputs o ot 1 ot ot+1 h delay of 1 time step h.. ht ht+1 h x nrolled visualization of the RNN sually also predict an output vector at each time step sually also predict an output vector at each time step Back to anilla RNN se cases of RNNs The state consists of a single hidden vector ht: Initial hidden state h0 is assumed given. Output Generation Representation/ Prediction Encoding/decoding Continuous prediction For t = 1,..., τ the RNN equations are at = + + b ht = tanh(at) RNN ot = ht + c pt = softmax(ot) Input Network s input - h0 initial hidden state has size m 1 [ - input vector at time t has size d 1

4 anilla RNN equations anilla RNN equations The state consists of a single hidden vector h: The state consists of a single hidden vector ht : Initial hidden state h0 is assumed given. Initial hidden state h0 is assumed given. For t = 1,..., T the RNN equations are For t = 1,..., τ the RNN equations are at = + + b at = + + b ht = tanh(at ) ht = tanh(at ) o t = ht + c o t = ht + c pt = softmax(ot ) pt = softmax(ot ) Parameters of the network Network s output hidden vectors - weight matrix of size m m applied to (hidden-to-hidden connection) - at hidden state at time t of size m 1 before non-linearity - weight matrix of size m d applied to (input-to-hidden connection) - ht hidden state at time t of size m 1 - ot output vector (of unnormalized log probabilities for each class) at time t of size C 1 - pt output probability vector at time t of size C 1 RNN ocabulary: [h,e,l,o] x Example training sequence: hello Fei-Fei Li & Andrej Karpathy & Justin Johnson - weight matrix of size C m applied to at (hidden-to-output connection) - c bias vector of size C 1 in equation for ot y Character-level language model example - b bias vector of size m 1 in equation for at Character-level language model example ocabulary: [h,e,l,o] Example training sequence: hello Lecture Feb 2016 Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture Feb 2016

5 Character-level language model example Character-level language model example ocabulary: [h,e,l,o] ocabulary: [h,e,l,o] Example training sequence: hello Example training sequence: hello Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture Feb 2016 Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture Feb 2016 y Eend this simple approach to full alphabet punctuation characters RNN x Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture Feb 2016

6 at first: train more train more train more Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture Feb 2016 Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture Feb 2016 How do we train a vanilla RNN? Supervised learning via a loss function & mini-batch gradient descent. Loss defined for one training sequence. Have a sequence x1, x2,..., xτ of input vectors. For each in sequence have a target output yt. Define loss lt between the yt pt for each t. Sum the loss over all time-steps L(x1,..., xτ, y1,..., yτ,,,, b, c) = τ X Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture Feb 2016 lt

7 log(y T 1 p1) log(yt τ 2 pτ 2) log(yt t pτ 1) log(yt τ pτ ) L h1 2 Common to use the cross-entropy loss: Loss function for a vanilla RNN lt = log(y T t pt) Computational Graph for vanilla RNN loss L y1 yτ 2 yτ 1 yτ thus L(x1:τ, y1:τ,,,, b, c) = log(yt T pt) where x1:τ = {x1,..., xτ } y1:τ = {y1,..., yτ }. To implement mini-batch gradient descent need to compute (x1:τ, y1:τ,,,, b, c) (x1:τ, y1:τ,,,, b, c),, You ve guessed it, use back-prop... p1 pτ 2 pτ 1 pτ o1 oτ 2 oτ 1 oτ h1 2 h0 + x1 h1 + x2 hτ 2 + xτ 1 hτ 1 + xτ a1 aτ 2 aτ 1 aτ Loss for one labelled training sequence x1,... xτ Bias vectors have been omitted for clarity. Gradient of loss for the cross-entropy & softmax layers Know from prior dealings with cross-entropy loss: for t = 1,... τ Back-prop for a vanilla RNN lt pt = 1 = lt lt pt = yt t y T t pt h0 + x1 h1 + x2 hτ 2 + xτ 1 hτ 1 + xτ = pt pt = yt ( t yt T diag(pt) ptp T pt t )

8 L h1 2 L h1 2 L h1 2 L h1 2 Gradient of loss w.r.t. Gradient of loss w.r.t. Children of node are o1, o2,..., oτ. Thus Children of node are o1, o2,..., oτ. Thus vec( ) = vec( ) vec( ) = vec( ) Know Know ( ) ot = ht = ot = IC h T t vec( ) ( ) ot = ht = ot = IC h T t vec( ) h0 + x1 h1 + x2 hτ 2 + xτ 1 hτ 1 + xτ = vec( ) = IC ht t h0 + x1 h1 + x2 hτ 2 + xτ 1 hτ 1 + xτ = vec( ) = IC ht t From prior reshapings know: From prior reshapings know: = g T t h T t = g T t h T t gradient needed for training network where gt =. where gt =. Gradient of loss w.r.t. h τ Gradient of loss w.r.t. h t hτ (last hidden state) has one child oτ thus If 1 t τ 1 then ht has children ot at+1 = oτ hτ oτ hτ = + +1 ht ht +1 ht Know Know oτ = = oτ = hτ ot = ht = = ht h0 + x1 h1 + x2 hτ 2 + xτ 1 hτ 1 + xτ Thus h0 + x1 h1 + x2 hτ 2 + xτ 1 hτ 1 + xτ = hτ oτ at+1 = ht + +1 = +1 = ht Thus = + ht +1

9 L h1 2 L h1 2 L h1 2 L h1 2 Gradient of loss w.r.t. h t Gradient of loss w.r.t. a t If 1 t τ 1 then ht has children ot at+1 The gradient w.r.t. at = + +1 ht ht +1 ht = ht ht Know Know h0 + x1 h1 + x2 hτ 2 + xτ 1 hτ 1 + xτ ot = ht = = ht h0 + x1 h1 + x2 hτ 2 + xτ 1 hτ 1 + xτ ht = tanh(at) = ht = diag ( tanh (at) ) = diag ( 1 tanh 2 (at) ) at+1 = ht + +1 = +1 = ht Thus Thus = diag ( 1 tanh 2 (at) ) ht = + ht +1 Have two different time steps in expression = must iterate backwards in time to compute all ht Recursively compute gradients for all a t h t Gradient of loss w.r.t. h0 + x1 h1 + x2 hτ 2 + xτ 1 hτ 1 + xτ Assume calculated for 1 t τ. Calculate = = diag ( 1 tanh 2 (aτ ) ) & hτ oτ aτ hτ for t = τ 1, τ 2,..., 1 - Compute = + ht +1 - Compute = diag ( 1 tanh 2 (at) ) ht h0 + x1 h1 + x2 hτ 2 + xτ 1 hτ 1 + xτ The gradient of the loss w.r.t. node. Children of are a1,... aτ thus Know vec( ) = at = + From prior reshapings know: = vec( ) = at = (Im h T t 1)vec( ) + = vec( ) = Im ht t 1 g T t ht t 1 where gt =.

10 L h1 2 L h1 2 L h1 2 Gradient of loss w.r.t. Gradient of loss w.r.t. The gradient of the loss w.r.t. node. The gradient of the loss w.r.t. node. Children of are a1,... aτ thus Children of are a1,... aτ thus vec( ) = vec( ) vec() = vec() Know Know h1 + x2 at = + = at = (Im h T t 1)vec( ) + h1 + x2 at = + = at = + (Im x T t )vec() h0 + x1 hτ 2 + xτ 1 hτ 1 + xτ = vec( ) = Im ht t 1 h0 + x1 hτ 2 + xτ 1 hτ 1 + xτ = vec() = Im xt t From prior reshapings know: From prior reshapings know: = gt T ht t 1 gradient needed for training network = g T t xt t where gt =. where gt =. Gradient of loss w.r.t. The gradient of the loss w.r.t. node. Children of are a1,... aτ thus Know vec() = vec() RNNs in Translation Applications h0 + x1 h1 + x2 hτ 2 + xτ 1 hτ 1 + xτ at = + = at = + (Im x T t )vec() = vec() = Im xt t From prior reshapings know: = g T t xt t gradient needed for training network where gt =.

11 Language translation Given a sentence in on language translate it to another language RNN-based Sentence Representation (Encoder) le chien sur la plage Dog on the beach of le of chien of sur of la of plage Encoder-Decoder Architecture [Cho et al., Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, EMNLP 2014] RNN-based Sentence Generation (Decoder) Conte P(ne word is dog ) P(ne word is on ) P(ne word is the ) P(ne word is beach ) Conte of le of chien of sur of la of plage of START of dog of on Model long-term information of the

12 Image Captioning Recurrent Neural Network Explain Images with Multimodal Recurrent Neural Networks, Mao et al. Deep isual-semantic Alignments for Generating Image Descriptions, Karpathy Fei-Fei Show Tell: A Neural Image Caption Generator, inyals et al. Long-term Recurrent Convolutional Networks for isual Recognition Description, Donahue et al. Learning a Recurrent isual Representation for Image Caption Generation, Chen Zitnick Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture test image 8 Feb 2016 Convolutional Neural Network Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture test image 8 Feb 2016

13 test image test image x0 <STA RT> X <START> test image y0 test image y0 before: h = tanh(xh * x + hh * h) h0 ih x0 <STA RT> v sample! h0 now: h = tanh(xh * x + hh * h + ih * v) <START> x0 <STA RT> <START> straw

14 test image test image y0 y1 y0 y1 h0 h1 h0 h1 x0 <STA RT> straw x0 <STA RT> straw <START> sample! hat <START> test image y0 y1 h0 h1 x0 <STA RT> straw <START> y2 h2 hat test image y0 y1 h0 h1 x0 <STA RT> straw <START> y2 h2 hat sample <END> token => finish.

15 Evaluation of te translation results Tricky to do automatically! Ideally want humans to evaluate Image Sentence Datasets - hat do you ask? Microsoft COCO - Can t use human evaluation for validating models - too slow expensive. [Tsung-Yi Lin et al. 2014] mscoco.org se stard machine translation metrics instead - BLE - ROGE CIDER - Meteor currently: ~120K images ~5 sentences each

16 Focus on gradient of loss w.r.t. Problem of exploding vanishing gradients in an RNN Take a closer look at = gt T h T t 1 where gt = = depends on for t = 1,..., τ. Let s take a closer look at... Denote got = D(at) = diag(1 tanh 2 (at)) Remember = goτ = = goτ D(aτ ) hτ aτ for t = τ 1, τ 2,..., 1: = got + ht +1 = D(at) ht Then you can show by recursive substitution that ( j t ) = go j D(at+k) j t ht j=t k=1 ( j t ) = go j D(at+k) j t j=t k=0 Focus on Denote got = D(at) = diag(1 tanh 2 (at)) Then you can show by recursive substitution that ( j t ) = go j D(at+k) j t ht j=t k=1 = go j j=t ( j t ) D(at+k) k=0 }{{} likely has small values on diagonal j t Focus on hy? Each matrix D(at+k) has tanh (at+k) on its diagonal 0 tanh (a) 1. Thus (tanh (a)) j t+1 is highly likely to have a small value even for not too large j t + 1. = only depends on first few entries in the sum.

17 Focus on Focus on n Denote got = D(at) = diag(1 tanh 2 (at)) Then you can show by recursive substitution that ( j t ) = go j D(at+k) j t ht j=t k=1 ( j t ) = go j D(at+k) }{{ j t } j=t k=0 potentially has very large or small values hy?... Remember has size m m. Assume is diagonalizable. Let its eigen-decomposition be = QΛQ T where Q is orthogonal Λ is a diagonal matrix containing the eigenvalues of. Then n = QΛ n Q T Let λ1,..., λm be the e-values of. Thus - If λi > 1 = λ n i will explode as n increases. - If λi < 1 = λ n i 0 as n increases. Focus on Summary on Denote got = D(at) = diag(1 tanh 2 (at)) Denote got = D(at) = diag(1 tanh 2 (at)) Then you can show by recursive substitution that ( j t ) = go j D(at+k) j t ht j=t k=1 ( j t ) = go j D(at+k) }{{ j t } j=t k=0 potentially has very large or small values Thus for sufficiently large j t either entries in j t can explode or vanish. Then you can show by recursive substitution that ht = = j=t j=t go j go j j t k=1 j t If j t explodes for j t > N = k=0 D(a t+k) j t D(a t+k) j t explodes = explodes. If j t vanishes for j t > N = only has contributions from nearby go where t t t t + N = is based on aggregation of gradients from subsets of temporally nearby states.

18 alue grows eremely is easily detectable at ion Problem. hen the Denote ndetected while drasel for far-away wordsgot = D(at) = diag(1 tanh 2 (at)) radient Problem. ing gradient Then problem, you can show by recursive substitution that j t = go j D(a t+k) j t ht j=t k=1 Summary on adients j t = go j D(a t+k) f the vanishing gradip neural networks, let j t j=t k=0 If j t explodes for j t > N = olve these problems. explodes = explodes. Thomas Mikolov first s gradients If to a j t small vanishes for j t > N = only has contributions from nearby go where t t t t + N ever they reach a = cerber as shown nearby in Algo- states. is based on aggregation of gradients from subsets of temporally = Cannot learn Algorithm long-range 1: dependencies Psudo-code for between norm clipping in the gradients whenever they ex- states. plode Gradient clipping Easy solution to exploding gradients Let G = then { θ ping. It shows the deork with respect to its G otherwise G G = G if G θ nsists of a where singleθ unit is some sensible threshold. small number of timegress on each A simple gradient heuristic first introduced by Thomas Mikolov. hits the high error wall off to a far-away local produces the dashed nt to somewhere close Solution to Exploding & anishing Gradients Easy partial solutions to vanishing gradients Solution 1: Initialize as the identity matrix as opposed a rom initialization. Solution 2: se ReL instead of tanh as the non-linear activation function. e introduce two techinitializing (hh) ranation. Dashed arrow shows what happens when the gradient is rescaled to a Figure 5: Gradient explosion clipping fixed size when its visualization norm above a threshold.

19 Easy partial solutions to vanishing gradients Solution 1: Initialize as the identity matrix as opposed a rom initialization. Solution 2: se ReL instead of tanh as the non-linear activation function. Long-Short-Term-Memories (LSTMs) - capturing long-range dependencies Still hard for an RNN to capture long-term dependencies. LSTMs Core Idea: Introduce a memory cell LSTMs Core Idea: Introduce a memory cell ot 1 ot ot+1 ot 1 ot ot+1 h.. ht ht+1 h.. h.. ht ht+1 h High-level graphic of an RNN LSTMs similar to RNN but they introduce a memory cell state ct. c.. ct 1 ct ct+1 c.. High-level graphic of a LSTM LSTMs similar to RNN but they introduce a memory cell state ct. LSTMs have the ability to remove or add information to ct regulated by structures called gates based on conte. pdate of ct designed so gradients flows these nodes backward in time easily. ct then controls what information from ct 1 should be used to generate ht.

20 ct 1 it ft ct 1 + it ct ft ot ct ot tanh(ct) ht ct 1 it ft ct 1 + it ct ft ot ct ct 1 ot tanh(ct) it ft ct 1 + it ct ht ft ot ct ot tanh(ct) ht LSTMs formal details LSTMs basic unit LSTMs (Hochreiter & Schmidhuer, 1997) better at capturing long term dependencies. Input gate σ( i x t + i h t 1 ) Introduces gates to calculate ht, ct from ct 1,. New temporary memory Formal description of a LSTM unit: tanh(cx t + ch t 1 ) it = σ(i + i) Input gate ft = σ(f + f) ot = σ(o + o) Forget gate Output/Exposure gate Forget gate ct = tanh(c + c) New memory cell σ( f x t + f h t 1 ) ct = ft ct 1 + it ct ht = ot tanh(ct) Final memory cell Output gate where - σ( ) is the sigmoid function - denotes element by element multiplication. σ(ox t + oh t 1 ) ct LSTMs basic unit LSTMs basic unit Input gate Input gate σ( i x t + i h t 1 ) New temporary memory: se to generate new memory that includes aspects of. σ( i x t + i h t 1 ) New temporary memory: se to generate new memory that includes aspects of. New temporary memory tanh(cx t + ch t 1 ) ct Input gate: se to determine whether the temporary memory ct is worth preserving. New temporary memory tanh(cx t + ch t 1 ) ct Input gate: se to determine whether the temporary memory ct is worth preserving. Forget gate Forget gate: Assess whether the past memory cell ct 1 should be included in ct. Forget gate Forget gate: Assess whether the past memory cell ct 1 should be included in ct. Output gate σ( f x t + f h t 1 ) pdated memory state: se the forget input gates to combine the new temporary memory the current memory cell state to get ct. Output gate σ( f x t + f h t 1 ) pdated memory state: se the forget input gates to combine the new temporary memory the current memory cell state to get ct. σ(ox t + oh t 1 ) Output gate: Decides which part of ct should be exposed to ht. σ(ox t + oh t 1 ) Output gate: Decides which part of ct should be exposed to ht.

21 ct 1 ct 1 it ft ct 1 + it ct ft ot it ft ct 1 + it ct ft ot ct ct ot tanh(ct) ot tanh(ct) ht ht ct 1 it ft ct 1 + it ct ft ot ct ot tanh(ct) ht LSTMs basic unit LSTMs basic unit Input gate Input gate σ( i x t + i h t 1 ) New temporary memory: se σ( i x t + i h t 1 ) New temporary memory: se to generate new memory to generate new memory that includes aspects of. that includes aspects of. New temporary memory Input gate: se to New temporary memory Input gate: se to tanh(cx t + ch t 1 ) determine whether the temporary tanh(cx t + ch t 1 ) determine whether the temporary ct memory ct is worth preserving. ct memory ct is worth preserving. Forget gate: Assess whether the Forget gate: Assess whether the Forget gate past memory cell ct 1 should be included in ct. Forget gate past memory cell ct 1 should be included in ct. σ( f x t + f h t 1 ) pdated memory state: se the σ( f x t + f h t 1 ) pdated memory state: se the forget input gates to combine forget input gates to combine the new temporary memory the the new temporary memory the Output gate current memory cell state to get ct. Output gate current memory cell state to get ct. σ(ox t + oh t 1 ) Output gate: Decides which part σ(ox t + oh t 1 ) Output gate: Decides which part of ct should be exposed to ht. of ct should be exposed to ht. LSTMs basic unit Input gate σ( i x t + i h t 1 ) New temporary memory: se to generate new memory that includes aspects of. New temporary memory tanh(cx t + ch t 1 ) ct Input gate: se to determine whether the temporary memory ct is worth preserving. Can go deep with LSTMs Forget gate: Assess whether the Forget gate past memory cell ct 1 should be included in ct. Output gate σ( f x t + f h t 1 ) pdated memory state: se the forget input gates to combine the new temporary memory the current memory cell state to get ct. σ(ox t + oh t 1 ) Output gate: Decides which part of ct should be exposed to ht.

22 35 Deep LSTM Network Bi-directional LSTM Network L (1) L (2) L (3) L (τ) L (1) L (2) L (3) L (τ) y (1) z (1) y (2) z (2) y (3) z (3) y (τ) z (τ) y (1) z (1) y (2) z (2) y (3) z (3) y (τ) z (τ) LSTM (1) LSTM (2) LSTM (3) LSTM (τ) LSTM (1) LSTM (2) LSTM (3) LSTM (τ) LSTM (1) LSTM (2) LSTM (3) LSTM (τ) LSTM (1) LSTM (2) LSTM (3) LSTM (τ) x (1) x (2) x (3) x (τ) 1 x (1) x (2) x (3) x (τ) 34 Summary RNNs allow a lot of flexibility in architecture design Backward flow of gradients in RNN can explode or vanish. anilla RNNs are simple but find it hard to learn long-term dependencies. Exploding is controlled with gradient clipping. anishing is controlled with additive interactions (LSTM) Common to use LSTMs: their additive interactions improve gradient flow

Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST

Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST 1 Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST Summary We have shown: Now First order optimization methods: GD (BP), SGD, Nesterov, Adagrad, ADAM, RMSPROP, etc. Second

More information

Lecture 11 Recurrent Neural Networks I

Lecture 11 Recurrent Neural Networks I Lecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor niversity of Chicago May 01, 2017 Introduction Sequence Learning with Neural Networks Some Sequence Tasks

More information

Lecture 11 Recurrent Neural Networks I

Lecture 11 Recurrent Neural Networks I Lecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 01, 2017 Introduction Sequence Learning with Neural Networks Some Sequence Tasks

More information

Lecture 15: Exploding and Vanishing Gradients

Lecture 15: Exploding and Vanishing Gradients Lecture 15: Exploding and Vanishing Gradients Roger Grosse 1 Introduction Last lecture, we introduced RNNs and saw how to derive the gradients using backprop through time. In principle, this lets us train

More information

arxiv: v3 [cs.lg] 14 Jan 2018

arxiv: v3 [cs.lg] 14 Jan 2018 A Gentle Tutorial of Recurrent Neural Network with Error Backpropagation Gang Chen Department of Computer Science and Engineering, SUNY at Buffalo arxiv:1610.02583v3 [cs.lg] 14 Jan 2018 1 abstract We describe

More information

Recurrent Neural Networks. Jian Tang

Recurrent Neural Networks. Jian Tang Recurrent Neural Networks Jian Tang tangjianpku@gmail.com 1 RNN: Recurrent neural networks Neural networks for sequence modeling Summarize a sequence with fix-sized vector through recursively updating

More information

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions 2018 IEEE International Workshop on Machine Learning for Signal Processing (MLSP 18) Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions Authors: S. Scardapane, S. Van Vaerenbergh,

More information

CSC321 Lecture 15: Exploding and Vanishing Gradients

CSC321 Lecture 15: Exploding and Vanishing Gradients CSC321 Lecture 15: Exploding and Vanishing Gradients Roger Grosse Roger Grosse CSC321 Lecture 15: Exploding and Vanishing Gradients 1 / 23 Overview Yesterday, we saw how to compute the gradient descent

More information

EE-559 Deep learning LSTM and GRU

EE-559 Deep learning LSTM and GRU EE-559 Deep learning 11.2. LSTM and GRU François Fleuret https://fleuret.org/ee559/ Mon Feb 18 13:33:24 UTC 2019 ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE The Long-Short Term Memory unit (LSTM) by Hochreiter

More information

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves Recurrent Neural Networks Deep Learning Lecture 5 Efstratios Gavves Sequential Data So far, all tasks assumed stationary data Neither all data, nor all tasks are stationary though Sequential Data: Text

More information

Recurrent Neural Network

Recurrent Neural Network Recurrent Neural Network Xiaogang Wang xgwang@ee..edu.hk March 2, 2017 Xiaogang Wang (linux) Recurrent Neural Network March 2, 2017 1 / 48 Outline 1 Recurrent neural networks Recurrent neural networks

More information

Sequence Modeling with Neural Networks

Sequence Modeling with Neural Networks Sequence Modeling with Neural Networks Harini Suresh y 0 y 1 y 2 s 0 s 1 s 2... x 0 x 1 x 2 hat is a sequence? This morning I took the dog for a walk. sentence medical signals speech waveform Successes

More information

Long Short-Term Memory (LSTM)

Long Short-Term Memory (LSTM) Long Short-Term Memory (LSTM) A brief introduction Daniel Renshaw 24th November 2014 1 / 15 Context and notation Just to give the LSTM something to do: neural network language modelling Vocabulary, size

More information

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning Recurrent Neural Network (RNNs) University of Waterloo October 23, 2015 Slides are partially based on Book in preparation, by Bengio, Goodfellow, and Aaron Courville, 2015 Sequential data Recurrent neural

More information

CSC321 Lecture 10 Training RNNs

CSC321 Lecture 10 Training RNNs CSC321 Lecture 10 Training RNNs Roger Grosse and Nitish Srivastava February 23, 2015 Roger Grosse and Nitish Srivastava CSC321 Lecture 10 Training RNNs February 23, 2015 1 / 18 Overview Last time, we saw

More information

Deep Learning Recurrent Networks 2/28/2018

Deep Learning Recurrent Networks 2/28/2018 Deep Learning Recurrent Networks /8/8 Recap: Recurrent networks can be incredibly effective Story so far Y(t+) Stock vector X(t) X(t+) X(t+) X(t+) X(t+) X(t+5) X(t+) X(t+7) Iterated structures are good

More information

Slide credit from Hung-Yi Lee & Richard Socher

Slide credit from Hung-Yi Lee & Richard Socher Slide credit from Hung-Yi Lee & Richard Socher 1 Review Recurrent Neural Network 2 Recurrent Neural Network Idea: condition the neural network on all previous words and tie the weights at each time step

More information

Recurrent neural networks

Recurrent neural networks 12-1: Recurrent neural networks Prof. J.C. Kao, UCLA Recurrent neural networks Motivation Network unrollwing Backpropagation through time Vanishing and exploding gradients LSTMs GRUs 12-2: Recurrent neural

More information

Introduction to RNNs!

Introduction to RNNs! Introduction to RNNs Arun Mallya Best viewed with Computer Modern fonts installed Outline Why Recurrent Neural Networks (RNNs)? The Vanilla RNN unit The RNN forward pass Backpropagation refresher The RNN

More information

Neural Networks Language Models

Neural Networks Language Models Neural Networks Language Models Philipp Koehn 10 October 2017 N-Gram Backoff Language Model 1 Previously, we approximated... by applying the chain rule p(w ) = p(w 1, w 2,..., w n ) p(w ) = i p(w i w 1,...,

More information

CSC321 Lecture 16: ResNets and Attention

CSC321 Lecture 16: ResNets and Attention CSC321 Lecture 16: ResNets and Attention Roger Grosse Roger Grosse CSC321 Lecture 16: ResNets and Attention 1 / 24 Overview Two topics for today: Topic 1: Deep Residual Networks (ResNets) This is the state-of-the

More information

Lecture 17: Neural Networks and Deep Learning

Lecture 17: Neural Networks and Deep Learning UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions

More information

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recap Standard RNNs Training: Backpropagation Through Time (BPTT) Application to sequence modeling Language modeling Applications: Automatic speech

More information

ECE521 Lectures 9 Fully Connected Neural Networks

ECE521 Lectures 9 Fully Connected Neural Networks ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance

More information

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino Artificial Neural Networks Data Base and Data Mining Group of Politecnico di Torino Elena Baralis Politecnico di Torino Artificial Neural Networks Inspired to the structure of the human brain Neurons as

More information

Stephen Scott.

Stephen Scott. 1 / 35 (Adapted from Vinod Variyam and Ian Goodfellow) sscott@cse.unl.edu 2 / 35 All our architectures so far work on fixed-sized inputs neural networks work on sequences of inputs E.g., text, biological

More information

Long-Short Term Memory and Other Gated RNNs

Long-Short Term Memory and Other Gated RNNs Long-Short Term Memory and Other Gated RNNs Sargur Srihari srihari@buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Sequence Modeling

More information

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Y. Wu, M. Schuster, Z. Chen, Q.V. Le, M. Norouzi, et al. Google arxiv:1609.08144v2 Reviewed by : Bill

More information

Natural Language Processing and Recurrent Neural Networks

Natural Language Processing and Recurrent Neural Networks Natural Language Processing and Recurrent Neural Networks Pranay Tarafdar October 19 th, 2018 Outline Introduction to NLP Word2vec RNN GRU LSTM Demo What is NLP? Natural Language? : Huge amount of information

More information

(2pts) What is the object being embedded (i.e. a vector representing this object is computed) when one uses

(2pts) What is the object being embedded (i.e. a vector representing this object is computed) when one uses Contents (75pts) COS495 Midterm 1 (15pts) Short answers........................... 1 (5pts) Unequal loss............................. 2 (15pts) About LSTMs........................... 3 (25pts) Modular

More information

Deep Learning Recurrent Networks 10/11/2017

Deep Learning Recurrent Networks 10/11/2017 Deep Learning Recurrent Networks 10/11/2017 1 Which open source project? Related math. What is it talking about? And a Wikipedia page explaining it all The unreasonable effectiveness of recurrent neural

More information

arxiv: v1 [cs.cl] 21 May 2017

arxiv: v1 [cs.cl] 21 May 2017 Spelling Correction as a Foreign Language Yingbo Zhou yingbzhou@ebay.com Utkarsh Porwal uporwal@ebay.com Roberto Konow rkonow@ebay.com arxiv:1705.07371v1 [cs.cl] 21 May 2017 Abstract In this paper, we

More information

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017 Modelling Time Series with Neural Networks Volker Tresp Summer 2017 1 Modelling of Time Series The next figure shows a time series (DAX) Other interesting time-series: energy prize, energy consumption,

More information

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Neural Networks 2. 2 Receptive fields and dealing with image inputs CS 446 Machine Learning Fall 2016 Oct 04, 2016 Neural Networks 2 Professor: Dan Roth Scribe: C. Cheng, C. Cervantes Overview Convolutional Neural Networks Recurrent Neural Networks 1 Introduction There

More information

Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials

Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials Contents 1 Pseudo-code for the damped Gauss-Newton vector product 2 2 Details of the pathological synthetic problems

More information

Deep Learning for Computer Vision

Deep Learning for Computer Vision Deep Learning for Computer Vision Spring 2018 http://vllab.ee.ntu.edu.tw/dlcv.html (primary) https://ceiba.ntu.edu.tw/1062dlcv (grade, etc.) FB: DLCV Spring 2018 Yu-Chiang Frank Wang 王鈺強, Associate Professor

More information

Neural Architectures for Image, Language, and Speech Processing

Neural Architectures for Image, Language, and Speech Processing Neural Architectures for Image, Language, and Speech Processing Karl Stratos June 26, 2018 1 / 31 Overview Feedforward Networks Need for Specialized Architectures Convolutional Neural Networks (CNNs) Recurrent

More information

Recurrent Neural Networks

Recurrent Neural Networks Charu C. Aggarwal IBM T J Watson Research Center Yorktown Heights, NY Recurrent Neural Networks Neural Networks and Deep Learning, Springer, 218 Chapter 7.1 7.2 The Challenges of Processing Sequences Conventional

More information

CSCI 315: Artificial Intelligence through Deep Learning

CSCI 315: Artificial Intelligence through Deep Learning CSCI 315: Artificial Intelligence through Deep Learning W&L Winter Term 2017 Prof. Levy Recurrent Neural Networks (Chapter 7) Recall our first-week discussion... How do we know stuff? (MIT Press 1996)

More information

Memory-Augmented Attention Model for Scene Text Recognition

Memory-Augmented Attention Model for Scene Text Recognition Memory-Augmented Attention Model for Scene Text Recognition Cong Wang 1,2, Fei Yin 1,2, Cheng-Lin Liu 1,2,3 1 National Laboratory of Pattern Recognition Institute of Automation, Chinese Academy of Sciences

More information

Contents. (75pts) COS495 Midterm. (15pts) Short answers

Contents. (75pts) COS495 Midterm. (15pts) Short answers Contents (75pts) COS495 Midterm 1 (15pts) Short answers........................... 1 (5pts) Unequal loss............................. 2 (15pts) About LSTMs........................... 3 (25pts) Modular

More information

Convolutional Neural Networks

Convolutional Neural Networks Convolutional Neural Networks Books» http://www.deeplearningbook.org/ Books http://neuralnetworksanddeeplearning.com/.org/ reviews» http://www.deeplearningbook.org/contents/linear_algebra.html» http://www.deeplearningbook.org/contents/prob.html»

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018 Sequence Models Ji Yang Department of Computing Science, University of Alberta February 14, 2018 This is a note mainly based on Prof. Andrew Ng s MOOC Sequential Models. I also include materials (equations,

More information

Recurrent and Recursive Networks

Recurrent and Recursive Networks Neural Networks with Applications to Vision and Language Recurrent and Recursive Networks Marco Kuhlmann Introduction Applications of sequence modelling Map unsegmented connected handwriting to strings.

More information

Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function

Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function Austin Wang Adviser: Xiuyuan Cheng May 4, 2017 1 Abstract This study analyzes how simple recurrent neural

More information

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018 Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1 Sequence-to-sequence modelling Problem: E.g. A sequence X 1 X N goes in A different sequence Y 1 Y M comes out Speech recognition:

More information

A Tutorial On Backward Propagation Through Time (BPTT) In The Gated Recurrent Unit (GRU) RNN

A Tutorial On Backward Propagation Through Time (BPTT) In The Gated Recurrent Unit (GRU) RNN A Tutorial On Backward Propagation Through Time (BPTT In The Gated Recurrent Unit (GRU RNN Minchen Li Department of Computer Science The University of British Columbia minchenl@cs.ubc.ca Abstract In this

More information

4. Multilayer Perceptrons

4. Multilayer Perceptrons 4. Multilayer Perceptrons This is a supervised error-correction learning algorithm. 1 4.1 Introduction A multilayer feedforward network consists of an input layer, one or more hidden layers, and an output

More information

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention CS23: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention Today s outline We will learn how to: I. Word Vector Representation i. Training - Generalize results with word vectors -

More information

Based on the original slides of Hung-yi Lee

Based on the original slides of Hung-yi Lee Based on the original slides of Hung-yi Lee New Activation Function Rectified Linear Unit (ReLU) σ z a a = z Reason: 1. Fast to compute 2. Biological reason a = 0 [Xavier Glorot, AISTATS 11] [Andrew L.

More information

Generating Sequences with Recurrent Neural Networks

Generating Sequences with Recurrent Neural Networks Generating Sequences with Recurrent Neural Networks Alex Graves University of Toronto & Google DeepMind Presented by Zhe Gan, Duke University May 15, 2015 1 / 23 Outline Deep recurrent neural network based

More information

RECURRENT NETWORKS I. Philipp Krähenbühl

RECURRENT NETWORKS I. Philipp Krähenbühl RECURRENT NETWORKS I Philipp Krähenbühl RECAP: CLASSIFICATION conv 1 conv 2 conv 3 conv 4 1 2 tu RECAP: SEGMENTATION conv 1 conv 2 conv 3 conv 4 RECAP: DETECTION conv 1 conv 2 conv 3 conv 4 RECAP: GENERATION

More information

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann (Feed-Forward) Neural Networks 2016-12-06 Dr. Hajira Jabeen, Prof. Jens Lehmann Outline In the previous lectures we have learned about tensors and factorization methods. RESCAL is a bilinear model for

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

CSC321 Lecture 15: Recurrent Neural Networks

CSC321 Lecture 15: Recurrent Neural Networks CSC321 Lecture 15: Recurrent Neural Networks Roger Grosse Roger Grosse CSC321 Lecture 15: Recurrent Neural Networks 1 / 26 Overview Sometimes we re interested in predicting sequences Speech-to-text and

More information

Deep Learning II: Momentum & Adaptive Step Size

Deep Learning II: Momentum & Adaptive Step Size Deep Learning II: Momentum & Adaptive Step Size CS 760: Machine Learning Spring 2018 Mark Craven and David Page www.biostat.wisc.edu/~craven/cs760 1 Goals for the Lecture You should understand the following

More information

Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow,

Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow, Index A Activation functions, neuron/perceptron binary threshold activation function, 102 103 linear activation function, 102 rectified linear unit, 106 sigmoid activation function, 103 104 SoftMax activation

More information

Lecture 5 Neural models for NLP

Lecture 5 Neural models for NLP CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm

More information

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes)

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes) Recurrent Neural Networks 2 CS 287 (Based on Yoav Goldberg s notes) Review: Representation of Sequence Many tasks in NLP involve sequences w 1,..., w n Representations as matrix dense vectors X (Following

More information

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes CS 6501: Deep Learning for Computer Graphics Basics of Neural Networks Connelly Barnes Overview Simple neural networks Perceptron Feedforward neural networks Multilayer perceptron and properties Autoencoders

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Conditional Language modeling with attention

Conditional Language modeling with attention Conditional Language modeling with attention 2017.08.25 Oxford Deep NLP 조수현 Review Conditional language model: assign probabilities to sequence of words given some conditioning context x What is the probability

More information

Deep Learning Recurrent Networks 10/16/2017

Deep Learning Recurrent Networks 10/16/2017 Deep Learning Recurrent Networks 10/16/2017 1 Which open source project? Related math. What is it talking about? And a Wikipedia page explaining it all The unreasonable effectiveness of recurrent neural

More information

Machine Translation. 10: Advanced Neural Machine Translation Architectures. Rico Sennrich. University of Edinburgh. R. Sennrich MT / 26

Machine Translation. 10: Advanced Neural Machine Translation Architectures. Rico Sennrich. University of Edinburgh. R. Sennrich MT / 26 Machine Translation 10: Advanced Neural Machine Translation Architectures Rico Sennrich University of Edinburgh R. Sennrich MT 2018 10 1 / 26 Today s Lecture so far today we discussed RNNs as encoder and

More information

EE-559 Deep learning Recurrent Neural Networks

EE-559 Deep learning Recurrent Neural Networks EE-559 Deep learning 11.1. Recurrent Neural Networks François Fleuret https://fleuret.org/ee559/ Sun Feb 24 20:33:31 UTC 2019 Inference from sequences François Fleuret EE-559 Deep learning / 11.1. Recurrent

More information

Analysis of Multilayer Neural Network Modeling and Long Short-Term Memory

Analysis of Multilayer Neural Network Modeling and Long Short-Term Memory Analysis of Multilayer Neural Network Modeling and Long Short-Term Memory Danilo López, Nelson Vera, Luis Pedraza International Science Index, Mathematical and Computational Sciences waset.org/publication/10006216

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Recurrent Neural Networks. deeplearning.ai. Why sequence models?

Recurrent Neural Networks. deeplearning.ai. Why sequence models? Recurrent Neural Networks deeplearning.ai Why sequence models? Examples of sequence data The quick brown fox jumped over the lazy dog. Speech recognition Music generation Sentiment classification There

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data January 17, 2006 Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Other Artificial Neural Networks Samy Bengio IDIAP Research Institute, Martigny, Switzerland,

More information

Lecture 15: Recurrent Neural Nets

Lecture 15: Recurrent Neural Nets Lecture 15: Recurrent Neural Nets Roger Grosse 1 Introduction Most of the prediction tasks we ve looked at have involved pretty simple kinds of outputs, such as real values or discrete categories. But

More information

Deep Structured Prediction in Handwriting Recognition Juan José Murillo Fuentes, P. M. Olmos (Univ. Carlos III) and J.C. Jaramillo (Univ.

Deep Structured Prediction in Handwriting Recognition Juan José Murillo Fuentes, P. M. Olmos (Univ. Carlos III) and J.C. Jaramillo (Univ. Deep Structured Prediction in Handwriting Recognition Juan José Murillo Fuentes, P. M. Olmos (Univ. Carlos III) and J.C. Jaramillo (Univ. Sevilla) Computational and Biological Learning Lab Dep. of Engineering

More information

Introduction to Deep Neural Networks

Introduction to Deep Neural Networks Introduction to Deep Neural Networks Presenter: Chunyuan Li Pattern Classification and Recognition (ECE 681.01) Duke University April, 2016 Outline 1 Background and Preliminaries Why DNNs? Model: Logistic

More information

Applied Natural Language Processing

Applied Natural Language Processing Applied Natural Language Processing Info 256 Lecture 20: Sequence labeling (April 9, 2019) David Bamman, UC Berkeley POS tagging NNP Labeling the tag that s correct for the context. IN JJ FW SYM IN JJ

More information

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1 What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1 Multi-layer networks Steve Renals Machine Learning Practical MLP Lecture 3 7 October 2015 MLP Lecture 3 Multi-layer networks 2 What Do Single

More information

An overview of word2vec

An overview of word2vec An overview of word2vec Benjamin Wilson Berlin ML Meetup, July 8 2014 Benjamin Wilson word2vec Berlin ML Meetup 1 / 25 Outline 1 Introduction 2 Background & Significance 3 Architecture 4 CBOW word representations

More information

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer. University of Cambridge Engineering Part IIB & EIST Part II Paper I0: Advanced Pattern Processing Handouts 4 & 5: Multi-Layer Perceptron: Introduction and Training x y (x) Inputs x 2 y (x) 2 Outputs x

More information

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016 Machine Learning for Signal Processing Neural Networks Continue Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016 1 So what are neural networks?? Voice signal N.Net Transcription Image N.Net Text

More information

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Jan Drchal Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science Topics covered

More information

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Vanishing and Exploding Gradients. ReLUs. Xavier Initialization

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Vanishing and Exploding Gradients. ReLUs. Xavier Initialization TTIC 31230, Fundamentals of Deep Learning David McAllester, April 2017 Vanishing and Exploding Gradients ReLUs Xavier Initialization Batch Normalization Highway Architectures: Resnets, LSTMs and GRUs Causes

More information

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?

More information

Recurrent Neural Networks. COMP-550 Oct 5, 2017

Recurrent Neural Networks. COMP-550 Oct 5, 2017 Recurrent Neural Networks COMP-550 Oct 5, 2017 Outline Introduction to neural networks and deep learning Feedforward neural networks Recurrent neural networks 2 Classification Review y = f( x) output label

More information

Conditional Random Fields for Sequential Supervised Learning

Conditional Random Fields for Sequential Supervised Learning Conditional Random Fields for Sequential Supervised Learning Thomas G. Dietterich Adam Ashenfelter Department of Computer Science Oregon State University Corvallis, Oregon 97331 http://www.eecs.oregonstate.edu/~tgd

More information

1 What a Neural Network Computes

1 What a Neural Network Computes Neural Networks 1 What a Neural Network Computes To begin with, we will discuss fully connected feed-forward neural networks, also known as multilayer perceptrons. A feedforward neural network consists

More information

Name: Student number:

Name: Student number: UNIVERSITY OF TORONTO Faculty of Arts and Science APRIL 2018 EXAMINATIONS CSC321H1S Duration 3 hours No Aids Allowed Name: Student number: This is a closed-book test. It is marked out of 35 marks. Please

More information

Faster Training of Very Deep Networks Via p-norm Gates

Faster Training of Very Deep Networks Via p-norm Gates Faster Training of Very Deep Networks Via p-norm Gates Trang Pham, Truyen Tran, Dinh Phung, Svetha Venkatesh Center for Pattern Recognition and Data Analytics Deakin University, Geelong Australia Email:

More information

Learning to translate with neural networks. Michael Auli

Learning to translate with neural networks. Michael Auli Learning to translate with neural networks Michael Auli 1 Neural networks for text processing Similar words near each other France Spain dog cat Neural networks for text processing Similar words near each

More information

Data Mining & Machine Learning

Data Mining & Machine Learning Data Mining & Machine Learning CS57300 Purdue University March 1, 2018 1 Recap of Last Class (Model Search) Forward and Backward passes 2 Feedforward Neural Networks Neural Networks: Architectures 2-layer

More information

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu Convolutional Neural Networks II Slides from Dr. Vlad Morariu 1 Optimization Example of optimization progress while training a neural network. (Loss over mini-batches goes down over time.) 2 Learning rate

More information

arxiv: v2 [cs.cl] 1 Jan 2019

arxiv: v2 [cs.cl] 1 Jan 2019 Variational Self-attention Model for Sentence Representation arxiv:1812.11559v2 [cs.cl] 1 Jan 2019 Qiang Zhang 1, Shangsong Liang 2, Emine Yilmaz 1 1 University College London, London, United Kingdom 2

More information

Bayesian Networks (Part I)

Bayesian Networks (Part I) 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Bayesian Networks (Part I) Graphical Model Readings: Murphy 10 10.2.1 Bishop 8.1,

More information

CSC321 Lecture 20: Reversible and Autoregressive Models

CSC321 Lecture 20: Reversible and Autoregressive Models CSC321 Lecture 20: Reversible and Autoregressive Models Roger Grosse Roger Grosse CSC321 Lecture 20: Reversible and Autoregressive Models 1 / 23 Overview Four modern approaches to generative modeling:

More information

CSC321 Lecture 9: Generalization

CSC321 Lecture 9: Generalization CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 26 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions

More information

Structured Neural Networks (I)

Structured Neural Networks (I) Structured Neural Networks (I) CS 690N, Spring 208 Advanced Natural Language Processing http://peoplecsumassedu/~brenocon/anlp208/ Brendan O Connor College of Information and Computer Sciences University

More information

CSC321 Lecture 9: Generalization

CSC321 Lecture 9: Generalization CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 27 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions

More information

NLP Programming Tutorial 8 - Recurrent Neural Nets

NLP Programming Tutorial 8 - Recurrent Neural Nets NLP Programming Tutorial 8 - Recurrent Neural Nets Graham Neubig Nara Institute of Science and Technology (NAIST) 1 Feed Forward Neural Nets All connections point forward ϕ( x) y It is a directed acyclic

More information

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing Semantics with Dense Vectors Reference: D. Jurafsky and J. Martin, Speech and Language Processing 1 Semantics with Dense Vectors We saw how to represent a word as a sparse vector with dimensions corresponding

More information

Introduction to Machine Learning Spring 2018 Note Neural Networks

Introduction to Machine Learning Spring 2018 Note Neural Networks CS 189 Introduction to Machine Learning Spring 2018 Note 14 1 Neural Networks Neural networks are a class of compositional function approximators. They come in a variety of shapes and sizes. In this class,

More information

Multimodal Machine Learning

Multimodal Machine Learning Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine Learning Laboratory [MultiComp Lab] 1 CMU Course 11-777: Multimodal Machine Learning 2 Lecture Objectives

More information

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued) Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound Lecture 3: Introduction to Deep Learning (continued) Course Logistics - Update on course registrations - 6 seats left now -

More information