A Tutorial On Backward Propagation Through Time (BPTT) In The Gated Recurrent Unit (GRU) RNN

Size: px
Start display at page:

Download "A Tutorial On Backward Propagation Through Time (BPTT) In The Gated Recurrent Unit (GRU) RNN"

Transcription

1 A Tutorial On Backward Propagation Through Time (BPTT In The Gated Recurrent Unit (GRU RNN Minchen Li Department of Computer Science The University of British Columbia Abstract In this tutorial, we provide a thorough explanation on how BPTT in GRU 1 is conducted. A MATLAB program which implements the entire BPTT for GRU and the psudo-codes describing the algorithms explicitly will be presented. We provide two algorithms for BPTT, a direct but quadratic time algorithm for easy understanding, and an optimized linear time algorithm. This tutorial starts with a specification of the problem followed by a mathematical derivation before the computational solutions. 1 Specification We want to use a dataset containing n s sentences each with n w words to train a GRU language model, and our vocabulary size is n v. Namely, we have input x R nv nw ns and label y R nv nw ns both representing n s sentences. For simplicity, lets look at one sentence at a time. In one sentence, the one-hot vector x t R nv 1 represents the t th word. For time step t, the GRU unit computes the output ŷ t using the input x t and the previous internal state s t 1 as follows: z t σ(u z x t + W z s t 1 + b z r t σ(u r x t + W r s t 1 + b r h t tanh(u h x t + W h (s t 1 r t + b h s t (1 z t h t + z t s t 1 ŷ t softmax(v s t + b V (1 Here is the vector element-wise multiplication, σ( is the element-wise sigmoid function, and tanh( is the element-wise hyperbolictangent function. The dimensions of the parameters are as follows: U z, U r, U h R ni nv W z, W r, W h R ni ni b z, b r, b h R ni 1 V R nv ni, b V R nv 1 where n i is the internal memory size set by the user. 1 GRU is an improved version of traditional RNN (Recurrent Neural Network, see WildML.com for an introduction. This link also provides an introduction to GRU and some general discussion on BPTT and beyond.

2 Then for step t, we can calculate the cross entropy loss L t as: ( L t sumofallelements y t log(ŷ t Here log is also an element-wise function. To train the GRU, we want to know the values of all parameters that minimize the total loss L nw t1 L t: argmin L Θ where Θ {U z, U r, U c, W z, W r, W c, b z, b r, b c, V, b V }. This is a non-convex problem with huge input data. So people usually use Stochastic Gradient Descent 2 method to solve this problem, which means we need to calculate L/, L/ U r, L/ U h, L/ W z, L/ W r, L/ W h, L/ b z, L/ b r, L/ b h, L/ V, L/ b V given a batch of sentences. (Note that in each step, these parameters stays the same. In this tutorial we consider using only one sentence at a time to make it concise. 2 Derivation The best way to calculate gradients using the Chain Rule from output to input is to first draw the expression graph of the entire model in order to figure out the relations between the output, intermediate results, and the input 3. Here we draw part of the expression graph of GRU in Fig.1. (2 Figure 1: The upper part of expression graph describing the operations of GRU. Note that the subgraph which s t 1 deps on is just like the sub-graph of s t. This is what the red dashed lines mean. With this expression graph, the Chain Rule works if you go backwards along the edges (top-down. If a node X has multiple outgoing edges connecting the target node T, you need to sum over the partial derivatives of each of those outgoing edges to derive the gradient T/ X. We will illustrate the rules in the following paragraphs. Let s take L/ as the example here. Others are just similar. Since L n w t1 L t and the parameters stay the same in each step, we also have L/ n w t1 ( L t/, so let s calculate each L t / indepently and sum them up. 2 See the Wikipedia to get some knowledge about Stochastic Gradient Descent. 3 See colah s blog and Stanford CS231n Course Note for some general introductions. 2

3 With the Chain Rule, we have: L t L t (3 The first part is just trivial if you know how to differentiate the cross entropy loss function embedded with the softmax function: L t V (ŷ t y t For z/, similarly, some people might just derive: (if they know how to differentiate sigmoid function s ( t (s t 1 h t z t (1 z t x T t (4 Here there are two expressions 1 z and z s t 1 influencing / z as shown in our expression graph. The solution is to derive partial derivatives through each edge and then add them up, which is exactly how we deal with / 1 as you will see in the following paragraphs. However, Eq.4 only calculates one part of the gradient, so we put a bar on top of it, while you may find this very useful in our following calculations. Note that s t 1 also deps on U z, so we can not treat it as a constant here. Moreover, this s t 1 will also introduce the influence of s i, where i 1,..., t 2. So for clearness, we should expand Eq.3 as: L t L t t L t L t i1 t i1 ( st s i s i (( t 1 ji s j+1 si s j where s i / is the gradient of s i with respect to U z while taking s i 1 as a constant, of which a similar example has been shown in Eq.4 for step t. The derivation of / 1 is similar to the derivation of / z as has been discussed above. Since there are four outgoing edges from s t 1 to s t directly and indirectly through z t, r t, and h t in the expression graph, we need to sum all the four partial derivatives together: 1 h t h t ( ht r t h t 1 + z t r t 1 + z t h t + 1 z t z t where / 1 is the gradient of s t with respect to s t 1 while taking h t and z t as constants. Similarly, h t / 1 is the gradient of h t with respect to s t 1 while taking r t as a constant. Plugging the intermediate results in the above formula, we get: s t (1 z t (Wr T ((Wh T (1 h h s t 1 r (1 r + ((Wh T (1 h h r t + 1 ( (s t 1 h t z t (1 z t + z W T z Till now, we have covered all the components needed to calculate L t /. The gradient of L t with respect to other parameters are just similar. In the next chapter, we will provide a more machinery view of the calculation - the psudo-code describing the algorithm to calculate the gradients. In the last chapter of this tutorial, we will provide the pure machine representation - a MATLAB program which implements the calculation and verification of BPTT. If you just want to understand the idea behind BPTT and decide to use fully supported auto-differentiation packages (like Theano 4 to build your own GRU, you can stop here. If you need to implement the exact chain rule like us or just curious about what will happen next, get ready to proceed! 4 Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. (5 (6 3

4 3 Algorithm Here we also only take L/ as the example. We will provide the calculation of all the gradients in the next chapter. We present two algorithms, one direct algorithm as derived previously calculating L t / and sum them up while taking O(n 2 w time, and the other O(n w time algorithm which we will see later. Algorithm 1 A direct but O(n 2 w time algorithm to calculate L/ (and beyond Input: The training data X, Y R nv nw composed of the one-hot column vectors x t, y t R nv 1, t 1, 2,..., n w representing the words in the sentence. Input: A vector s 0 R ni 1 representing the initial internal state of the model (usually set to 0. Input: The parameters Θ {U z, U r, U c, W z, W r, W c, b z, b r, b c, V, b V } of the model. Output: The total loss gradient L/. 1: %forward propagate to calculate the internal states S R ni nw, the predictions Ŷ Rnv nw, the losses L mtr R nw 1, and the intermediate results Z, R, C R ni nw of each step: 2: [S, Ŷ, L mtr, Z, R, C] forward(x, Y, Θ, s 0 % forward( can be implemented easily according to Eq.1 and Eq.2 3: du z zeros(n i, n v % initialize a variable du z 4: L mtr / S V T (Ŷ Y % calculate L t/ for t 1, 2,..., n w with one matrix operation 5: for t 1 to n w % calculate each L t / and accumulate 6: for j t to 1 % calculate each ( L t / s j ( s j / and accumulate 7: L t / z j L t / s j (s j 1 h j % s j / z j is (s j 1 h j, L t / s j is calculated in the last inner loop iteration or in Line 4 8: L t / (U z x j + W z s j 1 + b z L t / z j z j (1 z j % σ(x/ x σ(x (1 σ(x ( 9: du z + L t / (U z x j + W z s j 1 + b z x T j % accumulate 10: calculate L t / s j 1 using L t / s j and Eq.6 % for the next inner loop iteration 11: 12: 13: return du z % L/ The above direct algorithm actually follows Eq.5 to calculate L t / and then add them up to form L/ : L n w L t t1 n w t1 n w t1 ( Lt ( Lt t i1 t i1 ( st s i s i (( t 1 ji s j+1 si s j If we just expand L t / to the second line of the above equation and do some reordering, we can get: L n w ( Lt t ( st s i s i t1 n w t1 n w t1 ( t i1 i1 ( t ( Lt i1 4 ( Lt s i s i s i s i

5 Right now the inner summation keeps the subscript of L t and iterate over s i. If we further expand the inner summation and then sort them to iterate over L i, we get: L For the inner summation of Eq.7, we have: n w it n w t1 ( Li ( n w (( n w it+1 ( n w it+1 it L i st ( Li L t L i st+1 + L t +1 This just gives us an updating formula to calculate this inner summation for each step t incrementally rather than executing another for loop, thus making it possible for us to implement an O(n w time algorithm! Algorithm 2 An optimized O(n w time algorithm to calculate L/ (and beyond Input: The training data X, Y R nv nw composed of the one-hot column vectors x t, y t R nv 1, t 1, 2,..., n w representing the words in the sentence. Input: A vector s 0 R ni 1 representing the initial internal state of the model (usually set to 0. Input: The parameters Θ {U z, U r, U c, W z, W r, W c, b z, b r, b c, V, b V } of the model. Output: The total loss gradient L/. 1: %forward propagate to calculate the internal states S R ni nw, the predictions Ŷ Rnv nw, the losses L mtr R nw 1, and the intermediate results Z, R, C R ni nw of each step: 2: [S, Ŷ, L mtr, Z, R, C] forward(x, Y, Θ, s 0 % forward( can be implemented easily according to Eq.1 and Eq.2 3: du z zeros(n i, n v % initialize a variable du z 4: L mtr / S V T (Ŷ Y % calculate L t/ for t 1, 2,..., n w with one matrix operation 5: for t n w to 1 % calculate each 6: nw nw nw it ( L i/ z t ( nw it ( nw it ( L i/ ( L i and accumulate (7 (8 (s t 1 h t % / z t is (s t 1 h t, it ( L i/ is calculated in the last iteration or in Line 4. (when t n w, it ( L i/ L t / ( nw 7: it ( L nw i/ (U z x t +W z s t 1 +b z it ( L i/ z t z t (1 z t % σ(x/ x σ(x (1 ( σ(x nw 8: du z + it ( L i/ (U z x t + W z s j t + b z x T t % accumulate 9: calculate n w it 1 ( L i/ 1 using Eq.6 and Eq.8 % for the next iteration 10: 11: return du z % L/ 5

6 4 Implementation Here we provide the MATLAB program which calculates the gradients with respect to all the parameters of GRU using our two proposed algorithms. It also checks the gradients with the numerical results. We will divide our code into two parts, the first part presented below contains the core functions implementing the BPTT of GRU we just derived, the second part is composed of some functions that are less important to the topic of this tutorial. Core Functions 1 % This program t e s t s t h e BPTT p r o c e s s we manually d e v e l o p e d f o r GRU. % We c a l c u l a t e t h e g r a d i e n t s of GRU p a r a m e t e r s with c h a i n r u l e, and t h e n 3 % compare them t o t h e n u m e r i c a l g r a d i e n t s t o check whether our c h a i n r u l e % d e r i v a t i o n i s c o r r e c t. 5 % Here, we p r o v i d e d 2 v e r s i o n s of BPTT, b a c k w a r d d i r e c t ( and backward (. 7 % The f o r m e r one i s t h e d i r e c t i d e a t o c a l c u l a t e g r a d i e n t w i t h i n each s t e p % and add them up (O( s e n t e n c e s i z e ˆ 2 time. The l a t t e r one i s o p t i m i z e d t o 9 % c a l c u l a t e t h e c o n t r i b u t i o n of each s t e p t o t h e o v e r a l l g r a d i e n t, which i s % only O( s e n t e n c e s i z e time. 11 % This i s v ery h e l p f u l f o r p e o p l e who wants t o implement GRU i n C a f f e s i n c e 13 % C a f f e didn t s u p p o r t auto d i f f e r e n t i a t i o n. This i s a l s o very h e l p f u l f o r % t h e p e o p l e who wants t o know t h e d e t a i l s a b o u t B a c k p r o p a g a t i o n Through 15 % Time a l g o r i t h m i n t h e R e c c u r e n t N eural Networks ( such as GRU and LSTM % and a l s o g e t a s e n s e on how auto d i f f e r e n t i a t i o n i s p o s s i b l e. 17 % NOTE: We didn t i n v o l v e SGD t r a i n i n g h e r e. With SGD t r a i n i n g, t h i s 19 % program would become a c o m p l e t e i m p l e m e n t a t i o n of GRU which can be % t r a i n e d with s e q u e n c e d a t a. However, s i n c e t h i s i s only a CPU s e r i a l 21 % Matlab v e r s i o n of GRU, a p p l y i n g i t on l a r g e d a t a s e t s w i l l be d r a m a t i c a l l y % slow % by Minchen Li, a t The U n i v e r s i t y of B r i t i s h Columbia f u n c t i o n testbptt GRU 27 % s e t GRU and d a t a s c a l e v o c a b u l a r y s i z e 6 4 ; 29 imem size 4 ; s e n t e n c e s i z e 2 0 ; % number of words i n a s e n t e n c e 31 %( i n c l u d i n g s t a r t and symbol % s i n c e we w i l l only use one s e n t e n c e f o r t r a i n i n g, 33 % t h i s i s a l s o t h e t o t a l s t e p s d u r i n g t r a i n i n g. 35 [ x y ] g e t T r a i n i n g D a t a ( v o c a b u l a r y s i z e, s e n t e n c e s i z e ; 37 % i n i t i a l i z e p a r a m e t e r s : % m u l t i p l i e r f o r i n p u t x t of i n t e r m e d i a t e v a r i a b l e s 39 U z r and ( imem size, v o c a b u l a r y s i z e ; U r r and ( imem size, v o c a b u l a r y s i z e ; 41 U c r and ( imem size, v o c a b u l a r y s i z e ; % m u l t i p l i e r f o r p e r v i o u s s of i n t e r m e d i a t e v a r i a b l e s 43 W z rand ( imem size, imem size ; W r r a nd ( imem size, imem size ; 45 W c rand ( imem size, imem size ; % b i a s t e r m s of i n t e r m e d i a t e v a r i a b l e s 47 b z rand ( imem size, 1 ; 6

7 b r r a nd ( imem size, 1 ; 49 b c rand ( imem size, 1 ; % d e c o d e r f o r g e n e r a t i n g o u t p u t 51 V rand ( v o c a b u l a r y s i z e, imem size ; b V r and ( v o c a b u l a r y s i z e, 1 ; % b i a s of d e c o d e r 53 % p r e v i o u s s of s t e p 1 s 0 r a nd ( imem size, 1 ; 55 % c a l c u l a t e and check g r a d i e n t 57 t i c [ dv, db V, du z, du r, du c, dw z, dw r, dw c, db z, db r, db c, d s 0 ] 59 b a c k w a r d d i r e c t ( x, y, U z, U r, U c, W z, W r, W c, b z, b r, b c, V, b V, s 0 ; t o c 61 t i c checkgradient GRU ( x, y, U z, U r, U c, W z, W r, W c, b z, b r, b c, V, b V, s 0, 63 dv, db V, du z, du r, du c, dw z, dw r, dw c, db z, db r, db c, d s 0 ; t o c 65 t i c 67 [ dv, db V, du z, du r, du c, dw z, dw r, dw c, db z, db r, db c, d s 0 ] backward ( x, y, U z, U r, U c, W z, W r, W c, b z, b r, b c, V, b V, s 0 ; 69 t o c t i c 71 checkgradient GRU ( x, y, U z, U r, U c, W z, W r, W c, b z, b r, b c, V, b V, s 0, dv, db V, du z, du r, du c, dw z, dw r, dw c, db z, db r, db c, d s 0 ; 73 t o c 75 % Forward p r o p a g a t e c a l c u l a t e s, y h a t, l o s s and i n t e r m e d i a t e v a r i a b l e s f o r each s t e p 77 f u n c t i o n [ s, y h a t, L, z, r, c ] f o r w a r d ( x, y, U z, U r, U c, W z, W r, W c, b z, b r, b c, V, b V, s 0 79 % c o u n t s i z e s [ v o c a b u l a r y s i z e, s e n t e n c e s i z e ] s i z e ( x ; 81 imem size s i z e (V, 2 ; 83 % i n i t i a l i z e r e s u l t s s z e r o s ( imem size, s e n t e n c e s i z e ; 85 y h a t z e r o s ( v o c a b u l a r y s i z e, s e n t e n c e s i z e ; L z e r o s ( s e n t e n c e s i z e, 1 ; 87 z z e r o s ( imem size, s e n t e n c e s i z e ; r z e r o s ( imem size, s e n t e n c e s i z e ; 89 c z e r o s ( imem size, s e n t e n c e s i z e ; 91 % c a l c u l a t e r e s u l t f o r s t e p 1 s i n c e s 0 i s n o t i n s z ( :, 1 sigmoid ( U z x ( :, 1 + W z s 0 + b z ; 93 r ( :, 1 sigmoid ( U r x ( :, 1 + W r s 0 + b r ; c ( :, 1 t a n h ( U c x ( :, 1 + W c ( s 0. r ( :, 1 + b c ; 95 s ( :, 1 (1 z ( :, 1. c ( :, 1 + z ( :, 1. s 0 ; y h a t ( :, 1 softmax (V s ( :, 1 + b V ; 97 L ( 1 sum( y ( :, 1. l o g ( y h a t ( :, 1 ; % c a l c u l a t e r e s u l t s f o r s t e p 2 s e n t e n c e s i z e s i m i l a r l y 99 f o r wordi 2 : s e n t e n c e s i z e z ( :, wordi sigmoid ( U z x ( :, wordi + W z s ( :, wordi 1 + b z ; 101 r ( :, wordi sigmoid ( U r x ( :, wordi + W r s ( :, wordi 1 + b r ; c ( :, wordi t a n h ( U c x ( :, wordi + W c ( s ( :, wordi 1. r ( :, wordi + b c ; 7

8 103 s ( :, wordi (1 z ( :, wordi. c ( :, wordi + z ( :, wordi. s ( :, wordi 1 ; y h a t ( :, wordi softmax (V s ( :, wordi + b V ; 105 L ( wordi sum( y ( :, wordi. l o g ( y h a t ( :, wordi ; % Backward p r o p a g a t e t o c a l c u l a t e g r a d i e n t u s i n g c h a i n r u l e % (O( s e n t e n c e s i z e t ime 111 f u n c t i o n [ dv, db V, du z, du r, du c, dw z, dw r, dw c, db z, db r, db c, d s 0 ] backward ( x, y, U z, U r, U c, W z, W r, W c, b z, b r, b c, V, b V, s % f o r w a r d p r o p a g a t e t o g e t t h e i n t e r m e d i a t e and o u t p u t r e s u l t s [ s, y h a t, L, z, r, c ] f o r w a r d ( x, y, U z, U r, U c, W z, W r, W c, 115 b z, b r, b c, V, b V, s 0 ; % c o u n t s e n t e n c e s i z e 117 [, s e n t e n c e s i z e ] s i z e ( x ; 119 % c a l c u l a t e g r a d i e n t u s i n g c h a i n r u l e d e l t a y y h a t y ; 121 db V sum ( d e l t a y, 2 ; 123 dv z e r o s ( s i z e (V ; f o r wordi 1 : s e n t e n c e s i z e 125 dv dv + d e l t a y ( :, wordi s ( :, wordi ; 127 d s 0 z e r o s ( s i z e ( s 0 ; 129 du c z e r o s ( s i z e ( U c ; du r z e r o s ( s i z e ( U r ; 131 du z z e r o s ( s i z e ( U z ; dw c z e r o s ( s i z e ( W c ; 133 dw r z e r o s ( s i z e ( W r ; dw z z e r o s ( s i z e ( W z ; 135 db z z e r o s ( s i z e ( b z ; d b r z e r o s ( s i z e ( b r ; 137 db c z e r o s ( s i z e ( b c ; d s s i n g l e V d e l t a y ; 139 % c a l c u l a t e t h e d e r i v a t i v e c o n t r i b u t i o n of each s t e p and add them up d s c u r z e r o s ( s i z e ( d s s i n g l e, 1, 1 ; 141 f o r wordj s e n t e n c e s i z e : 1:2 d s c u r d s c u r + d s s i n g l e ( :, wordj ; 143 d s c u r b k d s c u r ; 145 d t a n h I n p u t ( d s c u r. (1 z ( :, wordj. (1 c ( :, wordj. c ( :, wordj ; db c db c + d t a n h I n p u t ; 147 du c du c + d t a n h I n p u t x ( :, wordj ; %c o u l d be a c c e l e r a t e d by a v o i d i n g add 0 dw c dw c + d t a n h I n p u t ( s ( :, wordj 1. r ( :, wordj ; 149 d s r W c d t a n h I n p u t ; d s c u r d s r. r ( :, wordj ; 151 d s i g I n p u t r d s r. s ( :, wordj 1. r ( :, wordj. (1 r ( :, wordj ; d b r d b r + d s i g I n p u t r ; 153 du r du r + d s i g I n p u t r x ( :, wordj ; %c o u l d be a c c e l e r a t e d by a v o i d i n g add 0 dw r dw r + d s i g I n p u t r s ( :, wordj 1 ; 155 d s c u r d s c u r + W r d s i g I n p u t r ; 157 d s c u r d s c u r + d s c u r b k. z ( :, wordj ; dz d s c u r b k. ( s ( :, wordj 1 c ( :, wordj ; 159 d s i g I n p u t z dz. z ( :, wordj. (1 z ( :, wordj ; db z db z + d s i g I n p u t z ; 8

9 161 du z du z + d s i g I n p u t z x ( :, wordj ; %c o u l d be a c c e l e r a t e d by a v o i d i n g add 0 dw z dw z + d s i g I n p u t z s ( :, wordj 1 ; 163 d s c u r d s c u r + W z d s i g I n p u t z ; 165 % s d s c u r d s c u r + d s s i n g l e ( :, 1 ; 169 d t a n h I n p u t ( d s c u r. (1 z ( :, 1. (1 c ( :, 1. c ( :, 1 ; db c db c + d t a n h I n p u t ; 171 du c du c + d t a n h I n p u t x ( :, 1 ; %c o u l d be a c c e l e r a t e d by a v o i d i n g add 0 dw c dw c + d t a n h I n p u t ( s 0. r ( :, 1 ; 173 d s r W c d t a n h I n p u t ; d s 0 d s 0 + d s r. r ( :, 1 ; 175 d s i g I n p u t r d s r. s 0. r ( :, 1. (1 r ( :, 1 ; d b r d b r + d s i g I n p u t r ; 177 du r du r + d s i g I n p u t r x ( :, 1 ; %c o u l d be a c c e l e r a t e d by a v o i d i n g add 0 dw r dw r + d s i g I n p u t r s 0 ; 179 d s 0 d s 0 + W r d s i g I n p u t r ; 181 d s 0 d s 0 + d s c u r. z ( :, 1 ; dz d s c u r. ( s 0 c ( :, 1 ; 183 d s i g I n p u t z dz. z ( :, 1. (1 z ( :, 1 ; db z db z + d s i g I n p u t z ; 185 du z du z + d s i g I n p u t z x ( :, 1 ; %c o u l d be a c c e l e r a t e d by a v o i d i n g add 0 dw z dw z + d s i g I n p u t z s 0 ; 187 d s 0 d s 0 + W z d s i g I n p u t z ; 189 % A more d i r e c t view of backward p r o p a g a t e t o c a l c u l a t e g r a d i e n t u s i n g 191 % c h a i n r u l e. (O( s e n t e n c e s i z e ˆ 2 time % I n s t e a d of c a l c u l a t i n g how much c o n t r i b u t i o n of d e r i v a t i v e each s t e p has, 193 % h e r e we c a l c u l a t e t h e g r a d i e n t w i t h i n e v e r y s t e p. f u n c t i o n [ dv, db V, du z, du r, du c, dw z, dw r, dw c, db z, db r, db c, d s 0 ] 195 b a c k w a r d d i r e c t ( x, y, U z, U r, U c, W z, W r, W c, b z, b r, b c, V, b V, s 0 % f o r w a r d p r o p a g a t e t o g e t t h e i n t e r m e d i a t e and o u t p u t r e s u l t s 197 [ s, y h a t, L, z, r, c ] f o r w a r d ( x, y, U z, U r, U c, W z, W r, W c, b z, b r, b c, V, b V, s 0 ; 199 % c o u n t s e n t e n c e s i z e [, s e n t e n c e s i z e ] s i z e ( x ; 201 % c a l c u l a t e g r a d i e n t u s i n g c h a i n r u l e 203 d e l t a y y h a t y ; db V sum ( d e l t a y, 2 ; 205 dv z e r o s ( s i z e (V ; 207 f o r wordi 1 : s e n t e n c e s i z e dv dv + d e l t a y ( :, wordi s ( :, wordi ; d s 0 z e r o s ( s i z e ( s 0 ; du c z e r o s ( s i z e ( U c ; 213 du r z e r o s ( s i z e ( U r ; du z z e r o s ( s i z e ( U z ; 215 dw c z e r o s ( s i z e ( W c ; dw r z e r o s ( s i z e ( W r ; 217 dw z z e r o s ( s i z e ( W z ; 9

10 db z z e r o s ( s i z e ( b z ; 219 d b r z e r o s ( s i z e ( b r ; db c z e r o s ( s i z e ( b c ; 221 d s s i n g l e V d e l t a y ; % c a l c u l a t e t h e d e r i v a t i v e s i n each s t e p and add them up 223 f o r wordi 1 : s e n t e n c e s i z e d s c u r d s s i n g l e ( :, wordi ; 225 % s i n c e i n each s t e p t, t h e d e r i v a t i v e s deps on s 0 s t, % we need t o t r a c e back from t o t 0 each time 227 f o r wordj wordi : 1:2 d s c u r b k d s c u r ; 229 d t a n h I n p u t ( d s c u r. (1 z ( :, wordj. (1 c ( :, wordj. c ( :, wordj ; 231 db c db c + d t a n h I n p u t ; du c du c + d t a n h I n p u t x ( :, wordj ; %c o u l d be a c c e l e r a t e d by a v o i d i n g add dw c dw c + d t a n h I n p u t ( s ( :, wordj 1. r ( :, wordj ; d s r W c d t a n h I n p u t ; 235 d s c u r d s r. r ( :, wordj ; d s i g I n p u t r d s r. s ( :, wordj 1. r ( :, wordj. (1 r ( :, wordj ; 237 d b r d b r + d s i g I n p u t r ; du r du r + d s i g I n p u t r x ( :, wordj ; %c o u l d be a c c e l e r a t e d by a v o i d i n g add dw r dw r + d s i g I n p u t r s ( :, wordj 1 ; d s c u r d s c u r + W r d s i g I n p u t r ; 241 d s c u r d s c u r + d s c u r b k. z ( :, wordj ; 243 dz d s c u r b k. ( s ( :, wordj 1 c ( :, wordj ; d s i g I n p u t z dz. z ( :, wordj. (1 z ( :, wordj ; 245 db z db z + d s i g I n p u t z ; du z du z + d s i g I n p u t z x ( :, wordj ; %c o u l d be a c c e l e r a t e d by a v o i d i n g add dw z dw z + d s i g I n p u t z s ( :, wordj 1 ; d s c u r d s c u r + W z d s i g I n p u t z ; % s 1 d t a n h I n p u t ( d s c u r. (1 z ( :, 1. (1 c ( :, 1. c ( :, 1 ; 253 db c db c + d t a n h I n p u t ; du c du c + d t a n h I n p u t x ( :, 1 ; %c o u l d be a c c e l e r a t e d by a v o i d i n g add dw c dw c + d t a n h I n p u t ( s 0. r ( :, 1 ; d s r W c d t a n h I n p u t ; 257 d s 0 d s 0 + d s r. r ( :, 1 ; d s i g I n p u t r d s r. s 0. r ( :, 1. (1 r ( :, 1 ; 259 d b r d b r + d s i g I n p u t r ; du r du r + d s i g I n p u t r x ( :, 1 ; %c o u l d be a c c e l e r a t e d by a v o i d i n g add dw r dw r + d s i g I n p u t r s 0 ; d s 0 d s 0 + W r d s i g I n p u t r ; 263 d s 0 d s 0 + d s c u r. z ( :, 1 ; 265 dz d s c u r. ( s 0 c ( :, 1 ; d s i g I n p u t z dz. z ( :, 1. (1 z ( :, 1 ; 267 db z db z + d s i g I n p u t z ; du z du z + d s i g I n p u t z x ( :, 1 ; %c o u l d be a c c e l e r a t e d by a v o i d i n g add dw z dw z + d s i g I n p u t z s 0 ; d s 0 d s 0 + W z d s i g I n p u t z ; % Sigmoid f u n c t i o n f o r n e u r a l network 275 f u n c t i o n v a l sigmoid ( x 10

11 277 v a l sigmf ( x, [ 1 0 ] ; testbptt GRU.m Less Important Functions 1 % Fake a t r a i n i n g d a t a s e t : g e n e r a t e only one s e n t e n c e f o r t r a i n i n g. %!!! Only f o r t e s t i n g. Needs t o be changed t o r e a d i n t r a i n i n g d a t a from f i l e s. 3 f u n c t i o n [ x t, y t ] g e t T r a i n i n g D a t a ( v o c a b u l a r y s i z e, s e n t e n c e s i z e a s s e r t ( v o c a b u l a r y s i z e > 2 ; % f o r s t a r t and of s e n t e n c e symbol 5 a s s e r t ( s e n t e n c e s i z e > 0 ; 7 % d e f i n e s t a r t and of s e n t e n c e i n t h e v o c a b u l a r y SENTENCE START z e r o s ( v o c a b u l a r y s i z e, 1 ; 9 SENTENCE START ( 1 1 ; SENTENCE END z e r o s ( v o c a b u l a r y s i z e, 1 ; 11 SENTENCE END ( 2 1 ; 13 % g e n e r a t e s e n t e n c e : x t z e r o s ( v o c a b u l a r y s i z e, s e n t e n c e s i z e 1 ; % l e a v e one s l o t f o r SENTENCE START 15 f o r wordi 1 : s e n t e n c e s i z e 1 % g e n e r a t e a random word e x c l u d e s s t a r t and symbol 17 x t ( r a n d i ( v o c a b u l a r y s i z e 2,1,1 +2, wordi 1 ; 19 y t [ x t, SENTENCE END ] ; % t r a i n i n g o u t p u t x t [SENTENCE START, x t ] ; % t r a i n i n g i n p u t % Use n u m e r i c a l d i f f e r e n t i a t i o n t o a p p r o x i m a t e t h e g r a d i e n t of each % p a r a m e t e r and c a l c u l a t e t h e d i f f e r e n c e between t h e s e n u m e r i c a l r e s u l t s 25 % and our r e s u l t s c a l c u l a t e d by a p p l y i n g c h a i n r u l e. f u n c t i o n checkgradient GRU ( x, y, U z, U r, U c, W z, W r, W c, b z, b r, b c, V, b V, s 0, 27 dv, db V, du z, du r, du c, dw z, dw r, dw c, db z, db r, db c, d s 0 % Here we use t h e c e n t r e d i f f e r e n c e f o r m u l a : 29 % df ( x / dx ( f ( x+h f ( x h / (2 h % I t i s a second o r d e r a c c u r a t e method with e r r o r bounded by O( h ˆ 2 31 h 1e 5; 33 % NOTE: h couldn t be t o o l a r g e or t o o s m a l l s i n c e l a r g e h w i l l % i n t r o d u c e b i g g e r t r u n c a t i o n e r r o r and s m a l l h w i l l i n t r o d u c e b i g g e r 35 % r o u n d o f f e r r o r. 37 d V n umerical z e r o s ( s i z e ( dv ; % C a l c u l a t e p a r t i a l d e r i v a t i v e e l e m e n t by e l e m e n t 39 f o r rowi 1 : s i z e ( dv numerical, 1 f o r c o l I 1 : s i z e ( dv numerical, 2 41 V plus V; V plus ( rowi, c o l I V plus ( rowi, c o l I + h ; 43 V minus V; V minus ( rowi, c o l I V minus ( rowi, c o l I h ; 45 [,, L p l u s ] f o r w a r d ( x, y, U z, U r, U c, W z, W r, W c, b z, b r, b c, V plus, b V, s 0 ; 47 [,, L minus ] f o r w a r d ( x, y, U z, U r, U c, W z, W r, W c, b z, b r, b c, V minus, b V, s 0 ; 49 d V n umerical ( rowi, c o l I ( sum ( L p l u s sum ( L minus / 2 / h ; 51 11

12 d i s p l a y ( sum ( sum ( abs ( dv numerical dv. / ( abs ( dv n umerical +h, 53 dv r e l a t i v e e r r o r ; % p r e v e n t d i v i d i n g by 0 by adding h 55 d U c n u m e r i c a l z e r o s ( s i z e ( du c ; f o r rowi 1 : s i z e ( d U c n u m e r i c a l, 1 57 f o r c o l I 1 : s i z e ( d U c n u m e r i c a l, 2 U c p l u s U c ; 59 U c p l u s ( rowi, c o l I U c p l u s ( rowi, c o l I + h ; U c minus U c ; 61 U c minus ( rowi, c o l I U c minus ( rowi, c o l I h ; [,, L p l u s ] f o r w a r d ( x, y, 63 U z, U r, U c plus, W z, W r, W c, b z, b r, b c, V, b V, s 0 ; [,, L minus ] f o r w a r d ( x, y, 65 U z, U r, U c minus, W z, W r, W c, b z, b r, b c, V, b V, s 0 ; d U c n u m e r i c a l ( rowi, c o l I ( sum ( L p l u s sum ( L minus / 2 / h ; d i s p l a y ( sum ( sum ( abs ( d U c n u m e r i c al du c. / ( abs ( d U c n u m e r i c a l +h, du c r e l a t i v e e r r o r ; 71 dw c numerical z e r o s ( s i z e ( dw c ; 73 f o r rowi 1 : s i z e ( dw c numerical, 1 f o r c o l I 1 : s i z e ( dw c numerical, 2 75 W c plus W c ; W c plus ( rowi, c o l I W c plus ( rowi, c o l I + h ; 77 W c minus W c ; W c minus ( rowi, c o l I W c minus ( rowi, c o l I h ; 79 [,, L p l u s ] f o r w a r d ( x, y, U z, U r, U c, W z, W r, W c plus, b z, b r, b c, V, b V, s 0 ; 81 [,, L minus ] f o r w a r d ( x, y, U z, U r, U c, W z, W r, W c minus, b z, b r, b c, V, b V, s 0 ; 83 dw c numerical ( rowi, c o l I ( sum ( L p l u s sum ( L minus / 2 / h ; 85 d i s p l a y ( sum ( sum ( abs ( dw c numerical dw c. / ( abs ( dw c numerical +h, 87 dw c r e l a t i v e e r r o r ; 89 d U r n u m e r i c a l z e r o s ( s i z e ( du r ; f o r rowi 1 : s i z e ( d U r n u m e r i c a l, 1 91 f o r c o l I 1 : s i z e ( d U r n u m e r i c a l, 2 U r p l u s U r ; 93 U r p l u s ( rowi, c o l I U r p l u s ( rowi, c o l I + h ; U r minus U r ; 95 U r minus ( rowi, c o l I U r minus ( rowi, c o l I h ; [,, L p l u s ] f o r w a r d ( x, y, 97 U z, U r p l u s, U c, W z, W r, W c, b z, b r, b c, V, b V, s 0 ; [,, L minus ] f o r w a r d ( x, y, 99 U z, U r minus, U c, W z, W r, W c, b z, b r, b c, V, b V, s 0 ; d U r n u m e r i c a l ( rowi, c o l I ( sum ( L p l u s sum ( L minus / 2 / h ; d i s p l a y ( sum ( sum ( abs ( d U r n u m e r i c a l du r. / ( abs ( d U r n u m e r i c a l +h, du r r e l a t i v e e r r o r ; 12

13 105 d W r n u m e r i c a l z e r o s ( s i z e ( dw r ; 107 f o r rowi 1 : s i z e ( dw r numerical, 1 f o r c o l I 1 : s i z e ( dw r numerical, W r p l u s W r ; W r p l u s ( rowi, c o l I W r p l u s ( rowi, c o l I + h ; 111 W r minus W r ; W r minus ( rowi, c o l I W r minus ( rowi, c o l I h ; 113 [,, L p l u s ] f o r w a r d ( x, y, U z, U r, U c, W z, W r plus, W c, b z, b r, b c, V, b V, s 0 ; 115 [,, L minus ] f o r w a r d ( x, y, U z, U r, U c, W z, W r minus, W c, b z, b r, b c, V, b V, s 0 ; 117 d W r n u m e r i c a l ( rowi, c o l I ( sum ( L p l u s sum ( L minus / 2 / h ; 119 d i s p l a y ( sum ( sum ( abs ( dw r numerical dw r. / ( abs ( d W r n u m e r i c a l +h, 121 dw r r e l a t i v e e r r o r ; 123 d U z n u m e r i c a l z e r o s ( s i z e ( du z ; f o r rowi 1 : s i z e ( d U z n u m e r i c a l, f o r c o l I 1 : s i z e ( d U z n u m e r i c a l, 2 U z p l u s U z ; 127 U z p l u s ( rowi, c o l I U z p l u s ( rowi, c o l I + h ; U z minus U z ; 129 U z minus ( rowi, c o l I U z minus ( rowi, c o l I h ; [,, L p l u s ] f o r w a r d ( x, y, 131 U z p lus, U r, U c, W z, W r, W c, b z, b r, b c, V, b V, s 0 ; [,, L minus ] f o r w a r d ( x, y, 133 U z minus, U r, U c, W z, W r, W c, b z, b r, b c, V, b V, s 0 ; d U z n u m e r i c a l ( rowi, c o l I ( sum ( L p l u s sum ( L minus / 2 / h ; d i s p l a y ( sum ( sum ( abs ( d U z n u m e r i c al du z. / ( abs ( d U z n u m e r i c a l +h, du z r e l a t i v e e r r o r ; 139 dw z numerical z e r o s ( s i z e ( dw z ; 141 f o r rowi 1 : s i z e ( dw z numerical, 1 f o r c o l I 1 : s i z e ( dw z numerical, W z plus W z ; W z plus ( rowi, c o l I W z plus ( rowi, c o l I + h ; 145 W z minus W z ; W z minus ( rowi, c o l I W z minus ( rowi, c o l I h ; 147 [,, L p l u s ] f o r w a r d ( x, y, U z, U r, U c, W z plus, W r, W c, b z, b r, b c, V, b V, s 0 ; 149 [,, L minus ] f o r w a r d ( x, y, U z, U r, U c, W z minus, W r, W c, b z, b r, b c, V, b V, s 0 ; 151 dw z numerical ( rowi, c o l I ( sum ( L p l u s sum ( L minus / 2 / h ; 153 d i s p l a y ( sum ( sum ( abs ( dw z numerical dw z. / ( abs ( dw z numerical +h, 155 dw z r e l a t i v e e r r o r ; 157 d b z n u m e r i c a l z e r o s ( s i z e ( db z ; 13

14 f o r i 1 : l e n g t h ( d b z n u m e r i c a l 159 b z p l u s b z ; b z p l u s ( i b z p l u s ( i + h ; 161 b z m i n u s b z ; b z m i n u s ( i b z m i n u s ( i h ; 163 [,, L p l u s ] f o r w a r d ( x, y, U z, U r, U c, W z, W r, W c, b z p l u s, b r, b c, V, b V, s 0 ; 165 [,, L minus ] f o r w a r d ( x, y, U z, U r, U c, W z, W r, W c, b z minus, b r, b c, V, b V, s 0 ; 167 d b z n u m e r i c a l ( i ( sum ( L p l u s sum ( L minus / 2 / h ; 169 d i s p l a y ( sum ( abs ( d b z n u m e r i c a l db z. / ( abs ( d b z n u m e r i c a l +h, db z r e l a t i v e e r r o r ; 171 d b r n u m e r i c a l z e r o s ( s i z e ( d b r ; 173 f o r i 1 : l e n g t h ( d b r n u m e r i c a l b r p l u s b r ; 175 b r p l u s ( i b r p l u s ( i + h ; b r m i n u s b r ; 177 b r m i n u s ( i b r m i n u s ( i h ; [,, L p l u s ] f o r w a r d ( x, y, 179 U z, U r, U c, W z, W r, W c, b z, b r p l u s, b c, V, b V, s 0 ; [,, L minus ] f o r w a r d ( x, y, 181 U z, U r, U c, W z, W r, W c, b z, b r m i n u s, b c, V, b V, s 0 ; d b r n u m e r i c a l ( i ( sum ( L p l u s sum ( L minus / 2 / h ; 183 d i s p l a y ( sum ( abs ( d b r n u m e r i c a l d b r. / ( abs ( d b r n u m e r i c a l +h, 185 d b r r e l a t i v e e r r o r ; 187 d b c n u m e r i c a l z e r o s ( s i z e ( db c ; f o r i 1 : l e n g t h ( d b c n u m e r i c a l 189 b c p l u s b c ; b c p l u s ( i b c p l u s ( i + h ; 191 b c m i n u s b c ; b c m i n u s ( i b c m i n u s ( i h ; 193 [,, L p l u s ] f o r w a r d ( x, y, U z, U r, U c, W z, W r, W c, b z, b r, b c p l u s, V, b V, s 0 ; 195 [,, L minus ] f o r w a r d ( x, y, U z, U r, U c, W z, W r, W c, b z, b r, b c minus, V, b V, s 0 ; 197 d b c n u m e r i c a l ( i ( sum ( L p l u s sum ( L minus / 2 / h ; 199 d i s p l a y ( sum ( abs ( d b c n u m e r i c a l db c. / ( abs ( d b c n u m e r i c a l +h, db c r e l a t i v e e r r o r ; 201 d b V n u m e r i c a l z e r o s ( s i z e ( db V ; 203 f o r i 1 : l e n g t h ( d b V n u m e r i c a l b V p l u s b V ; 205 b V p l u s ( i b V p l u s ( i + h ; b V minus b V ; 207 b V minus ( i b V minus ( i h ; [,, L p l u s ] f o r w a r d ( x, y, 209 U z, U r, U c, W z, W r, W c, b z, b r, b c, V, b V plus, s 0 ; [,, L minus ] f o r w a r d ( x, y, 211 U z, U r, U c, W z, W r, W c, b z, b r, b c, V, b V minus, s 0 ; d b V n u m e r i c a l ( i ( sum ( L p l u s sum ( L minus / 2 / h ; 213 d i s p l a y ( sum ( abs ( db V numerical db V. / ( abs ( d b V n u m e r i c a l +h, 14

15 215 db V r e l a t i v e e r r o r ; 217 d s 0 n u m e r i c a l z e r o s ( s i z e ( d s 0 ; f o r i 1 : l e n g t h ( d s 0 n u m e r i c a l 219 s 0 p l u s s 0 ; s 0 p l u s ( i s 0 p l u s ( i + h ; 221 s 0 m i n u s s 0 ; s 0 m i n u s ( i s 0 m i n u s ( i h ; 223 [,, L p l u s ] f o r w a r d ( x, y, U z, U r, U c, W z, W r, W c, b z, b r, b c, V, b V, s 0 p l u s ; 225 [,, L minus ] f o r w a r d ( x, y, U z, U r, U c, W z, W r, W c, b z, b r, b c, V, b V, s 0 m i n u s ; 227 d s 0 n u m e r i c a l ( i ( sum ( L p l u s sum ( L minus / 2 / h ; 229 d i s p l a y ( sum ( abs ( d s 0 n u m e r i c a l d s 0. / ( abs ( d s 0 n u m e r i c a l +h, d s 0 r e l a t i v e e r r o r ; 231 testbptt GRU.m 15

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016 Neural Networks Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016 Outline Part 1 Introduction Feedforward Neural Networks Stochastic Gradient Descent Computational Graph

More information

arxiv: v3 [cs.lg] 14 Jan 2018

arxiv: v3 [cs.lg] 14 Jan 2018 A Gentle Tutorial of Recurrent Neural Network with Error Backpropagation Gang Chen Department of Computer Science and Engineering, SUNY at Buffalo arxiv:1610.02583v3 [cs.lg] 14 Jan 2018 1 abstract We describe

More information

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017 Modelling Time Series with Neural Networks Volker Tresp Summer 2017 1 Modelling of Time Series The next figure shows a time series (DAX) Other interesting time-series: energy prize, energy consumption,

More information

Recurrent Neural Networks. Jian Tang

Recurrent Neural Networks. Jian Tang Recurrent Neural Networks Jian Tang tangjianpku@gmail.com 1 RNN: Recurrent neural networks Neural networks for sequence modeling Summarize a sequence with fix-sized vector through recursively updating

More information

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat Neural Networks, Computation Graphs CMSC 470 Marine Carpuat Binary Classification with a Multi-layer Perceptron φ A = 1 φ site = 1 φ located = 1 φ Maizuru = 1 φ, = 2 φ in = 1 φ Kyoto = 1 φ priest = 0 φ

More information

Lecture 17: Neural Networks and Deep Learning

Lecture 17: Neural Networks and Deep Learning UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions

More information

Solutions. Part I Logistic regression backpropagation with a single training example

Solutions. Part I Logistic regression backpropagation with a single training example Solutions Part I Logistic regression backpropagation with a single training example In this part, you are using the Stochastic Gradient Optimizer to train your Logistic Regression. Consequently, the gradients

More information

Neural Networks Language Models

Neural Networks Language Models Neural Networks Language Models Philipp Koehn 10 October 2017 N-Gram Backoff Language Model 1 Previously, we approximated... by applying the chain rule p(w ) = p(w 1, w 2,..., w n ) p(w ) = i p(w i w 1,...,

More information

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4 Neural Networks Learning the network: Backprop 11-785, Fall 2018 Lecture 4 1 Recap: The MLP can represent any function The MLP can be constructed to represent anything But how do we construct it? 2 Recap:

More information

Automatic Differentiation and Neural Networks

Automatic Differentiation and Neural Networks Statistical Machine Learning Notes 7 Automatic Differentiation and Neural Networks Instructor: Justin Domke 1 Introduction The name neural network is sometimes used to refer to many things (e.g. Hopfield

More information

Introduction to Deep Neural Networks

Introduction to Deep Neural Networks Introduction to Deep Neural Networks Presenter: Chunyuan Li Pattern Classification and Recognition (ECE 681.01) Duke University April, 2016 Outline 1 Background and Preliminaries Why DNNs? Model: Logistic

More information

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions 2018 IEEE International Workshop on Machine Learning for Signal Processing (MLSP 18) Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions Authors: S. Scardapane, S. Van Vaerenbergh,

More information

NLP Programming Tutorial 8 - Recurrent Neural Nets

NLP Programming Tutorial 8 - Recurrent Neural Nets NLP Programming Tutorial 8 - Recurrent Neural Nets Graham Neubig Nara Institute of Science and Technology (NAIST) 1 Feed Forward Neural Nets All connections point forward ϕ( x) y It is a directed acyclic

More information

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves Recurrent Neural Networks Deep Learning Lecture 5 Efstratios Gavves Sequential Data So far, all tasks assumed stationary data Neither all data, nor all tasks are stationary though Sequential Data: Text

More information

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 13 Mar 1, 2018

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 13 Mar 1, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Backpropagation Matt Gormley Lecture 13 Mar 1, 2018 1 Reminders Homework 5: Neural

More information

Long-Short Term Memory and Other Gated RNNs

Long-Short Term Memory and Other Gated RNNs Long-Short Term Memory and Other Gated RNNs Sargur Srihari srihari@buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Sequence Modeling

More information

(

( Class 15 - Long Short-Term Memory (LSTM) Study materials http://colah.github.io/posts/2015-08-understanding-lstms/ (http://colah.github.io/posts/2015-08-understanding-lstms/) http://karpathy.github.io/2015/05/21/rnn-effectiveness/

More information

Deep Learning Recurrent Networks 2/28/2018

Deep Learning Recurrent Networks 2/28/2018 Deep Learning Recurrent Networks /8/8 Recap: Recurrent networks can be incredibly effective Story so far Y(t+) Stock vector X(t) X(t+) X(t+) X(t+) X(t+) X(t+5) X(t+) X(t+7) Iterated structures are good

More information

Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials

Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials Contents 1 Pseudo-code for the damped Gauss-Newton vector product 2 2 Details of the pathological synthetic problems

More information

Natural Language Processing and Recurrent Neural Networks

Natural Language Processing and Recurrent Neural Networks Natural Language Processing and Recurrent Neural Networks Pranay Tarafdar October 19 th, 2018 Outline Introduction to NLP Word2vec RNN GRU LSTM Demo What is NLP? Natural Language? : Huge amount of information

More information

Analysis of Multilayer Neural Network Modeling and Long Short-Term Memory

Analysis of Multilayer Neural Network Modeling and Long Short-Term Memory Analysis of Multilayer Neural Network Modeling and Long Short-Term Memory Danilo López, Nelson Vera, Luis Pedraza International Science Index, Mathematical and Computational Sciences waset.org/publication/10006216

More information

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued) Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound Lecture 3: Introduction to Deep Learning (continued) Course Logistics - Update on course registrations - 6 seats left now -

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

RECURRENT NETWORKS I. Philipp Krähenbühl

RECURRENT NETWORKS I. Philipp Krähenbühl RECURRENT NETWORKS I Philipp Krähenbühl RECAP: CLASSIFICATION conv 1 conv 2 conv 3 conv 4 1 2 tu RECAP: SEGMENTATION conv 1 conv 2 conv 3 conv 4 RECAP: DETECTION conv 1 conv 2 conv 3 conv 4 RECAP: GENERATION

More information

Machine Learning Basics III

Machine Learning Basics III Machine Learning Basics III Benjamin Roth CIS LMU München Benjamin Roth (CIS LMU München) Machine Learning Basics III 1 / 62 Outline 1 Classification Logistic Regression 2 Gradient Based Optimization Gradient

More information

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann (Feed-Forward) Neural Networks 2016-12-06 Dr. Hajira Jabeen, Prof. Jens Lehmann Outline In the previous lectures we have learned about tensors and factorization methods. RESCAL is a bilinear model for

More information

Neural Networks in Structured Prediction. November 17, 2015

Neural Networks in Structured Prediction. November 17, 2015 Neural Networks in Structured Prediction November 17, 2015 HWs and Paper Last homework is going to be posted soon Neural net NER tagging model This is a new structured model Paper - Thursday after Thanksgiving

More information

ECE521 W17 Tutorial 1. Renjie Liao & Min Bai

ECE521 W17 Tutorial 1. Renjie Liao & Min Bai ECE521 W17 Tutorial 1 Renjie Liao & Min Bai Schedule Linear Algebra Review Matrices, vectors Basic operations Introduction to TensorFlow NumPy Computational Graphs Basic Examples Linear Algebra Review

More information

Logistic Regression & Neural Networks

Logistic Regression & Neural Networks Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein Logistic Regression Perceptron & Probabilities What if we want a probability

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

ECE521 Lectures 9 Fully Connected Neural Networks

ECE521 Lectures 9 Fully Connected Neural Networks ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance

More information

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes)

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes) Recurrent Neural Networks 2 CS 287 (Based on Yoav Goldberg s notes) Review: Representation of Sequence Many tasks in NLP involve sequences w 1,..., w n Representations as matrix dense vectors X (Following

More information

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Steve Renals Machine Learning Practical MLP Lecture 5 16 October 2018 MLP Lecture 5 / 16 October 2018 Deep Neural Networks

More information

4. Multilayer Perceptrons

4. Multilayer Perceptrons 4. Multilayer Perceptrons This is a supervised error-correction learning algorithm. 1 4.1 Introduction A multilayer feedforward network consists of an input layer, one or more hidden layers, and an output

More information

Advanced computational methods X Selected Topics: SGD

Advanced computational methods X Selected Topics: SGD Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety

More information

Neural Networks: Backpropagation

Neural Networks: Backpropagation Neural Networks: Backpropagation Seung-Hoon Na 1 1 Department of Computer Science Chonbuk National University 2018.10.25 eung-hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25

More information

Recurrent Neural Networks. COMP-550 Oct 5, 2017

Recurrent Neural Networks. COMP-550 Oct 5, 2017 Recurrent Neural Networks COMP-550 Oct 5, 2017 Outline Introduction to neural networks and deep learning Feedforward neural networks Recurrent neural networks 2 Classification Review y = f( x) output label

More information

(2pts) What is the object being embedded (i.e. a vector representing this object is computed) when one uses

(2pts) What is the object being embedded (i.e. a vector representing this object is computed) when one uses Contents (75pts) COS495 Midterm 1 (15pts) Short answers........................... 1 (5pts) Unequal loss............................. 2 (15pts) About LSTMs........................... 3 (25pts) Modular

More information

Recurrent Neural Network

Recurrent Neural Network Recurrent Neural Network Xiaogang Wang xgwang@ee..edu.hk March 2, 2017 Xiaogang Wang (linux) Recurrent Neural Network March 2, 2017 1 / 48 Outline 1 Recurrent neural networks Recurrent neural networks

More information

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Learning Deep Architectures for AI. Part II - Vijay Chakilam Learning Deep Architectures for AI - Yoshua Bengio Part II - Vijay Chakilam Limitations of Perceptron x1 W, b 0,1 1,1 y x2 weight plane output =1 output =0 There is no value for W and b such that the model

More information

CS224N: Natural Language Processing with Deep Learning Winter 2017 Midterm Exam

CS224N: Natural Language Processing with Deep Learning Winter 2017 Midterm Exam CS224N: Natural Language Processing with Deep Learning Winter 2017 Midterm Exam This examination consists of 14 printed sides, 5 questions, and 100 points. The exam accounts for 17% of your total grade.

More information

Contents. (75pts) COS495 Midterm. (15pts) Short answers

Contents. (75pts) COS495 Midterm. (15pts) Short answers Contents (75pts) COS495 Midterm 1 (15pts) Short answers........................... 1 (5pts) Unequal loss............................. 2 (15pts) About LSTMs........................... 3 (25pts) Modular

More information

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples Back-Propagation Algorithm Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples 1 Inner-product net =< w, x >= w x cos(θ) net = n i=1 w i x i A measure

More information

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Y. Wu, M. Schuster, Z. Chen, Q.V. Le, M. Norouzi, et al. Google arxiv:1609.08144v2 Reviewed by : Bill

More information

NEURAL LANGUAGE MODELS

NEURAL LANGUAGE MODELS COMP90042 LECTURE 14 NEURAL LANGUAGE MODELS LANGUAGE MODELS Assign a probability to a sequence of words Framed as sliding a window over the sentence, predicting each word from finite context to left E.g.,

More information

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications Neural Networks Bishop PRML Ch. 5 Alireza Ghane Neural Networks Alireza Ghane / Greg Mori 1 Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of

More information

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY 1 On-line Resources http://neuralnetworksanddeeplearning.com/index.html Online book by Michael Nielsen http://matlabtricks.com/post-5/3x3-convolution-kernelswith-online-demo

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino Artificial Neural Networks Data Base and Data Mining Group of Politecnico di Torino Elena Baralis Politecnico di Torino Artificial Neural Networks Inspired to the structure of the human brain Neurons as

More information

(Artificial) Neural Networks in TensorFlow

(Artificial) Neural Networks in TensorFlow (Artificial) Neural Networks in TensorFlow By Prof. Seungchul Lee Industrial AI Lab http://isystems.unist.ac.kr/ POSTECH Table of Contents I. 1. Recall Supervised Learning Setup II. 2. Artificial Neural

More information

Machine Learning. Boris

Machine Learning. Boris Machine Learning Boris Nadion boris@astrails.com @borisnadion @borisnadion boris@astrails.com astrails http://astrails.com awesome web and mobile apps since 2005 terms AI (artificial intelligence)

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning Elman Mansimov 1 September 24, 2015 1 Modified based on Shenlong Wang s and Jake Snell s tutorials, with additional contents borrowed from Kevin Swersky and Jasper Snoek

More information

Introduction to RNNs!

Introduction to RNNs! Introduction to RNNs Arun Mallya Best viewed with Computer Modern fonts installed Outline Why Recurrent Neural Networks (RNNs)? The Vanilla RNN unit The RNN forward pass Backpropagation refresher The RNN

More information

HOMEWORK #4: LOGISTIC REGRESSION

HOMEWORK #4: LOGISTIC REGRESSION HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2018 Due: Friday, February 23rd, 2018, 11:55 PM Submit code and report via EEE Dropbox You should submit a

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Lecture 4: Deep Learning Essentials Pierre Geurts, Gilles Louppe, Louis Wehenkel 1 / 52 Outline Goal: explain and motivate the basic constructs of neural networks. From linear

More information

Lecture 5 Neural models for NLP

Lecture 5 Neural models for NLP CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm

More information

Long Short-Term Memory (LSTM)

Long Short-Term Memory (LSTM) Long Short-Term Memory (LSTM) A brief introduction Daniel Renshaw 24th November 2014 1 / 15 Context and notation Just to give the LSTM something to do: neural network language modelling Vocabulary, size

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow,

Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow, Index A Activation functions, neuron/perceptron binary threshold activation function, 102 103 linear activation function, 102 rectified linear unit, 106 sigmoid activation function, 103 104 SoftMax activation

More information

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable

More information

Slide credit from Hung-Yi Lee & Richard Socher

Slide credit from Hung-Yi Lee & Richard Socher Slide credit from Hung-Yi Lee & Richard Socher 1 Review Recurrent Neural Network 2 Recurrent Neural Network Idea: condition the neural network on all previous words and tie the weights at each time step

More information

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch. Neural Networks Oliver Schulte - CMPT 726 Bishop PRML Ch. 5 Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of biological plausibility We will

More information

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes CS 6501: Deep Learning for Computer Graphics Basics of Neural Networks Connelly Barnes Overview Simple neural networks Perceptron Feedforward neural networks Multilayer perceptron and properties Autoencoders

More information

Lecture 15: Exploding and Vanishing Gradients

Lecture 15: Exploding and Vanishing Gradients Lecture 15: Exploding and Vanishing Gradients Roger Grosse 1 Introduction Last lecture, we introduced RNNs and saw how to derive the gradients using backprop through time. In principle, this lets us train

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Notes on Back Propagation in 4 Lines

Notes on Back Propagation in 4 Lines Notes on Back Propagation in 4 Lines Lili Mou moull12@sei.pku.edu.cn March, 2015 Congratulations! You are reading the clearest explanation of forward and backward propagation I have ever seen. In this

More information

Lecture 4 Backpropagation

Lecture 4 Backpropagation Lecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 5, 2017 Things we will look at today More Backpropagation Still more backpropagation Quiz

More information

Computational Graphs, and Backpropagation

Computational Graphs, and Backpropagation Chapter 1 Computational Graphs, and Backpropagation (Course notes for NLP by Michael Collins, Columbia University) 1.1 Introduction We now describe the backpropagation algorithm for calculation of derivatives

More information

CSCI 315: Artificial Intelligence through Deep Learning

CSCI 315: Artificial Intelligence through Deep Learning CSCI 315: Artificial Intelligence through Deep Learning W&L Winter Term 2017 Prof. Levy Recurrent Neural Networks (Chapter 7) Recall our first-week discussion... How do we know stuff? (MIT Press 1996)

More information

Lecture 2: Learning with neural networks

Lecture 2: Learning with neural networks Lecture 2: Learning with neural networks Deep Learning @ UvA LEARNING WITH NEURAL NETWORKS - PAGE 1 Lecture Overview o Machine Learning Paradigm for Neural Networks o The Backpropagation algorithm for

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen Neural Networks - I Henrik I Christensen Robotics & Intelligent Machines @ GT Georgia Institute of Technology, Atlanta, GA 30332-0280 hic@cc.gatech.edu Henrik I Christensen (RIM@GT) Neural Networks 1 /

More information

Long Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, Deep Lunch

Long Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, Deep Lunch Long Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, 2015 @ Deep Lunch 1 Why LSTM? OJen used in many recent RNN- based systems Machine transla;on Program execu;on Can

More information

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16 Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16 Slides adapted from Jordan Boyd-Graber, Justin Johnson, Andrej Karpathy, Chris Ketelsen, Fei-Fei Li, Mike Mozer, Michael Nielson

More information

CS224N: Natural Language Processing with Deep Learning Winter 2018 Midterm Exam

CS224N: Natural Language Processing with Deep Learning Winter 2018 Midterm Exam CS224N: Natural Language Processing with Deep Learning Winter 2018 Midterm Exam This examination consists of 17 printed sides, 5 questions, and 100 points. The exam accounts for 20% of your total grade.

More information

Error Functions & Linear Regression (1)

Error Functions & Linear Regression (1) Error Functions & Linear Regression (1) John Kelleher & Brian Mac Namee Machine Learning @ DIT Overview 1 Introduction Overview 2 Univariate Linear Regression Linear Regression Analytical Solution Gradient

More information

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning Recurrent Neural Network (RNNs) University of Waterloo October 23, 2015 Slides are partially based on Book in preparation, by Bengio, Goodfellow, and Aaron Courville, 2015 Sequential data Recurrent neural

More information

Introduction to Neural Networks

Introduction to Neural Networks Introduction to Neural Networks Philipp Koehn 4 April 205 Linear Models We used before weighted linear combination of feature values h j and weights λ j score(λ, d i ) = j λ j h j (d i ) Such models can

More information

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation) Learning for Deep Neural Networks (Back-propagation) Outline Summary of Previous Standford Lecture Universal Approximation Theorem Inference vs Training Gradient Descent Back-Propagation

More information

Feedforward Neural Nets and Backpropagation

Feedforward Neural Nets and Backpropagation Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28 th, 2016 1 / 23 Supervised Learning Roadmap Supervised Learning: Assume that we are given the features

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

y(x n, w) t n 2. (1)

y(x n, w) t n 2. (1) Network training: Training a neural network involves determining the weight parameter vector w that minimizes a cost function. Given a training set comprising a set of input vector {x n }, n = 1,...N,

More information

Computational Intelligence Winter Term 2017/18

Computational Intelligence Winter Term 2017/18 Computational Intelligence Winter Term 207/8 Prof. Dr. Günter Rudolph Lehrstuhl für Algorithm Engineering (LS ) Fakultät für Informatik TU Dortmund Plan for Today Single-Layer Perceptron Accelerated Learning

More information

Artificial Neuron (Perceptron)

Artificial Neuron (Perceptron) 9/6/208 Gradient Descent (GD) Hantao Zhang Deep Learning with Python Reading: https://en.wikipedia.org/wiki/gradient_descent Artificial Neuron (Perceptron) = w T = w 0 0 + + w 2 2 + + w d d where

More information

Sequence Modeling with Neural Networks

Sequence Modeling with Neural Networks Sequence Modeling with Neural Networks Harini Suresh y 0 y 1 y 2 s 0 s 1 s 2... x 0 x 1 x 2 hat is a sequence? This morning I took the dog for a walk. sentence medical signals speech waveform Successes

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Neural Networks: A brief touch Yuejie Chi Department of Electrical and Computer Engineering Spring 2018 1/41 Outline

More information

Neural Networks: Backpropagation

Neural Networks: Backpropagation Neural Networks: Backpropagation Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data January 17, 2006 Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Multi-Layer Perceptrons Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole

More information

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

COMP 551 Applied Machine Learning Lecture 14: Neural Networks COMP 551 Applied Machine Learning Lecture 14: Neural Networks Instructor: Ryan Lowe (ryan.lowe@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted,

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

Computational Intelligence

Computational Intelligence Plan for Today Single-Layer Perceptron Computational Intelligence Winter Term 00/ Prof. Dr. Günter Rudolph Lehrstuhl für Algorithm Engineering (LS ) Fakultät für Informatik TU Dortmund Accelerated Learning

More information

Error Backpropagation

Error Backpropagation Error Backpropagation Sargur Srihari 1 Topics in Error Backpropagation Terminology of backpropagation 1. Evaluation of Error function derivatives 2. Error Backpropagation algorithm 3. A simple example

More information

Convolutional Neural Networks

Convolutional Neural Networks Convolutional Neural Networks Books» http://www.deeplearningbook.org/ Books http://neuralnetworksanddeeplearning.com/.org/ reviews» http://www.deeplearningbook.org/contents/linear_algebra.html» http://www.deeplearningbook.org/contents/prob.html»

More information

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs) Multilayer Neural Networks (sometimes called Multilayer Perceptrons or MLPs) Linear separability Hyperplane In 2D: w x + w 2 x 2 + w 0 = 0 Feature x 2 = w w 2 x w 0 w 2 Feature 2 A perceptron can separate

More information

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Neural Networks 2. 2 Receptive fields and dealing with image inputs CS 446 Machine Learning Fall 2016 Oct 04, 2016 Neural Networks 2 Professor: Dan Roth Scribe: C. Cheng, C. Cervantes Overview Convolutional Neural Networks Recurrent Neural Networks 1 Introduction There

More information

Deep Learning (CNNs)

Deep Learning (CNNs) 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Deep Learning (CNNs) Deep Learning Readings: Murphy 28 Bishop - - HTF - - Mitchell

More information

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, 23 2013 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run

More information

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Lecture - 27 Multilayer Feedforward Neural networks with Sigmoidal

More information

Improved Learning through Augmenting the Loss

Improved Learning through Augmenting the Loss Improved Learning through Augmenting the Loss Hakan Inan inanh@stanford.edu Khashayar Khosravi khosravi@stanford.edu Abstract We present two improvements to the well-known Recurrent Neural Network Language

More information