RECURRENT NETWORKS I. Philipp Krähenbühl

RECURRENT NETWORKS I Philipp Krähenbühl

RECAP: CLASSIFICATION conv 1 conv 2 conv 3 conv 4 1 2 tu

RECAP: SEGMENTATION conv 1 conv 2 conv 3 conv 4

RECAP: DETECTION conv 1 conv 2 conv 3 conv 4

RECAP: GENERATION noise conv 1 conv 2 conv 3 conv 4

FEED FORWARD NETWORKS order of computation: conv 1 conv 2 conv 3 conv 4 1 2 tu

FEED FORWARD NETWORKS (Fied) order of conv 1 conv 2 conv 3 conv 4 1 2 tu computation Lower to upper layers Once we have the result discard all activations

WOULD YOU USE THIS TO DRIVE A CAR?

1 conv 3 conv 2 conv 1 WOULD YOU USE THIS TO DRIVE A CAR?

WOULD YOU USE THIS TO DRIVE A CAR? Independent decision for each frame No state or memory 1 Real world not conv 3 For supertu kart it might still be ok conv 2 Probably not conv 1

conv 1 conv 2 conv 3 1 conv 1 conv 2 conv 3 1 1 conv 3 conv 2 conv 1 HOW DO WE KEEP A STATE AROUND?

RECURRENT NEURAL NETWORK (RNN) State update: h t = f h ( t, h t 1, θ h ) h y Output: y t = f y ( t, h t, θ y ) recurrent connection

ELMAN NETWORKS State update: h t = f h ( t, h t 1, θ h ) = σ(u h h t 1 + W h + b h ) Output: y t = f y ( t, h t, θ y ) = σ(w y h t + b y ) h sigmoid y sigmoid

JORDAN NETWORKS State update: h t = f h ( t, h t 1, θ h ) = σ(u h y t 1 + W h + b h ) Output: y t = f y ( t, h t, θ y ) = σ(w y h t + b y ) h sigmoid y sigmoid

HOW DO WE TRAIN RNNS? State update: h t = f h ( t, h t 1, θ h ) h y Output: y t = f y ( t, h t, θ y ) recurrent connection

UNROLLING THROUGH TIME y 0 y 1 y 2 y t 0 0 1 2 t

UNROLLING THROUGH TIME y 0 y 1 y 2 y t Unrolled RNN Freed forward network 0 0 1 2 t Shared parameters Trained with back-prop

UNROLLING THROUGH TIME - ISSUES Long unrolling y 0 y 1 y 2 y t Vanishing or eploding gradients 0 0 1 2 t Very long unrolling Computationally epensive

VERY LONG UNROLLING y 0 y 1 y 2 y t Solution (hack) During training: Cut RNN (set h=0) after n timesteps 0 0 1 2 t Often still trains well in practice

EXPLODING AND VANISHING GRADIENTS h t h t 1 α h n h 0 α n h y Vanishing gradients: α 1 : α n 0 Eploding gradients: α α 1 : α n

EXPLODING AND VANISHING GRADIENTS Eploding gradients Gradient clipping (hack) l l clip h t 1 ( h t h t h t 1, ε, ε ) Vanishing gradients Different RNN structure

LSTM Long short-term memory f t = σ(w f t + U f h t 1 + b f ) i t = σ(w i t + U i h t 1 + b i ) o t = σ(w o t + U o h t 1 + b o ) c t = f t c t 1 + i t τ(w c t + U c h t 1 + b c ) h t = o t τ(c t ) c t 1 h t 1 + f t i t sigmoid sigmoid tanh o t sigmoid tanh c t h t t

LSTM Long short-term memory c t 1 + c t Cell state c f t i t o t tanh Allows for information to just flow through nearly unchanged h t 1 sigmoid sigmoid tanh sigmoid h t t

LSTM Long short-term memory c t 1 + c t Forget gate f f t i t sigmoid sigmoid tanh o t sigmoid tanh Clears the cell state h t 1 h t t

LSTM Long short-term memory Input gate i Allows a state update (or not) c t 1 + f t i t sigmoid sigmoid tanh o t sigmoid tanh c t Input h t 1 h t h (previous cell state) t

LSTM Long short-term memory Output gate o c t 1 + c t Should we produce an output? f t i t sigmoid sigmoid tanh o t sigmoid tanh Output h t 1 h t tanh of cell state t

LSTM Long short-term memory Can learn to keep state for up to 100 time steps Fewer vanishing gradients c t 1 + f t i t sigmoid sigmoid tanh o t sigmoid tanh c t h t 1 h t Trained by unrolling through time t

GRU Gated Recurrent Unit z t = σ(w z t + U z h t 1 + b z ) h t 1 + h t r t = σ(w r t + U r h t 1 + b r ) z t 1 z t h t = τ(w h t + U h (r t h t 1 ) + b h ) sigmoid tanh h t h t = (1 z t ) h t 1 + z t h t sigmoid r t t

GRU Gated Recurrent Unit Similar performance to LSTM h t 1 z t 1 z t + h t sigmoid tanh h t Almost same state update r t Fewer gates sigmoid t

SUMMARY Training RNNs Unroll in time + Backdrop Eploding gradients Clip Vanishing gradients (no long term interactions) Use LSTM or GRU