Deep Recurrent Neural Networks

Similar documents
Direct Method for Training Feed-forward Neural Networks using Batch Extended Kalman Filter for Multi- Step-Ahead Predictions

Large Vocabulary Continuous Speech Recognition with Long Short-Term Recurrent Networks

Recurrent Neural Networks

arxiv: v2 [cs.ne] 7 Apr 2015

Unfolded Recurrent Neural Networks for Speech Recognition

Slide credit from Hung-Yi Lee & Richard Socher

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

Residual LSTM: Design of a Deep Recurrent Architecture for Distant Speech Recognition

Lecture 11 Recurrent Neural Networks I

Recurrent and Recursive Networks

Lecture 11 Recurrent Neural Networks I

A Primal-Dual Method for Training Recurrent Neural Networks Constrained by the Echo-State Property

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Deep Neural Networks

SPEECH RECOGNITION WITH DEEP RECURRENT NEURAL NETWORKS. Alex Graves, Abdel-rahman Mohamed and Geoffrey Hinton

Highway-LSTM and Recurrent Highway Networks for Speech Recognition

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

End-to-end Automatic Speech Recognition

An exploration of dropout with LSTMs

Jakub Hajic Artificial Intelligence Seminar I

Sequence Transduction with Recurrent Neural Networks

CSC321 Lecture 15: Recurrent Neural Networks

arxiv: v4 [cs.cl] 5 Jun 2017

arxiv: v1 [cs.ne] 14 Nov 2012

Course 395: Machine Learning - Lectures

Speech recognition. Lecture 14: Neural Networks. Andrew Senior December 12, Google NYC

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Modeling Time-Frequency Patterns with LSTM vs. Convolutional Architectures for LVCSR Tasks

A summary of Deep Learning without Poor Local Minima

Based on the original slides of Hung-yi Lee

Long Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, Deep Lunch

Gated Recurrent Neural Tensor Network

UNSUPERVISED LEARNING

Artificial Neural Networks. MGS Lecture 2

Recurrent Neural Network

Neural Turing Machine. Author: Alex Graves, Greg Wayne, Ivo Danihelka Presented By: Tinghui Wang (Steve)

How to do backpropagation in a brain

Introduction to RNNs!

Deep Learning: a gentle introduction

Analysis of Multilayer Neural Network Modeling and Long Short-Term Memory

arxiv: v1 [cs.cl] 22 Feb 2018

arxiv: v5 [cs.lg] 2 Nov 2015

Deep Learning Tutorial. 李宏毅 Hung-yi Lee

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Gate Activation Signal Analysis for Gated Recurrent Neural Networks and Its Correlation with Phoneme Boundaries

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

Gianluca Pollastri, Head of Lab School of Computer Science and Informatics and. University College Dublin

Presented By: Omer Shmueli and Sivan Niv

Long-Short Term Memory and Other Gated RNNs

RARE SOUND EVENT DETECTION USING 1D CONVOLUTIONAL RECURRENT NEURAL NETWORKS

Deep Learning Architectures and Algorithms

Autoregressive Neural Models for Statistical Parametric Speech Synthesis

Introduction to Neural Networks

On the use of Long-Short Term Memory neural networks for time series prediction

Lecture 17: Neural Networks and Deep Learning

Research Article Deep Neural Networks with Multistate Activation Functions

Generating Sequences with Recurrent Neural Networks

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

A New Concept using LSTM Neural Networks for Dynamic System Identification

Denoising Autoencoders

Deep Belief Networks are compact universal approximators

Synaptic Devices and Neuron Circuits for Neuron-Inspired NanoElectronics

Beyond Cross-entropy: Towards Better Frame-level Objective Functions For Deep Neural Network Training In Automatic Speech Recognition

Improved Learning through Augmenting the Loss

Deep Learning for Natural Language Processing

Deep Learning Recurrent Networks 10/11/2017

Deep learning attracts lots of attention.

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Neural Architectures for Image, Language, and Speech Processing

Lecture 5: Recurrent Neural Networks

Artificial Neural Networks

Based on the original slides of Hung-yi Lee

Knowledge Extraction from DBNs for Images

A Critical Review of Recurrent Neural Networks for Sequence Learning

Very Deep Convolutional Neural Networks for LVCSR

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning

Recurrent Neural Networks. COMP-550 Oct 5, 2017

Evolving Deep Recurrent Neural Networks Using Ant Colony Optimization

Boundary Contraction Training for Acoustic Models based on Discrete Deep Neural Networks

Task-Oriented Dialogue System (Young, 2000)

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Feature Design. Feature Design. Feature Design. & Deep Learning

Recurrent Neural Networks. Jian Tang

Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting 卷积 LSTM 网络 : 利用机器学习预测短期降雨 施行健 香港科技大学 VALSE 2016/03/23

Deep Learning for Automatic Speech Recognition Part II

Neural networks. Chapter 19, Sections 1 5 1

Learning Deep Architectures for AI. Part I - Vijay Chakilam

SMALL-FOOTPRINT HIGH-PERFORMANCE DEEP NEURAL NETWORK-BASED SPEECH RECOGNITION USING SPLIT-VQ. Yongqiang Wang, Jinyu Li and Yifan Gong

Usually the estimation of the partition function is intractable and it becomes exponentially hard when the complexity of the model increases. However,

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Recurrent neural networks

Speech and Language Processing

Reading Group on Deep Learning Session 2

arxiv: v1 [cs.ne] 11 May 2017

Deep Learning for Speech Recognition. Hung-yi Lee

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Transcription:

Deep Recurrent Neural Networks Artem Chernodub e-mail: a.chernodub@gmail.com web: http://zzphoto.me ZZ Photo IMMSP NASU

2 / 28 Neuroscience Biological-inspired models Machine Learning p x y = p y x p(x)/p(y)

Problem domain: processing the sequences 3 / 28

4 / 28 Neural Network Models under consideration Classic Feedforward (Shrink) Neural Networks. Feedforward Deep Neural Networks. Recurrent Neural Networks.

Biological Neural Networks 5 / 28

6 / 28 Sequence processing problems Time series predictions Speech Recognition Natural Language Processing Engine control

Pen handwriting recognition (IAM- OnDB) Model # of parameters Word Error Rate (WER) HMM - 35.5% CTC RNN 13M 20.4% Training data: 20, 000 most frequently occurring words written by 221 different writers. A. Graves, S. Fernandez, M. Liwicki, H. Bunke, J. Schmidhuber. Unconstrained online handwriting recognition with recurrent neural networks // Advances in Neural Information Processing Systems 21, NIPS'21, p 577-584, 2008, MIT Press, Cambridge, MA, 2008. 7 / 28

8 / 28 TIMIT Phoneme Recognition Model # of parameters Error Hidden Markov Model, HMM - 27,3% Deep Belief Network, DBN ~ 4M 26,7% Deep RNN 4,3M 17.7% Training data: 462 speakers train / 24 speakers test, 3.16 / 0.14 hrs. Graves, A., Mohamed, A.-R., and Hinton, G. E. (2013). Speech recognition with deep recurrent neural networks // IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6645 6649. IEEE. Mohamed, A. and Hinton, G. E. (2010). Phone recognition using restricted Boltzmann machines // IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4354 4357.

Google Large Vocabulary Speech Recognition Model # of parameters Cross-entropy ReLU DNN 85M 11.3 Deep Projection LSTM RNN 13M 10.7 Training data: 3M utterances (1900 hrs). H. Sak, A. Senior, F. Beaufays. Long Short-Term Memory Recurrent Neural Network Architectures for Large Scale Acoustic Modeling // INTERSPEECH 2014. K. Vesely, A. Ghoshal, L. Burget, D. Povey. Sequence-discriminative training of deep neural networks // INTERSPEECH 2014. 9 / 28

Dynamic process identification scheme 10 / 28

11 / 28 Feedforward and Recurrent Neural Networks: popularity 90% / 10% FFNN RNN

12 / 28 Feedforward vs Recurrent Neural Networks Dynamic Multilayer Perceptron, DMLP (FFNN) Recurrent Multilayer Perceptron, RMLP (RNN)

Processing the sequences: Single-Step-Ahead predictions 13 / 28

14 / 28 Processing the sequences: Multi-Step-Ahead predictions, autoregression

15 / 28 Processing the sequences: Multi-Step-Ahead predictions under control

16 / 28 Classic Feedforward Neural Networks (before 2006). Single hidden layer (Kolmogorov-Cybenko Universal Approximation Theorem as the main hope). Vanishing gradients effect prevents using more layers. Less than 10K free parameters. Feature preprocessing stage is often critical.

17 / 28 Deep Feedforward Neural Networks Number of hidden layers > 1 (usually 6-9). 100K 100M free parameters. No (or less) feature preprocessing stage. 2-stage training process: i) unsupervised pre-training; ii) fine tuning (vanishing gradients problem is beaten!).

Training the Neural Network: derivative + optimization 18 / 28

19 / 28 DMLP: 1) forward propagation pass z j y( k = f ( w + 1) i (1) ji x i ), ~ ) (2 = g( w j z j ), j where z j is the postsynaptic value for the j-th hidden neuron, w (1) are the hidden layer s weights, f() are the hidden layer s activation functions, w (2) are the output layer s weights, and g() are the output layer s activation functions.

20 / 28 DMLP: 2) backpropagation pass Local gradients calculation: OUT δ δ = HID j = t( k + 1) ~ y ( k + 1), f '( z j ) w δ ( 2) OUT j. Derivatives calculation: E( k) = δ 2) w z OUT ( j j E( k) = δ 1) w x IN ( j i ji.,

21 / 28 Backpropagation Through Time (BPTT) BPTT = unrolling RNN to Deep Feedforward Neural Network where h = truncation depth. E ( ) 1 h BPTT k E( k n) =. ( 1) (1) w h n= 0 w ji ji

Bad effect of vanishing (exploding) gradients: a problem, ) ( ) 1 ( ) ( ) ( = m i m j m ji z w k E δ, ' 1) ( ) ( 1) ( ) ( + = m i i m ij m j m j w f δ δ 0 ) ( ) ( m w ji k E => 22 / 28 for m 1

J. Martens and I. Sutskever. Learning Recurrent Neural Networks with Hessian- Free Optimization // Proc. ICML, 2011. 23 / 28 Vanishing gradients effect: solutions Accurate initialization of weights: - if the weights are small, the gradients shrink exponentially; - if the weights are big the gradients grow exponentially. Hessian Free Optimization: use new optimizer by Martens. J. Martens. Deep Learning via Hessian-Free Optimization // Proc. International Conference of Machine Learning, pp. 735-742, 2010.

24 / 28 Time-series Single-Step-Ahead prediction problems A. Chernodub. Training Dynamic Neural Networks Using the Extended Kalman Filter for Multi-Step-Ahead Predictions // Artificial Neural Networks Methods and Applications in Bio-/Neuroinformatics. Petia Koprinkova-Hristova, Valeri Mladenov, Nikola K. Kasabov (Eds.), Springer, 2014, pp. 221-244.

25 / 28 Time-series Single-Step-Ahead predictions: the results MG17 ICE SOLAR CHICKENPOX RIVER LASER DLNN (FF) 0.0042 0.0244 0.1316 0.8670 4.3638 0.7711 DMLP (FF) 0.0006 0.0376 0.1313 0.4694 0.9979 0.0576 RMLP (RNN) 0.0050 0.0273 0.1493 0.7608 6.4790 0.2415 NARX (RNN) 0.0010 0.1122 0.1921 1.4736 2.0685 0.1332 Mean NMSE Single-Step-Ahead predictions for different neural network architectures A. Chernodub. Training Dynamic Neural Networks Using the Extended Kalman Filter for Multi-Step-Ahead Predictions // Artificial Neural Networks Methods and Applications in Bio-/Neuroinformatics. Petia Koprinkova-Hristova, Valeri Mladenov, Nikola K. Kasabov (Eds.), Springer, 2014, pp. 221-244.

?!!

27 / 28 Multi-Step-Ahead prediction, Laser Dataset feedforward data flow recurrent data flow A. Chernodub. Training Dynamic Neural Networks Using the Extended Kalman Filter for Multi-Step-Ahead Predictions // Artificial Neural Networks Methods and Applications in Bio-/Neuroinformatics. Petia Koprinkova-Hristova, Valeri Mladenov, Nikola K. Kasabov (Eds.), Springer, 2014, pp. 221-244.

28 / 28 Long-term Dependencies catching: T = 50, 100, 200 R. Pascanu, Y. Bengio. On the Difficulty of Training Recurrent Neural Networks // Tech. Rep. arxiv:1211.5063, Universite de Montreal (2012).

RNNvs FFNN Deep Feed-Forward NNs Recurrent NNs Pre-training stage Yes No (rarely) Derivatives calculation Backpropagation (BP) Backpropagation Through Time (BPTT) Multi-Step-Ahead Prediction Quality Sequence Recognition Quality Vanishing gradient effect Bad (-) Good (+) Bad (-) Good (+) Present (-) Present (-)

Thanks! e-mail: a.chernodub@gmail.com web: http://zzphoto.me