Deep Neural Networks

Similar documents
Speaker Representation and Verification Part II. by Vasileios Vasilakakis

UNSUPERVISED LEARNING

Speech recognition. Lecture 14: Neural Networks. Andrew Senior December 12, Google NYC

Lecture 16 Deep Neural Generative Models

Knowledge Extraction from DBNs for Images

Deep unsupervised learning

Greedy Layer-Wise Training of Deep Networks

Neural Networks and Deep Learning

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Restricted Boltzmann Machines

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

arxiv: v1 [cs.cl] 23 Sep 2013

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

ECE521 Lectures 9 Fully Connected Neural Networks

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Deep Generative Models. (Unsupervised Learning)

Course 395: Machine Learning - Lectures

Course Structure. Psychology 452 Week 12: Deep Learning. Chapter 8 Discussion. Part I: Deep Learning: What and Why? Rufus. Rufus Processed By Fetch

Jakub Hajic Artificial Intelligence Seminar I

Basic Principles of Unsupervised and Unsupervised

Reading Group on Deep Learning Session 1

Introduction to Neural Networks

Introduction to Convolutional Neural Networks (CNNs)

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

An efficient way to learn deep generative models

How to do backpropagation in a brain

Deep Learning & Neural Networks Lecture 2

Introduction Biologically Motivated Crude Model Backpropagation

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Deep Boltzmann Machines

Lecture 5: Logistic Regression. Neural Networks

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow,

Introduction to Neural Networks

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Robust Classification using Boltzmann machines by Vasileios Vasilakakis

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

MLPR: Logistic Regression and Neural Networks

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

Learning Deep Architectures for AI. Part II - Vijay Chakilam

10. Artificial Neural Networks

Deep Recurrent Neural Networks

Artificial Neural Networks. Historical description

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Pattern Recognition and Machine Learning

Usually the estimation of the partition function is intractable and it becomes exponentially hard when the complexity of the model increases. However,

Multilayer Perceptron

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

CS 4700: Foundations of Artificial Intelligence

The Origin of Deep Learning. Lili Mou Jan, 2015

PATTERN CLASSIFICATION

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Artificial Neural Networks. MGS Lecture 2

A Hybrid Deep Learning Approach For Chaotic Time Series Prediction Based On Unsupervised Feature Learning

Speaker recognition by means of Deep Belief Networks

Learning Tetris. 1 Tetris. February 3, 2009

Logistic Regression. COMP 527 Danushka Bollegala

ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Introduction to Restricted Boltzmann Machines

Neural networks. Chapter 20, Section 5 1

Links between Perceptrons, MLPs and SVMs

Neural Networks and Deep Learning.

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Neural networks. Chapter 19, Sections 1 5 1

The connection of dropout and Bayesian statistics

Deep Learning Architecture for Univariate Time Series Forecasting

Beyond Cross-entropy: Towards Better Frame-level Objective Functions For Deep Neural Network Training In Automatic Speech Recognition

Neural Networks. William Cohen [pilfered from: Ziv; Geoff Hinton; Yoshua Bengio; Yann LeCun; Hongkak Lee - NIPs 2010 tutorial ]

Automatic Speech Recognition (CS753)

Logistic Regression & Neural Networks

Very Deep Convolutional Neural Networks for LVCSR

Multilayer Perceptrons (MLPs)

How to do backpropagation in a brain. Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto

Artificial Neural Networks

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

Neural Networks (Part 1) Goals for the lecture

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Lecture 4 Towards Deep Learning

Feed-forward Network Functions

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Introduction to Machine Learning

Supervised Learning. George Konidaris

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

10-701/ Machine Learning, Fall

Lecture 17: Neural Networks and Deep Learning

ECE521 Lecture 7/8. Logistic Regression

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Machine Learning. Neural Networks

Statistical Machine Learning from Data

A Stochastic Model of Voice Control System

Transcription:

Deep Neural Networks DT2118 Speech and Speaker Recognition Giampiero Salvi KTH/CSC/TMH giampi@kth.se VT 2015 1 / 45

Outline State-to-Output Probability Model Artificial Neural Networks Perceptron Multi Layer Perceptron Error Backpropagation Hybrid HMM-MLP Deep Learning (Initialization) Deep Neural Networks Restricted Boltzmann Machines Deep Belief Networks 2 / 45

Outline State-to-Output Probability Model Artificial Neural Networks Perceptron Multi Layer Perceptron Error Backpropagation Hybrid HMM-MLP Deep Learning (Initialization) Deep Neural Networks Restricted Boltzmann Machines Deep Belief Networks 3 / 45

State-to-Output Probability Model... z n 1 z n z n+1...... x n 1 x n x n+1... is responsible for the discriminative power of the whole model GMMs used because easy to train and adapt discriminative training can improve results 4 / 45

State-to-Output Probability Model... z n 1 z n z n+1...... x n 1 x n x n+1... is responsible for the discriminative power of the whole model Alternatives: artificial neural networks (ANNs) deep neural networks (DNNs) support vector machines (SVMs) not used for ASR 4 / 45

Outline State-to-Output Probability Model Artificial Neural Networks Perceptron Multi Layer Perceptron Error Backpropagation Hybrid HMM-MLP Deep Learning (Initialization) Deep Neural Networks Restricted Boltzmann Machines Deep Belief Networks 5 / 45

Perceptron Known since the 1950 s [4] Input Transfer layer Sum Function x 0 Output x 1 x 2 x 3 w 0 w 1 w 2 w 3 w 4 f y x 4 [4] F. Rosenblatt. The perceptron: A perceiving and recognizing automaton. Tech. rep. 85-460-1. Cornell Aeronautical Laboratory, 1957 6 / 45

Perceptron input/output where y = f ( b + i w i x i ) f (z) = 1 1 + e z sigmoid f (z) = ez e z e z + e z hyperbolic tangent f (z) = max(0, z) rectified linear unit Equivalent to logistic regression (b = w 0 x 0 bias) 7 / 45

Perceptron: Linear Classification Learning adjust weights to correct errors y 2 a y 1 y 1 8 / 45

Multi-layer Perceptron [3] Output y 1 y 2 y 3 Hidden h 0 h 1 h 2 h 3 h 4 h 5 Input x 0 x 1 x 2 x 3 x 4 [3] F. Rosenblatt. Principles of neurodynamics. perceptrons and the theory of brain mechanisms. Tech. rep. DTIC Document, 1961 9 / 45

Universal Approximation Theorem First proposed by Gybenko [1] one single hidden layer and finite but appropriate number of neurons can approximate any function in R N with mild constraints [1] G. Gybenko. Approximation by superposition of sigmoidal functions. In: Mathematics of Control, Signals and Systems 2.4 (1989), pp. 303 314 10 / 45

Multi-layer Perceptron: Training Backpropagation algorithm [5] output unit hidden layer bias unit input units x 0 x 1 x 2 x d [5] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. Tech. rep. DTIC Document, 1985 11 / 45

Multi-layer Perceptron: Training Backpropagation algorithm [5] E output unit hidden layer bias unit input units x 0 x 1 x 2 x d [5] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. Tech. rep. DTIC Document, 1985 11 / 45

Learning Criteria Ideally minimise Expected Loss: J EL = E [ J(W, B, o, y) ] = J(W, B, o, y)p(o)do where o = features, y = labels but we do not know p(o) Use empirical learning criteria instead: Mean Square Error (MSE) Cross Entropy (CE) o 12 / 45

Mean Square Error Criterion J MSE = 1 M M J MSE (W, B, o m, y m ) m=1 J MSE (W, B, o m, y m ) = 1 v L y 2 2 = 1 ( v L y ) T ( v L y ) 2 13 / 45

Cross Entropy Criterion J CE = 1 M M J CE (W, B, o m, y m ) m=1 J CE (W, B, o m, y m ) = C i=1 y i log v L i Equivalent to minimising Kullback-Leibler divergence (KLD) 14 / 45

Update rules Wt+1 l Wt l ɛ Wt l bt+1 l bt l ɛ bt l To compute W l t and b l t we need the gradient of the criterion function. Key trick: chain rule of gradients f (g(x)): f x = f g g x 15 / 45

Backpropagation: Properties weights only depend on neighbouring variables algorithm finds local optimum sensitive to initialisation 16 / 45

Practical Issues preprocessing: Cepstral Mean Normalisation initialisation: random (symmetry breaking), linear range of activation function regularisation (weight decay, dropout) batch size selection sample randomisation momentum learning rate and stopping criterion 17 / 45

Output Layer Regression tasks: Linear layer v L = z L = W L v L 1 + b L Classification tasks: Softmax layer v L i = softmax i (z L ) = e zl i C j=1 ezl j 18 / 45

Probabilistic Interpretation 1. vi L [0, 1] i 2. C j=1 v i L = 1 Output activations are posterior probabilities of the classes given the observations v L i In speech: P(state sounds) = P(i o) 19 / 45

Hybrid HMM+Multi Layer Perceptron Figure from Yu and Deng 20 / 45

Combining probabilities HMMs use likelihoods P(sound state) MLPs and DNNs estimate posteriors P(state sound) We can combine with Bayes: P(sound state) = P(state sound)p(sound) P(state) P(state) can be estimated from the training set P(sound) is constant and can be ignored Use scaled likelihoods: P(sound state) = P(state sound) P(state) 21 / 45

Time Dependent and Recurrent ANNs 22 / 45

ANNs in ASR: Advantages discriminative in nature powerful time model: Time-Delayed Neural Networks (TDNNs) Recurrent Neural Networks (RNNs) 23 / 45

ANNs in ASR: Disadvantages training requires state level annotations (no EM available) usually annotations obtained with forced alignment not easy to adapt 24 / 45

Outline State-to-Output Probability Model Artificial Neural Networks Perceptron Multi Layer Perceptron Error Backpropagation Hybrid HMM-MLP Deep Learning (Initialization) Deep Neural Networks Restricted Boltzmann Machines Deep Belief Networks 25 / 45

Deep Neural Network Output y 1 y 2 y 3 Hidden 4 h 30 h 31 h 32 h 33 h 34 h 35 Hidden 3 h 20 h 21 h 22 h 23 h 24 h 25 Hidden 2 h 10 h 11 h 12 h 13 h 14 h 15 Input x 0 x 1 x 2 x 3 x 4 26 / 45

DNN: Motivation depth abstraction good initialisation (see later) fast computers, large datasets 27 / 45

DNN and MLPs no conceptual difference from MLPs Backpropagation alone not powerful enough local minima vanishing gradients 28 / 45

Deep Learning for Acoustic Modelling most promising technique at the moment pioneered by Geoffry Hinton (Univ. Toronto) most of the large companies are using it (Microsoft, Google, Nuance, IBM) unifies properties of generative and discriminative models 29 / 45

Deep Learning: Idea discriminative stuff observation label

Deep Learning: Idea discriminative stuff observation label

Deep Learning: Idea discriminative stuff generative stuff observation label observation label

Deep Learning: Idea discriminative stuff generative stuff observation label observation label 30 / 45

Deep Learning: Idea discriminative stuff generative stuff observation label observation label... but HMMs with Gaussian Mixure Models are also generative: why is this better? 30 / 45

Deep Learning: Idea #2 1. initialise DNN with Restricted Boltzmann Machines (RBM) that can be trained unsupervised 2. use fast learning procedure (Hinton) 3. use ridiculous amounts of unlabelled (cheap) data to train a ridiculous number of parameters in an unsupervised fashion 4. at the end, use small amounts of labelled (expensive) data and backpropagation to learn the labels 31 / 45

Restricted Boltzmann Machines (RBMs) First called Harmonium [6] binary nodes: Bernoulli distribution continuous nodes: Gaussian-Benoulli [6] P. Smolensky. Information processing in dynamical systems: Foundations of harmony theory. In: Department of Computer Science, University of Colorado, Boulder, 1986. Chap. 6 32 / 45

Restricted Boltzmann Machines (RBMs) Energy (Benoulli): E(v, h) = a T v b T h h T Wv Energy (Gaussian-Benoulli): E(v, h) = 1 2 (v a)t (v a) b T h h T Wv 33 / 45

RBM: Probabilistic Interpretation P(v, h) = e E(v,h) v,h e E(v,h) Posteriors (conditional independence): P(h v) = = i P(h i v) and P(v h) = = i P(v i h) 34 / 45

Binary Units: Cond Prob Posterior equals sigmoid function!! P(h i = 1 v) = = e (b i1+1w i, v) e (b i1+1w i, v) + e (b i0+0w i, v) e (b i1+1w i, v) e (b i1+1w i, v) + 1 = σ(b i 1 + 1W i, v) Same as Multi Layer Perceptron (viable for initialisation!) 35 / 45

Gaussian Units: Cond Prob with P(v h) = N (v; µ, Σ) µ = W T h + a Σ = I 36 / 45

RBM Training Stochastic Gradient Descend (minimise the negative log likelihood) J NLL (W, a, b, v) = log P(v) = F (v)+log v F (v) e where ( ) F (v) = log e E(v,h) is the free energy of the system. BUT: the gradient can not be computed exactly h 37 / 45

RBM Gradient J NLL (W, a, b, v) θ = F (v) θ ṽ F (ṽ) p(ṽ) θ first term increases prob of training data second term decreases prob density defined by the model 38 / 45

RBM Stochastic Gradient The general form is: [ E(v, h) θ J NLL (W, a, b, v) = θ data ] E(v, h) θ model Example: visible layer wij J NLL (W, a, b, v) = [ v i h j data v i h j model ] 39 / 45

Gibbs Sampling v i h j model computed with sampling Sample joint distribution of N variables, one at a time: P(X i X i ) where X i are all the other variables BUT: it takes exponential time to compute exactly 40 / 45

Contrastive Divergence Two tricks: 1. initialise the chain with a training sample 2. do not wait for convergence 41 / 45

RBMs and Deep Belief Networks Figure from Yu and Deng 42 / 45

Deep Belief Networks: Training Yee-Whye Teh (one of Hinton s students) observed that DBNs can be trained greedily for each layer: 1. train a RBM unsupervised 2. excite the network with training data to produce outputs 3. use the outputs to train next RBM 43 / 45

Final Step: Supervised Training Output y 1 y 2 y 3 Hidden 4 h 30 h 31 h 32 h 33 h 34 h 35 Hidden 3 h 20 h 21 h 22 h 23 h 24 h 25 Hidden 2 h 10 h 11 h 12 h 13 h 14 h 15 Input x 0 x 1 x 2 x 3 x 4 44 / 45

Deep Learning: Performance state-of-the-art on most ASR tasks Made people from University of Toronto, Microsoft, Google and IBM write a paper together [2] experiments with learning the features from speech signal Yu and Deng s book has many examples [2] G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury. Deep Neural Networks for Acoustic Modeling in Speech Recognition. In: IEEE Signal Processing Magazine (2012) 45 / 45