Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Similar documents
Speaker recognition by means of Deep Belief Networks

Deep Neural Networks

Deep unsupervised learning

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Usually the estimation of the partition function is intractable and it becomes exponentially hard when the complexity of the model increases. However,

Lecture 5: Logistic Regression. Neural Networks

Introduction to Neural Networks

arxiv: v1 [cs.cl] 23 Sep 2013

Restricted Boltzmann Machines for Collaborative Filtering

UNSUPERVISED LEARNING

CSC321 Lecture 20: Autoencoders

Artificial Neural Networks. MGS Lecture 2

Introduction to Machine Learning

Lecture 16 Deep Neural Generative Models

Rapid Introduction to Machine Learning/ Deep Learning

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Robust Classification using Boltzmann machines by Vasileios Vasilakakis

Deep Learning Architecture for Univariate Time Series Forecasting

Kartik Audhkhasi, Osonde Osoba, Bart Kosko

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Course Structure. Psychology 452 Week 12: Deep Learning. Chapter 8 Discussion. Part I: Deep Learning: What and Why? Rufus. Rufus Processed By Fetch

Pattern Recognition and Machine Learning

How to do backpropagation in a brain

The Origin of Deep Learning. Lili Mou Jan, 2015

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Knowledge Extraction from DBNs for Images

SMALL-FOOTPRINT HIGH-PERFORMANCE DEEP NEURAL NETWORK-BASED SPEECH RECOGNITION USING SPLIT-VQ. Yongqiang Wang, Jinyu Li and Yifan Gong

CSC321 Lecture 5: Multilayer Perceptrons

Introduction to Neural Networks

Neural networks. Chapter 19, Sections 1 5 1

Neural Networks and the Back-propagation Algorithm

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

ECE521 Lectures 9 Fully Connected Neural Networks

Artificial Neural Networks

Boundary Contraction Training for Acoustic Models based on Discrete Deep Neural Networks

Why DNN Works for Acoustic Modeling in Speech Recognition?

Neural Networks and Deep Learning

Deep Belief Networks are compact universal approximators

Statistical NLP for the Web

Unit III. A Survey of Neural Network Model

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Representational Power of Restricted Boltzmann Machines and Deep Belief Networks. Nicolas Le Roux and Yoshua Bengio Presented by Colin Graber

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

CSC321 Lecture 8: Optimization

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Unsupervised Learning

Computational statistics

Deep Neural Networks (1) Hidden layers; Back-propagation

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

STA 414/2104: Lecture 8

Backpropagation Neural Net

Deep Feedforward Networks

A Novel Activity Detection Method

Deep Learning Architectures and Algorithms

Reading Group on Deep Learning Session 1

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Neural networks. Chapter 20, Section 5 1

Day 3 Lecture 3. Optimizing deep networks

Greedy Layer-Wise Training of Deep Networks

Speech recognition. Lecture 14: Neural Networks. Andrew Senior December 12, Google NYC

Course 395: Machine Learning - Lectures

Does the Wake-sleep Algorithm Produce Good Density Estimators?

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Jakub Hajic Artificial Intelligence Seminar I

STA 414/2104: Lecture 8

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Deep Boltzmann Machines

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Artificial Intelligence

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Neural networks. Chapter 20. Chapter 20 1

Using Deep Belief Networks for Vector-Based Speaker Recognition

Lecture 17: Neural Networks and Deep Learning

Deep Feedforward Networks

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Restricted Boltzmann Machines

From perceptrons to word embeddings. Simon Šuster University of Groningen

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on

Artificial Neural Networks

A Practical Guide to Training Restricted Boltzmann Machines

TUTORIAL PART 1 Unsupervised Learning

Transcription:

Speaker Representation and Verification Part II by Vasileios Vasilakakis

Outline -Approaches of Neural Networks in Speaker/Speech Recognition -Feed-Forward Neural Networks -Training with Back-propagation -Generative Pre-training: Restricted Boltzmann Machines -RBMs: Bernoulli-Bernoulli RBM -RBMs: Gaussian-Bernoulli RBM -RBMs: tuning -RBMs for Speaker Verification

Approaches of Neural Networks in Speaker Recognition In previous lecture, we explained the GMM-UBM approach for speaker characterization and classification. Despite the success of GMM-UBM approach in modeling small vectors of acoustic features, exploiting information embedded in a large window of frames has been shown to be more efficient using deep neural nets. Additionally Auto-associative Neural Networks, trained to reconstruct the input features, or simple neural network classifiers, have been used to compress in a bottleneck layer the information given by a window including a wide context. AANNs have also been used, without exploiting a wide input context, but still using the compression layer as a feature extractor for training i-vector systems or using the bottleneck weights as i-vectors

Feed-Forward Neural Network: -It is an artificial network that directed graph with a set of inputs and a set of outputs represent a function over the inputs. Among feed-forward neural networks the most commonly used are the Auto-Associative Neural Network (or autoencoder) that learns to reconstruct the input or a Multilayer Perceptron to perform classification. A feed-forward neural network can have more than one layer of hidden units between its input and output. Each hidden unit typically uses the logistic function: To map its input from the layer below to the hidden unit Where is the bias of unit j, i is an index over units in the layer below and is the weight on a connection to unit j from unit I in the layer below.

Training Feed forward neural networks with back-propagation Traditionally, feed forward neural networks are trained using the backwardpropagation algorithm, which tries to minimize a given function on the input space. The error function can be expressed as: Where the error for a given pattern p can be expressed as the sum of errors of each output unit: The back-propagation minimizes the error function using gradient descent in the weights space. Weights are adjusted according to:

Generative pre-training But deep neural networks with many hidden layers are difficult to be trained using Backward-Propagation. -BP from a random starting point near the origin is not the best way to find a good sets of weights and unless the initial scales of the weights are carefully chosen, the back-propagated gradients will have very different magnitudes in different layers. -BP can get stuck in poor local optima Generative pre-training: The idea behind generative pre-training is to train each layer independently, and then use the transformed input space as input to the next hidden layer. Restricted Boltzmann Machines (RBMs) are usually used in order to learn each layer of feature transformations to be used for the next layer. It was shown by Hinton et al, that the initial values of the learned weights of the Deep Neural Network are better learned.

Beta distribution fitting We assumed that the activation probabilities are realizations of random variables and that each random variable follows a Beta distribution: The pairs of the parameters of the Beta distribution by maximizing the log-likelihood of the set of T observations of a speech segment as follows: can be estimated from where we have:

Restricted Boltzmann Machines (Bernoulli-Bernoulli) A Restricted Boltzmann Machine (RBM), is a particularly type of Markov Random Field that has a two layer-architecture, in which visible, binary stochastic units are connected to hidden binary stochastic units. The energy of the state {v,h} is given by: Where are the model parameters and represents the symmetric interaction term between the visible unit i and hidden unit j. The joint distribution over the visible and hidden units is defined by: where: Is known as a partition function or normalizing constant.

Restricted Boltzmann machines (Bernoulli-Bernoulli) The conditional distribution over hidden units h and visible vectors v are given by the logistic sigmoid function:

Restricted Boltzmann machines (Gaussian-Bernoulli) Because of the need to model real-valued inputs following Gaussian distribution, the Gaussian-Bernoulli was proposed in the literature. The energy of the Gaussian RBM is defined as follows: The conditional distribution over hidden units h and visible vectors v are given by:

Restricted Boltzmann machines (Tuning) Restricted Boltzmann machines, in similarity with traditional neural networks, contain some parameters that should be carefully set and tuned while training. These hyper-parameters are: - Learning rate - Momentum - Mini-batch size Learning rate: The learning rate or step determines how fast are the parameters moving towards the energy minimization. Higher learning rate means faster movement, but this may provide to the weights extremely high values. On the other hand, if the learning rate is small enough, the speed of learning is reduced and the training time to converge is higher. It should be noted that when training a Gaussian-Bernoulli RBM, the learning rate is usually chosen to be much smaller (about 1000 times smaller) when compared to the Bernoulli-Bernoulli RBM.

Restricted Boltzmann machines (Tuning) Momentum: The momentum method simulates a heavy ball rolling down a surface as shown in the image below: The ball builds up velocity along the floor of the ravine, but not across the ravine because the opposing gradients on opposite sides of the ravine cancel each other out over time. Instead of using the estimated gradient times the learning rate to increment the values of the parameters, the momentum method uses this quantity to increment the velocity, v, of the parameters and the current velocity is then used as the parameter increment. The momentum parameter is used to prevent the system from converging to a local minimum or saddle point.

Restricted Boltzmann machines (Tuning) Batch-learning: During training, a single training case could be used to update the network parameters every time. Though it is more efficient to divide the training set into small "mini-batches" of 10 to 100 cases. This allows matrix matrix multiplies and makes each epoch quicker, but the networks required more epochs to reach convergence Initial values of weights and biases: The weights are typically initialized to small random values chosen from a zero-mean Gaussian with a standard deviation of about 0.01. Using larger random values can speed the initial learning, but it may lead to a slightly worse final model. Care should be taken to ensure that the initial weight values do not allow typical visible vectors to drive the hidden unit probabilities very close to 1 or 0 as this significantly slows the learning.

Speaker Recognition by means of Deep Belief Networks In contrast with the previous approach of using GMM-UBM and i-vectors, a Deep Belief Network is proposed. Topology of a 5-layer RBM. The visible layer of the first RBM takes a context of 11 frames consisting of 46 parameters. -Our proposal: Train a deep belief network using as input data a wide context of the frames of several segments -Our assumption is that the probability distribution of the hidden units of the RBMs, their shape, carry information about the speaker identity because the phonetic content of the segment, for long enough utterances, is averaged.

Speaker recognition by means of Deep belief Networks Three main sets of pseudo-i-vectors have been extracted using the same network: - Empirical Mean and Variance - Beta distribution fitting - Legendre polynomial fitting Empirical Mean and Variance The simplest pseudo i-vector extraction approach computes the average value of the probability that a given output unit j is active: where T is the number of frames of the speaker segment, without performation decorrelation of the obtained means. Appending to the feature set the vector of the variances of each output node probability gives far better results. In both cases PCA projection was applied

Beta distribution fitting Distribution of the activation probability and the Beta fitting pdf for three output nodes The activation probability of a specific node is concentrated near 0 or 1, and that its distribution has a shape that can be better fit by a Beta distribution.

Beta distribution fitting We assumed that the activation probabilities are realizations of random variables and that each random variable follows a Beta distribution: The pairs of the parameters of the Beta distribution by maximizing the log-likelihood of the set of T observations of a speech segment as follows: can be estimated from where we have:

Legendre polynomial fitting In this approximation the weighting coefficients of the linear combination of a set of Legendre polynomials is used to characterize the speaker. The Legendre polynomials up to order 5 are given below: We used up to order 13. Legendre polynomials. Legendre polynomials have been used before in order to model prosodic features for speaker verification

Results: -Reference PLDA system and first results: -Beta distribution fitting: -Performance of mean, variance, Legendre approximations

References [1] G. Hinton, L. Deng, D. Yu, G. Dahl, A.Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury, Deep Neural Networks for acoustic modeling in speech recognition, IEEE Signal Processing Magazine, vol. 28, no. 6, pp. 82 97, 2012. [2] L. Deng, G. Hinton, B. Kingsbury, New types of Deep Neural Network Learning for speech recognition and related applications: An overview, in Proceedings ICASSP 2013, pp. 8599-8603, 2013. [3] G. Hinton, Training products of experts by minimizing contrastive divergence, Neural Computation, Vol. 14, pp. 1771 1800, 2002. [4]National Institute of Standards and Technology, NIST speech group web, http://www.nist.gov/itl/iad/mig/upload/nist_sre12_evalplan-v17-r1.pdf

Thank you