Speaker Representation and Verification Part II by Vasileios Vasilakakis
Outline -Approaches of Neural Networks in Speaker/Speech Recognition -Feed-Forward Neural Networks -Training with Back-propagation -Generative Pre-training: Restricted Boltzmann Machines -RBMs: Bernoulli-Bernoulli RBM -RBMs: Gaussian-Bernoulli RBM -RBMs: tuning -RBMs for Speaker Verification
Approaches of Neural Networks in Speaker Recognition In previous lecture, we explained the GMM-UBM approach for speaker characterization and classification. Despite the success of GMM-UBM approach in modeling small vectors of acoustic features, exploiting information embedded in a large window of frames has been shown to be more efficient using deep neural nets. Additionally Auto-associative Neural Networks, trained to reconstruct the input features, or simple neural network classifiers, have been used to compress in a bottleneck layer the information given by a window including a wide context. AANNs have also been used, without exploiting a wide input context, but still using the compression layer as a feature extractor for training i-vector systems or using the bottleneck weights as i-vectors
Feed-Forward Neural Network: -It is an artificial network that directed graph with a set of inputs and a set of outputs represent a function over the inputs. Among feed-forward neural networks the most commonly used are the Auto-Associative Neural Network (or autoencoder) that learns to reconstruct the input or a Multilayer Perceptron to perform classification. A feed-forward neural network can have more than one layer of hidden units between its input and output. Each hidden unit typically uses the logistic function: To map its input from the layer below to the hidden unit Where is the bias of unit j, i is an index over units in the layer below and is the weight on a connection to unit j from unit I in the layer below.
Training Feed forward neural networks with back-propagation Traditionally, feed forward neural networks are trained using the backwardpropagation algorithm, which tries to minimize a given function on the input space. The error function can be expressed as: Where the error for a given pattern p can be expressed as the sum of errors of each output unit: The back-propagation minimizes the error function using gradient descent in the weights space. Weights are adjusted according to:
Generative pre-training But deep neural networks with many hidden layers are difficult to be trained using Backward-Propagation. -BP from a random starting point near the origin is not the best way to find a good sets of weights and unless the initial scales of the weights are carefully chosen, the back-propagated gradients will have very different magnitudes in different layers. -BP can get stuck in poor local optima Generative pre-training: The idea behind generative pre-training is to train each layer independently, and then use the transformed input space as input to the next hidden layer. Restricted Boltzmann Machines (RBMs) are usually used in order to learn each layer of feature transformations to be used for the next layer. It was shown by Hinton et al, that the initial values of the learned weights of the Deep Neural Network are better learned.
Beta distribution fitting We assumed that the activation probabilities are realizations of random variables and that each random variable follows a Beta distribution: The pairs of the parameters of the Beta distribution by maximizing the log-likelihood of the set of T observations of a speech segment as follows: can be estimated from where we have:
Restricted Boltzmann Machines (Bernoulli-Bernoulli) A Restricted Boltzmann Machine (RBM), is a particularly type of Markov Random Field that has a two layer-architecture, in which visible, binary stochastic units are connected to hidden binary stochastic units. The energy of the state {v,h} is given by: Where are the model parameters and represents the symmetric interaction term between the visible unit i and hidden unit j. The joint distribution over the visible and hidden units is defined by: where: Is known as a partition function or normalizing constant.
Restricted Boltzmann machines (Bernoulli-Bernoulli) The conditional distribution over hidden units h and visible vectors v are given by the logistic sigmoid function:
Restricted Boltzmann machines (Gaussian-Bernoulli) Because of the need to model real-valued inputs following Gaussian distribution, the Gaussian-Bernoulli was proposed in the literature. The energy of the Gaussian RBM is defined as follows: The conditional distribution over hidden units h and visible vectors v are given by:
Restricted Boltzmann machines (Tuning) Restricted Boltzmann machines, in similarity with traditional neural networks, contain some parameters that should be carefully set and tuned while training. These hyper-parameters are: - Learning rate - Momentum - Mini-batch size Learning rate: The learning rate or step determines how fast are the parameters moving towards the energy minimization. Higher learning rate means faster movement, but this may provide to the weights extremely high values. On the other hand, if the learning rate is small enough, the speed of learning is reduced and the training time to converge is higher. It should be noted that when training a Gaussian-Bernoulli RBM, the learning rate is usually chosen to be much smaller (about 1000 times smaller) when compared to the Bernoulli-Bernoulli RBM.
Restricted Boltzmann machines (Tuning) Momentum: The momentum method simulates a heavy ball rolling down a surface as shown in the image below: The ball builds up velocity along the floor of the ravine, but not across the ravine because the opposing gradients on opposite sides of the ravine cancel each other out over time. Instead of using the estimated gradient times the learning rate to increment the values of the parameters, the momentum method uses this quantity to increment the velocity, v, of the parameters and the current velocity is then used as the parameter increment. The momentum parameter is used to prevent the system from converging to a local minimum or saddle point.
Restricted Boltzmann machines (Tuning) Batch-learning: During training, a single training case could be used to update the network parameters every time. Though it is more efficient to divide the training set into small "mini-batches" of 10 to 100 cases. This allows matrix matrix multiplies and makes each epoch quicker, but the networks required more epochs to reach convergence Initial values of weights and biases: The weights are typically initialized to small random values chosen from a zero-mean Gaussian with a standard deviation of about 0.01. Using larger random values can speed the initial learning, but it may lead to a slightly worse final model. Care should be taken to ensure that the initial weight values do not allow typical visible vectors to drive the hidden unit probabilities very close to 1 or 0 as this significantly slows the learning.
Speaker Recognition by means of Deep Belief Networks In contrast with the previous approach of using GMM-UBM and i-vectors, a Deep Belief Network is proposed. Topology of a 5-layer RBM. The visible layer of the first RBM takes a context of 11 frames consisting of 46 parameters. -Our proposal: Train a deep belief network using as input data a wide context of the frames of several segments -Our assumption is that the probability distribution of the hidden units of the RBMs, their shape, carry information about the speaker identity because the phonetic content of the segment, for long enough utterances, is averaged.
Speaker recognition by means of Deep belief Networks Three main sets of pseudo-i-vectors have been extracted using the same network: - Empirical Mean and Variance - Beta distribution fitting - Legendre polynomial fitting Empirical Mean and Variance The simplest pseudo i-vector extraction approach computes the average value of the probability that a given output unit j is active: where T is the number of frames of the speaker segment, without performation decorrelation of the obtained means. Appending to the feature set the vector of the variances of each output node probability gives far better results. In both cases PCA projection was applied
Beta distribution fitting Distribution of the activation probability and the Beta fitting pdf for three output nodes The activation probability of a specific node is concentrated near 0 or 1, and that its distribution has a shape that can be better fit by a Beta distribution.
Beta distribution fitting We assumed that the activation probabilities are realizations of random variables and that each random variable follows a Beta distribution: The pairs of the parameters of the Beta distribution by maximizing the log-likelihood of the set of T observations of a speech segment as follows: can be estimated from where we have:
Legendre polynomial fitting In this approximation the weighting coefficients of the linear combination of a set of Legendre polynomials is used to characterize the speaker. The Legendre polynomials up to order 5 are given below: We used up to order 13. Legendre polynomials. Legendre polynomials have been used before in order to model prosodic features for speaker verification
Results: -Reference PLDA system and first results: -Beta distribution fitting: -Performance of mean, variance, Legendre approximations
References [1] G. Hinton, L. Deng, D. Yu, G. Dahl, A.Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury, Deep Neural Networks for acoustic modeling in speech recognition, IEEE Signal Processing Magazine, vol. 28, no. 6, pp. 82 97, 2012. [2] L. Deng, G. Hinton, B. Kingsbury, New types of Deep Neural Network Learning for speech recognition and related applications: An overview, in Proceedings ICASSP 2013, pp. 8599-8603, 2013. [3] G. Hinton, Training products of experts by minimizing contrastive divergence, Neural Computation, Vol. 14, pp. 1771 1800, 2002. [4]National Institute of Standards and Technology, NIST speech group web, http://www.nist.gov/itl/iad/mig/upload/nist_sre12_evalplan-v17-r1.pdf
Thank you