Heeyoul (Henry) Choi. Dept. of Computer Science Texas A&M University

Similar documents
L11: Pattern recognition principles

Joint Factor Analysis for Speaker Verification

Introduction to Graphical Models

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Lecture 7: Con3nuous Latent Variable Models

Monaural speech separation using source-adapted models

Machine Learning 2nd Edition

Pattern Recognition and Machine Learning

Embedded Kernel Eigenvoice Speaker Adaptation and its Implication to Reference Speaker Weighting

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Deriving Principal Component Analysis (PCA)

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition

The Bayes classifier

Eigenvoice Speaker Adaptation via Composite Kernel PCA

CSC411: Final Review. James Lucas & David Madras. December 3, 2018

Lecture 3: Pattern Classification

Probabilistic & Unsupervised Learning

PHONEME CLASSIFICATION OVER THE RECONSTRUCTED PHASE SPACE USING PRINCIPAL COMPONENT ANALYSIS

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien

Eigenfaces. Face Recognition Using Principal Components Analysis

Machine Learning 11. week

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

CS534 Machine Learning - Spring Final Exam

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

High Dimensional Discriminant Analysis

A Small Footprint i-vector Extractor

PCA and LDA. Man-Wai MAK

Lecture 3: Pattern Classification. Pattern classification

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

Independent Component Analysis and Unsupervised Learning

Probabilistic Time Series Classification

Automatic Speech Recognition (CS753)

MAP adaptation with SphinxTrain

Mixture of Gaussians Models

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Course content (will be adapted to the background knowledge of the class):

Introduction to Machine Learning Midterm, Tues April 8

High Dimensional Discriminant Analysis

Dimensionality Reduction and Principal Components

Mixtures of Gaussians with Sparse Structure

What is Principal Component Analysis?

Switch Mechanism Diagnosis using a Pattern Recognition Approach

5. Discriminant analysis

Dimensionality Reduction and Principle Components

ENHANCEMENTS OF MAXIMUM LIKELIHOOD EIGEN-DECOMPOSITION USING FUZZY LOGIC CONTROL FOR EIGENVOICE-BASED SPEAKER ADAPTATION.

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

Mixtures of Gaussians with Sparse Regression Matrices. Constantinos Boulis, Jeffrey Bilmes

Statistical learning. Chapter 20, Sections 1 4 1

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

CPSC 540: Machine Learning

PCA and LDA. Man-Wai MAK

High Dimensional Discriminant Analysis

p(d θ ) l(θ ) 1.2 x x x

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Pattern Classification

speaker recognition using gmm-ubm semester project presentation

Principal Component Analysis (PCA)

EM Algorithm & High Dimensional Data

STA 4273H: Sta-s-cal Machine Learning

Brief Introduction of Machine Learning Techniques for Content Analysis

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Randomized Algorithms

Machine Learning Techniques for Computer Vision

PATTERN CLASSIFICATION

PCA, Kernel PCA, ICA

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

THESIS COVARIANCE REGULARIZATION IN MIXTURE OF GAUSSIANS FOR HIGH-DIMENSIONAL IMAGE CLASSIFICATION. Submitted by. Daniel L Elliott

Unsupervised Learning: K- Means & PCA

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Face Detection and Recognition

Classification of high dimensional data: High Dimensional Discriminant Analysis

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated

Gaussian Mixture Models with Component Means Constrained in Pre-selected Subspaces

CSE 446 Dimensionality Reduction, Sequences

Robot Image Credit: Viktoriya Sukhanova 123RF.com. Dimensionality Reduction

Principal Component Analysis

Clustering VS Classification

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

STA 414/2104: Lecture 8

Enhancements to Transformation-Based Speaker Adaptation: Principal Component and Inter-Class Maximum Likelihood Linear Regression

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Lecture 13 Visual recognition

Dimensionality Reduction and Principle Components Analysis

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Mathematical Formulation of Our Example

Face Recognition. Lecture-14

Machine Learning Overview

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Latent Variable Models and Expectation Maximization

Data Mining Techniques

ECE 661: Homework 10 Fall 2014

i-vector and GMM-UBM Bie Fanhu CSLT, RIIT, THU

Gaussian Models

Pattern Recognition. Parameter Estimation of Probability Density Functions

Speech Recognition HMM

Principal Component Analysis

Multi-Class Linear Dimension Reduction by. Weighted Pairwise Fisher Criteria

Transcription:

Heeyoul (Henry) Choi Dept. of Computer Science Texas A&M University hchoi@cs.tamu.edu

Introduction Speaker Adaptation Eigenvoice Comparison with others MAP, MLLR, EMAP, RMP, CAT, RSW Experiments Future work Summary

Speaker-dependent (SD) system Speaker-independent (SI) system Speaker Adaptation Finding SD system for a new speaker with small data. This paper is about making the adaptation faster based on eigenvoice approach.

Model-based algorithms Adapt to a new speaker by modifying the parameters of the system s speaker model. Maximum a posteriori (MAP), Maximum likelihood linear regression (MLLR). Require significant amounts of adaptation data from the new speaker.

Speaker space algorithm Constrain the adapted model to be a linear combination of a small number of basis vectors from the reference speakers. Faster and robust Related to speaker clustering in fact that they reduce the parameter dimension to search. Resemble extended MAP (EMAP) in fact that they use a priori information from reference speakers. Actually, prior information is used to reduce the parameter space. Eigenvoice is one of these algorithm.

Eigenvoice Finds basis vectors that are orthogonal to each other Efficient in the sense of variation. Has all property of principal component analysis (PCA) PCA is applied to the parameter space.

Eigenvoice is an analogy to eigenface in face images. Face is a weighted sum of eigenfaces, which are eigenvectors of face images. PCA Ordered by eigenvalues. Guarantee the minimized mean-square error.

Face recognition => speaker recognition? How? Speaker recognition vs. speech recognition Efficient representation for speakers A good SD model for the new speaker Speech recognition of the new speaker

HMMs EM PCA

Supervector : Model parameters. The means of HMM output Gaussians. Not voice. (Actually, eigen_model_parameter) Instead of PCA, independent component analysis (ICA) (Factorialvoice as in factorialface) linear discriminant analysis (LDA) (generalized eigenvoice) They used correlation matrix instead of covariance matrix.

Hidden states of every SD model are from SI model. Insufficient data for SD but enough for SI. (SI + small data) is enough to make SD model. Hidden states means kind of Speaker invariant features. Each speaker s characters are from the mixture of Gaussian for each state after adaptation. This makes sense to build new speakers model with the same states and transition probabilities as SI model.

Now, the adaptation procedure is redefined as finding K parameters. Computationally cheap. Only the weights of eigenvoices instead of all parameters. Requires only a small amount of adaptation data Think about the curse of dimensionality And robust against noise The discarded eigenvectors corresponding to small eigenvalues might be about noise.

Initialization of weights and the others (variances and transition probabilities) are from SI model The parameters for the new speaker is The problem is to estimate the weight w(j) from data. Maximum likelihood eigen-decomposition (MLED) Gaussian mean adaptation in a continuous density hidden Markov model (CDHMM).

Likelihood (lamda ={means of Gaussian}) P( O λ) = P( O, m, s λ) = P( m, s λ) P( O m, s, λ) Auxiliary function = m, s m, s Q( λ, ˆ λ) P( O, m, s λ)log P( O, m, s ˆ λ) m, s P( O, m, s ˆ λ) = P( o, m, s ˆ λ) = P( o m, s, ˆ λ) P( m, s ˆ λ) t t t t P( o m, s, ˆ λ) t is a Gaussian distribution.

Finally,

In the Gaussians, the mean estimates are Parameters are reduced from the D-dimensional means into the K- dimensional weights (K << D).

Model based algorithm. Maximum a Posteriori (MAP) Uses the prior information of parameters in Bayes rule. Updates only the parameters of Gaussians that have observations o t The number of parameters are large. Maximum Likelihood Linear Regression (MLLR) Update all the parameters which is formulated by the linear regression. Much less constrained by prior knowledge. A little by SI model. The number of parameters are small. (the transformation matrix) Eigenvoice puts a heavy emphasis on prior knowledge by eigenvectors updates all the parameters by the weights. (Eigenvector is D-dimension) The number of parameters are smaller than MLLR.

Extended Maximum a Posteriori (EMAP) Faster convergence by the correlations between observation. Update all the correlated parameters with one observation as much as they are related. Regression-based model prediction (RMP) EMAP to CDHMM Still faster than MAP. Better performance than MAP. Kind of a mixture of MLLR and MAP

Hard speaker clustering Clusters of reference speakers SI models (cf. codewords) When the new speaker data is available, choose one model. And then MLLR can be used. Soft speaker clustering The new speaker s model is a linear combination of reference speaker s models. Clustering + MLLR Clustering works as prior to MLLR which has a little prior information.

Reference Speaker Weighting (RSW) The new speaker s model is a linear combination of the reference models. The rest part is same as eigenvoice method. Good with medium or large-vocabulary systems. For class p and speaker r c(p,r) is the centroid for each speaker - speaker dependent v is speaker independent For new speaker model, from m, the vector m(s) is obtained by means of the ML. (equivalent to finding the weights) As the number R of reference speakers grows, it becomes more expensive in terms of memory and computation.

Vowel Classification PCA to the parameters of a vowel classifier (MoG)

Database R = 120 (reference speakers) and 30 test speakers D = 2808 (parameter dimension) = 26 characters x 6 states (single Gaussian) x 18 features PCA (0.,,, K eigenvoices ) and MLED (2 iterations) In figure MLED.5 : MLED with K=5 MLED.10=>MAP : MAP with MLED.10 model as prior Otherwise, SI is prior.

Sensitiveness to changes in the reference speaker models.

it is by no means universally true. I. T. Jolliffe 1 st eigenvector : sex (strong) 2 nd eigenvector : amplitude (strong) 3 rd eigenvector : second-formant (maybe) 4 th eigenvector : changes in pitch (maybe)

Extensions of the Eigenvoice Approach Hybrid such as MLED+MAP in Fig. 2. (Done) How about allowing K to rise as the data increases? MLED + MLLR Discriminative training of the reference SD models; Environment adaptation LDA rather than PCA Learning basis vectors by ML rather than PCA Eigenvoice adaptation of state transition probabilities and Gaussian standard deviations I don t think so. There are definitely some correlations between states and Gaussian, though.

Training models for Large-Vocabulary Systems There will be insufficient data per reference speaker. The computational and storage requirements of this naïve extention of small-vocabulary methodology would be onerous. Principals Inter-speaker variability (in K-space) Intra-speaker variability (by the Gaussians) Efficient pooling of training data from different speakers Remove speaker-dependent characteristics from training data at the beginning of training, rather than based on SI model.

Fast way to speaker adaptation based on Eigenvoices. Maximum Likelihood Eigen-Decomposition Better than MLLR and MAP. For large vocabulary data, MLED+MAP is best.