Why DNN Works for Acoustic Modeling in Speech Recognition?

Similar documents
Introduction to Neural Networks

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Speech recognition. Lecture 14: Neural Networks. Andrew Senior December 12, Google NYC

MULTI-FRAME FACTORISATION FOR LONG-SPAN ACOUSTIC MODELLING. Liang Lu and Steve Renals

arxiv: v1 [cs.cl] 23 Sep 2013

Deep Learning for Speech Recognition. Hung-yi Lee

Beyond Cross-entropy: Towards Better Frame-level Objective Functions For Deep Neural Network Training In Automatic Speech Recognition

Very Deep Convolutional Neural Networks for LVCSR

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

Sparse Models for Speech Recognition

Hierarchical Multi-Stream Posterior Based Speech Recognition System

Temporal Modeling and Basic Speech Recognition

Unfolded Recurrent Neural Networks for Speech Recognition

Lecture 3: ASR: HMMs, Forward, Viterbi

Presented By: Omer Shmueli and Sivan Niv

Automatic Speech Recognition (CS753)

Speaker recognition by means of Deep Belief Networks

Deep Neural Networks

Novel Hybrid NN/HMM Modelling Techniques for On-line Handwriting Recognition

SPEECH recognition systems based on hidden Markov

Spatial Diffuseness Features for DNN-Based Speech Recognition in Noisy and Reverberant Environments

Boundary Contraction Training for Acoustic Models based on Discrete Deep Neural Networks

] Automatic Speech Recognition (CS753)

SMALL-FOOTPRINT HIGH-PERFORMANCE DEEP NEURAL NETWORK-BASED SPEECH RECOGNITION USING SPLIT-VQ. Yongqiang Wang, Jinyu Li and Yifan Gong

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

Hidden Markov Model and Speech Recognition

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models Hamid R. Rabiee

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Modelling

Feature-space Speaker Adaptation for Probabilistic Linear Discriminant Analysis Acoustic Models

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

A TWO-LAYER NON-NEGATIVE MATRIX FACTORIZATION MODEL FOR VOCABULARY DISCOVERY. MengSun,HugoVanhamme

Unsupervised Learning of Hierarchical Models. in collaboration with Josh Susskind and Vlad Mnih

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

Brief Introduction of Machine Learning Techniques for Content Analysis

Robust Sound Event Detection in Continuous Audio Environments

Deep Neural Network and Its Application in Speech Recognition

Sum-Product Networks: A New Deep Architecture

Discriminative training of GMM-HMM acoustic model by RPCL type Bayesian Ying-Yang harmony learning

Speech and Language Processing

STA 4273H: Statistical Machine Learning

Lecture 5: GMM Acoustic Modeling and Feature Extraction


STA 414/2104: Machine Learning

Noise Compensation for Subspace Gaussian Mixture Models

Applied Statistics. Multivariate Analysis - part II. Troels C. Petersen (NBI) Statistics is merely a quantization of common sense 1

STA 414/2104: Lecture 8

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

CSC321 Lecture 20: Autoencoders

Model-Based Margin Estimation for Hidden Markov Model Learning and Generalization

Automatic Speech Recognition (CS753)

Covariance Matrix Enhancement Approach to Train Robust Gaussian Mixture Models of Speech Data

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

Convolutional Neural Networks

Large Vocabulary Continuous Speech Recognition with Long Short-Term Recurrent Networks

Usually the estimation of the partition function is intractable and it becomes exponentially hard when the complexity of the model increases. However,

Shankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms

ON THE USE OF MLP-DISTANCE TO ESTIMATE POSTERIOR PROBABILITIES BY KNN FOR SPEECH RECOGNITION

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Heeyoul (Henry) Choi. Dept. of Computer Science Texas A&M University

Multimodal Deep Learning for Predicting Survival from Breast Cancer

CSC401/2511 Spring CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto

Design and Implementation of Speech Recognition Systems

Matrix Updates for Perceptron Training of Continuous Density Hidden Markov Models

Lecture 3: Pattern Classification. Pattern classification

UNSUPERVISED LEARNING

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Estimation of Cepstral Coefficients for Robust Speech Recognition

PHONEME CLASSIFICATION OVER THE RECONSTRUCTED PHASE SPACE USING PRINCIPAL COMPONENT ANALYSIS

Gaussian Processes for Audio Feature Extraction

Statistical Machine Learning from Data

Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition

Detection-Based Speech Recognition with Sparse Point Process Models

Upper Bound Kullback-Leibler Divergence for Hidden Markov Models with Application as Discrimination Measure for Speech Recognition

Engineering Part IIB: Module 4F11 Speech and Language Processing Lectures 4/5 : Speech Recognition Basics

Chapter 1 Gaussian Mixture Models

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Lecture 3: Pattern Classification

Generating Sequences with Recurrent Neural Networks

Feature-Space Structural MAPLR with Regression Tree-based Multiple Transformation Matrices for DNN

Deep Recurrent Neural Networks

Hidden Markov Models

STA 414/2104: Lecture 8

A Primal-Dual Method for Training Recurrent Neural Networks Constrained by the Echo-State Property

FEATURE SELECTION USING FISHER S RATIO TECHNIQUE FOR AUTOMATIC SPEECH RECOGNITION

Graphical Models for Automatic Speech Recognition

How to do backpropagation in a brain

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Support Vector Machines using GMM Supervectors for Speaker Verification

Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks

Global SNR Estimation of Speech Signals using Entropy and Uncertainty Estimates from Dropout Networks

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Feature Extraction for ASR: Pitch

arxiv: v2 [cs.sd] 7 Feb 2018

COMP90051 Statistical Machine Learning

From perceptrons to word embeddings. Simon Šuster University of Groningen

DNN-based uncertainty estimation for weighted DNN-HMM ASR

Transcription:

Why DNN Works for Acoustic Modeling in Speech Recognition? Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA Joint work with Y. Bao, J. Pan, O. Abdel-Hamid

Outline Introduction DNN/HMM for Speech Bottleneck Features Incoherent Training Conclusions Other DNN Projects This talk is based on the following two papers: [1] J. Pan, C. Liu, Z. Wang, Y. Hu and H. Jiang, ``Investigations of Deep Neural Networks for Large Vocabulary Continuous Speech Recognition", Proc. of International Symposium on Chinese Spoken Language Processing (ISCSLP'2012), Hong Kong, December 2012. [2] Y. Bao, H. Jiang, L. Dai, C. Liu, Incoherent Training of Deep Neural Networks to De-correlated Bottleneck Features for Speech Recognition," submitted to 2013 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP'13), Vancouver, Canada.

Introduction: ASR History ASR formulation: o GMM/HMM + n-gram + Viterbi search Technical advances (incremental) over past 10 years: o Adaptation (speaker/environment): 5% rel. gain o Discriminative Training: 5-10% rel. gain o Feature normalization: 5% rel. gain o ROVER: 5% rel. gain More and more data better and better accuracy o read speech (>90%), telephony speech (>70%) o meeting/voicemail recording (<60%)

Acoustic Modeling: Optimization Acoustic modeling large-scale optimization o 2000+ hour data GMMs/HMM billions of samples 100+ million free parameters Training Methods o Maximum Likelihood Estimation (MLE) o Discriminative Training (DT) Engineering Issues o Efficiency: feasible with 100-1000 of CPUs o Reliability: robust estimation of all parameters

Neural Network for ASR 1990s: MLP for ASR (Bourlard and Morgan, 1994) o NN/HMM hybrid model (worse than GMM/HMM) 2000s: TANDEM (Hermansky, Ellis, et al., 2000) o Use MLP as Feature Extraction (5-10% rel. gain) 2006: DNN for small tasks (Hinton et al., 2006) o RBM-based pre-training for DNN 2010: DNN for small-scale ASR (Mohamed, Yi, et al. 2010) 2011: DNN for large-scale ASR o Over 30% rel. gain in Switchboard (Seide et al., 2011)

ASR Frontend sliding window waveform Feature Extraction (Linear Prediction, Filter Bank) Feature vectors Audio/speech coding Audio Segmentation Speech Recognition bit stream for transmission speech/music/noise words

Short-time Analysis sliding window waveform Feature vectors

Short-time Analysis sliding window waveform Feature vectors

Short-time Analysis sliding window waveform Feature vectors

Short-time Analysis sliding window waveform Feature vectors

ASR Frontend: GMM/HMM sliding window waveform Feature vectors

ASR Frontend: NN/HMM sliding window waveform Feature vectors

NN for ASR: old and new Deeper network more hidden layers ( 1 6-7 layers) Wider network More hidden nodes More output nodes (100 5-10 K ) More data 10-20 hours 2-10 k hours training data

GMMs/HMM vs. DNN/HMM Different acoustic models o GMMs vs. DNN Different feature vectors o 1 frame vs. concatenated frames (11-15 frames) vs.

Experiment (I): GMMs/HMM vs. DNN/HMM 70-hour Chinese ASR task; 4000 tied HMM states GMM: 30 Gaussians per state DNN: pre-trained; 1024 nodes per layer; 1-6 hidden layers Numbers in word error rates (%) NN-1: 1 hidden layer; DNN-6: 6 hidden layers MPE GMM: discriminatively trained GMM/HMM

Experiment (II): GMMs/HMM vs. DNN/HMM 320-hour English Switchboard task; 9000 tied HMM states GMM: 40 Gaussian per state DNN: pre-trained; 2000 nodes per layer; 1-5 hidden layers Numbers in word error rates (%) NN-1: 1 hidden layer; DNN-3/5: 3/5 hidden layers MPE GMM: discriminatively trained GMM/HMM

Conclusions (I) The gain of DNN/HMM hybrid is almost entirely attributed to the concatenated frames. o The concatenated features contain almost all additional information resulting in the gain. o But they are highly correlated. DNN is powerful to leverage highly correlated features.

What s next How about GMM/HMM? Hard to explore highly correlated features in GMMs. o Requires dimensional reduction for de-correlation. Linear dimensional reduction (PCA, LDA, KLT, ) o Failed to compete with DNN. Nonlinear dimensional reduction o Using NN/DNN (Hinton et al.), a.k.a. bottleneck features o Manifold learning, LLE, MDS, SNE,?

Bottleneck (BN) Feature

Experiment (I): Bottleneck(BN) Features 70-hour Chinese ASR Task (word error rate in %) MLE: maximum likelihood estimation MPE: discriminative training

Experiment (II): Bottleneck Features (BN) 320-hour English Switchboard Task (word error rate in %) MLE: maximum likelihood estimation MPE: discriminative training

Incoherent Training Bottleneck (BN) works but: o BN hurts DNN performance a little o Increasing BN correlation up Can we do better? The Idea: embedding de-correlation into back-propagation of DNN training. o De-correlation by constraining columns of weight matrix W o How to constrain?

Incoherent Training Define coherence of DNN weight matrix W as: G W = max i, j g ij = max i, j w i w j w i w j A matrix with smaller coherence indicates all of its column vectors are less similar. Approximate coherence using soft-max: G W

Incoherent Training All DNN weight matrices are optimized by minimizing a regularized objective function: Derivatives of coherence: G W w k F (new) = F (old ) + α max W G W Back-propagation is still applicable

Incoherent Training: De-correlation Applying incoherent training to one weight matrix in BN

Incoherent Training: Data-driven If only applying to one weight matrix W: Y = W T X + b Covariance matrix of Y: C Y = W T C X W Directly measure correlation coefficients based on the above covariance matrix: with G W = max i, j g ij Cx is estimate from one mini-batch each time

Incoherent Training: Data-driven After soft-max, Derivatives can be computed as: G W w k where Back-propagation still applies except Cx is computed for each mini-batch

Incoherent Training: Data-driven De-correlation When applying Incoherent Training to one weight matrix

Experiment (I): Incoherent Training 70-hour Chinese ASR Task (word error rate in %) MLE: maximum likelihood estimation MPE: discriminative training DNN-HMM: 13.1%

Experiment (II): Incoherent Training 320-hour English Switchboard Task (word error rate in %) MLE: maximum likelihood estimation MPE: discriminative training DNN-HMM: 31.2% (Hub5e98) and 23.7% (Hub5e01)

Conclusions (II) Possible to compete with DNN under the traditional GMMs/HMM framework. Promising to use bottleneck features learned from the proposed incoherent training. Benefits over DNN/HMM: o Slightly better performance o Enjoy other ASR techniques (adaptation, ) o Faster training process o Faster decoding process

Future works Apply incoherent training to general DNN learning Other nonlinear dimensional reduction methods for concatenated features