Deep Learning for Automatic Speech Recognition Part I

Similar documents
Deep Learning for Automatic Speech Recognition Part II

Lecture 5: GMM Acoustic Modeling and Feature Extraction

Hidden Markov Modelling

Deep Learning for Speech Recognition. Hung-yi Lee

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features

Lecture 9: Speech Recognition. Recognizing Speech

Hidden Markov Model and Speech Recognition

Lecture 9: Speech Recognition

Temporal Modeling and Basic Speech Recognition

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models

Statistical NLP Spring The Noisy Channel Model

Automatic Speech Recognition (CS753)

Speech Signal Representations

Lecture 3: ASR: HMMs, Forward, Viterbi

SPEECH RECOGNITION USING TIME DOMAIN FEATURES FROM PHASE SPACE RECONSTRUCTIONS

Environmental and Speaker Robustness in Automatic Speech Recognition with Limited Learning Data

CS 188: Artificial Intelligence Fall 2011

ASR using Hidden Markov Model : A tutorial

Estimation of Cepstral Coefficients for Robust Speech Recognition

] Automatic Speech Recognition (CS753)

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 9: Acoustic Models

Hidden Markov Models

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Statistical NLP Spring Digitizing Speech

Digitizing Speech. Statistical NLP Spring Frame Extraction. Gaussian Emissions. Vector Quantization. HMMs for Continuous Observations? ...

Deep Neural Networks

Automatic Speech Recognition (CS753)

Why DNN Works for Acoustic Modeling in Speech Recognition?

1. Markov models. 1.1 Markov-chain

Feature extraction 1

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

Augmented Statistical Models for Speech Recognition

Presented By: Omer Shmueli and Sivan Niv

On the Influence of the Delta Coefficients in a HMM-based Speech Recognition System

Signal representations: Cepstrum

Engineering Part IIB: Module 4F11 Speech and Language Processing Lectures 4/5 : Speech Recognition Basics

Statistical Sequence Recognition and Training: An Introduction to HMMs

Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition

Hidden Markov Models

Hidden Markov Models

Automatic Speech Recognition (CS753)

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

Hidden Markov Models Hamid R. Rabiee

Discriminative models for speech recognition

Automatic Phoneme Recognition. Segmental Hidden Markov Models

Robust Speaker Identification

Using Sub-Phonemic Units for HMM Based Phone Recognition

Machine Learning for natural language processing

Unfolded Recurrent Neural Networks for Speech Recognition

A TWO-LAYER NON-NEGATIVE MATRIX FACTORIZATION MODEL FOR VOCABULARY DISCOVERY. MengSun,HugoVanhamme

Sparse Models for Speech Recognition

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Jorge Silva and Shrikanth Narayanan, Senior Member, IEEE. 1 is the probability measure induced by the probability density function

Hidden Markov Models and Gaussian Mixture Models

Upper Bound Kullback-Leibler Divergence for Hidden Markov Models with Application as Discrimination Measure for Speech Recognition

Speech and Language Processing. Chapter 9 of SLP Automatic Speech Recognition (II)

PHONEME CLASSIFICATION OVER THE RECONSTRUCTED PHASE SPACE USING PRINCIPAL COMPONENT ANALYSIS

An Evolutionary Programming Based Algorithm for HMM training

Robust Speech Recognition in the Presence of Additive Noise. Svein Gunnar Storebakken Pettersen

Dynamic Time-Alignment Kernel in Support Vector Machine

Hidden Markov Models and Gaussian Mixture Models

A New OCR System Similar to ASR System

From perceptrons to word embeddings. Simon Šuster University of Groningen

Hidden Markov Models in Language Processing

Hierarchical Multi-Stream Posterior Based Speech Recognition System

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Statistical Processing of Natural Language

Singer Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Lecture 11: Hidden Markov Models

Shankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms

Brief Introduction of Machine Learning Techniques for Content Analysis

FACTORIAL HMMS FOR ACOUSTIC MODELING. Beth Logan and Pedro Moreno

Hidden Markov Models. Dr. Naomi Harte

Chapter 9 Automatic Speech Recognition DRAFT

Acoustic Modeling for Speech Recognition

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Bayesian Hidden Markov Models and Extensions

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Hidden Markov model. Jianxin Wu. LAMDA Group National Key Lab for Novel Software Technology Nanjing University, China

Speech Recognition HMM

FEATURE SELECTION USING FISHER S RATIO TECHNIQUE FOR AUTOMATIC SPEECH RECOGNITION

Discriminative training of GMM-HMM acoustic model by RPCL type Bayesian Ying-Yang harmony learning

TinySR. Peter Schmidt-Nielsen. August 27, 2014

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

Exemplar-based voice conversion using non-negative spectrogram deconvolution

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Mel-Generalized Cepstral Representation of Speech A Unified Approach to Speech Spectral Estimation. Keiichi Tokuda

Statistical Machine Learning from Data

Multiscale Systems Engineering Research Group

Adaptive Boosting for Automatic Speech Recognition. Kham Nguyen

NEURAL LANGUAGE MODELS

Signal Modeling Techniques in Speech Recognition. Hassan A. Kingravi

Comparing linear and non-linear transformation of speech

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

COMP90051 Statistical Machine Learning

STA 414/2104: Machine Learning

Transcription:

Deep Learning for Automatic Speech Recognition Part I Xiaodong Cui IBM T. J. Watson Research Center Yorktown Heights, NY 10598 Fall, 2018

Outline A brief history of automatic speech recognition Speech production model Fundamentals of automatic speech recognition Context-dependent deep neural networks EECS 6894, Columbia University 2/50

Applications of Speech Technologies Speech recognition Speech synthesis Voice conversion Speech enhancement Speech coding Spoken term detection Speaker recognition/verification Speech-to-speech translation Dialogue systems... EECS 6894, Columbia University 3/50

Some ASR Terminology Speaker dependent (SD) vs. speaker independent (SI) Isolated word recognition vs. continuous speech recognition Large-vocabulary continuous speech recognition (LVCSR) naturally speaking style > 1000 words historically but way more nowadays (e.g. 30K - 50K, some may reach 100K) Speaker adaptation supervised unsupervised Speech input and channel close-talking microphone far-field microphone microphone array codec EECS 6894, Columbia University 4/50

Roughly four generations: Four Generations of ASR 1st generation (1950s-1960s): Explorative work based on acoustic-phonetics 2nd generation (1960s-1970s): ASR based on template matching 3rd generation (1970s-2000s): ASR based on rigorous statistical modeling 4th generation (late 2000s-present): ASR based on deep learning *adapted from S. Furui, History and development of Speech Recognition." EECS 6894, Columbia University 5/50

1st Generation: Early Attempts (1) K. H. Davis, R. Biddulph, and S. Balashek, Automatic Recognition of Spoken Digits," J. Acoust. Soc. Am., vol 24, no. 6, pp. 627-642, 1952. EECS 6894, Columbia University 6/50

1st Generation: Early Attempts (2) K. H. Davis, R. Biddulph, and S. Balashek, Automatic Recognition of Spoken Digits," J. Acoust. Soc. Am., vol 24, no. 6, pp. 627-642, 1952. EECS 6894, Columbia University 7/50

2nd Generation: Template Matching Linear predictive coding (LPC) formulated independently by Atal and Itakura an effective way for estimation of vocal tract response Dynamic programming widely known as dynamic time warping (DTW) in speech community deal with non-uniformity of two patterns first proposed by Vintsyuk from former Soviet Union which was not known to western countries until 1980s. Sakoe and Chiba from Japan independently proposed it in late 1970s. Isolated-word or connected-word recognition based on DTW and LPC (and its variants) under appropriate distance measure. EECS 6894, Columbia University 8/50

An example illustrating DTW reference pattern test pattern EECS 6894, Columbia University 9/50

3rd Generation: Hidden Markov Models Switched from template-based approaches to rigorous statistical modeling Path of HMMs becoming the dominant approach in ASR Earliest research dated back to late 1960s by L. E. Baum and his colleagues at Institute for Defense Analyses (IDA) James Baker followed it up at CMU when he was a Ph.D in 1970s James and Janet Baker joined IBM and worked with Fred Jelinek on using HMMs for speech recognition in 1970s Workshop on HMMs was held by IDA in 1980 which resulted in a so-called "The Blue Book" with the title "Applications of Hidden Markov Models to Text and Speech". But the book was never widely distributed. A series of papers on the HMM methodology was published after the IDA workshop in the next few years including the well-known IEEE proceedings paper "A tutorial on hidden Markov models and selected applications in speech recognition" in 1989. HMMs have since become the dominant approach for speech recognition. EECS 6894, Columbia University 10/50

A Unified Speech Recognition Model P (W X) = P (X W )P (W ) P (X) X = {x 1, x 2,, x m } W = {w 1, w 2,, w n } is a sequence of speech features is a sequence of words P (W ) = P (w 1, w 2,, w n ) gives the probability of the sequence of the words referred to as language model (LM) P (X W ) = P (x 1, x 2,, x m w 1, w 2,, w n ) gives the probability of the sequence of speech features given the word sequence referred to as acoustic model (AM) EECS 6894, Columbia University 11/50

An Illustrative Example of HMMs SIL SIL-HH+AW HH-AW+AA AW-AA+R AA-R+Y R-Y+UW Y-UW+SIL SIL SIL HH AW AA R Y UW SIL How are you AW-AA+R EECS 6894, Columbia University 12/50

Two LVCSR Developments at IBM and Bell Labs IBM focused on dictation systems interested in seeking a probabilistic structure of the language model with a large vocabulary n-grams P (W ) = P (w 1 w 2 w n ) = P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w n w 1 w 2 w n 1 ) = P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w n w n 1 ) Bell Labs focused on voice dialing and command and control for call routing interested in speaker independent systems that could handle acoustic variability from a large number of different speakers Gaussian mixture models (GMMs) for state observation distribution P (x s) = i c i N (x; µ i, Σ i ) EECS 6894, Columbia University 13/50

3rd Generation: Further Improvements GMM-HMM acoustic model with n-gram language model have become the "standard" LVCSR recipe. Significant improvements in 1990s and 2000s. Speaker adaptation/normalization vocal tract length normalization (VTLN) maximum likelihood linear regression (MLLR) speaker adaptive training (SAT) Noise robustness parallel model combination (PMC) vector Taylor series (VTS) Discriminative training minimum classification error (MCE) maximum mutual information (MMI) minimum phone error (MPE) large margin EECS 6894, Columbia University 14/50

Progresses Made Before the Advent of Deep Learning *X. Huang, J. Baker and R. Reddy, A historical perspective of speech recognition." EECS 6894, Columbia University 15/50

4th Generation: Deep Learning G. E. Hinton, S. Osindero and Y.-W. Teh, "A Fast Learning Algorithm for Deep Belief Nets," Neural Computation, 18, pp 1527-1554, 2006. the ground-breaking paper on deep learning. layer-wise pre-training in an unsupervised fashion fine-tune afterwards with supervised training Changed people s mindset that deep neural networks are not good and can never be trained Microsoft, Goolge and IBM around 2011 and 2012 all reported significant improvements over their then state-of-the-art GMM-HMM-based ASR systems. EECS 6894, Columbia University 16/50

4th Generation: Deep Learning Deep acoustic modeling deep feedforward neural networks (DNN, CNN, ) deep recurrent neural networks (LSTM, GRU, ) end-to-end neural networks Deep language modeling deep feedforward neural networks (DNN, CNN) deep recurrent neural networks (RNNs, LSTM ) word embedding EECS 6894, Columbia University 17/50

DNN-HMM vs. GMM-HMM *G. Hinton et. al., Deep neural networks for acoustic modeling in speech recognition the shared views of four research groups." EECS 6894, Columbia University 18/50

Impact of Deep Learning on SWB2000 Switchboard database a popular public benchmark in speech recognition community human-human landline telephone conversations on directed topics 300 hours/ 2000 hour EECS 6894, Columbia University 19/50

Progresses Made at IBM on SWB2000 *estimate of human performance: 5.1% EECS 6894, Columbia University 20/50

Achieving "Human Parity" in ASR EECS 6894, Columbia University 21/50

Speech Production Model *from internet EECS 6894, Columbia University 22/50

Speech Production Model *L. Rabiner and B.-H. Juang, "Fundamentals of speech recognition". EECS 6894, Columbia University 23/50

Speech Production Model impulse train voiced sounds unvoiced sounds vocal tract speech signal white noise G time domain: frequency domain: s t = e(t) v(t) S(ω) = E(ω)V (ω) EECS 6894, Columbia University 24/50

Speech Production Model *J. Picone, "Fundamentals of speech recognition: a short course". EECS 6894, Columbia University 25/50

Speech Production Model *J. Picone, "Fundamentals of speech recognition: a short course". EECS 6894, Columbia University 26/50

An example of speech spectra /uw/ sound from an adult male and a boy pitch vocal tract EECS 6894, Columbia University 27/50

8000 7000 6000 5000 4000 3000 2000 1000 0 8000 7000 6000 5000 4000 3000 2000 1000 0 0 0.5 1 1.5 2 2.5 3 3.5 "deep learning for computer vision, speech and language" -- narrowband spectrogram 0 0.5 1 1.5 2 2.5 3 3.5 time Waveforms and Spectrograms 0.8 0.6 0.4 0.2 0-0.2-0.4-0.6 1 2 3 4 5 6 "deep learning for computer vision, speech and language" -- wideband spectrogram 10 4 frequency time frequency EECS 6894, Columbia University 28/50

Wideband and Narrowband Spectrograms 8000 "deep learning" -- wideband spectrogram 8000 "deep learning" -- narrowband spectrogram 7000 7000 6000 6000 5000 5000 frequency 4000 frequency 4000 3000 3000 2000 2000 1000 1000 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 0 0.1 0.2 0.3 0.4 0.5 0.6 time time EECS 6894, Columbia University 29/50

A Quick Walkthrough of GMM-HMM Based Speech Recognition speech input feature extraction decoding network text output acoustic model language model dictionary training decoding EECS 6894, Columbia University 30/50

Feature Extraction Frame window length 20ms with a shift of 10ms Commonly used hand-crafted features Linear Predictive Cepstral Coefficients (LPCCs) Mel-frequency Cepstral Coefficients (MFCCs) Perceptual Linear Predictive (PLP) analysis Mel-frequency Filter banks (FBanks) (widely used in DNNs/CNNs) EECS 6894, Columbia University 31/50

Cepstral Analysis Why cepstral analysis? deconvolution! windowing DFT log IDFT liftering S(ω) = E(ω) V (ω) log S(ω) = log E(ω) + log V (ω) c(n) = IDFT(log S(ω) ) = IDFT(log E(ω) + log V (ω) ) cepstrum, quefrency and liftering vocal tract components are represented by the slowly varying components concentrated at lower quefrency excitation components are represented by the fast varying components concentrated at the higher quefrency EECS 6894, Columbia University 32/50

Mel-Frequency Filter Bank and MFCC DFT MEL Log DCT MFCC EECS 6894, Columbia University 33/50

Dictionary HOW HH AW ARE AA R YOU Y UW Acoustic Units and Dictionary Context-Independent (CI) phonemes HH AW AA R Y UW Context-dependent (CD) phonemes coarticulation phonemes are different if they have different left and right contexts P l P c + P r e.g. HH-AW+AA, P-AW+AA, HH-AW+B are different CD phonemes each CD phoneme is model by an HMM. Other acoustic units syllables, words, a tradeoff between modeling accuracy and data coverage EECS 6894, Columbia University 34/50

Context-Dependent Phonemes Advantages: modeling subtle acoustic characteristics given the acoustic contexts Disadvantages: giving rise to a substantial number of CD phonemes e.g. 45 3 = 91125, 45 5 1.8 10 8 Solution: parameter tying (vowels, stops, fricatives, nasals...) widely used for targets of DNN/CNN systems *after HTK. EECS 6894, Columbia University 35/50

GMM-HMM: Mathematical Formulation λ = {A, B, π} State transition probability A: a ij = P (s t+1 = j s t = i) State observation PDF B: b i (o t ) = p(o t s t = i) Initial state distribution π: π i = P (s 1 = i) HMM is an extension of Markov chain a doubly embedded stochastic process an underlying stochastic process which is not directly observable another observable stochastic process generated from the hidden stochastic process EECS 6894, Columbia University 36/50

GMM-HMM: Mathematical Formulation SIL SIL-HH+AW HH-AW+AA AW-AA+R AA-R+Y R-Y+UW Y-UW+SIL SIL SIL HH AW AA R Y UW SIL How are you Three fundamental problems for HMMs Given the observed sequence O = {o 1, o T } and a model λ = {A, B, π}, how to evaluate the probability of the observation sequence P (O λ) = S P (O, S λ)? How do we adjust the model parameters λ = {A, B, π} to maximize P (O λ)? Given the observed sequence O = {o 1, o T } and the model λ = {A, B, π}, how to choose the most likely state sequence S = {s 1, s T }? EECS 6894, Columbia University 37/50

GMM-HMM: Acoustic Model Training How to compute the feature sequence likelihood given the acoustic model? The forward-backward algorithm Forward Computation: Initialization Induction Termination α t (i) = P (o 1 o t, s t = i λ) α 1(i) = π ib i(o 1), 1 i M M α t+1(j) = α t(i)a ijb j(o t+1), 1 j M, 1 t T 1 i=1 M P (O λ) = α T (i) Backward Computation: β t (i) = P (o t+1 o T s t = i, λ) Initialization β T (i) = 1, 1 i M Induction M β t(i) = a ijb j(o t+1)β t+1(j), 1 j M, T 1 t 1 i=1 i=1 Using both forward and backward variables: P (O λ) = M i=1 α t (i)β t (i) EECS 6894, Columbia University 38/50

GMM-HMM: Acoustic Model Training How to estimate model parameters λ given the data? Given the feature sequence O = {o 1, o T} where GMM is used for B: λ = argmax P (O λ) λ P (x s) = i c i N (x; µ i, Σ i) the Expectation-Maximization (EM) algorithm (also known as Baum-Welch algorithm) where t c ik = γ ik(t) t γi(t) µ ik = t γ ik(t)o t t γ ik(t), Σ ik = t γ ik(t)(o t µ ik )(o t µ ik ) T t γ ik(t) α t(i)β t(i) γ t(i) = P (s t = i O, λ) = Mi=1 α t(i)β t(i) α t(i)a ijb j(o t+1)β t+1(j) γ ik (t) = P (s t = i, ζ t = k O, λ) = Mi=1 Mj=1 α t(i)a ijb j(o t+1)β t+1(j) EECS 6894, Columbia University 39/50

GMM-HMM: Language Model Training n-gram: Approximate the conditional probability with n history words P (w 1, w 2,, w m ) m P (w i w i n+1,, w i 1 ) i=1 Counting events on context using training data P (w i w i n+1,, w i 1 ) = C(w i n+1,, w i 1, w i ) C(w i n+1,, w i 1 ) unigram, bigram, trigram, 4-grams, data sparsity issue back-off strategy and interpolation EECS 6894, Columbia University 40/50

GMM-HMM: Decoding How to find the path of a feature sequence that gives the maximum likelihood given the model? dynamic programming (Viterbi decoding) let φ j (t) represent the maximum likelihood of observing partial sequence from o 1 to o t and being in state j at time t by induction, φ j (t) = max s1,s2, st 1 P (s 1, s 2, s t 1, s t = j, o 1, o 2, o t λ) φ j (t) = max{φ i (t 1)a ij }b j (o t ) i *after HTK. EECS 6894, Columbia University 41/50

GMM-HMM: Decoding are you S sil how it sil E is car your Inject language model Inject dictionary Inject CD-HMMs with each CD-HMM state having a GMM distribution Compute (log-)likelihood of each feature vector in each CD-HMM state P (x s) = i c i N (x; µ i, Σ i ) EECS 6894, Columbia University 42/50

GMM-HMM: Decoding P(are how) P(how sil) P(you are) P(car your) P(is how) AA-B-1 AA-B-2 AA-B-3 AA-M-1 AA-M-2 AA-M-3 AA-M-4 AA-E-1 AA-E-2 SIL-E-7 SIL-E-8 SIL-E-9 GMMs scorer EECS 6894, Columbia University 43/50

GMM-HMM: Forced Alignment SIL SIL-HH+AW HH-AW+AA AW-AA+R AA-R+Y R-Y+UW Y-UW+SIL SIL How are you Given the text label, how to find the best underlying state sequence? same as decoding except the label is known Viterbi algorithm (again) Often referred to as Viterbi alignments in speech community Widely used in deep learning ASR to generate the target labels for the training data EECS 6894, Columbia University 44/50

Context-Dependent DNNs P(are how) P(how sil) P(you are) P(car your) P(is how) AA-B-1 AA-B-2 AA-B-3 AA-M-1 AA-M-2 AA-M-3 AA-M-4 AA-E-1 AA-E-2 SIL-E-7 SIL-E-8 SIL-E-9 GMMs scorer DNN EECS 6894, Columbia University 45/50

Context-Dependent DNNs AA-B-1 AA-B-2 AA-B-3 AA-M-1 AA-M-2 AA-M-3 AA-M-4 AA-E-1 AA-E-2 SIL-E-7 SIL-E-8 SIL-E-9 P (o s) = p(s o)p(o) p(s) p(s o) p(s) EECS 6894, Columbia University 46/50

Two Families of DNN-HMM Acoustic Modeling Hybrid systems directly connected to HMMs for acoustic modeling Bottleneck tandem systems used as feature extractors bottleneck features extracted can be used to train GMM-HMMs or DNN-HMMs EECS 6894, Columbia University 47/50

Modeling with Hidden Variables Hidden variables (or latent variables) are crucial in acoustic modeling (also true for computer vision and NLP) hidden state or phone sequence in acoustic modeling hidden speaker transformation in speaker adaptation latent topic and word distribution in Latent Dirichlet allocation in NLP It reflects your belief how a system works (internal working mechanism) Deep neural networks using a large number of hidden variables hidden variables are organized in a hierarchical fashion usually lack of straightforward interpretation (the well-known interpretability issue) EECS 6894, Columbia University 48/50

DNN-HMMs: An Typical Training Recipe Preparation 40-dim FBank features with ±4 adjacent frames (input dim = 40x9) use an existing model to generate alignments which are then converted to 1-hot targets from each frame create training and validation data sets estimate priors p(s) of CD states Training set DNN configuration (multiple hidden layers, softmax output layers and cross-entropy loss function) initialization optimization based on back-prop using SGD on the training set while monitoring loss on the validation set Test push input features from test utterances through the DNN acoustic model to get their posteriors convert posteriors to likelihoods Viterbi decoding on the decoding network measure the word error rate EECS 6894, Columbia University 49/50

Training A Hybrid DNN-HMM System Attila/Kaldi Theano/Torch/ Tensorflow Attila/Kaldi feature extraction CD-phone generation GMM-HMM CD-phones (classes) alignments (labels) features DNN-HMM posteriors HMM Viterbi Decoding EECS 6894, Columbia University 50/50