Deep Learning for Automatic Speech Recognition Part I
|
|
- Edward James
- 5 years ago
- Views:
Transcription
1 Deep Learning for Automatic Speech Recognition Part I Xiaodong Cui IBM T. J. Watson Research Center Yorktown Heights, NY Fall, 2018
2 Outline A brief history of automatic speech recognition Speech production model Fundamentals of automatic speech recognition Context-dependent deep neural networks EECS 6894, Columbia University 2/50
3 Applications of Speech Technologies Speech recognition Speech synthesis Voice conversion Speech enhancement Speech coding Spoken term detection Speaker recognition/verification Speech-to-speech translation Dialogue systems... EECS 6894, Columbia University 3/50
4 Some ASR Terminology Speaker dependent (SD) vs. speaker independent (SI) Isolated word recognition vs. continuous speech recognition Large-vocabulary continuous speech recognition (LVCSR) naturally speaking style > 1000 words historically but way more nowadays (e.g. 30K - 50K, some may reach 100K) Speaker adaptation supervised unsupervised Speech input and channel close-talking microphone far-field microphone microphone array codec EECS 6894, Columbia University 4/50
5 Roughly four generations: Four Generations of ASR 1st generation (1950s-1960s): Explorative work based on acoustic-phonetics 2nd generation (1960s-1970s): ASR based on template matching 3rd generation (1970s-2000s): ASR based on rigorous statistical modeling 4th generation (late 2000s-present): ASR based on deep learning *adapted from S. Furui, History and development of Speech Recognition." EECS 6894, Columbia University 5/50
6 1st Generation: Early Attempts (1) K. H. Davis, R. Biddulph, and S. Balashek, Automatic Recognition of Spoken Digits," J. Acoust. Soc. Am., vol 24, no. 6, pp , EECS 6894, Columbia University 6/50
7 1st Generation: Early Attempts (2) K. H. Davis, R. Biddulph, and S. Balashek, Automatic Recognition of Spoken Digits," J. Acoust. Soc. Am., vol 24, no. 6, pp , EECS 6894, Columbia University 7/50
8 2nd Generation: Template Matching Linear predictive coding (LPC) formulated independently by Atal and Itakura an effective way for estimation of vocal tract response Dynamic programming widely known as dynamic time warping (DTW) in speech community deal with non-uniformity of two patterns first proposed by Vintsyuk from former Soviet Union which was not known to western countries until 1980s. Sakoe and Chiba from Japan independently proposed it in late 1970s. Isolated-word or connected-word recognition based on DTW and LPC (and its variants) under appropriate distance measure. EECS 6894, Columbia University 8/50
9 An example illustrating DTW reference pattern test pattern EECS 6894, Columbia University 9/50
10 3rd Generation: Hidden Markov Models Switched from template-based approaches to rigorous statistical modeling Path of HMMs becoming the dominant approach in ASR Earliest research dated back to late 1960s by L. E. Baum and his colleagues at Institute for Defense Analyses (IDA) James Baker followed it up at CMU when he was a Ph.D in 1970s James and Janet Baker joined IBM and worked with Fred Jelinek on using HMMs for speech recognition in 1970s Workshop on HMMs was held by IDA in 1980 which resulted in a so-called "The Blue Book" with the title "Applications of Hidden Markov Models to Text and Speech". But the book was never widely distributed. A series of papers on the HMM methodology was published after the IDA workshop in the next few years including the well-known IEEE proceedings paper "A tutorial on hidden Markov models and selected applications in speech recognition" in HMMs have since become the dominant approach for speech recognition. EECS 6894, Columbia University 10/50
11 A Unified Speech Recognition Model P (W X) = P (X W )P (W ) P (X) X = {x 1, x 2,, x m } W = {w 1, w 2,, w n } is a sequence of speech features is a sequence of words P (W ) = P (w 1, w 2,, w n ) gives the probability of the sequence of the words referred to as language model (LM) P (X W ) = P (x 1, x 2,, x m w 1, w 2,, w n ) gives the probability of the sequence of speech features given the word sequence referred to as acoustic model (AM) EECS 6894, Columbia University 11/50
12 An Illustrative Example of HMMs SIL SIL-HH+AW HH-AW+AA AW-AA+R AA-R+Y R-Y+UW Y-UW+SIL SIL SIL HH AW AA R Y UW SIL How are you AW-AA+R EECS 6894, Columbia University 12/50
13 Two LVCSR Developments at IBM and Bell Labs IBM focused on dictation systems interested in seeking a probabilistic structure of the language model with a large vocabulary n-grams P (W ) = P (w 1 w 2 w n ) = P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w n w 1 w 2 w n 1 ) = P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w n w n 1 ) Bell Labs focused on voice dialing and command and control for call routing interested in speaker independent systems that could handle acoustic variability from a large number of different speakers Gaussian mixture models (GMMs) for state observation distribution P (x s) = i c i N (x; µ i, Σ i ) EECS 6894, Columbia University 13/50
14 3rd Generation: Further Improvements GMM-HMM acoustic model with n-gram language model have become the "standard" LVCSR recipe. Significant improvements in 1990s and 2000s. Speaker adaptation/normalization vocal tract length normalization (VTLN) maximum likelihood linear regression (MLLR) speaker adaptive training (SAT) Noise robustness parallel model combination (PMC) vector Taylor series (VTS) Discriminative training minimum classification error (MCE) maximum mutual information (MMI) minimum phone error (MPE) large margin EECS 6894, Columbia University 14/50
15 Progresses Made Before the Advent of Deep Learning *X. Huang, J. Baker and R. Reddy, A historical perspective of speech recognition." EECS 6894, Columbia University 15/50
16 4th Generation: Deep Learning G. E. Hinton, S. Osindero and Y.-W. Teh, "A Fast Learning Algorithm for Deep Belief Nets," Neural Computation, 18, pp , the ground-breaking paper on deep learning. layer-wise pre-training in an unsupervised fashion fine-tune afterwards with supervised training Changed people s mindset that deep neural networks are not good and can never be trained Microsoft, Goolge and IBM around 2011 and 2012 all reported significant improvements over their then state-of-the-art GMM-HMM-based ASR systems. EECS 6894, Columbia University 16/50
17 4th Generation: Deep Learning Deep acoustic modeling deep feedforward neural networks (DNN, CNN, ) deep recurrent neural networks (LSTM, GRU, ) end-to-end neural networks Deep language modeling deep feedforward neural networks (DNN, CNN) deep recurrent neural networks (RNNs, LSTM ) word embedding EECS 6894, Columbia University 17/50
18 DNN-HMM vs. GMM-HMM *G. Hinton et. al., Deep neural networks for acoustic modeling in speech recognition the shared views of four research groups." EECS 6894, Columbia University 18/50
19 Impact of Deep Learning on SWB2000 Switchboard database a popular public benchmark in speech recognition community human-human landline telephone conversations on directed topics 300 hours/ 2000 hour EECS 6894, Columbia University 19/50
20 Progresses Made at IBM on SWB2000 *estimate of human performance: 5.1% EECS 6894, Columbia University 20/50
21 Achieving "Human Parity" in ASR EECS 6894, Columbia University 21/50
22 Speech Production Model *from internet EECS 6894, Columbia University 22/50
23 Speech Production Model *L. Rabiner and B.-H. Juang, "Fundamentals of speech recognition". EECS 6894, Columbia University 23/50
24 Speech Production Model impulse train voiced sounds unvoiced sounds vocal tract speech signal white noise G time domain: frequency domain: s t = e(t) v(t) S(ω) = E(ω)V (ω) EECS 6894, Columbia University 24/50
25 Speech Production Model *J. Picone, "Fundamentals of speech recognition: a short course". EECS 6894, Columbia University 25/50
26 Speech Production Model *J. Picone, "Fundamentals of speech recognition: a short course". EECS 6894, Columbia University 26/50
27 An example of speech spectra /uw/ sound from an adult male and a boy pitch vocal tract EECS 6894, Columbia University 27/50
28 "deep learning for computer vision, speech and language" -- narrowband spectrogram time Waveforms and Spectrograms "deep learning for computer vision, speech and language" -- wideband spectrogram 10 4 frequency time frequency EECS 6894, Columbia University 28/50
29 Wideband and Narrowband Spectrograms 8000 "deep learning" -- wideband spectrogram 8000 "deep learning" -- narrowband spectrogram frequency 4000 frequency time time EECS 6894, Columbia University 29/50
30 A Quick Walkthrough of GMM-HMM Based Speech Recognition speech input feature extraction decoding network text output acoustic model language model dictionary training decoding EECS 6894, Columbia University 30/50
31 Feature Extraction Frame window length 20ms with a shift of 10ms Commonly used hand-crafted features Linear Predictive Cepstral Coefficients (LPCCs) Mel-frequency Cepstral Coefficients (MFCCs) Perceptual Linear Predictive (PLP) analysis Mel-frequency Filter banks (FBanks) (widely used in DNNs/CNNs) EECS 6894, Columbia University 31/50
32 Cepstral Analysis Why cepstral analysis? deconvolution! windowing DFT log IDFT liftering S(ω) = E(ω) V (ω) log S(ω) = log E(ω) + log V (ω) c(n) = IDFT(log S(ω) ) = IDFT(log E(ω) + log V (ω) ) cepstrum, quefrency and liftering vocal tract components are represented by the slowly varying components concentrated at lower quefrency excitation components are represented by the fast varying components concentrated at the higher quefrency EECS 6894, Columbia University 32/50
33 Mel-Frequency Filter Bank and MFCC DFT MEL Log DCT MFCC EECS 6894, Columbia University 33/50
34 Dictionary HOW HH AW ARE AA R YOU Y UW Acoustic Units and Dictionary Context-Independent (CI) phonemes HH AW AA R Y UW Context-dependent (CD) phonemes coarticulation phonemes are different if they have different left and right contexts P l P c + P r e.g. HH-AW+AA, P-AW+AA, HH-AW+B are different CD phonemes each CD phoneme is model by an HMM. Other acoustic units syllables, words, a tradeoff between modeling accuracy and data coverage EECS 6894, Columbia University 34/50
35 Context-Dependent Phonemes Advantages: modeling subtle acoustic characteristics given the acoustic contexts Disadvantages: giving rise to a substantial number of CD phonemes e.g = 91125, Solution: parameter tying (vowels, stops, fricatives, nasals...) widely used for targets of DNN/CNN systems *after HTK. EECS 6894, Columbia University 35/50
36 GMM-HMM: Mathematical Formulation λ = {A, B, π} State transition probability A: a ij = P (s t+1 = j s t = i) State observation PDF B: b i (o t ) = p(o t s t = i) Initial state distribution π: π i = P (s 1 = i) HMM is an extension of Markov chain a doubly embedded stochastic process an underlying stochastic process which is not directly observable another observable stochastic process generated from the hidden stochastic process EECS 6894, Columbia University 36/50
37 GMM-HMM: Mathematical Formulation SIL SIL-HH+AW HH-AW+AA AW-AA+R AA-R+Y R-Y+UW Y-UW+SIL SIL SIL HH AW AA R Y UW SIL How are you Three fundamental problems for HMMs Given the observed sequence O = {o 1, o T } and a model λ = {A, B, π}, how to evaluate the probability of the observation sequence P (O λ) = S P (O, S λ)? How do we adjust the model parameters λ = {A, B, π} to maximize P (O λ)? Given the observed sequence O = {o 1, o T } and the model λ = {A, B, π}, how to choose the most likely state sequence S = {s 1, s T }? EECS 6894, Columbia University 37/50
38 GMM-HMM: Acoustic Model Training How to compute the feature sequence likelihood given the acoustic model? The forward-backward algorithm Forward Computation: Initialization Induction Termination α t (i) = P (o 1 o t, s t = i λ) α 1(i) = π ib i(o 1), 1 i M M α t+1(j) = α t(i)a ijb j(o t+1), 1 j M, 1 t T 1 i=1 M P (O λ) = α T (i) Backward Computation: β t (i) = P (o t+1 o T s t = i, λ) Initialization β T (i) = 1, 1 i M Induction M β t(i) = a ijb j(o t+1)β t+1(j), 1 j M, T 1 t 1 i=1 i=1 Using both forward and backward variables: P (O λ) = M i=1 α t (i)β t (i) EECS 6894, Columbia University 38/50
39 GMM-HMM: Acoustic Model Training How to estimate model parameters λ given the data? Given the feature sequence O = {o 1, o T} where GMM is used for B: λ = argmax P (O λ) λ P (x s) = i c i N (x; µ i, Σ i) the Expectation-Maximization (EM) algorithm (also known as Baum-Welch algorithm) where t c ik = γ ik(t) t γi(t) µ ik = t γ ik(t)o t t γ ik(t), Σ ik = t γ ik(t)(o t µ ik )(o t µ ik ) T t γ ik(t) α t(i)β t(i) γ t(i) = P (s t = i O, λ) = Mi=1 α t(i)β t(i) α t(i)a ijb j(o t+1)β t+1(j) γ ik (t) = P (s t = i, ζ t = k O, λ) = Mi=1 Mj=1 α t(i)a ijb j(o t+1)β t+1(j) EECS 6894, Columbia University 39/50
40 GMM-HMM: Language Model Training n-gram: Approximate the conditional probability with n history words P (w 1, w 2,, w m ) m P (w i w i n+1,, w i 1 ) i=1 Counting events on context using training data P (w i w i n+1,, w i 1 ) = C(w i n+1,, w i 1, w i ) C(w i n+1,, w i 1 ) unigram, bigram, trigram, 4-grams, data sparsity issue back-off strategy and interpolation EECS 6894, Columbia University 40/50
41 GMM-HMM: Decoding How to find the path of a feature sequence that gives the maximum likelihood given the model? dynamic programming (Viterbi decoding) let φ j (t) represent the maximum likelihood of observing partial sequence from o 1 to o t and being in state j at time t by induction, φ j (t) = max s1,s2, st 1 P (s 1, s 2, s t 1, s t = j, o 1, o 2, o t λ) φ j (t) = max{φ i (t 1)a ij }b j (o t ) i *after HTK. EECS 6894, Columbia University 41/50
42 GMM-HMM: Decoding are you S sil how it sil E is car your Inject language model Inject dictionary Inject CD-HMMs with each CD-HMM state having a GMM distribution Compute (log-)likelihood of each feature vector in each CD-HMM state P (x s) = i c i N (x; µ i, Σ i ) EECS 6894, Columbia University 42/50
43 GMM-HMM: Decoding P(are how) P(how sil) P(you are) P(car your) P(is how) AA-B-1 AA-B-2 AA-B-3 AA-M-1 AA-M-2 AA-M-3 AA-M-4 AA-E-1 AA-E-2 SIL-E-7 SIL-E-8 SIL-E-9 GMMs scorer EECS 6894, Columbia University 43/50
44 GMM-HMM: Forced Alignment SIL SIL-HH+AW HH-AW+AA AW-AA+R AA-R+Y R-Y+UW Y-UW+SIL SIL How are you Given the text label, how to find the best underlying state sequence? same as decoding except the label is known Viterbi algorithm (again) Often referred to as Viterbi alignments in speech community Widely used in deep learning ASR to generate the target labels for the training data EECS 6894, Columbia University 44/50
45 Context-Dependent DNNs P(are how) P(how sil) P(you are) P(car your) P(is how) AA-B-1 AA-B-2 AA-B-3 AA-M-1 AA-M-2 AA-M-3 AA-M-4 AA-E-1 AA-E-2 SIL-E-7 SIL-E-8 SIL-E-9 GMMs scorer DNN EECS 6894, Columbia University 45/50
46 Context-Dependent DNNs AA-B-1 AA-B-2 AA-B-3 AA-M-1 AA-M-2 AA-M-3 AA-M-4 AA-E-1 AA-E-2 SIL-E-7 SIL-E-8 SIL-E-9 P (o s) = p(s o)p(o) p(s) p(s o) p(s) EECS 6894, Columbia University 46/50
47 Two Families of DNN-HMM Acoustic Modeling Hybrid systems directly connected to HMMs for acoustic modeling Bottleneck tandem systems used as feature extractors bottleneck features extracted can be used to train GMM-HMMs or DNN-HMMs EECS 6894, Columbia University 47/50
48 Modeling with Hidden Variables Hidden variables (or latent variables) are crucial in acoustic modeling (also true for computer vision and NLP) hidden state or phone sequence in acoustic modeling hidden speaker transformation in speaker adaptation latent topic and word distribution in Latent Dirichlet allocation in NLP It reflects your belief how a system works (internal working mechanism) Deep neural networks using a large number of hidden variables hidden variables are organized in a hierarchical fashion usually lack of straightforward interpretation (the well-known interpretability issue) EECS 6894, Columbia University 48/50
49 DNN-HMMs: An Typical Training Recipe Preparation 40-dim FBank features with ±4 adjacent frames (input dim = 40x9) use an existing model to generate alignments which are then converted to 1-hot targets from each frame create training and validation data sets estimate priors p(s) of CD states Training set DNN configuration (multiple hidden layers, softmax output layers and cross-entropy loss function) initialization optimization based on back-prop using SGD on the training set while monitoring loss on the validation set Test push input features from test utterances through the DNN acoustic model to get their posteriors convert posteriors to likelihoods Viterbi decoding on the decoding network measure the word error rate EECS 6894, Columbia University 49/50
50 Training A Hybrid DNN-HMM System Attila/Kaldi Theano/Torch/ Tensorflow Attila/Kaldi feature extraction CD-phone generation GMM-HMM CD-phones (classes) alignments (labels) features DNN-HMM posteriors HMM Viterbi Decoding EECS 6894, Columbia University 50/50
Deep Learning for Automatic Speech Recognition Part II
Deep Learning for Automatic Speech Recognition Part II Xiaodong Cui IBM T. J. Watson Research Center Yorktown Heights, NY 10598 Fall, 2018 Outline A brief revisit of sampling, pitch/formant and MFCC DNN-HMM
More informationLecture 5: GMM Acoustic Modeling and Feature Extraction
CS 224S / LINGUIST 285 Spoken Language Processing Andrew Maas Stanford University Spring 2017 Lecture 5: GMM Acoustic Modeling and Feature Extraction Original slides by Dan Jurafsky Outline for Today Acoustic
More informationHidden Markov Modelling
Hidden Markov Modelling Introduction Problem formulation Forward-Backward algorithm Viterbi search Baum-Welch parameter estimation Other considerations Multiple observation sequences Phone-based models
More informationDeep Learning for Speech Recognition. Hung-yi Lee
Deep Learning for Speech Recognition Hung-yi Lee Outline Conventional Speech Recognition How to use Deep Learning in acoustic modeling? Why Deep Learning? Speaker Adaptation Multi-task Deep Learning New
More informationSegmental Recurrent Neural Networks for End-to-end Speech Recognition
Segmental Recurrent Neural Networks for End-to-end Speech Recognition Liang Lu, Lingpeng Kong, Chris Dyer, Noah Smith and Steve Renals TTI-Chicago, UoE, CMU and UW 9 September 2016 Background A new wave
More informationReformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features
Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features Heiga ZEN (Byung Ha CHUN) Nagoya Inst. of Tech., Japan Overview. Research backgrounds 2.
More informationLecture 9: Speech Recognition. Recognizing Speech
EE E68: Speech & Audio Processing & Recognition Lecture 9: Speech Recognition 3 4 Recognizing Speech Feature Calculation Sequence Recognition Hidden Markov Models Dan Ellis http://www.ee.columbia.edu/~dpwe/e68/
More informationHidden Markov Model and Speech Recognition
1 Dec,2006 Outline Introduction 1 Introduction 2 3 4 5 Introduction What is Speech Recognition? Understanding what is being said Mapping speech data to textual information Speech Recognition is indeed
More informationLecture 9: Speech Recognition
EE E682: Speech & Audio Processing & Recognition Lecture 9: Speech Recognition 1 2 3 4 Recognizing Speech Feature Calculation Sequence Recognition Hidden Markov Models Dan Ellis
More informationTemporal Modeling and Basic Speech Recognition
UNIVERSITY ILLINOIS @ URBANA-CHAMPAIGN OF CS 498PS Audio Computing Lab Temporal Modeling and Basic Speech Recognition Paris Smaragdis paris@illinois.edu paris.cs.illinois.edu Today s lecture Recognizing
More informationThe Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models
Statistical NLP Spring 2009 The Noisy Channel Model Lecture 10: Acoustic Models Dan Klein UC Berkeley Search through space of all possible sentences. Pick the one that is most probable given the waveform.
More informationStatistical NLP Spring The Noisy Channel Model
Statistical NLP Spring 2009 Lecture 10: Acoustic Models Dan Klein UC Berkeley The Noisy Channel Model Search through space of all possible sentences. Pick the one that is most probable given the waveform.
More informationAutomatic Speech Recognition (CS753)
Automatic Speech Recognition (CS753) Lecture 12: Acoustic Feature Extraction for ASR Instructor: Preethi Jyothi Feb 13, 2017 Speech Signal Analysis Generate discrete samples A frame Need to focus on short
More informationSpeech Signal Representations
Speech Signal Representations Berlin Chen 2003 References: 1. X. Huang et. al., Spoken Language Processing, Chapters 5, 6 2. J. R. Deller et. al., Discrete-Time Processing of Speech Signals, Chapters 4-6
More informationLecture 3: ASR: HMMs, Forward, Viterbi
Original slides by Dan Jurafsky CS 224S / LINGUIST 285 Spoken Language Processing Andrew Maas Stanford University Spring 2017 Lecture 3: ASR: HMMs, Forward, Viterbi Fun informative read on phonetics The
More informationSPEECH RECOGNITION USING TIME DOMAIN FEATURES FROM PHASE SPACE RECONSTRUCTIONS
SPEECH RECOGNITION USING TIME DOMAIN FEATURES FROM PHASE SPACE RECONSTRUCTIONS by Jinjin Ye, B.S. A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL IN PARTIAL FULFILLMENT OF THE REQUIREMENTS for
More informationEnvironmental and Speaker Robustness in Automatic Speech Recognition with Limited Learning Data
University of California Los Angeles Environmental and Speaker Robustness in Automatic Speech Recognition with Limited Learning Data A dissertation submitted in partial satisfaction of the requirements
More informationCS 188: Artificial Intelligence Fall 2011
CS 188: Artificial Intelligence Fall 2011 Lecture 20: HMMs / Speech / ML 11/8/2011 Dan Klein UC Berkeley Today HMMs Demo bonanza! Most likely explanation queries Speech recognition A massive HMM! Details
More informationASR using Hidden Markov Model : A tutorial
ASR using Hidden Markov Model : A tutorial Samudravijaya K Workshop on ASR @BAMU; 14-OCT-11 samudravijaya@gmail.com Tata Institute of Fundamental Research Samudravijaya K Workshop on ASR @BAMU; 14-OCT-11
More informationEstimation of Cepstral Coefficients for Robust Speech Recognition
Estimation of Cepstral Coefficients for Robust Speech Recognition by Kevin M. Indrebo, B.S., M.S. A Dissertation submitted to the Faculty of the Graduate School, Marquette University, in Partial Fulfillment
More information] Automatic Speech Recognition (CS753)
] Automatic Speech Recognition (CS753) Lecture 17: Discriminative Training for HMMs Instructor: Preethi Jyothi Sep 28, 2017 Discriminative Training Recall: MLE for HMMs Maximum likelihood estimation (MLE)
More informationThe Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech
CS 294-5: Statistical Natural Language Processing The Noisy Channel Model Speech Recognition II Lecture 21: 11/29/05 Search through space of all possible sentences. Pick the one that is most probable given
More informationThe Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 9: Acoustic Models
Statistical NLP Spring 2010 The Noisy Channel Model Lecture 9: Acoustic Models Dan Klein UC Berkeley Acoustic model: HMMs over word positions with mixtures of Gaussians as emissions Language model: Distributions
More informationHidden Markov Models
Hidden Markov Models Dr Philip Jackson Centre for Vision, Speech & Signal Processing University of Surrey, UK 1 3 2 http://www.ee.surrey.ac.uk/personal/p.jackson/isspr/ Outline 1. Recognizing patterns
More information10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)
10. Hidden Markov Models (HMM) for Speech Processing (some slides taken from Glass and Zue course) Definition of an HMM The HMM are powerful statistical methods to characterize the observed samples of
More informationStatistical NLP Spring Digitizing Speech
Statistical NLP Spring 2008 Lecture 10: Acoustic Models Dan Klein UC Berkeley Digitizing Speech 1 Frame Extraction A frame (25 ms wide) extracted every 10 ms 25 ms 10ms... a 1 a 2 a 3 Figure from Simon
More informationDigitizing Speech. Statistical NLP Spring Frame Extraction. Gaussian Emissions. Vector Quantization. HMMs for Continuous Observations? ...
Statistical NLP Spring 2008 Digitizing Speech Lecture 10: Acoustic Models Dan Klein UC Berkeley Frame Extraction A frame (25 ms wide extracted every 10 ms 25 ms 10ms... a 1 a 2 a 3 Figure from Simon Arnfield
More informationDeep Neural Networks
Deep Neural Networks DT2118 Speech and Speaker Recognition Giampiero Salvi KTH/CSC/TMH giampi@kth.se VT 2015 1 / 45 Outline State-to-Output Probability Model Artificial Neural Networks Perceptron Multi
More informationAutomatic Speech Recognition (CS753)
Automatic Speech Recognition (CS753) Lecture 21: Speaker Adaptation Instructor: Preethi Jyothi Oct 23, 2017 Speaker variations Major cause of variability in speech is the differences between speakers Speaking
More informationWhy DNN Works for Acoustic Modeling in Speech Recognition?
Why DNN Works for Acoustic Modeling in Speech Recognition? Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA Joint work with Y. Bao, J. Pan,
More information1. Markov models. 1.1 Markov-chain
1. Markov models 1.1 Markov-chain Let X be a random variable X = (X 1,..., X t ) taking values in some set S = {s 1,..., s N }. The sequence is Markov chain if it has the following properties: 1. Limited
More informationFeature extraction 1
Centre for Vision Speech & Signal Processing University of Surrey, Guildford GU2 7XH. Feature extraction 1 Dr Philip Jackson Cepstral analysis - Real & complex cepstra - Homomorphic decomposition Filter
More informationCS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm
+ September13, 2016 Professor Meteer CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm Thanks to Dan Jurafsky for these slides + ASR components n Feature
More informationAugmented Statistical Models for Speech Recognition
Augmented Statistical Models for Speech Recognition Mark Gales & Martin Layton 31 August 2005 Trajectory Models For Speech Processing Workshop Overview Dependency Modelling in Speech Recognition: latent
More informationPresented By: Omer Shmueli and Sivan Niv
Deep Speaker: an End-to-End Neural Speaker Embedding System Chao Li, Xiaokong Ma, Bing Jiang, Xiangang Li, Xuewei Zhang, Xiao Liu, Ying Cao, Ajay Kannan, Zhenyao Zhu Presented By: Omer Shmueli and Sivan
More informationOn the Influence of the Delta Coefficients in a HMM-based Speech Recognition System
On the Influence of the Delta Coefficients in a HMM-based Speech Recognition System Fabrice Lefèvre, Claude Montacié and Marie-José Caraty Laboratoire d'informatique de Paris VI 4, place Jussieu 755 PARIS
More informationSignal representations: Cepstrum
Signal representations: Cepstrum Source-filter separation for sound production For speech, source corresponds to excitation by a pulse train for voiced phonemes and to turbulence (noise) for unvoiced phonemes,
More informationEngineering Part IIB: Module 4F11 Speech and Language Processing Lectures 4/5 : Speech Recognition Basics
Engineering Part IIB: Module 4F11 Speech and Language Processing Lectures 4/5 : Speech Recognition Basics Phil Woodland: pcw@eng.cam.ac.uk Lent 2013 Engineering Part IIB: Module 4F11 What is Speech Recognition?
More informationStatistical Sequence Recognition and Training: An Introduction to HMMs
Statistical Sequence Recognition and Training: An Introduction to HMMs EECS 225D Nikki Mirghafori nikki@icsi.berkeley.edu March 7, 2005 Credit: many of the HMM slides have been borrowed and adapted, with
More informationExperiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition
Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition ABSTRACT It is well known that the expectation-maximization (EM) algorithm, commonly used to estimate hidden
More informationHidden Markov Models
Hidden Markov Models Lecture Notes Speech Communication 2, SS 2004 Erhard Rank/Franz Pernkopf Signal Processing and Speech Communication Laboratory Graz University of Technology Inffeldgasse 16c, A-8010
More informationHidden Markov Models
Hidden Markov Models Slides mostly from Mitch Marcus and Eric Fosler (with lots of modifications). Have you seen HMMs? Have you seen Kalman filters? Have you seen dynamic programming? HMMs are dynamic
More informationAutomatic Speech Recognition (CS753)
Automatic Speech Recognition (CS753) Lecture 8: Tied state HMMs + DNNs in ASR Instructor: Preethi Jyothi Aug 17, 2017 Final Project Landscape Voice conversion using GANs Musical Note Extraction Keystroke
More informationUniversity of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I
University of Cambridge MPhil in Computer Speech Text & Internet Technology Module: Speech Processing II Lecture 2: Hidden Markov Models I o o o o o 1 2 3 4 T 1 b 2 () a 12 2 a 3 a 4 5 34 a 23 b () b ()
More informationHidden Markov Models Hamid R. Rabiee
Hidden Markov Models Hamid R. Rabiee 1 Hidden Markov Models (HMMs) In the previous slides, we have seen that in many cases the underlying behavior of nature could be modeled as a Markov process. However
More informationDiscriminative models for speech recognition
Discriminative models for speech recognition Anton Ragni Peterhouse University of Cambridge A thesis submitted for the degree of Doctor of Philosophy 2013 Declaration This dissertation is the result of
More informationAutomatic Phoneme Recognition. Segmental Hidden Markov Models
Automatic Phoneme Recognition with Segmental Hidden Markov Models Areg G. Baghdasaryan Thesis submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment
More informationRobust Speaker Identification
Robust Speaker Identification by Smarajit Bose Interdisciplinary Statistical Research Unit Indian Statistical Institute, Kolkata Joint work with Amita Pal and Ayanendranath Basu Overview } } } } } } }
More informationUsing Sub-Phonemic Units for HMM Based Phone Recognition
Jarle Bauck Hamar Using Sub-Phonemic Units for HMM Based Phone Recognition Thesis for the degree of Philosophiae Doctor Trondheim, June 2013 Norwegian University of Science and Technology Faculty of Information
More informationMachine Learning for natural language processing
Machine Learning for natural language processing Hidden Markov Models Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 33 Introduction So far, we have classified texts/observations
More informationUnfolded Recurrent Neural Networks for Speech Recognition
INTERSPEECH 2014 Unfolded Recurrent Neural Networks for Speech Recognition George Saon, Hagen Soltau, Ahmad Emami and Michael Picheny IBM T. J. Watson Research Center, Yorktown Heights, NY, 10598 gsaon@us.ibm.com
More informationA TWO-LAYER NON-NEGATIVE MATRIX FACTORIZATION MODEL FOR VOCABULARY DISCOVERY. MengSun,HugoVanhamme
A TWO-LAYER NON-NEGATIVE MATRIX FACTORIZATION MODEL FOR VOCABULARY DISCOVERY MengSun,HugoVanhamme Department of Electrical Engineering-ESAT, Katholieke Universiteit Leuven, Kasteelpark Arenberg 10, Bus
More informationSparse Models for Speech Recognition
Sparse Models for Speech Recognition Weibin Zhang and Pascale Fung Human Language Technology Center Hong Kong University of Science and Technology Outline Introduction to speech recognition Motivations
More informationACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging
ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging Stephen Clark Natural Language and Information Processing (NLIP) Group sc609@cam.ac.uk The POS Tagging Problem 2 England NNP s POS fencers
More informationJorge Silva and Shrikanth Narayanan, Senior Member, IEEE. 1 is the probability measure induced by the probability density function
890 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 3, MAY 2006 Average Divergence Distance as a Statistical Discrimination Measure for Hidden Markov Models Jorge Silva and Shrikanth
More informationHidden Markov Models and Gaussian Mixture Models
Hidden Markov Models and Gaussian Mixture Models Hiroshi Shimodaira and Steve Renals Automatic Speech Recognition ASR Lectures 4&5 25&29 January 2018 ASR Lectures 4&5 Hidden Markov Models and Gaussian
More informationUpper Bound Kullback-Leibler Divergence for Hidden Markov Models with Application as Discrimination Measure for Speech Recognition
Upper Bound Kullback-Leibler Divergence for Hidden Markov Models with Application as Discrimination Measure for Speech Recognition Jorge Silva and Shrikanth Narayanan Speech Analysis and Interpretation
More informationSpeech and Language Processing. Chapter 9 of SLP Automatic Speech Recognition (II)
Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (II) Outline for ASR ASR Architecture The Noisy Channel Model Five easy pieces of an ASR system 1) Language Model 2) Lexicon/Pronunciation
More informationPHONEME CLASSIFICATION OVER THE RECONSTRUCTED PHASE SPACE USING PRINCIPAL COMPONENT ANALYSIS
PHONEME CLASSIFICATION OVER THE RECONSTRUCTED PHASE SPACE USING PRINCIPAL COMPONENT ANALYSIS Jinjin Ye jinjin.ye@mu.edu Michael T. Johnson mike.johnson@mu.edu Richard J. Povinelli richard.povinelli@mu.edu
More informationAn Evolutionary Programming Based Algorithm for HMM training
An Evolutionary Programming Based Algorithm for HMM training Ewa Figielska,Wlodzimierz Kasprzak Institute of Control and Computation Engineering, Warsaw University of Technology ul. Nowowiejska 15/19,
More informationRobust Speech Recognition in the Presence of Additive Noise. Svein Gunnar Storebakken Pettersen
Robust Speech Recognition in the Presence of Additive Noise Svein Gunnar Storebakken Pettersen A Dissertation Submitted in Partial Fulfillment of the Requirements for the Degree of PHILOSOPHIAE DOCTOR
More informationDynamic Time-Alignment Kernel in Support Vector Machine
Dynamic Time-Alignment Kernel in Support Vector Machine Hiroshi Shimodaira School of Information Science, Japan Advanced Institute of Science and Technology sim@jaist.ac.jp Mitsuru Nakai School of Information
More informationHidden Markov Models and Gaussian Mixture Models
Hidden Markov Models and Gaussian Mixture Models Hiroshi Shimodaira and Steve Renals Automatic Speech Recognition ASR Lectures 4&5 23&27 January 2014 ASR Lectures 4&5 Hidden Markov Models and Gaussian
More informationA New OCR System Similar to ASR System
A ew OCR System Similar to ASR System Abstract Optical character recognition (OCR) system is created using the concepts of automatic speech recognition where the hidden Markov Model is widely used. Results
More informationFrom perceptrons to word embeddings. Simon Šuster University of Groningen
From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written
More informationHidden Markov Models in Language Processing
Hidden Markov Models in Language Processing Dustin Hillard Lecture notes courtesy of Prof. Mari Ostendorf Outline Review of Markov models What is an HMM? Examples General idea of hidden variables: implications
More informationHierarchical Multi-Stream Posterior Based Speech Recognition System
Hierarchical Multi-Stream Posterior Based Speech Recognition System Hamed Ketabdar 1,2, Hervé Bourlard 1,2 and Samy Bengio 1 1 IDIAP Research Institute, Martigny, Switzerland 2 Ecole Polytechnique Fédérale
More informationSequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them
HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated
More informationStatistical Processing of Natural Language
Statistical Processing of Natural Language and DMKM - Universitat Politècnica de Catalunya and 1 2 and 3 1. Observation Probability 2. Best State Sequence 3. Parameter Estimation 4 Graphical and Generative
More informationSinger Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers
Singer Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers Kumari Rambha Ranjan, Kartik Mahto, Dipti Kumari,S.S.Solanki Dept. of Electronics and Communication Birla
More informationCourse 495: Advanced Statistical Machine Learning/Pattern Recognition
Course 495: Advanced Statistical Machine Learning/Pattern Recognition Lecturer: Stefanos Zafeiriou Goal (Lectures): To present discrete and continuous valued probabilistic linear dynamical systems (HMMs
More informationLecture 11: Hidden Markov Models
Lecture 11: Hidden Markov Models Cognitive Systems - Machine Learning Cognitive Systems, Applied Computer Science, Bamberg University slides by Dr. Philip Jackson Centre for Vision, Speech & Signal Processing
More informationShankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms
Recognition of Visual Speech Elements Using Adaptively Boosted Hidden Markov Models. Say Wei Foo, Yong Lian, Liang Dong. IEEE Transactions on Circuits and Systems for Video Technology, May 2004. Shankar
More informationBrief Introduction of Machine Learning Techniques for Content Analysis
1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview
More informationFACTORIAL HMMS FOR ACOUSTIC MODELING. Beth Logan and Pedro Moreno
ACTORIAL HMMS OR ACOUSTIC MODELING Beth Logan and Pedro Moreno Cambridge Research Laboratories Digital Equipment Corporation One Kendall Square, Building 700, 2nd loor Cambridge, Massachusetts 02139 United
More informationHidden Markov Models. Dr. Naomi Harte
Hidden Markov Models Dr. Naomi Harte The Talk Hidden Markov Models What are they? Why are they useful? The maths part Probability calculations Training optimising parameters Viterbi unseen sequences Real
More informationChapter 9 Automatic Speech Recognition DRAFT
P R E L I M I N A R Y P R O O F S. Unpublished Work c 2008 by Pearson Education, Inc. To be published by Pearson Prentice Hall, Pearson Education, Inc., Upper Saddle River, New Jersey. All rights reserved.
More informationAcoustic Modeling for Speech Recognition
Acoustic Modeling for Speech Recognition Berlin Chen 2004 References:. X. Huang et. al. Spoken Language Processing. Chapter 8 2. S. Young. The HTK Book (HTK Version 3.2) Introduction For the given acoustic
More informationMore on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013
More on HMMs and other sequence models Intro to NLP - ETHZ - 18/03/2013 Summary Parts of speech tagging HMMs: Unsupervised parameter estimation Forward Backward algorithm Bayesian variants Discriminative
More informationBayesian Hidden Markov Models and Extensions
Bayesian Hidden Markov Models and Extensions Zoubin Ghahramani Department of Engineering University of Cambridge joint work with Matt Beal, Jurgen van Gael, Yunus Saatci, Tom Stepleton, Yee Whye Teh Modeling
More informationChapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang
Chapter 4 Dynamic Bayesian Networks 2016 Fall Jin Gu, Michael Zhang Reviews: BN Representation Basic steps for BN representations Define variables Define the preliminary relations between variables Check
More informationHidden Markov model. Jianxin Wu. LAMDA Group National Key Lab for Novel Software Technology Nanjing University, China
Hidden Markov model Jianxin Wu LAMDA Group National Key Lab for Novel Software Technology Nanjing University, China wujx2001@gmail.com May 7, 2018 Contents 1 Sequential data and the Markov property 2 1.1
More informationSpeech Recognition HMM
Speech Recognition HMM Jan Černocký, Valentina Hubeika {cernocky ihubeika}@fit.vutbr.cz FIT BUT Brno Speech Recognition HMM Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 1/38 Agenda Recap variability
More informationFEATURE SELECTION USING FISHER S RATIO TECHNIQUE FOR AUTOMATIC SPEECH RECOGNITION
FEATURE SELECTION USING FISHER S RATIO TECHNIQUE FOR AUTOMATIC SPEECH RECOGNITION Sarika Hegde 1, K. K. Achary 2 and Surendra Shetty 3 1 Department of Computer Applications, NMAM.I.T., Nitte, Karkala Taluk,
More informationDiscriminative training of GMM-HMM acoustic model by RPCL type Bayesian Ying-Yang harmony learning
Discriminative training of GMM-HMM acoustic model by RPCL type Bayesian Ying-Yang harmony learning Zaihu Pang 1, Xihong Wu 1, and Lei Xu 1,2 1 Speech and Hearing Research Center, Key Laboratory of Machine
More informationTinySR. Peter Schmidt-Nielsen. August 27, 2014
TinySR Peter Schmidt-Nielsen August 27, 2014 Abstract TinySR is a light weight real-time small vocabulary speech recognizer written entirely in portable C. The library fits in a single file (plus header),
More informationWe Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named
We Live in Exciting Times ACM (an international computing research society) has named CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Apr. 2, 2019 Yoshua Bengio,
More informationA Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement
A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement Simon Leglaive 1 Laurent Girin 1,2 Radu Horaud 1 1: Inria Grenoble Rhône-Alpes 2: Univ. Grenoble Alpes, Grenoble INP,
More informationExemplar-based voice conversion using non-negative spectrogram deconvolution
Exemplar-based voice conversion using non-negative spectrogram deconvolution Zhizheng Wu 1, Tuomas Virtanen 2, Tomi Kinnunen 3, Eng Siong Chng 1, Haizhou Li 1,4 1 Nanyang Technological University, Singapore
More informationFoundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model
Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model Alex Lascarides (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipop Koehn) 30 January
More informationMel-Generalized Cepstral Representation of Speech A Unified Approach to Speech Spectral Estimation. Keiichi Tokuda
Mel-Generalized Cepstral Representation of Speech A Unified Approach to Speech Spectral Estimation Keiichi Tokuda Nagoya Institute of Technology Carnegie Mellon University Tamkang University March 13,
More informationStatistical Machine Learning from Data
Samy Bengio Statistical Machine Learning from Data Statistical Machine Learning from Data Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne (EPFL),
More informationMultiscale Systems Engineering Research Group
Hidden Markov Model Prof. Yan Wang Woodruff School of Mechanical Engineering Georgia Institute of echnology Atlanta, GA 30332, U.S.A. yan.wang@me.gatech.edu Learning Objectives o familiarize the hidden
More informationAdaptive Boosting for Automatic Speech Recognition. Kham Nguyen
Adaptive Boosting for Automatic Speech Recognition by Kham Nguyen Submitted to the Electrical and Computer Engineering Department, Northeastern University in partial fulfillment of the requirements for
More informationNEURAL LANGUAGE MODELS
COMP90042 LECTURE 14 NEURAL LANGUAGE MODELS LANGUAGE MODELS Assign a probability to a sequence of words Framed as sliding a window over the sentence, predicting each word from finite context to left E.g.,
More informationSignal Modeling Techniques in Speech Recognition. Hassan A. Kingravi
Signal Modeling Techniques in Speech Recognition Hassan A. Kingravi Outline Introduction Spectral Shaping Spectral Analysis Parameter Transforms Statistical Modeling Discussion Conclusions 1: Introduction
More informationComparing linear and non-linear transformation of speech
Comparing linear and non-linear transformation of speech Larbi Mesbahi, Vincent Barreaud and Olivier Boeffard IRISA / ENSSAT - University of Rennes 1 6, rue de Kerampont, Lannion, France {lmesbahi, vincent.barreaud,
More informationHidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208
Hidden Markov Model Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1/19 Outline Example: Hidden Coin Tossing Hidden
More informationCOMP90051 Statistical Machine Learning
COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 24. Hidden Markov Models & message passing Looking back Representation of joint distributions Conditional/marginal independence
More informationSTA 414/2104: Machine Learning
STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far
More information