] Automatic Speech Recognition (CS753)

Similar documents
Discriminative Learning in Speech Recognition

idden Markov Models, iscriminative Training, and odern Speech Recognition Li Deng Microsoft Research UW-EE516; Lecture May 21, 2009

Automatic Speech Recognition (CS753)

Automatic Speech Recognition (CS753)

Automatic Speech Recognition (CS753)

Lecture 3: ASR: HMMs, Forward, Viterbi

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

Statistical Machine Learning from Data

Hidden Markov Models and Gaussian Mixture Models

Lecture 10. Discriminative Training, ROVER, and Consensus. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen

Diagonal Priors for Full Covariance Speech Recognition

Augmented Statistical Models for Speech Recognition

Automatic Speech Recognition (CS753)

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

Outline of Today s Lecture

Hidden Markov Modelling

Hidden Markov Models

Statistical Sequence Recognition and Training: An Introduction to HMMs

Speech recognition. Lecture 14: Neural Networks. Andrew Senior December 12, Google NYC

Conditional Language Modeling. Chris Dyer

Introduction to Neural Networks

Discriminative models for speech recognition

Augmented Statistical Models for Classifying Sequence Data

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Adapting n-gram Maximum Entropy Language Models with Conditional Entropy Regularization

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models

Statistical NLP Spring The Noisy Channel Model

On Semi-Supervised Learning of Gaussian Mixture Models forphonetic Classification

Semi-supervised Maximum Mutual Information Training of Deep Neural Network Acoustic Models

CS 188: Artificial Intelligence Fall 2011

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition

Unsupervised Model Adaptation using Information-Theoretic Criterion

Why DNN Works for Acoustic Modeling in Speech Recognition?

Sparse Models for Speech Recognition

Hidden Markov Model and Speech Recognition

Forward algorithm vs. particle filtering

Discriminative training of GMM-HMM acoustic model by RPCL type Bayesian Ying-Yang harmony learning

COMP90051 Statistical Machine Learning

Statistical NLP Spring Digitizing Speech

Digitizing Speech. Statistical NLP Spring Frame Extraction. Gaussian Emissions. Vector Quantization. HMMs for Continuous Observations? ...

Hidden Markov Models and Gaussian Mixture Models

Statistical Methods for NLP

STA 414/2104: Machine Learning

Midterm sample questions

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Speech and Language Processing. Chapter 9 of SLP Automatic Speech Recognition (II)

Statistical Methods for NLP

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Learning Bayesian networks

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

SOFT MARGIN ESTIMATION FOR AUTOMATIC SPEECH RECOGNITION

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Discriminative Models for Speech Recognition

Sequence Labeling: HMMs & Structured Perceptron

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

In machine learning, there are two distinct groups of learning methods for

Boundary Contraction Training for Acoustic Models based on Discrete Deep Neural Networks

Geoffrey Zweig May 7, 2009

Shankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms

Graphical models for part of speech tagging

Uncertainty Modeling without Subspace Methods for Text-Dependent Speaker Recognition

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

CSC321 Lecture 18: Learning Probabilistic Models

CSC321 Lecture 7 Neural language models

STA 4273H: Statistical Machine Learning

Graphical Models for Automatic Speech Recognition

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

Design and Implementation of Speech Recognition Systems

Intelligent Systems (AI-2)

1. Markov models. 1.1 Markov-chain

Hidden Markov Models Hamid R. Rabiee

Statistical Pattern Recognition

Automatic Speech Recognition (CS753)

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius

Lecture 13: Structured Prediction

EECS E6870: Lecture 4: Hidden Markov Models

Detection-Based Speech Recognition with Sparse Point Process Models

CS838-1 Advanced NLP: Hidden Markov Models

Lecture 12: Algorithms for HMMs

Log-Linear Models, MEMMs, and CRFs

Mixtures of Gaussians with Sparse Structure

Deep Learning for Automatic Speech Recognition Part I

Lecture 11: Hidden Markov Models

Adaptive Boosting for Automatic Speech Recognition. Kham Nguyen

Mixtures of Gaussians with Sparse Regression Matrices. Constantinos Boulis, Jeffrey Bilmes

Temporal Modeling and Basic Speech Recognition

Covariance Matrix Enhancement Approach to Train Robust Gaussian Mixture Models of Speech Data

Statistical Methods for NLP

Hidden Markov Models

Lecture 4. Hidden Markov Models. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom

Deep Neural Networks

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 9: Acoustic Models

Weighted Finite State Transducers in Automatic Speech Recognition

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

To Separate Speech! A System for Recognizing Simultaneous Speech

INVITED PAPER Training Augmented Models using SVMs

Intelligent Systems (AI-2)

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Transcription:

] Automatic Speech Recognition (CS753) Lecture 17: Discriminative Training for HMMs Instructor: Preethi Jyothi Sep 28, 2017

Discriminative Training

Recall: MLE for HMMs Maximum likelihood estimation (MLE) sets HMM parameters so as to maximise the objective function: L = log P (X i M i ) where X1,, Xi, XN are training utterances Mi is the HMM corresponding to the word sequence of Xi λ corresponds to the HMM parameters What are some conceptual problems with this approach?

Discriminative Learning Discriminative models directly model the class posterior probability or learn the parameters of a joint probability model discriminatively so that classification errors are minimised As opposed to generative models that attempt to learn a probability model of the data distribution [Vapnik] one should solve the (classification/recognition) problem directly and never solve a more general problem as an intermediate step [Vapnik]: V. Vapnik, Statistical Learning Theory, 1998

Discriminative Learning Two central issues in developing discriminative learning methods: 1. Constructing suitable objective functions for optimisation 2. Developing optimization techniques for these objective functions

Maximum mutual information (MMI) estimation: Discriminative Training MMI aims to directly maximise the posterior probability (criterion also referred to as conditional maximum likelihood) F MMI = = log P (M i X i ) log P (X i M i )P (W i ) P W P (X 0 i M W 0)P (W 0 ) P(W) is the language model probability

Why is it called MMI? Mutual information I(X, W) between acoustic data X and word labels W is defined as: I(X, W) = X X,W Pr(X, W) log Pr(X, W) Pr(X)Pr(W ) = X X,W Pr(X, W) log Pr(W X) Pr(W ) = H(W ) H(W X) where H(W) is the entropy of W and H(W X) is the conditional entropy

Why is it called MMI? Assume H(W) is given via the language model. Then, maximizing mutual information becomes equivalent to minimising conditional entropy H(W X) = 1 N = 1 N log Pr(W i X i ) log Thus, MMI is equivalent to maximizing: Pr(X i W i )Pr(W i ) P W Pr(X 0 i W 0 )Pr(W 0 ) F MMI = log P (X i M i )P (W i ) P W P (X 0 i M W 0)P (W 0 )

MMI estimation How do we compute this? F MMI = log P (X i M i )P (W i ) P W P (X 0 i M W 0)P (W 0 ) Numerator: Likelihood of data given correct word sequence Denominator: Total likelihood of the data given all possible word sequences

Recall: Word Lattices A word lattice is a pruned version of the decoding graph for an utterance Acyclic directed graph with arc costs computed from acoustic model and language model scores Lattice nodes implicitly capture information about time within the utterance SIL SIL I I I HAVE HAVE MOVE HAVE MOVE HAVE IT VERY IT VERY VERY VEAL VERY OFTEN FINE FINE FAST OFTEN SIL SIL SIL SIL Time Image from [GY08]: Gales & Young, Application of HMMs in speech recognition, NOW book, 2008

MMI estimation How do we compute this? F MMI = log P (X i M i )P (W i ) P W P (X 0 i M W 0)P (W 0 ) Numerator: Likelihood of data given correct word sequence Denominator: Total likelihood of the data given all possible word sequences Estimate by generating lattices, and summing over all the word sequences in the lattice

MMI Training and Lattices Computing the denominator: Estimate by generating lattices, and summing over all the words in the lattice Numerator lattices: Restrict G to a linear chain acceptor representing the words in the correct word sequence. Lattices are usually only computed once for MMI training. HMM parameter estimation for MMI uses the extended Baum- Welch algorithm [V96,WP00] Like HMMs, can DNNs also be trained with an MMI-type objective function? Yes! [V96]:Valtchev et al., Lattice-based discriminative training for large vocabulary speech recognition, 1996 [WP00]: Woodland and Povey, Large scale discriminative training for speech recognition, 2000

MMI results on Switchboard Switchboard results on two eval sets (SWB, CHE). Trained on 300 hours of speech. Comparing maximum likelihood (ML) against discriminatively trained GMM systems and MMItrained DNNs. SWB CHE Total GMM ML 21.2 36.4 28.8 GMM MMI 18.6 33.0 25.8 DNN CE 14.2 25.7 20.0 DNN MMI 12.9 24.6 18.8 [V et al.]:vesely et al., Sequence discriminative training of DNNs, Interspeech 2013

Another Discriminative Training Objective: Minimum Phone/Word Error (MPE/MWE) MMI is an optimisation criterion at the sentence-level. Change the criterion so that it is directly related to subsentence (i.e. word or phone) error rate. MPE/MWE objective function is defined as: F MPE/MWE = log P W (X i M W )P (W )A(W, W i ) P W P (X 0 i M W 0)P (W 0 ) where A(W, Wi) is phone/word accuracy of the sentence W given the reference sentence Wi i.e. the total phone count in Wi minus the sum of insertion/deletion/substitution errors of W

MPE/MWE training F MPE/MWE = log P W (X i M W )P (W )A(W, W i ) P W P (X 0 i M W 0)P (W 0 ) The MPE/MWE criterion is a weighted average of the phone/ word accuracy over all the training instances A(W, Wi) can be computed either at the phone or word level for the MPE or MWE criterion, respectively The weighting given by MPE/MWE depends on the number of incorrect phones/words in the string while MMI looks at whether the entire sentence is correct or not

MPE results on Switchboard Switchboard results on eval set SWB. Trained on 68 hours of speech. Comparing maximum likelihood (MLE) against discriminatively trained (MMI/MPE/MWE) GMM systems SWB %WER redn GMM MLE 46.6 - GMM MMI 44.3 2.3 GMM MPE 43.1 3.5 GMM MWE 43.3 3.3 [V et al.]:vesely et al., Sequence discriminative training of DNNs, Interspeech 2013

How does this fit within an ASR system?

Estimating acoustic model parameters If A: speech utterance and O A : acoustic features corresponding to the utterance A, W = arg max W P (O A W )P (W ) ASR decoding: Return the word sequence that jointly assigns the highest probability to O A How do we estimate λ in P λ (OA W)? MLE estimation MMI estimation MPE/MWE estimation Covered in this class