Sparse Models for Speech Recognition

Similar documents
Hidden Markov Model and Speech Recognition

Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition

Speech and Language Processing. Chapter 9 of SLP Automatic Speech Recognition (II)

Statistical Machine Learning from Data

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

Why DNN Works for Acoustic Modeling in Speech Recognition?

Augmented Statistical Models for Speech Recognition

Lecture 3: ASR: HMMs, Forward, Viterbi

Automatic Speech Recognition (CS753)

Covariance Matrix Enhancement Approach to Train Robust Gaussian Mixture Models of Speech Data

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 9: Acoustic Models

Hidden Markov Models and Gaussian Mixture Models

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

CS 136 Lecture 5 Acoustic modeling Phoneme modeling

Statistical NLP Spring Digitizing Speech

Digitizing Speech. Statistical NLP Spring Frame Extraction. Gaussian Emissions. Vector Quantization. HMMs for Continuous Observations? ...

The Noisy Channel Model and Markov Models

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models

Statistical NLP Spring The Noisy Channel Model

Hidden Markov Models in Language Processing

Speaker adaptation based on regularized speaker-dependent eigenphone matrix estimation

Forward algorithm vs. particle filtering

Shankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

CPSC 540: Machine Learning

Diagonal Priors for Full Covariance Speech Recognition

Hidden Markov Modelling

Engineering Part IIB: Module 4F11 Speech and Language Processing Lectures 4/5 : Speech Recognition Basics

STA 414/2104: Machine Learning

Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features

] Automatic Speech Recognition (CS753)

Hidden Markov Models

Hidden Markov Models Hamid R. Rabiee

Graphical models for part of speech tagging

Hidden Markov Models and Gaussian Mixture Models

Lecture 5: GMM Acoustic Modeling and Feature Extraction

Lecture 10. Discriminative Training, ROVER, and Consensus. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen

A TWO-LAYER NON-NEGATIVE MATRIX FACTORIZATION MODEL FOR VOCABULARY DISCOVERY. MengSun,HugoVanhamme

Detection-Based Speech Recognition with Sparse Point Process Models

CS 188: Artificial Intelligence Fall 2011

STA 4273H: Statistical Machine Learning

A Direct Criterion Minimization based fmllr via Gradient Descend

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Full-covariance model compensation for

An Introduction to Bioinformatics Algorithms Hidden Markov Models

1. Markov models. 1.1 Markov-chain

Graphical Models for Collaborative Filtering

An Evolutionary Programming Based Algorithm for HMM training

HIDDEN MARKOV MODELS IN SPEECH RECOGNITION

Low development cost, high quality speech recognition for new languages and domains. Cheap ASR

A New OCR System Similar to ASR System

PHONEME CLASSIFICATION OVER THE RECONSTRUCTED PHASE SPACE USING PRINCIPAL COMPONENT ANALYSIS

Factor analysed hidden Markov models for speech recognition

Hidden Markov Models

Hidden Markov Models

CSC321 Lecture 7 Neural language models

speaker recognition using gmm-ubm semester project presentation

Chapter 3: Basics of Language Modelling

FEATURE PRUNING IN LIKELIHOOD EVALUATION OF HMM-BASED SPEECH RECOGNITION. Xiao Li and Jeff Bilmes

Structured Precision Matrix Modelling for Speech Recognition

Independent Component Analysis and Unsupervised Learning

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Robust Speaker Identification

Hidden Markov Models

Automatic Speech Recognition (CS753)

Deep Learning for Speech Recognition. Hung-yi Lee

Linear Transforms in Automatic Speech Recognition: Estimation Procedures & Integration of Diverse Acoustic Data

Automatic Speech Recognition and Statistical Machine Translation under Uncertainty

Outline of Today s Lecture

Expectation Maximization (EM)

Automatic Speech Recognition (CS753)

Introduction to Neural Networks

Note Set 5: Hidden Markov Models

SYMBOL RECOGNITION IN HANDWRITTEN MATHEMATI- CAL FORMULAS

Statistical Pattern Recognition

Chapter 3: Basics of Language Modeling

Discriminative models for speech recognition

Hidden Markov Models

Graphical Models for Automatic Speech Recognition

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models.

A Bayesian Perspective on Residential Demand Response Using Smart Meter Data

Improving the Multi-Stack Decoding Algorithm in a Segment-based Speech Recognizer

Using Sub-Phonemic Units for HMM Based Phone Recognition

Human Mobility Pattern Prediction Algorithm using Mobile Device Location and Time Data

Heeyoul (Henry) Choi. Dept. of Computer Science Texas A&M University

Computational Genomics and Molecular Biology, Fall

Weighted Finite-State Transducers in Computational Biology

Estimation of Cepstral Coefficients for Robust Speech Recognition

Brief Introduction of Machine Learning Techniques for Content Analysis

HMM part 1. Dr Philip Jackson

Multimodal Deep Learning for Predicting Survival from Breast Cancer

Temporal Modeling and Basic Speech Recognition

Subspace based/universal Background Model (UBM) based speech modeling This paper is available at

Cross-Lingual Language Modeling for Automatic Speech Recogntion

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien

Hidden Markov Models

Unsupervised Learning of Hierarchical Models. in collaboration with Josh Susskind and Vlad Mnih

Eigenvoice Speaker Adaptation via Composite Kernel PCA

ASR using Hidden Markov Model : A tutorial

Transcription:

Sparse Models for Speech Recognition Weibin Zhang and Pascale Fung Human Language Technology Center Hong Kong University of Science and Technology

Outline Introduction to speech recognition Motivations for sparse models Maximum likelihood training of sparse models ML training of sparse banded models Discriminative training of sparse models Conclusions ECE/HKUST Weibin Zhang 2

Speech Recognition & its Applications 1. Automatic Speech Recognition (ASR): Convert speech wave into text automatically 2. Applications: Office/business systems: Manufacturing Telecommunications Mobile telephony Home Automation Navigation ECE/HKUST Weibin Zhang 3

History of ASR Technical Point of View ECE/HKUST Weibin Zhang 4

ASR Research -- Overview Statistical approaches lead in all area. Still big gap between human and machine performance...however Useful systems have been built which are changing the way we interact with the world...within five years people will discard their keyboards and interact with computers using touch-screens and voice controls... Bill Gates, Feb 2008 ECE/HKUST Weibin Zhang 5

Statistical speech recognition system ECE/HKUST Weibin Zhang 6

Statistical speech recognition system Language Model: P( recognize speech ) >> P( wreck a nice beach ) Dictionary: Wreck r e k Beach b i th Acoustic Model: P(O recognize speech ) ECE/HKUST Weibin Zhang 7

Statistical speech recognition system ECE/HKUST Weibin Zhang 8

Acoustic modeling Left-to-right hidden Markov models (HMMs) GMM-HMM based acoustic models p o t s j = m c jm N(o t : μ jm, Σ jm ) Θ = a ij, b j (o t ) = {a ij, c jm, u jm, Σ jm } ECE/HKUST Weibin Zhang 9

Evaluation of ASR system Word error rate (WER) = 1 accuracy Real time factor (RTF) ECE/HKUST Weibin Zhang 10

Covariance modeling ECE/HKUST Weibin Zhang 11

Covariance modeling Less Training data ECE/HKUST Weibin Zhang 12

ML training of sparse models Maximum likelihood (ML) training Θ = argmax{log(p(o Θ))} Θ The proposed new objective function S 1 i=2 M m=1 L Θ = log(p(o Θ)) ρ C im 1 ECE/HKUST Weibin Zhang 13

Maximizing the auxiliary function The precision matrices can be updated using C im = argmax{logdetc im trace S im C im λ C im 1 } C im 0 λ = 2ρ and S γ im is the sample covariance matrix. im Convex optimization or other more efficient methods (e.g. graphical lasso) ECE/HKUST Weibin Zhang 14

Experiments on the WSJ data Experimental setup Training, development and testing data sets Standard bigram language model Feature vector: 39-dimension MFCC 39 phonemes for English (39 3 triphones) 2843 tied HMM states ECE/HKUST Weibin Zhang 15

Tuning results on the dev. data set ECE/HKUST Weibin Zhang 16

WER on the testing data set Our result of 8.77% WER is comparable to the 8.6% WER reported in (Ko & Mak, 2011) using a similar testing configuration, but using 70 hours of training data Model type #Gaussians WER Rel. improv. Significant? Full 2 10.5-7.1 No Diagonal 10 9.84 ---- ---- Sparse 4 8.77 10.9% Yes ECE/HKUST Weibin Zhang 17

Sparse banded models Sparse models Sparse models Sparse banded feature reorder models ECE/HKUST Weibin Zhang 18

Training of sparse banded models Weighted lasso: f C im = H C im 1 H k, l = C im k, l = 0 Diagonal Sparse banded Full ECE/HKUST Weibin Zhang 19

Importance of the feature order O~N μ, Σ ; C = Σ 1 ; C ij = 0 o i and o j are conditionally independent (CI), given other variables. Rearrange the feature order so that o i and o j are CI if i j > b Three orders are investigated: HTK order : m 1 m 13 Δm 1 Δm 13 ΔΔm 1 ΔΔm 13 Knowledge-based order : m 1 Δm 1 ΔΔm 1 m 13 Δm 13 ΔΔm 13 Data-driven order : m 1 ΔΔm 1 Δm 6 Δm 10 ECE/HKUST Weibin Zhang 20

Results on the development data ECE/HKUST Weibin Zhang 21

Results on the test data Model type #Gaussians WER Rel. improv. Significant? Full 2 10.5-7.1 No Diagonal 10 9.84 ---- ---- Sparse 4 8.77 10.9% Yes Band8 4 8.91 9.5 Yes ECE/HKUST Weibin Zhang 22

Decoding time Sparse banded modes are the fastest since: 1) smaller searching beamwidths; 2) less model parameters. ECE/HKUST Weibin Zhang 23

Discriminative training MMI objective function: Θ = Argmax{ log P w r O, Θ } Θ New Objective function S 1 L Θ = log P w r O, Θ ρ C im 1 i=2 M m=1 A valid weak-sense auxiliary function is Q Θ; Θ =Q n Θ; Θ -Q d Θ; Θ +Q s Θ; Θ +Q I Θ; Θ S 1 M ρ C im 1 i=2 m=1 Same as ML training Ensure stability Improve generalization Regularization term ECE/HKUST Weibin Zhang 24

Results on the WSJ testing data Model type #Gaussians ML training MMI Full 2 11.68 9.18 Diagonal 10 9.84 9.04 Diagonal+ STC 10 9.26 8.66 Sparse 4 8.55 8.05 ECE/HKUST Weibin Zhang 25

Summary Sparse models are effective in dealing with the problems that conventional diagonal and full covariance models face: computation, incorrect model assumptions and over-fitting when training data is insufficient. We derive the overall training process under the HMM framework using both maximum likelihood training and discriminative training. The proposed sparse models subsume the traditional diagonal and full covariance models as special cases. ECE/HKUST Weibin Zhang 26

Thank you! ECE/HKUST Weibin Zhang 27