Augmented Statistical Models for Speech Recognition

Similar documents
INVITED PAPER Training Augmented Models using SVMs

Augmented Statistical Models for Classifying Sequence Data

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

Statistical NLP Spring Digitizing Speech

Digitizing Speech. Statistical NLP Spring Frame Extraction. Gaussian Emissions. Vector Quantization. HMMs for Continuous Observations? ...

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models

Statistical NLP Spring The Noisy Channel Model

Graphical Models. Mark Gales. Lent Machine Learning for Language Processing: Lecture 3. MPhil in Advanced Computer Science

Discriminative models for speech recognition

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

] Automatic Speech Recognition (CS753)

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 9: Acoustic Models

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

Speech and Language Processing. Chapter 9 of SLP Automatic Speech Recognition (II)

Sparse Models for Speech Recognition

Pattern Recognition and Machine Learning

STA 414/2104: Machine Learning

Hidden Markov Models in Language Processing

Discriminative Models for Speech Recognition

Hidden Markov Models

Brief Introduction of Machine Learning Techniques for Content Analysis

Introduction to Machine Learning Midterm, Tues April 8

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

STA 4273H: Statistical Machine Learning

Kernel Methods for Text-Independent Speaker Verification

HMM part 1. Dr Philip Jackson

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Dynamic Data Modeling, Recognition, and Synthesis. Rui Zhao Thesis Defense Advisor: Professor Qiang Ji

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Graphical Models for Automatic Speech Recognition

On the Influence of the Delta Coefficients in a HMM-based Speech Recognition System

Statistical Pattern Recognition

Support Vector Machines using GMM Supervectors for Speaker Verification

Statistical Machine Learning from Data

Hidden Markov Models and Gaussian Mixture Models

PATTERN CLASSIFICATION

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

CSCI-567: Machine Learning (Spring 2019)

Hidden Markov Models

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Lecture 16 Deep Neural Generative Models

Acoustic Unit Discovery (AUD) Models. Leda Sarı

Advanced Machine Learning & Perception

Hidden Markov Models and Gaussian Mixture Models

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Necessary Corrections in Intransitive Likelihood-Ratio Classifiers

Factor analysed hidden Markov models for speech recognition

HIDDEN MARKOV MODELS IN SPEECH RECOGNITION

Machine Learning Overview

Conditional Random Field

Detection-Based Speech Recognition with Sparse Point Process Models

Variational Decoding for Statistical Machine Translation

Unsupervised Learning of Hierarchical Models. in collaboration with Josh Susskind and Vlad Mnih

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Hidden Markov Modelling

MULTI-FRAME FACTORISATION FOR LONG-SPAN ACOUSTIC MODELLING. Liang Lu and Steve Renals

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

L11: Pattern recognition principles

Diagonal Priors for Full Covariance Speech Recognition

Model-Based Approaches to Robust Speech Recognition

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

The Minimum Message Length Principle for Inductive Inference

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Lecture 15. Probabilistic Models on Graph

Hidden Markov Models

Lecture 5: GMM Acoustic Modeling and Feature Extraction

Lecture 11: Hidden Markov Models

Machine Learning, Midterm Exam

Expectation Maximization (EM)

Dynamic Probabilistic Models for Latent Feature Propagation in Social Networks

Human-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg

Boundary Contraction Training for Acoustic Models based on Discrete Deep Neural Networks

Lecture 10. Discriminative Training, ROVER, and Consensus. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen

Machine Learning Techniques for Computer Vision

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

p(d θ ) l(θ ) 1.2 x x x

Hierarchical Multi-Stream Posterior Based Speech Recognition System

Introduction to Neural Networks

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Recent Developments in Statistical Dialogue Systems

Independent Component Analysis and Unsupervised Learning

Bayesian Learning (II)

Lecture 3: Pattern Classification

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Support Vector Machines

CAMBRIDGE UNIVERSITY

Geoffrey Zweig May 7, 2009

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Logistic Regression. COMP 527 Danushka Bollegala

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Model-Based Margin Estimation for Hidden Markov Model Learning and Generalization

Midterm sample questions

Statistical Methods for NLP

Transcription:

Augmented Statistical Models for Speech Recognition Mark Gales & Martin Layton 31 August 2005 Trajectory Models For Speech Processing Workshop

Overview Dependency Modelling in Speech Recognition: latent variables exponential family Augmented Statistical Models augments standard models, e.g. GMMs and HMMs extends representation of dependencies Augmented Statistical Model Training use maximum margin training relationship to dynamic kernels Preliminary LVCSR experiments Trajectory Models For Speech Processing Workshop 1

Dependency Modelling Speech data is dynamic - observations are not of a fixed length Dependency modelling essential part of speech recognition: p(o 1,..., o T ; λ) = p(o 1 ; λ)p(o 2 o 1 ; λ)... p(o T o 1,..., o T 1 ; λ) impractical to directly model in this form make extensive use of conditional independence Two possible forms of conditional independence used: observed variables latent (unobserved) variables Even given dependency (form of Bayesian Network): need to determine how dependencies interact Trajectory Models For Speech Processing Workshop 2

Hidden Markov Model - A Dynamic Bayesian Network o 1 o 2 o o o 3 4 T b 2 () b () b () 3 4 q t q t+1 1 a 12 2 a 3 a 4 5 34 a 23 45 o t o t+1 a22 a 33 a 44 (a) Standard HMM phone topology (b) HMM Dynamic Bayesian Network Notation for DBNs: circles - continuous variables shaded - observed variables squares - discrete variables non-shaded - unobserved variables Observations conditionally independent of other observations given state. States conditionally independent of other states given previous states. Poor model of the speech process - piecewise constant state-space. Trajectory Models For Speech Processing Workshop 3

Dependency Modelling using Latent Variables Switching linear dynamical system: discrete and continuous state-spaces q t q t+1 observations conditionally independent given continuous and discrete state; x t x t+1 approximate inference required Rao-Blackwellised Gibbs sampling. o t o t+1 Multiple data stream DBN: e.g. factorial HMM/mixed memory model; q t (1) q t+1 (1) asynchronous data common: speech and video/noise; speech and brain activation patterns. observation depends on state of both streams o t o t+1 q (2) t qt+1 (2) Trajectory Models For Speech Processing Workshop 4

SLDS Trajectory Modelling 5 sh ow dh ax g r ih dd l iy z Frames from phrase: SHOW THE GRIDLEY S... 0 5 10 Legend True HMM SLDS MFCC c 1 15 20 25 30 35 20 40 60 80 100 Frame Index Unfortunately doesn t currently classify better than an HMM! Trajectory Models For Speech Processing Workshop 5

Dependency Modelling using Observed Variables q t 1 qt q t+1 q t+2 o t 1 o t o t+1 o t+2 Commonly use member (or mixture) of the exponential family p(o; α) = 1 τ h(o) exp (α T(O)) h(o) is the reference distribution; τ is the normalisation term α are the natural parameters the function T(O) is a sufficient statistic. What is the appropriate form of statistics (T(O)) - needs DBN to be known for example in diagram, T (O) = T 2 t=1 o to t+1 o t+2 Trajectory Models For Speech Processing Workshop 6

Constrained Exponential Family Could hypothesise all possible dependencies and prune discriminative pruning found to be useful (buried Markov models) impractical for wide range (and lengths) of dependencies Consider constrained form of statistics local exponential approximation to the reference distribution ρ th -order differential form considered (related to Taylor-series) Distribution has two parts reference distribution defines latent variables local exponential model defines statistics (T(O)) Slightly more general form is the augmented statistical model train all the parameters (including the reference, base, distribution) Trajectory Models For Speech Processing Workshop 7

Augmented Statistical Models Augmented statistical models (related to fibre bundles) p(o; λ, α) = 1 τ ˇp(O; λ) exp α λ log(ˇp(o; λ)) 1 2! vec ( 2 λ log(ˇp(o; λ))). 1 ρ! vec ( ρ λ log(ˇp(o; λ))) Two sets of parameters λ - parameters of base distribution (ˇp(O; λ)) α - natural parameters of local exponential model Normalisation term τ ensures that R nt p(o; λ, α)do = 1; can be very complex to estimate p(o; λ, α) = p(o; λ, α)/τ Trajectory Models For Speech Processing Workshop 8

Augmented Gaussian Mixture Model Use a GMM as the base distribution: ˇp(o; λ) = M m=1 c mn (o; µ m, Σ m ) considering only the first derivatives of the means ( p(o; λ, α) = 1 M M ) c m N (o; µ m, Σ m )exp P (n o; λ)α n τ Σ 1 n (o µ n ) m=1 n=1 Simple two component one-dimensional example: 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 10 5 0 5 10 0 10 5 0 5 10 0 10 5 0 5 10 α = [0.0, 0.0] α = [ 1.0, 1.0] α = [1.0, 1.0] Trajectory Models For Speech Processing Workshop 9

Augmented Model Dependencies If the base distribution is a mixture of members of the exponential family ˇp(O; λ) = T ( M t=1 m=1 c J m exp consider a first order differential λ (n) k log (ˇp(O; λ)) = j=1 λ(m) j ) j (o t ) T (m) /τ (m) ( ) T P (n o t ; λ) T (n) k (o t ) log(τ (n) ) λ (n) k t=1 Augmented models of this form keep independence assumptions of the base distribution remove conditional independence assumptions of the base model - the local exponential model depends on a posterior... Augmented GMMs do not improve temporal modelling... Trajectory Models For Speech Processing Workshop 10

Augmented HMM Dependencies For an HMM: ˇp(O; λ) = θ Θ { T ( )} t=1 a θ t 1 θ t m θ t c m N (o t ; µ m, Σ m ) Derivative depends on posterior, γ jm (t) = P (θ t = {s j, m} O; λ), T (O) = T t=1 γ jm (t)σ 1 jm (o t µ jm ) posterior depends on complete observation sequence, O introduces dependencies beyond conditional state independence compact representation of effects of all observations Higher-order derivatives incorporate higher-order dependencies increasing order of derivatives - increasingly powerful trajectory model systematic approach to incorporating additional dependencies Trajectory Models For Speech Processing Workshop 11

Augmented Model Summary Extension to standard forms of statistical model Consists of two parts: base distribution determines the latent variables local exponential distribution augments base distribution Base distribution: standard form of statistical model examples considered: Gaussian mixture models and hidden Markov models Local exponential distribution: currently based on ρ th -order differential form gives additional dependencies not present in base distribution Normalisation term may be highly complex to calculate maximum likelihood training may be very awkward Trajectory Models For Speech Processing Workshop 12

Augmented Model Training Only consider simplified two-class problem Bayes decision rule for binary case (prior P (ω 1 ) and P (ω 2 )): P (ω 1 )τ (2) p(o; λ (1), α (1) ) P (ω 2 )τ (1) p(o; λ (2), α (2) ) ω 1 > < ω 2 1; 1 T log ( p(o; λ (1), α (1) ) ) p(o; λ (2), α (2) ) + b ω 1 > < ω 2 0 b = 1 T log ( P (ω1 )τ (2) P (ω 2 )τ (1) ) - no need to explicitly calculate τ Can express decision rule as the following scalar product [ w b ] [ φ(o; λ) 1 ] ω 1 > < ω 2 0 form of score-space and linear decision boundary Note - restrictions on α s to ensure a valid distribution. Trajectory Models For Speech Processing Workshop 13

Augmented Model Training - Binary Case (cont) Generative score-space is given by (first order derivatives) φ(o; λ) = 1 T log (ˇp(O; λ (1) ) ) log (ˇp(O; λ (2) ) ) λ (1) log (ˇp(O; λ (1) ) ) λ (2) log (ˇp(O; λ (2) ) ) only a function of the base-distribution parameters λ Linear decision boundary given by w = [ ] 1 α (1) α (2) only a function of the exponential model parameters α Bias is represented by b - depends on both α and λ Possibly large number of parameters for linear decision boundary maximum margin (MM) estimation good choice - SVM training Trajectory Models For Speech Processing Workshop 14

Support Vector Machines margin width support vector support vector decision boundary SVMs are a maximum margin, binary, classifier: related to minimising generalisation error; unique solution (compare to neural networks); may be kernelised - training/classification a function of dot-product (x i.x j ). Can be applied to speech - use a kernel to map variable data to a fixed length. Trajectory Models For Speech Processing Workshop 15

Estimating Model Parameters Two sets of parameters to be estimated using training data {O 1,..., O n }: base distribution (Kernel) λ = { λ (1), λ (2)} direction of decision boundary (y i { 1, 1} label of training example) w = n i=1 α svm i y i G 1 φ(o i ; λ) α svm = {α1 svm,..., αn svm } set of SVM Lagrange multipliers G associated with distance metric for SVM kernel Kernel parameters may be estimated using: maximum likelihood (ML) training; discriminative training, e.g. maximum mutual information (MMI) maximum margin (MM) training (consistent with α s). Trajectory Models For Speech Processing Workshop 16

SVMs and Class Posteriors Common objection to SVMs - no probabilistic interpretation use of additional sigmoidal mapping/relevance vector machines Generative kernels - distance from the decision boundary is the posterior ratio 1 w 1 ( [ w b ] [ φ(o; λ) 1 ] ) = 1 T log ( ) P (ω1 O) P (ω 2 O) w 1 is required to ensure first element of w is 1 augmented version of the kernel PDF becomes the class-conditional PDF Decision boundary also yields the exponential natural parameters 1 α (1) α (2) = 1 w 1 w = 1 w 1 n i=1 α svm i y i G 1 φ(o i ; λ) Trajectory Models For Speech Processing Workshop 17

Relationship to Dynamic Kernels Dynamic kernels popular for applying SVMs to sequence data Two standard kernels, related to generative kernels are: Fisher kernel Marginalised count kernel Fisher Kernel: equivalent to generative kernel with two base distributions the same ˇp(O; λ (1) ) = ˇp(O; λ (2) ) and only using first order derivatives. Fisher kernel useful with large amounts of unsupervised data. Fisher kernel can also be described as a marginalised count kernel. Trajectory Models For Speech Processing Workshop 18

Marginalised Count Kernel Another related kernel is the marginalised count kernel. used for discrete data (bioinformatics applications) score space element for second-order token pairings ab and states θ a θ b φ(o; λ) = T 1 t=1 I(o t = a, o t+1 = b)p (θ t = θ a, θ t+1 = θ b O; λ) compare to an element of the second derivative of PMF of a discrete HMM φ(o; λ) = T T I(o t = a, o τ = b)p (θ t = θ a, θ τ = θ b O; λ) +... t=1 τ=1 higher order derivatives yields higher order dependencies generative kernels allow continuous forms of count kernels Trajectory Models For Speech Processing Workshop 19

ISOLET E-Set Experiments ISOLET - isolated letters from American English E-set subset {B,C,D,E,G,P,T,V,Z} - highly confusable Standard features MFCC E D A, 10 emitting state HMM 2 components/state first-order mean derivative score-space for A-HMM Classifier Training WER Base (λ) Aug (α) (%) HMM A-HMM ML 8.7 MMI 4.8 ML MM 5.0 MMI MM 4.3 Augmented HMMs outperform HMMs for both ML and MMI trained systems. best performance using selection/more complex model - 3.2% Trajectory Models For Speech Processing Workshop 20

Binary Classifiers and LVCSR Many classifiers(e.g. SVMs) are inherently binary: speech recognition has a vast number of possible classes; how to map to a simple binary problem? Use pruned confusion networks (Venkataramani et al ASRU 2003): SIL TO A IN IN IN IT IT DIDN T ELABORATE SIL!NULL BUT TO IN IT DIDN T ELABORATE!NULL BUT IT DIDN T ELABORATE BUT BUT DIDN T A!NULL!NULL Word lattice Confusion Network Pruned confusion network use standard HMM decoder to generate word lattice; generate confusion networks (CN) from word lattice gives posterior for each arc being correct; prune CN to a maximum of two arcs (based on posteriors). TO IN!NULL Trajectory Models For Speech Processing Workshop 21

LVCSR Experimental Setup HMMs trained on 400hours of conversational telephone speech (fsh2004sub): standard CUHTK CTS frontend (CMN/CVN/VTLN/HLDA) state-clustered triphones ( 6000 states, 28 components/state); maximum likelihood training Confusion networks generated for fsh2004sub Perform 8-fold cross-validation on 400 hours training data: use CN to obtain highly confusable common word pairs ML/MMI-trained word HMMs - 3 emitting states, 4 components per state first-order derivatives (prior/mean/variance) score-space A-HMMs Evaluation on held-out data (eval03) 6 hours of test data decoded using LVCSR trigram language model baseline using confusion network decoding Trajectory Models For Speech Processing Workshop 22

8-Fold Cross-Validation LVCSR Results Word Pair Classifier Training WER (Examples/class) Base (λ) Aug (α) (%) CAN/CAN T (3761) KNOW/NO (4475) ML 11.0 HMM MMI 10.4 A-HMM ML MM 9.5 ML 27.7 HMM MMI 27.1 A-HMM ML MM 23.8 A-HMM outperforms both ML and MMI HMM also outperforms using equivalent number of parameters difficult to split dependency modelling gains from change in training criterion Trajectory Models For Speech Processing Workshop 23

Incorporating Posterior Information Useful to incorporate arc log-posterior (F(ω 1 ), F(ω 2 )) into decision process posterior contains e.g. N-gram LM, cross-word context acoustic information Two simple approaches: combination of two as independent sources (β empirically set) 1 T log ( p(o; λ (1), α (1) ) ) p(o; λ (2), α (2) ) + b + β (F(ω 1 ) F(ω 2 )) ω 1 > < ω 2 0 incorporate posterior into score-space (β obtained from decision boundary) φ cn (O; λ) = [ F(ω1 ) F(ω 2 ) φ(o; λ) Incorporating in score-space requires consistency between train/test posteriors ] Trajectory Models For Speech Processing Workshop 24

Evaluation Data LVCSR Results Baseline performance using Viterbi and Confusion Network decoding Decoding trigram LM Viterbi 30.8 Confusion Network 30.1 Rescore word-pairs using 3-state/4-component A-HMM+βCN SVM Rescoring #corrected/#pairs % corrected 10 SVMs 56/1250 4.5% β roughly set - error rate relatively insensitive to exact value only 1.6% of 76157 hypothesised words rescored - more SVMs required! More suitable to smaller tasks, e.g. digit recognition in low SNR conditions Trajectory Models For Speech Processing Workshop 25

Summary Dependency modelling for speech recognition use of latent variables use of sufficient statistics from the data Augmented statistical models allows simple combination of latent variables and sufficient statistics use of constrained exponential model to define statistics simple to train using an SVM - related to various dynamic kernels Preliminary results of a large vocabulary speech recognition task SVMs/Augmented models possibly useful for speech recognition Current work maximum margin kernel parameter estimation use of weighted finite-state transducers for higher-order derivative calculation modified variable-margin training (constrains w 1 = 1) Trajectory Models For Speech Processing Workshop 26