Detection-Based Speech Recognition with Sparse Point Process Models

Similar documents
The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models

Statistical NLP Spring The Noisy Channel Model

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 9: Acoustic Models

Lecture 3: ASR: HMMs, Forward, Viterbi

Hidden Markov Model and Speech Recognition

Temporal Modeling and Basic Speech Recognition

A TWO-LAYER NON-NEGATIVE MATRIX FACTORIZATION MODEL FOR VOCABULARY DISCOVERY. MengSun,HugoVanhamme

Augmented Statistical Models for Speech Recognition

Model-Based Margin Estimation for Hidden Markov Model Learning and Generalization

CS 188: Artificial Intelligence Fall 2011

Statistical NLP Spring Digitizing Speech

Digitizing Speech. Statistical NLP Spring Frame Extraction. Gaussian Emissions. Vector Quantization. HMMs for Continuous Observations? ...

Sparse Models for Speech Recognition

Deep Learning for Speech Recognition. Hung-yi Lee

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

A Low-Cost Robust Front-end for Embedded ASR System

ON THE USE OF MLP-DISTANCE TO ESTIMATE POSTERIOR PROBABILITIES BY KNN FOR SPEECH RECOGNITION

Discriminant Feature Space Transformations for Automatic Speech Recognition

Hidden Markov Modelling

Modelling Non-linear and Non-stationary Time Series

Environmental Sound Classification in Realistic Situations

Statistical Machine Learning from Data

] Automatic Speech Recognition (CS753)

Multi-level Gaussian selection for accurate low-resource ASR systems

Model-Based Approaches to Robust Speech Recognition

Hierarchical Multi-Stream Posterior Based Speech Recognition System

Zeros of z-transform(zzt) representation and chirp group delay processing for analysis of source and filter characteristics of speech signals

STA 4273H: Statistical Machine Learning

Improved Speech Presence Probabilities Using HMM-Based Inference, with Applications to Speech Enhancement and ASR

Wavelet Transform in Speech Segmentation

Estimation of Cepstral Coefficients for Robust Speech Recognition

Speech Enhancement with Applications in Speech Recognition

SPEECH ENHANCEMENT USING PCA AND VARIANCE OF THE RECONSTRUCTION ERROR IN DISTRIBUTED SPEECH RECOGNITION

AUDIO-VISUAL RELIABILITY ESTIMATES USING STREAM

Jorge Silva and Shrikanth Narayanan, Senior Member, IEEE. 1 is the probability measure induced by the probability density function

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

Detection of Overlapping Acoustic Events Based on NMF with Shared Basis Vectors

Why DNN Works for Acoustic Modeling in Speech Recognition?

This is an electronic reprint of the original article. This reprint may differ from the original in pagination and typographic detail.

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Gaussian Processes for Audio Feature Extraction

HMM part 1. Dr Philip Jackson

Proc. of NCC 2010, Chennai, India

Mixtures of Gaussians with Sparse Structure

Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition

Template-Based Representations. Sargur Srihari

Lecture 9: Speech Recognition. Recognizing Speech

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

Lecture 9: Speech Recognition

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien

Forward algorithm vs. particle filtering

Independent Component Analysis and Unsupervised Learning

Acoustic Unit Discovery (AUD) Models. Leda Sarı

Shankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms

Spatial Diffuseness Features for DNN-Based Speech Recognition in Noisy and Reverberant Environments

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

Design and Implementation of Speech Recognition Systems

Eigenvoice Speaker Adaptation via Composite Kernel PCA

Robust Speech Recognition in the Presence of Additive Noise. Svein Gunnar Storebakken Pettersen

Robust Sound Event Detection in Continuous Audio Environments

Single Channel Signal Separation Using MAP-based Subspace Decomposition

CS 343: Artificial Intelligence

FEATURE PRUNING IN LIKELIHOOD EVALUATION OF HMM-BASED SPEECH RECOGNITION. Xiao Li and Jeff Bilmes

Noise Classification based on PCA. Nattanun Thatphithakkul, Boontee Kruatrachue, Chai Wutiwiwatchai, Vataya Boonpiam

Hidden Markov Models Hamid R. Rabiee

Lecture 6: Gaussian Mixture Models (GMM)

Presented By: Omer Shmueli and Sivan Niv

Sean Escola. Center for Theoretical Neuroscience

Improving the Multi-Stack Decoding Algorithm in a Segment-based Speech Recognizer

Hidden Markov Models

Maximum Likelihood and Maximum A Posteriori Adaptation for Distributed Speaker Recognition Systems

An Asynchronous Hidden Markov Model for Audio-Visual Speech Recognition

Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features

A brief introduction to Conditional Random Fields

HIDDEN MARKOV MODELS IN SPEECH RECOGNITION

Statistical Methods for NLP

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius

The effect of speaking rate and vowel context on the perception of consonants. in babble noise

Discriminative training of GMM-HMM acoustic model by RPCL type Bayesian Ying-Yang harmony learning

Lecture 5: GMM Acoustic Modeling and Feature Extraction

PHONEME CLASSIFICATION OVER THE RECONSTRUCTED PHASE SPACE USING PRINCIPAL COMPONENT ANALYSIS

MVA Processing of Speech Features. Chia-Ping Chen, Jeff Bilmes

Full-covariance model compensation for

Robust Speaker Identification

Evaluation of the modified group delay feature for isolated word recognition

Boundary Contraction Training for Acoustic Models based on Discrete Deep Neural Networks

Noise Compensation for Subspace Gaussian Mixture Models

Joint Factor Analysis for Speaker Verification

Hidden Markov Models and Gaussian Mixture Models

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

How to do backpropagation in a brain

Machine Recognition of Sounds in Mixtures

Hidden Markov Models

A Comparative Study of Histogram Equalization (HEQ) for Robust Speech Recognition

DNN-based uncertainty estimation for weighted DNN-HMM ASR

Investigate more robust features for Speech Recognition using Deep Learning

CS 343: Artificial Intelligence

Graphical Models for Automatic Speech Recognition

Transcription:

Detection-Based Speech Recognition with Sparse Point Process Models Aren Jansen Partha Niyogi Human Language Technology Center of Excellence Departments of Computer Science and Statistics ICASSP 2010 Dallas, Texas

Are Frames the Optimal Level of Detail? q t 1 q t q t+1...... x t 1 x t x t+1... six zero six three seven...?

A Unified Event-Driven Approach Our Strategy: Only model and explain the portions of the signal we are reasonably confident about Point Process Models (PPM) [Jansen & Niyogi (2009)] 1 Transform the signal into sparse temporal point patterns of acoustic events 2 Decode linguistic objects according to the temporal statistics of these events Detection-Based ASR Architecture [Ma, Tsao, & Lee (2006)] 1 Run independent detectors for each word in lexicon in parallel 2 Extract word sequence from the combined detector set output Our Goal: Translate past robustness success of point process word modeling to standard small vocabulary task

The AURORA2 Evaluation AURORA2 Task Spoken digit sequences (8 khz) both clean and mixed with additive noise at {20, 15, 10, 5, 0, -5} db SNR Stationary (mostly): subway, car, exhibition hall, and street noise Non-stationary: babble, restaurant, airport, and train station We consider clean training evaluation Baseline HTK 3.4 Recognizer MFCCs computed with AURORA Front-End v2.0 (plus vel., acc.) 11 digit models 16 states/model (left-to-right, no skip transitions) + 3 silence states = 179 states 3 GMM components per digit state, 6 GMM components per silence state

PPM-Based ASR Architecture D φ1 Nφ1 d w1 Speech D φ2. N φ2 d w2. Decoder Digit Sequence D φn Nφn d wm R Definitions D φi = detector for feature φ i N φi = point pattern (event set) for feature φ i d wj = detector for word w j

Hidden State Feature Detectors, D φi 1 Define one feature φ i for each of the 179 HMM states 2 Define detector function for each φ i : g φi (x) = P(φ i x) = P(x φ i )P(φ i ) 179 i=1 P(x φ i)p(φ i ) 3 Threshold g φi at δ φi and pick local maxima times as acoustic events for feature φ i : N φi = {t 1,t 2,...}

Point Process Representation Example HTK Lattice: 926 one Point Process Representation: 926 one two two three three four four five five six six seven seven eight eight nine nine zero zero oh oh sil 0.2 0.4 0.6 0.8 1 1.2 1.4 Time (s) 26134 real-valued likelihoods (179 states 146 frames) sil 0.2 0.4 0.6 0.8 1 1.2 1.4 Time (s) 69 real-valued times

Sliding Model Word Detectors, d wj 1 Let θ w : R {0,1} be indicator function of word occurrence ] 2 Define LLR detector function f w (t) = log [ P(R θw(t)=1) P(R θ w(t)=0) 3 Introduce duration latent variable T : P(R θ w ) = P(R T,θ w )P(T θ w )dt 4 Partition R into three subsets: R l = R (0,t], R t,t = R (t,t+t], and R r = R (t+t,l]. Then, f w (t) = log P(Rt,T T,θ w (t)=1) P(R t,t T,θ w (t)=0) P(T θ w(t)=1)dt.

Word Model, P(R t,t T, θ w (t)=1) Inhomogeneous Poisson Process Definition Memoryless point process with feature φ i arrival probability λ φi (t)dt in differential time element dt at time t 1 Normalize all t R t,t to the interval [0,1], yielding R = {N φ i } 179 i=1 2 Assume T -independence of R, independent feature detectors, and inhomogeneous Poisson process model for each N φ i : 1 179 P(R t,t T,θ w (t)=1) = e R 1 T Rt,T 0 λ φ i (s)ds λ φi (s), i=1 s N φ i 3 Rate functions {λ φi } 179 1=1 are estimated with parametric model or KDE (examples from HMM force-align)

Example: seven Poisson Process Model Poisson Process Rate Parameters, λ φi (t) one two 7 three four 6 Feature φ i (HMM State) five six seven eight 5 4 3 nine 2 zero oh 1 sil 0 0.2 0.4 0.6 0.8 1 Fraction of Word Duration

Background Model, P(R t,t T, θ w (t)=0) Homogeneous Poisson Process Definition Memoryless point process with constant feature φ i arrival probability µ φi dt in any differential time element dt 1 No interval normalization necessary 2 If n φi is the number of events of type φ i in R t,t, then 179 P(R t,t T,θ w (t)=0) = [µ φi ] n φ ie µ φi T, 3 Background rate parameters {µ φi } 179 i=1 are estimated by counting in arbitrary background speech i=1

Graph-Based Decoder [Ma, Tsao, & Lee (2006)] Input: Digit detectors produce candidate detect set, along with confidence scores (f w ) and durations (arg max T of integrand) Decoder DAG Definition Vertices: start at t = 0, end at t =, two for each digit detect (left and right boundary) 1 Connect each vertex to next left boundary vertex with weight 0 2 Connect each left boundary vertex to its right boundary vertex with weight f w (t) 3 Connect each right boundary vertex to all left boundaries within 20 ms prior with weight 0 (no cycles) L one R L six R s L nine L two R R e Decode: Min-cost path from start to end with Dijkstra s algorithm time

What About Robustness? 1 0.9 0.8 0.7 0.6 g φ (t) = P(φ x t ) for φ = seven 5 clean Subway, 20 db SNR g φ (t) 0.5 0.4 0.3 0.2 0.1 0 0.32 0.34 0.36 0.38 0.4 0.42 0.44 Time (s)

Feature Detector Threshold Adaptation 1 Find feature detector threshold δ φ i that maintains background firing rate from clean speech 2 Use clean word/background models with adapted phone detector threshold Underlying Assumptions 1 Times/relative strengths of local maxima preserved 2 Background rate is adequate statistic Background (Mean) Firing Rate (Hz) 0.25 0.2 0.15 0.1 0.05 0 Detector Behavior for Feature φ = seven 5 δ* φ δ φ clean 20 db subway 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Detector Threshold µ φ This method is entirely unsupervised

Clean Speech Performance HTK (% Acc) PPM (% Acc) 99.0 98.3 Only 0.7% WER increase after a 400X reduction in representational data Possible explantion: forced aligned digit training examples were imperfect

Non-Stationary Noise Performance Train: Clean, Test: Babble SNR HTK PPM Adapt PPM 20 db 90.2 92.2 93.6 15 db 73.8 83.1 89.7 10 db 49.4 65.0 80.2 5 db 26.8 43.8 62.5 0 db 9.3 22.3 35.8-5 db 1.6 10.0 16.4 Avg. (0-20) 49.9 61.3 72.4 Train: Clean, Test: Airport SNR HTK PPM Adapt PPM 20 db 90.6 92.6 94.3 15 db 77.0 84.3 90.4 10 db 53.9 68.8 82.2 5 db 30.3 46.0 66.7 0 db 14.4 24.1 40.6-5 db 8.2 11.5 20.0 Avg. (0-20) 53.2 63.2 74.8 Non-adapted PPM system is significantly more robust than the HMM system to non-stationary noise Unsupervised feature detector threshold adaptation provides further gains

Stationary Noise Performance Train: Clean, Test: Subway SNR HTK PPM Adapt PPM 20 db 97.1 94.1 94.6 15 db 93.5 86.0 89.6 10 db 78.7 67.4 79.4 5 db 52.2 40.4 61.3 0 db 26.0 18.8 34.6-5 db 11.2 8.5 15.6 Avg. (0-20) 69.5 61.3 71.9 Train: Clean, Test: Car SNR HTK PPM Adapt PPM 20 db 97.4 94.6 95.1 15 db 90.0 85.9 89.9 10 db 67.0 63.0 76.8 5 db 34.1 33.3 57.1 0 db 14.5 13.6 29.0-5 db 9.4 6.5 12.6 Avg. (0-20) 60.6 58.1 69.6 Non-adapted PPM less robust than HTK system to stationary noise Suboptimal feature detector threshold is culprit Unsupervised threshold adaptation improves robustness over HTK levels at lower SNRs

Conclusions 1 Discarding 99.7% of the HMM lattice results in negligible loss in small vocabulary recognition accuracy 2 Sparse point process word models + detection-based ASR architecture improves robustness to all non-stationary noise sources in AURORA2 3 Unsupervised PPM adaptation (only 1 minute of data) improves robustness to all noise sources 4 Our system is compatible with other noise robustness techniques (both front end and GMM adaptation) 5 Sparse point process representations may supply the computational efficiency required to scale up detection-based ASR systems