Detection-Based Speech Recognition with Sparse Point Process Models

Size: px

Start display at page:

Download "Detection-Based Speech Recognition with Sparse Point Process Models"

Adelia Foster
5 years ago
Views:

1 Detection-Based Speech Recognition with Sparse Point Process Models Aren Jansen Partha Niyogi Human Language Technology Center of Excellence Departments of Computer Science and Statistics ICASSP 2010 Dallas, Texas

2 Are Frames the Optimal Level of Detail? q t 1 q t q t x t 1 x t x t+1... six zero six three seven...?

3 A Unified Event-Driven Approach Our Strategy: Only model and explain the portions of the signal we are reasonably confident about Point Process Models (PPM) [Jansen & Niyogi (2009)] 1 Transform the signal into sparse temporal point patterns of acoustic events 2 Decode linguistic objects according to the temporal statistics of these events Detection-Based ASR Architecture [Ma, Tsao, & Lee (2006)] 1 Run independent detectors for each word in lexicon in parallel 2 Extract word sequence from the combined detector set output Our Goal: Translate past robustness success of point process word modeling to standard small vocabulary task

4 The AURORA2 Evaluation AURORA2 Task Spoken digit sequences (8 khz) both clean and mixed with additive noise at {20, 15, 10, 5, 0, -5} db SNR Stationary (mostly): subway, car, exhibition hall, and street noise Non-stationary: babble, restaurant, airport, and train station We consider clean training evaluation Baseline HTK 3.4 Recognizer MFCCs computed with AURORA Front-End v2.0 (plus vel., acc.) 11 digit models 16 states/model (left-to-right, no skip transitions) + 3 silence states = 179 states 3 GMM components per digit state, 6 GMM components per silence state

5 PPM-Based ASR Architecture D φ1 Nφ1 d w1 Speech D φ2. N φ2 d w2. Decoder Digit Sequence D φn Nφn d wm R Definitions D φi = detector for feature φ i N φi = point pattern (event set) for feature φ i d wj = detector for word w j

6 Hidden State Feature Detectors, D φi 1 Define one feature φ i for each of the 179 HMM states 2 Define detector function for each φ i : g φi (x) = P(φ i x) = P(x φ i )P(φ i ) 179 i=1 P(x φ i)p(φ i ) 3 Threshold g φi at δ φi and pick local maxima times as acoustic events for feature φ i : N φi = {t 1,t 2,...}

Point Process Representation Example HTK Lattice: 926 one Point Process Representation: 926 one two two three three four four five five six six seven seven eight eight nine

7 Point Process Representation Example HTK Lattice: 926 one Point Process Representation: 926 one two two three three four four five five six six seven seven eight eight nine nine zero zero oh oh sil Time (s) real-valued likelihoods (179 states 146 frames) sil Time (s) 69 real-valued times

8 Sliding Model Word Detectors, d wj 1 Let θ w : R {0,1} be indicator function of word occurrence ] 2 Define LLR detector function f w (t) = log [ P(R θw(t)=1) P(R θ w(t)=0) 3 Introduce duration latent variable T : P(R θ w ) = P(R T,θ w )P(T θ w )dt 4 Partition R into three subsets: R l = R (0,t], R t,t = R (t,t+t], and R r = R (t+t,l]. Then, f w (t) = log P(Rt,T T,θ w (t)=1) P(R t,t T,θ w (t)=0) P(T θ w(t)=1)dt.

9 Word Model, P(R t,t T, θ w (t)=1) Inhomogeneous Poisson Process Definition Memoryless point process with feature φ i arrival probability λ φi (t)dt in differential time element dt at time t 1 Normalize all t R t,t to the interval [0,1], yielding R = {N φ i } 179 i=1 2 Assume T -independence of R, independent feature detectors, and inhomogeneous Poisson process model for each N φ i : P(R t,t T,θ w (t)=1) = e R 1 T Rt,T 0 λ φ i (s)ds λ φi (s), i=1 s N φ i 3 Rate functions {λ φi } 179 1=1 are estimated with parametric model or KDE (examples from HMM force-align)

10 Example: seven Poisson Process Model Poisson Process Rate Parameters, λ φi (t) one two 7 three four 6 Feature φ i (HMM State) five six seven eight nine 2 zero oh 1 sil Fraction of Word Duration

11 Background Model, P(R t,t T, θ w (t)=0) Homogeneous Poisson Process Definition Memoryless point process with constant feature φ i arrival probability µ φi dt in any differential time element dt 1 No interval normalization necessary 2 If n φi is the number of events of type φ i in R t,t, then 179 P(R t,t T,θ w (t)=0) = [µ φi ] n φ ie µ φi T, 3 Background rate parameters {µ φi } 179 i=1 are estimated by counting in arbitrary background speech i=1

12 Graph-Based Decoder [Ma, Tsao, & Lee (2006)] Input: Digit detectors produce candidate detect set, along with confidence scores (f w ) and durations (arg max T of integrand) Decoder DAG Definition Vertices: start at t = 0, end at t =, two for each digit detect (left and right boundary) 1 Connect each vertex to next left boundary vertex with weight 0 2 Connect each left boundary vertex to its right boundary vertex with weight f w (t) 3 Connect each right boundary vertex to all left boundaries within 20 ms prior with weight 0 (no cycles) L one R L six R s L nine L two R R e Decode: Min-cost path from start to end with Dijkstra s algorithm time

13 What About Robustness? g φ (t) = P(φ x t ) for φ = seven 5 clean Subway, 20 db SNR g φ (t) Time (s)

14 Feature Detector Threshold Adaptation 1 Find feature detector threshold δ φ i that maintains background firing rate from clean speech 2 Use clean word/background models with adapted phone detector threshold Underlying Assumptions 1 Times/relative strengths of local maxima preserved 2 Background rate is adequate statistic Background (Mean) Firing Rate (Hz) Detector Behavior for Feature φ = seven 5 δ* φ δ φ clean 20 db subway Detector Threshold µ φ This method is entirely unsupervised

15 Clean Speech Performance HTK (% Acc) PPM (% Acc) Only 0.7% WER increase after a 400X reduction in representational data Possible explantion: forced aligned digit training examples were imperfect

16 Non-Stationary Noise Performance Train: Clean, Test: Babble SNR HTK PPM Adapt PPM 20 db db db db db db Avg. (0-20) Train: Clean, Test: Airport SNR HTK PPM Adapt PPM 20 db db db db db db Avg. (0-20) Non-adapted PPM system is significantly more robust than the HMM system to non-stationary noise Unsupervised feature detector threshold adaptation provides further gains

17 Stationary Noise Performance Train: Clean, Test: Subway SNR HTK PPM Adapt PPM 20 db db db db db db Avg. (0-20) Train: Clean, Test: Car SNR HTK PPM Adapt PPM 20 db db db db db db Avg. (0-20) Non-adapted PPM less robust than HTK system to stationary noise Suboptimal feature detector threshold is culprit Unsupervised threshold adaptation improves robustness over HTK levels at lower SNRs

18 Conclusions 1 Discarding 99.7% of the HMM lattice results in negligible loss in small vocabulary recognition accuracy 2 Sparse point process word models + detection-based ASR architecture improves robustness to all non-stationary noise sources in AURORA2 3 Unsupervised PPM adaptation (only 1 minute of data) improves robustness to all noise sources 4 Our system is compatible with other noise robustness techniques (both front end and GMM adaptation) 5 Sparse point process representations may supply the computational efficiency required to scale up detection-based ASR systems

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech CS 294-5: Statistical Natural Language Processing The Noisy Channel Model Speech Recognition II Lecture 21: 11/29/05 Search through space of all possible sentences. Pick the one that is most probable given