Object Tracking and Asynchrony in Audio- Visual Speech Recognition

Size: px

Start display at page:

Download "Object Tracking and Asynchrony in Audio- Visual Speech Recognition"

Julia Lucas
5 years ago
Views:

1 Object Tracking and Asynchrony in Audio- Visual Speech Recognition Mark Hasegawa-Johnson AIVR Seminar August 31, 2006 AVICAR is thanks to: Bowon Lee, Ming Liu, Camille Goudeseune, Suketu Kamdar, Carl Press, Sarah Borys and to the Motorola Communications Center Some experiments and most good ideas in this talk thanks to Ming Liu, Karen Livescu, Kate Saenko and Partha Lal

2 Why AVSR is not like ASR Use of classifiers as features E.g., output of an AdaBoost lip tracker is feature in a face constellation Obstruction Tongue is rarely visible, glottis never Asynchrony Visual evidence for a word can start long before the audio evidence Which digit is she about to say?

3 Why ASR is like AVSR Use of classifiers as features E.g., neural networks or SVMs transform audio spectra into a phonetic feature space Obstruction Lip closure hides tongue closure Glottal stop hides lip or tongue position Asynchrony Tongue, lips, velum, and glottis can be out of sync, e.g., every ervy

4 Discriminative Features in Face/Lip Tracking: AdaBoost 1. Each wavelet defines a weak classifier: h i (x) = 1 iff f i > threshold, else h i (x) = 0 2. Start with equal weight for all training tokens: 3. For each learning iteration t: w m (1) = 1/M, 1 m M 1. Find i that minimizes the weighted training error. 2. w m if token m was correctly classified, else w m. 3. α t = log((1-ε t )/ ε t ) 4. Final strong classifier is H(x) = 1 iff Σ t α t ht(x) > Σ t α t

5 Example Haar Wavelet Features Selected by AdaBoost

6 AdaBoost in a Bayesian Context The AdaBoost margin: Guaranteed range: 0 M D (x) 1 Inverse sigmoid transform yields nearly normal distributions

7 Prior: Relative Position of Lips in the Face p(r=r lips M D (x)) α p(r=r lips ) p(m D (x) r=r lips )

8 Lip Tracking: a few results

9 Pixel-Based Features

10 Pixel-Based Features: Dimension

11 Model-Based Correction for Head-Pose Variability If the head is an ellipse, its measured width w F (t) and height h F (t) are functions of roll ρ, yaw ψ, pitch φ, true height ħ F and true width w F according to which can usefully be approximated as

12 Robust Correction: Linear Regression The additive random part of the lip width (w L (t)=w 1 +ħ L cosψ(t)sinρ(t)) is proportional to similar additive variation in the head width (w F (t)=w F1 +ħ F cosψ(t)sinρ(t)), so we can eliminate it by orthogonalizing w L (t) to w F (t).

compensation LLR = log-linear regression 13+d+dd = 13 static features 39

13 WER Results from AVICAR (Testing on the training data; 34 talkers, continuous digits) LR = linear regression Model = model-based head-pose compensation LLR = log-linear regression 13+d+dd = 13 static features 39 = 39 static features All systems have mean and variance normalization and MLLR

14 Audio-Visual Asynchrony For example, tongue touches the teeth before acoustic speech onset in the word three; lips are already round in anticipation of the /r/.

15 Audio-Visual Asynchrony: Coupled HMM is a typical Phoneme-Viseme Model (Chu and Huang, 2002) acoustic channel visual channel t = 1 t = 2 t = 3 t =T

A Physical Model of Asynchrony Slide created by Karen Livescu Articulatory Phonology [Browman & Goldstein 90]: The following 8 tract variables are independently & asynchronously controlled LIP-LOC

16 A Physical Model of Asynchrony Slide created by Karen Livescu Articulatory Phonology [Browman & Goldstein 90]: The following 8 tract variables are independently & asynchronously controlled LIP-LOC Protruded, Labial, Dental LIP-OP CLosed, CRitical, Narrow, Wide TB-LOC TT-LOC Dental, Alveolar, Palato-Alveolar, TT-LOC Retroflex LIP-OP TT-OP TB-OP VELUM TB-LOC Palatal, Velar, Uvular, Pharyngeal LIP-LOC LOC TT-OP, TB-OP CLosed, CRitical, Narrow, Mid- Narrow, Mid, Wide GLOTTIS GLO CLosed (stop), CRitical (voiced), Open (voiceless) VEL CLosed (non-nasal), Open (nasal) For speech recognition, we collapse these into 3 streams: lips, tongue, and glottis (LTG).

17 Motivation: Pronunciation variation Slide created by Karen Livescu word probably sense everybody don t baseform p r aa b ax b l iy s eh n s eh v r iy b ah d iy d ow n t (2) p r aa b iy (1) s eh n t s (1) eh v r ax b ax d iy (37) d ow n (1) p r ay (1) s ih t s (1) eh v er b ah d iy (16) d ow (1) p r aw l uh (1) eh ux b ax iy (6) ow n surface (actual) (1) p r ah b iy (1) p r aa l iy (1) p r aa b uw (1) eh r uw ay (1) eh b ah iy (4) d ow n t (3) d ow t (3) d ah n (1) p ow ih (3) ow (1) p aa iy (1) p aa b uh b l iy (1) p aa ah iy # pronunciations / w (3) n ax (2) d ax n (2) ax (1) n uw minimum # occurrences

18 Explanation: Asynchrony of tract variables Based on a slide created by Karen Livescu dictionary feature G T phone open crit / alveolar s values critical nasal mid / palatal closed / alveolar eh n open crit / alveolar s surface variant #1 (example of feature asynchrony) feature G T phone open crit / alveolar s values critical nas open mid / palatal closed / alveolar crit / alveolar eh n t s surface variant #2 (example of feature asynchrony + substitution) feature G T phone open crit / alveolar s values critical nas nar / palatal ih n cl / alv t open crit / alveolar s

19 Implementation: Multi-stream DBN Slide created by Karen Livescu Phone-based q (phonetic state) o (observation vector) Articulatory Feature-based L (state of lips) T (state of tongue) G (state of glottis) o (obs vector)

20 Baseline: Audio-only phone-based HMM Slide created by Partha Lal positioninworda {0,1,2,...} statetransitiona {0,1} phonestatea { /t/1, /t/2, /t/3, /u/1, /u/2, /u/3, } obsa

21 Baseline: Video-only phone-based HMM Slide created by Partha Lal positioninwordv {0,1,2,...} statetransitionv {0,1} phonestatev { /t/1, /t/2, /t/3, /u/1, /u/2, /u/3, } obsv

22 Audio-visual HMM without asynchrony Slide created by Partha Lal positioninword {0,1,2,...} statetransition {0,1} phonestate { /t/1, /t/2, /t/3, /u/1, /u/2, /u/3, } obs obsv obsa

23 Phoneme-Viseme CHMM Slide created by Partha Lal positioninworda {0,1,2,...} statetransitiona {0,1} phonestatea { /t/1, /t/2, /t/3, /u/1, /u/2, /u/3, } obsa positioninwordv {0,1,2,...} statetransitionv {0,1} phonestatev { /t/1, /t/2, /t/3, /u/1, /u/2, /u/3, } obsv

24 Articulatory Feature CHMM positioninwordl {0,1,2,...} statetransitionl {0,1} L { /OP/1, /OP/2, /RND/1, } positioninwordt {0,1,2,...} statetransitiont {0,1} T { /CL-ALV/1, /CL-ALV/2, /MID-UV/1, } positioninwordg {0,1,2,...} statetransitiong {0,1} G { /OP/1, /OP/2, /CRIT/1, } obsv obsa

Asynchrony Experiments: CUAVE 169 utterances used, 10 digits each NOISEX speech babble added at various SNRs Experimental setup Training on clean data, number of Gaussians tuned on clean dev set

25 Asynchrony Experiments: CUAVE 169 utterances used, 10 digits each NOISEX speech babble added at various SNRs Experimental setup Training on clean data, number of Gaussians tuned on clean dev set Audio/video weights tuned on noise-specific dev sets Uniform ( zero-gram ) language model Decoding constrained to 10-word utterances (avoids language model scale/penalty tuning) Thanks to Amar Subramanya at UW for the video observations Thanks to Kate Saenko at MIT for initial baselines and audio observations

26 Results, part 1: Should we use video? Answer: Fusion WER < Single-stream WER ( Novelty: None. Many authors have reported this. ) Audio Video Audiovisual CLEAN SNR 12dB SNR 10dB SNR 6dB SNR 4dB SNR -4dB

27 Results, part 2: Should the streams be asynchronous? Asynchronous WER < Synchronous WER (4% midsnrs) ( Novelty: First phone-based AVSR w/ inter-phone asynchrony. ) No A synchro ny 1 S tate A sync 2 S tate s A sync Unlim ite d A syn C L E A N S NR 1 2 d B S NR 1 0 d B S NR 6 d B S NR 4 d B S NR -4

28 Results, part 3: Should asynchrony be modeled using articulatory features? Answer: Articulatory feature WER = Phoneme-viseme WER ( Novelty: First articulatory feature model for AVSR. ) Phone-viseme Articulatory features Clean SNR 12dB SNR 10dB SNR 6dB SNR 4dB SNR -4dB

29 Results, part 4: Can AF system help the CHMM to correct mistakes? Answer: Combination AF + PV gives best results on this database Details: Systems vote to determine label of each word (NIST rover) WER on devtest, averaged across SNRs Rover, Best Three w/ AF Rover, Best Three w/o AF PV, 2 States Async AF PV, 1 State Async PV = Phone-viseme AF = Articulatory features

30 Conclusions Classifiers as features: AdaBoost margin outputs can be used as features in Gaussian model of facial geometry Head-pose correction in noise: Best correction algorithm uses linear regression followed by model-based correction Asynchrony matters: Best phone-based recognizer is a CHMM with two states of asynchrony allowed between audio and video Articulatory Feature Models complement Phone Models These two systems have identical WER Best result obtained when systems of both types are combined using rover

Discriminative Pronunciation Modeling: A Large-Margin Feature-Rich Approach

Discriminative Pronunciation Modeling: A Large-Margin Feature-Rich Approach Hao Tang Toyota Technological Institute at Chicago May 7, 2012 Joint work with Joseph Keshet and Karen Livescu To appear in ACL