Discriminative Pronunciation Modeling: A Large-Margin Feature-Rich Approach

Size: px

Start display at page:

Download "Discriminative Pronunciation Modeling: A Large-Margin Feature-Rich Approach"

Lily Porter
6 years ago
Views:

1 Discriminative Pronunciation Modeling: A Large-Margin Feature-Rich Approach Hao Tang Toyota Technological Institute at Chicago May 7, 2012 Joint work with Joseph Keshet and Karen Livescu To appear in ACL 2012

2 Outline Problem Model Features Dictionary Feature Function Length Feature Functions TF-IDF Feature Functions Phonetic Alignment Feature Functions Articulatory Feature Functions Experiments Conclusion Future Work

3 Problem: Pronunciation Variation probably

4 Problem: Pronunciation Variation probably /pcl p r aa bcl b ax bcl b l iy/

5 Problem: Pronunciation Variation probably /pcl p r aa bcl b ax bcl b l iy/ [p r aa b iy] [p r aa l iy] [p r ay] [p ow ih] [p aa iy]

6 Problem: Pronunciation Variation probably /pcl p r aa bcl b ax bcl b l iy/ [p r aa b iy] [p r aa l iy] [p r ay] [p ow ih] [p aa iy]

7 Lexical Access: Definition [p r aa l iy]?

8 Lexical Access: Definition [p r aa l iy]? This is called Lexical Access.

9 Lexical Access: How hard is this task? In Switchboard, fewer than half of the word tokens are pronounced canonically.

10 Lexical Access: Related works Model Experiments on a subset of Switchboard. ER lexicon lookup (from [Livescu, 2005]) 59.3% lexicon + Levenshtein distance 41.8% [Jyothi et al., 2011] 29.1% Our approach you ll see...

11 Lexical Access: Goal [p r aa l iy] probably

12 Lexical Access: Goal Find a function f : P V such that w = f (p), where w word probably p sequence of sub-word units [p r aa l iy] P set of all sequence of sub-word units V vocabulary

13 Model We model f as a linear function (in φ): w = f (p) = argmax θ φ(p, w). w V Note: Linearity is not so bad, because φ can be arbitrarily non-linear. For example, φ(p, w) can be the Levenshtein distance between p and the canonical pronunciation of w.

14 Learning: Goal Given p = [pcl p r aa bcl b l iy], we want to find θ such that θ φ(p, probably) good guy margin score θ φ(p, probable) θ φ(p, problem). bad guys

15 Learning: Algorithm (p 1, w 1 ) (p 2, w 2 ) (p 3, w 3 ) (p t, w t ) θ 0 θ 1 θ 2... θ t 1: for t = 1,... do 2: Sample (p t, w t ) from S 3: ŵ = max w V [ 1wi w θ φ(p t, w t ) + θ φ(p t, w) ] 4: φ t = φ(p t, w t ) φ(p t, ŵ) 5: 6: end for { } α t = min 1 λ, 1 wt w θ φ t φ t θ t+1 = θ t + α t φ t η t = 1 λt θ t+1 = (1 λη t ) θ t + η t φ t

16 Learning: Algorithm Two Algorithms Passive Aggressive (PA) algorithm Pegasos

17 Learning: Algorithm Two Algorithms Passive Aggressive (PA) algorithm Pegasos Both have good theoretical guarantees. We happily treat them as black boxes.

18 Features w = f (p) = argmax θ φ(p, w) w V We know how to learn θ. Let s talk about φ.

19 Problem Model Features Dictionary Feature Function Length Feature Functions TF-IDF Feature Functions Phonetic Alignment Feature Functions Articulatory Feature Functions Experiments Conclusion Future Work

20 Problem Model Features Dictionary Feature Function Length Feature Functions TF-IDF Feature Functions Phonetic Alignment Feature Functions Articulatory Feature Functions Experiments Conclusion Future Work

21 Features Given p = [pcl p r aa bcl b l iy], we want to design φ such that θ φ(p, probably) good guy margin score θ φ(p, probable) θ φ(p, problem). bad guys

22 Dictionary Feature Function Given a pronunciation dictionary:. privacy private pro probably problem. pcl p r ay1 ay2 v ax s iy pcl p r ay1 ay2 v ax tcl t pcl p r ow1 ow2 pcl p r aa bcl b ax bcl b l iy pcl p r aa bcl b l ax m When you see p = [pcl p r aa bcl b ax bcl b l iy], intuitively, you want θ φ(p, probably) > θ φ(p, w ) for w probably.

23 Dictionary Feature Function Define the dictionary feature function as φ dict (p, w) = 1 p pron(w), where pron(w) is the set of baseforms of w in the dictionary.

24 Dictionary Feature Function Define the dictionary feature function as φ dict (p, w) = 1 p pron(w), where pron(w) is the set of baseforms of w in the dictionary. Then for the previous example, θ dict φ dict (p, probably) }{{} =1 for w probably when θ dict > 0. > θ dict φ dict (p, w ) }{{} =0

25 Problem Model Features Dictionary Feature Function Length Feature Functions TF-IDF Feature Functions Phonetic Alignment Feature Functions Articulatory Feature Functions Experiments Conclusion Future Work

26 Length Feature Functions Suppose we have w = probably p = [pcl p r aa bcl b l iy] pron(w) = /pcl p r aa bcl b ax bcl b l iy/ We want to see how the length of the surface form deviates from the baseform. In this case l = 3.

27 Length Feature Functions We can have φ a l<b (p, w) = 1 a l<b

28 Length Feature Functions We can have φ a l<b (p, w) = 1 a l<b Now φ 3 l< 2 (p, probably) = 1. But pron(academic) = /ae kcl k ax dcl d eh m ax kcl k/, φ 3 l< 2 (p, academic) = 1.

29 Length Feature Functions We can have φ a l<b (p, w) = 1 a l<b Now φ 3 l< 2 (p, probably) = 1. But pron(academic) = /ae kcl k ax dcl d eh m ax kcl k/, φ 3 l< 2 (p, academic) = 1. We want it to be able to count deletions/insertions for different words.

30 Length Feature Functions e probably = a 0.. probable 0 probably 1 problem 0.. zero 0

31 Length Feature Functions 1 3 l< 2 e probably = a 0.. probable 0 probably 1 problem 0.. zero 0

32 Length Feature Functions The length feature function is defined as φ a l b (p, w) = 1 a l b e w, where l = p v for some v pron(w) and e wi = w w i 1 0 w i 1. w i w V 0

33 Problem Model Features Dictionary Feature Function Length Feature Functions TF-IDF Feature Functions Phonetic Alignment Feature Functions Articulatory Feature Functions Experiments Conclusion Future Work

34 TF-IDF Feature Functions If I tell you /bcl b/ occurs at least once in the surface form, can you guess the word?

35 TF-IDF Feature Functions If I tell you /bcl b/ occurs at least once in the surface form, can you guess the word? about, book, ball, baseball, basketball, robot, problem, probably,...

36 TF-IDF Feature Functions If I tell you /bcl b/ occurs at least once in the surface form, can you guess the word? about, book, ball, baseball, basketball, robot, problem, probably,... What if /bcl b/ occurs twice?

37 TF-IDF Feature Functions If I tell you /bcl b/ occurs at least once in the surface form, can you guess the word? about, book, ball, baseball, basketball, robot, problem, probably,... What if /bcl b/ occurs twice? probably? baseball? basketball?

38 TF-IDF Feature Functions The term (sub-word unit) frequency is defined as TF u (p) = p u u=pi:i+u 1. p u + 1 i=1 Suppose p = [p r aa l iy]. Then TF /l iy/ (p) = 1 4. Intuitively, if a sub-word unit has a high TF, then it is more discriminative.

39 TF-IDF Feature Functions If I tell you /ay1 ay2/ occurs in a surface form, can you guess the word?

40 TF-IDF Feature Functions If I tell you /ay1 ay2/ occurs in a surface form, can you guess the word? buy, cry, fly, flight, lie, light, like, I, right,...

41 TF-IDF Feature Functions If I tell you /ay1 ay2/ occurs in a surface form, can you guess the word? buy, cry, fly, flight, lie, light, like, I, right,... What if /z uw/ occurs?

42 TF-IDF Feature Functions If I tell you /ay1 ay2/ occurs in a surface form, can you guess the word? buy, cry, fly, flight, lie, light, like, I, right,... What if /z uw/ occurs? zoo? zoology?

43 TF-IDF Feature Functions The inverse document (word) frequency is defined as IDF u = log V V u, where V u = {w V (p, w) S, u p}. Intuitively, if a sub-word unit is found in a small, specific set of words, then it is more discriminative.

44 TF-IDF Feature Functions The final TF-IDF feature function for sub-word unit u is defined as φ u (p, w) = (TF u (p) IDF u ) e w. This feature function is borrowed from [Zweig et al., 2010].

45 Problem Model Features Dictionary Feature Function Length Feature Functions TF-IDF Feature Functions Phonetic Alignment Feature Functions Articulatory Feature Functions Experiments Conclusion Future Work

46 Phonetic Alignment Feature Functions Given and pron(probably) p = [p r aa l iy] /pcl p r aa bcl b l iy/ /pcl p r aa bcl b ax bcl b l iy/ we already have dictionary feature function. Can we do something fuzzier?

47 Phonetic Alignment Feature Functions Alignment 1 p r aa l iy pcl p r aa bcl b l iy Alignment 2 p r aa l iy pcl p r aa bcl b ax bcl b l iy

48 Phonetic Alignment Feature Functions Turn these p r aa l iy pcl p r aa bcl b l iy p r aa l iy pcl p r aa bcl b ax bcl b l iy into this pcl 2 p p 2 r r 2 aa aa 2 bcl 3 b 3 ax 1

49 Phonetic Alignment Feature Functions Each of those counts, normalized by some factor, is our feature. φ ph-align (p, w) = pcl 2/19 p p 1 r r 1 bcl 3/19. As a sanity check, we can expect to see high weights for l l and iy iy, or more generally, p p for p P. We can also expect to see a high weight for t dx, but a low weight for, say, aa kcl.

50 Problem Model Features Dictionary Feature Function Length Feature Functions TF-IDF Feature Functions Phonetic Alignment Feature Functions Articulatory Feature Functions Experiments Conclusion Future Work

51 Articulatory Feature Functions: Alignment Similarly, we can define alignment feature functions on the articulatory level. φ artic-align (p, w) = lip-loc-lab lip-loc-den 0.5 lip-open-clo lip-open-wide 0.1 tongue-tip-den tongue-tip-alv 0.3 vel-clo vel-open 0.2. Note: The value of the articulatory variables are inferred not observed.

52 Articulatory Feature Functions: Log-likelihood We also include the log-likelihood of the alignment as a feature, where φ LL (p, w) = L(p, w) h k L(p, w) h, k log-likelihood shift scale

53 Articulatory Feature Functions: Asynchrony Define the asynchrony among articulatory variables feature functions as φ a async(f1,f 2 )<b(p, w) = 1 a async(f1,f 2 )<b, where F 1 and F 2 sets of articulatory variables async(f 1, F 2 ) the asynchrony between F 1 and F 2 This can help in examples such as sense /s eh n s/ [s eh n t s]

54 Features: Big picture 1 p pron(w) φ(p, w) = 1 a l<b e a. 1 a l<b e zero TF u (p)idf u e a. TF u (p)idf u e zero pcl p p r r bcl. # of ranges V # of sub-word units V ( P + 1) 2 1

55 Features: Big picture φ(p, w) = lip-loc-lab lip-loc-den lip-open-clo lip-open-wide tongue-tip-den tongue-tip-alv vel-clo vel-open. L(p, w) h k 1 a async(tongue tip,tongue body)<b 1 a async(lip,tongue)<b. 7 i=1 F i 2 # of ranges # of combinations

56 Experiments: Setting dataset lexicon total words Switchboard 3328 words 2893 words ranges (, 3), ( 3, 2), ( 2, 1),..., (2, 3), (3, ) asynchrony tongue tip and tongue body lip and tongue lip, tongue and glottis, velum

57 Experiments: Comparing with others Split 2893 words into 2492-word training set, 165-word development set, and 236-word test set. Model ER lexicon lookup (from [Livescu, 2005]) 59.3% lexicon + Levenshtein distance 41.8% [Jyothi et al., 2011] 29.1% Our approach 15.2%

58 Experiments: Comparing learning methods 5-fold cross-validation for different learning methods ER lex lex lev CRF DP+ PA DP+ Pegasos DP+ DP+: dictionary, length, phone bigram TF-IDF, phonetic alignment.

59 Experiments: Comparing learning methods Algorithm CRF PA and Pegasos # of non-zero entries in θ 4,000, ,000 Time for each epoch 45 min 15 min

60 Experiments: Feature selection 5-fold cross-validation for different feature combinations ER PA p2 PA DP PA p2 DP PA p2 DP length PA DP+ PA ALL

61 Interesting Weights θ p pron(w) θ p p θ t dx θ oy1 oy n θ oy2 oy n θ n r θ f kcl θ l< 3 for probably θ 3 l< 2 for probably θ 2 l< 1 for probably θ 1 l<0 for probably

62 Conclusion We have a discriminative model. We use Passive-Aggressive or Pegasos to learn it. We have lots of features: Dictionary, Length, Phonetic alignment, Alignment of articulators, Asynchrony among articulators. It achieves good results.

63 Future Work Learning Kernel Direct loss minimization Learn feature weights and alignments jointly Features Features that span across articulatory variables Context-sensitive features from alignments Word Sequences Acoustics

64 Reference K. Livescu Feature-based Pronunciation Modeling for Automatic Speech Recognition. Ph.D. thesis, Massachusetts Institute of Technology, P. Jyothi, K. Livescu, and E. Fosler-Lussier Lexical access experiments with context-dependent articulatory feature-based models. In Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), G. Zweig, P. Nguyen, and A. Acero Continuous speech recognition with a TF-IDF acoustic model. In Proc. Interspeech, 2010.

65 [th ae ng kcl k] [y uw]

66 Learning Suppose (p, w) is drawn from the distribution ρ. The goal is to find a θ that minimizes the expected zero-one loss, θ [ ] = argmin E (p,w) ρ 1w f (p). θ Let S = {(p 1, w 1 ),..., (p m, w m )}. We try optimize where θ = argmin θ λ 2 θ m l(θ; p i, w i ), i=1 l(θ; p i, w i ) = 1 f (pi ) w i. We cannot optimize zero-one loss directly. A common trick is to optimize the hinge loss, [ ] l(θ; p i, w i ) = max 1 wi w θ φ(p i, w i ) + θ φ(p i, w). w V

67 Passive-Aggressive Algorithm We can also optimize θ t+1 = argmin θ 1 2 θ θ t 2 2 s.t. w V \ w t θ φ(p i, w t ) θ φ(p i, w) 0. This is also hard to optimize, so we optimize this instead, θ t+1 = argmin θ 1 2 θ θ t 2 2 s.t. θ φ(p i, w t ) θ φ(p i, ŵ) 0, where ŵ = argmax w V\w t θ φ(p t, w).

68 Phonetic Alignment Feature Functions Given p, q P { }, we encode p and q with two four tuples (s 1, s 2, s 3, s 4 ) and (t 1, t 2, t 3, t 4 ), which represents consonant place consonant manner vowel place vowel manner. Define the similarity between p and q as { 1, if p = q = ; s(p, q) = 4 i=1 1 s i =t i, otherwise, and run dynamic programming.

69 Phonetic Alignment Feature Functions The alignment feature function for p q, for p, q P { }, is defined as, φ p q (p, w) = 1 Z p K w L k 1 ak,i =p,b k,i =q, k=1 i=1 where K w = pron(w), L k is the length of the k-th alignment, and Z p = { Kw Lk k=1 i=1 1 a k,i =p, if p P; p K w, if p =.

70 Articulatory Feature Functions Let F be the set of articulatory variables that consists of tongue tip location tongue tip opening tongue body location tongue body opening lip opening glottis velum

71 Articulatory Feature Functions Given p, q F, for F F, the feature function for articulatory alignment is defined as φ p q (p, w) = 1 L L i=1 1 ai =p,b i =q

72 Articulatory Feature Functions voicing s s eh n n n s s s s * * * * nasality s s eh n n n s s s s * * * tongue body s s eh eh eh n n s s s u u u p p u u u u u tongue tip s s eh eh eh n n s s s cr cr cr m m cl cl cr cr cr surface s s eh eh n eh n n t s s s asynchrony 1 1 1

73 Articulatory Feature Functions For F h, F k F, the asynchrony between F h and F k is defined as async(f h, F k ) = 1 L L (t h,i t k,i ) i=1 More generally, for F 1, F 2 F, the asynchrony between F 1 and F 2 is defined as async(f 1, F 2 ) = 1 L 1 t h,i 1 t k,i L F 1 F 2 i=1 F h F 1 F k F 2 Define the asynchrony among articulatory variables feature functions as φ a async(f1,f 2 ) b(p, w) = 1 a async(f1,f 2 ) b

74 Experiments Model ER lexicon lookup (from [Livescu, 2005]) 59.3% lexicon + Levenshtein distance 41.8% [Jyothi et al., 2011] 29.1% CRF/DP+ 21.5% PA/DP+ 15.2% Pegasos/DP+ 14.8% PA/ALL 15.2% DP+ ALL dictionary, length, phone bigram TF-IDF, phonetic alignment DP+, alignment for articulatory variables, asynchrony between articulators, DBN log-likelihood

Object Tracking and Asynchrony in Audio- Visual Speech Recognition

Object Tracking and Asynchrony in Audio- Visual Speech Recognition Mark Hasegawa-Johnson AIVR Seminar August 31, 2006 AVICAR is thanks to: Bowon Lee, Ming Liu, Camille Goudeseune, Suketu Kamdar, Carl Press,