Acoustic Unit Discovery (AUD) Models. Leda Sarı

Size: px

Start display at page:

Download "Acoustic Unit Discovery (AUD) Models. Leda Sarı"

Flora Hampton
5 years ago
Views:

1 Acoustic Unit Discovery (AUD) Models Leda Sarı Lucas Ondel and Lukáš Burget A summary of AUD experiments from JHU Frederick Jelinek Summer Workshop 2016 lsari2@illinois.edu November 07, / 23

The goal of the workshop 1 Building Speech Recognition System from Untranscribed Data 1 http://www.

2 The goal of the workshop 1 Building Speech Recognition System from Untranscribed Data 1 building-speech-recognition-system-from-untranscribed-data/ 2 / 23

3 Acoustic Unit Discovery (AUD) A kind of clustering of audio Segment Cluster Model the acoustics of each cluster What if you do not know The phone set The phone identities The number of units 3 / 23

4 Some solutions Use a Bayesian generative model Use Dirichlet Process prior on the mixture of HMMs 4 / 23

2 Dir(x, α) = Γ( K i=1 α i) K i=1 Γ(α i) K i=1 x α i 1 i where K x i = 1

5 Beta and Dirichlet distributions Beta(x; α, β) = Γ(α + β) Γ(α)Γ(β) xα 1 (1 x) β 1 Dirichlet is a multivariate generalization of Beta distribution 2 Dir(x, α) = Γ( K i=1 α i) K i=1 Γ(α i) K i=1 x α i 1 i where K x i = 1 and x i (0, 1) i=1 2 Figure is taken from Wikipedia/Beta distribution 5 / 23

6 Dirichlet Process Given a base distribution H and a concentration parameter α 1 As a generative process ( Rich gets richer ) Sample X 1 H For n 1, { X n H w/probability α X n = x 2 Stick-breaking process w/probability α+n 1 n x α+n 1 where n x = n 1 j=1 I[X j = x] {x k } k=1 H f(x) = β k δ xk (x) k=1 k 1 where β k = β k (1 β i) and β k Beta(1, α) i=1 f(x) DP(H, α) Note that the samples of the process is a distribution itself 6 / 23

7 Previous Studies Lee and Glass 2012: θ 0 θ c γ π c j,k s t m t x t d j,k α b b t d j,k T j, k = g q + 1, g q+1; 0 q < L 7 / 23

8 Previous Studies Lee and Glass 2012 Segment definition is based on boundary variables Arbitrary initialization Gibbs sampling is used for inference Slow convergence Not suitable for large datasets Hyperparameters are chosen to impose weak priors 8 / 23

9 Previous Studies Ondel et al. 2016: Get rid of the boundary variables Use infinitely many units Figure : AUD phone loop model with infinite number of units 9 / 23

10 Previous Studies Ondel et al. 2016: θ 0 θ k γ π k c i s i,j m i,j x i,j k = 1,, j = 1,, L i i = 1,, N Hidden variables: c i, s i,j, m i,j Observations: x i,j 10 / 23

11 Previous Studies Ondel et al ν k Beta(1, γ) k 1 π k (ν) = ν k (1 ν j ) (Stick-breaking) j=1 θ c = (A, b, ω) x i,j m i,j N (µ, Σ) Σ = diag(λ) µ, λ N (µ 0, (κ 0 λ 1 ))Gamma(α 0, β 0 ) π Dir(η gmm 0 ) A(r, :) Dir(η hmm,r 0 ) Using conjugate priors and variational Bayes Hyperparameters of the priors: γ, µ 0, κ 0, α 0, β 0, η 0 gmm, η 0 hmm,r 11 / 23

12 Training the model Variational Bayesian (VB) inference Assume independence between the hidden variables and parameters Use conjugate priors to get posteriors in closed form Approximations: Truncate the infinite mixture 12 / 23

13 VB Training Assume a simpler distribution q(c, S, M, Θ) and maximize the lower bound on the log evidence of the observations p(c, S, M, Θ X) q(c, S, M, Θ) and min KL(q p) (1) log p(x) E q [log p(x, c, S, M, Θ Φ 0 )] E q [log q(c, S, M, Θ)] (2) q(c, S, M, Θ) = q(c, S, M)q(Θ) (3) log q (c, s, M) = E q(θ) [log p(x, c, S, M, Θ Φ 0 )] + K 1 (4) log q (Θ) = E q(c,s,m) [log p(x, c, S, M, Θ Φ 0 )] + K 2 (5) k 1 Set v T = 1 in π k (ν) = ν k (1 ν j ) (truncated Dirichlet) (6) j=1 13 / 23

14 Bigram Hierarchical Pitman Yor Language Model Generalization of the Dirichlet distribution Pitman Yor (PY) LM X t+1 G 0 X t+1 = x where Hierarchy (Teh 2006): w/probability θ+dt θ+c w/probability cx d θ+c c x = t j=1 I[X j = x] G 1 P Y (G 0, γ 0, d 0 ) G 2,i P Y (G 1, γ 1, d 1 ) c t G 2,ct 1 Then sample HMM with parameters θ ct Figure : HPYLM for AUD 14 / 23

15 AUD Evaluation Frame or segment level comparison of acoustic units and the true phonetic unites Normalized mutual information X: True labels, Y : Units NMI = I(X; Y ) H(X) (7) Other measures Number of discovered units Performance on different tasks (e.g. topic identification) 15 / 23

16 Multilingual AUD Motivation Assume a universal phone set A common generative model Transfer knowledge from resourceful languages to resource-less languages 16 / 23

17 Multilingual AUD Methods 1 Fully unsupervised: use only audio Direct use of a mismatched model Set the priors using the posterior estimates of the parameters of another AUD model 2 Use supervision on multilingual data Train multilingual bottleneck (BN) networks Extract BN features for the target data 17 / 23

18 Multilingual AUD Unsupervised Figure : Direct use Figure : Posteriors to priors 18 / 23

19 Multilingual Bottleneck Features 19 / 23

20 Results - Unsupervised Multilingual dataset: 7 languages from GlobalPhone dataset Target dataset: English (Wall Street Journal) Direct use Posterior to prior Language NMI # of units NMI # of units English (WSJ) Czech German Portuguese Russian Spanish Turkish Vietnamese languages / 23

21 Results - Supervised Supervision through BNF Feature LM NMI # units MFCC Unigram BNF Unigram BNF Bigram / 23

22 Conclusions Unsupervised approaches did not help much Using BNF as the input improves the NMI by 8% (absolute) 22 / 23

23 Thanks for listening. 23 / 23

24 References 1. Lee and Glass 2012: A non-parametric Bayesian Approach to Acoustic Model Discovery, Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pp Ondel et al. 2016: Variational Inference for Acoustic Unit Discovery, Procedia Computer Science, vol. 81, pp Teh 2006: A hierarchical Bayesian language model based on Pitman-Yor processes, Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pp / 23

Non-Parametric Bayes

Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian