Variational sampling approaches to word confusability

Size: px

Start display at page:

Download "Variational sampling approaches to word confusability"

Madeline Dean
5 years ago
Views:

1 Variational sampling approaches to word confusability John R. Hershey, Peder A. Olsen and Ramesh A. Gopinath IBM, T. J. Watson Research Center Information Theory and Applications, 2/ p.1/30

2 Abstract In speech recognition it is often useful to determine how confusable two words are. For speech models this comes down to computing the Bayes error between two HMMs. This problem is analytically and numerically intractable. A common alternative, that is numerically approachable, uses the KL divergence in place of the Bayes error. We present new approaches to approximating the KL divergence, that combine variational methods with importance sampling. The Bhattacharyya distance a closer cousin of the Bayes error turns out to be even more amenable to our approach. Our experiments demonstrate an improvement of orders of magnitude in accuracy over conventional methods. Information Theory and Applications, 2/ p.2/30

3 Outline Acoustic Confusability Information Theory and Applications, 2/ p.3/30

4 Outline Acoustic Confusability Divergence Measures for Distributions Information Theory and Applications, 2/ p.3/30

5 Outline Acoustic Confusability Divergence Measures for Distributions KL Divergence: Prior Art Information Theory and Applications, 2/ p.3/30

6 Outline Acoustic Confusability Divergence Measures for Distributions KL Divergence: Prior Art KL Divergence: Variational Approximations Information Theory and Applications, 2/ p.3/30

7 Outline Acoustic Confusability Divergence Measures for Distributions KL Divergence: Prior Art KL Divergence: Variational Approximations KL Divergence: Empirical Evaluations Information Theory and Applications, 2/ p.3/30

8 Outline Acoustic Confusability Divergence Measures for Distributions KL Divergence: Prior Art KL Divergence: Variational Approximations KL Divergence: Empirical Evaluations Bhattacharyya: Monte Carlo Approximation Information Theory and Applications, 2/ p.3/30

9 Outline Acoustic Confusability Divergence Measures for Distributions KL Divergence: Prior Art KL Divergence: Variational Approximations KL Divergence: Empirical Evaluations Bhattacharyya: Monte Carlo Approximation Bhattacharyya: Variational Approximation Information Theory and Applications, 2/ p.3/30

10 Outline Acoustic Confusability Divergence Measures for Distributions KL Divergence: Prior Art KL Divergence: Variational Approximations KL Divergence: Empirical Evaluations Bhattacharyya: Monte Carlo Approximation Bhattacharyya: Variational Approximation Bhattacharyya: Variational Monte Carlo Approximation Information Theory and Applications, 2/ p.3/30

11 Outline Acoustic Confusability Divergence Measures for Distributions KL Divergence: Prior Art KL Divergence: Variational Approximations KL Divergence: Empirical Evaluations Bhattacharyya: Monte Carlo Approximation Bhattacharyya: Variational Approximation Bhattacharyya: Variational Monte Carlo Approximation Bhattacharyya: Empirical Evaluations Information Theory and Applications, 2/ p.3/30

12 Outline Acoustic Confusability Divergence Measures for Distributions KL Divergence: Prior Art KL Divergence: Variational Approximations KL Divergence: Empirical Evaluations Bhattacharyya: Monte Carlo Approximation Bhattacharyya: Variational Approximation Bhattacharyya: Variational Monte Carlo Approximation Bhattacharyya: Empirical Evaluations Future Directions Information Theory and Applications, 2/ p.3/30

13 A Toy Version of the Confusability Problem 0.5 probability density plot 0.45 N(x; 2,1) pdf x X is an acoustic feature vector representing a speech class, say "E", with pdf f(x) = N(x; 2, 1). Information Theory and Applications, 2/ p.4/30

14 A Toy Version of the Confusability Problem 0.5 probability density plot N(x; 2,1) N(x;2,1) pdf x X is an acoustic feature vector representing a speech class, say "E", with pdf f(x) = N(x; 2, 1). Another speech class, "O", is described by pdf g(x) = N(x; 2, 1). Information Theory and Applications, 2/ p.4/30

15 A Toy Version of the Confusability Problem 0.5 probability density plot N(x; 2,1) N(x;2,1) pdf x X is an acoustic feature vector representing a speech class, say "E", with pdf f(x) = N(x; 2, 1). Another speech class, "O", is described by pdf g(x) = N(x; 2, 1). The asymmetric error is the probability that one class, "O", will be mistaken for the other, "E", when classifying according to A e (f, g) = Ê f(x)1 g(x) f(x) (x)dx. Information Theory and Applications, 2/ p.4/30

16 A Toy Version of the Confusability Problem 0.5 probability density plot N(x; 2,1) N(x;2,1) pdf x X is an acoustic feature vector representing a speech class, say "E", with pdf f(x) = N(x; 2, 1). Another speech class, "O", is described by pdf g(x) = N(x; 2, 1). The asymmetric error is the probability that one class, "O", will be mistaken for the other, "E", when classifying according to A e (f, g) = Ê f(x)1 g(x) f(x) (x)dx. The Bayes Error is the total classification error Be (f, g) = Ê min(f(x), g(x))dx. Information Theory and Applications, 2/ p.4/30

17 A Toy Version of the Confusability Problem probability density plot 0.5 N(x; 2,1) N(x;2,1) (fg) 1/2 0.4 pdf x X is an acoustic feature vector representing a speech class, say "E", with pdf f(x) = N(x; 2, 1). Another speech class, "O", is described by pdf g(x) = N(x; 2, 1). The asymmetric error is the probability that one class, "O", will be mistaken for the other, "E", when classifying according to A e (f, g) = Ê f(x)1 g(x) f(x) (x)dx. The Bayes Error is the total classification error Be (f, g) = Ê min(f(x), g(x))dx. Information Theory and Applications, 2/ p.4/30

18 Word Models A word is modeled using its pronunciation(s) and an HMM. As an example the word DIAL has a pronunciation D AY AX L and HMM F. D AY AX L Information Theory and Applications, 2/ p.5/30

19 Word Models A word is modeled using its pronunciation(s) and an HMM. As an example the word DIAL has a pronunciation D AY AX L and HMM F. D AY AX L CALL has a pronunciation K AO L and HMM G K AO L Information Theory and Applications, 2/ p.5/30

20 Word Models A word is modeled using its pronunciation(s) and an HMM. As an example the word DIAL has a pronunciation D AY AX L and HMM F. D AY AX L CALL has a pronunciation K AO L and HMM G K AO L Each node in the HMM has a GMM associated with it. The word confusability is the Bayes error B e (F, G). This quantity is too hard to compute!! Information Theory and Applications, 2/ p.5/30

21 The Edit Distance DIAL CALL edit op. cost D K substitution 1 AY ins/del 1 AX AO substitution 1 L L none 0 Total cost 3 Information Theory and Applications, 2/ p.6/30

22 The Edit Distance DIAL CALL edit op. cost D K substitution 1 AY ins/del 1 AX AO substitution 1 L L none 0 Total cost 3 The edit distance is the shortest path in the graph: I D AY AX L K AO L F Information Theory and Applications, 2/ p.6/30

23 Better ways Other techniques use weights on the edges. Acoustic perplexity and Average Divergence Distance are variants to this paradigm that use approximations to the KL divergence as weights. Information Theory and Applications, 2/ p.7/30

24 Bayes Error We use Bayes Error approximations for each pair of GMMs in the Cartesian HMM products: I D : K D : K AY : K AY : K AX : K AX : K L : K D : K AY : K AX : K L : K D : K AY : K AX : K D : AO AY : AO AX : AO L : AO D : AO AY : AO AX : AO D : AO AY : AO AX : AO L : AO D : AO AY : AO AX : AO D : L AY : L AX : L L : L D : L AY : L AX : L L : L F Information Theory and Applications, 2/ p.8/30

25 Gaussian Mixture Models Each node in the Cartesian HMM product corresponds to a pair of Gaussian Mixture models f and g. We write f(x) = a π a f a (x), where f a (x) = N(x;µ a ; Σ a ), Information Theory and Applications, 2/ p.9/30

26 Gaussian Mixture Models Each node in the Cartesian HMM product corresponds to a pair of Gaussian Mixture models f and g. We write f(x) = a π a f a (x), where f a (x) = N(x;µ a ; Σ a ), and g(x) = b ω b g b (x), where g b (x) = N(x;µ b ; Σ b ). Information Theory and Applications, 2/ p.9/30

27 Gaussian Mixture Models Each node in the Cartesian HMM product corresponds to a pair of Gaussian Mixture models f and g. We write f(x) = a π a f a (x), where f a (x) = N(x;µ a ; Σ a ), and g(x) = b ω b g b (x), where g b (x) = N(x;µ b ; Σ b ). The high dimensionality of x R d, d = 39, makes numerical integration difficult. Information Theory and Applications, 2/ p.9/30

28 Bayes, Bhattacharyya, Chernoff and Kullback-Leibler Bayes error: B e (f, g) = min(f(x), g(x))dx Information Theory and Applications, 2/ p.10/30

29 Bayes, Bhattacharyya, Chernoff and Kullback-Leibler Bayes error: B e (f, g) = min(f(x), g(x))dx Kullback Leibler divergence: D(f g) = f(x) log f(x) g(x) dx Information Theory and Applications, 2/ p.10/30

30 Bayes, Bhattacharyya, Chernoff and Kullback-Leibler Bayes error: B e (f, g) = min(f(x), g(x))dx Kullback Leibler divergence: D(f g) = f(x) log f(x) g(x) dx Bhattacharyya distance: B(f, g) = f(x)g(x)dx Information Theory and Applications, 2/ p.10/30

31 Bayes, Bhattacharyya, Chernoff and Kullback-Leibler Bayes error: B e (f, g) = min(f(x), g(x))dx Kullback Leibler divergence: D(f g) = f(x) log f(x) g(x) dx Bhattacharyya distance: B(f, g) = f(x)g(x)dx Chernoff distance: C(f, g) = max 0 s 1 C s (f, g) and C s (f, g) = f(x) s g(x) 1 s dx, 0 s 1. Information Theory and Applications, 2/ p.10/30

32 Bayes, Bhattacharyya, Chernoff and Kullback-Leibler Bayes error: B e (f, g) = min(f(x), g(x))dx Kullback Leibler divergence: D(f g) = f(x) log f(x) g(x) dx Bhattacharyya distance: B(f, g) = f(x)g(x)dx Chernoff distance: C(f, g) = max 0 s 1 C s (f, g) and C s (f, g) = f(x) s g(x) 1 s dx, 0 s 1. Why these? For a pair of single Gaussians f and g we can compute D(f g), B(f, g) and C s (f, g) analytically. Information Theory and Applications, 2/ p.10/30

33 Connections Perimeter divergence (power mean): P α (f, g) = 1 2 f(x)α g(x)α 1 α dx. We have B e (f, g) = P (f, g), B(f, g) = P 0 (f, g). Information Theory and Applications, 2/ p.11/30

34 Connections Perimeter divergence (power mean): P α (f, g) = 1 2 f(x)α g(x)α 1 α dx. We have B e (f, g) = P (f, g), B(f, g) = P 0 (f, g). Rényi generalised divergence of order s: D s (f g) = 1 s 1 log f(x) s g(x) 1 s dx. We have D 1 (f g) = D(f g) and D s (f g) = 1 s 1 log C s(f g). Information Theory and Applications, 2/ p.11/30

35 Connections Perimeter divergence (power mean): P α (f, g) = 1 2 f(x)α g(x)α 1 α dx. We have B e (f, g) = P (f, g), B(f, g) = P 0 (f, g). Rényi generalised divergence of order s: D s (f g) = 1 s 1 log f(x) s g(x) 1 s dx. We have D 1 (f g) = D(f g) and D s (f g) = 1 s 1 log C s(f g). Generalisation: G α,s (f g) = 1 s 1 log (sf(x) α + (1 s)g(x) α ) 1 α. G α,s (f g) connects log B e (f, g), D(f g), log B(f, g) and C s (f, g). Information Theory and Applications, 2/ p.11/30

36 Relations and Inequalities G α, 1(f g) = 2 log P α (f, g), G 0,s (f g) = D s (f g) and 2 G,s (f g) = B e (f, g). Information Theory and Applications, 2/ p.12/30

37 Relations and Inequalities G α, 1(f g) = 2 log P α (f, g), G 0,s (f g) = D s (f g) and 2 G,s (f g) = B e (f, g). P α (f, g) P β (f, g) for α β and B(f, g) P α (f, g)p α (f, g). Information Theory and Applications, 2/ p.12/30

38 Relations and Inequalities G α, 1(f g) = 2 log P α (f, g), G 0,s (f g) = D s (f g) and 2 G,s (f g) = B e (f, g). P α (f, g) P β (f, g) for α β and B(f, g) P α (f, g)p α (f, g). P α (f, f) = 1, G α,s (f, f) = 0, P (f, g) + P (f, g) = 2, B e (f, g) = 2 P (f, g). Information Theory and Applications, 2/ p.12/30

39 Relations and Inequalities G α, 1(f g) = 2 log P α (f, g), G 0,s (f g) = D s (f g) and 2 G,s (f g) = B e (f, g). P α (f, g) P β (f, g) for α β and B(f, g) P α (f, g)p α (f, g). P α (f, f) = 1, G α,s (f, f) = 0, P (f, g) + P (f, g) = 2, B e (f, g) = 2 P (f, g). B(f, g) = B(g, f), B e (f, g) = B e (g, f), D(f g) D(g f). Information Theory and Applications, 2/ p.12/30

40 Relations and Inequalities G α, 1(f g) = 2 log P α (f, g), G 0,s (f g) = D s (f g) and 2 G,s (f g) = B e (f, g). P α (f, g) P β (f, g) for α β and B(f, g) P α (f, g)p α (f, g). P α (f, f) = 1, G α,s (f, f) = 0, P (f, g) + P (f, g) = 2, B e (f, g) = 2 P (f, g). B(f, g) = B(g, f), B e (f, g) = B e (g, f), D(f g) D(g f). D(f g) 2 log B(f, g) 2 log B e (f, g), B e (f, g) B(f, g) B e (f, g)(2 B e (f, g)), B e (f, g) C(f, g) B(f, g), C 0 (f, g) = C 1 (f, g) = 1 and B(f, g) = C 1/2 (f, g). and so on. Information Theory and Applications, 2/ p.12/30

41 The KL Divergence of a GMM Monte Carlo sampling: Draw n samples from f. Then D(f g) 1 n n i=1 log f(x i) g(x i ) with error O(1/ n). Information Theory and Applications, 2/ p.13/30

42 The KL Divergence of a GMM Monte Carlo sampling: Draw n samples from f. Then D(f g) 1 n n i=1 log f(x i) g(x i ) with error O(1/ n). Gaussian Approximation: Approximate f with a gaussian ˆf whose mean and covariance matches the total mean and covariance of f. Same for g and ĝ, then use D(f g) D( ˆf ĝ) µ ˆf = a π a µ a Σ ˆf = a π a (Σ a + (µ a µ ˆf)(µ a µ ˆf) T ). Information Theory and Applications, 2/ p.13/30

43 Unscented Approximation It is possible to pick 2d sigma points {x a,k } 2d k=1 such that fa (x)h(x)dx = 1 2d 2d k=1 h(x a,k) is exact for all quadratic functions h. One choice of sigma points is x a,k = µ a + dλ a,k e a,k x a,d+k = µ a dλ a,k e a,k, This is akin to gaussian quadrature. Information Theory and Applications, 2/ p.14/30

44 Matched Bound Approximation Match the closest pairs of gaussians m(a) = argmin b (D(f a g b ) log(ω b )). Goldberger s approximate formula is: D(f g) D Goldberger (f g) = a Analogous to the chain rule for relative entropy. ( π a D(f a g m(a) ) + log π ) a. ω m(a) Information Theory and Applications, 2/ p.15/30

45 Matched Bound Approximation Match the closest pairs of gaussians m(a) = argmin b (D(f a g b ) log(ω b )). Goldberger s approximate formula is: D(f g) D Goldberger (f g) = a Analogous to the chain rule for relative entropy. ( π a D(f a g m(a) ) + log π ) a. ω m(a) Min Approximation: D(f g) min a,b D(f a g b ) is an approximation in the same spirit. Information Theory and Applications, 2/ p.15/30

46 Variational Approximation Let 1 = b φ b a be free parameters f log g = π a f a log a b = π a f a log a b ω b g b φ b a ω b g b φ b a Introduce the variational parameters Information Theory and Applications, 2/ p.16/30

47 Variational Approximation Let 1 = b φ b a be free parameters f log g = a a π a π a f a log b f a b φ b a ω b g b φ b a φ b a log ω bg b φ b a Use Jensen s inequality to interchange log and b φ b a Information Theory and Applications, 2/ p.16/30

48 Variational Approximation Let 1 = b φ b a be free parameters f log g = a a = a π a π a f a log b f a b φ b a ω b g b φ b a φ b a log ω bg b φ b a π a φ b a (log(ω b /φ b a ) + Simplify expression after using Jensen b f a log g b ) Information Theory and Applications, 2/ p.16/30

49 Variational Approximation Let 1 = b φ b a be free parameters f log g = a a = a π a π a f a log b f a b φ b a ω b g b φ b a φ b a log ω bg b φ b a π a φ b a (log(ω b /φ b a ) + b f a log g b ) Maximize over φ b a and do the same for f log f to get Information Theory and Applications, 2/ p.16/30

50 Variational Approximation Let 1 = b φ b a be free parameters f log g = a a = a π a π a f a log b f a b φ b a ω b g b φ b a φ b a log ω bg b φ b a π a φ b a (log(ω b /φ b a ) + b f a log g b ) Maximize over φ b a and do the same for f log f to get D(f g) D var (f g) = a π a log a π a e D(f a f a ) b ω be D(f a g b ). Information Theory and Applications, 2/ p.16/30

51 Variational Upper Bound Variational parameters: b φ b a = π a and a ψ a b = ω b. Gaussian replication: f = a π af a = ab φ b af a g = b ω bg b = ba ψ a bg b. Information Theory and Applications, 2/ p.17/30

52 Variational Upper Bound Variational parameters: b φ b a = π a and a ψ a b = ω b. Gaussian replication: f = a π af a = ab φ b af a g = b ω bg b = ba ψ a bg b. D(f g) = = f log(f/g) f log Introduce the variational parameters ψ a b. ( ab ψ ) a bg b f Information Theory and Applications, 2/ p.17/30

53 Variational Upper Bound Variational parameters: b φ b a = π a and a ψ a b = ω b. D(f g) = = f log f log ( ab ψ ) a bg b f ( φ b a f a ψ a b g b f φ b a f a ab ) dx Prepare to use Jensen s inequality. Information Theory and Applications, 2/ p.17/30

54 Variational Upper Bound Variational parameters: b φ b a = π a and a ψ a b = ω b. D(f g) = = ab f log ( ab φ b af a ab ψ a bg b ) ( ) φ b a f a ψ a b g b flog dx f φ ab b a f a ( ) ψa b g b φ b a f a log dx φ b a f a Interchange log and ab φ b a f a f. Information Theory and Applications, 2/ p.17/30

55 Variational Upper Bound Variational parameters: b φ b a = π a and a ψ a b = ω b. D(f g) = f log ab φ b a ( ab φ b af a ab ψ a bg b ) ( ) ψa b g b f a log dx φ b a f a = D(φ ψ) + ab φ b a D(f a g b ) The chain rule for relative entropy for mixtures with unequal number of components!! Information Theory and Applications, 2/ p.17/30

56 Variational Upper Bound Optimize variational bound D(f g) D(φ ψ) + ab φ b a D(f a g b ) with respect to constraints b φ b a = π a and a ψ a b = ω b. Information Theory and Applications, 2/ p.18/30

57 Variational Upper Bound Optimize variational bound D(f g) D(φ ψ) + ab φ b a D(f a g b ) with respect to constraints b φ b a = π a and a ψ a b = ω b. Fix φ, find optimal value for ψ Fix ψ, find optimal value for φ: ψ a b = ω bφ b a a φ b a φ b a = π aψ a b e D(f a g b ) b ψ a b e D(f a g b ). Iterate a few times to find optimal solution!. Information Theory and Applications, 2/ p.18/30

58 Comparison of KL div methods Plots showing histograms of difference between Monte Carlo Sampling with 1 million samples and various methods Probability zero product min gaussian variational Deviation from D MC(1M) Probability MC(2dn) unscented goldberger variational upper Deviation from D MC(1M) Information Theory and Applications, 2/ p.19/30

59 Summary of KL divergence methods Monte Carlo Sampling arbitrary accuracy, arbitrary cost Information Theory and Applications, 2/ p.20/30

60 Summary of KL divergence methods Monte Carlo Sampling arbitrary accuracy, arbitrary cost Gaussian Approximation not cheap, not good, closed-form Information Theory and Applications, 2/ p.20/30

61 Summary of KL divergence methods Monte Carlo Sampling arbitrary accuracy, arbitrary cost Gaussian Approximation not cheap, not good, closed-form Min Approximation cheap, but not good Information Theory and Applications, 2/ p.20/30

62 Summary of KL divergence methods Monte Carlo Sampling arbitrary accuracy, arbitrary cost Gaussian Approximation not cheap, not good, closed-form Min Approximation cheap, but not good Unscented Approximation almost as good as MC at same cost Information Theory and Applications, 2/ p.20/30

63 Summary of KL divergence methods Monte Carlo Sampling arbitrary accuracy, arbitrary cost Gaussian Approximation not cheap, not good, closed-form Min Approximation cheap, but not good Unscented Approximation almost as good as MC at same cost Matched Bound Approximation cheap and good, not differentiable Information Theory and Applications, 2/ p.20/30

64 Summary of KL divergence methods Monte Carlo Sampling arbitrary accuracy, arbitrary cost Gaussian Approximation not cheap, not good, closed-form Min Approximation cheap, but not good Unscented Approximation almost as good as MC at same cost Matched Bound Approximation cheap and good, not differentiable Variational Approximation cheap and good, closed-form Information Theory and Applications, 2/ p.20/30

65 Summary of KL divergence methods Monte Carlo Sampling arbitrary accuracy, arbitrary cost Gaussian Approximation not cheap, not good, closed-form Min Approximation cheap, but not good Unscented Approximation almost as good as MC at same cost Matched Bound Approximation cheap and good, not differentiable Variational Approximation cheap and good, closed-form Variational Upper Bound cheap and better, strict bound, iterative Information Theory and Applications, 2/ p.20/30

66 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/ p.21/30

67 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/ p.21/30

68 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/ p.21/30

69 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/ p.21/30

70 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/ p.21/30

71 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/ p.21/30

72 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/ p.21/30

73 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/ p.21/30

74 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/ p.21/30

75 Bhattacharyya distance The Bhattacharyya distance, B(f, g) = fg, can be estimate using Monte Carlo sampling from an arbitrary distribution h: where {x i } n i=1 B(f, g) ˆB h = 1 n n i=1 are sampled from h. f(xi )g(x i ), h(x i ) Information Theory and Applications, 2/ p.22/30

76 Bhattacharyya distance The Bhattacharyya distance, B(f, g) = fg, can be estimate using Monte Carlo sampling from an arbitrary distribution h: B(f, g) ˆB h = 1 n n i=1 f(xi )g(x i ), h(x i ) where {x i } n i=1 are sampled from h. The estimators are unbiased, E[ ˆB h ] = B(f, g), with variance var( ˆB h ) = 1 n ( fg h ) (B(f, g))2. Information Theory and Applications, 2/ p.22/30

77 Bhattacharyya distance The Bhattacharyya distance, B(f, g) = fg, can be estimate using Monte Carlo sampling from an arbitrary distribution h: B(f, g) ˆB h = 1 n n i=1 f(xi )g(x i ), h(x i ) where {x i } n i=1 are sampled from h. The estimators are unbiased, E[ ˆB h ] = B(f, g), with variance var( ˆB h ) = 1 n ( fg h h = f gives var( ˆB f ) = 1 (B(f,g))2 n. h = f+g 2 gives var( ˆB f+g ) = 2 And var( ˆB f+g 2 ) (B(f, g))2. 2fg f+g (B(f,g))2 n. ) var( ˆB f ) (Harmonic Arithmetic inequality). Information Theory and Applications, 2/ p.22/30

78 Best sampling distribution We can find the best sampling distribution h by minimizing the variance of ˆB h subject to the constraints h 0 and h = 1. The solution is h = fg fg, var( ˆB h ) = 0. Information Theory and Applications, 2/ p.23/30

79 Best sampling distribution We can find the best sampling distribution h by minimizing the variance of ˆB h subject to the constraints h 0 and h = 1. The solution is h = fg fg, var( ˆB h ) = 0. Unfortunately, using this h requires Computing the quantity, B(f, g) = fg, that we are trying to compute in the first place. Sample from fg. Information Theory and Applications, 2/ p.23/30

80 Best sampling distribution We can find the best sampling distribution h by minimizing the variance of ˆB h subject to the constraints h 0 and h = 1. The solution is h = fg fg, var( ˆB h ) = 0. Unfortunately, using this h requires Computing the quantity, B(f, g) = fg, that we are trying to compute in the first place. Sample from fg. We will use variational techniques to approximate fg with some unnormalized h that can be analytically integrated to give a genuine pdf h. Information Theory and Applications, 2/ p.23/30

81 Bhattacharyya Variational Upper Bound f = a π af a = ab φ b af a b φ b a = π a g = b ω bg b = ba ψ a bg b a ψ a b = ω b. Information Theory and Applications, 2/ p.24/30

82 Bhattacharyya Variational Upper Bound f = a π af a = ab φ b af a b φ b a = π a g = b ω bg b = ba ψ a bg b a ψ a b = ω b. g B(f, g) = f f ab = f ψ a bg b f Introduce the variational parameters ψ a b. Information Theory and Applications, 2/ p.24/30

83 Bhattacharyya Variational Upper Bound f = a π af a = ab φ b af a b φ b a = π a g = b ω bg b = ba ψ a bg b a ψ a b = ω b. B(f, g) = = f f ab ψ a bg b ab f φ b a f a f ψ a b g b φ b a f a dx Prepare to use Jensen s inequality. Information Theory and Applications, 2/ p.24/30

84 Bhattacharyya Variational Upper Bound f = a π af a = ab φ b af a b φ b a = π a g = b ω bg b = ba ψ a bg b a ψ a b = ω b. B(f, g) = = ab f f ab ψ a bg b ab φ b a f φ b a f a f ψ a b g b φ b a f a dx f a ψa b g b φ b a f a dx Interchange and ab φ b a f a f. Information Theory and Applications, 2/ p.24/30

85 Bhattacharyya Variational Upper Bound f = a π af a = ab φ b af a b φ b a = π a g = b ω bg b = ba ψ a bg b a ψ a b = ω b. B(f, g) = ab = ab f ab ψ a bg b f ψa b g b φ b a f a dx φ b a f a φ b a ψ a b B(f a, g b ) An inequality linking the mixture Bhattacharyya distance to the component distances!! Information Theory and Applications, 2/ p.24/30

86 Bhattacharyya Variational Upper Bound Optimize variational bound B(f, g) ab φ b a ψ a b B(f a, g b ) with respect to constraints b φ b a = π a and a ψ a b = ω b. Information Theory and Applications, 2/ p.25/30

87 Bhattacharyya Variational Upper Bound Optimize variational bound B(f, g) ab φ b a ψ a b B(f a, g b ) with respect to constraints b φ b a = π a and a ψ a b = ω b. Fix φ, find optimal value for ψ ψ a b = ω bφ b a (B(f a, g b )) 2 a φ b a (B(f a, g b )) 2. Fix ψ, find optimal value for φ: Iterate to find optimal solution! φ b a = π aψ a b (B(f a, g b )) 2 b ψ a b (B(f a, g b )) 2. Information Theory and Applications, 2/ p.25/30

88 Variational Monte Carlo Sampling Write the variational estimate V (f, g) = ab Õ φ b a ψ a b B(f a, g b ) = Õ Ô φ b a ψ a b fa g b = ab ĥ. Here ĥ = È ab Õ φ b a ψ a b fa g b is an unnormalized approximation of the optimal sampling distribution, fg/ Ê fg. Information Theory and Applications, 2/ p.26/30

89 Variational Monte Carlo Sampling Write the variational estimate V (f, g) = ab Õ φ b a ψ a b B(f a, g b ) = Õ Ô φ b a ψ a b fa g b = ab ĥ. Here ĥ = È ab Õ φ b a ψ a b fa g b is an unnormalized approximation of the optimal sampling distribution, fg/ Ê fg. h = ĥ/ê ĥ is a GMM since h ab = f a g b / Ê f a g b is a gaussian and h = ab π ab h ab, where π ab = Õ Ê φ b a ψ a b fa g b V (f, g) Information Theory and Applications, 2/ p.26/30

90 Variational Monte Carlo Sampling Write the variational estimate V (f, g) = ab Õ φ b a ψ a b B(f a, g b ) = Õ Ô φ b a ψ a b fa g b = ab ĥ. Here ĥ = È ab Õ φ b a ψ a b fa g b is an unnormalized approximation of the optimal sampling distribution, fg/ Ê fg. h = ĥ/ê ĥ is a GMM since h ab = f a g b / Ê f a g b is a gaussian and h = ab π ab h ab, where π ab = Õ Ê φ b a ψ a b fa g b V (f, g) Thus, drawing samples {x i } n i=1 from h the estimate ˆV n = 1 n n i=1 Ô f(xi )g(x i ) h(x i ) is unbiased and in experiments is seen to be far superior to sampling from (f + g)/2. Information Theory and Applications, 2/ p.26/30

91 Bhattacharyya Distance: Monte Carlo estimation Probability density Bhattacharyya: Importance sampling from f(x) Deviation from Bhattacharyya estimate with 1M samples Importance sampling from f(x): Slow convergence. Information Theory and Applications, 2/ p.27/30

92 Bhattacharyya Distance: Monte Carlo estimation Probability density Bhattacharyya: Importance sampling from f(x) Deviation from Bhattacharyya estimate with 1M samples Importance sampling from f(x): Slow convergence. Information Theory and Applications, 2/ p.27/30

93 Bhattacharyya Distance: Monte Carlo estimation Probability density Bhattacharyya: Importance sampling from f(x) Deviation from Bhattacharyya estimate with 1M samples Importance sampling from f(x): Slow convergence. Information Theory and Applications, 2/ p.27/30

94 Bhattacharyya Distance: Monte Carlo estimation Probability density Bhattacharyya: Importance sampling from f(x) K Deviation from Bhattacharyya estimate with 1M samples Importance sampling from f(x): Slow convergence. Information Theory and Applications, 2/ p.27/30

95 Bhattacharyya Distance: Monte Carlo estimation Probability density Bhattacharyya: Importance sampling from f(x) K 100K Deviation from Bhattacharyya estimate with 1M samples Importance sampling from f(x): Slow convergence. Information Theory and Applications, 2/ p.27/30

96 Bhattacharyya Distance: Monte Carlo estimation Importance sampling from f(x)+g(x) 2 Probability density f 100K Deviation from Bhattacharyya estimate with 1M samples Importance sampling from f(x): Slow convergence. Importance sampling from f(x)+g(x) 2 : Better convergence. Information Theory and Applications, 2/ p.27/30

97 Bhattacharyya Distance: Monte Carlo estimation Importance sampling from f(x)+g(x) 2 Probability density f 100K Deviation from Bhattacharyya estimate with 1M samples Importance sampling from f(x): Slow convergence. Importance sampling from f(x)+g(x) 2 : Better convergence. Information Theory and Applications, 2/ p.27/30

98 Bhattacharyya Distance: Monte Carlo estimation Importance sampling from f(x)+g(x) 2 Probability density f 100K Deviation from Bhattacharyya estimate with 1M samples Importance sampling from f(x): Slow convergence. Importance sampling from f(x)+g(x) 2 : Better convergence. Information Theory and Applications, 2/ p.27/30

99 Bhattacharyya Distance: Monte Carlo estimation Importance sampling from f(x)+g(x) 2 Probability density f 100K K Deviation from Bhattacharyya estimate with 1M samples Importance sampling from f(x): Slow convergence. Importance sampling from f(x)+g(x) 2 : Better convergence. Information Theory and Applications, 2/ p.27/30

100 Bhattacharyya Distance: Monte Carlo estimation Importance sampling from f(x)+g(x) 2 Probability density f 100K K 100K Deviation from Bhattacharyya estimate with 1M samples Importance sampling from f(x): Slow convergence. Importance sampling from f(x)+g(x) 2 : Better convergence. Information Theory and Applications, 2/ p.27/30

101 Bhattacharyya Distance: Monte Carlo estimation Probability density Variational importance sampling (f+g)/2 100K Deviation from Bhattacharyya estimate with 1M samples Importance sampling from f(x): Slow convergence. Importance sampling from f(x)+g(x) 2 : Better convergence. Variational importance sampling: Fast convergence. Information Theory and Applications, 2/ p.27/30

102 Bhattacharyya Distance: Monte Carlo estimation Probability density Variational importance sampling (f+g)/2 100K Deviation from Bhattacharyya estimate with 1M samples Importance sampling from f(x): Slow convergence. Importance sampling from f(x)+g(x) 2 : Better convergence. Variational importance sampling: Fast convergence. Information Theory and Applications, 2/ p.27/30

103 Bhattacharyya Distance: Monte Carlo estimation Probability density Variational importance sampling (f+g)/2 100K Deviation from Bhattacharyya estimate with 1M samples Importance sampling from f(x): Slow convergence. Importance sampling from f(x)+g(x) 2 : Better convergence. Variational importance sampling: Fast convergence. Information Theory and Applications, 2/ p.27/30

104 Bhattacharyya Distance: Monte Carlo estimation Probability density Variational importance sampling (f+g)/2 100K K Deviation from Bhattacharyya estimate with 1M samples Importance sampling from f(x): Slow convergence. Importance sampling from f(x)+g(x) 2 : Better convergence. Variational importance sampling: Fast convergence. Information Theory and Applications, 2/ p.27/30

105 Bhattacharyya Distance: Monte Carlo estimation Probability density Variational importance sampling (f+g)/2 100K K 100K Deviation from Bhattacharyya estimate with 1M samples Importance sampling from f(x): Slow convergence. Importance sampling from f(x)+g(x) 2 : Better convergence. Variational importance sampling: Fast convergence. Information Theory and Applications, 2/ p.27/30

106 Bhattacharyya Distance: Monte Carlo estimation Probability density Variational importance sampling f 100K (f+g)/2 100K variational 100K Deviation from Bhattacharyya estimate with 1M samples Importance sampling from f(x): Slow convergence. Importance sampling from f(x)+g(x) 2 : Better convergence. Variational importance sampling: Fast convergence. Information Theory and Applications, 2/ p.27/30

107 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/ p.28/30

108 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/ p.28/30

109 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/ p.28/30

110 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/ p.28/30

111 Variational Monte Carlo Sampling: KL-Divergence Information Theory and Applications, 2/ p.29/30

112 Future Directions HMM variational KL divergence Information Theory and Applications, 2/ p.30/30

113 Future Directions HMM variational KL divergence HMM variational Bhattacharyya Information Theory and Applications, 2/ p.30/30

114 Future Directions HMM variational KL divergence HMM variational Bhattacharyya Variational Chernoff distance Information Theory and Applications, 2/ p.30/30

115 Future Directions HMM variational KL divergence HMM variational Bhattacharyya Variational Chernoff distance Variational sampling of Bayes error using Chernoff approximation Information Theory and Applications, 2/ p.30/30

116 Future Directions HMM variational KL divergence HMM variational Bhattacharyya Variational Chernoff distance Variational sampling of Bayes error using Chernoff approximation Discriminative training, using Bhattacharyya divergence. Information Theory and Applications, 2/ p.30/30

117 Future Directions HMM variational KL divergence HMM variational Bhattacharyya Variational Chernoff distance Variational sampling of Bayes error using Chernoff approximation Discriminative training, using Bhattacharyya divergence. Acoustic confusability using Bhattacharyya divergence Information Theory and Applications, 2/ p.30/30

118 Future Directions HMM variational KL divergence HMM variational Bhattacharyya Variational Chernoff distance Variational sampling of Bayes error using Chernoff approximation Discriminative training, using Bhattacharyya divergence. Acoustic confusability using Bhattacharyya divergence Clustering of HMMs Information Theory and Applications, 2/ p.30/30

119 Future Directions HMM variational KL divergence HMM variational Bhattacharyya Variational Chernoff distance Variational sampling of Bayes error using Chernoff approximation Discriminative training, using Bhattacharyya divergence. Acoustic confusability using Bhattacharyya divergence Clustering of HMMs Information Theory and Applications, 2/ p.30/30

Scalable robust hypothesis tests using graphical models

Scalable robust hypothesis tests using graphical models Umamahesh Srinivas ipal Group Meeting October 22, 2010 Binary hypothesis testing problem Random vector x = (x 1,...,x n ) R n generated from either