Variational sampling approaches to word confusability

Size: px
Start display at page:

Download "Variational sampling approaches to word confusability"

Transcription

1 Variational sampling approaches to word confusability John R. Hershey, Peder A. Olsen and Ramesh A. Gopinath IBM, T. J. Watson Research Center Information Theory and Applications, 2/ p.1/30

2 Abstract In speech recognition it is often useful to determine how confusable two words are. For speech models this comes down to computing the Bayes error between two HMMs. This problem is analytically and numerically intractable. A common alternative, that is numerically approachable, uses the KL divergence in place of the Bayes error. We present new approaches to approximating the KL divergence, that combine variational methods with importance sampling. The Bhattacharyya distance a closer cousin of the Bayes error turns out to be even more amenable to our approach. Our experiments demonstrate an improvement of orders of magnitude in accuracy over conventional methods. Information Theory and Applications, 2/ p.2/30

3 Outline Acoustic Confusability Information Theory and Applications, 2/ p.3/30

4 Outline Acoustic Confusability Divergence Measures for Distributions Information Theory and Applications, 2/ p.3/30

5 Outline Acoustic Confusability Divergence Measures for Distributions KL Divergence: Prior Art Information Theory and Applications, 2/ p.3/30

6 Outline Acoustic Confusability Divergence Measures for Distributions KL Divergence: Prior Art KL Divergence: Variational Approximations Information Theory and Applications, 2/ p.3/30

7 Outline Acoustic Confusability Divergence Measures for Distributions KL Divergence: Prior Art KL Divergence: Variational Approximations KL Divergence: Empirical Evaluations Information Theory and Applications, 2/ p.3/30

8 Outline Acoustic Confusability Divergence Measures for Distributions KL Divergence: Prior Art KL Divergence: Variational Approximations KL Divergence: Empirical Evaluations Bhattacharyya: Monte Carlo Approximation Information Theory and Applications, 2/ p.3/30

9 Outline Acoustic Confusability Divergence Measures for Distributions KL Divergence: Prior Art KL Divergence: Variational Approximations KL Divergence: Empirical Evaluations Bhattacharyya: Monte Carlo Approximation Bhattacharyya: Variational Approximation Information Theory and Applications, 2/ p.3/30

10 Outline Acoustic Confusability Divergence Measures for Distributions KL Divergence: Prior Art KL Divergence: Variational Approximations KL Divergence: Empirical Evaluations Bhattacharyya: Monte Carlo Approximation Bhattacharyya: Variational Approximation Bhattacharyya: Variational Monte Carlo Approximation Information Theory and Applications, 2/ p.3/30

11 Outline Acoustic Confusability Divergence Measures for Distributions KL Divergence: Prior Art KL Divergence: Variational Approximations KL Divergence: Empirical Evaluations Bhattacharyya: Monte Carlo Approximation Bhattacharyya: Variational Approximation Bhattacharyya: Variational Monte Carlo Approximation Bhattacharyya: Empirical Evaluations Information Theory and Applications, 2/ p.3/30

12 Outline Acoustic Confusability Divergence Measures for Distributions KL Divergence: Prior Art KL Divergence: Variational Approximations KL Divergence: Empirical Evaluations Bhattacharyya: Monte Carlo Approximation Bhattacharyya: Variational Approximation Bhattacharyya: Variational Monte Carlo Approximation Bhattacharyya: Empirical Evaluations Future Directions Information Theory and Applications, 2/ p.3/30

13 A Toy Version of the Confusability Problem 0.5 probability density plot 0.45 N(x; 2,1) pdf x X is an acoustic feature vector representing a speech class, say "E", with pdf f(x) = N(x; 2, 1). Information Theory and Applications, 2/ p.4/30

14 A Toy Version of the Confusability Problem 0.5 probability density plot N(x; 2,1) N(x;2,1) pdf x X is an acoustic feature vector representing a speech class, say "E", with pdf f(x) = N(x; 2, 1). Another speech class, "O", is described by pdf g(x) = N(x; 2, 1). Information Theory and Applications, 2/ p.4/30

15 A Toy Version of the Confusability Problem 0.5 probability density plot N(x; 2,1) N(x;2,1) pdf x X is an acoustic feature vector representing a speech class, say "E", with pdf f(x) = N(x; 2, 1). Another speech class, "O", is described by pdf g(x) = N(x; 2, 1). The asymmetric error is the probability that one class, "O", will be mistaken for the other, "E", when classifying according to A e (f, g) = Ê f(x)1 g(x) f(x) (x)dx. Information Theory and Applications, 2/ p.4/30

16 A Toy Version of the Confusability Problem 0.5 probability density plot N(x; 2,1) N(x;2,1) pdf x X is an acoustic feature vector representing a speech class, say "E", with pdf f(x) = N(x; 2, 1). Another speech class, "O", is described by pdf g(x) = N(x; 2, 1). The asymmetric error is the probability that one class, "O", will be mistaken for the other, "E", when classifying according to A e (f, g) = Ê f(x)1 g(x) f(x) (x)dx. The Bayes Error is the total classification error Be (f, g) = Ê min(f(x), g(x))dx. Information Theory and Applications, 2/ p.4/30

17 A Toy Version of the Confusability Problem probability density plot 0.5 N(x; 2,1) N(x;2,1) (fg) 1/2 0.4 pdf x X is an acoustic feature vector representing a speech class, say "E", with pdf f(x) = N(x; 2, 1). Another speech class, "O", is described by pdf g(x) = N(x; 2, 1). The asymmetric error is the probability that one class, "O", will be mistaken for the other, "E", when classifying according to A e (f, g) = Ê f(x)1 g(x) f(x) (x)dx. The Bayes Error is the total classification error Be (f, g) = Ê min(f(x), g(x))dx. Information Theory and Applications, 2/ p.4/30

18 Word Models A word is modeled using its pronunciation(s) and an HMM. As an example the word DIAL has a pronunciation D AY AX L and HMM F. D AY AX L Information Theory and Applications, 2/ p.5/30

19 Word Models A word is modeled using its pronunciation(s) and an HMM. As an example the word DIAL has a pronunciation D AY AX L and HMM F. D AY AX L CALL has a pronunciation K AO L and HMM G K AO L Information Theory and Applications, 2/ p.5/30

20 Word Models A word is modeled using its pronunciation(s) and an HMM. As an example the word DIAL has a pronunciation D AY AX L and HMM F. D AY AX L CALL has a pronunciation K AO L and HMM G K AO L Each node in the HMM has a GMM associated with it. The word confusability is the Bayes error B e (F, G). This quantity is too hard to compute!! Information Theory and Applications, 2/ p.5/30

21 The Edit Distance DIAL CALL edit op. cost D K substitution 1 AY ins/del 1 AX AO substitution 1 L L none 0 Total cost 3 Information Theory and Applications, 2/ p.6/30

22 The Edit Distance DIAL CALL edit op. cost D K substitution 1 AY ins/del 1 AX AO substitution 1 L L none 0 Total cost 3 The edit distance is the shortest path in the graph: I D AY AX L K AO L F Information Theory and Applications, 2/ p.6/30

23 Better ways Other techniques use weights on the edges. Acoustic perplexity and Average Divergence Distance are variants to this paradigm that use approximations to the KL divergence as weights. Information Theory and Applications, 2/ p.7/30

24 Bayes Error We use Bayes Error approximations for each pair of GMMs in the Cartesian HMM products: I D : K D : K AY : K AY : K AX : K AX : K L : K D : K AY : K AX : K L : K D : K AY : K AX : K D : AO AY : AO AX : AO L : AO D : AO AY : AO AX : AO D : AO AY : AO AX : AO L : AO D : AO AY : AO AX : AO D : L AY : L AX : L L : L D : L AY : L AX : L L : L F Information Theory and Applications, 2/ p.8/30

25 Gaussian Mixture Models Each node in the Cartesian HMM product corresponds to a pair of Gaussian Mixture models f and g. We write f(x) = a π a f a (x), where f a (x) = N(x;µ a ; Σ a ), Information Theory and Applications, 2/ p.9/30

26 Gaussian Mixture Models Each node in the Cartesian HMM product corresponds to a pair of Gaussian Mixture models f and g. We write f(x) = a π a f a (x), where f a (x) = N(x;µ a ; Σ a ), and g(x) = b ω b g b (x), where g b (x) = N(x;µ b ; Σ b ). Information Theory and Applications, 2/ p.9/30

27 Gaussian Mixture Models Each node in the Cartesian HMM product corresponds to a pair of Gaussian Mixture models f and g. We write f(x) = a π a f a (x), where f a (x) = N(x;µ a ; Σ a ), and g(x) = b ω b g b (x), where g b (x) = N(x;µ b ; Σ b ). The high dimensionality of x R d, d = 39, makes numerical integration difficult. Information Theory and Applications, 2/ p.9/30

28 Bayes, Bhattacharyya, Chernoff and Kullback-Leibler Bayes error: B e (f, g) = min(f(x), g(x))dx Information Theory and Applications, 2/ p.10/30

29 Bayes, Bhattacharyya, Chernoff and Kullback-Leibler Bayes error: B e (f, g) = min(f(x), g(x))dx Kullback Leibler divergence: D(f g) = f(x) log f(x) g(x) dx Information Theory and Applications, 2/ p.10/30

30 Bayes, Bhattacharyya, Chernoff and Kullback-Leibler Bayes error: B e (f, g) = min(f(x), g(x))dx Kullback Leibler divergence: D(f g) = f(x) log f(x) g(x) dx Bhattacharyya distance: B(f, g) = f(x)g(x)dx Information Theory and Applications, 2/ p.10/30

31 Bayes, Bhattacharyya, Chernoff and Kullback-Leibler Bayes error: B e (f, g) = min(f(x), g(x))dx Kullback Leibler divergence: D(f g) = f(x) log f(x) g(x) dx Bhattacharyya distance: B(f, g) = f(x)g(x)dx Chernoff distance: C(f, g) = max 0 s 1 C s (f, g) and C s (f, g) = f(x) s g(x) 1 s dx, 0 s 1. Information Theory and Applications, 2/ p.10/30

32 Bayes, Bhattacharyya, Chernoff and Kullback-Leibler Bayes error: B e (f, g) = min(f(x), g(x))dx Kullback Leibler divergence: D(f g) = f(x) log f(x) g(x) dx Bhattacharyya distance: B(f, g) = f(x)g(x)dx Chernoff distance: C(f, g) = max 0 s 1 C s (f, g) and C s (f, g) = f(x) s g(x) 1 s dx, 0 s 1. Why these? For a pair of single Gaussians f and g we can compute D(f g), B(f, g) and C s (f, g) analytically. Information Theory and Applications, 2/ p.10/30

33 Connections Perimeter divergence (power mean): P α (f, g) = 1 2 f(x)α g(x)α 1 α dx. We have B e (f, g) = P (f, g), B(f, g) = P 0 (f, g). Information Theory and Applications, 2/ p.11/30

34 Connections Perimeter divergence (power mean): P α (f, g) = 1 2 f(x)α g(x)α 1 α dx. We have B e (f, g) = P (f, g), B(f, g) = P 0 (f, g). Rényi generalised divergence of order s: D s (f g) = 1 s 1 log f(x) s g(x) 1 s dx. We have D 1 (f g) = D(f g) and D s (f g) = 1 s 1 log C s(f g). Information Theory and Applications, 2/ p.11/30

35 Connections Perimeter divergence (power mean): P α (f, g) = 1 2 f(x)α g(x)α 1 α dx. We have B e (f, g) = P (f, g), B(f, g) = P 0 (f, g). Rényi generalised divergence of order s: D s (f g) = 1 s 1 log f(x) s g(x) 1 s dx. We have D 1 (f g) = D(f g) and D s (f g) = 1 s 1 log C s(f g). Generalisation: G α,s (f g) = 1 s 1 log (sf(x) α + (1 s)g(x) α ) 1 α. G α,s (f g) connects log B e (f, g), D(f g), log B(f, g) and C s (f, g). Information Theory and Applications, 2/ p.11/30

36 Relations and Inequalities G α, 1(f g) = 2 log P α (f, g), G 0,s (f g) = D s (f g) and 2 G,s (f g) = B e (f, g). Information Theory and Applications, 2/ p.12/30

37 Relations and Inequalities G α, 1(f g) = 2 log P α (f, g), G 0,s (f g) = D s (f g) and 2 G,s (f g) = B e (f, g). P α (f, g) P β (f, g) for α β and B(f, g) P α (f, g)p α (f, g). Information Theory and Applications, 2/ p.12/30

38 Relations and Inequalities G α, 1(f g) = 2 log P α (f, g), G 0,s (f g) = D s (f g) and 2 G,s (f g) = B e (f, g). P α (f, g) P β (f, g) for α β and B(f, g) P α (f, g)p α (f, g). P α (f, f) = 1, G α,s (f, f) = 0, P (f, g) + P (f, g) = 2, B e (f, g) = 2 P (f, g). Information Theory and Applications, 2/ p.12/30

39 Relations and Inequalities G α, 1(f g) = 2 log P α (f, g), G 0,s (f g) = D s (f g) and 2 G,s (f g) = B e (f, g). P α (f, g) P β (f, g) for α β and B(f, g) P α (f, g)p α (f, g). P α (f, f) = 1, G α,s (f, f) = 0, P (f, g) + P (f, g) = 2, B e (f, g) = 2 P (f, g). B(f, g) = B(g, f), B e (f, g) = B e (g, f), D(f g) D(g f). Information Theory and Applications, 2/ p.12/30

40 Relations and Inequalities G α, 1(f g) = 2 log P α (f, g), G 0,s (f g) = D s (f g) and 2 G,s (f g) = B e (f, g). P α (f, g) P β (f, g) for α β and B(f, g) P α (f, g)p α (f, g). P α (f, f) = 1, G α,s (f, f) = 0, P (f, g) + P (f, g) = 2, B e (f, g) = 2 P (f, g). B(f, g) = B(g, f), B e (f, g) = B e (g, f), D(f g) D(g f). D(f g) 2 log B(f, g) 2 log B e (f, g), B e (f, g) B(f, g) B e (f, g)(2 B e (f, g)), B e (f, g) C(f, g) B(f, g), C 0 (f, g) = C 1 (f, g) = 1 and B(f, g) = C 1/2 (f, g). and so on. Information Theory and Applications, 2/ p.12/30

41 The KL Divergence of a GMM Monte Carlo sampling: Draw n samples from f. Then D(f g) 1 n n i=1 log f(x i) g(x i ) with error O(1/ n). Information Theory and Applications, 2/ p.13/30

42 The KL Divergence of a GMM Monte Carlo sampling: Draw n samples from f. Then D(f g) 1 n n i=1 log f(x i) g(x i ) with error O(1/ n). Gaussian Approximation: Approximate f with a gaussian ˆf whose mean and covariance matches the total mean and covariance of f. Same for g and ĝ, then use D(f g) D( ˆf ĝ) µ ˆf = a π a µ a Σ ˆf = a π a (Σ a + (µ a µ ˆf)(µ a µ ˆf) T ). Information Theory and Applications, 2/ p.13/30

43 Unscented Approximation It is possible to pick 2d sigma points {x a,k } 2d k=1 such that fa (x)h(x)dx = 1 2d 2d k=1 h(x a,k) is exact for all quadratic functions h. One choice of sigma points is x a,k = µ a + dλ a,k e a,k x a,d+k = µ a dλ a,k e a,k, This is akin to gaussian quadrature. Information Theory and Applications, 2/ p.14/30

44 Matched Bound Approximation Match the closest pairs of gaussians m(a) = argmin b (D(f a g b ) log(ω b )). Goldberger s approximate formula is: D(f g) D Goldberger (f g) = a Analogous to the chain rule for relative entropy. ( π a D(f a g m(a) ) + log π ) a. ω m(a) Information Theory and Applications, 2/ p.15/30

45 Matched Bound Approximation Match the closest pairs of gaussians m(a) = argmin b (D(f a g b ) log(ω b )). Goldberger s approximate formula is: D(f g) D Goldberger (f g) = a Analogous to the chain rule for relative entropy. ( π a D(f a g m(a) ) + log π ) a. ω m(a) Min Approximation: D(f g) min a,b D(f a g b ) is an approximation in the same spirit. Information Theory and Applications, 2/ p.15/30

46 Variational Approximation Let 1 = b φ b a be free parameters f log g = π a f a log a b = π a f a log a b ω b g b φ b a ω b g b φ b a Introduce the variational parameters Information Theory and Applications, 2/ p.16/30

47 Variational Approximation Let 1 = b φ b a be free parameters f log g = a a π a π a f a log b f a b φ b a ω b g b φ b a φ b a log ω bg b φ b a Use Jensen s inequality to interchange log and b φ b a Information Theory and Applications, 2/ p.16/30

48 Variational Approximation Let 1 = b φ b a be free parameters f log g = a a = a π a π a f a log b f a b φ b a ω b g b φ b a φ b a log ω bg b φ b a π a φ b a (log(ω b /φ b a ) + Simplify expression after using Jensen b f a log g b ) Information Theory and Applications, 2/ p.16/30

49 Variational Approximation Let 1 = b φ b a be free parameters f log g = a a = a π a π a f a log b f a b φ b a ω b g b φ b a φ b a log ω bg b φ b a π a φ b a (log(ω b /φ b a ) + b f a log g b ) Maximize over φ b a and do the same for f log f to get Information Theory and Applications, 2/ p.16/30

50 Variational Approximation Let 1 = b φ b a be free parameters f log g = a a = a π a π a f a log b f a b φ b a ω b g b φ b a φ b a log ω bg b φ b a π a φ b a (log(ω b /φ b a ) + b f a log g b ) Maximize over φ b a and do the same for f log f to get D(f g) D var (f g) = a π a log a π a e D(f a f a ) b ω be D(f a g b ). Information Theory and Applications, 2/ p.16/30

51 Variational Upper Bound Variational parameters: b φ b a = π a and a ψ a b = ω b. Gaussian replication: f = a π af a = ab φ b af a g = b ω bg b = ba ψ a bg b. Information Theory and Applications, 2/ p.17/30

52 Variational Upper Bound Variational parameters: b φ b a = π a and a ψ a b = ω b. Gaussian replication: f = a π af a = ab φ b af a g = b ω bg b = ba ψ a bg b. D(f g) = = f log(f/g) f log Introduce the variational parameters ψ a b. ( ab ψ ) a bg b f Information Theory and Applications, 2/ p.17/30

53 Variational Upper Bound Variational parameters: b φ b a = π a and a ψ a b = ω b. D(f g) = = f log f log ( ab ψ ) a bg b f ( φ b a f a ψ a b g b f φ b a f a ab ) dx Prepare to use Jensen s inequality. Information Theory and Applications, 2/ p.17/30

54 Variational Upper Bound Variational parameters: b φ b a = π a and a ψ a b = ω b. D(f g) = = ab f log ( ab φ b af a ab ψ a bg b ) ( ) φ b a f a ψ a b g b flog dx f φ ab b a f a ( ) ψa b g b φ b a f a log dx φ b a f a Interchange log and ab φ b a f a f. Information Theory and Applications, 2/ p.17/30

55 Variational Upper Bound Variational parameters: b φ b a = π a and a ψ a b = ω b. D(f g) = f log ab φ b a ( ab φ b af a ab ψ a bg b ) ( ) ψa b g b f a log dx φ b a f a = D(φ ψ) + ab φ b a D(f a g b ) The chain rule for relative entropy for mixtures with unequal number of components!! Information Theory and Applications, 2/ p.17/30

56 Variational Upper Bound Optimize variational bound D(f g) D(φ ψ) + ab φ b a D(f a g b ) with respect to constraints b φ b a = π a and a ψ a b = ω b. Information Theory and Applications, 2/ p.18/30

57 Variational Upper Bound Optimize variational bound D(f g) D(φ ψ) + ab φ b a D(f a g b ) with respect to constraints b φ b a = π a and a ψ a b = ω b. Fix φ, find optimal value for ψ Fix ψ, find optimal value for φ: ψ a b = ω bφ b a a φ b a φ b a = π aψ a b e D(f a g b ) b ψ a b e D(f a g b ). Iterate a few times to find optimal solution!. Information Theory and Applications, 2/ p.18/30

58 Comparison of KL div methods Plots showing histograms of difference between Monte Carlo Sampling with 1 million samples and various methods Probability zero product min gaussian variational Deviation from D MC(1M) Probability MC(2dn) unscented goldberger variational upper Deviation from D MC(1M) Information Theory and Applications, 2/ p.19/30

59 Summary of KL divergence methods Monte Carlo Sampling arbitrary accuracy, arbitrary cost Information Theory and Applications, 2/ p.20/30

60 Summary of KL divergence methods Monte Carlo Sampling arbitrary accuracy, arbitrary cost Gaussian Approximation not cheap, not good, closed-form Information Theory and Applications, 2/ p.20/30

61 Summary of KL divergence methods Monte Carlo Sampling arbitrary accuracy, arbitrary cost Gaussian Approximation not cheap, not good, closed-form Min Approximation cheap, but not good Information Theory and Applications, 2/ p.20/30

62 Summary of KL divergence methods Monte Carlo Sampling arbitrary accuracy, arbitrary cost Gaussian Approximation not cheap, not good, closed-form Min Approximation cheap, but not good Unscented Approximation almost as good as MC at same cost Information Theory and Applications, 2/ p.20/30

63 Summary of KL divergence methods Monte Carlo Sampling arbitrary accuracy, arbitrary cost Gaussian Approximation not cheap, not good, closed-form Min Approximation cheap, but not good Unscented Approximation almost as good as MC at same cost Matched Bound Approximation cheap and good, not differentiable Information Theory and Applications, 2/ p.20/30

64 Summary of KL divergence methods Monte Carlo Sampling arbitrary accuracy, arbitrary cost Gaussian Approximation not cheap, not good, closed-form Min Approximation cheap, but not good Unscented Approximation almost as good as MC at same cost Matched Bound Approximation cheap and good, not differentiable Variational Approximation cheap and good, closed-form Information Theory and Applications, 2/ p.20/30

65 Summary of KL divergence methods Monte Carlo Sampling arbitrary accuracy, arbitrary cost Gaussian Approximation not cheap, not good, closed-form Min Approximation cheap, but not good Unscented Approximation almost as good as MC at same cost Matched Bound Approximation cheap and good, not differentiable Variational Approximation cheap and good, closed-form Variational Upper Bound cheap and better, strict bound, iterative Information Theory and Applications, 2/ p.20/30

66 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/ p.21/30

67 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/ p.21/30

68 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/ p.21/30

69 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/ p.21/30

70 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/ p.21/30

71 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/ p.21/30

72 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/ p.21/30

73 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/ p.21/30

74 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/ p.21/30

75 Bhattacharyya distance The Bhattacharyya distance, B(f, g) = fg, can be estimate using Monte Carlo sampling from an arbitrary distribution h: where {x i } n i=1 B(f, g) ˆB h = 1 n n i=1 are sampled from h. f(xi )g(x i ), h(x i ) Information Theory and Applications, 2/ p.22/30

76 Bhattacharyya distance The Bhattacharyya distance, B(f, g) = fg, can be estimate using Monte Carlo sampling from an arbitrary distribution h: B(f, g) ˆB h = 1 n n i=1 f(xi )g(x i ), h(x i ) where {x i } n i=1 are sampled from h. The estimators are unbiased, E[ ˆB h ] = B(f, g), with variance var( ˆB h ) = 1 n ( fg h ) (B(f, g))2. Information Theory and Applications, 2/ p.22/30

77 Bhattacharyya distance The Bhattacharyya distance, B(f, g) = fg, can be estimate using Monte Carlo sampling from an arbitrary distribution h: B(f, g) ˆB h = 1 n n i=1 f(xi )g(x i ), h(x i ) where {x i } n i=1 are sampled from h. The estimators are unbiased, E[ ˆB h ] = B(f, g), with variance var( ˆB h ) = 1 n ( fg h h = f gives var( ˆB f ) = 1 (B(f,g))2 n. h = f+g 2 gives var( ˆB f+g ) = 2 And var( ˆB f+g 2 ) (B(f, g))2. 2fg f+g (B(f,g))2 n. ) var( ˆB f ) (Harmonic Arithmetic inequality). Information Theory and Applications, 2/ p.22/30

78 Best sampling distribution We can find the best sampling distribution h by minimizing the variance of ˆB h subject to the constraints h 0 and h = 1. The solution is h = fg fg, var( ˆB h ) = 0. Information Theory and Applications, 2/ p.23/30

79 Best sampling distribution We can find the best sampling distribution h by minimizing the variance of ˆB h subject to the constraints h 0 and h = 1. The solution is h = fg fg, var( ˆB h ) = 0. Unfortunately, using this h requires Computing the quantity, B(f, g) = fg, that we are trying to compute in the first place. Sample from fg. Information Theory and Applications, 2/ p.23/30

80 Best sampling distribution We can find the best sampling distribution h by minimizing the variance of ˆB h subject to the constraints h 0 and h = 1. The solution is h = fg fg, var( ˆB h ) = 0. Unfortunately, using this h requires Computing the quantity, B(f, g) = fg, that we are trying to compute in the first place. Sample from fg. We will use variational techniques to approximate fg with some unnormalized h that can be analytically integrated to give a genuine pdf h. Information Theory and Applications, 2/ p.23/30

81 Bhattacharyya Variational Upper Bound f = a π af a = ab φ b af a b φ b a = π a g = b ω bg b = ba ψ a bg b a ψ a b = ω b. Information Theory and Applications, 2/ p.24/30

82 Bhattacharyya Variational Upper Bound f = a π af a = ab φ b af a b φ b a = π a g = b ω bg b = ba ψ a bg b a ψ a b = ω b. g B(f, g) = f f ab = f ψ a bg b f Introduce the variational parameters ψ a b. Information Theory and Applications, 2/ p.24/30

83 Bhattacharyya Variational Upper Bound f = a π af a = ab φ b af a b φ b a = π a g = b ω bg b = ba ψ a bg b a ψ a b = ω b. B(f, g) = = f f ab ψ a bg b ab f φ b a f a f ψ a b g b φ b a f a dx Prepare to use Jensen s inequality. Information Theory and Applications, 2/ p.24/30

84 Bhattacharyya Variational Upper Bound f = a π af a = ab φ b af a b φ b a = π a g = b ω bg b = ba ψ a bg b a ψ a b = ω b. B(f, g) = = ab f f ab ψ a bg b ab φ b a f φ b a f a f ψ a b g b φ b a f a dx f a ψa b g b φ b a f a dx Interchange and ab φ b a f a f. Information Theory and Applications, 2/ p.24/30

85 Bhattacharyya Variational Upper Bound f = a π af a = ab φ b af a b φ b a = π a g = b ω bg b = ba ψ a bg b a ψ a b = ω b. B(f, g) = ab = ab f ab ψ a bg b f ψa b g b φ b a f a dx φ b a f a φ b a ψ a b B(f a, g b ) An inequality linking the mixture Bhattacharyya distance to the component distances!! Information Theory and Applications, 2/ p.24/30

86 Bhattacharyya Variational Upper Bound Optimize variational bound B(f, g) ab φ b a ψ a b B(f a, g b ) with respect to constraints b φ b a = π a and a ψ a b = ω b. Information Theory and Applications, 2/ p.25/30

87 Bhattacharyya Variational Upper Bound Optimize variational bound B(f, g) ab φ b a ψ a b B(f a, g b ) with respect to constraints b φ b a = π a and a ψ a b = ω b. Fix φ, find optimal value for ψ ψ a b = ω bφ b a (B(f a, g b )) 2 a φ b a (B(f a, g b )) 2. Fix ψ, find optimal value for φ: Iterate to find optimal solution! φ b a = π aψ a b (B(f a, g b )) 2 b ψ a b (B(f a, g b )) 2. Information Theory and Applications, 2/ p.25/30

88 Variational Monte Carlo Sampling Write the variational estimate V (f, g) = ab Õ φ b a ψ a b B(f a, g b ) = Õ Ô φ b a ψ a b fa g b = ab ĥ. Here ĥ = È ab Õ φ b a ψ a b fa g b is an unnormalized approximation of the optimal sampling distribution, fg/ Ê fg. Information Theory and Applications, 2/ p.26/30

89 Variational Monte Carlo Sampling Write the variational estimate V (f, g) = ab Õ φ b a ψ a b B(f a, g b ) = Õ Ô φ b a ψ a b fa g b = ab ĥ. Here ĥ = È ab Õ φ b a ψ a b fa g b is an unnormalized approximation of the optimal sampling distribution, fg/ Ê fg. h = ĥ/ê ĥ is a GMM since h ab = f a g b / Ê f a g b is a gaussian and h = ab π ab h ab, where π ab = Õ Ê φ b a ψ a b fa g b V (f, g) Information Theory and Applications, 2/ p.26/30

90 Variational Monte Carlo Sampling Write the variational estimate V (f, g) = ab Õ φ b a ψ a b B(f a, g b ) = Õ Ô φ b a ψ a b fa g b = ab ĥ. Here ĥ = È ab Õ φ b a ψ a b fa g b is an unnormalized approximation of the optimal sampling distribution, fg/ Ê fg. h = ĥ/ê ĥ is a GMM since h ab = f a g b / Ê f a g b is a gaussian and h = ab π ab h ab, where π ab = Õ Ê φ b a ψ a b fa g b V (f, g) Thus, drawing samples {x i } n i=1 from h the estimate ˆV n = 1 n n i=1 Ô f(xi )g(x i ) h(x i ) is unbiased and in experiments is seen to be far superior to sampling from (f + g)/2. Information Theory and Applications, 2/ p.26/30

91 Bhattacharyya Distance: Monte Carlo estimation Probability density Bhattacharyya: Importance sampling from f(x) Deviation from Bhattacharyya estimate with 1M samples Importance sampling from f(x): Slow convergence. Information Theory and Applications, 2/ p.27/30

92 Bhattacharyya Distance: Monte Carlo estimation Probability density Bhattacharyya: Importance sampling from f(x) Deviation from Bhattacharyya estimate with 1M samples Importance sampling from f(x): Slow convergence. Information Theory and Applications, 2/ p.27/30

93 Bhattacharyya Distance: Monte Carlo estimation Probability density Bhattacharyya: Importance sampling from f(x) Deviation from Bhattacharyya estimate with 1M samples Importance sampling from f(x): Slow convergence. Information Theory and Applications, 2/ p.27/30

94 Bhattacharyya Distance: Monte Carlo estimation Probability density Bhattacharyya: Importance sampling from f(x) K Deviation from Bhattacharyya estimate with 1M samples Importance sampling from f(x): Slow convergence. Information Theory and Applications, 2/ p.27/30

95 Bhattacharyya Distance: Monte Carlo estimation Probability density Bhattacharyya: Importance sampling from f(x) K 100K Deviation from Bhattacharyya estimate with 1M samples Importance sampling from f(x): Slow convergence. Information Theory and Applications, 2/ p.27/30

96 Bhattacharyya Distance: Monte Carlo estimation Importance sampling from f(x)+g(x) 2 Probability density f 100K Deviation from Bhattacharyya estimate with 1M samples Importance sampling from f(x): Slow convergence. Importance sampling from f(x)+g(x) 2 : Better convergence. Information Theory and Applications, 2/ p.27/30

97 Bhattacharyya Distance: Monte Carlo estimation Importance sampling from f(x)+g(x) 2 Probability density f 100K Deviation from Bhattacharyya estimate with 1M samples Importance sampling from f(x): Slow convergence. Importance sampling from f(x)+g(x) 2 : Better convergence. Information Theory and Applications, 2/ p.27/30

98 Bhattacharyya Distance: Monte Carlo estimation Importance sampling from f(x)+g(x) 2 Probability density f 100K Deviation from Bhattacharyya estimate with 1M samples Importance sampling from f(x): Slow convergence. Importance sampling from f(x)+g(x) 2 : Better convergence. Information Theory and Applications, 2/ p.27/30

99 Bhattacharyya Distance: Monte Carlo estimation Importance sampling from f(x)+g(x) 2 Probability density f 100K K Deviation from Bhattacharyya estimate with 1M samples Importance sampling from f(x): Slow convergence. Importance sampling from f(x)+g(x) 2 : Better convergence. Information Theory and Applications, 2/ p.27/30

100 Bhattacharyya Distance: Monte Carlo estimation Importance sampling from f(x)+g(x) 2 Probability density f 100K K 100K Deviation from Bhattacharyya estimate with 1M samples Importance sampling from f(x): Slow convergence. Importance sampling from f(x)+g(x) 2 : Better convergence. Information Theory and Applications, 2/ p.27/30

101 Bhattacharyya Distance: Monte Carlo estimation Probability density Variational importance sampling (f+g)/2 100K Deviation from Bhattacharyya estimate with 1M samples Importance sampling from f(x): Slow convergence. Importance sampling from f(x)+g(x) 2 : Better convergence. Variational importance sampling: Fast convergence. Information Theory and Applications, 2/ p.27/30

102 Bhattacharyya Distance: Monte Carlo estimation Probability density Variational importance sampling (f+g)/2 100K Deviation from Bhattacharyya estimate with 1M samples Importance sampling from f(x): Slow convergence. Importance sampling from f(x)+g(x) 2 : Better convergence. Variational importance sampling: Fast convergence. Information Theory and Applications, 2/ p.27/30

103 Bhattacharyya Distance: Monte Carlo estimation Probability density Variational importance sampling (f+g)/2 100K Deviation from Bhattacharyya estimate with 1M samples Importance sampling from f(x): Slow convergence. Importance sampling from f(x)+g(x) 2 : Better convergence. Variational importance sampling: Fast convergence. Information Theory and Applications, 2/ p.27/30

104 Bhattacharyya Distance: Monte Carlo estimation Probability density Variational importance sampling (f+g)/2 100K K Deviation from Bhattacharyya estimate with 1M samples Importance sampling from f(x): Slow convergence. Importance sampling from f(x)+g(x) 2 : Better convergence. Variational importance sampling: Fast convergence. Information Theory and Applications, 2/ p.27/30

105 Bhattacharyya Distance: Monte Carlo estimation Probability density Variational importance sampling (f+g)/2 100K K 100K Deviation from Bhattacharyya estimate with 1M samples Importance sampling from f(x): Slow convergence. Importance sampling from f(x)+g(x) 2 : Better convergence. Variational importance sampling: Fast convergence. Information Theory and Applications, 2/ p.27/30

106 Bhattacharyya Distance: Monte Carlo estimation Probability density Variational importance sampling f 100K (f+g)/2 100K variational 100K Deviation from Bhattacharyya estimate with 1M samples Importance sampling from f(x): Slow convergence. Importance sampling from f(x)+g(x) 2 : Better convergence. Variational importance sampling: Fast convergence. Information Theory and Applications, 2/ p.27/30

107 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/ p.28/30

108 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/ p.28/30

109 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/ p.28/30

110 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/ p.28/30

111 Variational Monte Carlo Sampling: KL-Divergence Information Theory and Applications, 2/ p.29/30

112 Future Directions HMM variational KL divergence Information Theory and Applications, 2/ p.30/30

113 Future Directions HMM variational KL divergence HMM variational Bhattacharyya Information Theory and Applications, 2/ p.30/30

114 Future Directions HMM variational KL divergence HMM variational Bhattacharyya Variational Chernoff distance Information Theory and Applications, 2/ p.30/30

115 Future Directions HMM variational KL divergence HMM variational Bhattacharyya Variational Chernoff distance Variational sampling of Bayes error using Chernoff approximation Information Theory and Applications, 2/ p.30/30

116 Future Directions HMM variational KL divergence HMM variational Bhattacharyya Variational Chernoff distance Variational sampling of Bayes error using Chernoff approximation Discriminative training, using Bhattacharyya divergence. Information Theory and Applications, 2/ p.30/30

117 Future Directions HMM variational KL divergence HMM variational Bhattacharyya Variational Chernoff distance Variational sampling of Bayes error using Chernoff approximation Discriminative training, using Bhattacharyya divergence. Acoustic confusability using Bhattacharyya divergence Information Theory and Applications, 2/ p.30/30

118 Future Directions HMM variational KL divergence HMM variational Bhattacharyya Variational Chernoff distance Variational sampling of Bayes error using Chernoff approximation Discriminative training, using Bhattacharyya divergence. Acoustic confusability using Bhattacharyya divergence Clustering of HMMs Information Theory and Applications, 2/ p.30/30

119 Future Directions HMM variational KL divergence HMM variational Bhattacharyya Variational Chernoff distance Variational sampling of Bayes error using Chernoff approximation Discriminative training, using Bhattacharyya divergence. Acoustic confusability using Bhattacharyya divergence Clustering of HMMs Information Theory and Applications, 2/ p.30/30

Scalable robust hypothesis tests using graphical models

Scalable robust hypothesis tests using graphical models Scalable robust hypothesis tests using graphical models Umamahesh Srinivas ipal Group Meeting October 22, 2010 Binary hypothesis testing problem Random vector x = (x 1,...,x n ) R n generated from either

More information

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that

More information

Adaptive Monte Carlo methods

Adaptive Monte Carlo methods Adaptive Monte Carlo methods Jean-Michel Marin Projet Select, INRIA Futurs, Université Paris-Sud joint with Randal Douc (École Polytechnique), Arnaud Guillin (Université de Marseille) and Christian Robert

More information

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS PROBABILITY AND INFORMATION THEORY Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Probability space Rules of probability

More information

Outline of Today s Lecture

Outline of Today s Lecture University of Washington Department of Electrical Engineering Computer Speech Processing EE516 Winter 2005 Jeff A. Bilmes Lecture 12 Slides Feb 23 rd, 2005 Outline of Today s

More information

Gaussian Mixture Models

Gaussian Mixture Models Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

Support Vector Machines using GMM Supervectors for Speaker Verification

Support Vector Machines using GMM Supervectors for Speaker Verification 1 Support Vector Machines using GMM Supervectors for Speaker Verification W. M. Campbell, D. E. Sturim, D. A. Reynolds MIT Lincoln Laboratory 244 Wood Street Lexington, MA 02420 Corresponding author e-mail:

More information

Information geometry for bivariate distribution control

Information geometry for bivariate distribution control Information geometry for bivariate distribution control C.T.J.Dodson + Hong Wang Mathematics + Control Systems Centre, University of Manchester Institute of Science and Technology Optimal control of stochastic

More information

Variable selection and feature construction using methods related to information theory

Variable selection and feature construction using methods related to information theory Outline Variable selection and feature construction using methods related to information theory Kari 1 1 Intelligent Systems Lab, Motorola, Tempe, AZ IJCNN 2007 Outline Outline 1 Information Theory and

More information

Probabilistic Time Series Classification

Probabilistic Time Series Classification Probabilistic Time Series Classification Y. Cem Sübakan Boğaziçi University 25.06.2013 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 1 / 54 Problem Statement The goal is to assign

More information

Advanced Machine Learning & Perception

Advanced Machine Learning & Perception Advanced Machine Learning & Perception Instructor: Tony Jebara Topic 6 Standard Kernels Unusual Input Spaces for Kernels String Kernels Probabilistic Kernels Fisher Kernels Probability Product Kernels

More information

Expectation Propagation Algorithm

Expectation Propagation Algorithm Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

Expectation Maximization Algorithm

Expectation Maximization Algorithm Expectation Maximization Algorithm Vibhav Gogate The University of Texas at Dallas Slides adapted from Carlos Guestrin, Dan Klein, Luke Zettlemoyer and Dan Weld The Evils of Hard Assignments? Clusters

More information

Machine Learning Techniques for Computer Vision

Machine Learning Techniques for Computer Vision Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM

More information

Pattern Recognition. Parameter Estimation of Probability Density Functions

Pattern Recognition. Parameter Estimation of Probability Density Functions Pattern Recognition Parameter Estimation of Probability Density Functions Classification Problem (Review) The classification problem is to assign an arbitrary feature vector x F to one of c classes. The

More information

An Analytic Distance Metric for Gaussian Mixture Models with Application in Image Retrieval

An Analytic Distance Metric for Gaussian Mixture Models with Application in Image Retrieval An Analytic Distance Metric for Gaussian Mixture Models with Application in Image Retrieval G. Sfikas, C. Constantinopoulos *, A. Likas, and N.P. Galatsanos Department of Computer Science, University of

More information

Adapted Feature Extraction and Its Applications

Adapted Feature Extraction and Its Applications saito@math.ucdavis.edu 1 Adapted Feature Extraction and Its Applications Naoki Saito Department of Mathematics University of California Davis, CA 95616 email: saito@math.ucdavis.edu URL: http://www.math.ucdavis.edu/

More information

Bayes Decision Theory

Bayes Decision Theory Bayes Decision Theory Minimum-Error-Rate Classification Classifiers, Discriminant Functions and Decision Surfaces The Normal Density 0 Minimum-Error-Rate Classification Actions are decisions on classes

More information

p(d θ ) l(θ ) 1.2 x x x

p(d θ ) l(θ ) 1.2 x x x p(d θ ).2 x 0-7 0.8 x 0-7 0.4 x 0-7 l(θ ) -20-40 -60-80 -00 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ θ x FIGURE 3.. The top graph shows several training points in one dimension, known or assumed to

More information

Necessary Corrections in Intransitive Likelihood-Ratio Classifiers

Necessary Corrections in Intransitive Likelihood-Ratio Classifiers Necessary Corrections in Intransitive Likelihood-Ratio Classifiers Gang Ji and Jeff Bilmes SSLI-Lab, Department of Electrical Engineering University of Washington Seattle, WA 9895-500 {gang,bilmes}@ee.washington.edu

More information

Math /Foundations of Algebra/Fall 2017 Numbers at the Foundations: Real Numbers In calculus, the derivative of a function f(x) is defined

Math /Foundations of Algebra/Fall 2017 Numbers at the Foundations: Real Numbers In calculus, the derivative of a function f(x) is defined Math 400-001/Foundations of Algebra/Fall 2017 Numbers at the Foundations: Real Numbers In calculus, the derivative of a function f(x) is defined using limits. As a particular case, the derivative of f(x)

More information

10-701/15-781, Machine Learning: Homework 4

10-701/15-781, Machine Learning: Homework 4 10-701/15-781, Machine Learning: Homewor 4 Aarti Singh Carnegie Mellon University ˆ The assignment is due at 10:30 am beginning of class on Mon, Nov 15, 2010. ˆ Separate you answers into five parts, one

More information

Homework 1 Due: Thursday 2/5/2015. Instructions: Turn in your homework in class on Thursday 2/5/2015

Homework 1 Due: Thursday 2/5/2015. Instructions: Turn in your homework in class on Thursday 2/5/2015 10-704 Homework 1 Due: Thursday 2/5/2015 Instructions: Turn in your homework in class on Thursday 2/5/2015 1. Information Theory Basics and Inequalities C&T 2.47, 2.29 (a) A deck of n cards in order 1,

More information

Lecture 3: Pattern Classification

Lecture 3: Pattern Classification EE E6820: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 1 2 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mixtures

More information

Lecture 5 Channel Coding over Continuous Channels

Lecture 5 Channel Coding over Continuous Channels Lecture 5 Channel Coding over Continuous Channels I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw November 14, 2014 1 / 34 I-Hsiang Wang NIT Lecture 5 From

More information

F denotes cumulative density. denotes probability density function; (.)

F denotes cumulative density. denotes probability density function; (.) BAYESIAN ANALYSIS: FOREWORDS Notation. System means the real thing and a model is an assumed mathematical form for the system.. he probability model class M contains the set of the all admissible models

More information

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 1

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 1 CS434a/541a: Pattern Recognition Prof. Olga Veksler Lecture 1 1 Outline of the lecture Syllabus Introduction to Pattern Recognition Review of Probability/Statistics 2 Syllabus Prerequisite Analysis of

More information

d(x n, x) d(x n, x nk ) + d(x nk, x) where we chose any fixed k > N

d(x n, x) d(x n, x nk ) + d(x nk, x) where we chose any fixed k > N Problem 1. Let f : A R R have the property that for every x A, there exists ɛ > 0 such that f(t) > ɛ if t (x ɛ, x + ɛ) A. If the set A is compact, prove there exists c > 0 such that f(x) > c for all x

More information

Estimating Gaussian Mixture Densities with EM A Tutorial

Estimating Gaussian Mixture Densities with EM A Tutorial Estimating Gaussian Mixture Densities with EM A Tutorial Carlo Tomasi Due University Expectation Maximization (EM) [4, 3, 6] is a numerical algorithm for the maximization of functions of several variables

More information

Monte Carlo and cold gases. Lode Pollet.

Monte Carlo and cold gases. Lode Pollet. Monte Carlo and cold gases Lode Pollet lpollet@physics.harvard.edu 1 Outline Classical Monte Carlo The Monte Carlo trick Markov chains Metropolis algorithm Ising model critical slowing down Quantum Monte

More information

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory Bayesian decision theory 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory Jussi Tohka jussi.tohka@tut.fi Institute of Signal Processing Tampere University of Technology

More information

Markov Chain Monte Carlo Methods for Stochastic

Markov Chain Monte Carlo Methods for Stochastic Markov Chain Monte Carlo Methods for Stochastic Optimization i John R. Birge The University of Chicago Booth School of Business Joint work with Nicholas Polson, Chicago Booth. JRBirge U Florida, Nov 2013

More information

Density estimation Nonparametric conditional mean estimation Semiparametric conditional mean estimation. Nonparametrics. Gabriel Montes-Rojas

Density estimation Nonparametric conditional mean estimation Semiparametric conditional mean estimation. Nonparametrics. Gabriel Montes-Rojas 0 0 5 Motivation: Regression discontinuity (Angrist&Pischke) Outcome.5 1 1.5 A. Linear E[Y 0i X i] 0.2.4.6.8 1 X Outcome.5 1 1.5 B. Nonlinear E[Y 0i X i] i 0.2.4.6.8 1 X utcome.5 1 1.5 C. Nonlinearity

More information

Section 6.1: Composite Functions

Section 6.1: Composite Functions Section 6.1: Composite Functions Def: Given two function f and g, the composite function, which we denote by f g and read as f composed with g, is defined by (f g)(x) = f(g(x)). In other words, the function

More information

Feature selection and extraction Spectral domain quality estimation Alternatives

Feature selection and extraction Spectral domain quality estimation Alternatives Feature selection and extraction Error estimation Maa-57.3210 Data Classification and Modelling in Remote Sensing Markus Törmä markus.torma@tkk.fi Measurements Preprocessing: Remove random and systematic

More information

COURSE Numerical integration of functions (continuation) 3.3. The Romberg s iterative generation method

COURSE Numerical integration of functions (continuation) 3.3. The Romberg s iterative generation method COURSE 7 3. Numerical integration of functions (continuation) 3.3. The Romberg s iterative generation method The presence of derivatives in the remainder difficulties in applicability to practical problems

More information

1 + x 2 d dx (sec 1 x) =

1 + x 2 d dx (sec 1 x) = Page This exam has: 8 multiple choice questions worth 4 points each. hand graded questions worth 4 points each. Important: No graphing calculators! Any non-graphing, non-differentiating, non-integrating

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

TABLE OF CONTENTS 2 CHAPTER 1

TABLE OF CONTENTS 2 CHAPTER 1 TABLE OF CONTENTS CHAPTER 1 Quadratics CHAPTER Functions 3 CHAPTER 3 Coordinate Geometry 3 CHAPTER 4 Circular Measure 4 CHAPTER 5 Trigonometry 4 CHAPTER 6 Vectors 5 CHAPTER 7 Series 6 CHAPTER 8 Differentiation

More information

Linearly-Solvable Stochastic Optimal Control Problems

Linearly-Solvable Stochastic Optimal Control Problems Linearly-Solvable Stochastic Optimal Control Problems Emo Todorov Applied Mathematics and Computer Science & Engineering University of Washington Winter 2014 Emo Todorov (UW) AMATH/CSE 579, Winter 2014

More information

A counterexample to integration by parts. Alexander Kheifets Department of Mathematics University of Massachusetts Lowell Alexander

A counterexample to integration by parts. Alexander Kheifets Department of Mathematics University of Massachusetts Lowell Alexander A counterexample to integration by parts Alexander Kheifets Department of Mathematics University of Massachusetts Lowell Alexander Kheifets@uml.edu James Propp Department of Mathematics University of Massachusetts

More information

Quantitative Biology II Lecture 4: Variational Methods

Quantitative Biology II Lecture 4: Variational Methods 10 th March 2015 Quantitative Biology II Lecture 4: Variational Methods Gurinder Singh Mickey Atwal Center for Quantitative Biology Cold Spring Harbor Laboratory Image credit: Mike West Summary Approximate

More information

Monte Carlo Methods. Geoff Gordon February 9, 2006

Monte Carlo Methods. Geoff Gordon February 9, 2006 Monte Carlo Methods Geoff Gordon ggordon@cs.cmu.edu February 9, 2006 Numerical integration problem 5 4 3 f(x,y) 2 1 1 0 0.5 0 X 0.5 1 1 0.8 0.6 0.4 Y 0.2 0 0.2 0.4 0.6 0.8 1 x X f(x)dx Used for: function

More information

Multivariate statistical methods and data mining in particle physics

Multivariate statistical methods and data mining in particle physics Multivariate statistical methods and data mining in particle physics RHUL Physics www.pp.rhul.ac.uk/~cowan Academic Training Lectures CERN 16 19 June, 2008 1 Outline Statement of the problem Some general

More information

Statistical methods in recognition. Why is classification a problem?

Statistical methods in recognition. Why is classification a problem? Statistical methods in recognition Basic steps in classifier design collect training images choose a classification model estimate parameters of classification model from training images evaluate model

More information

Kullback-Leibler Designs

Kullback-Leibler Designs Kullback-Leibler Designs Astrid JOURDAN Jessica FRANCO Contents Contents Introduction Kullback-Leibler divergence Estimation by a Monte-Carlo method Design comparison Conclusion 2 Introduction Computer

More information

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians University of Cambridge Engineering Part IIB Module 4F: Statistical Pattern Processing Handout 2: Multivariate Gaussians.2.5..5 8 6 4 2 2 4 6 8 Mark Gales mjfg@eng.cam.ac.uk Michaelmas 2 2 Engineering

More information

Latent Tree Approximation in Linear Model

Latent Tree Approximation in Linear Model Latent Tree Approximation in Linear Model Navid Tafaghodi Khajavi Dept. of Electrical Engineering, University of Hawaii, Honolulu, HI 96822 Email: navidt@hawaii.edu ariv:1710.01838v1 [cs.it] 5 Oct 2017

More information

Computational Genomics

Computational Genomics Computational Genomics http://www.cs.cmu.edu/~02710 Introduction to probability, statistics and algorithms (brief) intro to probability Basic notations Random variable - referring to an element / event

More information

3 Operations on One Random Variable - Expectation

3 Operations on One Random Variable - Expectation 3 Operations on One Random Variable - Expectation 3.0 INTRODUCTION operations on a random variable Most of these operations are based on a single concept expectation. Even a probability of an event can

More information

A Family of Probabilistic Kernels Based on Information Divergence. Antoni B. Chan, Nuno Vasconcelos, and Pedro J. Moreno

A Family of Probabilistic Kernels Based on Information Divergence. Antoni B. Chan, Nuno Vasconcelos, and Pedro J. Moreno A Family of Probabilistic Kernels Based on Information Divergence Antoni B. Chan, Nuno Vasconcelos, and Pedro J. Moreno SVCL-TR 2004/01 June 2004 A Family of Probabilistic Kernels Based on Information

More information

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Information Theory. Week 4 Compressing streams. Iain Murray,

Information Theory. Week 4 Compressing streams. Iain Murray, Information Theory http://www.inf.ed.ac.uk/teaching/courses/it/ Week 4 Compressing streams Iain Murray, 2014 School of Informatics, University of Edinburgh Jensen s inequality For convex functions: E[f(x)]

More information

Probability Review. Yutian Li. January 18, Stanford University. Yutian Li (Stanford University) Probability Review January 18, / 27

Probability Review. Yutian Li. January 18, Stanford University. Yutian Li (Stanford University) Probability Review January 18, / 27 Probability Review Yutian Li Stanford University January 18, 2018 Yutian Li (Stanford University) Probability Review January 18, 2018 1 / 27 Outline 1 Elements of probability 2 Random variables 3 Multiple

More information

Auto-Encoding Variational Bayes

Auto-Encoding Variational Bayes Auto-Encoding Variational Bayes Diederik P Kingma, Max Welling June 18, 2018 Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 1 / 39 Outline 1 Introduction 2 Variational Lower

More information

Jorge Silva and Shrikanth Narayanan, Senior Member, IEEE. 1 is the probability measure induced by the probability density function

Jorge Silva and Shrikanth Narayanan, Senior Member, IEEE. 1 is the probability measure induced by the probability density function 890 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 3, MAY 2006 Average Divergence Distance as a Statistical Discrimination Measure for Hidden Markov Models Jorge Silva and Shrikanth

More information

Variational Inference via Stochastic Backpropagation

Variational Inference via Stochastic Backpropagation Variational Inference via Stochastic Backpropagation Kai Fan February 27, 2016 Preliminaries Stochastic Backpropagation Variational Auto-Encoding Related Work Summary Outline Preliminaries Stochastic Backpropagation

More information

A Novel Low-Complexity HMM Similarity Measure

A Novel Low-Complexity HMM Similarity Measure A Novel Low-Complexity HMM Similarity Measure Sayed Mohammad Ebrahim Sahraeian, Student Member, IEEE, and Byung-Jun Yoon, Member, IEEE Abstract In this letter, we propose a novel similarity measure for

More information

Brief Introduction of Machine Learning Techniques for Content Analysis

Brief Introduction of Machine Learning Techniques for Content Analysis 1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview

More information

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians Engineering Part IIB: Module F Statistical Pattern Processing University of Cambridge Engineering Part IIB Module F: Statistical Pattern Processing Handout : Multivariate Gaussians. Generative Model Decision

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Introduction to Probabilistic Methods Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB

More information

Machine learning - HT Maximum Likelihood

Machine learning - HT Maximum Likelihood Machine learning - HT 2016 3. Maximum Likelihood Varun Kanade University of Oxford January 27, 2016 Outline Probabilistic Framework Formulate linear regression in the language of probability Introduce

More information

Minimum Entropy, k-means, Spectral Clustering

Minimum Entropy, k-means, Spectral Clustering Minimum Entropy, k-means, Spectral Clustering Yongjin Lee Biometrics Technology Research Team ETRI 6 Kajong-dong, Yusong-gu Daejon 305-350, Korea Email: solarone@etri.re.kr Seungjin Choi Department of

More information

Cross entropy-based importance sampling using Gaussian densities revisited

Cross entropy-based importance sampling using Gaussian densities revisited Cross entropy-based importance sampling using Gaussian densities revisited Sebastian Geyer a,, Iason Papaioannou a, Daniel Straub a a Engineering Ris Analysis Group, Technische Universität München, Arcisstraße

More information

Doug Cochran. 3 October 2011

Doug Cochran. 3 October 2011 ) ) School Electrical, Computer, & Energy Arizona State University (Joint work with Stephen Howard and Bill Moran) 3 October 2011 Tenets ) The purpose sensor networks is to sense; i.e., to enable detection,

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

Chapter 6. Estimation of Confidence Intervals for Nodal Maximum Power Consumption per Customer

Chapter 6. Estimation of Confidence Intervals for Nodal Maximum Power Consumption per Customer Chapter 6 Estimation of Confidence Intervals for Nodal Maximum Power Consumption per Customer The aim of this chapter is to calculate confidence intervals for the maximum power consumption per customer

More information

Econometrics I, Estimation

Econometrics I, Estimation Econometrics I, Estimation Department of Economics Stanford University September, 2008 Part I Parameter, Estimator, Estimate A parametric is a feature of the population. An estimator is a function of the

More information

How to learn from very few examples?

How to learn from very few examples? How to learn from very few examples? Dengyong Zhou Department of Empirical Inference Max Planck Institute for Biological Cybernetics Spemannstr. 38, 72076 Tuebingen, Germany Outline Introduction Part A

More information

The connection of dropout and Bayesian statistics

The connection of dropout and Bayesian statistics The connection of dropout and Bayesian statistics Interpretation of dropout as approximate Bayesian modelling of NN http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf Dropout Geoffrey Hinton Google, University

More information

Markov Chain Monte Carlo Methods for Stochastic Optimization

Markov Chain Monte Carlo Methods for Stochastic Optimization Markov Chain Monte Carlo Methods for Stochastic Optimization John R. Birge The University of Chicago Booth School of Business Joint work with Nicholas Polson, Chicago Booth. JRBirge U of Toronto, MIE,

More information

On the Choice of Parametric Families of Copulas

On the Choice of Parametric Families of Copulas On the Choice of Parametric Families of Copulas Radu Craiu Department of Statistics University of Toronto Collaborators: Mariana Craiu, University Politehnica, Bucharest Vienna, July 2008 Outline 1 Brief

More information

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham Generative classifiers: The Gaussian classifier Ata Kaban School of Computer Science University of Birmingham Outline We have already seen how Bayes rule can be turned into a classifier In all our examples

More information

Stat410 Probability and Statistics II (F16)

Stat410 Probability and Statistics II (F16) Stat4 Probability and Statistics II (F6 Exponential, Poisson and Gamma Suppose on average every /λ hours, a Stochastic train arrives at the Random station. Further we assume the waiting time between two

More information

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Particle Filtering Approaches for Dynamic Stochastic Optimization

Particle Filtering Approaches for Dynamic Stochastic Optimization Particle Filtering Approaches for Dynamic Stochastic Optimization John R. Birge The University of Chicago Booth School of Business Joint work with Nicholas Polson, Chicago Booth. JRBirge I-Sim Workshop,

More information

Kernel Density Estimation

Kernel Density Estimation EECS 598: Statistical Learning Theory, Winter 2014 Topic 19 Kernel Density Estimation Lecturer: Clayton Scott Scribe: Yun Wei, Yanzhen Deng Disclaimer: These notes have not been subjected to the usual

More information

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain

More information

Unit 3: HW3.5 Sum and Product

Unit 3: HW3.5 Sum and Product Unit 3: HW3.5 Sum and Product Without solving, find the sum and product of the roots of each equation. 1. x 2 8x + 7 = 0 2. 2x + 5 = x 2 3. -7x + 4 = -3x 2 4. -10x 2 = 5x - 2 5. 5x 2 2x 3 4 6. 1 3 x2 3x

More information

Online Estimation of Discrete Densities using Classifier Chains

Online Estimation of Discrete Densities using Classifier Chains Online Estimation of Discrete Densities using Classifier Chains Michael Geilke 1 and Eibe Frank 2 and Stefan Kramer 1 1 Johannes Gutenberg-Universtität Mainz, Germany {geilke,kramer}@informatik.uni-mainz.de

More information

13 : Variational Inference: Loopy Belief Propagation and Mean Field

13 : Variational Inference: Loopy Belief Propagation and Mean Field 10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction

More information

Expectation Propagation for Approximate Bayesian Inference

Expectation Propagation for Approximate Bayesian Inference Expectation Propagation for Approximate Bayesian Inference José Miguel Hernández Lobato Universidad Autónoma de Madrid, Computer Science Department February 5, 2007 1/ 24 Bayesian Inference Inference Given

More information

Graphical Models and Kernel Methods

Graphical Models and Kernel Methods Graphical Models and Kernel Methods Jerry Zhu Department of Computer Sciences University of Wisconsin Madison, USA MLSS June 17, 2014 1 / 123 Outline Graphical Models Probabilistic Inference Directed vs.

More information

Review: mostly probability and some statistics

Review: mostly probability and some statistics Review: mostly probability and some statistics C2 1 Content robability (should know already) Axioms and properties Conditional probability and independence Law of Total probability and Bayes theorem Random

More information

Sparse Kernel Density Estimation Technique Based on Zero-Norm Constraint

Sparse Kernel Density Estimation Technique Based on Zero-Norm Constraint Sparse Kernel Density Estimation Technique Based on Zero-Norm Constraint Xia Hong 1, Sheng Chen 2, Chris J. Harris 2 1 School of Systems Engineering University of Reading, Reading RG6 6AY, UK E-mail: x.hong@reading.ac.uk

More information

Quick Tour of Basic Probability Theory and Linear Algebra

Quick Tour of Basic Probability Theory and Linear Algebra Quick Tour of and Linear Algebra Quick Tour of and Linear Algebra CS224w: Social and Information Network Analysis Fall 2011 Quick Tour of and Linear Algebra Quick Tour of and Linear Algebra Outline Definitions

More information

Monte-Carlo MMD-MA, Université Paris-Dauphine. Xiaolu Tan

Monte-Carlo MMD-MA, Université Paris-Dauphine. Xiaolu Tan Monte-Carlo MMD-MA, Université Paris-Dauphine Xiaolu Tan tan@ceremade.dauphine.fr Septembre 2015 Contents 1 Introduction 1 1.1 The principle.................................. 1 1.2 The error analysis

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Some New Results on Information Properties of Mixture Distributions

Some New Results on Information Properties of Mixture Distributions Filomat 31:13 (2017), 4225 4230 https://doi.org/10.2298/fil1713225t Published by Faculty of Sciences and Mathematics, University of Niš, Serbia Available at: http://www.pmf.ni.ac.rs/filomat Some New Results

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane c Ghane/Mori 4 6 8 4 6 8 4 6 8 4 6 8 5 5 5 5 5 5 4 6 8 4 4 6 8 4 5 5 5 5 5 5 µ, Σ) α f Learningscale is slightly Parameters is slightly larger larger

More information

Probabilistic Models for Sequence Labeling

Probabilistic Models for Sequence Labeling Probabilistic Models for Sequence Labeling Besnik Fetahu June 9, 2011 Besnik Fetahu () Probabilistic Models for Sequence Labeling June 9, 2011 1 / 26 Background & Motivation Problem introduction Generative

More information

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling 10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin

More information

10708 Graphical Models: Homework 2

10708 Graphical Models: Homework 2 10708 Graphical Models: Homework 2 Due Monday, March 18, beginning of class Feburary 27, 2013 Instructions: There are five questions (one for extra credit) on this assignment. There is a problem involves

More information

A minimalist s exposition of EM

A minimalist s exposition of EM A minimalist s exposition of EM Karl Stratos 1 What EM optimizes Let O, H be a random variables representing the space of samples. Let be the parameter of a generative model with an associated probability

More information

(Multivariate) Gaussian (Normal) Probability Densities

(Multivariate) Gaussian (Normal) Probability Densities (Multivariate) Gaussian (Normal) Probability Densities Carl Edward Rasmussen, José Miguel Hernández-Lobato & Richard Turner April 20th, 2018 Rasmussen, Hernàndez-Lobato & Turner Gaussian Densities April

More information