Variational sampling approaches to word confusability
|
|
- Madeline Dean
- 5 years ago
- Views:
Transcription
1 Variational sampling approaches to word confusability John R. Hershey, Peder A. Olsen and Ramesh A. Gopinath IBM, T. J. Watson Research Center Information Theory and Applications, 2/ p.1/30
2 Abstract In speech recognition it is often useful to determine how confusable two words are. For speech models this comes down to computing the Bayes error between two HMMs. This problem is analytically and numerically intractable. A common alternative, that is numerically approachable, uses the KL divergence in place of the Bayes error. We present new approaches to approximating the KL divergence, that combine variational methods with importance sampling. The Bhattacharyya distance a closer cousin of the Bayes error turns out to be even more amenable to our approach. Our experiments demonstrate an improvement of orders of magnitude in accuracy over conventional methods. Information Theory and Applications, 2/ p.2/30
3 Outline Acoustic Confusability Information Theory and Applications, 2/ p.3/30
4 Outline Acoustic Confusability Divergence Measures for Distributions Information Theory and Applications, 2/ p.3/30
5 Outline Acoustic Confusability Divergence Measures for Distributions KL Divergence: Prior Art Information Theory and Applications, 2/ p.3/30
6 Outline Acoustic Confusability Divergence Measures for Distributions KL Divergence: Prior Art KL Divergence: Variational Approximations Information Theory and Applications, 2/ p.3/30
7 Outline Acoustic Confusability Divergence Measures for Distributions KL Divergence: Prior Art KL Divergence: Variational Approximations KL Divergence: Empirical Evaluations Information Theory and Applications, 2/ p.3/30
8 Outline Acoustic Confusability Divergence Measures for Distributions KL Divergence: Prior Art KL Divergence: Variational Approximations KL Divergence: Empirical Evaluations Bhattacharyya: Monte Carlo Approximation Information Theory and Applications, 2/ p.3/30
9 Outline Acoustic Confusability Divergence Measures for Distributions KL Divergence: Prior Art KL Divergence: Variational Approximations KL Divergence: Empirical Evaluations Bhattacharyya: Monte Carlo Approximation Bhattacharyya: Variational Approximation Information Theory and Applications, 2/ p.3/30
10 Outline Acoustic Confusability Divergence Measures for Distributions KL Divergence: Prior Art KL Divergence: Variational Approximations KL Divergence: Empirical Evaluations Bhattacharyya: Monte Carlo Approximation Bhattacharyya: Variational Approximation Bhattacharyya: Variational Monte Carlo Approximation Information Theory and Applications, 2/ p.3/30
11 Outline Acoustic Confusability Divergence Measures for Distributions KL Divergence: Prior Art KL Divergence: Variational Approximations KL Divergence: Empirical Evaluations Bhattacharyya: Monte Carlo Approximation Bhattacharyya: Variational Approximation Bhattacharyya: Variational Monte Carlo Approximation Bhattacharyya: Empirical Evaluations Information Theory and Applications, 2/ p.3/30
12 Outline Acoustic Confusability Divergence Measures for Distributions KL Divergence: Prior Art KL Divergence: Variational Approximations KL Divergence: Empirical Evaluations Bhattacharyya: Monte Carlo Approximation Bhattacharyya: Variational Approximation Bhattacharyya: Variational Monte Carlo Approximation Bhattacharyya: Empirical Evaluations Future Directions Information Theory and Applications, 2/ p.3/30
13 A Toy Version of the Confusability Problem 0.5 probability density plot 0.45 N(x; 2,1) pdf x X is an acoustic feature vector representing a speech class, say "E", with pdf f(x) = N(x; 2, 1). Information Theory and Applications, 2/ p.4/30
14 A Toy Version of the Confusability Problem 0.5 probability density plot N(x; 2,1) N(x;2,1) pdf x X is an acoustic feature vector representing a speech class, say "E", with pdf f(x) = N(x; 2, 1). Another speech class, "O", is described by pdf g(x) = N(x; 2, 1). Information Theory and Applications, 2/ p.4/30
15 A Toy Version of the Confusability Problem 0.5 probability density plot N(x; 2,1) N(x;2,1) pdf x X is an acoustic feature vector representing a speech class, say "E", with pdf f(x) = N(x; 2, 1). Another speech class, "O", is described by pdf g(x) = N(x; 2, 1). The asymmetric error is the probability that one class, "O", will be mistaken for the other, "E", when classifying according to A e (f, g) = Ê f(x)1 g(x) f(x) (x)dx. Information Theory and Applications, 2/ p.4/30
16 A Toy Version of the Confusability Problem 0.5 probability density plot N(x; 2,1) N(x;2,1) pdf x X is an acoustic feature vector representing a speech class, say "E", with pdf f(x) = N(x; 2, 1). Another speech class, "O", is described by pdf g(x) = N(x; 2, 1). The asymmetric error is the probability that one class, "O", will be mistaken for the other, "E", when classifying according to A e (f, g) = Ê f(x)1 g(x) f(x) (x)dx. The Bayes Error is the total classification error Be (f, g) = Ê min(f(x), g(x))dx. Information Theory and Applications, 2/ p.4/30
17 A Toy Version of the Confusability Problem probability density plot 0.5 N(x; 2,1) N(x;2,1) (fg) 1/2 0.4 pdf x X is an acoustic feature vector representing a speech class, say "E", with pdf f(x) = N(x; 2, 1). Another speech class, "O", is described by pdf g(x) = N(x; 2, 1). The asymmetric error is the probability that one class, "O", will be mistaken for the other, "E", when classifying according to A e (f, g) = Ê f(x)1 g(x) f(x) (x)dx. The Bayes Error is the total classification error Be (f, g) = Ê min(f(x), g(x))dx. Information Theory and Applications, 2/ p.4/30
18 Word Models A word is modeled using its pronunciation(s) and an HMM. As an example the word DIAL has a pronunciation D AY AX L and HMM F. D AY AX L Information Theory and Applications, 2/ p.5/30
19 Word Models A word is modeled using its pronunciation(s) and an HMM. As an example the word DIAL has a pronunciation D AY AX L and HMM F. D AY AX L CALL has a pronunciation K AO L and HMM G K AO L Information Theory and Applications, 2/ p.5/30
20 Word Models A word is modeled using its pronunciation(s) and an HMM. As an example the word DIAL has a pronunciation D AY AX L and HMM F. D AY AX L CALL has a pronunciation K AO L and HMM G K AO L Each node in the HMM has a GMM associated with it. The word confusability is the Bayes error B e (F, G). This quantity is too hard to compute!! Information Theory and Applications, 2/ p.5/30
21 The Edit Distance DIAL CALL edit op. cost D K substitution 1 AY ins/del 1 AX AO substitution 1 L L none 0 Total cost 3 Information Theory and Applications, 2/ p.6/30
22 The Edit Distance DIAL CALL edit op. cost D K substitution 1 AY ins/del 1 AX AO substitution 1 L L none 0 Total cost 3 The edit distance is the shortest path in the graph: I D AY AX L K AO L F Information Theory and Applications, 2/ p.6/30
23 Better ways Other techniques use weights on the edges. Acoustic perplexity and Average Divergence Distance are variants to this paradigm that use approximations to the KL divergence as weights. Information Theory and Applications, 2/ p.7/30
24 Bayes Error We use Bayes Error approximations for each pair of GMMs in the Cartesian HMM products: I D : K D : K AY : K AY : K AX : K AX : K L : K D : K AY : K AX : K L : K D : K AY : K AX : K D : AO AY : AO AX : AO L : AO D : AO AY : AO AX : AO D : AO AY : AO AX : AO L : AO D : AO AY : AO AX : AO D : L AY : L AX : L L : L D : L AY : L AX : L L : L F Information Theory and Applications, 2/ p.8/30
25 Gaussian Mixture Models Each node in the Cartesian HMM product corresponds to a pair of Gaussian Mixture models f and g. We write f(x) = a π a f a (x), where f a (x) = N(x;µ a ; Σ a ), Information Theory and Applications, 2/ p.9/30
26 Gaussian Mixture Models Each node in the Cartesian HMM product corresponds to a pair of Gaussian Mixture models f and g. We write f(x) = a π a f a (x), where f a (x) = N(x;µ a ; Σ a ), and g(x) = b ω b g b (x), where g b (x) = N(x;µ b ; Σ b ). Information Theory and Applications, 2/ p.9/30
27 Gaussian Mixture Models Each node in the Cartesian HMM product corresponds to a pair of Gaussian Mixture models f and g. We write f(x) = a π a f a (x), where f a (x) = N(x;µ a ; Σ a ), and g(x) = b ω b g b (x), where g b (x) = N(x;µ b ; Σ b ). The high dimensionality of x R d, d = 39, makes numerical integration difficult. Information Theory and Applications, 2/ p.9/30
28 Bayes, Bhattacharyya, Chernoff and Kullback-Leibler Bayes error: B e (f, g) = min(f(x), g(x))dx Information Theory and Applications, 2/ p.10/30
29 Bayes, Bhattacharyya, Chernoff and Kullback-Leibler Bayes error: B e (f, g) = min(f(x), g(x))dx Kullback Leibler divergence: D(f g) = f(x) log f(x) g(x) dx Information Theory and Applications, 2/ p.10/30
30 Bayes, Bhattacharyya, Chernoff and Kullback-Leibler Bayes error: B e (f, g) = min(f(x), g(x))dx Kullback Leibler divergence: D(f g) = f(x) log f(x) g(x) dx Bhattacharyya distance: B(f, g) = f(x)g(x)dx Information Theory and Applications, 2/ p.10/30
31 Bayes, Bhattacharyya, Chernoff and Kullback-Leibler Bayes error: B e (f, g) = min(f(x), g(x))dx Kullback Leibler divergence: D(f g) = f(x) log f(x) g(x) dx Bhattacharyya distance: B(f, g) = f(x)g(x)dx Chernoff distance: C(f, g) = max 0 s 1 C s (f, g) and C s (f, g) = f(x) s g(x) 1 s dx, 0 s 1. Information Theory and Applications, 2/ p.10/30
32 Bayes, Bhattacharyya, Chernoff and Kullback-Leibler Bayes error: B e (f, g) = min(f(x), g(x))dx Kullback Leibler divergence: D(f g) = f(x) log f(x) g(x) dx Bhattacharyya distance: B(f, g) = f(x)g(x)dx Chernoff distance: C(f, g) = max 0 s 1 C s (f, g) and C s (f, g) = f(x) s g(x) 1 s dx, 0 s 1. Why these? For a pair of single Gaussians f and g we can compute D(f g), B(f, g) and C s (f, g) analytically. Information Theory and Applications, 2/ p.10/30
33 Connections Perimeter divergence (power mean): P α (f, g) = 1 2 f(x)α g(x)α 1 α dx. We have B e (f, g) = P (f, g), B(f, g) = P 0 (f, g). Information Theory and Applications, 2/ p.11/30
34 Connections Perimeter divergence (power mean): P α (f, g) = 1 2 f(x)α g(x)α 1 α dx. We have B e (f, g) = P (f, g), B(f, g) = P 0 (f, g). Rényi generalised divergence of order s: D s (f g) = 1 s 1 log f(x) s g(x) 1 s dx. We have D 1 (f g) = D(f g) and D s (f g) = 1 s 1 log C s(f g). Information Theory and Applications, 2/ p.11/30
35 Connections Perimeter divergence (power mean): P α (f, g) = 1 2 f(x)α g(x)α 1 α dx. We have B e (f, g) = P (f, g), B(f, g) = P 0 (f, g). Rényi generalised divergence of order s: D s (f g) = 1 s 1 log f(x) s g(x) 1 s dx. We have D 1 (f g) = D(f g) and D s (f g) = 1 s 1 log C s(f g). Generalisation: G α,s (f g) = 1 s 1 log (sf(x) α + (1 s)g(x) α ) 1 α. G α,s (f g) connects log B e (f, g), D(f g), log B(f, g) and C s (f, g). Information Theory and Applications, 2/ p.11/30
36 Relations and Inequalities G α, 1(f g) = 2 log P α (f, g), G 0,s (f g) = D s (f g) and 2 G,s (f g) = B e (f, g). Information Theory and Applications, 2/ p.12/30
37 Relations and Inequalities G α, 1(f g) = 2 log P α (f, g), G 0,s (f g) = D s (f g) and 2 G,s (f g) = B e (f, g). P α (f, g) P β (f, g) for α β and B(f, g) P α (f, g)p α (f, g). Information Theory and Applications, 2/ p.12/30
38 Relations and Inequalities G α, 1(f g) = 2 log P α (f, g), G 0,s (f g) = D s (f g) and 2 G,s (f g) = B e (f, g). P α (f, g) P β (f, g) for α β and B(f, g) P α (f, g)p α (f, g). P α (f, f) = 1, G α,s (f, f) = 0, P (f, g) + P (f, g) = 2, B e (f, g) = 2 P (f, g). Information Theory and Applications, 2/ p.12/30
39 Relations and Inequalities G α, 1(f g) = 2 log P α (f, g), G 0,s (f g) = D s (f g) and 2 G,s (f g) = B e (f, g). P α (f, g) P β (f, g) for α β and B(f, g) P α (f, g)p α (f, g). P α (f, f) = 1, G α,s (f, f) = 0, P (f, g) + P (f, g) = 2, B e (f, g) = 2 P (f, g). B(f, g) = B(g, f), B e (f, g) = B e (g, f), D(f g) D(g f). Information Theory and Applications, 2/ p.12/30
40 Relations and Inequalities G α, 1(f g) = 2 log P α (f, g), G 0,s (f g) = D s (f g) and 2 G,s (f g) = B e (f, g). P α (f, g) P β (f, g) for α β and B(f, g) P α (f, g)p α (f, g). P α (f, f) = 1, G α,s (f, f) = 0, P (f, g) + P (f, g) = 2, B e (f, g) = 2 P (f, g). B(f, g) = B(g, f), B e (f, g) = B e (g, f), D(f g) D(g f). D(f g) 2 log B(f, g) 2 log B e (f, g), B e (f, g) B(f, g) B e (f, g)(2 B e (f, g)), B e (f, g) C(f, g) B(f, g), C 0 (f, g) = C 1 (f, g) = 1 and B(f, g) = C 1/2 (f, g). and so on. Information Theory and Applications, 2/ p.12/30
41 The KL Divergence of a GMM Monte Carlo sampling: Draw n samples from f. Then D(f g) 1 n n i=1 log f(x i) g(x i ) with error O(1/ n). Information Theory and Applications, 2/ p.13/30
42 The KL Divergence of a GMM Monte Carlo sampling: Draw n samples from f. Then D(f g) 1 n n i=1 log f(x i) g(x i ) with error O(1/ n). Gaussian Approximation: Approximate f with a gaussian ˆf whose mean and covariance matches the total mean and covariance of f. Same for g and ĝ, then use D(f g) D( ˆf ĝ) µ ˆf = a π a µ a Σ ˆf = a π a (Σ a + (µ a µ ˆf)(µ a µ ˆf) T ). Information Theory and Applications, 2/ p.13/30
43 Unscented Approximation It is possible to pick 2d sigma points {x a,k } 2d k=1 such that fa (x)h(x)dx = 1 2d 2d k=1 h(x a,k) is exact for all quadratic functions h. One choice of sigma points is x a,k = µ a + dλ a,k e a,k x a,d+k = µ a dλ a,k e a,k, This is akin to gaussian quadrature. Information Theory and Applications, 2/ p.14/30
44 Matched Bound Approximation Match the closest pairs of gaussians m(a) = argmin b (D(f a g b ) log(ω b )). Goldberger s approximate formula is: D(f g) D Goldberger (f g) = a Analogous to the chain rule for relative entropy. ( π a D(f a g m(a) ) + log π ) a. ω m(a) Information Theory and Applications, 2/ p.15/30
45 Matched Bound Approximation Match the closest pairs of gaussians m(a) = argmin b (D(f a g b ) log(ω b )). Goldberger s approximate formula is: D(f g) D Goldberger (f g) = a Analogous to the chain rule for relative entropy. ( π a D(f a g m(a) ) + log π ) a. ω m(a) Min Approximation: D(f g) min a,b D(f a g b ) is an approximation in the same spirit. Information Theory and Applications, 2/ p.15/30
46 Variational Approximation Let 1 = b φ b a be free parameters f log g = π a f a log a b = π a f a log a b ω b g b φ b a ω b g b φ b a Introduce the variational parameters Information Theory and Applications, 2/ p.16/30
47 Variational Approximation Let 1 = b φ b a be free parameters f log g = a a π a π a f a log b f a b φ b a ω b g b φ b a φ b a log ω bg b φ b a Use Jensen s inequality to interchange log and b φ b a Information Theory and Applications, 2/ p.16/30
48 Variational Approximation Let 1 = b φ b a be free parameters f log g = a a = a π a π a f a log b f a b φ b a ω b g b φ b a φ b a log ω bg b φ b a π a φ b a (log(ω b /φ b a ) + Simplify expression after using Jensen b f a log g b ) Information Theory and Applications, 2/ p.16/30
49 Variational Approximation Let 1 = b φ b a be free parameters f log g = a a = a π a π a f a log b f a b φ b a ω b g b φ b a φ b a log ω bg b φ b a π a φ b a (log(ω b /φ b a ) + b f a log g b ) Maximize over φ b a and do the same for f log f to get Information Theory and Applications, 2/ p.16/30
50 Variational Approximation Let 1 = b φ b a be free parameters f log g = a a = a π a π a f a log b f a b φ b a ω b g b φ b a φ b a log ω bg b φ b a π a φ b a (log(ω b /φ b a ) + b f a log g b ) Maximize over φ b a and do the same for f log f to get D(f g) D var (f g) = a π a log a π a e D(f a f a ) b ω be D(f a g b ). Information Theory and Applications, 2/ p.16/30
51 Variational Upper Bound Variational parameters: b φ b a = π a and a ψ a b = ω b. Gaussian replication: f = a π af a = ab φ b af a g = b ω bg b = ba ψ a bg b. Information Theory and Applications, 2/ p.17/30
52 Variational Upper Bound Variational parameters: b φ b a = π a and a ψ a b = ω b. Gaussian replication: f = a π af a = ab φ b af a g = b ω bg b = ba ψ a bg b. D(f g) = = f log(f/g) f log Introduce the variational parameters ψ a b. ( ab ψ ) a bg b f Information Theory and Applications, 2/ p.17/30
53 Variational Upper Bound Variational parameters: b φ b a = π a and a ψ a b = ω b. D(f g) = = f log f log ( ab ψ ) a bg b f ( φ b a f a ψ a b g b f φ b a f a ab ) dx Prepare to use Jensen s inequality. Information Theory and Applications, 2/ p.17/30
54 Variational Upper Bound Variational parameters: b φ b a = π a and a ψ a b = ω b. D(f g) = = ab f log ( ab φ b af a ab ψ a bg b ) ( ) φ b a f a ψ a b g b flog dx f φ ab b a f a ( ) ψa b g b φ b a f a log dx φ b a f a Interchange log and ab φ b a f a f. Information Theory and Applications, 2/ p.17/30
55 Variational Upper Bound Variational parameters: b φ b a = π a and a ψ a b = ω b. D(f g) = f log ab φ b a ( ab φ b af a ab ψ a bg b ) ( ) ψa b g b f a log dx φ b a f a = D(φ ψ) + ab φ b a D(f a g b ) The chain rule for relative entropy for mixtures with unequal number of components!! Information Theory and Applications, 2/ p.17/30
56 Variational Upper Bound Optimize variational bound D(f g) D(φ ψ) + ab φ b a D(f a g b ) with respect to constraints b φ b a = π a and a ψ a b = ω b. Information Theory and Applications, 2/ p.18/30
57 Variational Upper Bound Optimize variational bound D(f g) D(φ ψ) + ab φ b a D(f a g b ) with respect to constraints b φ b a = π a and a ψ a b = ω b. Fix φ, find optimal value for ψ Fix ψ, find optimal value for φ: ψ a b = ω bφ b a a φ b a φ b a = π aψ a b e D(f a g b ) b ψ a b e D(f a g b ). Iterate a few times to find optimal solution!. Information Theory and Applications, 2/ p.18/30
58 Comparison of KL div methods Plots showing histograms of difference between Monte Carlo Sampling with 1 million samples and various methods Probability zero product min gaussian variational Deviation from D MC(1M) Probability MC(2dn) unscented goldberger variational upper Deviation from D MC(1M) Information Theory and Applications, 2/ p.19/30
59 Summary of KL divergence methods Monte Carlo Sampling arbitrary accuracy, arbitrary cost Information Theory and Applications, 2/ p.20/30
60 Summary of KL divergence methods Monte Carlo Sampling arbitrary accuracy, arbitrary cost Gaussian Approximation not cheap, not good, closed-form Information Theory and Applications, 2/ p.20/30
61 Summary of KL divergence methods Monte Carlo Sampling arbitrary accuracy, arbitrary cost Gaussian Approximation not cheap, not good, closed-form Min Approximation cheap, but not good Information Theory and Applications, 2/ p.20/30
62 Summary of KL divergence methods Monte Carlo Sampling arbitrary accuracy, arbitrary cost Gaussian Approximation not cheap, not good, closed-form Min Approximation cheap, but not good Unscented Approximation almost as good as MC at same cost Information Theory and Applications, 2/ p.20/30
63 Summary of KL divergence methods Monte Carlo Sampling arbitrary accuracy, arbitrary cost Gaussian Approximation not cheap, not good, closed-form Min Approximation cheap, but not good Unscented Approximation almost as good as MC at same cost Matched Bound Approximation cheap and good, not differentiable Information Theory and Applications, 2/ p.20/30
64 Summary of KL divergence methods Monte Carlo Sampling arbitrary accuracy, arbitrary cost Gaussian Approximation not cheap, not good, closed-form Min Approximation cheap, but not good Unscented Approximation almost as good as MC at same cost Matched Bound Approximation cheap and good, not differentiable Variational Approximation cheap and good, closed-form Information Theory and Applications, 2/ p.20/30
65 Summary of KL divergence methods Monte Carlo Sampling arbitrary accuracy, arbitrary cost Gaussian Approximation not cheap, not good, closed-form Min Approximation cheap, but not good Unscented Approximation almost as good as MC at same cost Matched Bound Approximation cheap and good, not differentiable Variational Approximation cheap and good, closed-form Variational Upper Bound cheap and better, strict bound, iterative Information Theory and Applications, 2/ p.20/30
66 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/ p.21/30
67 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/ p.21/30
68 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/ p.21/30
69 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/ p.21/30
70 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/ p.21/30
71 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/ p.21/30
72 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/ p.21/30
73 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/ p.21/30
74 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/ p.21/30
75 Bhattacharyya distance The Bhattacharyya distance, B(f, g) = fg, can be estimate using Monte Carlo sampling from an arbitrary distribution h: where {x i } n i=1 B(f, g) ˆB h = 1 n n i=1 are sampled from h. f(xi )g(x i ), h(x i ) Information Theory and Applications, 2/ p.22/30
76 Bhattacharyya distance The Bhattacharyya distance, B(f, g) = fg, can be estimate using Monte Carlo sampling from an arbitrary distribution h: B(f, g) ˆB h = 1 n n i=1 f(xi )g(x i ), h(x i ) where {x i } n i=1 are sampled from h. The estimators are unbiased, E[ ˆB h ] = B(f, g), with variance var( ˆB h ) = 1 n ( fg h ) (B(f, g))2. Information Theory and Applications, 2/ p.22/30
77 Bhattacharyya distance The Bhattacharyya distance, B(f, g) = fg, can be estimate using Monte Carlo sampling from an arbitrary distribution h: B(f, g) ˆB h = 1 n n i=1 f(xi )g(x i ), h(x i ) where {x i } n i=1 are sampled from h. The estimators are unbiased, E[ ˆB h ] = B(f, g), with variance var( ˆB h ) = 1 n ( fg h h = f gives var( ˆB f ) = 1 (B(f,g))2 n. h = f+g 2 gives var( ˆB f+g ) = 2 And var( ˆB f+g 2 ) (B(f, g))2. 2fg f+g (B(f,g))2 n. ) var( ˆB f ) (Harmonic Arithmetic inequality). Information Theory and Applications, 2/ p.22/30
78 Best sampling distribution We can find the best sampling distribution h by minimizing the variance of ˆB h subject to the constraints h 0 and h = 1. The solution is h = fg fg, var( ˆB h ) = 0. Information Theory and Applications, 2/ p.23/30
79 Best sampling distribution We can find the best sampling distribution h by minimizing the variance of ˆB h subject to the constraints h 0 and h = 1. The solution is h = fg fg, var( ˆB h ) = 0. Unfortunately, using this h requires Computing the quantity, B(f, g) = fg, that we are trying to compute in the first place. Sample from fg. Information Theory and Applications, 2/ p.23/30
80 Best sampling distribution We can find the best sampling distribution h by minimizing the variance of ˆB h subject to the constraints h 0 and h = 1. The solution is h = fg fg, var( ˆB h ) = 0. Unfortunately, using this h requires Computing the quantity, B(f, g) = fg, that we are trying to compute in the first place. Sample from fg. We will use variational techniques to approximate fg with some unnormalized h that can be analytically integrated to give a genuine pdf h. Information Theory and Applications, 2/ p.23/30
81 Bhattacharyya Variational Upper Bound f = a π af a = ab φ b af a b φ b a = π a g = b ω bg b = ba ψ a bg b a ψ a b = ω b. Information Theory and Applications, 2/ p.24/30
82 Bhattacharyya Variational Upper Bound f = a π af a = ab φ b af a b φ b a = π a g = b ω bg b = ba ψ a bg b a ψ a b = ω b. g B(f, g) = f f ab = f ψ a bg b f Introduce the variational parameters ψ a b. Information Theory and Applications, 2/ p.24/30
83 Bhattacharyya Variational Upper Bound f = a π af a = ab φ b af a b φ b a = π a g = b ω bg b = ba ψ a bg b a ψ a b = ω b. B(f, g) = = f f ab ψ a bg b ab f φ b a f a f ψ a b g b φ b a f a dx Prepare to use Jensen s inequality. Information Theory and Applications, 2/ p.24/30
84 Bhattacharyya Variational Upper Bound f = a π af a = ab φ b af a b φ b a = π a g = b ω bg b = ba ψ a bg b a ψ a b = ω b. B(f, g) = = ab f f ab ψ a bg b ab φ b a f φ b a f a f ψ a b g b φ b a f a dx f a ψa b g b φ b a f a dx Interchange and ab φ b a f a f. Information Theory and Applications, 2/ p.24/30
85 Bhattacharyya Variational Upper Bound f = a π af a = ab φ b af a b φ b a = π a g = b ω bg b = ba ψ a bg b a ψ a b = ω b. B(f, g) = ab = ab f ab ψ a bg b f ψa b g b φ b a f a dx φ b a f a φ b a ψ a b B(f a, g b ) An inequality linking the mixture Bhattacharyya distance to the component distances!! Information Theory and Applications, 2/ p.24/30
86 Bhattacharyya Variational Upper Bound Optimize variational bound B(f, g) ab φ b a ψ a b B(f a, g b ) with respect to constraints b φ b a = π a and a ψ a b = ω b. Information Theory and Applications, 2/ p.25/30
87 Bhattacharyya Variational Upper Bound Optimize variational bound B(f, g) ab φ b a ψ a b B(f a, g b ) with respect to constraints b φ b a = π a and a ψ a b = ω b. Fix φ, find optimal value for ψ ψ a b = ω bφ b a (B(f a, g b )) 2 a φ b a (B(f a, g b )) 2. Fix ψ, find optimal value for φ: Iterate to find optimal solution! φ b a = π aψ a b (B(f a, g b )) 2 b ψ a b (B(f a, g b )) 2. Information Theory and Applications, 2/ p.25/30
88 Variational Monte Carlo Sampling Write the variational estimate V (f, g) = ab Õ φ b a ψ a b B(f a, g b ) = Õ Ô φ b a ψ a b fa g b = ab ĥ. Here ĥ = È ab Õ φ b a ψ a b fa g b is an unnormalized approximation of the optimal sampling distribution, fg/ Ê fg. Information Theory and Applications, 2/ p.26/30
89 Variational Monte Carlo Sampling Write the variational estimate V (f, g) = ab Õ φ b a ψ a b B(f a, g b ) = Õ Ô φ b a ψ a b fa g b = ab ĥ. Here ĥ = È ab Õ φ b a ψ a b fa g b is an unnormalized approximation of the optimal sampling distribution, fg/ Ê fg. h = ĥ/ê ĥ is a GMM since h ab = f a g b / Ê f a g b is a gaussian and h = ab π ab h ab, where π ab = Õ Ê φ b a ψ a b fa g b V (f, g) Information Theory and Applications, 2/ p.26/30
90 Variational Monte Carlo Sampling Write the variational estimate V (f, g) = ab Õ φ b a ψ a b B(f a, g b ) = Õ Ô φ b a ψ a b fa g b = ab ĥ. Here ĥ = È ab Õ φ b a ψ a b fa g b is an unnormalized approximation of the optimal sampling distribution, fg/ Ê fg. h = ĥ/ê ĥ is a GMM since h ab = f a g b / Ê f a g b is a gaussian and h = ab π ab h ab, where π ab = Õ Ê φ b a ψ a b fa g b V (f, g) Thus, drawing samples {x i } n i=1 from h the estimate ˆV n = 1 n n i=1 Ô f(xi )g(x i ) h(x i ) is unbiased and in experiments is seen to be far superior to sampling from (f + g)/2. Information Theory and Applications, 2/ p.26/30
91 Bhattacharyya Distance: Monte Carlo estimation Probability density Bhattacharyya: Importance sampling from f(x) Deviation from Bhattacharyya estimate with 1M samples Importance sampling from f(x): Slow convergence. Information Theory and Applications, 2/ p.27/30
92 Bhattacharyya Distance: Monte Carlo estimation Probability density Bhattacharyya: Importance sampling from f(x) Deviation from Bhattacharyya estimate with 1M samples Importance sampling from f(x): Slow convergence. Information Theory and Applications, 2/ p.27/30
93 Bhattacharyya Distance: Monte Carlo estimation Probability density Bhattacharyya: Importance sampling from f(x) Deviation from Bhattacharyya estimate with 1M samples Importance sampling from f(x): Slow convergence. Information Theory and Applications, 2/ p.27/30
94 Bhattacharyya Distance: Monte Carlo estimation Probability density Bhattacharyya: Importance sampling from f(x) K Deviation from Bhattacharyya estimate with 1M samples Importance sampling from f(x): Slow convergence. Information Theory and Applications, 2/ p.27/30
95 Bhattacharyya Distance: Monte Carlo estimation Probability density Bhattacharyya: Importance sampling from f(x) K 100K Deviation from Bhattacharyya estimate with 1M samples Importance sampling from f(x): Slow convergence. Information Theory and Applications, 2/ p.27/30
96 Bhattacharyya Distance: Monte Carlo estimation Importance sampling from f(x)+g(x) 2 Probability density f 100K Deviation from Bhattacharyya estimate with 1M samples Importance sampling from f(x): Slow convergence. Importance sampling from f(x)+g(x) 2 : Better convergence. Information Theory and Applications, 2/ p.27/30
97 Bhattacharyya Distance: Monte Carlo estimation Importance sampling from f(x)+g(x) 2 Probability density f 100K Deviation from Bhattacharyya estimate with 1M samples Importance sampling from f(x): Slow convergence. Importance sampling from f(x)+g(x) 2 : Better convergence. Information Theory and Applications, 2/ p.27/30
98 Bhattacharyya Distance: Monte Carlo estimation Importance sampling from f(x)+g(x) 2 Probability density f 100K Deviation from Bhattacharyya estimate with 1M samples Importance sampling from f(x): Slow convergence. Importance sampling from f(x)+g(x) 2 : Better convergence. Information Theory and Applications, 2/ p.27/30
99 Bhattacharyya Distance: Monte Carlo estimation Importance sampling from f(x)+g(x) 2 Probability density f 100K K Deviation from Bhattacharyya estimate with 1M samples Importance sampling from f(x): Slow convergence. Importance sampling from f(x)+g(x) 2 : Better convergence. Information Theory and Applications, 2/ p.27/30
100 Bhattacharyya Distance: Monte Carlo estimation Importance sampling from f(x)+g(x) 2 Probability density f 100K K 100K Deviation from Bhattacharyya estimate with 1M samples Importance sampling from f(x): Slow convergence. Importance sampling from f(x)+g(x) 2 : Better convergence. Information Theory and Applications, 2/ p.27/30
101 Bhattacharyya Distance: Monte Carlo estimation Probability density Variational importance sampling (f+g)/2 100K Deviation from Bhattacharyya estimate with 1M samples Importance sampling from f(x): Slow convergence. Importance sampling from f(x)+g(x) 2 : Better convergence. Variational importance sampling: Fast convergence. Information Theory and Applications, 2/ p.27/30
102 Bhattacharyya Distance: Monte Carlo estimation Probability density Variational importance sampling (f+g)/2 100K Deviation from Bhattacharyya estimate with 1M samples Importance sampling from f(x): Slow convergence. Importance sampling from f(x)+g(x) 2 : Better convergence. Variational importance sampling: Fast convergence. Information Theory and Applications, 2/ p.27/30
103 Bhattacharyya Distance: Monte Carlo estimation Probability density Variational importance sampling (f+g)/2 100K Deviation from Bhattacharyya estimate with 1M samples Importance sampling from f(x): Slow convergence. Importance sampling from f(x)+g(x) 2 : Better convergence. Variational importance sampling: Fast convergence. Information Theory and Applications, 2/ p.27/30
104 Bhattacharyya Distance: Monte Carlo estimation Probability density Variational importance sampling (f+g)/2 100K K Deviation from Bhattacharyya estimate with 1M samples Importance sampling from f(x): Slow convergence. Importance sampling from f(x)+g(x) 2 : Better convergence. Variational importance sampling: Fast convergence. Information Theory and Applications, 2/ p.27/30
105 Bhattacharyya Distance: Monte Carlo estimation Probability density Variational importance sampling (f+g)/2 100K K 100K Deviation from Bhattacharyya estimate with 1M samples Importance sampling from f(x): Slow convergence. Importance sampling from f(x)+g(x) 2 : Better convergence. Variational importance sampling: Fast convergence. Information Theory and Applications, 2/ p.27/30
106 Bhattacharyya Distance: Monte Carlo estimation Probability density Variational importance sampling f 100K (f+g)/2 100K variational 100K Deviation from Bhattacharyya estimate with 1M samples Importance sampling from f(x): Slow convergence. Importance sampling from f(x)+g(x) 2 : Better convergence. Variational importance sampling: Fast convergence. Information Theory and Applications, 2/ p.27/30
107 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/ p.28/30
108 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/ p.28/30
109 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/ p.28/30
110 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/ p.28/30
111 Variational Monte Carlo Sampling: KL-Divergence Information Theory and Applications, 2/ p.29/30
112 Future Directions HMM variational KL divergence Information Theory and Applications, 2/ p.30/30
113 Future Directions HMM variational KL divergence HMM variational Bhattacharyya Information Theory and Applications, 2/ p.30/30
114 Future Directions HMM variational KL divergence HMM variational Bhattacharyya Variational Chernoff distance Information Theory and Applications, 2/ p.30/30
115 Future Directions HMM variational KL divergence HMM variational Bhattacharyya Variational Chernoff distance Variational sampling of Bayes error using Chernoff approximation Information Theory and Applications, 2/ p.30/30
116 Future Directions HMM variational KL divergence HMM variational Bhattacharyya Variational Chernoff distance Variational sampling of Bayes error using Chernoff approximation Discriminative training, using Bhattacharyya divergence. Information Theory and Applications, 2/ p.30/30
117 Future Directions HMM variational KL divergence HMM variational Bhattacharyya Variational Chernoff distance Variational sampling of Bayes error using Chernoff approximation Discriminative training, using Bhattacharyya divergence. Acoustic confusability using Bhattacharyya divergence Information Theory and Applications, 2/ p.30/30
118 Future Directions HMM variational KL divergence HMM variational Bhattacharyya Variational Chernoff distance Variational sampling of Bayes error using Chernoff approximation Discriminative training, using Bhattacharyya divergence. Acoustic confusability using Bhattacharyya divergence Clustering of HMMs Information Theory and Applications, 2/ p.30/30
119 Future Directions HMM variational KL divergence HMM variational Bhattacharyya Variational Chernoff distance Variational sampling of Bayes error using Chernoff approximation Discriminative training, using Bhattacharyya divergence. Acoustic confusability using Bhattacharyya divergence Clustering of HMMs Information Theory and Applications, 2/ p.30/30
Scalable robust hypothesis tests using graphical models
Scalable robust hypothesis tests using graphical models Umamahesh Srinivas ipal Group Meeting October 22, 2010 Binary hypothesis testing problem Random vector x = (x 1,...,x n ) R n generated from either
More informationThe Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision
The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that
More informationAdaptive Monte Carlo methods
Adaptive Monte Carlo methods Jean-Michel Marin Projet Select, INRIA Futurs, Université Paris-Sud joint with Randal Douc (École Polytechnique), Arnaud Guillin (Université de Marseille) and Christian Robert
More informationPROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS
PROBABILITY AND INFORMATION THEORY Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Probability space Rules of probability
More informationOutline of Today s Lecture
University of Washington Department of Electrical Engineering Computer Speech Processing EE516 Winter 2005 Jeff A. Bilmes Lecture 12 Slides Feb 23 rd, 2005 Outline of Today s
More informationGaussian Mixture Models
Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some
More information13: Variational inference II
10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational
More informationSupport Vector Machines using GMM Supervectors for Speaker Verification
1 Support Vector Machines using GMM Supervectors for Speaker Verification W. M. Campbell, D. E. Sturim, D. A. Reynolds MIT Lincoln Laboratory 244 Wood Street Lexington, MA 02420 Corresponding author e-mail:
More informationInformation geometry for bivariate distribution control
Information geometry for bivariate distribution control C.T.J.Dodson + Hong Wang Mathematics + Control Systems Centre, University of Manchester Institute of Science and Technology Optimal control of stochastic
More informationVariable selection and feature construction using methods related to information theory
Outline Variable selection and feature construction using methods related to information theory Kari 1 1 Intelligent Systems Lab, Motorola, Tempe, AZ IJCNN 2007 Outline Outline 1 Information Theory and
More informationProbabilistic Time Series Classification
Probabilistic Time Series Classification Y. Cem Sübakan Boğaziçi University 25.06.2013 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 1 / 54 Problem Statement The goal is to assign
More informationAdvanced Machine Learning & Perception
Advanced Machine Learning & Perception Instructor: Tony Jebara Topic 6 Standard Kernels Unusual Input Spaces for Kernels String Kernels Probabilistic Kernels Fisher Kernels Probability Product Kernels
More informationExpectation Propagation Algorithm
Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,
More informationRecent Advances in Bayesian Inference Techniques
Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian
More informationExpectation Maximization Algorithm
Expectation Maximization Algorithm Vibhav Gogate The University of Texas at Dallas Slides adapted from Carlos Guestrin, Dan Klein, Luke Zettlemoyer and Dan Weld The Evils of Hard Assignments? Clusters
More informationMachine Learning Techniques for Computer Vision
Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM
More informationPattern Recognition. Parameter Estimation of Probability Density Functions
Pattern Recognition Parameter Estimation of Probability Density Functions Classification Problem (Review) The classification problem is to assign an arbitrary feature vector x F to one of c classes. The
More informationAn Analytic Distance Metric for Gaussian Mixture Models with Application in Image Retrieval
An Analytic Distance Metric for Gaussian Mixture Models with Application in Image Retrieval G. Sfikas, C. Constantinopoulos *, A. Likas, and N.P. Galatsanos Department of Computer Science, University of
More informationAdapted Feature Extraction and Its Applications
saito@math.ucdavis.edu 1 Adapted Feature Extraction and Its Applications Naoki Saito Department of Mathematics University of California Davis, CA 95616 email: saito@math.ucdavis.edu URL: http://www.math.ucdavis.edu/
More informationBayes Decision Theory
Bayes Decision Theory Minimum-Error-Rate Classification Classifiers, Discriminant Functions and Decision Surfaces The Normal Density 0 Minimum-Error-Rate Classification Actions are decisions on classes
More informationp(d θ ) l(θ ) 1.2 x x x
p(d θ ).2 x 0-7 0.8 x 0-7 0.4 x 0-7 l(θ ) -20-40 -60-80 -00 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ θ x FIGURE 3.. The top graph shows several training points in one dimension, known or assumed to
More informationNecessary Corrections in Intransitive Likelihood-Ratio Classifiers
Necessary Corrections in Intransitive Likelihood-Ratio Classifiers Gang Ji and Jeff Bilmes SSLI-Lab, Department of Electrical Engineering University of Washington Seattle, WA 9895-500 {gang,bilmes}@ee.washington.edu
More informationMath /Foundations of Algebra/Fall 2017 Numbers at the Foundations: Real Numbers In calculus, the derivative of a function f(x) is defined
Math 400-001/Foundations of Algebra/Fall 2017 Numbers at the Foundations: Real Numbers In calculus, the derivative of a function f(x) is defined using limits. As a particular case, the derivative of f(x)
More information10-701/15-781, Machine Learning: Homework 4
10-701/15-781, Machine Learning: Homewor 4 Aarti Singh Carnegie Mellon University ˆ The assignment is due at 10:30 am beginning of class on Mon, Nov 15, 2010. ˆ Separate you answers into five parts, one
More informationHomework 1 Due: Thursday 2/5/2015. Instructions: Turn in your homework in class on Thursday 2/5/2015
10-704 Homework 1 Due: Thursday 2/5/2015 Instructions: Turn in your homework in class on Thursday 2/5/2015 1. Information Theory Basics and Inequalities C&T 2.47, 2.29 (a) A deck of n cards in order 1,
More informationLecture 3: Pattern Classification
EE E6820: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 1 2 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mixtures
More informationLecture 5 Channel Coding over Continuous Channels
Lecture 5 Channel Coding over Continuous Channels I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw November 14, 2014 1 / 34 I-Hsiang Wang NIT Lecture 5 From
More informationF denotes cumulative density. denotes probability density function; (.)
BAYESIAN ANALYSIS: FOREWORDS Notation. System means the real thing and a model is an assumed mathematical form for the system.. he probability model class M contains the set of the all admissible models
More informationCS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 1
CS434a/541a: Pattern Recognition Prof. Olga Veksler Lecture 1 1 Outline of the lecture Syllabus Introduction to Pattern Recognition Review of Probability/Statistics 2 Syllabus Prerequisite Analysis of
More informationd(x n, x) d(x n, x nk ) + d(x nk, x) where we chose any fixed k > N
Problem 1. Let f : A R R have the property that for every x A, there exists ɛ > 0 such that f(t) > ɛ if t (x ɛ, x + ɛ) A. If the set A is compact, prove there exists c > 0 such that f(x) > c for all x
More informationEstimating Gaussian Mixture Densities with EM A Tutorial
Estimating Gaussian Mixture Densities with EM A Tutorial Carlo Tomasi Due University Expectation Maximization (EM) [4, 3, 6] is a numerical algorithm for the maximization of functions of several variables
More informationMonte Carlo and cold gases. Lode Pollet.
Monte Carlo and cold gases Lode Pollet lpollet@physics.harvard.edu 1 Outline Classical Monte Carlo The Monte Carlo trick Markov chains Metropolis algorithm Ising model critical slowing down Quantum Monte
More informationBayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory
Bayesian decision theory 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory Jussi Tohka jussi.tohka@tut.fi Institute of Signal Processing Tampere University of Technology
More informationMarkov Chain Monte Carlo Methods for Stochastic
Markov Chain Monte Carlo Methods for Stochastic Optimization i John R. Birge The University of Chicago Booth School of Business Joint work with Nicholas Polson, Chicago Booth. JRBirge U Florida, Nov 2013
More informationDensity estimation Nonparametric conditional mean estimation Semiparametric conditional mean estimation. Nonparametrics. Gabriel Montes-Rojas
0 0 5 Motivation: Regression discontinuity (Angrist&Pischke) Outcome.5 1 1.5 A. Linear E[Y 0i X i] 0.2.4.6.8 1 X Outcome.5 1 1.5 B. Nonlinear E[Y 0i X i] i 0.2.4.6.8 1 X utcome.5 1 1.5 C. Nonlinearity
More informationSection 6.1: Composite Functions
Section 6.1: Composite Functions Def: Given two function f and g, the composite function, which we denote by f g and read as f composed with g, is defined by (f g)(x) = f(g(x)). In other words, the function
More informationFeature selection and extraction Spectral domain quality estimation Alternatives
Feature selection and extraction Error estimation Maa-57.3210 Data Classification and Modelling in Remote Sensing Markus Törmä markus.torma@tkk.fi Measurements Preprocessing: Remove random and systematic
More informationCOURSE Numerical integration of functions (continuation) 3.3. The Romberg s iterative generation method
COURSE 7 3. Numerical integration of functions (continuation) 3.3. The Romberg s iterative generation method The presence of derivatives in the remainder difficulties in applicability to practical problems
More information1 + x 2 d dx (sec 1 x) =
Page This exam has: 8 multiple choice questions worth 4 points each. hand graded questions worth 4 points each. Important: No graphing calculators! Any non-graphing, non-differentiating, non-integrating
More informationPATTERN RECOGNITION AND MACHINE LEARNING
PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality
More informationTABLE OF CONTENTS 2 CHAPTER 1
TABLE OF CONTENTS CHAPTER 1 Quadratics CHAPTER Functions 3 CHAPTER 3 Coordinate Geometry 3 CHAPTER 4 Circular Measure 4 CHAPTER 5 Trigonometry 4 CHAPTER 6 Vectors 5 CHAPTER 7 Series 6 CHAPTER 8 Differentiation
More informationLinearly-Solvable Stochastic Optimal Control Problems
Linearly-Solvable Stochastic Optimal Control Problems Emo Todorov Applied Mathematics and Computer Science & Engineering University of Washington Winter 2014 Emo Todorov (UW) AMATH/CSE 579, Winter 2014
More informationA counterexample to integration by parts. Alexander Kheifets Department of Mathematics University of Massachusetts Lowell Alexander
A counterexample to integration by parts Alexander Kheifets Department of Mathematics University of Massachusetts Lowell Alexander Kheifets@uml.edu James Propp Department of Mathematics University of Massachusetts
More informationQuantitative Biology II Lecture 4: Variational Methods
10 th March 2015 Quantitative Biology II Lecture 4: Variational Methods Gurinder Singh Mickey Atwal Center for Quantitative Biology Cold Spring Harbor Laboratory Image credit: Mike West Summary Approximate
More informationMonte Carlo Methods. Geoff Gordon February 9, 2006
Monte Carlo Methods Geoff Gordon ggordon@cs.cmu.edu February 9, 2006 Numerical integration problem 5 4 3 f(x,y) 2 1 1 0 0.5 0 X 0.5 1 1 0.8 0.6 0.4 Y 0.2 0 0.2 0.4 0.6 0.8 1 x X f(x)dx Used for: function
More informationMultivariate statistical methods and data mining in particle physics
Multivariate statistical methods and data mining in particle physics RHUL Physics www.pp.rhul.ac.uk/~cowan Academic Training Lectures CERN 16 19 June, 2008 1 Outline Statement of the problem Some general
More informationStatistical methods in recognition. Why is classification a problem?
Statistical methods in recognition Basic steps in classifier design collect training images choose a classification model estimate parameters of classification model from training images evaluate model
More informationKullback-Leibler Designs
Kullback-Leibler Designs Astrid JOURDAN Jessica FRANCO Contents Contents Introduction Kullback-Leibler divergence Estimation by a Monte-Carlo method Design comparison Conclusion 2 Introduction Computer
More informationUniversity of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians
University of Cambridge Engineering Part IIB Module 4F: Statistical Pattern Processing Handout 2: Multivariate Gaussians.2.5..5 8 6 4 2 2 4 6 8 Mark Gales mjfg@eng.cam.ac.uk Michaelmas 2 2 Engineering
More informationLatent Tree Approximation in Linear Model
Latent Tree Approximation in Linear Model Navid Tafaghodi Khajavi Dept. of Electrical Engineering, University of Hawaii, Honolulu, HI 96822 Email: navidt@hawaii.edu ariv:1710.01838v1 [cs.it] 5 Oct 2017
More informationComputational Genomics
Computational Genomics http://www.cs.cmu.edu/~02710 Introduction to probability, statistics and algorithms (brief) intro to probability Basic notations Random variable - referring to an element / event
More information3 Operations on One Random Variable - Expectation
3 Operations on One Random Variable - Expectation 3.0 INTRODUCTION operations on a random variable Most of these operations are based on a single concept expectation. Even a probability of an event can
More informationA Family of Probabilistic Kernels Based on Information Divergence. Antoni B. Chan, Nuno Vasconcelos, and Pedro J. Moreno
A Family of Probabilistic Kernels Based on Information Divergence Antoni B. Chan, Nuno Vasconcelos, and Pedro J. Moreno SVCL-TR 2004/01 June 2004 A Family of Probabilistic Kernels Based on Information
More informationPattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods
Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationInformation Theory. Week 4 Compressing streams. Iain Murray,
Information Theory http://www.inf.ed.ac.uk/teaching/courses/it/ Week 4 Compressing streams Iain Murray, 2014 School of Informatics, University of Edinburgh Jensen s inequality For convex functions: E[f(x)]
More informationProbability Review. Yutian Li. January 18, Stanford University. Yutian Li (Stanford University) Probability Review January 18, / 27
Probability Review Yutian Li Stanford University January 18, 2018 Yutian Li (Stanford University) Probability Review January 18, 2018 1 / 27 Outline 1 Elements of probability 2 Random variables 3 Multiple
More informationAuto-Encoding Variational Bayes
Auto-Encoding Variational Bayes Diederik P Kingma, Max Welling June 18, 2018 Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 1 / 39 Outline 1 Introduction 2 Variational Lower
More informationJorge Silva and Shrikanth Narayanan, Senior Member, IEEE. 1 is the probability measure induced by the probability density function
890 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 3, MAY 2006 Average Divergence Distance as a Statistical Discrimination Measure for Hidden Markov Models Jorge Silva and Shrikanth
More informationVariational Inference via Stochastic Backpropagation
Variational Inference via Stochastic Backpropagation Kai Fan February 27, 2016 Preliminaries Stochastic Backpropagation Variational Auto-Encoding Related Work Summary Outline Preliminaries Stochastic Backpropagation
More informationA Novel Low-Complexity HMM Similarity Measure
A Novel Low-Complexity HMM Similarity Measure Sayed Mohammad Ebrahim Sahraeian, Student Member, IEEE, and Byung-Jun Yoon, Member, IEEE Abstract In this letter, we propose a novel similarity measure for
More informationBrief Introduction of Machine Learning Techniques for Content Analysis
1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview
More informationUniversity of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians
Engineering Part IIB: Module F Statistical Pattern Processing University of Cambridge Engineering Part IIB Module F: Statistical Pattern Processing Handout : Multivariate Gaussians. Generative Model Decision
More informationIntroduction to Machine Learning
Introduction to Machine Learning Introduction to Probabilistic Methods Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB
More informationMachine learning - HT Maximum Likelihood
Machine learning - HT 2016 3. Maximum Likelihood Varun Kanade University of Oxford January 27, 2016 Outline Probabilistic Framework Formulate linear regression in the language of probability Introduce
More informationMinimum Entropy, k-means, Spectral Clustering
Minimum Entropy, k-means, Spectral Clustering Yongjin Lee Biometrics Technology Research Team ETRI 6 Kajong-dong, Yusong-gu Daejon 305-350, Korea Email: solarone@etri.re.kr Seungjin Choi Department of
More informationCross entropy-based importance sampling using Gaussian densities revisited
Cross entropy-based importance sampling using Gaussian densities revisited Sebastian Geyer a,, Iason Papaioannou a, Daniel Straub a a Engineering Ris Analysis Group, Technische Universität München, Arcisstraße
More informationDoug Cochran. 3 October 2011
) ) School Electrical, Computer, & Energy Arizona State University (Joint work with Stephen Howard and Bill Moran) 3 October 2011 Tenets ) The purpose sensor networks is to sense; i.e., to enable detection,
More informationL11: Pattern recognition principles
L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction
More informationChapter 6. Estimation of Confidence Intervals for Nodal Maximum Power Consumption per Customer
Chapter 6 Estimation of Confidence Intervals for Nodal Maximum Power Consumption per Customer The aim of this chapter is to calculate confidence intervals for the maximum power consumption per customer
More informationEconometrics I, Estimation
Econometrics I, Estimation Department of Economics Stanford University September, 2008 Part I Parameter, Estimator, Estimate A parametric is a feature of the population. An estimator is a function of the
More informationHow to learn from very few examples?
How to learn from very few examples? Dengyong Zhou Department of Empirical Inference Max Planck Institute for Biological Cybernetics Spemannstr. 38, 72076 Tuebingen, Germany Outline Introduction Part A
More informationThe connection of dropout and Bayesian statistics
The connection of dropout and Bayesian statistics Interpretation of dropout as approximate Bayesian modelling of NN http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf Dropout Geoffrey Hinton Google, University
More informationMarkov Chain Monte Carlo Methods for Stochastic Optimization
Markov Chain Monte Carlo Methods for Stochastic Optimization John R. Birge The University of Chicago Booth School of Business Joint work with Nicholas Polson, Chicago Booth. JRBirge U of Toronto, MIE,
More informationOn the Choice of Parametric Families of Copulas
On the Choice of Parametric Families of Copulas Radu Craiu Department of Statistics University of Toronto Collaborators: Mariana Craiu, University Politehnica, Bucharest Vienna, July 2008 Outline 1 Brief
More informationGenerative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham
Generative classifiers: The Gaussian classifier Ata Kaban School of Computer Science University of Birmingham Outline We have already seen how Bayes rule can be turned into a classifier In all our examples
More informationStat410 Probability and Statistics II (F16)
Stat4 Probability and Statistics II (F6 Exponential, Poisson and Gamma Suppose on average every /λ hours, a Stochastic train arrives at the Random station. Further we assume the waiting time between two
More informationIEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm
IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More informationParticle Filtering Approaches for Dynamic Stochastic Optimization
Particle Filtering Approaches for Dynamic Stochastic Optimization John R. Birge The University of Chicago Booth School of Business Joint work with Nicholas Polson, Chicago Booth. JRBirge I-Sim Workshop,
More informationKernel Density Estimation
EECS 598: Statistical Learning Theory, Winter 2014 Topic 19 Kernel Density Estimation Lecturer: Clayton Scott Scribe: Yun Wei, Yanzhen Deng Disclaimer: These notes have not been subjected to the usual
More informationComputer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo
Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain
More informationUnit 3: HW3.5 Sum and Product
Unit 3: HW3.5 Sum and Product Without solving, find the sum and product of the roots of each equation. 1. x 2 8x + 7 = 0 2. 2x + 5 = x 2 3. -7x + 4 = -3x 2 4. -10x 2 = 5x - 2 5. 5x 2 2x 3 4 6. 1 3 x2 3x
More informationOnline Estimation of Discrete Densities using Classifier Chains
Online Estimation of Discrete Densities using Classifier Chains Michael Geilke 1 and Eibe Frank 2 and Stefan Kramer 1 1 Johannes Gutenberg-Universtität Mainz, Germany {geilke,kramer}@informatik.uni-mainz.de
More information13 : Variational Inference: Loopy Belief Propagation and Mean Field
10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction
More informationExpectation Propagation for Approximate Bayesian Inference
Expectation Propagation for Approximate Bayesian Inference José Miguel Hernández Lobato Universidad Autónoma de Madrid, Computer Science Department February 5, 2007 1/ 24 Bayesian Inference Inference Given
More informationGraphical Models and Kernel Methods
Graphical Models and Kernel Methods Jerry Zhu Department of Computer Sciences University of Wisconsin Madison, USA MLSS June 17, 2014 1 / 123 Outline Graphical Models Probabilistic Inference Directed vs.
More informationReview: mostly probability and some statistics
Review: mostly probability and some statistics C2 1 Content robability (should know already) Axioms and properties Conditional probability and independence Law of Total probability and Bayes theorem Random
More informationSparse Kernel Density Estimation Technique Based on Zero-Norm Constraint
Sparse Kernel Density Estimation Technique Based on Zero-Norm Constraint Xia Hong 1, Sheng Chen 2, Chris J. Harris 2 1 School of Systems Engineering University of Reading, Reading RG6 6AY, UK E-mail: x.hong@reading.ac.uk
More informationQuick Tour of Basic Probability Theory and Linear Algebra
Quick Tour of and Linear Algebra Quick Tour of and Linear Algebra CS224w: Social and Information Network Analysis Fall 2011 Quick Tour of and Linear Algebra Quick Tour of and Linear Algebra Outline Definitions
More informationMonte-Carlo MMD-MA, Université Paris-Dauphine. Xiaolu Tan
Monte-Carlo MMD-MA, Université Paris-Dauphine Xiaolu Tan tan@ceremade.dauphine.fr Septembre 2015 Contents 1 Introduction 1 1.1 The principle.................................. 1 1.2 The error analysis
More informationBayesian Methods for Machine Learning
Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),
More informationSome New Results on Information Properties of Mixture Distributions
Filomat 31:13 (2017), 4225 4230 https://doi.org/10.2298/fil1713225t Published by Faculty of Sciences and Mathematics, University of Niš, Serbia Available at: http://www.pmf.ni.ac.rs/filomat Some New Results
More informationExpectation Maximization
Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane c Ghane/Mori 4 6 8 4 6 8 4 6 8 4 6 8 5 5 5 5 5 5 4 6 8 4 4 6 8 4 5 5 5 5 5 5 µ, Σ) α f Learningscale is slightly Parameters is slightly larger larger
More informationProbabilistic Models for Sequence Labeling
Probabilistic Models for Sequence Labeling Besnik Fetahu June 9, 2011 Besnik Fetahu () Probabilistic Models for Sequence Labeling June 9, 2011 1 / 26 Background & Motivation Problem introduction Generative
More information27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling
10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin
More information10708 Graphical Models: Homework 2
10708 Graphical Models: Homework 2 Due Monday, March 18, beginning of class Feburary 27, 2013 Instructions: There are five questions (one for extra credit) on this assignment. There is a problem involves
More informationA minimalist s exposition of EM
A minimalist s exposition of EM Karl Stratos 1 What EM optimizes Let O, H be a random variables representing the space of samples. Let be the parameter of a generative model with an associated probability
More information(Multivariate) Gaussian (Normal) Probability Densities
(Multivariate) Gaussian (Normal) Probability Densities Carl Edward Rasmussen, José Miguel Hernández-Lobato & Richard Turner April 20th, 2018 Rasmussen, Hernàndez-Lobato & Turner Gaussian Densities April
More information