The Noisy Expectation-Maximization Algorithm for Multiplicative Noise Injection

Size: px
Start display at page:

Download "The Noisy Expectation-Maximization Algorithm for Multiplicative Noise Injection"

Transcription

1 The Noisy Expectation-Maximization Algorithm for Multiplicative Noise Injection Osonde Osoba, Bart Kosko We generalize the noisy expectation-maximization (NEM) algorithm to allow arbitrary noise-injection modes besides simply adding carefully chosen noise to the data. This includes the important special case of multiplicative noise injection. A generalized NEM Theorem shows that all such injected noise speeds convergence of the EM algorithm on average. The multiplicative noise-benefit condition has a simple quadratic form for Gaussian and Cauchy mixture models. Simulations show a multiplicative-noise EM speed-up of more than 27% in a Gaussian mixture model. Injecting blind noise only slowed convergence. A related theorem gives a sufficient condition for an average EM noise benefit for arbitrary modes of noise injection if the data model comes from the general exponential family of probability density functions. A final theorem shows that injected noise slows EM convergence if the NEM inequalities reverse. 1. Noise Boosting the EM Algorithm We show how carefully chosen and injected multiplicative noise can speed convergence of the popular expectation-maximization (EM) algorithm. The proof also allows arbitrary modes of combining signal and noise. The result still speeds EM convergence on average at each iteration. The EM algorithm generalizes maximum-likelihood estimation to the case of missing or corrupted data 1, 2. The algorithm iteratively climbs a hill of probability until it reaches the peak. The EM algorithm also generalizes many other popular algorithms. These include the k-means clustering algorithm 3 used in pattern recognition and big-data analysis, the backpropagation algorithm used to train deep feedforward and convolutional neural networks 4 6, and the Baum-Welch algorithm used to train hidden Markov models 7, 8. But the EM algorithm can converge slowly if the amount of missing data is high or if the number of estimated parameters is large 2, 9. It can also get stuck at local probability maxima. Users can run the EM algorithm from several starting points to mitigate the problem of convergence to local maxima. The Noisy EM (NEM) algorithm 3, is a noise-enhanced version of the EM algorithm that carefully selects noise and adds it to the data. NEM converges faster on average than EM does because on average it takes larger steps up the same O. Osoba is with the RAND Corporation, Santa Monica, CA USA ( oosoba@rand.org) O. Osoba and B. Kosko are with the Signal and Image Processing Institute, Electrical Engineering Department, University of Southern California, Los Angeles, CA USA ( kosko@usc.edu) 1

2 2 hill of probability. The largest noise gains tend to occur in the first few steps. This is a type of nonlinear noise benefit or stochastic resonance that does not depend on a threshold 25. NEM adds noise to the data if the noise satisfies the NEM positivity condition: ( ) f(y + N, Z θk ) E Y,Z,N θ ln 0. (1) The NEM positivity condition (1) holds when the noise-perturbed likelihood f(y + N, z θ k ) is larger on average than the noiseless likelihood f(y, z θ k ) at the k th step of the algorithm 10, 12. A simple argument gives the intuition behind the NEM positivity condition for additive noise. This argument holds in much greater generality and underlies much of the theory of noise-boosting the EM algorithm. Consider a noise sample or realization n that makes a signal y more probable: f(y + n θ) f(y θ) in terms of probability density functions (pdfs) for some parameter θ. The value y is a realization of the signal random variable Y. The value n is a realization of the noise random variable N. Then this pdf inequality holds if and only if ln( f(y+n θ) f(y θ) ) 0. Averaging over the signal and noise random variables gives the basic expectation form of the NEM positivity condition. Particular choices of the signal probability f(y θ) can greatly simplify the sufficient NEM condition. This signal conditional probability is the so-called data model in the EM context of maximum likelihood estimation. We show below that Gaussian and Cauchy choices lead to simple quadratic NEM conditions. An exponential data model leads to an even simpler linear condition. This argument also shows that a noise-based pdf inequality is sufficient to satisfy the positivity condition of the NEM Theorem. It shows even more than this because the NEM positivity condition uses only an expectation. So the pdf inequality need hold only almost everywhere. It need not hold on sets of zero probability. This allows the user to ignore particular values when using continuous probability models. The same argument for multiplicative noise suggests that a simliar positivity condition should hold for a noise benefit. This will hold given the corresponding pdf inequality f(yn θ) f(y θ). There is nothing unique about the operations of addition or multiplication in this signal-noise context. So a noise benefit should hold for any method of combining signal and noise that obeys these pdf inequalities. A sequence of new theorems shows that this is the case. Theorem 2 below generalizes the NEM theorem from additive noise injection y+n to arbitrary noise injection φ(y, N). Theorem 3 gives the new NEM condition for the the special case where the noise-injection mode is multiplicative: φ(y, N) = yn. We call this new condition the m-nem condition or the multplicative-noisy-expectation- Maximization condition. Figure 1 shows an EM speed-up of 27.6% due to the multiplicative NEM condition. Figure 2 shows that ordinary or blind multiplicative noise (not subject to the m-nem condition) only slowed EM convergence. Blind noise was just noise drawn at random

3 3 Figure 1: Noise benefit in a Gaussian-mixture-model NEM algorithm that used multiplicative noise injection. Low intensity noise decreased the EM convergence time while higher intensity starting noise increased it. The multiplicative noise had unit mean with different standard deviations. The optimal initial noise standard deviation was σ = 0.44 and gave a 27.6% speed-up over the noiseless EM algorithm. This NEM procedure injected noise that decayed at an inverse-square rate. or uniformly from the set of all possible noise. It was not subject to the m-nem condition or to any other condition. The optimal speed-up using additive noise on the same data model was 30.5% at an optimal noise power of σ = 1.9. This speed-up was slightly better than the m- NEM speed-up. But a statistical test for the difference in mean optimal convergence times found that this difference was not statistically significant at the 0.05 significance level. The hypothesis test for the difference of means gave a large p-value of A sample 95%-bootstrap confidence interval for the average difference in optimal convergence time was 0.45, So we cannot reject the null hypothesis that the difference in optimal average convergence times for the two noise-injection modes was statistically insignificant at the 0.05 level. An open and important research question is whether there are general conditions under which one noise injection mode outperforms the other. 2. General Noise Injection Condition for a NEM Benefit We next generalize the original proof for additive NEM 10, 12 to NEM that uses more general noise injection. The metrical idea behind the proof remains the same: a noise benefit occurs on average at an iteration if the noisy pdf is closer to the optimal pdf than the noiseless pdf is. Relative entropy measures the pdf quality or pseudo-distance compared with the optimal pdf:

4 4 where D (f(y, z θ ) f N (y, z θ k )) D (f(y, z θ ) f(y, z θ k )) (2) f N (y, z θ k ) = f(φ(y, N), z θ k ) (3) is the noise-injected pdf. The relative entropy has the form of an average logarithm 26 h(u, v) D (h(u, v) g(u, v)) = ln h(u, v) du dv (4) g(u, v) U,V for positive pdfs h and g over the same support. Convergent sums can replace the integrals in the discrete case. The key point is that the noise-injection mode φ(y, N) need not be addition φ(y, N) = y + N or multiplication φ(y, N) = yn. It can be any function φ of the data y and the noise N. The above relative entropy inequality is logically equivalent to the EM noisebenefit condition at iteration k cast in terms of expectations 10: E Q(θ θ ) Q(θ k θ ) E Q(θ θ ) Q N (θ k θ ) (5) where Q N is the noise-perturbed surrogate likelihood function Q N (θ θ k ) = E Z Y,θk ln f N (y, Z θ). (6) Any noise N that satisfies this EM noise-benefit condition will on average give better parameter estimates at each iteration than will noiseless estimates or those that use blind noise. The relative-entropy version of the noise-benefit condition allows the same derivation of the generalized NEM condition as in the case of additive noise. The result is Theorem 1. Theorem 1: The Generalized NEM Theorem Let φ(y, N) be an arbitrary mode of combining the signal Y and the noise N. Then the average EM noise benefit ( ) ( ) Q(θ θ ) Q(θ k θ ) Q(θ θ ) Q N (θ k θ ) (7) holds if ( ) f(φ(y, N), Z θk ) E Y,Z,N θ ln 0. (8) Proof: We want to show that the noisy likelihood function f(φ(y, N), z θ k ) is closer to the optimal likelihood f(y, z θ ) than is the noiseless likelihood function f(y, z θ k ). We use relative entropy for this comparison. Then the relative entropy between the optimal likelihood and the noisy likelihood is

5 5 c k (N) = D(f(y, z θ ) f(φ(y, N), z θ k )). (9) The relative entropy between the optimal likelihood and the noiseless likelihood is c k = D(f(y, z θ ) f(y, z θ k )). (10) We show first that the expectation of the Q-function differences in (5) is a distance pseudo-metric. Rewrite the Q function as an integral over Z: Z lnf(y, z θ)f(z y, θ k ) dz. (11) Then the term c k = D(f(y, z θ ) f(y, z θ k )) is the expectation over Y given the current parameter value θ k because factoring the conditional pdf f(y, z θ ) gives c k = ln(f(y, z θ )) ln f(y, z θ k )f(y, z θ ) dz dy (12) = ln(f(y, z θ )) ln f(y, z θ k )f(z y, θ )f(y θ ) dz dy (13) = E Y θk Q (θ θ ) Q (θ k θ ). (14) The noise term c k (N) is also the expectation over Y given the parameter value θ k : c k (N) = ln(f(y, z θ )) ln f(y + N, z θ k )f(y, z θ ) dz dy (15) = ln(f(y, z θ )) ln f(y + N, z θ k )f(z y, θ )f(y θ ) dz dy (16) = E Y θk Q (θ θ ) Q N (θ k θ ). (17) Take the noise expectation of both terms c k and c k (N): Then the pseudo-metrical inequality ensures an average noise benefitr: E N ck = ck (18) E N ck (N) = E N ck (N). (19) c k E N c k (N) (20) E N,Y θk Q (θ θ ) Q (θ k θ ) E N,Y θk Q (θ θ ) Q N (θ k θ ). (21)

6 6 We use the inequality condition (32) to derive a more useful sufficient condition for a noise benefit. Expand the difference of relative entropy terms c k c k (N): ( ) f(y, z θ ) f(y, z θ ) c k c k (N) = ln ln f(y, z θ ) dy dz Y,Z f(y, z θ k ) f(y + N, z θ k ) (22) ( ) f(y, z θ ) f(y + N, z θk ) = ln + ln f(y, z θ ) dy dz Y,Z f(y, z θ k ) f(y, z θ ) (23) f(y, z θ )f(y + N, z θ k ) = ln f(y, z θ ) dy dz (24) Y,Z f(y, z θ k )f(y, z θ ) f(y + N, z θk ) = ln f(y, z θ ) dy dz. (25) f(y, z θ k ) Y,Z Take the expectation with respect to the noise term N: E N c k c k (N) = c k E N c k (N) (26) f(y + n, z θk ) = ln f(y, z θ )f(n y) dy dz dn (27) N Y,Z f(y, z θ k ) f(y + n, z θk ) = ln f(n y)f(y, z θ ) dn dy dz (28) Y,Z N f(y, z θ k ) = E Y,Z,N θ ln f(y + N, Z θ k). (29) The assumption of finite differential entropy for Y and Z ensures that ln f(y, z θ)f(y, z θ ) is integrable. Thus the integrand is integrable. So Fubini s theorem 27 permits the change in the order of integration in (29): ( ) f(y + N, Z θk ) c k E N c k (N) iff E Y,Z,N θ ln 0. (30) Then an EM noise benefit occurs on average if ( ) f(y + N, Z θk ) E Y,Z,N θ ln 0. (31) The same derivation as in 10 shows that the average EM noise benefit occurs if c k E N c k (N) (32) ( ) f(φ(y, N), Z θk ) E Y,Z,N θ ln 0 (33) when the generalized noisy likelihood f(φ(y, N), z θ k ) replaces the additive noisy likelihood f(y + N, z θ k ) in the original NEM proof.

7 7 3. The Special Case of Multiplicative NEM Theorem 1 allows a direct proof that properly chosen multiplicative noise can speed average EM convergence. Multiplicative noise 28 occurs in many applications in signal processing and communications. These include synthetic aperture radar 29 31, sonar imaging 32, 33, and photonics 34. The proof requires only that we use the following conditional noise pdf for multiplicative NEM (m-nem): f N (y, z θ k ) = f(yn, z θ k ). (34) Then Theorem 2 gives the following special case for multiplicative noise. Theorem: Multiplicative NEM (m-nem) Theorem The average EM noise benefit ( ) ( ) Q(θ θ ) Q(θ k θ ) Q(θ θ ) Q N (θ k θ ) (35) holds if E Y,Z,N θ ln ( ) f(y N, Z θk ) 0. (36) Proof: The argument is the same as in Theorem 2 with f N (y, z θ k ) = f(yn, z θ k ). 4. Gaussiain and Cauchy Mixture Models Many of the additive-nem corollaries apply to the generalized NEM condition without change. Corollary 2 from 10 leads to an m-nem condition for a Gaussian mixture model (GMM) because the noise condition applies to each mixed normal pdf in the mixture. An identical multiplicative noise-benefit condtiion holds for a Cauchy mixture model. We state and prove the m-nem GMM result as a separate corollary. The resulting quadratic m-nem condition depends only the Gaussian means and not on their variances. We first review mixture models and then state the closed form of a GMM. The EM algorithm offers a standard way to estimate the parameters of a mixture model. The parameters consist of the convex or probability mixing weights as well as the individual parameters of the mixed pdfs. A finite mixture model is a convex combination of a finite number of similar pdfs. So we can view a mixture as a convex combination of a finite set of similar sub-populations. The sub-population pdfs are similar in the sense that they all come from the same parametric family. Mixture models apply to a wide range of statistical problems in pattern recognition and machine intelligence. A Gaussian mixture consists of convex-weighted normal pdfs. The EM algorithm estimates the mixture weights as well as the means and variances of each normal pdf. A Cauchy

8 8 mixture consists likewise of convex-wegithed Cauchy pdfs. The GMM is by far the most common mixture model in practice 38. Let Y be the observed mixed random variable. Let K be the number of subpopulations. Let Z {1,..., K} be the hidden sub-population index random variable. The convex population mixing proportions α 1,..., α K define a discrete pdf for Z: P (Z = j) = α j. The pdf f(y Z = j, θ j ) is the pdf of the j th sub-population where θ 1,..., θ K are the pdf parameters for each sub-population. The sub-population parameter θ j can represent the mean or variance of a normal pdf or both. It can represent any number of quantities that parametrize the pdf. Let Θ denote the vector of all model parameters Θ = {α 1,..., α K, θ 1,..., θ K }. The joint pdf f(y, z Θ) is K f(y, z Θ) = α j f(y j, θ j ) δz j. (37) The marginal pdf for Y and the conditional pdf for Z given y are j=1 f(y Θ) = j and p Z (j y, Θ) = α jf(y Z = j, θ j ) f(y Θ) by Bayes theorem. Rewrite the joint pdf as an exponential: f(y, z Θ) = exp j α j f(y j, θ j ) (38) (39) ln(α j ) + ln f(y j, θ j ) δz j. (40) Thus ln f(y, z Θ) = j δz j lnα j f(y j, θ j ). (41) EM algorithms for finite mixture models estimate Θ using the sub-population index Z as the latent variable. An EM algorithml uses (39) to derive the Q-function Q(Θ Θ k ) =E Z y,θk ln f(y, Z Θ) (42) = z j δz j lnα j f(y j, θ j ) p Z (z y, Θ k ) (43) = j lnα j f(y j, θ j )p Z (j y, Θ k ). (44) Corollary: m-nem Condition for Gaussian Mixture Models Suppose that Y Z=j N (µ j, σj 2 ). So f(y j, θ) is a normal or Gaussian pdf. Then the pointwise pdf noise benefit for multiplicative noise f(yn θ) f(y θ) (45) holds if and only if y(n 1) y(n + 1) 2µ j 0. (46)

9 9 Proof: Compare the noisy normal pdf with the noiseless normal pdf. The normal pdf is 1 f(y θ) = exp (y µ j) 2 σ j 2π 2σj 2. (47) So f(yn θ) f(y θ) iff exp (yn µ j) 2 2σ 2 j exp (y µ j) 2 2σ 2 j (48) iff ( ) 2 yn µj σ j ( ) 2 y µj (49) iff (yn µ j ) 2 (y µ j ) 2 (50) iff y 2 n 2 + µ 2 j 2µ j yn y 2 + µ 2 j 2yµ j (51) σ j iff y 2 n 2 2µ j yn y 2 2yµ j (52) iff y 2 (n 2 1) 2yµ j (n 1) 0 (53) iff y(n 1) y(n + 1) 2µ j 0. (54) This proves (46). The same m-nem noise-benefit condition (46) holds for a Cauchy mixture model. Suppose that Y Z=j C(m j, d j ) and thus that f(y j, θ) is a Cauchy pdf with median m j and dispersion d j. The median controls the location of the Cauchy bell curve. The dispersion controls its width. A Cauchy random variable has no mean. It has finite lower-order fractional moments. But its variance and all its higher-order moments are infinite. The Cauchy pdf f(y j, θ) has the form 1 f(y θ) = ( ) 2. (55) y mj πd j 1 + d j So the pdf inequality f(yn θ) f(y θ) is equivalent to the same quadratic inequality as in the above derivation of the Gaussian n-nem condition. This gives (46) as the Cauchy m-nem noise-benefit condition. 5. The Multiplicative Noisy Expectation-Maximization Algorithm The m-nem Theorem and its corollaries give a general method for noise-boosting the EM algorithm. Theorem 1 implies that on average these NEM variants outperform the noiseless EM algorithm. Algorithm 1 gives the multiplicative NEM algorithm schema. The operation mnemnoisesample(y, k τ σ N ) generates noise samples that satisfy the NEM condition for the current data model. The noise sampling pdf depends on the vector

10 10 of random samples y in the Gaussian and Cauchy mixture models. The noise can have any value in the NEM algorithm for censored gamma models. Algorithm 1 ˆθ mnem = m-nem-estimate(y) Require: y = (y 1,..., y M ) : vector of observed incomplete data Ensure: ˆθ mnem : m-nem estimate of parameter θ 1: while ( θ k θ k 1 10 tol ) do 2: N S -Step: n mnemnoisesample(y, k τ σ N ) 3: N M -Step: y yn 4: E-Step: Q (θ θ k ) E Z y,θk ln f(y, Z θ) 5: M-Step: θ k+1 argmax {Q (θ θ k )} 6: k k + 1 7: end while 8: ˆθmNEM θ k θ The E-Step takes a conditional expectation of a function of the noisy data samples y given the noiseless data samples y. A deterministic decay factor k τ scales the noise on the k th iteration. τ is the noise decay rate 10. The decay factor k τ reduces the noise at each iteration. So the decay factor drives the noise N k to zero as the iteration step k increases. The simulations used τ = 2. The decay factor reduces the NEM estimator s jitter around its final value. This is important because the EM algorithm converges to fixed-points. So excessive estimator jitter prolongs convergence time even when the jitter occurs near the final solution. The simulations in this paper used polynomial decay factors instead of the logarithmic cooling schedules found in annealing applications The NEM noise generating procedure mnemnoisesample(y, k τ σ N ) returns a NEM compliant noise sample n at a given noise level σ N for each data sample y. This procedure changes with the EM data model. The following noise generating procedure applies to GMMs because of Corollary 4. We used the following 1-D noise generating procedure for the GMM simulations: mnemnoisesample for GMM m-nem Require: y and σ N : current data sample and noise level Ensure: n : noise sample satisfying NEM condition N(y) {n R : y(n 1) y(n + 1) 2µ j 0} n is a sample from the distribution T N(1, σ N N(y)) Figure 1 displays a noise benefit for a m-nem algorithm on a GMM. The injected noise is subject to the Gaussian NEM condition in (46).

11 11 Figure 2: Blind multiplicative noise did not improve the convergence times of GMM EM algorithms. Such blind noise only increased the time to convergence. This increase in convergence time occurred even when (as in this case) the multiplicative noise used a cooling schedule. The next section develops the NEM theory for the important case of exponential family pdfs. The corresponding theorem states the generalized NEM condition for additive (??) and multiplicative (??) noise injection. The theorem also applies to mixtures of exponential family pdfs because of Corollary 2 from NEM Noise Benefits for Exponential Family Probabilities Exponential family pdfs include such popular densities as the normal, exponential, gamma, and Poisson 44. A member of this family has a pdf f(x θ) of the exponential form f(x θ) = exp a(θ)k(x) + b(x) + c(θ) (56) so long as the density s domain does not include the parameter θ. This latter condition bars the uniform pdf from the exponential family. The family also excludes Cauchy and Student-t pdfs. The normal pdf f(y θ) = 1 σ j 2π exp (y µj)2 belongs 2σj 2 to the exponential family because we can write it in exponential form if a(θ) = 1 2σ, 2 K(x) = x 2, and c(θ) = µ2 2σ ln 2πσ 2. 2 Theorem: Generalized NEM Condition for Exponential Family Probability Density Functions Suppose X has an exponential family pdf: f(x θ) = exp a(θ)k(x) + b(x) + c(θ). (57) Then an EM noise benefit occurs for arbitrary noise-combination mode φ if a(θ)k(φ(x, n)) K(x) + b(φ(x, n)) b(x) 0. (58)

12 12 Proof: Compare the noisy pdf f(φ(x, n) θ) with the noiseless pdf f(x θ). The noise benefit occurs if ln f(φ(x, n) θ) ln f(x θ) (59) since the logarithm is a monotone increasing function. This inequality holds iff a(θ)k(φ(x, n)) + b(φ(x, n)) + c(θ) a(θ)k(x) + b(x) + c(θ) (60) iff a(θ)k(φ(x, n)) + b(φ(x, n)) a(θ)k(x) + b(x) (61) iff a(θ)k(φ(x, n)) K(x) + b(φ(x, n)) b(x) 0. (62) The final inequality proves (58). This noise-benefit condition reduces to a(θ)k(x + n)) K(x) + b(x + n)) b(x) 0. (63) in the additive noise case when φ(x, n) = x + n. It reduces to a(θ)k(xn)) K(x) + b(xn)) b(x) 0. (64) in the multiplicative noise case when φ(x, n) = xn. Note that the c(θ) term does not appear in the NEM conditions. Consider the exponential signal pdf f(x θ) = 1 θ e x θ. It has the form of an exponential-family pdf if a(θ) = 1 θ, K(x) = x, b(x) = 0, and c(θ) = ln θ. So the condition for an additive NEM noise benefit becomes 1 x + n x 0. (65) θ This gives a simple negative noise condition for an additive NEM benefit: n 0. (66) The condition for a multiplicative NEM benefit likewise beomes It gives a similar NEM condition: 1 xn x 0. (67) θ n 1. (68) We conclude with a dual theorem that guarantees that noise will harm the EM algorithm by slowing its convergence on average. Both the theorem statement and its proof simply reverse all pertinent inequalities in Theorem 2. Theorem: The Generalized EM Noise Harm Theorem Let φ(y, N) be an arbitrary mode of combining the signal Y and the noise N. Then the EM estimation ( iteration noise harm ) ( ) Q(θ θ ) Q(θ k θ ) Q(θ θ ) Q N (θ k θ ) (69)

13 13 occurs on average if ( ) f(φ(y, N), Z θk ) E Y,Z,N θ ln 0. (70) Proof: The proof follows the same argument as in Theorem 2 with all the inequalities reversed. This general noise-harm result leads to corollary noise-harm conditions for the additive and multiplicative GMM-NEM models by reversing all pertinent inequalities. A similar reversal gives a noise-harm condition for pdfs from the exponential family. Such harmful GMM noise increased EM convergence by 35% in the multiplicativenoise case and by 40% in the additive-noise case. No noise benefit or harm occurs on average if equality replaces all pertinent inequalities. Corollary: Noise harm conditions for GMM-NEM The noise harm condition in (70) holds for the GMM-NEM algorithm if for additive noise and if for multiplicative noise. n 2 2n (µ j y) (71) y(n 1) y(n + 1) 2µ j 0 (72) Proof: The proof follows the derivations for the additive and multiplicative GMM-NEM conditions if all inequalities reverse. 7. Conclusion The NEM additive noise model extends to a multiplicative noise model and indeed to arbitrary modesl that combine noise and signal. The multiplicative NEM theorem specifically gives a sufficient positivity condition such that multiplicative noise improves the average speed of the EM algorithm. The multiplicative-noise NEM condition for the GMM and exponential family models are only slightly more complex than their respective additive-noise NEM conditions. An open research question is whether there are general conditions when either multiplicative or additive noise outperforms the other. Another open question is whether data sparsity affects the m-nem speed-up (as it does the additive case 10) and how it affects more general modes of NEM-based noise injection. Bibliography 1 A. P. Dempster, N. M. Laird and D. B. Rubin, Maximum Likelihood from Incomplete Data via the EM Algorithm (with discussion), Journal of the Royal Statistical Society, Series B 39 (1977) 1 38.

14 14 2 G. J. McLachlan and T. Krishnan, The EM Algorithm and Extensions (John Wiley and Sons, 2007). 3 O. Osoba and B. Kosko, Noise-Enhanced Clustering and Competitive Learning Algorithms, Neural Networks 37 (2013) K. Audhkhasi, O. Osoba and B. Kosko, Noise benefits in backpropagation and deep bidirectional pre-training, in Neural Networks (IJCNN), The 2013 International Joint Conference on (IEEE, 2013), pp K. Audhkhasi, O. Osoba and B. Kosko, Noise benefits in convolutional neural networks, in Proceedings of the 2014 International Conference on Advances in Big Data Analytics (2014), pp Y. LeCun, Y. Bengio and G. Hinton, Deep learning, Nature 521 (2015) K. Audhkhasi, O. Osoba and B. Kosko, Noisy hidden Markov models for speech recognition, in International Joint Conference on Neural Networks (IJCNN) (IEEE, 2013), pp L. R. Welch, Hidden Markov models and the Baum-Welch algorithm, IEEE Information Theory Society Newsletter 53 (2003) M. A. Tanner, Tools for Statistical Inference: Methods for the Exploration of Posterior Distributions and Likelihood Functions, Springer Series in Statistics (Springer, 1996). 10 O. Osoba, S. Mitaim and B. Kosko, The Noisy Expectation-Maximization Algorithm, Fluctuation and Noise Letters 12 (2013) O. Osoba, S. Mitaim and B. Kosko, Noise Benefits in the Expectation-Maximization Algorithm: NEM Theorems and Models, in The International Joint Conference on Neural Networks (IJCNN) (IEEE, 2011), pp O. A. Osoba, Noise benefits in expectation-maximization algorithms, Ph.D. thesis, University of Southern California, K. Wiesenfeld, F. Moss et al., Stochastic resonance and the benefits of noise: from ice ages to crayfish and squids, Nature 373 (1995) A. R. Bulsara and L. Gammaitoni, Tuning in to noise, Physics today 49 (1996) L. Gammaitoni, P. Hänggi, P. Jung and F. Marchesoni, Stochastic resonance, Reviews of modern physics 70 (1998) S. Mitaim and B. Kosko, Adaptive Stochastic Resonance, Proceedings of the IEEE: Special Issue on Intelligent Signal Processing 86 (1998) F. Chapeau-Blondeau and D. Rousseau, Noise-Enhanced Performance for an Optimal Bayesian Estimator, IEEE Transactions on Signal Processing 52 (2004) I. Lee, X. Liu, C. Zhou and B. Kosko, Noise-enhanced detection of subthreshold signals with carbon nanotubes, IEEE Transactions on Nanotechnology 5 (2006) B. Kosko, Noise (Penguin, 2006). 20 M. McDonnell, N. Stocks, C. Pearce and D. Abbott, Stochastic resonance: from suprathreshold stochastic resonance to stochastic signal quantization (Cambridge University Press, 2008). 21 A. Patel and B. Kosko, Optimal Mean-Square Noise Benefits in Quantizer-Array Linear Estimation, IEEE Signal Processing Letters 17 (2010) A. Patel and B. Kosko, Noise Benefits in Quantizer-Array Correlation Detection and Watermark Decoding, IEEE Transactions on Signal Processing 59 (2011) H. Chen, L. R. Varshney and P. K. Varshney, Noise-enhanced information systems, Proceedings of the IEEE 102 (2014) S. Mitaim and B. Kosko, Noise-benefit forbidden-interval theorems for threshold signal detectors based on cross correlations, Physical Review E 90 (2014)

15 15 25 B. Franzke and B. Kosko, Noise can speed convergence in markov chains, Physical Review E 84 (2011) T. M. Cover and J. A. Thomas, Elements of Information Theory (Wiley & Sons, New York, 1991). 27 G. B. Folland, Real Analysis: Modern Techniques and Their Applications (Wiley- Interscience, 1999), 2nd edition. 28 L. Rudin, P.-L. Lions and S. Osher, Multiplicative denoising and deblurring: Theory and algorithms, in Geometric Level Set Methods in Imaging, Vision, and Graphics (Springer, 2003), pp J. Ash, E. Ertin, L. Potter and E. Zelnio, Wide-angle synthetic aperture radar imaging: Models and algorithms for anisotropic scattering, Signal Processing Magazine, IEEE 31 (2014) S. Chen, Y. Li, X. Wang, S. Xiao and M. Sato, Modeling and interpretation of scattering mechanisms in polarimetric synthetic aperture radar: Advances and perspectives, Signal Processing Magazine, IEEE 31 (2014) M. Tur, K. C. Chin and J. W. Goodman, When is speckle noise multiplicative?, Appl. Opt. 21 (1982) J. M. Bioucas-Dias and M. A. Figueiredo, Multiplicative noise removal using variable splitting and constrained optimization, Image Processing, IEEE Transactions on 19 (2010) J. Ringelstein, A. B. Gershman and J. F. Böhme, Direction finding in random inhomogeneous media in the presence of multiplicative noise, Signal Processing Letters, IEEE 7 (2000) T. Yilmaz, C. M. Depriest, A. Braun, J. H. Abeles and P. J. Delfyett, Noise in fundamental and harmonic modelocked semiconductor lasers: experiments and simulations, Quantum Electronics, IEEE Journal of 39 (2003) R. A. Redner and H. F. Walker, Mixture Densities, Maximum Likelihood and the EM algorithm, SIAM Review 26 (1984) G. J. McLachlan and D. Peel, Finite Mixture Models, Wiley series in probability and statistics: Applied probability and statistics (Wiley, 2004). 37 R. V. Hogg and E. A. Tanis, Probability and Statistical Inference (Prentice Hall, 2006), 7th edition. 38 N. A. Gershenfeld, The Nature of Mathematical Modeling (Cambridge University Press, 1999). 39 S. Kirkpatrick, C. Gelatt Jr and M. Vecchi, Optimization by simulated annealing, Science 220 (1983) V. Černỳ, Thermodynamical approach to the Traveling Salesman Problem: An efficient simulation algorithm, Journal of Optimization Theory and Applications 45 (1985) S. Geman and C. Hwang, Diffusions for global optimization, SIAM Journal on Control and Optimization 24 (1986) B. Hajek, Cooling schedules for optimal annealing, Mathematics of operations research 13 (1988) B. Kosko, Neural Networks and Fuzzy Systems: A Dynamical Systems Approach to Machine Intelligence (Prentice Hall, 1991). 44 R. V. Hogg, J. McKean and A. T. Craig, Introduction to Mathematical Statistics (Pearson, 2013).

The Noisy Expectation-Maximization Algorithm for Multiplicative Noise Injection

The Noisy Expectation-Maximization Algorithm for Multiplicative Noise Injection The Noisy Expectation-Maximization Algorithm for Multiplicative Noise Injection Osonde Osoba, Bart Kosko We generalize the noisy expectation-maximization (NEM) algorithm to allow arbitrary modes of noise

More information

The Noisy Expectation-Maximization Algorithm for Multiplicative Noise Injection

The Noisy Expectation-Maximization Algorithm for Multiplicative Noise Injection Fluctuation and Noise Letters Vol. 15, No. 1 (2016) 1650007 (23 pages) c World Scientific Publishing Company DOI: 10.1142/S0219477516500073 The Noisy Expectation-Maximization Algorithm for Multiplicative

More information

THE NOISY EXPECTATION MAXIMIZATION ALGORITHM

THE NOISY EXPECTATION MAXIMIZATION ALGORITHM Fluctuation and Noise Letters Vol. 12, No. 3 (2013) 1350012 (30 pages) c World Scientific Publishing Company DOI: 10.1142/S0219477513500120 THE NOISY EXPECTATION MAXIMIZATION ALGORITHM OSONDE OSOBA Signal

More information

The Noisy Expectation-Maximization Algorithm

The Noisy Expectation-Maximization Algorithm To appear in Fluctuation and Noise Letters The Noisy Expectation-Maximization Algorithm Osonde Osoba, Sanya Mitaim, Bart Kosko Abstract We present a noise-injected version of the Expectation-Maximization

More information

Optimal Mean-Square Noise Benefits in Quantizer-Array Linear Estimation Ashok Patel and Bart Kosko

Optimal Mean-Square Noise Benefits in Quantizer-Array Linear Estimation Ashok Patel and Bart Kosko IEEE SIGNAL PROCESSING LETTERS, VOL. 17, NO. 12, DECEMBER 2010 1005 Optimal Mean-Square Noise Benefits in Quantizer-Array Linear Estimation Ashok Patel and Bart Kosko Abstract A new theorem shows that

More information

Kartik Audhkhasi, Osonde Osoba, Bart Kosko

Kartik Audhkhasi, Osonde Osoba, Bart Kosko To appear: 2013 International Joint Conference on Neural Networks (IJCNN-2013 Noise Benefits in Backpropagation and Deep Bidirectional Pre-training Kartik Audhkhasi, Osonde Osoba, Bart Kosko Abstract We

More information

The Expectation Maximization Algorithm

The Expectation Maximization Algorithm The Expectation Maximization Algorithm Frank Dellaert College of Computing, Georgia Institute of Technology Technical Report number GIT-GVU-- February Abstract This note represents my attempt at explaining

More information

NOISE can sometimes improve nonlinear signal processing

NOISE can sometimes improve nonlinear signal processing 488 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 2, FEBRUARY 2011 Noise Benefits in Quantizer-Array Correlation Detection and Watermark Decoding Ashok Patel, Member, IEEE, and Bart Kosko, Fellow,

More information

Neural Networks 37 (2013) Contents lists available at SciVerse ScienceDirect. Neural Networks

Neural Networks 37 (2013) Contents lists available at SciVerse ScienceDirect. Neural Networks Neural Networks 37 (2013) 132 140 Contents lists available at SciVerse ScienceDirect Neural Networks journal homepage: www.elsevier.com/locate/neunet Noise-enhanced clustering and competitive learning

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

NOISE BENEFITS IN CONVOLUTIONAL NEURAL NETWORKS. Kartik Audhkhasi, Osonde Osoba, Bart Kosko

NOISE BENEFITS IN CONVOLUTIONAL NEURAL NETWORKS. Kartik Audhkhasi, Osonde Osoba, Bart Kosko NOISE BENEFITS IN CONVOLUTIONAL NEURAL NETWORKS Kartik Audhkhasi, Osonde Osoba, Bart Kosko Signal and Information Processing Institute Electrical Engineering Department University of Southern California,

More information

Estimating the parameters of hidden binomial trials by the EM algorithm

Estimating the parameters of hidden binomial trials by the EM algorithm Hacettepe Journal of Mathematics and Statistics Volume 43 (5) (2014), 885 890 Estimating the parameters of hidden binomial trials by the EM algorithm Degang Zhu Received 02 : 09 : 2013 : Accepted 02 :

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Bidirectional Representation and Backpropagation Learning

Bidirectional Representation and Backpropagation Learning Int'l Conf on Advances in Big Data Analytics ABDA'6 3 Bidirectional Representation and Bacpropagation Learning Olaoluwa Adigun and Bart Koso Department of Electrical Engineering Signal and Image Processing

More information

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes (bilmes@cs.berkeley.edu) International Computer Science Institute

More information

The Expectation-Maximization Algorithm

The Expectation-Maximization Algorithm 1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

p(d θ ) l(θ ) 1.2 x x x

p(d θ ) l(θ ) 1.2 x x x p(d θ ).2 x 0-7 0.8 x 0-7 0.4 x 0-7 l(θ ) -20-40 -60-80 -00 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ θ x FIGURE 3.. The top graph shows several training points in one dimension, known or assumed to

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

PATTERN CLASSIFICATION

PATTERN CLASSIFICATION PATTERN CLASSIFICATION Second Edition Richard O. Duda Peter E. Hart David G. Stork A Wiley-lnterscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim Brisbane Singapore Toronto CONTENTS

More information

On the Slow Convergence of EM and VBEM in Low-Noise Linear Models

On the Slow Convergence of EM and VBEM in Low-Noise Linear Models NOTE Communicated by Zoubin Ghahramani On the Slow Convergence of EM and VBEM in Low-Noise Linear Models Kaare Brandt Petersen kbp@imm.dtu.dk Ole Winther owi@imm.dtu.dk Lars Kai Hansen lkhansen@imm.dtu.dk

More information

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference Shunsuke Horii Waseda University s.horii@aoni.waseda.jp Abstract In this paper, we present a hierarchical model which

More information

Efficient Variational Inference in Large-Scale Bayesian Compressed Sensing

Efficient Variational Inference in Large-Scale Bayesian Compressed Sensing Efficient Variational Inference in Large-Scale Bayesian Compressed Sensing George Papandreou and Alan Yuille Department of Statistics University of California, Los Angeles ICCV Workshop on Information

More information

Estimating Gaussian Mixture Densities with EM A Tutorial

Estimating Gaussian Mixture Densities with EM A Tutorial Estimating Gaussian Mixture Densities with EM A Tutorial Carlo Tomasi Due University Expectation Maximization (EM) [4, 3, 6] is a numerical algorithm for the maximization of functions of several variables

More information

NOISE ENHANCED ANISOTROPIC DIFFUSION FOR SCALAR IMAGE RESTORATION. 6, avenue du Ponceau Cergy-Pontoise, France.

NOISE ENHANCED ANISOTROPIC DIFFUSION FOR SCALAR IMAGE RESTORATION. 6, avenue du Ponceau Cergy-Pontoise, France. NOISE ENHANCED ANISOTROPIC DIFFUSION FOR SCALAR IMAGE RESTORATION Aymeric HISTACE 1, David ROUSSEAU 2 1 Equipe en Traitement d'image et du Signal UMR CNRS 8051 6, avenue du Ponceau 95014 Cergy-Pontoise,

More information

Noise Benefits in Quantizer-Array Correlation Detection and Watermark Decoding

Noise Benefits in Quantizer-Array Correlation Detection and Watermark Decoding To appear in the IEEE Transactions on Signal Processing Current draft: 9 November 010 Noise Benefits in Quantizer-Array Correlation Detection and Watermark Decoding Ashok Patel, Member IEEE, and Bart Kosko,

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 20: Expectation Maximization Algorithm EM for Mixture Models Many figures courtesy Kevin Murphy s

More information

DETECTION theory deals primarily with techniques for

DETECTION theory deals primarily with techniques for ADVANCED SIGNAL PROCESSING SE Optimum Detection of Deterministic and Random Signals Stefan Tertinek Graz University of Technology turtle@sbox.tugraz.at Abstract This paper introduces various methods for

More information

On Information Maximization and Blind Signal Deconvolution

On Information Maximization and Blind Signal Deconvolution On Information Maximization and Blind Signal Deconvolution A Röbel Technical University of Berlin, Institute of Communication Sciences email: roebel@kgwtu-berlinde Abstract: In the following paper we investigate

More information

Markov chain Monte Carlo methods for visual tracking

Markov chain Monte Carlo methods for visual tracking Markov chain Monte Carlo methods for visual tracking Ray Luo rluo@cory.eecs.berkeley.edu Department of Electrical Engineering and Computer Sciences University of California, Berkeley Berkeley, CA 94720

More information

ECE 275B Homework #2 Due Thursday MIDTERM is Scheduled for Tuesday, February 21, 2012

ECE 275B Homework #2 Due Thursday MIDTERM is Scheduled for Tuesday, February 21, 2012 Reading ECE 275B Homework #2 Due Thursday 2-16-12 MIDTERM is Scheduled for Tuesday, February 21, 2012 Read and understand the Newton-Raphson and Method of Scores MLE procedures given in Kay, Example 7.11,

More information

A Derivation of the EM Updates for Finding the Maximum Likelihood Parameter Estimates of the Student s t Distribution

A Derivation of the EM Updates for Finding the Maximum Likelihood Parameter Estimates of the Student s t Distribution A Derivation of the EM Updates for Finding the Maximum Likelihood Parameter Estimates of the Student s t Distribution Carl Scheffler First draft: September 008 Contents The Student s t Distribution The

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Introduction to Probability and Statistics (Continued)

Introduction to Probability and Statistics (Continued) Introduction to Probability and Statistics (Continued) Prof. icholas Zabaras Center for Informatics and Computational Science https://cics.nd.edu/ University of otre Dame otre Dame, Indiana, USA Email:

More information

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization Prof. Daniel Cremers 6. Mixture Models and Expectation-Maximization Motivation Often the introduction of latent (unobserved) random variables into a model can help to express complex (marginal) distributions

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Variational Learning : From exponential families to multilinear systems

Variational Learning : From exponential families to multilinear systems Variational Learning : From exponential families to multilinear systems Ananth Ranganathan th February 005 Abstract This note aims to give a general overview of variational inference on graphical models.

More information

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation Clustering by Mixture Models General bacground on clustering Example method: -means Mixture model based clustering Model estimation 1 Clustering A basic tool in data mining/pattern recognition: Divide

More information

NOISE-BENEFIT FORBIDDEN INTERVAL THEOREMS FOR THRESHOLD SIGNAL DETECTORS BASED ON CROSS CORRELATION

NOISE-BENEFIT FORBIDDEN INTERVAL THEOREMS FOR THRESHOLD SIGNAL DETECTORS BASED ON CROSS CORRELATION OISE-BEEFIT FORBIDDE ITERVAL THEOREMS FOR THRESHOLD SIGAL DETECTORS BASED O CROSS CORRELATIO Sanya Mitaim and Bart Kosko Department of Electrical and Computer Engineering, Faculty of Engineering Thammasat

More information

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X. Optimization Background: Problem: given a function f(x) defined on X, find x such that f(x ) f(x) for all x X. The value x is called a maximizer of f and is written argmax X f. In general, argmax X f may

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.

More information

Markov Chain Monte Carlo. Simulated Annealing.

Markov Chain Monte Carlo. Simulated Annealing. Aula 10. Simulated Annealing. 0 Markov Chain Monte Carlo. Simulated Annealing. Anatoli Iambartsev IME-USP Aula 10. Simulated Annealing. 1 [RC] Stochastic search. General iterative formula for optimizing

More information

Mixture Models and EM

Mixture Models and EM Mixture Models and EM Goal: Introduction to probabilistic mixture models and the expectationmaximization (EM) algorithm. Motivation: simultaneous fitting of multiple model instances unsupervised clustering

More information

Machine Learning Techniques for Computer Vision

Machine Learning Techniques for Computer Vision Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ 1 Bayesian paradigm Consistent use of probability theory

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?

More information

HMM and IOHMM Modeling of EEG Rhythms for Asynchronous BCI Systems

HMM and IOHMM Modeling of EEG Rhythms for Asynchronous BCI Systems HMM and IOHMM Modeling of EEG Rhythms for Asynchronous BCI Systems Silvia Chiappa and Samy Bengio {chiappa,bengio}@idiap.ch IDIAP, P.O. Box 592, CH-1920 Martigny, Switzerland Abstract. We compare the use

More information

Learning Gaussian Process Models from Uncertain Data

Learning Gaussian Process Models from Uncertain Data Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada

More information

Gaussian Mixture Models

Gaussian Mixture Models Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some

More information

Convergence Rate of Expectation-Maximization

Convergence Rate of Expectation-Maximization Convergence Rate of Expectation-Maximiation Raunak Kumar University of British Columbia Mark Schmidt University of British Columbia Abstract raunakkumar17@outlookcom schmidtm@csubcca Expectation-maximiation

More information

SIMU L TED ATED ANNEA L NG ING

SIMU L TED ATED ANNEA L NG ING SIMULATED ANNEALING Fundamental Concept Motivation by an analogy to the statistical mechanics of annealing in solids. => to coerce a solid (i.e., in a poor, unordered state) into a low energy thermodynamic

More information

Lecture 4: Probabilistic Learning

Lecture 4: Probabilistic Learning DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods

More information

Physics 509: Bootstrap and Robust Parameter Estimation

Physics 509: Bootstrap and Robust Parameter Estimation Physics 509: Bootstrap and Robust Parameter Estimation Scott Oser Lecture #20 Physics 509 1 Nonparametric parameter estimation Question: what error estimate should you assign to the slope and intercept

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Sean Escola. Center for Theoretical Neuroscience

Sean Escola. Center for Theoretical Neuroscience Employing hidden Markov models of neural spike-trains toward the improved estimation of linear receptive fields and the decoding of multiple firing regimes Sean Escola Center for Theoretical Neuroscience

More information

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information

Confidence Estimation Methods for Neural Networks: A Practical Comparison

Confidence Estimation Methods for Neural Networks: A Practical Comparison , 6-8 000, Confidence Estimation Methods for : A Practical Comparison G. Papadopoulos, P.J. Edwards, A.F. Murray Department of Electronics and Electrical Engineering, University of Edinburgh Abstract.

More information

Statistics: Learning models from data

Statistics: Learning models from data DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial

More information

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models Jaemo Sung 1, Sung-Yang Bang 1, Seungjin Choi 1, and Zoubin Ghahramani 2 1 Department of Computer Science, POSTECH,

More information

Forecasting Wind Ramps

Forecasting Wind Ramps Forecasting Wind Ramps Erin Summers and Anand Subramanian Jan 5, 20 Introduction The recent increase in the number of wind power producers has necessitated changes in the methods power system operators

More information

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014. Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

Gaussian Estimation under Attack Uncertainty

Gaussian Estimation under Attack Uncertainty Gaussian Estimation under Attack Uncertainty Tara Javidi Yonatan Kaspi Himanshu Tyagi Abstract We consider the estimation of a standard Gaussian random variable under an observation attack where an adversary

More information

An introduction to Variational calculus in Machine Learning

An introduction to Variational calculus in Machine Learning n introduction to Variational calculus in Machine Learning nders Meng February 2004 1 Introduction The intention of this note is not to give a full understanding of calculus of variations since this area

More information

Mollifying Networks. ICLR,2017 Presenter: Arshdeep Sekhon & Be

Mollifying Networks. ICLR,2017 Presenter: Arshdeep Sekhon & Be Mollifying Networks Caglar Gulcehre 1 Marcin Moczulski 2 Francesco Visin 3 Yoshua Bengio 1 1 University of Montreal, 2 University of Oxford, 3 Politecnico di Milano ICLR,2017 Presenter: Arshdeep Sekhon

More information

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

PILCO: A Model-Based and Data-Efficient Approach to Policy Search PILCO: A Model-Based and Data-Efficient Approach to Policy Search (M.P. Deisenroth and C.E. Rasmussen) CSC2541 November 4, 2016 PILCO Graphical Model PILCO Probabilistic Inference for Learning COntrol

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University

More information

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated

More information

Algorithmisches Lernen/Machine Learning

Algorithmisches Lernen/Machine Learning Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines

More information

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

PDF hosted at the Radboud Repository of the Radboud University Nijmegen PDF hosted at the Radboud Repository of the Radboud University Nijmegen The following full text is a preprint version which may differ from the publisher's version. For additional information about this

More information

ECE 275B Homework #2 Due Thursday 2/12/2015. MIDTERM is Scheduled for Thursday, February 19, 2015

ECE 275B Homework #2 Due Thursday 2/12/2015. MIDTERM is Scheduled for Thursday, February 19, 2015 Reading ECE 275B Homework #2 Due Thursday 2/12/2015 MIDTERM is Scheduled for Thursday, February 19, 2015 Read and understand the Newton-Raphson and Method of Scores MLE procedures given in Kay, Example

More information

Probability and Information Theory. Sargur N. Srihari

Probability and Information Theory. Sargur N. Srihari Probability and Information Theory Sargur N. srihari@cedar.buffalo.edu 1 Topics in Probability and Information Theory Overview 1. Why Probability? 2. Random Variables 3. Probability Distributions 4. Marginal

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes Iain Murray murray@cs.toronto.edu CSC255, Introduction to Machine Learning, Fall 28 Dept. Computer Science, University of Toronto The problem Learn scalar function of

More information

Using Noise to Speed Up Video Classification with Recurrent Backpropagation

Using Noise to Speed Up Video Classification with Recurrent Backpropagation Using Noise to Speed Up Video Classification with Recurrent Bacpropagation Olaoluwa Adigun Department of Electrical Engineering University of Southern California Los Angeles, CA 90089. E-mail: adigun@usc.edu

More information

Automated Segmentation of Low Light Level Imagery using Poisson MAP- MRF Labelling

Automated Segmentation of Low Light Level Imagery using Poisson MAP- MRF Labelling Automated Segmentation of Low Light Level Imagery using Poisson MAP- MRF Labelling Abstract An automated unsupervised technique, based upon a Bayesian framework, for the segmentation of low light level

More information

Machine Learning, Midterm Exam

Machine Learning, Midterm Exam 10-601 Machine Learning, Midterm Exam Instructors: Tom Mitchell, Ziv Bar-Joseph Wednesday 12 th December, 2012 There are 9 questions, for a total of 100 points. This exam has 20 pages, make sure you have

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Sequential Procedure for Testing Hypothesis about Mean of Latent Gaussian Process

Sequential Procedure for Testing Hypothesis about Mean of Latent Gaussian Process Applied Mathematical Sciences, Vol. 4, 2010, no. 62, 3083-3093 Sequential Procedure for Testing Hypothesis about Mean of Latent Gaussian Process Julia Bondarenko Helmut-Schmidt University Hamburg University

More information

Shankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms

Shankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms Recognition of Visual Speech Elements Using Adaptively Boosted Hidden Markov Models. Say Wei Foo, Yong Lian, Liang Dong. IEEE Transactions on Circuits and Systems for Video Technology, May 2004. Shankar

More information

Linear Dynamical Systems

Linear Dynamical Systems Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations

More information

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Bagging During Markov Chain Monte Carlo for Smoother Predictions Bagging During Markov Chain Monte Carlo for Smoother Predictions Herbert K. H. Lee University of California, Santa Cruz Abstract: Making good predictions from noisy data is a challenging problem. Methods

More information

EVALUATING SYMMETRIC INFORMATION GAP BETWEEN DYNAMICAL SYSTEMS USING PARTICLE FILTER

EVALUATING SYMMETRIC INFORMATION GAP BETWEEN DYNAMICAL SYSTEMS USING PARTICLE FILTER EVALUATING SYMMETRIC INFORMATION GAP BETWEEN DYNAMICAL SYSTEMS USING PARTICLE FILTER Zhen Zhen 1, Jun Young Lee 2, and Abdus Saboor 3 1 Mingde College, Guizhou University, China zhenz2000@21cn.com 2 Department

More information

HMM part 1. Dr Philip Jackson

HMM part 1. Dr Philip Jackson Centre for Vision Speech & Signal Processing University of Surrey, Guildford GU2 7XH. HMM part 1 Dr Philip Jackson Probability fundamentals Markov models State topology diagrams Hidden Markov models -

More information

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2 STATS 306B: Unsupervised Learning Spring 2014 Lecture 2 April 2 Lecturer: Lester Mackey Scribe: Junyang Qian, Minzhe Wang 2.1 Recap In the last lecture, we formulated our working definition of unsupervised

More information

Appendices: Stochastic Backpropagation and Approximate Inference in Deep Generative Models

Appendices: Stochastic Backpropagation and Approximate Inference in Deep Generative Models Appendices: Stochastic Backpropagation and Approximate Inference in Deep Generative Models Danilo Jimenez Rezende Shakir Mohamed Daan Wierstra Google DeepMind, London, United Kingdom DANILOR@GOOGLE.COM

More information

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang Chapter 4 Dynamic Bayesian Networks 2016 Fall Jin Gu, Michael Zhang Reviews: BN Representation Basic steps for BN representations Define variables Define the preliminary relations between variables Check

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

Brief Introduction of Machine Learning Techniques for Content Analysis

Brief Introduction of Machine Learning Techniques for Content Analysis 1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

A Note on the Expectation-Maximization (EM) Algorithm

A Note on the Expectation-Maximization (EM) Algorithm A Note on the Expectation-Maximization (EM) Algorithm ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign March 11, 2007 1 Introduction The Expectation-Maximization

More information

6 Markov Chain Monte Carlo (MCMC)

6 Markov Chain Monte Carlo (MCMC) 6 Markov Chain Monte Carlo (MCMC) The underlying idea in MCMC is to replace the iid samples of basic MC methods, with dependent samples from an ergodic Markov chain, whose limiting (stationary) distribution

More information

EM Algorithm. Expectation-maximization (EM) algorithm.

EM Algorithm. Expectation-maximization (EM) algorithm. EM Algorithm Outline: Expectation-maximization (EM) algorithm. Examples. Reading: A.P. Dempster, N.M. Laird, and D.B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, J. R. Stat. Soc.,

More information