The Noisy Expectation-Maximization Algorithm for Multiplicative Noise Injection

The Noisy Expectation-Maximization Algorithm for Multiplicative Noise Injection Osonde Osoba, Bart Kosko We generalize the noisy expectation-maximization (NEM) algorithm to allow arbitrary noise-injection modes besides simply adding carefully chosen noise to the data. This includes the important special case of multiplicative noise injection. A generalized NEM Theorem shows that all such injected noise speeds convergence of the EM algorithm on average. The multiplicative noise-benefit condition has a simple quadratic form for Gaussian and Cauchy mixture models. Simulations show a multiplicative-noise EM speed-up of more than 27% in a Gaussian mixture model. Injecting blind noise only slowed convergence. A related theorem gives a sufficient condition for an average EM noise benefit for arbitrary modes of noise injection if the data model comes from the general exponential family of probability density functions. A final theorem shows that injected noise slows EM convergence if the NEM inequalities reverse. 1. Noise Boosting the EM Algorithm We show how carefully chosen and injected multiplicative noise can speed convergence of the popular expectation-maximization (EM) algorithm. The proof also allows arbitrary modes of combining signal and noise. The result still speeds EM convergence on average at each iteration. The EM algorithm generalizes maximum-likelihood estimation to the case of missing or corrupted data 1, 2. The algorithm iteratively climbs a hill of probability until it reaches the peak. The EM algorithm also generalizes many other popular algorithms. These include the k-means clustering algorithm 3 used in pattern recognition and big-data analysis, the backpropagation algorithm used to train deep feedforward and convolutional neural networks 4 6, and the Baum-Welch algorithm used to train hidden Markov models 7, 8. But the EM algorithm can converge slowly if the amount of missing data is high or if the number of estimated parameters is large 2, 9. It can also get stuck at local probability maxima. Users can run the EM algorithm from several starting points to mitigate the problem of convergence to local maxima. The Noisy EM (NEM) algorithm 3, 10 12 is a noise-enhanced version of the EM algorithm that carefully selects noise and adds it to the data. NEM converges faster on average than EM does because on average it takes larger steps up the same O. Osoba is with the RAND Corporation, Santa Monica, CA 90401-3208 USA (email: oosoba@rand.org) O. Osoba and B. Kosko are with the Signal and Image Processing Institute, Electrical Engineering Department, University of Southern California, Los Angeles, CA 90089-2564 USA (email: kosko@usc.edu) 1

2 hill of probability. The largest noise gains tend to occur in the first few steps. This is a type of nonlinear noise benefit or stochastic resonance 13 24 that does not depend on a threshold 25. NEM adds noise to the data if the noise satisfies the NEM positivity condition: ( ) f(y + N, Z θk ) E Y,Z,N θ ln 0. (1) The NEM positivity condition (1) holds when the noise-perturbed likelihood f(y + N, z θ k ) is larger on average than the noiseless likelihood f(y, z θ k ) at the k th step of the algorithm 10, 12. A simple argument gives the intuition behind the NEM positivity condition for additive noise. This argument holds in much greater generality and underlies much of the theory of noise-boosting the EM algorithm. Consider a noise sample or realization n that makes a signal y more probable: f(y + n θ) f(y θ) in terms of probability density functions (pdfs) for some parameter θ. The value y is a realization of the signal random variable Y. The value n is a realization of the noise random variable N. Then this pdf inequality holds if and only if ln( f(y+n θ) f(y θ) ) 0. Averaging over the signal and noise random variables gives the basic expectation form of the NEM positivity condition. Particular choices of the signal probability f(y θ) can greatly simplify the sufficient NEM condition. This signal conditional probability is the so-called data model in the EM context of maximum likelihood estimation. We show below that Gaussian and Cauchy choices lead to simple quadratic NEM conditions. An exponential data model leads to an even simpler linear condition. This argument also shows that a noise-based pdf inequality is sufficient to satisfy the positivity condition of the NEM Theorem. It shows even more than this because the NEM positivity condition uses only an expectation. So the pdf inequality need hold only almost everywhere. It need not hold on sets of zero probability. This allows the user to ignore particular values when using continuous probability models. The same argument for multiplicative noise suggests that a simliar positivity condition should hold for a noise benefit. This will hold given the corresponding pdf inequality f(yn θ) f(y θ). There is nothing unique about the operations of addition or multiplication in this signal-noise context. So a noise benefit should hold for any method of combining signal and noise that obeys these pdf inequalities. A sequence of new theorems shows that this is the case. Theorem 2 below generalizes the NEM theorem from additive noise injection y+n to arbitrary noise injection φ(y, N). Theorem 3 gives the new NEM condition for the the special case where the noise-injection mode is multiplicative: φ(y, N) = yn. We call this new condition the m-nem condition or the multplicative-noisy-expectation- Maximization condition. Figure 1 shows an EM speed-up of 27.6% due to the multiplicative NEM condition. Figure 2 shows that ordinary or blind multiplicative noise (not subject to the m-nem condition) only slowed EM convergence. Blind noise was just noise drawn at random

3 Figure 1: Noise benefit in a Gaussian-mixture-model NEM algorithm that used multiplicative noise injection. Low intensity noise decreased the EM convergence time while higher intensity starting noise increased it. The multiplicative noise had unit mean with different standard deviations. The optimal initial noise standard deviation was σ = 0.44 and gave a 27.6% speed-up over the noiseless EM algorithm. This NEM procedure injected noise that decayed at an inverse-square rate. or uniformly from the set of all possible noise. It was not subject to the m-nem condition or to any other condition. The optimal speed-up using additive noise on the same data model was 30.5% at an optimal noise power of σ = 1.9. This speed-up was slightly better than the m- NEM speed-up. But a statistical test for the difference in mean optimal convergence times found that this difference was not statistically significant at the 0.05 significance level. The hypothesis test for the difference of means gave a large p-value of 0.136. A sample 95%-bootstrap confidence interval for the average difference in optimal convergence time was 0.45, 0.067. So we cannot reject the null hypothesis that the difference in optimal average convergence times for the two noise-injection modes was statistically insignificant at the 0.05 level. An open and important research question is whether there are general conditions under which one noise injection mode outperforms the other. 2. General Noise Injection Condition for a NEM Benefit We next generalize the original proof for additive NEM 10, 12 to NEM that uses more general noise injection. The metrical idea behind the proof remains the same: a noise benefit occurs on average at an iteration if the noisy pdf is closer to the optimal pdf than the noiseless pdf is. Relative entropy measures the pdf quality or pseudo-distance compared with the optimal pdf:

4 where D (f(y, z θ ) f N (y, z θ k )) D (f(y, z θ ) f(y, z θ k )) (2) f N (y, z θ k ) = f(φ(y, N), z θ k ) (3) is the noise-injected pdf. The relative entropy has the form of an average logarithm 26 h(u, v) D (h(u, v) g(u, v)) = ln h(u, v) du dv (4) g(u, v) U,V for positive pdfs h and g over the same support. Convergent sums can replace the integrals in the discrete case. The key point is that the noise-injection mode φ(y, N) need not be addition φ(y, N) = y + N or multiplication φ(y, N) = yn. It can be any function φ of the data y and the noise N. The above relative entropy inequality is logically equivalent to the EM noisebenefit condition at iteration k cast in terms of expectations 10: E Q(θ θ ) Q(θ k θ ) E Q(θ θ ) Q N (θ k θ ) (5) where Q N is the noise-perturbed surrogate likelihood function Q N (θ θ k ) = E Z Y,θk ln f N (y, Z θ). (6) Any noise N that satisfies this EM noise-benefit condition will on average give better parameter estimates at each iteration than will noiseless estimates or those that use blind noise. The relative-entropy version of the noise-benefit condition allows the same derivation of the generalized NEM condition as in the case of additive noise. The result is Theorem 1. Theorem 1: The Generalized NEM Theorem Let φ(y, N) be an arbitrary mode of combining the signal Y and the noise N. Then the average EM noise benefit ( ) ( ) Q(θ θ ) Q(θ k θ ) Q(θ θ ) Q N (θ k θ ) (7) holds if ( ) f(φ(y, N), Z θk ) E Y,Z,N θ ln 0. (8) Proof: We want to show that the noisy likelihood function f(φ(y, N), z θ k ) is closer to the optimal likelihood f(y, z θ ) than is the noiseless likelihood function f(y, z θ k ). We use relative entropy for this comparison. Then the relative entropy between the optimal likelihood and the noisy likelihood is

5 c k (N) = D(f(y, z θ ) f(φ(y, N), z θ k )). (9) The relative entropy between the optimal likelihood and the noiseless likelihood is c k = D(f(y, z θ ) f(y, z θ k )). (10) We show first that the expectation of the Q-function differences in (5) is a distance pseudo-metric. Rewrite the Q function as an integral over Z: Z lnf(y, z θ)f(z y, θ k ) dz. (11) Then the term c k = D(f(y, z θ ) f(y, z θ k )) is the expectation over Y given the current parameter value θ k because factoring the conditional pdf f(y, z θ ) gives c k = ln(f(y, z θ )) ln f(y, z θ k )f(y, z θ ) dz dy (12) = ln(f(y, z θ )) ln f(y, z θ k )f(z y, θ )f(y θ ) dz dy (13) = E Y θk Q (θ θ ) Q (θ k θ ). (14) The noise term c k (N) is also the expectation over Y given the parameter value θ k : c k (N) = ln(f(y, z θ )) ln f(y + N, z θ k )f(y, z θ ) dz dy (15) = ln(f(y, z θ )) ln f(y + N, z θ k )f(z y, θ )f(y θ ) dz dy (16) = E Y θk Q (θ θ ) Q N (θ k θ ). (17) Take the noise expectation of both terms c k and c k (N): Then the pseudo-metrical inequality ensures an average noise benefitr: E N ck = ck (18) E N ck (N) = E N ck (N). (19) c k E N c k (N) (20) E N,Y θk Q (θ θ ) Q (θ k θ ) E N,Y θk Q (θ θ ) Q N (θ k θ ). (21)

6 We use the inequality condition (32) to derive a more useful sufficient condition for a noise benefit. Expand the difference of relative entropy terms c k c k (N): ( ) f(y, z θ ) f(y, z θ ) c k c k (N) = ln ln f(y, z θ ) dy dz Y,Z f(y, z θ k ) f(y + N, z θ k ) (22) ( ) f(y, z θ ) f(y + N, z θk ) = ln + ln f(y, z θ ) dy dz Y,Z f(y, z θ k ) f(y, z θ ) (23) f(y, z θ )f(y + N, z θ k ) = ln f(y, z θ ) dy dz (24) Y,Z f(y, z θ k )f(y, z θ ) f(y + N, z θk ) = ln f(y, z θ ) dy dz. (25) f(y, z θ k ) Y,Z Take the expectation with respect to the noise term N: E N c k c k (N) = c k E N c k (N) (26) f(y + n, z θk ) = ln f(y, z θ )f(n y) dy dz dn (27) N Y,Z f(y, z θ k ) f(y + n, z θk ) = ln f(n y)f(y, z θ ) dn dy dz (28) Y,Z N f(y, z θ k ) = E Y,Z,N θ ln f(y + N, Z θ k). (29) The assumption of finite differential entropy for Y and Z ensures that ln f(y, z θ)f(y, z θ ) is integrable. Thus the integrand is integrable. So Fubini s theorem 27 permits the change in the order of integration in (29): ( ) f(y + N, Z θk ) c k E N c k (N) iff E Y,Z,N θ ln 0. (30) Then an EM noise benefit occurs on average if ( ) f(y + N, Z θk ) E Y,Z,N θ ln 0. (31) The same derivation as in 10 shows that the average EM noise benefit occurs if c k E N c k (N) (32) ( ) f(φ(y, N), Z θk ) E Y,Z,N θ ln 0 (33) when the generalized noisy likelihood f(φ(y, N), z θ k ) replaces the additive noisy likelihood f(y + N, z θ k ) in the original NEM proof.

7 3. The Special Case of Multiplicative NEM Theorem 1 allows a direct proof that properly chosen multiplicative noise can speed average EM convergence. Multiplicative noise 28 occurs in many applications in signal processing and communications. These include synthetic aperture radar 29 31, sonar imaging 32, 33, and photonics 34. The proof requires only that we use the following conditional noise pdf for multiplicative NEM (m-nem): f N (y, z θ k ) = f(yn, z θ k ). (34) Then Theorem 2 gives the following special case for multiplicative noise. Theorem: Multiplicative NEM (m-nem) Theorem The average EM noise benefit ( ) ( ) Q(θ θ ) Q(θ k θ ) Q(θ θ ) Q N (θ k θ ) (35) holds if E Y,Z,N θ ln ( ) f(y N, Z θk ) 0. (36) Proof: The argument is the same as in Theorem 2 with f N (y, z θ k ) = f(yn, z θ k ). 4. Gaussiain and Cauchy Mixture Models Many of the additive-nem corollaries apply to the generalized NEM condition without change. Corollary 2 from 10 leads to an m-nem condition for a Gaussian mixture model (GMM) because the noise condition applies to each mixed normal pdf in the mixture. An identical multiplicative noise-benefit condtiion holds for a Cauchy mixture model. We state and prove the m-nem GMM result as a separate corollary. The resulting quadratic m-nem condition depends only the Gaussian means and not on their variances. We first review mixture models and then state the closed form of a GMM. The EM algorithm offers a standard way to estimate the parameters of a mixture model. The parameters consist of the convex or probability mixing weights as well as the individual parameters of the mixed pdfs. A finite mixture model 35 37 is a convex combination of a finite number of similar pdfs. So we can view a mixture as a convex combination of a finite set of similar sub-populations. The sub-population pdfs are similar in the sense that they all come from the same parametric family. Mixture models apply to a wide range of statistical problems in pattern recognition and machine intelligence. A Gaussian mixture consists of convex-weighted normal pdfs. The EM algorithm estimates the mixture weights as well as the means and variances of each normal pdf. A Cauchy

8 mixture consists likewise of convex-wegithed Cauchy pdfs. The GMM is by far the most common mixture model in practice 38. Let Y be the observed mixed random variable. Let K be the number of subpopulations. Let Z {1,..., K} be the hidden sub-population index random variable. The convex population mixing proportions α 1,..., α K define a discrete pdf for Z: P (Z = j) = α j. The pdf f(y Z = j, θ j ) is the pdf of the j th sub-population where θ 1,..., θ K are the pdf parameters for each sub-population. The sub-population parameter θ j can represent the mean or variance of a normal pdf or both. It can represent any number of quantities that parametrize the pdf. Let Θ denote the vector of all model parameters Θ = {α 1,..., α K, θ 1,..., θ K }. The joint pdf f(y, z Θ) is K f(y, z Θ) = α j f(y j, θ j ) δz j. (37) The marginal pdf for Y and the conditional pdf for Z given y are j=1 f(y Θ) = j and p Z (j y, Θ) = α jf(y Z = j, θ j ) f(y Θ) by Bayes theorem. Rewrite the joint pdf as an exponential: f(y, z Θ) = exp j α j f(y j, θ j ) (38) (39) ln(α j ) + ln f(y j, θ j ) δz j. (40) Thus ln f(y, z Θ) = j δz j lnα j f(y j, θ j ). (41) EM algorithms for finite mixture models estimate Θ using the sub-population index Z as the latent variable. An EM algorithml uses (39) to derive the Q-function Q(Θ Θ k ) =E Z y,θk ln f(y, Z Θ) (42) = z j δz j lnα j f(y j, θ j ) p Z (z y, Θ k ) (43) = j lnα j f(y j, θ j )p Z (j y, Θ k ). (44) Corollary: m-nem Condition for Gaussian Mixture Models Suppose that Y Z=j N (µ j, σj 2 ). So f(y j, θ) is a normal or Gaussian pdf. Then the pointwise pdf noise benefit for multiplicative noise f(yn θ) f(y θ) (45) holds if and only if y(n 1) y(n + 1) 2µ j 0. (46)

9 Proof: Compare the noisy normal pdf with the noiseless normal pdf. The normal pdf is 1 f(y θ) = exp (y µ j) 2 σ j 2π 2σj 2. (47) So f(yn θ) f(y θ) iff exp (yn µ j) 2 2σ 2 j exp (y µ j) 2 2σ 2 j (48) iff ( ) 2 yn µj σ j ( ) 2 y µj (49) iff (yn µ j ) 2 (y µ j ) 2 (50) iff y 2 n 2 + µ 2 j 2µ j yn y 2 + µ 2 j 2yµ j (51) σ j iff y 2 n 2 2µ j yn y 2 2yµ j (52) iff y 2 (n 2 1) 2yµ j (n 1) 0 (53) iff y(n 1) y(n + 1) 2µ j 0. (54) This proves (46). The same m-nem noise-benefit condition (46) holds for a Cauchy mixture model. Suppose that Y Z=j C(m j, d j ) and thus that f(y j, θ) is a Cauchy pdf with median m j and dispersion d j. The median controls the location of the Cauchy bell curve. The dispersion controls its width. A Cauchy random variable has no mean. It has finite lower-order fractional moments. But its variance and all its higher-order moments are infinite. The Cauchy pdf f(y j, θ) has the form 1 f(y θ) = ( ) 2. (55) y mj πd j 1 + d j So the pdf inequality f(yn θ) f(y θ) is equivalent to the same quadratic inequality as in the above derivation of the Gaussian n-nem condition. This gives (46) as the Cauchy m-nem noise-benefit condition. 5. The Multiplicative Noisy Expectation-Maximization Algorithm The m-nem Theorem and its corollaries give a general method for noise-boosting the EM algorithm. Theorem 1 implies that on average these NEM variants outperform the noiseless EM algorithm. Algorithm 1 gives the multiplicative NEM algorithm schema. The operation mnemnoisesample(y, k τ σ N ) generates noise samples that satisfy the NEM condition for the current data model. The noise sampling pdf depends on the vector

10 of random samples y in the Gaussian and Cauchy mixture models. The noise can have any value in the NEM algorithm for censored gamma models. Algorithm 1 ˆθ mnem = m-nem-estimate(y) Require: y = (y 1,..., y M ) : vector of observed incomplete data Ensure: ˆθ mnem : m-nem estimate of parameter θ 1: while ( θ k θ k 1 10 tol ) do 2: N S -Step: n mnemnoisesample(y, k τ σ N ) 3: N M -Step: y yn 4: E-Step: Q (θ θ k ) E Z y,θk ln f(y, Z θ) 5: M-Step: θ k+1 argmax {Q (θ θ k )} 6: k k + 1 7: end while 8: ˆθmNEM θ k θ The E-Step takes a conditional expectation of a function of the noisy data samples y given the noiseless data samples y. A deterministic decay factor k τ scales the noise on the k th iteration. τ is the noise decay rate 10. The decay factor k τ reduces the noise at each iteration. So the decay factor drives the noise N k to zero as the iteration step k increases. The simulations used τ = 2. The decay factor reduces the NEM estimator s jitter around its final value. This is important because the EM algorithm converges to fixed-points. So excessive estimator jitter prolongs convergence time even when the jitter occurs near the final solution. The simulations in this paper used polynomial decay factors instead of the logarithmic cooling schedules found in annealing applications 39 43. The NEM noise generating procedure mnemnoisesample(y, k τ σ N ) returns a NEM compliant noise sample n at a given noise level σ N for each data sample y. This procedure changes with the EM data model. The following noise generating procedure applies to GMMs because of Corollary 4. We used the following 1-D noise generating procedure for the GMM simulations: mnemnoisesample for GMM m-nem Require: y and σ N : current data sample and noise level Ensure: n : noise sample satisfying NEM condition N(y) {n R : y(n 1) y(n + 1) 2µ j 0} n is a sample from the distribution T N(1, σ N N(y)) Figure 1 displays a noise benefit for a m-nem algorithm on a GMM. The injected noise is subject to the Gaussian NEM condition in (46).

11 Figure 2: Blind multiplicative noise did not improve the convergence times of GMM EM algorithms. Such blind noise only increased the time to convergence. This increase in convergence time occurred even when (as in this case) the multiplicative noise used a cooling schedule. The next section develops the NEM theory for the important case of exponential family pdfs. The corresponding theorem states the generalized NEM condition for additive (??) and multiplicative (??) noise injection. The theorem also applies to mixtures of exponential family pdfs because of Corollary 2 from 10. 6. NEM Noise Benefits for Exponential Family Probabilities Exponential family pdfs include such popular densities as the normal, exponential, gamma, and Poisson 44. A member of this family has a pdf f(x θ) of the exponential form f(x θ) = exp a(θ)k(x) + b(x) + c(θ) (56) so long as the density s domain does not include the parameter θ. This latter condition bars the uniform pdf from the exponential family. The family also excludes Cauchy and Student-t pdfs. The normal pdf f(y θ) = 1 σ j 2π exp (y µj)2 belongs 2σj 2 to the exponential family because we can write it in exponential form if a(θ) = 1 2σ, 2 K(x) = x 2, and c(θ) = µ2 2σ ln 2πσ 2. 2 Theorem: Generalized NEM Condition for Exponential Family Probability Density Functions Suppose X has an exponential family pdf: f(x θ) = exp a(θ)k(x) + b(x) + c(θ). (57) Then an EM noise benefit occurs for arbitrary noise-combination mode φ if a(θ)k(φ(x, n)) K(x) + b(φ(x, n)) b(x) 0. (58)

12 Proof: Compare the noisy pdf f(φ(x, n) θ) with the noiseless pdf f(x θ). The noise benefit occurs if ln f(φ(x, n) θ) ln f(x θ) (59) since the logarithm is a monotone increasing function. This inequality holds iff a(θ)k(φ(x, n)) + b(φ(x, n)) + c(θ) a(θ)k(x) + b(x) + c(θ) (60) iff a(θ)k(φ(x, n)) + b(φ(x, n)) a(θ)k(x) + b(x) (61) iff a(θ)k(φ(x, n)) K(x) + b(φ(x, n)) b(x) 0. (62) The final inequality proves (58). This noise-benefit condition reduces to a(θ)k(x + n)) K(x) + b(x + n)) b(x) 0. (63) in the additive noise case when φ(x, n) = x + n. It reduces to a(θ)k(xn)) K(x) + b(xn)) b(x) 0. (64) in the multiplicative noise case when φ(x, n) = xn. Note that the c(θ) term does not appear in the NEM conditions. Consider the exponential signal pdf f(x θ) = 1 θ e x θ. It has the form of an exponential-family pdf if a(θ) = 1 θ, K(x) = x, b(x) = 0, and c(θ) = ln θ. So the condition for an additive NEM noise benefit becomes 1 x + n x 0. (65) θ This gives a simple negative noise condition for an additive NEM benefit: n 0. (66) The condition for a multiplicative NEM benefit likewise beomes It gives a similar NEM condition: 1 xn x 0. (67) θ n 1. (68) We conclude with a dual theorem that guarantees that noise will harm the EM algorithm by slowing its convergence on average. Both the theorem statement and its proof simply reverse all pertinent inequalities in Theorem 2. Theorem: The Generalized EM Noise Harm Theorem Let φ(y, N) be an arbitrary mode of combining the signal Y and the noise N. Then the EM estimation ( iteration noise harm ) ( ) Q(θ θ ) Q(θ k θ ) Q(θ θ ) Q N (θ k θ ) (69)

13 occurs on average if ( ) f(φ(y, N), Z θk ) E Y,Z,N θ ln 0. (70) Proof: The proof follows the same argument as in Theorem 2 with all the inequalities reversed. This general noise-harm result leads to corollary noise-harm conditions for the additive and multiplicative GMM-NEM models by reversing all pertinent inequalities. A similar reversal gives a noise-harm condition for pdfs from the exponential family. Such harmful GMM noise increased EM convergence by 35% in the multiplicativenoise case and by 40% in the additive-noise case. No noise benefit or harm occurs on average if equality replaces all pertinent inequalities. Corollary: Noise harm conditions for GMM-NEM The noise harm condition in (70) holds for the GMM-NEM algorithm if for additive noise and if for multiplicative noise. n 2 2n (µ j y) (71) y(n 1) y(n + 1) 2µ j 0 (72) Proof: The proof follows the derivations for the additive and multiplicative GMM-NEM conditions if all inequalities reverse. 7. Conclusion The NEM additive noise model extends to a multiplicative noise model and indeed to arbitrary modesl that combine noise and signal. The multiplicative NEM theorem specifically gives a sufficient positivity condition such that multiplicative noise improves the average speed of the EM algorithm. The multiplicative-noise NEM condition for the GMM and exponential family models are only slightly more complex than their respective additive-noise NEM conditions. An open research question is whether there are general conditions when either multiplicative or additive noise outperforms the other. Another open question is whether data sparsity affects the m-nem speed-up (as it does the additive case 10) and how it affects more general modes of NEM-based noise injection. Bibliography 1 A. P. Dempster, N. M. Laird and D. B. Rubin, Maximum Likelihood from Incomplete Data via the EM Algorithm (with discussion), Journal of the Royal Statistical Society, Series B 39 (1977) 1 38.

14 2 G. J. McLachlan and T. Krishnan, The EM Algorithm and Extensions (John Wiley and Sons, 2007). 3 O. Osoba and B. Kosko, Noise-Enhanced Clustering and Competitive Learning Algorithms, Neural Networks 37 (2013) 132 140. 4 K. Audhkhasi, O. Osoba and B. Kosko, Noise benefits in backpropagation and deep bidirectional pre-training, in Neural Networks (IJCNN), The 2013 International Joint Conference on (IEEE, 2013), pp. 1 8. 5 K. Audhkhasi, O. Osoba and B. Kosko, Noise benefits in convolutional neural networks, in Proceedings of the 2014 International Conference on Advances in Big Data Analytics (2014), pp. 73 80. 6 Y. LeCun, Y. Bengio and G. Hinton, Deep learning, Nature 521 (2015) 436 444. 7 K. Audhkhasi, O. Osoba and B. Kosko, Noisy hidden Markov models for speech recognition, in International Joint Conference on Neural Networks (IJCNN) (IEEE, 2013), pp. 2738 2743. 8 L. R. Welch, Hidden Markov models and the Baum-Welch algorithm, IEEE Information Theory Society Newsletter 53 (2003) 1 14. 9 M. A. Tanner, Tools for Statistical Inference: Methods for the Exploration of Posterior Distributions and Likelihood Functions, Springer Series in Statistics (Springer, 1996). 10 O. Osoba, S. Mitaim and B. Kosko, The Noisy Expectation-Maximization Algorithm, Fluctuation and Noise Letters 12 (2013) 1350012. 11 O. Osoba, S. Mitaim and B. Kosko, Noise Benefits in the Expectation-Maximization Algorithm: NEM Theorems and Models, in The International Joint Conference on Neural Networks (IJCNN) (IEEE, 2011), pp. 3178 3183. 12 O. A. Osoba, Noise benefits in expectation-maximization algorithms, Ph.D. thesis, University of Southern California, 2013. 13 K. Wiesenfeld, F. Moss et al., Stochastic resonance and the benefits of noise: from ice ages to crayfish and squids, Nature 373 (1995) 33 36. 14 A. R. Bulsara and L. Gammaitoni, Tuning in to noise, Physics today 49 (1996) 39 47. 15 L. Gammaitoni, P. Hänggi, P. Jung and F. Marchesoni, Stochastic resonance, Reviews of modern physics 70 (1998) 223. 16 S. Mitaim and B. Kosko, Adaptive Stochastic Resonance, Proceedings of the IEEE: Special Issue on Intelligent Signal Processing 86 (1998) 2152 2183. 17 F. Chapeau-Blondeau and D. Rousseau, Noise-Enhanced Performance for an Optimal Bayesian Estimator, IEEE Transactions on Signal Processing 52 (2004) 1327 1334. 18 I. Lee, X. Liu, C. Zhou and B. Kosko, Noise-enhanced detection of subthreshold signals with carbon nanotubes, IEEE Transactions on Nanotechnology 5 (2006) 613 627. 19 B. Kosko, Noise (Penguin, 2006). 20 M. McDonnell, N. Stocks, C. Pearce and D. Abbott, Stochastic resonance: from suprathreshold stochastic resonance to stochastic signal quantization (Cambridge University Press, 2008). 21 A. Patel and B. Kosko, Optimal Mean-Square Noise Benefits in Quantizer-Array Linear Estimation, IEEE Signal Processing Letters 17 (2010) 1005 1009. 22 A. Patel and B. Kosko, Noise Benefits in Quantizer-Array Correlation Detection and Watermark Decoding, IEEE Transactions on Signal Processing 59 (2011) 488 505. 23 H. Chen, L. R. Varshney and P. K. Varshney, Noise-enhanced information systems, Proceedings of the IEEE 102 (2014) 1607 1621. 24 S. Mitaim and B. Kosko, Noise-benefit forbidden-interval theorems for threshold signal detectors based on cross correlations, Physical Review E 90 (2014) 052124.

15 25 B. Franzke and B. Kosko, Noise can speed convergence in markov chains, Physical Review E 84 (2011) 041112. 26 T. M. Cover and J. A. Thomas, Elements of Information Theory (Wiley & Sons, New York, 1991). 27 G. B. Folland, Real Analysis: Modern Techniques and Their Applications (Wiley- Interscience, 1999), 2nd edition. 28 L. Rudin, P.-L. Lions and S. Osher, Multiplicative denoising and deblurring: Theory and algorithms, in Geometric Level Set Methods in Imaging, Vision, and Graphics (Springer, 2003), pp. 103 119. 29 J. Ash, E. Ertin, L. Potter and E. Zelnio, Wide-angle synthetic aperture radar imaging: Models and algorithms for anisotropic scattering, Signal Processing Magazine, IEEE 31 (2014) 16 26. 30 S. Chen, Y. Li, X. Wang, S. Xiao and M. Sato, Modeling and interpretation of scattering mechanisms in polarimetric synthetic aperture radar: Advances and perspectives, Signal Processing Magazine, IEEE 31 (2014) 79 89. 31 M. Tur, K. C. Chin and J. W. Goodman, When is speckle noise multiplicative?, Appl. Opt. 21 (1982) 1157 1159. 32 J. M. Bioucas-Dias and M. A. Figueiredo, Multiplicative noise removal using variable splitting and constrained optimization, Image Processing, IEEE Transactions on 19 (2010) 1720 1730. 33 J. Ringelstein, A. B. Gershman and J. F. Böhme, Direction finding in random inhomogeneous media in the presence of multiplicative noise, Signal Processing Letters, IEEE 7 (2000) 269 272. 34 T. Yilmaz, C. M. Depriest, A. Braun, J. H. Abeles and P. J. Delfyett, Noise in fundamental and harmonic modelocked semiconductor lasers: experiments and simulations, Quantum Electronics, IEEE Journal of 39 (2003) 838 849. 35 R. A. Redner and H. F. Walker, Mixture Densities, Maximum Likelihood and the EM algorithm, SIAM Review 26 (1984) 195 239. 36 G. J. McLachlan and D. Peel, Finite Mixture Models, Wiley series in probability and statistics: Applied probability and statistics (Wiley, 2004). 37 R. V. Hogg and E. A. Tanis, Probability and Statistical Inference (Prentice Hall, 2006), 7th edition. 38 N. A. Gershenfeld, The Nature of Mathematical Modeling (Cambridge University Press, 1999). 39 S. Kirkpatrick, C. Gelatt Jr and M. Vecchi, Optimization by simulated annealing, Science 220 (1983) 671 680. 40 V. Černỳ, Thermodynamical approach to the Traveling Salesman Problem: An efficient simulation algorithm, Journal of Optimization Theory and Applications 45 (1985) 41 51. 41 S. Geman and C. Hwang, Diffusions for global optimization, SIAM Journal on Control and Optimization 24 (1986) 1031 1043. 42 B. Hajek, Cooling schedules for optimal annealing, Mathematics of operations research 13 (1988) 311 329. 43 B. Kosko, Neural Networks and Fuzzy Systems: A Dynamical Systems Approach to Machine Intelligence (Prentice Hall, 1991). 44 R. V. Hogg, J. McKean and A. T. Craig, Introduction to Mathematical Statistics (Pearson, 2013).