The Noisy Expectation-Maximization Algorithm for Multiplicative Noise Injection

Similar documents
The Noisy Expectation-Maximization Algorithm for Multiplicative Noise Injection

The Noisy Expectation-Maximization Algorithm for Multiplicative Noise Injection

THE NOISY EXPECTATION MAXIMIZATION ALGORITHM

The Noisy Expectation-Maximization Algorithm

Optimal Mean-Square Noise Benefits in Quantizer-Array Linear Estimation Ashok Patel and Bart Kosko

Kartik Audhkhasi, Osonde Osoba, Bart Kosko

The Expectation Maximization Algorithm

NOISE can sometimes improve nonlinear signal processing

Neural Networks 37 (2013) Contents lists available at SciVerse ScienceDirect. Neural Networks

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

NOISE BENEFITS IN CONVOLUTIONAL NEURAL NETWORKS. Kartik Audhkhasi, Osonde Osoba, Bart Kosko

Estimating the parameters of hidden binomial trials by the EM algorithm

STA 4273H: Statistical Machine Learning

Bidirectional Representation and Backpropagation Learning

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

The Expectation-Maximization Algorithm

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

p(d θ ) l(θ ) 1.2 x x x

Nonparametric Bayesian Methods (Gaussian Processes)

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Unsupervised Learning

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Recent Advances in Bayesian Inference Techniques

Reading Group on Deep Learning Session 1

PATTERN CLASSIFICATION

On the Slow Convergence of EM and VBEM in Low-Noise Linear Models

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference

Efficient Variational Inference in Large-Scale Bayesian Compressed Sensing

Estimating Gaussian Mixture Densities with EM A Tutorial

NOISE ENHANCED ANISOTROPIC DIFFUSION FOR SCALAR IMAGE RESTORATION. 6, avenue du Ponceau Cergy-Pontoise, France.

Noise Benefits in Quantizer-Array Correlation Detection and Watermark Decoding

Introduction to Machine Learning

DETECTION theory deals primarily with techniques for

On Information Maximization and Blind Signal Deconvolution

Markov chain Monte Carlo methods for visual tracking

ECE 275B Homework #2 Due Thursday MIDTERM is Scheduled for Tuesday, February 21, 2012

A Derivation of the EM Updates for Finding the Maximum Likelihood Parameter Estimates of the Student s t Distribution

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Introduction to Probability and Statistics (Continued)

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Lecture : Probabilistic Machine Learning

Variational Learning : From exponential families to multilinear systems

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

NOISE-BENEFIT FORBIDDEN INTERVAL THEOREMS FOR THRESHOLD SIGNAL DETECTORS BASED ON CROSS CORRELATION

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

13: Variational inference II

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Markov Chain Monte Carlo. Simulated Annealing.

Mixture Models and EM

Machine Learning Techniques for Computer Vision

Approximate Inference Part 1 of 2

Pattern Recognition and Machine Learning

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

HMM and IOHMM Modeling of EEG Rhythms for Asynchronous BCI Systems

Learning Gaussian Process Models from Uncertain Data

Gaussian Mixture Models

Convergence Rate of Expectation-Maximization

SIMU L TED ATED ANNEA L NG ING

Lecture 4: Probabilistic Learning

Physics 509: Bootstrap and Robust Parameter Estimation

CSCI-567: Machine Learning (Spring 2019)

Sean Escola. Center for Theoretical Neuroscience

Variational Principal Components

Confidence Estimation Methods for Neural Networks: A Practical Comparison

Statistics: Learning models from data

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models

Forecasting Wind Ramps

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Bayesian Machine Learning

Gaussian Estimation under Attack Uncertainty

An introduction to Variational calculus in Machine Learning

Mollifying Networks. ICLR,2017 Presenter: Arshdeep Sekhon & Be

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Unsupervised Learning with Permuted Data

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Algorithmisches Lernen/Machine Learning

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

ECE 275B Homework #2 Due Thursday 2/12/2015. MIDTERM is Scheduled for Thursday, February 19, 2015

Probability and Information Theory. Sargur N. Srihari

Introduction to Gaussian Processes

Using Noise to Speed Up Video Classification with Recurrent Backpropagation

Automated Segmentation of Low Light Level Imagery using Poisson MAP- MRF Labelling

Machine Learning, Midterm Exam

STA 4273H: Statistical Machine Learning

Sequential Procedure for Testing Hypothesis about Mean of Latent Gaussian Process

Shankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms

Linear Dynamical Systems

Bagging During Markov Chain Monte Carlo for Smoother Predictions

EVALUATING SYMMETRIC INFORMATION GAP BETWEEN DYNAMICAL SYSTEMS USING PARTICLE FILTER

HMM part 1. Dr Philip Jackson

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Appendices: Stochastic Backpropagation and Approximate Inference in Deep Generative Models

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

L11: Pattern recognition principles

Brief Introduction of Machine Learning Techniques for Content Analysis

STA414/2104 Statistical Methods for Machine Learning II

A Note on the Expectation-Maximization (EM) Algorithm

6 Markov Chain Monte Carlo (MCMC)

EM Algorithm. Expectation-maximization (EM) algorithm.

Transcription:

The Noisy Expectation-Maximization Algorithm for Multiplicative Noise Injection Osonde Osoba, Bart Kosko We generalize the noisy expectation-maximization (NEM) algorithm to allow arbitrary noise-injection modes besides simply adding carefully chosen noise to the data. This includes the important special case of multiplicative noise injection. A generalized NEM Theorem shows that all such injected noise speeds convergence of the EM algorithm on average. The multiplicative noise-benefit condition has a simple quadratic form for Gaussian and Cauchy mixture models. Simulations show a multiplicative-noise EM speed-up of more than 27% in a Gaussian mixture model. Injecting blind noise only slowed convergence. A related theorem gives a sufficient condition for an average EM noise benefit for arbitrary modes of noise injection if the data model comes from the general exponential family of probability density functions. A final theorem shows that injected noise slows EM convergence if the NEM inequalities reverse. 1. Noise Boosting the EM Algorithm We show how carefully chosen and injected multiplicative noise can speed convergence of the popular expectation-maximization (EM) algorithm. The proof also allows arbitrary modes of combining signal and noise. The result still speeds EM convergence on average at each iteration. The EM algorithm generalizes maximum-likelihood estimation to the case of missing or corrupted data 1, 2. The algorithm iteratively climbs a hill of probability until it reaches the peak. The EM algorithm also generalizes many other popular algorithms. These include the k-means clustering algorithm 3 used in pattern recognition and big-data analysis, the backpropagation algorithm used to train deep feedforward and convolutional neural networks 4 6, and the Baum-Welch algorithm used to train hidden Markov models 7, 8. But the EM algorithm can converge slowly if the amount of missing data is high or if the number of estimated parameters is large 2, 9. It can also get stuck at local probability maxima. Users can run the EM algorithm from several starting points to mitigate the problem of convergence to local maxima. The Noisy EM (NEM) algorithm 3, 10 12 is a noise-enhanced version of the EM algorithm that carefully selects noise and adds it to the data. NEM converges faster on average than EM does because on average it takes larger steps up the same O. Osoba is with the RAND Corporation, Santa Monica, CA 90401-3208 USA (email: oosoba@rand.org) O. Osoba and B. Kosko are with the Signal and Image Processing Institute, Electrical Engineering Department, University of Southern California, Los Angeles, CA 90089-2564 USA (email: kosko@usc.edu) 1

2 hill of probability. The largest noise gains tend to occur in the first few steps. This is a type of nonlinear noise benefit or stochastic resonance 13 24 that does not depend on a threshold 25. NEM adds noise to the data if the noise satisfies the NEM positivity condition: ( ) f(y + N, Z θk ) E Y,Z,N θ ln 0. (1) The NEM positivity condition (1) holds when the noise-perturbed likelihood f(y + N, z θ k ) is larger on average than the noiseless likelihood f(y, z θ k ) at the k th step of the algorithm 10, 12. A simple argument gives the intuition behind the NEM positivity condition for additive noise. This argument holds in much greater generality and underlies much of the theory of noise-boosting the EM algorithm. Consider a noise sample or realization n that makes a signal y more probable: f(y + n θ) f(y θ) in terms of probability density functions (pdfs) for some parameter θ. The value y is a realization of the signal random variable Y. The value n is a realization of the noise random variable N. Then this pdf inequality holds if and only if ln( f(y+n θ) f(y θ) ) 0. Averaging over the signal and noise random variables gives the basic expectation form of the NEM positivity condition. Particular choices of the signal probability f(y θ) can greatly simplify the sufficient NEM condition. This signal conditional probability is the so-called data model in the EM context of maximum likelihood estimation. We show below that Gaussian and Cauchy choices lead to simple quadratic NEM conditions. An exponential data model leads to an even simpler linear condition. This argument also shows that a noise-based pdf inequality is sufficient to satisfy the positivity condition of the NEM Theorem. It shows even more than this because the NEM positivity condition uses only an expectation. So the pdf inequality need hold only almost everywhere. It need not hold on sets of zero probability. This allows the user to ignore particular values when using continuous probability models. The same argument for multiplicative noise suggests that a simliar positivity condition should hold for a noise benefit. This will hold given the corresponding pdf inequality f(yn θ) f(y θ). There is nothing unique about the operations of addition or multiplication in this signal-noise context. So a noise benefit should hold for any method of combining signal and noise that obeys these pdf inequalities. A sequence of new theorems shows that this is the case. Theorem 2 below generalizes the NEM theorem from additive noise injection y+n to arbitrary noise injection φ(y, N). Theorem 3 gives the new NEM condition for the the special case where the noise-injection mode is multiplicative: φ(y, N) = yn. We call this new condition the m-nem condition or the multplicative-noisy-expectation- Maximization condition. Figure 1 shows an EM speed-up of 27.6% due to the multiplicative NEM condition. Figure 2 shows that ordinary or blind multiplicative noise (not subject to the m-nem condition) only slowed EM convergence. Blind noise was just noise drawn at random

3 Figure 1: Noise benefit in a Gaussian-mixture-model NEM algorithm that used multiplicative noise injection. Low intensity noise decreased the EM convergence time while higher intensity starting noise increased it. The multiplicative noise had unit mean with different standard deviations. The optimal initial noise standard deviation was σ = 0.44 and gave a 27.6% speed-up over the noiseless EM algorithm. This NEM procedure injected noise that decayed at an inverse-square rate. or uniformly from the set of all possible noise. It was not subject to the m-nem condition or to any other condition. The optimal speed-up using additive noise on the same data model was 30.5% at an optimal noise power of σ = 1.9. This speed-up was slightly better than the m- NEM speed-up. But a statistical test for the difference in mean optimal convergence times found that this difference was not statistically significant at the 0.05 significance level. The hypothesis test for the difference of means gave a large p-value of 0.136. A sample 95%-bootstrap confidence interval for the average difference in optimal convergence time was 0.45, 0.067. So we cannot reject the null hypothesis that the difference in optimal average convergence times for the two noise-injection modes was statistically insignificant at the 0.05 level. An open and important research question is whether there are general conditions under which one noise injection mode outperforms the other. 2. General Noise Injection Condition for a NEM Benefit We next generalize the original proof for additive NEM 10, 12 to NEM that uses more general noise injection. The metrical idea behind the proof remains the same: a noise benefit occurs on average at an iteration if the noisy pdf is closer to the optimal pdf than the noiseless pdf is. Relative entropy measures the pdf quality or pseudo-distance compared with the optimal pdf:

4 where D (f(y, z θ ) f N (y, z θ k )) D (f(y, z θ ) f(y, z θ k )) (2) f N (y, z θ k ) = f(φ(y, N), z θ k ) (3) is the noise-injected pdf. The relative entropy has the form of an average logarithm 26 h(u, v) D (h(u, v) g(u, v)) = ln h(u, v) du dv (4) g(u, v) U,V for positive pdfs h and g over the same support. Convergent sums can replace the integrals in the discrete case. The key point is that the noise-injection mode φ(y, N) need not be addition φ(y, N) = y + N or multiplication φ(y, N) = yn. It can be any function φ of the data y and the noise N. The above relative entropy inequality is logically equivalent to the EM noisebenefit condition at iteration k cast in terms of expectations 10: E Q(θ θ ) Q(θ k θ ) E Q(θ θ ) Q N (θ k θ ) (5) where Q N is the noise-perturbed surrogate likelihood function Q N (θ θ k ) = E Z Y,θk ln f N (y, Z θ). (6) Any noise N that satisfies this EM noise-benefit condition will on average give better parameter estimates at each iteration than will noiseless estimates or those that use blind noise. The relative-entropy version of the noise-benefit condition allows the same derivation of the generalized NEM condition as in the case of additive noise. The result is Theorem 1. Theorem 1: The Generalized NEM Theorem Let φ(y, N) be an arbitrary mode of combining the signal Y and the noise N. Then the average EM noise benefit ( ) ( ) Q(θ θ ) Q(θ k θ ) Q(θ θ ) Q N (θ k θ ) (7) holds if ( ) f(φ(y, N), Z θk ) E Y,Z,N θ ln 0. (8) Proof: We want to show that the noisy likelihood function f(φ(y, N), z θ k ) is closer to the optimal likelihood f(y, z θ ) than is the noiseless likelihood function f(y, z θ k ). We use relative entropy for this comparison. Then the relative entropy between the optimal likelihood and the noisy likelihood is

5 c k (N) = D(f(y, z θ ) f(φ(y, N), z θ k )). (9) The relative entropy between the optimal likelihood and the noiseless likelihood is c k = D(f(y, z θ ) f(y, z θ k )). (10) We show first that the expectation of the Q-function differences in (5) is a distance pseudo-metric. Rewrite the Q function as an integral over Z: Z lnf(y, z θ)f(z y, θ k ) dz. (11) Then the term c k = D(f(y, z θ ) f(y, z θ k )) is the expectation over Y given the current parameter value θ k because factoring the conditional pdf f(y, z θ ) gives c k = ln(f(y, z θ )) ln f(y, z θ k )f(y, z θ ) dz dy (12) = ln(f(y, z θ )) ln f(y, z θ k )f(z y, θ )f(y θ ) dz dy (13) = E Y θk Q (θ θ ) Q (θ k θ ). (14) The noise term c k (N) is also the expectation over Y given the parameter value θ k : c k (N) = ln(f(y, z θ )) ln f(y + N, z θ k )f(y, z θ ) dz dy (15) = ln(f(y, z θ )) ln f(y + N, z θ k )f(z y, θ )f(y θ ) dz dy (16) = E Y θk Q (θ θ ) Q N (θ k θ ). (17) Take the noise expectation of both terms c k and c k (N): Then the pseudo-metrical inequality ensures an average noise benefitr: E N ck = ck (18) E N ck (N) = E N ck (N). (19) c k E N c k (N) (20) E N,Y θk Q (θ θ ) Q (θ k θ ) E N,Y θk Q (θ θ ) Q N (θ k θ ). (21)

6 We use the inequality condition (32) to derive a more useful sufficient condition for a noise benefit. Expand the difference of relative entropy terms c k c k (N): ( ) f(y, z θ ) f(y, z θ ) c k c k (N) = ln ln f(y, z θ ) dy dz Y,Z f(y, z θ k ) f(y + N, z θ k ) (22) ( ) f(y, z θ ) f(y + N, z θk ) = ln + ln f(y, z θ ) dy dz Y,Z f(y, z θ k ) f(y, z θ ) (23) f(y, z θ )f(y + N, z θ k ) = ln f(y, z θ ) dy dz (24) Y,Z f(y, z θ k )f(y, z θ ) f(y + N, z θk ) = ln f(y, z θ ) dy dz. (25) f(y, z θ k ) Y,Z Take the expectation with respect to the noise term N: E N c k c k (N) = c k E N c k (N) (26) f(y + n, z θk ) = ln f(y, z θ )f(n y) dy dz dn (27) N Y,Z f(y, z θ k ) f(y + n, z θk ) = ln f(n y)f(y, z θ ) dn dy dz (28) Y,Z N f(y, z θ k ) = E Y,Z,N θ ln f(y + N, Z θ k). (29) The assumption of finite differential entropy for Y and Z ensures that ln f(y, z θ)f(y, z θ ) is integrable. Thus the integrand is integrable. So Fubini s theorem 27 permits the change in the order of integration in (29): ( ) f(y + N, Z θk ) c k E N c k (N) iff E Y,Z,N θ ln 0. (30) Then an EM noise benefit occurs on average if ( ) f(y + N, Z θk ) E Y,Z,N θ ln 0. (31) The same derivation as in 10 shows that the average EM noise benefit occurs if c k E N c k (N) (32) ( ) f(φ(y, N), Z θk ) E Y,Z,N θ ln 0 (33) when the generalized noisy likelihood f(φ(y, N), z θ k ) replaces the additive noisy likelihood f(y + N, z θ k ) in the original NEM proof.

7 3. The Special Case of Multiplicative NEM Theorem 1 allows a direct proof that properly chosen multiplicative noise can speed average EM convergence. Multiplicative noise 28 occurs in many applications in signal processing and communications. These include synthetic aperture radar 29 31, sonar imaging 32, 33, and photonics 34. The proof requires only that we use the following conditional noise pdf for multiplicative NEM (m-nem): f N (y, z θ k ) = f(yn, z θ k ). (34) Then Theorem 2 gives the following special case for multiplicative noise. Theorem: Multiplicative NEM (m-nem) Theorem The average EM noise benefit ( ) ( ) Q(θ θ ) Q(θ k θ ) Q(θ θ ) Q N (θ k θ ) (35) holds if E Y,Z,N θ ln ( ) f(y N, Z θk ) 0. (36) Proof: The argument is the same as in Theorem 2 with f N (y, z θ k ) = f(yn, z θ k ). 4. Gaussiain and Cauchy Mixture Models Many of the additive-nem corollaries apply to the generalized NEM condition without change. Corollary 2 from 10 leads to an m-nem condition for a Gaussian mixture model (GMM) because the noise condition applies to each mixed normal pdf in the mixture. An identical multiplicative noise-benefit condtiion holds for a Cauchy mixture model. We state and prove the m-nem GMM result as a separate corollary. The resulting quadratic m-nem condition depends only the Gaussian means and not on their variances. We first review mixture models and then state the closed form of a GMM. The EM algorithm offers a standard way to estimate the parameters of a mixture model. The parameters consist of the convex or probability mixing weights as well as the individual parameters of the mixed pdfs. A finite mixture model 35 37 is a convex combination of a finite number of similar pdfs. So we can view a mixture as a convex combination of a finite set of similar sub-populations. The sub-population pdfs are similar in the sense that they all come from the same parametric family. Mixture models apply to a wide range of statistical problems in pattern recognition and machine intelligence. A Gaussian mixture consists of convex-weighted normal pdfs. The EM algorithm estimates the mixture weights as well as the means and variances of each normal pdf. A Cauchy

8 mixture consists likewise of convex-wegithed Cauchy pdfs. The GMM is by far the most common mixture model in practice 38. Let Y be the observed mixed random variable. Let K be the number of subpopulations. Let Z {1,..., K} be the hidden sub-population index random variable. The convex population mixing proportions α 1,..., α K define a discrete pdf for Z: P (Z = j) = α j. The pdf f(y Z = j, θ j ) is the pdf of the j th sub-population where θ 1,..., θ K are the pdf parameters for each sub-population. The sub-population parameter θ j can represent the mean or variance of a normal pdf or both. It can represent any number of quantities that parametrize the pdf. Let Θ denote the vector of all model parameters Θ = {α 1,..., α K, θ 1,..., θ K }. The joint pdf f(y, z Θ) is K f(y, z Θ) = α j f(y j, θ j ) δz j. (37) The marginal pdf for Y and the conditional pdf for Z given y are j=1 f(y Θ) = j and p Z (j y, Θ) = α jf(y Z = j, θ j ) f(y Θ) by Bayes theorem. Rewrite the joint pdf as an exponential: f(y, z Θ) = exp j α j f(y j, θ j ) (38) (39) ln(α j ) + ln f(y j, θ j ) δz j. (40) Thus ln f(y, z Θ) = j δz j lnα j f(y j, θ j ). (41) EM algorithms for finite mixture models estimate Θ using the sub-population index Z as the latent variable. An EM algorithml uses (39) to derive the Q-function Q(Θ Θ k ) =E Z y,θk ln f(y, Z Θ) (42) = z j δz j lnα j f(y j, θ j ) p Z (z y, Θ k ) (43) = j lnα j f(y j, θ j )p Z (j y, Θ k ). (44) Corollary: m-nem Condition for Gaussian Mixture Models Suppose that Y Z=j N (µ j, σj 2 ). So f(y j, θ) is a normal or Gaussian pdf. Then the pointwise pdf noise benefit for multiplicative noise f(yn θ) f(y θ) (45) holds if and only if y(n 1) y(n + 1) 2µ j 0. (46)

9 Proof: Compare the noisy normal pdf with the noiseless normal pdf. The normal pdf is 1 f(y θ) = exp (y µ j) 2 σ j 2π 2σj 2. (47) So f(yn θ) f(y θ) iff exp (yn µ j) 2 2σ 2 j exp (y µ j) 2 2σ 2 j (48) iff ( ) 2 yn µj σ j ( ) 2 y µj (49) iff (yn µ j ) 2 (y µ j ) 2 (50) iff y 2 n 2 + µ 2 j 2µ j yn y 2 + µ 2 j 2yµ j (51) σ j iff y 2 n 2 2µ j yn y 2 2yµ j (52) iff y 2 (n 2 1) 2yµ j (n 1) 0 (53) iff y(n 1) y(n + 1) 2µ j 0. (54) This proves (46). The same m-nem noise-benefit condition (46) holds for a Cauchy mixture model. Suppose that Y Z=j C(m j, d j ) and thus that f(y j, θ) is a Cauchy pdf with median m j and dispersion d j. The median controls the location of the Cauchy bell curve. The dispersion controls its width. A Cauchy random variable has no mean. It has finite lower-order fractional moments. But its variance and all its higher-order moments are infinite. The Cauchy pdf f(y j, θ) has the form 1 f(y θ) = ( ) 2. (55) y mj πd j 1 + d j So the pdf inequality f(yn θ) f(y θ) is equivalent to the same quadratic inequality as in the above derivation of the Gaussian n-nem condition. This gives (46) as the Cauchy m-nem noise-benefit condition. 5. The Multiplicative Noisy Expectation-Maximization Algorithm The m-nem Theorem and its corollaries give a general method for noise-boosting the EM algorithm. Theorem 1 implies that on average these NEM variants outperform the noiseless EM algorithm. Algorithm 1 gives the multiplicative NEM algorithm schema. The operation mnemnoisesample(y, k τ σ N ) generates noise samples that satisfy the NEM condition for the current data model. The noise sampling pdf depends on the vector

10 of random samples y in the Gaussian and Cauchy mixture models. The noise can have any value in the NEM algorithm for censored gamma models. Algorithm 1 ˆθ mnem = m-nem-estimate(y) Require: y = (y 1,..., y M ) : vector of observed incomplete data Ensure: ˆθ mnem : m-nem estimate of parameter θ 1: while ( θ k θ k 1 10 tol ) do 2: N S -Step: n mnemnoisesample(y, k τ σ N ) 3: N M -Step: y yn 4: E-Step: Q (θ θ k ) E Z y,θk ln f(y, Z θ) 5: M-Step: θ k+1 argmax {Q (θ θ k )} 6: k k + 1 7: end while 8: ˆθmNEM θ k θ The E-Step takes a conditional expectation of a function of the noisy data samples y given the noiseless data samples y. A deterministic decay factor k τ scales the noise on the k th iteration. τ is the noise decay rate 10. The decay factor k τ reduces the noise at each iteration. So the decay factor drives the noise N k to zero as the iteration step k increases. The simulations used τ = 2. The decay factor reduces the NEM estimator s jitter around its final value. This is important because the EM algorithm converges to fixed-points. So excessive estimator jitter prolongs convergence time even when the jitter occurs near the final solution. The simulations in this paper used polynomial decay factors instead of the logarithmic cooling schedules found in annealing applications 39 43. The NEM noise generating procedure mnemnoisesample(y, k τ σ N ) returns a NEM compliant noise sample n at a given noise level σ N for each data sample y. This procedure changes with the EM data model. The following noise generating procedure applies to GMMs because of Corollary 4. We used the following 1-D noise generating procedure for the GMM simulations: mnemnoisesample for GMM m-nem Require: y and σ N : current data sample and noise level Ensure: n : noise sample satisfying NEM condition N(y) {n R : y(n 1) y(n + 1) 2µ j 0} n is a sample from the distribution T N(1, σ N N(y)) Figure 1 displays a noise benefit for a m-nem algorithm on a GMM. The injected noise is subject to the Gaussian NEM condition in (46).

11 Figure 2: Blind multiplicative noise did not improve the convergence times of GMM EM algorithms. Such blind noise only increased the time to convergence. This increase in convergence time occurred even when (as in this case) the multiplicative noise used a cooling schedule. The next section develops the NEM theory for the important case of exponential family pdfs. The corresponding theorem states the generalized NEM condition for additive (??) and multiplicative (??) noise injection. The theorem also applies to mixtures of exponential family pdfs because of Corollary 2 from 10. 6. NEM Noise Benefits for Exponential Family Probabilities Exponential family pdfs include such popular densities as the normal, exponential, gamma, and Poisson 44. A member of this family has a pdf f(x θ) of the exponential form f(x θ) = exp a(θ)k(x) + b(x) + c(θ) (56) so long as the density s domain does not include the parameter θ. This latter condition bars the uniform pdf from the exponential family. The family also excludes Cauchy and Student-t pdfs. The normal pdf f(y θ) = 1 σ j 2π exp (y µj)2 belongs 2σj 2 to the exponential family because we can write it in exponential form if a(θ) = 1 2σ, 2 K(x) = x 2, and c(θ) = µ2 2σ ln 2πσ 2. 2 Theorem: Generalized NEM Condition for Exponential Family Probability Density Functions Suppose X has an exponential family pdf: f(x θ) = exp a(θ)k(x) + b(x) + c(θ). (57) Then an EM noise benefit occurs for arbitrary noise-combination mode φ if a(θ)k(φ(x, n)) K(x) + b(φ(x, n)) b(x) 0. (58)

12 Proof: Compare the noisy pdf f(φ(x, n) θ) with the noiseless pdf f(x θ). The noise benefit occurs if ln f(φ(x, n) θ) ln f(x θ) (59) since the logarithm is a monotone increasing function. This inequality holds iff a(θ)k(φ(x, n)) + b(φ(x, n)) + c(θ) a(θ)k(x) + b(x) + c(θ) (60) iff a(θ)k(φ(x, n)) + b(φ(x, n)) a(θ)k(x) + b(x) (61) iff a(θ)k(φ(x, n)) K(x) + b(φ(x, n)) b(x) 0. (62) The final inequality proves (58). This noise-benefit condition reduces to a(θ)k(x + n)) K(x) + b(x + n)) b(x) 0. (63) in the additive noise case when φ(x, n) = x + n. It reduces to a(θ)k(xn)) K(x) + b(xn)) b(x) 0. (64) in the multiplicative noise case when φ(x, n) = xn. Note that the c(θ) term does not appear in the NEM conditions. Consider the exponential signal pdf f(x θ) = 1 θ e x θ. It has the form of an exponential-family pdf if a(θ) = 1 θ, K(x) = x, b(x) = 0, and c(θ) = ln θ. So the condition for an additive NEM noise benefit becomes 1 x + n x 0. (65) θ This gives a simple negative noise condition for an additive NEM benefit: n 0. (66) The condition for a multiplicative NEM benefit likewise beomes It gives a similar NEM condition: 1 xn x 0. (67) θ n 1. (68) We conclude with a dual theorem that guarantees that noise will harm the EM algorithm by slowing its convergence on average. Both the theorem statement and its proof simply reverse all pertinent inequalities in Theorem 2. Theorem: The Generalized EM Noise Harm Theorem Let φ(y, N) be an arbitrary mode of combining the signal Y and the noise N. Then the EM estimation ( iteration noise harm ) ( ) Q(θ θ ) Q(θ k θ ) Q(θ θ ) Q N (θ k θ ) (69)

13 occurs on average if ( ) f(φ(y, N), Z θk ) E Y,Z,N θ ln 0. (70) Proof: The proof follows the same argument as in Theorem 2 with all the inequalities reversed. This general noise-harm result leads to corollary noise-harm conditions for the additive and multiplicative GMM-NEM models by reversing all pertinent inequalities. A similar reversal gives a noise-harm condition for pdfs from the exponential family. Such harmful GMM noise increased EM convergence by 35% in the multiplicativenoise case and by 40% in the additive-noise case. No noise benefit or harm occurs on average if equality replaces all pertinent inequalities. Corollary: Noise harm conditions for GMM-NEM The noise harm condition in (70) holds for the GMM-NEM algorithm if for additive noise and if for multiplicative noise. n 2 2n (µ j y) (71) y(n 1) y(n + 1) 2µ j 0 (72) Proof: The proof follows the derivations for the additive and multiplicative GMM-NEM conditions if all inequalities reverse. 7. Conclusion The NEM additive noise model extends to a multiplicative noise model and indeed to arbitrary modesl that combine noise and signal. The multiplicative NEM theorem specifically gives a sufficient positivity condition such that multiplicative noise improves the average speed of the EM algorithm. The multiplicative-noise NEM condition for the GMM and exponential family models are only slightly more complex than their respective additive-noise NEM conditions. An open research question is whether there are general conditions when either multiplicative or additive noise outperforms the other. Another open question is whether data sparsity affects the m-nem speed-up (as it does the additive case 10) and how it affects more general modes of NEM-based noise injection. Bibliography 1 A. P. Dempster, N. M. Laird and D. B. Rubin, Maximum Likelihood from Incomplete Data via the EM Algorithm (with discussion), Journal of the Royal Statistical Society, Series B 39 (1977) 1 38.

14 2 G. J. McLachlan and T. Krishnan, The EM Algorithm and Extensions (John Wiley and Sons, 2007). 3 O. Osoba and B. Kosko, Noise-Enhanced Clustering and Competitive Learning Algorithms, Neural Networks 37 (2013) 132 140. 4 K. Audhkhasi, O. Osoba and B. Kosko, Noise benefits in backpropagation and deep bidirectional pre-training, in Neural Networks (IJCNN), The 2013 International Joint Conference on (IEEE, 2013), pp. 1 8. 5 K. Audhkhasi, O. Osoba and B. Kosko, Noise benefits in convolutional neural networks, in Proceedings of the 2014 International Conference on Advances in Big Data Analytics (2014), pp. 73 80. 6 Y. LeCun, Y. Bengio and G. Hinton, Deep learning, Nature 521 (2015) 436 444. 7 K. Audhkhasi, O. Osoba and B. Kosko, Noisy hidden Markov models for speech recognition, in International Joint Conference on Neural Networks (IJCNN) (IEEE, 2013), pp. 2738 2743. 8 L. R. Welch, Hidden Markov models and the Baum-Welch algorithm, IEEE Information Theory Society Newsletter 53 (2003) 1 14. 9 M. A. Tanner, Tools for Statistical Inference: Methods for the Exploration of Posterior Distributions and Likelihood Functions, Springer Series in Statistics (Springer, 1996). 10 O. Osoba, S. Mitaim and B. Kosko, The Noisy Expectation-Maximization Algorithm, Fluctuation and Noise Letters 12 (2013) 1350012. 11 O. Osoba, S. Mitaim and B. Kosko, Noise Benefits in the Expectation-Maximization Algorithm: NEM Theorems and Models, in The International Joint Conference on Neural Networks (IJCNN) (IEEE, 2011), pp. 3178 3183. 12 O. A. Osoba, Noise benefits in expectation-maximization algorithms, Ph.D. thesis, University of Southern California, 2013. 13 K. Wiesenfeld, F. Moss et al., Stochastic resonance and the benefits of noise: from ice ages to crayfish and squids, Nature 373 (1995) 33 36. 14 A. R. Bulsara and L. Gammaitoni, Tuning in to noise, Physics today 49 (1996) 39 47. 15 L. Gammaitoni, P. Hänggi, P. Jung and F. Marchesoni, Stochastic resonance, Reviews of modern physics 70 (1998) 223. 16 S. Mitaim and B. Kosko, Adaptive Stochastic Resonance, Proceedings of the IEEE: Special Issue on Intelligent Signal Processing 86 (1998) 2152 2183. 17 F. Chapeau-Blondeau and D. Rousseau, Noise-Enhanced Performance for an Optimal Bayesian Estimator, IEEE Transactions on Signal Processing 52 (2004) 1327 1334. 18 I. Lee, X. Liu, C. Zhou and B. Kosko, Noise-enhanced detection of subthreshold signals with carbon nanotubes, IEEE Transactions on Nanotechnology 5 (2006) 613 627. 19 B. Kosko, Noise (Penguin, 2006). 20 M. McDonnell, N. Stocks, C. Pearce and D. Abbott, Stochastic resonance: from suprathreshold stochastic resonance to stochastic signal quantization (Cambridge University Press, 2008). 21 A. Patel and B. Kosko, Optimal Mean-Square Noise Benefits in Quantizer-Array Linear Estimation, IEEE Signal Processing Letters 17 (2010) 1005 1009. 22 A. Patel and B. Kosko, Noise Benefits in Quantizer-Array Correlation Detection and Watermark Decoding, IEEE Transactions on Signal Processing 59 (2011) 488 505. 23 H. Chen, L. R. Varshney and P. K. Varshney, Noise-enhanced information systems, Proceedings of the IEEE 102 (2014) 1607 1621. 24 S. Mitaim and B. Kosko, Noise-benefit forbidden-interval theorems for threshold signal detectors based on cross correlations, Physical Review E 90 (2014) 052124.

15 25 B. Franzke and B. Kosko, Noise can speed convergence in markov chains, Physical Review E 84 (2011) 041112. 26 T. M. Cover and J. A. Thomas, Elements of Information Theory (Wiley & Sons, New York, 1991). 27 G. B. Folland, Real Analysis: Modern Techniques and Their Applications (Wiley- Interscience, 1999), 2nd edition. 28 L. Rudin, P.-L. Lions and S. Osher, Multiplicative denoising and deblurring: Theory and algorithms, in Geometric Level Set Methods in Imaging, Vision, and Graphics (Springer, 2003), pp. 103 119. 29 J. Ash, E. Ertin, L. Potter and E. Zelnio, Wide-angle synthetic aperture radar imaging: Models and algorithms for anisotropic scattering, Signal Processing Magazine, IEEE 31 (2014) 16 26. 30 S. Chen, Y. Li, X. Wang, S. Xiao and M. Sato, Modeling and interpretation of scattering mechanisms in polarimetric synthetic aperture radar: Advances and perspectives, Signal Processing Magazine, IEEE 31 (2014) 79 89. 31 M. Tur, K. C. Chin and J. W. Goodman, When is speckle noise multiplicative?, Appl. Opt. 21 (1982) 1157 1159. 32 J. M. Bioucas-Dias and M. A. Figueiredo, Multiplicative noise removal using variable splitting and constrained optimization, Image Processing, IEEE Transactions on 19 (2010) 1720 1730. 33 J. Ringelstein, A. B. Gershman and J. F. Böhme, Direction finding in random inhomogeneous media in the presence of multiplicative noise, Signal Processing Letters, IEEE 7 (2000) 269 272. 34 T. Yilmaz, C. M. Depriest, A. Braun, J. H. Abeles and P. J. Delfyett, Noise in fundamental and harmonic modelocked semiconductor lasers: experiments and simulations, Quantum Electronics, IEEE Journal of 39 (2003) 838 849. 35 R. A. Redner and H. F. Walker, Mixture Densities, Maximum Likelihood and the EM algorithm, SIAM Review 26 (1984) 195 239. 36 G. J. McLachlan and D. Peel, Finite Mixture Models, Wiley series in probability and statistics: Applied probability and statistics (Wiley, 2004). 37 R. V. Hogg and E. A. Tanis, Probability and Statistical Inference (Prentice Hall, 2006), 7th edition. 38 N. A. Gershenfeld, The Nature of Mathematical Modeling (Cambridge University Press, 1999). 39 S. Kirkpatrick, C. Gelatt Jr and M. Vecchi, Optimization by simulated annealing, Science 220 (1983) 671 680. 40 V. Černỳ, Thermodynamical approach to the Traveling Salesman Problem: An efficient simulation algorithm, Journal of Optimization Theory and Applications 45 (1985) 41 51. 41 S. Geman and C. Hwang, Diffusions for global optimization, SIAM Journal on Control and Optimization 24 (1986) 1031 1043. 42 B. Hajek, Cooling schedules for optimal annealing, Mathematics of operations research 13 (1988) 311 329. 43 B. Kosko, Neural Networks and Fuzzy Systems: A Dynamical Systems Approach to Machine Intelligence (Prentice Hall, 1991). 44 R. V. Hogg, J. McKean and A. T. Craig, Introduction to Mathematical Statistics (Pearson, 2013).