EM Algorithms. September 6, 2012

Size: px
Start display at page:

Download "EM Algorithms. September 6, 2012"

Transcription

1 EM Algorithms Charles Byrne and Paul P. B. Eggermont September 6, 2012 Abstract A well studied procedure for estimating a parameter from observed data is to maximize the likelihood function. When a maximizer cannot be obtained in closed form, iterative maximization algorithms, such as the expectation maximization (EM) maximum likelihood algorithms, are needed. The standard formulation of the EM algorithms postulates that finding a maximizer of the likelihood is complicated because the observed data is somehow incomplete or deficient, and the maximization would have been simpler had we observed the complete data. The EM algorithm involves repeated calculations involving complete data that has been estimated using the current parameter value and conditional expectation. The standard formulation is adequate for the discrete case, in which the random variables involved are governed by finite or infinite probability functions, but unsatisfactory in the continuous case, in which probability density functions and integrals are needed. We adopt the view that the observed data is not necessarily incomplete, but just difficult to work with, while different data, which we call the preferred data, leads to simpler calculations. To relate the preferred data to the observed data, we assume that the preferred data is acceptable, which means that the conditional distribution of the preferred data, given the observed data, is independent of the parameter. This extension of the EM algorithms contains the usual formulation for the discrete case, while removing the difficulties associated with the continuous case. Examples are given to illustrate this new approach. 1 Introduction We suppose that the observable random vector Y taking values in R N is governed by a probability density function (pdf) or probability function (pf) of the form f Y (y θ), for some value of the parameter vector θ Θ, where Θ is the set of all legitimate Charles Byrne@uml.edu, Department of Mathematical Sciences, University of Massachusetts Lowell, Lowell, MA 01854, USA eggermon@udel.edu, Food and Resource Economics, University of Delaware, Newark, DE 19716, USA 1

2 values of θ. Our observed data consists of one realization y of Y ; we do not exclude the possibility that the entries of y are independently obtained samples of a common real-valued random variable. The true vector of parameters is to be estimated by maximizing the likelihood function L y (θ) = f Y (y θ) over all θ Θ to obtain a maximum likelihood estimate, θ ML. We assume that there is another related random vector X, which we shall call the preferred data, such that, had we been able to obtain one realization x of X, maximizing the likelihood function L x (θ) = f X (x θ) would have been simpler than maximizing the likelihood function L y (θ) = f Y (y θ). The EM algorithm based on X generates a sequence {θ k } of parameter vectors. For the continuous case of probability density functions (pdf), the vector θ k+1 is obtained from θ k by maximizing the conditional expected value E(log f X (X θ) y, θ k ) = f X Y (x y, θ k ) log f X (x θ)dx; (1.1) for the discrete case of probability functions (pf), the integral is replaced by a summation. The EM algorithm is not really a single algorithm, but a framework for the design of iterative likelihood maximization methods for parameter estimation; nevertheless, we shall continue to refer to the EM algorithm. In decreasing order of importance and difficulty, our goals are these: 1. to have the sequence of parameters {θ k } converging to θ ML ; 2. to have the sequence of functions {f X (x θ k )} converging to f X (x θ ML ); 3. to have the sequence of numbers {L y (θ k )} converging to L y (θ ML ); 4. to have the sequence of numbers {L y (θ k )} non-decreasing. Our focus here will be almost entirely on the fourth goal on the list, with some discussion of the third goal. We shall discuss the first two goals only for certain examples. Our primary objective in this paper is to find a condition on the preferred data X sufficient to reach the fourth goal. As we shall show, in order to have L y (θ k+1 ) L y (θ k ), it is sufficient that X satisfy the acceptability condition f Y X (y x, θ) = f Y X (y x). Our treatment of the EM algorithm differs somewhat from most conventional presentations, such as those in the fundamental paper [9] and the survey text [19]. As we discuss in more detail in the next section, these conventional treatments are both restrictive and, in some aspects, incorrect, particularly as they apply to the 2

3 continuous case. We view our notion of acceptability as a method for correcting these conventional formulations of the EM algorithm. We begin by reviewing the conventional formulation of the EM algorithm, pointing out aspects of that formulation that require correction. We then develop the theory of the EM for acceptable X, including the involvement of the Kullback-Leibler distance and the alternating-minimization approach of Csiszár and Tusnády [8]. We then apply the theory to solve the problem of estimating the mixing proportions in finite mixtures. The theory in this paper permits us to conclude only that likelihood is non-decreasing, but by invoking a convergence theorem presented in [5], we are able to go further and prove convergence of the sequence {θ k } to a likelihood maximizer. 2 The Conventional Formulation of EM By the conventional formulation we mean that given in [9], repeated in [19], and used in most papers on the subject subsequent to 1977, such as [29] and [20]. 2.1 Incomplete and Complete Data In [9] the fundamental distinction is between incomplete data, our observable Y, and complete data, our preferred X. The assumption made in [9] is that there is a manyto-one function h such that Y = h(x); since we can get y from x but not the other way around, the y is incomplete, while the x is complete. 2.2 Missing Data It is noted in [9] that their complete-incomplete formulation includes the case of missing data, and several of the examples they give are of this type. The term missing data is vague, however; any data that we have not observed can be said to be missing. A close reading of [9] shows that the real problem that can arise when data is missing, either by failure to report, censoring, or truncation, is that it complicates the calculations, as with certain types of designed experiments. The desire to restore the missing data is therefore a desire to simplify the calculations. The popular example of multinomial data given below illustrates well the point that one can often choose to view the observed data as incomplete simply in order to introduce complete data that makes the calculations simpler, even when there is no suggestion, in the original problem, that the observed data is in any way inadequate or incomplete. It is in order to emphasize this desire for simplification that we refer to X as the preferred data, not the complete data. 3

4 We shall address later the special case we call the missing-data model, in which the preferred data X takes the form X = Z = (Y, W ), where W is called the missing data and Z the complete data. Clearly, we have Y = h(x) now, where h is the projection onto the first coordinate. In this case only do we refer to X as complete. When the EM algorithm is applied to the missing-data model, the likelihood is non-decreasing, which suggests that, for an arbitrary preferred data X, we could imagine X as W, the missing data, and imagine applying the EM algorithm to Z = (Y, X). This approach would produce an EM sequence of parameter vectors for which likelihood is non-decreasing, but it need not be the same sequence as obtained by applying the EM algorithm to X directly. It is the same sequence, provided that X is acceptable. We are not suggesting that applying the EM algorithm to Z = (Y, X) would simplify calculations. 2.3 The Conventional Formulation The conventional formulation of the problem is that there is a random vector X, called the complete data, taking values in R M, where typically M N, and often M > N, with pdf or pf of the form f X (x θ), and a function h : R M R N such that Y = h(x). For example, let X 1 and X 2 be independent and uniformly distributed on [0, θ 0 ], X = (X 1, X 2 ) and Y = X 1 + X 2 = h(x). We can use the missing-data model here with Z = (X 1 + X 2, X 1 X 2 ). What is missing is not unique, however; instead of W = X 1 X 2 as the missing data, we can use W = X 2, or any number of other combinations of X 1 and X 2 that would allow us to recapture X. As we shall attempt to convince the reader, this formulation is somewhat restrictive; the main point is simply that f X (x θ) would have been easier to maximize than f Y (y θ) is, regardless of the relationship between X and Y. For this reason we shall call Y the observed data and X the preferred data, and not assume that Y = h(x) for some function h. We reserve the term complete data for Z of the form Z = (Y, W ); note that, in this case we do have Y = h(z). In some applications, the preferred data X arises naturally from the problem, while in other cases the user must imagine preferred data. This choice in selecting the preferred data can be helpful in speeding up the algorithm (see [13]). 2.4 A Multinomial Example In many applications, the entries of the vector y are independent realizations of a single real-valued or vector-valued random variable V, as they are, at least initially, for 4

5 finite mixture problems to be considered later. This is not always the case, however, as the following example shows. A well known example that was used in [9] and again in [19] to illustrate the EM algorithm concerns a multinomial model taken from genetics. Here there are four cells, with cell probabilities θ 0, 1 4 (1 θ 0), 1 4 (1 θ 0), and 1 4 θ 0, for some θ 0 Θ = [0, 1] to be estimated. The entries of y are the frequencies from a sample size of 197. We then have f Y (y θ) = 197! y 1!y 2!y 3!y 4! ( θ)y 1 ( 1 4 (1 θ))y 2 ( 1 4 (1 θ))y 3 ( 1 4 θ)y 4. (2.1) It is then supposed that the first of the original four cells can be split into two subcells, with probabilities 1 2 and 1 4 θ 0. We then write y 1 = y 11 + y 12, and let X = (Y 11, Y 12, Y 2, Y 3, Y 4 ), (2.2) where X has a multinomial distribution with five cells. Note that we do now have Y = h(x). This example is a popular one in the literature on the EM algorithm (see [9] for citations). It is never suggested that the splitting of the first group into two subgroups is motivated by the demands of the genetics theory itself. As stated in [19], the motivation for the splitting is to allow us to view the two random variables Y 12 + Y 4 and Y 2 + Y 3 as governed by a binomial distribution; that is, we can view the value of y 12 + y 4 as the number of heads, and the value y 2 + y 3 as the number of tails that occur in the flipping of a biased coin y 12 + y 4 + y 2 + y 3 times. This simplifies the calculation of the likelihood maximizer. 2.5 Difficulties with the Conventional Formulation In the literature on the EM algorithm, it is common to assume that there is a function h : R M R N such that Y = h(x). In the discrete case, in which summation and finite or infinite probability functions are involved, we then have f Y (y θ) = f X (x θ), (2.3) where x X (y) X (y) = {x h(x) = y} = h 1 ({y}). (2.4) The difficulty arises in the continuous case, where integration and probability density functions (pdf) are needed; the set X (y) can have measure zero in R M, so it is incorrect 5

6 to mimic Equation (2.3) and write f Y (y θ) = x X (y) f X (x θ)dx. (2.5) The case of X 1 and X 2 independent and uniformly distributed on [0, θ 0 ], X = (X 1, X 2 ) and Y = X 1 + X 2 = h(x) provides a good illustration of the problem. Here X (y) is the set of all pairs (x 1, x 2 ) with y = x 1 + x 2 ; this subset of R 2 has Lebesgue measure zero in R 2. Now f Y X (y x, θ) is a delta function, which we may view as δ(x 2 (y x 1 )), with the property that g(x 2 )δ(x 2 (y x 1 ))dx 2 = g(y x 1 ). (2.6) Then we can write f Y (y θ) = δ(x 2 y x 1 )f X (x 1, x 2 θ)dx 2 dx 1 = f X (x 1, y x 1 )dx 1. (2.7) 2.6 A Different Formulation For any preferred data X the EM algorithm involves two steps: given y and the current estimate θ k, the E-step of the EM algorithm is to calculate E(log f X (X θ) y, θ k ) = f X Y (x y, θ k ) log f X (x θ)dx, (2.8) the conditional expected value of the random vector log f X (X θ), given y and θ k. Then the M-step is to maximize f X Y (x y, θ k ) log f X (x θ)dx (2.9) with respect to θ Θ to obtain θ k+1. For the missing-data model Z = (Y, W ), the M-step is to maximize f W Y (w y, θ k ) log f Y,W (y, w θ)dw. (2.10) We shall also show that for the missing-data model we always have L y (θ k+1 ) L y (θ k ). For those cases involving continuous distributions in which Y = h(x) the X is acceptable, but we must define f Y X (y x) in terms of a delta function; we shall not consider such cases here. 6

7 2.7 The Example of Finite Mixtures We say that a random vector V taking values in R D is a finite mixture if, for j = 1,..., J, f j is a probability density function or probability function, θ j 0 is a weight, the θ j sum to one, and the probability density function or probability function for V is f V (v θ) = J θ j f j (v). (2.11) The value of D is unimportant and for simplicity, we shall assume that D = 1. We draw N independent samples of V, denoted v n, and let y n, the nth entry of the vector y, be the number v n. To create the preferred data we assume that, for each n, the number v n is a sample of the random variable V n whose pdf or pf is f jn, where the probability that j n = j is θ j. We then let the N entries of the preferred data X be the indices j n. The conditional distribution of Y, given X, is clearly independent of the parameter vector θ, and is given by f Y (y x, θ) = N f jn (y n ); therefore, X is acceptable. Note that we cannot recapture the entries of y from those of x, so the model Y = h(x) does not hold here. Note also that, although the vector y is taken originally to be a vector whose entries are independently drawn samples from V, when we create the preferred data X we change our view of y. Now each entry of y is governed by a different distribution, so y is no longer viewed as a vector n=1 of independent sample values of a single random vector. 3 The Missing-Data Model For the missing-data model, the E-step of the EM algorithm is to calculate E(log f Z (Z θ) y, θ k ) = f W Y (w y, θ k ) log f Y,W (y, w θ)dw. (3.1) Then the M-step is to maximize f W Y (w y, θ k ) log f Y,W (y, w θ)dw (3.2) to get θ k+1. For the missing-data model we have the following result (see [14, 6]). Proposition 3.1 The sequence {L y (θ k )} is non-decreasing, as k +. 7

8 Proof: Begin with log f Y (y θ) = f W Y (w y, θ k ) log f Y (y θ)dw. Then log f Y (y θ) = ( ) f W Y (w y, θ k ) log f Y,W (y, w θ) log f W Y (w y, θ) dw. Therefore, log f Y (y θ k+1 ) log f Y (y θ k ) = f W Y (w y, θ k ) log f Y,W (y, w θ k+1 )dw f W Y (w y, θ k ) log f Y,W (y, w θ k )dw + f W Y (w y, θ k ) log f W Y (w y, θ k )dw f W Y (w y, θ k ) log f W Y (w y, θ k+1 )dw.(3.3) The first difference on the right side of Equation (3.3) is non-negative because of the M-step, while the second difference is non-negative because of Jensen s Inequality. The fact that likelihood is not decreasing in the missing-data model suggests that we might try to use this model for arbitrary preferred data X, by defining the complete data to be Z = (Y, X) and viewing X as the missing data, that is, W = X. This approach works, provided that X satisfies the acceptability condition. 4 The EM Algorithm for Acceptable X For any preferred data X the E-step of the EM algorithm is to calculate E(log f X (X θ) y, θ k ) = f X Y (x y, θ k ) log f X (x θ)dx. (4.1) Once we have y, the M-step is then to maximize E(log f X (X θ) y, θ k ) = f X Y (x y, θ k ) log f X (x θ)dx (4.2) to get θ = θ k The Likelihood is Non-Decreasing For the moment, let X be any preferred data. We examine the behavior of the likelihood when the EM algorithm is applied to X. 8

9 As in the proof for the missing-data model, we begin with log f Y (y θ) = f X Y (x y, θ k ) log f Y (y θ)dx. Then log f Y (y θ) = ( ) f X Y (x y, θ k ) log f Y,X (y, x θ) log f X Y (x y, θ) dx. Therefore, log f Y (y θ k+1 ) log f Y (y θ k ) = f X Y (x y, θ k ) log f Y,X (y, x θ k+1 )dx f X Y (x y, θ k ) log f Y,X (y, x θ k )dx + f X Y (x y, θ k ) log f X Y (x y, θ k )dx f X Y (x y, θ k ) log f X Y (x y, θ k+1 )dx. (4.3) The second difference on the right side of Equation (4.3) is non-negative because of Jensen s Inequality. But we cannot assert that the first difference on the right is non-negative because, in the M-step, we maximize f X Y (x y, θ k ) log f X (x θ)dx not If X is acceptable, then f X Y (x y, θ k ) log f Y,X (y, x θ)dx. log f Y,X (y, x θ) log f X (x θ) = log f Y X (y x) (4.4) is independent of θ and the difficulty disappears. We may then conclude that likelihood is non-decreasing for acceptable X. 4.2 Generalized EM Algorithms If, instead of maximizing f X Y (x y, θ k ) log f X (x θ)dx, at each M-step, we simply select θ k+1 so that f X Y (x y, θ k ) log f Y,X (y, x θ k+1 )dx f X Y (x y, θ k ) log f Y,X (y, x θ k )dx > 0, we say that we are using a generalized EM (GEM) algorithm. It is clear from the discussion in the previous subsection that, whenever X is acceptable, a GEM also guarantees that likelihood is non-decreasing. 9

10 4.3 Preferred Data as Missing Data We know that, when the missing-data model is used and the M-step is defined as maximizing the function in (3.2), the likelihood is not decreasing. It would seem then that, for any choice of preferred data X, we could view this data as missing and take as our complete data the pair Z = (Y, X), with X now playing the role of W. The function in (3.2) is then f X Y (x y, θ k ) log f Y,X (y, x θ)dx; (4.5) we maximize this function to get θ k+1. It then follows that L y (θ k+1 ) L y (θ k ). The obvious question is whether or not these two functions given in (4.1) and (4.5) have the same maximizers. For acceptable X we have log f Y,X (y, x θ) = log f X (x θ) + log f Y X (y x), (4.6) so the two functions given in (4.1) and (4.5) do have the same maximizers. It follows once again that, whenever the preferred data is acceptable, we have L y (θ k+1 ) L y (θ k ). Without additional assumptions, however, we cannot conclude that {θ k } converges to θ ML, nor that {f Y (y θ k )} converges to f Y (y θ ML ). In the discrete case in which Y = h(x) the conditional probability f Y X (y x, θ) is δ(y h(x)), as a function of y, for given x, and is the characteristic function of the set X (y), as a function of x, for given y. Therefore, we can write f X Y (x y, θ) = { fx (x θ)/f Y (y θ), if x X (y); 0, if x / X (y). (4.7) For the continuous case in which Y = h(x), the pdf f Y X (y x, θ) is again a delta function of y, for given x; the difficulty arises when we need to view this as a function of x, for given y. The acceptability property helps us avoid this difficulty. When X is acceptable, we have f X Y (x y, θ) = f Y X (y x)f X (x θ)/f Y (y θ), (4.8) whenever f Y (y θ) 0, and is zero otherwise. Consequently, when X is acceptable, we have a kernel model for f Y (y θ) in terms of the f X (x θ): f Y (y θ) = f Y X (y x)f X (x θ)dx; (4.9) 10

11 for the continuous case we view this as a corrected version of Equation (2.5). In the discrete case the integral is replaced by a summation, of course, but when we are speaking generally about either case, we shall use the integral sign. The acceptability of the missing data W is used in [6], but more for computational convenience and to involve the Kullback-Leibler distance in the formulation of the EM algorithm. It is not necessary that W be acceptable in order for likelihood to be non-decreasing, as we have seen. 5 The EM and the Kullback-Leibler Distance We illustrate the usefulness of acceptability and reformulate the M-step in terms of cross-entropy or Kullback-Leibler distance minimization. 5.1 Cross-Entropy or the Kullback-Leibler Distance The cross-entropy or Kullback-Leibler distance [16] is a useful tool for analyzing the EM algorithm. For positive numbers u and v, the Kullback-Leibler distance from u to v is KL(u, v) = u log u v + v u. (5.1) We also define KL(0, 0) = 0, KL(0, v) = v and KL(u, 0) = +. The KL distance is extended to nonnegative vectors component-wise, so that for nonnegative vectors a and b we have KL(a, b) = J KL(a j, b j ). (5.2) One of the most useful facts about the KL distance is contained in the following lemma. Lemma 5.1 For non-negative vectors a and b, with b + = J b j > 0, we have 5.2 Using Acceptable Data KL(a, b) = KL(a +, b + ) + KL(a, a + b + b). (5.3) The assumption that the data X is acceptable helps simplify the theoretical discussion of the EM algorithm. 11

12 For any preferred X the M-step of the EM algorithm, in the continuous case, is to maximize the function fx Y (x y, θ k ) log fx(x θ)dx, (5.4) over θ Θ; the integral is replaced by a sum in the discrete case. For notational convenience we let and b(θ k ) = f X Y (x y, θ k ), (5.5) f(θ) = f X (x θ); (5.6) both functions are functions of the vector variable x. Then the M-step is equivalent to minimizing the Kullback-Leibler or cross-entropy distance ( KL(b(θ k ), f(θ)) = f X Y (x y, θ k fx Y (x y, θ k )) ) log dx f X (x θ) ( = f X Y (x y, θ k fx Y (x y, θ k )) ) log + f X (x θ) f X Y (x y, θ k )dx. (5.7) f X (x θ) This holds since both f X (x θ) and f X Y (x y, θ k ) are probability density functions or probabilities. For acceptable X we have log f Y,X (y, x θ) = log f X Y (x y, θ) + log f Y (y θ) = log f Y X (y x) + log f X (x θ). (5.8) Therefore, log f Y (y θ k+1 ) log f Y (y θ) = KL(b(θ k ), f(θ)) KL(b(θ k ), f(θ k+1 )) + KL(b(θ k ), b(θ k+1 )) KL(b(θ k ), b(θ)). (5.9) Since θ = θ k+1 minimizes KL(b(θ k ), f(θ)), we have that log f Y (y θ k+1 ) log f Y (y θ k ) = KL(b(θ k ), f(θ k )) KL(b(θ k ), f(θ k+1 )) + KL(b(θ k ), b(θ k+1 )) 0. (5.10) This tells us, once again, that the sequence of likelihood values {log f Y (y θ k )} is increasing, and that the sequence of its negatives, { log f Y (y θ k )}, is decreasing. Since we assume that there is a maximizer θ ML of the likelihood, the sequence { log f Y (y θ k )} is also bounded below and the sequences {KL(b(θ k ), b(θ k+1 ))} and {KL(b(θ k ), f(θ k )) KL(b(θ k ), f(θ k+1 ))} converge to zero. Without some notion of convergence in the parameter space Θ, we cannot conclude that {θ k } converges to a maximum likelihood estimate θ ML. Without some additional assumptions, we cannot even conclude that the functions f(θ k ) converge to f(θ ML ). 12

13 6 The Approach of Csiszár and Tusnády For acceptable X the M-step of the EM algorithm is to minimize the function KL(b(θ k ), f(θ)) over θ Θ to get θ k+1. To put the EM algorithm into the framework of the alternating minimization approach of Csiszár and Tusnády [8], we need to view the M-step in a slightly different way; the problem is that, for the continuous case, having found θ k+1, we do not then minimize KL(b(θ), f(θ k+1 )) at the next step. 6.1 The Framework of Csiszár and Tusnády Following [8], we take Ψ(p, q) to be a real-valued function of the variables p P and q Q, where P and Q are arbitrary sets. Minimizing Ψ(p, q n ) gives p n+1 and minimizing Ψ(p n+1, q) gives q n+1, so that The objective is to find (ˆp, ˆq) such that Ψ(p n, q n ) Ψ(p n, q n+1 ) Ψ(p n+1, q n+1 ). (6.1) Ψ(p, q) Ψ(ˆp, ˆq), for all p and q. In order to show that {Ψ(p n, q n )} converges to d = inf Ψ(p, q) p P,q Q the authors of [8] assume the three- and four-point properties. If there is a non-negative function : P P R such that then the three-point property holds. If Ψ(p, q n+1 ) Ψ(p n+1, q n+1 ) (p, p n+1 ), (6.2) (p, p n ) + Ψ(p, q) Ψ(p, q n+1 ), (6.3) for all p and q, then the four-point property holds. Combining these two inequalities, we have (p, p n ) (p, p n+1 ) Ψ(p n+1, q n+1 ) Ψ(p, q). (6.4) From the inequality in (6.4) it follows easily that the sequence {Ψ(p n, q n )} converges to d. Suppose this is not the case. Then there are p, q, and D > d with Ψ(p n, q n ) D > Ψ(p, q ) d. 13

14 From Equation (6.4) we have (p, p n ) (p, p n+1 ) Ψ(p n+1, q n+1 ) Ψ(p, q ) D Ψ(p, q ) > 0. But since { (p, p n )} is a decreasing sequence of positive quantities, successive differences must converge to zero; that is, {Ψ(p n+1, q n+1 )} must converge to Ψ(p, q ), which is a contradiction. The five-point property of [8] is obtained by combining (6.2) and (6.3): Ψ(p, q) + Ψ(p, q n 1 ) Ψ(p, q n ) + Ψ(p n, q n 1 ). (6.5) Note that the five-point property does not involve the second function (p, p). However, assuming that the five-point property holds, it is possible to define (p, p) so that both the three- and four-point properties hold. Assuming the five-point property, we have Ψ(p, q n 1 ) Ψ(p, q n ) Ψ(p n, q n ) Ψ(p, q), (6.6) from which we can show easily that {Ψ(p n, q n )} converges to d. 6.2 Alternating Minimization for the EM Algorithm Assume that X is acceptable. We define the function G(θ) to be G(θ) = f X Y (x y, θ) log f Y X (y x)dx, (6.7) for the continuous case, with a sum replacing the integral for the discrete case. Using the identities f Y,X (y, x θ) = f X Y (x y, θ)f Y (y θ) = f Y X (y x, θ)f X (x θ) = f Y X (y x)f X (x θ), we then have log f Y (y θ) = G(θ ) + KL(b(θ ), b(θ)) KL(b(θ ), f(θ)), (6.8) for any parameter values θ and θ. With the choice of θ = θ we have log f Y (y θ) = G(θ) KL(b(θ), f(θ)). (6.9) Therefore, subtracting Equation 6.9 from Equation 6.8, we get ( ) ( ) KL(b(θ ), f(θ)) G(θ ) KL(b(θ), f(θ)) G(θ) = KL(b(θ ), b(θ)). (6.10) 14

15 Now we can put the EM algorithm into the alternating-minimization framework. Define Ψ(b(θ ), f(θ)) = KL(b(θ ), f(θ)) G(θ ). (6.11) We know from Equation (6.10) that Ψ(b(θ ), f(θ)) Ψ(b(θ), f(θ)) = KL(b(θ ), b(θ)). (6.12) Therefore, we can say that the M-step of the EM algorithm is to minimize Ψ(b(θ k ), f(θ)) over θ Θ to get θ k+1 and that minimizing Ψ(b(θ), f(θ k+1 )) gives us θ = θ k+1 again. With the choice of (b(θ ), b(θ)) = KL(b(θ ), b(θ)), Equation (6.12) becomes Ψ(b(θ ), f(θ)) Ψ(b(θ), f(θ)) = (b(θ ), b(θ)), (6.13) which is the three-point property. With P = B(Θ) and Q = F(Θ) the collections of all functions b(θ) and f(θ), respectively, we can view the EM algorithm as alternating minimization of the function Ψ(p, q), over p P and q Q. As we have seen, the three-point property holds. What about the four-point property? The Kullback-Leibler distance is an example of a jointly convex Bregman distance. According to a lemma of Eggermont and LaRiccia [10, 11], the four-point property holds for alternating minimization of such distances, using (p, p) = KL(p, p), provided that the objects that can occur in the second-variable position form a convex subset of R N. In the continuous case of the EM algorithm, we are not performing alternating minimization on the function KL(b(θ), f(θ )), but on KL(b(θ), f(θ ))+G(θ). In the discrete case, whenever Y = h(x), the function G(θ) is always zero, so we are performing alternating minimization on the KL distance KL(b(θ), f(θ )). In [2] the authors consider the problem of minimizing a function of the form Λ(p, q) = φ(p) + ψ(q) + D g (p, q), (6.14) where φ and ψ are convex and differentiable on R J, D g is a Bregman distance, and P = Q is the interior of the domain of g. In [7] it was shown that, when D g is jointly convex, the function Λ(p, q) has the five-point property of [8], which is equivalent to the three- and four-point properties taken together. In some particular instances, the collection of the functions f(θ) is a convex subset of R J, as well, so the three- and four-point properties hold. 15

16 As we saw previously, to have Ψ(p n, q n ) converging to d, it is sufficient that the five-point property hold. It is conceivable, then, that the five-point property may hold for Bregman distances under somewhat more general conditions than those employed in the Eggermont-LaRiccia Lemma. The five-point property for the EM case is the following: KL(b(θ), f(θ k )) KL(b(θ), f(θ k+1 )) ( ) ( ) KL(b(θ k ), f(θ k )) G(θ k ) KL(b(θ), f(θ)) G(θ). (6.15) 7 Sums of Independent Poisson Random Variables The EM is often used with aggregated data. The case of sums of independent Poisson random variables is particularly important. 7.1 Poisson Sums Let X 1,..., X N be independent Poisson random variables with expected value E(X n ) = λ n. Let X be the random vector with X n as its entries, λ the vector whose entries are the λ n, and λ + = N n=1 λ n. Then the probability function for X is f X (x λ) = N n=1 λ xn n exp( λ n )/x n! = exp( λ + ) Now let Y = N n=1 X n. Then, the probability function for Y is As we shall see shortly, we have N n=1 λ xn n /x n!. (7.1) Prob(Y = y) = Prob(X X N = y) = N exp( λ + ) λ xn n /x n!. (7.2) x x N =y x x N =y exp( λ + ) N n=1 n=1 Therefore, Y is a Poisson random variable with E(Y ) = λ +. λ xn n /x n! = exp( λ + )λ y +/y!. (7.3) When we observe an instance of Y, we can consider the conditional distribution f X Y (x y, λ) of {X 1,..., X N }, subject to y = X X N. We have f X Y (x y, λ) = y! ( λ1 ) x1 ( λn ) xn.... (7.4) x 1!...x N! λ + λ + 16

17 This is a multinomial distribution. Given y and λ, the conditional expected value of X n is then E(X n y, λ) = yλ n /λ +. To see why this is true, consider the marginal conditional distribution f X1 Y (x 1 y, λ) of X 1, conditioned on y and λ, which we obtain by holding x 1 fixed and summing over the remaining variables. We have f X1 Y (x 1 y, λ) = y! ( λ1 ) ( x1 λ ) y x1 + x 1!(y x 1 )! λ + λ + (y x 1 )! x x x N =y x 2!...x N! 1 N n=2 ( λn λ + ) xn, where λ + = λ + λ 1. As we shall show shortly, so that (y x 1 )! N x x x N =y x 2!...x N! 1 f X1 Y (x 1 y, λ) = n=2 ( λn λ + ) xn = 1, y! ( λ1 ) ( x1 λ ) y x1 +. x 1!(y x 1 )! λ + λ + The random variable X 1 is equivalent to the random number of heads showing in y flips of a coin, with the probability of heads given by λ 1 /λ +. Consequently, the conditional expected value of X 1 is yλ 1 /λ +, as claimed. In the next subsection we look more closely at the multinomial distribution. 7.2 The Multinomial Distribution When we expand the quantity (a a N ) y, we obtain a sum of terms, each having the form a x a x N N, with x x N = y. How many terms of the same form are there? There are N variables a n. We are to use x n of the a n, for each n = 1,..., N, to get y = x x N factors. Imagine y blank spaces, each to be filled in by a variable as we do the selection. We select x 1 of these blanks and mark them a 1. We can do that in ( ) y x 1 ways. We then select x2 of the remaining blank spaces and enter a 2 in them; we can do this in ( y x 1 ) x 2 ways. Continuing in this way, we find that we can select the N factor types in ( )( ) ( ) y y x1 y (x x N 2 )... (7.5) x 1 x 2 x N 1 17

18 ways, or in y! x 1!(y x 1 )!... (y (x x N 2 ))! x N 1!(y (x x N 1 ))! = y! x 1!...x N!. (7.6) This tells us in how many different sequences the factor variables can be selected. Applying this, we get the multinomial theorem: (a a N ) y y! = x 1!...x N! ax a x N N. (7.7) Select a n = λ n /λ +. Then, From this we get = x x N =y x x N =y exp( λ + ) x x N =y ( λ1 1 = 1 y = λ ) y N λ + λ + y! ( λ1 ) x1 ( λn ) xn.... (7.8) x 1!...x N! λ + λ + N n=1 λ xn n /x n! = exp( λ + )λ y +/y!. (7.9) 8 Poisson Sums in Emission Tomography Sums of Poisson random variables and the problem of complete versus incomplete data arise in single-photon computed emission tomography (SPECT) (Wernick and Aarsvold (2004) [28]). 8.1 The SPECT Reconstruction Problem In their 1976 paper Rockmore and Makovski [25] suggested that the problem of reconstructing a tomographic image be viewed as statistical parameter estimation. Shepp and Vardi (1982) [26] expanded on this idea and suggested that the EM algorithm discussed by Dempster, Laird and Rubin (1977) [9] be used for the reconstruction. The region of interest within the body of the patient is discretized into J pixels (or voxels), with λ j 0 the unknown amount of radionuclide within the jth pixel; we assume that λ j is also the expected number of photons emitted from the jth pixel during the scanning time. Emitted photons are detected at any one of I detectors outside the body, with y i > 0 the photon count at the ith detector. The probability that a photon emitted at the jth pixel will be detected at the ith detector is P ij, which we assume is known; the overall probability of detecting a photon emitted from the jth pixel is s j = I i=1 P ij > 0. 18

19 8.1.1 The Preferred Data For each i and j the random variable X ij is the number of photons emitted from the jth pixel and detected at the ith detector; the X ij are assumed to be independent and P ij λ j -Poisson. With x ij a realization of X ij, the vector x with components x ij is our preferred data. The pdf for this preferred data is a probability vector, with f X (x λ) = I i=1 J exp P ijλ j (P ij λ j ) x ij /x ij!. (8.1) Given an estimate λ k of the vector λ and the restriction that Y i = J X ij, the random variables X i1,..., X ij have the multinomial distribution Prob(x i1,..., x ij ) = y i! x i1! x ij! J ( Pij λ j (P λ) i ) xij. Therefore, the conditional expected value of X ij, given y and λ k, is ( E(X ij y, λ k ) = λ k yi ) j P ij, (P λ k ) i and the conditional expected value of the random variable becomes log f X (X λ) = I i=1 E(log f X (X λ) y, λ k ) = J ( P ij λ j ) + X ij log(p ij λ j ) + constants I i=1 J ( ( P ij λ j ) + λ k j P ij ( yi ) (P λ k ) i ) log(p ij λ j ), omitting terms that do not involve the parameter vector λ. In the EM algorithm, we obtain the next estimate λ k+1 by maximizing E(log f X (X λ) y, λ k ). The log likelihood function for the preferred data X (omitting constants) is LL x (λ) = I i=1 J ( Of course, we do not have the complete data The Incomplete Data What we do have are the y i, values of the random variables ) P ij λ j + X ij log(p ij λ j ). (8.2) Y i = J X ij ; (8.3) 19

20 this is the given data. These random variables are also independent and (P λ) i - Poisson, where J (P λ) i = P ij λ j. The log likelihood function for the given data is LL y (λ) = I ( i=1 ) (P λ) i + y i log((p λ) i ). (8.4) Maximizing LL x (λ) in Equation (8.2) is easy, while maximizing LL y (λ) in Equation (8.4) is harder and requires an iterative method. The EM algorithm involves two steps: in the E-step we compute the conditional expected value of LL x (λ), conditioned on the data vector y and the current estimate λ k of λ; in the M-step we maximize this conditional expected value to get the next λ k+1. Putting these two steps together, we have the following EMML iteration: λ k+1 j = λ k j s 1 j I i=1 P ij y i (P λ k ) i. (8.5) For any positive starting vector λ 0, the sequence {λ k } converges to a maximizer of LL y (λ), over all non-negative vectors λ. Note that, because we are dealing with finite probability vectors in this example, it is a simple matter to conclude that f Y (y λ) = f X (x λ). (8.6) 8.2 Using the KL Distance x X (y) In this subsection we assume, for notational convenience, that the system y = P λ has been normalized so that s j = 1 for each j. Maximizing E(log f X (X λ) y, λ k ) is equivalent to minimizing KL(r(λ k ), q(λ)), where r(λ) and q(λ) are I by J arrays with entries ( r(λ) ij = λ j P ij yi ) (P λ) i and q(λ) ij = λ j P ij. In terms of our previous notation we identify r(λ) with b(θ), and q(λ) with f(θ). The set F(Θ) of all f(θ) is now a convex set and the four-point property of [8] holds. The, 20

21 iterative step of the EMML algorithm is then I λ k+1 j = λ k y i j P i,j. (8.7) (P λ k ) i i=1 The sequence {λ k } converges to a maximizer λ ML of the likelihood for any positive starting vector. As we noted previously, before we can discuss the possible convergence of the sequence {λ k } of parameter vectors to a maximizer of the likelihood, it is necessary to have a notion of convergence in the parameter space. For the problem in this section, the parameter vectors λ are non-negative. Proof of convergence of the sequence {λ k } depends heavily on the following [4]: and KL(y, P λ k ) KL(y, P λ k+1 ) = KL(r(λ k ), r(λ k+1 )) + KL(λ k+1, λ k ); (8.8) KL(λ ML, λ k ) KL(λ ML, λ k+1 ) KL(y, P λ k ) KL(y, P λ ML ). (8.9) Any likelihood maximizer λ ML is also a non-negative minimizer of the KL distance KL(y, P λ), so the EMML algorithm can be thought of as a method for finding a nonnegative solution (or approximate solution) for a system y = P λ of linear equations in which y i > 0 and P ij 0 for all indices. This will be helpful when we consider mixture problems. 9 Finite Mixture Problems Estimating the combining proportions in probabilistic mixture problems shows that there are meaningful examples of our acceptable-data model, and provides important applications of likelihood maximization. 9.1 Mixtures We say that a random vector V taking values in R D is a finite mixture (see [12, 24]) if there are probability density functions or probabilities f j and numbers θ j 0, for j = 1,..., J, such that the probability density function or probability function for V has the form f V (v θ) = J θ j f j (v), (9.1) for some choice of the θ j 0 with J θ j = 1. As previously, we shall assume, without loss of generality, that D = 1. 21

22 9.2 The Likelihood Function The data are N realizations of the random variable V, denoted v n, for n = 1,..., N, and the given data is the vector y = (v 1,..., v N ). The column vector θ = (θ 1,..., θ J ) T is the generic parameter vector of mixture combining proportions. The likelihood function is L y (θ) = N n=1 Then the log likelihood function is LL y (θ) = N n=1 ( ) θ 1 f 1 (v n ) θ J f J (v n ). (9.2) ( ) log θ 1 f 1 (v n ) θ J f J (v n ). With u the column vector with entries u n = 1/N, and P the matrix with entries P nj = f j (v n ), we define s j = N P nj = n=1 N f j (v n ). n=1 Maximizing LL y (θ) is equivalent to minimizing F (θ) = KL(u, P θ) A Motivating Illustration J (1 s j )θ j. (9.3) To motivate such mixture problems, we imagine that each data value is generated by first selecting one value of j, with probability θ j, and then selecting a realization of a random variable governed by f j (v). For example, there could be J bowls of colored marbles, and we randomly select a bowl, and then randomly select a marble within the selected bowl. For each n the number v n is the numerical code for the color of the nth marble drawn. In this illustration we are using a mixture of probability functions, but we could have used probability density functions. 9.4 The Acceptable Data We approach the mixture problem by creating acceptable data. We imagine that we could have obtained x n = j n, for n = 1,..., N, where the selection of v n is governed by the function f jn (v). In the bowls example, j n is the number of the bowl from which the nth marble is drawn. The acceptable-data random vector is X = (X 1,..., X N ), where 22

23 the X n are independent random variables taking values in the set {j = 1,..., J}. The value j n is one realization of X n. Since our objective is to estimate the true θ j, the values v n are now irrelevant. Our ML estimate of the true θ j is simply the proportion of times j = j n. Given a realization x of X, the conditional pdf or pf of Y does not involve the mixing proportions, so X is acceptable. Notice also that it is not possible to calculate the entries of y from those of x; the model Y = h(x) does not hold. 9.5 The Mix-EM Algorithm Using this acceptable data, we derive the EM algorithm, which we call the Mix-EM algorithm. With N j denoting the number of times the value j occurs as an entry of x, the likelihood function for X is and the log likelihood is L x (θ) = f X (x θ) = J θ N j j, (9.4) LL x (θ) = log L x (θ) = J N j log θ j. (9.5) Then E(log L x (θ) y, θ k ) = J E(N j y, θ k ) log θ j. (9.6) To simplify the calculations in the E-step we rewrite LL x (θ) as LL x (θ) = N n=1 where X nj = 1 if j = j n and zero otherwise. Then we have J X nj log θ j, (9.7) E(X nj y, θ k ) = prob (X nj = 1 y, θ k ) = θk j f j (v n ) f(v n θ k ). (9.8) The function E(LL x (θ) y, θ k ) becomes E(LL x (θ) y, θ k ) = N J n=1 θ k j f j (v n ) f(v n θ k ) log θ j. (9.9) 23

24 Maximizing with respect to θ, we get the iterative step of the Mix-EM algorithm: θ k+1 j = 1 N θk j N n=1 f j (v n ) f(v n θ k ). (9.10) We know from our previous discussions that, since the preferred data X is acceptable, likelihood is non-decreasing for this algorithm. We shall go further now, and show that the sequence of probability vectors {θ k } converges to a maximizer of the likelihood. 9.6 Convergence of the Mix-EM Algorithm As we noted earlier, maximizing the likelihood in the mixture case is equivalent to minimizing F (θ) = KL(u, P θ) + J (1 s j )θ j, over probability vectors θ. It is easily shown that, if ˆθ minimizes F (θ) over all nonnegative vectors θ, then ˆθ is a probability vector. Therefore, we can obtain the maximum likelihood estimate of θ by minimizing F (θ) over non-negative vectors θ. The following theorem is found in [5]. Theorem 9.1 Let u be any positive vector, P any non-negative matrix with s j > 0 for each j, and F (θ) = KL(u, P θ) + J β j KL(γ j, θ j ). If s j + β j > 0, α j = s j /(s j + β j ), and β j γ j 0, for all j, then the iterative sequence given by θ k+1 j = α j s 1 j θj k ( N u ) n P n,j (P θ k ) n=1 n converges to a non-negative minimizer of F (θ). + (1 α j )γ j (9.11) With the choices u n = 1/N, γ j = 0, and β j = 1 s j, the iteration in Equation (9.11) becomes that of the Mix-EM algorithm. Therefore, the sequence {θ k } converges to the maximum likelihood estimate of the mixing proportions. 24

25 10 More on Convergence There is a mistake in the proof of convergence given in Dempster, Laird, and Rubin (1977) [9]. Wu (1983) [29] and Boyles (1983) [3] attempted to repair the error, but also gave examples in which the EM algorithm failed to converge to a global maximizer of likelihood. In Chapter 3 of McLachlan and Krishnan (1997) [19] we find the basic theory of the EM algorithm, including available results on convergence and the rate of convergence. Because many authors rely on Equation (2.5), it is not clear that these results are valid in the generality in which they are presented. There appears to be no single convergence theorem that is relied on universally; each application seems to require its own proof of convergence. When the use of the EM algorithm was suggested for SPECT and PET, it was necessary to prove convergence of the resulting iterative algorithm in Equation (8.5), as was eventually achieved in a sequence of papers (Shepp and Vardi (1982) [26], Lange and Carson (1984) [17], Vardi, Shepp and Kaufman (1985) [27], Lange, Bahn and Little (1987) [18], and Byrne (1993) [4]). When the EM algorithm was applied to list-mode data in SPECT and PET (Barrett, White, and Parra (1997) [1, 23], and Huesman et al. (2000) [15]), the resulting algorithm differed slightly from that in Equation (8.5) and a proof of convergence was provided in Byrne (2001) [5]. The convergence theorem in Byrne (2001) also establishes the convergence of the iteration in Equation (9.10) to the maximum-likelihood estimate of the mixing proportions. 11 Open Questions As we have seen, the conventional formulation of the EM algorithm presents difficulties when probability density functions are involved. We have shown here that the use of acceptable preferred data can be helpful in resolving this issue, but other ways may also be useful. Proving convergence of the sequence {θ k } appears to involve the selection of an appropriate topology for the parameter space Θ. While it is common to assume that Θ is a subset of Euclidean space and that the usual norm should be used to define distance, it may be helpful to tailor the metric to the nature of the parameters. In the case of Poisson sums, for example, the parameters are non-negative vectors and we found that the cross-entropy distance is more appropriate. Even so, additional assumptions appear necessary before convergence of the {θ k } can be established. To simplify the analysis, it is often assumed that cluster points of the sequence lie in the 25

26 interior of the set Θ, which is not a realistic assumption in some applications. It may be wise to consider, instead, convergence of the functions f X (x θ k ), or maybe even to identify the parameters θ with the functions f X (x θ). Proving convergence to L y (θ ML ) of the likelihood values L y (θ k ) is also an option. 12 Conclusion Difficulties with the conventional formulation of the EM algorithm in the continuous case of probability density functions (pdf) has prompted us to adopt a new definition, that of acceptable data. As we have shown, this model can be helpful in generating EM algorithms in a variety of situations. For the discrete case of probability functions (pf), the conventional approach remains satisfactory. In both cases, the two steps of the EM algorithm can be viewed as alternating minimization of the Kullback-Leibler distance between two sets of parameterized pf or pdf, along the lines investigated by Csiszár and Tusnády [8]. In order to use the full power of their theory, however, we need one of the sets to be convex. This does occur in the important special case of sums of independent Poisson random variables, but is not generally the case. References 1. Barrett, H., White, T., and Parra, L. (1997) List-mode likelihood. J. Opt. Soc. Am. A 14, pp Bauschke, H., Combettes, P., and Noll, D. (2006) Joint minimization with alternating Bregman proximity operators. Pacific Journal of Optimization, 2, pp Boyles, R. (1983) On the convergence of the EM algorithm. Journal of the Royal Statistical Society B, 45, pp Byrne, C. (1993) Iterative image reconstruction algorithms based on crossentropy minimization. IEEE Transactions on Image Processing IP-2, pp Byrne, C. (2001) Likelihood maximization for list-mode emission tomographic image reconstruction. IEEE Transactions on Medical Imaging 20(10), pp Byrne, C., and Eggermont, P. (2011) EM Algorithms. in Handbook of Mathematical Methods in Imaging, Otmar Scherzer, ed., Springer-Science. 26

27 7. Byrne, C. (2012) Alternating and sequential unconstrained minimization algorithms. Journal of Optimization Theory and Applications, electronic 154(3), DOI /s , (2012); hardcopy 156(2), February, Csiszár, I. and Tusnády, G. (1984) Information geometry and alternating minimization procedures. Statistics and Decisions Supp. 1, pp Dempster, A.P., Laird, N.M. and Rubin, D.B. (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B 37, pp Eggermont, P.P.B., LaRiccia, V.N. (1995) Smoothed maximum likelihood density estimation for inverse problems. Annals of Statistics 23, pp Eggermont, P., and LaRiccia, V. (2001) Maximum Penalized Likelihood Estimation, Volume I: Density Estimation. New York: Springer. 12. Everitt, B. and Hand, D. (1981) Finite Mixture Distributions London: Chapman and Hall. 13. Fessler, J., Ficaro, E., Clinthorne, N., and Lange, K. (1997) Grouped-coordinate ascent algorithms for penalized-likelihood transmission image reconstruction. IEEE Transactions on Medical Imaging, 16 (2), pp Hogg, R., McKean, J., and Craig, A. (2004) Introduction to Mathematical Statistics, 6th edition, Prentice Hall. 15. Huesman, R., Klein, G., Moses, W., Qi, J., Ruetter, B., and Virador, P. (2000) List-mode maximum likelihood reconstruction applied to positron emission mammography (PEM) with irregular sampling. IEEE Transactions on Medical Imaging 19 (5), pp Kullback, S. and Leibler, R. (1951) On information and sufficiency. Annals of Mathematical Statistics 22, pp Lange, K. and Carson, R. (1984) EM reconstruction algorithms for emission and transmission tomography. Journal of Computer Assisted Tomography 8, pp Lange, K., Bahn, M. and Little, R. (1987) A theoretical study of some maximum likelihood algorithms for emission and transmission tomography. IEEE Trans. Med. Imag. MI-6(2), pp

28 19. McLachlan, G.J. and Krishnan, T. (1997) The EM Algorithm and Extensions. New York: John Wiley and Sons, Inc. 20. Meng, X., and Pedlow, S. (1992) EM: a bibliographic review with missing articles. Proceedings of the Statistical Computing Section, American Statistical Association, American Statistical Association, Alexandria, VA. 21. Meng, X., and van Dyk, D. (1997) The EM algorithm- An old folk-song sung to a fast new tune. J. R. Statist. Soc. B, 59(3), pp Narayanan, M., Byrne, C. and King, M. (2001) An interior point iterative maximum-likelihood reconstruction algorithm incorporating upper and lower bounds with application to SPECT transmission imaging. IEEE Transactions on Medical Imaging TMI-20 (4), pp Parra, L. and Barrett, H. (1998) List-mode likelihood: EM algorithm and image quality estimation demonstrated on 2-D PET. IEEE Transactions on Medical Imaging 17, pp Redner, R., and Walker, H. (1984) Mixture Densities, Maximum Likelihood and the EM Algorithm. SIAM Review, 26(2), pp Rockmore, A., and Macovski, A. (1976) A maximum likelihood approach to emission image reconstruction from projections. IEEE Transactions on Nuclear Science, NS-23, pp Shepp, L., and Vardi, Y. (1982) Maximum likelihood reconstruction for emission tomography. IEEE Transactions on Medical Imaging, MI-1, pp Vardi, Y., Shepp, L.A. and Kaufman, L. (1985) A statistical model for positron emission tomography. Journal of the American Statistical Association 80, pp Wernick, M. and Aarsvold, J., editors (2004) Emission Tomography: The Fundamentals of PET and SPECT. San Diego: Elsevier Academic Press. 29. Wu, C.F.J. (1983) On the convergence properties of the EM algorithm. Annals of Statistics, 11, pp

Non-Stochastic EM Algorithms in Optimization

Non-Stochastic EM Algorithms in Optimization Non-Stochastic EM Algorithms in Optimization Charles Byrne August 30, 2013 Abstract We consider the problem of maximizing a non-negative function f : Z R, where Z is an arbitrary set. We assume that there

More information

EM Algorithms from a Non-Stochastic Perspective

EM Algorithms from a Non-Stochastic Perspective EM Algorithms from a Non-Stochastic Perspective Charles Byrne Charles Byrne@uml.edu Department of Mathematical Sciences, University of Massachusetts Lowell, Lowell, MA 01854, USA December 15, 2014 Abstract

More information

EM Algorithms. Charles Byrne (Charles Department of Mathematical Sciences University of Massachusetts Lowell Lowell, MA 01854, USA

EM Algorithms. Charles Byrne (Charles Department of Mathematical Sciences University of Massachusetts Lowell Lowell, MA 01854, USA EM Algorithms Charles Byrne (Charles Byrne@uml.edu) Department of Mathematical Sciences University of Massachusetts Lowell Lowell, MA 01854, USA March 6, 2011 Abstract The EM algorithm is not a single

More information

The EM Algorithm: Theory, Applications and Related Methods

The EM Algorithm: Theory, Applications and Related Methods Charles L. Byrne Department of Mathematical Sciences University of Massachusetts Lowell March 1, 2017 The EM Algorithm: Theory, Applications and Related Methods Contents Preface vii 1 Introduction 1 1.1

More information

Alternating Minimization and Alternating Projection Algorithms: A Tutorial

Alternating Minimization and Alternating Projection Algorithms: A Tutorial Alternating Minimization and Alternating Projection Algorithms: A Tutorial Charles L. Byrne Department of Mathematical Sciences University of Massachusetts Lowell Lowell, MA 01854 March 27, 2011 Abstract

More information

Auxiliary-Function Methods in Optimization

Auxiliary-Function Methods in Optimization Auxiliary-Function Methods in Optimization Charles Byrne (Charles Byrne@uml.edu) http://faculty.uml.edu/cbyrne/cbyrne.html Department of Mathematical Sciences University of Massachusetts Lowell Lowell,

More information

Auxiliary-Function Methods in Iterative Optimization

Auxiliary-Function Methods in Iterative Optimization Auxiliary-Function Methods in Iterative Optimization Charles L. Byrne April 6, 2015 Abstract Let C X be a nonempty subset of an arbitrary set X and f : X R. The problem is to minimize f over C. In auxiliary-function

More information

EM for ML Estimation

EM for ML Estimation Overview EM for ML Estimation An algorithm for Maximum Likelihood (ML) Estimation from incomplete data (Dempster, Laird, and Rubin, 1977) 1. Formulate complete data so that complete-data ML estimation

More information

Estimating the parameters of hidden binomial trials by the EM algorithm

Estimating the parameters of hidden binomial trials by the EM algorithm Hacettepe Journal of Mathematics and Statistics Volume 43 (5) (2014), 885 890 Estimating the parameters of hidden binomial trials by the EM algorithm Degang Zhu Received 02 : 09 : 2013 : Accepted 02 :

More information

Sequential Unconstrained Minimization: A Survey

Sequential Unconstrained Minimization: A Survey Sequential Unconstrained Minimization: A Survey Charles L. Byrne February 21, 2013 Abstract The problem is to minimize a function f : X (, ], over a non-empty subset C of X, where X is an arbitrary set.

More information

Iterative Algorithms in Tomography

Iterative Algorithms in Tomography Iterative Algorithms in Tomography Charles Byrne (Charles Byrne@uml.edu), Department of Mathematical Sciences, University of Massachusetts Lowell, Lowell, MA 01854 October 17, 2005 Abstract The fundamental

More information

is a Borel subset of S Θ for each c R (Bertsekas and Shreve, 1978, Proposition 7.36) This always holds in practical applications.

is a Borel subset of S Θ for each c R (Bertsekas and Shreve, 1978, Proposition 7.36) This always holds in practical applications. Stat 811 Lecture Notes The Wald Consistency Theorem Charles J. Geyer April 9, 01 1 Analyticity Assumptions Let { f θ : θ Θ } be a family of subprobability densities 1 with respect to a measure µ on a measurable

More information

Iterative Convex Optimization Algorithms; Part Two: Without the Baillon Haddad Theorem

Iterative Convex Optimization Algorithms; Part Two: Without the Baillon Haddad Theorem Iterative Convex Optimization Algorithms; Part Two: Without the Baillon Haddad Theorem Charles L. Byrne February 24, 2015 Abstract Let C X be a nonempty subset of an arbitrary set X and f : X R. The problem

More information

Verifying Regularity Conditions for Logit-Normal GLMM

Verifying Regularity Conditions for Logit-Normal GLMM Verifying Regularity Conditions for Logit-Normal GLMM Yun Ju Sung Charles J. Geyer January 10, 2006 In this note we verify the conditions of the theorems in Sung and Geyer (submitted) for the Logit-Normal

More information

Lecture Notes on Iterative Optimization Algorithms

Lecture Notes on Iterative Optimization Algorithms Charles L. Byrne Department of Mathematical Sciences University of Massachusetts Lowell December 8, 2014 Lecture Notes on Iterative Optimization Algorithms Contents Preface vii 1 Overview and Examples

More information

The Expectation Maximization Algorithm

The Expectation Maximization Algorithm The Expectation Maximization Algorithm Frank Dellaert College of Computing, Georgia Institute of Technology Technical Report number GIT-GVU-- February Abstract This note represents my attempt at explaining

More information

Lecture Notes on Auxiliary-Function and Projection-Based Optimization Methods in Inverse Problems 1

Lecture Notes on Auxiliary-Function and Projection-Based Optimization Methods in Inverse Problems 1 Lecture Notes on Auxiliary-Function and Projection-Based Optimization Methods in Inverse Problems 1 Charles L. Byrne 2 June 16, 2013 1 The latest version is available at http://faculty.uml.edu/cbyrne/cbyrne.html

More information

Objective Functions for Tomographic Reconstruction from. Randoms-Precorrected PET Scans. gram separately, this process doubles the storage space for

Objective Functions for Tomographic Reconstruction from. Randoms-Precorrected PET Scans. gram separately, this process doubles the storage space for Objective Functions for Tomographic Reconstruction from Randoms-Precorrected PET Scans Mehmet Yavuz and Jerey A. Fessler Dept. of EECS, University of Michigan Abstract In PET, usually the data are precorrected

More information

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation Statistics 62: L p spaces, metrics on spaces of probabilites, and connections to estimation Moulinath Banerjee December 6, 2006 L p spaces and Hilbert spaces We first formally define L p spaces. Consider

More information

Iterative Optimization in Inverse Problems. Charles L. Byrne Department of Mathematical Sciences University of Massachusetts Lowell Lowell, MA 01854

Iterative Optimization in Inverse Problems. Charles L. Byrne Department of Mathematical Sciences University of Massachusetts Lowell Lowell, MA 01854 Iterative Optimization in Inverse Problems Charles L. Byrne Department of Mathematical Sciences University of Massachusetts Lowell Lowell, MA 01854 May 22, 2013 2 Contents 1 Preface 1 2 Introduction 7

More information

Sparsity Regularization for Image Reconstruction with Poisson Data

Sparsity Regularization for Image Reconstruction with Poisson Data Sparsity Regularization for Image Reconstruction with Poisson Data Daniel J. Lingenfelter a, Jeffrey A. Fessler a,andzhonghe b a Electrical Engineering and Computer Science, University of Michigan, Ann

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

Estimating Gaussian Mixture Densities with EM A Tutorial

Estimating Gaussian Mixture Densities with EM A Tutorial Estimating Gaussian Mixture Densities with EM A Tutorial Carlo Tomasi Due University Expectation Maximization (EM) [4, 3, 6] is a numerical algorithm for the maximization of functions of several variables

More information

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes (bilmes@cs.berkeley.edu) International Computer Science Institute

More information

A Time-Averaged Projection Matrix Approach to Motion Compensation

A Time-Averaged Projection Matrix Approach to Motion Compensation A Time-Averaged Proection Matrix Approach to Motion Compensation Charles Byrne (Charles Byrne@uml.edu) Department of Mathematical Sciences University of Massachusetts at Lowell Lowell, MA 01854, USA March

More information

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation Clustering by Mixture Models General bacground on clustering Example method: -means Mixture model based clustering Model estimation 1 Clustering A basic tool in data mining/pattern recognition: Divide

More information

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.

More information

Conditional densities, mass functions, and expectations

Conditional densities, mass functions, and expectations Conditional densities, mass functions, and expectations Jason Swanson April 22, 27 1 Discrete random variables Suppose that X is a discrete random variable with range {x 1, x 2, x 3,...}, and that Y is

More information

Lecture 1: Brief Review on Stochastic Processes

Lecture 1: Brief Review on Stochastic Processes Lecture 1: Brief Review on Stochastic Processes A stochastic process is a collection of random variables {X t (s) : t T, s S}, where T is some index set and S is the common sample space of the random variables.

More information

Superiorized Inversion of the Radon Transform

Superiorized Inversion of the Radon Transform Superiorized Inversion of the Radon Transform Gabor T. Herman Graduate Center, City University of New York March 28, 2017 The Radon Transform in 2D For a function f of two real variables, a real number

More information

Sequential Procedure for Testing Hypothesis about Mean of Latent Gaussian Process

Sequential Procedure for Testing Hypothesis about Mean of Latent Gaussian Process Applied Mathematical Sciences, Vol. 4, 2010, no. 62, 3083-3093 Sequential Procedure for Testing Hypothesis about Mean of Latent Gaussian Process Julia Bondarenko Helmut-Schmidt University Hamburg University

More information

September Math Course: First Order Derivative

September Math Course: First Order Derivative September Math Course: First Order Derivative Arina Nikandrova Functions Function y = f (x), where x is either be a scalar or a vector of several variables (x,..., x n ), can be thought of as a rule which

More information

Expectation Maximization (EM) Algorithm. Each has it s own probability of seeing H on any one flip. Let. p 1 = P ( H on Coin 1 )

Expectation Maximization (EM) Algorithm. Each has it s own probability of seeing H on any one flip. Let. p 1 = P ( H on Coin 1 ) Expectation Maximization (EM Algorithm Motivating Example: Have two coins: Coin 1 and Coin 2 Each has it s own probability of seeing H on any one flip. Let p 1 = P ( H on Coin 1 p 2 = P ( H on Coin 2 Select

More information

Convergence Rate of Expectation-Maximization

Convergence Rate of Expectation-Maximization Convergence Rate of Expectation-Maximiation Raunak Kumar University of British Columbia Mark Schmidt University of British Columbia Abstract raunakkumar17@outlookcom schmidtm@csubcca Expectation-maximiation

More information

Review of Maximum Likelihood Estimators

Review of Maximum Likelihood Estimators Libby MacKinnon CSE 527 notes Lecture 7, October 7, 2007 MLE and EM Review of Maximum Likelihood Estimators MLE is one of many approaches to parameter estimation. The likelihood of independent observations

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 /

More information

An Alternative Proof of Primitivity of Indecomposable Nonnegative Matrices with a Positive Trace

An Alternative Proof of Primitivity of Indecomposable Nonnegative Matrices with a Positive Trace An Alternative Proof of Primitivity of Indecomposable Nonnegative Matrices with a Positive Trace Takao Fujimoto Abstract. This research memorandum is aimed at presenting an alternative proof to a well

More information

Statistics 3858 : Maximum Likelihood Estimators

Statistics 3858 : Maximum Likelihood Estimators Statistics 3858 : Maximum Likelihood Estimators 1 Method of Maximum Likelihood In this method we construct the so called likelihood function, that is L(θ) = L(θ; X 1, X 2,..., X n ) = f n (X 1, X 2,...,

More information

COMPETING RISKS WEIBULL MODEL: PARAMETER ESTIMATES AND THEIR ACCURACY

COMPETING RISKS WEIBULL MODEL: PARAMETER ESTIMATES AND THEIR ACCURACY Annales Univ Sci Budapest, Sect Comp 45 2016) 45 55 COMPETING RISKS WEIBULL MODEL: PARAMETER ESTIMATES AND THEIR ACCURACY Ágnes M Kovács Budapest, Hungary) Howard M Taylor Newark, DE, USA) Communicated

More information

Introduction to Real Analysis Alternative Chapter 1

Introduction to Real Analysis Alternative Chapter 1 Christopher Heil Introduction to Real Analysis Alternative Chapter 1 A Primer on Norms and Banach Spaces Last Updated: March 10, 2018 c 2018 by Christopher Heil Chapter 1 A Primer on Norms and Banach Spaces

More information

EM Algorithm II. September 11, 2018

EM Algorithm II. September 11, 2018 EM Algorithm II September 11, 2018 Review EM 1/27 (Y obs, Y mis ) f (y obs, y mis θ), we observe Y obs but not Y mis Complete-data log likelihood: l C (θ Y obs, Y mis ) = log { f (Y obs, Y mis θ) Observed-data

More information

Part V. 17 Introduction: What are measures and why measurable sets. Lebesgue Integration Theory

Part V. 17 Introduction: What are measures and why measurable sets. Lebesgue Integration Theory Part V 7 Introduction: What are measures and why measurable sets Lebesgue Integration Theory Definition 7. (Preliminary). A measure on a set is a function :2 [ ] such that. () = 2. If { } = is a finite

More information

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information

PARAMETER CONVERGENCE FOR EM AND MM ALGORITHMS

PARAMETER CONVERGENCE FOR EM AND MM ALGORITHMS Statistica Sinica 15(2005), 831-840 PARAMETER CONVERGENCE FOR EM AND MM ALGORITHMS Florin Vaida University of California at San Diego Abstract: It is well known that the likelihood sequence of the EM algorithm

More information

Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

More information

An introduction to Variational calculus in Machine Learning

An introduction to Variational calculus in Machine Learning n introduction to Variational calculus in Machine Learning nders Meng February 2004 1 Introduction The intention of this note is not to give a full understanding of calculus of variations since this area

More information

Lecture 4: Probability and Discrete Random Variables

Lecture 4: Probability and Discrete Random Variables Error Correcting Codes: Combinatorics, Algorithms and Applications (Fall 2007) Lecture 4: Probability and Discrete Random Variables Wednesday, January 21, 2009 Lecturer: Atri Rudra Scribe: Anonymous 1

More information

Review of Probability Theory

Review of Probability Theory Review of Probability Theory Arian Maleki and Tom Do Stanford University Probability theory is the study of uncertainty Through this class, we will be relying on concepts from probability theory for deriving

More information

Sequential unconstrained minimization algorithms for constrained optimization

Sequential unconstrained minimization algorithms for constrained optimization IOP PUBLISHING Inverse Problems 24 (2008) 015013 (27pp) INVERSE PROBLEMS doi:10.1088/0266-5611/24/1/015013 Sequential unconstrained minimization algorithms for constrained optimization Charles Byrne Department

More information

Learning Binary Classifiers for Multi-Class Problem

Learning Binary Classifiers for Multi-Class Problem Research Memorandum No. 1010 September 28, 2006 Learning Binary Classifiers for Multi-Class Problem Shiro Ikeda The Institute of Statistical Mathematics 4-6-7 Minami-Azabu, Minato-ku, Tokyo, 106-8569,

More information

Iterative Convex Optimization Algorithms; Part One: Using the Baillon Haddad Theorem

Iterative Convex Optimization Algorithms; Part One: Using the Baillon Haddad Theorem Iterative Convex Optimization Algorithms; Part One: Using the Baillon Haddad Theorem Charles Byrne (Charles Byrne@uml.edu) http://faculty.uml.edu/cbyrne/cbyrne.html Department of Mathematical Sciences

More information

Machine Learning for Signal Processing Expectation Maximization Mixture Models. Bhiksha Raj 27 Oct /

Machine Learning for Signal Processing Expectation Maximization Mixture Models. Bhiksha Raj 27 Oct / Machine Learning for Signal rocessing Expectation Maximization Mixture Models Bhiksha Raj 27 Oct 2016 11755/18797 1 Learning Distributions for Data roblem: Given a collection of examples from some data,

More information

The memory centre IMUJ PREPRINT 2012/03. P. Spurek

The memory centre IMUJ PREPRINT 2012/03. P. Spurek The memory centre IMUJ PREPRINT 202/03 P. Spurek Faculty of Mathematics and Computer Science, Jagiellonian University, Łojasiewicza 6, 30-348 Kraków, Poland J. Tabor Faculty of Mathematics and Computer

More information

Introduction to Machine Learning

Introduction to Machine Learning What does this mean? Outline Contents Introduction to Machine Learning Introduction to Probabilistic Methods Varun Chandola December 26, 2017 1 Introduction to Probability 1 2 Random Variables 3 3 Bayes

More information

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach Jae-Kwang Kim Department of Statistics, Iowa State University Outline 1 Introduction 2 Observed likelihood 3 Mean Score

More information

Set, functions and Euclidean space. Seungjin Han

Set, functions and Euclidean space. Seungjin Han Set, functions and Euclidean space Seungjin Han September, 2018 1 Some Basics LOGIC A is necessary for B : If B holds, then A holds. B A A B is the contraposition of B A. A is sufficient for B: If A holds,

More information

Expectation Maximization Mixture Models HMMs

Expectation Maximization Mixture Models HMMs 11-755 Machine Learning for Signal rocessing Expectation Maximization Mixture Models HMMs Class 9. 21 Sep 2010 1 Learning Distributions for Data roblem: Given a collection of examples from some data, estimate

More information

[POLS 8500] Review of Linear Algebra, Probability and Information Theory

[POLS 8500] Review of Linear Algebra, Probability and Information Theory [POLS 8500] Review of Linear Algebra, Probability and Information Theory Professor Jason Anastasopoulos ljanastas@uga.edu January 12, 2017 For today... Basic linear algebra. Basic probability. Programming

More information

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others. Unbiased Estimation Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others. To compare ˆθ and θ, two estimators of θ: Say ˆθ is better than θ if it

More information

Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models.

Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models. Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.com This Lecture Expectation-Maximization (EM)

More information

Maximum-Likelihood Deconvolution in the Spatial and Spatial-Energy Domain for Events With Any Number of Interactions

Maximum-Likelihood Deconvolution in the Spatial and Spatial-Energy Domain for Events With Any Number of Interactions IEEE TRANSACTIONS ON NUCLEAR SCIENCE, VOL. 59, NO. 2, APRIL 2012 469 Maximum-Likelihood Deconvolution in the Spatial and Spatial-Energy Domain for Events With Any Number of Interactions Weiyi Wang, Member,

More information

A New Look at First Order Methods Lifting the Lipschitz Gradient Continuity Restriction

A New Look at First Order Methods Lifting the Lipschitz Gradient Continuity Restriction A New Look at First Order Methods Lifting the Lipschitz Gradient Continuity Restriction Marc Teboulle School of Mathematical Sciences Tel Aviv University Joint work with H. Bauschke and J. Bolte Optimization

More information

Mixture Models and EM

Mixture Models and EM Mixture Models and EM Goal: Introduction to probabilistic mixture models and the expectationmaximization (EM) algorithm. Motivation: simultaneous fitting of multiple model instances unsupervised clustering

More information

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 1 Announcements: Project Milestones due date passed. HW3 due on Monday It ll be collaborative HW2 grades

More information

3 Integration and Expectation

3 Integration and Expectation 3 Integration and Expectation 3.1 Construction of the Lebesgue Integral Let (, F, µ) be a measure space (not necessarily a probability space). Our objective will be to define the Lebesgue integral R fdµ

More information

SUMMARY OF PROBABILITY CONCEPTS SO FAR (SUPPLEMENT FOR MA416)

SUMMARY OF PROBABILITY CONCEPTS SO FAR (SUPPLEMENT FOR MA416) SUMMARY OF PROBABILITY CONCEPTS SO FAR (SUPPLEMENT FOR MA416) D. ARAPURA This is a summary of the essential material covered so far. The final will be cumulative. I ve also included some review problems

More information

MIXTURE OF EXPERTS ARCHITECTURES FOR NEURAL NETWORKS AS A SPECIAL CASE OF CONDITIONAL EXPECTATION FORMULA

MIXTURE OF EXPERTS ARCHITECTURES FOR NEURAL NETWORKS AS A SPECIAL CASE OF CONDITIONAL EXPECTATION FORMULA MIXTURE OF EXPERTS ARCHITECTURES FOR NEURAL NETWORKS AS A SPECIAL CASE OF CONDITIONAL EXPECTATION FORMULA Jiří Grim Department of Pattern Recognition Institute of Information Theory and Automation Academy

More information

Chapter 3 : Likelihood function and inference

Chapter 3 : Likelihood function and inference Chapter 3 : Likelihood function and inference 4 Likelihood function and inference The likelihood Information and curvature Sufficiency and ancilarity Maximum likelihood estimation Non-regular models EM

More information

Introduction and Preliminaries

Introduction and Preliminaries Chapter 1 Introduction and Preliminaries This chapter serves two purposes. The first purpose is to prepare the readers for the more systematic development in later chapters of methods of real analysis

More information

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014. Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)

More information

Lecture Notes on Metric Spaces

Lecture Notes on Metric Spaces Lecture Notes on Metric Spaces Math 117: Summer 2007 John Douglas Moore Our goal of these notes is to explain a few facts regarding metric spaces not included in the first few chapters of the text [1],

More information

Last lecture 1/35. General optimization problems Newton Raphson Fisher scoring Quasi Newton

Last lecture 1/35. General optimization problems Newton Raphson Fisher scoring Quasi Newton EM Algorithm Last lecture 1/35 General optimization problems Newton Raphson Fisher scoring Quasi Newton Nonlinear regression models Gauss-Newton Generalized linear models Iteratively reweighted least squares

More information

Lecture 3: Lower Bounds for Bandit Algorithms

Lecture 3: Lower Bounds for Bandit Algorithms CMSC 858G: Bandits, Experts and Games 09/19/16 Lecture 3: Lower Bounds for Bandit Algorithms Instructor: Alex Slivkins Scribed by: Soham De & Karthik A Sankararaman 1 Lower Bounds In this lecture (and

More information

STA 711: Probability & Measure Theory Robert L. Wolpert

STA 711: Probability & Measure Theory Robert L. Wolpert STA 711: Probability & Measure Theory Robert L. Wolpert 6 Independence 6.1 Independent Events A collection of events {A i } F in a probability space (Ω,F,P) is called independent if P[ i I A i ] = P[A

More information

Quantitative Biology II Lecture 4: Variational Methods

Quantitative Biology II Lecture 4: Variational Methods 10 th March 2015 Quantitative Biology II Lecture 4: Variational Methods Gurinder Singh Mickey Atwal Center for Quantitative Biology Cold Spring Harbor Laboratory Image credit: Mike West Summary Approximate

More information

STATISTICAL INFERENCE WITH DATA AUGMENTATION AND PARAMETER EXPANSION

STATISTICAL INFERENCE WITH DATA AUGMENTATION AND PARAMETER EXPANSION STATISTICAL INFERENCE WITH arxiv:1512.00847v1 [math.st] 2 Dec 2015 DATA AUGMENTATION AND PARAMETER EXPANSION Yannis G. Yatracos Faculty of Communication and Media Studies Cyprus University of Technology

More information

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS Maya Gupta, Luca Cazzanti, and Santosh Srivastava University of Washington Dept. of Electrical Engineering Seattle,

More information

Invariant HPD credible sets and MAP estimators

Invariant HPD credible sets and MAP estimators Bayesian Analysis (007), Number 4, pp. 681 69 Invariant HPD credible sets and MAP estimators Pierre Druilhet and Jean-Michel Marin Abstract. MAP estimators and HPD credible sets are often criticized in

More information

Spring 2012 Math 541B Exam 1

Spring 2012 Math 541B Exam 1 Spring 2012 Math 541B Exam 1 1. A sample of size n is drawn without replacement from an urn containing N balls, m of which are red and N m are black; the balls are otherwise indistinguishable. Let X denote

More information

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions International Journal of Control Vol. 00, No. 00, January 2007, 1 10 Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions I-JENG WANG and JAMES C.

More information

Perhaps the simplest way of modeling two (discrete) random variables is by means of a joint PMF, defined as follows.

Perhaps the simplest way of modeling two (discrete) random variables is by means of a joint PMF, defined as follows. Chapter 5 Two Random Variables In a practical engineering problem, there is almost always causal relationship between different events. Some relationships are determined by physical laws, e.g., voltage

More information

Integrating Correlated Bayesian Networks Using Maximum Entropy

Integrating Correlated Bayesian Networks Using Maximum Entropy Applied Mathematical Sciences, Vol. 5, 2011, no. 48, 2361-2371 Integrating Correlated Bayesian Networks Using Maximum Entropy Kenneth D. Jarman Pacific Northwest National Laboratory PO Box 999, MSIN K7-90

More information

Convex Feasibility Problems

Convex Feasibility Problems Laureate Prof. Jonathan Borwein with Matthew Tam http://carma.newcastle.edu.au/drmethods/paseky.html Spring School on Variational Analysis VI Paseky nad Jizerou, April 19 25, 2015 Last Revised: May 6,

More information

Discriminative Direction for Kernel Classifiers

Discriminative Direction for Kernel Classifiers Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering

More information

Bounds on the Largest Singular Value of a Matrix and the Convergence of Simultaneous and Block-Iterative Algorithms for Sparse Linear Systems

Bounds on the Largest Singular Value of a Matrix and the Convergence of Simultaneous and Block-Iterative Algorithms for Sparse Linear Systems Bounds on the Largest Singular Value of a Matrix and the Convergence of Simultaneous and Block-Iterative Algorithms for Sparse Linear Systems Charles Byrne (Charles Byrne@uml.edu) Department of Mathematical

More information

The Information Bottleneck Revisited or How to Choose a Good Distortion Measure

The Information Bottleneck Revisited or How to Choose a Good Distortion Measure The Information Bottleneck Revisited or How to Choose a Good Distortion Measure Peter Harremoës Centrum voor Wiskunde en Informatica PO 94079, 1090 GB Amsterdam The Nederlands PHarremoes@cwinl Naftali

More information

A Note on the Expectation-Maximization (EM) Algorithm

A Note on the Expectation-Maximization (EM) Algorithm A Note on the Expectation-Maximization (EM) Algorithm ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign March 11, 2007 1 Introduction The Expectation-Maximization

More information

Weighted Finite-State Transducers in Computational Biology

Weighted Finite-State Transducers in Computational Biology Weighted Finite-State Transducers in Computational Biology Mehryar Mohri Courant Institute of Mathematical Sciences mohri@cims.nyu.edu Joint work with Corinna Cortes (Google Research). 1 This Tutorial

More information

Extensions of the CQ Algorithm for the Split Feasibility and Split Equality Problems

Extensions of the CQ Algorithm for the Split Feasibility and Split Equality Problems Extensions of the CQ Algorithm for the Split Feasibility Split Equality Problems Charles L. Byrne Abdellatif Moudafi September 2, 2013 Abstract The convex feasibility problem (CFP) is to find a member

More information

Notes on Mathematics Groups

Notes on Mathematics Groups EPGY Singapore Quantum Mechanics: 2007 Notes on Mathematics Groups A group, G, is defined is a set of elements G and a binary operation on G; one of the elements of G has particularly special properties

More information

i=1 h n (ˆθ n ) = 0. (2)

i=1 h n (ˆθ n ) = 0. (2) Stat 8112 Lecture Notes Unbiased Estimating Equations Charles J. Geyer April 29, 2012 1 Introduction In this handout we generalize the notion of maximum likelihood estimation to solution of unbiased estimating

More information

Variational Inference (11/04/13)

Variational Inference (11/04/13) STA561: Probabilistic machine learning Variational Inference (11/04/13) Lecturer: Barbara Engelhardt Scribes: Matt Dickenson, Alireza Samany, Tracy Schifeling 1 Introduction In this lecture we will further

More information

arxiv: v1 [math.oc] 4 Oct 2018

arxiv: v1 [math.oc] 4 Oct 2018 Convergence of the Expectation-Maximization Algorithm Through Discrete-Time Lyapunov Stability Theory Orlando Romero Sarthak Chatterjee Sérgio Pequito arxiv:1810.02022v1 [math.oc] 4 Oct 2018 Abstract In

More information

X X (2) X Pr(X = x θ) (3)

X X (2) X Pr(X = x θ) (3) Notes for 848 lecture 6: A ML basis for compatibility and parsimony Notation θ Θ (1) Θ is the space of all possible trees (and model parameters) θ is a point in the parameter space = a particular tree

More information

STATISTICAL METHODS FOR SIGNAL PROCESSING c Alfred Hero

STATISTICAL METHODS FOR SIGNAL PROCESSING c Alfred Hero STATISTICAL METHODS FOR SIGNAL PROCESSING c Alfred Hero 1999 32 Statistic used Meaning in plain english Reduction ratio T (X) [X 1,..., X n ] T, entire data sample RR 1 T (X) [X (1),..., X (n) ] T, rank

More information

Mixture Models and Expectation-Maximization

Mixture Models and Expectation-Maximization Mixture Models and Expectation-Maximiation David M. Blei March 9, 2012 EM for mixtures of multinomials The graphical model for a mixture of multinomials π d x dn N D θ k K How should we fit the parameters?

More information

Background on Coherent Systems

Background on Coherent Systems 2 Background on Coherent Systems 2.1 Basic Ideas We will use the term system quite freely and regularly, even though it will remain an undefined term throughout this monograph. As we all have some experience

More information

Statistical Estimation

Statistical Estimation Statistical Estimation Use data and a model. The plug-in estimators are based on the simple principle of applying the defining functional to the ECDF. Other methods of estimation: minimize residuals from

More information

an introduction to bayesian inference

an introduction to bayesian inference with an application to network analysis http://jakehofman.com january 13, 2010 motivation would like models that: provide predictive and explanatory power are complex enough to describe observed phenomena

More information

Week 3: The EM algorithm

Week 3: The EM algorithm Week 3: The EM algorithm Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2005 Mixtures of Gaussians Data: Y = {y 1... y N } Latent

More information