The Generalized CEM Algorithm. MIT Media Lab. Abstract. variable probability models to maximize conditional likelihood

Size: px

Start display at page:

Download "The Generalized CEM Algorithm. MIT Media Lab. Abstract. variable probability models to maximize conditional likelihood"

Reynold Townsend
5 years ago
Views:

1 The Generalized CEM Algorithm Tony Jebara MIT Media Lab Ames St. Cambridge, MA 139 Alex Pentland MIT Media Lab Ames St. Cambridge, MA 139 Abstract We propose a general approach for estimating the parameters of latent variable probability models to maximize conditional likelihood and discriminant criteria. Unlike joint likelihood, these objectives are better suited for classication and regression. The approach utilizes and extends the previously introduced CEM framework (Conditional Expectation Maximization), which reformulates EM to handle the conditional likelihood case. We generalize the CEM algorithm to estimate any mixture of exponential family densities. This includes structured graphical models over exponential families, such as HMMs. The algorithm eciently takes advantage of the factorization of the underlying graph. In addition, the new CEM bound is tighter and more rigorous than the original one. The nal result is a CEM algorithm that mirrors the EM algorithm where both estimate a variational lower bound on their respective incomplete objective functions, and both generate the same standard M-steps over complete likelihood for direct maximization. The equivalence of M-steps facilitates migration of current ML approaches to conditional criteria for improved classication and regression results. 1 Introduction Recently, the machine learning community and its application domains have seen the proliferation of conditional or discriminative criteria for classication and regression. For instance, support vector machines [7] have generated competitive classier systems and are being combined with probabilistic models [3]. In the speech community, discriminatively trained HMMs minimize classication error for superior phoneme labeling [6]. Mixtures of experts are used as probabilistic regressors after maximizing their conditional likelihood [5]. Even traditional neural networks employ a least-squares objective function on the output, emphasizing prediction performance [1]. All these criteria allocate modeling resources with the given task in mind, yielding improved performance. In contrast, under ML and MAP (the canonical criteria of probabilistic models), each density is trained separately to describe observations rather than optimally solve for classication or regression. Therefore performance is compromised.

2 The ML/MAP criteria suer most when the model is inaccurate (this modelmismatch occurs often in real-world situations). For visualization, observe Figure 1 where data must be classied using Gaussians ( per class). The optimal ML solution unfortunately does no better than chance on this data. In comparison, the maximum conditional likelihood solution correctly labels most examples (a) Max. Likelihood Classier (b) Max. Conditional Likelihood Classier Figure 1: Likelihood of class and data vs. likelihood of class given data. Nevertheless, ML and MAP remain attractive criteria since they can be easily and convergently optimized for a large class of latent variable probability densities and graphical models. These otherwise intractable models are lower bounded and decoupled via the EM [] algorithm which generates complete data and simple M-steps. Since the EM algorithm is limited to ML and MAP, proponents of other criteria must resort to gradient algorithms [6], or second order methods [5] to estimate latent variable probability models. These generic optimization algorithms need extra bookkeeping, selection of step size, exhibit non-monotonic convergence, and so on. It is thus desirable to nd a lower bound like EM which facilitates optimization of conditional or discriminant criteria and generates simple M-steps. Such a lower bound was proposed in [] as the CEM (Conditional Expectation Maximization) algorithm and used to perform regression 1 with Gaussian mixtures. However CEM's full generality extends beyond that case to a large class of probability models. It can also be further rened to make it tighter. In the following, we describe the peculiar structure of conditional and discriminative problems and show how EM cannot fully lower bound them. We then show how lower bounding arises naturally from the log-concavity of certain probability models. This CEM lower bound in question eectively generates weighted and displaced complete data from incomplete data to produce M-steps that are structurally identical to the EM case. We then present the estimation of the CEM bound's parameters. This is done by approximating the log-likelihood function with a sparse envelope on its epigraph. Lower bounding is depicted for mixtures of Gaussians and multinomials. For latent variable graphical models, we discuss HMMs and explain the ecient computation of their CEM lower bound. This document is a brief theoretical discussion, but practical and implementation details can be veried online. The reader is encouraged to please visit the accompanying web page: for experiments, results and visualization material. 1 Here we emphasisize classication instead.

3 CEM: Discriminative Lower Bounds for Simple M-Steps For latent variable P models, a distribution is described as the sum of simple models: p(c; xj) = m p(m; c; xj). Here m is the missing data, x P is the input and c is the class label. The standard ML objective function is: l = i log p(c i; x i j). EM simplies each i'th component of the log-likelihood with a variational bound using Jensen inequality (with equality achieved at = ~): log X m p(m; xj) X m p(m; xj ~) P n p(n; xj ~) log p(m; xj)! + log X m p(m; xj ~) (1) An alternative criterion is conditional maximum likelihood (CML), P where the loglikelihood of the class given the observations is optimized: l c = i log p(c ijx i ; ). Equation shows a component of the conditional P Plog-likelihood in further detail. Due to the presence of the negative in? log( m c ), the so-called \negative logsum", EM acts as an upper bound on that term. Therefore, EM can only lower bound half the terms in the conditional likelihood, leaving the rest intractable and preventing a direct M-step. log p(cjx; ) = log p(c; xj) p(xj) = log X m p(m; c; xj)? log X m X c p(m; c; xj) () In the case of discriminative learning [3] [7] likelihood ratios between two classes are compared via a discriminant function L(xj) (Equation 3). The sign of the function denotes the class, ^c = sign(l(x)). For each class, one could have a latent variable probability model ( 1 and ). However, if a lower bound is needed on such an expression, EM will succeed on the log() function but applying EM on the remaining? log() will produce an undesirable upper bound. Therefore, to lower bound such negative log-sums we propose the generalized CEM algorithm. L(xj) = log p(xj 1) p(xj ) = logx m p(m; xj 1 )? log X m p(m; xj ) (3) Denition 1 The generalized CEM algorithm is the complementary lower bound on the negative log-sum that EM would otherwise upper bound. Its form is:? log X m p(m; xj) X m w m log p(m; y m j) + k where w m is a positive scalar weight, y m is a displaced data point in the space of x and k is a scalar constant. Equality is achieved at some ~. Note its structural similarity to the EM lower bound, hence their similar simple M-steps. 3 Exploiting Log-Concavity in the Exponential Family It seems unusual that the log-sum can be both upper bounded and lower bound by a positive weighted sum of logs. In fact, the lower bound (EM) always exists while the upper bound (CEM) is only feasible for certain models, basically log-concave ones such as the exponential P family. For the EM case, the probabilities in the log-summation, log( m p(m; )), could be any arbitrary positive function. Jensen's inequality merely exploits the presence of the concave logarithm of an expectation. Furthermore, the lower bound's parameters only compute \responsibilities" or expectations of p(m; ) at. ~ CEM, however, can not use Jensen, and hence requires log-concave probabilities for p(m; ). Furthermore, the choice of its parameters depends on the form of

4 p(m; ), not just responsibility ratios. We thus assume exponential family members for p(m; ) since they are log-concave. The family is dened as p(xj) = a(x) exp( T x?k()), where a(x) is positive and K() is convex. The exponential family subsumes a variety of models including Gaussians, multinomials, Poisson, and their conjugates. Assume that we are lower bounding (with CEM) the negative log of a latent variable model by a sum of log-exponential family models (Equation ). One typical choice for the latent variable model is an exponential family mixture, i.e. p(xj) = P m p(m)a(x) exp(t mx? K( m )).? log p(xj) X m w m T my m? w m K( m ) + k () We use the following intuitive requirements to solve for the bound's parameters (k; y m ; w m ). Contact: a variational bound must equal the negative log-sum at the current operating point ( ~). T angentiality: the gradients of the bound equal those of the negative log-sum at the current point. T ightness: the bound is as tight a bound as possible without stepping over the negative log-sum. The rst two requirements need the value and the gradient of the negative log-sum at the current model settings ( ~ S = ~ m m ). Since EM generates a variational bound, its value and gradient at ~ are identical to those computed from the negative logsum. Usually, using EM is computationally more ecient. Thus we compute the following quantities with EM: the value V = p(xj ), ~ the responsibilities h m = p(m)p(xj ~ m ) (or ^h P m = h m = n h n) and the gradients G m p(xj) at k =? log p(xj ~)? X m y m = G m w m m ~m = 1 w m m m w m ~ T my m? K( ~ m )!? x ~m (5) m ~m (6) The above equations recover the parameters for the CEM bound and show the case of y m for a mixture of the exponential family. Note that if we set w m =?^h m, then y m! x and CEM reduces to the exact same upper bound EM generates. CEM's added exibility arises from allowing the data point to move to another location (y m ) as well undergo weighting. EM generates weighted complete data while CEM generates displaced and weighted complete data. Now, we wish to nd the smallest w m permissible for a true lower bound, i.e. the tightest approximation while staying below the negative log-sum. Of course, one need not pick the smallest possible w m, any w m greater than the minimal one will also generate a true bound. In addition, we post-process w m which must be positive and guarantee reasonable M-step calculations. Substituting k and y m into the bound gives an expression for w m : Pm wm K(m )?K( ~ m)?(m? ~ ~ m log p(xj) p(xj ~) + Pm (m? ~ m) T G m (7) By convexity of K( m ), we know that the terms multiplying each w m remain positive. Therefore, w m are always constrained from below, w m (), regardless of Note the mixing components (class frequencies) are xed while updating the exponential family parameters. The mixing components themselves also form a log-concave structure suitable for CEM if we alternatively x the exponential family models. However [] already implemented a (slightly dierent) bound and update rule for mixing proportions that was valid for general mixtures (i.e. not just Gaussians).

5 Gaussian log Σ m p(m) p(x Θ m ) MIN { log p(m=1)p(x ), log p(m=)p(x ) } Binomial HMM 3 1 log Σ m p(m) p(x Θ m ) MIN { log p(m=1)p(x ), log p(m=)p(x ) } (a) Negative Log-Sum (b) Sparse Envelope (c) CEM Lower Bound Figure : Computing envelopes and CEM bounds. the choice of (i.e. the inequality never ips over). To nd the minimum w m (tightest bound), we could exhaustively vary the value of over the whole space and verify all the conditions on w m. This is intractable so we propose an ecient approximation. The EM Epigraph and Envelope Approximation Recall that EM generated an undesirable upper bound on the negative log-sum (i.e. consider negating Equation 1). Yet these upper bounds play a critical role in computing the CEM lower bound. Recall from concave duality that a function can be dened as the minimum of all its convex upper bounds. EM generates upper bounds that lie in the negative log-sum's epigraph and their minimum forms its envelope. EM achieves equality (Equation 8) if we minimize over all bounds in the continuous space of fh j g variations.? log X j p(j)p(xj j ) = fh MIN j g? P P P h j j log p(j)p(xj j)? log n hn n h n Of course, we need not exhaustively consider every upper bound, rather one may choose a select few sample bounds whose envelope captures most of the interesting behavior of the function. For the log-sum, we get a very nice approximation when we use the winner-takes-all case where one model dominates all others and one responsibility ratio h j is set to 1. while all others are.. Considering j = 1::M such models (envelope components) and minimizing over them approximates the log-mixture. Here, the envelopes components are really M negative log-exponential-family models. Thus we approximate? log P j p(j)p(xj j) MINf? log p(j)p(xj j ) 8 j g. Figure shows very accurate sparse envelope approximation for a -component Gaussian mixture (varying 1D means) and a mixture of binomials. We now have a parsimonious approximation of the negative log-sum which we can replace back into the CEM bound. If CEM's concave bound is less than each (8)

6 envelope, it is underneath the negative log-sum. Furthermore, the expression for w m which seemed intractable simplies since log p(xj) decouples into separate log-exponential-family terms. Each of these envelope components (j = 1::M) varies with a single j at a time. Plugging an envelope member in place of the log-sum greatly simplies Equation 7 for w m : Pm wm K(m)?K( ~ m )?(m? ~ ~ m P log p(j)p(~xj j ) + p(xj ~) m (m? m) ~ T G m (9) Here, the latent variable model is now replaced by p(j)p(~xj j ), a mere exponential family member. Now, due to the decoupling of the m models it is possible to naturally split the above joint constraint over all w m into M stricter individual inequalties as in Equation 1. We also heuristically split the constant value log p(xj ~) into M dierent c m variables such that P m c m = V. The result is a decoupled set of M constraints j = 1::M (from M sparse envelope components) for each of the M CEM parameters, w m. We use (j = m) to indicate that the envelope / epigraph component only varies with one j (unlike the latent variable probability). w m (j = m) log p(j)p(~xj j)? c m + ( m? ~ m ) T G m K( m )? K( ~ m )? ( m? ~ m (1) ) We solve for the smallest w m (call it wm) possible under Equation 1 for each of the j = 1::M envelope components. These are then consolidated by picking the largest wm (call this the nal w m ) that was achieved. Assume that we knew a priori the value of wm. This wm denes a component of the CEM lower bound which supports a component of the envelope. The closest point in m space between the bound and the envelope component can then be computed by taking gradients and we note that the following constraint holds at m ) 1 m wm + (j = m) (j = m)~x + w m m ~m + G m! The above constraint maps the optimization of each w m from a search over the whole m space to a single degree of freedom. Using the convexity of K( m ) we have a 1 to 1 map between the gradient of K( m ) and m. Equation 11 eectively changes the optimization over the space of m to have a single degree of freedom which, when varied, determines w m. For the Gaussian case with identity covariance, computing the maximum w m from Equation 1 can then be done analytically (it is quadratic due to the simplicity of K() = 1 T ). For more dicult models, we perform a simple bisection or secant search in 1 dimension for each w m which typically converges to the true solution with an average of 5 iterations (using some eciency heuristics). Another possibility is storing lookup tables for direct computation of the w m as in []. We solve Equation 11 in this manner a total of M M times to fully describe the CEM bound on the negative log-sum. Once the CEM bound is computed, it is straightforward to combine it with EM bounds of the same form, sum them over multiple components of the log-likelihood and maximize. By iterating bound and maximization steps, monotonic convergence of the mixture models is veried. Deterministic annealing of CEM can also be used to avoid local minima. We rst use annealed EM to compute the value V and gradients G m of the function and then form an annealed version of the EM sparse envelope. The resulting parameters in CEM (k; y m ; w m ) generate a less local bound. (11)

7 5 HMMs - Bounding Structured Models We now consider CEM for HMMs, a structured model with latent variables and a probability factorization given by an underlying markov chain. The logsum P structure in an HMM for an observation sequence X is: log p(xj) = log (s 1;:::;s T ) p(s 1 ; :::; s T ; Xj), for N states and chains of length T. The sum in the log contains N T mixture components. However, the probability density factorizes allowing ecient computation of EM bounds. Thus, after Baum-Welch computations, the HMM is bounded via EM and can be updated with N M-step equations for emission distributions and N M-step equations for multinomials (i.e. the transition P matrix). Therefore, EM generates exponential family lower bounds: log p(xj) n;t w n;t log p( t j n ) + v n;t log p( t j n). CEM also generates this form except it bounds? log p(xj). As usual, to compute k and ym we use the value V of p(xj ) ~ and its gradients G m from the EM bound. Furthermore, we approximate the HMM's envelope using EM to obtain an envelope of upper bounds. CEM's w m parameters are then checked against these sparse components. Of course, we certainly do not want to check all N T terms in the mixture. Only N T terms are needed due to factorization. Each of the N models (multinomials and emission densities) is checked against these N T terms via Equation 1. The terms are again in a winner-takes-all form over the T elements of the chain which are imputed as values of ~x. Figure shows the envelope approximation of the HMM's likelihood with 1D means varying in a -state HMM with xed transition probabilities. Note the faithful representation of the function via the EM envelope. Thus, the HMM can be accurately represented as a minimum of log-exponential family models and CEM lower bounds its negative logarithm. 6 Discussion We have derived the CEM algorithm to lower bound the negative log summation. Combining it with EM produces tight lower bounds on conditional and discriminant criteria with latent variables. By iteratively maximizing the bound, monotonic convergence and deterministic annealing are feasible. This generalized algorithm holds for log-concave mixtures such as the exponential family and certain graphical models. Eectively, the CEM algorithm eciently nds a lower bound on discriminative criteria and maps them to the usual tractable M-step structure found in EM techniques. The re-use of maximization machinery permits easy migration of current probabilistic models from ML to conditional and discriminative criteria. 7 Appendix: Conditional Bayesian Estimation This appendix motivates maximum conditional likelihood as an approximation of conditional Bayesian integration. It also shows that the conditional integral diers from the conditioned joint integral. We rst compute the joint Bayesian integral from (X ; Y) data and then condition it to obtain p(yjx) j : p(yjx) j = R p(x; yj)p(jx ; Y)d R p(xj)p(jx ; Y)d (1) The corresponding dependency graphs (Figure 3(b) and (c)) show how joint estimation contrasts conditional estimation which assumes that x is always given as a parent of y. The conditional Bayesian integral takes advantage of the graph's factorization and estimates a dierent p(yjx) c.

8 X θ θ Y Z X Y p(x,y) Integrate j p(y x) c p(y x) Integrate {X,Y} {Y X} Condition (a) Data (b) Conditioned Joint (c) Direct Conditional (b) Inconsistency Figure 3: Inconsistency of Conditioned Joint and Conditional Bayesian Estimates p(yjx) c = Z p(yjx; c ) [p( c jx ; Y)]d c = Z p(yj p(yjx; c ) c ; X )p( c ) d c (13) p(yjx ) Conditional maximum likelihood approximates this integral by picking the single p(yjx; c ) that maximizes p(yj c ; X ). Fundamentally, though, the two Bayesian integrals p(yjx) j and p(yjx) c are dierent. We exhaustively perform both Bayesian integral for a Gaussian mixture model on data points. Figure 3 shows the data and the resulting conditional densities. There is a clear inconsistency between joint and conditional estimation techniques that is not only a problem with ML but also at the Bayesian integration level (Figure 3(d)). References [1] Bishop, C. (1996). Neural networks for pattern recognition. Oxford Press. [] Dempster, A.P. and Laird, N.M. and Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society,B39. [3] Jaakkola T. and Haussler D. (1998). Exploiting generative models in discriminative classiers. NIPS 11. [] Jebara T. and Pentland A. (1998). Maximum conditional likelihood via bound maximization and the CEM algorithm. NIPS 11. [5] Jordan, M.I. and Jacobs, R.A. (199). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6: [6] Rathinavelu, C. and Deng, L. (1996). The trended HMM with discriminative training for phonetic classication. ICSLP 96. [7] Vapnik V. (1995). The nature of statistical learning theory. Springer-Verlag.

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes (bilmes@cs.berkeley.edu) International Computer Science Institute