MAXIMUM LIKELIHOOD ESTIMATION AND EM FIXED POINT IDEALS FOR BINARY TENSORS. Daniel Lemke

Size: px

Start display at page:

Download "MAXIMUM LIKELIHOOD ESTIMATION AND EM FIXED POINT IDEALS FOR BINARY TENSORS. Daniel Lemke"

Erik Hampton
5 years ago
Views:

1 MAXIMUM LIKELIHOOD ESTIMATION AND EM FIXED POINT IDEALS FOR BINARY TENSORS Daniel Lemke Version of May 27, 2016

2 Contents 1 Introduction Maximum Likelihood Estimation Results Background Math Nonnegative Rank Tensors of Bounded Nonnegative Rank Maximum Likelihood Estimation: A Closer Look EM Algorithm for Matrices Ideals, Varieties, and Algorithms EM Fixed Point Ideal Extension of EM Algorithm to Tensors EM Fixed Point Ideal for Tensors MLE Using Boundary Strata Boundary Stratification of Binary Tensors Experiments Implementation Cellular Decomposition, Primality, and Primary Decomposition EM, MLE, and Boundary Strata Experiments Conclusion 31 Bibliography 32 2

3 I would like to express my deepest gratitude to Serkan Hoşten and Kaie Kubjas for providing the ideas and framework found in this thesis. Thanks go to Nathanael Aff for programming guidance and to Michelle Lemke Riggs and Matthias Beck for their special awareness of English grammar. 3

4 1 Introduction The term Algebraic Statistics first appeared in the literature as the title of a 2001 book by Giovanni Pistone, Eva Riccomagno, and Henry Wynn [PRW]. Beginning with an introduction to Gröbner bases, it presents the application of polynomial algebra to statistics, discrete probability, and experimental design. In 2005 Lior Pachter and Bernd Sturmfels published a single-volume collection of works titled Algebraic Statistics for Computational Biology [PS]. It was written by an array of professionals and graduate students from the fields of algebra and computational biology. The book provides a thorough treatment of the basic principles of algebraic statistics and their relationship to computational biology, and presents an emerging dictionary between algebraic geometry and statistics. Our research is a continuation of the work of Sturmfels et al. The story picks up with a well-known problem in statistics called Maximum Likelihood Estimation. 1.1 Maximum Likelihood Estimation The likelihood of a set of data is the probability of observing that particular set of data, given some statistical model, which is just a family of probability distributions. The values of the parameters that maximize the sample likelihood function are known as the Maximum Likelihood Estimates or MLEs. MLEs have been studied since the dawn of the 20th century and were made popular by the statistician and biologist Sir Ronald Fisher [Wik]. Consider the following example from [PS] 1.1. Suppose we generate a DNA sequence by rolling three tetrahedral dice, each labelled A, C, G, and T, for nucleobases adenine, cytosine, guanine, and thymine. Two of the dice are unfair, one is fair, and suppose they have the associated probabilities of Table 1.1. Table 1.1 A C G T first die second die third die T C C A A G We generate the DNA sequence CTCACGTGATGAGAGCATTCTCAGACCGTGACGCGTGTAGCAGCGGCTC 4

5 by selecting the first die with probability θ 1, the second with probability θ 2, and the third with probability 1 θ 1 θ 2. We would like to determine the parameters θ 1 and θ 2 that were used to select the dice. This amounts to a problem in optimization. Let p A, p C, p G, and p T denote the probabilities that will generate any of the four letters. The statistical model derived from Table 1.1 is written algebraically as follows: p A = 0.10 θ θ , p C = 0.08 θ θ , p G = 0.11 θ θ , p T = 0.09 θ θ We emphasize that these are polynomials in the unknowns θ 1 and θ 2. In statistical terminology these unknowns are called the model parameters. Each of the 49 characters was generated independently, so the likelihood of observing the above DNA sequence is the product of the probabilities of observing the individual letters: L(θ 1, θ 2 ) = p C (θ 1, θ 2 ) p T (θ 1, θ 2 ) p C (θ 1, θ 2 ) p A (θ 1, θ 2 ) p C (θ 1, θ 2 ) = p 10 A (θ 1, θ 2 ) p 15 C (θ 1, θ 2 ) p 15 G (θ 1, θ 2 ) p 10 T (θ 1, θ 2 ). In the maximum likelihood framework we estimate the unknown parameters that were used with those values in the parameter space which make the likelihood of observing the data as large as possible. The parameter space over which we maximize L(θ 1, θ 2 ) is the triangle Θ = {(θ 1, θ 2 ) R 2 : θ 1 > 0 and θ 2 > 0 and θ 1 + θ 2 1}. It is simpler and equivalent to maximize the log of the likelihood function, denoted l(θ): l(θ) = log(l(θ 1, θ 2 )) = 10 log(p A (θ 1, θ 2 )) + 14 log(p C (θ 1, θ 2 )) + 15 log(p G (θ 1, θ 2 )) + 10 log(p T (θ 1, θ 2 )), and we can obtain the solution to this optimization problem using techniques from Calculus. Optimization yields the maximum likelihood estimate (ˆθ 1, ˆθ 2 ) = ( , ). One of the drawbacks of MLEs, in terms of popular use, is that it is in general a nonconvex optimization problem requiring solutions to complicated nonlinear systems of equations. It is common in practice to circumvent these issues by using the hill-climbing Expectation Maximization (EM) algorithm, one of the main topics of this thesis. However, any algorithm of this type is doomed to imperfection. It will inevitably run into the problem of being trapped in local maxima and will have no way of providing a certificate for having found the global optimum, which may or may not exist [KRS +, pg. 2]. 1.2 Results We analyze the behavior of the EM algorithm in the case where the model M, the space over which we are optimizing, consists of data arrays of nonnegative rank 2 (cf. 2.1). M is a nonconvex, compact, semi-algebraic subset of a 7-dimensional tetrahedron. 5

6 Figure 1.1: Representative picture of the 7-dimensional model M as a 3-dimensional nonconvex, nonlinear, compact subset of the 3-dimensional tetrahedron. Since maximum likelihood estimation is an optimization problem, in order to locate the global optimum one restricts the objective function to the the interior and to each boundary, finds the maximum on each of these strata, and picks the best value among them. Allman, Hosten, Rhodes, and Zwiernik [AHRZ] give exact formulas for the maxima on each boundary stratification of M. [AHRZ] follows [ARSZ] in which M is realized as those probability distributions satisfying a special set of polynomial equalities and inequalities. We analyze the [AHRZ] formulas by determining how often they produce MLEs within M. We determine the strata of M that the EM algorithm is most attracted to, find the frequency with which the EM algorithm locates the global optimum, and count the number of times the EM algorithm must be run in order to find the MLE. We also compare the computation times of running the algorithm against using [AHRZ], and produce a picture of the behavior of the algorithm on a 3-dimensional slice of the 7-dimensional model M. We also analyze an algebraic approach to the EM algorithm and MLE. The EM fixed points are all the points that the EM algorithm can potentially converge to. These points represent the entire collection of maxima in the relative interior of M, as well as maximizers on the boundaries of M, and can be realized as the vanishing set of a collection of polynomials. We find these polynomials, following in the footsteps of [KRS + ], and describe the set of all EM fixed points to maximum likelihood problems of two separate classes of data arrays. In total we discuss and compare three approaches to the maximum likelihood problem on M; one is algorithmic, one is formulaic, and one is algebraic. In Chapter 2 we cover the background math necessary to understand Chapters 3 and 4. These concepts include MLE, nonnegative rank, tensors of bounded nonnegative rank, and the EM algorithm for matrices. We also discuss ideals, varieties, and primary decomposition. In Chapter 3 we describe the EM fixed point ideal for binary tensors of nonnegative rank less than or equal to 2 and 3, we describe cellular decomposition, which was used to produce these ideals, and we provide tables completely characterizing these ideals. In Chapter 4 we provide results on MLE using the boundary strata given in [ARSZ] and [AHRZ]. 6

7 2 Background Math 2.1 Nonnegative Rank The nonnegative rank of a nonnegative matrix A R m n, denoted rank + (A), is the smallest r Z 0 such that A = B C for nonnegative B R m r and nonnegative C R r n. Equivalently, it is the smallest r such that A can be written as the sum of r nonnegative rank 1 matrices, A = r i=1 x i y i for x i R m 1 0, y i R 1 n 0. Rank is always less than or equal to nonnegative rank. The smallest case for which rank and nonnegative rank disagree is for m = n = 4. [CR] provides the standard example. It is shown that the matrix has rank + = 4, but by observing linear dependence, or that = , we see that this matrix is rank 3 in the usual sense. Stephen Vavasis shows that nonnegative matrix factorization is NP-hard in [Vav]. 7

The cells of a 3 3 3 Rubik s Cube represent a 3 3 3-tensor and a labelling of the vertices of a 4-dimensional cube represent a 2 2 2 2-tensor. Example 2.1.

8 2.2 Tensors of Bounded Nonnegative Rank A real nonnegative tensor is a multidimensional array in R d 1 d 2 d n 0. A vector is a 1- dimensional tensor, a matrix is a 2-dimensional tensor, and a 3-or-higher dimensional tensor is just a tensor. Figure 2.1: A and tensor. Image sources: [Kar] & [Wal]. The cells of a Rubik s Cube represent a tensor and a labelling of the vertices of a 4-dimensional cube represent a tensor. Example 2.1. Let a = (a 1, a 2 ), b = (b 1, b 2 ), c = (c 1, c 2 ) R 2 0, then a b c is a nonnegative rank 1, tensor and can be written in slices as ( ) a1 b 1 c 1 a 1 b 1 c 2 a 2 b 1 c 1 a 2 b 1 c 2. a 1 b 2 c 1 a 1 b 2 c 2 a 2 b 2 c 1 a 2 b 2 c 2 This is just one view of the tensor (front-to-back), but that it is rank 1 in the usual sense. Indeed, each slice is a linear combination of the other, independent of the viewpoint. a 2 b 1 c 1 a 2 b 1 c 2 a 1 b 1 c 1 a 1 b 1 c 2 a 2 b 2 c 1 a 2 b 2 c 2 a 1 b 2 c 1 a 1 b 2 c 2 Figure 2.2: This tensor can be written in different slices as viewed from the top-down, bottom-up, left-right, and right-left. A tensor P of format d 1 d 2 d n has nonnegative rank at most r, if r is the smallest natural number such that P can be written as the sum of r nonnegative rank 1 tensors. Thus 8

9 we can build tensors of arbitrary nonnegative rank by adding nonnegative rank 1 tensors. A rank + r tensor of this form can be written P = a 11 a 12 a 1n + a 21 a 22 a 2n + + a r1 a r2 a rn (2.1) with a ij R 0. Example 2.2. Let P = p ijk be a real tensor. Then P is nonnegative rank 2 if there exists nonnegative 2 2-matrices [ ] [ ] [ ] a11 a A = 12 b11 b, B = 12 c11 c, C = 12 a 21 a 22 b 21 b 22 c 21 c 22 such that p ijk = a 1i b 1j c 1k + a 2i b 2j c 2k. c 1k c 2k = + b 1j b 2j a 1i a 2i Figure 2.3: Rank + 2 tensor decomposition, adapted from [KB], depicting a general rank + 2 tensor being constructed by adding rank + 1 tensors, which are themselves built from the rows of the nonnegative matrices A, B, and C. It is shown in [Lan, 5.5] that the set of real tensors P = [p i1 i 2 i n ] of format d 1 d 2 d n of nonnegative rank 2 is a closed semialgebraic subset of dimension 2(d 1 + d d n ) 2(n 1). Throughout, we informally refer to the set of tensors of some dimensions and rank as a space of tensors of some dimensions and rank. Definition 2.1 ([ARSZ]). Suppose P is of the form (2.1) with n 3, d i 2, and r = 2. Pick any subset A of [n] = {1, 2,..., n} with 1 A n 1 and write the tensor P as an ordinary matrix with i A d i rows and j A d j columns. The flattening rank of P is the maximal rank of any of these matrices. Definition 2.2 ([ARSZ]). Fix a tuple π = (π 1, π 2,..., π n ) where π i {1,..., d i }. Then P is π-supermodular if is a permutation of p i1 i 2 i n p j1 j 2 j n p k1 k 2 k n p l1 l 2 l n (2.2) whenever {i r, j r } = {k r, l r } and π r (k r ) π r (l r ) holds for r = 1, 2,..., n. A tensor P is called supermodular if it is π-supermodular for some π. 9

10 Theorem 2.1 ([ARSZ]). A nonnegative tensor P has nonnegative rank at most 2 if and only if P is supermodular and has flattening rank at most 2. Example 2.3. Let P = [p ijkl ] be a real tensor. P has flattening-rank at most 2 for any solutions to the systems of equations defined by the 3-minors of the matrices p 1111 p 1112 p 1121 p 1122 p 1111 p 1112 p 1211 p 1212 p 1111 p 1121 p 1211 p 1221 p 1211 p 1212 p 1221 p 1222 p 2111 p 2112 p 2121 p 2122, p 1121 p 1122 p 1221 p 1222 p 2111 p 2112 p 2211 p 2212, p 1112 p 1122 p 1212 p 1222 p 2111 p 2121 p 2211 p 2221 p 2211 p 2212 p 2221 p 2222 p 2121 p 2122 p 2221 p 2222 p 2112 p 2122 p 2212 p 2222 (2.3) obtained by setting n = 4 and A = {1, 2}, A = {1, 3}, and A = {1, 4}, respectively, in Definition 2.1. Since A and A c yield transpose matrices, and since A = {1} results in a 2 8-matrix, there are no other 3-minors to consider. Example 2.4 ([ARSZ]). Let P = [p ijk ] be a real tensor. As in Example 2.2, p ijk = a 1i b 1j c 1k + a 2i b 2j c 2k. In this case there are no flattening rank conditions since for each flattening there are no 3-minors. For π = (id, id, id), the binomial inequalities for supermodularity are p 111 p 222 p 112 p 221 p 111 p 222 p 121 p 212 p 111 p 222 p 211 p 122 p 112 p 222 p 122 p 212 p 121 p 222 p 122 p 221 p 211 p 222 p 212 p 221 p 111 p 122 p 112 p 121 p 111 p 212 p 112 p 211 p 111 p 221 p 121 p 211. (2.4) Nonnegative tensors P that satisfy these nine inequalities lie in the set M id,id,id = M (12),(12),(12). By label swapping 1 2, we obtain three other sets M, M id,id,(12) id,(12),id = M, and M (12),id,(12) (12),id,id = M. Thus, by definition, the semialgebraic set of all id,(12),(12) supermodular tensors is the union M = M id,id,id M M id,id,(12) id,(12),id M (12),id,id. (2.5) Theorem 2.1 states that P R has nonnegative rank 2 if and only if P lies in M. 2.3 Maximum Likelihood Estimation: A Closer Look When dealing with statistical models involving discrete data we may identify the sample space with the set of the first m positive integers, [m] := {1, 2,..., m}. A probability distribution on the set [m] is a point in the probability simplex { } m m 1 := (p 1,..., p m ) R m : p i = 1 and p j 0 for all j. i=1 10

11 The algebraic statistical model is a natural generalization of the ordinary statistical model. It comes as the image of a polynomial map f : R d R m, θ = (θ 1, θ 2,..., θ d ) (f 1 (θ), f 2 (θ),..., f m (θ)). (2.6) Each f i is a polynomial in R[θ 1,..., θ d ] and θ 1,..., θ d are the model parameters. Furthermore, (θ 1, θ 2,..., θ d ) is a point in Θ, a non-empty open subset of R d called the parameter space of the model f. We assume that Θ satisfies f i (θ) > 0 for all i [m] and θ Θ. Since the data is discrete, it can be given in the form of a sequence of observations i 1, i 2,..., i N (2.7) where each i j is an element from the sample space [m]. The integer N is the sample size. This data can be summarized in the data vector u = (u 1, u 2,..., u m ) where u k is the number of indices j [N] such that i j = k. Hence u N m, where N = {0, 1, 2,...}, and u 1 + u u m = N. The empirical distribution corresponding to the data (2.7) is the scaled vector (1/N)u which is a point in the probability simplex. We consider the model f to be a good fit for the data u if there exists a parameter vector θ Θ such that the probability distribution f(θ) is close, in a statistically meaningful way, to the empirical distribution (1/N)u. Were we to draw N times at random from the set [m] with respect to the probability distribution f(θ), then the probability of observing the sequence (2.7) gives the likelihood function L(θ) = f i1 (θ)f i2 (θ) f in (θ) = f 1 (θ) u 1 f 2 (θ) u2 f m (θ) um. (2.8) Since u represents the observed data it is thus fixed, and L depends only on θ; therefore, L is a function from Θ to R >0. It is equivalent but simpler to deal with the log of the likelihood function, l(θ). The problem of maximum likelihood estimation is to maximize l(θ) where θ ranges over the the parameter space Θ. Put plainly, we aim to solve the optimization problem: maximize l(θ) subject to θ Θ. (2.9) A solution to (2.9) is called a maximum likelihood estimate of θ with respect to the model f and the data u, and is denoted ˆθ. For many statistical models, a maximum likelihood estimate may not exist, and if it does, there could be more than one global maximum; actually, there can be infinitely many of them [PS]. Also, it may be difficult to find any one of these global maxima. This is where the Expectation Maximization (EM) algorithm enters the picture. It is a numerical method for finding solutions to (2.9), but it also gives insight, like shading paper over a leaf, into the topology of the model M. For a detailed treatment of maximum likelihood estimation in the context of computational biology, see [PS] 1.1, 1.3, and 3.3, from which the above exposition is derived. Let s consider maximum likelihood estimation in a less general setting. The rth mixture model M of two discrete random variables X and Y expresses the conditional statement X Y Z, where Z is a hidden variable with r states 1. Now, 1 Imagine having data on hair length and height. The hidden variable is gender and has r states, depending on how one chooses to classify gender. 11

12 assuming X and Y have m and n states respectively, their joint distribution is written as an m n-matrix of nonnegative rank r whose entries sum to 1. Let the nonnegative matrix u 11 u 1n U =..... u m1 u mn be a collection of independent and identically distributed samples from a joint distribution. Here, u ij is the number of observations in the sample with X = i and Y = j. The sample size is u ++ = i,j u ij. The EM algorithm attempts to maximize the log-likelihood function (2.12) of the model M. It approximates the data matrix U with a product of nonnegative matrices A and B where A R m r 0 and B R r n 0. As mentioned in the introduction, this is a nonconvex optimization problem, and any algorithm that attempts to solve it will run into a host of problems, of which the following dichotomy is most fundamental: either the MLE ˆP lies in the relative interior of the model M, or it lies in the boundary M of the model. If ˆP lies in M, then it is generally not a critical point for the likelihood function in the space of rank r matrices. It is shown in [KRS + ] that for 8 8-matrices of nonnegative rank 5, 96% of data matrices have MLEs lying in the boundary M. Let mn 1 denote the probability simplex of nonnegative m n-matrices P = [p ij ]. The model M is the subset of mn 1 consisting of all matrices of the form P = A Λ B, (2.10) where A is a nonnegative m r-matrix whose columns sum to 1, Λ is a nonnegative r r diagonal matrix whose entries sum to 1, and B is a nonnegative r n-matrix whose rows sum to 1. The kth column of A represents the conditional probability distribution of X given Z = k; the kth row of B represents the conditional probability distribution of Y given Z = k, and the diagonal of Λ is the probability distribution of Z. The parameter space in which (A, Λ, B) lies is the convex polytope Θ = ( m 1 ) r r 1 ( n 1 ) r. The model M is the image of the trilinear map Θ mn 1, (A, Λ, B) P. We aim to learn the model parameters (A, Λ, B) by maximizing the likelihood function ( u++ u ) m n i=1 j=1 p u ij ij (2.11) or equivalently, by maximizing the log-likelihood function ( m n r ) l U = u ij log a ik λ k b kj i=1 j=1 k=1 (2.12) over M. 12

13 2.4 EM Algorithm for Matrices The EM algorithm for m n-matrices is an iterative method for finding local maxima of the likelihood function (2.12). Algorithm 1 presents the version in [PS], 1.3. Algorithm 1 Function EM(U, r) Select random a 1, a 2,..., a r m 1, random λ r 1, and random b 1, b 2,..., b r n 1. Run the following steps until the entries of the m n-matrix P converge. E-Step: Estimate the m r n-table that represents this expected hidden data: Set v ikj := a ikλ k b kj r l=1 a ilλ l b lj u ij for i = 1,..., m, k = 1,..., r, and j = 1,..., n. M-Step: Maximize the likelihood function of the model for the hidden data: Set λ k := m n i=1 i=1 v ikj/u ++ for k = 1,..., r. Set a ik := n j=1 v ikj/u ++ for k = 1,..., r, i = 1,..., m. Set b kj := n i=1 v ikj/u ++ for k = 1,..., r, j = 1,..., n. Update the estimate of the joint distribution for our mixture model: Set p ij := r k=1 a ikλ k b kj for i = 1,..., m, j = 1,..., n. Return P. The alternating sequence estimation steps and maximization steps (E- and M-steps) defines trajectories in the parameter polytope Θ. The log-likelihood function (2.12) is nondecreasing along each trajectory (cf. [PS], Theorem 1.15). The value can remain unchanged only at a fixed point of the EM algorithm. Definition 2.3. An EM fixed point for a given table U is any point (A, Λ, B) in the polytope Θ = ( m 1 ) r r 1 ( n 1 ) r to which the EM alogorithm can converge if it is applied to (U, r). Lemma 2.2 ([KRS + ]). The following are equivalent for a point (A, Λ, B) in the parameter polytope Θ: 1. The point (A, Λ, B) is an EM fixed point 2. If we start EM with (A, Λ, B) instead of a random point, then EM converges to (A, Λ, B). 3. The point (A, Λ, B) remains fixed after one E-step and one M-step. Every global maximum ˆP of l U is among the EM fixed points. [KRS + ] identify the polynomials whose roots represent all fixed points for the 4 4-matrix case. Since a point is EM fixed if and only if it stays fixed after an E-step and an M-step, we can write rational function equations for the EM fixed points in Θ. We examine this process in depth in Chapter Ideals, Varieties, and Algorithms Let R = K[x 1,..., x n ] be the ring of polynomials in n variables with coefficients in a subfield K of the real numbers R, usually the rational numbers K = Q. 13

14 Definition 2.4. A subset I R is an ideal in R if I is a subgroup of R under addition, and for every f I and every g R we have fg I. Equivalently, an ideal I is closed under taking linear combinations with coefficients in the ring R. Definition 2.5. Let K be a field and let f 1,..., f s be polynomials in K[x 1,..., x n ]. Then we set V (f 1,..., f s ) = {(a 1,..., a n ) K n : f i (a 1,..., a n ) = 0 for all 1 i s}. We call V (f 1,..., f s ) the variety defined by f 1,..., f s. Let T = f 1,..., f s. The ideal generated by T, denoted T, is the smallest ideal in R containing T. We use V (T ) in place of V ( T ). In computational algebra, we often replace T by a Gröbner basis of T. This allows us to test ideal membership and to determine geometric properties of the variety V (T ) [CLO]. Definition 2.6. A subset X C n is a variety if X = V (T ) for some T R. A variety X C n is irreducible if we cannot write X = X 1 X 2, where X 1, X 2 X are strictly smaller varieties. An ideal I R is prime if fg I implies f I or g I. Proposition 2.3. The variety X is irreducible if and only if I(X) is prime. An ideal is radical if it is an intersection of prime ideals. Proposition 2.4. Every variety X can be written uniquely as X = X 1 X 2 X m, where X 1, X 2,..., X m are irreducible and none of these m components contain any other. Moreover, I(X) = I(X 1 ) I(X 2 ) I(X m ) is the unique decomposition of radical ideal I(X) as an intersection of prime ideals. A minimal prime of an ideal I is a prime ideal J such that V (J) is an irreducible component of V (I). Definition 2.7. An ideal I in K[x 1,..., x n ] is primary if fg I implies either f I and g m I for some m > 0. Lemma 2.5. If an ideal I is primary, then I is prime, and it is the smallest prime ideal containing I. All ideals I in R can be written as intersections of primary ideals, that is, a decomposition I = Q 1 Q 2 Q s where each Q i is primary. The radical P = Q is a prime ideal and Q is called P -primary. Primary ideals are more general than prime ideals, but they still define irreducible varieties, and geometrically primary ideals contain the same information as do their prime counterparts. Definition 2.8. Let I K[x 1,..., x n ] be an ideal, and f K[x 1,..., x n ]. Then the saturation of I with respect to f is the ideal (I : f ) = g K[x 1,..., x n ] : gf m I for some m > 0. Saturating an Ideal I by a polynomial f geometrically means that we obtain a new ideal J = (I : f ) whose variety V (J) contains all components of V (I) except for the ones on which f vanishes. For more on these concepts see [CLO], from whence this section is derived. 14

15 3 EM Fixed Point Ideal 3.1 Extension of EM Algorithm to Tensors Maximum likelihood estimation and the EM algorithm for matrices extend naturally to data given in the form of a tensor, which is just a table of dimension higher than 2. Here we restate the MLE problem and the EM algorithm for tensors of nonnegative rank 2 and describe the ideal of EM fixed points. We begin by updating the paramater polytope Θ to ( ) 2. A point in Θ is of the form (A, B, C, Λ) where A R 2 2 0, B R2 2 0, and C R2 2 0 are nonnegative and row stochastic, and Λ R is a nonnegative diagonal 2 2-matrix. The model M is the image of the quadrilinear map Θ 7, (A, B, C, Λ) P. (3.1) We update the function l U to reflect the tensor U. Now we seek to maximize ( u+++ u ) 2 i,j,k=1 where u ijk is the data and the unknowns P = [p ijk ] form a nonnegative tensor of nonnegative rank 2 with p +++ = 1. Since we do not allow p ijk = 0, P is a strict subset of the probability simplex 7. Again, this is equivalent to maximizing the log-likelihood function l U = u ijk log(p ijk ) = ( r ) u ijk log λ l a li b lj c lk. (3.2) i,j,k i,j,k l=1 In Algorithm 2 we update the EM Algorithm for matrices to reflect the new format of the data. p u ijk ijk 15

16 Algorithm 2 Function EM(U, r). i, j, k = {1, 2}, r = 2, U = [u ijk ] R Select random nonnegative stochastic matrices A, B, C in R+ 2 2 and a diagonal 2 2 matrix Λ. Define two nonnegative rank 1, tensors [λ 1 a 1i b 1j c 1k ] and [λ 2 a 2i b 2j c 2k ]. Run the following steps until the entries of the tensor P converge. E-Step: Estimate the table that represents this expected hidden data: Set vijk l := λ la li b lj c lk 2 u s=1 λsa sib sj c ijk for i, j, k, l = {1, 2} sk M-Step: Maximize the likelihood function of the model for the hidden data: Set λ l := 2 i,j,k=1 vl ijk /u +++ for l = 1, 2 Set a li := 2 j,k=1 vl ijk /(u +++λ l ) for l, i = 1, 2 Set b lj := 2 i,k=1 vl ijk /(u +++λ l ) for l, j = 1, 2 Set c lk := 2 i,j=1 vl ijk /(u +++λ l ) for l, k = 1, 2 Update the estimate of the joint distribution for our mixture model: Set p ijk := 2 l=1 λ la li b lj c lk for i, j, k = 1, 2 Return P. 3.2 EM Fixed Point Ideal for Tensors As in Section 2.4, if we could compute all EM fixed points, then this would reveal the global maximizer of l U. Since a point is EM fixed if and only if it stays fixed after an E-step and an M-step, we can write rational function equations for the EM fixed points in Θ: λ l = 1 u +++ a li = b lj = c lk = 1 u +++ λ l 1 u +++ λ l 1 u +++ λ l 2 i,j,k=1 2 j,k=1 2 i,k=1 2 i,j=1 λ l a li b lj c lk 2 s=1 λ sa si b sj c sk u ijk for l = 1, 2, λ l a li b lj c lk 2 s=1 λ sa si b sj c sk u ijk for i, l = 1, 2, λ l a li b lj c lk 2 s=1 λ sa si b sj c sk u ijk for j, l = 1, 2, λ l a li b lj c lk 2 s=1 λ sa si b sj c sk u ijk for k, l = 1, 2. Our goal is to understand the solutions to these equations for a fixed tensor U. We seek to find the variety they define in the polytope Θ and the image of that variety in M. In the EM algorithm we usually start with a li, b lj, c lk, λ l that are strictly positive. The a li, b lj, c lk may become zero in the limit, but the parameters λ k always remain positive when the u ijk are positive since the rows of A, B, C sum to 1. This justifies that we cancel out the factors λ k in our equations. After this, the first equation is implied by the other three. Therefore, 16

17 the set of all EM fixed points is a variety, and it is characterized by a li = 1 2 a li b lj c lk u s=1 λ u ijk for i, l = 1, 2, sa si b sj c sk b lj = 1 u +++ c lk = 1 u +++ j,k=1 2 i,k=1 2 i,j=1 a li b lj c lk 2 s=1 λ sa si b sj c sk u ijk for j, l = 1, 2, a li b lj c lk 2 s=1 λ sa si b sj c sk u ijk for k, l = 1, 2. These equations can be simplified further, for example, a li = 1 2 a li b lj c lk u s=1 λ u ijk = sa si b sj c sk a li u +++ j,k=1 a li u +++ = 2 b lj c lk = j,k=1 j,k=1 2 j,k=1 2 j,k=1 a li b lj c lk 2 s=1 λ sa si b sj c sk u ijk = a li b lj c lk 2 s=1 λ sa si b sj c sk u ijk = ( 2 ( ) ) u ijk a li u s=1 λ b lj c lk = 0 = sa si b sj c sk a li ( 2 j,k=1 ( u +++ u ) ) ijk b lj c lk = 0, for l, i = 1, 2. p ijk Note that 2 j,k=1 b ljc lk = 1 and the last line, in part, follows from the identity p ijk = 2 λ l a li b lj c lk. We can do this simplification symmetrically for b lj and c lk yielding ( 2 ( a li u +++ u ) ) ijk b lj c lk = 0 for i, l = 1, 2, p ijk j,k=1 ( 2 ( b lj u +++ u ) ) ijk a li c lk = 0 for j, l = 1, 2, p ijk i,k=1 ( 2 ( c lk u +++ u ) ) ijk a li b lj = 0 for k, l = 1, 2. p i,j=1 ijk l=1 Therefore, the set of EM fixed points is a variety characterized by the above equations. We can simplify further, denoting by R the tensor with entries r ijk = u +++ u ijk p ijk 17

18 and the fixed point equations become ( 2 ) a li r ijk b lj c lk = 0 for all l, i = 1, 2, j,k=1 ( 2 ) b lj r ijk a li c lk = 0 for all l, j = 1, 2, i,k=1 ( 2 ) c lk r ijk a li b lj = 0 for all l, k = 1, 2. i,j=1 This derivation yields the following theorem. Theorem 3.1. The variety of EM fixed points for tensors of rank + 2 in the polytope Θ is defined by the equations ( 2 ) a li r ijk b lj c lk = 0 for all l, i = 1, 2, j,k=1 ( 2 ) b lj r ijk a li c lk = 0 for all l, j = 1, 2, i,k=1 ( 2 ) c lk r ijk a li b lj = 0 for all l, k = 1, 2, i,j=1 where [r ijk ] = [ u +++ u ijk p ijk ]. The variety defined in Theorem 3.1 is reducible. Definition 3.1. Let F be the ideal of EM fixed points, as in Theorem 3.1. A minimal prime of F is called relevant if it contains none of the 8 polynomials p ijk = 2 l=1 a lib lj c lk. Theorem 3.2. The ideal F of EM fixed points for tensors of nonnegative rank 2 has precisely 52 minimal primes consisting of 9 orbital classes. Moreover, the ideal is radical, hence, it equals the intersection of its minimal primes. Proof. While the ideal F is not a binomial ideal, we follow [KRS + ] in using an approach based on the primary decomposition of binomial ideals given in [ES, 6]. Let F be the EM fixed point ideal ( 2 ) ( 2 ) ( 2 ) a li r ijk b lj c lk, b lj r ijk a li c lk, c lk r ijk a li b lj : i, j, k, l = 1, 2. i,j=1 j,k=1 i,k=1 Any prime ideal containing F contains either a li or 2 j,k=1 r ijkb lj c lk for l, i {1, 2}, and either b lj or 2 i,k=1 r ijka li c lk for l, j {1, 2}, and either c lk or 2 i,j=1 r ijka li b lj for l, k 18

19 {1, 2}. We categorize all primes containing F according to the set S of unknowns a li, b lj, and c lk. There are 2 12 subsets and the symmetry group acts on this power set by permuting the rows of A, B, and C simultaneously, the columns of A, B, and C separately, and the matrices A, B, and C themselves. We pick one representative S from each orbit that is relevant, that is, we exclude those orbits for which p ijk = 2 l=1 a lib lj c lk = 0. These are exactly the orbits containing an element p ijk lying in the ideal S. For each relevant representative S, we compute the cellular component F S = ((F + S ) : ( S c ) ), where S c = {a 11,...a 22, b 11,..., b 22, c 11,...c 22 }\S. Next we minimalize our cellular decomposition by removing all representatives S such that F T F S for some representative T in another orbit. This leads to a list of 6 orbits comprising 11 ideals. Up to symmetry, each prime is uniquely determined by its attributes in Table 3.1. These are its set S, its degree and codimension, the ranks ra = rank(a), rb = rank(b), and rc = rank(c) at a generic point, the number of ideals in the orbit of S, and the number of elements in the primary decomposition. In each case, primality of the ideal was verified using either the Macaulay2 isprime function or the linear elimination sequence in [GSS, Proposition 23(b)], which we discuss in detail in 5.1. Table 3.1 Minimal primes of EM fixed point ideal F for tensors of rank + 2. Class S S a s b s c s deg codim ra rb rc orbit #prime s { } {a 11 } {a 11, b 11 } {a 11, a 12 } {a 11, a 22 } {a 11, b 11, c 11 } In Table 3.1, while the ideals given by { } determine the fixed points in the interior of M, the ideals given by {a 11 }, {a 11, b 11 }, {a 11, a 22 }, and {a 11, b 11, c 11 } determine the fixed points on the non-interior boundary strata of M, as seen in [AHRZ]. That is, the map (3.1) sends any elements in the parameter polytope Θ = (A, B, C, Λ) that vanish under the defining equations of these ideals to a boundary strata of M. The ideal given by {a 11, a 12 } is degenerate because it yields a probability distribution outside of M. If a 11 = a 12 = 0 19

20 then A is not a stochastic matrix since the first row of A does not sum to 1. While #primes corresponding to the { } is 5 we only show the attributes of three. The two minimal primes not appearing in the list can be obtained as group actions of the ones in the list. We refer to the minimal primes appearing in the list as representatives of orbital classes. The total number of minimal primes can be read off the table by summing the product of the columns orbit and #primes. The orbit is the number of parameters in the orbit of the element from column 1. For example, the orbit of {a 11, a 22 } is {b 11, b 22 }, {c 11, c 22 }, so orbit = 3. The most concise ideal is the minimal prime defined by setting a 11 = 0 and a 22 = 0, corresponding to one of two 5-dimensional substrata of M. It has defining equations a 1,1, a 2,2, b 1,1 r 2,1,2 + b 1,2 r 2,2,2, b 1,1 r 2,1,1 + b 1,2 r 2,2,1, b 2,1 r 1,1,2 + b 2,2 r 1,2,2, b 2,1 r 1,1,1 + b 2,2 r 1,2,1, c 1,1 r 2,1,1 + c 1,2 r 2,1,2, c 1,1 r 2,2,1 + c 1,2 r 2,2,2, c 2,1 r 1,2,1 + c 2,2 r 1,2,2, c 2,1 r 1,1,1 + c 2,2 r 1,1,2, r 1,1,2 r 1,2,1 r 1,1,1 r 1,2,2, r 2,1,2 r 2,2,1 r 2,1,1 r 2,2,2. (3.3) Recall that r ijk = u +++ u ijk u ijk = u +++ p 2 ijk l=1 λ, (3.4) la li b lj c lk thus the set of EM fixed points corresponding to a data tensor U = [u ijk ] defined by this ideal are obtained by substituting (3.4), clearing denominators, and saturating. In this case, the tensor R consists of two rank-1 slices. We also see that ra, rb, and rc are 2 since the determinants of A, B, and C do not appear in the decomposition. We extend the computations of Theorem 3.2 to the case of tensors of nonnegative rank 3. The boundary stratification of the space of tensors of nonnegative rank 3 is not known, but the parameters that yield its stratification reside within the decomposition given in Table 3.2. We update the parameter polytope Θ = (A, B, C, Λ) with a 11 a 12 b 11 b 12 c 11 c 12 1 A = a 21 a 22 B = b 21 b 22 C = c 21 c 22 Λ = λ λ 2 a 31 a 32 b 31 b 32 c 31 c 32 λ 3 where A, B, C are stochastic matrices, and 3 l=1 λ l = 1 with λ i 0. We extend the EM algorithm in the natural way with p ijk = 3 λ l a li b lj c lk. l=1 Theorem 3.3. The ideal F of EM fixed points for tensors of nonnegative rank 3 has precisely 277 minimal primes consisting of 41 orbital classes. Up to symmetry, each prime is uniquely determined by its attributes in Table 3.2. Moreover, the ideal is not radical. The ideal corresponding to { } contains embedded components. 20

21 Table 3.2 Minimal primes of EM fixed point ideal F for tensors of rank + 3. Set S S a s b s c s deg codim ra rb rc o #p s { } {a 11 } {a 11, b 11 } {a 11, b 11, c 11 } {a 11, a 12 } {a 11, a 21 } {a 11, a 22 } {a 11, a 12, a 21 } {a 11, a 12, a 21, a 22 } {a 11, a 12, b 21 } {a 11, a 12, b 21, c 21 } {a 11, a 12, b 21, b 22 } {a 11, a 21, b 11, b 21 } {a 11, a 21, b 11, b 21, c 11, c 21 } {a 11, a 21, b 11, b 22, c 11, c 22 } {a 11, a 22, b 11, b 22 } {a 11, a 22, b 11, b 22, c 11, c 22 } {a 11, a 12, a 21, b 21 } {a 11, a 12, a 21, b 21, c 21 }

22 4 MLE Using Boundary Strata 4.1 Boundary Stratification of Binary Tensors Following [ARSZ], Allman, Hosten, Rhodes, and Zwiernik completely characterize the boundary stratification of binary tensors of nonnegative rank two. For tensors of nonnegative rank 2 they give specific formulas for the ML estimate on each strata. For example, in the case there are 15 ridges of dimension 5. One of these stratum is obtained as the images of those parameters (A, B, C, Λ) where a 11 = 0 and a 22 = 0. The resulting tensor p ijk is of the form λ 1 ( 0 0 a12 b 11 c 11 a 12 b 11 c a 12 b 12 c 11 a 12 b 12 c 12 ) + λ 1 ( a22 b 21 c 21 a 22 b 21 c a 22 b 22 c 21 a 22 b 22 c [AHRZ] give the following ML formula corresponding to this 5-dimensional ridge completely in terms of the data U = [u ijk ], ˆp ijk = u ij+ u i+k u i++ u +++ i, j, k = 1, 2. For all of the strata there is a formula in terms of the data U. If, among all the formulas for all of the stratifications, this ˆP produces the maximum output of the log-likelihood function l U, and if ˆP is supermodular, then this is the MLE for U, and the MLE for this data lies on a 5-dimensional ridge of the model. Table 4.1 completely characterizes the boundary stratification of M. For the case, we show that the [AHRZ] formula computations for the MLE are faster than the EM algorithm. We use the results in [ARSZ] and [AHRZ] to perform this experiment, along with several others, giving insight into the EM algorithm for tensors, maximum likelihood estimation, and the model M. ). 22

23 Table 4.1 Boundary stratification of tensors of nonnegative rank 2 # of Strata Dimension Zeros of (A, B, C, Λ) 1 7 (Interior) { } 6 6 a 11 = a 1i = 0 and b 1j = a 11 = a 22 = a 1i = 0 and b 1j = 0 and c 1k = λ i = 0 (rank + 1 tensors) 4.2 Experiments In the following experiments we implement all EM and MLE computations using Julia, a high-level dynamic programming language for technical computing. Julia is relatively new. It was developed by researchers at MIT and first appeared in All graphical modeling is done in R. We present the following experiments as a series of questions and answers. Experiment 4.1. In which boundary stratification of the model M is the MLE most likely to occur for a random nonnegative data tensor? We uniformly and randomly generate 10, 000 nonnegative tensors. For each tensor we use the [AHRZ] formulas to count the number of times the MLE lands on one of the five stratum of the 7-dimensional space of tensors of rank + 2. There are 31 formulas to check and for each formula we must verify that ˆp ijk is supermodular. Checking supermodularity requires verification of between 9 and 36 inequalities. In total, the computations for this experiment required 0.4 seconds. Table 4.2 shows the percentage distribution among the strata. Table 4.2 Experiment 4.1: Stratification attraction of the model M. 7-dim 6-dim 5a-dim 5b-dim 4-dim 3-dim In R, we produce a 3-dimensional picture modelling this behavior on the 7-dimensional model M. We generate Figures 4.1 and 4.2 by running the EM algorithm on 200,

24 randomly generated tensors [u ijk ] from the Jukes-Cantor slice given by [ ] [ ] [ ] [ ] u111 u 112 x y u211 u = and 212 w z =. u 121 u 122 z w u 221 u 222 y x A normalized [u ijk ] in the Jukes-Cantor slice is a point in the the 3-dimensional simplex. 99.9% of the MLEs for these tensors occur evenly distributed among the interior and the three substrata of the 5b stratum of M, labelled 5b 1, 5b 2, and 5b 3. To each of these strata we associate a color and to each of the points in the 3-d simplex we assign the color of the associated boundary stratum of the MLE at that point. This yields a partitioning of the 3-dimensional simplex. The 5b 1, 5b 2, and 5b 3 subsets form the same shape, rotated by 60. Observe the linear boundary between the 5b i subsets and the polynomial boundary of the interior. 24

The MLE of the light and dark turqoise points lies on the 5b1 stratum.

25 Figure 4.1: The simplex is partitioned based on the MLE for the tensor in the Jukes-Cantor slice. The MLE of the pink points is in the interior of the model. The MLE of the light and dark turqoise points lies on the 5b1 stratum. The second row depicts the 5b1 subset from two different angles. The blank space between cells is where the shapes interlock. The second picture in the first row shows the interior and the 5b1 subsets interlocked. 25

Figure 4.2: The figure on the left shows the 5b 1, 5b 2, and 5b 3 subsets locked together. The figure on the right shows all 4 subsets: the interior, 5b 1, 5b 2, and 5b 3.

26 Figure 4.2: The figure on the left shows the 5b 1, 5b 2, and 5b 3 subsets locked together. The figure on the right shows all 4 subsets: the interior, 5b 1, 5b 2, and 5b 3. The orange is the interior of the model. Experiment 4.2. How much does the EM algorithm vary for one data tensor U = [u ijk ]? We uniformly and randomly generate 100 nonnegative data tensors U and for each U we run the EM algorithm from 1, 000 different starting parameters. Recall that starting parameters are elements in the polytope Θ = (A, B, C, Λ). We determine the frequency with which the EM algorithm spreads, or finds MLEs, across different strata. Table 4.3 Experiments 4.2 Spreads across: 1 stratum 2 strata 3 strata 4 strata 5 strata 6 strata % of time: Table 4.3 says that given 1,000 different starting parameters the EM algorithm will find a local maxima on one particular strata 23% of the time. 32% of the time the algorithm will spread across 2 strata, and so on. This table does not address the density of these spreads. In order to address this, for each run of 1, 000 we count the number of times EM lands in one strata more than 90% of the time. We found that this will occur in 80% of the samples. Informally, this means that the EM algorithm is reliable in the sense that it tends to be attracted to one stratum most of the time. Experiment 4.3. How often does the EM algorithm produce the actual MLE given one starting parameter? We uniformly and randomly generate 10, 000 nonnegative data tensors U and run the EM algorithm on each U from one starting parameter with a maximum of 10, 000 steps. We compute the actual MLE using the [AHRZ] formulas and compare. The EM algorithm converges in 80% of the samples and produces the actual MLE 76% of the time. 26

27 Experiment 4.4. How many times must the EM algorithm be run to find the actual MLE? We input 1, 000 uniformly and randomly generated nonnegative tensors U and compute the MLE for each one using the [AHRZ] formulas. We then run the EM algorithm on each U until it returns the MLE. We tally the number of different starting parameters required to hit the MLE. The EM algorithm finds the MLE given 1 starting parameter 75.5% of the time. It requires less than 10 different starting parameters to find the MLE 96.5% of the time. Experiment 4.5. Is computing MLEs using the formulas faster than using EM Algorithm? This is a valid question because computing MLEs with the formulas is not trivial. There are 31 formulas to check. Each formula has a canonical representative, as in a 11 = 0, a 22 = 0. To obtain the ML estimates on the parameters in the orbit of the canonical representive (b 11 = b 22 = 0 and c 11 = c 22 = 0); we permute the data tensor and perform the computation as if it was the canonical representative, and then permute the ML estimate in reverse. For each formula, supermodularity must be verified. We generate 1,000 nonnegative, random, and uniform tensors U. For each U we compute the MLE using the [AHRZ] formulas and the EM algorithm with a maximum 10,000 steps to ensure convergence 80% of the time. If the EM algorithm does not find the MLE given one set of starting parameters, we discard the trial. It must be noted that the EM algorithm for tensors that we are implementing is most likely not the optimally coded EM algorithm, thus these speeds are only estimates as to the efficiency of the algorithm. We compute the mean, median, maximum, and minimum run times for each method. Table 4.4 shows that in these trials, the EM algorithm was never faster than the [AHRZ] formulas. In fact, the slowest formula time of seconds beats the fastest EM time of seconds. Table 4.4 Experiment 4.5 Formula EM Algorithm Mean: Median: Max: Min:

28 5 Implementation Here we give an overview of the computational methods used throughout this project. The computational body of work can be split into two categories: the EM fixed point ideal decompositions, and the EM algorithm along with the [AHRZ] formulas. The former consists of cellular decomposition, determining the primality of the cellular components, and the decomposing of the non-primary components. All of this is implemented in Macaulay2. The latter consists of the EM algorithm itself, the coding of the [AHRZ] formulas, and all of the support functions required for gathering data from these objects. Initially, we attempted to implement this using R, but Julia proved to be greater than 10-times faster. All of the coding for this section is done in Julia, besides the modeling of Figures 4.1 and Cellular Decomposition, Primality, and Primary Decomposition Macaulay2 is our primary tool for computing cellular decompositions, determining primality, and decomposing ideals into primary components. The most important theorem for determining primality is [GSS, Proposition 23]. It is stated therein without proof; for a concise proof, see [LS, Pg. 3]. Lemma 5.1. [GSS, Proposition 23] Let J R[x 1,..., x n ] be an ideal containing a polynomial f = gx 1 + h with g, h not involving x 1 and g a non-zero divisor modulo J. Let J 1 = J R[x 2,..., x n ] be the elimination ideal. Then J is prime if and only if J 1 is prime. Algorithm 3 Pseudocode implementation of Proposition 5.1. Input an ideal I R[x 1,..., x n ]. Create LIST : a list of all variables in the ring. Compute K: a list of generators of a Gröbner basis of I. for i in length(list ) Set f = LIST [i] for j in length(k) Set g = d(k[j])/ df # Note that the variables appear linearly. if (I : g) == I (implying g is a nonzero divisor) then I = eliminate(list [i], I) Return I. 28

29 Following elimination, Algorithm 3 yields a simpler ideal, as measured by degree and codimension. After multiple eliminations, as in all of our cases, verification of primality using the Macaulay2 isprime command takes only seconds. It is wise to update and maintain a sequence of strings representing the elimination sequence for fast verification. Consider the ideal (3.3) obtained by setting a 11 = a 22 = 0. In the syntax of Macaulay2, our algorithm will output the sequence of strings in Figure 5.1. Figure 5.1: Elimination sequence for verifying primality of the EM fixed point ideal of tensors of rank + 2 corresponding to a 11 = a 22 = 0. K = first entries gens gb I; g = diff(a_(1,1), K#1); I : ideal(g) == I I = eliminate(a_(1,1), I); K = first entries gens gb I; g = diff(a_(2,2), K#0); I : ideal(g) == I I = eliminate(a_(2,2), I); K = first entries gens gb I; g = diff(b_(1,1), K#2); I : ideal(g) == I I = eliminate(b_(1,1), I); K = first entries gens gb I; g = diff(b_(2,1), K#5); I : ideal(g) == I I = eliminate(b_(2,1), I); isprime(i) The degree of 3.3 is reduced from 25 to 9, the codimension drops from 8 to 4, and primality is verified by isprime in less than 1 second. Our most important method for finding minimal primes is found in the discussion following Proposition 23 in [GSS]. This method is based on splitting the ideal I that we wish to decompose into two parts. Given an ideal I, if there is an element f of its Gröbner basis that factors f = f 1 f 2, then I = I, f1 I, f 2 : f 1. (5.1) In our case, the ideals are radical, so we drop the radical signs in 5.1. We keep a list of ideals whose intersection is the same as I. For each ideal we keep a list of the elements we have inverted by so far (for example, f 1 in I, f 2 : f1 ) and saturate at each step with these elements. Eventually, the ideals either split into one or two prime parts, which we verify as in Proposition 5.1, or the splits result in ideals that are decomposable in under 5 minutes using Macaulay2 s built-in functionality. This method worked invariably for our decompositions. Algorithm 4 presents an overview of our method for computing cellular decompositions of the EM fixed point ideal for tensors of nonnegative rank 2. The code associated with this algorithm comprises the main body of our work with the EM fixed point ideals. For its implementation for 4 4-matrices, from which our code follows, see 29

Polynomials, Ideals, and Gröbner Bases

Polynomials, Ideals, and Gröbner Bases Notes by Bernd Sturmfels for the lecture on April 10, 2018, in the IMPRS Ringvorlesung Introduction to Nonlinear Algebra We fix a field K. Some examples of fields