HOW MANY MODES CAN A TWO COMPONENT MIXTURE HAVE? Surajit Ray and Dan Ren. Boston University

Size: px

Start display at page:

Download "HOW MANY MODES CAN A TWO COMPONENT MIXTURE HAVE? Surajit Ray and Dan Ren. Boston University"

Myra Cornelia Watts
6 years ago
Views:

1 HOW MANY MODES CAN A TWO COMPONENT MIXTURE HAVE? Surajit Ray and Dan Ren Boston University Abstract: The main result of this article states that one can get as many as D + 1 modes from a two component normal mixture in D dimensions. Multivariate mixture models are widely used for modeling homogeneous populations and for cluster analysis. Either the components directly or modes arsing from these components are often used to extract individual clusters. Though in lower dimensions these strategies work well, our results show that high dimensional mixtures are often very complex and researchers should take extra precaution while using mixtures for cluster analysis. Even in the simplest case of mixing only two normal components in D dimensions, we can show that it can have a maximum of D + 1 modes. When we mix more components or if the components are non-normal the number of modes might be even higher, which might lead us to wrong inference on the number of clusters. Further analyses show that the number of modes depend on the component means and eigenvalues of the ratio of the two component covariance matrices, which in turn provides a clear guideline as to when one can use mixture analysis for clustering high dimensional data. Key words and phrases: Mixture, modal cluster, multivariate mode, clustering, dimension reduction, topography, manifold 1 Introduction 1.1 Number of modes of a normal mixture Multivariate normal mixtures provide a flexible method of fitting high-dimensional data. This fit often provides a primary data reduction through the number, location and shape of its components. However, a more interesting question relates to the exploration of how components interact to describe an overall pattern of density. Of particular interest is finding the number of modes the density displays. The relation between the number of modes and number of components is not one to one. Often modes are used to determine the number of homogeneous groups in a population (Li et al., 2007; McLachlan and Peel, 2000; Titterington et al., 1985). Modes of densities are also widely used to summarize posterior distributions in Bayesian analysis (Berger, 1985; Lehmann and Casella, 1998) and to build Bayesian inferential framework. 1

2 The main results of this paper is summarized in the following theorem: Theorem 1. A D dimensional normal mixture of two components has at most D + 1 modes and a mixture with D + 1 modes always exists in D dimensions. In one dimension a two-component normal mixture can display one or two modes. But the density shapes become complex in higher dimensions. For example a two-component normal in two dimensions can give rise to one, two or three modes (see Ray and Lindsay, 2005, for a three mode example). Ray and Lindsay (2005) provide more examples in two and three dimensions where the number of modes are more than the number of mixing components. But beside these pathological examples there is no result on the upper bound of the number of modes that a mixture of normals can display. This paper provides the first set of results on the upper bound for the number of modes of a two-component normal mixture. We also show that this bound is tight, i.e., we can provide numerical values for a mixture which attains this upper bound. It is well known that the topography, in the sense of the key features as a density of a mixture of distributions is often extremely complex. Among the different features of the topography we are especially interested in the number of modes the density displays referred to as the modality of the density from here on. Ray and Lindsay (2005) provide a detailed understanding of the topography of mixtures of normal distributions in terms of the means and variances of the component distributions. But how these density shapes respond to the rotation or scaling based on the component covariances is not well studied. For example, it is not clear if rotation and scaling retains all the modes after transformation. In this paper we present a set of results showing the invariance of modality of normal mixtures under the operation of translation, scaling and rotation. These results allow us to show that the modality of a two-component mixture of normals with arbitrary variance-covariance matrices is mathematically equivalent to the topography of a mixture of normals, with one component of which has a spherical covariance and the other has an appropriate diagonal covariance matrix of the same dimension. A follow up analysis shows that, the number of modes are closely related to the number of unique eigenvalues of the ratio of the covariance matrices, in a matrix sense (inverse of one matrix multiplied by the other matrix). Finally we use these results to arrive at the main result on the tight upper bound on the number of modes. 1.2 Relevant Literature Studies of the number of modes of normal mixtures date back to the beginning of twentieth century but until recently the results have focused primarily on univariate mixtures. In fact, there is a simple description of modality when one is mixing two univariate normal components. Helguero (1904) determined necessary and sufficient conditions for bimodality in 2

3 the mixture of two univariate normals with equal variances and mixing proportions. More research on univariate mixture cases followed. For example, Eisenberger (1964) investigated the conditions for bimodality in the mixture of two univariate normals with arbitrary variance and mixing proportions and Behboodian (1970) derived a sufficient condition for unimodal mixture densities. Kakiuchi (1981) and Kemperman (1991) then extended the problem to mixtures of non-normal distributions, and derived corresponding necessary and sufficient conditions. In the context of multivariate normal mixtures, a recent result by Carreira-Perpiñán and Williams (2003) shows that for any D-dimensional normal mixture, the number of modes cannot exceed the number of components if each component has the same covariance matrix up to a scalar scaling factor. The most recent and comprehensive results in this area of research are provided by Ray and Lindsay (2005), who present the most generalized modality results for arbitrary dimensions, number of components and component variance structure. The key result in Ray and Lindsay (2005) shows that the topography of multivariate mixtures, in the sense of their key features as a density, can be analyzed rigorously in lower dimensions by use of a ridgeline manifold that contains all critical points as well as the ridges of the density. This important topographical result allows them to solve for the number of modes both analytically and numerically. Besides solving for the number of modes Ray and Lindsay (2005) provide pathological examples of more modes than components in more than one dimension. A comprehensive summary of the above results are available in Frühwirth- Schnatter (2006) and a recent review paper by Melnykov and Maitra (2010). Much of the modality theory discussed in Ray and Lindsay (2005) has been widely used for developing clustering techniques by Ray and Lindsay (2008); Coretto and Hennig (2010); Hennig (2010b) and Hennig (2010a) and for the advancement of likelihood based inference for normal mixtures by Chen and Tan (2009); Holzmann and Vollmer (2008); Dannemann and Holzmann (2008) and Lindsay et al. (2008). Applications of these results are found in new areas of research such as signal processing (Li, 2007; Scott et al., 2009) and image retrieval (Sfikas et al., 2005). Using the modality theorem in the special case of a two-component normal mixture, Ray and Lindsay (2005) provide examples of three modes in two dimensions, and four modes in three dimensions. These mixtures have unequal covariances matrices, but they are limited to being diagonal in structure. But providing an upper bound of modes for mixtures in arbitrary dimensions for arbitrary component variance-covariance matrix remained an unresolved problem. 1.3 Our Results The main contribution of this paper is to provide a tight upper bound for the number of modes of a two-component normal mixture for arbitrary dimension and arbitrary component 3

4 variance-covariance matrices. Let us denote the dimension of the multivariate normal density by D and the number of components of the mixture by K. In this paper, we only consider two-component normal mixture cases, i.e., K = 2; and the corresponding parameters for each normal density are their means µ i and variance-covariance matrices Σ i, i = 1, 2. Let π, π = 1 π be the respective proportions of two densities. It can be shown that for specified means and variances the number of modes depends on the mixing proportions. In fact, Ray and Lindsay (2005) provide examples of mixtures where different ranges of π display one, two and three modes for the same means and variance-covariance matrices. But one should notice that the specification of π is irrelevant in the context of determination of the maximal number of modes displayed by a mixture of two components. In other words we are asking the following question given a pair of component means and covariance matrices what is the maximum number of modes it can display if one has the complete freedom of choosing the mixing proportion π? Hence we will ignore the parameter π for our analysis and for notational ease we will denote a D dimensional mixture of two components with means µ 1 and µ 2, and variances Σ 1 and Σ 2 by NM(µ 1, Σ 1, µ 2, Σ 2 ) D. Our main result shows that the number of modes for the above mixture is bounded above by D + 1 and that bound is achievable for any D. In fact we provide a recursive algorithm to construct the parameters of the component densities which attain this bound. Modes are defined as the local maxima of the density height and understanding the modes require understanding of the topography of the density along with their higher order features. Many of the results we will use in this paper are based on these higher order features of normal mixtures defined in terms of Π-function (different from the omitted parameter π) and curvature functions defined in Ray and Lindsay (2005). So, in Section 2 we will first define the terminologies and state some of the important results from Ray and Lindsay (2005) which will be used in this paper. In particular we will present the concept of Π-functions and curvature functions of a mixture, which have the advantage of being expressed explicitly in terms of means and variances of components while retaining full information about the topography and hence the number of modes of a mixture. Moreover the Π-function and curvature function attain a very simple form for a two-component normal mixture. This simplification of the curvature function allows us to show that the number of modes of the two-component mixtures is explicitly determined by the number of roots of the curvature function within the range [0,1]. But the roots of the curvature function defined in Section 2 are very difficult to study for arbitrary mixtures. Ray and Lindsay (2005) explore the roots for curvature functions only in the case of diagonal covariance matrices up to three dimensions. In this paper we seek to generalize the modality results for arbitrary dimensions and component variance- 4

5 covariance matrices. To arrive at these results in Section 3, we first show that modality of an arbitrary D-dimensional normal mixture NM(µ 1, Σ 1, µ 2, Σ 2 ) D remains unchanged under any translation and a specified scaling and rotation of the random variable. These results will be enormously helpful as it will allows us to study the topography of arbitrary D-dimensional normal mixture by exploring the topography of a simplified class of normal mixtures with the first component being a standard normal and the second component having a diagonal covariance matrix. We denote this class by NM(0, I,µ, Λ) D, where, 0 and µ are both D dimensional means, I is a identity matrix and Λ is a diagonal matrix of dimension D. These results are derived analytically and examples are provided to illustrate these results. In Section 4 we explore the modality of normal mixtures of the form NM(0, I,µ, Λ) D. We show that the maximum number of modes is constrained by d, the number of distinct diagonal entries in Λ. In fact the modality of such a normal with d distinct diagonal entries, is less than or equal to (d+1). It is easy to check that d can be equal to the dimension D and thus we arrive at the first part of our result showing that any arbitrary D dimensional normal mixture can have at most (D + 1) modes. The tightness of the stated bound is achieved by providing a recursive method for construction of two-component normals which achieve this bound. In this section we also show that many previous modality results can be stated as special cases of our generalized result. For D = 1, this can be used to prove the univariate results in Helguero (1904) and Robertson and Fryer (1969). For D = 2 and D = 3 our results show that the examples in Ray and Lindsay (2005) achieve the upper limit of the number of modes in their respective dimensions. Section 5 provides some discussion and further research directions regarding the number of modes of multivariate normal mixture of more than two components. Generalization of the modality of mixtures of multivariate normals to multivariate-t densities and then ultimately to multivariate elliptical distributions will also be discussed in this section. 2 Topography of multivariate normals In this section we state some important results from Ray and Lindsay (2005) that will be extensively used in this paper. The rest of the paper will use the notations defined in this section. Readers familiar with the results in Ray and Lindsay (2005) may skip this section. Ray and Lindsay (2005) presents a unified theory for understanding the topography of high dimensional normal mixtures. Their main result shows that the topography of mixtures, in the sense of their key features as a density, can be analyzed rigorously in lower dimensions by use of a ridgeline manifold that contains all critical points as well as the ridges of the density. A K-component mixture of D-dimensional normals can be represented by the probability 5

6 density function g(x) = π 1 φ(x; µ 1, Σ 1 ) + π 2 φ(x; µ 2, Σ 2 ) π K φ(x; µ K, Σ K ),x R D, where π j is the mixing proportion of component j, π j [0, 1], K j=1 π j = 1, and φ(x; µ, Σ) is the density of a multivariate normal distribution with mean µ and variance Σ. We will sometimes use φ j (x) as shorthand notation for φ(x; µ j, Σ j ), and call φ j the j th component density. 2.1 The K-1 dimensional ridgeline manifold Definition 1. The K 1 dimensional set of points { } K S K = α R K : α i [0, 1], α i = 1 will be called the unit simplex. The function x (α) from S K into R D defined by x (α) = [ α 1 Σ α 2 Σ α K Σ 1 ] 1 [ K α1 Σ 1 1 µ 1 + α 2 Σ 1 2 µ α K Σ 1 K µ ] K will be called the ridgeline function. It will sometimes be written as x α. The image of this map will be denoted by M and called the ridgeline surface or manifold. If K = 2, it will be called the ridgeline as it is a one-dimensional curve. Theorem 2. (Ray and Lindsay, 2005) Let g(x) be the density of a K-component multivariate normal densities as given by (2). Then all of g(x) s critical values, and hence modes, antimodes and saddle points, are points in M. The previous result states that instead of exploring the whole R D space to find modes, we now only need to concentrate on the ridgeline, embedded in the (K 1)-dimensional unit simplex. In this paper we only deal with two components and for K = 2 the ridgeline can be represented as x (α) = Sα 1 [ ασ 1 1 µ 1 + ᾱσ 1 2 µ ] 2, where Sα = [ ασ ᾱσ 1 ] 2 and α [0, 1] and ᾱ = 1 α. As α varies from 0 to 1, the image of the function x (α) defines a curve from µ 1 to µ 2 and the critical points of the D-dimensional mixture can be explored by evaluating the height of the density along the curve x (α). Thus we next consider the diagnostic properties of the elevation plot along the curve x (α) defined by h(α) = g (x (α)). We will call h(α) the ridgeline elevation function. Analytically, the number of peaks of h(α) is exactly the maximum number of modes the mixture can display. In some cases a visual (1) 6

7 inspection of h(α) or numerical root finding methods might allow us the enumerate the roots of h(α) and hence the number of modes. But depending on the resolution, numerical methods can always miss some zero crossings. Moreover, numerical solutions will not serve the purpose of this paper which focuses on determining the upper bound on the number of modes. Hence we focus our attention to finding analytical solutions for the critical points of h(α) for finding the number of modes of the mixture. 2.2 The curvature function To find the number of modes, first note that x (α) is a critical value of h(α) if it satisfies h (α) = πφ 1 (x (α)) + πφ 2 (x (α)) = 0, where prime denotes differentiation with respect to α. Solving the last displayed equation for π, and turning it into a function of α we get: Π(α) = φ 2 (α) φ 2 (α) φ 1 (α). As we are just interested in the number of modes we can examine the number of up and down oscillations of the function Π. Section 4 of Ray and Lindsay (2005) shows that the number of up-down oscillations of Π, is given by n, the zeroes of Π (α) = φ 2 (α)φ 1 (α) φ 1 (α)φ 2 ) (φ (α) 2. 2 (α) φ 1 (α) In general, to determine the sign changes of Π we can use any function of α with the same numerator φ 2 (α)φ 1 (α) φ 1 (α)φ 2 (α), provided the denominator is a positive function of α. Using the denominator φ 1 (α)φ 2 (α) instead of (φ 2 (α) φ 1 (α))2 the curvature function κ(α) is defined as: κ(α) = φ 2 (α) φ 1 (α) φ 2 (α) φ 1 (α) φ 1 (α) φ 2 (α) φ 1 (α) φ 2 (α). (2) We use κ(α) as it results in a simple expression for any distribution belonging to the exponential family. It is closely related to the mixture curvature measures given by Lindsay (1983). 2.3 Properties of the Curvature function κ(α) We now study the curvature function κ(α) more closely, as it will be extensively used to prove the results in Section 3 and Section 4. The following result, provides a simple expression for the curvature for the mixture of normals. 7

8 Theorem 3. (Ray and Lindsay, 2005) Let g(x) be the mixture of two multivariate normal densities. Then the curvature function in (2) is given by κ(α) = [p(α)] 2 [1 αᾱp(α)], where p(α) = (µ 2 µ 1 ) Σ 1 1 S 1 α Σ 1 2 S 1 α Σ 1 2 S 1 α Σ 1 1 (µ 2 µ 1 ). (3) By the expression above, p(α) is always positive. Thus zeroes of κ(α) are the same as the zeroes of (1 αᾱp(α)). For notational ease, let us denote q(α) = 1 αᾱp(α). (4) By calculation, q(0) = q(1) = 1 and hence, κ takes positive values at the two extremes α=0 and 1. Thus, there are an even number of sign changes of the function κ(α) in the range [0,1], as also indicated by the nature of Π. In particular at the first zero, α 1, of κ, the function Π has a maximum, at the next α 1 a minimum, and so forth. Thus we arrive at the following result relating the number of solutions of q(α) to the modality of the mixture. Result 1. Let n be the number of solutions of q(α) in the range [0,1]. Then the corresponding mixture will display n modes. We note that both p(α) and q(α) uniquely defines the number of modes. We will use p(α) to show the invariance in the proof of Theorem 5, and later use q(α) to find the number of modes while providing the proofs of other theorems. 3 Invariance of modality under scaling and rotation Studying the modality of arbitrary normal mixtures directly based on the curvature function κ(α) is a very complex undertaking. Instead in this section we will show that the curvature function which defines the modal features of a two-component normal mixture remains unchanged under certain transformations. We will use these transformations to show that the topography of arbitrary D-dimensional normal mixture can be examined by exploring the topography of a simplified class of normal mixture given by the mixture of a spherical normal and a normal with a diagonal covariance matrix. We arrive at this result in two steps described in the following two subsections. 3.1 Invariance of modality under scaling First we state the theorem that provides the simplification that in D dimensions the modal properties of arbitrary two-component normal mixture can be fully examined by studying the modality of mixture of two components, one of which is the standard normal in D dimensions. 8

9 Theorem 4. For an arbitrary mixture of two multivariate normals, the modality, of NM(µ 1, Σ 1, µ 2, Σ 2 ) D is the same as that of NM(0, I,µ 2, Σ 2 ) D, where µ 2 = (Σ 2 2Σ 1 )1 2 2 (µ 2 µ 1 ), Σ 2 = Σ1 2 2 Σ 1 1 Σ Proof. See Appendix Remark 1. First note that the above transformation is not equivalent to the regular standardization for the first component alone. Using a regular standardization a single component can be transformed to a standard normal but the resulting parameters of the second component will lose its symmetry which is crucial for equating the curvature function of the two mixtures detailed in the proof of Theorem 4. Also, note that µ 2, Σ 2 in Theorem 4 is well-defined, because the variance matrices Σ 1 and Σ 2 are both positive definite. Note that the two components are interchangeable and the strategy is to scale the whole mixture by the covariance of the component whose mean is translated to the origin. Before moving on to the next result, we provide an application of Theorem 4. For easy visualization we will use contour plots of a two dimensional mixture. This example will also serve the purpose of providing a geometric intuition of the proof of Theorem 4. First, note that it is easy to check that geometrically shifting the means of both the components by the same vector is equivalent to changing the origin of the reference frame of the contour plot. This implies that the modal features and hence the number of modes remain unchanged after simple translation. So we concentrate on the changes of the contour plot strictly under the operation of scaling defined in Theorem 4 by taking µ 1 = 0. Example 1. Consider the mixture density with the following parameters: µ 1 =, Σ 1 =, µ 2 =, Σ 2 = Applying the transformation defined in Theorem 4 the parameters of the two components after scaling are given by: µ 0 1 =, Σ =, µ =, Σ = Figure 1 gives the density contour plots before (left panel) and after (right panel) the transformation and clearly though the contour shapes and the location of the modes have changed, the number of modes and the number of saddle points remains unchanged. Note that under the transformation both components are scaled, and in this example the component centered at zero is scaled to have the identity covariance and the covariance of the other component is scaled appropriately. This is easily visible from the contour plots of in Figure 1 where the elongated elliptical component in the left panel with the origin as the center is transformed into a spherical component with the same center. Of course the change 9

10 (a) (b) y y x x Figure 1: Contour plots for the bivariate normal mixture of Example 1 in (a) the original parameters and (b) the transformed parameter. in means and covariances of the components have changed the location of the three modes, but as the theorem suggests the number of modes is strictly preserved between the mixtures. The contour plots in Figure 1 are not available unless D = 2, so we provide an alternative graphical display showing the invariance of modes. We compare the ridgeline elevation of the two mixtures in Example 1. Recall that the ridgeline elevation for a two component mixtures is simply the height of the mixture density along the ridgeline manifold defined in (1), but it carries the full modality information for mixtures in any dimensions. Figure 2 displays the ridgeline elevation plot before and after the transformation. Again note that though the shape of elevation plots differ, the number of up-down oscillations of the curves in the left and right panel in Figure 2 are exactly the same. In both cases the ridgeline elevation plot confirms the presence of three modes. 3.2 Invariance of modality under rotation By Theorem 4 the topography of any D dimensional mixture can be studied using mixtures of the form NM(µ 1 = 0, Σ 1 = I,µ 2, Σ 2 ). But uncovering the topography, even when one component has an arbitrary covariance matrix, is difficult. In this section we seek to provide a further simplification, which will allow us to find the number of modes of an arbitrary mixture by studying the modes of another mixture, one component of which is a standard normal and the other component is a normal with diagonal covariance matrix. Before we state the result, recall that the maximum number of modes of a two-component 10

11 (a) (b) density arc length arc length Figure 2: Ridgeline function with respect to the arc distance for the bivariate normal mixture of Example 1 in (a) the original parameters and (b) the transformed parameter. normal is uniquely defined by the number of roots between 0 and 1 of q(α) given in (4) and for any mixture q(α) is uniquely defined by p(α). So we will first provide a simplification of the expression for p(α) for mixtures of the form NM(0, I,µ 2, Σ 2 ) D and then state the rotation invariance theorem. Result 2. For mixture of the form NM(0, I,µ 2, Σ 2 ) D, the term p(α) in (3) can be expressed in terms of the eigenvalues and eigenvectors of Σ 2 in the following way: p(α) = D c i [α(λ i 1) + 1] 3, (5) where c i = λ i (µ 2 ξ i) 2, and λ i s and ξ i s are eigenvalues and corresponding eigenvectors of matrix Σ 2. Proof. See Appendix. We will now state the following property of invariance of mixture modality under rotation. Theorem 5. The modality of mixture NM(0, I,µ 2, Σ 2 ) D, is the same as that of mixture NM(0, I,µ 0, Λ) D, with µ T 0 = (µ 2 ξ 1, µ 2 ξ 2,...,µ 2 ξ D) and Λ = diag(λ 1, λ 2,...,λ D ), where (λ i, ξ i, i = 1,, D) are the eigenvalue, eigenvector pairs of Σ 2 Proof. Using µ 0 and Λ in Result 2 it is easy to check that the p(α) of mixtures NM(0, I,µ 2, Σ 2 ) D and NM(0, I,µ 0, Λ) D have the same expression, hence the number of roots, which implies that the two mixtures will have the same modality. 11

12 For illustration, we will now apply the rotation described in Theorem 5 to the scaled version of Example 1 whose first component is a standard normal. Example 1 gives the numerical values of the parameters after scaling and Figure 3 shows the contour plots of the mixtures before and after rotation. Example 2. (Continuation of Example 1) Applying the rotation transformation described in Theorem 5 on the mixture with parameters µ 1 =, Σ 1 =, µ 2 =, Σ 2 =, we get the mixture with parameters µ 1 =, Σ 1 = 0 0 1, µ 0 = , Λ = (6) The contour plot in Figure 3(a) depicts the unrotated mixture NM(0, I,µ 2, Σ 2 ), where as the Figure 3(b) shows the contours of the rotated mixture NM(0, I,µ 0, Λ). Algebraically the rotation to achieve the diagonal covariance of the second component is equivalent to using the orthonormal matrix P, whose columns are the eigenvectors of covariance matrix Σ 2, to rotate the random variable. In fact, in two dimensions it has a very simple interpretation. We simply rotate the mixture contour around the origin (0, 0), such that the major axis of the ellipse from contour of the second component is parallel to the x-axis. This will automatically set the minor axis parallel to the y-axis resulting in a diagonal covariance matrix of the second component (see Figure 3). Note that this rotation does not affect the covariance matrix of the first component as it remains an identity matrix. Finally we combine Theorem 4 and Theorem 5 to state the following corollary. Corollary 1. The modality of any arbitrary mixture is equal to another mixture of the form NM(0, I,µ 0, Λ), where Λ is diagonal. Proof. First apply Theorem 4 to scale any mixture to the form NM(0, I,µ, Σ) and then apply Theorem 5 to rotate it to the form NM(0, I,µ 0, Λ). 4 Number of modes of a two-component multivariate normal mixture In this section we will first focus our attention to exploring the modality of normal mixtures of the simplified form NM(0, I,µ, Λ) D. We will restrict ourselves to this small class of mixtures as we have already shown in Section 3 that the modality of any two-component normal mixture is equivalent to the modality of a corresponding mixture of the form NM(0, I,µ, Λ) D. 12

13 (a) (b) y y x x Figure 3: Contour plots for the bivariate normal mixture of Example 2 in (a) before and (b) after rotation. First we will show that the maximum number of modes is a function of d, the number of distinct diagonal entries in Λ, by first showing that the maximum number of modes is less than or equal to (d + 1), and then by showing that the upper bound (d + 1) is achievable. It is easy to check that d can be equal to the dimension D and thus we arrive at the final result on the upper bound of the number of modes of an arbitrary D dimensional mixture. 4.1 Upper bound on the number of modes of a two-component normal mixture Recall that the number of modes can be directly enumerated using the number of solutions of q(α) = 1 α(1 α)p(α) = 0 within the range [0,1]. Using the simplified form of p(α) given in (5) for mixtures of the form NM(0, I,µ, Λ) D we can simplify q(α) as q(α) = 1 α(1 α) D where λ i s are the diagonal elements of Λ and c i = λ i µ 2 i. c i [α(λ i 1) + 1] 3 = 0, To find the roots of q(α), we first state the following Lemma. Lemma 1. The number of solutions of q(α) = 1 α(1 α) D c i [α(λ i 1) + 1] 3 = 0, 13

14 where α [0, 1] is exactly equal to the number of non-negative solutions for the equation q (t) = 1 t(t + 1) D c i (t + λ i ) 3 = 0. Proof. Define α = 1, then t [0, ) corresponds to α [0, 1] and it is easy to check t + 1 q(α) = q (t). This simple change of variable from α to t allows us to relate the number of modes to the positive solutions of q (t) instead of the more difficult problem of finding solutions in the restricted interval [0, 1] for q(α). This simplification will enable us to find the upper bound of the number of modes and also allow us to recursively construct extra modes in extra dimensions. We will now use the mixture density given in (2) to illustrate the result in Lemma 1. Example 3. (Continuation of Example 1 and 2) After scaling and rotation the modality of Example 1 is equivalent to the mixture with parameters µ 1 =, Σ 1 =, µ 2 =, Σ 2 = For the above mixture [ q(α) = 1 α(1 α) Using the change of variable α = 1 t (19α + 1) ( 0.95α + 1) 3 we have [ q (t) = 1 t(t + 1) 0.05 (t ) (t + 20) 3 Solving the equation q(α) = 0 the 4 solutions in the range [0,1] are α 1 = , α 2 = , α 3 = , α 4 = ; while the equation q (t) = 0 also have 4 non-negative solutions, which are t 1 = , t 2 = , t 3 = , t 4 = As a visual aid we have also presented the curves q(α) and q (t) along with their zero crossing in Figure 4. As we are only interested in the positive solutions of q (t) we have changed the axis of t to log(t) to accommodate the wide range of t. In fact the solutions for Example 3 in log scale are symmetric and they are log(t 1 ) = 5.821, log(t 2 ) = 1.822, log(t 3 ) = 1.822, log(t 4 ) = ]. ], 14

15 (a) (b) q(α) q(t) α log(t) Figure 4: Plots for (a) q(α) against α and (b) q (t) against log(t) for the mixture given in Example 3 Now we state the important result relating the number of non-negative solutions of q (t) = 0, and hence the number of modes to the number of unique diagonal entries of Λ, which equals to the number of distinct eigenvalues of Σ 2. Lemma 2. Consider mixtures of type NM(0, I,µ, Σ 2 ) D. Suppose Σ 2 has d (d D) distinct eigenvalues, then irrespective of the value of µ there are at most 2d non-negative solutions for the corresponding q (t) = 0. Proof. Let the d distinct eigenvalues of Σ 2 be λ 1,, λ d. Let us denote the upper bound of the number of real roots of q (t) by O and the lower bound of its negative roots by N. We are interested in finding an upper bound for the non-negative roots, i.e O N. We will calculate the two bounds in two separate steps. Within each step we will consider two separate cases: one where all the eigenvalues are distinct from 1 and the other where at least one of the d distinct eigenvalues is equal to 1. Step 1. To enumerate the upper bound of the number of real roots of the rational function q (t) we transform it to a polynomial function, whose roots are easier to enumerate. Case 1: If λ i 1 for all i = 1,, d the resulting multiplier for converting q (t) = 0 into a polynomial equation will be d (t + λ i) 3 and as the highest order of the polynomial q (t) d (t + λ i) 3 is 3d, we have O = 3d. 15

16 Case 2: On the other hand if λ i = 1 for any one i {1,, d} the resulting multiplier for converting q (t) = 0 into a polynomial equation will be order of the polynomial is q (t) Q d (t+λ i) 3 Q d (t+λ i) 3 (t+1) and the highest (t+1) will now be 3d 1 giving O = 3d 1. Hence, the equation q(t) = 0 has at most O solutions, where 3d if λ i 1, i {1,, d}; O = (7) 3d 1 if λ i = 1, for any one i {1,, d}. Step 2. To find the lower bound on the number of negative roots we first note the following = = q (t) = 0 1 D t(t + 1) = c i (t + λ i ) 3 1 t = 1 D t c i (t + λ i ) 3 Thus the solutions to q (t) = 0 are equal to the crossing of the two curves 1 t, and r(t) = 1 D t c i (see Figure 5 for an illustration). Let us denote the (t + λ i ) 3 right limit of a function f at point t, lim x t + f(x) by f(x + ). Similarly we denote the left limit, lim x t f(x) by f(x ). Notice that r(t) is a rational function and c i 0, λ i > 0. Thus for each i = 1, 2...d we have a vertical asymptote i.e., r(( λ i ) + ) = + and r(( λ i ) ) =. Additionally we have r(( 1) + ) = + and r(( 1) ) =. [See the dashed lines representing the asymptotes in Figure 5] This implies that r(t) will have several disjoint branches and those branches traveling from one negative to its neighboring positive vertical asymptote have to cross the line y = 0 and hence the curve 1/t at least once. Now we discuss the two distinct cases. Case 1: If λ i 1 for all i = 1,, d the graph of r(t) has d+1 asymptotes one each at λ 1,...,λ d and 1. This gives rise to d + 2 disjoint branches among which d intermediate branches will have at least one crossing with the curve 1, which gives rise to at least d t negative roots of q (t) and hence N = d. Case 2: On the other hand if λ i = 1 for any one i {1,, d} then there are only (d 1) distinct eigenvalues different from 1, and the graph of r(t) now has (d + 1) branches, among which the d 1 intermediate branches give rise to at least d 1 negative solutions and hence N = d 1. 16

17 Hence, the equation q(t) = 0 has at most N negative solutions, where d if λ i 1, i {1,, d}; N = d 1 if λ i = 1, for any one i {1,, d}. (8) Combining the (7) and (8) we show that for both cases there can be at most (O N) = 2d non-negative solutions for the equation q (t) = 0. 1/t and r(t) t r(t) = 1 t (t + 2) (t + 4) (t + 8) (t + 9) t Figure 5: Plots showing the vertical asymptotes of r(t) = 1 t (t+2) (t+4) (t+8) (t+9) 3 and its crossing with the curve 1/t. Finally we state the main theorem of this paper giving us the upper bound on the number of modes of a mixture of two normal components. Theorem 6. The number of modes of the normal mixture NM(µ 1, Σ 1, µ 2, Σ 2 ) D is at most (d + 1), where d is the number of distinct eigenvalues of the matrix Σ 2 = Σ1/2 2 Σ 1 1 Σ1/2 2 and hence the number of distinct eigenvalues of the matrix ratio of the covariance matrices Σ 2 and Σ 1 denoted by Σ 1 1 Σ 2. Proof. By Theorem 4 the modality of the mixture NM(µ 1, Σ 1, µ 2, Σ 2 ) D is the same as the mixture NM(0, I,µ 2, Σ1 2 2 Σ 1 1 Σ1 2 2 ) D, where µ 2 is a vector of dimension D. Now using Lemma 2 17

18 we know that the corresponding q (t) and hence q(α) will have at most 2d roots. Finally, using Result 1 we can show that NM(µ 1, Σ 1, µ 2, Σ 2 ) D has at most 2d = d + 1 modes. To show the second part, note that if λ is an eigenvalue of the matrix Σ 2 = Σ1/2 2 Σ 1 1 Σ1/2 2, then λ satisfies the equation: Σ 2 λi = 0. On the other hand, Σ 2 Σ 1 1 λi = Σ 1/2 2 Σ 2Σ 1/2 2 λi = Σ 1/2 2 Σ 2 λi Σ 1/2 2 = Σ 2 λi Hence, λ is an eigenvalue of the matrix Σ 2 if and only if λ is an eigenvalue of the matrix Σ 2 Σ 1 1, which implies the second part of the Theorem. Theorem 7. Any D dimensional normal mixture NM(µ 1, Σ 1, µ 2, Σ 2 ) D has at most D + 1 modes. Proof. Σ 2 = Σ1/2 2 Σ 1 1 Σ1/2 2, has D eigenvalues, hence d D. Using this inequality in Theorem 6 completes the proof. 4.2 Existence of D + 1 modes in D dimensions In this subsection we will show that it is always possible to find a mixture in any dimension which will attain D + 1 modes. First we provide two examples for D=2 and D=3 where the upper bound is achieved. Remark 2. Example 1, with D = 2, and eigenvalues 20 and 0.05 achieves the upper bound on the number of modes for a two dimensional mixtures. Example 4. Consider the three dimensional example with 4 modes given in Ray and Lindsay (2005) with the parameters being µ 1 = 0 0 0, Σ 1 = , µ 2 = 1/ 2 2 1/ 2, Σ 2 = (9) A straightforward calculation based on Theorem 4 shows that Σ 2 has eigenvalues 0.05, 1 and 20, i.e., D = d = 3. This density mixture has 4 modes, which again achieves the upper bound (D + 1). Though we have come up with examples achieving the upper bound for two and three components, it is not easy to come up with such pathological examples in higher dimensions. Hence we will design a construction method which allows one to construct one extra mode from each additional dimension. Starting from the fact that one can construct a mixture with two modes in one dimension (or using the examples in D=2 and D=3) one can use the 18

19 recursive relation to construct the parameters of a mixture in D dimensions which will have D + 1 modes. Recall that Theorem 6 shows that in D dimensions the equation q (t) = 0, can have at most 2D non-negative solutions, which in turn implies that the corresponding mixture can achieve at most D + 1 modes. Therefore, to achieve one extra mode in D + 1 dimensions we just need to choose the parameters of the mixture such that the corresponding q (t) = 0 achieves two extra non-negative solutions. The following Lemma provides the construction method to find the two extra solution of q (t) = 0 starting from any dimension D. Lemma 3. Let {(c i, λ i ), i = 1, 2,...,D} be such that the equation y(t, D) = 1 t(t + 1) D c i (t + λ i ) 3 = 0 has 2D non-negative solutions. Then one can always find a pair of scalars (c D+1, λ D+1 ) such that has 2D + 2 solutions. D+1 c i y(t, D + 1) = 1 t(t + 1) (t + λ i ) 3 = 0 Proof. Note that y(t, D) is the same as q (t) = 0 for D dimensions. Since y(t, D) = 0 has 2D non-negative solutions, and y(0, D) and y(, D) are both positive, y(t, D) changes sign 2D times in the positive axis of t. Let y(t, D) be positive at points t 0, t 2,, t 2D = a, and negative at points t 1, t 3,, t 2D 1, such that 0 t 0 < t 1 < t 2 < < t 2D 1 < t 2D = a. First we choose y 0 > 0 such that y 0 (a + λ) 3 < y(t j, D)(t j + λ) 3 for j even, and for all eigenvalues λ > 0. It can be verified that such an y 0 always exists. 1 Then we choose t 2D+1 > a such that t 2D+1 (t 2D+1 + 1) < y 0 8, and then we choose λ D+1 > max{λ 1,, λ D }, such that t 2D+1 + λ D+1 a + λ D+1 Now define c D+1 = y 0 (a + λ D+1 ) 3. < 2, which will ensure that (t 2D+1 + λ D+1 ) 3 t 2d+1 (t 2D+1 + 1)(a + λ D+1 ) 3 < y 0 (10) With the chosen pair of (c D+1, λ D+1 ) we have c D+1 > 0, for j even; Y (t j ) = y(t j, D) (t j + λ D+1 ) 3 < 0, for j odd. 19

20 c D+1 i.e., Y (t) = y(t, D) (t + λ D+1 ) 3 has the same sign as y(t, d) at points t 0, t 1,, t 2D, which means that Y (t) has 2D non-negative solutions which are all less than a = t 2D. On the other hand, we have c D+1 Y (t 2D+1 ) = y(t 2D+1, D) (t 2D+1 + λ D+1 ) 3 1 < t 2D+1 (t 2D+1 + 1) c D+1 (t 2D+1 + λ D+1 ) 3 < 1 t 2D+1 (t 2D+1 + 1) y 0(a + λ D+1 ) 3 (t 2D+1 + λ D+1 ) 3 < 0 where the last inequality holds because of the inequality (10). Hence Y (t) will be negative at point t 2D+1 > a, but lim t Y (t) > 0 so Y (t) = y(t, D + 1) = 0 has two more solutions than y(t, D) = 0, both of which are greater than a. Remark 3. Note that the proof of the above theorem provides only one method of constructing the two extra non-negative solutions. These solutions are not unique. The following corollary provides the recursive construction method for constructing extra modes when the dimension of mixture is increased by unity. Corollary 2. If a mixture of two normals in D dimensions has D +1 modes one can choose the parameters of the extra dimensions such that the resulting D + 1 dimensional normal will have D + 2 modes. Proof. Use Theorem 4 and 5 to re-parametrize any mixture to the form NM(0, I, µ,λ) D, where µ = (µ 1,...,µ D ), Λ = diag(λ 1,...,λ D ) and then use Lemma 3 with c i = λ i µ 2 i to compute (c D+1, λ D+1 ). The new mixture NM(0, I, µ = (µ 1,...,µ D, µ D+1 ), Λ = diag(λ 1,...,λ D, λ D+1, )) D+1, with µ D+1 = λ D+1 /c i will have D + 2 modes. We now apply the method described in Corollary 2 to construct a 4-dimensional example with 5 modes, starting from the 3-dimensional case in Example 4. Example 5. We first apply theorem 3 to transform the 3-dimensional normal mixture given in (9) into the form NM(0, I,µ 2, Λ) D=3, where µ 2 = 1/ , Σ 2 = (11) 20

21 Σ 2 has d = 3 eigenvalues: λ 1 = 0.05, λ 2 = 1, λ 3 = 20, with corresponding c i s given by Note that the equation q (t) = y(t, 3) = c 1 = 0.025, c 2 = 4, c 3 = t(t + 1) 3 c i has 6 positive solutions: (t + λ i ) , , , , and Now we take 0 < t 0 = < t 1 = 0.1 < t 2 = 0.3 < t 3 = 1 < t 4 = 3 < t 5 = 30 < t 6 = 200 = a such that y(t) is positive at points t 0, t 2, t 4, t 6, and negative at points t 1, t 3, t 5. Now choose y 0 = , then y 0 (a+λ) 3 < y(t j )(t j +λ) 3 for all j even, and eigenvalues 1 λ. Now take t 7 = > a = 200 such that t 7 (t 7 + 1) < y 0 8. Let λ 4 = , then t 7 + λ 4 a + λ 4 < 2. Let c 4 = y 0 (a + λ 4 ) 3 = , i.e., the last component of the new 4-dimensional mean is µ 4 = c 4 /λ 4 = This gives a 4-dimensional normal mixture NM(0, I,µ new 2, Σ new 2 ) D=4, with 1/ µ new 2 = The corresponding equation 2 10, Σnew = q (t) = 1 t(t + 1) has eight positive solutions as following: d c i (t + λ i ) 3 = , , , , , , and which implies the existence of five modes. Figure 6 shows the q (t) for the four dimensional example along with the eight nonnegative zero crossings. Among the eight crossings the two on the right are obtained using the construction method in Corollary 2. Remark 4. The construction process in Lemma 3 is designed to add two more positive solutions to equation q (t) = 0, when the dimension is increased, by adding another term in the summation, without perturbing the original non-negative solutions too much. In Example 5 we started with six roots in three dimensions and constructed two extra roots in four dimensions. Among the six roots the first five remained exactly the same as the original ones (according to our precision), and the sixth one is only shifted by a small magnitude (0.001).. 21

22 q (t) log(t) Figure 6: Plots for q (t), which has eight positive roots, along with the zero crossing. Here q (t) is plotted with respect to log(t) because of the big range of t. Finally we state arrive at the main theorem of the paper, Theorem 1, which proves the tightness of the bound given in Theorem 7, using the following argument Proof of Theorem 1. The upper bound has already been shown in Theorem 7. To show that this bound can be achieved we show the construction of mixtures with D + 1 modes in any dimension. In one dimension two normals with equal variance will have two modes if the distance between their means is more than two times the common standard deviation. Now one can use Corollary 2 repeatedly to construct one extra mode per dimension resulting in exactly D + 1 modes in D dimensions. 4.3 Special Cases The result given in Theorem 1 is the most general modality theorem available for a twocomponent normal mixture. Many previous modality results can be stated as special cases of this generalized result. In the corollaries which follow we show that our modality result can be used to duplicate some of the univariate and multivariate results found in the literature. The study of the case when D = 1, i.e., the mixture of two univariate normals, can be traced back to the early 20th century. For example, Helguero (1904) discussed the equal variance case, and Robertson and Fryer (1969) discussed the unequal variance case, and they both showed that there exists at most 2 modes for the univariate normal mixture. Note that for both cases, the two variances are either equal or proportional to one another in one dimension, and our result also shows that at most two modes are achievable. Some results 22

23 on the mixture of two higher-dimensional normals with equal or proportional variances have also been developed later. A recent result from Ray and Lindsay (2005) shows that for any dimension, a two-component normal mixture with proportional variances can have at most two modes. Our result confirms the result from Ray and Lindsay (2005), however with a different methodology. Corollary 3. In any dimension the mixture of two normal components with equal or proportional variance (Σ 2 = cσ 1 for a scalar c > 0), can have at most two modes. Proof. By Theorem 6 the maximum number of modes is one more than the number of distinct eigenvalues, d of Σ 2 = Σ1/2 2 Σ 1 1 Σ1/2 2. For the equal or proportional case Σ I if Σ 2 = Σ 1 2 = ci if Σ 2 = cσ 1 In both case all the eigenvalues are same, thus they can have at most two modes. Now we discuss some of the examples stated in Ray and Lindsay (2005). Both the two dimensional example with three modes with parameters given in Example 3 and the three dimensional example in Example 4 with four modes were stated earlier as mere examples of existence of more than two modes. But our results show that they actually achieve the upper bound possible within their respective dimensions. Moreover the construction method of the examples in Ray and Lindsay (2005) was not easily generalizable in higher dimensions, but our construction algorithm described in Lemma 3 provides an easy strategy for constructing such examples. 5 Conclusion and discussion In this paper we have developed a powerful theory for understanding the topography of a multivariate normal mixture model. The results on the upper bound are mainly focused on the two-component case, where we can provide the clear upper bound of D + 1 for any D dimensional normal mixtures. Moreover, for any dimension one can produce a mixture which attains the upper bound. In this paper, we have also verified that the number of modes for a two-component D-dimensional normal distribution mixture NM(µ 1, Σ 1, µ 2, Σ 2 ) D is bounded above by the distinct eigenvalues of the ratio matrix Σ 2 Σ 1 1, irrespective of the means. In the process of doing this analysis, we have not discussed how these new bounds and construction methods might be used for statistical purposes. We think that there is a wide area of application for these results. Given a parameter structure one can easily estimate the upper bound of the number of modes which might be enormous help for many clustering methods. The construction method might become handy for Bayesian prior elicitation. 23

24 Finally the results give us a clear understanding of the interplay of component means and variances in shaping up the topography of mixtures which may be easily generalizable to mixtures of other distributions. We also note that there are still a number of open mathematical questions. For example, mixture of T distributions are often used as a robust alternative to mixtures of normals, but there are no available results on the number of modes of the mixture of T s. One should note that the contours of T and normal, which determine the number of modes displays very similar topographical structure and so one might be able to borrow the results on topography of normals for exploring the topography of T mixtures. In fact using this intuition one can then easily generalize the results for any elliptical distribution. Finally, our results on upper bound are mainly derived for K = 2. It would therefore be useful to establish relationships between the modality structure of the pairs of densities in a mixture and the overall modality of the entire mixture of K > 2 components. This generalization becomes challenging even when K = 3 resulting in the ridgeline manifold of two dimensions which may involve finding the roots of an equation of two variables. Acknowledgments: We thank Dr. David Fried of the Department of Mathematics and Statistics at Boston University for his assistance in solving the algebraic problems for this paper. A Proof of Theorems and Results A.1 Proof of Theorem 4 We only need to compare if the function p(α) is same for the two mixtures NM(µ 1, Σ 1, µ 2, Σ 2 ) D and NM(0, I,µ 2, Σ 2 ) D.. Thus First note that for which implies S α = ασ ᾱσ 1 2 = Σ 1/2 2 (ασ 1/2 2 Σ 1 1 Σ1/2 2 + ᾱi)σ 1/2 2 Now for the mixture NM(µ 1, Σ 1, µ 2, Σ 2 ) D, S 1 α = Σ 1/2 2 (ασ 1/2 2 Σ 1 1 Σ1/2 2 + ᾱi) 1 Σ 1/2 2, Σ 1/2 2 S 1 α Σ 1/2 2 = (ασ 1/2 2 Σ 1 1 Σ1/2 2 + ᾱi) 1. p(α) = (µ 2 µ 1 ) Σ 1 1 S 1 α Σ 1 2 S 1 α Σ 1 2 S 1 α Σ 1 1 (µ 2 µ 1 ) = (µ 2 µ 1 ) Σ 1 1 Σ1/2 2 (Σ 1/2 2 Sα 1 Σ 1/2 2 ) 3 Σ 1/2 2 Σ 1 1 (µ 2 µ 1 ) = (µ 2 µ 1 ) Σ 1 1 Σ1/2 2 (ασ 1/2 2 Σ 1 1 Σ1/2 2 + ᾱi) 3 Σ 1/2 2 Σ 1 1 (µ 2 µ 1 ) 24

On the upper bound of the number of modes of a multivariate normal mixture

On the upper bound of the number of modes of a multivariate normal mixture Surajit Ray and Dan Ren Department of Mathematics and Statistics Boston University 111 Cummington Street, Boston, MA 02215, USA