INVARIANT COORDINATE SELECTION

Size: px

Start display at page:

Download "INVARIANT COORDINATE SELECTION"

Marylou Brooks
6 years ago
Views:

1 INVARIANT COORDINATE SELECTION By David E. Tyler 1, Frank Critchley, Lutz Dümbgen 2, and Hannu Oja Rutgers University, Open University, University of Berne and University of Tampere SUMMARY A general method for exploring multivariate data by comparing different estimates of multivariate scatter is presented. The method is based upon the eigenvalue-eigenvector decomposition of one scatter matrix relative to another. In particular, it is shown that the eigenvectors can be used to generate an affine invariant coordinate system for the multivariate data. Consequently, we view this method as a method for invariant coordinate selection (ICS). By plotting the data with respect to this new invariant coordinate system, various data structures can be revealed. For example, under certain independent components models, it is shown that the invariant coordinates correspond to the independent components. Another example pertains to mixtures of elliptical distributions. In this case, it is shown that a subset of the invariant coordinates corresponds to Fisher s linear discriminant subspace, even though the class identifications of the data points are unknown. Some illustrative examples are given. 1. Introduction. When sampling from a multivariate normal distribution, the sample mean vector and sample variance-covariance matrix are a sufficient summary of the data set. To protect against non-normality, and in particular against longer tailed distributions and outliers, one can replace the sample mean and covariance matrix with robust estimates of multivariate location and scatter (or pseudo-covariance). A variety of robust estimates of the multivariate location vector and scatter matrix have been proposed. Among them are multivariate M -estimates [19, 29], the minimum volume ellipsoid estimate (MVE) and the minimum covariance determinant estimate (MCD) [38], S-estimates [12, 25], projection based estimates [30, 44], τ-estimates [26], CM -estimates [24] and MM -estimates [43, 45], as well as one-step versions of these estimates [27]. After computing robust estimates of multivariate location and scatter, outliers can often be detected by examining the corresponding robust Mahalanobis distances, see e.g. [39]. Summarizing a multivariate data set via a location and a scatter statistic, and then inspecting the corresponding Mahalanobis distance plot for possible outliers, is appropriate if the bulk of the data arises from a multivariate normal distribution or, more generally, from an elliptically symmetric distribution. However, if the data arises from a distribution which is not symmetric, then different AMS 2000 subject classifications. Primary 62H05, 62G35. Secondary 62-09, 62H25, 62H30. Key words and phrases. affine invariance, cluster analysis, independent components analysis, mixture models, multivariate diagnostics, multivariate scatter, principal components, projection pursuit, robust statistics. 1 Research supported by NSF Grant DMS Research supported by the Swiss National Science Foundation 1

2 2 D.E. TYLER, F. CRITCHLEY, L. DÜMBGEN and H. OJA location statistics are estimating different notions of central tendency. Moreover, if the data arises from a distribution other than an elliptically symmetric distribution, even one which is symmetric, then different scatter statistics are not necessarily estimating the same population quantity, but rather are reflecting different aspects of the underlying distribution. This suggests that comparing different estimates of multivariate scatter may help reveal interesting departures from an elliptically symmetric distribution. Such data structures may not be apparent in a Mahalanobis distance plot. In this paper, we present a general multivariate method based upon the comparison of different estimates of multivariate scatter. This method is based on the eigenvalue-eigenvector decomposition of one scatter matrix relative to another. An important property of this decomposition is that the corresponding eigenvectors generate an affine invariant coordinate system for the multivariate observations, and so we view this method as a method for invariant coordinate selection (ICS). By plotting the data with respect to this new invariant coordinate system, various data structures can be revealed. For example, when the data arises from a mixture of elliptical distributions, the space spanned by a subset of the invariant coordinates gives an estimate of Fisher s linear discriminant subspace, even though the class identifications of the data points are unknown. Another example pertains to certain independent components models. Here the variables obtained using the invariant coordinates correspond to estimates of the independent components. The paper is organized as follows. Section 2 sets up some notation and concepts to be used in the paper. In particular, the general concept of affine equivariant scatter matrices is reviewed in 2.1 and some classes of scatter matrices are briefly reviewed in section 2.2. The idea of comparing two different scatter matrices using the eigenvalue-eigenvector decomposition of one scatter matrix relative to another is discussed in section 3, with the invariance properties of the ICS transformation being given in section 4. Section 5 gives a theoretical study of the ICS transformation under the aforementioned elliptical mixture models (section 5.1), and under independent components models (section 5.2). The results in section 5.1 represent a broad generalization of results given under the heading of generalized principal components analysis (GPCA) by Ruiz-Gazen [41] and Caussinus and Ruiz-Gazen [5, 6]. Readers primarily interested in how ICS works in practice may wish to skip section 5 at a first reading. In section 6, a general discussion on the choice of scatter matrices one may consider when implementing ICS, along with some examples illustrating the utility of the ICS transformation for diagnostic plots, are given. Further discussion, open research questions, and the relationship of ICS to other approaches are given in section 7. All formal proofs are reserved for section 8, an appendix. An R package entitled ICS [34] is freely available for implementing the ICS methods. 2. Scatter Matrices Affine Equivariance. Let F Y denote the distribution function of the multivariate random variable Y R p, and let P p represent the set of all symmetric positive definite matrices of order p. Affine equivariant multivariate location and scatter functionals, say µ(f Y ) R p and V (F Y ) P p

3 INVARIANT COORDINATE SELECTION 3 respectively, are functions of the distribution satisfying the property that for Y = AY + b, with A nonsingular and b R p, (1) µ(f Y ) = Aµ(F Y ) + b and V (F Y ) = AV (F Y )A. Classical examples of affine equivariant location and scatter functionals are the mean vector µ Y = E[Y ] and the variance-covariance matrix Σ Y = E[(Y µ Y )(Y µ Y ) ] respectively, provided they exist. For our purposes, affine equivariance of the scatter matrix can be relaxed slightly to require only affine equivariance of its shape components. A shape component of a scatter matrix V P p refers to any function of V, say S(V ), such that (2) S(V ) = S(λV ) for any λ > 0. Thus, we say that the shape of V (F Y ) is affine equivariant if (3) V (F Y ) AV (F Y )A. For a p-dimensional sample of size n, Y = {y 1,..., y n }, affine equivariant multivariate location and scatter statistics, say µ and V respectively, are defined by applying the above definition to the empirical distribution function. That is, they are statistics satisfying the property that for any nonsingular A and any b R p, (4) y i y i = Ay i + b for i = 1,..., n ( µ, V ) ( µ, V ) = (A µ + b, A V A ). Likewise, the shape of V is said to be affine equivariant if (5) V A V A. The sample mean vector ȳ and sample variance-covariance matrix S n are examples of affine equivariant location and scatter statistics respectively, as are all the estimates cited in the introduction. Typically, in practice, V is normalized so that it is consistent at the multivariate normal model for the variance-covariance matrix. The normalized version is thus given as Ṽ = V /β, where β > 0 is such that V (F Z ) = βi when Z has a standard multivariate normal distribution. For our purposes, it is sufficient to consider only the unnormalized scatter matrix V since our proposed methods depend only on the scatter matrix up to proportionality, i.e. only on the shape of the scatter matrix. Under elliptical symmetry, affine equivariant location and scatter functionals have relatively simple forms. Recall that an elliptically symmetric distribution is defined to be one arising from an affine transformation of a spherically symmetric distribution, i.e. if Z QZ for any p p orthogonal matrix Q, then the distribution of Y = AZ + µ is said to have an elliptically symmetric distribution with center µ R p and shape matrix Γ = AA, see e.g. [2]. If the distribution of Y is also absolutely continuous, then it has a density of the form (6) f(y; µ, Γ, g) = det(γ) 1/2 g{(y µ) Γ 1 (y µ)} for y R p,

4 4 D.E. TYLER, F. CRITCHLEY, L. DÜMBGEN and H. OJA for some non-negative function g and with Γ P p. As defined, the shape parameter Γ of an elliptically symmetric distribution is only well defined up to a scalar multiple, i.e. if Γ satisfies the definition of a shape matrix for a given elliptically symmetric distribution, then λγ also does for any λ > 0. In the absolutely continuous case, if no restrictions are placed on the function g, then the parameter Γ is confounded with g. One could normalize the shape parameter by setting, for example, det(γ) = 1 or trace(γ) = p. Again, this is not necessary for our purposes since only the shape components of Γ, as defined in (2), are of interest in this paper, and these shape components for an elliptically symmetric distribution are well defined. Under elliptical symmetry, any affine equivariant location functional corresponds to the center of symmetry and any affine equivariant scatter functional is proportional to the shape matrix, i.e. µ(f Y ) = µ and V (F Y ) Γ. In particular, µ Y = µ and Σ Y Γ when the first and second moments exist respectively. More generally, if V (F Y ) is any functional satisfying (3), then V (F Y ) Γ. As noted in the introduction, for general distributions, affine equivariant location functionals are not necessarily equal and affine equivariant scatter functionals are not necessarily proportional to each other. The corresponding sample versions of these functionals are therefore estimating different population features. The difference in these functionals reflect in some way how the distribution differs from an elliptically symmetric distribution. Remark 2.1. The class of distributions for which all affine equivariant location functionals are equal and all equivariant scatter functionals are proportional to each other is broader than the class of elliptical distributions. For example, this can be shown to be true for F Y when Y = AZ+µ with the distribution of Z being exchangeable and symmetric in each component. That is, Z DJZ for any permutation matrix J and any diagonal matrix D having diagonal elements ±1. We conjecture that this is the broadest class for which this property holds. This class contains the elliptical symmetric distributions, since these correspond to Z having a spherically symmetric distribution Classes of scatter statistics. Conceptually, the simplest alternatives to the sample mean ȳ and sample covariance matrix S n are the weighted sample means and sample covariance matrices respectively, with the weights dependent on the classical Mahalanobis distances. These are defined by (7) n i=1 µ = u n 1(s o,i )y i n i=1 u 1(s o,i ), and V = i=1 u 2(s o,i )(y i ȳ)(y i ȳ) n i=1 u, 2(s o,i ) where s o,i = (y i ȳ) Sn 1 (y i ȳ), and u 1 (s) and u 2 (s) are some appropriately chosen weight functions. Other simple alternatives to the sample covariance matrix can be obtained by applying only the scatter equation above to the sample of pairwise differences, i.e. to the symmetrized data set (8) Y s = {y i y j i, j = 1,..., n, i j},

5 INVARIANT COORDINATE SELECTION 5 for which the sample mean is zero. Even though the weighted mean and covariance matrix, as well as the symmetrized version of the weighted covariance matrix, may downweight outliers, they have unbounded influence functions and zero breakdown points. A more robust class of multivariate location and scatter statistics is given by the multivariate M -estimates, which can be viewed as adaptively weighted sample means and sample covariance matrices respectively. More specifically, they are defined as solutions to the M -estimating equations n i=1 µ = u n 1(s i )y i n i=1 u 1(s i ), and V = i=1 u 2(s i )(y i µ)(y i µ) (9) n i=1 u, 3(s i ) where s i = (y i µ) V 1 (y i µ), and u 1 (s), u 2 (s) and u 3 (s) are again some appropriately chosen weight functions. We refer the reader to [19] and [29] for the general theory regarding the multivariate M -estimates. The equations given in (9) are implicit equations in ( µ, V ) since the weights depend upon the Mahalanobis distances relative to ( µ, V ), i.e. on d i ( µ, V ) = s i. Nevertheless, relatively simple algorithms exist for computing the multivariate M -estimates. The maximum likelihood estimates of the parameters µ and Γ of an elliptical distribution for a given spread function g in (6) are special cases of M -estimates. From a robustness perspective, an often cited drawback to the multivariate M -estimates is their relatively low breakdown in higher dimension. Specifically, their breakdown point is bounded above by 1/(p + 1). Subsequently, numerous high breakdown point estimates have been proposed, such as the MVE, the MCD, the S-estimates, the projection based estimates, the τ-estimates, the CM -estimates and the MM -estimates, all of which are cited in the introduction. All the high breakdown point estimates are computationally intensive and, except for small data sets, are usually computed using approximate or probabilistic algorithms. The computational complexity of high breakdown point multivariate estimates is especially challenging for extremely large data sets in high dimensions, and this remains an open and active area of research. The definition of the weighted sample means and covariance matrices given by (7) can be readily generalized by using any initial affine equivariant location and scatter statistic, say µ o and V o respectively. That is, (10) n i=1 µ = u n 1(s o,i )y i n i=1 u 1(s o,i ), and V = i=1 u 2(s o,i )(y i µ o )(y i µ o ) n i=1 u, 2(s o,i ) where now s o,i = (y i µ o ) 1 V o (y i µ o ). In the univariate setting such weighted sample means and variances are sometimes referred to as one-step W -estimates [18, 31], and so we refer to their multivariate versions as multivariate one-step W -estimates. Given a location and a scatter statistic, a corresponding one-step W -estimate provides a computationally simple choice for an alternative location and scatter statistic. Any method one uses for obtaining location and scatter statistics for a data set Y can also be applied to its symmetrized version Y s to produce a scatter statistic. For symmetrized data, any affine equivariant location statistic is always zero.

6 6 D.E. TYLER, F. CRITCHLEY, L. DÜMBGEN and H. OJA The functional or population versions of the location and scatter statistics discussed in this section are readily obtained by replacing the empirical distribution of Y with the population distribution function F Y. For the M -estimates and the one-step W -estimates, this simply implies replacing the averages in (9) and (10) respectively with expected values. For symmetrized data, the functional versions are obtained by replacing the empirical distribution of Y s with its almost sure limit F s Y, the distribution function of Y s = Y 1 Y 2, where Y 1 and Y 2 are independent copies of Y. 3. Comparing Scatter Matrices. Comparing positive definite symmetric matrices arises naturally within a variety of multivariate statistical problems. Perhaps the most obvious case is when one wishes to compare the covariance structures of two or more different groups, see e.g. [16]. Other well known cases occur in multivariate analysis of variance, or MANOVA, wherein interest lies in comparing the within group and between group sum of squares and cross-products matrices, and in canonical correlation analysis, wherein interest lies in comparing the covariance matrix of one set of variables with the covariance matrix of its linear predictor based on another set of variables. These methods involve either multiple populations or two different sets of variables. Less attention has been given to the comparison of different estimates of scatter for a single set of variables from a single population. Some work in in this direction, though, can be found in [1, 4, 5, 6, 7, 41], which will be discussed in later sections. Typically, the difference between two positive definite symmetric matrices can be summarized by considering the eigenvalues and eigenvectors of one matrix with respect to the other. More specifically, suppose V 1 P p and V 2 P p. An eigenvalue, say ρ j, and a corresponding eigenvector, say h j, of V 2 relative to V 1 correspond to a nontrivial solution to the matrix equations (11) V 2 h j = ρ j V 1 h j. Equivalently, ρ j and h j are an eigenvalue and corresponding eigenvector respectively of V1 1 V 2. Since most readers are probably more familiar with the eigenvalue-eigenvector theory of symmetric matrices, we note that ρ j also represents an eigenvalue of the symmetric matrix M = V 1/2 1 V 2 V 1/2 1 P, where V 1/2 1 P p denotes the unique positive definite symmetric square root of V 1. Hence, we can choose p ordered eigenvalues, ρ 1 ρ 2... ρ p > 0, and an orthonormal set of eigenvectors q j, j = 1,..., p, such that Mq j = ρ j q j. The relationship between h j and the eigenvectors of M is given by q j V 1/2 1 h j, and so h i V 1h j = 0 for i j. This yields the following simultaneous diagonalization of V 1 and V 2, (12) H V 1 H = D 1 and H V 2 H = D 2 where H = [ h 1... h p ], D 1 and D 2 are diagonal matrices with positive entries and D 1 1 D 2 = = diagonal{ρ 1,..., ρ p }. Without loss of generality, one can take D 1 = I by normalizing h j so that h j V 1h j = 1. Alternatively, one can take D 2 = I. Such a normalization is not necessary for our

7 INVARIANT COORDINATE SELECTION 7 purposes and we simply prefer the general form (12) since it reflects the exchangeability between the roles of V 1 and V 2. Note that the matrix V 1 1 V 2 has the spectral value decomposition (13) V 1 1 V 2 = H H 1. Various useful interpretations of the eigenvalues and eigenvectors in (11) can be given whenever V 1 and V 2 are two different scatter matrices for the same population or sample. We first note that the eigenvalues ρ 1,..., ρ p are the maximal invariants under affine transformation for comparing V 1 and V 2. That is, if we define a function G(V 1, V 2 ) such that G(V 1, V 2 ) = G(AV 1 A, AV 2 A ) for any nonsingular A, then G(V 1, V 2 ) = G(D 1, D 2 ) = G(I, ), with D 1, D 2 and being defined as above. Furthermore is invariant under such transformations. Since scatter matrices tend to only be well defined up to a scalar multiple, it is more natural to be interested in the difference between V 1 and V 2 up to proportionality. In this case, if we consider a function G(V 1, V 2 ) such that G(V 1, V 2 ) = G(λ 1 AV 1 A, λ 2 AV 2 A ) for any nonsingular A and any λ 1 > 0 and λ 2 > 0, then G(V 1, V 2 ) = G(I, / det( ) 1/p ). That is, maximal invariants in this case are (ρ 1,..., ρ p )/( p i=1 ρ i) 1/p or, in other words, we are interested in (ρ 1,..., ρ p ) up to a common scalar multiplier. A more useful interpretation of the eigenvalues arises from the following optimality property, which follows readily from standard eigenvalue-eigenvector theory. For h R p, let (14) κ(h) = h V 2 h/h V 1 h. For V 1 = V 1 (F Y ) and V 2 = V 2 (F Y ), κ(h) represents the square of the ratio of two different measures of scale for the variable h Y. Recall that the classical measure of kurtosis corresponds to the fourth power of the ratio of two scale measures, namely the fourth root of the fourth central moment and the standard deviation. Thus, the value of κ(h) 2 can be viewed as a generalized measure of relative kurtosis. The term relative is used here since the scatter matrices V 1 and V 2 are not necessarily normalized. If both V 1 and V 2 are normalized so that they are both consistent for the variance-covariance matrix under a multivariate normal model, then a deviation of κ(h) from 1 would indicate non-normality. In general, though, the ratio κ(h 1 ) 2 /κ(h 2 ) 2 does not depend upon any particular normalization. The maximal possible value of κ(h) over h R p is ρ 1 with the maximum being achieved in the direction of h 1. Likewise, the minimal possible value of κ(h) is ρ p with the minimum being achieved in the direction of h p. More generally, we have (15) sup{κ(h) h R p, h V 1 h j = 0, j = 1,... m 1} = ρ m, with the supremum being obtained at h m, and (16) inf{κ(h) h R p, h V 1 h j = 0, j = m + 1,... p} = ρ m, with the infimum being obtained at h m. These successive optimality results suggest that plotting the data or distribution using the coordinates Z = H Y may reveal interesting structures. We explore this idea in later sections.

8 8 D.E. TYLER, F. CRITCHLEY, L. DÜMBGEN and H. OJA Remark 3.1. An alternative motivation for the transformation Z = H Y is as follows. Suppose Y is first standardized using a scatter functional V 1 (F ) satisfying (3), i.e. X = V 1 (F Y ) 1/2 Y. If Y is elliptically symmetric about µ Y, then X is spherically symmetric about the center µ X = V 1 (F Y ) 1/2 µ Y. If a second scatter functional is then applied to X, say V 2 (F ) satisfying (3), then V 2 (F X ) I, and hence no projection of X is any more interesting than any other projection of X. However, if Y is not elliptically symmetric, then V 2 (F X ) is not necessarily proportional to I. This suggests a principal components analysis of X based on V 2 (F X ) may reveal some interesting projections. By taking the spectral value decomposition V 2 (F X ) = QDQ, where Q is an orthogonal matrix, and then constructing the principal component variables Q X, one obtains (17) Q X = H Y = Z, with D =, whenever H is normalized so that H V 1 (F Y )H = I. 4. Invariant Coordinate Systems. In this and the following section we study the properties of the transformation Z = H Y in more detail, and in section 6 we give some examples illustrating the utility of the transformation when used in diagnostic plots. For simplicity, unless otherwise stated, we hereafter state any theoretical properties using the functional or population version of scatter matrices. The sample version then follows as a special case based on the empirical distributions. Examples are, of course, given for the sample version. The following condition is assumed throughout and the following notation is used hereafter. Condition 4.1. For Y R p having distribution F Y, let V 1 (F ) and V 2 (F ) be two scatter functionals satisfying (3). Further, suppose both V 1 (F ) and V 2 (F ) are uniquely defined at F Y. Definition 4.1. Let H(F ) = [h 1 (F )... h p (F )] be a matrix of eigenvectors defined as in (11) and (12), with ρ 1 (F )... ρ p (F ) being the corresponding eigenvalues, whenever V 1 and V 2 are taken to be V 1 (F ) and V 2 (F ) respectively. It is well known that principal component variables are invariant under translations and orthogonal transformations of the original variables, but not invariant under other general affine transformations. An important property of the transformation proposed here, i.e. Z = H(F Y ) Y, is that the resulting variables are invariant under any affine transformation. Theorem 4.1. In addition to Condition 4.1, suppose the roots ρ 1 (F Y ),..., ρ p (F Y ) are all distinct. Then for the affine transformation Y = AY + b, with A being nonsingular, (18) ρ j (F Y ) = γρ j (F Y ) for j = 1,..., p for some γ > 0. Moreover, the components of Z = H(F Y ) Y and Z = H(F Y ) Y differ at most by coordinatewise location and scale. That is, for some constants α 1,..., α p and β 1,..., β p, with

9 INVARIANT COORDINATE SELECTION 9 α j 0 for j = 1,..., p, (19) Z j = α j Z j + β j for j = 1,..., p. Due to property (19) we refer to the transformed variables Z = H(F Y ) Y as an invariant coordinate system, and the method for obtaining them as invariant coordinate selection (ICS). Note that if a univariate standardization is applied to the transformed variables, then the standardized versions of Z j and Zj differ only by a factor of ±1. A generalization of the previous theorem, which allows for possible multiple roots, can be stated as follows. Theorem 4.2. Let Y, Y, Z and Z be defined as in Theorem 4.1. In addition to Condition 4.1, suppose the roots ρ 1 (F Y ),..., ρ p (F Y ) consist of m distinct values, say ρ (1) >... > ρ (m), with ρ (k) having multiplicity p k for k = 1,..., m, and hence p p m = p. Then, (18) still holds. Furthermore, suppose we partition Z = (Z(1),..., Z (m) ), where Z (k) R p k. Then, for some nonsingular matrix C k of order p k and some p k -dimensional vector β k, (20) Z (k) = C kz (k) + β k for k = 1,..., m. That is, the space spanned by the components of Z(k) components of Z (k). is the same as the space spanned by the As with any eigenvalue/eigenvector problem, eigenvectors are not well defined. For a distinct root, the eigenvector is well defined up to a scalar multiple. For a multiple root, say with multiplicity p o, the corresponding p o eigenvectors can be chosen to be any linearly independent vectors spanning the corresponding p o dimensional eigenspace. Consequently Z (k) in Theorem 4.2 is not well defined. One could construct some arbitrary rule for defining Z (k) uniquely. However, this is not necessary here since no matter which rule one may use to define Z (k) uniquely, the results of Theorem 4.2 hold. 5. ICS Under Non-elliptical Models. When Y has an elliptically symmetric distribution, all the roots ρ 1 (F Y ),..., ρ p (F Y ) are equal, and so the ICS transformation Z = H(F Y ) Y is arbitrary. The aim of ICS though is to detect departures of Y from an elliptically symmetric distribution. In this section, the behavior of the ICS transformation is demonstrated theoretically for two classes of non-elliptically symmetric models, namely for mixtures of elliptical distributions and for independent components models Mixture of elliptical distributions. In practice, data often appear to arise from mixture distributions, with the mixing being the result of some unmeasured grouping variable. Uncovering the different groups is typically viewed as a problem in cluster analysis. One clustering method, proposed by Art, Gnanadeskian and Kettenring [1], is based on first reducing the dimension of the

10 10 D.E. TYLER, F. CRITCHLEY, L. DÜMBGEN and H. OJA clustering problem by attempting to identify Fisher s linear discriminant subspace. To do this, they give an iterative algorithm for approximating the within group sum of squares and cross-products matrix, say W n, and then consider the eigenvectors of Wn 1 (T n W n ), where T n is the total sum of squares and cross-products matrix. The approach proposed by Art et al. [1] is motivated primarily by heuristic arguments and is supported by a Monte Carlo study. Subsequently, Ruiz-Gazen [41] and Caussinus and Ruiz-Gazen [5, 6] show for a location mixture of multivariate normal distributions with equal variance-covariance matrices that Fisher s linear discriminant subspace can be consistently estimated even when the group identification is not known, provided that the dimension, say q, of the subspace is known. Their results are based on the eigenvectors associated with the q largest eigenvalues of S1,n 1 S n, where S n is the sample variance-covariance matrix and S 1,n is either the one-step W -estimate (7) or its symmetrized version. They also require that the S 1,n differs from S n by only a small perturbation, since their proof involves expanding the functional version of S 1,n about the functional version of S n. In this subsection, it is shown that these results can be extended essentially to any pair of scatter matrices, and also that the results hold under mixtures of elliptical distributions with proportional scatter parameters. For simplicity, we first consider properties of the ICS transformation for a mixture of two multivariate normal distributions with proportional covariance matrices. Considering proportional covariance matrices allows for the inclusion of a point mass contamination as one of the mixture components, since a point mass contamination is obtained by letting the proportionality constant go to zero. Theorem 5.1. In addition to Condition 4.1, suppose Y d (1 α) Normal p (µ 1, Γ) + α Normal p (µ 2, λ Γ), where 0 < α < 1, µ 1 µ 2, λ > 0 and Γ P p. Then either i) ρ 1 (F Y ) > ρ 2 (F Y ) =... = ρ p (F Y ), ii) ρ 1 (F Y ) =... = ρ p 1 (F Y ) > ρ p (F Y ), or iii) ρ 1 (F Y ) =... = ρ p (F Y ). For p > 2, if case (i) holds, then h 1 (F Y ) Γ 1 (µ 1 µ 2 ), and if case (ii) holds, then h p (F Y ) Γ 1 (µ 1 µ 2 ). For p = 2, if ρ 1 (F Y ) > ρ 2 (F Y ), then either h 1 (F Y ) or h 2 (F Y ) is proportional to Γ 1 (µ 1 µ 2 ) Thus, depending on whether case (i) or case (ii) holds, h 1 or h p respectively corresponds to Fisher s linear discriminant function, see e.g. [28], even though the group identity is unknown. An intuitive explanation as to why one might expect this to hold is that any estimate of scatter contains information on the between group variability, i.e. the difference between µ 1 and µ 2, and the within

11 INVARIANT COORDINATE SELECTION 11 group variability or shape, i.e. Γ. Thus, one might anticipate that one could separate these two sources of variability by using two different estimates of scatter. This intuition though is not used in our proof of Theorem 5.1, nor is our proof based on generalizing the perturbation arguments used by Ruiz-Gazen [41] and Caussinus and Ruiz-Gazen [6] in deriving their aforementioned results. Rather, the proof of Theorem 5.1 given in the appendix relies solely on invariance arguments. Whether case (i) or case (ii) holds in Theorem 5.1 depends on the choice of V 1 (F ) and V 2 (F ) and on the nature of the mixture. Obviously, if case (i) holds and then the roles of V 1 (F ) and V 2 (F ) are reversed, then case (ii) would hold. Case (iii) holds only in very specific situations. In particular, case (iii) holds if µ 1 = µ 2, in which case Y has an elliptically symmetric distribution. When µ 1 µ 2, i.e. when the mixture is not elliptical itself, it is still possible for case (iii) to hold. This though is dependent not only on the specific choice of V 1 (F ) and V 2 (F ), but also on the particular value of the parameters α, µ 1, µ 2, Γ and λ. For example, suppose V 1 (F ) = Σ(F ), the population covariance matrix, and V 2 (F ) = K(F ) where (21) K(F ) = E[(Y µ Y ) Σ(F ) 1 (Y µ Y ) (Y µ Y )(Y µ Y ) ], Beside being analytically tractable, the scatter functional K(F ) is one which arises in a classical algorithm for independent components analysis and is discussed in more detail in later sections. For the special case λ = 1 and when µ 1 µ 2, if we let η = α(1 α), then it can be shown that case (i) holds for η > 1/6, case (ii) holds for η < 1/6, and case (iii) holds for η = 1/6. Also, for any of these three cases, we have ρ 1 (F Y ) ρ p (F Y ) = η 1 6η θ 2 /(1 + ηθ) 2, where θ = (µ 1 µ 2 ) Γ 1 (µ 1 µ 2 ). Other examples have been studied in the aforementioned papers by Caussinus and Ruiz-Gazen [5, 6]. In their work, V 2 (F ) = Σ(F ) and V 1 (F ) corresponds to the functional version of the symmetrized version of the one-step W -estimate (7). Paraphrasing, they show for the case λ = 1 and for the class of weight functions u 2 (s) = u(βs) that case (i) holds for small enough β provided η < 1/6. They do not note, though, that case (i) or (ii) can hold for other values of β and η. The reason the condition η < 1/6 arises in their work, as well as in the discussion in the previous paragraph, is because their proof involves expanding u(βs) about u(s), with the matrix K(F ) then appearing in the linear term of the corresponding expansion of the one-step W -estimate about Σ(F ). Theorem 5.1 readily generalizes to a mixture of two elliptical distributions with equal shape matrices, but with possibly different location vectors and different spread functions. That is, if Y has density f Y (y) = (1 α)f(y; µ 1, Γ, g 1 ) + αf(y; µ 2, Γ, g 2 ), where 0 < α < 1, µ 1 µ 2 and f(y; µ, Γ, g) is defined by (6), then the results of Theorem 5.1 hold. Note that this mixture distribution includes the case where both mixture components are from the same elliptical family but with proportional shape matrices. This special case corresponds to setting g 2 (s) = g 1 (s/λ), and hence f(y; µ 2, Γ, g 2 ) = f(y; µ 2, λγ, g 1 ).

12 12 D.E. TYLER, F. CRITCHLEY, L. DÜMBGEN and H. OJA An extension of these results to a mixture of k elliptically symmetric distributions with possibly different centers and different spread functions, but with equal shape matrices, is given in the following theorem. Stated more heuristically, this theorem implies that Fisher s linear discriminant subspace, see e.g. [28], corresponds to the span of some subset of the invariant coordinates, even though the group identifications are not known. Theorem 5.2. In addition to Condition 4.1, suppose Y has density f Y (y) = det(γ) 1/2 k α j g j {(y µ j ) Γ 1 (y µ j )}, j=1 where α j > 0 for j = 1,..., k, α α k = 1, Γ P p, and g 1,..., g k are nonnegative functions. Also, suppose the centers µ 1,..., µ k span some q dimensional hyperplane, with 0 < q < p. Then, using the notation of Theorem 4.2 for multiple roots, there exists at least one root ρ (j), j = 1,..., m, with multiplicity greater than or equal to p q. Furthermore, if no root has multiplicity greater than p q, then there is a root with multiplicity p q, say ρ (t), such that (22) Span{ Γ 1 (µ j µ k ) j = 1,..., k 1 } = Span{H q (F Y )}, where H q (F Y ) = [ h 1 (F Y ),..., h p1+...+p t 1 (F Y ), h p1+...+p t+1 (F Y ),..., h p (F Y ) ]. The condition in the above theorem that only one root has multiplicity p q and no other root has a greater multiplicity reduces to case (i)-(ii) in Theorem 5.1 when k = 2. Analogous to the discussion given after Theorem 5.1, this condition generally holds except for special cases. For a given choice of V 1 (F Y ) and V 2 (F Y ), these special cases depend on the particular values of the parameters Independent components analysis models. Independent components analysis or ICA is a highly popular method within many applied areas which routinely encounter multivariate data. For a good overview, see [21]. The most common ICA model presumes that Y arises as a convolution of p independent components or variables. That is, Y = BX, where B is nonsingular, and the components of X, say X 1,..., X p, are independent. The main objective of ICA is to recover the mixing matrix B so that one can unmix Y to obtain independent components X = B 1 Y. Under this ICA model, there is some indeterminacy in the mixing matrix B, since the model can also be expressed as Y = B o X o, where B o = BQΛ and X o = Λ 1 Q X, Q being a permutation matrix and Λ a diagonal matrix with non-zero entries. The components of X o are then also independent. Under the condition that at most one of the independent components X 1,..., X p has a normal distribution, it is well known that this is the only indeterminacy for B, and consequently the independent components X = B 1 Y are well defined up to permutations and componentwise scaling factors. The relationship between ICS and ICA for symmetric distributions is given in the next theorem.

13 INVARIANT COORDINATE SELECTION 13 Theorem 5.3. In addition to Condition 4.1, suppose Y = BX+µ, where B is nonsingular, and the components of X, say X 1,..., X p, are mutually independent. Further, suppose X is symmetric about 0, i.e. X d X, and the roots ρ 1 (F Y ),..., ρ p (F Y ) are all distinct. Then, the transformed variable Z = H(F Y ) Y consists of independent components, or more specifically, Z and X differ by at most a permutation and/or componentwise location and scale. From the proof of Theorem 5.3, it can be noted that the condition that X be symmetrically distributed about 0 can be relaxed to require that only p 1 of the components of X be symmetrically distributed about 0. It is also worth noting that the condition that all the roots be distinct is more restrictive than the condition that at most one of the components of X is normal. This follows since it is straightforward to show in general that if the distributions of two components of X differ from each other by only a location shift and/or scale change, then there is at least one root having multiplicity greater than one. If X is not symmetric about 0, then one can symmetrize Y before applying the above theorem. That is, suppose Y = BX + µ with X having independent components, and let Y 1 and Y 2 be independent copies of Y. Then Y s = Y 1 Y 2 = BX s, where X s = X 1 X 2 is symmetric about zero and has independent components. Thus, Theorem 5.3 can be applied to Y s. Moreover, since the convolution matrix B is the same for both Y and Y s, it follows that the transformed variable Z = H(F s Y ) Y and X differ by at most a permutation and/or componentwise location and scale, where F s Y refers to the symmetrized distribution of F Y, i.e. the distribution of Y s. An alternative to symmetrizing Y is to choose both V 1 (F ) and V 2 (F ) so that they satisfy the following independence property. Definition 5.1. An affine equivariant scatter functional V (F ) is said to have the independence property if V (F X ) is a diagonal matrix whenever the components of X are mutually independent, provided V (F X ) exists. Assuming this property, Oja et al. [35] proposed using principal components on standardized variables as defined in Remark 3.1 to obtain a solution to the ICA problem. Their solution can be restated as follows. Theorem 5.4. In addition to Condition 4.1, suppose Y = BX + µ, where B is nonsingular, and the components of X, say X 1,..., X p, are mutually independent. Further, suppose both scatter functionals V 1 (F ) and V 2 (F ) satisfy the independence property given in Definition 5.1, and the roots ρ 1 (F Y ),..., ρ p (F Y ) are all distinct. Then, the transformed variable Z = H(F Y ) Y consists of independent components, or more specifically, Z and X differ by at most a permutation and/or componentwise location and scale. The covariance matrix Σ(F ) is of course well known to satisfy Definition 5.1. It is also straightforward to show that the scatter functional K(F ) defined in (21) does as well. Theorem 5.4 represents

14 14 D.E. TYLER, F. CRITCHLEY, L. DÜMBGEN and H. OJA a generalization of the an early ICA algorithm proposed by Cardoso [3] based on the spectral value decomposition of a kurtosis matrix. Cardoso s algorithm, which he calls the fourth-order blind identification (FOBI ) algorithm, can be shown to be equivalent to choosing V 1 (F ) = Σ(F ) and V 2 (F ) = K(F ) in the above theorem. It is worth noting that the independence property given by Definition 5.1 is weaker than the property (23) X i and X j are independent V (F X ) i,j = 0. The covariance matrix satisfies (23), whereas K(F ) does not. An often overlooked observation is that (23) does not hold for robust scatter functionals in general, i.e. independence does not necessarily imply a zero pseudo-correlation. It is an open problem as to what scatter functionals other the covariance matrix, if any, satisfy (23). Furthermore, robust scatter functionals tend not to satisfy in general even the weaker Definition 5.1. At symmetric distributions, though, the independence property can be shown to hold for general scatter matrices in the following sense. Theorem 5.5. Let V (F ) be a scatter functional satisfying (3). Suppose the distribution of X is symmetric about some center µ R p, with the components of X being mutually independent. If V (F X ) exists, then it is a diagonal matrix. Consequently, given a scatter functional V (F ), one can construct a new scatter functional satisfying Definition 5.1 by defining V s (F ) = V (F s ), where F s represents the symmetrized distribution of F. Using symmetrization to obtain scatter functionals which satisfy the independence property has been studied recently by Taskinen et al. [42]. Finally, we note that the results of this section can be generalized in two directions. First, we consider the case of multiple roots, and next we consider the case where only blocks of the components of X are independent. Theorem 5.6. In addition to Condition 4.1, suppose Y = BX + µ, where B is nonsingular, and the components of X, say X 1,..., X p, are mutually independent. Further, suppose either (i) X is symmetric about 0, i.e. X d X, or (ii) both V 1 (F ) and V 2 (F ) satisfy Definition 5.1. Then, using the notation of Theorem 4.2 for multiple roots, for the transformed variable Z = H(F Y ) Y the random vectors Z (1),..., Z (m) are mutually independent. Theorem 5.7. In addition to Condition 4.1, suppose Y = BX + µ, where B is nonsingular, and X = (X(1),..., X (m) ) has mutually independent components X (1) R p1,..., X (m) R pm, with p p m = p. Further, suppose X is symmetric about 0, and the roots ρ 1 (F Y ),..., ρ p (F Y ) are

15 INVARIANT COORDINATE SELECTION 15 all distinct. Then, there exists a partition {J 1,... J m } of {1,..., p} with the cardinality of J k being p k for k = 1,..., m such that for the transformed variable Z = H(F Y ) Y the random vectors Z (1) = {Z j, j J 1 },..., Z (m) = {Z j, j J m } are mutually independent. More specifically, Z (j) and X (j) are affine transformations of each other. From the proof of Theorem 5.7, it can be noted that the theorem still holds if one of the X (j) s is not symmetric. If the distribution of X is not symmetric, Theorems 5.6 and 5.7 can be applied to Y s, the symmetrized version of Y. To generalize Theorem 5.4 to the case where blocks of the components of X are independent, a modification of the independence property is needed. Such generalizations of Definition 5.1, Theorem 5.4 and Theorem 5.5 are fairly straightforward, and so are not treated formally here. Remark 5.1. The general case of multiple roots for the setting given in Theorem 5.7 is more problematic. The problem stems from the possibility that a multiple root may not be associated with a particular X (j) but rather with two or more different X (j) s. For example, consider the case X = (X(1), X (2)), with X (1) R 2 and X (2) R. For this case, V 1 (F X ) 1 V 2 (F X ) is block diagonal with diagonal blocks of order 2 and 1 respectively. The three eigenvalues ρ 1 (F Y ), ρ 2 (F Y ) and ρ 3 (F Y ) correspond to the two eigenvalues of the diagonal block of order 2 and to the last diagonal element, but not necessarily respectively. So, if ρ 1 (F Y ) = ρ 2 (F Y ) > ρ 3 (F Y ), this does not imply that the last diagonal element corresponds to ρ 3 (F Y ), and hence Z (1) R 2 and Z (2) R, as defined in Theorem 4.2, are not necessarily independent. 6. Discussion and Examples. Although the theoretical results of this paper essentially apply to any pair of scatter matrices, in practice the choice of scatter matrices can affect the resulting ICS method. From our experience, for some data sets, the choice of the scatter matrices does not seem to have a big impact on the diagnostic plots of the ICS variables, particularly when the data is consistent with one of the mixture models or one of the independent component models considered in section 5. For some other data sets, however, the resulting diagnostic plots can be quite sensitive to the choice of the scatter matrices. In general, different pairs of scatter matrices may reveal different types of structure in the data, since departures from an elliptical distribution can come in many forms. Consequently, it is doubtful if any specific pair of scatter matrices is best for all situations. Rather than choosing two scatter matrices beforehand, especially when one is in a purely exploratory situation having no idea of what to expect, it would be reasonable to consider a number of different pairs of scatter matrices and to consider the resulting ICS transformations as complementary. A general sense of how the choice of the pair of scatter matrices may impact the resulting ICS method can be obtained by a basic understanding of the properties of the scatter matrices being used. For the purpose of this discussion, we divide the scatter matrices into three broad

16 16 D.E. TYLER, F. CRITCHLEY, L. DÜMBGEN and H. OJA classes. Class I scatter statistics will refer to those which are not robust in the sense that their breakdown point is essentially zero. This class includes the sample covariance matrix, as well as the one-step W -estimates defined by (7) and their symmetrized version. Other scatter statistics which lie within this class are the multivariate sign and rank scatter matrices, see e.g. [46]. Class II scatter statistics will refer to those which are moderately robust in the sense that they have bounded influence functions as well as positive breakdown points, but with breakdown points being no greater than 1/(p + 1). This class primarily includes the multivariate M -estimates, but it also includes among others the sample covariance matrices obtained after applying either convex hull peeling or ellipsoid hull peeling to the data, see [13]. Class III scatter statistics will refer to the high breakdown point scatter matrices which are discussed in section 2.2. The symmetrized version of a class II or III scatter matrix, as well as the one-step W -estimates of scatter (10) which uses an initial class II or III scatter matrix for downweighting, are viewed respectively as class II or III scatter matrices themselves. If one or both scatter matrices are from class I, then the resulting ICS transformation may be heavily influenced by a few outliers at the expense of finding other structures in the data. In addition, even if there are no spurious outliers and a mixture model or an independent components model of the form discussed in section 5 hold, but with long tailed distributions, then the resulting sample ICS transformation may be an inefficient estimate of the corresponding population ICS transformation. Simulation studies reported in [32] have shown that for independent components analysis an improved performance is obtained by choosing robust scatter matrices for the ICS transformation. Nevertheless, since they are simple to compute, the use of class I scatter matrices can be useful if the data set is known not to contain any spurious outliers or if the objective of the diagnostics is to find such outliers, as recommended in [4]. If one uses class II or III scatter matrices, then one can still find spurious outliers by plotting the corresponding robust Mahalanobis distances. The resulting ICS transformation, though, would not be heavily affected by the spurious outliers. Outliers affect class II scatter matrices more so than class III scatter matrices, although even a high proportion of spurious outliers may not necessarily affect the class II scatter matrices. For outliers to heavily affect a class II scatter matrix, they usually need to lie in a cluster, see e.g. [15]. The results of section 5.1 though suggest that such clustered outliers can be identified after making an ICS transformation, even if they can not be identified using a robust Mahalanobis distance based on a class II statistic. Using two class III scatter matrices for an ICS transformation may not necessarily give good results, unless one is only interested in the structure of the inner 50% of the data. For example, suppose the data arises from a mixture of two multivariate normal distributions with widely separated means but equal covariance matrices. A class III scatter matrix is then primarily determined by the properties of the 60% component. Consequently, when using two class III scatter matrices for ICS the corresponding ICS roots will tend to be equal or nearly equal. In the case where all the roots are equal, Theorem 5.1 does not apply. In the case where the roots are nearly

17 INVARIANT COORDINATE SELECTION 17 (a) (b) (c) Distances using Cauchy M estimate Distances using W estimate Distances using W estimate Index Index Distances using Cauchy M estimate Fig. 1. Example 1: Mahalanobis distances based on (a) V 1, (b) V 2 and (c) V 1 versus V 2. equal, due to sampling variation, the sample ICS transformation may not satisfactorily uncover Fisher s linear discriminant function. A reasonable general choice for the pair of scatter matrices to use for an ICS transformation would be to use one class II and one class III scatter matrix. If one wishes to avoid the computational complexity involved with a class III scatter matrix, then using two class II scatter matrices may be adequate. In particular, one could choose a class II scatter matrix whose breakdown point is close to 1/(p + 1), such as the M -estimate corresponding to the maximum likelihood estimate for an elliptical Cauchy distribution [15], together with a corresponding one-step W -estimate for which ψ(s) = su 2 (s) 0 goes s. Such a one-step W -estimate of scatter has a redescending influence function. From our experience, the use of a class III scatter matrix for ICS does not seem to reveal any data structures that can not be obtained otherwise. The remarks and recommendations made here are highly conjectural. What pairs of scatter matrices are best at detecting specific types of departure from an elliptical distribution remains a broad open problem. In particular, it would be of interest to discover for what types of data structures would it be advantageous to use at least one class III scatter matrix in the ICS method. Most likely, some advantages may arise when working with very high dimensional data sets, in which case the computational intensity needed to compute a class III scatter matrix is greatly amplified, see e.g. [40]. We demonstrate some of the concepts in the following examples. These examples illustrate for several data sets the use of the ICS transformation for constructing diagnostic plots. They also serve as illustrations of the theory presented in the previous sections Example 1. Rousseeuw and van Driessen [40] analyze a data set consisting of n = 677 metal plates on which p = 9 characteristics are measured. For this data set they compute the sample mean and covariance matrix as well as the MCD estimate of center and scatter. Their paper helps illustrate the advantage of using high breakdown point multivariate estimates, or class III statistics, for uncovering multiple outliers in a data set.

Invariant co-ordinate selection

J. R. Statist. Soc. B (2009) 71, Part 3, pp. 549 592 Invariant co-ordinate selection David E. Tyler, Rutgers University, Piscataway, USA Frank Critchley, The Open University, Milton Keynes, UK Lutz Dümbgen