The Complexity of Computing the MCD-Estimator

Size: px

Start display at page:

Download "The Complexity of Computing the MCD-Estimator"

Gavin Bridges
5 years ago
Views:

1 Te Complexity of Computing te MCD-Estimator Torsten Bernolt Lerstul Informatik 2 Universität Dortmund, Germany torstenbernolt@uni-dortmundde Paul Fiscer IMM, Danisc Tecnical University Kongens Lyngby, Denmark paf@immdtudk Abstract In modern statistics te robust estimation of parameters is a central problem, i e, an estimation tat is not or only sligtly affected by outliers in te data Te Minimum Covariance Determinant estimator (MCD) [8] is probably one of te most important robust estimators of location and scatter Te complexity of computing te MCD, owever, was unknown and generally tougt to be exponential even if te dimensionality of te data is fixed Here we present a polynomial time algoritm for MCD for fixed dimension of te data In contrast we sow tat computing te MCDestimator is NP-ard if te dimension varies Keywords: Computational statistics, efficient algoritms, NP-completeness, combinatorial geometry 1 Introduction In modern matematical statistics and data analysis one fundamental problem is tat of constructing statistical metods wic are robust against model deviations For example, it is well known tat te standard estimates of location Te financial support of te Deutsce Forscungsgemeinscaft (SFB 475, Reduction of complexity in multivariate data structures ) is gratefully acknowledged 1

2 and scatter sample mean and sample variance are not robust A single data point wic is moved far out will cange tese quantities arbitrarily In general one assumes tat te observed data is mainly generated by some process or distribution wic one would like to analyse We sall call te part of te data coming from te distribution of interest te data from te true population Te rest of te data, owever, migt come from oter sources or is altered by noise, we call tis te outliers Te goal is to neverteless estimate statistical quantities of te true population Tis is clearly impossible if te majority of te data consists of outliers, tus we sall assume tat te majority of te data comes from te true population One possible approac to tackle te problem of robust estimation is to find a sufficiently large subset of te data mainly consisting of elements of te true population and to base te estimation on tis subset Several autors follow tis approac, e g [3, 8] One of te most popular metods in tis context is to select a subset wit a minimum value of te covariance determinant (MCD estimator, [2, 4, 8]) Heuristic searc algoritms for te MCD can be found in [5, 6, 9, 10, 11] A comparison of te MCD, MVE and S-estimator is presented in [1] More precisely, given N observations, a subset of size, for some > N/2, is selected for wic te determinant of te empirical covariance matrix is minimal over all subsets of size We sall now formally define MCD and ten discuss some of its properties Let X = x 1,, x be a set 1 of points in R d for some constant d Let x i = (x i1,, x id ) T Te (empirical) covariance matrix C = C(X ) = (c ab ) 1 a,b d of X is te (d d)-matrix defined by c ab = 1 (x ia t a ) (x ib t b ) were t j = 1 i=1 or in matrix notation C(X ) = 1 x i x T i t t T i=1 x ij, Te covariance matrix is positive semidefinite, in our application te data will even guarantee positive definitness For a d d-matrix M, let det(m) denote its determinant For te determinant of a covariance matrix C we write det(x ) = det(c(x )) and we sall call it covariance determinant Let us now define te problem: Definition 11 (MCD) Let d < N/2 Let X = {x 1,, x N } be a set of N points in R d Let be a natural number, N/2 < < N Te minimum covariance determinant problem for X and, MCD for sort, is te problem 1 Strictly speaking we are considering multisets ere, i e, we allow multiple occurrences of te same element We sall neverteless use te term set as is te practice in statistics One may as well tink of weigted points, were te weigt indicates te multiplicity 2 i=1

3 to find an -element set X = {x i1,, x i } X suc tat det(x ) is minimal, over all -element sets For te decision version of MCD, MCDd, a positive real number B is given in addition Te problem is to decide weter tere exists an -element set X = {x i1,, x i } X suc tat det(x ) B Anoter robust estimator of location and sape is te Minimum Volume Ellipsoid and our results can be easily adapted to tis estimator: Definition 12 (MVE) Te Minimum Volume Ellipsoid is te problem to find a subset of size, for tat te enclosing ellipsoid as te minimal volume Te empirical covariance matrix C(X ) wit minimal determinant yields a robust estimator S of scatter, S = S(X ) = c 0 C(X ), were c 0 is a suitably cosen constant to acieve consistency As an estimator for te location one uses te mean (or center of gravity) t = t(x ) = 1 x X x of te points in te set X Te pair (t, S) is called MCD-estimator wit respect to X Tere is a nice geometric interpretation of te MCD Te inverse C 1 (X ) of te minimum covariance matrix C(X ) and te mean t(x ) define an ellipsoid in R d, for details see Section 2 Tis ellipsoid nicely matces te points X, see Figure 1 for an example in two dimensions Te determinant is a measure of volume Hence a small determinant corresponds to an ellipsoid of small volume If te extensions of te ellipsoid in all dimensions are small ten te set X is quite compact Anoter way to get a small volume is tat te ellipsoid is somewat flat, i e, it migt ave a large extension in some directions but only small ones in oters Tis indicates tat te set X is essentially lower dimensional In tis paper we address te complexity of computing te MCD-estimator Obviously, computing det(x ) for all ( N ) subsets X of X of size solves te problem, toug it migt take exponential time in It was not clear weter te estimator itself as tis complexity independent of te dimensionality d of te data Here we sow tat te complexity of MCD is polynomial if te dimension is fixed Tis is acieved by avoiding to consider all subsets of size Exploiting geometric properties of te estimator, we ave been able to design an algoritm wic enumerates a sequence of subsets of size of te input data set X in polynomial time We sow tat one of te sets enumerated as minimum covariance determinant Te running time of our algoritm is ) O (N d2 On te oter and it is possible to sow tat te decision version of te MCD problem is NP-complete if te dimension varies Tis is acieved by reducing CLIQUE to MCDd Te reduction combines combinatorial and algebraic metods in a clever way and is of its own interest Te main problem 3

4 Figure 1: Te figure sows te ellipsoid wic corresponds to te covariance matrix wit minimum determinant Te ellipsoid is plotted for two different radii Te points on te rigt and side are outside te ellipsoid even for a very large radius in constructing a reduction is tat one cannot control te entries of te covariance matrix directly but only troug te data points Moreover, canging a data point migt alter all entries of te covariance matrix Te continuous nature of te MCD-estimator introduces furter difficulties Te next section states some properties of te covariance determinant and te related ellipsoid wic will be elpful in proving our results 2 MCD and Ellipses Fix d and let v = d(d + 3)/2 A quadric Q in R d is a (d 1)-dimensional manifold determined by a second order expression wic depends on v + 1 real parameters a 0, a 1,, a d and a ij for 1 i j d Every point z = (z 1,, z d ) T Q satisfies te condition a 0 + a 1 z a d z d + a 11 z a 12 z 1 z a d d 1 z d z d 1 + a dd z 2 d = 0 (1) Note tat tere are only v degrees of freedom because equation (1) can be multiplied by any non-zero constant witout canging te quadric Equation (1) can be rewritten in matrix form as follows Let te symmetric matrix A and 4

5 te vector b be defined by a 11 a 12 a 1d a 12 a 22 a 2d A := a 1d a 2d a dd, b := a 1 a 2 a d Ten Equation (1) is equivalent to We say tat quadric Q selects a subset X X if z T Az + z T b + a 0 = 0 (2) x T Ax + x T b + a 0 { 0 x X > 0 x X \ X If te quadric surface is te surface of an ellipsoid ten Equation (2) can be rewritten as (z t) T M (z t) = r 2, (3) were M is a positive definite (d d)-matrix, t R d is te center point and r R is te radius Given M, t and r > 0 we denote by E(M, t, r) te solid ellipsoid defined by Formula (3) wit equality replaced by less tan or equal Selection by quadrics in d dimensions is equivalent to linear separation in v dimensions To tis end consider te mapping : R d R v defined by ẑ = (z 1,, z d, z 1 z 1,, z i z j,, z d z d ) T, 1 i j d For a set Z R d let Ẑ := {ẑ z Z} Now te parameters a i, a ij in (1) define a yperplane in R v wic separates te points of ˆX from tose in ˆX \ ˆX As mentioned in Section 1, te covariance matrix C(X ) of a point set X is positive definite Its inverse C 1 is also positive definite and te ellipsoid E(C 1, t(x ), r) is an ellipsoid wic fits te point set X for a suitably cosen radius r Trougout tis paper we assume tat te points of X are in general quadric position, i e, no yperplane in R v contains more tan v + 1 points of ˆX Te following result of Rousseeuw [9] sows tat te fit is even better for te set defining te minimum covariance determinant Lemma 21 Let d < < N and let X R d be a set of N points Let X opt X, X opt = be suc tat det(x opt ) is minimal for all subsets of X of cardinality Let C opt = C(X opt ) be te corresponding covariance matrix and t opt = t(x opt ) be te center of gravity Ten tere exists a radius r > 0 suc tat X opt = X E(Copt, 1 t opt, r), tat is, E(C 1 opt, t opt, r) selects X opt 5

6 Given a set S X, S = v ten (by our assumption on te position of te points) tere is a unique quadric Q(S) troug te points of S; we call it te quadric defined by S It can be computed by writing an equation of te form (1) for every point z S and solving te resulting system of linear equations for a 0, a 1,, a dd As mentioned above te value of one parameter, e g, a 0, can be cosen arbitrarily Te next lemma sows tat a set of points selectable by an ellipsoid is (almost) selectable by a quadric defined by a set of v points Lemma 22 Given X R d in general quadric position and an ellipsoid E let X = E X Ten tere exists a set S X, S = v := d(d + 3)/2 suc tat for te quadric Q(S) te following olds: Let A, b and a 0 define Q(S) as in (2) ten x T Ax + x T b + a 0 0 x X x T Ax + x T b + a 0 0 x X \ X x T Ax + x T b + a 0 = 0 for at most v points x X \ X Proof Let an ellipsoid E = E(M, t, r) be given wic selects a set X X, i e, { r (x t) T 2 if x X M (x t), > r 2 if x X \ X Expanding te matrix equation into te form (1) wit a 0 = r 2 we arrive at a 1 x a d x d + a 11 x a 12 x 1 x a dd x 2 d { r 2 if x X, > r 2 if x X \ X (4) Te yperplane a 1 x a d x d + a 11 x a 12 x 1 x a dd x 2 d = r 2 separates te points of ˆX and ˆX \ ˆX in R v Tis yperplane is now moved in suc a way tat it contains v points of ˆX but no point as passed troug it By our assumption made on te position of te points it follows tat te yperplane contains exactly v points Tis means tat te inequality or strict inequality (4) becomes an equality for te points x on te yperplane Let a i and a ij denote te parameters of te resulting yperplane Clearly a i and a ij define a quadric in R d but not necessarily an ellipsoid Altogeter tere are at most v new points x X \ X suc tat a 1x a dx d + a 11x a 12x 1 x a d,d 1x d x d 1 + a ddx 2 d = a 0 6

7 3 An Efficient Algoritm for Fixed Dimension Using te results from te previous section, we sow ow to list all subsets selectable by ellipsoids in polynomial time Actually we sall list a polynomial collection of sets wic contains all tose selectable by ellipsoids Tere are in general infinitely many ellipsoids wic select te same subset Given X X let E(X ) denote te set of all ellipsoids selecting X Next we sow ow to select a representative from every E(X ), X X Let E E(X ) According to Lemma 22, tere is yperplane H(E) in R v suc tat for all points of X te inequality for H(E) is satisfied wit less or equal and tere are at most v points of X \ X satisfying it wit equality Te algoritm loops troug all subsets S of X of cardinality v For every S it computes te yperplane H(S) defined by S, wic is possible in time O(v 3 ) We ten compute te set T = {x X x satisfies (4) wit strictly less } Te time to compute T is O(N) Finally, for every S S let T = S T Te sets T enumerated as above contain all subsets of X selectable by ellipsoids and possibly oter sets We are only interested in sets of size Let T 1,, T k be te sequence sets constructed as above wit T i = For eac T i te covariance determinant is computed and te overall minimum is selected By Lemma 21, any set defining an optimal covariance determinant is selectable by an ellipsoid, ence, an optimal set appears in te sequence Te number of enumerated sets T is at most ( ) N v 2 v = O(N v ), were v = d(d + 3)/2 For every set time O(N) is spent Te following teorem summarizes tis result Teorem 31 For N datapoints in fixed dimension d te MCD-problem can be solved in polynomial time O(N v+1 ) were v = d(d + 3)/2 4 Te Hardness Result In tis section we sow tat te decision version MCDd of te Minimum Covariance Determinant problem is NP-complete if te dimension varies To indicate tis we sall use n to denote te dimension in tis section Definition 41 (Maalanobis distance) Let C be a positive definite n n- matrix and let t be a n-vector Ten te Maalanobis distance md(x; C, t) of vector x w r t C and t is defined by md(x; C, t) := (x t) T C 1 (x t) Te following Lemma is a special case of te Teorem 1 from [9]: 7

8 Lemma 42 (Excange Lemma) Let x and y be points from R n and let X be a set of points from R n not containing x or y Let C = C(X ) be te covariance matrix of X and t = t(x ) be te center of gravity of te points in X If md(x; C, t) > md(y; C, t) ten det(x {x}) > det (X {y}) Intuitively, excanging a distant point wit a closer one decreases te determinant We sow tat te decision version of MCD is NP-complete by reducing te maximum clique problem n/2-clique to it For te sake of completeness let us repeat te definition of te latter problem Definition 43 (n/2 n/2-clique) Let a grap G wit n vertices and m edges be given Te problem is to decide weter G contains a complete subgrap on n/2 vertices, i e, a subset of n/2 vertices in wic all edges are present Te parameters used in te following teorem and in te rest of te paper are summarized in Table 1 of te appendix Now we are ready to state te main teorem: Teorem 44 MCDd is NP-complete Proof Let G = (V, E) be a grap wit vertex set V, V = n and edge set E For a reduction we construct an input for te MCD-estimator by mapping te vertices and edges of G into points in R n and by coosing te appropriate constant B = 1 (k + 2wz 2 ) k (2wz 2 ) (n k), were w = k 4 /2 and z = k 2k Te n dimension of te resulting input is te number of vertices of te grap Let v i, i = 1,, n, be te vertices and let e ij denote te edge between v i and v j Let e i denote te i-t unit vector in R n Vectors and points in R n are identified as usual We use k to denote n/2 in te following Tree types of points in R n are used in te reduction: Vertex points (v-points), edge-points (e-points), and auxiliary points (a-points) Let X consist of te following points: For every vertex v i add te point e i on te i-t coordinate axis For every edge e ij add te point e i +e j on te diagonal of te 2-dimensional subspace in dimensions i and j For every i {1,, n} add k 4 /2 times te point k 2k e i and k 4 /2 times te point (k 2k ) e i Tese points are on te i-t coordinate axis very close to te origin Altogeter X contains N := n+m+nk 4 points MCD selects a subset X X of cardinality < N suc tat det(x ) is minimal We set := k + ( ) k 2 + nk 4 8

9 wic is te number of a-points plus te number of edges and vertices of a k-clique Te a-points serve two purposes: By coice of, at least k 4 ( n 2) n n 4 ( n 16 2) n copies of every a-point ave to be selected Tis ensures tat te covariance determinant for n 4 is not zero, because te a-points span R n Second teir large number ensures tat te center of any covariance ellipsoid defined by points is very close to te origin Te a-points are close to te origin and do not contribute muc to te covariance determinant resp to te volume of te associated ellipsoid We sall sow tat one as to select all of tem for a minimum ellipsoid Still ( k 2) + k points are missing and we ave to select tem from te v- and e-points Tese points are far away from te origin (and te center of any ellipsoid defined by points), ence tey contribute muc to te covariance determinant In order to keep te determinant small te vectors tey represent sould span a low-dimensional space If G contains a k-clique ten te set X can be completed by adding te points corresponding to te edges and vertices of a clique We sall call tis a clique configuration Te vectors of te v- and e-points of X span a space of only k dimensions, wic bounds teir influence on te covariance determinant Altogeter te covariance determinant det(x ) is small In Section 41 we will sow tat det(x ) is not larger tan B If G does not contain a k-clique, we will be forced to add v- and e-points to X suc tat te corresponding vectors span at least k + 1 dimensions Tis results in a muc larger value of det(x ) All suc configurations are called non-clique configurations In order to lower bound te determinant we will construct in Section 43 an arrangement of points wic cannot be realized by te reduction It consists of all a-points, exactly k + 1 v-points and (k + 1) nk 4 copies of te origin We call tis a minimal (k + 1)-configuration We will sow in Section 44 tat tis minimal (k + 1)-configuration as a smaller determinant tan a non-clique configuration, but is still greater tan B To sow tat te reduction works in polynomial time, we look at te number of points and te bit-lengt of te numbers Te number of points is bounded by O(k 5 ) Te numbers itself are described by rational numbers B is te largest number, and nominator and denominator are bounded by 2k (4k+1)n Terefore te bit-lengt is less tan O(k 2 log k) Te proofs of te facts in te following sections ave to cope wit many tecnical problems MCD is a continuous problem, not a combinatorial one Te main difficulty, owever, is tat we cannot control te entries of te covariance matrix directly In general a point in X influences all entries in te covariance matrix C(X ) as well as te center of gravity 9

10 41 An Upper Bound on te Determinant of te Clique Configuration Let X constitute a clique configuration Te center of gravity t of X is t = 1 (k,, k, 0,, 0)T, were te transition of te entries from k to 0 occurs after position k Te covariance matrix C of te clique configuration as te following form: a b b 0 0 b b C = 1 b b a 0 0, 0 0 c c were te upper left submatrix is k k and a = k + 2wz 2 k 2 / b = 1 k 2 / c = 2wz 2 By Hadarmard s determinant inequality, see e g Horn and Jonson [7], te determinant of a positive definite matrix is bounded by te product of its diagonal elements Hence we arrive at n det(c) a k c n k = ) k (k + 2wz 2 k2 (2wz 2) (n k) ( k + 2wz 2) k ( 2wz 2 ) (n k) = n B 42 All a-points Have to be Selected Let X X be a set of points We want to sow tat for any coice of X every a-point is closer to te center of gravity of te set X tan any v- or e-point Closeness ere is measured wit respect to te Maalanobis distance More precisely: Lemma 45 Let a be an arbitrary a-point and let x be an arbitrary v- or e- point, a, x X Ten te following relation olds for any set X X wit X = md(a; C(X ), t(x )) < md(x; C(X ), t(x )) 10

11 Proof We try to construct X in suc a way tat te difference md(x; C(X ), t(x )) md(a; C(X ), t(x )) is minimized and sow tat it is always larger tan 0 It ten follows from Lemma 42 tat a set X defining a minimal covariance determinant as to contain all a-points In order to maximize te Maalanobis distance of an a-point and simultaneously minimize tat of a v-point w r t C and t, we even allow configurations of points wic are not realizable by our reduction We allow tat a v- or e-point is cosen multiple times, in order to pull te ellipsoid towards it Te analysis distinguises several cases We describe te analysis of one case in detail, namely tat of maximizing te Maalanobis distance of an a- point and simultaneously minimizing tat of a v-point in a different dimension Te arguments for te oter cases follow te same line 2 Te influence of a point on te covariance matrix and ence te Maalanobis distance is maximal in te direction from te point to te center of gravity and is minimal in ortogonal directions We tus consider a twodimensional sub-scenario of a d-dimensional one Let us, w l o g, devote dimension 1 to minimize te distance to a v-point and dimension 2 to maximize tat of an a-point As we ave moved all degrees of freedom into dimensions 1 and 2, te a- points in dimensions 3, 4,, n are symmetrically placed Tese a-points do not influence te upper left (2 2)-submatrix of te covariance matrix, tey do owever affect te center of gravity As tey are symmetrical wit respect to te origin, tey sum up to te origin Tere are at most n + n(n 1)/2 v- and e-points In te case we are looking at now, we allow u copies of te v- point v 1 := (1, 0, 0,,,, 0) T and w u copies of te a-point (0, z, 0,,,, 0) T We ten compute te (2-dimensional) covariance matrix C and te center of gravity t of tis arrangement of points t = (t 1, t 2 ) T = 1 ( ) T 1 ( T u, (w u)z + w( z) = u, zu), [ ] a b C = = 1 u + 2w z2 u2 u 2 z b c u 2 z (2w u) z 2 u2 z 2 Let a 2 := (0, z, 0,,,, 0) T be te a-point in dimension 2 Let d(v 1 ) = md(v 1 ; C, t) be te Maalanobis distance of v 1 w r t C and t, and let d(a 2 ) = md(a 2 ; C, t) be te corresponding value for a 2 We consider te inequality d(v 1 ) d(a 2 ) > 0 Multiplying wit det(c) we find ( 2 t2 z z 2) a + ( 2 t 2 2 t 1 z ) b + ( 2 t ) c > 0 2 Maple workseets for te cases not treated ere can be found on te following website: ttp://ls2-wwwcsuni-dortmundde/ bernolt/mcd/indextml 11

12 Substituting a, b, c, t 1, t 2 and multiplying wit /z 2 yields ( ) z2 u + ( 1 z 2) k 4 > 0 k In order to minimize te left-and side, one as to coose u as large as possible, i e, u = 2k + 2k(2k 1)/2: k 4 4 k 2 + ( k 4 2 k 1 ) z 2 4 k 1 > 0 Te term k 4 is te dominant one, and te left-and side increases wit k and tus te inequality is true for k 3 Te oter cases tat one as to consider include distributing te u missing a-points in oter possible ways and te consideration of two v-points and mixtures of e- and v-points Te arguments are along te same line and always establis tat te corresponding difference of te Maalanobis distance is larger tan 0 for k 3 Altogeter it follows tat te Maalanobis distance of a v- or e-point is always larger tan tat of any a-point Te following lemma is an immediate consequence of Lemma 45 and te fact tat te origin is contained in te convex ull of te a-points Lemma 46 Let 0 be te origin and let x be an arbitrary v- or e-point, x X Ten te following inequation olds for any set X X wit X = : md(0; C(X ), t(x )) < md(x; C(X ), t(x )) 43 Constructing a Minimal Configuration Assume tat te grap G of our clique problem does not contain a k-clique Let X be te set of points of te corresponding MCDd problem and let X X be any set of points We now sow tat det(x ) is at least as large as te determinant of te minimal (k +1)-configuration To tis end we sow ow X can be transformed into te minimal (k + 1)-configuration, witout increasing te determinant Lemma 47 Let G be a grap on n vertices witout a k-clique, k = n/2 Let X be te set of points of te corresponding MCDd problem Let X X be any set of points Let D be te determinant of a minimal (k + 1)-configuration Ten det(x ) D Proof For te proof let us introduce some notation Given a set of Y consisting of e- and v-points we say tat it spans k dimensions if te subspace spanned by te corresponding vectors is k-dimensional We say tat Y touces k dimensions if tere are at least k positions in wic some member of Y as a 1-entry For example te set {(1, 1, 0, 0, 0), (0, 0, 1, 1, 0)} spans 2 dimensions 12

13 and touces 4 dimensions Adding te vector (0, 1, 1, 0, 0) does not increase te number of dimensions touced but increases te dimension of te span to 3 From Section 42 we know tat any -element set wit minimal covariance determinant as to contain all a-points Consequently it as to contain exactly t := (k 1)k + k e- or v-points y 1,, y t Let Y = {y 1,, y t } be te set of tese points As G does not contain a k-clique, te vectors in Y span a space of at least k + 1 dimensions If tere are k + 1 v-points in Y ten we can acieve te minimal (k + 1)- configuration directly by moving all but k + 1 v-points to te origin By Lemma 46 and te Excange Lemma 42, te covariance determinant of te resulting configuration is less tan or equal to tat of te original configuration Oterwise we ave to replace some e-points by v-points in addition, witout increasing te determinant In order to control te cange of te determinant during te replacement, one as to carefully select wic e- and v-points to keep and wic to move into te origin Terefore, te location of te points in Y relative to eac oter is important We represent tis structure as an undirected grap H = H(Y) For every v-point v i Y tere is a vertex i in H For every e-point e ij Y te vertices i and j are in H as is te edge {i, j} In order to distinguis between vertices wic are solely introduced by e-points and tose for wic te corresponding v-point is in Y we call a vertex i of H marked if v i Y Te resulting grap H is isomorpic to a subgrap of te original grap G, but as two types of vertices, marked and unmarked ones Te marked vertices correspond to v-points really present in Y wile te unmarked vertices of H do not ave a corresponding v-point in Y Tey are merely induced by an e-point in Y We now sow ow a set Y Y can be constructed suc tat te resulting grap H = H(Y ) is cycle-free and tat Y spans k + 1 dimensions Definition 48 Let B be a tree wit marked vertices as described above If B as m edges te value w(b) of B is defined by { m + 1 if at least one vertex is marked, w(b) = m oterwise Note tat if te tree B is defined by a set Y of v- and e-points, i e B = H(Y), and w(b) = s, ten Y spans s dimensions, but migt touc more In contrast, a cyclic grap defined by a even number s of e-points, e g e 12, e 23,, e s1 only spans s 1 dimensions Claim 49 If a grap H as at least ( k 2) edges and does not contain a k-clique ten tere is an r > 0 and tere are vertex-disjoint trees B 1,, B r in G wit i=1r w(b i) k Proof Assume tat for all coices of r and vertex-disjoint trees B 1,, B r te equation i=1r w (B i) k 1 olds true Ten tere are at most k 1+r vertices in te trees B 1,, B r 13

14 Let B 1,, B r any r trees suc tat all edges of H lie witin tese trees and suc tat all verticies are covered It is allowed tat some B i consist of isolated vertices only Let k i be te number of vertices in tree B i We want to establis an upper bound for te number of edges in te connected components induced by te trees To tis end let B 1,, B r be te vertex disjoint graps induced by te vertices of te trees B 1,, B r A grap B i as k i vertices and at most ( k i ) 2 edges Te number of edges in te graps B 1,, B r is at most Z = ( k 1 ) ( k r ) 2 and te B i ten contain i=1r k i = k 1 + r vertices If r = 1, te grap B 1 is identical to te grap H Moreover, w(b 1 ) k 1 ence B 1 (and H) as at most k vertices As H as at least ( k 2) edges, it contains a k-clique contrary to our assumption Oterwise, if r 2, te edges are distributed over two or more graps and due to te convexity of te function Z tere are fewer edges in te grap G tan ( k 2) So tis leads to a contradiction Tus tere must be vertex disjoint trees B 1,, B r wit i=1r w (B i) k Claim 410 Let H be a grap wit M marked vertices, ( k 2) + k M edges and let H contain no k-clique Ten tere are vertex-disjoint trees B 1,, B r in H wit i=1r w (B i) k + 1 Proof We prove tis claim by constructing trees wit te desired property: Case 1: M k + 1 Tere are k + 1 trees eac consisting of a single marked vertex Case 2: 1 M k Te grap as at least ( k 2) edges Claim 49 sows tat tere exists some r and trees B 1,, B r suc tat i=1r w (B i) k Take one marked vertex additionally Case 3: M = 0 Te grap as at least ( ) ( k 2 +k = k+1 ) 2 edges By Claim 49 it follows tat tere exists some r and trees B 1,, B r suc tat i=1r w (B i) k+1 For te construction of te set Y apply Claim 410 to te grap H(Y) Let B 1,, B r be te resulting trees Now move all e-points in Y corresponding to edges tat are not present in some B i into te origin By Lemma 46 te covariance determinant is only decreased by tis operation In te following, te e-points in Y will be replaced by suitably cosen v- points We ten end up wit Y consisting of at least k + 1 v-points and no e-point Let us consider a single tree B i and te corresponding v- and e-points Te formula below sows a (3 3)-submatrix of corresponding rows and columns of te covariance matrix For tecnical reasons te covariance matrix is split 14

15 into te sum of te pure covariance part and te offset resulting from te fact tat te center of gravity is not te origin Te first row is a prototype of a dimension wic is touced by exactly q v- or e-points Te second row is a dimension wic is solely touced by a single e-point tat also touces te dimension of te tird row Te tird row represents a dimension wic is eiter unmarked and is touced by p + 1 e-points, p 2, or wic is marked and touced by p e-points, p 2 q p + q2 q (1+p)q q 1 1+p (1+p)q 1+p (1+p)2 any oter row a leaf of te tree a node wit p + 1 v/e-points We track te effect of replacing an e-point toucing te dimension of row 2 and 3 by a v-point on te matrix Te replacement is reflected by te following operations on te rows and columns column 3 := column 3 - column 2 row 3 := row 3 - row 2 Te resulting matrix is q p + q2 q pq q 1 p pq p p2 Te determinant of te matrix as not been canged in tis process Tus after successively applying tis process to all edges of te tree B i, all points corresponding to te tree are replaced by isolated vertices Claim 410 ensures tat tere are at least k + 1 vertices If we move te superfluous vertices into te origin, we obtain a minimal (k + 1)-configuration and Lemma 47 as been proved 44 A Lower Bound on te Determinant of a Minimal (k + 1)-Configuration In tis section we compute te covariance determinant of te minimal (k + 1)- configuration as constructed in te previous section Te center of gravity t of te clique configuration is t = 1 (1,, 1, 0,, 0)T, were te transition of te entries from 1 to 0 occurs after position k + 1 Te covariance matrix C m of te minimal (k + 1)-configuration as te following 15

16 form: a b b 0 0 b b C m = 1 b b a c c were te upper left submatrix is (k + 1) (k + 1) and a = 1 + 2wz 2 1/ b = 1/ c = 2wz 2 According to Geršcorin s Disc Teorem, see e g [7], all eigenvalues of a matrix M = [m ij ] are located in te union of te discs m ii γ n j=1,j i m ij for γ C suc tat te determinant is lower bounded for k 2 as follows: n det(c m ) (a k b ) k+1 c n k 1 ( = 1 + 2wz 2 k + 1 ) k+1 (2wz 2 ) n k 1 ( ) k+1 9 (2wz 2 ) n k 1 10 (k + 2wz 2 ) k (2wz 2 ) (2wz 2 ) n k 1 = n B We used te following relations: 1 + 2wz 2 (k + 1)/ > 9/10 for k 2 9/10 > (k + 2wz 2 ) k (2wz 2 ) for k 2 And tat completes te proof of te NP-completeness of MCD 16

17 5 Summary We ave presented a polynomial-time algoritm for te minimum covariance determinant problem for fixed dimensions of te data On te oter and we ave sown tat te problem is NP-ard for varying dimension Te running time of our algoritm on N d-dimensional data points is O ( N d(d+3)/2) Te ardness result suggests tat any uniform algoritm for te MCD problem as a running time were d appears more tan poly-logaritmic in te exponent It is, owever, possible tat algoritms exist wic ave a running time of N O(d) Let us also remark tat te algoritm can be easily adapted for te Minimum Volume Ellipsoid problem and tat our result implies tat tis problem is NP-complete for varying dimension as well Acknowledgement We would like to tank Claudia Becker, Tomas Fender, Ursula Gater for introducing us to te questions of robust statistics in general and in particular to te MCD-problem We are also grateful for suggestions concerning te presentation of te statistical results We tank Tomas Hofmeister for pointing out a simpler proof of Claim 49 A Appendix n number of vertices of grap G m number of edges of grap G N number of points of te MCD-problem (ere N = n + m + nk 4 ) k clique size (ere k = n/2 ) selection size (ere = k + ( k 2) + nk 4 ) z distance of an a-point from te origin (ere z = k 2k ) w weigt of an a-point (ere w = k 4 /2) B Bound for MCDd ere B = 1 (k + 2wz 2 ) k (2wz 2 ) (n k) n X te set of points in R n constructed in te reduction X te subset of points for wic te covariance determinant is computed Table 1: Te table summarizes te parameters used in te paper 17

18 References [1] C Becker and U Gater Te largest nonidentifiable outlier: A comparison of multivariate simultaneous outlier identification rules Computational Statistics and Data Analysis, 36: , 2000 [2] R W Buttler, P L Davies, and M Jun Asymptotics for te minimum covariance determinant estimator Annals of Statistics, 21: , 1993 [3] P L Davies Asymptotic beaviour of s-estimates of multivariate location parameters and dispersion matrices Annals of Statistics, 15: , 1987 [4] R Grübel A minimal caracterization of te covariance matrix Metrika, 35:49 52, 1988 [5] D M Hawkins A feasible solution algoritms for te minimum covariance determinant estimator in multivariate data Computational Statistics and Data Analysis, 17: , 1994 [6] D M Hawkins and D J Olive Improved feasible solution algoritms for ig breakdown estimation Computational Statistics and Data Analysis, 30:1 11, 1999 [7] R A Horn and C R Jonson Matrix Analysis Cambridge University Press, 1985 [8] P J Rousseeuw Least median of squares regression Journal of te American Statistical Association, 79: , 1984 [9] P J Rousseeuw and K Van Driessen A fast algoritm for te minimum covariance determinant estimator Tecnometrics, 41: , 1999 [10] D Woodruff and D Rocke Computable robust estimation of multivariate location and sape in ig dimension using compound estimators Journal of te American Statistical Association, 89: , 1994 [11] D Woodruff and D Rocke Identification of outliers in multivariate data Journal of te American Statistical Association, 91: ,

Generic maximum nullity of a graph

Generic maximum nullity of a grap Leslie Hogben Bryan Sader Marc 5, 2008 Abstract For a grap G of order n, te maximum nullity of G is defined to be te largest possible nullity over all real symmetric n