arxiv: v1 [math.st] 23 May 2014

Size: px

Start display at page:

Download "arxiv: v1 [math.st] 23 May 2014"

Francine Porter
6 years ago
Views:

1 Abstract On oracle efficiency of the ROAD classification rule Britta Anker Bak and Jens Ledet Jensen Department of Mathematics, Aarhus University, Denmark arxiv: v [math.st] 3 May 04 For high-dimensional classification Fishers rule performs poorly due to noise from estimation of the covariance matrix. Fan, Feng and Tong (0 introduced the ROAD classifier that puts an L -constraint on the classification vector. In their Theorem Fan, Feng and Tong (0 show that the ROAD classifier asymptotically has the same misclassification rate as the corresponding oracle based classifier. Unfortunately, the proof contains an error. Here we restate the theorem and provide a new proof. Introduction We consider classification among two groups based on a p-dimensional normally distributed variable. Let the means in the two groups be µ and µ, and let the common variance be Σ. Also, let the probability of belonging to either of the two groups be. Defining µ a = (µ +µ / and µ d = (µ µ /, the Bayes discriminant rule becomes δ w (x = +(w T (x µ a < 0, with w = w F = Σ µ d, where x is classified to group or according to the value of δ w (x. The misclassification rate of the rule δ w is W(δ w = ( wt µ d /(w T Σw /, where (z = Φ(z is the upper tail probability of a standard normal distribution. The interpretation of the Bayes rule is that w F is the vector that minimizes the misclassification rate. Fan, Feng and Tong (0 suggest to use a L regularized version of w F, that is, w c = argmin w c, w T µ d = w T Σw. Its sample version yields the ROAD classifier ŵ c = argmin w T ˆΣw w c, w T ˆµ d = ˆδ = +(ŵ T c (x ˆµ a < 0. Theorem of Fan, Feng and Tong (0 states that the misclassification rate W(ˆδ of the ROAD classifier approaches the misclassification rate of the oracle classifier W(δ wc. Unfortunately, an essential step in the proof use an inequality which is not valid, see Appendix A for details. We reformulate the theorem and give a new proof.

2 Theorem. Let ǫ be a positive constant such that max j { µ dj } > ǫ, and c > ǫ + /max j { µ dj }. Let a n be a sequence tending to zero such that ˆΣ Σ = O p (a n, and ˆµ i µ i = O p (a n, i =,. Then, as n : with d n = c a n (+c Σ. W(ˆδ W(δ wc = O p (d n Prior to proving the theorem we comment on the differences compared to Theorem of Fan, Feng and Tong (0. Contrary to us, Fan, Feng and Tong (0 requires that the smallest eigenvalue ofσis bounded from below. The upper bound onw(ˆδ W(δ wc in Fan, Feng and Tong (0 depends on the sparsity of w c and of w c (, where w c ( is given by w c ( = arg min w T Σw, w c, w Tˆµ d = whereas our bound depends on the regularizing parameter c only. In the formulation of the theorem c is allowed to depend on n. We require a lower bound on max j { µ dj }, which is not part of the theorem in Fan, Feng and Tong (0. However, it enters indirectly in that we must have c > /max j { µ dj } in order for w c to exist. Thus, if max j { µ dj } 0, we have c, and c enters the upper bound of Fan, Feng and Tong (0. The reason for our more restrictive condition c > ǫ + /max j { µ dj } is that the theorem only makes sense if ŵ c exists with probability tending to one. Similarly, whereas Fan, Feng and Tong (0 have the condition ˆµ d µ d = O p (a n, we have ˆµ i µ i = O p (a n, i =,, in order to handle a term in the misclassification rate that has been neglected in Fan, Feng and Tong (0. Finally, Σ appears in our bound. However, requiring that the variances Σ ii, i =,...,p, are bounded is often encountered in high dimensional settings. Proof of Theorem In the proof we use the following inequalities: (a(+ǫ (a ǫ for a > 0 and ǫ <, ( ((a+ǫ / (a / ǫ for a > 0 and a+ǫ > 0. ( The misclassification rate consists of two terms corresponding to an observation from each of the two groups. The proofs for the two terms are identical, so to simplify we consider the misclassification rate of an observation from group only. Using ( the misclassification rate of ˆδ becomes W(ˆδ = ( ŵc T ˆµ d +ŵc T (ˆµ µ ŵt = ( ŵt +O( ŵc T c Σŵ c c Σŵ (ˆµ µ c ( ŵt +O(c ˆµ µ. (3 c Σŵ c

3 Next, and from ( we get ( ŵt = ( c Σŵ c ŵ T c Σŵ c ŵ T c ˆΣŵ c c ˆΣ Σ, ŵc T ˆΣŵ c From the proof in Fan, Feng and Tong (0 we see that and thus ( ŵc TˆΣŵ c ŵ T c ˆΣŵ c w (T c Σw ( c c ˆΣ Σ, = ( Combining (3 5 we have W(ˆδ = ( w c (T Σŵ c ( w c (T Σŵ c ( +O(c ˆΣ Σ. (4 +O(c ˆΣ Σ. (5 +O(c ˆΣ Σ +c ˆµ µ. (6 Since the oracle misclassification rate is W(δ wc = ( /( wc TΣw c we need to compare wc TΣw c with w c (T Σŵ c (. To this end let A = {w : w T µ d =, w c}, A = {w : w Tˆµ d =, w c}. We want to show that for any w A there exists w A such that w T Σw is close to w T Σ w and vice versa. This means that the minimum of w T Σw over the set A is close to the minimum over the set A. Let w A, and define w = w/(w Tˆµ d. If w c, we have w A, and w T Σw = (w Tˆµ d w T Σ w = (+O(c ˆµ d µ d w T Σ w. If instead w > c, we first define w A and then w = w/( w Tˆµ d A. To define w assume without loss of generality that = max j { µ dj. Write w = (w,w ( where w ( is (p -dimensional, and define w = ( w,rw ( with 0 < r <, and w chosen such that w T µ d =. The latter requirement implies w = rw T ( µ d( = r( w. We will show that with r = c ˆµ d µ d /(c / = O(c ˆµ d µ d we have w c. From the definition of w we have w = w +r w ( = r( w +r( w w. 3

4 If r( w > 0 we get w = +r ( w +w w +r ( c. This shows that w A and w A since w = w w Tˆµ d +r ( c c ˆµ d µ d c, when r c ˆµ d µ d /(c /. If instead r( w < 0 we find w = r +r w rc r rc, and w rc/( c ˆµ d µ d c for r c ˆµ d µ d. The latter condition is satisfied with r c ˆµ d µ d /(c /. Comparing w and w we get and also w T Σw w T Σ w c w w Σ c[( r w +( r ] Σ = O(c 4 ˆµ d µ d Σ, w T Σ w w T Σw (w T Σw O(c ˆµ d µ d. We have now shown that any value of w T Σw for w A is close to the corresponding value for some w A. The other way around, starting with w A, is treated in the same way. The only difference is that instead of using c / > ǫ, we use that when ˆ < min{ǫ,ǫ 3 /( + ǫ }, which happens with probability tending to (exponentially fast, we have ˆ > 0 and c /ˆ > ǫ/. Therefore, the minimum wc TΣw c of w T Σw over the set A is close to the minimum w c (T Σw c ( over the set A : w T c Σw c = w (T c Σw ( c +O(c 4 ˆµ d µ d Σ +O(c ˆµ d µ d w T c Σw c = w (T c Σw ( c +O(c 4 ˆµ d µ d Σ. Combining the latter with (6 we conclude Acknowledgement W(ˆδ W(δ wc = O ( c a n (+c Σ. We thank Xin Tong for reading this note and refer to the arxiv version of Fan, Feng and Tong (0 for updated versions. 4

5 Appendix A An essential step in the proof in Fan, Feng and Tong (0 is the inequality (used in equation ( of that paper w T c ˆµ d w T c Σw c. w c (T Σw c ( Unfortunately, this inequality is not correct. We illustrate this by a concrete example. We consider the two-dimensional case with ( µ d = (,0 T, Σ =, c = +ǫ with ǫ < /σ. σ In this case we have w c = (, ǫ T and w T c Σw c = ǫ+σǫ. Consider next ˆµ d = (+a,b with a and b small. For a and b sufficiently small we obtain and c = ( +b[a+ǫ(+a]/(+a+b, a+ǫ(+a T, (7 +a +a+b w ( w (T c Σw ( c = (w ( c +w ( c w( c +σ(w( c. For a and b small and including O(a and O(b terms only we get = w c (T Σw c ( which must be compared to { +ǫ ǫσ(+ǫ } +a bǫ+(a bǫ, (8 ǫ+σǫ ǫ+σǫ w T c ˆµ d w T c Σw c = +a bǫ ǫ+σǫ. (9 We thus see that (8 is less that (9 when a bǫ has the opposite sign of +ǫ ǫσ(+ǫ. Since (a bǫ N(0,( ǫ+σǫc 0 for some constant c 0, the probability of a particular sign of a bǫ is one half. References Fan, J., Feng, Y. and Tong, X. (0. A road to classification in high dimensional space: the regularized optimal affine discriminant J. R. Statist. Soc. B, 74, The paper with revised versions can also be found at arxiv.org: arxiv:0.6095v [stat.ml]. 5

arxiv: v2 [stat.ml] 9 Nov 2011

arxiv: v2 [stat.ml] 9 Nov 2011 A ROAD to Classification in High Dimensional Space Jianqing Fan [1], Yang Feng [2] and Xin Tong [1] [1] Department of Operations Research & Financial Engineering, Princeton University, Princeton, New Jersey