Finding a Needle in a Haystack: Conditions for Reliable. Detection in the Presence of Clutter

Finding a eedle in a Haystack: Conditions for Reliable Detection in the Presence of Clutter Bruno Jedynak and Damianos Karakos October 23, 2006 Abstract We study conditions for the detection of an -length iid sequence with unknown pmf p, among M -length iid sequences with unknown pmf p 0. We show how the quantity M2 Dp p 0 determines the asymptotic probability of error. Keywords: reliable detection, probability of error, Sanov s theorem, Kulback- Leibler distance, phase transition. Introduction Our motivation for this paper has its origins in Geman et. al. 996, where an algorithm for tracking roads in satellite images was experimentally studied. Below a certain clutter level, the algorithm could track a road acurately, and suddenly, with increased clutter level, tracking would become impossible. This phenomenon was studied theoretically in Yuille et. al.2000 and 200. Using a simplified statistical model, the authors show that, in an This research was partially supported by ARO DAAD9/-02--0337 and general funds from the Center for Imaging Science at The Johns Hopkins University. USTL and Department of Applied Mathematics, Johns Hopkins University Mail should be adressed to: Bruno Jedynak, Clark 302b, Johns Hopkins University, Baltimore, MD, 2286-2686 Center for Language and Speech Processing, Johns Hopkins University, Baltimore, MD, 2286-2686

apropriate asymptotic setting, the number of false detections is subject to a phase transition. Our objective in this paper is to generalize these results. First, we demonstrate, in the same setting, that the phase transition phenomenon occurs for the error rate of the maximum likelihood estimator. Second, we consider the situation where the underlying statistical model is unknown; i.e., there is a special object among many others but it is not known how it is special it is an outlier, in some sense. We show that the same phase transition phenomenon occurs in this case as well. Moreover, we propose a target detector that has the same asymptotic performance as the maximum likelihood estmator, had the model been known. Simulations illustrate these results. Let X = X X 2... X X2 X2 2... X2... X M+ X 2 M+... X M+ be a M + matrix made of independent random variables rvs taking values in a finite set. We denote by X m = Xm,..., Xm X the rvs in line m and by X m the ones that are not in line m. There is a special line, the target, with index t. All the other lines will be called distractors. The rvs X t are identically distributed with point mass function pmf p. The other ones, X t, are identically distributed with pmf p 0 p. The goal is to estimate t, the target, from a single realization of X. If p 0 and p are close, the target does not differ much from the distractors, a situation akin to finding a needle in a haystack. 2

2 Known distributions Let x be a realization of X. Then, the log-likelihood of x is lx = = log p x n t + n= n= M+ log p x n M+ t p 0 x n t + m=,m t n= m= n= log p 0 x n m 2 log p 0 x n m 3 The maximum likelihood estimator mle for t is then ˆtx = arg max m M+ n= log p x n m p 0 x n m 4 We call the reward of line m the quantity n= log p x n m p 0 x n m 5 The mle entails choosing the line with the largest reward. The quantity of interest is the probability that the mle differs from the target: em, = IP ˆtX t 6 which is the probability that a distractor gets a reward which is greater than the reward of the target. If M is fixed, letting, and using the law of large numbers, we obtain n= n= log p x n t p 0 x n t log p x n m p 0 x n m Dp, p 0 and 7 Dp 0, p for every m t 8 Logarithms are base 2 throughout the paper. 3

almost surely, where Dp, q = x px log px qx 9 is the Kulback-Leibler distance between p and q. Hence, as long p q, Dp, q > 0, and the reward of the target converges to a positive value while the reward of each distractor converges to a negative value which allows us to show that the error of the mle goes to zero. One can even bound em, from above for any fixed M and as follows Theorem 2 em, M p0 xp x. 0 x ote that p0 xp x = Hellingerp 0, p, 0 x where Hellingerp 0, p is the Hellinger distance between p 0 and p. The proof, using classical large deviations techniques, is in Section 6. ote that if the right-hand side of 0 goes to 0 as M and, the probability that the mle differs from the target goes to 0. This condition, however, is not necessary. As we show below, there is a maximum rate at which M can go to infinity in order for the probability of error to go to zero if M increases faster, then the probability of error goes to one. A similar result, i.e., that the number of distractors for which the reward is larger than the reward of the target follows a phase transition, was also shown by Yuille et. al.2000. We present below the same analysis for the convergence of the mle. The phase transition, or in other words, the dependence of the probability of error on the rate at which M goes to infinity, is expressed in the following theorem: Theorem 2 If ε > 0, such that lim M, M2 Dp,p0 ε = 0 then lim em, = 0, 2 M, 4

and If ε > 0, such that lim M, M2 Dp,p0+ε = + then lim em, =. 3 M, The intuition is as follows. First, as goes to infinity, if M remains fixed, the probability of error goes to zero exponentially fast, following a large deviation phenomenon since the reward of the target line converges to a positive value, while the reward of the distractors converges to a negative value as was mentioned earlier. On the other hand, as the number M of distractors increases, when remains fixed, the probability that there exists a distractor with a reward larger than the reward of the target increases as well. These are two competing phenomena, whose interaction gives rise to the critical rate Dp, p 0. The detailed proof appears in Section 6. ote: In order for the limits of functions of M, to be well-defined as M,, we assume that M is, in general, a function of. Hence, all limits lim M, should be interpreted as lim, with the proviso that M is increasing according to some function of. We kept the notation lim M, for simplicity. 3 Unknown Distributions We now look at the case where p 0 and p are unknown. It is clear that the error rate of any estimator in this context cannot be lower than the error rate of the mle with known p 0 and p. Hence, 3 holds even when em, is the error rate of any estimator. Can one build an estimator of the target for which the error rate will satisfy 2? The answer is yes as we shall see now. A simple way of building an estimator of the target when p 0 and p are unknown is to plug-in estimators of p 0 and p in the previous mle estimator 4. Hence, let us define 5

tx = arg max m M+ n= log ˆp mx n m ˆp m x n m 4 where ˆp m and ˆp m are the empirical distributions of the rvs in line m and in all the other lines, respectively. I.e., and ˆp m x = ˆp m x = M {Xm n = x}, 5 n= M+ n= j=,j m {X n j = x}. 6 ote that tx = arg max Dˆp m, ˆp m. 7 m M+ Hence, t is the line that differs the most in the Kulback-Leibler sense from the average distribution of the other lines. The reader may be more familiar with the variant ṫx = arg max Dˆp m, ˆp, 8 m M+ where ˆp is the empirical distribution over all rvs, including line m; both ṫ and t are similar in the sense that they pick the sequence which differs the most from the rest. It turns out that the error rate of t, that is ẽm, = IP tx t, 9 where, as before, t denotes the target, has the same asymptotic behavior as the mle 4 in the case of known distributions. Theorem 3 If ε > 0, such that lim M, M2 Dp,p0 ε = 0 then lim ẽm, = 0 20 M, 6

0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 error 0.5 error 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0. 0. 0 0 0 20 50 00 200 500 0 20 50 00 200 500 Figure : Estimates of the probability of error for various, for the case p 0 = 0.9, 0., p = 0.8, 0.2, M = 000. The two plots correspond to the cases of known and unknown distributions, respectively. The red line represents the upper bound as established by Theorem. and If ε > 0, such that lim M, M2 Dp,p0+ε = + then lim ẽm, =. 2 M, The proof uses the same large deviations techniques as the proof of Theorem 2 but is slightly more complex due to the fact that the rewards are not independent anymore. The proof appears in Section 6. 4 Simulations We now provide simulations that show Theorems, 2 and 3 in action. We generated M = 000 binary sequences with probabilities p 0 = 0.9, 0. and p = 0.8, 0.2 for the background and the target, respectively. We varied the number from 0 to 500, and we observed the probability of error decreasing to zero. We performed the random experiment 00 times for each value of. The procedure was replicated 20 times in 7

0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 error 0.5 error 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0. 0. 0 0 0 20 50 00 200 500 0 20 50 00 200 500 Figure 2: Estimates of the probability of error for various, for the case p 0 = 0.9, 0., p = 0.7, 0.3, M = 000. The two plots correspond to the cases of known and unknown distributions, respectively. The red line represents the upper bound as established by Theorem. order to compute error bars. The plots in Figure show the estimated probability of error versus, for the two maximum likelihood detectors known and unknown distributions, respectively, along with standard deviation error bars. As expected, the error for the case of unknown distributions is somewhat higher, as there is an additional error due to the inaccuracy in estimating the two distributions. The KL divergence is Dp, p 0 0.064. The dashed line shows the phase transition boundary, i.e., the value of such that M = 2 Dp,p0. For M = 000, this value is equal to 55.5. For the known distributions plot, the red line corresponds to the upper bound established by Theorem, and it is equal to 0000.98. Similar plots for the case p 0 = 0.9, 0. and p = 0.7, 0.3 are shown in Figure 2. As expected, the error curves of Figure are higher than the ones in Figure 2, since the former detection case is harder than the latter. The phase transition boundary is depicted in Figure 2 with the dashed line at value = 44.9. The upper bound of Theorem is given by 0000.9349. 8

5 Conclusions We have considered a statistical model with M + sequences of independent random variables, each of length. All random variables have the same point mass function p 0 except for one sequence, the target, for which the common point mass function is p. The error of the maximum likelihood estimator for the target converges to 0 if there exists an ε > 0 such that M2 Dp,p0 ε 0, and it converges to if there exists an ε > 0 such that M2 Dp,p0+ε +. Moreover, when p 0 and p are unknown, we are able to build an estimator of the target with the same performace; this allows us to study the important practical problem of outlier detection. We conjecture that these results can be generalized to the case of ergodic Markov chains, and we plan to report the more general results in a subsequent publication. 6 Proofs Without loss of generality, we assume that the target line is line number. Proof of Theorem : em, = IP max log p X n m 2 m M+ p n= 0 Xm n > log p X n p n= 0 X n 22 MIP log p X n 2 p n= 0 X2 n > log p X n p n= 0 X n 23 p X2 n p 0 X n s = MIP p n= 0 X2 np X n >, for all s > 0 24 [ p X2 n p 0 X n ] s ME p n= 0 X2 np X n 25 [ p X2 p 0 X s ] = M E p 0 X2 p X 26 where 25 is due to the Markov inequality. 9

Let us define fs E [ p X2 p 0 X s ] p 0 X2 p X and gs ln fs 27 One can check that f /2 = g /2 = 0. Moreover, using Hölder s inequality, Grimmett et. al. 992, it is easy to show that, for any s, t > 0 and 0 α, E [ p X2 p 0 X αs+ αt ] p 0 X2 p X E [ p X2 p 0 X s ] α p 0 X2 p X E [ p X2 p 0 X t ] α p 0 X2 p X. 28 By taking the log on both sides, we deduce that g is a convex function of s. Hence, it achieves its minimum value at s = /2 therefore, f achieves its minimum value at s = /2. This leads to the tightest upper bound in 26, i.e., em, Mf 2 = M x p0 xp x 2. 29 In order to prove Theorems 2 and 3, we start with two technical lemmas that will be useful later on. Lemma Let U and R be two rvs, and y IR. Then, for any ε > 0, IP U > y + ε IP R < ε IP U + R > y IP U > y ε + IP R > ε 30 Proof of Lemma : IP U + R > y = IP U + R > y, R ε + IP U + R > y, R > ε 3 IP U > y ε + IP R > ε 32 0

and IP U + R y = IP U + R y, R < ε + IP U + R y, R ε 33 IP U y + ε + IP R < ε 34 which allows us to obtain the lower bound by computing the complementary event. Lemma 2 Let V,..., VM be a sequence of M independent, identically distributed, discrete random variables. Moreover, assume that the following large deviation property holds for some z IR, IP V > z =. 2 Iz., where Iz > 0 and a = b lim + log a = 0. 35 b Then, if ε > 0 s.t. lim M, + M2 Iz ε = 0, then lim IP max V m > z = 0. 36 M, + m M Also, if ε > 0 s.t. lim M, + M2 Iz+ε = +, then lim IP max m > z =. M, + m M 37 Proof of Lemma 2: Let ε > 0 be arbitrarily small. Then, there exists 0 ε > 0 such that > 0, IP V log > z < ε. 38 2 Iz To prove the first part, we start with the following claim: ε > 0 > 0 > : IP max m M V m > z > ε.

Then, using the union bound, we obtain ε > 0 > 0 > : M m= IP V m > z = MIP V > z > ε. 39 By picking > 0 ε, 39 becomes ε > 0 > 0 ε > : M2 Iz ε > ε. Hence, M2 Iz ε does not converge to zero for any ε > 0, as required. To prove the second part, we first assume that > 0 ε, as above. Then, IP max m M V m > z = IP max m M V m z 40 = IP M V z 4 M log IP V = 2 >z 42 MIP V 2 >z 43 2 M2 Iz+ε, 44 where the first inequality is a consequence of the inequality log x x, and the second inequality arises from 38. ote that 44 is true for any arbitrary ε > 0. Hence, if there exists ε > 0 such that M2 Iz+ε +, then necessarily IP max m M Vm > z. This concludes the proof of the second part, and the proof of the lemma. We are now ready for Proof of Theorem 2: em, = IP max 2 m M+ n= log p X n m p 0 X n m > n= log p X n p 0 X n 45 = IP U M + R > Dp, p 0, where 46 2

UM = max 2 m M+ R = Dp, p 0 n= log p X n m p 0 X n m n= and 47 log p X n p 0 X n 48 From the law of large numbers, R 0 in probability. Hence, for all η > 0 and α > 0, and for sufficiently large, using Lemma, IP U M > Dp, p 0 + η α em, IP U M > Dp, p 0 η + α 49 Let us define V m n= log p X n m p 0 X n m 50 ow, using Sanov s theorem, Dembo at. al.998, IP V 2 Dp, p 0. = 2 Dp,p0 5 Indeed, IP V 2 Dp, p 0. = 2 Dp,p 0 52 where Dp, p 0 = inf p C Dp, p 0, with C = {p; E p log p p 0 Dp, p 0 }, 53 and for p C, Dp, p 0 = E p log p p 0 = Dp, p + E p log p p 0 54 Dp, p + Dp, p 0 Dp, p 0. 55 ow, by continuity of the rate function, there exists ε > 0 such that IP V 2 > Dp, p 0 η. = 2 Dp,p0 ε 56 3

and there exists ε > 0 such that IP V 2 > Dp, p 0 + η. = 2 Dp,p0+ε 57 Finally, since U M = max V m 58 2 m M+ and the rvs V2,..., VM+ are iid, we obtain the required result from Lemma 2. Proof of Theorem 3: We proceed along the same lines as for the proof of Theorem 2. ẽm, = IP max 2 m M+ n= log ˆp mx n m ˆp m X n m > n= log ˆp X n ˆp X n 59 IP UM + RM > Dp, p 0, with 60 UM = max log ˆp mxm n 2 m M+ p 0 Xm n 6 n= RM = A M + B + C 62 A M = max log p 0Xm n 2 m M+ ˆp m Xm n 63 B = n= C = Dp, p 0 For all η > 0, from Lemma, n= log ˆp X n p 0 X n 64 n= log ˆp X n p 0 X n 65 ẽm, IP U M > Dp, p 0 η + IP R M > η. 66 Let V m = n= log ˆp mxm n p 0 Xm n, 2 m M +. 67 4

Using Sanov s theorem, IP V 2 Dp, p 0. = 2 Dp,p0. 68 Indeed, IP V 2 Dp, p 0. = 2 Dp,p 0, 69 where Dp, p 0 = inf p C Dp, p 0, with C = {p; E p log p p 0 Dp, p 0 }. 70 And for p C, Dp, p 0 = E p log p p 0 Dp, p 0. 7 ow, by continuity of the rate function, there exists ε > 0 such that IP V 2 > Dp, p 0 η. = 2 Dp,p0 ε. 72 To show that 66 approaches zero as M, with M2 Dp,p0 ε 0, it suffices to prove that R M 0, since the term IP U M > Dp, p 0 η of 66 goes to zero by virtue of 72, Lemma 2, and the fact that U M = max V m, 73 2 m M+ and the rvs V2,..., VM+ are iid. Using the law of large numbers, C 0 in probability. Also, IP A M > η MIP log p 0X2 n ˆp n= 2 X2 n > η MIP max log p 0X n 2 n ˆp 2 X2 n > η MIP log p 0X2 ˆp 2 X2 > η 74 75 76 5

M max IP log p 0x x ˆp 2 x > η 77 = M max x IP ˆp 2x < 2 η p 0 x 78 M max x 2 M Ix,η = M2 M Jη 79 where Ix, η > 0 is a rate function, and Jη = min x Ix, η. The last inequality comes from the fact that ˆp 2 x p 0 x in probability. A similar argument shows that B 0 in probability as well. ote that the result would still hold if we replaced ˆp m with ˆp, i.e., with the empirical distribution over the full data. References [] Amir Dembo and Ofer Zeitouni. Large Deviations Techniques and Applications. Springer Verlag, 998. [2] D. Geman and B. Jedynak. An active testing model for tracking roads from satellite images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8: 4, January 996. [3] G.R. Grimmett and D. R. Stirzaker. Probability and Random Processes. Oxford Science Publications, 992. [4] Alan L. Yuille and James M. Coughlan. Fundamental limits of bayesian inference: Order parameters and phase transitions for road tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 222:60 73, February 2000. [5] Alan L. Yuille, James M. Coughlan, Yingnian Wu, and Song Chun Zhu. Order parameters for detecting target curves in images: When does high level knowledge help? International Journal of Computer Vision, 4:9 33, 200. 6