Finding a Needle in a Haystack: Conditions for Reliable. Detection in the Presence of Clutter

Similar documents
Order Parameters for Minimax Entropy Distributions: When does high level knowledge help?

Lecture 8: Information Theory and Statistics

Large Deviations Techniques and Applications

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

Gärtner-Ellis Theorem and applications.

Order Parameters for Detecting Target Curves in Images: When Does High Level Knowledge Help?

Convergence Rates of Algorithms for Visual Search: Detecting Visual Contours

Lecture 22: Error exponents in hypothesis testing, GLRT

Lecture 4 Noisy Channel Coding

Generalized Neyman Pearson optimality of empirical likelihood for testing parameter hypotheses

Lecture 3: Lower Bounds for Bandit Algorithms

Lecture 17: Density Estimation Lecturer: Yihong Wu Scribe: Jiaqi Mu, Mar 31, 2016 [Ed. Apr 1]

Computation of Csiszár s Mutual Information of Order α

5.1 Inequalities via joint range

C.7. Numerical series. Pag. 147 Proof of the converging criteria for series. Theorem 5.29 (Comparison test) Let a k and b k be positive-term series

Homework Set #2 Data Compression, Huffman code and AEP

Introduction to Statistical Learning Theory

Information measures in simple coding problems

LECTURE 3. Last time:

Information Theory and Hypothesis Testing

Machine Learning Theory Lecture 4

Series 7, May 22, 2018 (EM Convergence)

Robustness and duality of maximum entropy and exponential family distributions

Lecture 15: MCMC Sanjeev Arora Elad Hazan. COS 402 Machine Learning and Artificial Intelligence Fall 2016

Basic math for biology

Lecture 7 Introduction to Statistical Decision Theory

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Phenomena in high dimensions in geometric analysis, random matrices, and computational geometry Roscoff, France, June 25-29, 2012

Entropy and Ergodic Theory Lecture 15: A first look at concentration

Consistency of the maximum likelihood estimator for general hidden Markov models

Maximum likelihood in log-linear models

Bayesian Methods for Machine Learning

Chaos, Complexity, and Inference (36-462)

Solutions to Set #2 Data Compression, Huffman code and AEP

CS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018

On Finding Optimal Policies for Markovian Decision Processes Using Simulation

On Improved Bounds for Probability Metrics and f- Divergences

Machine Learning Lecture Notes

The Game of Twenty Questions with noisy answers. Applications to Fast face detection, micro-surgical tool tracking and electron microscopy

Strong Converse and Stein s Lemma in the Quantum Hypothesis Testing

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Probabilistic Graphical Models

Capacity of AWGN channels

A minimalist s exposition of EM

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs

U Logo Use Guidelines

Information Theory in Intelligent Decision Making

Lecture 35: December The fundamental statistical distances

(each row defines a probability distribution). Given n-strings x X n, y Y n we can use the absence of memory in the channel to compute

Prequential Analysis

Supplemental for Spectral Algorithm For Latent Tree Graphical Models

Lecture 8: Information Theory and Statistics

Scalable Hash-Based Estimation of Divergence Measures

Latent Variable Models and EM algorithm

Convergence of Random Variables

ELEC633: Graphical Models

Note that in the example in Lecture 1, the state Home is recurrent (and even absorbing), but all other states are transient. f ii (n) f ii = n=1 < +

Expectation Propagation in Factor Graphs: A Tutorial

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

Hardness of the Covering Radius Problem on Lattices

Computable priors sharpened into Occam s razors

Hands-On Learning Theory Fall 2016, Lecture 3

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2

Composite Hypotheses and Generalized Likelihood Ratio Tests

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

FORMULATION OF THE LEARNING PROBLEM

STATISTICAL CURVATURE AND STOCHASTIC COMPLEXITY

Statistical Machine Learning

certain class of distributions, any SFQ can be expressed as a set of thresholds on the sufficient statistic. For distributions

A Note on the Central Limit Theorem for a Class of Linear Systems 1

Empirical Risk Minimization

1 Inverse Transform Method and some alternative algorithms

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

Lecture 11: Introduction to Markov Chains. Copyright G. Caire (Sample Lectures) 321

IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 43, NO. 3, MARCH

Reputations. Larry Samuelson. Yale University. February 13, 2013

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava

Exponential Tail Bounds

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science Transmission of Information Spring 2006

1 Sequences of events and their limits

Universal Outlier Hypothesis Testing

Bandit models: a tutorial

Iowa State University. Instructor: Alex Roitershtein Summer Exam #1. Solutions. x u = 2 x v

ABC methods for phase-type distributions with applications in insurance risk problems

Application of Information Theory, Lecture 7. Relative Entropy. Handout Mode. Iftach Haitner. Tel Aviv University.

Asymptotic formulae for the number of repeating prime sequences less than N

Machine Learning for Data Science (CS4786) Lecture 24

Statistics: Learning models from data

Unsupervised Learning with Permuted Data

Lecture 4: Proof of Shannon s theorem and an explicit code

Math 229: Introduction to Analytic Number Theory The product formula for ξ(s) and ζ(s); vertical distribution of zeros

10. Composite Hypothesis Testing. ECE 830, Spring 2014

f-divergence Estimation and Two-Sample Homogeneity Test under Semiparametric Density-Ratio Models

An inverse of Sanov s theorem

First Hitting Time Analysis of the Independence Metropolis Sampler

P i [B k ] = lim. n=1 p(n) ii <. n=1. V i :=

Transcription:

Finding a eedle in a Haystack: Conditions for Reliable Detection in the Presence of Clutter Bruno Jedynak and Damianos Karakos October 23, 2006 Abstract We study conditions for the detection of an -length iid sequence with unknown pmf p, among M -length iid sequences with unknown pmf p 0. We show how the quantity M2 Dp p 0 determines the asymptotic probability of error. Keywords: reliable detection, probability of error, Sanov s theorem, Kulback- Leibler distance, phase transition. Introduction Our motivation for this paper has its origins in Geman et. al. 996, where an algorithm for tracking roads in satellite images was experimentally studied. Below a certain clutter level, the algorithm could track a road acurately, and suddenly, with increased clutter level, tracking would become impossible. This phenomenon was studied theoretically in Yuille et. al.2000 and 200. Using a simplified statistical model, the authors show that, in an This research was partially supported by ARO DAAD9/-02--0337 and general funds from the Center for Imaging Science at The Johns Hopkins University. USTL and Department of Applied Mathematics, Johns Hopkins University Mail should be adressed to: Bruno Jedynak, Clark 302b, Johns Hopkins University, Baltimore, MD, 2286-2686 Center for Language and Speech Processing, Johns Hopkins University, Baltimore, MD, 2286-2686

apropriate asymptotic setting, the number of false detections is subject to a phase transition. Our objective in this paper is to generalize these results. First, we demonstrate, in the same setting, that the phase transition phenomenon occurs for the error rate of the maximum likelihood estimator. Second, we consider the situation where the underlying statistical model is unknown; i.e., there is a special object among many others but it is not known how it is special it is an outlier, in some sense. We show that the same phase transition phenomenon occurs in this case as well. Moreover, we propose a target detector that has the same asymptotic performance as the maximum likelihood estmator, had the model been known. Simulations illustrate these results. Let X = X X 2... X X2 X2 2... X2... X M+ X 2 M+... X M+ be a M + matrix made of independent random variables rvs taking values in a finite set. We denote by X m = Xm,..., Xm X the rvs in line m and by X m the ones that are not in line m. There is a special line, the target, with index t. All the other lines will be called distractors. The rvs X t are identically distributed with point mass function pmf p. The other ones, X t, are identically distributed with pmf p 0 p. The goal is to estimate t, the target, from a single realization of X. If p 0 and p are close, the target does not differ much from the distractors, a situation akin to finding a needle in a haystack. 2

2 Known distributions Let x be a realization of X. Then, the log-likelihood of x is lx = = log p x n t + n= n= M+ log p x n M+ t p 0 x n t + m=,m t n= m= n= log p 0 x n m 2 log p 0 x n m 3 The maximum likelihood estimator mle for t is then ˆtx = arg max m M+ n= log p x n m p 0 x n m 4 We call the reward of line m the quantity n= log p x n m p 0 x n m 5 The mle entails choosing the line with the largest reward. The quantity of interest is the probability that the mle differs from the target: em, = IP ˆtX t 6 which is the probability that a distractor gets a reward which is greater than the reward of the target. If M is fixed, letting, and using the law of large numbers, we obtain n= n= log p x n t p 0 x n t log p x n m p 0 x n m Dp, p 0 and 7 Dp 0, p for every m t 8 Logarithms are base 2 throughout the paper. 3

almost surely, where Dp, q = x px log px qx 9 is the Kulback-Leibler distance between p and q. Hence, as long p q, Dp, q > 0, and the reward of the target converges to a positive value while the reward of each distractor converges to a negative value which allows us to show that the error of the mle goes to zero. One can even bound em, from above for any fixed M and as follows Theorem 2 em, M p0 xp x. 0 x ote that p0 xp x = Hellingerp 0, p, 0 x where Hellingerp 0, p is the Hellinger distance between p 0 and p. The proof, using classical large deviations techniques, is in Section 6. ote that if the right-hand side of 0 goes to 0 as M and, the probability that the mle differs from the target goes to 0. This condition, however, is not necessary. As we show below, there is a maximum rate at which M can go to infinity in order for the probability of error to go to zero if M increases faster, then the probability of error goes to one. A similar result, i.e., that the number of distractors for which the reward is larger than the reward of the target follows a phase transition, was also shown by Yuille et. al.2000. We present below the same analysis for the convergence of the mle. The phase transition, or in other words, the dependence of the probability of error on the rate at which M goes to infinity, is expressed in the following theorem: Theorem 2 If ε > 0, such that lim M, M2 Dp,p0 ε = 0 then lim em, = 0, 2 M, 4

and If ε > 0, such that lim M, M2 Dp,p0+ε = + then lim em, =. 3 M, The intuition is as follows. First, as goes to infinity, if M remains fixed, the probability of error goes to zero exponentially fast, following a large deviation phenomenon since the reward of the target line converges to a positive value, while the reward of the distractors converges to a negative value as was mentioned earlier. On the other hand, as the number M of distractors increases, when remains fixed, the probability that there exists a distractor with a reward larger than the reward of the target increases as well. These are two competing phenomena, whose interaction gives rise to the critical rate Dp, p 0. The detailed proof appears in Section 6. ote: In order for the limits of functions of M, to be well-defined as M,, we assume that M is, in general, a function of. Hence, all limits lim M, should be interpreted as lim, with the proviso that M is increasing according to some function of. We kept the notation lim M, for simplicity. 3 Unknown Distributions We now look at the case where p 0 and p are unknown. It is clear that the error rate of any estimator in this context cannot be lower than the error rate of the mle with known p 0 and p. Hence, 3 holds even when em, is the error rate of any estimator. Can one build an estimator of the target for which the error rate will satisfy 2? The answer is yes as we shall see now. A simple way of building an estimator of the target when p 0 and p are unknown is to plug-in estimators of p 0 and p in the previous mle estimator 4. Hence, let us define 5

tx = arg max m M+ n= log ˆp mx n m ˆp m x n m 4 where ˆp m and ˆp m are the empirical distributions of the rvs in line m and in all the other lines, respectively. I.e., and ˆp m x = ˆp m x = M {Xm n = x}, 5 n= M+ n= j=,j m {X n j = x}. 6 ote that tx = arg max Dˆp m, ˆp m. 7 m M+ Hence, t is the line that differs the most in the Kulback-Leibler sense from the average distribution of the other lines. The reader may be more familiar with the variant ṫx = arg max Dˆp m, ˆp, 8 m M+ where ˆp is the empirical distribution over all rvs, including line m; both ṫ and t are similar in the sense that they pick the sequence which differs the most from the rest. It turns out that the error rate of t, that is ẽm, = IP tx t, 9 where, as before, t denotes the target, has the same asymptotic behavior as the mle 4 in the case of known distributions. Theorem 3 If ε > 0, such that lim M, M2 Dp,p0 ε = 0 then lim ẽm, = 0 20 M, 6

0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 error 0.5 error 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0. 0. 0 0 0 20 50 00 200 500 0 20 50 00 200 500 Figure : Estimates of the probability of error for various, for the case p 0 = 0.9, 0., p = 0.8, 0.2, M = 000. The two plots correspond to the cases of known and unknown distributions, respectively. The red line represents the upper bound as established by Theorem. and If ε > 0, such that lim M, M2 Dp,p0+ε = + then lim ẽm, =. 2 M, The proof uses the same large deviations techniques as the proof of Theorem 2 but is slightly more complex due to the fact that the rewards are not independent anymore. The proof appears in Section 6. 4 Simulations We now provide simulations that show Theorems, 2 and 3 in action. We generated M = 000 binary sequences with probabilities p 0 = 0.9, 0. and p = 0.8, 0.2 for the background and the target, respectively. We varied the number from 0 to 500, and we observed the probability of error decreasing to zero. We performed the random experiment 00 times for each value of. The procedure was replicated 20 times in 7

0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 error 0.5 error 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0. 0. 0 0 0 20 50 00 200 500 0 20 50 00 200 500 Figure 2: Estimates of the probability of error for various, for the case p 0 = 0.9, 0., p = 0.7, 0.3, M = 000. The two plots correspond to the cases of known and unknown distributions, respectively. The red line represents the upper bound as established by Theorem. order to compute error bars. The plots in Figure show the estimated probability of error versus, for the two maximum likelihood detectors known and unknown distributions, respectively, along with standard deviation error bars. As expected, the error for the case of unknown distributions is somewhat higher, as there is an additional error due to the inaccuracy in estimating the two distributions. The KL divergence is Dp, p 0 0.064. The dashed line shows the phase transition boundary, i.e., the value of such that M = 2 Dp,p0. For M = 000, this value is equal to 55.5. For the known distributions plot, the red line corresponds to the upper bound established by Theorem, and it is equal to 0000.98. Similar plots for the case p 0 = 0.9, 0. and p = 0.7, 0.3 are shown in Figure 2. As expected, the error curves of Figure are higher than the ones in Figure 2, since the former detection case is harder than the latter. The phase transition boundary is depicted in Figure 2 with the dashed line at value = 44.9. The upper bound of Theorem is given by 0000.9349. 8

5 Conclusions We have considered a statistical model with M + sequences of independent random variables, each of length. All random variables have the same point mass function p 0 except for one sequence, the target, for which the common point mass function is p. The error of the maximum likelihood estimator for the target converges to 0 if there exists an ε > 0 such that M2 Dp,p0 ε 0, and it converges to if there exists an ε > 0 such that M2 Dp,p0+ε +. Moreover, when p 0 and p are unknown, we are able to build an estimator of the target with the same performace; this allows us to study the important practical problem of outlier detection. We conjecture that these results can be generalized to the case of ergodic Markov chains, and we plan to report the more general results in a subsequent publication. 6 Proofs Without loss of generality, we assume that the target line is line number. Proof of Theorem : em, = IP max log p X n m 2 m M+ p n= 0 Xm n > log p X n p n= 0 X n 22 MIP log p X n 2 p n= 0 X2 n > log p X n p n= 0 X n 23 p X2 n p 0 X n s = MIP p n= 0 X2 np X n >, for all s > 0 24 [ p X2 n p 0 X n ] s ME p n= 0 X2 np X n 25 [ p X2 p 0 X s ] = M E p 0 X2 p X 26 where 25 is due to the Markov inequality. 9

Let us define fs E [ p X2 p 0 X s ] p 0 X2 p X and gs ln fs 27 One can check that f /2 = g /2 = 0. Moreover, using Hölder s inequality, Grimmett et. al. 992, it is easy to show that, for any s, t > 0 and 0 α, E [ p X2 p 0 X αs+ αt ] p 0 X2 p X E [ p X2 p 0 X s ] α p 0 X2 p X E [ p X2 p 0 X t ] α p 0 X2 p X. 28 By taking the log on both sides, we deduce that g is a convex function of s. Hence, it achieves its minimum value at s = /2 therefore, f achieves its minimum value at s = /2. This leads to the tightest upper bound in 26, i.e., em, Mf 2 = M x p0 xp x 2. 29 In order to prove Theorems 2 and 3, we start with two technical lemmas that will be useful later on. Lemma Let U and R be two rvs, and y IR. Then, for any ε > 0, IP U > y + ε IP R < ε IP U + R > y IP U > y ε + IP R > ε 30 Proof of Lemma : IP U + R > y = IP U + R > y, R ε + IP U + R > y, R > ε 3 IP U > y ε + IP R > ε 32 0

and IP U + R y = IP U + R y, R < ε + IP U + R y, R ε 33 IP U y + ε + IP R < ε 34 which allows us to obtain the lower bound by computing the complementary event. Lemma 2 Let V,..., VM be a sequence of M independent, identically distributed, discrete random variables. Moreover, assume that the following large deviation property holds for some z IR, IP V > z =. 2 Iz., where Iz > 0 and a = b lim + log a = 0. 35 b Then, if ε > 0 s.t. lim M, + M2 Iz ε = 0, then lim IP max V m > z = 0. 36 M, + m M Also, if ε > 0 s.t. lim M, + M2 Iz+ε = +, then lim IP max m > z =. M, + m M 37 Proof of Lemma 2: Let ε > 0 be arbitrarily small. Then, there exists 0 ε > 0 such that > 0, IP V log > z < ε. 38 2 Iz To prove the first part, we start with the following claim: ε > 0 > 0 > : IP max m M V m > z > ε.

Then, using the union bound, we obtain ε > 0 > 0 > : M m= IP V m > z = MIP V > z > ε. 39 By picking > 0 ε, 39 becomes ε > 0 > 0 ε > : M2 Iz ε > ε. Hence, M2 Iz ε does not converge to zero for any ε > 0, as required. To prove the second part, we first assume that > 0 ε, as above. Then, IP max m M V m > z = IP max m M V m z 40 = IP M V z 4 M log IP V = 2 >z 42 MIP V 2 >z 43 2 M2 Iz+ε, 44 where the first inequality is a consequence of the inequality log x x, and the second inequality arises from 38. ote that 44 is true for any arbitrary ε > 0. Hence, if there exists ε > 0 such that M2 Iz+ε +, then necessarily IP max m M Vm > z. This concludes the proof of the second part, and the proof of the lemma. We are now ready for Proof of Theorem 2: em, = IP max 2 m M+ n= log p X n m p 0 X n m > n= log p X n p 0 X n 45 = IP U M + R > Dp, p 0, where 46 2

UM = max 2 m M+ R = Dp, p 0 n= log p X n m p 0 X n m n= and 47 log p X n p 0 X n 48 From the law of large numbers, R 0 in probability. Hence, for all η > 0 and α > 0, and for sufficiently large, using Lemma, IP U M > Dp, p 0 + η α em, IP U M > Dp, p 0 η + α 49 Let us define V m n= log p X n m p 0 X n m 50 ow, using Sanov s theorem, Dembo at. al.998, IP V 2 Dp, p 0. = 2 Dp,p0 5 Indeed, IP V 2 Dp, p 0. = 2 Dp,p 0 52 where Dp, p 0 = inf p C Dp, p 0, with C = {p; E p log p p 0 Dp, p 0 }, 53 and for p C, Dp, p 0 = E p log p p 0 = Dp, p + E p log p p 0 54 Dp, p + Dp, p 0 Dp, p 0. 55 ow, by continuity of the rate function, there exists ε > 0 such that IP V 2 > Dp, p 0 η. = 2 Dp,p0 ε 56 3

and there exists ε > 0 such that IP V 2 > Dp, p 0 + η. = 2 Dp,p0+ε 57 Finally, since U M = max V m 58 2 m M+ and the rvs V2,..., VM+ are iid, we obtain the required result from Lemma 2. Proof of Theorem 3: We proceed along the same lines as for the proof of Theorem 2. ẽm, = IP max 2 m M+ n= log ˆp mx n m ˆp m X n m > n= log ˆp X n ˆp X n 59 IP UM + RM > Dp, p 0, with 60 UM = max log ˆp mxm n 2 m M+ p 0 Xm n 6 n= RM = A M + B + C 62 A M = max log p 0Xm n 2 m M+ ˆp m Xm n 63 B = n= C = Dp, p 0 For all η > 0, from Lemma, n= log ˆp X n p 0 X n 64 n= log ˆp X n p 0 X n 65 ẽm, IP U M > Dp, p 0 η + IP R M > η. 66 Let V m = n= log ˆp mxm n p 0 Xm n, 2 m M +. 67 4

Using Sanov s theorem, IP V 2 Dp, p 0. = 2 Dp,p0. 68 Indeed, IP V 2 Dp, p 0. = 2 Dp,p 0, 69 where Dp, p 0 = inf p C Dp, p 0, with C = {p; E p log p p 0 Dp, p 0 }. 70 And for p C, Dp, p 0 = E p log p p 0 Dp, p 0. 7 ow, by continuity of the rate function, there exists ε > 0 such that IP V 2 > Dp, p 0 η. = 2 Dp,p0 ε. 72 To show that 66 approaches zero as M, with M2 Dp,p0 ε 0, it suffices to prove that R M 0, since the term IP U M > Dp, p 0 η of 66 goes to zero by virtue of 72, Lemma 2, and the fact that U M = max V m, 73 2 m M+ and the rvs V2,..., VM+ are iid. Using the law of large numbers, C 0 in probability. Also, IP A M > η MIP log p 0X2 n ˆp n= 2 X2 n > η MIP max log p 0X n 2 n ˆp 2 X2 n > η MIP log p 0X2 ˆp 2 X2 > η 74 75 76 5

M max IP log p 0x x ˆp 2 x > η 77 = M max x IP ˆp 2x < 2 η p 0 x 78 M max x 2 M Ix,η = M2 M Jη 79 where Ix, η > 0 is a rate function, and Jη = min x Ix, η. The last inequality comes from the fact that ˆp 2 x p 0 x in probability. A similar argument shows that B 0 in probability as well. ote that the result would still hold if we replaced ˆp m with ˆp, i.e., with the empirical distribution over the full data. References [] Amir Dembo and Ofer Zeitouni. Large Deviations Techniques and Applications. Springer Verlag, 998. [2] D. Geman and B. Jedynak. An active testing model for tracking roads from satellite images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8: 4, January 996. [3] G.R. Grimmett and D. R. Stirzaker. Probability and Random Processes. Oxford Science Publications, 992. [4] Alan L. Yuille and James M. Coughlan. Fundamental limits of bayesian inference: Order parameters and phase transitions for road tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 222:60 73, February 2000. [5] Alan L. Yuille, James M. Coughlan, Yingnian Wu, and Song Chun Zhu. Order parameters for detecting target curves in images: When does high level knowledge help? International Journal of Computer Vision, 4:9 33, 200. 6