Anonymous Heterogeneous Distributed Detection: Optimal Decision Rules, Error Exponents, and the Price of Anonymity

Similar documents
Lecture 8: Information Theory and Statistics

Lecture 8: Information Theory and Statistics

Lecture 7 Introduction to Statistical Decision Theory

Generalized Neyman Pearson optimality of empirical likelihood for testing parameter hypotheses

Uncertainty. Jayakrishnan Unnikrishnan. CSL June PhD Defense ECE Department

Metric Spaces and Topology

Lecture 22: Error exponents in hypothesis testing, GLRT

Set, functions and Euclidean space. Seungjin Han

Theorems. Theorem 1.11: Greatest-Lower-Bound Property. Theorem 1.20: The Archimedean property of. Theorem 1.21: -th Root of Real Numbers

Lebesgue Measure on R n

Lecture 4 Noisy Channel Coding

Lebesgue Measure on R n

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Empirical Processes: General Weak Convergence Theory

Maths 212: Homework Solutions

1 Lecture 4: Set topology on metric spaces, 8/17/2012

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 3, MARCH

Part III. 10 Topological Space Basics. Topological Spaces

The Method of Types and Its Application to Information Hiding

REVIEW OF ESSENTIAL MATH 346 TOPICS

Introduction to Topology

Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1)

Measures. Chapter Some prerequisites. 1.2 Introduction

at time t, in dimension d. The index i varies in a countable set I. We call configuration the family, denoted generically by Φ: U (x i (t) x j (t))

Topological properties of Z p and Q p and Euclidean models

Some Background Material

Indeed, if we want m to be compatible with taking limits, it should be countably additive, meaning that ( )

Seminorms and locally convex spaces

Commutative Banach algebras 79

Week 2: Sequences and Series

Near-Potential Games: Geometry and Dynamics

Introduction to Real Analysis Alternative Chapter 1

Efficient Decoding of Permutation Codes Obtained from Distance Preserving Maps

Information Theory and Hypothesis Testing

The small ball property in Banach spaces (quantitative results)

SEPARABILITY AND COMPLETENESS FOR THE WASSERSTEIN DISTANCE

Guesswork Subject to a Total Entropy Budget

Part V. 17 Introduction: What are measures and why measurable sets. Lebesgue Integration Theory

Functional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability...

Rose-Hulman Undergraduate Mathematics Journal

Exercises from other sources REAL NUMBERS 2,...,

Real Analysis Math 131AH Rudin, Chapter #1. Dominique Abdi

Gaussian Estimation under Attack Uncertainty

EXPOSITORY NOTES ON DISTRIBUTION THEORY, FALL 2018

Near-Potential Games: Geometry and Dynamics

Measures. 1 Introduction. These preliminary lecture notes are partly based on textbooks by Athreya and Lahiri, Capinski and Kopp, and Folland.

Chapter 1. Measure Spaces. 1.1 Algebras and σ algebras of sets Notation and preliminaries

Appendix B Convex analysis

Testing Problems with Sub-Learning Sample Complexity

ON SPACE-FILLING CURVES AND THE HAHN-MAZURKIEWICZ THEOREM

Topological properties

Lecture 35: December The fundamental statistical distances

We are going to discuss what it means for a sequence to converge in three stages: First, we define what it means for a sequence to converge to zero

Measurable Choice Functions

Course 212: Academic Year Section 1: Metric Spaces

Construction of a general measure structure

10-704: Information Processing and Learning Fall Lecture 24: Dec 7

EECS 750. Hypothesis Testing with Communication Constraints

Continuity. Matt Rosenzweig

Problem Set 2: Solutions Math 201A: Fall 2016

2 Metric Spaces Definitions Exotic Examples... 3

arxiv: v2 [math.ag] 24 Jun 2015

2. The Concept of Convergence: Ultrafilters and Nets

Math 320-2: Midterm 2 Practice Solutions Northwestern University, Winter 2015

CHAPTER I THE RIESZ REPRESENTATION THEOREM

Mathematics for Economists

Lecture Notes in Advanced Calculus 1 (80315) Raz Kupferman Institute of Mathematics The Hebrew University

2 Sequences, Continuity, and Limits

n [ F (b j ) F (a j ) ], n j=1(a j, b j ] E (4.1)

(x k ) sequence in F, lim x k = x x F. If F : R n R is a function, level sets and sublevel sets of F are any sets of the form (respectively);

On Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing

II - REAL ANALYSIS. This property gives us a way to extend the notion of content to finite unions of rectangles: we define

MATH 31BH Homework 1 Solutions

DUALITY IN COUNTABLY INFINITE MONOTROPIC PROGRAMS

Behaviour of Lipschitz functions on negligible sets. Non-differentiability in R. Outline

Math 118B Solutions. Charles Martin. March 6, d i (x i, y i ) + d i (y i, z i ) = d(x, y) + d(y, z). i=1

MATH 117 LECTURE NOTES

Topological vectorspaces

Quick Tour of the Topology of R. Steven Hurder, Dave Marker, & John Wood 1

Mid Term-1 : Practice problems

Lecture 3: Lower Bounds for Bandit Algorithms

BALANCING GAUSSIAN VECTORS. 1. Introduction

Computational and Statistical Learning theory

Solution Set for Homework #1

Bayesian Learning in Social Networks

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk)

INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS

Stat 8112 Lecture Notes Weak Convergence in Metric Spaces Charles J. Geyer January 23, Metric Spaces

Chapter 2 Metric Spaces

Final. due May 8, 2012

Introductory Analysis I Fall 2014 Homework #9 Due: Wednesday, November 19

CHANGE DETECTION WITH UNKNOWN POST-CHANGE PARAMETER USING KIEFER-WOLFOWITZ METHOD

M17 MAT25-21 HOMEWORK 6

Introduction to Empirical Processes and Semiparametric Inference Lecture 08: Stochastic Convergence

1 Lyapunov theory of stability

Lecture 6 Channel Coding over Continuous Channels

here, this space is in fact infinite-dimensional, so t σ ess. Exercise Let T B(H) be a self-adjoint operator on an infinitedimensional

PERFECTLY secure key agreement has been studied recently

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

arxiv: v4 [cs.it] 17 Oct 2015

Transcription:

Anonymous Heterogeneous Distributed Detection: Optimal Decision Rules, Error Exponents, and the Price of Anonymity Wei-Ning Chen and I-Hsiang Wang arxiv:805.03554v2 [cs.it] 29 Jul 208 Abstract We explore the fundamental limits of heterogeneous distributed detection in an anonymous sensor network with n sensors and a single fusion center. The fusion center collects the single observation from each of the n sensors to detect a binary parameter. The sensors are clustered into multiple groups, and different groups follow different distributions under a given hypothesis. The key challenge for the fusion center is the anonymity of sensors although it knows the exact number of sensors and the distribution of observations in each group, it does not know which group each sensor belongs to. It is hence natural to consider it as a composite hypothesis testing problem. First, we propose an optimal test called mixture likelihood ratio test, which is a randomized threshold test based on the ratio of the uniform mixture of all the possible distributions under one hypothesis to that under the other hypothesis. Optimality is shown by first arguing that there exists an optimal test that is symmetric, that is, it does not depend on the order of observations across the sensors, and then proving that the mixture likelihood ratio test is optimal among all symmetric tests. Second, we focus on the Neyman-Pearson setting and characterize the error exponent of the worst-case type-ii error probability as n tends to infinity, assuming the number of sensors in each group is proportional to n. Finally, we generalize our result to find the collection of all achievable type-i and type-ii error exponents, showing that the boundary of the region can be obtained by solving a convex optimization problem. Our results elucidate the price of anonymity in heterogeneous distributed detection, and can be extended to M-ary hypothesis testing with heterogeneous observations generated according to hidden latent variables. The results are also applied to distributed detection under Byzantine attacks, which hints that the conventional approach based on simple hypothesis testing might be too pessimistic. I. INTRODUCTION In wireless sensor networks, the cost of identifying individual sensors increases drastically as the number of sensors grows. For distributed detection [], when the observations follow identical and independent distributions (i.i.d.) across all sensors, identifying individual sensors is not very important. When the fusion center can fully access The material in this paper is presented in part at the IEEE International Symposium on Information Theory, Vail, Colorado, USA, June 208. W.-N. Chen is with the Graduate Institute of Communication Engineering, National Taiwan University, Taipei 067, Taiwan (email: r05942078@ntu.edu.tw). I.-H. Wang is with the Department of Electrical Engineering and the Graduate Institute of Communication Engineering, National Taiwan University, Taipei 067, Taiwan (email: ihwang@ntu.edu.tw).

2 the observations, the empirical distribution (types) of the collected observation is a sufficient statistic. When the communication between each sensor and the fusion center is limited, for binary hypothesis testing it is asymptotically optimal to use the same local decision function at all sensors [2]. Hence, anonymity is not a critical issue for the classical (homogeneous) distributed detection problem. However, when the joint distribution of the observations is heterogeneous, that is, marginal distributions of observations vary across sensors, sensor anonymity may deteriorate the performance of distributed detection, even for binary hypothesis testing. One such example is distributed detection under Byzantine attack [3], where a fixed number of sensors are compromised by malicious attackers and report fake observations following certain distributions. Even if the fusion center is aware of the number of compromised sensors and the attacking strategy that renders worstcase detection performance (the least favorable distribution as considered in [4] [6]), it is more difficult to detect the hidden parameter when the fusion center does not know which sensors are compromised. In this paper, we aim to quantify the performance loss due to sensor anonymity in heterogeneous distributed detection, with n sensors and a single fusion center. Each sensor (say sensor i, i {,..., n) has a single random observation X i. The goal of the fusion center is to estimate the hidden parameter θ {0, (that is, binary hypothesis testing) from the collected observations. The distributions of the observations, however, are heterogeneous observations at different sensors may follow different sets of distributions. In particular, we assume that these n sensors are clustered into K groups {I,..., I K, and group I k {,..., n comprises n k sensors, for k =,..., K. Under hypothesis H θ, θ {0,, X i P θ;k, for i I k. Moreover, the sensors are anonymous, that is, the collected observations at the fusion center are unordered. In other words, although the fusion center is fully aware of the heterogeneity of it observation, including the set of distributions {P θ;k θ {0,, k =,..., K and {n k k =,..., K, it does not know what distribution each individual sensor will follow. To address the lack of knowledge about the exact distributions of the observations, we formulate the detection problem as a composite hypothesis testing problem, where the vector observation of length n follows a product distribution within a finite class of n-letter product distributions under a given parameter θ. The class consists of ( ) ( ) n n,...,n K possible product distributions, each of which follows one of the n n,...,n K possible partitions of the sensors. The fusion center takes all the possible partitions into consideration when detecting the hidden parameter. We mainly focus on a Neyman-Pearson setting, where the goal is to minimize the worst-case type-ii error probability such that the worst-case type-i error probability is not larger than a constant. Towards the end of this paper, we also extend our results to a Bayesian setting, where a binary prior distribution is laid on H 0 and H. Our main contribution comprises three parts. First, we develop an optimal test, termed mixture likelihood ratio test (MLRT), for the anonymous heterogeneous distributed detection problem. MLRT is a randomized threshold test based on the ratio of the uniform mixture of all the possible distributions under hypothesis H to the uniform mixture of those under H 0. To prove the optimality, we first argue that there exists an optimal test that is symmetric, that is, it does not depend on the order of observations across the sensors, and thus we only need to consider tests

3 which depend on the histogram of observations. In other words, the histogram of observations contains sufficient information for optimal detection. Moreover, all possible distributions over the space of observations X n under H 0 (or H ) turn out to be the same one over the space of its histogram, so if we test the hypothesis according to the histogram, the original composite hypothesis testing problem boils down to a simple hypothesis testing problem. The one-to-one correspondence between symmetric tests and tests defined on the histogram is the key to derive optimal test. This result extends to M-ary hypothesis testing with heterogeneous observations generated according to hidden latent variables, each of which is associated to a observation, but the decision maker only knows the histogram of the latent variables. Second, for the case that the alphabet X is a finite set, we characterize the error exponent of the minimum worst-case type-ii error probability as n with the ratios n k n α k k =,..., K. The optimal error exponent turns out to be the minimization of a linear combination of Kullback-Leibler divergences (KL divergences) with the k-th term being D (U k P ;k ) and α k being the coefficient, for k =,..., K. The minimization is over all possible distributions U,..., U K such that K k= α ku k = K k= α kp 0;k. In a simple hypothesis testing problem with i.i.d. observations, a standard approach to derive the type-ii error exponent is invoking a strong converse lemma (see, for example, Chapter 2 in [7]) to relate the type-i and type-ii error probability of an optimal test, and then applying the large deviation toolkit on the optimal test to single-letterize and find the exponent. In contrast, in our problem, neither can the mixture distributions in the optimal test be decomposed into a product form, nor can the acceptance region be bounded by a large deviation event, making this approach fail to characterize the error exponent. To circumvent the difficulties, we turn to the method of types and use bounds on types (empirical distributions) for single-letterization. For achievability, instead of the optimal MLRT which is difficult to single-letterize, we employ a simpler test that resemble Hoeffding s test [8]. For the converse, we use an argument based on the method of types. We propose a generalized divergence D α,...,α K (P,..., P K ; Q,..., Q K ) from a group of distributions {Q,..., Q K to another group of distributions {P,..., P K, which plays a similar role as KL divergence in simple hypothesis testing problems. The key to the characterization of the optimal error exponent is to prove a generalized Sanov Theorem for the composite setting we considered. Based on the characterized error exponent, given the number of bits that a sensor can send to the fusion center, one can also formulate an optimization problem to find the best local decision functions, as in the homogeneous case [2]. Finally, we extend our results from the Neyman-Pearson setting to a Bayesian setting, minimizing the average probability of error (that is, combining type-i and type-ii error). It can be shown that the optimal test is computationally infeasible, since it involves summation over all possible permutations. To overcome the complexity issue, we propose an asymptotically optimal test based on information geometry, which achieves the same error exponent of the average probability of error. We also study the exponent region R, the collection of all pairs of achievable type-i and type-ii error exponents. In particular, we propose a way to parametrize the contour of R based on information projection. However, the closed-form expression of R involves an explicit solution of a convex optimization problem, which remains unsettled. As a by-product, we apply our results for K = 2 to the distributed detection problem under Byzantine attack

4 and further obtain bounds on the worst-case type-ii error exponent. Compared with the worst-case exponent in an alternative Bayesian formulation [3] where the observation of sensors are assumed to be i.i.d. according to a mixture distribution, it is shown that the worst-case exponent in the composite testing formulation is strictly larger. This hints that the conventional approach taken in [3] might be too pessimistic. Related Works Decentralized detection is a classical topic, and attracts extensive attention in recent years due to its application in wireless sensor networks. See, for example, [], [2], [6], [9]. Most works in decentralized detection are focused on finding optimal local decision function in both Neyman-Pearson and Bayesian regime. Under some assumptios on the distribution of a given hypothesis, optimal design criteria of local decision function and the decision rule at the fusion center are given. Unlike the anonymous setting considered in our work, the above-mentioned classical works assume fusion centers, as well as the local sensors, have perfect knowledge about the joint distribution, and hence the decision rules are designed according to it. This is termed an informed" setting in our paper and is used as a baseline to compare with and see the price of anonymity. On the other hand, in our setting, the fusion center collects observations without knowing the exact index of each one, and thus the problem is formulated into a composite hypothesis testing problem. Composite hypothesis testing is a long-standing problem in statistics, and is notoriously difficult to find an optimal test. In general, the uniform most powerful (UMP) test does not exist, see, for example, Section 8.3 in [0]. Even if we relax the performance evaluation to the minimax regime, the general form of the optimal test is still unknown, except for some special case. For example, [5] considered the case that the composite hypothesis class H θ is formed by all ɛ-contaminated distributions of P θ, that is, {( ɛ)p θ + ɛq possible distributions Q. Under this structure, Huber showed that a censored version of likelihood ratio test is optimal in the minimax regime. Other works such as [8], [] followed the idea of Hoeffding s test [8] and proposed an universal asymptotically optimal test when the null hypothesis is simple. Meanwhile, in our setting, neither the parameter space of the considered distributions is continuous, nor the null hypothesis is simple, making their approaches hard to extend. Another common test for composite hypothesis testing is the generalized likelihood ratio test (GLRT). The optimality of GLRT is guaranteed under some circumstances, see, for example, [2]. However, the results in [2] hold only for simple null and composite alternative. In contrast, our result indicates that GLRT is not optimal in our setting. The concept of Byzantine attack can be traced back to [3] (known as the Byzantine Generals Problem ), in which reliability of a computer system with malfunctioned components is studied. After that, Byzantine model is developed and generalized by several research areas, especially in communication security. For example, the distributed detection with Byzantine attack is studied under the Neyman-Pearson formulation in [3] and under the Bayesian setting in [4]. In their settings, each sensor is assumed to be compromised with probability α, so the observation turns out to be drawn identically and independently from an mixture distribution, making the hypothesis testing problem simple, and thus Neyman-Pearson lemma can be applied. In contrast, in our work we assume the number of Byzantine sensors is fixed and is αn, where n is the total number of sensors, and thus the problem falls into a composite hypothesis testing instead of the mixture setting.

5 This work is presented in part at ISIT 208. In the conference version [5], upper and lower bounds on the type-ii error exponent were given, where the lower bound (achievability) is based on an modified version of Hoeffding s test, and the upper bound (converse) is derived by relaxing the original problem into a simple hypothesis testing. In this journal version, we show that the achievability bound in the conference version is indeed tight, closing the gap between the upper and lower bounds. The rest of this paper is organized as follows. In Section II, we formulate the composite hypothesis testing problem for anonymous heterogeneous distributed detection and provide some background. In Section III, the main results are provided, where the proofs are delegated to Section IV and V. In Section VI, we generalize the results to the Bayesian setting, and in Section VII, we briefly discuss the case when X is not finite, and the case when partial information about the group assignment is available at the fusion center. Finally, we conclude the paper with some further directions and open questions in Section VIII. A. Problem Setup II. PROBLEM FORMULATION AND PRELIMINARIES Following the description of the setting in Section I, let us formulate the composite hypothesis testing problem. Let σ(i) denote the label of the group that sensor i belongs to. This labeling σ( ), however, is not revealed to the fusion center. Hence, the fusion center needs to consider all ( ) n n,...,n K possible σ : {,..., n {,..., K satisfying {i σ(i) = k = n k, k =,..., K, () and decides whether the hidden θ is 0 or. For notational convenience, let ν denote the vector [n... n K ], and let S n,ν denote the collection of all labelings satisfying (). Hence, the fusion center is faced with the following composite hypothesis testing problem, where the goal is to infer the parameter θ: H θ : X n P θ;σ n i= P θ;σ(i), for some σ S n,ν. As mentioned in Section I, throughput the paper we consider binary hypothesis testing, that is, θ {0,. Let each single observation take values from some measurable space (X, F), where F is a σ-algebra on X. Hence P θ;k P X for all θ {0, and k {,..., K, where P X denotes the collection of all possible distributions over (X, F). The vector observation x n is defined on the space (X n, F n ), where F n is the tensor product σ-algebra of F, that is, the smallest σ-algebra contains the following collection of events: {E E 2 E n E i F. A (randomized) test is a measurable function φ : (X n, F n ) ([0, ], B), where B denotes the Borel σ-field on R. The worst-case type-i and type-ii error probabilities of a decision rule φ are defined as P F (n) (φ) max σ S n,ν E P0;σ [φ(x n )] (Type I) P M (n) (φ) max σ S n,ν E P;σ [ φ(x n )] (Type II).

6 Our focus is on the Neyman-Pearson setting: find a decision rule φ satisfying P F (n) (φ) ɛ such that P M (n) (φ) is minimized. Let β (n) (ɛ, ν) denote the minimum type-ii error probability. For the asymptotic regime, we assume that the ratio n k n α k as n for all k =,..., K, and K k= α k =. We aim to explore if β (n) (ɛ, ν) decays exponentially fast as n, and characterize the corresponding error exponent. For notational convenience, we define upper and lower bounds on the exponent: E (ɛ, α) lim sup n { n log 2 β (n) (ɛ, ν), E (ɛ, α) lim inf n { n log 2 β (n) (ɛ, ν), where in taking the limits, we assume that lim n n k n match, we simply denote it as E (ɛ, α). = α k, for all k =,..., K. If the upper and lower bound Remark 2.. The original distributed detection problem [], [2], [6] involves local decision functions at the sensors to address the limited communication between each sensor and the fusion center. In order to focus on the impact of anonymity, we first absorb them into the distributions {P θ;k : k =,..., K because they are symbol-by-symbol maps. Later, we will discuss how to find the best local decision functions according to the characterized error exponent. B. Notations Let us introduce notations that will be used throughout this paper. n denotes the total number of observations, and K denotes the number of groups of sensors. ν [n... n K ] denotes the number of sensors in the K groups. That is, n k 0, n k Z, and K k= n k = n. α [α... α K ] denotes the fraction of each group of sensors in all sensors in the asymptotic regime. That is, α k 0, and K k= α k =. σ : {,..., n {,..., K is the labeling function which assigns the index of each sensor to a group. We also denote the collection of indices of sensors in group k as I k = σ (k) {i σ(i) = k. (2) Let S n,ν be the collection of all σ satisfying (2). We also use S n to denote the collection of length-n permutations: S n { τ : {, 2,..., n {, 2,..., n. Note that the cardinalities of the two sets are ( ) n S n,ν =, S n = n!. n, n 2,..., n K We usually write P θ as the vector of {P θ;k : P θ P θ; P θ;2. P θ;k.

7 C. Method of Types For a sequence x n X n, where X = {a, a 2,..., a d, its type (empirical distribution) is defined as Π x n = [π(a x n ), π(a 2 x n ),..., π(a d x n )], where π(a i x n ) is the frequency of a i in the sequence x n, that is, π(a i x n ) = n {xj=a n i. For a given length n, we use P n to denote the collection of possible types of length-n sequences. In other words, {[ i P n n, i 2 n,..., i ] d i,..., i d N {0, i + i 2 + + i d = n. n Let U P n be an n-type. The type class T n (U) is the set of all length-n sequences with type U, Let us introduce some useful lemmas about type. j= T n (U) {x n X n Π x n = U. Lemma 2. (Cardinality Bound of P n ). P n (n + ) X. In words, P n grows polynomial in n. Lemma 2.2 (Probability of Type Class). Let P P n, Q P X. Then (n + ) X 2 nd(q P ) Q n (T n (P )) 2 nd(q P ). For finite X, P X can be viewed as a subspace in R d endowed with Euclidean metric and standard topology. The following theorem, developed by Sanov, depicts the probability of a large deviation event. Lemma 2.3 (Sanov s Theorem). Let Γ P X. Then we have inf T int Γ D (T Q) lim inf n n log Q {xn : Π x n Γ lim sup n n log Q {xn : Π x n Γ inf D (T Q), T cl Γ (3) where int Γ and and cl Γ respectively denote the interior and the closure of Γ, with respect to the standard topology on R d. In particular, if the infimum on the right-hand side is equal to the infimum on the left-hand side in (3), we have lim n n log Q {xn : Π x n Γ = inf D (T Q). T Γ Proofs of the lemmas mentioned above can be found in standard information theory textbooks, Chapter in [6] for example. Alternatively, a more rigorous proof of Sanov s theorem Lemma 2.3 can be found in [7].

8 III. MAIN RESULTS As mentioned in Section II, the observations come from the measurable space (X n, F n ). Throughout the rest of the paper, we assume that X is a totally ordered set, and F n satisfies the following two assumptions: ) F n contains the following set: X n {(x, x 2,..., x n ) x x 2... x n. (4) 2) F n is closed under permutation. That is, if A F n, for any length-n permutation τ : {,..., n {,..., n, A τ {( ) x π(),..., x π(n) (x,..., x n ) A F. (5) Remark 3.. We assume that X is a totally ordered set in order to set the condition such that X is measurable. The purpose to require X to be measurable is to preserve the measurability of the ordering map Π( ), as later defined in Definition 4.. In general, if X is not totally ordered, we can still require the collection of representatives in the equivalent classes induced by Π to be measurable. However, the regularity assumptions on F need to be carefully concerned in that case. Remark 3.2. The second assumption always holds for tensor σ-fields. The first assumption typically holds too. For example, if X is finite, we can simply choose F as the power set 2 X, and if X R, we can choose F as the Borel σ-field. In particular, for X being a finite set, it is straightforward to define a total order over it, and hence it is a totally ordered set. Moreover, the above two assumptions are automatically satisfied. A. Main Contributions Our first contribution is the characterization of the optimal test: Theorem 3. (Optimal Test). Define the mixture likelihood ratio l(x n ): l(x n σ S ) n,ν P ;σ (x n ) σ S n,ν P 0;σ (x n ). (6) Suppose F n satisfies the two assumptions (4), (5). Then an optimal tests φ (x n ) takes the following form:, if l(x n ) > τ φ (x n ) = γ, if l(x n ) = τ 0, if l(x n ) < τ. (7) That is, for any test φ, we have P F (φ) P F (φ ) P M (φ) P M (φ ). Remark 3.3. We see that the optimal test, MLRT, is the likelihood ratio test between two uniform mixture distributions S n,ν σ S n,ν P θ;σ, θ {0,.

9 Interestingly, the optimality of MLRT indicates that the widely used decision rule, generalized likelihood ratio test (GLRT), which is defined as the randomized thresholded test according to the following likelihood ratio l GLRT (x n ) sup σ S n,ν P ;σ (x n ) sup σ Sn,ν P 0;σ (x n ), is strictly sub-optimal in the anonymous hypothesis testing problem. Sketch of proof: The proof consists of two steps. In the first step, we introduce symmetric tests (as later defined in Definition 4.2), which do not depend on the order of the observations. Then, we show that among all symmetric tests, (7) is optimal. The key is to reduce the original composite hypothesis testing problem into a simple one through the ordering map Π(x n ) in Definition 4., and then apply Neyman-Pearson lemma. In the second step, we prove that for any test ψ, one can always symmetrize it and construct a symmetric one φ which is as good as ψ, so (7) is optimal among all tests. However, ψ is constructed by assigning values on each equivalence classes introduced by the ordering map Π( ), so the measurability of ψ need to be carefully examined. For the detailed proof, please refer to Section IV. Our second result specifies the exponent of type-ii error in Neyman-Pearson formulation, which does not depend on the type-i error probability ɛ: Theorem 3.2 (Asymptotic Behavior). Let us consider the case X <, The exponent of type-ii error probability is characterized as follows. K E (ɛ, α) = min α kd (U k P ;k ) U (P X ) K k= subject to α U = α P 0. (8) Remark 3.4. A standard way to derive the exponent of type-ii error probability is to identify the acceptance region (of H 0 ) of the optimal test (7) as an large-deviation event under H, and further apply a strong converse lemma to obtain a bound. However, notice that the mixture measure, σ P θ;σ, θ {0,, cannot be factorized into a product form, which makes it hard to single-letterize. Instead, if we add an additional assumption that X is finite, then we can utilize method of types, such as Sanov s theorem, to circumvent the difficulties. Sketch of proof: For the achievability part, we propose a sub-optimal test based on Hoeffding s result [8], in which we accept observations x n satisfying D (Π x n M 0 (α)) ɛ for some threshold ɛ. We apply tools in method of types to bound the type-i and type-ii error probabilities, showing that (8) is achievable. For the converse part, given an arbitrary test, we define its acceptance region as A (if the given test is randomized, we can round the test by /2 and make it determinstic, that is, we accept H if φ(x n ) > /2) and consider another high-probability set B. We analyze the probability of P ;σ {A B, and show that the exponent cannot be greater than (8), which concludes the converse part. For the detailed proof, please refer to Section V. Finally, we give a structural result of the error exponent. Proposition 3.. For the case X <, the type-ii error exponent E (ɛ, α) as characterized in Theorem 3.2 only depends on α. Moreover, it is a convex function of α.

0 Proof: See Appendix A. B. Numerical Evaluations To quantify the price of anonymity, note that when the sensors are not anonymous (termed the informed setting), it becomes a simple hypothesis testing problem, and the error exponent of the type-ii probability of error in the Neyman-Pearson setting is straightforward to derive: E Informed (ɛ, α) = K k= α kd (P 0;k P ;k ). For ease of illustration, in the following we restrict to the special case of binary alphabet, that is, X = 2, and K = 2 groups. Let P θ; = Ber(p θ ) and P θ;2 = Ber(q θ ), for θ = 0,, where Ber(p) is the Bernoulli distribution [. with parameter p. Since there are only two groups, we set α α α] Numerical examples are given in Figure to illustrate the price of anonymity versus the mixing parameter α. In general, anonymity may cause significant performance loss. In certain regimes, the type-ii error exponent can even be pushed to zero. E(ϵ, α) (p 0,p )=(0.6, 0.8), (q 0,q )=(0.3, 0.8) E(ϵ, α) (p 0,p )=(0.3, 0.8), (q 0,q )=(0.8, 0.2) 0.64 Informed Anonymous Informed Anonymous 0.75 0.46 0.5 0.28 0.25 0. 0 0.25 0.5 0.75 α 0 0 0.25 0.5 0.75 α Fig. : Price of anonymity C. Distributed Detection with Byzantine Attacks Let us apply the results to distributed detection under Byzantine attacks, where the sensors are partitioned into two groups. One group consists of n( α) honest sensors reporting true i.i.d. observations, while the other consists of nα Byzantine sensors reporting fake i.i.d. observations. Here we again neglect the local decision function and assume that each sensor can report its observation to the fusion center. The true observations follow P θ i.i.d. across honest sensors, while the compromised ones follow Q θ i.i.d. across Byzantine sensors, for θ = 0,. In general, Q θ is unknown to the fusion center, but in terms of error exponent, one can find the least favorable pair Q 0, Q which minimize the error exponent. Hence, our results can be applied here and arrive the worst-case type-ii error exponent as follows: min ( α)d (U P ) + αd (V Q ) Q 0,Q,U,V P X subject to ( α)u + αv = ( α)p 0 + αq 0. (9)

In [3], it assumes that each sensor can be compromised with probability α, and hence it becomes a homogeneous distributed detection problem, where the observation of each sensor follows a mixture distribution ( α)p θ +αq θ under hypothesis θ, i.i.d. across all sensors. The worst-case exponent of type-ii error probability, as derived in [3], is hence min D (( α)p 0 + αq 0 ( α)p + αq ). (0) Q 0,Q P X We see that the achievable type-ii error exponent (9) in our setting is always greater than that in the i.i.d. scenario (0) (and is strictly larger for some α) due to the convexity of KL divergence. This implies the i.i.d. mixture model [3] might be too pessimistic. Figure 2 shows a numerical evaluation. (p 0,p )=(0., 0.9) Error Exponent.65. 0.55 i.i.d model anonymous model 0 0 0.2 0.4 0.6 0.8 Adversary Power (α) Fig. 2: Comparison between i.i.d. and our setting IV. PROOF OF THEOREM 3. Before proving Theorem 3., let us introduce some definitions that help the exposition. Definition 4. (Ordering Map). The ordering map Π( ) : (X n, F n ) F F n X n, is defined as follows: The measurability of Π is easy to check. Π(x n ) (x i, x i2,..., x in ), such that x i x i2... x in. ( X n, F), where X n is from (4) and Remark 4.. If X <, the mapping Π maps a sample x n to its type, and the space X n is equivalent to P X. Remark 4.2. We will use Π to denote the pre-image of Π. That is, for all Ẽ X n, ( ) { Π Ẽ x n X n Π(x n ) Ẽ. Notice that the measurability of Π implies for any Ẽ F, ) we have Π (Ẽ F n. Definition 4.2 (Symmetric Test). We say a test φ(x n ) is symmetric, if it is σ(π(x n ))-measurable, that is, it can be represented as a composition φ(x n ) = φ Π(x n ),

2 for some measurable function φ : X n [0, ]. This implies the test φ maps a sequence of observations x n and all its permutations to the same value. Lemma 4.. Among all symmetric test, φ (x n ), as defined in (7), is optimal. proof of Lemma 4.: To show the optimality of φ, we first transform the original composite hypothesis testing problem to another one in the auxiliary space X n through the ordering mapping Π( ), which turns out to be a simple hypothesis testing problem. Hence, applying Neyman-Pearson lemma, we obtain the optimal test. See Figure 3 for illustration of the relation between the original space and the auxiliary space. Π( ) ( X n, F n) ( X n, F ) φ( ) φ( ) (R, B) Fig. 3: Illustration of the auxiliary space Part. First, we claim that for all σ S n,ν, the probability measure P 0;σ Π, defined on ( X n, F), does not depend on σ anymore. Thus we can define the probability measure P 0 P 0;σ Π, such that for all σ, ( P0;σ, F n, X n) Π( ) ( P0, F, X n). This claim is quite intuitive, since the labeling σ corresponds to the order of observations, and the ordering map removes the order. To show this claim, we first observe that for all E F, its pre-image Π (E) = τ S n E τ, () where E τ { (x τ(),..., x τ(n) ) (x,..., x n ) E. Therefore, for any two σ, σ S n,ν, we can write σ = π σ for some π S n, and thus have P 0;σ Π {E =P 0;σ { τ S n E τ =P 0;π σ { where the equality (a) holds due to the following fact: τ S n E τ { (a) = P 0;σ τ S n E τ π = P 0;σ Π {E, π S n, S n π {τ π τ S n = S n. Following the same argument, P P ;σ Π does not depend on σ either.

3 Part 2. Second, let us we consider an auxiliary hypothesis testing problem on X n : H 0 : Z P 0 H : Z P, (2) and let φ : X n [0, ] be a test with type-i and type-ii error probabilities as follows: P F ( φ) [ ] E P0 φ(z) P M ( φ) [ φ(z) ]. E P We claim that for any symmetric test φ(x n ) = φ (Π(x n )) as defined in Definition 4.2, the following holds: P F ( φ) = P F (φ) To show this, note that a direct calculation gives P M ( φ) = P M (φ). P F (φ) = max P σ 0;σ [φ(x n )] [ ] = max P 0;σ φ (Π(X n )) σ = max σ φ (Π(x n )) P 0;σ (dx n ) = max σ φ (z) P 0;σ (Π (dz)) = E P0 [ φ(z) ] = P F ( φ). For the same reason, P M (φ) = P M ( φ). Therefore, for any symmetric test on X n, the corresponding φ has exactly the same type-i and type-ii error probability. Notice that the auxiliary hypothesis testing problem (2) is simple, so by Neyman-Pearson lemma, we have readily seen that the optimal symmetric test on the original problem should be where l (x n ) is defined as, if l (x n ) > τ φ (x n ) = γ, if l (x n ) = τ 0, if l (x n ) < τ, l (x n ) = P (Π(x n )) P 0 (Π(x n )) = P { ;σ Π (Π(x n )) P 0;σ {Π (Π(x n ))]. Part 3. Finally, we show that l (x n ) is indeed the mixture likelihood ratio l(x n ), as defined in (6). With a slight abuse of notation, let Π x n Π (Π(x n )) = { x τ(),..., x τ(n) τ S n. In words, Πx n is the collection of x n and all its permutations. We observe that P ;σ { Π (Π(x n )) = (a) = y n Π x n ( P ;σ (y n ) τ S n P ;σ (τ(x n )) ) c (x n )

4 (b) = P ;σ (x n ) c (x n )c 2 (σ). σ S n,ν The constant c (x n ) in (a) is due to the fact that x n = (x,..., x n ) might not be all distinct, so summing over the set {τ(x n ) τ S n may count an element y n Π x n multiple times. Note that if x n are all distinct, then c (x n ) =. (b) holds because P ;σ (τ(x n )) = P ;σ τ (x n ) and S n,ν = S n,ν τ {σ τ σ S n,ν. Again, the summation counts σ repeatedly, so we normalize by the constant c 2 (σ). Following the same reason, { P 0;σ Π (Π(x n )) = P 0;σ (x n ) c (x n )c 2 (σ). σ S n,ν Hence, which establishes the claim. l (x n ) = P { ;σ Π (Π(x n )) P 0;σ {Π (Π(x n )) ( σ P = Sn,ν ;σ )) (xn c (x n )c 2 (σ) ( σ P Sn,ν 0;σ )) (xn = σ P ;σ(x n ) σ P 0;σ(x n ) = l(xn ), c (x n )c 2 (σ) Lemma 4.2. For any general (measurable) test ψ(x n ) : X n [0, ], there exists a symmetric test φ(x n ) whose performance is not worse than ψ. That is, P F (φ) P F (ψ) (3) P M (φ) P M (ψ). proof of Lemma 4.2: With a slight abuse of notation, let τ(x n ) denote the coordinate-permutation function with respect to τ S n, i.e. τ(x n ) = (x τ(),..., x τ(n) ). Then we construct φ(x n ) as follows: φ(x n ) ψ τ(x n ). n! τ S n We claim the following two facts: ) φ(x n ) is symmetric, and thus can be written as φ Π(x n ) for some 2) (3) holds for the constructed φ. F-measurable φ. Part. To see that φ(x n ) = φ Π(x n ), we observe that for any y n, z n Π ( x n ), there exists a permutation π S n such that y n = π(z n ). Hence it suffices to verify that for all π S n, φ(x n ) = φ(π(x n )). φ(π(x n )) = ψ τ (π(x n )) n! τ S n = ψ τ π(x n ) n! τ S n (a) = ψ τ (x n ) = φ(x n ). n! τ S n

5 The equality (a) holds due to the fact that Therefore, φ(x n ) can be decomposed into φ Π(x n ). S n π {τ π τ S n = S n. Next, we check the measurability of φ. Notice that φ is F -measurable, since both ψ and τ are measurable. The measurability of τ follows from the τ-permuted closedness assumption of F n : A F n, A τ {τ(x n ) x n A F n. Observe that for all Borel-measurable set B, we have { φ {B = Π φ {B F n E τ F n, τ S n where we use E to denote event φ {B, and E τ to denote the τ-permuted event of E, as defined in (5). Notice here we use the fact given by (). Therefore it suffices to check We claim that indeed, for every E X n. This is because E X n, τ S n E τ F n E F n X n = F. { τ S n E τ X n = E, ) Since E X n, we have E = E X n { τ S n E τ X n. 2) For any τ and for any x n E τ X n, x n E. Hence, τ S n, E τ X n E, that is, { τ S n E τ X n E. Hence, showing that φ is F measurable. τ S n E τ F n = E = { τ S n E τ X n F n X n = F, Part 2. We show that φ(x n ) cannot be worse than ψ(x n ). Observe that for all τ S n, we have P F (ψ τ) = max E P0;σ [ψ (τ(x n ))] = max E P0;σ τ [ψ(x n )] = max E P0;σ σ S n,ν σ S n,ν σ [ψ(x n )] = P F (ψ). S n,ν Again, the third equality holds due to the fact S n,ν τ { σ τ σ S n,ν = Sn,ν. Therefore, we have P F (φ) = max σ E P 0;σ n! = n! [ n! τ S n ψ τ(x n ) max E P 0;σ [ψ τ(x n )] σ τ S n P F (ψ τ) = P F (ψ). τ S n ]

6 Following the same argument, we obtain P M (φ) P M (ψ), and the proof completes. Finally, the proof of Theorem 3. directly follows from Lemma 4. and Lemma 4.2. Proof of Theorem 3.: From Lemma 4.2, we only need to consider symmetric tests. From Lemma 4., we see that the optimal test among all symmetric tests is the mixture likelihood test, as defined in (7). This establishes Theorem 3.. Remark 4.3. Notice that in the above proof, we do not make use of assumptions on the distribution of X n, such as independence. Indeed, the proof indicates that for the anonymous composite hypothesis testing problem, under the minimax criterion (i.e. to minimize the worst case error), we should always design tests based on the empirical distribution of X n (i.e. as a function of Π(x n )). This principle also holds for other statistical inference problems, such as M-ary hypothesis testing. V. PROOF OF THEOREM 3.2 For the case X <, the auxiliary space X is equivalent to the space of all probability measures on X, that is, P X, and the mapping Π(x n ) maps a sequence of samples to its type Π x n. According to Lemma 4.2, the optimal test is symmetric, which implies that we only need to consider tests depending on the type. For tests depending only on the empirical distribution, it is natural to view their acceptance region as a collection of empirical distribution, that is, a (measurable) subset of P X. This motivates us to apply Sanov s theorem. We begin with the following generalization of Sanov s result: Lemma 5. (Generalized Sanov Theorem). Let X <, and Γ P X be a collection of distributions on X. Then for all σ S n,ν and θ {0,, we have lim inf n inf [U... U K ] (P X ) K α U int Γ lim sup n K α k D (U k P θ;k ) (4) k= n log P θ;σ {Π x n Γ (5) n log P θ;σ {Π x n Γ (6) K inf α k D (U k P θ;k ), (7) [U... U K ] (P X ) K α U cl Γ where in taking the limits, we assume that lim n n k n the right-hand side is equal to the infimum in the left-hand side, then we have k= = α k, for all k =,..., K. In particular, if the infimum in lim n n log P θ;σ {Π x n Γ = inf α k D (U k P θ;k ). [U... U K ] (P X ) K α k U cl Γ The proof is a direct extension of Lemma 2.3, except that we replace the i.i.d. measure with the product of independent non-identical ones, P θ;σ. For the detailed proof, please refer to Appendix B.

7 Motivated by the generalized Sanov Theorem, we further define the following generalized divergence to measure how far from one set of distributions Q [Q... Q K ] to another set of distributions P [P... P K ] : Definition 5.. Let P = [P... P K ] and Q = [Q... Q K ] are both in (P X ) K. Let α = [α... α K ] be a K-tuple probability vector. Define K D α (P ; Q) inf α kd (U k Q k ) U (P X ) K k=. (8) subject to α U = α P Thus (4) in Lemma 5. can be rewritten as inf D α(u; P θ ) lim inf α U int Γ n n log P θ;σ {Π x n Γ lim sup n n log P θ;σ {Π x n Γ inf D α(u; P θ ). α U cl Γ Also, the result of Theorem 3.2, (8), is equivalent to the following statement: E (ɛ, α) = D α (P 0 ; P ). Remark 5.. Intuitively, D α (P ; Q) measures how far between P and Q. However, D α ( ; ) is not a divergence, since D α (P ; Q) = 0 does not always imply P = Q. Notice that for any fixed Q (P X ) K, D α (P ; Q) can be regarded as a function of P. Moreover, this function depends only on the mixture of P, say, α P. Therefore, for notional convenience, let us use f Q ( ) : P X R {+ to denote this function: In other words, K f Q (T ) inf α kd (U k Q k ) U (P X ) K k=. subject to α U = T f Q (α P ) = D α (P ; Q). Before entering the main proof of Theorem 3.2, let us introduce some properties of f Q ( ). Lemma 5.2. Let Q (P X ) K and f Q ( ) : P X R {+ be defined as Definition 5. and above. Then, ) f Q (α Q) = 0 2) The collection of all T P X such that f Q (T ) <, denoted as C Q {T P X : f Q (T ) <, is a compact, convex subset of P X. 3) f Q (T ) is a convex, continuous function of T on C Q (and by the compactness of C Q, f Q (T ) is also uniformly continuous). Proof of Lemma 5.2 can be found in Appendix C. proof of Theorem 3.2:

8 Part (Achievability). Let δ > 0 and consider the test : φ(x n ) {x n :D(Π x n M 0(α))>δ. Denote the acceptance region of φ as Γ {T P X : D (T M 0 (α)) > δ. Then the exponent of type-i error probability P F (φ) can be bounded by lim inf n = lim inf n (a) n log E P 0;σ [φ(x n )] n log P 0;σ {Π x n Γ inf f P 0 (T ) T cl Γ (b) δ, where (a) holds by Lemma 5., and be holds due to the the convexity of KL divergence: K D (T M 0 (α)) min α kd (U k P 0;k ) = f P0 (T ) U (P X ) K k=. subject to Notice that for any δ > 0, as n large enough, we must have α U = T P F (φ) < ɛ. On the other hand, the exponent of type-ii error probability E (ɛ, α) can be bounded by lim inf n = lim inf n n log E P ;σ [φ(x n )] n log P ;σ {X n : D (Π X n M 0 (α)) δ inf T cl Γ c f P (T ), (9) By Pinsker s inequality (Theorem 6.5 in [7]), we have { cl (Γ c ) = {T P X : D (T M 0 (α)) δ T P X : T M 0 (α) 2δ B 2δ (M 0(α)), so (9) can be further lower bounded by Also, by the continuity (Lemma 5.2) of f P ( ), inf f T cl Γ c P (T ) inf f T B 2δ (M0(α)) P (T ). with Finally, since δ can be chosen arbitrarily small, we have inf f T B 2δ (M0(α)) P (T ) = f P (M 0 (α)) + (δ), lim (δ) = 0. δ 0 E (ɛ, α) f P (M 0 (α)) = D α (P 0 ; P ). (20)

9 Part 2 (Converse). We have shown that symmetric test is optimal in Lemma 4.2. Hence, in the following, it suffices to consider symmetric tests. For an arbitrary symmetric test ψ : P n [0, ] such that its type-i error probability P F (ψ) < ɛ, we shall lower bound its type-ii error probability as follows. Let A (n) {T P n : ψ(t ) /2, and recall that P 0 P 0;σ Π is a probability measure independent of σ. Then, we have ɛ > E P0 [ψ(t )] = (a) > 2 T (A (n) ) c P0 (T ) = T P n P0 (T )ψ(t ) 2 ( P 0 { A (n)), (a) holds since for all T / A (n), ψ(t ) > /2. In other words, we have P 0 {A (n) > 2ɛ. T (A (n) ) c P0 (T )ψ(t ) On the other hand, let B (n) {T P n D (T M 0 (α)) δ. Then, according to the analysis in type-i error probability in the achievability part, we have P 0 {B (n) > ɛ. Applying union bound, we see that P 0 {A (n) B (n) > 3ɛ, and hence for ɛ < 3, A(n) B (n) is non-empty. Let V n A (n) B (n) and define P P ;σ Π (which is also independent of σ). Again we have We further estimate P {V n by P F (ψ) = [ ψ(t )] E P ( ψ(t )) P {T T A (n) 2 P {V n. P {V n =P ;σ {T n (V n ) = = U k P nk : k α ku k =V n U k P nk : k α ku k =V n max U k P nk : k α ku k =V n =2 n D n, K k= P n k ;k {T nk (U k ) 2 k n kd(u k P ;k ) 2 k n kd(u k P ;k )

20 where D n Notice that since V n B (n), so we have min U k P nk : k α ku k =V n ( D (V n M 0 (α)) δ. k n k n D (U k P ;k ) ). Since δ can be chosen arbitrarily small, as δ 0 and n (with n k n α k ), we have E (ɛ, α) lim D n n = min U k P X : k α ku k =M 0(α) = f P (M 0 (α)) = D α (P 0 ; P ), ( ) α k D (U k P ;k ) k which completes the proof. VI. A GEOMETRICAL PERSPECTIVE IN CHERNOFF S REGIME So far, for asymptotic regime, we have been focusing on Neyman-Pearson s formulation, in which we minimize the worst-case type-ii error probability, subject to the worst-case type-i error probability not being larger than a constant ɛ. It is natural to extend the result from Section III to Chernoff s regime, where we aim to minimize the average probability of error: P e (n) (φ) π 0 P F (n) + π P M (n). Note that π 0 and π are the prior distributions of H 0 and H and do not scale with n. As suggested by Theorem 3., the optimal test is the mixture likelihood ratio test, so we only need to specify the corresponding threshold τ. However, the mixture likelihood ratio involves summation over S n,ν, making the computation complexity extremely high. Even for the case X <, the computation still takes Θ ( n X ) operations and thus is difficult to implement. To break the computational barrier, we propose an asymptotically optimal test, based on information projection, which achieves the optimal exponent of the average probability of error. Moreover, the result can be generalized to determine the achievable exponent region R, the collection of all achievable pairs of exponents: R {(E 0, E ) there exists a test φ, such that P (n) F (φ) 2 ne0, P (n) M (φ) 2 ne, where a sequence a n 2 ne0 means a n decays to zero at the rate faster than E 0, that is, lim inf n n log a n E 0.

2 A. Asymptotically Optimal Test in Chernoff s Regime Theorem 6. (Efficient Test). Recall the function f P (T ) : P X R {+ defined in Defintion 5.. Consider the following test based on the function f P0 ( ) and f P ( ): 0, if f φ eff (x n P (Π x n) > f P0 (Π x n) ), else f P (Π x n) f P0 (Π x n). Then φ eff is asymptotically optimal in Chernoff s regime. That is, for all priors π 0, π, for all tests φ, and for all n large enough, n log (P e(φ)) n log (P e(φ eff )). Remark 6.. From the convexity of KL-divergence and the space P X, the function f P ( ) is indeed the minimization of a convex function. Hence the proposed test in Theorem 6. can be computed efficiently. Proof: Let us set some notations. For each P (P X ) K, we use B r (P ) P X to denote the r-ball centered at T with respect to f P ( ): B r (P ) {T P X f P (T ) < r. By the continuity of f P ( ) (from Lemma 5.2), B r (P ) is an open set. Then, define the largest packing radius between P 0, P as follows: See Figure 4 for illustration. r sup {B r (P 0 ) B r (P ) =. r (2) P X B r (P 0 ) P 0 r r P Fig. 4: Illustration of B r ( ) and r The rest of the proof will be organized as follows: we first show that φ eff has error exponent at least r (the achievability part): lim n n log (P e(φ eff )) r. Then, we will prove that for all tests, the error exponent will be at most r (the converse part).

22 Part (Achievability). Define and notice that A {T P X f P (T ) f P0 (T ), P (n) F (φ eff ) = P 0;σ {Π x n A P (n) M (φ eff ) = P ;σ {Π x n A c, for any arbitrary σ (recall that φ eff depends only on the empirical distribution and therefore is symmetrical, so the error is independent of the choice of a specific σ). By the generalized Sanov s theorem (Lemma 5.), we see that the exponent of P F (n) (φ eff ) is lower bounded by inf T cl A f P0 (T ). Similarly, the exponent of P M (n) (φ eff ) is lower bounded by inf T cl A c f P (T ). It is not hard to see that indeed, and inf f P 0 (T ) = inf f P 0 (T ), (22) T cl A T A inf f T cl A c P (T ) = inf f T A c P (T ). (23) Equation (22) holds since A is a closed set (it is a pre-image of a continuous function from a closed set), so cl A = A. For the equation (23), we notice that A c is open, and hence the infimum of a continuous function on A c is actually equal to the infimum on cl A c. Hence, it suffices to show that inf f P 0 (T ) r, inf f T A T A c P (T ) r. P X A c P 0 P A Fig. 5: Relation between A, A c and B r (P 0 ), B r (P ) It is straightforward to see that A c contains B r (P 0 ) and A contains B r (P ), since we must have ) T B r (P 0 ), f P0 (T ) < f P (T ), 2) T B r (P ), f P0 (T ) > f P (T ).

23 Otherwise B r (P 0 ) intersects B r (P ), violating our assumption on r. Also notice that A, A c are disjoint, so A c B r (P 0 ) = A B r (P ) =, implying that Therefore, we have A c B r (P ) c, A B r (P 0 ) c. inf T A f P0 (T ) inf T Br (P 0) c f P 0 (T ) r inf T A c f P (T ) inf T Br (P ) c f P (T ) r, proving the achievability part. Part 2 (Converse). We show that for any test φ (n), the exponent of the average probability of error greater than r leads to contradiction. Suppose the type-i and type-ii error exponents of φ (n) are r, r 2 respectively, and r > r, r 2 > r. By Lemma 4.2, we only need to consider symmetric tests, that is, tests depend only on the type. Therefore, we can write the acceptance region of H 0, H as B (n) = { Π x n : φ (n) (x n ) = B (n) 0 = { Π x n : φ (n) (x n ) = 0. The exponents of type-i and type-ii errors thus are greater then r, r 2 respectively, we have { lim inf min n { lim inf n T B (n) min T B (n) 0 f P0 (T ) = r > r f P (T ) = r 2 > r. Define min {r, r 2 = r, and δ ( r r ) /2 > 0. By (24), there exists M large enough, such that for all n > M, We further define We see that ) B 0 B are dense in P X, since and n>m min T B (n) min T B (n) 0 P n is dense in P X. So we have f P0 (T ) > r δ > r f P (T ) > r δ > r. B = B 0 = n>m n>m B (n) B (n) 0. B (n) 0 B (n) = P n, (24) (cl B 0 cl B ) c = (cl B 0 ) c (cl B ) c =. (25)

24 2) By construction, inf f P0 (T ) = min f P0 (T ) > r δ > r T B T cl B From (26), we have inf f P (T ) = min f P (T ) > r δ > r. T B 0 T cl B 0 B ( r δ) (P 0 ) (cl B ) c B ( r δ) (P ) (cl B 0 ) c, (26) and by (25) B ( r δ) (P 0 ) B ( r δ) (P ) =. However, this violates our assumption that r is the supreme of radius such that the two sets do not overlap. This proves the converse part. Remark 6.2. In Theorem 6., we provide an asymptotically optimal test based on an information-geometric perspective. However, we do not specify the exact error exponent. As stated in the proof, the optimal exponent of average probability of error can be obtain by solving the information projection problem: min f P 0 (T ), T A where A is the acceptance region of φ eff. The optimization problem, though convex, is hard to obtain a closed-form expression, but we can still evaluate it numerically. B. Characterization of Achievable Exponent Region R One can generalize the result from Theorem 6.. Define the following test: φ λ (x n ) {fp0 (Π x n ) f P (Π x n ) λ, where λ [ f P (M 0 (α)), f P0 (M (α))]. Following a similar idea in the proof of Theorem 6., one can show that φ λ is optimal in a sense that for any test φ and λ, E 0 (φ) E 0 (φ λ ) E (φ) E (φ λ ), and E (φ) E (φ λ ) E 0 (φ) E 0 (φ λ ), where (E 0 (φ), E (φ)) are the error exponents with respect to test φ : { E 0 (φ) lim inf n n log P F (n) (φ) { E (φ) lim inf n n log P M (n) (φ). To obtain a parametrization of the boundary of R, it suffices to solve the following information projection problem: E 0 (λ) inf T Aλ f P0 (T ) E (λ) inf T (Aλ ) c f P (T ),

25 E f P0 (M (α)) (E 0 (λ),e (λ)) R f P (M 0 (α)) E 0 Fig. 6: Illustration of (E 0 (λ), E (λ)) where A λ {f P0 (Π x n) f P (Π x n) λ is the acceptance region of φ λ. Therefore, (E 0 (λ), E (λ)) parametrizes the boundary of R, for λ [ f P (M 0 (α)), f P0 (M (α))]. In particular, we see that for the corners λ = f P0 (M (α)) and λ = f P (M 0 (α)), we obtain the same results as in Neyman-Pearson regime (Theorem 3.2). Note that although the information-projection problem is a convex optimization problem, the closed-form expression remains unknown. VII. DISCUSSION A. Extension to Polish X Theorem 3. characterizes the optimal test in the anonymous detection problem, where only a few conditions on the σ-field F are required. In Theorem 3.2, we further assume the alphabet X is finite, in order to apply large deviation tools based on the method of types (see Remark 3.4 for discussion). However, the the optimal exponent of the type-ii error probability, given by the result of Theorem 3.2, depends only on the possible distributions under H θ, and hence it is interesting to see if one can remove the assumption that X being finite. Recall that in the proof, the main tool we employed is the generalized version of Sanov s theorem (see Lemma 5.), and thus the question turns out to be whether it is possible to prove Lemma 5. without using method of types. Surprisingly, the answer is yes if X is a Polish space (a completely separable metrizabla topological space). If X is Polish, the space of all probability measures on X (P X ) is also Polish, equipped with weak-topology induced by weak convergence. One can choose, for example, Levy-Prokhorov metric on P X. The proof of standard Sanov s Theorem on Polish X, however, is far more complicated than the case of finite X, see [8], [9] for detailed proof. Lemma 5. for Polish X can be proved with similar techniques. Nevertheless, in order not to digress further from the subject, we only present a proof for finite X in this paper. B. The Benefit of Partial Information about the Group Assignment From Figure, we see that in some cases, the type-ii error exponent can be pushed to zero, making reliable detection no longer possible. If each sensor is allowed to transmit a few bits of information to partially reveal their