State Variation Mining: On Information Divergence with Message Importance in Big Data

Size: px

Start display at page:

Download "State Variation Mining: On Information Divergence with Message Importance in Big Data"

James McCarthy
6 years ago
Views:

1 State Variation Mining: On Information Divergence with Message Importance in Big Data Rui She, Shanyun Liu, and Pingyi Fan State Key Laboratory on Microwave and Digital Communications Tsinghua National Laboratory for Information Science and Technology Department of Electronic Engineering, Tsinghua University, Beijing, P.R. China s: {sher15, arxiv: v2 [cs.it] 15 Jan 2018 Abstract Information transfer which reveals the state variation of variables can play a vital role in big data analytics and processing. In fact, the measure for information transfer can reflect the system change from the statistics by using the variable distributions, similar to KL divergence and Renyi divergence. Furthermore, in terms of the information transfer in big data, small probability events dominate the importance of the total message to some degree. Therefore, it is significant to design an information transfer measure based on the message importance which emphasizes the small probability events. In this paper, we propose the message importance divergence (MID) and investigate its characteristics and applications on three aspects. First, the message importance transfer capacity based on MID is presented to offer an upper bound for the information transfer with disturbance. Then, we utilize the MID to guide the queue length selection, which is the fundamental problem considered to have higher social or academic value in the caching operation of mobile edge computing. Finally, we extend the MID to the continuous case and discuss the robustness by using it to measuring information distance. I. INTRODUCTION Recently, the amount of data is exploding rapidly and the computing complexity for data processing is also increasing. To some degree, this phenomenon is resulted from more and more mobile devices as well as the growing service of clouds. In the literature, it is favored to process the collected data to dig out the hidden important information. On the one hand, it is necessary to improve the computation technologies for big data processing, such as cloud computing, fog computing and mobile edge computing (MEC). On the other hand, a series of technologies for big data analysis and mining are required, such as neural networks and machine learning, as well as distributed parallel computing, etc. In many scenarios of big data, the small probability events attract more attention than the large probability ones [1]. That is, the rarity of small probability events has higher social or academic value. Message importance is used to reflect such kind of values in applications. How to characterize the message importance variation in the processing of big data is a critical and interesting problem. Here, we shall investigate and measure information transfer process by use of the message importance, focusing on small probability events. Before we proceed, let us review the definition of message importance measure (MIM) briefly first [2]. In a finite alphabet, for a given probability distribution P = {p(x 1 ),p(x 2 ),...,p(x n )}, the MIM with importance coefficient 0 is defined as L(P, ) = log { x i p(x i )e (1 p(xi)) }, (1) which measures the information importance of the distribution. Then, by setting the parameter = 1 and simplifying the form of MIM, it is easy to obtain its fundamental definition as follows. Definition 1. For the discrete probability P ={p(x 1 ), p(x 2 ),...,p(x n )}, the MIM can be given by L(P) = x i p(x i )e p(xi). (2) On account of the Definition 1, it is available to regard the MIM as an element to measure the message importance variation for a dynamic system. Then, new information transfer measures based on the MIM are defined as follows. Definition 2. In terms of two discrete probability Q = {q(x 1 ),q(x 2 ),...,q(x n )} and P = {p(x 1 ),p(x 2 ),...,p(x n )}, the message importance divergence (MID) is defined as (Q P) = x i {q(x i )e q(xi) p(x i )e p(xi) }. (3) Note that the Definition 2 characterizes the information transfer from the statistics. That is, we can make use of MID to measure the end-to-end change for an information transfer process. In fact, there exist a variety of different information measures handling the problem of information transfer process. Shannon entropy and Renyi entropy are applicable to intrinsic dimension estimation [3]. As well, the NMIM and M-I divergence can be used in anomaly detection [4], [5]. Moreover, the directed information [6] and Schreibers transfer entropy [7] are commonly applied to inferring the causality structure and characterizing the information transfer process. In addition, referring to the idea from dynamical system theory, new information transfer measures are proposed to explore and exploit the causality between states in the system control [8]. However, in spite of numerous kinds of information measures, few works focus on how to characterize the information

2 transfer from the perspective of message importance in big data. To this end, the MID different from the above information measures is introduced. We organize the rest of this paper as follows. In Section II, we introduce the message importance transfer capacity measured by the MID to describe the information transfer with disturbance. In Section III, the MID is used to discuss the queue length selection for the data caching in MEC from the viewpoint of queue theory. Moreover, some simulations are presented to validate our theoretical results. In Section IV, we extend the MID to the continuous case to investigate the correlation of message importance in the information transfer process. Finally, we conclude it in Section VI. II. MESSAGE IMPORTANCE TRANSFER CAPACITY BASED ON MESSAGE IMPORTANCE DIVERGENCE In this section, we will introduce the MID to describe the information transfer with additive disturbance. To do so, we define the message importance transfer capacity measured by the MID as follows. Definition 3. Assume that there exists an information transfer process as, {X,p(y x),y}, (4) where the p(y x) denotes a probability distribution matrix describing the information transfer from the variable X following distribution p(x) to the Y following p(y). We define the message importance transfer capacity as C = max p(x) {L(Y) L(Y X)}, (5) where p(y j ) = x i p(x i )p(y j x i ), L(Y) = y j p(y j )e p(yj), L(Y X) = y j x i p(x i,y j )e p(yj xi). In order to have an insight into the applications of message importance transfer capacity, some specific information transfer scenarios are discussed as follows. A. Binary symmetric information transfer Proposition 1. Assume that there exists an information transfer process {X, p(y x), Y}, where the information transfer matrix is [ ] 1 p(y x) =, (6) 1 which indicates that variables X and Y both obey binary distributions. In this case, we have C() = e 1 2 L(), (7) where L() = e +(1 )e (1 ) and 0 < < 1. Proof. Considering a variable X following binary distribution (p,1 p), it is not difficult to see that L(Y X) = e +(1 )e (1 ). (8) Moreover, according to Eq. (5), we have C(p,) = max p { [p+(1 2p)]e [p+(1 2p)] +[(1 p)+(2p 1)]e [(1 p)+(2p 1)]} L(). Then, it is readily seen that C(p, ) p { = (1 2) [1 p (1 2p)]e [p+(1 2p)] [1 (1 p) ε(2p 1)]e [(1 p)+(2p 1)]}. (10) In the light of the monotonically decreasing of C(p,) p for p [0,1], it is apparent that p = 1/2 is the only solution for = 0. Therefore, the proposition can be testified. C(p,) p B. Strongly symmetric information transfer Proposition 2. Assume that there exists a strongly symmetric information transfer matrix as follows 1... p(y x) = , (11) 1... which indicates that variables X and Y both follow K- ary distributions. We have the message importance transfer capacity as (9) C() = e 1 K {(1 )e (1 ) +e }, (12) where the parameter (0,1). Proof. On account of the information transfer matrix and the Eq. (2), we have L(Y X) = e +(1 )e (1 ). (13) Then, similar to the proof of Proposition 1, we can also use Lagrange multiplier method to obtain the message information transfer capacity. In this case, the distribution of Y should satisfy p(y 1 ) = p(y 2 ) =... = p(y K ) = 1/K. In addition, consider that the probability distribution of variablex is {p(x 1 ),p(x 2 ),...,p(x K )}. In the strongly symmetric transfer matrix, if the variable X follows uniform distribution, namely p(x 1 ) = p(x 2 ) =... = p(x K ) = 1/K, we will have K K p(y j ) = p(x i,y j ) = p(x i )p(y j x i ) = 1 K (14) K p(y j x i ) = 1 K, which indicates that Y also follows the uniform distribution. Therefore, it is testified that when the variable X follows uniform distribution which leads to the uniform distribution for variable Y, we will obtain the message importance transfer capacity C().

3 III. APPLICATION IN MOBILE EDGE COMPUTING WITH THE M/M/S/K QUEUE Consider the MEC system that consists of numerous mobile devices, an edge server cluster, and a central cloud far from the local devices. The queue models for mobile devices, the edge server cluster and the central cloud are regarded as a M/M/1 queue, a M/M/s queue, and a M/M/ queue, respectively. In particular, each mobile device can offload a part of or the entire service requests to the corresponding edge server cluster when the communication is disturbed. As well, the overloaded requests from edge servers will be offloaded to the central cloud for processing [10]. In terms of the MID, it can be applied to distinguish the state probability distributions in a queue model such as M/M/s queue. In particular, the state probability is derived from the stationary process, namely a dynamic equilibrium based on birth and death process. Then, we can obtain the steady queue state probability in the M/M/s/k model. By use of Taylor series expansion, the approximate MIM is given by s+k s+k p j e pj = p j [1 p j +O(p 2 j )]. = 1 p 2 0 { s 1 ( aj j! )2 + (as /s!) 2 [1 ρ 2(k+1) } ] 1 ρ 2, (15) where λ and µ are arrival rate and service rate,sis the number of servers, k is the buffer or caching size, the traffic intensity ρ = a/s as well as a = λ/ µ. Note that, in the M/M/s/k queue, the queue state probability p j (j = 0,...,s+k) are given by p 0 = [ s 1 a j ] 1, j! + as s! 1 ρk+1 (16a) 1 ρ p j = aj j! p 0, (0 < j < s), (16b) p j = as s! ρj s p 0, (s j s+k). (16c) Then, referring to Eq. (15), we can obtain the MID to characterize the message importance gap for the queue model M/M/s as follows. Proposition 3. As for the M/M/s model, the information difference between two queue state probability distributions P k+1 and P k which are with buffer size k and k + 1 respectively, can be measured by MID as (P k+1 P k ) = s+k+1 s+k p j e pj p j e pj. 1 = ( (ϕ 1 +ϕ ρ )2 1 ρ (ϕ 1 +ϕ k ϕ2 2 1 ρ 2[ 1 ρ 2(k+1) (ϕ 1 +ϕ 2 1 ρ )2 s 1 1 ρ )2) 1 ρ2(k+2) ( aj j! )2 (ϕ 1 +ϕ 2 1 ρ k+2 1 ρ )2]. (17) where p j and p j are queue state probability in the M/M/s/k and M/M/s/k+1 models respectively, as well as parameter ϕ 1 and ϕ 2 are given by ϕ 1 = s 1 aj /j! and ϕ 2 = a s /s!. Similarly, it is not difficult to derive the MID between the queue state probability distributions with buffer size and k, which is given by (P P k ) 1 = ( (ϕ 1 +ϕ ρ )2 (ϕ 1 + ϕ2 + ϕ2 2 1 ρ 2(k+1) s 1 1 ρ )2) ( aj j! )2 1 ρ 2[ (ϕ 1 +ϕ ρ )2 (ϕ 1 + ϕ2 1 ρ )2]. (18) Especially, when the number of server is one, that is, the queue model is M/M/1/k model, it it readily seen that (s=1) (P P k ) = 2a 1+a 2a(1 ak+1 ) (1+a)(1 a k+2 ), (19) where (s=1) (P P k ) denotes the MID with s = 1. Moreover, for the queue length selection, it is required that the distinction between two distribution P and P k should be small enough, namely, (s=1) (P P k ) ǫ. Then we have ǫ(1+a) a(2+ǫ (2 ǫ)a) k ln 1. (20) lna It is easy to see that ǫ plays a key role in the caching size selection when using finite size caching to imitate the infinite caching working mode. Similar to MID, the KL divergence between the queue state probability distributions with buffer size k+1 and k is given by D(P k P k+1 ) = j = log p j log 1 p j j p j log 1 p j {1+ ϕ 2 (1 ρ)ρ k+1 }, ϕ 1(1 ρ) +ϕ 2 (21) where the parameters p j, p j, ϕ 1 and ϕ 2 are the same as them in Proposition 3. Likewise, we can derive the KL divergence between the queue state distributions with buffer size and k as D(P k P ) = log ϕ 1 1 +ϕ 2 1 ρ. (22) 1 ρ ϕ 1 +ϕ k ρ For the queue length selection with KL divergence, it is readily seen that k ln(1 1 2 ) ǫ 2. (23) lna To validate our derived results in theory, some simulations are presented. The events arrivals are listed in Table I. It is readily seen that they have the same average interarrival time as 1/ λ 0. Besides, the traffic intensity is selected as ρ = 0.9 in all cases.

4 Type of Imfornation measure value TABLE I THE DISTRIBUTIONS OF EVENT INTERARRIVAL TIME Exponential Uniform Normal P(X) X E( λ 0 ) X U(0,2/ λ 0 ) X N( λ 1, Queue length 1 ) λ 2 0 -Sim-Poisson -Sim-Uniform -Sim-Normal D-Sim-Poisson D-Sim-Uniform D-Sim-Normal -Ana-Poisson D-Ana-Poisson Fig. 1. The performance of different information measures between the queue length k and k+1 for the queue model in the case of different arrival events distributions. In Fig. 1, the legends -Sim, -Ana and D-Sim, D-Ana are used to denote the simulation results and the analytical results for MID and KL divergence, respectively. It is illustrated that the convergence of MID is faster than that of KL divergence, which indicates that MID may provide a reasonable lower bound to select the caching size for MEC. In addition, we can see that the Poisson distribution corresponds the worst case for the arrival process among the three discussed cases with respect to the convergence of both MID and KL divergence. IV. MESSAGE IMPORTANCE DIVERGENCE IN CONTINUOUS CASES Similar to the definition 1 and 2, we can extend the two definition to the case with continuous distributions as L(f(x)) = f(x)e f(x) dx, x, (24) (g(x) f(x)) = L(g(x)) L(f(x)) = g(x)e g(x) f(x)e f(x) dx,x, (25) where g(x) and f(x) are two probability distributions with respect to the variable X in a given interval. Moreover, L(f(x)) and (g(x) f(x)) can be called differential message importance measure (DMIM) and differential message importance divergence (DMID). Then, we investigate the correlation of message importance by using DMID, which can also reflect the robustness of DMID. Consider the observation model, P g0 f 0 : f 0 (x) g 0 (x), that denotes an information transfer map for the variable X from the probability distribution f 0 (x) to g 0 (x). By using the similar way in [9], the relationship between f 0 (x) and g 0 (x) can be described as g 0 (x) = f 0 (x)+ǫf0 α (x)u(x), (26) and the constraint condition satisfies ǫf0 α (x)u(x)dx = 0, (27) where ǫ and α are adjustable coefficients. u(x) is a perturbation function of the variable X in the interval. Then, by using the above model, the end-to-end information distance measured by DMID is given as follows. Proposition 4. For two probability distributions g 0 (x) and f 0 (x) whose relationship satisfies the conditions Eq. (26) and Eq. (27), the information distance measured by DMID is given by (g 0 (x) f 0 (x)) { = g 0 (x)e g0(x) f 0 (x)e f0(x)} dx = ǫ i! (i 1)! f0 i+α (x)u(x)dx + ǫ2 2 (28) where ǫ and α denote parameters, u(x) is a function of the variable X in the interval. f0 i 1+2α (x)u 2 (x)dx+o(ǫ 2 ), Proposition 4 describes the perturbation between f 0 (x) and g 0 (x). Furthermore, it is not difficult to obtain the DMID between two different distributions g (u) 1 and g (u) 2 based on the same reference distribution f 0 (x), which is given by (g (u) 1 (x) g(u) 2 (x)) = [L(g (u) 1 (x)) L(f 0(x))] [L(g (u) 2 (x)) L(f 0(x))] = ǫ + ǫ2 2 +o(ǫ 2 ), where i! (i 1)! f0 i+α (x)[u 1 (x) u 2 (x)]dx f0 i 1+2α (x)[u 2 1 (x) u2 2 (x)]dx (29) g (u) 1 (x) = f 0(x)+ǫf α 0 (x)u 1(x), x, (30a) g (u) 2 (x) = f 0(x)+ǫf α 0 (x)u 2(x), x, (30b) in which the ǫ and α are parameters, u 1 (x) and u 2 (x) are functions of the variable X in the interval.

5 Remark 1. It is apparent that when the parameter ǫ is small enough, the DMID of two distributions is convergent with the order of O(ǫ). Actually, this provides a way to apply DMID to measure the correlation of message importance, if the system does not have relatively large change. In addition, by extending the information transfer process described as Eq. (4) to the case with continuous distributions, it is also significant to clarify the effect of the additive disturbance on the message importance transfer capacity as follows. Theorem 1. Assume that there exists an information transfer process between the variable X and Y both following continuous distributoins, denoted by {X, p(y x), Y}, where E[X] = 0, E[X 2 ] = P s, Y = X + Z, and Z denotes independent memoryless additive variable, namely a continuous additive disturbance, whose mean and variance satisfy that E[Z] = µ and E[(Z µ) 2 ] = σ 2, respectively. Then, the message importance transfer capacity is determined by the variances of the variable X and Z as C(P s ) = max p(x) (Y Z) = max {L(Y)} L(Z) p(x) { + } = max p(y)e p(y) dy p(x) ( 1) j j! s.t. E[Y 2 ] = P s +P N, p j+1 (z)dz, (31a) (31b) where P N = µ 2 +σ 2, L( ) is the DMIM operator, and p(y) = p(x)p(y x)dx. Proof. According to Eq. (4), we have { x(x,y) = x z(x,y) = y x. (32) Then, it is not difficult to see that p(x,y) = p(x,z) J( xz x x x y ) = p(x,z) z z xy x y = p(x,z). (33) Moreover, by virtue of the independence of X and Z, we have p(x)p(y x) = p(x)p(z), which indicates that p(y x) = p(z). Then, we have L(Y X) = p(x)p(y x)e p(y x) dxdy x y = p(x)p(z)e p(z) dxdz (34) x z = p(x){ p(z)e p(z) dz}dx = L(Z). x z Consequently, in terms of the Definition 3, it is readily seen that L(Y) L(Y X) can be written as L(Y) L(Z), which testifies Eq. (31a). Furthermore, according to the fact that E[Y 2 ] = E[(X + Z) 2 ] = E[X 2 ] +E[Z 2 ] = P s + P N, we have the constraint condition Eq. (31b). As well, by substituting the definition of MID into Eq. (31a), the Theorem 1 is proved. In practice, the variance can be regarded as the power of signals. Therefore, the message importance transfer capacity in the Theorem 1 may be applied to guide the signal transfer processing with an additive disturbance. V. CONCLUSION In this paper, we investigated the information transfer problem in big data and proposed an information measure, i.e., MID. Furthermore, this information measure has its own dramatic characteristics on paying more attention to the message importance hidden in big data. This makes the information measure as a promising tool for information transfer measure in big data. We presented the message importance transfer capacity measured by the MID which can give an upper bound for the information transfer with disturbance. Furthermore, we employed the MID to discuss the caching size selection in MEC. In addition, the MID was extended to the continuous case to investigate the correlation of message importance in the information transfer process. ACKNOWLEDGMENT We indeed appreciate the support of the China Major State Basic Research Development Program (973 Program) No.2012CB316100(2), and the National Natural Science Foundation of China (NSFC) No REFERENCES [1] Y. Bu, S. Zou, and V. V. Veeravalli, Linear-complexity exponentiallyconsistent tests for universal outlying sequence detection, in Proc IEEE International Symposium on Information Theory (ISIT), Aachen, Germany, June. 2017, pp [2] P. Fan, Y. Dong, J. Lu, and S. Liu, Message importance measure and its application to minority subset detection in big data, in Proc. IEEE Globecom Workshops (GC Wkshps), Washington D.C., USA, Dec. 2016, pp 1 5. [3] K. M. Carter, R. Raich, and A. O. Hero, On local intrinsic dimension estimation and its applications, IEEE Trans. Signal Process., vol. 58, no. 2, pp , Feb [4] S. Liu, R. She, P. Fan, and K. B. Letaief. (2017). Non-parametric message important measure: storage code design and transmission planning for big data. [Online]. Available: [5] R. She, S. Y. Liu, and P. Fan, Amplifying inter-message distance: on information divergence measures in big data, IEEE Access, vol. 5, pp , Nov [6] L. Zhao, Y. H. Kim, H. H. Permuter, and T. Weissman, Universal estimation of directed information, in Proc. IEEE Int. Symp. Inf. Theory (ISIT), Austin, USA, June. 2010, pp [7] T. Schreiber, Measuring information transfer, Physical Review Letters, vol. 85, no. 2, pp , July, [8] S. Sinha and U. Vaidya, Causality preserving information transfer measure for control dynamical system, in Proc. IEEE 55th Conference on Decision and Control (CDC), Las Vegas, USA, Dec. 2016, pp [9] S. Huang, A. Makur, L. Zheng, and G. W. Wornell, An informationtheoretic approach to universal feature selection in high-dimensional inference, in Proc IEEE International Symposium on Information Theory (ISIT), Aachen, Germany, June. 2017, pp

6 [10] L. Liu, Z. Chang, X. Guo, and T. Ristaniemi, Multi-objective optimization for computation offloading in mobile-edge computing, In Proc. IEEE Symposium on Computers and Communications (ISCC), Heraklion, Greece, July. 2017, pp

Minor probability events detection in big data: An integrated approach with Bayesian testing and MIM

Minor probability events detection in big data: An integrated approach with Bayesian testing and MIM Shuo Wan, Jiaxun Lu, Pingyi Fan, and Khaled B. Letaief*, Tsinghua National Laboratory for Information