Mutual Information and Optimal Data Coding May 9 th 2012 Jules de Tibeiro Université de Moncton à Shippagan Bernard Colin François Dubeau Hussein Khreibani Université de Sherbooe
Abstract Introduction and Motivation Example Theoretical Framewor φ Divergence Mutual Information Optimal Partition Mutual Information explained by a partition Existence of an Optimal Partition Computational Aspects and Examples Conclusions and Perspectives References 2
Abstract Based on the notion of mutual information between the components of a random vector An optimal quantization of the support of its probability measure A simultaneous discretization of the whole set of the components of the random vector The stochastic dependence between them Key words: Divergence, mutual information, copula, optimal quantization 3
Introduction and Motivation An optimal discretization of the support of a continuous multivariate distribution To retain the stochastic dependence between the variables X = X 1, X 2,, X a random vector with values in R, β R, P X Where P X is the probability measure of X and S PX = R is the support of P X n = n 1 n 2 n a product of given integers A partition P of S PX in n elements or classes A partition P is a product partition deduced from partitions P 1, P 2,, P of the supports of the marginal probability measures in n 1 n 2 n intervals Using a mutual information criterion, choosing the set of all intervals such that the quantization of the support of S PX, retains the stochastic dependence between the components of the random vector X 4
Introduction and Motivation Here is the example for which such an optimal discretization might be desirable Let us suppose that we have a sample of individuals on which we observe the following variables: X = age, Y = salary, Z = socioprofessional group If we want to tae into account the variables simultaneously as, for example, in multiple correspondence analysis, we have to put them on the same form by means of a discretization of the first two ones Instead of the usual independent categorization of the variables X and Y in a given number of classes (p for X and q for Y), it would be more relevant, using their stochastic dependence, to categorize simultaneously X and Y in pq classes (referred sometimes as a (p, q) partition), in order to preserve as much as possible the dependence between them Moreover, and depending on the values taen by the categorical variable Z, the (conditional) discretization of the random vector (X, Y ) must differ, from one class to the others, to tae into account the stochastic dependence between the continuous random variables and the categorical one Usually, we do not tae care of this dependence in creating classes for continuous random variables However, the dependence between X = age and Y = salary are certainly quite different between the socioprofessional groups. 5
Let (Ω, F, μ) be a measured space φ Divergence μ 1 and μ 2 be two probability measures defined on F, such that μ i μ for i = 1,2 φ divergence or the generalized divergence (Csiszár[2]) between μ 1 and μ 2 I φ μ 1, μ 2 = φ dμ 1 dμ 2 dμ 2 = φ f 1 f 2 f 2 dμ where φ t is a convex function from R + \ 0 to R and where f i = dμ 1 dμ I φ μ 1, μ 2 does not depend on the choice of μ for i = 1,2 Homogenous models I φ μ 1, μ 2 = dμ 2 dμ 1 φ dμ 1 dμ 2 dμ 1 = f 2 f 1 φ f 1 f 2 f 1 dμ 6
φ Divergence Usual measures of φ divergence φ (x) Name x ln x; x 1 ln x Kullbac and Leibler x 1 Distance in variation ( x 1) 2 Hellinger 1 x α ; 0 < α < 1 Chernoff (x 1) 2 χ 2 [1 x 1 m] m ; m > 0 Jeffreys 7
Mutual Information Let (Ω, F, P) be a probability space X 1, X 2,, X be random variables defined on (Ω, F, P) with values in measured spaces X i, F i, λ i i = 1,2,, Denote respectively by P X = P X1,X 2,,X by i=1 P X probability measures defined on the product space (X i=1 χ i, i=1 F i, i=1 λ i ) Equal to the joint probability measure and to the product of the marginal ones, Supposed to be absolutely continuous with respect to the product measure λ = i=1 λ i 8
Definition:1 Mutual Information The φ mutual information or the mutual information between the random variables X 1, X 2,, X, is given by: I φ X 1, X 2,, X = I φ P X, i=1 P Xi = φ dp X d i=1 P Xi d i=1 P Xi = φ( f 1 f 2 )f 2 dλ where f 1 and f 2 are the probability density functions of the measures P X and i=1 P Xi with respect to λ = i=1 λ i 9
Mutual Information explained by a partition Random vector X defined on (Ω, F, P) has values in (R, β R ) P X - probability measure (P X λ where λ is the Lebesgue measure on R ) Support S PX may be assumed of the form i=1 every i = 1,2,, [a i, b i ] where < a i < b i < for Given integers n 1, n 2,, n, let P i for i = 1,2,, be a partition of [a i, b i ] in n i intervals {γ iji } such that a i = x i0 < x i1 < < x ini 1 < x ini = b i γ iji = [x iji 1 < x iji ] for j i = 1,2,, n i 1 and γ ini = [x ini 1, b i ] Product-partition P = i=1 P i of S PX in n = n 1 n 2 n rectangles of R P = γ 1j1 γ 2j2 γ j = { i=1 γ iji }; every i: j i = 1,2,, n i 10
Mutual Information explained by a partition Product-partition P = i=1 P i of S PX in n = n 1 n 2 n rectangles of R P = γ 1j1 γ 2j2 γ j = { i=1 γ iji }; every i: j i = 1,2,, n i If σ(p) denotes the σ-algebra generated by P, the restriction of P X to σ(p) is given by P X ( i=1 γ iji ) for every j 1, j 2,, j whose marginal are, for every i = 1,2,, : P X i 1 r=1 a r, b r γ iji r=i+1 a r, b r = P X (γ iji ) the mutual information, denoted by I φ (P), explained by partition P of the support of S PX I φ P = φ( P X( i=1γiji ) j 1,j 2,,j i=1 P Xi (γ iji ) i=1 ) P Xi (γ iji ) 11
Existence of an Optimal Partition For given integers n 1, n 2,, n and for every i = 1,2,, P i,ni - the class of partitions of [a i, b i ] in n i disjoint intervals P n - the class of partitions of S PX given by P n = i=1 where n is the multi index (n 1, n 2,, n ) Each element P of P n may be considered as a vector of R i=1 (n i+1) having components P i,ni (a 1, x 11,, x 1n1 1, b 1, a 2, x 21,, x 2n1 1, b 2,, a, x 1,, x n 1, b ), Under the constraints: a i < x i1 < < x ini 1 < b i for every i = 1,2,, A partition P of S PX, for which the mutual information loss is minimum, solve the optimization problem: min P P n (I φ X 1, X 2,, X I φ (P)), which is equivalent to: max P P n I φ P = max P P n j 1,j 2,,j φ( P X( i=1γiji ) i=1 P Xi (γ iji ) i=1 ) P Xi (γ iji ) 12
Computational Aspects and Examples Consider the case of a bivariate random vector X = (X 1, X 2 ) with probability density function f x 1, x 2 whose support is [0,1] 2 For each component, let respectively: 0 = x 10 < x 11 < x 12 < x 1i < < x 1p 1 < x 1p = 1, and 0 = x 20 < x 21 < x 22 < x 2j < < x 2q 1 < x 2q = 1, the ends of intervals of two partitions of [0,1] in respectively p and q elements For i = 1,2,, p and j = 1,2,, q, the probability measure of a rectangle x 1i 1, x 1i x 2j 1, x 2j is given by: x 1i x 1i 1 x 2j x 2j 1 f x 1, x 2 dx 1 dx 2 = p ij While its product probability measure is expressed as: x 1i x f 1 x 1 dx 1 2j q p f x 1i 1 2 x 2 dx x 2j 1 2 = p i+ p +j with p i+ = j=1 p ij and p +j = i=1 p ij 13
Computational Aspects and Examples The approximation of the mutual information between the random variables X 1 and X 2, conveyed by the discrete probability measure {p ij } is given by p i=1 q j=1 φ( p ij p i+ p +j ) p i+ p +j And for given p and q and f(x 1, x 2 ), one has to maximize the following expression; max x 1i,{x 2j } p i=1 q j=1 φ x 1i x 1i 1 x 2j x 2j 1 f x 1,x 2 dx 1 dx 2 x x 1i 2j x f 1 x 1 dx 1 f 2 x 2 dx 2 1i 1 x 1i f 1 x 1 dx 1 f x 1i 1 2 x 2 dx x 2j 1 2 x 2j 1 x 2j The well nown method of feasible directions in Zoutendiji [3] and also in Berseas [1] 14
Example Computational Aspects and Examples Let X = X 1, X 2 ~ε 2 θ 1 θ 1 be a bivariate exponential random vector, whose probability density function is given by f x 1, x 2 = e x 1 x 2 1 + θ 2θ e x 1 + e x 2 2e x 1 x 2 I R+ 2 x 1, x 2 Let C u 1, u 2 be its copula whose probability density function c u 1, u 2 is c u 1, u 2 = 1 + θ 1 2u 1 1 2u 2 I 0,1 2 u 1, u 2 This family of distribution is also nown as Farlie-Gumbel-Morgensten class 15
Conclusions and Perspectives In data mining, the choice of a parametric statistical model is not quite realistic due to a huge number of variables and data and, in this case, a non parametric framewor is often more appropriate To estimate the probability density function of a random vector, we will use a ernel density estimator in order to evaluate the mutual information between its components and study the effects of the choice of the ernel on the robustness of the optimal partition In Multiple Correspondence analysis (MC) and in Classification, we have often to deal simultaneously with continuous and categorical variables, and it may be of interest to use an optimal partition in order to retain, as much as possible, the stochastic dependence between the random variables and we will explore the consequences of the choices of φ and of an optimal partition P on these models Finally, will develop user friendly software to perform optimal coding in the nonparametric and semi parametric cases 16
References [1] D.P. Berseas, Nonlinear Programming 2 nd Ed, Athena Scientific, Belmont, Mass., 1990 [2] I. Csiszár, Information-type measures of difference of probability distributions and indirect observations, Studia Scientiarum Mathematicarum Hungarica, 2 (1967), 299-318 [3] G. Zoutendij, Methods of feasible directions, Elsevier, Amsterdam and D. VanNostrand, Princeton, N.J, 1960 17