Mutual Information and Optimal Data Coding

Similar documents
Two Useful Bounds for Variational Inference

Measuring Disclosure Risk and Information Loss in Population Based Frequency Tables

CS Lecture 19. Exponential Families & Expectation Propagation

Extreme Value Analysis and Spatial Extremes

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

ECE 4400:693 - Information Theory

A Brief Introduction to Copulas

A Conditional Approach to Modeling Multivariate Extremes

Maximization of the information divergence from the multinomial distributions 1

Math 416 Lecture 2 DEFINITION. Here are the multivariate versions: X, Y, Z iff P(X = x, Y = y, Z =z) = p(x, y, z) of X, Y, Z iff for all sets A, B, C,

Probability Distribution And Density For Functional Random Variables

Goodness of Fit Test and Test of Independence by Entropy

Information geometry for bivariate distribution control

Integration on Measure Spaces

Quasi-copulas and signed measures

Model Validation: A Probabilistic Formulation

An Introduction to Expectation-Maximization

The Information Bottleneck Revisited or How to Choose a Good Distortion Measure

Copulas. MOU Lili. December, 2014

Latent Variable Models and EM algorithm

Research Article On Some Improvements of the Jensen Inequality with Some Applications

LECTURE 10: NEYMAN-PEARSON LEMMA AND ASYMPTOTIC TESTING. The last equality is provided so this can look like a more familiar parametric test.

16.4. Power Series. Introduction. Prerequisites. Learning Outcomes

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata

Chapter I : Basic Notions

Estimation Under Multivariate Inverse Weibull Distribution

Statistics (1): Estimation

Copula Regression RAHUL A. PARSA DRAKE UNIVERSITY & STUART A. KLUGMAN SOCIETY OF ACTUARIES CASUALTY ACTUARIAL SOCIETY MAY 18,2011

8. Geometric problems

A SYMMETRIC INFORMATION DIVERGENCE MEASURE OF CSISZAR'S F DIVERGENCE CLASS

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

9. Geometric problems

Multivariate Measures of Positive Dependence

2 Lebesgue integration

Theorem 2.1 (Caratheodory). A (countably additive) probability measure on a field has an extension. n=1

Stat 5101 Lecture Notes

Probability Distributions and Estimation of Ali-Mikhail-Haq Copula

Recursive Least Squares for an Entropy Regularized MSE Cost Function

Product measure and Fubini s theorem

Estimation of Rényi Information Divergence via Pruned Minimal Spanning Trees 1

A PARAMETRIC MODEL FOR DISCRETE-VALUED TIME SERIES. 1. Introduction

Bayes spaces: use of improper priors and distances between densities

Part IA Probability. Definitions. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

EE/ACM Applications of Convex Optimization in Signal Processing and Communications Lecture 18

Introduction to Smoothing spline ANOVA models (metamodelling)

8. Geometric problems

Lecture 6 Basic Probability

On the Choice of Parametric Families of Copulas

Lecture 35: December The fundamental statistical distances

Chapter 8. General Countably Additive Set Functions. 8.1 Hahn Decomposition Theorem

Semi-parametric predictive inference for bivariate data using copulas

Various types of likelihood

A View on Extension of Utility-Based on Links with Information Measures

Machine Learning Lecture Notes

MAXIMUM ENTROPIES COPULAS

Convexity/Concavity of Renyi Entropy and α-mutual Information

Conditional Least Squares and Copulae in Claims Reserving for a Single Line of Business

REVIEW OF MAIN CONCEPTS AND FORMULAS A B = Ā B. Pr(A B C) = Pr(A) Pr(A B C) =Pr(A) Pr(B A) Pr(C A B)

A CLASSROOM NOTE: ENTROPY, INFORMATION, AND MARKOV PROPERTY. Zoran R. Pop-Stojanović. 1. Introduction

Lecture 10: Semi-Markov Type Processes

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

The main results about probability measures are the following two facts:

Maximization of a Strongly Unimodal Multivariate Discrete Distribution

Multiple Random Variables

Local time of self-affine sets of Brownian motion type and the jigsaw puzzle problem

Introduction to Maximum Likelihood Estimation

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Diffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)

Chapter 1: Probability Theory Lecture 1: Measure space and measurable function

Modeling Dependence of Daily Stock Prices and Making Predictions of Future Movements

Lecture 8: Minimax Lower Bounds: LeCam, Fano, and Assouad

The Regularized EM Algorithm

Bayes rule and Bayes error. Donglin Zeng, Department of Biostatistics, University of North Carolina

Random experiments may consist of stages that are performed. Example: Roll a die two times. Consider the events E 1 = 1 or 2 on first roll

Multivariate Statistics

Studia Scientiarum Mathematicarum Hungarica 42 (2), (2005) Communicated by D. Miklós

A NOVEL OPTIMAL PROBABILITY DENSITY FUNCTION TRACKING FILTER DESIGN 1

Dissimilarity, Quasimetric, Metric

Lecture Quantitative Finance Spring Term 2015

1 Variance of a Random Variable

Research Article Global Existence and Boundedness of Solutions to a Second-Order Nonlinear Differential System

Entropy and Large Deviations

(U) =, if 0 U, 1 U, (U) = X, if 0 U, and 1 U. (U) = E, if 0 U, but 1 U. (U) = X \ E if 0 U, but 1 U. n=1 A n, then A M.

Optimal global rates of convergence for interpolation problems with random design

(2) E M = E C = X\E M

Cross entropy-based importance sampling using Gaussian densities revisited

Exponential families also behave nicely under conditioning. Specifically, suppose we write η = (η 1, η 2 ) R k R p k so that

Variational Inference. Sargur Srihari

Copula based Divergence Measures and their use in Image Registration

THEOREMS, ETC., FOR MATH 516

STAT 7032 Probability Spring Wlodek Bryc

An Extended BIC for Model Selection

A Goodness-of-fit Test for Copulas

Robust Stochastic Frontier Analysis: a Minimum Density Power Divergence Approach

Modelling Dependence with Copulas and Applications to Risk Management. Filip Lindskog, RiskLab, ETH Zürich

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

The Distribution of Partially Exchangeable Random Variables

Introduction to Restricted Boltzmann Machines

Curve Fitting Re-visited, Bishop1.2.5

Data Mining and Analysis: Fundamental Concepts and Algorithms

Transcription:

Mutual Information and Optimal Data Coding May 9 th 2012 Jules de Tibeiro Université de Moncton à Shippagan Bernard Colin François Dubeau Hussein Khreibani Université de Sherbooe

Abstract Introduction and Motivation Example Theoretical Framewor φ Divergence Mutual Information Optimal Partition Mutual Information explained by a partition Existence of an Optimal Partition Computational Aspects and Examples Conclusions and Perspectives References 2

Abstract Based on the notion of mutual information between the components of a random vector An optimal quantization of the support of its probability measure A simultaneous discretization of the whole set of the components of the random vector The stochastic dependence between them Key words: Divergence, mutual information, copula, optimal quantization 3

Introduction and Motivation An optimal discretization of the support of a continuous multivariate distribution To retain the stochastic dependence between the variables X = X 1, X 2,, X a random vector with values in R, β R, P X Where P X is the probability measure of X and S PX = R is the support of P X n = n 1 n 2 n a product of given integers A partition P of S PX in n elements or classes A partition P is a product partition deduced from partitions P 1, P 2,, P of the supports of the marginal probability measures in n 1 n 2 n intervals Using a mutual information criterion, choosing the set of all intervals such that the quantization of the support of S PX, retains the stochastic dependence between the components of the random vector X 4

Introduction and Motivation Here is the example for which such an optimal discretization might be desirable Let us suppose that we have a sample of individuals on which we observe the following variables: X = age, Y = salary, Z = socioprofessional group If we want to tae into account the variables simultaneously as, for example, in multiple correspondence analysis, we have to put them on the same form by means of a discretization of the first two ones Instead of the usual independent categorization of the variables X and Y in a given number of classes (p for X and q for Y), it would be more relevant, using their stochastic dependence, to categorize simultaneously X and Y in pq classes (referred sometimes as a (p, q) partition), in order to preserve as much as possible the dependence between them Moreover, and depending on the values taen by the categorical variable Z, the (conditional) discretization of the random vector (X, Y ) must differ, from one class to the others, to tae into account the stochastic dependence between the continuous random variables and the categorical one Usually, we do not tae care of this dependence in creating classes for continuous random variables However, the dependence between X = age and Y = salary are certainly quite different between the socioprofessional groups. 5

Let (Ω, F, μ) be a measured space φ Divergence μ 1 and μ 2 be two probability measures defined on F, such that μ i μ for i = 1,2 φ divergence or the generalized divergence (Csiszár[2]) between μ 1 and μ 2 I φ μ 1, μ 2 = φ dμ 1 dμ 2 dμ 2 = φ f 1 f 2 f 2 dμ where φ t is a convex function from R + \ 0 to R and where f i = dμ 1 dμ I φ μ 1, μ 2 does not depend on the choice of μ for i = 1,2 Homogenous models I φ μ 1, μ 2 = dμ 2 dμ 1 φ dμ 1 dμ 2 dμ 1 = f 2 f 1 φ f 1 f 2 f 1 dμ 6

φ Divergence Usual measures of φ divergence φ (x) Name x ln x; x 1 ln x Kullbac and Leibler x 1 Distance in variation ( x 1) 2 Hellinger 1 x α ; 0 < α < 1 Chernoff (x 1) 2 χ 2 [1 x 1 m] m ; m > 0 Jeffreys 7

Mutual Information Let (Ω, F, P) be a probability space X 1, X 2,, X be random variables defined on (Ω, F, P) with values in measured spaces X i, F i, λ i i = 1,2,, Denote respectively by P X = P X1,X 2,,X by i=1 P X probability measures defined on the product space (X i=1 χ i, i=1 F i, i=1 λ i ) Equal to the joint probability measure and to the product of the marginal ones, Supposed to be absolutely continuous with respect to the product measure λ = i=1 λ i 8

Definition:1 Mutual Information The φ mutual information or the mutual information between the random variables X 1, X 2,, X, is given by: I φ X 1, X 2,, X = I φ P X, i=1 P Xi = φ dp X d i=1 P Xi d i=1 P Xi = φ( f 1 f 2 )f 2 dλ where f 1 and f 2 are the probability density functions of the measures P X and i=1 P Xi with respect to λ = i=1 λ i 9

Mutual Information explained by a partition Random vector X defined on (Ω, F, P) has values in (R, β R ) P X - probability measure (P X λ where λ is the Lebesgue measure on R ) Support S PX may be assumed of the form i=1 every i = 1,2,, [a i, b i ] where < a i < b i < for Given integers n 1, n 2,, n, let P i for i = 1,2,, be a partition of [a i, b i ] in n i intervals {γ iji } such that a i = x i0 < x i1 < < x ini 1 < x ini = b i γ iji = [x iji 1 < x iji ] for j i = 1,2,, n i 1 and γ ini = [x ini 1, b i ] Product-partition P = i=1 P i of S PX in n = n 1 n 2 n rectangles of R P = γ 1j1 γ 2j2 γ j = { i=1 γ iji }; every i: j i = 1,2,, n i 10

Mutual Information explained by a partition Product-partition P = i=1 P i of S PX in n = n 1 n 2 n rectangles of R P = γ 1j1 γ 2j2 γ j = { i=1 γ iji }; every i: j i = 1,2,, n i If σ(p) denotes the σ-algebra generated by P, the restriction of P X to σ(p) is given by P X ( i=1 γ iji ) for every j 1, j 2,, j whose marginal are, for every i = 1,2,, : P X i 1 r=1 a r, b r γ iji r=i+1 a r, b r = P X (γ iji ) the mutual information, denoted by I φ (P), explained by partition P of the support of S PX I φ P = φ( P X( i=1γiji ) j 1,j 2,,j i=1 P Xi (γ iji ) i=1 ) P Xi (γ iji ) 11

Existence of an Optimal Partition For given integers n 1, n 2,, n and for every i = 1,2,, P i,ni - the class of partitions of [a i, b i ] in n i disjoint intervals P n - the class of partitions of S PX given by P n = i=1 where n is the multi index (n 1, n 2,, n ) Each element P of P n may be considered as a vector of R i=1 (n i+1) having components P i,ni (a 1, x 11,, x 1n1 1, b 1, a 2, x 21,, x 2n1 1, b 2,, a, x 1,, x n 1, b ), Under the constraints: a i < x i1 < < x ini 1 < b i for every i = 1,2,, A partition P of S PX, for which the mutual information loss is minimum, solve the optimization problem: min P P n (I φ X 1, X 2,, X I φ (P)), which is equivalent to: max P P n I φ P = max P P n j 1,j 2,,j φ( P X( i=1γiji ) i=1 P Xi (γ iji ) i=1 ) P Xi (γ iji ) 12

Computational Aspects and Examples Consider the case of a bivariate random vector X = (X 1, X 2 ) with probability density function f x 1, x 2 whose support is [0,1] 2 For each component, let respectively: 0 = x 10 < x 11 < x 12 < x 1i < < x 1p 1 < x 1p = 1, and 0 = x 20 < x 21 < x 22 < x 2j < < x 2q 1 < x 2q = 1, the ends of intervals of two partitions of [0,1] in respectively p and q elements For i = 1,2,, p and j = 1,2,, q, the probability measure of a rectangle x 1i 1, x 1i x 2j 1, x 2j is given by: x 1i x 1i 1 x 2j x 2j 1 f x 1, x 2 dx 1 dx 2 = p ij While its product probability measure is expressed as: x 1i x f 1 x 1 dx 1 2j q p f x 1i 1 2 x 2 dx x 2j 1 2 = p i+ p +j with p i+ = j=1 p ij and p +j = i=1 p ij 13

Computational Aspects and Examples The approximation of the mutual information between the random variables X 1 and X 2, conveyed by the discrete probability measure {p ij } is given by p i=1 q j=1 φ( p ij p i+ p +j ) p i+ p +j And for given p and q and f(x 1, x 2 ), one has to maximize the following expression; max x 1i,{x 2j } p i=1 q j=1 φ x 1i x 1i 1 x 2j x 2j 1 f x 1,x 2 dx 1 dx 2 x x 1i 2j x f 1 x 1 dx 1 f 2 x 2 dx 2 1i 1 x 1i f 1 x 1 dx 1 f x 1i 1 2 x 2 dx x 2j 1 2 x 2j 1 x 2j The well nown method of feasible directions in Zoutendiji [3] and also in Berseas [1] 14

Example Computational Aspects and Examples Let X = X 1, X 2 ~ε 2 θ 1 θ 1 be a bivariate exponential random vector, whose probability density function is given by f x 1, x 2 = e x 1 x 2 1 + θ 2θ e x 1 + e x 2 2e x 1 x 2 I R+ 2 x 1, x 2 Let C u 1, u 2 be its copula whose probability density function c u 1, u 2 is c u 1, u 2 = 1 + θ 1 2u 1 1 2u 2 I 0,1 2 u 1, u 2 This family of distribution is also nown as Farlie-Gumbel-Morgensten class 15

Conclusions and Perspectives In data mining, the choice of a parametric statistical model is not quite realistic due to a huge number of variables and data and, in this case, a non parametric framewor is often more appropriate To estimate the probability density function of a random vector, we will use a ernel density estimator in order to evaluate the mutual information between its components and study the effects of the choice of the ernel on the robustness of the optimal partition In Multiple Correspondence analysis (MC) and in Classification, we have often to deal simultaneously with continuous and categorical variables, and it may be of interest to use an optimal partition in order to retain, as much as possible, the stochastic dependence between the random variables and we will explore the consequences of the choices of φ and of an optimal partition P on these models Finally, will develop user friendly software to perform optimal coding in the nonparametric and semi parametric cases 16

References [1] D.P. Berseas, Nonlinear Programming 2 nd Ed, Athena Scientific, Belmont, Mass., 1990 [2] I. Csiszár, Information-type measures of difference of probability distributions and indirect observations, Studia Scientiarum Mathematicarum Hungarica, 2 (1967), 299-318 [3] G. Zoutendij, Methods of feasible directions, Elsevier, Amsterdam and D. VanNostrand, Princeton, N.J, 1960 17