An Information-Theoretic Measure of Dependency Among Variables in Large Datasets

Similar documents
CS229 Lecture notes. Andrew Ng

MARKOV CHAINS AND MARKOV DECISION THEORY. Contents

arxiv: v1 [math.fa] 23 Aug 2018

Do Schools Matter for High Math Achievement? Evidence from the American Mathematics Competitions Glenn Ellison and Ashley Swanson Online Appendix

A Brief Introduction to Markov Chains and Hidden Markov Models

Separation of Variables and a Spherical Shell with Surface Charge

Efficiently Generating Random Bits from Finite State Markov Chains

A Simple and Efficient Algorithm of 3-D Single-Source Localization with Uniform Cross Array Bing Xue 1 2 a) * Guangyou Fang 1 2 b and Yicai Ji 1 2 c)

Mat 1501 lecture notes, penultimate installment

A. Distribution of the test statistic

SUPPLEMENTARY MATERIAL TO INNOVATED SCALABLE EFFICIENT ESTIMATION IN ULTRA-LARGE GAUSSIAN GRAPHICAL MODELS

(This is a sample cover image for this issue. The actual cover is not yet available at this time.)

An Extension of Almost Sure Central Limit Theorem for Order Statistics

Course 2BA1, Section 11: Periodic Functions and Fourier Series

Stochastic Variational Inference with Gradient Linearization

STA 216 Project: Spline Approach to Discrete Survival Analysis

Statistical Learning Theory: A Primer

Some Measures for Asymmetry of Distributions

A proposed nonparametric mixture density estimation using B-spline functions

THE THREE POINT STEINER PROBLEM ON THE FLAT TORUS: THE MINIMAL LUNE CASE

Bayesian Learning. You hear a which which could equally be Thanks or Tanks, which would you go with?

Iterative Decoding Performance Bounds for LDPC Codes on Noisy Channels

FRST Multivariate Statistics. Multivariate Discriminant Analysis (MDA)

Lecture Note 3: Stationary Iterative Methods

Two-Stage Least Squares as Minimum Distance

8 APPENDIX. E[m M] = (n S )(1 exp( exp(s min + c M))) (19) E[m M] n exp(s min + c M) (20) 8.1 EMPIRICAL EVALUATION OF SAMPLING

(f) is called a nearly holomorphic modular form of weight k + 2r as in [5].

Asynchronous Control for Coupled Markov Decision Systems

Haar Decomposition and Reconstruction Algorithms

Efficient Generation of Random Bits from Finite State Markov Chains

An explicit Jordan Decomposition of Companion matrices

Problem set 6 The Perron Frobenius theorem.

Cryptanalysis of PKP: A New Approach

Week 6 Lectures, Math 6451, Tanveer

Akaike Information Criterion for ANOVA Model with a Simple Order Restriction

Bourgain s Theorem. Computational and Metric Geometry. Instructor: Yury Makarychev. d(s 1, s 2 ).

Approximated MLC shape matrix decomposition with interleaf collision constraint

XSAT of linear CNF formulas

Reichenbachian Common Cause Systems

Approximated MLC shape matrix decomposition with interleaf collision constraint

arxiv: v1 [math.co] 17 Dec 2018

Explicit overall risk minimization transductive bound

T.C. Banwell, S. Galli. {bct, Telcordia Technologies, Inc., 445 South Street, Morristown, NJ 07960, USA

Restricted weak type on maximal linear and multilinear integral maps.

Homework 5 Solutions

The Binary Space Partitioning-Tree Process Supplementary Material

On the Goal Value of a Boolean Function

A Comparison Study of the Test for Right Censored and Grouped Data

Alberto Maydeu Olivares Instituto de Empresa Marketing Dept. C/Maria de Molina Madrid Spain

On colorings of the Boolean lattice avoiding a rainbow copy of a poset arxiv: v1 [math.co] 21 Dec 2018

ASummaryofGaussianProcesses Coryn A.L. Bailer-Jones

Partial permutation decoding for MacDonald codes

$, (2.1) n="# #. (2.2)

Math 124B January 17, 2012

General Certificate of Education Advanced Level Examination June 2010

Uniprocessor Feasibility of Sporadic Tasks with Constrained Deadlines is Strongly conp-complete

Related Topics Maxwell s equations, electrical eddy field, magnetic field of coils, coil, magnetic flux, induced voltage

Statistical Inference, Econometric Analysis and Matrix Algebra

AST 418/518 Instrumentation and Statistics

Statistics for Applications. Chapter 7: Regression 1/43

The Relationship Between Discrete and Continuous Entropy in EPR-Steering Inequalities. Abstract

Math 124B January 31, 2012

Research Article On the Lower Bound for the Number of Real Roots of a Random Algebraic Equation

MATH 172: MOTIVATION FOR FOURIER SERIES: SEPARATION OF VARIABLES

Algorithms to solve massively under-defined systems of multivariate quadratic equations

SEMINAR 2. PENDULUMS. V = mgl cos θ. (2) L = T V = 1 2 ml2 θ2 + mgl cos θ, (3) d dt ml2 θ2 + mgl sin θ = 0, (4) θ + g l

6.434J/16.391J Statistics for Engineers and Scientists May 4 MIT, Spring 2006 Handout #17. Solution 7

A NOTE ON QUASI-STATIONARY DISTRIBUTIONS OF BIRTH-DEATH PROCESSES AND THE SIS LOGISTIC EPIDEMIC

Lecture 9. Stability of Elastic Structures. Lecture 10. Advanced Topic in Column Buckling

Lecture 11. Fourier transform

Expectation-Maximization for Estimating Parameters for a Mixture of Poissons

Lemma 1. Suppose K S is a compact subset and I α is a covering of K. There is a finite subcollection {I j } such that

AALBORG UNIVERSITY. The distribution of communication cost for a mobile service scenario. Jesper Møller and Man Lung Yiu. R June 2009

Source and Relay Matrices Optimization for Multiuser Multi-Hop MIMO Relay Systems

Rate-Distortion Theory of Finite Point Processes

The Group Structure on a Smooth Tropical Cubic

Schedulability Analysis of Deferrable Scheduling Algorithms for Maintaining Real-Time Data Freshness

NEW DEVELOPMENT OF OPTIMAL COMPUTING BUDGET ALLOCATION FOR DISCRETE EVENT SIMULATION

Statistical Learning Theory: a Primer

1D Heat Propagation Problems

Scalable Spectrum Allocation for Large Networks Based on Sparse Optimization

Bayesian Unscented Kalman Filter for State Estimation of Nonlinear and Non-Gaussian Systems

Appendix for Stochastic Gradient Monomial Gamma Sampler

Integrating Factor Methods as Exponential Integrators

FRIEZE GROUPS IN R 2

Componentwise Determination of the Interval Hull Solution for Linear Interval Parameter Systems

Research of Data Fusion Method of Multi-Sensor Based on Correlation Coefficient of Confidence Distance

Appendix of the Paper The Role of No-Arbitrage on Forecasting: Lessons from a Parametric Term Structure Model

Efficient Generation of Random Bits from Finite State Markov Chains

14 Separation of Variables Method

Throughput Optimal Scheduling for Wireless Downlinks with Reconfiguration Delay

First-Order Corrections to Gutzwiller s Trace Formula for Systems with Discrete Symmetries

Control Chart For Monitoring Nonparametric Profiles With Arbitrary Design

Sequential Decoding of Polar Codes with Arbitrary Binary Kernel

Convergence Property of the Iri-Imai Algorithm for Some Smooth Convex Programming Problems

Approximate Bandwidth Allocation for Fixed-Priority-Scheduled Periodic Resources (WSU-CS Technical Report Version)

Competitive Diffusion in Social Networks: Quality or Seeding?

MA 201: Partial Differential Equations Lecture - 10

Two-sample inference for normal mean vectors based on monotone missing data

Coded Caching for Files with Distinct File Sizes

Transcription:

An Information-Theoretic Measure of Dependency Among Variabes in Large Datasets Ai Mousavi, Richard G. Baraniuk Department of Eectrica and Computer Engineering Rice University Houston, TX 77005 arxiv:508.04073v [cs.it] 7 Aug 205 Abstract The maxima information coefficient (MIC, which measures the amount of dependence between two variabes, is abe to detect both inear and non-inear associations. However, computationa cost grows rapidy as a function of the dataset size. In this paper, we deveop a computationay efficient approximation to the MIC that repaces its dynamic programming step with a much simper technique based on the uniform partitioning of data grid. A variety of experiments demonstrate the quaity of our approximation. I. INTRODUCTION One of the chaenging issues for statisticians and computer scientists is deaing with arge datasets containing hundreds of variabes which some of them may have interesting but unexpored reationships with each other. This is due to exampes of massive datasets in different areas such as: socia networks, astronomy, genomics, medica records, and poitica science. Hence, it is an interesting topic to try to come up with methods which hep us to discover these reationships. Measuring the amount of dependence among two variabes has been extensivey studied in the iterature and severa methods have been proposed for it. We review some of them in the foowing. In [], the author has suggested seven properties to be satisfied by any measure φ(x,y used for determining the amount of dependence between x and y. These properties known as Rényi s axioms are: In defining φ(x,y between any two random variabes, neither x nor y shoud be constant with probabiity. 0 φ(x,y. φ(x,y = 0 if and ony if x and y are independent from each other. φ(x, y = if there is an arbitrary functiona dependency betweenx andy. In other words, if y = f(x orx = g(y where f(. and g(. are Bore-measurabe functions. φ(x,y = φ(y,x if f(. and g(. are stricty monotonic functions, then φ(x,y = φ(f(x,g(y. if x and y are jointy Gaussian random variabes, then φ(x,y = PCC(x,y where PCC is the Pearson correation coefficient. This work was supported by NSF CCF-092627, CCF-7939; DARPA/ONR N6600--C-4092 and N6600---4090; ONR N0004-0- -0989, and N0004---074; ARO MURI W9NF-09--0383. Emai: {ai.mousavi, richb} @rice.edu The Pearson correation coefficient (PCC is the most we known dependency measure. However, it is unabe to detect non-inear associations. In other words, the PCC is ony abe to capture inear associations between two variabes. As another measure of dependency, correation ratio of random variabe y (if σ 2 (y exists and σ(y > 0 on random variabe x, introduced in [2] and [3], is defined as Θ(y = σ(e(y x. ( σ(y It is easy to show that 0 Θ(y where Θ(y = if and ony if y = f(x in which f(x is a Bore-measurabe function and Θ(y = 0 if x and y are independent. The aternative formua of the correation ratio mentioned in [] is Θ(y = sup PCC(y, g(x. (2 (g This aternative formua eads to another measure of dependency caed maxima correation [4]: S(x,y = suppcc(f(x,g(y, (3 f,g where f(. and g(. are Bore-measurabe functions. The author in [] has shown that S(x,y = 0 if and ony if x and y are independent. Furthermore, if there is an arbitrary functiona reationship between x and y, then S(x,y =. The authors in [5] have introduced the aternating conditiona expectation (ACE agorithm to find the optima transformations. The Spearman correation coefficient [6] is defined simiar to the PCC; however, it is defined between the two ranked variabes. By ranked variabes we mean repacing each data point by its rank (or the average rank for equa sampe points in the ascending order. Therefore, if x i and ỹ i denote the ranked versions of x i and y i, the Spearman correation coefficient woud be i ρ = ( x i x(ỹ i ȳ i ( x i x 2 i (ỹ (4 i ȳ 2. The authors in [7] have expressed covariance and inear correation in terms of principa components and generaized them for variabes distributed aong a curve. They have estimated their measures using principa curves. Mutua Information [8] is another measure that can be used for quantification of dependency between two variabes

since it satisfies some common properties of other dependency measures. As an exampe I(x,y = 0 if and ony if x and y are independent. The authors in [9] have used kerne density estimation of probabiity density functions in order to estimate the mutua information between two variabes. In [0], a method of mutua information estimation based on binning and estimating entropy from k-nearest neighbors is proposed. The MIC [] is recenty proposed for quantifying dependency between two random variabes. It is based on binning the dataset using dynamic programming technique to compute mutua information between different variabes. It has two main properties which makes it superior in comparison with the aforementioned measures. First, it has generaity meaning that if the sampe size is arge enough, it is abe to detect different kinds of associations rather than specific types. Second, it is an equitabe measure meaning that it gives simiar scores to equay noisy associations no matter what type the association is. One of the probems with the MIC is the fact that its computationa cost grows rapidy as a function of the dataset size. Since this computationa cost may become infeasibe, the authors in [] have appied a heuristic so as not to compute the mutua information for a possibe grids. This heuristic appication may resut in finding a oca maximum. In this paper, we deveop a computationay efficient approximation to the MIC. This approximation is based on repacing the dynamic programming appication used in computation of the MIC with a very efficient technique that is uniformy binning the data. We show that our proposed method is abe to detect both functiona and non-functiona associations between different variabes, simiar to the MIC whie more efficienty. In addition, it has a better performance in recognizing the independence between different variabes. The rest of this paper is organized as the foowing. In section (II, we review the MIC and the agorithm used to compute it from []. In section (III, we introduce our new measure of dependency that is a modification to the MIC. We present simuation resuts in section (IV. Finay, section (V incudes the concusion of the paper. II. THE MAXIMAL INFORMATION COEFFICIENT (MIC A. MIC Definition and Properties For any finite dataset D which contains ordered pairs of two random variabes, one can partition the first eement, i.e., x-vaue of these pairs into x bins and simiary partition the second eement or y-vaue of these pairs into y bins. As a resut of this partitioning, we wi have an x -by- y grid G. Every ce of this grid may or may not contain some sampe points from the set D. This grid induces a probabiity distribution on the ces of G where the corresponding probabiity of each ce is equa to the portion of sampe points ocated in that ce. That is to say p ij = D ij D, (5 i th row D ij j th coumn Fig.. Partitioning of dataset D into x coumns and y rows. D ij denotes the set of sampe points ocated in the i-th row and the j-th coumn. where p ij denotes the probabiity corresponding to the ce ocated at the i th row and the j th coumn and D ij denotes the number of sampe points faing into the i-th row and the j-th coumn (See Figure for a graphica view of the grid G. It is obvious that for each ( x, y, we wi have a grid that induces a new probabiity distribution and hence resuts in a different mutua information between the two variabes. Let I D G (P;Q = max GI D G (P;Q be the argest possibe mutua information achievabe by an x -by- y grid G on a set D of sampe points. P and Q are the partitions of X-axis and Y-axis of grid G, respectivey. In order to have a fair comparison among different grids, the computed vaues of mutua information shoud be normaized. Since I(P;Q = H(Q H(Q P = H(P H(P Q, we divide I D G (P;Q by og(min( x, y. Therefore, we have 0 I D G (P;Q. (6 og(min( x, y This inequaity motivates the definition of the MIC as a measure of dependency between two variabes. For a dataset D containing n sampes of two variabes, we have ID G MIC(D = max (P;Q x y<b(n og(min( x, y. (7 where B(n = n 0.6 [] or more generay ω( B(n O(n ǫ. According to this definition, the MIC has the foowing properties: 0 MIC(D. MIC(x,y = MIC(y,x. It is invariant under order-preserving transformation appied to the dataset D. It is not invariant under the rotation of coordinate axes, e.g., if y = x, then MIC(D =. However, after a 45 cockwise rotation of coordinate axes, instead of y = x we have y = 0 and hence MIC(D = 0. B. MIC Agorithm Athough the agorithm for computing the MIC is fuy described in [], here we ony review the OptimizeXAxis agorithm which is used in computation of the highest mutua

Cumps Fig. 2. OptimizeXAxis [] considers ony consecutive points faing into the same row and draw partitions between them. The set of consecutive points faing into the same row is caed cump. information achievabe by an x -by- y grid. Any x -by- y grid imposes two sets of partitions on x-vaues (coumns of grid and y-vaues (rows of grid. We indicate coumns of the grid by c,c 2,...,c x where c i denotes the endpoint (argest x- vaue of the i-th coumn. Since I(P,Q is upper-bounded by H(P and H(Q, in order to maximize it, one can equipartition either the Y or X axis, i.e., impose a discrete uniform distribution on either Q or P. Without oss of generaity, we consider the version of the agorithm that equipartitions the Y-axis. However, it is obvious that we shoud check both of the cases (equipartitioning either the X or Y axis separatey for each x -by- y grid and choose the maximum resuting mutua information. Let H(P denote the entropy of distribution imposed by m sampe points (m D = n on the partition of X- axis. Simiary, et H(Q denote the entropy of distribution imposed by m sampe points (m D = n on the partition of Y-axis. Since we have assumed that the Y-axis is equipartitioned, H(Q is constant and equa to og( Q. Finay, et H(P, Q denote the entropy of distribution imposed by m sampe points (m < D = n on the ces of grid G which has X-axis partition P and Y-axis partition Q. Since I(P;Q = H(Q H(Q P and we have aready maximized H(Q by equipartitioning the Y-axis, to achieve the highest mutua information, we have to minimize the H(Q P. This is done by the OptimizeXAxis agorithm []. An aternative formua for the mutua information is I(P; Q = H(Q+H(P H(P, Q. Since H(Q is constant, the OptimizeXAxis ony needs to maximize H(P H(P,Q. The foowing theorem [] is the key to sove this probem. Theorem II.. For a dataset D of size n and a fixed row partition Q, and for every m, N, if we define F(m, = max D(:m, P = {H(P H(P,Q} then for > and < m n we woud have the foowing recursive equation F(m, = max { i m i F(i, H( i,m,q}. (8 i<m m m Proof of Theorem II.: See proposition 3.2. in []. The OptimizeXAxis uses dynamic programming technique motivated by Theorem II.. It ensures F(n, that is the desired partition of dataset D (which has n sampe points Agorithm OptimizeXAxis(D,Q, x [] Require: D is a set of ordered pairs sorted in increasing order by x-vaues Require: Q is a Y-axis partition of D Require: x is an integer greater than Ensure: Returns a ist of scores (I 2,...,I x such that each I is the maximum vaue of I(P;Q over a partitions P of size : c 0,...,c k GetCumpsPartition(D,Q 2: 3: Find the optima partition of size 2 4: for t = 2 to k do 5: Find s {,...,t} maximizing H( c s,c t H( c s,c t,q. 6: P t,2 c s,c t 7: I t,2 H(Q+H(P t,2 H(P t,2,q 8: end for 9: 0: Inductivey buid the rest of the tabe of optima partitions : for = 3 to x do 2: for t = 2 to k do 3: Find s {,...,t} maximizing F(s,t, := c s c t (I s, H(Q+ Q # i, i= c t og # i, #, where#,j is the number of points in thej-th coumn of P s, c t and # i,j is the number of points in the j-th coumn of P s, c t that fa in the i-th row of Q 4: P t, P s, c t 5: I t, H(Q+H(P t, H(P t,,q 6: end for 7: end for 8: return (I k,2,...,i k,x having coumns imposing partition P over X-axis. In order to minimize the H(Q P, OptimizeXAxis considers ony consecutive points faing into the same row and draw partitions between them. The set of consecutive points faing into the same row is caed cump (See Figure 2 for a graphica view of cump. In Agorithm, the GetCumpsPartition subroutine is responsibe for finding and partitioning the cumps. Moreover, P t, is an optima partition of size for the first t cumps. A. Noiseess Setting III. THE UNIFORM-MIC (U-MIC The major drawback of the Agorithm is its computationa compexity. If there exists k cumps in the given partition of an x -by- y grid, the runtime of this agorithm woud be O(k 2 x y. If there is a functiona association between the two variabes, the number of cumps in the corresponding grid is pretty sma. However, for noisy or random datasets it is easy to imagine that the number of cumps is very arge and hence the computationa compexity of the Agorithm ag:optimizexaxis woud be arge.

Furthermore and due to this probem, this agorithm cannot be generaized in order to detect associations between more than two variabes. As an exampe, if we want to detect whether or not three variabes are reated to each other, we may write the formua for the generaized mutua information as: I(P;Q;R =H(P+H(Q+H(R H(P,Q (9 H(P,R H(Q,R+H(P,Q,R. Hence, intuitivey and ike the case for two random variabes, in order to maximize the generaized mutua information, we have to equipartition one axis to maximize the entropy. Nevertheess, we shoud partition the two other axes with respect to the paces of the cumps in them. if we equipartition the first axis and there exists k cumps in the second axis and k 2 cumps in the third axis, then the runtime of this agorithm woud be O(kk 2 2 2 x y z 2 where x, y, z are the sizes of partitioning. This runtime is not acceptabe for arge datasets. Therefore, we have to modify the agorithm in order to decrease its runtime and as a resut make it generaizabe to higher dimensions. The agorithm we propose in here for repacing the Agorithm is uniform partitioning (Agorithm 2. Let y min = min i y i, y max = max j y j, and simiary x min = min i x i and x max = max j x j. we then partition both X and Y axes such that a the coumns have ength xmax xmin x and simiary a the rows have ength ymax ymin y. We ca this new measure, that is derived by repacing the Agorithm with Agorithm 2, by the U-MIC (Uniform Maxima Information Coefficient. In the foowing we prove that the U-MIC wi approach as the sampe size grows for when there exists a functiona association between two variabes (with finite derivative. Without oss of generaity, we do a the proofs in the case that (x,y [0,] [0,]. These proofs coud be generaized to other cases easiy. Agorithm 2 UniformPartition( x, y Require: Dataset D Require: x and y are integers greater than Ensure: Returns a score I which is the vaue of I(P;Q where P are Q are distributions from uniform partitioning of both axes. : P Uniform partition of X-axis by x coumns each has ength xmax xmin x 2: Q Uniform partition of Y -axis by y rows each has ength ymax ymin y 3: I = H(P+H(Q H(P,Q og(min( x, y 4: return I Proposition III.. If D = {(x i,y i } n i= where y i = h(x i and h (x <, then im n U-MIC(D =. Proof of Proposition III.: We denote by g h (α the subeve function of function h(., i.e., g h (α = λ({x : h(x α}, (0 where λ(t denotes the fraction of sampe points in the set T. Consequenty g h (α = F y (α = P(y α = P(h(x α, ( where F y (. denotes the cumuative distribution function (CDF and P(. denotes the probabiity function. Using this notation and assuming that we uniformy partition Y -axis by y rows, we can write the entropy of Q which is the uniform partition of Y-axis as H(Q (2 y = P(Q = iog(p(q = i y ( i = P Y < i+ ( ( i og P Y < i+ y y y y y = g h (α i og (g h (α i y y y = y g h(α i og(g h(α i y ( g h(α i og, y y where i y α i < i+ y for each i (0 i y is derived according to the mean vaue Theorem. If without oss of generaity we assume that min( x, y = y then we can write y H(Q og( y = y y og( y g h(α i og(g h(α i + As a resut, in the asymptotic setting we can write im y H(Q og( y = i y g h(α i. (3 y g h (α i =, (4 where the ast equaity hods since im y i y g h (α i is the Riemann integra of the function g h (α i. If we assume that h (. < c, then according to the mean vaue Theorem we have ( ( i+ i h h c. (5 y y y Equation (5 states that for a particuar coumn of the X-axis partition, the curve of the function passes through at most c+ ces of that coumn. We use this fact in upper-bounding the H(Q P. Simiar to (2 and (3 we have y H(Q P = k = P(Q = i P = k (6

og(p(q = i P = k y ( i = P Y < i+ P = k y y ( ( i og P Y < i+ P = k y y y = f y x (α i P = k y og (f y x (α i P = k y y = f y x (α i P = k y og(f y x (α i P = k y ( f y x (α i P = kog y y, where f y x denotes the conditiona probabiity density function. Because of equation (5, we can simpify (6 as j c+ H(Q P = k = f y x (α i P = k (7 i=j y og(f y x (α i P = k j c+ ( f y x (α i P = kog i=j y y. If we define k = argmax k H(Q P = k, then since H(Q P = kp(p = kh(q P = k, we can write j c+ H(Q P f y x (α i P = k (8 i=j y and hence og(f y x (α i P = k j c+ i=j y f y x (α i P = k og ( y, j c+ H(Q P im y og( y f y x (α i P = k = 0. (9 i=j y The ast equaity hods since y 0 but c <. As a resut im U-MIC(D = I(P;Q y og(min{ x, y } = H(Q H(Q P og(min{ x, y } =. (20 If x and y are independent, then according to the foowing Proposition we have U-MIC(D = 0. Proposition III.2. If D = {(x i,y i } n i= where x i y i for i n, then U-MIC(D = 0. Proof of Proposition III.2: The ine of reasoning is straight forward and simiar to the proof of Proposition III.. Since x and y are independent from each other, we can write H(Q (2 y = P(Q = iog(p(q = i y ( i = P Y < i+ ( ( i og P Y < i+ y y y y y ( i = P og Y < i+ y ( P P = k y ( i y Y < i+ y P = k y = P(Q = i P = kog(p(q = i P = k = H(Q P = k. Therefore, H(Q = H(Q P = k for every k where 0 k x. Now since H(Q P = kp(p = kh(q P = k, we have H(Q = H(Q P and as a resut U-MIC(D=0. B. Noisy Setting In this section we study performance of the U-MIC in noisy setting. We first give a ower-bound on it when the two variabes x and y have a noisy functiona association in which the noise is bounded. After that, we study the case of unbounded noise. For the bounded noise case, without oss of generaity we assume that x U[0,] and the noise has a uniform distribution. Specificay, we assume that sampe points(x i,y i have the form (x i,h(x i +z ǫ where z ǫ U[ ǫ,ǫ]. We define y mid = ymax+ymin 2. In Agorithm 2, we divide the Y-axis into two rows by drawing a horizonta ine at y mid. In addition, we divide the X-axis into x coumns each having the ength x (since x U[0,]. Let D = {(x i,y i y i < y mid } and D 2 = {(x i,y i y i > y mid }. We use P and Q to denote the partition of X-axis and Y-axis of the grid in this setting. Having this setting and notations in mind, the foowing Coroary gives a simpe ower-bound for U-MIC(D in this case. Coroary III.3. Let m be the number of coumns in P in which there exists a sampe point (ˆx,ŷ such that ŷ y mid ǫ. Then, U-MIC(D is ower-bounded by D og( D D og( D D 2 og( D 2 D m x. Proof of Coroary III.3: Since I(P, Q = H(Q H(Q P, we need to have an upper-bound on H(Q P in order to determine a ower-bound on I(P, Q. According to

δ n δ n 2ǫ n + Points y = h(x+z Fig. 3. Using k-nearest neighbors method to bound the noise in noisy reationships. We repace each point with the average of its neighbors in its δ n-neighborhood. the entropy definition we can write H(Q = D ( D og D D D 2 D og X ( D2 D (22 = D og( D D og( D D 2 og( D 2. D Let M = { p,..., pm } denote the coumns in which there exists a data point (ˆx,ŷ such that ŷ y mid ǫ. Since Q has ony two rows, we can upper-bound the H(Q P as the foowing H(Q P = x k=0 ( (a = x P(P = kh(q P = k (23 k M (b = x k M M x = m x, H(Q P = k+ H(Q P = k H(Q P = k k/ M where (a hods since x U[0,] and (b hods because z ǫ U[ ǫ,ǫ]. The ower-bound is then derived by combining (22 and (23. The main issue with generaizing this ower-bounding idea to other noise distributions is that noise vaues coud be unbounded. Hence, we use the idea of k-nearest neighbors to bound the noise so as to come up with a consistent version of the association detector. We study this idea for the case that noise is drawn from a Gaussian distribution with 0 mean and variance of σ 2. For each sampe point, we consider its δ n -neighborhood (we use subscript n to show the dependency on the size of the dataset n. We repace each data point with the average of sampe points ocated in its δ n -neighborhood. The foowing emma characterizes the number of sampe points in this neighborhood. Lemma III.4. Let x be uniformy distributed, i.e., x U[0,] and(x i,y i denote thei-th data point indwherey i = h(x i + z i. If N = {(x j,y j (x i x j 2 δ 2 n }, then im n N = 2nδ n. Proof of Lemma III.4: Let I(. denote the indicator function. Then we can write n 2ǫ n = N = I(x i δ n x j x i +δ n. (24 j= As a resut E[2ǫ n ] = 2nδ n. Using the Hoeffding inequaity we have P( 2ǫ n E[2ǫ n ] t 2e 2ct2 n 2, (25 for some constant c. If we et t = ogn, then im n (ǫ n = nδ n or equivaenty, im n N = 2nδ n. Assume that h(. is a Lipschitz continuous function of order β, i.e., h(v h(w k v w β where k is a constant depends on the function h(.. If we estimate (or repace the y-vaue of each noisy sampe point with the average of sampe points in its δ n -neighborhood, in the case of Gaussian noise (0 mean and variance of σ 2 we can write the estimation mean squared error as n = n E( h(x i h(x i 2 (26 n = n i= [ n ǫn ] 2 j= ǫn (h(x i j +z i j E h(x i 2ǫ n + i= k2 ǫ 2β n n 2ǫ n +. 2β + σ2 In order to minimize the estimation error we can take derivative with respect to ǫ n and set it to 0. Therefore, ǫ n which minimizes the mean squared error is ǫ n = σ 2 2β+ 4k 2 β n 2β+. (27 We use this ǫ n ater to to bound the noise. The foowing emma gives a probabiistic bound on the noise vaues. Lemma III.5. If z,z 2,...,z n are i.i.d. drawn from N(0,σ 2, then P{max i n z i > t} 2ne t2 2σ 2. Proof of Lemma III.5: First of a, for a zero mean Gaussian random variabe z i we prove that P{ z i > t} 2ne t2 2σ 2. Let u = z i t and hence u N( t,σ 2. We have 0 e u2 2ut 2σ 2 du 2πσ 2 As a resut we can write t e z 2 i 2σ 2 dz i = 2πσ 2 0 0 2πσ 2 e u2 2σ 2 du. (28 e (u t2 2σ 2 2πσ 2 du e t2 2σ 2. (29 Simiary, (29 hods for [, t] and hence P{ z i > t} 2e t2 2σ 2. The resut of emma then foows from using unionbound on the z i s. By using the k-nearest neighbors method, each z i is repaced by z i which is the average of 2ǫ n + i.i.d. noise vaues and hence its variance is decreased by 2ǫ n +. This idea

Fig. 4. Test functiona reationships for Tabes I,II, and III. function. One interpretation of this difference is that in the proof of Proposition III., we have assumed that the absoute vaue of derivative of function h(. is upper-bounded by constantc. However, this is not the case for sinusoida function with different frequencies since there is a discontinuity in this function. If we increase the sampe size, as reported in Tabe II, this issue is aeviated as we can see. TABLE I: MIC(D and U-MIC(D for different functiona reationships in Figure 4. For this set of experiments, D = 200. Linear Paraboic Periodic Cubic Sin (Diff. Freq. Sin (Singe Freq. MIC U-MIC 0.93 0.95 0.75 0.9 TABLE II: MIC(D and U-MIC(D for different functiona reationships in Figure 4. For this set of experiments, D = 5000. Linear Paraboic Periodic Cubic Sin (Diff. Freq. Sin (Singe Freq. MIC U-MIC 0.99 0.99 0.93 0.95 Fig. 5. Test non-functiona reationships for Tabes IV and V. motivates the foowing coroary which ets us to bound the noise. Coroary ǫn III.6. By using the k-nearest neighbors method, z i = j= ǫn z i j 2ǫ n+, and as a resut im n max i n z i = 0. Proof of Coroary III.6: According to the Lemma III.5, we can write P{max i n z i > t} 2ne t 2 (2ǫn+ 2σ 2. The resut then foows from etting t = ogn and ǫ n = ǫ n which was derived in (27. In the next section we show how the U-MIC works in practice comparing to the MIC. IV. SIMULATION RESULTS In this section, we study the performance of our proposed measure. We first show how it works for functiona associations. Second, we study its performance for non-functiona associations. Finay, we do some experiments for the case of noisy reationships. As mentioned previousy, the authors in [] appy a heuristic to compute the MIC which may not resut in the true MIC. On the other hand, we do not appy any heuristic in the simuation resuts in order to have a precise comparison with our proposed method. Figure 4 shows the functiona associations that we have tested the performance of the MIC and U-MIC agorithms on. Tabe I summarizes the resuts for the case that there are 200 sampe points. One interesting point in Tabe I is the vaue of the U-MIC for sinusoida function with different frequencies. As we can see, MIC(D= whie U-MIC(D=0.75 for this TABLE III: Run time (in sec. for cacuation of MIC(D and U-MIC(D for different functiona reationships in Figure 4. For this set of experiments, D = 200. Linear Paraboic Periodic Cubic Sin (Diff. Freq. Sin (Singe Freq. MIC 0. 0.5 0. 0.2 2 0.4 U-MIC 0.0 0.0 0.0 0.0 0.0 0.0 Athough the same issue hods for periodic function in Figure 4, we do not see that much effect. Quaitativey, the derivative of continuous pieces of the periodic function in Figure 4 (y = x is smaer than the maximum of the derivative of sinusoida function with different frequencies (y = sin(0x, y = sin(20x. Hence, if we uniformy partition the X-axis in the case of periodic function, there woud be fewer sampe points in rows of a certain coumn and more probaby higher entropy (resuting in higher U-MIC, as the case in Tabe I. Tabe III summarizes the runtime for cacuation of the MIC and U-MIC for different functiona associations in Figure 4. As we can see, the U-MIC is at east 0 times faster in these cases. This is expected since the MIC uses dynamic programming to find a cose to optima grid for the data whie the U-MIC just uniformy partitions the axes. TABLE IV: MIC(D and U-MIC(D for different nonfunctiona reationships in Figure 5. For this set of experiments, D = 200. Circe Sinusoida Mixture Two Lines Random MIC 0.68 0.72 0.7 0.6 U-MIC 0.64 0.69 0.68 0.06 Tabe IV summarizes the resuts for non-functiona associations presented in Figure 5. One important point about Tabe

TABLE V: Run time (in sec. for cacuation of MIC(D and U-MIC(D for different non-functiona reationships in Figure 5. For this set of experiments, D = 200. Circe Sinusoida Mixture Two Lines Random MIC 26.38 3.4 6.60 6.00 U-MIC 0.0 0.02 0.0 0.02 Fig. 6. Test noisy non-functiona reationships for Tabes VI and VII. IV is that the U-MIC has a better performance in the case of random sampe points (i.e., x y. In this case, the idea MIC and U-MIC is 0; however, as we can see MIC(D=0.6 and U-MIC(D=0.06. This issue is reated to one of the criticisms made about MIC in the iterature [2]. One of the drawbacks of the MIC is the fact that as a statistica test it has a ower power than other measures of dependency such as distance correation [2]. In other words, it gives more fase positives in detecting associations. However, according to our simuation resuts and Proposition III.2 this issue is aeviated in the U- MIC. Tabe V shows the runtime for cacuation of the MIC and U-MIC. In the case of non-functiona reationships we have more cumps in the initia grid of sampe points for cacuation of the MIC. Hence, Agorithm which is basicay running dynamic programming over initia grid to find the optima grid, woud have arger runtime as we can see in Tabe V. On the other hand, since the U-MIC is deaing with uniform partitioning of the grid of sampe points, it does not matter what type of reationship the two random variabes have. The runtime is amost constant and simiar to the cases that there is a functiona association between two variabes. Tabes VI and VII summarize the resuts for noisy nonfunctiona associations presented in Figure 6. Figure 6 is simiar to Figure 5 except for the fact that we have added noise drawn from uniform distribution, i.e., U[ 0.5, 0.5] to the sampe points. Comparing Tabe VI with Tabe IV, we can see that the range of decrease for different associations is amost the same for the both MIC and U-MIC. We expected this for MIC since it has an important property caed equitabiity []. On the other hand, we can observe that at east according to the simuation resuts reported here, U-MIC has approximatey the same equitabiity property. V. CONCLUSION In this paper we introduced a nove measure of dependency between two variabes. This measure is caed the uniform maxima information coefficient (U-MIC because it is a modification of the origina MIC []. It is derived from uniform partitioning of the both X and Y axes. Therefore, it is not deaing with dynamic programming simiar to what the MIC does and hence, is much faster. We proved that asymptoticay, U-MIC equas to if there is a functiona reationship between two variabes. If two variabes are truy independent from each other, then we showed that the U-MIC woud be equa to 0. Specificay, according to the simuation resuts, we showed that the U-MIC does a better job in recognizing independence between variabes comparing to the MIC. TABLE VI: MIC(D and U-MIC(D for different nonfunctiona reationships in Figure 5. For this set of experiments, D = 200 and noise is uniformy distributed in [- 0.05,0.05]. Circe Sinusoida Mixture Two Lines MIC 0.54 0.60 0.57 U-MIC 0.52 0.48 0.54 TABLE VII: Run time (in sec. for cacuation of MIC(D and U-MIC(D for different noisy non-functiona reationships in Figure 5. For this set of experiments, D = 200 and noise is uniformy distributed in [-0.05,0.05]. Circe Sinusoida Mixture Two Lines MIC 35 6 27 U-MIC 0.0 0.02 0.0 REFERENCES [] A. Rényi, New version of the probabiistic generaization of the arge sieve, Acta Mathematica Hungarica, vo. 0, no., pp. 27 226, 959. [2] H. Cramer, Mathematica Methods of Statistics. Princeton Univ Pr, 999, vo. 9. [3] A. Komogorov, Grundbegriffe der wahrscheinichkeitsrechnung, 933. [4] H. Gebeein, Das statistische probem der korreation as variations-und eigenwertprobem und sein zusammenhang mit der ausgeichsrechnung, ZAMM-Journa of Appied Mathematics and Mechanics/Zeitschrift für Angewandte Mathematik und Mechanik, vo. 2, no. 6, pp. 364 379, 94. [5] L. Breiman and J. Friedman, Estimating optima transformations for mutipe regression and correation, Journa of the American Statistica Association, pp. 580 598, 985. [6] W. Pirie, Spearman rank correation coefficient, Encycopedia of Statistica Sciences, 988. [7] P. Deicado and M. Smrekar, Measuring non-inear dependence for two random variabes distributed aong a curve, Statistics and Computing, vo. 9, no. 3, pp. 255 269, 2009. [8] T. Cover and J. Thomas, Eements of information theory. Wiey Onine Library, 99, vo. 6. [9] Y. Moon, B. Rajagopaan, and U. La, Estimation of mutua information using kerne density estimators, Physica Review E, vo. 52, no. 3, pp. 238 232, 995. [0] A. Kraskov, H. Stögbauer, and P. Grassberger, Estimating mutua information, Physica Review E, vo. 69, no. 6, p. 06638, 2004. [] D. Reshef, Y. Reshef, H. Finucane, S. Grossman, G. McVean, P. Turnbaugh, E. Lander, M. Mitzenmacher, and P. Sabeti, Detecting nove associations in arge data sets, Science, vo. 334, no. 6062, pp. 58 524, 20. [2] N. Simon and R. Tibshirani, Comment on detecting nove associations in arge data sets by reshef et a, science dec 6, 20, arxiv preprint arxiv:40.7645, 204.