Scale Invariant Conditional Dependence Measures

Similar documents
Copula-based Kernel Dependency Measures

Kernel-Based Nonparametric Anomaly Detection

Introduction to Machine Learning. Recitation 11

Block designs and statistics

are equal to zero, where, q = p 1. For each gene j, the pairwise null and alternative hypotheses are,

Computational and Statistical Learning Theory

1 Bounding the Margin

3.3 Variational Characterization of Singular Values

CS Lecture 13. More Maximum Likelihood

A note on the multiplication of sparse matrices

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

The Distribution of the Covariance Matrix for a Subset of Elliptical Distributions with Extension to Two Kurtosis Parameters

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

IN modern society that various systems have become more

Feature Extraction Techniques

Ch 12: Variations on Backpropagation

A Simple Regression Problem

arxiv: v1 [stat.ot] 7 Jul 2010

Hilbert Space Representations of Probability Distributions

Learnability of Gaussians with flexible variances

PAC-Bayes Analysis Of Maximum Entropy Learning

Probability Distributions

1 Proof of learning bounds

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material

Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison

Nonmonotonic Networks. a. IRST, I Povo (Trento) Italy, b. Univ. of Trento, Physics Dept., I Povo (Trento) Italy

COS 424: Interacting with Data. Written Exercises

Extension of CSRSM for the Parametric Study of the Face Stability of Pressurized Tunnels

DERIVING PROPER UNIFORM PRIORS FOR REGRESSION COEFFICIENTS

Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Kernel Measures of Conditional Dependence

Statistical Convergence of Kernel CCA

Sharp Time Data Tradeoffs for Linear Inverse Problems

Boosting with log-loss

Bootstrapping Dependent Data

Robustness and Regularization of Support Vector Machines

Qualitative Modelling of Time Series Using Self-Organizing Maps: Application to Animal Science

Optimal Jamming Over Additive Noise: Vector Source-Channel Case

AN OPTIMAL SHRINKAGE FACTOR IN PREDICTION OF ORDERED RANDOM EFFECTS

The proofs of Theorem 1-3 are along the lines of Wied and Galeano (2013).

Supplement to: Subsampling Methods for Persistent Homology

Keywords: Estimator, Bias, Mean-squared error, normality, generalized Pareto distribution

The Weierstrass Approximation Theorem

A Note on the Applied Use of MDL Approximations

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

The Transactional Nature of Quantum Information

Computational and Statistical Learning Theory

Non-Parametric Non-Line-of-Sight Identification 1

Stability Bounds for Non-i.i.d. Processes

When Short Runs Beat Long Runs

Meta-Analytic Interval Estimation for Bivariate Correlations

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical

Chapter 6 1-D Continuous Groups

Some Proofs: This section provides proofs of some theoretical results in section 3.

Uniform Approximation and Bernstein Polynomials with Coefficients in the Unit Interval

Testing the lag length of vector autoregressive models: A power comparison between portmanteau and Lagrange multiplier tests

Support Vector Machines. Maximizing the Margin

On the Impact of Kernel Approximation on Learning Accuracy

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

arxiv: v1 [cs.ds] 3 Feb 2014

arxiv: v1 [stat.ml] 27 Oct 2018

Testing equality of variances for multiple univariate normal populations

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

Support recovery in compressed sensing: An estimation theoretic approach

Recovering Data from Underdetermined Quadratic Measurements (CS 229a Project: Final Writeup)

Combining Classifiers

Kernel Methods and Support Vector Machines

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes

Stochastic Subgradient Methods

ESTIMATING AND FORMING CONFIDENCE INTERVALS FOR EXTREMA OF RANDOM POLYNOMIALS. A Thesis. Presented to. The Faculty of the Department of Mathematics

Supplementary Material for Fast and Provable Algorithms for Spectrally Sparse Signal Reconstruction via Low-Rank Hankel Matrix Completion

arxiv: v1 [cs.lg] 25 Jul 2016

A Self-Organizing Model for Logical Regression Jerry Farlow 1 University of Maine. (1900 words)

Understanding Machine Learning Solution Manual

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Lower Bounds for Quantized Matrix Completion

Lecture 12: Ensemble Methods. Introduction. Weighted Majority. Mixture of Experts/Committee. Σ k α k =1. Isabelle Guyon

A Simplified Analytical Approach for Efficiency Evaluation of the Weaving Machines with Automatic Filling Repair

GEE ESTIMATORS IN MIXTURE MODEL WITH VARYING CONCENTRATIONS

G G G G G. Spec k G. G Spec k G G. G G m G. G Spec k. Spec k

A Smoothed Boosting Algorithm Using Probabilistic Output Codes

ON THE TWO-LEVEL PRECONDITIONING IN LEAST SQUARES METHOD

Best Procedures For Sample-Free Item Analysis

Lean Walsh Transform

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 11 10/15/2008 ABSTRACT INTEGRATION I

An l 1 Regularized Method for Numerical Differentiation Using Empirical Eigenfunctions

The Hilbert Schmidt version of the commutator theorem for zero trace matrices

Introduction to Kernel methods

Fixed-to-Variable Length Distribution Matching

Yongquan Zhang a, Feilong Cao b & Zongben Xu a a Institute for Information and System Sciences, Xi'an Jiaotong. Available online: 11 Mar 2011

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

Estimating Parameters for a Gaussian pdf

Asynchronous Gossip Algorithms for Stochastic Optimization

Empirical phi-divergence test-statistics for the equality of means of two populations

Fast and Memory Optimal Low-Rank Matrix Approximation

2 Q 10. Likewise, in case of multiple particles, the corresponding density in 2 must be averaged over all

On Conditions for Linearity of Optimal Estimation

Lecture 21 Nov 18, 2015

Transcription:

Sashank J. Reddi sjakkar@cs.cu.edu Machine Learning Departent, School of Coputer Science, Carnegie Mellon University Barnabás Póczos bapoczos@cs.cu.edu Machine Learning Departent, School of Coputer Science, Carnegie Mellon University Abstract In this paper we develop new dependence and conditional dependence easures and provide their estiators. An attractive property of these easures and estiators is that they are invariant to any onotone increasing transforations of the rando variables, which is iportant in any applications including feature selection. Under certain conditions we show the consistency of these estiators, derive upper bounds on their convergence rates, and show that the estiators do not suffer fro the curse of diensionality. However, when the conditions are less restrictive, we derive a lower bound which proves that in the worst case the convergence can be arbitrarily slow siilarly to soe other estiators. Nuerical illustrations deonstrate the applicability of our ethod. 1. Introduction Measuring dependencies and conditional dependencies are of great iportance in any scientific fields including achine learning, and statistics. There are nuerous probles where we want to know how large the dependence is between rando variables, and how this dependence changes if we observe other rando variables. Correlated rando variables ight becoe independent when we observe a third rando variable, and the opposite situation is also possible where independent variables becoe dependent after observing other rando variables. Thanks to the wide potential application range (e.g., in bioinforatics, pharacoinforatics, epideiol- Proceedings of the 30 th International Conference on Machine Learning, Atlanta, Georgia, USA, 2013. JMLR: W&CP volue 28. Copyright 2013 by the author(s). ogy, psychology, econoetrics), finding efficient dependence and conditional dependence easures has been an active research area for decades. These easures have been used, for exaple, in causality detection, feature selection, active learning, structure learning, boosting, iage registration, independent coponent and subspace analysis. Although the theory of these estiators is actively researched, there are still several fundaental open questions. The estiation of certain dependence and conditional dependence easures is easy in a few cases: For exaple, (i) when the rando variables have discrete distributions with finitely any possible values, (ii) when there is a known siple relationship between the (e.g., a linear odel describes their behavior), or (iii) if they have joint distributions that belong to a paraetric faily that is easy to estiate (e.g. noral distributions). In this paper we consider the ore challenging nonparaetric estiation proble when the rando variables have continuous distributions, and we do not have any other inforation about the. Nuerical experients indicate, that a recently proposed kernel easure based on noralized crosscovariance operators (D HS ) appears to be very powerful in easuring both dependencies and conditional dependencies (Fukuizu et al., 2008). Nonetheless, even this ethod has a few drawbacks and several open fundaental questions. In particular, lower and upper bounds on the convergence rates of the estiators are not known. It is also not clear if the D HS easure or its existing estiator D HS are invariant to any invertible transforations of the rando variables. This kind of invariance property is so iportant that Schweizer & Wolff (1981) evenputitinto the axios of dependence. One reason for this is that in any scenarios we need to copare the estiated dependencies. If certain variables are easured on different scales, the dependence can be uch different in absence of this invariance property. As a result, it

ight happen that in a dependence based feature selection algorith different features would be selected if we easured a quantity e.g. in gras, kilogras, or if we used log-scale. This is an odd situation that can be avoided with dependence easures that are invariant to invertible transforations of the variables. Main contributions: The goal of this paper is to provide new theoretical insights in this field. Our contributions can be suarized in the following points: (i) We prove that the dependence and conditional dependence easures (D HS ) are invariant to any invertible transforations, but its estiator D HS does not have this property. (ii) Under soe conditions we derive an upper bound on the rate of convergence of D HS. (iii) We show that if we apply D HS on the epirical copula transfored points, then the resulting estiator D C will be invariant to any onotone increasing transforations. (iv) We show that under soe conditions the estiator is consistent and derive an upper bound on the rate. (v) We prove that if the conditions are less restrictive, then convergence of both D HS and D C can be arbitrarily slow. (vi) We also generalize these dependence easures as well as their estiators to sets of rando variables and provide an upper bound on the convergence rate of the estiators. Related work: Since the literature on dependence easure is huge, we only ention few proinent exaples here. The ost well-known dependence easure is probably the Shannon utual inforation, which has been generalized to the Rényi-α (Rényi, 1961) and Tsallis-α utual inforation (Tsallis, 1988). Other interesting dependence easures are the axial correlation coefficient (Rényi, 1959), kernel utual inforation (Gretton et al., 2003), the generalized variance and kernel canonical correlation analysis (Bach, 2002), the Hilbert-Schidt independence criterion (Gretton et al., 2005), the Schweizer-Wolff easure (Schweizer & Wolff, 1981), axiu-ean discrepancy (MMD) (Borgwardt et al., 2006; Fortet & Mourier, 1953), Copula-MMD (Póczos et al., 2012), and the distance based correlation (Székely et al., 2007). Soe of these easures, e.g. the Shannon utual inforation, are invariant to any invertible transforations of the rando variables. The Copula-MMD and Schweizer- Wolff easures are invariant to onotone increasing transforations, while the MMD and any other dependence easures are not invariant to any of these transforations. The conditional dependence estiation is an even ore challenging proble, and only very few dependence easures have been generalized to the conditional case (Fukuizu et al., 2008; Poczos &Schneider, 2012). Notation: We use X P to denote that the rando variable X has probability distribution P. The sybol X Y Z indicates the conditional independence of X and Y given Z. Let X i:j denote the tuple (X i,...,x j ). E[X] stands for the expectation of rando variable X. The sybols µ X and B X denote easure and Borel σ field on X respectively. We use D(X i,...,x j )to denote the dependence easure of set of rando variables {X i,...,x j }. With slight abuse of notation, we will use D(X, Y ) to denote the dependence easure between sets of the rando variables X and Y. {A j } denotes the set {A 1,...,A k } where k will be clear fro the context. The null space and range of an operator L are denoted by N (L) and R(L) respectively, and A stands for the closure of set A. 2. Dependence Measure using Hilbert-Schidt Nor In this section, we review the theory behind the Hilbert Schidt (HS) nor and its use in defining dependence easures. Suppose (X, Y ) is a rando variable on X Y. Let H X = {f : X R} be a reproducing kernel Hilbert Space (RKHS) associated with X, feature ap φ(x) H X (x X) and kernel k X (x, y) =φ(x),φ(y) HX (x, y X). The kernel satisfies the property f(x) =f,k X (x,.) HX for f H X, which is called the reproducing property of the kernel. We can siilarly define RKHS H Y and kernel k Y associated with Y. Let us define a class of kernels known as universal kernels, which are critical to this paper. Definition 1. A kernel k X : X X R is called universal whenever the associated RKHS H X is dense in C(X ) the space of bounded continuous functions over X with respect to the L nor. Gaussian and Laplace kernels are two popular kernels which belong to the class of universal kernels. The cross-covariance operators on these RKHSs capture the dependence of rando variables X and Y (Baker, 1973). In order to ensure existence of these operators, we assue that E [k X (X, X)] < and E [k Y (Y,Y )] <. We also assue that all the kernels defined in this paper satisfy the above assuption. The cross-covariance operator (COCO) Σ : H X H Y is an operator such that g, Σ f HY = E XY [f(x)g(y )] E X [f(x)] E Y [g(y )] holds for all f H X and g H Y. Existence and uniqueness of such an operator can be shown using the Riesz representation theore (Reed & Sion, 1980).

If X and Y are identical, then the operator Σ XX is called the covariance operator. Both operators above are natural generalizations of the covariance atrix in Euclidean space to Hilbert space. Analogous to correlation atrix in the Euclidean space, we can define operator V, V =Σ 1/2 YY Σ Σ 1/2 XX, where R(V ) R(Σ YY ) and N (V ) R(Σ XX ). This operator is called the noralized cross-covariance operator (NOCCO). Siilar to Σ, the existence of NOCCO can be proved using Riesz representation theore. Intuitively, the noralized cross-covariance operator captures the dependence of rando variables X and Y discounting the influence of the arginals. We would like to point out that the notation we adopted leads to a few technical difficulties which can easily be addressed (refer (Grünewälder et al., 2012)). We also define the conditional covariance operators which will be useful to capture conditional independence. Suppose we have another rando variable Z on Z with RHKS H Z and kernel k Z. The noralized conditional cross-covariance operator is defined as: V Z = V V YZ V ZX. Siilar to the cross-covariance operator, the conditional cross-covariance operator is a natural extension of conditional covariance atrix to Hilbert space. An interesting aspect of the noralized conditional crosscovariance operator is that it can be expressed in ters of siple products of noralized cross-covariance operators (Fukuizu et al., 2008). It is not surprising that covariance operators described above can be used for easuring dependence between rando variables since they capture the dependence between the. While one can use Σ or V in defining the dependence easure (Gretton et al., 2005; Fukuizu et al., 2008), we will use the latter in this paper. Let the variable X denote (X, Z), which will be useful for defining conditional dependence easures. We define the HS nor of a linear operator L : H X H Y as follows. Definition 2. (Hilbert-Schidt Nor) The Hilbert- Schidt nor of L is defined as L 2 HS = i,j v j,lu i 2 H Y, where {u i } and {v j } are an orthonoral bases of H X and H Y respectively, provided the su converges. The HS nor is a generalization of the Frobenius nor on atrices and is independent of the choice of the orthonoral bases. An operator is called Hilbert- Schidt if its HS nor is finite. The covariance operators defined in this paper are assued to be Hilbert- Schidt Operators. Fukuizu et al. (2008) definethe following dependence easures: D HS (X, Y )=V 2 HS, D HS (X, Y Z) =V Z 2 HS. Note that the easures above are defined for a pair of rando variables X and Y. In Section 5 we will generalize this approach and provide dependence and conditional easures with estiators that operate on sets of rando variables. Theore 10 (in the suppleentary aterial) justifies the use of the above dependence easures. The result can be equivalently stated in ters of HS nor of the covariance operators since HS nor of an operator is zero if and only if the operator itself is a null operator. At first it ight appear that these easures are strongly linked to the kernel used in constructing the easure but Fukuizu et al. (2008) show a rearkable property that the easures are independent of the kernels, which is captured by the following result. Let E Z [P X Z P Y Z (A B)] = E[ B(Y ) Z = z]e[ A(X) Z = z]dp Z (z) for A B X and B B Y. Theore 1. (Kernel-Free property) Assue that the probabilities P XY and E Z [P X Z P Y Z ] are absolutely continuous with respect to µ X µ Y with probability density functions p XY and p X Y Z, respectively, then we have D HS (X, Y Z) = p XY (x, y) p X (x)p Y (y) p 2 X Y Z(x, y) p X (x)p Y (y)dµ X dµ Y. p X (x)p Y (y) X Y Suppose Z =, we have D HS (X, Y )= 2 pxy (x, y) p X (x)p Y (y) 1 p X (x)p Y (y)dµ X dµ Y. X Y The above result shows that these easures bear an uncanny reseblance to utual inforation and ight possibly inherit soe of its desirable properties. We show that this intuition is, in fact, true by proving an iportant consequence of the above result. In particular, the following result holds when the assuptions stated in Theore 1 are satisfied. Theore 2. (Invariance of dependence easure) Assue that the probabilities P XY and E Z [P X Z P Y Z ] are absolutely continuous. Let X R d, and let

Γ X : X R d be a differentiable injective function. Let Γ Y and Γ Z be defined siilarly. Under the above assuptions, we have D HS (X, Y Z) =D HS (Γ X (X), Γ Y (Y ) Γ Z (Z)). As a special case of Z =, we have D HS (X, Y )=D HS (Γ X (X), Γ Y (Y )). The proof is in the suppleentary aterial. Though we proved the result for Euclidean spaces, it can be generalized to certain easurable spaces under ild conditions (Hewitt & Stroberg, 1975). We now look at the epirical estiators for the dependence easures defined above. Let X 1:,Y 1: and Z 1: be i.i.d saples fro the joint distribution. Let µ () 1 X = i=1 k X (,X i ) and µ () Y = 1 i=1 k Y (,Y i )denote the epirical ean aps respectively. The epirical estiator of Σ is Σ () = 1 k Y (,Y i ) µ () Y i=1 k X (,X i ) µ () X, H X The epirical covariance operators Σ () XX and Σ () YY can be defined in a siilar fashion. The epirical noralized cross-covariance operator V is V () 1/2 1/2 = Σ () YY + I Σ () Σ () XX I +, where > 0 is the regularization constant (Bach, 2002; Fukuizu et al., 2004). In the later sections, we look at a particular choice of which provides good convergence rates. The epirical conditional crosscovariance operator is V () Z = V () V () YZ V () ZX. Let G X be the centered gra atrix such that G X,ij = k X (,X i ) µ X,k X (,X j ) µ X HX and R X = G X (G X + I) 1. Siilarly, we can define G Y,G Z and R Y,R Z for rando variables Y and Z. The epirical dependence easures are then D HS (X, Y )= V () 2 HS = Tr[R Y R X ], D HS (X, Y Z) = V () Z 2 HS = Tr[R Y R X 2R Y R X R Z + R Y R Z R X R Z ]. Fukuizu et al. (2008) show the consistency of the above estiators. We now provide an upper bound on the convergence rates of these estiators under certain assuptions. Theore 3. (Consistency of operators) Assue Σ 3/4 YY Σ Σ 3/4 XX is Hilbert-Schidt. Suppose satisfies 0 and 3. Then, we have convergence in probability in HS nor i.e V () V HS P 0. An upper bound on convergence rate of V () V HS is O p ( 3/2 1/2 + 1/4 ). Proof Sketch (refer suppleentary for coplete proof). We can upper bound V () V HS by V () (Σ YY + I) 1/2 Σ (Σ XX + I) 1/2 HS + (Σ YY + I) 1/2 Σ (Σ XX + I) 1/2 HS V. The first ter can be shown to be O p 3/2 1/2 using Lea 3 (in the suppleentary aterial). To prove the second part, consider the coplete orthogonal systes {ξ i } i=1 and {ψ i} i=1 for H X and H Y with an eigenvalue λ i 0 and γ i 0 respectively. Using the definition of Hilbert-Schidt nor, we can bound the square of second ter by i,j=1 λ i + γ j + 2 λ i γ j (λ i + )(γ j + ) ψ j, Σ ξ i 2. Using AM-GM inequality, and assuing λ 1, γ 1 and that Σ 3/4 YY Σ Σ 3/4 XX is Hilbert- Schidt, the theore follows. Theore 4. (Consistency of estiators) Assue Σ 3/4 YY Σ Σ 3/4 X X, Σ 3/4 ZZ Σ ZX Σ 3/4 X X and Σ 3/4 YZ Σ YZΣ 3/4 ZZ are Hilbert-Schidt. Suppose satisfies 0 and 3. Then, we have D HS (X, Y ) P D HS (X, Y ), D HS (X, Y Z) P D HS (X, Y Z). An upper bound on convergence rate of the estiator D HS (X, Y Z) is O p ( 3/2 1/2 + 1/4 ). The proof is in the suppleentary aterial. The assuptions used in Theores 3 and 4 depend on the probability distribution and the kernel. A natural question arises if such assuptions are necessary for estiating these easures. The following result answers this question affiratively. The crux of the result lies in the fact that these dependence easures are intricately connected to functionals of probability

distributions that are typically hard to estiate. In particular, we have p 2 D HS (X, Y )= XY (x, y) p X (x)p Y (y) 1 dµ X dµ Y. X Y Since estiation of these functionals is tightly coupled with soothness assuptions on probability distributions of the rando variables, it is reasonable to assue that our assuptions have a siilar effect. We prove the result for the special case where X, Y R d and Z =. Let P denote the set of all distributions in [0, 1] d. Theore 5. Let D n denote an estiator of D HS on saple size n. For any sequence of estiates { D n } and any sequence {a n } converging to 0, there exists a copact subset P 0 P for which the unifor rate of convergence is slower than {a n }. In other words, li inf n sup P P 0 where D = D HS (X, Y ). D n D a n > 0. The proof is in the suppleentary aterial. An iportant point to note is that while the dependence easures theselves are invariant to invertible transforations, the estiators do not possess this property. It is often desirable to have this invariance property for the estiators as well since we generally deal with finite saple estiators. This property can be achieved using copula transforation, which is our focus in the next section. Along with the theoretical analysis, we also provide a copelling justification for using copula transforation fro a practical point of view in Section 6. 3. Copula Transforation We review iportant properties of the copula of ultivariate distributions in this section. The use of the transforation will be clear in the later sections. The copula plays an iportant role in the study of dependence aong rando variables. They not only capture the dependence between rando variables but also help us construct dependence easures which are invariant to any strictly increasing transforation of the arginal variables. Sklar s theore is central to the theory of copulas. It gives the relationship between a ultivariate rando variable and its univariate arginals. Suppose X = X 1,...,X d R d is a d-diensional ultivariate rando variable. Let us denote the arginal cuulative distribution function (cdf) of X j by F j X : R [0, 1]. In this paper, we assue that the arginals F j X are invertible. The copula is defined by Sklar s theore as follows: Theore 6. (Sklar s theore). Let H (x 1,...,x d ) = Pr X 1 x 1,...,X d x d be the ultivariate cuulative distribution function with continuous arginals F j X. Then there exists a unique copula C such that H(x 1,...,x d )=C(FX(x 1 1 ),...,FX(x d d )). (1) Conversely, if C is a copula and F j X are arginal cdfs, then H given in Equation (1) is the joint distribution with arginals F j X. Let T X = TX X 1,...,Td denote the transfored variables where T X = F X (X) = FX 1 (X1 ),...,FX d (Xd ) [0, 1] d. Here, F X is called copula transforation. The above theore gives a one-to-one correspondence between the joint distribution of X and T X. Furtherore, it provides a way to construct dependence easures over the transfored variables since we have inforation about the copula distribution. An interesting consequence of the copula transforation is that we can get invariance of dependence easures to any strictly increasing transforations of the arginal variables. 4. Hilbert-Schidt Dependence Measure using Copulas Consider rando variables X = X 1,...,X d, Y = Y 1,...,Y d and Z = Z 1,...,Z d. We focus on the proble of defining dependence easure D C (X, Y ) and conditional dependence easure D C (X, Y Z) using copula transforation. The next section generalizes the dependence easure to any set of rando variables. Note that we have assued that the rando variables X, Y and Z are all d diensional for siplicity, but our results hold for rando variables with different diensions. Let T X,T Y and T Z be the copula transfored variables of X, Y and Z respectively. With slight abuse of notation, we use k X,k Y and k Z to denote the kernels over transfored variables T X, T Y and T Z respectively. In what follows, the kernels are functions of the for [0, 1] d [0, 1] d R since they are defined over the transfored rando variables. We define the dependence aong rando variables X and Y as: D C (X, Y )=D HS (T X,T Y )=V TY T X 2 HS. The conditional dependence easure is defined as: D C (X, Y Z) =D HS (T X,T Y T Z )=V TY T X 2 HS,

where T X =(T X,T Z ). By Theore 10 (in suppleentary) and the fact that the arginal cdfs are invertible, it is easy to see that D C (X, Y )=0 X Y and D C (X, Y Z) = 0 X Y Z. Our goal is to estiate the dependence easures D C (X, Y ) and D C (X, Y Z) using the i.i.d saples X 1:, Y 1: and Z 1:. Suppose we have the copula transfored variables T X,T Y and T Z, then we can use the estiators D C (X, Y )= V () 2 HS, D C (X, Y Z) = V () 2 HS for dependence and conditional dependence easures respectively. However, we only have i.i.d saples X 1:, Y 1:, Z 1:, and the arginal cdfs are unknown to us. We have to get the epirical copula transfored variables through these saples by estiating the arginals distribution functions F j X, F j Y and F j Z. These distribution functions can be estiated efficiently using the rank statistics. For x R and x j R for 1 j d, let F j X (x) = 1 i :1 i, x X j, i F X x 1,...,x d = F 1 X (x 1 ),..., F X(x d d ). F X is called the epirical copula transforation of X. The saples TX1,..., T X = FX (X 1 ),..., F X (X ), called the epirical copula, are estiates of true copula transforation. We can siilarly define epirical copula transforations and epirical copula for rando variables Y and Z. It should be noted that the saples of epirical copula TX1,..., T X are not independent even though X 1: are i.i.d saples. We can now use the dependence estiators in (Fukuizu et al., 2008) usinge- pirical copula TX1,..., T X instead of the saples (X 1,...,X ). Lea 2 (in suppleentary) shows that the epirical copula is a good approxiation of i.i.d. saples (T X1,...,T X ). It is iportant to note the relationship between easures D HS and D C. The copula transforation can also be viewed as an invertible transforation and hence, by Theore 2 we have D HS = D C. Though the easures are identical, their corresponding estiators D HS and D C are different. At this point, we should also ephasize the difference between our work and Póczos et al. (2012). Although both these works use copula trick to obtain invariance, in contrast to Póczos et al. (2012), we essentially get the sae easure even after copula transforation. In other words, the copula transforation in our case does not change the dependence easure and therefore, provides an invariant finite saple estiator to D HS. Thus, we provide a copelling case to use copulas for D HS. Moreover, the invariance property extends naturally to conditional dependence easure in our case. We now focus on the consistency of the proposed estiators D C (X, Y ) and D C (X, Y Z). We assue that the kernel functions k X,k Y and k Z are bounded kernel functions and are Lipschitz continuous on [0, 1] d i.e there exists a B>0 such that k X (x 1,x) k X (x 2,x) Bx 1 x 2, for all x, x 1,x 2 [0, 1] d. The gaussian kernel is one of the popular kernels which is not only universal but also bounded and Lipschitz continuous. In what follows, we assue that conditions required for Theore 3 and 4 hold for the transfored variables as well. We now show the consistency of the dependence estiators and provide upper bounds on their rates of convergence. Theore 7. (Consistency of copula dependence estiators) Assue kernels k X,k Y and k Z are bounded and Lipschitz continuous. Suppose satisfies 0 and 3, then (i) D C (X, Y ) P D C (X, Y ). (ii) D C (X, Y Z) P D C (X, Y Z). An upper bound on convergence rate of estiators D C (X, Y ) and D C (X, Y Z) is O p ( 3/2 1/2 + 1/4 ). Proof Sketch (refer suppleentary for coplete proof). We first show V () T Y TY V TY T X HS P 0. Consider the following upper bound of V () T Y TY V TY T X HS : V () T Y TX V () HS + V () V TY T X HS. (2) Fro Theore 3, it is easy to see that the second ter converges with O p ( 3/2 1/2 + 1/4 ). The first ter can be proved to be bounded by C 3/2 Σ () T Y TX Σ () HS on siilar lines as Lea 3. We then prove that Σ () T Y TX Σ () HS converges O p ( 1/2 ) (refer Lea 1 in the suppleentary aterial), thereby proving the overall rate to be O p ( 3/2 1/2 ). The convergence of the dependence estiators follows fro the convergence of the operators.

The above result shows that by using copula transforation, we can ensure that the invariance property holds for finite saple estiators without loss in statistical efficiency. It should be noted that slightly better rates of convergence can be obtained by ore restrictive assuptions. It is also noteworthy that under the conditions assued in this paper, the estiators do not suffer fro the curse of diensionality, and hence, can be used in high diensions. Moreover, the estiators only use rank statistics rather than the actual values of X 1:,Y 1: and Z 1:. This provides us with robust estiators since an arbitrarily large outlier saple cannot affect the statistics badly. In addition, this also akes the estiators invariant to onotone increasing transforations. Let us call the dependence easure defined above as pairwise dependence since it easures dependence between two rando variables. Now, the question arises if this approach can be generalized to easure dependence aongst a set of rando variables rather than just two rando variables. We answer this question affiratively in the next section. 5. Generalized Dependence Measures Suppose S = {S 1,...,S n } is a set of rando variables. Siilar to our previous assuptions, we assue that rando variables {S j } are d-diensional. We would like to easure the dependence aongst the set of rando variables. Recall S i:j represents the rando variable (S i,...,s j ). Note that S i:j is a rando variable of (j i)d diensions. With slight abuse of notation, the kernel k i:j corresponding to variable S i:j is defined appropriately. We now express the generalized dependence easure as a su of pairwise dependence easures. For siplicity, we denote D C (S 1,...,S n )by D C (S). The dependence easures are defined as: n 1 D C (S) = D C (S j,s j+1:n ), (3) j=1 n 1 D C (S Z) = D C (S j,s j+1:n Z). (4) j=1 Note that the set dependence easure is a su of n 1 pairwise dependence easures. The following result justifies the use of these dependence easures. Theore 8. (Generalized dependence easure) If the product kernels k j k j+1:n for j = {1,...,n 1} are universal, we have (i) D C (S) =0 (S 1,...,S n ) are independent. (ii) D C (S Z) = 0 (S 1,...,S n ) are independent given Z. The proof is in the suppleentary aterial. We can now use the pairwise dependence estiators for estiating set dependence. The following estiators are used for easuring dependence D C (S) = D C (S Z) = n 1 j=1 n 1 j=1 D C (S j,s j+1:n ), (5) D C (S j,s j+1:n Z). (6) Let us assue that the conditions required for Theore 7 are satisfied. The following theore states the consistency of the dependence estiators proposed above, and provides an upper bound on their rates of convergence. Theore 9. (Consistency of generalized estiators) Suppose the kernels defined above are bounded and Lipschitz continuous, then (i) D C (S) P D C (S). (ii) D C (S Z) P D C (S Z). An upper bound on convergence rate of DC (S) and D C (S Z) is O p (n 3/2 1/2 + n 1/4 ). Proof. The theore follows easily fro Theore 7 and Equations (3), (4), (5) and (6). 6. Experiental Results In this section, we epirically illustrate the theoretical contributions of the paper. We copare the perforance of D HS (X, Y ) and D HS (X, Y Z) (referred to as NHS) with D C (X, Y ) and D C (X, Y Z) (referred to as CHS), that is with and without copula respectively. In the following experients, we choose gaussian kernels and choose σ by edian heuristic. We fix = 10 6 for our experients. 6.1. Synthetic Dataset In the first siulation, we constructed the following rando variables: X 1 U[0, 4π], X 2 = (500 V 1, 1000 tanh(v 2 ), 500 sinh(v 3 )), where V 1, V 2, V 3 U[0, 1], and Y = 1000 tanh(x 1 ). 200 saple points are generated using the distributions specified above. The task in this experient was to choose a feature between X 1 and X 2 that contains the ost inforation about Y. Note that Y is a deterinistic function of X 1 and independent of X 2. Therefore, we expect the dependence for (X 1,Y)to be high and that of (X 2,Y) to be low. Figure 1 shows

dependence easure (DM) coparison of NHS and CHS. It can be seen that while CHS chooses the correct feature X 1, NHS chooses the incorrect feature X 2. becoe dependent when conditioned on Z. The results in Figure 3 clearly indicate that NHS fails to detect the dependence between X and Y when conditioned on Z while CHS successfully captures this dependence. DM 3.5 3 2.5 2 1.5 1 0.5 DM 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 A B C D 0 A B C D Figure 1. Coluns (A) and (B) represent the dependence of (X 1,Y) and (X 2,Y) as easured by NHS while Coluns (C) and (D) represent the dependence of (X 1,Y) and (X 2,Y) as easured by CHS. Figure 3. Coluns (A) and (B) represent the dependence of (X, Y ) and (X, Y ) Z as easured by NHS while Coluns (C) and (D) represent the dependence of (X, Y ) and (X, Y ) Z as easured by CHS. The next siulation is designed to prove the significance of invariance property on finite saples. As entioned earlier, though NHS is invariant to invertible transforation, its estiators do not retain this property. Thanks to copula transforation, CHS does not suffer fro this issue. Let X U[0, 4π], Y = 50 X and Z = 10 tanh(y/100). Note that Z is an invertible transforation of Y. The dependence of (X, Z) under CHS is not reported here as CHS is invariant to any onotone increasing transforations. We now deonstrate the asyptotic nature of the invariance property of NHS on the sae data. Figure 2 clearly shows that while both ethods have alost the sae dependence easure for (X, Y ), this dependence easured by NHS is reduced significantly when Y is transfored to Z. We can clearly see that NHS requires a large saple before it exhibits the invariance property. This proble furthur aplifies as we ove to higher diensions, aking it undesirable for higher diensional tasks. DM 8 7 6 5 4 3 NHS(X,Y) NHS(X,Z) CHS(X,Y) 2 0 500 1000 1500 2000 2500 3000 Figure 2. DM with varying nuber of saples. We deonstrate the perforance of conditional dependence easures in the following experient. Let X, V U[0, 4π], Y = 50 sin(v ) and Z = log(x + Y ). Observe that though X and Y are independent, they 6.2. Housing Dataset We evaluate the perforance of the dependence easures on the Housing dataset fro the UCI repository. The iportance of scale invariance on real-world data is deonstrated through this experient. This dataset consists of 506 saple points each of which has 12 real valued and 1 integer valued attributes. We will only consider the real value attributes for this experient and discard the integer attribute. Our ai is to predict the edian value of owner-occupied hoes based on other attributes like per capital crie, percentage of lower status of the population etc. This dataset is particularly interesting since it contains features of different nature and scale. In this experient, we would like to predict the single ost relevant feature for predicting the edian value. Features 13 and 6 achieve the least prediction errors aongst all features using linear regressors (see suppleentary aterial for ore details). While CHS predicts these two features as the ost relevant features, NHS perfors poorly by selecting features 1 and 6 as the ost relevant features. 7. Conclusion In this paper we developed new dependence and condition dependence easures and estiators which are invariant to any onotone increasing transforations of the variables. We showed that under certain conditions the convergence rates of the estiators are polynoial, but when the conditions are less restrictive, then siilarly to other existing estiators the convergence can be arbitrarily slow. We generalized these easures and estiators to sets of variables as well, and illustrated the applicability of our ethod with a few nuerical experients.

References Bach, Francis R. Kernel independent coponent analysis. JMLR, 3:1 48, 2002. Baker, Charles R. Joint easures and cross-covariance operators. Transactions of the Aerican Matheatical Society, 186:pp. 273 289, 1973. Borgwardt, K., Gretton, A., Rasch, M., Kriegel, H., Schölkopf, B., and Sola, A. Integrating structured biological data by kernel axiu ean discrepancy. Bioinforatics, 22(14):e49 e57, 2006. Fortet, R. and Mourier, E. Convergence de laréparation epirique vers la réparation théorique. Ann. Scient. École Nor, 70:266 285, 1953. Fukuizu, K., Gretton, A., Sun, X., and Schoelkopf, B. Kernel easures of conditional dependence. In NIPS 20, pp. 489 496, Cabridge, MA, 2008. MIT Press. Fukuizu, Kenji, Bach, Francis R., and Jordan, Michael I. Diensionality reduction for supervised learning with reproducing kernel hilbert spaces. Journal of Machine Learning Research, 5:73 99, 2004. Póczos, Barnabás, Ghahraani, Zoubin, and Schneider, Jeff G. Copula-based kernel dependency easures. In ICML, 2012. Reed, M. and Sion, B. Functional Analysis. Acadeic Press, 1980. ISBN 9780080570488. Rényi, A. On easures of dependence. Acta. Math. Acad. Sci. Hungar, 10:441 451, 1959. Rényi, A. On easure of entropy and inforation. In 4th Berkeley Syposiu on Math., Stat., and Prob., pp. 547 561, 1961. Ritov, Y. and Bickel, P. J. Achieving inforation bounds in non and seiparaetric odels. The Annals of Statistics, 18(2):pp. 925 938, 1990. Schweizer, B. and Wolff, E. On nonparaetric easures of dependence for rando variables. The Annals of Statistics, 9, 1981. Székely, G. J., Rizzo, M. L., and Bakirov, N. K. Measuring and testing dependence by correlation of distances. Annals of Statistics, 35:2769 2794, 2007. Tsallis, C. Possible generalization of boltzann-gibbs statistics. J. Statist. Phys., 52(1-2):479 487, 1988. Fukuizu, Kenji, Bach, Francis R., and Gretton, Arthur. Statistical convergence of kernel cca. In Advances in Neural Inforation Processing Systes 18, 2005. Gretton, A., Herbrich, R., and Sola, A. The kernel utual inforation. In Proc. ICASSP, 2003. Gretton, A., Bousquet, O., Sola, A., and Schölkopf, B. Measuring statistical dependence with Hilbert- Schidt nors. In ALT, pp. 63 77, 2005. Grünewälder, S, Lever, G, Baldassarre, L, Patterson, S, Gretton, A, and Pontil, M. Conditional ean ebeddings as regressors. In Proceedings of the 29th International Conference on Machine Learning, ICML 2012, volue 2, pp. 1823 1830, 2012. Hewitt, E. and Stroberg, K. Real and Abstract Analysis: A Modern Treatent of the Theory of Functions of a Real Variable. Graduate Texts in Matheatics. Springer, 1975. ISBN 9780387901381. Poczos, B. and Schneider, J. Nonparaetric estiation of conditional inforation and divergences. In International Conference on AI and Statistics (AIS- TATS), JMLR Workshop and Conference Proceedings, 2012.