Hilbert Schmidt Independence Criterion

Size: px

Start display at page:

Download "Hilbert Schmidt Independence Criterion"

Madeleine York
6 years ago
Views:

1 Hilbert Schmidt Independence Criterion Thanks to Arthur Gretton, Le Song, Bernhard Schölkopf, Olivier Bousquet Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia ICONIP 2006, Hong Kong, October 3 Alexander J. Smola: Hilbert Schmidt Independence Criterion 1 / 30

2 Outline 1 Measuring Independence Covariance Operator Hilbert Space Methods A Test Statistic and its Analysis 2 Independent Component Analysis ICA Primer Examples 3 Feature Selection Problem Setting Algorithm Results Alexander J. Smola: Hilbert Schmidt Independence Criterion 2 / 30

3 Measuring Independence Problem Given {(x 1, y 1 ),..., (x m, y m )} Pr(x, y) determine whether Pr(x, y) = Pr(x) Pr(y). Measure degree of dependence. Applications Independent component analysis Dimensionality reduction and feature extraction Statistical modeling Indirect Approach Perform density estimate of Pr(x, y) Check whether the estimate approximately factorizes Direct Approach Check properties of factorizing distributions E.g. kurtosis, covariance operators, etc. Alexander J. Smola: Hilbert Schmidt Independence Criterion 3 / 30

4 Measuring Independence Problem Given {(x 1, y 1 ),..., (x m, y m )} Pr(x, y) determine whether Pr(x, y) = Pr(x) Pr(y). Measure degree of dependence. Applications Independent component analysis Dimensionality reduction and feature extraction Statistical modeling Indirect Approach Perform density estimate of Pr(x, y) Check whether the estimate approximately factorizes Direct Approach Check properties of factorizing distributions E.g. kurtosis, covariance operators, etc. Alexander J. Smola: Hilbert Schmidt Independence Criterion 3 / 30

5 Measuring Independence Problem Given {(x 1, y 1 ),..., (x m, y m )} Pr(x, y) determine whether Pr(x, y) = Pr(x) Pr(y). Measure degree of dependence. Applications Independent component analysis Dimensionality reduction and feature extraction Statistical modeling Indirect Approach Perform density estimate of Pr(x, y) Check whether the estimate approximately factorizes Direct Approach Check properties of factorizing distributions E.g. kurtosis, covariance operators, etc. Alexander J. Smola: Hilbert Schmidt Independence Criterion 3 / 30

6 Measuring Independence Problem Given {(x 1, y 1 ),..., (x m, y m )} Pr(x, y) determine whether Pr(x, y) = Pr(x) Pr(y). Measure degree of dependence. Applications Independent component analysis Dimensionality reduction and feature extraction Statistical modeling Indirect Approach Perform density estimate of Pr(x, y) Check whether the estimate approximately factorizes Direct Approach Check properties of factorizing distributions E.g. kurtosis, covariance operators, etc. Alexander J. Smola: Hilbert Schmidt Independence Criterion 3 / 30

7 Covariance Operator Linear Case For linear functions f (x) = w x and g(y) = v y the covariance is given by Cov{f (x), g(y)} = w Cv This is a bilinear operator on the space of linear functions. Nonlinear Case Define C to be the operator with (f, g) Cov {f, g}. Theorem C is a bilinear operator in f and g. Proof. We only show linearity in f : Cov {αf, g} = αcov {f, g}. Moreover, for f + f the covariance is additive. Alexander J. Smola: Hilbert Schmidt Independence Criterion 4 / 30

8 Covariance Operator Linear Case For linear functions f (x) = w x and g(y) = v y the covariance is given by Cov{f (x), g(y)} = w Cv This is a bilinear operator on the space of linear functions. Nonlinear Case Define C to be the operator with (f, g) Cov {f, g}. Theorem C is a bilinear operator in f and g. Proof. We only show linearity in f : Cov {αf, g} = αcov {f, g}. Moreover, for f + f the covariance is additive. Alexander J. Smola: Hilbert Schmidt Independence Criterion 4 / 30

9 Covariance Operator Linear Case For linear functions f (x) = w x and g(y) = v y the covariance is given by Cov{f (x), g(y)} = w Cv This is a bilinear operator on the space of linear functions. Nonlinear Case Define C to be the operator with (f, g) Cov {f, g}. Theorem C is a bilinear operator in f and g. Proof. We only show linearity in f : Cov {αf, g} = αcov {f, g}. Moreover, for f + f the covariance is additive. Alexander J. Smola: Hilbert Schmidt Independence Criterion 4 / 30

10 Independent random variables Alexander J. Smola: Hilbert Schmidt Independence Criterion 5 / 30

11 Dependent random variables Alexander J. Smola: Hilbert Schmidt Independence Criterion 6 / 30

12 Or are we just unlucky? Alexander J. Smola: Hilbert Schmidt Independence Criterion 7 / 30

13 Covariance operators Criterion (Renyi, 1957) Test for independence by checking whether C = 0. Reproducing Kernel Hilbert Space Kernels k, l on X, Y with associated RKHSs F, G. Assume bounded k, l on domain. Mean operator µ x, f = E x [f (x)] and µ y, g = E y [g(y)] Covariance operator Define covariance operator C via bilinear form f C xy g = Cov{f, g} = E x,y [f (x)g(y)] E x [f (x)] E y [g(y)] Alexander J. Smola: Hilbert Schmidt Independence Criterion 8 / 30

14 Covariance operators Criterion (Renyi, 1957) Test for independence by checking whether C = 0. Reproducing Kernel Hilbert Space Kernels k, l on X, Y with associated RKHSs F, G. Assume bounded k, l on domain. Mean operator µ x, f = E x [f (x)] and µ y, g = E y [g(y)] Covariance operator Define covariance operator C via bilinear form f C xy g = Cov{f, g} = E x,y [f (x)g(y)] E x [f (x)] E y [g(y)] Alexander J. Smola: Hilbert Schmidt Independence Criterion 8 / 30

15 Covariance operators Criterion (Renyi, 1957) Test for independence by checking whether C = 0. Reproducing Kernel Hilbert Space Kernels k, l on X, Y with associated RKHSs F, G. Assume bounded k, l on domain. Mean operator µ x, f = E x [f (x)] and µ y, g = E y [g(y)] Covariance operator Define covariance operator C via bilinear form f C xy g = Cov{f, g} = E x,y [f (x)g(y)] E x [f (x)] E y [g(y)] Alexander J. Smola: Hilbert Schmidt Independence Criterion 8 / 30

16 Covariance operators Criterion (Renyi, 1957) Test for independence by checking whether C = 0. Reproducing Kernel Hilbert Space Kernels k, l on X, Y with associated RKHSs F, G. Assume bounded k, l on domain. Mean operator µ x, f = E x [f (x)] and µ y, g = E y [g(y)] Covariance operator Define covariance operator C via bilinear form f C xy g = Cov{f, g} = E x,y [f (x)g(y)] E x [f (x)] E y [g(y)] Alexander J. Smola: Hilbert Schmidt Independence Criterion 8 / 30

17 Hilbert Space Representation Theorem Provided that k, l are universal kernels C xy = 0 if and only if x, y are independent. Proof. Step 1: If x, y are dependent then there exist some [0, 1]-bounded range f, g with Cov {f, g } = ɛ > 0. Step 2: Since k, l are universal there exist ɛ approximation of f, g in F, G such that covariance of approximation does not vanish. Step 3: Hence the covariance operator C xy is nonzero. Alexander J. Smola: Hilbert Schmidt Independence Criterion 9 / 30

18 A Test Statistic Covariance operator C xy = E x,y [k(x, )l(y, )] E x [k(x, )] E y [l(y, )] Operator Norm Use the norm of C xy to test whether x and y are independent. It also gives us a measure of dependence. HSIC(Pr xy, F, G) := C xy 2 where denotes the Hilbert-Schmidt norm. Frobenius Norm For matrices we can define M 2 = ij M 2 ij = tr M M. Hilbert-Schmidt norm is generalization of Frobenius norm. Alexander J. Smola: Hilbert Schmidt Independence Criterion 10 / 30

19 A Test Statistic Covariance operator C xy = E x,y [k(x, )l(y, )] E x [k(x, )] E y [l(y, )] Operator Norm Use the norm of C xy to test whether x and y are independent. It also gives us a measure of dependence. HSIC(Pr xy, F, G) := C xy 2 where denotes the Hilbert-Schmidt norm. Frobenius Norm For matrices we can define M 2 = ij M 2 ij = tr M M. Hilbert-Schmidt norm is generalization of Frobenius norm. Alexander J. Smola: Hilbert Schmidt Independence Criterion 10 / 30

20 A Test Statistic Covariance operator C xy = E x,y [k(x, )l(y, )] E x [k(x, )] E y [l(y, )] Operator Norm Use the norm of C xy to test whether x and y are independent. It also gives us a measure of dependence. HSIC(Pr xy, F, G) := C xy 2 where denotes the Hilbert-Schmidt norm. Frobenius Norm For matrices we can define M 2 = ij M 2 ij = tr M M. Hilbert-Schmidt norm is generalization of Frobenius norm. Alexander J. Smola: Hilbert Schmidt Independence Criterion 10 / 30

21 Computing C xy 2 Rank-one operators For rank-one terms we have f g 2 = f g, f g HS = f 2 g 2. Joint expectation By construction of C xy we exploit linearity and obtain C xy 2 = C xy, C xy HS = {E x,y E x,y 2E x,ye x E y + E x E y E x E y } [ k(x, )l(y, ), k(x, )l(y, ) HS ] = {E x,y E x,y 2E x,ye x E y + E x E y E x E y } [k(x, x )l(y, y )] This is well-defined if k, l are bounded. Alexander J. Smola: Hilbert Schmidt Independence Criterion 11 / 30

22 Estimating C xy 2 Empirical criterion HSIC(Z, F, G) := 1 (m 1) 2 trkhlh where K ij = k(x i, x j ), L ij = l(y i, y j ) and H ij = δ ij m 2. Theorem E Z [HSIC(Z, F, G)] = HSIC(Pr, F, G) + O(1/m) xy Proof: Sketch only. Expand tr KHLH into terms of pairs, triples and quadruples of indices of non-repeated terms, which lead to the proper expectations and bound the rest by O(m 1 ). Alexander J. Smola: Hilbert Schmidt Independence Criterion 12 / 30

23 Estimating C xy 2 Empirical criterion HSIC(Z, F, G) := 1 (m 1) 2 trkhlh where K ij = k(x i, x j ), L ij = l(y i, y j ) and H ij = δ ij m 2. Theorem E Z [HSIC(Z, F, G)] = HSIC(Pr, F, G) + O(1/m) xy Proof: Sketch only. Expand tr KHLH into terms of pairs, triples and quadruples of indices of non-repeated terms, which lead to the proper expectations and bound the rest by O(m 1 ). Alexander J. Smola: Hilbert Schmidt Independence Criterion 12 / 30

24 Estimating C xy 2 Empirical criterion HSIC(Z, F, G) := 1 (m 1) 2 trkhlh where K ij = k(x i, x j ), L ij = l(y i, y j ) and H ij = δ ij m 2. Theorem E Z [HSIC(Z, F, G)] = HSIC(Pr, F, G) + O(1/m) xy Proof: Sketch only. Expand tr KHLH into terms of pairs, triples and quadruples of indices of non-repeated terms, which lead to the proper expectations and bound the rest by O(m 1 ). Alexander J. Smola: Hilbert Schmidt Independence Criterion 12 / 30

25 Uniform convergence bounds for C xy 2 Theorem (Recall Hoeffding s theorem for U Statistics) For averages over functions on r variables u := 1 g(x i1,..., x ir ) (m) r which are bounded by a u b we have ( Pr {u E u[u] t} exp 2t ) 2 m/r u (b a) 2 i m r In our statistic we have terms of 2, 3, and 4 random variables. Alexander J. Smola: Hilbert Schmidt Independence Criterion 13 / 30

26 Uniform convergence bounds for C xy 2 Corollary Assume that k, l. Then at least with probability 1 δ HSIC(Z, F, G) HSIC(Pr, F, G) log 6/δ xy 0.24m + C m Proof. Bound each of the three terms separatly via Hoeffding s theorem. Alexander J. Smola: Hilbert Schmidt Independence Criterion 14 / 30

27 Outline 1 Measuring Independence Covariance Operator Hilbert Space Methods A Test Statistic and its Analysis 2 Independent Component Analysis ICA Primer Examples 3 Feature Selection Problem Setting Algorithm Results Alexander J. Smola: Hilbert Schmidt Independence Criterion 15 / 30

28 Blind Source Separation Data w = Ms, where all s i are mutually independent. The Cocktail Party Problem Task Recover the sources S and mixing matrix M given W. Alexander J. Smola: Hilbert Schmidt Independence Criterion 16 / 30

29 Independent Component Analysis Whitening Rotate, center, and whiten data before separation. This is always possible. Optimization We cannot recover scale of data anyway. Need to find orthogonal matrix U such that Uw = s leads to independent random variables. Optimization on the Stiefel manifold. Could do this by a Newton method. Important Trick Kernel matrix could be huge. Use reduced-rank representation. We get tr H(AA )H(BB ) = A HB 2 instead of tr HKHL. Alexander J. Smola: Hilbert Schmidt Independence Criterion 17 / 30

30 Independent Component Analysis Whitening Rotate, center, and whiten data before separation. This is always possible. Optimization We cannot recover scale of data anyway. Need to find orthogonal matrix U such that Uw = s leads to independent random variables. Optimization on the Stiefel manifold. Could do this by a Newton method. Important Trick Kernel matrix could be huge. Use reduced-rank representation. We get tr H(AA )H(BB ) = A HB 2 instead of tr HKHL. Alexander J. Smola: Hilbert Schmidt Independence Criterion 17 / 30

31 Independent Component Analysis Whitening Rotate, center, and whiten data before separation. This is always possible. Optimization We cannot recover scale of data anyway. Need to find orthogonal matrix U such that Uw = s leads to independent random variables. Optimization on the Stiefel manifold. Could do this by a Newton method. Important Trick Kernel matrix could be huge. Use reduced-rank representation. We get tr H(AA )H(BB ) = A HB 2 instead of tr HKHL. Alexander J. Smola: Hilbert Schmidt Independence Criterion 17 / 30

32 ICA Experiments Alexander J. Smola: Hilbert Schmidt Independence Criterion 18 / 30

33 Outlier Robustness Alexander J. Smola: Hilbert Schmidt Independence Criterion 19 / 30

34 Automatic Regularization Alexander J. Smola: Hilbert Schmidt Independence Criterion 20 / 30

35 Mini Summary Linear mixture of independent sources Remove mean and whiten for preprocessing Use HSIC as measure of dependence Find best rotation to demix the data Performance HSIC is very robust to outliers General purpose criterion Best performing algorithm (Radical) is designed for linear ICA, HSIC is a general purpose criterion Low rank decomposition makes optimization feasible Alexander J. Smola: Hilbert Schmidt Independence Criterion 21 / 30

36 Mini Summary Linear mixture of independent sources Remove mean and whiten for preprocessing Use HSIC as measure of dependence Find best rotation to demix the data Performance HSIC is very robust to outliers General purpose criterion Best performing algorithm (Radical) is designed for linear ICA, HSIC is a general purpose criterion Low rank decomposition makes optimization feasible Alexander J. Smola: Hilbert Schmidt Independence Criterion 21 / 30

37 Outline 1 Measuring Independence Covariance Operator Hilbert Space Methods A Test Statistic and its Analysis 2 Independent Component Analysis ICA Primer Examples 3 Feature Selection Problem Setting Algorithm Results Alexander J. Smola: Hilbert Schmidt Independence Criterion 22 / 30

38 Feature Selection The Problem Large number of features Select a small subset of them Basic Idea Find features such that the distributions p(x y = 1) and p(x y = 1) are as different as possible. Use a two-sample test for that. Important Tweak We can find a similar criterion to measure dependence between data and labels (by computing the Hilbert-Schmidt norm of covariance operator). Alexander J. Smola: Hilbert Schmidt Independence Criterion 23 / 30

39 Recursive Feature Elimination Algorithm Start with full set of features Adjust kernel width to pick up maximum discrepancy Find feature which decreases dissimilarity the least Remove this feature Repeat Applications Binary classification (standard MMD criterion) Multiclass Regression Alexander J. Smola: Hilbert Schmidt Independence Criterion 24 / 30

40 Alexander J. Smola: Hilbert Schmidt Independence Criterion 25 / 30

41 Comparison to other feature selectors Synthetic Data Brain Computer Interface Data Alexander J. Smola: Hilbert Schmidt Independence Criterion 26 / 30

42 Frequency Band Selection Alexander J. Smola: Hilbert Schmidt Independence Criterion 27 / 30

43 Microarray Feature Selection Goal Obtain small subset of features for estimation Reproducible feature selection Results Alexander J. Smola: Hilbert Schmidt Independence Criterion 28 / 30

44 Summary 1 Measuring Independence Covariance Operator Hilbert Space Methods A Test Statistic and its Analysis 2 Independent Component Analysis ICA Primer Examples 3 Feature Selection Problem Setting Algorithm Results Alexander J. Smola: Hilbert Schmidt Independence Criterion 29 / 30

45 Shameless Plugs Looking for a job... talk to me! Alex.Smola@nicta.com.au ( Positions PhD scholarships Postdoctoral positions, Senior researchers Long-term visitors (sabbaticals etc.) More details on kernels Schölkopf and Smola: Learning with Kernels Alexander J. Smola: Hilbert Schmidt Independence Criterion 30 / 30

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Kernel Methods Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra,