arxiv: v2 [stat.ml] 3 Sep 2015

Size: px
Start display at page:

Download "arxiv: v2 [stat.ml] 3 Sep 2015"

Transcription

1 Nonparametric Independence Testing for Small Sample Sizes arxiv:46.9v [stat.ml] 3 Sep 5 Aaditya Ramdas Dept. of Statistics and Machine Learning Dept. Carnegie Mellon University aramdas@cs.cmu.edu September 4, 5 Abstract Leila Wehbe Machine Learning Dept. Carnegie Mellon University lwehbe@cs.cmu.edu This paper deals with the problem of nonparametric independence testing, a fundamental decision-theoretic problem that asks if two arbitrary (possibly multivariate) random variables X, Y are independent or not, a question that comes up in many fields like causality and neuroscience. While quantities like correlation of X, Y only test for (univariate) linear independence, natural alternatives like mutual information of X, Y are hard to estimate due to a serious curse of dimensionality. A recent approach, avoiding both issues, estimates norms of an operator in Reproducing Kernel Hilbert Spaces (RKHSs). Our main contribution is strong empirical evidence that by employing shrunk operators when the sample size is small, one can attain an improvement in power at low false positive rates. We analyze the effects of Stein shrinkage on a popular test statistic called HSIC (Hilbert-Schmidt Independence Criterion). Our observations provide insights into two recently proposed shrinkage estimators, SCOSE and FCOSE - we prove that SCOSE is (essentially) the optimal linear shrinkage method for estimating the true operator; however, the non-linearly shrunk FCOSE usually achieves greater improvements in test power. This work is important for more powerful nonparametric detection of subtle nonlinear dependencies for small samples. Introduction The problem of nonparametric independence testing deals with ascertaining if two random variables are independent or not, making no parametric assumptions about their underlying distributions. Formally, given n samples (x i, y i ) for i {,..., n} where x i R p, y i R q, that are drawn from a joint distribution P XY supported on X Y R p+q, we want to decide between the null and alternate hypotheses H : P XY = P X P Y vs. H : P XY P X P Y where P X, P Y are the marginals of P XY w.r.t. X, Y. A test is a function from the data to {, }. Tests aim to have high power (probability of detecting dependence, when it exists) at a prespecified allowable type- error rate α (probability of detecting dependence when there isn t any). Independence testing is often a precursor to further analysis. Consider for instance conditional independence testing for inferring causality, say by the PC algorithm Spirtes et al. (), whose first step is (unconditional) independence testing. It is also useful for scientific discovery like in Both authors have equally contributed to this paper.

2 neuroscience, to see if a stimulus X (say an image) is independent of the brain activity Y (say fmri) in a relevant part of the brain. Since detecting nonlinear correlations is much easier than estimating a nonparametric regression function (of Y onto X), it can be done at smaller sample sizes, with further samples collected for estimation only if an effect is detected by the hypothesis test. For such situations, correlation only tests for univariate linear independence, while other statistics like mutual information that do characterize multivariate independence are hard to estimate from data, suffering from a serious curse of dimensionality. A recent popular approach for this problem (and a related two-sample testing problem) involve the use of quantities defined in reproducing kernel Hilbert spaces (RKHSs) - see Gretton et al. (6), Harchaoui et al. (7), Gretton, Herbrich, Smola, Bousquet & Schölkopf (5), Gretton, Bousquet, Smola & Schölkopf (5). This paper will concern itself with increasing the statistical power at small samples of a popular kernel statistic called HSIC, by using shrunk empirical estimators of the unknown population quantity (introduced below).. Hilbert Schmidt Independence Criterion Due to limited space, familiarity with RKHS terminology is assumed - see Scholkopf & Smola () for an introduction. Let k : X X R and l : Y Y R be two positive-definite reproducing kernels that correspond to RKHSs H k and H l respectively with inner-products, k and, l. Let k, l arise from (implicit) feature maps φ : X H k and ψ : Y H l. In other words, φ, ψ are not functions, but mappings to the Hilbert space. i.e. φ(x) H k, ψ(y) H l respectively. These functions, when evaluated at points in the original spaces, must satisfy φ(x)(x ) = φ(x), φ(x ) k = k(x, x ) and ψ(y)(y ) = ψ(y), ψ(y ) l = l(y, y ). The mean embedding of P X and P Y are defined as µ X := E x PX φ(x) H k and µ Y := E y PY ψ(y) H l whose empirical estimates are µ X := n n i= φ(x i) and µ Y := n n i= ψ(y i). Finally, the cross-covariance operator of X, Y is defined as Σ XY := E (x,y) PXY (φ(x) µ X ) (ψ(y) µ Y ) where is an outer-product. For unfamiliar readers, if we used the linear kernel k(x, x ) = x T x and l(y, y ) = y T y, then the cross-covariance operator is just the cross-covariance matrix. The plug-in empirical estimator of Σ XY is S XY := n n (φ(x i ) µ X ) (ψ(y i ) µ Y ) i= For conciseness, define φ(x i ) = φ(x i ) µ X, ψ(y i ) = ψ(y i ) µ Y, k(x, x ) = φ(x), φ(x ) k and l(y, y ) = ψ(y), ψ(y ) l. The test statistic Hilbert-Schmidt Independence Criterion (HSIC) defined in Gretton, Bousquet, Smola & Schölkopf (5) is the squared Hilbert-Schmidt norm of S XY, and can be calculated using centered kernel matrices K, L, where K ij = k(x i, x j ), L ij = l(y i, y j ), as HSIC := S XY HS = n tr( K L) () For unfamiliar readers, if we used the linear kernel, this just corresponds to the Frobenius norm of the cross-covariance matrix. The most important property is: when the kernels k, l are characteristic, then the corresponding population statistic Σ XY HS is zero iff X, Y are independent Gretton, Bousquet, Smola & Schölkopf (5). This gives rise to a natural test - calculate S XY HS and reject the null if it is large. ( ) x x γ and Laplace Examples of characteristic kernels include Gaussian k(x, x ) = exp ( k(x, x ) = exp x x γ ), for any bandwidth γ, while the aforementioned linear kernel is not

3 characteristic the corresponding HSIC tests only linear relationships, and a zero cross-covariance matrix characterizes independence only for multivariate Gaussian distributions. Working with the infinite dimensional operator with characteristic kernels, allows us to identify any general nonlinear dependence (in the limit) between any pair of distributions, not just Gaussians.. Independence Testing using HSIC A permutation-based test is described in Gretton, Bousquet, Smola & Schölkopf (5), and proceeds in the following manner. From the given data, calculate the test statistic T := S XY HS. Keeping the order of x,..., x n fixed, randomly permute y,..., y n a large number of times, and recompute the permuted HSIC each time. This destroyed any dependence between x, y simulating a draw from the product of marginals, making the empirical distribution of the permuted HSICs behave like the null distribution of the test statistic (distribution of HSIC when H is true). For a pre-specified type- error α, calculate threshold t α in the right tail of the null distribution. Reject H if T > t α. This test was proved to be consistent against any fixed alternative, meaning for any fixed type- error α, the power goes to as n. Empirically, the power can be calculated using simulations by repeating the above permutation test many times for a fixed P XY (for which dependence holds), and reporting the empirical probability of rejecting the null (detecting the dependence). Note that the power depends on P XY (unknown to the user of the test)..3 Shrunk Estimators of S XY Even though S XY is an unbiased estimator of Σ XY, it typically has high variance at low sample sizes. The idea of Stein shrinkage Stein (956) is to trade-off bias and variance, first introduced in the context of Gaussian mean estimation. This strategy of introducing some bias and decreasing the variance to get different estimators of Σ XY was followed by Muandet et al. (4) who define a linear shrinkage estimator of S XY called SCOSE (Simple Covariance Shrinkage Estimator) and a nonlinear shrinkage estimator called FCOSE (Flexible Covariance Shrinkage Estimator). When we refer to shrunk estimators, we implicitly mean SCOSE and FCOSE. We will describe these briefly in Section..4 Contributions Our first contribution is the following :. We provide evidence that employing shrunk estimators of Σ XY, instead of S XY, to calculate the aforementioned test statistic, can increase the power of the associated independence test at low false positive rates, when the sample size is small (there is higher variance in estimating infinitedimensional operators). Our second contribution is to analyze the effect of shrinkage on the test statistic, to provide some practical insight.. The effect of shrinkage on the test-statistic is very similar to soft-thresholding (see Section 4), shrinking very small statistics to zero, and shrinking other values nearly (but not) linearly, and nearly (but not) monotonically. Our last contribution is an insight on the two estimators considered in this paper, SCOSE and FCOSE. 3. We prove that SCOSE is (essentially, up to lower order terms) the optimal/oracle linear shrinkage estimator with respect to quadratic risk (see Section 5). However, we observe that FCOSE typically achieves higher power than SCOSE. This indicates that it may be useful to search for the optimal estimator in a larger class than linearly shrunk estimators, and also that quadratic loss may not be the right loss function for the purposes of test power. 3

4 The rest of this paper is organized as follows. Section introduces SCOSE, FCOSE and their corresponding shrunk test statistics. Section 3 presents illuminating experiments that bring out the statistically significant improvement in power over HSIC. Section 4 conducts a deeper investigation into the effect of shrinkage and proves the oracle optimality of SCOSE under quadratic risk. Shrunk Estimators and Test Statistics Let HS(H k, H l ) represent the set of Hilbert-Schmidt operators from H k to H l. We first note that S XY can be written as the solution to the following optimization problem. n S XY := min φ(x i ) Z HS(H k,h l ) n ψ(y i ) Z HS Using this idea Muandet et al. (4) suggest the following two shrunk/regularized estimators. From SCOSE to HSIC S This is derived in Muandet et al. (4) by solving n min φ(x i ) Z HS(H k,h l ) n ψ(y i ) Z + λ Z HS HS i= and the optimal solution (called SCOSE) is S S XY := i= ( λ ) S XY + λ where λ (and hence the shrinkage intensity) is estimated by leave-one-out cross-validation (LOOCV), in closed form as ( ) λ ρ S CV := + λ CV n K i= ii Lii n K ] n i,j= ij Lij [ n = (n ) n K n i,j= ij Lij + n K n i= ii Lii Observing the expression for λ CV in Muandet et al. (4), the denominator can be negative (for example, with the Gaussian kernel for small bandwidths, resulting in a kernel matrix close to the identity). This can cause λ CV to be negative, and ρ S to be (unintentionally) outside the range [, ]. Though not discussed in Muandet et al. (4), we shall follow the convention ( that when ) ρ S <, we shall use ρ S = and if ρ S >, we use ρ S =. Indeed, one can show that λ +λ S XY dominates λ +λ ( ) + S XY where (x) + = max{x, }. In Section 4, we prove that SXY S is (essentially) the optimal/oracle linear shrinkage estimator with respect to quadratic risk. We can now calculate the corresponding shrunk statistic HSIC S = SXY S HS = n n i= K ii Lii HSIC (n )HSIC + n K HSIC () n i= ii Lii While the above expression looks daunting, one thing to note is that the amount that HSIC is shrunk (i.e. the multiplicative factor) depends on the value of HSIC. As we shall see in section 4, small HSIC values get shrunk to zero, but as can be seen above, the shrinkage of HSIC is nonmonotonic. n + 4

5 From FCOSE to HSIC F The Flexible Covariance Shrinkage Estimator is derived by relying on the Representer theorem, see Scholkopf & Smola (), to instead minimize n n φ(x i ) ψ(y n β i i ) n φ(x i ) ψ(y i ) + λ β i= i= over all β R n, and the optimal solution (called FCOSE) is S F XY := n i= β λ i n φ(x i ) ψ(y i ) HS where β λ = ( K L + λi) K L where denotes elementwise (Hadamard) product, is the vector [,,..., ] T, and as before the best λ is determined by LOOCV. The procedure to evaluate the optimal λ efficiently is described by Muandet et al. (4) - a single eigenvalue decomposition of K L costing O(n 3 ) can be done, following which evaluating LOOCV is only O(n ) per λ, see Muandet et al. (4), section 3. for more details. As before, after picking the λ by LOOCV, we can derive the corresponding shrunk test statistic as HSIC F = S S XY HS = n tr(m(m + λi) M(M + λi) M) where M = K L. Note here that the shrinkage is not linear, and the effect on HSIC cannot be seen immediately. Similar to SCOSE, we shall see in section 4, small HSIC values get shrunk to zero (LOOCV chooses a large λ). 3 Linear Shrinkage and Quadratic Risk In this section, we prove that SCOSE is (essentially) optimal within a particular class of estimators. Such oracle arguments also exist elsewhere in the literature, like Ledoit & Wolf (4), so we provide only a brief proof outline. Proposition. The oracle (with respect to quadratic risk) linear shrinkage estimator and intensity is defined as S, ρ := and is given by S := ( ρ )S XY where arg min Z Σ XY HS Z HS,Z=( ρ)s XY, ρ ρ := E S XY Σ XY HS E S XY Proof. Define α = Σ XY HS, β = E S XY Σ XY HS, δ = E S XY. Since E[S XY ] = Σ XY, it is easy to verify that α + β = δ. Substituting and expanding the objective, we get: E Z Σ XY HS = E ρs XY + (S XY Σ XY ) HS = ρ δ + β ρ(δ α ) = ρ α + ( ρ) β Differentiating and equating to zero, gives ρ = β δ. 5

6 This ρ appears in terms of quantities that depend on the unknown underlying distribution (hence the term oracle estimator). We use plugin estimates b, d for β, δ. Let d = S XY HS = n K n i,j= ij Lij = HSIC. Since β is the variance of S XY, let b be the sample variance of S XY, i.e. b = n n n k= φ(x i ) ψ xi S XY = n into S and simplifying, we see that HSIC := S HS is ( HSIC n n i= = K ii Lii HSIC nhsic [ n n i= K ii Lii n n i,j= K ij Lij ]. Plugging these ) HSIC (3) Comparing Eq.(3) with Eq.() shows that SCOSE is essentially S, up to a factor in the denominator which is of the same order as the bias of the HSIC empirical estimator (see Theorem in Gretton, Bousquet, Smola & Schölkopf (5)). In other words, SCOSE just corresponds to using a slightly different estimator for δ than the simple plugin d, which varies on the same order as the bias δ Ed. Hence SCOSE, as estimated via regularization and LOOCV, is (essentially) the optimal linear shrinkage estimator under quadratic risk. To the best of our knowledge, this is the first such characterization of optimality of an estimator achieved through leave-one-out cross-validation. We are only able to prove this because one can explicitly calculate both the oracle linear shrinkage intensity ρ as well as the optimal λ CV (as mentioned in Section ). This raises a natural open question can we find other situations where the LOOCV estimator is optimal with respect to some risk measure? (perhaps when explicit calculations are not possible, like ridge regression). 4 Experiments In this section, we run three kinds of experiments: a) to verify that SCOSE has better quadratic risk than FCOSE and original sample estimator, b) detailed synthetic experiments to verify that shrinkage does improve power, across interesting regimes of α = {.,.5,.}, and c) real data obtained from MNIST, to show that we shrinkage detect dependence at much lower samples than the original data size. 4. Quadratic Risk Figure shows that SCOSE is indeed much better than both S XY and FCOSE with respect to quadratic risk. Here, we calculate E Z Σ XY HS for the distribution given in dataset (A) for Z {S XY, SXY S, SF XY }. The expectation is calculated by repeating the experiment times. Each time Z is calculated according to N {, 5, } samples and Σ XY is approximated by the empirical cross-covariance matrix on 5, samples. The four panels use four different kernels which are linear, polynomial, Laplace and Gaussian from top to bottom. The shrunk estimators are always better than the unshrunk, with a larger difference between SCOSE and FCOSE for finitedimensional feature spaces (top two). In infinite-dimensional feature spaces (bottom two), SCOSE and FCOSE are much better than the unshrunk estimator but very similar to each other. The differences between all estimators decreases with increasing n, since the sample cross-covariance operator itself becomes very accurate. 4. Synthetic Data We perform synthetic experiments in a wide variety of settings to demonstrate that the shrunk test statistics achieve higher power than HSIC in a variety of settings. We follow the schema provided in HSIC and HSIC HSIC/n C/n both converge to population HSIC at same rate determined by the dominant term (HSIC). 6

7 quadratic risk quadratic risk.5. 5 N.. 5 N quadratic risk quadratic risk.. 5 N.. 5 N Figure : All panels show quadratic risk E X Σ XY HS for X {S XY, SXY S, SF XY }. Dataset (A) was used in all four panels, but the kernels were varied - from top to bottom is the linear, quadratic, Gaussian and Laplace kernel. the introduction for independence testing and calculating power. We only consider difficult distributions with nonlinear dependence between X, Y, on which linear methods like correlation are shown to fail to detect dependence (some of them were used in previous papers on independence testing like Gretton et al. (7) and Chwialkowski & Gretton (4)). For all experiments, α {.,.5,.} is chosen as the type- error (for choosing the threshold level of the null distribution s right tail). For every setting of parameters of each experiment, power is calculated as the percentage of rejection over repetitions (independent trials), with permutations per repetition (permutation testing to find the null distribution threshold at level α). We use the Gaussian kernel where the bandwidth is chosen by the common median heuristic Scholkopf & Smola (). Table is a representative sample from what we saw on other examples - either large, small or no improvement in power was seen but almost never a worsening of power. The improvements in power may not always be huge, but they are statistically significant - it is difficult to detect such non-linear dependencies at low sample sizes, so any increase in power can be important in scientific applications. Remark. A more appropriate way than using error bars to assess significance is by the Wilcoxon rank sum test, omitted for lack of space, though it yields more favorable results. 4.3 Real Data We use two real datasets - the first is a good example where shrinkage helps a lot, but in the second it does not help (we show it on purpose). Like the synthetic datasets, for most real datasets it either helps or does not hurt (being very rarely worse; see remark in the discussion section). The first is the Eckerle dataset Eckerle (979) from the NIST Statistical Reference Datasets (NIST StRD) for Nonlinear Regression, data from a NIST study of circular interference transmittance (n=35, Y is transmittance, X is wavelength). A plot of the data in Figure reveals a nonlinear relationship between X, Y (though the correlation is.35 with p-value.84). We subsample the data to see how often we can detect a relationship at %, %, 3% of the original data size, when the false positive level is always controlled at.5. The second is the Aircraft dataset Bowman & Azzalini (4) (n=79, X is log(speed), Y is log(span)). Once again, correlation is low, with a p-value of over.8, and we subsample the data to 5%, %, % of the original data size. 7

8 α =. α =.5 α =. HSIC HSIC S HSIC F HSIC HSIC S HSIC F HSIC HSIC S HSIC F ±.. ±.. ±.. ±.. ±.. ± Table : The first column shows scatterplots of X vs Y (all having dependence between X, Y ). There are 3 sets of 5 columns each - for α =.,.5,. (controlled by running permutations). In eachs set, the first three columns show the power of HSIC, HSIC S, HSIC F (with standard deviation over repetitions below). The fourth column shows when HSIC S is significantly better than HSIC, and the fifth column when HSIC F has significantly higher power than HSIC. A blank means the powers are not significantly better or worse. In the first dataset (A) (top 4) we show how the power varies with increasing n (becomes easier). In the second dataset (B) (second 4) we show how the power varies with rotation (goes from near-independence to clear dependence). In the third dataset (C) (third 4), we demonstrate a case where shrinkage does not help much, which is a circle with a hole. In the last dataset (D) (last 4), we demonstrate a case where HSIC S does as well as HSIC F. We tried many more datasets, these are a few representative samples. 8

9 transmitance log(span) wavelength log(speed) HSIC. HSIC S HSIC F % % 3% HSIC. HSIC S HSIC F 5% % % Figure : Top Row: The left figure shows a plot of wavelength against transmittance. The right figure shows the power of HSIC, HSIC S, HSIC F when the data are subsampled to %, %, 3% (error bars over repetitions). Bottom Row: The left figure shows a plot of log(wingspan) vs log(airspeed). The right figure shows the power of HSIC, HSIC S, HSIC F when the data are subsampled to 5%, %, % (error bars over repetitions). 5 Discussion Why might shrinkage improve power? Let us examine the net effect of using shrunk estimators on the value of HSIC, i.e. let us compare HSIC S and HSIC F to HSIC by computing these over all the repetitions of the permutation testing procedure described in the introduction. In Fig. 3, both estimators are visually similar in transforming the actual test statistic. Perhaps the more interesting phenomenon is that Fig. 3 is reminiscent of the graph of a soft-thresholding operator ST t (x) = max{, x t}. Intuitively, if the unshrunk HSIC value is small, the shrinkage methods deem it to be noise and it is shrunk to zero. Looking at the X-axis scaling of the top and bottom row, the size of the region that gets shrunk to zero decreases with n - as expected, shrinkage has less effect when S XY has low variance). The shrinkage being non-monotone (more so for n = than n = 5 in Figure 3) is key to achieving an improvement in power. Using the intuition from the above figure, we can finally piece together why shrinkage may yield benefits. A rejection of H occurs when the test statistic stands out in the right tail of its null distribution. Typically, when the alternative is true (this is when rejecting the null improves power) the unshrunk test statistics calculated from the permuted samples is smaller than the unshrunk HSIC calculated on the original sample. However, the effect of shrinking the small statistics towards zero, and setting the smallest ones to zero, is that the unpermuted test statistic under the alternative 9

10 HSIC S.5 HSIC F.5.5 HSIC.5 HSIC HSIC S. HSIC F.. HSIC. HSIC Figure 3: The top row corresponds to n =, and the bottom row has n = 5. The left plots compare HSIC S to HSIC, and the right plots compare HSIC F to HSIC. Each cross mark corresponds to the shrunk and unshrunk HSIC calculated during a single permutation of a permutation test. distribution stands out more in the right tail of the null. In other words, relative to the unshrunk null distribution and the unshrunk test statistic, the tail of the null distribution is shrunk more towards zero than the unpermuted test statistic, causing the latter to have a higher quantile in the right tail of the former (relative to the quantile before shrinkage). Let us verify this experimentally. In Fig.4 we plot for each of the datasets in Table, the average ratio of unpermuted statistic T to the 95th percentile of the permuted statistics, for T {HSIC, HSIC S, HSIC F }. Recall that for dataset (C), we didn t see much of an improvement in power, but for (A),(B),(D) it is clear from Fig. 4 that the unpermuted statistic is shrunk less than its null distribution s 95th quantile. Remark. In our experiments, real and synthetic, shrinkage usually improves (and almost never worsens) power in false-positive regimes that we usually care about. Will shrinkage always improve power? Possibly not. Even though shrunk the shrunk S XY dominates S XY for estimation error, it may not be the case that shrunk HSIC always dominates unshrunk HSIC for test power (i.e. the latter may not be inadmissible). However, just as no single classifier always outperforms another, it is still beneficial to add techniques like shrinkage, that seem to consistently yield benefits in practice, to the practitioner s array of tools. 6 Conclusion We presented evidence for an important phenomenon - using biased but lower variance shrunk estimators of cross-covariance operators can often significantly improve test power of HSIC at small sample sizes. This observation (that shrinkage can improve power) has rarely been made in the statistics and machine learning testing literature. We think the reason is that most test statistics for independence testing cannot be immediately expressed as the norm of an empirical operator, making it less obvious how to apply shrinkage to improve their power at low sample sizes. We also showed the optimality (among linear shrinkage estimators) of SCOSE, but observe that the nonlinear shrinkage of FCOSE usually yields higher power. To the best of our knowledge, there seems to be no current literature showing that the choice made by leave-one-out cross-validation (SCOSE) explicitly leads to an estimator that is optimal in some sense (among linear shrinkage estimators). This may be because it is often not possible to explicitly calculate the form of the

11 unpermuted HSIC 95%HSIC permuted unpermuted HSIC 95%HSIC permuted estimator estimator unpermuted HSIC 95%HSIC permuted 3 unpermuted HSIC 95%HSIC permuted estimator estimator Figure 4: All panels show the ratio of the unpermuted HSIC to the 95th percentile of the null distribution based on HSICs calculated from the permuted data. (see Table ) The top row has datasets (C) with radius., (B) with angle 3 π/3, and the bottom row has (D) with N = 5, (A) with N = 4. These observations were qualitatively the same in all other synthetic data parameter settings, and also for other percentiles than 95th, and since the figures look identical in spirit, they were omitted due to lack of space. LOOCV estimator, nor the explicit form of the best linear shrinkage estimator, as can both be done in this simple setting. Since even the best possible linear shrinkage estimator (as represented by SCOSE) is usually worse than FCOSE, this result indicates that in order to improve upon FCOSE, it will be necessary to further study the class of non-linear shrinkage estimators for our infinite dimensional operators, as done for finite dimensional covariance matrices in Ledoit & Wolf () and other papers by the same authors. We ended with a brief investigation into the effect of shrinkage on HSIC and why shrinkage may intuitively improve power. We think that our work will be important for more powerful nonparametric detection of subtle nonlinear dependencies at low sample sizes, a common problem in scientific applications. Acknowledgments We would like to thank Arthur Gretton and Jessica Chemali for useful feedback on an earlier draft of the paper, Larry Wasserman for pointing out some useful references, and Krikamol Muandet for sharing his code. AR was supported in part by ONR MURI grant N495. References Bowman, A. W. & Azzalini, A. (4), R package sm: nonparametric smoothing methods (version.-5.4), University of Glasgow, UK and Università di Padova, Italia. Chwialkowski, K. & Gretton, A. (4), A kernel independence test for random processes, in Proceedings of The 3st International Conference on Machine Learning, pp

12 Eckerle, K. (979), Circular interference transmittance study, National Institute of Standards and Technology (NIST), US Department of Commerce, USA. Gretton, A., Borgwardt, K. M., Rasch, M., Schölkopf, B. & Smola, A. J. (6), A kernel method for the two-sample-problem, in Neural Information Processing Systems, pp Gretton, A., Bousquet, O., Smola, A. & Schölkopf, B. (5), Measuring statistical dependence with Hilbert-Schmidt norms, in Algorithmic Learning Theory, Springer, pp Gretton, A., Fukumizu, K., Teo, C. H., Song, L., Schölkopf, B. & Smola, A. J. (7), A kernel statistical test of independence, Neural Information Processing Systems. Gretton, A., Herbrich, R., Smola, A., Bousquet, O. & Schölkopf, B. (5), Kernel methods for measuring independence, The Journal of Machine Learning Research 6, Harchaoui, Z., Bach, F. & Moulines, E. (7), Testing for homogeneity with kernel fisher discriminant analysis, Arxiv preprint Ledoit, O. & Wolf, M. (4), A well-conditioned estimator for large-dimensional covariance matrices, Journal of Multivariate Analysis 88(), Ledoit, O. & Wolf, M. (), Nonlinear shrinkage estimation of large-dimensional covariance matrices, Institute for Empirical Research in Economics University of Zurich Working Paper (55). Muandet, K., Fukumizu, K., Sriperumbudur, B., Gretton, A. & Schoelkopf, B. (4), Kernel mean estimation and stein effect, in Proceedings of The 3st International Conference on Machine Learning, pp. 8. Scholkopf, B. & Smola, A. (), Learning with kernels, MIT press Cambridge. Spirtes, P., Glymour, C. N. & Scheines, R. (), Causation, prediction, and search, Vol. 8, MIT press. Stein, C. (956), Inadmissibility of the usual estimator for the mean of a multivariate normal distribution, Proceedings of the Third Berkeley symposium on mathematical statistics and probability (399), 97 6.

Hilbert Space Representations of Probability Distributions

Hilbert Space Representations of Probability Distributions Hilbert Space Representations of Probability Distributions Arthur Gretton joint work with Karsten Borgwardt, Kenji Fukumizu, Malte Rasch, Bernhard Schölkopf, Alex Smola, Le Song, Choon Hui Teo Max Planck

More information

Hilbert Schmidt Independence Criterion

Hilbert Schmidt Independence Criterion Hilbert Schmidt Independence Criterion Thanks to Arthur Gretton, Le Song, Bernhard Schölkopf, Olivier Bousquet Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au

More information

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Kernel Methods Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra,

More information

9.2 Support Vector Machines 159

9.2 Support Vector Machines 159 9.2 Support Vector Machines 159 9.2.3 Kernel Methods We have all the tools together now to make an exciting step. Let us summarize our findings. We are interested in regularized estimation problems of

More information

Dependence Minimizing Regression with Model Selection for Non-Linear Causal Inference under Non-Gaussian Noise

Dependence Minimizing Regression with Model Selection for Non-Linear Causal Inference under Non-Gaussian Noise Dependence Minimizing Regression with Model Selection for Non-Linear Causal Inference under Non-Gaussian Noise Makoto Yamada and Masashi Sugiyama Department of Computer Science, Tokyo Institute of Technology

More information

Kernel Measures of Conditional Dependence

Kernel Measures of Conditional Dependence Kernel Measures of Conditional Dependence Kenji Fukumizu Institute of Statistical Mathematics 4-6-7 Minami-Azabu, Minato-ku Tokyo 6-8569 Japan fukumizu@ism.ac.jp Arthur Gretton Max-Planck Institute for

More information

Maximum Mean Discrepancy

Maximum Mean Discrepancy Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia

More information

On a Nonparametric Notion of Residual and its Applications

On a Nonparametric Notion of Residual and its Applications On a Nonparametric Notion of Residual and its Applications Bodhisattva Sen and Gábor Székely arxiv:1409.3886v1 [stat.me] 12 Sep 2014 Columbia University and National Science Foundation September 16, 2014

More information

An Adaptive Test of Independence with Analytic Kernel Embeddings

An Adaptive Test of Independence with Analytic Kernel Embeddings An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum Gatsby Unit, University College London wittawat@gatsby.ucl.ac.uk Probabilistic Graphical Model Workshop 2017 Institute

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University Chapter 9. Support Vector Machine Yongdai Kim Seoul National University 1. Introduction Support Vector Machine (SVM) is a classification method developed by Vapnik (1996). It is thought that SVM improved

More information

Minimax Estimation of Kernel Mean Embeddings

Minimax Estimation of Kernel Mean Embeddings Minimax Estimation of Kernel Mean Embeddings Bharath K. Sriperumbudur Department of Statistics Pennsylvania State University Gatsby Computational Neuroscience Unit May 4, 2016 Collaborators Dr. Ilya Tolstikhin

More information

A Linear-Time Kernel Goodness-of-Fit Test

A Linear-Time Kernel Goodness-of-Fit Test A Linear-Time Kernel Goodness-of-Fit Test Wittawat Jitkrittum 1; Wenkai Xu 1 Zoltán Szabó 2 NIPS 2017 Best paper! Kenji Fukumizu 3 Arthur Gretton 1 wittawatj@gmail.com 1 Gatsby Unit, University College

More information

Understanding the relationship between Functional and Structural Connectivity of Brain Networks

Understanding the relationship between Functional and Structural Connectivity of Brain Networks Understanding the relationship between Functional and Structural Connectivity of Brain Networks Sashank J. Reddi Machine Learning Department, Carnegie Mellon University SJAKKAMR@CS.CMU.EDU Abstract Background.

More information

An Adaptive Test of Independence with Analytic Kernel Embeddings

An Adaptive Test of Independence with Analytic Kernel Embeddings An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum 1 Zoltán Szabó 2 Arthur Gretton 1 1 Gatsby Unit, University College London 2 CMAP, École Polytechnique ICML 2017, Sydney

More information

Distinguishing Causes from Effects using Nonlinear Acyclic Causal Models

Distinguishing Causes from Effects using Nonlinear Acyclic Causal Models Distinguishing Causes from Effects using Nonlinear Acyclic Causal Models Kun Zhang Dept of Computer Science and HIIT University of Helsinki 14 Helsinki, Finland kun.zhang@cs.helsinki.fi Aapo Hyvärinen

More information

Distinguishing Causes from Effects using Nonlinear Acyclic Causal Models

Distinguishing Causes from Effects using Nonlinear Acyclic Causal Models JMLR Workshop and Conference Proceedings 6:17 164 NIPS 28 workshop on causality Distinguishing Causes from Effects using Nonlinear Acyclic Causal Models Kun Zhang Dept of Computer Science and HIIT University

More information

Probabilistic Regression Using Basis Function Models

Probabilistic Regression Using Basis Function Models Probabilistic Regression Using Basis Function Models Gregory Z. Grudic Department of Computer Science University of Colorado, Boulder grudic@cs.colorado.edu Abstract Our goal is to accurately estimate

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

Unsupervised Kernel Dimension Reduction Supplemental Material

Unsupervised Kernel Dimension Reduction Supplemental Material Unsupervised Kernel Dimension Reduction Supplemental Material Meihong Wang Dept. of Computer Science U. of Southern California Los Angeles, CA meihongw@usc.edu Fei Sha Dept. of Computer Science U. of Southern

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Package dhsic. R topics documented: July 27, 2017

Package dhsic. R topics documented: July 27, 2017 Package dhsic July 27, 2017 Type Package Title Independence Testing via Hilbert Schmidt Independence Criterion Version 2.0 Date 2017-07-24 Author Niklas Pfister and Jonas Peters Maintainer Niklas Pfister

More information

22 : Hilbert Space Embeddings of Distributions

22 : Hilbert Space Embeddings of Distributions 10-708: Probabilistic Graphical Models 10-708, Spring 2014 22 : Hilbert Space Embeddings of Distributions Lecturer: Eric P. Xing Scribes: Sujay Kumar Jauhar and Zhiguang Huo 1 Introduction and Motivation

More information

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis. 401 Review Major topics of the course 1. Univariate analysis 2. Bivariate analysis 3. Simple linear regression 4. Linear algebra 5. Multiple regression analysis Major analysis methods 1. Graphical analysis

More information

Least-Squares Independence Test

Least-Squares Independence Test IEICE Transactions on Information and Sstems, vol.e94-d, no.6, pp.333-336,. Least-Squares Independence Test Masashi Sugiama Toko Institute of Technolog, Japan PRESTO, Japan Science and Technolog Agenc,

More information

Learning Interpretable Features to Compare Distributions

Learning Interpretable Features to Compare Distributions Learning Interpretable Features to Compare Distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London Theory of Big Data, 2017 1/41 Goal of this talk Given: Two collections

More information

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396 Data Mining Linear & nonlinear classifiers Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 31 Table of contents 1 Introduction

More information

Ordinary Least Squares Linear Regression

Ordinary Least Squares Linear Regression Ordinary Least Squares Linear Regression Ryan P. Adams COS 324 Elements of Machine Learning Princeton University Linear regression is one of the simplest and most fundamental modeling ideas in statistics

More information

Kernel methods for Bayesian inference

Kernel methods for Bayesian inference Kernel methods for Bayesian inference Arthur Gretton Gatsby Computational Neuroscience Unit Lancaster, Nov. 2014 Motivating Example: Bayesian inference without a model 3600 downsampled frames of 20 20

More information

Kernel Methods. Barnabás Póczos

Kernel Methods. Barnabás Póczos Kernel Methods Barnabás Póczos Outline Quick Introduction Feature space Perceptron in the feature space Kernels Mercer s theorem Finite domain Arbitrary domain Kernel families Constructing new kernels

More information

ESANN'2003 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2003, d-side publi., ISBN X, pp.

ESANN'2003 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2003, d-side publi., ISBN X, pp. On different ensembles of kernel machines Michiko Yamana, Hiroyuki Nakahara, Massimiliano Pontil, and Shun-ichi Amari Λ Abstract. We study some ensembles of kernel machines. Each machine is first trained

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes COMP 55 Applied Machine Learning Lecture 2: Gaussian processes Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~hvanho2/comp55

More information

Hilbert Space Embedding of Probability Measures

Hilbert Space Embedding of Probability Measures Lecture 2 Hilbert Space Embedding of Probability Measures Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 2017 Recap of Lecture

More information

Causal Inference on Discrete Data via Estimating Distance Correlations

Causal Inference on Discrete Data via Estimating Distance Correlations ARTICLE CommunicatedbyAapoHyvärinen Causal Inference on Discrete Data via Estimating Distance Correlations Furui Liu frliu@cse.cuhk.edu.hk Laiwan Chan lwchan@cse.euhk.edu.hk Department of Computer Science

More information

Big Hypothesis Testing with Kernel Embeddings

Big Hypothesis Testing with Kernel Embeddings Big Hypothesis Testing with Kernel Embeddings Dino Sejdinovic Department of Statistics University of Oxford 9 January 2015 UCL Workshop on the Theory of Big Data D. Sejdinovic (Statistics, Oxford) Big

More information

Unsupervised Nonparametric Anomaly Detection: A Kernel Method

Unsupervised Nonparametric Anomaly Detection: A Kernel Method Fifty-second Annual Allerton Conference Allerton House, UIUC, Illinois, USA October - 3, 24 Unsupervised Nonparametric Anomaly Detection: A Kernel Method Shaofeng Zou Yingbin Liang H. Vincent Poor 2 Xinghua

More information

CIS 520: Machine Learning Oct 09, Kernel Methods

CIS 520: Machine Learning Oct 09, Kernel Methods CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed

More information

Support Vector Machine

Support Vector Machine Support Vector Machine Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Linear Support Vector Machine Kernelized SVM Kernels 2 From ERM to RLM Empirical Risk Minimization in the binary

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Importance Reweighting Using Adversarial-Collaborative Training

Importance Reweighting Using Adversarial-Collaborative Training Importance Reweighting Using Adversarial-Collaborative Training Yifan Wu yw4@andrew.cmu.edu Tianshu Ren tren@andrew.cmu.edu Lidan Mu lmu@andrew.cmu.edu Abstract We consider the problem of reweighting a

More information

arxiv: v1 [cs.lg] 22 Jun 2009

arxiv: v1 [cs.lg] 22 Jun 2009 Bayesian two-sample tests arxiv:0906.4032v1 [cs.lg] 22 Jun 2009 Karsten M. Borgwardt 1 and Zoubin Ghahramani 2 1 Max-Planck-Institutes Tübingen, 2 University of Cambridge June 22, 2009 Abstract In this

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS

More information

Fast Two Sample Tests using Smooth Random Features

Fast Two Sample Tests using Smooth Random Features Kacper Chwialkowski University College London, Computer Science Department Aaditya Ramdas Carnegie Mellon University, Machine Learning and Statistics School of Computer Science Dino Sejdinovic University

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

More information

Approximate Kernel PCA with Random Features

Approximate Kernel PCA with Random Features Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

41903: Introduction to Nonparametrics

41903: Introduction to Nonparametrics 41903: Notes 5 Introduction Nonparametrics fundamentally about fitting flexible models: want model that is flexible enough to accommodate important patterns but not so flexible it overspecializes to specific

More information

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract Scale-Invariance of Support Vector Machines based on the Triangular Kernel François Fleuret Hichem Sahbi IMEDIA Research Group INRIA Domaine de Voluceau 78150 Le Chesnay, France Abstract This paper focuses

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21 CS 70 Discrete Mathematics and Probability Theory Fall 205 Lecture 2 Inference In this note we revisit the problem of inference: Given some data or observations from the world, what can we infer about

More information

Lecture Notes 1: Vector spaces

Lecture Notes 1: Vector spaces Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Kernel methods, kernel SVM and ridge regression

Kernel methods, kernel SVM and ridge regression Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Collaborative Filtering 2 Collaborative Filtering R: rating matrix; U: user factor;

More information

Learning features to compare distributions

Learning features to compare distributions Learning features to compare distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS 2016 Workshop on Adversarial Learning, Barcelona Spain 1/28 Goal of this

More information

Discriminative Direction for Kernel Classifiers

Discriminative Direction for Kernel Classifiers Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering

More information

Sparse Nonparametric Density Estimation in High Dimensions Using the Rodeo

Sparse Nonparametric Density Estimation in High Dimensions Using the Rodeo Outline in High Dimensions Using the Rodeo Han Liu 1,2 John Lafferty 2,3 Larry Wasserman 1,2 1 Statistics Department, 2 Machine Learning Department, 3 Computer Science Department, Carnegie Mellon University

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

10-701/ Recitation : Kernels

10-701/ Recitation : Kernels 10-701/15-781 Recitation : Kernels Manojit Nandi February 27, 2014 Outline Mathematical Theory Banach Space and Hilbert Spaces Kernels Commonly Used Kernels Kernel Theory One Weird Kernel Trick Representer

More information

Introduction. Chapter 1

Introduction. Chapter 1 Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics

More information

6.867 Machine learning

6.867 Machine learning 6.867 Machine learning Mid-term eam October 8, 6 ( points) Your name and MIT ID: .5.5 y.5 y.5 a).5.5 b).5.5.5.5 y.5 y.5 c).5.5 d).5.5 Figure : Plots of linear regression results with different types of

More information

Cross-Validation with Confidence

Cross-Validation with Confidence Cross-Validation with Confidence Jing Lei Department of Statistics, Carnegie Mellon University UMN Statistics Seminar, Mar 30, 2017 Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation

More information

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization Tim Roughgarden & Gregory Valiant April 18, 2018 1 The Context and Intuition behind Regularization Given a dataset, and some class of models

More information

The local equivalence of two distances between clusterings: the Misclassification Error metric and the χ 2 distance

The local equivalence of two distances between clusterings: the Misclassification Error metric and the χ 2 distance The local equivalence of two distances between clusterings: the Misclassification Error metric and the χ 2 distance Marina Meilă University of Washington Department of Statistics Box 354322 Seattle, WA

More information

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers Computational Methods for Data Analysis Massimo Poesio SUPPORT VECTOR MACHINES Support Vector Machines Linear classifiers 1 Linear Classifiers denotes +1 denotes -1 w x + b>0 f(x,w,b) = sign(w x + b) How

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Lectures 5 & 6: Hypothesis Testing

Lectures 5 & 6: Hypothesis Testing Lectures 5 & 6: Hypothesis Testing in which you learn to apply the concept of statistical significance to OLS estimates, learn the concept of t values, how to use them in regression work and come across

More information

CS 195-5: Machine Learning Problem Set 1

CS 195-5: Machine Learning Problem Set 1 CS 95-5: Machine Learning Problem Set Douglas Lanman dlanman@brown.edu 7 September Regression Problem Show that the prediction errors y f(x; ŵ) are necessarily uncorrelated with any linear function of

More information

4 Bias-Variance for Ridge Regression (24 points)

4 Bias-Variance for Ridge Regression (24 points) Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,

More information

How Good is a Kernel When Used as a Similarity Measure?

How Good is a Kernel When Used as a Similarity Measure? How Good is a Kernel When Used as a Similarity Measure? Nathan Srebro Toyota Technological Institute-Chicago IL, USA IBM Haifa Research Lab, ISRAEL nati@uchicago.edu Abstract. Recently, Balcan and Blum

More information

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina Indirect Rule Learning: Support Vector Machines Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection

More information

Analysis of Spectral Kernel Design based Semi-supervised Learning

Analysis of Spectral Kernel Design based Semi-supervised Learning Analysis of Spectral Kernel Design based Semi-supervised Learning Tong Zhang IBM T. J. Watson Research Center Yorktown Heights, NY 10598 Rie Kubota Ando IBM T. J. Watson Research Center Yorktown Heights,

More information

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES Wei Chu, S. Sathiya Keerthi, Chong Jin Ong Control Division, Department of Mechanical Engineering, National University of Singapore 0 Kent Ridge Crescent,

More information

Machine Learning Linear Regression. Prof. Matteo Matteucci

Machine Learning Linear Regression. Prof. Matteo Matteucci Machine Learning Linear Regression Prof. Matteo Matteucci Outline 2 o Simple Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression o Multi Variate Regession Model Least Squares

More information

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University

More information

CS 7140: Advanced Machine Learning

CS 7140: Advanced Machine Learning Instructor CS 714: Advanced Machine Learning Lecture 3: Gaussian Processes (17 Jan, 218) Jan-Willem van de Meent (j.vandemeent@northeastern.edu) Scribes Mo Han (han.m@husky.neu.edu) Guillem Reus Muns (reusmuns.g@husky.neu.edu)

More information

18.6 Regression and Classification with Linear Models

18.6 Regression and Classification with Linear Models 18.6 Regression and Classification with Linear Models 352 The hypothesis space of linear functions of continuous-valued inputs has been used for hundreds of years A univariate linear function (a straight

More information

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain

More information

18.9 SUPPORT VECTOR MACHINES

18.9 SUPPORT VECTOR MACHINES 744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the

More information

Introduction to Machine Learning Fall 2017 Note 5. 1 Overview. 2 Metric

Introduction to Machine Learning Fall 2017 Note 5. 1 Overview. 2 Metric CS 189 Introduction to Machine Learning Fall 2017 Note 5 1 Overview Recall from our previous note that for a fixed input x, our measurement Y is a noisy measurement of the true underlying response f x):

More information

Lecture 3: Statistical Decision Theory (Part II)

Lecture 3: Statistical Decision Theory (Part II) Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Kernel methods for comparing distributions, measuring dependence

Kernel methods for comparing distributions, measuring dependence Kernel methods for comparing distributions, measuring dependence Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Principal component analysis Given a set of M centered observations

More information

SVMs: nonlinearity through kernels

SVMs: nonlinearity through kernels Non-separable data e-8. Support Vector Machines 8.. The Optimal Hyperplane Consider the following two datasets: SVMs: nonlinearity through kernels ER Chapter 3.4, e-8 (a) Few noisy data. (b) Nonlinearly

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Forefront of the Two Sample Problem

Forefront of the Two Sample Problem Forefront of the Two Sample Problem From classical to state-of-the-art methods Yuchi Matsuoka What is the Two Sample Problem? X ",, X % ~ P i. i. d Y ",, Y - ~ Q i. i. d Two Sample Problem P = Q? Example:

More information

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Least Absolute Shrinkage is Equivalent to Quadratic Penalization Least Absolute Shrinkage is Equivalent to Quadratic Penalization Yves Grandvalet Heudiasyc, UMR CNRS 6599, Université de Technologie de Compiègne, BP 20.529, 60205 Compiègne Cedex, France Yves.Grandvalet@hds.utc.fr

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 1 Solutions Thursday, September 19 What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information

Statistical Convergence of Kernel CCA

Statistical Convergence of Kernel CCA Statistical Convergence of Kernel CCA Kenji Fukumizu Institute of Statistical Mathematics Tokyo 106-8569 Japan fukumizu@ism.ac.jp Francis R. Bach Centre de Morphologie Mathematique Ecole des Mines de Paris,

More information

Semi-Nonparametric Inferences for Massive Data

Semi-Nonparametric Inferences for Massive Data Semi-Nonparametric Inferences for Massive Data Guang Cheng 1 Department of Statistics Purdue University Statistics Seminar at NCSU October, 2015 1 Acknowledge NSF, Simons Foundation and ONR. A Joint Work

More information

Structure learning in human causal induction

Structure learning in human causal induction Structure learning in human causal induction Joshua B. Tenenbaum & Thomas L. Griffiths Department of Psychology Stanford University, Stanford, CA 94305 jbt,gruffydd @psych.stanford.edu Abstract We use

More information

Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators.

Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators. Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators. Michel Besserve MPI for Intelligent Systems, Tübingen michel.besserve@tuebingen.mpg.de Nikos K. Logothetis MPI

More information

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Classification Methods II: Linear and Quadratic Discrimminant Analysis Classification Methods II: Linear and Quadratic Discrimminant Analysis Rebecca C. Steorts, Duke University STA 325, Chapter 4 ISL Agenda Linear Discrimminant Analysis (LDA) Classification Recall that linear

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information