Hilbert Schmidt Independence Criterion
|
|
- Madeleine York
- 6 years ago
- Views:
Transcription
1 Hilbert Schmidt Independence Criterion Thanks to Arthur Gretton, Le Song, Bernhard Schölkopf, Olivier Bousquet Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia ICONIP 2006, Hong Kong, October 3 Alexander J. Smola: Hilbert Schmidt Independence Criterion 1 / 30
2 Outline 1 Measuring Independence Covariance Operator Hilbert Space Methods A Test Statistic and its Analysis 2 Independent Component Analysis ICA Primer Examples 3 Feature Selection Problem Setting Algorithm Results Alexander J. Smola: Hilbert Schmidt Independence Criterion 2 / 30
3 Measuring Independence Problem Given {(x 1, y 1 ),..., (x m, y m )} Pr(x, y) determine whether Pr(x, y) = Pr(x) Pr(y). Measure degree of dependence. Applications Independent component analysis Dimensionality reduction and feature extraction Statistical modeling Indirect Approach Perform density estimate of Pr(x, y) Check whether the estimate approximately factorizes Direct Approach Check properties of factorizing distributions E.g. kurtosis, covariance operators, etc. Alexander J. Smola: Hilbert Schmidt Independence Criterion 3 / 30
4 Measuring Independence Problem Given {(x 1, y 1 ),..., (x m, y m )} Pr(x, y) determine whether Pr(x, y) = Pr(x) Pr(y). Measure degree of dependence. Applications Independent component analysis Dimensionality reduction and feature extraction Statistical modeling Indirect Approach Perform density estimate of Pr(x, y) Check whether the estimate approximately factorizes Direct Approach Check properties of factorizing distributions E.g. kurtosis, covariance operators, etc. Alexander J. Smola: Hilbert Schmidt Independence Criterion 3 / 30
5 Measuring Independence Problem Given {(x 1, y 1 ),..., (x m, y m )} Pr(x, y) determine whether Pr(x, y) = Pr(x) Pr(y). Measure degree of dependence. Applications Independent component analysis Dimensionality reduction and feature extraction Statistical modeling Indirect Approach Perform density estimate of Pr(x, y) Check whether the estimate approximately factorizes Direct Approach Check properties of factorizing distributions E.g. kurtosis, covariance operators, etc. Alexander J. Smola: Hilbert Schmidt Independence Criterion 3 / 30
6 Measuring Independence Problem Given {(x 1, y 1 ),..., (x m, y m )} Pr(x, y) determine whether Pr(x, y) = Pr(x) Pr(y). Measure degree of dependence. Applications Independent component analysis Dimensionality reduction and feature extraction Statistical modeling Indirect Approach Perform density estimate of Pr(x, y) Check whether the estimate approximately factorizes Direct Approach Check properties of factorizing distributions E.g. kurtosis, covariance operators, etc. Alexander J. Smola: Hilbert Schmidt Independence Criterion 3 / 30
7 Covariance Operator Linear Case For linear functions f (x) = w x and g(y) = v y the covariance is given by Cov{f (x), g(y)} = w Cv This is a bilinear operator on the space of linear functions. Nonlinear Case Define C to be the operator with (f, g) Cov {f, g}. Theorem C is a bilinear operator in f and g. Proof. We only show linearity in f : Cov {αf, g} = αcov {f, g}. Moreover, for f + f the covariance is additive. Alexander J. Smola: Hilbert Schmidt Independence Criterion 4 / 30
8 Covariance Operator Linear Case For linear functions f (x) = w x and g(y) = v y the covariance is given by Cov{f (x), g(y)} = w Cv This is a bilinear operator on the space of linear functions. Nonlinear Case Define C to be the operator with (f, g) Cov {f, g}. Theorem C is a bilinear operator in f and g. Proof. We only show linearity in f : Cov {αf, g} = αcov {f, g}. Moreover, for f + f the covariance is additive. Alexander J. Smola: Hilbert Schmidt Independence Criterion 4 / 30
9 Covariance Operator Linear Case For linear functions f (x) = w x and g(y) = v y the covariance is given by Cov{f (x), g(y)} = w Cv This is a bilinear operator on the space of linear functions. Nonlinear Case Define C to be the operator with (f, g) Cov {f, g}. Theorem C is a bilinear operator in f and g. Proof. We only show linearity in f : Cov {αf, g} = αcov {f, g}. Moreover, for f + f the covariance is additive. Alexander J. Smola: Hilbert Schmidt Independence Criterion 4 / 30
10 Independent random variables Alexander J. Smola: Hilbert Schmidt Independence Criterion 5 / 30
11 Dependent random variables Alexander J. Smola: Hilbert Schmidt Independence Criterion 6 / 30
12 Or are we just unlucky? Alexander J. Smola: Hilbert Schmidt Independence Criterion 7 / 30
13 Covariance operators Criterion (Renyi, 1957) Test for independence by checking whether C = 0. Reproducing Kernel Hilbert Space Kernels k, l on X, Y with associated RKHSs F, G. Assume bounded k, l on domain. Mean operator µ x, f = E x [f (x)] and µ y, g = E y [g(y)] Covariance operator Define covariance operator C via bilinear form f C xy g = Cov{f, g} = E x,y [f (x)g(y)] E x [f (x)] E y [g(y)] Alexander J. Smola: Hilbert Schmidt Independence Criterion 8 / 30
14 Covariance operators Criterion (Renyi, 1957) Test for independence by checking whether C = 0. Reproducing Kernel Hilbert Space Kernels k, l on X, Y with associated RKHSs F, G. Assume bounded k, l on domain. Mean operator µ x, f = E x [f (x)] and µ y, g = E y [g(y)] Covariance operator Define covariance operator C via bilinear form f C xy g = Cov{f, g} = E x,y [f (x)g(y)] E x [f (x)] E y [g(y)] Alexander J. Smola: Hilbert Schmidt Independence Criterion 8 / 30
15 Covariance operators Criterion (Renyi, 1957) Test for independence by checking whether C = 0. Reproducing Kernel Hilbert Space Kernels k, l on X, Y with associated RKHSs F, G. Assume bounded k, l on domain. Mean operator µ x, f = E x [f (x)] and µ y, g = E y [g(y)] Covariance operator Define covariance operator C via bilinear form f C xy g = Cov{f, g} = E x,y [f (x)g(y)] E x [f (x)] E y [g(y)] Alexander J. Smola: Hilbert Schmidt Independence Criterion 8 / 30
16 Covariance operators Criterion (Renyi, 1957) Test for independence by checking whether C = 0. Reproducing Kernel Hilbert Space Kernels k, l on X, Y with associated RKHSs F, G. Assume bounded k, l on domain. Mean operator µ x, f = E x [f (x)] and µ y, g = E y [g(y)] Covariance operator Define covariance operator C via bilinear form f C xy g = Cov{f, g} = E x,y [f (x)g(y)] E x [f (x)] E y [g(y)] Alexander J. Smola: Hilbert Schmidt Independence Criterion 8 / 30
17 Hilbert Space Representation Theorem Provided that k, l are universal kernels C xy = 0 if and only if x, y are independent. Proof. Step 1: If x, y are dependent then there exist some [0, 1]-bounded range f, g with Cov {f, g } = ɛ > 0. Step 2: Since k, l are universal there exist ɛ approximation of f, g in F, G such that covariance of approximation does not vanish. Step 3: Hence the covariance operator C xy is nonzero. Alexander J. Smola: Hilbert Schmidt Independence Criterion 9 / 30
18 A Test Statistic Covariance operator C xy = E x,y [k(x, )l(y, )] E x [k(x, )] E y [l(y, )] Operator Norm Use the norm of C xy to test whether x and y are independent. It also gives us a measure of dependence. HSIC(Pr xy, F, G) := C xy 2 where denotes the Hilbert-Schmidt norm. Frobenius Norm For matrices we can define M 2 = ij M 2 ij = tr M M. Hilbert-Schmidt norm is generalization of Frobenius norm. Alexander J. Smola: Hilbert Schmidt Independence Criterion 10 / 30
19 A Test Statistic Covariance operator C xy = E x,y [k(x, )l(y, )] E x [k(x, )] E y [l(y, )] Operator Norm Use the norm of C xy to test whether x and y are independent. It also gives us a measure of dependence. HSIC(Pr xy, F, G) := C xy 2 where denotes the Hilbert-Schmidt norm. Frobenius Norm For matrices we can define M 2 = ij M 2 ij = tr M M. Hilbert-Schmidt norm is generalization of Frobenius norm. Alexander J. Smola: Hilbert Schmidt Independence Criterion 10 / 30
20 A Test Statistic Covariance operator C xy = E x,y [k(x, )l(y, )] E x [k(x, )] E y [l(y, )] Operator Norm Use the norm of C xy to test whether x and y are independent. It also gives us a measure of dependence. HSIC(Pr xy, F, G) := C xy 2 where denotes the Hilbert-Schmidt norm. Frobenius Norm For matrices we can define M 2 = ij M 2 ij = tr M M. Hilbert-Schmidt norm is generalization of Frobenius norm. Alexander J. Smola: Hilbert Schmidt Independence Criterion 10 / 30
21 Computing C xy 2 Rank-one operators For rank-one terms we have f g 2 = f g, f g HS = f 2 g 2. Joint expectation By construction of C xy we exploit linearity and obtain C xy 2 = C xy, C xy HS = {E x,y E x,y 2E x,ye x E y + E x E y E x E y } [ k(x, )l(y, ), k(x, )l(y, ) HS ] = {E x,y E x,y 2E x,ye x E y + E x E y E x E y } [k(x, x )l(y, y )] This is well-defined if k, l are bounded. Alexander J. Smola: Hilbert Schmidt Independence Criterion 11 / 30
22 Estimating C xy 2 Empirical criterion HSIC(Z, F, G) := 1 (m 1) 2 trkhlh where K ij = k(x i, x j ), L ij = l(y i, y j ) and H ij = δ ij m 2. Theorem E Z [HSIC(Z, F, G)] = HSIC(Pr, F, G) + O(1/m) xy Proof: Sketch only. Expand tr KHLH into terms of pairs, triples and quadruples of indices of non-repeated terms, which lead to the proper expectations and bound the rest by O(m 1 ). Alexander J. Smola: Hilbert Schmidt Independence Criterion 12 / 30
23 Estimating C xy 2 Empirical criterion HSIC(Z, F, G) := 1 (m 1) 2 trkhlh where K ij = k(x i, x j ), L ij = l(y i, y j ) and H ij = δ ij m 2. Theorem E Z [HSIC(Z, F, G)] = HSIC(Pr, F, G) + O(1/m) xy Proof: Sketch only. Expand tr KHLH into terms of pairs, triples and quadruples of indices of non-repeated terms, which lead to the proper expectations and bound the rest by O(m 1 ). Alexander J. Smola: Hilbert Schmidt Independence Criterion 12 / 30
24 Estimating C xy 2 Empirical criterion HSIC(Z, F, G) := 1 (m 1) 2 trkhlh where K ij = k(x i, x j ), L ij = l(y i, y j ) and H ij = δ ij m 2. Theorem E Z [HSIC(Z, F, G)] = HSIC(Pr, F, G) + O(1/m) xy Proof: Sketch only. Expand tr KHLH into terms of pairs, triples and quadruples of indices of non-repeated terms, which lead to the proper expectations and bound the rest by O(m 1 ). Alexander J. Smola: Hilbert Schmidt Independence Criterion 12 / 30
25 Uniform convergence bounds for C xy 2 Theorem (Recall Hoeffding s theorem for U Statistics) For averages over functions on r variables u := 1 g(x i1,..., x ir ) (m) r which are bounded by a u b we have ( Pr {u E u[u] t} exp 2t ) 2 m/r u (b a) 2 i m r In our statistic we have terms of 2, 3, and 4 random variables. Alexander J. Smola: Hilbert Schmidt Independence Criterion 13 / 30
26 Uniform convergence bounds for C xy 2 Corollary Assume that k, l. Then at least with probability 1 δ HSIC(Z, F, G) HSIC(Pr, F, G) log 6/δ xy 0.24m + C m Proof. Bound each of the three terms separatly via Hoeffding s theorem. Alexander J. Smola: Hilbert Schmidt Independence Criterion 14 / 30
27 Outline 1 Measuring Independence Covariance Operator Hilbert Space Methods A Test Statistic and its Analysis 2 Independent Component Analysis ICA Primer Examples 3 Feature Selection Problem Setting Algorithm Results Alexander J. Smola: Hilbert Schmidt Independence Criterion 15 / 30
28 Blind Source Separation Data w = Ms, where all s i are mutually independent. The Cocktail Party Problem Task Recover the sources S and mixing matrix M given W. Alexander J. Smola: Hilbert Schmidt Independence Criterion 16 / 30
29 Independent Component Analysis Whitening Rotate, center, and whiten data before separation. This is always possible. Optimization We cannot recover scale of data anyway. Need to find orthogonal matrix U such that Uw = s leads to independent random variables. Optimization on the Stiefel manifold. Could do this by a Newton method. Important Trick Kernel matrix could be huge. Use reduced-rank representation. We get tr H(AA )H(BB ) = A HB 2 instead of tr HKHL. Alexander J. Smola: Hilbert Schmidt Independence Criterion 17 / 30
30 Independent Component Analysis Whitening Rotate, center, and whiten data before separation. This is always possible. Optimization We cannot recover scale of data anyway. Need to find orthogonal matrix U such that Uw = s leads to independent random variables. Optimization on the Stiefel manifold. Could do this by a Newton method. Important Trick Kernel matrix could be huge. Use reduced-rank representation. We get tr H(AA )H(BB ) = A HB 2 instead of tr HKHL. Alexander J. Smola: Hilbert Schmidt Independence Criterion 17 / 30
31 Independent Component Analysis Whitening Rotate, center, and whiten data before separation. This is always possible. Optimization We cannot recover scale of data anyway. Need to find orthogonal matrix U such that Uw = s leads to independent random variables. Optimization on the Stiefel manifold. Could do this by a Newton method. Important Trick Kernel matrix could be huge. Use reduced-rank representation. We get tr H(AA )H(BB ) = A HB 2 instead of tr HKHL. Alexander J. Smola: Hilbert Schmidt Independence Criterion 17 / 30
32 ICA Experiments Alexander J. Smola: Hilbert Schmidt Independence Criterion 18 / 30
33 Outlier Robustness Alexander J. Smola: Hilbert Schmidt Independence Criterion 19 / 30
34 Automatic Regularization Alexander J. Smola: Hilbert Schmidt Independence Criterion 20 / 30
35 Mini Summary Linear mixture of independent sources Remove mean and whiten for preprocessing Use HSIC as measure of dependence Find best rotation to demix the data Performance HSIC is very robust to outliers General purpose criterion Best performing algorithm (Radical) is designed for linear ICA, HSIC is a general purpose criterion Low rank decomposition makes optimization feasible Alexander J. Smola: Hilbert Schmidt Independence Criterion 21 / 30
36 Mini Summary Linear mixture of independent sources Remove mean and whiten for preprocessing Use HSIC as measure of dependence Find best rotation to demix the data Performance HSIC is very robust to outliers General purpose criterion Best performing algorithm (Radical) is designed for linear ICA, HSIC is a general purpose criterion Low rank decomposition makes optimization feasible Alexander J. Smola: Hilbert Schmidt Independence Criterion 21 / 30
37 Outline 1 Measuring Independence Covariance Operator Hilbert Space Methods A Test Statistic and its Analysis 2 Independent Component Analysis ICA Primer Examples 3 Feature Selection Problem Setting Algorithm Results Alexander J. Smola: Hilbert Schmidt Independence Criterion 22 / 30
38 Feature Selection The Problem Large number of features Select a small subset of them Basic Idea Find features such that the distributions p(x y = 1) and p(x y = 1) are as different as possible. Use a two-sample test for that. Important Tweak We can find a similar criterion to measure dependence between data and labels (by computing the Hilbert-Schmidt norm of covariance operator). Alexander J. Smola: Hilbert Schmidt Independence Criterion 23 / 30
39 Recursive Feature Elimination Algorithm Start with full set of features Adjust kernel width to pick up maximum discrepancy Find feature which decreases dissimilarity the least Remove this feature Repeat Applications Binary classification (standard MMD criterion) Multiclass Regression Alexander J. Smola: Hilbert Schmidt Independence Criterion 24 / 30
40 Alexander J. Smola: Hilbert Schmidt Independence Criterion 25 / 30
41 Comparison to other feature selectors Synthetic Data Brain Computer Interface Data Alexander J. Smola: Hilbert Schmidt Independence Criterion 26 / 30
42 Frequency Band Selection Alexander J. Smola: Hilbert Schmidt Independence Criterion 27 / 30
43 Microarray Feature Selection Goal Obtain small subset of features for estimation Reproducible feature selection Results Alexander J. Smola: Hilbert Schmidt Independence Criterion 28 / 30
44 Summary 1 Measuring Independence Covariance Operator Hilbert Space Methods A Test Statistic and its Analysis 2 Independent Component Analysis ICA Primer Examples 3 Feature Selection Problem Setting Algorithm Results Alexander J. Smola: Hilbert Schmidt Independence Criterion 29 / 30
45 Shameless Plugs Looking for a job... talk to me! Alex.Smola@nicta.com.au ( Positions PhD scholarships Postdoctoral positions, Senior researchers Long-term visitors (sabbaticals etc.) More details on kernels Schölkopf and Smola: Learning with Kernels Alexander J. Smola: Hilbert Schmidt Independence Criterion 30 / 30
Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton
Kernel Methods Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra,
More informationMaximum Mean Discrepancy
Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia
More informationMeasuring Statistical Dependence with Hilbert-Schmidt Norms
Max Planck Institut für biologische Kybernetik Max Planck Institute for Biological Cybernetics Technical Report No. 140 Measuring Statistical Dependence with Hilbert-Schmidt Norms Arthur Gretton, 1 Olivier
More informationFeature Selection via Dependence Maximization
Journal of Machine Learning Research 1 (2007) xxx-xxx Submitted xx/07; Published xx/07 Feature Selection via Dependence Maximization Le Song lesong@it.usyd.edu.au NICTA, Statistical Machine Learning Program,
More informationHilbert Space Representations of Probability Distributions
Hilbert Space Representations of Probability Distributions Arthur Gretton joint work with Karsten Borgwardt, Kenji Fukumizu, Malte Rasch, Bernhard Schölkopf, Alex Smola, Le Song, Choon Hui Teo Max Planck
More informationAn Adaptive Test of Independence with Analytic Kernel Embeddings
An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum Gatsby Unit, University College London wittawat@gatsby.ucl.ac.uk Probabilistic Graphical Model Workshop 2017 Institute
More informationCausal Modeling with Generative Neural Networks
Causal Modeling with Generative Neural Networks Michele Sebag TAO, CNRS INRIA LRI Université Paris-Sud Joint work: D. Kalainathan, O. Goudet, I. Guyon, M. Hajaiej, A. Decelle, C. Furtlehner https://arxiv.org/abs/1709.05321
More informationGeometric Analysis of Hilbert-Schmidt Independence Criterion Based ICA Contrast Function
Geometric Analysis of Hilbert-Schmidt Independence Criterion Based ICA Contrast Function NICTA Technical Report: PA006080 ISSN 1833-9646 Hao Shen 1, Stefanie Jegelka 2 and Arthur Gretton 2 1 Systems Engineering
More informationAn Introduction to Machine Learning
An Introduction to Machine Learning L2: Instance Based Estimation Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune, January
More informationBrisk Kernel ICA 1. Stefanie Jegelka 2, Arthur Gretton 3. 1 Introduction
Brisk Kernel ICA 1 Stefanie Jegelka 2, Arthur Gretton 3 Abstract. Recent approaches to independent component analysis have used kernel independence measures to obtain very good performance in ICA, particularly
More informationAn Adaptive Test of Independence with Analytic Kernel Embeddings
An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum 1 Zoltán Szabó 2 Arthur Gretton 1 1 Gatsby Unit, University College London 2 CMAP, École Polytechnique ICML 2017, Sydney
More informationarxiv: v2 [stat.ml] 3 Sep 2015
Nonparametric Independence Testing for Small Sample Sizes arxiv:46.9v [stat.ml] 3 Sep 5 Aaditya Ramdas Dept. of Statistics and Machine Learning Dept. Carnegie Mellon University aramdas@cs.cmu.edu September
More informationAdvanced Introduction to Machine Learning CMU-10715
Advanced Introduction to Machine Learning CMU-10715 Independent Component Analysis Barnabás Póczos Independent Component Analysis 2 Independent Component Analysis Model original signals Observations (Mixtures)
More informationColored Maximum Variance Unfolding
Colored Maximum Variance Unfolding Le Song, Alex Smola, Karsten Borgwardt and Arthur Gretton National ICT Australia, Canberra, Australia University of Cambridge, Cambridge, United Kingdom MPI for Biological
More informationReal and Complex Independent Subspace Analysis by Generalized Variance
Real and Complex Independent Subspace Analysis by Generalized Variance Neural Information Processing Group, Department of Information Systems, Eötvös Loránd University, Budapest, Hungary ICA Research Network
More informationIndependent Component Analysis. PhD Seminar Jörgen Ungh
Independent Component Analysis PhD Seminar Jörgen Ungh Agenda Background a motivater Independence ICA vs. PCA Gaussian data ICA theory Examples Background & motivation The cocktail party problem Bla bla
More informationMACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA
1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR
More informationAn Introduction to Machine Learning
An Introduction to Machine Learning L6: Structured Estimation Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune, January
More informationL5 Support Vector Classification
L5 Support Vector Classification Support Vector Machine Problem definition Geometrical picture Optimization problem Optimization Problem Hard margin Convexity Dual problem Soft margin problem Alexander
More informationStatistical Convergence of Kernel CCA
Statistical Convergence of Kernel CCA Kenji Fukumizu Institute of Statistical Mathematics Tokyo 106-8569 Japan fukumizu@ism.ac.jp Francis R. Bach Centre de Morphologie Mathematique Ecole des Mines de Paris,
More informationSMO Algorithms for Support Vector Machines without Bias Term
Institute of Automatic Control Laboratory for Control Systems and Process Automation Prof. Dr.-Ing. Dr. h. c. Rolf Isermann SMO Algorithms for Support Vector Machines without Bias Term Michael Vogt, 18-Jul-2002
More informationKernel methods for comparing distributions, measuring dependence
Kernel methods for comparing distributions, measuring dependence Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Principal component analysis Given a set of M centered observations
More informationConvergence of Eigenspaces in Kernel Principal Component Analysis
Convergence of Eigenspaces in Kernel Principal Component Analysis Shixin Wang Advanced machine learning April 19, 2016 Shixin Wang Convergence of Eigenspaces April 19, 2016 1 / 18 Outline 1 Motivation
More informationA Kernel Statistical Test of Independence
A Kernel Statistical Test of Independence Arthur Gretton MPI for Biological Cybernetics Tübingen, Germany arthur@tuebingen.mpg.de Kenji Fukumizu Inst. of Statistical Mathematics Tokyo Japan fukumizu@ism.ac.jp
More informationKernel Measures of Conditional Dependence
Kernel Measures of Conditional Dependence Kenji Fukumizu Institute of Statistical Mathematics 4-6-7 Minami-Azabu, Minato-ku Tokyo 6-8569 Japan fukumizu@ism.ac.jp Arthur Gretton Max-Planck Institute for
More informationIntroduction to Machine Learning
10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what
More informationFast Kernel-Based Independent Component Analysis
Fast Kernel-Based Independent Component Analysis Hao Shen, Stefanie Jegelka and Arthur Gretton Abstract Recent approaches to Independent Component Analysis (ICA have used kernel independence measures to
More informationAdvanced Introduction to Machine Learning
10-715 Advanced Introduction to Machine Learning Homework Due Oct 15, 10.30 am Rules Please follow these guidelines. Failure to do so, will result in loss of credit. 1. Homework is due on the due date
More informationReduced-Rank Hidden Markov Models
Reduced-Rank Hidden Markov Models Sajid M. Siddiqi Byron Boots Geoffrey J. Gordon Carnegie Mellon University ... x 1 x 2 x 3 x τ y 1 y 2 y 3 y τ Sequence of observations: Y =[y 1 y 2 y 3... y τ ] Assume
More informationKernel-Based Contrast Functions for Sufficient Dimension Reduction
Kernel-Based Contrast Functions for Sufficient Dimension Reduction Michael I. Jordan Departments of Statistics and EECS University of California, Berkeley Joint work with Kenji Fukumizu and Francis Bach
More informationLearning gradients: prescriptive models
Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007 Relevant papers Learning Coordinate Covariances via Gradients. Sayan
More informationFunctional Analysis Review
Outline 9.520: Statistical Learning Theory and Applications February 8, 2010 Outline 1 2 3 4 Vector Space Outline A vector space is a set V with binary operations +: V V V and : R V V such that for all
More informationKernel methods and the exponential family
Kernel methods and the exponential family Stéphane Canu 1 and Alex J. Smola 2 1- PSI - FRE CNRS 2645 INSA de Rouen, France St Etienne du Rouvray, France Stephane.Canu@insa-rouen.fr 2- Statistical Machine
More informationVector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.
Vector spaces DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Vector space Consists of: A set V A scalar
More informationMatrix Support Functional and its Applications
Matrix Support Functional and its Applications James V Burke Mathematics, University of Washington Joint work with Yuan Gao (UW) and Tim Hoheisel (McGill), CORS, Banff 2016 June 1, 2016 Connections What
More informationMTTS1 Dimensionality Reduction and Visualization Spring 2014 Jaakko Peltonen
MTTS1 Dimensionality Reduction and Visualization Spring 2014 Jaakko Peltonen Lecture 3: Linear feature extraction Feature extraction feature extraction: (more general) transform the original to (k < d).
More informationIntroduction to Support Vector Machines
Introduction to Support Vector Machines Hsuan-Tien Lin Learning Systems Group, California Institute of Technology Talk in NTU EE/CS Speech Lab, November 16, 2005 H.-T. Lin (Learning Systems Group) Introduction
More informationMachine Learning for Computational Advertising
Machine Learning for Computational Advertising L1: Basics and Probability Theory Alexander J. Smola Yahoo! Labs Santa Clara, CA 95051 alex@smola.org UC Santa Cruz, April 2009 Alexander J. Smola: Machine
More informationMachine Learning. Lecture 9: Learning Theory. Feng Li.
Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell
More informationData Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395
Data Mining Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 42 Outline 1 Introduction 2 Feature selection
More informationBayesian Support Vector Machines for Feature Ranking and Selection
Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher pat@student.ethz.ch ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction
More information10-701/ Recitation : Kernels
10-701/15-781 Recitation : Kernels Manojit Nandi February 27, 2014 Outline Mathematical Theory Banach Space and Hilbert Spaces Kernels Commonly Used Kernels Kernel Theory One Weird Kernel Trick Representer
More informationCIS 520: Machine Learning Oct 09, Kernel Methods
CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed
More informationSupport Vector Machines
Wien, June, 2010 Paul Hofmarcher, Stefan Theussl, WU Wien Hofmarcher/Theussl SVM 1/21 Linear Separable Separating Hyperplanes Non-Linear Separable Soft-Margin Hyperplanes Hofmarcher/Theussl SVM 2/21 (SVM)
More informationApproximate Kernel Methods
Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207 Outline Motivating example Ridge regression
More informationRobust Support Vector Machines for Probability Distributions
Robust Support Vector Machines for Probability Distributions Andreas Christmann joint work with Ingo Steinwart (Los Alamos National Lab) ICORS 2008, Antalya, Turkey, September 8-12, 2008 Andreas Christmann,
More informationSemi-supervised Dictionary Learning Based on Hilbert-Schmidt Independence Criterion
Semi-supervised ictionary Learning Based on Hilbert-Schmidt Independence Criterion Mehrdad J. Gangeh 1, Safaa M.A. Bedawi 2, Ali Ghodsi 3, and Fakhri Karray 2 1 epartments of Medical Biophysics, and Radiation
More informationConvergence Rates of Kernel Quadrature Rules
Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction
More informationIndirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina
Indirect Rule Learning: Support Vector Machines Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection
More informationMachine Learning (BSMC-GA 4439) Wenke Liu
Machine Learning (BSMC-GA 4439) Wenke Liu 02-01-2018 Biomedical data are usually high-dimensional Number of samples (n) is relatively small whereas number of features (p) can be large Sometimes p>>n Problems
More informationKernel Methods. Outline
Kernel Methods Quang Nguyen University of Pittsburgh CS 3750, Fall 2011 Outline Motivation Examples Kernels Definitions Kernel trick Basic properties Mercer condition Constructing feature space Hilbert
More informationKernels A Machine Learning Overview
Kernels A Machine Learning Overview S.V.N. Vishy Vishwanathan vishy@axiom.anu.edu.au National ICT of Australia and Australian National University Thanks to Alex Smola, Stéphane Canu, Mike Jordan and Peter
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 11, 2009 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationNatural Gradient Learning for Over- and Under-Complete Bases in ICA
NOTE Communicated by Jean-François Cardoso Natural Gradient Learning for Over- and Under-Complete Bases in ICA Shun-ichi Amari RIKEN Brain Science Institute, Wako-shi, Hirosawa, Saitama 351-01, Japan Independent
More informationLecture Notes 1: Vector spaces
Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector
More informationJoint distribution optimal transportation for domain adaptation
Joint distribution optimal transportation for domain adaptation Changhuang Wan Mechanical and Aerospace Engineering Department The Ohio State University March 8 th, 2018 Joint distribution optimal transportation
More informationLinear and Non-Linear Dimensionality Reduction
Linear and Non-Linear Dimensionality Reduction Alexander Schulz aschulz(at)techfak.uni-bielefeld.de University of Pisa, Pisa 4.5.215 and 7.5.215 Overview Dimensionality Reduction Motivation Linear Projections
More informationLeast-Squares Independence Test
IEICE Transactions on Information and Sstems, vol.e94-d, no.6, pp.333-336,. Least-Squares Independence Test Masashi Sugiama Toko Institute of Technolog, Japan PRESTO, Japan Science and Technolog Agenc,
More informationLecture 3: Statistical Decision Theory (Part II)
Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical
More informationChapter 9. Support Vector Machine. Yongdai Kim Seoul National University
Chapter 9. Support Vector Machine Yongdai Kim Seoul National University 1. Introduction Support Vector Machine (SVM) is a classification method developed by Vapnik (1996). It is thought that SVM improved
More informationSupervised Principal Component Analysis: Visualization, Classification and Regression on Subspaces and Submanifolds
Supervised Principal Component Analysis: Visualization, Classification and Regression on Subspaces and Submanifolds Elnaz Barshan a,alighodsi b,zohrehazimifar a,, Mansoor Zolghadri Jahromi a a Department
More informationAssociation studies and regression
Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration
More informationSupport'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan
Support'Vector'Machines Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan kasthuri.kannan@nyumc.org Overview Support Vector Machines for Classification Linear Discrimination Nonlinear Discrimination
More informationKernel Methods. Machine Learning A W VO
Kernel Methods Machine Learning A 708.063 07W VO Outline 1. Dual representation 2. The kernel concept 3. Properties of kernels 4. Examples of kernel machines Kernel PCA Support vector regression (Relevance
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationSemi-Supervised Learning through Principal Directions Estimation
Semi-Supervised Learning through Principal Directions Estimation Olivier Chapelle, Bernhard Schölkopf, Jason Weston Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany {first.last}@tuebingen.mpg.de
More informationHilbert Schmidt component analysis
Lietuvos matematikos rinkinys ISSN 0132-2818 Proc. of the Lithuanian Mathematical Society, Ser. A Vol. 57, 2016 DOI: 10.15388/LMR.A.2016.02 pages 7 11 Hilbert Schmidt component analysis Povilas Daniušis,
More informationKernel Independent Component Analysis
Journal of Machine Learning Research 3 (2002) 1-48 Submitted 11/01; Published 07/02 Kernel Independent Component Analysis Francis R. Bach Computer Science Division University of California Berkeley, CA
More informationKernel Methods in Machine Learning
Kernel Methods in Machine Learning Autumn 2015 Lecture 1: Introduction Juho Rousu ICS-E4030 Kernel Methods in Machine Learning 9. September, 2015 uho Rousu (ICS-E4030 Kernel Methods in Machine Learning)
More informationA Linear-Time Kernel Goodness-of-Fit Test
A Linear-Time Kernel Goodness-of-Fit Test Wittawat Jitkrittum 1; Wenkai Xu 1 Zoltán Szabó 2 NIPS 2017 Best paper! Kenji Fukumizu 3 Arthur Gretton 1 wittawatj@gmail.com 1 Gatsby Unit, University College
More informationStrictly Positive Definite Functions on a Real Inner Product Space
Strictly Positive Definite Functions on a Real Inner Product Space Allan Pinkus Abstract. If ft) = a kt k converges for all t IR with all coefficients a k 0, then the function f< x, y >) is positive definite
More informationBundle Methods for Machine Learning
Bundle Methods for Machine Learning Joint work with Quoc Le, Choon-Hui Teo, Vishy Vishwanathan and Markus Weimer Alexander J. Smola sml.nicta.com.au Statistical Machine Learning Program Canberra, ACT 0200
More informationKernel Learning via Random Fourier Representations
Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random
More informationKernel Logistic Regression and the Import Vector Machine
Kernel Logistic Regression and the Import Vector Machine Ji Zhu and Trevor Hastie Journal of Computational and Graphical Statistics, 2005 Presented by Mingtao Ding Duke University December 8, 2011 Mingtao
More informationLearning SVM Classifiers with Indefinite Kernels
Learning SVM Classifiers with Indefinite Kernels Suicheng Gu and Yuhong Guo Dept. of Computer and Information Sciences Temple University Support Vector Machines (SVMs) (Kernel) SVMs are widely used in
More informationKernel Methods & Support Vector Machines
Kernel Methods & Support Vector Machines Mahdi pakdaman Naeini PhD Candidate, University of Tehran Senior Researcher, TOSAN Intelligent Data Miners Outline Motivation Introduction to pattern recognition
More informationGeneralized clustering via kernel embeddings
Generalized clustering via kernel embeddings Stefanie Jegelka 1, Arthur Gretton 2,1, Bernhard Schölkopf 1, Bharath K. Sriperumbudur 3, and Ulrike von Luxburg 1 1 Max Planck Institute for Biological Cybernetics,
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:
More informationPrincipal Component Analysis
Principal Component Analysis Introduction Consider a zero mean random vector R n with autocorrelation matri R = E( T ). R has eigenvectors q(1),,q(n) and associated eigenvalues λ(1) λ(n). Let Q = [ q(1)
More informationDimensionality Reduction. CS57300 Data Mining Fall Instructor: Bruno Ribeiro
Dimensionality Reduction CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro Goal } Visualize high dimensional data (and understand its Geometry) } Project the data into lower dimensional spaces }
More informationStat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.
Linear SVM (separable case) First consider the scenario where the two classes of points are separable. It s desirable to have the width (called margin) between the two dashed lines to be large, i.e., have
More informationPCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani
PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationSketching for Large-Scale Learning of Mixture Models
Sketching for Large-Scale Learning of Mixture Models Nicolas Keriven Université Rennes 1, Inria Rennes Bretagne-atlantique Adv. Rémi Gribonval Outline Introduction Practical Approach Results Theoretical
More informationComputer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo
Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationOutline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space
to The The A s s in to Fabio A. González Ph.D. Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá April 2, 2009 to The The A s s in 1 Motivation Outline 2 The Mapping the
More informationAn Introduction to Independent Components Analysis (ICA)
An Introduction to Independent Components Analysis (ICA) Anish R. Shah, CFA Northfield Information Services Anish@northinfo.com Newport Jun 6, 2008 1 Overview of Talk Review principal components Introduce
More informationConnection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis
Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis Alvina Goh Vision Reading Group 13 October 2005 Connection of Local Linear Embedding, ISOMAP, and Kernel Principal
More informationA Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices
A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices Ryota Tomioka 1, Taiji Suzuki 1, Masashi Sugiyama 2, Hisashi Kashima 1 1 The University of Tokyo 2 Tokyo Institute of Technology 2010-06-22
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 9, 2011 About this class Goal In this class we continue our journey in the world of RKHS. We discuss the Mercer theorem which gives
More informationKernel Bayes Rule: Nonparametric Bayesian inference with kernels
Kernel Bayes Rule: Nonparametric Bayesian inference with kernels Kenji Fukumizu The Institute of Statistical Mathematics NIPS 2012 Workshop Confluence between Kernel Methods and Graphical Models December
More informationThere are two things that are particularly nice about the first basis
Orthogonality and the Gram-Schmidt Process In Chapter 4, we spent a great deal of time studying the problem of finding a basis for a vector space We know that a basis for a vector space can potentially
More informationReducing Multiclass to Binary: A Unifying Approach for Margin Classifiers
Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Erin Allwein, Robert Schapire and Yoram Singer Journal of Machine Learning Research, 1:113-141, 000 CSE 54: Seminar on Learning
More informationSupport Vector Machine for Classification and Regression
Support Vector Machine for Classification and Regression Ahlame Douzal AMA-LIG, Université Joseph Fourier Master 2R - MOSIG (2013) November 25, 2013 Loss function, Separating Hyperplanes, Canonical Hyperplan
More informationMachine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling
Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University
More informationLearning Multiple Tasks with a Sparse Matrix-Normal Penalty
Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Yi Zhang and Jeff Schneider NIPS 2010 Presented by Esther Salazar Duke University March 25, 2011 E. Salazar (Reading group) March 25, 2011 1
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationIntroduction to Machine Learning
Introduction to Machine Learning 12. Gaussian Processes Alex Smola Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701 10-701 The Normal Distribution http://www.gaussianprocess.org/gpml/chapters/
More information