Hilbert Schmidt Independence Criterion

Size: px
Start display at page:

Download "Hilbert Schmidt Independence Criterion"

Transcription

1 Hilbert Schmidt Independence Criterion Thanks to Arthur Gretton, Le Song, Bernhard Schölkopf, Olivier Bousquet Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia ICONIP 2006, Hong Kong, October 3 Alexander J. Smola: Hilbert Schmidt Independence Criterion 1 / 30

2 Outline 1 Measuring Independence Covariance Operator Hilbert Space Methods A Test Statistic and its Analysis 2 Independent Component Analysis ICA Primer Examples 3 Feature Selection Problem Setting Algorithm Results Alexander J. Smola: Hilbert Schmidt Independence Criterion 2 / 30

3 Measuring Independence Problem Given {(x 1, y 1 ),..., (x m, y m )} Pr(x, y) determine whether Pr(x, y) = Pr(x) Pr(y). Measure degree of dependence. Applications Independent component analysis Dimensionality reduction and feature extraction Statistical modeling Indirect Approach Perform density estimate of Pr(x, y) Check whether the estimate approximately factorizes Direct Approach Check properties of factorizing distributions E.g. kurtosis, covariance operators, etc. Alexander J. Smola: Hilbert Schmidt Independence Criterion 3 / 30

4 Measuring Independence Problem Given {(x 1, y 1 ),..., (x m, y m )} Pr(x, y) determine whether Pr(x, y) = Pr(x) Pr(y). Measure degree of dependence. Applications Independent component analysis Dimensionality reduction and feature extraction Statistical modeling Indirect Approach Perform density estimate of Pr(x, y) Check whether the estimate approximately factorizes Direct Approach Check properties of factorizing distributions E.g. kurtosis, covariance operators, etc. Alexander J. Smola: Hilbert Schmidt Independence Criterion 3 / 30

5 Measuring Independence Problem Given {(x 1, y 1 ),..., (x m, y m )} Pr(x, y) determine whether Pr(x, y) = Pr(x) Pr(y). Measure degree of dependence. Applications Independent component analysis Dimensionality reduction and feature extraction Statistical modeling Indirect Approach Perform density estimate of Pr(x, y) Check whether the estimate approximately factorizes Direct Approach Check properties of factorizing distributions E.g. kurtosis, covariance operators, etc. Alexander J. Smola: Hilbert Schmidt Independence Criterion 3 / 30

6 Measuring Independence Problem Given {(x 1, y 1 ),..., (x m, y m )} Pr(x, y) determine whether Pr(x, y) = Pr(x) Pr(y). Measure degree of dependence. Applications Independent component analysis Dimensionality reduction and feature extraction Statistical modeling Indirect Approach Perform density estimate of Pr(x, y) Check whether the estimate approximately factorizes Direct Approach Check properties of factorizing distributions E.g. kurtosis, covariance operators, etc. Alexander J. Smola: Hilbert Schmidt Independence Criterion 3 / 30

7 Covariance Operator Linear Case For linear functions f (x) = w x and g(y) = v y the covariance is given by Cov{f (x), g(y)} = w Cv This is a bilinear operator on the space of linear functions. Nonlinear Case Define C to be the operator with (f, g) Cov {f, g}. Theorem C is a bilinear operator in f and g. Proof. We only show linearity in f : Cov {αf, g} = αcov {f, g}. Moreover, for f + f the covariance is additive. Alexander J. Smola: Hilbert Schmidt Independence Criterion 4 / 30

8 Covariance Operator Linear Case For linear functions f (x) = w x and g(y) = v y the covariance is given by Cov{f (x), g(y)} = w Cv This is a bilinear operator on the space of linear functions. Nonlinear Case Define C to be the operator with (f, g) Cov {f, g}. Theorem C is a bilinear operator in f and g. Proof. We only show linearity in f : Cov {αf, g} = αcov {f, g}. Moreover, for f + f the covariance is additive. Alexander J. Smola: Hilbert Schmidt Independence Criterion 4 / 30

9 Covariance Operator Linear Case For linear functions f (x) = w x and g(y) = v y the covariance is given by Cov{f (x), g(y)} = w Cv This is a bilinear operator on the space of linear functions. Nonlinear Case Define C to be the operator with (f, g) Cov {f, g}. Theorem C is a bilinear operator in f and g. Proof. We only show linearity in f : Cov {αf, g} = αcov {f, g}. Moreover, for f + f the covariance is additive. Alexander J. Smola: Hilbert Schmidt Independence Criterion 4 / 30

10 Independent random variables Alexander J. Smola: Hilbert Schmidt Independence Criterion 5 / 30

11 Dependent random variables Alexander J. Smola: Hilbert Schmidt Independence Criterion 6 / 30

12 Or are we just unlucky? Alexander J. Smola: Hilbert Schmidt Independence Criterion 7 / 30

13 Covariance operators Criterion (Renyi, 1957) Test for independence by checking whether C = 0. Reproducing Kernel Hilbert Space Kernels k, l on X, Y with associated RKHSs F, G. Assume bounded k, l on domain. Mean operator µ x, f = E x [f (x)] and µ y, g = E y [g(y)] Covariance operator Define covariance operator C via bilinear form f C xy g = Cov{f, g} = E x,y [f (x)g(y)] E x [f (x)] E y [g(y)] Alexander J. Smola: Hilbert Schmidt Independence Criterion 8 / 30

14 Covariance operators Criterion (Renyi, 1957) Test for independence by checking whether C = 0. Reproducing Kernel Hilbert Space Kernels k, l on X, Y with associated RKHSs F, G. Assume bounded k, l on domain. Mean operator µ x, f = E x [f (x)] and µ y, g = E y [g(y)] Covariance operator Define covariance operator C via bilinear form f C xy g = Cov{f, g} = E x,y [f (x)g(y)] E x [f (x)] E y [g(y)] Alexander J. Smola: Hilbert Schmidt Independence Criterion 8 / 30

15 Covariance operators Criterion (Renyi, 1957) Test for independence by checking whether C = 0. Reproducing Kernel Hilbert Space Kernels k, l on X, Y with associated RKHSs F, G. Assume bounded k, l on domain. Mean operator µ x, f = E x [f (x)] and µ y, g = E y [g(y)] Covariance operator Define covariance operator C via bilinear form f C xy g = Cov{f, g} = E x,y [f (x)g(y)] E x [f (x)] E y [g(y)] Alexander J. Smola: Hilbert Schmidt Independence Criterion 8 / 30

16 Covariance operators Criterion (Renyi, 1957) Test for independence by checking whether C = 0. Reproducing Kernel Hilbert Space Kernels k, l on X, Y with associated RKHSs F, G. Assume bounded k, l on domain. Mean operator µ x, f = E x [f (x)] and µ y, g = E y [g(y)] Covariance operator Define covariance operator C via bilinear form f C xy g = Cov{f, g} = E x,y [f (x)g(y)] E x [f (x)] E y [g(y)] Alexander J. Smola: Hilbert Schmidt Independence Criterion 8 / 30

17 Hilbert Space Representation Theorem Provided that k, l are universal kernels C xy = 0 if and only if x, y are independent. Proof. Step 1: If x, y are dependent then there exist some [0, 1]-bounded range f, g with Cov {f, g } = ɛ > 0. Step 2: Since k, l are universal there exist ɛ approximation of f, g in F, G such that covariance of approximation does not vanish. Step 3: Hence the covariance operator C xy is nonzero. Alexander J. Smola: Hilbert Schmidt Independence Criterion 9 / 30

18 A Test Statistic Covariance operator C xy = E x,y [k(x, )l(y, )] E x [k(x, )] E y [l(y, )] Operator Norm Use the norm of C xy to test whether x and y are independent. It also gives us a measure of dependence. HSIC(Pr xy, F, G) := C xy 2 where denotes the Hilbert-Schmidt norm. Frobenius Norm For matrices we can define M 2 = ij M 2 ij = tr M M. Hilbert-Schmidt norm is generalization of Frobenius norm. Alexander J. Smola: Hilbert Schmidt Independence Criterion 10 / 30

19 A Test Statistic Covariance operator C xy = E x,y [k(x, )l(y, )] E x [k(x, )] E y [l(y, )] Operator Norm Use the norm of C xy to test whether x and y are independent. It also gives us a measure of dependence. HSIC(Pr xy, F, G) := C xy 2 where denotes the Hilbert-Schmidt norm. Frobenius Norm For matrices we can define M 2 = ij M 2 ij = tr M M. Hilbert-Schmidt norm is generalization of Frobenius norm. Alexander J. Smola: Hilbert Schmidt Independence Criterion 10 / 30

20 A Test Statistic Covariance operator C xy = E x,y [k(x, )l(y, )] E x [k(x, )] E y [l(y, )] Operator Norm Use the norm of C xy to test whether x and y are independent. It also gives us a measure of dependence. HSIC(Pr xy, F, G) := C xy 2 where denotes the Hilbert-Schmidt norm. Frobenius Norm For matrices we can define M 2 = ij M 2 ij = tr M M. Hilbert-Schmidt norm is generalization of Frobenius norm. Alexander J. Smola: Hilbert Schmidt Independence Criterion 10 / 30

21 Computing C xy 2 Rank-one operators For rank-one terms we have f g 2 = f g, f g HS = f 2 g 2. Joint expectation By construction of C xy we exploit linearity and obtain C xy 2 = C xy, C xy HS = {E x,y E x,y 2E x,ye x E y + E x E y E x E y } [ k(x, )l(y, ), k(x, )l(y, ) HS ] = {E x,y E x,y 2E x,ye x E y + E x E y E x E y } [k(x, x )l(y, y )] This is well-defined if k, l are bounded. Alexander J. Smola: Hilbert Schmidt Independence Criterion 11 / 30

22 Estimating C xy 2 Empirical criterion HSIC(Z, F, G) := 1 (m 1) 2 trkhlh where K ij = k(x i, x j ), L ij = l(y i, y j ) and H ij = δ ij m 2. Theorem E Z [HSIC(Z, F, G)] = HSIC(Pr, F, G) + O(1/m) xy Proof: Sketch only. Expand tr KHLH into terms of pairs, triples and quadruples of indices of non-repeated terms, which lead to the proper expectations and bound the rest by O(m 1 ). Alexander J. Smola: Hilbert Schmidt Independence Criterion 12 / 30

23 Estimating C xy 2 Empirical criterion HSIC(Z, F, G) := 1 (m 1) 2 trkhlh where K ij = k(x i, x j ), L ij = l(y i, y j ) and H ij = δ ij m 2. Theorem E Z [HSIC(Z, F, G)] = HSIC(Pr, F, G) + O(1/m) xy Proof: Sketch only. Expand tr KHLH into terms of pairs, triples and quadruples of indices of non-repeated terms, which lead to the proper expectations and bound the rest by O(m 1 ). Alexander J. Smola: Hilbert Schmidt Independence Criterion 12 / 30

24 Estimating C xy 2 Empirical criterion HSIC(Z, F, G) := 1 (m 1) 2 trkhlh where K ij = k(x i, x j ), L ij = l(y i, y j ) and H ij = δ ij m 2. Theorem E Z [HSIC(Z, F, G)] = HSIC(Pr, F, G) + O(1/m) xy Proof: Sketch only. Expand tr KHLH into terms of pairs, triples and quadruples of indices of non-repeated terms, which lead to the proper expectations and bound the rest by O(m 1 ). Alexander J. Smola: Hilbert Schmidt Independence Criterion 12 / 30

25 Uniform convergence bounds for C xy 2 Theorem (Recall Hoeffding s theorem for U Statistics) For averages over functions on r variables u := 1 g(x i1,..., x ir ) (m) r which are bounded by a u b we have ( Pr {u E u[u] t} exp 2t ) 2 m/r u (b a) 2 i m r In our statistic we have terms of 2, 3, and 4 random variables. Alexander J. Smola: Hilbert Schmidt Independence Criterion 13 / 30

26 Uniform convergence bounds for C xy 2 Corollary Assume that k, l. Then at least with probability 1 δ HSIC(Z, F, G) HSIC(Pr, F, G) log 6/δ xy 0.24m + C m Proof. Bound each of the three terms separatly via Hoeffding s theorem. Alexander J. Smola: Hilbert Schmidt Independence Criterion 14 / 30

27 Outline 1 Measuring Independence Covariance Operator Hilbert Space Methods A Test Statistic and its Analysis 2 Independent Component Analysis ICA Primer Examples 3 Feature Selection Problem Setting Algorithm Results Alexander J. Smola: Hilbert Schmidt Independence Criterion 15 / 30

28 Blind Source Separation Data w = Ms, where all s i are mutually independent. The Cocktail Party Problem Task Recover the sources S and mixing matrix M given W. Alexander J. Smola: Hilbert Schmidt Independence Criterion 16 / 30

29 Independent Component Analysis Whitening Rotate, center, and whiten data before separation. This is always possible. Optimization We cannot recover scale of data anyway. Need to find orthogonal matrix U such that Uw = s leads to independent random variables. Optimization on the Stiefel manifold. Could do this by a Newton method. Important Trick Kernel matrix could be huge. Use reduced-rank representation. We get tr H(AA )H(BB ) = A HB 2 instead of tr HKHL. Alexander J. Smola: Hilbert Schmidt Independence Criterion 17 / 30

30 Independent Component Analysis Whitening Rotate, center, and whiten data before separation. This is always possible. Optimization We cannot recover scale of data anyway. Need to find orthogonal matrix U such that Uw = s leads to independent random variables. Optimization on the Stiefel manifold. Could do this by a Newton method. Important Trick Kernel matrix could be huge. Use reduced-rank representation. We get tr H(AA )H(BB ) = A HB 2 instead of tr HKHL. Alexander J. Smola: Hilbert Schmidt Independence Criterion 17 / 30

31 Independent Component Analysis Whitening Rotate, center, and whiten data before separation. This is always possible. Optimization We cannot recover scale of data anyway. Need to find orthogonal matrix U such that Uw = s leads to independent random variables. Optimization on the Stiefel manifold. Could do this by a Newton method. Important Trick Kernel matrix could be huge. Use reduced-rank representation. We get tr H(AA )H(BB ) = A HB 2 instead of tr HKHL. Alexander J. Smola: Hilbert Schmidt Independence Criterion 17 / 30

32 ICA Experiments Alexander J. Smola: Hilbert Schmidt Independence Criterion 18 / 30

33 Outlier Robustness Alexander J. Smola: Hilbert Schmidt Independence Criterion 19 / 30

34 Automatic Regularization Alexander J. Smola: Hilbert Schmidt Independence Criterion 20 / 30

35 Mini Summary Linear mixture of independent sources Remove mean and whiten for preprocessing Use HSIC as measure of dependence Find best rotation to demix the data Performance HSIC is very robust to outliers General purpose criterion Best performing algorithm (Radical) is designed for linear ICA, HSIC is a general purpose criterion Low rank decomposition makes optimization feasible Alexander J. Smola: Hilbert Schmidt Independence Criterion 21 / 30

36 Mini Summary Linear mixture of independent sources Remove mean and whiten for preprocessing Use HSIC as measure of dependence Find best rotation to demix the data Performance HSIC is very robust to outliers General purpose criterion Best performing algorithm (Radical) is designed for linear ICA, HSIC is a general purpose criterion Low rank decomposition makes optimization feasible Alexander J. Smola: Hilbert Schmidt Independence Criterion 21 / 30

37 Outline 1 Measuring Independence Covariance Operator Hilbert Space Methods A Test Statistic and its Analysis 2 Independent Component Analysis ICA Primer Examples 3 Feature Selection Problem Setting Algorithm Results Alexander J. Smola: Hilbert Schmidt Independence Criterion 22 / 30

38 Feature Selection The Problem Large number of features Select a small subset of them Basic Idea Find features such that the distributions p(x y = 1) and p(x y = 1) are as different as possible. Use a two-sample test for that. Important Tweak We can find a similar criterion to measure dependence between data and labels (by computing the Hilbert-Schmidt norm of covariance operator). Alexander J. Smola: Hilbert Schmidt Independence Criterion 23 / 30

39 Recursive Feature Elimination Algorithm Start with full set of features Adjust kernel width to pick up maximum discrepancy Find feature which decreases dissimilarity the least Remove this feature Repeat Applications Binary classification (standard MMD criterion) Multiclass Regression Alexander J. Smola: Hilbert Schmidt Independence Criterion 24 / 30

40 Alexander J. Smola: Hilbert Schmidt Independence Criterion 25 / 30

41 Comparison to other feature selectors Synthetic Data Brain Computer Interface Data Alexander J. Smola: Hilbert Schmidt Independence Criterion 26 / 30

42 Frequency Band Selection Alexander J. Smola: Hilbert Schmidt Independence Criterion 27 / 30

43 Microarray Feature Selection Goal Obtain small subset of features for estimation Reproducible feature selection Results Alexander J. Smola: Hilbert Schmidt Independence Criterion 28 / 30

44 Summary 1 Measuring Independence Covariance Operator Hilbert Space Methods A Test Statistic and its Analysis 2 Independent Component Analysis ICA Primer Examples 3 Feature Selection Problem Setting Algorithm Results Alexander J. Smola: Hilbert Schmidt Independence Criterion 29 / 30

45 Shameless Plugs Looking for a job... talk to me! Alex.Smola@nicta.com.au ( Positions PhD scholarships Postdoctoral positions, Senior researchers Long-term visitors (sabbaticals etc.) More details on kernels Schölkopf and Smola: Learning with Kernels Alexander J. Smola: Hilbert Schmidt Independence Criterion 30 / 30

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Kernel Methods Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra,

More information

Maximum Mean Discrepancy

Maximum Mean Discrepancy Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia

More information

Measuring Statistical Dependence with Hilbert-Schmidt Norms

Measuring Statistical Dependence with Hilbert-Schmidt Norms Max Planck Institut für biologische Kybernetik Max Planck Institute for Biological Cybernetics Technical Report No. 140 Measuring Statistical Dependence with Hilbert-Schmidt Norms Arthur Gretton, 1 Olivier

More information

Feature Selection via Dependence Maximization

Feature Selection via Dependence Maximization Journal of Machine Learning Research 1 (2007) xxx-xxx Submitted xx/07; Published xx/07 Feature Selection via Dependence Maximization Le Song lesong@it.usyd.edu.au NICTA, Statistical Machine Learning Program,

More information

Hilbert Space Representations of Probability Distributions

Hilbert Space Representations of Probability Distributions Hilbert Space Representations of Probability Distributions Arthur Gretton joint work with Karsten Borgwardt, Kenji Fukumizu, Malte Rasch, Bernhard Schölkopf, Alex Smola, Le Song, Choon Hui Teo Max Planck

More information

An Adaptive Test of Independence with Analytic Kernel Embeddings

An Adaptive Test of Independence with Analytic Kernel Embeddings An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum Gatsby Unit, University College London wittawat@gatsby.ucl.ac.uk Probabilistic Graphical Model Workshop 2017 Institute

More information

Causal Modeling with Generative Neural Networks

Causal Modeling with Generative Neural Networks Causal Modeling with Generative Neural Networks Michele Sebag TAO, CNRS INRIA LRI Université Paris-Sud Joint work: D. Kalainathan, O. Goudet, I. Guyon, M. Hajaiej, A. Decelle, C. Furtlehner https://arxiv.org/abs/1709.05321

More information

Geometric Analysis of Hilbert-Schmidt Independence Criterion Based ICA Contrast Function

Geometric Analysis of Hilbert-Schmidt Independence Criterion Based ICA Contrast Function Geometric Analysis of Hilbert-Schmidt Independence Criterion Based ICA Contrast Function NICTA Technical Report: PA006080 ISSN 1833-9646 Hao Shen 1, Stefanie Jegelka 2 and Arthur Gretton 2 1 Systems Engineering

More information

An Introduction to Machine Learning

An Introduction to Machine Learning An Introduction to Machine Learning L2: Instance Based Estimation Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune, January

More information

Brisk Kernel ICA 1. Stefanie Jegelka 2, Arthur Gretton 3. 1 Introduction

Brisk Kernel ICA 1. Stefanie Jegelka 2, Arthur Gretton 3. 1 Introduction Brisk Kernel ICA 1 Stefanie Jegelka 2, Arthur Gretton 3 Abstract. Recent approaches to independent component analysis have used kernel independence measures to obtain very good performance in ICA, particularly

More information

An Adaptive Test of Independence with Analytic Kernel Embeddings

An Adaptive Test of Independence with Analytic Kernel Embeddings An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum 1 Zoltán Szabó 2 Arthur Gretton 1 1 Gatsby Unit, University College London 2 CMAP, École Polytechnique ICML 2017, Sydney

More information

arxiv: v2 [stat.ml] 3 Sep 2015

arxiv: v2 [stat.ml] 3 Sep 2015 Nonparametric Independence Testing for Small Sample Sizes arxiv:46.9v [stat.ml] 3 Sep 5 Aaditya Ramdas Dept. of Statistics and Machine Learning Dept. Carnegie Mellon University aramdas@cs.cmu.edu September

More information

Advanced Introduction to Machine Learning CMU-10715

Advanced Introduction to Machine Learning CMU-10715 Advanced Introduction to Machine Learning CMU-10715 Independent Component Analysis Barnabás Póczos Independent Component Analysis 2 Independent Component Analysis Model original signals Observations (Mixtures)

More information

Colored Maximum Variance Unfolding

Colored Maximum Variance Unfolding Colored Maximum Variance Unfolding Le Song, Alex Smola, Karsten Borgwardt and Arthur Gretton National ICT Australia, Canberra, Australia University of Cambridge, Cambridge, United Kingdom MPI for Biological

More information

Real and Complex Independent Subspace Analysis by Generalized Variance

Real and Complex Independent Subspace Analysis by Generalized Variance Real and Complex Independent Subspace Analysis by Generalized Variance Neural Information Processing Group, Department of Information Systems, Eötvös Loránd University, Budapest, Hungary ICA Research Network

More information

Independent Component Analysis. PhD Seminar Jörgen Ungh

Independent Component Analysis. PhD Seminar Jörgen Ungh Independent Component Analysis PhD Seminar Jörgen Ungh Agenda Background a motivater Independence ICA vs. PCA Gaussian data ICA theory Examples Background & motivation The cocktail party problem Bla bla

More information

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR

More information

An Introduction to Machine Learning

An Introduction to Machine Learning An Introduction to Machine Learning L6: Structured Estimation Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune, January

More information

L5 Support Vector Classification

L5 Support Vector Classification L5 Support Vector Classification Support Vector Machine Problem definition Geometrical picture Optimization problem Optimization Problem Hard margin Convexity Dual problem Soft margin problem Alexander

More information

Statistical Convergence of Kernel CCA

Statistical Convergence of Kernel CCA Statistical Convergence of Kernel CCA Kenji Fukumizu Institute of Statistical Mathematics Tokyo 106-8569 Japan fukumizu@ism.ac.jp Francis R. Bach Centre de Morphologie Mathematique Ecole des Mines de Paris,

More information

SMO Algorithms for Support Vector Machines without Bias Term

SMO Algorithms for Support Vector Machines without Bias Term Institute of Automatic Control Laboratory for Control Systems and Process Automation Prof. Dr.-Ing. Dr. h. c. Rolf Isermann SMO Algorithms for Support Vector Machines without Bias Term Michael Vogt, 18-Jul-2002

More information

Kernel methods for comparing distributions, measuring dependence

Kernel methods for comparing distributions, measuring dependence Kernel methods for comparing distributions, measuring dependence Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Principal component analysis Given a set of M centered observations

More information

Convergence of Eigenspaces in Kernel Principal Component Analysis

Convergence of Eigenspaces in Kernel Principal Component Analysis Convergence of Eigenspaces in Kernel Principal Component Analysis Shixin Wang Advanced machine learning April 19, 2016 Shixin Wang Convergence of Eigenspaces April 19, 2016 1 / 18 Outline 1 Motivation

More information

A Kernel Statistical Test of Independence

A Kernel Statistical Test of Independence A Kernel Statistical Test of Independence Arthur Gretton MPI for Biological Cybernetics Tübingen, Germany arthur@tuebingen.mpg.de Kenji Fukumizu Inst. of Statistical Mathematics Tokyo Japan fukumizu@ism.ac.jp

More information

Kernel Measures of Conditional Dependence

Kernel Measures of Conditional Dependence Kernel Measures of Conditional Dependence Kenji Fukumizu Institute of Statistical Mathematics 4-6-7 Minami-Azabu, Minato-ku Tokyo 6-8569 Japan fukumizu@ism.ac.jp Arthur Gretton Max-Planck Institute for

More information

Introduction to Machine Learning

Introduction to Machine Learning 10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what

More information

Fast Kernel-Based Independent Component Analysis

Fast Kernel-Based Independent Component Analysis Fast Kernel-Based Independent Component Analysis Hao Shen, Stefanie Jegelka and Arthur Gretton Abstract Recent approaches to Independent Component Analysis (ICA have used kernel independence measures to

More information

Advanced Introduction to Machine Learning

Advanced Introduction to Machine Learning 10-715 Advanced Introduction to Machine Learning Homework Due Oct 15, 10.30 am Rules Please follow these guidelines. Failure to do so, will result in loss of credit. 1. Homework is due on the due date

More information

Reduced-Rank Hidden Markov Models

Reduced-Rank Hidden Markov Models Reduced-Rank Hidden Markov Models Sajid M. Siddiqi Byron Boots Geoffrey J. Gordon Carnegie Mellon University ... x 1 x 2 x 3 x τ y 1 y 2 y 3 y τ Sequence of observations: Y =[y 1 y 2 y 3... y τ ] Assume

More information

Kernel-Based Contrast Functions for Sufficient Dimension Reduction

Kernel-Based Contrast Functions for Sufficient Dimension Reduction Kernel-Based Contrast Functions for Sufficient Dimension Reduction Michael I. Jordan Departments of Statistics and EECS University of California, Berkeley Joint work with Kenji Fukumizu and Francis Bach

More information

Learning gradients: prescriptive models

Learning gradients: prescriptive models Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007 Relevant papers Learning Coordinate Covariances via Gradients. Sayan

More information

Functional Analysis Review

Functional Analysis Review Outline 9.520: Statistical Learning Theory and Applications February 8, 2010 Outline 1 2 3 4 Vector Space Outline A vector space is a set V with binary operations +: V V V and : R V V such that for all

More information

Kernel methods and the exponential family

Kernel methods and the exponential family Kernel methods and the exponential family Stéphane Canu 1 and Alex J. Smola 2 1- PSI - FRE CNRS 2645 INSA de Rouen, France St Etienne du Rouvray, France Stephane.Canu@insa-rouen.fr 2- Statistical Machine

More information

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis. Vector spaces DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Vector space Consists of: A set V A scalar

More information

Matrix Support Functional and its Applications

Matrix Support Functional and its Applications Matrix Support Functional and its Applications James V Burke Mathematics, University of Washington Joint work with Yuan Gao (UW) and Tim Hoheisel (McGill), CORS, Banff 2016 June 1, 2016 Connections What

More information

MTTS1 Dimensionality Reduction and Visualization Spring 2014 Jaakko Peltonen

MTTS1 Dimensionality Reduction and Visualization Spring 2014 Jaakko Peltonen MTTS1 Dimensionality Reduction and Visualization Spring 2014 Jaakko Peltonen Lecture 3: Linear feature extraction Feature extraction feature extraction: (more general) transform the original to (k < d).

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Hsuan-Tien Lin Learning Systems Group, California Institute of Technology Talk in NTU EE/CS Speech Lab, November 16, 2005 H.-T. Lin (Learning Systems Group) Introduction

More information

Machine Learning for Computational Advertising

Machine Learning for Computational Advertising Machine Learning for Computational Advertising L1: Basics and Probability Theory Alexander J. Smola Yahoo! Labs Santa Clara, CA 95051 alex@smola.org UC Santa Cruz, April 2009 Alexander J. Smola: Machine

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 42 Outline 1 Introduction 2 Feature selection

More information

Bayesian Support Vector Machines for Feature Ranking and Selection

Bayesian Support Vector Machines for Feature Ranking and Selection Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher pat@student.ethz.ch ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction

More information

10-701/ Recitation : Kernels

10-701/ Recitation : Kernels 10-701/15-781 Recitation : Kernels Manojit Nandi February 27, 2014 Outline Mathematical Theory Banach Space and Hilbert Spaces Kernels Commonly Used Kernels Kernel Theory One Weird Kernel Trick Representer

More information

CIS 520: Machine Learning Oct 09, Kernel Methods

CIS 520: Machine Learning Oct 09, Kernel Methods CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed

More information

Support Vector Machines

Support Vector Machines Wien, June, 2010 Paul Hofmarcher, Stefan Theussl, WU Wien Hofmarcher/Theussl SVM 1/21 Linear Separable Separating Hyperplanes Non-Linear Separable Soft-Margin Hyperplanes Hofmarcher/Theussl SVM 2/21 (SVM)

More information

Approximate Kernel Methods

Approximate Kernel Methods Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207 Outline Motivating example Ridge regression

More information

Robust Support Vector Machines for Probability Distributions

Robust Support Vector Machines for Probability Distributions Robust Support Vector Machines for Probability Distributions Andreas Christmann joint work with Ingo Steinwart (Los Alamos National Lab) ICORS 2008, Antalya, Turkey, September 8-12, 2008 Andreas Christmann,

More information

Semi-supervised Dictionary Learning Based on Hilbert-Schmidt Independence Criterion

Semi-supervised Dictionary Learning Based on Hilbert-Schmidt Independence Criterion Semi-supervised ictionary Learning Based on Hilbert-Schmidt Independence Criterion Mehrdad J. Gangeh 1, Safaa M.A. Bedawi 2, Ali Ghodsi 3, and Fakhri Karray 2 1 epartments of Medical Biophysics, and Radiation

More information

Convergence Rates of Kernel Quadrature Rules

Convergence Rates of Kernel Quadrature Rules Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction

More information

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina Indirect Rule Learning: Support Vector Machines Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 02-01-2018 Biomedical data are usually high-dimensional Number of samples (n) is relatively small whereas number of features (p) can be large Sometimes p>>n Problems

More information

Kernel Methods. Outline

Kernel Methods. Outline Kernel Methods Quang Nguyen University of Pittsburgh CS 3750, Fall 2011 Outline Motivation Examples Kernels Definitions Kernel trick Basic properties Mercer condition Constructing feature space Hilbert

More information

Kernels A Machine Learning Overview

Kernels A Machine Learning Overview Kernels A Machine Learning Overview S.V.N. Vishy Vishwanathan vishy@axiom.anu.edu.au National ICT of Australia and Australian National University Thanks to Alex Smola, Stéphane Canu, Mike Jordan and Peter

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 11, 2009 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Natural Gradient Learning for Over- and Under-Complete Bases in ICA

Natural Gradient Learning for Over- and Under-Complete Bases in ICA NOTE Communicated by Jean-François Cardoso Natural Gradient Learning for Over- and Under-Complete Bases in ICA Shun-ichi Amari RIKEN Brain Science Institute, Wako-shi, Hirosawa, Saitama 351-01, Japan Independent

More information

Lecture Notes 1: Vector spaces

Lecture Notes 1: Vector spaces Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector

More information

Joint distribution optimal transportation for domain adaptation

Joint distribution optimal transportation for domain adaptation Joint distribution optimal transportation for domain adaptation Changhuang Wan Mechanical and Aerospace Engineering Department The Ohio State University March 8 th, 2018 Joint distribution optimal transportation

More information

Linear and Non-Linear Dimensionality Reduction

Linear and Non-Linear Dimensionality Reduction Linear and Non-Linear Dimensionality Reduction Alexander Schulz aschulz(at)techfak.uni-bielefeld.de University of Pisa, Pisa 4.5.215 and 7.5.215 Overview Dimensionality Reduction Motivation Linear Projections

More information

Least-Squares Independence Test

Least-Squares Independence Test IEICE Transactions on Information and Sstems, vol.e94-d, no.6, pp.333-336,. Least-Squares Independence Test Masashi Sugiama Toko Institute of Technolog, Japan PRESTO, Japan Science and Technolog Agenc,

More information

Lecture 3: Statistical Decision Theory (Part II)

Lecture 3: Statistical Decision Theory (Part II) Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical

More information

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University Chapter 9. Support Vector Machine Yongdai Kim Seoul National University 1. Introduction Support Vector Machine (SVM) is a classification method developed by Vapnik (1996). It is thought that SVM improved

More information

Supervised Principal Component Analysis: Visualization, Classification and Regression on Subspaces and Submanifolds

Supervised Principal Component Analysis: Visualization, Classification and Regression on Subspaces and Submanifolds Supervised Principal Component Analysis: Visualization, Classification and Regression on Subspaces and Submanifolds Elnaz Barshan a,alighodsi b,zohrehazimifar a,, Mansoor Zolghadri Jahromi a a Department

More information

Association studies and regression

Association studies and regression Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration

More information

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan Support'Vector'Machines Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan kasthuri.kannan@nyumc.org Overview Support Vector Machines for Classification Linear Discrimination Nonlinear Discrimination

More information

Kernel Methods. Machine Learning A W VO

Kernel Methods. Machine Learning A W VO Kernel Methods Machine Learning A 708.063 07W VO Outline 1. Dual representation 2. The kernel concept 3. Properties of kernels 4. Examples of kernel machines Kernel PCA Support vector regression (Relevance

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Semi-Supervised Learning through Principal Directions Estimation

Semi-Supervised Learning through Principal Directions Estimation Semi-Supervised Learning through Principal Directions Estimation Olivier Chapelle, Bernhard Schölkopf, Jason Weston Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany {first.last}@tuebingen.mpg.de

More information

Hilbert Schmidt component analysis

Hilbert Schmidt component analysis Lietuvos matematikos rinkinys ISSN 0132-2818 Proc. of the Lithuanian Mathematical Society, Ser. A Vol. 57, 2016 DOI: 10.15388/LMR.A.2016.02 pages 7 11 Hilbert Schmidt component analysis Povilas Daniušis,

More information

Kernel Independent Component Analysis

Kernel Independent Component Analysis Journal of Machine Learning Research 3 (2002) 1-48 Submitted 11/01; Published 07/02 Kernel Independent Component Analysis Francis R. Bach Computer Science Division University of California Berkeley, CA

More information

Kernel Methods in Machine Learning

Kernel Methods in Machine Learning Kernel Methods in Machine Learning Autumn 2015 Lecture 1: Introduction Juho Rousu ICS-E4030 Kernel Methods in Machine Learning 9. September, 2015 uho Rousu (ICS-E4030 Kernel Methods in Machine Learning)

More information

A Linear-Time Kernel Goodness-of-Fit Test

A Linear-Time Kernel Goodness-of-Fit Test A Linear-Time Kernel Goodness-of-Fit Test Wittawat Jitkrittum 1; Wenkai Xu 1 Zoltán Szabó 2 NIPS 2017 Best paper! Kenji Fukumizu 3 Arthur Gretton 1 wittawatj@gmail.com 1 Gatsby Unit, University College

More information

Strictly Positive Definite Functions on a Real Inner Product Space

Strictly Positive Definite Functions on a Real Inner Product Space Strictly Positive Definite Functions on a Real Inner Product Space Allan Pinkus Abstract. If ft) = a kt k converges for all t IR with all coefficients a k 0, then the function f< x, y >) is positive definite

More information

Bundle Methods for Machine Learning

Bundle Methods for Machine Learning Bundle Methods for Machine Learning Joint work with Quoc Le, Choon-Hui Teo, Vishy Vishwanathan and Markus Weimer Alexander J. Smola sml.nicta.com.au Statistical Machine Learning Program Canberra, ACT 0200

More information

Kernel Learning via Random Fourier Representations

Kernel Learning via Random Fourier Representations Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random

More information

Kernel Logistic Regression and the Import Vector Machine

Kernel Logistic Regression and the Import Vector Machine Kernel Logistic Regression and the Import Vector Machine Ji Zhu and Trevor Hastie Journal of Computational and Graphical Statistics, 2005 Presented by Mingtao Ding Duke University December 8, 2011 Mingtao

More information

Learning SVM Classifiers with Indefinite Kernels

Learning SVM Classifiers with Indefinite Kernels Learning SVM Classifiers with Indefinite Kernels Suicheng Gu and Yuhong Guo Dept. of Computer and Information Sciences Temple University Support Vector Machines (SVMs) (Kernel) SVMs are widely used in

More information

Kernel Methods & Support Vector Machines

Kernel Methods & Support Vector Machines Kernel Methods & Support Vector Machines Mahdi pakdaman Naeini PhD Candidate, University of Tehran Senior Researcher, TOSAN Intelligent Data Miners Outline Motivation Introduction to pattern recognition

More information

Generalized clustering via kernel embeddings

Generalized clustering via kernel embeddings Generalized clustering via kernel embeddings Stefanie Jegelka 1, Arthur Gretton 2,1, Bernhard Schölkopf 1, Bharath K. Sriperumbudur 3, and Ulrike von Luxburg 1 1 Max Planck Institute for Biological Cybernetics,

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:

More information

Principal Component Analysis

Principal Component Analysis Principal Component Analysis Introduction Consider a zero mean random vector R n with autocorrelation matri R = E( T ). R has eigenvectors q(1),,q(n) and associated eigenvalues λ(1) λ(n). Let Q = [ q(1)

More information

Dimensionality Reduction. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Dimensionality Reduction. CS57300 Data Mining Fall Instructor: Bruno Ribeiro Dimensionality Reduction CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro Goal } Visualize high dimensional data (and understand its Geometry) } Project the data into lower dimensional spaces }

More information

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable. Linear SVM (separable case) First consider the scenario where the two classes of points are separable. It s desirable to have the width (called margin) between the two dashed lines to be large, i.e., have

More information

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Sketching for Large-Scale Learning of Mixture Models

Sketching for Large-Scale Learning of Mixture Models Sketching for Large-Scale Learning of Mixture Models Nicolas Keriven Université Rennes 1, Inria Rennes Bretagne-atlantique Adv. Rémi Gribonval Outline Introduction Practical Approach Results Theoretical

More information

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space to The The A s s in to Fabio A. González Ph.D. Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá April 2, 2009 to The The A s s in 1 Motivation Outline 2 The Mapping the

More information

An Introduction to Independent Components Analysis (ICA)

An Introduction to Independent Components Analysis (ICA) An Introduction to Independent Components Analysis (ICA) Anish R. Shah, CFA Northfield Information Services Anish@northinfo.com Newport Jun 6, 2008 1 Overview of Talk Review principal components Introduce

More information

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis Alvina Goh Vision Reading Group 13 October 2005 Connection of Local Linear Embedding, ISOMAP, and Kernel Principal

More information

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices Ryota Tomioka 1, Taiji Suzuki 1, Masashi Sugiyama 2, Hisashi Kashima 1 1 The University of Tokyo 2 Tokyo Institute of Technology 2010-06-22

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 9, 2011 About this class Goal In this class we continue our journey in the world of RKHS. We discuss the Mercer theorem which gives

More information

Kernel Bayes Rule: Nonparametric Bayesian inference with kernels

Kernel Bayes Rule: Nonparametric Bayesian inference with kernels Kernel Bayes Rule: Nonparametric Bayesian inference with kernels Kenji Fukumizu The Institute of Statistical Mathematics NIPS 2012 Workshop Confluence between Kernel Methods and Graphical Models December

More information

There are two things that are particularly nice about the first basis

There are two things that are particularly nice about the first basis Orthogonality and the Gram-Schmidt Process In Chapter 4, we spent a great deal of time studying the problem of finding a basis for a vector space We know that a basis for a vector space can potentially

More information

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Erin Allwein, Robert Schapire and Yoram Singer Journal of Machine Learning Research, 1:113-141, 000 CSE 54: Seminar on Learning

More information

Support Vector Machine for Classification and Regression

Support Vector Machine for Classification and Regression Support Vector Machine for Classification and Regression Ahlame Douzal AMA-LIG, Université Joseph Fourier Master 2R - MOSIG (2013) November 25, 2013 Loss function, Separating Hyperplanes, Canonical Hyperplan

More information

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University

More information

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Yi Zhang and Jeff Schneider NIPS 2010 Presented by Esther Salazar Duke University March 25, 2011 E. Salazar (Reading group) March 25, 2011 1

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning 12. Gaussian Processes Alex Smola Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701 10-701 The Normal Distribution http://www.gaussianprocess.org/gpml/chapters/

More information