Hilbert Schmidt component analysis

Similar documents
Nullity of the second order discrete problem with nonlocal multipoint boundary conditions

Semi-supervised Dictionary Learning Based on Hilbert-Schmidt Independence Criterion

The minimizer for the second order differential problem with one nonlocal condition

A Modified Incremental Principal Component Analysis for On-line Learning of Feature Space and Classifier

A Modified Incremental Principal Component Analysis for On-Line Learning of Feature Space and Classifier

The Euler Mascheroni constant in school

A discrete limit theorem for the periodic Hurwitz zeta-function

VILNIUS UNIVERSITY. Povilas Daniušis FEATURE EXTRACTION VIA DEPENDENCE STRUCTURE OPTIMIZATION

Adaptive Kernel Principal Component Analysis With Unsupervised Learning of Kernels

Weighted Maximum Variance Dimensionality Reduction

Method of marks for propositional linear temporal logic

Weighted Maximum Variance Dimensionality Reduction

Hilbert Schmidt Independence Criterion

On the anticrowding population dynamics taking into account a discrete set of offspring

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

On Improving the k-means Algorithm to Classify Unclassified Patterns

Top-k Parametrized Boost

Multiple Similarities Based Kernel Subspace Learning for Image Classification

Iterative Laplacian Score for Feature Selection

Supervised locally linear embedding

An Adaptive Test of Independence with Analytic Kernel Embeddings

Kernel-based Feature Extraction under Maximum Margin Criterion

Transformations of formulae of hybrid logic

Introduction to Machine Learning

Aijun An and Nick Cercone. Department of Computer Science, University of Waterloo. methods in a context of learning classication rules.

Discriminative Direction for Kernel Classifiers

An Adaptive Test of Independence with Analytic Kernel Embeddings

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton

A general biochemical kinetics data fitting algorithm for quasi-steady-state detection

Study on Classification Methods Based on Three Different Learning Criteria. Jae Kyu Suhr

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

Kernel Methods. Outline

Non-negative Matrix Factorization on Kernels

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Approximate Kernel Methods

Feature Selection via Dependence Maximization

SUPPORT VECTOR REGRESSION WITH A GENERALIZED QUADRATIC LOSS

L11: Pattern recognition principles

Logistic Regression and Boosting for Labeled Bags of Instances

Kernel methods for comparing distributions, measuring dependence

Analysis of Spectral Kernel Design based Semi-supervised Learning

Radial basis function method modelling borehole heat transfer: the practical application

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Supervised Principal Component Analysis: Visualization, Classification and Regression on Subspaces and Submanifolds

Learning Kernel Parameters by using Class Separability Measure

Cluster Kernels for Semi-Supervised Learning

Statistical Pattern Recognition

DESIGNING RBF CLASSIFIERS FOR WEIGHTED BOOSTING

Sparse Support Vector Machines by Kernel Discriminant Analysis

Dimension Reduction Using Nonnegative Matrix Tri-Factorization in Multi-label Classification

Active and Semi-supervised Kernel Classification

Linear and Non-Linear Dimensionality Reduction

Directly and Efficiently Optimizing Prediction Error and AUC of Linear Classifiers

Statistical Convergence of Kernel CCA

Hilbert Space Representations of Probability Distributions

Dimension Reduction (PCA, ICA, CCA, FLD,

A BAYESIAN APPROACH FOR EXTREME LEARNING MACHINE-BASED SUBSPACE LEARNING. Alexandros Iosifidis and Moncef Gabbouj

Robust Fisher Discriminant Analysis

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

Local Learning Projections

Linear Dimensionality Reduction by Maximizing the Chernoff Distance in the Transformed Space

ESANN'2003 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2003, d-side publi., ISBN X, pp.

A Coupled Helmholtz Machine for PCA

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Small sample size in high dimensional space - minimum distance based classification.

Dimension reduction methods: Algorithms and Applications Yousef Saad Department of Computer Science and Engineering University of Minnesota

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 1/31

Fisher s Linear Discriminant Analysis

Machine learning for pervasive systems Classification in high-dimensional spaces

Randomized Algorithms

Local Fisher Discriminant Analysis for Supervised Dimensionality Reduction

Dimensionality Reduction

Novel Distance-Based SVM Kernels for Infinite Ensemble Learning

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Algebraic and spectral analysis of local magnetic field intensity

Feature Extraction with Weighted Samples Based on Independent Component Analysis

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space

Semi-Supervised Learning through Principal Directions Estimation

Learning Binary Classifiers for Multi-Class Problem

Instance-based Domain Adaptation via Multi-clustering Logistic Approximation

Discriminative K-means for Clustering

Integrating Global and Local Structures: A Least Squares Framework for Dimensionality Reduction

Feature extraction for one-class classification

Semi-Supervised Laplacian Regularization of Kernel Canonical Correlation Analysis

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data

Training algorithms for fuzzy support vector machines with nois

Maximum Mean Discrepancy

B555 - Machine Learning - Homework 4. Enrique Areyan April 28, 2015

A Kernel Statistical Test of Independence

Support Vector Machines

CS534 Machine Learning - Spring Final Exam

Unsupervised Classification via Convex Absolute Value Inequalities

Brief Introduction to Machine Learning

Feature Selection by Reordering *

Motivating the Covariance Matrix

Kernel methods and the exponential family

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Transcription:

Lietuvos matematikos rinkinys ISSN 0132-2818 Proc. of the Lithuanian Mathematical Society, Ser. A Vol. 57, 2016 DOI: 10.15388/LMR.A.2016.02 pages 7 11 Hilbert Schmidt component analysis Povilas Daniušis, Pranas Vaitkus, Linas Petkevičius Faculty of Mathematics and Informatics, Vilnius University Naugarduko str. 4, LT-03225 Vilnius, Lithuania E-mail: povilas.daniusis@gmail.com, vaitkuspranas@gmail.com E-mail: linas.petkevicius@mif.vu.lt Abstract. We propose a feature extraction algorithm, based on the Hilbert Schmidt independence criterion (HSIC) and the maximum dependence minimum redundancy approach. Experiments with classification data sets demonstrate that suggested Hilbert Schmidt component analysis (HSCA) algorithm in certain cases may be more efficient than other considered approaches. Keywords: feature extraction, dimensionality reduction, HSCA, Hilbert Schmidt independence criterion, kernel methods. 1 Introduction In many cases the initial representation of data is inconvenient, or even prohibitive for further analysis. For example, in image analysis, text analysis and computational genetics, high-dimensional, massive, structural, incomplete, and noisy data sets are common. Therefore, feature extraction, or the revelation of informative features from raw data is one of the fundamental machine learning problems. In this article we focus on supervised feature extraction algorithms, that use dependence-based criteria of optimality. The article is structured as follows. In Section 2 we briefly formulate an esimators of a Hilbert Schmidt independence criterion (HSIC), proposed by [5]. In Section 3 we propose a new algorithm, Hilbert Schmidt component analysis (HSCA). The main idea of HSCA is to find non-redundant features which maximize HSIC with a dependent variable. Finally, in Section 4, we experimentally compare our approach with several alternative feature extraction methods. Therein we statistically analyze the accuracy of k-nn classifier, based on LDA [4], PCA [6], HBFE [2, 10], and HSCA features. 2 Hilbert Schmidt independence criterion The Hilbert Schmidt independence criterion (HSIC) is a kernel-based dependence measure proposed and investigated in [5, 8]. Let T := (x i,y i ) m i=1 be a supervised training set, where x i X are inputs, y i Y corresponding desired outputs, and X, Y are two sets. Let k : X X R, and l : Y Y R be two positive definite kernels [7], with corresponding Gram matrices K, and L. There are proposed two

8 P. Daniušis, P. Vaitkus, L. Petkevičius empirical estimators of HSIC (see [5, 8]) 1 : and ĤSIC 1 (X,Y) := ĤSIC 0 (X,Y) := (m 1) 2 Tr(KHLH), (1) ( 1 Tr K L+ 1T T K11 L1 m(m 3) (m 1)(m 2) 2 ) m 2 1T K L1. (2) The (1) is biased with an O(m 1 ) bias, and the (2) is an unbiased estimator of HSIC [5, 8]. 3 Hilbert Schmidt component analysis (HSCA) In this section we suggest an algorithm for Hilbert Schmidt component analysis (HSCA), which is based on the HSIC dependence measure. The choice of HSIC is motivated by its neat theoretical properties [5, 8], and promising experimental results achieved by various HSIC-based feature extraction algorithms [2, 3, 5, 10]. Suppose we have a supervised training set T := (x i,y i ) m i=1, where x i R Dx are observations, and y R Dy are dependent variables. Let us denote the data matrices X = [x 1,x 2,...,x m ], and Y = [y 1,y 2,...,y m ], and assume that the kernel for the inputs is linear (i.e. K = X T X). In HSCA we iteratively seek d D x linear projections, which maximize the dependence with the dependent variable y and simultaneously minimize the dependence with the already computed projections. In other words, for the t-th feature we seek a projection vector p, which maximizes the ratio η t (p) = ĤSIC(pT X,Y) ĤSIC(p T X,P T t X), (3) where P t = [p 1,...,p t 1 ] are projection vectors extracted in previous t 1 steps, and ĤSIC is an estimator of HSIC. Note that, at the first step, only ĤSIC(p T X,Y) is maximized. For example, plugging (1) estimator into (3), we have to maximize the following generalized Rayleigh quotient η t (p) = Tr(XT pp T XHLH) Tr(X T pp T XHL f H) = pt XHLHX T p p T XHL f HX T p, (4) where the kernel matrix of features L f (i,j) = l(p T t 1 x i,p T t 1 x j). The maximizer is principal eigenvector of the generalized eigenproblem XHLHX T p = λxhl f HX T p. (5) The case of unbiased HSIC estimator (2) may be treated in the similar manner. Well known kernel trick [7] allows to extend HSCA to arbitrary kernel case, however we ommit the details due to space restrictions. 1 In (2) K and L are corresponding Gram matrices with zero diagonals.

4 Computer experiments Hilbert Schmidt component analysis 9 In this section we will analyze twelve classification data sets, eleven of them are from the UCI machine learning repository [1], and the remaining Ames data set is from chemometrics. 2 We are interested in the performance dynamics of the k-nn classifier, when the inputs are constructed by several feature extraction algorithms: unsupervised PCA [6], supervised LDA [4], HBFE [2, 10] and HSCA. The measure of efficiency we will analyze therein is the accuracy of k-nn classifier, calculated over the testing set. The following procedure was adopted when conducting experiments. Fifty random partitions of the data set into training and testing sets of equal size was generated, and feature extraction was performed using all the above-mentioned methods. The projection matrices of the feature extraction methods were estimated using only the training data. The features generated from the testing set then were classified using k-nn classifier. The feature dimensionality was selected using a training data and 3-fold cross validation. Wilcoxon s sign rank test [9] with the standard p-value threshold of 0.05 was applied to the samples of corresponding classification accuracies. The following comparisons were made, indicating the statistically significant cases in the table: 1. HBFE 1 with HBFE 0, and HSCA 1 with HSCA 0 (better one indicated in bold text); 2. The most efficient method with the remaining ones (statistically significant cases are reported in underlined text); 3. HSCA with HBFE (data sets where HSCA was more efficient are indicated with, and means that it turned out to be less efficient); 4. The most efficient HSIC-based algorithm (i.e. HBFE 0, HBFE 1, HSCA 0 or HSCA 1 ) with the remaining ones ( means that HSIC-based algorithm outperformed other ones, and means that PCA, LDA or unmodified inputs were more efficient). The results in Table 1 show that HSCA approach may allow to achieve slightly better classification accuracy for some data sets. 5 Conclusions Suggested HSCA (Hilbert Schmidt component analysis) algorithm (Section 3) optimizes ratio of feature relevancy, and feature redundancy estimates. Both estimates are formulated in terms of HSIC dependence measure. Optimal features are solutions of generalized eigenproblem (4). In section 4 we statistically compared HBFE with several alternative feature extraction methods, analysing classification performance (accuracy) as the measure of feature relevance. The results of the conducted experiments demonstrate practical usefulness of HSCA algorithm. 2 Details about Ames data set is not available due to agreement with the provider. Liet. matem. rink. Proc. LMS, Ser. A, 57, 2016, 7 11.

10 P. Daniušis, P. Vaitkus, L. Petkevičius Table 1. Classification accuracy comparison. Dataset Full HBFE 1 HBFE 0 HSCA 1 HSCA 0 PCA LDA 1-NN classifier and linear kernel Ames 0.7753 0.7589 0.7765 0.7826 0.8012 0.7786 0.7714 Australian 0.7933 0.7987 0.8045 0.8093 0.8095 0.7868 0.8114 Breastcancer 0.9558 0.9553 0.9543 0.9553 0.9595 0.9566 0.9562 Covertype 0.6868 0.6956 0.6732 0.7086 0.7014 0.6748 0.6756 Derm 0.9906 0.9973 0.9973 0.9973 0.9971 0.9949 0.9971 German 0.6698 0.6841 0.6730 0.6833 0.6910 0.6700 0.6851 Heart 0.7618 0.7600 0.7677 0.7612 0.7627 0.7520 0.7698 Ionosphere 0.8421 0.8555 0.8683 0.8686 0.8773 0.8581 0.8171 Sonar 0.8146 0.7819 0.7538 0.7427 0.8046 0.8146 0.6938 Spambase 0.8975 0.8993 0.8979 0.9116 0.9056 0.9015 0.8680 Specft 0.6770 0.6690 0.7030 0.6730 0.7020 0.6630 0.5370 Wdbc 0.9503 0.9355 0.9455 0.9476 0.9570 0.9506 0.9528 1-NN classifier and Gaussian kernel Australian 0.7924 0.7900 0.7927 0.7878 0.8185 0.7807 0.8110 Breastcancer 0.9508 0.9505 0.9458 0.9472 0.9466 0.9522 0.9487 Covertype 0.6932 0.6480 0.6738 0.6855 0.6860 0.6777 0.7091 Derm 0.9872 0.9985 0.9989 0.9993 0.9989 0.9935 0.9984 German 0.6717 0.6668 0.6784 0.6956 0.6797 0.6737 0.6802 Heart 0.7638 0.7689 0.7679 0.7575 0.7669 0.7588 0.7366 Ionosphere 0.8521 0.8895 0.9128 0.9025 0.9158 0.9083 0.8927 Sonar 0.8358 0.6955 0.7942 0.7204 0.8344 0.8387 0.8258 Spambase 0.8570 0.7994 0.7903 0.8119 0.8129 0.8471 0.8779 Specft 0.6778 0.7283 0.7317 0.7650 0.7400 0.6370 0.7370 Wdbc 0.9510 0.9148 0.9425 0.9320 0.9424 0.9510 0.9599 References [1] A. Asuncion and D.J. Newman. UCI Machine Learning Repository. University of California, School of Information and Computer Science, 2007. Available from internet: http://archive.ics.uci.edu/ml/. [2] P. Daniušis and P. Vaitkus. Supervised feature extraction using Hilbert Schmidt norms. Lecture Notes in Computer Science, 5788:25 33, 2009. [3] P. Daniušis and P. Vaitkus. A feature extraction algorithm based on the Hilbert Schmidt independence criterion. Šiauliai Math. Sem., 4(12):35 42, 2009. [4] R.A. Fisher. The use of multiple measurements in taxonomic problems. Ann. Eug., 7:179 188, 1936. [5] A. Gretton, O. Bousquet, A. Smola and B. Schölkopf. Measuring statistical dependence with Hilbert Schmidt norms. In Proc. of 16th Int. Conf. on Algorithmic Learning Theory, pp. 63 77, 2005. [6] I.T. Jolliffe. Principal Component Analysis. Springer, Berlin, 1986. [7] B. Schölkopf and A.J. Smola. Learning with Kernels. MIT Press, 2002. [8] L. Song, A. Smola, A. Gretton, K. Borgwardt and J. Bedo. Supervised feature selection via dependence estimation. In Proc. Intl. Conf. Machine Learning, pp. 823 830, 2007. [9] F. Wilcoxon. Individual comparisons by ranking methods. Biometrics, 1:80 83, 1945. [10] Y. Zhang and Z.-H. Zhou. Multi-label dimensionality reduction via dependence maximization. In Proc. of the 33rd AAAI Conf. on Artificial Intelligence, pp. 1503 1505, 2008.

Hilbert Schmidt component analysis 11 REZIUMĖ Hilberto Šmito komponenčių analizė P. Daniušis, P. Vaitkus, L. Petkevičius Straipsnyje pateikiamas naujas HSCA požymių išskyrimo metodas, kurio esmė yra maksimizuoti HSIC priklausomumo mato įvertinį tarp požymių ir priklausomo kintamojo, kartu siekiant eliminuoti požymių tarpusavio priklausomumą, metodas eksperimentiškai palygintas su klasikiniais bei naujais požymių išskyrimo metodais. Raktiniai žodžiai: požymių išskyrimas, dimensijos mažinimas, HSCA, Hilbert Schmidt nepriklausomumo kriterijus, branduolių metodai. Liet. matem. rink. Proc. LMS, Ser. A, 57, 2016, 7 11.