Invariant coordinate selection for multivariate data analysis - the package ICS

Similar documents
Scatter Matrices and Independent Component Analysis

The ICS Package. May 4, 2007

Independent Component (IC) Models: New Extensions of the Multinormal Model

Characteristics of multivariate distributions and the invariant coordinate system

Independent component analysis for functional data

Package SpatialNP. June 5, 2018

On Invariant Coordinate Selection and Nonparametric Analysis of Multivariate Data

Package ICS. March 3, 2018

The SpatialNP Package

INVARIANT COORDINATE SELECTION

Signed-rank Tests for Location in the Symmetric Independent Component Model

MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 1: Introduction, Multivariate Location and Scatter

On Invariant Within Equivalence Coordinate System (IWECS) Transformations

Invariant co-ordinate selection

Symmetrised M-estimators of multivariate scatter

Computationally Easy Outlier Detection via Projection Pursuit with Finitely Many Directions

On Independent Component Analysis

Statistical Depth Function

A more efficient second order blind identification method for separation of uncorrelated stationary time series

Journal of Statistical Software

arxiv: v3 [stat.me] 2 Feb 2018 Abstract

arxiv: v2 [stat.me] 31 Aug 2017

MULTICHANNEL SIGNAL PROCESSING USING SPATIAL RANK COVARIANCE MATRICES

Robustness of the Affine Equivariant Scatter Estimator Based on the Spatial Rank Covariance Matrix

Robust estimation of principal components from depth-based multivariate rank covariance matrix

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes

Package BSSasymp. R topics documented: September 12, Type Package

Multivariate-sign-based high-dimensional tests for sphericity

Nonparametric Methods for Multivariate Location Problems with Independent and Cluster Correlated Observations

Miscellanea Multivariate sign-based high-dimensional tests for sphericity

Multivariate Distributions

Geodesic Convexity and Regularized Scatter Estimation

This is an electronic reprint of the original article. This reprint may differ from the original in pagination and typographic detail.

Tests Using Spatial Median

AFFINE INVARIANT MULTIVARIATE RANK TESTS FOR SEVERAL SAMPLES

Independent Component Analysis

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Fisher s Linear Discriminant Analysis

Linear Dimensionality Reduction

Lietzén, Niko; Pitkäniemi, Janne; Heinävaara, Sirpa; Ilmonen, Pauliina On Exploring Hidden Structures Behind Cervical Cancer Incidence

Second-Order Inference for Gaussian Random Curves

Random Vectors and Multivariate Normal Distributions

Practical tests for randomized complete block designs

Package sscor. January 28, 2016

Multiple Variable Analysis

SOME TOOLS FOR LINEAR DIMENSION REDUCTION

Advanced Introduction to Machine Learning

Robust scale estimation with extensions

CIFAR Lectures: Non-Gaussian statistics and natural images

Distances and similarities Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

Principal Component Analysis

Pattern Recognition 2

Unsupervised Learning with Permuted Data

ISSN X On misclassification probabilities of linear and quadratic classifiers

Symmetric and self-adjoint matrices

Robust Maximum Association Between Data Sets: The R Package ccapp

Unconstrained Ordination

Lecture 7: Con3nuous Latent Variable Models

Robust Tools for the Imperfect World

The squared symmetric FastICA estimator

Optimization and Testing in Linear. Non-Gaussian Component Analysis

Robustness of Principal Components

On the Efficiencies of Several Generalized Least Squares Estimators in a Seemingly Unrelated Regression Model and a Heteroscedastic Model

Computational functional genomics

Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Joint Factor Analysis for Speaker Verification

arxiv: v2 [math.st] 4 Aug 2017

Revision: Chapter 1-6. Applied Multivariate Statistics Spring 2012

Principal Components Analysis. Sargur Srihari University at Buffalo

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

Experimental Design and Data Analysis for Biologists

BIOL 4605/7220 CH 20.1 Correlation

Summary of basic probability theory Math 218, Mathematical Statistics D Joyce, Spring 2016

Fundamentals of Principal Component Analysis (PCA), Independent Component Analysis (ICA), and Independent Vector Analysis (IVA)

Generalized Multivariate Rank Type Test Statistics via Spatial U-Quantiles

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Multivariate Statistics

Principal Component Analysis

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts I-II

An Introduction to Multivariate Statistical Analysis

A User's Guide To Principal Components

Basics of Multivariate Modelling and Data Analysis

Dependence. MFM Practitioner Module: Risk & Asset Allocation. John Dodson. September 11, Dependence. John Dodson. Outline.

OPERATING TRANSFORMATION RETRANSFORMATION ON SPATIAL MEDIAN AND ANGLE TEST

Independent Component Analysis

Eigenvalues, Eigenvectors, and an Intro to PCA

Independent Component Analysis and Its Applications. By Qing Xue, 10/15/2004

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Basics of Experimental Design. Review of Statistics. Basic Study. Experimental Design. When an Experiment is Not Possible. Studying Relations

Independent Component Analysis (ICA)

Donghoh Kim & Se-Kang Kim

Discriminative Direction for Kernel Classifiers

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018

Principal Component Analysis vs. Independent Component Analysis for Damage Detection

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

1 Linear Algebra Problems

FINM 331: MULTIVARIATE DATA ANALYSIS FALL 2017 PROBLEM SET 3

7.4: Integration of rational functions

Transcription:

Invariant coordinate selection for multivariate data analysis - the package ICS Klaus Nordhausen 1 Hannu Oja 1 David E. Tyler 2 1 Tampere School of Public Health University of Tampere 2 Department of Statistics Rutgers - The State University of New Jersey

Outline Definitions Invariant Coordinate Selection ICS and R Applications and Examples

Transformations I Let X 1, X 2,..., X n be independent p-variate observations and write X = (X 1 X 2... X n ) for the corresponding p n data matrix. Affine transformation X AX + b1, where A is a full-rank p p matrix, b a p-vector and 1 a n-vector full of ones. Orthogonal transformation with U U = UU = I. X UX

Transformations II Sign-change transformation X JX where J is a p p diagonal matrix with diagonal elements ±1. Permutation X PX where P is a p p permutation matrix.

Location and scatter statistics A p-vector valued statistic T = T (X) is called a location statistic if it is affine equivariant, that is, T (AX + b1 ) = AT (X) + b for all full-rank p p-matrices A and for all p-vectors b. A p p matrix S = S(X) is a scatter statistic if it is affine equivariant in the sense that S(AX + b1 ) = AS(X)A for all full-rank p p-matrices A and for all p-vectors b.

Special scatter statistics A scatter statistic with respect to the origin is affine equivariant in the sense that S(AXJ) = AS(X)A for all full-rank p p-matrices A and for all n n sign change matrices J. A symmetrized scatter statistic version of a scatter statistic S is defined as S sym (X) := S(X sym ), where X sym is the matrix of all pairwise differences of the original observation vectors. A shape matrix is only affine equivariant in the sense that S(AX + b1 ) AS(X)A.

Independence property A scatter functional S has the independence property if it is a diagonal matrix for all random vectors with independent margins. Note that in general scatter statistics do not have the independence property. Only the covariance matrix, the matrix of fourth moments and symmetrized scatter matrices have this property. If, however, X has independent and at least p 1 symmetric components all scatter matrices will be diagonal matrices.

Examples of scatter matrices The most common location and scatter statistics are the vector of means and the regular covariance matrix COV. A so called 1 step M-estimator one is for example the matrix of fourth moments. COV 4 (X) = 1 p + 2 ave[ X i X 2 COV (X i X)(X i X) ] A regular M-estimator is for instance Tyler s shape matrix. [ ] (X i T (X))(X i T (X)) S Tyl (X) = p ave X i T (X) 2 S Tyl The symmetrized version of Tyler s shape matrix is known as Dümbgens s shape matrix.

Scatter matrices in R R offers a lot functions for estimating different scatter matrices. A most likely not complete list: covrobust: cov.nnve ICS: covorigin, cov4, covaxis, tm ICSNP: tyler.shape, duembgen.shape, HR.Mest, HP1.shape MASS: cov.rob, cov.trob robustbase: covmcd, covogk rrcov: covmcd, covmest, covogk

Two scatter matrices for ICS Tyler et al. (2008) showed that two different scatter matrices S 1 = S 1 (X) and S 2 = S 2 (X) can be used to find an invariant coordinate system as follows: Starting with S 1 and S 2, define a p p transformation matrix B = B(X) and a diagonal matrix D = D(X) by S 1 2 S 1B = B D that is, B gives the eigenvectors of S 1 2 S 1. The following result can then be shown to hold. The transformation X Z = B(X)X yields an invariant coordinate system in the sense that B(AX)(AX) = JB(X)X for some p p sign change matrix J.

On the choice of S 1 and S 2 As shown previously there are a lot of possibilities for S 1 and S 2 to choose from. Since all the scatter matrices have different properties, these can yield different invariant coordinate systems. Unfortunately, there are so far no theoretic results about the optimal choice. This is still an open research question. Some comments are however already possible: For two given scatter matrices, the order has no effect Depending on the application in mind the scatter matrices should fulfill some further conditions. Practise showed so far that for most data sets different choices yield only minor differences.

Implementation in R The two scatter transformation for ICS is implemented in R in the package ICS. The main function ics can take the name of two functions for S 1 and S 2 or two in advance computed scatter matrices and returns and S4-object. The package ICS offers then furthermore several functions to work with an ics object and offers also several scatter matrices and two tests of multinormality. The function ics has options for different standardization methods for B and D. ics(x, S1 = cov, S2 = cov4, S1args = list(), S2args = list(), stdb = "Z", stdkurt = TRUE, na.action = na.fail) Multivariate nonparametric methods which are meaningful in the context of ICS are implemented in the R package ICSNP.

What is an ICS good for? So what is an ICS good for? In the following applications an ICS can be of use: Descriptive statistics Finding outliers, structure and clustering Dimension reduction Independent component analysis Multivariate nonparametrics The following slides will by means of examples motivate some of the above applications.

Descriptive statistics The way how the transformation matrix B is obtained can also be seen as taking the ratio of two different scale measures. Therefore the elements of D can be seen as measures of Kurtosis. Using this interpretation and given the choice of S 1, S 2, T 1 and T 2, one obtains immediately: The location: T 1 (X) The scatter: S 1 (X) Measure of skewness: T 2 (Z ) T 1 (Z ) Measure of Kurtosis: S 2 (Z ) Note that when S 1 is the regular covariance matrix and S 2 the matrix of fourth moments, the elements of D are based on classical moments The last two measures can then be used to construct tests for multinormality or ellipticity (see for example Kankainen et al. 2007).

Finding outliers, structure and clusters In general, that the coordinates are ordered according to their kurtosis is a useful feature. One can assume that interesting directions are the ones with extreme measures of kurtosis. Outliers for example are usually shown in the first coordinate. If the original data is coming from an elliptical distribution, S 1 and S 2 measure the same population quantity and therefore the values of D in that case should approximately be all the same. For non-elliptical distributions however they measure different quantities and therefore the coordinates can reveal "hidden" structures. In the case of X coming from an mixture of elliptical distributions, the first or the last coordinate corresponds to Fisher s linear discriminant function (Without knowing the class labels!)

Finding the outliers The modified wood gravity data is a classical data set for outlier detection methods. It has 6 measurements on 20 observations containing four outliers. Here S 1 is a M-estimator based on t 1 and S 2 based on t 2. x1 0.11 0.15 0.45 0.55 0.40 0.50 0.45 0.60 0.11 0.15 x2 x3 0.40 0.55 0.45 0.60 x4 x5 0.85 0.95 0.45 0.60 0.40 0.50 0.40 0.55 0.85 0.95 y Original data IC.1 9 11 13 1 1 3 1.5 0.0 28 24 9 11 13 IC.2 IC.3 14 16 18 1 1 3 IC.4 IC.5 30 28 28 24 1.5 0.0 14 16 18 30 28 IC.6 Invariant coordinates

Finding the structure I The following data set is simulated and has 400 observations for 6 variables. It looks like an elliptical data set. V1 10 0 10 4 0 4 10 0 5 5 0 5 10 0 10 V2 V3 3 0 2 4 0 4 V4 V5 6 2 2 5 0 5 10 0 10 3 0 2 6 2 2 V6 Original data

Finding the structure II Comp.1 5 0 5 2 2 4 0.5 0.5 10 5 15 5 5 Comp.2 Comp.3 6 0 4 2 2 4 Comp.4 Comp.5 2 0 2 10 0 10 0.5 0.5 6 2 2 6 2 0 2 Comp.6 PCA

Finding the structure III IC.1 4 1 2 2 0 2 2 0 1 2 3 0 2 4 4 1 2 IC.2 IC.3 3 0 2 2 0 2 IC.4 IC.5 2 0 2 3 0 2 4 2 0 2 3 0 2 2 0 2 IC.6 ICS

Clustering and dimension reduction To illustrate clustering via an ICS we use the Iris data set. The different species are colored differently and we can see, that the last component is enough for doing the clustering. IC.1 5 6 7 8 9 10 1.5 0.0 1.0 2.0 4 5 6 7 8 9 5 6 7 8 9 10 IC.2 IC.3 3 4 5 6 7 4 5 6 7 8 9 1.5 0.0 1.0 2.0 3 4 5 6 7 IC.4

Independent component analysis Independent component analysis (ICA) is a method often applied in signal processing or medical image analysis. The most basic ICA model is of the form X i = AZ i, i = 1,..., n where Z i has independent components and A is a full rank mixing matrix. The goal is to find an unmixing matrix B to recover the independent components. Oja et al. (2006) showed that the two scatter matrix transformation recovers in such a model the independent components if S 1 and S 2 have the independence property and the independent components have different kurtosis values. Using two robust scatters hence provides a robust ICA method.

Independent component analysis

Multivariate nonparametrics A lot of multivariate nonparametric methods are not affine equivariant by nature. Applying those methods in an invariant coordinate system is therefore an important improvement. The method introduced here is easier to apply than the so called transformation re-transformation technique of Chakraborty and Chaudhuri (1996) and has also further properties which can be used in the analysis.

Marginal nonparametric methods Puri and Sen describe very detailed how to use marginal signs and ranks in multivariate nonparametrics. However the tests based on these are not invariant under affine transformations, i.e. the test decision will be based on the coordinate system used. Such tests can now for example be made affine invariant by performing the test in the invariant coordinate system. For testing purposes the coordinate system should be constructed under the H 0. For example in the one sample location test when testing for the origin the scatter functionals used should be taken wrt to the origin and furthermore should be permutation invariant. Therefore for the scatters should hold S k (AXPJ) = AS k (X)A, A, P, J and k = 1, 2.

Why invariance is important We simulate 60 observations from a N 4 ((0, 0, 0, 0.48), I 4 ) distribution and then rotate the data with random matrix. The test is whether the origin is the center of symmetry. Test on original data: Marginal One Sample Normal Scores Test data: Y T = 9.653, df = 4, p-value = 0.04669 alternative hypothesis: true location is not equal to c(0,0,0,0) Test on transformed data: Marginal One Sample Normal Scores Test data: (Y %*% t(a)) T = 9.387, df = 4, p-value = 0.05212 alternative hypothesis: true location is not equal to c(0,0,0,0)

Why invariance is important II Now the same using an invariant coordinate system. Test on ICS based on original data: Marginal One Sample Normal Scores Test data: Z.Y T = 9.737, df = 4, p-value = 0.04511 alternative hypothesis: true location is not equal to c(0,0,0,0) Test based on ICS based on transformed data: Marginal One Sample Normal Scores Test data: Z.Y.trans T = 9.737, df = 4, p-value = 0.04511 alternative hypothesis: true location is not equal to c(0,0,0,0)

Key references I Tyler, D. E., Critchley, F., Dümbgen, L. and Oja, H. (2008). Exploring multivariate data via multiple scatter matrices. Conditionally accepted. Oja, H., Sirkiä, S. and Eriksson, J. (2006). Scatter matrices and indepedent component analysis. Austrian Journal of Statistics, 19, 175 189. Nordhausen, K. and Oja, H. and Tyler, D. E. (2008). Two scatter matrices for multivariate data analysis: The package ICS. Submitted. Nordhausen, K. and Oja, H. and Tyler, D. E. (2006). On the efficiency of invariant multivariate sign and rank test. Festschrift for Tarmo Pukkila on his 60th birthday, 217 231. Nordhausen, K., Oja, H. and Paindaveine, D. (2008). Rank-based location tests in the independent component model. Conditionally accepted.

Key references II Chakraborty, B. and Chaudhuri, P. (1996). On a transformation retransformation technique for constructing affine equivariant multivariate median. Proceedings of American Mathematical Society, 124, 1529 1537. Kankainen, A., Taskinen, S. and Oja, H. (2007). Tests of multinormality based on location vectors and scatter matrices. Statistical Methods & Applications, 16, 357 379. Puri, M. L. and Sen, P. K. (1971). Nonparametric methods in multivariate analysis. New York, Wiley & Sons. Tyler, D.E. (1987). A distribution-free M-estimator of multivariate scatter. The Annals of Statistics, 15, 234 251.