HOMOGENEITY ANALYSIS USING EUCLIDEAN MINIMUM SPANNING TREES 1. INTRODUCTION

Similar documents
QUADRATIC MAJORIZATION 1. INTRODUCTION

MARGINAL REWEIGHTING IN HOMOGENEITY ANALYSIS. φ j : Ω Γ j.

MAJORIZATION OF DIFFERENT ORDER

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines

ALGORITHM CONSTRUCTION BY DECOMPOSITION 1. INTRODUCTION. The following theorem is so simple it s almost embarassing. Nevertheless

MAXIMUM LIKELIHOOD IN GENERALIZED FIXED SCORE FACTOR ANALYSIS 1. INTRODUCTION

over the parameters θ. In both cases, consequently, we select the minimizing

Chapter 7. Least Squares Optimal Scaling of Partially Observed Linear Systems. Jan DeLeeuw

Linear Algebra (Review) Volker Tresp 2018

REGRESSION, DISCRIMINANT ANALYSIS, AND CANONICAL CORRELATION ANALYSIS WITH HOMALS 1. MORALS

MULTIVARIATE ANALYSIS OF ORDINAL DATA USING PROBITS

AN ITERATIVE MAXIMUM EIGENVALUE METHOD FOR SQUARED DISTANCE SCALING 1. INTRODUCTION

NONLINEAR PRINCIPAL COMPONENT ANALYSIS

MULTIVARIATE CUMULANTS IN R. 1. Introduction

Computational Statistics and Data Analysis

LEAST SQUARES METHODS FOR FACTOR ANALYSIS. 1. Introduction

LEAST SQUARES ORTHOGONAL POLYNOMIAL COMPONENT ANALYSIS IN R. 1. Introduction

Chapter 2 Nonlinear Principal Component Analysis

UCLA Department of Statistics Papers

FACTOR ANALYSIS AS MATRIX DECOMPOSITION 1. INTRODUCTION

Least squares multidimensional scaling with transformed distances

Principal components based on a subset of qualitative variables and its accelerated computational algorithm

Number of cases (objects) Number of variables Number of dimensions. n-vector with categorical observations

Lecture 2: Linear Algebra Review

Machine learning for pervasive systems Classification in high-dimensional spaces

Chapter 3 Transformations

Econ Slides from Lecture 7

Chapter 7: Symmetric Matrices and Quadratic Forms

NONLINEAR PRINCIPAL COMPONENT ANALYSIS AND RELATED TECHNIQUES. Principal Component Analysis (PCA from now on) is a multivariate data

Linear Algebra. Session 12

Linear Algebra and Eigenproblems

Bounds on the Traveling Salesman Problem

CS 143 Linear Algebra Review

Linear Algebra in Actuarial Science: Slides to the lecture

Motivating the Covariance Matrix

Total Least Squares Approach in Regression Methods

Reduction of Random Variables in Structural Reliability Analysis

Two-stage acceleration for non-linear PCA

Linear Algebra Review. Vectors

YORK UNIVERSITY. Faculty of Science Department of Mathematics and Statistics MATH M Test #2 Solutions

Simultaneous Diagonalization of Positive Semi-definite Matrices

Combinatorial Optimization

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

Characterization of half-radial matrices

Sufficient Dimension Reduction using Support Vector Machine and it s variants

Linear Algebra (Review) Volker Tresp 2017

Finding normalized and modularity cuts by spectral clustering. Ljubjana 2010, October

Multigrid absolute value preconditioning

Review of Linear Algebra

7. Symmetric Matrices and Quadratic Forms

Correspondence Analysis of Longitudinal Data

SELF-ADJOINTNESS OF SCHRÖDINGER-TYPE OPERATORS WITH SINGULAR POTENTIALS ON MANIFOLDS OF BOUNDED GEOMETRY

CENTERING IN MULTILEVEL MODELS. Consider the situation in which we have m groups of individuals, where

Multivariate Statistical Analysis

Singular Value Decomposition

Preface to Second Edition... vii. Preface to First Edition...

NORMS ON SPACE OF MATRICES

A GENERALIZATION OF TAKANE'S ALGORITHM FOR DEDICOM. Yosmo TAKANE

On the eigenvalues of Euclidean distance matrices

DUAL REGULARIZED TOTAL LEAST SQUARES SOLUTION FROM TWO-PARAMETER TRUST-REGION ALGORITHM. Geunseop Lee

14 Singular Value Decomposition

Practice Final Solutions. 1. Consider the following algorithm. Assume that n 1. line code 1 alg(n) { 2 j = 0 3 if (n = 0) { 4 return j

Tactical Decompositions of Steiner Systems and Orbits of Projective Groups

Decomposition algorithms in Clifford algebras

Econ Slides from Lecture 8

Knowledge Discovery and Data Mining 1 (VO) ( )

Assessing Sample Variability in the Visualization Techniques related to Principal Component Analysis : Bootstrap and Alternative Simulation Methods.

Decision Problems TSP. Instance: A complete graph G with non-negative edge costs, and an integer

SIMPLICES AND SPECTRA OF GRAPHS

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

c Springer, Reprinted with permission.

Math 302 Outcome Statements Winter 2013

Multidimensional Scaling in R: SMACOF

Least Squares Optimization

Concavity in Data Analysis

Applied Linear Algebra

Linear Algebra Practice Problems

FAST CROSS-VALIDATION IN ROBUST PCA

FIXED-RANK APPROXIMATION WITH CONSTRAINTS

THE NEWTON BRACKETING METHOD FOR THE MINIMIZATION OF CONVEX FUNCTIONS SUBJECT TO AFFINE CONSTRAINTS

Thus ρ is the direct sum of three blocks, where block on T 0 ρ is the identity, on block T 1 it is identically one, and on block T + it is non-zero

Control for Coordination of Linear Systems

The convergence of stationary iterations with indefinite splitting

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

ON THE EQUIVALENCE OF CONGLOMERABILITY AND DISINTEGRABILITY FOR UNBOUNDED RANDOM VARIABLES

MIXED DATA GENERATOR

CSE 554 Lecture 7: Alignment

On the Diagonal Approximation of Full Matrices

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Circular Chromatic Numbers of Some Reduced Kneser Graphs

Chemometrics. Matti Hotokka Physical chemistry Åbo Akademi University

Applied Linear Algebra in Geoscience Using MATLAB

Review problems for MA 54, Fall 2004.

CS6964: Notes On Linear Systems

Compression on the digital unit sphere

SOME INEQUALITIES FOR THE EUCLIDEAN OPERATOR RADIUS OF TWO OPERATORS IN HILBERT SPACES

Discrete Wiskunde II. Lecture 5: Shortest Paths & Spanning Trees

1. Addition: To every pair of vectors x, y X corresponds an element x + y X such that the commutative and associative properties hold

Introduction to the Mathematical and Statistical Foundations of Econometrics Herman J. Bierens Pennsylvania State University

Distance Geometry-Matrix Completion

Transcription:

HOMOGENEITY ANALYSIS USING EUCLIDEAN MINIMUM SPANNING TREES JAN DE LEEUW 1. INTRODUCTION In homogeneity analysis the data are a system of m subsets of a finite set of n objects. The purpose of the technique is to locate the objects in low-dimensional Euclidean space in such a way that the points in each of the m subsets are close together. Thus we want the subsets to be compact or homogeneous, relative to the overall size of the configuration of the n points. In the usual forms of homogeneity analysis (which are also known as multiple correspondence analysis or MCA) the size of a point set is defined as the size of the star plot of the set. The star plot is the graph defined by connecting all points in the set with their centroid, and the size is the total squared length of the edges. In order to prevent trivial solutions the configuration X is usually centered and normalized such that X X = I. Minimizing total size of the star plots under the normalization constraints amounts to solving a (spare) singular value decomposition problem. Recently, there have been a number of proposals to use different measures of point set size [De Leeuw, 2003]. One reason for looking at alternatives is the well known sensitivity of squared distances to outliers. Ordinary homogeneity analysis tends to locate subsets with only a few points way on the outside of the plot, which means that they dominate the solution by determining the scale. A second reason is that ordinary homogeneity analysis tends to create horseshoes for many types of data sets, i.e. two dimensional representations in which the second dimension is a quadratic function of the first [Schriever, 1985; Rijckevorsel, 1987; Bekker and Leeuw, 1988]. This is generally seen as wasteful, because we use two dimensions to represent an essentially one dimensional structure. Date: December 14, 2003. 2000 Mathematics Subject Classification. 62H25. Key words and phrases. Multivariate Analysis, Correspondence Analysis. 1

2 JAN DE LEEUW The most straightforward modification of MCA is to measure the size of the star plots by using distances instead of squared distances [Heiser, 1987; Deleeuw and Michailides, 2003; Michailides and Deleeuw, 2003]. If we continue to apply the normalization X X = I then the solutions to this modified problem map all n objects into p + 1 points if we compute a representation in p dimensions. This means the solutions are only marginally dependent on the data and do not show sufficient detail to be interesting. From the data analysis point of view they are close to worthless. In this paper we study another measure of size, which is quite different from the star plot based ones. The size of a point set is defined as the length of the Euclidean minimum spanning tree, and our loss function is the sum of these lengths over all m trees. We continue to use the same normalization as in MCA, i.e. we only consider centered and orthonormal X. 2. ALGORITHM The problem we study is to minimize σ (X) = min tr W j D(X) W W j over X X = I, where W j is the set of adjacency matrices for the spanning trees of the subset j and D(X) is the matrix of Euclidean distances. One obvious algorithm is to apply block relaxation [De Leeuw, 1994] to σ (X, W 1,, W m ) = tr W j D(X). Thus we minimize our loss function by alternating minimization over X with minimization over the W j. It is easy to deal with the subproblem of minimimizing over W j W j for X fixed at its current value, because that is simply computation of the minimum spanning tree, for which we can use Kruskal s or Prim s algorithm [Graham and Hell, 1985]. The problem of minimizing loss over normalized X for fixed W j is somewhat more complicated, but we can solve it using majorization De Leeuw [1994]; Heiser [1995]; Lange et al. [2000].

HOMOGENEITY ANALYSIS USING EUCLIDEAN MINIMUM SPANNING TREES 3 We first regularize the problem to deal with the possibility of zero distances, because in such configurations our loss function is not differentiable. Suppose ɛ > 0 is a fixed small number. For each pair of points i and k we define d ik (X, ɛ) = (x i x k ) (x i x k ) + ɛ. Clearly this regularized distance is everywhere positive and continuously differentiable in X. As in Heiser [1987] we now apply the Arithmetic-Geometric mean inequality to obtain, for two different configurations X and Y, 1 1 d ik (X, ɛ) d ik (Y, ɛ) 2 (d2 ik (X, ɛ) + d2 ik (Y, ɛ)). We use the representation, which is standard in multidimensional scaling literature, dik 2 (X, ɛ) = d2 ik (X) + ɛ = tr X A ik X + ɛ, where A ik is defined in terms of the unit vectors e i and e k as A ik = (e i e k )(e i e k ). Using this notation σ (X, ɛ) = tr W j D(X, ɛ) where we define the matrix and the scalar 1 2 (tr X B(Y, ɛ)x + tr Y B(Y, ɛ)y ) + ɛφ(y, ɛ), B(Y, ɛ) = φ(y, ɛ) = i=1 k=1 i=1 k=1 m w jik d ik (Y, ɛ) A ik, m w jik d ik (Y, ɛ). Given an intermediate solution X (ν), the majorization algorithm finds an update X (ν+1) by minimizing tr X B(X (ν), ɛ)x over X X = I. Convergence follows from general results on majorization methods, and in the particular case we can use the fact that σ (X, ɛ) is decreasing in ɛ to show, in addition, that letting ɛ 0 gives convergence to the solution of the original non-regularized problem.

4 JAN DE LEEUW In general we do not iterate the majorization algorithm, which finds an optimal X for given W j, until convergence, but we just take a single step before computing new minimum spanning trees. An implementation of the algorithm, in the R programming language, is given in the Appendix. For the MST calculation it depends on the ape package from the CRAN repository, written by Paradis, Strimmer, Claude, Jobb, Noel, and Bolker. 3. EXAMPLES 4. A MODIFICATION The algorithm simplifies greatly if we define our minimum spanning trees by using the square of the Euclidean distance. In fact, we need neither regularization nor majorization. We have σ (X, W 1,, W m ) = tr W j D 2 (X) = tr X V X, where V = w jik A ik. i=1 k=1 Now minimizing over normalized X for fixed W j means finding eigenvectors corresponding with the smallest non-zero eigenvalues of V. And computing the minimum over the W j fox fixed X means computing the MST for the squared distances. Because there are only a finite number of spanning trees, and the algorithm stops if the loss does not decrease, it is clear that in this case we have convergence in a finite number of steps. This sounds much better than it is, because examples show us that the finite convergence gets us to different stationary points from different random starts. 5. DISCUSSION There are many variations possible. For instance travelling salesman tour. For instance squared distance MST.

HOMOGENEITY ANALYSIS USING EUCLIDEAN MINIMUM SPANNING TREES 5 APPENDIX A. CODE r e q u i r e ( homals ) r e q u i r e ( ape ) 3 hommst< f u n c t i o n ( g, pow=1, eps =1e 10) { n< dim ( g ) [ 1 ] ; m< dim ( g ) [ 2 ] 6 x< svd ( matrix ( rnorm (2 n ), n, 2 ) ) $u r epeat { d< e u d i s t ( x, pow, eps ) 9 c< matrix ( 0, n, n ) ; s< 0. 0 f o r ( j i n 1 :m) { 12 tmp< s p a n n e r ( as. f a c t o r ( g [, j ] ), d ) s< s+tmp$ s ; c< c+tmp$c 15 b< bmat ( c, d, pow ) x< e i g e n ( b ) $ v e c t o r s [, c ( n 1,n 2) ] p r i n t ( s ) 18 f o r ( j i n 1 :m) s p a n p l o t ( as. f a c t o r ( g [, j ] ), x, e u d i s t ( x, pow, eps ) ) 21 s p a n n e r< f u n c t i o n ( g, d ) { l e v< l e v e l s ( g ) ; nn< l e n g t h ( g ) 24 s< 0. 0 ; c< matrix ( 0, nn, nn ) ; f o r ( k i n l e v ) { ind< which ( k==g ) 27 n< le ngth ( ind ) i f ( n ==1) dd< mm< matrix ( 0, 1, 1 ) e l s e { 30 dd< d [ ind, i n d ] mm< mst ( dd ) c [ ind, i n d ]< mm 33 s< s+sum ( dd mm) l i s t ( c=c, s=s ) 36 s p a n p l o t< f u n c t i o n ( g, x, d ) { 39 p l o t ( x, c o l = GREEN, pch =8) l e v< l e v e l s ( g ) ; nn< l e n g t h ( g ) rb< rainbow ( l e n g t h ( l e v ) ) 42 f o r ( k i n l e v ) { ind< which ( k==g ) n< l e n g t h ( i n d ) 45 i f ( n ==1) dd< mm< matrix ( 0, 1, 1 ) e l s e { dd< d [ ind, i n d ] 48 mm< mst ( dd ) f o r ( i i n 1 : n ) { j n d< which (1== as. v e c t o r (mm[ i, ] ) )

6 JAN DE LEEUW 51 sapply ( jnd, f u n c t i o n ( r ) l i n e s ( rbind ( x [ i n d [ i ], ], x [ i n d [ r ], ] ), c o l = rb [ which ( l e v ==k ) ] ) ) 54 57 e u d i s t< f u n c t i o n ( x, pow, eps ) { c< crossprod ( t ( x ) ) ; s< diag ( c ) dd< outer ( s, s, + ) 2 c 60 i f ( pow==2) return ( dd ) e l s e return ( s q r t ( dd+ eps ) ) 63 bmat< f u n c t i o n ( c, d, pow ) { i f ( pow==2) b< c e l s e b< c / d 66 r< diag ( rowsums ( b ) ) return ( r b ) REFERENCES P. Bekker and J. De Leeuw. Relation between variants of nonlinear principal component analysis. In J.L.A. Van Rijckevorsel and J. De Leeuw, editors, Component and Correspondence Analysis. Wiley, Chichester, England, 1988. J. De Leeuw. Block Relaxation Methods in Statistics. In H.H. Bock, W. Lenski, and M.M. Richter, editors, Information Systems and Data Analysis, Berlin, 1994. Springer Verlag. J. De Leeuw. Homogeneity Analysis of Pavings. URL http: //jackman.stanford.edu/ideal/measurementconference/ abstracts/hompeig.pdf. August 2003. J. Deleeuw and G. Michailides. Weber correspondence analysis: The onedimensional case. Preprint 343, UCLA Department of Statistics, 2003. URL http://preprints.stat.ucla.edu/343/twopoints.pdf. R.L. Graham and P. Hell. On the History of the Minimum Spanning Tree Problem. Annals of the History of Computing, 7:43 57, 1985. W.J. Heiser. Correspondence analysis with least absolute residuals. Computational Statistica and Data Analysis, 5:357 356, 1987. W.J. Heiser. Convergent Computing by Iterative Majorization: Theory and Applications in Multidimensional Data Analysis. In W.J. Krzanowski, editor, Recent Advanmtages in Descriptive Multivariate Analysis. Oxford: Clarendon Press, 1995.

HOMOGENEITY ANALYSIS USING EUCLIDEAN MINIMUM SPANNING TREES 7 K. Lange, D.R. Hunter, and I. Yang. Optimization Transfer Using Surrogate Objective Functions. Journal of Computational and Graphical Statistics, 9:1 20, 2000. G. Michailides and J. Deleeuw. Homogeneity analysis using absolute deviations. Preprint 346, UCLA Department of Statistics, 2003. URL http: //preprints.stat.ucla.edu/346/paper3.pdf. J.L.A. Van Rijckevorsel. The application of fuzzy coding and horseshoes in multiple correspondence analysis. PhD thesis, University of Leiden, The Netherlands, 1987. Also published in 1987 by DSWO-Press, Leiden, The Netherlands. B.F. Schriever. Order Dependence. PhD thesis, University of Amsterdam, The Netherlands, 1985. Also published in 1985 by CWI, Amsterdam, The Netherlands. DEPARTMENT OF STATISTICS, UNIVERSITY OF CALIFORNIA, LOS ANGELES, CA 90095-1554 E-mail address, Jan de Leeuw: deleeuw@stat.ucla.edu URL, Jan de Leeuw: http://gifi.stat.ucla.edu