An Information Geometrical View of Stationary Subspace Analysis

Similar documents
FRST Multivariate Statistics. Multivariate Discriminant Analysis (MDA)

A Brief Introduction to Markov Chains and Hidden Markov Models

A. Distribution of the test statistic

Stochastic Variational Inference with Gradient Linearization

First-Order Corrections to Gutzwiller s Trace Formula for Systems with Discrete Symmetries

Statistics for Applications. Chapter 7: Regression 1/43

An Algorithm for Pruning Redundant Modules in Min-Max Modular Network

ASummaryofGaussianProcesses Coryn A.L. Bailer-Jones

SUPPLEMENTARY MATERIAL TO INNOVATED SCALABLE EFFICIENT ESTIMATION IN ULTRA-LARGE GAUSSIAN GRAPHICAL MODELS

Appendix of the Paper The Role of No-Arbitrage on Forecasting: Lessons from a Parametric Term Structure Model

Optimality of Inference in Hierarchical Coding for Distributed Object-Based Representations

arxiv: v1 [cs.lg] 31 Oct 2017

Expectation-Maximization for Estimating Parameters for a Mixture of Poissons

(This is a sample cover image for this issue. The actual cover is not yet available at this time.)

A Simple and Efficient Algorithm of 3-D Single-Source Localization with Uniform Cross Array Bing Xue 1 2 a) * Guangyou Fang 1 2 b and Yicai Ji 1 2 c)

A Solution to the 4-bit Parity Problem with a Single Quaternary Neuron

Multilayer Kerceptron

Some Measures for Asymmetry of Distributions

Melodic contour estimation with B-spline models using a MDL criterion

Explicit overall risk minimization transductive bound

A Statistical Framework for Real-time Event Detection in Power Systems

Target Location Estimation in Wireless Sensor Networks Using Binary Data

Least-squares Independent Component Analysis

A proposed nonparametric mixture density estimation using B-spline functions

Bayesian Learning. You hear a which which could equally be Thanks or Tanks, which would you go with?

Bayesian Unscented Kalman Filter for State Estimation of Nonlinear and Non-Gaussian Systems

BALANCING REGULAR MATRIX PENCILS

Statistical Learning Theory: A Primer

Active Learning & Experimental Design

Alberto Maydeu Olivares Instituto de Empresa Marketing Dept. C/Maria de Molina Madrid Spain

A SIMPLIFIED DESIGN OF MULTIDIMENSIONAL TRANSFER FUNCTION MODELS

6.434J/16.391J Statistics for Engineers and Scientists May 4 MIT, Spring 2006 Handout #17. Solution 7

On a geometrical approach in contact mechanics

Extended SMART Algorithms for Non-Negative Matrix Factorization

FORECASTING TELECOMMUNICATIONS DATA WITH AUTOREGRESSIVE INTEGRATED MOVING AVERAGE MODELS

DIGITAL FILTER DESIGN OF IIR FILTERS USING REAL VALUED GENETIC ALGORITHM

Steepest Descent Adaptation of Min-Max Fuzzy If-Then Rules 1

4 Separation of Variables

The EM Algorithm applied to determining new limit points of Mahler measures

Scalable Spectrum Allocation for Large Networks Based on Sparse Optimization

A Robust Voice Activity Detection based on Noise Eigenspace Projection

Stochastic Automata Networks (SAN) - Modelling. and Evaluation. Paulo Fernandes 1. Brigitte Plateau 2. May 29, 1997

CS229 Lecture notes. Andrew Ng

A Comparison Study of the Test for Right Censored and Grouped Data

Probabilistic Graphical Models

An approximate method for solving the inverse scattering problem with fixed-energy data

Lecture 6: Moderately Large Deflection Theory of Beams

Available online at ScienceDirect. Procedia Computer Science 96 (2016 )

BP neural network-based sports performance prediction model applied research

Identification of macro and micro parameters in solidification model

Mode in Output Participation Factors for Linear Systems

STABILITY OF A PARAMETRICALLY EXCITED DAMPED INVERTED PENDULUM 1. INTRODUCTION

Partial permutation decoding for MacDonald codes

THE THREE POINT STEINER PROBLEM ON THE FLAT TORUS: THE MINIMAL LUNE CASE

Efficiently Generating Random Bits from Finite State Markov Chains

Two view learning: SVM-2K, Theory and Practice

Sequential Decoding of Polar Codes with Arbitrary Binary Kernel

SVM-based Supervised and Unsupervised Classification Schemes

PARSIMONIOUS VARIATIONAL-BAYES MIXTURE AGGREGATION WITH A POISSON PRIOR. Pierrick Bruneau, Marc Gelgon and Fabien Picarougne

Enhanced Trajectory Based Similarity Prediction with Uncertainty Quantification

Published in: Proceedings of the Twenty Second Nordic Seminar on Computational Mechanics

A Novel Learning Method for Elman Neural Network Using Local Search

Algorithms to solve massively under-defined systems of multivariate quadratic equations

Soft Clustering on Graphs

Moreau-Yosida Regularization for Grouped Tree Structure Learning

Approximation and Fast Calculation of Non-local Boundary Conditions for the Time-dependent Schrödinger Equation

Akaike Information Criterion for ANOVA Model with a Simple Order Restriction

Model-based Clustering by Probabilistic Self-organizing Maps

Kernel pea and De-Noising in Feature Spaces

V.B The Cluster Expansion

pp in Backpropagation Convergence Via Deterministic Nonmonotone Perturbed O. L. Mangasarian & M. V. Solodov Madison, WI Abstract

Physics 235 Chapter 8. Chapter 8 Central-Force Motion

Width of Percolation Transition in Complex Networks

STA 216 Project: Spline Approach to Discrete Survival Analysis

International Journal of Mass Spectrometry

A Ridgelet Kernel Regression Model using Genetic Algorithm

Incremental Reformulated Automatic Relevance Determination

Path planning with PH G2 splines in R2

AST 418/518 Instrumentation and Statistics

Product Cosines of Angles between Subspaces

Fitting affine and orthogonal transformations between two sets of points

Technical Appendix for Voting, Speechmaking, and the Dimensions of Conflict in the US Senate

Supervised i-vector Modeling - Theory and Applications

Substructuring Preconditioners for the Bidomain Extracellular Potential Problem

Inverse-Variance Weighting PCA-based VRE criterion to select the optimal number of PCs

Do Schools Matter for High Math Achievement? Evidence from the American Mathematics Competitions Glenn Ellison and Ashley Swanson Online Appendix

Uniformly Reweighted Belief Propagation: A Factor Graph Approach

Cryptanalysis of PKP: A New Approach

C. Fourier Sine Series Overview

arxiv: v1 [cs.db] 1 Aug 2012

Two-sample inference for normal mean vectors based on monotone missing data

MONTE CARLO SIMULATIONS

NEW DEVELOPMENT OF OPTIMAL COMPUTING BUDGET ALLOCATION FOR DISCRETE EVENT SIMULATION

T.C. Banwell, S. Galli. {bct, Telcordia Technologies, Inc., 445 South Street, Morristown, NJ 07960, USA

Strauss PDEs 2e: Section Exercise 1 Page 1 of 7

Primal and dual active-set methods for convex quadratic programming

MA 201: Partial Differential Equations Lecture - 10

Source and Relay Matrices Optimization for Multiuser Multi-Hop MIMO Relay Systems

Traffic data collection

Fast Blind Recognition of Channel Codes

Transcription:

An Information Geometrica View of Stationary Subspace Anaysis Motoaki Kawanabe, Wojciech Samek, Pau von Bünau, and Frank C. Meinecke Fraunhofer Institute FIRST, Kekuéstr. 7, 12489 Berin, Germany Berin Institute of Technoogy, Frankinsr. 28/29, 10587 Berin, Germany motoaki.kawanabe@first.fraunhofer.de, wojwoj@mai.tu-berin.de buenau@cs.tu-berin.de, frank.meinecke@tu-berin.de Abstract. Stationary Subspace Anaysis (SSA) 3 is an unsupervised earning method that finds subspaces in which data distributions stay invariant over time. It has been shown to be very usefu for studying non-stationarities in various appications 5, 10, 4, 9. In this paper, we present the first SSA agorithm based on a fu generative mode of the data. This new derivation reates SSA to previous work on finding interesting subspaces from high-dimensiona data in a simiar way as the three easy routes to independent component anaysis 6, and provides an information geometric view. Keywords: stationary subspace anaysis, generative mode, maximum ikeihood estimation, Kuback-Leiber divergence, information geometry 1 Introduction Finding subspaces which contain interesting structures in high-dimensiona data is an important preprocessing step for efficient information processing. A cassica exampe is Principa Component Anaysis (PCA) that extracts ow-dimensiona subspaces which capture as much variance of the observations as possibe. In contrast, more recent approaches consider other criteria than maximum variance. For instance, non-gaussian component anaysis (NGCA) 2, 7, a genera semiparametric framework incuding projection pursuit 8, extracts non-gaussian structures, e.g. subspaces which revea custers or heavy taied distributions. Another recent direction is coored subspace anaysis 13 which searches for inear projections having tempora correations. What a these methods have in common is that i.i.d. (white) Gaussian random variabes are considered as noise, and each method seeks to maximize a specific deviation from it to uncover an informative subspace. This fact reminded us of the thought-provoking paper 6 showing three easy routes to Independent Component Anaysis (ICA). Independent components can be extracted either by using non-gaussianity, or by non-whiteness of spectrum, or by non-stationarity 11 of the variance. The née Wojcikiewicz

2 Kawanabe, Samek, von Bünau and Meinecke goa of this paper is to discuss projection methods corresponding to the third route, i.e. based on finding non-stationary structures. Recenty, Bünau et a. 3 proposed a nove technique caed Stationary Subspace Anaysis (SSA) which finds the ow-dimensiona projections having stationary distributions from high-dimensiona observations. This is an instrumenta anaysis whenever the observed data is conceivaby generated as a mixture of underying stationary and non-stationary components: a singe non-stationary source mixed into the observed signas can make the whoe data appear nonstationary; and non-stationary sources with ow power can remain hidden among strong stationary components. Uncovering these two groups of sources is therefore a key step towards understanding the data. Stationary Subspace Anaysis has been appied successfuy to biomedica signa processing 5, Computer Vision 10, high-dimensiona change point detection 4 and domain adaptation probems 9. However, the SSA agorithm 3 is based on finding the stationary sources and does not mode the non-stationary components, i.e. unike 11, the SSA agorithm is not derived from any generative mode of the observed data. In this paper, we assume a inear mixing mode with uncorreated stationary and nonstationary sources and compute the maximum ikeihood estimators of the projections onto both subspaces. That is, we aso expicity mode the non-stationary part. The objective function turns out to be a combination of the origina SSA objective and a term which penaizes cross-covariances between the stationary and non-stationary sources. The remainder of this paper is organized as foows. After expaining our mode assumptions in Section 2, we derive the objective function of the maximum ikeihood approach in Section 3 and provide a geometrica iustration. Then, we present the resuts of the numerica simuations in Section 4. 2 Bock Gaussian Mode for SSA In the SSA mode 3, we assume that the system of interest consists of d stationary source signas s s (t) = s 1 (t),s 2 (t),...,s d (t) (caed s-sources) and D d non-stationary source signas s n (t) = s d+1 (t),s d+2 (t),...,s D (t) (aso n- sources) where the observed signas x(t) are a inear superposition of the sources, x(t) = As(t) = A s A n s s (t) s n, (1) (t) and A is an invertibe matrix. Note that we do not assume that the sources s(t) are independent. We refer to the spaces spanned by A s and A n as the s- and n-space respectivey. The goa is to factorize the observed signas x(t) according to Eq. (1), i.e. to find a inear transformation  1 that separates the s-sources from the n-sources. Given this mode, the s-sources and the n-space are uniquey identifiabe whereas the n-sources and the s-space are not (see 3 for detais). In order to invert the mixing of stationary and non-stationary sources, we divide the time series x(t) of ength T into L epochs defined by the index sets

An Information Geometric View of SSA 3 T 1,...,T L. Even though the SSA mode incudes both the stationary and the non-stationary sources, the SSA agorithm 3 optimizes ony the stationarity of the estimated stationary sources and does not optimize the non-stationary source estimates: the approach is to find the projection such that the difference between the mean and covariance in each epoch is identica to the average mean and covariance over a epochs. In this paper, we derive a maximum ikeihood estimator for the SSA mode under the foowing additiona assumptions. The sources s are independent in time. The s-sourcess s aredrawnfrom a d-dimensiona GaussianN(µ s,σ s ), which is constant in time In each epoch, the n-sources s n are drawn from a (D d)-dimensiona Gaussian N(µ n,σn ), i.e. their distribution varies over epochs The s- and n-sources are group-wise uncorreated. Thus the true distribution of the observed signas x(t) in epoch is N(m,R ), where m = Aµ, R = AΣ A and µ s Σ µ = µ n, Σ = s 0 0 Σ n. (2) The unknown parameters of the mode are the mixing matrix A, the mean and covariance of the s-sources (µ s,σ s ) and those of the n-sources {(µ n,σn )}L =1. 3 Maximum Likeihood and its Information Geometric Interpretation In this section, we derive the objective function of the maximum ikeihood SSA (MLSSA). Under the bock Gaussian mode, the negative og ikeihood to be minimized becomes =1 t T ogp(x(t) A,µ s,µ n,σs,σ n ) ogdetσ = T og deta + TD 2 og2π + 1 2 =1 t T { +Tr Σ 1 A 1 (x(t) m )(x(t) m ) A }. (3) It is known that the maximum ikeihood estimation can be regarded as the minimum Kuback-Leiber-divergence (KLD) projection 1. Let x = 1 t T x(t), R = 1 t T (x(t) x )(x(t) x ) (4)

4 Kawanabe, Samek, von Bünau and Meinecke be the sampe mean and covariance of the -th chunk, where denotes its ength. Indeed, the ikeihood L ML can be expressed as D KL N(x, R ) N(m,R ) +const. (5) =1 In the foowing, we derive the estimators of the moment parameters expicity as functions of the mixing matrix A which eads to an objective function for estimating A. Since the KL-divergence is invariant under inear transformations, the divergence between the two distributions of the observation x is equa to that between the corresponding distributions of the sources s = A 1 x, i.e. D KL N(x, R ) N(m,R ) = D KL N(A 1 x,a 1 R A ) N(µ,Σ ). (6) The centra idea of this paper is to regard the KL-divergence as a distance measure in the space of Gaussian probabiity distributions, which eads to an information geometric viewpoint of the SSA objective. The divergence in Equation (6) is the distance between the empirica distribution of the demixed signas and the true underying sources. According to our assumption the true mode ies on a manifod M defined by Equation (2), i.e. { } M := N(µ,Σ) Σ Σ = s 0 0 Σ n. (7) Therefore it is convenient to spit the divergence(6) into the foowing two parts, 1. an orthogona projection D 1 onto the manifod M, D 1 = D KL N(A 1 x,a 1 R A ) N( µ, Σ ), 2. and a component D 2 in the manifod, D 2 = D KL N( µ, Σ ) N(µ,Σ ). This decomposition is iustrated in Figure 1. It is aso known as the generaized pythagorean theorem in information geometry 1. The orthogona projection onto the manifod M is given by µ = µ s µ n = Σs Σ = 0 0 Σ n ( A 1 ) s x (A 1 ) n x = A 1 x, (8) ( A 1 )s { (A ) } 1 s R 0 = ( 0 A 1 )n { (A ) } 1 n, (9) R where ( A 1) s and ( A 1 ) n are the projection matrices to the stationary and non-stationary sources respectivey. The projection onto the manifod M does

(A 1 x,a 1 R A ) An Information Geometric View of SSA 5 D 1 D s 2 ( µ s µ n Σ s 0, 0 Σ n M ) ( µ, Σ ) D n 2 D 2 ( µ s Σ µ n, s 0 0 Σ n ) (µ,σ ) Fig.1. A geometrica view of the divergence decomposition. The divergence D 1 corresponds to the orthogona projection onto the manifod of modes and D 2 is the divergence on the manifod, which can be further decomposed according to the bock diagona structure of the source covariance matrix. not affect the mean, it merey sets the off diagona bocks of the covariance matrix to zero. Conversey, on the manifod M, ony the diagona bocks vary, thus the above projection is orthogona w.r.t. the Fisher information matrix. For more detais, see 1. Since on M the stationary and non-stationary sources are independent by definition, we can further decompose the distance D 2 from N( µ, Σ ) to the true distribution N(µ,Σ ) into two independent parts, D 2 = D s 2 +D n 2 = D KL N( µ s, Σ s ) N(µ s,σ s ) +D KL N( µ n, Σ ) n N(µ n,σ) n. This decomposition eads to the foowing expression for the ikeihood, =1 {D KL N(A 1 x,a 1 R A ) +D KL N( µ s, Σ s ) N(µ s,σ s ) N( µ, Σ ) +D KL N( µ n, Σ n ) N(µ n,σ n ) }. (10) The moment parameters (µ s,σ s ) of the s-sources appear ony in the second term, whie those of the n-sources {(µ n,σn )}L =1 are incuded ony in the third term. Therefore, for fixed A, the moment estimators (denoted with hats) minimizing the ikeihood L ML can be cacuated from {( µ, Σ )} L =1 : for the n-sources, ( µ n, Σ n ) = ( µn, Σ n ), which means that Dn 2 = 0,

6 Kawanabe, Samek, von Bünau and Meinecke the mean µ s of the s-sources is the weighted average L =1 T µs or equivaenty ( A 1) s x, where x = 1 T T t=1x(t) is the goba mean, the covariance Σ s of the s-sources is the projection of the goba covariance R = 1 T T t=1 (x(t) x)(x(t) x) ontothes-space,i.e. ( { (A A 1)s 1 R ) } s By substituting these moment estimators, we obtain an objective function to determine the mixing matrix A, =1 {D KL N(A 1 x,a 1 R A ) + D KL N( µ s, Σ s ) N( µ, Σ ) N( µ s, Σ } s ). (11) Moreover, since the soution is undetermined up to scaing, sign and inear transformations within the s- and n-space, we set the estimated s-sources to zero mean and unit variance by centering and whitening of the data, i.e. we write the estimated demixing matrix as A 1 = BŴ where Ŵ = R 1/2 is a whitening matrix and B is an orthogona matrix. Let y = Ŵx and V = Ŵ R Ŵ be the mean and covariance of the centerized and whitened data y(t) = Ŵx(t) in the -th epoch. Then, the objective function becomes =1 {D KL N(By,B V B ) + D KL N(B s y,b s V (B s ) ) N(By,b-diagB s V (B s ),B n V (B n ) ) } N(0,I d ), (12) where B = (B s ),(B n ) and b-diag denotes the bock diagona matrix with given sub-matrices. The second term coincides with the origina SSA objective. The first term comes from the orthogonaity assumption and can be regarded as a joint bock diagonaization criterion. The impication of this joint objective is iustrated in Section 4. The objective function (12) can be further simpified as =1 2 { ogdet( Σ ) ogdet(σ n )+( µ s ) µ } s +const. (13) where µ s = B s y, Σs = B s V (B s ) and Σ := B V B = A 1 R A. As in the SSA agorithm 3, we optimize the objective function (12) using a gradient descent agorithm taking into account the Lie group structure of the set of orthogona matrices 12. 4 Numerica Exampe We compare the performance of our nove method with the origina SSA agorithm 3. In particuar, we study the case where the epoch covariance matrices are exacty bock diagonaizabe as this matches the assumptions of our method.

An Information Geometric View of SSA 7 In Figure 2, we compare MLSSA and SSA over varying numbers of epochs using 5 stationary sources and a 20-dimensiona input space, i.e. d = 5 and D = 20. For each epoch we randomy sampe a bock diagonaizabe covariance matrix and a mean vector and mix them with a random, but we-conditioned matrix A. Note that no estimation error is introduced as we sampe both moments directy. As performance measure we use the median ange to the true n-subspace over 100 trias and represent the 25% and 75% quanties by error bars. For each tria, in order to avoid oca minima, we restart the optimization procedure five times and seect the soution with the owest objective function vaue. From the resuts we see that our method yieds a much ower error than the origina approach, especiay when a sma number of epochs is used. For instance, when using MLSSA (red ine) 10 epochs are sufficient to achieve negigibe errors (around 0.01 degree) whereas for the origina SSA agorithm (bue ine) more than 30 epochs are required. Athough it seems that thanks to the first term in Eq. (12), i.e. joint bock diagonaization, MLSSA aows a much faster convergence than the origina method, more experiments wi be needed to obtain a fu picture. Ange to the true n subspace in degrees 90 30 10 5 2 1 0.1 0.01 SSA MLSSA 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 Number of epochs Fig.2. Performance Comparison between the proposed method (MLSSA) and the origina SSA for a varying number of epochs. The red curve shows the performance, measured as the median ange to the true n-subspace over 100 trias, of our method whereas the bue curve stands for the resuts obtained with SSA. The 25% and 75% quanties are represented by the error bars. Note that we use randomy samped bock diagonaizabe covariance matrices in our experiments and set the dimensionaity of the stationary subspace to 5 (out of 20 dimensions). Our proposed method significanty outperforms SSA in this exampe, especiay when ess than 30 epochs are used. MLSSA obviousy expoits (by the first term in Eq. (12)) the bock diagonaizabe structure of the covariance matrices and thus reconstructs the true stationary subspace much faster (i.e. with ess epochs) than SSA.

8 Kawanabe, Samek, von Bünau and Meinecke 5 Concusions In this paper, we deveoped the first generative version of SSA. The generative mode is a bock Gaussian mode and the objective function of the maximum ikeihood approach turns out to be a combination of the origina SSA and a joint bock diagonaization term. This new derivation not ony heps the theoretica understanding of the procedure in the information geometrica framework, but the agorithm aso yieds competitive resuts. Moreover, the ikeihood formuation aows future extension towards mode seection methods and for incorporating prior knowedge. Acknowedgement The authors acknowedge K.-R. Müer, F. J. Kiráy and D. Bythe for vauabe discussions. References 1. Amari, S.: Differentia-geometrica methods in statistics. Lecture notes in statistics, Springer-Verag, Berin (1985) 2. Banchard, G., Sugiyama, M., Kawanabe, M., Spokoiny, V., Müer, K.R.: In search of non-gaussian components of a high-dimensiona distribution. Journa of Machine Learning Research 7, 247 282 (2006) 3. von Bünau, P., Meinecke, F.C., Kiráy, F., Müer, K.R.: Finding stationary subspaces in mutivariate time series. Physica Review Letters 103, 214101 (2009) 4. von Bünau, P., Meinecke, F.C., Müer, J.S., Lemm, S., Müer, K.R.: Boosting high-dimensiona change point detection with stationary subspace anaysis. In: Workshop on Tempora Segmentation at NIPS 2009 (2009) 5. von Bünau, P., Meinecke, F.C., Schoer, S., Müer, K.R.: Finding stationary brain sources in EEG data. In: Proceedings of the 32nd Annua Conference of the IEEE EMBS. pp. 2810 2813 (2010) 6. Cardoso, J.F.: The three easy routes to independent component anaysis; contrasts and geometry. In: Proc. ICA 2001. pp. 1 6 (2001) 7. Diederichs, E., Juditsky, A., Spokoiny, V., Schtte, C.: Sparse non-gaussian component anaysis. IEEE Trans. Inform. Theory 56, 3033 3047 (2010) 8. Friedman, J.H., Tukey, J.W.: A projection pursuit agorithm for exporatory data anaysis. IEEE Trans. Computers 23, 881 890 (1974) 9. Hara, S., Kawahara, Y., Washio, T., von Bünau, P.: Stationary subspace anaysis as a generaized eigenvaue probem. In: Proc of 17th internationa conference on Neura information processing: theory and agorithms - Voume Part I. pp. 422 429. ICONIP 10, Springer-Verag (2010) 10. Meinecke, F., von Bünau, P., Kawanabe, M., Müer, K.R.: Learning invariances with stationary subspace anaysis. In: Computer Vision Workshops (ICCV Workshops), 2009 IEEE 12th Internationa Conference on. pp. 87 92 (2009) 11. Pham, D.T., Cardoso, J.F.: Bind separation of instantaneous mixtures of non stationary sources. In: Proc. ICA 2000. pp. 187 192. Hesinki, Finand (2000) 12. Pumbey, M.D.: Geometrica methods for non-negative ica: Manifods, ie groups and tora subagebras. Neurocomputing 67(161 197) (2005) 13. Theis, F.: Coored subspace anaysis: Dimension reduction based on a signa s autocorreation structure. IEEE Trans. Circuits & Systems I 57(7), 1463 1474 (2010)