Dynamic Data Factorization

Similar documents
A Constraint Generation Approach to Learning Stable Linear Dynamical Systems

Time Series: Theory and Methods

A New Subspace Identification Method for Open and Closed Loop Data

Non linear Temporal Textures Synthesis: a Monte Carlo approach

BLIND SEPARATION OF INSTANTANEOUS MIXTURES OF NON STATIONARY SOURCES

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

Statistical Signal Processing Detection, Estimation, and Time Series Analysis

Independent Component Analysis. Contents

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Lessons in Estimation Theory for Signal Processing, Communications, and Control

Observability and Identifiability of Jump Linear Systems

CONTROL SYSTEMS ANALYSIS VIA BLIND SOURCE DECONVOLUTION. Kenji Sugimoto and Yoshito Kikkawa

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Elements of Multivariate Time Series Analysis

SPARSE signal representations have gained popularity in recent

New Introduction to Multiple Time Series Analysis

PCA, Kernel PCA, ICA

Subspace mappings for image sequences

CS281 Section 4: Factor Analysis and PCA

Two-View Segmentation of Dynamic Scenes from the Multibody Fundamental Matrix

STATISTICAL LEARNING SYSTEMS

Linear Methods for Regression. Lijun Zhang

Dimensionality Reduction. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Introduction to Machine Learning

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Estimation, Detection, and Identification CMU 18752

Singular Value Decomposition

ADAPTIVE FILTER THEORY

Review problems for MA 54, Fall 2004.

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Lecture Notes 5: Multiresolution Analysis

Maximum variance formulation

Contents. Preface for the Instructor. Preface for the Student. xvii. Acknowledgments. 1 Vector Spaces 1 1.A R n and C n 2

Total Least Squares Approach in Regression Methods

Linear Regression and Its Applications

Principal Component Analysis

TIME SERIES ANALYSIS. Forecasting and Control. Wiley. Fifth Edition GWILYM M. JENKINS GEORGE E. P. BOX GREGORY C. REINSEL GRETA M.

Robustness of Principal Components

Linear Subspace Models

A Strict Stability Limit for Adaptive Gradient Type Algorithms

Properties of Zero-Free Spectral Matrices Brian D. O. Anderson, Life Fellow, IEEE, and Manfred Deistler, Fellow, IEEE

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

covariance function, 174 probability structure of; Yule-Walker equations, 174 Moving average process, fluctuations, 5-6, 175 probability structure of

Tutorial on Principal Component Analysis

14 Singular Value Decomposition

Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

A Cross-Associative Neural Network for SVD of Nonsquared Data Matrix in Signal Processing

SVD, PCA & Preprocessing

On the convergence of the iterative solution of the likelihood equations

Time Series Analysis. James D. Hamilton PRINCETON UNIVERSITY PRESS PRINCETON, NEW JERSEY

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

IV. Matrix Approximation using Least-Squares

c 1995 Society for Industrial and Applied Mathematics Vol. 37, No. 1, pp , March

RECURSIVE SUBSPACE IDENTIFICATION IN THE LEAST SQUARES FRAMEWORK

POLYNOMIAL SINGULAR VALUES FOR NUMBER OF WIDEBAND SOURCES ESTIMATION AND PRINCIPAL COMPONENT ANALYSIS

Shannon meets Wiener II: On MMSE estimation in successive decoding schemes

Mathematical foundations - linear algebra

Time Series Analysis. James D. Hamilton PRINCETON UNIVERSITY PRESS PRINCETON, NEW JERSEY

ASIGNIFICANT research effort has been devoted to the. Optimal State Estimation for Stochastic Systems: An Information Theoretic Approach

Linear Heteroencoders

Variational Principal Components

TWO METHODS FOR ESTIMATING OVERCOMPLETE INDEPENDENT COMPONENT BASES. Mika Inki and Aapo Hyvärinen

Principal Component Analysis

Advanced Introduction to Machine Learning

PROPERTIES OF THE EMPIRICAL CHARACTERISTIC FUNCTION AND ITS APPLICATION TO TESTING FOR INDEPENDENCE. Noboru Murata

Principal Component Analysis

Observability of Linear Hybrid Systems

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts I-II

On Information Maximization and Blind Signal Deconvolution

Prof. Dr.-Ing. Armin Dekorsy Department of Communications Engineering. Stochastic Processes and Linear Algebra Recap Slides

Conditions for Suboptimal Filter Stability in SLAM

Blind Identification of FIR Systems and Deconvolution of White Input Sequences

Introduction to the Mathematical and Statistical Foundations of Econometrics Herman J. Bierens Pennsylvania State University

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Dimensionality Reduction Using the Sparse Linear Model: Supplementary Material

Factor Analysis (10/2/13)

Multivariate Statistical Analysis

Thomas J. Fisher. Research Statement. Preliminary Results

Novel spectrum sensing schemes for Cognitive Radio Networks

A Coupled Helmholtz Machine for PCA

below, kernel PCA Eigenvectors, and linear combinations thereof. For the cases where the pre-image does exist, we can provide a means of constructing

Statistical and Adaptive Signal Processing

Math 396. Quotient spaces

Unsupervised learning: beyond simple clustering and PCA

FINITE-DIMENSIONAL LINEAR ALGEBRA

Linear Dependent Dimensionality Reduction

CIFAR Lectures: Non-Gaussian statistics and natural images

Introduction. Chapter One

Natural Gradient Learning for Over- and Under-Complete Bases in ICA

18 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 19, NO. 1, JANUARY 2008

CSE 554 Lecture 7: Alignment

BLIND SEPARATION OF INSTANTANEOUS MIXTURES OF NON STATIONARY SOURCES

Chapter 2 Wiener Filtering

Data Mining and Analysis: Fundamental Concepts and Algorithms

EECS 275 Matrix Computation

Statistical Pattern Recognition

CANONICAL LOSSLESS STATE-SPACE SYSTEMS: STAIRCASE FORMS AND THE SCHUR ALGORITHM

THE SINGULAR VALUE DECOMPOSITION MARKUS GRASMAIR

Prediction, filtering and smoothing using LSCR: State estimation algorithms with guaranteed confidence sets

Index. for generalized eigenvalue problem, butterfly form, 211

Transcription:

Dynamic Data Factorization Stefano Soatto Alessro Chiuso Department of Computer Science, UCLA, Los Angeles - CA 90095 Department of Electrical Engineering, Washington University, StLouis - MO 63130 Dipartimento di Elettronica ed Informatica, Universitá di Padova, Italy Keywords: canonical correlation, model reduction, balanced realization, subspace identification, dynamic scene analysis, minimum description length, image sequence compression, video coding, generative model, prediction error methods, ARMA regression, system identification, learning Technical Report UCLA-CSD 010001, March 6, 2000 Abstract We present a technique for reducing the dimensionality of a stationary time series with an arbitrary covariance sequence This problem is solved in closed form in a way that is asymptotically efficient (ie optimal in the sense of maximum likelihood) Our technique can be seen as a special case of a more general theory of subspace identification borrowed from the field of dynamical systems Although our scheme only captures second-order statistics, we suggest extensions to model higher-order moments 1 Preliminaries Let be a discrete-time stochastic process with sample paths We are interested in reducing the dimensionality of the data (ie a sample path of dimension ) by extracting a model for the process We restrict our attention to second-order statistics assume that the process generates a covariance sequence with rational spectrum Later we will extend our approach to a more general class of models 11 Model realization It is well known that a positive definite covariance sequence with rational spectrum corresponds to an equivalence class of second order stationary processes [10] It is then possible to choose as a representative of each class a Gauss-Markov model that is the output of a linear dynamical system driven by white, zero-mean Gaussian noise with the given covariance In other words, we can assume that there exists a positive integer, a process (the state ) with initial condition!! " # $% &!)( a symmetric positive semi-definite matrix (*,+-0/ " such that is the output of the following Gauss-Markov ARMA model 1 : 1 243 65478 209 :9 6! " # ;< " 654 5>=% 20?@ :?A 8! " # + ;<BDC?A 9 * E 5 ( (1) for some matrices 7!F G =AG The goal of dimensionality reduction can be posed as the identification of the model above, that is the estimation of the model parameters 7A# =%# # + # ( from output measurements 3 # # It is commonly believed that even for the case of linear models driven by Gaussian noise such as (1) this problem can only be solved iteratively, is often formulated in the framework of expectation-maximization [13] Instead, we will follow the philosophy of subspace identification methods as championed by Van Overschee DeMoor [11], show 1 ARMA sts for auto-regressive moving average 1

3333332 4 9 that a solution can be computed in closed form This solution can be shown to be asymptotically efficient under suitable hypotheses 12 Uniqueness canonical model realizations 7A# =%# # + # ( The first observation concerning the model (1) is that the choice of matrices is not unique, in the sense that there are infinitely many models that give rise to exactly the same measurement covariance sequence starting from suitable initial 7 7 conditions = = The first source of non-uniqueness * ( has to( % do with the choice of basis for the state space: one can substitute with, with, with, with, choose the initial condition, where is any invertible A matrix obtain the same output covariance sequence; indeed, one also obtains the same output realization The second source of non-uniqueness has to do with issues in spectral factorization that are beyond the scope of this paper [10] Suffices for our purpose to say that one can transform the model (1) into a particular unique form the so-called innovation representation given by 1 243 65>78 2 " 654 A! " # $A 5>=% 2 6! " # 6 (2) where is called Kalman gain the innovation process 2 Therefore, without loss of generality, we restrict ourselves to models of the form (2) We are therefore left with the first source of non-uniqueness: any given process has not a unique innovation model, but an equivalence class of models 5 C78E 5 7 # C=%E 5 = # CE 5 F# % In order to be able to identify a unique model of the type (2) from a sample path, it is therefore necessary to choose a representative of each equivalence class (ie a basis of the state-space): such a representative is called a canonical model realization (or simply canonical realization) It is canonical in the sense that it does not depend on the choice of the state space (because it has been fixed) While there are many possible choices of canonical realizations (see for instance [9]), we are interested in one that is tailored to the data, in the sense of having a diagonal state covariance Such a model realization is called balanced [3] Since we are interested in data dimensionality reduction, we will make the following assumptions about the model (1): = ; =A 65 (3) choose a realization that makes the columns of orthogonal: = * = 5 5 # # "! (4) where, without loss of generality, / #, for$&%(! are non negative We will see shortly that such a choice results in a canonical model realization Such a model is characterized by a state space such that its covariance 5!BDC * E is BDC 23 diagonal, * Eits )! diagonal elements are uniquely defined by the data We shall also make a simplifying assumption that has full row rank, that is the state is constructible in one step The real numbers, which one can estimate from the data, are called the canonical correlation coefficients [8], as they represent the canonical correlations between the past (say* 5 +," - # - % ) the future (say 5 +," - # - / ) of the process In terms of the data 3, the spaces* can be thought of as the row-span of the doubly-infinite block-ankel matrices: 5 / 01+," 2>3 2(6 243,, 20 543 20, 2>3 2 543 2 The innovation process can be interpreted as the one-step prediction error: ;<=?> @A ;<="B(C A ;< D<"BFE = It is the error one commits in predicting (in the Bayesian sense) the value ofa ;< = given the values ofa ;G =, forgiḧ < 3 The ilbert spaces of zero-mean finite variance rom variables the (row) space of semi-infinite sequences constructed from a realization (sample-path) are isometrically isomorphic with inner product JA K L A M N @OQPA > KR A M S 7888888 2

3 * 5 / 01+," 3333332 4 5 6 53 I543 2 5 2 5 6 I5 3 543 "! In practice one only considers the finite past future, ie takes only a finite number of block rows in these matrices (under our hypothesis we can take just one) The canonical correlations can be related to the mutual information between past future by the well-known formula * # 5! (5)! which is independent of by stationariety [1] This interpretation can be used in the reduction context as one wants to reduce the dimensionality of the data while keeping the maximum amount of information If is chosen in this basis, according to (5), the reduction just consist in dropping its last components 3 The # problem # ;8 we set out to solve can then be formulated as follows: given measurements of a sample path of the process: 7A#, estimate =D# #, a canonical realization of the process Ideally, we would want the maximum likelihood solution from the finite sample: 7A # =D # # % 65 3 # # 7A# =%# F# 6 (6) We will first derive a closed-form suboptimal solution then argue that it asymptotically maximizes the likelihood 3Q5( 7888888 9 2 Closed-form solution for the linear Gaussian case The idea behind dynamic dimensionality reduction is to use the dynamic properties of the processes involved to determine a low rank approximation To underst this fact, let us for a moment assume that we only consider the output equation Let 5 C 3 # #?5 3 E%>AG with, similarly for, notice that 5 = 2 = ; = %G ;= * = 5 by our assumptions (3) (4) Consider the problem of finding the best =@ estimate of in the sense of Frobenius: # 5 5= Let 5"! # * ;$!< AG ;! *!5 ;%# 0 G #%# * #<5! be the singular value decomposition (SVD) [7] with diagonal Let #, be the matrices formed with the first! # columns of respectively Let us also denote =D with 5 # # It follows immediately from the fixed rank approximation property of the SVD [7] that 65&! ; 65 # * owever, this solution is static in the sense that it does not take into account the fact that the rows of have a very particular structure (determined by equation (2)) [15] Therefore, it is not the optimal solution we are looking for 4 We are instead looking for a solution that exploits the fact that the state (ie a column of ) represents a very special low-rank approximation of the output : it is the one that makes the past the future conditionally independent Therefore, in the context of dimensionality reduction of finite data, it is natural to choose the state to be the -dimensional subspace of the past measurements which retains the maximum amount information to predict future measurements at any given time according to formula (5) In the rest of this section we shall construct such an approximation The way we construct this approximation can be summarized as follows Let ( be the lower triangular Cholesky factor [7] of * %) that of * We want to find the best -dimensional subspace of the (finite) past (say ) to predict the (finite) future (say ) in the sense of weighted Frobenius norm, ie (+* 5-, / The reason why the norm is weighted by the inverse cholesky factor is that we want to compute relative error not absolute one The answer to this question [7] is found by computing the first principal directions in with respect to choosing the state as the space spanned by these components Principal directions are computed using the Singular Value Decomposition as follows Let us denote 5 C 6 # # EF%G? with, similarly for, notice that 5>= 2 ; = AG ; (7) 4 Incidentally, this suboptimal solution is used in [15] to model dynamic textures, despite the lack of optimality, naif system identification is sufficient to capture non-trivial textures 3

3 3 the orthogonal projection 5 of the rows of onto the row-span of the matrix the orthogonal projection of the rows of onto the row-span of F53 ( * * Let us denote with decomposition (SVD) [7] similarly by Let us compute the singular value * ) 5&! # * ;! %G ;! *!5 ;%#< AG #%# * #5 45 # # with # Ideally (in the absence of noise when the data are generated according to a Gauss- Markov model) @5 D5 5 "! 5" In practice,! however, this does not hold One can therefore choose by choosing a threshold on the value of Let us denote with #, the matrices formed with the firs! # columns of respectively Let us also denote with 5 # # = Consider the problem of finding the best estimate of the best estimate of of the form 5 G for some linear operator acting on the past, in the sense of =@ Frobenius: # 5 (+* 5= It is shown in [7] that this is solved by choosing as the space spanned by the first principal directions in with respect to It is also shown in [7] that these principal directions are given by the formula: # * ) 5! * ( Therefore, we have =@ 65 (! 5 # * 5Q! * (9) ) ( 7 can be determined uniquely, again in the sense of Frobenius, by solving the following linear problem: 7 which is trivially done in closed form using 5! * ( as 7A 65 5 7A 65 3 53 * Q (10) =D Notice that = is uniquely determined up to a change of sign of the components of Also note that BDC * E 2 * 2 5 # * # 5 (11) which is diagonal Thus the resulting model is stochastically balanced 6 Let 5 5 =D I543 be the estimated innovation which is, strictly speaking, the one-step ahead prediction error based on just one measurement, ie it is the transient innovation Its covariance can be obtained by 65 3 53 B B * B where B is defined as 5C 3 # # 5 3 E The input-to-state matrix can be estimated, following (2) as 65 follows Compute 5( B which yields 85 3 543 B * * (12) Note that we have used the finite past future ( ) to obtain our estimate As we have pointed out, the difference 5 =D is the 715 0= transient innovation 5>" 75F=A ie it is the prediction error based on the finite past It coincides with the stationary! one if only if the system has finite memory (in our case has memory one), which happens if only if is nilpotent (in our case ) It follows easily that its variance is bigger than the variance of the true innovation For the same reason also the matrix is not the stationary (the Kalman gain) There is a procedure which gives consistent estimates even using the finite past future under some assumptions This is based on solving some Riccati equation, but would require the introduction of a certain backward model for the process, which is beyond our scopes This fact, however, seems to have been partially overlooked also in the system identification community, we will not comment on it further beyond warning the reader that there are indeed procedures which gives consistent asymptotically efficient estimates [6] 5 The orthogonal projection is given by C @ R! " R# > 6 Strictly speaking this model is finite interval-balanced as in this case $ K are the canonical correlation coefficients between finite past finite future, namely 4 (8)

5 21 Choice of model order In the algorithm above we have assumed that the order of the model was given In practice, this needs to be inferred from the data Following [3], we propose determining the model order empirically from the singular values # # The problem of order estimation is currently an active research topic in the field of subspace identification cannnot be considered solved There are, however, numerous procedures based either on information-theoretic criteria or hypothesis testing (see [4, 12] references therein) which can be used to choose a threshold 22 Asymptotic properties While the solution given above is clearly suboptimal, it is possible to show that a slight modification of the algorithm reported above is asymptotically efficient This proof of this fact has not yet been published is due to Bauer [5] The proof is valid for a particular type of algorithms which, in our setup, would substantially increase the computational complexity 3 Extensions to nonlinear, non-gaussian models While the algorithm proposed above depends critically on the linear structure of the problem only captures the secondorder statistics of the process, some simple modifications allow extending it to classes of non-linear models /or to non-gaussian processes 31 Choice of basis independent components = The estimate of derived in Section 2 can be interpreted in a functional sense as determining a choice of (orthonormal) basis elements for the space of measurements This basis captures the principal directions of the correlation structure of the data owever, one may want to capture higher-order statistical structure in the data, for instance by requiring that the state vectors are not only orthogonal (in the sense of correlation), but also statistically independent This could be done in the context of independent component analysis (ICA) by seeking for = 5 3 # # =% 3 # = (13) subject to 3 # # 65 3 (14) where indicates Kullback-Leibler s divergence Although this seems appealing from the point of view of dimensionality reduction of a = = static set, are not independent for they have to satisfy (1) Nevertheless, could be chosen according to the independence criterion so as to extract maximally independent components [2], then 7 could be inferred from the estimated according to a least-squares or maximum likelihood criterion 32 Features kernels The model (1) assumes the existence of a state process with realization of which the data are a linear combination; the algorithm in Section 2 uses the geometry of the measurement space (specifically its inner product $ # BDC $ * E = ) in order to recover the matrix 5 More in general, however, one could postulate the = 62>?A existence of a state of which the data are a nonlinear combination If is an invertible map, then we could have Equivalently, if 5 ), one could postulate the existence of a function acting on the data in such a way that its image is a linear combination of states : 854=% 2?A (15) Such images 5 are called features Applying the algorithm in Section 2 to the transformed data is equivalent to changing the geometry of the measurement space by choosing an inner product in feature space: $ # 5 $ # (16) in a way that is reminiscent of kernel methods [14] The feature map could either be assigned ad-hoc, or it could be part of the identification procedure Ad-hoc choices may include wavelet transforms, Gabor filters etc 5

4 Applications 41 Dynamic image modeling, coding synthesis We have tested the power of the algorithm above in modeling visual scenes that exhibit certain stationariety properties, socalled dynamic textures [15] A small sample (50-100 images) of scenes containing water, smoke, foliage etc has been used to identify a model, which can then be used for coding with orders of magnitude compression factors, or for synthesis of novel views " In Figure 1 we show a few images of the original sample (top) as well as simulated ones (bottom), the value of the canonical correlation coefficients The order of the model has been truncated to Complete video sequences can be downloaded from http://visionuclaedu A different sequence depicting water undergoing vortical motion is shown in Figure 2, together with the synthetic sequence the correlation coefficients 25 x 104 2 Eigenvalues intensity 15 1 05 0 0 10 20 30 40 50 60 70 80 90 100 Frames Figure 1: Acknowledgements We wish to thank Gianfranco Doretto for his assistance with the experiments 6

1100 1000 900 800 Eigenvalues intensity 700 600 500 400 300 200 100 0 10 20 30 40 50 60 70 80 90 100 Frames Figure 2: References [1] Akaike Canonical correlation analysis of time series the use of an information criterion System identification: advances case studies, R Mehra D Lainiotis (Eds):27 96, 1976 [2] S Amari F Cardoso Blind source separation semiparametric statistical approach IEEE Trans Signal Processing, 45(11):2692 2700, 1997 [3] K Arun S Y Kung Balanced approximation of stochastic systems SIAM Journal of Matrix Analysis Applications, 11(1):42 68, 1990 [4] D Bauer Some asymptotic theory for the estimation of linear systems using maximul likelihood methods or subspace algorithms PhD thesis, Technical University, Wien, 1998 [5] D Bauer Asymptotic efficiency of the ccs subspace method in the case of no exogeneous inputs (submitted), 2000 [6] A Chiuso Gometric methods for subspace identification PhD thesis, Università di Padova, Feb 2000 [7] G Golub C Van Loan Matrix computations Johns opkins University Press, 2 edition, 1989 [8] otelling Relations between two sets of variables Biometrica, 28:321 377, 1936 7

[9] T Kailath Linear Systems Prentice all, 1980 [10] L Ljung System Identification: theory for the user Prentice all, 1987 [11] P Van Overschee B De Moor Subspace algorithms for the stochastic identification problem Automatica, 29:649 660, 1993 [12] K Peternell Identification of linear time-invariant systems via subspace realization-based methods PhD thesis, Technical University, Wien, 1995 [13] S Roweis Z Ghahramani A unifying review of linear gaussian models Neural Computation, 11(2), 1999 [14] B Schölkopf A Smola Kernel pca: pattern reconstruction via approximate pre-images In 8th International Conference on Articial Neural Networks, pages 147 152, 1998 [15] S Soatto, G Doretto, Y Wu Dynamic textures In Intl Conf on Computer Vision, (in press), 2001 8