Regularized Generalized Canonical Correlation Analysis Extended to Symbolic Data

Size: px

Start display at page:

Download "Regularized Generalized Canonical Correlation Analysis Extended to Symbolic Data"

Annice Atkins
5 years ago
Views:

1 Noname manuscript No. (will be inserted by the editor) Regularized Generalized Canonical Correlation Analysis Extended to Symbolic Data Received: date / Accepted: date Abstract Regularized Generalized Canonical Correlation Analysis (RGCCA) is a component-based approach which aims at studying the relationship between several blocks of numerical variables. In this paper we propose a method called Symbolic Generalized Canonical Correlation Analysis (Symbolic GCCA) that extends RGCCA to symbolic data. It is a versatile tool for multi-block data analysis that can deal with any type of datasets (e.g. observations described by intervals, histograms,...) provided that a relevant kernel function is defined for each block. A monotonically convergent algorithm for symbolic GCCA is presented and applied to a 4-block dataset of power plants cooling towers described by histograms. Keywords Symbolic Data, Regularized Generalized Canonical Correlation Analysis, kernel functions 1 Introduction The modern problem of managing and analyzing massive amounts of data does not concern just the problem of dataset size, it also concerns that of dealing with data that can be more or less complex. On the web site of ECML/PKDD 2007 on Mining Complex Data, complex data have been defined as follows: in contrast to the typical tabular data, complex data can consist of heterogeneous data types, can come from different sources, or live in high dimensional spaces. All these specificities call for new data mining strategies. Actually, the complexity can come from the object description itself as in the case of images, videos, audio/text documents or from the dataset structure such as in the case of distributed data, heterogeneous data or spatio-temporal data. A typical example is a medical patient who can be described by heterogeneous data such as images, text documents and socio-demographic information. In

2 2 practice, complex data are more or less based on several kinds of observations described by standard numerical and/or categorical data contained in several related data tables. Usually, data mining deals with observations that are described by several standard variables including numerical or categorical ones. Symbolic data analysis (SDA) [1], [2] deals with concepts that are in general define from the original observations. These concepts are described by symbolic data which can be standard categorical or numerical data but also more complex descriptions such as sets, sequences of weighted values, intervals, histograms and the like. The word symbolic is used as the more complex descriptions cannot be manipulated just as real numbers. We will proceed as follows. Starting from complex data as the one describing power plants cooling towers that will illustrate our present study, a fusion process is applied to get a symbolic data table from which new type knowledge will be discovered using some specific methodological tools extended to concepts that are considered as a new type of observations. The present paper investigates an extension to symbolic data of Regularized Generalized Canonical Correlation Analysis (RGCCA) introduced in [22]. RGCCA is itself a generalization of regularized CCA [27], [15] to the more than two-block case and offers a unifying view of various multi-block data analysis methods. The paper is organized as follows: RGCCA is briefly introduced in section 2 and a monotone convergent algorithm for RGCCA is proposed. Then, using appropriate positive definite kernel functions we perform this new version of RGCCA on symbolic data. Examples of kernel functions for symbolic data are discussed in section 3. Finally, section 4 illustrates the usefulness of symbolic GCCA on a 4-block dataset for studying power plants cooling towers described by histograms. 2 Regularized Generalized Canonical Correlation Analysis Regularized Generalized Canonical Correlation Analysis (RGCCA) proposed in [22] is a method for studying associations between more than two blocks of variables. RGCCA aims at extracting the information shared by J blocks of centered variables X 1,..., X J taking into account an a priori graph of connections between blocks specified by a binary design matrix C = {c jk } such that c jk = 1 if blocks X j and X k are connected and c jk = 0 otherwise. RGCCA is defined as the following optimization problem (1): { J max a j,k=1:j k c jk g (cov(x j a j, X k a k )) 1,a 2,...,a J (1) subject to (1 τ j )var(x j a j ) + τ j a j j = 1, j = 1,..., J In this optimization problem, g can be defined as g(x) = x (Horst scheme proposed in [14]), g(x) = x (Centroid scheme proposed in [29]) or g(x) = x 2

3 RGCCA Extended to Symbolic Data 3 (Factorial scheme proposed in [16]). The vector a j (resp. y j = X j a j ) is referred to as an outer weight vector (resp. an outer component). The Horst scheme penalizes structural negative correlations between components while both Centroid and Factorial schemes can be viewed as attractive alternatives that enable two components to be negatively correlated. From an optimization point of view, the shrinkage parameters τ j [0, 1], j = 1,..., J in (1) smoothly interpolate between the maximization of the covariance (τ j = 1 for all j) and the maximization of the correlation (τ j = 0 for all j). Optimization problem (1) is solved using the algorithm described below in Algorithm 1. Let us denote by K j = X j X t j, the n n matrix of inner products between observations. The procedure begins by an arbitrary choice of initial values Algorithm 1 Dual algorithm for Regularized Generalized Canonical Correlation Analysis Step A. Initialisation Choose J arbitrary vectors α 0 1,..., α0 J such that the constraints in (1) hold: repeat s = 1, 2,... for j = 1, 2,..., J do [ α 0 j = ( α 0 j [τ )t j I + (1 τ j ) 1 ] ] 1/2 n K j K t j α0 j α 0 j Step B. Inner component for block j j 1 z s j = k=1 [ ( c jk w cov K j α s j, K kα s+1 k )] K k α s+1 k + J k=j+1 c jk w [ cov ( K j α s j, K kα s k)] Kk α s k where w(x) = 1 for the Horst scheme, = x for the factorial scheme and = sign(x) for the centroid scheme. Step C. Dual outer component for block j α s+1 j = end for until convergence [(z sj )t K j [ τ j I + (1 τ j ) 1 n K j ] ] 1 1/2 [ z s j τ j I + (1 τ j ) 1 ] 1 n K j z s j α 0 1,..., α 0 J (Step A(a) in Algorithm 1). Assuming that dual outer weight vectors α s+1 1, α s+1 2,..., α s+1 j 1 are constructed for blocks X 1, X 2,..., X j 1, the dual outer weight vector α s+1 j is computed by considering the inner component z s j for block X j given in Step B(a) in Algorithm 1, and the formula given in Step C(a) in Algorithm 1. The procedure is iterated until convergence. Indeed, the convergence is proved in [21] as the bounded criteria to be maximized increases at each step of the iterative procedure until reaching a plateau. We stress that Algorithm 1 is not the original (primal) formulation of the

4 4 RGCCA algorithm presented in [22] but rather a dual formulation which was originally proposed in [21]. The key difference between the primal and the dual formulation relies on the fact that it is always possible to express a j as a linear combination of the observations of X j, that is a j = X t j α j. Moreover, α j is optimized in the dual formulation while a j is optimized in the primal. For more details on the dual formulation we refer interested readers to [21]. As will be seen, the dual formulation better fits the handlingof symbolic data. In general, Algorithm 1 does not necessarily converge to the global optimum and restarts are needed for getting a global optimum solution. Empirically, we note that Algorithm 1 is found not to be very sensitive to the starting point and usually convergence (tolerance = ) is reached within a few iterations. It was assumed without loss of generality that the variables were centered but actually, data K j can be centered by simply applying the following transform: K j = (I 1 n 11t )K j (I 1 n 11t ) (2) where I denotes the n-dimensional identity matrix and 1 the n-vector of ones. Moreover, we stress that only first dimension components are built in Algorithm 1. Components related to other dimensions can be easily obtained using the same procedures on deflated blocks with respect to the preceding dimension components. The deflation operation can be obtained from K j after extraction of y j using the following formula: K j (I y j y t j)k j (I y j y t j) (3) As a conclusion of this section, observe that Algorithm 1 solves an optimization problem (1) by manipulating the observations just through pairwise inner products between observations. As it will be seen in section 3, symbolic GCCA will be derived from Algorithm 1 by designing appropriate pairwise inner products between symbolic data. 2.1 Special cases of RGCCA To define the design matrix, the shrinkage parameters and the function g, an overview of methods that constitute the RGCCA framework is summarized in Table 1.

5 RGCCA Extended to Symbolic Data 5 Table 1 Special cases of RGCCA: PLS regression [30], canonical correlation analysis [12], redundancy analysis [25], regularized CCA [27], [15], [18], regularized redundancy analysis [20], SUMCOR [11], SSQCOR [13], SABSCOR [10], SUMCOV [26], SSQCOV [10], SABSCOV [14], Caroll s CCA [6], Multiple Co-Inertia Analysis [7]. XJ+1 is called a super-block and is equal to the concatenation of blocks X1,..., XJ (XJ+1 = [ ] X1 X2... XJ ). TWO-BLOCK CASE METHOD CRITERION PLS regression argmax a1,a2 Canonical Correlation Analysis argmax a1,a2 Redundancy analysis argmax a1,a2 regularized CCA argmax a1,a2 regularized Redundancy analysis argmax a1,a2 cov(x1a1, X2a2) s.c. a1 = a2 = 1 cov(x1a1, X2a2) s.c. var(x1a1) = var(x2a2) = 1 cov(x1a1, X2a2) s.c. a1 = var(x2a2) = 1 MULTI-BLOCK CASE METHOD CRITERION cov(x1a1, X2a2) s.c. (1 τj)var(xjaj) + τj aj 2 = 1, j = 1, 2 cov(x1a1, X2a2) s.c. a1 = (1 τ2)var(x2a2) + τ2 a2 2 = 1 SUMCOR a1,...,a J argmax J cov(x jaj, Xkak) s.c. var(xjaj) = 1, j = 1,..., J SSQCOR argmax j,k=1:j k cov(x a1,...,a J J (Xjaj, Xkak) s.c. var(xjaj) = 1, j = 1,..., J j,k=1:j k cov2 a1,...,a J SABSCOR argmax J cov(x j,k=1:j k jaj, Xkak) s.c. var(xjaj) = 1, j = 1,..., J SUMCOV argmax a1,...,a J J cov(x jaj, Xkak) s.c. aj = 1, j = 1,..., J SSQCOV argmax j,k=1:j k cov(x a1,...,a J J cov2 (Xjaj, Xkak) s.c. aj = 1, j = 1,..., J j,k=1:j k cov2 a1,...,a J SABSCOV argmax J cov(x j,k=1:j k jaj, Xkak) s.c. aj = 1 j = 1,..., J Carroll s CCA argmax J j=1 cov2 (Xjaj, XJ+1aJ+1) s.c. a1,...,a J var(xjaj) = 1, j = 1,..., J1, J + 1 aj = 1, j = J1 + 1,..., J MCOA argmax J j=1 cov2 (Xjaj, XJ+1aJ+1) s.c. aj = var(xj+1aj+1) = 1, j = 1,..., J a1,...,a J

6 6 Multi-block data analysis. In multi-block data analysis, all blocks X j, j = 1,..., J are assumed to be connected and many criteria were proposed in the literature with the objective of finding block components satisfying some kind of optimality: some are based on correlation others on covariance. Table 1 reports the mains ones: (i) SUMCOR (SUM of the CORrelation) [11]), (ii) SSQCOR (Sum of SQuare of the CORrelation) [13], (iii) SABSCOR (Sum of the ABsolute value of the CORrelation)) [10], (iv) SUMCOV (sum of the covariance) [26], (v) SSQCOV (Sum of square of the covariance) [10], (vi) SABSCOV (Sum of the ABSolute value of the COVariance) [14]. Regularized Canonical Correlation Analysis. Regularized Canonical Correlation Analysis [27], [15], [18] is defined as the following optimization problem: For various extreme cases τ 1 = 0 or 1 and τ 2 = 0 or 1 (which correspond exactly to the framework described in [3] and [5]), regularized CCA covers a situation which goes from Tucker s interbattery factor analysis [24] to Canonical Correlation Analysis [12] while passing through redundancy analysis [25]. The special case 0 τ 1 1 and τ 2 = 0 which corresponds to a regularized version of redundancy analysis has been studied in [20] and [4]. Hierarchical model. To conclude, we stress that the introduction of the design matrix allows analyzing complex structural relationships between blocks such as hierarchical models that have been introduced in [28]. Hierarchical models will be illustrated in section 4. It is quite remarkable that single RGCCA algorithm described in Algorithm 1 offers a unfying view of all the methods reported in Table 1. This is of practical interest for unified statistical analysis and unified implementation strategies. Further, in the next section we show how to extend all the methods reported in Table 1 with Algorithm 1 applied to symbolic data. 3 Kernels for symbolic data Symbolic GCCA is a versatile tool for multi-block data analysis that allows handling any type of datasets (e.g. observations described by histograms, intervals, strings, images...) as long as a relevant kernel function can be defined for each block [9]. Using kernels on such a structured symbolic dataset usually involves first choosing a similarity measure between pairs of symbolic objects for each block and next transforming these n n similarity matrices into symmetric and positive definite n n matrices called kernel matrices. Thus, symbolic GCCA can be reduced to a two steps procedure: Step 1. Design for each block a kernel function that will encode the proximity of symbolic data. For instance, the euclidean distance is usually chosen for computing distances between histograms and Laplacian transformation can be used to form the

7 RGCCA Extended to Symbolic Data 7 kernel matrix: p j k(x ij, x il ) = exp( γ x ijh x ilh ) (4) where x ij is the set of symbolic measurements observed on the jth individual for the ith block and where x ijh is the hth symbolic measurement observed on the jth individual for the ith block. The positive definitness of the kernel matrix guarantees that each element of the kernel matrix corresponds to a pairwise inner products evaluation in some space induced by the kernel function (see for instance [19] or [9] for more details on kernel theory). Step 2. Symbolic GCCA consists in calculating the kernel matrix associated with each block j: [ K j ] kl = k j (x jk, x jl ) and replacing in Algorithm 1, K j = X j X t j with K j. Notice that index j appears in k j to emphasize that kernel functions may differ from one block to another according the nature of the block. h=1 4 CASE STUDY: Nuclear power plant cooling datatest In order to analyze the degradations of cooling towers, French energy company (EDF) has collected surveying data since their construction [8]. Twenty-one cooling towers are described by 10 different histograms related to subsidence, geometric deformation, cracks and corrosion. 1. Geometric deformation is described by 3 histograms of the external hyperbolic shapes of the towers (Ecartabs t 2, Ecartabs 1 2, Ecartabs t 1). 2. Cracks information is described by 2 histograms of the length and the orientation of the cracks of the towers (longfi H 2, Orientation 2). 3. Corrosion level is described by the histogram of the length of the corrosion of the towers (longco H 2). 4. Subsidence of the towers is described by 4 histograms (TAS 10ansH, TAS TotalH, TAS Diff, TAS 20ansH) Each individual histogram is called a first-order block. Four second-order blocks (Geometric, Cracks, Corrosion, Subsidence) are built by concatenating individual histograms as follows: - Geometric contains Ecartabs t 2, Ecartabs 1 2 and Ecartabs t 1 - Cracks contains longfi H 2 and Orientation 2 - Corrosion contains longco H 2 - Subsidence contains TAS 10ansH, TAS TotalH, TAS Diff, TAS 20ansH. Finally, a superblock concatenating all second-order blocks is also considered. Detecting abnormal degradation levels for some towers and determining the relational connections between measures at the tower level are necessary to

8 8 understand physical phenomena. Therefore, to perform an accurate analysis of the cooling towers degradation, the four categories of information (geometric, cracks, corrosions, subsidence) have to be considered simultaneously with Symbolic GCCA. 4.1 Results The R software was used to perform our analysis [17]. Symbolic GCCA combined with the hierarchical model described in Figure 1 was used. The hyperparameters (τ j and γ) were set to 1 and.1, respectively, for all blocks. Two Ecartabs_t_2 Ecartabs_1_2 Ecartabs_t_1 Geometric Corrosion (longco_h_2) Geometric Corrosion Cracks Subsidence longfi_h_2 Orientation_2 Cracks TAS_10ansH TAS_TotalH TAS_Diff TAS_20ansH Subsidence Fig. 1 Hierarchical model components have been constructed for each block. The graphical display of the towers obtained by crossing the two first components of the superblock (y 1 and y 2 ) is shown in Figure 2. It may be observed that the left part of the graphical display concentrates the most damaged towers while the right part concentrates the less damaged ones. Figure 3 is built by computing correlations between the first component of each block and the two first components of the super-block, y 1 and y 2. Analyzing these data is of interest for emphasizing correlations between the different blocks. This can be used to anticipate some degradation problems that may occur. For example, civil engineers need to know which extent subsidence problems can cause other degradation problems (geometric deformation, cracks, corrosion areas). In this context, it is interesting to observe (see Figure 3) that the analysis indicates a clear separation of the geometric and subsidence blocks from the two other blocks (cracks and corrosion areas) and this brings out two clusters of towers regarding degradation problems (see Figure 2). The first cluster includes towers with both problems of subsidence and geometric

9 RGCCA Extended to Symbolic Data 9 Nuclear power plant cooling ower17 tower5 tower3 second global component tower2 tower15 tower14 tower1 tower16 tower12 tower8 tower13 tower9 tower10 tower11 tower19 tower7 tower21 tower20 tower6 tower4 tower first global component Fig. 2 Factorial plan (y 1, y 2 ) where y 1 and y 2 are block components related to the superblock. deformation. The second cluster includes towers with both problems of cracks and corrosion areas. Figure 4 depicts relational connections between blocks. Correlations between the first components of each block are reported. From Figure 4, it seems that the relationships between subsidence and geometry is stronger than the ones between corrosion and cracks. 5 Conclusion and perspectives Symbolic GCCA method introduced in this paper extends a large number of well-known data analysis methods to the symbolic context by simply choosing some appropriate kernel functions. We observe that symbolic GCCA algorithm is computationally efficient regarding the number of iterations to reach convergence. However there is no guarantee that the algorithm converges towards a global optimum of the criterion and a restart strategy can be used for getting a global optimum solution. A bootstrap method providing confidence intervals can be performed in order to indicate how reliable are the correlations esti-

10 10 correlation circle CRACKS Orientation CORROSION longco nb_corrosion1 nb_crack2 nb_corrosion2 nb_crack1 longfi Ecartabs_t1 CRACKS GEOMETRIC Ecartabs_12 GEOMETRIC CORROSION Ecartabs_t2 TAS_Tot TAS_Dif TAS_10 SUBSIDENCE TAS_ Fig. 3 Correlation circle mated with symbolic GCCA. Last, the choice of the hyperparameters (γ and τ j ) can be determined from a standard cross-validation procedure. 6 Acknowledgments This work was supported by a grant from the French National Reseach Agency (ANR Investissement d Avenir BRAINOMICS; grant ANR-10-BINF-04). References 1. Billard, L. and Diday, E., Symbolic Data Analysis: conceptual statistics and data mining. Wiley, Diday, E. and Noirhomme, M. Symbolic Data Analysis and the SODAS software. Wiley, M. Borga, T. Landelius, and H. Knutsson. A Unified Approach to PCA, PLS, MLR and CCA. Technical report, 1997.

11 RGCCA Extended to Symbolic Data 11 Ecartabs_t_ Ecartabs_1_2 Corrosion (longco_h_2).599 Geometric.979 Ecartabs_t_ Subsidence.501 Cracks.933 TAS_10ansH TAS_TotalH TAS_Diff TAS_20ansH Orientation_2 longfi_h_2 Fig. 4 diagram of relationships between blocks 4. S. Bougeard, M. Hanafi, and E.M. Qannari. Continuum redundancy-pls regression: a simple continuum approach. Computational Statistics and Data Analysis, 52: , A.J. Burnham, R. Viveros, and J.F. MacGregor. Frameworks for latent variable multivariate regression. Journal of Chemometrics, 10:3145, Carroll, J.D. A generalization of canonical correlation analysis to three or more sets of variables. In Proc. 76th conv. Am. Psych. Assoc , Chessel, D., and Hanafi, M. Analyse de la co-inertie de K nuages de points. Revue de Statistique Applique, 44, 3560, Courtois, A., Genest, Y., Afonso, F., Diday, E., Orcesi A., In service inspection of reinforced concrete cooling towers EDFs feedback, IALCEE Cuturi, M. Positive definite kernels in machine learning. arxiv preprint arxiv: , M. Hanafi and H.A.L. Kiers. Analysis of K sets of data, with differential emphasis on agreement between and within sets. Computational Statistics and Data Analysis, 51: , P. Horst. Relations among m sets of variables. Psychometrika, 26:126149, H. Hotelling. Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24:417441, J.R. Kettenring. Canonical analysis of several sets of variables. Biometrika, 58:433451, N. Kramer. Analysis of high-dimensional data with partial least squares and boosting. In Doctoral dissertation, Technischen Universitat Berlin, S.E. Leurgans, R.A. Moyeed, and B.W. Silverman. Canonical correlation analysis when the data are curves. Journal of the Royal Statistical Society. Series B, 55:725740, J.B. Lohmöller. Latent Variables Path Modeling with Partial Least Squares. Physica- Verlag, Heildelberg, R Core Team, R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL

12 J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, New York, NY, USA, Scholkopf, B., and Smola, A.J. Learning with kernels: support vector machines, regularization, optimization and beyond. the MIT Press Y. Takane and H. Hwang. Regularized linear and kernel redundancy analysis. Computational Statistics and Data Analysis, 52:394405, Tenenhaus, A., Philippe, C., and Frouin, V., Kernel Generalized Canonical Correlation Analysis. Technical report, A. Tenenhaus and M. Tenenhaus. Regularized Generalized Canonical Correlation Analysis. Psychometrika, 76:257284, M. Tenenhaus, V. Esposito Vinzi, Y.-M. Chatelin, and C. Lauro. PLS path modeling. Computational Statistics and Data Analysis, 48:159205, L.R. Tucker. An inter-battery method of factor analysis. Psychometrika, 23:111136, A.L. Van den Wollenberg. Redudancy analysis: an alternative for canonical correlation analysis. Psychometrika, 42:207219, J.P. Van de Geer. Linear relations among k sets of variables. Psychometrika, 49:7094, Vinod. Canonical ridge and econometrics of joint production. Journal of Econometrics, 4:147166, H. Wold. Soft Modeling: The Basic Design and Some Extensions. In in Systems under indi- rect observation, Part 2, K.G. J reskog and H. Wold (Eds), North-Holland, Amsterdam, pages 154, H. Wold. Partial Least Squares. In Encyclopedia of Statistical Sciences, vol. 6, Kotz, S and Johnson, N.L. (Eds), John Wiley and Sons, New York, pages , S. Wold, H. Martens, and H. Wold. The multivariate calibration problem in chemistry solved by the PLS method. In Proc. Conf. Matrix Pencils, Ruhe A. and Kastrom B. (Eds), March 1982, Lecture Notes in Mathematics, Springer Verlag, Heidelberg, pages , 1983.

Structured data analysis with RGCCA. Arthur Tenenhaus 2015/02/13

Structured data analysis with RGCCA Arthur Tenenhaus 2015/02/13 Overview of the presentation Part I. Multi-block analysis Part II. Multi-block and Multi-way analysis p 1 p 2 p 3 p 1 p 2 p 3 n X 1 X...