Tensor Canonical Correlation Analysis and Its applications

Size: px

Start display at page:

Download "Tensor Canonical Correlation Analysis and Its applications"

Irene Richard
5 years ago
Views:

1 Tensor Canonical Correlation Analysis and Its applications Presenter: Yong LUO The work is done when Yong LUO was a Research Fellow at Nanyang Technological University, Singapore

2 Outline Y. Luo, D. C. Tao, R. Kotagiri, C. Xu, and Y. G. Wen, Tensor Canonical Correlation Analysis for Multiview Dimension Reduction, IEEE Transactions on Knowledge and Data Engineering (T-KDE), vol. 27, no. 11, pp , Y. Luo, Y. G. Wen and D. C. Tao, On Combining Side Information and Unlabeled Data for Heterogeneous Multi-task Metric Learning, International Joint Conference on Artificial Intelligence (IJCAI), pp , 2016.

Multi-view dimension reduction (MVDR) Dimension reduction (DR) Find a low-dimensional representation for high dimensional data Benefits: reduce the chance

3 Multi-view dimension reduction (MVDR) Dimension reduction (DR) Find a low-dimensional representation for high dimensional data Benefits: reduce the chance of over-fitting, reduce computational cost, etc. Approaches: feature selection (IG, MI, sparse learning, etc.), feature transformation (PCA, LDA, LE, etc.)

4 MVDR Real world objects usually contain information from multiple sources, and can be extracted different kinds of features. Traditional DR methods cannot effectively handle multiple types of features Feature concatenation

5 MVDR Multi-view learning Learn to fuse multiple distinct feature representations Families: weighted view combination, multi-view dimension reduction, view agreement exploration Multi-view dimension reduction Multi-view feature selection Multi-view subspace learning: seek a low-dimensional common subspace to compactly represent the heterogeneous data; One of the most representative model: CCA

6 Canonical correlation analysis (CCA) Objective of CCA Correlation maximization on the common subspace z 1n = x T 1n h 1 z 2n = x T 2n h 2 z x 1 x 2 argmax z 1,z 2 ρ = corr z 1, z 2 = h 1 T C 12 h 2 h 1 T C 11 h 1 h 2 T C 22 h 2 H. Hotelling, Relations between two sets of variants, Biometrika, D. P. Foster, et al., Multi-view dimensionality reduction via canonical correlation analysis, Tech. Rep., 2008.

7 Generalizations of CCA to several views CCA-MAXVAR Generalizes CCA to M 2 views argmin M z,a, h p m=1 1 M M m=1 z α m z m 2 2, s. t. z m 2 = 1 z m = X m T h m is the vector of canonical variables for the m th view, and z is a centroid representation Solutions can be obtained using the SVD of X m J. R. Kettenring, Canonical analysis of several sets of variables, Biometrika, 1971.

8 Generalizations of CCA to several views CCA-LS argmin h M m m=1 1 2M M 1 M p,q=1 X p T h p X q T h q 2 2 s. t. 1 M m m=1 h m T C mm h m = 1 Equivalent to CCA-MAXVAR, but can be solved efficiently and adaptively based on LS regression J. Via et al., A learning algorithm for adaptive canonical correlation analysis of several data sets, Neural Networks, 2007.

9 The proposed TCCA framework Main drawback of CCA-MAXVAR and CCA-LS Only the statistics (correlation information) between pairs of features is explored, while high-order statistics is ignored Tensor CCA Directly maximize the high-order correlation between all views x 3 x 3 d 3 x 1 d 3 x 1 d 2 d 2 x 2 d 1 Pairwise correlation d 1 x 2 High order tensor correlation

1 λ 1 u 1 1 + + λ r u 3 r u 1 r d 2 X 2 d

10 The proposed TCCA framework for MVDR N Tensor CCA LAB WT d 1 X 1 d 3 x 3 x 1 u 3 1 λ 1 u λ r u 3 r u 1 r d 2 X 2 d 2 C 123 x 2 d 1 u 2 1 u 2 r SIFT d 3 X 3 Covariance tensor Sum of rank-1 approximation Mapping U 1 r Z 1 U 2 r 3r Z 2 Z U 3 r Z 3

11 Tensor basics Generalization of an n-dimensional array Scalar: order-0 tensor Vector: order-1 tensor Matrix: order-2 tensor Order-3 tensor

12 Tensor basics Tensor-matrix multiplication The m-mode product of an I 1 I 2 I M tensor A and an J m I m matrix U is a tensor B = A m U of size I 1 I m 1 J m I m+1 I M with the element B i 1,, i m 1, j m, i m+1,, i M = i m =1 The product of A and a sequence of matrices ሼU m I m A i 1, i 2,, i M U j m, i m B = A 1 U 1 2 U 2 M U M

13 Tensor basics Tensor-vector multiplication The contracted m-mode product of A and an I m -vector u is an I 1 I m 1 I m+1 I M tensor B = A ഥ m u of order M 1 with the entry B i 1,, i m 1, i m+1,, i M = i m =1 Tensor-tensor multiplication Outer product, contracted product, inner product i 1 =1 i 2 =1 I m A i 1, i 2,, i M u i m Frobenius norm of the tensor I 1 I 2 I M A 2 2 F = A, A = A i 1, i 2,, i M i M =1

14 Tensor basics Matricization The mode-m matricization of A is denoted as an I m I 1 I m 1 I m+1 I M matrix A m row-wise vectorizing column-wise vectorizing A 1 A 2 mode-1 mode-2 A frontal matricizing A 3 horizontal matricizing

15 Tensor basics Matricization property The m-mode multiplication B = A m U can be manipulated as matrix multiplication by storing the tensors in metricized form, i.e., B m = UA m The series of m-mode product can be expressed as a series of Kronecker products B = A 1 U 1 2 U 2 M U M T B m = U m A m U cm 1 U cm 1 U cm 1 c 1, c 2,, c K = m + 1, m + 2,, M, 1, 2,, m 1 is a forward cyclic ordering for indices of the tensor dims

16 TCCA formulation Optimization problem Maximize the correlation between the canonical variables z m = X m T h m, m = 1,, M: argmax h m ρ = corr z 1, z 2,, z M = z 1 z 2 z M T e, Equivalent formulation s. t. z m T z m = 1, m = 1,, M argmax h m ρ = C 12 m ഥ 1 h 1 T ഥ 2 h 2 T ഥ M h M T, s. t. h m T C mm + εi h m = 1, m = 1,, M Covariance tensor: C 12 M = 1 σ N n=1 N x 1n x 2n x Mn

17 TCCA formulation Reformulation 1Τ Let M = C 12 M 1 ሚC 2 1Τ 11 2 ሚC 2 1Τ 22 M ሚC 2 MM, and 1Τ u m = ሚC 2 mm h m, where ሚC mm = C mm + εi Main solution argmax u m ρ = M ഥ 1 u 1 T ഥ 2 u 2 T ഥ M u M T, s. t. u m T u m = 1, m = 1,, M If define M = ρu 1 u 2 u M, problem becomes 2, [Lathauwer et al., 2000a] argmin M M u F m Solved by alternating least square (ALS), high-order power method (HOPM), etc. L. De Lathauwer et al., On the best Rank-1 and rank-(r1, r2,..., rn) approximation of higher-order tensors, SIAM J. Matrix Anal. Appl., 2000.

18 TCCA solution Solutions Remaining solutions: recursively maximizing the same correlation as presented in the main TCCA problem All solutions: the best sum of rank-1 approximation, i.e., rank-r CP decomposition of M Projected data r M k=1 ρ k u 1 k u 2 k u M k Z m = XT 1Τ m ሚC 2 mm U m U m = u m 1,, u m r

19 KTCCA formulation Non-linear extension Non-linear feature mapping : X m = x m1, x m2,, x mn Canonical variables: z m = T X m h m Representer theorem: h m = X m a m Optimization problem argmax ρ = K 12 M ഥ 1 a T 1 ഥ 2 a T 2 ഥ M a T M, a m s. t. a T 2 m K mm + εk mm a m = 1, m = 1,, M L m T L m

20 KTCCA solution Reformulation 1Τ Let S = K 12 m 1 ሚC 2 1Τ 11 2 ሚC 2 1Τ 22 M ሚC 2 MM, and b m = 1Τ ሚC 2 mm a m : argmax u m ρ = S ഥ 1 b 1 T ഥ 2 b 2 T ഥ M b M T, Solved by ALS Projected data: s. t. b m T b m = 1, m = 1,, M Z m = K mm L m 1 B m, m = 1,, M

21 Experimental setup Datasets SecStr: biometric structure prediction 84K instances, 100 as labeled, additional 1200K unlabeled 3 views: attributes based on left, middle, right context generated from the sequence window of amino acid. Each view is 105-D Advertisement classification 3279 instances, 100 as labeled 3 views: features based on the terms in the images (588-D), terms in the current URL (495-D), and terms in the anchor URL (472-D) Web image annotation images, {4, 6, 8} labeled instances for each of 10 concepts 3 views: 500-D SIFT visual words, 144-D color, 128-D wavelet Classifier: RLS and KNN Evaluation criterion: Prediction/classification/annotation accuracy

22 Experimental setup Compared methods BSF: best single view feature CAT: concatenation of the normalized features FRAC: a recent multi-view feature selection algorithm CCA: m m 1 Τ2 subsets of two views CCA (BST): the best subset CCA (AVG): the average performance of all subsets CCA-LS: traditional generalizations of CCA to several views DSE: a popular unsupervised multi-view DR method SSMVD: a recent unsupervised multi-view DR method TCCA: the proposed method

23 Experimental results and analysis Biometric structure prediction Unlabeled = 84K Unlabeled = 1.3M Learn common subspace > CAT > BSF SSMVD, CCA-LS are comparable, so are DSE, CCA (BST) TCCA is the best at most dims, and does not decrease significantly when dim is high

24 Experimental results and analysis Web image annotation Non-linear Linear DSE comparable to CCA (BST), CCA (AVG) TCCA > SSMVD, and is better than the other CCA based methods Non-linear > linear

Conclusions and discussion Conclusions Finding a common subspace for all views using the CCAbased strategy is often better than simply concatenating all the features, especially when the feature

25 Conclusions and discussion Conclusions Finding a common subspace for all views using the CCAbased strategy is often better than simply concatenating all the features, especially when the feature dimension is high Examining more statistics, which may require more unlabeled data to be utilized, often leads to better performance; By exploring the high-order statistics, the proposed TCCA outperforms the other methods Discussion Can the common subspace be used for knowledge transfer between different views?

26 Distance metric learning (DML) Goal: learn appropriate distance function over input space to reflect relationships between data Useful in many ML algorithms, e.g., clustering, classification and information retrieval Most common DML scheme: Mahalanobis metric learning >>> learning linear transformation d A x i, x j = x i x j T A xi x j, d A x i, x j = Ux i Ux j 2 2 A = UU T Non-linear and local DML are able to capture complex structure in the data

27 Transfer DML (TDML) Motivation Needs large amount of side information to learn robust distance metric The training samples are insufficient in the task/domain of interest (target task/domain) We have abundant labeled data in certain related, but different tasks/domains (source tasks/domains) Goal Utilize the metrics obtained from source tasks/domains to help metric learning in the target tasks/domains

28 Homogeneous TDML (HoTDML) Data of source domain and target domain drawn from different distributions (same feature space) Examples [Pan and Yang, 2009] Web document classification: university website -> new website Indoor WiFi localization: WiFi signal-strength values change in different time periods or on different devices Sentiment classification: distribution of reviews among different types of products can be very different Challenge How to utilize the source information appropriately given the different distributions or find a subspace in which the distribution difference is reduced S. J. Pan and Q. Yang, A survey on transfer learning, IEEE TKDE, 2010.

29 Heterogeneous TDML (HeTDML) Data of source domain and target domain lie in different feature spaces, and may have different semantics Examples Multi-lingual document classification Labeled reviews in Spanish (scarce) Labeled reviews in English (abundant) Classify reviews in Spanish multi-view classification or retrieval, etc. Challenge How to find correspondences or common representations for the different domains

30 HeTDML existing solutions Heterogeneous transfer learning (HTL) approaches usually transform heterogeneous features into a common subspace, and the transformation can be used to derive a metric Groups Heterogeneous domain adaptation (HDA) Improve the performance in target domain Most only handle two domains Heterogeneous multi-task learning (HMTL) Improve the performance of all domains simultaneously

Heterogeneous multi-task metric learning (HMTML) Limitations of existing HMTL approaches Do not optimize w.r.t. the metric Mainly focus on utilizing the side information Can only explore the pairwise

31 Heterogeneous multi-task metric learning (HMTML) Limitations of existing HMTL approaches Do not optimize w.r.t. the metric Mainly focus on utilizing the side information Can only explore the pairwise relationships between different domains, the high-order statistics that can only be obtained by simultaneously examining all domains is ignored Our method Handle arbitrary number of domains, and directly optimize w.r.t. the metrics Make use of large amounts of unlabeled data to build domain connections Explore high-order statistics between all domains

32 HMTML framework Unlabeled data English Germany D 1 L D U D M L A 1 = U 1 U 1 T U x 11 U x 12 U x1n X 1 U U U x x M2 M1 X M U U x MN A M = U M U M T U 1 A 1 = U 1 U 1 T U z 11 U z12 Z 1 U z 1N U U z z U MN M1 U zm2 Z M U Tensor based correlation maximization U M A M = U M U M T

33 HMTML formulation Optimization problem General formulation argmin A M m m=1 2 M F A m = m=1 Ψ A m + γr A 1, A 2,, A M, s.t. A m 0, m = 1, 2,, M, Ψ A m = σ N m N m 1 i<j L A m ; x mi, x mj, y mij is the empirical loss w.r.t. A m R A 1, A 2,, A M enforces information transfer across different domains

34 Knowledge transfer by high-order correlation maximization Main idea Decompose A m as A m = U m UT m Use U m to project the unlabeled data points of different domains into a common subspace, where the correlation of all domains are maximized Formulation argmax U M m m=1 1 N U n=1 N U corr z U 1n, z U U 2n,, z Mn, corr z U 1n, z U U 2n,, z Mn = z U 1n z U U 2n z Mn the correlation of the projected representations z U mn = U T m xu M mn m=1 T e is

35 Knowledge transfer by high-order correlation maximization Reformulation G = E r 1 U 1 2 U 2 M U M is the covariance tensor of the mappings; C n U is the covariance tensor of the representations for the n th unlabeled sample. argmax U M m m=1 argmax U M m m=1 argmin U M m m=1 1 N U n=1 1 N U n=1 1 N U n=1 N U corr z U 1n, z U U 2n,, z Mn N U G ഥ 1 N U C n U G F 2 [Luo et al., 2015] x U T U 1n ഥ M x Mn T [Lathauwer et al., 2000b] Y. Luo et al., Tensor Canonical Correlation Analysis for Multi-view Dimension Reduction, IEEE TKDE, L. De Lathauwer et al., A multilinear singular value decomposition, JMAA, 2000.

36 HMTML formulation Specific optimization problem M 1 N m argmin F U m = U M m m=1 m=1 N m + γ N U N U n=1 k=1 g y mk 1 δ T mk U m U T m δ mk C n U G F 2 + m=1 Corresponds to find a subspace where the representations of all domains are close to each other Knowledge is transferred in this subspace, and so different domains can help each other in learning the mapping U m, or equivalently the metric A m M γ m U m 1

37 HMTML solution Rewrite C n U G F 2 to an expression w.r.t. U m G = E r 1 U 1 2 U 2 M U M = B m U m, Metricizing property [Lathauwer et al., 2000a] G m C n U G F 2 = = U m B m U C n m C n U G F 2 = U C n m U m B m F 2 G m F 2 B = E r 1 U 1 m 1 U m 1 m+1 U m+1 M U M Alternating for each U m and solve each subproblem w.r.t. U m by projected gradient descent L. De Lathauwer et al., On the best Rank-1 and rank-(r1, r2,..., rn) approximation of higher-order tensors, SIAM J. Matrix Anal. Appl., 2000.

38 Experiments Datasets and features Reuters multilingual collection (RMLC) 6 categories, 3 domains: English (EN), Italian (IT), Spanish (SP) Number of Documents: EN=18758, IT=24039, SP=12342 TF-IDF features, PCA preprocess to find comparable and highlevel patterns for transfer NUS WIDE Subset of 12 animal concepts, images + tags {SIFT, wavelet, tag} + PCA preprocess, each representation is a domain Evaluation criteria Accuracy, MacroF1

39 Experiments Compared methods EU: Euclidean distance between samples based on their original feature representations RDML: an efficient and competitive DML algorithm, does not make use of any additional information from other domains DAMA: constructing mappings U m to link multiple heterogeneous domains using manifold alignment MTDA: the multi-task extension of linear discriminative analysis HMTML: the proposed method

40 Experiments Average perf. of all domains w.r.t. common factors Although the labeled samples in each domain is scarce, learning the distance metric separately using RDML can still improve the performance significantly

41 Experiments Average perf. of all domains w.r.t. common factors All the three heterogeneous transfer learning approaches achieve much better performance than RDML. This indicates that it is useful to leverage information from other domains in DML

42 Experiments Average perf. of all domains w.r.t. common factors HMTML outperforms both DAMA and MTDA at most numbers (of common factors). This indicates that the learned factors by our method are more expressive than the other approaches

43 Experiments Performance for individual domains RDML improves the performance in each domain, and the improvements are similar for different domains, since there is no communication between them

44 Experiments Performance for individual domains Transfer learning methods has much larger improvements than RDML in the domains that the discriminative ability of the original representations is not very good. This demonstrates that the knowledge is successfully transferred between different domains

45 Experiments Performance for individual domains The discriminative domain obtains little benefit from the other relatively non-discriminative domains in DAMA and MTDA, while in the proposed HMTML, we still achieve significant improvements

Conclusions The labeled data deficiency problem can be alleviated by learning metrics for multiple heterogeneous domains simultaneously The shared knowledge of different domains exploited by

46 Conclusions The labeled data deficiency problem can be alleviated by learning metrics for multiple heterogeneous domains simultaneously The shared knowledge of different domains exploited by the transfer learning methods can benefit each domain if appropriate common factors are discovered, and the high-order statistics (correlation information) is critical in discovering such factors

47 Thank You! Q & A

THE objective of distance metric learning (DML) is to. Heterogeneous Multi-task Metric Learning across Multiple Domains

THE objective of distance metric learning (DML) is to. Heterogeneous Multi-task Metric Learning across Multiple Domains > TNNLS-2015-P-5907 REVISION 4 < 1 Heterogeneous Multi-task Metric Learning across Multiple Domains Yong Luo, Yonggang Wen, Senior Member, IEEE, and Dacheng Tao, Fellow, IEEE arxiv:1904.04081v1 [stat.ml]