A Least Squares Formulation for Canonical Correlation Analysis

Similar documents
A Least Squares Formulation for Canonical Correlation Analysis

On the Equivalence Between Canonical Correlation Analysis and Orthonormalized Partial Least Squares

A Least Squares Formulation for a Class of Generalized Eigenvalue Problems in Machine Learning

Multi-Label Informed Latent Semantic Indexing


Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Lecture 5 Supspace Tranformations Eigendecompositions, kernel PCA and CCA

Learning with Singular Vectors

Multivariate Statistical Analysis

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Data Mining Techniques

Vector Space Models. wine_spectral.r

arxiv: v1 [stat.ml] 16 Jan 2017

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Machine Learning - MT & 14. PCA and MDS

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Least Squares Regression

Learning gradients: prescriptive models

Principal Component Analysis (PCA)

Discriminative K-means for Clustering

CPSC 540: Machine Learning

LEC 2: Principal Component Analysis (PCA) A First Dimensionality Reduction Approach

Principal Component Analysis and Linear Discriminant Analysis

Canonical Correlation Analysis with Kernels

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

STATS306B STATS306B. Discriminant Analysis. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

Least Squares Regression

Fantope Regularization in Metric Learning

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

Basic Calculus Review

Positive Definite Matrix

Machine Learning And Applications: Supervised Learning-SVM

Subspace Methods for Visual Learning and Recognition

Midterm exam CS 189/289, Fall 2015

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

LEC 3: Fisher Discriminant Analysis (FDA)

STA141C: Big Data & High Performance Statistical Computing

Multi-Output Regularized Projection

Regularized Least Squares

1 Inner Product and Orthogonality

STATS 306B: Unsupervised Learning Spring Lecture 13 May 12

Functional Analysis Review

Lecture 6: Methods for high-dimensional problems

2. Review of Linear Algebra

LEARNING from multiple feature sets, which is also

Linear & nonlinear classifiers

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Linear Dimensionality Reduction

6.036 midterm review. Wednesday, March 18, 15

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

Regularized Least Squares

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

GWAS IV: Bayesian linear (variance component) models

CS 231A Section 1: Linear Algebra & Probability Review

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

Discriminative Models

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data

Integrating Global and Local Structures: A Least Squares Framework for Dimensionality Reduction

Large Scale Data Analysis Using Deep Learning

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

LEC 5: Two Dimensional Linear Discriminant Analysis

Kernel Methods. Machine Learning A W VO

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Support Vector Machine (SVM) and Kernel Methods

Fisher s Linear Discriminant Analysis

Kernel Methods. Barnabás Póczos

Summary and discussion of: Dropout Training as Adaptive Regularization

ECS289: Scalable Machine Learning

10-725/36-725: Convex Optimization Prerequisite Topics

1. Background: The SVD and the best basis (questions selected from Ch. 6- Can you fill in the exercises?)

Short Course Robust Optimization and Machine Learning. 3. Optimization in Supervised Learning

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

Chapter 2 Canonical Correlation Analysis

Statistical Pattern Recognition

Using SVD to Recommend Movies

CS 143 Linear Algebra Review

Feed-forward Network Functions

Drift Reduction For Metal-Oxide Sensor Arrays Using Canonical Correlation Regression And Partial Least Squares

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning for pervasive systems Classification in high-dimensional spaces

Developmental Stage Annotation of Drosophila Gene Expression Pattern Images via an Entire Solution Path for LDA

Machine Learning (BSMC-GA 4439) Wenke Liu

ON THE SINGULAR DECOMPOSITION OF MATRICES

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

STA141C: Big Data & High Performance Statistical Computing

Batch Mode Sparse Active Learning. Lixin Shi, Yuhang Zhao Tsinghua University

Sparse Gaussian conditional random fields

Sparse Permutation Invariant Covariance Estimation: Final Talk

SVMs: nonlinearity through kernels

CS540 Machine learning Lecture 5

COS 424: Interacting with Data

Subspace Analysis for Facial Image Recognition: A Comparative Study. Yongbin Zhang, Lixin Lang and Onur Hamsici

Introduction to Machine Learning

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Transcription:

A Least Squares Formulation for Canonical Correlation Analysis Liang Sun, Shuiwang Ji, and Jieping Ye Department of Computer Science and Engineering Arizona State University

Motivation Canonical Correlation Analysis (CCA) is commonly used for finding the correlations between two sets of multi-dimensional variables (Hotelling, 1936; Hardoon et al., 2004; Vert & Kanehisa, 2003). Two representations X R d n and Y R k n. One popular use of CCA is for supervised learning, in which one view is derived from the data and another view is derived from the class labels. CCA involves an eigenvalue problem, which is computationally expensive to solve. It is challenging to derive sparse CCA models.

Motivation CCA is closely related to linear regression (Y R 1 n ). CCA is equivalent to Fisher Linear Discriminant Analysis (LDA) for binary-class problems. Fisher LDA can be formulated as a least squares problem for binary-class problems. It can be solved efficiently using conjugate gradient. Sparse models can be derived readily using 1-norm regularization (Lasso). Multivariate linear regression (MLR) is a well-studied technique for regression problems. To apply MLR to multi-label classification, one key issue is how to define an appropriate class indicator matrix. Can we extend their equivalence relationship to the general (multi-label) case?

Main Contributions We establish the equivalence relationship between CCA and multivariate linear regression for multi-label problems under a mild condition. Based on the equivalence relationship, several CCA extensions including sparse CCA are derived. The entire solution path for sparse CCA can be readily computed by Least Angle Regression algorithm (LARS). Our experiments confirm the equivalence relationship, and results also show that the performance of these wo models is very close even when the assumption is violated.

Outline 1 Background 2 CCA Versus Multivariate Linear Regression 3 CCA Extensions 4 Experiment

Background: CCA In CCA two different representations of the same set of objects are given, and a projection is computed for each representation such that they are maximally correlated in the dimensionality-reduced space. Two representations X R d n and Y R k n. CCA attempts to maximize the following correlation coefficient w.r.t. w x and w y : ρ = wx T XY T w y. (1) (wx T XX T w x )(wy T YY T w y )

Background: CCA The optimization problem in CCA can be formulated as max w w x,w x T XY T w y (2) y subject to w T x XX T w x = 1, w T y YY T w y = 1. Assume YY T is nonsingular, w x can be obtained by computing the eigenvector corresponding to top eigenvalue of the following generalized eigenvalue problem: XY T (YY T ) 1 YX T w x = ηxx T w x. (3)

Background: CCA Multiple projection vectors under certain orthonormality constraints can be obtained by computing the top l eigenvectors of the generalized eigenvalue problem in Eq. (3). In regularized CCA (rcca), a regularization term λi with λ > 0 is added to XX T to prevent the overfitting and avoid the singularity of XX T. Specifically, rcca solves the following generalized eigenvalue problem: XY T (YY T ) 1 YX T w x = η(xx T + λi )w x. (4)

Background: Multivariate Linear Regression We are given a training set {(x i, t i )} n i=1, where x i R d is the observation and t i R k is the corresponding target. We assume both {x i } N i=1 and {t i} n i=1 are centered. In MLR, we compute a weight matrix W by minimizing the following sum-of-squares cost function: min W n W T x i t i 2 2 = W T X T 2 F. (5) i=1 The optimal weight matrix is given by W LS = (XX T ) XT T. (6) To improve its generalization ability, a penalty term based on 2-norm or 1-norm regularization is commonly applied.

Background: MLR for Multi-label Classification In the general multi-label classification, we are given a data set consisting of n samples {(x i, y i )} n i=1, where x i IR d, and y i denotes the set of class labels of the i-th sample. Assume there are k labels. The 1-of-k binary coding scheme T IR k n is commonly employed to apply a vector-valued class code to each data point. Sample class indicator matrix: T ij = 1 if x j contains the i-th class label, and T ij = 0 otherwise. The solution to the least squares problem depends on the choice of class indicator matrix.

CCA Versus Multivariate Linear Regression The optimal projection matrix W CCA in CCA consists of the top eigenvectors of (XX T ) ( XY T (YY T ) 1 YX T ). The optimal weight matrix in MLR is given by W LS = (XX T ) XT T. Establish the equivalence relationship between these two.

Notations and Definitions To simplify the discussion, we define the following matrices: Denote SVD of X by H = Y T (YY T ) 1 2 R n k, (7) C XX = XX T R d d, (8) C HH = XHH T X T R d d, (9) C DD = C XX C HH R d d. (10) X = UΣV T = [U 1, U 2 ] diag(σ r, 0) [V 1, V 2 ] T = U 1 Σ r V T 1, where r = rank(x ), U 1 R d r, V 1 R n r. The optimal projection matrix W CCA of CCA consists of the top eigenvectors of C XX C HH.

Computing CCA via Eigendecomposition Define A R r k as A = Σ 1 r U1 T XH. Let the SVD of A be A = PΣ A Q T. The eigendecomposition of C XX C HH can be derived as follows: C XX C HH = U 1 Σ 2 r U1 T XHH T X T = U 1 Σ 1 r AH T X T UU T [ ] Ir = U Σ 1 r A[H T X T U 0 1, H T X T U 2 ]U T [ Σ 1 r AA = U T ] Σ r 0 U T 0 0 [ ] [ Σ 1 r P 0 ΣA Σ = U T ] [ A 0 P T ] Σ r 0 U T. 0 I 0 0 0 I

Equivalence Relationship between CCA and MLR The optimal projection matrix in CCA is given by W CCA = U 1 Σ 1 r P l, (11) where P l contains the first l columns of P. In MLR, we define the class indicator matrix T as T = (YY T ) 1 2 Y = H T, and the optimal solution is given by W LS = (XX T ) XH = U 1 Σ 2 r U1 T XH = U 1 Σ 1 r PΣ A Q T. (12) Our main results show that all diagonal elements of Σ A are ones provided that rank(x ) = n 1.

CCA Extensions: Regularized CCA Regularization is commonly used to control the complexity of the model and improve the generalization performance. Based on the least squares formulation of CCA, we obtain the 2-norm regularized least squares CCA formulation (called LS-CCA 2 ) that minimizes the following objective function: ( k n ) L 2 (W, λ) = (xi T w j T ij ) 2 + λ w j 2 2, j=1 i=1 where W = [w 1,, w k ], and λ > 0 is the regularization parameter.

CCA Extensions: Sparse CCA Sparseness can often be achieved by penalizing the L 1 -norm of the variables. The sparse 1-norm least squares CCA formulation (called LS-CCA 1 ) can be derived by minimizing the following objective function: ( k n ) L 1 (W, λ) = (xi T w j T ij ) 2 + λ w j 1. j=1 i=1 The optimal wj, for 1 j k, is given by w j = arg min w j ( n (xi T w j T ij ) 2 + λ w j 1 ). (13) i=1

CCA Extensions: Entire CCA Solution Path The optimal w j can be computed equivalently as w j = arg min w j 1 τ n (xi T w j T ij ) 2. (14) The solution can be readily computed by the Least Angle Regression algorithm (LARS), which computes the entire solution path for all values of τ, with essentially the same computational cost as fitting the model with a single τ value. i=1

Experiment Experimental Setup Two types of data used in the experiment: The gene expression pattern image data of Drosophila. The scene data set. Five methods are investigated: CCA Regularized CCA (rcca) LS-CCA LS-CCA 2 LS-CCA 1 Linear SVM is applied for classification for each label after the CCA projection. The Receiver Operating Characteristic (ROC) value is computed for each label and the averaged performance over all labels is reported.

Equivalence Relationship The assumption rank(x ) = n 1 holds in all cases when the data dimensionality d is larger than the sample size n. Our results show that when d > n, all diagonal elements of Σ A are ones and CCA and LS-CCA achieve the same classification performance, which confirms our theoretical analysis.

Performance Comparison Table: Comparison of different CCA methods in terms of mean ROC scores. n tot denotes the total number of images in the data set, and k denotes the number of terms (labels). Ten different splittings of the data into training (of size n) and test (of size n tot n) sets are applied for each data set. For the regularized algorithms, the value of the parameter is chosen via cross-validation. The proposed sparse CCA model (LS-CCA 1 ) performs the best for this data set. n tot k CCA LS-CCA rcca LS-CCA 2 LS-CCA 1 863 10 0.542 0.542 0.617 0.619 0.722 1041 15 0.534 0.534 0.602 0.603 0.707 1138 20 0.538 0.538 0.609 0.610 0.714 1222 25 0.540 0.540 0.603 0.605 0.704 1349 30 0.548 0.548 0.606 0.608 0.709

Sensitivity Study We vary the training sample size to investigate LS-CCA and its variants in comparison with CCA. d = 384 for the gene data set, and d = 294 for the scene data set. Mean ROC Score 0.8 0.75 0.7 0.65 0.6 CCA LS CCA LS CCA 1 LS CCA 2 rcca Mean ROC Score 1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 CCA LS CCA LS CCA 1 LS CCA 2 rcca 0.55 0.6 0.55 0.5 100 204 303 400 497 602 697 802 900 Training Size Figure: Gene data set 0.5 102 198 287 407 497 607 700 806 897 Training Size Figure: Scene data set

The Entire CCA Solution Path 0 2 3 4 6 8 10 14 18 20 Coefficients 0.10 0.05 0.00 0.05 0.10 0.15 218 13 1 197 277 49 40 36 0.0 0.2 0.4 0.6 0.8 1.0 Sparseness Coefficient Figure: The entire collection of solution paths for a subset of the coefficients from the first weight vector w 1 on the scene data set. The x-axis denotes the sparseness coefficient, and the y-axis denotes the value of the coefficients (of w 1 ).

Conclusion and Future Work Conclusion We establish the equivalence relationship between CCA and multivariate linear regression under a mind condition. Several CCA extensions including sparse CCA are proposed based on the equivalence relationship. Experiments confirm the equivalence relationship. Results also shows that the performance of CCA and MLR is very close even when the assumption is violated. Future Work Investigate semi-supervised CCA by incorporating unlabeled data into the CCA framework. Extensions to the nonlinear case using the kernel trick. More extensive investigation of the sparse CCA model in biological applications.