Stochastic Optimization for Deep CCA via Nonlinear Orthogonal Iterations

Similar documents
Statistical Convergence of Kernel CCA

A summary of Deep Learning without Poor Local Minima

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN ,

Kernel-Based Contrast Functions for Sufficient Dimension Reduction

Fast Learning with Noise in Deep Neural Nets

Multimodal Deep Learning for Predicting Survival from Breast Cancer

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Normalization Techniques in Training of Deep Neural Networks

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Multi-view representation learning for speech (and language)

Logistic Regression. William Cohen

Data Mining Techniques

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Internal Covariate Shift Batch Normalization Implementation Experiments. Batch Normalization. Devin Willmott. University of Kentucky.

arxiv: v1 [stat.ml] 16 Jan 2018

Learning Deep Architectures for AI. Part II - Vijay Chakilam

A Coupled Helmholtz Machine for PCA

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Sequence to Sequence Models and Attention

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

A Least Squares Formulation for Canonical Correlation Analysis

A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir

Comparison of Modern Stochastic Optimization Algorithms

UNSUPERVISED LEARNING OF ACOUSTIC FEATURES VIA DEEP CANONICAL CORRELATION ANALYSIS

Introduction to Neural Networks

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Stochastic Approximation: Mini-Batches, Optimistic Rates and Acceleration

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Stochastic Optimization Algorithms Beyond SG

Introduction to Convolutional Neural Networks (CNNs)

Statistical Machine Learning

Neural Networks and Deep Learning

Sajid Anwar, Kyuyeon Hwang and Wonyong Sung

DIFFERENTIABLE CANONICAL CORRELATION ANALYSIS

Canonical Correlation Analysis with Kernels

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY,

Accelerating SVRG via second-order information

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Learning with Singular Vectors

Advanced computational methods X Selected Topics: SGD

Linear Regression (continued)

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Classification with Perceptrons. Reading:

Linear classifiers: Overfitting and regularization

Nonlinear Optimization Methods for Machine Learning

Machine Learning with Quantum-Inspired Tensor Networks

Don t Decay the Learning Rate, Increase the Batch Size. Samuel L. Smith, Pieter-Jan Kindermans, Quoc V. Le December 9 th 2017

STATS 306B: Unsupervised Learning Spring Lecture 13 May 12

Stochastic optimization in Hilbert spaces

Bidirectional Representation and Backpropagation Learning

Lecture 6. Notes on Linear Algebra. Perceptron

VC-dimension of a context-dependent perceptron

Bits of Machine Learning Part 1: Supervised Learning

Welcome to the Machine Learning Practical Deep Neural Networks. MLP Lecture 1 / 18 September 2018 Single Layer Networks (1) 1

Introduction to Neural Networks

L 2,1 Norm and its Applications

Online Generalized Eigenvalue Decomposition: Primal Dual Geometry and Inverse-Free Stochastic Optimization

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Subspace Methods for Visual Learning and Recognition

Midterm: CS 6375 Spring 2015 Solutions

Logistic Regression. COMP 527 Danushka Bollegala

pairs. Such a system is a reinforcement learning system. In this paper we consider the case where we have a distribution of rewarded pairs of input an

ON HYPERPARAMETER OPTIMIZATION IN LEARNING SYSTEMS

Deep Feedforward Networks

CSCI567 Machine Learning (Fall 2014)

Linear Models for Regression. Sargur Srihari

Machine Learning. Boris

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

Artificial Neural Networks. MGS Lecture 2

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

The Geometry Of Kernel Canonical Correlation Analysis

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Statistical Data Mining and Machine Learning Hilary Term 2016

A two-layer ICA-like model estimated by Score Matching

ECS289: Scalable Machine Learning

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Clustering VS Classification

Large-scale Stochastic Optimization

Immediate Reward Reinforcement Learning for Projective Kernel Methods

Reward-modulated inference

The role of dimensionality reduction in classification

1 Introduction Independent component analysis (ICA) [10] is a statistical technique whose main applications are blind source separation, blind deconvo

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis

A Spectral Regularization Framework for Multi-Task Structure Learning

Characterization of Gradient Dominance and Regularity Conditions for Neural Networks

Linear and Non-Linear Dimensionality Reduction

Introduction to Neural Networks

Introduction to Logistic Regression and Support Vector Machine

Transcription:

Stochastic Optimization for Deep CCA via Nonlinear Orthogonal Iterations Weiran Wang Toyota Technological Institute at Chicago * Joint work with Raman Arora (JHU), Karen Livescu and Nati Srebro (TTIC) 53rd Allerton Conference on Communication, Control, and Computing September 30, 2015

Multi-view feature learning Training data consists of samples of a D-dimensional random vector that has some natural split into two sub-vectors: [ ] x, x R Dx, y R Dy, D y x +D y = D. Natural views: audio+video, audio+articulation, text in different languages... Abstract/synthetic: word+context words, different parts of a parse tree... Task: extracting useful features/subspaces in the presence of multiple views which contain complementary information. Motivations: noise suppression, soft supervision, cross-view retrieval/generation... We assume feature dimensions have zero mean for notational simplicity.

Canonical correlation analysis (CCA) [Hotelling 1936] Given: data set of N paired vectors {(x 1,y 1 ),...,(x N,y N )}, which are samples of random vectors x R Dx, y R Dy. Find: direction vectors (u, v) that maximize the correlation (u,v) = argmax u,v = argmax u,v corr(u x,v y) u Σ xy v (u Σ xx u)(v Σ yy v) where Σ xy = N i=1 x iy i, Σ xx = N i=1 x ix i, Σ yy = N i=1 y iy i. Subsequent direction vectors maximize the same correlation, subject to being uncorrelated with previous directions.

Canonical correlation analysis (CCA) Extracting L-dimensional projections U R Dx L, V R Dy L ( ) max tr U Σ xy V U,V s.t. U Σ xx U = V Σ yy V = I. Closed-form solution obtained by SVD of Σ xy = Σ 1/2 xx Σ xy Σ 1/2 yy.

Canonical correlation analysis (CCA) Extracting L-dimensional projections U R Dx L, V R Dy L ( ) max tr U Σ xy V U,V s.t. U Σ xx U = V Σ yy V = I. Closed-form solution obtained by SVD of Σ xy = Σ 1/2 xx Σ xy Σ 1/2 yy. Alternative formulation Let X = [x 1,...,x N ], Y = [y 1,...,y N ]. Then CCA equivalently solves 1 min U X V Y 2 U,V 2 = 1 N U x i V 2 y i F 2 i=1 s.t. (U X)(X U) = (V Y)(Y V) = I.

Canonical correlation analysis (CCA) Extracting L-dimensional projections U R Dx L, V R Dy L ( ) max tr U Σ xy V U,V s.t. U Σ xx U = V Σ yy V = I. Closed-form solution obtained by SVD of Σ xy = Σ 1/2 xx Σ xy Σ 1/2 yy. Alternative formulation Let X = [x 1,...,x N ], Y = [y 1,...,y N ]. Then CCA equivalently solves 1 min U X V Y 2 U,V 2 = 1 N U x i V 2 y i F 2 i=1 s.t. (U X)(X U) = (V Y)(Y V) = I. But CCA can only find linear subspaces...

Deep CCA [Andrew, Arora, Bilmes and Livescu 2013] Transform the input of each view nonlinearly with deep neural networks (DNNs), such that the canonical correlation (measured by CCA) between the outputs is maximized. View 2 U f View 1 g V x y Final projection: f(x) = U f(x), g(y) = V g(y). A parametric nonlinear extension of CCA better scaling to large data than the kernel extension of CCA [Lai & Fyfe 2000, Akaho 2001, Melzer et al. 2001, Bach & Jordan 2002].

Deep CCA: objective and gradient Objective over DNN weights (W f,w g ) and CCA projections (U,V) ( ) max tr U FG V W f,w g,u,v s.t. U FF U = V GG V = I, where F = f(x) = [f(x 1 ),...,f(x N )] and G = g(y) = [g(y 1 ),...,g(y N )].

Deep CCA: objective and gradient Objective over DNN weights (W f,w g ) and CCA projections (U,V) ( ) max tr U FG V W f,w g,u,v s.t. U FF U = V GG V = I, where F = f(x) = [f(x 1 ),...,f(x N )] and G = g(y) = [g(y 1 ),...,g(y N )]. Gradient computation [Andrew, Arora, Bilmes and Livescu 2013] Let Σ fg = FG, Σ ff = FF, Σ gg = GG, and Σ fg = Σ 1/2 ff Σ fg Σ 1/2 gg = ŨΛṼ be its SVD. Then l σ l( Σ fg ) F = 2 ff F+ fg G, where ff = 1 2 Σ 1/2 ff ŨΛŨ Σ 1/2 ff, fg = Σ 1/2 ff ŨṼ Σ 1/2 gg. Gradients with respect to (W f,w g ) are computed via back-propagation.

Stochastic optimization of deep CCA ( ) max tr U FG V W f,w g,u,v s.t. U FF U = V GG V = I, Objective is not expectation of loss over samples due to the constraints. More difficult than PCA/PLS [Arora, Cotter, Livescu and Srebro 2012]. Exact gradient computation requires feeding-forward all data through the DNNs. Back-propagation requires large memory for large DNNs, can not be run on GPUs.

Stochastic optimization of deep CCA ( ) max tr U FG V W f,w g,u,v s.t. U FF U = V GG V = I, Objective is not expectation of loss over samples due to the constraints. More difficult than PCA/PLS [Arora, Cotter, Livescu and Srebro 2012]. Exact gradient computation requires feeding-forward all data through the DNNs. Back-propagation requires large memory for large DNNs, can not be run on GPUs. Question: can we do stochastic optimization for deep CCA?

Approach I : Large minibatches (STOL) (n) ˆΣ fg = 1/2 ˆΣ ff ˆΣ fgˆσ 1/2 gg Use a minibatch of n samples to estimate and the gradient. Works well for n large enough [Wang, Arora, Livescu and Bilmes 2015]. Total Canon. Corr. 100 80 60 40 STOL 100 STOL 200 STOL 300 STOL 400 STOL 500 STOL 750 STOL 1000 L-BFGS 20 0 0.5 1 1.5 2 hours Does not work for small n because E [ˆΣ(n) fg ] Σ fg, due to the nonlinearities in computing Σ fg (matrix inversion, multiplication). Therefore, the gradient estimated on a minibatch is not unbiased estimate of the true gradient.

Alternating least squares for CCA Alternating least squares [Golub & Zha 1995]: run orthogonal iterations to obtain singular vectors (Ũ,Ṽ) of Σ fg. Input: Data matrices X, Y. Initial Ũ0 R dx L s.t. Ũ 0 Ũ0 = I. A 0 Ũ 0 Σ 1 2 ff X for t = 1,2,...,T do B t A t 1 Y ( YY ) 1 Y % Least squares regression Y At 1 B t ( ) B t B 1 2 t B t % Orthogonalize B t A t B t X ( XX ) 1 X % Least squares regression X Bt A t ( ) A t A 1 2 t A t % Orthogonalize A t end for Output: A T Ũ X, B T Ṽ Y are CCA projections as T. This procedure converges linearly under mild conditions. [Lu & Foster 2014] used a similar procedure for linear CCA with high dimensional sparse inputs.

Approach II : Nolinear orthogonal iterations (NOI) Choose a minibatch b of n samples at each step, and

Approach II : Nolinear orthogonal iterations (NOI) Choose a minibatch b of n samples at each step, and Adaptively estimate covariance matrix for orthogonalization Σ g g ρσ g g +(1 ρ) N g(y i ) g(y i ) n i b Time constant ρ [0,1). Update form is similar to that of momentum and widely used in subspace tracking. Saving Σ g g R L L costs little memory as L is usually small.

Approach II : Nolinear orthogonal iterations (NOI) Choose a minibatch b of n samples at each step, and Adaptively estimate covariance matrix for orthogonalization Σ g g ρσ g g +(1 ρ) N g(y i ) g(y i ) n i b Time constant ρ [0,1). Update form is similar to that of momentum and widely used in subspace tracking. Saving Σ g g R L L costs little memory as L is usually small. Replace exact linear regression with nonlinear least squares and take a gradient descent step on the minibatch 1 n f(x i ) Σ 1 2 g g g(y 2 i) min W f i b Ordinary DNN regression problem. No involved gradient.

Approach II : Nolinear orthogonal iterations (NOI) Choose a minibatch b of n samples at each step, and Adaptively estimate covariance matrix for orthogonalization Σ g g ρσ g g +(1 ρ) N g(y i ) g(y i ) n i b Time constant ρ [0,1). Update form is similar to that of momentum and widely used in subspace tracking. Saving Σ g g R L L costs little memory as L is usually small. Replace exact linear regression with nonlinear least squares and take a gradient descent step on the minibatch 1 n f(x i ) Σ 1 2 g g g(y 2 i) min W f i b Ordinary DNN regression problem. No involved gradient. Each step feeds-forward and back-propagates only n samples!

Approach II : Nolinear orthogonal iterations (NOI) Choose a minibatch b of n samples at each step, and Adaptively estimate covariance matrix for orthogonalization Σ g g ρσ g g +(1 ρ) N g(y i ) g(y i ) n i b Time constant ρ [0,1). Update form is similar to that of momentum and widely used in subspace tracking. Saving Σ g g R L L costs little memory as L is usually small. Replace exact linear regression with nonlinear least squares and take a gradient descent step on the minibatch 1 n f(x i ) Σ 1 2 g g g(y 2 i) min W f i b Ordinary DNN regression problem. No involved gradient. Each step feeds-forward and back-propagates only n samples! [Ma, Lu and Foster 2015] proposed a similar algorithm AppGrad for CCA with ρ 0.

Experiments: Datasets Compare three optimizers on two real-world datasets Limited-memory BFGS run with full-batch gradient Stochastic optimization with large minibatches (STOL) Nonlinear Orthogonal Iterations (NOI) Hyper-parameters (time constant ρ, minibatch size n, learning rate η) are tuned by grid search. STOL/NOI run for a maximum number of 50 epochs. Statistics of two real-world datasets. dataset training/tuning/test L DNN architectures JW11 30K/11K/9K 112 273-1800-1800-112 112-1200-1200-112 MNIST 50K/10K/10K 50 392-800-800-50 392-800-800-50

Experiments: Effect of minibatch size n Canon. Corr. 90 80 70 JW11 MNIST 60 L-BFGS n = N 50 STOL n = 100 30 STOL n = 500 40 NOI n = 10 NOI n = 20 20 30 NOI n = 50 NOI n = 100 20 0 10 20 30 40 10 50 0 10 20 30 40 50 epoch 50 40 epoch Learning curves on tuning sets with different minibatch size n. NOI works well with various small minibatch sizes. NOI gives steep improvement in the first few passes over the data.

Experiments: Effect of time constant ρ Canon. Corr. 80 75 70 65 JW11 44 n = 10 n = 20 43 n = 50 n = 100 42 0 0.2 0.4 0.6 0.8 0.9 0.99 ρ 47 46 45 MNIST 0 0.2 0.4 0.6 0.8 0.9 0.99 Total correlation achieved by NOI on tuning sets with different ρ. ρ NOI works well for a wide range of ρ. Beneficial to use large ρ to incorporate the previous estimate of covariance for small n.

Experiments: Pure stochastic optimization 30 Canon. Corr. 25 20 15 Random Init. SVD STOL n = 500 NOI n = 1 0 0.2 0.4 0.6 0.8 0.9 0.99 0.999 Total correlation achieved on MNIST training sets at different ρ for linear CCA. ρ AppGrad [Ma, Lu and Foster 2015] used n O(L) with ρ = 0. Pure stochastic optimization n = 1 works well with large ρ.

Conclusions We have developed NOI for training deep CCA with small minibatches, alleviating the memory cost. NOI performs competitively to previous batch and stochastic optimizers.

Conclusions We have developed NOI for training deep CCA with small minibatches, alleviating the memory cost. NOI performs competitively to previous batch and stochastic optimizers. Future directions: Gradients of nonlinear least squares problems in NOI are not unbiased estimate of gradient of deep CCA objective. Need to better understand the convergence properties of NOI.

Conclusions We have developed NOI for training deep CCA with small minibatches, alleviating the memory cost. NOI performs competitively to previous batch and stochastic optimizers. Future directions: Gradients of nonlinear least squares problems in NOI are not unbiased estimate of gradient of deep CCA objective. Need to better understand the convergence properties of NOI. Thank you!

Bibliography I Shotaro Akaho. A kernel method for canonical correlation analysis. In Proceedings of the International Meeting of the Psychometric Society (IMPS2001), Osaka, Japan, 2001. Springer-Verlag. Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. Deep canonical correlation analysis. In Sanjoy Dasgupta and David McAllester, editors, Proc. of the 30th Int. Conf. Machine Learning (ICML 2013), pages 1247 1255, Atlanta, GA, June 16 21 2013. Raman Arora, Andy Cotter, Karen Livescu, and Nati Srebro. Stochastic optimization for PCA and PLS. In 50th Annual Allerton Conference on Communication, Control, and Computing, pages 861 868, Montcello, IL, October 1 5 2012. Francis R. Bach and Michael I. Jordan. Kernel independent component analysis. Journal of Machine Learning Research, 3:1 48, July 2002. Gene H. Golub and Hongyuan Zha. Linear Algebra for Signal Processing, volume 69 of The IMA Volumes in Mathematics and its Applications, chapter The Canonical Correlations of Matrix Pairs and their Numerical Computation, pages 27 49. Springer-Verlag, 1995. Harold Hotelling. Relations between two sets of variates. Biometrika, 28(3/4):321 377, December 1936. P. L. Lai and C. Fyfe. Kernel and nonlinear canonical correlation analysis. Int. J. Neural Syst., 10(5):365 377, October 2000.

Bibliography II Zhuang Ma, Yichao Lu, and Dean Foster. Finding linear structure in large datasets with scalable canonical correlation analysis. In Francis Bach and David Blei, editors, Proc. of the 32st Int. Conf. Machine Learning (ICML 2015), pages 169 178, Lille, France, July 7 9 2015. Thomas Melzer, Michael Reiter, and Horst Bischof. Nonlinear feature extraction using generalized canonical correlation analysis. In G. Dorffner, H. Bischof, and K. Hornik, editors, Proc. of the 11th Int. Conf. Artificial Neural Networks (ICANN 01), pages 353 360, Vienna, Austria, August 21 25 2001. Weiran Wang, Raman Arora, Karen Livescu, and Jeff Bilmes. Unsupervised learning of acoustic features via deep canonical correlation analysis. In Proc. of the IEEE Int. Conf. Acoustics, Speech and Sig. Proc. (ICASSP 15), Brisbane, Australia, April 19 24 2015.