Fourier PCA. Navin Goyal (MSR India), Santosh Vempala (Georgia Tech) and Ying Xiao (Georgia Tech)

Similar documents
Applied Mathematics 205. Unit V: Eigenvalue Problems. Lecturer: Dr. David Knezevic

Econ Slides from Lecture 7

j=1 u 1jv 1j. 1/ 2 Lemma 1. An orthogonal set of vectors must be linearly independent.

Introduction to Machine Learning

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Recall : Eigenvalues and Eigenvectors

Applied Linear Algebra in Geoscience Using MATLAB

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

1. What is the determinant of the following matrix? a 1 a 2 4a 3 2a 2 b 1 b 2 4b 3 2b c 1. = 4, then det

Math 102, Winter Final Exam Review. Chapter 1. Matrices and Gaussian Elimination

Principal Component Analysis

Robustness of Principal Components

Lecture 7 Spectral methods

Robust Statistics, Revisited

235 Final exam review questions

Methods for sparse analysis of high-dimensional data, II

Appendix A. Proof to Theorem 1

Chapter 3 Transformations

CS281 Section 4: Factor Analysis and PCA

The Eigenvalue Problem: Perturbation Theory

Mathematical foundations - linear algebra

Orthogonal tensor decomposition

UNIT 6: The singular value decomposition.

LINEAR ALGEBRA 1, 2012-I PARTIAL EXAM 3 SOLUTIONS TO PRACTICE PROBLEMS

HST.582J/6.555J/16.456J

Glossary of Linear Algebra Terms. Prepared by Vince Zaccone For Campus Learning Assistance Services at UCSB

MATH 304 Linear Algebra Lecture 34: Review for Test 2.

Computational Methods CMSC/AMSC/MAPL 460. Eigenvalues and Eigenvectors. Ramani Duraiswami, Dept. of Computer Science

Quantum Computing Lecture 2. Review of Linear Algebra

Midterm for Introduction to Numerical Analysis I, AMSC/CMSC 466, on 10/29/2015

DS-GA 1002 Lecture notes 10 November 23, Linear models

Methods for sparse analysis of high-dimensional data, II

Bare-bones outline of eigenvalue theory and the Jordan canonical form

Notes on singular value decomposition for Math 54. Recall that if A is a symmetric n n matrix, then A has real eigenvalues A = P DP 1 A = P DP T.

ECE 598: Representation Learning: Algorithms and Models Fall 2017

Conceptual Questions for Review

Lecture 7: Positive Semidefinite Matrices

Multivariate Statistical Analysis

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Ir O D = D = ( ) Section 2.6 Example 1. (Bottom of page 119) dim(v ) = dim(l(v, W )) = dim(v ) dim(f ) = dim(v )

Math 408 Advanced Linear Algebra

MTH 2032 SemesterII

8.1 Concentration inequality for Gaussian random matrix (cont d)

CS168: The Modern Algorithmic Toolbox Lecture #10: Tensors, and Low-Rank Tensor Recovery

Applied Linear Algebra in Geoscience Using MATLAB

EIGENVALUE PROBLEMS. Background on eigenvalues/ eigenvectors / decompositions. Perturbation analysis, condition numbers..

U.C. Berkeley Better-than-Worst-Case Analysis Handout 3 Luca Trevisan May 24, 2018

Definition (T -invariant subspace) Example. Example

Provable Alternating Minimization Methods for Non-convex Optimization

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

Matrix Vector Products

7. Symmetric Matrices and Quadratic Forms

33AH, WINTER 2018: STUDY GUIDE FOR FINAL EXAM

Remark 1 By definition, an eigenvector must be a nonzero vector, but eigenvalue could be zero.

Preliminary/Qualifying Exam in Numerical Analysis (Math 502a) Spring 2012

Probabilistic & Unsupervised Learning

Singular value decomposition. If only the first p singular values are nonzero we write. U T o U p =0

RITZ VALUE BOUNDS THAT EXPLOIT QUASI-SPARSITY

Name: Final Exam MATH 3320

Polynomial Chaos and Karhunen-Loeve Expansion

Natural Gradient Learning for Over- and Under-Complete Bases in ICA

Linear Algebra in Actuarial Science: Slides to the lecture

An Introduction to Spectral Learning

Recall the convention that, for us, all vectors are column vectors.

MATH 31 - ADDITIONAL PRACTICE PROBLEMS FOR FINAL

Fall TMA4145 Linear Methods. Exercise set Given the matrix 1 2

Chapter 4 Euclid Space

Remark By definition, an eigenvector must be a nonzero vector, but eigenvalue could be zero.

Data Mining Lecture 4: Covariance, EVD, PCA & SVD

1 9/5 Matrices, vectors, and their applications

MATH 20F: LINEAR ALGEBRA LECTURE B00 (T. KEMP)

DIAGONALIZATION. In order to see the implications of this definition, let us consider the following example Example 1. Consider the matrix

No books, no notes, no calculators. You must show work, unless the question is a true/false, yes/no, or fill-in-the-blank question.

Singular Value Decomposition

Math 113 Final Exam: Solutions

ICS 6N Computational Linear Algebra Symmetric Matrices and Orthogonal Diagonalization

Donald Goldfarb IEOR Department Columbia University UCLA Mathematics Department Distinguished Lecture Series May 17 19, 2016

CIFAR Lectures: Non-Gaussian statistics and natural images

PCA, Kernel PCA, ICA

MATH 829: Introduction to Data Mining and Analysis Principal component analysis

Lecture 8 : Eigenvalues and Eigenvectors

MATH 1120 (LINEAR ALGEBRA 1), FINAL EXAM FALL 2011 SOLUTIONS TO PRACTICE VERSION

Today: eigenvalue sensitivity, eigenvalue algorithms Reminder: midterm starts today

Principal Component Analysis

Eigenvectors and Hermitian Operators

v = 1(1 t) + 1(1 + t). We may consider these components as vectors in R n and R m : w 1. R n, w m

IV. Matrix Approximation using Least-Squares

Reconstruction from Anisotropic Random Measurements

Chapter 6: Orthogonality

Convergence of Eigenspaces in Kernel Principal Component Analysis

Machine Learning - MT & 14. PCA and MDS

Exercise Set 7.2. Skills

Review problems for MA 54, Fall 2004.

Independent Component Analysis. Contents

Robustness Meets Algorithms

Robust Principal Component Analysis

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

Independent Component Analysis and Its Application on Accelerator Physics

Singular Value Decomposition

Transcription:

Fourier PCA Navin Goyal (MSR India), Santosh Vempala (Georgia Tech) and Ying Xiao (Georgia Tech)

Introduction 1. Describe a learning problem. 2. Develop an efficient tensor decomposition.

Independent component analysis See independent samples x = As: s R m is a random vector with independent coordinates. Variables s i are not Gaussian. A R n m is a fixed matrix of full row rank. Each column A i R m has unit norm. Goal is to compute A.

ICA: start with independent random vector s

ICA: independent samples

ICA: but under the map A

ICA: goal is to recover A from samples only.

Applications Matrix A gives n linear measurements of m random variables. General dimensionality reduction tool in statistics and machine learning [HTF01]. Gained traction in deep belief networks [LKNN12]. Blind source separation and deconvolution in signal processing [HKO01]. More practically: finance [KO96], biology [VSJHO00] and MRI [KOHFY10].

Status Jutten and Herault 1991 formalised this problem. Studied in our community first by [FJK96]. Provably good algorithms: [AGMS12] and [AGHKT12]. Many algorithms proposed in signals processing literature [HKO01].

Standard approaches PCA Define a contrast function where optima are A j. Second moment E ( (u T x) 2) is usual PCA. Only succeeds when all the eigenvalues of the covariance matrix are different. Any distribution can be put into isotropic position.

Standard approaches fourth moments Define a contrast function where optima are A j. Fourth moment E ( (u T x) 4). Tensor decomposition: T = m λ j A j A j A j A j j=1 In case A j are orthonormal, they are the local optima of: T (v, v, v, v) = T ijkl v i v j v k v l where v = 1. i,j,k,l

Standard assumptions for x = As All algorithms require: 1. A is full rank n n: as many measurements as underlying variables. 2. Each s i differs from a Gaussian in the fourth moment: E (s i ) = 0, E ( si 2 ) ( ) = 1, E s 4 i 3 Note: this precludes the underdetermined case when A R n m is fat.

Our results We require neither standard assumptions. Underdetermined: A R n m is a fat matrix where n << m. Any moment: E (s r i ) E (zr ) where z N(0, 1). Theorem (Informal) Let x = As be an underdetermined ( )] m ) ICA model. Let d 2N be such that σ m ([vec > 0. Suppose for each s i, one A d/2 i i=1 of its first k cumulants satisfies cum ki (s i ). Then one can recover the columns of A up to ɛ accuracy in polynomial time.

Underdetermined ICA: start with distribution over Rm

Underdetermined ICA: independent samples s

Underdetermined ICA: A first rotates/scales

Underdetermined ICA: then A projects down to Rn

Fully determined ICA nice algorithm 1. (Fourier weights) Pick a random vector u from N(0, σ 2 I n ). For every x, compute its Fourier weight w(x) = e iut x x S eiut x. 2. (Reweighted Covariance) Compute the covariance matrix of the points x reweighted by w(x) µ u = 1 S x S w(x)x and Σ u = 1 S 3. Compute the eigenvectors V of Σ u. w(x)(x µ u )(x µ u ) T. x S

Why does this work? 1. Fourier differentiation/multiplication relationship: (f )ˆ(u) = (2πiu)ˆf (u) 2. Actually we consider the log of the fourier transform: ( ( )) ( ) D 2 log E exp(iu T x) = Adiag g j (A T j u) where g j (t) is the second derivative of log(e (exp(its j ))). A T

Technical overview Fundamental analytic tools: Second characteristic function ψ(u) = log(e ( exp(iu T x) ). Estimate order d derivative tensor field D d ψ from samples. Evaluate D d ψ at two randomly chosen u, v R n to give two tensors T u and T v. Perform a tensor decomposition on T u and T v to obtain A.

First derivative Easy case A = I n : Thus: ψ u 1 = = ψ(u) = log(e ( ) ( ) exp(iu T x) = log(e exp(iu T s) 1 ( ) E (exp(iu T s)) E s 1 exp(iu T s) 1 n j=1 E (exp(iu js j )) E (s 1 exp(iu 1 s 1 )) = E (s 1 exp(iu 1 s 1 )) E (exp(iu 1 s 1 )) n E (exp(iu j s j )) j=2

Second derivative Easy case A = I n : 1. Differentiating via quotient rule: 2 ψ u 2 1 2. Differentiating a constant: = E ( s 2 1 exp(iu 1s 1 ) ) E (s i exp(iu 1 s 1 )) 2 E (exp(iu 1 s 1 )) 2 2 ψ u 1 u 2 = 0

General derivatives Key point: taking one derivative isolates each variable u i. Second derivative is a diagonal matrix. Subsequent derivatives are diagonal tensors: only the (i,..., i) term is nonzero. NB: Higher derivatives are represented by n n tensors. There is one such tensor per point in R n.

Basis change: second derivative When A I n, we have to work much harder: ( ) D 2 ψ u = Adiag g j (A T j u) where g j : R C is given by: A T g j (v) = 2 v 2 log (E (exp(ivs j)))

Basis change: general derivative When A I n, we have to work much harder: D d ψ u = m g(a T j u)(a j A j ) j=1 where g j : R C is given by: g j (v) = d v d log (E (exp(ivs j))) Evaluating the derivative at different points u give us tensors with shared decompositions!

Single tensor decomposition is NP hard Forget the derivatives now. Take λ j C and A j R n : T = m λ j A j A j, j=1 When we can recover the vectors A j? When is this computationally tractable?

Known results When d = 2, usual eigenvalue decomposition. M = n λ j A j A j j=1 When d 3 and A j are linearly independent, a tensor power iteration suffices [AGHKT12]. T = m λ j A j A j, j=1 This necessarily implies m n. For unique recovery, require all the eigenvalues to be different.

Generalising the problem What about two equations instead of one? T µ = m µ j A j A j T λ = j=1 m λ j A j A j j=1 Our technique will flatten the tensors: M µ = [ ( vec A d/2 j )] [ ( diag (µ j ) vec A d/2 j )] T

Algorithm Input: two tensors T µ and T λ flattened to M µ and M λ : 1. Compute W the right singular vectors of M µ. 2. Form matrix M = (W T M µ W )(W T M λ W ) 1. 3. Eigenvector decomposition M = PDP 1. 4. For each column P i, let v i C n be the best rank 1 approximation to P i packed back into a tensor. 5. For each v i, output re ( e iθ v i ) / re ( e iθ v i ) where θ = argmax θ [0,2π] ( re ( e iθ v i ) ).

Theorem Theorem (Tensor decomposition) Let T µ, T λ R n n be order d tensors such that d 2N and: ( where vec µ i λ i T µ = A d/2 j m j=1 µ j A d j T λ = m j=1 λ j A d j ) are linearly independent, µ i /λ i 0 and µ j λ j > 0 for all i, j. Then, the vectors Aj can be estimated to any desired accuracy in polynomial time.

Analysis Let s pretend M µ and M λ are full rank: [ ( M µ M 1 λ = vec = A d/2 j ( [ ( vec [ ( vec A d/2 j )] [ ( diag (µ j ) vec A d/2 j )] A d/2 j )] T )] T ) 1 diag (λ j ) 1 [ vec ( [ ( diag (µ j /λ j ) vec A d/2 j )] 1 The eigenvectors are flattened tensors of the form A d/2 j. A d/2 j )] 1

Diagonalisability for non-normal matrices When can we write A = PDP 1? Require all eigenvectors to be independent (P invertible). Minimal polynomial of A has non-degenerate roots.

Diagonalisability for non-normal matrices When can we write A = PDP 1? Require all eigenvectors to be independent (P invertible). Minimal polynomial of A has non-degenerate roots. Sufficient condition: all roots are non-degenerate.

Perturbed spectra for non-normal matrices More complicated than normal matrices: Normal: λ i (A + E) λ i (A) E. not-normal: Either Bauer-Fike Theorem λ i (A + E) λ j (A) E for some j, or we must assume A + E is already diagonalizable. Neither of these suffice.

Generalised Weyl inequality Lemma Let A C n n be a diagonalizable matrix such that A = Pdiag (λ i ) P 1. Let E C n n be a matrix such that λ i (A) λ j (A) 3κ(P) E for all i j. Then there exists a permutation π : [n] [n] such that λi (A + E) λ π(i) (A) κ(p) E. Proof. Via a homotopy argument (like strong Gershgorin theorem).

Robust analysis Proof sketch: 1. Apply Generalized Weyl to bound eigenvalues hence diagonalisable. 2. Apply Ipsen-Eisenstat theorem (generalised Davis-Kahan sin(θ) theorem). ( 3. This implies that output eigenvectors are close to vec 4. Apply tensor power iteration to extract approximate A j. A d/2 j 5. Show that the best real projection of approximate A j is close to true. )

Underdetermined ICA Algorithm x = As where A is a fat matrix. 1. Pick two independent random vectors u, v N(0, σ 2 I n ). 2. Form the d th derivative tensors at u and v, T u and T v. 3. Run tensor decomposition on the pair (T u, T v ).

Estimating from samples [D 4 ψ u ] i1,i 2,i 3,i 4 = 1 [ ( ) φ(u) 4 E (ix i1 )(ix i2 )(ix i3 )(ix i4 ) exp(iu T x) φ(u) 3 ( ) ( ) E (ix i2 )(ix i3 )(ix i4 ) exp(iu T x) E (ix i1 ) exp(iu T x) φ(u) 2 ( ) ( ) E (ix i2 )(ix i3 ) exp(iu T x) E (ix i1 )(ix i4 ) exp(iu T x) φ(u) 2 ( ) ( ) E (ix i2 )(ix i4 ) exp(iu T x) E (ix i1 )(ix i3 ) exp(iu T x) φ(u) 2 + At most 2 d 1 (d 1)! terms. Each one is easy to estimate empirically!

Theorem Theorem Fix n, m N such that n m. Let x R n be given by an underdetermined ( )] ICA m model ) x = As. Let d N such that and σ m ([vec > 0. Suppose that for each s i, one of its A d/2 i i=1 cumulants ( d < k i k satisfies cum ki (s i ) and E s i k) M. Then one can recover the columns of A up to ɛ accuracy in time and sample complexity ([ ( )]) ) k poly (n d+k, m k2, M k, 1/ k, 1/σ m vec, 1/ɛ. A d/2 i

Analysis Recall our matrices were: [ ( )] M u = vec where: A d/2 i ( ) [ ( diag g j (A T j u) vec g j (v) = d v d log (E (exp(ivs j))) A d/2 i Need to show that g j (A T j u)/g j (A T j v) are well-spaced. )] T

Truncation Taylor series of second characteristic: k i g i (u) = cum l (s i ) (iu)l d (l d)! + R (iu) k i d+1 t (k i d + 1)!. l=d Finite degree polynomials are anti-concentrated. Tail error is small because of existence of higher moments (in fact one suffices).

Tail error g j is the d th derivative of log(e ( exp(iu T s) ) ). For characteristic function φ (d) (u) ( E x d). Count the number of terms after iterating quotient rule d times.

Polynomial anti-concentration Lemma Let p(x) be a degree d monic polynomial over R. Let x N(0, σ 2 ), then for any t R we have Pr ( p(x) t ɛ) 4dɛ1/d σ 2π

Polynomial anti-concentration Proof. 1. For a fixed interval, a scaled Chebyshev polynomial has smallest l norm (order 1/2 d when interval is [ 1, 1]). 2. Since p is degree d, there are at most d 1 changes of sign, hence only d 1 intervals where p(x) is close to any t. 3. Applying the first fact, each interval is of length at most ɛ 1/d, each has Gaussian measure 1/σ 2π.

Polynomial anti-concentration

Eigenvalue spacings ( g Want to bound Pr i (A T i u) g i (A T i v) g j (A T j u) g j (A T j v) Condition on a value of A T j u = s. Then: ). ɛ g i (A T i u) g i (A T i v) s g j (A T j v) = p i (A T i u) g i (A T i v) + ɛ i g i (A T i v) s g j (A T j v) p i (A T i u) g i (A T i v) s g j (A T j v) ɛ i g i (A T i v). Once we ve conditioned on A T j u we can pretend A T i u is also a Gaussian (of highly reduced variance).

Eigenvalue spacings A T i u = A i, A j A T j u + r T u where r is orthogonal to A i Variance of remaining ( randomness )]) is r 2 σ m ([vec. A d/2 i We conclude by union bounding with the event that denominators are not too large, and then over all pairs i, j.

Extensions Can remove Gaussian noise when x = As + η and η N(µ, Σ). Gaussian mixtures (when x n i=1 w in(µ i, Σ i )), in the spherical covariance setting. (Gaussian noise applies here too.)

Open problems What is the relationship between our method and kernel PCA? Independent subspaces. Gaussian mixtures: underdetermined and generalized covariance case.

Fin Questions?