Fourier PCA Navin Goyal (MSR India), Santosh Vempala (Georgia Tech) and Ying Xiao (Georgia Tech)
Introduction 1. Describe a learning problem. 2. Develop an efficient tensor decomposition.
Independent component analysis See independent samples x = As: s R m is a random vector with independent coordinates. Variables s i are not Gaussian. A R n m is a fixed matrix of full row rank. Each column A i R m has unit norm. Goal is to compute A.
ICA: start with independent random vector s
ICA: independent samples
ICA: but under the map A
ICA: goal is to recover A from samples only.
Applications Matrix A gives n linear measurements of m random variables. General dimensionality reduction tool in statistics and machine learning [HTF01]. Gained traction in deep belief networks [LKNN12]. Blind source separation and deconvolution in signal processing [HKO01]. More practically: finance [KO96], biology [VSJHO00] and MRI [KOHFY10].
Status Jutten and Herault 1991 formalised this problem. Studied in our community first by [FJK96]. Provably good algorithms: [AGMS12] and [AGHKT12]. Many algorithms proposed in signals processing literature [HKO01].
Standard approaches PCA Define a contrast function where optima are A j. Second moment E ( (u T x) 2) is usual PCA. Only succeeds when all the eigenvalues of the covariance matrix are different. Any distribution can be put into isotropic position.
Standard approaches fourth moments Define a contrast function where optima are A j. Fourth moment E ( (u T x) 4). Tensor decomposition: T = m λ j A j A j A j A j j=1 In case A j are orthonormal, they are the local optima of: T (v, v, v, v) = T ijkl v i v j v k v l where v = 1. i,j,k,l
Standard assumptions for x = As All algorithms require: 1. A is full rank n n: as many measurements as underlying variables. 2. Each s i differs from a Gaussian in the fourth moment: E (s i ) = 0, E ( si 2 ) ( ) = 1, E s 4 i 3 Note: this precludes the underdetermined case when A R n m is fat.
Our results We require neither standard assumptions. Underdetermined: A R n m is a fat matrix where n << m. Any moment: E (s r i ) E (zr ) where z N(0, 1). Theorem (Informal) Let x = As be an underdetermined ( )] m ) ICA model. Let d 2N be such that σ m ([vec > 0. Suppose for each s i, one A d/2 i i=1 of its first k cumulants satisfies cum ki (s i ). Then one can recover the columns of A up to ɛ accuracy in polynomial time.
Underdetermined ICA: start with distribution over Rm
Underdetermined ICA: independent samples s
Underdetermined ICA: A first rotates/scales
Underdetermined ICA: then A projects down to Rn
Fully determined ICA nice algorithm 1. (Fourier weights) Pick a random vector u from N(0, σ 2 I n ). For every x, compute its Fourier weight w(x) = e iut x x S eiut x. 2. (Reweighted Covariance) Compute the covariance matrix of the points x reweighted by w(x) µ u = 1 S x S w(x)x and Σ u = 1 S 3. Compute the eigenvectors V of Σ u. w(x)(x µ u )(x µ u ) T. x S
Why does this work? 1. Fourier differentiation/multiplication relationship: (f )ˆ(u) = (2πiu)ˆf (u) 2. Actually we consider the log of the fourier transform: ( ( )) ( ) D 2 log E exp(iu T x) = Adiag g j (A T j u) where g j (t) is the second derivative of log(e (exp(its j ))). A T
Technical overview Fundamental analytic tools: Second characteristic function ψ(u) = log(e ( exp(iu T x) ). Estimate order d derivative tensor field D d ψ from samples. Evaluate D d ψ at two randomly chosen u, v R n to give two tensors T u and T v. Perform a tensor decomposition on T u and T v to obtain A.
First derivative Easy case A = I n : Thus: ψ u 1 = = ψ(u) = log(e ( ) ( ) exp(iu T x) = log(e exp(iu T s) 1 ( ) E (exp(iu T s)) E s 1 exp(iu T s) 1 n j=1 E (exp(iu js j )) E (s 1 exp(iu 1 s 1 )) = E (s 1 exp(iu 1 s 1 )) E (exp(iu 1 s 1 )) n E (exp(iu j s j )) j=2
Second derivative Easy case A = I n : 1. Differentiating via quotient rule: 2 ψ u 2 1 2. Differentiating a constant: = E ( s 2 1 exp(iu 1s 1 ) ) E (s i exp(iu 1 s 1 )) 2 E (exp(iu 1 s 1 )) 2 2 ψ u 1 u 2 = 0
General derivatives Key point: taking one derivative isolates each variable u i. Second derivative is a diagonal matrix. Subsequent derivatives are diagonal tensors: only the (i,..., i) term is nonzero. NB: Higher derivatives are represented by n n tensors. There is one such tensor per point in R n.
Basis change: second derivative When A I n, we have to work much harder: ( ) D 2 ψ u = Adiag g j (A T j u) where g j : R C is given by: A T g j (v) = 2 v 2 log (E (exp(ivs j)))
Basis change: general derivative When A I n, we have to work much harder: D d ψ u = m g(a T j u)(a j A j ) j=1 where g j : R C is given by: g j (v) = d v d log (E (exp(ivs j))) Evaluating the derivative at different points u give us tensors with shared decompositions!
Single tensor decomposition is NP hard Forget the derivatives now. Take λ j C and A j R n : T = m λ j A j A j, j=1 When we can recover the vectors A j? When is this computationally tractable?
Known results When d = 2, usual eigenvalue decomposition. M = n λ j A j A j j=1 When d 3 and A j are linearly independent, a tensor power iteration suffices [AGHKT12]. T = m λ j A j A j, j=1 This necessarily implies m n. For unique recovery, require all the eigenvalues to be different.
Generalising the problem What about two equations instead of one? T µ = m µ j A j A j T λ = j=1 m λ j A j A j j=1 Our technique will flatten the tensors: M µ = [ ( vec A d/2 j )] [ ( diag (µ j ) vec A d/2 j )] T
Algorithm Input: two tensors T µ and T λ flattened to M µ and M λ : 1. Compute W the right singular vectors of M µ. 2. Form matrix M = (W T M µ W )(W T M λ W ) 1. 3. Eigenvector decomposition M = PDP 1. 4. For each column P i, let v i C n be the best rank 1 approximation to P i packed back into a tensor. 5. For each v i, output re ( e iθ v i ) / re ( e iθ v i ) where θ = argmax θ [0,2π] ( re ( e iθ v i ) ).
Theorem Theorem (Tensor decomposition) Let T µ, T λ R n n be order d tensors such that d 2N and: ( where vec µ i λ i T µ = A d/2 j m j=1 µ j A d j T λ = m j=1 λ j A d j ) are linearly independent, µ i /λ i 0 and µ j λ j > 0 for all i, j. Then, the vectors Aj can be estimated to any desired accuracy in polynomial time.
Analysis Let s pretend M µ and M λ are full rank: [ ( M µ M 1 λ = vec = A d/2 j ( [ ( vec [ ( vec A d/2 j )] [ ( diag (µ j ) vec A d/2 j )] A d/2 j )] T )] T ) 1 diag (λ j ) 1 [ vec ( [ ( diag (µ j /λ j ) vec A d/2 j )] 1 The eigenvectors are flattened tensors of the form A d/2 j. A d/2 j )] 1
Diagonalisability for non-normal matrices When can we write A = PDP 1? Require all eigenvectors to be independent (P invertible). Minimal polynomial of A has non-degenerate roots.
Diagonalisability for non-normal matrices When can we write A = PDP 1? Require all eigenvectors to be independent (P invertible). Minimal polynomial of A has non-degenerate roots. Sufficient condition: all roots are non-degenerate.
Perturbed spectra for non-normal matrices More complicated than normal matrices: Normal: λ i (A + E) λ i (A) E. not-normal: Either Bauer-Fike Theorem λ i (A + E) λ j (A) E for some j, or we must assume A + E is already diagonalizable. Neither of these suffice.
Generalised Weyl inequality Lemma Let A C n n be a diagonalizable matrix such that A = Pdiag (λ i ) P 1. Let E C n n be a matrix such that λ i (A) λ j (A) 3κ(P) E for all i j. Then there exists a permutation π : [n] [n] such that λi (A + E) λ π(i) (A) κ(p) E. Proof. Via a homotopy argument (like strong Gershgorin theorem).
Robust analysis Proof sketch: 1. Apply Generalized Weyl to bound eigenvalues hence diagonalisable. 2. Apply Ipsen-Eisenstat theorem (generalised Davis-Kahan sin(θ) theorem). ( 3. This implies that output eigenvectors are close to vec 4. Apply tensor power iteration to extract approximate A j. A d/2 j 5. Show that the best real projection of approximate A j is close to true. )
Underdetermined ICA Algorithm x = As where A is a fat matrix. 1. Pick two independent random vectors u, v N(0, σ 2 I n ). 2. Form the d th derivative tensors at u and v, T u and T v. 3. Run tensor decomposition on the pair (T u, T v ).
Estimating from samples [D 4 ψ u ] i1,i 2,i 3,i 4 = 1 [ ( ) φ(u) 4 E (ix i1 )(ix i2 )(ix i3 )(ix i4 ) exp(iu T x) φ(u) 3 ( ) ( ) E (ix i2 )(ix i3 )(ix i4 ) exp(iu T x) E (ix i1 ) exp(iu T x) φ(u) 2 ( ) ( ) E (ix i2 )(ix i3 ) exp(iu T x) E (ix i1 )(ix i4 ) exp(iu T x) φ(u) 2 ( ) ( ) E (ix i2 )(ix i4 ) exp(iu T x) E (ix i1 )(ix i3 ) exp(iu T x) φ(u) 2 + At most 2 d 1 (d 1)! terms. Each one is easy to estimate empirically!
Theorem Theorem Fix n, m N such that n m. Let x R n be given by an underdetermined ( )] ICA m model ) x = As. Let d N such that and σ m ([vec > 0. Suppose that for each s i, one of its A d/2 i i=1 cumulants ( d < k i k satisfies cum ki (s i ) and E s i k) M. Then one can recover the columns of A up to ɛ accuracy in time and sample complexity ([ ( )]) ) k poly (n d+k, m k2, M k, 1/ k, 1/σ m vec, 1/ɛ. A d/2 i
Analysis Recall our matrices were: [ ( )] M u = vec where: A d/2 i ( ) [ ( diag g j (A T j u) vec g j (v) = d v d log (E (exp(ivs j))) A d/2 i Need to show that g j (A T j u)/g j (A T j v) are well-spaced. )] T
Truncation Taylor series of second characteristic: k i g i (u) = cum l (s i ) (iu)l d (l d)! + R (iu) k i d+1 t (k i d + 1)!. l=d Finite degree polynomials are anti-concentrated. Tail error is small because of existence of higher moments (in fact one suffices).
Tail error g j is the d th derivative of log(e ( exp(iu T s) ) ). For characteristic function φ (d) (u) ( E x d). Count the number of terms after iterating quotient rule d times.
Polynomial anti-concentration Lemma Let p(x) be a degree d monic polynomial over R. Let x N(0, σ 2 ), then for any t R we have Pr ( p(x) t ɛ) 4dɛ1/d σ 2π
Polynomial anti-concentration Proof. 1. For a fixed interval, a scaled Chebyshev polynomial has smallest l norm (order 1/2 d when interval is [ 1, 1]). 2. Since p is degree d, there are at most d 1 changes of sign, hence only d 1 intervals where p(x) is close to any t. 3. Applying the first fact, each interval is of length at most ɛ 1/d, each has Gaussian measure 1/σ 2π.
Polynomial anti-concentration
Eigenvalue spacings ( g Want to bound Pr i (A T i u) g i (A T i v) g j (A T j u) g j (A T j v) Condition on a value of A T j u = s. Then: ). ɛ g i (A T i u) g i (A T i v) s g j (A T j v) = p i (A T i u) g i (A T i v) + ɛ i g i (A T i v) s g j (A T j v) p i (A T i u) g i (A T i v) s g j (A T j v) ɛ i g i (A T i v). Once we ve conditioned on A T j u we can pretend A T i u is also a Gaussian (of highly reduced variance).
Eigenvalue spacings A T i u = A i, A j A T j u + r T u where r is orthogonal to A i Variance of remaining ( randomness )]) is r 2 σ m ([vec. A d/2 i We conclude by union bounding with the event that denominators are not too large, and then over all pairs i, j.
Extensions Can remove Gaussian noise when x = As + η and η N(µ, Σ). Gaussian mixtures (when x n i=1 w in(µ i, Σ i )), in the spherical covariance setting. (Gaussian noise applies here too.)
Open problems What is the relationship between our method and kernel PCA? Independent subspaces. Gaussian mixtures: underdetermined and generalized covariance case.
Fin Questions?