Analyzing Tensor Power Method Dynamics in Overcomplete Regime

Size: px

Start display at page:

Download "Analyzing Tensor Power Method Dynamics in Overcomplete Regime"

Brandon Hutchinson
5 years ago
Views:

1 Journal of Machine Learning Research 18 (2017) 1-40 Submitte 9/15; Revise 11/16; Publishe 4/17 Analyzing Tensor Power Metho Dynamics in Overcomplete Regime Animashree Ananumar Department of Electrical Engineering an Computer Science University of California, Irvine Engineering Hall, Room 4408 Irvine, CA 92697, USA Rong Ge Department of Computer Science Due University 308 Research Drive (LSRC Builing), Room D226 Durham, NC 27708, USA Maji Janzamin Department of Electrical Engineering an Computer Science University of California, Irvine Engineering Hall, Room 4406 Irvine, CA 92697, USA Eitor: Tommi Jaaola Abstract We present a novel analysis of the ynamics of tensor power iterations in the overcomplete regime where the tensor CP ran is larger than the input imension. Fining the CP ecomposition of an overcomplete tensor is NP-har in general. We consier the case where the tensor components are ranomly rawn, an show that the simple power iteration recovers the components with boune error uner mil initialization conitions. We apply our analysis to unsupervise learning of latent variable moels, such as multi-view mixture moels an spherical Gaussian mixtures. Given the thir orer moment tensor, we learn the parameters using tensor power iterations. We prove it can correctly learn the moel parameters when the number of hien components is much larger than the ata imension, up to = o( 1.5 ). We initialize the power iterations with ata samples an prove its success uner mil conitions on the signal-to-noise ratio of the samples. Our analysis significantly expans the class of latent variable moels where spectral methos are applicable. Our analysis also eals with noise in the input tensor leaing to sample complexity result in the application to learning latent variable moels. Keywors: tensor ecomposition, tensor power iteration, overcomplete representation, unsupervise learning, latent variable moels c 2017 Animashree Ananumar, Rong Ge, an Maji Janzamin. License: CC-BY 4.0, see Attribution requirements are provie at

2 Ananumar, Ge, an Janzamin 1. Introuction CANDECOMP/PARAFAC (CP) ecomposition of a symmetric tensor T R is the process of ecomposing it into a succinct sum of ran-one tensors, given by T = j [] λ j a j a j a j, λ j R, a j R, (1) where enotes the outer prouct. The minimum for which the tensor can be ecompose in the above form is calle the (symmetric) tensor ran. Tensor power iteration is a simple, popular an efficient metho for recovering the tensor ran-one components a j s. The tensor power iteration is given by T (I, x, x) x T (I, x, x), (2) where T (I, x, x) := j,l [] x j x l T (:, j, l) R is a multilinear combination of tensor fibers, an is the l 2 norm operator. See Section 1.3 for an overview of tensor notations an preliminaries. The tensor power iteration is a generalization of matrix power iteration: for matrix M R, the power iteration is given by x Mx/ Mx. Dynamics an convergence properties of matrix power iterations are well unerstoo (Horn an Johnson, 2012). On the other han, a theoretical unerstaning of tensor power iterations is much more limite. Tensor power iteration can be viewe as a graient escent step (with infinite step size), corresponing to the problem of fining the best ran-1 approximation of the input tensor T (Ananumar et al., 2014c). This optimization problem is non-convex. Unlie the matrix case, where the number of isolate stationary points of power iteration is at most the imension (given by eigenvectors corresponing to unique eigenvalues), in the tensor case, the number of stationary points is, in fact, exponential in the input imension (Cartwright an Sturmfels, 2013). This maes the analysis of tensor power iteration far more challenging. Despite the above challenges, many avances have been mae in unerstaning the tensor power iterations in specific regimes. When the components a j s are orthogonal to one another, it is nown that there are no spurious local optima for tensor power iterations, an the only stable fixe points correspon to the true a j s (Zhang an Golub, 2001; Ananumar et al., 2014c). Any tensor with linearly inepenent components a j s can be orthogonalize, via an invertible transformation (whitening) an thus, its components can be recovere efficiently. A careful perturbation analysis in this setting was carrie out in Ananumar et al. (2014c). The framewor in Ananumar et al. (2014c) is however not applicable in the overcomplete setting, where the tensor ran excees the imension. Such overcomplete tensors cannot be orthogonalize an fining guarantee ecomposition is a challenging open problem. It is nown that fining CP tensor ecomposition is NP-har (Hillar an Lim, 2013). In this paper, we mae significant heaway in showing that the simple power iterations can recover the components in the overcomplete regime uner a set of mil conitions on the components a j s. 2

3 Tensor Power Metho Dynamics in Overcomplete Regime Overcomplete tensors also arise in many machine learning applications such as moments of many latent variable moels, e.g., multiview mixtures, inepenent component Analysis (ICA), an sparse coing moels, where the number of hien variables excees the input imensions (Ananumar et al., 2015). Overcomplete moels often have impressive empirical performance (Coates et al., 2011), an can provie greater flexibility in moeling, an are more robust to noise (Lewici an Sejnowsi, 2000). By stuying algorithms for overcomplete tensor ecomposition, we expan the class of moels that can be learnt efficiently using simple spectral methos such as tensor power iterations. Note there are other algorithms for ecomposing overcomplete tensors (De Lathauwer et al., 2007; Goyal et al., 2013; Bhasara et al., 2013), but they all require tensors of at least 4-th orer an require large computational complexity. Ge an Ma (2015) wors for 3r orer tensor but requires quasi-polynomial time. The main contribution of this paper is an analysis for the practical power metho in the overcomplete regime. 1.1 Summary of Results We analyze the ynamics of thir orer tensor power iterations in the overcomplete regime. We assume that the tensor components a j s are ranomly rawn from the unit sphere. Since general tensor ecomposition is challenging in the overcomplete regime, we argue that this is a natural first step to consier for tractable recovery. We characterize the basin of attraction for the local optima near the ran-one components a j s. We show that uner mil initialization conition, there is fast convergence to these local optima in O(log log ) iterations. This result is the core technical analysis of this paper state in the following theorem. Theorem 1 (Dynamics of tensor power iteration) Consier tensor ˆT = T + E such that exact tensor T has ran- ecomposition in (1) with ran-one components a j R, j [] being uniformly i.i.. rawn from the unit -imensional sphere, an the ratio of maximum an minimum (in absolute value) weights λ j s being constant. In aition, suppose the perturbation tensor E has boune spectral norm as ( ) E ɛ, where ɛ < o. (3) Let tensor ran = o( 1.5 ), an the unit-norm initial vector x (1) satisfy the correlation boun x (1), a j β, (4) w.r.t. some true component a j, j [], for some β > (log ) c for some universal constant c > 0. After N = Θ (log log ) iterations, the tensor power iteration in (2) outputs a vector having w.h.p. a constant correlation with the true component a j as x (N+1), a j 1 γ, for any fixe constant γ > 0. As a corollary, this result can be use for learning latent variable moels such as multiview mixtures. We show that the above initialization conition is satisfie using a sample with mil signal-to-noise ratio; see Section 2 for more etails on this. 3

4 Ananumar, Ge, an Janzamin The above result is a significant improvement over the recent analysis by Ananumar et al. (2015, 2014a,b) for overcomplete tensor ecomposition. In these wors, it is require for the initialization vectors to have a constant amount of correlation with the true a j s. However, obtaining such strong initializations is usually not realistic in practice. On the other han, the initialization conition in (4) is mil, an ecaying even when the ran is significantly larger than imension ; up to = o( 1.5 ). In learning the mixture moel, such initialization vectors can be obtaine as samples from the mixture moel, even when there is a large amount of noise. Given this improvement, we combine our analysis in Theorem 1, an the guarantees in Ananumar et al. (2015, 2014a), proving that the moel parameters can be recovere consistently. A etaile proof outline for Theorem 1 is provie in Section 3.1. Uner the ranom assumption, it is not har to show that the first iteration of tensor power upate maes progress. However, after the first iteration, the input vector an the tensor components are no longer inepenent of each other. Therefore, we cannot irectly repeat the same argument for the secon step. How o we analyze the secon step even though the vector an tensor components are correlate? The main intuition is to characterize the epenency between the vector an the tensor components, an show that there is still enough ranomness left for us to repeat the argument. This iea was inspire by the analysis of Approximate Message Passing (AMP) algorithms (Bayati an Montanari, 2010). However, our analysis here is very ifferent in several ey aspects: 1) In approximate message passing, typically the analysis wors in the large system limit, where the number of iterations is fixe an the imension goes to infinity. Here we can hanle a superconstant number of iterations O(log log ), even for finite ; 2) Usually is assume to be a constant factor times in the AMP-lie analysis, while here we allow them to be polynomially relate. 1.2 Relate Wor Tensor ecomposition for learning latent variable moels: In the introuction, some relate wors are mentione which stuy the theoretical an practical aspects of spectral techniques for learning latent variable moels. Among them, Ananumar et al. (2014c) provie the analysis of tensor power iteration for learning several latent variable moels in the unercomplete regime. Ananumar et al. (2014a) provie the analysis in the overcomplete regime an Ananumar et al. (2014b) provie tensor concentration bouns an apply the analysis in (Ananumar et al., 2014a) to learning LVMs proposing tight sample complexity guarantees. Learning mixture of Gaussians: Here, we provie a subset of relate wors stuying learning mixture of Gaussians which are more comparable with our result. For a more etaile list of these wors, see Ananumar et al. (2014c); Hsu an Kaae (2013). The problem of learning mixture of Gaussians ates bac to the wor by Pearson (1895). They propose a moment-base technique that involves solving systems of multivariate polynomials which is in general challenging in both computational an statistical sense. Recently, lots of stuies on learning Gaussian mixture moels have been one improving both aspects which can be ivie to two main classes: istance-base an spectral methos. 4

5 Tensor Power Metho Dynamics in Overcomplete Regime Distance-base methos impose separation conition on the mean vectors showing that uner enough separation the parameters can be estimate. Among such approaches, we can mention Dasgupta (1999); Vempala an Wang (2002); Arora an Kannan (2005). As iscusse in the summary of results, these results wor even if > 1.5 as long as the separation conition between means is satisfie, but our wor can tolerate higher level of noise in the regime of = o( 1.5 ) with polynomial computational complexity. The guarantees in (Vempala an Wang, 2002) also wor in the high noise regime but nee higher computational complexity as polynomial in O() an. In the spectral approaches, the observe moments are constructe an the spectral ecomposition of the observe moments are performe to recover the parameters (Kalai et al., 2010; Ananumar et al., 2012, 2014b). Kalai et al. (2010) analyze the problem of learning mixture of two general Gaussians an provie algorithm with high orer polynomial sample an computational complexity. Note that in general, the complexity of such methos grows exponentially with the number of components without further assumptions (Moitra an Valiant, 2010). Hsu an Kaae (2013) provie a spectral algorithm uner non-egeneracy conitions on the mean vectors an propose guarantees with polynomial sample complexity epening on the conition number of the moment matrices. Ananumar et al. (2014b) perform tensor power iteration on the thir orer moment tensor to recover the mean vectors in the overcomplete regime as long as = o( 1.5 ), but nee very goo initialization vector having constant correlation with the true mean vector. Here, we improve the correlation level require for convergence. 1.3 Notation an Tensor Preliminaries Let [] := {1, 2,..., }, an v enote the l 2 norm of vector v. We use Õ an Ω to hie polylog factors in asymptotic notations O an Ω, respectively. Tensor preliminaries: A real p-th orer tensor T p R is a member of the outer prouct of Eucliean spaces R. The ifferent imensions of the tensor are referre to as moes. For instance, for a matrix, the first moe refers to columns an the secon moe refers to rows. In aition, fibers are higher orer analogues of matrix rows an columns. A fiber is obtaine by fixing all but one of the inices of the tensor (an is arrange as a column vector). For example, for a thir orer tensor T R, the moe-1 fiber is given by T (:, j, l). Similarly, slices are obtaine by fixing all but two of the inices of the tensor. For example, for the thir orer tensor T, the slices along 3r moe are given by T (:, :, l). We view a tensor T R as a multilinear form. In particular, for vectors u, v, w R, we have 1 T (I, v, w) := v j w l T (:, j, l) R, (5) j,l [] which is a multilinear combination of the tensor moe-1 fibers. Similarly T (u, v, w) R is a multilinear combination of the tensor entries, an T (I, I, w) R is a linear combination of the tensor slices. A 3r orer tensor T R is sai to be ran-1 if it can be written in the form T = λ a b c T (i, j, l) = λ a(i) b(j) c(l), (6) 1. Compare with the matrix case where for M R, we have M(I, u) = Mu := j [] ujm(:, j). 5

6 Ananumar, Ge, an Janzamin h z 1 z 2 z p Figure 1: Multiview mixture moel. where notation represents the outer prouct an a, b, c R are unit vectors. A tensor T R is sai to have a CP ran at most if it can be written as the sum of ran-1 tensors as T = i [] λ i a i b i c i, λ i R, a i, b i, c i R. (7) For thir orer tensor T R, the spectral (operator) norm is efine as T := sup T (u, v, w). u = v = w =1 In the rest of the paper, Section 2 escribes how to apply our tensor results to learning multiview mixture moels. Section 3 illustrates the proof ieas, with more etails in the Appenix. Finally we conclue in Section Learning Multiview Mixture Moel through Tensor Methos We propose our main technical result in Section 1.1 proviing convergence guarantees for the tensor power iterations given mil initialization conitions in the overcomplete regime; see Theorem 1. Along this result we provie the application to learning multiview mixtures moel in Theorem 2. In this section, we briefly introuce the tensor ecomposition framewor as the learning algorithm an then state the learning guarantees with more etails an remars. 2.1 Multiview Mixture Moel Consier an exchangeable multiview mixture moel with components an p 3 views; see Figure 1. Suppose that hien variable h is a iscrete categorical ranom variable taing one of the states. It is convenient to represent it by basis vectors such that h = e j R if an only if it taes the j-th state. Note that e j R enotes the j-the basis vector in the -imensional space. The prior probability for each hien state is Pr[h = e j ] = λ j, j []. For simplicity, in this paper we assume all the λ i s are the same. However, similar argument wors even when the ratio of maximum an minimum prior probabilities λ max /λ min is boune by some constant. The variables (views) z l R are relate to the hien state through factor matrix A R such that z l = Ah + η l, l [p], 6

7 Tensor Power Metho Dynamics in Overcomplete Regime Algorithm 1 Learning multiview mixture moel via tensor power iterations Require: 1) Thir orer moment tensor T R in (8), 2) n samples of z 1 in multiview mixture moel as z (τ) 1, τ [n], an 3) number of iterations N. 1: for τ = 1 to n o 2: Initialize unit vectors x (1) τ z (τ) 1 / z (τ) 1. 3: for t = 1 to N o 4: Tensor power upates (see (5) for the efinition of the multilinear form): x (t+1) τ = ( T I, x (t) τ ( I, x (t) τ T ), x (t) τ, x (t) τ ), (9) 5: en for 6: en for { } 7: return the output of Proceure 2 with input x (N+1) τ : τ [n] as estimates x j. where zero-mean noise vectors η l R are inepenent of each other an the hien state h. Given this, the variables (views) z l R are conitionally inepenent given the latent variable h, an the conitional means are E[z l h = e j ] = a j, where a j R enotes the j-th column of factor matrix A = [a 1 a ] R. In aition, the above properties imply that the orer of observations z l o not matter an the moel is exchangeable. The goal of the learning problem is to recover the parameters of the moel (factor matrix) A given observations. For this moel, the thir orer 2 observe moment has the form (Ananumar et al., 2014c) E[z 1 z 2 z 3 ] = j [] λ j a j a j a j. (8) Hence, given thir orer observe moment, the unsupervise learning problem (recovering factor matrix A) reuces to computing a tensor ecomposition as in (8). 2.2 Tensor Decomposition Algorithm The algorithm for unsupervise learning of multiview mixture moel is base on tensor ecomposition techniques provie in Algorithm 1. The main step in (9) performs tensor power iteration; 3 see (5) for the multilinear form efinition. After running the algorithm for all ifferent initialization vectors, the clustering process from Ananumar et al. (2015) ensures that the best converge vectors are returne as the estimation of true components a j. 2. It is enough to form the thir orer moment for our learning purpose. 3. This is the generalization of matrix power iteration to 3r orer tensors. 7

8 Ananumar, Ge, an Janzamin Proceure 2 Clustering process (Ananumar et al., 2015) { } Require: Tensor T R, set S := x (N+1) τ : τ [n], parameter ν. 1: while S is not empty o 2: Choose x S which maximizes T (x, x, x). 3: Do N more iterations of (9) starting from x. 4: Output the result of iterations enote by ˆx. 5: Remove all the x S with x, ˆx > ν/2. 6: en while 2.3 Learning Guarantees We assume a Gaussian prior on the mean vectors, i.e., the vectors a j N (0, I /), j [] are i.i.. rawn from a stanar multivariate Gaussian istribution with unit expecte square norm. Note that in the high imension (growing ), this assumption is the same as uniformly rawing from unit sphere since the norm of vector concentrates in the high imension an there is no nee for normalization. Even though we impose a prior istribution, we o not use a MAP estimator, since the corresponing optimization is NP-har. Instea, we learn the moel parameters through ecomposition of the thir orer moments through tensor power iterations. The assumption of a Gaussian prior is stanar in machine learning applications. We impose it here for tractable analysis of power iteration ynamics. Such Gaussian assumptions have been use before for analysis of other iterative methos such as approximate message passing algorithms, an there are eviences that similar results hol for more general istributions; see (Bayati an Montanari, 2010) an references there. As explaine in the previous sections, we use tensor power metho to learn the components a j s, an the metho is initialize with observe samples z i. Intuitively, this initialization is useful since z i = Ah+η i is a perturbe version of esire parameter a j (when h = e j ). Thus, we present the result in terms of the signal-to-noise (SNR) ratio which is the expecte norm of signal a j (which is one here) ivie by the expecte norm of noise η i, i.e., the SNR in the i-th sample z i = a j + η i (assume h = e j ) is efine as SNR := E[ a j ]/E[ η i ]. This specifies how much noise the initialization vector z i can tolerate in orer to ensure the convergence of tensor power iteration to a esire local optimum. We now propose the conitions require for recovery guarantees, an state a brief explanation of them. Conitions for Theorems 2 an 3: Ran conition: o( 1.5 ). The columns of A are uniformly i.i.. rawn from unit -imensional sphere. The noise vectors η l, l [3], are inepenent of matrix A an each other. In aition, the signal-to-noise ratio (SNR) is w.h.p. boune as ( ) max{, } SNR Ω, 1 β for some β (log ) c for universal constant c > 0. 8

9 Tensor Power Metho Dynamics in Overcomplete Regime The ran conition bouns the level of overcompleteness for which the recovery guarantees are satisfie. The ranom assumption on the columns of A are crucial for analyzing the ynamics of tensor power iteration. We use it to argue there exists enough ranomness left in the components after conitioning on the previous iterations; see Section 3.1 for the etails. The boun on the SNR is require to mae sure the given sample use for initialization is close enough to the corresponing mean vector. This ensures that the initial vector is insie the basin-of-attraction of the corresponing component, an hence, the convergence to the mean vector can be guarantee. Uner these assumptions we have the following theorem. Theorem 2 (Learning multiview mixture moel: closeness to single columuns) Consier a multiview mixture moel (or a spherical Gaussian mixture) in the above settings with components in imensions. If the above conitions hol, then the tensor power iteration converges to a vector close to one of the true mean vectors a j s (having constant correlation). In particular, for milly overcomplete moels, where = α for some constant α > 1, the signal-to-noise ratio (SNR) is as low as Ω( 1/2+ɛ ), for any ɛ > 0. Thus, we can learn mixture moels with a high level of noise. In general, we establish how the require noise level scales with the number of hien components, as long as = o( 1.5 ). The above theorem states convergence to esire local optima which are close to true components a j s. In Theorem 3, we show that we can sharpen the above result, by jointly iterating over the recovere vectors, an consistently recover the components a j s. This result also uses the analysis from Ananumar et al. (2015). Theorem 3 (Learning multiview mixture moel: recovering the factor matrix) Assume the above conitions hol. The initialization of power iteration is performe by samples of z 1 in multiview mixture moel. Suppose the tensor power iterations are at least initialize once for each a j, j [] such that z 1 = a j +η 1. 4 Then by using the exact 3r orer moment tensor in (8) as input, the tensor ecomposition algorithm outputs an estimate Â (up to permutation of its columns) satisfying w.h.p. (over the ranomness of the components a j s) Â A ɛ, F where the number of iterations of the algorithm is N = Θ ( log ( 1 ɛ ) + log log ). See Section 3 for the proof. The above theorems assume the exact thir orer tensor is given to the algorithm. We provie the results given empirical tensor in Section Learning spherical Gaussian mixtures: Consier a mixture of ifferent Gaussian vectors with spherical covariance. Let a j R, j [] enote the mean vectors an the covariance matrices are σ 2 I. Assuming the parameter σ is nown, the moifie thir orer observe moment M 3 := E[z z z] σ 2 i [] (E[z] e i e i + e i E[z] e i + e i e i E[z]) 4. Note that this happens for component j with high probability when the number of initializations is proportional to inverse prior probability corresponing to that mixture. 9

10 Ananumar, Ge, an Janzamin has the tensor ecomposition form (Hsu an Kaae, 2012) M 3 = j [] λ j a j a j a j, where λ j is the probability of rawing j-th Gaussian mixture. The above guarantees can be applie to learning mean vectors a j in this moel with the aitional property that the noise is spherical Gaussian. Learning multiview mixture moel with istinct factor matrices: Consier the multiview mixture moel with ifferent factor matrices where the first three views are relate to the hien state as z 1 = Ah + η 1, z 2 = Bh + η 2, z 3 = Ch + η 3. Then, the guarantees in the above theorem can be extene to recovering the columns of all three factor matrices A, B, an C with appropriate moifications in the power iteration algorithm as follows. First the upate formula (9) is change as ( ) ( ) ( x (t+1) 1,τ = T T I, x (t) 2,τ, x(t) 3,τ ( I, x (t) 2,τ, x(t) 3,τ ), x (t+1) 2,τ = T ( T x (t) 1,τ x (t) 1,τ, I, x(t) 3,τ, I, x(t) 3,τ ), x (t+1) 3,τ = ) T x (t) 1,τ, x(t) 2,τ (, I ) x (t), 1,τ, x(t) 2,τ, I which is the alternating asymmetric version of symmetric power iteration in (9). Here, we alternate among ifferent moes of the tensor. In aition, the initialization for each moe of the tensor is appropriately performe with the samples corresponing to that moe. Note that the analysis still wors in the asymmetric version since there exists even more inepenence relationships through the iterations of the power upate because of introucing new ranom matrices B an C Sample Complexity Analysis In the previous section, we assume the exact thir orer tensor in (8) is given to the tensor ecomposition Algorithm 1. We now estimate the tensor given n samples z (i) 1, z(i) 2, z(i) 3, i [n], as ˆT = 1 z (i) 1 z (i) 2 z (i) 3 n. (10) i [n] For the multiview mixture moel introuce in Section 2.1, let the noise vector η l be spherical, an ζ 2 enote the variance of each entry of noise vector. We now provie the following recovery guarantees. Aitional conitions for Theorem 4: Let E 1 := [η (1) 1, η(2) 1,..., η(n) 1 ] R n, where η (i) 1 R is the i-th sample of noise vector η 1. These noise matrices satisfy the following RIP property which is aapte from Canes an Tao (2006). ( ) Matrix E 1 R n satisfies a wea RIP conition such that for any subset of O number of columns, the spectral norm of E log 2 1 restricte to those columns is boune by 2. The same conition is satisfie for similarly efine noise matrices E 2 an E 3. T 10

11 Tensor Power Metho Dynamics in Overcomplete Regime The number of samples n satisfies lower boun such that ( ) ( ) ζ n + λ max + ζ 2 n n + λ 1.5 max n ( ) { } + ζ n + min ɛ n, Õ(λ min), (11) ( / ) where ɛ < o. Theorem 4 (Learning multiview mixture moel) Consier the empirical tensor in (10) as the input to tensor ecomposition Algorithm 1. Suppose the above aitional conitions are also satisfie. Then, the same guarantees as in Theorem 2 hol. In aition, the same guarantees as in Theorem 3 also hol with the recovery boun (up to permutation of columns of Â) change as ( ) Â A F E Õ, where E enotes the perturbation tensor originate from empirical estimation in (10), an its spectral norm E is boune by the LHS of (11). See Section 3 for the proof. 3. Proof Outline Our main technical result is the analysis of thir orer tensor power iteration provie in Theorem 1 which also allows to tolerate some amount of noise in the input tensor. We analyze the noiseless an noisy settings in ifferent ways. We basically first prove the result for the noiseless setting where the input tensor has an exact ran- ecomposition in (1). When the noise is also consiere, we show that the contribution of noise in the analysis is ominate by the main signal, an thus, the same result still hols. For the rest of this section we focus on the noiseless setting, while we iscuss the proof ieas for the noisy case in Section 3.2. We first iscuss the proof of Theorem 3 which involves two phases. In the first phase, we show that uner certain small amount of correlation (see (13)) between the initial vector an the true component, the power iteration in (2) converges to some vector which has constant correlation with the true component. This result is the core technical analysis of this paper which is provie in Lemma 5. In the secon phase, we incorporate the result of Ananumar et al. (2015, 2014a) which guarantees the convergence of power iteration (followe by a coorinate escent iteration) given initial vectors having constant correlation with the true components. This is state in Lemma 6. To simplify the notation, we consier the tensor 5 λ min T = j [] a j a j a j, a j N (0, 1 I ). (12) 5. In the analysis, we assume that all the weights are equal to one which can be generalize to the case when the ratio of maximum an minimum weights (in absolute value) are constant. 11

12 Ananumar, Ge, an Janzamin Notice that this is exactly proportional to the 3r orer moment tensor of the multiview mixture moel in (8). The following lemma is restatement of Theorem 1 in the noiseless setting. Lemma 5 (Dynamics of tensor power iteration, phase 1) Consier the ran- tensor T of the form in (12). Let tensor ran = o( 1.5 ), an the unit-norm initial vector x (1) satisfies the correlation boun x (1), a j β, (13) w.r.t. some true component a j, j [], for some β > (log ) c for some universal constant c > 0. After N = Θ (log log ) iterations, the tensor power iteration in (2) outputs a vector having w.h.p. a constant correlation with the true component a j as for any fixe constant γ > 0. x (N+1), a j 1 γ, The proof outline of above lemma is provie in Section 3.1. Next, we provie the following lemma from Ananumar et al. (2015) which provies the ynamics of tensor power iteration when the initialization satisfies the constant correlation boun state below. Lemma 6 (Dynamics of tensor power iteration, phase 2) Consier the ran- tensor T of the form in (12) with ran conition o( 1.5 ). Let the initial vectors x (1) j satisfy the constant correlation boun x (1) j, a j 1 γ j, w.r.t. true components a j, j [], for some constants γ j > 0. Let the output of the tensor power upates 6 in (2) applie to all these ifferent initialization vectors after N = Θ ( log 1 ) ɛ iterations be stace as columns of matrix Â. Then, we have w.h.p.7 Â A F ɛ, where the recovery error is up to permutation of columns of Â. See Ananumar et al. (2015) for the proof of above lemma. Given the above two lemmas, the learning result in Theorem 3 is irectly prove. Proof of Theorem 3 The result is prove by combining Lemma 5 an Lemma 6. Note that the initialization conition in (4) is w.h.p. satisfie given the SNR boun assume. Proof of Theorem 4 In Theorem 3, we provie the result given exact tensor by combining Lemmas 5 an 6. The only ifference here is we are given an empirical estimate of the tensor an we nee to incorporate the effect of noise in the empirical input. We now use 6. This result also nees an aitional step of coorinate escent iterations since the true components are not the fixe points of power iteration; see Ananumar et al. (2015, 2014a) for the etails. 7. Ananumar et al. (2015, 2014a) recover the vector up to sign since they wor in the asymmetric case. In symmetric case it is easy to resolve sign ambiguity issue. 12

13 Tensor Power Metho Dynamics in Overcomplete Regime Theorem 1 that characterizes the effect of noise in first step (aapting Lemma 5 to noisy setting), an Ananumar et al. (2015) that provie the result of Lemma 6 in noisy setting. In aition, the tensor concentration boun for multiview mixture moel is analyze in Theorem 1 of Ananumar et al. (2014b) (Lemma 56 in Ananumar et al. (2015)) that shows the error between empirical an exact tensors is boune as ˆT T ζ ( ) ( ) ( n + λ max + ζ 2 n n + λ 1.5 max + ζ n n + ). n The sample complexity requirement in (11) is then erive by imposing the error requirements in our noisy analysis of tensor power ynamics in Theorem 1 (see Equation (3)) an the noisy analysis of Lemma 6 (see Theorem 1 of Ananumar et al. (2015) where the perturbation tensor E nees to be boune as E Õ(λ min)). The final recovery error on Â A F is also from Theorem 1 of Ananumar et al. (2015). 3.1 Proof Outline of Lemma 5 (Noiseless Case of Theorem 1) First step: We first intuitively show the first step of the algorithm maes progress. Suppose the tensor is T = j [] a j a j a j, an the initial vector x has correlation x, a 1 β with the first component. The result of the first iteration is the normalize version of the following vector: x = j [] a j, x 2 a j. Intuitively, this vector shoul have roughly a 1, x = 2β correlation with a 2 1 (as the other terms are ranom they on t contribute much). On the other han, the norm of this vector is roughly O( /): this is because a j, x 2 for j 1 is roughly 8 1/, an the sum of ranom vectors with length 1/ will have length roughly O( /). These arguments can be mae precise showing the normalize version x/ x has correlation 2β with a 1 ensuring progress in the first step. Going forwar: As we explaine, the basic iea behin proving Lemma 5 is to characterize the conitional istribution of ranom Gaussian tensor components a j s given previous iterations. In particular, we show that the resiual inepenent ranomness left in these conitional istributions is large enough an we can exploit it to obtain tighter concentration bouns throughout the analysis of the iterations. The Gaussian assumption on the components, an small enough number of iterations are crucial in this argument. Notations: For two vectors u, v R, the Haamar prouct enote by is efine as the entry-wise multiplication of vectors, i.e., (u v) j := u j v j for j []. For a matrix A, let P A enote the projection operator to the subspace orthogonal to column span of A. For a subspace R, let R enote the space orthogonal to it. Therefore, for a subspace R, the projection operator on the subspace orthogonal to R is equivalently enote by P R or P R. For a ranom matrix D, let D {u = Dv} enote the conitional istribution of D 8. The correlation between two unit Gaussian vectors in imensions is roughly 1/. 13

14 Ananumar, Ge, an Janzamin given linear constraints u = Dv. We also use equality notation () = to enote the equivalence in istribution. Lemma 5 involves analyzing the ynamics of power iteration in (2) for 3r orer ran- T (I,x,x) T (I,x,x) tensors. For the ran- tensor in (12), the power iterative form x can be written as x (t+1) = A ( A x (t)) 2 A ( A x (t)) 2, (14) where the multilinear form in (5) is use. Here, A = [a 1 a ] R enotes the factor matrix, an for vector y R, y 2 := y y R represents the element-wise square of entries of y. We consier the case where a i N (0, 1 I) are i.i.. rawn an we analyze the evolution of the ynamics of the power upate. As explaine earlier, for a given initialization x (1), the upate in the first step can be analyze easily since A is inepenent of x (1). However, in subsequent steps, the upates x (t) are epenent on A, an it is no longer clear how to provie a tight boun on the evolution of x (t). In this wor, we provie a careful analysis by controlling the amount of correlation buil-up by exploiting the structure of Gaussian matrices uner linear constraints. This enables us to provie better guarantees for matrix A with Gaussian entries compare to general matrices A. Intermeiate upate steps an variables:before we procee, we nee to brea own power upate in (2) an introuce some intermeiate upate steps an variables as follows. Recall that x (1) R enotes the initialization vector. Without loss of generality, let us analyze the convergence of power upate to first component of ran- tensor T enote by a 1. Hence, let the first entry of x (1) enote by x (1) 1 be the maximum entry (in absolute value) of x (1), i.e., x (1) 1 = x (1). Let B := [a 2 a 3 a ] R ( 1), an therefore A = [a 1 B]. We brea the power upate formula in (2) into a few steps by introucing intermeiate variables y (t) R an x (t+1) R as y (t) := A x (t), x (t+1) := A(y (t) ) 2. Note that x (t+1) is the unnormalize version of x (t+1) := x (t+1) / x (t+1), i.e., x (t+1) := T (I, x (t), x (t) ). Thus, we nee to jointly analyze the ynamics of all variables x (t), y (t) an (y (t) ) 2. Define [ X [t] := x (1)... x (t)] [, Y [t] := y (1)... y (t)]. Matrix B is ranomly rawn with i.i.. Gaussian entries B ij N (0, 1 ). As the iterations procee, we consier the following conitional istributions B (t,1) := B {X [t], Y [t] }, B (t,2) := B {X [t+1], Y [t] }. (15) Thus, B (t,1) is the conitional istribution of B at the mile of t th iteration (before upate step x (t+1) = A(y (t) ) 2 ) an B (t,2) is the conitional istribution at the en of t th iteration (after upate step x (t+1) = A(y (t) ) 2 ). By analyzing the above conitional istributions, we can characterize the left inepenent ranomness in B. 14

15 Tensor Power Metho Dynamics in Overcomplete Regime Conitional Distributions In orer to characterize the conitional istribution of B uner evolution of x (t) an y (t) in (15), we exploit the following basic fact (see (Bayati an Montanari, 2010) for proof). Lemma 7 (Conitional istribution of Normal matrices uner linear conition) Consier ranom matrix D with i.i.. Gaussian entries D ij N (0, σ 2 ). Conitione on u = Dv with nown vectors u an v, the matrix D is istribute as D {u = Dv} () = 1 v 2 uv + DP v, where ranom matrix D is an inepenent copy of D with i.i.. Gaussian entries D ij N (0, σ 2 ), an P v is the projection operator on to the subspace orthogonal to v. We refer to DP v as the resiual ranom matrix since it represents the remaining ranomness left after conitioning. It is a ranom matrix whose rows are inepenent ranom vectors that are orthogonal to v, an the variance in each irection orthogonal to v is equal to σ 2. The above Lemma can be exploite to characterize the conitional istribution of B introuce in (15). However, a naive irect application using the constraint Y [t] = A X [t] is not transparent for analysis. The reason is the evolution of x (t) an y (t) are themselves governe by the conitional istribution of B given previous iterations. Therefore, we nee the following recursive version of Lemma 7 which can be immeiately argue by inuction. Corollary 8 (Iterative conitioning) Consier ranom matrix D with i.i.. Gaussian entries D ij N (0, σ 2 ), an let F () = P C DP R be the ranom Gaussian matrix whose columns are orthogonal to space C an rows are orthogonal to space R. Conitione on the linear constraint u = Dv, where 9 u C, the matrix F is istribute as F {u = Dv} () = 1 (P R v) 2 u(p R v) + P C DP {R,v}, where ranom matrix D is an inepenent copy of D with i.i.. Gaussian entries D ij N (0, σ 2 ). Thus, the resiual ranom matrix P C DP {R,v} is a ranom Gaussian matrix whose columns are orthogonal to C an rows are orthogonal to span{r, v}. The variance in any remaining imension is equal to σ Form of Iterative Upates Now we exploit the conitional istribution arguments propose in the previous section to characterize the conitional istribution of B given the upate variables x an y up to the current iteration; recall (15) where B (t,1) is the conitional istribution of B at the mile of t th iteration an B (t,2) at the en of t th iteration. Before that, we nee to introuce some more intermeiate variables. 9. We nee that u C, otherwise the event u = Dv is impossible. 15

16 Ananumar, Ge, an Janzamin Intermeiate variables: We separate the first entry of y an (y) 2 from the rest, i.e., we have y (t) 1 = a 1 x (t), y (t) 1 = B x (t) (B (t 1,2) ) x (t), where y (t) 1 R 1 enotes y (t) R with the first entry remove. The upate formula for x (t+1) can be also ecompose as x (t+1) = (y (t) 1 )2 a 1 + Bw (t) (y (t) 1 )2 a 1 + B (t,1) w (t), where w (t) := (y (t) 1 ) 2 R 1, is the new intermeiate variable in the power iterations. Let B res. (t,1) an B res. (t,2) resiual ranom matrices corresponing to B (t,1) an B (t,2) respectively, an enote the u (t+1) := B (t,1) res. w (t), v (t) := (B (t 1,2) res. ) x (t), where u (t) R an v (t) R 1 are respectively the part of x (t) an y (t) 1 representing the resiual ranomness after conitioning on the previous iterations. We also summarize all variables an notations in Table 1 in the Appenix which can be use as a reference throughout the paper. Finally we mae the following observations. Lemma 9 (Form of iterative upates) The conitional istribution of B at the mile of t th iteration enote by B (t,1) satisfies (t,1) () B = i [t 1] u (i+1) (P W [i 1] w(i) ) P W [i 1] w(i) 2 + P X [i 1] x(i) (v (i) ) P X i [t] [i 1] x(i) 2 + B (t,1) res., (16) B res. (t,1) () = P X BP, (17) [t] W [t 1] where ranom matrix B is an inepenent copy of B with i.i.. Gaussian entries B ij N (0, 1 ). Similarly, the conitional istribution of B at the en of tth iteration enote by B (t,2) satisfies (t,2) () B = ( u (i+1) (P W [i 1] w(i) ) P W i [t] [i 1] w(i) 2 + P x(i) (v (i) ) ) X[i 1] P X [i 1] x(i) 2 + B res. (t,2), (18) B res. (t,2) () = P X [t] B P, (19) W [t] where ranom matrix B is an inepenent copy of B with i.i.. Gaussian entries B ij N (0, 1 ). The lemma can be irectly prove by applying the iterative conitioning argument in Corollary 8. See the etaile proof in the appenix. 16

17 Tensor Power Metho Dynamics in Overcomplete Regime x (t) y (t) w (t) x (t+1) y (t+1) upate steps at iteration t Figure 2: Flow of the power upate algorithm stating intermeiate steps. Iteration t for which the inuctive step shoul be argue is also inicate Analysis of Iterative Upates Lemma 9 characterizes the conitional istribution of B given the upate variables x an y up to the current iteration; see (15) for the efinition of conitional forms of B enote by B (t,1) an B (t,2). Intuitively, when the number of iterations t, then the resiual inepenent ranomness left in B (t,1) an B (t,2) (respectively enote by B res. (t,1) an B res. (t,2) ) characterize in Lemma 9 is large enough an we can exploit it to obtain tighter concentration bouns throughout the analysis of the iterations. Note that the goal is to show that uner t, the iterations x (t) converge to the true component with constant error, i.e., x (t), a 1 1 γ for some constant γ > 0. If this alreay hols before iteration t we are one, an if it oes not hol, next iteration is analyze to finally achieve the goal. This analysis is one via inuction argument. During the iterations, we maintain several invariants to analyze the ynamics of power upate. The goal is to ensure progress in each iteration as in (20). Inuction hypothesis: The following are assume at the beginning of the iteration t as inuction hypothesis; see Figure 2 for the scope of inuctive step. 1. Length of Projection on x: δ t P X [t 1] x(t) 1, where δ t is of orer 1/ polylog, an the value of δ t only epens on t an log. 2. Length of Projection on w: δ t 1 P W [t 2] w(t 1) t 1, P W [t 2] w(t 1) 1 t 1, where δ t is of orer 1/ polylog an t is of orer polylog. Both δ t an t only epen on t an log. 17

18 Ananumar, Ge, an Janzamin 3. Progress: 10 a 1, x (t) [δ t, t ] β2t 1, (20) a 1, P X [t 1] x(t) t β2t Norm of u,v: δ t 1 2 δ t 1 2 v(t 1) 2, u(t) 2 t 1. The analysis for basis of inuction an inuctive step are provie in Appenix B. 3.2 Effect of Noise in Theorem 1 Given ran- ranom tensor T in (12), an a starting point x (1), our analysis in the noiseless setting shows that the tensor power iteration in (2) outputs a vector which will be close to a j if x (1) has a large enough correlation with a j. Now suppose we are given noisy tensor ˆT = T + E where E has some small norm. In this case where the noise is also present, we get a sequence ˆx (t) = x (t) + ξ (t) where x (t) is the component not incorporating any noise (as in previous section 11 ), while ξ (t) represents the contribution of noise tensor E in the power iteration; see (21) below. We prove that ξ (t) is a very small noise that oes not change our calculations state in the following lemma. Lemma 10 (Bouning norm of error) Suppose the spectral norm of the error tensor E is boune as E ɛ /, where ɛ < o( /). Then the noise vector ξ (t) at iteration t satisfies the l 2 norm boun ξ (t) Õ(β2t 1 ɛ). Note that when t is the first number such that β2t 1 /, we have ξ (t) = o(1). Notice that since when β2t 1 /, the main inuction is alreay over an we now x (t) is constant close to the true component, an thus, the noise is always small. Proof iea: We now provie an overview of ieas for proving the above lemma; see Appenix D for the complete proof which is base on an inuction argument. We first 10. Note that although the bouns on y (t) 1 are argue at iteration t, the boun on the first entry of y(t) enote by y (t) 1 = a 1, x (t) is assume here in the inuction hypothesis at the en of iteration t Note that there is a subtle ifference between notation x (t) in the noiseless an noisy settings. In the noiseless setting, this vector is normalize, while in the noisy setting the whole vector ˆx (t) = x (t) + ξ (t) is normalize. 18

19 Tensor Power Metho Dynamics in Overcomplete Regime write the following recursion expaning the contribution of main signal an noise terms in the tensor power iteration as ( ) x (t+1) + ξ (t+1) = Norm ˆT (x (t) + ξ (t), x (t) + ξ (t), I) ( ) = Norm T (x (t), x (t), I) + 2T (x (t), ξ (t), I) + T (ξ (t), ξ (t), I) + E(ˆx (t), ˆx (t), I), where for vector v, we have Norm(v) := v/ v, i.e., it normalizes the vector. The first term is the esire main signal an shoul have the largest norm, an the rest of the terms are the noise terms. The thir term is of orer ξ (t) 2, an hence, it shoul be fine whenever we choose E to be small enough. The last term is O( E ) an is the same for all iterations so that is also fine. The problematic term is the secon term, whose norm if we boun naively is 2 ξ (t). However the normalization factor also contributes a factor of roughly /, an thus, this term grows exponentially; it is still fine if we just o a constant number of iterations, but the exponent will epen on the number of iterations. In orer to solve this problem, an mae sure that the amount of noise we can tolerate is inepenent of the number of iterations, we nee a better way to boun the noise term ξ (t). The main problem here is we boun the norm of T (x (t), ξ (t), I) by T ξ (t) O(ξ (t) ), by oing this we ignore the fact that x (t) is uncorrelate with the components in T. In orer to get a tighter boun, we introuce another norm ; see Definition 21 for the exact form. Intuitively, the norm captures the fact that x oes not have a high correlation with the components (except for the first component that x will converge to), an gives a better boun. In particular we have T (x (t), ξ (t), I) factor is compensate by the aitional term 4. Conclusion. (21) ξ(t) 2. Therefore, the normalization In this paper, we provie a novel analysis for the ynamics of thir orer tensor power iteration showing convergence guarantees to vectors having constant correlation with the tensor component. This enables us to prove unsupervise learning of latent variable moels in the challenging overcomplete regime where the hien imensionality is larger than the observe imension. The main technical observation is that uner ranom Gaussian tensor components an small number of iterations, the resiual ranomness in the components (which are involve in the iterative steps) are sufficiently large. This enables us to show progress in the next iteration of the upate step. As future wor, it is very interesting to generalize this analysis to higher orer tensor power iteration, an more generally to other ins of iterative upates. Acnowlegments A. Ananumar is supporte in part by Microsoft Faculty Fellowship, NSF Career awar CCF , NSF awar CCF , ONR awar N , ARO YIP awar W911NF , an AFOSR YIP awar FA M. Janzamin is supporte by NSF Awar CCF

20 Ananumar, Ge, an Janzamin Variable Space Description Recursion formula A R mapping matrix in (14) n.a. x (t) R upate variable in (14) x (t+1) := A(y(t) ) 2 A(y (t) ) 2 y (t) R intermeiate variable in (14) y (t) := A x (t) x (t) R unnormalize version of x (t) x (t+1) := A(y (t) ) 2 ˆx (t) R noisy version of x (t) ˆx (t) = x (t) + ξ (t) ; see (21) ξ (t) R Contribution of noise in tensor power upate given noisy tensor ˆT = T + E ˆx(t) = x (t) + ξ (t) ; see (21) B R ( 1) matrix A := [a 1 a 2 a ] with first column remove, i.e., B := [a 2 a 3 a ]. Note that the first column a 1 is the esire one to recover. B (t,1) R ( 1) previous iterations at the mile of t th iteration (before upate step conitional istribution of B given x (t+1) = A(y (t) ) 2 ) B (t,2) R ( 1) previous iterations at the en of t th iteration (after upate step conitional istribution of B given x (t+1) = A(y (t) ) 2 ) B (t,1) res. R ( 1) resiual inepenent ranomness left in B (t,1) ; see Lemma 9. B (t,2) res. R ( 1) resiual inepenent ranomness left in B (t,2) ; see Lemma 9. w (t) R 1 intermeiate variable in upate formula (14) u (t) R part of x (t) representing the left inepenent ranomness v (t) R 1 part of y (t) 1 representing the left inepenent ranomness n.a. B (t,1) () = B {X [t], Y [t] } B (t,2) () = B {X [t+1], Y [t] } see equation (17) see equation (19) w (t) := (y (t) 1 ) 2 u (t+1) := B (t,1) res. w (t) v (t) := (B res. (t 1,2) ) x (t) Table 1: Table of parameters an variables. Superscript (t) enotes the variable at t-th iteration. 20

21 Tensor Power Metho Dynamics in Overcomplete Regime Appenix A. Proof of Lemma 9 Proof of Lemma 9 Recall that we have upates of the form x (t+1) = A(y (t) ) 2, w (t) := (y (t) 1 ) 2, y (t) = A x (t). Let X [t]\1 := [ x (2)... x (t)], an let the rows of Y [t] are partitione as the first an the rest of rows as [ Y [t] = Y [t] 1 Y [t] 1 ]. We now mae the following simple observations (t,1) () B = B {Y [t] = A X [t], X[t]\1 = A(Y [t 1] ) 2 } () = B {Y [t] 1 = B X [t], X[t]\1 = a 1 (Y [t 1] 1 ) 2 + BW [t 1] } () = B {v (1) = B x (1),..., v (t) = (B (t 1,2) res. ) x (t), u (2) = B (1,1) res. w (1),..., u (t) = B (t 1,1) res. w (t 1) }, where the secon equivalence comes from the fact that B is matrix A with first column remove. Now applying Corollary 8, we have the result. The istribution of B (t,2) follow similarly. Appenix B. Analysis of Inuction Argument In this section, we analyze the basis of inuction an inuctive step for the inuction argument propose in Section for the proof of Lemma 5. B.1 Basis of Inuction We first show that the hypothesis hols for initialization vector x (1) as the basis of inuction. Claim 1 (Basis of inuction) The inuction hypothesis is true for t = 1. Proof Notice that inuction hypothesis for t = 1 only involves the bouns on x (1) an a 1, x (1) as in Hypotheses 1 an 3, respectively. These bouns are irectly argue by the correlation assumption on the initial vector x (1) state in (13) where δ 1 = δ 1 = 1 = 1. 21

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012 CS-6 Theory Gems November 8, 0 Lecture Lecturer: Alesaner Mąry Scribes: Alhussein Fawzi, Dorina Thanou Introuction Toay, we will briefly iscuss an important technique in probability theory measure concentration