arxiv: v1 [math.st] 6 Sep 2018

Size: px

Start display at page:

Download "arxiv: v1 [math.st] 6 Sep 2018"

Adam Cannon
5 years ago
Views:

1 Optimal Sparse Singular Value Decomposition for High-dimensional High-order Data arxiv: v [math.st] 6 Sep 08 Anru Zhang and Rungang Han University of Wisconsin-Madison September 7, 08 Abstract In this article, we consider the sparse tensor singular value decomposition, which aims for dimension reduction on high-dimensional high-order data with certain sparsity structure. A method named sparse tensor alternating thresholding for singular value decomposition STAT-SVD is proposed. The proposed procedure features a novel double projection & thresholding scheme, which provides a sharp criterion for thresholding in each iteration. Compared with regular tensor SVD model, STAT-SVD permits more robust estimation under weaer assumptions. Both the upper and lower bounds for estimation accuracy are developed. The proposed procedure is shown to be minimax rate-optimal in a general class of situations. Simulation studies show that STAT-SVD performs well under a variety of configurations. We also illustrate the merits of the proposed procedure on a longitudinal tensor dataset on European country mortality rates. Keywords: high-dimensional high-order data, projection and thresholding, singular value decomposition, sparsity, Tucer low-ran tensor. Anru Zhang is Assistant Professor, Department of Statistics, University of Wisconsin-Madison, Madison, WI 53706, anruzhang@stat.wisc.edu; Rungang Han is PhD student, Department of Statistics, University of Wisconsin-Madison, Madison, WI 53706, rhan3@wisc.edu. The research of Anru Zhang is supported in part by NS Grant DMS-8868.

2 Introduction High-dimensional high-order data, i.e., values arranged in large-scale tensors along three or more directions, commonly occur in a broad range of applications due to revolutionary developments in science and technology. These data possess distinct characteristics compared with the traditional low-dimensional or low-order data and pose unprecedented challenges to various communities, including statistics, machine learning, applied mathematics, and electrical engineering. To better summarize, visualize, and analyze high-dimensional high-order data, a sufficient dimension reduction often becomes the crucial first step. Therefore, how to effectively exploit the low-ran structure from high-dimensional high-order observations is often an important tas. To this end, the framewor of tensor SVD or tensor PCA has been introduced and extensively studied recently Allen, 0b; Richard and Montanari, 04; Anandumar et al., 06; Zhang and Xia, 08; Liu et al., 07; Wang and Song, 07. Suppose one is interested in an order-d low-ran tensor of dimension p p d, which is observed with entry-wise additive noise Z as Y = X+Z. Assume the fixed tensor X is low-ran in the sense that the fibers of X along different directions i.e., the counterpart of rows and columns for matrix all lie in low-dimensional subspaces, say U,..., U d. The goal of tensor SVD is to estimate the loadings U,..., U d and underlying low-ran tensor X. Under the regular tensor SVD setting, several practical methods have been introduced and studied, including High-order SVD HOSVD De Lathauwer et al., 000a, High-order Orthogonal Iteration HOOI De Lathauwer et al., 000b, sum-of-square scheme Hopins et al., 05, homotopy or continuation method Anandumar et al., 06. Using lower bound arguments, Richard and Montanari 04 and Zhang and Xia 08 showed that the signal-noise-ratio SNR C max{p, p, p 3 } / is required to ensure that consistent estimation is statistically possible; and SNR C max{p, p, p 3 } 3/4 may be further necessary for computationally efficient methods. However in many applications, such conditions, i.e., SNR is no less than a polynomial of dimension p, are often too restrictive to satisfy. Moreover, in many applications, the leading singular/eigenvectors of the high-dimensional high-order data may satisfy intrinsic structural assumptions along certain ways. We have seen the need for singular value decomposition in a number of modern tensor data applications, where sparsity plays an essential role. or example, in high-order longitudinal study, since observations

3 often come as multivariate functions of time, e.g. country-wise fertility and death rates by the calendar year and age Wilmoth and Sholniov, 006, the leading singular vectors along the mode of calendar year or age are expected to be smooth, and therefore becomes sparse after differential transformation; in Electroencephalogram EEG data analysis, the brain electrical activities are measured and stored as multi-way data with three or more modes representing channels, time, and patients, etc. It is often believed that some parts of brain region are more active and vary through time smoothly, then the leading singular vectors may be sparse along the channel mode and smooth along time mode Miwaeichi et al., 004; in imaging ensemble analysis, facial images are often stored as high-dimensional high-order tensors. To sufficiently reduce the dimension, one loos for low-dimensional subspaces that can best explain the possibly sparse facial features and suppress illumination effects such as shadows and highlights Vasilescu and Terzopoulos, 003. How to incorporate these structural assumptions wisely to improve the performance of subsequent statistical analyses is crucial for singular value decomposition in tensor data analysis. previous literature. Such a problem, however, has not been well studied or understood in In this article, we aim to fill this gap by developing methodology and theory for sparse tensor SVD. In addition to the regular tensor SVD model, we assume that the underlying low-ran structure X satisfies some sparsity constraints. As mentioned above, the data are not necessarily sparse along all modes in practice for example, it is not reasonable to assume sparsity for the patient mode in EEG data or subject mode in high-order longitudinal data. To allow more flexibility, we suppose there exists a subset J s {,..., d} such that part of the loadings {U : J s } contains certain row-wise sparsity structures. The detailed formulation of the sparse tensor SVD model is introduced in Section. To better illustrate the nature and difficulty of sparse tensor SVD problem, it is also helpful to discuss its order- counterpart, matrix sparse singular value decomposition, for comparison. The framewor of matrix sparse singular value decomposition, which focuses on extracting simultaneously sparse and low-ran matrix structure from high-dimensional matrix data, has been introduced and extensively studied during the past decade see Lee et al. 00; Yang et al. 04, 06 and the references therein. In addition, sparse principal component analysis, a closely connected topic, has also been considered in Zou et al. 006; Shen and Huang 008; 3

4 Johnstone and Lu 009; Cai et al. 03. In contrast, sparse tensor SVD is much more involved and difficult than sparse matrix SVD and regular tensor SVD in many aspects. irst, classical methods for matrix data are often not directly applicable to high-order data. Many previous wors approach the tensor problem by vectorizing or matricizing high-order data or intuitively speaing, stretching the data cubes into matrices or vectors so that high-order problems are transformed into vector or matrix ones. However, since high-order structures can get lost in the process of simple vectorizing or matricizing, one may only obtain sub-optimal results in the subsequent analyses. Second, some straightforward extensions from sparse matrix SVD methods, such as sparse HOSVD, sparse HOOI, or a single projection & thresholding scheme, does not perform optimally in general. Third, as pointed out by the seminal wor of Hillar and Lim 03, many basic concepts or methods for matrix data cannot be directly generalized to the high-order ones. Naive extensions of concepts such as operator norm, singular values, and eigenvalues are mathematically possible but computationally NP-hard. To overcome these difficulties, we propose a procedure named Sparse Tensor Alternating Truncation for Singular Value Decomposition STAT-SVD for sparse tensor SVD in this paper. The method consists of two steps: i a thresholded spectral initialization and ii an iterative alternating updating scheme. One crucial part of the procedure is a novel double projection & thresholding scheme, which provides a sharp criterion for thresholding in each iteration. Since each step of STAT-SVD only involves basic matrix and tensor operations, such as matricization, multiplication, matrix SVD, and thresholding, the proposed procedure can be implemented efficiently. We study both the theoretical and numerical properties of the proposed procedure. We prove by an upper bound argument that the STAT-SVD estimator can recover the low-ran structures accurately. A lower bound is further developed to show that the proposed estimator is rate optimal for a general class of simultaneously sparse and low-ran tensors. To the best of our nowledge, we are among the first ones to study the method and theory for sparse tensor SVD with matching upper and lower bound results. The numerical results show that the STAT- SVD outperforms other more naive methods, such as regular HOOI, HOSVD, sparse HOOI, and sparse HOSVD, by achieving significantly smaller estimation errors within much shorter running time. We also illustrate the merit of STAT-SVD in the analyses of high-order mortality 4

5 rate data. The rest of this article is organized as follows. After a brief introduction of the notations and preliminaries, we formally introduce the sparse tensor SVD model in Section. Then we propose the methodology for sparse tensor SVD in Section 3. The theoretical properties of the proposed procedure are developed in Section 4. The data-driven hyperparameter selection is discussed in Section 5. Numerical performance of the proposed methods is studied through both simulation studies and the real data analyses on high-order mortality rate dataset in Section 6. inally, further discussions, proofs of the technical results, and supporting theoretical tools are postponed to the supplementary materials. Problem ormulation. Notations and Preliminaries We start this section with notations and preliminaries that will be used throughout the paper. The lowercase letters, e.g. x, y, u, v, are used to denote scalars or vectors. or any a, b R, let a b and a b be the minimum and maximum of a and b, respectively. or convenience, we denote p = p,..., p d, r = r,..., r d, s = s,..., s d, and p = p p d, s = s s d, r = r r d. or any p, we also note p = l p l, s = l s l, r = l r l. We use C, c and the variations to denote generic constants, whose actual values may change from line to line. The uppercase letters are used to note matrices. or A R m n, we assume the singular value decomposition is A = i σ iu i v i, where σ σ 0 are the singular values in descending order. Denote σ i A = σ i as the i-th largest singular value of A. Particularly, the largest and smallest non-trivial singular values: σ max A = σ A and σ min A = σ m n A play important roles in our analysis. We also define SVD r A = [u,..., u r ] as the matrix comprised of the top r left singular vectors of A. Let the collections of regular and sparse orthogonal matrices be O p,r = {U R p r : U U = I r }, O p,r s = {U R p r : U U = I r, U 0 = p i= {U [i,:] 0} s}. These orthogonal matrices are extensively used in later narratives. In addition, the boldface capital letters, e.g. X, Y, Z, are used to represent tensors of order-3 5

6 or higher. or any S R r r d and U R p r, the mode- tensor-matrix product is defined as S U R r r p r + r d, S U i,...,i d = r j = S i,...,j,...,i d U i,j. Multiplication along different directions is commutative invariant, i.e., S U U = S U U for. or convenience, the tensor product along all d modes is applied S U d U d := S; U,..., U d. Matricization is a basic tensor operation that transforms tensors to matrices. Particularly the mode- matricization for any X R p p d is defined as M X R p p, where [M X] i,i +p i 3 + +p p d i d = X i,...,i p. The general mode- matricization can be defined similarly. The readers are referred to Kolda and Bader 009 for a more comprehensive tutorial for tensor algebra. We also use the R syntax to denote sub-vectors, -matrices, and -tensors. or example, A [I,I ] is used to note the sub-matrix with row indices I and column indices I of A. In addition, for any integers a b, we use a : b to represent consecutive sequence {a,, b}; and : alone represents the entire index set. Thus, A [:,:r] represents the first r columns of A while A [s +:p,:] represents the {s +,..., p }-th rows with indices {s +,..., p }. Similar notations are also applied to sub-vectors and sub-tensors.. Sparse Tensor SVD Model Now we formally introduce the sparse tensor SVD model. Suppose one observes a p p d - dimensional tensor Y with additive noise, say Y = X + Z. Here, X is a low-ran tensor in the sense that all fibers along each direction lie in some low-dimensional subspace, Z consists of i.i.d. Gaussian noises with mean zero and variance σ. Given the connection between Tucer low-ran and Tucer decomposition Tucer, 966; Kolda and Bader, 009, the model can be further written as Y = X + Z = S U d U d + Z = S; U,..., U d + Z, 6

igure : Illustration of sparse tensor SVD model. Here, X is sparse along Modes- and -3. with S R r r d and {U O p,r } d = being the unnown core tensor and Mode-,..., loadings, respectively.

7 igure : Illustration of sparse tensor SVD model. Here, X is sparse along Modes- and -3. with S R r r d and {U O p,r } d = being the unnown core tensor and Mode-,..., loadings, respectively. Especially, U is also the singular subspace of X along the -th direction. Recall O p,r and O p,r s are the classes of regular and sparse orthogonal columns. Given previous discussions, suppose a subset J s {,..., d} is nown a priori, such that the mode- loading U for any J s is s -sparse, O p,r s, J s ; U O p,r, / J s. To be more flexible, we set s = p if there is no sparsity constraint in the -th Mode, i.e., / J s. The central goal is to estimate singular subspaces U,..., U d, and the original low-ran tensor X based on observations Y. See igure for a pictorial illustration of sparse tensor SVD model. Remar It is worth mentioning that our framewor is related but distinct from the CP low-ran-based decomposition framewor in literature see De Silva and Lim 008; Kolda and Bader 009; Anandumar et al. 04a,b; Sun et al. 05; Sun and Li 07; Wang et al. 07; Hao et al. 08 and the references therein. Since the main focus of tensor SVD analysis is to sufficiently reduce the high-order data onto low-dimensional subspaces, the Tucer low-ran structure provides a more natural fit based on the interpretation of Tucer decomposition that we have mentioned above. Additionally, it is nown that any CP ran-r tensor must be of Tucer ran at most r, but the opposite of this statement is not generally true Kolda and Bader, 009. Therefore, we mainly pursue the more general Tucer low-ran model rather than the CP one, with distinct methodologies and theories developed for the rest of this paper. 7

8 3 The STAT-SVD Procedure Next, we propose the procedure for sparse tensor singular value decomposition. Note that U in the sparse tensor SVD model is not only the mode- singular subspace of X but also the left singular subspace of M X, a straightforward idea for initialization is Ũ 0 = SVD r M Y. This method, originally introduced by De Lathauwer et al. 000a, has been widely referred to as high-order SVD HOSVD and used in numerous scenarios. However, Ũ 0 may fail to provide any consistent estimations or even warm starts, unless one has strong SNR λ/σ Cp 3/ Zhang and Xia, 08, where λ = min =,,3 σ r M X. An alternative idea is to apply l regularized estimation to encourage sparsity Allen, 0a,b, Û, Û, Û3 = arg min U,U,U 3 Y S U U 3 U 3 + λ J s U. Both the computation and theoretical analysis for the regularized MLE may be difficult. Instead, we propose a procedure with a novel double thresholding & projection scheme as follows. Step Initialization: support Let Y = M Y be the matricization of Y for =,..., d. Recall p = p p d /p, p = p p d. or =,..., d, select the index sets for all sparse modes, { Î 0 = i : Y [i,:] σ p + p log p + log p Ideally speaing, Î0 or max Y [i,j] σ log p j }, J s. provides an initial estimate for the support of {U : J s }, which captures the significant signals of X but may miss the wea ones. Î 0 = {,..., p }. or / J s, we select Step Initialization: loadings Construct Ỹ R p Y p [i d,...,i, Ỹ [i,...,i d ] = d ], i,..., i d Î0 Î0 d ; 0, otherwise, and initialize Û 0 = SVD r M Ỹ, =,..., d. 8

9 Then Û 0 provides a rough estimate for U. The initialization step is illustrated in igure 3 a. Step 3 Power Iteration and Alternating Thresholding Provided a reasonable initialization by thresholded spectral method, we consider an iterative alternating scheme to refine the estimation. Suppose we aim to update Û t Û t+ updating scheme is discussed under two scenarios. for some specific =,..., d and t = 0,,.... The If the targeting mode, say Mode-, is non-sparse, we can update by orthogonal projection A t = M Y; Û t+, I p, Û t 3,..., Û t d R p r. 3 Ideally speaing, the dimension of Y is significantly reduced from p p d to p r, while the majority of signal X can be preserved in A t after such a projection. Then we update See igure 3 b for an illustration. Û t+ t = SVD r A. If the targeting mode, say Mode-, is sparse, we apply a double projection & thresholding scheme for refinement see igure 3 c. We still perform projection first, st Projection A t = M Y; I p, Û t,..., Û t d R p r. Then A t is denoised by row-wise hard thresholding, st Thresholding B t R p r, B t,[i,:] = At,[i,:] { A t,[i,:] η }, i p, where the thresholding level η = σ r + r log p + log p is determined by the quantile of A t,[i,:] if the i-th row of At does not contain any signal. Since A,[i,:] σ χ r σ r when A,[i,:] are pure noise, an η induced by A,[i,:] can be as large as σ r, which may falsely ill many rows with wea signals. In order to lower the large thresholding level, we introduce the second projection and thresholding: let the leading r right singular vectors of B t be t t ˆV, i.e., ˆV = SVD r B t, then we further reduce the dimension of p -by-r matrix A t to p -by-r matrix Āt by performing right projection, nd Projection Ā t = A t ˆV t R p r, 9

10 a Initilaization b Dense mode update c Sparse mode update igure : Illustration of STAT-SVD procedure. 0

11 and apply the second thresholding nd Thresholding Bt R p r, Bt,[i,:] = Ā,[i,:] { A t,[i,:] η }, with much smaller thresholding value η = σ r + r log p + log p, given the reduced dimension of Ā t. As we will illustrate in theoretical analysis, the double projection & thresholding scheme provides more accurate denoising performance. It is also noteworthy that a similar version of double thresholding appears in recent high-dimensional clustering literature Jin and Wang, 06, although their problem, method, and theory were all different from ours. inally, we update Û by QR decomposition Û t+ t = QR B. Step 4 The iteration is stopped until the maximum number of iteration is reached i.e. t > t max, or convergence, i.e. the following criterion holds, t B Bt 5 < ε tol. Here ε tol is the maximum tolerance, which can be chosen empirically. inally, we propose the denoising estimator for X as ˆX = Y Û d Û d. The pseudo-code for the proposed STAT-SVD method is provided in Algorithm. We particularly summarize the double projection & thresholding scheme for the sparse mode update in Algorithm. 4 Theoretical Analysis In this section, we analyze the theoretical properties of the proposed procedure in the previous section. The sin Θ distances are adopted to quantify the singular subspace estimation errors. Particularly for any U, V O p,r, the principal angles between U and V is defined as an r- by-r diagonal matrix: ΘU, V = diagcos σ U V,..., cos σ r U V. Then the sin Θ robenius norm sin ΘU, V = r U V can be used to characterize the distance

12 Algorithm Sparse tensor alternating thresholding - SVD STAT-SVD : Input: order-d tensor data Y R p, ran r, noise level σ, set of sparse modes J s. : Initialization Let Y = M Y. Select subset Î0 by { i p : Y [i,:] σ p + p log p + log p Î 0 = or max j Y [i,j] σ } log p, J s ; {,..., p }, / J s. 3: Set t = 0. Calculate the initializations Û 0 = SVD r M Ỹ, where Ỹ = Y i,...,i d, 4: while t < t max or convergence criterion not satisfied do 5: for =,..., d do 6: if J s then 0, otherwise. 7: Sparse mode update Update Û t+ via Algorithm ; 8: else 9: Dense mode update Calculate i,..., i d Î0 Î0 d ; A t+ = M Y; Û t+,..., Û t+, I p, Û t +,..., Û t d. 0: Update as Û t+ = SVD r : end if : end for 3: t = t +. 4: end while A t+.

13 Algorithm Update Scheme in STAT-SVD Sparse Mode : Input: Y, Û t +,..., Û t d, Û t+,..., Û t+, ran r. : irst Projection Calculate A t = M Y; Û t+,..., Û t+, I p, Û t +,..., Û t d 3: irst Thresholding Perform row-wise thresholding for A t+ : B t+ R p r, B t,[i,:] = At+,[i,:] { }, i p A t+,[i,:] η, where η = σ r + r log p + log p. 4: Second Projection Extract the leading r right singular vectors of B t then project as Āt+ ˆV t+ = SVD r = A t+ ˆV t+ R p r. 5: Second thresholding Perform thresholding on B t+ B t+, O p,r, B t+ R p r, Bt+,[i,:] = Āt+,[i,:] { }, i p A t+,[i,:] η, where η = σ r + r log p + log p. 6: Apply QR decomposition to t+ B, and assign the Q part to Û t+ O p,r.. 3

14 between U and V. The readers are referred to Cai and Zhang, 08, Lemma for more discussions on the properties of sin Θ distance. Theorem Upper bound Suppose r, σ are nown, log p log p d, and for d, λ = σ r M X C 0 σ s log p / + s r + max d r. Then after at most Ods log p iterations, the proposed Algorithm yields the following estimation error upper bound, ˆX X d Cσ sin Θ Û, U l= r l + s r + s log p. 4 J s { Cσ s r + s log p /λ } r, J s, { Cσ p r /λ } r, / J s, 5 with probability at least Cp + +p d logs log p p p d. Here C > 0 is some uniform constant, which does not depend on σ, p, r, s. Remar The estimation error upper bound 4 is comprised of three terms: σ d l= r l, σ s r, and σ s logp, which correspond to the estimation complexity for the core tensor S, the values of loading U, and the support of U only for sparse modes, respectively. On the other hand, the signal strength assumption σ r X Cσ s log p / + s r + max d r involves p only in the logarithmic term. Compared with the assumption σ r X σ max{p, p, p 3 } 3/4 that is required in regular tensor SVD, our proposed algorithm is able to handle high-dimensional settings under much weaer conditions. Remar 3 Proof setch for Theorem Since the proof of Theorem is lengthy and highly non-trivial, we briefly discuss the setch here. irst, a number of conditions are introduced as the baseline assumptions Step. Under these conditions, we try to establish the upper bound for the initial estimate: sin ΘÛ 0, U c and Û 0 M X C s log p for constants 0 < c < / and C > 0 Step, then develop the upper bound for estimates after each iteration: sin ΘÛ t+, U s r + s log p { Js} 4 λ /σ + C s log p λ /σ τ t,

15 and Û t X Cσ s r + Cσ s log p { Js} + Cσ s log p τ t. for some constant 0 < τ / Step 3. sufficiently decays. One obtain final estimators Û tmax Then after a number of iterations, the error rate and ˆX tmax that achieve the targeting upper bounds of estimation error Steps 4 and 5. inally, we use a coupling scheme to show that the introduced conditions hold with high probability Step 6. Remar 4 Performance of Single Projection & Truncation Scheme As discussed earlier, the proposed double projection & thresholding scheme is crucial to the performance of STAT-SVD. Such a scheme is essential in the sense that the simple single projection & thresholding, which may be a more straightforward extension from matrix sparse SVD method, may yield sub-optimal results. To be specific, suppose X is the estimator with single thresholding & projection see Algorithm 3 in Appendix for detailed explanations, Theorem 5 in the Appendix shows that there exists a low-ran tensor X that satisfies all assumptions in Theorem, but X yields a higher rate of convergence in the sense that E X X Cσ r l + s r + s log p + s r log p /. l J s J s Next, we study the statistical lower bound for sparse tensor SVD. Consider the following class of sparse and low-ran tensors, p,r,s J s = X = S; U,..., U d R p p d : ranx r,..., r d, suppu 0 s for J s. The following lower bound results hold for sparse tensor SVD. Theorem Lower Bound: Subspace Estimation Suppose s 3r, consider the following classes of sparse and low-ran tensors with least singular value constraint on matricization, p,r,sj s, λ = {X p,r,s J s : σ r M X λ }, =,..., d. Then, for any fixed, we have inf Û sup X p,r,sj s,λ sin Θ σ Û, U c s r λ + σ s logp /s λ I { Js} r. 5

16 Theorem 3 Lower bound: Tensor Recovery Suppose s 3r for any. Consider the tensor recovery over the class of sparse and low-ran tensors p,r,s J s, there exists uniform constant c > 0 such that inf sup ˆX X d ˆX cσ r + X p,r,sj s = d s r + s log p /s. J s Theorems,, and 3 together show the proposed STAT-SVD method achieves the optimal rate of convergence in the general class of sparse low-ran tensors when log p /s log p, for J s. = 5 Data-driven Hyperparameter Selection In practice, the proposed procedure requires the input of hyperparameters σ and r, r,..., r d. Since Y = X + Z and a significant portion of entries of X are zeros, the data-driven median estimator Yang et al., 06 can be used to estimate σ: ˆσ = Median Y / Here is the 75% quantile of standard normal distribution. We have the following theoretical guarantee for ˆσ. Proposition Concentration Inequality of ˆσ Let ˆσ = Median Y /z If s = o p log p, then there exists universal constants C and c, such that ˆσ σ log p P c C/p. 6 σ p Now we consider the data-driven selections of r,..., r d. Recall that we directly threshold on each mode and obtain r,..., r d based on the singular values of Ỹ, { ˆr = max Ỹ in the initialization step of Algorithm. r : σ r M Ỹ ˆσδ Î 0, Î0 i Here, δ ij =.0 + j + i log ep i + j log ep j + 4 log p We propose to select }. 7 and ˆσ is the estimated standard deviation. Under regularity conditions, one can show that ˆr,..., ˆr d match the true rans with high probability. 6

17 Proposition Under the conditions of Theorem and Proposition, we have ˆr = r for each =,..., d with probability at least Op + + p d /p δ for any δ > 0. If neither σ nor {r } d = are nown, we can first estimate σ by MAD estimator ˆσ, then estimate {r } d = by 7, and feed all the estimations to Algorithm. We have the following theoretical guarantee for this fully data-driven method. Theorem 4 Suppose we set σ = ˆσ and r = ˆr in Algorithm. If s = o p log p, the conclusion of Theorem holds with probability at least O p + + p d /p δ for any δ > 0. Remar 5 Similarly as some previous wors on distribution-based methods for principal component number selection Choi et al., 07, the performance of the proposed ˆσ and ˆr may rely on specific Gaussian noise assumptions. In scenarios with general noise, some more empirical schemes can be applied. or example, to estimate σ, one can trim a portion of entries from Y with largest absolute values, and evaluate the trimmed variance ˆσ as the sample variance for remaining entries Serfling, 984; for r, we can apply the cumulative percentage of total variation criterion Jolliffe, 00, Chapter 6.. a commonly used in principal component analysis literature: { r i= ˆr = arg min r : σ i M } Y p i= σ i M Y > ρ. Here, ρ 0, is an empirical thresholding level. 6 Numerical Analysis 6. Simulation Studies We evaluate the numerical performance of the proposed STAT-SVD method by simulation studies on various synthetic datasets. In each setting, we first generate an r -by-r -by-r 3 tensor S with i.i.d. standard normal entries, then rescale S as S = S λ/ min 3 σ r M S to ensure that σ r M S λ. or =,, 3, we generate singular subspaces Ū and indices subsets Ω t with cardinality s uniformly at random from O s,r and {,..., p}, respectively. The 7

18 combination of Ū and Ω Ū [j,:], i Ω, and i is the j-th element of Ω ; U [i,:] = 0, i / Ω. yields a uniformly random sparse singular subspace in O p,r s. Let X = S U U 3 U 3, Z iid N0, σ, and Y = X + Z be the underlying parameter, noise, and observation tensors. To examine the performance of STAT-SVD, we apply the proposed Algorithm along with four baseline methods, HOSVD, HOOI, sparse HOSVD S-HOSVD, and sparse HOOI S- HOOI, on the same synthetic data for comparison. Here, HOSVD and HOOI are classical methods De Lathauwer et al., 000a,b that have been widely used in literature. S-HOSVD and S-HOOI are the sparse modifications of HOSVD and HOOI intuitively, S-HOSVD and S-HOOI are performed by replacing all regular matrix SVD steps in HOSVD and HOOI by sparse matrix SVD Yang et al., 04. The detailed implementation of S-HOSVD and S-HOOI are summarized in Section B of the supplementary materials. The experiments are repeated for 00 times in each setting. In the first simulation study, we compare the estimation errors of STAT-SVD and baseline methods HOOI, S-HOOI, HOSVD, S-HOSVD in average robenius norm, l Û = 3 3 sin ΘÛ, U, l ˆX = ˆX X. = or hyperparameters, we use the median estimator ˆσ in STAT-SVD and all baseline methods; since sin Θ distance is one of the most important error quantification in tensor SVD analysis and one requires the correct r to evaluate the sin Θ distance between singular subspaces, we use the true rans r for all implementations. ix p = p = p 3 = 50, s = s = s 3 = 5, λ = 70, r = r = r = r 3, we specifically consider two scenarios: σ =, r varies from to ; r = 5, σ ranges from 0. to.5. Although the signal-to-noise ratio SNR, λ/σ here seems large, λ represents the singular value of the each matricization that measures the signal strength of the whole tensor; while σ is the standard deviation of each Z ij that quantifies the noise level of each single entry. By random matrix theory Vershynin, 00, the singular values of M Z are around σ p 5 to 5, which is comparable to the signal M X. As one can see from the numerical results in igures 3 and 4, although all methods yield smaller estimation error with smaller noise level, STAT-SVD significantly outperforms all other schemes in estimations 8

19 igure 3: Estimation error of U left panel and X right panel for different methods. Here, p = p = p 3 = 50, s = s = s 3 = 5, σ =, λ = 70, and r = r = r 3 = r vary. igure 4: Estimation error of U left panel and X right panel for different methods. Here, p = p = p 3 = 50, s = s = s 3 = 5, r = r = r 3 = 5, λ = 70, σ varies. of both subspaces and original tensors. We also consider the accuracy of hyperparameters estimation. Under the previous simulation setting, we examine the performance of MAD estimator ˆσ and ran estimator ˆr in Section 5. The results are provided in Table. One can see that the proposed ˆσ and ˆr provide reasonable estimations. The ran estimation is particularly accurate when noise level is moderate. In previous sections, the presentation and analysis were mostly focused on Gaussian noise case. Next, we consider the setting that the noises are uniformly distributed on [ σ 3, σ 3]. Let p = p = p 3 = 50, s = s = s 3 = 5, r = r = r 3 = 5, λ = 70, σ varies from 0. to.5. As we can see from the estimation error results in igure 5, STAT-SVD still achieves significantly 9

20 σ ˆσ σ ˆr r r ˆσ σ ˆr r Table : Ran and noise level estimation accuracy. Here, p = p = p 3 = 50, s = s = s 3 = 5, λ = 70. The first three rows explore the setting where r = r = r 3 = 5 with σ varies; the last three rows consider the case with fixed σ = and varying ran. The errors are averaged over 00 repetitions. igure 5: Estimation error of U left panel and X right panel with uniform distributed noise. better performance than other methods. As we have discussed before, in many applications, the tensor dataset possesses sparsity structure in only part of directions. Thus, we turn to a setting that X contains both sparse and dense modes. Specifically, we set p = p = p 3 = 50, r = r = r 3 = 0, s = s = 0, λ = 60, and s 3 = p 3, so that X is sparse along mode-, - and dense along mode-3. The singular subspace estimation error with varying noise level is evaluated and shown in igure 6. We can see STAT- SVD significantly outperforms the other methods on the estimation of sparse modes Û and Û. More interestingly, STAT-SVD also estimates the non-sparse singular subspace U 3 slightly more accurately, especially when σ is large. In fact, different modes and singular subspaces of any specific tensor dataset are a unity rather than separate objects, so more accurate estimation 0

21 igure 6: Estimation error of U left top, U right top, U 3 left bottom and X right bottom in partial sparse setting. Here, λ = 60, p = p = p 3 = 50, s 3 = p 3, s = s = 0, r = r = r 3 = 0. of dense mode U 3 is possible when one can fully utilize the sparsity in U and U. Especially for STAT-SVD, with the proposed double projection & thresholding scheme Algorithm, one gets significant better estimations on U and U than the baseline methods in each iteration so more precise projection 3 can be achieved when updating dense mode singular subspace Û t 3, and a slightly better final estimation of U 3 can be achieved. As the time cost is another critical issue in high-dimensional data analysis, we summarize the computational complexity for both initialization and each iteration of STAT-SVD and baseline methods into Table. We also compare the running time of STAT-SVD and other algorithms by simulations. As one can see from Tables and 3, HOOI, HOSVD, S-HOOI, and S-HOSVD are all slower than STAT-SVD this is because the embedded sparse matrix SVD or regular matrix SVD in baselines are computationally expensive, especially for the high-dimensional

22 STAT-SVD HOSVD HOOI S-HOSVD S-HOOI initialization per-iteration Op d + sd+ Op d+ Op d+ OT p d+ OT p d+ Op d r + s d+ 0 Op d+ 0 OT p d+ Table : Computational complexity of STAT-SVD and baseline methods for order-d sparse tensor SVD. Here, each mode share the same p i, s i, and r i. T denotes the number of iteration times. p i STAT-SVD HOSVD HOOI S-HOSVD S-HOOI Table 3: Running time unit: seconds of STAT-SVD and baseline methods. Here, λ = 70, σ =, s = s = s 3 = 0, r = r = r 3 = 5, p = p = p 3 varies. cases. In contrast, STAT-SVD is much faster, as the efficient truncation maes sure that only the significant parts of the data are extensively applied in computation, i.e., we only need to perform SVD on the s -by-s submatrix instead of the original p -by-p one. 6. Mortality Rate Data Analysis We illustrate the power of STAT-SVD through a demographic example. The mortality rate, i.e. the number of deaths divided by the total number of population, provides interesting insights to demographic information of the certain area, period, and age span. The Bereley Human Mortality Database Wilmoth and Sholniov, 006 contains a good source of morality rate data aligned by countries, ages, and years. We aim to analyze the mortality rate among 6 European countries for ages 0 to 95 from 959 to 00. The data tensor Y is of dimension with three modes representing age, year, and country, respectively. Since the mortality rate is relatively steady from teenagers to adults see, e.g., Miniño 03, it is reasonable to assume that the underlying loadings of Mode age are sparse after taing differential transformation.

23 Specifically, let D R be a secondary difference matrix: D ij = if i = j; D ij = if i j = ; D ij = 0 if i j. We pre-process the data by multiplying the secondary difference matrix along Modes age: Ỹ = Y D. We aim to apply sparse tensor SVD on Ỹ with J s = {}. To achieve more robust performance, the hyperparameters are selected via empirical methods instead of the Gaussian-noisebased procedure discussed in Section 5. r is selected via cumulative percentage of total variation criterion Jolliffe, 00, Chapter 6.., { r i= ˆr = arg min r : σ i M } X p i= σ i M X > 0.5. or the noise level, we trim 5% largest entries of Y in absolute value, then set ˆσ as the sample standard deviation of the remaining entries of Y. The estimated ran and noise level of Ỹ are ˆr, ˆr, ˆr 3 =, 5, 4 and ˆσ = In addition to STAT-SVD, we also apply S-HOOI and S-HOSVD for comparison. Suppose Ũ, ˆV, Ŵ are the resulting singular subspaces of Ỹ. Then we transform Ũ singular subspace of Mode age bac via Û = D Ũ. We first compare the first two singular vectors of Modes age and year, û, û, ˆv, and ˆv, in igures 7 and 8. We can see from û that infants aged at 0- and old people aged over 55 have a higher ris of dying, and people aged from to 50 have a relatively steady death rate. In addition, the shape of u further suggests additional factors of mortality rate in certain periods. or example, there may be more complicated patterns in mortality rate for the infants and elderly aged over 80 due to the high ris of death in these two periods. rom ˆv, we can see the mortality rate in these European countries declines significantly from the year 955 to 00, while ˆv does not give any significant pattern. Since the three methods produce similar plots on values of singular vectors, we further compare the estimation support indexes in igure 9. Since the mortality rate is steady from childhood to adults, one expects that the zero indexes would cover such an age span for û and û. The outcome of STAT-SVD matches this phenomenon. In comparison, the sparsity patterns given by S-HOSVD and S-HOOI are less interpretable. Next, we consider the mortality rate pattern for countries. In particular, we plot ŵ against ŵ for each country in igure 0, refer to the 04 European GDP per capita raning data, search for the countries whose GDP per capita is raned as top 50% among all European counties, Lin: 3

24 igure 7: Mortality rate data: û, û igure 8: Mortality rate data: ˆv, ˆv igure 9: Sparse indexes of Mode age after secondary differential transformation 4

25 igure 0: Mortality rate data: ŵ and ŵ of different countries. Red circles and blac triangles represent the countries with top 50% GDP per capita in Europe and the other countries, respectively. igure : Mortality rate means for countries in Europe. Red circles and blac triangles represent the countries with top 50% GDP per capita in Europe and the other countries, respectively. then highlight these countries with red circle marers and the other with blac triangles. We can clearly see two clusters in igure 0 that countries with more GDP per capita have smaller values of ŵ and ŵ, which implies a lower mortality rate. Also, wealthier countries are highly clustered in the graph, which indicates the common pattern of the death rate for these countries. In contrast, we also calculate the mean mortality rates of these countries and mar them with different colors in igure. By comparing igures 0 and, the mortality data clustering performance of STAT-SVD is significantly better than the one by mean estimations. 5

26 7 Proofs 7. Proof of Theorem Since the proof for this upper bound result is fairly complicated, we divide the proof into steps. Note that although we do not have refinement for non-sparse modes, to mae the proof consistent, we first extend the refinement step to non-sparse algorithm. Specifically, we also apply algorithm on Û t for J s, with η = η = 0 i.e. no truncation. This modification does not change the algorithm at all but maes the following analysis consistent for both sparse and non-sparse modes. Without loss of generality, we assume σ = throughout this proof. The idea of proving this theorem is that, we first impose a series of conditions, then prove the statement under these conditions, and finally prove these conditions hold with high probability. Step Introduction of Notations and Conditions We introduce or rephrase the following list of notations. N Matricizations, for =,..., d, X = M X, Y = M Y, Z = M Z. N Index sets for sparse mode J s : define True support I = { i : X [i,:] 0 }, Significant support { I 0 = i : } X [i,:] 6σs log p, { Selected support Î 0 = or t, we also define Ît i : Y [i,:] σ p + p log p + log p or max Y [i,j] σ log p j and Ĵ t with J s : selected support for A in t-th step selected support for Ā in t-th step Î t = Ĵ t = } { i : { i :, =,..., d. A t Āt [i,:] [i,:] } η, } η, 8 9 6

27 where η = σ r + r log p + log p, η = σ r + r log p + log p. or non-sparse mode J s, define I = I t that for any, I, I 0 = Ît = Ĵ t = [ : p ]. It is easy to see Î t, Ĵ t are all subsets of {,..., p }. We also define I = I + I d I I. Moreover, Ît, Ĵ t, It, J t are defined in the similar way. N 3 Index projections: for any I {,..., p }, we define D I = diag I R p p, if I {,..., p } is any index subset; D I to zero. N 4 Loadings: can be interpreted as the projection matrix, which set all rows with index not in I U = U + U d U U O p r, =,..., d; Û t = Û t + Û t d Û t Û t O p r, =,..., d. V = SVD r X U O r,r, i.e., leading r right singular vectors of X U ; ˆV t = SVD r DÎt Y Û t O r,r, Combining these definitions, we can rewrite A t A t Ā t = Y Û t, We can immediately see that A t Y = X + Z, A t, Bt, Āt Bt = Y Û t t ˆV, Bt, Bt i.e., leading r right singular vectors of DÎt Y Û t. = DÎt A t and Āt as = D Î t Y Û t, = DĴ t Y Û t t ˆV., Āt t, B are essentially projections of Y. Since t, B can be decomposed accordingly as A t B t Ā t B t = AX,t = B X,t = ĀX,t = B X,t + A Z,t + B Z,t + ĀZ,t + B Z,t A X,t B X,t Ā X,t B X,t = X Û t, = DÎt X Û t, AZ,t = Z Û t ; BZ,t = DÎt Z Û t = X Û t t ˆV, ĀZ,t = Z Û t t ˆV = DĴ t X Û t t ˆV, BZ,t = DĴ t Z Û t t ˆV. 0 7

28 N 5 Error Bounds: for t = 0,,..., =,..., d, E t = Û t X, t t = sin Θ Û, U. or =,..., d, t =,..., we further define K t t t = sin Θ Û ˆV, U V. We also introduce the following conditions for the proof of this theorem. A A max Z i,...,i d log p i,...,i d I I d Z [I,...,I d ] s + s log p + log p Z U d Ud r + r log p + log p. A 3 d, Z,[I,I ] s + s + log p. 3 A 4 d, N := Z [I,:]U V s + r + log p. 4 A 5 d, M := sup Z,[I,I ]W s + r + W l R s l r l, W l l s l r l + log p, where W = W + W d W W O s,r. A 6 Support consistency condition: 5 I 0 Î0, Î t I, t = 0,..., t max, Ĵ t I, t =,..., t max. 6 8

29 Step Theoretical guarantees for initialization: Û 0 In this second step, we show that under A A 7, the initialization estimator Û 0 =,,, d: sin Θ Û 0 Recall that Û 0 Ỹ0 = SVD r M DÎ0 X DÎ0 + DÎ0 Cai and Zhang 08, Z DÎ0 0 sin Θ Û, U We set C 0 80 d, recall that satisfies the following two inequalities for any 0 Û, U s log p. 7 λ X ds log p. 8 = SVD r DÎ0 Y DÎ0. Since DÎ0 Y DÎ0 =, by the unilateral perturbation bound result Proposition in σ r U D Î 0 σr U D Î 0 Y DÎ0 Y DÎ0 U D Î 0 σ r + D Î 0 Y DÎ0 Y DÎ0. 9 λ 80 ds log p, =,..., d, 0 we analyze each of the three ey terms in the right hand side of 9 as follows, a Since 6 holds, i.e. I 0 of DÎ0 Y DÎ0, then σ r U D Î 0 σ r U X σ r X 6 λ λ Î0 Y DÎ0 U X D I 0 X X D 0 I d = 8 λ 6ds log p /, we now the non-zero part of D I 0 σ r U D Î 0 X DÎ0 X DÎ0 X DÎ0 D Î 0 X D 0 I D I Z D I d D I 0 d U D Î 0 Y D 0 I is a submatrix Z DÎ0 Z DÎ0 s + / s log p + log p / D I \I 0 X s + / s log p + log p λ ds log p λ.λ =.9λ. s + log p 9

30 b Note that DÎ0 Y DÎ0 now σ r + DÎ0 Y DÎ0 = DÎ0 X DÎ0 Lemma 6 + DÎ0 Z DÎ0 3 s + s + log p 4 s log p 0.λ. c Since the left singular space of D I X D I, randî0 X DÎ0 r, we 6 σ DÎ0 Z DÎ0 σ DI Z D I is U, we have U D I X D I = O, thus, U D 6 Î 0 Y DÎ0 U D Î 0 Y D I U D I Y D I + U D Y I D I \Î0 U D I X D I + U D I Z D I + U Y,[I \Î0 U D I Z D I + Z,[I,I ] + 6 X,[I \I 0 Y,[I \Î0,I ] Z,[I,I ] + s 6s log p,i ] + s + s log p + log p + 4 s log p s + log p + 4 s log p 9 s log p. Summarizing a, b, c, and 9, we must have 7, i.e. Z,[I \I 0,I ] 0 sin Θ Û, U s log p, =,..., d, λ,i ] provided that A A 7 hold. Next we consider the bound for Û 0 X. Since Û 0 is the leading r left singular vectors of M Ỹ 0 = D 0 I and Then by Lemma 6, 0 Û D 0 I D 0 I Y D 0 I X D 0 I = D 0 I X D 0 I r D I 0 + D I 0 3 r s + s + log p 8 s log p. Y D I 0 Z D 0 I., rand I 0 Z D 0 I r Z[I,I ] X D 0 I r, 30

31 The last inequality comes from the fact that r s s, r s s and r s. Meanwhile, 0 Û D 0 I X D 0 I X X D 0 I X D 0 I = X X 0 [I,...,I0 d ] d X [I \I 0,:] = d = X,[i,:] d s 6s log p Therefore, Û 0 = i I \I 0 =4 ds log p. X which has proved 8. 0 Û D 0 I = X D 0 I + Û 0 8 s log p + 4 ds log p ds log p, X D I 0 X D I 0 Step 3 Next we consider the refinement of each iteration. To be specific, we try to study the performance of Û t Step, we have based on Û t. We still assume 6 all hold. Based on the result in E 0 ds log p and 0 s log p, λ provided that A A 7 hold. We particularly provides the following upper bound for E t Û t X and t = sin Θ Û, U E t s η + 3 r N + 3 r M K t t + Et + λ + K t + + Et d λd + Et λ + + Et λ =, 3 Et. 4 λ The detailed proof is collected in Section C.3 in the supplementary materials. Step 4 In this step, we combine the results in Step 4 and provide a upper bound for E tmax, tmax for t max = C logds log p. We first show, for all sparse and non-sparse modes, we have the following upper bounds for E t : E t 30 s r + 30 s log p + ds log p t, =,..., d, t = 0,,

32 irst, 5 holds for t = 0 due to 8. If 5 holds for t for some t, we aim to prove 5 for t. irst, based on the basic principle of Tucer ran, we have r r, r s s l r l s l r l s l s l = s. 6 Also, we shall recall that η = r + r log p + log p r + log p; η = r + r log p + log p r + log p; N l s l + r l + log p s l + log p; M l sl + d r l + sl r l + log p, l= 7 and E t 30 s r + 30 s log p + ds log p t 30 s + 30 s log p + ds log p t 60 s log p + ds log p t. 8 Thus, the following upper bound holds for K t, if we set C gap 00d. K t t E + + E t d + s η + 6r M λ / d 60 s log p + ds log p + t s η + 6r M λ / 60 d s log p + d ds log p + s r + s log p λ 6 s r + r + d = s r + log p + 00dλ C gap λ. λ 3

33 Thus, E t s η + 3 r N + r M K t + Et λ + + Et d λd K t s η + 6 r N + r M K t + Et λ 7 4 s r + 5 s log p + r M K t + Et λ + + Et d λ d + + Et d λ d. 9 With C gap > 0d + d, by Lemma 8, we have: r M K t r M d = Et + M s r η + λ λ 6r M λ, 4 s r + 4 s log p + ds log p t+. E t r M + + Et d 3 s r + 3 s log p + ds log p λ λ d t+ 3 Combining 9, 30, and 3, we have proved 30 E t 30 s r + 30 s log p + ds log p t, i.e. 5 for t and =. Similarly, one can also prove the upper bounds for E t =,..., d, which implies the claim 5 holds for t 0. for Then we further provide another upper bound for non-sparse modes. Note that we assume log p log p log p d, thus we can find some constant C, such that log p C p, =,,..., d. Now we want to show: E t C p r + ds log p t, J s 3 Again, we assume mode is non-sparse and only prove the bound for E t, the similar bounds for other non-sparse modes essentially follow. Return to 9, note that we particularly have η = 0 since it is a non-sparse mode, we have: E t 6 r N + r M K t + Et λ 7 p r + 6 r log p + r M K t + 6 C p r + r M + + Et d λ d + Et λ K t + Et λ + + Et d + + Et d λ d λ d

34 Also, we have the following bound for K t as η = 0, Besides, we could rebound M lie following: K t Et E t d + 6r M. 34 λ M + C p + r + l sl r l. 35 If we set C gap > 0d + d, then together with 5, 33, 34 and 35, we can proved the following result as similar as we proved in Lemma 8: r M K t + Et + + Et d C p r + ds log p r λ rd λ d t. Then 63 follows. Now for t max = C logds log p, for large constant C > 0, we have E tmax = Û tmax X tmax 6 t = sin Θ Û max, U C s r + s log p, J s, C p r, / J s, C s r + s log p /λ, J s, C p r /λ, / J s Step 5 In this step, we consider the estimation error for ˆX. We first have the following decomposition, Here, ˆX X = Y PÛ d PÛd X Z PÛ d PÛd + X PÛ d PÛd X. X PÛ d PÛd X X PÛ + X PÛ PÛ + + X PÛ PÛ d PÛd d PÛd d X l Ûl d = Û l X 36 l C s l r l + s l log p + pl r l ; l Js l Js l= l= 34

35 Z PÛ d PÛd Û Z U U Û d Ûd Z U Û d Ûd + Z U Û d Ûd + Z U U Û Û d Û d U U Û Z Û Û d + M sin Θ Û, U Z U U d Û d + M sin Θ Û, U + M sin Θ Û, U Z U d Ud + d l= Z U d U d d sin M l Θ Û l, U l. l= r + r log p + log p = r + log p; sin 537 M l Θ Ûl, U l C M l s l r l + s l log p + C M l p l r l λ l λ l l J s l J s C l J s s l r l + s l log p + C l J s p l r l. In summary, ˆX X C r + d l= rl s l + l J s sl log p. under the assumptions 6 that we introduced in Step. Note that log p = log p + + log p d and we have the assumption that log p log p d, thus the result is equitant to the one stated in Theorem. Step 6 inally, we prove that all imposed conditions in Step, i.e. -6, hold with probability at least O p + +p d log ds log p p. The detailed proof is collected in Section C.4 in the supplementary materials. Combing all results from Steps 6, we have finished the proof for this theorem. 35

36 Acnowledgment The authors would lie to than the Editor, Associate Editor, and anonymous referees for their constructive comments, which have greatly helped to improve the presentation of the paper. References Allen, G. 0a. Sparse higher-order principal components analysis. In AISTATS, volume 5. Allen, G. I. 0b. Regularized tensor factorizations and higher-order principal components analysis. arxiv preprint arxiv: Anandumar, A., Deng, Y., Ge, R., and Mobahi, H. 06. Homotopy analysis for tensor pca. arxiv preprint arxiv: Anandumar, A., Ge, R., Hsu, D., Kaade, S. M., and Telgarsy, M. 04a. Tensor decompositions for learning latent variable models. Journal of Machine Learning Research, 5: Anandumar, A., Ge, R., and Janzamin, M. 04b. Guaranteed non-orthogonal tensor decomposition via alternating ran- updates. arxiv preprint arxiv: Birgé, L. 00. An alternative point of view on lepsi s method. Lecture Notes-Monograph Series, pages Cai, T. T., Ma, Z., and Wu, Y. 03. Sparse pca: Optimal rates and adaptive estimation. The Annals of Statistics, 46: Cai, T. T. and Zhang, A. 08. Rate-optimal perturbation bounds for singular subspaces with applications to high-dimensional statistics. The Annals of Statistics, 46: Choi, Y., Taylor, J., Tibshirani, R., et al. 07. Selecting the number of principal components: estimation of the true ran of a noisy matrix. The Annals of Statistics, 456: De Lathauwer, L., De Moor, B., and Vandewalle, J. 000a. A multilinear singular value decomposition. SIAM journal on Matrix Analysis and Applications, 4:

sparse and low-rank tensor recovery Cubic-Sketching

sparse and low-rank tensor recovery Cubic-Sketching Sparse and Low-Ran Tensor Recovery via Cubic-Setching Guang Cheng Department of Statistics Purdue University www.science.purdue.edu/bigdata CCAM@Purdue Math Oct. 27, 2017 Joint wor with Botao Hao and Anru