arxiv: v1 [math.st] 6 Sep 2018

Size: px
Start display at page:

Download "arxiv: v1 [math.st] 6 Sep 2018"

Transcription

1 Optimal Sparse Singular Value Decomposition for High-dimensional High-order Data arxiv: v [math.st] 6 Sep 08 Anru Zhang and Rungang Han University of Wisconsin-Madison September 7, 08 Abstract In this article, we consider the sparse tensor singular value decomposition, which aims for dimension reduction on high-dimensional high-order data with certain sparsity structure. A method named sparse tensor alternating thresholding for singular value decomposition STAT-SVD is proposed. The proposed procedure features a novel double projection & thresholding scheme, which provides a sharp criterion for thresholding in each iteration. Compared with regular tensor SVD model, STAT-SVD permits more robust estimation under weaer assumptions. Both the upper and lower bounds for estimation accuracy are developed. The proposed procedure is shown to be minimax rate-optimal in a general class of situations. Simulation studies show that STAT-SVD performs well under a variety of configurations. We also illustrate the merits of the proposed procedure on a longitudinal tensor dataset on European country mortality rates. Keywords: high-dimensional high-order data, projection and thresholding, singular value decomposition, sparsity, Tucer low-ran tensor. Anru Zhang is Assistant Professor, Department of Statistics, University of Wisconsin-Madison, Madison, WI 53706, anruzhang@stat.wisc.edu; Rungang Han is PhD student, Department of Statistics, University of Wisconsin-Madison, Madison, WI 53706, rhan3@wisc.edu. The research of Anru Zhang is supported in part by NS Grant DMS-8868.

2 Introduction High-dimensional high-order data, i.e., values arranged in large-scale tensors along three or more directions, commonly occur in a broad range of applications due to revolutionary developments in science and technology. These data possess distinct characteristics compared with the traditional low-dimensional or low-order data and pose unprecedented challenges to various communities, including statistics, machine learning, applied mathematics, and electrical engineering. To better summarize, visualize, and analyze high-dimensional high-order data, a sufficient dimension reduction often becomes the crucial first step. Therefore, how to effectively exploit the low-ran structure from high-dimensional high-order observations is often an important tas. To this end, the framewor of tensor SVD or tensor PCA has been introduced and extensively studied recently Allen, 0b; Richard and Montanari, 04; Anandumar et al., 06; Zhang and Xia, 08; Liu et al., 07; Wang and Song, 07. Suppose one is interested in an order-d low-ran tensor of dimension p p d, which is observed with entry-wise additive noise Z as Y = X+Z. Assume the fixed tensor X is low-ran in the sense that the fibers of X along different directions i.e., the counterpart of rows and columns for matrix all lie in low-dimensional subspaces, say U,..., U d. The goal of tensor SVD is to estimate the loadings U,..., U d and underlying low-ran tensor X. Under the regular tensor SVD setting, several practical methods have been introduced and studied, including High-order SVD HOSVD De Lathauwer et al., 000a, High-order Orthogonal Iteration HOOI De Lathauwer et al., 000b, sum-of-square scheme Hopins et al., 05, homotopy or continuation method Anandumar et al., 06. Using lower bound arguments, Richard and Montanari 04 and Zhang and Xia 08 showed that the signal-noise-ratio SNR C max{p, p, p 3 } / is required to ensure that consistent estimation is statistically possible; and SNR C max{p, p, p 3 } 3/4 may be further necessary for computationally efficient methods. However in many applications, such conditions, i.e., SNR is no less than a polynomial of dimension p, are often too restrictive to satisfy. Moreover, in many applications, the leading singular/eigenvectors of the high-dimensional high-order data may satisfy intrinsic structural assumptions along certain ways. We have seen the need for singular value decomposition in a number of modern tensor data applications, where sparsity plays an essential role. or example, in high-order longitudinal study, since observations

3 often come as multivariate functions of time, e.g. country-wise fertility and death rates by the calendar year and age Wilmoth and Sholniov, 006, the leading singular vectors along the mode of calendar year or age are expected to be smooth, and therefore becomes sparse after differential transformation; in Electroencephalogram EEG data analysis, the brain electrical activities are measured and stored as multi-way data with three or more modes representing channels, time, and patients, etc. It is often believed that some parts of brain region are more active and vary through time smoothly, then the leading singular vectors may be sparse along the channel mode and smooth along time mode Miwaeichi et al., 004; in imaging ensemble analysis, facial images are often stored as high-dimensional high-order tensors. To sufficiently reduce the dimension, one loos for low-dimensional subspaces that can best explain the possibly sparse facial features and suppress illumination effects such as shadows and highlights Vasilescu and Terzopoulos, 003. How to incorporate these structural assumptions wisely to improve the performance of subsequent statistical analyses is crucial for singular value decomposition in tensor data analysis. previous literature. Such a problem, however, has not been well studied or understood in In this article, we aim to fill this gap by developing methodology and theory for sparse tensor SVD. In addition to the regular tensor SVD model, we assume that the underlying low-ran structure X satisfies some sparsity constraints. As mentioned above, the data are not necessarily sparse along all modes in practice for example, it is not reasonable to assume sparsity for the patient mode in EEG data or subject mode in high-order longitudinal data. To allow more flexibility, we suppose there exists a subset J s {,..., d} such that part of the loadings {U : J s } contains certain row-wise sparsity structures. The detailed formulation of the sparse tensor SVD model is introduced in Section. To better illustrate the nature and difficulty of sparse tensor SVD problem, it is also helpful to discuss its order- counterpart, matrix sparse singular value decomposition, for comparison. The framewor of matrix sparse singular value decomposition, which focuses on extracting simultaneously sparse and low-ran matrix structure from high-dimensional matrix data, has been introduced and extensively studied during the past decade see Lee et al. 00; Yang et al. 04, 06 and the references therein. In addition, sparse principal component analysis, a closely connected topic, has also been considered in Zou et al. 006; Shen and Huang 008; 3

4 Johnstone and Lu 009; Cai et al. 03. In contrast, sparse tensor SVD is much more involved and difficult than sparse matrix SVD and regular tensor SVD in many aspects. irst, classical methods for matrix data are often not directly applicable to high-order data. Many previous wors approach the tensor problem by vectorizing or matricizing high-order data or intuitively speaing, stretching the data cubes into matrices or vectors so that high-order problems are transformed into vector or matrix ones. However, since high-order structures can get lost in the process of simple vectorizing or matricizing, one may only obtain sub-optimal results in the subsequent analyses. Second, some straightforward extensions from sparse matrix SVD methods, such as sparse HOSVD, sparse HOOI, or a single projection & thresholding scheme, does not perform optimally in general. Third, as pointed out by the seminal wor of Hillar and Lim 03, many basic concepts or methods for matrix data cannot be directly generalized to the high-order ones. Naive extensions of concepts such as operator norm, singular values, and eigenvalues are mathematically possible but computationally NP-hard. To overcome these difficulties, we propose a procedure named Sparse Tensor Alternating Truncation for Singular Value Decomposition STAT-SVD for sparse tensor SVD in this paper. The method consists of two steps: i a thresholded spectral initialization and ii an iterative alternating updating scheme. One crucial part of the procedure is a novel double projection & thresholding scheme, which provides a sharp criterion for thresholding in each iteration. Since each step of STAT-SVD only involves basic matrix and tensor operations, such as matricization, multiplication, matrix SVD, and thresholding, the proposed procedure can be implemented efficiently. We study both the theoretical and numerical properties of the proposed procedure. We prove by an upper bound argument that the STAT-SVD estimator can recover the low-ran structures accurately. A lower bound is further developed to show that the proposed estimator is rate optimal for a general class of simultaneously sparse and low-ran tensors. To the best of our nowledge, we are among the first ones to study the method and theory for sparse tensor SVD with matching upper and lower bound results. The numerical results show that the STAT- SVD outperforms other more naive methods, such as regular HOOI, HOSVD, sparse HOOI, and sparse HOSVD, by achieving significantly smaller estimation errors within much shorter running time. We also illustrate the merit of STAT-SVD in the analyses of high-order mortality 4

5 rate data. The rest of this article is organized as follows. After a brief introduction of the notations and preliminaries, we formally introduce the sparse tensor SVD model in Section. Then we propose the methodology for sparse tensor SVD in Section 3. The theoretical properties of the proposed procedure are developed in Section 4. The data-driven hyperparameter selection is discussed in Section 5. Numerical performance of the proposed methods is studied through both simulation studies and the real data analyses on high-order mortality rate dataset in Section 6. inally, further discussions, proofs of the technical results, and supporting theoretical tools are postponed to the supplementary materials. Problem ormulation. Notations and Preliminaries We start this section with notations and preliminaries that will be used throughout the paper. The lowercase letters, e.g. x, y, u, v, are used to denote scalars or vectors. or any a, b R, let a b and a b be the minimum and maximum of a and b, respectively. or convenience, we denote p = p,..., p d, r = r,..., r d, s = s,..., s d, and p = p p d, s = s s d, r = r r d. or any p, we also note p = l p l, s = l s l, r = l r l. We use C, c and the variations to denote generic constants, whose actual values may change from line to line. The uppercase letters are used to note matrices. or A R m n, we assume the singular value decomposition is A = i σ iu i v i, where σ σ 0 are the singular values in descending order. Denote σ i A = σ i as the i-th largest singular value of A. Particularly, the largest and smallest non-trivial singular values: σ max A = σ A and σ min A = σ m n A play important roles in our analysis. We also define SVD r A = [u,..., u r ] as the matrix comprised of the top r left singular vectors of A. Let the collections of regular and sparse orthogonal matrices be O p,r = {U R p r : U U = I r }, O p,r s = {U R p r : U U = I r, U 0 = p i= {U [i,:] 0} s}. These orthogonal matrices are extensively used in later narratives. In addition, the boldface capital letters, e.g. X, Y, Z, are used to represent tensors of order-3 5

6 or higher. or any S R r r d and U R p r, the mode- tensor-matrix product is defined as S U R r r p r + r d, S U i,...,i d = r j = S i,...,j,...,i d U i,j. Multiplication along different directions is commutative invariant, i.e., S U U = S U U for. or convenience, the tensor product along all d modes is applied S U d U d := S; U,..., U d. Matricization is a basic tensor operation that transforms tensors to matrices. Particularly the mode- matricization for any X R p p d is defined as M X R p p, where [M X] i,i +p i 3 + +p p d i d = X i,...,i p. The general mode- matricization can be defined similarly. The readers are referred to Kolda and Bader 009 for a more comprehensive tutorial for tensor algebra. We also use the R syntax to denote sub-vectors, -matrices, and -tensors. or example, A [I,I ] is used to note the sub-matrix with row indices I and column indices I of A. In addition, for any integers a b, we use a : b to represent consecutive sequence {a,, b}; and : alone represents the entire index set. Thus, A [:,:r] represents the first r columns of A while A [s +:p,:] represents the {s +,..., p }-th rows with indices {s +,..., p }. Similar notations are also applied to sub-vectors and sub-tensors.. Sparse Tensor SVD Model Now we formally introduce the sparse tensor SVD model. Suppose one observes a p p d - dimensional tensor Y with additive noise, say Y = X + Z. Here, X is a low-ran tensor in the sense that all fibers along each direction lie in some low-dimensional subspace, Z consists of i.i.d. Gaussian noises with mean zero and variance σ. Given the connection between Tucer low-ran and Tucer decomposition Tucer, 966; Kolda and Bader, 009, the model can be further written as Y = X + Z = S U d U d + Z = S; U,..., U d + Z, 6

7 igure : Illustration of sparse tensor SVD model. Here, X is sparse along Modes- and -3. with S R r r d and {U O p,r } d = being the unnown core tensor and Mode-,..., loadings, respectively. Especially, U is also the singular subspace of X along the -th direction. Recall O p,r and O p,r s are the classes of regular and sparse orthogonal columns. Given previous discussions, suppose a subset J s {,..., d} is nown a priori, such that the mode- loading U for any J s is s -sparse, O p,r s, J s ; U O p,r, / J s. To be more flexible, we set s = p if there is no sparsity constraint in the -th Mode, i.e., / J s. The central goal is to estimate singular subspaces U,..., U d, and the original low-ran tensor X based on observations Y. See igure for a pictorial illustration of sparse tensor SVD model. Remar It is worth mentioning that our framewor is related but distinct from the CP low-ran-based decomposition framewor in literature see De Silva and Lim 008; Kolda and Bader 009; Anandumar et al. 04a,b; Sun et al. 05; Sun and Li 07; Wang et al. 07; Hao et al. 08 and the references therein. Since the main focus of tensor SVD analysis is to sufficiently reduce the high-order data onto low-dimensional subspaces, the Tucer low-ran structure provides a more natural fit based on the interpretation of Tucer decomposition that we have mentioned above. Additionally, it is nown that any CP ran-r tensor must be of Tucer ran at most r, but the opposite of this statement is not generally true Kolda and Bader, 009. Therefore, we mainly pursue the more general Tucer low-ran model rather than the CP one, with distinct methodologies and theories developed for the rest of this paper. 7

8 3 The STAT-SVD Procedure Next, we propose the procedure for sparse tensor singular value decomposition. Note that U in the sparse tensor SVD model is not only the mode- singular subspace of X but also the left singular subspace of M X, a straightforward idea for initialization is Ũ 0 = SVD r M Y. This method, originally introduced by De Lathauwer et al. 000a, has been widely referred to as high-order SVD HOSVD and used in numerous scenarios. However, Ũ 0 may fail to provide any consistent estimations or even warm starts, unless one has strong SNR λ/σ Cp 3/ Zhang and Xia, 08, where λ = min =,,3 σ r M X. An alternative idea is to apply l regularized estimation to encourage sparsity Allen, 0a,b, Û, Û, Û3 = arg min U,U,U 3 Y S U U 3 U 3 + λ J s U. Both the computation and theoretical analysis for the regularized MLE may be difficult. Instead, we propose a procedure with a novel double thresholding & projection scheme as follows. Step Initialization: support Let Y = M Y be the matricization of Y for =,..., d. Recall p = p p d /p, p = p p d. or =,..., d, select the index sets for all sparse modes, { Î 0 = i : Y [i,:] σ p + p log p + log p Ideally speaing, Î0 or max Y [i,j] σ log p j }, J s. provides an initial estimate for the support of {U : J s }, which captures the significant signals of X but may miss the wea ones. Î 0 = {,..., p }. or / J s, we select Step Initialization: loadings Construct Ỹ R p Y p [i d,...,i, Ỹ [i,...,i d ] = d ], i,..., i d Î0 Î0 d ; 0, otherwise, and initialize Û 0 = SVD r M Ỹ, =,..., d. 8

9 Then Û 0 provides a rough estimate for U. The initialization step is illustrated in igure 3 a. Step 3 Power Iteration and Alternating Thresholding Provided a reasonable initialization by thresholded spectral method, we consider an iterative alternating scheme to refine the estimation. Suppose we aim to update Û t Û t+ updating scheme is discussed under two scenarios. for some specific =,..., d and t = 0,,.... The If the targeting mode, say Mode-, is non-sparse, we can update by orthogonal projection A t = M Y; Û t+, I p, Û t 3,..., Û t d R p r. 3 Ideally speaing, the dimension of Y is significantly reduced from p p d to p r, while the majority of signal X can be preserved in A t after such a projection. Then we update See igure 3 b for an illustration. Û t+ t = SVD r A. If the targeting mode, say Mode-, is sparse, we apply a double projection & thresholding scheme for refinement see igure 3 c. We still perform projection first, st Projection A t = M Y; I p, Û t,..., Û t d R p r. Then A t is denoised by row-wise hard thresholding, st Thresholding B t R p r, B t,[i,:] = At,[i,:] { A t,[i,:] η }, i p, where the thresholding level η = σ r + r log p + log p is determined by the quantile of A t,[i,:] if the i-th row of At does not contain any signal. Since A,[i,:] σ χ r σ r when A,[i,:] are pure noise, an η induced by A,[i,:] can be as large as σ r, which may falsely ill many rows with wea signals. In order to lower the large thresholding level, we introduce the second projection and thresholding: let the leading r right singular vectors of B t be t t ˆV, i.e., ˆV = SVD r B t, then we further reduce the dimension of p -by-r matrix A t to p -by-r matrix Āt by performing right projection, nd Projection Ā t = A t ˆV t R p r, 9

10 a Initilaization b Dense mode update c Sparse mode update igure : Illustration of STAT-SVD procedure. 0

11 and apply the second thresholding nd Thresholding Bt R p r, Bt,[i,:] = Ā,[i,:] { A t,[i,:] η }, with much smaller thresholding value η = σ r + r log p + log p, given the reduced dimension of Ā t. As we will illustrate in theoretical analysis, the double projection & thresholding scheme provides more accurate denoising performance. It is also noteworthy that a similar version of double thresholding appears in recent high-dimensional clustering literature Jin and Wang, 06, although their problem, method, and theory were all different from ours. inally, we update Û by QR decomposition Û t+ t = QR B. Step 4 The iteration is stopped until the maximum number of iteration is reached i.e. t > t max, or convergence, i.e. the following criterion holds, t B Bt 5 < ε tol. Here ε tol is the maximum tolerance, which can be chosen empirically. inally, we propose the denoising estimator for X as ˆX = Y Û d Û d. The pseudo-code for the proposed STAT-SVD method is provided in Algorithm. We particularly summarize the double projection & thresholding scheme for the sparse mode update in Algorithm. 4 Theoretical Analysis In this section, we analyze the theoretical properties of the proposed procedure in the previous section. The sin Θ distances are adopted to quantify the singular subspace estimation errors. Particularly for any U, V O p,r, the principal angles between U and V is defined as an r- by-r diagonal matrix: ΘU, V = diagcos σ U V,..., cos σ r U V. Then the sin Θ robenius norm sin ΘU, V = r U V can be used to characterize the distance

12 Algorithm Sparse tensor alternating thresholding - SVD STAT-SVD : Input: order-d tensor data Y R p, ran r, noise level σ, set of sparse modes J s. : Initialization Let Y = M Y. Select subset Î0 by { i p : Y [i,:] σ p + p log p + log p Î 0 = or max j Y [i,j] σ } log p, J s ; {,..., p }, / J s. 3: Set t = 0. Calculate the initializations Û 0 = SVD r M Ỹ, where Ỹ = Y i,...,i d, 4: while t < t max or convergence criterion not satisfied do 5: for =,..., d do 6: if J s then 0, otherwise. 7: Sparse mode update Update Û t+ via Algorithm ; 8: else 9: Dense mode update Calculate i,..., i d Î0 Î0 d ; A t+ = M Y; Û t+,..., Û t+, I p, Û t +,..., Û t d. 0: Update as Û t+ = SVD r : end if : end for 3: t = t +. 4: end while A t+.

13 Algorithm Update Scheme in STAT-SVD Sparse Mode : Input: Y, Û t +,..., Û t d, Û t+,..., Û t+, ran r. : irst Projection Calculate A t = M Y; Û t+,..., Û t+, I p, Û t +,..., Û t d 3: irst Thresholding Perform row-wise thresholding for A t+ : B t+ R p r, B t,[i,:] = At+,[i,:] { }, i p A t+,[i,:] η, where η = σ r + r log p + log p. 4: Second Projection Extract the leading r right singular vectors of B t then project as Āt+ ˆV t+ = SVD r = A t+ ˆV t+ R p r. 5: Second thresholding Perform thresholding on B t+ B t+, O p,r, B t+ R p r, Bt+,[i,:] = Āt+,[i,:] { }, i p A t+,[i,:] η, where η = σ r + r log p + log p. 6: Apply QR decomposition to t+ B, and assign the Q part to Û t+ O p,r.. 3

14 between U and V. The readers are referred to Cai and Zhang, 08, Lemma for more discussions on the properties of sin Θ distance. Theorem Upper bound Suppose r, σ are nown, log p log p d, and for d, λ = σ r M X C 0 σ s log p / + s r + max d r. Then after at most Ods log p iterations, the proposed Algorithm yields the following estimation error upper bound, ˆX X d Cσ sin Θ Û, U l= r l + s r + s log p. 4 J s { Cσ s r + s log p /λ } r, J s, { Cσ p r /λ } r, / J s, 5 with probability at least Cp + +p d logs log p p p d. Here C > 0 is some uniform constant, which does not depend on σ, p, r, s. Remar The estimation error upper bound 4 is comprised of three terms: σ d l= r l, σ s r, and σ s logp, which correspond to the estimation complexity for the core tensor S, the values of loading U, and the support of U only for sparse modes, respectively. On the other hand, the signal strength assumption σ r X Cσ s log p / + s r + max d r involves p only in the logarithmic term. Compared with the assumption σ r X σ max{p, p, p 3 } 3/4 that is required in regular tensor SVD, our proposed algorithm is able to handle high-dimensional settings under much weaer conditions. Remar 3 Proof setch for Theorem Since the proof of Theorem is lengthy and highly non-trivial, we briefly discuss the setch here. irst, a number of conditions are introduced as the baseline assumptions Step. Under these conditions, we try to establish the upper bound for the initial estimate: sin ΘÛ 0, U c and Û 0 M X C s log p for constants 0 < c < / and C > 0 Step, then develop the upper bound for estimates after each iteration: sin ΘÛ t+, U s r + s log p { Js} 4 λ /σ + C s log p λ /σ τ t,

15 and Û t X Cσ s r + Cσ s log p { Js} + Cσ s log p τ t. for some constant 0 < τ / Step 3. sufficiently decays. One obtain final estimators Û tmax Then after a number of iterations, the error rate and ˆX tmax that achieve the targeting upper bounds of estimation error Steps 4 and 5. inally, we use a coupling scheme to show that the introduced conditions hold with high probability Step 6. Remar 4 Performance of Single Projection & Truncation Scheme As discussed earlier, the proposed double projection & thresholding scheme is crucial to the performance of STAT-SVD. Such a scheme is essential in the sense that the simple single projection & thresholding, which may be a more straightforward extension from matrix sparse SVD method, may yield sub-optimal results. To be specific, suppose X is the estimator with single thresholding & projection see Algorithm 3 in Appendix for detailed explanations, Theorem 5 in the Appendix shows that there exists a low-ran tensor X that satisfies all assumptions in Theorem, but X yields a higher rate of convergence in the sense that E X X Cσ r l + s r + s log p + s r log p /. l J s J s Next, we study the statistical lower bound for sparse tensor SVD. Consider the following class of sparse and low-ran tensors, p,r,s J s = X = S; U,..., U d R p p d : ranx r,..., r d, suppu 0 s for J s. The following lower bound results hold for sparse tensor SVD. Theorem Lower Bound: Subspace Estimation Suppose s 3r, consider the following classes of sparse and low-ran tensors with least singular value constraint on matricization, p,r,sj s, λ = {X p,r,s J s : σ r M X λ }, =,..., d. Then, for any fixed, we have inf Û sup X p,r,sj s,λ sin Θ σ Û, U c s r λ + σ s logp /s λ I { Js} r. 5

16 Theorem 3 Lower bound: Tensor Recovery Suppose s 3r for any. Consider the tensor recovery over the class of sparse and low-ran tensors p,r,s J s, there exists uniform constant c > 0 such that inf sup ˆX X d ˆX cσ r + X p,r,sj s = d s r + s log p /s. J s Theorems,, and 3 together show the proposed STAT-SVD method achieves the optimal rate of convergence in the general class of sparse low-ran tensors when log p /s log p, for J s. = 5 Data-driven Hyperparameter Selection In practice, the proposed procedure requires the input of hyperparameters σ and r, r,..., r d. Since Y = X + Z and a significant portion of entries of X are zeros, the data-driven median estimator Yang et al., 06 can be used to estimate σ: ˆσ = Median Y / Here is the 75% quantile of standard normal distribution. We have the following theoretical guarantee for ˆσ. Proposition Concentration Inequality of ˆσ Let ˆσ = Median Y /z If s = o p log p, then there exists universal constants C and c, such that ˆσ σ log p P c C/p. 6 σ p Now we consider the data-driven selections of r,..., r d. Recall that we directly threshold on each mode and obtain r,..., r d based on the singular values of Ỹ, { ˆr = max Ỹ in the initialization step of Algorithm. r : σ r M Ỹ ˆσδ Î 0, Î0 i Here, δ ij =.0 + j + i log ep i + j log ep j + 4 log p We propose to select }. 7 and ˆσ is the estimated standard deviation. Under regularity conditions, one can show that ˆr,..., ˆr d match the true rans with high probability. 6

17 Proposition Under the conditions of Theorem and Proposition, we have ˆr = r for each =,..., d with probability at least Op + + p d /p δ for any δ > 0. If neither σ nor {r } d = are nown, we can first estimate σ by MAD estimator ˆσ, then estimate {r } d = by 7, and feed all the estimations to Algorithm. We have the following theoretical guarantee for this fully data-driven method. Theorem 4 Suppose we set σ = ˆσ and r = ˆr in Algorithm. If s = o p log p, the conclusion of Theorem holds with probability at least O p + + p d /p δ for any δ > 0. Remar 5 Similarly as some previous wors on distribution-based methods for principal component number selection Choi et al., 07, the performance of the proposed ˆσ and ˆr may rely on specific Gaussian noise assumptions. In scenarios with general noise, some more empirical schemes can be applied. or example, to estimate σ, one can trim a portion of entries from Y with largest absolute values, and evaluate the trimmed variance ˆσ as the sample variance for remaining entries Serfling, 984; for r, we can apply the cumulative percentage of total variation criterion Jolliffe, 00, Chapter 6.. a commonly used in principal component analysis literature: { r i= ˆr = arg min r : σ i M } Y p i= σ i M Y > ρ. Here, ρ 0, is an empirical thresholding level. 6 Numerical Analysis 6. Simulation Studies We evaluate the numerical performance of the proposed STAT-SVD method by simulation studies on various synthetic datasets. In each setting, we first generate an r -by-r -by-r 3 tensor S with i.i.d. standard normal entries, then rescale S as S = S λ/ min 3 σ r M S to ensure that σ r M S λ. or =,, 3, we generate singular subspaces Ū and indices subsets Ω t with cardinality s uniformly at random from O s,r and {,..., p}, respectively. The 7

18 combination of Ū and Ω Ū [j,:], i Ω, and i is the j-th element of Ω ; U [i,:] = 0, i / Ω. yields a uniformly random sparse singular subspace in O p,r s. Let X = S U U 3 U 3, Z iid N0, σ, and Y = X + Z be the underlying parameter, noise, and observation tensors. To examine the performance of STAT-SVD, we apply the proposed Algorithm along with four baseline methods, HOSVD, HOOI, sparse HOSVD S-HOSVD, and sparse HOOI S- HOOI, on the same synthetic data for comparison. Here, HOSVD and HOOI are classical methods De Lathauwer et al., 000a,b that have been widely used in literature. S-HOSVD and S-HOOI are the sparse modifications of HOSVD and HOOI intuitively, S-HOSVD and S-HOOI are performed by replacing all regular matrix SVD steps in HOSVD and HOOI by sparse matrix SVD Yang et al., 04. The detailed implementation of S-HOSVD and S-HOOI are summarized in Section B of the supplementary materials. The experiments are repeated for 00 times in each setting. In the first simulation study, we compare the estimation errors of STAT-SVD and baseline methods HOOI, S-HOOI, HOSVD, S-HOSVD in average robenius norm, l Û = 3 3 sin ΘÛ, U, l ˆX = ˆX X. = or hyperparameters, we use the median estimator ˆσ in STAT-SVD and all baseline methods; since sin Θ distance is one of the most important error quantification in tensor SVD analysis and one requires the correct r to evaluate the sin Θ distance between singular subspaces, we use the true rans r for all implementations. ix p = p = p 3 = 50, s = s = s 3 = 5, λ = 70, r = r = r = r 3, we specifically consider two scenarios: σ =, r varies from to ; r = 5, σ ranges from 0. to.5. Although the signal-to-noise ratio SNR, λ/σ here seems large, λ represents the singular value of the each matricization that measures the signal strength of the whole tensor; while σ is the standard deviation of each Z ij that quantifies the noise level of each single entry. By random matrix theory Vershynin, 00, the singular values of M Z are around σ p 5 to 5, which is comparable to the signal M X. As one can see from the numerical results in igures 3 and 4, although all methods yield smaller estimation error with smaller noise level, STAT-SVD significantly outperforms all other schemes in estimations 8

19 igure 3: Estimation error of U left panel and X right panel for different methods. Here, p = p = p 3 = 50, s = s = s 3 = 5, σ =, λ = 70, and r = r = r 3 = r vary. igure 4: Estimation error of U left panel and X right panel for different methods. Here, p = p = p 3 = 50, s = s = s 3 = 5, r = r = r 3 = 5, λ = 70, σ varies. of both subspaces and original tensors. We also consider the accuracy of hyperparameters estimation. Under the previous simulation setting, we examine the performance of MAD estimator ˆσ and ran estimator ˆr in Section 5. The results are provided in Table. One can see that the proposed ˆσ and ˆr provide reasonable estimations. The ran estimation is particularly accurate when noise level is moderate. In previous sections, the presentation and analysis were mostly focused on Gaussian noise case. Next, we consider the setting that the noises are uniformly distributed on [ σ 3, σ 3]. Let p = p = p 3 = 50, s = s = s 3 = 5, r = r = r 3 = 5, λ = 70, σ varies from 0. to.5. As we can see from the estimation error results in igure 5, STAT-SVD still achieves significantly 9

20 σ ˆσ σ ˆr r r ˆσ σ ˆr r Table : Ran and noise level estimation accuracy. Here, p = p = p 3 = 50, s = s = s 3 = 5, λ = 70. The first three rows explore the setting where r = r = r 3 = 5 with σ varies; the last three rows consider the case with fixed σ = and varying ran. The errors are averaged over 00 repetitions. igure 5: Estimation error of U left panel and X right panel with uniform distributed noise. better performance than other methods. As we have discussed before, in many applications, the tensor dataset possesses sparsity structure in only part of directions. Thus, we turn to a setting that X contains both sparse and dense modes. Specifically, we set p = p = p 3 = 50, r = r = r 3 = 0, s = s = 0, λ = 60, and s 3 = p 3, so that X is sparse along mode-, - and dense along mode-3. The singular subspace estimation error with varying noise level is evaluated and shown in igure 6. We can see STAT- SVD significantly outperforms the other methods on the estimation of sparse modes Û and Û. More interestingly, STAT-SVD also estimates the non-sparse singular subspace U 3 slightly more accurately, especially when σ is large. In fact, different modes and singular subspaces of any specific tensor dataset are a unity rather than separate objects, so more accurate estimation 0

21 igure 6: Estimation error of U left top, U right top, U 3 left bottom and X right bottom in partial sparse setting. Here, λ = 60, p = p = p 3 = 50, s 3 = p 3, s = s = 0, r = r = r 3 = 0. of dense mode U 3 is possible when one can fully utilize the sparsity in U and U. Especially for STAT-SVD, with the proposed double projection & thresholding scheme Algorithm, one gets significant better estimations on U and U than the baseline methods in each iteration so more precise projection 3 can be achieved when updating dense mode singular subspace Û t 3, and a slightly better final estimation of U 3 can be achieved. As the time cost is another critical issue in high-dimensional data analysis, we summarize the computational complexity for both initialization and each iteration of STAT-SVD and baseline methods into Table. We also compare the running time of STAT-SVD and other algorithms by simulations. As one can see from Tables and 3, HOOI, HOSVD, S-HOOI, and S-HOSVD are all slower than STAT-SVD this is because the embedded sparse matrix SVD or regular matrix SVD in baselines are computationally expensive, especially for the high-dimensional

22 STAT-SVD HOSVD HOOI S-HOSVD S-HOOI initialization per-iteration Op d + sd+ Op d+ Op d+ OT p d+ OT p d+ Op d r + s d+ 0 Op d+ 0 OT p d+ Table : Computational complexity of STAT-SVD and baseline methods for order-d sparse tensor SVD. Here, each mode share the same p i, s i, and r i. T denotes the number of iteration times. p i STAT-SVD HOSVD HOOI S-HOSVD S-HOOI Table 3: Running time unit: seconds of STAT-SVD and baseline methods. Here, λ = 70, σ =, s = s = s 3 = 0, r = r = r 3 = 5, p = p = p 3 varies. cases. In contrast, STAT-SVD is much faster, as the efficient truncation maes sure that only the significant parts of the data are extensively applied in computation, i.e., we only need to perform SVD on the s -by-s submatrix instead of the original p -by-p one. 6. Mortality Rate Data Analysis We illustrate the power of STAT-SVD through a demographic example. The mortality rate, i.e. the number of deaths divided by the total number of population, provides interesting insights to demographic information of the certain area, period, and age span. The Bereley Human Mortality Database Wilmoth and Sholniov, 006 contains a good source of morality rate data aligned by countries, ages, and years. We aim to analyze the mortality rate among 6 European countries for ages 0 to 95 from 959 to 00. The data tensor Y is of dimension with three modes representing age, year, and country, respectively. Since the mortality rate is relatively steady from teenagers to adults see, e.g., Miniño 03, it is reasonable to assume that the underlying loadings of Mode age are sparse after taing differential transformation.

23 Specifically, let D R be a secondary difference matrix: D ij = if i = j; D ij = if i j = ; D ij = 0 if i j. We pre-process the data by multiplying the secondary difference matrix along Modes age: Ỹ = Y D. We aim to apply sparse tensor SVD on Ỹ with J s = {}. To achieve more robust performance, the hyperparameters are selected via empirical methods instead of the Gaussian-noisebased procedure discussed in Section 5. r is selected via cumulative percentage of total variation criterion Jolliffe, 00, Chapter 6.., { r i= ˆr = arg min r : σ i M } X p i= σ i M X > 0.5. or the noise level, we trim 5% largest entries of Y in absolute value, then set ˆσ as the sample standard deviation of the remaining entries of Y. The estimated ran and noise level of Ỹ are ˆr, ˆr, ˆr 3 =, 5, 4 and ˆσ = In addition to STAT-SVD, we also apply S-HOOI and S-HOSVD for comparison. Suppose Ũ, ˆV, Ŵ are the resulting singular subspaces of Ỹ. Then we transform Ũ singular subspace of Mode age bac via Û = D Ũ. We first compare the first two singular vectors of Modes age and year, û, û, ˆv, and ˆv, in igures 7 and 8. We can see from û that infants aged at 0- and old people aged over 55 have a higher ris of dying, and people aged from to 50 have a relatively steady death rate. In addition, the shape of u further suggests additional factors of mortality rate in certain periods. or example, there may be more complicated patterns in mortality rate for the infants and elderly aged over 80 due to the high ris of death in these two periods. rom ˆv, we can see the mortality rate in these European countries declines significantly from the year 955 to 00, while ˆv does not give any significant pattern. Since the three methods produce similar plots on values of singular vectors, we further compare the estimation support indexes in igure 9. Since the mortality rate is steady from childhood to adults, one expects that the zero indexes would cover such an age span for û and û. The outcome of STAT-SVD matches this phenomenon. In comparison, the sparsity patterns given by S-HOSVD and S-HOOI are less interpretable. Next, we consider the mortality rate pattern for countries. In particular, we plot ŵ against ŵ for each country in igure 0, refer to the 04 European GDP per capita raning data, search for the countries whose GDP per capita is raned as top 50% among all European counties, Lin: 3

24 igure 7: Mortality rate data: û, û igure 8: Mortality rate data: ˆv, ˆv igure 9: Sparse indexes of Mode age after secondary differential transformation 4

25 igure 0: Mortality rate data: ŵ and ŵ of different countries. Red circles and blac triangles represent the countries with top 50% GDP per capita in Europe and the other countries, respectively. igure : Mortality rate means for countries in Europe. Red circles and blac triangles represent the countries with top 50% GDP per capita in Europe and the other countries, respectively. then highlight these countries with red circle marers and the other with blac triangles. We can clearly see two clusters in igure 0 that countries with more GDP per capita have smaller values of ŵ and ŵ, which implies a lower mortality rate. Also, wealthier countries are highly clustered in the graph, which indicates the common pattern of the death rate for these countries. In contrast, we also calculate the mean mortality rates of these countries and mar them with different colors in igure. By comparing igures 0 and, the mortality data clustering performance of STAT-SVD is significantly better than the one by mean estimations. 5

26 7 Proofs 7. Proof of Theorem Since the proof for this upper bound result is fairly complicated, we divide the proof into steps. Note that although we do not have refinement for non-sparse modes, to mae the proof consistent, we first extend the refinement step to non-sparse algorithm. Specifically, we also apply algorithm on Û t for J s, with η = η = 0 i.e. no truncation. This modification does not change the algorithm at all but maes the following analysis consistent for both sparse and non-sparse modes. Without loss of generality, we assume σ = throughout this proof. The idea of proving this theorem is that, we first impose a series of conditions, then prove the statement under these conditions, and finally prove these conditions hold with high probability. Step Introduction of Notations and Conditions We introduce or rephrase the following list of notations. N Matricizations, for =,..., d, X = M X, Y = M Y, Z = M Z. N Index sets for sparse mode J s : define True support I = { i : X [i,:] 0 }, Significant support { I 0 = i : } X [i,:] 6σs log p, { Selected support Î 0 = or t, we also define Ît i : Y [i,:] σ p + p log p + log p or max Y [i,j] σ log p j and Ĵ t with J s : selected support for A in t-th step selected support for Ā in t-th step Î t = Ĵ t = } { i : { i :, =,..., d. A t Āt [i,:] [i,:] } η, } η, 8 9 6

27 where η = σ r + r log p + log p, η = σ r + r log p + log p. or non-sparse mode J s, define I = I t that for any, I, I 0 = Ît = Ĵ t = [ : p ]. It is easy to see Î t, Ĵ t are all subsets of {,..., p }. We also define I = I + I d I I. Moreover, Ît, Ĵ t, It, J t are defined in the similar way. N 3 Index projections: for any I {,..., p }, we define D I = diag I R p p, if I {,..., p } is any index subset; D I to zero. N 4 Loadings: can be interpreted as the projection matrix, which set all rows with index not in I U = U + U d U U O p r, =,..., d; Û t = Û t + Û t d Û t Û t O p r, =,..., d. V = SVD r X U O r,r, i.e., leading r right singular vectors of X U ; ˆV t = SVD r DÎt Y Û t O r,r, Combining these definitions, we can rewrite A t A t Ā t = Y Û t, We can immediately see that A t Y = X + Z, A t, Bt, Āt Bt = Y Û t t ˆV, Bt, Bt i.e., leading r right singular vectors of DÎt Y Û t. = DÎt A t and Āt as = D Î t Y Û t, = DĴ t Y Û t t ˆV., Āt t, B are essentially projections of Y. Since t, B can be decomposed accordingly as A t B t Ā t B t = AX,t = B X,t = ĀX,t = B X,t + A Z,t + B Z,t + ĀZ,t + B Z,t A X,t B X,t Ā X,t B X,t = X Û t, = DÎt X Û t, AZ,t = Z Û t ; BZ,t = DÎt Z Û t = X Û t t ˆV, ĀZ,t = Z Û t t ˆV = DĴ t X Û t t ˆV, BZ,t = DĴ t Z Û t t ˆV. 0 7

28 N 5 Error Bounds: for t = 0,,..., =,..., d, E t = Û t X, t t = sin Θ Û, U. or =,..., d, t =,..., we further define K t t t = sin Θ Û ˆV, U V. We also introduce the following conditions for the proof of this theorem. A A max Z i,...,i d log p i,...,i d I I d Z [I,...,I d ] s + s log p + log p Z U d Ud r + r log p + log p. A 3 d, Z,[I,I ] s + s + log p. 3 A 4 d, N := Z [I,:]U V s + r + log p. 4 A 5 d, M := sup Z,[I,I ]W s + r + W l R s l r l, W l l s l r l + log p, where W = W + W d W W O s,r. A 6 Support consistency condition: 5 I 0 Î0, Î t I, t = 0,..., t max, Ĵ t I, t =,..., t max. 6 8

29 Step Theoretical guarantees for initialization: Û 0 In this second step, we show that under A A 7, the initialization estimator Û 0 =,,, d: sin Θ Û 0 Recall that Û 0 Ỹ0 = SVD r M DÎ0 X DÎ0 + DÎ0 Cai and Zhang 08, Z DÎ0 0 sin Θ Û, U We set C 0 80 d, recall that satisfies the following two inequalities for any 0 Û, U s log p. 7 λ X ds log p. 8 = SVD r DÎ0 Y DÎ0. Since DÎ0 Y DÎ0 =, by the unilateral perturbation bound result Proposition in σ r U D Î 0 σr U D Î 0 Y DÎ0 Y DÎ0 U D Î 0 σ r + D Î 0 Y DÎ0 Y DÎ0. 9 λ 80 ds log p, =,..., d, 0 we analyze each of the three ey terms in the right hand side of 9 as follows, a Since 6 holds, i.e. I 0 of DÎ0 Y DÎ0, then σ r U D Î 0 σ r U X σ r X 6 λ λ Î0 Y DÎ0 U X D I 0 X X D 0 I d = 8 λ 6ds log p /, we now the non-zero part of D I 0 σ r U D Î 0 X DÎ0 X DÎ0 X DÎ0 D Î 0 X D 0 I D I Z D I d D I 0 d U D Î 0 Y D 0 I is a submatrix Z DÎ0 Z DÎ0 s + / s log p + log p / D I \I 0 X s + / s log p + log p λ ds log p λ.λ =.9λ. s + log p 9

30 b Note that DÎ0 Y DÎ0 now σ r + DÎ0 Y DÎ0 = DÎ0 X DÎ0 Lemma 6 + DÎ0 Z DÎ0 3 s + s + log p 4 s log p 0.λ. c Since the left singular space of D I X D I, randî0 X DÎ0 r, we 6 σ DÎ0 Z DÎ0 σ DI Z D I is U, we have U D I X D I = O, thus, U D 6 Î 0 Y DÎ0 U D Î 0 Y D I U D I Y D I + U D Y I D I \Î0 U D I X D I + U D I Z D I + U Y,[I \Î0 U D I Z D I + Z,[I,I ] + 6 X,[I \I 0 Y,[I \Î0,I ] Z,[I,I ] + s 6s log p,i ] + s + s log p + log p + 4 s log p s + log p + 4 s log p 9 s log p. Summarizing a, b, c, and 9, we must have 7, i.e. Z,[I \I 0,I ] 0 sin Θ Û, U s log p, =,..., d, λ,i ] provided that A A 7 hold. Next we consider the bound for Û 0 X. Since Û 0 is the leading r left singular vectors of M Ỹ 0 = D 0 I and Then by Lemma 6, 0 Û D 0 I D 0 I Y D 0 I X D 0 I = D 0 I X D 0 I r D I 0 + D I 0 3 r s + s + log p 8 s log p. Y D I 0 Z D 0 I., rand I 0 Z D 0 I r Z[I,I ] X D 0 I r, 30

31 The last inequality comes from the fact that r s s, r s s and r s. Meanwhile, 0 Û D 0 I X D 0 I X X D 0 I X D 0 I = X X 0 [I,...,I0 d ] d X [I \I 0,:] = d = X,[i,:] d s 6s log p Therefore, Û 0 = i I \I 0 =4 ds log p. X which has proved 8. 0 Û D 0 I = X D 0 I + Û 0 8 s log p + 4 ds log p ds log p, X D I 0 X D I 0 Step 3 Next we consider the refinement of each iteration. To be specific, we try to study the performance of Û t Step, we have based on Û t. We still assume 6 all hold. Based on the result in E 0 ds log p and 0 s log p, λ provided that A A 7 hold. We particularly provides the following upper bound for E t Û t X and t = sin Θ Û, U E t s η + 3 r N + 3 r M K t t + Et + λ + K t + + Et d λd + Et λ + + Et λ =, 3 Et. 4 λ The detailed proof is collected in Section C.3 in the supplementary materials. Step 4 In this step, we combine the results in Step 4 and provide a upper bound for E tmax, tmax for t max = C logds log p. We first show, for all sparse and non-sparse modes, we have the following upper bounds for E t : E t 30 s r + 30 s log p + ds log p t, =,..., d, t = 0,,

32 irst, 5 holds for t = 0 due to 8. If 5 holds for t for some t, we aim to prove 5 for t. irst, based on the basic principle of Tucer ran, we have r r, r s s l r l s l r l s l s l = s. 6 Also, we shall recall that η = r + r log p + log p r + log p; η = r + r log p + log p r + log p; N l s l + r l + log p s l + log p; M l sl + d r l + sl r l + log p, l= 7 and E t 30 s r + 30 s log p + ds log p t 30 s + 30 s log p + ds log p t 60 s log p + ds log p t. 8 Thus, the following upper bound holds for K t, if we set C gap 00d. K t t E + + E t d + s η + 6r M λ / d 60 s log p + ds log p + t s η + 6r M λ / 60 d s log p + d ds log p + s r + s log p λ 6 s r + r + d = s r + log p + 00dλ C gap λ. λ 3

33 Thus, E t s η + 3 r N + r M K t + Et λ + + Et d λd K t s η + 6 r N + r M K t + Et λ 7 4 s r + 5 s log p + r M K t + Et λ + + Et d λ d + + Et d λ d. 9 With C gap > 0d + d, by Lemma 8, we have: r M K t r M d = Et + M s r η + λ λ 6r M λ, 4 s r + 4 s log p + ds log p t+. E t r M + + Et d 3 s r + 3 s log p + ds log p λ λ d t+ 3 Combining 9, 30, and 3, we have proved 30 E t 30 s r + 30 s log p + ds log p t, i.e. 5 for t and =. Similarly, one can also prove the upper bounds for E t =,..., d, which implies the claim 5 holds for t 0. for Then we further provide another upper bound for non-sparse modes. Note that we assume log p log p log p d, thus we can find some constant C, such that log p C p, =,,..., d. Now we want to show: E t C p r + ds log p t, J s 3 Again, we assume mode is non-sparse and only prove the bound for E t, the similar bounds for other non-sparse modes essentially follow. Return to 9, note that we particularly have η = 0 since it is a non-sparse mode, we have: E t 6 r N + r M K t + Et λ 7 p r + 6 r log p + r M K t + 6 C p r + r M + + Et d λ d + Et λ K t + Et λ + + Et d + + Et d λ d λ d

34 Also, we have the following bound for K t as η = 0, Besides, we could rebound M lie following: K t Et E t d + 6r M. 34 λ M + C p + r + l sl r l. 35 If we set C gap > 0d + d, then together with 5, 33, 34 and 35, we can proved the following result as similar as we proved in Lemma 8: r M K t + Et + + Et d C p r + ds log p r λ rd λ d t. Then 63 follows. Now for t max = C logds log p, for large constant C > 0, we have E tmax = Û tmax X tmax 6 t = sin Θ Û max, U C s r + s log p, J s, C p r, / J s, C s r + s log p /λ, J s, C p r /λ, / J s Step 5 In this step, we consider the estimation error for ˆX. We first have the following decomposition, Here, ˆX X = Y PÛ d PÛd X Z PÛ d PÛd + X PÛ d PÛd X. X PÛ d PÛd X X PÛ + X PÛ PÛ + + X PÛ PÛ d PÛd d PÛd d X l Ûl d = Û l X 36 l C s l r l + s l log p + pl r l ; l Js l Js l= l= 34

35 Z PÛ d PÛd Û Z U U Û d Ûd Z U Û d Ûd + Z U Û d Ûd + Z U U Û Û d Û d U U Û Z Û Û d + M sin Θ Û, U Z U U d Û d + M sin Θ Û, U + M sin Θ Û, U Z U d Ud + d l= Z U d U d d sin M l Θ Û l, U l. l= r + r log p + log p = r + log p; sin 537 M l Θ Ûl, U l C M l s l r l + s l log p + C M l p l r l λ l λ l l J s l J s C l J s s l r l + s l log p + C l J s p l r l. In summary, ˆX X C r + d l= rl s l + l J s sl log p. under the assumptions 6 that we introduced in Step. Note that log p = log p + + log p d and we have the assumption that log p log p d, thus the result is equitant to the one stated in Theorem. Step 6 inally, we prove that all imposed conditions in Step, i.e. -6, hold with probability at least O p + +p d log ds log p p. The detailed proof is collected in Section C.4 in the supplementary materials. Combing all results from Steps 6, we have finished the proof for this theorem. 35

36 Acnowledgment The authors would lie to than the Editor, Associate Editor, and anonymous referees for their constructive comments, which have greatly helped to improve the presentation of the paper. References Allen, G. 0a. Sparse higher-order principal components analysis. In AISTATS, volume 5. Allen, G. I. 0b. Regularized tensor factorizations and higher-order principal components analysis. arxiv preprint arxiv: Anandumar, A., Deng, Y., Ge, R., and Mobahi, H. 06. Homotopy analysis for tensor pca. arxiv preprint arxiv: Anandumar, A., Ge, R., Hsu, D., Kaade, S. M., and Telgarsy, M. 04a. Tensor decompositions for learning latent variable models. Journal of Machine Learning Research, 5: Anandumar, A., Ge, R., and Janzamin, M. 04b. Guaranteed non-orthogonal tensor decomposition via alternating ran- updates. arxiv preprint arxiv: Birgé, L. 00. An alternative point of view on lepsi s method. Lecture Notes-Monograph Series, pages Cai, T. T., Ma, Z., and Wu, Y. 03. Sparse pca: Optimal rates and adaptive estimation. The Annals of Statistics, 46: Cai, T. T. and Zhang, A. 08. Rate-optimal perturbation bounds for singular subspaces with applications to high-dimensional statistics. The Annals of Statistics, 46: Choi, Y., Taylor, J., Tibshirani, R., et al. 07. Selecting the number of principal components: estimation of the true ran of a noisy matrix. The Annals of Statistics, 456: De Lathauwer, L., De Moor, B., and Vandewalle, J. 000a. A multilinear singular value decomposition. SIAM journal on Matrix Analysis and Applications, 4:

sparse and low-rank tensor recovery Cubic-Sketching

sparse and low-rank tensor recovery Cubic-Sketching Sparse and Low-Ran Tensor Recovery via Cubic-Setching Guang Cheng Department of Statistics Purdue University www.science.purdue.edu/bigdata CCAM@Purdue Math Oct. 27, 2017 Joint wor with Botao Hao and Anru

More information

DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania

DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania Submitted to the Annals of Statistics DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING By T. Tony Cai and Linjun Zhang University of Pennsylvania We would like to congratulate the

More information

Principal Component Analysis

Principal Component Analysis Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used

More information

15 Singular Value Decomposition

15 Singular Value Decomposition 15 Singular Value Decomposition For any high-dimensional data analysis, one s first thought should often be: can I use an SVD? The singular value decomposition is an invaluable analysis tool for dealing

More information

MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing

MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing Afonso S. Bandeira April 9, 2015 1 The Johnson-Lindenstrauss Lemma Suppose one has n points, X = {x 1,..., x n }, in R d with d very

More information

Supplement to A Generalized Least Squares Matrix Decomposition. 1 GPMF & Smoothness: Ω-norm Penalty & Functional Data

Supplement to A Generalized Least Squares Matrix Decomposition. 1 GPMF & Smoothness: Ω-norm Penalty & Functional Data Supplement to A Generalized Least Squares Matrix Decomposition Genevera I. Allen 1, Logan Grosenic 2, & Jonathan Taylor 3 1 Department of Statistics and Electrical and Computer Engineering, Rice University

More information

Lecture Notes 5: Multiresolution Analysis

Lecture Notes 5: Multiresolution Analysis Optimization-based data analysis Fall 2017 Lecture Notes 5: Multiresolution Analysis 1 Frames A frame is a generalization of an orthonormal basis. The inner products between the vectors in a frame and

More information

Kronecker Product Approximation with Multiple Factor Matrices via the Tensor Product Algorithm

Kronecker Product Approximation with Multiple Factor Matrices via the Tensor Product Algorithm Kronecker Product Approximation with Multiple actor Matrices via the Tensor Product Algorithm King Keung Wu, Yeung Yam, Helen Meng and Mehran Mesbahi Department of Mechanical and Automation Engineering,

More information

Noisy Streaming PCA. Noting g t = x t x t, rearranging and dividing both sides by 2η we get

Noisy Streaming PCA. Noting g t = x t x t, rearranging and dividing both sides by 2η we get Supplementary Material A. Auxillary Lemmas Lemma A. Lemma. Shalev-Shwartz & Ben-David,. Any update of the form P t+ = Π C P t ηg t, 3 for an arbitrary sequence of matrices g, g,..., g, projection Π C onto

More information

Orthogonal tensor decomposition

Orthogonal tensor decomposition Orthogonal tensor decomposition Daniel Hsu Columbia University Largely based on 2012 arxiv report Tensor decompositions for learning latent variable models, with Anandkumar, Ge, Kakade, and Telgarsky.

More information

Lecture Notes 1: Vector spaces

Lecture Notes 1: Vector spaces Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector

More information

19.1 Problem setup: Sparse linear regression

19.1 Problem setup: Sparse linear regression ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 19: Minimax rates for sparse linear regression Lecturer: Yihong Wu Scribe: Subhadeep Paul, April 13/14, 2016 In

More information

New Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit

New Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit New Coherence and RIP Analysis for Wea 1 Orthogonal Matching Pursuit Mingrui Yang, Member, IEEE, and Fran de Hoog arxiv:1405.3354v1 [cs.it] 14 May 2014 Abstract In this paper we define a new coherence

More information

ON THE QR ITERATIONS OF REAL MATRICES

ON THE QR ITERATIONS OF REAL MATRICES Unspecified Journal Volume, Number, Pages S????-????(XX- ON THE QR ITERATIONS OF REAL MATRICES HUAJUN HUANG AND TIN-YAU TAM Abstract. We answer a question of D. Serre on the QR iterations of a real matrix

More information

A Multi-Affine Model for Tensor Decomposition

A Multi-Affine Model for Tensor Decomposition Yiqing Yang UW Madison breakds@cs.wisc.edu A Multi-Affine Model for Tensor Decomposition Hongrui Jiang UW Madison hongrui@engr.wisc.edu Li Zhang UW Madison lizhang@cs.wisc.edu Chris J. Murphy UC Davis

More information

arxiv: v5 [math.na] 16 Nov 2017

arxiv: v5 [math.na] 16 Nov 2017 RANDOM PERTURBATION OF LOW RANK MATRICES: IMPROVING CLASSICAL BOUNDS arxiv:3.657v5 [math.na] 6 Nov 07 SEAN O ROURKE, VAN VU, AND KE WANG Abstract. Matrix perturbation inequalities, such as Weyl s theorem

More information

14 Singular Value Decomposition

14 Singular Value Decomposition 14 Singular Value Decomposition For any high-dimensional data analysis, one s first thought should often be: can I use an SVD? The singular value decomposition is an invaluable analysis tool for dealing

More information

Tikhonov Regularization of Large Symmetric Problems

Tikhonov Regularization of Large Symmetric Problems NUMERICAL LINEAR ALGEBRA WITH APPLICATIONS Numer. Linear Algebra Appl. 2000; 00:1 11 [Version: 2000/03/22 v1.0] Tihonov Regularization of Large Symmetric Problems D. Calvetti 1, L. Reichel 2 and A. Shuibi

More information

Methods for sparse analysis of high-dimensional data, II

Methods for sparse analysis of high-dimensional data, II Methods for sparse analysis of high-dimensional data, II Rachel Ward May 23, 2011 High dimensional data with low-dimensional structure 300 by 300 pixel images = 90, 000 dimensions 2 / 47 High dimensional

More information

Optimal spectral shrinkage and PCA with heteroscedastic noise

Optimal spectral shrinkage and PCA with heteroscedastic noise Optimal spectral shrinage and PCA with heteroscedastic noise William Leeb and Elad Romanov Abstract This paper studies the related problems of denoising, covariance estimation, and principal component

More information

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26 Principal Component Analysis Brett Bernstein CDS at NYU April 25, 2017 Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 1 / 26 Initial Question Intro Question Question Let S R n n be symmetric. 1

More information

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University PCA with random noise Van Ha Vu Department of Mathematics Yale University An important problem that appears in various areas of applied mathematics (in particular statistics, computer science and numerical

More information

arxiv: v1 [math.na] 5 May 2011

arxiv: v1 [math.na] 5 May 2011 ITERATIVE METHODS FOR COMPUTING EIGENVALUES AND EIGENVECTORS MAYSUM PANJU arxiv:1105.1185v1 [math.na] 5 May 2011 Abstract. We examine some numerical iterative methods for computing the eigenvalues and

More information

SPARSE signal representations have gained popularity in recent

SPARSE signal representations have gained popularity in recent 6958 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 10, OCTOBER 2011 Blind Compressed Sensing Sivan Gleichman and Yonina C. Eldar, Senior Member, IEEE Abstract The fundamental principle underlying

More information

Least Squares Approximation

Least Squares Approximation Chapter 6 Least Squares Approximation As we saw in Chapter 5 we can interpret radial basis function interpolation as a constrained optimization problem. We now take this point of view again, but start

More information

/16/$ IEEE 1728

/16/$ IEEE 1728 Extension of the Semi-Algebraic Framework for Approximate CP Decompositions via Simultaneous Matrix Diagonalization to the Efficient Calculation of Coupled CP Decompositions Kristina Naskovska and Martin

More information

arxiv: v1 [cs.it] 21 Feb 2013

arxiv: v1 [cs.it] 21 Feb 2013 q-ary Compressive Sensing arxiv:30.568v [cs.it] Feb 03 Youssef Mroueh,, Lorenzo Rosasco, CBCL, CSAIL, Massachusetts Institute of Technology LCSL, Istituto Italiano di Tecnologia and IIT@MIT lab, Istituto

More information

EECS 275 Matrix Computation

EECS 275 Matrix Computation EECS 275 Matrix Computation Ming-Hsuan Yang Electrical Engineering and Computer Science University of California at Merced Merced, CA 95344 http://faculty.ucmerced.edu/mhyang Lecture 22 1 / 21 Overview

More information

Tensor Decompositions for Machine Learning. G. Roeder 1. UBC Machine Learning Reading Group, June University of British Columbia

Tensor Decompositions for Machine Learning. G. Roeder 1. UBC Machine Learning Reading Group, June University of British Columbia Network Feature s Decompositions for Machine Learning 1 1 Department of Computer Science University of British Columbia UBC Machine Learning Group, June 15 2016 1/30 Contact information Network Feature

More information

Methods for sparse analysis of high-dimensional data, II

Methods for sparse analysis of high-dimensional data, II Methods for sparse analysis of high-dimensional data, II Rachel Ward May 26, 2011 High dimensional data with low-dimensional structure 300 by 300 pixel images = 90, 000 dimensions 2 / 55 High dimensional

More information

Zhaoxing Gao and Ruey S Tsay Booth School of Business, University of Chicago. August 23, 2018

Zhaoxing Gao and Ruey S Tsay Booth School of Business, University of Chicago. August 23, 2018 Supplementary Material for Structural-Factor Modeling of High-Dimensional Time Series: Another Look at Approximate Factor Models with Diverging Eigenvalues Zhaoxing Gao and Ruey S Tsay Booth School of

More information

Reconstruction from Anisotropic Random Measurements

Reconstruction from Anisotropic Random Measurements Reconstruction from Anisotropic Random Measurements Mark Rudelson and Shuheng Zhou The University of Michigan, Ann Arbor Coding, Complexity, and Sparsity Workshop, 013 Ann Arbor, Michigan August 7, 013

More information

High Dimensional Covariance and Precision Matrix Estimation

High Dimensional Covariance and Precision Matrix Estimation High Dimensional Covariance and Precision Matrix Estimation Wei Wang Washington University in St. Louis Thursday 23 rd February, 2017 Wei Wang (Washington University in St. Louis) High Dimensional Covariance

More information

L 2,1 Norm and its Applications

L 2,1 Norm and its Applications L 2, Norm and its Applications Yale Chang Introduction According to the structure of the constraints, the sparsity can be obtained from three types of regularizers for different purposes.. Flat Sparsity.

More information

Math 471 (Numerical methods) Chapter 3 (second half). System of equations

Math 471 (Numerical methods) Chapter 3 (second half). System of equations Math 47 (Numerical methods) Chapter 3 (second half). System of equations Overlap 3.5 3.8 of Bradie 3.5 LU factorization w/o pivoting. Motivation: ( ) A I Gaussian Elimination (U L ) where U is upper triangular

More information

Estimation of large dimensional sparse covariance matrices

Estimation of large dimensional sparse covariance matrices Estimation of large dimensional sparse covariance matrices Department of Statistics UC, Berkeley May 5, 2009 Sample covariance matrix and its eigenvalues Data: n p matrix X n (independent identically distributed)

More information

On the convergence of higher-order orthogonality iteration and its extension

On the convergence of higher-order orthogonality iteration and its extension On the convergence of higher-order orthogonality iteration and its extension Yangyang Xu IMA, University of Minnesota SIAM Conference LA15, Atlanta October 27, 2015 Best low-multilinear-rank approximation

More information

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu Dimension Reduction Techniques Presented by Jie (Jerry) Yu Outline Problem Modeling Review of PCA and MDS Isomap Local Linear Embedding (LLE) Charting Background Advances in data collection and storage

More information

Robustness of Principal Components

Robustness of Principal Components PCA for Clustering An objective of principal components analysis is to identify linear combinations of the original variables that are useful in accounting for the variation in those original variables.

More information

An algebraic perspective on integer sparse recovery

An algebraic perspective on integer sparse recovery An algebraic perspective on integer sparse recovery Lenny Fukshansky Claremont McKenna College (joint work with Deanna Needell and Benny Sudakov) Combinatorics Seminar USC October 31, 2018 From Wikipedia:

More information

Non-convex Robust PCA: Provable Bounds

Non-convex Robust PCA: Provable Bounds Non-convex Robust PCA: Provable Bounds Anima Anandkumar U.C. Irvine Joint work with Praneeth Netrapalli, U.N. Niranjan, Prateek Jain and Sujay Sanghavi. Learning with Big Data High Dimensional Regime Missing

More information

Deep Linear Networks with Arbitrary Loss: All Local Minima Are Global

Deep Linear Networks with Arbitrary Loss: All Local Minima Are Global homas Laurent * 1 James H. von Brecht * 2 Abstract We consider deep linear networks with arbitrary convex differentiable loss. We provide a short and elementary proof of the fact that all local minima

More information

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works CS68: The Modern Algorithmic Toolbox Lecture #8: How PCA Works Tim Roughgarden & Gregory Valiant April 20, 206 Introduction Last lecture introduced the idea of principal components analysis (PCA). The

More information

Introduction to Compressed Sensing

Introduction to Compressed Sensing Introduction to Compressed Sensing Alejandro Parada, Gonzalo Arce University of Delaware August 25, 2016 Motivation: Classical Sampling 1 Motivation: Classical Sampling Issues Some applications Radar Spectral

More information

Multilinear Singular Value Decomposition for Two Qubits

Multilinear Singular Value Decomposition for Two Qubits Malaysian Journal of Mathematical Sciences 10(S) August: 69 83 (2016) Special Issue: The 7 th International Conference on Research and Education in Mathematics (ICREM7) MALAYSIAN JOURNAL OF MATHEMATICAL

More information

Factor Analysis (10/2/13)

Factor Analysis (10/2/13) STA561: Probabilistic machine learning Factor Analysis (10/2/13) Lecturer: Barbara Engelhardt Scribes: Li Zhu, Fan Li, Ni Guan Factor Analysis Factor analysis is related to the mixture models we have studied.

More information

Journal of Multivariate Analysis. Consistency of sparse PCA in High Dimension, Low Sample Size contexts

Journal of Multivariate Analysis. Consistency of sparse PCA in High Dimension, Low Sample Size contexts Journal of Multivariate Analysis 5 (03) 37 333 Contents lists available at SciVerse ScienceDirect Journal of Multivariate Analysis journal homepage: www.elsevier.com/locate/jmva Consistency of sparse PCA

More information

Iterative Methods for Solving A x = b

Iterative Methods for Solving A x = b Iterative Methods for Solving A x = b A good (free) online source for iterative methods for solving A x = b is given in the description of a set of iterative solvers called templates found at netlib: http

More information

Institute for Computational Mathematics Hong Kong Baptist University

Institute for Computational Mathematics Hong Kong Baptist University Institute for Computational Mathematics Hong Kong Baptist University ICM Research Report 08-0 How to find a good submatrix S. A. Goreinov, I. V. Oseledets, D. V. Savostyanov, E. E. Tyrtyshnikov, N. L.

More information

Fundamentals of Multilinear Subspace Learning

Fundamentals of Multilinear Subspace Learning Chapter 3 Fundamentals of Multilinear Subspace Learning The previous chapter covered background materials on linear subspace learning. From this chapter on, we shall proceed to multiple dimensions with

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

THE SINGULAR VALUE DECOMPOSITION MARKUS GRASMAIR

THE SINGULAR VALUE DECOMPOSITION MARKUS GRASMAIR THE SINGULAR VALUE DECOMPOSITION MARKUS GRASMAIR 1. Definition Existence Theorem 1. Assume that A R m n. Then there exist orthogonal matrices U R m m V R n n, values σ 1 σ 2... σ p 0 with p = min{m, n},

More information

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER On the Performance of Sparse Recovery

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER On the Performance of Sparse Recovery IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER 2011 7255 On the Performance of Sparse Recovery Via `p-minimization (0 p 1) Meng Wang, Student Member, IEEE, Weiyu Xu, and Ao Tang, Senior

More information

2.3. Clustering or vector quantization 57

2.3. Clustering or vector quantization 57 Multivariate Statistics non-negative matrix factorisation and sparse dictionary learning The PCA decomposition is by construction optimal solution to argmin A R n q,h R q p X AH 2 2 under constraint :

More information

Introduction Wavelet shrinage methods have been very successful in nonparametric regression. But so far most of the wavelet regression methods have be

Introduction Wavelet shrinage methods have been very successful in nonparametric regression. But so far most of the wavelet regression methods have be Wavelet Estimation For Samples With Random Uniform Design T. Tony Cai Department of Statistics, Purdue University Lawrence D. Brown Department of Statistics, University of Pennsylvania Abstract We show

More information

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra. DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1

More information

A Cross-Associative Neural Network for SVD of Nonsquared Data Matrix in Signal Processing

A Cross-Associative Neural Network for SVD of Nonsquared Data Matrix in Signal Processing IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 12, NO. 5, SEPTEMBER 2001 1215 A Cross-Associative Neural Network for SVD of Nonsquared Data Matrix in Signal Processing Da-Zheng Feng, Zheng Bao, Xian-Da Zhang

More information

Truncation Strategy of Tensor Compressive Sensing for Noisy Video Sequences

Truncation Strategy of Tensor Compressive Sensing for Noisy Video Sequences Journal of Information Hiding and Multimedia Signal Processing c 2016 ISSN 207-4212 Ubiquitous International Volume 7, Number 5, September 2016 Truncation Strategy of Tensor Compressive Sensing for Noisy

More information

8.1 Concentration inequality for Gaussian random matrix (cont d)

8.1 Concentration inequality for Gaussian random matrix (cont d) MGMT 69: Topics in High-dimensional Data Analysis Falll 26 Lecture 8: Spectral clustering and Laplacian matrices Lecturer: Jiaming Xu Scribe: Hyun-Ju Oh and Taotao He, October 4, 26 Outline Concentration

More information

CVPR A New Tensor Algebra - Tutorial. July 26, 2017

CVPR A New Tensor Algebra - Tutorial. July 26, 2017 CVPR 2017 A New Tensor Algebra - Tutorial Lior Horesh lhoresh@us.ibm.com Misha Kilmer misha.kilmer@tufts.edu July 26, 2017 Outline Motivation Background and notation New t-product and associated algebraic

More information

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

EE 381V: Large Scale Optimization Fall Lecture 24 April 11 EE 381V: Large Scale Optimization Fall 2012 Lecture 24 April 11 Lecturer: Caramanis & Sanghavi Scribe: Tao Huang 24.1 Review In past classes, we studied the problem of sparsity. Sparsity problem is that

More information

Functional SVD for Big Data

Functional SVD for Big Data Functional SVD for Big Data Pan Chao April 23, 2014 Pan Chao Functional SVD for Big Data April 23, 2014 1 / 24 Outline 1 One-Way Functional SVD a) Interpretation b) Robustness c) CV/GCV 2 Two-Way Problem

More information

SVD, PCA & Preprocessing

SVD, PCA & Preprocessing Chapter 1 SVD, PCA & Preprocessing Part 2: Pre-processing and selecting the rank Pre-processing Skillicorn chapter 3.1 2 Why pre-process? Consider matrix of weather data Monthly temperatures in degrees

More information

7 Principal Component Analysis

7 Principal Component Analysis 7 Principal Component Analysis This topic will build a series of techniques to deal with high-dimensional data. Unlike regression problems, our goal is not to predict a value (the y-coordinate), it is

More information

THE PERTURBATION BOUND FOR THE SPECTRAL RADIUS OF A NON-NEGATIVE TENSOR

THE PERTURBATION BOUND FOR THE SPECTRAL RADIUS OF A NON-NEGATIVE TENSOR THE PERTURBATION BOUND FOR THE SPECTRAL RADIUS OF A NON-NEGATIVE TENSOR WEN LI AND MICHAEL K. NG Abstract. In this paper, we study the perturbation bound for the spectral radius of an m th - order n-dimensional

More information

A Reservoir Sampling Algorithm with Adaptive Estimation of Conditional Expectation

A Reservoir Sampling Algorithm with Adaptive Estimation of Conditional Expectation A Reservoir Sampling Algorithm with Adaptive Estimation of Conditional Expectation Vu Malbasa and Slobodan Vucetic Abstract Resource-constrained data mining introduces many constraints when learning from

More information

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A = 30 MATHEMATICS REVIEW G A.1.1 Matrices and Vectors Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A = a 11 a 12... a 1N a 21 a 22... a 2N...... a M1 a M2... a MN A matrix can

More information

Conditions for Robust Principal Component Analysis

Conditions for Robust Principal Component Analysis Rose-Hulman Undergraduate Mathematics Journal Volume 12 Issue 2 Article 9 Conditions for Robust Principal Component Analysis Michael Hornstein Stanford University, mdhornstein@gmail.com Follow this and

More information

SUPPLEMENTAL NOTES FOR ROBUST REGULARIZED SINGULAR VALUE DECOMPOSITION WITH APPLICATION TO MORTALITY DATA

SUPPLEMENTAL NOTES FOR ROBUST REGULARIZED SINGULAR VALUE DECOMPOSITION WITH APPLICATION TO MORTALITY DATA SUPPLEMENTAL NOTES FOR ROBUST REGULARIZED SINGULAR VALUE DECOMPOSITION WITH APPLICATION TO MORTALITY DATA By Lingsong Zhang, Haipeng Shen and Jianhua Z. Huang Purdue University, University of North Carolina,

More information

Rank Determination for Low-Rank Data Completion

Rank Determination for Low-Rank Data Completion Journal of Machine Learning Research 18 017) 1-9 Submitted 7/17; Revised 8/17; Published 9/17 Rank Determination for Low-Rank Data Completion Morteza Ashraphijuo Columbia University New York, NY 1007,

More information

Lecture Notes 10: Matrix Factorization

Lecture Notes 10: Matrix Factorization Optimization-based data analysis Fall 207 Lecture Notes 0: Matrix Factorization Low-rank models. Rank- model Consider the problem of modeling a quantity y[i, j] that depends on two indices i and j. To

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

An Effective Tensor Completion Method Based on Multi-linear Tensor Ring Decomposition

An Effective Tensor Completion Method Based on Multi-linear Tensor Ring Decomposition An Effective Tensor Completion Method Based on Multi-linear Tensor Ring Decomposition Jinshi Yu, Guoxu Zhou, Qibin Zhao and Kan Xie School of Automation, Guangdong University of Technology, Guangzhou,

More information

Dimensionality Reduction Using the Sparse Linear Model: Supplementary Material

Dimensionality Reduction Using the Sparse Linear Model: Supplementary Material Dimensionality Reduction Using the Sparse Linear Model: Supplementary Material Ioannis Gkioulekas arvard SEAS Cambridge, MA 038 igkiou@seas.harvard.edu Todd Zickler arvard SEAS Cambridge, MA 038 zickler@seas.harvard.edu

More information

Machine Learning for Disease Progression

Machine Learning for Disease Progression Machine Learning for Disease Progression Yong Deng Department of Materials Science & Engineering yongdeng@stanford.edu Xuxin Huang Department of Applied Physics xxhuang@stanford.edu Guanyang Wang Department

More information

Linear Algebra and Eigenproblems

Linear Algebra and Eigenproblems Appendix A A Linear Algebra and Eigenproblems A working knowledge of linear algebra is key to understanding many of the issues raised in this work. In particular, many of the discussions of the details

More information

Iterative Matching Pursuit and its Applications in Adaptive Time-Frequency Analysis

Iterative Matching Pursuit and its Applications in Adaptive Time-Frequency Analysis Iterative Matching Pursuit and its Applications in Adaptive Time-Frequency Analysis Zuoqiang Shi Mathematical Sciences Center, Tsinghua University Joint wor with Prof. Thomas Y. Hou and Sparsity, Jan 9,

More information

Numerical Methods I Non-Square and Sparse Linear Systems

Numerical Methods I Non-Square and Sparse Linear Systems Numerical Methods I Non-Square and Sparse Linear Systems Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 MATH-GA 2011.003 / CSCI-GA 2945.003, Fall 2014 September 25th, 2014 A. Donev (Courant

More information

Sparse orthogonal factor analysis

Sparse orthogonal factor analysis Sparse orthogonal factor analysis Kohei Adachi and Nickolay T. Trendafilov Abstract A sparse orthogonal factor analysis procedure is proposed for estimating the optimal solution with sparse loadings. In

More information

Sparse representation classification and positive L1 minimization

Sparse representation classification and positive L1 minimization Sparse representation classification and positive L1 minimization Cencheng Shen Joint Work with Li Chen, Carey E. Priebe Applied Mathematics and Statistics Johns Hopkins University, August 5, 2014 Cencheng

More information

SYMMETRIC MATRIX PERTURBATION FOR DIFFERENTIALLY-PRIVATE PRINCIPAL COMPONENT ANALYSIS. Hafiz Imtiaz and Anand D. Sarwate

SYMMETRIC MATRIX PERTURBATION FOR DIFFERENTIALLY-PRIVATE PRINCIPAL COMPONENT ANALYSIS. Hafiz Imtiaz and Anand D. Sarwate SYMMETRIC MATRIX PERTURBATION FOR DIFFERENTIALLY-PRIVATE PRINCIPAL COMPONENT ANALYSIS Hafiz Imtiaz and Anand D. Sarwate Rutgers, The State University of New Jersey ABSTRACT Differential privacy is a strong,

More information

Prediction of double gene knockout measurements

Prediction of double gene knockout measurements Prediction of double gene knockout measurements Sofia Kyriazopoulou-Panagiotopoulou sofiakp@stanford.edu December 12, 2008 Abstract One way to get an insight into the potential interaction between a pair

More information

Statistical Data Analysis

Statistical Data Analysis DS-GA 0 Lecture notes 8 Fall 016 1 Descriptive statistics Statistical Data Analysis In this section we consider the problem of analyzing a set of data. We describe several techniques for visualizing the

More information

Solving Corrupted Quadratic Equations, Provably

Solving Corrupted Quadratic Equations, Provably Solving Corrupted Quadratic Equations, Provably Yuejie Chi London Workshop on Sparse Signal Processing September 206 Acknowledgement Joint work with Yuanxin Li (OSU), Huishuai Zhuang (Syracuse) and Yingbin

More information

Machine Learning (Spring 2012) Principal Component Analysis

Machine Learning (Spring 2012) Principal Component Analysis 1-71 Machine Learning (Spring 1) Principal Component Analysis Yang Xu This note is partly based on Chapter 1.1 in Chris Bishop s book on PRML and the lecture slides on PCA written by Carlos Guestrin in

More information

Deep Learning Book Notes Chapter 2: Linear Algebra

Deep Learning Book Notes Chapter 2: Linear Algebra Deep Learning Book Notes Chapter 2: Linear Algebra Compiled By: Abhinaba Bala, Dakshit Agrawal, Mohit Jain Section 2.1: Scalars, Vectors, Matrices and Tensors Scalar Single Number Lowercase names in italic

More information

Stat 159/259: Linear Algebra Notes

Stat 159/259: Linear Algebra Notes Stat 159/259: Linear Algebra Notes Jarrod Millman November 16, 2015 Abstract These notes assume you ve taken a semester of undergraduate linear algebra. In particular, I assume you are familiar with the

More information

arxiv: v3 [math.oc] 8 Jan 2019

arxiv: v3 [math.oc] 8 Jan 2019 Why Random Reshuffling Beats Stochastic Gradient Descent Mert Gürbüzbalaban, Asuman Ozdaglar, Pablo Parrilo arxiv:1510.08560v3 [math.oc] 8 Jan 2019 January 9, 2019 Abstract We analyze the convergence rate

More information

Robust Principal Component Analysis

Robust Principal Component Analysis ELE 538B: Mathematics of High-Dimensional Data Robust Principal Component Analysis Yuxin Chen Princeton University, Fall 2018 Disentangling sparse and low-rank matrices Suppose we are given a matrix M

More information

arxiv: v1 [math.na] 22 Sep 2016

arxiv: v1 [math.na] 22 Sep 2016 A Randomized Tensor Singular Value Decomposition based on the t-product Jiani Zhang Arvind K. Saibaba Misha E. Kilmer Shuchin Aeron arxiv:1609.07086v1 [math.na] Sep 016 September 3, 016 Abstract The tensor

More information

On The Belonging Of A Perturbed Vector To A Subspace From A Numerical View Point

On The Belonging Of A Perturbed Vector To A Subspace From A Numerical View Point Applied Mathematics E-Notes, 7(007), 65-70 c ISSN 1607-510 Available free at mirror sites of http://www.math.nthu.edu.tw/ amen/ On The Belonging Of A Perturbed Vector To A Subspace From A Numerical View

More information

EE 381V: Large Scale Learning Spring Lecture 16 March 7

EE 381V: Large Scale Learning Spring Lecture 16 March 7 EE 381V: Large Scale Learning Spring 2013 Lecture 16 March 7 Lecturer: Caramanis & Sanghavi Scribe: Tianyang Bai 16.1 Topics Covered In this lecture, we introduced one method of matrix completion via SVD-based

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Low-rank matrix recovery via nonconvex optimization Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Wafer Pattern Recognition Using Tucker Decomposition

Wafer Pattern Recognition Using Tucker Decomposition Wafer Pattern Recognition Using Tucker Decomposition Ahmed Wahba, Li-C. Wang, Zheng Zhang UC Santa Barbara Nik Sumikawa NXP Semiconductors Abstract In production test data analytics, it is often that an

More information

Donald Goldfarb IEOR Department Columbia University UCLA Mathematics Department Distinguished Lecture Series May 17 19, 2016

Donald Goldfarb IEOR Department Columbia University UCLA Mathematics Department Distinguished Lecture Series May 17 19, 2016 Optimization for Tensor Models Donald Goldfarb IEOR Department Columbia University UCLA Mathematics Department Distinguished Lecture Series May 17 19, 2016 1 Tensors Matrix Tensor: higher-order matrix

More information

Computational and Statistical Boundaries for Submatrix Localization in a Large Noisy Matrix

Computational and Statistical Boundaries for Submatrix Localization in a Large Noisy Matrix University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 2017 Computational and Statistical Boundaries for Submatrix Localization in a Large Noisy Matrix Tony Cai University

More information

AM205: Assignment 2. i=1

AM205: Assignment 2. i=1 AM05: Assignment Question 1 [10 points] (a) [4 points] For p 1, the p-norm for a vector x R n is defined as: ( n ) 1/p x p x i p ( ) i=1 This definition is in fact meaningful for p < 1 as well, although

More information

Tensor-Tensor Product Toolbox

Tensor-Tensor Product Toolbox Tensor-Tensor Product Toolbox 1 version 10 Canyi Lu canyilu@gmailcom Carnegie Mellon University https://githubcom/canyilu/tproduct June, 018 1 INTRODUCTION Tensors are higher-order extensions of matrices

More information

Optimization for Compressed Sensing

Optimization for Compressed Sensing Optimization for Compressed Sensing Robert J. Vanderbei 2014 March 21 Dept. of Industrial & Systems Engineering University of Florida http://www.princeton.edu/ rvdb Lasso Regression The problem is to solve

More information

A Randomized Algorithm for the Approximation of Matrices

A Randomized Algorithm for the Approximation of Matrices A Randomized Algorithm for the Approximation of Matrices Per-Gunnar Martinsson, Vladimir Rokhlin, and Mark Tygert Technical Report YALEU/DCS/TR-36 June 29, 2006 Abstract Given an m n matrix A and a positive

More information