arxiv: v1 [math.st] 6 Sep 2018
|
|
- Adam Cannon
- 5 years ago
- Views:
Transcription
1 Optimal Sparse Singular Value Decomposition for High-dimensional High-order Data arxiv: v [math.st] 6 Sep 08 Anru Zhang and Rungang Han University of Wisconsin-Madison September 7, 08 Abstract In this article, we consider the sparse tensor singular value decomposition, which aims for dimension reduction on high-dimensional high-order data with certain sparsity structure. A method named sparse tensor alternating thresholding for singular value decomposition STAT-SVD is proposed. The proposed procedure features a novel double projection & thresholding scheme, which provides a sharp criterion for thresholding in each iteration. Compared with regular tensor SVD model, STAT-SVD permits more robust estimation under weaer assumptions. Both the upper and lower bounds for estimation accuracy are developed. The proposed procedure is shown to be minimax rate-optimal in a general class of situations. Simulation studies show that STAT-SVD performs well under a variety of configurations. We also illustrate the merits of the proposed procedure on a longitudinal tensor dataset on European country mortality rates. Keywords: high-dimensional high-order data, projection and thresholding, singular value decomposition, sparsity, Tucer low-ran tensor. Anru Zhang is Assistant Professor, Department of Statistics, University of Wisconsin-Madison, Madison, WI 53706, anruzhang@stat.wisc.edu; Rungang Han is PhD student, Department of Statistics, University of Wisconsin-Madison, Madison, WI 53706, rhan3@wisc.edu. The research of Anru Zhang is supported in part by NS Grant DMS-8868.
2 Introduction High-dimensional high-order data, i.e., values arranged in large-scale tensors along three or more directions, commonly occur in a broad range of applications due to revolutionary developments in science and technology. These data possess distinct characteristics compared with the traditional low-dimensional or low-order data and pose unprecedented challenges to various communities, including statistics, machine learning, applied mathematics, and electrical engineering. To better summarize, visualize, and analyze high-dimensional high-order data, a sufficient dimension reduction often becomes the crucial first step. Therefore, how to effectively exploit the low-ran structure from high-dimensional high-order observations is often an important tas. To this end, the framewor of tensor SVD or tensor PCA has been introduced and extensively studied recently Allen, 0b; Richard and Montanari, 04; Anandumar et al., 06; Zhang and Xia, 08; Liu et al., 07; Wang and Song, 07. Suppose one is interested in an order-d low-ran tensor of dimension p p d, which is observed with entry-wise additive noise Z as Y = X+Z. Assume the fixed tensor X is low-ran in the sense that the fibers of X along different directions i.e., the counterpart of rows and columns for matrix all lie in low-dimensional subspaces, say U,..., U d. The goal of tensor SVD is to estimate the loadings U,..., U d and underlying low-ran tensor X. Under the regular tensor SVD setting, several practical methods have been introduced and studied, including High-order SVD HOSVD De Lathauwer et al., 000a, High-order Orthogonal Iteration HOOI De Lathauwer et al., 000b, sum-of-square scheme Hopins et al., 05, homotopy or continuation method Anandumar et al., 06. Using lower bound arguments, Richard and Montanari 04 and Zhang and Xia 08 showed that the signal-noise-ratio SNR C max{p, p, p 3 } / is required to ensure that consistent estimation is statistically possible; and SNR C max{p, p, p 3 } 3/4 may be further necessary for computationally efficient methods. However in many applications, such conditions, i.e., SNR is no less than a polynomial of dimension p, are often too restrictive to satisfy. Moreover, in many applications, the leading singular/eigenvectors of the high-dimensional high-order data may satisfy intrinsic structural assumptions along certain ways. We have seen the need for singular value decomposition in a number of modern tensor data applications, where sparsity plays an essential role. or example, in high-order longitudinal study, since observations
3 often come as multivariate functions of time, e.g. country-wise fertility and death rates by the calendar year and age Wilmoth and Sholniov, 006, the leading singular vectors along the mode of calendar year or age are expected to be smooth, and therefore becomes sparse after differential transformation; in Electroencephalogram EEG data analysis, the brain electrical activities are measured and stored as multi-way data with three or more modes representing channels, time, and patients, etc. It is often believed that some parts of brain region are more active and vary through time smoothly, then the leading singular vectors may be sparse along the channel mode and smooth along time mode Miwaeichi et al., 004; in imaging ensemble analysis, facial images are often stored as high-dimensional high-order tensors. To sufficiently reduce the dimension, one loos for low-dimensional subspaces that can best explain the possibly sparse facial features and suppress illumination effects such as shadows and highlights Vasilescu and Terzopoulos, 003. How to incorporate these structural assumptions wisely to improve the performance of subsequent statistical analyses is crucial for singular value decomposition in tensor data analysis. previous literature. Such a problem, however, has not been well studied or understood in In this article, we aim to fill this gap by developing methodology and theory for sparse tensor SVD. In addition to the regular tensor SVD model, we assume that the underlying low-ran structure X satisfies some sparsity constraints. As mentioned above, the data are not necessarily sparse along all modes in practice for example, it is not reasonable to assume sparsity for the patient mode in EEG data or subject mode in high-order longitudinal data. To allow more flexibility, we suppose there exists a subset J s {,..., d} such that part of the loadings {U : J s } contains certain row-wise sparsity structures. The detailed formulation of the sparse tensor SVD model is introduced in Section. To better illustrate the nature and difficulty of sparse tensor SVD problem, it is also helpful to discuss its order- counterpart, matrix sparse singular value decomposition, for comparison. The framewor of matrix sparse singular value decomposition, which focuses on extracting simultaneously sparse and low-ran matrix structure from high-dimensional matrix data, has been introduced and extensively studied during the past decade see Lee et al. 00; Yang et al. 04, 06 and the references therein. In addition, sparse principal component analysis, a closely connected topic, has also been considered in Zou et al. 006; Shen and Huang 008; 3
4 Johnstone and Lu 009; Cai et al. 03. In contrast, sparse tensor SVD is much more involved and difficult than sparse matrix SVD and regular tensor SVD in many aspects. irst, classical methods for matrix data are often not directly applicable to high-order data. Many previous wors approach the tensor problem by vectorizing or matricizing high-order data or intuitively speaing, stretching the data cubes into matrices or vectors so that high-order problems are transformed into vector or matrix ones. However, since high-order structures can get lost in the process of simple vectorizing or matricizing, one may only obtain sub-optimal results in the subsequent analyses. Second, some straightforward extensions from sparse matrix SVD methods, such as sparse HOSVD, sparse HOOI, or a single projection & thresholding scheme, does not perform optimally in general. Third, as pointed out by the seminal wor of Hillar and Lim 03, many basic concepts or methods for matrix data cannot be directly generalized to the high-order ones. Naive extensions of concepts such as operator norm, singular values, and eigenvalues are mathematically possible but computationally NP-hard. To overcome these difficulties, we propose a procedure named Sparse Tensor Alternating Truncation for Singular Value Decomposition STAT-SVD for sparse tensor SVD in this paper. The method consists of two steps: i a thresholded spectral initialization and ii an iterative alternating updating scheme. One crucial part of the procedure is a novel double projection & thresholding scheme, which provides a sharp criterion for thresholding in each iteration. Since each step of STAT-SVD only involves basic matrix and tensor operations, such as matricization, multiplication, matrix SVD, and thresholding, the proposed procedure can be implemented efficiently. We study both the theoretical and numerical properties of the proposed procedure. We prove by an upper bound argument that the STAT-SVD estimator can recover the low-ran structures accurately. A lower bound is further developed to show that the proposed estimator is rate optimal for a general class of simultaneously sparse and low-ran tensors. To the best of our nowledge, we are among the first ones to study the method and theory for sparse tensor SVD with matching upper and lower bound results. The numerical results show that the STAT- SVD outperforms other more naive methods, such as regular HOOI, HOSVD, sparse HOOI, and sparse HOSVD, by achieving significantly smaller estimation errors within much shorter running time. We also illustrate the merit of STAT-SVD in the analyses of high-order mortality 4
5 rate data. The rest of this article is organized as follows. After a brief introduction of the notations and preliminaries, we formally introduce the sparse tensor SVD model in Section. Then we propose the methodology for sparse tensor SVD in Section 3. The theoretical properties of the proposed procedure are developed in Section 4. The data-driven hyperparameter selection is discussed in Section 5. Numerical performance of the proposed methods is studied through both simulation studies and the real data analyses on high-order mortality rate dataset in Section 6. inally, further discussions, proofs of the technical results, and supporting theoretical tools are postponed to the supplementary materials. Problem ormulation. Notations and Preliminaries We start this section with notations and preliminaries that will be used throughout the paper. The lowercase letters, e.g. x, y, u, v, are used to denote scalars or vectors. or any a, b R, let a b and a b be the minimum and maximum of a and b, respectively. or convenience, we denote p = p,..., p d, r = r,..., r d, s = s,..., s d, and p = p p d, s = s s d, r = r r d. or any p, we also note p = l p l, s = l s l, r = l r l. We use C, c and the variations to denote generic constants, whose actual values may change from line to line. The uppercase letters are used to note matrices. or A R m n, we assume the singular value decomposition is A = i σ iu i v i, where σ σ 0 are the singular values in descending order. Denote σ i A = σ i as the i-th largest singular value of A. Particularly, the largest and smallest non-trivial singular values: σ max A = σ A and σ min A = σ m n A play important roles in our analysis. We also define SVD r A = [u,..., u r ] as the matrix comprised of the top r left singular vectors of A. Let the collections of regular and sparse orthogonal matrices be O p,r = {U R p r : U U = I r }, O p,r s = {U R p r : U U = I r, U 0 = p i= {U [i,:] 0} s}. These orthogonal matrices are extensively used in later narratives. In addition, the boldface capital letters, e.g. X, Y, Z, are used to represent tensors of order-3 5
6 or higher. or any S R r r d and U R p r, the mode- tensor-matrix product is defined as S U R r r p r + r d, S U i,...,i d = r j = S i,...,j,...,i d U i,j. Multiplication along different directions is commutative invariant, i.e., S U U = S U U for. or convenience, the tensor product along all d modes is applied S U d U d := S; U,..., U d. Matricization is a basic tensor operation that transforms tensors to matrices. Particularly the mode- matricization for any X R p p d is defined as M X R p p, where [M X] i,i +p i 3 + +p p d i d = X i,...,i p. The general mode- matricization can be defined similarly. The readers are referred to Kolda and Bader 009 for a more comprehensive tutorial for tensor algebra. We also use the R syntax to denote sub-vectors, -matrices, and -tensors. or example, A [I,I ] is used to note the sub-matrix with row indices I and column indices I of A. In addition, for any integers a b, we use a : b to represent consecutive sequence {a,, b}; and : alone represents the entire index set. Thus, A [:,:r] represents the first r columns of A while A [s +:p,:] represents the {s +,..., p }-th rows with indices {s +,..., p }. Similar notations are also applied to sub-vectors and sub-tensors.. Sparse Tensor SVD Model Now we formally introduce the sparse tensor SVD model. Suppose one observes a p p d - dimensional tensor Y with additive noise, say Y = X + Z. Here, X is a low-ran tensor in the sense that all fibers along each direction lie in some low-dimensional subspace, Z consists of i.i.d. Gaussian noises with mean zero and variance σ. Given the connection between Tucer low-ran and Tucer decomposition Tucer, 966; Kolda and Bader, 009, the model can be further written as Y = X + Z = S U d U d + Z = S; U,..., U d + Z, 6
7 igure : Illustration of sparse tensor SVD model. Here, X is sparse along Modes- and -3. with S R r r d and {U O p,r } d = being the unnown core tensor and Mode-,..., loadings, respectively. Especially, U is also the singular subspace of X along the -th direction. Recall O p,r and O p,r s are the classes of regular and sparse orthogonal columns. Given previous discussions, suppose a subset J s {,..., d} is nown a priori, such that the mode- loading U for any J s is s -sparse, O p,r s, J s ; U O p,r, / J s. To be more flexible, we set s = p if there is no sparsity constraint in the -th Mode, i.e., / J s. The central goal is to estimate singular subspaces U,..., U d, and the original low-ran tensor X based on observations Y. See igure for a pictorial illustration of sparse tensor SVD model. Remar It is worth mentioning that our framewor is related but distinct from the CP low-ran-based decomposition framewor in literature see De Silva and Lim 008; Kolda and Bader 009; Anandumar et al. 04a,b; Sun et al. 05; Sun and Li 07; Wang et al. 07; Hao et al. 08 and the references therein. Since the main focus of tensor SVD analysis is to sufficiently reduce the high-order data onto low-dimensional subspaces, the Tucer low-ran structure provides a more natural fit based on the interpretation of Tucer decomposition that we have mentioned above. Additionally, it is nown that any CP ran-r tensor must be of Tucer ran at most r, but the opposite of this statement is not generally true Kolda and Bader, 009. Therefore, we mainly pursue the more general Tucer low-ran model rather than the CP one, with distinct methodologies and theories developed for the rest of this paper. 7
8 3 The STAT-SVD Procedure Next, we propose the procedure for sparse tensor singular value decomposition. Note that U in the sparse tensor SVD model is not only the mode- singular subspace of X but also the left singular subspace of M X, a straightforward idea for initialization is Ũ 0 = SVD r M Y. This method, originally introduced by De Lathauwer et al. 000a, has been widely referred to as high-order SVD HOSVD and used in numerous scenarios. However, Ũ 0 may fail to provide any consistent estimations or even warm starts, unless one has strong SNR λ/σ Cp 3/ Zhang and Xia, 08, where λ = min =,,3 σ r M X. An alternative idea is to apply l regularized estimation to encourage sparsity Allen, 0a,b, Û, Û, Û3 = arg min U,U,U 3 Y S U U 3 U 3 + λ J s U. Both the computation and theoretical analysis for the regularized MLE may be difficult. Instead, we propose a procedure with a novel double thresholding & projection scheme as follows. Step Initialization: support Let Y = M Y be the matricization of Y for =,..., d. Recall p = p p d /p, p = p p d. or =,..., d, select the index sets for all sparse modes, { Î 0 = i : Y [i,:] σ p + p log p + log p Ideally speaing, Î0 or max Y [i,j] σ log p j }, J s. provides an initial estimate for the support of {U : J s }, which captures the significant signals of X but may miss the wea ones. Î 0 = {,..., p }. or / J s, we select Step Initialization: loadings Construct Ỹ R p Y p [i d,...,i, Ỹ [i,...,i d ] = d ], i,..., i d Î0 Î0 d ; 0, otherwise, and initialize Û 0 = SVD r M Ỹ, =,..., d. 8
9 Then Û 0 provides a rough estimate for U. The initialization step is illustrated in igure 3 a. Step 3 Power Iteration and Alternating Thresholding Provided a reasonable initialization by thresholded spectral method, we consider an iterative alternating scheme to refine the estimation. Suppose we aim to update Û t Û t+ updating scheme is discussed under two scenarios. for some specific =,..., d and t = 0,,.... The If the targeting mode, say Mode-, is non-sparse, we can update by orthogonal projection A t = M Y; Û t+, I p, Û t 3,..., Û t d R p r. 3 Ideally speaing, the dimension of Y is significantly reduced from p p d to p r, while the majority of signal X can be preserved in A t after such a projection. Then we update See igure 3 b for an illustration. Û t+ t = SVD r A. If the targeting mode, say Mode-, is sparse, we apply a double projection & thresholding scheme for refinement see igure 3 c. We still perform projection first, st Projection A t = M Y; I p, Û t,..., Û t d R p r. Then A t is denoised by row-wise hard thresholding, st Thresholding B t R p r, B t,[i,:] = At,[i,:] { A t,[i,:] η }, i p, where the thresholding level η = σ r + r log p + log p is determined by the quantile of A t,[i,:] if the i-th row of At does not contain any signal. Since A,[i,:] σ χ r σ r when A,[i,:] are pure noise, an η induced by A,[i,:] can be as large as σ r, which may falsely ill many rows with wea signals. In order to lower the large thresholding level, we introduce the second projection and thresholding: let the leading r right singular vectors of B t be t t ˆV, i.e., ˆV = SVD r B t, then we further reduce the dimension of p -by-r matrix A t to p -by-r matrix Āt by performing right projection, nd Projection Ā t = A t ˆV t R p r, 9
10 a Initilaization b Dense mode update c Sparse mode update igure : Illustration of STAT-SVD procedure. 0
11 and apply the second thresholding nd Thresholding Bt R p r, Bt,[i,:] = Ā,[i,:] { A t,[i,:] η }, with much smaller thresholding value η = σ r + r log p + log p, given the reduced dimension of Ā t. As we will illustrate in theoretical analysis, the double projection & thresholding scheme provides more accurate denoising performance. It is also noteworthy that a similar version of double thresholding appears in recent high-dimensional clustering literature Jin and Wang, 06, although their problem, method, and theory were all different from ours. inally, we update Û by QR decomposition Û t+ t = QR B. Step 4 The iteration is stopped until the maximum number of iteration is reached i.e. t > t max, or convergence, i.e. the following criterion holds, t B Bt 5 < ε tol. Here ε tol is the maximum tolerance, which can be chosen empirically. inally, we propose the denoising estimator for X as ˆX = Y Û d Û d. The pseudo-code for the proposed STAT-SVD method is provided in Algorithm. We particularly summarize the double projection & thresholding scheme for the sparse mode update in Algorithm. 4 Theoretical Analysis In this section, we analyze the theoretical properties of the proposed procedure in the previous section. The sin Θ distances are adopted to quantify the singular subspace estimation errors. Particularly for any U, V O p,r, the principal angles between U and V is defined as an r- by-r diagonal matrix: ΘU, V = diagcos σ U V,..., cos σ r U V. Then the sin Θ robenius norm sin ΘU, V = r U V can be used to characterize the distance
12 Algorithm Sparse tensor alternating thresholding - SVD STAT-SVD : Input: order-d tensor data Y R p, ran r, noise level σ, set of sparse modes J s. : Initialization Let Y = M Y. Select subset Î0 by { i p : Y [i,:] σ p + p log p + log p Î 0 = or max j Y [i,j] σ } log p, J s ; {,..., p }, / J s. 3: Set t = 0. Calculate the initializations Û 0 = SVD r M Ỹ, where Ỹ = Y i,...,i d, 4: while t < t max or convergence criterion not satisfied do 5: for =,..., d do 6: if J s then 0, otherwise. 7: Sparse mode update Update Û t+ via Algorithm ; 8: else 9: Dense mode update Calculate i,..., i d Î0 Î0 d ; A t+ = M Y; Û t+,..., Û t+, I p, Û t +,..., Û t d. 0: Update as Û t+ = SVD r : end if : end for 3: t = t +. 4: end while A t+.
13 Algorithm Update Scheme in STAT-SVD Sparse Mode : Input: Y, Û t +,..., Û t d, Û t+,..., Û t+, ran r. : irst Projection Calculate A t = M Y; Û t+,..., Û t+, I p, Û t +,..., Û t d 3: irst Thresholding Perform row-wise thresholding for A t+ : B t+ R p r, B t,[i,:] = At+,[i,:] { }, i p A t+,[i,:] η, where η = σ r + r log p + log p. 4: Second Projection Extract the leading r right singular vectors of B t then project as Āt+ ˆV t+ = SVD r = A t+ ˆV t+ R p r. 5: Second thresholding Perform thresholding on B t+ B t+, O p,r, B t+ R p r, Bt+,[i,:] = Āt+,[i,:] { }, i p A t+,[i,:] η, where η = σ r + r log p + log p. 6: Apply QR decomposition to t+ B, and assign the Q part to Û t+ O p,r.. 3
14 between U and V. The readers are referred to Cai and Zhang, 08, Lemma for more discussions on the properties of sin Θ distance. Theorem Upper bound Suppose r, σ are nown, log p log p d, and for d, λ = σ r M X C 0 σ s log p / + s r + max d r. Then after at most Ods log p iterations, the proposed Algorithm yields the following estimation error upper bound, ˆX X d Cσ sin Θ Û, U l= r l + s r + s log p. 4 J s { Cσ s r + s log p /λ } r, J s, { Cσ p r /λ } r, / J s, 5 with probability at least Cp + +p d logs log p p p d. Here C > 0 is some uniform constant, which does not depend on σ, p, r, s. Remar The estimation error upper bound 4 is comprised of three terms: σ d l= r l, σ s r, and σ s logp, which correspond to the estimation complexity for the core tensor S, the values of loading U, and the support of U only for sparse modes, respectively. On the other hand, the signal strength assumption σ r X Cσ s log p / + s r + max d r involves p only in the logarithmic term. Compared with the assumption σ r X σ max{p, p, p 3 } 3/4 that is required in regular tensor SVD, our proposed algorithm is able to handle high-dimensional settings under much weaer conditions. Remar 3 Proof setch for Theorem Since the proof of Theorem is lengthy and highly non-trivial, we briefly discuss the setch here. irst, a number of conditions are introduced as the baseline assumptions Step. Under these conditions, we try to establish the upper bound for the initial estimate: sin ΘÛ 0, U c and Û 0 M X C s log p for constants 0 < c < / and C > 0 Step, then develop the upper bound for estimates after each iteration: sin ΘÛ t+, U s r + s log p { Js} 4 λ /σ + C s log p λ /σ τ t,
15 and Û t X Cσ s r + Cσ s log p { Js} + Cσ s log p τ t. for some constant 0 < τ / Step 3. sufficiently decays. One obtain final estimators Û tmax Then after a number of iterations, the error rate and ˆX tmax that achieve the targeting upper bounds of estimation error Steps 4 and 5. inally, we use a coupling scheme to show that the introduced conditions hold with high probability Step 6. Remar 4 Performance of Single Projection & Truncation Scheme As discussed earlier, the proposed double projection & thresholding scheme is crucial to the performance of STAT-SVD. Such a scheme is essential in the sense that the simple single projection & thresholding, which may be a more straightforward extension from matrix sparse SVD method, may yield sub-optimal results. To be specific, suppose X is the estimator with single thresholding & projection see Algorithm 3 in Appendix for detailed explanations, Theorem 5 in the Appendix shows that there exists a low-ran tensor X that satisfies all assumptions in Theorem, but X yields a higher rate of convergence in the sense that E X X Cσ r l + s r + s log p + s r log p /. l J s J s Next, we study the statistical lower bound for sparse tensor SVD. Consider the following class of sparse and low-ran tensors, p,r,s J s = X = S; U,..., U d R p p d : ranx r,..., r d, suppu 0 s for J s. The following lower bound results hold for sparse tensor SVD. Theorem Lower Bound: Subspace Estimation Suppose s 3r, consider the following classes of sparse and low-ran tensors with least singular value constraint on matricization, p,r,sj s, λ = {X p,r,s J s : σ r M X λ }, =,..., d. Then, for any fixed, we have inf Û sup X p,r,sj s,λ sin Θ σ Û, U c s r λ + σ s logp /s λ I { Js} r. 5
16 Theorem 3 Lower bound: Tensor Recovery Suppose s 3r for any. Consider the tensor recovery over the class of sparse and low-ran tensors p,r,s J s, there exists uniform constant c > 0 such that inf sup ˆX X d ˆX cσ r + X p,r,sj s = d s r + s log p /s. J s Theorems,, and 3 together show the proposed STAT-SVD method achieves the optimal rate of convergence in the general class of sparse low-ran tensors when log p /s log p, for J s. = 5 Data-driven Hyperparameter Selection In practice, the proposed procedure requires the input of hyperparameters σ and r, r,..., r d. Since Y = X + Z and a significant portion of entries of X are zeros, the data-driven median estimator Yang et al., 06 can be used to estimate σ: ˆσ = Median Y / Here is the 75% quantile of standard normal distribution. We have the following theoretical guarantee for ˆσ. Proposition Concentration Inequality of ˆσ Let ˆσ = Median Y /z If s = o p log p, then there exists universal constants C and c, such that ˆσ σ log p P c C/p. 6 σ p Now we consider the data-driven selections of r,..., r d. Recall that we directly threshold on each mode and obtain r,..., r d based on the singular values of Ỹ, { ˆr = max Ỹ in the initialization step of Algorithm. r : σ r M Ỹ ˆσδ Î 0, Î0 i Here, δ ij =.0 + j + i log ep i + j log ep j + 4 log p We propose to select }. 7 and ˆσ is the estimated standard deviation. Under regularity conditions, one can show that ˆr,..., ˆr d match the true rans with high probability. 6
17 Proposition Under the conditions of Theorem and Proposition, we have ˆr = r for each =,..., d with probability at least Op + + p d /p δ for any δ > 0. If neither σ nor {r } d = are nown, we can first estimate σ by MAD estimator ˆσ, then estimate {r } d = by 7, and feed all the estimations to Algorithm. We have the following theoretical guarantee for this fully data-driven method. Theorem 4 Suppose we set σ = ˆσ and r = ˆr in Algorithm. If s = o p log p, the conclusion of Theorem holds with probability at least O p + + p d /p δ for any δ > 0. Remar 5 Similarly as some previous wors on distribution-based methods for principal component number selection Choi et al., 07, the performance of the proposed ˆσ and ˆr may rely on specific Gaussian noise assumptions. In scenarios with general noise, some more empirical schemes can be applied. or example, to estimate σ, one can trim a portion of entries from Y with largest absolute values, and evaluate the trimmed variance ˆσ as the sample variance for remaining entries Serfling, 984; for r, we can apply the cumulative percentage of total variation criterion Jolliffe, 00, Chapter 6.. a commonly used in principal component analysis literature: { r i= ˆr = arg min r : σ i M } Y p i= σ i M Y > ρ. Here, ρ 0, is an empirical thresholding level. 6 Numerical Analysis 6. Simulation Studies We evaluate the numerical performance of the proposed STAT-SVD method by simulation studies on various synthetic datasets. In each setting, we first generate an r -by-r -by-r 3 tensor S with i.i.d. standard normal entries, then rescale S as S = S λ/ min 3 σ r M S to ensure that σ r M S λ. or =,, 3, we generate singular subspaces Ū and indices subsets Ω t with cardinality s uniformly at random from O s,r and {,..., p}, respectively. The 7
18 combination of Ū and Ω Ū [j,:], i Ω, and i is the j-th element of Ω ; U [i,:] = 0, i / Ω. yields a uniformly random sparse singular subspace in O p,r s. Let X = S U U 3 U 3, Z iid N0, σ, and Y = X + Z be the underlying parameter, noise, and observation tensors. To examine the performance of STAT-SVD, we apply the proposed Algorithm along with four baseline methods, HOSVD, HOOI, sparse HOSVD S-HOSVD, and sparse HOOI S- HOOI, on the same synthetic data for comparison. Here, HOSVD and HOOI are classical methods De Lathauwer et al., 000a,b that have been widely used in literature. S-HOSVD and S-HOOI are the sparse modifications of HOSVD and HOOI intuitively, S-HOSVD and S-HOOI are performed by replacing all regular matrix SVD steps in HOSVD and HOOI by sparse matrix SVD Yang et al., 04. The detailed implementation of S-HOSVD and S-HOOI are summarized in Section B of the supplementary materials. The experiments are repeated for 00 times in each setting. In the first simulation study, we compare the estimation errors of STAT-SVD and baseline methods HOOI, S-HOOI, HOSVD, S-HOSVD in average robenius norm, l Û = 3 3 sin ΘÛ, U, l ˆX = ˆX X. = or hyperparameters, we use the median estimator ˆσ in STAT-SVD and all baseline methods; since sin Θ distance is one of the most important error quantification in tensor SVD analysis and one requires the correct r to evaluate the sin Θ distance between singular subspaces, we use the true rans r for all implementations. ix p = p = p 3 = 50, s = s = s 3 = 5, λ = 70, r = r = r = r 3, we specifically consider two scenarios: σ =, r varies from to ; r = 5, σ ranges from 0. to.5. Although the signal-to-noise ratio SNR, λ/σ here seems large, λ represents the singular value of the each matricization that measures the signal strength of the whole tensor; while σ is the standard deviation of each Z ij that quantifies the noise level of each single entry. By random matrix theory Vershynin, 00, the singular values of M Z are around σ p 5 to 5, which is comparable to the signal M X. As one can see from the numerical results in igures 3 and 4, although all methods yield smaller estimation error with smaller noise level, STAT-SVD significantly outperforms all other schemes in estimations 8
19 igure 3: Estimation error of U left panel and X right panel for different methods. Here, p = p = p 3 = 50, s = s = s 3 = 5, σ =, λ = 70, and r = r = r 3 = r vary. igure 4: Estimation error of U left panel and X right panel for different methods. Here, p = p = p 3 = 50, s = s = s 3 = 5, r = r = r 3 = 5, λ = 70, σ varies. of both subspaces and original tensors. We also consider the accuracy of hyperparameters estimation. Under the previous simulation setting, we examine the performance of MAD estimator ˆσ and ran estimator ˆr in Section 5. The results are provided in Table. One can see that the proposed ˆσ and ˆr provide reasonable estimations. The ran estimation is particularly accurate when noise level is moderate. In previous sections, the presentation and analysis were mostly focused on Gaussian noise case. Next, we consider the setting that the noises are uniformly distributed on [ σ 3, σ 3]. Let p = p = p 3 = 50, s = s = s 3 = 5, r = r = r 3 = 5, λ = 70, σ varies from 0. to.5. As we can see from the estimation error results in igure 5, STAT-SVD still achieves significantly 9
20 σ ˆσ σ ˆr r r ˆσ σ ˆr r Table : Ran and noise level estimation accuracy. Here, p = p = p 3 = 50, s = s = s 3 = 5, λ = 70. The first three rows explore the setting where r = r = r 3 = 5 with σ varies; the last three rows consider the case with fixed σ = and varying ran. The errors are averaged over 00 repetitions. igure 5: Estimation error of U left panel and X right panel with uniform distributed noise. better performance than other methods. As we have discussed before, in many applications, the tensor dataset possesses sparsity structure in only part of directions. Thus, we turn to a setting that X contains both sparse and dense modes. Specifically, we set p = p = p 3 = 50, r = r = r 3 = 0, s = s = 0, λ = 60, and s 3 = p 3, so that X is sparse along mode-, - and dense along mode-3. The singular subspace estimation error with varying noise level is evaluated and shown in igure 6. We can see STAT- SVD significantly outperforms the other methods on the estimation of sparse modes Û and Û. More interestingly, STAT-SVD also estimates the non-sparse singular subspace U 3 slightly more accurately, especially when σ is large. In fact, different modes and singular subspaces of any specific tensor dataset are a unity rather than separate objects, so more accurate estimation 0
21 igure 6: Estimation error of U left top, U right top, U 3 left bottom and X right bottom in partial sparse setting. Here, λ = 60, p = p = p 3 = 50, s 3 = p 3, s = s = 0, r = r = r 3 = 0. of dense mode U 3 is possible when one can fully utilize the sparsity in U and U. Especially for STAT-SVD, with the proposed double projection & thresholding scheme Algorithm, one gets significant better estimations on U and U than the baseline methods in each iteration so more precise projection 3 can be achieved when updating dense mode singular subspace Û t 3, and a slightly better final estimation of U 3 can be achieved. As the time cost is another critical issue in high-dimensional data analysis, we summarize the computational complexity for both initialization and each iteration of STAT-SVD and baseline methods into Table. We also compare the running time of STAT-SVD and other algorithms by simulations. As one can see from Tables and 3, HOOI, HOSVD, S-HOOI, and S-HOSVD are all slower than STAT-SVD this is because the embedded sparse matrix SVD or regular matrix SVD in baselines are computationally expensive, especially for the high-dimensional
22 STAT-SVD HOSVD HOOI S-HOSVD S-HOOI initialization per-iteration Op d + sd+ Op d+ Op d+ OT p d+ OT p d+ Op d r + s d+ 0 Op d+ 0 OT p d+ Table : Computational complexity of STAT-SVD and baseline methods for order-d sparse tensor SVD. Here, each mode share the same p i, s i, and r i. T denotes the number of iteration times. p i STAT-SVD HOSVD HOOI S-HOSVD S-HOOI Table 3: Running time unit: seconds of STAT-SVD and baseline methods. Here, λ = 70, σ =, s = s = s 3 = 0, r = r = r 3 = 5, p = p = p 3 varies. cases. In contrast, STAT-SVD is much faster, as the efficient truncation maes sure that only the significant parts of the data are extensively applied in computation, i.e., we only need to perform SVD on the s -by-s submatrix instead of the original p -by-p one. 6. Mortality Rate Data Analysis We illustrate the power of STAT-SVD through a demographic example. The mortality rate, i.e. the number of deaths divided by the total number of population, provides interesting insights to demographic information of the certain area, period, and age span. The Bereley Human Mortality Database Wilmoth and Sholniov, 006 contains a good source of morality rate data aligned by countries, ages, and years. We aim to analyze the mortality rate among 6 European countries for ages 0 to 95 from 959 to 00. The data tensor Y is of dimension with three modes representing age, year, and country, respectively. Since the mortality rate is relatively steady from teenagers to adults see, e.g., Miniño 03, it is reasonable to assume that the underlying loadings of Mode age are sparse after taing differential transformation.
23 Specifically, let D R be a secondary difference matrix: D ij = if i = j; D ij = if i j = ; D ij = 0 if i j. We pre-process the data by multiplying the secondary difference matrix along Modes age: Ỹ = Y D. We aim to apply sparse tensor SVD on Ỹ with J s = {}. To achieve more robust performance, the hyperparameters are selected via empirical methods instead of the Gaussian-noisebased procedure discussed in Section 5. r is selected via cumulative percentage of total variation criterion Jolliffe, 00, Chapter 6.., { r i= ˆr = arg min r : σ i M } X p i= σ i M X > 0.5. or the noise level, we trim 5% largest entries of Y in absolute value, then set ˆσ as the sample standard deviation of the remaining entries of Y. The estimated ran and noise level of Ỹ are ˆr, ˆr, ˆr 3 =, 5, 4 and ˆσ = In addition to STAT-SVD, we also apply S-HOOI and S-HOSVD for comparison. Suppose Ũ, ˆV, Ŵ are the resulting singular subspaces of Ỹ. Then we transform Ũ singular subspace of Mode age bac via Û = D Ũ. We first compare the first two singular vectors of Modes age and year, û, û, ˆv, and ˆv, in igures 7 and 8. We can see from û that infants aged at 0- and old people aged over 55 have a higher ris of dying, and people aged from to 50 have a relatively steady death rate. In addition, the shape of u further suggests additional factors of mortality rate in certain periods. or example, there may be more complicated patterns in mortality rate for the infants and elderly aged over 80 due to the high ris of death in these two periods. rom ˆv, we can see the mortality rate in these European countries declines significantly from the year 955 to 00, while ˆv does not give any significant pattern. Since the three methods produce similar plots on values of singular vectors, we further compare the estimation support indexes in igure 9. Since the mortality rate is steady from childhood to adults, one expects that the zero indexes would cover such an age span for û and û. The outcome of STAT-SVD matches this phenomenon. In comparison, the sparsity patterns given by S-HOSVD and S-HOOI are less interpretable. Next, we consider the mortality rate pattern for countries. In particular, we plot ŵ against ŵ for each country in igure 0, refer to the 04 European GDP per capita raning data, search for the countries whose GDP per capita is raned as top 50% among all European counties, Lin: 3
24 igure 7: Mortality rate data: û, û igure 8: Mortality rate data: ˆv, ˆv igure 9: Sparse indexes of Mode age after secondary differential transformation 4
25 igure 0: Mortality rate data: ŵ and ŵ of different countries. Red circles and blac triangles represent the countries with top 50% GDP per capita in Europe and the other countries, respectively. igure : Mortality rate means for countries in Europe. Red circles and blac triangles represent the countries with top 50% GDP per capita in Europe and the other countries, respectively. then highlight these countries with red circle marers and the other with blac triangles. We can clearly see two clusters in igure 0 that countries with more GDP per capita have smaller values of ŵ and ŵ, which implies a lower mortality rate. Also, wealthier countries are highly clustered in the graph, which indicates the common pattern of the death rate for these countries. In contrast, we also calculate the mean mortality rates of these countries and mar them with different colors in igure. By comparing igures 0 and, the mortality data clustering performance of STAT-SVD is significantly better than the one by mean estimations. 5
26 7 Proofs 7. Proof of Theorem Since the proof for this upper bound result is fairly complicated, we divide the proof into steps. Note that although we do not have refinement for non-sparse modes, to mae the proof consistent, we first extend the refinement step to non-sparse algorithm. Specifically, we also apply algorithm on Û t for J s, with η = η = 0 i.e. no truncation. This modification does not change the algorithm at all but maes the following analysis consistent for both sparse and non-sparse modes. Without loss of generality, we assume σ = throughout this proof. The idea of proving this theorem is that, we first impose a series of conditions, then prove the statement under these conditions, and finally prove these conditions hold with high probability. Step Introduction of Notations and Conditions We introduce or rephrase the following list of notations. N Matricizations, for =,..., d, X = M X, Y = M Y, Z = M Z. N Index sets for sparse mode J s : define True support I = { i : X [i,:] 0 }, Significant support { I 0 = i : } X [i,:] 6σs log p, { Selected support Î 0 = or t, we also define Ît i : Y [i,:] σ p + p log p + log p or max Y [i,j] σ log p j and Ĵ t with J s : selected support for A in t-th step selected support for Ā in t-th step Î t = Ĵ t = } { i : { i :, =,..., d. A t Āt [i,:] [i,:] } η, } η, 8 9 6
27 where η = σ r + r log p + log p, η = σ r + r log p + log p. or non-sparse mode J s, define I = I t that for any, I, I 0 = Ît = Ĵ t = [ : p ]. It is easy to see Î t, Ĵ t are all subsets of {,..., p }. We also define I = I + I d I I. Moreover, Ît, Ĵ t, It, J t are defined in the similar way. N 3 Index projections: for any I {,..., p }, we define D I = diag I R p p, if I {,..., p } is any index subset; D I to zero. N 4 Loadings: can be interpreted as the projection matrix, which set all rows with index not in I U = U + U d U U O p r, =,..., d; Û t = Û t + Û t d Û t Û t O p r, =,..., d. V = SVD r X U O r,r, i.e., leading r right singular vectors of X U ; ˆV t = SVD r DÎt Y Û t O r,r, Combining these definitions, we can rewrite A t A t Ā t = Y Û t, We can immediately see that A t Y = X + Z, A t, Bt, Āt Bt = Y Û t t ˆV, Bt, Bt i.e., leading r right singular vectors of DÎt Y Û t. = DÎt A t and Āt as = D Î t Y Û t, = DĴ t Y Û t t ˆV., Āt t, B are essentially projections of Y. Since t, B can be decomposed accordingly as A t B t Ā t B t = AX,t = B X,t = ĀX,t = B X,t + A Z,t + B Z,t + ĀZ,t + B Z,t A X,t B X,t Ā X,t B X,t = X Û t, = DÎt X Û t, AZ,t = Z Û t ; BZ,t = DÎt Z Û t = X Û t t ˆV, ĀZ,t = Z Û t t ˆV = DĴ t X Û t t ˆV, BZ,t = DĴ t Z Û t t ˆV. 0 7
28 N 5 Error Bounds: for t = 0,,..., =,..., d, E t = Û t X, t t = sin Θ Û, U. or =,..., d, t =,..., we further define K t t t = sin Θ Û ˆV, U V. We also introduce the following conditions for the proof of this theorem. A A max Z i,...,i d log p i,...,i d I I d Z [I,...,I d ] s + s log p + log p Z U d Ud r + r log p + log p. A 3 d, Z,[I,I ] s + s + log p. 3 A 4 d, N := Z [I,:]U V s + r + log p. 4 A 5 d, M := sup Z,[I,I ]W s + r + W l R s l r l, W l l s l r l + log p, where W = W + W d W W O s,r. A 6 Support consistency condition: 5 I 0 Î0, Î t I, t = 0,..., t max, Ĵ t I, t =,..., t max. 6 8
29 Step Theoretical guarantees for initialization: Û 0 In this second step, we show that under A A 7, the initialization estimator Û 0 =,,, d: sin Θ Û 0 Recall that Û 0 Ỹ0 = SVD r M DÎ0 X DÎ0 + DÎ0 Cai and Zhang 08, Z DÎ0 0 sin Θ Û, U We set C 0 80 d, recall that satisfies the following two inequalities for any 0 Û, U s log p. 7 λ X ds log p. 8 = SVD r DÎ0 Y DÎ0. Since DÎ0 Y DÎ0 =, by the unilateral perturbation bound result Proposition in σ r U D Î 0 σr U D Î 0 Y DÎ0 Y DÎ0 U D Î 0 σ r + D Î 0 Y DÎ0 Y DÎ0. 9 λ 80 ds log p, =,..., d, 0 we analyze each of the three ey terms in the right hand side of 9 as follows, a Since 6 holds, i.e. I 0 of DÎ0 Y DÎ0, then σ r U D Î 0 σ r U X σ r X 6 λ λ Î0 Y DÎ0 U X D I 0 X X D 0 I d = 8 λ 6ds log p /, we now the non-zero part of D I 0 σ r U D Î 0 X DÎ0 X DÎ0 X DÎ0 D Î 0 X D 0 I D I Z D I d D I 0 d U D Î 0 Y D 0 I is a submatrix Z DÎ0 Z DÎ0 s + / s log p + log p / D I \I 0 X s + / s log p + log p λ ds log p λ.λ =.9λ. s + log p 9
30 b Note that DÎ0 Y DÎ0 now σ r + DÎ0 Y DÎ0 = DÎ0 X DÎ0 Lemma 6 + DÎ0 Z DÎ0 3 s + s + log p 4 s log p 0.λ. c Since the left singular space of D I X D I, randî0 X DÎ0 r, we 6 σ DÎ0 Z DÎ0 σ DI Z D I is U, we have U D I X D I = O, thus, U D 6 Î 0 Y DÎ0 U D Î 0 Y D I U D I Y D I + U D Y I D I \Î0 U D I X D I + U D I Z D I + U Y,[I \Î0 U D I Z D I + Z,[I,I ] + 6 X,[I \I 0 Y,[I \Î0,I ] Z,[I,I ] + s 6s log p,i ] + s + s log p + log p + 4 s log p s + log p + 4 s log p 9 s log p. Summarizing a, b, c, and 9, we must have 7, i.e. Z,[I \I 0,I ] 0 sin Θ Û, U s log p, =,..., d, λ,i ] provided that A A 7 hold. Next we consider the bound for Û 0 X. Since Û 0 is the leading r left singular vectors of M Ỹ 0 = D 0 I and Then by Lemma 6, 0 Û D 0 I D 0 I Y D 0 I X D 0 I = D 0 I X D 0 I r D I 0 + D I 0 3 r s + s + log p 8 s log p. Y D I 0 Z D 0 I., rand I 0 Z D 0 I r Z[I,I ] X D 0 I r, 30
31 The last inequality comes from the fact that r s s, r s s and r s. Meanwhile, 0 Û D 0 I X D 0 I X X D 0 I X D 0 I = X X 0 [I,...,I0 d ] d X [I \I 0,:] = d = X,[i,:] d s 6s log p Therefore, Û 0 = i I \I 0 =4 ds log p. X which has proved 8. 0 Û D 0 I = X D 0 I + Û 0 8 s log p + 4 ds log p ds log p, X D I 0 X D I 0 Step 3 Next we consider the refinement of each iteration. To be specific, we try to study the performance of Û t Step, we have based on Û t. We still assume 6 all hold. Based on the result in E 0 ds log p and 0 s log p, λ provided that A A 7 hold. We particularly provides the following upper bound for E t Û t X and t = sin Θ Û, U E t s η + 3 r N + 3 r M K t t + Et + λ + K t + + Et d λd + Et λ + + Et λ =, 3 Et. 4 λ The detailed proof is collected in Section C.3 in the supplementary materials. Step 4 In this step, we combine the results in Step 4 and provide a upper bound for E tmax, tmax for t max = C logds log p. We first show, for all sparse and non-sparse modes, we have the following upper bounds for E t : E t 30 s r + 30 s log p + ds log p t, =,..., d, t = 0,,
32 irst, 5 holds for t = 0 due to 8. If 5 holds for t for some t, we aim to prove 5 for t. irst, based on the basic principle of Tucer ran, we have r r, r s s l r l s l r l s l s l = s. 6 Also, we shall recall that η = r + r log p + log p r + log p; η = r + r log p + log p r + log p; N l s l + r l + log p s l + log p; M l sl + d r l + sl r l + log p, l= 7 and E t 30 s r + 30 s log p + ds log p t 30 s + 30 s log p + ds log p t 60 s log p + ds log p t. 8 Thus, the following upper bound holds for K t, if we set C gap 00d. K t t E + + E t d + s η + 6r M λ / d 60 s log p + ds log p + t s η + 6r M λ / 60 d s log p + d ds log p + s r + s log p λ 6 s r + r + d = s r + log p + 00dλ C gap λ. λ 3
33 Thus, E t s η + 3 r N + r M K t + Et λ + + Et d λd K t s η + 6 r N + r M K t + Et λ 7 4 s r + 5 s log p + r M K t + Et λ + + Et d λ d + + Et d λ d. 9 With C gap > 0d + d, by Lemma 8, we have: r M K t r M d = Et + M s r η + λ λ 6r M λ, 4 s r + 4 s log p + ds log p t+. E t r M + + Et d 3 s r + 3 s log p + ds log p λ λ d t+ 3 Combining 9, 30, and 3, we have proved 30 E t 30 s r + 30 s log p + ds log p t, i.e. 5 for t and =. Similarly, one can also prove the upper bounds for E t =,..., d, which implies the claim 5 holds for t 0. for Then we further provide another upper bound for non-sparse modes. Note that we assume log p log p log p d, thus we can find some constant C, such that log p C p, =,,..., d. Now we want to show: E t C p r + ds log p t, J s 3 Again, we assume mode is non-sparse and only prove the bound for E t, the similar bounds for other non-sparse modes essentially follow. Return to 9, note that we particularly have η = 0 since it is a non-sparse mode, we have: E t 6 r N + r M K t + Et λ 7 p r + 6 r log p + r M K t + 6 C p r + r M + + Et d λ d + Et λ K t + Et λ + + Et d + + Et d λ d λ d
34 Also, we have the following bound for K t as η = 0, Besides, we could rebound M lie following: K t Et E t d + 6r M. 34 λ M + C p + r + l sl r l. 35 If we set C gap > 0d + d, then together with 5, 33, 34 and 35, we can proved the following result as similar as we proved in Lemma 8: r M K t + Et + + Et d C p r + ds log p r λ rd λ d t. Then 63 follows. Now for t max = C logds log p, for large constant C > 0, we have E tmax = Û tmax X tmax 6 t = sin Θ Û max, U C s r + s log p, J s, C p r, / J s, C s r + s log p /λ, J s, C p r /λ, / J s Step 5 In this step, we consider the estimation error for ˆX. We first have the following decomposition, Here, ˆX X = Y PÛ d PÛd X Z PÛ d PÛd + X PÛ d PÛd X. X PÛ d PÛd X X PÛ + X PÛ PÛ + + X PÛ PÛ d PÛd d PÛd d X l Ûl d = Û l X 36 l C s l r l + s l log p + pl r l ; l Js l Js l= l= 34
35 Z PÛ d PÛd Û Z U U Û d Ûd Z U Û d Ûd + Z U Û d Ûd + Z U U Û Û d Û d U U Û Z Û Û d + M sin Θ Û, U Z U U d Û d + M sin Θ Û, U + M sin Θ Û, U Z U d Ud + d l= Z U d U d d sin M l Θ Û l, U l. l= r + r log p + log p = r + log p; sin 537 M l Θ Ûl, U l C M l s l r l + s l log p + C M l p l r l λ l λ l l J s l J s C l J s s l r l + s l log p + C l J s p l r l. In summary, ˆX X C r + d l= rl s l + l J s sl log p. under the assumptions 6 that we introduced in Step. Note that log p = log p + + log p d and we have the assumption that log p log p d, thus the result is equitant to the one stated in Theorem. Step 6 inally, we prove that all imposed conditions in Step, i.e. -6, hold with probability at least O p + +p d log ds log p p. The detailed proof is collected in Section C.4 in the supplementary materials. Combing all results from Steps 6, we have finished the proof for this theorem. 35
36 Acnowledgment The authors would lie to than the Editor, Associate Editor, and anonymous referees for their constructive comments, which have greatly helped to improve the presentation of the paper. References Allen, G. 0a. Sparse higher-order principal components analysis. In AISTATS, volume 5. Allen, G. I. 0b. Regularized tensor factorizations and higher-order principal components analysis. arxiv preprint arxiv: Anandumar, A., Deng, Y., Ge, R., and Mobahi, H. 06. Homotopy analysis for tensor pca. arxiv preprint arxiv: Anandumar, A., Ge, R., Hsu, D., Kaade, S. M., and Telgarsy, M. 04a. Tensor decompositions for learning latent variable models. Journal of Machine Learning Research, 5: Anandumar, A., Ge, R., and Janzamin, M. 04b. Guaranteed non-orthogonal tensor decomposition via alternating ran- updates. arxiv preprint arxiv: Birgé, L. 00. An alternative point of view on lepsi s method. Lecture Notes-Monograph Series, pages Cai, T. T., Ma, Z., and Wu, Y. 03. Sparse pca: Optimal rates and adaptive estimation. The Annals of Statistics, 46: Cai, T. T. and Zhang, A. 08. Rate-optimal perturbation bounds for singular subspaces with applications to high-dimensional statistics. The Annals of Statistics, 46: Choi, Y., Taylor, J., Tibshirani, R., et al. 07. Selecting the number of principal components: estimation of the true ran of a noisy matrix. The Annals of Statistics, 456: De Lathauwer, L., De Moor, B., and Vandewalle, J. 000a. A multilinear singular value decomposition. SIAM journal on Matrix Analysis and Applications, 4:
sparse and low-rank tensor recovery Cubic-Sketching
Sparse and Low-Ran Tensor Recovery via Cubic-Setching Guang Cheng Department of Statistics Purdue University www.science.purdue.edu/bigdata CCAM@Purdue Math Oct. 27, 2017 Joint wor with Botao Hao and Anru
More informationDISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania
Submitted to the Annals of Statistics DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING By T. Tony Cai and Linjun Zhang University of Pennsylvania We would like to congratulate the
More informationPrincipal Component Analysis
Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used
More information15 Singular Value Decomposition
15 Singular Value Decomposition For any high-dimensional data analysis, one s first thought should often be: can I use an SVD? The singular value decomposition is an invaluable analysis tool for dealing
More informationMAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing
MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing Afonso S. Bandeira April 9, 2015 1 The Johnson-Lindenstrauss Lemma Suppose one has n points, X = {x 1,..., x n }, in R d with d very
More informationSupplement to A Generalized Least Squares Matrix Decomposition. 1 GPMF & Smoothness: Ω-norm Penalty & Functional Data
Supplement to A Generalized Least Squares Matrix Decomposition Genevera I. Allen 1, Logan Grosenic 2, & Jonathan Taylor 3 1 Department of Statistics and Electrical and Computer Engineering, Rice University
More informationLecture Notes 5: Multiresolution Analysis
Optimization-based data analysis Fall 2017 Lecture Notes 5: Multiresolution Analysis 1 Frames A frame is a generalization of an orthonormal basis. The inner products between the vectors in a frame and
More informationKronecker Product Approximation with Multiple Factor Matrices via the Tensor Product Algorithm
Kronecker Product Approximation with Multiple actor Matrices via the Tensor Product Algorithm King Keung Wu, Yeung Yam, Helen Meng and Mehran Mesbahi Department of Mechanical and Automation Engineering,
More informationNoisy Streaming PCA. Noting g t = x t x t, rearranging and dividing both sides by 2η we get
Supplementary Material A. Auxillary Lemmas Lemma A. Lemma. Shalev-Shwartz & Ben-David,. Any update of the form P t+ = Π C P t ηg t, 3 for an arbitrary sequence of matrices g, g,..., g, projection Π C onto
More informationOrthogonal tensor decomposition
Orthogonal tensor decomposition Daniel Hsu Columbia University Largely based on 2012 arxiv report Tensor decompositions for learning latent variable models, with Anandkumar, Ge, Kakade, and Telgarsky.
More informationLecture Notes 1: Vector spaces
Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector
More information19.1 Problem setup: Sparse linear regression
ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 19: Minimax rates for sparse linear regression Lecturer: Yihong Wu Scribe: Subhadeep Paul, April 13/14, 2016 In
More informationNew Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit
New Coherence and RIP Analysis for Wea 1 Orthogonal Matching Pursuit Mingrui Yang, Member, IEEE, and Fran de Hoog arxiv:1405.3354v1 [cs.it] 14 May 2014 Abstract In this paper we define a new coherence
More informationON THE QR ITERATIONS OF REAL MATRICES
Unspecified Journal Volume, Number, Pages S????-????(XX- ON THE QR ITERATIONS OF REAL MATRICES HUAJUN HUANG AND TIN-YAU TAM Abstract. We answer a question of D. Serre on the QR iterations of a real matrix
More informationA Multi-Affine Model for Tensor Decomposition
Yiqing Yang UW Madison breakds@cs.wisc.edu A Multi-Affine Model for Tensor Decomposition Hongrui Jiang UW Madison hongrui@engr.wisc.edu Li Zhang UW Madison lizhang@cs.wisc.edu Chris J. Murphy UC Davis
More informationarxiv: v5 [math.na] 16 Nov 2017
RANDOM PERTURBATION OF LOW RANK MATRICES: IMPROVING CLASSICAL BOUNDS arxiv:3.657v5 [math.na] 6 Nov 07 SEAN O ROURKE, VAN VU, AND KE WANG Abstract. Matrix perturbation inequalities, such as Weyl s theorem
More information14 Singular Value Decomposition
14 Singular Value Decomposition For any high-dimensional data analysis, one s first thought should often be: can I use an SVD? The singular value decomposition is an invaluable analysis tool for dealing
More informationTikhonov Regularization of Large Symmetric Problems
NUMERICAL LINEAR ALGEBRA WITH APPLICATIONS Numer. Linear Algebra Appl. 2000; 00:1 11 [Version: 2000/03/22 v1.0] Tihonov Regularization of Large Symmetric Problems D. Calvetti 1, L. Reichel 2 and A. Shuibi
More informationMethods for sparse analysis of high-dimensional data, II
Methods for sparse analysis of high-dimensional data, II Rachel Ward May 23, 2011 High dimensional data with low-dimensional structure 300 by 300 pixel images = 90, 000 dimensions 2 / 47 High dimensional
More informationOptimal spectral shrinkage and PCA with heteroscedastic noise
Optimal spectral shrinage and PCA with heteroscedastic noise William Leeb and Elad Romanov Abstract This paper studies the related problems of denoising, covariance estimation, and principal component
More informationLecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26
Principal Component Analysis Brett Bernstein CDS at NYU April 25, 2017 Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 1 / 26 Initial Question Intro Question Question Let S R n n be symmetric. 1
More informationPCA with random noise. Van Ha Vu. Department of Mathematics Yale University
PCA with random noise Van Ha Vu Department of Mathematics Yale University An important problem that appears in various areas of applied mathematics (in particular statistics, computer science and numerical
More informationarxiv: v1 [math.na] 5 May 2011
ITERATIVE METHODS FOR COMPUTING EIGENVALUES AND EIGENVECTORS MAYSUM PANJU arxiv:1105.1185v1 [math.na] 5 May 2011 Abstract. We examine some numerical iterative methods for computing the eigenvalues and
More informationSPARSE signal representations have gained popularity in recent
6958 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 10, OCTOBER 2011 Blind Compressed Sensing Sivan Gleichman and Yonina C. Eldar, Senior Member, IEEE Abstract The fundamental principle underlying
More informationLeast Squares Approximation
Chapter 6 Least Squares Approximation As we saw in Chapter 5 we can interpret radial basis function interpolation as a constrained optimization problem. We now take this point of view again, but start
More information/16/$ IEEE 1728
Extension of the Semi-Algebraic Framework for Approximate CP Decompositions via Simultaneous Matrix Diagonalization to the Efficient Calculation of Coupled CP Decompositions Kristina Naskovska and Martin
More informationarxiv: v1 [cs.it] 21 Feb 2013
q-ary Compressive Sensing arxiv:30.568v [cs.it] Feb 03 Youssef Mroueh,, Lorenzo Rosasco, CBCL, CSAIL, Massachusetts Institute of Technology LCSL, Istituto Italiano di Tecnologia and IIT@MIT lab, Istituto
More informationEECS 275 Matrix Computation
EECS 275 Matrix Computation Ming-Hsuan Yang Electrical Engineering and Computer Science University of California at Merced Merced, CA 95344 http://faculty.ucmerced.edu/mhyang Lecture 22 1 / 21 Overview
More informationTensor Decompositions for Machine Learning. G. Roeder 1. UBC Machine Learning Reading Group, June University of British Columbia
Network Feature s Decompositions for Machine Learning 1 1 Department of Computer Science University of British Columbia UBC Machine Learning Group, June 15 2016 1/30 Contact information Network Feature
More informationMethods for sparse analysis of high-dimensional data, II
Methods for sparse analysis of high-dimensional data, II Rachel Ward May 26, 2011 High dimensional data with low-dimensional structure 300 by 300 pixel images = 90, 000 dimensions 2 / 55 High dimensional
More informationZhaoxing Gao and Ruey S Tsay Booth School of Business, University of Chicago. August 23, 2018
Supplementary Material for Structural-Factor Modeling of High-Dimensional Time Series: Another Look at Approximate Factor Models with Diverging Eigenvalues Zhaoxing Gao and Ruey S Tsay Booth School of
More informationReconstruction from Anisotropic Random Measurements
Reconstruction from Anisotropic Random Measurements Mark Rudelson and Shuheng Zhou The University of Michigan, Ann Arbor Coding, Complexity, and Sparsity Workshop, 013 Ann Arbor, Michigan August 7, 013
More informationHigh Dimensional Covariance and Precision Matrix Estimation
High Dimensional Covariance and Precision Matrix Estimation Wei Wang Washington University in St. Louis Thursday 23 rd February, 2017 Wei Wang (Washington University in St. Louis) High Dimensional Covariance
More informationL 2,1 Norm and its Applications
L 2, Norm and its Applications Yale Chang Introduction According to the structure of the constraints, the sparsity can be obtained from three types of regularizers for different purposes.. Flat Sparsity.
More informationMath 471 (Numerical methods) Chapter 3 (second half). System of equations
Math 47 (Numerical methods) Chapter 3 (second half). System of equations Overlap 3.5 3.8 of Bradie 3.5 LU factorization w/o pivoting. Motivation: ( ) A I Gaussian Elimination (U L ) where U is upper triangular
More informationEstimation of large dimensional sparse covariance matrices
Estimation of large dimensional sparse covariance matrices Department of Statistics UC, Berkeley May 5, 2009 Sample covariance matrix and its eigenvalues Data: n p matrix X n (independent identically distributed)
More informationOn the convergence of higher-order orthogonality iteration and its extension
On the convergence of higher-order orthogonality iteration and its extension Yangyang Xu IMA, University of Minnesota SIAM Conference LA15, Atlanta October 27, 2015 Best low-multilinear-rank approximation
More informationDimension Reduction Techniques. Presented by Jie (Jerry) Yu
Dimension Reduction Techniques Presented by Jie (Jerry) Yu Outline Problem Modeling Review of PCA and MDS Isomap Local Linear Embedding (LLE) Charting Background Advances in data collection and storage
More informationRobustness of Principal Components
PCA for Clustering An objective of principal components analysis is to identify linear combinations of the original variables that are useful in accounting for the variation in those original variables.
More informationAn algebraic perspective on integer sparse recovery
An algebraic perspective on integer sparse recovery Lenny Fukshansky Claremont McKenna College (joint work with Deanna Needell and Benny Sudakov) Combinatorics Seminar USC October 31, 2018 From Wikipedia:
More informationNon-convex Robust PCA: Provable Bounds
Non-convex Robust PCA: Provable Bounds Anima Anandkumar U.C. Irvine Joint work with Praneeth Netrapalli, U.N. Niranjan, Prateek Jain and Sujay Sanghavi. Learning with Big Data High Dimensional Regime Missing
More informationDeep Linear Networks with Arbitrary Loss: All Local Minima Are Global
homas Laurent * 1 James H. von Brecht * 2 Abstract We consider deep linear networks with arbitrary convex differentiable loss. We provide a short and elementary proof of the fact that all local minima
More informationCS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works
CS68: The Modern Algorithmic Toolbox Lecture #8: How PCA Works Tim Roughgarden & Gregory Valiant April 20, 206 Introduction Last lecture introduced the idea of principal components analysis (PCA). The
More informationIntroduction to Compressed Sensing
Introduction to Compressed Sensing Alejandro Parada, Gonzalo Arce University of Delaware August 25, 2016 Motivation: Classical Sampling 1 Motivation: Classical Sampling Issues Some applications Radar Spectral
More informationMultilinear Singular Value Decomposition for Two Qubits
Malaysian Journal of Mathematical Sciences 10(S) August: 69 83 (2016) Special Issue: The 7 th International Conference on Research and Education in Mathematics (ICREM7) MALAYSIAN JOURNAL OF MATHEMATICAL
More informationFactor Analysis (10/2/13)
STA561: Probabilistic machine learning Factor Analysis (10/2/13) Lecturer: Barbara Engelhardt Scribes: Li Zhu, Fan Li, Ni Guan Factor Analysis Factor analysis is related to the mixture models we have studied.
More informationJournal of Multivariate Analysis. Consistency of sparse PCA in High Dimension, Low Sample Size contexts
Journal of Multivariate Analysis 5 (03) 37 333 Contents lists available at SciVerse ScienceDirect Journal of Multivariate Analysis journal homepage: www.elsevier.com/locate/jmva Consistency of sparse PCA
More informationIterative Methods for Solving A x = b
Iterative Methods for Solving A x = b A good (free) online source for iterative methods for solving A x = b is given in the description of a set of iterative solvers called templates found at netlib: http
More informationInstitute for Computational Mathematics Hong Kong Baptist University
Institute for Computational Mathematics Hong Kong Baptist University ICM Research Report 08-0 How to find a good submatrix S. A. Goreinov, I. V. Oseledets, D. V. Savostyanov, E. E. Tyrtyshnikov, N. L.
More informationFundamentals of Multilinear Subspace Learning
Chapter 3 Fundamentals of Multilinear Subspace Learning The previous chapter covered background materials on linear subspace learning. From this chapter on, we shall proceed to multiple dimensions with
More informationLinear Regression and Its Applications
Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start
More informationTHE SINGULAR VALUE DECOMPOSITION MARKUS GRASMAIR
THE SINGULAR VALUE DECOMPOSITION MARKUS GRASMAIR 1. Definition Existence Theorem 1. Assume that A R m n. Then there exist orthogonal matrices U R m m V R n n, values σ 1 σ 2... σ p 0 with p = min{m, n},
More informationIEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER On the Performance of Sparse Recovery
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER 2011 7255 On the Performance of Sparse Recovery Via `p-minimization (0 p 1) Meng Wang, Student Member, IEEE, Weiyu Xu, and Ao Tang, Senior
More information2.3. Clustering or vector quantization 57
Multivariate Statistics non-negative matrix factorisation and sparse dictionary learning The PCA decomposition is by construction optimal solution to argmin A R n q,h R q p X AH 2 2 under constraint :
More informationIntroduction Wavelet shrinage methods have been very successful in nonparametric regression. But so far most of the wavelet regression methods have be
Wavelet Estimation For Samples With Random Uniform Design T. Tony Cai Department of Statistics, Purdue University Lawrence D. Brown Department of Statistics, University of Pennsylvania Abstract We show
More informationDS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.
DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1
More informationA Cross-Associative Neural Network for SVD of Nonsquared Data Matrix in Signal Processing
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 12, NO. 5, SEPTEMBER 2001 1215 A Cross-Associative Neural Network for SVD of Nonsquared Data Matrix in Signal Processing Da-Zheng Feng, Zheng Bao, Xian-Da Zhang
More informationTruncation Strategy of Tensor Compressive Sensing for Noisy Video Sequences
Journal of Information Hiding and Multimedia Signal Processing c 2016 ISSN 207-4212 Ubiquitous International Volume 7, Number 5, September 2016 Truncation Strategy of Tensor Compressive Sensing for Noisy
More information8.1 Concentration inequality for Gaussian random matrix (cont d)
MGMT 69: Topics in High-dimensional Data Analysis Falll 26 Lecture 8: Spectral clustering and Laplacian matrices Lecturer: Jiaming Xu Scribe: Hyun-Ju Oh and Taotao He, October 4, 26 Outline Concentration
More informationCVPR A New Tensor Algebra - Tutorial. July 26, 2017
CVPR 2017 A New Tensor Algebra - Tutorial Lior Horesh lhoresh@us.ibm.com Misha Kilmer misha.kilmer@tufts.edu July 26, 2017 Outline Motivation Background and notation New t-product and associated algebraic
More informationEE 381V: Large Scale Optimization Fall Lecture 24 April 11
EE 381V: Large Scale Optimization Fall 2012 Lecture 24 April 11 Lecturer: Caramanis & Sanghavi Scribe: Tao Huang 24.1 Review In past classes, we studied the problem of sparsity. Sparsity problem is that
More informationFunctional SVD for Big Data
Functional SVD for Big Data Pan Chao April 23, 2014 Pan Chao Functional SVD for Big Data April 23, 2014 1 / 24 Outline 1 One-Way Functional SVD a) Interpretation b) Robustness c) CV/GCV 2 Two-Way Problem
More informationSVD, PCA & Preprocessing
Chapter 1 SVD, PCA & Preprocessing Part 2: Pre-processing and selecting the rank Pre-processing Skillicorn chapter 3.1 2 Why pre-process? Consider matrix of weather data Monthly temperatures in degrees
More information7 Principal Component Analysis
7 Principal Component Analysis This topic will build a series of techniques to deal with high-dimensional data. Unlike regression problems, our goal is not to predict a value (the y-coordinate), it is
More informationTHE PERTURBATION BOUND FOR THE SPECTRAL RADIUS OF A NON-NEGATIVE TENSOR
THE PERTURBATION BOUND FOR THE SPECTRAL RADIUS OF A NON-NEGATIVE TENSOR WEN LI AND MICHAEL K. NG Abstract. In this paper, we study the perturbation bound for the spectral radius of an m th - order n-dimensional
More informationA Reservoir Sampling Algorithm with Adaptive Estimation of Conditional Expectation
A Reservoir Sampling Algorithm with Adaptive Estimation of Conditional Expectation Vu Malbasa and Slobodan Vucetic Abstract Resource-constrained data mining introduces many constraints when learning from
More informationMatrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =
30 MATHEMATICS REVIEW G A.1.1 Matrices and Vectors Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A = a 11 a 12... a 1N a 21 a 22... a 2N...... a M1 a M2... a MN A matrix can
More informationConditions for Robust Principal Component Analysis
Rose-Hulman Undergraduate Mathematics Journal Volume 12 Issue 2 Article 9 Conditions for Robust Principal Component Analysis Michael Hornstein Stanford University, mdhornstein@gmail.com Follow this and
More informationSUPPLEMENTAL NOTES FOR ROBUST REGULARIZED SINGULAR VALUE DECOMPOSITION WITH APPLICATION TO MORTALITY DATA
SUPPLEMENTAL NOTES FOR ROBUST REGULARIZED SINGULAR VALUE DECOMPOSITION WITH APPLICATION TO MORTALITY DATA By Lingsong Zhang, Haipeng Shen and Jianhua Z. Huang Purdue University, University of North Carolina,
More informationRank Determination for Low-Rank Data Completion
Journal of Machine Learning Research 18 017) 1-9 Submitted 7/17; Revised 8/17; Published 9/17 Rank Determination for Low-Rank Data Completion Morteza Ashraphijuo Columbia University New York, NY 1007,
More informationLecture Notes 10: Matrix Factorization
Optimization-based data analysis Fall 207 Lecture Notes 0: Matrix Factorization Low-rank models. Rank- model Consider the problem of modeling a quantity y[i, j] that depends on two indices i and j. To
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationAn Effective Tensor Completion Method Based on Multi-linear Tensor Ring Decomposition
An Effective Tensor Completion Method Based on Multi-linear Tensor Ring Decomposition Jinshi Yu, Guoxu Zhou, Qibin Zhao and Kan Xie School of Automation, Guangdong University of Technology, Guangzhou,
More informationDimensionality Reduction Using the Sparse Linear Model: Supplementary Material
Dimensionality Reduction Using the Sparse Linear Model: Supplementary Material Ioannis Gkioulekas arvard SEAS Cambridge, MA 038 igkiou@seas.harvard.edu Todd Zickler arvard SEAS Cambridge, MA 038 zickler@seas.harvard.edu
More informationMachine Learning for Disease Progression
Machine Learning for Disease Progression Yong Deng Department of Materials Science & Engineering yongdeng@stanford.edu Xuxin Huang Department of Applied Physics xxhuang@stanford.edu Guanyang Wang Department
More informationLinear Algebra and Eigenproblems
Appendix A A Linear Algebra and Eigenproblems A working knowledge of linear algebra is key to understanding many of the issues raised in this work. In particular, many of the discussions of the details
More informationIterative Matching Pursuit and its Applications in Adaptive Time-Frequency Analysis
Iterative Matching Pursuit and its Applications in Adaptive Time-Frequency Analysis Zuoqiang Shi Mathematical Sciences Center, Tsinghua University Joint wor with Prof. Thomas Y. Hou and Sparsity, Jan 9,
More informationNumerical Methods I Non-Square and Sparse Linear Systems
Numerical Methods I Non-Square and Sparse Linear Systems Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 MATH-GA 2011.003 / CSCI-GA 2945.003, Fall 2014 September 25th, 2014 A. Donev (Courant
More informationSparse orthogonal factor analysis
Sparse orthogonal factor analysis Kohei Adachi and Nickolay T. Trendafilov Abstract A sparse orthogonal factor analysis procedure is proposed for estimating the optimal solution with sparse loadings. In
More informationSparse representation classification and positive L1 minimization
Sparse representation classification and positive L1 minimization Cencheng Shen Joint Work with Li Chen, Carey E. Priebe Applied Mathematics and Statistics Johns Hopkins University, August 5, 2014 Cencheng
More informationSYMMETRIC MATRIX PERTURBATION FOR DIFFERENTIALLY-PRIVATE PRINCIPAL COMPONENT ANALYSIS. Hafiz Imtiaz and Anand D. Sarwate
SYMMETRIC MATRIX PERTURBATION FOR DIFFERENTIALLY-PRIVATE PRINCIPAL COMPONENT ANALYSIS Hafiz Imtiaz and Anand D. Sarwate Rutgers, The State University of New Jersey ABSTRACT Differential privacy is a strong,
More informationPrediction of double gene knockout measurements
Prediction of double gene knockout measurements Sofia Kyriazopoulou-Panagiotopoulou sofiakp@stanford.edu December 12, 2008 Abstract One way to get an insight into the potential interaction between a pair
More informationStatistical Data Analysis
DS-GA 0 Lecture notes 8 Fall 016 1 Descriptive statistics Statistical Data Analysis In this section we consider the problem of analyzing a set of data. We describe several techniques for visualizing the
More informationSolving Corrupted Quadratic Equations, Provably
Solving Corrupted Quadratic Equations, Provably Yuejie Chi London Workshop on Sparse Signal Processing September 206 Acknowledgement Joint work with Yuanxin Li (OSU), Huishuai Zhuang (Syracuse) and Yingbin
More informationMachine Learning (Spring 2012) Principal Component Analysis
1-71 Machine Learning (Spring 1) Principal Component Analysis Yang Xu This note is partly based on Chapter 1.1 in Chris Bishop s book on PRML and the lecture slides on PCA written by Carlos Guestrin in
More informationDeep Learning Book Notes Chapter 2: Linear Algebra
Deep Learning Book Notes Chapter 2: Linear Algebra Compiled By: Abhinaba Bala, Dakshit Agrawal, Mohit Jain Section 2.1: Scalars, Vectors, Matrices and Tensors Scalar Single Number Lowercase names in italic
More informationStat 159/259: Linear Algebra Notes
Stat 159/259: Linear Algebra Notes Jarrod Millman November 16, 2015 Abstract These notes assume you ve taken a semester of undergraduate linear algebra. In particular, I assume you are familiar with the
More informationarxiv: v3 [math.oc] 8 Jan 2019
Why Random Reshuffling Beats Stochastic Gradient Descent Mert Gürbüzbalaban, Asuman Ozdaglar, Pablo Parrilo arxiv:1510.08560v3 [math.oc] 8 Jan 2019 January 9, 2019 Abstract We analyze the convergence rate
More informationRobust Principal Component Analysis
ELE 538B: Mathematics of High-Dimensional Data Robust Principal Component Analysis Yuxin Chen Princeton University, Fall 2018 Disentangling sparse and low-rank matrices Suppose we are given a matrix M
More informationarxiv: v1 [math.na] 22 Sep 2016
A Randomized Tensor Singular Value Decomposition based on the t-product Jiani Zhang Arvind K. Saibaba Misha E. Kilmer Shuchin Aeron arxiv:1609.07086v1 [math.na] Sep 016 September 3, 016 Abstract The tensor
More informationOn The Belonging Of A Perturbed Vector To A Subspace From A Numerical View Point
Applied Mathematics E-Notes, 7(007), 65-70 c ISSN 1607-510 Available free at mirror sites of http://www.math.nthu.edu.tw/ amen/ On The Belonging Of A Perturbed Vector To A Subspace From A Numerical View
More informationEE 381V: Large Scale Learning Spring Lecture 16 March 7
EE 381V: Large Scale Learning Spring 2013 Lecture 16 March 7 Lecturer: Caramanis & Sanghavi Scribe: Tianyang Bai 16.1 Topics Covered In this lecture, we introduced one method of matrix completion via SVD-based
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Low-rank matrix recovery via nonconvex optimization Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationWafer Pattern Recognition Using Tucker Decomposition
Wafer Pattern Recognition Using Tucker Decomposition Ahmed Wahba, Li-C. Wang, Zheng Zhang UC Santa Barbara Nik Sumikawa NXP Semiconductors Abstract In production test data analytics, it is often that an
More informationDonald Goldfarb IEOR Department Columbia University UCLA Mathematics Department Distinguished Lecture Series May 17 19, 2016
Optimization for Tensor Models Donald Goldfarb IEOR Department Columbia University UCLA Mathematics Department Distinguished Lecture Series May 17 19, 2016 1 Tensors Matrix Tensor: higher-order matrix
More informationComputational and Statistical Boundaries for Submatrix Localization in a Large Noisy Matrix
University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 2017 Computational and Statistical Boundaries for Submatrix Localization in a Large Noisy Matrix Tony Cai University
More informationAM205: Assignment 2. i=1
AM05: Assignment Question 1 [10 points] (a) [4 points] For p 1, the p-norm for a vector x R n is defined as: ( n ) 1/p x p x i p ( ) i=1 This definition is in fact meaningful for p < 1 as well, although
More informationTensor-Tensor Product Toolbox
Tensor-Tensor Product Toolbox 1 version 10 Canyi Lu canyilu@gmailcom Carnegie Mellon University https://githubcom/canyilu/tproduct June, 018 1 INTRODUCTION Tensors are higher-order extensions of matrices
More informationOptimization for Compressed Sensing
Optimization for Compressed Sensing Robert J. Vanderbei 2014 March 21 Dept. of Industrial & Systems Engineering University of Florida http://www.princeton.edu/ rvdb Lasso Regression The problem is to solve
More informationA Randomized Algorithm for the Approximation of Matrices
A Randomized Algorithm for the Approximation of Matrices Per-Gunnar Martinsson, Vladimir Rokhlin, and Mark Tygert Technical Report YALEU/DCS/TR-36 June 29, 2006 Abstract Given an m n matrix A and a positive
More information