Robust subspace recovery by geodesically convex optimization
|
|
- Lawrence Ross
- 5 years ago
- Views:
Transcription
1 Robust subspace recovery by geodesically convex optimization Teng Zhang arxiv: v2 [stat.ml] 0 Jun 202 Abstract We introduce Tyler s M-estimator to robustly recover the underlying linear model from a data set contaminated by outliers. We prove that the objective function of this estimator is geodesically convex on the manifold of all positive definite matrices have a unique minimizer. Besides, we prove that when inliers (i.e., points that are not outliers are sampled from a subspace the percentage of outliers is bounded by some number, then under some very weak assumptions a commonly used algorithm of this estimator can recover the underlying subspace exactly. We also show that empirically this algorithm compares favorably with other convex algorithms of subspace recovery. I. ITRODUCTIO This paper is about the following problem: suppose we are given a data set X with inliers sampled from a lowdimensional linear model some arbitrary outliers, can we recover the underlying linear model? The primary tool for this problem is Principal Component Analysis (PCA. However, PCA is very sensitive to outliers. Considering the popularity of linear modeling, an robust algorithm that find the underlying linear model will have many applications. This work introduces Tyler s M-estimator of covariance in [28], proves that the objective function is geodesically convex on the manifold of all positive definite matrices. Moreover, this work proves that when inliers are sample from a subspace L, a commonly used algorithm for this estimator finds the underlying subspace L under very weak conditions that almost only depend on the percentage of outliers. A. otation conventions We assume that we are given a data set X R D with points. We define the projector Π L as the D D symmetric matrix such that Π 2 L = Π L, the range of Π L is L. We define P L as the D dim(l projection matrix from R D to L, or equivalently, any P L such that Π L = P L PL T. We use L to denote the orthogonal complement of L. We use X L to express the set of points that lie both in X the subspace L, use X \ L to express the set of points that lie in X but not in the subspace L. We use X to denote the cardinality of the set X, S + (D to denote the set of all D D semi-positive definite matrices S ++ (D to denote the set of all D D positive definite matrices. T. Zhang is with the Institute of Mathematics its Applications, University of Minnesota, Minneapolis, M, USA zhang620@umn.edu. B. Main results In this paper we introduce the following estimator due to Tyler [28]: Σ = arg min tr(σ=,σ=σ T,Σ S ++(D F(Σ, where F(Σ = log(x T Σ x+ D logdet(σ, (I. we obtain Σ by the it of the sequence Σ (k generated by the following iterative procedure in [28]: Σ (k+ = x T Σ (k x /tr(. (I.2 x T Σ (k x We will explain the motivation of this estimator as an M- estimator of covariance in Section I-D, show in Section III that the objective function F(Σ is geodesically convex on S ++ (D, under the condition (III. the sequence Σ (k generated by (I.2 converges to the unique solution of (I.. When the inliers lie exactly on the subspace L, then under some weak assumptions (almost only depends on the percentage of outliers we can recover the L exactly from k Σ (k, which is a singular matrix with L as its range. Theorem I.. If there exists a d-dimensional subspace L such that X L > d X D, (I.3 the points in the set Y = {P L x : x X L } R d Y 0 = {P L x : x X \ L } R D d lie in general positions respectively (i.e., any k points in Y span a k- dimensional subspace for all k d any k points in Y 0 span a k-dimensional subspace for all k D d. Then the sequence Σ (k generated by (I.2 converges to some ˆΣ such that im(ˆσ = L. The condition of general positions is very weak for example, when we choose inliers arbitrarily from a uniform distribution in a ball in L, or from Gaussian measure in L choose outliers arbitrarily from a uniform measure in a ball in R D, or from Gaussian measure in R D this condition holds with probability. We remark that when the ambient dimension D the dimension of subspace L is kept as the constant d, then d D approaches 0 the required percentage of inliers in Theorem I. approaches 0. This property makes Theorem I. particularly strong for high-dimensional data set with a lowdimensional structure.
2 2 C. Previous work The robust estimator for covariance has been well studied in the statistical literature, which is related topic to robust linear modeling since we can recovery the linear model by the principal components of the estimated covariance. The M- estimators, L-estimators, MCD/MVE estimators Stahel- Donoho estimator have been proposed; for a complete review we refer the reader to [20, Section 6]. However most of these methods are not convex (or the convexity is not analyzed, so the algorithms are intractable (or have unknown tractability. it is possible that the only exception is M-estimators: the convergence of its algorithm has been analyzed in [3], we will show later that as a special M-estimator, F(Σ is geodesically convex in the space of positive definite matrices (it is also shown in [32]. There are other methods that recover the linear model without estimating the robust covariance, such as Projection Pursuit [8], [], [22, Section 2], which find the principal component directly by optimization on a sphere. Another common strategy is to fit the linear model by PCA after removing possible outliers [27], [33]. However these methods are still nonconvex. Some recent works on robust linear modeling focus on convex optimization tractable algorithms [34], [22], [35], [7]. Similar to Theorem I., these works provide conditions for exact subspace recovery. We remark that these conditions are more complicated than the condition in Theorem I., since they assume an incoherence condition that requires the inliers to be spread out on the subspace L, which is not required in Theorem I.. These kind of conditions are required in [34, Theorem ], [35, (6(7]. In [7, Theorem.] it is shown that the exact recovery holds with high probability when X L > C 0 +C X \L D, d which is a simple condition very similar to the condition in (I.3. However this condition is obtained under the assumption that inliers outliers are both sampled from Gaussian distributions. In another recent work, Soltanolkotabi Cès proved that sparse subspace clustering (SSC algorithm [7] can recover multiple subspaces with high probability, but this theory also have probabilistic assumptions: it assumes that inliers outliers are both i.i.d sampled from uniform measures on unit spheres [26, Theorem.3]. We remark that our condition (I.3 sometimes can be more restrictive than the corresponding conditions of other convex methods. For example, when outliers have small magnitude concentrate around the origin, then the conditions in [35, Theorem 2] can tolerate more outliers. Similarly, the conditions in [34, Theorem ] [26, Theorem.3] can also tolerate more outliers than (I.3 under some settings. The advantage of our condition is that it is deterministic simple, empirically it is also usually less restrictive than the conditions in [34], [22], [35], [7]. D. M-estimators In this section we show that the estimator (I. can be considered as a special M-estimator of covariance gives background of the current research on this estimator. We start with the motivation: it is well known that the empirical covariance is the MLE estimator for the covariance when we assume that all x X are i.i.d. drawn from a Gaussian distribution. As a natural generalization, M-estimators of covariance [9], [0], [2] consider the generalized distribution C(ρe ρ(xt Σ x / det(σ, (I.4 where C(ρ is a normalization constant, chosen so that the integral of the distribution is equal to one. It is a generalization since when ρ(x = x, (I.4 gives Gaussian distribution. When data points are i.i.d. sampled from the distribution (I.4, the corresponding MLE estimator of covariance is called M- estimator, which minimizes ρ(x T Σ x+ 2 logdet(σ. (I.5 The objective function F(Σ can be considered as the M- estimator when ρ(x = D 2 log(x. While for this choice of ρ the function in (I.4 is not a distribution, it can be considered as the it of the following multivariate student distribution as ν 0 [20, page 87]: Γ[(ν +D/2] Γ(ν/2ν D/2 π D/2 detσ[+ ν xt Σ x] (ν+d/2. Since student distribution has a heavy tail, it is expected that this estimator should be more robust to outliers. The reason that we enforce the condition tr(σ = in (I. is the scale invariance property of F(Σ: for any constant c > 0 Σ S ++ (D, we have F(Σ = F(cΣ. (I.6 This simple fact can be easily verified, but will be repetitively used in the analysis later. Tyler Kent investigated the estimator (I. implicitly, by solving the equation F (Σ = 0 [28], [2]. They obtained the uniqueness of the solution (up to scaling to F (Σ = 0 showed that the algorithm (I.2 converges to this solution in [2, Theorem 2], under the assumption (III.. This result is almost equivalent Theorem III. Theorem III.4 in this work, except that we consider the minimization of F(Σ directly also show the existence of the minimizer. An interesting claim in [28] is that, this estimator is the most robust estimator of the scatter matrix of an elliptical distribution in the sense of minimizing the maximum asymptotic variance. The geodesical convexity of the objective function F(Σ was discovered later. In [2], Auderset et al. showed that the function is geodesically convex on the space {Σ S ++ (D : det(σ = }. After finishing this work, we learned that the geodesical convexity of F(Σ on the space S ++ (D was recently independently investigated by Wiesel in [32, Proposition ]. Wiesel also extended the convex analysis to regularized Tyler s M-estimator in [32] generalized LSE (logarithm of a sum of exponents functions the estimation of Kronecker structured covariance in [30], [3].
3 3 E. Contributions the structure of this paper The main contribution in this work is that, we introduce Tyler s M-estimator for subspace recovery, justify this estimator by showing that the algorithm (I.2 can recover the underlying subspace exactly under rather weak assumptions on the distribution of data points. Besides, we also apply geodesic convexity majorization-minimization argument to show the existence the uniqueness of the minimizer, the convergence of the algorithm. While these two facts are also observed in [32], the analysis in this paper is more careful therefore it proves the uniqueness of the minimizer the pointwise convergence of the algorithm. The paper is organized as follows. In Section II, we introduce the background on the geometry ofs ++ (D geodesic convexity. With this background we prove the uniqueness of the solution to (I. the convergence of the algorithm (I.2 in Section III. Then we perform some experiments that describe the performance of Algorithm I.2 verify Theorem I. in Section IV. Technical proofs are shown in the Appendix. II. PRELIMIARIES Our analysis relies on basic concepts from the geometry of S ++ (D the geodesic convexity. For this purpose, in Section II-A we present a brief summary of the geometry of S + (D in Section II-B we introduce the definition some properties of geodesic convexity. For more details we refer the reader to [4], [29] on the geometry of S ++ (D the geodesic convexity. A. Metric geodesic on S ++ (D The metric of S ++ (D has been well studied in literature. Indeed, the trace metric in differential geometry [5, pg 326], natural metric in symmetric cone [8], [5], affine-invariant metric [24], the metric given by Fisher information matrix for Gaussian covariance matrix estimation [25] give the same metric on S ++ (D. For Σ,Σ 2 S ++ (D, these metrics are defined by: dist(σ,σ 2 = log(σ 2 Σ 2 Σ 2 F. (II. Based on this metric, the unique geodesic γ ΣΣ 2 (t (0 t connecting Σ Σ 2 is given by [4, (6.]: γ ΣΣ 2 (t = Σ 2 (Σ 2 Σ 2 Σ 2 t Σ 2. (II.2 It follows that the midpoint of Σ Σ 2 is γ ΣΣ 2 ( 2 = Σ 2 (Σ 2 Σ 2 Σ 2 2Σ 2. We remark that Σ 2 (Σ 2 Σ 2 Σ 2 2Σ 2 is also called the geometric mean of Σ Σ 2 [4, Section 4.]. B. Geodesic convexity Geodesic convexity is a natural generalization of the convexity to Riemannian manifolds [29, Chapter 3.2]. Given a Riemannian manifoldm a set A M, we say a function f : A R is geodesically convex, if every geodesic γ xy of M with endpoints x,y A (i.e., γ xy is a function from [0,] to M with γ xy (0 = x γ xy ( = y lies in A, f(γ xy (t ( tf(x+tf(y for any x,y A 0 < t <. (II.3 Following the proof of [23, Theorem..4], for a continuous function, the geodesic midpoint convexity is equivalent to the geodesic convexity: Lemma II.. Let f : A R be a continuous function. If f(γ xy ( f(x+f(y for any x y A 2 2 then f is a geodesically convex function. (II.4 III. PROPERTY OF THE OBJECTIVE FUCTIO AD THE ALGORITHM In this section we study the properties of the objective function F(Σ the algorithm in (I.2. We show that under a very mild assumption, the solution to (I. is unique the sequence Σ (k converges to the solution. We will also discuss Theorem I., the empirical algorithm some implementation issues in this section. A. Uniqueness of the solution We first show the existence the uniqueness of the solution to (I. under a rather weak assumption. Theorem III.. If for any linear subspace L we have X L < dim(l D, then the solution of (I. exists is unique. (III. Indeed, for real data sets that contain noise, (III. is usually satisfied if the dimension is smaller than the number of points: in noisy data set generally any d-dimensional linear subspace only contains at most d points. An important remark is that the condition (III. is incompatible with the condition on the percentage of inliers in Theorem I.. Indeed, the solution to (I. does not exists: one may verify that F Π L +εi ( ΠL +εi tr(π L +εi converges to as ε 0, while tr(π L +εi converges to a singular matrix where F(Σ is undefined. The proof of Theorem III. depends on the following two lemmas, whose proofs will be presented in the appendix. In general, Lemma III.2 guarantees the uniqueness of the solution Lemma III.3 guarantees the existence of the solution. While (III.2 is also proved to [32, Proposition ], additionally we show the condition for the equality in Lemma III.2, which implies the uniqueness of the minimizer of (I.. Lemma III.2. F(Σ is geodesically convex on the manifold S ++ (D. That is, for any Σ Σ 2 S ++ (D, we have F(Σ +F(Σ 2 2F(Σ 2 (Σ 2 Σ 2 Σ 2 2 Σ 2. (III.2 When span{x} = R D, the equality in (III.2 holds if only if Σ = cσ 2.
4 4 Lemma III.3. Under the condition (III., we have F(Σ as λ min (Σ 0. Here λ min (Σ is the smallest eigenvalue of Σ. (III.3 ow we are ready to prove Theorem III.. Proof: We first prove the uniqueness of the solution to (I.. If Σ Σ 2 are both solutions to (I., then apply (III.2 the scale invariance (I.6, we have F(Σ 3 F(Σ = F(Σ 2, for Σ 3 = Σ 2 (Σ 2 Σ 2 Σ 2 2Σ 2 (. tr Σ 2 (Σ 2 Σ 2 Σ 2 2Σ 2 Since Σ Σ 2 are both minimizers to F(Σ, we have F(Σ 3 = F(Σ = F(Σ, by the condition of equality in Lemma III.2 (the assumption span{x} = R D in Lemma III.2 holds; otherwise (III. does not hold for L = span{x}, we have Σ = cσ 2. However we have tr(σ = tr(σ 2, therefore Σ = Σ 2, which contradicts out assumption, we prove the uniqueness of the solution to (I.. ow we prove the existence of the solution. First, there exists a sequence {Σ i } i {Σ S ++ (D : tr(σ = } such that F(Σ i converges to inf tr(σ=,σ S++(DF(Σ. By compactness there is a converging subsequence of {Σ i }, by Lemma III.3 this subsequence does not converge to a singular matrix therefore the subsequence converges to some matrix Σ 0 S ++ (D. By continuity of F(Σ we obtain F(Σ 0 = inf tr(σ=,σ S++(DF(Σ therefore Σ 0 is a solution to (I.. B. Convergence of algorithm In this section we prove the convergence of the sequence (Σ k generated by (I.2 under the assumption (III., we will also discuss its connection to Theorem I., which is about the convergence of the sequence (Σ k under another assumption. We begin with the motivation of the procedure (I.2. If we set the derivative of F(Σ with respect to Σ to be 0, we have d dσ F(Σ = Σ = D x T Σ x D Σ = 0 x T Σ x. Since we minimize F(Σ under the assumption tr(σ =, we have D Σ = ( tr D x T Σ x x T Σ x = tr( x T Σ x x T Σ x whose RHS is the update formula (I.2. Therefore we have the motivation that the Theorem III.4 shows that Σ (k converges to the solution to (I. under the assumption (III.. Similar to [32, Section, II], it uses the majorization-minimization argument. However the analysis here is more complete in the sense that it proves the convergence of the sequence Σ (k, while the argument in [32] only leads to the convergence of the objective function F(Σ (k. Theorem III.4. When the condition (III. holds, the sequence Σ (k generated by (I.2 converges to the unique solution to (I.. This theorem also implies that the condition X L X Theorem I. is almost necessary. Indeed, if X L > d D in X < d D, then the condition (III. is usually satisfied, by Theorem III. the solution to (I. exists ( by definition nonsingular by Theorem III.4 Σ (k converges to this nonsingular matrix. Therefore we can not recoveryl by its range. This also shows a phase transition phenomenon at X L X = d D. For simplicity in the proof we define the operator T : S + (D S + (D as T(Σ = x T Σ x /tr( x T Σ x. (III.4 The main ingredient for the proof is the observation that Σ (k+ = T(Σ (k can be considered as a minimizer of a majorization function G(Σ,Σ (k over F(Σ such that G(Σ,Σ (k F(Σ G(Σ (k,σ (k = F(Σ (k. In this sense it can be considered as an algorithm with the majorization-minimization (MM principle []. We remark that similar observations are also used in the proof of the convergence in other iteratively reweighted least square (IRLS algorithms such as [4], [6], [35], [7]. When the condition in Theorem I. holds, the assumption (III. is violated, by our analysis in Section III-A the solution to (I. does not exist. However, Theorem I. shows that the sequence Σ (k still converges it converges to a singular matrix. Due to the complexity we put its proof in the appendix. C. Empirical algorithm implementation issues Since the solution to (I. can be considered a robust estimator of covariance, empirically we can simply recover the underlying d-dimensional subspace by the span of its top d eigenvectors. Our empirical algorithm in summarized in Algorithm. In each iteration the major computational cost is due to the calculation of the inverse of Σ (k the calculation of xt Σ (k x, therefore the cost is in the order of O( D 2 when D. We will show later in Section IV-C that the algorithm exhibit linear convergence. In implementation we stop the algorithm after k-th iteration when Σ (k Σ (k F Σ (k F < 0 8. In this paragraph we describe an empirical problem where the algorithm breaks down at some iteration step, describe a way to overcome it. If the condition in Theorem I. holds, λ min (Σ (k converges to 0 as k it is nonzero for each k. However in implementation, due to the rounding error,
5 5 Algorithm Empirical algorithm for recovering a d- dimensional subspace Input: X R D : data set, d: dimension of the subspace. Output: L : a d-dimensional linear subspace. Steps: Initialization: Σ (0 = I. Repeat ( (2 until convergence: k = k +, 2 Σ (k+ = x T Σ (k x /tr( x T Σ (k x. Recovery error =00, D=0, d=5 Let Σ be the it of the sequence Σ (k, let L be the span of top d eigenvectors of Σ. when k is very large λ min (Σ (k is very close to zero, the calculated Σ (k could be a non-positive matrix or a matrix with imaginary part the convergence of Algorithm fails. Therefore in implementation we check the value of min x T Σ (k x in each iteration, stop the algorithm when it is negative or has imaginary part. We remark that this breakdown will not happen for real data sets or synthetic data sets with noise, since in these cases Σ (k converges to a nonsingular positive matrix the rounding error will not makeσ (k a non-positive matrix or a matrix with imaginary part. D. Discussion on spherical projection A simple powerful method to enhance the robustness of an algorithm to outliers is to preprocess the data set by projecting the data points to a unit sphere. Empirically it enhances the robustness of PCA Reaper algorithms significantly [7, Section 5]. Therefore a natural question is that whether it can also be applied in our algorithm. Interestingly, spherical projection has been implicitly applied in the objective function F(Σ our algorithm: one may verify that the magnitude of any point in X does not impact the solution to (I., or the update formula (I.2. IV. UMERICAL EXPERIMETS In this section, we present some numerical experiments on Algorithm, to obtain the empirical performance of this algorithm. We also show that our algorithm outperforms other convex algorithms of robust PCA by a real data set. A. Model for simulation In Sections IV-B-IV-D, we apply our algorithm on the data generated from the following model. We choose a d- dimensional subspace L, sample points i.i.d. from the Gaussian distribution (0,Π L on L, sample 0 outliers i.i.d. from the uniform distribution in the cube [0,] D. In some experiments we also add a Gaussian noise (0,ε 2 I to each of the point. We use uniform distribution in [0,] D for outliers, to show that our algorithm allows anisotropic outliers. Recovery error umebr of inliers =00, D=50, d= umebr of inliers Fig.. The dependence on the number of inliers recovery error: x-axis is the number of inlier y-axis is the corresponding recovery error B. Exact recovery of the subspace In this section we use the model in Section IV-A, choose D = 0 or 50, d = 5, 0 = 00 different values of (2 to 20 for D = to 20 for D = 0. The mean recovery error ΠˆL Π L F over 20 experiments is recorded in Figure, where ˆL is obtained by the Algorithm L is the true underlying subspace. Theorem I. guarantees that ΠˆL Π L F = for > 00 when D = 0 > 0 when D = 50, this is verified in this experiment. When D = 50 = there is a small nonzero recovery error, which seems to contradicts Theorem I.. But we remark that when D = 50 = the convergence is slow, we stop the algorithm at the 000-th iteration without really converging to the solution to (I.. We expect that exact recover of the subspace L could be obtained after larger number of iterations in Algorithm.
6 =00, D=0, d=5 = Distance to the optimal value =00 =20 Recovery error ˆL L F Size of noise ǫ umber of iterations Fig. 3. Robustness to noise: the x-axis represents the size of Gaussian noise ε, the y-axis represents the recovery error. Recovery error =20, 0 =00, D=0, d=5 =20, 0 =00, D=50, d= umber of iterations Fig. 2. Convergence rate for simulated data sets. See the text in Section IV-C for more details of the experiment. C. Convergence rate In this section we show that empirically the algorithm converges linearly. In the left figure in Figure 2, we show the convergence rate for simulated data sets with D = 0, d = 5, 0 = 00 = 80,00,20. Additional we add a Gaussian noise with ε = 0.0. The x-axis represent the number of iterations k the y-axis is Σ (k Σ (K F, where K is the total number of iterations in Algorithm. In the right figure we show a different convergent rate: for two different simulations with no noise we plot Π Lk Π L F with respect to the number of iterations k, where L k is the span of first d eigenvectors of Σ (k. We use the settings (, 0,D,d = (20,00,0,5 (20,00,50,5 since by Theorem I. k Π Lk Π L F = 0. From the right figure in Figure 2 we see that the recovery error also converges linearly. D. Robustness to noise In this section we investigate the empirically robustness of our algorithm to noise by simulated data set sampled according to section IV-A with (, 0,D,d = (20,00,0,5, different size of noise ε. We use this setting since when ε = 0, we recover the subspace exactly. We record the recovery error in Figure IV-D with respect to the size of noise. In this experiment the recovery error depends linearly on the size of noise ε. Indeed, we consider a theory that explain the performance of Algorithm under some noise of small size as an interesting future question. E. Faces in a Crowd In this section we test our algorithm on the experiment of Faces in a Crowd in [7, Section 5.4]. The goal of this experiment is to show that our algorithm can be used to robustly learn the structure of face images. Linear modeling is applicable here since the images of the faces from the same person lies on a 9-dimensional subspace [3]. In this experiment we learn the subspace from a data set that contains 32 face images of a person from the Extended Yale Face Database [6] 400 rom images from the BACKGROUD/Google folder of the Caltech0 database [9]. The images are converts to grayscale downsample to pixels. We preprocess the images by subtracting their Euclidean median, apply Algorithm to this data set to obtain a 9-dimensional subspace, we use 32 other images from the same person to test how the learned subspace fits these images. This experiment is also used in [7, Section 5.4], therefore we only compare our algorithm with S-Reaper, which has been shown to perform better than PCA, spherical PCA, LLD Reaper algorithms. PCA algorithm is still included for comparison since it is the basic technique in linear modeling. Figure IV-E shows five images their projections to the 9-dimensional subspace fitted by PCA, S-reaper our algorithm (which is labeled as M-estimator due to the argument in Section I-D respectively. Figure IV-E shows that our algorithm visually performs better than S-Reaper, especially
7 7 M est S reaper PCA Original Fig. 4. Distance to the robust subspace Inlier Inlier Outlier Test sample Test sample The projection of images to the fitted subspace. M estimator S reaper Distance to the PCA subspace Fig. 5. Ordered distances of the 32 test images to the fitted 9-dimensional subspaces by Algorithm, S-reaper PCA. assumption. We also demonstrated the virtue of this methods by experiments on simulated data sets real data sets. An open question is that, if we can have a theoretical guarantee on the robustness of our algorithm to noise therefore verify the empirical performance in Section IV-D. We find it difficult to apply the commonly used perturbation analysis in [35, Section 2.7] or [34, Theorem 2], which are based on the size of the perturbation of the objective function, since the objective function F(Σ at a singular matrix is undefined. An interesting direction is to extend the idea of geodesical convexity to other problems. Euclidean metric between matrices is usually used under this metric the set of all positive definite matrices is considered as a cone. However in this work we consider the set of all positive matrices as a manifold use the Riemmannian metric between matrices. It turns out that whilef(σ in nonconvex in Euclidean metric, it is convex in Riemmannian metric, this formulation is more powerful than similar formulations that are convex in Euclidean metric [35], [7]. It would be interesting if there are other optimization problems with the property of geodesical convexity. VI. ACKOWLEDGEMET The author would like to thank Michael Mccoy for reading an earlier version of this manuscript for helpful comments. The author is grateful to Lek Heng Lim for introducing the book [4] helpful discussions. A. Proof of Lemma III.2 VII. APPEDIX Proof: Geodesical convexity of F(Σ follows from (III.2 Lemma II.. Therefore we only need to prove (III.2 for geodesic convexity. We will prove (III.2 by showing that, if Σ 3 S ++ (D is the geometric mean of Σ,Σ 2 S ++ (D, then we have ln(det(σ +ln(det(σ 2 = 2ln(det(Σ 3, (VII. for the test images. This observation can also be quantitatively verified by checking the distances of 32 test images to the fitted subspace by PCA, S-reaper out algorithm, which is shown in Figure IV-E. The subspace generated by our algorithm has smaller distances to the test images, which explain the better performance of our algorithm in Figure IV-E. Besides, in this experiment our algorithm performs much faster than S-Reaper; our algorithm costs 4.4 seconds on a machine with Intel Core 2 Duo CPU at 3.00GHz 6GB memory, while S-reaper cost 40 seconds. it is expected since there is an additional eigenvalue decomposition in each iteration of the S-Reaper algorithm. V. DISCUSSIO In this paper we have investigated an M-estimator for covariance estimation, proved that this estimator can find the underlying subspace exactly under a rather weak ln(x T Σ x+ln(x T Σ 2 x 2ln(x T Σ 3 x. (VII.2 We start with the proof of (VII.. Use (II.2 with t = 2, we have Σ 3 Σ Σ 3 =Σ 2 (Σ 2 Σ 3 Σ 2 2 Σ 2 Σ Σ 2 (Σ 2 Σ 3 Σ 2 2 Σ 2 =Σ 2. (VII.3 Using (VII.3, (VII. can be proved as follows: det(σ 2 = det(σ 3 Σ Σ 3 = det(σ 3 det(σ det(σ 3 =det(σ 3 2 /det(σ. To prove (VII.2, we let the SVD decomposition of Σ 2 Σ 2 Σ 2 = U 0 Σ 0 U0 T define ˆx = U 0 Σ 2 x, then we have x T Σ x = ˆx T ˆx, x T Σ 2 x = ˆx T Σ 0ˆx, x T Σ 3 x =
8 8 ˆx T Σ 2 0 ˆx. Assuming that Σ 0 is a diagonal matrix with diagonal entries σ,σ 2,,σ p ˆx = (ˆx,ˆx 2,,ˆx p T, then (VII.2 is equivalent to p p p σ ˆx 2 i ˆx 2 i ( σ 2 ˆx 2 i 2, i= i= i= which can be verified by Cauchy-Schwartz inequality. Therefore (VII.2 is proved. Finally we find the condition such that the equality in (III.2 holds. By its proof of geodesic convexity we know that it holds only when the equality (VII.2 holds for any x X. By the condition of equality in Cauchy-Schwartz inequality, we have that the equality in (III.2 only holds when for any i D (here i is the index of coordinates such that ˆx i 0, σ i = c for some c R. When Σ cσ 2, σ i is not the same number for all i D. Therefore there exists i D such that ˆx i = 0. That is, there exists a hyperplane in R D such that ˆx lies on it. Since ˆx is a linear transformation of x, when (VII.2 holds for any x X, then there exists a hyperplane such that it contains x X, which contradicts our assumption that span{x} = R D. B. Proof of Theorem III.4 First we will prove that the operator T is monotone with respect to the objective function F : F(T(Σ F(Σ, the equality holds for Σ S ++ (D only when T(Σ = Σ. We prove it by constructing the following majorization function over F(Σ: G(Σ,Σ = x T x x T Σ x,σ + D logdet(σ+c. (VII.4 When C is well chosen such that G(Σ,Σ = F(Σ. The fact G(Σ,Σ F(Σ can be proved by checking the first the second derivative of G(Σ,Σ F(Σ with respect to Σ. It is easy to verify the unique minimizer of G(Σ,Σ is Σ = D x T x x T Σ x, which is a scaled version of T(Σ. Therefore we prove the monotonicity of T as follows: F(T(Σ = F( Σ G( Σ,Σ G(Σ,Σ = F(Σ. (VII.5 Because of the uniqueness of the minimizer of G(Σ,Σ, the equality in the second inequality of (VII.5 holds only when Σ = Σ. Since Σ = ct(σ tr(σ = tr(t(σ =, the equality in (VII.5 only holds when T(Σ = Σ. Therefore the sequence F(Σ (k is monotone, any accumulation points of the sequence {Σ (k }, ˆΣ, satisfies F(T(ˆΣ = F(ˆΣ therefore T(ˆΣ = ˆΣ. Applying T(ˆΣ = ˆΣ, we have ˆΣ x T x x T ˆΣ = ci, for some c R. x (VII.6 Let A = log(σ, applying logdet(σ = tr(a d da exp(a = exp(a, the derivative of F(Σ with respect to A is d da F(Σ = x T x Σ x T ˆΣ x D I. Since the set {A : A = log(σ, where det(σ = } = {A : tr(a = }, applying (VII.6 the derivative of F(Σ with respect to A in the set {Σ : det(σ = } is 0 at c 0ˆΣ, where c 0 is a number chosen such that det(c 0ˆΣ =. Since both the set det(σ = F(Σ are geodesically convex (see (VII. for the convexity of the set, c 0ˆΣ is the unique minimizer of F(Σ in the set {Σ : det(σ = }. Applying the scale invariance of F(σ in (I.6, ˆΣ is the unique solution in the set {Σ : tr(σ = }, which means that ˆΣ is also the unique solution to (I.. C. Proof of Lemma III.3 Proof: If Lemma III.3 does not hold, then there exists a sequence Σ m such that it converges some Σ S + (D \ S ++ (D, the sequence F(Σ m is bounded. WLOG we assume that λ j (Σ mi v j (Σ mi also converge for any j p, where λ j (Σ v j (Σ are the j-th eigenvalue eigenvector of Σ. This can be assumed since any sequence has a subsequence satisfying this property (eigenvectors eigenvalues of Σ m lie in a compact space. We prove (III.3 by induction on the ambient dimension D. When D=2, we have dim(ker( Σ =, F(Σ m \ker( Σ (VII.7 ( log(λ 2 (Σ m +2log(x T v 2 (Σ m + 2 log(λ 2(Σ m + 2 log(λ (Σ m. When x / ker( Σ, we have inf m x T v 2 (Σ m > 0, therefore the term log(x T v 2 (Σ m is bounded from below. Applying the assumption that λ (Σ m are bounded from below, 2 log(λ (Σ m is also bounded from below. Applying the assumption X\ker( Σ > 2 m λ 2 (Σ m = 0, the RHS of (VII.7 converges to +, which is a contradiction to the assumption that F(Σ m is bounded, therefore (III.3 is proved. If (III.3 holds for the case dim(x < D 0, then we will prove (III.3 for dim(x = D 0. By the assumption on the convergence of eigenvectors eigenvalues of Σ (k, to prove (III.3 it is equivalent to prove that F (Σ m as m, (VII.8 where Σ m = PT LΣ m P L, L = ker( Σ, d0 = dim( L F : S ++ (d 0 R is defined by F (Σ = log((p T Lx T Σ P T Lx+ logdet(σ. D 0
9 9 An important observation is that m tr(σ m = 0. Combine it with X\ L > d0 D 0, we have m F (Σ m F ( Σ m = ( d 0 D 0 =. When Σ m converges to a nonsingular matrix Σ, m F ( Σ m = F ( Σ = C X \ L m logtr(σ m (VII.9 (VII.0 for some constant C, when Σ m converges to a singular matrix, by induction m F ( Σ m =. (VII. Combining (VII.9, (VII.0 (VII., (VII.8 is proved therefore Lemma III.3 is proved by induction. D. Proof of Theorem I. The roadmap of the proof is as follows. We denote the set of outliers by X 0 = X \ L let X = X L, let = X 0 = X 0. Assume that the solutions of (I. for the set Y Y 0 are I d /d I D d /(D d respectively, then we will prove that k Σ(k = d Π L, (VII.2 which implies Theorem I.. WLOG we can make these the assumptions on the solutions of (I. for the set Y Y 0 since the points in Y Y 0 lie in general positions, applying Theorem III. the solution to (I. for the set Y Y 0 are nonsingular. Assuming the solution of (I. for the set Y Y 0 are ˆΣ ˆΣ 2 respectively, then the following set X, which is a linear transformation of X, satisfies that the solution to (I. for the set Y Y 2 (they are generated by X are I d /d I D d /(D d: X = { ˆΣ Π L x+ ˆΣ 2 Π L x : x X}. If the algorithm in (I.2 for X converges to d Π L then by linear transformation the algorithm for X converges to P L Σ P T L, whose range is also in L. Therefore to prove Theorem I., we only need to prove (VII.2. ow we start to prove (VII.2. Using the update formula in (I.2 the assumption that the solutions of (I. to Y Y 0 are I d /d I D d /(D d respectively, we have P T L xxt P L P L x 2 tr( T P L P L 0 P L x 2 tr( 0 = PL T xxt P L P L x 2 P T L P L P L x 2 = d I d D d I D d. (VII.3 (VII.4 By checking the trace of the numerator of the LHS in (VII.3 (VII.4 we have 0 P T L PT L P L P L x 2 = d I d P L P L x 2 = X 0 D d I D d = 0 D d I D d. Applying (VII.5 (VII.6 we have λ min (PL T x T Σ x P L λ min (PT L = d λ min(p T L ΣP L, λ max (P T L λ max (PT L 0 xxt λ min (P T L ΣP L x 2P L x T Σ x P L xxt λ max (PL T ΣP L P L x 2P L (VII.5 (VII.6 = 0 D d λ max(p T L ΣP L. Combining them with the definition of the operator T in (III.4, we have λ min (PL T T(ΣP L λ max (P T T(ΣP L L α λ min(pl T ΣP L λ max (P T ΣP L L, where α = (D d 0 d > (it follows from the assumption X L X > d D. Therefore λ min (PL T Σ (k P L k λ max (PL T Σ (k P L λ min(p T k αk L Σ ( P L λ max (PL T Σ ( P L =. (VII.7 Since tr(σ (k = for all k > 0, we have λ max(p T k L Σ(k P L = 0, k PT L Σ(k P L = 0. (VII.8 Combining (VII.8 the fact that Σ (k is positive semidefinite, k PT L Σ (k P L = 0. (VII.9 Since we already obtained (VII.8 (VII.9, in order to prove (VII.2 we only need to prove that PL T ΣP L converges to I d /d. Applying (VII.5 we have λ max (P T L x T Σ x P L λ max (P T xxt L x T Σ x PT L +λ max ( 0 λ max (P T L ΣP L 0 P T L xxt x T Σ x PT L x 2 P L x 2 + d λ max(p T L ΣP L
10 0 λ min (P T L d λ min(p T L ΣP L. Therefore x T Σ x P L λ min ( P T L λ max (P T L T(ΣP L λ min (P T L T(ΣP L λ max(p T L ΣP L λ min (P T L ΣP L xxt x T Σ x P L dλ max (P T L ΣP L x 2 0 P L x 2 + λ min (PL T. (VII.20 ΣP L ow we will prove that λ min (P T L Σ (k P L > c for all k for some c > 0. (VII.2 If (VII.2 does not hold then there exists a subsequence Σ kj such that kj λ min (P T L Σ (kj P L = 0. Applying (VII.8, (VII.9 the induction argument in the proof of Lemma III.3 we have kj F(Σ (kj =, which contradicts the monotone property of the algorithm in (VII.5. Therefore (VII.2 is proved. Applying (VII.7, there exists a constant C > 0 such that λ max (P T L Σ(k P L Cα k. (VII.22 ow we prove the existence of k λ max(p T L Σ(k P L λ min(p T L Σ(k P L. If it does not exist, then there exists ε > 0 such that for any sufficiently large K 0, there exists k > k 2 > K 0 such that λ max (P T L Σ (k P L λ min (P T L Σ (k P L λ max(p T L Σ (k2 P L λ min (P T L Σ (k2 P L > ε. Summing (VII.20 for Σ = Σ (k2,σ (k2+,,σ (k, apply (VII.2 (VII.22 we have the contradiction for sufficiently large K 0. ext we will prove λ max (PL T Σ (k P L k λ min (PL T Σ (k P L = (VII.23 by contradiction, i.e., by assuming k λ max(p T L Σ(k P L λ min(p T L Σ(k P L = c 0 >. Since the sequence Σ (k lies in compact space, there is a subsequence{σ (kj } j converging to ˆΣ with λ max (P T L ˆΣPL λ min (P T L ˆΣPL = c 0 >. (VII.24 Applying (VII.8 (VII.9 we have Π L ˆΣΠL = ˆΣ. By simple calculation this property also holds for T n (ˆΣ for any n. Therefore the update T n (ˆΣ can be considered as a update only depends on the set Y. Then by using Theorem III.4 to the set Y we have n Tn (ˆΣ = d Π L, therefore for any ε > 0, there exists some n 0 > 0 such that λ max (P T L T n0 (ˆΣP L λ min (P T L T n0 (ˆΣP L < +ε. (VII.25 Using the continuity of the mapping T n0, for any η > 0 there exists ε 2 > 0 such that λ max (PL T T n0 (ˆΣP L λ min (PL T T n0 (ˆΣP L λ max(pl T T n0 (ΣP L λ min (PL T T n0 (ΣP L < η, (VII.26 when Σ ˆΣ < ε 2. Choose j 0 large enough such that Σ (kj 0 ˆΣ < ε 2, then applying (VII.25 (VII.26 with Σ = Σ (kj 0 we obtain λ max (PL T Σ kj 0 +n0 P L λ min (PL T Σ kj 0 +n0 P L < +ε +η. (VII.27 Summing (VII.20 with Σ = Σ k for all k k j0 + n 0, applying (VII.2 (VII.22 we obtain that for some C > 0, λ max (PL T c 0 = Σ (k P L k λ min (PL T Σ (k P L λ max(p T L Σ kj 0 +n0 P L λ min (P T L Σ kj 0 +n0 P L +C α kj 0 n0 <+C α kj 0 n0 +ε +η. (VII.28 Since we can choose ε, η arbitrarily small k j0, n 0 arbitrarily large, (VII.28 is a contradiction to (VII.24. Therefore (VII.23 is proved. Combining (VII.23 with (VII.8 (VII.9 notice that tr(σ (k = for all k > 0, we proved (VII.2. REFERECES [] L. P. Ammann. Robust singular value decompositions: A new approach to projection pursuit. Journal of the American Statistical Association, 88(422:pp , 993. [2] C. Auderset, C. Mazza, E. A. Ruh. Angular gaussian cauchy estimation. Journal of Multivariate Analysis, 93(:80 97, [3] R. Basri D. Jacobs. Lambertian reflectance linear subspaces. IEEE Transactions on Pattern Analysis Machine Intelligence, 25(2:28 233, February [4] R. Bhatia. Positive Definite Matrices. Princeton Series in Applied Mathematics. Princeton University Press, [5] S. Bonnabel R. Sepulchre. Riemannian metric geometric mean for positive semidefinite matrices of fixed rank. 3(3: , [6] T. F. Chan P. Mulet. On the convergence of the lagged diffusivity fixed point method in total variation image restoration. SIAM J. umer. Anal., 36: , February 999. [7] E. Elhamifar R. Vidal. Sparse subspace clustering. In Proceedings of the 2009 IEEE Computer Society Conference on Computer Vision Pattern Recognition (CVPR 09, pages , [8] J. Faraut A. Korányi. Analysis on symmetric cones. Oxford mathematical monographs. Clarendon Press, 994. [9] L. Fei-Fei, R. Fergus, P. Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 0 object categories. Comput. Vis. Image Underst., 06(:59 70, Apr [0] P. J. Huber. Robust Statistics. John Wiley & Sons Inc., ew York, 98. Wiley Series in Probability Mathematical Statistics. [] D. R. Hunter K. Lange. A tutorial on mm algorithms. The American Statistician, 58(:pp , [2] J. T. Kent D. E. Tyler. Maximum likelihood estimation for the wrapped cauchy distribution. Journal of Applied Statistics, 5(2: , 988.
11 [3] J. T. Kent D. E. Tyler. Redescending M-estimates of multivariate location scatter. The Annals of Statistics, 9(4:pp , 99. [4] H. W. Kuhn. A note on Fermat s problem. Mathematical Programming, 4:98 07, /BF [5] S. Lang. Fundamentals of differential geometry. umber v. 60 in Graduate texts in mathematics. Springer, 999. [6] K. Lee, J. Ho, D. Kriegman. Acquiring linear subspaces for face recognition under variable lighting. IEEE Trans. Pattern Anal. Mach. Intelligence, 27(5: , [7] G. Lerman, M. McCoy, J. A. Tropp, T. Zhang. Robust computation of linear models, or how to find a needle in a haystack. Submitted Febrary 202. Available at [8] G. Li Z. Chen. Projection-Pursuit Approach to Robust Dispersion Matrices Principal Components: Primary Theory Monte Carlo. Journal of the American Statistical Association, 80(39: , 985. [9] R. A. Maronna. Robust M-estimators of multivariate location scatter. The Annals of Statistics, 4(:pp. 5 67, 976. [20] R. A. Maronna, R. D. Martin, V. J. Yohai. Robust statistics. Wiley Series in Probability Statistics. John Wiley & Sons Ltd., Chichester, Theory methods. [2] R. A. Maronna, R. D. Martin, V. J. Yohai. Robust statistics: Theory methods. Wiley Series in Probability Statistics. John Wiley & Sons Ltd., Chichester, [22] M. McCoy J. A. Tropp. Two proposals for robust PCA using semidefinite programming. Elec. J. Stat., 5:23 60, 20. [23] C. iculescu L. Persson. Convex functions their applications: a contemporary approach. umber v. 3 in CMS books in mathematics. Springer, [24] X. Pennec, P. Fillard,. Ayache. A riemannian framework for tensor computing. International Journal of Computer Vision, 66:4 66, /s z. [25] S. Smith. Covariance, subspace, intrinsic crame acute;r-rao bounds. Signal Processing, IEEE Transactions on, 53(5:60 630, may [26] M. Soltanolkotabi E. J. Cès. A geometric analysis of subspace clustering with outliers. CoRR, abs/2.4258, 20. [27] F. D. L. Torre M. J. Black. A framework for robust subspace learning. International Journal of Computer Vision, 54:7 42, /A: [28] D. E. Tyler. A distribution-free m-estimator of multivariate scatter. The Annals of Statistics, 5(:pp , 987. [29] C. Udrişte. Convex functions optimization methods on Riemannian manifolds. Mathematics its applications. Kluwer Academic Publishers, 994. [30] A. Wiesel. Geodesic convexity covariance estimation. Submitted to IEEE Trans. on Signal Processing. [3] A. Wiesel. On the convexity in kronecker structured covariance estimation. To be presented in SSP 202. [32] A. Wiesel. Unified framework to regularized covariance estimation in scaled gaussian models. Signal Processing, IEEE Transactions on, 60(:29 38, jan [33] H. Xu, C. Caramanis, S. Mannor. Principal Component Analysis with Contaminated Data: The High Dimensional Case. In Conference on Learning Theory (COLT [34] H. Xu, C. Caramanis, S. Sanghavi. Robust PCA via Outlier Pursuit. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, A. Culotta, editors, Advances in eural Information Processing Systems 23, pages [35] T. Zhang G. Lerman. A novel M-estimator for robust PCA. preprint, 20. arxiv:
A Novel M-Estimator for Robust PCA
Journal of Machine Learning Research 15 (2014) 749-808 Submitted 12/11; Revised 6/13; Published 2/14 A Novel M-Estimator for Robust PCA Teng Zhang Institute for Mathematics and its Applications University
More informationGeodesic Convexity and Regularized Scatter Estimation
Geodesic Convexity and Regularized Scatter Estimation Lutz Duembgen (Bern) David Tyler (Rutgers) Klaus Nordhausen (Turku/Vienna), Heike Schuhmacher (Bern) Markus Pauly (Ulm), Thomas Schweizer (Bern) Düsseldorf,
More informationRobust computation of linear models by convex relaxation
Noname manuscript No. (will be inserted by the editor) Robust computation of linear models by convex relaxation Gilad Lerman Michael B. McCoy Joel A. Tropp Teng Zhang 18 February 2012. Revised 31 May 2013.
More informationGrassmann Averages for Scalable Robust PCA Supplementary Material
Grassmann Averages for Scalable Robust PCA Supplementary Material Søren Hauberg DTU Compute Lyngby, Denmark sohau@dtu.dk Aasa Feragen DIKU and MPIs Tübingen Denmark and Germany aasa@diku.dk Michael J.
More informationRobust Principal Component Analysis
ELE 538B: Mathematics of High-Dimensional Data Robust Principal Component Analysis Yuxin Chen Princeton University, Fall 2018 Disentangling sparse and low-rank matrices Suppose we are given a matrix M
More informationPrincipal Component Analysis
Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used
More informationSparse representation classification and positive L1 minimization
Sparse representation classification and positive L1 minimization Cencheng Shen Joint Work with Li Chen, Carey E. Priebe Applied Mathematics and Statistics Johns Hopkins University, August 5, 2014 Cencheng
More informationLecture Notes 1: Vector spaces
Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector
More informationPCA with random noise. Van Ha Vu. Department of Mathematics Yale University
PCA with random noise Van Ha Vu Department of Mathematics Yale University An important problem that appears in various areas of applied mathematics (in particular statistics, computer science and numerical
More informationSEQUENTIAL SUBSPACE FINDING: A NEW ALGORITHM FOR LEARNING LOW-DIMENSIONAL LINEAR SUBSPACES.
SEQUENTIAL SUBSPACE FINDING: A NEW ALGORITHM FOR LEARNING LOW-DIMENSIONAL LINEAR SUBSPACES Mostafa Sadeghi a, Mohsen Joneidi a, Massoud Babaie-Zadeh a, and Christian Jutten b a Electrical Engineering Department,
More informationOrthogonal Matching Pursuit for Sparse Signal Recovery With Noise
Orthogonal Matching Pursuit for Sparse Signal Recovery With Noise The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation As Published
More informationRiemannian Metric Learning for Symmetric Positive Definite Matrices
CMSC 88J: Linear Subspaces and Manifolds for Computer Vision and Machine Learning Riemannian Metric Learning for Symmetric Positive Definite Matrices Raviteja Vemulapalli Guide: Professor David W. Jacobs
More informationSmall sample size in high dimensional space - minimum distance based classification.
Small sample size in high dimensional space - minimum distance based classification. Ewa Skubalska-Rafaj lowicz Institute of Computer Engineering, Automatics and Robotics, Department of Electronics, Wroc
More informationHomework 1. Yuan Yao. September 18, 2011
Homework 1 Yuan Yao September 18, 2011 1. Singular Value Decomposition: The goal of this exercise is to refresh your memory about the singular value decomposition and matrix norms. A good reference to
More informationReconstruction from Anisotropic Random Measurements
Reconstruction from Anisotropic Random Measurements Mark Rudelson and Shuheng Zhou The University of Michigan, Ann Arbor Coding, Complexity, and Sparsity Workshop, 013 Ann Arbor, Michigan August 7, 013
More informationVector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.
Vector spaces DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Vector space Consists of: A set V A scalar
More informationNew Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit
New Coherence and RIP Analysis for Wea 1 Orthogonal Matching Pursuit Mingrui Yang, Member, IEEE, and Fran de Hoog arxiv:1405.3354v1 [cs.it] 14 May 2014 Abstract In this paper we define a new coherence
More informationIEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER On the Performance of Sparse Recovery
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER 2011 7255 On the Performance of Sparse Recovery Via `p-minimization (0 p 1) Meng Wang, Student Member, IEEE, Weiyu Xu, and Ao Tang, Senior
More informationSparse Subspace Clustering
Sparse Subspace Clustering Based on Sparse Subspace Clustering: Algorithm, Theory, and Applications by Elhamifar and Vidal (2013) Alex Gutierrez CSCI 8314 March 2, 2017 Outline 1 Motivation and Background
More informationDS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.
DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationL26: Advanced dimensionality reduction
L26: Advanced dimensionality reduction The snapshot CA approach Oriented rincipal Components Analysis Non-linear dimensionality reduction (manifold learning) ISOMA Locally Linear Embedding CSCE 666 attern
More informationSUBSPACE CLUSTERING WITH DENSE REPRESENTATIONS. Eva L. Dyer, Christoph Studer, Richard G. Baraniuk
SUBSPACE CLUSTERING WITH DENSE REPRESENTATIONS Eva L. Dyer, Christoph Studer, Richard G. Baraniuk Rice University; e-mail: {e.dyer, studer, richb@rice.edu} ABSTRACT Unions of subspaces have recently been
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Feature Extraction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi, Payam Siyari Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Dimensionality Reduction
More informationSUBSPACE CLUSTERING WITH DENSE REPRESENTATIONS. Eva L. Dyer, Christoph Studer, Richard G. Baraniuk. ECE Department, Rice University, Houston, TX
SUBSPACE CLUSTERING WITH DENSE REPRESENTATIONS Eva L. Dyer, Christoph Studer, Richard G. Baraniuk ECE Department, Rice University, Houston, TX ABSTRACT Unions of subspaces have recently been shown to provide
More informationSTA414/2104 Statistical Methods for Machine Learning II
STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements
More informationClustering VS Classification
MCQ Clustering VS Classification 1. What is the relation between the distance between clusters and the corresponding class discriminability? a. proportional b. inversely-proportional c. no-relation Ans:
More informationNon-convex Robust PCA: Provable Bounds
Non-convex Robust PCA: Provable Bounds Anima Anandkumar U.C. Irvine Joint work with Praneeth Netrapalli, U.N. Niranjan, Prateek Jain and Sujay Sanghavi. Learning with Big Data High Dimensional Regime Missing
More informationConditions for Robust Principal Component Analysis
Rose-Hulman Undergraduate Mathematics Journal Volume 12 Issue 2 Article 9 Conditions for Robust Principal Component Analysis Michael Hornstein Stanford University, mdhornstein@gmail.com Follow this and
More informationCSC 576: Variants of Sparse Learning
CSC 576: Variants of Sparse Learning Ji Liu Department of Computer Science, University of Rochester October 27, 205 Introduction Our previous note basically suggests using l norm to enforce sparsity in
More informationDimension Reduction Techniques. Presented by Jie (Jerry) Yu
Dimension Reduction Techniques Presented by Jie (Jerry) Yu Outline Problem Modeling Review of PCA and MDS Isomap Local Linear Embedding (LLE) Charting Background Advances in data collection and storage
More informationSPARSE signal representations have gained popularity in recent
6958 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 10, OCTOBER 2011 Blind Compressed Sensing Sivan Gleichman and Yonina C. Eldar, Senior Member, IEEE Abstract The fundamental principle underlying
More informationUnsupervised Learning
2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and
More informationTwo-View Segmentation of Dynamic Scenes from the Multibody Fundamental Matrix
Two-View Segmentation of Dynamic Scenes from the Multibody Fundamental Matrix René Vidal Stefano Soatto Shankar Sastry Department of EECS, UC Berkeley Department of Computer Sciences, UCLA 30 Cory Hall,
More informationPCA, Kernel PCA, ICA
PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per
More informationLecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26
Principal Component Analysis Brett Bernstein CDS at NYU April 25, 2017 Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 1 / 26 Initial Question Intro Question Question Let S R n n be symmetric. 1
More informationSolution-recovery in l 1 -norm for non-square linear systems: deterministic conditions and open questions
Solution-recovery in l 1 -norm for non-square linear systems: deterministic conditions and open questions Yin Zhang Technical Report TR05-06 Department of Computational and Applied Mathematics Rice University,
More informationInformation-Theoretic Limits of Matrix Completion
Information-Theoretic Limits of Matrix Completion Erwin Riegler, David Stotz, and Helmut Bölcskei Dept. IT & EE, ETH Zurich, Switzerland Email: {eriegler, dstotz, boelcskei}@nari.ee.ethz.ch Abstract We
More informationsparse and low-rank tensor recovery Cubic-Sketching
Sparse and Low-Ran Tensor Recovery via Cubic-Setching Guang Cheng Department of Statistics Purdue University www.science.purdue.edu/bigdata CCAM@Purdue Math Oct. 27, 2017 Joint wor with Botao Hao and Anru
More informationStructured matrix factorizations. Example: Eigenfaces
Structured matrix factorizations Example: Eigenfaces An extremely large variety of interesting and important problems in machine learning can be formulated as: Given a matrix, find a matrix and a matrix
More informationVectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x =
Linear Algebra Review Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1 x x = 2. x n Vectors of up to three dimensions are easy to diagram.
More informationSolving Corrupted Quadratic Equations, Provably
Solving Corrupted Quadratic Equations, Provably Yuejie Chi London Workshop on Sparse Signal Processing September 206 Acknowledgement Joint work with Yuanxin Li (OSU), Huishuai Zhuang (Syracuse) and Yingbin
More informationECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction
ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering
More informationRobust PCA via Outlier Pursuit
Robust PCA via Outlier Pursuit Huan Xu Electrical and Computer Engineering University of Texas at Austin huan.xu@mail.utexas.edu Constantine Caramanis Electrical and Computer Engineering University of
More informationFast Angular Synchronization for Phase Retrieval via Incomplete Information
Fast Angular Synchronization for Phase Retrieval via Incomplete Information Aditya Viswanathan a and Mark Iwen b a Department of Mathematics, Michigan State University; b Department of Mathematics & Department
More informationIntroduction to Machine Learning
10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what
More informationMULTICHANNEL SIGNAL PROCESSING USING SPATIAL RANK COVARIANCE MATRICES
MULTICHANNEL SIGNAL PROCESSING USING SPATIAL RANK COVARIANCE MATRICES S. Visuri 1 H. Oja V. Koivunen 1 1 Signal Processing Lab. Dept. of Statistics Tampere Univ. of Technology University of Jyväskylä P.O.
More informationBlock-Sparse Recovery via Convex Optimization
1 Block-Sparse Recovery via Convex Optimization Ehsan Elhamifar, Student Member, IEEE, and René Vidal, Senior Member, IEEE arxiv:11040654v3 [mathoc 13 Apr 2012 Abstract Given a dictionary that consists
More informationFirst Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate
58th Annual IEEE Symposium on Foundations of Computer Science First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate Zeyuan Allen-Zhu Microsoft Research zeyuan@csail.mit.edu
More informationMACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA
1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR
More informationMethods for sparse analysis of high-dimensional data, II
Methods for sparse analysis of high-dimensional data, II Rachel Ward May 26, 2011 High dimensional data with low-dimensional structure 300 by 300 pixel images = 90, 000 dimensions 2 / 55 High dimensional
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationdistances between objects of different dimensions
distances between objects of different dimensions Lek-Heng Lim University of Chicago joint work with: Ke Ye (CAS) and Rodolphe Sepulchre (Cambridge) thanks: DARPA D15AP00109, NSF DMS-1209136, NSF IIS-1546413,
More informationUnsupervised Learning with Permuted Data
Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University
More informationSparse and low-rank decomposition for big data systems via smoothed Riemannian optimization
Sparse and low-rank decomposition for big data systems via smoothed Riemannian optimization Yuanming Shi ShanghaiTech University, Shanghai, China shiym@shanghaitech.edu.cn Bamdev Mishra Amazon Development
More informationRobust Statistics, Revisited
Robust Statistics, Revisited Ankur Moitra (MIT) joint work with Ilias Diakonikolas, Jerry Li, Gautam Kamath, Daniel Kane and Alistair Stewart CLASSIC PARAMETER ESTIMATION Given samples from an unknown
More informationMath 341: Convex Geometry. Xi Chen
Math 341: Convex Geometry Xi Chen 479 Central Academic Building, University of Alberta, Edmonton, Alberta T6G 2G1, CANADA E-mail address: xichen@math.ualberta.ca CHAPTER 1 Basics 1. Euclidean Geometry
More informationSparse Covariance Selection using Semidefinite Programming
Sparse Covariance Selection using Semidefinite Programming A. d Aspremont ORFE, Princeton University Joint work with O. Banerjee, L. El Ghaoui & G. Natsoulis, U.C. Berkeley & Iconix Pharmaceuticals Support
More informationMethods for sparse analysis of high-dimensional data, II
Methods for sparse analysis of high-dimensional data, II Rachel Ward May 23, 2011 High dimensional data with low-dimensional structure 300 by 300 pixel images = 90, 000 dimensions 2 / 47 High dimensional
More informationScalable Subspace Clustering
Scalable Subspace Clustering René Vidal Center for Imaging Science, Laboratory for Computational Sensing and Robotics, Institute for Computational Medicine, Department of Biomedical Engineering, Johns
More informationThe properties of L p -GMM estimators
The properties of L p -GMM estimators Robert de Jong and Chirok Han Michigan State University February 2000 Abstract This paper considers Generalized Method of Moment-type estimators for which a criterion
More informationStatistical Machine Learning
Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x
More informationLecture 2: Linear Algebra Review
EE 227A: Convex Optimization and Applications January 19 Lecture 2: Linear Algebra Review Lecturer: Mert Pilanci Reading assignment: Appendix C of BV. Sections 2-6 of the web textbook 1 2.1 Vectors 2.1.1
More informationGEOMETRIC DISTANCE BETWEEN POSITIVE DEFINITE MATRICES OF DIFFERENT DIMENSIONS
GEOMETRIC DISTANCE BETWEEN POSITIVE DEFINITE MATRICES OF DIFFERENT DIMENSIONS LEK-HENG LIM, RODOLPHE SEPULCHRE, AND KE YE Abstract. We show how the Riemannian distance on S n ++, the cone of n n real symmetric
More informationDimensionality Reduction Using the Sparse Linear Model: Supplementary Material
Dimensionality Reduction Using the Sparse Linear Model: Supplementary Material Ioannis Gkioulekas arvard SEAS Cambridge, MA 038 igkiou@seas.harvard.edu Todd Zickler arvard SEAS Cambridge, MA 038 zickler@seas.harvard.edu
More informationRobustness Meets Algorithms
Robustness Meets Algorithms Ankur Moitra (MIT) ICML 2017 Tutorial, August 6 th CLASSIC PARAMETER ESTIMATION Given samples from an unknown distribution in some class e.g. a 1-D Gaussian can we accurately
More informationConnection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis
Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis Alvina Goh Vision Reading Group 13 October 2005 Connection of Local Linear Embedding, ISOMAP, and Kernel Principal
More informationPositive semidefinite matrix approximation with a trace constraint
Positive semidefinite matrix approximation with a trace constraint Kouhei Harada August 8, 208 We propose an efficient algorithm to solve positive a semidefinite matrix approximation problem with a trace
More informationRecent Developments in Compressed Sensing
Recent Developments in Compressed Sensing M. Vidyasagar Distinguished Professor, IIT Hyderabad m.vidyasagar@iith.ac.in, www.iith.ac.in/ m vidyasagar/ ISL Seminar, Stanford University, 19 April 2018 Outline
More informationRobust Principal Component Pursuit via Alternating Minimization Scheme on Matrix Manifolds
Robust Principal Component Pursuit via Alternating Minimization Scheme on Matrix Manifolds Tao Wu Institute for Mathematics and Scientific Computing Karl-Franzens-University of Graz joint work with Prof.
More informationFactor Analysis (10/2/13)
STA561: Probabilistic machine learning Factor Analysis (10/2/13) Lecturer: Barbara Engelhardt Scribes: Li Zhu, Fan Li, Ni Guan Factor Analysis Factor analysis is related to the mixture models we have studied.
More informationDiscriminative Direction for Kernel Classifiers
Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering
More informationLecture: Face Recognition
Lecture: Face Recognition Juan Carlos Niebles and Ranjay Krishna Stanford Vision and Learning Lab Lecture 12-1 What we will learn today Introduction to face recognition The Eigenfaces Algorithm Linear
More informationSparse Approximation via Penalty Decomposition Methods
Sparse Approximation via Penalty Decomposition Methods Zhaosong Lu Yong Zhang February 19, 2012 Abstract In this paper we consider sparse approximation problems, that is, general l 0 minimization problems
More informationAnalysis of Robust PCA via Local Incoherence
Analysis of Robust PCA via Local Incoherence Huishuai Zhang Department of EECS Syracuse University Syracuse, NY 3244 hzhan23@syr.edu Yi Zhou Department of EECS Syracuse University Syracuse, NY 3244 yzhou35@syr.edu
More informationA Cross-Associative Neural Network for SVD of Nonsquared Data Matrix in Signal Processing
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 12, NO. 5, SEPTEMBER 2001 1215 A Cross-Associative Neural Network for SVD of Nonsquared Data Matrix in Signal Processing Da-Zheng Feng, Zheng Bao, Xian-Da Zhang
More informationSparse Solutions of Systems of Equations and Sparse Modelling of Signals and Images
Sparse Solutions of Systems of Equations and Sparse Modelling of Signals and Images Alfredo Nava-Tudela ant@umd.edu John J. Benedetto Department of Mathematics jjb@umd.edu Abstract In this project we are
More informationOptimal Linear Estimation under Unknown Nonlinear Transform
Optimal Linear Estimation under Unknown Nonlinear Transform Xinyang Yi The University of Texas at Austin yixy@utexas.edu Constantine Caramanis The University of Texas at Austin constantine@utexas.edu Zhaoran
More informationStat 159/259: Linear Algebra Notes
Stat 159/259: Linear Algebra Notes Jarrod Millman November 16, 2015 Abstract These notes assume you ve taken a semester of undergraduate linear algebra. In particular, I assume you are familiar with the
More informationAnalysis of a Privacy-preserving PCA Algorithm using Random Matrix Theory
Analysis of a Privacy-preserving PCA Algorithm using Random Matrix Theory Lu Wei, Anand D. Sarwate, Jukka Corander, Alfred Hero, and Vahid Tarokh Department of Electrical and Computer Engineering, University
More informationRobust Motion Segmentation by Spectral Clustering
Robust Motion Segmentation by Spectral Clustering Hongbin Wang and Phil F. Culverhouse Centre for Robotics Intelligent Systems University of Plymouth Plymouth, PL4 8AA, UK {hongbin.wang, P.Culverhouse}@plymouth.ac.uk
More informationFACTOR ANALYSIS AND MULTIDIMENSIONAL SCALING
FACTOR ANALYSIS AND MULTIDIMENSIONAL SCALING Vishwanath Mantha Department for Electrical and Computer Engineering Mississippi State University, Mississippi State, MS 39762 mantha@isip.msstate.edu ABSTRACT
More informationApproximating the Covariance Matrix with Low-rank Perturbations
Approximating the Covariance Matrix with Low-rank Perturbations Malik Magdon-Ismail and Jonathan T. Purnell Department of Computer Science Rensselaer Polytechnic Institute Troy, NY 12180 {magdon,purnej}@cs.rpi.edu
More informationA Randomized Algorithm for the Approximation of Matrices
A Randomized Algorithm for the Approximation of Matrices Per-Gunnar Martinsson, Vladimir Rokhlin, and Mark Tygert Technical Report YALEU/DCS/TR-36 June 29, 2006 Abstract Given an m n matrix A and a positive
More informationarxiv: v1 [math.na] 26 Nov 2009
Non-convexly constrained linear inverse problems arxiv:0911.5098v1 [math.na] 26 Nov 2009 Thomas Blumensath Applied Mathematics, School of Mathematics, University of Southampton, University Road, Southampton,
More informationRecursive Sparse Recovery in Large but Structured Noise - Part 2
Recursive Sparse Recovery in Large but Structured Noise - Part 2 Chenlu Qiu and Namrata Vaswani ECE dept, Iowa State University, Ames IA, Email: {chenlu,namrata}@iastate.edu Abstract We study the problem
More informationConvergence of the Ensemble Kalman Filter in Hilbert Space
Convergence of the Ensemble Kalman Filter in Hilbert Space Jan Mandel Center for Computational Mathematics Department of Mathematical and Statistical Sciences University of Colorado Denver Parts based
More informationThe Metric Geometry of the Multivariable Matrix Geometric Mean
Trieste, 2013 p. 1/26 The Metric Geometry of the Multivariable Matrix Geometric Mean Jimmie Lawson Joint Work with Yongdo Lim Department of Mathematics Louisiana State University Baton Rouge, LA 70803,
More informationScaling Limits of Waves in Convex Scalar Conservation Laws Under Random Initial Perturbations
Journal of Statistical Physics, Vol. 122, No. 2, January 2006 ( C 2006 ) DOI: 10.1007/s10955-005-8006-x Scaling Limits of Waves in Convex Scalar Conservation Laws Under Random Initial Perturbations Jan
More informationMachine learning for pervasive systems Classification in high-dimensional spaces
Machine learning for pervasive systems Classification in high-dimensional spaces Department of Communications and Networking Aalto University, School of Electrical Engineering stephan.sigg@aalto.fi Version
More informationA Characterization of Sampling Patterns for Union of Low-Rank Subspaces Retrieval Problem
A Characterization of Sampling Patterns for Union of Low-Rank Subspaces Retrieval Problem Morteza Ashraphijuo Columbia University ashraphijuo@ee.columbia.edu Xiaodong Wang Columbia University wangx@ee.columbia.edu
More informationIT is well-known that the cone of real symmetric positive
SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY 1 Geometric distance between positive definite matrices of different dimensions Lek-Heng Lim, Rodolphe Sepulchre, Fellow, IEEE, and Ke Ye Abstract We
More informationDimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas
Dimensionality Reduction: PCA Nicholas Ruozzi University of Texas at Dallas Eigenvalues λ is an eigenvalue of a matrix A R n n if the linear system Ax = λx has at least one non-zero solution If Ax = λx
More informationTHE SINGULAR VALUE DECOMPOSITION MARKUS GRASMAIR
THE SINGULAR VALUE DECOMPOSITION MARKUS GRASMAIR 1. Definition Existence Theorem 1. Assume that A R m n. Then there exist orthogonal matrices U R m m V R n n, values σ 1 σ 2... σ p 0 with p = min{m, n},
More informationA Riemannian Framework for Denoising Diffusion Tensor Images
A Riemannian Framework for Denoising Diffusion Tensor Images Manasi Datar No Institute Given Abstract. Diffusion Tensor Imaging (DTI) is a relatively new imaging modality that has been extensively used
More informationRobust Stochastic Principal Component Analysis
John Goes Teng Zhang Raman Arora Gilad Lerman University of Minnesota Princeton University Johns Hopkins University University of Minnesota Abstract We consider the problem of finding lower dimensional
More informationThe Expectation-Maximization Algorithm
1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable
More informationBackground Mathematics (2/2) 1. David Barber
Background Mathematics (2/2) 1 David Barber University College London Modified by Samson Cheung (sccheung@ieee.org) 1 These slides accompany the book Bayesian Reasoning and Machine Learning. The book and
More informationSYMMETRIC MATRIX PERTURBATION FOR DIFFERENTIALLY-PRIVATE PRINCIPAL COMPONENT ANALYSIS. Hafiz Imtiaz and Anand D. Sarwate
SYMMETRIC MATRIX PERTURBATION FOR DIFFERENTIALLY-PRIVATE PRINCIPAL COMPONENT ANALYSIS Hafiz Imtiaz and Anand D. Sarwate Rutgers, The State University of New Jersey ABSTRACT Differential privacy is a strong,
More informationOn the Behavior of Information Theoretic Criteria for Model Order Selection
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 49, NO. 8, AUGUST 2001 1689 On the Behavior of Information Theoretic Criteria for Model Order Selection Athanasios P. Liavas, Member, IEEE, and Phillip A. Regalia,
More information