Isotropic PCA and Affine-Invariant Clustering

Size: px
Start display at page:

Download "Isotropic PCA and Affine-Invariant Clustering"

Transcription

1 Isotropic PCA and Affine-Invariant Clustering S. Charles Brubaker Santosh S. Vempala Abstract We present an extension of Principal Component Analysis (PCA) and a new algorithm for clustering points in R n based on it. The key property of the algorithm is that it is affine-invariant. When the input is a sample from a mixture of two arbitrary Gaussians, the algorithm correctly classifies the sample assuming only that the two components are separable by a hyperplane, i.e., there exists a halfspace that contains most of one Gaussian and almost none of the other in probability mass. This is nearly the best possible, improving known results substantially [4, 9, ]. For k > 2 components, the algorithm requires only that there be some (k )- dimensional subspace in which the overlap in every direction is small. Here we define overlap to be the ratio of the following two quantities: ) the average squared distance between a point and the mean of its component, and 2) the average squared distance between a point and the mean of the mixture. The main result may also be stated in the language of linear discriminant analysis: if the standard Fisher discriminant [8] is small enough, labels are not needed to estimate the optimal subspace for projection. Our main tools are isotropic transformation, spectral projection and a simple reweighting technique. We call this combination isotropic PCA. College of Computing, Georgia Tech. {brubaker,vempala}@cc.gatech.edu

2 Introduction We present an extension to Principal Component Analysis (PCA), which is able to go beyond standard PCA in identifying important directions. When the covariance matrix of the input (distribution or point set in R n ) is a multiple of the identity, then PCA reveals no information; the second moment along any direction is the same. Such inputs are called isotropic. Our extension, which we call isotropic PCA, can reveal interesting information in such settings. We use this technique to give an affine-invariant clustering algorithm for points in R n. When applied to the problem of unraveling mixtures of arbitrary Gaussians from unlabeled samples, the algorithm yields substantial improvements of known results. To illustrate the technique, consider the uniform distribution on the set X = {(x, y) R 2 : x {, }, y [ 3, 3]}, which is isotropic. Suppose this distribution is rotated in an unknown way and that we would like to recover the original x and y axes. For each point in a sample, we may project it to the unit circle and compute the covariance matrix of the resulting point set. The x direction will correspond to the greater eigenvector, the y direction to the other. See Figure for an illustration. Instead of projection onto the unit circle, this process may also be thought of as importance weighting, a technique which allows one to simulate one distribution with another. In this case, we are simulating a distribution over the set X, where the density function is proportional to ( + y 2 ), so that points near (, 0) or (, 0) are more probable. 2.5 a 0.5 a Figure : Mapping points to the unit circle and then finding the direction of maximum variance reveals the orientation of this isotropic distribution. In this paper, we describe how to apply this method to mixtures of arbitrary Gaussians in R n in order to find a set of directions along which the Gaussians are well-separated. These directions span the Fisher subspace of the mixture, a classical concept in Pattern Recognition. Once these directions are identified, points can be classified according to which component of the distribution generated them, and hence all parameters of the mixture can be learned. What separates this paper from previous work on learning mixtures is that our algorithm is affine-invariant. Indeed, for every mixture distribution that can be learned using a previously known algorithm, there is a linear transformation of bounded condition number that causes the algorithm to fail. For k = 2 components our algorithm has nearly the best possible guarantees (and subsumes all previous results) for clustering Gaussian mixtures. For k > 2, it requires that there be a (k )-dimensional subspace where the overlap of the components is small in every direction (See section.2). This condition can be stated in terms of the Fisher discriminant, a quantity commonly used in the field of Pattern Recognition with labeled data. Because our algorithm is affine invariant, it makes it possible to unravel a much larger set of Gaussian mixtures than had

3 been possible previously. The first step of our algorithm is to place the mixture in isotropic position (see Section.2) via an affine transformation. This has the effect of making the (k )-dimensional Fisher subspace, i.e., the one that minimizes the Fisher discriminant, the same as the subspace spanned by the means of the components (they only coincide in general in isotropic position), for any mixture. The rest of the algorithm identifies directions close to this subspace and uses them to cluster, without access to labels. Intuitively this is hard since after isotropy, standard PCA reveals no additional information. Before presenting the ideas and guarantees in more detail, we describe relevant related work.. Previous Work A mixture model is a convex combination of distributions of known type. In the most commonly studied version, a distribution F in R n is composed of k unknown Gaussians. That is, F = w N(µ, Σ ) w k N(µ k, Σ k ), where the mixing weights w i, means µ i, and covariance matrices Σ i are all unknown. Typically, k n, so that a concise model explains a high dimensional phenomenon. A random sample is generated from F by first choosing a component with probability equal to its mixing weight and then picking a random point from that component distribution. In this paper, we study the classical problem of unraveling a sample from a mixture, i.e., labeling each point in the sample according to its component of origin. Heuristics for classifying samples include expectation maximization [5] and k-means clustering []. These methods can take a long time and can get stuck with suboptimal classifications. Over the past decade, there has been much progress on finding polynomial-time algorithms with rigorous guarantees for classifying mixtures, especially mixtures of Gaussians [4, 5, 4, 7, 9, ]. Starting with Dasgupta s paper [4], one line of work uses the concentration of pairwise distances and assumes that the components means are so far apart that distances between points from the same component are likely to be smaller than distances from points in different components. Arora and Kannan [4] establish nearly optimal results for such distance-based algorithms. Unfortunately their results inherently require separation that grows with the dimension of the ambient space and the largest variance of each component Gaussian. To see why this is unnatural, consider k well-separated Gaussians in R k with means e,..., e k, i.e. each mean is unit away from the origin along a unique coordinate axis. Adding extra dimensions with arbitrary variance does not affect the separability of these Gaussians, but these algorithms are no longer guaranteed to work. For example, suppose that each Gaussian has a maximum variance of ɛ. Then, adding O (kɛ 2 ) extra dimensions with variance ɛ will violate the necessary separation conditions. To improve on this, a subsequent line of work uses spectral projection (PCA). Vempala and Wang [7] showed that for a mixture of spherical Gaussians, the subspace spanned by the top k principal components of the mixture contains the means of the components. Thus, projecting to this subspace has the effect of shrinking the components while maintaining the separation between their means. This leads to a nearly optimal separation requirement of µ i µ j Ω(k /4 ) max{σ i, σ j } where µ i is the mean of component i and σi 2 is the variance of component i along any direction. Note that there is no dependence on the dimension of the distribution. Kannan et al. [9] applied the spectral approach to arbitrary mixtures of Gaussians (and more generally, logconcave distributions) 2

4 (a) Distance Concentration Separability (b) Hyperplane Separability (c) Intermean Hyperplane and Fisher Hyperplane. Figure 2: Previous work requires distance concentration separability which depends on the maximum directional variance (a). Our results require only hyperplane separability, which depends only on the variance in the separating direction(b). For non-isotropic mixtures the best separating direction may not be between the means of the components(c). and obtained a separation that grows with a polynomial in k and the largest variance of each component: µ i µ j poly(k) max{σ i,max, σ j,max } where σi,max 2 is the maximum variance of the ith component in any direction. The polynomial in k was improved in [] along with matching lower bounds for this approach, suggesting this to be the limit of spectral methods. Going beyond this spectral threshold for arbitrary Gaussians has been a major open problem. The representative hard case is the special case of two parallel pancakes, i.e., two Gaussians that are spherical in n directions and narrow in the last direction, so that a hyperplane orthogonal to the last direction separates the two. The spectral approach requires a separation that grows with their largest standard deviation which is unrelated to the distance between the pancakes (their means). Other examples can be generated by starting with Gaussians in k dimensions that are separable and then adding other dimensions, one of which has large variance. Because there is a subspace where the Gaussians are separable, the separation requirement should depend only on the dimension of this subspace and the components variances in it. A related line of work considers learning symmetric product distributions, where the coordinates are independent. Feldman et al [6] have shown that mixtures of axis-aligned Gaussians can be approximated without any separation assumption at all in time exponential in k. A. Dasgupta et al [3] consider heavy-tailed distributions as opposed to Gaussians or log-concave ones and give conditions under which they can be clustered using an algorithm that is exponential in the number of samples. Chaudhuri and Rao [2] have recently given a polynomial time algorithm for clustering such heavy tailed product distributions..2 Results We assume we are given a lower bound w on the minimum mixing weight and k, the number of components. With high probability, our algorithm Unravel returns a partition of space by hyperplanes so that each part (a polyhedron) encloses almost all of the probability mass of a single component and almost none of the other components. The error of such a set of polyhedra is the total probability mass that falls outside the correct polyhedron. 3

5 We first state our result for two Gaussians in a way that makes clear the relationship to previous work that relies on separation. Theorem. Let w, µ, Σ and w 2, µ 2, Σ 2 define a mixture of two Gaussians. There is an absolute constant C such that, if there exists a direction v such that ( proj v (µ µ 2 ) C v T Σ v + ) ( v T Σ 2 v w 2 log /2 wδ + ), η then with probability δ algorithm Unravel returns two complementary halfspaces that have error at most η using time and a number of samples that is polynomial in n, w, log(/δ). The requirement is that in some direction the separation between the means must be comparable to the standard deviation. This separation condition of Theorem is affine-invariant and much weaker than conditions of the form µ µ 2 max{σ,max, σ 2,max } used in previous work. See Figure 2(a). The dotted line shows how previous work effectively treats every component as spherical. Hyperplane separability (Figure 2(b)) is a weaker condition. We also note that the separating direction does not need to be the intermean direction as illustrated in Figure 2(c). The dotted line illustrates hyperplane induced by the intermean direction, which may be far from the optimal separating hyperplane shown by the solid line. It will be insightful to state this result in terms of the Fisher discriminant, a standard notion from Pattern Recognition [8, 7] that is used with labeled data. In words, the Fisher discriminant along direction p is J(p) = the intra-component variance in direction p the total variance in direction p Mathematically, this is expressed as J(p) = E [ proj p (x µ l(x) ) 2] E [ proj p (x) 2] = p T (w Σ + w 2 Σ 2 )p p T (w (Σ + µ µ T ) + w 2(Σ 2 + µ 2 µ T 2 ))p for x distributed according to a mixture distribution with means µ i and covariance matrices Σ i. We use l(x) to indicate the component from which x was drawn. Theorem 2. There is an absolute constant C for which the following holds. Suppose that F is a mixture of two Gaussians such that there exists a direction p for which ( J(p) Cw 3 log δw + ). η With probability δ, algorithm Unravel returns a halfspace with error at most η using time and sample complexity polynomial in n, w, log(/δ). There are several ways of generalizing the Fisher discriminant for k = 2 components to greater k [7]. These generalizations are most easily understood when the distribution is isotropic. An isotropic distribution has the identity matrix as its covariance and the origin as its mean. An isotropic mixture therefore has w i µ i = 0 and w i (Σ i + µ i µ T i ) = I. 4

6 It is well known that any distribution with bounded covariance matrix (and therefore any mixture) can be made isotropic by an affine transformation. As we will see shortly, for k = 2, for an isotropic mixture, the line joining the means is the direction that minimizes the Fisher discriminant. Under isotropy, the denominator of the Fisher discriminant is always. Thus, the discriminant is just the expected squared distance between the projection of a point and the projection of its mean, where projection is onto some direction p. The generalization to k > 2 is natural, as we may simply replace projection onto direction p with projection onto a (k )-dimensional subspace S. For convenience, let Σ = w i Σ i. Let the vector p,..., p k be an orthonormal basis of S and let l(x) be the component from which x was drawn. We then have under isotropy k J(S) = E[ proj S (x µ l(x) ) 2 ] = p T j Σp j for x distributed according to a mixture distribution with means µ i and covariance matrices Σ i. As Σ is symmetric positive definite, it follows that the smallest k eigenvectors of the matrix are optimal choices of p j and S is the span of these eigenvectors. This motivates our definition of the Fisher subspace for any mixture with bounded second moments (not necessarily Gaussians). Definition. Let {w i, µ i, Σ i } be the weights, means, and covariance matrices for an isotropic mixture distribution with mean at the origin and where dim(span{µ,..., µ k }) = k. Let l(x) be the component from which x was drawn. The Fisher subspace F is defined as the (k )-dimensional subspace that minimizes J(S) = E[ proj S (x µ l(x) ) 2 ]. over subspaces S of dimension k. Note that dim(span{µ,..., µ k }) is only k because isotropy implies k w iµ i = 0. The next lemma provides a simple alternative characterization of the Fisher subspace as the span of the means of the components (after transforming to isotropic position). The proof is given in Section 3.2. Lemma. Suppose {w i, µ i, Σ i } k defines an isotropic mixture in Rn. Let λ... λ n be the eigenvalues of the matrix Σ = k w iσ i and let v,..., v n be the corresponding eigenvectors. If the dimension of the span of the means of the components is k, then the Fisher subspace j= F = span{v n k+2,..., v n } = span{µ,..., µ k }. Our algorithm attempts to find the Fisher subspace (or one close to it) and succeeds in doing so, provided the discriminant is small enough. The next definition will be useful in stating our main theorem precisely. For non-isotropic mixtures, the Fisher discriminant generalizes to P k Pk j= pt j wi(σi + µiµt i ) Σpj and the overlap to p T Pk wi(σi + µiµt i ) Σp 5

7 Definition 2. The overlap of a mixture given as in Definition is φ = min max S:dim(S)=k p S pt Σp. () It is a direct consequence of the Courant-Fisher min-max theorem that φ is the (k )th smallest eigenvalue of the matrix Σ and the subspace achieving φ is the Fisher subspace, i.e., We can now state our main theorem for k > 2. φ = E[projF (x µ l(x) )proj F (x µ l(x) ) T ] 2. Theorem 3. There is an absolute constant C for which the following holds. Suppose that F is a mixture of k Gaussian components where the overlap satisfies ( nk φ Cw 3 k 3 log δw + ) η With probability δ, algorithm Unravel returns a set of k polyhedra that have error at most η using time and a number of samples that is polynomial in n, w, log(/δ). In words, the algorithm successfully unravels arbitrary Gaussians provided there exists a (k )- dimensional subspace in which along every direction, the expected squared distance of a point to its component mean is smaller than the expected squared distance to the overall mean by roughly a poly(k, /w) factor. There is no dependence on the largest variances of the individual components, and the dependence on the ambient dimension is logarithmic. This means that the addition of extra dimensions (even where the distribution has large variance) as discussed in Section. has little impact on the success of our algorithm. 2 Algorithm The algorithm has three major components: an initial affine transformation, a reweighting step, and identification of a direction close to the Fisher subspace and a hyperplane orthogonal to this direction which leaves each component s probability mass almost entirely in one of the halfspaces induced by the hyperplane. The key insight is that the reweighting technique will either cause the mean of the mixture to shift in the intermean subspace, or cause the top k principal components of the second moment matrix to approximate the intermean subspace. In either case, we obtain a direction along which we can partition the components. We first find an affine transformation W which when applied to F results in an isotropic distribution. That is, we move the mean to the origin and apply a linear transformation to make the covariance matrix the identity. We apply this transformation to a new set of m points {x i } from F and then reweight according to a spherically symmetric Gaussian exp( x 2 /()) for α = Θ(n/w). We then compute the mean û and second moment matrix ˆM of the resulting set. 2 After the reweighting, the algorithm chooses either the new mean or the direction of maximum second moment and projects the data onto this direction h. By bisecting the largest gap between points, we obtain a threshold t, which along with h defines a hyperplane that separates the components. Using the notation H h,t = {x R n : h T x t}, to indicate a halfspace, we then recurse 2 This practice of transforming the points and then looking at the second moment matrix can be viewed as a form of kernel PCA; however the connection between our algorithm and kernel PCA is superficial. Our transformation does not result in any standard kernel. Moreover, it is dimension-preserving (it is just a reweighting), and hence the kernel trick has no computational advantage. 6

8 on each half of the mixture. Thus, every node in the recursion tree represents an intersection of half-spaces. To make our analysis easier, we assume that we use different samples for each step of the algorithm. The reader might find it useful to read Section 2., which gives an intuitive explaination for how the algorithm works on parallel pancakes, before reviewing the details of the algorithm. Algorithm Unravel Input: Integer k, scalar w. Initialization: P = R n.. (Isotropy) Use samples lying in P to compute an affine transformation W that makes the distribution nearly isotropic (mean zero, identity covariance matrix). 2. (Reweighting) Use m samples in P and for each compute a weight e x 2 /(α) (where α > n/w). 3. (Separating Direction) Find the mean of the reweighted data ˆµ. If ˆµ > w/(3), let h = ˆµ. Otherwise, find the covariance matrix ˆM of the reweighted points and let h be its top principal component. 4. (Recursion) Project m 2 sample points to h and find the largest gap between points in the interval [ /2, /2]. If this gap is less than /4(k ), then return P. Otherwise, set t to be the midpoint of the largest gap, recurse on P H h,t and P H h, t, and return the union of the polyhedra produces by these recursive calls. 2. Parallel Pancakes The following special case, which represents the open problem in previous work, will illuminate the intuition behind the new algorithm. Suppose F is a mixture of two spherical Gaussians that are well-separated, i.e. the intermean distance is large compared to the standard deviation along any direction. We consider two cases, one where the mixing weights are equal and another where they are imbalanced. After isotropy is enforced, each component will become thin in the intermean direction, giving the density the appearance of two parallel pancakes. When the mixing weights are equal, the means of the components will be equally spaced at a distance of φ on opposite sides of the origin. For imbalanced weights, the origin will still lie on the intermean direction but will be much closer to the heavier component, while the lighter component will be much further away. In both cases, this transformation makes the variance of the mixture in every direction, so the principal components give us no insight into the inter-mean direction. Consider next the effect of the reweighting on the mean of the mixture. For the case of equal mixing weights, symmetry assures that the mean does not shift at all. For imbalanced weights, however, the heavier component, which lies closer to the origin will become heavier still. Thus, the reweighted mean shifts toward the mean of the heavier component, allowing us to detect the intermean direction. Finally, consider the effect of reweighting on the second moments of the mixture with equal mixing weights. Because points closer to the origin are weighted more, the second moment in every direction is reduced. However, in the intermean direction, where part of the moment is due to the displacement of the component means from the origin, it shrinks less. Thus, the direction of maximum second moment is the intermean direction. 7

9 2.2 Overview of Analysis To analyze the algorithm, in the general case, we will proceed as follows. Section 3 shows that under isotropy the Fisher subspace coincides with the intermean subspace (Lemma ), gives the necessary sampling convergence and perturbation lemmas and relates overlap to a more conventional notion of separation (Prop. 5). Section 3.3 gives approximations to the first and second moments. Section 4 then combines these approximations with the perturbation lemmas to show that the vector h (either the mean shift or the largest principal component) lies close to the intermean subspace. Finally, Section 5 shows the correctness of the recursive aspects of the algorithm. 3 Preliminaries 3. Matrix Properties For a matrix Z, we will denote the ith largest eigenvalue of Z by λ i (Z) or just λ i if the matrix is clear from context. Unless specified otherwise, all norms are the 2-norm. For symmetric matrices, this is Z 2 = λ (Z) = max x R n Zx 2 / x 2. The following two facts from linear algebra will be useful in our analysis. Fact 2. Let λ... λ n be the eigenvalues for an n-by-n symmetric positive definite matrix Z and let v,... v n be the corresponding eigenvectors. Then λ n λ n k+ = min S:dim(S)=k p T j Zp j, where {p j } is any orthonormal basis for S. If λ n k > λ n k+, then span{v n,..., v n k+ } is the unique minimizing subspace. Recall that a matrix Z is positive semi-definite if x T Zx 0 for all non-zero x. j= Fact 3. Suppose that the matrix [ ] A B T Z = B D is symmetric positive semi-definite and that A and D are square submatrices. A D. Then B Proof. Let y and x be the top left and right singular vectors of B, so that y T Bx = B. Because Z is positive semi-definite, we have that for any real γ, 0 [γx T y T ]Z[γx T y T ] T = γ 2 x T Ax + 2γy T Bx + y T Dy. This is a quadratic polynomial in γ that can have only one real root. Therefore the discriminant must be non-positive: 0 4(y T Bx) 2 4(x T Ax)(y T Dy). We conclude that B = y T Bx (x T Ax)(y T Dy) A D. 8

10 3.2 The Fisher Criterion and Isotropy We begin with the proof of the lemma that for an isotropic mixture the Fisher subspace is the same as the intermean subspace. Proof of Lemma. By Definition for an isotropic distribution, the Fisher subspace minimizes k J(S) = E[ proj S (x µ l(x) ) 2 ] = p T j Σp j, where {p j } is an orthonormal basis for S. By Fact 2, one minimizing subspace is the span of the smallest k eigenvectors of the matrix Σ, i.e. v n k+2,..., v n. Because the distribution is isotropic, Σ = I w i µ i µ T i. and these vectors become the largest eigenvectors of k w iµ i µ T i. Clearly, span{v n k+2,..., v n } span{µ,..., µ k }, but both spans have dimension k making them equal. This also implies that λ n k+2 (Σ) = v T n k+2 j= w i µ i µ T i v n k+2 > 0. Thus, λ n k+2 (Σ) <. On the other hand v n k+, must be orthogonal every µ i, so λ n k+ (Σ) =. Therefore, λ n k+ (Σ) > λ n k+2 (Σ) and by Fact 2 span{v n k+2,..., v n } = span{µ,..., µ k } is the unique minimizing subspace. It follows directly that under the conditions of Lemma, the overlap may be characterized as ) φ = λ n k+2 (Σ) = λ k ( w i µ i µ T i For clarity of the analysis, we will assume that Step of the algorithm produces a perfectly isotropic mixture. Theorem 4 gives a bound on the required number of samples to make the distribution nearly isotropic, and as our analysis shows, our algorithm is robust to small estimation errors. We will also assume for convenience of notation that the the unit vectors along the first k coordinate axes e,... e k span the intermean (i.e. Fisher) subspace. That is, F = span{e,..., e k }. When considering this subspace it will be convenient to be able to refer to projection of the mean vectors to this subspace. Thus, we define µ i R k to be the first k coordinates of µ i ; the remaining coordinates are all zero. In other terms, µ i = [I k 0] µ i. In this coordinate system the covariance matrix of each component has a particular structure, which will be useful for our analysis. For the rest of this paper we fix the following notation: an isotropic mixture is defined by {w i, µ i, Σ i }. We assume that span{e,..., e k } is the intermean subspace and A i,b i, and D i are defined such that [ Ai B w i Σ i = i T ] (2) where A i is a (k ) (k ) submatrix and D i is a (n k + ) (n k + ) submatrix. 9 B i D i.

11 Lemma 4 (Covariance Structure). Using the above notation, for all components i. A i φ, D i, B i φ Proof of Lemma 4. Because span{e,..., e k } is the Fisher subspace φ = max v R k v 2 v T A i v = A i. 2 Also k D i = I, so k D i =. Each matrix w i Σ i is positive definite, so the principal minors A i,d i must be positive definite as well. Therefore, A i φ, D i, and B i A i D i = φ using Fact 3. For small φ, the covariance between intermean and non-intermean directions, i.e. B i, is small. For k = 2, this means that all densities will have a nearly parallel pancake shape. In general, it means that k of the principal axes of the Gaussians will lie close to the intermean subspace. We conclude this section with a proposition connecting, for k = 2, the overlap to a standard notion of separation between two distributions, so that Theorem becomes an immediate corollary of Theorem 2. Proposition 5. If there exists a unit vector p such that then the overlap φ J(p) ( + w w 2 t 2 ). p T (µ µ 2 ) > t( p T w Σ p + p T w 2 Σ 2 p), Proof of Proposition 5. Since the mean of the distribution is at the origin, we have w p T µ = w 2 p T µ 2. Thus, p T µ p T µ 2 2 = (p T µ ) 2 + (p T µ 2 ) p T µ p T µ 2 ( = (w p T µ ) 2 w 2 + w ), w w 2 using w + w 2 =. We rewrite the last factor as w 2 + w w w 2 = w2 + w w w 2 w 2 w2 2 = w 2 w2 2 Again, using the fact that w p T µ = w 2 p T µ 2, we have that p T µ p T µ 2 2 = (w p T µ ) 2 ( + ) w w 2 w w 2 = w (p T µ ) 2 + w 2 (p T µ 2 ) 2. w w 2 Thus, by the separation condition = ( + ). w w 2 w w 2 w (p T µ ) 2 + w 2 (p T µ 2 ) 2 = w w 2 p T µ p T µ 2 2 w w 2 t 2 (p T w Σ p + p T w 2 Σ 2 p). 0

12 To bound J(p), we then argue J(p) = p T w Σ p + p T w 2 Σ 2 p w (p T Σ p + (p T µ ) 2 ) + w 2 (p T Σ 2 p + (p T µ 2 ) 2 ) = w (p T µ ) 2 + w 2 (p T µ 2 ) 2 w (p T Σ p + (p T µ ) 2 ) + w 2 (p T Σ 2 p + (p T µ 2 ) 2 ) and J(p) /( + w w 2 t 2 ). w w 2 t 2 (w p T Σ p + w 2 p T Σ 2 p) w (p T Σ p + (p T µ ) 2 ) + w 2 (p T Σ 2 p + (p T µ 2 ) 2 ) w w 2 t 2 J(p), 3.3 Approximation of the Reweighted Moments Our algorithm works by computing the first and second reweighted moments of a point set from F. In this section, we examine how the reweighting affects the second moments of a single component and then give some approximations for the first and second moments of the entire mixture Single Component The first step is to characterize how the reweighting affects the moments of a single component. Specifically, we will show for any function f (and therefore x and xx T in particular) that for α > 0, [ )] E f(x) exp ( x 2 = i w i ρ i E i [f(y i )], Here, E i [ ] denotes expectation taken with respect to the component i, the quantity ρ i = E i [exp and y i is a Gaussian variable with parameters slightly perturbed from the original ith component. )] Claim 6. If α = n/w, the quantity ρ i = E i [exp is at least /2. ( x 2 Proof. Because the distribution is isotropic, for any component i, w i E i [ x 2 ] n. Therefore, ρ i = E i [exp ( x 2 )] ] E i [ x 2 n w i 2. )] ( x 2, Lemma 7 (Reweighted Moments of a Single Component). For any α > 0, with respect to a single component i of the mixture )] E i [x exp ( x 2 = ρ i (µ i α Σ iµ i + f) and E i [xx T exp where f, F = O(α 2 ). We first establish the following claim. )] ( x 2 = ρ(σ i + µ i µ T i α (Σ iσ i + µ i µ T i Σ i + Σ i µ i µ T i ) + F )

13 Claim 8. Let x be a random variable distributed according to the normal distribution N(µ, Σ) and let Σ = QΛQ T be the singular value decomposition of Σ with λ,..., λ n being the diagonal elements of Λ. Let W = diag(α/(α + λ ),..., α/(α + λ n )). Finally, let y be a random variable distributed according to N(QW Q T µ, QW ΛQ T ). Then for any function f(x), [ )] E f(x) exp ( x 2 ( = det(w ) /2 exp µt QW Q T ) µ E [f(y)]. Proof of Claim 8. We assume that Q = I for the initial part of the proof. From the definition of a Gaussian distribution, we have [ )] ( E f(x) exp ( x 2 = det(λ) /2 (2π) n/2 f(x) exp xt x R n (x µ)t Λ ) (x µ). 2 Because Λ is diagonal, we may write the exponents on the right hand side as n x 2 i α + (x i µ i ) 2 λ i = Completing the square gives the expression n n ( ) α 2 ( λi α x i µ i α + λ i α + λ i x 2 i (λ + α ) 2x i µ i λ i ) + µ 2 i λ i µ 2 i λ i + µ 2 i λ i. α. α + λ i The last two terms can be simplified to µ 2 i /(α + λ i). In matrix form the exponent becomes For general Q, this becomes (x W µ) T (W Λ) (x W µ) + µ T W µα. ( x QW Q T µ ) T Q(W Λ) Q T ( x QW Q T µ ) + µ T QW Q T µα. Now recalling the definition of the random variable y, we see [ )] E f(x) exp ( x 2 R n f(x) exp ( = det(λ) /2 (2π) n/2 exp ( 2 µt QW Q T ) µ ( x QW Q T µ ) T Q(W Λ) Q T ( x QW Q T µ )) ( = det(w ) /2 exp µt QW Q T ) µ E [f(y)]. The proof of Lemma 7 is now straightforward. Proof of Lemma 7. For simplicity of notation, we drop the subscript i from ρ i, µ i, Σ i with the understanding that all statements of expectation apply to a single component. Using the notation of Claim 8, we have [ )] ( ρ = E exp ( x 2 = det(w ) /2 exp µt QW Q T ) µ. 2

14 A diagonal entry of the matrix W can expanded as α α + λ i = λ i α + λ i = λ i α + λ 2 i α(α + λ i ), so that Thus, [ )] E x exp ( x 2 W = I α Λ + α 2 W Λ2. = ρ(qw Q T µ) = ρ(qiq T µ α QΛQT µ + α 2 QW Λ2 Q T µ) = ρ(µ Σµ + f), α where f = O(α 2 ). We analyze the perturbed covariance in a similar fashion. [ )] E xx T exp ( x 2 = ρ ( Q(W Λ)Q T + QW Q T µµ T QW Q T ) ( = ρ QΛQ T α QΛ2 Q T + α 2 QW Λ3 Q T +(µ α Σµ + f)(µ ) α Σµ + f)t = ρ ( Σ + µµ T ) α (ΣΣ + µµt Σ + Σµµ T ) + F, where F = O(α 2 ) Mixture moments The second step is to approximate the first and)] second moments of the entire mixture distribution. Let ρ be the vector where ρ i = E i [exp and let ρ be the average of the ρ i. We also define [ )] u E x exp ( x 2 = ( x 2 w i ρ i µ i α [ )] M E xx T exp ( x 2 = w i ρ i Σ i µ i + f (3) w i ρ i (Σ i + µ i µ T i α (Σ iσ i + µ i µ T i Σ i + Σ i µ i µ T i )) + F (4) with f = O(α 2 ) and F = O(α 2 ). We denote the estimates of these quantities computed from samples by û and ˆM respectively. Lemma 9. Let v = k ρ iw i µ i. Then u v 2 4k2 α 2 w φ. 3

15 Proof of Lemma 9. We argue from Eqn. 2 and Eqn. 3 that u v = w i ρ i Σ i µ i + O(α 2 ) α α w α w ρ i (w i Σ i )( w i µ i ) + O(α 2 ) ρ i [A i, Bi T ] T ( w i µ i ) + O(α 2 ). From isotropy, it follows that w i µ i. To bound the other factor, we argue Therefore, [A i, B T i ] T 2 max{ A i, B i } 2φ. u v 2 2k2 α 2 w φ + O(α 3 ) 4k2 α 2 w φ, for sufficiently large n, as α n/w. Lemma 0. Let If ρ ρ < /(), then [ k ] Γ = ρ i(w i µ i µ T i + A i ) 0 0 k ρ id i ρ. i w i α D2 i M Γ k 2 w 2 α 2 φ. Before giving the proof, we summarize some of the necessary calculation in the following claim. Claim. The matrix of second moments [ )] [ M = E xx T exp ( x 2 Γ 0 = 0 Γ 22 where and F = O(α 2 ). Γ = Γ 22 = ρ i (w i µ i µ T i + A i ) ρ i D i ρ i w i α D2 i = 2 = 22 = ] + [ T ] + F, ρ i w i α BT i B i + ρ i ( wi µ i µ T i A i + w i A i µ i µ T i + A 2 ) i w i α ρ i B i ρ i ( Bi (w i µ i µ T ) i ) + B i A i + D i B i w i α ρ i w i α B ib T i, 4

16 Proof. The calculation is straightforward. Proof of Lemma 0. We begin by bounding the 2-norm of each of the blocks. Since w i µ i µ i T < and A i φ and B i φ, we can bound = max y = ρ i w i α yt Bi T B i y T ρ i ( w i α yt w i µ i µ T i A i + w i A i µ i µ T i + A 2 ) i y + O(α 2 ) ρ i w i α B i 2 + ρ i w i α (2 A + A 2 ) + O(α 2 ) 4k wα φ + O(α 2 ). By a similar argument, 22 kφ/(wα) + O(α 2 ). Therefore, 2 (ρ i ρ)b i + ρ i ρ B i + k ρ ρ φ + 3k ρ k ρ ρ φ + φ wα 7k φ + O(α 2 ). 2wα For 2, we observe that k B i = 0. ρ i ( Bi (w i µ i µ T ) i ) + B i A i + D i B i + O(α 2 ) w i α ρ i ( Bi (w i µ i µ T i ) + B i A i + D i B i ) + O(α 2 ) w i α ρ i w i α ( φ + φ φ + φ) + O(α 2 ) Thus, we have max{, 22, 2 } 4k φ/(wα) + O(α 2 ), so that M Γ + O(α 2 ) 2 max{, 22, 2 } 8k φ + O(α 2 ) 6k φ. wα wα for sufficiently large n, as α n/w. 3.4 Sample Convergence We now give some bounds on the convergence of the transformation to isotropy (ˆµ 0 and ˆΣ I) and on the convergence of the reweighted sample mean û and sample matrix of second moments ˆM to their expectations u and M. For the convergence of second moment matrices, we use the following lemma due to Rudelson [2], which was presented in this form in [3]. Lemma 2. Let y be a random vector from a distribution D in R n, with sup D y = M and E(yy T ). Let y,..., y m be independent samples from D. Let where C is an absolute constant. Then, η = CM log m m 5

17 (i) If η <, then E ( m m y i yi T E(yy T ) ) η. (ii) For every t (0, ), P ( m m y i yi T E(yy T ) > t ) 2e ct2 /η 2. This lemma is used to show that a distribution can be made nearly isotropic using only O (kn) samples [2, 0]. The isotropic transformation is computed simply by estimating the mean and covariance matrix of a sample, and computing the affine transformation that puts the sample in isotropic position. Theorem 4. There is an absolute constant C such that for an isotropic mixture of k logconcave distributions, with probability at least δ, a sample of size m > C kn log2 (n/δ) ɛ 2 gives a sample mean ˆµ and sample covariance ˆΣ so that We now consider the reweighted moments. ˆµ ɛ and ˆΣ I ɛ. Lemma 3. Let ɛ, δ > 0 and let ˆµ be the reweighted sample mean of a set of m points drawn from an isotropic mixture of k Gaussians in n dimensions, where Then m 2nα ɛ 2 log 2n δ. P [ û u > ɛ] δ Proof. We first consider only a single coordinate of the vector û. Let y = x exp ( x 2 /() ) u. We observe that ( ) ( ) x exp x 2 x exp x2 α e < α. Thus, each term in the sum mû = m j= y j falls the range [ α u, α u ]. We may therefore apply Hoeffding s inequality to show that P [ û u ɛ/ n ] 2 exp ( 2m2 (ɛ/ n) 2 ) ) m (2 α) 2 2 exp ( mɛ2 δ n n. Taking the union bound over the n coordinates, we have that with probability δ the error in each coordinate is at most ɛ/ n, which implies that û u ɛ. 6

18 Lemma 4. Let ɛ, δ > 0 and let ˆM be the reweighted sample matrix of second moments for a set of m points drawn from an isotropic mixture of k Gaussians in n dimensions, where m C nα ɛ 2 log nα δ. and C is an absolute constant. Then [ ] P ˆM M > ɛ < δ. Proof. We will apply Lemma 2. Define y = x exp ( x 2 /() ). Then, ) yi 2 x 2 i exp ( x 2 α ( ) x 2 i exp x2 i α α e < α. Therefore y αn. Next, since M is in isotropic position (we can assume this w.l.o.g.), we have for any unit vector v, E((v T y) 2 )) E((v T x) 2 ) and so E(yy T ). Now we apply the second part of Lemma 2 with η = ɛ c/ ln(2/δ) and t = η ln(2/δ)/c. This requires that η = cɛ log m ln(2/δ) C αn m which is satisfied for our choice of m. Lemma 5. Let X be a collection of m points drawn from a Gaussian with mean µ and variance σ 2. With probability δ, x µ σ 2 log m/δ. for every x X. 3.5 Perturbation Lemma We will use the following key lemma due to Stewart [6] to show that when we apply the spectral step, the top k dimensional invariant subspace will be close to the Fisher subspace. Lemma 6 (Stewart s Theorem). Suppose A and A + E are n-by-n symmetric matrices and that [ ] [ D 0 r E E A = E = T ] 2 r 0 D 2 n r E 2 E 22 n r. r n r r n r Let the columns of V be the top r eigenvectors of the matrix A + E and let P 2 be the matrix with columns e r+,..., e n. If d = λ r (D ) λ (D 2 ) > 0 and E d 5, then V T P 2 4 d E

19 4 Finding a Vector near the Fisher Subspace In this section, we combine the approximations of Section 3.3 and the perturbation lemma of Section 3.5 to show that the direction h chosen by step 3 of the algorithm is close to the intermean subspace. Section 5 argues that this direction can be used to partition the components. Finding the separating direction is the most challenging part of the classification task and represents the main contribution of this work. We first assume zero overlap and that the sample reweighted moments behave exactly according to expectation. In this case, the mean shift û becomes v w i ρ i µ i. We can intuitively think of the components that have greater ρ i as gaining mixing weight and those with smaller ρ i as losing mixing weight. As long as the ρ i are not all equal, we will observe some shift of the mean in the intermean subspace, i.e. Fisher subspace. Therefore, we may use this direction to partition the components. On the other hand, if all of the ρ i are equal, then ˆM becomes ] ] Γ [ k ρ i(w i µ i µ i T + A i ) 0 0 k ρ id i ρ i w i α D2 i = ρ [ I 0 0 I α k w i D 2 i Notice that the second moments in the subspace span{e,..., e k } are maintained while those in the complementary subspace are reduced by poly(/α). Therefore, the top eigenvector will be in the intermean subspace, which is the Fisher subspace. We now argue that this same strategy can be adapted to work in general, i.e., with nonzero overlap and sampling errors, with high probability. A critical aspect of this argument is that the norm of the error term ˆM Γ depends only on φ and k and not the dimension of the data. See Lemma 0 and the supporting Lemma 4 and Fact 3. Since we cannot know directly how imbalanced the ρ i are, we choose the method of finding a separating direction according the norm of the vector û. Recall that when û > w/(3) the algorithm uses û to determine the separating direction h. Lemma 7 guarantees that this vector is close to the Fisher subspace. When û w/(3), the algorithm uses the top eigenvector of the covariance matrix ˆM. Lemma 8 guarantees that this vector is close to the Fisher subspace. Lemma 7 (Mean Shift Method). Let ɛ > 0. There exists a constant C such that if m Cn 4 poly(k, w, log n/δ), then the following holds with probability δ. If û > w/(3) and then φ w2 ɛ 2 4 k 2, û T v û v ɛ. Lemma 8 (Spectral Method). Let ɛ > 0. There exists a constant C such that if m Cn 4 poly(k, w, log n/δ), then the following holds with probability δ. Let v,..., v k be the top k eigenvectors of ˆM. If û w/(3) and then φ w2 ɛ k 2 min proj F (v) ɛ. v span{v,...,v k }, v =. 8

20 4. Mean Shift Proof of Lemma 7. We will make use of the following claim. Claim 9. For any vectors a, b 0, a T ( b a b a b 2 ) /2 max{ a 2, b 2. } By the triangle inequality, û v û u + u v. By Lemma 9, 4k 2 4k u v α 2 w φ = 2 α 2 w w 2 ɛ wɛ 2 4k α 2. By Lemma 3, for large m we obtain the same bound on û u with probability δ. Thus, wɛ û v 2 0 α 2. Applying the claim gives û T v û v û v 2 û 2 wɛ 2 0 α α 2 w = ɛ. Proof of Claim 9. Without loss of generality, assume u v and fix the distance u v. In order to maximize the angle between u and v, the vector v should be chosen so that it is tangent to the sphere centered at u with radius u v. Hence, the vectors u,v,(u v) form a right triangle where u 2 = v 2 + u v 2. For this choice of v, let θ be the angle between u and v so that u T v u v = cos θ = ( sin2 θ) /2 = ( ) /2 u v 2 u Spectral Method We first show that the smallness of the mean shift û implies that the coefficients ρ i are sufficiently uniform to allow us to apply the spectral method. Claim 20 (Small Mean Shift Implies Balanced Second Moments). If û w/(3) and then φ w 64k, ρ ρ 2 8α. 9

21 Proof. Let q,..., q k be the right singular vectors of the matrix U = [w µ,..., w k µ k ] and let σ i (U) be the ith largest singular value. Because k w iµ i = 0, we have that σ k (U) = 0 and q k = / k. Recall that ρ is the k vector of scalars ρ,..., ρ k and that v = Uρ. Then v 2 = Uρ 2 = k σ i (U) 2 (qi T ρ) 2 σ k (U) 2 ρ q k (q T k ρ) 2 2 = σ k (U) 2 ρ ρ 2 2. Because q k span{µ,..., µ k }, we have that k w iq T k µ iµ T i q k φ. Therefore, σ k (U) 2 = Uq k 2 ( ) = qk T wi 2 µ i µ T i wqk T w( φ). ( ) w i µ i µ T i q k q k Thus, we have the bound ρ ρ v 2 v. ( φ)w w By the triangle inequality v û + û v. As argued in Lemma 9, Thus, û v 4k 2 α 2 w φ = 4k 2 α 2 w w k 2 = ρ ρ 2 ρ v w 2 ρ ( ) w w w α. w 3. We next show that the top k principal components of Γ span the intermean subspace and put a lower bound on the spectral gap between the intermean and non-intermean components. Lemma 2 (Ideal Case). If ρ ρ /(8α), then λ k (Γ) λ k (Γ) 4α, and the top k eigenvectors of Γ span the means of the components. 20

22 Proof of Lemma 2. We first bound λ k (Γ ). Recall that Γ = ρ i (w i µ i µ T i + A i ). Thus, λ k (Γ ) = min y = ρ i y T (w i µ i µ T i + A i )y ρ max y = ( ρ ρ i )y T (w i µ i µ T i + A i )y. We observe that k yt (w i µ i µ T i + A i )y = and each term is non-negative. Hence the sum is bounded by ( ρ ρ i )y T (w i µ i µ T i + A i )y ρ ρ, so, Next, we bound λ (Γ 22 ). Recall that λ k (Γ ) ρ ρ ρ. Γ 22 = ρ i D i ρ i w i α D2 i and that for any n k vector y such that y =, we have k yt D i y =. Using the same arguments as above, λ (Γ 22 ) = max ρ + (ρ i ρ)y T D i y ρ i y = w i α yt Di 2 y ρ + ρ ρ min y = ρ i w i α yt D 2 i y. To bound the last sum, we observe that ρ i ρ = O(α ). Therefore ρ i w i α yt D 2 i y ρ α w i y T D 2 i y + O(α 2 ). Without loss of generality, we may assume that y = e by an appropriate rotation of the D i. Let D i (l, j) be element in the lth row and jth column of the matrix D i. Then the sum becomes w i y T D 2 i y = w i j= n D j (, j) 2 w i D j (, ) 2. 2

23 Because k D i = I, we have k D i(, ) =. From the Cauchy-Schwartz inequality, it follows ( w i)/2 ( ) /2 D i (, ) 2 w i wi D i (, ) wi =. Since k w i =, we conclude that k w i D i (, ) 2. Thus, using the fact that ρ /2, we have ρ i w i α yt Di 2 y Putting the bounds together λ k (Γ ) λ (Γ 22 ) 2 ρ ρ 4α. Proof of Lemma 8. To bound the effect of overlap and sample errors on the eigenvectors, we apply Stewart s Lemma (Lemma 6). Define d = λ k (Γ) λ k (Γ) and E = ˆM Γ. We assume that the mean shift satisfies û w/(3) and that φ is small. By Lemma 2, this implies that d = λ k (Γ) λ k (Γ) 4α. (5) To bound E, we use the triangle inequality E Γ M + M ˆM. Lemma 0 bounds the first term by 6 M Γ 2 k 2 6 w 2 α 2 φ = 2 k 2 w 2 α 2 w 2 ɛ k 2 ɛ. 40α By Lemma 4, we obtain the same bound on M ˆM with probability δ for large enough m. Thus, E ɛ. 20α Combining the bounds of Eqn. 5 and 4.2, we have ( ɛ) 2 d 5 E ( ɛ) 2 4α 5 ɛ 0, 20α as ( ɛ) 2 ɛ. This implies both that E d/5 and that 4 E 2 /d < ( ɛ) 2, enabling us to apply Stewart s Lemma to the matrix pair Γ and ˆM. By Lemma 2, the top k eigenvectors of Γ, i.e. e,..., e k, span the means of the components. Let the columns of P be these eigenvectors. Let the columns of P 2 be defined such that [P, P 2 ] is an orthonormal matrix and let v,..., v k be the top k eigenvectors of ˆM. By Stewart s Lemma, letting the columns of V be v,..., v k, we have V T P 2 2 ( ɛ) 2, or equivalently, min proj F v = σ k (V T P ) ɛ. v span{v,...,v k }, v = 22

24 5 Recursion In this section, we show that for every direction h that is close to the intermean subspace, the largest gap clustering step produces a pair of complementary halfspaces that partitions R n while leaving only a small part of the probability mass on the wrong side of the partition, small enough that with high probability, it does not affect the samples used by the algorithm. Lemma 22. Let δ, δ > 0, where δ δ/(2m 2 ), and let m 2 satisfy m 2 n/k log(2k/δ). Suppose that h is a unit vector such that proj F (h) Let F be a mixture of k > Gaussians with overlap φ w 2 0 (k ) 2 log δ. w 2 9 (k ) 2 log δ. Let X be a collection of m 2 points from F and let t be the midpoint of the largest gap in set {h T x : x X}. With probability δ, the halfspace H h,t has the following property. For a random sample y from F either y, µ l(y) H h,t or y, µ l(y) / H h,t with probability δ. Proof of Lemma 22. The idea behind the proof is simple. We first show that two of the means are at least a constant distance apart. We then bound the width of a component along the direction h, i.e. the maximum distance between two points belonging to the same component. If the width of each component is small, then clearly the largest gap must fall between components. Setting t to be the midpoint of the gap, we avoid cutting any components. We first show that at least one mean must be far from the origin in the direction h. Let the columns of P be the vectors e,..., e k. The span of these vectors is also the span of the means, so we have max(h T µ i ) 2 = max(h T P P T µ i ) 2 i i ( (P = P T h 2 T max h) T i P h µ i P T h 2 P T h 2 ( φ) > 2. ) 2 w i ( (P T h) T P h µ i Since the origin is the mean of the means, we conclude that the maximum distance between two means in the direction h is at least /2. Without loss of generality, we assume that the interval [0, /2] is contained between two means projected to h. We now show that every point x drawn from component i falls in a narrow interval when projected to h. That is, x satisfies h T x b i, where b i = [h T µ i (8(k )), h T µ i + (8(k )) ]. We begin by examining the variance along h. Let e k,..., e n be the columns of the matrix n-by- (n k + ) matrix P 2. Recall from Eqn. 2 that P T w iσ i P = A i, that P T 2 w iσ i P = B i, and that ) 2 23

25 P2 T w iσ i P 2 = D i. The norms of these matrices are bounded according to Lemma 4. Also, the vector h = P P T h + P 2P2 T h. For convenience of notation we define ɛ such that P T h = ɛ. Then P2 T h 2 = ( ɛ) 2 2ɛ. We now argue h T w i Σ i h ( h T P A i P T h + 2h T P 2 B i P h + h T P2 T D i P 2 h ) 2 ( h T P A i P T h + h T P 2 D i P2 T h ) 2( P T h 2 A i + P T 2 h 2 D i ) 2(φ + 2ɛ). Using the assumptions about φ and ɛ, we conclude that the maximum variance along h is at most max h T Σ i h 2 ( w i w 2 9 (k ) 2 log δ + 2 w 2 0 (k ) 2 log ) δ ( 2 7 (k ) 2 log /δ ). We now translate these bounds on the variance to a bound on the difference between the minimum and maximum points along the direction h. By Lemma 5, with probability δ/2 h T (x µ l(x) ) 2h T Σ i h log(2m 2 /δ) 8(k ) log(2m 2/δ) log(/δ ) 8(k ). Thus, with probability δ/2, every point from X falls into the union of intervals b... b k where b i = [h T µ i (8(k )), h T µ i + (8(k )) ]. Because these intervals are centered about the means, at least the equivalent of one interval must fall outside the range [0, /2], which we assumed was contained between two projected means. Thus, the measure of subset of [0, /2] that does not fall into one of the intervals is 2 (k ) 4(k ) = 4. This set can be cut into at most k intervals, so the smallest possible gap between these intervals is (4(k )), which is exactly the width of an interval. Because m 2 = k/w log(2k/δ) the set X contains at least one sample from every component with probability δ/2. Overall, with probability δ every component has at least one sample and all samples from component i fall in b i. Thus, the largest gap between the sampled points will not contain one of the intervals b,..., b k. Moreover, the midpoint t of this gap must also fall outside of b... b k, ensuring that no b i is cut by t. By the same argument given above, any single point y from F is contained in b... b k with probability δ proving the Lemma. In the proof of the main theorem for large k, we will need to have every point sampled from F in the recursion subtree classified correctly by the halfspace, so we will assume δ considerably smaller than m 2 /δ. The second lemma shows that all submixtures have smaller overlap to ensure that all the relevant lemmas apply in the recursive steps. Lemma 23. The removal of any subset of components cannot induce a mixture with greater overlap than the original. Proof of Lemma 23. Suppose that the components j +,... k are removed from the mixture. Let ω = j w i be a normalizing factor for the weights. Then if c = j w iµ i = k i=j+ w iµ i, the 24

Isotropic PCA and Affine-Invariant Clustering

Isotropic PCA and Affine-Invariant Clustering Isotropic PCA and Affine-Invariant Clustering (Extended Abstract) S. Charles Brubaker Santosh S. Vempala Georgia Institute of Technology Atlanta, GA 30332 {brubaker,vempala}@cc.gatech.edu Abstract We present

More information

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra. DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1

More information

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University PCA with random noise Van Ha Vu Department of Mathematics Yale University An important problem that appears in various areas of applied mathematics (in particular statistics, computer science and numerical

More information

Lecture Notes 1: Vector spaces

Lecture Notes 1: Vector spaces Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector

More information

Principal Component Analysis

Principal Component Analysis Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used

More information

arxiv: v5 [math.na] 16 Nov 2017

arxiv: v5 [math.na] 16 Nov 2017 RANDOM PERTURBATION OF LOW RANK MATRICES: IMPROVING CLASSICAL BOUNDS arxiv:3.657v5 [math.na] 6 Nov 07 SEAN O ROURKE, VAN VU, AND KE WANG Abstract. Matrix perturbation inequalities, such as Weyl s theorem

More information

A strongly polynomial algorithm for linear systems having a binary solution

A strongly polynomial algorithm for linear systems having a binary solution A strongly polynomial algorithm for linear systems having a binary solution Sergei Chubanov Institute of Information Systems at the University of Siegen, Germany e-mail: sergei.chubanov@uni-siegen.de 7th

More information

Motivating the Covariance Matrix

Motivating the Covariance Matrix Motivating the Covariance Matrix Raúl Rojas Computer Science Department Freie Universität Berlin January 2009 Abstract This note reviews some interesting properties of the covariance matrix and its role

More information

The following definition is fundamental.

The following definition is fundamental. 1. Some Basics from Linear Algebra With these notes, I will try and clarify certain topics that I only quickly mention in class. First and foremost, I will assume that you are familiar with many basic

More information

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works CS68: The Modern Algorithmic Toolbox Lecture #8: How PCA Works Tim Roughgarden & Gregory Valiant April 20, 206 Introduction Last lecture introduced the idea of principal components analysis (PCA). The

More information

Lecture 7: Positive Semidefinite Matrices

Lecture 7: Positive Semidefinite Matrices Lecture 7: Positive Semidefinite Matrices Rajat Mittal IIT Kanpur The main aim of this lecture note is to prepare your background for semidefinite programming. We have already seen some linear algebra.

More information

Principal Component Analysis

Principal Component Analysis Principal Component Analysis Laurenz Wiskott Institute for Theoretical Biology Humboldt-University Berlin Invalidenstraße 43 D-10115 Berlin, Germany 11 March 2004 1 Intuition Problem Statement Experimental

More information

Lecture 14: SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Lecturer: Sanjeev Arora

Lecture 14: SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Lecturer: Sanjeev Arora princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 14: SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Lecturer: Sanjeev Arora Scribe: Today we continue the

More information

Linear Algebra Massoud Malek

Linear Algebra Massoud Malek CSUEB Linear Algebra Massoud Malek Inner Product and Normed Space In all that follows, the n n identity matrix is denoted by I n, the n n zero matrix by Z n, and the zero vector by θ n An inner product

More information

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A = 30 MATHEMATICS REVIEW G A.1.1 Matrices and Vectors Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A = a 11 a 12... a 1N a 21 a 22... a 2N...... a M1 a M2... a MN A matrix can

More information

Fourier PCA. Navin Goyal (MSR India), Santosh Vempala (Georgia Tech) and Ying Xiao (Georgia Tech)

Fourier PCA. Navin Goyal (MSR India), Santosh Vempala (Georgia Tech) and Ying Xiao (Georgia Tech) Fourier PCA Navin Goyal (MSR India), Santosh Vempala (Georgia Tech) and Ying Xiao (Georgia Tech) Introduction 1. Describe a learning problem. 2. Develop an efficient tensor decomposition. Independent component

More information

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces.

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces. Math 350 Fall 2011 Notes about inner product spaces In this notes we state and prove some important properties of inner product spaces. First, recall the dot product on R n : if x, y R n, say x = (x 1,...,

More information

LECTURE NOTE #10 PROF. ALAN YUILLE

LECTURE NOTE #10 PROF. ALAN YUILLE LECTURE NOTE #10 PROF. ALAN YUILLE 1. Principle Component Analysis (PCA) One way to deal with the curse of dimensionality is to project data down onto a space of low dimensions, see figure (1). Figure

More information

Conditions for Robust Principal Component Analysis

Conditions for Robust Principal Component Analysis Rose-Hulman Undergraduate Mathematics Journal Volume 12 Issue 2 Article 9 Conditions for Robust Principal Component Analysis Michael Hornstein Stanford University, mdhornstein@gmail.com Follow this and

More information

The Informativeness of k-means for Learning Mixture Models

The Informativeness of k-means for Learning Mixture Models The Informativeness of k-means for Learning Mixture Models Vincent Y. F. Tan (Joint work with Zhaoqiang Liu) National University of Singapore June 18, 2018 1/35 Gaussian distribution For F dimensions,

More information

15 Singular Value Decomposition

15 Singular Value Decomposition 15 Singular Value Decomposition For any high-dimensional data analysis, one s first thought should often be: can I use an SVD? The singular value decomposition is an invaluable analysis tool for dealing

More information

Chapter 3 Transformations

Chapter 3 Transformations Chapter 3 Transformations An Introduction to Optimization Spring, 2014 Wei-Ta Chu 1 Linear Transformations A function is called a linear transformation if 1. for every and 2. for every If we fix the bases

More information

Lecture 2: Linear Algebra Review

Lecture 2: Linear Algebra Review EE 227A: Convex Optimization and Applications January 19 Lecture 2: Linear Algebra Review Lecturer: Mert Pilanci Reading assignment: Appendix C of BV. Sections 2-6 of the web textbook 1 2.1 Vectors 2.1.1

More information

Regularized Discriminant Analysis and Reduced-Rank LDA

Regularized Discriminant Analysis and Reduced-Rank LDA Regularized Discriminant Analysis and Reduced-Rank LDA Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Regularized Discriminant Analysis A compromise between LDA and

More information

Stat 159/259: Linear Algebra Notes

Stat 159/259: Linear Algebra Notes Stat 159/259: Linear Algebra Notes Jarrod Millman November 16, 2015 Abstract These notes assume you ve taken a semester of undergraduate linear algebra. In particular, I assume you are familiar with the

More information

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Chapter 14 SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Today we continue the topic of low-dimensional approximation to datasets and matrices. Last time we saw the singular

More information

3 Best-Fit Subspaces and Singular Value Decomposition

3 Best-Fit Subspaces and Singular Value Decomposition 3 Best-Fit Subspaces and Singular Value Decomposition (SVD) Think of the rows of an n d matrix A as n data points in a d-dimensional space and consider the problem of finding the best k-dimensional subspace

More information

Baum s Algorithm Learns Intersections of Halfspaces with respect to Log-Concave Distributions

Baum s Algorithm Learns Intersections of Halfspaces with respect to Log-Concave Distributions Baum s Algorithm Learns Intersections of Halfspaces with respect to Log-Concave Distributions Adam R Klivans UT-Austin klivans@csutexasedu Philip M Long Google plong@googlecom April 10, 2009 Alex K Tang

More information

Lecture Note 12: The Eigenvalue Problem

Lecture Note 12: The Eigenvalue Problem MATH 5330: Computational Methods of Linear Algebra Lecture Note 12: The Eigenvalue Problem 1 Theoretical Background Xianyi Zeng Department of Mathematical Sciences, UTEP The eigenvalue problem is a classical

More information

Least Squares Optimization

Least Squares Optimization Least Squares Optimization The following is a brief review of least squares optimization and constrained optimization techniques. Broadly, these techniques can be used in data analysis and visualization

More information

Tangent spaces, normals and extrema

Tangent spaces, normals and extrema Chapter 3 Tangent spaces, normals and extrema If S is a surface in 3-space, with a point a S where S looks smooth, i.e., without any fold or cusp or self-crossing, we can intuitively define the tangent

More information

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2. APPENDIX A Background Mathematics A. Linear Algebra A.. Vector algebra Let x denote the n-dimensional column vector with components 0 x x 2 B C @. A x n Definition 6 (scalar product). The scalar product

More information

Symmetric Matrices and Eigendecomposition

Symmetric Matrices and Eigendecomposition Symmetric Matrices and Eigendecomposition Robert M. Freund January, 2014 c 2014 Massachusetts Institute of Technology. All rights reserved. 1 2 1 Symmetric Matrices and Convexity of Quadratic Functions

More information

Throughout these notes we assume V, W are finite dimensional inner product spaces over C.

Throughout these notes we assume V, W are finite dimensional inner product spaces over C. Math 342 - Linear Algebra II Notes Throughout these notes we assume V, W are finite dimensional inner product spaces over C 1 Upper Triangular Representation Proposition: Let T L(V ) There exists an orthonormal

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

Convergence in shape of Steiner symmetrized line segments. Arthur Korneychuk

Convergence in shape of Steiner symmetrized line segments. Arthur Korneychuk Convergence in shape of Steiner symmetrized line segments by Arthur Korneychuk A thesis submitted in conformity with the requirements for the degree of Master of Science Graduate Department of Mathematics

More information

Dissertation Defense

Dissertation Defense Clustering Algorithms for Random and Pseudo-random Structures Dissertation Defense Pradipta Mitra 1 1 Department of Computer Science Yale University April 23, 2008 Mitra (Yale University) Dissertation

More information

14 Singular Value Decomposition

14 Singular Value Decomposition 14 Singular Value Decomposition For any high-dimensional data analysis, one s first thought should often be: can I use an SVD? The singular value decomposition is an invaluable analysis tool for dealing

More information

LINEAR ALGEBRA W W L CHEN

LINEAR ALGEBRA W W L CHEN LINEAR ALGEBRA W W L CHEN c W W L Chen, 1997, 2008. This chapter is available free to all individuals, on the understanding that it is not to be used for financial gain, and may be downloaded and/or photocopied,

More information

NORMS ON SPACE OF MATRICES

NORMS ON SPACE OF MATRICES NORMS ON SPACE OF MATRICES. Operator Norms on Space of linear maps Let A be an n n real matrix and x 0 be a vector in R n. We would like to use the Picard iteration method to solve for the following system

More information

Learning convex bodies is hard

Learning convex bodies is hard Learning convex bodies is hard Navin Goyal Microsoft Research India navingo@microsoft.com Luis Rademacher Georgia Tech lrademac@cc.gatech.edu Abstract We show that learning a convex body in R d, given

More information

Linear Algebra. Paul Yiu. 6D: 2-planes in R 4. Department of Mathematics Florida Atlantic University. Fall 2011

Linear Algebra. Paul Yiu. 6D: 2-planes in R 4. Department of Mathematics Florida Atlantic University. Fall 2011 Linear Algebra Paul Yiu Department of Mathematics Florida Atlantic University Fall 2011 6D: 2-planes in R 4 The angle between a vector and a plane The angle between a vector v R n and a subspace V is the

More information

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013. The University of Texas at Austin Department of Electrical and Computer Engineering EE381V: Large Scale Learning Spring 2013 Assignment Two Caramanis/Sanghavi Due: Tuesday, Feb. 19, 2013. Computational

More information

Basic Elements of Linear Algebra

Basic Elements of Linear Algebra A Basic Review of Linear Algebra Nick West nickwest@stanfordedu September 16, 2010 Part I Basic Elements of Linear Algebra Although the subject of linear algebra is much broader than just vectors and matrices,

More information

CS 195-5: Machine Learning Problem Set 1

CS 195-5: Machine Learning Problem Set 1 CS 95-5: Machine Learning Problem Set Douglas Lanman dlanman@brown.edu 7 September Regression Problem Show that the prediction errors y f(x; ŵ) are necessarily uncorrelated with any linear function of

More information

MATH Linear Algebra

MATH Linear Algebra MATH 304 - Linear Algebra In the previous note we learned an important algorithm to produce orthogonal sequences of vectors called the Gramm-Schmidt orthogonalization process. Gramm-Schmidt orthogonalization

More information

Math Linear Algebra II. 1. Inner Products and Norms

Math Linear Algebra II. 1. Inner Products and Norms Math 342 - Linear Algebra II Notes 1. Inner Products and Norms One knows from a basic introduction to vectors in R n Math 254 at OSU) that the length of a vector x = x 1 x 2... x n ) T R n, denoted x,

More information

Dot Products. K. Behrend. April 3, Abstract A short review of some basic facts on the dot product. Projections. The spectral theorem.

Dot Products. K. Behrend. April 3, Abstract A short review of some basic facts on the dot product. Projections. The spectral theorem. Dot Products K. Behrend April 3, 008 Abstract A short review of some basic facts on the dot product. Projections. The spectral theorem. Contents The dot product 3. Length of a vector........................

More information

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis Massimiliano Pontil 1 Today s plan SVD and principal component analysis (PCA) Connection

More information

Lecture 8 : Eigenvalues and Eigenvectors

Lecture 8 : Eigenvalues and Eigenvectors CPS290: Algorithmic Foundations of Data Science February 24, 2017 Lecture 8 : Eigenvalues and Eigenvectors Lecturer: Kamesh Munagala Scribe: Kamesh Munagala Hermitian Matrices It is simpler to begin with

More information

Convex and Semidefinite Programming for Approximation

Convex and Semidefinite Programming for Approximation Convex and Semidefinite Programming for Approximation We have seen linear programming based methods to solve NP-hard problems. One perspective on this is that linear programming is a meta-method since

More information

2. Matrix Algebra and Random Vectors

2. Matrix Algebra and Random Vectors 2. Matrix Algebra and Random Vectors 2.1 Introduction Multivariate data can be conveniently display as array of numbers. In general, a rectangular array of numbers with, for instance, n rows and p columns

More information

IV. Matrix Approximation using Least-Squares

IV. Matrix Approximation using Least-Squares IV. Matrix Approximation using Least-Squares The SVD and Matrix Approximation We begin with the following fundamental question. Let A be an M N matrix with rank R. What is the closest matrix to A that

More information

1. General Vector Spaces

1. General Vector Spaces 1.1. Vector space axioms. 1. General Vector Spaces Definition 1.1. Let V be a nonempty set of objects on which the operations of addition and scalar multiplication are defined. By addition we mean a rule

More information

Algorithmic Convex Geometry

Algorithmic Convex Geometry Algorithmic Convex Geometry August 2011 2 Contents 1 Overview 5 1.1 Learning by random sampling.................. 5 2 The Brunn-Minkowski Inequality 7 2.1 The inequality.......................... 8 2.1.1

More information

LINEAR ALGEBRA BOOT CAMP WEEK 4: THE SPECTRAL THEOREM

LINEAR ALGEBRA BOOT CAMP WEEK 4: THE SPECTRAL THEOREM LINEAR ALGEBRA BOOT CAMP WEEK 4: THE SPECTRAL THEOREM Unless otherwise stated, all vector spaces in this worksheet are finite dimensional and the scalar field F is R or C. Definition 1. A linear operator

More information

sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU)

sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU) sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU) 0 overview Our Contributions: 1 overview Our Contributions: A near optimal low-rank

More information

Reconstruction from Anisotropic Random Measurements

Reconstruction from Anisotropic Random Measurements Reconstruction from Anisotropic Random Measurements Mark Rudelson and Shuheng Zhou The University of Michigan, Ann Arbor Coding, Complexity, and Sparsity Workshop, 013 Ann Arbor, Michigan August 7, 013

More information

Semidefinite Programming

Semidefinite Programming Semidefinite Programming Notes by Bernd Sturmfels for the lecture on June 26, 208, in the IMPRS Ringvorlesung Introduction to Nonlinear Algebra The transition from linear algebra to nonlinear algebra has

More information

In English, this means that if we travel on a straight line between any two points in C, then we never leave C.

In English, this means that if we travel on a straight line between any two points in C, then we never leave C. Convex sets In this section, we will be introduced to some of the mathematical fundamentals of convex sets. In order to motivate some of the definitions, we will look at the closest point problem from

More information

Least Squares Optimization

Least Squares Optimization Least Squares Optimization The following is a brief review of least squares optimization and constrained optimization techniques, which are widely used to analyze and visualize data. Least squares (LS)

More information

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 11 Luca Trevisan February 29, 2016

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 11 Luca Trevisan February 29, 2016 U.C. Berkeley CS294: Spectral Methods and Expanders Handout Luca Trevisan February 29, 206 Lecture : ARV In which we introduce semi-definite programming and a semi-definite programming relaxation of sparsest

More information

EXTENSIONS OF PRINCIPAL COMPONENTS ANALYSIS

EXTENSIONS OF PRINCIPAL COMPONENTS ANALYSIS EXTENSIONS OF PRINCIPAL COMPONENTS ANALYSIS A Thesis Presented to The Academic Faculty by S. Charles Brubaker In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the School

More information

LECTURE NOTE #11 PROF. ALAN YUILLE

LECTURE NOTE #11 PROF. ALAN YUILLE LECTURE NOTE #11 PROF. ALAN YUILLE 1. NonLinear Dimension Reduction Spectral Methods. The basic idea is to assume that the data lies on a manifold/surface in D-dimensional space, see figure (1) Perform

More information

Discriminant Analysis with High Dimensional. von Mises-Fisher distribution and

Discriminant Analysis with High Dimensional. von Mises-Fisher distribution and Athens Journal of Sciences December 2014 Discriminant Analysis with High Dimensional von Mises - Fisher Distributions By Mario Romanazzi This paper extends previous work in discriminant analysis with von

More information

Linear Algebra and Robot Modeling

Linear Algebra and Robot Modeling Linear Algebra and Robot Modeling Nathan Ratliff Abstract Linear algebra is fundamental to robot modeling, control, and optimization. This document reviews some of the basic kinematic equations and uses

More information

8.1 Concentration inequality for Gaussian random matrix (cont d)

8.1 Concentration inequality for Gaussian random matrix (cont d) MGMT 69: Topics in High-dimensional Data Analysis Falll 26 Lecture 8: Spectral clustering and Laplacian matrices Lecturer: Jiaming Xu Scribe: Hyun-Ju Oh and Taotao He, October 4, 26 Outline Concentration

More information

Finding normalized and modularity cuts by spectral clustering. Ljubjana 2010, October

Finding normalized and modularity cuts by spectral clustering. Ljubjana 2010, October Finding normalized and modularity cuts by spectral clustering Marianna Bolla Institute of Mathematics Budapest University of Technology and Economics marib@math.bme.hu Ljubjana 2010, October Outline Find

More information

Theorems of Erdős-Ko-Rado type in polar spaces

Theorems of Erdős-Ko-Rado type in polar spaces Theorems of Erdős-Ko-Rado type in polar spaces Valentina Pepe, Leo Storme, Frédéric Vanhove Department of Mathematics, Ghent University, Krijgslaan 28-S22, 9000 Ghent, Belgium Abstract We consider Erdős-Ko-Rado

More information

October 25, 2013 INNER PRODUCT SPACES

October 25, 2013 INNER PRODUCT SPACES October 25, 2013 INNER PRODUCT SPACES RODICA D. COSTIN Contents 1. Inner product 2 1.1. Inner product 2 1.2. Inner product spaces 4 2. Orthogonal bases 5 2.1. Existence of an orthogonal basis 7 2.2. Orthogonal

More information

Chapter 6: Orthogonality

Chapter 6: Orthogonality Chapter 6: Orthogonality (Last Updated: November 7, 7) These notes are derived primarily from Linear Algebra and its applications by David Lay (4ed). A few theorems have been moved around.. Inner products

More information

CSC Linear Programming and Combinatorial Optimization Lecture 10: Semidefinite Programming

CSC Linear Programming and Combinatorial Optimization Lecture 10: Semidefinite Programming CSC2411 - Linear Programming and Combinatorial Optimization Lecture 10: Semidefinite Programming Notes taken by Mike Jamieson March 28, 2005 Summary: In this lecture, we introduce semidefinite programming

More information

Econ Slides from Lecture 8

Econ Slides from Lecture 8 Econ 205 Sobel Econ 205 - Slides from Lecture 8 Joel Sobel September 1, 2010 Computational Facts 1. det AB = det BA = det A det B 2. If D is a diagonal matrix, then det D is equal to the product of its

More information

Discriminant analysis and supervised classification

Discriminant analysis and supervised classification Discriminant analysis and supervised classification Angela Montanari 1 Linear discriminant analysis Linear discriminant analysis (LDA) also known as Fisher s linear discriminant analysis or as Canonical

More information

(v, w) = arccos( < v, w >

(v, w) = arccos( < v, w > MA322 Sathaye Notes on Inner Products Notes on Chapter 6 Inner product. Given a real vector space V, an inner product is defined to be a bilinear map F : V V R such that the following holds: For all v

More information

Ir O D = D = ( ) Section 2.6 Example 1. (Bottom of page 119) dim(v ) = dim(l(v, W )) = dim(v ) dim(f ) = dim(v )

Ir O D = D = ( ) Section 2.6 Example 1. (Bottom of page 119) dim(v ) = dim(l(v, W )) = dim(v ) dim(f ) = dim(v ) Section 3.2 Theorem 3.6. Let A be an m n matrix of rank r. Then r m, r n, and, by means of a finite number of elementary row and column operations, A can be transformed into the matrix ( ) Ir O D = 1 O

More information

Lecture 9: Low Rank Approximation

Lecture 9: Low Rank Approximation CSE 521: Design and Analysis of Algorithms I Fall 2018 Lecture 9: Low Rank Approximation Lecturer: Shayan Oveis Gharan February 8th Scribe: Jun Qi Disclaimer: These notes have not been subjected to the

More information

7 Principal Component Analysis

7 Principal Component Analysis 7 Principal Component Analysis This topic will build a series of techniques to deal with high-dimensional data. Unlike regression problems, our goal is not to predict a value (the y-coordinate), it is

More information

4.2. ORTHOGONALITY 161

4.2. ORTHOGONALITY 161 4.2. ORTHOGONALITY 161 Definition 4.2.9 An affine space (E, E ) is a Euclidean affine space iff its underlying vector space E is a Euclidean vector space. Given any two points a, b E, we define the distance

More information

High-dimensional distributions with convexity properties

High-dimensional distributions with convexity properties High-dimensional distributions with convexity properties Bo az Klartag Tel-Aviv University A conference in honor of Charles Fefferman, Princeton, May 2009 High-Dimensional Distributions We are concerned

More information

MTH 2032 SemesterII

MTH 2032 SemesterII MTH 202 SemesterII 2010-11 Linear Algebra Worked Examples Dr. Tony Yee Department of Mathematics and Information Technology The Hong Kong Institute of Education December 28, 2011 ii Contents Table of Contents

More information

Algebra I Fall 2007

Algebra I Fall 2007 MIT OpenCourseWare http://ocw.mit.edu 18.701 Algebra I Fall 007 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms. 18.701 007 Geometry of the Special Unitary

More information

Kernel Methods. Machine Learning A W VO

Kernel Methods. Machine Learning A W VO Kernel Methods Machine Learning A 708.063 07W VO Outline 1. Dual representation 2. The kernel concept 3. Properties of kernels 4. Examples of kernel machines Kernel PCA Support vector regression (Relevance

More information

The Hilbert Space of Random Variables

The Hilbert Space of Random Variables The Hilbert Space of Random Variables Electrical Engineering 126 (UC Berkeley) Spring 2018 1 Outline Fix a probability space and consider the set H := {X : X is a real-valued random variable with E[X 2

More information

The Multivariate Gaussian Distribution

The Multivariate Gaussian Distribution The Multivariate Gaussian Distribution Chuong B. Do October, 8 A vector-valued random variable X = T X X n is said to have a multivariate normal or Gaussian) distribution with mean µ R n and covariance

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Feature Extraction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi, Payam Siyari Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Dimensionality Reduction

More information

LINEAR ALGEBRA BOOT CAMP WEEK 1: THE BASICS

LINEAR ALGEBRA BOOT CAMP WEEK 1: THE BASICS LINEAR ALGEBRA BOOT CAMP WEEK 1: THE BASICS Unless otherwise stated, all vector spaces in this worksheet are finite dimensional and the scalar field F has characteristic zero. The following are facts (in

More information

Lecture 24: Element-wise Sampling of Graphs and Linear Equation Solving. 22 Element-wise Sampling of Graphs and Linear Equation Solving

Lecture 24: Element-wise Sampling of Graphs and Linear Equation Solving. 22 Element-wise Sampling of Graphs and Linear Equation Solving Stat260/CS294: Randomized Algorithms for Matrices and Data Lecture 24-12/02/2013 Lecture 24: Element-wise Sampling of Graphs and Linear Equation Solving Lecturer: Michael Mahoney Scribe: Michael Mahoney

More information

Common-Knowledge / Cheat Sheet

Common-Knowledge / Cheat Sheet CSE 521: Design and Analysis of Algorithms I Fall 2018 Common-Knowledge / Cheat Sheet 1 Randomized Algorithm Expectation: For a random variable X with domain, the discrete set S, E [X] = s S P [X = s]

More information

Probabilistic & Unsupervised Learning

Probabilistic & Unsupervised Learning Probabilistic & Unsupervised Learning Week 2: Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College

More information

Introduction to Real Analysis Alternative Chapter 1

Introduction to Real Analysis Alternative Chapter 1 Christopher Heil Introduction to Real Analysis Alternative Chapter 1 A Primer on Norms and Banach Spaces Last Updated: March 10, 2018 c 2018 by Christopher Heil Chapter 1 A Primer on Norms and Banach Spaces

More information

Linear Algebra I. Ronald van Luijk, 2015

Linear Algebra I. Ronald van Luijk, 2015 Linear Algebra I Ronald van Luijk, 2015 With many parts from Linear Algebra I by Michael Stoll, 2007 Contents Dependencies among sections 3 Chapter 1. Euclidean space: lines and hyperplanes 5 1.1. Definition

More information

Functional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability...

Functional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability... Functional Analysis Franck Sueur 2018-2019 Contents 1 Metric spaces 1 1.1 Definitions........................................ 1 1.2 Completeness...................................... 3 1.3 Compactness......................................

More information

7. Symmetric Matrices and Quadratic Forms

7. Symmetric Matrices and Quadratic Forms Linear Algebra 7. Symmetric Matrices and Quadratic Forms CSIE NCU 1 7. Symmetric Matrices and Quadratic Forms 7.1 Diagonalization of symmetric matrices 2 7.2 Quadratic forms.. 9 7.4 The singular value

More information

CSC 411 Lecture 12: Principal Component Analysis

CSC 411 Lecture 12: Principal Component Analysis CSC 411 Lecture 12: Principal Component Analysis Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 12-PCA 1 / 23 Overview Today we ll cover the first unsupervised

More information

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x =

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x = Linear Algebra Review Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1 x x = 2. x n Vectors of up to three dimensions are easy to diagram.

More information

Min-Rank Conjecture for Log-Depth Circuits

Min-Rank Conjecture for Log-Depth Circuits Min-Rank Conjecture for Log-Depth Circuits Stasys Jukna a,,1, Georg Schnitger b,1 a Institute of Mathematics and Computer Science, Akademijos 4, LT-80663 Vilnius, Lithuania b University of Frankfurt, Institut

More information

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26 Principal Component Analysis Brett Bernstein CDS at NYU April 25, 2017 Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 1 / 26 Initial Question Intro Question Question Let S R n n be symmetric. 1

More information

Lecture Notes 2: Matrices

Lecture Notes 2: Matrices Optimization-based data analysis Fall 2017 Lecture Notes 2: Matrices Matrices are rectangular arrays of numbers, which are extremely useful for data analysis. They can be interpreted as vectors in a vector

More information

Semidefinite and Second Order Cone Programming Seminar Fall 2001 Lecture 5

Semidefinite and Second Order Cone Programming Seminar Fall 2001 Lecture 5 Semidefinite and Second Order Cone Programming Seminar Fall 2001 Lecture 5 Instructor: Farid Alizadeh Scribe: Anton Riabov 10/08/2001 1 Overview We continue studying the maximum eigenvalue SDP, and generalize

More information