Noisy Streaming PCA. Noting g t = x t x t, rearranging and dividing both sides by 2η we get
|
|
- Meghan Dennis
- 5 years ago
- Views:
Transcription
1 Supplementary Material A. Auxillary Lemmas Lemma A. Lemma. Shalev-Shwartz & Ben-David,. Any update of the form P t+ = Π C P t ηg t, 3 for an arbitrary sequence of matrices g, g,..., g, projection Π C onto an arbitrary convex set C, and initialization P = satisfies P t P, g t for any P C. P η + η g t, Lemma A. Lemma Warmuth & Kuzmin,. or any PSD A and any symmetric B, C for which B C, it holds that r AB r AC. B. Proofs of Section 3 Proof of heorem.. Using Lemma A., noting P k for all P C, and g t = x t x t so that g t = x t, x t Px t x t P t x t k η + η. Choosing η = k completes the proof. Proof of heorem 3.. Let ĝ t = g t + E t be the noisy gradient. he analysis begins with bounding the distance between the t-th iterate and any candidate P C, D t = P t P. Dt+ = P t+ P = Π C P t ηĝ t P P t ηĝ t P = P t P + η ĝ t η P t P, g t + E t D t + η G η P t P, g t η P t P, E t Dt + η G η P t P, g t + η P t P E t D t + η G η P t P, g t + kη E t Where the first inequality is due to projection onto a convex set being non-expanding and the third one is by Holder s inequality. he last inequality follows from triangle inequality and constraints: P t P P t + P k. Noting g t = x t x t, rearranging and dividing both sides by η we get P P t, x t x t D t Dt+ + η η G + k E t 5 We then sum over iterates and use the telescopic property of the first term of the right hand side of 5, so that D t Dt+ = D D + D. he initial distance D is bounded as follows: D = P P = P + P P, P k + k + P P k Plugging back the above bound in 5 and noting E t E, we get x t Px t x P t x k η + ηg By choosing optimal learning rate η = k G desired results. + ke we get the Proof of heorem 3.. Let ĝ k = x k + y k x k + y k denote the corrupted gradient and let g k = x k x k denote the unbiased estimate of C based on the k-th sample. Let v denote the top eigenvector of C := E D [xx with associated eigenvalue λ and let u N, I be the initialization for Oja s algorithm. We are going to assume that y k + y k. We note that our assumption implies that ĝ k g k O and ĝ k n O n/. Let Φ M k and Ψ k be defined as in the proof of heorem 5.3. We are first going to upper bound r : r + ηr r r Φ uu = r ĝ g Φ uu + ηr x x Φ uu ĝ Φuu + η r + ηw g w + η ĝ g + η ĝ exp ηw g w + η ĝ g + η ĝ u exp ηw k g k w k + η ĝ k g k + η ĝ k, where we have used lemma in Warmuth & Kuzmin,.
2 Next we are going to lower bound E[r Ψ : E[r Ψ E[r vv I + ηĝ Φ I = E[r vv I + ηc Φ I + E[r vv I + ηy y Φ I + ηλe[r vv Φ I exp ηλ η λ E[r vv Φ I exp ηλ η λ, 7 where we used the fact that E[x k = and that + x exp x x for x [,. inally using Lieb- hirring s inequality we have the following upper bound on E[r Ψ : E[r Ψ E[r I + ηĝ Ψ = E[ I + ηx x + ηy y + η ĝ + η3 ĝ 3 +η ĝ r Ψ exp ηλ + α η + α η + α 3 η 3 3/ +α η, 8 where for the last inequality we have pushed the expectation inside the trace and used the fact that E[ ĝ n is bounded above by some linear combinations of y m, where m [n. Next we have used the fact that + x exp x and induction together with y k O and y k n O n/. Let α = α + α η + α 3 η + α η 3 3/ + ηλ. Using Chebyshev s inequality with equations 7 and 8 we have: P[r Ψ exp ηλ η λ δ exp / ηα δ. probability δ. hus with probability δ it holds that Od + log/δexp ηw k g k w k + η ĝ k g k +η ĝ k 3δ exp ηλ η λ = g k, P P k log Od + log/δ η + O + ηo log Substituting η = C. Proofs of Section η 3δ β finishes the proof. Lemma C.. he naive estimator x x is not unbiased, i.e. E R [ x x x = q xx + q q diag xx, where the expectation is taken with respect to Bernoulli random model. Proof of Lemma C.. or off-diagonal elements i j, E R [ x x x i,j = x i x j Pi, j {i,.., i r } = x i x j PZ i = PZ j = = q x i x j. Now consider the diagonal elements. E R [ x x x i,i = x i PZ i = = qx i. Putting together, E R [ x x x = q xx + q q diag xx. Proof of Lemma.. irst note that E S [zz x = r rq q E S [ x i s e is e i s x = q q diag x x. he proof simply follows from the following equalities.. β Let η = for some constant β independent of. We claim that for small enough β it holds that with probability at least δ, r Ψ 3 exp ηλ η λ. or the choice of η, this is equivalent to exp βα + δ 9, where α now becomes α = α + α β + λ β + α 3 β + α β 3. aking β we can see that exp αβ < + δ/9 and since exp αβ is a continuous function of β the desired value of β > exists. rom the derivation in Allen- Zhu & Li, 7 Appendix I, it follows that u Od + log/δ and r Ωδ r Ψ with Φ uu E S,R [ x x zz x = E R [E S [ q x x x x E R [E S [zz x x = q E R[ x x x q q E R [diag x x x = xx + q q q = xx. diag xx q q diag xx 9
3 Proof of heorem.. We first bound ĝ t as follows: ĝ t xt x = t z t z t = x t x t + z t z t x t x t + zt z t x t + z t q x + r t q x q q + r t q q x t x t, z t z t where the first inequality follows from the fact that inner product of PSD matrices is non-negative, the third inequality is by definition of x and z, and the last inequality holds since x x. aking the expectation of both sides, noting the for Binomial distribution E R [rt = var r t + E R [r t = dq q + d q, we get E R [ ĝ q + dq q 3 +d q q q. Now, using Lemma A. with {ĝ t }, noting P k for all P C E[ P P t, ĝ t Setting η = k η + η q + dq q 3 + d q q k η + η q + dq q 3 + d q q q. q q q +dq q 3 +d q q k and taking expectation with respect to the internal randomization S and the distribution of the missing data R finishes the proof. Proof of heorem Oja with missing entries.3. We begin by bounding E S,R [ ĝ k, E S,R [ ĝ k 3 and E S,R [ ĝ k. Let a := x k x k and b := z k z k. Using Jensen s inequality we have ĝ k a + b, ĝ k 3 a 3 + b 3, ĝ k 8a + 8b. Since x it holds that a n q n and b n rn k qn q. Let n µ = dq q + dq µ 3 = dq 3q + 3dq + q 3dq + d q µ = dq 7q + 7dq + q 8dq + d q q 3 + dq 3 d q 3 + d 3 q 3. hese are the second, third and fourth moment of the binomial random variable r k, respectively. Also let α k = q + r k q q.hus we have E S,R [ ĝ k n α n = n q n + n µ n q n q n. Define Φ M k, Ψ k, w k, λ,v and u as in the proof of theorem 5.3. We have: r = r + ηr ĝ Φ uu + η r ĝ Φuu r + ηr ĝ Φ uu + η α r r exp η α + ηw ĝ w u exp η α t + η w t ĝt w t. Next we need a lower bound on E[Ψ. his is done in the same way as in heorem 5.3: E[r Ψ exp ηλ η λ, inally, we need to bound E[Ψ. It holds that: E[r Ψ = E[r I + ηĝ Ψ I + ηĝ Ψ E[r I + ηĝ Ψ = E[r I + η 3 ĝ 3 + η ĝ + ηĝ + η ĝ Ψ r I + η α + η 3 α 3 + η α + ηλψ r I + η α + ηλe[ψ exp αη + ηλ E[r Ψ, where α = α + α 3 + α. Using Chebyshev s inequality it holds that P[r Ψ exp ηλ η λ δ / exp αη + η λ δ. As long as η log+δ/9 α+λ, this implies that with probability δ, r Ψ 3 exp ηλ η λ. rom the derivation in Allen-Zhu & Li, 7 Appendix I, it follows that u Od + log/δ and r Φ uu Ωδ r Ψ with probability δ. hus with probability δ it holds that Od + log/δexp η α t + η w t ĝt w t 3δ exp ηλ η λ.
4 aking logarithms on both sides implies that ηλ η r E S,R [w t w t E S,R [ĝ t logod + log/δ log 3δ +η E S,R [ α, t where we have used the independence between ĝ t and w t. his implies g t, P E S,R [P t logod + log/δ log 3δ η Setting η = log+δ/9 α+λ finishes the proof. D. Proofs of Section 5 Lemma D.. he naive estimator x i x i E R [ x x x = + ηα is not unbiased, i.e. rr rd r dd xx + dd diag xx, where expectation is taken with respect to the uniform sampling model. Proof of Lemma D.. or off-diagonal elements i j, E R [ x x x i,j = x i x j Pi, j {i,.., i r } so we need to compute the probability of the event that two fixed element i, j [ d are both in the subset {i,.., i r } sampled uniformly at random from all subsets of size r. he number of sets of r elements which contain i and j is exactly d r and the total number of r sets is d r r d = rr dd thus Pi, j {i,.., i r } = and so d r E R [ x x rr x i,j = x i x j dd. Now consider the diagonal elements. E R [ x x x i,i = x i Pi {i,.., i r } so we need to compute the probability of the event i {i,.., i r }. he number of sets with size r containing i is d r and thus E R [ x x x i,i = x i r d. Putting together, E R [ x x x = = rr dd xx + r d rr diag xx dd rr rd r dd xx + dd diag xx. Proof of Lemma 5.. irst note that E S [zz x = dr r r E S[ x i s e is e i s x = d r r diag x x. he proof simply follows from the following equalities. E S,R [ x x zz x = E R [E S [ x x x x E R [E S [zz x x = dd rr E R[ x x x d r r E R[diag x x = xx + d r r diag xx d r diag xx r = xx. 3 where diag A, A R d d is the d d matrix consisting of the diagonal of A. Proof of heorem 5.. We first bound ĝ t as follows: ĝ t xt x = t z t z t = x t x t + zt z t x t x t + z t z t x t + z t x t x t, z t z t d d r r + r d r r x d d + r d r r r. where the first inequality follows from x t x t, z t z t, since this is an inner product of PSD matrices, the third inequality follows the definition of x and z, and the forth inequality holds since x x.now, using Lemma A. with {ĝ t }, noting P k for all P C, we have that P P t, ĝ t k η + η d d + r d r r r = k η + η d d + r d r r r. Since Eĝt [=g t = x t x t, setting η = rr k and taking expectation with d d +r d r respect to the internal randomization S and the distribution of the missing data R finishes the proof. Proof of heorem 5.3. irst we bound ĝ k. Since ĝ k = x k x k z k z k is the difference of two PSD matrices it holds that ĝ k = max xk x k, z k z k α.
5 Let v be the top eigenvector of C with associated eigenvalue λ. Also let g k = x k x k and w k be the k-th iterate of Oja s algorithm, such that P k = w k w k. Define Φ M k = I + ηĝ k I + ηĝ MI + ηĝ I + ηĝ k Ψ k = I + ηĝ k I + ηĝ vv I + ηĝ I + ηĝ k. ollowing the calculations in Allen-Zhu & Li, 7 Appendix I, we have: r = r I + ηĝ Φ uu I + ηĝ = r + ηr ĝ Φ uu + η r ĝ Φ uu ĝ = r + ηr ĝ Φ uu + η r ĝ Φuu r + ηr ĝ Φ uu + η α r = r + η α + ηw ĝ w r exp η α + ηw ĝ w u exp η α + η w k ĝk w k, 5 where for the first inequality we have used the fact that Φ uu is PSD and ĝ k α and for the second inequality we have used + x exp x. We have also used the fact that Φ uu = r Φ uu w w. Next we have E[Ψ = E[r vv I + ηĝ Φ I I + ηĝ = E[r vv I + ηĝ Φ I + η v ĝ Φ I ĝ v E[r vv I + ηc Φ I = + ηλe[v Φ I v exp ηλ η λ E[v Φ I v exp ηλ η λ, where the expectation is both with respect to R and D and we have used + x exp x x, for x. inally we have: E[r Ψ = E[r I + ηĝ Ψ I + ηĝ Ψ E[r I + ηĝ Ψ = E[r I + η 3 ĝ 3 + η ĝ + ηĝ + η ĝ Ψ E[r 7 I + η α I + ηg Ψ exp η α + ηλ E[r Ψ exp η α + ηλ. Combining equations and 7 with Chebyshev s inequality we get: P[r Ψ exp ηλ α η exp ηλ α η exp η α +η λ δ δ. hus with probability δ it holds r Ψ exp ηλ α η δ / exp η α + η λ 3 exp ηλ α η log+δ/9 as long as η α +λ. ollowing the derivations Allen-Zhu & Li, 7, we have that with probability δ it holds that r Ωδ r Ψ and that Φ uu u Od + log/δ. Union bound, together with inequality 5 implies that with probability δ we have: Od + log/δ exp η α + η 3δ exp ηλ α η = logod + log/δ + η w k ĝk w k w k ĝk w k log/3δ + ηλ α η = λ w k ĝk w k logod + log/δ log/3δ / + α η = η g k, P E S,R [P k α +λ logod+log/δ log/3δ / log + δ/9 + log + δ/9, where in the last implication we have substituted η = log+δ/9 α +λ and used the fact that α < α + λ.
6 E. Proofs of Section Proof of heorem.. Denoting P t Ix t by y t, we can rewrite g t as g t = xty t +y t x t y t. hus we have g t = y t r x t y t + y t x t x t y t + y t x t x t y t y t = x t. By convexity of norms, x P t x x P x P t P, g t. Using Lemma A., noting P k for all P C, and g t, x t P t x t Choosing η = k x t P x t k η + η.. Proofs of Section 8 completes the proof. Proof of heorem 8.. Our proof follows Zinkevich, 3. We start the analysis by bounding the distance of our iterates at time t from any dynamic competitor P t at time t P t+ P t = Π C P t ηg t P t P t ηg t P t = P t P t + P t P t ηg t = P t P t + P t P t + P t P t, P t η g t, P t P t P t P t + P t + P t P t P t + η x t P t x t x t P t x t P t P t + k + k P t P t + η + η x t P t x t x t P t x t P t + η g t P t P t P t + η P t Rearranging and dividing both sides by η we get x t P t x t x t P t x t 3 k P t P t + η η P t P t P t+ P t + η Summing over t, observe the telescopic sum on the right hand side is bounded by P t P t P t+ P t = P P P + P k P Plugging the definition of the total shift S into the sum x t P t x t x t P t x t 3 ks η + η + k η = ks + k + η η Choosing η = ks+k completes the proof... Experimental Results We provide additional experiments in igures 3,, 5 and for PCA with partial observations and for PCA with missing entries. As stated in the main text, all our observations from igure and igure hold for the results presented here. urther, in igure 7, we present experimental results for the proposed Absolute Subspace Deviation Model algorithm of Section referred to as Robust GD in the plots. he algorithms against which we compare are Robust Online PCA proposed in Goes et al., referred to as Robust MEG in the plots which is an instance of online mirror descent with choice of potential function being entropy and simple batch PCA on the whole dataset. As ground truth we take the k-dimensional subspace returned by the proposed method in Lerman et al.,. In the following discussion the ground truth is represented by the projection matrix P R d d and the estimated projection matrix returned by an algorithm is denoted by P t. Since our experiments are ran on a synthetic dataset of fixed size, we denote this dataset by X R d n. he criteria against which we evaluate are average angle between subspaces given by r P t P /k, reconstruction error on the whole dataset given by U t X U X /n where a rank-k projection matrix P is decomposed as P = UU for orthogonal U R d k. We also evaluate on the total regret incurred. he data set is generated as in Goes et al.,. irst inliers are sampled from a k-dimensional Multivariate Normal distribution with mean and covariance matrix k I k. Next the inliers are embedded in a d- dimensional space via the linear transformation U R d k with entries U ii = and U i j =. inally outliers are
7 %88 missing %9 missing %9 missing - L Oja MGD MGD-po MGD-md Oja-po Oja-md GROUSE ime ime ime igure 3: Comparisons of Oja, MGD, MGD-PO, MGD-MD, Oja-PO, Oja-MD, GROUSE for PCA with missing data on XRMB dataset, in terms of the variance captured on a test set as a function of number of observed entries for k= top, number of iterations middle and runtime bottom. sampled from a d-dimensional Multivariate Normal distribution with diagonal covariance λ d meanσ Σ out out where Σ out ii = i, here λ is a user-specified parameter which governs the SNR. In our experiments λ = the number of outliers is %, %, 8% of the total number of points, k = and d =. In our comparison we also include a capped version of our proposed method where the rank of each intermediate iterate is hard-capped to 5 by keeping the 5 directions associated with the top 5 singular values of our current iterate. We now briefly discuss the more interesting points which should be observed from the provided igure 7. irst our algorithm clearly out-performs the Robust Online PCA and batch PCA in all the chosen criteria. Secondly, since Robust Online PCA is required to always keep full-rank iterates as it operates in the complementary space to the one we wish to recover it has to pay computational cost Od 3 per iteration while our method only pays Od k where k is the rank of the current iterate. As can be seen from the plots for the cases when number of outliers are % or % the intermediate rank of iterates does not grow too much so in practice our proposed method is efficient. inally we note that the capped version of the proposed method performs as well or even better in some cases.
8 %88 missing %9 missing %9 missing - L Oja MGD MGD-po MGD-md Oja-po Oja-md GROUSE ime ime ime igure : Comparisons of Oja, MGD, MGD-PO, MGD-MD, Oja-PO, Oja-MD, GROUSE for PCA with missing data on XRMB dataset, in terms of the variance captured on a test set as a function of number of observed entries for k= top, number of iterations middle and runtime bottom.
9 %88 missing %9 missing %9 missing - L Oja MGD MGD-po MGD-md Oja-po Oja-md GROUSE ime - ime - ime igure 5: Comparisons of Oja, MGD, MSG-PO, MSG-MD, Oja-PO, Oja-MD and GROUSE for PCA with missing data on MNIS dataset, in terms of the variance captured on a test set as a function of number of observed entries for k= top, number of iterations middle and runtime bottom.
10 %88 missing %9 missing %9 missing - L Oja MGD MGD-po MGD-md Oja-po Oja-md GROUSE ime ime ime igure : Comparisons of Oja, MGD, MSG-PO, MSG-MD, Oja-PO, Oja-MD and GROUSE for PCA with missing data on MNIS dataset, in terms of the variance captured on a test set as a function of number of observed entries for k= top, number of iterations middle and runtime bottom.
11 tracep * P t /k Regret P * X-P t X /n rankp t tracep * P t /k Regret P * X-P t X /n rankp t tracep * P t /k Regret P * X-P t X /n rankp t average angle between subspaces PCA optimization error reconstruction error rank of iterates Robust MEG Robust GD Batch PCA Capped Robust GD log log log log Robust MEG Robust GD Batch PCA Capped Robust GD log log log log Robust MEG Robust GD Batch PCA Capped Robust GD log log log log igure 7: k = Comparison of batch PCA, Robust Online PCA Robust MEG, and Absolute Subspace Deviation Model Robust GD with % outliers top row, % outliers middle row and 8% outliers bottom row on synthetic dataset. Experiments are in terms of the average subspace angle left most, variance captured on a test set left, reconstruction error right and the rank of iterates right most as a function of number of samples.
Streaming Principal Component Analysis in Noisy Settings
Streaming Principal Component Analysis in Noisy Settings Teodor V. Marinov * 1 Poorya Mianjy * 1 Raman Arora 1 Abstract We study streaming algorithms for principal component analysis (PCA) in noisy settings.
More informationOnline Kernel PCA with Entropic Matrix Updates
Online Kernel PCA with Entropic Matrix Updates Dima Kuzmin Manfred K. Warmuth University of California - Santa Cruz ICML 2007, Corvallis, Oregon April 23, 2008 D. Kuzmin, M. Warmuth (UCSC) Online Kernel
More informationOn-line Variance Minimization
On-line Variance Minimization Manfred Warmuth Dima Kuzmin University of California - Santa Cruz 19th Annual Conference on Learning Theory M. Warmuth, D. Kuzmin (UCSC) On-line Variance Minimization COLT06
More informationOnline Kernel PCA with Entropic Matrix Updates
Dima Kuzmin Manfred K. Warmuth Computer Science Department, University of California - Santa Cruz dima@cse.ucsc.edu manfred@cse.ucsc.edu Abstract A number of updates for density matrices have been developed
More informationEE 381V: Large Scale Learning Spring Lecture 16 March 7
EE 381V: Large Scale Learning Spring 2013 Lecture 16 March 7 Lecturer: Caramanis & Sanghavi Scribe: Tianyang Bai 16.1 Topics Covered In this lecture, we introduced one method of matrix completion via SVD-based
More informationLecture Notes 1: Vector spaces
Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector
More informationGI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil
GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis Massimiliano Pontil 1 Today s plan SVD and principal component analysis (PCA) Connection
More information15 Singular Value Decomposition
15 Singular Value Decomposition For any high-dimensional data analysis, one s first thought should often be: can I use an SVD? The singular value decomposition is an invaluable analysis tool for dealing
More informationCS281 Section 4: Factor Analysis and PCA
CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we
More informationDS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.
DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1
More informationA Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir
A Stochastic PCA Algorithm with an Exponential Convergence Rate Ohad Shamir Weizmann Institute of Science NIPS Optimization Workshop December 2014 Ohad Shamir Stochastic PCA with Exponential Convergence
More informationPCA, Kernel PCA, ICA
PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per
More informationMethods for sparse analysis of high-dimensional data, II
Methods for sparse analysis of high-dimensional data, II Rachel Ward May 26, 2011 High dimensional data with low-dimensional structure 300 by 300 pixel images = 90, 000 dimensions 2 / 55 High dimensional
More informationTutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning
Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret
More informationOn the Generalization Ability of Online Strongly Convex Programming Algorithms
On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract
More informationPrincipal Component Analysis
Principal Component Analysis Laurenz Wiskott Institute for Theoretical Biology Humboldt-University Berlin Invalidenstraße 43 D-10115 Berlin, Germany 11 March 2004 1 Intuition Problem Statement Experimental
More informationPCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani
PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given
More informationMethods for sparse analysis of high-dimensional data, II
Methods for sparse analysis of high-dimensional data, II Rachel Ward May 23, 2011 High dimensional data with low-dimensional structure 300 by 300 pixel images = 90, 000 dimensions 2 / 47 High dimensional
More information7 Principal Component Analysis
7 Principal Component Analysis This topic will build a series of techniques to deal with high-dimensional data. Unlike regression problems, our goal is not to predict a value (the y-coordinate), it is
More informationStatistical Machine Learning
Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x
More informationA Stochastic PCA and SVD Algorithm with an Exponential Convergence Rate
Ohad Shamir Weizmann Institute of Science, Rehovot, Israel OHAD.SHAMIR@WEIZMANN.AC.IL Abstract We describe and analyze a simple algorithm for principal component analysis and singular value decomposition,
More informationLecture 2: Linear Algebra Review
EE 227A: Convex Optimization and Applications January 19 Lecture 2: Linear Algebra Review Lecturer: Mert Pilanci Reading assignment: Appendix C of BV. Sections 2-6 of the web textbook 1 2.1 Vectors 2.1.1
More informationAdaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ
More informationNORMS ON SPACE OF MATRICES
NORMS ON SPACE OF MATRICES. Operator Norms on Space of linear maps Let A be an n n real matrix and x 0 be a vector in R n. We would like to use the Picard iteration method to solve for the following system
More informationComposite Objective Mirror Descent
Composite Objective Mirror Descent John C. Duchi 1,3 Shai Shalev-Shwartz 2 Yoram Singer 3 Ambuj Tewari 4 1 University of California, Berkeley 2 Hebrew University of Jerusalem, Israel 3 Google Research
More informationarxiv: v5 [math.na] 16 Nov 2017
RANDOM PERTURBATION OF LOW RANK MATRICES: IMPROVING CLASSICAL BOUNDS arxiv:3.657v5 [math.na] 6 Nov 07 SEAN O ROURKE, VAN VU, AND KE WANG Abstract. Matrix perturbation inequalities, such as Weyl s theorem
More informationCOMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017
COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University PRINCIPAL COMPONENT ANALYSIS DIMENSIONALITY
More informationSupplementary Materials for Riemannian Pursuit for Big Matrix Recovery
Supplementary Materials for Riemannian Pursuit for Big Matrix Recovery Mingkui Tan, School of omputer Science, The University of Adelaide, Australia Ivor W. Tsang, IS, University of Technology Sydney,
More informationLecture 3. Random Fourier measurements
Lecture 3. Random Fourier measurements 1 Sampling from Fourier matrices 2 Law of Large Numbers and its operator-valued versions 3 Frames. Rudelson s Selection Theorem Sampling from Fourier matrices Our
More informationLecture 19: Follow The Regulerized Leader
COS-511: Learning heory Spring 2017 Lecturer: Roi Livni Lecture 19: Follow he Regulerized Leader Disclaimer: hese notes have not been subjected to the usual scrutiny reserved for formal publications. hey
More information14 Singular Value Decomposition
14 Singular Value Decomposition For any high-dimensional data analysis, one s first thought should often be: can I use an SVD? The singular value decomposition is an invaluable analysis tool for dealing
More informationLatent Variable Models and EM algorithm
Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic
More informationBare minimum on matrix algebra. Psychology 588: Covariance structure and factor models
Bare minimum on matrix algebra Psychology 588: Covariance structure and factor models Matrix multiplication 2 Consider three notations for linear combinations y11 y1 m x11 x 1p b11 b 1m y y x x b b n1
More informationEECS 275 Matrix Computation
EECS 275 Matrix Computation Ming-Hsuan Yang Electrical Engineering and Computer Science University of California at Merced Merced, CA 95344 http://faculty.ucmerced.edu/mhyang Lecture 6 1 / 22 Overview
More informationSolving Corrupted Quadratic Equations, Provably
Solving Corrupted Quadratic Equations, Provably Yuejie Chi London Workshop on Sparse Signal Processing September 206 Acknowledgement Joint work with Yuanxin Li (OSU), Huishuai Zhuang (Syracuse) and Yingbin
More informationMultivariate Statistics Random Projections and Johnson-Lindenstrauss Lemma
Multivariate Statistics Random Projections and Johnson-Lindenstrauss Lemma Suppose again we have n sample points x,..., x n R p. The data-point x i R p can be thought of as the i-th row X i of an n p-dimensional
More informationC&O367: Nonlinear Optimization (Winter 2013) Assignment 4 H. Wolkowicz
C&O367: Nonlinear Optimization (Winter 013) Assignment 4 H. Wolkowicz Posted Mon, Feb. 8 Due: Thursday, Feb. 8 10:00AM (before class), 1 Matrices 1.1 Positive Definite Matrices 1. Let A S n, i.e., let
More informationFourier PCA. Navin Goyal (MSR India), Santosh Vempala (Georgia Tech) and Ying Xiao (Georgia Tech)
Fourier PCA Navin Goyal (MSR India), Santosh Vempala (Georgia Tech) and Ying Xiao (Georgia Tech) Introduction 1. Describe a learning problem. 2. Develop an efficient tensor decomposition. Independent component
More informationOnline (and Distributed) Learning with Information Constraints. Ohad Shamir
Online (and Distributed) Learning with Information Constraints Ohad Shamir Weizmann Institute of Science Online Algorithms and Learning Workshop Leiden, November 2014 Ohad Shamir Learning with Information
More information1 Principal Components Analysis
Lecture 3 and 4 Sept. 18 and Sept.20-2006 Data Visualization STAT 442 / 890, CM 462 Lecture: Ali Ghodsi 1 Principal Components Analysis Principal components analysis (PCA) is a very popular technique for
More informationUnsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent
Unsupervised Machine Learning and Data Mining DS 5230 / DS 4420 - Fall 2018 Lecture 7 Jan-Willem van de Meent DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford) Dimensionality Reduction Goal:
More informationBeyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization
JMLR: Workshop and Conference Proceedings vol (2010) 1 16 24th Annual Conference on Learning heory Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization
More informationLinear Algebra Practice Problems
Linear Algebra Practice Problems Page of 7 Linear Algebra Practice Problems These problems cover Chapters 4, 5, 6, and 7 of Elementary Linear Algebra, 6th ed, by Ron Larson and David Falvo (ISBN-3 = 978--68-78376-2,
More informationData Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis
Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inrialpes.fr http://perception.inrialpes.fr/ Outline of Lecture
More informationLecture 5 : Projections
Lecture 5 : Projections EE227C. Lecturer: Professor Martin Wainwright. Scribe: Alvin Wan Up until now, we have seen convergence rates of unconstrained gradient descent. Now, we consider a constrained minimization
More informationKnowledge Discovery and Data Mining 1 (VO) ( )
Knowledge Discovery and Data Mining 1 (VO) (707.003) Review of Linear Algebra Denis Helic KTI, TU Graz Oct 9, 2014 Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 1 / 74 Big picture: KDDM Probability Theory
More informationLecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016
Lecture 8 Principal Component Analysis Luigi Freda ALCOR Lab DIAG University of Rome La Sapienza December 13, 2016 Luigi Freda ( La Sapienza University) Lecture 8 December 13, 2016 1 / 31 Outline 1 Eigen
More informationData Preprocessing Tasks
Data Tasks 1 2 3 Data Reduction 4 We re here. 1 Dimensionality Reduction Dimensionality reduction is a commonly used approach for generating fewer features. Typically used because too many features can
More informationAdaptive Online Gradient Descent
University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow
More informationLarge Scale Matrix Analysis and Inference
Large Scale Matrix Analysis and Inference Wouter M. Koolen - Manfred Warmuth Reza Bosagh Zadeh - Gunnar Carlsson - Michael Mahoney Dec 9, NIPS 2013 1 / 32 Introductory musing What is a matrix? a i,j 1
More informationMatrix concentration inequalities
ELE 538B: Mathematics of High-Dimensional Data Matrix concentration inequalities Yuxin Chen Princeton University, Fall 2018 Recap: matrix Bernstein inequality Consider a sequence of independent random
More informationConjugate Gradient (CG) Method
Conjugate Gradient (CG) Method by K. Ozawa 1 Introduction In the series of this lecture, I will introduce the conjugate gradient method, which solves efficiently large scale sparse linear simultaneous
More informationA Cross-Associative Neural Network for SVD of Nonsquared Data Matrix in Signal Processing
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 12, NO. 5, SEPTEMBER 2001 1215 A Cross-Associative Neural Network for SVD of Nonsquared Data Matrix in Signal Processing Da-Zheng Feng, Zheng Bao, Xian-Da Zhang
More informationWhen Fisher meets Fukunaga-Koontz: A New Look at Linear Discriminants
When Fisher meets Fukunaga-Koontz: A New Look at Linear Discriminants Sheng Zhang erence Sim School of Computing, National University of Singapore 3 Science Drive 2, Singapore 7543 {zhangshe, tsim}@comp.nus.edu.sg
More informationCourse Notes for EE227C (Spring 2018): Convex Optimization and Approximation
Course Notes for EE7C (Spring 08): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee7c@berkeley.edu October
More informationLecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan
Lecture 3: Latent Variables Models and Learning with the EM Algorithm Sam Roweis Tuesday July25, 2006 Machine Learning Summer School, Taiwan Latent Variable Models What to do when a variable z is always
More informationLecture 3: Review of Linear Algebra
ECE 83 Fall 2 Statistical Signal Processing instructor: R Nowak, scribe: R Nowak Lecture 3: Review of Linear Algebra Very often in this course we will represent signals as vectors and operators (eg, filters,
More informationLecture 9: SVD, Low Rank Approximation
CSE 521: Design and Analysis of Algorithms I Spring 2016 Lecture 9: SVD, Low Rank Approimation Lecturer: Shayan Oveis Gharan April 25th Scribe: Koosha Khalvati Disclaimer: hese notes have not been subjected
More informationHST.582J/6.555J/16.456J
Blind Source Separation: PCA & ICA HST.582J/6.555J/16.456J Gari D. Clifford gari [at] mit. edu http://www.mit.edu/~gari G. D. Clifford 2005-2009 What is BSS? Assume an observation (signal) is a linear
More informationMACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA
1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR
More informationPreliminary Examination, Numerical Analysis, August 2016
Preliminary Examination, Numerical Analysis, August 2016 Instructions: This exam is closed books and notes. The time allowed is three hours and you need to work on any three out of questions 1-4 and any
More informationPCA and admixture models
PCA and admixture models CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price PCA and admixture models 1 / 57 Announcements HW1
More informationLecture 4: Purifications and fidelity
CS 766/QIC 820 Theory of Quantum Information (Fall 2011) Lecture 4: Purifications and fidelity Throughout this lecture we will be discussing pairs of registers of the form (X, Y), and the relationships
More informationThe Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017
The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationCS181 Midterm 2 Practice Solutions
CS181 Midterm 2 Practice Solutions 1. Convergence of -Means Consider Lloyd s algorithm for finding a -Means clustering of N data, i.e., minimizing the distortion measure objective function J({r n } N n=1,
More informationCourse Notes for EE227C (Spring 2018): Convex Optimization and Approximation
Course Notes for EE7C (Spring 018): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee7c@berkeley.edu October
More informationPrincipal Component Analysis (PCA) for Sparse High-Dimensional Data
AB Principal Component Analysis (PCA) for Sparse High-Dimensional Data Tapani Raiko, Alexander Ilin, and Juha Karhunen Helsinki University of Technology, Finland Adaptive Informatics Research Center Principal
More informationAPPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.
APPENDIX A Background Mathematics A. Linear Algebra A.. Vector algebra Let x denote the n-dimensional column vector with components 0 x x 2 B C @. A x n Definition 6 (scalar product). The scalar product
More informationCommon-Knowledge / Cheat Sheet
CSE 521: Design and Analysis of Algorithms I Fall 2018 Common-Knowledge / Cheat Sheet 1 Randomized Algorithm Expectation: For a random variable X with domain, the discrete set S, E [X] = s S P [X = s]
More information1. Background: The SVD and the best basis (questions selected from Ch. 6- Can you fill in the exercises?)
Math 35 Exam Review SOLUTIONS Overview In this third of the course we focused on linear learning algorithms to model data. summarize: To. Background: The SVD and the best basis (questions selected from
More informationLecture 5 Singular value decomposition
Lecture 5 Singular value decomposition Weinan E 1,2 and Tiejun Li 2 1 Department of Mathematics, Princeton University, weinan@princeton.edu 2 School of Mathematical Sciences, Peking University, tieli@pku.edu.cn
More informationSecond Order Optimization Algorithms I
Second Order Optimization Algorithms I Yinyu Ye Department of Management Science and Engineering Stanford University Stanford, CA 94305, U.S.A. http://www.stanford.edu/ yyye Chapters 7, 8, 9 and 10 1 The
More informationLarge Sample Properties of Estimators in the Classical Linear Regression Model
Large Sample Properties of Estimators in the Classical Linear Regression Model 7 October 004 A. Statement of the classical linear regression model The classical linear regression model can be written in
More informationMachine Learning (CS 567) Lecture 5
Machine Learning (CS 567) Lecture 5 Time: T-Th 5:00pm - 6:20pm Location: GFS 118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol
More informationVector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.
Vector spaces DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Vector space Consists of: A set V A scalar
More informationManifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA
Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inria.fr http://perception.inrialpes.fr/
More informationECONOMETRIC METHODS II: TIME SERIES LECTURE NOTES ON THE KALMAN FILTER. The Kalman Filter. We will be concerned with state space systems of the form
ECONOMETRIC METHODS II: TIME SERIES LECTURE NOTES ON THE KALMAN FILTER KRISTOFFER P. NIMARK The Kalman Filter We will be concerned with state space systems of the form X t = A t X t 1 + C t u t 0.1 Z t
More information1 Singular Value Decomposition and Principal Component
Singular Value Decomposition and Principal Component Analysis In these lectures we discuss the SVD and the PCA, two of the most widely used tools in machine learning. Principal Component Analysis (PCA)
More informationCS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works
CS68: The Modern Algorithmic Toolbox Lecture #8: How PCA Works Tim Roughgarden & Gregory Valiant April 20, 206 Introduction Last lecture introduced the idea of principal components analysis (PCA). The
More informationMath 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces.
Math 350 Fall 2011 Notes about inner product spaces In this notes we state and prove some important properties of inner product spaces. First, recall the dot product on R n : if x, y R n, say x = (x 1,...,
More informationL26: Advanced dimensionality reduction
L26: Advanced dimensionality reduction The snapshot CA approach Oriented rincipal Components Analysis Non-linear dimensionality reduction (manifold learning) ISOMA Locally Linear Embedding CSCE 666 attern
More informationData Mining Techniques
Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 12 Jan-Willem van de Meent (credit: Yijun Zhao, Percy Liang) DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford) Linear Dimensionality
More informationGaussian Models (9/9/13)
STA561: Probabilistic machine learning Gaussian Models (9/9/13) Lecturer: Barbara Engelhardt Scribes: Xi He, Jiangwei Pan, Ali Razeen, Animesh Srivastava 1 Multivariate Normal Distribution The multivariate
More informationSum-of-Squares and Spectral Algorithms
Sum-of-Squares and Spectral Algorithms Tselil Schramm June 23, 2017 Workshop on SoS @ STOC 2017 Spectral algorithms as a tool for analyzing SoS. SoS Semidefinite Programs Spectral Algorithms SoS suggests
More informationRandomized Online PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension
Journal of Machine Learning Research 9 (2008) 2287-2320 Submitted 9/07; Published 10/08 Randomized Online PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension Manfred K. Warmuth Dima
More informationConcentration Ellipsoids
Concentration Ellipsoids ECE275A Lecture Supplement Fall 2008 Kenneth Kreutz Delgado Electrical and Computer Engineering Jacobs School of Engineering University of California, San Diego VERSION LSECE275CE
More informationMachine Learning - MT & 14. PCA and MDS
Machine Learning - MT 2016 13 & 14. PCA and MDS Varun Kanade University of Oxford November 21 & 23, 2016 Announcements Sheet 4 due this Friday by noon Practical 3 this week (continue next week if necessary)
More informationIV. Matrix Approximation using Least-Squares
IV. Matrix Approximation using Least-Squares The SVD and Matrix Approximation We begin with the following fundamental question. Let A be an M N matrix with rank R. What is the closest matrix to A that
More informationMathematical foundations - linear algebra
Mathematical foundations - linear algebra Andrea Passerini passerini@disi.unitn.it Machine Learning Vector space Definition (over reals) A set X is called a vector space over IR if addition and scalar
More informationNotes on AdaGrad. Joseph Perla 2014
Notes on AdaGrad Joseph Perla 2014 1 Introduction Stochastic Gradient Descent (SGD) is a common online learning algorithm for optimizing convex (and often non-convex) functions in machine learning today.
More informationx. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).
.8.6 µ =, σ = 1 µ = 1, σ = 1 / µ =, σ =.. 3 1 1 3 x Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ ). The Gaussian distribution Probably the most-important distribution in all of statistics
More information1 Kernel methods & optimization
Machine Learning Class Notes 9-26-13 Prof. David Sontag 1 Kernel methods & optimization One eample of a kernel that is frequently used in practice and which allows for highly non-linear discriminant functions
More informationLecture 8 : Eigenvalues and Eigenvectors
CPS290: Algorithmic Foundations of Data Science February 24, 2017 Lecture 8 : Eigenvalues and Eigenvectors Lecturer: Kamesh Munagala Scribe: Kamesh Munagala Hermitian Matrices It is simpler to begin with
More informationOLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research
OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and
More informationInner products and Norms. Inner product of 2 vectors. Inner product of 2 vectors x and y in R n : x 1 y 1 + x 2 y x n y n in R n
Inner products and Norms Inner product of 2 vectors Inner product of 2 vectors x and y in R n : x 1 y 1 + x 2 y 2 + + x n y n in R n Notation: (x, y) or y T x For complex vectors (x, y) = x 1 ȳ 1 + x 2
More informationFactor Analysis (10/2/13)
STA561: Probabilistic machine learning Factor Analysis (10/2/13) Lecturer: Barbara Engelhardt Scribes: Li Zhu, Fan Li, Ni Guan Factor Analysis Factor analysis is related to the mixture models we have studied.
More informationLinear Algebra Massoud Malek
CSUEB Linear Algebra Massoud Malek Inner Product and Normed Space In all that follows, the n n identity matrix is denoted by I n, the n n zero matrix by Z n, and the zero vector by θ n An inner product
More informationE2 212: Matrix Theory (Fall 2010) Solutions to Test - 1
E2 212: Matrix Theory (Fall 2010) s to Test - 1 1. Let X = [x 1, x 2,..., x n ] R m n be a tall matrix. Let S R(X), and let P be an orthogonal projector onto S. (a) If X is full rank, show that P can be
More informationOn some special cases of the Entropy Photon-Number Inequality
On some special cases of the Entropy Photon-Number Inequality Smarajit Das, Naresh Sharma and Siddharth Muthukrishnan Tata Institute of Fundamental Research December 5, 2011 One of the aims of information
More information