Noisy Streaming PCA. Noting g t = x t x t, rearranging and dividing both sides by 2η we get

Size: px
Start display at page:

Download "Noisy Streaming PCA. Noting g t = x t x t, rearranging and dividing both sides by 2η we get"

Transcription

1 Supplementary Material A. Auxillary Lemmas Lemma A. Lemma. Shalev-Shwartz & Ben-David,. Any update of the form P t+ = Π C P t ηg t, 3 for an arbitrary sequence of matrices g, g,..., g, projection Π C onto an arbitrary convex set C, and initialization P = satisfies P t P, g t for any P C. P η + η g t, Lemma A. Lemma Warmuth & Kuzmin,. or any PSD A and any symmetric B, C for which B C, it holds that r AB r AC. B. Proofs of Section 3 Proof of heorem.. Using Lemma A., noting P k for all P C, and g t = x t x t so that g t = x t, x t Px t x t P t x t k η + η. Choosing η = k completes the proof. Proof of heorem 3.. Let ĝ t = g t + E t be the noisy gradient. he analysis begins with bounding the distance between the t-th iterate and any candidate P C, D t = P t P. Dt+ = P t+ P = Π C P t ηĝ t P P t ηĝ t P = P t P + η ĝ t η P t P, g t + E t D t + η G η P t P, g t η P t P, E t Dt + η G η P t P, g t + η P t P E t D t + η G η P t P, g t + kη E t Where the first inequality is due to projection onto a convex set being non-expanding and the third one is by Holder s inequality. he last inequality follows from triangle inequality and constraints: P t P P t + P k. Noting g t = x t x t, rearranging and dividing both sides by η we get P P t, x t x t D t Dt+ + η η G + k E t 5 We then sum over iterates and use the telescopic property of the first term of the right hand side of 5, so that D t Dt+ = D D + D. he initial distance D is bounded as follows: D = P P = P + P P, P k + k + P P k Plugging back the above bound in 5 and noting E t E, we get x t Px t x P t x k η + ηg By choosing optimal learning rate η = k G desired results. + ke we get the Proof of heorem 3.. Let ĝ k = x k + y k x k + y k denote the corrupted gradient and let g k = x k x k denote the unbiased estimate of C based on the k-th sample. Let v denote the top eigenvector of C := E D [xx with associated eigenvalue λ and let u N, I be the initialization for Oja s algorithm. We are going to assume that y k + y k. We note that our assumption implies that ĝ k g k O and ĝ k n O n/. Let Φ M k and Ψ k be defined as in the proof of heorem 5.3. We are first going to upper bound r : r + ηr r r Φ uu = r ĝ g Φ uu + ηr x x Φ uu ĝ Φuu + η r + ηw g w + η ĝ g + η ĝ exp ηw g w + η ĝ g + η ĝ u exp ηw k g k w k + η ĝ k g k + η ĝ k, where we have used lemma in Warmuth & Kuzmin,.

2 Next we are going to lower bound E[r Ψ : E[r Ψ E[r vv I + ηĝ Φ I = E[r vv I + ηc Φ I + E[r vv I + ηy y Φ I + ηλe[r vv Φ I exp ηλ η λ E[r vv Φ I exp ηλ η λ, 7 where we used the fact that E[x k = and that + x exp x x for x [,. inally using Lieb- hirring s inequality we have the following upper bound on E[r Ψ : E[r Ψ E[r I + ηĝ Ψ = E[ I + ηx x + ηy y + η ĝ + η3 ĝ 3 +η ĝ r Ψ exp ηλ + α η + α η + α 3 η 3 3/ +α η, 8 where for the last inequality we have pushed the expectation inside the trace and used the fact that E[ ĝ n is bounded above by some linear combinations of y m, where m [n. Next we have used the fact that + x exp x and induction together with y k O and y k n O n/. Let α = α + α η + α 3 η + α η 3 3/ + ηλ. Using Chebyshev s inequality with equations 7 and 8 we have: P[r Ψ exp ηλ η λ δ exp / ηα δ. probability δ. hus with probability δ it holds that Od + log/δexp ηw k g k w k + η ĝ k g k +η ĝ k 3δ exp ηλ η λ = g k, P P k log Od + log/δ η + O + ηo log Substituting η = C. Proofs of Section η 3δ β finishes the proof. Lemma C.. he naive estimator x x is not unbiased, i.e. E R [ x x x = q xx + q q diag xx, where the expectation is taken with respect to Bernoulli random model. Proof of Lemma C.. or off-diagonal elements i j, E R [ x x x i,j = x i x j Pi, j {i,.., i r } = x i x j PZ i = PZ j = = q x i x j. Now consider the diagonal elements. E R [ x x x i,i = x i PZ i = = qx i. Putting together, E R [ x x x = q xx + q q diag xx. Proof of Lemma.. irst note that E S [zz x = r rq q E S [ x i s e is e i s x = q q diag x x. he proof simply follows from the following equalities.. β Let η = for some constant β independent of. We claim that for small enough β it holds that with probability at least δ, r Ψ 3 exp ηλ η λ. or the choice of η, this is equivalent to exp βα + δ 9, where α now becomes α = α + α β + λ β + α 3 β + α β 3. aking β we can see that exp αβ < + δ/9 and since exp αβ is a continuous function of β the desired value of β > exists. rom the derivation in Allen- Zhu & Li, 7 Appendix I, it follows that u Od + log/δ and r Ωδ r Ψ with Φ uu E S,R [ x x zz x = E R [E S [ q x x x x E R [E S [zz x x = q E R[ x x x q q E R [diag x x x = xx + q q q = xx. diag xx q q diag xx 9

3 Proof of heorem.. We first bound ĝ t as follows: ĝ t xt x = t z t z t = x t x t + z t z t x t x t + zt z t x t + z t q x + r t q x q q + r t q q x t x t, z t z t where the first inequality follows from the fact that inner product of PSD matrices is non-negative, the third inequality is by definition of x and z, and the last inequality holds since x x. aking the expectation of both sides, noting the for Binomial distribution E R [rt = var r t + E R [r t = dq q + d q, we get E R [ ĝ q + dq q 3 +d q q q. Now, using Lemma A. with {ĝ t }, noting P k for all P C E[ P P t, ĝ t Setting η = k η + η q + dq q 3 + d q q k η + η q + dq q 3 + d q q q. q q q +dq q 3 +d q q k and taking expectation with respect to the internal randomization S and the distribution of the missing data R finishes the proof. Proof of heorem Oja with missing entries.3. We begin by bounding E S,R [ ĝ k, E S,R [ ĝ k 3 and E S,R [ ĝ k. Let a := x k x k and b := z k z k. Using Jensen s inequality we have ĝ k a + b, ĝ k 3 a 3 + b 3, ĝ k 8a + 8b. Since x it holds that a n q n and b n rn k qn q. Let n µ = dq q + dq µ 3 = dq 3q + 3dq + q 3dq + d q µ = dq 7q + 7dq + q 8dq + d q q 3 + dq 3 d q 3 + d 3 q 3. hese are the second, third and fourth moment of the binomial random variable r k, respectively. Also let α k = q + r k q q.hus we have E S,R [ ĝ k n α n = n q n + n µ n q n q n. Define Φ M k, Ψ k, w k, λ,v and u as in the proof of theorem 5.3. We have: r = r + ηr ĝ Φ uu + η r ĝ Φuu r + ηr ĝ Φ uu + η α r r exp η α + ηw ĝ w u exp η α t + η w t ĝt w t. Next we need a lower bound on E[Ψ. his is done in the same way as in heorem 5.3: E[r Ψ exp ηλ η λ, inally, we need to bound E[Ψ. It holds that: E[r Ψ = E[r I + ηĝ Ψ I + ηĝ Ψ E[r I + ηĝ Ψ = E[r I + η 3 ĝ 3 + η ĝ + ηĝ + η ĝ Ψ r I + η α + η 3 α 3 + η α + ηλψ r I + η α + ηλe[ψ exp αη + ηλ E[r Ψ, where α = α + α 3 + α. Using Chebyshev s inequality it holds that P[r Ψ exp ηλ η λ δ / exp αη + η λ δ. As long as η log+δ/9 α+λ, this implies that with probability δ, r Ψ 3 exp ηλ η λ. rom the derivation in Allen-Zhu & Li, 7 Appendix I, it follows that u Od + log/δ and r Φ uu Ωδ r Ψ with probability δ. hus with probability δ it holds that Od + log/δexp η α t + η w t ĝt w t 3δ exp ηλ η λ.

4 aking logarithms on both sides implies that ηλ η r E S,R [w t w t E S,R [ĝ t logod + log/δ log 3δ +η E S,R [ α, t where we have used the independence between ĝ t and w t. his implies g t, P E S,R [P t logod + log/δ log 3δ η Setting η = log+δ/9 α+λ finishes the proof. D. Proofs of Section 5 Lemma D.. he naive estimator x i x i E R [ x x x = + ηα is not unbiased, i.e. rr rd r dd xx + dd diag xx, where expectation is taken with respect to the uniform sampling model. Proof of Lemma D.. or off-diagonal elements i j, E R [ x x x i,j = x i x j Pi, j {i,.., i r } so we need to compute the probability of the event that two fixed element i, j [ d are both in the subset {i,.., i r } sampled uniformly at random from all subsets of size r. he number of sets of r elements which contain i and j is exactly d r and the total number of r sets is d r r d = rr dd thus Pi, j {i,.., i r } = and so d r E R [ x x rr x i,j = x i x j dd. Now consider the diagonal elements. E R [ x x x i,i = x i Pi {i,.., i r } so we need to compute the probability of the event i {i,.., i r }. he number of sets with size r containing i is d r and thus E R [ x x x i,i = x i r d. Putting together, E R [ x x x = = rr dd xx + r d rr diag xx dd rr rd r dd xx + dd diag xx. Proof of Lemma 5.. irst note that E S [zz x = dr r r E S[ x i s e is e i s x = d r r diag x x. he proof simply follows from the following equalities. E S,R [ x x zz x = E R [E S [ x x x x E R [E S [zz x x = dd rr E R[ x x x d r r E R[diag x x = xx + d r r diag xx d r diag xx r = xx. 3 where diag A, A R d d is the d d matrix consisting of the diagonal of A. Proof of heorem 5.. We first bound ĝ t as follows: ĝ t xt x = t z t z t = x t x t + zt z t x t x t + z t z t x t + z t x t x t, z t z t d d r r + r d r r x d d + r d r r r. where the first inequality follows from x t x t, z t z t, since this is an inner product of PSD matrices, the third inequality follows the definition of x and z, and the forth inequality holds since x x.now, using Lemma A. with {ĝ t }, noting P k for all P C, we have that P P t, ĝ t k η + η d d + r d r r r = k η + η d d + r d r r r. Since Eĝt [=g t = x t x t, setting η = rr k and taking expectation with d d +r d r respect to the internal randomization S and the distribution of the missing data R finishes the proof. Proof of heorem 5.3. irst we bound ĝ k. Since ĝ k = x k x k z k z k is the difference of two PSD matrices it holds that ĝ k = max xk x k, z k z k α.

5 Let v be the top eigenvector of C with associated eigenvalue λ. Also let g k = x k x k and w k be the k-th iterate of Oja s algorithm, such that P k = w k w k. Define Φ M k = I + ηĝ k I + ηĝ MI + ηĝ I + ηĝ k Ψ k = I + ηĝ k I + ηĝ vv I + ηĝ I + ηĝ k. ollowing the calculations in Allen-Zhu & Li, 7 Appendix I, we have: r = r I + ηĝ Φ uu I + ηĝ = r + ηr ĝ Φ uu + η r ĝ Φ uu ĝ = r + ηr ĝ Φ uu + η r ĝ Φuu r + ηr ĝ Φ uu + η α r = r + η α + ηw ĝ w r exp η α + ηw ĝ w u exp η α + η w k ĝk w k, 5 where for the first inequality we have used the fact that Φ uu is PSD and ĝ k α and for the second inequality we have used + x exp x. We have also used the fact that Φ uu = r Φ uu w w. Next we have E[Ψ = E[r vv I + ηĝ Φ I I + ηĝ = E[r vv I + ηĝ Φ I + η v ĝ Φ I ĝ v E[r vv I + ηc Φ I = + ηλe[v Φ I v exp ηλ η λ E[v Φ I v exp ηλ η λ, where the expectation is both with respect to R and D and we have used + x exp x x, for x. inally we have: E[r Ψ = E[r I + ηĝ Ψ I + ηĝ Ψ E[r I + ηĝ Ψ = E[r I + η 3 ĝ 3 + η ĝ + ηĝ + η ĝ Ψ E[r 7 I + η α I + ηg Ψ exp η α + ηλ E[r Ψ exp η α + ηλ. Combining equations and 7 with Chebyshev s inequality we get: P[r Ψ exp ηλ α η exp ηλ α η exp η α +η λ δ δ. hus with probability δ it holds r Ψ exp ηλ α η δ / exp η α + η λ 3 exp ηλ α η log+δ/9 as long as η α +λ. ollowing the derivations Allen-Zhu & Li, 7, we have that with probability δ it holds that r Ωδ r Ψ and that Φ uu u Od + log/δ. Union bound, together with inequality 5 implies that with probability δ we have: Od + log/δ exp η α + η 3δ exp ηλ α η = logod + log/δ + η w k ĝk w k w k ĝk w k log/3δ + ηλ α η = λ w k ĝk w k logod + log/δ log/3δ / + α η = η g k, P E S,R [P k α +λ logod+log/δ log/3δ / log + δ/9 + log + δ/9, where in the last implication we have substituted η = log+δ/9 α +λ and used the fact that α < α + λ.

6 E. Proofs of Section Proof of heorem.. Denoting P t Ix t by y t, we can rewrite g t as g t = xty t +y t x t y t. hus we have g t = y t r x t y t + y t x t x t y t + y t x t x t y t y t = x t. By convexity of norms, x P t x x P x P t P, g t. Using Lemma A., noting P k for all P C, and g t, x t P t x t Choosing η = k x t P x t k η + η.. Proofs of Section 8 completes the proof. Proof of heorem 8.. Our proof follows Zinkevich, 3. We start the analysis by bounding the distance of our iterates at time t from any dynamic competitor P t at time t P t+ P t = Π C P t ηg t P t P t ηg t P t = P t P t + P t P t ηg t = P t P t + P t P t + P t P t, P t η g t, P t P t P t P t + P t + P t P t P t + η x t P t x t x t P t x t P t P t + k + k P t P t + η + η x t P t x t x t P t x t P t + η g t P t P t P t + η P t Rearranging and dividing both sides by η we get x t P t x t x t P t x t 3 k P t P t + η η P t P t P t+ P t + η Summing over t, observe the telescopic sum on the right hand side is bounded by P t P t P t+ P t = P P P + P k P Plugging the definition of the total shift S into the sum x t P t x t x t P t x t 3 ks η + η + k η = ks + k + η η Choosing η = ks+k completes the proof... Experimental Results We provide additional experiments in igures 3,, 5 and for PCA with partial observations and for PCA with missing entries. As stated in the main text, all our observations from igure and igure hold for the results presented here. urther, in igure 7, we present experimental results for the proposed Absolute Subspace Deviation Model algorithm of Section referred to as Robust GD in the plots. he algorithms against which we compare are Robust Online PCA proposed in Goes et al., referred to as Robust MEG in the plots which is an instance of online mirror descent with choice of potential function being entropy and simple batch PCA on the whole dataset. As ground truth we take the k-dimensional subspace returned by the proposed method in Lerman et al.,. In the following discussion the ground truth is represented by the projection matrix P R d d and the estimated projection matrix returned by an algorithm is denoted by P t. Since our experiments are ran on a synthetic dataset of fixed size, we denote this dataset by X R d n. he criteria against which we evaluate are average angle between subspaces given by r P t P /k, reconstruction error on the whole dataset given by U t X U X /n where a rank-k projection matrix P is decomposed as P = UU for orthogonal U R d k. We also evaluate on the total regret incurred. he data set is generated as in Goes et al.,. irst inliers are sampled from a k-dimensional Multivariate Normal distribution with mean and covariance matrix k I k. Next the inliers are embedded in a d- dimensional space via the linear transformation U R d k with entries U ii = and U i j =. inally outliers are

7 %88 missing %9 missing %9 missing - L Oja MGD MGD-po MGD-md Oja-po Oja-md GROUSE ime ime ime igure 3: Comparisons of Oja, MGD, MGD-PO, MGD-MD, Oja-PO, Oja-MD, GROUSE for PCA with missing data on XRMB dataset, in terms of the variance captured on a test set as a function of number of observed entries for k= top, number of iterations middle and runtime bottom. sampled from a d-dimensional Multivariate Normal distribution with diagonal covariance λ d meanσ Σ out out where Σ out ii = i, here λ is a user-specified parameter which governs the SNR. In our experiments λ = the number of outliers is %, %, 8% of the total number of points, k = and d =. In our comparison we also include a capped version of our proposed method where the rank of each intermediate iterate is hard-capped to 5 by keeping the 5 directions associated with the top 5 singular values of our current iterate. We now briefly discuss the more interesting points which should be observed from the provided igure 7. irst our algorithm clearly out-performs the Robust Online PCA and batch PCA in all the chosen criteria. Secondly, since Robust Online PCA is required to always keep full-rank iterates as it operates in the complementary space to the one we wish to recover it has to pay computational cost Od 3 per iteration while our method only pays Od k where k is the rank of the current iterate. As can be seen from the plots for the cases when number of outliers are % or % the intermediate rank of iterates does not grow too much so in practice our proposed method is efficient. inally we note that the capped version of the proposed method performs as well or even better in some cases.

8 %88 missing %9 missing %9 missing - L Oja MGD MGD-po MGD-md Oja-po Oja-md GROUSE ime ime ime igure : Comparisons of Oja, MGD, MGD-PO, MGD-MD, Oja-PO, Oja-MD, GROUSE for PCA with missing data on XRMB dataset, in terms of the variance captured on a test set as a function of number of observed entries for k= top, number of iterations middle and runtime bottom.

9 %88 missing %9 missing %9 missing - L Oja MGD MGD-po MGD-md Oja-po Oja-md GROUSE ime - ime - ime igure 5: Comparisons of Oja, MGD, MSG-PO, MSG-MD, Oja-PO, Oja-MD and GROUSE for PCA with missing data on MNIS dataset, in terms of the variance captured on a test set as a function of number of observed entries for k= top, number of iterations middle and runtime bottom.

10 %88 missing %9 missing %9 missing - L Oja MGD MGD-po MGD-md Oja-po Oja-md GROUSE ime ime ime igure : Comparisons of Oja, MGD, MSG-PO, MSG-MD, Oja-PO, Oja-MD and GROUSE for PCA with missing data on MNIS dataset, in terms of the variance captured on a test set as a function of number of observed entries for k= top, number of iterations middle and runtime bottom.

11 tracep * P t /k Regret P * X-P t X /n rankp t tracep * P t /k Regret P * X-P t X /n rankp t tracep * P t /k Regret P * X-P t X /n rankp t average angle between subspaces PCA optimization error reconstruction error rank of iterates Robust MEG Robust GD Batch PCA Capped Robust GD log log log log Robust MEG Robust GD Batch PCA Capped Robust GD log log log log Robust MEG Robust GD Batch PCA Capped Robust GD log log log log igure 7: k = Comparison of batch PCA, Robust Online PCA Robust MEG, and Absolute Subspace Deviation Model Robust GD with % outliers top row, % outliers middle row and 8% outliers bottom row on synthetic dataset. Experiments are in terms of the average subspace angle left most, variance captured on a test set left, reconstruction error right and the rank of iterates right most as a function of number of samples.

Streaming Principal Component Analysis in Noisy Settings

Streaming Principal Component Analysis in Noisy Settings Streaming Principal Component Analysis in Noisy Settings Teodor V. Marinov * 1 Poorya Mianjy * 1 Raman Arora 1 Abstract We study streaming algorithms for principal component analysis (PCA) in noisy settings.

More information

Online Kernel PCA with Entropic Matrix Updates

Online Kernel PCA with Entropic Matrix Updates Online Kernel PCA with Entropic Matrix Updates Dima Kuzmin Manfred K. Warmuth University of California - Santa Cruz ICML 2007, Corvallis, Oregon April 23, 2008 D. Kuzmin, M. Warmuth (UCSC) Online Kernel

More information

On-line Variance Minimization

On-line Variance Minimization On-line Variance Minimization Manfred Warmuth Dima Kuzmin University of California - Santa Cruz 19th Annual Conference on Learning Theory M. Warmuth, D. Kuzmin (UCSC) On-line Variance Minimization COLT06

More information

Online Kernel PCA with Entropic Matrix Updates

Online Kernel PCA with Entropic Matrix Updates Dima Kuzmin Manfred K. Warmuth Computer Science Department, University of California - Santa Cruz dima@cse.ucsc.edu manfred@cse.ucsc.edu Abstract A number of updates for density matrices have been developed

More information

EE 381V: Large Scale Learning Spring Lecture 16 March 7

EE 381V: Large Scale Learning Spring Lecture 16 March 7 EE 381V: Large Scale Learning Spring 2013 Lecture 16 March 7 Lecturer: Caramanis & Sanghavi Scribe: Tianyang Bai 16.1 Topics Covered In this lecture, we introduced one method of matrix completion via SVD-based

More information

Lecture Notes 1: Vector spaces

Lecture Notes 1: Vector spaces Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector

More information

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis Massimiliano Pontil 1 Today s plan SVD and principal component analysis (PCA) Connection

More information

15 Singular Value Decomposition

15 Singular Value Decomposition 15 Singular Value Decomposition For any high-dimensional data analysis, one s first thought should often be: can I use an SVD? The singular value decomposition is an invaluable analysis tool for dealing

More information

CS281 Section 4: Factor Analysis and PCA

CS281 Section 4: Factor Analysis and PCA CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we

More information

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra. DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1

More information

A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir

A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir A Stochastic PCA Algorithm with an Exponential Convergence Rate Ohad Shamir Weizmann Institute of Science NIPS Optimization Workshop December 2014 Ohad Shamir Stochastic PCA with Exponential Convergence

More information

PCA, Kernel PCA, ICA

PCA, Kernel PCA, ICA PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per

More information

Methods for sparse analysis of high-dimensional data, II

Methods for sparse analysis of high-dimensional data, II Methods for sparse analysis of high-dimensional data, II Rachel Ward May 26, 2011 High dimensional data with low-dimensional structure 300 by 300 pixel images = 90, 000 dimensions 2 / 55 High dimensional

More information

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret

More information

On the Generalization Ability of Online Strongly Convex Programming Algorithms

On the Generalization Ability of Online Strongly Convex Programming Algorithms On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract

More information

Principal Component Analysis

Principal Component Analysis Principal Component Analysis Laurenz Wiskott Institute for Theoretical Biology Humboldt-University Berlin Invalidenstraße 43 D-10115 Berlin, Germany 11 March 2004 1 Intuition Problem Statement Experimental

More information

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given

More information

Methods for sparse analysis of high-dimensional data, II

Methods for sparse analysis of high-dimensional data, II Methods for sparse analysis of high-dimensional data, II Rachel Ward May 23, 2011 High dimensional data with low-dimensional structure 300 by 300 pixel images = 90, 000 dimensions 2 / 47 High dimensional

More information

7 Principal Component Analysis

7 Principal Component Analysis 7 Principal Component Analysis This topic will build a series of techniques to deal with high-dimensional data. Unlike regression problems, our goal is not to predict a value (the y-coordinate), it is

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x

More information

A Stochastic PCA and SVD Algorithm with an Exponential Convergence Rate

A Stochastic PCA and SVD Algorithm with an Exponential Convergence Rate Ohad Shamir Weizmann Institute of Science, Rehovot, Israel OHAD.SHAMIR@WEIZMANN.AC.IL Abstract We describe and analyze a simple algorithm for principal component analysis and singular value decomposition,

More information

Lecture 2: Linear Algebra Review

Lecture 2: Linear Algebra Review EE 227A: Convex Optimization and Applications January 19 Lecture 2: Linear Algebra Review Lecturer: Mert Pilanci Reading assignment: Appendix C of BV. Sections 2-6 of the web textbook 1 2.1 Vectors 2.1.1

More information

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ

More information

NORMS ON SPACE OF MATRICES

NORMS ON SPACE OF MATRICES NORMS ON SPACE OF MATRICES. Operator Norms on Space of linear maps Let A be an n n real matrix and x 0 be a vector in R n. We would like to use the Picard iteration method to solve for the following system

More information

Composite Objective Mirror Descent

Composite Objective Mirror Descent Composite Objective Mirror Descent John C. Duchi 1,3 Shai Shalev-Shwartz 2 Yoram Singer 3 Ambuj Tewari 4 1 University of California, Berkeley 2 Hebrew University of Jerusalem, Israel 3 Google Research

More information

arxiv: v5 [math.na] 16 Nov 2017

arxiv: v5 [math.na] 16 Nov 2017 RANDOM PERTURBATION OF LOW RANK MATRICES: IMPROVING CLASSICAL BOUNDS arxiv:3.657v5 [math.na] 6 Nov 07 SEAN O ROURKE, VAN VU, AND KE WANG Abstract. Matrix perturbation inequalities, such as Weyl s theorem

More information

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University PRINCIPAL COMPONENT ANALYSIS DIMENSIONALITY

More information

Supplementary Materials for Riemannian Pursuit for Big Matrix Recovery

Supplementary Materials for Riemannian Pursuit for Big Matrix Recovery Supplementary Materials for Riemannian Pursuit for Big Matrix Recovery Mingkui Tan, School of omputer Science, The University of Adelaide, Australia Ivor W. Tsang, IS, University of Technology Sydney,

More information

Lecture 3. Random Fourier measurements

Lecture 3. Random Fourier measurements Lecture 3. Random Fourier measurements 1 Sampling from Fourier matrices 2 Law of Large Numbers and its operator-valued versions 3 Frames. Rudelson s Selection Theorem Sampling from Fourier matrices Our

More information

Lecture 19: Follow The Regulerized Leader

Lecture 19: Follow The Regulerized Leader COS-511: Learning heory Spring 2017 Lecturer: Roi Livni Lecture 19: Follow he Regulerized Leader Disclaimer: hese notes have not been subjected to the usual scrutiny reserved for formal publications. hey

More information

14 Singular Value Decomposition

14 Singular Value Decomposition 14 Singular Value Decomposition For any high-dimensional data analysis, one s first thought should often be: can I use an SVD? The singular value decomposition is an invaluable analysis tool for dealing

More information

Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

More information

Bare minimum on matrix algebra. Psychology 588: Covariance structure and factor models

Bare minimum on matrix algebra. Psychology 588: Covariance structure and factor models Bare minimum on matrix algebra Psychology 588: Covariance structure and factor models Matrix multiplication 2 Consider three notations for linear combinations y11 y1 m x11 x 1p b11 b 1m y y x x b b n1

More information

EECS 275 Matrix Computation

EECS 275 Matrix Computation EECS 275 Matrix Computation Ming-Hsuan Yang Electrical Engineering and Computer Science University of California at Merced Merced, CA 95344 http://faculty.ucmerced.edu/mhyang Lecture 6 1 / 22 Overview

More information

Solving Corrupted Quadratic Equations, Provably

Solving Corrupted Quadratic Equations, Provably Solving Corrupted Quadratic Equations, Provably Yuejie Chi London Workshop on Sparse Signal Processing September 206 Acknowledgement Joint work with Yuanxin Li (OSU), Huishuai Zhuang (Syracuse) and Yingbin

More information

Multivariate Statistics Random Projections and Johnson-Lindenstrauss Lemma

Multivariate Statistics Random Projections and Johnson-Lindenstrauss Lemma Multivariate Statistics Random Projections and Johnson-Lindenstrauss Lemma Suppose again we have n sample points x,..., x n R p. The data-point x i R p can be thought of as the i-th row X i of an n p-dimensional

More information

C&O367: Nonlinear Optimization (Winter 2013) Assignment 4 H. Wolkowicz

C&O367: Nonlinear Optimization (Winter 2013) Assignment 4 H. Wolkowicz C&O367: Nonlinear Optimization (Winter 013) Assignment 4 H. Wolkowicz Posted Mon, Feb. 8 Due: Thursday, Feb. 8 10:00AM (before class), 1 Matrices 1.1 Positive Definite Matrices 1. Let A S n, i.e., let

More information

Fourier PCA. Navin Goyal (MSR India), Santosh Vempala (Georgia Tech) and Ying Xiao (Georgia Tech)

Fourier PCA. Navin Goyal (MSR India), Santosh Vempala (Georgia Tech) and Ying Xiao (Georgia Tech) Fourier PCA Navin Goyal (MSR India), Santosh Vempala (Georgia Tech) and Ying Xiao (Georgia Tech) Introduction 1. Describe a learning problem. 2. Develop an efficient tensor decomposition. Independent component

More information

Online (and Distributed) Learning with Information Constraints. Ohad Shamir

Online (and Distributed) Learning with Information Constraints. Ohad Shamir Online (and Distributed) Learning with Information Constraints Ohad Shamir Weizmann Institute of Science Online Algorithms and Learning Workshop Leiden, November 2014 Ohad Shamir Learning with Information

More information

1 Principal Components Analysis

1 Principal Components Analysis Lecture 3 and 4 Sept. 18 and Sept.20-2006 Data Visualization STAT 442 / 890, CM 462 Lecture: Ali Ghodsi 1 Principal Components Analysis Principal components analysis (PCA) is a very popular technique for

More information

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent Unsupervised Machine Learning and Data Mining DS 5230 / DS 4420 - Fall 2018 Lecture 7 Jan-Willem van de Meent DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford) Dimensionality Reduction Goal:

More information

Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization

Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization JMLR: Workshop and Conference Proceedings vol (2010) 1 16 24th Annual Conference on Learning heory Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization

More information

Linear Algebra Practice Problems

Linear Algebra Practice Problems Linear Algebra Practice Problems Page of 7 Linear Algebra Practice Problems These problems cover Chapters 4, 5, 6, and 7 of Elementary Linear Algebra, 6th ed, by Ron Larson and David Falvo (ISBN-3 = 978--68-78376-2,

More information

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inrialpes.fr http://perception.inrialpes.fr/ Outline of Lecture

More information

Lecture 5 : Projections

Lecture 5 : Projections Lecture 5 : Projections EE227C. Lecturer: Professor Martin Wainwright. Scribe: Alvin Wan Up until now, we have seen convergence rates of unconstrained gradient descent. Now, we consider a constrained minimization

More information

Knowledge Discovery and Data Mining 1 (VO) ( )

Knowledge Discovery and Data Mining 1 (VO) ( ) Knowledge Discovery and Data Mining 1 (VO) (707.003) Review of Linear Algebra Denis Helic KTI, TU Graz Oct 9, 2014 Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 1 / 74 Big picture: KDDM Probability Theory

More information

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016 Lecture 8 Principal Component Analysis Luigi Freda ALCOR Lab DIAG University of Rome La Sapienza December 13, 2016 Luigi Freda ( La Sapienza University) Lecture 8 December 13, 2016 1 / 31 Outline 1 Eigen

More information

Data Preprocessing Tasks

Data Preprocessing Tasks Data Tasks 1 2 3 Data Reduction 4 We re here. 1 Dimensionality Reduction Dimensionality reduction is a commonly used approach for generating fewer features. Typically used because too many features can

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow

More information

Large Scale Matrix Analysis and Inference

Large Scale Matrix Analysis and Inference Large Scale Matrix Analysis and Inference Wouter M. Koolen - Manfred Warmuth Reza Bosagh Zadeh - Gunnar Carlsson - Michael Mahoney Dec 9, NIPS 2013 1 / 32 Introductory musing What is a matrix? a i,j 1

More information

Matrix concentration inequalities

Matrix concentration inequalities ELE 538B: Mathematics of High-Dimensional Data Matrix concentration inequalities Yuxin Chen Princeton University, Fall 2018 Recap: matrix Bernstein inequality Consider a sequence of independent random

More information

Conjugate Gradient (CG) Method

Conjugate Gradient (CG) Method Conjugate Gradient (CG) Method by K. Ozawa 1 Introduction In the series of this lecture, I will introduce the conjugate gradient method, which solves efficiently large scale sparse linear simultaneous

More information

A Cross-Associative Neural Network for SVD of Nonsquared Data Matrix in Signal Processing

A Cross-Associative Neural Network for SVD of Nonsquared Data Matrix in Signal Processing IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 12, NO. 5, SEPTEMBER 2001 1215 A Cross-Associative Neural Network for SVD of Nonsquared Data Matrix in Signal Processing Da-Zheng Feng, Zheng Bao, Xian-Da Zhang

More information

When Fisher meets Fukunaga-Koontz: A New Look at Linear Discriminants

When Fisher meets Fukunaga-Koontz: A New Look at Linear Discriminants When Fisher meets Fukunaga-Koontz: A New Look at Linear Discriminants Sheng Zhang erence Sim School of Computing, National University of Singapore 3 Science Drive 2, Singapore 7543 {zhangshe, tsim}@comp.nus.edu.sg

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 08): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee7c@berkeley.edu October

More information

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan Lecture 3: Latent Variables Models and Learning with the EM Algorithm Sam Roweis Tuesday July25, 2006 Machine Learning Summer School, Taiwan Latent Variable Models What to do when a variable z is always

More information

Lecture 3: Review of Linear Algebra

Lecture 3: Review of Linear Algebra ECE 83 Fall 2 Statistical Signal Processing instructor: R Nowak, scribe: R Nowak Lecture 3: Review of Linear Algebra Very often in this course we will represent signals as vectors and operators (eg, filters,

More information

Lecture 9: SVD, Low Rank Approximation

Lecture 9: SVD, Low Rank Approximation CSE 521: Design and Analysis of Algorithms I Spring 2016 Lecture 9: SVD, Low Rank Approimation Lecturer: Shayan Oveis Gharan April 25th Scribe: Koosha Khalvati Disclaimer: hese notes have not been subjected

More information

HST.582J/6.555J/16.456J

HST.582J/6.555J/16.456J Blind Source Separation: PCA & ICA HST.582J/6.555J/16.456J Gari D. Clifford gari [at] mit. edu http://www.mit.edu/~gari G. D. Clifford 2005-2009 What is BSS? Assume an observation (signal) is a linear

More information

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR

More information

Preliminary Examination, Numerical Analysis, August 2016

Preliminary Examination, Numerical Analysis, August 2016 Preliminary Examination, Numerical Analysis, August 2016 Instructions: This exam is closed books and notes. The time allowed is three hours and you need to work on any three out of questions 1-4 and any

More information

PCA and admixture models

PCA and admixture models PCA and admixture models CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price PCA and admixture models 1 / 57 Announcements HW1

More information

Lecture 4: Purifications and fidelity

Lecture 4: Purifications and fidelity CS 766/QIC 820 Theory of Quantum Information (Fall 2011) Lecture 4: Purifications and fidelity Throughout this lecture we will be discussing pairs of registers of the form (X, Y), and the relationships

More information

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017 The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

CS181 Midterm 2 Practice Solutions

CS181 Midterm 2 Practice Solutions CS181 Midterm 2 Practice Solutions 1. Convergence of -Means Consider Lloyd s algorithm for finding a -Means clustering of N data, i.e., minimizing the distortion measure objective function J({r n } N n=1,

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 018): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee7c@berkeley.edu October

More information

Principal Component Analysis (PCA) for Sparse High-Dimensional Data

Principal Component Analysis (PCA) for Sparse High-Dimensional Data AB Principal Component Analysis (PCA) for Sparse High-Dimensional Data Tapani Raiko, Alexander Ilin, and Juha Karhunen Helsinki University of Technology, Finland Adaptive Informatics Research Center Principal

More information

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2. APPENDIX A Background Mathematics A. Linear Algebra A.. Vector algebra Let x denote the n-dimensional column vector with components 0 x x 2 B C @. A x n Definition 6 (scalar product). The scalar product

More information

Common-Knowledge / Cheat Sheet

Common-Knowledge / Cheat Sheet CSE 521: Design and Analysis of Algorithms I Fall 2018 Common-Knowledge / Cheat Sheet 1 Randomized Algorithm Expectation: For a random variable X with domain, the discrete set S, E [X] = s S P [X = s]

More information

1. Background: The SVD and the best basis (questions selected from Ch. 6- Can you fill in the exercises?)

1. Background: The SVD and the best basis (questions selected from Ch. 6- Can you fill in the exercises?) Math 35 Exam Review SOLUTIONS Overview In this third of the course we focused on linear learning algorithms to model data. summarize: To. Background: The SVD and the best basis (questions selected from

More information

Lecture 5 Singular value decomposition

Lecture 5 Singular value decomposition Lecture 5 Singular value decomposition Weinan E 1,2 and Tiejun Li 2 1 Department of Mathematics, Princeton University, weinan@princeton.edu 2 School of Mathematical Sciences, Peking University, tieli@pku.edu.cn

More information

Second Order Optimization Algorithms I

Second Order Optimization Algorithms I Second Order Optimization Algorithms I Yinyu Ye Department of Management Science and Engineering Stanford University Stanford, CA 94305, U.S.A. http://www.stanford.edu/ yyye Chapters 7, 8, 9 and 10 1 The

More information

Large Sample Properties of Estimators in the Classical Linear Regression Model

Large Sample Properties of Estimators in the Classical Linear Regression Model Large Sample Properties of Estimators in the Classical Linear Regression Model 7 October 004 A. Statement of the classical linear regression model The classical linear regression model can be written in

More information

Machine Learning (CS 567) Lecture 5

Machine Learning (CS 567) Lecture 5 Machine Learning (CS 567) Lecture 5 Time: T-Th 5:00pm - 6:20pm Location: GFS 118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol

More information

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis. Vector spaces DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Vector space Consists of: A set V A scalar

More information

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inria.fr http://perception.inrialpes.fr/

More information

ECONOMETRIC METHODS II: TIME SERIES LECTURE NOTES ON THE KALMAN FILTER. The Kalman Filter. We will be concerned with state space systems of the form

ECONOMETRIC METHODS II: TIME SERIES LECTURE NOTES ON THE KALMAN FILTER. The Kalman Filter. We will be concerned with state space systems of the form ECONOMETRIC METHODS II: TIME SERIES LECTURE NOTES ON THE KALMAN FILTER KRISTOFFER P. NIMARK The Kalman Filter We will be concerned with state space systems of the form X t = A t X t 1 + C t u t 0.1 Z t

More information

1 Singular Value Decomposition and Principal Component

1 Singular Value Decomposition and Principal Component Singular Value Decomposition and Principal Component Analysis In these lectures we discuss the SVD and the PCA, two of the most widely used tools in machine learning. Principal Component Analysis (PCA)

More information

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works CS68: The Modern Algorithmic Toolbox Lecture #8: How PCA Works Tim Roughgarden & Gregory Valiant April 20, 206 Introduction Last lecture introduced the idea of principal components analysis (PCA). The

More information

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces.

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces. Math 350 Fall 2011 Notes about inner product spaces In this notes we state and prove some important properties of inner product spaces. First, recall the dot product on R n : if x, y R n, say x = (x 1,...,

More information

L26: Advanced dimensionality reduction

L26: Advanced dimensionality reduction L26: Advanced dimensionality reduction The snapshot CA approach Oriented rincipal Components Analysis Non-linear dimensionality reduction (manifold learning) ISOMA Locally Linear Embedding CSCE 666 attern

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 12 Jan-Willem van de Meent (credit: Yijun Zhao, Percy Liang) DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford) Linear Dimensionality

More information

Gaussian Models (9/9/13)

Gaussian Models (9/9/13) STA561: Probabilistic machine learning Gaussian Models (9/9/13) Lecturer: Barbara Engelhardt Scribes: Xi He, Jiangwei Pan, Ali Razeen, Animesh Srivastava 1 Multivariate Normal Distribution The multivariate

More information

Sum-of-Squares and Spectral Algorithms

Sum-of-Squares and Spectral Algorithms Sum-of-Squares and Spectral Algorithms Tselil Schramm June 23, 2017 Workshop on SoS @ STOC 2017 Spectral algorithms as a tool for analyzing SoS. SoS Semidefinite Programs Spectral Algorithms SoS suggests

More information

Randomized Online PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension

Randomized Online PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension Journal of Machine Learning Research 9 (2008) 2287-2320 Submitted 9/07; Published 10/08 Randomized Online PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension Manfred K. Warmuth Dima

More information

Concentration Ellipsoids

Concentration Ellipsoids Concentration Ellipsoids ECE275A Lecture Supplement Fall 2008 Kenneth Kreutz Delgado Electrical and Computer Engineering Jacobs School of Engineering University of California, San Diego VERSION LSECE275CE

More information

Machine Learning - MT & 14. PCA and MDS

Machine Learning - MT & 14. PCA and MDS Machine Learning - MT 2016 13 & 14. PCA and MDS Varun Kanade University of Oxford November 21 & 23, 2016 Announcements Sheet 4 due this Friday by noon Practical 3 this week (continue next week if necessary)

More information

IV. Matrix Approximation using Least-Squares

IV. Matrix Approximation using Least-Squares IV. Matrix Approximation using Least-Squares The SVD and Matrix Approximation We begin with the following fundamental question. Let A be an M N matrix with rank R. What is the closest matrix to A that

More information

Mathematical foundations - linear algebra

Mathematical foundations - linear algebra Mathematical foundations - linear algebra Andrea Passerini passerini@disi.unitn.it Machine Learning Vector space Definition (over reals) A set X is called a vector space over IR if addition and scalar

More information

Notes on AdaGrad. Joseph Perla 2014

Notes on AdaGrad. Joseph Perla 2014 Notes on AdaGrad Joseph Perla 2014 1 Introduction Stochastic Gradient Descent (SGD) is a common online learning algorithm for optimizing convex (and often non-convex) functions in machine learning today.

More information

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ). .8.6 µ =, σ = 1 µ = 1, σ = 1 / µ =, σ =.. 3 1 1 3 x Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ ). The Gaussian distribution Probably the most-important distribution in all of statistics

More information

1 Kernel methods & optimization

1 Kernel methods & optimization Machine Learning Class Notes 9-26-13 Prof. David Sontag 1 Kernel methods & optimization One eample of a kernel that is frequently used in practice and which allows for highly non-linear discriminant functions

More information

Lecture 8 : Eigenvalues and Eigenvectors

Lecture 8 : Eigenvalues and Eigenvectors CPS290: Algorithmic Foundations of Data Science February 24, 2017 Lecture 8 : Eigenvalues and Eigenvectors Lecturer: Kamesh Munagala Scribe: Kamesh Munagala Hermitian Matrices It is simpler to begin with

More information

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and

More information

Inner products and Norms. Inner product of 2 vectors. Inner product of 2 vectors x and y in R n : x 1 y 1 + x 2 y x n y n in R n

Inner products and Norms. Inner product of 2 vectors. Inner product of 2 vectors x and y in R n : x 1 y 1 + x 2 y x n y n in R n Inner products and Norms Inner product of 2 vectors Inner product of 2 vectors x and y in R n : x 1 y 1 + x 2 y 2 + + x n y n in R n Notation: (x, y) or y T x For complex vectors (x, y) = x 1 ȳ 1 + x 2

More information

Factor Analysis (10/2/13)

Factor Analysis (10/2/13) STA561: Probabilistic machine learning Factor Analysis (10/2/13) Lecturer: Barbara Engelhardt Scribes: Li Zhu, Fan Li, Ni Guan Factor Analysis Factor analysis is related to the mixture models we have studied.

More information

Linear Algebra Massoud Malek

Linear Algebra Massoud Malek CSUEB Linear Algebra Massoud Malek Inner Product and Normed Space In all that follows, the n n identity matrix is denoted by I n, the n n zero matrix by Z n, and the zero vector by θ n An inner product

More information

E2 212: Matrix Theory (Fall 2010) Solutions to Test - 1

E2 212: Matrix Theory (Fall 2010) Solutions to Test - 1 E2 212: Matrix Theory (Fall 2010) s to Test - 1 1. Let X = [x 1, x 2,..., x n ] R m n be a tall matrix. Let S R(X), and let P be an orthogonal projector onto S. (a) If X is full rank, show that P can be

More information

On some special cases of the Entropy Photon-Number Inequality

On some special cases of the Entropy Photon-Number Inequality On some special cases of the Entropy Photon-Number Inequality Smarajit Das, Naresh Sharma and Siddharth Muthukrishnan Tata Institute of Fundamental Research December 5, 2011 One of the aims of information

More information