Approximate Principal Components Analysis of Large Data Sets

Similar documents
COMPRESSED AND PENALIZED LINEAR

Approximate Principal Components Analysis of Large Data Sets

Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices

Normalized power iterations for the computation of SVD

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

A fast randomized algorithm for overdetermined linear least-squares regression

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Recovering any low-rank matrix, provably

Lecture 8: Linear Algebra Background

A Fast Algorithm For Computing The A-optimal Sampling Distributions In A Big Data Linear Regression

Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation Discussion of sampling approach in big data

The Nyström Extension and Spectral Methods in Learning

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

A randomized algorithm for approximating the SVD of a matrix

4 Bias-Variance for Ridge Regression (24 points)

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Advances in Extreme Learning Machines

The lasso, persistence, and cross-validation

Linear Models for Regression CS534

Bias-Variance Tradeoff

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Randomized Algorithms

A fast randomized algorithm for approximating an SVD of a matrix

Foundations of Computer Vision

14 Singular Value Decomposition

CS 143 Linear Algebra Review

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Sketched Ridge Regression:

Lecture 24: Element-wise Sampling of Graphs and Linear Equation Solving. 22 Element-wise Sampling of Graphs and Linear Equation Solving

Data Mining Techniques

4 Bias-Variance for Ridge Regression (24 points)

Randomized Algorithms for Large-Scale Data Analysis

1 Singular Value Decomposition and Principal Component

Sketching as a Tool for Numerical Linear Algebra

Exercise Sheet 1. 1 Probability revision 1: Student-t as an infinite mixture of Gaussians


Machine Learning (Spring 2012) Principal Component Analysis

Linear Algebra for Machine Learning. Sargur N. Srihari

Randomized methods for computing the Singular Value Decomposition (SVD) of very large matrices

Methods for sparse analysis of high-dimensional data, II

B553 Lecture 5: Matrix Algebra Review

Linear Models for Regression CS534

Linear Models for Regression CS534

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Linear Models in Machine Learning

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

Machine Learning for OR & FE

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Randomized Numerical Linear Algebra: Review and Progresses

Fundamentals of Matrices

Mathematical foundations - linear algebra

Lecture 6: Methods for high-dimensional problems

Lecture 2: Linear Algebra Review

15 Singular Value Decomposition

ORIE 4741: Learning with Big Messy Data. Regularization

Approximate Spectral Clustering via Randomized Sketching

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Unsupervised Learning: Dimensionality Reduction

MATH 320, WEEK 7: Matrices, Matrix Operations

A fast randomized algorithm for the approximation of matrices preliminary report

Kernel Methods. Barnabás Póczos

sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU)

PCA, Kernel PCA, ICA

Lecture: Local Spectral Methods (1 of 4)

8.1 Concentration inequality for Gaussian random matrix (cont d)

7 Principal Component Analysis

The exam is closed book, closed notes except your one-page cheat sheet.

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

2.3. Clustering or vector quantization 57

Sketching as a Tool for Numerical Linear Algebra

Properties of Matrices and Operations on Matrices

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

IV. Matrix Approximation using Least-Squares

Multivariate Statistical Analysis

Linear regression methods

CMSC858P Supervised Learning Methods

Gradient-based Sampling: An Adaptive Importance Sampling for Least-squares

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization

What is Principal Component Analysis?

Lecture 5 Singular value decomposition

Metric-based classifiers. Nuno Vasconcelos UCSD

CS 229r: Algorithms for Big Data Fall Lecture 17 10/28

MSA220/MVE440 Statistical Learning for Big Data

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Singular Value Decomposition

Spectral Regularization

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018

Ordinary Least Squares Linear Regression

SVD, PCA & Preprocessing

arxiv: v1 [stat.me] 23 Jun 2013

Approximate Kernel PCA with Random Features

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

Theory and Computation for Gaussian Processes

Transcription:

Approximate Principal Components Analysis of Large Data Sets Daniel J. McDonald Department of Statistics Indiana University mypage.iu.edu/ dajmcdon April 27, 2016

Approximation-Regularization for Analysis of Large Data Sets Daniel J. McDonald Department of Statistics Indiana University mypage.iu.edu/ dajmcdon April 27, 2016

LESSON OF THE TALK Many statistical methods use (perhaps implicitly) a singular value decomposition (SVD). The SVD is computationally expensive. We want to understand the statistical properties of some approximations which speed up computation and save storage. Spoiler: sometimes approximations actually improve the statistical properties 1

CORE TECHNIQUES Suppose we have a matrix X R n p and vector Y R n LEAST SQUARES: Finding β = argmin Xβ Y 2 2 β PRINCIPAL COMPONENTS ANALYSIS (PCA): (Or graph Laplacian or diffusion map or..) Finding U, V orthogonal and Λ diagonal such that X = X X = UΛV where X = 11 X 2

CORE TECHNIQUES If X fits into RAM, there exist excellent algorithms in LAPACK that are Double precision Very stable cubic complexity, O(min{n, p} 3 ), with small constants require extensive random access to matrix There is a lot of interest in finding and analyzing techniques that extend these approaches to large(r) problems 3

OUT-OF-CORE TECHNIQUES Many techniques focus on randomized compression (This is sometimes known as sketching) LEAST SQUARES: 1 Rokhlin, Tygert, A fast randomized algorithm for overdetermined linear least-squares regression (2008). 2 Drineas, Mahoney, et al., Faster least squares approximation (2011). 3 Woodruff Sketching as a Tool for Numerical Linear Algebra (2013). 4 Homrighausen, McDonald, Preconditioned least squares (2016). SPECTRAL DECOMPOSITION: 1 Halko, et al., Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions (2011). 2 Gittens, Mahoney, Revisiting the Nystrom Method for Improved Large-Scale Machine Learning (2013). 3 Pourkamali, Memory and Computation Efficient PCA via Very Sparse Random Projections (2014). 4 Homrighausen, McDonald, On the Nyström and Column-Sampling Methods for the Approximate PCA of Large Data Sets (2015). 4

OUT-OF-CORE TECHNIQUES Many techniques focus on randomized compression (This is sometimes known as sketching) LEAST SQUARES: 1 Rokhlin, Tygert, A fast randomized algorithm for overdetermined linear least-squares regression (2008). 2 Drineas, Mahoney, et al., Faster least squares approximation (2011). 3 Woodruff Sketching as a Tool for Numerical Linear Algebra (2013). 4 Homrighausen, McDonald, Preconditioned least squares (2016). SPECTRAL DECOMPOSITION: 1 Halko, et al., Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions (2011). 2 Gittens, Mahoney, Revisiting the Nystrom Method for Improved Large-Scale Machine Learning (2013). 3 Pourkamali, Memory and Computation Efficient PCA via Very Sparse Random Projections (2014). 4 Homrighausen, McDonald, On the Nyström and Column-Sampling Methods for the Approximate PCA of Large Data Sets (2015). 4

COMPRESSION BASIC IDEA: Choose some matrix Q R q n or R p q. Use QX or XQ and QY instead Finding QX for arbitrary Q and X takes O(qnp) computations This can be expensive, To get this approach to work, we need some structure on Q 5

Gaussian (Well behaved distribution and eas(ier) theory. Dense matrix) Fast Johnson-Lindenstrauss Methods THE Q MATRIX Randomized Hadamard (or Fourier) transformation (Allows for O(np log(p)) computations.) Q = πτ for π a permutation of I and τ = [I q 0]. (XQ means only read q columns ) Sparse Bernoulli i.i.d. Q ij 1 with probability 1/(2s) 0 with probability 1 1/s 1 with probability 1/(2s) This means QX takes O ( qnp) s computations on average 6

Gaussian (Well behaved distribution and eas(ier) theory. Dense matrix) Fast Johnson-Lindenstrauss Methods THE Q MATRIX Randomized Hadamard (or Fourier) transformation (Allows for O(np log(p)) computations.) Q = πτ for π a permutation of I and τ = [I q 0]. (XQ means only read q columns ) Sparse Bernoulli i.i.d. Q ij 1 with probability 1/(2s) 0 with probability 1 1/s 1 with probability 1/(2s) This means QX takes O ( qnp) s computations on average 6

TYPICAL RESULTS THE GENERAL PHILOSOPHY: Find an approximation that is as close as possible to the solution of the original problem PCA: A typical result would be to find an approximate Ṽ such that ( ) p 1 angle(v, Ṽ) n spectral gap (This is the same order of convergence as PCA [Homrighausen, McDonald (2015)]) 7

TYPICAL RESULTS THE GENERAL PHILOSOPHY: Find an approximation that is as close as possible to the solution of the original problem Least Squares: A typical result would be to find an β such that ( ) X β Y 2 2 (1 + ɛ) min Xβ β Y 2 2 Here, β should be easier to compute than β 7

COLLABORATORS & GRANT SUPPORT Collaborator: Darren Homrighausen, Colorado State Assistant Professor, Department of Statistics Grant Support: NSF Grant DMS 14-07543 INET Grant INO 14-00020 8

Typical result: Principal components analysis

NOTATION Write (again) the SVD of X as X = UΛV, For general rank r matrices A we write A = U(A)Λ(A)V(A) where Λ(A) = diag(λ 1 (A),..., λ r (A)) For some matrix A, we use A d to be the first d columns of A. 9

APPROXIMATIONS Can t use X What about approximating it? Focus on two methods of approximate SVD 1 Nyström extension 2 Column sampling 10

INTUITION (NOT THE ALGORITHM) Essentially, let S = X X R n n and R = XX R p p Both of these are symmetric, positive semi-definite Randomly choose q entries in {1,..., n} or {1,..., p} Then partition the matrix so the selected portion is S 11 or R 11 [ ] [ ] S = VΛ 2 V S11 S = 12 R = UΛ 2 U R11 R = 12 S 21 S 22 R 21 R 22 11

APPROXIMATING THINGS WE DON T CARE ABOUT If we want to approximate S (or R), we have for example Nyström S [ S11 S 21 ] S 11 [ S11 S 12 ] Column sampling S U ([ S11 S 21 ]) Λ ([ S11 S 21 ]) U ([ S11 S 21 ]) Previous theoretical results have focused on the accuracy of these approximations (and variants) for a fixed computational budget. 12

APPROXIMATING THINGS WE DON T CARE ABOUT If we want to approximate S (or R), we have for example Nyström S [ S11 S 21 ] S 11 [ S11 S 12 ] Column sampling S U ([ S11 S 21 ]) Λ ([ S11 S 21 ]) U ([ S11 S 21 ]) We don t care about these approximations. We don t want these matrices. 12

We really want U, V, and Λ. Or even better, the population analogues WHAT WE ACTUALLY WANT... Then we can get the principal components, principal coordinates, and the amount of variance explained. It turns out that there are quite a few ways to use these two methods to get the things we want. Let L(S) = [ S11 S 21 ] L(R) = [ R11 R 21 ] 13

After some reasonable algebra... LOTS OF APPROXIMATIONS Quantity of interest Label Approximations V V nys L(S)V(S 11 )Λ(S 11 ) V cs U(L(S)) U nys L(R)V(R 11 )Λ(R 11 ) U cs U(L(R)) U Û nys XV nys Λ /2 nys Û cs XV cs Λ /2 cs Û U(x 1 ) 14

NOTES OF INTEREST Complexity: Method Computational Storage Standard O(n 2 p + pn 2 ) O(np) Nyström O(nq 2 + q 3 ) O(q 2 ) [O(nq)] Column sampling O(qnp + qp 2 ) O(pq) Column sampling results in orthogonal U and V, Nyström doesn t Û = U(x 1 ) where X = [ ] x 1 x 2 seems reasonable if we think that only the first l selected covariates matter (gets used frequently by new statistical methods) 15

SO WHAT DO WE USE? Well... It depends. We did some theory which gives a way to calculate how far off your approximation might be. You could use these bounds if you like to use your data and make a choice. Did some simulations too. Draw X i iid N p (0, Σ) for n = 5000, p = 3000. 4 different conditions. For each method report: Projection error of estimator Projection error of baseline estimator. 16

APPROXIMATIONS TO V relative difference relative difference 1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35 1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35 5 10 15 20 25 30 approximation parameter (l) relative difference relative difference 1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35 1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35 10 20 30 40 50 60 70 80 approximation parameter (l) y-axis, relative performance Nyström to column sampling. < 1 better; > 1 worse. d {2, 5, 10, 15}. x-axis, approximation parameter q [3d/2, 15d] Random 0.001 (solid, square) Random 0.01 (dashed, circle) Random 0.1 (dotted, triangle) Band (dot-dash, diamond) d small, similar time d large, Nyström is 10 15% 20 40 60 80 100 120 140 50 100 150 200 approximation parameter (l) approximation parameter (l) 17

APPROXIMATIONS TO U relative difference 0.5 1.0 1.5 2.0 2.5 relative difference 0.5 1.0 1.5 2.0 2.5 y-axis, performance relative to U cs x-axis, approximation parameter q 20 40 60 80 100 120 140 approximation parameter (l) 20 40 60 80 100 120 140 approximation parameter (l) U nys, Û nys, Û cs, Û relative difference 0.5 1.0 1.5 2.0 2.5 relative difference 0.5 1.0 1.5 2.0 2.5 Random 0.001 Random 0.01 Random 0.1 Band d = 10 20 40 60 80 100 120 140 approximation parameter (l) 20 40 60 80 100 120 140 approximation parameter (l) 18

APPROXIMATE PCA CONCLUSIONS (SO FAR) For computing V, CS beats Nyström in terms of accuracy, but is much slower for similar choices of the approximation parameter and d large. For computing U, the naïve methods are bad, better to approximate V and multiply, so see above Û really stinks. But it s most obvious. This also gets used frequently For more info, see the paper. Contains boring theory, more extensive simulations, and application to Enron email dataset. EVEN BETTER APPROXIMATE PCA We really want the population versions: top d eigenvectors of Σ... 19

Atypical (better) result: Compressed regression

Form a loss function l : R p R p R + (For this talk, let s just specify l( θ, θ) = θ θ 2 2 ) ESTIMATION RISK The quality of an estimator is given by its risk R( θ) = E θ θ 2 2 This can be decomposed as: R( θ) = E[ θ] θ 2 + E θ E[ θ] 2 = Bias 2 + Variance There is a natural conservation between these quantities... 20

BIAS-VARIANCE TRADEOFF Risk Squared Bias Variance Model Complexity Typical result examines the bias, atypical result also considers the variance 21

FULLY COMPRESSED REGRESSION Let Q R q n Let s solve the fully compressed least squares problem β FC = argmin QXβ QY 2 2 β A common way to solve least squares problems that are: - Very large or - Poorly conditioned The numerical/theoretical properties generally depend on Q, q (Multiply quickly? Reduce dimension? Reduce Storage?) 22

Note: THREE APPROXIMATIONS Full compression: Q(Xβ Y) 2 2 β X Q QXβ 2β X Q QY Partial compression: β FC = (X Q QX) 1 X Q QY β PC = (X Q QX) 1 X Y Convex combination compression: 1 β C = W α W = [ β FC, β PC ] α = argmin Wα Y 2 2 α 1 see the work of Stephen Becker CU Boulder Applied Math 23

WHY THESE? Turns out that FC is unbiased, and therefore worse than OLS (has high variance) On the other hand, PC is biased and empirics demonstrate low variance Combination should give better statistical properties 24

COMPRESSED REGRESSION With this Q, β C works in practice: Computational savings: O ( qnp s + qp 2) Approximately the same estimation risk as OLS This is good, but we had a realization: IF RIDGE REGRESSION IS BETTER THAN OLS, WHY NOT POINT THE APPROXIMATION AT RIDGE? 25

COMPRESSED RIDGE REGRESSION This means introducing a tuning parameter λ and defining: β PC (λ) = (X Q QX + λi) 1 X Y β FC (λ) = (X Q QX + λi) 1 X Q QY (Everything else about the procedure is the same) This has the same computational complexity, but has much lower risk Let s look at an (a)typical result... 26

q = 500 1.0 0.5 0.0 0.5 1.0 q = 1000 1.0 log 10 (risk) 0.5 0.0 0.5 full compression (penalized) partial compression (penalized) convex combination ols full compression (unpenalized) partial compression (unpenalized) ridge regression 1.0 q = 1500 1.0 0.5 0.0 0.5 1.0 1 1.32 1.75 2.31 3.06 4.04 5.35 7.07 9.35 12.37 16.35 21.62 28.59 37.81 50 λ

CONCLUSIONS COMPRESSED REGRESSION In this case, approximation actually improves existing approaches We have ways of choosing the tuning parameter using the data Much more elaborate simulations and theory (current work) GENERAL PHILOSOPHY Approximation is not necessarily a bad thing Don t just minimize it, that only considers the bias Examine statistical properties to calibrate In some cases, we can get the best of both worlds: easier computations better statistical properties 28