COMPRESSED AND PENALIZED LINEAR

Similar documents
Approximate Principal Components Analysis of Large Data Sets

Sketched Ridge Regression:

Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation Discussion of sampling approach in big data

ISyE 691 Data mining and analytics

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

ECS289: Scalable Machine Learning

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Linear Models in Machine Learning

Approximate Principal Components Analysis of Large Data Sets

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

A fast randomized algorithm for overdetermined linear least-squares regression

Gradient-based Sampling: An Adaptive Importance Sampling for Least-squares

STA141C: Big Data & High Performance Statistical Computing

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane

A Modern Look at Classical Multivariate Techniques

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Linear Models for Regression CS534

Sketching as a Tool for Numerical Linear Algebra

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Degrees of Freedom in Regression Ensembles

Machine Learning for OR & FE

arxiv: v1 [stat.me] 23 Jun 2013

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725

CS 229r: Algorithms for Big Data Fall Lecture 17 10/28

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Stochastic Gradient Descent. CS 584: Big Data Analytics

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Functional SVD for Big Data

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Big Data Analytics: Optimization and Randomization

Linear Methods for Regression. Lijun Zhang

MS&E 226: Small Data

A Statistical Perspective on Algorithmic Leveraging

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization

Regression Shrinkage and Selection via the Lasso

Linear Models for Regression CS534

Data-Intensive Statistical Challenges in Astrophysics

Divide-and-combine Strategies in Statistical Modeling for Massive Data

Regression, Ridge Regression, Lasso

Stochastic Analogues to Deterministic Optimizers

Advances in Extreme Learning Machines

Less is More: Computational Regularization by Subsampling

Generalized Elastic Net Regression

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Linear Models for Regression CS534

4 Bias-Variance for Ridge Regression (24 points)

Linear Regression (9/11/13)

High-dimensional regression

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Overfitting, Bias / Variance Analysis

Probabilistic Graphical Models & Applications

Stochastic optimization in Hilbert spaces

First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate

Linear Regression 9/23/17. Simple linear regression. Advertising sales: Variance changes based on # of TVs. Advertising sales: Normal error?

4 Bias-Variance for Ridge Regression (24 points)

Statistical Data Mining and Machine Learning Hilary Term 2016

Accelerated Block-Coordinate Relaxation for Regularized Optimization

Estimators based on non-convex programs: Statistical and computational guarantees

A fast randomized algorithm for approximating an SVD of a matrix

Fast Linear Algorithms for Machine Learning

Biostatistics Advanced Methods in Biostatistics IV

CSC321 Lecture 7: Optimization

Computational and Statistical Aspects of Statistical Machine Learning. John Lafferty Department of Statistics Retreat Gleacher Center

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Bayesian Methods for Machine Learning

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

Machine learning, shrinkage estimation, and economic theory

CSC321 Lecture 8: Optimization

A Statistical Perspective on Algorithmic Leveraging

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

A Short Introduction to the Lasso Methodology

Accelerating SVRG via second-order information

sparse and low-rank tensor recovery Cubic-Sketching

Linear regression methods

Adaptive Piecewise Polynomial Estimation via Trend Filtering

Approximate Kernel PCA with Random Features

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

The lasso, persistence, and cross-validation

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Risk estimation for high-dimensional lasso regression

Regression I: Mean Squared Error and Measuring Quality of Fit

Less is More: Computational Regularization by Subsampling

Lecture 1: Supervised Learning

MSA220/MVE440 Statistical Learning for Big Data

Minwise hashing for large-scale regression and classification with sparse data

Bayesian linear regression

MS&E 226: Small Data

Quick Introduction to Nonnegative Matrix Factorization

STA414/2104 Statistical Methods for Machine Learning II

Bayesian Regression Linear and Logistic Regression

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Sample questions for Fundamentals of Machine Learning 2018

Comparison of Modern Stochastic Optimization Algorithms

Spectral Regularization

Transcription:

COMPRESSED AND PENALIZED LINEAR REGRESSION Daniel J. McDonald Indiana University, Bloomington mypage.iu.edu/ dajmcdon 2 June 2017 1

OBLIGATORY DATA IS BIG SLIDE Modern statistical applications genomics, neural image analysis, text analysis, weather prediction have large numbers of covariates p Also frequently have lots of observations n. Need algorithms which can handle these kinds of data sets, with good statistical properties 2

LESSON OF THE TALK Many statistical methods use (perhaps implicitly) a singular value decomposition (SVD) to solve an optimization problem. The SVD is computationally expensive. We want to understand the statistical properties of some approximations which speed up computation and save storage. Spoiler: sometimes approximations actually improve the statistical properties 3

CORE TECHNIQUES Suppose we have a matrix X R n p and vector Y R n LEAST SQUARES: l 2 REGULARIZATION: β = argmin Xβ Y 2 2 β β(λ) = argmin Xβ Y 2 2 + λ β 2 2 β 4

CORE TECHNIQUES If X fits into RAM, there exist excellent algorithms in LAPACK that are Double precision Very stable cubic complexity, O(np 2 + p 3 ), with small constants require extensive random access to matrix There is a lot of interest in finding and analyzing techniques that extend these approaches to large(r) problems 5

OUT-OF-CORE TECHNIQUES If X is too large to manipulate in RAM, there are other approaches (Stochastic) gradient descent Conjugate gradient Rank-one QR updates Krylov subspace methods 6

OUT-OF-CORE TECHNIQUES Many techniques focus on randomized compression This is sometimes known as sketching or preconditioning Rokhlin, Tygert, (2008) A fast randomized algorithm for overdetermined linear least-squares regression. Drineas, Mahoney, et al., (2011) Faster least squares approximation. Woodruff, (2014) Sketching as a Tool for Numerical Linear Algebra. Wang, Lee, Mahdavi, Kolar, Srebro, (2016) Sketching meets random projection in the dual. Ma, Mahoney, and Yu, (2015), A statistical perspective on algorithmic leveraging. Pilanci and Wainwright, (2015-2016). Multiple papers. Others. 7

COMPRESSION BASIC IDEA: Choose some matrix Q R q n. Use QX (and) QY instead in the optimization. Finding QX for arbitrary Q and X takes O(qnp) computations. This can be expensive. To get this approach to work, we need some structure on Q 8

THE Q MATRIX Gaussian: Well behaved distribution and eas(ier) theory. Dense matrix Fast Johnson-Lindenstrauss Methods Randomized Hadamard (or Fourier) transformation: Allows for O(np log(p)) computations. Q = πτ for π a permutation of I and τ = [I q 0]: QX means read q (random) rows Sparse Bernoulli: i.i.d. Q ij 1 with probability 1/(2s) 0 with probability 1 1/s 1 with probability 1/(2s) This means QX takes O ( qnp) s computations on average. 9

TYPICAL RESULTS The general philosophy: Find an approximation algorithm that is as close as possible to the solution of the original problem. For OLS, typical results would be to produce an β such that X β Y 2 (1 + ɛ) Xβ Y 2 2 2, X( β β ) 2 ɛ Xβ 2 2 2, β 2 β ɛ β 2 2 2, Here, β should be easier to compute than β = argmin Xβ Y 2 2 β 10

COLLABORATOR & GRANT SUPPORT Collaborator: Darren Homrighausen Department of Statistics Southern Methodist University Grant Support: NSF, Institute for New Economic Thinking 11

An alternate perspective

Form a loss function l : Θ Θ R + RISK The quality of an estimator is given by its risk [ ] R( θ) = E l( θ, θ) We could use l 2 estimation risk: R( θ) = E θ θ or excess l 2 prediction risk R( θ) = E Y X θ 2 = E Y Xθ + Xθ X θ 2 constant + E X(θ θ) 2 2 2 2 2 2 12

RISK DECOMPOSITION For an approximation θ of θ, E θ θ 2 = E θ θ + θ E θ + E θ θ 2 2 2 EApprox. error 2 + Variance + Bias 2 Previous analyses focus only on the approximation error (with expectation over the algorithm, not the data). 13

BIAS-VARIANCE TRADEOFF MSE Var Bias 2 model complexity Typical result compares to the zero-bias estimator which is assumed to have small variance. We examine E θ θ 2 where the expectation is over everything 2 random. Closest similar analysis is Ma, Mahoney, and Yu (JMLR, 2015). 14

COMPRESSED REGRESSION Let Q R q n Call the fully compressed least squares estimator β F C = argmin Q(Xβ Y ) 2 2 β A common way to solve least squares problems that are: Very large or Poorly conditioned The numerical/theoretical properties generally depend on Q, q, X 15

1. Full compression: FAMILY OF 4 β F C = argmin Q(Xβ Y ) 2 2 β 2. Partial compression: 1 = argmin QY 2 2 + QXβ 2 2 2Y Q QXβ β = (X Q QX) 1 X Q QY β P C = argmin Y 2 2 + QXβ 2 2 2Y Xβ β = (X Q QX) 1 X Y 1 Also called "Hessian Sketching". 16

Write: B = [ β F C β P C ] W = XB 3. Linear combination compression: α lin = argmin W α Y 2 2 α β lin = B α lin 4. Convex combination compression: α con = argmin W α Y 2 2 0 α α=1 WE ALSO COMBINE THESE β con = B α con These are simple to calculate given FC and PC. 17

WHY THESE? Turns out that FC is (approximately) unbiased, and therefore worse than OLS (has high variance) On the other hand, PC is biased and empirics demonstrate low variance Combination should give better statistical properties We do everything with an l 2 penalty 18

Evidence from simulations

SIMULATION SETUP Draw X i MVN(0, (1 ρ)i p + ρ11 ) We use ρ = {0.2, 0.8}. Draw β N(0, τ 2 I p ) Draw Y i = X i β + ɛ i with ɛ i N(0, σ 2 ). 19

BAYES ESTIMATOR For this model, the optimal estimator (in MSE) is β B = (X X + λ I p ) 1 X Y In particular, with λ = σ2 nτ 2 This is the posterior mode of the Bayes estimator under conjugate normal prior It is also the ridge regression estimator for a particular λ 20

GOLDILOCKS With p < n 1 If λ is too big, we will tend to shrink all coefficients to 0. This problem is too hard. 2 If λ is too small, OLS will be very close to the optimal estimator. This problem is too easy. 3 Need τ 2, σ 2 just right. Take τ 2 = π/2. This implies E[ β i ] = 1 (convenient) Take n = 5000. Big but computable. Take σ = 50 log(λ ) 1.14 (reasonable) Take p {50, 100, 250, 500} 21

ρ = 0.8, p = 50, q/n = 0.2 0.8 log10(estimation risk) 0.4 0.0 2.83 2.59 2.35 2.11 1.86 1.62 1.38 1.14 0.89 0.65 0.41 0.17 0.08 0.32 0.56 0.81 1.05 1.29 1.53 log(λ) convex linear FC PC ols bayes 22

ρ = 0.2, p = 500, q/n = 0.2 0.8 log10(estimation risk) 0.4 0.0 0.4 2.83 2.59 2.35 2.11 1.86 1.62 1.38 1.14 0.89 0.65 0.41 0.17 0.08 0.32 0.56 0.81 1.05 1.29 1.53 log(λ) convex linear FC PC ols bayes 23

1.00 q: 500 q: 1000 q: 1500 0.75 % best method 0.50 0.25 0.00 1.00 0.75 0.50 rho: 0.2 rho: 0.8 0.25 0.00 50 100 250 500 50 100 250 500 50 100 250 500 p convex linear FC PC ols ridge 24

SECOND VERSE... In that case, ridge was optimal. We did it again with β i 1. 25

ρ = 0.8, p = 50, q/n = 0.2 0 log10(estimation risk) 1 2 3 0.48 0.2 0.07 0.35 0.63 0.91 1.19 1.47 1.75 2.03 2.31 2.59 2.87 3.14 3.42 3.7 log(λ) convex linear FC PC ols ridge 26

ρ = 0.2, p = 500, q/n = 0.2 1 log10(estimation risk) 0 1 2 0.48 0.2 0.07 0.35 0.63 0.91 1.19 1.47 1.75 2.03 2.31 2.59 2.87 3.14 3.42 3.7 log(λ) convex linear FC PC ols ridge 27

1.00 q: 500 q: 1000 q: 1500 0.75 % best method 0.50 0.25 0.00 1.00 0.75 0.50 rho: 0.2 rho: 0.8 0.25 0.00 50 100 250 500 50 100 250 500 50 100 250 500 p convex linear FC PC ols ridge 28

SELECTING TUNING PARAMETERS We use GCV with the degrees of freedom: X β(λ) Y GCV(λ) = (1 df/n) 2 df is easy for full or partial compression. For the other cases, we do a dumb approximation: weighted average using α. Easy to calculate for a range of λ without extra computations. Also calculate the divergence to derive Stein s Unbiased Risk Estimate. After tedious algebra, it had odd behavior in practice (can be huge, or negative!?!) 2 2 29

1.0 500 1000 1500 0.4 500 1000 1500 λgcv λtest 0.5 0.0 0.5 100 log(testgcv testmin) 0.3 0.2 0.1 1.0 compression convex linear FC PC 0.0 p convex linear FC PC n = 5000, p = 50, ρ = 0.8, β N(0, π/2) 30

Theoretical results (sketch)

STANDARD RIDGE RESULTS Theorem ( bias 2 β ) ridge (λ) X = λ 2 β V (D 2 + λi p ) 2 V β. ( tr V[ β ) p ridge (λ) X] = σ 2 d 2 i (d 2 i=1 i +. λ)2 31

WHAT S THE TRICK? Similar results are hard for compressed regression. All the estimators depend (at least) on We derived properties of Q Q (X Q QX + λi p ) 1 [ ] s E q Q Q = I n [ ( )] s V vec q Q Q = (s 3) + diag(vec (I n )) + 1 q q I n 2 + 1 q K nn So the technique is to do a Taylor expansion around s q Q Q = I n. 32

MAIN RESULT Theorem bias 2 [ βf C X] = λ 2 β V (D2 + λi p) 2 V β + o p(1) p tr(v[ βf C X]) = σ 2 d 2 i (d 2 + op(1) i + λ)2 i=1 + (s 3) + q + β M (I H) 2 Mβ q tr ( diag(vec (I n))m M (I H)Mβ β M (I H) ) tr(mm ) + 1 q tr ( (I H)Mβ β M (I H)M M ). Note: M = (X X + λi p ) 1 X and H = XM (hat matrix for ridge regression) 33

Applications

RNA-SEQ short-read RNA sequence data using a (Poisson) linear model to predict read counts based on the surrounding nucleotides 8 different tissues data is publicly available, preprocessing already done 34

RESULTS Test error averaged over 10 replications. On each replication, tuning parameters were chosen with GCV. At the best tuning parameter, the test error was computed and then these were averaged. For each dataset, we randomly chose 75% of the data as training on each replication. 35

b1 b2 b3 g1 g2 w1 w2 w3 1.5 100 log(method error ols error) 1.0 0.5 q: 10000 0.0 convex linear FC PC ridge 36

TAKE-AWAY MESSAGE q = 10000 results in data reductions between 74% and 93% q = 20000 gives reductions between 48% and 84% ridge and ordinary least squares give equivalent test set performance (differing by less than.001%) This is an easy problem. Full compression is the worst. Skip galaxies 37

C URRENT APPLICATION This image and the next are from the Sloan Digital Sky Survey 38

SDSS Repeatedly photograph regions of the sky Collect information on piles of objects Around 200 million galaxy spectra at the moment Each spectra contains around 5000 wavelengths Want to predict the redshift from the spectra 39

A FEW GALAXIES 15 10 5 flux 0 5 10 15 3.7 3.8 3.9 4.0 λ(a ) 40

SAME GALAXIES AFTER SMOOTHING 5 4 flux 3 2 1 0 3.7 3.8 3.9 4.0 λ(a ) 41

5000 REDSHIFTS 400 300 200 100 0 0.00 0.25 0.50 0.75 1.00 redshift 42

q: 1000 q: 2000 20 100 log(method error ols error) 10 0 10 convex linear FC PC ridge 43

Conclusions

CONCLUSIONS Lots of people are looking at approximations to standard statistical methods (OLS, PCA, etc.) They characterize the approximation error We have been looking at whether we can actually benefit from the approximation Today looked at compressed regression 44

ONGOING AND FUTURE WORK We want to generalize our results here to other penalties Also generalized linear models We re also looking at how compression interacts with PCA Combine compression and random projection Connections to sufficient dimension reduction? Iterative versions? 45