COMPRESSED AND PENALIZED LINEAR REGRESSION Daniel J. McDonald Indiana University, Bloomington mypage.iu.edu/ dajmcdon 2 June 2017 1
OBLIGATORY DATA IS BIG SLIDE Modern statistical applications genomics, neural image analysis, text analysis, weather prediction have large numbers of covariates p Also frequently have lots of observations n. Need algorithms which can handle these kinds of data sets, with good statistical properties 2
LESSON OF THE TALK Many statistical methods use (perhaps implicitly) a singular value decomposition (SVD) to solve an optimization problem. The SVD is computationally expensive. We want to understand the statistical properties of some approximations which speed up computation and save storage. Spoiler: sometimes approximations actually improve the statistical properties 3
CORE TECHNIQUES Suppose we have a matrix X R n p and vector Y R n LEAST SQUARES: l 2 REGULARIZATION: β = argmin Xβ Y 2 2 β β(λ) = argmin Xβ Y 2 2 + λ β 2 2 β 4
CORE TECHNIQUES If X fits into RAM, there exist excellent algorithms in LAPACK that are Double precision Very stable cubic complexity, O(np 2 + p 3 ), with small constants require extensive random access to matrix There is a lot of interest in finding and analyzing techniques that extend these approaches to large(r) problems 5
OUT-OF-CORE TECHNIQUES If X is too large to manipulate in RAM, there are other approaches (Stochastic) gradient descent Conjugate gradient Rank-one QR updates Krylov subspace methods 6
OUT-OF-CORE TECHNIQUES Many techniques focus on randomized compression This is sometimes known as sketching or preconditioning Rokhlin, Tygert, (2008) A fast randomized algorithm for overdetermined linear least-squares regression. Drineas, Mahoney, et al., (2011) Faster least squares approximation. Woodruff, (2014) Sketching as a Tool for Numerical Linear Algebra. Wang, Lee, Mahdavi, Kolar, Srebro, (2016) Sketching meets random projection in the dual. Ma, Mahoney, and Yu, (2015), A statistical perspective on algorithmic leveraging. Pilanci and Wainwright, (2015-2016). Multiple papers. Others. 7
COMPRESSION BASIC IDEA: Choose some matrix Q R q n. Use QX (and) QY instead in the optimization. Finding QX for arbitrary Q and X takes O(qnp) computations. This can be expensive. To get this approach to work, we need some structure on Q 8
THE Q MATRIX Gaussian: Well behaved distribution and eas(ier) theory. Dense matrix Fast Johnson-Lindenstrauss Methods Randomized Hadamard (or Fourier) transformation: Allows for O(np log(p)) computations. Q = πτ for π a permutation of I and τ = [I q 0]: QX means read q (random) rows Sparse Bernoulli: i.i.d. Q ij 1 with probability 1/(2s) 0 with probability 1 1/s 1 with probability 1/(2s) This means QX takes O ( qnp) s computations on average. 9
TYPICAL RESULTS The general philosophy: Find an approximation algorithm that is as close as possible to the solution of the original problem. For OLS, typical results would be to produce an β such that X β Y 2 (1 + ɛ) Xβ Y 2 2 2, X( β β ) 2 ɛ Xβ 2 2 2, β 2 β ɛ β 2 2 2, Here, β should be easier to compute than β = argmin Xβ Y 2 2 β 10
COLLABORATOR & GRANT SUPPORT Collaborator: Darren Homrighausen Department of Statistics Southern Methodist University Grant Support: NSF, Institute for New Economic Thinking 11
An alternate perspective
Form a loss function l : Θ Θ R + RISK The quality of an estimator is given by its risk [ ] R( θ) = E l( θ, θ) We could use l 2 estimation risk: R( θ) = E θ θ or excess l 2 prediction risk R( θ) = E Y X θ 2 = E Y Xθ + Xθ X θ 2 constant + E X(θ θ) 2 2 2 2 2 2 12
RISK DECOMPOSITION For an approximation θ of θ, E θ θ 2 = E θ θ + θ E θ + E θ θ 2 2 2 EApprox. error 2 + Variance + Bias 2 Previous analyses focus only on the approximation error (with expectation over the algorithm, not the data). 13
BIAS-VARIANCE TRADEOFF MSE Var Bias 2 model complexity Typical result compares to the zero-bias estimator which is assumed to have small variance. We examine E θ θ 2 where the expectation is over everything 2 random. Closest similar analysis is Ma, Mahoney, and Yu (JMLR, 2015). 14
COMPRESSED REGRESSION Let Q R q n Call the fully compressed least squares estimator β F C = argmin Q(Xβ Y ) 2 2 β A common way to solve least squares problems that are: Very large or Poorly conditioned The numerical/theoretical properties generally depend on Q, q, X 15
1. Full compression: FAMILY OF 4 β F C = argmin Q(Xβ Y ) 2 2 β 2. Partial compression: 1 = argmin QY 2 2 + QXβ 2 2 2Y Q QXβ β = (X Q QX) 1 X Q QY β P C = argmin Y 2 2 + QXβ 2 2 2Y Xβ β = (X Q QX) 1 X Y 1 Also called "Hessian Sketching". 16
Write: B = [ β F C β P C ] W = XB 3. Linear combination compression: α lin = argmin W α Y 2 2 α β lin = B α lin 4. Convex combination compression: α con = argmin W α Y 2 2 0 α α=1 WE ALSO COMBINE THESE β con = B α con These are simple to calculate given FC and PC. 17
WHY THESE? Turns out that FC is (approximately) unbiased, and therefore worse than OLS (has high variance) On the other hand, PC is biased and empirics demonstrate low variance Combination should give better statistical properties We do everything with an l 2 penalty 18
Evidence from simulations
SIMULATION SETUP Draw X i MVN(0, (1 ρ)i p + ρ11 ) We use ρ = {0.2, 0.8}. Draw β N(0, τ 2 I p ) Draw Y i = X i β + ɛ i with ɛ i N(0, σ 2 ). 19
BAYES ESTIMATOR For this model, the optimal estimator (in MSE) is β B = (X X + λ I p ) 1 X Y In particular, with λ = σ2 nτ 2 This is the posterior mode of the Bayes estimator under conjugate normal prior It is also the ridge regression estimator for a particular λ 20
GOLDILOCKS With p < n 1 If λ is too big, we will tend to shrink all coefficients to 0. This problem is too hard. 2 If λ is too small, OLS will be very close to the optimal estimator. This problem is too easy. 3 Need τ 2, σ 2 just right. Take τ 2 = π/2. This implies E[ β i ] = 1 (convenient) Take n = 5000. Big but computable. Take σ = 50 log(λ ) 1.14 (reasonable) Take p {50, 100, 250, 500} 21
ρ = 0.8, p = 50, q/n = 0.2 0.8 log10(estimation risk) 0.4 0.0 2.83 2.59 2.35 2.11 1.86 1.62 1.38 1.14 0.89 0.65 0.41 0.17 0.08 0.32 0.56 0.81 1.05 1.29 1.53 log(λ) convex linear FC PC ols bayes 22
ρ = 0.2, p = 500, q/n = 0.2 0.8 log10(estimation risk) 0.4 0.0 0.4 2.83 2.59 2.35 2.11 1.86 1.62 1.38 1.14 0.89 0.65 0.41 0.17 0.08 0.32 0.56 0.81 1.05 1.29 1.53 log(λ) convex linear FC PC ols bayes 23
1.00 q: 500 q: 1000 q: 1500 0.75 % best method 0.50 0.25 0.00 1.00 0.75 0.50 rho: 0.2 rho: 0.8 0.25 0.00 50 100 250 500 50 100 250 500 50 100 250 500 p convex linear FC PC ols ridge 24
SECOND VERSE... In that case, ridge was optimal. We did it again with β i 1. 25
ρ = 0.8, p = 50, q/n = 0.2 0 log10(estimation risk) 1 2 3 0.48 0.2 0.07 0.35 0.63 0.91 1.19 1.47 1.75 2.03 2.31 2.59 2.87 3.14 3.42 3.7 log(λ) convex linear FC PC ols ridge 26
ρ = 0.2, p = 500, q/n = 0.2 1 log10(estimation risk) 0 1 2 0.48 0.2 0.07 0.35 0.63 0.91 1.19 1.47 1.75 2.03 2.31 2.59 2.87 3.14 3.42 3.7 log(λ) convex linear FC PC ols ridge 27
1.00 q: 500 q: 1000 q: 1500 0.75 % best method 0.50 0.25 0.00 1.00 0.75 0.50 rho: 0.2 rho: 0.8 0.25 0.00 50 100 250 500 50 100 250 500 50 100 250 500 p convex linear FC PC ols ridge 28
SELECTING TUNING PARAMETERS We use GCV with the degrees of freedom: X β(λ) Y GCV(λ) = (1 df/n) 2 df is easy for full or partial compression. For the other cases, we do a dumb approximation: weighted average using α. Easy to calculate for a range of λ without extra computations. Also calculate the divergence to derive Stein s Unbiased Risk Estimate. After tedious algebra, it had odd behavior in practice (can be huge, or negative!?!) 2 2 29
1.0 500 1000 1500 0.4 500 1000 1500 λgcv λtest 0.5 0.0 0.5 100 log(testgcv testmin) 0.3 0.2 0.1 1.0 compression convex linear FC PC 0.0 p convex linear FC PC n = 5000, p = 50, ρ = 0.8, β N(0, π/2) 30
Theoretical results (sketch)
STANDARD RIDGE RESULTS Theorem ( bias 2 β ) ridge (λ) X = λ 2 β V (D 2 + λi p ) 2 V β. ( tr V[ β ) p ridge (λ) X] = σ 2 d 2 i (d 2 i=1 i +. λ)2 31
WHAT S THE TRICK? Similar results are hard for compressed regression. All the estimators depend (at least) on We derived properties of Q Q (X Q QX + λi p ) 1 [ ] s E q Q Q = I n [ ( )] s V vec q Q Q = (s 3) + diag(vec (I n )) + 1 q q I n 2 + 1 q K nn So the technique is to do a Taylor expansion around s q Q Q = I n. 32
MAIN RESULT Theorem bias 2 [ βf C X] = λ 2 β V (D2 + λi p) 2 V β + o p(1) p tr(v[ βf C X]) = σ 2 d 2 i (d 2 + op(1) i + λ)2 i=1 + (s 3) + q + β M (I H) 2 Mβ q tr ( diag(vec (I n))m M (I H)Mβ β M (I H) ) tr(mm ) + 1 q tr ( (I H)Mβ β M (I H)M M ). Note: M = (X X + λi p ) 1 X and H = XM (hat matrix for ridge regression) 33
Applications
RNA-SEQ short-read RNA sequence data using a (Poisson) linear model to predict read counts based on the surrounding nucleotides 8 different tissues data is publicly available, preprocessing already done 34
RESULTS Test error averaged over 10 replications. On each replication, tuning parameters were chosen with GCV. At the best tuning parameter, the test error was computed and then these were averaged. For each dataset, we randomly chose 75% of the data as training on each replication. 35
b1 b2 b3 g1 g2 w1 w2 w3 1.5 100 log(method error ols error) 1.0 0.5 q: 10000 0.0 convex linear FC PC ridge 36
TAKE-AWAY MESSAGE q = 10000 results in data reductions between 74% and 93% q = 20000 gives reductions between 48% and 84% ridge and ordinary least squares give equivalent test set performance (differing by less than.001%) This is an easy problem. Full compression is the worst. Skip galaxies 37
C URRENT APPLICATION This image and the next are from the Sloan Digital Sky Survey 38
SDSS Repeatedly photograph regions of the sky Collect information on piles of objects Around 200 million galaxy spectra at the moment Each spectra contains around 5000 wavelengths Want to predict the redshift from the spectra 39
A FEW GALAXIES 15 10 5 flux 0 5 10 15 3.7 3.8 3.9 4.0 λ(a ) 40
SAME GALAXIES AFTER SMOOTHING 5 4 flux 3 2 1 0 3.7 3.8 3.9 4.0 λ(a ) 41
5000 REDSHIFTS 400 300 200 100 0 0.00 0.25 0.50 0.75 1.00 redshift 42
q: 1000 q: 2000 20 100 log(method error ols error) 10 0 10 convex linear FC PC ridge 43
Conclusions
CONCLUSIONS Lots of people are looking at approximations to standard statistical methods (OLS, PCA, etc.) They characterize the approximation error We have been looking at whether we can actually benefit from the approximation Today looked at compressed regression 44
ONGOING AND FUTURE WORK We want to generalize our results here to other penalties Also generalized linear models We re also looking at how compression interacts with PCA Combine compression and random projection Connections to sufficient dimension reduction? Iterative versions? 45