Sketched Ridge Regression:

Size: px

Start display at page:

Download "Sketched Ridge Regression:"

Ross Gibbs
5 years ago
Views:

1 Sketched Ridge Regression: Optimization and Statistical Perspectives Shusen Wang UC Berkeley Alex Gittens RPI Michael Mahoney UC Berkeley

2 Overview

3 Ridge Regression min w f w = 1 n Xw y + γ w Over-determined: n d n d

4 Ridge Regression min w f w = 1 n Xw y + γ w Efficient and approximate solution? Use only part of the data? n d

5 Ridge Regression min w f w = 1 n Xw y + γ w Matrix Sketching: Random selection Random projection

6 Approximate Ridge Regression min w f w = 1 n Xw y + γ w Sketched solution: w O

7 Approximate Ridge Regression min w f w = 1 n Xw y + γ w Sketched solution: w O Sketch size OR S T sketch size

8 Approximate Ridge Regression min w f w = 1 n Xw y + γ w Sketched solution: w O Sketch size OR S T sketch size f w O 1 + ε min w f w Optimization Perspective

9 Approximate Ridge Regression min w f w = 1 n Xw y + γ w Statistical Perspective Bias Variance

10 Related Work Least squares regression: min Xw y w Reference Drineas, Mahoney, and Muthukrishnan: Sampling algorithms for l2 regression and applications. In SODA, Drineas, Mahoney, Muthukrishnan, and Sarlos: Faster least squares approximation. Numerische Mathematik, Clarkson and Woodruff: Low rank approximation and regression in input sparsity time. In STOC, Ma, Mahoney, and Yu: A statistical perspective on algorithmic leveraging. Journal of Machine Learning Research, Pilanci and Wainwright: Iterative Hessian sketch: fast and accurate solution approximation for constrained least squares. Journal of Machine Learning Research, Raskutti and Mahoney: A statistical perspective on randomized sketching for ordinary least-squares. Journal of Machine Learning Research, Etc

11 Sketched Ridge Regression

12 Matrix Sketching X S [ X Turn big matrix into smaller one. X R^ S S [ X R _ S. S R^ _ is called sketching matrix, e.g., Uniform sampling Leverage score sampling Gaussian projection Subsampled randomized Hadamard transform (SRHT) Count sketch (sparse embedding) Etc.

13 Matrix Sketching Some matrix sketching methods are efficient. Time cost is o(nds) lower than multiplication. Examples: Leverage score sampling: O(nd log n) time SRHT: O(nd log s) time X S [ X

14 Ridge Regression Objective function: Optimal solution: f w = 1 n Xw y + γ w w = argmin f w w = X [ X + nγi e S X [ y Time cost: O nd or O ndt

15 Sketched Ridge Regression Goal: efficiently and approximately solve argmin w f w = 1 n Xw y + γ w.

16 Sketched Ridge Regression Goal: efficiently and approximately solve argmin w f w = 1 n Xw y + γ w. Approach: reduce the size of X and y by matrix sketching.

17 Sketched Ridge Regression Sketched solution: w O = argmin w 1 n S[ Xw S [ y + γ w = X [ SS [ X + nγi S e X [ SS [ y

18 Sketched Ridge Regression Sketched solution: Time: O sd w O = argmin w 1 n S[ Xw S [ y + γ w = X [ SS [ X + nγi S e X [ SS [ y + T _ T _ is the cost of sketching S [ X E.g. T _ = O(nd log s) for SRHT. E.g. T _ = O nd log n for leverage score sampling.

19 Theory: Optimization Perspective

20 Optimization Perspective Recall the objective function f w = i^ Xw y + γ w. Bound f w O f w. i^ XwO Xw f w O f w.

21 Optimization Perspective For the sketching methods SRHT or leverage sampling with s = OR js uniform sampling with s = O k js lmn S T f w O f w εf w holds w.p T,, X R^ S : the design matrix γ: the regularization parameter β = X p p (0, 1] ^qr X p μ 1, ^ S p : the row coherence of X

22 Optimization Perspective For the sketching methods SRHT or leverage sampling with s = OR js uniform sampling with s = O k js lmn S T f w O f w εf w holds w.p T,, X R^ S : the design matrix γ: the regularization parameter β = X p p (0, 1] ^qr X p μ 1, ^ S p : the row coherence of X i ^ XwO Xw εf w.

23 Theory: Statistical Perspective

24 Statistical Model X R^ S : fixed design matrix w x R S : the true and unknown model y = Xw x + δ: observed response vector δ i,, δ^ are random noise E δ = 0 and E δδ [ = ξ I^

25 Bias-Variance Decomposition Risk: R w = i^ E Xw Xw x E is taken w.r.t. the random noise δ.

26 Bias-Variance Decomposition Risk: R w = i^ E Xw Xw x E is taken w.r.t. the random noise δ. Risk measures prediction error.

27 Bias-Variance Decomposition Risk: R w = i^ E Xw Xw x R w = bias w + var w

28 Bias-Variance Decomposition Risk: R w = i^ E Xw Xw x R w = bias w + var w Optimal Solution Sketched Solution bias w = γ n Σ + nγi S ƒi ΣV [ w x, var w = p ^ I S + nγσ ƒ ƒi, bias w O = γ n ΣU [ SS [ UΣ + nγi S e ΣV [ w x, var w O = p ^ U [ SS [ U + nγσ ƒ e U [ SS [, Here X = UΣV [ is the SVD.

29 Statistical Perspective For the sketching methods SRHT or leverage sampling with s = OR uniform sampling with s = O k S lmn S T p, the followings hold w.p. 0.9: S T p, X R^ S : the design matrix μ 1, ^ : the row coherence of X S 1 ε bias wo bias w 1 + ε, Good! 1 ε n s var wo var w 1 + ε n s. Bad! Because n s.

30 Statistical Perspective For the sketching methods SRHT or leverage sampling with s = OR uniform sampling with s = O k S lmn S T p, the followings hold w.p. 0.9: S T p, X R^ S : the design matrix μ 1, ^ : the row coherence of X S 1 ε 1 ε n s bias wo bias w 1 + ε, var wo var w 1 + ε n s. If y is noisy variance dominatesbias R w _ R(w ).

31 Conclusions Use sketched solution to initialize numerical optimization. Xw O is close to Xw. Optimization Perspective

32 Conclusions Use sketched solution to initialize numerical optimization. Xw O is close to Xw. Optimization Perspective w : output of the t-th iteration of CG algorithm. Xw Š ƒxw Xw ƒxw p p p 2 p X Ž X ƒi X Ž X ri Initialization is important..

33 Conclusions Use sketched solution to initialize numerical optimization. Xw O is close to Xw. Never use sketched solution to replace the optimal solution. Much higher variance è bad generalization. Optimization Perspective Statistical Perspective

34 Model Averaging

35 Model Averaging Independently draw S i,, S. Compute the sketched solutions w O i,, w O. Model averaging: w O = i O i w.

36 Optimization Perspective For sufficiently large s, w ƒ w w ε holds w.h.p. Without model averaging

37 Optimization Perspective For sufficiently large s, w ƒ w w ε holds w.h.p. Without model averaging Using the same matrix sketching and same s, w ƒ w w T + ε holds w.h.p. With model averaging

38 Optimization Perspective For sufficiently large s, w ƒ w w ε holds w.h.p. Without model averaging Using the same matrix sketching and same s, w ƒ w w T + ε holds w.h.p. With model averaging

39 Optimization Perspective For sufficiently large s, w ƒ w w ε holds w.h.p. Without model averaging Using the same matrix sketching and same s, w ƒ w w T + ε holds w.h.p. With model averaging If s d ε is very small error bound T.

40 Statistical Perspective Risk: R w = i^ E Xw Xw x = bias w + var w Model averaging : bias w O = γ n i ΣU [ S S [ UΣ + nγi S e ΣV [ w x i var w O = p ^ Here X = UΣV [ is the SVD. i U [ S S [ U + nγσ ƒ e U [ S S [ i.,

41 Statistical Perspective For sufficiently large s, the followings hold w.h.p.: bias w O bias w 1 + ε and var w O var w n s 1 + ε. Without model averaging Using the same sketching methods and same s, the followings hold w.h.p.: bias w O bias w 1 + ε and var w O var w n s 1 g + ε 2

42 Statistical Perspective For sufficiently large s, the followings hold w.h.p.: bias w O bias w 1 + ε and var w O var w n s 1 + ε. Without model averaging Using the same sketching methods and same s, the followings hold w.h.p.: bias w O bias w 1 + ε and var w O var w n s 1 g + ε 2 With model averaging

43 Statistical Perspective For sufficiently large s, the followings hold w.h.p.: bias w O bias w 1 + ε and var w O var w n s 1 + ε. Without model averaging Using the same sketching methods and same s, the followings hold w.h.p.: bias w O bias w 1 + ε and var w O var w n s 1 g + ε 2 With model averaging

44 Statistical Perspective For sufficiently large s, the followings hold w.h.p.: bias w O bias w 1 + ε and var w O var w n s 1 + ε. Without model averaging Using the same sketching methods and same s, the followings hold w.h.p.: bias w O bias w 1 + ε and var w O var w n s 1 g + ε 2 With model averaging If ε is small, then var w O i.

45 Applications to Distributed Optimization x i, y i,, x^, y^ are (randomly) split among g machines. Equivalent to uniform sampling with s = ^.

46 Optimization Perspective Application to distributed optimization: If s = ^ d, w O is very close to w (provably). w O is good initialization of distributed optimization algorithms.

47 Statistical Perspective Application to distributed machine learning: If s = ^ d, then R w O is comparable to R w. If low-precision solution suffices, then w O is a good substitute of w. One-shot solution.

48 Thank You! The paper is at arxiv:

Recovering any low-rank matrix, provably

Recovering any low-rank matrix, provably Rachel Ward University of Texas at Austin October, 2014 Joint work with Yudong Chen (U.C. Berkeley), Srinadh Bhojanapalli and Sujay Sanghavi (U.T. Austin) Matrix