Subsampling for Ridge Regression via Regularized Volume Sampling

Size: px

Start display at page:

Download "Subsampling for Ridge Regression via Regularized Volume Sampling"

Frederica Cobb
5 years ago
Views:

1 Subsampling for Ridge Regression via Regularized Volume Sampling Micha l Dereziński and Manfred Warmuth University of California at Santa Cruz AISTATS 18, / 22

2 Linear regression y x 2 / 22

3 Optimal solution y w = argmin w x (x i w y i ) 2 i 3 / 22

4 How many labels needed to get close to optimum? y - All x i given - But labels y i unknown Guess how many needed? x 4 / 22

5 How many labels needed to get close to optimum? y - All x i given - But labels y i unknown Guess how many needed? x 4 / 22

6 Answer: one label y x x i 5 / 22

7 Which one? - x max (furthest from 0) is bad - any deterministic choice is bad y Good: 1 label y i drawn x 2 i x y i E i (x j j E i w i x i }{{} w i = i x i x max x j w ) 2 = j P(i) {}}{ x 2 i x 2 w i (x j w y j ) 2 {}}{ y i x i = w 6 / 22

8 Which one? - x max (furthest from 0) is bad - any deterministic choice is bad y Good: 1 label y i drawn x 2 i x y i E i (x j j E i w i x i }{{} w i = i x i x max x j w ) 2 = j P(i) {}}{ x 2 i x 2 w i (x j w y j ) 2 {}}{ y i x i = w 6 / 22

9 Which one? - x max (furthest from 0) is bad - any deterministic choice is bad y Good: 1 label y i drawn x 2 i x y i E i (x j j E i w i x i }{{} w i = i x i x max x j w ) 2 = j P(i) {}}{ x 2 i x 2 w i (x j w y j ) 2 {}}{ y i x i = w 6 / 22

10 Which one? - x max (furthest from 0) is bad - any deterministic choice is bad y Good: 1 label y i drawn x 2 i x y i E i (x j j E i w i x i }{{} w i = i x i x max x j w ) 2 = j P(i) {}}{ x 2 i x 2 w i (x j w y j ) 2 {}}{ y i x i = w 6 / 22

11 Generalization: Volume Sampling Given n points x i R d Choose subset S of d indices Get labels y i R for i S Find w (S): solution for the d examples {(x i, y i ) : i S} Key prior result [DW17] 1 : for any (x 1, y 1 ),..., (x n, y n ), E S (x j w (S) x j w ) 2 = d (x j w y j ) 2 j j when S chosen squared volume of parallelopiped spanned by the {x i : i S} 1 DW. NIPS 17 7 / 22

12 Generalization: Volume Sampling Given n points x i R d Choose subset S of d indices Get labels y i R for i S Find w (S): solution for the d examples {(x i, y i ) : i S} Key prior result [DW17] 1 : for any (x 1, y 1 ),..., (x n, y n ), E S (x j w (S) x j w ) 2 = d (x j w y j ) 2 j j when S chosen squared volume of parallelopiped spanned by the {x i : i S} Unbiasedness: E S w (S) = w 1 DW. NIPS 17 7 / 22

13 Generalization: Volume Sampling Given n points x i R d Choose subset S of d indices Get labels y i R for i S Find w (S): solution for the d examples {(x i, y i ) : i S} Key prior result [DW17] 1 : for any (x 1, y 1 ),..., (x n, y n ), E S (x j w (S) x j w ) 2 = d (x j w y j ) 2 j j when S chosen squared volume of parallelopiped spanned by the {x i : i S} Open: Replace factor d with ɛ, using small S d 1 DW. NIPS 17 7 / 22

14 Two types of volume sampling Let X R d n be full rank, n d X S = x i x j Distribution over all s-element subsets S: det(x S P(S) X S) if s d [DRVW06] 1 det(x S X S ) if s d [AB13]2 det(x S X S ) = squared volume of parallelepiped P(x i, x j ) x i x j 1 [DRVW06] Deshpande et al. SODA 06 2 [AB13] Avron and Boutsidis. JMAA 13 8 / 22

15 Prior work: efficient volume sampling algorithms Sampling sets of size s d: 1. Deshpande and Rademacher (FOCS 10) O(snd ω log d) 2. Guruswami and Sinop (SODA 12) O(snd 2 ) Sampling sets of size s d: 1. Li et al. (NIPS 17) (approximate MCMC) Õ(s3 nd 2 ) 2. DW (NIPS 17) O(n 2 d) We give an O(nd 2 ) algorithm for s d 9 / 22

16 Prior work: efficient volume sampling algorithms Sampling sets of size s d: 1. Deshpande and Rademacher (FOCS 10) O(snd ω log d) 2. Guruswami and Sinop (SODA 12) O(snd 2 ) Sampling sets of size s d: 1. Li et al. (NIPS 17) (approximate MCMC) Õ(s3 nd 2 ) 2. DW (NIPS 17) O(n 2 d) We give an O(nd 2 ) algorithm for s d 9 / 22

17 Our contributions: subsampling for linear regression 1. Statistical 1.1 Dimension-free error bounds (random noise assumptions) ( ) degrees of freedom Error = O subset size s 1.2 Lower bounds volume sampling achieves near-optimal error bounds no i.i.d. sampling works as well for small sample size s 2. Computational 2.1 Regularized volume sampling 2.2 Faster sampling algorithm O(nd 2 ) for any s d 10 / 22

18 Our contributions: subsampling for linear regression 1. Statistical 1.1 Dimension-free error bounds (random noise assumptions) ( ) degrees of freedom Error = O subset size s 1.2 Lower bounds volume sampling achieves near-optimal error bounds no i.i.d. sampling works as well for small sample size s 2. Computational 2.1 Regularized volume sampling 2.2 Faster sampling algorithm O(nd 2 ) for any s d 10 / 22

19 Our contributions: subsampling for linear regression 1. Statistical 1.1 Dimension-free error bounds (random noise assumptions) ( ) degrees of freedom Error = O subset size s 1.2 Lower bounds volume sampling achieves near-optimal error bounds no i.i.d. sampling works as well for small sample size s 2. Computational 2.1 Regularized volume sampling 2.2 Faster sampling algorithm O(nd 2 ) for any s d 10 / 22

20 Statistical model: linear labels plus bounded noise Fixed design matrix: X R d n, linear model: w R d y = X w + ξ, where E[ξ] = 0, and Var[ξ] σ 2 I Goal: Minimize Mean Squared Prediction Error using few labels MSPE(ŵ) = E 1 n n (x i ŵ x i w ) 2 i=1 11 / 22

21 Statistical model: linear labels plus bounded noise Fixed design matrix: X R d n, linear model: w R d y = X w + ξ, where E[ξ] = 0, and Var[ξ] σ 2 I Goal: Minimize Mean Squared Prediction Error using few labels MSPE(ŵ) = E 1 n n (x i ŵ x i w ) 2 i=1 11 / 22

22 Dimension-free error bound (XS X S +λi) 1 X S y S {}}{ Ridge estimator: ŵλ (S) = argmin X S w y S 2 + λ w 2 w Main theorem If λ σ2, then there is a distribution over subsets S of size s w 2 s.t. MSPE(ŵ λ (S)) σ2 d λ s d λ +1 = O ( σ 2 ) d λ, s where d λ = d λ i i=1 λ i +λ < d, {λ i} - eigenvalues of XX d λ - degrees of freedom (statistical dimension) of X, given λ Note: Under decaying eigenvalues, d λ d 12 / 22

23 Dimension-free error bound (XS X S +λi) 1 X S y S {}}{ Ridge estimator: ŵλ (S) = argmin X S w y S 2 + λ w 2 w Main theorem If λ σ2, then there is a distribution over subsets S of size s w 2 s.t. MSPE(ŵ λ (S)) σ2 d λ s d λ +1 = O ( σ 2 ) d λ, s where d λ = d λ i i=1 λ i +λ < d, {λ i} - eigenvalues of XX d λ - degrees of freedom (statistical dimension) of X, given λ Note: Under decaying eigenvalues, d λ d 12 / 22

24 Regularized volume sampling Naive idea: replace P(S) det(x S X S ), with P(S) det(x S X S + λi) Problem: Not clear how to sample efficiently Intuition: No simple extension of the Cauchy-Binet formula ( ) n d det(x S X S ) = det(xx ) s d S: S =s S: S =s det(x S X S + λi) =??? 13 / 22

25 Regularized volume sampling Naive idea: replace P(S) det(x S X S ), with P(S) det(x S X S + λi) Problem: Not clear how to sample efficiently Intuition: No simple extension of the Cauchy-Binet formula ( ) n d det(x S X S ) = det(xx ) s d S: S =s S: S =s det(x S X S + λi) =??? 13 / 22

26 Reverse iterative sampling {1..n} S P(S i S) S = S i size n n 1 s s 1 Start with S = {1..n} Sample index i S Go to set S i = S {i} Repeat until desired size P(S i S) det(x S i X S i + λi) det(x S X S + λi) = 1 x i (X S X S + λi) 1 x i Note: not the same as P(S) det(x S X S + λi) (unless λ = 0) 14 / 22

27 Reverse iterative sampling {1..n} S P(S i S) S = S i size n n 1 s s 1 Start with S = {1..n} Sample index i S Go to set S i = S {i} Repeat until desired size P(S i S) det(x S i X S i + λi) det(x S X S + λi) = 1 x i (X S X S + λi) 1 x i Note: not the same as P(S) det(x S X S + λi) (unless λ = 0) 14 / 22

28 Reverse iterative sampling {1..n} S P(S i S) S = S i size n n 1 s s 1 Start with S = {1..n} Sample index i S Go to set S i = S {i} Repeat until desired size P(S i S) det(x S i X S i + λi) det(x S X S + λi) = 1 x i (X S X S + λi) 1 x i Note: not the same as P(S) det(x S X S + λi) (unless λ = 0) 14 / 22

29 Key Expectation Bound Lemma For λ-regularized volume sampling of set S of size s E S (X S X S + λi) 1 n d λ + 1 s d λ + 1 (XX + λi) 1 (for λ = 0, this was shown in [DW17]) Proof of main theorem Bias-variance decomposition for fixed S: MSPE(ŵ λ (S) S) = (bias)2 + variance σ2 n tr(x (X S X S + λi) 1 X) Take expectation over S and apply the lemma. 15 / 22

30 Key Expectation Bound Lemma For λ-regularized volume sampling of set S of size s E S (X S X S + λi) 1 n d λ + 1 s d λ + 1 (XX + λi) 1 (for λ = 0, this was shown in [DW17]) Proof of main theorem Bias-variance decomposition for fixed S: MSPE(ŵ λ (S) S) = (bias)2 + variance σ2 n tr(x (X S X S + λi) 1 X) Take expectation over S and apply the lemma. 15 / 22

31 Volume sampling vs leverage score sampling Leverage score sampling: examples selected i.i.d. w.p. x i (XX ) 1 x i. How many labels needed for MSPE(ŵ) σ 2? (σ 2 - label noise) 1. Volume sampling 2d λ labels 2. Any i.i.d. sampling Ω(d λ ln d λ ) labels (lower bound) Our volume sampling is as fast as computing exact leverage scores 16 / 22

32 Volume sampling vs leverage score sampling Leverage score sampling: examples selected i.i.d. w.p. x i (XX ) 1 x i. How many labels needed for MSPE(ŵ) σ 2? (σ 2 - label noise) 1. Volume sampling 2d λ labels 2. Any i.i.d. sampling Ω(d λ ln d λ ) labels (lower bound) Our volume sampling is as fast as computing exact leverage scores 16 / 22

33 Volume sampling vs leverage score sampling Leverage score sampling: examples selected i.i.d. w.p. x i (XX ) 1 x i. How many labels needed for MSPE(ŵ) σ 2? (σ 2 - label noise) 1. Volume sampling 2d λ labels 2. Any i.i.d. sampling Ω(d λ ln d λ ) labels (lower bound) Our volume sampling is as fast as computing exact leverage scores 16 / 22

34 Volume sampling vs leverage score sampling Leverage score sampling: examples selected i.i.d. w.p. x i (XX ) 1 x i. How many labels needed for MSPE(ŵ) σ 2? (σ 2 - label noise) 1. Volume sampling 2d λ labels 2. Any i.i.d. sampling Ω(d λ ln d λ ) labels (lower bound) Our volume sampling is as fast as computing exact leverage scores 16 / 22

35 Volume sampling vs leverage scores on real data 17 / 22

36 Simple algorithm for volume sampling {1..n} S P(S i S) S = S i size n n 1 s s 1 Start with S = {1..n} Sample index i S Go to set S i = S {i} Repeat until desired size Simple algorithm: Update distribution P(S i S) at every step Runtime: n s {}}{ steps Problem: Quadratic dependence on n O(n d) {}}{ update = O(n 2 d) 18 / 22

37 Faster algorithm via rejection sampling Recall: P(S i S) 1 x i (X S X S + λi) 1 x i Idea: Rejection sampling from distribution P(S i S) 1. Sample i uniformly from set S, 2. Compute h i = 1 x i (X S X S + λi) 1 x i, one trial 3. With probability 1 h i reject and go back to 1. We show: Number of trials per step is constant w.h.p. Runtime: n s {}}{ steps O(1) {}}{ trials per step O(d 2 ) {}}{ compute h i = O(nd 2 ) Result: Linear dependence on n 19 / 22

38 Faster algorithm via rejection sampling Recall: P(S i S) 1 x i (X S X S + λi) 1 x i Idea: Rejection sampling from distribution P(S i S) 1. Sample i uniformly from set S, 2. Compute h i = 1 x i (X S X S + λi) 1 x i, one trial 3. With probability 1 h i reject and go back to 1. We show: Number of trials per step is constant w.h.p. Runtime: n s {}}{ steps O(1) {}}{ trials per step O(d 2 ) {}}{ compute h i = O(nd 2 ) Result: Linear dependence on n 19 / 22

39 Faster volume sampling on real data RegVol: simple algorithm FastRegVol: rejection sampling Time (seconds) cadata RegVol FastRegVol Time (seconds) MSD RegVol FastRegVol Data size (n) Data size (n) cpusmall abalone Time (seconds) RegVol FastRegVol Data size (n) Time (seconds) RegVol FastRegVol Data size (n) 20 / 22

40 Conclusions 1. Dimension-free error bounds for subsampled regression 2. Optimal subsampling must be joint 3. Why pick volume sampling over leverage score sampling? 3.1 Better error bounds for small sample sizes, 3.2 Nearly the same computational efficiency. 21 / 22

41 Conclusions 1. Dimension-free error bounds for subsampled regression 2. Optimal subsampling must be joint 3. Why pick volume sampling over leverage score sampling? 3.1 Better error bounds for small sample sizes, 3.2 Nearly the same computational efficiency. 21 / 22

42 Conclusions 1. Dimension-free error bounds for subsampled regression 2. Optimal subsampling must be joint 3. Why pick volume sampling over leverage score sampling? 3.1 Better error bounds for small sample sizes, 3.2 Nearly the same computational efficiency. 21 / 22

43 Thank you 22 / 22

Sketched Ridge Regression:

Sketched Ridge Regression: Optimization and Statistical Perspectives Shusen Wang UC Berkeley Alex Gittens RPI Michael Mahoney UC Berkeley Overview Ridge Regression min w f w = 1 n Xw y + γ w Over-determined: