Acceleration of Randomized Kaczmarz Method

Size: px

Start display at page:

Download "Acceleration of Randomized Kaczmarz Method"

Alice Mathews
5 years ago
Views:

1 Acceleration of Randomized Kaczmarz Method Deanna Needell [Joint work with Y. Eldar] Stanford University BIRS Banff, March 2011

2 Problem Background Setup Setup Let Ax = b be an overdetermined consistent system of equations

3 Problem Background Setup Setup Let Ax = b be an overdetermined consistent system of equations

4 Problem Background Setup Setup Let Ax = b be an overdetermined consistent system of equations

5 Problem Background Setup Setup Let Ax = b be an overdetermined consistent system of equations Goal From A and b we wish to recover unknown x. Assume m n.

6 Kaczmarz Method Method Kaczmarz The Kaczmarz method is an iterative method used to solve Ax = b. Due to its speed and simplicity, it s used in a variety of applications.

7 Kaczmarz Method Method Kaczmarz The Kaczmarz method is an iterative method used to solve Ax = b. Due to its speed and simplicity, it s used in a variety of applications.

8 Kaczmarz Method Method Kaczmarz The Kaczmarz method is an iterative method used to solve Ax = b. Due to its speed and simplicity, it s used in a variety of applications.

9 Kaczmarz Method Method Kaczmarz 1 Start with initial guess x 0 2 x k+1 = x k + b[i] a i,x k a a i 2 i where i = (k mod m) Repeat (2)

10 Kaczmarz Method Method Kaczmarz 1 Start with initial guess x 0 2 x k+1 = x k + b[i] a i,x k a a i 2 i where i = (k mod m) Repeat (2)

11 Kaczmarz Method Method Kaczmarz 1 Start with initial guess x 0 2 x k+1 = x k + b[i] a i,x k a a i 2 i where i = (k mod m) Repeat (2)

12 Kaczmarz Method Method Kaczmarz 1 Start with initial guess x 0 2 x k+1 = x k + b[i] a i,x k a a i 2 i where i = (k mod m) Repeat (2)

13 Kaczmarz Method Geometrically Denote H i = {w : a i,w = b[i]}.

14 Kaczmarz Method Geometrically Denote H i = {w : a i,w = b[i]}.

15 Kaczmarz Method Geometrically Denote H i = {w : a i,w = b[i]}.

16 Kaczmarz Method Geometrically Denote H i = {w : a i,w = b[i]}.

17 Kaczmarz Method But what if... Denote H i = {w : a i,w = b[i]}.

18 Kaczmarz Method But what if... Denote H i = {w : a i,w = b[i]}.

19 Kaczmarz Method But what if... Denote H i = {w : a i,w = b[i]}.

20 Kaczmarz Method But what if... Denote H i = {w : a i,w = b[i]}.

21 Kaczmarz Method But what if... Denote H i = {w : a i,w = b[i]}.

22 Kaczmarz Method But what if... Denote H i = {w : a i,w = b[i]}.

23 Kaczmarz Method But what if... Denote H i = {w : a i,w = b[i]}.

24 Kaczmarz Method But what if... Denote H i = {w : a i,w = b[i]}.

25 Kaczmarz Method But what if... Denote H i = {w : a i,w = b[i]}.

26 Randomized Version Randomized Kaczmarz Kaczmarz 1 Start with initial guess x 0 2 x k+1 = x k + b[i] a i,x k a a i 2 i where i is chosen randomly 2 3 Repeat (2)

27 Randomized Version Randomized Kaczmarz Kaczmarz 1 Start with initial guess x 0 2 x k+1 = x k + b[i] a i,x k a a i 2 i where i is chosen randomly 2 3 Repeat (2)

28 Randomized Version Randomized Kaczmarz Strohmer-Vershynin 1 Start with initial guess x 0 2 x k+1 = x k + bp ap,x k a a p 2 p where P(p = i) = a i A 2 F 3 Repeat (2)

29 Randomized Version Randomized Kaczmarz Strohmer-Vershynin 1 Start with initial guess x 0 2 x k+1 = x k + bp ap,x k a a p 2 p where P(p = i) = a i A 2 F 3 Repeat (2)

30 Randomized Version Randomized Kaczmarz (RK) Strohmer-Vershynin Let R = A 1 2 A 2 F ( A 1 def = inf{m : M Ax 2 x 2 for all x}) Then E x k x 2 2 (1 1 R) k x0 x 2 2 Well conditioned A Convergence in O(n) iterations O(n 2 ) total runtime. Better than O(mn 2 ) runtime for Gaussian elimination and empirically often faster than Conjugate Gradient.

31 Randomized Version Randomized Kaczmarz (RK) Strohmer-Vershynin Let R = A 1 2 A 2 F ( A 1 def = inf{m : M Ax 2 x 2 for all x}) Then E x k x 2 2 (1 1 R) k x0 x 2 2 Well conditioned A Convergence in O(n) iterations O(n 2 ) total runtime. Better than O(mn 2 ) runtime for Gaussian elimination and empirically often faster than Conjugate Gradient.

32 Randomized Version Randomized Kaczmarz (RK) Strohmer-Vershynin Let R = A 1 2 A 2 F ( A 1 def = inf{m : M Ax 2 x 2 for all x}) Then E x k x 2 2 (1 1 R) k x0 x 2 2 Well conditioned A Convergence in O(n) iterations O(n 2 ) total runtime. Better than O(mn 2 ) runtime for Gaussian elimination and empirically often faster than Conjugate Gradient.

33 Randomized Version Randomized Kaczmarz (RK) Strohmer-Vershynin Let R = A 1 2 A 2 F ( A 1 def = inf{m : M Ax 2 x 2 for all x}) Then E x k x 2 2 (1 1 R) k x0 x 2 2 Well conditioned A Convergence in O(n) iterations O(n 2 ) total runtime. Better than O(mn 2 ) runtime for Gaussian elimination and empirically often faster than Conjugate Gradient.

34 Randomized Version Randomized Kaczmarz (RK) with noise System with noise We now consider the consistent system Ax = b corrupted by noise to form the possibly inconsistent system Ax b +z.

35 Randomized Version Randomized Kaczmarz (RK) with noise Theorem [N] Let Ax = b be corrupted with noise: Ax b +z. Then E x k x 2 ( 1 1 R) k/2 x0 x 2 + Rγ, where γ = max i z[i] a i 2. This bound is sharp and attained in simple examples.

36 Randomized Version Randomized Kaczmarz (RK) with noise Theorem [N] Let Ax = b be corrupted with noise: Ax b +z. Then E x k x 2 ( 1 1 R) k/2 x0 x 2 + Rγ, where γ = max i z[i] a i 2. This bound is sharp and attained in simple examples.

37 Randomized Version Randomized Kaczmarz (RK) with noise Error Error in estimation: Gaussian 2000 by 100 after 800 iterations Error 0.05 Threshold Error Error in estimation: Gaussian 2000 by Trials Iterations Error in estimation: Partial Fourier 700 by 101 after 1000 iterations. Error Threshold Error in estimation: Bernoulli 2000 by 100 after 750 iterations. Error Threshold Error 0.08 Error Trials Trials Figure: Comparison between actual error (blue) and predicted threshold (pink). Scatter plot shows exponential convergence over several trials.

38 Modified RK Even better convergence? : Noiseless case revisited Recall x k+1 = x k + b[i] a i,x k a a i 2 i 2 Since these projections are orthogonal, the optimal projection is one that maximizes x k+1 x k 2. Therefore we choose i maximizing b[i] a i,x k a i 2. Too costly Project onto low dimensional subspace. Use the low dimensional representations to predict the optimal projection.

39 Modified RK Even better convergence? : Noiseless case revisited Recall x k+1 = x k + b[i] a i,x k a a i 2 i 2 Since these projections are orthogonal, the optimal projection is one that maximizes x k+1 x k 2. Therefore we choose i maximizing b[i] a i,x k a i 2. Too costly Project onto low dimensional subspace. Use the low dimensional representations to predict the optimal projection.

40 Modified RK Even better convergence? : Noiseless case revisited Recall x k+1 = x k + b[i] a i,x k a a i 2 i 2 Since these projections are orthogonal, the optimal projection is one that maximizes x k+1 x k 2. Therefore we choose i maximizing b[i] a i,x k a i 2. Too costly Project onto low dimensional subspace. Use the low dimensional representations to predict the optimal projection.

41 Modified RK Even better convergence? : Noiseless case revisited Recall x k+1 = x k + b[i] a i,x k a a i 2 i 2 Since these projections are orthogonal, the optimal projection is one that maximizes x k+1 x k 2. Therefore we choose i maximizing b[i] a i,x k a i 2. Too costly Project onto low dimensional subspace. Use the low dimensional representations to predict the optimal projection.

42 Modified RK Even better convergence? : Noiseless case revisited Recall x k+1 = x k + b[i] a i,x k a a i 2 i 2 Since these projections are orthogonal, the optimal projection is one that maximizes x k+1 x k 2. Therefore we choose i maximizing b[i] a i,x k a i 2. Too costly Project onto low dimensional subspace. Use the low dimensional representations to predict the optimal projection.

43 Modified RK Even better convergence? : Noiseless case revisited Recall x k+1 = x k + b[i] a i,x k a a i 2 i 2 Since these projections are orthogonal, the optimal projection is one that maximizes x k+1 x k 2. Therefore we choose i maximizing b[i] a i,x k a i 2. Too costly Project onto low dimensional subspace. Use the low dimensional representations to predict the optimal projection.

44 Modified RK JL Dimension Reduction Johnson-Lindenstrauss Lemma Let δ > 0 and let S be a finite set of points in R n. Then for any d satisfying d C log S δ 2, (1) there exists a Lipschitz mapping Φ : R n R d such that (1 δ) s i s j 2 2 Φ(s i ) Φ(s j ) 2 2 (1+δ) s i s j 2 2, (2) for all s i,s j S.

45 Modified RK JL Dimension Reduction Moreover In the proof of the JL Lemma the map Φ is chosen as the projection onto a random d-dimensional subspace of R n. Now many known distributions will yield such a projection. Recently, transforms with fast multiplies have also been shown to satisfy the JL Lemma [Ailon-Chazelle, Hinrichs-Vybiral, Ailon-Liberty, Krahmer-Ward,...] Perform Reduction Choose such a d n projector Φ and during preprocessing set α i = Φa i.

46 Modified RK JL Dimension Reduction Moreover In the proof of the JL Lemma the map Φ is chosen as the projection onto a random d-dimensional subspace of R n. Now many known distributions will yield such a projection. Recently, transforms with fast multiplies have also been shown to satisfy the JL Lemma [Ailon-Chazelle, Hinrichs-Vybiral, Ailon-Liberty, Krahmer-Ward,...] Perform Reduction Choose such a d n projector Φ and during preprocessing set α i = Φa i.

47 Modified RK JL Dimension Reduction Moreover In the proof of the JL Lemma the map Φ is chosen as the projection onto a random d-dimensional subspace of R n. Now many known distributions will yield such a projection. Recently, transforms with fast multiplies have also been shown to satisfy the JL Lemma [Ailon-Chazelle, Hinrichs-Vybiral, Ailon-Liberty, Krahmer-Ward,...] Perform Reduction Choose such a d n projector Φ and during preprocessing set α i = Φa i.

48 Modified RK RK via Johnson-Lindenstrauss (RKJL) [N-Eldar] Select: Select n rows so that each row a i is chosen with probability a i 2 2 / A 2 F. For each set and set j = argmax i γ i. Test: For a j and the first row a l selected set γ i = b[i] α i,φx k α i 2, γ j = b[j] a j,x k a j 2 and γ l = b[l] a l,x k a l 2. If γ l > γ j, set j = l. Project: Set x k+1 = x k + b[j] a j,x k a j 2 a j. 2 Update: Set k = k + 1 and repeat.

49 Modified RK RK via Johnson-Lindenstrauss (RKJL) [N-Eldar] Select: Select n rows so that each row a i is chosen with probability a i 2 2 / A 2 F. For each set γ i = b[i] α i,φx k α i 2, and set j = argmax i γ i. Test: For a j and the first row a l selected set Project: Set γ j = b[j] a j,x k a j 2 and γ l = b[l] a l,x k a l 2. If γl > γj, set j = l. Update: Set k = k + 1 and repeat. x k+1 = x k + b[j] a j,x k a j 2 a j.

50 Modified RK RK via Johnson-Lindenstrauss (RKJL) [N-Eldar] Select: Select n rows so that each row a i is chosen with probability a i 2 2 / A 2 F. For each set γ i = b[i] α i,φx k α i 2, and set j = argmax i γ i. Test: For a j and the first row a l selected set γ j = b[j] a j,x k a j 2 and γ l = b[l] a l,x k a l 2. If γ l Project: Set > γ j, set j = l. x k+1 = x k + b[j] a j,x k a j 2 a j. 2 Update: Set k = k + 1 and repeat.

51 Modified RK RK via Johnson-Lindenstrauss (RKJL) [N-Eldar] Select: Select n rows so that each row a i is chosen with probability a i 2 2 / A 2 F. For each set γ i = b[i] α i,φx k α i 2, and set j = argmax i γ i. Test: For a j and the first row a l selected set γ j = b[j] a j,x k a j 2 and γ l = b[l] a l,x k a l 2. If γ l > γ j, set j = l. Project: Set x k+1 = x k + b[j] a j,x k a j 2 a j. Update: Set k = k +1 and repeat.

52 Modified RK Runtime Select: Calculate Φx k : In general O(nd) Calculate γ i for each i (of n): O(nd) Test: Calculate γ j and γ l : O(n) Project: Calculate x k+1 : O(n) Overall Runtime Since each iteration takes O(nd), we have convergence in O(n 2 d).

53 Modified RK Choosing parameter d Lemma: Choice of d Let Φ be the n d (Gaussian) matrix with d = Cδ 2 log(n) as in the RKJL method. Set γ i = Φa i,φx k also as in the method. Then γ i a i,x k 2δ for all i and k in the first O(n) iterations of RKJL. Low Risk This shows worst case expected convergence in at most O(n 2 logn) time, and of course in most cases one expects far faster convergence.

54 Modified RK Choosing parameter d Lemma: Choice of d Let Φ be the n d (Gaussian) matrix with d = Cδ 2 log(n) as in the RKJL method. Set γ i = Φa i,φx k also as in the method. Then γ i a i,x k 2δ for all i and k in the first O(n) iterations of RKJL. Low Risk This shows worst case expected convergence in at most O(n 2 logn) time, and of course in most cases one expects far faster convergence.

55 Modified RK Choosing parameter d Lemma: Choice of d Let Φ be the n d (Gaussian) matrix with d = Cδ 2 log(n) as in the RKJL method. Set γ i = Φa i,φx k also as in the method. Then γ i a i,x k 2δ for all i and k in the first O(n) iterations of RKJL. Low Risk This shows worst case expected convergence in at most O(n 2 logn) time, and of course in most cases one expects far faster convergence.

56 Modified RK Choosing parameter d Lemma: Choice of d Let Φ be the n d (Gaussian) matrix with d = Cδ 2 log(n) as in the RKJL method. Set γ i = Φa i,φx k also as in the method. Then γ i a i,x k 2δ for all i and k in the first O(n) iterations of RKJL. Low Risk This shows worst case expected convergence in at most O(n 2 logn) time, and of course in most cases one expects far faster convergence.

57 Justification Analytical Justification Theorem [Assuming row normalization] Fix an estimation x k and denote by x k+1 and xk+1 the next estimations using the RKJL and the standard RK method, respectively. Set γj = a j,x k 2 and reorder these so that γ1 γ 2... γ m. Then when d = Cδ 2 logn, m E x k+1 x 2 2 min E x k+1 x 2 2 ( p j 1 ) γj +2δ, E xk+1 m x 2 2 where j=1 { ( m j n 1) p j = ( m n), j m n+1 0, j > m n+1 are non-negative values satisfying m j=1 p j = 1 and p 1 p 2... p m = 0.

58 Justification Analytical Justification Theorem [Assuming row normalization] Fix an estimation x k and denote by x k+1 and xk+1 the next estimations using the RKJL and the standard RK method, respectively. Set γj = a j,x k 2 and reorder these so that γ1 γ 2... γ m. Then when d = Cδ 2 logn, m E x k+1 x 2 2 min E x k+1 x 2 2 ( p j 1 ) γj +2δ, E xk+1 m x 2 2 where j=1 { ( m j n 1) p j = ( m n), j m n+1 0, j > m n+1 are non-negative values satisfying m j=1 p j = 1 and p 1 p 2... p m = 0.

59 Justification Analytical Justification Theorem [Assuming row normalization] Fix an estimation x k and denote by x k+1 and xk+1 the next estimations using the RKJL and the standard RK method, respectively. Set γj = a j,x k 2 and reorder these so that γ1 γ 2... γ m. Then when d = Cδ 2 logn, m E x k+1 x 2 2 min E x k+1 x 2 2 ( p j 1 ) γj +2δ, E xk+1 m x 2 2 where j=1 { ( m j n 1) p j = ( m n), j m n+1 0, j > m n+1 are non-negative values satisfying m j=1 p j = 1 and p 1 p 2... p m = 0.

60 Justification Analytical Justification Theorem [Assuming row normalization] Fix an estimation x k and denote by x k+1 and xk+1 the next estimations using the RKJL and the standard RK method, respectively. Set γj = a j,x k 2 and reorder these so that γ1 γ 2... γ m. Then when d = Cδ 2 logn, m E x k+1 x 2 2 min E x k+1 x 2 2 ( p j 1 ) γj +2δ, E xk+1 m x 2 2 where j=1 { ( m j n 1) p j = ( m n), j m n+1 0, j > m n+1 are non-negative values satisfying m j=1 p j = 1 and p 1 p 2... p m = 0.

61 Justification Analytical Justification Corollary Fix an estimation x k and denote by x k+1 and xk+1 the next estimations using the RKJL and the standard method, respectively. Set γj = a j,x k 2 and reorder these so that γ1 γ 2... γ m. Then when exact geometry is preserved (δ 0), E x k+1 x 2 2 E x k+1 x 2 2 m j=1 ( p j 1 ) γj m.

62 Justification Analytical Justification Corollary Fix an estimation x k and denote by x k+1 and xk+1 the next estimations using the RKJL and the standard method, respectively. Set γj = a j,x k 2 and reorder these so that γ1 γ 2... γ m. Then when exact geometry is preserved (δ 0), E x k+1 x 2 2 E x k+1 x 2 2 m j=1 ( p j 1 ) γj m.

63 Justification Analytical Justification Corollary Fix an estimation x k and denote by x k+1 and xk+1 the next estimations using the RKJL and the standard method, respectively. Set γj = a j,x k 2 and reorder these so that γ1 γ 2... γ m. Then when exact geometry is preserved (δ 0), E x k+1 x 2 2 E x k+1 x 2 2 m j=1 ( p j 1 ) γj m.

64 Justification Analytical Justification Corollary Fix an estimation x k and denote by x k+1 and xk+1 the next estimations using the RKJL and the standard method, respectively. Set γj = a j,x k 2 and reorder these so that γ1 γ 2... γ m. Then when exact geometry is preserved (δ 0), E x k+1 x 2 2 E x k+1 x 2 2 m j=1 ( p j 1 ) γj m.

65 Justification Empirical Evidence RK RKJL Figure: l 2 -Error (y-axis) as a function of the iterations (x-axis). The dashed line is standard Randomized Kaczmarz, and the solid line is the modified one, without a Johnson-Lindenstrauss projection. Instead, the best move out of the randomly chosen n rows is used. Note that we cannot afford to do this computationally.

66 Justification Empirical Evidence RK RKJL, d=1000 RKJL, d=500 RKJL, d=100 RKJL, d= Figure: l 2 -Error (y-axis) as a function of the iterations (x-axis) for various values of d with m = and n = 1000.

67 Thank you For more information Web: References: Eldar, Needell, Acceleration of Randomized Kaczmarz Method via the Johnson-Lindenstrauss Lemma, Num. Algorithms, to appear. Needell, Randomized Kaczmarz solver for noisy linear systems, BIT Num. Math., 50(2) Strohmer, Vershynin, A randomized Kaczmarz algorithm with exponential convergence, J. Four. Ana. and App

SGD and Randomized projection algorithms for overdetermined linear systems

SGD and Randomized projection algorithms for overdetermined linear systems Deanna Needell Claremont McKenna College IPAM, Feb. 25, 2014 Includes joint work with Eldar, Ward, Tropp, Srebro-Ward Setup Setup