On the exponential convergence of. the Kaczmarz algorithm

On the exponential convergence of the Kaczmarz algorithm Liang Dai and Thomas B. Schön Department of Information Technology, Uppsala University, arxiv:4.407v [cs.sy] 0 Mar 05 75 05 Uppsala, Sweden. E-mail: {liang.dai, thomas.schon}@it.uu.se Abstract The Kaczmarz algorithm (KA) is a popular method for solving a system of linear equations. In this note we derive a new exponential convergence result for the KA. The key allowing us to establish the new result is to rewrite the KA in such a way that its solution path can be interpreted as the output from a particular dynamical system. The asymptotic stability results of the corresponding dynamical system can then be leveraged to prove exponential convergence of the KA. The new bound is also compared to existing bounds. Index Terms Kaczmarz algorithm, Stability analysis, Cyclic algorithm. I. PROBLEM STATEMENT In this note, we discuss the exponential convergence property of the Kaczmarz algorithm (KA) []. Since its introduction, the KA has been applied in many different fields and many new developments are reported[]-[3]. The KA is used to find the solution to the following system of consistent linear equations Ax = b, () This work was supported by the project Probabilistic modelling of dynamical systems (Contract number: 6-03-554) funded by the Swedish Research Council. March, 05

where x R n denotes the unknown vector, A R m n, m n, rank(a) = n and b R m. Define the hyperplane H i as where the i-th row of A is denoted by a T i H i = {x a T i x = b i }, and the i-th element of b is denoted by b i. Geometrically, the KA finds the solution by projecting (or approximately projecting) onto the hyperplanes cyclically from an initial approximation x 0, which reads as x k+ = x k + λ b i(k) a T i(k) x k a a i(k) i(k), () where i(k) = mod(k, m) +. In the update equation (), λ is the relaxation parameter, which satisfies 0 < λ <. We use the Matlab convention mod(, ) to denote the modulus after division operation and to denote the spectral norm of a matrix. It is well-known that the KA is sometimes rather slow to converge. This is especially true when several consecutive row vectors of the matrix A are in some sense "close" to each other. In order to overcome this drawback, the Randomized Karczmarz Algorithm (RKA) algorithm was introduced in [4] for λ =. The key of the RKA is that, instead of performing the hyperplane projections cyclically in a deterministic order, the projections are performed in a random order. More specifically, at time k, select a hyperplane H p to project with probability ap, for p = A F,, m. Note that F is used to denote the Frobenius norm of a matrix. Intuitively speaking, the involved randomization is performing a kind of "preconditioning" to the original matrix equations [6], resulting in a faster exponential convergence rate, as established in [4]. The specific and predefined ordering of the projections in the KA makes it challenging to obtain a tight bound of the convergence rate of the method. In [6], the authors build up the convergence rate of the KA by exploiting the Meany inequality [7], which works for the case λ = in (). [8], [0] also established convergence rates for the KA, for λ (0, ). In Section III, we will compare these results in more detail. In this note, we present a different way to characterize the convergence property of the KA described in (). The key underlying our approach is that we interpret the solution path of the KA as the output of a particular dynamical system. By studying the stability property of this related dynamical system, we then obtain new exponential convergence results for the KA. Related to this, it is interesting to note that the so-called Integral Quadratic Constraints (IQCs) has recently March, 05

3 been used in studying the convergence rate of first-order algorithms applied to solve general convex optimization problems [9]. The note will be organized as follows. In the subsequent section we make use of the subsequence {x jm x} j=0 to enable the derivation of the new exponential convergence result. In Section III we discuss its connections and differences to existing results. Conclusions and ideas for future work are provided in Section IV. II. THE NEW CONVERGENCE RESULT First, let us introduce the matrix B R m n, for which the i-th row b T i a i a i, i =,, m. Furthermore, let P i b i b T i k 0. Using this new notation allows us to rewrite () according to is defined as b i for i =,,, m and let θ k x k x for θ k+ = (I λp i(k) )θ k, (3) which can be interpreted as a discrete time-varying linear dynamical system. Hence, this relation inspires us to study the KA by employing the techniques for analyzing the stability properties of time-varying linear systems, see e.g. [4], [5]. In what follows we will focus on analyzing the convergence rate of the sub-sequence { θ jm } j=0. Given the fact that i(k) = mod(k, m) +, we have ( m ) θ (j+)m = (I λp i ) θ jm M m θ jm. i= The following theorem provides an upper bound on the spectral norm of M m. Theorem : Let ρ M m and 0 < λ, then it holds that where B denotes the pseudo-inverse of the matrix B. ρ λ( λ) ρ, (4) ( + λ m ) B Proof: Let v 0 R n be a vector satisfying M m v 0 = ρv 0, v 0 = and let v i = (I λp i )v i for i =,, m. It follows that v m = M m v 0 and v m = ρ. Notice that P i = P i, so we have (I λp i ) = I (λ λ )P i, March, 05

4 for i =,, m. Hence it holds that which in turn implies that λ( λ) v i = v T i (I λp i ) v i = v T i (I λ( λ)p i )v i = v i λ( λ) P i v i, P i v i = v 0 v m = ρ. (5) i= Also, for any i {,, m}, we have that Together with (5), we get Meanwhile, we have that v i v 0 i = (v k v k ) = λ i P k v k i λ P k v k λ i i P k v k m λi λ P k v k v i v 0 λi λ ( ρ ). (6) λv0 T B T Bv 0 = λ v0 T P k v 0 = λ P k v 0 = λ λ λ P k [v k + (v 0 v k )] P k v k + λ P k v k + λ P k (v k v 0 ) v k v 0. (7) March, 05

5 Together with (5) and (6), we have that or equivalently λv T 0 B T Bv 0 ( ρ ) λ + λ m λ(k ) λ ( ρ ) hence it follows that λv T 0 B T Bv 0 ρ λ ( + λ m(m ) ), (8) Since v0 T B T Bv 0, we conclude that B ρ λ( λ)vt 0 B T Bv 0 + λ m(m ). ρ λ( λ). (9) ( + λ m(m )) B Finally, notice that m(m ) m holds for any natural number m, which concludes the proof. Remark : Notice that in the proof of Theorem, (7) is the main approximation step, and a better approximation here will lead to an improvement of the bound. The following corollary characterizes the convergence of the KA under λ =, which will be used in the subsequent section to enable comparison to the results given in [6], [7]. We omit the proof since it is a direct implication of Theorem. Corollary : For the KA with λ = in (), if m n, we have that ρ. (0) m B Next, we will derive an improvement over the bound (4), enabled by partitioning the matrix A into non-overlapping sub-matrices. Let q = m +, where x denotes the smallest number n which is greater or equal to x. Define the following sets as T i = {(i )n +,, in}, for i =,, q and T q = {(q )n +,, m}. Further, for i =,, q, define B i as the sub-matrix of B with the rows indexed by the set T i, and N i = j T i (I λp j ). Corollary : Based on the previous definitions, and further assume that all the sub-matrices B i for for i =,, q are of rank n, then we have that ( ) q ρ λ( λ) ρ ( + λ n(n )) B i i= () March, 05

6 Proof: Notice that since M m = N q N q N N, we have that q ρ = M m N i. () For each N i, the spectral norm can be bounded analogously to what was done in Theorem, resulting in N i λ( λ). ( + λ n(n )) B i Finally, inserting this inequality into () concludes the proof. i= III. DISCUSSION AND NUMERICAL ILLUSTRATION In Section III-A and III-B we compare our new bound with the bounds provided by the Meany inequality [6], [7] and the RKA, respectively. In Section III-B we also provide a numerical illustration. Section III-C is devoted to a comparison with the bound provided in [8], and finally Section III-D compares with the result given by [0]. A. Comparison with the bound given by Meany inequality In the following, we assume that m = n and λ =. Denote the singular values of B as σ σ σ n, then the bound in [6], [7] given by the Meany inequality can be written as ρ n i= σ i, and the bound given in (0) can be written as ρ σ n n. This implies that when σn n n σ i i= i.e. n σi n, (3) holds, the bound in (0) is tighter. In the following lemma, we derive a sufficient condition, under which the inequality (3) holds. Lemma : If σ n (n )n n n Proof: Notice that n σi = i= i= holds, the inequality in (3) is satisfied. ( n σi i= ) σ n ( n i= σ i n ) n σ n ( ) n n σ n n. (4) March, 05

7 The inequality (4) holds since Hence, if holds, or equivalently if n σi i= n σi = B F = n. i= ( ) n n σn n n σ n holds, then (3) holds, which concludes the proof. (n )n n n Remark : Notice that the right hand side of the inequality in Lemma is in the order of n for large n. Another difference is that the bound provided by Theorem depends explicitly on the size of matrix, while the bound provided by the Meany inequality does not. B. Comparison with the bound given by the RKA Let us now compare our new results to the results available for the RKA. Note that in this case, we set λ =. If {θ jm } j=0 denotes the sequence generated by the RKA, then it holds that [4] E θ jm ( A F A ) jm θ 0, (5) for j, where E denotes the expectation operator with respect to the random operations up to index jm. To compare (5) and (0), we make the assumption that A is a matrix with each row normalized, i.e. A = B, for simplicity. It follows that B F = m m, and. (6) m B B F B Furthermore, since B F B, we have that ( ) m, (7) B F B B F B and combining (6) and (7), results in m B ( B F B ) m. March, 05

8 The above inequality implies that the bound given by (0) is more conservative than the one given by the RKA. Next, a numerical illustration is implemented to compare the bounds given by (0), () and (5). The setup is as follows. Let m = 30 and n = 3, generate A = randn(30, 3) and normalize each row to obtain B, generate x = randn(3, ) and compute y = Bx. In the implementation of the RKA, we run 000 realizations with the same initial value x 0 to obtain an average performance result, which is reported in Fig.. From the left panel in Fig., we can see that the bound (5) for characterizing the convergence of the RKA is closer to the real performance of the RKA, while the bounds given by (0) and () for bounding the convergence of the KA are further away from the real performance of the KA. The right panel in Fig. shows a zoomed illustration of the bound given by (0) and (). We can observe that the bound given by () improves upon (0), which is enabled by the partitioning of the rows of the matrix. 0 5 0 0 0 0.8 0 5 0 0 θi 0 5 θi 0 0.7 0 0 0 5 RKA KA KA BD 0 0.6 0 30 RKA BD KA BD KA BD 0 35 0 0 40 60 80 00 i KA BD 0 0 40 60 80 00 i Fig.. In the left panel, the curves with tags KA and RKA illustrate the real performance of the KA and the RKA. The curves with tags KA BD, KA BD and RKA BD illustrate the bounds given by (0), () and (5), respectively. In the right panel, a zoomed illustration for the curves KA BD and KA BD in the left panel is given. March, 05

9 C. Comparison with the bound given in [8] To compare the result given in [8], we assume that A = B, and that they are square and invertible matrices. Under these assumptions, the involved quantity µ in Corollary 4. of [8] can be approximated by m B given the results in Theorem. of [8]. Hence, the convergence rate of the KA given by Theorem 3. in [8] can be written as where λ (0, ). The result of the current work reads as ρ λ( λ), (8) m [ + (m )λ ] B ρ λ( λ), (9) ( + λ m ) B where λ (0, ). A closer look at the two bounds (8) and (9) reveals the following: ) The optimal choice for the right hand side (RHS) of (8) is λ = 4m 3, resulting in (m ) ρ m(. When m is large, ρ decreases with the speed 4m 3+) B ) When λ = m (a suboptimal choice for simplicity), (9) gives that ρ m.5 B ( m ) 4m B When m is large, ρ decreases in the speed of, faster than the one in [8]. A m B comparison of both bounds when the optimal λ are chosen is given in Fig. (). 3) When λ is chosen to be, both bounds decrease in the order of m... D. Comparison with the bound given in [0] We will once again assume that each row of A is normalized, i.e. B = A. In [0] the authors makes use of a subspace correction method in studying the convergence speed of the KA. They show that (see eq. (3) in [0]), when the best relaxation parameter λ is chosen, ρ can be bounded from above according to ρ. (0) log (m) B B As we discussed in the previous section, when a near-optimal λ (i.e. λ is chosen as ) is used, m the upper bound implied by our analysis gives that ( ρ 4m B By assumption we have B F = m, which implies that B = B T B tr(bt B) n m ). () = B F n = m n. () March, 05

0 0.95 eq. (8) eq. (9) 0.9 0.85 0.8 0.998 0.996 0.994 0.75 0.99 700 800 900 000 m 0.7 0 00 400 600 800 000 m Fig.. The bounds (8) and (9) (using the optimal parameters) are plotted for the given B = 0.5 and m ranging from 0 to 000. The result shows that the bound proposed in this work is always lower than the one given in [8] under the experimental settings. Hence, the bound obtained by [0] will decrease with a speed of m log (m) B while the present work gives the decreasing speed of. m B as m increases, IV. SUMMARY By studying the stability property of a time-varying dynamical system that is related to the KA we have been able to establish some new results concerning the convergence speed of the algorithm. The new results are also compared to several related, previously available results. Let us end the discussion by noting that the following two ideas can possibly lead to further improvements of the results. One potential idea is trying to improve the inequality in (7), since this part introduces much of the approximations in establishing the main result of the note; another idea is to try to find an optimal partitioning of the rows of the matrix A, such that the right hand side of () is minimized. March, 05

V. ACKNOWLEDGEMENT The authors would like to thank the reviewers for useful comments and pointing out the references [8], [0], and Marcus Björk for helpful suggestions. REFERENCES [] S. Kaczmarz, Angenaherte Auflosung von Systemen linearer Gleichungen, Bulletin International de l Académie Polonaise des Sciences et des Lettres, 35, 355-357, 937. [] P. P. B. Eggermont, G. T. Herman, A. Lent, Iterative algorithms for large partitioned linear systems, with applications to image reconstruction, Linear algebra and its applications, 40, 37-67, 98. [3] Y. Censor, Row-action methods for huge and sparse systems and their applications, SIAM review, 3(4), 444-466, 98. [4] T. Strohmer, R. Vershynin, A randomized Kaczmarz algorithm with exponential convergence, Journal of Fourier Analysis and Applications, 5(), 6-78, 009. [5] F. Natterer, The Mathematics of Computerized Tomography, Wiley, New York, 986. [6] T. Strohmer, R. Vershynin, Comments on the randomized Kaczmarz method, Journal of Fourier Analysis and Applications, 5(4), 437-440, 009. [7] D. Needell, Randomized Kaczmarz solver for noisy linear systems, BIT Numerical Mathematics, 50(), 395-403, 00. [8] Y. Eldar, D. Needell, Acceleration of randomized Kaczmarz method via the Johnson-Lindenstrauss lemma, Numerical Algorithms, 58(), 63-77, 0. [9] X. Chen, A. Powell, Almost sure convergence for the Kaczmarz algorithm with random measurements, Journal of Fourier Analysis and Applications, 8(6), 95-4, 0. [0] A. Zouzias, N. M. Freris, Randomized extended Kaczmarz for solving least-squares, SIAM Journal on Matrix Analysis and Applications, 0. [] G. Thoppe, V. Borkar, D. Manjunath, A stochastic Kaczmarz algorithm for network tomography, Automatica, 03. [] D. Needell, J. A. Tropp, Paved with Good Intentions: Analysis of a Randomized Block Kaczmarz Method, Linear Algebra and its Applications, 44, 99-, 04. [3] L. Dai, M. Soltanalian, and K. Pelckmans, On the Randomized Kaczmarz Algorithm, IEEE Signal Processing Letters, (3), 330-333, 04. [4] L. Guo, Time-varying stochastic systems: stability, estimation and control, Jilin Science and Technology Press, Changchun, 993. [5] L. Guo, L. Ljung, Exponential stability of general tracking algorithms, IEEE Transactions on Automatic Control, 40(8), 376-387, 995. [6] A. Galántai, On the rate of convergence of the alternating projection method in finite dimensional spaces, Journal of mathematical analysis and applications, 30(), 30-44, 005. [7] R. K. Meany, A matrix inequality, SIAM journal of Numerical Analysis, 6(), 04-07, 969. [8] J. Mandel, Convergence of the cyclical relaxation method for linear inequalities, Mathematical programming, 30(), 8-8, 984. [9] L. Lessard, B. Recht, A. Packard, Analysis and design of optimization algorithms via integral quadratic constraints, arxiv preprint, arxiv:408.3595, 04. [0] P. Oswald, W. Zhou, Convergence Estimates for Kaczmarz-Type Methods, preprint, 05. March, 05