Learning Theory of Randomized Kaczmarz Algorithm

Size: px

Start display at page:

Download "Learning Theory of Randomized Kaczmarz Algorithm"

Andrew Andrews
5 years ago
Views:

1 Journal of Machine Learning Research Submitted 6/14; Revised 4/15; Published 1/15 Junhong Lin Ding-Xuan Zhou Department of Mathematics City University of Hong Kong 83 Tat Chee Avenue Kowloon, Hong Kong, China Editor: Gabor Lugosi Abstract A relaxed randomized Kaczmarz algorithm is investigated in a least squares regression setting by a learning theory approach. When the sampling values are accurate and the regression function conditional means is linear, such an algorithm has been well studied in the community of non-uniform sampling. In this paper, we are mainly interested in the different case of either noisy random measurements or a nonlinear regression function. In this case, we show that relaxation is needed. A necessary and sufficient condition on the sequence of relaxation parameters or step sizes for the convergence of the algorithm in expectation is presented. Moreover, polynomial rates of convergence, both in expectation and in probability, are provided explicitly. As a result, the almost sure convergence of the algorithm is proved by applying the Borel-Cantelli Lemma. Keywords: learning theory, relaxed randomized Kaczmarz algorithm, online learning, space of homogeneous linear functions, almost sure convergence 1. Introduction The Kaczmarz method is an iterative projection algorithm. It was originally proposed for solving overdetermined systems of linear equations, and has been adapted to image reconstruction, signal processing and numerous other applications. Given a matrix A R m d and a vector b R m, the classical Kaczmarz algorithm Kaczmarz, 1937 approximates a solution of the linear systems Ax = b by an iterative scheme as x k+1 = x k + b i a i, x k a i a i, 1 where i = k mod m, a T i is the i-th row of the matrix A, and x 1 R d is an initial vector. Here, is the inner product in R d and the induced norm. The convergence of the Kaczmarz algorithm is well understood Kaczmarz, 1937, and its convergence rate depends on the order of rows of A. To avoid this dependence, a randomized Kaczmarz algorithm was considered in Strohmer and Vershynin, 009 by setting the probability of a row to be proportional to its norm. It takes the form x k+1 = x k + b pi a pi, x k a pi a pi, c 015 Junhong Lin and Ding-Xuan Zhou.

2 Lin and Zhou where pi takes values in 1,..., m} with probability a pi with A A F = m d i=1 j=1 a ij F being the Frobenius norm square of A. Exponential convergence rate was proved for the expected error E x k+1 x of the randomized Kaczmarz algorithm in Strohmer and Vershynin, 009. When noise exists in the sample value b = Ax + ξ with ξ being a noise vector, a bound for the expected error was obtained in Needell, 010 and divergence was proved when the variance of ξ is positive. The error bound consists of an exponentially ξ convergent part and a noise-driven term proportional to the noise level max i i a i. The randomized Kaczmarz algorithm was generalized in Chen and Powell 01 to a setting with a sequence of independent random measurement vectors ϕ t R d } t as x k+1 = x k + y k ϕ k, x k ϕ k ϕ k. 3 When the measurements have no noise y k = ϕ k, x, almost sure convergence was proved and quantitative error bounds were provided in Chen and Powell, 01. When the linear system Ax = b is overdetermined m > d and has no solution, the Kaczmarz algorithm can be modified by introducing a relaxation parameter η k > 0 in front of b i a i,x k a i a i and the output sequence x k } converges to the least squares solution arg min x R d Ax b when lim k η k = 0. See, e.g., Zouzias and Freris, 013 and references therein. Setting ψ k = 1 ϕ k ϕ k S d 1 and ỹ k = 1 ϕ k y k yields an equivalent form of the scheme 3 as x k+1 = x k + ỹ k ψ k, x k } ψ k. This form is similar to those in the literature of online learning for least squares regression and together with the relaxed Kaczmarz method Zouzias and Freris, 013 motivates us to consider the following relaxed randomized Kaczmarz algorithm. Definition 1 With normalized measurement vectors ψ t S d 1 } t and sample values ỹ t R} t, the relaxed randomized Kaczmarz algorithm is defined by x t+1 = x t + η t ỹ t ψ t, x t } ψ t, t = 1,..., 4 where x 1 R d is an initial vector and η t } is a sequence of relaxation parameters or step sizes. The purpose of this paper is to provide learning theory analysis for the relaxed randomized Kaczmarz algorithm. We shall assume throughout the paper that 0 < η t for each t N and that the sequence z t := ψ t, ỹ t } t N is independently drawn according to a Borel probability measure ρ on Z := S d 1 R which satisfies E[ ỹ ] <. Our first goal is to deal with the noisy setting for the randomized Kaczmarz algorithm. When the sampling process is noisy or nonlinear to be defined below, we show that x t } t converges to some x R d in expectation if and only if lim t η t = 0 and t=1 η t =. Moreover, the rate of convergence in expectation cannot be too fast. It tells us that the relaxation parameter is necessary for the convergence in the noisy setting. When η t } t takes the form η t = η 1 t θ, we provide convergence rates in expectation and in confidence and 334

3 prove the almost sure convergence. Such results were presented in the case of no noise in Strohmer and Vershynin, 009; Chen and Powell, 01 and are new in the noisy setting. Our second goal is to give the first almost sure convergence result in online learning for least squares regression when regularization is not needed. Such a result can be found in Tarrés and Yao, 014 when regularization is imposed, while the convergence in expectation without regularization was proved in Ying and Pontil, 008. We also present the first consistency result for online learning when the approximation error to be defined below does not tend to zero.. Main Results To introduce our learning theory approach to the relaxed randomized Kaczmarz algorithm 4, we decompose the probability measure ρ on Z = S d 1 R into its marginal distribution ρ X on X := S d 1 and conditional distributions ρ ψ at ψ X. The conditional means define the regression function f ρ : X R as f ρ ψ = ỹdρỹ ψ, ψ X. 5 R The hypothesis space for the Kaczmarz algorithm 4 consists of homogeneous linear functions H = f x L ρ X : x R d}, where f x ψ := x, ψ, ψ X. 6 Definition The sampling process associated with ρ is said to be noise-free if ỹ = f ρ ψ almost surely. Otherwise, it is called noisy. It is said to be linear if f ρ H as a function in L ρ X. Otherwise, it is called nonlinear. The main difference between our analysis in this paper and that in the literature Strohmer and Vershynin, 009; Needell, 010; Chen and Powell, 01 lies in the setting when the sampling process is either noisy or nonlinear. These two situations can be handled simultaneously by means of the least squares generalization error Ef = Z ỹ fψ dρ, a well developed concept in learning theory. The assumption E[ ỹ ] < on ρ ensures f ρ L ρ X and Ef ρ <. The noise-free condition can be stated as Ef ρ = 0. It is well known that the regression function minimizes Ef among all the square integral with respect to ρ X functions f L ρ X, and satisfies Ef Ef ρ = f f ρ L = fψ f ρ ρ ψ dρ X. 7 X Since the hypothesis space H is a finite dimensional subspace of L ρ X, the continuous functional Ef achieves a minimizer X f H = arg min Ef. 8 f H From 7 we see that f H is the best approximation of f ρ in the subspace H. It is unique as the orthogonal projection of f ρ onto H. It can be written as f H = f x for some x R d. But such a vector x is not necessarily unique. 3343

4 Lin and Zhou The linear condition can be stated as f ρ = f H or f ρ H as functions in L ρ X. So we see that the sampling process is noisy or nonlinear if and only if Ef H > 0. Now we can state our first main result, to be proved in Section 4, which gives a characterization of the convergence of x t } t to some x R d in expectation. Theorem 3 Define the sequence x t } t by 4. Assume Ef H > 0. Then we have the limit lim T E z1,...,z T x T +1 x = 0 for some x R d if and only if In this case, we have lim η t = 0 t T =1 and η t =. 9 t=1 E z1,...,z T x T +1 x =. 10 Compared with the result on exponential convergence in expectation in the linear case without noise Strohmer and Vershynin, 009, the somewhat negative result 10 tells us that in the noisy setting the convergence in expectation cannot be as fast as E z1,...,z T x T +1 x OT θ for any θ >. But for θ < 1, such learning rates can be achieved by taking η t = η 1 t θ, as shown in the following second main result, to be proved in Section 4. Theorem 4 Let η t = η 1 t θ for some θ 0, 1] and η 1 0, 1. Define the sequence x t } t by 4. Then for some x R d we have E z1,...,z T x T +1 x C0 T θ, if θ < 1, C 0 T λrη 1, if θ = 1, 11 where C 0 is a constant independent of T N given explicitly in the proof and λ r is the smallest positive eigenvalue of the covariance matrix C ρx of the probability measure ρ X defined by C ρx = E ρx [ψψ T ] = ψψ T dρ X. 1 Our third main result is the following confidence-based estimate for the error which will be proved in Section 5. Theorem 5 Assume that for some constant M > 0, ỹ M almost surely. Let θ [1/, 1], η t = η 1 t θ 1 with 0 < η 1 < min1, λ r }, and T N. Then for some x R d and for any 0 < δ < 1, with confidence at least 1 δ we have C1 x T +1 x T θ/ log 4 δ log T, when θ [1/, 1, C 1 T λrη 1 log 13 δ log T, when θ = 1, where C 1 is a positive constant independent of T or δ given explicitly in the proof. Our last main result is about the almost sure convergence of the algorithm, which will be proved in Section 6. X 3344

5 Theorem 6 Under the assumptions of Theorem 5, we have for any ɛ 0, 1], the following holds for some x R d : A When 1/ θ < 1, lim t t θ1 ɛ/ x t+1 x = 0 almost surely. B When θ = 1, lim t t λrη 11 ɛ x t+1 x = 0 almost surely. Let us demonstrate our setting by two examples without noise considered in the literature. The first example appeared in Chen and Powell, 01. Example 1 If random measurement vectors ϕ t } t=1 surely, then ψ k = 1 ϕ k ϕ k S d 1 } are independent. are independent and nonzero almost The second example is from Strohmer and Vershynin, 009. Example Define the random vector ϕ which is a normalized row of a full rank matrix A R m d, with probabilities as ϕ = a j a j with probability a j A F j = 1,, m. It was shown in Strohmer and Vershynin 009 that the smallest eigenvalue of the covariance matrix is positive: λ min E[ϕϕ T 1 ] A F A 1. It means r = d and λ r 1 A F A 1. The third example is on homoskedastic models Johnston, Example 3 In the literature of homoskedastic models, it is assumed that the sample value y t } t satisfies y t = x, ψ t + ξ t with ξ t } t being independently drawn according to a zero mean probability measure ξ. This corresponds to the special case when the conditional distributions ρ ψ are given by ρ ψ = f ρ ψ + ξ. Our setting induced by ρ is more general and allows heteroskedastic models. 3. Connections to Learning Theory The relaxed randomized Kaczmarz algorithm defined by 4 may be rewritten as an online learning algorithm with output functions from the hypothesis space 6, and our main results stated in the last section are new even in the online learning literature. To demonstrate this, we denote the tth output function F t on X induced by the vector x t to be given by F t ψ = x t, ψ for ψ X. Then the iteration relation 4 gives F t+1 = F t + η t ỹ t F t ψ t }, ψ t. 14 This is a special kernel-based least squares online learning algorithm. Here a Mercer kernel on a metric space X means a function K : X X R which is continuous, symmetric and the matrix Kx i, x j l i,j=1 is positive semidefinite for any finite subset x i} l i=1 X. 3345

6 Lin and Zhou It generates a reproducing kernel Hilbert space H K, K by the set of fundamental functions K, x : x X } with the inner product K, x, K, y K = Kx, y. A least squares regularized online learning algorithm in H K is defined with ψ t, ỹ t X R} t drawn independently according to a probability measure on Z = X R as F t+1 = F t η t F t ψ t ỹ t K, ψ t + λf t }, t = 1,..., 15 where λ 0 is a regularization parameter. The consistency of the online learning algorithm 15 is well understood when the approximation error Dλ tends to zero as λ 0. Definition 7 The approximation error or regularization error of the pair ρ, K is defined for λ > 0 as Dλ = inf Ef Efρ + λ f } } K = inf f f ρ L + λ f f H K f H ρ K. 16 K X When λ > 0 with regularization and lim λ 0 Dλ = 0, the error F T +1 f ρ L in ρ X expectation and in confidence was bounded in Smale and Yao, 005; Ying and Zhou, 006; Smale and Zhou, 009; Tarrés and Yao, 014 by means of the decay of Dλ and T. The error analysis was done in Ying and Pontil 008 without regularization λ = 0 but under the approximation error condition lim λ 0 Dλ = 0. The error F T +1 f ρ K with the H K -metric was also analyzed when f ρ H K. If we take the kernel to be the linear one: Kx, y = x, y with X = R d, and assume that the marginal distribution ρ X is supported on X = S d 1, then ψ t S d 1 almost surely. Set λ = 0, we see that the relaxed randomized Kaczmarz algorithm expressed in the form 14 is the least squares online learning algorithm 15 without regularization. So the error analysis from Ying and Pontil 008 applies, but the condition lim λ 0 Dλ = 0 is required for the consistency in L ρ X and even stronger conditions stronger than f ρ H K are needed for the consistency in the H K -metric. Notice that for the linear kernel, x =, x K. So the error analysis carried out in this paper provides bounds for the error F T +1 f H K without the condition lim λ 0 Dλ = 0. Such results cannot be found in the literature of online learning. It leads to the problem of carrying our similar error analysis for more general online learning algorithms associated with more general kernels. Moreover, the best convergence rate in expectation of the general kernel-based least squares online learning algorithm is OT 1/ in the literature Smale and Yao, 005; Ying and Zhou, 006; Smale and Zhou, 009; Tarrés and Yao, 014; Hu et al., 015. Theorem 4 demonstrates that the special online learning algorithm 4 has convergence rates of type OT 1 ɛ for any ɛ > 0 and even of type OT 1 log T shown in Theorem 8 below, which is a great improvement. Note that there is a gap between the negative result 10 and the positive one 11, which leads to the natural question whether learning rates of type E z1,...,z T x T +1 x = OT θ are possible for 1 < θ. We conjecture that this is impossible for a general probability measure ρ, but a noise condition might help. The case θ = 1 with a slight logarithmic modification OT 1 log T can be achieved by imposing a minor restriction on the step size in the following theorem which will be proved in the next section. The authors thank Dr. Yiming Ying for pointing out this result. 3346

7 Theorem 8 Let λ r be as in Theorem 4 and η t = 1 λ rt+t 0 for some t 0 N such that t 0 λ r 1. Define the sequence x t } t by 4. Then for some x R d, we have E z1,,z T x T +1 x C 3 T + t 0 1 log T, where C 3 is a constant independent of T N given explicitly in the proof. 4. Convergence in Expectation In this section we prove our main results on convergence in expectation. To this end, we need some preliminary analysis. Recall the function f H defined by 8. It equals f x for some x R d. As the orthogonal projection of f ρ onto the finite dimensional subspace H in the Hilbert space L ρ X, it satisfies f ρ f x, f x L ρx = X f ρ ψ x, ψ x, ψ dρ X ψ = 0, x R d. 17 The vector x is not necessarily unique. To see this, we use the covariance matrix C ρx of the measure ρ X defined by 1 and denote its eigenvalues to be λ 1... λ r > λ r+1 =... = λ d = 0 where r 1,..., d} is the rank of C ρx. Denote the eigenspace of C ρx associated with the eigenvalue 0 as V 0 and the orthogonal projection onto V 0 as P 0. Then any vector x + v from the set x + V 0 is also a minimizer of Ef x in R d, but f x +v = f x = f H as functions in the space L ρ X. The following lemma about the residual vectors r t = x t x } t is a crucial step in our analysis in this section. Lemma 9 Define the sequence x t } t by 4. Let x R d be such that f x r t = x t x. Then there holds = f H. Denote E zt [ r t+1 ] = r t + η t + η t f rt L ρ X + η t Ef H, t N. 18 Proof Subtract x from both sides of 4 and take inner products. We see from ψ t = 1 that r t+1 = r t + η t ỹ t ψ t, x t } ψ t, r t + η t ỹ t ψ t, x t }. 19 Since x t does not depend on z t, taking expectation with respect to z t, we see from E[ỹ t ψ t ] = f ρ ψ t and E zt ỹ t ψ t, x t } = Ef xt that E zt [ r t+1 ] = r t + η t E ψt [f ρ ψ t ψ t, x t } ψ t, r t ] + η t Ef xt. By 17, we know that the middle term above equals η t E ψt [ ψ t, x ψ t, x t } ψ t, r t ] = η t E ψt [ ψ t, r t } ψ t, r t ] = η t f rt L ρ X. Since f x is the orthogonal projection of f ρ onto H, there holds Ef xt = Ef ρ + f ρ f x L + f x f xt ρ L = Ef H + f rt X ρ L. Then the desired identity 18 follows. X ρ X 3347

8 Lin and Zhou We are in a position to prove our first main result. Proof of Theorem 3 Necessity. We first analyze the first two terms of the right hand side of the identity 18 in Lemma 9. Since 0 < η t, we have η t + ηt < 0. Observe from the Schwarz inequality that f rt ψ = r t, ψ r t ψ = r t and thereby f rt L r t. It follows that ρ X r t + η t + η t f rt L ρ X r t + η t + η t r t = 1 η t r t. This together with 18 implies E zt [ r t+1 ] 1 η t r t + η t Ef H. 0 Then we can proceed with proving the necessity. If lim T E z1,...,z T x T +1 x = 0 for some x R d and Ef H > 0, we know from 0 that lim T η T = 0. It ensures the existence of some integer t 0 such that η t 1 3 for any t t 0. Since 1 η exp η} for 0 < η 1 3, we know that for any t t 0, 1 η t exp 4η t }. Combining this with 0 yields E z1,...,z T x T +1 x Π T t=t 0 exp 4η t } E z1,...,z r t0 1 t 0. But 0 also tells us that E z1,...,z t0 1 r t 0 η t 0 1 Ef H > 0. So E z1,...,z T x T +1 x exp 4 t=t 0 η t } η t 0 1Ef H. Since lim T E z1,...,z T x T +1 x = 0, we must have t=1 η t =. This proves the necessity. Sufficiency. Recall that V 0 is the eigenspace of the covariance matrix C ρx associated with the eigenvalue 0 and P 0 is the orthogonal projection onto V 0. Then ψ t is orthogonal to V 0 almost surely for each t. It follows that P 0 x t+1 = P 0 x t and thereby P 0 x t = P 0 x 1 for each t. Take the vector x to be the minimizer of Ef x in R d such that P 0 x = P 0 x 1. With this choice, r t is orthogonal to V 0 for each t, and belongs to the orthogonal complement V0. Note that the eigenvalues of C ρ X restricted to the subspace V0 is at least λ r > 0. So we have f rt L = ψ, r ρ t dρ X = rt T ψψ T r t dρ X = rt T C ρx r t 1 X X and f rt L λ r r t. The condition lim t η t = 0 ensures the existence of some t 1 N ρ X such that η t 1 for any t t 1. Thus, we see from 18 in Lemma 9 that for t t 1, E zt [ r t+1 ] r t η t f rt L ρ X + η t Ef H 1 η t λ r r t + η t Ef H. Applying this inequality iteratively for t = T, t 1 yields E z1,...,z T [ r T +1 ] E z1,...,z t1 1 [ r t 1 ] X t=t 1 1 η t λ r + Ef H 3348 ηt t=t 1 k=t+1 1 η k λ r,

9 where we denote T k=t+1 1 η kλ r = 1 for t = T. By the condition t=1 η t =, one has t=t 1 1 η t λ r exp λ r T t=t 1 η t } 0 as T. Thus for any ε > 0, there exists t = t ε N such that for any T t, E z1,...,z t1 1 [ r t 1 ] t=t 1 1 η t λ r ε. To deal with the other term of the bound for r T +1, we use the assumption lim t η t = 0, and find some integer tε t 1 such that η t λ r ε for any t tε. Write ηt t=t 1 k=t+1 1 η k λ r = tε ηt t=t 1 k=t+1 The second term of 3 can be bounded as ηt t=tε+1 k=t+1 1 η k λ r = ε = ε 1 η k λ r + t=tε+1 t=tε+1 = ε 1 η t λ r ηt t=tε+1 k=t+1 T k=t+1 1 η k λ r 1 1 η t λ r k=tε+1 k=t+1 1 η k λ r ε. 1 η k λ r. 3 1 η k λ r To bound the first term of 3, we apply the condition t=1 η t = again and find some integer t 3 = t 3 ε > tε such that t 3 k=tε+1 η k 1 λ r log tε ε. Hence k=t+1 k=tε+1 η k t 3 k=tε+1 k=t+1 η k 1 λ r log tε ε, T t 3. It thus follows that for each t t 1,..., tε}, T } 1 η k λ r exp λ r η k exp λ r k=tε+1 η k ε tε. Combining with the fact η t 1 for each t t 1, we see that the first term of 3 can be bounded as tε ηt 1 η k λ r ε tε ηt ε. tε t=t 1 t=t 1 k=t

10 Lin and Zhou From the above analysis, we know that when T maxt 1, tε, t, t 3 }, E z1,...,z T [ r T +1 ] ε + Ef H ε. This proves the convergence lim T E z1,...,z T x T +1 x = 0 for some x R d and the sufficiency is verified. From the bound 0, we also see that E z1,...,z T x T +1 x η T Ef H, T N. This implies that T =1 E z1,...,z T x T +1 x Ef H η T =. T =1 The proof of Theorem 3 is complete. In the proof of our second main result, we need some elementary inequalities. Lemma 10 a For ν, a > 0, there holds a a exp νx} x a, νe x > 0. 4 b Let ν > 0 and q 0. If 0 < q 1 < 1, then for any t N, we have t 1 i q exp ν j q 1 q 1+q 1 + q 1+q 1 q 1 + ν ν1 q1 1 t q 1 q. 5 e i=1 j=i+1 For q 1 = 1, we have t 1 i q exp ν i=1 j=i+1 j 1 q ν q +1 t minν,q 1}, if ν q 1, q t ν log t, if ν = q 1. 6 c For any t < T N and θ 0, 1], there holds k=t+1 k θ t=1 1 [T + 1 t + 1 ], if θ < 1, logt + 1 logt + 1, if θ = 1. d For θ 0, 1], µ > 0, and T N, there holds } } θ exp µ t θ exp µ θ µe T θ, if θ < 1, T µ, if θ = Proof The inequalities in parts a and b can be found in Smale and Zhou, 009, Lemma. 3350

11 Part c can be proved by noting that k=t+1 k θ k=t+1 k+1 k x θ dx = T +1 For part d, we use the inequality 7 in part c to derive exp µ } t θ t=1 t+1 x θ dx. } } exp µ exp µ T, if θ < 1, T µ, if θ = 1. For θ 0, 1, by applying 4 with ν = µ, x = T and a = This proves the result. exp µ } θ θ 1 θ T T θ. µe θ, we get We can now prove our second main result. This is done by following the estimate in the proof of Theorem 3. Proof of Theorem 4 Since η 1 < 1, we have η t < 1 for all t N. Therefore, we can take t 1 = 1 in and obtain T E z1,...,z T [ r T +1 ] r 1 1 η t λ r + Ef H t=1 t=1 r 1 exp +Ef H η 1 λ r η 1 T t=1 t θ } t θ exp λ r η 1 t=1 ηt t=1 k=t+1 T k=t+1 k θ } 1 η k λ r. 9 Applying part d with µ = λ r η 1 of Lemma 10, we know that the first term of 9 can be bounded as T } } θ r 1 exp λ r η 1 t θ r 1 exp λrη1 θ λ rη 1 e T θ, if θ < 1, r 1 T λrη 1, if θ = 1. Applying part b of Lemma 10 with q 1 = θ, q = θ, ν = λ r η 1, and noting that λ r η 1 < 1 by η 1 0, 1 and λ r 0, 1], we know that the second term of 9 can be bounded by Ef H η 1 Ef H η θ λ rη θ λ rη 1 1 θ 1 e 1+θ T θ, if θ < 1, λ rη 1 T λrη 1, if θ =

12 Lin and Zhou Thus, we get our desired result with C 0 given by } r 1 exp λrη1 θ C 0 = +Ef H η1 r 1 + Ef H η 1 This completes the proof of Theorem 4. θ λ rη 1 e λ rη 1 + 3θ λ rη θ λ rη 1 1 θ 1 e 1+θ, if θ < 1, λ rη 1, if θ = 1. Remark 11 From the proof of Theorem 4, we see that if Ef H = 0, then for some x R d, we have E z1,...,z T [ r T +1 ] r 1 T t=1 1 η t λ r. The above argument actually can be used to prove Theorem 8. Proof of Theorem 8 Since η 1 1, we have η t 1 for all t N. Thus, we can take t 1 = 1 in and obtain T E z1,...,z T [ r T +1 ] r 1 1 η t λ r + Ef H t=1 = r 1 T t=1 + Ef H λ r 1 1 t + t 0 t=1 1 t + t 0 k=t+1 ηt t=1 k=t k + t 0 1 η k λ r We note that It thus follows that k=t = k + t 0 k=t+1 k + t 0 1 k + t 0 = t + t 0 T + t 0. E z1,...,z T [ r T +1 ] r 1 t 0 + Ef H 1 T + t 0 T + t 0 λ r t=1 1 t + t 0. With T 1 t=1 t+t 0 log T +t 0 t 0 +1 log T, we get the desired result with C 3 given by This proves Theorem 8. C 3 = t 0 r 1 + Ef H λ. r 335

13 5. Confidence-Based Estimates for Convergence In this section, we prove our third main result, Theorem 5. Recall that V 0 is the eigenspace of the covariance matrix C ρx associated with the eigenvalue 0. We choose x as in the proof of the sufficiency part of Theorem 3. With this choice, r t belongs to the orthogonal complement V 0 almost surely. Our error analysis is based on the following error decomposition. 5.1 Error Decomposition For t N, set the operator Π t k = t j=k I η jc ρx on R d for k t and Π t t+1 = I. Subtracting x from both sides of 4, we have r k+1 = I η k C ρx r k + η k χ k, 30 where χ k = ỹ k ψ k, x ψ k + C ρx ψ k ψ T k r k. Applying this relationship iteratively for k = t,, 1, we get Thus r t+1 = Π t 1r 1 + η k Π t k+1 χ k. r t+1 Π t 1r 1 + η k Π t k+1 χ k. 31 The first term of the bound 31 is caused by the initial error, which is deterministic and will be estimated in subsection 5.. The second term is the sample error depending on the sample. Since r k is independent of z k, by E[ỹ k ψ k ] = f ρ ψ k and 17, E[χ k z 1,..., z k 1 ] = X f ρ ψ x, ψ ψ + C ρx ψψ T r k dρ X ψ = 0. It tells us that ω k := η k Π t k+1 χ k} k is a martingale difference sequence. The idea of analyzing the sample error by properties of martingale difference sequences can be found in the recent work in Tarrés and Yao, 014 to which details about martingale difference sequences are referred. In particular, we can apply the following Pinelis-Bernstein inequality from Tarrés and Yao, 014 derived from Pinelis, 1994, Theorem 3.4 to estimate the sample error. Lemma 1 Let ω k } k be a martingale difference sequence in a Hilbert space. Suppose that almost surely ω k B and t E[ ω k ω 1,..., ω k 1 ] L t. Then for any 0 < δ < 1, the following holds with probability at least 1 δ, j B sup ω k 3 + L t log δ. 1 j t The required bounds B and L t will be presented in subsections 5.3 and 5.4, respectively. 3353

14 Lin and Zhou 5. Initial Error Lemma 13 Let η k = η 1 k θ with θ 0, 1] and η 1 0, 1. Then Π t 1r 1 C0 t θ, when θ < 1, C 0 t λrη 1, when θ = 1, where } θ C 0 = r 1 exp λrη1 θ λ rη 1 e, when θ < 1, r 1, when θ = 1. Proof By our choice of x, we know that r 1 belongs to the subspace V0. Thus, we have Π t 1r 1 Π t 1 V 0 r 1. Here Π t 1 V0 denotes the restriction of the self adjoint operator Π t 1 onto V 0. Since λ l : l = 1,,, r} are the eigenvalues of C ρx restricted to V0, λ 1 1 and η 1 < 1, we have Π t 1 V 0 = sup 1 l r t 1 η k λ l t 1 η 1 λ r k θ exp λ r η 1 Applying part d of Lemma 10, we get our desired result. } k θ. 5.3 Bounding the Residual Sequence To bound ω k = η k Π t k+1 χ k, we start with a rough bound for r t. Lemma 14 Assume that for some constant M > 0, ỹ M almost surely. Let θ [0, 1] and η t = η 1 t θ with η 1 0, 1. Then for any t N, we have almost surely r t where C 1 is a constant independent of t given by C 1 = C 1 t, when θ [0, 1, C 1 loget, when θ = 1, 3 r1 +η 1 M+ x, when θ [0, 1, r1 + η 1 M + x, when θ = 1. Proof Rewrite 19 with x t = x + r t as r t+1 = r t + η t ỹ t ψ t, x ψ t, r t ψ t, r t +ηt ỹ t ψ t, x ψ t, r t = r t + F ψ t, r t, 3354

15 where F : R R is a quadratic function given by Fµ = F ηt,ỹ t,ψ t,x µ = η tη t µ + η t 1 η t ỹ t ψ t, x µ + η t ỹ t ψ t, x. Note that η t η t 0 by 0 < η t η 1 1. A simple calculation shows that max Fx = x R η t 1 η t ỹ t ψ t, x + ηt ỹ t ψ t, x = η tỹ t ψ t, x. η t η t η t Since ỹ t M almost surely and ψ t = 1, Thus, ỹ t ψ t, x ỹ t + ψ t x M + x. r t+1 r t + η tỹ t ψ t, x η t r t + η t M + x. Using this relationship iteratively yields r t+1 r 1 + η k M + x = r 1 + η 1 M + x k θ. Since that we get k θ 1 + k= k k 1 x θ dx = t θ, when θ [0, 1, loget, when θ = 1, r t which leads to the desired result. r1 +η 1 M+ x t, when θ [0, 1, r 1 + η 1 M + x loget, when θ = 1, 5.4 Estimating Conditional Variance and Upper Bound In this subsection, we give bounds for the two terms t η k E[ Πt k+1 χ k z 1,..., z k 1 ] and sup 1 k t η k Π t k+1 χ k required in applying the Pinelis-Bernstein inequality. Lemma 15 Let η k = η 1 k θ with θ 0, 1] and η 1 0, 1. Then almost surely we have ηk E[ Πt k+1 χ k z 1,..., z k 1 ] η1k θ exp η 1λ r j=k+1 j θ 33 EfH + r k. 3355

16 Lin and Zhou Proof Recall that both ψ k and r k belong to V0 almost surely for each k N. As a result, χ k also belongs to V0 almost surely for each k. Hence ηk E[ Πt k+1 χ k z 1,..., z k 1 ] ηk Πt k+1 V0 E[ χ k z 1,..., z k 1 ]. Since η 1 < 1 and λ 1 1, we have Thus, Π t k+1 t t V0 = sup 1 η j λ l 1 η j λ r 1 l r j=k+1 j=k+1 exp λ r η j = exp λ rη 1 j=k+1 ηk E[ Πt k+1 χ k z 1,..., z k 1 ] ηk exp λ rη 1 j=k+1 j θ j=k+1 j θ E[ χ k z 1,..., z k 1 ].. 34 Since r k does not depend on z k, we see from ψ k = 1, E[ỹ k ψ k ] = f ρ ψ k and 17 that E zk [ ỹ k ψ k, x ψ k, C ρx ψ k ψ T k r k ] = E zk [ỹ k ψ k, x ψ k, C ρx r k ] E zk [ỹ k ψ k, x ψ k, r k ψ k ] = E ψk [f ρ ψ k ψ k, x ψ k, C ρx r k ] E ψk [f ρ ψ k ψ k, x ψ k, r k ] = 0. It thus follows that E[ χ k z 1,..., z k 1 ] = E zk [ χ k ] = E zk [ỹ k ψ k, x ] + E zk [ C ρx ψ k ψ T k r k ] = Ef H + E zk C ρx C ρ X r k, r k Ef H + r k. Putting the above bound into 35, we get the desire result. 35 Lemma 16 Assume that for some constant M > 0, ỹ M almost surely. Let θ [0, 1] and η t = η 1 t θ with η 1 0, 1. Then for any t N, we have almost surely sup η k Π t k+1 χ C t k θ max sup 1 k t r k, 1 }, when θ < 1, 1 k t C t λrη 1 max sup 1 k t r k, 1 } 36, when θ = 1, where C is a constant given by θ η C = 1 M + x + θ θ + eλ rη, when θ < 1, 1 1 θ 1 η 1 M + x + λrη 1, when θ =

17 Proof Let k 1,..., t}. From the definition of χ k, we have χ k ỹ k + ψ k x ψ k + C ρx ψ k ψ T k r k. But ỹ k M, ψ k = 1 and C ρx 1. So we have χ k M + x + r k M + x + max r k, 1}. This together with 34 and the fact that χ k belongs to V 0 implies η k Π t k+1 χ k η 1 k θ Π t k+1 V 0 χ k η 1 M + x + k θ Π t k+1 V0 max r k, 1} η 1 M + x + k θ exp λ rη 1 j θ max r k, 1}. j=k+1 What is left is to estimate I k := k θ exp λ rη 1 j=k+1 j θ. For θ [1/, 1, applying part c of Lemma 10 gives I k k θ exp λ } rη 1 1 θ [t + 1 k + 1 ]. If k t/, then k θ θ t θ and thus I k θ t θ. If 1 k < t/, then we have k+1 t+1/ and t+1 k+1 1 θ 1 t+1. It follows that I k exp λ rη 1 1 θ 1 } t. 1 θ Applying part a of Lemma 10 with x = t, ν = λrη 11 θ 1 and a = θ, we get I k θ eλ r η 1 1 θ 1 θ t θ. For θ = 1, by part c of Lemma 10, with λ r η 1 < 1, we have t + 1 λrη1 t I k k 1 = k + 1 t + 1 k + 1 λrη1 t λrη 1 k λrη1 1 λrη 1 t λrη 1. k From the above analysis, we conclude the desired result. 3357

18 Lin and Zhou 5.5 Preliminary Error Analysis Based on the above estimates, we can apply Lemma 1 to obtain an error bound. Proposition 17 Under the assumptions of Theorem 5, for some x R d and for any 0 < δ < 1 and fixed t N, with confidence at least 1 δ, we have C x t+1 x t 1 θ log δ, when θ [ 1 3, 1, C t 1 λrη loget log δ, when θ = 1, 37 where C is a positive constant independent of t or δ given explicitly in the proof. Proof To apply the Pinelis-Bernstein inequality to estimate t η kπ t k+1 χ k, we need bounds B and L t. By Lemmas 14 and 16, we have sup η k Π t k+1 χ C k C 1 + 1t 1 3θ, when θ < 1, 1 k t C C 1 + 1t λrη 38 1 loget, when θ = 1. By Lemmas 15 and 14, we get where ηk E[ Πt k+1 χ k z 1,..., z k 1 ] C t } 3 k 3θ 1 exp λ r η t 1 j=k+1 j θ C 3 loget t k exp, when θ [ 1 3 λ, 1, } r η t 1 j=k+1 j 1, when θ = 1, C 3 = Ef H + C 1η 1. Applying part b of Lemma 10 with ν = λ r η 1 < 1, q 1 = θ and q = 3θ 1, we have for θ < 1, k 3θ 1 exp λ rη 1 j θ j=k+1 4θ 1 3θ 3θ + λ r η 1 λ r η 1 e1 θ 1 t + t 1 3θ, and for θ = 1, k exp λ rη 1 j=k+1 j λ r η 1 t λrη 1 + t. Therefore, we get ηk E[ Πt k+1 χ k z 1,..., z k 1 ] C4 t, when θ [ 1 3, 1, C 4 t λrη 1 loget, when θ = 1,

19 with C 4 = C 4θ 1 3 λ rη 1 + 3θ λ rη 1 e1 θ 1 3θ + 1, when θ [ 1 3, 1, C 3 5 λ rη 1 1 λ rη 1, when θ = 1. Applying Lemma 1 to the martingale difference sequence ω k := η k Π t k+1 χ k} k with B and L t given by 38 and 39 respectively, we know that with probability at least 1 δ, j sup η k Π t k+1 χ k 1 j t C 5 t log δ, when θ [ 1 3, 1, C 5 t 1 λrη loget log δ, when θ = 1, where C 5 = C C 1 + 1/3 + C 4. Putting this bound into 31 with t replaced by j, and then applying Lemma 13 to bound the initial error, we get the desired result with C = C 0 + C 5 from Lemma 1. In the above procedure, we have used a rough bound 3 for r t. This rough bound tends to as t becomes large. In contrast, the bound provided in Proposition 17 tends to 0 when θ 1/, 1] and is much better. But this bound holds with confidence. We shall use this refined bound to improve our estimates in the following subsection. 5.6 Improved Error Analysis In this subsection, we prove our third main result by improving the preliminary confidencebased error bound in Proposition 17. Proof of Theorem 5 When θ = 1, our desired bound follows from 37 with C 1 = C. It remains to prove the case θ [1/, 1. Let T N. Applying Proposition 17 with t = 1,, T, and taking the union event followed by rescaling, we know that there exists a subset Zδ T of ZT with measure at least 1 δ such that r t C 6 log δ log T, t = 1,..., T + 1, z 1,..., z T Z T δ, 40 where C 6 = C + r 1. Now we turn to the essential part of the proof. Define another martingale difference sequence ω k } k by multiplying the one in the proof of Proposition 17 by a characteristic function 1 rk C 6 log log T } as δ ω k = η k Π T k+1 χ k1 rk C 6 log δ log T }. From 36 and the multiplication with the characteristic function 1 rk C 6 log log T }, we δ have sup ω k C C 6 log log T T θ k T δ 3359

20 Lin and Zhou Notice that the characteristic function 1 rk C 6 log δ log T } is independent of z k. Also, from the proof of Lemma 15, we know that for each k 1,..., T }, E[ Π t k+1 χ k z 1,..., z k 1 ] exp λ rη 1 j θ EfH + r k. j=k+1 It follows by setting C 7 = Ef H + C6 η 1 that ] E [ ω k z 1,..., z k 1 C 7 log T δ log T k θ exp λ rη 1 Applying part b of Lemma 10 yields ] E [ ω k z 1,..., z k 1 C 7 log δ log T 3θ + λ r η θ λ r η 1 e1 θ 1 j=k+1 1+θ + 1 Using this bound as L T and 41 as the bound B in Lemma 1, we know that there exists another subset Z δ T of ZT with measure at least 1 δ such that for every z 1,..., z T Z δ T, there holds ω k C 8 T θ log log T, δ where C 8 = C C θ + λ r η 1 C 7 + λ r η θ λ r η 1 e1 θ 1 j θ T θ. 1+θ 1. This together with 40 tells us that for every z 1,..., z T Zδ T Z δ T, there holds η k Π T k+1 χ k C 8T θ log log T. 4 δ The subset Zδ T Z δ T has measure at least 1 δ. Therefore, we can put 4 into 31, and apply Lemma 13 to bound the initial error, which proves Theorem 5 for the case θ [1/, 1 after scaling δ to δ/ and setting the constant C 1 = C 0 + C Almost Sure Convergence In this section, we prove the almost sure convergence of the randomized Kaczmarz algorithm. Recall that the almost sure convergence of a sequence of random variables X n } towards X means that P lim X n = X = 1, n 3360.

21 or equivalently, lim P sup X k X > ε = 0 for any ε > 0. n k n The Borel-Cantelli Lemma see e.g. Klenke, 010 asserts for a sequence E n n of events that if the sum of the probabilities is finite n=1 PE n <, then the probability that infinitely many of them occur is 0, that is, P lim sup n E n = P n=1 k=n E n = 0. The following lemma is an easy consequence of the Borel-Cantelli Lemma. We give the proof for completeness. Lemma 18 Let X n } be a sequence of events in some probability space and ε n } be a sequence of positive numbers satisfying lim n ε n = 0. If P X n X > ε n <, n=1 then X n converges to X almost surely. Proof Since lim n ε n = 0, for any ε > 0, there exists some n N such that for all k n, ε k < ε. Thus, P sup k n X k X > ε P k n X k X > ε k P X k X > ε k. k n Letting n, one gets P sup k n X k X > ε 0. This proves the result. Now we can apply Lemma 18 to prove our last main result. Proof of Theorem 6 Set Λ t = t θ/ when θ < 1, t λrη 1 when θ = 1. By Theorem 5, we have for any t and 0 < δ t < 1, P Λ ɛ 1 t x t+1 x > C 1 Λ ɛ t log 4 log t δ t. δ t Choose δ t = t, and ε t = C 1 Λ ɛ t log 4/δ t log t. Obviously t= P Λ ɛ 1 t x t+1 x > ε t δ t < t= and ε t 4 C 1 Λ ɛ t log 3 t 0, as t. Then our conclusion of Theorem 6 follows from Lemma

22 Lin and Zhou Remark 19 The above method of proof can be used to get a more quantitative estimate for the almost sure convergence of the Kaczmarz algorithm with noiseless random measurements Chen and Powell, 01. In that setting, η t 1, y t = f ρ ψ t and r = d. It was shown in Strohmer and Vershynin, 009; Chen and Powell, 01 that with q = 1 λ r, E x t+1 x q t r 1. It follows from the Chebyshev inequality that for any ɛ 0, 1, P q tɛ 1 x t+1 x > q tɛ t = P x t+1 x > q t t E[ x t+1 x ] q t t. Thus, we get P q tɛ 1 x t+1 x > q tɛ t r 1 t. Obviously, q tɛ t 0 as t, and t=1 r 1 t <. Applying Lemma 18 with ε t = q tɛ t, we know that for any ɛ 0, 1, lim 1 λ r tɛ 1 x t+1 x = 0 t almost surely. 7. Simulations and Discussions In this section we provide some numerical simulations and further discussions on our error analysis. To illustrate our derived convergence rates and compare with the existing literature, we carry out numerical simulations corresponding to Example with the same data distributions as in Needell, 010: m = 00, d = 100, A R is a Gaussian matrix with each entry drawn independently from the standard normal distribution N0, 1, and y R 100 is a Gaussian noise with each component drawn independently from the normal distribution with mean 0 and standard deviation 0.0. The measurement vectors ψ t = 1 ϕ ϕ t t} are drawn from the normalized rows of A as in Example and ỹ t = y t / ϕ t } with mean x = 0. We conduct 100 trials for each choice of the relaxation parameter sequences η t = 1, η t = 1/ t, η t = 1/t. In each trial, algorithm 4 is run 100 times with random Gaussian initial vectors of norm x 1 = 0.0. Figure 1 depicts the error x t+1 x for t = 1,..., 1500 averaged with 100 trials and 100 initial vectors. The black line is a plot with the constant relaxation parameter sequence η t = 1, which verifies the divergence of the algorithm, as proved in Needell, 010. The blue line is a plot with η t = 1/ t, which hints a slow convergence of the algorithm. The red line is a plot with η t = 1/t, which confirms a faster convergence. The above simulations are consistent with our error analysis. In this paper, a learning theory approach to the relaxed randomized Kaczmarz algorithm is presented. It yields new results and observations including a necessary and sufficient condition 9, stated in Theorem 3, for the convergence in expectation when the sampling process is noisy or nonlinear. For noise-free and linear sampling processes that is, Ef H = 0, we can see from Remark 11 with η t 1 that E z1,...,z T [ x T +1 x ] x 1 x 1 λ r T. This exponential convergence result was proved in Strohmer and Vershynin, 009 for Example under the restriction that the matrix A has full column rank, where the number 336

23 a Index Figure 1: Error of the relaxed randomized Kaczmarz algorithm with η t = 1 black line, η t = 1/ t blue line, and η t = 1/t red line 3363

24 Lin and Zhou 1 λ r is replaced by a quantity involving A 1 = infm : M Ax x for all x}. Our result is more general valid for underdetermined systems with A 1 =. In the framework of Kaczmarz algorithms, we consider online learning algorithms associated with the least squares loss. It would be interesting to extend our study to algorithms associated with more general loss functions Ying and Zhou, 006 such as hinge loss, and to consider error analysis without requiring the approximation error Ying and Zhou, 006 tending to zero. Acknowledgments The work described in this paper is supported partially by the Research Grants Council of Hong Kong [Project No. CityU ]. The authors would like to thank the referees for constructive suggestions and Dr. Xin Guo for helping with the numerical simulations. The corresponding author is Ding-Xuan Zhou. References X. Chen and A. Powell. Almost sure convergence for the Kaczmarz algorithm with random measurements. Journal of Fourier Analysis and Applications, 18: , 01. T. Hu, J. Fan, Q. Wu, and D. X. Zhou. Regularization schemes for minimum error entropy principle. Analysis and Applications, 13: , 015. J. Johnston. Econometric Methods. McGraw Hill, New York, S. Kaczmarz. Angenäherte auflösung von systemen linearer gleichungen. Bulletin International de l Académie Polonaise des Sciences et des Lettres A, 35: , A. Klenke. Probability Theory: A Comprehensive Course. Springer-Verlag, London, 008. D. Needell. Randomized Kaczmarz solver for noisy linear systems. BIT. Numerical Mathematics, 50: , 010. I. Pinelis. Optimum bounds for the distributions of martingales in banach spaces. Annals of Probability, : , Y. Ying and M. Pontil. Online gradient descent learning algorithms. Foundations of Computational Mathematics, 8: , 008. S. Smale and Y. Yao. Online learning algorithms. Foundations of Computational Mathematics, 6: , 005. S. Smale and D. X. Zhou. Online learning with Markov sampling. Analysis and Applications, 7:87 113, 009. T. Strohmer and R. Vershynin. A randomized Kaczmarz algorithm with exponential convergence. Journal of Fourier Analysis and Applications, 15:6 78,

25 P. Tarrés and Y. Yao. Online learning as stochastic approximations of regularization paths. IEEE Transactions on Information Theory, 60: , 014. V. Vapnik. Statistical Learning Theory. John Wiley & Sons, Y. Ying and D. X. Zhou. Online regularized classification algorithms. IEEE Transactions on Information Theory, 5: , 006. A. Zouzias and N. M. Freris. Randomized extended Kaczmarz for solving least squares. SIAM Journal on Matrix Analysis and Applications, 34: ,

Online Gradient Descent Learning Algorithms

DISI, Genova, December 2006 Online Gradient Descent Learning Algorithms Yiming Ying (joint work with Massimiliano Pontil) Department of Computer Science, University College London Introduction Outline