Learning Theory of Randomized Kaczmarz Algorithm

Size: px
Start display at page:

Download "Learning Theory of Randomized Kaczmarz Algorithm"

Transcription

1 Journal of Machine Learning Research Submitted 6/14; Revised 4/15; Published 1/15 Junhong Lin Ding-Xuan Zhou Department of Mathematics City University of Hong Kong 83 Tat Chee Avenue Kowloon, Hong Kong, China Editor: Gabor Lugosi Abstract A relaxed randomized Kaczmarz algorithm is investigated in a least squares regression setting by a learning theory approach. When the sampling values are accurate and the regression function conditional means is linear, such an algorithm has been well studied in the community of non-uniform sampling. In this paper, we are mainly interested in the different case of either noisy random measurements or a nonlinear regression function. In this case, we show that relaxation is needed. A necessary and sufficient condition on the sequence of relaxation parameters or step sizes for the convergence of the algorithm in expectation is presented. Moreover, polynomial rates of convergence, both in expectation and in probability, are provided explicitly. As a result, the almost sure convergence of the algorithm is proved by applying the Borel-Cantelli Lemma. Keywords: learning theory, relaxed randomized Kaczmarz algorithm, online learning, space of homogeneous linear functions, almost sure convergence 1. Introduction The Kaczmarz method is an iterative projection algorithm. It was originally proposed for solving overdetermined systems of linear equations, and has been adapted to image reconstruction, signal processing and numerous other applications. Given a matrix A R m d and a vector b R m, the classical Kaczmarz algorithm Kaczmarz, 1937 approximates a solution of the linear systems Ax = b by an iterative scheme as x k+1 = x k + b i a i, x k a i a i, 1 where i = k mod m, a T i is the i-th row of the matrix A, and x 1 R d is an initial vector. Here, is the inner product in R d and the induced norm. The convergence of the Kaczmarz algorithm is well understood Kaczmarz, 1937, and its convergence rate depends on the order of rows of A. To avoid this dependence, a randomized Kaczmarz algorithm was considered in Strohmer and Vershynin, 009 by setting the probability of a row to be proportional to its norm. It takes the form x k+1 = x k + b pi a pi, x k a pi a pi, c 015 Junhong Lin and Ding-Xuan Zhou.

2 Lin and Zhou where pi takes values in 1,..., m} with probability a pi with A A F = m d i=1 j=1 a ij F being the Frobenius norm square of A. Exponential convergence rate was proved for the expected error E x k+1 x of the randomized Kaczmarz algorithm in Strohmer and Vershynin, 009. When noise exists in the sample value b = Ax + ξ with ξ being a noise vector, a bound for the expected error was obtained in Needell, 010 and divergence was proved when the variance of ξ is positive. The error bound consists of an exponentially ξ convergent part and a noise-driven term proportional to the noise level max i i a i. The randomized Kaczmarz algorithm was generalized in Chen and Powell 01 to a setting with a sequence of independent random measurement vectors ϕ t R d } t as x k+1 = x k + y k ϕ k, x k ϕ k ϕ k. 3 When the measurements have no noise y k = ϕ k, x, almost sure convergence was proved and quantitative error bounds were provided in Chen and Powell, 01. When the linear system Ax = b is overdetermined m > d and has no solution, the Kaczmarz algorithm can be modified by introducing a relaxation parameter η k > 0 in front of b i a i,x k a i a i and the output sequence x k } converges to the least squares solution arg min x R d Ax b when lim k η k = 0. See, e.g., Zouzias and Freris, 013 and references therein. Setting ψ k = 1 ϕ k ϕ k S d 1 and ỹ k = 1 ϕ k y k yields an equivalent form of the scheme 3 as x k+1 = x k + ỹ k ψ k, x k } ψ k. This form is similar to those in the literature of online learning for least squares regression and together with the relaxed Kaczmarz method Zouzias and Freris, 013 motivates us to consider the following relaxed randomized Kaczmarz algorithm. Definition 1 With normalized measurement vectors ψ t S d 1 } t and sample values ỹ t R} t, the relaxed randomized Kaczmarz algorithm is defined by x t+1 = x t + η t ỹ t ψ t, x t } ψ t, t = 1,..., 4 where x 1 R d is an initial vector and η t } is a sequence of relaxation parameters or step sizes. The purpose of this paper is to provide learning theory analysis for the relaxed randomized Kaczmarz algorithm. We shall assume throughout the paper that 0 < η t for each t N and that the sequence z t := ψ t, ỹ t } t N is independently drawn according to a Borel probability measure ρ on Z := S d 1 R which satisfies E[ ỹ ] <. Our first goal is to deal with the noisy setting for the randomized Kaczmarz algorithm. When the sampling process is noisy or nonlinear to be defined below, we show that x t } t converges to some x R d in expectation if and only if lim t η t = 0 and t=1 η t =. Moreover, the rate of convergence in expectation cannot be too fast. It tells us that the relaxation parameter is necessary for the convergence in the noisy setting. When η t } t takes the form η t = η 1 t θ, we provide convergence rates in expectation and in confidence and 334

3 prove the almost sure convergence. Such results were presented in the case of no noise in Strohmer and Vershynin, 009; Chen and Powell, 01 and are new in the noisy setting. Our second goal is to give the first almost sure convergence result in online learning for least squares regression when regularization is not needed. Such a result can be found in Tarrés and Yao, 014 when regularization is imposed, while the convergence in expectation without regularization was proved in Ying and Pontil, 008. We also present the first consistency result for online learning when the approximation error to be defined below does not tend to zero.. Main Results To introduce our learning theory approach to the relaxed randomized Kaczmarz algorithm 4, we decompose the probability measure ρ on Z = S d 1 R into its marginal distribution ρ X on X := S d 1 and conditional distributions ρ ψ at ψ X. The conditional means define the regression function f ρ : X R as f ρ ψ = ỹdρỹ ψ, ψ X. 5 R The hypothesis space for the Kaczmarz algorithm 4 consists of homogeneous linear functions H = f x L ρ X : x R d}, where f x ψ := x, ψ, ψ X. 6 Definition The sampling process associated with ρ is said to be noise-free if ỹ = f ρ ψ almost surely. Otherwise, it is called noisy. It is said to be linear if f ρ H as a function in L ρ X. Otherwise, it is called nonlinear. The main difference between our analysis in this paper and that in the literature Strohmer and Vershynin, 009; Needell, 010; Chen and Powell, 01 lies in the setting when the sampling process is either noisy or nonlinear. These two situations can be handled simultaneously by means of the least squares generalization error Ef = Z ỹ fψ dρ, a well developed concept in learning theory. The assumption E[ ỹ ] < on ρ ensures f ρ L ρ X and Ef ρ <. The noise-free condition can be stated as Ef ρ = 0. It is well known that the regression function minimizes Ef among all the square integral with respect to ρ X functions f L ρ X, and satisfies Ef Ef ρ = f f ρ L = fψ f ρ ρ ψ dρ X. 7 X Since the hypothesis space H is a finite dimensional subspace of L ρ X, the continuous functional Ef achieves a minimizer X f H = arg min Ef. 8 f H From 7 we see that f H is the best approximation of f ρ in the subspace H. It is unique as the orthogonal projection of f ρ onto H. It can be written as f H = f x for some x R d. But such a vector x is not necessarily unique. 3343

4 Lin and Zhou The linear condition can be stated as f ρ = f H or f ρ H as functions in L ρ X. So we see that the sampling process is noisy or nonlinear if and only if Ef H > 0. Now we can state our first main result, to be proved in Section 4, which gives a characterization of the convergence of x t } t to some x R d in expectation. Theorem 3 Define the sequence x t } t by 4. Assume Ef H > 0. Then we have the limit lim T E z1,...,z T x T +1 x = 0 for some x R d if and only if In this case, we have lim η t = 0 t T =1 and η t =. 9 t=1 E z1,...,z T x T +1 x =. 10 Compared with the result on exponential convergence in expectation in the linear case without noise Strohmer and Vershynin, 009, the somewhat negative result 10 tells us that in the noisy setting the convergence in expectation cannot be as fast as E z1,...,z T x T +1 x OT θ for any θ >. But for θ < 1, such learning rates can be achieved by taking η t = η 1 t θ, as shown in the following second main result, to be proved in Section 4. Theorem 4 Let η t = η 1 t θ for some θ 0, 1] and η 1 0, 1. Define the sequence x t } t by 4. Then for some x R d we have E z1,...,z T x T +1 x C0 T θ, if θ < 1, C 0 T λrη 1, if θ = 1, 11 where C 0 is a constant independent of T N given explicitly in the proof and λ r is the smallest positive eigenvalue of the covariance matrix C ρx of the probability measure ρ X defined by C ρx = E ρx [ψψ T ] = ψψ T dρ X. 1 Our third main result is the following confidence-based estimate for the error which will be proved in Section 5. Theorem 5 Assume that for some constant M > 0, ỹ M almost surely. Let θ [1/, 1], η t = η 1 t θ 1 with 0 < η 1 < min1, λ r }, and T N. Then for some x R d and for any 0 < δ < 1, with confidence at least 1 δ we have C1 x T +1 x T θ/ log 4 δ log T, when θ [1/, 1, C 1 T λrη 1 log 13 δ log T, when θ = 1, where C 1 is a positive constant independent of T or δ given explicitly in the proof. Our last main result is about the almost sure convergence of the algorithm, which will be proved in Section 6. X 3344

5 Theorem 6 Under the assumptions of Theorem 5, we have for any ɛ 0, 1], the following holds for some x R d : A When 1/ θ < 1, lim t t θ1 ɛ/ x t+1 x = 0 almost surely. B When θ = 1, lim t t λrη 11 ɛ x t+1 x = 0 almost surely. Let us demonstrate our setting by two examples without noise considered in the literature. The first example appeared in Chen and Powell, 01. Example 1 If random measurement vectors ϕ t } t=1 surely, then ψ k = 1 ϕ k ϕ k S d 1 } are independent. are independent and nonzero almost The second example is from Strohmer and Vershynin, 009. Example Define the random vector ϕ which is a normalized row of a full rank matrix A R m d, with probabilities as ϕ = a j a j with probability a j A F j = 1,, m. It was shown in Strohmer and Vershynin 009 that the smallest eigenvalue of the covariance matrix is positive: λ min E[ϕϕ T 1 ] A F A 1. It means r = d and λ r 1 A F A 1. The third example is on homoskedastic models Johnston, Example 3 In the literature of homoskedastic models, it is assumed that the sample value y t } t satisfies y t = x, ψ t + ξ t with ξ t } t being independently drawn according to a zero mean probability measure ξ. This corresponds to the special case when the conditional distributions ρ ψ are given by ρ ψ = f ρ ψ + ξ. Our setting induced by ρ is more general and allows heteroskedastic models. 3. Connections to Learning Theory The relaxed randomized Kaczmarz algorithm defined by 4 may be rewritten as an online learning algorithm with output functions from the hypothesis space 6, and our main results stated in the last section are new even in the online learning literature. To demonstrate this, we denote the tth output function F t on X induced by the vector x t to be given by F t ψ = x t, ψ for ψ X. Then the iteration relation 4 gives F t+1 = F t + η t ỹ t F t ψ t }, ψ t. 14 This is a special kernel-based least squares online learning algorithm. Here a Mercer kernel on a metric space X means a function K : X X R which is continuous, symmetric and the matrix Kx i, x j l i,j=1 is positive semidefinite for any finite subset x i} l i=1 X. 3345

6 Lin and Zhou It generates a reproducing kernel Hilbert space H K, K by the set of fundamental functions K, x : x X } with the inner product K, x, K, y K = Kx, y. A least squares regularized online learning algorithm in H K is defined with ψ t, ỹ t X R} t drawn independently according to a probability measure on Z = X R as F t+1 = F t η t F t ψ t ỹ t K, ψ t + λf t }, t = 1,..., 15 where λ 0 is a regularization parameter. The consistency of the online learning algorithm 15 is well understood when the approximation error Dλ tends to zero as λ 0. Definition 7 The approximation error or regularization error of the pair ρ, K is defined for λ > 0 as Dλ = inf Ef Efρ + λ f } } K = inf f f ρ L + λ f f H K f H ρ K. 16 K X When λ > 0 with regularization and lim λ 0 Dλ = 0, the error F T +1 f ρ L in ρ X expectation and in confidence was bounded in Smale and Yao, 005; Ying and Zhou, 006; Smale and Zhou, 009; Tarrés and Yao, 014 by means of the decay of Dλ and T. The error analysis was done in Ying and Pontil 008 without regularization λ = 0 but under the approximation error condition lim λ 0 Dλ = 0. The error F T +1 f ρ K with the H K -metric was also analyzed when f ρ H K. If we take the kernel to be the linear one: Kx, y = x, y with X = R d, and assume that the marginal distribution ρ X is supported on X = S d 1, then ψ t S d 1 almost surely. Set λ = 0, we see that the relaxed randomized Kaczmarz algorithm expressed in the form 14 is the least squares online learning algorithm 15 without regularization. So the error analysis from Ying and Pontil 008 applies, but the condition lim λ 0 Dλ = 0 is required for the consistency in L ρ X and even stronger conditions stronger than f ρ H K are needed for the consistency in the H K -metric. Notice that for the linear kernel, x =, x K. So the error analysis carried out in this paper provides bounds for the error F T +1 f H K without the condition lim λ 0 Dλ = 0. Such results cannot be found in the literature of online learning. It leads to the problem of carrying our similar error analysis for more general online learning algorithms associated with more general kernels. Moreover, the best convergence rate in expectation of the general kernel-based least squares online learning algorithm is OT 1/ in the literature Smale and Yao, 005; Ying and Zhou, 006; Smale and Zhou, 009; Tarrés and Yao, 014; Hu et al., 015. Theorem 4 demonstrates that the special online learning algorithm 4 has convergence rates of type OT 1 ɛ for any ɛ > 0 and even of type OT 1 log T shown in Theorem 8 below, which is a great improvement. Note that there is a gap between the negative result 10 and the positive one 11, which leads to the natural question whether learning rates of type E z1,...,z T x T +1 x = OT θ are possible for 1 < θ. We conjecture that this is impossible for a general probability measure ρ, but a noise condition might help. The case θ = 1 with a slight logarithmic modification OT 1 log T can be achieved by imposing a minor restriction on the step size in the following theorem which will be proved in the next section. The authors thank Dr. Yiming Ying for pointing out this result. 3346

7 Theorem 8 Let λ r be as in Theorem 4 and η t = 1 λ rt+t 0 for some t 0 N such that t 0 λ r 1. Define the sequence x t } t by 4. Then for some x R d, we have E z1,,z T x T +1 x C 3 T + t 0 1 log T, where C 3 is a constant independent of T N given explicitly in the proof. 4. Convergence in Expectation In this section we prove our main results on convergence in expectation. To this end, we need some preliminary analysis. Recall the function f H defined by 8. It equals f x for some x R d. As the orthogonal projection of f ρ onto the finite dimensional subspace H in the Hilbert space L ρ X, it satisfies f ρ f x, f x L ρx = X f ρ ψ x, ψ x, ψ dρ X ψ = 0, x R d. 17 The vector x is not necessarily unique. To see this, we use the covariance matrix C ρx of the measure ρ X defined by 1 and denote its eigenvalues to be λ 1... λ r > λ r+1 =... = λ d = 0 where r 1,..., d} is the rank of C ρx. Denote the eigenspace of C ρx associated with the eigenvalue 0 as V 0 and the orthogonal projection onto V 0 as P 0. Then any vector x + v from the set x + V 0 is also a minimizer of Ef x in R d, but f x +v = f x = f H as functions in the space L ρ X. The following lemma about the residual vectors r t = x t x } t is a crucial step in our analysis in this section. Lemma 9 Define the sequence x t } t by 4. Let x R d be such that f x r t = x t x. Then there holds = f H. Denote E zt [ r t+1 ] = r t + η t + η t f rt L ρ X + η t Ef H, t N. 18 Proof Subtract x from both sides of 4 and take inner products. We see from ψ t = 1 that r t+1 = r t + η t ỹ t ψ t, x t } ψ t, r t + η t ỹ t ψ t, x t }. 19 Since x t does not depend on z t, taking expectation with respect to z t, we see from E[ỹ t ψ t ] = f ρ ψ t and E zt ỹ t ψ t, x t } = Ef xt that E zt [ r t+1 ] = r t + η t E ψt [f ρ ψ t ψ t, x t } ψ t, r t ] + η t Ef xt. By 17, we know that the middle term above equals η t E ψt [ ψ t, x ψ t, x t } ψ t, r t ] = η t E ψt [ ψ t, r t } ψ t, r t ] = η t f rt L ρ X. Since f x is the orthogonal projection of f ρ onto H, there holds Ef xt = Ef ρ + f ρ f x L + f x f xt ρ L = Ef H + f rt X ρ L. Then the desired identity 18 follows. X ρ X 3347

8 Lin and Zhou We are in a position to prove our first main result. Proof of Theorem 3 Necessity. We first analyze the first two terms of the right hand side of the identity 18 in Lemma 9. Since 0 < η t, we have η t + ηt < 0. Observe from the Schwarz inequality that f rt ψ = r t, ψ r t ψ = r t and thereby f rt L r t. It follows that ρ X r t + η t + η t f rt L ρ X r t + η t + η t r t = 1 η t r t. This together with 18 implies E zt [ r t+1 ] 1 η t r t + η t Ef H. 0 Then we can proceed with proving the necessity. If lim T E z1,...,z T x T +1 x = 0 for some x R d and Ef H > 0, we know from 0 that lim T η T = 0. It ensures the existence of some integer t 0 such that η t 1 3 for any t t 0. Since 1 η exp η} for 0 < η 1 3, we know that for any t t 0, 1 η t exp 4η t }. Combining this with 0 yields E z1,...,z T x T +1 x Π T t=t 0 exp 4η t } E z1,...,z r t0 1 t 0. But 0 also tells us that E z1,...,z t0 1 r t 0 η t 0 1 Ef H > 0. So E z1,...,z T x T +1 x exp 4 t=t 0 η t } η t 0 1Ef H. Since lim T E z1,...,z T x T +1 x = 0, we must have t=1 η t =. This proves the necessity. Sufficiency. Recall that V 0 is the eigenspace of the covariance matrix C ρx associated with the eigenvalue 0 and P 0 is the orthogonal projection onto V 0. Then ψ t is orthogonal to V 0 almost surely for each t. It follows that P 0 x t+1 = P 0 x t and thereby P 0 x t = P 0 x 1 for each t. Take the vector x to be the minimizer of Ef x in R d such that P 0 x = P 0 x 1. With this choice, r t is orthogonal to V 0 for each t, and belongs to the orthogonal complement V0. Note that the eigenvalues of C ρ X restricted to the subspace V0 is at least λ r > 0. So we have f rt L = ψ, r ρ t dρ X = rt T ψψ T r t dρ X = rt T C ρx r t 1 X X and f rt L λ r r t. The condition lim t η t = 0 ensures the existence of some t 1 N ρ X such that η t 1 for any t t 1. Thus, we see from 18 in Lemma 9 that for t t 1, E zt [ r t+1 ] r t η t f rt L ρ X + η t Ef H 1 η t λ r r t + η t Ef H. Applying this inequality iteratively for t = T, t 1 yields E z1,...,z T [ r T +1 ] E z1,...,z t1 1 [ r t 1 ] X t=t 1 1 η t λ r + Ef H 3348 ηt t=t 1 k=t+1 1 η k λ r,

9 where we denote T k=t+1 1 η kλ r = 1 for t = T. By the condition t=1 η t =, one has t=t 1 1 η t λ r exp λ r T t=t 1 η t } 0 as T. Thus for any ε > 0, there exists t = t ε N such that for any T t, E z1,...,z t1 1 [ r t 1 ] t=t 1 1 η t λ r ε. To deal with the other term of the bound for r T +1, we use the assumption lim t η t = 0, and find some integer tε t 1 such that η t λ r ε for any t tε. Write ηt t=t 1 k=t+1 1 η k λ r = tε ηt t=t 1 k=t+1 The second term of 3 can be bounded as ηt t=tε+1 k=t+1 1 η k λ r = ε = ε 1 η k λ r + t=tε+1 t=tε+1 = ε 1 η t λ r ηt t=tε+1 k=t+1 T k=t+1 1 η k λ r 1 1 η t λ r k=tε+1 k=t+1 1 η k λ r ε. 1 η k λ r. 3 1 η k λ r To bound the first term of 3, we apply the condition t=1 η t = again and find some integer t 3 = t 3 ε > tε such that t 3 k=tε+1 η k 1 λ r log tε ε. Hence k=t+1 k=tε+1 η k t 3 k=tε+1 k=t+1 η k 1 λ r log tε ε, T t 3. It thus follows that for each t t 1,..., tε}, T } 1 η k λ r exp λ r η k exp λ r k=tε+1 η k ε tε. Combining with the fact η t 1 for each t t 1, we see that the first term of 3 can be bounded as tε ηt 1 η k λ r ε tε ηt ε. tε t=t 1 t=t 1 k=t

10 Lin and Zhou From the above analysis, we know that when T maxt 1, tε, t, t 3 }, E z1,...,z T [ r T +1 ] ε + Ef H ε. This proves the convergence lim T E z1,...,z T x T +1 x = 0 for some x R d and the sufficiency is verified. From the bound 0, we also see that E z1,...,z T x T +1 x η T Ef H, T N. This implies that T =1 E z1,...,z T x T +1 x Ef H η T =. T =1 The proof of Theorem 3 is complete. In the proof of our second main result, we need some elementary inequalities. Lemma 10 a For ν, a > 0, there holds a a exp νx} x a, νe x > 0. 4 b Let ν > 0 and q 0. If 0 < q 1 < 1, then for any t N, we have t 1 i q exp ν j q 1 q 1+q 1 + q 1+q 1 q 1 + ν ν1 q1 1 t q 1 q. 5 e i=1 j=i+1 For q 1 = 1, we have t 1 i q exp ν i=1 j=i+1 j 1 q ν q +1 t minν,q 1}, if ν q 1, q t ν log t, if ν = q 1. 6 c For any t < T N and θ 0, 1], there holds k=t+1 k θ t=1 1 [T + 1 t + 1 ], if θ < 1, logt + 1 logt + 1, if θ = 1. d For θ 0, 1], µ > 0, and T N, there holds } } θ exp µ t θ exp µ θ µe T θ, if θ < 1, T µ, if θ = Proof The inequalities in parts a and b can be found in Smale and Zhou, 009, Lemma. 3350

11 Part c can be proved by noting that k=t+1 k θ k=t+1 k+1 k x θ dx = T +1 For part d, we use the inequality 7 in part c to derive exp µ } t θ t=1 t+1 x θ dx. } } exp µ exp µ T, if θ < 1, T µ, if θ = 1. For θ 0, 1, by applying 4 with ν = µ, x = T and a = This proves the result. exp µ } θ θ 1 θ T T θ. µe θ, we get We can now prove our second main result. This is done by following the estimate in the proof of Theorem 3. Proof of Theorem 4 Since η 1 < 1, we have η t < 1 for all t N. Therefore, we can take t 1 = 1 in and obtain T E z1,...,z T [ r T +1 ] r 1 1 η t λ r + Ef H t=1 t=1 r 1 exp +Ef H η 1 λ r η 1 T t=1 t θ } t θ exp λ r η 1 t=1 ηt t=1 k=t+1 T k=t+1 k θ } 1 η k λ r. 9 Applying part d with µ = λ r η 1 of Lemma 10, we know that the first term of 9 can be bounded as T } } θ r 1 exp λ r η 1 t θ r 1 exp λrη1 θ λ rη 1 e T θ, if θ < 1, r 1 T λrη 1, if θ = 1. Applying part b of Lemma 10 with q 1 = θ, q = θ, ν = λ r η 1, and noting that λ r η 1 < 1 by η 1 0, 1 and λ r 0, 1], we know that the second term of 9 can be bounded by Ef H η 1 Ef H η θ λ rη θ λ rη 1 1 θ 1 e 1+θ T θ, if θ < 1, λ rη 1 T λrη 1, if θ =

12 Lin and Zhou Thus, we get our desired result with C 0 given by } r 1 exp λrη1 θ C 0 = +Ef H η1 r 1 + Ef H η 1 This completes the proof of Theorem 4. θ λ rη 1 e λ rη 1 + 3θ λ rη θ λ rη 1 1 θ 1 e 1+θ, if θ < 1, λ rη 1, if θ = 1. Remark 11 From the proof of Theorem 4, we see that if Ef H = 0, then for some x R d, we have E z1,...,z T [ r T +1 ] r 1 T t=1 1 η t λ r. The above argument actually can be used to prove Theorem 8. Proof of Theorem 8 Since η 1 1, we have η t 1 for all t N. Thus, we can take t 1 = 1 in and obtain T E z1,...,z T [ r T +1 ] r 1 1 η t λ r + Ef H t=1 = r 1 T t=1 + Ef H λ r 1 1 t + t 0 t=1 1 t + t 0 k=t+1 ηt t=1 k=t k + t 0 1 η k λ r We note that It thus follows that k=t = k + t 0 k=t+1 k + t 0 1 k + t 0 = t + t 0 T + t 0. E z1,...,z T [ r T +1 ] r 1 t 0 + Ef H 1 T + t 0 T + t 0 λ r t=1 1 t + t 0. With T 1 t=1 t+t 0 log T +t 0 t 0 +1 log T, we get the desired result with C 3 given by This proves Theorem 8. C 3 = t 0 r 1 + Ef H λ. r 335

13 5. Confidence-Based Estimates for Convergence In this section, we prove our third main result, Theorem 5. Recall that V 0 is the eigenspace of the covariance matrix C ρx associated with the eigenvalue 0. We choose x as in the proof of the sufficiency part of Theorem 3. With this choice, r t belongs to the orthogonal complement V 0 almost surely. Our error analysis is based on the following error decomposition. 5.1 Error Decomposition For t N, set the operator Π t k = t j=k I η jc ρx on R d for k t and Π t t+1 = I. Subtracting x from both sides of 4, we have r k+1 = I η k C ρx r k + η k χ k, 30 where χ k = ỹ k ψ k, x ψ k + C ρx ψ k ψ T k r k. Applying this relationship iteratively for k = t,, 1, we get Thus r t+1 = Π t 1r 1 + η k Π t k+1 χ k. r t+1 Π t 1r 1 + η k Π t k+1 χ k. 31 The first term of the bound 31 is caused by the initial error, which is deterministic and will be estimated in subsection 5.. The second term is the sample error depending on the sample. Since r k is independent of z k, by E[ỹ k ψ k ] = f ρ ψ k and 17, E[χ k z 1,..., z k 1 ] = X f ρ ψ x, ψ ψ + C ρx ψψ T r k dρ X ψ = 0. It tells us that ω k := η k Π t k+1 χ k} k is a martingale difference sequence. The idea of analyzing the sample error by properties of martingale difference sequences can be found in the recent work in Tarrés and Yao, 014 to which details about martingale difference sequences are referred. In particular, we can apply the following Pinelis-Bernstein inequality from Tarrés and Yao, 014 derived from Pinelis, 1994, Theorem 3.4 to estimate the sample error. Lemma 1 Let ω k } k be a martingale difference sequence in a Hilbert space. Suppose that almost surely ω k B and t E[ ω k ω 1,..., ω k 1 ] L t. Then for any 0 < δ < 1, the following holds with probability at least 1 δ, j B sup ω k 3 + L t log δ. 1 j t The required bounds B and L t will be presented in subsections 5.3 and 5.4, respectively. 3353

14 Lin and Zhou 5. Initial Error Lemma 13 Let η k = η 1 k θ with θ 0, 1] and η 1 0, 1. Then Π t 1r 1 C0 t θ, when θ < 1, C 0 t λrη 1, when θ = 1, where } θ C 0 = r 1 exp λrη1 θ λ rη 1 e, when θ < 1, r 1, when θ = 1. Proof By our choice of x, we know that r 1 belongs to the subspace V0. Thus, we have Π t 1r 1 Π t 1 V 0 r 1. Here Π t 1 V0 denotes the restriction of the self adjoint operator Π t 1 onto V 0. Since λ l : l = 1,,, r} are the eigenvalues of C ρx restricted to V0, λ 1 1 and η 1 < 1, we have Π t 1 V 0 = sup 1 l r t 1 η k λ l t 1 η 1 λ r k θ exp λ r η 1 Applying part d of Lemma 10, we get our desired result. } k θ. 5.3 Bounding the Residual Sequence To bound ω k = η k Π t k+1 χ k, we start with a rough bound for r t. Lemma 14 Assume that for some constant M > 0, ỹ M almost surely. Let θ [0, 1] and η t = η 1 t θ with η 1 0, 1. Then for any t N, we have almost surely r t where C 1 is a constant independent of t given by C 1 = C 1 t, when θ [0, 1, C 1 loget, when θ = 1, 3 r1 +η 1 M+ x, when θ [0, 1, r1 + η 1 M + x, when θ = 1. Proof Rewrite 19 with x t = x + r t as r t+1 = r t + η t ỹ t ψ t, x ψ t, r t ψ t, r t +ηt ỹ t ψ t, x ψ t, r t = r t + F ψ t, r t, 3354

15 where F : R R is a quadratic function given by Fµ = F ηt,ỹ t,ψ t,x µ = η tη t µ + η t 1 η t ỹ t ψ t, x µ + η t ỹ t ψ t, x. Note that η t η t 0 by 0 < η t η 1 1. A simple calculation shows that max Fx = x R η t 1 η t ỹ t ψ t, x + ηt ỹ t ψ t, x = η tỹ t ψ t, x. η t η t η t Since ỹ t M almost surely and ψ t = 1, Thus, ỹ t ψ t, x ỹ t + ψ t x M + x. r t+1 r t + η tỹ t ψ t, x η t r t + η t M + x. Using this relationship iteratively yields r t+1 r 1 + η k M + x = r 1 + η 1 M + x k θ. Since that we get k θ 1 + k= k k 1 x θ dx = t θ, when θ [0, 1, loget, when θ = 1, r t which leads to the desired result. r1 +η 1 M+ x t, when θ [0, 1, r 1 + η 1 M + x loget, when θ = 1, 5.4 Estimating Conditional Variance and Upper Bound In this subsection, we give bounds for the two terms t η k E[ Πt k+1 χ k z 1,..., z k 1 ] and sup 1 k t η k Π t k+1 χ k required in applying the Pinelis-Bernstein inequality. Lemma 15 Let η k = η 1 k θ with θ 0, 1] and η 1 0, 1. Then almost surely we have ηk E[ Πt k+1 χ k z 1,..., z k 1 ] η1k θ exp η 1λ r j=k+1 j θ 33 EfH + r k. 3355

16 Lin and Zhou Proof Recall that both ψ k and r k belong to V0 almost surely for each k N. As a result, χ k also belongs to V0 almost surely for each k. Hence ηk E[ Πt k+1 χ k z 1,..., z k 1 ] ηk Πt k+1 V0 E[ χ k z 1,..., z k 1 ]. Since η 1 < 1 and λ 1 1, we have Thus, Π t k+1 t t V0 = sup 1 η j λ l 1 η j λ r 1 l r j=k+1 j=k+1 exp λ r η j = exp λ rη 1 j=k+1 ηk E[ Πt k+1 χ k z 1,..., z k 1 ] ηk exp λ rη 1 j=k+1 j θ j=k+1 j θ E[ χ k z 1,..., z k 1 ].. 34 Since r k does not depend on z k, we see from ψ k = 1, E[ỹ k ψ k ] = f ρ ψ k and 17 that E zk [ ỹ k ψ k, x ψ k, C ρx ψ k ψ T k r k ] = E zk [ỹ k ψ k, x ψ k, C ρx r k ] E zk [ỹ k ψ k, x ψ k, r k ψ k ] = E ψk [f ρ ψ k ψ k, x ψ k, C ρx r k ] E ψk [f ρ ψ k ψ k, x ψ k, r k ] = 0. It thus follows that E[ χ k z 1,..., z k 1 ] = E zk [ χ k ] = E zk [ỹ k ψ k, x ] + E zk [ C ρx ψ k ψ T k r k ] = Ef H + E zk C ρx C ρ X r k, r k Ef H + r k. Putting the above bound into 35, we get the desire result. 35 Lemma 16 Assume that for some constant M > 0, ỹ M almost surely. Let θ [0, 1] and η t = η 1 t θ with η 1 0, 1. Then for any t N, we have almost surely sup η k Π t k+1 χ C t k θ max sup 1 k t r k, 1 }, when θ < 1, 1 k t C t λrη 1 max sup 1 k t r k, 1 } 36, when θ = 1, where C is a constant given by θ η C = 1 M + x + θ θ + eλ rη, when θ < 1, 1 1 θ 1 η 1 M + x + λrη 1, when θ =

17 Proof Let k 1,..., t}. From the definition of χ k, we have χ k ỹ k + ψ k x ψ k + C ρx ψ k ψ T k r k. But ỹ k M, ψ k = 1 and C ρx 1. So we have χ k M + x + r k M + x + max r k, 1}. This together with 34 and the fact that χ k belongs to V 0 implies η k Π t k+1 χ k η 1 k θ Π t k+1 V 0 χ k η 1 M + x + k θ Π t k+1 V0 max r k, 1} η 1 M + x + k θ exp λ rη 1 j θ max r k, 1}. j=k+1 What is left is to estimate I k := k θ exp λ rη 1 j=k+1 j θ. For θ [1/, 1, applying part c of Lemma 10 gives I k k θ exp λ } rη 1 1 θ [t + 1 k + 1 ]. If k t/, then k θ θ t θ and thus I k θ t θ. If 1 k < t/, then we have k+1 t+1/ and t+1 k+1 1 θ 1 t+1. It follows that I k exp λ rη 1 1 θ 1 } t. 1 θ Applying part a of Lemma 10 with x = t, ν = λrη 11 θ 1 and a = θ, we get I k θ eλ r η 1 1 θ 1 θ t θ. For θ = 1, by part c of Lemma 10, with λ r η 1 < 1, we have t + 1 λrη1 t I k k 1 = k + 1 t + 1 k + 1 λrη1 t λrη 1 k λrη1 1 λrη 1 t λrη 1. k From the above analysis, we conclude the desired result. 3357

18 Lin and Zhou 5.5 Preliminary Error Analysis Based on the above estimates, we can apply Lemma 1 to obtain an error bound. Proposition 17 Under the assumptions of Theorem 5, for some x R d and for any 0 < δ < 1 and fixed t N, with confidence at least 1 δ, we have C x t+1 x t 1 θ log δ, when θ [ 1 3, 1, C t 1 λrη loget log δ, when θ = 1, 37 where C is a positive constant independent of t or δ given explicitly in the proof. Proof To apply the Pinelis-Bernstein inequality to estimate t η kπ t k+1 χ k, we need bounds B and L t. By Lemmas 14 and 16, we have sup η k Π t k+1 χ C k C 1 + 1t 1 3θ, when θ < 1, 1 k t C C 1 + 1t λrη 38 1 loget, when θ = 1. By Lemmas 15 and 14, we get where ηk E[ Πt k+1 χ k z 1,..., z k 1 ] C t } 3 k 3θ 1 exp λ r η t 1 j=k+1 j θ C 3 loget t k exp, when θ [ 1 3 λ, 1, } r η t 1 j=k+1 j 1, when θ = 1, C 3 = Ef H + C 1η 1. Applying part b of Lemma 10 with ν = λ r η 1 < 1, q 1 = θ and q = 3θ 1, we have for θ < 1, k 3θ 1 exp λ rη 1 j θ j=k+1 4θ 1 3θ 3θ + λ r η 1 λ r η 1 e1 θ 1 t + t 1 3θ, and for θ = 1, k exp λ rη 1 j=k+1 j λ r η 1 t λrη 1 + t. Therefore, we get ηk E[ Πt k+1 χ k z 1,..., z k 1 ] C4 t, when θ [ 1 3, 1, C 4 t λrη 1 loget, when θ = 1,

19 with C 4 = C 4θ 1 3 λ rη 1 + 3θ λ rη 1 e1 θ 1 3θ + 1, when θ [ 1 3, 1, C 3 5 λ rη 1 1 λ rη 1, when θ = 1. Applying Lemma 1 to the martingale difference sequence ω k := η k Π t k+1 χ k} k with B and L t given by 38 and 39 respectively, we know that with probability at least 1 δ, j sup η k Π t k+1 χ k 1 j t C 5 t log δ, when θ [ 1 3, 1, C 5 t 1 λrη loget log δ, when θ = 1, where C 5 = C C 1 + 1/3 + C 4. Putting this bound into 31 with t replaced by j, and then applying Lemma 13 to bound the initial error, we get the desired result with C = C 0 + C 5 from Lemma 1. In the above procedure, we have used a rough bound 3 for r t. This rough bound tends to as t becomes large. In contrast, the bound provided in Proposition 17 tends to 0 when θ 1/, 1] and is much better. But this bound holds with confidence. We shall use this refined bound to improve our estimates in the following subsection. 5.6 Improved Error Analysis In this subsection, we prove our third main result by improving the preliminary confidencebased error bound in Proposition 17. Proof of Theorem 5 When θ = 1, our desired bound follows from 37 with C 1 = C. It remains to prove the case θ [1/, 1. Let T N. Applying Proposition 17 with t = 1,, T, and taking the union event followed by rescaling, we know that there exists a subset Zδ T of ZT with measure at least 1 δ such that r t C 6 log δ log T, t = 1,..., T + 1, z 1,..., z T Z T δ, 40 where C 6 = C + r 1. Now we turn to the essential part of the proof. Define another martingale difference sequence ω k } k by multiplying the one in the proof of Proposition 17 by a characteristic function 1 rk C 6 log log T } as δ ω k = η k Π T k+1 χ k1 rk C 6 log δ log T }. From 36 and the multiplication with the characteristic function 1 rk C 6 log log T }, we δ have sup ω k C C 6 log log T T θ k T δ 3359

20 Lin and Zhou Notice that the characteristic function 1 rk C 6 log δ log T } is independent of z k. Also, from the proof of Lemma 15, we know that for each k 1,..., T }, E[ Π t k+1 χ k z 1,..., z k 1 ] exp λ rη 1 j θ EfH + r k. j=k+1 It follows by setting C 7 = Ef H + C6 η 1 that ] E [ ω k z 1,..., z k 1 C 7 log T δ log T k θ exp λ rη 1 Applying part b of Lemma 10 yields ] E [ ω k z 1,..., z k 1 C 7 log δ log T 3θ + λ r η θ λ r η 1 e1 θ 1 j=k+1 1+θ + 1 Using this bound as L T and 41 as the bound B in Lemma 1, we know that there exists another subset Z δ T of ZT with measure at least 1 δ such that for every z 1,..., z T Z δ T, there holds ω k C 8 T θ log log T, δ where C 8 = C C θ + λ r η 1 C 7 + λ r η θ λ r η 1 e1 θ 1 j θ T θ. 1+θ 1. This together with 40 tells us that for every z 1,..., z T Zδ T Z δ T, there holds η k Π T k+1 χ k C 8T θ log log T. 4 δ The subset Zδ T Z δ T has measure at least 1 δ. Therefore, we can put 4 into 31, and apply Lemma 13 to bound the initial error, which proves Theorem 5 for the case θ [1/, 1 after scaling δ to δ/ and setting the constant C 1 = C 0 + C Almost Sure Convergence In this section, we prove the almost sure convergence of the randomized Kaczmarz algorithm. Recall that the almost sure convergence of a sequence of random variables X n } towards X means that P lim X n = X = 1, n 3360.

21 or equivalently, lim P sup X k X > ε = 0 for any ε > 0. n k n The Borel-Cantelli Lemma see e.g. Klenke, 010 asserts for a sequence E n n of events that if the sum of the probabilities is finite n=1 PE n <, then the probability that infinitely many of them occur is 0, that is, P lim sup n E n = P n=1 k=n E n = 0. The following lemma is an easy consequence of the Borel-Cantelli Lemma. We give the proof for completeness. Lemma 18 Let X n } be a sequence of events in some probability space and ε n } be a sequence of positive numbers satisfying lim n ε n = 0. If P X n X > ε n <, n=1 then X n converges to X almost surely. Proof Since lim n ε n = 0, for any ε > 0, there exists some n N such that for all k n, ε k < ε. Thus, P sup k n X k X > ε P k n X k X > ε k P X k X > ε k. k n Letting n, one gets P sup k n X k X > ε 0. This proves the result. Now we can apply Lemma 18 to prove our last main result. Proof of Theorem 6 Set Λ t = t θ/ when θ < 1, t λrη 1 when θ = 1. By Theorem 5, we have for any t and 0 < δ t < 1, P Λ ɛ 1 t x t+1 x > C 1 Λ ɛ t log 4 log t δ t. δ t Choose δ t = t, and ε t = C 1 Λ ɛ t log 4/δ t log t. Obviously t= P Λ ɛ 1 t x t+1 x > ε t δ t < t= and ε t 4 C 1 Λ ɛ t log 3 t 0, as t. Then our conclusion of Theorem 6 follows from Lemma

22 Lin and Zhou Remark 19 The above method of proof can be used to get a more quantitative estimate for the almost sure convergence of the Kaczmarz algorithm with noiseless random measurements Chen and Powell, 01. In that setting, η t 1, y t = f ρ ψ t and r = d. It was shown in Strohmer and Vershynin, 009; Chen and Powell, 01 that with q = 1 λ r, E x t+1 x q t r 1. It follows from the Chebyshev inequality that for any ɛ 0, 1, P q tɛ 1 x t+1 x > q tɛ t = P x t+1 x > q t t E[ x t+1 x ] q t t. Thus, we get P q tɛ 1 x t+1 x > q tɛ t r 1 t. Obviously, q tɛ t 0 as t, and t=1 r 1 t <. Applying Lemma 18 with ε t = q tɛ t, we know that for any ɛ 0, 1, lim 1 λ r tɛ 1 x t+1 x = 0 t almost surely. 7. Simulations and Discussions In this section we provide some numerical simulations and further discussions on our error analysis. To illustrate our derived convergence rates and compare with the existing literature, we carry out numerical simulations corresponding to Example with the same data distributions as in Needell, 010: m = 00, d = 100, A R is a Gaussian matrix with each entry drawn independently from the standard normal distribution N0, 1, and y R 100 is a Gaussian noise with each component drawn independently from the normal distribution with mean 0 and standard deviation 0.0. The measurement vectors ψ t = 1 ϕ ϕ t t} are drawn from the normalized rows of A as in Example and ỹ t = y t / ϕ t } with mean x = 0. We conduct 100 trials for each choice of the relaxation parameter sequences η t = 1, η t = 1/ t, η t = 1/t. In each trial, algorithm 4 is run 100 times with random Gaussian initial vectors of norm x 1 = 0.0. Figure 1 depicts the error x t+1 x for t = 1,..., 1500 averaged with 100 trials and 100 initial vectors. The black line is a plot with the constant relaxation parameter sequence η t = 1, which verifies the divergence of the algorithm, as proved in Needell, 010. The blue line is a plot with η t = 1/ t, which hints a slow convergence of the algorithm. The red line is a plot with η t = 1/t, which confirms a faster convergence. The above simulations are consistent with our error analysis. In this paper, a learning theory approach to the relaxed randomized Kaczmarz algorithm is presented. It yields new results and observations including a necessary and sufficient condition 9, stated in Theorem 3, for the convergence in expectation when the sampling process is noisy or nonlinear. For noise-free and linear sampling processes that is, Ef H = 0, we can see from Remark 11 with η t 1 that E z1,...,z T [ x T +1 x ] x 1 x 1 λ r T. This exponential convergence result was proved in Strohmer and Vershynin, 009 for Example under the restriction that the matrix A has full column rank, where the number 336

23 a Index Figure 1: Error of the relaxed randomized Kaczmarz algorithm with η t = 1 black line, η t = 1/ t blue line, and η t = 1/t red line 3363

24 Lin and Zhou 1 λ r is replaced by a quantity involving A 1 = infm : M Ax x for all x}. Our result is more general valid for underdetermined systems with A 1 =. In the framework of Kaczmarz algorithms, we consider online learning algorithms associated with the least squares loss. It would be interesting to extend our study to algorithms associated with more general loss functions Ying and Zhou, 006 such as hinge loss, and to consider error analysis without requiring the approximation error Ying and Zhou, 006 tending to zero. Acknowledgments The work described in this paper is supported partially by the Research Grants Council of Hong Kong [Project No. CityU ]. The authors would like to thank the referees for constructive suggestions and Dr. Xin Guo for helping with the numerical simulations. The corresponding author is Ding-Xuan Zhou. References X. Chen and A. Powell. Almost sure convergence for the Kaczmarz algorithm with random measurements. Journal of Fourier Analysis and Applications, 18: , 01. T. Hu, J. Fan, Q. Wu, and D. X. Zhou. Regularization schemes for minimum error entropy principle. Analysis and Applications, 13: , 015. J. Johnston. Econometric Methods. McGraw Hill, New York, S. Kaczmarz. Angenäherte auflösung von systemen linearer gleichungen. Bulletin International de l Académie Polonaise des Sciences et des Lettres A, 35: , A. Klenke. Probability Theory: A Comprehensive Course. Springer-Verlag, London, 008. D. Needell. Randomized Kaczmarz solver for noisy linear systems. BIT. Numerical Mathematics, 50: , 010. I. Pinelis. Optimum bounds for the distributions of martingales in banach spaces. Annals of Probability, : , Y. Ying and M. Pontil. Online gradient descent learning algorithms. Foundations of Computational Mathematics, 8: , 008. S. Smale and Y. Yao. Online learning algorithms. Foundations of Computational Mathematics, 6: , 005. S. Smale and D. X. Zhou. Online learning with Markov sampling. Analysis and Applications, 7:87 113, 009. T. Strohmer and R. Vershynin. A randomized Kaczmarz algorithm with exponential convergence. Journal of Fourier Analysis and Applications, 15:6 78,

25 P. Tarrés and Y. Yao. Online learning as stochastic approximations of regularization paths. IEEE Transactions on Information Theory, 60: , 014. V. Vapnik. Statistical Learning Theory. John Wiley & Sons, Y. Ying and D. X. Zhou. Online regularized classification algorithms. IEEE Transactions on Information Theory, 5: , 006. A. Zouzias and N. M. Freris. Randomized extended Kaczmarz for solving least squares. SIAM Journal on Matrix Analysis and Applications, 34: ,

Online Gradient Descent Learning Algorithms

Online Gradient Descent Learning Algorithms DISI, Genova, December 2006 Online Gradient Descent Learning Algorithms Yiming Ying (joint work with Massimiliano Pontil) Department of Computer Science, University College London Introduction Outline

More information

On the exponential convergence of. the Kaczmarz algorithm

On the exponential convergence of. the Kaczmarz algorithm On the exponential convergence of the Kaczmarz algorithm Liang Dai and Thomas B. Schön Department of Information Technology, Uppsala University, arxiv:4.407v [cs.sy] 0 Mar 05 75 05 Uppsala, Sweden. E-mail:

More information

Online gradient descent learning algorithm

Online gradient descent learning algorithm Online gradient descent learning algorithm Yiming Ying and Massimiliano Pontil Department of Computer Science, University College London Gower Street, London, WCE 6BT, England, UK {y.ying, m.pontil}@cs.ucl.ac.uk

More information

Geometry on Probability Spaces

Geometry on Probability Spaces Geometry on Probability Spaces Steve Smale Toyota Technological Institute at Chicago 427 East 60th Street, Chicago, IL 60637, USA E-mail: smale@math.berkeley.edu Ding-Xuan Zhou Department of Mathematics,

More information

Convergence Rates for Greedy Kaczmarz Algorithms

Convergence Rates for Greedy Kaczmarz Algorithms onvergence Rates for Greedy Kaczmarz Algorithms Julie Nutini 1, Behrooz Sepehry 1, Alim Virani 1, Issam Laradji 1, Mark Schmidt 1, Hoyt Koepke 2 1 niversity of British olumbia, 2 Dato Abstract We discuss

More information

Learnability of Gaussians with flexible variances

Learnability of Gaussians with flexible variances Learnability of Gaussians with flexible variances Ding-Xuan Zhou City University of Hong Kong E-ail: azhou@cityu.edu.hk Supported in part by Research Grants Council of Hong Kong Start October 20, 2007

More information

arxiv: v1 [math.pr] 22 May 2008

arxiv: v1 [math.pr] 22 May 2008 THE LEAST SINGULAR VALUE OF A RANDOM SQUARE MATRIX IS O(n 1/2 ) arxiv:0805.3407v1 [math.pr] 22 May 2008 MARK RUDELSON AND ROMAN VERSHYNIN Abstract. Let A be a matrix whose entries are real i.i.d. centered

More information

Kernel Method: Data Analysis with Positive Definite Kernels

Kernel Method: Data Analysis with Positive Definite Kernels Kernel Method: Data Analysis with Positive Definite Kernels 2. Positive Definite Kernel and Reproducing Kernel Hilbert Space Kenji Fukumizu The Institute of Statistical Mathematics. Graduate University

More information

A Sampling Kaczmarz-Motzkin Algorithm for Linear Feasibility

A Sampling Kaczmarz-Motzkin Algorithm for Linear Feasibility A Sampling Kaczmarz-Motzkin Algorithm for Linear Feasibility Jamie Haddock Graduate Group in Applied Mathematics, Department of Mathematics, University of California, Davis Copper Mountain Conference on

More information

Randomized projection algorithms for overdetermined linear systems

Randomized projection algorithms for overdetermined linear systems Randomized projection algorithms for overdetermined linear systems Deanna Needell Claremont McKenna College ISMP, Berlin 2012 Setup Setup Let Ax = b be an overdetermined, standardized, full rank system

More information

Distributed Learning with Regularized Least Squares

Distributed Learning with Regularized Least Squares Journal of Machine Learning Research 8 07-3 Submitted /5; Revised 6/7; Published 9/7 Distributed Learning with Regularized Least Squares Shao-Bo Lin sblin983@gmailcom Department of Mathematics City University

More information

Introduction and Preliminaries

Introduction and Preliminaries Chapter 1 Introduction and Preliminaries This chapter serves two purposes. The first purpose is to prepare the readers for the more systematic development in later chapters of methods of real analysis

More information

Online Learning with Samples Drawn from Non-identical Distributions

Online Learning with Samples Drawn from Non-identical Distributions Journal of Machine Learning Research 10 (009) 873-898 Submitted 1/08; Revised 7/09; Published 1/09 Online Learning with Samples Drawn from Non-identical Distributions Ting Hu Ding-uan hou Department of

More information

Finite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product

Finite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product Chapter 4 Hilbert Spaces 4.1 Inner Product Spaces Inner Product Space. A complex vector space E is called an inner product space (or a pre-hilbert space, or a unitary space) if there is a mapping (, )

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Learning gradients: prescriptive models

Learning gradients: prescriptive models Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007 Relevant papers Learning Coordinate Covariances via Gradients. Sayan

More information

Chapter 1. Preliminaries. The purpose of this chapter is to provide some basic background information. Linear Space. Hilbert Space.

Chapter 1. Preliminaries. The purpose of this chapter is to provide some basic background information. Linear Space. Hilbert Space. Chapter 1 Preliminaries The purpose of this chapter is to provide some basic background information. Linear Space Hilbert Space Basic Principles 1 2 Preliminaries Linear Space The notion of linear space

More information

A full-newton step infeasible interior-point algorithm for linear programming based on a kernel function

A full-newton step infeasible interior-point algorithm for linear programming based on a kernel function A full-newton step infeasible interior-point algorithm for linear programming based on a kernel function Zhongyi Liu, Wenyu Sun Abstract This paper proposes an infeasible interior-point algorithm with

More information

Normed & Inner Product Vector Spaces

Normed & Inner Product Vector Spaces Normed & Inner Product Vector Spaces ECE 174 Introduction to Linear & Nonlinear Optimization Ken Kreutz-Delgado ECE Department, UC San Diego Ken Kreutz-Delgado (UC San Diego) ECE 174 Fall 2016 1 / 27 Normed

More information

Scattered Data Interpolation with Polynomial Precision and Conditionally Positive Definite Functions

Scattered Data Interpolation with Polynomial Precision and Conditionally Positive Definite Functions Chapter 3 Scattered Data Interpolation with Polynomial Precision and Conditionally Positive Definite Functions 3.1 Scattered Data Interpolation with Polynomial Precision Sometimes the assumption on the

More information

SPECTRAL THEOREM FOR COMPACT SELF-ADJOINT OPERATORS

SPECTRAL THEOREM FOR COMPACT SELF-ADJOINT OPERATORS SPECTRAL THEOREM FOR COMPACT SELF-ADJOINT OPERATORS G. RAMESH Contents Introduction 1 1. Bounded Operators 1 1.3. Examples 3 2. Compact Operators 5 2.1. Properties 6 3. The Spectral Theorem 9 3.3. Self-adjoint

More information

Derivative reproducing properties for kernel methods in learning theory

Derivative reproducing properties for kernel methods in learning theory Journal of Computational and Applied Mathematics 220 (2008) 456 463 www.elsevier.com/locate/cam Derivative reproducing properties for kernel methods in learning theory Ding-Xuan Zhou Department of Mathematics,

More information

Brownian motion. Samy Tindel. Purdue University. Probability Theory 2 - MA 539

Brownian motion. Samy Tindel. Purdue University. Probability Theory 2 - MA 539 Brownian motion Samy Tindel Purdue University Probability Theory 2 - MA 539 Mostly taken from Brownian Motion and Stochastic Calculus by I. Karatzas and S. Shreve Samy T. Brownian motion Probability Theory

More information

SGD and Randomized projection algorithms for overdetermined linear systems

SGD and Randomized projection algorithms for overdetermined linear systems SGD and Randomized projection algorithms for overdetermined linear systems Deanna Needell Claremont McKenna College IPAM, Feb. 25, 2014 Includes joint work with Eldar, Ward, Tropp, Srebro-Ward Setup Setup

More information

Randomized Kaczmarz Nick Freris EPFL

Randomized Kaczmarz Nick Freris EPFL Randomized Kaczmarz Nick Freris EPFL (Joint work with A. Zouzias) Outline Randomized Kaczmarz algorithm for linear systems Consistent (noiseless) Inconsistent (noisy) Optimal de-noising Convergence analysis

More information

Elements of Positive Definite Kernel and Reproducing Kernel Hilbert Space

Elements of Positive Definite Kernel and Reproducing Kernel Hilbert Space Elements of Positive Definite Kernel and Reproducing Kernel Hilbert Space Statistical Inference with Reproducing Kernel Hilbert Space Kenji Fukumizu Institute of Statistical Mathematics, ROIS Department

More information

Wiener Measure and Brownian Motion

Wiener Measure and Brownian Motion Chapter 16 Wiener Measure and Brownian Motion Diffusion of particles is a product of their apparently random motion. The density u(t, x) of diffusing particles satisfies the diffusion equation (16.1) u

More information

Probability and Measure

Probability and Measure Part II Year 2018 2017 2016 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2018 84 Paper 4, Section II 26J Let (X, A) be a measurable space. Let T : X X be a measurable map, and µ a probability

More information

The Codimension of the Zeros of a Stable Process in Random Scenery

The Codimension of the Zeros of a Stable Process in Random Scenery The Codimension of the Zeros of a Stable Process in Random Scenery Davar Khoshnevisan The University of Utah, Department of Mathematics Salt Lake City, UT 84105 0090, U.S.A. davar@math.utah.edu http://www.math.utah.edu/~davar

More information

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop Music and Machine Learning (IFT68 Winter 8) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

More information

Analysis Preliminary Exam Workshop: Hilbert Spaces

Analysis Preliminary Exam Workshop: Hilbert Spaces Analysis Preliminary Exam Workshop: Hilbert Spaces 1. Hilbert spaces A Hilbert space H is a complete real or complex inner product space. Consider complex Hilbert spaces for definiteness. If (, ) : H H

More information

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan Background: Global Optimization and Gaussian Processes The Geometry of Gaussian Processes and the Chaining Trick Algorithm

More information

Convergence of Eigenspaces in Kernel Principal Component Analysis

Convergence of Eigenspaces in Kernel Principal Component Analysis Convergence of Eigenspaces in Kernel Principal Component Analysis Shixin Wang Advanced machine learning April 19, 2016 Shixin Wang Convergence of Eigenspaces April 19, 2016 1 / 18 Outline 1 Motivation

More information

DS-GA 1002 Lecture notes 10 November 23, Linear models

DS-GA 1002 Lecture notes 10 November 23, Linear models DS-GA 2 Lecture notes November 23, 2 Linear functions Linear models A linear model encodes the assumption that two quantities are linearly related. Mathematically, this is characterized using linear functions.

More information

Acceleration of Randomized Kaczmarz Method

Acceleration of Randomized Kaczmarz Method Acceleration of Randomized Kaczmarz Method Deanna Needell [Joint work with Y. Eldar] Stanford University BIRS Banff, March 2011 Problem Background Setup Setup Let Ax = b be an overdetermined consistent

More information

STRONG CONVERGENCE OF AN ITERATIVE METHOD FOR VARIATIONAL INEQUALITY PROBLEMS AND FIXED POINT PROBLEMS

STRONG CONVERGENCE OF AN ITERATIVE METHOD FOR VARIATIONAL INEQUALITY PROBLEMS AND FIXED POINT PROBLEMS ARCHIVUM MATHEMATICUM (BRNO) Tomus 45 (2009), 147 158 STRONG CONVERGENCE OF AN ITERATIVE METHOD FOR VARIATIONAL INEQUALITY PROBLEMS AND FIXED POINT PROBLEMS Xiaolong Qin 1, Shin Min Kang 1, Yongfu Su 2,

More information

Elementary linear algebra

Elementary linear algebra Chapter 1 Elementary linear algebra 1.1 Vector spaces Vector spaces owe their importance to the fact that so many models arising in the solutions of specific problems turn out to be vector spaces. The

More information

arxiv: v2 [math.pr] 27 Oct 2015

arxiv: v2 [math.pr] 27 Oct 2015 A brief note on the Karhunen-Loève expansion Alen Alexanderian arxiv:1509.07526v2 [math.pr] 27 Oct 2015 October 28, 2015 Abstract We provide a detailed derivation of the Karhunen Loève expansion of a stochastic

More information

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra. DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1

More information

A NEW ITERATIVE METHOD FOR THE SPLIT COMMON FIXED POINT PROBLEM IN HILBERT SPACES. Fenghui Wang

A NEW ITERATIVE METHOD FOR THE SPLIT COMMON FIXED POINT PROBLEM IN HILBERT SPACES. Fenghui Wang A NEW ITERATIVE METHOD FOR THE SPLIT COMMON FIXED POINT PROBLEM IN HILBERT SPACES Fenghui Wang Department of Mathematics, Luoyang Normal University, Luoyang 470, P.R. China E-mail: wfenghui@63.com ABSTRACT.

More information

REAL AND COMPLEX ANALYSIS

REAL AND COMPLEX ANALYSIS REAL AND COMPLE ANALYSIS Third Edition Walter Rudin Professor of Mathematics University of Wisconsin, Madison Version 1.1 No rights reserved. Any part of this work can be reproduced or transmitted in any

More information

Worst-Case Bounds for Gaussian Process Models

Worst-Case Bounds for Gaussian Process Models Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis

More information

3 (Due ). Let A X consist of points (x, y) such that either x or y is a rational number. Is A measurable? What is its Lebesgue measure?

3 (Due ). Let A X consist of points (x, y) such that either x or y is a rational number. Is A measurable? What is its Lebesgue measure? MA 645-4A (Real Analysis), Dr. Chernov Homework assignment 1 (Due ). Show that the open disk x 2 + y 2 < 1 is a countable union of planar elementary sets. Show that the closed disk x 2 + y 2 1 is a countable

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 11, 2009 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Research Article Strong Convergence of a Projected Gradient Method

Research Article Strong Convergence of a Projected Gradient Method Applied Mathematics Volume 2012, Article ID 410137, 10 pages doi:10.1155/2012/410137 Research Article Strong Convergence of a Projected Gradient Method Shunhou Fan and Yonghong Yao Department of Mathematics,

More information

Supplementary Materials for Riemannian Pursuit for Big Matrix Recovery

Supplementary Materials for Riemannian Pursuit for Big Matrix Recovery Supplementary Materials for Riemannian Pursuit for Big Matrix Recovery Mingkui Tan, School of omputer Science, The University of Adelaide, Australia Ivor W. Tsang, IS, University of Technology Sydney,

More information

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications Class 19: Data Representation by Design What is data representation? Let X be a data-space X M (M) F (M) X A data representation

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

CHAPTER 11. A Revision. 1. The Computers and Numbers therein

CHAPTER 11. A Revision. 1. The Computers and Numbers therein CHAPTER A Revision. The Computers and Numbers therein Traditional computer science begins with a finite alphabet. By stringing elements of the alphabet one after another, one obtains strings. A set of

More information

Scaling Limits of Waves in Convex Scalar Conservation Laws Under Random Initial Perturbations

Scaling Limits of Waves in Convex Scalar Conservation Laws Under Random Initial Perturbations Journal of Statistical Physics, Vol. 122, No. 2, January 2006 ( C 2006 ) DOI: 10.1007/s10955-005-8006-x Scaling Limits of Waves in Convex Scalar Conservation Laws Under Random Initial Perturbations Jan

More information

Introduction to Real Analysis Alternative Chapter 1

Introduction to Real Analysis Alternative Chapter 1 Christopher Heil Introduction to Real Analysis Alternative Chapter 1 A Primer on Norms and Banach Spaces Last Updated: March 10, 2018 c 2018 by Christopher Heil Chapter 1 A Primer on Norms and Banach Spaces

More information

Observer design for a general class of triangular systems

Observer design for a general class of triangular systems 1st International Symposium on Mathematical Theory of Networks and Systems July 7-11, 014. Observer design for a general class of triangular systems Dimitris Boskos 1 John Tsinias Abstract The paper deals

More information

Brownian Motion. 1 Definition Brownian Motion Wiener measure... 3

Brownian Motion. 1 Definition Brownian Motion Wiener measure... 3 Brownian Motion Contents 1 Definition 2 1.1 Brownian Motion................................. 2 1.2 Wiener measure.................................. 3 2 Construction 4 2.1 Gaussian process.................................

More information

Proofs for Large Sample Properties of Generalized Method of Moments Estimators

Proofs for Large Sample Properties of Generalized Method of Moments Estimators Proofs for Large Sample Properties of Generalized Method of Moments Estimators Lars Peter Hansen University of Chicago March 8, 2012 1 Introduction Econometrica did not publish many of the proofs in my

More information

SPECTRAL THEOREM FOR SYMMETRIC OPERATORS WITH COMPACT RESOLVENT

SPECTRAL THEOREM FOR SYMMETRIC OPERATORS WITH COMPACT RESOLVENT SPECTRAL THEOREM FOR SYMMETRIC OPERATORS WITH COMPACT RESOLVENT Abstract. These are the letcure notes prepared for the workshop on Functional Analysis and Operator Algebras to be held at NIT-Karnataka,

More information

Lecture 3. Random Fourier measurements

Lecture 3. Random Fourier measurements Lecture 3. Random Fourier measurements 1 Sampling from Fourier matrices 2 Law of Large Numbers and its operator-valued versions 3 Frames. Rudelson s Selection Theorem Sampling from Fourier matrices Our

More information

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee 9.520: Class 20 Bayesian Interpretations Tomaso Poggio and Sayan Mukherjee Plan Bayesian interpretation of Regularization Bayesian interpretation of the regularizer Bayesian interpretation of quadratic

More information

2. Dual space is essential for the concept of gradient which, in turn, leads to the variational analysis of Lagrange multipliers.

2. Dual space is essential for the concept of gradient which, in turn, leads to the variational analysis of Lagrange multipliers. Chapter 3 Duality in Banach Space Modern optimization theory largely centers around the interplay of a normed vector space and its corresponding dual. The notion of duality is important for the following

More information

STAT 7032 Probability Spring Wlodek Bryc

STAT 7032 Probability Spring Wlodek Bryc STAT 7032 Probability Spring 2018 Wlodek Bryc Created: Friday, Jan 2, 2014 Revised for Spring 2018 Printed: January 9, 2018 File: Grad-Prob-2018.TEX Department of Mathematical Sciences, University of Cincinnati,

More information

HILBERT SPACES AND THE RADON-NIKODYM THEOREM. where the bar in the first equation denotes complex conjugation. In either case, for any x V define

HILBERT SPACES AND THE RADON-NIKODYM THEOREM. where the bar in the first equation denotes complex conjugation. In either case, for any x V define HILBERT SPACES AND THE RADON-NIKODYM THEOREM STEVEN P. LALLEY 1. DEFINITIONS Definition 1. A real inner product space is a real vector space V together with a symmetric, bilinear, positive-definite mapping,

More information

Abstract. 1. Introduction

Abstract. 1. Introduction Journal of Computational Mathematics Vol.28, No.2, 2010, 273 288. http://www.global-sci.org/jcm doi:10.4208/jcm.2009.10-m2870 UNIFORM SUPERCONVERGENCE OF GALERKIN METHODS FOR SINGULARLY PERTURBED PROBLEMS

More information

Monte-Carlo MMD-MA, Université Paris-Dauphine. Xiaolu Tan

Monte-Carlo MMD-MA, Université Paris-Dauphine. Xiaolu Tan Monte-Carlo MMD-MA, Université Paris-Dauphine Xiaolu Tan tan@ceremade.dauphine.fr Septembre 2015 Contents 1 Introduction 1 1.1 The principle.................................. 1 1.2 The error analysis

More information

Invertibility of symmetric random matrices

Invertibility of symmetric random matrices Invertibility of symmetric random matrices Roman Vershynin University of Michigan romanv@umich.edu February 1, 2011; last revised March 16, 2012 Abstract We study n n symmetric random matrices H, possibly

More information

MATHS 730 FC Lecture Notes March 5, Introduction

MATHS 730 FC Lecture Notes March 5, Introduction 1 INTRODUCTION MATHS 730 FC Lecture Notes March 5, 2014 1 Introduction Definition. If A, B are sets and there exists a bijection A B, they have the same cardinality, which we write as A, #A. If there exists

More information

Are There Sixth Order Three Dimensional PNS Hankel Tensors?

Are There Sixth Order Three Dimensional PNS Hankel Tensors? Are There Sixth Order Three Dimensional PNS Hankel Tensors? Guoyin Li Liqun Qi Qun Wang November 17, 014 Abstract Are there positive semi-definite PSD) but not sums of squares SOS) Hankel tensors? If the

More information

Chapter 4. Measure Theory. 1. Measure Spaces

Chapter 4. Measure Theory. 1. Measure Spaces Chapter 4. Measure Theory 1. Measure Spaces Let X be a nonempty set. A collection S of subsets of X is said to be an algebra on X if S has the following properties: 1. X S; 2. if A S, then A c S; 3. if

More information

Fredholm Theory. April 25, 2018

Fredholm Theory. April 25, 2018 Fredholm Theory April 25, 208 Roughly speaking, Fredholm theory consists of the study of operators of the form I + A where A is compact. From this point on, we will also refer to I + A as Fredholm operators.

More information

On the Set of Limit Points of Normed Sums of Geometrically Weighted I.I.D. Bounded Random Variables

On the Set of Limit Points of Normed Sums of Geometrically Weighted I.I.D. Bounded Random Variables On the Set of Limit Points of Normed Sums of Geometrically Weighted I.I.D. Bounded Random Variables Deli Li 1, Yongcheng Qi, and Andrew Rosalsky 3 1 Department of Mathematical Sciences, Lakehead University,

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

1 Regression with High Dimensional Data

1 Regression with High Dimensional Data 6.883 Learning with Combinatorial Structure ote for Lecture 11 Instructor: Prof. Stefanie Jegelka Scribe: Xuhong Zhang 1 Regression with High Dimensional Data Consider the following regression problem:

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

Effective Dimension and Generalization of Kernel Learning

Effective Dimension and Generalization of Kernel Learning Effective Dimension and Generalization of Kernel Learning Tong Zhang IBM T.J. Watson Research Center Yorktown Heights, Y 10598 tzhang@watson.ibm.com Abstract We investigate the generalization performance

More information

Weighted Sums of Orthogonal Polynomials Related to Birth-Death Processes with Killing

Weighted Sums of Orthogonal Polynomials Related to Birth-Death Processes with Killing Advances in Dynamical Systems and Applications ISSN 0973-5321, Volume 8, Number 2, pp. 401 412 (2013) http://campus.mst.edu/adsa Weighted Sums of Orthogonal Polynomials Related to Birth-Death Processes

More information

(1) Consider the space S consisting of all continuous real-valued functions on the closed interval [0, 1]. For f, g S, define

(1) Consider the space S consisting of all continuous real-valued functions on the closed interval [0, 1]. For f, g S, define Homework, Real Analysis I, Fall, 2010. (1) Consider the space S consisting of all continuous real-valued functions on the closed interval [0, 1]. For f, g S, define ρ(f, g) = 1 0 f(x) g(x) dx. Show that

More information

Asymptotically Efficient Nonparametric Estimation of Nonlinear Spectral Functionals

Asymptotically Efficient Nonparametric Estimation of Nonlinear Spectral Functionals Acta Applicandae Mathematicae 78: 145 154, 2003. 2003 Kluwer Academic Publishers. Printed in the Netherlands. 145 Asymptotically Efficient Nonparametric Estimation of Nonlinear Spectral Functionals M.

More information

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015 EE613 Machine Learning for Engineers Kernel methods Support Vector Machines jean-marc odobez 2015 overview Kernel methods introductions and main elements defining kernels Kernelization of k-nn, K-Means,

More information

Some limit theorems on uncertain random sequences

Some limit theorems on uncertain random sequences Journal of Intelligent & Fuzzy Systems 34 (218) 57 515 DOI:1.3233/JIFS-17599 IOS Press 57 Some it theorems on uncertain random sequences Xiaosheng Wang a,, Dan Chen a, Hamed Ahmadzade b and Rong Gao c

More information

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis Massimiliano Pontil 1 Today s plan SVD and principal component analysis (PCA) Connection

More information

Scaling Limits of Waves in Convex Scalar Conservation Laws under Random Initial Perturbations

Scaling Limits of Waves in Convex Scalar Conservation Laws under Random Initial Perturbations Scaling Limits of Waves in Convex Scalar Conservation Laws under Random Initial Perturbations Jan Wehr and Jack Xin Abstract We study waves in convex scalar conservation laws under noisy initial perturbations.

More information

Your first day at work MATH 806 (Fall 2015)

Your first day at work MATH 806 (Fall 2015) Your first day at work MATH 806 (Fall 2015) 1. Let X be a set (with no particular algebraic structure). A function d : X X R is called a metric on X (and then X is called a metric space) when d satisfies

More information

The Hilbert Space of Random Variables

The Hilbert Space of Random Variables The Hilbert Space of Random Variables Electrical Engineering 126 (UC Berkeley) Spring 2018 1 Outline Fix a probability space and consider the set H := {X : X is a real-valued random variable with E[X 2

More information

Convergence rates of spectral methods for statistical inverse learning problems

Convergence rates of spectral methods for statistical inverse learning problems Convergence rates of spectral methods for statistical inverse learning problems G. Blanchard Universtität Potsdam UCL/Gatsby unit, 04/11/2015 Joint work with N. Mücke (U. Potsdam); N. Krämer (U. München)

More information

An introduction to some aspects of functional analysis

An introduction to some aspects of functional analysis An introduction to some aspects of functional analysis Stephen Semmes Rice University Abstract These informal notes deal with some very basic objects in functional analysis, including norms and seminorms

More information

Iterative Solution of a Matrix Riccati Equation Arising in Stochastic Control

Iterative Solution of a Matrix Riccati Equation Arising in Stochastic Control Iterative Solution of a Matrix Riccati Equation Arising in Stochastic Control Chun-Hua Guo Dedicated to Peter Lancaster on the occasion of his 70th birthday We consider iterative methods for finding the

More information

SPRING 2006 PRELIMINARY EXAMINATION SOLUTIONS

SPRING 2006 PRELIMINARY EXAMINATION SOLUTIONS SPRING 006 PRELIMINARY EXAMINATION SOLUTIONS 1A. Let G be the subgroup of the free abelian group Z 4 consisting of all integer vectors (x, y, z, w) such that x + 3y + 5z + 7w = 0. (a) Determine a linearly

More information

Time Series 3. Robert Almgren. Sept. 28, 2009

Time Series 3. Robert Almgren. Sept. 28, 2009 Time Series 3 Robert Almgren Sept. 28, 2009 Last time we discussed two main categories of linear models, and their combination. Here w t denotes a white noise: a stationary process with E w t ) = 0, E

More information

FUNCTION BASES FOR TOPOLOGICAL VECTOR SPACES. Yılmaz Yılmaz

FUNCTION BASES FOR TOPOLOGICAL VECTOR SPACES. Yılmaz Yılmaz Topological Methods in Nonlinear Analysis Journal of the Juliusz Schauder Center Volume 33, 2009, 335 353 FUNCTION BASES FOR TOPOLOGICAL VECTOR SPACES Yılmaz Yılmaz Abstract. Our main interest in this

More information

1 Math 241A-B Homework Problem List for F2015 and W2016

1 Math 241A-B Homework Problem List for F2015 and W2016 1 Math 241A-B Homework Problem List for F2015 W2016 1.1 Homework 1. Due Wednesday, October 7, 2015 Notation 1.1 Let U be any set, g be a positive function on U, Y be a normed space. For any f : U Y let

More information

Supplementary Material: A Single-Pass Algorithm for Efficiently Recovering Sparse Cluster Centers of High-dimensional Data

Supplementary Material: A Single-Pass Algorithm for Efficiently Recovering Sparse Cluster Centers of High-dimensional Data Supplementary Material: A Single-Pass Algorithm for Efficiently Recovering Sparse Cluster Centers of High-dimensional Data Jinfeng Yi JINFENGY@US.IBM.COM IBM Thomas J. Watson Research Center, Yorktown

More information

PROBABILITY: LIMIT THEOREMS II, SPRING HOMEWORK PROBLEMS

PROBABILITY: LIMIT THEOREMS II, SPRING HOMEWORK PROBLEMS PROBABILITY: LIMIT THEOREMS II, SPRING 218. HOMEWORK PROBLEMS PROF. YURI BAKHTIN Instructions. You are allowed to work on solutions in groups, but you are required to write up solutions on your own. Please

More information

Part II Probability and Measure

Part II Probability and Measure Part II Probability and Measure Theorems Based on lectures by J. Miller Notes taken by Dexter Chua Michaelmas 2016 These notes are not endorsed by the lecturers, and I have modified them (often significantly)

More information

Viscosity Iterative Approximating the Common Fixed Points of Non-expansive Semigroups in Banach Spaces

Viscosity Iterative Approximating the Common Fixed Points of Non-expansive Semigroups in Banach Spaces Viscosity Iterative Approximating the Common Fixed Points of Non-expansive Semigroups in Banach Spaces YUAN-HENG WANG Zhejiang Normal University Department of Mathematics Yingbing Road 688, 321004 Jinhua

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

An Infeasible Interior-Point Algorithm with full-newton Step for Linear Optimization

An Infeasible Interior-Point Algorithm with full-newton Step for Linear Optimization An Infeasible Interior-Point Algorithm with full-newton Step for Linear Optimization H. Mansouri M. Zangiabadi Y. Bai C. Roos Department of Mathematical Science, Shahrekord University, P.O. Box 115, Shahrekord,

More information

Lecture: Adaptive Filtering

Lecture: Adaptive Filtering ECE 830 Spring 2013 Statistical Signal Processing instructors: K. Jamieson and R. Nowak Lecture: Adaptive Filtering Adaptive filters are commonly used for online filtering of signals. The goal is to estimate

More information

On the simplest expression of the perturbed Moore Penrose metric generalized inverse

On the simplest expression of the perturbed Moore Penrose metric generalized inverse Annals of the University of Bucharest (mathematical series) 4 (LXII) (2013), 433 446 On the simplest expression of the perturbed Moore Penrose metric generalized inverse Jianbing Cao and Yifeng Xue Communicated

More information

Exercise Solutions to Functional Analysis

Exercise Solutions to Functional Analysis Exercise Solutions to Functional Analysis Note: References refer to M. Schechter, Principles of Functional Analysis Exersize that. Let φ,..., φ n be an orthonormal set in a Hilbert space H. Show n f n

More information

SPATIAL STRUCTURE IN LOW DIMENSIONS FOR DIFFUSION LIMITED TWO-PARTICLE REACTIONS

SPATIAL STRUCTURE IN LOW DIMENSIONS FOR DIFFUSION LIMITED TWO-PARTICLE REACTIONS The Annals of Applied Probability 2001, Vol. 11, No. 1, 121 181 SPATIAL STRUCTURE IN LOW DIMENSIONS FOR DIFFUSION LIMITED TWO-PARTICLE REACTIONS By Maury Bramson 1 and Joel L. Lebowitz 2 University of

More information

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES Wei Chu, S. Sathiya Keerthi, Chong Jin Ong Control Division, Department of Mechanical Engineering, National University of Singapore 0 Kent Ridge Crescent,

More information