arxiv: v1 [stat.ml] 31 Oct 2018

Size: px
Start display at page:

Download "arxiv: v1 [stat.ml] 31 Oct 2018"

Transcription

1 On Fast Leverage Score Sampling and Optimal Learning Alessandro Rudi, Luigi Carratino 3 luigi.carratino@dibris.unige.it Daniele Calandriello,2 daniele.calandriello@iit.it Lorenzo Rosasco 2,3 lrosasco@mit.edu arxiv: v [stat.ml] 3 Oct 208 Abstract Leverage score sampling provides an appealing way to perform approximate computations for large matrices. Indeed, it allows to derive faithful approximations with a complexity adapted to the problem at hand. Yet, performing leverage scores sampling is a challenge in its own right requiring further approximations. In this paper, we study the problem of leverage score sampling for positive definite matrices defined by a kernel. Our contribution is twofold. First we provide a novel algorithm for leverage score sampling and second, we exploit the proposed method in statistical learning by deriving a novel solver for kernel ridge regression. Our main technical contribution is showing that the proposed algorithms are currently the most efficient and accurate for these problems. Introduction A variety of machine learning problems require manipulating and performing computations with large matrices that often do not fit memory. In practice, randomized techniques are often employed to reduce the computational burden. Examples include stochastic approximations [], columns/rows subsampling and more general sketching techniques [2, 3]. One of the simplest approach is uniform column sampling [4, 5], that is replacing the original matrix with a subset of columns chosen uniformly at random. This approach is fast to compute, but the number of columns needed for a prescribed approximation accuracy does not take advantage of the possible low rank structure of the matrix at hand. As discussed in [6], leverage score sampling provides a way to tackle this shortcoming. Here columns are sampled proportionally to suitable weights, called leverage scores (LS) [7, 6]. With this sampling strategy, the number of columns needed for a prescribed accuracy is governed by the so called effective dimension which is a natural extension of the notion of rank. Despite these nice properties, performing leverage score sampling provides a challenge in its own right, since it has complexity in the same order of an eigendecomposition of the original matrix. Indeed, much effort has been recently devoted to derive fast and provably accurate algorithms for approximate leverage score sampling [2, 8, 6, 9]. Equal contribution. INRIA Sierra-project team & École Normale Supérieure, Paris. 2 LCSL Istituto Italiano di Tecnologia, Genova, Italy & MIT, Cambridge, USA. 3 DIBRIS Università degli Studi di Genova, Genova, Italy.

2 In this paper, we consider these questions in the case of positive semi-definite matrices, central for example in Gaussian processes [0] and kernel methods []. Sampling approaches in this context are related to the so called Nyström approximation [2] and Nyström centers selection problem [0], and are widely studied both in practice [4] and in theory [5]. Our contribution is twofold. First, we propose and study BLESS, a novel algorithm for approximate leverage scores sampling. The first solution to this problem is introduced in [6], but has poor approximation guarantees and high time complexity. Improved approximations are achieved by algorithms recently proposed in [8] and [9]. In particular, the approach in [8] can obtain good accuracy and very efficient computations but only as long as distributed resources are available. Our first technical contribution is showing that our algorithm can achieve state of the art accuracy and computational complexity without requiring distributed resources. The key idea is to follow a coarse to fine strategy, alternating uniform and leverage scores sampling on sets of increasing size. Our second, contribution is considering leverage score sampling in statistical learning with least squares. We extend the approach in [3] for efficient kernel ridge regression based on combining fast optimization algorithms (preconditioned conjugate gradient) with uniform sampling. Results in [3] showed that optimal learning bounds can be achieved with a complexity which is Õ(n n) in time and Õ(n) space. In this paper, we study the impact of replacing uniform with leverage score sampling. In particular, we prove that the derived method still achieves optimal learning bounds but the time and memory is now Õ(nd eff), and Õ(d eff 2 ) respectively, where d eff is the effective dimension which and is never larger, and possibly much smaller, than n. To the best of our knowledge this is the best currently known computational guarantees for a kernel ridge regression solver. 2 Leverage score sampling with BLESS After introducing leverage score sampling and previous algorithms, we present our approach and first theoretical results. 2. Leverage score sampling Suppose K R n n is symmetric and positive semidefinite. A basic question is deriving memory efficient approximation of K [4, 8] or related quantities, e.g. approximate projections on its range [9], or associated estimators, as in kernel ridge regression [4, 3]. The eigendecomposition of K offers a natural, but computationally demanding solution. Subsampling columns (or rows) is an appealing alternative. A basic approach is uniform sampling, whereas a more refined approach is leverage scores sampling. This latter procedure corresponds to sampling columns with probabilities proportional to the leverage scores ( l(i, ) = K( K ) + ni), i [n], () ii where [n] = {,..., n}. The advantage of leverage score sampling, is that potentially very few columns can suffice for the desired approximation. Indeed, letting d () = n max i=,...,n l(i, ), d eff() = 2 n l(i, ), i=

3 for > 0, it is easy to see that d eff () d () / for all, and previous results show that the number of columns required for accurate approximation are d for uniform sampling and d eff for leverage score sampling [5, 6]. However, it is clear from definition () that an exact leverage scores computation would require the same order of computations as an eigendecomposition, hence approximations are needed.the accuracy of approximate leverage scores is typically measured by t > 0 in multiplicative bounds of the form + t l(i, ) l(i, ) ( + t)l(i, ), i [n]. (2) Before proposing a new improved solution, we briefly discuss relevant previous works. To provide a unified view, some preliminary discussion is useful. 2.2 Approximate leverage scores First, we recall how a subset of columns can be used to compute approximate leverage scores. For M n, let J = {j i } M i= with j i [n], and K J,J R M M with entries (K J,J ) lm = K jl,j m. For i [n], let K J,i = ( K j,i,..., K jm,i) and consider for > /n, l J (i, ) = (n) ( K ii K J,i( K J,J + na) KJ,i ), (3) where A R M M is a matrix to be specified (see later for details). The above definition is motivated by the observation that if J = [n], and A = I, then l J (i, ) = l(i, ), by the following identity K( K + ni) = (n) ( K K( K + ni) K). In the following, it is also useful to consider a subset of leverage scores computed as in (3). For M R n, let U = {u i } R i= with u i [n], and Also in the following we will use the notation L J (U, ) = { l J (u, ),..., l J (u R, )}. (4) L J (U, ) J (5) to indicate the leverage score sampling of J U columns based on the leverage scores L J (U, ), that is the procedure of sampling columns from U according to their leverage scores, computed using J, to obtain a new subset of columns J. We end noting that leverage score sampling (5) requires O(M 2 ) memory to store K J, and O(M 3 + RM 2 ) time to invert K J, and compute R leverage scores via (3). Clearly, l J depends on the choice of the matrix A, but we omit this dependence to simplify the notation. 3

4 2.3 Previous algorithms for leverage scores computations We discuss relevant previous approaches using the above quantities. Two-Pass sampling [6]. This is the first approximate leverage score sampling proposed, and is based on using directly (5) as L J (U 2, ) J 2, with U 2 = [n] and J a subset taken uniformly at random. Here we call this method Two-Pass sampling since it requires two rounds of sampling on the whole set [n], one uniform to select J and one using leverage scores to select J 2. Recursive-RLS [9]. This is a development of Two-Pass sampling based on the idea of recursing the above construction. In our notation, let U U 2 U 3 = [n], where U, U 2 are uniformly sampled and have cardinalities n/4 and n/2, respectively. The idea is to start from J = U, and consider first L J (U 2, ) J 2, but then continue with L J2 (U 3, ) J 3. Indeed, the above construction can be made recursive for a family of nested subsets (U h ) H of cardinalities n/2 h, considering J = U and L Jh (U h+, ) J h+. (6) SQUEAK[8]. This approach follows a different iterative strategy. Consider a partition U, U 2, U 3 of [n], so that U j = n/3, for j =, Then, consider J = U, and and then continue with L J U 2 (J U 2, ) J 2, L J2 U 3 (J 2 U 3, ) J 3. Similarly to the other cases, the procedure is iterated considering H subsets (U h ) H h= each with cardinality n/h. Starting from J = U the iterations is L Jh U h+ (J h U h+, ). (7) We note that all the above procedures require specifying the number of iteration to be performed, the weights matrix to compute the leverage scores at each iteration, and a strategy to select the subsets (U h ) h. In all the above cases the selection of U h is based on uniform sampling, while the number of iterations and weight choices arise from theoretical considerations (see [6, 8, 9] for details). Note that Two-Pass sampling uses a set J of cardinality roughly / (an upper bound on d ()) and incurs in a computational cost of RM 2 = n/ 2. In comparison, Recursive-RLS [9] leads to essentially the same accuracy while improving computations. In particular, the sets J h are never larger than d eff (). Taking into account that at the last iteration performs leverage score sampling on U h = [n], the total computational complexity 4

5 Algorithm Bottom-up Leverage Scores Sampling (BLESS) Input: dataset {x i } n i=, regularization, step q, starting reg. 0, constants q, q 2 controlling the approximation level. Output: M h [n] number of selected points, J h set of indexes, A h weights. : J 0 =, A 0 = [], H = log( 0/) log q 2: for h =... H do 3: h = h /q 4: set constant R h = q min{κ 2 / h, n} 5: sample U h = {u,..., u Rh } i.i.d. u i Uniform([n]) 6: compute l Jh (x uk, h ) for all u k U h using Eq. 3 7: set P h = (p h,k ) R h k= with p h,k = l Jh (x uk, h )/( ljh u U h (x u, h )) 8: set constant M h = q 2 d h with d h = n ljh R h u U h (x u, h ), and 9: sample J h = {j,..., j Mh } i.i.d. j ) i Multinomial(P h, U h ) 0: A h = R hm h n diag (p h,j,..., p h,jmh : end for is nd eff () 2. SQUEAK [8] recovers the same accuracy, size of J h, and nd eff () 2 time complexity when U h d eff (), but only requires a single pass over the data. We also note that a distributed version of SQUEAK is discussed in [8], which allows to reduce the computational cost to nd eff () 2 /p, provided p machines are available. 2.4 Leverage score sampling with BLESS The procedure we propose, dubbed BLESS, has similarities to the one proposed in [9] (see (6)), but also some important differences. The main difference is that, rather than a fixed, we consider a decreasing sequence of parameters 0 > > > H = resulting in different algorithmic choices. For the construction of the subsets U h we do not use nested subsets, but rather each (U h ) H h= is sampled uniformly and independently, with a size smoothly increasing as / h. Similarly, as in [9] we proceed iteratively, but at each iteration a different decreasing parameter h is used to compute the leverage scores. Using the notation introduced above, the iteration of BLESS is given by L Jh (U h+, h+ ) J h+, (8) where the initial set J = U is sampled uniformly with size roughly / 0. BLESS has two main advantages. The first is computational: each of the sets U h, including the final U H, has cardinality smaller than /. Therefore the overall runtime has a cost of only RM 2 M 2 /, which can be dramatically smaller than the nm 2 cost achieved by the methods in [9], [8] and is comparable to the distributed version of SQUEAK using p = /n machines. The second advantage is that a whole path of leverage scores {l(i, h )} H h= is computed at once, in the sense that at each iteration accurate approximate leverage scores at scale h are computed. This is extremely useful in practice, as it can be used when cross-validating h. As a comparison, for all previous method a full run of the algorithm is needed for each value of h. 5

6 Algorithm 2 Bottom-up Leverage Scores Sampling without Replacement (BLESS-R) Input: dataset {x i } n i=, regularization, step q, starting reg. 0, constant q 2 controlling the approximation level. Output: M h [n] number of selected points, J h set of indexes, A h weights. : J 0 =, A 0 = [], H = log( 0/) log q, 2: for h =... H do 3: h = h /q 4: set constant β h = min{q 2 κ 2 /( h n), } 5: initialize U h = 6: for i [n] do 7: add i to U h with probability β h 8: end for 9: for j U h do 0: compute p h,j = min{q 2 ljh (x j, h ), } : add j to J h with probability p h,j /β h 2: end for ) 3: J h = {j,..., j Mh }, and A h = diag (p h,j,..., p h,jmh. 4: end for In the paper we consider two variations of the above general idea leading to Algorithm and Algorithm 2. The main difference in the two algorithms lies in the way in which sampling is performed: with and without replacement, respectively. In particular, considering sampling without replacement (see 2) it is possible to take the set (U h ) H h= to be nested and also to obtain slightly improved results, as shown in the next section. The derivation of BLESS rests on some basic ideas. First, note that, since sampling uniformly a set U of size d () / allows a good approximation, then we can replace L [n] ([n], ) J by L U (U, ) J, (9) where J can be taken to have cardinality d eff (). However, this is still costly, and the idea is to repeat and couple approximations at multiple scales. Consider >, a set U of size d ( ) / sampled uniformly, and L U (U, ) J. The basic idea behind BLESS is to replace (9) by L J (U, ) J. The key result, see, is that taking J of cardinality ( /)d eff () (0) suffice to achieve the same accuracy as J. Now, if we take sufficiently large, it is easy to see that d eff ( ) d ( ) /, so that we can take J uniformly at random. However, the factor ( /) in (0) becomes too big. Taking multiple scales fix this problem and leads to the iteration in (8). 2.5 Theoretical guarantees Our first main result establishes in a precise and quantitative way the advantages of BLESS. 6

7 Algorithm Runtime J Uniform Sampling [5] / Exact RLS Sampl. n 3 d eff () Two-Pass Sampling [6] n/ 2 d eff () Recursive RLS [9] nd eff () 2 d eff () SQUEAK [8] nd eff () 2 d eff () This work, Alg. and 2 / d eff () 2 d eff () Table : The proposed algorithms are compared with the state of the art (in Õ notation), in terms of time complexity and cardinality of the set J required to satisfy the approximation condition in Eq. 2. Theorem. Let n N, > 0 and δ (0, ]. Given t > 0, q > and H N, ( h ) H h= defined as in Algorithms and 2, when (J h, a h ) H h= are computed. by Alg. with parameters 0 = κ2 min(t,), q 5κ2 q 2 q(+t), q 2 2q (2t+)2 2. by Alg. 2 with parameters 0 = κ2 min(t,), q 54κ 2 (2t+) 2 log 2Hn t 2 δ, t 2 ( + t) log 2Hn δ, let l Jh (i, h ) as in Eq. (3) depending on J h, A h, then with probability at least δ: (a) + t l(i, h) l Jh (i, h ) ( + min(t, ))l(i, h ), i [n], h [H], (b) J h q 2 d eff ( h ), h [H]. The above result confirms that the subsets J h computed by BLESS are accurate in the desired sense, see (2), and the size of all J h is small and proportional to d eff ( h ), leading to a computational cost of only O ( min (, n) d eff () 2 log 2 ) ( in time and O deff () 2 log 2 ) in space (for additional properties of J h see Thm. 4 in appendixes). Table compares the complexity and number of columns sampled by BLESS with other methods. The crucial point is that in most applications, the parameter is chosen as a decreasing function of n, e.g. = / n, resulting in potentially massive computational gains. Indeed, since BLESS computes leverage scores for sets of size at most /, this allows to perform leverage scores sampling on matrices with millions of rows/columns, as shown in the experiments. In the next section, we illustrate the impact of BLESS in the context of supervised statistical learning. 3 Efficient supervised learning with leverage scores In this section, we discuss the impact of BLESS in a supervised learning. Unlike most previous results on leverage scores sampling in this context [6, 8, 9], we consider the setting of statistical learning, where the challenge is that inputs, as well as the outputs, are random. More precisely, given a probability space (X Y, ρ), where Y R, and considering least 7

8 squares, the problem is to solve min E(f), E(f) = f H X Y (f(x) y) 2 dρ(x, y), () when ρ is known only through (x i, y i ) n i= ρn. In the above minimization problem, H is a reproducing kernel Hilbert space defined by a positive definite kernel K : X X R []. Recall that the latter is defined as the completion of span{k(x, ) x X} with the inner product K(x, ), K(x, ) H = K(x, x ). The quality of an empirical approximate solution f is measured via probabilistic bounds on the excess risk R( f ) = E( f ) min f H E(f). 3. Learning with FALKON-BLESS The algorithm we propose, called FALKON-BLESS, combines BLESS with FALKON [3] a state of the art algorithm to solve the least squares problem presented above. The appeal of FALKON is that it is currently the most efficient solution to achieve optimal excess risk bounds. As we discuss in the following, the combination with BLESS leads to further improvements. We describe the derivation of the considered algorithm starting from kernel ridge regression (KRR) n f (x) = K(x, x i )c i, c = ( K + ni) Ŷ (2) i= where c = (c,..., c n ), Ŷ = (y,..., y n ) and K R n n is the empirical kernel matrix with entries ( K) ij = K(x i, x j ). KRR has optimal statistical properties [5], but large O(n 3 ) time and O(n 2 ) space requirements. FALKON can be seen as an approximate ridge regression solver combining a number of algorithmic ideas. First, sampling is used to select a subset { x,..., x M } of the input data uniformly at random, and to define an approximate solution f,m (x) = M K( x j, x)α j, α = (KnMK nm + K MM ) KnMy, (3) j= where α = (α,..., α M ), K nm R n M, has entries (K nm ) ij = K(x i, x j ) and K MM R M M has entries (K MM ) jj = K( x j, x j ), with i [n], j, j [M]. We note, that the linear system in (3) can be seen to obtained from the one in (2) by uniform column subsampling of the empirical kernel matrix. The columns selected corresponds to the inputs { x,..., x M }. FALKON proposes to compute a solution of the linear system 3 via a preconditioned iterative solver. The preconditioner is the core of the algorithm and is defined by a matrix B such that BB = ( n M K2 MM + K MM ). (4) The above choice provides a computationally efficient approximation to the exact preconditioner of the linear system in (3) corresponding to B such that BB = (K nm K nm + K MM ). The preconditioner in (4) can then be combined with conjugate gradient to solve the linear system in (3). The overall algorithm has complexity O(nMt) in time and O(M 2 ) in space, where t is the number of conjugate gradient iterations performed. 8

9 3 RLS Accuracy Time R-ACC 5 th / 95 th quant BLESS / 2.03 BLESS-R /.50 SQUEAK /.48 Uniform / 3.75 RRLS / 2.70 R-ACC BLESS BLESS-R SQUEAK Uniform RRLS Figure : Leverage scores relative accuracy for = 0 5, n = , M = 0 000, 0 repetitions. In this paper, we analyze a variation of FALKON where the points { x,..., x M } are selected via leverage score sampling using BLESS, see Algorithm or Algorithm 2, so that M = M h and x k = x jk, for J h = {j,..., j Mh } and k [M h ]. Further, the preconditioner in (4) is replaced by B h B h = ( n M K J h,j h A h K J h,j h + h K Jh,J h ). (5) This solution can lead to huge computational improvements. Indeed, the total cost of FALKON-BLESS is the sum of computing BLESS and FALKON, corresponding to O ( nmt + (/)M 2 log n + M 3) O(M 2 ), (6) in time and space respectively, where M is the size of the set J H returned by BLESS. 3.2 Statistical properties of FALKON-BLESS In this section, we state and discuss our second main result, providing an excess risk bound for FALKON-BLESS. Here a population version of the effective dimension plays a key role. Let ρ X be the marginal measure of ρ on X, let C : H H be the linear operator defined as follows and d eff () be the population version of d eff (), d eff () = Tr(C(C + I) ), with (Cf)(x ) = K(x, x)f(x)dρ X (x), for any f H, x X. It is possible to show that d eff () is the limit of d eff () as n goes to infinity, see Lemma below taken from [4]. If we assume throughout that, X K(x, x ) κ 2, x, x X, (7) then the operator C is symmetric, positive definite and trace class, and the behavior of d eff () can be characterized in terms of the properties of the eigenvalues (σ j ) j N of C. Indeed as for d eff (), we have that d eff () κ 2 /, moreover if σ j = O(j α ), for α, we have d eff () = O( /α ). Then for larger α, d eff is smaller than / and faster learning rates are possible, as shown below. We next discuss the properties of the FALKON-BLESS solution denoted by f,n,t. 9

10 Theorem 2. Let n N, > 0 and δ (0, ]. Assume that y [ a 2, a 2 ], almost surely, a > 0, and denote by f H a minimizer of (). There exists n 0 N, such that for any n n 0, if t log n, 9κ2 n log n δ, then the following holds with probability at least δ: ) R( f,n,t ) 4a n + 32 f H 2 H ( a 2 log 2 2 δ n 2 + a d eff() log 2 δ n In particular, when d eff () = O( /α ), for α, by selecting = n α/(α+), we have where c is given explicitly in the proof. R( f,n,t) cn α α+, We comment on the above result discussing the statistical and computational implications. Statistics.The above theorem provides statistical guarantees in terms of finite sample bounds on the excess risk of FALKON-BLESS, A first bound depends of the number of examples n, the regularization parameter and the population effective dimension d eff (). The second bound is derived optimizing, and is the same as the one achieved by exact kernel ridge regression which is known to be optimal [5, 6, 7]. Note that improvements under further assumptions are possible and are derived in the supplementary materials, see Thm. 8. Here, we comment on the computational properties of FALKON-BLESS and compare it to previous solutions. Computations.To discuss computational implications, we recall a result from [4] showing that the population version of the effective dimension d eff () and the effective dimension d eff () associated to the empirical kernel matrix converge up to constants. Lemma. Let > 0 and δ (0, ]. When 9κ2 n log n δ, then with probability at least δ, (/3)d eff () d eff () 3d eff (). Recalling the complexity of FALKON-BLESS (6), using Thm 2 and Lemma, we derive a cost O (nd eff () log n + ) d eff () 2 log n + d eff () 3 in time and O(d eff () 2 ) in space, for all n, satisfying the assumptions in Theorem 2. These expressions can be further simplified. Indeed, it is easy to see that for all > 0, + d eff () κ 2 /, (8) so that d eff () 3 κ2 d eff () 2. Moreover, if we consider the optimal choice = O(n α α+ ) given in Theorem 2, and take d eff () = O( /α ), we have d eff ( ) O(n), and therefore d eff () 2 O(nd eff ()). In summary, for the parameter choices leading to optimal learning rates, FALKON-BLESS has complexity Õ(nd eff ( )), in time and Õ(d eff ( ) 2 ) in space, ignoring log terms. We can compare this to previous results. In [3] uniform. 0

11 Seconds BLESS BLESS-R SQUEAK RRLS Time Comparison Classification Error FALKON-BLESS FALKON-UNI Test Error Number of Points falkon Figure 2: Runtimes with = 0 3 and n increasing Figure 3: C-err at 5 iterations for varying falkon sampling is considered leading to M O(/) and achieving a complexity of Õ(n/) which is always larger than the one achieved by FALKON in view of (8). Approximate leverage scores sampling is also considered in [3] requiring Õ(nd eff() 2 ) time and reducing the time complexity of FALKON to Õ(nd eff( )). Clearly in this case the complexity of leverage scores sampling dominates, and our results provide BLESS as a fix. 4 Experiments Leverage scores accuracy. We first study the accuracy of the leverage scores generated by BLESS and BLESS-R, comparing SQUEAK [8] and Recursive-RLS (RRLS) [9]. We begin by uniformly sampling a subsets of n = points from the SUSY dataset [8], and computing the exact leverage scores l(i, ) using a Gaussian Kernel with σ = 4 and = 0 5, which is at the limit of our computational feasibility. We then run each algorithm to compute the approximate leverage scores l JH (i, ), and we measure the accuracy of each method using the ratio l JH (i, )/l(i, ) (R-ACC). The final results are presented in Figure. On the left side for each algorithm we report runtime, mean R-ACC, and the 5 th and 95 th quantile, each averaged over the 0 repetitions. On the right side a box-plot of the R-ACC. As shown in Figure BLESS and BLESS-R achieve the same optimal accuracy of SQUEAK with just a fraction of time. Note that despite our best efforts, we could not obtain high-accuracy results for RRLS (maybe a wrong constant in the original implementation). However note that RRLS is computationally demanding compared to BLESS, being orders of magnitude slower, as expected from the theory. Finally, although uniform sampling is the fastest approach, it suffers from much larger variance and can over or under-estimate leverage scores by an order of magnitude more than the other methods, making it more fragile for downstream applications. In Fig. 2 we plot the runtime cost of the compared algorithms as the number of points grows from n = 000 to 70000, this time for = 0 3. We see that while previous algorithms runtime grows near-linearly with n, BLESS and BLESS-R run in a constant / runtime, as predicted by the theory. BLESS for supervised learning. We study the performance of FALKON-BLESS and compare it with the original FALKON [3] where an equal number of Nyström centres are

12 0.85 Test Accuracy 0.8 Test Accuracy AUC AUC Iterations Iterations Figure 4: AUC per iteration of the SUSY dataset Figure 5: AUC per iteration of the HIGGS dataset sampled uniformly at random (FALKON-UNI). We take from [3] the two biggest datasets and their best hyper-parameters for the FALKON algorithm. We noticed that it is possible to achieve the same accuracy of FALKON-UNI, by using bless for BLESS and falkon for FALKON with bless falkon, in order to lower the d eff and keep the number of Nyström centres low. For the SUSY dataset we use a Gaussian Kernel with σ = 4, falkon = 0 6, bless = 0 4 obtaining M h 0 4 Nyström centres. For the HIGGS dataset we use a Gaussian Kernel with σ = 22, falkon = 0 8, bless = 0 6, obtaining M h Nyström centres. We then sample a comparable number of centers uniformly for FALKON-UNI. Looking at the plot of their AUC at each iteration (Fig.4,5) we observe that FALKON-BLESS converges much faster than FALKON-UNI. For the SUSY dataset (Figure 4) 5 iterations of FALKON-BLESS (60 seconds) achieve the same accuracy of 20 iterations of FALKON-UNI (60 seconds). Since running BLESS takes just 2 secs. this corresponds to a 4 speedup. For the HIGGS dataset 0 iter. of FALKON-BLESS (with BLESS requiring.5 minutes, for a total of.4 hours) achieve the same accuracy of 20 iter. of FALKON-UNI (2.7 hours). Additionally we observed that FALKON-BLESS is more stable than FALKON-UNI w.r.t. falkon, σ. In Figure 3 the classification error after 5 iterations of FALKON-BLESS and FALKON-UNI over the SUSY dataset ( bless = 0 4 ). We notice that FALKON-BLESS has a wider optimal region (95% of the best error) for the regulariazion parameter ([.3 0 3, ]) w.r.t. FALKON-UNI ([.3 0 3, ]). 5 Conclusions In this paper we presented two algorithms BLESS and BLESS-R to efficiently compute a small set of columns from a large symmetric positive semidefinite matrix K, useful for approximating the matrix or to compute leverage scores with a given precision. Moreover we applied the proposed algorithms in the context of statistical learning with least squares, combining BLESS with FALKON [3]. We analyzed the computational and statistical properties of the resulting algorithm, showing that it achieves optimal statistical guarantees with a cost that is O(nd eff ()) in time, being currently the fastest. We can extend the proposed work in several ways: (a) combine BLESS with fast stochastic gradient algorithms [9] and other approximation schemes (i.e. random features [20, 2]), to further reduce 2

13 the computational complexity for optimal rates, (b) consider the impact of BLESS in the context of multi-tasking [22, 23] or structured prediction [24, 25]. Acknowledgments. This material is based upon work supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-2326, and the Italian Institute of Technology. We gratefully acknowledge the support of NVIDIA Corporation for the donation of the Titan Xp GPUs and the Tesla k40 GPU used for this research. L. R. acknowledges the support of the AFOSR projects FA and BAA-AFRL-AFOSR (European Office of Aerospace Research and Development), and the EU H2020-MSCA-RISE project NoMADS - DLV A. R. acknowledges the support of the European Research Council (grant SEQUOIA ). References [] Raman Arora, Andrew Cotter, Karen Livescu, and Nathan Srebro. Stochastic optimization for pca and pls. In Communication, Control, and Computing (Allerton), th Annual Allerton Conference on, pages IEEE, 202. [2] David P. Woodruff. Sketching as a tool for numerical linear algebra. arxiv preprint arxiv:4.4357, 204. [3] Joel A Tropp. User-friendly tools for random matrices: An introduction. Technical report, CALIFORNIA INST OF TECH PASADENA DIV OF ENGINEERING AND APPLIED SCIENCE, 202. [4] Christopher Williams and Matthias Seeger. Using the Nystrom method to speed up kernel machines. In Neural Information Processing Systems, 200. [5] Francis Bach. Sharp analysis of low-rank kernel matrix approximations. In Conference on Learning Theory, 203. [6] Ahmed El Alaoui and Michael W. Mahoney. Fast randomized kernel methods with statistical guarantees. In Neural Information Processing Systems, 205. [7] Petros Drineas, Malik Magdon-Ismail, Michael W. Mahoney, and David P. Woodruff. Fast approximation of matrix coherence and statistical leverage. The Journal of Machine Learning Research, 3(): , 202. [8] Daniele Calandriello, Alessandro Lazaric, and Michal Valko. Distributed sequential sampling for kernel matrix approximation. In AISTATS, 207. [9] Cameron Musco and Christopher Musco. Recursive Sampling for the Nyström Method. In NIPS, 207. [0] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian processes for machine learning. Adaptive computation and machine learning. MIT Press, Cambridge, Mass, OCLC: ocm

14 [] Bernhard Schölkopf, Alexander J Smola, et al. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, [2] Alex J Smola and Bernhard Schölkopf. Sparse greedy matrix approximation for machine learning [3] Alessandro Rudi, Luigi Carratino, and Lorenzo Rosasco. Falkon: An optimal large scale kernel method. In Advances in Neural Information Processing Systems, pages , 207. [4] Alessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco. Less is more: Nyström computational regularization. In Advances in Neural Information Processing Systems, pages , 205. [5] Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized leastsquares algorithm. Foundations of Computational Mathematics, 7(3):33 368, [6] Ingo Steinwart, Don R Hush, Clint Scovel, et al. Optimal rates for regularized least squares regression. In COLT, [7] Junhong Lin, Alessandro Rudi, Lorenzo Rosasco, and Volkan Cevher. Optimal rates for spectral algorithms with least-squares regression over hilbert spaces. Applied and Computational Harmonic Analysis, 208. [8] Pierre Baldi, Peter Sadowski, and Daniel Whiteson. Searching for exotic particles in high-energy physics with deep learning. Nature communications, 5:4308, 204. [9] Nicolas L Roux, Mark Schmidt, and Francis R Bach. A stochastic gradient method with an exponential convergence _rate for finite training sets. In Advances in neural information processing systems, pages , 202. [20] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in neural information processing systems, pages 77 84, [2] Alessandro Rudi and Lorenzo Rosasco. Generalization properties of learning with random features. In Advances in Neural Information Processing Systems, pages , 207. [22] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Convex multi-task feature learning. Machine Learning, 73(3): , [23] Carlo Ciliberto, Alessandro Rudi, Lorenzo Rosasco, and Massimiliano Pontil. Consistent multitask learning with nonlinear output relations. In Advances in Neural Information Processing Systems, pages , 207. [24] Carlo Ciliberto, Lorenzo Rosasco, and Alessandro Rudi. A consistent regularization approach for structured prediction. Advances in Neural Information Processing Systems 29 (NIPS), pages , 206. [25] Anna Korba, Alexandre Garcia, and Florence d Alché Buc. A structured prediction approach for label ranking. arxiv preprint arxiv: ,

15 [26] Nachman Aronszajn. Theory of reproducing kernels. Transactions of the American mathematical society, 68(3): , 950. [27] Ingo Steinwart and Andreas Christmann. Support vector machines. Springer Science & Business Media,

16 A Theoretical Analysis for Algorithms and 2 In this section, Thm. 4 and Thm. 5 provide guarantees for the two methods, from which Thm. is derived. In particular in Section A.4 some important properties about (out-of-sample-)leverage scores, that will be used in the proofs, are derived. A. Notation Let X be a Polish space and K : X X R a positive semidefinite function on X, we denote H the Hilbert space obtained by the completion of H = span{k(x, ) x X} according to the norm induced by the inner product K(x, ), K(x, ) H = K(x, x ). Spaces H constructed in this way are known as reproducing kernel Hilbert spaces and there is a one-to-one relation between a kernel K and its associated RKHS. For more details on RKHS we refer the reader to [26, 27]. Given a kernel K, in the following we will denote with K x = K(x, ) H for all x X. We say that a kernel is bounded if K x H κ with κ > 0. In the following we will always assume K to be continuous and bounded by κ > 0. The continuity of K with the fact that X is Polish implies H to be separable [27]. In the rest of the appdendizes we denote with A, the operator A+I, for any symmetric linear operator A, R and I the identity operator. A.2 Definitions For n N, (x i ) n i=, and J {,..., n}, A R J J diagonal matrix with positive diagonal, denote l J in eq. (3) by showing the dependence from both J and A as Moreover define ĈJ,A as follows l J,A (i, ) = (n) ( K ii K J,i( K J,J + na) KJ,i ). (9) Ĉ J,A = J J i= A ii K x ji K xji. We define the out-of-sample leverage scores, that are an extension of l J,A to any point x in the space X. Definition (out-of-sample leverage scores). Let J = {j,..., j M } {,..., n}, with M N and A R M M be a positive diagonal matrix. Then for any x X and > 0 we define l J,A (x, ) = n (ĈJ,A + I) /2 K x 2 H. Moreover define l,[] (x, ) = (n) K(x, x). 6

17 In particular we denote by l(x, ) = l [n],i (x, ), the out of sample version of the leverage scores l(i, ). Indeed note that l(x i, ) = l(i, ) for i [n] and > 0 as proven by the next proposition that shows, more generally, the relation between l J,A and l J,A. Proposition. Let n N, (x i ) n i= X. For any > 0, J {,..., n}, A R J J with A positive diagonal, we that that for any x X, l J,A (x, ) in Def. and l J,A (x, ) in Def. 3, satisfy l J, n J A (x i, ) = l J,A (i, ), when J > 0, and l,[] (x i, ) = l,[] (i, ), when J = 0, for any i [n], > 0. Proof. Let J = {j,..., j J }. We will first show that l J,A (x, ) is characterized by, l J,A (x, ) = K(x, x) n n v J(x) (K J + J A) v J (x), with K J R M M with (K J ) lm = K(x jl, x jm ) and v J (x) = (K(x, x j ),..., K(x, x jm )). Denote with Z J : H R J, the linear operator defined by Z J = (K xj,..., K xj J ), that is (Z J f) k = K xjk, f, for f H and k {,... J }. Then, by denoting with B = J A H we have Z JB Z J = J J i= A ii K x ji K xji = ĈJ,A. Now note that, since (Q + I) = (I Q(Q + I) ) for any positive linear operator and > 0, we have l J,A (x, ) = K x, (ĈJ,A + I) K x n = K x, (I H n ĈJ,A(ĈJ,A + I) )K x H K(x, x) = K x, Z n n JB /2 (B /2 Z J ZJB /2 + I) B /2 Z J K x, H where in the last step we use the fact that R R(R R + I) = R (RR + I) R, for any bounded linear operator R and > 0. In particular we used it with R = B /2 Z J. Now note that Z J Z J R J J and in particular Z J Z J = K J, moreover Z J K x = v(x), so K(x, x) l J,A (x, ) = n K(x, x) = n K(x, x) = n n v(x) B /2 (B /2 K J B /2 + I) B /2 v(x) n v(x) (K J + B) v(x) n v(x) (K J + J A) v(x), where in the second step we used the fact that B /2 (B /2 KB /2 + I) B /2 = (K + B), for any invertible B any positive operator K and > 0. Finally note that K(x, x) l J, n J A (x i, ) = n n v(x) (K J + na) v(x) = l J,A (i, ). 7

18 A.3 Preliminary results Denote with G (A, B) the quantity G (A, B) = (A + I) /2 (A B)(A + I) /2, for A, B positive bounded linear operators and for > 0. Proposition 2. Let A, B be positive bounded linear operators and > 0, then I (A + I) /2 (B + I)(A + I) /2 = G (A, B) G (B, A) G (B, A), where the last inequality holds if G (B, A) <. Proof. For the sake of compactness denote with A the operator A + I and with B the operator B + I. First of all note that I = A /2 A A /2, so I A /2 B A /2 = A /2 A A /2 A /2 B A /2 = A /2 (A B )A /2 = A /2 (A B)A /2 = A /2 B /2 B /2 (A B)B /2 where in the last step we multiplied and divided by B /2. Then B /2 A /2, I A /2 B A /2 A /2 B /2 2 B /2 (A B)B /2, moreover, by Prop. 7 of [4] (see also Prop. 8 of [2]), if G (B, A) <, we have A /2 B /2 2 ( G (B, A)). Proposition 3. Let A, B, C be bounded positive linear operators on a Hilbert space. Let > 0. Then, the following holds G (A, C) G (A, B) + ( + G (A, B))G (B, C). Proof. In the following we denote with A the operator A + I and the same for B, C. Then A /2 (A C)A /2 A /2 (A B)A /2 + A /2 (B C)A /2. Now note that, by dividing and multiplying for B /2, we have A /2 (B C)A /2 = A /2 B /2 B /2 (B C)B /2 B /2 A /2 A /2 B /2 2 B /2 (B C)B /2 = A /2 B /2 2 G (B, C). Finally note that, since Z 2 = Z Z for any bounded linear operator Z, we have A /2 B /2 2 = A /2 B A /2 = I + (I A /2 B A /2 ) + I A /2 B A /2. Moreover, by Prop. 2, we have that I A /2 B A /2 = G (A, B). 8

19 Proposition 4. Let B be a bounded linear operator, then I BB σ min (B) 2 σ max (B) 2 + I BB. Proof. Now we recall that, denoting by the Lowner partial order, for a positive bounded operator A such that ai A bi for 0 a b, we have ( b)i I A ( a)i ( + b)i and so, since BB = I (I BB ), we have ( I BB )I σ min (B) 2 I BB σ max (B) 2 I + ( + I BB )I, from we have the desired result. Let HS denote the Hilbert-Schmidt norm. We recall and adapt to our needs a result from Prop. 8 of [4]. Proposition 5. Let > 0 and v,..., v n with n, be identically distributed random vectors on separable Hilbert space H, such that there exists κ 2 > 0 for which v H κ 2 almost surely. Denote by Q the Hermitian operator Q = n n i= E[v i v i ]. Let Q n = n n i= v i v i. Then for any δ (0, ], the following holds (Q + I) /2 (Q Q n )(Q + I) /2 4κ2 β 3n + with probability δ and β = log 4 Tr(Q(Q+I) ) Q(Q+I) δ 8κ2 (+Tr(Q Q)) Q δ. 2κ 2 β n Proof. Let Q = Q + I. Here we apply non-commutative Bernstein inequality like [3] (with the extension to separable Hilbert spaces as in[4], Prop. 2) on the random variables Z i = M Q /2 v i Q /2 v i with M i = Q /2 (E[v i v i ])Q /2 for i n. Note that the expectation of Z i is 0. The random vectors are bounded by Q /2 v i Q /2 v i M i = E v i [Q /2 v i Q /2 v i Q /2 v i Q /2 v i ] H 2 κ 2 (Q + ) /2 2 2κ2, and the second orded moment is E(Z i ) 2 = E v i, Q v /2 i Q v i Q /2 v i Q 2 Q2 κ2 E[Q /2 v Q /2 v ] = κ2 Q(Q + I) =: S. Now we can apply the Bernstein inequality with intrinsic dimension in [3] (or Prop. 2 in [4]). Now some considerations on β. It is β = log 4 Tr S S δ lower bound for Q Q = σ 4 Tr Q = Q Q Q δ, now we need a σ + where σ = Q is the biggest eigenvalue of Q, now, when 0 < σ we have β 8 Tr Q δ. When σ, note that Tr(Q(Q + I) ) Tr(Q) κ 2 /, then So finally β 8(κ2 / Q +Tr(Q Tr(Q(Q + I) ) Q Q κ2 δ Q)) σ σ + 9 = κ2 + κ2 σ 2κ2 σ.

20 A.4 Analytic decomposition Lemma 2. Let > 0, J, J {,..., n}, with J, J and A R J J, A R J J positive diagonal matrices, then with ν = G (ĈJ,A, ĈJ,A). 2ν ν l J,A (x, ) l J,A (x, ) ν l J,A (x, ), x X, Proof. By denoting with B the operator B = (ĈJ,A + I) /2 (ĈJ,A + I)/2, and according to the characterization of l J,A (x, ) via Prop., we have l J,A (x, ) = n (ĈJ,A + I) /2 K x 2 H = n B (ĈJ,A + I) /2 K x 2 H. So, by recalling the fact that, by definition of Lowner partial order, we have a v 2 Av 2 b v 2, for any vector v and bounded linear operator such that ai A A bi with 0 a b, and the fact that σ(a A) = σ(aa ) = σ(a) 2, we have σ min (B) 2 (ĈJ,A + I) /2 K x 2 H B(ĈJ,A + I) /2 K x 2 H σ max (B) 2 (ĈJ,A + I) /2 K x 2 H. That, by Prop., is equivalent to σ min (B) 2 lj,a (x, ) l J,A (x, ) σ max (B) 2 lj,a (x, ). By Prop. 4 we have I BB σ min (B) 2 σ max (B) 2 + I BB. Finally, by Prop. 2, we have I BB ν ν. Lemma 3. Let 0 <, and J {,..., n} and A R J J, then l J,A (x, ) l J,A (x, ) l J,A (x, ), x X. Proof. If J = 0 we have that l,[] (x, ) = K(x,x) n and the desired result is easily verified. If J, let B = (C J,A + I) /2 (C J,A + I) /2. By recalling the fact that, by definition of Lowner partial order, we have a v 2 Av 2 b v 2, for any vector v and bounded linear operator such that ai A A bi with 0 a b, and the fact that σ(a A) = σ(aa ) = σ(a) 2, we have σ min (B) 2 (ĈJ,A + I) /2 K x 2 H B(ĈJ,A + I) /2 K x 2 H σ max (B) 2 (ĈJ,A + I) /2 K x 2 H. That, by Prop., is equivalent to Now note that σ min (B) 2 lj,a (x, ) l J,A (x, ) σ max (B) 2 lj,a (x, ). σ min (B) 2 σ + inf σ 0 σ + =, σ max(b) 2 σ + sup σ 0 σ + =. 20

21 Theorem 3. Let > 0, J {,..., n}, with J and A R J J positive diagonal. Then the following hold for any x X, 2ν J,A ν J,A l(x, ) lj,a (x, ) ν J,A l(x, ), where ν J,A = G (Ĉ, ĈJ,A). Morever note that for any U {,..., n}, we have ν J,A η U + ( + η U )β J,A,U, with β J,A,U = G (ĈU,I, ĈJ,A) and η U = G (Ĉ, ĈU,I). Proof. By applying Lemma 2, with their J = {,..., n}, A = I, and recalling that l(x, ) = l {,...,n},i, we have for all x X 2ν J,A ν J,A l(x, ) lj,a (x, ) ν J,A l(x, ). To conclude the proof we bound ν J,A in terms of β J,A,U and η U, via Prop. 3. A.5 Proof for Algorithm Lemma 4. Let n N, (x i ) n i= X. Let U {,... n}, with U. Let (p k) U k= R be a non-negative sequence summing to. Let M N and J = {j,..., j M } with j i sampled i.i.d. from {,..., U } with probability (p k ) U k= and A = U diag(p j,..., p jm ). Let τ (0, ], and s := sup k {,..., U } U p k (ĈU,I + I) /2 K xuk 2 H. When M 2s log 4n τ, then the following holds with probability at least τ (ĈU,I + I) /2 (ĈJ,A ĈU,I)(ĈU,I + I) /2 Proof. Denote with ζ i the random variable 4s log 4n τ M. ζ i = U p k (ĈU,I + I) /2 (K xji K xji )(ĈU,I + I) /2, for i {,..., M}. In particular note that ζ,..., ζ M are i.i.d. since j,..., j M are. Moreover note the following two facts ζ i = sup k {,..., U } U k= U p k (ĈU,I + I) /2 K xuk 2 H = s, E[ζ i ] = p k (ĈU,I + I) /2 (K xk K xk )(ĈU,I + I) /2 U p k = (ĈU,I + I) /2 Ĉ U,I (ĈU,I + I) /2 =: W, 2

22 where for the second identity we used the fact that d/l k = /(p k U ). Since by definition of Ĉ J,A we have ( ) M ζ i = (ĈU,I + I) /2 M K xji K xji (ĈU,I + I) /2 M J A ii i= i= = (ĈU,I + I) /2 Ĉ J,A (ĈU,I + I) /2, then, by applying non-commutative Bernstein inequality (Prop. 5 is a version specific for our problem), we have (ĈU,I + I) /2 (ĈJ,A ĈU,I)(ĈU,I + I) /2 = M (ζ i E[ζ i ]) 2sη 2s W η M 3M + M, with probability at least τ, and η := log 4 Tr(W ) τ W. In particular, by noting that W by definition, when M 2sη, then 2sη 2s W η 3M + M 2sη 2sη 3M + M 2sη 2sη 4sη 3 M + M M. To conclude note that Tr(W ) W i= 4n rank(w ) U n, so η log τ. Lemma 5. Let n, R N, (x i ) n i= X. Let U = {u,..., u R } with u i i.i.d. with uniform probability on {,..., n}. Let τ (0, ] and let > 0. When R then the following holds with probability τ 2nκ2 4n log n + κ2 τ, (Ĉ + I) /2 (ĈU,I Ĉ)(Ĉ + I) /2 4nκ 2 log 4n τ (n + κ 2 )R. Proof. Denote by ζ i the random variable ζ i = (Ĉ + I) /2 (K xui K xui )(Ĉ + I) /2, for i {,..., R}. Note that ζ i are i.i.d. since u i are. Moreover note that ζ i = sup (Ĉ + I) /2 K xi 2 sup ( i {,...,n} i {,...,n} n K x i K xi + I) /2 K xi 2 nκ2 =: v. n + κ2 Moreover note that E[ζ i ] = n (Ĉ n + I) /2 (K xi K xi )(Ĉ + I) /2 = (Ĉ + I) /2 Ĉ(Ĉ + I) /2 =: W. i= So we have, by non-commutative Bernstein inequality (Prop. 5 is a version specific for our problem), (Ĉ + I) /2 (ĈU,I Ĉ)(Ĉ + I) /2 = M M i= (ζ i E[ζ i ]) 2vη 3R + 2v W η, R 22

23 with probability at least τ, and η := log 4 Tr(W ). In particular, by noting that W by definition, when R have 2vη 3R + 2v W η R η log 4n τ. 2nκ2 η (n+κ 2 )R 4nκ 2 η (n+κ 2 )R τ W, analogously to the end of the proof of Lemma 4, we. To conclude note that Tr(W ) W rank(w ) n, so Lemma 6. Let n, R N, (x i ) n i= X. Let U = {u,..., u R } with u i i.i.d. with uniform probability on {,..., n}. Let τ (0, ] and let > 0. When R then the following holds with probability τ n R i= 6nκ2 4n log n + κ2 τ, R ( l(x ui, ) < max 5, 6 ) 5 d eff(). Proof. First of all denote with z i the random variable z i = n R l(x ui, ) and note that (z i ) R i= are i.i.d. since (u i ) R i= are. Moreover, by the characterization of l(x, ) via Prop., we have z i sup (Ĉ + I) /2 K xk 2 (K xk K xk /n + I) /2 K xk 2 k {,...,n} moreover we have κ 2 R(κ 2 =: v, /n + ) E[z i ] = E[Tr((Ĉ + I) (K xui K xui ))] = Tr((Ĉ + I) E[K xui K xui ]) ( ) n = Tr (Ĉ + ( ) I) n K x k K xk = Tr (Ĉ + I) Ĉ = d eff (). k= So by applying Bernstein inequality, the following holds with probability at least τ n R l(x ui, ) d eff () R = R (z i E[z i ]) R 2v log 2 τ 3R + 2vd eff () log 2 τ. 3R i= So we have n R n l(x ui, ) d eff ()+ R R i= i= i= R l(x ui, ) d eff () d eff()+ 2v log 2 τ 3R + 2vd eff () log 2 τ R Now, if d eff () 4, since R 6v log 2 τ, we have that d eff () + 2v log 2 τ 3R + 2vd eff () log 2 τ 4 + R < 5. If d eff () > 4, since R 6v log 2 τ, we have d eff () + 2v log ( ) 2 τ 3R + 2vd eff () log 2 τ + 3R 24d eff () + d eff () < 6 8d eff () 5 d eff().. 23

24 Theorem 4. Let n N, (x i ) n i= X. Let δ (0, ], t, q >, > 0 and H, d h, h, J h, A h, U h as in Alg.. Let Ā h = n J A h and ν h = G h (Ĉ, ĈJ h,āh ), β h = G h (ĈU h,i, ĈJ h,āh ), η h = G h (Ĉ, ĈU h,i). When 0 = κ 2 min(t, ), q 5κ2 q 2 q( + t), q 2 2q (2t + )2 t 2 then the following holds with probability δ: for any h {0,..., H} ( + t) log 2Hn, δ a) T l(x, h ) l (x) min(t, 2) l(x, Jh,Āh h ), x X, b) d h 3q d eff ( h ) 0q, and J h q 2 (3qd eff ( h ) 0q). c) β h 7, η h 3, ν h. c T c T c T (20) where T = + t and c T = 2 + /(T ). Proof. Let H, c T, q and h, U h, J h, A h, d h, P h = (p h,k ) R h k=, for h {0,..., H} as defined in Alg. and define τ = δ/(3h). Now we are going to define some events and we prove a recurrence relation that they satisfy. Finally we unroll the recurrence relation and bound the resulting events in probability. Definitions of the events Now we are going to define some events that will be useful to prove the theorem. Denote with E h the event such that the conditions in Eq. (20)-(a) hold for J h, A h, U h. Denote with F h the event such that n R h u U h l(xu, h ) 6 5 d eff(). Denote with B,h the event such that β h, satisfies β h 4s h log 4n τ, with s h := sup (ĈU M h k {,...,R h } R h p h,i + h I) /2 K xuk 2. (2) h,k Denote with B 2,h the event such that η h, satisfies 4κ 2 n log κ2 η h h τ ( h n + κ 2. )R h First bound for s h. Note that, by definition of p h,k, that is, by Prop. p h,k = n l Jh,A h (x uk, h )/(d h R h ) = n l Jh,Āh (x u k, h )/(d h R h ), so s h = d h (ĈU h,i + h I) /2 K xuk 2 sup k {,...,R h } n l (x Jh,Āh u k, h ) = sup u U h d h luh,i(x u, h ) l Jh,Āh (x u, h ), 24

25 where the last step consists in apply the definition of l Uh,I. By applying Lemma 2 and 3 to l Uh,I(x, h ), we have l Uh,I(x, h ) η h l(x, h ) h h ( η h ) l(x, h ) and analogously by applying Lemma 3 to l Jh,Āh (x, h), we have l Jh,Āh (x, h) l Jh,Āh (x, h ). So, by extending the sup of s h to the whole X, we have s h d h sup x X l Uh,I(x, h ) l Jh,Āh (x, h) h d h h ( η h ) sup x X l(x, h ) l Jh,Āh (x, h ). Now we are ready to prove the recurrence relation, for h {,... H}, E h B,h B 2,h E h F h. Analysis of E 0. Note that, since Ĉ κ2, then κ 2 + I (Ĉ + I), so for any x X the following holds K(x, x) (κ 2 + )n l(x, K(x, x) ) n. κ Since 0 = 2 min(2,t ) and l,[] (x, 0 ) = K(x,x) 0 n, we have T l(x, 0 ) K(x, x) K(x, x) min(2, T )K(x, x) l T n,[] (x, 0 ) = = 0 n (κ 2 min(2, T ) l(x, 0 ). + 0 )n Setting conventionally d 0, ν 0, η 0, β 0 = 0 (they are not used by the algorithm or the proof), we have that E 0 holds everywhere and so, with probability. Analysis of E h B,h B 2,h. T l(x, h ) and so First note that under E h, the following holds l Jh,Āh (x, h ) s h h d h h ( η h ) sup x X l(x, h ) l Jh,Āh (x, h ) h d h h ( η h ) sup x X l(x, h ) l(x, T h ) T h d h h ( η h ). Now note that under B 2,h, by applying the definition of R h in Alg., by the condition on q, we have 4κ 2 n log κ2 η h h τ 4 log κ2 ( h n + κ 2 h τ 3/(c T ) 3/22. )R h q So under B,h B 2,h E h and the fact that q = h h, we have s h T h d h h ( η h ) (8/7)qT d h and so, since M h = q 2 d h, by the condition on q 2, we have 4s h log 4n τ (32/7)qT d h log 4n τ (32/7)qT log 4n τ β h = < 7, M h M h q 2 c T 25

Optimal kernel methods for large scale learning

Optimal kernel methods for large scale learning Optimal kernel methods for large scale learning Alessandro Rudi INRIA - École Normale Supérieure, Paris joint work with Luigi Carratino, Lorenzo Rosasco 6 Mar 2018 École Polytechnique Learning problem

More information

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Loucas Pillaud-Vivien

More information

Less is More: Computational Regularization by Subsampling

Less is More: Computational Regularization by Subsampling Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro

More information

Less is More: Computational Regularization by Subsampling

Less is More: Computational Regularization by Subsampling Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro

More information

Convergence Rates of Kernel Quadrature Rules

Convergence Rates of Kernel Quadrature Rules Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction

More information

TUM 2016 Class 3 Large scale learning by regularization

TUM 2016 Class 3 Large scale learning by regularization TUM 2016 Class 3 Large scale learning by regularization Lorenzo Rosasco UNIGE-MIT-IIT July 25, 2016 Learning problem Solve min w E(w), E(w) = dρ(x, y)l(w x, y) given (x 1, y 1 ),..., (x n, y n ) Beyond

More information

Pack only the essentials: distributed sequential sampling for adaptive kernel DL

Pack only the essentials: distributed sequential sampling for adaptive kernel DL Pack only the essentials: distributed sequential sampling for adaptive kernel DL with Daniele Calandriello and Alessandro Lazaric SequeL team, Inria Lille - Nord Europe, France appeared in AISTATS 2017

More information

Mathematical Methods for Data Analysis

Mathematical Methods for Data Analysis Mathematical Methods for Data Analysis Massimiliano Pontil Istituto Italiano di Tecnologia and Department of Computer Science University College London Massimiliano Pontil Mathematical Methods for Data

More information

Distributed Adaptive Sampling for Kernel Matrix Approximation

Distributed Adaptive Sampling for Kernel Matrix Approximation Daniele Calandriello Alessandro Lazaric Michal Valko SequeL team, INRIA Lille - Nord Europe Abstract Most kernel-based methods, such as kernel regression, kernel PCA, ICA, or k-means clustering, do not

More information

Approximate Kernel PCA with Random Features

Approximate Kernel PCA with Random Features Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,

More information

arxiv: v1 [cs.it] 21 Feb 2013

arxiv: v1 [cs.it] 21 Feb 2013 q-ary Compressive Sensing arxiv:30.568v [cs.it] Feb 03 Youssef Mroueh,, Lorenzo Rosasco, CBCL, CSAIL, Massachusetts Institute of Technology LCSL, Istituto Italiano di Tecnologia and IIT@MIT lab, Istituto

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

Notes on Regularized Least Squares Ryan M. Rifkin and Ross A. Lippert

Notes on Regularized Least Squares Ryan M. Rifkin and Ross A. Lippert Computer Science and Artificial Intelligence Laboratory Technical Report MIT-CSAIL-TR-2007-025 CBCL-268 May 1, 2007 Notes on Regularized Least Squares Ryan M. Rifkin and Ross A. Lippert massachusetts institute

More information

Online Gradient Descent Learning Algorithms

Online Gradient Descent Learning Algorithms DISI, Genova, December 2006 Online Gradient Descent Learning Algorithms Yiming Ying (joint work with Massimiliano Pontil) Department of Computer Science, University College London Introduction Outline

More information

Kernels for Multi task Learning

Kernels for Multi task Learning Kernels for Multi task Learning Charles A Micchelli Department of Mathematics and Statistics State University of New York, The University at Albany 1400 Washington Avenue, Albany, NY, 12222, USA Massimiliano

More information

Regularized Least Squares

Regularized Least Squares Regularized Least Squares Ryan M. Rifkin Google, Inc. 2008 Basics: Data Data points S = {(X 1, Y 1 ),...,(X n, Y n )}. We let X simultaneously refer to the set {X 1,...,X n } and to the n by d matrix whose

More information

A Spectral Regularization Framework for Multi-Task Structure Learning

A Spectral Regularization Framework for Multi-Task Structure Learning A Spectral Regularization Framework for Multi-Task Structure Learning Massimiliano Pontil Department of Computer Science University College London (Joint work with A. Argyriou, T. Evgeniou, C.A. Micchelli,

More information

Optimal Rates for Multi-pass Stochastic Gradient Methods

Optimal Rates for Multi-pass Stochastic Gradient Methods Journal of Machine Learning Research 8 (07) -47 Submitted 3/7; Revised 8/7; Published 0/7 Optimal Rates for Multi-pass Stochastic Gradient Methods Junhong Lin Laboratory for Computational and Statistical

More information

An inverse problem perspective on machine learning

An inverse problem perspective on machine learning An inverse problem perspective on machine learning Lorenzo Rosasco University of Genova Massachusetts Institute of Technology Istituto Italiano di Tecnologia lcsl.mit.edu Feb 9th, 2018 Inverse Problems

More information

Kernel Learning via Random Fourier Representations

Kernel Learning via Random Fourier Representations Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random

More information

Analysis of Nyström Method with Sequential Ridge Leverage Score Sampling

Analysis of Nyström Method with Sequential Ridge Leverage Score Sampling Analysis of Nyström Method with Sequential Ridge Leverage Score Sampling Daniele Calandriello SequeL team INRIA Lille - Nord Europe Alessandro Lazaric SequeL team INRIA Lille - Nord Europe Michal Valko

More information

Approximate Kernel Methods

Approximate Kernel Methods Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207 Outline Motivating example Ridge regression

More information

sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU)

sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU) sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU) 0 overview Our Contributions: 1 overview Our Contributions: A near optimal low-rank

More information

Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms

Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms Junhong Lin Volkan Cevher Laboratory for Information and Inference Systems École Polytechnique Fédérale

More information

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 04: Features and Kernels. Lorenzo Rosasco

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 04: Features and Kernels. Lorenzo Rosasco MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 04: Features and Kernels Lorenzo Rosasco Linear functions Let H lin be the space of linear functions f(x) = w x. f w is one

More information

Strictly Positive Definite Functions on a Real Inner Product Space

Strictly Positive Definite Functions on a Real Inner Product Space Strictly Positive Definite Functions on a Real Inner Product Space Allan Pinkus Abstract. If ft) = a kt k converges for all t IR with all coefficients a k 0, then the function f< x, y >) is positive definite

More information

Distribution Regression: A Simple Technique with Minimax-optimal Guarantee

Distribution Regression: A Simple Technique with Minimax-optimal Guarantee Distribution Regression: A Simple Technique with Minimax-optimal Guarantee (CMAP, École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department,

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Convergence rates of spectral methods for statistical inverse learning problems

Convergence rates of spectral methods for statistical inverse learning problems Convergence rates of spectral methods for statistical inverse learning problems G. Blanchard Universtität Potsdam UCL/Gatsby unit, 04/11/2015 Joint work with N. Mücke (U. Potsdam); N. Krämer (U. München)

More information

Iterative Methods for Solving A x = b

Iterative Methods for Solving A x = b Iterative Methods for Solving A x = b A good (free) online source for iterative methods for solving A x = b is given in the description of a set of iterative solvers called templates found at netlib: http

More information

Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices

Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices Vahid Dehdari and Clayton V. Deutsch Geostatistical modeling involves many variables and many locations.

More information

Some tensor decomposition methods for machine learning

Some tensor decomposition methods for machine learning Some tensor decomposition methods for machine learning Massimiliano Pontil Istituto Italiano di Tecnologia and University College London 16 August 2016 1 / 36 Outline Problem and motivation Tucker decomposition

More information

FALKON: An Optimal Large Scale Kernel Method

FALKON: An Optimal Large Scale Kernel Method FALKON: An Optimal Large Scale Kernel Method Alessandro Rudi INRIA Sierra Project-team, École Normale Supérieure, Paris Luigi Carratino University of Genoa Genova, Italy Lorenzo Rosasco University of Genoa,

More information

Regularization via Spectral Filtering

Regularization via Spectral Filtering Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,

More information

9.2 Support Vector Machines 159

9.2 Support Vector Machines 159 9.2 Support Vector Machines 159 9.2.3 Kernel Methods We have all the tools together now to make an exciting step. Let us summarize our findings. We are interested in regularized estimation problems of

More information

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017 The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Class 2 & 3 Overfitting & Regularization

Class 2 & 3 Overfitting & Regularization Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating

More information

Regularized Least Squares

Regularized Least Squares Regularized Least Squares Charlie Frogner 1 MIT 2011 1 Slides mostly stolen from Ryan Rifkin (Google). Summary In RLS, the Tikhonov minimization problem boils down to solving a linear system (and this

More information

On Learnability, Complexity and Stability

On Learnability, Complexity and Stability On Learnability, Complexity and Stability Silvia Villa, Lorenzo Rosasco and Tomaso Poggio 1 Introduction A key question in statistical learning is which hypotheses (function) spaces are learnable. Roughly

More information

Back to the future: Radial Basis Function networks revisited

Back to the future: Radial Basis Function networks revisited Back to the future: Radial Basis Function networks revisited Qichao Que, Mikhail Belkin Department of Computer Science and Engineering Ohio State University Columbus, OH 4310 que, mbelkin@cse.ohio-state.edu

More information

Regularization Predicts While Discovering Taxonomy Youssef Mroueh, Tomaso Poggio, and Lorenzo Rosasco

Regularization Predicts While Discovering Taxonomy Youssef Mroueh, Tomaso Poggio, and Lorenzo Rosasco Computer Science and Artificial Intelligence Laboratory Technical Report MIT-CSAIL-TR-2011-029 CBCL-299 June 3, 2011 Regularization Predicts While Discovering Taxonomy Youssef Mroueh, Tomaso Poggio, and

More information

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology

More information

RegML 2018 Class 2 Tikhonov regularization and kernels

RegML 2018 Class 2 Tikhonov regularization and kernels RegML 2018 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT June 17, 2018 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n = (x i,

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July

More information

A summary of Deep Learning without Poor Local Minima

A summary of Deep Learning without Poor Local Minima A summary of Deep Learning without Poor Local Minima by Kenji Kawaguchi MIT oral presentation at NIPS 2016 Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given

More information

Spectral Regularization

Spectral Regularization Spectral Regularization Lorenzo Rosasco 9.520 Class 07 February 27, 2008 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,

More information

Sketching as a Tool for Numerical Linear Algebra

Sketching as a Tool for Numerical Linear Algebra Sketching as a Tool for Numerical Linear Algebra (Part 2) David P. Woodruff presented by Sepehr Assadi o(n) Big Data Reading Group University of Pennsylvania February, 2015 Sepehr Assadi (Penn) Sketching

More information

Scalable machine learning for massive datasets: Fast summation algorithms

Scalable machine learning for massive datasets: Fast summation algorithms Scalable machine learning for massive datasets: Fast summation algorithms Getting good enough solutions as fast as possible Vikas Chandrakant Raykar vikas@cs.umd.edu University of Maryland, CollegePark

More information

Randomized Algorithms

Randomized Algorithms Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models

More information

to be more efficient on enormous scale, in a stream, or in distributed settings.

to be more efficient on enormous scale, in a stream, or in distributed settings. 16 Matrix Sketching The singular value decomposition (SVD) can be interpreted as finding the most dominant directions in an (n d) matrix A (or n points in R d ). Typically n > d. It is typically easy to

More information

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1). Regression and PCA Classification The goal: map from input X to a label Y. Y has a discrete set of possible values We focused on binary Y (values 0 or 1). But we also discussed larger number of classes

More information

Transductive Experiment Design

Transductive Experiment Design Appearing in NIPS 2005 workshop Foundations of Active Learning, Whistler, Canada, December, 2005. Transductive Experiment Design Kai Yu, Jinbo Bi, Volker Tresp Siemens AG 81739 Munich, Germany Abstract

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

MLCC 2017 Regularization Networks I: Linear Models

MLCC 2017 Regularization Networks I: Linear Models MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017 About this class We introduce a class of learning algorithms based on Tikhonov regularization We study computational

More information

Online Kernel PCA with Entropic Matrix Updates

Online Kernel PCA with Entropic Matrix Updates Dima Kuzmin Manfred K. Warmuth Computer Science Department, University of California - Santa Cruz dima@cse.ucsc.edu manfred@cse.ucsc.edu Abstract A number of updates for density matrices have been developed

More information

Learning sets and subspaces: a spectral approach

Learning sets and subspaces: a spectral approach Learning sets and subspaces: a spectral approach Alessandro Rudi DIBRIS, Università di Genova Optimization and dynamical processes in Statistical learning and inverse problems Sept 8-12, 2014 A world of

More information

Strictly positive definite functions on a real inner product space

Strictly positive definite functions on a real inner product space Advances in Computational Mathematics 20: 263 271, 2004. 2004 Kluwer Academic Publishers. Printed in the Netherlands. Strictly positive definite functions on a real inner product space Allan Pinkus Department

More information

Explicit Rates of Convergence for Sparse Variational Inference in Gaussian Process Regression

Explicit Rates of Convergence for Sparse Variational Inference in Gaussian Process Regression JMLR: Workshop and Conference Proceedings : 9, 08. Symposium on Advances in Approximate Bayesian Inference Explicit Rates of Convergence for Sparse Variational Inference in Gaussian Process Regression

More information

Stochastic gradient descent and robustness to ill-conditioning

Stochastic gradient descent and robustness to ill-conditioning Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

Statistically and Computationally Efficient Variance Estimator for Kernel Ridge Regression

Statistically and Computationally Efficient Variance Estimator for Kernel Ridge Regression Statistically and Computationally Efficient Variance Estimator for Kernel Ridge Regression Meimei Liu Department of Statistical Science Duke University Durham, IN - 27708 Email: meimei.liu@duke.edu Jean

More information

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A = 30 MATHEMATICS REVIEW G A.1.1 Matrices and Vectors Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A = a 11 a 12... a 1N a 21 a 22... a 2N...... a M1 a M2... a MN A matrix can

More information

Convex Multi-Task Feature Learning

Convex Multi-Task Feature Learning Convex Multi-Task Feature Learning Andreas Argyriou 1, Theodoros Evgeniou 2, and Massimiliano Pontil 1 1 Department of Computer Science University College London Gower Street London WC1E 6BT UK a.argyriou@cs.ucl.ac.uk

More information

Statistical Convergence of Kernel CCA

Statistical Convergence of Kernel CCA Statistical Convergence of Kernel CCA Kenji Fukumizu Institute of Statistical Mathematics Tokyo 106-8569 Japan fukumizu@ism.ac.jp Francis R. Bach Centre de Morphologie Mathematique Ecole des Mines de Paris,

More information

Efficient Complex Output Prediction

Efficient Complex Output Prediction Efficient Complex Output Prediction Florence d Alché-Buc Joint work with Romain Brault, Alex Lambert, Maxime Sangnier October 12, 2017 LTCI, Télécom ParisTech, Institut-Mines Télécom, Université Paris-Saclay

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Linear Algebra Massoud Malek

Linear Algebra Massoud Malek CSUEB Linear Algebra Massoud Malek Inner Product and Normed Space In all that follows, the n n identity matrix is denoted by I n, the n n zero matrix by Z n, and the zero vector by θ n An inner product

More information

Oslo Class 4 Early Stopping and Spectral Regularization

Oslo Class 4 Early Stopping and Spectral Regularization RegML2017@SIMULA Oslo Class 4 Early Stopping and Spectral Regularization Lorenzo Rosasco UNIGE-MIT-IIT June 28, 2016 Learning problem Solve min w E(w), E(w) = dρ(x, y)l(w x, y) given (x 1, y 1 ),..., (x

More information

Karhunen-Loève decomposition of Gaussian measures on Banach spaces

Karhunen-Loève decomposition of Gaussian measures on Banach spaces Karhunen-Loève decomposition of Gaussian measures on Banach spaces Jean-Charles Croix jean-charles.croix@emse.fr Génie Mathématique et Industriel (GMI) First workshop on Gaussian processes at Saint-Etienne

More information

Matrix concentration inequalities

Matrix concentration inequalities ELE 538B: Mathematics of High-Dimensional Data Matrix concentration inequalities Yuxin Chen Princeton University, Fall 2018 Recap: matrix Bernstein inequality Consider a sequence of independent random

More information

Convergence Rates for Greedy Kaczmarz Algorithms

Convergence Rates for Greedy Kaczmarz Algorithms onvergence Rates for Greedy Kaczmarz Algorithms Julie Nutini 1, Behrooz Sepehry 1, Alim Virani 1, Issam Laradji 1, Mark Schmidt 1, Hoyt Koepke 2 1 niversity of British olumbia, 2 Dato Abstract We discuss

More information

Inference For High Dimensional M-estimates. Fixed Design Results

Inference For High Dimensional M-estimates. Fixed Design Results : Fixed Design Results Lihua Lei Advisors: Peter J. Bickel, Michael I. Jordan joint work with Peter J. Bickel and Noureddine El Karoui Dec. 8, 2016 1/57 Table of Contents 1 Background 2 Main Results and

More information

Spectral Regularization for Support Estimation

Spectral Regularization for Support Estimation Spectral Regularization for Support Estimation Ernesto De Vito DSA, Univ. di Genova, and INFN, Sezione di Genova, Italy devito@dima.ungie.it Lorenzo Rosasco CBCL - MIT, - USA, and IIT, Italy lrosasco@mit.edu

More information

A fast randomized algorithm for overdetermined linear least-squares regression

A fast randomized algorithm for overdetermined linear least-squares regression A fast randomized algorithm for overdetermined linear least-squares regression Vladimir Rokhlin and Mark Tygert Technical Report YALEU/DCS/TR-1403 April 28, 2008 Abstract We introduce a randomized algorithm

More information

TUM 2016 Class 1 Statistical learning theory

TUM 2016 Class 1 Statistical learning theory TUM 2016 Class 1 Statistical learning theory Lorenzo Rosasco UNIGE-MIT-IIT July 25, 2016 Machine learning applications Texts Images Data: (x 1, y 1 ),..., (x n, y n ) Note: x i s huge dimensional! All

More information

Online Passive-Aggressive Algorithms

Online Passive-Aggressive Algorithms Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

Fastfood Approximating Kernel Expansions in Loglinear Time. Quoc Le, Tamas Sarlos, and Alex Smola Presenter: Shuai Zheng (Kyle)

Fastfood Approximating Kernel Expansions in Loglinear Time. Quoc Le, Tamas Sarlos, and Alex Smola Presenter: Shuai Zheng (Kyle) Fastfood Approximating Kernel Expansions in Loglinear Time Quoc Le, Tamas Sarlos, and Alex Smola Presenter: Shuai Zheng (Kyle) Large Scale Problem: ImageNet Challenge Large scale data Number of training

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning Jacob Rafati http://rafati.net jrafatiheravi@ucmerced.edu Ph.D. Candidate, Electrical Engineering and Computer Science University

More information

Introduction to Machine Learning

Introduction to Machine Learning 10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what

More information

Fantope Regularization in Metric Learning

Fantope Regularization in Metric Learning Fantope Regularization in Metric Learning CVPR 2014 Marc T. Law (LIP6, UPMC), Nicolas Thome (LIP6 - UPMC Sorbonne Universités), Matthieu Cord (LIP6 - UPMC Sorbonne Universités), Paris, France Introduction

More information

Adaptive Sampling Under Low Noise Conditions 1

Adaptive Sampling Under Low Noise Conditions 1 Manuscrit auteur, publié dans "41èmes Journées de Statistique, SFdS, Bordeaux (2009)" Adaptive Sampling Under Low Noise Conditions 1 Nicolò Cesa-Bianchi Dipartimento di Scienze dell Informazione Università

More information

Are Loss Functions All the Same?

Are Loss Functions All the Same? Are Loss Functions All the Same? L. Rosasco E. De Vito A. Caponnetto M. Piana A. Verri November 11, 2003 Abstract In this paper we investigate the impact of choosing different loss functions from the viewpoint

More information

2 Tikhonov Regularization and ERM

2 Tikhonov Regularization and ERM Introduction Here we discusses how a class of regularization methods originally designed to solve ill-posed inverse problems give rise to regularized learning algorithms. These algorithms are kernel methods

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new

More information

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES Wei Chu, S. Sathiya Keerthi, Chong Jin Ong Control Division, Department of Mechanical Engineering, National University of Singapore 0 Kent Ridge Crescent,

More information

Stochastic Quasi-Newton Methods

Stochastic Quasi-Newton Methods Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, 2016 1 / 35 Outline Stochastic Approximation Stochastic Gradient Descent

More information

Kernel Methods. Machine Learning A W VO

Kernel Methods. Machine Learning A W VO Kernel Methods Machine Learning A 708.063 07W VO Outline 1. Dual representation 2. The kernel concept 3. Properties of kernels 4. Examples of kernel machines Kernel PCA Support vector regression (Relevance

More information

Convergence of a Generalized Gradient Selection Approach for the Decomposition Method

Convergence of a Generalized Gradient Selection Approach for the Decomposition Method Convergence of a Generalized Gradient Selection Approach for the Decomposition Method Nikolas List Fakultät für Mathematik, Ruhr-Universität Bochum, 44780 Bochum, Germany Niko.List@gmx.de Abstract. The

More information

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010 Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept 27, 2010 Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X

More information

A note on the generalization performance of kernel classifiers with margin. Theodoros Evgeniou and Massimiliano Pontil

A note on the generalization performance of kernel classifiers with margin. Theodoros Evgeniou and Massimiliano Pontil MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES A.I. Memo No. 68 November 999 C.B.C.L

More information

Approximating the Covariance Matrix with Low-rank Perturbations

Approximating the Covariance Matrix with Low-rank Perturbations Approximating the Covariance Matrix with Low-rank Perturbations Malik Magdon-Ismail and Jonathan T. Purnell Department of Computer Science Rensselaer Polytechnic Institute Troy, NY 12180 {magdon,purnej}@cs.rpi.edu

More information

Nyström-SGD: Fast Learning of Kernel-Classifiers with Conditioned Stochastic Gradient Descent

Nyström-SGD: Fast Learning of Kernel-Classifiers with Conditioned Stochastic Gradient Descent Nyström-SGD: Fast Learning of Kernel-Classifiers with Conditioned Stochastic Gradient Descent Lukas Pfahler and Katharina Morik TU Dortmund University, Artificial Intelligence Group, Dortmund, Germany

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM Wei Chu Chong Jin Ong chuwei@gatsby.ucl.ac.uk mpeongcj@nus.edu.sg S. Sathiya Keerthi mpessk@nus.edu.sg Control Division, Department

More information

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015 EE613 Machine Learning for Engineers Kernel methods Support Vector Machines jean-marc odobez 2015 overview Kernel methods introductions and main elements defining kernels Kernelization of k-nn, K-Means,

More information

Distribution Regression with Minimax-Optimal Guarantee

Distribution Regression with Minimax-Optimal Guarantee Distribution Regression with Minimax-Optimal Guarantee (Gatsby Unit, UCL) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications Class 19: Data Representation by Design What is data representation? Let X be a data-space X M (M) F (M) X A data representation

More information