Optimal kernel methods for large scale learning

Size: px

Start display at page:

Download "Optimal kernel methods for large scale learning"

Jennifer Harris
6 years ago
Views:

1 Optimal kernel methods for large scale learning Alessandro Rudi INRIA - École Normale Supérieure, Paris joint work with Luigi Carratino, Lorenzo Rosasco 6 Mar 2018 École Polytechnique

2 Learning problem The problem P Find f H = argmin E(f), E(f) = f H with ρ unknown but given (x i, y i ) n i=1 i.i.d. samples. dρ(x, y)(y f(x)) 2 Remarks: stochastic optimization problem H is a space of candidate solutions.

3 Outline Learning with kernels Random projections FALKON: Random projections and preconditioning

4 Kernel ridge regression Let K p.d. kernel (e.g. K(x, x ) = e γ x x 2 ) and H = span{k(x, ) x X}, Problem P n f λ = argmin f H 1 n n (y i f(x i )) 2 + λ f 2 H i=1

5 KRR: Statistic (worst case) Noise E[ Y p X = x] 1 2 p!σ2 b p 2, p 2 Kernel boundness sup x K(x, x) <. Best model There exists f H solving min E(f). f H

6 KRR: Statistic (worst case) Noise E[ Y p X = x] 1 2 p!σ2 b p 2, p 2 Kernel boundness sup x K(x, x) <. Best model There exists f H solving min E(f). f H Theorem[(Caponnetto, De Vito 05)] Under the assumptions above EE( f λ ) E(f H ) 1 λn + λ. By selecting λ n = 1 n EE( f λn ) E(f H ) 1 n

7 KRR: Statistics (refined case) Let Lf(x ) = EK(x, x)f(x) and N (λ) = Trace((L + λi) 1 L) Capacity condition: N (λ) = O(λ γ ), γ [0, 1] Source condition: f H Range(L r ), r 1/2

8 KRR: Statistics (refined case) Let Lf(x ) = EK(x, x)f(x) and N (λ) = Trace((L + λi) 1 L) Capacity condition: N (λ) = O(λ γ ), γ [0, 1] Source condition: f H Range(L r ), r 1/2 Theorem[(Caponnetto, De Vito 05)] Under (basic) and (refined) EE( f λ ) E(f H ) N (λ) n + λ 2r. By selecting λ n = n 1 2r+γ EE( f λn ) E(f H ) n 2r 2r+γ

9 KRR: Statistics (refined case) Let Lf(x ) = EK(x, x)f(x) and N (λ) = Trace((L + λi) 1 L) Capacity condition: N (λ) = O(λ γ ), γ [0, 1] Source condition: f H Range(L r ), r 1/2 Theorem[(Caponnetto, De Vito 05)] Under (basic) and (refined) EE( f λ ) E(f H ) N (λ) n + λ 2r. By selecting λ n = n 1 2r+γ EE( f λn ) E(f H ) n 2r 2r+γ

10 KRR: Optimization n f λ (x) = K(x, x i )c i i=1 ( K + λni)c = ŷ Linear System bk c = by Computations Space O(n 2 ) Kernel eval. O(n 2 ) Time O(n 3 ) Large scale ML: Running out of time and space... can we fix it?

11 Computations for optimal statistical accuracy Model: O(n) Space: O(n 2 ) Kernel eval.: O(n 2 ) Time: O(n 3 )

12 Outline Learning with kernels Random projections FALKON: Random projections and preconditioning

13 Random projections Solve P n on H M = span{k( x 1, ),..., K( x M, )} 1 n f λ,m = argmin (y i f(x i )) 2 + λ f 2 H f H M n i=1

14 Random projections Solve P n on H M = span{k( x 1, ),..., K( x M, )} 1 f λ,m = argmin f H M n... that is, pick M columns at random f λ,m (x) = M K(x, x i )c i i=1 n (y i f(x i )) 2 + λ f 2 H i=1 Linear System c ( K nm K nm +λn K MM )c = K nm ŷ bk nm = by - Nyström methods (Smola, Scholköpf 00) - Gaussian processes: inducing inputs (Quionero-Candela et al 05)

15 Nyström KRR: Computations c bk nm = by Computations (train) Space O(n 2 ) O(M 2 n) Kernel eval. O(n 2 ) O(nM n ) Time O(n 3 ) O(nM 2 n) Computations (test) O(n) O(M n ) ( K nm K nm + λn K MM )c = K nm ŷ

16 Nyström KRR: Statistics (worst case) Theorem[Rudi, Camoriano, Rosasco 15] Under the basic assumptions EE( f λ,m ) E(f H ) 1 λn + λ + 1 M.

17 Nyström KRR: Statistics (worst case) Theorem[Rudi, Camoriano, Rosasco 15] Under the basic assumptions By selecting λ n = 1 n, M n = 1 λ n EE( f λ,m ) E(f H ) 1 λn + λ + 1 M. EE( f λn,m n ) E(f H ) 1 n

18 Remarks M = O( n) suffices for optimal generalization Previous works: only for fixed design (Bach 13, Alaoui, Mahoney, 15, Yang et al. 15, Musco, Musco 16) Matches statistical minimax lower bounds [Caponnetto, De Vito 05]. Special case: Sobolev spaces with s = d/2, e.g. exponential kernel and Fourier features. Same statistical bound of (kernel) ridge regression [Caponnetto, De Vito 05].

19 Nyström KRR: Statistics (refined) Theorem[Rudi, Camoriano, Rosasco 15] Under (basic) and (refined) EE( f λ,m ) E(f H ) N (λ) n + λ 2r + 1 M.

20 Nyström KRR: Statistics (refined) Theorem[Rudi, Camoriano, Rosasco 15] Under (basic) and (refined) By selecting λ n = n 1 2r+γ, Mn = 1 λ n EE( f λ,m ) E(f H ) N (λ) n EE( f λn,m n ) E(f H ) n + λ 2r + 1 M. 2r 2r+γ

21 Remarks The obtained rate is minmax optimal [Caponnetto, De Vito 05]. Reduces to worst case for γ = 1, r = 1/2. Comparison with Random Features: [Rudi, Rosasco 17] M = n c

22 Computations required for 1/ n rate Model: O( n) Space: O(n) Kernel eval.: O(n n) Time: O(n 2 ) Possible improvements: adaptive sampling optimization

23 Outline Learning with kernels Random projections FALKON: Random projections and preconditioning

24 Beyond O(n 2 ) time? ( K nm K nm + λn K MM ) c = K nm ŷ. by bk nm c = Bottleneck: compute K nm K nm requires O(nM 2 ) time.

25 Optimization to rescue K nm K nm + λn K MM c = }{{} K nm ŷ. }{{} H b bk nm c = by Idea: First order methods c t = c t 1 τ n [ K nm ( K nm c t 1 y n ) + λn K MM c t 1 ] Pros: requires O(nM t) Cons: t κ(h) arbitrarily large- κ(h) = σ max (H)/σ min (H) condition number.

26 Preconditioning Idea: solve an equivalent linear system with better condition number Preconditioning Hc = b P HP β = P b, c = P β. Ideally P P = H 1, so that t = O(κ(H)) t = O(1)! Computing a good preconditioning can be hard!

27 Remarks Preconditioning KRR (Fasshauer et al 12, Avron et al 16, Cutajat 16, Ma, Belkin 17) H = K + λni Can we precondition Nystrom-KRR?

28 Preconditioning Nystom-KRR Consider H := K nm K nm + λn K MM Proposed Preconditioning ( n P P = M K MM 2 + λn K ) 1 MM Compare to naive preconditioning ( P P = K nm KnM + λn K ) 1 MM.

29 Baby FALKON Proposed Preconditioning P P = ( n M K 2 MM + λn K MM ) 1, Gradient descent M f λ,m,t (x) = K(x, x i )c t,i, c t = P β t i=1 β t = β t 1 τ n P [ K nm ( K nm P β t 1 y n ) + λn K MM P β t 1 ]

30 FALKON Gradient descent conjugate gradient Computing P P = 1 T 1 A 1, n T = chol(k MM ), ( ) 1 A = chol M T T + λi, where chol( ) is the Cholesky decomposition.

31 Falkon statistics (worst case) Theorem Under (basic), when M > log n λ, [ EE( f λn,m n,t n ) E(f H ) 1 λn + λ + 1 M + exp t ( 1 log n ) ] 1/2 λm

32 Falkon statistics (worst case) Theorem Under (basic), when M > log n λ, [ EE( f λn,m n,t n ) E(f H ) 1 λn + λ + 1 M + exp t ( 1 log n ) ] 1/2 λm By selecting then λ n = 1/ n, M n = 2 log n, t n = log n, λ EE( f λn,m n,t n ) E(f H ) 1 n

33 Falkon statistics (refined results) Theorem Under (basic) and (refined), when M > log n λ, EE( f λn,m n,t n ) E(f H ) N (λ) n + λ 2r + 1 M + exp [ t ( 1 log n ) ] 1/2 λm

34 Falkon statistics (refined results) Theorem Under (basic) and (refined), when M > log n λ, EE( f λn,m n,t n ) E(f H ) N (λ) n + λ 2r + 1 M + exp [ t ( 1 log n ) ] 1/2 λm By selecting then λ n = n 1 2r+γ, Mn = 2 log n, t n = log n, λ EE( f λn,m n,t n ) E(f H ) n 2r 2r+γ

35 Remarks Relevant works SGD RF-KRR (Rahimi, Recht 07; Bach 15; Rudi, Rosasco 17) Divide and conquer (Zhang et al. 13) NYTRO (Angles et al 16) Nyström SGD (Lin, Rosasco 16) Same statistical properties/memory requirements Much smaller time complexity

36 Proof: bridging statistics and optimization Lemma Let δ > 0, κ P := κ(p HP ), c δ = c 0 log 1 δ. When λ 1 n with probability 1 δ. E( f λ,m,t ) E(f H ) E( f λ,m ) E(f H ) + c δ exp( t/ κ P ).

37 Proof: bridging statistics and optimization Lemma Let δ > 0, κ P := κ(p HP ), c δ = c 0 log 1 δ. When λ 1 n with probability 1 δ. E( f λ,m,t ) E(f H ) E( f λ,m ) E(f H ) + c δ exp( t/ κ P ). Lemma Let δ (0, 1], λ > 0. When then with probability 1 δ. κ(p HP ) M = 2 log 1 δ λ, ( 1 log 1 ) 1 δ < 4 λm

38 Computational implications Cost for optimal generalization with FALKON M n = O( n), t n = log n Model: O( n) Space: O(n) Kernel eval.: O(n n) Time: O(n 2 ) O(n n)

39 In practice 1 Higgs dataset: n = 10, 000, 000, M = 50,

40 Some experiments MillionSongs (n 10 6 ) YELP (n 10 6 ) TIMIT (n 10 6 ) MSE Relative error Time(s) RMSE Time(m) c-err Time(h) FALKON % 1.5 Prec. KRR Hierarchical D&C Rand. Feat Nyström ADMM R. F BCD R. F % 1.7 BCD Nyström % 1.7 KRR % 8.3 EigenPro % 3.9 Deep NN % - Sparse Kernels % - Ensemble % - Table: MillionSongs, YELP and TIMIT Datasets. Times obtained on: = cluster of 128 EC2 r3.2xlarge machines, = cluster of 8 EC2 r3.8xlarge machines, = single machine with two Intel Xeon E5-2620, one Nvidia GTX Titan X GPU and 128GB of RAM, = cluster with 512 GB of RAM and IBM POWER8 12-core processor, = unknown platform.

41 Some more experiments SUSY (n 10 6 ) HIGGS (n 10 7 ) IMAGENET (n 10 6 ) c-err AUC Time(m) AUC Time(h) c-err Time(h) FALKON 19.6% % 4 EigenPro 19.8% Hierarchical 20.1% Boosted Decision Tree Neural Network Deep Neural Network Inception-V % - Table: Architectures: cluster with IBM POWER8 12-core cpu, 512 GB RAM, single machine with two Intel Xeon E5-2620, one Nvidia GTX Titan X GPU, 128GB RAM, single machine.

42 Contributions Best computations so far for optimal statistics Space O(n) Time O(n n) Random projections+iterative solvers+preconditioning... fast rates... adaptive sampling Next? Distributed architectures... Open the kernel blackbox!

43 Proving κ(p HP ) 1 Let K x = K(x, ) H, C = K x K xdρ X(x), Ĉ n = 1 n n K xi K xi, i=1 Ĉ M = 1 M M K xj K xj. j=1 Recall that P = 1 n T 1 A 1, T = chol(k MM ), A = chol ( 1 M T T + λi ). Steps 1. P HP = A V (Ĉn + λi)v A 1

44 Proving κ(p HP ) 1 Let K x = K(x, ) H, C = K x K xdρ X(x), Ĉ n = 1 n n K xi K xi, i=1 Ĉ M = 1 M M K xj K xj. j=1 Recall that P = 1 n T 1 A 1, T = chol(k MM ), A = chol ( 1 M T T + λi ). Steps 2. P HP = A V (ĈM + λi)v A 1 + A V (Ĉn ĈM )V A 1

45 Proving κ(p HP ) 1 Let K x = K(x, ) H, C = K x K xdρ X(x), Ĉ n = 1 n n K xi K xi, i=1 Ĉ M = 1 M M K xj K xj. j=1 Recall that P = 1 n T 1 A 1, T = chol(k MM ), A = chol ( 1 M T T + λi ). Steps 3. P HP = I + A V (Ĉn ĈM )V A 1

46 Proving κ(p HP ) 1 Let K x = K(x, ) H, C = K x K xdρ X(x), Ĉ n = 1 n n K xi K xi, i=1 Ĉ M = 1 M M K xj K xj. j=1 Recall that P = 1 n T 1 A 1, T = chol(k MM ), A = chol ( 1 M T T + λi ). Steps 3. P HP = I + E with E = A V (Ĉn ĈM )V A 1

47 Proving κ(p HP ) 1 Let K x = K(x, ) H, C = K x K xdρ X(x), Ĉ n = 1 n n K xi K xi, i=1 Ĉ M = 1 M M K xj K xj. j=1 Recall that P = 1 n T 1 A 1, T = chol(k MM ), A = chol ( 1 M T T + λi ). Steps 4. κ(p HP ) = κ(i + E) 1 + E, when E < 1, 1 E

48 Proving κ(p HP ) 1 Let K x = K(x, ) H, C = K x K xdρ X(x), Ĉ n = 1 n n K xi K xi, i=1 Ĉ M = 1 M M K xj K xj. j=1 Recall that P = 1 n T 1 A 1, T = chol(k MM ), A = chol ( 1 M T T + λi ). Steps 5.E = A V (Ĉn ĈM )V A 1 1/2 w.h.p. when M c 0 log 1 δ λ

An inverse problem perspective on machine learning

An inverse problem perspective on machine learning Lorenzo Rosasco University of Genova Massachusetts Institute of Technology Istituto Italiano di Tecnologia lcsl.mit.edu Feb 9th, 2018 Inverse Problems