An inverse problem perspective on machine learning

Size: px

Start display at page:

Download "An inverse problem perspective on machine learning"

Eric Curtis
5 years ago
Views:

1 An inverse problem perspective on machine learning Lorenzo Rosasco University of Genova Massachusetts Institute of Technology Istituto Italiano di Tecnologia lcsl.mit.edu Feb 9th, 2018 Inverse Problems and Machine Learning Workshop, CM+X Caltech

2 Today selection I Classics: I Latest releases: Learning as an inverse problem Kernel methods as a test bed for algorithm design

3 Outline Learning theory 2000 Learning as an inverse problem Regularization Recent advances

4 What s learning (x 1,y 1 ) (x 4,y 4 ) (x 5,y 5 ) (x 2,y 2 ) (x 3,y 3 )

5 What s learning (x 1,y 1 ) (x 7,?) (x 6,?) (x 4,y 4 ) (x 5,y 5 ) (x 2,y 2 ) (x 3,y 3 )

6 What s learning (x 1,y 1 ) (x 7,?) (x 6,?) (x 4,y 4 ) (x 5,y 5 ) (x 2,y 2 ) (x 3,y 3 ) Learning is about inference not interpolation

7 Statistical Machine Learning (ML) I (X, Y ) a pair of random variables in X R. I L : R R! [0, 1) a loss function. I H R X Problem: Solve min E[L(f(X),Y)] f2h given only (x 1,y 1 ),...,(x n,y n ), a sample of n i.i. copies of (X, Y ).

8 ML theory around I All algorithms are ERM (empirical risk minimization) sup f2h 1 n 1 min f2h n nx L(f(x i ),y i ) I Emphasis on empirical process theory...! nx P L(f(X i ),Y i ) E[L(f(X),Y)] > i=1 i=1 [Vapnik 96] [Vapnik, Chervonenkis, 71 Dudley, Giné, Zinn 94] I...and complexity measures, e.g. Gaussian/Rademacher complexities C(H) =E sup nx f2h i=1 if(x i ) [Barlett, Bousquet, Koltchinskii, Massart, Mendelson... 00]

9 Around the same time Cucker and Smale, On the mathematical foundations of learning theory, AMS I Caponnetto, De Vito and R. Verri, Learning as an Inverse Problem, JMLR I Smale, Zhou, Shannon sampling and function reconstruction from point values, Bull. AMS

10 Outline Learning theory 2000 Learning as an inverse problem Regularization Recent advances

11 Inverse Problems (IP) I A : H!Gbounded linear operator, between Hilbert spaces I g 2G Problem: Findf solving Af = g assuming A and g are given, with kg g kapple [Engl, Hanke, Neubauer 96]

12 Ill-posedeness I Existence: g/2 Range(A) I Uniqueness: Ker(A) 6= ; I Stability: ka k = 1 (large is also a mess) f g g H A Range(A) G O = argmin H kaf gk 2, f = A g =min kfk O

13 Is machine learning an inverse problem? I (X, Y ) I L : R R! [0, 1) I H R X Solve min E[L(f(X),Y)] f2h I A : H!G I g 2G Find f solving Af = g given A and g with kg g kapple given only (x 1,y 1 ),...,(x n,y n ). Actually yes, under some assumptions.

14 Key assumptions: least squares and RKHS Assumption L(f(x),y)=(f(x) y) 2 Assumption I (H, h, i) is a Hilbert space (real, separable) I continuous evaluation functionals, for all x 2X,lete x : H!R, withe x (f) =f(x), then e x (f) e x (f 0 ). kf f 0 k [Aronszajn 50]

15 Key assumptions: least squares and RKHS Assumption L(f(x),y)=(f(x) y) 2 Assumption I (H, h, i) is a Hilbert space (real, separable) I continuous evaluation functionals, for all x 2X,lete x : H!R, withe x (f) =f(x), then e x (f) e x (f 0 ). kf f 0 k Implications I kfk 1. kfk I 9 k x 2Hsuch that f(x) =hf,k x i [Aronszajn 50]

16 Interpolation and sampling operator [Bertero, De mol, Pike 85, 88] f(x i )=hf,k xi i = y i, + S n f = y i =1,...,n Sampling operator: S n : H!R n, (S n f) i = hf,k xi i, 8i =1,...,n S n f x 1x 2 x 3 x 4 x 5 X

17 Learning and restriction operator [Caponnetto, De Vito, R. 05] hf,k x i = f (x), a.s. + Restriction operator: S : H!L 2 (X, ), (S f)(x) =hf,k x i, almost surely. S f = f f (x) = R d (x, y)y -almost surely. L 2 (X, )={f 2 R X kfk 2 = R d f(x) 2 < 1} S f X

18 Learning as an inverse problem Inverse problem Find f solving given S n and y n =(y 1,...,y n ). S f = f

19 Learning as an inverse problem Inverse problem Find f solving given S n and y n =(y 1,...,y n ). S f = f Least squares min H ks f f k 2, ks f f k 2 = E(f(X) Y ) 2 E(f (X) Y ) 2

20 Let s see what we got I Noise model I Integral operators & covariance operators I Kernels

21 Noise model Ideal S f = f Empirical S n f = y S S f = S f S ns n f = S ny Noise model ks ny S f kapple 1 ks S S ns n kapple 2 Inverse problem discretization, Econometrics

22 Integral and covariance operators operators I Extension operator S : L 2 (X, )!H Z S f(x 0 )= d (x)k(x 0,x)f(x) where k(x, x 0 )=hk x,k 0 xi is pos.def. I Covariance operator S S : H!H Z S S = d (x)k x k x 0

23 Kernels Choosing a RKHS implies choosing a representation. Theorem (Moore-Aronzaijn) Let k : X X!R, pos.def., then the completion of NX {f 2 R X f = c i k xi, c 1,...,c N 2 R, x 1,...,x N 2X,N 2 N} j=1 w.r.t. hk x,k 0 xi = k(x, x 0 ) is a RKHS.

24 Kernels If K(x, x 0 )=x > x 0,then, I S n is the n by D data matrix (S infinite data matrix) I S ns n and S S are the empirical and true covariance operators

25 Kernels If K(x, x 0 )=x > x 0,then, I S n is the n by D data matrix (S infinite data matrix) I S ns n and S S are the empirical and true covariance operators Other kernels: I K(x, x 0 )=(1+x > x 0 ) p I K(x, x 0 )=e kx x0 k 2 I K(x, x 0 )=e kx x0 k

26 What now? Steal

27 Outline Learning theory 2000 Learning as an inverse problem Regularization Recent advances

28 Tikhonov aka ridge regression f n =(S ns n + ni) 1 S ny

29 Tikhonov aka ridge regression f n =(S ns n + ni) 1 Sny = Sn(S n Sn + ni) {z } K n 1 y K n c = y

30 Statistics Theorem (Caponnetto De Vito 05) Assume K(X, X), Y apple1 a.s. and f 2 Range(S S ) r, 1/2 <r<1. If E[kSf n n f k 2 ]. n 2r 2r+1 n = n 1 2r+1

31 Statistics Theorem (Caponnetto De Vito 05) Assume K(X, X), Y apple1 a.s. and f 2 Range(S S ) r, 1/2 <r<1. If E[kSf n n f k 2 ]. n 2r 2r+1 n = n 1 2r+1 Proof 8 >0, E[kSf n f k 2 ]. 1 ( ) + 2r E[ 1 ], E[ 2 ]. 1 p n

32 Iterative regularization From the Neumann series... Xt 1 fn t = (I SnS n ) j Sny j=0

33 From the Neumann series... f t n = Iterative regularization Xt 1 Xt 1 (I SnS n ) j Sny = Sn (I S n Sn ) {z } j=0 j=0 K n j y

34 From the Neumann series... f t n =... to gradient descent Iterative regularization Xt 1 Xt 1 (I SnS n ) j Sny = Sn (I S n Sn ) {z } j=0 j=0 K n f t n = f t 1 n S n(s n f t 1 n y) c t n = c t 1 n (K n c t 1 n y) j y Test Training t

35 Iterative regularization statistics Theorem (Bauer, Pereverzev, R. 07) Assume K(X, X), Y apple1 a.s. and f 2 Range(S S ) r, 1/2 <r<1. Ift n = n 1 2r+1 E[kSf tn n f k 2 ]. n 2r 2r+1

36 Iterative regularization statistics Theorem (Bauer, Pereverzev, R. 07) Assume K(X, X), Y apple1 a.s. and f 2 Range(S S ) r, 1/2 <r<1. Ift n = n 1 2r+1 E[kSf tn n f k 2 ]. n 2r 2r+1 Proof 8 >0, E[kSf t n f k 2 ]. t ( ) + 1 t 2r E[ 1 ], E[ 2 ]. 1 p n

37 Tikhonov vs iterative regularization I Same statistical properties... I... but time complexities are di erent O(n 3 ) vs O(n 2 n 1 2r+1 ), I Iterative regularization provides a bridge between statistics and computations. I Kernel methods become a test bed for algorithmic solutions.

38 Computational regularization Tikhonov time O(n 3 ) +spaceo(n 2 ) for 1/ p n learning bound

39 Computational regularization Tikhonov time O(n 3 ) +spaceo(n 2 ) for 1/ p n learning bound + Iterative regularization time O(n 2p n) +spaceo(n 2 ) for 1/ p n learning bound

40 Outline Learning theory 2000 Learning as an inverse problem Regularization Recent advances

41 Steal from optimization Acceleration I Conjugate gradient I Chebyshev method [Blanchard, Kramer 96] [Bauer, Pervezev. R. 07] I Nesterov acceleration (Nesterov, 83) Stochastic gradient I Single pass stochastic gradient [Salzo, R. 18] I Multi-pass incremental gradient [Tarres, Yao, 05, Pontil, Ying, 09, Bach, Dieuleveut, Flammarion, 17] I Multi-pass stochastic gradient with mini-batches. [Villa, R. 15] [Lin, R. 16]

42 Computational regularization Iterative regularization time O(n 2p n) +spaceo(n 2 ) for 1/ p n learning bound + Stochastic iterative regularization time O(n 2 ) +spaceo(n 2 ) for 1/ p n learning bound

43 Can we do better? How about memory?

44 Regularization with projection and preconditioning [Halko, Martinsson, Tropp 09] (K > nm K nm + nk MM )c = K > nm y c n 1 BB > = M K2 MM + nk MM K nm = y FALKON [Rudi, Carratino, R. 17], see also [Ma, Belkin 17] c t = B t t = t 1 n B> K > nm (K nm B t 1 y) + nk MM B t 1

45 Falkon statistics Theorem (Rudi, Carratino, R. 17) Assume K(X, X), Y apple1 a.s. and f 2 Range(S S ) r, 1/2 <r<1. If n = n 1 2r+1, Mn = n 1 2r+1, tn = log n then E[kSf n,tn,mn n f k 2 ]. n 2r 2r+1

46 Computational regularization time O(n 2 ) +spaceo(n 2 ) for 1/ p n learning bound + time Õ(np n) +spaceo(n p n) for 1/ p n learning bound

47 Some results MillionSongs YELP TIMIT MSE Relative error Time(s) RMSE Time(m) c-err Time(h) FALKON % 1.5 Prec. KRR Hierarchical ? D&C Rand. Feat Nyström ADMM R. F BCD R. F % 1.7 BCD Nyström % 1.7 KRR % 8.3 EigenPro % 3.9 o Deep NN % - Sparse Kernels % - Ensemble % -

48 Conclusions Contribution I Learning as an inverse problems I Computational regularization: statistics meets numerics Future work I Scaling things up... I Regularization with projections (quadrature, Galerkin methods) I Connection to PDE/integral equations: exploit more structure I Structured prediction/deep learning I Semisupervised/unsupervised learning I Embedding and compressed learning

Optimal kernel methods for large scale learning

Optimal kernel methods for large scale learning Alessandro Rudi INRIA - École Normale Supérieure, Paris joint work with Luigi Carratino, Lorenzo Rosasco 6 Mar 2018 École Polytechnique Learning problem