An inverse problem perspective on machine learning

Size: px
Start display at page:

Download "An inverse problem perspective on machine learning"

Transcription

1 An inverse problem perspective on machine learning Lorenzo Rosasco University of Genova Massachusetts Institute of Technology Istituto Italiano di Tecnologia lcsl.mit.edu Feb 9th, 2018 Inverse Problems and Machine Learning Workshop, CM+X Caltech

2 Today selection I Classics: I Latest releases: Learning as an inverse problem Kernel methods as a test bed for algorithm design

3 Outline Learning theory 2000 Learning as an inverse problem Regularization Recent advances

4 What s learning (x 1,y 1 ) (x 4,y 4 ) (x 5,y 5 ) (x 2,y 2 ) (x 3,y 3 )

5 What s learning (x 1,y 1 ) (x 7,?) (x 6,?) (x 4,y 4 ) (x 5,y 5 ) (x 2,y 2 ) (x 3,y 3 )

6 What s learning (x 1,y 1 ) (x 7,?) (x 6,?) (x 4,y 4 ) (x 5,y 5 ) (x 2,y 2 ) (x 3,y 3 ) Learning is about inference not interpolation

7 Statistical Machine Learning (ML) I (X, Y ) a pair of random variables in X R. I L : R R! [0, 1) a loss function. I H R X Problem: Solve min E[L(f(X),Y)] f2h given only (x 1,y 1 ),...,(x n,y n ), a sample of n i.i. copies of (X, Y ).

8 ML theory around I All algorithms are ERM (empirical risk minimization) sup f2h 1 n 1 min f2h n nx L(f(x i ),y i ) I Emphasis on empirical process theory...! nx P L(f(X i ),Y i ) E[L(f(X),Y)] > i=1 i=1 [Vapnik 96] [Vapnik, Chervonenkis, 71 Dudley, Giné, Zinn 94] I...and complexity measures, e.g. Gaussian/Rademacher complexities C(H) =E sup nx f2h i=1 if(x i ) [Barlett, Bousquet, Koltchinskii, Massart, Mendelson... 00]

9 Around the same time Cucker and Smale, On the mathematical foundations of learning theory, AMS I Caponnetto, De Vito and R. Verri, Learning as an Inverse Problem, JMLR I Smale, Zhou, Shannon sampling and function reconstruction from point values, Bull. AMS

10 Outline Learning theory 2000 Learning as an inverse problem Regularization Recent advances

11 Inverse Problems (IP) I A : H!Gbounded linear operator, between Hilbert spaces I g 2G Problem: Findf solving Af = g assuming A and g are given, with kg g kapple [Engl, Hanke, Neubauer 96]

12 Ill-posedeness I Existence: g/2 Range(A) I Uniqueness: Ker(A) 6= ; I Stability: ka k = 1 (large is also a mess) f g g H A Range(A) G O = argmin H kaf gk 2, f = A g =min kfk O

13 Is machine learning an inverse problem? I (X, Y ) I L : R R! [0, 1) I H R X Solve min E[L(f(X),Y)] f2h I A : H!G I g 2G Find f solving Af = g given A and g with kg g kapple given only (x 1,y 1 ),...,(x n,y n ). Actually yes, under some assumptions.

14 Key assumptions: least squares and RKHS Assumption L(f(x),y)=(f(x) y) 2 Assumption I (H, h, i) is a Hilbert space (real, separable) I continuous evaluation functionals, for all x 2X,lete x : H!R, withe x (f) =f(x), then e x (f) e x (f 0 ). kf f 0 k [Aronszajn 50]

15 Key assumptions: least squares and RKHS Assumption L(f(x),y)=(f(x) y) 2 Assumption I (H, h, i) is a Hilbert space (real, separable) I continuous evaluation functionals, for all x 2X,lete x : H!R, withe x (f) =f(x), then e x (f) e x (f 0 ). kf f 0 k Implications I kfk 1. kfk I 9 k x 2Hsuch that f(x) =hf,k x i [Aronszajn 50]

16 Interpolation and sampling operator [Bertero, De mol, Pike 85, 88] f(x i )=hf,k xi i = y i, + S n f = y i =1,...,n Sampling operator: S n : H!R n, (S n f) i = hf,k xi i, 8i =1,...,n S n f x 1x 2 x 3 x 4 x 5 X

17 Learning and restriction operator [Caponnetto, De Vito, R. 05] hf,k x i = f (x), a.s. + Restriction operator: S : H!L 2 (X, ), (S f)(x) =hf,k x i, almost surely. S f = f f (x) = R d (x, y)y -almost surely. L 2 (X, )={f 2 R X kfk 2 = R d f(x) 2 < 1} S f X

18 Learning as an inverse problem Inverse problem Find f solving given S n and y n =(y 1,...,y n ). S f = f

19 Learning as an inverse problem Inverse problem Find f solving given S n and y n =(y 1,...,y n ). S f = f Least squares min H ks f f k 2, ks f f k 2 = E(f(X) Y ) 2 E(f (X) Y ) 2

20 Let s see what we got I Noise model I Integral operators & covariance operators I Kernels

21 Noise model Ideal S f = f Empirical S n f = y S S f = S f S ns n f = S ny Noise model ks ny S f kapple 1 ks S S ns n kapple 2 Inverse problem discretization, Econometrics

22 Integral and covariance operators operators I Extension operator S : L 2 (X, )!H Z S f(x 0 )= d (x)k(x 0,x)f(x) where k(x, x 0 )=hk x,k 0 xi is pos.def. I Covariance operator S S : H!H Z S S = d (x)k x k x 0

23 Kernels Choosing a RKHS implies choosing a representation. Theorem (Moore-Aronzaijn) Let k : X X!R, pos.def., then the completion of NX {f 2 R X f = c i k xi, c 1,...,c N 2 R, x 1,...,x N 2X,N 2 N} j=1 w.r.t. hk x,k 0 xi = k(x, x 0 ) is a RKHS.

24 Kernels If K(x, x 0 )=x > x 0,then, I S n is the n by D data matrix (S infinite data matrix) I S ns n and S S are the empirical and true covariance operators

25 Kernels If K(x, x 0 )=x > x 0,then, I S n is the n by D data matrix (S infinite data matrix) I S ns n and S S are the empirical and true covariance operators Other kernels: I K(x, x 0 )=(1+x > x 0 ) p I K(x, x 0 )=e kx x0 k 2 I K(x, x 0 )=e kx x0 k

26 What now? Steal

27 Outline Learning theory 2000 Learning as an inverse problem Regularization Recent advances

28 Tikhonov aka ridge regression f n =(S ns n + ni) 1 S ny

29 Tikhonov aka ridge regression f n =(S ns n + ni) 1 Sny = Sn(S n Sn + ni) {z } K n 1 y K n c = y

30 Statistics Theorem (Caponnetto De Vito 05) Assume K(X, X), Y apple1 a.s. and f 2 Range(S S ) r, 1/2 <r<1. If E[kSf n n f k 2 ]. n 2r 2r+1 n = n 1 2r+1

31 Statistics Theorem (Caponnetto De Vito 05) Assume K(X, X), Y apple1 a.s. and f 2 Range(S S ) r, 1/2 <r<1. If E[kSf n n f k 2 ]. n 2r 2r+1 n = n 1 2r+1 Proof 8 >0, E[kSf n f k 2 ]. 1 ( ) + 2r E[ 1 ], E[ 2 ]. 1 p n

32 Iterative regularization From the Neumann series... Xt 1 fn t = (I SnS n ) j Sny j=0

33 From the Neumann series... f t n = Iterative regularization Xt 1 Xt 1 (I SnS n ) j Sny = Sn (I S n Sn ) {z } j=0 j=0 K n j y

34 From the Neumann series... f t n =... to gradient descent Iterative regularization Xt 1 Xt 1 (I SnS n ) j Sny = Sn (I S n Sn ) {z } j=0 j=0 K n f t n = f t 1 n S n(s n f t 1 n y) c t n = c t 1 n (K n c t 1 n y) j y Test Training t

35 Iterative regularization statistics Theorem (Bauer, Pereverzev, R. 07) Assume K(X, X), Y apple1 a.s. and f 2 Range(S S ) r, 1/2 <r<1. Ift n = n 1 2r+1 E[kSf tn n f k 2 ]. n 2r 2r+1

36 Iterative regularization statistics Theorem (Bauer, Pereverzev, R. 07) Assume K(X, X), Y apple1 a.s. and f 2 Range(S S ) r, 1/2 <r<1. Ift n = n 1 2r+1 E[kSf tn n f k 2 ]. n 2r 2r+1 Proof 8 >0, E[kSf t n f k 2 ]. t ( ) + 1 t 2r E[ 1 ], E[ 2 ]. 1 p n

37 Tikhonov vs iterative regularization I Same statistical properties... I... but time complexities are di erent O(n 3 ) vs O(n 2 n 1 2r+1 ), I Iterative regularization provides a bridge between statistics and computations. I Kernel methods become a test bed for algorithmic solutions.

38 Computational regularization Tikhonov time O(n 3 ) +spaceo(n 2 ) for 1/ p n learning bound

39 Computational regularization Tikhonov time O(n 3 ) +spaceo(n 2 ) for 1/ p n learning bound + Iterative regularization time O(n 2p n) +spaceo(n 2 ) for 1/ p n learning bound

40 Outline Learning theory 2000 Learning as an inverse problem Regularization Recent advances

41 Steal from optimization Acceleration I Conjugate gradient I Chebyshev method [Blanchard, Kramer 96] [Bauer, Pervezev. R. 07] I Nesterov acceleration (Nesterov, 83) Stochastic gradient I Single pass stochastic gradient [Salzo, R. 18] I Multi-pass incremental gradient [Tarres, Yao, 05, Pontil, Ying, 09, Bach, Dieuleveut, Flammarion, 17] I Multi-pass stochastic gradient with mini-batches. [Villa, R. 15] [Lin, R. 16]

42 Computational regularization Iterative regularization time O(n 2p n) +spaceo(n 2 ) for 1/ p n learning bound + Stochastic iterative regularization time O(n 2 ) +spaceo(n 2 ) for 1/ p n learning bound

43 Can we do better? How about memory?

44 Regularization with projection and preconditioning [Halko, Martinsson, Tropp 09] (K > nm K nm + nk MM )c = K > nm y c n 1 BB > = M K2 MM + nk MM K nm = y FALKON [Rudi, Carratino, R. 17], see also [Ma, Belkin 17] c t = B t t = t 1 n B> K > nm (K nm B t 1 y) + nk MM B t 1

45 Falkon statistics Theorem (Rudi, Carratino, R. 17) Assume K(X, X), Y apple1 a.s. and f 2 Range(S S ) r, 1/2 <r<1. If n = n 1 2r+1, Mn = n 1 2r+1, tn = log n then E[kSf n,tn,mn n f k 2 ]. n 2r 2r+1

46 Computational regularization time O(n 2 ) +spaceo(n 2 ) for 1/ p n learning bound + time Õ(np n) +spaceo(n p n) for 1/ p n learning bound

47 Some results MillionSongs YELP TIMIT MSE Relative error Time(s) RMSE Time(m) c-err Time(h) FALKON % 1.5 Prec. KRR Hierarchical ? D&C Rand. Feat Nyström ADMM R. F BCD R. F % 1.7 BCD Nyström % 1.7 KRR % 8.3 EigenPro % 3.9 o Deep NN % - Sparse Kernels % - Ensemble % -

48 Conclusions Contribution I Learning as an inverse problems I Computational regularization: statistics meets numerics Future work I Scaling things up... I Regularization with projections (quadrature, Galerkin methods) I Connection to PDE/integral equations: exploit more structure I Structured prediction/deep learning I Semisupervised/unsupervised learning I Embedding and compressed learning

Optimal kernel methods for large scale learning

Optimal kernel methods for large scale learning Optimal kernel methods for large scale learning Alessandro Rudi INRIA - École Normale Supérieure, Paris joint work with Luigi Carratino, Lorenzo Rosasco 6 Mar 2018 École Polytechnique Learning problem

More information

Early Stopping for Computational Learning

Early Stopping for Computational Learning Early Stopping for Computational Learning Lorenzo Rosasco Universita di Genova, Massachusetts Institute of Technology Istituto Italiano di Tecnologia CBMM Sestri Levante, September, 2014 joint work with

More information

Less is More: Computational Regularization by Subsampling

Less is More: Computational Regularization by Subsampling Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro

More information

Less is More: Computational Regularization by Subsampling

Less is More: Computational Regularization by Subsampling Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro

More information

Iterative Convex Regularization

Iterative Convex Regularization Iterative Convex Regularization Lorenzo Rosasco Universita di Genova Universita di Genova Istituto Italiano di Tecnologia Massachusetts Institute of Technology Optimization and Statistical Learning Workshop,

More information

Online Gradient Descent Learning Algorithms

Online Gradient Descent Learning Algorithms DISI, Genova, December 2006 Online Gradient Descent Learning Algorithms Yiming Ying (joint work with Massimiliano Pontil) Department of Computer Science, University College London Introduction Outline

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

Optimal Rates for Multi-pass Stochastic Gradient Methods

Optimal Rates for Multi-pass Stochastic Gradient Methods Journal of Machine Learning Research 8 (07) -47 Submitted 3/7; Revised 8/7; Published 0/7 Optimal Rates for Multi-pass Stochastic Gradient Methods Junhong Lin Laboratory for Computational and Statistical

More information

Convergence rates of spectral methods for statistical inverse learning problems

Convergence rates of spectral methods for statistical inverse learning problems Convergence rates of spectral methods for statistical inverse learning problems G. Blanchard Universtität Potsdam UCL/Gatsby unit, 04/11/2015 Joint work with N. Mücke (U. Potsdam); N. Krämer (U. München)

More information

TUM 2016 Class 3 Large scale learning by regularization

TUM 2016 Class 3 Large scale learning by regularization TUM 2016 Class 3 Large scale learning by regularization Lorenzo Rosasco UNIGE-MIT-IIT July 25, 2016 Learning problem Solve min w E(w), E(w) = dρ(x, y)l(w x, y) given (x 1, y 1 ),..., (x n, y n ) Beyond

More information

Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms

Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms Junhong Lin Volkan Cevher Laboratory for Information and Inference Systems École Polytechnique Fédérale

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Loucas Pillaud-Vivien

More information

RegML 2018 Class 2 Tikhonov regularization and kernels

RegML 2018 Class 2 Tikhonov regularization and kernels RegML 2018 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT June 17, 2018 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n = (x i,

More information

Regularization Algorithms for Learning

Regularization Algorithms for Learning DISI, UNIGE Texas, 10/19/07 plan motivation setting elastic net regularization - iterative thresholding algorithms - error estimates and parameter choice applications motivations starting point of many

More information

Are Loss Functions All the Same?

Are Loss Functions All the Same? Are Loss Functions All the Same? L. Rosasco E. De Vito A. Caponnetto M. Piana A. Verri November 11, 2003 Abstract In this paper we investigate the impact of choosing different loss functions from the viewpoint

More information

Oslo Class 2 Tikhonov regularization and kernels

Oslo Class 2 Tikhonov regularization and kernels RegML2017@SIMULA Oslo Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT May 3, 2017 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n

More information

Mathematical Methods for Data Analysis

Mathematical Methods for Data Analysis Mathematical Methods for Data Analysis Massimiliano Pontil Istituto Italiano di Tecnologia and Department of Computer Science University College London Massimiliano Pontil Mathematical Methods for Data

More information

Learning, Regularization and Ill-Posed Inverse Problems

Learning, Regularization and Ill-Posed Inverse Problems Learning, Regularization and Ill-Posed Inverse Problems Lorenzo Rosasco DISI, Università di Genova rosasco@disi.unige.it Andrea Caponnetto DISI, Università di Genova caponnetto@disi.unige.it Ernesto De

More information

Learning from Examples as an Inverse Problem

Learning from Examples as an Inverse Problem Journal of Machine Learning Research () Submitted 12/04; Published Learning from Examples as an Inverse Problem Ernesto De Vito Dipartimento di Matematica Università di Modena e Reggio Emilia Modena, Italy

More information

Kernel Learning via Random Fourier Representations

Kernel Learning via Random Fourier Representations Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random

More information

Spectral Regularization

Spectral Regularization Spectral Regularization Lorenzo Rosasco 9.520 Class 07 February 27, 2008 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,

More information

ON EARLY STOPPING IN GRADIENT DESCENT LEARNING. 1. Introduction

ON EARLY STOPPING IN GRADIENT DESCENT LEARNING. 1. Introduction ON EARLY STOPPING IN GRADIENT DESCENT LEARNING YUAN YAO, LORENZO ROSASCO, AND ANDREA CAPONNETTO Abstract. In this paper, we study a family of gradient descent algorithms to approximate the regression function

More information

Regularization via Spectral Filtering

Regularization via Spectral Filtering Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,

More information

Advances in kernel exponential families

Advances in kernel exponential families Advances in kernel exponential families Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS, 2017 1/39 Outline Motivating application: Fast estimation of complex multivariate

More information

Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization

Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization John Duchi Outline I Motivation 1 Uniform laws of large numbers 2 Loss minimization and data dependence II Uniform

More information

Optimal Distributed Learning with Multi-pass Stochastic Gradient Methods

Optimal Distributed Learning with Multi-pass Stochastic Gradient Methods Optimal Distributed Learning with Multi-pass Stochastic Gradient Methods Junhong Lin Volkan Cevher Abstract We study generalization properties of distributed algorithms in the setting of nonparametric

More information

Spectral Filtering for MultiOutput Learning

Spectral Filtering for MultiOutput Learning Spectral Filtering for MultiOutput Learning Lorenzo Rosasco Center for Biological and Computational Learning, MIT Universita di Genova, Italy Plan Learning with kernels Multioutput kernel and regularization

More information

Machine Learning Linear Models

Machine Learning Linear Models Machine Learning Linear Models Outline II - Linear Models 1. Linear Regression (a) Linear regression: History (b) Linear regression with Least Squares (c) Matrix representation and Normal Equation Method

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning

Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Francesco Orabona Yahoo! Labs New York, USA francesco@orabona.com Abstract Stochastic gradient descent algorithms

More information

Oslo Class 4 Early Stopping and Spectral Regularization

Oslo Class 4 Early Stopping and Spectral Regularization RegML2017@SIMULA Oslo Class 4 Early Stopping and Spectral Regularization Lorenzo Rosasco UNIGE-MIT-IIT June 28, 2016 Learning problem Solve min w E(w), E(w) = dρ(x, y)l(w x, y) given (x 1, y 1 ),..., (x

More information

The Learning Problem and Regularization

The Learning Problem and Regularization 9.520 Class 02 February 2011 Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference problem from usually small sets of high dimensional, noisy data. Learning

More information

arxiv: v1 [math.st] 28 May 2016

arxiv: v1 [math.st] 28 May 2016 Kernel ridge vs. principal component regression: minimax bounds and adaptability of regularization operators Lee H. Dicker Dean P. Foster Daniel Hsu arxiv:1605.08839v1 [math.st] 8 May 016 May 31, 016 Abstract

More information

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto Reproducing Kernel Hilbert Spaces 9.520 Class 03, 15 February 2006 Andrea Caponnetto About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 04: Features and Kernels. Lorenzo Rosasco

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 04: Features and Kernels. Lorenzo Rosasco MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 04: Features and Kernels Lorenzo Rosasco Linear functions Let H lin be the space of linear functions f(x) = w x. f w is one

More information

Approximate Kernel PCA with Random Features

Approximate Kernel PCA with Random Features Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,

More information

On Regularization Algorithms in Learning Theory

On Regularization Algorithms in Learning Theory On Regularization Algorithms in Learning Theory Frank Bauer a, Sergei Pereverzev b, Lorenzo Rosasco c,1 a Institute for Mathematical Stochastics, University of Göttingen, Department of Mathematics, Maschmühlenweg

More information

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee 9.520: Class 20 Bayesian Interpretations Tomaso Poggio and Sayan Mukherjee Plan Bayesian interpretation of Regularization Bayesian interpretation of the regularizer Bayesian interpretation of quadratic

More information

Error Estimates for Multi-Penalty Regularization under General Source Condition

Error Estimates for Multi-Penalty Regularization under General Source Condition Error Estimates for Multi-Penalty Regularization under General Source Condition Abhishake Rastogi Department of Mathematics Indian Institute of Technology Delhi New Delhi 006, India abhishekrastogi202@gmail.com

More information

Geometry on Probability Spaces

Geometry on Probability Spaces Geometry on Probability Spaces Steve Smale Toyota Technological Institute at Chicago 427 East 60th Street, Chicago, IL 60637, USA E-mail: smale@math.berkeley.edu Ding-Xuan Zhou Department of Mathematics,

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Spectral Algorithms for Supervised Learning

Spectral Algorithms for Supervised Learning LETTER Communicated by David Hardoon Spectral Algorithms for Supervised Learning L. Lo Gerfo logerfo@disi.unige.it L. Rosasco rosasco@disi.unige.it F. Odone odone@disi.unige.it Dipartimento di Informatica

More information

Approximate Kernel Methods

Approximate Kernel Methods Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207 Outline Motivating example Ridge regression

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

Stochastic gradient descent and robustness to ill-conditioning

Stochastic gradient descent and robustness to ill-conditioning Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1). Regression and PCA Classification The goal: map from input X to a label Y. Y has a discrete set of possible values We focused on binary Y (values 0 or 1). But we also discussed larger number of classes

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Deep Boosting MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Model selection. Deep boosting. theory. algorithm. experiments. page 2 Model Selection Problem:

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Learning Kernels MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Kernel methods. Learning kernels scenario. learning bounds. algorithms. page 2 Machine Learning

More information

Class 2 & 3 Overfitting & Regularization

Class 2 & 3 Overfitting & Regularization Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating

More information

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017 Simple Techniques for Improving SGD CS6787 Lecture 2 Fall 2017 Step Sizes and Convergence Where we left off Stochastic gradient descent x t+1 = x t rf(x t ; yĩt ) Much faster per iteration than gradient

More information

On Learnability, Complexity and Stability

On Learnability, Complexity and Stability On Learnability, Complexity and Stability Silvia Villa, Lorenzo Rosasco and Tomaso Poggio 1 Introduction A key question in statistical learning is which hypotheses (function) spaces are learnable. Roughly

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

arxiv: v1 [cs.it] 21 Feb 2013

arxiv: v1 [cs.it] 21 Feb 2013 q-ary Compressive Sensing arxiv:30.568v [cs.it] Feb 03 Youssef Mroueh,, Lorenzo Rosasco, CBCL, CSAIL, Massachusetts Institute of Technology LCSL, Istituto Italiano di Tecnologia and IIT@MIT lab, Istituto

More information

Convergence of Eigenspaces in Kernel Principal Component Analysis

Convergence of Eigenspaces in Kernel Principal Component Analysis Convergence of Eigenspaces in Kernel Principal Component Analysis Shixin Wang Advanced machine learning April 19, 2016 Shixin Wang Convergence of Eigenspaces April 19, 2016 1 / 18 Outline 1 Motivation

More information

Foundations of Machine Learning

Foundations of Machine Learning Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters

More information

2 Tikhonov Regularization and ERM

2 Tikhonov Regularization and ERM Introduction Here we discusses how a class of regularization methods originally designed to solve ill-posed inverse problems give rise to regularized learning algorithms. These algorithms are kernel methods

More information

Derivative reproducing properties for kernel methods in learning theory

Derivative reproducing properties for kernel methods in learning theory Journal of Computational and Applied Mathematics 220 (2008) 456 463 www.elsevier.com/locate/cam Derivative reproducing properties for kernel methods in learning theory Ding-Xuan Zhou Department of Mathematics,

More information

FALKON: An Optimal Large Scale Kernel Method

FALKON: An Optimal Large Scale Kernel Method FALKON: An Optimal Large Scale Kernel Method Alessandro Rudi INRIA Sierra Project-team, École Normale Supérieure, Paris Luigi Carratino University of Genoa Genova, Italy Lorenzo Rosasco University of Genoa,

More information

On regularization algorithms in learning theory

On regularization algorithms in learning theory Journal of Complexity 23 (2007) 52 72 www.elsevier.com/locate/jco On regularization algorithms in learning theory Frank Bauer a, Sergei Pereverzev b, Lorenzo Rosasco c,d, a Institute for Mathematical Stochastics,

More information

TUM 2016 Class 1 Statistical learning theory

TUM 2016 Class 1 Statistical learning theory TUM 2016 Class 1 Statistical learning theory Lorenzo Rosasco UNIGE-MIT-IIT July 25, 2016 Machine learning applications Texts Images Data: (x 1, y 1 ),..., (x n, y n ) Note: x i s huge dimensional! All

More information

Learnability of Gaussians with flexible variances

Learnability of Gaussians with flexible variances Learnability of Gaussians with flexible variances Ding-Xuan Zhou City University of Hong Kong E-ail: azhou@cityu.edu.hk Supported in part by Research Grants Council of Hong Kong Start October 20, 2007

More information

CS489/698: Intro to ML

CS489/698: Intro to ML CS489/698: Intro to ML Lecture 02: Linear Regression 1 I d rather die than telling you my password! Transfer success! 2 Outline Announcements Linear Regression Regularization Cross-validation 3 Outline

More information

Generalizing the Bias Term of Support Vector Machines

Generalizing the Bias Term of Support Vector Machines Generalizing the Bias Term of Support Vector Machines Wenye Li and Kwong-Sak Leung and Kin-Hong Lee Department of Computer Science and Engineering The Chinese University of Hong Kong wyli, ksleung, khlee}@cse.cuhk.edu.hk

More information

MLCC 2017 Regularization Networks I: Linear Models

MLCC 2017 Regularization Networks I: Linear Models MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017 About this class We introduce a class of learning algorithms based on Tikhonov regularization We study computational

More information

Causal Inference by Minimizing the Dual Norm of Bias. Nathan Kallus. Cornell University and Cornell Tech

Causal Inference by Minimizing the Dual Norm of Bias. Nathan Kallus. Cornell University and Cornell Tech Causal Inference by Minimizing the Dual Norm of Bias Nathan Kallus Cornell University and Cornell Tech www.nathankallus.com Matching Zoo It s a zoo of matching estimators for causal effects: PSM, NN, CM,

More information

Distributed Semi-supervised Learning with Kernel Ridge Regression

Distributed Semi-supervised Learning with Kernel Ridge Regression Journal of Machine Learning Research 18 (017) 1- Submitted 11/16; Revised 4/17; Published 5/17 Distributed Semi-supervised Learning with Kernel Ridge Regression Xiangyu Chang Center of Data Science and

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Stochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017

Stochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017 Stochastic Gradient Descent: The Workhorse of Machine Learning CS6787 Lecture 1 Fall 2017 Fundamentals of Machine Learning? Machine Learning in Practice this course What s missing in the basic stuff? Efficiency!

More information

Divide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates

Divide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates : A Distributed Algorithm with Minimax Optimal Rates Yuchen Zhang, John C. Duchi, Martin Wainwright (UC Berkeley;http://arxiv.org/pdf/1305.509; Apr 9, 014) Gatsby Unit, Tea Talk June 10, 014 Outline Motivation.

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Kernel methods for Bayesian inference

Kernel methods for Bayesian inference Kernel methods for Bayesian inference Arthur Gretton Gatsby Computational Neuroscience Unit Lancaster, Nov. 2014 Motivating Example: Bayesian inference without a model 3600 downsampled frames of 20 20

More information

Dipartimento di Fisica

Dipartimento di Fisica Dipartimento di Fisica Multi-Output Learning with Spectral Filters by Luca Baldassarre DIFI, Università di Genova Via Dodecaneso 33, 16146 Genova, Italy http://www.fisica.unige.it/ Dottorato di Ricerca

More information

Hilbert Space Methods in Learning

Hilbert Space Methods in Learning Hilbert Space Methods in Learning guest lecturer: Risi Kondor 6772 Advanced Machine Learning and Perception (Jebara), Columbia University, October 15, 2003. 1 1. A general formulation of the learning problem

More information

Linearized Alternating Direction Method: Two Blocks and Multiple Blocks. Zhouchen Lin 林宙辰北京大学

Linearized Alternating Direction Method: Two Blocks and Multiple Blocks. Zhouchen Lin 林宙辰北京大学 Linearized Alternating Direction Method: Two Blocks and Multiple Blocks Zhouchen Lin 林宙辰北京大学 Dec. 3, 014 Outline Alternating Direction Method (ADM) Linearized Alternating Direction Method (LADM) Two Blocks

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 9, 2011 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued) Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound Lecture 3: Introduction to Deep Learning (continued) Course Logistics - Update on course registrations - 6 seats left now -

More information

Statistical Convergence of Kernel CCA

Statistical Convergence of Kernel CCA Statistical Convergence of Kernel CCA Kenji Fukumizu Institute of Statistical Mathematics Tokyo 106-8569 Japan fukumizu@ism.ac.jp Francis R. Bach Centre de Morphologie Mathematique Ecole des Mines de Paris,

More information

consistent learning by composite proximal thresholding

consistent learning by composite proximal thresholding consistent learning by composite proximal thresholding Saverio Salzo Università degli Studi di Genova Optimization in Machine learning, vision and image processing Université Paul Sabatier, Toulouse 6-7

More information

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes COMP 55 Applied Machine Learning Lecture 2: Gaussian processes Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~hvanho2/comp55

More information

Nyström-SGD: Fast Learning of Kernel-Classifiers with Conditioned Stochastic Gradient Descent

Nyström-SGD: Fast Learning of Kernel-Classifiers with Conditioned Stochastic Gradient Descent Nyström-SGD: Fast Learning of Kernel-Classifiers with Conditioned Stochastic Gradient Descent Lukas Pfahler and Katharina Morik TU Dortmund University, Artificial Intelligence Group, Dortmund, Germany

More information

CSC 578 Neural Networks and Deep Learning

CSC 578 Neural Networks and Deep Learning CSC 578 Neural Networks and Deep Learning Fall 2018/19 3. Improving Neural Networks (Some figures adapted from NNDL book) 1 Various Approaches to Improve Neural Networks 1. Cost functions Quadratic Cross

More information

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

ADANET: adaptive learning of neural networks

ADANET: adaptive learning of neural networks ADANET: adaptive learning of neural networks Joint work with Corinna Cortes (Google Research) Javier Gonzalo (Google Research) Vitaly Kuznetsov (Google Research) Scott Yang (Courant Institute) MEHRYAR

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Weihang Chen, Xingchen Chen, Jinxiu Liang, Cheng Xu, Zehao Chen and Donglin He March 26, 2017 Outline What is Stochastic Gradient Descent Comparison between BGD and SGD Analysis

More information

Tutorial on Machine Learning for Advanced Electronics

Tutorial on Machine Learning for Advanced Electronics Tutorial on Machine Learning for Advanced Electronics Maxim Raginsky March 2017 Part I (Some) Theory and Principles Machine Learning: estimation of dependencies from empirical data (V. Vapnik) enabling

More information

Finite-sum Composition Optimization via Variance Reduced Gradient Descent

Finite-sum Composition Optimization via Variance Reduced Gradient Descent Finite-sum Composition Optimization via Variance Reduced Gradient Descent Xiangru Lian Mengdi Wang Ji Liu University of Rochester Princeton University University of Rochester xiangru@yandex.com mengdiw@princeton.edu

More information

Introductory Machine Learning Notes 1. Lorenzo Rosasco

Introductory Machine Learning Notes 1. Lorenzo Rosasco Introductory Machine Learning Notes 1 Lorenzo Rosasco DIBRIS, Universita degli Studi di Genova LCSL, Massachusetts Institute of Technology and Istituto Italiano di Tecnologia lrosasco@mit.edu October 10,

More information

Structured Prediction

Structured Prediction Structured Prediction Ningshan Zhang Advanced Machine Learning, Spring 2016 Outline Ensemble Methods for Structured Prediction[1] On-line learning Boosting AGeneralizedKernelApproachtoStructuredOutputLearning[2]

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

Learning sets and subspaces: a spectral approach

Learning sets and subspaces: a spectral approach Learning sets and subspaces: a spectral approach Alessandro Rudi DIBRIS, Università di Genova Optimization and dynamical processes in Statistical learning and inverse problems Sept 8-12, 2014 A world of

More information

Bayesian Interpretations of Regularization

Bayesian Interpretations of Regularization Bayesian Interpretations of Regularization Charlie Frogner 9.50 Class 15 April 1, 009 The Plan Regularized least squares maps {(x i, y i )} n i=1 to a function that minimizes the regularized loss: f S

More information

Sparse Approximation and Variable Selection

Sparse Approximation and Variable Selection Sparse Approximation and Variable Selection Lorenzo Rosasco 9.520 Class 07 February 26, 2007 About this class Goal To introduce the problem of variable selection, discuss its connection to sparse approximation

More information