An inverse problem perspective on machine learning
|
|
- Eric Curtis
- 5 years ago
- Views:
Transcription
1 An inverse problem perspective on machine learning Lorenzo Rosasco University of Genova Massachusetts Institute of Technology Istituto Italiano di Tecnologia lcsl.mit.edu Feb 9th, 2018 Inverse Problems and Machine Learning Workshop, CM+X Caltech
2 Today selection I Classics: I Latest releases: Learning as an inverse problem Kernel methods as a test bed for algorithm design
3 Outline Learning theory 2000 Learning as an inverse problem Regularization Recent advances
4 What s learning (x 1,y 1 ) (x 4,y 4 ) (x 5,y 5 ) (x 2,y 2 ) (x 3,y 3 )
5 What s learning (x 1,y 1 ) (x 7,?) (x 6,?) (x 4,y 4 ) (x 5,y 5 ) (x 2,y 2 ) (x 3,y 3 )
6 What s learning (x 1,y 1 ) (x 7,?) (x 6,?) (x 4,y 4 ) (x 5,y 5 ) (x 2,y 2 ) (x 3,y 3 ) Learning is about inference not interpolation
7 Statistical Machine Learning (ML) I (X, Y ) a pair of random variables in X R. I L : R R! [0, 1) a loss function. I H R X Problem: Solve min E[L(f(X),Y)] f2h given only (x 1,y 1 ),...,(x n,y n ), a sample of n i.i. copies of (X, Y ).
8 ML theory around I All algorithms are ERM (empirical risk minimization) sup f2h 1 n 1 min f2h n nx L(f(x i ),y i ) I Emphasis on empirical process theory...! nx P L(f(X i ),Y i ) E[L(f(X),Y)] > i=1 i=1 [Vapnik 96] [Vapnik, Chervonenkis, 71 Dudley, Giné, Zinn 94] I...and complexity measures, e.g. Gaussian/Rademacher complexities C(H) =E sup nx f2h i=1 if(x i ) [Barlett, Bousquet, Koltchinskii, Massart, Mendelson... 00]
9 Around the same time Cucker and Smale, On the mathematical foundations of learning theory, AMS I Caponnetto, De Vito and R. Verri, Learning as an Inverse Problem, JMLR I Smale, Zhou, Shannon sampling and function reconstruction from point values, Bull. AMS
10 Outline Learning theory 2000 Learning as an inverse problem Regularization Recent advances
11 Inverse Problems (IP) I A : H!Gbounded linear operator, between Hilbert spaces I g 2G Problem: Findf solving Af = g assuming A and g are given, with kg g kapple [Engl, Hanke, Neubauer 96]
12 Ill-posedeness I Existence: g/2 Range(A) I Uniqueness: Ker(A) 6= ; I Stability: ka k = 1 (large is also a mess) f g g H A Range(A) G O = argmin H kaf gk 2, f = A g =min kfk O
13 Is machine learning an inverse problem? I (X, Y ) I L : R R! [0, 1) I H R X Solve min E[L(f(X),Y)] f2h I A : H!G I g 2G Find f solving Af = g given A and g with kg g kapple given only (x 1,y 1 ),...,(x n,y n ). Actually yes, under some assumptions.
14 Key assumptions: least squares and RKHS Assumption L(f(x),y)=(f(x) y) 2 Assumption I (H, h, i) is a Hilbert space (real, separable) I continuous evaluation functionals, for all x 2X,lete x : H!R, withe x (f) =f(x), then e x (f) e x (f 0 ). kf f 0 k [Aronszajn 50]
15 Key assumptions: least squares and RKHS Assumption L(f(x),y)=(f(x) y) 2 Assumption I (H, h, i) is a Hilbert space (real, separable) I continuous evaluation functionals, for all x 2X,lete x : H!R, withe x (f) =f(x), then e x (f) e x (f 0 ). kf f 0 k Implications I kfk 1. kfk I 9 k x 2Hsuch that f(x) =hf,k x i [Aronszajn 50]
16 Interpolation and sampling operator [Bertero, De mol, Pike 85, 88] f(x i )=hf,k xi i = y i, + S n f = y i =1,...,n Sampling operator: S n : H!R n, (S n f) i = hf,k xi i, 8i =1,...,n S n f x 1x 2 x 3 x 4 x 5 X
17 Learning and restriction operator [Caponnetto, De Vito, R. 05] hf,k x i = f (x), a.s. + Restriction operator: S : H!L 2 (X, ), (S f)(x) =hf,k x i, almost surely. S f = f f (x) = R d (x, y)y -almost surely. L 2 (X, )={f 2 R X kfk 2 = R d f(x) 2 < 1} S f X
18 Learning as an inverse problem Inverse problem Find f solving given S n and y n =(y 1,...,y n ). S f = f
19 Learning as an inverse problem Inverse problem Find f solving given S n and y n =(y 1,...,y n ). S f = f Least squares min H ks f f k 2, ks f f k 2 = E(f(X) Y ) 2 E(f (X) Y ) 2
20 Let s see what we got I Noise model I Integral operators & covariance operators I Kernels
21 Noise model Ideal S f = f Empirical S n f = y S S f = S f S ns n f = S ny Noise model ks ny S f kapple 1 ks S S ns n kapple 2 Inverse problem discretization, Econometrics
22 Integral and covariance operators operators I Extension operator S : L 2 (X, )!H Z S f(x 0 )= d (x)k(x 0,x)f(x) where k(x, x 0 )=hk x,k 0 xi is pos.def. I Covariance operator S S : H!H Z S S = d (x)k x k x 0
23 Kernels Choosing a RKHS implies choosing a representation. Theorem (Moore-Aronzaijn) Let k : X X!R, pos.def., then the completion of NX {f 2 R X f = c i k xi, c 1,...,c N 2 R, x 1,...,x N 2X,N 2 N} j=1 w.r.t. hk x,k 0 xi = k(x, x 0 ) is a RKHS.
24 Kernels If K(x, x 0 )=x > x 0,then, I S n is the n by D data matrix (S infinite data matrix) I S ns n and S S are the empirical and true covariance operators
25 Kernels If K(x, x 0 )=x > x 0,then, I S n is the n by D data matrix (S infinite data matrix) I S ns n and S S are the empirical and true covariance operators Other kernels: I K(x, x 0 )=(1+x > x 0 ) p I K(x, x 0 )=e kx x0 k 2 I K(x, x 0 )=e kx x0 k
26 What now? Steal
27 Outline Learning theory 2000 Learning as an inverse problem Regularization Recent advances
28 Tikhonov aka ridge regression f n =(S ns n + ni) 1 S ny
29 Tikhonov aka ridge regression f n =(S ns n + ni) 1 Sny = Sn(S n Sn + ni) {z } K n 1 y K n c = y
30 Statistics Theorem (Caponnetto De Vito 05) Assume K(X, X), Y apple1 a.s. and f 2 Range(S S ) r, 1/2 <r<1. If E[kSf n n f k 2 ]. n 2r 2r+1 n = n 1 2r+1
31 Statistics Theorem (Caponnetto De Vito 05) Assume K(X, X), Y apple1 a.s. and f 2 Range(S S ) r, 1/2 <r<1. If E[kSf n n f k 2 ]. n 2r 2r+1 n = n 1 2r+1 Proof 8 >0, E[kSf n f k 2 ]. 1 ( ) + 2r E[ 1 ], E[ 2 ]. 1 p n
32 Iterative regularization From the Neumann series... Xt 1 fn t = (I SnS n ) j Sny j=0
33 From the Neumann series... f t n = Iterative regularization Xt 1 Xt 1 (I SnS n ) j Sny = Sn (I S n Sn ) {z } j=0 j=0 K n j y
34 From the Neumann series... f t n =... to gradient descent Iterative regularization Xt 1 Xt 1 (I SnS n ) j Sny = Sn (I S n Sn ) {z } j=0 j=0 K n f t n = f t 1 n S n(s n f t 1 n y) c t n = c t 1 n (K n c t 1 n y) j y Test Training t
35 Iterative regularization statistics Theorem (Bauer, Pereverzev, R. 07) Assume K(X, X), Y apple1 a.s. and f 2 Range(S S ) r, 1/2 <r<1. Ift n = n 1 2r+1 E[kSf tn n f k 2 ]. n 2r 2r+1
36 Iterative regularization statistics Theorem (Bauer, Pereverzev, R. 07) Assume K(X, X), Y apple1 a.s. and f 2 Range(S S ) r, 1/2 <r<1. Ift n = n 1 2r+1 E[kSf tn n f k 2 ]. n 2r 2r+1 Proof 8 >0, E[kSf t n f k 2 ]. t ( ) + 1 t 2r E[ 1 ], E[ 2 ]. 1 p n
37 Tikhonov vs iterative regularization I Same statistical properties... I... but time complexities are di erent O(n 3 ) vs O(n 2 n 1 2r+1 ), I Iterative regularization provides a bridge between statistics and computations. I Kernel methods become a test bed for algorithmic solutions.
38 Computational regularization Tikhonov time O(n 3 ) +spaceo(n 2 ) for 1/ p n learning bound
39 Computational regularization Tikhonov time O(n 3 ) +spaceo(n 2 ) for 1/ p n learning bound + Iterative regularization time O(n 2p n) +spaceo(n 2 ) for 1/ p n learning bound
40 Outline Learning theory 2000 Learning as an inverse problem Regularization Recent advances
41 Steal from optimization Acceleration I Conjugate gradient I Chebyshev method [Blanchard, Kramer 96] [Bauer, Pervezev. R. 07] I Nesterov acceleration (Nesterov, 83) Stochastic gradient I Single pass stochastic gradient [Salzo, R. 18] I Multi-pass incremental gradient [Tarres, Yao, 05, Pontil, Ying, 09, Bach, Dieuleveut, Flammarion, 17] I Multi-pass stochastic gradient with mini-batches. [Villa, R. 15] [Lin, R. 16]
42 Computational regularization Iterative regularization time O(n 2p n) +spaceo(n 2 ) for 1/ p n learning bound + Stochastic iterative regularization time O(n 2 ) +spaceo(n 2 ) for 1/ p n learning bound
43 Can we do better? How about memory?
44 Regularization with projection and preconditioning [Halko, Martinsson, Tropp 09] (K > nm K nm + nk MM )c = K > nm y c n 1 BB > = M K2 MM + nk MM K nm = y FALKON [Rudi, Carratino, R. 17], see also [Ma, Belkin 17] c t = B t t = t 1 n B> K > nm (K nm B t 1 y) + nk MM B t 1
45 Falkon statistics Theorem (Rudi, Carratino, R. 17) Assume K(X, X), Y apple1 a.s. and f 2 Range(S S ) r, 1/2 <r<1. If n = n 1 2r+1, Mn = n 1 2r+1, tn = log n then E[kSf n,tn,mn n f k 2 ]. n 2r 2r+1
46 Computational regularization time O(n 2 ) +spaceo(n 2 ) for 1/ p n learning bound + time Õ(np n) +spaceo(n p n) for 1/ p n learning bound
47 Some results MillionSongs YELP TIMIT MSE Relative error Time(s) RMSE Time(m) c-err Time(h) FALKON % 1.5 Prec. KRR Hierarchical ? D&C Rand. Feat Nyström ADMM R. F BCD R. F % 1.7 BCD Nyström % 1.7 KRR % 8.3 EigenPro % 3.9 o Deep NN % - Sparse Kernels % - Ensemble % -
48 Conclusions Contribution I Learning as an inverse problems I Computational regularization: statistics meets numerics Future work I Scaling things up... I Regularization with projections (quadrature, Galerkin methods) I Connection to PDE/integral equations: exploit more structure I Structured prediction/deep learning I Semisupervised/unsupervised learning I Embedding and compressed learning
Optimal kernel methods for large scale learning
Optimal kernel methods for large scale learning Alessandro Rudi INRIA - École Normale Supérieure, Paris joint work with Luigi Carratino, Lorenzo Rosasco 6 Mar 2018 École Polytechnique Learning problem
More informationEarly Stopping for Computational Learning
Early Stopping for Computational Learning Lorenzo Rosasco Universita di Genova, Massachusetts Institute of Technology Istituto Italiano di Tecnologia CBMM Sestri Levante, September, 2014 joint work with
More informationLess is More: Computational Regularization by Subsampling
Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro
More informationLess is More: Computational Regularization by Subsampling
Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro
More informationIterative Convex Regularization
Iterative Convex Regularization Lorenzo Rosasco Universita di Genova Universita di Genova Istituto Italiano di Tecnologia Massachusetts Institute of Technology Optimization and Statistical Learning Workshop,
More informationOnline Gradient Descent Learning Algorithms
DISI, Genova, December 2006 Online Gradient Descent Learning Algorithms Yiming Ying (joint work with Massimiliano Pontil) Department of Computer Science, University College London Introduction Outline
More informationStochastic optimization in Hilbert spaces
Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert
More informationOptimal Rates for Multi-pass Stochastic Gradient Methods
Journal of Machine Learning Research 8 (07) -47 Submitted 3/7; Revised 8/7; Published 0/7 Optimal Rates for Multi-pass Stochastic Gradient Methods Junhong Lin Laboratory for Computational and Statistical
More informationConvergence rates of spectral methods for statistical inverse learning problems
Convergence rates of spectral methods for statistical inverse learning problems G. Blanchard Universtität Potsdam UCL/Gatsby unit, 04/11/2015 Joint work with N. Mücke (U. Potsdam); N. Krämer (U. München)
More informationTUM 2016 Class 3 Large scale learning by regularization
TUM 2016 Class 3 Large scale learning by regularization Lorenzo Rosasco UNIGE-MIT-IIT July 25, 2016 Learning problem Solve min w E(w), E(w) = dρ(x, y)l(w x, y) given (x 1, y 1 ),..., (x n, y n ) Beyond
More informationOptimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms
Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms Junhong Lin Volkan Cevher Laboratory for Information and Inference Systems École Polytechnique Fédérale
More informationLearning with stochastic proximal gradient
Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and
More informationStatistical Optimality of Stochastic Gradient Descent through Multiple Passes
Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Loucas Pillaud-Vivien
More informationRegML 2018 Class 2 Tikhonov regularization and kernels
RegML 2018 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT June 17, 2018 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n = (x i,
More informationRegularization Algorithms for Learning
DISI, UNIGE Texas, 10/19/07 plan motivation setting elastic net regularization - iterative thresholding algorithms - error estimates and parameter choice applications motivations starting point of many
More informationAre Loss Functions All the Same?
Are Loss Functions All the Same? L. Rosasco E. De Vito A. Caponnetto M. Piana A. Verri November 11, 2003 Abstract In this paper we investigate the impact of choosing different loss functions from the viewpoint
More informationOslo Class 2 Tikhonov regularization and kernels
RegML2017@SIMULA Oslo Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT May 3, 2017 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n
More informationMathematical Methods for Data Analysis
Mathematical Methods for Data Analysis Massimiliano Pontil Istituto Italiano di Tecnologia and Department of Computer Science University College London Massimiliano Pontil Mathematical Methods for Data
More informationLearning, Regularization and Ill-Posed Inverse Problems
Learning, Regularization and Ill-Posed Inverse Problems Lorenzo Rosasco DISI, Università di Genova rosasco@disi.unige.it Andrea Caponnetto DISI, Università di Genova caponnetto@disi.unige.it Ernesto De
More informationLearning from Examples as an Inverse Problem
Journal of Machine Learning Research () Submitted 12/04; Published Learning from Examples as an Inverse Problem Ernesto De Vito Dipartimento di Matematica Università di Modena e Reggio Emilia Modena, Italy
More informationKernel Learning via Random Fourier Representations
Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random
More informationSpectral Regularization
Spectral Regularization Lorenzo Rosasco 9.520 Class 07 February 27, 2008 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,
More informationON EARLY STOPPING IN GRADIENT DESCENT LEARNING. 1. Introduction
ON EARLY STOPPING IN GRADIENT DESCENT LEARNING YUAN YAO, LORENZO ROSASCO, AND ANDREA CAPONNETTO Abstract. In this paper, we study a family of gradient descent algorithms to approximate the regression function
More informationRegularization via Spectral Filtering
Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,
More informationAdvances in kernel exponential families
Advances in kernel exponential families Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS, 2017 1/39 Outline Motivating application: Fast estimation of complex multivariate
More informationUniform concentration inequalities, martingales, Rademacher complexity and symmetrization
Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization John Duchi Outline I Motivation 1 Uniform laws of large numbers 2 Loss minimization and data dependence II Uniform
More informationOptimal Distributed Learning with Multi-pass Stochastic Gradient Methods
Optimal Distributed Learning with Multi-pass Stochastic Gradient Methods Junhong Lin Volkan Cevher Abstract We study generalization properties of distributed algorithms in the setting of nonparametric
More informationSpectral Filtering for MultiOutput Learning
Spectral Filtering for MultiOutput Learning Lorenzo Rosasco Center for Biological and Computational Learning, MIT Universita di Genova, Italy Plan Learning with kernels Multioutput kernel and regularization
More informationMachine Learning Linear Models
Machine Learning Linear Models Outline II - Linear Models 1. Linear Regression (a) Linear regression: History (b) Linear regression with Least Squares (c) Matrix representation and Normal Equation Method
More informationECE521 lecture 4: 19 January Optimization, MLE, regularization
ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity
More informationSimultaneous Model Selection and Optimization through Parameter-free Stochastic Learning
Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Francesco Orabona Yahoo! Labs New York, USA francesco@orabona.com Abstract Stochastic gradient descent algorithms
More informationOslo Class 4 Early Stopping and Spectral Regularization
RegML2017@SIMULA Oslo Class 4 Early Stopping and Spectral Regularization Lorenzo Rosasco UNIGE-MIT-IIT June 28, 2016 Learning problem Solve min w E(w), E(w) = dρ(x, y)l(w x, y) given (x 1, y 1 ),..., (x
More informationThe Learning Problem and Regularization
9.520 Class 02 February 2011 Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference problem from usually small sets of high dimensional, noisy data. Learning
More informationarxiv: v1 [math.st] 28 May 2016
Kernel ridge vs. principal component regression: minimax bounds and adaptability of regularization operators Lee H. Dicker Dean P. Foster Daniel Hsu arxiv:1605.08839v1 [math.st] 8 May 016 May 31, 016 Abstract
More informationReproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto
Reproducing Kernel Hilbert Spaces 9.520 Class 03, 15 February 2006 Andrea Caponnetto About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationMIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 04: Features and Kernels. Lorenzo Rosasco
MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 04: Features and Kernels Lorenzo Rosasco Linear functions Let H lin be the space of linear functions f(x) = w x. f w is one
More informationApproximate Kernel PCA with Random Features
Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,
More informationOn Regularization Algorithms in Learning Theory
On Regularization Algorithms in Learning Theory Frank Bauer a, Sergei Pereverzev b, Lorenzo Rosasco c,1 a Institute for Mathematical Stochastics, University of Göttingen, Department of Mathematics, Maschmühlenweg
More information9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee
9.520: Class 20 Bayesian Interpretations Tomaso Poggio and Sayan Mukherjee Plan Bayesian interpretation of Regularization Bayesian interpretation of the regularizer Bayesian interpretation of quadratic
More informationError Estimates for Multi-Penalty Regularization under General Source Condition
Error Estimates for Multi-Penalty Regularization under General Source Condition Abhishake Rastogi Department of Mathematics Indian Institute of Technology Delhi New Delhi 006, India abhishekrastogi202@gmail.com
More informationGeometry on Probability Spaces
Geometry on Probability Spaces Steve Smale Toyota Technological Institute at Chicago 427 East 60th Street, Chicago, IL 60637, USA E-mail: smale@math.berkeley.edu Ding-Xuan Zhou Department of Mathematics,
More informationGeneralization theory
Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1
More informationSpectral Algorithms for Supervised Learning
LETTER Communicated by David Hardoon Spectral Algorithms for Supervised Learning L. Lo Gerfo logerfo@disi.unige.it L. Rosasco rosasco@disi.unige.it F. Odone odone@disi.unige.it Dipartimento di Informatica
More informationApproximate Kernel Methods
Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207 Outline Motivating example Ridge regression
More informationStochastic Gradient Descent. Ryan Tibshirani Convex Optimization
Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so
More informationStochastic gradient descent and robustness to ill-conditioning
Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,
More informationClassification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).
Regression and PCA Classification The goal: map from input X to a label Y. Y has a discrete set of possible values We focused on binary Y (values 0 or 1). But we also discussed larger number of classes
More informationAdvanced Machine Learning
Advanced Machine Learning Deep Boosting MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Model selection. Deep boosting. theory. algorithm. experiments. page 2 Model Selection Problem:
More informationStochastic Optimization Algorithms Beyond SG
Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationAdvanced Machine Learning
Advanced Machine Learning Learning Kernels MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Kernel methods. Learning kernels scenario. learning bounds. algorithms. page 2 Machine Learning
More informationClass 2 & 3 Overfitting & Regularization
Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating
More informationSimple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017
Simple Techniques for Improving SGD CS6787 Lecture 2 Fall 2017 Step Sizes and Convergence Where we left off Stochastic gradient descent x t+1 = x t rf(x t ; yĩt ) Much faster per iteration than gradient
More informationOn Learnability, Complexity and Stability
On Learnability, Complexity and Stability Silvia Villa, Lorenzo Rosasco and Tomaso Poggio 1 Introduction A key question in statistical learning is which hypotheses (function) spaces are learnable. Roughly
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationarxiv: v1 [cs.it] 21 Feb 2013
q-ary Compressive Sensing arxiv:30.568v [cs.it] Feb 03 Youssef Mroueh,, Lorenzo Rosasco, CBCL, CSAIL, Massachusetts Institute of Technology LCSL, Istituto Italiano di Tecnologia and IIT@MIT lab, Istituto
More informationConvergence of Eigenspaces in Kernel Principal Component Analysis
Convergence of Eigenspaces in Kernel Principal Component Analysis Shixin Wang Advanced machine learning April 19, 2016 Shixin Wang Convergence of Eigenspaces April 19, 2016 1 / 18 Outline 1 Motivation
More informationFoundations of Machine Learning
Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters
More information2 Tikhonov Regularization and ERM
Introduction Here we discusses how a class of regularization methods originally designed to solve ill-posed inverse problems give rise to regularized learning algorithms. These algorithms are kernel methods
More informationDerivative reproducing properties for kernel methods in learning theory
Journal of Computational and Applied Mathematics 220 (2008) 456 463 www.elsevier.com/locate/cam Derivative reproducing properties for kernel methods in learning theory Ding-Xuan Zhou Department of Mathematics,
More informationFALKON: An Optimal Large Scale Kernel Method
FALKON: An Optimal Large Scale Kernel Method Alessandro Rudi INRIA Sierra Project-team, École Normale Supérieure, Paris Luigi Carratino University of Genoa Genova, Italy Lorenzo Rosasco University of Genoa,
More informationOn regularization algorithms in learning theory
Journal of Complexity 23 (2007) 52 72 www.elsevier.com/locate/jco On regularization algorithms in learning theory Frank Bauer a, Sergei Pereverzev b, Lorenzo Rosasco c,d, a Institute for Mathematical Stochastics,
More informationTUM 2016 Class 1 Statistical learning theory
TUM 2016 Class 1 Statistical learning theory Lorenzo Rosasco UNIGE-MIT-IIT July 25, 2016 Machine learning applications Texts Images Data: (x 1, y 1 ),..., (x n, y n ) Note: x i s huge dimensional! All
More informationLearnability of Gaussians with flexible variances
Learnability of Gaussians with flexible variances Ding-Xuan Zhou City University of Hong Kong E-ail: azhou@cityu.edu.hk Supported in part by Research Grants Council of Hong Kong Start October 20, 2007
More informationCS489/698: Intro to ML
CS489/698: Intro to ML Lecture 02: Linear Regression 1 I d rather die than telling you my password! Transfer success! 2 Outline Announcements Linear Regression Regularization Cross-validation 3 Outline
More informationGeneralizing the Bias Term of Support Vector Machines
Generalizing the Bias Term of Support Vector Machines Wenye Li and Kwong-Sak Leung and Kin-Hong Lee Department of Computer Science and Engineering The Chinese University of Hong Kong wyli, ksleung, khlee}@cse.cuhk.edu.hk
More informationMLCC 2017 Regularization Networks I: Linear Models
MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017 About this class We introduce a class of learning algorithms based on Tikhonov regularization We study computational
More informationCausal Inference by Minimizing the Dual Norm of Bias. Nathan Kallus. Cornell University and Cornell Tech
Causal Inference by Minimizing the Dual Norm of Bias Nathan Kallus Cornell University and Cornell Tech www.nathankallus.com Matching Zoo It s a zoo of matching estimators for causal effects: PSM, NN, CM,
More informationDistributed Semi-supervised Learning with Kernel Ridge Regression
Journal of Machine Learning Research 18 (017) 1- Submitted 11/16; Revised 4/17; Published 5/17 Distributed Semi-supervised Learning with Kernel Ridge Regression Xiangyu Chang Center of Data Science and
More informationContents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More informationStochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017
Stochastic Gradient Descent: The Workhorse of Machine Learning CS6787 Lecture 1 Fall 2017 Fundamentals of Machine Learning? Machine Learning in Practice this course What s missing in the basic stuff? Efficiency!
More informationDivide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates
: A Distributed Algorithm with Minimax Optimal Rates Yuchen Zhang, John C. Duchi, Martin Wainwright (UC Berkeley;http://arxiv.org/pdf/1305.509; Apr 9, 014) Gatsby Unit, Tea Talk June 10, 014 Outline Motivation.
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationKernel methods for Bayesian inference
Kernel methods for Bayesian inference Arthur Gretton Gatsby Computational Neuroscience Unit Lancaster, Nov. 2014 Motivating Example: Bayesian inference without a model 3600 downsampled frames of 20 20
More informationDipartimento di Fisica
Dipartimento di Fisica Multi-Output Learning with Spectral Filters by Luca Baldassarre DIFI, Università di Genova Via Dodecaneso 33, 16146 Genova, Italy http://www.fisica.unige.it/ Dottorato di Ricerca
More informationHilbert Space Methods in Learning
Hilbert Space Methods in Learning guest lecturer: Risi Kondor 6772 Advanced Machine Learning and Perception (Jebara), Columbia University, October 15, 2003. 1 1. A general formulation of the learning problem
More informationLinearized Alternating Direction Method: Two Blocks and Multiple Blocks. Zhouchen Lin 林宙辰北京大学
Linearized Alternating Direction Method: Two Blocks and Multiple Blocks Zhouchen Lin 林宙辰北京大学 Dec. 3, 014 Outline Alternating Direction Method (ADM) Linearized Alternating Direction Method (LADM) Two Blocks
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 9, 2011 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationTopics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)
Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound Lecture 3: Introduction to Deep Learning (continued) Course Logistics - Update on course registrations - 6 seats left now -
More informationStatistical Convergence of Kernel CCA
Statistical Convergence of Kernel CCA Kenji Fukumizu Institute of Statistical Mathematics Tokyo 106-8569 Japan fukumizu@ism.ac.jp Francis R. Bach Centre de Morphologie Mathematique Ecole des Mines de Paris,
More informationconsistent learning by composite proximal thresholding
consistent learning by composite proximal thresholding Saverio Salzo Università degli Studi di Genova Optimization in Machine learning, vision and image processing Université Paul Sabatier, Toulouse 6-7
More informationCOMP 551 Applied Machine Learning Lecture 20: Gaussian processes
COMP 55 Applied Machine Learning Lecture 2: Gaussian processes Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~hvanho2/comp55
More informationNyström-SGD: Fast Learning of Kernel-Classifiers with Conditioned Stochastic Gradient Descent
Nyström-SGD: Fast Learning of Kernel-Classifiers with Conditioned Stochastic Gradient Descent Lukas Pfahler and Katharina Morik TU Dortmund University, Artificial Intelligence Group, Dortmund, Germany
More informationCSC 578 Neural Networks and Deep Learning
CSC 578 Neural Networks and Deep Learning Fall 2018/19 3. Improving Neural Networks (Some figures adapted from NNDL book) 1 Various Approaches to Improve Neural Networks 1. Cost functions Quadratic Cross
More informationModern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization
Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where
More informationSupport Vector Machines for Classification: A Statistical Portrait
Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,
More informationADANET: adaptive learning of neural networks
ADANET: adaptive learning of neural networks Joint work with Corinna Cortes (Google Research) Javier Gonzalo (Google Research) Vitaly Kuznetsov (Google Research) Scott Yang (Courant Institute) MEHRYAR
More informationGaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationStochastic Gradient Descent
Stochastic Gradient Descent Weihang Chen, Xingchen Chen, Jinxiu Liang, Cheng Xu, Zehao Chen and Donglin He March 26, 2017 Outline What is Stochastic Gradient Descent Comparison between BGD and SGD Analysis
More informationTutorial on Machine Learning for Advanced Electronics
Tutorial on Machine Learning for Advanced Electronics Maxim Raginsky March 2017 Part I (Some) Theory and Principles Machine Learning: estimation of dependencies from empirical data (V. Vapnik) enabling
More informationFinite-sum Composition Optimization via Variance Reduced Gradient Descent
Finite-sum Composition Optimization via Variance Reduced Gradient Descent Xiangru Lian Mengdi Wang Ji Liu University of Rochester Princeton University University of Rochester xiangru@yandex.com mengdiw@princeton.edu
More informationIntroductory Machine Learning Notes 1. Lorenzo Rosasco
Introductory Machine Learning Notes 1 Lorenzo Rosasco DIBRIS, Universita degli Studi di Genova LCSL, Massachusetts Institute of Technology and Istituto Italiano di Tecnologia lrosasco@mit.edu October 10,
More informationStructured Prediction
Structured Prediction Ningshan Zhang Advanced Machine Learning, Spring 2016 Outline Ensemble Methods for Structured Prediction[1] On-line learning Boosting AGeneralizedKernelApproachtoStructuredOutputLearning[2]
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is
More informationLearning sets and subspaces: a spectral approach
Learning sets and subspaces: a spectral approach Alessandro Rudi DIBRIS, Università di Genova Optimization and dynamical processes in Statistical learning and inverse problems Sept 8-12, 2014 A world of
More informationBayesian Interpretations of Regularization
Bayesian Interpretations of Regularization Charlie Frogner 9.50 Class 15 April 1, 009 The Plan Regularized least squares maps {(x i, y i )} n i=1 to a function that minimizes the regularized loss: f S
More informationSparse Approximation and Variable Selection
Sparse Approximation and Variable Selection Lorenzo Rosasco 9.520 Class 07 February 26, 2007 About this class Goal To introduce the problem of variable selection, discuss its connection to sparse approximation
More information