Online Gradient Descent Learning Algorithms

Size: px
Start display at page:

Download "Online Gradient Descent Learning Algorithms"

Transcription

1 DISI, Genova, December 2006 Online Gradient Descent Learning Algorithms Yiming Ying (joint work with Massimiliano Pontil) Department of Computer Science, University College London

2 Introduction Outline General learning setting Online gradient descent algorithm 0 Main results Generalization error Implications: consistency Error rates Discussions and comparisons Conclusions and questions

3 Introduction Learning theory model Input sample space X: a subset of Euclidean space R d Output labeled space Y : a subset of R. Distribution ρ on X Y : ρ(x, y) = ρ X (x)ρ(y x). Loss function: L(f(x), y) = (y f(x)) 2. Statistical assumption: labeled sample data sequence S = {z j = (x j, y j ) : j = 1, 2, }. identically and independent distributed according to ρ.

4 Goal of learning Given sample S, find a function f in a suitable hypothesis space such that the true error E(f) := (f(x) y) 2 dρ(x, y) X Y is close to the smallest true error E(f ρ ) where f ρ is the regression function f ρ (x) := ydρ(y x) = inf { E(f) } f : X R Y

5 Note that f f ρ 2 ρ X = X (f(x) f ρ (x)) 2 dρ X (x) = E(f) E(f ρ ) Equivalent approximation problem: find an approximator f in a hypothesis space such that f f ρ 2 ρ is small Hypothesis space assumption Hypothesis space: reproducing kernel Hilbert spaces H K (RKHS) Gaussian kernel: K(x, x ) = e σ x x 2. Polynomial kernel: K(x, x ) = (1 + x, x ) n.

6 Batch Learning algorithm Use the data set S t = {z 1,, z t } at one time Tikhonov regularization: Cucker and Smale; Evgeniou-Pontil-Poggio; Smale-Zhou; De Vito and Verri et al., from different perspectives: Regularization Network, Approximation Theory and Inverse Problems etc. 1 t f St,λ = arg min (y j f(x j ) 2 + λ f 2 f H K t K, λ > 0 Gradient descent boosting: Yao, Rosasco and Caponnetto j=1 f k+1 = f k η k t t (f k (x j ) y j )K xj. j=1 Early stopping rule instead of regularization terms in H K.

7 Stochastic Online learning in RKHS Use the data one by one Online regularized learning algorithm f j+1 = f j η j ((f j (x j ) y j )K xj + λf j ), j N, for e.g. f 1 = 0, where λ > 0 is regularization parameter and {η j } is step sizes (learning rates). Kivinen, Smola and Williamson; Smale and Yao; Ying-Zhou et al.

8 The online algorithm studied here f j+1 = f j η j (f j (x j ) y j )K xj, j N, for e.g. f 1 = 0, (1) {η j, j N} universal sequence {η j = η(t) : j = 1,, t}, t: sample number Our analysis purpose for the above algorithm Stochastic generalization error bounds for E [ f t+1 f ρ 2 ρ] in terms of the step sizes and approximation property of H K. The choice of step sizes to guarantee the (weak) consistency: E [ f t+1 f ρ 2 ρ] inf f HK f f ρ 2 ρ as t

9 Type I: Step sizes are a universal sequence Main results Generalization error Define K-functional: K(s, f ρ ) := inf f HK { f f ρ ρ + s f K }, s > 0. Theorem 1. Let θ (0, 1) and {η j = j θ : j } with some constant µ µ(θ). Then, µ for any t, E [ ] [ 2 ( f t+1 f ρ 2 ρ K(b θ,µ t (1 θ)/2, f ρ )] + O t min{θ,1 θ} ln t ). Implication to consistency: K-functional: K(, f ρ ) is non-decreasing, concave, and lim s 0+ K(s, f ρ ) = inf f HK f f ρ ρ = Consistency: lim t E [ f t+1 f ρ 2 ρ] = inff HK f f ρ 2 ρ.

10 Error rates We assume the f ρ has some smoothness. Define L K : L 2 ρ X L 2 ρ X : L K f(x) = K(x, y)f(y)dρ X, x X, f L 2 ρ X. X The fractional range space L β K(L 2 ρ X ) : the range space of L β K. Theorem 2. Let θ (0, 1), µ(θ) be absolute constants depending on θ. If f ρ L β K(L 2 ρ X ) with some 0 < β 1/2 then, by selecting η j = 1 µ( 2β 2β+1 )j 2β+1 for j, for any t there holds [ ] ( ) E Z T ft+1 f ρ 2 ρ = O t 2β 2β+1 ln t. (2) 2β

11 Type II: Step sizes depending on sample number Generalization error Theorem 3. Let {η j = η : j }. Then, we have that E [ ] [ f t+1 f ρ 2 ρ K( ( ηt ) ] ( ), f ρ ) + O η ln t Rule of early stopping: trade off K( ( ηt ) 1 2, f ρ ) and O ( η ln t ) = stopping rule: t = t(η) to ensure the bounds tend to zero as η 0 +. Equivalently, from the perspective of choosing step sizes η = η(t).

12 Implication to (weak) consistency: the step sizes (depending on samples number) t: lim t η(t) ln t = 0 and lim t tη(t) = = consistency. Error rates Theorem 4. Let {η j = η : j = 1, 2, t}. If f ρ L β K(L 2 ρ X ) for some β > 0 then, by choosing η := 2β β 64(1+κ) 4 t (2β+1) 2β+1, we have that E [ ] f t+1 f ρ 2 ρ = O (t 2β 2β+1 ln t ).

13 Discussions and Comparisons Comparisons are based on the same assumptions on f ρ L β K(L 2 ρ X ). Our error rates for online gradient descent algorithm (1): (I) O ( ) t 2β 2β+1 ln t with β (0, 1 ] for {η 2 j, j N} universal sequence (II) O ( ) t 2β 2β+1 ln t with β > 0 for {ηj = η(t) : j = 1,, t} depending on sample number Batch Tikhonov regularization: O ( t 2β+1) with β (0, 1] Zhang; Smale and Zhou

14 Discussions continued Online regularized algorithm: choosing λ = λ(t) > 0 appropriately Yao and Smale; Ying-Zhou: O ( t 2β 2β+2 ln t ) with β (0, 1] Pontil and Ying: O ( t 2β 2β+1 ln t ) with β (0, 1] The rate O(t 2β 2β+1 ) is capacity independent (eigenvalue independent) optimal (only assumption on f ρ, no assumption on decays of eigenvalues of L K ) implied by Capponetto and De Vito.

15 Ideas of Proof Three main steps. Error decomposition Rewrite the online algorithm (1): f j+1 f ρ = (I η j L K )(f j f ρ ) + η j ( LK (f j f ρ ) + (y j f j (x j ))K xj ) I: the identity operator Define B(f j, z j ) := L K (f j f ρ ) + (y j f j (x j ))K xj. Then E zj [ B(fj, z j ) ] = 0. Set ω t k(l K ) := t j=k (I η jl K ) and ω t t+1(l K ) := I. f t+1 f ρ = ω t 1(L K )f ρ + t j=1 η jω t j+1(l K )B(f j, z j ).

16 Proof Continued Proposition 1. For any t, [ f t+1 f ρ 2 ρ] is bounded by ω t 1(L K )f ρ 2 ρ }{{} approximation error t [ + 2(1 + κ) 4 E(fk ) ] ηk/( t 2 η j + 1 ). k=1 j=k+1 }{{} Cumulative sample error Remark 1. The standard cumulative loss t k=1 (y k f k (x k ) 2 has been extensively studied in online community: Cesa-Bianchi, Warmuth, Smola et al. Weighted cumulative loss: [ t k=1 (y k f k (x k ) 2 η 2 k/( t j=k+1 η j + 1 )].

17 Sketch of Proof for Proposition 1 Proof Continued E [ ] f t+1 f ρ 2 ρ = ω t 1 (L K )f ρ 2 ρ + E [ t η ] j=1 jωj+1(l t K )B(f j, z j ) 2 ρ 2 E [ t ω1(l t K )f ρ, η j ωj+1(l t K )B(f j, z j ) ] ρ j=1 }{{} zero since E zj [ B(fj, z j ) ] = 0 [ ] [ ] E Z t k t η k ωk+1(l t K )B(f k, z k ) 2 ρ = ηkz 2 ω t k+1(l k K )B(f k, z k ) 2 ρ k t [ ηk ω 2 k+1(l t K )L 1 2 K 2 Z k B(fk, z k ) K] 2, k t E zk [ B(fk, z k ) 2 K c E(f k ). ω t k+1(l K )L 1 2 K 2 2 ( 1 + κ ) 2/( t j=k+1 η j + 1 ).

18 Approximation error: For any f H K, Proof Continued ω1(l t K )f ρ ρ ω1(l t K )(f f ρ ) ρ + ω1(l t K )f ρ f f ρ ρ + ω1(l t K )L 1 2 K L 1 2 K f ρ f f ρ ρ + 2(1 + κ) ( t ( η k ) + 1 ) 1 2 f K, k=1 K functional: ω1(l t K )f ρ ρ K ( 2(1 + κ) ( t ( η k ) + 1 ) 1 2 ), f ρ k=1

19 Proof Continued Cumulative sample error: t [ E(fk ) ] ηk/( t 2 k=1 j=k+1 η j + 1 ) ( [ sup E(fk ) ]) t /( t η 2 k η j + 1 ). k=1,,t k=1 j=k+1 Uniformly bounding for [ E(f k ) ] : using f ρ ρ and E(f ρ ). Estimate t k=1 η2 k/( t j=k+1 η j + 1 ) : For instance, η j = O ( j θ) with θ (0, 1), t k=1 η 2 k /( t j=k+1 η j + 1 ) = O ( t min{θ,1 θ} ln t ).

20 Conclusions Online gradient descent algorithm is a simple yet competitive algorithm: It is statistically consistent (for two types of step sizes) Error rates are essentially the same as classical batch regularized learning and online regularized learning algorithms Optimal capacity independent error rates

21 Questions H K -norm error f t+1 f ρ K with the universal polynomially decaying step sizes Probability inequality estimates, almost surely convergence (strong consistency) Generalization error analysis with data non i.i.d. Can we directly use standard cumulative error bounds to bound generalization error?

22 Grazie! Buon Natale e Felice Anno Nuovo!

Online gradient descent learning algorithm

Online gradient descent learning algorithm Online gradient descent learning algorithm Yiming Ying and Massimiliano Pontil Department of Computer Science, University College London Gower Street, London, WCE 6BT, England, UK {y.ying, m.pontil}@cs.ucl.ac.uk

More information

Optimal Rates for Multi-pass Stochastic Gradient Methods

Optimal Rates for Multi-pass Stochastic Gradient Methods Journal of Machine Learning Research 8 (07) -47 Submitted 3/7; Revised 8/7; Published 0/7 Optimal Rates for Multi-pass Stochastic Gradient Methods Junhong Lin Laboratory for Computational and Statistical

More information

Learning Theory of Randomized Kaczmarz Algorithm

Learning Theory of Randomized Kaczmarz Algorithm Journal of Machine Learning Research 16 015 3341-3365 Submitted 6/14; Revised 4/15; Published 1/15 Junhong Lin Ding-Xuan Zhou Department of Mathematics City University of Hong Kong 83 Tat Chee Avenue Kowloon,

More information

TUM 2016 Class 1 Statistical learning theory

TUM 2016 Class 1 Statistical learning theory TUM 2016 Class 1 Statistical learning theory Lorenzo Rosasco UNIGE-MIT-IIT July 25, 2016 Machine learning applications Texts Images Data: (x 1, y 1 ),..., (x n, y n ) Note: x i s huge dimensional! All

More information

Are Loss Functions All the Same?

Are Loss Functions All the Same? Are Loss Functions All the Same? L. Rosasco E. De Vito A. Caponnetto M. Piana A. Verri November 11, 2003 Abstract In this paper we investigate the impact of choosing different loss functions from the viewpoint

More information

RegML 2018 Class 2 Tikhonov regularization and kernels

RegML 2018 Class 2 Tikhonov regularization and kernels RegML 2018 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT June 17, 2018 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n = (x i,

More information

Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms

Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms Junhong Lin Volkan Cevher Laboratory for Information and Inference Systems École Polytechnique Fédérale

More information

Mathematical Methods for Data Analysis

Mathematical Methods for Data Analysis Mathematical Methods for Data Analysis Massimiliano Pontil Istituto Italiano di Tecnologia and Department of Computer Science University College London Massimiliano Pontil Mathematical Methods for Data

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

Regularization in Reproducing Kernel Banach Spaces

Regularization in Reproducing Kernel Banach Spaces .... Regularization in Reproducing Kernel Banach Spaces Guohui Song School of Mathematical and Statistical Sciences Arizona State University Comp Math Seminar, September 16, 2010 Joint work with Dr. Fred

More information

Oslo Class 2 Tikhonov regularization and kernels

Oslo Class 2 Tikhonov regularization and kernels RegML2017@SIMULA Oslo Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT May 3, 2017 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n

More information

Online Learning with Samples Drawn from Non-identical Distributions

Online Learning with Samples Drawn from Non-identical Distributions Journal of Machine Learning Research 10 (009) 873-898 Submitted 1/08; Revised 7/09; Published 1/09 Online Learning with Samples Drawn from Non-identical Distributions Ting Hu Ding-uan hou Department of

More information

An inverse problem perspective on machine learning

An inverse problem perspective on machine learning An inverse problem perspective on machine learning Lorenzo Rosasco University of Genova Massachusetts Institute of Technology Istituto Italiano di Tecnologia lcsl.mit.edu Feb 9th, 2018 Inverse Problems

More information

Learning, Regularization and Ill-Posed Inverse Problems

Learning, Regularization and Ill-Posed Inverse Problems Learning, Regularization and Ill-Posed Inverse Problems Lorenzo Rosasco DISI, Università di Genova rosasco@disi.unige.it Andrea Caponnetto DISI, Università di Genova caponnetto@disi.unige.it Ernesto De

More information

Convergence rates of spectral methods for statistical inverse learning problems

Convergence rates of spectral methods for statistical inverse learning problems Convergence rates of spectral methods for statistical inverse learning problems G. Blanchard Universtität Potsdam UCL/Gatsby unit, 04/11/2015 Joint work with N. Mücke (U. Potsdam); N. Krämer (U. München)

More information

Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning

Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Francesco Orabona Yahoo! Labs New York, USA francesco@orabona.com Abstract Stochastic gradient descent algorithms

More information

ON EARLY STOPPING IN GRADIENT DESCENT LEARNING. 1. Introduction

ON EARLY STOPPING IN GRADIENT DESCENT LEARNING. 1. Introduction ON EARLY STOPPING IN GRADIENT DESCENT LEARNING YUAN YAO, LORENZO ROSASCO, AND ANDREA CAPONNETTO Abstract. In this paper, we study a family of gradient descent algorithms to approximate the regression function

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee The Learning Problem and Regularization 9.520 Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing

More information

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto Reproducing Kernel Hilbert Spaces 9.520 Class 03, 15 February 2006 Andrea Caponnetto About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Loucas Pillaud-Vivien

More information

On regularization algorithms in learning theory

On regularization algorithms in learning theory Journal of Complexity 23 (2007) 52 72 www.elsevier.com/locate/jco On regularization algorithms in learning theory Frank Bauer a, Sergei Pereverzev b, Lorenzo Rosasco c,d, a Institute for Mathematical Stochastics,

More information

On Regularization Algorithms in Learning Theory

On Regularization Algorithms in Learning Theory On Regularization Algorithms in Learning Theory Frank Bauer a, Sergei Pereverzev b, Lorenzo Rosasco c,1 a Institute for Mathematical Stochastics, University of Göttingen, Department of Mathematics, Maschmühlenweg

More information

Diffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)

Diffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel) Diffeomorphic Warping Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel) What Manifold Learning Isn t Common features of Manifold Learning Algorithms: 1-1 charting Dense sampling Geometric Assumptions

More information

Spectral Regularization

Spectral Regularization Spectral Regularization Lorenzo Rosasco 9.520 Class 07 February 27, 2008 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,

More information

Divide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates

Divide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates : A Distributed Algorithm with Minimax Optimal Rates Yuchen Zhang, John C. Duchi, Martin Wainwright (UC Berkeley;http://arxiv.org/pdf/1305.509; Apr 9, 014) Gatsby Unit, Tea Talk June 10, 014 Outline Motivation.

More information

Error Estimates for Multi-Penalty Regularization under General Source Condition

Error Estimates for Multi-Penalty Regularization under General Source Condition Error Estimates for Multi-Penalty Regularization under General Source Condition Abhishake Rastogi Department of Mathematics Indian Institute of Technology Delhi New Delhi 006, India abhishekrastogi202@gmail.com

More information

Regularization via Spectral Filtering

Regularization via Spectral Filtering Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,

More information

Learnability of Gaussians with flexible variances

Learnability of Gaussians with flexible variances Learnability of Gaussians with flexible variances Ding-Xuan Zhou City University of Hong Kong E-ail: azhou@cityu.edu.hk Supported in part by Research Grants Council of Hong Kong Start October 20, 2007

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 9, 2011 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 11, 2009 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Class 2 & 3 Overfitting & Regularization

Class 2 & 3 Overfitting & Regularization Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating

More information

Optimal kernel methods for large scale learning

Optimal kernel methods for large scale learning Optimal kernel methods for large scale learning Alessandro Rudi INRIA - École Normale Supérieure, Paris joint work with Luigi Carratino, Lorenzo Rosasco 6 Mar 2018 École Polytechnique Learning problem

More information

RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets Class 22, 2004 Tomaso Poggio and Sayan Mukherjee

RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets Class 22, 2004 Tomaso Poggio and Sayan Mukherjee RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets 9.520 Class 22, 2004 Tomaso Poggio and Sayan Mukherjee About this class Goal To introduce an alternate perspective of RKHS via integral operators

More information

Online Learning Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das

Online Learning Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das Online Learning 9.520 Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das About this class Goal To introduce the general setting of online learning. To describe an online version of the RLS algorithm

More information

Derivative reproducing properties for kernel methods in learning theory

Derivative reproducing properties for kernel methods in learning theory Journal of Computational and Applied Mathematics 220 (2008) 456 463 www.elsevier.com/locate/cam Derivative reproducing properties for kernel methods in learning theory Ding-Xuan Zhou Department of Mathematics,

More information

Sufficient Conditions for Uniform Stability of Regularization Algorithms Andre Wibisono, Lorenzo Rosasco, and Tomaso Poggio

Sufficient Conditions for Uniform Stability of Regularization Algorithms Andre Wibisono, Lorenzo Rosasco, and Tomaso Poggio Computer Science and Artificial Intelligence Laboratory Technical Report MIT-CSAIL-TR-2009-060 CBCL-284 December 1, 2009 Sufficient Conditions for Uniform Stability of Regularization Algorithms Andre Wibisono,

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces 9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we

More information

Online Regression Competitive with Changing Predictors

Online Regression Competitive with Changing Predictors Online Regression Competitive with Changing Predictors Steven Busuttil and Yuri Kalnishkan Computer Learning Research Centre and Department of Computer Science, Royal Holloway, University of London, Egham,

More information

Optimal Rates for Spectral Algorithms with Least-Squares Regression over Hilbert Spaces

Optimal Rates for Spectral Algorithms with Least-Squares Regression over Hilbert Spaces Optimal Rates for Spectral Algorithms with Least-Squares Regression over Hilbert Spaces Junhong Lin 1, Alessandro Rudi 2,3, Lorenzo Rosasco 4,5, Volkan Cevher 1 1 Laboratory for Information and Inference

More information

Less is More: Computational Regularization by Subsampling

Less is More: Computational Regularization by Subsampling Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro

More information

Geometry on Probability Spaces

Geometry on Probability Spaces Geometry on Probability Spaces Steve Smale Toyota Technological Institute at Chicago 427 East 60th Street, Chicago, IL 60637, USA E-mail: smale@math.berkeley.edu Ding-Xuan Zhou Department of Mathematics,

More information

Less is More: Computational Regularization by Subsampling

Less is More: Computational Regularization by Subsampling Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Convergence Rates of Kernel Quadrature Rules

Convergence Rates of Kernel Quadrature Rules Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction

More information

Full-information Online Learning

Full-information Online Learning Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2

More information

Distributed Learning with Regularized Least Squares

Distributed Learning with Regularized Least Squares Journal of Machine Learning Research 8 07-3 Submitted /5; Revised 6/7; Published 9/7 Distributed Learning with Regularized Least Squares Shao-Bo Lin sblin983@gmailcom Department of Mathematics City University

More information

Approximation Theoretical Questions for SVMs

Approximation Theoretical Questions for SVMs Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually

More information

Learning from Examples as an Inverse Problem

Learning from Examples as an Inverse Problem Journal of Machine Learning Research () Submitted 12/04; Published Learning from Examples as an Inverse Problem Ernesto De Vito Dipartimento di Matematica Università di Modena e Reggio Emilia Modena, Italy

More information

Kernels for Multi task Learning

Kernels for Multi task Learning Kernels for Multi task Learning Charles A Micchelli Department of Mathematics and Statistics State University of New York, The University at Albany 1400 Washington Avenue, Albany, NY, 12222, USA Massimiliano

More information

On Infinite-Dimensional Integration in Weighted Hilbert Spaces

On Infinite-Dimensional Integration in Weighted Hilbert Spaces On Infinite-Dimensional Integration in Weighted Hilbert Spaces Sebastian Mayer Universität Bonn Joint work with M. Gnewuch (UNSW Sydney) and K. Ritter (TU Kaiserslautern). HDA 2013, Canberra, Australia.

More information

Shannon Sampling II. Connections to Learning Theory

Shannon Sampling II. Connections to Learning Theory Shannon Sapling II Connections to Learning heory Steve Sale oyota echnological Institute at Chicago 147 East 60th Street, Chicago, IL 60637, USA E-ail: sale@athberkeleyedu Ding-Xuan Zhou Departent of Matheatics,

More information

Early Stopping for Computational Learning

Early Stopping for Computational Learning Early Stopping for Computational Learning Lorenzo Rosasco Universita di Genova, Massachusetts Institute of Technology Istituto Italiano di Tecnologia CBMM Sestri Levante, September, 2014 joint work with

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Worst-Case Bounds for Gaussian Process Models

Worst-Case Bounds for Gaussian Process Models Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis

More information

Kernels A Machine Learning Overview

Kernels A Machine Learning Overview Kernels A Machine Learning Overview S.V.N. Vishy Vishwanathan vishy@axiom.anu.edu.au National ICT of Australia and Australian National University Thanks to Alex Smola, Stéphane Canu, Mike Jordan and Peter

More information

Can we do statistical inference in a non-asymptotic way? 1

Can we do statistical inference in a non-asymptotic way? 1 Can we do statistical inference in a non-asymptotic way? 1 Guang Cheng 2 Statistics@Purdue www.science.purdue.edu/bigdata/ ONR Review Meeting@Duke Oct 11, 2017 1 Acknowledge NSF, ONR and Simons Foundation.

More information

On the V γ Dimension for Regression in Reproducing Kernel Hilbert Spaces. Theodoros Evgeniou, Massimiliano Pontil

On the V γ Dimension for Regression in Reproducing Kernel Hilbert Spaces. Theodoros Evgeniou, Massimiliano Pontil MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES A.I. Memo No. 1656 May 1999 C.B.C.L

More information

Dipartimento di Informatica e Scienze dell Informazione

Dipartimento di Informatica e Scienze dell Informazione Dipartimento di Informatica e Scienze dell Informazione Regularization Approaches in Learning Theory by Lorenzo Rosasco Theses Series DISI-TH-2006-05 DISI, Università di Genova v. Dodecaneso 35, 16146

More information

Spectral Algorithms for Supervised Learning

Spectral Algorithms for Supervised Learning LETTER Communicated by David Hardoon Spectral Algorithms for Supervised Learning L. Lo Gerfo logerfo@disi.unige.it L. Rosasco rosasco@disi.unige.it F. Odone odone@disi.unige.it Dipartimento di Informatica

More information

Hilbert Space Methods in Learning

Hilbert Space Methods in Learning Hilbert Space Methods in Learning guest lecturer: Risi Kondor 6772 Advanced Machine Learning and Perception (Jebara), Columbia University, October 15, 2003. 1 1. A general formulation of the learning problem

More information

Back to the future: Radial Basis Function networks revisited

Back to the future: Radial Basis Function networks revisited Back to the future: Radial Basis Function networks revisited Qichao Que, Mikhail Belkin Department of Computer Science and Engineering Ohio State University Columbus, OH 4310 que, mbelkin@cse.ohio-state.edu

More information

2 Tikhonov Regularization and ERM

2 Tikhonov Regularization and ERM Introduction Here we discusses how a class of regularization methods originally designed to solve ill-posed inverse problems give rise to regularized learning algorithms. These algorithms are kernel methods

More information

arxiv: v1 [math.st] 28 May 2016

arxiv: v1 [math.st] 28 May 2016 Kernel ridge vs. principal component regression: minimax bounds and adaptability of regularization operators Lee H. Dicker Dean P. Foster Daniel Hsu arxiv:1605.08839v1 [math.st] 8 May 016 May 31, 016 Abstract

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

Lecture 4 Colorization and Segmentation

Lecture 4 Colorization and Segmentation Lecture 4 Colorization and Segmentation Summer School Mathematics in Imaging Science University of Bologna, Itay June 1st 2018 Friday 11:15-13:15 Sung Ha Kang School of Mathematics Georgia Institute of

More information

Online Learning and Sequential Decision Making

Online Learning and Sequential Decision Making Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning

More information

RANDOM FIELDS AND GEOMETRY. Robert Adler and Jonathan Taylor

RANDOM FIELDS AND GEOMETRY. Robert Adler and Jonathan Taylor RANDOM FIELDS AND GEOMETRY from the book of the same name by Robert Adler and Jonathan Taylor IE&M, Technion, Israel, Statistics, Stanford, US. ie.technion.ac.il/adler.phtml www-stat.stanford.edu/ jtaylor

More information

Spectral Regularization for Support Estimation

Spectral Regularization for Support Estimation Spectral Regularization for Support Estimation Ernesto De Vito DSA, Univ. di Genova, and INFN, Sezione di Genova, Italy devito@dima.ungie.it Lorenzo Rosasco CBCL - MIT, - USA, and IIT, Italy lrosasco@mit.edu

More information

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4 Statistical Machine Learning II Spring 07, Learning Theory, Lecture 4 Jean Honorio jhonorio@purdue.edu Deterministic Optimization For brevity, everywhere differentiable functions will be called smooth.

More information

A Spectral Regularization Framework for Multi-Task Structure Learning

A Spectral Regularization Framework for Multi-Task Structure Learning A Spectral Regularization Framework for Multi-Task Structure Learning Massimiliano Pontil Department of Computer Science University College London (Joint work with A. Argyriou, T. Evgeniou, C.A. Micchelli,

More information

Learning gradients: prescriptive models

Learning gradients: prescriptive models Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007 Relevant papers Learning Coordinate Covariances via Gradients. Sayan

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 9, 2011 About this class Goal In this class we continue our journey in the world of RKHS. We discuss the Mercer theorem which gives

More information

Regularization Networks and Support Vector Machines

Regularization Networks and Support Vector Machines Advances in Computational Mathematics x (1999) x-x 1 Regularization Networks and Support Vector Machines Theodoros Evgeniou, Massimiliano Pontil, Tomaso Poggio Center for Biological and Computational Learning

More information

TUM 2016 Class 3 Large scale learning by regularization

TUM 2016 Class 3 Large scale learning by regularization TUM 2016 Class 3 Large scale learning by regularization Lorenzo Rosasco UNIGE-MIT-IIT July 25, 2016 Learning problem Solve min w E(w), E(w) = dρ(x, y)l(w x, y) given (x 1, y 1 ),..., (x n, y n ) Beyond

More information

BINARY CLASSIFICATION

BINARY CLASSIFICATION BINARY CLASSIFICATION MAXIM RAGINSY The problem of binary classification can be stated as follows. We have a random couple Z = X, Y ), where X R d is called the feature vector and Y {, } is called the

More information

Structured Prediction

Structured Prediction Structured Prediction Ningshan Zhang Advanced Machine Learning, Spring 2016 Outline Ensemble Methods for Structured Prediction[1] On-line learning Boosting AGeneralizedKernelApproachtoStructuredOutputLearning[2]

More information

1. Mathematical Foundations of Machine Learning

1. Mathematical Foundations of Machine Learning 1. Mathematical Foundations of Machine Learning 1.1 Basic Concepts Definition of Learning Definition [Mitchell (1997)] A computer program is said to learn from experience E with respect to some class of

More information

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee 9.520: Class 20 Bayesian Interpretations Tomaso Poggio and Sayan Mukherjee Plan Bayesian interpretation of Regularization Bayesian interpretation of the regularizer Bayesian interpretation of quadratic

More information

CIS 520: Machine Learning Oct 09, Kernel Methods

CIS 520: Machine Learning Oct 09, Kernel Methods CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed

More information

Adaptive Sampling Under Low Noise Conditions 1

Adaptive Sampling Under Low Noise Conditions 1 Manuscrit auteur, publié dans "41èmes Journées de Statistique, SFdS, Bordeaux (2009)" Adaptive Sampling Under Low Noise Conditions 1 Nicolò Cesa-Bianchi Dipartimento di Scienze dell Informazione Università

More information

Optimal Distributed Learning with Multi-pass Stochastic Gradient Methods

Optimal Distributed Learning with Multi-pass Stochastic Gradient Methods Optimal Distributed Learning with Multi-pass Stochastic Gradient Methods Junhong Lin Volkan Cevher Abstract We study generalization properties of distributed algorithms in the setting of nonparametric

More information

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004 Outline Motivation Approximation error under convex risk minimization

More information

Statistical Convergence of Kernel CCA

Statistical Convergence of Kernel CCA Statistical Convergence of Kernel CCA Kenji Fukumizu Institute of Statistical Mathematics Tokyo 106-8569 Japan fukumizu@ism.ac.jp Francis R. Bach Centre de Morphologie Mathematique Ecole des Mines de Paris,

More information

Regularization and statistical learning theory for data analysis

Regularization and statistical learning theory for data analysis Computational Statistics & Data Analysis 38 (2002) 421 432 www.elsevier.com/locate/csda Regularization and statistical learning theory for data analysis Theodoros Evgeniou a;, Tomaso Poggio b, Massimiliano

More information

Kernel Method: Data Analysis with Positive Definite Kernels

Kernel Method: Data Analysis with Positive Definite Kernels Kernel Method: Data Analysis with Positive Definite Kernels 2. Positive Definite Kernel and Reproducing Kernel Hilbert Space Kenji Fukumizu The Institute of Statistical Mathematics. Graduate University

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow

More information

Regularization Algorithms for Learning

Regularization Algorithms for Learning DISI, UNIGE Texas, 10/19/07 plan motivation setting elastic net regularization - iterative thresholding algorithms - error estimates and parameter choice applications motivations starting point of many

More information

Online Learning and Online Convex Optimization

Online Learning and Online Convex Optimization Online Learning and Online Convex Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Learning 1 / 49 Summary 1 My beautiful regret 2 A supposedly fun game

More information

Distributed Semi-supervised Learning with Kernel Ridge Regression

Distributed Semi-supervised Learning with Kernel Ridge Regression Journal of Machine Learning Research 18 (017) 1- Submitted 11/16; Revised 4/17; Published 5/17 Distributed Semi-supervised Learning with Kernel Ridge Regression Xiangyu Chang Center of Data Science and

More information

Functional Gradient Descent

Functional Gradient Descent Statistical Techniques in Robotics (16-831, F12) Lecture #21 (Nov 14, 2012) Functional Gradient Descent Lecturer: Drew Bagnell Scribe: Daniel Carlton Smith 1 1 Goal of Functional Gradient Descent We have

More information

The Learning Problem and Regularization

The Learning Problem and Regularization 9.520 Class 02 February 2011 Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference problem from usually small sets of high dimensional, noisy data. Learning

More information

Distribution Regression: A Simple Technique with Minimax-optimal Guarantee

Distribution Regression: A Simple Technique with Minimax-optimal Guarantee Distribution Regression: A Simple Technique with Minimax-optimal Guarantee (CMAP, École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department,

More information

Universal Kernels for Multi-Task Learning

Universal Kernels for Multi-Task Learning Journal of Machine Learning Research (XXXX) Submitted XX; Published XXX Universal Kernels for Multi-Task Learning Andrea Caponnetto Department of Computer Science The University of Chicago 1100 East 58th

More information

Approximate Kernel Methods

Approximate Kernel Methods Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207 Outline Motivating example Ridge regression

More information

An Identity for Kernel Ridge Regression

An Identity for Kernel Ridge Regression An Identity for Kernel Ridge Regression Fedor Zhdanov and Yuri Kalnishkan Computer Learning Research Centre and Department of Computer Science, Royal Holloway, University of London, Egham, Surrey, TW20

More information

Complexity and regularization issues in kernel-based learning

Complexity and regularization issues in kernel-based learning Complexity and regularization issues in kernel-based learning Marcello Sanguineti Department of Communications, Computer, and System Sciences (DIST) University of Genoa - Via Opera Pia 13, 16145 Genova,

More information

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh Generalization Bounds in Machine Learning Presented by: Afshin Rostamizadeh Outline Introduction to generalization bounds. Examples: VC-bounds Covering Number bounds Rademacher bounds Stability bounds

More information

Lecture I: Asymptotics for large GUE random matrices

Lecture I: Asymptotics for large GUE random matrices Lecture I: Asymptotics for large GUE random matrices Steen Thorbjørnsen, University of Aarhus andom Matrices Definition. Let (Ω, F, P) be a probability space and let n be a positive integer. Then a random

More information

Dipartimento di Fisica

Dipartimento di Fisica Dipartimento di Fisica Multi-Output Learning with Spectral Filters by Luca Baldassarre DIFI, Università di Genova Via Dodecaneso 33, 16146 Genova, Italy http://www.fisica.unige.it/ Dottorato di Ricerca

More information