Online Gradient Descent Learning Algorithms
|
|
- Rosamund Stone
- 5 years ago
- Views:
Transcription
1 DISI, Genova, December 2006 Online Gradient Descent Learning Algorithms Yiming Ying (joint work with Massimiliano Pontil) Department of Computer Science, University College London
2 Introduction Outline General learning setting Online gradient descent algorithm 0 Main results Generalization error Implications: consistency Error rates Discussions and comparisons Conclusions and questions
3 Introduction Learning theory model Input sample space X: a subset of Euclidean space R d Output labeled space Y : a subset of R. Distribution ρ on X Y : ρ(x, y) = ρ X (x)ρ(y x). Loss function: L(f(x), y) = (y f(x)) 2. Statistical assumption: labeled sample data sequence S = {z j = (x j, y j ) : j = 1, 2, }. identically and independent distributed according to ρ.
4 Goal of learning Given sample S, find a function f in a suitable hypothesis space such that the true error E(f) := (f(x) y) 2 dρ(x, y) X Y is close to the smallest true error E(f ρ ) where f ρ is the regression function f ρ (x) := ydρ(y x) = inf { E(f) } f : X R Y
5 Note that f f ρ 2 ρ X = X (f(x) f ρ (x)) 2 dρ X (x) = E(f) E(f ρ ) Equivalent approximation problem: find an approximator f in a hypothesis space such that f f ρ 2 ρ is small Hypothesis space assumption Hypothesis space: reproducing kernel Hilbert spaces H K (RKHS) Gaussian kernel: K(x, x ) = e σ x x 2. Polynomial kernel: K(x, x ) = (1 + x, x ) n.
6 Batch Learning algorithm Use the data set S t = {z 1,, z t } at one time Tikhonov regularization: Cucker and Smale; Evgeniou-Pontil-Poggio; Smale-Zhou; De Vito and Verri et al., from different perspectives: Regularization Network, Approximation Theory and Inverse Problems etc. 1 t f St,λ = arg min (y j f(x j ) 2 + λ f 2 f H K t K, λ > 0 Gradient descent boosting: Yao, Rosasco and Caponnetto j=1 f k+1 = f k η k t t (f k (x j ) y j )K xj. j=1 Early stopping rule instead of regularization terms in H K.
7 Stochastic Online learning in RKHS Use the data one by one Online regularized learning algorithm f j+1 = f j η j ((f j (x j ) y j )K xj + λf j ), j N, for e.g. f 1 = 0, where λ > 0 is regularization parameter and {η j } is step sizes (learning rates). Kivinen, Smola and Williamson; Smale and Yao; Ying-Zhou et al.
8 The online algorithm studied here f j+1 = f j η j (f j (x j ) y j )K xj, j N, for e.g. f 1 = 0, (1) {η j, j N} universal sequence {η j = η(t) : j = 1,, t}, t: sample number Our analysis purpose for the above algorithm Stochastic generalization error bounds for E [ f t+1 f ρ 2 ρ] in terms of the step sizes and approximation property of H K. The choice of step sizes to guarantee the (weak) consistency: E [ f t+1 f ρ 2 ρ] inf f HK f f ρ 2 ρ as t
9 Type I: Step sizes are a universal sequence Main results Generalization error Define K-functional: K(s, f ρ ) := inf f HK { f f ρ ρ + s f K }, s > 0. Theorem 1. Let θ (0, 1) and {η j = j θ : j } with some constant µ µ(θ). Then, µ for any t, E [ ] [ 2 ( f t+1 f ρ 2 ρ K(b θ,µ t (1 θ)/2, f ρ )] + O t min{θ,1 θ} ln t ). Implication to consistency: K-functional: K(, f ρ ) is non-decreasing, concave, and lim s 0+ K(s, f ρ ) = inf f HK f f ρ ρ = Consistency: lim t E [ f t+1 f ρ 2 ρ] = inff HK f f ρ 2 ρ.
10 Error rates We assume the f ρ has some smoothness. Define L K : L 2 ρ X L 2 ρ X : L K f(x) = K(x, y)f(y)dρ X, x X, f L 2 ρ X. X The fractional range space L β K(L 2 ρ X ) : the range space of L β K. Theorem 2. Let θ (0, 1), µ(θ) be absolute constants depending on θ. If f ρ L β K(L 2 ρ X ) with some 0 < β 1/2 then, by selecting η j = 1 µ( 2β 2β+1 )j 2β+1 for j, for any t there holds [ ] ( ) E Z T ft+1 f ρ 2 ρ = O t 2β 2β+1 ln t. (2) 2β
11 Type II: Step sizes depending on sample number Generalization error Theorem 3. Let {η j = η : j }. Then, we have that E [ ] [ f t+1 f ρ 2 ρ K( ( ηt ) ] ( ), f ρ ) + O η ln t Rule of early stopping: trade off K( ( ηt ) 1 2, f ρ ) and O ( η ln t ) = stopping rule: t = t(η) to ensure the bounds tend to zero as η 0 +. Equivalently, from the perspective of choosing step sizes η = η(t).
12 Implication to (weak) consistency: the step sizes (depending on samples number) t: lim t η(t) ln t = 0 and lim t tη(t) = = consistency. Error rates Theorem 4. Let {η j = η : j = 1, 2, t}. If f ρ L β K(L 2 ρ X ) for some β > 0 then, by choosing η := 2β β 64(1+κ) 4 t (2β+1) 2β+1, we have that E [ ] f t+1 f ρ 2 ρ = O (t 2β 2β+1 ln t ).
13 Discussions and Comparisons Comparisons are based on the same assumptions on f ρ L β K(L 2 ρ X ). Our error rates for online gradient descent algorithm (1): (I) O ( ) t 2β 2β+1 ln t with β (0, 1 ] for {η 2 j, j N} universal sequence (II) O ( ) t 2β 2β+1 ln t with β > 0 for {ηj = η(t) : j = 1,, t} depending on sample number Batch Tikhonov regularization: O ( t 2β+1) with β (0, 1] Zhang; Smale and Zhou
14 Discussions continued Online regularized algorithm: choosing λ = λ(t) > 0 appropriately Yao and Smale; Ying-Zhou: O ( t 2β 2β+2 ln t ) with β (0, 1] Pontil and Ying: O ( t 2β 2β+1 ln t ) with β (0, 1] The rate O(t 2β 2β+1 ) is capacity independent (eigenvalue independent) optimal (only assumption on f ρ, no assumption on decays of eigenvalues of L K ) implied by Capponetto and De Vito.
15 Ideas of Proof Three main steps. Error decomposition Rewrite the online algorithm (1): f j+1 f ρ = (I η j L K )(f j f ρ ) + η j ( LK (f j f ρ ) + (y j f j (x j ))K xj ) I: the identity operator Define B(f j, z j ) := L K (f j f ρ ) + (y j f j (x j ))K xj. Then E zj [ B(fj, z j ) ] = 0. Set ω t k(l K ) := t j=k (I η jl K ) and ω t t+1(l K ) := I. f t+1 f ρ = ω t 1(L K )f ρ + t j=1 η jω t j+1(l K )B(f j, z j ).
16 Proof Continued Proposition 1. For any t, [ f t+1 f ρ 2 ρ] is bounded by ω t 1(L K )f ρ 2 ρ }{{} approximation error t [ + 2(1 + κ) 4 E(fk ) ] ηk/( t 2 η j + 1 ). k=1 j=k+1 }{{} Cumulative sample error Remark 1. The standard cumulative loss t k=1 (y k f k (x k ) 2 has been extensively studied in online community: Cesa-Bianchi, Warmuth, Smola et al. Weighted cumulative loss: [ t k=1 (y k f k (x k ) 2 η 2 k/( t j=k+1 η j + 1 )].
17 Sketch of Proof for Proposition 1 Proof Continued E [ ] f t+1 f ρ 2 ρ = ω t 1 (L K )f ρ 2 ρ + E [ t η ] j=1 jωj+1(l t K )B(f j, z j ) 2 ρ 2 E [ t ω1(l t K )f ρ, η j ωj+1(l t K )B(f j, z j ) ] ρ j=1 }{{} zero since E zj [ B(fj, z j ) ] = 0 [ ] [ ] E Z t k t η k ωk+1(l t K )B(f k, z k ) 2 ρ = ηkz 2 ω t k+1(l k K )B(f k, z k ) 2 ρ k t [ ηk ω 2 k+1(l t K )L 1 2 K 2 Z k B(fk, z k ) K] 2, k t E zk [ B(fk, z k ) 2 K c E(f k ). ω t k+1(l K )L 1 2 K 2 2 ( 1 + κ ) 2/( t j=k+1 η j + 1 ).
18 Approximation error: For any f H K, Proof Continued ω1(l t K )f ρ ρ ω1(l t K )(f f ρ ) ρ + ω1(l t K )f ρ f f ρ ρ + ω1(l t K )L 1 2 K L 1 2 K f ρ f f ρ ρ + 2(1 + κ) ( t ( η k ) + 1 ) 1 2 f K, k=1 K functional: ω1(l t K )f ρ ρ K ( 2(1 + κ) ( t ( η k ) + 1 ) 1 2 ), f ρ k=1
19 Proof Continued Cumulative sample error: t [ E(fk ) ] ηk/( t 2 k=1 j=k+1 η j + 1 ) ( [ sup E(fk ) ]) t /( t η 2 k η j + 1 ). k=1,,t k=1 j=k+1 Uniformly bounding for [ E(f k ) ] : using f ρ ρ and E(f ρ ). Estimate t k=1 η2 k/( t j=k+1 η j + 1 ) : For instance, η j = O ( j θ) with θ (0, 1), t k=1 η 2 k /( t j=k+1 η j + 1 ) = O ( t min{θ,1 θ} ln t ).
20 Conclusions Online gradient descent algorithm is a simple yet competitive algorithm: It is statistically consistent (for two types of step sizes) Error rates are essentially the same as classical batch regularized learning and online regularized learning algorithms Optimal capacity independent error rates
21 Questions H K -norm error f t+1 f ρ K with the universal polynomially decaying step sizes Probability inequality estimates, almost surely convergence (strong consistency) Generalization error analysis with data non i.i.d. Can we directly use standard cumulative error bounds to bound generalization error?
22 Grazie! Buon Natale e Felice Anno Nuovo!
Online gradient descent learning algorithm
Online gradient descent learning algorithm Yiming Ying and Massimiliano Pontil Department of Computer Science, University College London Gower Street, London, WCE 6BT, England, UK {y.ying, m.pontil}@cs.ucl.ac.uk
More informationOptimal Rates for Multi-pass Stochastic Gradient Methods
Journal of Machine Learning Research 8 (07) -47 Submitted 3/7; Revised 8/7; Published 0/7 Optimal Rates for Multi-pass Stochastic Gradient Methods Junhong Lin Laboratory for Computational and Statistical
More informationLearning Theory of Randomized Kaczmarz Algorithm
Journal of Machine Learning Research 16 015 3341-3365 Submitted 6/14; Revised 4/15; Published 1/15 Junhong Lin Ding-Xuan Zhou Department of Mathematics City University of Hong Kong 83 Tat Chee Avenue Kowloon,
More informationTUM 2016 Class 1 Statistical learning theory
TUM 2016 Class 1 Statistical learning theory Lorenzo Rosasco UNIGE-MIT-IIT July 25, 2016 Machine learning applications Texts Images Data: (x 1, y 1 ),..., (x n, y n ) Note: x i s huge dimensional! All
More informationAre Loss Functions All the Same?
Are Loss Functions All the Same? L. Rosasco E. De Vito A. Caponnetto M. Piana A. Verri November 11, 2003 Abstract In this paper we investigate the impact of choosing different loss functions from the viewpoint
More informationRegML 2018 Class 2 Tikhonov regularization and kernels
RegML 2018 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT June 17, 2018 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n = (x i,
More informationOptimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms
Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms Junhong Lin Volkan Cevher Laboratory for Information and Inference Systems École Polytechnique Fédérale
More informationMathematical Methods for Data Analysis
Mathematical Methods for Data Analysis Massimiliano Pontil Istituto Italiano di Tecnologia and Department of Computer Science University College London Massimiliano Pontil Mathematical Methods for Data
More informationStochastic optimization in Hilbert spaces
Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert
More informationRegularization in Reproducing Kernel Banach Spaces
.... Regularization in Reproducing Kernel Banach Spaces Guohui Song School of Mathematical and Statistical Sciences Arizona State University Comp Math Seminar, September 16, 2010 Joint work with Dr. Fred
More informationOslo Class 2 Tikhonov regularization and kernels
RegML2017@SIMULA Oslo Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT May 3, 2017 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n
More informationOnline Learning with Samples Drawn from Non-identical Distributions
Journal of Machine Learning Research 10 (009) 873-898 Submitted 1/08; Revised 7/09; Published 1/09 Online Learning with Samples Drawn from Non-identical Distributions Ting Hu Ding-uan hou Department of
More informationAn inverse problem perspective on machine learning
An inverse problem perspective on machine learning Lorenzo Rosasco University of Genova Massachusetts Institute of Technology Istituto Italiano di Tecnologia lcsl.mit.edu Feb 9th, 2018 Inverse Problems
More informationLearning, Regularization and Ill-Posed Inverse Problems
Learning, Regularization and Ill-Posed Inverse Problems Lorenzo Rosasco DISI, Università di Genova rosasco@disi.unige.it Andrea Caponnetto DISI, Università di Genova caponnetto@disi.unige.it Ernesto De
More informationConvergence rates of spectral methods for statistical inverse learning problems
Convergence rates of spectral methods for statistical inverse learning problems G. Blanchard Universtität Potsdam UCL/Gatsby unit, 04/11/2015 Joint work with N. Mücke (U. Potsdam); N. Krämer (U. München)
More informationSimultaneous Model Selection and Optimization through Parameter-free Stochastic Learning
Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Francesco Orabona Yahoo! Labs New York, USA francesco@orabona.com Abstract Stochastic gradient descent algorithms
More informationON EARLY STOPPING IN GRADIENT DESCENT LEARNING. 1. Introduction
ON EARLY STOPPING IN GRADIENT DESCENT LEARNING YUAN YAO, LORENZO ROSASCO, AND ANDREA CAPONNETTO Abstract. In this paper, we study a family of gradient descent algorithms to approximate the regression function
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationThe Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee
The Learning Problem and Regularization 9.520 Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing
More informationReproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto
Reproducing Kernel Hilbert Spaces 9.520 Class 03, 15 February 2006 Andrea Caponnetto About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationStatistical Optimality of Stochastic Gradient Descent through Multiple Passes
Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Loucas Pillaud-Vivien
More informationOn regularization algorithms in learning theory
Journal of Complexity 23 (2007) 52 72 www.elsevier.com/locate/jco On regularization algorithms in learning theory Frank Bauer a, Sergei Pereverzev b, Lorenzo Rosasco c,d, a Institute for Mathematical Stochastics,
More informationOn Regularization Algorithms in Learning Theory
On Regularization Algorithms in Learning Theory Frank Bauer a, Sergei Pereverzev b, Lorenzo Rosasco c,1 a Institute for Mathematical Stochastics, University of Göttingen, Department of Mathematics, Maschmühlenweg
More informationDiffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)
Diffeomorphic Warping Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel) What Manifold Learning Isn t Common features of Manifold Learning Algorithms: 1-1 charting Dense sampling Geometric Assumptions
More informationSpectral Regularization
Spectral Regularization Lorenzo Rosasco 9.520 Class 07 February 27, 2008 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,
More informationDivide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates
: A Distributed Algorithm with Minimax Optimal Rates Yuchen Zhang, John C. Duchi, Martin Wainwright (UC Berkeley;http://arxiv.org/pdf/1305.509; Apr 9, 014) Gatsby Unit, Tea Talk June 10, 014 Outline Motivation.
More informationError Estimates for Multi-Penalty Regularization under General Source Condition
Error Estimates for Multi-Penalty Regularization under General Source Condition Abhishake Rastogi Department of Mathematics Indian Institute of Technology Delhi New Delhi 006, India abhishekrastogi202@gmail.com
More informationRegularization via Spectral Filtering
Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,
More informationLearnability of Gaussians with flexible variances
Learnability of Gaussians with flexible variances Ding-Xuan Zhou City University of Hong Kong E-ail: azhou@cityu.edu.hk Supported in part by Research Grants Council of Hong Kong Start October 20, 2007
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 9, 2011 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 11, 2009 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationClass 2 & 3 Overfitting & Regularization
Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating
More informationOptimal kernel methods for large scale learning
Optimal kernel methods for large scale learning Alessandro Rudi INRIA - École Normale Supérieure, Paris joint work with Luigi Carratino, Lorenzo Rosasco 6 Mar 2018 École Polytechnique Learning problem
More informationRKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets Class 22, 2004 Tomaso Poggio and Sayan Mukherjee
RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets 9.520 Class 22, 2004 Tomaso Poggio and Sayan Mukherjee About this class Goal To introduce an alternate perspective of RKHS via integral operators
More informationOnline Learning Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das
Online Learning 9.520 Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das About this class Goal To introduce the general setting of online learning. To describe an online version of the RLS algorithm
More informationDerivative reproducing properties for kernel methods in learning theory
Journal of Computational and Applied Mathematics 220 (2008) 456 463 www.elsevier.com/locate/cam Derivative reproducing properties for kernel methods in learning theory Ding-Xuan Zhou Department of Mathematics,
More informationSufficient Conditions for Uniform Stability of Regularization Algorithms Andre Wibisono, Lorenzo Rosasco, and Tomaso Poggio
Computer Science and Artificial Intelligence Laboratory Technical Report MIT-CSAIL-TR-2009-060 CBCL-284 December 1, 2009 Sufficient Conditions for Uniform Stability of Regularization Algorithms Andre Wibisono,
More informationReproducing Kernel Hilbert Spaces
9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we
More informationOnline Regression Competitive with Changing Predictors
Online Regression Competitive with Changing Predictors Steven Busuttil and Yuri Kalnishkan Computer Learning Research Centre and Department of Computer Science, Royal Holloway, University of London, Egham,
More informationOptimal Rates for Spectral Algorithms with Least-Squares Regression over Hilbert Spaces
Optimal Rates for Spectral Algorithms with Least-Squares Regression over Hilbert Spaces Junhong Lin 1, Alessandro Rudi 2,3, Lorenzo Rosasco 4,5, Volkan Cevher 1 1 Laboratory for Information and Inference
More informationLess is More: Computational Regularization by Subsampling
Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro
More informationGeometry on Probability Spaces
Geometry on Probability Spaces Steve Smale Toyota Technological Institute at Chicago 427 East 60th Street, Chicago, IL 60637, USA E-mail: smale@math.berkeley.edu Ding-Xuan Zhou Department of Mathematics,
More informationLess is More: Computational Regularization by Subsampling
Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro
More informationGeneralization theory
Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1
More informationConvergence Rates of Kernel Quadrature Rules
Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction
More informationFull-information Online Learning
Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2
More informationDistributed Learning with Regularized Least Squares
Journal of Machine Learning Research 8 07-3 Submitted /5; Revised 6/7; Published 9/7 Distributed Learning with Regularized Least Squares Shao-Bo Lin sblin983@gmailcom Department of Mathematics City University
More informationApproximation Theoretical Questions for SVMs
Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually
More informationLearning from Examples as an Inverse Problem
Journal of Machine Learning Research () Submitted 12/04; Published Learning from Examples as an Inverse Problem Ernesto De Vito Dipartimento di Matematica Università di Modena e Reggio Emilia Modena, Italy
More informationKernels for Multi task Learning
Kernels for Multi task Learning Charles A Micchelli Department of Mathematics and Statistics State University of New York, The University at Albany 1400 Washington Avenue, Albany, NY, 12222, USA Massimiliano
More informationOn Infinite-Dimensional Integration in Weighted Hilbert Spaces
On Infinite-Dimensional Integration in Weighted Hilbert Spaces Sebastian Mayer Universität Bonn Joint work with M. Gnewuch (UNSW Sydney) and K. Ritter (TU Kaiserslautern). HDA 2013, Canberra, Australia.
More informationShannon Sampling II. Connections to Learning Theory
Shannon Sapling II Connections to Learning heory Steve Sale oyota echnological Institute at Chicago 147 East 60th Street, Chicago, IL 60637, USA E-ail: sale@athberkeleyedu Ding-Xuan Zhou Departent of Matheatics,
More informationEarly Stopping for Computational Learning
Early Stopping for Computational Learning Lorenzo Rosasco Universita di Genova, Massachusetts Institute of Technology Istituto Italiano di Tecnologia CBMM Sestri Levante, September, 2014 joint work with
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationWorst-Case Bounds for Gaussian Process Models
Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis
More informationKernels A Machine Learning Overview
Kernels A Machine Learning Overview S.V.N. Vishy Vishwanathan vishy@axiom.anu.edu.au National ICT of Australia and Australian National University Thanks to Alex Smola, Stéphane Canu, Mike Jordan and Peter
More informationCan we do statistical inference in a non-asymptotic way? 1
Can we do statistical inference in a non-asymptotic way? 1 Guang Cheng 2 Statistics@Purdue www.science.purdue.edu/bigdata/ ONR Review Meeting@Duke Oct 11, 2017 1 Acknowledge NSF, ONR and Simons Foundation.
More informationOn the V γ Dimension for Regression in Reproducing Kernel Hilbert Spaces. Theodoros Evgeniou, Massimiliano Pontil
MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES A.I. Memo No. 1656 May 1999 C.B.C.L
More informationDipartimento di Informatica e Scienze dell Informazione
Dipartimento di Informatica e Scienze dell Informazione Regularization Approaches in Learning Theory by Lorenzo Rosasco Theses Series DISI-TH-2006-05 DISI, Università di Genova v. Dodecaneso 35, 16146
More informationSpectral Algorithms for Supervised Learning
LETTER Communicated by David Hardoon Spectral Algorithms for Supervised Learning L. Lo Gerfo logerfo@disi.unige.it L. Rosasco rosasco@disi.unige.it F. Odone odone@disi.unige.it Dipartimento di Informatica
More informationHilbert Space Methods in Learning
Hilbert Space Methods in Learning guest lecturer: Risi Kondor 6772 Advanced Machine Learning and Perception (Jebara), Columbia University, October 15, 2003. 1 1. A general formulation of the learning problem
More informationBack to the future: Radial Basis Function networks revisited
Back to the future: Radial Basis Function networks revisited Qichao Que, Mikhail Belkin Department of Computer Science and Engineering Ohio State University Columbus, OH 4310 que, mbelkin@cse.ohio-state.edu
More information2 Tikhonov Regularization and ERM
Introduction Here we discusses how a class of regularization methods originally designed to solve ill-posed inverse problems give rise to regularized learning algorithms. These algorithms are kernel methods
More informationarxiv: v1 [math.st] 28 May 2016
Kernel ridge vs. principal component regression: minimax bounds and adaptability of regularization operators Lee H. Dicker Dean P. Foster Daniel Hsu arxiv:1605.08839v1 [math.st] 8 May 016 May 31, 016 Abstract
More informationLearning with stochastic proximal gradient
Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and
More informationLecture 4 Colorization and Segmentation
Lecture 4 Colorization and Segmentation Summer School Mathematics in Imaging Science University of Bologna, Itay June 1st 2018 Friday 11:15-13:15 Sung Ha Kang School of Mathematics Georgia Institute of
More informationOnline Learning and Sequential Decision Making
Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning
More informationRANDOM FIELDS AND GEOMETRY. Robert Adler and Jonathan Taylor
RANDOM FIELDS AND GEOMETRY from the book of the same name by Robert Adler and Jonathan Taylor IE&M, Technion, Israel, Statistics, Stanford, US. ie.technion.ac.il/adler.phtml www-stat.stanford.edu/ jtaylor
More informationSpectral Regularization for Support Estimation
Spectral Regularization for Support Estimation Ernesto De Vito DSA, Univ. di Genova, and INFN, Sezione di Genova, Italy devito@dima.ungie.it Lorenzo Rosasco CBCL - MIT, - USA, and IIT, Italy lrosasco@mit.edu
More informationStatistical Machine Learning II Spring 2017, Learning Theory, Lecture 4
Statistical Machine Learning II Spring 07, Learning Theory, Lecture 4 Jean Honorio jhonorio@purdue.edu Deterministic Optimization For brevity, everywhere differentiable functions will be called smooth.
More informationA Spectral Regularization Framework for Multi-Task Structure Learning
A Spectral Regularization Framework for Multi-Task Structure Learning Massimiliano Pontil Department of Computer Science University College London (Joint work with A. Argyriou, T. Evgeniou, C.A. Micchelli,
More informationLearning gradients: prescriptive models
Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007 Relevant papers Learning Coordinate Covariances via Gradients. Sayan
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 9, 2011 About this class Goal In this class we continue our journey in the world of RKHS. We discuss the Mercer theorem which gives
More informationRegularization Networks and Support Vector Machines
Advances in Computational Mathematics x (1999) x-x 1 Regularization Networks and Support Vector Machines Theodoros Evgeniou, Massimiliano Pontil, Tomaso Poggio Center for Biological and Computational Learning
More informationTUM 2016 Class 3 Large scale learning by regularization
TUM 2016 Class 3 Large scale learning by regularization Lorenzo Rosasco UNIGE-MIT-IIT July 25, 2016 Learning problem Solve min w E(w), E(w) = dρ(x, y)l(w x, y) given (x 1, y 1 ),..., (x n, y n ) Beyond
More informationBINARY CLASSIFICATION
BINARY CLASSIFICATION MAXIM RAGINSY The problem of binary classification can be stated as follows. We have a random couple Z = X, Y ), where X R d is called the feature vector and Y {, } is called the
More informationStructured Prediction
Structured Prediction Ningshan Zhang Advanced Machine Learning, Spring 2016 Outline Ensemble Methods for Structured Prediction[1] On-line learning Boosting AGeneralizedKernelApproachtoStructuredOutputLearning[2]
More information1. Mathematical Foundations of Machine Learning
1. Mathematical Foundations of Machine Learning 1.1 Basic Concepts Definition of Learning Definition [Mitchell (1997)] A computer program is said to learn from experience E with respect to some class of
More information9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee
9.520: Class 20 Bayesian Interpretations Tomaso Poggio and Sayan Mukherjee Plan Bayesian interpretation of Regularization Bayesian interpretation of the regularizer Bayesian interpretation of quadratic
More informationCIS 520: Machine Learning Oct 09, Kernel Methods
CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed
More informationAdaptive Sampling Under Low Noise Conditions 1
Manuscrit auteur, publié dans "41èmes Journées de Statistique, SFdS, Bordeaux (2009)" Adaptive Sampling Under Low Noise Conditions 1 Nicolò Cesa-Bianchi Dipartimento di Scienze dell Informazione Università
More informationOptimal Distributed Learning with Multi-pass Stochastic Gradient Methods
Optimal Distributed Learning with Multi-pass Stochastic Gradient Methods Junhong Lin Volkan Cevher Abstract We study generalization properties of distributed algorithms in the setting of nonparametric
More informationSTATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION
STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004 Outline Motivation Approximation error under convex risk minimization
More informationStatistical Convergence of Kernel CCA
Statistical Convergence of Kernel CCA Kenji Fukumizu Institute of Statistical Mathematics Tokyo 106-8569 Japan fukumizu@ism.ac.jp Francis R. Bach Centre de Morphologie Mathematique Ecole des Mines de Paris,
More informationRegularization and statistical learning theory for data analysis
Computational Statistics & Data Analysis 38 (2002) 421 432 www.elsevier.com/locate/csda Regularization and statistical learning theory for data analysis Theodoros Evgeniou a;, Tomaso Poggio b, Massimiliano
More informationKernel Method: Data Analysis with Positive Definite Kernels
Kernel Method: Data Analysis with Positive Definite Kernels 2. Positive Definite Kernel and Reproducing Kernel Hilbert Space Kenji Fukumizu The Institute of Statistical Mathematics. Graduate University
More informationAdaptive Online Gradient Descent
University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow
More informationRegularization Algorithms for Learning
DISI, UNIGE Texas, 10/19/07 plan motivation setting elastic net regularization - iterative thresholding algorithms - error estimates and parameter choice applications motivations starting point of many
More informationOnline Learning and Online Convex Optimization
Online Learning and Online Convex Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Learning 1 / 49 Summary 1 My beautiful regret 2 A supposedly fun game
More informationDistributed Semi-supervised Learning with Kernel Ridge Regression
Journal of Machine Learning Research 18 (017) 1- Submitted 11/16; Revised 4/17; Published 5/17 Distributed Semi-supervised Learning with Kernel Ridge Regression Xiangyu Chang Center of Data Science and
More informationFunctional Gradient Descent
Statistical Techniques in Robotics (16-831, F12) Lecture #21 (Nov 14, 2012) Functional Gradient Descent Lecturer: Drew Bagnell Scribe: Daniel Carlton Smith 1 1 Goal of Functional Gradient Descent We have
More informationThe Learning Problem and Regularization
9.520 Class 02 February 2011 Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference problem from usually small sets of high dimensional, noisy data. Learning
More informationDistribution Regression: A Simple Technique with Minimax-optimal Guarantee
Distribution Regression: A Simple Technique with Minimax-optimal Guarantee (CMAP, École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department,
More informationUniversal Kernels for Multi-Task Learning
Journal of Machine Learning Research (XXXX) Submitted XX; Published XXX Universal Kernels for Multi-Task Learning Andrea Caponnetto Department of Computer Science The University of Chicago 1100 East 58th
More informationApproximate Kernel Methods
Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207 Outline Motivating example Ridge regression
More informationAn Identity for Kernel Ridge Regression
An Identity for Kernel Ridge Regression Fedor Zhdanov and Yuri Kalnishkan Computer Learning Research Centre and Department of Computer Science, Royal Holloway, University of London, Egham, Surrey, TW20
More informationComplexity and regularization issues in kernel-based learning
Complexity and regularization issues in kernel-based learning Marcello Sanguineti Department of Communications, Computer, and System Sciences (DIST) University of Genoa - Via Opera Pia 13, 16145 Genova,
More informationGeneralization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh
Generalization Bounds in Machine Learning Presented by: Afshin Rostamizadeh Outline Introduction to generalization bounds. Examples: VC-bounds Covering Number bounds Rademacher bounds Stability bounds
More informationLecture I: Asymptotics for large GUE random matrices
Lecture I: Asymptotics for large GUE random matrices Steen Thorbjørnsen, University of Aarhus andom Matrices Definition. Let (Ω, F, P) be a probability space and let n be a positive integer. Then a random
More informationDipartimento di Fisica
Dipartimento di Fisica Multi-Output Learning with Spectral Filters by Luca Baldassarre DIFI, Università di Genova Via Dodecaneso 33, 16146 Genova, Italy http://www.fisica.unige.it/ Dottorato di Ricerca
More information