Less is More: Computational Regularization by Subsampling
|
|
- Hollie Sharyl Bryan
- 5 years ago
- Views:
Transcription
1 Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro Rudi, Raffaello Camoriano Paris
2 A Starting Point Classically: Statistics and optimization distinct steps in algorithm design Empirical process theory + Optimization
3 A Starting Point Classically: Statistics and optimization distinct steps in algorithm design Empirical process theory + Optimization Large Scale: Consider interplay between statistics and optimization! (Bottou, Bousquet 08)
4 A Starting Point Classically: Statistics and optimization distinct steps in algorithm design Empirical process theory + Optimization Large Scale: Consider interplay between statistics and optimization! (Bottou, Bousquet 08) Computational Regularization: Computation tricks =regularization
5 Problem: Estimate f Supervised Learning f
6 Supervised Learning Problem: Estimate f given S n = {(x 1, y 1 ),..., (x n, y n )} f (x 1,y 1 ) (x 2,y 2) (x 3,y 3) (x 4,y 4) (x 5,y 5)
7 Supervised Learning Problem: Estimate f given S n = {(x 1, y 1 ),..., (x n, y n )} f (x 1,y 1 ) (x 2,y 2) (x 3,y 3) (x 4,y 4) (x 5,y 5) The Setting y i = f (x i ) + ε i i {1,..., n} ε i R, x i R d random (bounded but with unknown distribution) f unknown
8 Outline Nonparametric Learning Data Dependent Subsampling Data Independent Subsampling
9 Non-linear/non-parametric learning f(x) = M c i q(x, w i ) i=1
10 Non-linear/non-parametric learning f(x) = M c i q(x, w i ) i=1 q non linear function
11 Non-linear/non-parametric learning f(x) = M c i q(x, w i ) i=1 q non linear function w i R d centers
12 Non-linear/non-parametric learning f(x) = M c i q(x, w i ) i=1 q non linear function w i R d centers c i R coefficients
13 Non-linear/non-parametric learning f(x) = M c i q(x, w i ) i=1 q non linear function w i R d centers c i R coefficients M = M n could/should grow with n
14 Non-linear/non-parametric learning f(x) = M c i q(x, w i ) i=1 q non linear function w i R d centers c i R coefficients M = M n could/should grow with n Question: How to choose w i, c i and M given S n?
15 Learning with Positive Definite Kernels There is an elegant answer if: q is symmetric all the matrices Q ij = q(x i, x j ) are positive semi-definite 1 1 They have non-negative eigenvalues
16 Learning with Positive Definite Kernels There is an elegant answer if: q is symmetric all the matrices Q ij = q(x i, x j ) are positive semi-definite 1 Representer Theorem (Kimeldorf, Wahba 70; Schölkopf et al. 01) M = n, w i = x i, c i by convex optimization! 1 They have non-negative eigenvalues
17 Kernel Ridge Regression (KRR) a.k.a. Tikhonov Regularization where 2 f λ = argmin f H 1 n n (y i f(x i )) 2 + λ f 2 i=1 H = {f f(x) = M c i q(x, w i ), c i R, w i R }{{ d, }} M {{ N } } any center! any length! i=1 2 The norm is induced by the inner product f, f = i,j c ic j q(x i, x j )
18 Kernel Ridge Regression (KRR) a.k.a. Tikhonov Regularization where 2 f λ = argmin f H 1 n n (y i f(x i )) 2 + λ f 2 i=1 Solution H = {f f(x) = f λ = M c i q(x, w i ), c i R, w i R }{{ d, }} M {{ N } } any center! any length! i=1 n c i q(x, x i ) with c = ( Q + λni) 1 ŷ i=1 2 The norm is induced by the inner product f, f = i,j c ic j q(x i, x j )
19 KRR: Statistics
20 KRR: Statistics Well understood statistical properties: Classical Theorem If f H, then λ = 1 n E ( f λ (x) f (x)) 2 1 n
21 KRR: Statistics Well understood statistical properties: Classical Theorem If f H, then λ = 1 n E ( f λ (x) f (x)) 2 1 n Remarks
22 KRR: Statistics Well understood statistical properties: Classical Theorem If f H, then λ = 1 n E ( f λ (x) f (x)) 2 1 n Remarks 1. Optimal nonparametric bound
23 KRR: Statistics Well understood statistical properties: Classical Theorem If f H, then λ = 1 n E ( f λ (x) f (x)) 2 1 n Remarks 1. Optimal nonparametric bound 2. More refined results for smooth kernels λ = n 1 2s+1, E ( f λ (x) f (x)) 2 n 2s 2s+1
24 KRR: Statistics Well understood statistical properties: Classical Theorem If f H, then λ = 1 n E ( f λ (x) f (x)) 2 1 n Remarks 1. Optimal nonparametric bound 2. More refined results for smooth kernels λ = n 1 2s+1, E ( f λ (x) f (x)) 2 n 2s 2s+1 3. Adaptive tuning, e.g. via cross validation 4. Proofs: inverse problems results + random matrices (Smale and Zhou + Caponnetto, De Vito, R.)
25 KRR: Optimization n f λ = c i q(x, x i ) with c = ( Q + λni) 1 ŷ i=1 Linear System Complexity Space O(n 2 ) bq c = Time O(n 3 ) by
26 KRR: Optimization n f λ = c i q(x, x i ) with c = ( Q + λni) 1 ŷ i=1 Linear System Complexity Space O(n 2 ) bq c = Time O(n 3 ) by BIG DATA? Running out of time and space... Can this be fixed?
27 Beyond Tikhonov: Spectral Filtering ( ˆQ + λi) 1 approximation of ˆQ controlled by λ
28 Beyond Tikhonov: Spectral Filtering ( ˆQ + λi) 1 approximation of ˆQ controlled by λ Can we approximate ˆQ by saving computations?
29 Beyond Tikhonov: Spectral Filtering ( ˆQ + λi) 1 approximation of ˆQ controlled by λ Can we approximate ˆQ by saving computations? Yes!
30 Beyond Tikhonov: Spectral Filtering ( ˆQ + λi) 1 approximation of ˆQ controlled by λ Can we approximate ˆQ by saving computations? Yes! Spectral filtering (Engl 96- inverse problems, Rosasco et al. 05- ML ) g λ ( ˆQ) ˆQ The filter function g λ defines the form of the approximation
31 Spectral filtering Examples Tikhonov- ridge regression Truncated SVD principal component regression Landweber iteration GD/ L 2 -boosting nu-method accelerated GD/Chebyshev method...
32 Spectral filtering Examples Tikhonov- ridge regression Truncated SVD principal component regression Landweber iteration GD/ L 2 -boosting nu-method accelerated GD/Chebyshev method... Landweber iteration (truncated power series)... c t = g t ( ˆQ) t 1 = γ (I γ ˆQ) r ŷ r=0
33 Spectral filtering Examples Tikhonov- ridge regression Truncated SVD principal component regression Landweber iteration GD/ L 2 -boosting nu-method accelerated GD/Chebyshev method... Landweber iteration (truncated power series)... c t = g t ( ˆQ) t 1 = γ (I γ ˆQ) r ŷ r=0... it s GD for ERM!! r = 1... t c r = c r 1 γ( ˆQc r 1 ŷ), c 0 = 0
34 Statistics and computations with spectral filtering The different filters achieve essentially the same optimal statistical error!
35 Statistics and computations with spectral filtering The different filters achieve essentially the same optimal statistical error! Difference is in computations Notet: λ 1 Filter Time Space Tikhonov n 3 n 2 GD n 2 λ 1 n 2 Accelerated GD n 2 λ 1/2 n 2 Truncated SVD n 2 λ γ n 2 = t, for iterative methods
36 Semiconvergence 0.24 Empirical Error Expected Error Error Iteration Iterations control statistics and time complexity
37 Computational Regularization
38 Computational Regularization BIG DATA? Running out of time and space...
39 Computational Regularization BIG DATA? Running out of time and space... Is there a principle to control statistics, time and space complexity?
40 Outline Nonparametric Learning Data Dependent Subsampling Data Independent Subsampling
41 1. pick w i at random... Subsampling
42 Subsampling 1. pick w i at random... from training set (Smola, Scholköpf 00) w 1,..., w M x 1,... x n M n
43 Subsampling 1. pick w i at random... from training set (Smola, Scholköpf 00) w 1,..., w M x 1,... x n M n 2. perform KRR on H M = {f f(x) = M i=1 c i q(x, w i ), c i R, w i R d, M N}.
44 Subsampling 1. pick w i at random... from training set (Smola, Scholköpf 00) w 1,..., w M x 1,... x n M n 2. perform KRR on H M = {f f(x) = M i=1 c i q(x, w i ), c i R, w i R d, M N}. Linear System bq M c = by Complexity Space O(n 2 ) O(nM) Time O(n 3 ) O(nM 2 )
45 Subsampling 1. pick w i at random... from training set (Smola, Scholköpf 00) w 1,..., w M x 1,... x n M n 2. perform KRR on H M = {f f(x) = M i=1 c i q(x, w i ), c i R, w i R d, M N}. Linear System bq M c = by Complexity Space O(n 2 ) O(nM) Time O(n 3 ) O(nM 2 ) What about statistics? What s the price for efficient computations?
46 Putting our Result in Context *Many* different subsampling schemes (Smola, Scholkopf 00; Williams, Seeger 01; )
47 Putting our Result in Context *Many* different subsampling schemes (Smola, Scholkopf 00; Williams, Seeger 01; ) Theoretical guarantees mainly on matrix approximation (Mahoney and Drineas 09; Cortes et al 10, Kumar et al ) Q Q M 1 M
48 Putting our Result in Context *Many* different subsampling schemes (Smola, Scholkopf 00; Williams, Seeger 01; ) Theoretical guarantees mainly on matrix approximation (Mahoney and Drineas 09; Cortes et al 10, Kumar et al ) Q Q M 1 M Statistical guarantees suboptimal or in restricted setting (Cortes et al. 10; Jin et al. 11, Bach 13, Alaoui, Mahoney 14)
49 Main Result (Rudi, Camoriano, Rosasco, 15) Theorem If f H, then λ = 1 n, M = 1 λ, E ( f λ,m (x) f (x)) 2 1 n
50 Main Result (Rudi, Camoriano, Rosasco, 15) Theorem If f H, then λ = 1 n, M = 1 λ, E ( f λ,m (x) f (x)) 2 1 n Remarks
51 Main Result (Rudi, Camoriano, Rosasco, 15) Theorem If f H, then λ = 1 n, M = 1 λ, E ( f λ,m (x) f (x)) 2 1 n Remarks 1. Subsampling achives optimal bound...
52 Main Result (Rudi, Camoriano, Rosasco, 15) Theorem If f H, then λ = 1 n, M = 1 λ, E ( f λ,m (x) f (x)) 2 1 n Remarks 1. Subsampling achives optimal bound with M n!!
53 Main Result (Rudi, Camoriano, Rosasco, 15) Theorem If f H, then λ = 1 n, M = 1 λ, E ( f λ,m (x) f (x)) 2 1 n Remarks 1. Subsampling achives optimal bound with M n!! 3. More generally, λ = n 1 2s+1, M = 1 λ, E x ( f λ,m (x) f (x)) 2 n 2s 2s+1
54 Main Result (Rudi, Camoriano, Rosasco, 15) Theorem If f H, then λ = 1 n, M = 1 λ, E ( f λ,m (x) f (x)) 2 1 n Remarks 1. Subsampling achives optimal bound with M n!! 3. More generally, λ = n 1 2s+1, M = 1 λ, E x ( f λ,m (x) f (x)) 2 n 2s 2s+1 Note: An interesting insight is obtained rewriting the result...
55 Computational Regularization by Subsampling (Rudi, Camoriano, Rosasco, 15) A simple idea: swap the role of λ and M...
56 Computational Regularization by Subsampling (Rudi, Camoriano, Rosasco, 15) A simple idea: swap the role of λ and M... Theorem If f H with a smooth kernel, then M = n 1 2s+1, λ = 1 M, E x ( f λ,m (x) f (x)) 2 n 2s 2s+1
57 Computational Regularization by Subsampling (Rudi, Camoriano, Rosasco, 15) A simple idea: swap the role of λ and M... Theorem If f H with a smooth kernel, then M = n 1 2s+1, λ = 1 M, E x ( f λ,m (x) f (x)) 2 n 2s 2s+1 λ and M play the same role new interpretation: subsampling regularizes!
58 Computational Regularization by Subsampling (Rudi, Camoriano, Rosasco, 15) A simple idea: swap the role of λ and M... Theorem If f H with a smooth kernel, then M = n 1 2s+1, λ = 1 M, E x ( f λ,m (x) f (x)) 2 n 2s 2s+1 λ and M play the same role new interpretation: subsampling regularizes! New natural incremental algorithm... Algorithm
59 Computational Regularization by Subsampling (Rudi, Camoriano, Rosasco, 15) A simple idea: swap the role of λ and M... Theorem If f H with a smooth kernel, then M = n 1 2s+1, λ = 1 M, E x ( f λ,m (x) f (x)) 2 n 2s 2s+1 λ and M play the same role new interpretation: subsampling regularizes! New natural incremental algorithm... Algorithm 1. Pick a center + compute solution
60 Computational Regularization by Subsampling (Rudi, Camoriano, Rosasco, 15) A simple idea: swap the role of λ and M... Theorem If f H with a smooth kernel, then M = n 1 2s+1, λ = 1 M, E x ( f λ,m (x) f (x)) 2 n 2s 2s+1 λ and M play the same role new interpretation: subsampling regularizes! New natural incremental algorithm... Algorithm 1. Pick a center + compute solution 2. Pick another center + rank one update
61 Computational Regularization by Subsampling (Rudi, Camoriano, Rosasco, 15) A simple idea: swap the role of λ and M... Theorem If f H with a smooth kernel, then M = n 1 2s+1, λ = 1 M, E x ( f λ,m (x) f (x)) 2 n 2s 2s+1 λ and M play the same role new interpretation: subsampling regularizes! New natural incremental algorithm... Algorithm 1. Pick a center + compute solution 2. Pick another center + rank one update 3. Pick another center...
62 Nÿstrom CoRe Illustrated n, λ are fixed Validation Error Computation controls stability! Time/space requirement tailored to generalization
63 Experiments comparable/better w.r.t. the state of the art Dataset n tr d Incremental Standard Standard Random Fastfood CoRe KRLS Nyström Features RF Ins. Co ± CPU ± CT slices ± NA Year Pred ± NA Forest ± NA Random Features (Rahimi, Recht 07) Fastfood (Le et al. 13)
64 Summary so far Optimal learning with data dependent subsampling Computational regularization: subsampling regularizes!
65 Summary so far Optimal learning with data dependent subsampling Computational regularization: subsampling regularizes! Few more questions: Can one do better than uniform sampling?
66 Summary so far Optimal learning with data dependent subsampling Computational regularization: subsampling regularizes! Few more questions: Can one do better than uniform sampling? Yes: leverage score sampling... What about data independent sampling?
67 Outline Nonparametric Learning Data Dependent Subsampling Data Independent Subsampling
68 Random Features f(x) = M c i q(x, w i ) i=1
69 Random Features f(x) = M c i q(x, w i ) i=1 q general non linear function
70 Random Features f(x) = M c i q(x, w i ) i=1 q general non linear function pick w i at random according to a distribution µ w 1,..., w M µ
71 Random Features f(x) = M c i q(x, w i ) i=1 q general non linear function pick w i at random according to a distribution µ w 1,..., w M µ perform KRR on M H M = {f f(x) = c i q(x, w i ), c i R}. i=1
72 Random Fourier Features (Rahimi, Recht 07) Consider q(x, w) = e iwt x,
73 Random Fourier Features (Rahimi, Recht 07) Consider q(x, w) = e iwt x, w µ(w) = N (0, I)
74 Random Fourier Features (Rahimi, Recht 07) Consider q(x, w) = e iwt x, w µ(w) = N (0, I) Then ] E w [q(x, w)q(x, w) = e x x 2γ = K(x, x )
75 Random Fourier Features (Rahimi, Recht 07) Consider q(x, w) = e iwt x, w µ(w) = N (0, I) Then ] E w [q(x, w)q(x, w) = e x x 2γ = K(x, x ) By sampling w 1,..., w M we are considering the approximating kernel 1 M M i=1 [ ] q(x, w i )q(x, w i ) = K M (x, x )
76 More Random Features translation invariant kernels K(x, x ) = H(x x ), q(x, w) = e iwt x, infinite neural nets kernels w µ = F(H) q(x, w) = w T x + b +, (w, b) µ = U[S d ] infinite dot product kernels homogeneous additive kernels group invariant kernels... Note: Connections with hashing and sketching techniques.
77 Properties of Random Features
78 Properties of Random Features Optimization Time: O(n 3 ) O(nM 2 ) Space: O(n 2 ) O(nM)
79 Properties of Random Features Optimization Time: O(n 3 ) O(nM 2 ) Space: O(n 2 ) O(nM) Statistics As before: do we pay a price for efficient computations?
80 Previous works
81 Previous works *Many* different random features for different kernels (Rahimi, Recht 07, Vedaldi, Zisserman, )
82 Previous works *Many* different random features for different kernels (Rahimi, Recht 07, Vedaldi, Zisserman, ) Theoretical guarantees: mainly kernel approximation (Rahimi, Recht 07,..., Sriperumbudur and Szabo 15) K(x, x ) K M (x, x ) 1 M
83 Previous works *Many* different random features for different kernels (Rahimi, Recht 07, Vedaldi, Zisserman, ) Theoretical guarantees: mainly kernel approximation (Rahimi, Recht 07,..., Sriperumbudur and Szabo 15) K(x, x ) K M (x, x ) 1 M Statistical guarantees suboptimal or in restricted setting (Rahimi, Recht 09, Yang et al ,Bach 15 )
84 Main Result Let q(x, w) = e iwt x,
85 Main Result Let ( q(x, w) = e iwt x 1, w µ(w) = c d 1 + w 2 ) d+1 2
86 Main Result Let ( q(x, w) = e iwt x 1, w µ(w) = c d 1 + w 2 ) d+1 2 Theorem If f H s Sobolev space, then λ = n 1 2s+1, M = 1 λ 2s, E ( f λ,m (x) f (x)) 2 n 2s 2s+1
87 Main Result Let ( q(x, w) = e iwt x 1, w µ(w) = c d 1 + w 2 ) d+1 2 Theorem If f H s Sobolev space, then λ = n 1 2s+1, M = 1 λ 2s, E ( f λ,m (x) f (x)) 2 n 2s 2s+1 Random features achieve optimal bound!
88 Main Result Let ( q(x, w) = e iwt x 1, w µ(w) = c d 1 + w 2 ) d+1 2 Theorem If f H s Sobolev space, then λ = n 1 2s+1, M = 1 λ 2s, E ( f λ,m (x) f (x)) 2 n 2s 2s+1 Random features achieve optimal bound! Efficient worst case subsampling M n but cannot exploit smoothness.
89 Remarks & Extensions Nÿstrom vs Random features Both achieve optimal rates Nÿstrom seems to need fewer samples (random centers)
90 Remarks & Extensions Nÿstrom vs Random features Both achieve optimal rates Nÿstrom seems to need fewer samples (random centers) How tight are the results?
91 Remarks & Extensions Nÿstrom vs Random features Both achieve optimal rates Nÿstrom seems to need fewer samples (random centers) How tight are the results? log λ Test Error log M test error
92 Contributions Optimal bounds for data dependent/independent subsampling Subsampling: Nÿstrom vs Random features Beyond ridge regression: early stopping and multiple passes SGD (see arxiv)
93 Contributions Optimal bounds for data dependent/independent subsampling Subsampling: Nÿstrom vs Random features Beyond ridge regression: early stopping and multiple passes SGD (see arxiv) Some questions: Quest for the best sampling Regularization by projection: inverse problems and preconditioning Beyond randomization: non convex neural nets optimization?
94 Contributions Optimal bounds for data dependent/independent subsampling Subsampling: Nÿstrom vs Random features Beyond ridge regression: early stopping and multiple passes SGD (see arxiv) Some questions: Quest for the best sampling Regularization by projection: inverse problems and preconditioning Beyond randomization: non convex neural nets optimization? Some perspectives: Computational regularization: subsampling regularizes Algorithm design: control stability for good statistics/computations
Less is More: Computational Regularization by Subsampling
Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro
More informationTUM 2016 Class 3 Large scale learning by regularization
TUM 2016 Class 3 Large scale learning by regularization Lorenzo Rosasco UNIGE-MIT-IIT July 25, 2016 Learning problem Solve min w E(w), E(w) = dρ(x, y)l(w x, y) given (x 1, y 1 ),..., (x n, y n ) Beyond
More informationOptimal kernel methods for large scale learning
Optimal kernel methods for large scale learning Alessandro Rudi INRIA - École Normale Supérieure, Paris joint work with Luigi Carratino, Lorenzo Rosasco 6 Mar 2018 École Polytechnique Learning problem
More informationStochastic optimization in Hilbert spaces
Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert
More informationOslo Class 4 Early Stopping and Spectral Regularization
RegML2017@SIMULA Oslo Class 4 Early Stopping and Spectral Regularization Lorenzo Rosasco UNIGE-MIT-IIT June 28, 2016 Learning problem Solve min w E(w), E(w) = dρ(x, y)l(w x, y) given (x 1, y 1 ),..., (x
More informationStatistical Optimality of Stochastic Gradient Descent through Multiple Passes
Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Loucas Pillaud-Vivien
More informationApproximate Kernel Methods
Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207 Outline Motivating example Ridge regression
More informationApproximate Kernel PCA with Random Features
Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,
More informationAn inverse problem perspective on machine learning
An inverse problem perspective on machine learning Lorenzo Rosasco University of Genova Massachusetts Institute of Technology Istituto Italiano di Tecnologia lcsl.mit.edu Feb 9th, 2018 Inverse Problems
More informationRegML 2018 Class 2 Tikhonov regularization and kernels
RegML 2018 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT June 17, 2018 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n = (x i,
More informationConvergence Rates of Kernel Quadrature Rules
Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction
More informationSpectral Regularization
Spectral Regularization Lorenzo Rosasco 9.520 Class 07 February 27, 2008 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,
More informationRegularization via Spectral Filtering
Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 9, 2011 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationBasis Expansion and Nonlinear SVM. Kai Yu
Basis Expansion and Nonlinear SVM Kai Yu Linear Classifiers f(x) =w > x + b z(x) = sign(f(x)) Help to learn more general cases, e.g., nonlinear models 8/7/12 2 Nonlinear Classifiers via Basis Expansion
More informationMemory Efficient Kernel Approximation
Si Si Department of Computer Science University of Texas at Austin ICML Beijing, China June 23, 2014 Joint work with Cho-Jui Hsieh and Inderjit S. Dhillon Outline Background Motivation Low-Rank vs. Block
More informationEarly Stopping for Computational Learning
Early Stopping for Computational Learning Lorenzo Rosasco Universita di Genova, Massachusetts Institute of Technology Istituto Italiano di Tecnologia CBMM Sestri Levante, September, 2014 joint work with
More informationIterative Convex Regularization
Iterative Convex Regularization Lorenzo Rosasco Universita di Genova Universita di Genova Istituto Italiano di Tecnologia Massachusetts Institute of Technology Optimization and Statistical Learning Workshop,
More informationKernel Learning via Random Fourier Representations
Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random
More informationSVRG++ with Non-uniform Sampling
SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract
More informationOslo Class 2 Tikhonov regularization and kernels
RegML2017@SIMULA Oslo Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT May 3, 2017 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n
More informationMathematical Methods for Data Analysis
Mathematical Methods for Data Analysis Massimiliano Pontil Istituto Italiano di Tecnologia and Department of Computer Science University College London Massimiliano Pontil Mathematical Methods for Data
More informationOptimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms
Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms Junhong Lin Volkan Cevher Laboratory for Information and Inference Systems École Polytechnique Fédérale
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationTowards Deep Kernel Machines
Towards Deep Kernel Machines Julien Mairal Inria, Grenoble Prague, April, 2017 Julien Mairal Towards deep kernel machines 1/51 Part I: Scientific Context Julien Mairal Towards deep kernel machines 2/51
More informationDiffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)
Diffeomorphic Warping Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel) What Manifold Learning Isn t Common features of Manifold Learning Algorithms: 1-1 charting Dense sampling Geometric Assumptions
More informationOnline Gradient Descent Learning Algorithms
DISI, Genova, December 2006 Online Gradient Descent Learning Algorithms Yiming Ying (joint work with Massimiliano Pontil) Department of Computer Science, University College London Introduction Outline
More information2 Tikhonov Regularization and ERM
Introduction Here we discusses how a class of regularization methods originally designed to solve ill-posed inverse problems give rise to regularized learning algorithms. These algorithms are kernel methods
More informationLearning sets and subspaces: a spectral approach
Learning sets and subspaces: a spectral approach Alessandro Rudi DIBRIS, Università di Genova Optimization and dynamical processes in Statistical learning and inverse problems Sept 8-12, 2014 A world of
More informationAnalysis of Nyström Method with Sequential Ridge Leverage Score Sampling
Analysis of Nyström Method with Sequential Ridge Leverage Score Sampling Daniele Calandriello SequeL team INRIA Lille - Nord Europe Alessandro Lazaric SequeL team INRIA Lille - Nord Europe Michal Valko
More informationPack only the essentials: distributed sequential sampling for adaptive kernel DL
Pack only the essentials: distributed sequential sampling for adaptive kernel DL with Daniele Calandriello and Alessandro Lazaric SequeL team, Inria Lille - Nord Europe, France appeared in AISTATS 2017
More informationSpectral Filtering for MultiOutput Learning
Spectral Filtering for MultiOutput Learning Lorenzo Rosasco Center for Biological and Computational Learning, MIT Universita di Genova, Italy Plan Learning with kernels Multioutput kernel and regularization
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 11, 2009 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationStochastic Gradient Descent. Ryan Tibshirani Convex Optimization
Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so
More informationBeyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new
More informationStatistically and Computationally Efficient Variance Estimator for Kernel Ridge Regression
Statistically and Computationally Efficient Variance Estimator for Kernel Ridge Regression Meimei Liu Department of Statistical Science Duke University Durham, IN - 27708 Email: meimei.liu@duke.edu Jean
More informationarxiv: v1 [cs.lg] 11 Jan 2012
arxiv:20.246v [cs.lg] Jan 202 Pierre Machart LIF, LSIS, CNRS, Aix-Marseille Université Thomas Peel LIF, LATP, CNRS, Aix-Marseille Université Sandrine Anthoine LATP, CNRS, Aix-Marseille Université Liva
More informationFALKON: An Optimal Large Scale Kernel Method
FALKON: An Optimal Large Scale Kernel Method Alessandro Rudi INRIA Sierra Project-team, École Normale Supérieure, Paris Luigi Carratino University of Genoa Genova, Italy Lorenzo Rosasco University of Genoa,
More informationThe Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee
The Learning Problem and Regularization 9.520 Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing
More informationOptimal Rates for Multi-pass Stochastic Gradient Methods
Journal of Machine Learning Research 8 (07) -47 Submitted 3/7; Revised 8/7; Published 0/7 Optimal Rates for Multi-pass Stochastic Gradient Methods Junhong Lin Laboratory for Computational and Statistical
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationSimultaneous Model Selection and Optimization through Parameter-free Stochastic Learning
Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Francesco Orabona Yahoo! Labs New York, USA francesco@orabona.com Abstract Stochastic gradient descent algorithms
More informationsublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU)
sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU) 0 overview Our Contributions: 1 overview Our Contributions: A near optimal low-rank
More informationStochastic Optimization Algorithms Beyond SG
Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods
More informationwhat can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley
what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley Collaborators Joint work with Samy Bengio, Moritz Hardt, Michael Jordan, Jason Lee, Max Simchowitz,
More informationDistributed Adaptive Sampling for Kernel Matrix Approximation
Daniele Calandriello Alessandro Lazaric Michal Valko SequeL team, INRIA Lille - Nord Europe Abstract Most kernel-based methods, such as kernel regression, kernel PCA, ICA, or k-means clustering, do not
More informationContents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More informationEmpirical Risk Minimization
Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space
More informationUnsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto
Unsupervised Learning Techniques 9.520 Class 07, 1 March 2006 Andrea Caponnetto About this class Goal To introduce some methods for unsupervised learning: Gaussian Mixtures, K-Means, ISOMAP, HLLE, Laplacian
More informationClass 2 & 3 Overfitting & Regularization
Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating
More informationMLCC 2017 Regularization Networks I: Linear Models
MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017 About this class We introduce a class of learning algorithms based on Tikhonov regularization We study computational
More informationRegularization Algorithms for Learning
DISI, UNIGE Texas, 10/19/07 plan motivation setting elastic net regularization - iterative thresholding algorithms - error estimates and parameter choice applications motivations starting point of many
More informationReproducing Kernel Hilbert Spaces
9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we
More informationBig Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Predictive Models Predictive Models 1 / 34 Outline
More informationSemi-Nonparametric Inferences for Massive Data
Semi-Nonparametric Inferences for Massive Data Guang Cheng 1 Department of Statistics Purdue University Statistics Seminar at NCSU October, 2015 1 Acknowledge NSF, Simons Foundation and ONR. A Joint Work
More informationSpectral Algorithms for Supervised Learning
LETTER Communicated by David Hardoon Spectral Algorithms for Supervised Learning L. Lo Gerfo logerfo@disi.unige.it L. Rosasco rosasco@disi.unige.it F. Odone odone@disi.unige.it Dipartimento di Informatica
More informationMulti-Task Learning and Matrix Regularization
Multi-Task Learning and Matrix Regularization Andreas Argyriou Department of Computer Science University College London Collaborators T. Evgeniou (INSEAD) R. Hauser (University of Oxford) M. Herbster (University
More informationNystrom Method for Approximating the GMM Kernel
Nystrom Method for Approximating the GMM Kernel Ping Li Department of Statistics and Biostatistics Department of omputer Science Rutgers University Piscataway, NJ 0, USA pingli@stat.rutgers.edu Abstract
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationLearning with stochastic proximal gradient
Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and
More informationhttps://goo.gl/kfxweg KYOTO UNIVERSITY Statistical Machine Learning Theory Sparsity Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT OF INTELLIGENCE SCIENCE AND TECHNOLOGY 1 KYOTO UNIVERSITY Topics:
More informationThe Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017
The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the
More informationAlgorithmic Stability and Generalization Christoph Lampert
Algorithmic Stability and Generalization Christoph Lampert November 28, 2018 1 / 32 IST Austria (Institute of Science and Technology Austria) institute for basic research opened in 2009 located in outskirts
More informationECE521 lecture 4: 19 January Optimization, MLE, regularization
ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity
More informationAn Adaptive Test of Independence with Analytic Kernel Embeddings
An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum 1 Zoltán Szabó 2 Arthur Gretton 1 1 Gatsby Unit, University College London 2 CMAP, École Polytechnique ICML 2017, Sydney
More informationConvergence rates of spectral methods for statistical inverse learning problems
Convergence rates of spectral methods for statistical inverse learning problems G. Blanchard Universtität Potsdam UCL/Gatsby unit, 04/11/2015 Joint work with N. Mücke (U. Potsdam); N. Krämer (U. München)
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationKernel-based Approximation. Methods using MATLAB. Gregory Fasshauer. Interdisciplinary Mathematical Sciences. Michael McCourt.
SINGAPORE SHANGHAI Vol TAIPEI - Interdisciplinary Mathematical Sciences 19 Kernel-based Approximation Methods using MATLAB Gregory Fasshauer Illinois Institute of Technology, USA Michael McCourt University
More informationDistribution Regression
Zoltán Szabó (École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) Dagstuhl Seminar 16481
More informationApproximate Spectral Clustering via Randomized Sketching
Approximate Spectral Clustering via Randomized Sketching Christos Boutsidis Yahoo! Labs, New York Joint work with Alex Gittens (Ebay), Anju Kambadur (IBM) The big picture: sketch and solve Tradeoff: Speed
More informationSimple Optimization, Bigger Models, and Faster Learning. Niao He
Simple Optimization, Bigger Models, and Faster Learning Niao He Big Data Symposium, UIUC, 2016 Big Data, Big Picture Niao He (UIUC) 2/26 Big Data, Big Picture Niao He (UIUC) 3/26 Big Data, Big Picture
More informationOnline Kernel PCA with Entropic Matrix Updates
Dima Kuzmin Manfred K. Warmuth Computer Science Department, University of California - Santa Cruz dima@cse.ucsc.edu manfred@cse.ucsc.edu Abstract A number of updates for density matrices have been developed
More informationLarge-scale Image Annotation by Efficient and Robust Kernel Metric Learning
Large-scale Image Annotation by Efficient and Robust Kernel Metric Learning Supplementary Material Zheyun Feng Rong Jin Anil Jain Department of Computer Science and Engineering, Michigan State University,
More informationLecture 2: Linear Algebra Review
EE 227A: Convex Optimization and Applications January 19 Lecture 2: Linear Algebra Review Lecturer: Mert Pilanci Reading assignment: Appendix C of BV. Sections 2-6 of the web textbook 1 2.1 Vectors 2.1.1
More information10-701/ Recitation : Kernels
10-701/15-781 Recitation : Kernels Manojit Nandi February 27, 2014 Outline Mathematical Theory Banach Space and Hilbert Spaces Kernels Commonly Used Kernels Kernel Theory One Weird Kernel Trick Representer
More informationA Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices
A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices Ryota Tomioka 1, Taiji Suzuki 1, Masashi Sugiyama 2, Hisashi Kashima 1 1 The University of Tokyo 2 Tokyo Institute of Technology 2010-06-22
More informationStochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos
1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and
More informationApproximate Principal Components Analysis of Large Data Sets
Approximate Principal Components Analysis of Large Data Sets Daniel J. McDonald Department of Statistics Indiana University mypage.iu.edu/ dajmcdon April 27, 2016 Approximation-Regularization for Analysis
More informationLearning gradients: prescriptive models
Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007 Relevant papers Learning Coordinate Covariances via Gradients. Sayan
More informationMIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 04: Features and Kernels. Lorenzo Rosasco
MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 04: Features and Kernels Lorenzo Rosasco Linear functions Let H lin be the space of linear functions f(x) = w x. f w is one
More informationRegularized Least Squares
Regularized Least Squares Ryan M. Rifkin Google, Inc. 2008 Basics: Data Data points S = {(X 1, Y 1 ),...,(X n, Y n )}. We let X simultaneously refer to the set {X 1,...,X n } and to the n by d matrix whose
More informationRKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets Class 22, 2004 Tomaso Poggio and Sayan Mukherjee
RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets 9.520 Class 22, 2004 Tomaso Poggio and Sayan Mukherjee About this class Goal To introduce an alternate perspective of RKHS via integral operators
More informationSketched Ridge Regression:
Sketched Ridge Regression: Optimization and Statistical Perspectives Shusen Wang UC Berkeley Alex Gittens RPI Michael Mahoney UC Berkeley Overview Ridge Regression min w f w = 1 n Xw y + γ w Over-determined:
More informationStat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.
Linear SVM (separable case) First consider the scenario where the two classes of points are separable. It s desirable to have the width (called margin) between the two dashed lines to be large, i.e., have
More informationReproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto
Reproducing Kernel Hilbert Spaces 9.520 Class 03, 15 February 2006 Andrea Caponnetto About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationOptimal Distributed Learning with Multi-pass Stochastic Gradient Methods
Optimal Distributed Learning with Multi-pass Stochastic Gradient Methods Junhong Lin Volkan Cevher Abstract We study generalization properties of distributed algorithms in the setting of nonparametric
More informationOptimization for Machine Learning
Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS
More informationSupport Vector Machines for Classification: A Statistical Portrait
Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,
More informationECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple
More informationAdvanced Introduction to Machine Learning
10-715 Advanced Introduction to Machine Learning Homework Due Oct 15, 10.30 am Rules Please follow these guidelines. Failure to do so, will result in loss of credit. 1. Homework is due on the due date
More informationAdvances in kernel exponential families
Advances in kernel exponential families Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS, 2017 1/39 Outline Motivating application: Fast estimation of complex multivariate
More informationECE 5984: Introduction to Machine Learning
ECE 5984: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16 Dhruv Batra Virginia Tech Administrativia HW3 Due: April 14, 11:55pm You will implement
More informationIntroductory Machine Learning Notes 1
Introductory Machine Learning Notes 1 Lorenzo Rosasco DIBRIS, Universita degli Studi di Genova LCSL, Massachusetts Institute of Technology and Istituto Italiano di Tecnologia lrosasco@mit.edu December
More informationCOMP 551 Applied Machine Learning Lecture 20: Gaussian processes
COMP 55 Applied Machine Learning Lecture 2: Gaussian processes Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~hvanho2/comp55
More informationSparse Random Features Algorithm as Coordinate Descent in Hilbert Space
Sparse Random Features Algorithm as Coordinate Descent in Hilbert Space Ian E.H. Yen 1 Ting-Wei Lin Shou-De Lin Pradeep Ravikumar 1 Inderjit S. Dhillon 1 Department of Computer Science 1: University of
More informationStochastic gradient descent and robustness to ill-conditioning
Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,
More informationDistribution Regression: A Simple Technique with Minimax-optimal Guarantee
Distribution Regression: A Simple Technique with Minimax-optimal Guarantee (CMAP, École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department,
More informationAn Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM
An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM Wei Chu Chong Jin Ong chuwei@gatsby.ucl.ac.uk mpeongcj@nus.edu.sg S. Sathiya Keerthi mpessk@nus.edu.sg Control Division, Department
More informationRandomized Algorithms
Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models
More informationarxiv: v6 [stat.ml] 17 Mar 2016
Less is More: Nyström Computational Regularization Alessandro Rudi 1 Raffaello Camoriano 1,2 Lorenzo Rosasco 1,3 ale rudi@mit.edu raffaello.camoriano@iit.it lrosasco@mit.edu arxiv:1507.04717v6 [stat.ml]
More information