Optimal kernel methods for large scale learning
|
|
- Jennifer Harris
- 6 years ago
- Views:
Transcription
1 Optimal kernel methods for large scale learning Alessandro Rudi INRIA - École Normale Supérieure, Paris joint work with Luigi Carratino, Lorenzo Rosasco 6 Mar 2018 École Polytechnique
2 Learning problem The problem P Find f H = argmin E(f), E(f) = f H with ρ unknown but given (x i, y i ) n i=1 i.i.d. samples. dρ(x, y)(y f(x)) 2 Remarks: stochastic optimization problem H is a space of candidate solutions.
3 Outline Learning with kernels Random projections FALKON: Random projections and preconditioning
4 Kernel ridge regression Let K p.d. kernel (e.g. K(x, x ) = e γ x x 2 ) and H = span{k(x, ) x X}, Problem P n f λ = argmin f H 1 n n (y i f(x i )) 2 + λ f 2 H i=1
5 KRR: Statistic (worst case) Noise E[ Y p X = x] 1 2 p!σ2 b p 2, p 2 Kernel boundness sup x K(x, x) <. Best model There exists f H solving min E(f). f H
6 KRR: Statistic (worst case) Noise E[ Y p X = x] 1 2 p!σ2 b p 2, p 2 Kernel boundness sup x K(x, x) <. Best model There exists f H solving min E(f). f H Theorem[(Caponnetto, De Vito 05)] Under the assumptions above EE( f λ ) E(f H ) 1 λn + λ. By selecting λ n = 1 n EE( f λn ) E(f H ) 1 n
7 KRR: Statistics (refined case) Let Lf(x ) = EK(x, x)f(x) and N (λ) = Trace((L + λi) 1 L) Capacity condition: N (λ) = O(λ γ ), γ [0, 1] Source condition: f H Range(L r ), r 1/2
8 KRR: Statistics (refined case) Let Lf(x ) = EK(x, x)f(x) and N (λ) = Trace((L + λi) 1 L) Capacity condition: N (λ) = O(λ γ ), γ [0, 1] Source condition: f H Range(L r ), r 1/2 Theorem[(Caponnetto, De Vito 05)] Under (basic) and (refined) EE( f λ ) E(f H ) N (λ) n + λ 2r. By selecting λ n = n 1 2r+γ EE( f λn ) E(f H ) n 2r 2r+γ
9 KRR: Statistics (refined case) Let Lf(x ) = EK(x, x)f(x) and N (λ) = Trace((L + λi) 1 L) Capacity condition: N (λ) = O(λ γ ), γ [0, 1] Source condition: f H Range(L r ), r 1/2 Theorem[(Caponnetto, De Vito 05)] Under (basic) and (refined) EE( f λ ) E(f H ) N (λ) n + λ 2r. By selecting λ n = n 1 2r+γ EE( f λn ) E(f H ) n 2r 2r+γ
10 KRR: Optimization n f λ (x) = K(x, x i )c i i=1 ( K + λni)c = ŷ Linear System bk c = by Computations Space O(n 2 ) Kernel eval. O(n 2 ) Time O(n 3 ) Large scale ML: Running out of time and space... can we fix it?
11 Computations for optimal statistical accuracy Model: O(n) Space: O(n 2 ) Kernel eval.: O(n 2 ) Time: O(n 3 )
12 Outline Learning with kernels Random projections FALKON: Random projections and preconditioning
13 Random projections Solve P n on H M = span{k( x 1, ),..., K( x M, )} 1 n f λ,m = argmin (y i f(x i )) 2 + λ f 2 H f H M n i=1
14 Random projections Solve P n on H M = span{k( x 1, ),..., K( x M, )} 1 f λ,m = argmin f H M n... that is, pick M columns at random f λ,m (x) = M K(x, x i )c i i=1 n (y i f(x i )) 2 + λ f 2 H i=1 Linear System c ( K nm K nm +λn K MM )c = K nm ŷ bk nm = by - Nyström methods (Smola, Scholköpf 00) - Gaussian processes: inducing inputs (Quionero-Candela et al 05)
15 Nyström KRR: Computations c bk nm = by Computations (train) Space O(n 2 ) O(M 2 n) Kernel eval. O(n 2 ) O(nM n ) Time O(n 3 ) O(nM 2 n) Computations (test) O(n) O(M n ) ( K nm K nm + λn K MM )c = K nm ŷ
16 Nyström KRR: Statistics (worst case) Theorem[Rudi, Camoriano, Rosasco 15] Under the basic assumptions EE( f λ,m ) E(f H ) 1 λn + λ + 1 M.
17 Nyström KRR: Statistics (worst case) Theorem[Rudi, Camoriano, Rosasco 15] Under the basic assumptions By selecting λ n = 1 n, M n = 1 λ n EE( f λ,m ) E(f H ) 1 λn + λ + 1 M. EE( f λn,m n ) E(f H ) 1 n
18 Remarks M = O( n) suffices for optimal generalization Previous works: only for fixed design (Bach 13, Alaoui, Mahoney, 15, Yang et al. 15, Musco, Musco 16) Matches statistical minimax lower bounds [Caponnetto, De Vito 05]. Special case: Sobolev spaces with s = d/2, e.g. exponential kernel and Fourier features. Same statistical bound of (kernel) ridge regression [Caponnetto, De Vito 05].
19 Nyström KRR: Statistics (refined) Theorem[Rudi, Camoriano, Rosasco 15] Under (basic) and (refined) EE( f λ,m ) E(f H ) N (λ) n + λ 2r + 1 M.
20 Nyström KRR: Statistics (refined) Theorem[Rudi, Camoriano, Rosasco 15] Under (basic) and (refined) By selecting λ n = n 1 2r+γ, Mn = 1 λ n EE( f λ,m ) E(f H ) N (λ) n EE( f λn,m n ) E(f H ) n + λ 2r + 1 M. 2r 2r+γ
21 Remarks The obtained rate is minmax optimal [Caponnetto, De Vito 05]. Reduces to worst case for γ = 1, r = 1/2. Comparison with Random Features: [Rudi, Rosasco 17] M = n c
22 Computations required for 1/ n rate Model: O( n) Space: O(n) Kernel eval.: O(n n) Time: O(n 2 ) Possible improvements: adaptive sampling optimization
23 Outline Learning with kernels Random projections FALKON: Random projections and preconditioning
24 Beyond O(n 2 ) time? ( K nm K nm + λn K MM ) c = K nm ŷ. by bk nm c = Bottleneck: compute K nm K nm requires O(nM 2 ) time.
25 Optimization to rescue K nm K nm + λn K MM c = }{{} K nm ŷ. }{{} H b bk nm c = by Idea: First order methods c t = c t 1 τ n [ K nm ( K nm c t 1 y n ) + λn K MM c t 1 ] Pros: requires O(nM t) Cons: t κ(h) arbitrarily large- κ(h) = σ max (H)/σ min (H) condition number.
26 Preconditioning Idea: solve an equivalent linear system with better condition number Preconditioning Hc = b P HP β = P b, c = P β. Ideally P P = H 1, so that t = O(κ(H)) t = O(1)! Computing a good preconditioning can be hard!
27 Remarks Preconditioning KRR (Fasshauer et al 12, Avron et al 16, Cutajat 16, Ma, Belkin 17) H = K + λni Can we precondition Nystrom-KRR?
28 Preconditioning Nystom-KRR Consider H := K nm K nm + λn K MM Proposed Preconditioning ( n P P = M K MM 2 + λn K ) 1 MM Compare to naive preconditioning ( P P = K nm KnM + λn K ) 1 MM.
29 Baby FALKON Proposed Preconditioning P P = ( n M K 2 MM + λn K MM ) 1, Gradient descent M f λ,m,t (x) = K(x, x i )c t,i, c t = P β t i=1 β t = β t 1 τ n P [ K nm ( K nm P β t 1 y n ) + λn K MM P β t 1 ]
30 FALKON Gradient descent conjugate gradient Computing P P = 1 T 1 A 1, n T = chol(k MM ), ( ) 1 A = chol M T T + λi, where chol( ) is the Cholesky decomposition.
31 Falkon statistics (worst case) Theorem Under (basic), when M > log n λ, [ EE( f λn,m n,t n ) E(f H ) 1 λn + λ + 1 M + exp t ( 1 log n ) ] 1/2 λm
32 Falkon statistics (worst case) Theorem Under (basic), when M > log n λ, [ EE( f λn,m n,t n ) E(f H ) 1 λn + λ + 1 M + exp t ( 1 log n ) ] 1/2 λm By selecting then λ n = 1/ n, M n = 2 log n, t n = log n, λ EE( f λn,m n,t n ) E(f H ) 1 n
33 Falkon statistics (refined results) Theorem Under (basic) and (refined), when M > log n λ, EE( f λn,m n,t n ) E(f H ) N (λ) n + λ 2r + 1 M + exp [ t ( 1 log n ) ] 1/2 λm
34 Falkon statistics (refined results) Theorem Under (basic) and (refined), when M > log n λ, EE( f λn,m n,t n ) E(f H ) N (λ) n + λ 2r + 1 M + exp [ t ( 1 log n ) ] 1/2 λm By selecting then λ n = n 1 2r+γ, Mn = 2 log n, t n = log n, λ EE( f λn,m n,t n ) E(f H ) n 2r 2r+γ
35 Remarks Relevant works SGD RF-KRR (Rahimi, Recht 07; Bach 15; Rudi, Rosasco 17) Divide and conquer (Zhang et al. 13) NYTRO (Angles et al 16) Nyström SGD (Lin, Rosasco 16) Same statistical properties/memory requirements Much smaller time complexity
36 Proof: bridging statistics and optimization Lemma Let δ > 0, κ P := κ(p HP ), c δ = c 0 log 1 δ. When λ 1 n with probability 1 δ. E( f λ,m,t ) E(f H ) E( f λ,m ) E(f H ) + c δ exp( t/ κ P ).
37 Proof: bridging statistics and optimization Lemma Let δ > 0, κ P := κ(p HP ), c δ = c 0 log 1 δ. When λ 1 n with probability 1 δ. E( f λ,m,t ) E(f H ) E( f λ,m ) E(f H ) + c δ exp( t/ κ P ). Lemma Let δ (0, 1], λ > 0. When then with probability 1 δ. κ(p HP ) M = 2 log 1 δ λ, ( 1 log 1 ) 1 δ < 4 λm
38 Computational implications Cost for optimal generalization with FALKON M n = O( n), t n = log n Model: O( n) Space: O(n) Kernel eval.: O(n n) Time: O(n 2 ) O(n n)
39 In practice 1 Higgs dataset: n = 10, 000, 000, M = 50,
40 Some experiments MillionSongs (n 10 6 ) YELP (n 10 6 ) TIMIT (n 10 6 ) MSE Relative error Time(s) RMSE Time(m) c-err Time(h) FALKON % 1.5 Prec. KRR Hierarchical D&C Rand. Feat Nyström ADMM R. F BCD R. F % 1.7 BCD Nyström % 1.7 KRR % 8.3 EigenPro % 3.9 Deep NN % - Sparse Kernels % - Ensemble % - Table: MillionSongs, YELP and TIMIT Datasets. Times obtained on: = cluster of 128 EC2 r3.2xlarge machines, = cluster of 8 EC2 r3.8xlarge machines, = single machine with two Intel Xeon E5-2620, one Nvidia GTX Titan X GPU and 128GB of RAM, = cluster with 512 GB of RAM and IBM POWER8 12-core processor, = unknown platform.
41 Some more experiments SUSY (n 10 6 ) HIGGS (n 10 7 ) IMAGENET (n 10 6 ) c-err AUC Time(m) AUC Time(h) c-err Time(h) FALKON 19.6% % 4 EigenPro 19.8% Hierarchical 20.1% Boosted Decision Tree Neural Network Deep Neural Network Inception-V % - Table: Architectures: cluster with IBM POWER8 12-core cpu, 512 GB RAM, single machine with two Intel Xeon E5-2620, one Nvidia GTX Titan X GPU, 128GB RAM, single machine.
42 Contributions Best computations so far for optimal statistics Space O(n) Time O(n n) Random projections+iterative solvers+preconditioning... fast rates... adaptive sampling Next? Distributed architectures... Open the kernel blackbox!
43 Proving κ(p HP ) 1 Let K x = K(x, ) H, C = K x K xdρ X(x), Ĉ n = 1 n n K xi K xi, i=1 Ĉ M = 1 M M K xj K xj. j=1 Recall that P = 1 n T 1 A 1, T = chol(k MM ), A = chol ( 1 M T T + λi ). Steps 1. P HP = A V (Ĉn + λi)v A 1
44 Proving κ(p HP ) 1 Let K x = K(x, ) H, C = K x K xdρ X(x), Ĉ n = 1 n n K xi K xi, i=1 Ĉ M = 1 M M K xj K xj. j=1 Recall that P = 1 n T 1 A 1, T = chol(k MM ), A = chol ( 1 M T T + λi ). Steps 2. P HP = A V (ĈM + λi)v A 1 + A V (Ĉn ĈM )V A 1
45 Proving κ(p HP ) 1 Let K x = K(x, ) H, C = K x K xdρ X(x), Ĉ n = 1 n n K xi K xi, i=1 Ĉ M = 1 M M K xj K xj. j=1 Recall that P = 1 n T 1 A 1, T = chol(k MM ), A = chol ( 1 M T T + λi ). Steps 3. P HP = I + A V (Ĉn ĈM )V A 1
46 Proving κ(p HP ) 1 Let K x = K(x, ) H, C = K x K xdρ X(x), Ĉ n = 1 n n K xi K xi, i=1 Ĉ M = 1 M M K xj K xj. j=1 Recall that P = 1 n T 1 A 1, T = chol(k MM ), A = chol ( 1 M T T + λi ). Steps 3. P HP = I + E with E = A V (Ĉn ĈM )V A 1
47 Proving κ(p HP ) 1 Let K x = K(x, ) H, C = K x K xdρ X(x), Ĉ n = 1 n n K xi K xi, i=1 Ĉ M = 1 M M K xj K xj. j=1 Recall that P = 1 n T 1 A 1, T = chol(k MM ), A = chol ( 1 M T T + λi ). Steps 4. κ(p HP ) = κ(i + E) 1 + E, when E < 1, 1 E
48 Proving κ(p HP ) 1 Let K x = K(x, ) H, C = K x K xdρ X(x), Ĉ n = 1 n n K xi K xi, i=1 Ĉ M = 1 M M K xj K xj. j=1 Recall that P = 1 n T 1 A 1, T = chol(k MM ), A = chol ( 1 M T T + λi ). Steps 5.E = A V (Ĉn ĈM )V A 1 1/2 w.h.p. when M c 0 log 1 δ λ
An inverse problem perspective on machine learning
An inverse problem perspective on machine learning Lorenzo Rosasco University of Genova Massachusetts Institute of Technology Istituto Italiano di Tecnologia lcsl.mit.edu Feb 9th, 2018 Inverse Problems
More informationLess is More: Computational Regularization by Subsampling
Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro
More informationFALKON: An Optimal Large Scale Kernel Method
FALKON: An Optimal Large Scale Kernel Method Alessandro Rudi INRIA Sierra Project-team, École Normale Supérieure, Paris Luigi Carratino University of Genoa Genova, Italy Lorenzo Rosasco University of Genoa,
More informationLess is More: Computational Regularization by Subsampling
Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro
More informationTUM 2016 Class 3 Large scale learning by regularization
TUM 2016 Class 3 Large scale learning by regularization Lorenzo Rosasco UNIGE-MIT-IIT July 25, 2016 Learning problem Solve min w E(w), E(w) = dρ(x, y)l(w x, y) given (x 1, y 1 ),..., (x n, y n ) Beyond
More informationStatistical Optimality of Stochastic Gradient Descent through Multiple Passes
Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Loucas Pillaud-Vivien
More informationStochastic optimization in Hilbert spaces
Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert
More informationApproximate Kernel PCA with Random Features
Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,
More informationConvergence Rates of Kernel Quadrature Rules
Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction
More informationOnline Gradient Descent Learning Algorithms
DISI, Genova, December 2006 Online Gradient Descent Learning Algorithms Yiming Ying (joint work with Massimiliano Pontil) Department of Computer Science, University College London Introduction Outline
More informationRegML 2018 Class 2 Tikhonov regularization and kernels
RegML 2018 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT June 17, 2018 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n = (x i,
More informationApproximate Kernel Methods
Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207 Outline Motivating example Ridge regression
More informationarxiv: v1 [stat.ml] 31 Oct 2018
On Fast Leverage Score Sampling and Optimal Learning Alessandro Rudi, alessandro.rudi@inria.fr Luigi Carratino 3 luigi.carratino@dibris.unige.it Daniele Calandriello,2 daniele.calandriello@iit.it Lorenzo
More informationOslo Class 2 Tikhonov regularization and kernels
RegML2017@SIMULA Oslo Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT May 3, 2017 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n
More informationKernel Learning via Random Fourier Representations
Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random
More informationOptimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms
Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms Junhong Lin Volkan Cevher Laboratory for Information and Inference Systems École Polytechnique Fédérale
More informationDivide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates
: A Distributed Algorithm with Minimax Optimal Rates Yuchen Zhang, John C. Duchi, Martin Wainwright (UC Berkeley;http://arxiv.org/pdf/1305.509; Apr 9, 014) Gatsby Unit, Tea Talk June 10, 014 Outline Motivation.
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationNonparameteric Regression:
Nonparameteric Regression: Nadaraya-Watson Kernel Regression & Gaussian Process Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,
More informationStatistically and Computationally Efficient Variance Estimator for Kernel Ridge Regression
Statistically and Computationally Efficient Variance Estimator for Kernel Ridge Regression Meimei Liu Department of Statistical Science Duke University Durham, IN - 27708 Email: meimei.liu@duke.edu Jean
More informationECE 5984: Introduction to Machine Learning
ECE 5984: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16 Dhruv Batra Virginia Tech Administrativia HW3 Due: April 14, 11:55pm You will implement
More informationNyström-SGD: Fast Learning of Kernel-Classifiers with Conditioned Stochastic Gradient Descent
Nyström-SGD: Fast Learning of Kernel-Classifiers with Conditioned Stochastic Gradient Descent Lukas Pfahler and Katharina Morik TU Dortmund University, Artificial Intelligence Group, Dortmund, Germany
More informationIntroduction to Machine Learning
Introduction to Machine Learning Kernel Methods Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1 / 21
More informationGaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature
More informationOslo Class 4 Early Stopping and Spectral Regularization
RegML2017@SIMULA Oslo Class 4 Early Stopping and Spectral Regularization Lorenzo Rosasco UNIGE-MIT-IIT June 28, 2016 Learning problem Solve min w E(w), E(w) = dρ(x, y)l(w x, y) given (x 1, y 1 ),..., (x
More informationGeneralization theory
Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1
More informationA Magiv CV Theory for Large-Margin Classifiers
A Magiv CV Theory for Large-Margin Classifiers Hui Zou School of Statistics, University of Minnesota June 30, 2018 Joint work with Boxiang Wang Outline 1 Background 2 Magic CV formula 3 Magic support vector
More informationRandomized Algorithms
Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models
More informationMLCC 2017 Regularization Networks I: Linear Models
MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017 About this class We introduce a class of learning algorithms based on Tikhonov regularization We study computational
More informationSimple Optimization, Bigger Models, and Faster Learning. Niao He
Simple Optimization, Bigger Models, and Faster Learning Niao He Big Data Symposium, UIUC, 2016 Big Data, Big Picture Niao He (UIUC) 2/26 Big Data, Big Picture Niao He (UIUC) 3/26 Big Data, Big Picture
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationIntroduction to Gaussian Process
Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Non-parametric Gaussian Process (GP) GP Regression
More informationTUM 2016 Class 1 Statistical learning theory
TUM 2016 Class 1 Statistical learning theory Lorenzo Rosasco UNIGE-MIT-IIT July 25, 2016 Machine learning applications Texts Images Data: (x 1, y 1 ),..., (x n, y n ) Note: x i s huge dimensional! All
More informationFaster Kernel Ridge Regression Using Sketching and Preconditioning
Faster Kernel Ridge Regression Using Sketching and Preconditioning Haim Avron Department of Applied Math Tel Aviv University, Israel SIAM CS&E 2017 February 27, 2017 Brief Introduction to Kernel Ridge
More informationDistributed Adaptive Sampling for Kernel Matrix Approximation
Daniele Calandriello Alessandro Lazaric Michal Valko SequeL team, INRIA Lille - Nord Europe Abstract Most kernel-based methods, such as kernel regression, kernel PCA, ICA, or k-means clustering, do not
More informationFast learning rates for plug-in classifiers under the margin condition
Fast learning rates for plug-in classifiers under the margin condition Jean-Yves Audibert 1 Alexandre B. Tsybakov 2 1 Certis ParisTech - Ecole des Ponts, France 2 LPMA Université Pierre et Marie Curie,
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Neural Networks: A brief touch Yuejie Chi Department of Electrical and Computer Engineering Spring 2018 1/41 Outline
More informationDistributed Learning with Regularized Least Squares
Journal of Machine Learning Research 8 07-3 Submitted /5; Revised 6/7; Published 9/7 Distributed Learning with Regularized Least Squares Shao-Bo Lin sblin983@gmailcom Department of Mathematics City University
More informationMathematical Methods for Data Analysis
Mathematical Methods for Data Analysis Massimiliano Pontil Istituto Italiano di Tecnologia and Department of Computer Science University College London Massimiliano Pontil Mathematical Methods for Data
More informationAccelerating Stochastic Optimization
Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz
More informationStochastic Gradient Descent
Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular
More informationGaussian Process Regression
Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process
More informationFaster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)
Faster Machine Learning via Low-Precision Communication & Computation Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich) 2 How many bits do you need to represent a single number in machine
More informationCOMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation
COMP 55 Applied Machine Learning Lecture 2: Bayesian optimisation Associate Instructor: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp55 Unless otherwise noted, all material posted
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 27, 2015 Outline Linear regression Ridge regression and Lasso Time complexity (closed form solution) Iterative Solvers Regression Input: training
More informationLeast Squares SVM Regression
Least Squares SVM Regression Consider changing SVM to LS SVM by making following modifications: min (w,e) ½ w 2 + ½C Σ e(i) 2 subject to d(i) (w T Φ( x(i))+ b) = e(i), i, and C>0. Note that e(i) is error
More informationStructured Prediction
Structured Prediction Ningshan Zhang Advanced Machine Learning, Spring 2016 Outline Ensemble Methods for Structured Prediction[1] On-line learning Boosting AGeneralizedKernelApproachtoStructuredOutputLearning[2]
More informationSemi-Nonparametric Inferences for Massive Data
Semi-Nonparametric Inferences for Massive Data Guang Cheng 1 Department of Statistics Purdue University Statistics Seminar at NCSU October, 2015 1 Acknowledge NSF, Simons Foundation and ONR. A Joint Work
More informationCPSC 340: Machine Learning and Data Mining. Gradient Descent Fall 2016
CPSC 340: Machine Learning and Data Mining Gradient Descent Fall 2016 Admin Assignment 1: Marks up this weekend on UBC Connect. Assignment 2: 3 late days to hand it in Monday. Assignment 3: Due Wednesday
More informationMachine Learning (CSE 446): Neural Networks
Machine Learning (CSE 446): Neural Networks Noah Smith c 2017 University of Washington nasmith@cs.washington.edu November 6, 2017 1 / 22 Admin No Wednesday office hours for Noah; no lecture Friday. 2 /
More informationECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple
More informationFastfood Approximating Kernel Expansions in Loglinear Time. Quoc Le, Tamas Sarlos, and Alex Smola Presenter: Shuai Zheng (Kyle)
Fastfood Approximating Kernel Expansions in Loglinear Time Quoc Le, Tamas Sarlos, and Alex Smola Presenter: Shuai Zheng (Kyle) Large Scale Problem: ImageNet Challenge Large scale data Number of training
More informationLogistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Logistic Regression Matt Gormley Lecture 8 Feb. 12, 2018 1 10-601 Introduction
More informationDistribution Regression
Zoltán Szabó (École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) Dagstuhl Seminar 16481
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationClass 2 & 3 Overfitting & Regularization
Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating
More informationCSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18
CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$
More informationINTRODUCTION TO DATA SCIENCE
INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #13 3/9/2017 CMSC320 Tuesdays & Thursdays 3:30pm 4:45pm ANNOUNCEMENTS Mini-Project #1 is due Saturday night (3/11): Seems like people are able to do
More informationLecture 6. Notes on Linear Algebra. Perceptron
Lecture 6. Notes on Linear Algebra. Perceptron COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Andrey Kan Copyright: University of Melbourne This lecture Notes on linear algebra Vectors
More informationSimultaneous Model Selection and Optimization through Parameter-free Stochastic Learning
Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Francesco Orabona Yahoo! Labs New York, USA francesco@orabona.com Abstract Stochastic gradient descent algorithms
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 9, 2011 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationRegularization Algorithms for Learning
DISI, UNIGE Texas, 10/19/07 plan motivation setting elastic net regularization - iterative thresholding algorithms - error estimates and parameter choice applications motivations starting point of many
More informationMemory Efficient Kernel Approximation
Si Si Department of Computer Science University of Texas at Austin ICML Beijing, China June 23, 2014 Joint work with Cho-Jui Hsieh and Inderjit S. Dhillon Outline Background Motivation Low-Rank vs. Block
More informationApproximate Spectral Clustering via Randomized Sketching
Approximate Spectral Clustering via Randomized Sketching Christos Boutsidis Yahoo! Labs, New York Joint work with Alex Gittens (Ebay), Anju Kambadur (IBM) The big picture: sketch and solve Tradeoff: Speed
More informationProbabilistic Regression Using Basis Function Models
Probabilistic Regression Using Basis Function Models Gregory Z. Grudic Department of Computer Science University of Colorado, Boulder grudic@cs.colorado.edu Abstract Our goal is to accurately estimate
More informationarxiv: v6 [stat.ml] 17 Mar 2016
Less is More: Nyström Computational Regularization Alessandro Rudi 1 Raffaello Camoriano 1,2 Lorenzo Rosasco 1,3 ale rudi@mit.edu raffaello.camoriano@iit.it lrosasco@mit.edu arxiv:1507.04717v6 [stat.ml]
More informationCONTEMPORARY ANALYTICAL ECOSYSTEM PATRICK HALL, SAS INSTITUTE
CONTEMPORARY ANALYTICAL ECOSYSTEM PATRICK HALL, SAS INSTITUTE Copyright 2013, SAS Institute Inc. All rights reserved. Agenda (Optional) History Lesson 2015 Buzzwords Machine Learning for X Citizen Data
More informationwhat can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley
what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley Collaborators Joint work with Samy Bengio, Moritz Hardt, Michael Jordan, Jason Lee, Max Simchowitz,
More informationMachine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6
Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)
More informationEXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING
EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical
More informationMachine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari
Machine Learning Basics: Stochastic Gradient Descent Sargur N. srihari@cedar.buffalo.edu 1 Topics 1. Learning Algorithms 2. Capacity, Overfitting and Underfitting 3. Hyperparameters and Validation Sets
More informationMidterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric
More informationDATA MINING AND MACHINE LEARNING
DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems
More informationDistribution Regression: A Simple Technique with Minimax-optimal Guarantee
Distribution Regression: A Simple Technique with Minimax-optimal Guarantee (CMAP, École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department,
More informationDistributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College
Distributed Estimation, Information Loss and Exponential Families Qiang Liu Department of Computer Science Dartmouth College Statistical Learning / Estimation Learning generative models from data Topic
More informationNeural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17
3/9/7 Neural Networks Emily Fox University of Washington March 0, 207 Slides adapted from Ali Farhadi (via Carlos Guestrin and Luke Zettlemoyer) Single-layer neural network 3/9/7 Perceptron as a neural
More informationTopics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families
Midterm Review Topics we covered Machine Learning Optimization Basics of optimization Convexity Unconstrained: GD, SGD Constrained: Lagrange, KKT Duality Linear Methods Perceptrons Support Vector Machines
More informationHoldout and Cross-Validation Methods Overfitting Avoidance
Holdout and Cross-Validation Methods Overfitting Avoidance Decision Trees Reduce error pruning Cost-complexity pruning Neural Networks Early stopping Adjusting Regularizers via Cross-Validation Nearest
More informationDistributed Semi-supervised Learning with Kernel Ridge Regression
Journal of Machine Learning Research 18 (017) 1- Submitted 11/16; Revised 4/17; Published 5/17 Distributed Semi-supervised Learning with Kernel Ridge Regression Xiangyu Chang Center of Data Science and
More informationFinal Overview. Introduction to ML. Marek Petrik 4/25/2017
Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,
More informationOptimal Rates for Spectral Algorithms with Least-Squares Regression over Hilbert Spaces
Optimal Rates for Spectral Algorithms with Least-Squares Regression over Hilbert Spaces Junhong Lin 1, Alessandro Rudi 2,3, Lorenzo Rosasco 4,5, Volkan Cevher 1 1 Laboratory for Information and Inference
More informationEXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING
EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: August 30, 2018, 14.00 19.00 RESPONSIBLE TEACHER: Niklas Wahlström NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical
More informationMachine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d.
More informationSpectral Filtering for MultiOutput Learning
Spectral Filtering for MultiOutput Learning Lorenzo Rosasco Center for Biological and Computational Learning, MIT Universita di Genova, Italy Plan Learning with kernels Multioutput kernel and regularization
More informationRegression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)
Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features
More informationarxiv: v1 [stat.ml] 5 Nov 2018
Kernel Conjugate Gradient Methods with Random Projections Junhong Lin and Volkan Cevher {junhong.lin, volkan.cevher}@epfl.ch Laboratory for Information and Inference Systems École Polytechnique Fédérale
More informationMachine Learning Basics: Maximum Likelihood Estimation
Machine Learning Basics: Maximum Likelihood Estimation Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics 1. Learning
More informationQuiz 2 Date: Monday, November 21, 2016
10-704 Information Processing and Learning Fall 2016 Quiz 2 Date: Monday, November 21, 2016 Name: Andrew ID: Department: Guidelines: 1. PLEASE DO NOT TURN THIS PAGE UNTIL INSTRUCTED. 2. Write your name,
More informationRegression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)
Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features
More informationCOMPRESSED AND PENALIZED LINEAR
COMPRESSED AND PENALIZED LINEAR REGRESSION Daniel J. McDonald Indiana University, Bloomington mypage.iu.edu/ dajmcdon 2 June 2017 1 OBLIGATORY DATA IS BIG SLIDE Modern statistical applications genomics,
More informationStatistical Convergence of Kernel CCA
Statistical Convergence of Kernel CCA Kenji Fukumizu Institute of Statistical Mathematics Tokyo 106-8569 Japan fukumizu@ism.ac.jp Francis R. Bach Centre de Morphologie Mathematique Ecole des Mines de Paris,
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is
More informationSpectral Regularization
Spectral Regularization Lorenzo Rosasco 9.520 Class 07 February 27, 2008 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Nov 2, 2016 Outline SGD-typed algorithms for Deep Learning Parallel SGD for deep learning Perceptron Prediction value for a training data: prediction
More informationOptimization geometry and implicit regularization
Optimization geometry and implicit regularization Suriya Gunasekar Joint work with N. Srebro (TTIC), J. Lee (USC), D. Soudry (Technion), M.S. Nacson (Technion), B. Woodworth (TTIC), S. Bhojanapalli (TTIC),
More informationSVRG++ with Non-uniform Sampling
SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract
More informationRirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology
Rirdge Regression Szymon Bobek Institute of Applied Computer science AGH University of Science and Technology Based on Carlos Guestrin adn Emily Fox slides from Coursera Specialization on Machine Learnign
More informationMachine Learning Techniques
Machine Learning Techniques ( 機器學習技法 ) Lecture 15: Matrix Factorization Hsuan-Tien Lin ( 林軒田 ) htlin@csie.ntu.edu.tw Department of Computer Science & Information Engineering National Taiwan University
More informationAsynchronous Doubly Stochastic Sparse Kernel Learning
The Thirty-Second AAAI Conference on Artificial Intelligence AAAI-8 Asynchronous Doubly Stochastic Sparse Kernel Learning Bin Gu, Xin Miao, Zhouyuan Huo, Heng Huang * Department of Electrical & Computer
More informationLogistic Regression. William Cohen
Logistic Regression William Cohen 1 Outline Quick review classi5ication, naïve Bayes, perceptrons new result for naïve Bayes Learning as optimization Logistic regression via gradient ascent Over5itting
More information