Op#mal convex op#miza#on under Tsybakov noise through connec#ons to ac#ve learning

Size: px
Start display at page:

Download "Op#mal convex op#miza#on under Tsybakov noise through connec#ons to ac#ve learning"

Transcription

1 Op#mal convex op#miza#on under Tsybakov noise through connec#ons to ac#ve learning Aar$ Singh Joint work with: Aaditya Ramdas

2 Connec#ons between convex op#miza#on and ac#ve learning (a formal reduc#on) 2

3 Role of feedback in convex op$miza$on in ac$ve learning f(x) x minimize computa#onal complexity (# queries needed to find op$mum) X 0 Raginsky- Rakhlin 09 x minimize sample complexity (# queries needed to find decision boundary) 3 X

4 Ac#ve learning oracle model Oracle provides Y 2{0, 1} Query x 1 Y 1 Query x 2 Y 2 Query x N Y N E[Y X] =P (Y =1 X) 4

5 Stochas#c op#miza#on oracle model (first- order) Oracle provides f(x), g(x) = 5f(x) Query x 1 bf(x 1 ), bg(x 1 ) Query x 2 bf(x 2 ), bg(x 2 ) Query x N bf(x N ), bg(x N ) E[ f(x)] b = f(x), E[bg(x)] = g(x) unbiased, variance 2 5

6 Connec#ons in 1- dim noiseless sefng convex op$miza$on f(x) ac$ve learning X 0 X 1 P (sign(g(x)) = + X) P (Y =1 X) 0 X P (sign(g(x)) = + X) 6

7 Connec#ons in 1- dim noisy sefng convex op$miza$on f(x) ac$ve learning X X 1 bg P (sign(g(x)) = + X) P (Y =1 X) zero- mean noise X P (sign(g(x)) bg = + X) 7

8 Minimax ac#ve learning rates in 1- dim If Tsybakov Noise Condi$on (TNC) holds apple 1 kx x k apple 1 P (Y =1 X = x) 1/ then minimax op$mal ac$ve learning rate in 1- dim is E[kbx N x k] N 1 2apple 2 and under 0/1 loss + smoothness of P(Y X) Risk(bx N ) Risk(x ) N apple 2apple 2 Castro- Nowak 07 N 1 2 N 1 x 8 X apple =2

9 TNC and strong convexity Strong convexity TNC with apple =2 2 ) f(x) f(x ) kx x k apple 2 ) kg(x) g(x )k kx x k apple 0 If noise pmf grows linearly around its zero mean (Gaussian, uniform, triangular), then PP (sign(g(x)) bg = + X = x) 1/2 kx x k 9

10 Algorithmic reduc#on (1- dim) In 1- dim, consider any ac$ve learning algorithm that is op$mal for TNC exponent κ = 2. When given labels Y = sign(bg(x)), where f(x) is a strongly convex func$on with Lipschitz gradients, it yields E[kbx T x k]=o(t 1 2 ) E[f(bx T ) f(x )] = O(T 1 ) Matches op$mal rates for strongly convex func$ons Nemirovski- Yudin 83, Agarwal- Bartlei- Ravikumar- Wainwright 10 What about d- dim? 10

11 1- dim vs. d- dim Convex op$miza$on Ac$ve learning x x Minimizer: a point (0- dim) Decision boundary: curve (d- 1 dim) T 1 N 2 2+ d 1 Complexity of convex optimization in any dimension is same as complexity of active learning in 1 dimension. 11

12 Algorithmic reduc#on (d- dim) Random coordinate descent with 1- dim ac$ve learning subrou$ne For e =1,...,E = d(log T ) 2 Choose coordinate j at random from 1,...,d Do active learning along coordinate with sample budget T e = T/E treating sign(bg(x t )) as label Y t j If f is strongly convex with Lipschitz gradients O O S S inf bx inf bx f f E[kbx x f k] E[f(bx) f(x )] = Õ(T 1 2 ) = Õ(T 1 ) 12

13 Degree of convexity via Tsybakov noise condi#on (TNC) 13

14 Degree of convexity via TNC TNC for convex func$ons apple 1 ) f(x) f(x ) kx x k apple kg(x) g(x )k kx x k apple 1 Controls strength of convexity around the minimum x f(x) X Uniformly convex func$on implies TNC apple 2 Controls strength of convexity everywhere in domain 14

15 Minimax convex op#miza#on rates Theorem: If TNC for convex func$ons holds f(x) f(x ) kx x k apple apple>1 f(x) and f is Lipschitz, then minimax op$mal convex op$miza$on rate over a bounded set (diam 1) is x X kbx T x k T 1 2apple 2 d-dim kf(bx) f(x )k T apple 2apple 2 d-dim T T 3/2 Precisely the rates for 1-dim active learning! apple =3/2 T 1 T 1 2 Strongly convex Convex apple =2 apple!1 15

16 Lower bounds based on ac#ve learning O S inf bx f E[kbx x f k] = (T 1 2apple 2 ) f 1 f 0 S =[0, 1] d \ {kxk apple1} O : b f(x) N(f(x), 2 ), bg(x) N(g(x), 2 I d ) f 0 (x) =c 1 dx x i apple i=1 0 2a 4a f 1 (x) = c1 ( x 1 f 0 (x) 2a apple + P d i=2 x i apple )+c 2 x 1 apple 4a otherwise P 0 = P ({X i,f 0 (X i ),g 0 (X i )} T i=1) P 1 = P ({X i,f 1 (X i ),g 1 (X i )} T i=1) 16

17 Lower bounds based on ac#ve learning O S inf bx f E[kbx x f k] = (T 1 2apple 2 ) Fano s Inequality if KL(P 0,P 1 ) apple Constant inf bx f P (kbx x f k > kx f 0 x f 1 k/2) constant KL(P 0,P 1 ) apple f 1 f 0 Query that yields max difference between func$on/gradient values Castro- Nowak a 4a apple Constant if kx f 0 x f 1 k/2 =a = T 1 2apple 2 17

18 Lower bounds based on ac#ve learning O S inf bx f E[kbx x f k] = (T 1 2apple 2 ) Fano s Inequality if KL(P 0,P 1 ) apple Constant inf bx f P (kbx x f k > kx f 0 x f 1 k/2) constant KL(P 0,P 1 ) apple Query that yields max difference between func$on/gradient values Castro- Nowak 07 Also yields lower bounds for uniformly convex func$ons and zeroth- order oracle which match Iouditski- Nesterov 10, Jamieson- Nowak- Recht 12 18

19 Epoch- based gradient descent Initialize e =1,x 1 1,T 1,R 1, 1 until Oracle budget T is exhausted Projected Gradient Descent Requires knowledge of κ,r e+1 1 apple e+1, If f is a convex func$on that sa$sfies TNC(κ) and is Lipschitz O O S S inf bx inf bx f f E[kbx x f k] = Õ(T 1 2apple 2 ) E[f(bx) f(x )] = Õ(T apple 2apple 2 ) 19

20 Adap#ng to degree of convexity 20

21 Adap#ng to degree of convexity For e =1,...,E = log p T/log T (ignoring κ) Run any optimization procedure that is optimal for convex functions, with sample budget T e = T/E R e+1 = R e /2,e e +1 Adapted from Iouditski- Nesterov 10 since Also, 9ē s.t. kxē x ēk T kx e x ek apple apple f(x e ) f(x e) x ē = x 1/(2apple 2) R e p T rate for convex Lipschitz functions 21

22 Adap#ng to degree of convexity For e =1,...,E = log p T/log T (ignoring κ) Run any optimization procedure that is optimal for convex functions, with sample budget T e = T/E R e+1 = R e /2,e e +1 Adapted from Iouditski- Nesterov 10 9ē s.t. kxē x ēk T 1/(2apple 2) x ē = x x R 1 = x 1 x 1 22

23 Adap#ng to degree of convexity For e =1,...,E = log p T/log T (ignoring κ) Run any optimization procedure that is optimal for convex functions, with sample budget T e = T/E R e+1 = R e /2,e e +1 Adapted from Iouditski- Nesterov 10 9ē s.t. kxē x ēk T 1/(2apple 2) x ē = x x R 2 = x 2 x 2 23

24 Adap#ng to degree of convexity For e =1,...,E = log p T/log T (ignoring κ) Run any optimization procedure that is optimal for convex functions, with sample budget T e = T/E R e+1 = R e /2,e e +1 Adapted from Iouditski- Nesterov 10 9ē s.t. kxē x ēk T 1/(2apple 2) T x ē = x x R e x ē = xē 1/(2apple 2) 24

25 Adap#ng to degree of convexity For e =1,...,E = log p T/log T (ignoring κ) Run any optimization procedure that is optimal for convex functions, with sample budget T e = T/E R e+1 = R e /2,e e +1 Adapted from Iouditski- Nesterov 10 9ē s.t. kxē x ēk T x ē = x 1/(2apple 2) T = 8e ē, kx e xēk T 1/(2apple 2) x x e xē 1/(2apple 2) R e>e x e 25

26 Adap#ng to degree of convexity For e =1,...,E = log p T/log T (ignoring κ) Run any optimization procedure that is optimal for convex functions, with sample budget T e = T/E R e+1 = R e /2,e e +1 If f is a convex func$on that sa$sfies TNC(κ) and is Lipschitz O O S S inf bx inf bx f f E[kbx x f k] = Õ(T 1 2apple 2 ) E[f(bx) f(x )] = Õ(T apple 2apple 2 ) 26

27 Adap#ve ac#ve learning 27

28 Adap#ve 1- dim ac#ve learning Robust Binary Search adap$ve to κ For e =1,...,E = log p T/log T (ignoring κ) Do passive learning with sample budget T e = T/E R e+1 = R e /2,e e +1 Adapted from Iouditski- Nesterov 10 28

29 Adap#ve 1- dim ac#ve learning Robust Binary Search adap$ve to κ For e =1,...,E = log p T/log T (ignoring κ) Do passive learning with sample budget T e = T/E R e+1 = R e /2,e e +1 Adapted from Iouditski- Nesterov 10 since ckx e x ek apple apple Risk(x e ) Risk(x e) Also, 9ē s.t. kxē x ēk T x ē = x 1/(2apple 2) R e p T passive rate for threshold classifiers 29

30 Adap#ve 1- dim ac#ve learning Robust Binary Search adap$ve to κ For e =1,...,E = log p T/log T (ignoring κ) Do passive learning with sample budget T e = T/E R e+1 = R e /2,e e +1 Adapted from Iouditski- Nesterov 10 9ē s.t. kxē x ēk T x ē = x 8e ē, kx e xēk T 1/(2apple 2) 1/(2apple 2) x = x e T xē 1/(2apple 2) R e>e x e Much simpler than Hanneke 09 30

31 Reference & Future direc#ons A. Ramdas and A. Singh, Op$mal rates for stochas$c convex op$miza$on under Tsybakov noise condi$on, to appear ICML (Available on arxiv) Ø Reduc$on from d- dim convex op$miza$on to 1- dim ac$ve learning for κ- TNC func$ons (κ 2)? Ø Adap$ve d- dimensional ac$ve learning/model selec$on in ac$ve learning? Ø Por$ng ac$ve learning results to yield non- convex op$miza$on guarantees? 31

Tsybakov noise adap/ve margin- based ac/ve learning

Tsybakov noise adap/ve margin- based ac/ve learning Tsybakov noise adap/ve margin- based ac/ve learning Aar$ Singh A. Nico Habermann Associate Professor NIPS workshop on Learning Faster from Easy Data II Dec 11, 2015 Passive Learning Ac/ve Learning (X j,?)

More information

Query Complexity of Derivative-Free Optimization

Query Complexity of Derivative-Free Optimization Query Complexity of Derivative-Free Optimization Kevin G Jamieson University of Wisconsin Madison, WI 53706, USA kgjamieson@wiscedu Robert D Nowak University of Wisconsin Madison, WI 53706, USA nowak@engrwiscedu

More information

How hard is this function to optimize?

How hard is this function to optimize? How hard is this function to optimize? John Duchi Based on joint work with Sabyasachi Chatterjee, John Lafferty, Yuancheng Zhu Stanford University West Coast Optimization Rumble October 2016 Problem minimize

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

Plug-in Approach to Active Learning

Plug-in Approach to Active Learning Plug-in Approach to Active Learning Stanislav Minsker Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 1 / 18 Prediction framework Let (X, Y ) be a random couple in R d { 1, +1}. X

More information

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2 Nearest Neighbor Machine Learning CSE546 Kevin Jamieson University of Washington October 26, 2017 2017 Kevin Jamieson 2 Some data, Bayes Classifier Training data: True label: +1 True label: -1 Optimal

More information

Stochastic gradient descent and robustness to ill-conditioning

Stochastic gradient descent and robustness to ill-conditioning Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

Stochas(c Dual Ascent Linear Systems, Quasi-Newton Updates and Matrix Inversion

Stochas(c Dual Ascent Linear Systems, Quasi-Newton Updates and Matrix Inversion Stochas(c Dual Ascent Linear Systems, Quasi-Newton Updates and Matrix Inversion Peter Richtárik (joint work with Robert M. Gower) University of Edinburgh Oberwolfach, March 8, 2016 Part I Stochas(c Dual

More information

Regression.

Regression. Regression www.biostat.wisc.edu/~dpage/cs760/ Goals for the lecture you should understand the following concepts linear regression RMSE, MAE, and R-square logistic regression convex functions and sets

More information

Minimax rates for Batched Stochastic Optimization

Minimax rates for Batched Stochastic Optimization Minimax rates for Batched Stochastic Optimization John Duchi based on joint work with Feng Ruan and Chulhee Yun Stanford University Tradeoffs Major problem in theoretical statistics: how do we characterize

More information

Minimax risk bounds for linear threshold functions

Minimax risk bounds for linear threshold functions CS281B/Stat241B (Spring 2008) Statistical Learning Theory Lecture: 3 Minimax risk bounds for linear threshold functions Lecturer: Peter Bartlett Scribe: Hao Zhang 1 Review We assume that there is a probability

More information

Lecture: Adaptive Filtering

Lecture: Adaptive Filtering ECE 830 Spring 2013 Statistical Signal Processing instructors: K. Jamieson and R. Nowak Lecture: Adaptive Filtering Adaptive filters are commonly used for online filtering of signals. The goal is to estimate

More information

O(1/ɛ) Lower Bounds for Convex Optimization with Stochastic Oracles

O(1/ɛ) Lower Bounds for Convex Optimization with Stochastic Oracles Technical Note July 8, 200 O(/ɛ) Lower Bounds for Convex Optimization with Stochastic Oracles Yaoliang Yu Xinhua Zhang yaoliang@cs.ualberta.ca xinhua.zhang.cs@gmail.com Department of Computing Science

More information

Lecture 3: Minimizing Large Sums. Peter Richtárik

Lecture 3: Minimizing Large Sums. Peter Richtárik Lecture 3: Minimizing Large Sums Peter Richtárik Graduate School in Systems, Op@miza@on, Control and Networks Belgium 2015 Mo@va@on: Machine Learning & Empirical Risk Minimiza@on Training Linear Predictors

More information

Bellman s Curse of Dimensionality

Bellman s Curse of Dimensionality Bellman s Curse of Dimensionality n- dimensional state space Number of states grows exponen

More information

Least Mean Squares Regression. Machine Learning Fall 2017

Least Mean Squares Regression. Machine Learning Fall 2017 Least Mean Squares Regression Machine Learning Fall 2017 1 Lecture Overview Linear classifiers What func?ons do linear classifiers express? Least Squares Method for Regression 2 Where are we? Linear classifiers

More information

Approximation Theoretical Questions for SVMs

Approximation Theoretical Questions for SVMs Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually

More information

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):

More information

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk)

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk) 10-704: Information Processing and Learning Spring 2015 Lecture 21: Examples of Lower Bounds and Assouad s Method Lecturer: Akshay Krishnamurthy Scribes: Soumya Batra Note: LaTeX template courtesy of UC

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Online Convex Optimization MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Online projected sub-gradient descent. Exponentiated Gradient (EG). Mirror descent.

More information

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,

More information

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization Convex Optimization Ofer Meshi Lecture 6: Lower Bounds Constrained Optimization Lower Bounds Some upper bounds: #iter μ 2 M #iter 2 M #iter L L μ 2 Oracle/ops GD κ log 1/ε M x # ε L # x # L # ε # με f

More information

Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization

Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization JMLR: Workshop and Conference Proceedings vol (2010) 1 16 24th Annual Conference on Learning heory Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

Announcements. Proposals graded

Announcements. Proposals graded Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:

More information

Composite nonlinear models at scale

Composite nonlinear models at scale Composite nonlinear models at scale Dmitriy Drusvyatskiy Mathematics, University of Washington Joint work with D. Davis (Cornell), M. Fazel (UW), A.S. Lewis (Cornell) C. Paquette (Lehigh), and S. Roy (UW)

More information

An iterative hard thresholding estimator for low rank matrix recovery

An iterative hard thresholding estimator for low rank matrix recovery An iterative hard thresholding estimator for low rank matrix recovery Alexandra Carpentier - based on a joint work with Arlene K.Y. Kim Statistical Laboratory, Department of Pure Mathematics and Mathematical

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

Adaptive Sampling Under Low Noise Conditions 1

Adaptive Sampling Under Low Noise Conditions 1 Manuscrit auteur, publié dans "41èmes Journées de Statistique, SFdS, Bordeaux (2009)" Adaptive Sampling Under Low Noise Conditions 1 Nicolò Cesa-Bianchi Dipartimento di Scienze dell Informazione Università

More information

Stochastic Composition Optimization

Stochastic Composition Optimization Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works with Ethan X. Fang, Han Liu, and Ji Liu ORFE@Princeton ICCOPT, Tokyo, August 8-11, 2016 1 / 24 Collaborators

More information

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos 1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and

More information

Classification Logistic Regression

Classification Logistic Regression Classification Logistic Regression Machine Learning CSE546 Kevin Jamieson University of Washington October 16, 2016 1 THUS FAR, REGRESSION: PREDICT A CONTINUOUS VALUE GIVEN SOME INPUTS 2 Weather prediction

More information

Non-convex optimization. Issam Laradji

Non-convex optimization. Issam Laradji Non-convex optimization Issam Laradji Strongly Convex Objective function f(x) x Strongly Convex Objective function Assumptions Gradient Lipschitz continuous f(x) Strongly convex x Strongly Convex Objective

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

9 Classification. 9.1 Linear Classifiers

9 Classification. 9.1 Linear Classifiers 9 Classification This topic returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive

More information

A talk on Oracle inequalities and regularization. by Sara van de Geer

A talk on Oracle inequalities and regularization. by Sara van de Geer A talk on Oracle inequalities and regularization by Sara van de Geer Workshop Regularization in Statistics Banff International Regularization Station September 6-11, 2003 Aim: to compare l 1 and other

More information

Better Algorithms for Selective Sampling

Better Algorithms for Selective Sampling Francesco Orabona Nicolò Cesa-Bianchi DSI, Università degli Studi di Milano, Italy francesco@orabonacom nicolocesa-bianchi@unimiit Abstract We study online algorithms for selective sampling that use regularized

More information

The Online Approach to Machine Learning

The Online Approach to Machine Learning The Online Approach to Machine Learning Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Approach to ML 1 / 53 Summary 1 My beautiful regret 2 A supposedly fun game I

More information

GRADIENT = STEEPEST DESCENT

GRADIENT = STEEPEST DESCENT GRADIENT METHODS GRADIENT = STEEPEST DESCENT Convex Function Iso-contours gradient 0.5 0.4 4 2 0 8 0.3 0.2 0. 0 0. negative gradient 6 0.2 4 0.3 2.5 0.5 0 0.5 0.5 0 0.5 0.4 0.5.5 0.5 0 0.5 GRADIENT DESCENT

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

Information theoretic perspectives on learning algorithms

Information theoretic perspectives on learning algorithms Information theoretic perspectives on learning algorithms Varun Jog University of Wisconsin - Madison Departments of ECE and Mathematics Shannon Channel Hangout! May 8, 2018 Jointly with Adrian Tovar-Lopez

More information

Expanding the reach of optimal methods

Expanding the reach of optimal methods Expanding the reach of optimal methods Dmitriy Drusvyatskiy Mathematics, University of Washington Joint work with C. Kempton (UW), M. Fazel (UW), A.S. Lewis (Cornell), and S. Roy (UW) BURKAPALOOZA! WCOM

More information

Convergence Rates of Kernel Quadrature Rules

Convergence Rates of Kernel Quadrature Rules Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

Early Stopping for Computational Learning

Early Stopping for Computational Learning Early Stopping for Computational Learning Lorenzo Rosasco Universita di Genova, Massachusetts Institute of Technology Istituto Italiano di Tecnologia CBMM Sestri Levante, September, 2014 joint work with

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values

Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values Mengdi Wang Ethan X. Fang Han Liu Abstract Classical stochastic gradient methods are well suited

More information

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday! Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

Lecture 2: Subgradient Methods

Lecture 2: Subgradient Methods Lecture 2: Subgradient Methods John C. Duchi Stanford University Park City Mathematics Institute 206 Abstract In this lecture, we discuss first order methods for the minimization of convex functions. We

More information

Is the test error unbiased for these programs?

Is the test error unbiased for these programs? Is the test error unbiased for these programs? Xtrain avg N o Preprocessing by de meaning using whole TEST set 2017 Kevin Jamieson 1 Is the test error unbiased for this program? e Stott see non for f x

More information

Local Minimax Complexity of Stochastic Convex Optimization

Local Minimax Complexity of Stochastic Convex Optimization Local Minimax Complexity of Stochastic Convex Optimization Yuancheng Zhu Sabyasachi Chatterjee John Duchi and John Lafferty arxiv:1605.07596v3 [stat.ml] 6 May 016 Department of Statistics Department of

More information

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Alexander Rakhlin University of Pennsylvania Ohad Shamir Microsoft Research New England Karthik Sridharan University of Pennsylvania

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Binary Classification / Perceptron

Binary Classification / Perceptron Binary Classification / Perceptron Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and Vibhav Gogate Supervised Learning Input: x 1, y 1,, (x n, y n ) x i is the i th data

More information

Bandit Convex Optimization

Bandit Convex Optimization March 7, 2017 Table of Contents 1 (BCO) 2 Projection Methods 3 Barrier Methods 4 Variance reduction 5 Other methods 6 Conclusion Learning scenario Compact convex action set K R d. For t = 1 to T : Predict

More information

arxiv: v1 [cs.it] 21 Feb 2013

arxiv: v1 [cs.it] 21 Feb 2013 q-ary Compressive Sensing arxiv:30.568v [cs.it] Feb 03 Youssef Mroueh,, Lorenzo Rosasco, CBCL, CSAIL, Massachusetts Institute of Technology LCSL, Istituto Italiano di Tecnologia and IIT@MIT lab, Istituto

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July

More information

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization racle Complexity of Second-rder Methods for Smooth Convex ptimization Yossi Arjevani had Shamir Ron Shiff Weizmann Institute of Science Rehovot 7610001 Israel Abstract yossi.arjevani@weizmann.ac.il ohad.shamir@weizmann.ac.il

More information

EE 511 Online Learning Perceptron

EE 511 Online Learning Perceptron Slides adapted from Ali Farhadi, Mari Ostendorf, Pedro Domingos, Carlos Guestrin, and Luke Zettelmoyer, Kevin Jamison EE 511 Online Learning Perceptron Instructor: Hanna Hajishirzi hannaneh@washington.edu

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

10. Ellipsoid method

10. Ellipsoid method 10. Ellipsoid method EE236C (Spring 2008-09) ellipsoid method convergence proof inequality constraints 10 1 Ellipsoid method history developed by Shor, Nemirovski, Yudin in 1970s used in 1979 by Khachian

More information

Randomized Derivative-Free Optimization of Noisy Convex Functions

Randomized Derivative-Free Optimization of Noisy Convex Functions Randomized Derivative-Free Optimization of Noisy Convex Functions Ruobing Chen Stefan M. Wild July, 05 Abstract We propose STARS, a randomized derivative-free algorithm for unconstrained optimization when

More information

Minimax Bounds for Active Learning

Minimax Bounds for Active Learning IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. XX, NO. X, MONTH YEAR Minimax Bounds for Active Learning Rui M. Castro, Robert D. Nowak, Senior Member, IEEE, Abstract This paper analyzes the potential advantages

More information

CS 435, 2018 Lecture 3, Date: 8 March 2018 Instructor: Nisheeth Vishnoi. Gradient Descent

CS 435, 2018 Lecture 3, Date: 8 March 2018 Instructor: Nisheeth Vishnoi. Gradient Descent CS 435, 2018 Lecture 3, Date: 8 March 2018 Instructor: Nisheeth Vishnoi Gradient Descent This lecture introduces Gradient Descent a meta-algorithm for unconstrained minimization. Under convexity of the

More information

Fast Stochastic Optimization Algorithms for ML

Fast Stochastic Optimization Algorithms for ML Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2

More information

Convex Optimization in Classification Problems

Convex Optimization in Classification Problems New Trends in Optimization and Computational Algorithms December 9 13, 2001 Convex Optimization in Classification Problems Laurent El Ghaoui Department of EECS, UC Berkeley elghaoui@eecs.berkeley.edu 1

More information

Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1)

Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1) Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1) Detection problems can usually be casted as binary or M-ary hypothesis testing problems. Applications: This chapter: Simple hypothesis

More information

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Minimax Analysis of Active Learning

Minimax Analysis of Active Learning Journal of Machine Learning Research 6 05 3487-360 Submitted 0/4; Published /5 Minimax Analysis of Active Learning Steve Hanneke Princeton, NJ 0854 Liu Yang IBM T J Watson Research Center, Yorktown Heights,

More information

Noise-Tolerant Interactive Learning Using Pairwise Comparisons

Noise-Tolerant Interactive Learning Using Pairwise Comparisons Noise-Tolerant Interactive Learning Using Pairwise Comparisons Yichong Xu *, Hongyang Zhang *, Kyle Miller, Aarti Singh *, and Artur Dubrawski * Machine Learning Department, Carnegie Mellon University,

More information

COMP 562: Introduction to Machine Learning

COMP 562: Introduction to Machine Learning COMP 562: Introduction to Machine Learning Lecture 20 : Support Vector Machines, Kernels Mahmoud Mostapha 1 Department of Computer Science University of North Carolina at Chapel Hill mahmoudm@cs.unc.edu

More information

arxiv: v2 [cs.lg] 6 May 2017

arxiv: v2 [cs.lg] 6 May 2017 arxiv:170107895v [cslg] 6 May 017 Information Theoretic Limits for Linear Prediction with Graph-Structured Sparsity Abstract Adarsh Barik Krannert School of Management Purdue University West Lafayette,

More information

Finite-sum Composition Optimization via Variance Reduced Gradient Descent

Finite-sum Composition Optimization via Variance Reduced Gradient Descent Finite-sum Composition Optimization via Variance Reduced Gradient Descent Xiangru Lian Mengdi Wang Ji Liu University of Rochester Princeton University University of Rochester xiangru@yandex.com mengdiw@princeton.edu

More information

Machine Learning and Data Mining. Linear regression. Prof. Alexander Ihler

Machine Learning and Data Mining. Linear regression. Prof. Alexander Ihler + Machine Learning and Data Mining Linear regression Prof. Alexander Ihler Supervised learning Notation Features x Targets y Predictions ŷ Parameters θ Learning algorithm Program ( Learner ) Change µ Improve

More information

CS 6140: Machine Learning Spring What We Learned Last Week 2/26/16

CS 6140: Machine Learning Spring What We Learned Last Week 2/26/16 Logis@cs CS 6140: Machine Learning Spring 2016 Instructor: Lu Wang College of Computer and Informa@on Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Email: luwang@ccs.neu.edu Sign

More information

Hilbert s 13th Problem Great Theorem; Shame about the Algorithm. Bill Moran

Hilbert s 13th Problem Great Theorem; Shame about the Algorithm. Bill Moran Hilbert s 13th Problem Great Theorem; Shame about the Algorithm Bill Moran Structure of Talk Solving Polynomial Equations Hilbert s 13th Problem Kolmogorov-Arnold Theorem Neural Networks Quadratic Equations

More information

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler + Machine Learning and Data Mining Multi-layer Perceptrons & Neural Networks: Basics Prof. Alexander Ihler Linear Classifiers (Perceptrons) Linear Classifiers a linear classifier is a mapping which partitions

More information

Sta$s$cal Op$miza$on for Big Data. Zhaoran Wang and Han Liu (Joint work with Tong Zhang)

Sta$s$cal Op$miza$on for Big Data. Zhaoran Wang and Han Liu (Joint work with Tong Zhang) Sta$s$cal Op$miza$on for Big Data Zhaoran Wang and Han Liu (Joint work with Tong Zhang) Big Data Movement Big Data = Massive Data- size + High Dimensional + Complex Structural + Highly Noisy Big Data give

More information

arxiv: v2 [math.oc] 8 Oct 2011

arxiv: v2 [math.oc] 8 Oct 2011 Stochastic convex optimization with bandit feedback Alekh Agarwal Dean P. Foster Daniel Hsu Sham M. Kakade, Alexander Rakhlin Department of EECS Department of Statistics Microsoft Research University of

More information

Midterm exam CS 189/289, Fall 2015

Midterm exam CS 189/289, Fall 2015 Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points

More information

Logis&c Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Logis&c Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com Logis&c Regression These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made their course materials freely available online. Feel free to reuse or adapt these

More information

Unsupervised Learning: K- Means & PCA

Unsupervised Learning: K- Means & PCA Unsupervised Learning: K- Means & PCA Unsupervised Learning Supervised learning used labeled data pairs (x, y) to learn a func>on f : X Y But, what if we don t have labels? No labels = unsupervised learning

More information

Adaptive Compressive Imaging Using Sparse Hierarchical Learned Dictionaries

Adaptive Compressive Imaging Using Sparse Hierarchical Learned Dictionaries Adaptive Compressive Imaging Using Sparse Hierarchical Learned Dictionaries Jarvis Haupt University of Minnesota Department of Electrical and Computer Engineering Supported by Motivation New Agile Sensing

More information

Ergodic Subgradient Descent

Ergodic Subgradient Descent Ergodic Subgradient Descent John Duchi, Alekh Agarwal, Mikael Johansson, Michael Jordan University of California, Berkeley and Royal Institute of Technology (KTH), Sweden Allerton Conference, September

More information

Solution: First we need to find the mean of this distribution. The mean is. ) = e[( e 1 e 1 ) 0] = 2.

Solution: First we need to find the mean of this distribution. The mean is. ) = e[( e 1 e 1 ) 0] = 2. Math 0A with Professor Stankova Worksheet, Discussion #; Monday, //207 GSI name: Roy Zhao Standard Deviation Example. Let f(x) = e e x for x and 0 otherwise. Find the standard deviation of this distribution.

More information

Pavel Dvurechensky Alexander Gasnikov Alexander Tiurin. July 26, 2017

Pavel Dvurechensky Alexander Gasnikov Alexander Tiurin. July 26, 2017 Randomized Similar Triangles Method: A Unifying Framework for Accelerated Randomized Optimization Methods Coordinate Descent, Directional Search, Derivative-Free Method) Pavel Dvurechensky Alexander Gasnikov

More information

Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725

Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725 Gradient Descent Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh Convex Optimization 10-725/36-725 Based on slides from Vandenberghe, Tibshirani Gradient Descent Consider unconstrained, smooth convex

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012 CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012 Luke ZeClemoyer Slides adapted from Carlos Guestrin Linear classifiers Which line is becer? w. = j w (j) x (j) Data Example i Pick the one with the

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

Model Selection and Geometry

Model Selection and Geometry Model Selection and Geometry Pascal Massart Université Paris-Sud, Orsay Leipzig, February Purpose of the talk! Concentration of measure plays a fundamental role in the theory of model selection! Model

More information

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI Bayes Classifiers CAP5610 Machine Learning Instructor: Guo-Jun QI Recap: Joint distributions Joint distribution over Input vector X = (X 1, X 2 ) X 1 =B or B (drinking beer or not) X 2 = H or H (headache

More information

Online Learning and Online Convex Optimization

Online Learning and Online Convex Optimization Online Learning and Online Convex Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Learning 1 / 49 Summary 1 My beautiful regret 2 A supposedly fun game

More information

ECON 7335 INFORMATION, LEARNING AND EXPECTATIONS IN MACRO LECTURE 1: BASICS. 1. Bayes Rule. p(b j A)p(A) p(b)

ECON 7335 INFORMATION, LEARNING AND EXPECTATIONS IN MACRO LECTURE 1: BASICS. 1. Bayes Rule. p(b j A)p(A) p(b) ECON 7335 INFORMATION, LEARNING AND EXPECTATIONS IN MACRO LECTURE : BASICS KRISTOFFER P. NIMARK. Bayes Rule De nition. Bayes Rule. The probability of event A occurring conditional on the event B having

More information