Minimax rates for Batched Stochastic Optimization

Size: px
Start display at page:

Download "Minimax rates for Batched Stochastic Optimization"

Transcription

1 Minimax rates for Batched Stochastic Optimization John Duchi based on joint work with Feng Ruan and Chulhee Yun Stanford University

2 Tradeoffs Major problem in theoretical statistics: how do we characterize statistical optimality for problems with constraints?

3 Tradeoffs Major problem in theoretical statistics: how do we characterize statistical optimality for problems with constraints? Computational [Berthet & Rigollet 13, Ma & Wu 15, Brennan et al. 18, Feldman et al. 18] Privacy [Dwork et al. 06, Hardt & Talwar 09, Duchi et al. 13] Robustness [Huber 81, Hardt & Moitra 13, Diakonikolas et al. 16] Memory / communication [Duchi et al. 14, Braverman et al. 15, Steinhardt & Duchi 16]

4 Tradeoffs Major problem in theoretical statistics: how do we characterize statistical optimality for problems with constraints? Statistical Performance Data set size n

5 Tradeoffs Major problem in theoretical statistics: how do we characterize statistical optimality for problems with constraints? Statistical Performance Data set size n Other resources

6 Tradeoffs Major problem in theoretical statistics: how do we characterize statistical optimality for problems with constraints? Statistical Performance Data set size n Computation

7 Tradeoffs Major problem in theoretical statistics: how do we characterize statistical optimality for problems with constraints? Statistical Performance Data set size n Energy use

8 Tradeoffs Major problem in theoretical statistics: how do we characterize statistical optimality for problems with constraints? Statistical Performance Data set size n Disclosure risk (privacy)

9 Problem Setting minimize f(x) where f convex given mean-zero noisy gradient information g = rf(x)+

10 Problem Setting minimize f(x) where f convex given mean-zero noisy gradient information g = rf(x)+ computational complexity for these problems?

11 Stochastic Gradient methods Iterate (for k = 1, 2, ) g k = rf(x k )+ k x k+1 = x k k g k

12 Stochastic Gradient methods Iterate (for k = 1, 2, ) g k = rf(x k )+ k x k+1 = x k k g k g 1 g 2 g 3 g 4 g 5 g 6 g 7 g 8 g 9 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9

13 Stochastic Gradient methods Iterate (for k = 1, 2, ) g k = rf(x k )+ k x k+1 = x k k g k g 1 g 2 g 3 g 4 g 5 g 6 g 7 g 8 g 9 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 Theorem (Nemirovski & Yudin 83; Nemirovski et al. 09; Agarwal et al. 11) After k iterations, we have (optimal) convergence E[f(x k )] f?. 1 p k

14 Parellelization and interactivity? g 1 g 2 g 3 g 4 g 5 g 6 g 7 g 8 g 9 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 Requires many iterations, lots of interaction, no parallelism

15 Parellelization and interactivity? g 1 g 2 g 3 g 4 g 5 g 6 g 7 g 8 g 9 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 Requires many iterations, lots of interaction, no parallelism g 1 g 2 g 3 g 4 g 5 g 6 g 7 g 8 g 9 x 1 x 2 x 3 Trade off breadth for depth?

16 Batched optimization? Medical trials [Perchet et al. 16, Hardwick & Stout 02, Stein 45]

17 Batched optimization? Medical trials [Perchet et al. 16, Hardwick & Stout 02, Stein 45] Ideal: get patient, give treatment, observe outcome 1 year 1 year 1 year 1 year 1 year 1 year 100s of years

18 Batched optimization? Medical trials [Perchet et al. 16, Hardwick & Stout 02, Stein 45] Ideal: get patient, give treatment, observe outcome 1 year 1 year 1 year 1 year 1 year 1 year 100s of years Reality:

19 Local Differential Privacy

20 Problem Statement Problem: given M rounds of adaptation and n computations, what is the optimal error in optimization?

21 Problem Statement Problem: given M rounds of adaptation and n computations, what is the optimal error in optimization? A M,n = Algorithms with M rounds of computation and n (noisy) gradient computations F = Function class of interest

22 Problem Statement Problem: given M rounds of adaptation and n computations, what is the optimal error in optimization? A M,n = Algorithms with M rounds of computation and n (noisy) gradient computations F = Function class of interest Study minimax optimization error E[f(bx)] f?

23 Problem Statement Problem: given M rounds of adaptation and n computations, what is the optimal error in optimization? A M,n = Algorithms with M rounds of computation and n (noisy) gradient computations F = Function class of interest Study minimax optimization error sup f2f {E[f(bx)] f? }

24 Problem Statement Problem: given M rounds of adaptation and n computations, what is the optimal error in optimization? A M,n = Algorithms with M rounds of computation and n (noisy) gradient computations F = Function class of interest Study minimax optimization error inf sup bx2a M,n f2f {E[f(bx)] f? }

25 Problem Statement Problem: given M rounds of adaptation and n computations, what is the optimal error in optimization? A M,n = Algorithms with M rounds of computation and n (noisy) gradient computations F = Function class of interest Study minimax optimization error M M,n (F) := inf sup bx2a M,n f2f {E[f(bx)] f? }

26 Background and hopes

27 Background and hopes [Perchet, RCS 18] Batched Bandits. For 2 armed bandit, optimal regret achievable with M = O(log log n)

28 Background and hopes [Perchet, RCS 18] Batched Bandits. For 2 armed bandit, optimal regret achievable with M = O(log log n) [Nemirovski et al. 09, Ghadimi & Lan 12] Stochastic strongly convex optimization: f(bx) f?. kx 0 x? k 2 exp( M/ p Cond(f)) + Var( ) or M & p Cond(f) log n rounds n

29 Background and hopes [Perchet, RCS 18] Batched Bandits. For 2 armed bandit, optimal regret achievable with M = O(log log n) [Nemirovski et al. 09, Ghadimi & Lan 12] Stochastic strongly convex optimization: f(bx) f?. kx 0 x? k 2 exp( M/ p Cond(f)) + Var( ) or M & p Cond(f) log n rounds [Smith, TU 17] To solve convex optimization, need n M & log 1 rounds

30 Main Results F H, := { strongly convex,h smooth f}

31 Main Results F H, := { strongly convex,h smooth f} Theorem (D., Ruan, Yun 18) M M,n (F H, ) C(d, n) n 1 where C(d, n) poly(n) 1 ( d d+2) M

32 Main Results F H, := { strongly convex,h smooth f} Theorem (D., Ruan, Yun 18) M M,n (F H, ) C(d, n) n 1 where C(d, n) poly(n) 1 ( d d+2) M Theorem (D., Ruan, Yun 18) If M apple (d/2) log log n P f(bx) f? Cn then there is an algorithm s.t. 1 ( d+2) d M log n! 0

33 Achievability For convex functions f, x? 2 {y hrf(x),y xi apple0}

34 Achievability For convex functions f, x? 2 {y hrf(x),y xi apple0}

35 Maintain feasible box At round t, take points Achievability B t = c t +[ r t,r t ] d x i 2 B t,i=1,...,m with center c t

36 Maintain feasible box At round t, take points get parallel (noisy) gradients Achievability B t = c t +[ r t,r t ] d x i 2 B t,i=1,...,m with center c t brf(x 1 )... brf(x m )

37 Maintain feasible box At round t, take points get parallel (noisy) gradients Achievability B t = c t +[ r t,r t ] d x i 2 B t,i=1,...,m with center c t brf(x 1 )... brf(x m ) w.h.p. x? 2 {y h b rf(x i ),y xi apple ky xk}

38 Maintain feasible box At round t, take points Achievability B t = c t +[ r t,r t ] d x i 2 B t,i=1,...,m with center c t get parallel (noisy) gradients x 4 brf(x 1 )... brf(x m ) x 1 x 3 w.h.p. x? 2 {y h b rf(x i ),y xi apple ky xk} x 2

39 Maintain feasible box At round t, take points get parallel (noisy) gradients Achievability B t = c t +[ r t,r t ] d x i 2 B t,i=1,...,m with center c t brf(x 1 )... brf(x m ) w.h.p. x? 2 {y h b rf(x i ),y xi apple ky xk}

40 Recursive shrinking Box radius decreases as r t apple r t 1

41 Recursive shrinking Box radius decreases as r t apple r t 1 r 1

42 Recursive shrinking Box radius decreases as r t apple r t 1 r 2 r 1

43 Recursive shrinking Box radius decreases as r t apple r t 1 r 3 r 2 r 1

44 Recursive shrinking Box radius decreases as r t apple r t 1 r 3 r 2 r 1

45 Recursive shrinking Box radius decreases as r t apple r t 1 or, recursively r t apple r t 1 apple 1+ r 2 apple P t 1 j=0 t 2 apple j r t 0 t 1 1 r 3 r 2 r 1

46 Recursive shrinking Box radius decreases as r t apple r t 1 or, recursively r t apple r t 1 apple 1+ r 2 apple P t 1 j=0 t 2 apple j r t 0 t 1 1 r 3 r 2 r 1 for us, dimension d = d d +2 = n 1 d+2

47 Recursive shrinking Box radius decreases as with = d d+2 or, recursively t 1 r t apple 1 r t apple r t 1 r 3 r 2 r 1

48 Recursive shrinking Box radius decreases as with and = d d+2 t or, recursively t 1 r t apple 1 n iff r t apple r t 1 t. 1 log n r 3 r 2 r 1

49 Recursive shrinking Box radius decreases as with and = d d+2 t or, recursively t 1 r t apple 1 n iff r t apple r t 1 t. 1 log n r 3 r 2 r 1 Solution: t & log log n log 1/ = log log n log(1 + d/2)

50 Main Results F H, := { strongly convex,h smooth f} Theorem (D., Ruan, Yun 18) M M,n (F H, ) C(d, n) n 1 where C(d, n) poly(n) 1 ( d d+2) M Theorem (D., Ruan, Yun 18) If M apple (d/2) log log n P f(bx) f? Cn then there is an algorithm s.t. 1 ( d+2) d M log n! 0

51 How do we prove lower bounds? Lower bound 1. Two functions, where optimizing one means not optimizing the other [Agarwal BRW 13, Duchi 17] 2. Make information available to algorithm to distinguish them small

52 How do we prove lower bounds? Lower bound 1. Two functions, where optimizing one means not optimizing the other [Agarwal BRW 13, Duchi 17] 2. Make information available to algorithm to distinguish them small f 0 f 1 dist(f 0,f 1 )

53 How do we prove lower bounds? Lower bound 1. Two functions, where optimizing one means not optimizing the other [Agarwal BRW 13, Duchi 17] 2. Make information available to algorithm to distinguish them small

54 How do we prove lower bounds? Lower bound 1. Two functions, where optimizing one means not optimizing the other [Agarwal BRW 13, Duchi 17] 2. Make information available to algorithm to distinguish them small inf bx max E [f v(bx) f v? ] v2{0,1} dist(f 0,f 1 ) 2 inf P (A distinguishes f 0,f 1 ) Alg A

55 How do we prove lower bounds? Lower bound 1. Two functions, where optimizing one means not optimizing the other [Agarwal BRW 13, Duchi 17] 2. Make information available to algorithm to distinguish them small inf bx max E [f v(bx) f v? ] v2{0,1} dist(f 0,f 1 ) 2 inf Alg A P (A distinguishes f 0,f 1 ) {z } =1 kp 0 P 1 k TV

56 Lower bound: recursive packing M

57 Lower bound: recursive packing M U (1) = 1 packing of initial set 1

58 Lower bound: recursive packing M U (1) = 1 packing of initial set U (t) u =2 t packing of ball centered at u 2 1

59 Lower bound: recursive packing M U (1) = 1 packing of initial set U (t) u =2 t packing of ball centered at u 2 1

60 Lower bound: recursive packing M U (1) = 1 packing of initial set U (t) u =2 t packing of ball centered at u Idea: 1. define functions recursively on balls 2. optimization means identifying ball 2 1

61 Function constructions M Index functions by path down u 1:M =(u 1,...,u M )

62 Function constructions M Index functions by path down u 1:M =(u 1,...,u M ) f (1) u 1:M (x) =f (0) u 1:M (x) x 62 u M + M B

63 Function constructions M Index functions by path down u 1:M =(u 1,...,u M ) f (1) u 1:M (x) =f (0) u 1:M (x) f (±1) u 1:M (x) =f (±1) u 1:t,ũ t+1:m (x) x 62 u M + M B x 62 u t + t B

64 Function constructions M Index functions by path down u 1:M =(u 1,...,u M ) f (1) u 1:M (x) =f (0) u 1:M (x) f (±1) u 1:M (x) =f (±1) u 1:t,ũ t+1:m (x) x 62 u M + M B x 62 u t + t B Optimizing well means identifying sequence defining function u 1 u 2 u 3

65 Function construction Recurse: f u1 (x) = 1 2 kx u 1k 2 f u1:t (x) = SmoothMax{f u1:t 1 (x), kx u t k 2 + b t } max{f 1 (x),f 2 (x)} f 1 (x) f(x) f 1 (x) f 2 (x) f 2 (x)

66 Information recursion At each round t (of M): D KL (rf u1:m (x)+ rf u1:t,ũ t+1:m (x)+ ). 2 t

67 Information recursion At each round t (of M): D KL (rf u1:m (x)+ rf u1:t,ũ t+1:m (x)+ ). 2 t Choose radius for constant information per round: 2 t n # in packing 1

68 Information recursion At each round t (of M): D KL (rf u1:m (x)+ rf u1:t,ũ t+1:m (x)+ ). 2 t Choose radius for constant information per round: 2 t n d t d t 1 1

69 Information recursion At each round t (of M): D KL (rf u1:m (x)+ rf u1:t,ũ t+1:m (x)+ ). 2 t Choose radius for constant information per round: 2 t n d t d t 1 1 Solution for lower bound: M = n 1 d+2 d d+2 M 1 = n ( d d+2) M

How hard is this function to optimize?

How hard is this function to optimize? How hard is this function to optimize? John Duchi Based on joint work with Sabyasachi Chatterjee, John Lafferty, Yuancheng Zhu Stanford University West Coast Optimization Rumble October 2016 Problem minimize

More information

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret

More information

Bandit View on Continuous Stochastic Optimization

Bandit View on Continuous Stochastic Optimization Bandit View on Continuous Stochastic Optimization Sébastien Bubeck 1 joint work with Rémi Munos 1 & Gilles Stoltz 2 & Csaba Szepesvari 3 1 INRIA Lille, SequeL team 2 CNRS/ENS/HEC 3 University of Alberta

More information

Composite nonlinear models at scale

Composite nonlinear models at scale Composite nonlinear models at scale Dmitriy Drusvyatskiy Mathematics, University of Washington Joint work with D. Davis (Cornell), M. Fazel (UW), A.S. Lewis (Cornell) C. Paquette (Lehigh), and S. Roy (UW)

More information

You should be able to...

You should be able to... Lecture Outline Gradient Projection Algorithm Constant Step Length, Varying Step Length, Diminishing Step Length Complexity Issues Gradient Projection With Exploration Projection Solving QPs: active set

More information

Op#mal convex op#miza#on under Tsybakov noise through connec#ons to ac#ve learning

Op#mal convex op#miza#on under Tsybakov noise through connec#ons to ac#ve learning Op#mal convex op#miza#on under Tsybakov noise through connec#ons to ac#ve learning Aar$ Singh Joint work with: Aaditya Ramdas Connec#ons between convex op#miza#on and ac#ve learning (a formal reduc#on)

More information

Stochastic Composition Optimization

Stochastic Composition Optimization Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works with Ethan X. Fang, Han Liu, and Ji Liu ORFE@Princeton ICCOPT, Tokyo, August 8-11, 2016 1 / 24 Collaborators

More information

Block stochastic gradient update method

Block stochastic gradient update method Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of Minnesota Department of Mathematics, UCLA November 1, 2015 This work was done while in Rice University 1 / 26 Stochastic

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Two optimization problems in a stochastic bandit model

Two optimization problems in a stochastic bandit model Two optimization problems in a stochastic bandit model Emilie Kaufmann joint work with Olivier Cappé, Aurélien Garivier and Shivaram Kalyanakrishnan Journées MAS 204, Toulouse Outline From stochastic optimization

More information

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning. Tutorial: PART 1 Online Convex Optimization, A Game- Theoretic Approach to Learning http://www.cs.princeton.edu/~ehazan/tutorial/tutorial.htm Elad Hazan Princeton University Satyen Kale Yahoo Research

More information

Lecture 9: October 25, Lower bounds for minimax rates via multiple hypotheses

Lecture 9: October 25, Lower bounds for minimax rates via multiple hypotheses Information and Coding Theory Autumn 07 Lecturer: Madhur Tulsiani Lecture 9: October 5, 07 Lower bounds for minimax rates via multiple hypotheses In this lecture, we extend the ideas from the previous

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Proximal and First-Order Methods for Convex Optimization

Proximal and First-Order Methods for Convex Optimization Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - January 2013 Context Machine

More information

Distributed Statistical Estimation and Rates of Convergence in Normal Approximation

Distributed Statistical Estimation and Rates of Convergence in Normal Approximation Distributed Statistical Estimation and Rates of Convergence in Normal Approximation Stas Minsker (joint with Nate Strawn) Department of Mathematics, USC July 3, 2017 Colloquium on Concentration inequalities,

More information

Communication-efficient and Differentially-private Distributed SGD

Communication-efficient and Differentially-private Distributed SGD 1/36 Communication-efficient and Differentially-private Distributed SGD Ananda Theertha Suresh with Naman Agarwal, Felix X. Yu Sanjiv Kumar, H. Brendan McMahan Google Research November 16, 2018 2/36 Outline

More information

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer Tutorial: PART 2 Optimization for Machine Learning Elad Hazan Princeton University + help from Sanjeev Arora & Yoram Singer Agenda 1. Learning as mathematical optimization Stochastic optimization, ERM,

More information

Accuracy First: Selecting a Differential Privacy Level for Accuracy-Constrained Empirical Risk Minimization

Accuracy First: Selecting a Differential Privacy Level for Accuracy-Constrained Empirical Risk Minimization Accuracy First: Selecting a Differential Privacy Level for Accuracy-Constrained Empirical Risk Minimization Katrina Ligett HUJI & Caltech joint with Seth Neel, Aaron Roth, Bo Waggoner, Steven Wu NIPS 2017

More information

Hamiltonian Descent Methods

Hamiltonian Descent Methods Hamiltonian Descent Methods Chris J. Maddison 1,2 with Daniel Paulin 1, Yee Whye Teh 1,2, Brendan O Donoghue 2, Arnaud Doucet 1 Department of Statistics University of Oxford 1 DeepMind London, UK 2 The

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

GRADIENT = STEEPEST DESCENT

GRADIENT = STEEPEST DESCENT GRADIENT METHODS GRADIENT = STEEPEST DESCENT Convex Function Iso-contours gradient 0.5 0.4 4 2 0 8 0.3 0.2 0. 0 0. negative gradient 6 0.2 4 0.3 2.5 0.5 0 0.5 0.5 0 0.5 0.4 0.5.5 0.5 0 0.5 GRADIENT DESCENT

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his

More information

Multi-armed bandit models: a tutorial

Multi-armed bandit models: a tutorial Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)

More information

Stochastic Proximal Gradient Algorithm

Stochastic Proximal Gradient Algorithm Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind

More information

Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725

Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725 Gradient Descent Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh Convex Optimization 10-725/36-725 Based on slides from Vandenberghe, Tibshirani Gradient Descent Consider unconstrained, smooth convex

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - September 2012 Context Machine

More information

Reducing contextual bandits to supervised learning

Reducing contextual bandits to supervised learning Reducing contextual bandits to supervised learning Daniel Hsu Columbia University Based on joint work with A. Agarwal, S. Kale, J. Langford, L. Li, and R. Schapire 1 Learning to interact: example #1 Practicing

More information

Ergodic Subgradient Descent

Ergodic Subgradient Descent Ergodic Subgradient Descent John Duchi, Alekh Agarwal, Mikael Johansson, Michael Jordan University of California, Berkeley and Royal Institute of Technology (KTH), Sweden Allerton Conference, September

More information

Expanding the reach of optimal methods

Expanding the reach of optimal methods Expanding the reach of optimal methods Dmitriy Drusvyatskiy Mathematics, University of Washington Joint work with C. Kempton (UW), M. Fazel (UW), A.S. Lewis (Cornell), and S. Roy (UW) BURKAPALOOZA! WCOM

More information

Time Series Prediction & Online Learning

Time Series Prediction & Online Learning Time Series Prediction & Online Learning Joint work with Vitaly Kuznetsov (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Motivation Time series prediction: stock values. earthquakes.

More information

CS281B/Stat241B. Statistical Learning Theory. Lecture 1.

CS281B/Stat241B. Statistical Learning Theory. Lecture 1. CS281B/Stat241B. Statistical Learning Theory. Lecture 1. Peter Bartlett 1. Organizational issues. 2. Overview. 3. Probabilistic formulation of prediction problems. 4. Game theoretic formulation of prediction

More information

Computational Oracle Inequalities for Large Scale Model Selection Problems

Computational Oracle Inequalities for Large Scale Model Selection Problems for Large Scale Model Selection Problems University of California at Berkeley Queensland University of Technology ETH Zürich, September 2011 Joint work with Alekh Agarwal, John Duchi and Clément Levrard.

More information

Lecture 17: Density Estimation Lecturer: Yihong Wu Scribe: Jiaqi Mu, Mar 31, 2016 [Ed. Apr 1]

Lecture 17: Density Estimation Lecturer: Yihong Wu Scribe: Jiaqi Mu, Mar 31, 2016 [Ed. Apr 1] ECE598: Information-theoretic methods in high-dimensional statistics Spring 06 Lecture 7: Density Estimation Lecturer: Yihong Wu Scribe: Jiaqi Mu, Mar 3, 06 [Ed. Apr ] In last lecture, we studied the minimax

More information

Full-information Online Learning

Full-information Online Learning Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

Introduction to Bandit Algorithms. Introduction to Bandit Algorithms

Introduction to Bandit Algorithms. Introduction to Bandit Algorithms Stochastic K-Arm Bandit Problem Formulation Consider K arms (actions) each correspond to an unknown distribution {ν k } K k=1 with values bounded in [0, 1]. At each time t, the agent pulls an arm I t {1,...,

More information

R. Lachieze-Rey Recent Berry-Esseen bounds obtained with Stein s method andgeorgia PoincareTech. inequalities, 1 / with 29 G.

R. Lachieze-Rey Recent Berry-Esseen bounds obtained with Stein s method andgeorgia PoincareTech. inequalities, 1 / with 29 G. Recent Berry-Esseen bounds obtained with Stein s method and Poincare inequalities, with Geometric applications Raphaël Lachièze-Rey, Univ. Paris 5 René Descartes, Georgia Tech. R. Lachieze-Rey Recent Berry-Esseen

More information

Bandit models: a tutorial

Bandit models: a tutorial Gdt COS, December 3rd, 2015 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses

More information

Online Learning for Time Series Prediction

Online Learning for Time Series Prediction Online Learning for Time Series Prediction Joint work with Vitaly Kuznetsov (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Motivation Time series prediction: stock values.

More information

This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization

This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization x = prox_f(x)+prox_{f^*}(x) use to get prox of norms! PROXIMAL METHODS WHY PROXIMAL METHODS Smooth

More information

The information complexity of sequential resource allocation

The information complexity of sequential resource allocation The information complexity of sequential resource allocation Emilie Kaufmann, joint work with Olivier Cappé, Aurélien Garivier and Shivaram Kalyanakrishan SMILE Seminar, ENS, June 8th, 205 Sequential allocation

More information

Query Complexity of Derivative-Free Optimization

Query Complexity of Derivative-Free Optimization Query Complexity of Derivative-Free Optimization Kevin G Jamieson University of Wisconsin Madison, WI 53706, USA kgjamieson@wiscedu Robert D Nowak University of Wisconsin Madison, WI 53706, USA nowak@engrwiscedu

More information

Online Optimization in X -Armed Bandits

Online Optimization in X -Armed Bandits Online Optimization in X -Armed Bandits Sébastien Bubeck INRIA Lille, SequeL project, France sebastien.bubeck@inria.fr Rémi Munos INRIA Lille, SequeL project, France remi.munos@inria.fr Gilles Stoltz Ecole

More information

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017 Simple Techniques for Improving SGD CS6787 Lecture 2 Fall 2017 Step Sizes and Convergence Where we left off Stochastic gradient descent x t+1 = x t rf(x t ; yĩt ) Much faster per iteration than gradient

More information

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley Learning Methods for Online Prediction Problems Peter Bartlett Statistics and EECS UC Berkeley Course Synopsis A finite comparison class: A = {1,..., m}. 1. Prediction with expert advice. 2. With perfect

More information

Randomized Smoothing for Stochastic Optimization

Randomized Smoothing for Stochastic Optimization Randomized Smoothing for Stochastic Optimization John Duchi Peter Bartlett Martin Wainwright University of California, Berkeley NIPS Big Learn Workshop, December 2011 Duchi (UC Berkeley) Smoothing and

More information

ORIE 6326: Convex Optimization. Quasi-Newton Methods

ORIE 6326: Convex Optimization. Quasi-Newton Methods ORIE 6326: Convex Optimization Quasi-Newton Methods Professor Udell Operations Research and Information Engineering Cornell April 10, 2017 Slides on steepest descent and analysis of Newton s method adapted

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July

More information

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 Online Convex Optimization Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 The General Setting The General Setting (Cover) Given only the above, learning isn't always possible Some Natural

More information

The multi armed-bandit problem

The multi armed-bandit problem The multi armed-bandit problem (with covariates if we have time) Vianney Perchet & Philippe Rigollet LPMA Université Paris Diderot ORFE Princeton University Algorithms and Dynamics for Games and Optimization

More information

Nesterov s Acceleration

Nesterov s Acceleration Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x

More information

Efficient Private ERM for Smooth Objectives

Efficient Private ERM for Smooth Objectives Efficient Private ERM for Smooth Objectives Jiaqi Zhang, Kai Zheng, Wenlong Mou, Liwei Wang Key Laboratory of Machine Perception, MOE, School of Electronics Engineering and Computer Science, Peking University,

More information

Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization

Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization JMLR: Workshop and Conference Proceedings vol (2010) 1 16 24th Annual Conference on Learning heory Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

1-Bit Matrix Completion

1-Bit Matrix Completion 1-Bit Matrix Completion Mark A. Davenport School of Electrical and Computer Engineering Georgia Institute of Technology Yaniv Plan Mary Wootters Ewout van den Berg Matrix Completion d When is it possible

More information

Bandit Convex Optimization

Bandit Convex Optimization March 7, 2017 Table of Contents 1 (BCO) 2 Projection Methods 3 Barrier Methods 4 Variance reduction 5 Other methods 6 Conclusion Learning scenario Compact convex action set K R d. For t = 1 to T : Predict

More information

Stochastic gradient descent and robustness to ill-conditioning

Stochastic gradient descent and robustness to ill-conditioning Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

Distirbutional robustness, regularizing variance, and adversaries

Distirbutional robustness, regularizing variance, and adversaries Distirbutional robustness, regularizing variance, and adversaries John Duchi Based on joint work with Hongseok Namkoong and Aman Sinha Stanford University November 2017 Motivation We do not want machine-learned

More information

1-Bit Matrix Completion

1-Bit Matrix Completion 1-Bit Matrix Completion Mark A. Davenport School of Electrical and Computer Engineering Georgia Institute of Technology Yaniv Plan Mary Wootters Ewout van den Berg Matrix Completion d When is it possible

More information

Convex Optimization in Classification Problems

Convex Optimization in Classification Problems New Trends in Optimization and Computational Algorithms December 9 13, 2001 Convex Optimization in Classification Problems Laurent El Ghaoui Department of EECS, UC Berkeley elghaoui@eecs.berkeley.edu 1

More information

Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values

Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values Mengdi Wang Ethan X. Fang Han Liu Abstract Classical stochastic gradient methods are well suited

More information

Bandit Online Convex Optimization

Bandit Online Convex Optimization March 31, 2015 Outline 1 OCO vs Bandit OCO 2 Gradient Estimates 3 Oblivious Adversary 4 Reshaping for Improved Rates 5 Adaptive Adversary 6 Concluding Remarks Review of (Online) Convex Optimization Set-up

More information

Lecture 23: Online convex optimization Online convex optimization: generalization of several algorithms

Lecture 23: Online convex optimization Online convex optimization: generalization of several algorithms EECS 598-005: heoretical Foundations of Machine Learning Fall 2015 Lecture 23: Online convex optimization Lecturer: Jacob Abernethy Scribes: Vikas Dhiman Disclaimer: hese notes have not been subjected

More information

O(1/ɛ) Lower Bounds for Convex Optimization with Stochastic Oracles

O(1/ɛ) Lower Bounds for Convex Optimization with Stochastic Oracles Technical Note July 8, 200 O(/ɛ) Lower Bounds for Convex Optimization with Stochastic Oracles Yaoliang Yu Xinhua Zhang yaoliang@cs.ualberta.ca xinhua.zhang.cs@gmail.com Department of Computing Science

More information

Online Convex Optimization Using Predictions

Online Convex Optimization Using Predictions Online Convex Optimization Using Predictions Niangjun Chen Joint work with Anish Agarwal, Lachlan Andrew, Siddharth Barman, and Adam Wierman 1 c " c " (x " ) F x " 2 c ) c ) x ) F x " x ) β x ) x " 3 F

More information

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I Sébastien Bubeck Theory Group i.i.d. multi-armed bandit, Robbins [1952] i.i.d. multi-armed bandit, Robbins [1952] Known

More information

Lecture 17: October 27

Lecture 17: October 27 0-725/36-725: Convex Optimiation Fall 205 Lecturer: Ryan Tibshirani Lecture 7: October 27 Scribes: Brandon Amos, Gines Hidalgo Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These

More information

Notes on AdaGrad. Joseph Perla 2014

Notes on AdaGrad. Joseph Perla 2014 Notes on AdaGrad Joseph Perla 2014 1 Introduction Stochastic Gradient Descent (SGD) is a common online learning algorithm for optimizing convex (and often non-convex) functions in machine learning today.

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

Accelerate Subgradient Methods

Accelerate Subgradient Methods Accelerate Subgradient Methods Tianbao Yang Department of Computer Science The University of Iowa Contributors: students Yi Xu, Yan Yan and colleague Qihang Lin Yang (CS@Uiowa) Accelerate Subgradient Methods

More information

Non-Smooth, Non-Finite, and Non-Convex Optimization

Non-Smooth, Non-Finite, and Non-Convex Optimization Non-Smooth, Non-Finite, and Non-Convex Optimization Deep Learning Summer School Mark Schmidt University of British Columbia August 2015 Complex-Step Derivative Using complex number to compute directional

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Convex optimization problems. Optimization problem in standard form

Convex optimization problems. Optimization problem in standard form Convex optimization problems optimization problem in standard form convex optimization problems linear optimization quadratic optimization geometric programming quasiconvex optimization generalized inequality

More information

Stochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017

Stochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017 Stochastic Gradient Descent: The Workhorse of Machine Learning CS6787 Lecture 1 Fall 2017 Fundamentals of Machine Learning? Machine Learning in Practice this course What s missing in the basic stuff? Efficiency!

More information

Scalable robust hypothesis tests using graphical models

Scalable robust hypothesis tests using graphical models Scalable robust hypothesis tests using graphical models Umamahesh Srinivas ipal Group Meeting October 22, 2010 Binary hypothesis testing problem Random vector x = (x 1,...,x n ) R n generated from either

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic

More information

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization Convex Optimization Ofer Meshi Lecture 6: Lower Bounds Constrained Optimization Lower Bounds Some upper bounds: #iter μ 2 M #iter 2 M #iter L L μ 2 Oracle/ops GD κ log 1/ε M x # ε L # x # L # ε # με f

More information

Unconstrained minimization: assumptions

Unconstrained minimization: assumptions Unconstrained minimization I terminology and assumptions I gradient descent method I steepest descent method I Newton s method I self-concordant functions I implementation IOE 611: Nonlinear Programming,

More information

Stochastic Approximation: Mini-Batches, Optimistic Rates and Acceleration

Stochastic Approximation: Mini-Batches, Optimistic Rates and Acceleration Stochastic Approximation: Mini-Batches, Optimistic Rates and Acceleration Nati Srebro Toyota Technological Institute at Chicago a philanthropically endowed academic computer science institute dedicated

More information

and the polynomial-time Turing p reduction from approximate CVP to SVP given in [10], the present authors obtained a n=2-approximation algorithm that

and the polynomial-time Turing p reduction from approximate CVP to SVP given in [10], the present authors obtained a n=2-approximation algorithm that Sampling short lattice vectors and the closest lattice vector problem Miklos Ajtai Ravi Kumar D. Sivakumar IBM Almaden Research Center 650 Harry Road, San Jose, CA 95120. fajtai, ravi, sivag@almaden.ibm.com

More information

A recursive model-based trust-region method for derivative-free bound-constrained optimization.

A recursive model-based trust-region method for derivative-free bound-constrained optimization. A recursive model-based trust-region method for derivative-free bound-constrained optimization. ANKE TRÖLTZSCH [CERFACS, TOULOUSE, FRANCE] JOINT WORK WITH: SERGE GRATTON [ENSEEIHT, TOULOUSE, FRANCE] PHILIPPE

More information

Determinant maximization with linear. S. Boyd, L. Vandenberghe, S.-P. Wu. Information Systems Laboratory. Stanford University

Determinant maximization with linear. S. Boyd, L. Vandenberghe, S.-P. Wu. Information Systems Laboratory. Stanford University Determinant maximization with linear matrix inequality constraints S. Boyd, L. Vandenberghe, S.-P. Wu Information Systems Laboratory Stanford University SCCM Seminar 5 February 1996 1 MAXDET problem denition

More information

Bandits : optimality in exponential families

Bandits : optimality in exponential families Bandits : optimality in exponential families Odalric-Ambrym Maillard IHES, January 2016 Odalric-Ambrym Maillard Bandits 1 / 40 Introduction 1 Stochastic multi-armed bandits 2 Boundary crossing probabilities

More information

18.657: Mathematics of Machine Learning

18.657: Mathematics of Machine Learning 8.657: Mathematics of Machine Learning Lecturer: Philippe Rigollet Lecture 3 Scribe: Mina Karzand Oct., 05 Previously, we analyzed the convergence of the projected gradient descent algorithm. We proved

More information

What Can We Learn Privately?

What Can We Learn Privately? What Can We Learn Privately? Sofya Raskhodnikova Penn State University Joint work with Shiva Kasiviswanathan Homin Lee Kobbi Nissim Adam Smith Los Alamos UT Austin Ben Gurion Penn State To appear in SICOMP

More information

WHY DUALITY? Gradient descent Newton s method Quasi-newton Conjugate gradients. No constraints. Non-differentiable ???? Constrained problems? ????

WHY DUALITY? Gradient descent Newton s method Quasi-newton Conjugate gradients. No constraints. Non-differentiable ???? Constrained problems? ???? DUALITY WHY DUALITY? No constraints f(x) Non-differentiable f(x) Gradient descent Newton s method Quasi-newton Conjugate gradients etc???? Constrained problems? f(x) subject to g(x) apple 0???? h(x) =0

More information

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan Background: Global Optimization and Gaussian Processes The Geometry of Gaussian Processes and the Chaining Trick Algorithm

More information

LOWELL WEEKLY JOURNAL. ^Jberxy and (Jmott Oao M d Ccmsparftble. %m >ai ruv GEEAT INDUSTRIES

LOWELL WEEKLY JOURNAL. ^Jberxy and (Jmott Oao M d Ccmsparftble. %m >ai ruv GEEAT INDUSTRIES ? (») /»» 9 F ( ) / ) /»F»»»»»# F??»»» Q ( ( »»» < 3»» /» > > } > Q ( Q > Z F 5

More information

Fundamental Limits of Online and Distributed Algorithms for Statistical Learning and Estimation

Fundamental Limits of Online and Distributed Algorithms for Statistical Learning and Estimation Fundamental Limits of Online and Distributed Algorithms for Statistical Learning and Estimation Ohad Shamir Weizmann Institute of Science ohad.shamir@weizmann.ac.il Abstract Many machine learning approaches

More information

Foundations of Machine Learning

Foundations of Machine Learning Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about

More information

arxiv: v2 [math.oc] 8 Oct 2011

arxiv: v2 [math.oc] 8 Oct 2011 Stochastic convex optimization with bandit feedback Alekh Agarwal Dean P. Foster Daniel Hsu Sham M. Kakade, Alexander Rakhlin Department of EECS Department of Statistics Microsoft Research University of

More information

Uncertainty. Jayakrishnan Unnikrishnan. CSL June PhD Defense ECE Department

Uncertainty. Jayakrishnan Unnikrishnan. CSL June PhD Defense ECE Department Decision-Making under Statistical Uncertainty Jayakrishnan Unnikrishnan PhD Defense ECE Department University of Illinois at Urbana-Champaign CSL 141 12 June 2010 Statistical Decision-Making Relevant in

More information

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve

More information

Bandits for Online Optimization

Bandits for Online Optimization Bandits for Online Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Bandits for Online Optimization 1 / 16 The multiarmed bandit problem... K slot machines Each

More information

Private Empirical Risk Minimization, Revisited

Private Empirical Risk Minimization, Revisited Private Empirical Risk Minimization, Revisited Raef Bassily Adam Smith Abhradeep Thakurta April 10, 2014 Abstract In this paper, we initiate a systematic investigation of differentially private algorithms

More information

10. Ellipsoid method

10. Ellipsoid method 10. Ellipsoid method EE236C (Spring 2008-09) ellipsoid method convergence proof inequality constraints 10 1 Ellipsoid method history developed by Shor, Nemirovski, Yudin in 1970s used in 1979 by Khachian

More information

Optimization, Learning, and Games with Predictable Sequences

Optimization, Learning, and Games with Predictable Sequences Optimization, Learning, and Games with Predictable Sequences Alexander Rakhlin University of Pennsylvania Karthik Sridharan University of Pennsylvania Abstract We provide several applications of Optimistic

More information

Statistical learning theory, Support vector machines, and Bioinformatics

Statistical learning theory, Support vector machines, and Bioinformatics 1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.

More information