Online (and Distributed) Learning with Information Constraints. Ohad Shamir

Size: px
Start display at page:

Download "Online (and Distributed) Learning with Information Constraints. Ohad Shamir"

Transcription

1 Online (and Distributed) Learning with Information Constraints Ohad Shamir Weizmann Institute of Science Online Algorithms and Learning Workshop Leiden, November 2014 Ohad Shamir Learning with Information Constraints 1/23

2 Main Question How well can we learn with information constraints on how we can interact with data? Ohad Shamir Learning with Information Constraints 2/23

3 Partial-Information Constraints Only part of the data is visible to the learning algorithm Ohad Shamir Learning with Information Constraints 3/23

4 Partial-Information Constraints Only part of the data is visible to the learning algorithm Examples A few coordinates from each example Multi-armed bandits + variants, attribute-efficient learning, missing features... Ohad Shamir Learning with Information Constraints 3/23

5 Partial-Information Constraints Only part of the data is visible to the learning algorithm Examples A few coordinates from each example Multi-armed bandits + variants, attribute-efficient learning, missing features... A linear projection of each example Bandit linear optimization Ohad Shamir Learning with Information Constraints 3/23

6 Partial-Information Constraints Only part of the data is visible to the learning algorithm Examples A few coordinates from each example Multi-armed bandits + variants, attribute-efficient learning, missing features... A linear projection of each example Bandit linear optimization Some other function of each example Partial monitoring; bandit convex optimization... Ohad Shamir Learning with Information Constraints 3/23

7 Partial-Information Constraints Only part of the data is visible to the learning algorithm Examples A few coordinates from each example Multi-armed bandits + variants, attribute-efficient learning, missing features... A linear projection of each example Bandit linear optimization Some other function of each example Partial monitoring; bandit convex optimization... Price of partial information: Known in some cases, setting-dependent Ohad Shamir Learning with Information Constraints 3/23

8 Memory Constraints Ohad Shamir Learning with Information Constraints 4/23

9 Memory Constraints Example (Online Convex Optimization) Consider the squared loss or exp-concave losses Ohad Shamir Learning with Information Constraints 4/23

10 Memory Constraints Example (Online Convex Optimization) Consider the squared loss or exp-concave losses O(log(T )) regret possible, but known algorithms require Ω(d 2 ) memory [Vovk 1998; Azoury and Warmuth 2001; Hazan et al. 2007] Ohad Shamir Learning with Information Constraints 4/23

11 Memory Constraints Example (Online Convex Optimization) Consider the squared loss or exp-concave losses O(log(T )) regret possible, but known algorithms require Ω(d 2 ) memory [Vovk 1998; Azoury and Warmuth 2001; Hazan et al. 2007] O( T ) regret possible with O(d)-memory algorithms (e.g. online gradient descent) Is there a price to pay for linear-memory methods? Ohad Shamir Learning with Information Constraints 4/23

12 Memory Constraints Example (Online PCA) i.i.d. data stream x 1,..., x m R d Goal: Find direction/s with most variance Solution: Consider empirical covariance matrix 1 m m i=1 x ix i d d matrix: too large in high dimensions Ohad Shamir Learning with Information Constraints 5/23

13 Memory Constraints Example (Online PCA) i.i.d. data stream x 1,..., x m R d Goal: Find direction/s with most variance Solution: Consider empirical covariance matrix 1 m m i=1 x ix i d d matrix: too large in high dimensions Can we do online PCA with limited memory? [Warmuth and Kuzmin 2008, Arora et al. 2012, Mitliagkas et al. 2013, Balsubramani et al ] Ohad Shamir Learning with Information Constraints 5/23

14 Communication Constraints Training examples partitioned in a distributed system Communication is (relatively) slow and expensive Can we learn with small communication complexity? Ohad Shamir Learning with Information Constraints 6/23

15 Main Questions Can we quantify information theoretically how such constraints affect our performance? Can we trade-off between data size and information constraints? Memory/ communication/ Information per example feasible region Sample complexity (data size) Ohad Shamir Learning with Information Constraints 7/23

16 Related Work Problem-specific lower bounds (e.g. multi-armed bandits) Learning with communication constraints [e.g. Balcan et al. 2012, Zhang et al. 2013]: Non i.i.d. data or very small communication budget Communication/space bounds in theoretical computer science. But not learning problems and/or non-i.i.d. data Statistical estimation with limited memory [e.g. Hellman and Cover 1970, Leighton and Rivest 1986, Ertin and Potter 2003, Kontorovich 2012]: Asymptotic / small memory regime Ohad Shamir Learning with Information Constraints 8/23

17 (b, n, m) Protocols Ohad Shamir Learning with Information Constraints 9/23

18 (b, n, m) Protocols Definition For t = 1... m Receive n i.i.d. instances X t Compute W t = f t (X t, W 1, W 2,..., W t 1 ), where W t b bits Return f (W 1... W m ) X t b bits W 1 W t 1 W t Learning Algorithm Ohad Shamir Learning with Information Constraints 9/23

19 Examples of (b, n, m) Protocols Ohad Shamir Learning with Information Constraints 10/23

20 Examples of (b, n, m) Protocols Bounded memory online algorithms For t = 1... m Receive i.i.d. instance x t Compute W t = f t (x t, W t 1 ), where W t b bits Return f (W m ) W t 1 x t W t Learning Algorithm Ohad Shamir Learning with Information Constraints 10/23

21 Examples of (b, n, m) Protocols Learning from Expert Advice w/ partial information For t = 1... m Reward vector x t generated Observe W t = f t (x t, W 1, W 2,..., W t 1 ), where W t b bits E.g. for multiarmed bandits, W t = x t,it where coordinate i t selected based on past observations x 1,i1,, x t 1,it 1 x t x t,it Learning Algorithm Ohad Shamir Learning with Information Constraints 11/23

22 Examples of (b, n, m) Protocols Non-interactive distributed learning Machine 1 X 1 Machine 2 X 2 Machine m X m W 2 W 1 W m Ohad Shamir Learning with Information Constraints 12/23

23 Examples of (b, n, m) Protocols One-pass distributed learning Machine 1 X 1 W 1 Machine 2 X 2 Machine m X m W 2 W m Ohad Shamir Learning with Information Constraints 13/23

24 Hide-and-Seek Problem Ohad Shamir Learning with Information Constraints 14/23

25 Hide-and-Seek Problem Instances x t drawn i.i.d. from a product distribution on { 1, +1} d A single unknown coordinate j {1,..., d} is biased E[x t ] = ρ e j Goal: Given sampled data, find j Ohad Shamir Learning with Information Constraints 14/23

26 Result for (b, 1, m) Protocols (e.g. bounded-memory) Ohad Shamir Learning with Information Constraints 15/23

27 Result for (b, 1, m) Protocols (e.g. bounded-memory) Theorem Given m i.i.d. instances; bias ρ: Ohad Shamir Learning with Information Constraints 15/23

28 Result for (b, 1, m) Protocols (e.g. bounded-memory) Theorem Given m i.i.d. instances; bias ρ: Can detect biased coordinate if m log(d)/ρ 2 Just return coordinate J with highest empirical mean Ohad Shamir Learning with Information Constraints 15/23

29 Result for (b, 1, m) Protocols (e.g. bounded-memory) Theorem Given m i.i.d. instances; bias ρ: Can detect biased coordinate if m log(d)/ρ 2 Just return coordinate J with highest empirical mean Any (b, 1, m) protocol will fail if m (d/b)/ρ 2 Ohad Shamir Learning with Information Constraints 15/23

30 Result for (b, 1, m) Protocols (e.g. bounded-memory) Theorem Given m i.i.d. instances; bias ρ: Can detect biased coordinate if m log(d)/ρ 2 Just return coordinate J with highest empirical mean Any (b, 1, m) protocol will fail if m (d/b)/ρ 2 Tight up to log factors Can be generalized to n > 1 Ohad Shamir Learning with Information Constraints 15/23

31 Proof Idea x t,1 j x t,2 x t,j x t,d W t Ohad Shamir Learning with Information Constraints 16/23

32 Proof Idea x t,1 j x t,2 x t,j x t,d W t As long as W t doesn t know j, can only convey b/d bits on x t,j in expectation Ohad Shamir Learning with Information Constraints 16/23

33 Proof Idea x t,1 j x t,2 x t,j x t,d W t As long as W t doesn t know j, can only convey b/d bits on x t,j in expectation x t,j conveys at most ρ 2 bits on j Ohad Shamir Learning with Information Constraints 16/23

34 Proof Idea x t,1 j x t,2 x t,j x t,d W t As long as W t doesn t know j, can only convey b/d bits on x t,j in expectation x t,j conveys at most ρ 2 bits on j Standard data processing inequality: W t conveys min{ρ 2, b/d} bits on j Ohad Shamir Learning with Information Constraints 16/23

35 Proof Idea x t,1 j x t,2 x t,j x t,d W t As long as W t doesn t know j, can only convey b/d bits on x t,j in expectation x t,j conveys at most ρ 2 bits on j Standard data processing inequality: W t conveys min{ρ 2, b/d} bits on j We prove an information contraction inequality: W t conveys ρ 2 b/d bits on j Ohad Shamir Learning with Information Constraints 16/23

36 Proof Idea x t,1 j x t,2 x t,j x t,d W t W t conveys ρ 2 b/d bits on j After m rounds: mρ 2 b/d bits of information on j. Ohad Shamir Learning with Information Constraints 17/23

37 Proof Idea x t,1 j x t,2 x t,j x t,d W t W t conveys ρ 2 b/d bits on j After m rounds: mρ 2 b/d bits of information on j. Insufficient to detect if m (d/b)/ρ 2 Ohad Shamir Learning with Information Constraints 17/23

38 Application: Online learning with partial information Ohad Shamir Learning with Information Constraints 18/23

39 Application: Online learning with partial information T reward vectors x 1,..., x T chosen from {0, 1} d. Some b bits of each x t are sequentially observed Ohad Shamir Learning with Information Constraints 18/23

40 Application: Online learning with partial information T reward vectors x 1,..., x T chosen from {0, 1} d. Some b bits of each x t are sequentially observed ( ) d Theorem: Expected regret Ω b T Result is generic holds regardless of what are those b bits: Chosen coordinate x t,it (Multi-armed bandits) Ohad Shamir Learning with Information Constraints 18/23

41 Application: Online learning with partial information T reward vectors x 1,..., x T chosen from {0, 1} d. Some b bits of each x t are sequentially observed ( ) d Theorem: Expected regret Ω b T Result is generic holds regardless of what are those b bits: Chosen coordinate x t,it (Multi-armed bandits) Some other coordinate x t,jt ; Some subset of coordinates x t,j1, x t,j2,..., x t,jk (semi-bandit feedback; prediction with limited advice; bandits with side-information; attribute-efficient learning...); Linear projection of x t (bandit linear optimization); Using some bounded-width feedback matrix (partial monitoring)... Even if algorithm can choose b bits based on x t! Ohad Shamir Learning with Information Constraints 18/23

42 Application: Stochastic/Online Optimization Theorem stochastic/online optimization problems a in R d, s.t. Possible to get Õ( T ) regret Any b-memory online algorithm (or b-communication distributed algorithm) has Ω( (d 2 /b)t ) regret a Caveat: Non-convex, but efficiently solvable Ohad Shamir Learning with Information Constraints 19/23

43 Application: Stochastic/Online Optimization Theorem stochastic/online optimization problems a in R d, s.t. Possible to get Õ( T ) regret Any b-memory online algorithm (or b-communication distributed algorithm) has Ω( (d 2 /b)t ) regret a Caveat: Non-convex, but efficiently solvable Conclusion: Any online algorithm with o(d 2 ) memory is suboptimal (e.g. gradient descent, mirror descent) New Memory-sample and communication-sample trade-offs: In a statistical setting, can get same average regret with more data, even with memory/communication constraints Ohad Shamir Learning with Information Constraints 19/23

44 Application: Sparse PCA / Covariance Estimation Simplest Case - Detecting Correlations Given i.i.d. sample x 1,..., x m, detect a single pair of correlated features Ohad Shamir Learning with Information Constraints 20/23

45 Application: Sparse PCA / Covariance Estimation Simplest Case - Detecting Correlations Given i.i.d. sample x 1,..., x m, detect a single pair of correlated features Statistically optimal methods use empirical covariance matrix 1 m m i=1 x ix i Problem: Requires too much communication/memory when dimension d and m are large We Show: There are situations where no memory/communication-efficient method can be statistically optimal Ohad Shamir Learning with Information Constraints 20/23

46 Application: Sparse PCA / Covariance Estimation Theorem Suppose: E[xx ] = I d + τ (E i,j + E j,i ) for unknown i, j Given m samples, i j, Pr ( x i x j E[x i x j ] τ 2 ) 2 exp( mτ 2 /6). Then for τ = Θ(1/d): m d 2 : Enough to return largest off-diagonal entry of empirical covariance matrix m d4 b : Any b-memory online algorithm / (b, n, m) protocol for appropriate n will fail for some such distribution Ohad Shamir Learning with Information Constraints 21/23

47 Summary Generic framework of information constraints in learning Same results applicable to Different information constraints (memory, communication, partial information...) Different learning settings (multi-armed bandits and variants, stochastic/online optimization, sparse PCA and covariance estimation...) New resource trade-offs: More data - less memory More data - less communication (in a natural regime) Ohad Shamir Learning with Information Constraints 22/23

48 Summary Generic framework of information constraints in learning Same results applicable to Different information constraints (memory, communication, partial information...) Different learning settings (multi-armed bandits and variants, stochastic/online optimization, sparse PCA and covariance estimation...) New resource trade-offs: More data - less memory More data - less communication (in a natural regime) Information Trade-offs for other learning settings? Ohad Shamir Learning with Information Constraints 22/23

49 More details: arxiv and NIPS 2014 THANKS! Ohad Shamir Learning with Information Constraints 23/23

Fundamental Limits of Online and Distributed Algorithms for Statistical Learning and Estimation

Fundamental Limits of Online and Distributed Algorithms for Statistical Learning and Estimation Fundamental Limits of Online and Distributed Algorithms for Statistical Learning and Estimation Ohad Shamir Weizmann Institute of Science ohad.shamir@weizmann.ac.il Abstract Many machine learning approaches

More information

First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate

First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate 58th Annual IEEE Symposium on Foundations of Computer Science First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate Zeyuan Allen-Zhu Microsoft Research zeyuan@csail.mit.edu

More information

A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir

A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir A Stochastic PCA Algorithm with an Exponential Convergence Rate Ohad Shamir Weizmann Institute of Science NIPS Optimization Workshop December 2014 Ohad Shamir Stochastic PCA with Exponential Convergence

More information

Online Learning with Feedback Graphs

Online Learning with Feedback Graphs Online Learning with Feedback Graphs Claudio Gentile INRIA and Google NY clagentile@gmailcom NYC March 6th, 2018 1 Content of this lecture Regret analysis of sequential prediction problems lying between

More information

Online Learning and Online Convex Optimization

Online Learning and Online Convex Optimization Online Learning and Online Convex Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Learning 1 / 49 Summary 1 My beautiful regret 2 A supposedly fun game

More information

From Bandits to Experts: A Tale of Domination and Independence

From Bandits to Experts: A Tale of Domination and Independence From Bandits to Experts: A Tale of Domination and Independence Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Domination and Independence 1 / 1 From Bandits to Experts: A

More information

Bandits for Online Optimization

Bandits for Online Optimization Bandits for Online Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Bandits for Online Optimization 1 / 16 The multiarmed bandit problem... K slot machines Each

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

Explore no more: Improved high-probability regret bounds for non-stochastic bandits

Explore no more: Improved high-probability regret bounds for non-stochastic bandits Explore no more: Improved high-probability regret bounds for non-stochastic bandits Gergely Neu SequeL team INRIA Lille Nord Europe gergely.neu@gmail.com Abstract This work addresses the problem of regret

More information

The Online Approach to Machine Learning

The Online Approach to Machine Learning The Online Approach to Machine Learning Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Approach to ML 1 / 53 Summary 1 My beautiful regret 2 A supposedly fun game I

More information

Explore no more: Improved high-probability regret bounds for non-stochastic bandits

Explore no more: Improved high-probability regret bounds for non-stochastic bandits Explore no more: Improved high-probability regret bounds for non-stochastic bandits Gergely Neu SequeL team INRIA Lille Nord Europe gergely.neu@gmail.com Abstract This work addresses the problem of regret

More information

New Algorithms for Contextual Bandits

New Algorithms for Contextual Bandits New Algorithms for Contextual Bandits Lev Reyzin Georgia Institute of Technology Work done at Yahoo! 1 S A. Beygelzimer, J. Langford, L. Li, L. Reyzin, R.E. Schapire Contextual Bandit Algorithms with Supervised

More information

Supplementary Material of Dynamic Regret of Strongly Adaptive Methods

Supplementary Material of Dynamic Regret of Strongly Adaptive Methods Supplementary Material of A Proof of Lemma 1 Lijun Zhang 1 ianbao Yang Rong Jin 3 Zhi-Hua Zhou 1 We first prove the first part of Lemma 1 Let k = log K t hen, integer t can be represented in the base-k

More information

Reducing contextual bandits to supervised learning

Reducing contextual bandits to supervised learning Reducing contextual bandits to supervised learning Daniel Hsu Columbia University Based on joint work with A. Agarwal, S. Kale, J. Langford, L. Li, and R. Schapire 1 Learning to interact: example #1 Practicing

More information

Online Learning with Feedback Graphs

Online Learning with Feedback Graphs Online Learning with Feedback Graphs Nicolò Cesa-Bianchi Università degli Studi di Milano Joint work with: Noga Alon (Tel-Aviv University) Ofer Dekel (Microsoft Research) Tomer Koren (Technion and Microsoft

More information

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret

More information

Efficient learning by implicit exploration in bandit problems with side observations

Efficient learning by implicit exploration in bandit problems with side observations Efficient learning by implicit exploration in bandit problems with side observations omáš Kocák Gergely Neu Michal Valko Rémi Munos SequeL team, INRIA Lille Nord Europe, France {tomas.kocak,gergely.neu,michal.valko,remi.munos}@inria.fr

More information

Linear Methods for Regression. Lijun Zhang

Linear Methods for Regression. Lijun Zhang Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived

More information

Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017

Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017 s s Machine Learning Reading Group The University of British Columbia Summer 2017 (OCO) Convex 1/29 Outline (OCO) Convex Stochastic Bernoulli s (OCO) Convex 2/29 At each iteration t, the player chooses

More information

Exponential Weights on the Hypercube in Polynomial Time

Exponential Weights on the Hypercube in Polynomial Time European Workshop on Reinforcement Learning 14 (2018) October 2018, Lille, France. Exponential Weights on the Hypercube in Polynomial Time College of Information and Computer Sciences University of Massachusetts

More information

Yevgeny Seldin. University of Copenhagen

Yevgeny Seldin. University of Copenhagen Yevgeny Seldin University of Copenhagen Classical (Batch) Machine Learning Collect Data Data Assumption The samples are independent identically distributed (i.i.d.) Machine Learning Prediction rule New

More information

Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm

Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm Jacob Steinhardt Percy Liang Stanford University {jsteinhardt,pliang}@cs.stanford.edu Jun 11, 2013 J. Steinhardt & P. Liang (Stanford)

More information

Notes on AdaGrad. Joseph Perla 2014

Notes on AdaGrad. Joseph Perla 2014 Notes on AdaGrad Joseph Perla 2014 1 Introduction Stochastic Gradient Descent (SGD) is a common online learning algorithm for optimizing convex (and often non-convex) functions in machine learning today.

More information

Adaptive Sampling Under Low Noise Conditions 1

Adaptive Sampling Under Low Noise Conditions 1 Manuscrit auteur, publié dans "41èmes Journées de Statistique, SFdS, Bordeaux (2009)" Adaptive Sampling Under Low Noise Conditions 1 Nicolò Cesa-Bianchi Dipartimento di Scienze dell Informazione Università

More information

Noisy Streaming PCA. Noting g t = x t x t, rearranging and dividing both sides by 2η we get

Noisy Streaming PCA. Noting g t = x t x t, rearranging and dividing both sides by 2η we get Supplementary Material A. Auxillary Lemmas Lemma A. Lemma. Shalev-Shwartz & Ben-David,. Any update of the form P t+ = Π C P t ηg t, 3 for an arbitrary sequence of matrices g, g,..., g, projection Π C onto

More information

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I Sébastien Bubeck Theory Group i.i.d. multi-armed bandit, Robbins [1952] i.i.d. multi-armed bandit, Robbins [1952] Known

More information

No-Regret Algorithms for Unconstrained Online Convex Optimization

No-Regret Algorithms for Unconstrained Online Convex Optimization No-Regret Algorithms for Unconstrained Online Convex Optimization Matthew Streeter Duolingo, Inc. Pittsburgh, PA 153 matt@duolingo.com H. Brendan McMahan Google, Inc. Seattle, WA 98103 mcmahan@google.com

More information

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):

More information

Near-Optimal Algorithms for Online Matrix Prediction

Near-Optimal Algorithms for Online Matrix Prediction JMLR: Workshop and Conference Proceedings vol 23 (2012) 38.1 38.13 25th Annual Conference on Learning Theory Near-Optimal Algorithms for Online Matrix Prediction Elad Hazan Technion - Israel Inst. of Tech.

More information

Stochastic bandits: Explore-First and UCB

Stochastic bandits: Explore-First and UCB CSE599s, Spring 2014, Online Learning Lecture 15-2/19/2014 Stochastic bandits: Explore-First and UCB Lecturer: Brendan McMahan or Ofer Dekel Scribe: Javad Hosseini In this lecture, we like to answer this

More information

Regret Bounds for Online Portfolio Selection with a Cardinality Constraint

Regret Bounds for Online Portfolio Selection with a Cardinality Constraint Regret Bounds for Online Portfolio Selection with a Cardinality Constraint Shinji Ito NEC Corporation Daisuke Hatano RIKEN AIP Hanna Sumita okyo Metropolitan University Akihiro Yabe NEC Corporation akuro

More information

Online Kernel PCA with Entropic Matrix Updates

Online Kernel PCA with Entropic Matrix Updates Online Kernel PCA with Entropic Matrix Updates Dima Kuzmin Manfred K. Warmuth University of California - Santa Cruz ICML 2007, Corvallis, Oregon April 23, 2008 D. Kuzmin, M. Warmuth (UCSC) Online Kernel

More information

Communication-efficient and Differentially-private Distributed SGD

Communication-efficient and Differentially-private Distributed SGD 1/36 Communication-efficient and Differentially-private Distributed SGD Ananda Theertha Suresh with Naman Agarwal, Felix X. Yu Sanjiv Kumar, H. Brendan McMahan Google Research November 16, 2018 2/36 Outline

More information

Bandit models: a tutorial

Bandit models: a tutorial Gdt COS, December 3rd, 2015 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

Stochastic Gradient Descent. CS 584: Big Data Analytics

Stochastic Gradient Descent. CS 584: Big Data Analytics Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration

More information

EASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University

EASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University EASINESS IN BANDITS Gergely Neu Pompeu Fabra University EASINESS IN BANDITS Gergely Neu Pompeu Fabra University THE BANDIT PROBLEM Play for T rounds attempting to maximize rewards THE BANDIT PROBLEM Play

More information

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging

More information

Online Bounds for Bayesian Algorithms

Online Bounds for Bayesian Algorithms Online Bounds for Bayesian Algorithms Sham M. Kakade Computer and Information Science Department University of Pennsylvania Andrew Y. Ng Computer Science Department Stanford University Abstract We present

More information

Optimal and Adaptive Online Learning

Optimal and Adaptive Online Learning Optimal and Adaptive Online Learning Haipeng Luo Advisor: Robert Schapire Computer Science Department Princeton University Examples of Online Learning (a) Spam detection 2 / 34 Examples of Online Learning

More information

arxiv: v1 [cs.it] 21 Feb 2013

arxiv: v1 [cs.it] 21 Feb 2013 q-ary Compressive Sensing arxiv:30.568v [cs.it] Feb 03 Youssef Mroueh,, Lorenzo Rosasco, CBCL, CSAIL, Massachusetts Institute of Technology LCSL, Istituto Italiano di Tecnologia and IIT@MIT lab, Istituto

More information

On the tradeoff between computational complexity and sample complexity in learning

On the tradeoff between computational complexity and sample complexity in learning On the tradeoff between computational complexity and sample complexity in learning Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Joint work with Sham

More information

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning. Tutorial: PART 1 Online Convex Optimization, A Game- Theoretic Approach to Learning http://www.cs.princeton.edu/~ehazan/tutorial/tutorial.htm Elad Hazan Princeton University Satyen Kale Yahoo Research

More information

Full-information Online Learning

Full-information Online Learning Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2

More information

CS281B/Stat241B. Statistical Learning Theory. Lecture 14.

CS281B/Stat241B. Statistical Learning Theory. Lecture 14. CS281B/Stat241B. Statistical Learning Theory. Lecture 14. Wouter M. Koolen Convex losses Exp-concave losses Mixable losses The gradient trick Specialists 1 Menu Today we solve new online learning problems

More information

The Multi-Armed Bandit Problem

The Multi-Armed Bandit Problem The Multi-Armed Bandit Problem Electrical and Computer Engineering December 7, 2013 Outline 1 2 Mathematical 3 Algorithm Upper Confidence Bound Algorithm A/B Testing Exploration vs. Exploitation Scientist

More information

Linear Scalarized Knowledge Gradient in the Multi-Objective Multi-Armed Bandits Problem

Linear Scalarized Knowledge Gradient in the Multi-Objective Multi-Armed Bandits Problem ESANN 04 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 3-5 April 04, i6doc.com publ., ISBN 978-8749095-7. Available from

More information

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri Online Learning Jordan Boyd-Graber University of Colorado Boulder LECTURE 21 Slides adapted from Mohri Jordan Boyd-Graber Boulder Online Learning 1 of 31 Motivation PAC learning: distribution fixed over

More information

Stochastic Composition Optimization

Stochastic Composition Optimization Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works with Ethan X. Fang, Han Liu, and Ji Liu ORFE@Princeton ICCOPT, Tokyo, August 8-11, 2016 1 / 24 Collaborators

More information

Piecewise-stationary Bandit Problems with Side Observations

Piecewise-stationary Bandit Problems with Side Observations Jia Yuan Yu jia.yu@mcgill.ca Department Electrical and Computer Engineering, McGill University, Montréal, Québec, Canada. Shie Mannor shie.mannor@mcgill.ca; shie@ee.technion.ac.il Department Electrical

More information

Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade

Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for Big data 1 / 22

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the

More information

Online Learning with Gaussian Payoffs and Side Observations

Online Learning with Gaussian Payoffs and Side Observations Online Learning with Gaussian Payoffs and Side Observations Yifan Wu 1 András György 2 Csaba Szepesvári 1 1 Department of Computing Science University of Alberta 2 Department of Electrical and Electronic

More information

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology

More information

Adaptive Online Learning in Dynamic Environments

Adaptive Online Learning in Dynamic Environments Adaptive Online Learning in Dynamic Environments Lijun Zhang, Shiyin Lu, Zhi-Hua Zhou National Key Laboratory for Novel Software Technology Nanjing University, Nanjing 210023, China {zhanglj, lusy, zhouzh}@lamda.nju.edu.cn

More information

Multi-armed bandit models: a tutorial

Multi-armed bandit models: a tutorial Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)

More information

Lecture 16: Perceptron and Exponential Weights Algorithm

Lecture 16: Perceptron and Exponential Weights Algorithm EECS 598-005: Theoretical Foundations of Machine Learning Fall 2015 Lecture 16: Perceptron and Exponential Weights Algorithm Lecturer: Jacob Abernethy Scribes: Yue Wang, Editors: Weiqing Yu and Andrew

More information

The Multi-Arm Bandit Framework

The Multi-Arm Bandit Framework The Multi-Arm Bandit Framework A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/94

More information

Online learning with feedback graphs and switching costs

Online learning with feedback graphs and switching costs Online learning with feedback graphs and switching costs A Proof of Theorem Proof. Without loss of generality let the independent sequence set I(G :T ) formed of actions (or arms ) from to. Given the sequence

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 Online Convex Optimization Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 The General Setting The General Setting (Cover) Given only the above, learning isn't always possible Some Natural

More information

Lecture 4: Lower Bounds (ending); Thompson Sampling

Lecture 4: Lower Bounds (ending); Thompson Sampling CMSC 858G: Bandits, Experts and Games 09/12/16 Lecture 4: Lower Bounds (ending); Thompson Sampling Instructor: Alex Slivkins Scribed by: Guowei Sun,Cheng Jie 1 Lower bounds on regret (ending) Recap from

More information

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Alexander Rakhlin University of Pennsylvania Ohad Shamir Microsoft Research New England Karthik Sridharan University of Pennsylvania

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V

More information

Improving the Convergence of Back-Propogation Learning with Second Order Methods

Improving the Convergence of Back-Propogation Learning with Second Order Methods the of Back-Propogation Learning with Second Order Methods Sue Becker and Yann le Cun, Sept 1988 Kasey Bray, October 2017 Table of Contents 1 with Back-Propagation 2 the of BP 3 A Computationally Feasible

More information

Learning with Exploration

Learning with Exploration Learning with Exploration John Langford (Yahoo!) { With help from many } Austin, March 24, 2011 Yahoo! wants to interactively choose content and use the observed feedback to improve future content choices.

More information

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5 Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5 Slides adapted from Jordan Boyd-Graber, Tom Mitchell, Ziv Bar-Joseph Machine Learning: Chenhao Tan Boulder 1 of 27 Quiz question For

More information

Budgeted Prediction With Expert Advice

Budgeted Prediction With Expert Advice Budgeted Prediction With Expert Advice Kareem Amin University of Pennsylvania, Philadelphia, PA akareem@seas.upenn.edu Gerald Tesauro IBM T.J. Watson Research Center Yorktown Heights, NY gtesauro@us.ibm.com

More information

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com Linear Regression These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made their course materials freely available online. Feel free to reuse or adapt these

More information

The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits John Langford and Tong Zhang

The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits John Langford and Tong Zhang The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits John Langford and Tong Zhang Presentation by Terry Lam 02/2011 Outline The Contextual Bandit Problem Prior Works The Epoch Greedy Algorithm

More information

Sparse Linear Contextual Bandits via Relevance Vector Machines

Sparse Linear Contextual Bandits via Relevance Vector Machines Sparse Linear Contextual Bandits via Relevance Vector Machines Davis Gilton and Rebecca Willett Electrical and Computer Engineering University of Wisconsin-Madison Madison, WI 53706 Email: gilton@wisc.edu,

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016 AM 1: Advanced Optimization Spring 016 Prof. Yaron Singer Lecture 11 March 3rd 1 Overview In this lecture we will introduce the notion of online convex optimization. This is an extremely useful framework

More information

Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12

Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12 Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12 Lecture 4: Multiarmed Bandit in the Adversarial Model Lecturer: Yishay Mansour Scribe: Shai Vardi 4.1 Lecture Overview

More information

Linear Probability Forecasting

Linear Probability Forecasting Linear Probability Forecasting Fedor Zhdanov and Yuri Kalnishkan Computer Learning Research Centre and Department of Computer Science, Royal Holloway, University of London, Egham, Surrey, TW0 0EX, UK {fedor,yura}@cs.rhul.ac.uk

More information

Allocating Resources, in the Future

Allocating Resources, in the Future Allocating Resources, in the Future Sid Banerjee School of ORIE May 3, 2018 Simons Workshop on Mathematical and Computational Challenges in Real-Time Decision Making online resource allocation: basic model......

More information

Approximate Principal Components Analysis of Large Data Sets

Approximate Principal Components Analysis of Large Data Sets Approximate Principal Components Analysis of Large Data Sets Daniel J. McDonald Department of Statistics Indiana University mypage.iu.edu/ dajmcdon April 27, 2016 Approximation-Regularization for Analysis

More information

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression

More information

Gibbs Sampling in Linear Models #2

Gibbs Sampling in Linear Models #2 Gibbs Sampling in Linear Models #2 Econ 690 Purdue University Outline 1 Linear Regression Model with a Changepoint Example with Temperature Data 2 The Seemingly Unrelated Regressions Model 3 Gibbs sampling

More information

1 Kernel methods & optimization

1 Kernel methods & optimization Machine Learning Class Notes 9-26-13 Prof. David Sontag 1 Kernel methods & optimization One eample of a kernel that is frequently used in practice and which allows for highly non-linear discriminant functions

More information

Summary and discussion of: Dropout Training as Adaptive Regularization

Summary and discussion of: Dropout Training as Adaptive Regularization Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial

More information

Learnability, Stability, Regularization and Strong Convexity

Learnability, Stability, Regularization and Strong Convexity Learnability, Stability, Regularization and Strong Convexity Nati Srebro Shai Shalev-Shwartz HUJI Ohad Shamir Weizmann Karthik Sridharan Cornell Ambuj Tewari Michigan Toyota Technological Institute Chicago

More information

Structured Prediction

Structured Prediction Structured Prediction Ningshan Zhang Advanced Machine Learning, Spring 2016 Outline Ensemble Methods for Structured Prediction[1] On-line learning Boosting AGeneralizedKernelApproachtoStructuredOutputLearning[2]

More information

Classification: Logistic Regression from Data

Classification: Logistic Regression from Data Classification: Logistic Regression from Data Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3 Slides adapted from Emily Fox Machine Learning: Jordan Boyd-Graber Boulder Classification:

More information

Collaborative Filtering. Radek Pelánek

Collaborative Filtering. Radek Pelánek Collaborative Filtering Radek Pelánek 2017 Notes on Lecture the most technical lecture of the course includes some scary looking math, but typically with intuitive interpretation use of standard machine

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Evaluation of multi armed bandit algorithms and empirical algorithm

Evaluation of multi armed bandit algorithms and empirical algorithm Acta Technica 62, No. 2B/2017, 639 656 c 2017 Institute of Thermomechanics CAS, v.v.i. Evaluation of multi armed bandit algorithms and empirical algorithm Zhang Hong 2,3, Cao Xiushan 1, Pu Qiumei 1,4 Abstract.

More information

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization racle Complexity of Second-rder Methods for Smooth Convex ptimization Yossi Arjevani had Shamir Ron Shiff Weizmann Institute of Science Rehovot 7610001 Israel Abstract yossi.arjevani@weizmann.ac.il ohad.shamir@weizmann.ac.il

More information

Machine Learning Theory (CS 6783)

Machine Learning Theory (CS 6783) Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Hollister, 306 Instructor : Karthik Sridharan ABOUT THE COURSE No exams! 5 assignments that count towards your grades (55%) One term project (40%)

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his

More information

Bayesian Learning in Undirected Graphical Models

Bayesian Learning in Undirected Graphical Models Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ Work with: Iain Murray and Hyun-Chul

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data January 17, 2006 Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Other Artificial Neural Networks Samy Bengio IDIAP Research Institute, Martigny, Switzerland,

More information

An Experimental Evaluation of High-Dimensional Multi-Armed Bandits

An Experimental Evaluation of High-Dimensional Multi-Armed Bandits An Experimental Evaluation of High-Dimensional Multi-Armed Bandits Naoki Egami Romain Ferrali Kosuke Imai Princeton University Talk at Political Data Science Conference Washington University, St. Louis

More information

Presentation in Convex Optimization

Presentation in Convex Optimization Dec 22, 2014 Introduction Sample size selection in optimization methods for machine learning Introduction Sample size selection in optimization methods for machine learning Main results: presents a methodology

More information

Stochastic and Adversarial Online Learning without Hyperparameters

Stochastic and Adversarial Online Learning without Hyperparameters Stochastic and Adversarial Online Learning without Hyperparameters Ashok Cutkosky Department of Computer Science Stanford University ashokc@cs.stanford.edu Kwabena Boahen Department of Bioengineering Stanford

More information

Communication Lower Bounds for Statistical Estimation Problems via a Distributed Data Processing Inequality

Communication Lower Bounds for Statistical Estimation Problems via a Distributed Data Processing Inequality Communication Lower Bounds for Statistical Estimation Problems via a Distributed Data Processing Inequality Mark Braverman Ankit Garg Tengyu Ma Huy Nguyen David Woodruff DIMACS Workshop on Big Data through

More information

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017 Simple Techniques for Improving SGD CS6787 Lecture 2 Fall 2017 Step Sizes and Convergence Where we left off Stochastic gradient descent x t+1 = x t rf(x t ; yĩt ) Much faster per iteration than gradient

More information

Bandit Convex Optimization: T Regret in One Dimension

Bandit Convex Optimization: T Regret in One Dimension Bandit Convex Optimization: T Regret in One Dimension arxiv:1502.06398v1 [cs.lg 23 Feb 2015 Sébastien Bubeck Microsoft Research sebubeck@microsoft.com Tomer Koren Technion tomerk@technion.ac.il February

More information

Does Better Inference mean Better Learning?

Does Better Inference mean Better Learning? Does Better Inference mean Better Learning? Andrew E. Gelfand, Rina Dechter & Alexander Ihler Department of Computer Science University of California, Irvine {agelfand,dechter,ihler}@ics.uci.edu Abstract

More information