Lecture 14: Bandits with Budget Constraints

Similar documents
Assortment Optimization under MNL

CS : Algorithms and Uncertainty Lecture 17 Date: October 26, 2016

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture # 15 Scribe: Jieming Mao April 1, 2013

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

Lecture 10 Support Vector Machines II

Feature Selection: Part 1

Announcements EWA with ɛ-exploration (recap) Lecture 20: EXP3 Algorithm. EECS598: Prediction and Learning: It s Only a Game Fall 2013.

Lecture 4. Instructor: Haipeng Luo

Winter 2008 CS567 Stochastic Linear/Integer Programming Guest Lecturer: Xu, Huan

More metrics on cartesian products

COS 521: Advanced Algorithms Game Theory and Linear Programming

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009

Revenue comparison when the length of horizon is known in advance. The standard errors of all numbers are less than 0.1%.

1 The Mistake Bound Model

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011

APPROXIMATE PRICES OF BASKET AND ASIAN OPTIONS DUPONT OLIVIER. Premia 14

k t+1 + c t A t k t, t=0

Stochastic Multi-armed Bandits in Constant Space

PROBLEM SET 7 GENERAL EQUILIBRIUM

The Experts/Multiplicative Weights Algorithm and Applications

Suggested solutions for the exam in SF2863 Systems Engineering. June 12,

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Lecture 4: November 17, Part 1 Single Buffer Management

Economics 8105 Macroeconomic Theory Recitation 1

Economics 2450A: Public Economics Section 10: Education Policies and Simpler Theory of Capital Taxation

Maximal Margin Classifier

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

CS : Algorithms and Uncertainty Lecture 14 Date: October 17, 2016

Lecture 4: Universal Hash Functions/Streaming Cont d

An Asymptotically Efficient Simulation-Based Algorithm for Finite Horizon Stochastic Dynamic Programming

Problem Set 9 Solutions

Lecture Notes on Linear Regression

3.1 ML and Empirical Distribution

ECE559VV Project Report

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Equilibrium with Complete Markets. Instructor: Dmytro Hryshko

Additional Codes using Finite Difference Method. 1 HJB Equation for Consumption-Saving Problem Without Uncertainty

Pricing Problems under the Nested Logit Model with a Quality Consistency Constraint

Appendix B: Resampling Algorithms

Economics 101. Lecture 4 - Equilibrium and Efficiency

Online Appendix: Reciprocity with Many Goods

Linear Regression Analysis: Terminology and Notation

CHAPTER 17 Amortized Analysis

Erratum: A Generalized Path Integral Control Approach to Reinforcement Learning

Technical Note: Capacity Constraints Across Nests in Assortment Optimization Under the Nested Logit Model

Hidden Markov Models

Lecture 11. minimize. c j x j. j=1. 1 x j 0 j. +, b R m + and c R n +

Simultaneous Optimization of Berth Allocation, Quay Crane Assignment and Quay Crane Scheduling Problems in Container Terminals

Notes on Frequency Estimation in Data Streams

EFFECTS OF JOINT REPLENISHMENT POLICY ON COMPANY COST UNDER PERMISSIBLE DELAY IN PAYMENTS

Lecture 20: November 7

Randomness and Computation

MMA and GCMMA two methods for nonlinear optimization

VARIATION OF CONSTANT SUM CONSTRAINT FOR INTEGER MODEL WITH NON UNIFORM VARIABLES

Vapnik-Chervonenkis theory

Perfect Competition and the Nash Bargaining Solution

Portfolios with Trading Constraints and Payout Restrictions

CSC 411 / CSC D11 / CSC C11

Conservative Contextual Linear Bandits

A SEPARABLE APPROXIMATION DYNAMIC PROGRAMMING ALGORITHM FOR ECONOMIC DISPATCH WITH TRANSMISSION LOSSES. Pierre HANSEN, Nenad MLADENOVI]

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #16 Scribe: Yannan Wang April 3, 2014

1 Convex Optimization

Dynamic Programming. Lecture 13 (5/31/2017)

Capacity Constraints Across Nests in Assortment Optimization Under the Nested Logit Model

Lecture 21: Numerical methods for pricing American type derivatives

LOW BIAS INTEGRATED PATH ESTIMATORS. James M. Calvin

Some modelling aspects for the Matlab implementation of MMA

COS 511: Theoretical Machine Learning

Computing Correlated Equilibria in Multi-Player Games

n α j x j = 0 j=1 has a nontrivial solution. Here A is the n k matrix whose jth column is the vector for all t j=0

4DVAR, according to the name, is a four-dimensional variational method.

Matrix Approximation via Sampling, Subspace Embedding. 1 Solving Linear Systems Using SVD

1 Gradient descent for convex functions: univariate case

Pricing and Resource Allocation Game Theoretic Models

Assortment Optimization under the Paired Combinatorial Logit Model

CHAPTER 5 NUMERICAL EVALUATION OF DYNAMIC RESPONSE

Chapter Newton s Method

Edge Isoperimetric Inequalities

Which Separator? Spring 1

Week 5: Neural Networks

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Generalized Linear Methods

Lecture 10 Support Vector Machines. Oct

Statistics for Economics & Business

Basic Statistical Analysis and Yield Calculations

Game Theory. Lecture Notes By Y. Narahari. Department of Computer Science and Automation Indian Institute of Science Bangalore, India February 2008

Approximation Methods for Pricing Problems under the Nested Logit Model with Price Bounds

Interactive Bi-Level Multi-Objective Integer. Non-linear Programming Problem

BOUNDEDNESS OF THE RIESZ TRANSFORM WITH MATRIX A 2 WEIGHTS

arxiv: v1 [math.oc] 25 Jun 2008

The Expectation-Maximization Algorithm

Computational and Statistical Learning theory Assignment 4

Excess Error, Approximation Error, and Estimation Error

An Experiment/Some Intuition (Fall 2006): Lecture 18 The EM Algorithm heads coin 1 tails coin 2 Overview Maximum Likelihood Estimation

,, MRTS is the marginal rate of technical substitution

Numerical Solution of Ordinary Differential Equations

A Simple Inventory System

14 Lagrange Multipliers

Transcription:

IEOR 8100-001: Learnng and Optmzaton for Sequental Decson Makng 03/07/16 Lecture 14: andts wth udget Constrants Instructor: Shpra Agrawal Scrbed by: Zhpeng Lu 1 Problem defnton In the regular Mult-armed andt problem we pull an arm I t at each round t and a reward s collected whch depend on the bandt I t. Now suppose that each round after pullng the arm a cost s also ncurred. When the total cost tll tme t surpass a gven level the algorthm stops. hs settng of problem s called andts wth constrants. he formal descrpton s followng. At tme t pull an arm I t = observe reward [0 1] and cost [0 1] d. Gven I t = D where D s a jont dstrbuton dependent on arm denote E[ I t = ] = µ E[j I t = ] = C j. Stop when any budget constrant s volated. Goal s to here 0. mze subject to t Remark. he above settng s also called andt wth Knapsacks wk. Usually we replace the constrant wth a smplfed verson by lettng = mn j j and j j = 1... d. t Example 1. Dynamc prcng wth lmted supply Suppose at prce q the product has probablty Sq to be sold observng revenue q and ncurrng nventory decreasng by 1. So the dstrbuton of reward and cost fohe bandt problem s: pullng arm { q 1 w.p. Sq = 0 0 w.p. 1 Sq 2 Optmal statc polcy A polcy s a mappng whch maps hstory to acton. It can be pure statc n whch case we keep pullng the sngle arm wth hghest expected reward. It can be mxed n whch case we pull arms randomly accordng to some dstrbuton. O can be dynamc where each turn we make decson based on the hstory and remanng resource. In ths secton we wll show that there s an optmal statc polcy whch has expected reward at least optmal dynamc and t satsfes constrants n expectaton. 1

Suppose each turn we pull arm wth probablty p and the constrants are all satsfed n expectaton. hen the optmal total reward OP s; subject to p µ p C j p 1. j = 1... d Notce that the second constrant mples that we allow not pullng any arm durng one round. Next we prove OP s bettehan what a optmal dynamc polcy could acheve. Defne X : total number of tmes an optmal polcy pcked arm p = E[X]. Any feasble polcy must satsfy t c sj t = 1.... ake expectatons on both sdes and we have s=1 E[ =1...N t:i t= ] E[X C j ] p C j. herefore { p } N =1 s feasble to OP and the total expected reward of ths optmal polcy E[ X µ ] = E[X µ ] = p µ OP. 3 andt algorthm: reducng to unconstraned f problem Suppose the algorthm stops at tme τ when a budget constrant s volated. Defne the regret of such polcy R = OP E[ ]. Snce the total reward of optmal dynamc polcy s bounded by OP the gap between optmal dynamc polcy and the algorthm s bounded by R. Now we reduce the constraned problem to an unconstraned problem wth nonlnear objectve functon by applyng Lagrangan multpleo the constraned problem. he new unconstraned problem wll mze f 2

whch s a concave and Lpschtz contnuous functon. r t f = z j=1...d 0. In the above defnton of f he frst tem s average reward and the second tem s the penalty from the mum volaton of budget constrant. We defne the penalty coeffcent z as z = 2OP whch we wll explan momentarly. Suppose we relax the budget to + and defne ɛ =. Denote the OP wth budget 1 + ɛ as OP 1+ɛ. Any p OP 1+ɛ -feasble snce p 1 + ɛ C 1 + ɛ j j = 1... d 1 + ɛ therefore p 1+ɛ s OP-feasble. hus we have N p µ = =1 N =1 p 1 + ɛ µ 1 + ɛ 1 + ɛop so f we relax the budget constrant by the optmal value of OP wll ncrease by no more than OP. So we can set z = 2OP whch guarantees that volatng the budget won t gve beneft n terms of ncreasng value of f. heorem 2 below wll provde the exact relaton between optmzng f and the andts wth Knapsacks problem. he sgnfcance of ths value of z wll be more precsely llustrated n the proof of that theorem. Now we defne the regret of the algorthm wth objectve functon f [ R f = OP f E f r ] t where OP f = p: p 1 f p µ p C = p µ Z C j j 0 And we defne R as the regret of the constraned problem wth larger budget = + 2 z R f + Õ [ τ ] R = OP E. where τ s the frst tme a budget s volated. We conclude ths lecture wth the followng theorem. heorem 2. If an algorthm acheves R f regret for unconstraned f problem then R 3R f + zõ and the algorthm wll not volate at any tme step t wth hgh probablty. 3

Proof. Proof outlne: he proof follows from the followng two clams: 1. OP f OP z. 2. Let be the cost of the decson at tme t for unconstraned f algorthm then wth hgh probablty j for all j. If above two clams are true then τ = wth hgh probablty and R = OP E[ τ ] OP f E[ ] = R f + z j=1...d 0 R f + z Usng = 2R f z + Õ we get the theorem statement. Next we prove the above two clams. he frst clam holds because the optmal soluton n fact any soluton p such that p 1 p 0 for OP forms a feasble soluton for OP f wth value OP z. For second clam let M be the mum budget volaton above by the algorthm fohe unconstraned f problem.e. hen M := f r t j=1...d 0 = E OP OP f [ ] zm + zm 2 zm zm 2 he second last nequalty follows usng our earler observaton that OP 1+ɛ OP1 + ɛ and z = 2OP. Last nequalty follows because OP f OP the optmal soluton p for OP forms a feasble soluton for OP f wth value OP. hen rearrangng M 2 OP f f r t = 2 z z R f herefore by defnton of M E[ t j ] + 2R f z So that usng Azuma-Hoeffdng wth hgh probablty j + 2 Z R f + Õ = provng the second clam. Why s above theorem useful? Conceptually above theorem shows that to bound regret of andts wth knapsacks we can nstead use an algorthm that has small regret bound fohe unconstraned f bandt problem. In next lecture we wll prove regret bounds fohe unconstraned f problem. In partcular fohe f dscussed above a unform bound on regret of R f Õz N can be proven though we wll not prove ths specal case n class. hen gven above theorem 4

we could solve the unconstraned f bandt problem wth slghtly smaller budget = 2 z Õ N Õ to get regret so that above theorem wll gve R f Õ z N Õz N Õz N R 3R f + zõ Õz N + zõ Usng z = 2OP OP we get a bound of Õ on regret for andts wth knapsacks.e. multplcatve guarantee that the reward acheved by bandt algorthm s at least OP1 Õ 1. Note that above algorthm assumed that the value of OP s known OP was used to set the value of z. If OP s not known then t needs to be estmated usng pure exploraton n the begnnng. hs ntal exploraton mght cause addtonal regret. hs s a lmtaton of the above approach of reducng the problem to unconstraned f problem. For a drect approach wthout knowng OP refeo Secton 4.2 of [1]. References [1] Shpra Agrawal and Nkhl R Devanur. andts wth concave rewards and convex knapsacks. In Proceedngs of the ffteenth ACM conference on Economcs and computaton pages 989 1006. ACM 2014. 5