Low-Regret for Online Decision-Making

Size: px

Start display at page:

Download "Low-Regret for Online Decision-Making"

Augustus Wilkins
5 years ago
Views:

1 Siddhartha Banerjee and Alberto Vera November 6, /17

2 Introduction Compensated Coupling Bayes Selector Conclusion Motivation 2/17

3 Introduction Compensated Coupling Bayes Selector Conclusion Motivation Offline is a host who knew all the arrivals ahead of time 2/17

4 Example: Online Packing (NRM) Resources with initial budgets B i, i = 1,..., d. 3/17

5 Example: Online Packing (NRM) Resources with initial budgets B i, i = 1,..., d. Type j wants a ij units of resource i. 3/17

6 Example: Online Packing (NRM) Resources with initial budgets B i, i = 1,..., d. Type j wants a ij units of resource i. Type j arrives with probability p j and has reward r j, j = 1,..., n. 3/17

7 Example: Online Packing (NRM) Resources with initial budgets B i, i = 1,..., d. Type j wants a ij units of resource i. Type j arrives with probability p j and has reward r j, j = 1,..., n. If we had full information, solve max r x s.t. Ax B (Budget Consumption) x Z(T ) (Respect Arrivals) x N n. Paradigm: two agents solving this problem, Online and Offline 3/17

8 Traditional Approach and Our Contributions State State S1 S1 S2 S2 S3 S3 S4 Offline S4 Online Offline Time Time t1 t2 t3 t4 t1 t2 t3 t4 4/17

9 Traditional Approach and Our Contributions State State S1 S1 S2 S2 S3 S3 S4 Offline S4 Online Offline Time Time t1 t2 t3 t4 t1 t2 t3 t4 Our Contributions Compensated Coupling: for analyzing algorithms Bayes Selector: meta-algorithm based on estimators Applications: constant regret 4/17

10 Application: Constant Regret for Online Packing V on and V off collected values, Regret := V off V on. 5/17

11 Application: Constant Regret for Online Packing V on and V off collected values, Regret := V off V on. V off = max{r x : Ax B, 0 x Z(T )}. 5/17

12 Application: Constant Regret for Online Packing V on and V off collected values, Regret := V off V on. V off = max{r x : Ax B, 0 x Z(T )}. Theorem (Informal) We provide a natural policy with constant expected regret for online packing problems. 5/17

13 Application: Constant Regret for Online Packing V on and V off collected values, Regret := V off V on. V off = max{r x : Ax B, 0 x Z(T )}. Theorem (Informal) We provide a natural policy with constant expected regret for online packing problems. Regret depends on d, n, κ(a), and the distribution of Z(t) Independent of T and initial budget levels B Correlated and time-varying distributions 1 o(1) approximation whenever T and B are large 5/17

14 Related Work Constant Regret for Multi-Secretary Arlotto and Gurvich (2017) Network Revenue Management / Online Packing: Asymptotic analysis O( T ) regret Talluri and Van Ryzin (2004) Later o( T ), O(1) conditional, O(1) Poisson Reiman and Wang (2008),Jasin and Kumar (2012),Bumpensanti and Wang (2018) Prophet: worst case distribution (competitive ratio) Ratio 1 e 1 for maximum of i.i.d. Hill and Kertz (1982) Best ratio 0.745, used in posted pricing Correa et al. (2017) Online Matching Manshadi et al (2012) Approximate Dynamic Programming Powell (2011) 6/17

15 Framework for Online Decision-Making Arrival type J t J presented at t = T, T 1,..., 1 7/17

16 Framework for Online Decision-Making Arrival type J t J presented at t = T, T 1,..., 1 States s S and actions a A 7/17

17 Framework for Online Decision-Making Arrival type J t J presented at t = T, T 1,..., 1 States s S and actions a A Rewards R(a, s, j) 7/17

18 Framework for Online Decision-Making Arrival type J t J presented at t = T, T 1,..., 1 States s S and actions a A Rewards R(a, s, j) Transitions T (a, s, j) 7/17

19 Framework for Online Decision-Making Arrival type J t J presented at t = T, T 1,..., 1 States s S and actions a A Rewards R(a, s, j) Transitions T (a, s, j) Examples: Online Packing (NRM), Multi-Secretary, Online Matching, Ski-Rental, etc. 7/17

20 Example: Multi-Secretary One resource (d = 1) with capacity B. Goal: choose B arrivals to maximize their sum t = t is time to go 8/17

21 Value of Offline V off (t, s)[ω] is the optimal offline value starting on s with t periods to go on sample path ω. 9/17

22 Value of Offline V off (t, s)[ω] is the optimal offline value starting on s with t periods to go on sample path ω t B = B = /17

23 Value of Offline V off (t, s)[ω] is the optimal offline value starting on s with t periods to go on sample path ω t B = B = It satisfies the Bellman Equation V off (t, s)[ω] = max{r(a, s, J t )+V off (t 1, T (a, s, J t ))[ω] : a A}. 9/17

24 Offline Follows Online Online uses a policy π on. We want Offline to follow π on 10/17

25 Offline Follows Online Online uses a policy π on. We want Offline to follow π on What if Offline is not satisfied with a = π on (t, s, j)? V off (t, s)[ω] > R(a, s, J t ) + V off (t 1, T (a, s, J t ))[ω]. 10/17

26 Offline Follows Online Online uses a policy π on. We want Offline to follow π on What if Offline is not satisfied with a = π on (t, s, j)? V off (t, s)[ω] > R(a, s, J t ) + V off (t 1, T (a, s, J t ))[ω]. We need to compensate him by paying V off (t, s)[ω] R(a, s, J t ) V off (t 1, T (a, s, J t ))[ω]. 10/17

27 Offline Follows Online Online uses a policy π on. We want Offline to follow π on What if Offline is not satisfied with a = π on (t, s, j)? V off (t, s)[ω] > R(a, s, J t ) + V off (t 1, T (a, s, J t ))[ω]. We need to compensate him by paying V off (t, s)[ω] R(a, s, J t ) V off (t 1, T (a, s, J t ))[ω] t = /17

28 Compensations How big are the compensations? 11/17

29 Compensations How big are the compensations? r max for multi-secretary and A j r max dr max for packing. 11/17

30 Compensations How big are the compensations? r max for multi-secretary and A j r max dr max for packing. Q(t, s, a) is the disagreement set: ω Ω such that Offline loses optimality by taking a. 11/17

31 Compensations How big are the compensations? r max for multi-secretary and A j r max dr max for packing. Q(t, s, a) is the disagreement set: ω Ω such that Offline loses optimality by taking a. Lemma (Compensated Coupling) Fix any policy for Online. Let S t be Online s state over time and A t Online s action over time. If r is a bound on the compensations, then [ ] E[Regret] re P[Q(t, S t, A t )] t 11/17

32 Compensated Coupling View State State Compensation S1 S1 S2 S2 S3 S3 S4 Online Offline S4 Online Offline 1 Time Time t1 t2 t3 t4 t1 t 1 t2 t3 t4 12/17

33 Compensated Coupling View State State Compensation S1 S1 S2 S2 S3 S3 S4 Online Offline S4 Online Offline 1 Time Time State t1 t2 t3 t4 Compensation t1 t 1 t2 t3 t4 Compensation Online Offline 2 t1 t 1 t2 t3 t4 Time 12/17

34 Compensated Coupling View State State Compensation S1 S1 S2 S2 S3 S3 S4 Online Offline S4 Online Offline 1 Time Time t1 t2 t3 t4 t1 t 1 t2 t3 t4 State Compensation State Compensation Compensation S1 Compensation Compensation S2 S3 Online Offline 2 S4 Online Offline 3 t1 t 1 t2 t3 t4 Time t1 t2 t3 t4 Time 12/17

35 The Bayes Selector Policy Compensated coupling works for any policy. Can we turn it around and find a policy? 13/17

36 The Bayes Selector Policy Compensated coupling works for any policy. Can we turn it around and find a policy? Alice is making decisions and we want to copy her. An oracle tells us she picked red/blue with probabilities 0.7/0.3. What to do? 13/17

37 The Bayes Selector Policy Compensated coupling works for any policy. Can we turn it around and find a policy? Alice is making decisions and we want to copy her. An oracle tells us she picked red/blue with probabilities 0.7/0.3. What to do? Q(t, s, a) is the disagreement set: ω Ω such that Offline loses optimality by taking a. 13/17

38 The Bayes Selector Policy Compensated coupling works for any policy. Can we turn it around and find a policy? Alice is making decisions and we want to copy her. An oracle tells us she picked red/blue with probabilities 0.7/0.3. What to do? Q(t, s, a) is the disagreement set: ω Ω such that Offline loses optimality by taking a. Policy: at each time t, take action A t BS argmin{p[q(t, S t BS, a)] : a A}. 13/17

39 Regret Guarantee of Bayes Selector Ideal Policy: at time t, A t BS argmin{p[q(t, St BS, a)] : a A}. 14/17

40 Regret Guarantee of Bayes Selector Ideal Policy: at time t, A t BS argmin{p[q(t, St BS, a)] : a A}. Meta-Algorithm: Based on estimators ˆq(t, s, a) P[Q(t, s, a)] 14/17

41 Regret Guarantee of Bayes Selector Ideal Policy: at time t, A t BS argmin{p[q(t, St BS, a)] : a A}. Meta-Algorithm: Based on estimators ˆq(t, s, a) P[Q(t, s, a)] at time t, A t BS argmin{ˆq(t, St BS, a) : a A}. 14/17

42 Regret Guarantee of Bayes Selector Ideal Policy: at time t, A t BS argmin{p[q(t, St BS, a)] : a A}. Meta-Algorithm: Based on estimators ˆq(t, s, a) P[Q(t, s, a)] at time t, A t BS argmin{ˆq(t, St BS, a) : a A}. Theorem Let r be an upper bound on the compensations. The Bayes Selector achieves [ ] E[Regret] re ˆq(t, SBS, t A t BS) t 14/17

43 Regret Guarantee of Bayes Selector Ideal Policy: at time t, A t BS argmin{p[q(t, St BS, a)] : a A}. Meta-Algorithm: Based on estimators ˆq(t, s, a) P[Q(t, s, a)] at time t, A t BS argmin{ˆq(t, St BS, a) : a A}. Theorem Let r be an upper bound on the compensations. The Bayes Selector achieves [ ] E[Regret] re ˆq(t, SBS, t A t BS) t Theorem The Bayes Selector achieves constant regret for online packing and matching problems. 14/17

44 Estimators for Online Packing ˆq(t, s, a) P[Q(t, s, a)], with Q(t, s, a) is the disagreement set. Given Online s budget B t, Offline solves max{r x : Ax B t, 0 x Z(t)}. 15/17

45 Estimators for Online Packing ˆq(t, s, a) P[Q(t, s, a)], with Q(t, s, a) is the disagreement set. Given Online s budget B t, Offline solves max{r x : Ax B t, 0 x Z(t)}. Let X solve max{r x : Ax B t, 0 x E[Z(t)]} 15/17

46 Estimators for Online Packing ˆq(t, s, a) P[Q(t, s, a)], with Q(t, s, a) is the disagreement set. Given Online s budget B t, Offline solves max{r x : Ax B t, 0 x Z(t)}. Let X solve max{r x : Ax B t, 0 x E[Z(t)]} X j /E[Z j (t)] [0, 1] for all j. 15/17

47 Estimators for Online Packing ˆq(t, s, a) P[Q(t, s, a)], with Q(t, s, a) is the disagreement set. Given Online s budget B t, Offline solves max{r x : Ax B t, 0 x Z(t)}. Let X solve max{r x : Ax B t, 0 x E[Z(t)]} X j /E[Z j (t)] [0, 1] for all j. Accept j iff X j /E[Z j (t)] 1/2. 15/17

48 Estimators for Online Packing ˆq(t, s, a) P[Q(t, s, a)], with Q(t, s, a) is the disagreement set. Given Online s budget B t, Offline solves max{r x : Ax B t, 0 x Z(t)}. Let X solve max{r x : Ax B t, 0 x E[Z(t)]} X j /E[Z j (t)] [0, 1] for all j. Accept j iff X j /E[Z j (t)] 1/2. min{ˆq(t, B t, accept), ˆq(t, B t, reject)} exp( t(p j /2κ) 2 /25) 15/17

49 Numerical Results Bayes Selector Randomized Adaptive Policy Randomized Static Policy Regret Scaling 16/17

50 Conclusions and Future Directions Compensated coupling: A way to understand online decision-making 17/17

51 Conclusions and Future Directions Compensated coupling: A way to understand online decision-making Bayes Selector: Natural policy with a performance guarantee Bridge between estimation and optimization 17/17

52 Conclusions and Future Directions Compensated coupling: A way to understand online decision-making Bayes Selector: Natural policy with a performance guarantee Bridge between estimation and optimization Online packing and matching: the Bayes Selector achieves constant regret Future work: learning and multi-stage problems Promising directions: online convex optimization 17/17

53 Appendix: Details for Numerical Results Type j Valuation Resource ,2 1,2 p j /1

Allocating Resources, in the Future

Allocating Resources, in the Future Sid Banerjee School of ORIE May 3, 2018 Simons Workshop on Mathematical and Computational Challenges in Real-Time Decision Making online resource allocation: basic model......