Symbolic Perseus: a Generic POMDP Algorithm with Application to Dynamic Pricing with Demand Learning

Size: px

Start display at page:

Download "Symbolic Perseus: a Generic POMDP Algorithm with Application to Dynamic Pricing with Demand Learning"

Magdalen Jordan
5 years ago
Views:

1 Symbolic Perseus: a Generic POMDP Algorithm with Application to Dynamic Pricing with Demand Learning Pascal Poupart (University of Waterloo) INFORMS

2 Outline Dynamic Pricing as a POMDP Symbolic Perseus Generic POMDP solver Point-based value iteration Algebraic decision diagrams Experimental evaluation Conclusion 2

3 Setting One or several firms (monopoly or oligopoly) Fixed capacity and fixed number of selling rounds (i.e., sale of seasonal items) Finite range of prices Unknown and varying demand Question: how to dynamically adjust prices to maximize sales? 3

4 POMDPs Formulation (monopoly) Firm Consumer Time Time Time Time 4

5 POMDPs Formulation (oligopoly) Firm Competitors Consumer Time Time Time Time -i -i -i -i -i -i -i -i 5

6 Unknown demand & competitors Firm Competitors Consumer Time Time Time Time -i -i -i -i -i -i -i -i 6

7 Demand Model Probability that consumer chooses firm i: Pr(=i) = e a i+b i p i Σ i e a i+b i p i + 1 Parameters a i and b i are unknown Learn them From historical data As process evolves 7

8 Model each competitor: Competitors Pricing strategy: inv/time price Two thresholds: t up and t down If inv/time < t up price If inv/time > t down price Learn thresholds From historical data As process evolves 8

9 Expanded POMDP Firm Consumer A, B A, B A, B A, B Time Time Time Time Competitors -i -i -i -i -i -i -i -i T, T T, T T, T T, T 9

10 POMDPs Partially Observable Markov Decision Processes S: set of states Cross product of domain of all variables S = i dom(v i ) (exponentially large!) A: set of actions {price, price, price unchanged} O: set of observations Cross product of domain of observable variables T(s,a,s ) = Pr(s s,a): transition function Factored rep: Pr(s s,a) = i Pr(V i parents(v i )) R(s,a) = r: reward function Sale = price x 10

11 Belief: b(s) Distribution over states Belief monitoring Belief update: Bayes theorem b ao (s ) = k Σ s S b(s) Pr(s s,a) Pr(o a,s ) b ao = < o, a, b > Demand learning and opponent modeling: Implicit learning by belief monitoring 11

12 Policy trees Policy π Mapping from past actions & obs to next action Tree representation a 1 o 1 o 2 a 2 o 1 o 2 a 3 o 1 o 2 a 4 a 5 a 6 a 7 Problem: tree grows exponentially with time 12

13 Policy π : B A Policy Optimization mapping from beliefs to actions Value function V π (b) = Σ t γ t E b t π [R] Optimal policy π*: V*(b) V π (b) for all π,b Bellman s Equation: V*(b) = max a E b [R] + γ Σ o Pr(o s,a) V*(b ao ) 13

14 Difficulties Exponentially large state space S = i dom(v i ) Solution: algebraic decision diagrams Complex policy space Policy π : B A Continuous belief space Solution: point-based Bellman backups 14

15 Symbolic Perseus Point-based value iteration + algebraic decision diagrams Publicly available: Has been used to solve POMDPs with millions of states Currently used by Intel, Toronto Rehabilitation Institute, Univ of Dundee, Technical Univ of Lisbon, Univ of British Columbia, Univ of Manchester, Univ of Waterloo 15

16 Piecewise linear & convex val fn Value of a policy tree β is linear V β (b 0 ) = Σ s S b 0 (s) V β (s) Value of an optimal finite horizon policy is piecewise-linear and convex [SS73] V π b(s)=0 belief space b(s)=1 16

17 Point-based value iteration Point-based backup (Pineau & al. 2003) α t-1 (b) = max a E b [R] + γ Σ o Pr(o s,a) α t (b ao ) V t-1 V t b b ao 1 b ao 2 17

18 Algebraic Decision Diagrams First use in MDPs: Hoey et al Factored Representation Exploit conditional independence Pr(s s,a) = i Pr(V i parents(v i )) Automatic State aggregation Exploit context specific independence Exploit sparsity 18

19 Factored Representation Time Time Time Time Transition fn: Pr(s s,a) Flat representation: matrix O( S 2 ) Factored representation: often O(log S ) 19

20 Computation with Factored Rep Belief monitoring: b ao (s ) = k Pr(o a,s ) Σ s b(s) Pr(s s,a) Point-based Bellman backup: α(s) = max a R(s,a) + Σ s o Pr(s s,a) Pr(o a,s ) α ao (s ) Flat representation: O( S 2 ) Factored representation: often O( S log S ) 20

21 Algebraic Decision Diagrams X Tree-based representation Acyclic directed graph Y Z Y 3 Avoid duplicate entries Exploit context specific independence Exploit sparsity xyz xy~z x~yz ~xyz 0 ~xy~z 0 ~x~yz x~y~z 2 ~x~y~z 3 21

22 Empirical Results Monopolistic Dynamic Pricing / Time S SP Value Upper bound Runtime (min) 10 / 20 73, / , / , / , / , / ,

23 COACH project Automated prompting system to help elderly persons wash their hands IATSL: Alex Mihailidis, Jesse Hoey, Jennifer Boger et al. 23

24 Policy Optimization Partially observable MDP: Handle noisy HandLocation and noisy WaterFlow Can adapt to user responsiveness 50,181,120 states, 20 actions, 12 observations Approximation: fully observable MDP Assume HandLocation, WaterFlow are fully observable Remove responsiveness user variable 25,090,560 states, 20 actions 24

25 Empirical Comparison (Simulation) 25

26 Conclusion Natural encoding of Dynamic Pricing as POMDP Demand and competitor learning by belief monitoring Factored model Symbolic Perseus (generic POMDP solvers) Point-based value iteration + algebraic decision diagrams Exploit problem specific structure Future work Bayesian reinforcement learning Planning as inference 26

An Analytic Solution to Discrete Bayesian Reinforcement Learning

An Analytic Solution to Discrete Bayesian Reinforcement Learning Pascal Poupart (U of Waterloo) Nikos Vlassis (U of Amsterdam) Jesse Hoey (U of Toronto) Kevin Regan (U of Waterloo) 1 Motivation Automated