Central-limit approach to risk-aware Markov decision processes

Size: px

Start display at page:

Download "Central-limit approach to risk-aware Markov decision processes"

Janis Williamson
5 years ago
Views:

1 Central-limit approach to risk-aware Markov decision processes Jia Yuan Yu Concordia University November 27, 2015 Joint work with Pengqian Yu and Huan Xu.

2 Inventory Management 1 1 Look at current inventory level 2 Decide how much to order 3 Receive inventory, sell inventory (random demand) 4 New inventory level, LeanCor

3 What is an MDP? 1 Single decision maker, 2 Sequential decisions, 3 Uncertainty. Time horizon 1, 2,..., States: X, Actions: A, Reward function: r : X A R, Transition probabilities: P : X A X.

4 Policies Sequential decision problem Policy is a mapping π : X A, If the decision-maker sticks to policy π, we have a Markov chain x1 π, x 2 π,... and we write instead of r(x π 1 ), r(x π 2 ),... r(x 1, π(x 1 )), r(x 2, π(x 2 )),...

Example 1 Time horizon: From Black Friday to End of Year. 2 State at day t: how many iphones x t in stock. 3 Action: how many iphones a t to order from warehouse.

5 Example 1 Time horizon: From Black Friday to End of Year. 2 State at day t: how many iphones x t in stock. 3 Action: how many iphones a t to order from warehouse. 4 Transition: from one inventory level x t to the next x t+1, with random demand and a t 5 Cost ( Reward): c(x t, a t ) includes holding cost, order cost, backorder cost, etc.

6 What is risk? Comparing probability distributions of alternatives 2. 2 From

7 What is risk? Many notions (risk measures): mappings of RVs (probability distributions) to real numbers; parametrized: curvature of u, value of λ. Expected utility (von Neumann and Morgenstern): ρ u (X) = Eu(X) with a concave u, Mean-Variance (Markowitz): ρ λ (X) = EX λvx, Value-at-risk (VAR): ρ λ (X) = F 1 X (λ), Expected shortfall (CVAR or AVAR): integral of VAR.

8 VAR

9 Let s cook up an example 1 Chain of many stores. 2 The previous boss minimizes T E c(xt π ) t=1 for each store. 3 You want to try a different approach: For a given λ, you want: min π s.t. v ( T ) P c(xt π ) v λ. t=1.

10 Let s cook up an example 1 For a given λ, you want: min π S = T c(xt π ), t=1 F 1 S (λ), where F 1 S (λ) is the value-at-risk ρ λ (S). 2 You can then say λ-fraction of stores spend less than v on inventory management costs (under i.i.d. assumption)..

11 What set of policies? 1 For a given λ, π = arg min π Π F 1 S (λ). 2 π is the training you give to each store s inventory manager. 3 You want to limit personnel training! 4 = Limit Π to stationary (Markov) policies: X A A..

12 What s been done? MDPs without risk (or risk-neutral) Risk in i.i.d. settings (in finance, operations research) Risk in MDPs mean-variance [MT11] [PG13] probabilistic objective, chance constraint [XM11] dynamic (Markovian) risk measures: [Rus10] [Oso12] risk constraint (CVAR or AVAR) gradient approach [TCGM15]

13 Risk-neutral MDP Finite horizon, infinite horizon: max π Π r(xt π ) T 1 E t=0 max π Π E α t r(xt π ) t=0 Solution via Bellman equation, value iteration, etc. Unlike E, our ρ is not usually not linear.

14 What makes risk measure ρ hard? Stationary policy π = Markov chain {x π t : x X, t = 1, 2,...} T 1 Find CDF of ST π = r(xt π ), t=0 Solve min π Π ρ (S π T ). For risk-neutral MDPs, π Π (well-known). Not in risk-averse setting! History-dependent policies π 1 : H 1 A, π 2 : H 2 A..., Non-Markov chain {x t } T 1 S π 1,...,π T T = r(x t, π t (x t )), Solve min π 1,...,π T Π t=0 ρ ( S π 1,...,π T ) T.

15 Mean-variance [MT11], [PG13] Max-reward: max π Π s.t. T E r(xt π ) t=1 ( T ) r(xt π ) v. VAR t=1 Min-variance: min π Π s.t. VAR ( T ) r(xt π ) t=1 T E r(xt π ) w. t=1 These problems are NP-hard (reduction from subset sum).

16 Probabilistic goal, chance constraint [XM11] Given v R: max π Π ( T ) P r(xt π ) v t=1 Given v R and α (0, 1): max π Π s.t. T E r(xt π ) t=1 ( T ) P r(xt π ) v α t=1

17 Risk constraint [CG14] Given β: min π Π s.t. T E r(xt π ) t=1 ( T ) ρ r(xt π ) β, t=1 where ρ is CVAR.

18 Dynamic (Markovian) risk measures [Rus10] [Oso12] RV X from (Ω, F, P), and coherent ρ: max π 1,...,π T ρ(x) = sup E Q X. Q:absolutely continuous wrt P For a fixed x 1 and T : r(x 1, π 1 (x 1 )) + ρ (c(x 2, π 2 (x 2 )) + ρ (... + ρ(c T +1 (x T +1 )) )) Infinite horizon, discount γ: max π Π ( ( R.V. real number {}}{{}}{ r(x0 π ) + γρ r(x1 π ) + γρ r(x2 π ) +γ ρ ( r(x3 π ) +... ) )) Solve for stationary π by a version of value iteration.

19 Gradient approach [TCGM15] Static and dynamic risk. Without evaluating or estimating the risk value. static coherent ρ: minπ ρ( T t=1 x t π ) estimator for π ρ( T t=1 x π ) plug into a gradient descent algorithm to find local optimum. dynamic (Markovian) ρ T (cf. [Rus10]): ρ (π) = lim T ρ T (x1 π... x T π), min π ρ (π), estimate π ρ (π), etc.

20 What do we propose? There is a huge number of policies! We consider only risk of a stationary policy: ) ρ (r(x 1 π ) r(x T π ). (1) min π Π Each policy π has a risk value attached to it. Evaluating risk for each policy is challenging. Policy improvement from π to π. Approximate solution to NP-hard problem.

21 Policy evaluation 1 Transition probabilities P π known 2 Transition probabilities P π unknown 3 Tools: CLT for Markov chains (e.g., Kontoyiannis and Meyn) Asymptotic reward variance for MDP with fixed policy (e.g., Meyn and Tweedie)

22 CLT Theorem Suppose that R 1, R 2... are i.i.d. with finite mean µ and variance σ 2. Let S n = (R R n )/n, then: n(sn µ) N(0, σ 2 ).

23 CLT for Markov chains Theorem (Theorem 5.1,[KM03]) Suppose that X π and r satisfy ergodicity and continuity assumptions. Let GT π (y) denote the distribution function G π T (y) P { S π T T ϕ π (r) σ T Let ˆr π (x) be a solution to the Poisson equation: Then, for all x π 0 X, } y. P πˆr π = ˆr π r + ϕ π (r)1. (2) GT π γ(y) [ ϱ ] (y) = g(y) + σ T 6σ 2 (1 y 2 ) ˆr π (x0 π ) + o(1/ T ), uniformly in y R, where g and γ denotes the standard normal distribution and density.

24 P π known Input: P π, T, r, λ, x 0. 1 Solving P π ξ π = ξ π for stationary distribution ξ π. 2 Compute asymptotic expected reward per time step: ϕ π (r) = x ξπ (x)r(x) 3 Solve the Poisson equation for ˆr π. 4 Compute asymptotic variance [MT12]: σ 2 = x X[ˆr π (x) 2 (P πˆr π (x)) 2 ]ξ π (x). 5 Estimate the probability distribution GT π of Sπ T with g by CLT. 6 Output VAR estimate: ( y T ϕ ˆρ λ (ST {y π π ) } ) = inf (r) R : g σ > λ. T

25 Estimation error G T y σ=1 σ=1.1

26 P π needs to be estimated Input: π, T, r, λ, x 0. 1 Estimate P π with ˆP π (Hard part: how many samples). 2 Estimate stationary distribution ˆξ π. 3 Compute asymptotic expected reward per time step: ˆϕ π (r) = x ˆξ π (x)r(x) 4 Solve the Poisson equation for ˆr π. 5 Compute asymptotic variance [MT12, Equation 17.44]: ˆσ 2 = x X[ˆr π (x) 2 (ˆP πˆr π (x)) 2 ]ˆξ π (x). 6 Estimate the probability distribution GT π of Sπ T with g by CLT. 7 Output VAR estimate: ( y T ˆϕ ˆρ λ (ST {y π π ) } ) = inf (r) R : g ˆσ > λ. T

27 Policy improvement Parametrized space of policies: π θ, for θ Θ R d. Randomized and continuous policies: π θ : X A X continuous in parameter θ. Stochastic approximation and policy gradient. SPSA (simultaneous perturbation stochastic approx): ( ) θ (i)ρ(st θ ) = (i) ρ(s θ+β T ) ρ(st θ ), β θ (i) t+1 = Γ ( i θt + λ(t) θ (i)ρ(st θ ) ) for all i. Step sizes satisfy stochastic approximation assumptions.

28 Policy improvement We can find a local optimum.

29 What can we guarantee? Soon on arxiv. Policy evaluation Theorem Our estimated risk ˆρ has an error of O( 1 σ ) with high probability. For T every ɛ, δ > 0, and T large enough, we have Policy improvement Theorem P( ˆρ λ (ST π ) ρ λ(st π ) ɛ) 1 δ. Under continuity and stochastic approximation assumptions, for every ɛ > 0, there exists β such that θ t converges almost surely to an ɛ-neighbourhood of a local optimum.

30 Extensions Any risk measure that is a functional of the probability distribution of S π T. Use this algorithm for other NP-hard problems (Subset Sum)? Extension to infinite horizon, broken-down into phases of length T.

31 References Yinlam Chow and Mohammad Ghavamzadeh. Algorithms for CVaR optimization in MDPs. In Advances in Neural Information Processing Systems, pages , Ioannis Kontoyiannis and Sean P Meyn. Spectral theory and limit theorems for geometrically ergodic Markov processes. Annals of Applied Probability, pages , S. Mannor and J. N. Tsitsiklis. Mean-variance optimization in Markov decision processes. In Proceedings of ICML, Sean P Meyn and Richard L Tweedie. Markov chains and stochastic stability. Springer Science & Business Media, Takayuki Osogami. Robustness and risk-sensitivity in Markov decision processes. In Advances in Neural Information Processing Systems, pages , LA Prashanth and Mohammad Ghavamzadeh. Actor-critic algorithms for risk-sensitive MDPs. In Advances in Neural Information Processing Systems, pages , Andrzej Ruszczyński. Risk-averse dynamic programming for Markov decision processes. Mathematical programming, 125(2): , 2010.

Policy Gradient for Coherent Risk Measures

Policy Gradient for Coherent Risk Measures Aviv Tamar UC Berkeley avivt@berkeleyedu Mohammad Ghavamzadeh Adobe Research & INRIA mohammadghavamzadeh@inriafr Yinlam Chow Stanford University ychow@stanfordedu