Central-limit approach to risk-aware Markov decision processes Jia Yuan Yu Concordia University November 27, 2015 Joint work with Pengqian Yu and Huan Xu.
Inventory Management 1 1 Look at current inventory level 2 Decide how much to order 3 Receive inventory, sell inventory (random demand) 4 New inventory level, 5... 1 LeanCor
What is an MDP? 1 Single decision maker, 2 Sequential decisions, 3 Uncertainty. Time horizon 1, 2,..., States: X, Actions: A, Reward function: r : X A R, Transition probabilities: P : X A X.
Policies Sequential decision problem Policy is a mapping π : X A, If the decision-maker sticks to policy π, we have a Markov chain x1 π, x 2 π,... and we write instead of r(x π 1 ), r(x π 2 ),... r(x 1, π(x 1 )), r(x 2, π(x 2 )),...
Example 1 Time horizon: From Black Friday to End of Year. 2 State at day t: how many iphones x t in stock. 3 Action: how many iphones a t to order from warehouse. 4 Transition: from one inventory level x t to the next x t+1, with random demand and a t 5 Cost ( Reward): c(x t, a t ) includes holding cost, order cost, backorder cost, etc.
What is risk? Comparing probability distributions of alternatives 2. 2 From http://www.fao.org/
What is risk? Many notions (risk measures): mappings of RVs (probability distributions) to real numbers; parametrized: curvature of u, value of λ. Expected utility (von Neumann and Morgenstern): ρ u (X) = Eu(X) with a concave u, Mean-Variance (Markowitz): ρ λ (X) = EX λvx, Value-at-risk (VAR): ρ λ (X) = F 1 X (λ), Expected shortfall (CVAR or AVAR): integral of VAR.
VAR
Let s cook up an example 1 Chain of many stores. 2 The previous boss minimizes T E c(xt π ) t=1 for each store. 3 You want to try a different approach: For a given λ, you want: min π s.t. v ( T ) P c(xt π ) v λ. t=1.
Let s cook up an example 1 For a given λ, you want: min π S = T c(xt π ), t=1 F 1 S (λ), where F 1 S (λ) is the value-at-risk ρ λ (S). 2 You can then say λ-fraction of stores spend less than v on inventory management costs (under i.i.d. assumption)..
What set of policies? 1 For a given λ, π = arg min π Π F 1 S (λ). 2 π is the training you give to each store s inventory manager. 3 You want to limit personnel training! 4 = Limit Π to stationary (Markov) policies: X A A..
What s been done? MDPs without risk (or risk-neutral) Risk in i.i.d. settings (in finance, operations research) Risk in MDPs mean-variance [MT11] [PG13] probabilistic objective, chance constraint [XM11] dynamic (Markovian) risk measures: [Rus10] [Oso12] risk constraint (CVAR or AVAR) gradient approach [TCGM15]
Risk-neutral MDP Finite horizon, infinite horizon: max π Π r(xt π ) T 1 E t=0 max π Π E α t r(xt π ) t=0 Solution via Bellman equation, value iteration, etc. Unlike E, our ρ is not usually not linear.
What makes risk measure ρ hard? Stationary policy π = Markov chain {x π t : x X, t = 1, 2,...} T 1 Find CDF of ST π = r(xt π ), t=0 Solve min π Π ρ (S π T ). For risk-neutral MDPs, π Π (well-known). Not in risk-averse setting! History-dependent policies π 1 : H 1 A, π 2 : H 2 A..., Non-Markov chain {x t } T 1 S π 1,...,π T T = r(x t, π t (x t )), Solve min π 1,...,π T Π t=0 ρ ( S π 1,...,π T ) T.
Mean-variance [MT11], [PG13] Max-reward: max π Π s.t. T E r(xt π ) t=1 ( T ) r(xt π ) v. VAR t=1 Min-variance: min π Π s.t. VAR ( T ) r(xt π ) t=1 T E r(xt π ) w. t=1 These problems are NP-hard (reduction from subset sum).
Probabilistic goal, chance constraint [XM11] Given v R: max π Π ( T ) P r(xt π ) v t=1 Given v R and α (0, 1): max π Π s.t. T E r(xt π ) t=1 ( T ) P r(xt π ) v α t=1
Risk constraint [CG14] Given β: min π Π s.t. T E r(xt π ) t=1 ( T ) ρ r(xt π ) β, t=1 where ρ is CVAR.
Dynamic (Markovian) risk measures [Rus10] [Oso12] RV X from (Ω, F, P), and coherent ρ: max π 1,...,π T ρ(x) = sup E Q X. Q:absolutely continuous wrt P For a fixed x 1 and T : r(x 1, π 1 (x 1 )) + ρ (c(x 2, π 2 (x 2 )) + ρ (... + ρ(c T +1 (x T +1 )) )) Infinite horizon, discount γ: max π Π ( ( R.V. real number {}}{{}}{ r(x0 π ) + γρ r(x1 π ) + γρ r(x2 π ) +γ ρ ( r(x3 π ) +... ) )) Solve for stationary π by a version of value iteration.
Gradient approach [TCGM15] Static and dynamic risk. Without evaluating or estimating the risk value. static coherent ρ: minπ ρ( T t=1 x t π ) estimator for π ρ( T t=1 x π ) plug into a gradient descent algorithm to find local optimum. dynamic (Markovian) ρ T (cf. [Rus10]): ρ (π) = lim T ρ T (x1 π... x T π), min π ρ (π), estimate π ρ (π), etc.
What do we propose? There is a huge number of policies! We consider only risk of a stationary policy: ) ρ (r(x 1 π ) +... + r(x T π ). (1) min π Π Each policy π has a risk value attached to it. Evaluating risk for each policy is challenging. Policy improvement from π to π. Approximate solution to NP-hard problem.
Policy evaluation 1 Transition probabilities P π known 2 Transition probabilities P π unknown 3 Tools: CLT for Markov chains (e.g., Kontoyiannis and Meyn) Asymptotic reward variance for MDP with fixed policy (e.g., Meyn and Tweedie)
CLT Theorem Suppose that R 1, R 2... are i.i.d. with finite mean µ and variance σ 2. Let S n = (R 1 +... + R n )/n, then: n(sn µ) N(0, σ 2 ).
CLT for Markov chains Theorem (Theorem 5.1,[KM03]) Suppose that X π and r satisfy ergodicity and continuity assumptions. Let GT π (y) denote the distribution function G π T (y) P { S π T T ϕ π (r) σ T Let ˆr π (x) be a solution to the Poisson equation: Then, for all x π 0 X, } y. P πˆr π = ˆr π r + ϕ π (r)1. (2) GT π γ(y) [ ϱ ] (y) = g(y) + σ T 6σ 2 (1 y 2 ) ˆr π (x0 π ) + o(1/ T ), uniformly in y R, where g and γ denotes the standard normal distribution and density.
P π known Input: P π, T, r, λ, x 0. 1 Solving P π ξ π = ξ π for stationary distribution ξ π. 2 Compute asymptotic expected reward per time step: ϕ π (r) = x ξπ (x)r(x) 3 Solve the Poisson equation for ˆr π. 4 Compute asymptotic variance [MT12]: σ 2 = x X[ˆr π (x) 2 (P πˆr π (x)) 2 ]ξ π (x). 5 Estimate the probability distribution GT π of Sπ T with g by CLT. 6 Output VAR estimate: ( y T ϕ ˆρ λ (ST {y π π ) } ) = inf (r) R : g σ > λ. T
Estimation error 1 0.8 G T 0.6 0.4 0.2 0 20 10 0 10 20 30 40 y σ=1 σ=1.1
P π needs to be estimated Input: π, T, r, λ, x 0. 1 Estimate P π with ˆP π (Hard part: how many samples). 2 Estimate stationary distribution ˆξ π. 3 Compute asymptotic expected reward per time step: ˆϕ π (r) = x ˆξ π (x)r(x) 4 Solve the Poisson equation for ˆr π. 5 Compute asymptotic variance [MT12, Equation 17.44]: ˆσ 2 = x X[ˆr π (x) 2 (ˆP πˆr π (x)) 2 ]ˆξ π (x). 6 Estimate the probability distribution GT π of Sπ T with g by CLT. 7 Output VAR estimate: ( y T ˆϕ ˆρ λ (ST {y π π ) } ) = inf (r) R : g ˆσ > λ. T
Policy improvement Parametrized space of policies: π θ, for θ Θ R d. Randomized and continuous policies: π θ : X A X continuous in parameter θ. Stochastic approximation and policy gradient. SPSA (simultaneous perturbation stochastic approx): ( ) θ (i)ρ(st θ ) = (i) ρ(s θ+β T ) ρ(st θ ), β θ (i) t+1 = Γ ( i θt + λ(t) θ (i)ρ(st θ ) ) for all i. Step sizes satisfy stochastic approximation assumptions.
Policy improvement We can find a local optimum.
What can we guarantee? Soon on arxiv. Policy evaluation Theorem Our estimated risk ˆρ has an error of O( 1 σ ) with high probability. For T every ɛ, δ > 0, and T large enough, we have Policy improvement Theorem P( ˆρ λ (ST π ) ρ λ(st π ) ɛ) 1 δ. Under continuity and stochastic approximation assumptions, for every ɛ > 0, there exists β such that θ t converges almost surely to an ɛ-neighbourhood of a local optimum.
Extensions Any risk measure that is a functional of the probability distribution of S π T. Use this algorithm for other NP-hard problems (Subset Sum)? Extension to infinite horizon, broken-down into phases of length T.
References Yinlam Chow and Mohammad Ghavamzadeh. Algorithms for CVaR optimization in MDPs. In Advances in Neural Information Processing Systems, pages 3509 3517, 2014. Ioannis Kontoyiannis and Sean P Meyn. Spectral theory and limit theorems for geometrically ergodic Markov processes. Annals of Applied Probability, pages 304 362, 2003. S. Mannor and J. N. Tsitsiklis. Mean-variance optimization in Markov decision processes. In Proceedings of ICML, 2011. Sean P Meyn and Richard L Tweedie. Markov chains and stochastic stability. Springer Science & Business Media, 2012. Takayuki Osogami. Robustness and risk-sensitivity in Markov decision processes. In Advances in Neural Information Processing Systems, pages 233 241, 2012. LA Prashanth and Mohammad Ghavamzadeh. Actor-critic algorithms for risk-sensitive MDPs. In Advances in Neural Information Processing Systems, pages 252 260, 2013. Andrzej Ruszczyński. Risk-averse dynamic programming for Markov decision processes. Mathematical programming, 125(2):235 261, 2010.