Central-limit approach to risk-aware Markov decision processes

Similar documents
Policy Gradient for Coherent Risk Measures

Approximate Dynamic Programming

Risk-Sensitive and Efficient Reinforcement Learning Algorithms. Aviv Tamar

Gradient Estimation for Attractor Networks

Approximate Dynamic Programming

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and their Applications to Supply Chain Management

Markov Decision Processes and Dynamic Programming

On the Approximate Solution of POMDP and the Near-Optimality of Finite-State Controllers

Practicable Robust Markov Decision Processes

STA205 Probability: Week 8 R. Wolpert

Chapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Finite-Sample Analysis in Reinforcement Learning

University of Warwick, EC9A0 Maths for Economists Lecture Notes 10: Dynamic Programming

arxiv: v2 [cs.ai] 8 Jun 2015

Average-cost temporal difference learning and adaptive control variates

Risk-Averse Dynamic Optimization. Andrzej Ruszczyński. Research supported by the NSF award CMMI

Approximate Dynamic Programming

, and rewards and transition matrices as shown below:

Policy Gradients with Variance Related Risk Criteria

= µ(s, a)c(s, a) s.t. linear constraints ;

Reinforcement Learning. Introduction

Optimizing the CVaR via Sampling

The Art of Sequential Optimization via Simulations

The Role of Discount Factor in Risk Sensitive Markov Decision Processes

CVaR and Examples of Deviation Risk Measures

Stochastic Optimization One-stage problem

Introduction to Reinforcement Learning Part 1: Markov Decision Processes

Discrete planning (an introduction)

Risk-Constrained Reinforcement Learning with Percentile Risk Criteria

Separable Utility Functions in Dynamic Economic Models

Dynamic Risk Measures and Nonlinear Expectations with Markov Chain noise

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Mean Field Competitive Binary MDPs and Structured Solutions

Reinforcement Learning and NLP

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time

An Uncertain Control Model with Application to. Production-Inventory System

RISK-SENSITIVE AND DATA-DRIVEN SEQUENTIAL DECISION MAKING

Variance Adjusted Actor Critic Algorithms

Procedia Computer Science 00 (2011) 000 6

Reinforcement Learning as Classification Leveraging Modern Classifiers

Lecture Notes 10: Dynamic Programming

Infinite-Horizon Average Reward Markov Decision Processes

ORIGINS OF STOCHASTIC PROGRAMMING

Generalized quantiles as risk measures

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Q-Learning for Markov Decision Processes*

Decision Theory: Markov Decision Processes

On Differentiability of Average Cost in Parameterized Markov Chains

TD Learning. Sean Meyn. Department of Electrical and Computer Engineering University of Illinois & the Coordinated Science Laboratory

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Controlled sequential Monte Carlo

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs

Quasi Stochastic Approximation American Control Conference San Francisco, June 2011

Lecture 1: March 7, 2018

Reinforcement Learning

Q-Learning in Continuous State Action Spaces

On Optimization of the Total Present Value of Profits under Semi Markov Conditions

Stochastic Composition Optimization

Value Iteration and Action ɛ-approximation of Optimal Policies in Discounted Markov Decision Processes

Risk-Constrained Reinforcement Learning with Percentile Risk Criteria

Reformulation of chance constrained problems using penalty functions

Value Function Iteration

Planning and Model Selection in Data Driven Markov models

Point Process Control

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Stochastic Primal-Dual Methods for Reinforcement Learning

Reinforcement learning

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Project Discussions: SNL/ADMM, MDP/Randomization, Quadratic Regularization, and Online Linear Programming

Ergodic Theorems. Samy Tindel. Purdue University. Probability Theory 2 - MA 539. Taken from Probability: Theory and examples by R.

Internet Monetization

Birgit Rudloff Operations Research and Financial Engineering, Princeton University

Inverse Stochastic Dominance Constraints Duality and Methods

AM 121: Intro to Optimization Models and Methods: Fall 2018

Stochastic Safest and Shortest Path Problems

A.Piunovskiy. University of Liverpool Fluid Approximation to Controlled Markov. Chains with Local Transitions. A.Piunovskiy.

Robust Optimal Control Using Conditional Risk Mappings in Infinite Horizon

On service level measures in stochastic inventory control

Portfolio optimization with stochastic dominance constraints

Lightning Does Not Strike Twice: Robust MDPs with Coupled Uncertainty

Robust Growth-Optimal Portfolios

Robustness and bootstrap techniques in portfolio efficiency tests

Reinforcement Learning

Sequential Monte Carlo Samplers for Applications in High Dimensions

ELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches. Ville Kyrki

Reductions Of Undiscounted Markov Decision Processes and Stochastic Games To Discounted Ones. Jefferson Huang

Control Variates for Markov Chain Monte Carlo

FOURIER TRANSFORMS OF STATIONARY PROCESSES 1. k=1

18.175: Lecture 14 Infinite divisibility and so forth

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

MDP Preliminaries. Nan Jiang. February 10, 2019

Markov Decision Processes Chapter 17. Mausam

On Finding Optimal Policies for Markovian Decision Processes Using Simulation

Artificial Intelligence & Sequential Decision Problems

Markov Chain Monte Carlo Methods for Stochastic

Markov Decision Processes Infinite Horizon Problems

Markov decision processes

STAT 512 sp 2018 Summary Sheet

Transcription:

Central-limit approach to risk-aware Markov decision processes Jia Yuan Yu Concordia University November 27, 2015 Joint work with Pengqian Yu and Huan Xu.

Inventory Management 1 1 Look at current inventory level 2 Decide how much to order 3 Receive inventory, sell inventory (random demand) 4 New inventory level, 5... 1 LeanCor

What is an MDP? 1 Single decision maker, 2 Sequential decisions, 3 Uncertainty. Time horizon 1, 2,..., States: X, Actions: A, Reward function: r : X A R, Transition probabilities: P : X A X.

Policies Sequential decision problem Policy is a mapping π : X A, If the decision-maker sticks to policy π, we have a Markov chain x1 π, x 2 π,... and we write instead of r(x π 1 ), r(x π 2 ),... r(x 1, π(x 1 )), r(x 2, π(x 2 )),...

Example 1 Time horizon: From Black Friday to End of Year. 2 State at day t: how many iphones x t in stock. 3 Action: how many iphones a t to order from warehouse. 4 Transition: from one inventory level x t to the next x t+1, with random demand and a t 5 Cost ( Reward): c(x t, a t ) includes holding cost, order cost, backorder cost, etc.

What is risk? Comparing probability distributions of alternatives 2. 2 From http://www.fao.org/

What is risk? Many notions (risk measures): mappings of RVs (probability distributions) to real numbers; parametrized: curvature of u, value of λ. Expected utility (von Neumann and Morgenstern): ρ u (X) = Eu(X) with a concave u, Mean-Variance (Markowitz): ρ λ (X) = EX λvx, Value-at-risk (VAR): ρ λ (X) = F 1 X (λ), Expected shortfall (CVAR or AVAR): integral of VAR.

VAR

Let s cook up an example 1 Chain of many stores. 2 The previous boss minimizes T E c(xt π ) t=1 for each store. 3 You want to try a different approach: For a given λ, you want: min π s.t. v ( T ) P c(xt π ) v λ. t=1.

Let s cook up an example 1 For a given λ, you want: min π S = T c(xt π ), t=1 F 1 S (λ), where F 1 S (λ) is the value-at-risk ρ λ (S). 2 You can then say λ-fraction of stores spend less than v on inventory management costs (under i.i.d. assumption)..

What set of policies? 1 For a given λ, π = arg min π Π F 1 S (λ). 2 π is the training you give to each store s inventory manager. 3 You want to limit personnel training! 4 = Limit Π to stationary (Markov) policies: X A A..

What s been done? MDPs without risk (or risk-neutral) Risk in i.i.d. settings (in finance, operations research) Risk in MDPs mean-variance [MT11] [PG13] probabilistic objective, chance constraint [XM11] dynamic (Markovian) risk measures: [Rus10] [Oso12] risk constraint (CVAR or AVAR) gradient approach [TCGM15]

Risk-neutral MDP Finite horizon, infinite horizon: max π Π r(xt π ) T 1 E t=0 max π Π E α t r(xt π ) t=0 Solution via Bellman equation, value iteration, etc. Unlike E, our ρ is not usually not linear.

What makes risk measure ρ hard? Stationary policy π = Markov chain {x π t : x X, t = 1, 2,...} T 1 Find CDF of ST π = r(xt π ), t=0 Solve min π Π ρ (S π T ). For risk-neutral MDPs, π Π (well-known). Not in risk-averse setting! History-dependent policies π 1 : H 1 A, π 2 : H 2 A..., Non-Markov chain {x t } T 1 S π 1,...,π T T = r(x t, π t (x t )), Solve min π 1,...,π T Π t=0 ρ ( S π 1,...,π T ) T.

Mean-variance [MT11], [PG13] Max-reward: max π Π s.t. T E r(xt π ) t=1 ( T ) r(xt π ) v. VAR t=1 Min-variance: min π Π s.t. VAR ( T ) r(xt π ) t=1 T E r(xt π ) w. t=1 These problems are NP-hard (reduction from subset sum).

Probabilistic goal, chance constraint [XM11] Given v R: max π Π ( T ) P r(xt π ) v t=1 Given v R and α (0, 1): max π Π s.t. T E r(xt π ) t=1 ( T ) P r(xt π ) v α t=1

Risk constraint [CG14] Given β: min π Π s.t. T E r(xt π ) t=1 ( T ) ρ r(xt π ) β, t=1 where ρ is CVAR.

Dynamic (Markovian) risk measures [Rus10] [Oso12] RV X from (Ω, F, P), and coherent ρ: max π 1,...,π T ρ(x) = sup E Q X. Q:absolutely continuous wrt P For a fixed x 1 and T : r(x 1, π 1 (x 1 )) + ρ (c(x 2, π 2 (x 2 )) + ρ (... + ρ(c T +1 (x T +1 )) )) Infinite horizon, discount γ: max π Π ( ( R.V. real number {}}{{}}{ r(x0 π ) + γρ r(x1 π ) + γρ r(x2 π ) +γ ρ ( r(x3 π ) +... ) )) Solve for stationary π by a version of value iteration.

Gradient approach [TCGM15] Static and dynamic risk. Without evaluating or estimating the risk value. static coherent ρ: minπ ρ( T t=1 x t π ) estimator for π ρ( T t=1 x π ) plug into a gradient descent algorithm to find local optimum. dynamic (Markovian) ρ T (cf. [Rus10]): ρ (π) = lim T ρ T (x1 π... x T π), min π ρ (π), estimate π ρ (π), etc.

What do we propose? There is a huge number of policies! We consider only risk of a stationary policy: ) ρ (r(x 1 π ) +... + r(x T π ). (1) min π Π Each policy π has a risk value attached to it. Evaluating risk for each policy is challenging. Policy improvement from π to π. Approximate solution to NP-hard problem.

Policy evaluation 1 Transition probabilities P π known 2 Transition probabilities P π unknown 3 Tools: CLT for Markov chains (e.g., Kontoyiannis and Meyn) Asymptotic reward variance for MDP with fixed policy (e.g., Meyn and Tweedie)

CLT Theorem Suppose that R 1, R 2... are i.i.d. with finite mean µ and variance σ 2. Let S n = (R 1 +... + R n )/n, then: n(sn µ) N(0, σ 2 ).

CLT for Markov chains Theorem (Theorem 5.1,[KM03]) Suppose that X π and r satisfy ergodicity and continuity assumptions. Let GT π (y) denote the distribution function G π T (y) P { S π T T ϕ π (r) σ T Let ˆr π (x) be a solution to the Poisson equation: Then, for all x π 0 X, } y. P πˆr π = ˆr π r + ϕ π (r)1. (2) GT π γ(y) [ ϱ ] (y) = g(y) + σ T 6σ 2 (1 y 2 ) ˆr π (x0 π ) + o(1/ T ), uniformly in y R, where g and γ denotes the standard normal distribution and density.

P π known Input: P π, T, r, λ, x 0. 1 Solving P π ξ π = ξ π for stationary distribution ξ π. 2 Compute asymptotic expected reward per time step: ϕ π (r) = x ξπ (x)r(x) 3 Solve the Poisson equation for ˆr π. 4 Compute asymptotic variance [MT12]: σ 2 = x X[ˆr π (x) 2 (P πˆr π (x)) 2 ]ξ π (x). 5 Estimate the probability distribution GT π of Sπ T with g by CLT. 6 Output VAR estimate: ( y T ϕ ˆρ λ (ST {y π π ) } ) = inf (r) R : g σ > λ. T

Estimation error 1 0.8 G T 0.6 0.4 0.2 0 20 10 0 10 20 30 40 y σ=1 σ=1.1

P π needs to be estimated Input: π, T, r, λ, x 0. 1 Estimate P π with ˆP π (Hard part: how many samples). 2 Estimate stationary distribution ˆξ π. 3 Compute asymptotic expected reward per time step: ˆϕ π (r) = x ˆξ π (x)r(x) 4 Solve the Poisson equation for ˆr π. 5 Compute asymptotic variance [MT12, Equation 17.44]: ˆσ 2 = x X[ˆr π (x) 2 (ˆP πˆr π (x)) 2 ]ˆξ π (x). 6 Estimate the probability distribution GT π of Sπ T with g by CLT. 7 Output VAR estimate: ( y T ˆϕ ˆρ λ (ST {y π π ) } ) = inf (r) R : g ˆσ > λ. T

Policy improvement Parametrized space of policies: π θ, for θ Θ R d. Randomized and continuous policies: π θ : X A X continuous in parameter θ. Stochastic approximation and policy gradient. SPSA (simultaneous perturbation stochastic approx): ( ) θ (i)ρ(st θ ) = (i) ρ(s θ+β T ) ρ(st θ ), β θ (i) t+1 = Γ ( i θt + λ(t) θ (i)ρ(st θ ) ) for all i. Step sizes satisfy stochastic approximation assumptions.

Policy improvement We can find a local optimum.

What can we guarantee? Soon on arxiv. Policy evaluation Theorem Our estimated risk ˆρ has an error of O( 1 σ ) with high probability. For T every ɛ, δ > 0, and T large enough, we have Policy improvement Theorem P( ˆρ λ (ST π ) ρ λ(st π ) ɛ) 1 δ. Under continuity and stochastic approximation assumptions, for every ɛ > 0, there exists β such that θ t converges almost surely to an ɛ-neighbourhood of a local optimum.

Extensions Any risk measure that is a functional of the probability distribution of S π T. Use this algorithm for other NP-hard problems (Subset Sum)? Extension to infinite horizon, broken-down into phases of length T.

References Yinlam Chow and Mohammad Ghavamzadeh. Algorithms for CVaR optimization in MDPs. In Advances in Neural Information Processing Systems, pages 3509 3517, 2014. Ioannis Kontoyiannis and Sean P Meyn. Spectral theory and limit theorems for geometrically ergodic Markov processes. Annals of Applied Probability, pages 304 362, 2003. S. Mannor and J. N. Tsitsiklis. Mean-variance optimization in Markov decision processes. In Proceedings of ICML, 2011. Sean P Meyn and Richard L Tweedie. Markov chains and stochastic stability. Springer Science & Business Media, 2012. Takayuki Osogami. Robustness and risk-sensitivity in Markov decision processes. In Advances in Neural Information Processing Systems, pages 233 241, 2012. LA Prashanth and Mohammad Ghavamzadeh. Actor-critic algorithms for risk-sensitive MDPs. In Advances in Neural Information Processing Systems, pages 252 260, 2013. Andrzej Ruszczyński. Risk-averse dynamic programming for Markov decision processes. Mathematical programming, 125(2):235 261, 2010.