Practicable Robust Markov Decision Processes

Similar documents
Planning and Model Selection in Data Driven Markov models

Lightning Does Not Strike Twice: Robust MDPs with Coupled Uncertainty

Lightning Does Not Strike Twice: Robust MDPs with Coupled Uncertainty

Preference Elicitation for Sequential Decision Problems

Decision Theory: Markov Decision Processes

Reinforcement Learning

Constructing Learning Models from Data: The Dynamic Catalog Mailing Problem

CS 7180: Behavioral Modeling and Decisionmaking

Lecture 1: March 7, 2018

Distributed Optimization. Song Chong EE, KAIST

Decision Theory: Q-Learning

Markov decision processes

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Reinforcement Learning. Yishay Mansour Tel-Aviv University

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Sequential Decision Problems

Robust Partially Observable Markov Decision Processes

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Reinforcement Learning

CS 4100 // artificial intelligence. Recap/midterm review!

Markov Chains. Chapter 16. Markov Chains - 1

Reinforcement Learning

Lecture notes for Analysis of Algorithms : Markov decision processes

Online regret in reinforcement learning

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Internet Monetization

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

Notes from Week 9: Multi-Armed Bandit Problems II. 1 Information-theoretic lower bounds for multiarmed

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Experts in a Markov Decision Process

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Optimism in the Face of Uncertainty Should be Refutable

16.4 Multiattribute Utility Functions

Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning

Reinforcement Learning

Planning Under Uncertainty II

Markov Decision Processes

, and rewards and transition matrices as shown below:

Hybrid Machine Learning Algorithms

Motivation for introducing probabilities

Some AI Planning Problems

Grundlagen der Künstlichen Intelligenz

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

CS 188: Artificial Intelligence

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Bandit models: a tutorial

Learning to Coordinate Efficiently: A Model-based Approach

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Central-limit approach to risk-aware Markov decision processes

Learning in Zero-Sum Team Markov Games using Factored Value Functions

An analogy from Calculus: limits

Lecture notes on OPP algorithms [Preliminary Draft]

Reinforcement Learning as Classification Leveraging Modern Classifiers

Reinforcement Learning. Introduction

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

Infinite-Horizon Average Reward Markov Decision Processes

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Robust Markov Decision Processes

MDP Preliminaries. Nan Jiang. February 10, 2019

16.410/413 Principles of Autonomy and Decision Making

Chapter 3: The Reinforcement Learning Problem

CORC Tech Report TR Robust dynamic programming

Distributionally Robust Convex Optimization

CS 4649/7649 Robot Intelligence: Planning

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Reinforcement Learning

Prioritized Sweeping Converges to the Optimal Value Function

Safe Policy Improvement by Minimizing Robust Baseline Regret

Multiagent Value Iteration in Markov Games

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

Using first-order logic, formalize the following knowledge:

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

Multi-model Markov Decision Processes

Multi-armed bandit models: a tutorial

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Planning in Markov Decision Processes

Reinforcement Learning II

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

Module 8 Linear Programming. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

Reinforcement Learning: An Introduction

An Analysis of Model-Based Interval Estimation for Markov Decision Processes

Sparse analysis Lecture II: Hardness results for sparse approximation problems

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

CS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time

Advanced Machine Learning

6 Reinforcement Learning

Chapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS

Machine Learning I Reinforcement Learning

CS 188: Artificial Intelligence Spring Announcements

Reinforcement Learning

Information, Utility & Bounded Rationality

Discrete Probability and State Estimation

Course basics. CSE 190: Reinforcement Learning: An Introduction. Last Time. Course goals. The website for the class is linked off my homepage.

15-780: ReinforcementLearning

Q-Learning in Continuous State Action Spaces

Comp487/587 - Boolean Formulas

Transcription:

Practicable Robust Markov Decision Processes Huan Xu Department of Mechanical Engineering National University of Singapore Joint work with Shiau-Hong Lim (IBM), Shie Mannor (Techion), Ofir Mebel (Apple) 18, Nov, 2015; IMS Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 1 / 40

Classical planning problems We typically want to maximize the expected average reward In planning: Model is known A single scalar reward Rare events (black swans) only crop-up through expectations Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 2 / 40

Classical planning problems We typically want to maximize the expected average reward In planning: Model is known A single scalar reward Rare events (black swans) only crop-up through expectations Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 2 / 40

Classical planning problems We typically want to maximize the expected average reward In planning: Model is known A single scalar reward Rare events (black swans) only crop-up through expectations Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 2 / 40

Motivation example - Mail catalog Large US retailer (Fortune 500 company) Marketing problem: send or not send coupon/invitation/mail order catalogue Common wisdom: per customer look at RFM: Recency, Frequency, Monetary value Dynamics matter Every model will be wrong: how do you model humans? Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 3 / 40

Motivation example - Mail catalog Large US retailer (Fortune 500 company) Marketing problem: send or not send coupon/invitation/mail order catalogue Common wisdom: per customer look at RFM: Recency, Frequency, Monetary value Dynamics matter Every model will be wrong: how do you model humans? Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 3 / 40

Motivation example - Mail catalog Large US retailer (Fortune 500 company) Marketing problem: send or not send coupon/invitation/mail order catalogue Common wisdom: per customer look at RFM: Recency, Frequency, Monetary value Dynamics matter Every model will be wrong: how do you model humans? Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 3 / 40

Common to the problems Real state space is huge with lots of uncertainty and parameters Batch data are available Operative solution: build a smallish MDP (< 300 states!), solve, apply. Computational speed less of an issue Uncertainty and risk are THE concern (and cannot be made scalar) Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 4 / 40

Common to the problems Real state space is huge with lots of uncertainty and parameters Batch data are available Operative solution: build a smallish MDP (< 300 states!), solve, apply. Computational speed less of an issue Uncertainty and risk are THE concern (and cannot be made scalar) Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 4 / 40

Common to the problems Real state space is huge with lots of uncertainty and parameters Batch data are available Operative solution: build a smallish MDP (< 300 states!), solve, apply. Computational speed less of an issue Uncertainty and risk are THE concern (and cannot be made scalar) Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 4 / 40

The Question: 1 How to optimize when the model is not (fully) known? But you have some idea on the magnitude of the uncertainty. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 5 / 40

Markov Decision Processes Defined by a tuple T, γ, S, A, p, r : T is the possibly infinite decision horizon. γ is the discount factor. S is the set of states. A is the set of actions. p transition probability, in the form of p t (s s, a). r immediate reward, in the form of r t (s, a). Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 6 / 40

Markov Decision Processes Total reward is defined: R = T t=1 γt 1 r t (s t, a t ). Classical goal: find a policy π that maximizes the expected total reward under π. Three solution approaches: Value Iteration Policy Iteration Linear Programming Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 7 / 40

Two types of uncertainty Internal Uncertainty: uncertainty due to random transitions/rewards Risk aware MDPs. Not this talk. Parameter uncertainty: uncertainty in the parameters Robust MDPs. This talk. Risk vs Ambiguity. Ellsberg s paradox Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 8 / 40

Two types of uncertainty Internal Uncertainty: uncertainty due to random transitions/rewards Risk aware MDPs. Not this talk. Parameter uncertainty: uncertainty in the parameters Robust MDPs. This talk. Risk vs Ambiguity. Ellsberg s paradox Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 8 / 40

Two types of uncertainty Internal Uncertainty: uncertainty due to random transitions/rewards Risk aware MDPs. Not this talk. Parameter uncertainty: uncertainty in the parameters Robust MDPs. This talk. Risk vs Ambiguity. Ellsberg s paradox Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 8 / 40

Parameter Uncertainty Parameter uncertainty due to: 1 noisy/incorrect observation 2 estimation from finite samples 3 environment-dependent 4 simplification of the problem Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 9 / 40

Robust MDPs S and A are known, p and r are unknown. When in doubt assume the worst Set inclusive uncertainty - p and r belong to a known set ( uncertainty set). Look for a policy with best worst-case performance. Problem becomes: ( ) max policy min parameter U E policy, parameter [ ] γ t r t t Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 10 / 40

Robust MDPs S and A are known, p and r are unknown. When in doubt assume the worst Set inclusive uncertainty - p and r belong to a known set ( uncertainty set). Look for a policy with best worst-case performance. Problem becomes: ( ) max policy min parameter U E policy, parameter [ ] γ t r t t Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 10 / 40

Robust MDPs S and A are known, p and r are unknown. When in doubt assume the worst Set inclusive uncertainty - p and r belong to a known set ( uncertainty set). Look for a policy with best worst-case performance. Problem becomes: ( ) max policy min parameter U E policy, parameter [ ] γ t r t t Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 10 / 40

Game against nature In general: problem is NP-hard except under rectangular case. Nice (non worst-case) generative probabilistic interpretation. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 11 / 40

Robust MDPs - conventional wisdom Problem is solvable when the uncertainty set is rectangular, i.e., the Cartesian product of uncertainty sets of individual states: { U = (p, r) : p U p (s), r } U r (s) s s Eventually solved using robust Dynamic Programming (robust value iteration, robust policy iteration) by Nilim and El-Ghoui (OR, 2005), Iyengar (MOR, 2005), among others Take the worst outcome for every state. If really bad events are to be included: must make U p and U r huge conservativeness A complicated tradeoff to make: can t correlate bad events Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 12 / 40

Robust MDPs - conventional wisdom Problem is solvable when the uncertainty set is rectangular, i.e., the Cartesian product of uncertainty sets of individual states: { U = (p, r) : p U p (s), r } U r (s) s s Eventually solved using robust Dynamic Programming (robust value iteration, robust policy iteration) by Nilim and El-Ghoui (OR, 2005), Iyengar (MOR, 2005), among others Take the worst outcome for every state. If really bad events are to be included: must make U p and U r huge conservativeness A complicated tradeoff to make: can t correlate bad events Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 12 / 40

Robust MDPs - conventional wisdom Problem is solvable when the uncertainty set is rectangular, i.e., the Cartesian product of uncertainty sets of individual states: { U = (p, r) : p U p (s), r } U r (s) s s Eventually solved using robust Dynamic Programming (robust value iteration, robust policy iteration) by Nilim and El-Ghoui (OR, 2005), Iyengar (MOR, 2005), among others Take the worst outcome for every state. If really bad events are to be included: must make U p and U r huge conservativeness A complicated tradeoff to make: can t correlate bad events Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 12 / 40

Practicable? The problem is not solved yet. Still issues to address for practically successful robust MDP More flexible uncertainty set this talk Probabilistic uncertainty not this talk Large scale not this talk Learn the uncertainty this talk Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 13 / 40

Practicable? The problem is not solved yet. Still issues to address for practically successful robust MDP More flexible uncertainty set this talk Probabilistic uncertainty not this talk Large scale not this talk Learn the uncertainty this talk Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 13 / 40

Practicable? The problem is not solved yet. Still issues to address for practically successful robust MDP More flexible uncertainty set this talk Probabilistic uncertainty not this talk Large scale not this talk Learn the uncertainty this talk Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 13 / 40

Practicable? The problem is not solved yet. Still issues to address for practically successful robust MDP More flexible uncertainty set this talk Probabilistic uncertainty not this talk Large scale not this talk Learn the uncertainty this talk Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 13 / 40

Practicable? The problem is not solved yet. Still issues to address for practically successful robust MDP More flexible uncertainty set this talk Probabilistic uncertainty not this talk Large scale not this talk Learn the uncertainty this talk Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 13 / 40

Coupling uncertainty: Intuition No coupling - conservative General coupling - NP hard Some sort of coupling that remains tractable? k-rectangular uncertainty Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 14 / 40

Coupling uncertainty: Intuition No coupling - conservative General coupling - NP hard Some sort of coupling that remains tractable? k-rectangular uncertainty Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 14 / 40

Example 1: lightning does not strike twice How many times can lightning strike? Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 15 / 40

Lightning does not strike twice Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 16 / 40

Lightning does not strike twice Limiting the number of deviation allowed { U K = (p, r) : p s = P nom, r s = R nom except at most } s 1,.., s K where p si, r si (U psi, U rsi ) If K = 0, Naive MDP If K = S, standard robust MDP K in between, interesting regime Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 17 / 40

Lightning does not strike twice Limiting the number of deviation allowed { U K = (p, r) : p s = P nom, r s = R nom except at most } s 1,.., s K where p si, r si (U psi, U rsi ) If K = 0, Naive MDP If K = S, standard robust MDP K in between, interesting regime Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 17 / 40

Lightning does not strike twice Limiting the number of deviation allowed { U K = (p, r) : p s = P nom, r s = R nom except at most } s 1,.., s K where p si, r si (U psi, U rsi ) If K = 0, Naive MDP If K = S, standard robust MDP K in between, interesting regime Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 17 / 40

Non-rectagular uncertainty sets Rectangular sets large uncertainty. But close to being rectangular. Rectangularity means conditional projection of the uncertainty set remains same. For LDST, there are k + 1 possible conditional projection of the uncertainty set. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 18 / 40

Non-rectagular uncertainty sets Rectangular sets large uncertainty. But close to being rectangular. Rectangularity means conditional projection of the uncertainty set remains same. For LDST, there are k + 1 possible conditional projection of the uncertainty set. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 18 / 40

Non-rectagular uncertainty sets Rectangular sets large uncertainty. But close to being rectangular. Rectangularity means conditional projection of the uncertainty set remains same. For LDST, there are k + 1 possible conditional projection of the uncertainty set. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 18 / 40

Non-rectagular uncertainty sets Rectangular sets large uncertainty. But close to being rectangular. Rectangularity means conditional projection of the uncertainty set remains same. For LDST, there are k + 1 possible conditional projection of the uncertainty set. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 18 / 40

Example 2: Union of rectangular uncertainty sets The uncertainty set is a union of several rectangular uncertainty sets (which may not be disjoint), i.e., U = J j=1 U j ; where U j = s S U j s, j = 1,, J. There are at most 2 J 1 possible conditional projection of the uncertainty set. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 19 / 40

k-rectangular uncertainty set Definition Let k be an integer, an uncertainty set U is called k-rectangular if for any subset of states, the class of conditional projection sets contains no more than k sets. k-rectangularity is closed under operators such as union, intersection, Cartesian product (with potentially increase of k). Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 20 / 40

The uncertainty model Uncertainty is realized in a history dependent way. The decision maker observes the realized uncertain parameters. Markovian uncertainty set is NP hard. Solving a two-player zero-sum stochastic game. Key observation: Uncertainty sets coupled only through k - the index of the conditional projection. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 21 / 40

The uncertainty model Uncertainty is realized in a history dependent way. The decision maker observes the realized uncertain parameters. Markovian uncertainty set is NP hard. Solving a two-player zero-sum stochastic game. Key observation: Uncertainty sets coupled only through k - the index of the conditional projection. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 21 / 40

The uncertainty model Uncertainty is realized in a history dependent way. The decision maker observes the realized uncertain parameters. Markovian uncertainty set is NP hard. Solving a two-player zero-sum stochastic game. Key observation: Uncertainty sets coupled only through k - the index of the conditional projection. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 21 / 40

Algoroithm for k rectangular uncertainty set Theorem The robust MDP with k rectangular uncertainty set can be solved in polynomial time. How? Augment the state-space, to include k. Doubled the decision stages - odd stages for decision maker, even stages for Nature Solve by a dynamic programming on the augmented state-space, alternatively for decision maker and for Nature. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 22 / 40

Algoroithm for k rectangular uncertainty set Theorem The robust MDP with k rectangular uncertainty set can be solved in polynomial time. How? Augment the state-space, to include k. Doubled the decision stages - odd stages for decision maker, even stages for Nature Solve by a dynamic programming on the augmented state-space, alternatively for decision maker and for Nature. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 22 / 40

Algoroithm for k rectangular uncertainty set Theorem The robust MDP with k rectangular uncertainty set can be solved in polynomial time. How? Augment the state-space, to include k. Doubled the decision stages - odd stages for decision maker, even stages for Nature Solve by a dynamic programming on the augmented state-space, alternatively for decision maker and for Nature. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 22 / 40

Algoroithm for k rectangular uncertainty set Theorem The robust MDP with k rectangular uncertainty set can be solved in polynomial time. How? Augment the state-space, to include k. Doubled the decision stages - odd stages for decision maker, even stages for Nature Solve by a dynamic programming on the augmented state-space, alternatively for decision maker and for Nature. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 22 / 40

Illustration via LDST Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 23 / 40

Question: where do I get the uncertainty sets? There are two types of uncertainty. Stochastic uncertainty: there is some true p and true r, just that we don t know the exact value. Adversarial uncertainty: there is no true p and r, each time when the state is visited, the parameter can vary. Due to model simplification, or some adversarial effect ignored. If I can collect more data, can I Identify the nature of the uncertainty? Learn the value of the stochastic uncertainty? Learn the level of the adversarial uncertainty? Yes we can! Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 24 / 40

Question: where do I get the uncertainty sets? There are two types of uncertainty. Stochastic uncertainty: there is some true p and true r, just that we don t know the exact value. Adversarial uncertainty: there is no true p and r, each time when the state is visited, the parameter can vary. Due to model simplification, or some adversarial effect ignored. If I can collect more data, can I Identify the nature of the uncertainty? Learn the value of the stochastic uncertainty? Learn the level of the adversarial uncertainty? Yes we can! Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 24 / 40

Question: where do I get the uncertainty sets? There are two types of uncertainty. Stochastic uncertainty: there is some true p and true r, just that we don t know the exact value. Adversarial uncertainty: there is no true p and r, each time when the state is visited, the parameter can vary. Due to model simplification, or some adversarial effect ignored. If I can collect more data, can I Identify the nature of the uncertainty? Learn the value of the stochastic uncertainty? Learn the level of the adversarial uncertainty? Yes we can! Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 24 / 40

Formal setup MDP with finite states and actions, reward in [0, 1]. For each pair (s, a), given a (potentially infinite) class of nested uncertainty sets U(s, a). Each pair (s, a) can be either stochastic or adversarial, which is not known. If (s, a) is stochastic, then the true p and r are unknown If (s, a) is adversarial, then its true uncertainty set (also unknown) belongs to U(s, a). Allowed to repeat the MDP many times. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 25 / 40

Formal setup MDP with finite states and actions, reward in [0, 1]. For each pair (s, a), given a (potentially infinite) class of nested uncertainty sets U(s, a). Each pair (s, a) can be either stochastic or adversarial, which is not known. If (s, a) is stochastic, then the true p and r are unknown If (s, a) is adversarial, then its true uncertainty set (also unknown) belongs to U(s, a). Allowed to repeat the MDP many times. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 25 / 40

Formal setup MDP with finite states and actions, reward in [0, 1]. For each pair (s, a), given a (potentially infinite) class of nested uncertainty sets U(s, a). Each pair (s, a) can be either stochastic or adversarial, which is not known. If (s, a) is stochastic, then the true p and r are unknown If (s, a) is adversarial, then its true uncertainty set (also unknown) belongs to U(s, a). Allowed to repeat the MDP many times. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 25 / 40

Challenge and Objective For adversarial state-action pair, the parameter can be arbitrary (and adaptive to the algorithm). Hence not possible to exactly identify the nature of uncertainty. Not possible to achieve diminishing regret against optimal stationary policy in hindsight. That is, may not take full advantage if the adversary chooses to play nice. Can achieve a vanishing regret against the performance of the robust MDP knowing exactly p and r for stochastic pair, and the true uncertainty set of adversarial pair. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 26 / 40

Challenge and Objective For adversarial state-action pair, the parameter can be arbitrary (and adaptive to the algorithm). Hence not possible to exactly identify the nature of uncertainty. Not possible to achieve diminishing regret against optimal stationary policy in hindsight. That is, may not take full advantage if the adversary chooses to play nice. Can achieve a vanishing regret against the performance of the robust MDP knowing exactly p and r for stochastic pair, and the true uncertainty set of adversarial pair. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 26 / 40

Challenge and Objective For adversarial state-action pair, the parameter can be arbitrary (and adaptive to the algorithm). Hence not possible to exactly identify the nature of uncertainty. Not possible to achieve diminishing regret against optimal stationary policy in hindsight. That is, may not take full advantage if the adversary chooses to play nice. Can achieve a vanishing regret against the performance of the robust MDP knowing exactly p and r for stochastic pair, and the true uncertainty set of adversarial pair. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 26 / 40

Challenge and Objective For adversarial state-action pair, the parameter can be arbitrary (and adaptive to the algorithm). Hence not possible to exactly identify the nature of uncertainty. Not possible to achieve diminishing regret against optimal stationary policy in hindsight. That is, may not take full advantage if the adversary chooses to play nice. Can achieve a vanishing regret against the performance of the robust MDP knowing exactly p and r for stochastic pair, and the true uncertainty set of adversarial pair. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 26 / 40

Main intuition When purely stochastic, one can resort to RL algorithms, such as UCRL (which consistently uses optimistic estimation) to achieve diminishing regret. However, adversary can hurt. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 27 / 40

Main intuition g g + β a 1 s 0 a2 a 3 s 1 s 3 s 2 s 4 g + β g α 2β < α < 3β. Choose solid line in phase 1 (2T steps), dashed line in phase 2 (T steps). The expected value of s 4 is g + β α 2, and the expected value of s 1 is g + 3β α 4 > g. The total accumulated reward is 3Tg + T (2β α). Compared to the minimax policy, the overall regret is non-diminishing. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 28 / 40

Main intuition Be optimistic, but cautious. Using UCRL, start by assuming all state-action pairs are stochastic. Monitor outcome of transition of each pair. Using a statistic check to identify pairs that is assumed to be stochastic but indeed adversarial, as well as pairs assumed to have an uncertainty set smaller than its true uncertainty set. Update the information of pairs that fail the statistic check, and re-solve the minimax MDP. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 29 / 40

The algorithm -OLRM Input: S, A, T, δ, and for each (s, a), U(s, a) 1 Initialize the set F {}. For each (s, a), set U(s, a) {}. 2 Initialize k 1. 3 Compute an optimistic robust policy π, assuming all state-action pairs in F are adversarial with uncertainty sets as given by U(s, a). 4 Execute π until one of the followings happen: The execution count of some state-action (s, a) has been doubled. The executed state-action pair (s, a) fails the stochasticity check. In this case (s, a) is added to F if it is not yet in F. Update U(s, a). 5 Increment k. Go back to step 3. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 30 / 40

Computing Optimistic Robust Policy Input: S, A, T, δ, F, k, and for each (s, a), U(s, a), ˆP k ( s, a) and N k (s, a). 1 Set Ṽ T k (s) = 0 for all s. 2 Repeat, for t = T 1,..., 0: For each (s, a) F, set Q t k (s, a) = min{t t, r(s, For each (s, a) / F, set Q t k (s, a) = min{t t, a) + min p U(s,a) p( )Ṽ k t+1 ( )}. + (T + 1) r(s, a) + ˆP k ( s, a)ṽ k t+1( ) 1 2N k (s, a) 14SATk 2 log }. δ For each s, set Ṽt k (s) = max a Qk t (s, a) π t (s) = arg max a Qk t (s, a). 3 Output π. and Robust to adversarial, optimistic to stochastic. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 31 / 40

Computing Optimistic Robust Policy Input: S, A, T, δ, F, k, and for each (s, a), U(s, a), ˆP k ( s, a) and N k (s, a). 1 Set Ṽ T k (s) = 0 for all s. 2 Repeat, for t = T 1,..., 0: For each (s, a) F, set Q t k (s, a) = min{t t, r(s, For each (s, a) / F, set Q t k (s, a) = min{t t, a) + min p U(s,a) p( )Ṽ k t+1 ( )}. + (T + 1) r(s, a) + ˆP k ( s, a)ṽ k t+1( ) 1 2N k (s, a) 14SATk 2 log }. δ For each s, set Ṽt k (s) = max a Qk t (s, a) π t (s) = arg max a Qk t (s, a). 3 Output π. and Robust to adversarial, optimistic to stochastic. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 31 / 40

Stochasticity check When (s, a) F, it fails the stochasticity check if: n } {ˆPkj ( s, a)ṽ k j t j +1 ( ) Ṽ k j t j +1 (s j ) > (2.5+T +3.5T S) n log 14SAT τ 2. δ j=1 When (s, a) F, it fails the stochasticity check if n { } min p( )Ṽ k j t p U(s,a) j +1 ( ) Ṽ k j t j +1 (s j ) > 2T 2(n n ) log 14τ 2. δ j=n +1 If (s, a) fails the stochasticity check, add (s, a) into F, and update U(s, a) as the smallest set in U(s, a) that satisfies n { } min p( )Ṽ k j t p U(s,a) j +1 ( ) Ṽ k j t j +1 (s j ) < T 2(n n ) log 14τ 2. δ j=n +1 Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 32 / 40

More on stochasticity check Essentially checking whether the value of actual transition from (s, a) is below what is expected from the parameter estimation. No false alarm: with high probability, all stochastic state-action pairs will always pass the stochasticity check; and all adversarial state-action pairs will pass the stochasticity check if U(s, a) U (s, a). May fail to identify adversarial states, if the adversary plays nicely. However, satisfactory rewards are accumulated, so nothing needs to be changed. If the adversary plays nasty, then the stochasticity check will detect it, and subsequently protect against it. if it ain t broke, don t fix it. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 33 / 40

More on stochasticity check Essentially checking whether the value of actual transition from (s, a) is below what is expected from the parameter estimation. No false alarm: with high probability, all stochastic state-action pairs will always pass the stochasticity check; and all adversarial state-action pairs will pass the stochasticity check if U(s, a) U (s, a). May fail to identify adversarial states, if the adversary plays nicely. However, satisfactory rewards are accumulated, so nothing needs to be changed. If the adversary plays nasty, then the stochasticity check will detect it, and subsequently protect against it. if it ain t broke, don t fix it. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 33 / 40

Performance guarantee Theorem Given δ, T, S, A and U, if U(s, a) C for all (s, a), then the total regret of OLRM is [ (m) O T 3/2 ( S + ] C) SAm log SATm δ for all m, with probability at least 1 δ. The total number of steps is τ = Tm, hence the regret is Õ[T ( S + C) SAτ]. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 34 / 40

Performance guarantee What if U is an infinity set? We consider the following class: U(s, a) = {η(s, a) + αb(s, a) : α 0 (s, a) α α } P(S) (1) Theorem Given δ, T, S, A, U as defined in Eq. (1), the total regret of OLRM is [ ( (m) Õ T S Aτ + (SAα B) 2/3 τ 1/3 + (SAα B) 1/3 τ 2/3)]. for all m, with probability at least 1 δ. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 35 / 40

Infinite horizon average reward Assume for any p in the true uncertainty set, the resulting MDP is unichain and communicating. Similar algorithm, except that computing the optimistic robust policy is trickier. Similar performance guarantee: O( τ) for finite U, and O(τ 2/3 ) for infinite U. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 36 / 40

Simulation Gold Gold Silver Silver Usual condition Abnormal condition Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 37 / 40

Simulation R minimax s 0 a 1 s 1 a 2 a 3 s 2 s 4 R good s 3 a 4 s 5 s 6 R bad Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 38 / 40

Simulation x 10 5 3 2.5 Total rewards 2 1.5 1 0.5 UCRL2 OLRM2 Optimal minimax policy 0 0 0.5 1 1.5 Time steps 2 2.5 3 x 10 6 Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 39 / 40

Conclusion Coupled uncertainty k-rectangularity Coupled through the index k Solved via state augmentation Learning the uncertainty Reinforcement learning adapted Diminishing regret Make robust MDP practicable. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 40 / 40