Practicable Robust Markov Decision Processes Huan Xu Department of Mechanical Engineering National University of Singapore Joint work with Shiau-Hong Lim (IBM), Shie Mannor (Techion), Ofir Mebel (Apple) 18, Nov, 2015; IMS Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 1 / 40
Classical planning problems We typically want to maximize the expected average reward In planning: Model is known A single scalar reward Rare events (black swans) only crop-up through expectations Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 2 / 40
Classical planning problems We typically want to maximize the expected average reward In planning: Model is known A single scalar reward Rare events (black swans) only crop-up through expectations Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 2 / 40
Classical planning problems We typically want to maximize the expected average reward In planning: Model is known A single scalar reward Rare events (black swans) only crop-up through expectations Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 2 / 40
Motivation example - Mail catalog Large US retailer (Fortune 500 company) Marketing problem: send or not send coupon/invitation/mail order catalogue Common wisdom: per customer look at RFM: Recency, Frequency, Monetary value Dynamics matter Every model will be wrong: how do you model humans? Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 3 / 40
Motivation example - Mail catalog Large US retailer (Fortune 500 company) Marketing problem: send or not send coupon/invitation/mail order catalogue Common wisdom: per customer look at RFM: Recency, Frequency, Monetary value Dynamics matter Every model will be wrong: how do you model humans? Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 3 / 40
Motivation example - Mail catalog Large US retailer (Fortune 500 company) Marketing problem: send or not send coupon/invitation/mail order catalogue Common wisdom: per customer look at RFM: Recency, Frequency, Monetary value Dynamics matter Every model will be wrong: how do you model humans? Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 3 / 40
Common to the problems Real state space is huge with lots of uncertainty and parameters Batch data are available Operative solution: build a smallish MDP (< 300 states!), solve, apply. Computational speed less of an issue Uncertainty and risk are THE concern (and cannot be made scalar) Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 4 / 40
Common to the problems Real state space is huge with lots of uncertainty and parameters Batch data are available Operative solution: build a smallish MDP (< 300 states!), solve, apply. Computational speed less of an issue Uncertainty and risk are THE concern (and cannot be made scalar) Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 4 / 40
Common to the problems Real state space is huge with lots of uncertainty and parameters Batch data are available Operative solution: build a smallish MDP (< 300 states!), solve, apply. Computational speed less of an issue Uncertainty and risk are THE concern (and cannot be made scalar) Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 4 / 40
The Question: 1 How to optimize when the model is not (fully) known? But you have some idea on the magnitude of the uncertainty. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 5 / 40
Markov Decision Processes Defined by a tuple T, γ, S, A, p, r : T is the possibly infinite decision horizon. γ is the discount factor. S is the set of states. A is the set of actions. p transition probability, in the form of p t (s s, a). r immediate reward, in the form of r t (s, a). Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 6 / 40
Markov Decision Processes Total reward is defined: R = T t=1 γt 1 r t (s t, a t ). Classical goal: find a policy π that maximizes the expected total reward under π. Three solution approaches: Value Iteration Policy Iteration Linear Programming Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 7 / 40
Two types of uncertainty Internal Uncertainty: uncertainty due to random transitions/rewards Risk aware MDPs. Not this talk. Parameter uncertainty: uncertainty in the parameters Robust MDPs. This talk. Risk vs Ambiguity. Ellsberg s paradox Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 8 / 40
Two types of uncertainty Internal Uncertainty: uncertainty due to random transitions/rewards Risk aware MDPs. Not this talk. Parameter uncertainty: uncertainty in the parameters Robust MDPs. This talk. Risk vs Ambiguity. Ellsberg s paradox Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 8 / 40
Two types of uncertainty Internal Uncertainty: uncertainty due to random transitions/rewards Risk aware MDPs. Not this talk. Parameter uncertainty: uncertainty in the parameters Robust MDPs. This talk. Risk vs Ambiguity. Ellsberg s paradox Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 8 / 40
Parameter Uncertainty Parameter uncertainty due to: 1 noisy/incorrect observation 2 estimation from finite samples 3 environment-dependent 4 simplification of the problem Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 9 / 40
Robust MDPs S and A are known, p and r are unknown. When in doubt assume the worst Set inclusive uncertainty - p and r belong to a known set ( uncertainty set). Look for a policy with best worst-case performance. Problem becomes: ( ) max policy min parameter U E policy, parameter [ ] γ t r t t Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 10 / 40
Robust MDPs S and A are known, p and r are unknown. When in doubt assume the worst Set inclusive uncertainty - p and r belong to a known set ( uncertainty set). Look for a policy with best worst-case performance. Problem becomes: ( ) max policy min parameter U E policy, parameter [ ] γ t r t t Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 10 / 40
Robust MDPs S and A are known, p and r are unknown. When in doubt assume the worst Set inclusive uncertainty - p and r belong to a known set ( uncertainty set). Look for a policy with best worst-case performance. Problem becomes: ( ) max policy min parameter U E policy, parameter [ ] γ t r t t Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 10 / 40
Game against nature In general: problem is NP-hard except under rectangular case. Nice (non worst-case) generative probabilistic interpretation. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 11 / 40
Robust MDPs - conventional wisdom Problem is solvable when the uncertainty set is rectangular, i.e., the Cartesian product of uncertainty sets of individual states: { U = (p, r) : p U p (s), r } U r (s) s s Eventually solved using robust Dynamic Programming (robust value iteration, robust policy iteration) by Nilim and El-Ghoui (OR, 2005), Iyengar (MOR, 2005), among others Take the worst outcome for every state. If really bad events are to be included: must make U p and U r huge conservativeness A complicated tradeoff to make: can t correlate bad events Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 12 / 40
Robust MDPs - conventional wisdom Problem is solvable when the uncertainty set is rectangular, i.e., the Cartesian product of uncertainty sets of individual states: { U = (p, r) : p U p (s), r } U r (s) s s Eventually solved using robust Dynamic Programming (robust value iteration, robust policy iteration) by Nilim and El-Ghoui (OR, 2005), Iyengar (MOR, 2005), among others Take the worst outcome for every state. If really bad events are to be included: must make U p and U r huge conservativeness A complicated tradeoff to make: can t correlate bad events Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 12 / 40
Robust MDPs - conventional wisdom Problem is solvable when the uncertainty set is rectangular, i.e., the Cartesian product of uncertainty sets of individual states: { U = (p, r) : p U p (s), r } U r (s) s s Eventually solved using robust Dynamic Programming (robust value iteration, robust policy iteration) by Nilim and El-Ghoui (OR, 2005), Iyengar (MOR, 2005), among others Take the worst outcome for every state. If really bad events are to be included: must make U p and U r huge conservativeness A complicated tradeoff to make: can t correlate bad events Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 12 / 40
Practicable? The problem is not solved yet. Still issues to address for practically successful robust MDP More flexible uncertainty set this talk Probabilistic uncertainty not this talk Large scale not this talk Learn the uncertainty this talk Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 13 / 40
Practicable? The problem is not solved yet. Still issues to address for practically successful robust MDP More flexible uncertainty set this talk Probabilistic uncertainty not this talk Large scale not this talk Learn the uncertainty this talk Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 13 / 40
Practicable? The problem is not solved yet. Still issues to address for practically successful robust MDP More flexible uncertainty set this talk Probabilistic uncertainty not this talk Large scale not this talk Learn the uncertainty this talk Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 13 / 40
Practicable? The problem is not solved yet. Still issues to address for practically successful robust MDP More flexible uncertainty set this talk Probabilistic uncertainty not this talk Large scale not this talk Learn the uncertainty this talk Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 13 / 40
Practicable? The problem is not solved yet. Still issues to address for practically successful robust MDP More flexible uncertainty set this talk Probabilistic uncertainty not this talk Large scale not this talk Learn the uncertainty this talk Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 13 / 40
Coupling uncertainty: Intuition No coupling - conservative General coupling - NP hard Some sort of coupling that remains tractable? k-rectangular uncertainty Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 14 / 40
Coupling uncertainty: Intuition No coupling - conservative General coupling - NP hard Some sort of coupling that remains tractable? k-rectangular uncertainty Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 14 / 40
Example 1: lightning does not strike twice How many times can lightning strike? Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 15 / 40
Lightning does not strike twice Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 16 / 40
Lightning does not strike twice Limiting the number of deviation allowed { U K = (p, r) : p s = P nom, r s = R nom except at most } s 1,.., s K where p si, r si (U psi, U rsi ) If K = 0, Naive MDP If K = S, standard robust MDP K in between, interesting regime Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 17 / 40
Lightning does not strike twice Limiting the number of deviation allowed { U K = (p, r) : p s = P nom, r s = R nom except at most } s 1,.., s K where p si, r si (U psi, U rsi ) If K = 0, Naive MDP If K = S, standard robust MDP K in between, interesting regime Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 17 / 40
Lightning does not strike twice Limiting the number of deviation allowed { U K = (p, r) : p s = P nom, r s = R nom except at most } s 1,.., s K where p si, r si (U psi, U rsi ) If K = 0, Naive MDP If K = S, standard robust MDP K in between, interesting regime Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 17 / 40
Non-rectagular uncertainty sets Rectangular sets large uncertainty. But close to being rectangular. Rectangularity means conditional projection of the uncertainty set remains same. For LDST, there are k + 1 possible conditional projection of the uncertainty set. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 18 / 40
Non-rectagular uncertainty sets Rectangular sets large uncertainty. But close to being rectangular. Rectangularity means conditional projection of the uncertainty set remains same. For LDST, there are k + 1 possible conditional projection of the uncertainty set. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 18 / 40
Non-rectagular uncertainty sets Rectangular sets large uncertainty. But close to being rectangular. Rectangularity means conditional projection of the uncertainty set remains same. For LDST, there are k + 1 possible conditional projection of the uncertainty set. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 18 / 40
Non-rectagular uncertainty sets Rectangular sets large uncertainty. But close to being rectangular. Rectangularity means conditional projection of the uncertainty set remains same. For LDST, there are k + 1 possible conditional projection of the uncertainty set. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 18 / 40
Example 2: Union of rectangular uncertainty sets The uncertainty set is a union of several rectangular uncertainty sets (which may not be disjoint), i.e., U = J j=1 U j ; where U j = s S U j s, j = 1,, J. There are at most 2 J 1 possible conditional projection of the uncertainty set. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 19 / 40
k-rectangular uncertainty set Definition Let k be an integer, an uncertainty set U is called k-rectangular if for any subset of states, the class of conditional projection sets contains no more than k sets. k-rectangularity is closed under operators such as union, intersection, Cartesian product (with potentially increase of k). Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 20 / 40
The uncertainty model Uncertainty is realized in a history dependent way. The decision maker observes the realized uncertain parameters. Markovian uncertainty set is NP hard. Solving a two-player zero-sum stochastic game. Key observation: Uncertainty sets coupled only through k - the index of the conditional projection. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 21 / 40
The uncertainty model Uncertainty is realized in a history dependent way. The decision maker observes the realized uncertain parameters. Markovian uncertainty set is NP hard. Solving a two-player zero-sum stochastic game. Key observation: Uncertainty sets coupled only through k - the index of the conditional projection. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 21 / 40
The uncertainty model Uncertainty is realized in a history dependent way. The decision maker observes the realized uncertain parameters. Markovian uncertainty set is NP hard. Solving a two-player zero-sum stochastic game. Key observation: Uncertainty sets coupled only through k - the index of the conditional projection. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 21 / 40
Algoroithm for k rectangular uncertainty set Theorem The robust MDP with k rectangular uncertainty set can be solved in polynomial time. How? Augment the state-space, to include k. Doubled the decision stages - odd stages for decision maker, even stages for Nature Solve by a dynamic programming on the augmented state-space, alternatively for decision maker and for Nature. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 22 / 40
Algoroithm for k rectangular uncertainty set Theorem The robust MDP with k rectangular uncertainty set can be solved in polynomial time. How? Augment the state-space, to include k. Doubled the decision stages - odd stages for decision maker, even stages for Nature Solve by a dynamic programming on the augmented state-space, alternatively for decision maker and for Nature. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 22 / 40
Algoroithm for k rectangular uncertainty set Theorem The robust MDP with k rectangular uncertainty set can be solved in polynomial time. How? Augment the state-space, to include k. Doubled the decision stages - odd stages for decision maker, even stages for Nature Solve by a dynamic programming on the augmented state-space, alternatively for decision maker and for Nature. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 22 / 40
Algoroithm for k rectangular uncertainty set Theorem The robust MDP with k rectangular uncertainty set can be solved in polynomial time. How? Augment the state-space, to include k. Doubled the decision stages - odd stages for decision maker, even stages for Nature Solve by a dynamic programming on the augmented state-space, alternatively for decision maker and for Nature. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 22 / 40
Illustration via LDST Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 23 / 40
Question: where do I get the uncertainty sets? There are two types of uncertainty. Stochastic uncertainty: there is some true p and true r, just that we don t know the exact value. Adversarial uncertainty: there is no true p and r, each time when the state is visited, the parameter can vary. Due to model simplification, or some adversarial effect ignored. If I can collect more data, can I Identify the nature of the uncertainty? Learn the value of the stochastic uncertainty? Learn the level of the adversarial uncertainty? Yes we can! Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 24 / 40
Question: where do I get the uncertainty sets? There are two types of uncertainty. Stochastic uncertainty: there is some true p and true r, just that we don t know the exact value. Adversarial uncertainty: there is no true p and r, each time when the state is visited, the parameter can vary. Due to model simplification, or some adversarial effect ignored. If I can collect more data, can I Identify the nature of the uncertainty? Learn the value of the stochastic uncertainty? Learn the level of the adversarial uncertainty? Yes we can! Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 24 / 40
Question: where do I get the uncertainty sets? There are two types of uncertainty. Stochastic uncertainty: there is some true p and true r, just that we don t know the exact value. Adversarial uncertainty: there is no true p and r, each time when the state is visited, the parameter can vary. Due to model simplification, or some adversarial effect ignored. If I can collect more data, can I Identify the nature of the uncertainty? Learn the value of the stochastic uncertainty? Learn the level of the adversarial uncertainty? Yes we can! Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 24 / 40
Formal setup MDP with finite states and actions, reward in [0, 1]. For each pair (s, a), given a (potentially infinite) class of nested uncertainty sets U(s, a). Each pair (s, a) can be either stochastic or adversarial, which is not known. If (s, a) is stochastic, then the true p and r are unknown If (s, a) is adversarial, then its true uncertainty set (also unknown) belongs to U(s, a). Allowed to repeat the MDP many times. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 25 / 40
Formal setup MDP with finite states and actions, reward in [0, 1]. For each pair (s, a), given a (potentially infinite) class of nested uncertainty sets U(s, a). Each pair (s, a) can be either stochastic or adversarial, which is not known. If (s, a) is stochastic, then the true p and r are unknown If (s, a) is adversarial, then its true uncertainty set (also unknown) belongs to U(s, a). Allowed to repeat the MDP many times. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 25 / 40
Formal setup MDP with finite states and actions, reward in [0, 1]. For each pair (s, a), given a (potentially infinite) class of nested uncertainty sets U(s, a). Each pair (s, a) can be either stochastic or adversarial, which is not known. If (s, a) is stochastic, then the true p and r are unknown If (s, a) is adversarial, then its true uncertainty set (also unknown) belongs to U(s, a). Allowed to repeat the MDP many times. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 25 / 40
Challenge and Objective For adversarial state-action pair, the parameter can be arbitrary (and adaptive to the algorithm). Hence not possible to exactly identify the nature of uncertainty. Not possible to achieve diminishing regret against optimal stationary policy in hindsight. That is, may not take full advantage if the adversary chooses to play nice. Can achieve a vanishing regret against the performance of the robust MDP knowing exactly p and r for stochastic pair, and the true uncertainty set of adversarial pair. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 26 / 40
Challenge and Objective For adversarial state-action pair, the parameter can be arbitrary (and adaptive to the algorithm). Hence not possible to exactly identify the nature of uncertainty. Not possible to achieve diminishing regret against optimal stationary policy in hindsight. That is, may not take full advantage if the adversary chooses to play nice. Can achieve a vanishing regret against the performance of the robust MDP knowing exactly p and r for stochastic pair, and the true uncertainty set of adversarial pair. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 26 / 40
Challenge and Objective For adversarial state-action pair, the parameter can be arbitrary (and adaptive to the algorithm). Hence not possible to exactly identify the nature of uncertainty. Not possible to achieve diminishing regret against optimal stationary policy in hindsight. That is, may not take full advantage if the adversary chooses to play nice. Can achieve a vanishing regret against the performance of the robust MDP knowing exactly p and r for stochastic pair, and the true uncertainty set of adversarial pair. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 26 / 40
Challenge and Objective For adversarial state-action pair, the parameter can be arbitrary (and adaptive to the algorithm). Hence not possible to exactly identify the nature of uncertainty. Not possible to achieve diminishing regret against optimal stationary policy in hindsight. That is, may not take full advantage if the adversary chooses to play nice. Can achieve a vanishing regret against the performance of the robust MDP knowing exactly p and r for stochastic pair, and the true uncertainty set of adversarial pair. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 26 / 40
Main intuition When purely stochastic, one can resort to RL algorithms, such as UCRL (which consistently uses optimistic estimation) to achieve diminishing regret. However, adversary can hurt. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 27 / 40
Main intuition g g + β a 1 s 0 a2 a 3 s 1 s 3 s 2 s 4 g + β g α 2β < α < 3β. Choose solid line in phase 1 (2T steps), dashed line in phase 2 (T steps). The expected value of s 4 is g + β α 2, and the expected value of s 1 is g + 3β α 4 > g. The total accumulated reward is 3Tg + T (2β α). Compared to the minimax policy, the overall regret is non-diminishing. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 28 / 40
Main intuition Be optimistic, but cautious. Using UCRL, start by assuming all state-action pairs are stochastic. Monitor outcome of transition of each pair. Using a statistic check to identify pairs that is assumed to be stochastic but indeed adversarial, as well as pairs assumed to have an uncertainty set smaller than its true uncertainty set. Update the information of pairs that fail the statistic check, and re-solve the minimax MDP. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 29 / 40
The algorithm -OLRM Input: S, A, T, δ, and for each (s, a), U(s, a) 1 Initialize the set F {}. For each (s, a), set U(s, a) {}. 2 Initialize k 1. 3 Compute an optimistic robust policy π, assuming all state-action pairs in F are adversarial with uncertainty sets as given by U(s, a). 4 Execute π until one of the followings happen: The execution count of some state-action (s, a) has been doubled. The executed state-action pair (s, a) fails the stochasticity check. In this case (s, a) is added to F if it is not yet in F. Update U(s, a). 5 Increment k. Go back to step 3. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 30 / 40
Computing Optimistic Robust Policy Input: S, A, T, δ, F, k, and for each (s, a), U(s, a), ˆP k ( s, a) and N k (s, a). 1 Set Ṽ T k (s) = 0 for all s. 2 Repeat, for t = T 1,..., 0: For each (s, a) F, set Q t k (s, a) = min{t t, r(s, For each (s, a) / F, set Q t k (s, a) = min{t t, a) + min p U(s,a) p( )Ṽ k t+1 ( )}. + (T + 1) r(s, a) + ˆP k ( s, a)ṽ k t+1( ) 1 2N k (s, a) 14SATk 2 log }. δ For each s, set Ṽt k (s) = max a Qk t (s, a) π t (s) = arg max a Qk t (s, a). 3 Output π. and Robust to adversarial, optimistic to stochastic. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 31 / 40
Computing Optimistic Robust Policy Input: S, A, T, δ, F, k, and for each (s, a), U(s, a), ˆP k ( s, a) and N k (s, a). 1 Set Ṽ T k (s) = 0 for all s. 2 Repeat, for t = T 1,..., 0: For each (s, a) F, set Q t k (s, a) = min{t t, r(s, For each (s, a) / F, set Q t k (s, a) = min{t t, a) + min p U(s,a) p( )Ṽ k t+1 ( )}. + (T + 1) r(s, a) + ˆP k ( s, a)ṽ k t+1( ) 1 2N k (s, a) 14SATk 2 log }. δ For each s, set Ṽt k (s) = max a Qk t (s, a) π t (s) = arg max a Qk t (s, a). 3 Output π. and Robust to adversarial, optimistic to stochastic. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 31 / 40
Stochasticity check When (s, a) F, it fails the stochasticity check if: n } {ˆPkj ( s, a)ṽ k j t j +1 ( ) Ṽ k j t j +1 (s j ) > (2.5+T +3.5T S) n log 14SAT τ 2. δ j=1 When (s, a) F, it fails the stochasticity check if n { } min p( )Ṽ k j t p U(s,a) j +1 ( ) Ṽ k j t j +1 (s j ) > 2T 2(n n ) log 14τ 2. δ j=n +1 If (s, a) fails the stochasticity check, add (s, a) into F, and update U(s, a) as the smallest set in U(s, a) that satisfies n { } min p( )Ṽ k j t p U(s,a) j +1 ( ) Ṽ k j t j +1 (s j ) < T 2(n n ) log 14τ 2. δ j=n +1 Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 32 / 40
More on stochasticity check Essentially checking whether the value of actual transition from (s, a) is below what is expected from the parameter estimation. No false alarm: with high probability, all stochastic state-action pairs will always pass the stochasticity check; and all adversarial state-action pairs will pass the stochasticity check if U(s, a) U (s, a). May fail to identify adversarial states, if the adversary plays nicely. However, satisfactory rewards are accumulated, so nothing needs to be changed. If the adversary plays nasty, then the stochasticity check will detect it, and subsequently protect against it. if it ain t broke, don t fix it. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 33 / 40
More on stochasticity check Essentially checking whether the value of actual transition from (s, a) is below what is expected from the parameter estimation. No false alarm: with high probability, all stochastic state-action pairs will always pass the stochasticity check; and all adversarial state-action pairs will pass the stochasticity check if U(s, a) U (s, a). May fail to identify adversarial states, if the adversary plays nicely. However, satisfactory rewards are accumulated, so nothing needs to be changed. If the adversary plays nasty, then the stochasticity check will detect it, and subsequently protect against it. if it ain t broke, don t fix it. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 33 / 40
Performance guarantee Theorem Given δ, T, S, A and U, if U(s, a) C for all (s, a), then the total regret of OLRM is [ (m) O T 3/2 ( S + ] C) SAm log SATm δ for all m, with probability at least 1 δ. The total number of steps is τ = Tm, hence the regret is Õ[T ( S + C) SAτ]. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 34 / 40
Performance guarantee What if U is an infinity set? We consider the following class: U(s, a) = {η(s, a) + αb(s, a) : α 0 (s, a) α α } P(S) (1) Theorem Given δ, T, S, A, U as defined in Eq. (1), the total regret of OLRM is [ ( (m) Õ T S Aτ + (SAα B) 2/3 τ 1/3 + (SAα B) 1/3 τ 2/3)]. for all m, with probability at least 1 δ. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 35 / 40
Infinite horizon average reward Assume for any p in the true uncertainty set, the resulting MDP is unichain and communicating. Similar algorithm, except that computing the optimistic robust policy is trickier. Similar performance guarantee: O( τ) for finite U, and O(τ 2/3 ) for infinite U. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 36 / 40
Simulation Gold Gold Silver Silver Usual condition Abnormal condition Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 37 / 40
Simulation R minimax s 0 a 1 s 1 a 2 a 3 s 2 s 4 R good s 3 a 4 s 5 s 6 R bad Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 38 / 40
Simulation x 10 5 3 2.5 Total rewards 2 1.5 1 0.5 UCRL2 OLRM2 Optimal minimax policy 0 0 0.5 1 1.5 Time steps 2 2.5 3 x 10 6 Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 39 / 40
Conclusion Coupled uncertainty k-rectangularity Coupled through the index k Solved via state augmentation Learning the uncertainty Reinforcement learning adapted Diminishing regret Make robust MDP practicable. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 40 / 40