Practicable Robust Markov Decision Processes
|
|
- Melinda Nicholson
- 6 years ago
- Views:
Transcription
1 Practicable Robust Markov Decision Processes Huan Xu Department of Mechanical Engineering National University of Singapore Joint work with Shiau-Hong Lim (IBM), Shie Mannor (Techion), Ofir Mebel (Apple) 18, Nov, 2015; IMS Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 1 / 40
2 Classical planning problems We typically want to maximize the expected average reward In planning: Model is known A single scalar reward Rare events (black swans) only crop-up through expectations Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 2 / 40
3 Classical planning problems We typically want to maximize the expected average reward In planning: Model is known A single scalar reward Rare events (black swans) only crop-up through expectations Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 2 / 40
4 Classical planning problems We typically want to maximize the expected average reward In planning: Model is known A single scalar reward Rare events (black swans) only crop-up through expectations Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 2 / 40
5 Motivation example - Mail catalog Large US retailer (Fortune 500 company) Marketing problem: send or not send coupon/invitation/mail order catalogue Common wisdom: per customer look at RFM: Recency, Frequency, Monetary value Dynamics matter Every model will be wrong: how do you model humans? Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 3 / 40
6 Motivation example - Mail catalog Large US retailer (Fortune 500 company) Marketing problem: send or not send coupon/invitation/mail order catalogue Common wisdom: per customer look at RFM: Recency, Frequency, Monetary value Dynamics matter Every model will be wrong: how do you model humans? Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 3 / 40
7 Motivation example - Mail catalog Large US retailer (Fortune 500 company) Marketing problem: send or not send coupon/invitation/mail order catalogue Common wisdom: per customer look at RFM: Recency, Frequency, Monetary value Dynamics matter Every model will be wrong: how do you model humans? Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 3 / 40
8 Common to the problems Real state space is huge with lots of uncertainty and parameters Batch data are available Operative solution: build a smallish MDP (< 300 states!), solve, apply. Computational speed less of an issue Uncertainty and risk are THE concern (and cannot be made scalar) Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 4 / 40
9 Common to the problems Real state space is huge with lots of uncertainty and parameters Batch data are available Operative solution: build a smallish MDP (< 300 states!), solve, apply. Computational speed less of an issue Uncertainty and risk are THE concern (and cannot be made scalar) Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 4 / 40
10 Common to the problems Real state space is huge with lots of uncertainty and parameters Batch data are available Operative solution: build a smallish MDP (< 300 states!), solve, apply. Computational speed less of an issue Uncertainty and risk are THE concern (and cannot be made scalar) Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 4 / 40
11 The Question: 1 How to optimize when the model is not (fully) known? But you have some idea on the magnitude of the uncertainty. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 5 / 40
12 Markov Decision Processes Defined by a tuple T, γ, S, A, p, r : T is the possibly infinite decision horizon. γ is the discount factor. S is the set of states. A is the set of actions. p transition probability, in the form of p t (s s, a). r immediate reward, in the form of r t (s, a). Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 6 / 40
13 Markov Decision Processes Total reward is defined: R = T t=1 γt 1 r t (s t, a t ). Classical goal: find a policy π that maximizes the expected total reward under π. Three solution approaches: Value Iteration Policy Iteration Linear Programming Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 7 / 40
14 Two types of uncertainty Internal Uncertainty: uncertainty due to random transitions/rewards Risk aware MDPs. Not this talk. Parameter uncertainty: uncertainty in the parameters Robust MDPs. This talk. Risk vs Ambiguity. Ellsberg s paradox Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 8 / 40
15 Two types of uncertainty Internal Uncertainty: uncertainty due to random transitions/rewards Risk aware MDPs. Not this talk. Parameter uncertainty: uncertainty in the parameters Robust MDPs. This talk. Risk vs Ambiguity. Ellsberg s paradox Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 8 / 40
16 Two types of uncertainty Internal Uncertainty: uncertainty due to random transitions/rewards Risk aware MDPs. Not this talk. Parameter uncertainty: uncertainty in the parameters Robust MDPs. This talk. Risk vs Ambiguity. Ellsberg s paradox Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 8 / 40
17 Parameter Uncertainty Parameter uncertainty due to: 1 noisy/incorrect observation 2 estimation from finite samples 3 environment-dependent 4 simplification of the problem Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 9 / 40
18 Robust MDPs S and A are known, p and r are unknown. When in doubt assume the worst Set inclusive uncertainty - p and r belong to a known set ( uncertainty set). Look for a policy with best worst-case performance. Problem becomes: ( ) max policy min parameter U E policy, parameter [ ] γ t r t t Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 10 / 40
19 Robust MDPs S and A are known, p and r are unknown. When in doubt assume the worst Set inclusive uncertainty - p and r belong to a known set ( uncertainty set). Look for a policy with best worst-case performance. Problem becomes: ( ) max policy min parameter U E policy, parameter [ ] γ t r t t Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 10 / 40
20 Robust MDPs S and A are known, p and r are unknown. When in doubt assume the worst Set inclusive uncertainty - p and r belong to a known set ( uncertainty set). Look for a policy with best worst-case performance. Problem becomes: ( ) max policy min parameter U E policy, parameter [ ] γ t r t t Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 10 / 40
21 Game against nature In general: problem is NP-hard except under rectangular case. Nice (non worst-case) generative probabilistic interpretation. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 11 / 40
22 Robust MDPs - conventional wisdom Problem is solvable when the uncertainty set is rectangular, i.e., the Cartesian product of uncertainty sets of individual states: { U = (p, r) : p U p (s), r } U r (s) s s Eventually solved using robust Dynamic Programming (robust value iteration, robust policy iteration) by Nilim and El-Ghoui (OR, 2005), Iyengar (MOR, 2005), among others Take the worst outcome for every state. If really bad events are to be included: must make U p and U r huge conservativeness A complicated tradeoff to make: can t correlate bad events Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 12 / 40
23 Robust MDPs - conventional wisdom Problem is solvable when the uncertainty set is rectangular, i.e., the Cartesian product of uncertainty sets of individual states: { U = (p, r) : p U p (s), r } U r (s) s s Eventually solved using robust Dynamic Programming (robust value iteration, robust policy iteration) by Nilim and El-Ghoui (OR, 2005), Iyengar (MOR, 2005), among others Take the worst outcome for every state. If really bad events are to be included: must make U p and U r huge conservativeness A complicated tradeoff to make: can t correlate bad events Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 12 / 40
24 Robust MDPs - conventional wisdom Problem is solvable when the uncertainty set is rectangular, i.e., the Cartesian product of uncertainty sets of individual states: { U = (p, r) : p U p (s), r } U r (s) s s Eventually solved using robust Dynamic Programming (robust value iteration, robust policy iteration) by Nilim and El-Ghoui (OR, 2005), Iyengar (MOR, 2005), among others Take the worst outcome for every state. If really bad events are to be included: must make U p and U r huge conservativeness A complicated tradeoff to make: can t correlate bad events Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 12 / 40
25 Practicable? The problem is not solved yet. Still issues to address for practically successful robust MDP More flexible uncertainty set this talk Probabilistic uncertainty not this talk Large scale not this talk Learn the uncertainty this talk Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 13 / 40
26 Practicable? The problem is not solved yet. Still issues to address for practically successful robust MDP More flexible uncertainty set this talk Probabilistic uncertainty not this talk Large scale not this talk Learn the uncertainty this talk Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 13 / 40
27 Practicable? The problem is not solved yet. Still issues to address for practically successful robust MDP More flexible uncertainty set this talk Probabilistic uncertainty not this talk Large scale not this talk Learn the uncertainty this talk Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 13 / 40
28 Practicable? The problem is not solved yet. Still issues to address for practically successful robust MDP More flexible uncertainty set this talk Probabilistic uncertainty not this talk Large scale not this talk Learn the uncertainty this talk Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 13 / 40
29 Practicable? The problem is not solved yet. Still issues to address for practically successful robust MDP More flexible uncertainty set this talk Probabilistic uncertainty not this talk Large scale not this talk Learn the uncertainty this talk Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 13 / 40
30 Coupling uncertainty: Intuition No coupling - conservative General coupling - NP hard Some sort of coupling that remains tractable? k-rectangular uncertainty Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 14 / 40
31 Coupling uncertainty: Intuition No coupling - conservative General coupling - NP hard Some sort of coupling that remains tractable? k-rectangular uncertainty Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 14 / 40
32 Example 1: lightning does not strike twice How many times can lightning strike? Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 15 / 40
33 Lightning does not strike twice Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 16 / 40
34 Lightning does not strike twice Limiting the number of deviation allowed { U K = (p, r) : p s = P nom, r s = R nom except at most } s 1,.., s K where p si, r si (U psi, U rsi ) If K = 0, Naive MDP If K = S, standard robust MDP K in between, interesting regime Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 17 / 40
35 Lightning does not strike twice Limiting the number of deviation allowed { U K = (p, r) : p s = P nom, r s = R nom except at most } s 1,.., s K where p si, r si (U psi, U rsi ) If K = 0, Naive MDP If K = S, standard robust MDP K in between, interesting regime Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 17 / 40
36 Lightning does not strike twice Limiting the number of deviation allowed { U K = (p, r) : p s = P nom, r s = R nom except at most } s 1,.., s K where p si, r si (U psi, U rsi ) If K = 0, Naive MDP If K = S, standard robust MDP K in between, interesting regime Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 17 / 40
37 Non-rectagular uncertainty sets Rectangular sets large uncertainty. But close to being rectangular. Rectangularity means conditional projection of the uncertainty set remains same. For LDST, there are k + 1 possible conditional projection of the uncertainty set. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 18 / 40
38 Non-rectagular uncertainty sets Rectangular sets large uncertainty. But close to being rectangular. Rectangularity means conditional projection of the uncertainty set remains same. For LDST, there are k + 1 possible conditional projection of the uncertainty set. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 18 / 40
39 Non-rectagular uncertainty sets Rectangular sets large uncertainty. But close to being rectangular. Rectangularity means conditional projection of the uncertainty set remains same. For LDST, there are k + 1 possible conditional projection of the uncertainty set. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 18 / 40
40 Non-rectagular uncertainty sets Rectangular sets large uncertainty. But close to being rectangular. Rectangularity means conditional projection of the uncertainty set remains same. For LDST, there are k + 1 possible conditional projection of the uncertainty set. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 18 / 40
41 Example 2: Union of rectangular uncertainty sets The uncertainty set is a union of several rectangular uncertainty sets (which may not be disjoint), i.e., U = J j=1 U j ; where U j = s S U j s, j = 1,, J. There are at most 2 J 1 possible conditional projection of the uncertainty set. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 19 / 40
42 k-rectangular uncertainty set Definition Let k be an integer, an uncertainty set U is called k-rectangular if for any subset of states, the class of conditional projection sets contains no more than k sets. k-rectangularity is closed under operators such as union, intersection, Cartesian product (with potentially increase of k). Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 20 / 40
43 The uncertainty model Uncertainty is realized in a history dependent way. The decision maker observes the realized uncertain parameters. Markovian uncertainty set is NP hard. Solving a two-player zero-sum stochastic game. Key observation: Uncertainty sets coupled only through k - the index of the conditional projection. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 21 / 40
44 The uncertainty model Uncertainty is realized in a history dependent way. The decision maker observes the realized uncertain parameters. Markovian uncertainty set is NP hard. Solving a two-player zero-sum stochastic game. Key observation: Uncertainty sets coupled only through k - the index of the conditional projection. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 21 / 40
45 The uncertainty model Uncertainty is realized in a history dependent way. The decision maker observes the realized uncertain parameters. Markovian uncertainty set is NP hard. Solving a two-player zero-sum stochastic game. Key observation: Uncertainty sets coupled only through k - the index of the conditional projection. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 21 / 40
46 Algoroithm for k rectangular uncertainty set Theorem The robust MDP with k rectangular uncertainty set can be solved in polynomial time. How? Augment the state-space, to include k. Doubled the decision stages - odd stages for decision maker, even stages for Nature Solve by a dynamic programming on the augmented state-space, alternatively for decision maker and for Nature. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 22 / 40
47 Algoroithm for k rectangular uncertainty set Theorem The robust MDP with k rectangular uncertainty set can be solved in polynomial time. How? Augment the state-space, to include k. Doubled the decision stages - odd stages for decision maker, even stages for Nature Solve by a dynamic programming on the augmented state-space, alternatively for decision maker and for Nature. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 22 / 40
48 Algoroithm for k rectangular uncertainty set Theorem The robust MDP with k rectangular uncertainty set can be solved in polynomial time. How? Augment the state-space, to include k. Doubled the decision stages - odd stages for decision maker, even stages for Nature Solve by a dynamic programming on the augmented state-space, alternatively for decision maker and for Nature. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 22 / 40
49 Algoroithm for k rectangular uncertainty set Theorem The robust MDP with k rectangular uncertainty set can be solved in polynomial time. How? Augment the state-space, to include k. Doubled the decision stages - odd stages for decision maker, even stages for Nature Solve by a dynamic programming on the augmented state-space, alternatively for decision maker and for Nature. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 22 / 40
50 Illustration via LDST Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 23 / 40
51 Question: where do I get the uncertainty sets? There are two types of uncertainty. Stochastic uncertainty: there is some true p and true r, just that we don t know the exact value. Adversarial uncertainty: there is no true p and r, each time when the state is visited, the parameter can vary. Due to model simplification, or some adversarial effect ignored. If I can collect more data, can I Identify the nature of the uncertainty? Learn the value of the stochastic uncertainty? Learn the level of the adversarial uncertainty? Yes we can! Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 24 / 40
52 Question: where do I get the uncertainty sets? There are two types of uncertainty. Stochastic uncertainty: there is some true p and true r, just that we don t know the exact value. Adversarial uncertainty: there is no true p and r, each time when the state is visited, the parameter can vary. Due to model simplification, or some adversarial effect ignored. If I can collect more data, can I Identify the nature of the uncertainty? Learn the value of the stochastic uncertainty? Learn the level of the adversarial uncertainty? Yes we can! Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 24 / 40
53 Question: where do I get the uncertainty sets? There are two types of uncertainty. Stochastic uncertainty: there is some true p and true r, just that we don t know the exact value. Adversarial uncertainty: there is no true p and r, each time when the state is visited, the parameter can vary. Due to model simplification, or some adversarial effect ignored. If I can collect more data, can I Identify the nature of the uncertainty? Learn the value of the stochastic uncertainty? Learn the level of the adversarial uncertainty? Yes we can! Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 24 / 40
54 Formal setup MDP with finite states and actions, reward in [0, 1]. For each pair (s, a), given a (potentially infinite) class of nested uncertainty sets U(s, a). Each pair (s, a) can be either stochastic or adversarial, which is not known. If (s, a) is stochastic, then the true p and r are unknown If (s, a) is adversarial, then its true uncertainty set (also unknown) belongs to U(s, a). Allowed to repeat the MDP many times. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 25 / 40
55 Formal setup MDP with finite states and actions, reward in [0, 1]. For each pair (s, a), given a (potentially infinite) class of nested uncertainty sets U(s, a). Each pair (s, a) can be either stochastic or adversarial, which is not known. If (s, a) is stochastic, then the true p and r are unknown If (s, a) is adversarial, then its true uncertainty set (also unknown) belongs to U(s, a). Allowed to repeat the MDP many times. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 25 / 40
56 Formal setup MDP with finite states and actions, reward in [0, 1]. For each pair (s, a), given a (potentially infinite) class of nested uncertainty sets U(s, a). Each pair (s, a) can be either stochastic or adversarial, which is not known. If (s, a) is stochastic, then the true p and r are unknown If (s, a) is adversarial, then its true uncertainty set (also unknown) belongs to U(s, a). Allowed to repeat the MDP many times. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 25 / 40
57 Challenge and Objective For adversarial state-action pair, the parameter can be arbitrary (and adaptive to the algorithm). Hence not possible to exactly identify the nature of uncertainty. Not possible to achieve diminishing regret against optimal stationary policy in hindsight. That is, may not take full advantage if the adversary chooses to play nice. Can achieve a vanishing regret against the performance of the robust MDP knowing exactly p and r for stochastic pair, and the true uncertainty set of adversarial pair. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 26 / 40
58 Challenge and Objective For adversarial state-action pair, the parameter can be arbitrary (and adaptive to the algorithm). Hence not possible to exactly identify the nature of uncertainty. Not possible to achieve diminishing regret against optimal stationary policy in hindsight. That is, may not take full advantage if the adversary chooses to play nice. Can achieve a vanishing regret against the performance of the robust MDP knowing exactly p and r for stochastic pair, and the true uncertainty set of adversarial pair. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 26 / 40
59 Challenge and Objective For adversarial state-action pair, the parameter can be arbitrary (and adaptive to the algorithm). Hence not possible to exactly identify the nature of uncertainty. Not possible to achieve diminishing regret against optimal stationary policy in hindsight. That is, may not take full advantage if the adversary chooses to play nice. Can achieve a vanishing regret against the performance of the robust MDP knowing exactly p and r for stochastic pair, and the true uncertainty set of adversarial pair. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 26 / 40
60 Challenge and Objective For adversarial state-action pair, the parameter can be arbitrary (and adaptive to the algorithm). Hence not possible to exactly identify the nature of uncertainty. Not possible to achieve diminishing regret against optimal stationary policy in hindsight. That is, may not take full advantage if the adversary chooses to play nice. Can achieve a vanishing regret against the performance of the robust MDP knowing exactly p and r for stochastic pair, and the true uncertainty set of adversarial pair. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 26 / 40
61 Main intuition When purely stochastic, one can resort to RL algorithms, such as UCRL (which consistently uses optimistic estimation) to achieve diminishing regret. However, adversary can hurt. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 27 / 40
62 Main intuition g g + β a 1 s 0 a2 a 3 s 1 s 3 s 2 s 4 g + β g α 2β < α < 3β. Choose solid line in phase 1 (2T steps), dashed line in phase 2 (T steps). The expected value of s 4 is g + β α 2, and the expected value of s 1 is g + 3β α 4 > g. The total accumulated reward is 3Tg + T (2β α). Compared to the minimax policy, the overall regret is non-diminishing. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 28 / 40
63 Main intuition Be optimistic, but cautious. Using UCRL, start by assuming all state-action pairs are stochastic. Monitor outcome of transition of each pair. Using a statistic check to identify pairs that is assumed to be stochastic but indeed adversarial, as well as pairs assumed to have an uncertainty set smaller than its true uncertainty set. Update the information of pairs that fail the statistic check, and re-solve the minimax MDP. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 29 / 40
64 The algorithm -OLRM Input: S, A, T, δ, and for each (s, a), U(s, a) 1 Initialize the set F {}. For each (s, a), set U(s, a) {}. 2 Initialize k 1. 3 Compute an optimistic robust policy π, assuming all state-action pairs in F are adversarial with uncertainty sets as given by U(s, a). 4 Execute π until one of the followings happen: The execution count of some state-action (s, a) has been doubled. The executed state-action pair (s, a) fails the stochasticity check. In this case (s, a) is added to F if it is not yet in F. Update U(s, a). 5 Increment k. Go back to step 3. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 30 / 40
65 Computing Optimistic Robust Policy Input: S, A, T, δ, F, k, and for each (s, a), U(s, a), ˆP k ( s, a) and N k (s, a). 1 Set Ṽ T k (s) = 0 for all s. 2 Repeat, for t = T 1,..., 0: For each (s, a) F, set Q t k (s, a) = min{t t, r(s, For each (s, a) / F, set Q t k (s, a) = min{t t, a) + min p U(s,a) p( )Ṽ k t+1 ( )}. + (T + 1) r(s, a) + ˆP k ( s, a)ṽ k t+1( ) 1 2N k (s, a) 14SATk 2 log }. δ For each s, set Ṽt k (s) = max a Qk t (s, a) π t (s) = arg max a Qk t (s, a). 3 Output π. and Robust to adversarial, optimistic to stochastic. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 31 / 40
66 Computing Optimistic Robust Policy Input: S, A, T, δ, F, k, and for each (s, a), U(s, a), ˆP k ( s, a) and N k (s, a). 1 Set Ṽ T k (s) = 0 for all s. 2 Repeat, for t = T 1,..., 0: For each (s, a) F, set Q t k (s, a) = min{t t, r(s, For each (s, a) / F, set Q t k (s, a) = min{t t, a) + min p U(s,a) p( )Ṽ k t+1 ( )}. + (T + 1) r(s, a) + ˆP k ( s, a)ṽ k t+1( ) 1 2N k (s, a) 14SATk 2 log }. δ For each s, set Ṽt k (s) = max a Qk t (s, a) π t (s) = arg max a Qk t (s, a). 3 Output π. and Robust to adversarial, optimistic to stochastic. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 31 / 40
67 Stochasticity check When (s, a) F, it fails the stochasticity check if: n } {ˆPkj ( s, a)ṽ k j t j +1 ( ) Ṽ k j t j +1 (s j ) > (2.5+T +3.5T S) n log 14SAT τ 2. δ j=1 When (s, a) F, it fails the stochasticity check if n { } min p( )Ṽ k j t p U(s,a) j +1 ( ) Ṽ k j t j +1 (s j ) > 2T 2(n n ) log 14τ 2. δ j=n +1 If (s, a) fails the stochasticity check, add (s, a) into F, and update U(s, a) as the smallest set in U(s, a) that satisfies n { } min p( )Ṽ k j t p U(s,a) j +1 ( ) Ṽ k j t j +1 (s j ) < T 2(n n ) log 14τ 2. δ j=n +1 Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 32 / 40
68 More on stochasticity check Essentially checking whether the value of actual transition from (s, a) is below what is expected from the parameter estimation. No false alarm: with high probability, all stochastic state-action pairs will always pass the stochasticity check; and all adversarial state-action pairs will pass the stochasticity check if U(s, a) U (s, a). May fail to identify adversarial states, if the adversary plays nicely. However, satisfactory rewards are accumulated, so nothing needs to be changed. If the adversary plays nasty, then the stochasticity check will detect it, and subsequently protect against it. if it ain t broke, don t fix it. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 33 / 40
69 More on stochasticity check Essentially checking whether the value of actual transition from (s, a) is below what is expected from the parameter estimation. No false alarm: with high probability, all stochastic state-action pairs will always pass the stochasticity check; and all adversarial state-action pairs will pass the stochasticity check if U(s, a) U (s, a). May fail to identify adversarial states, if the adversary plays nicely. However, satisfactory rewards are accumulated, so nothing needs to be changed. If the adversary plays nasty, then the stochasticity check will detect it, and subsequently protect against it. if it ain t broke, don t fix it. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 33 / 40
70 Performance guarantee Theorem Given δ, T, S, A and U, if U(s, a) C for all (s, a), then the total regret of OLRM is [ (m) O T 3/2 ( S + ] C) SAm log SATm δ for all m, with probability at least 1 δ. The total number of steps is τ = Tm, hence the regret is Õ[T ( S + C) SAτ]. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 34 / 40
71 Performance guarantee What if U is an infinity set? We consider the following class: U(s, a) = {η(s, a) + αb(s, a) : α 0 (s, a) α α } P(S) (1) Theorem Given δ, T, S, A, U as defined in Eq. (1), the total regret of OLRM is [ ( (m) Õ T S Aτ + (SAα B) 2/3 τ 1/3 + (SAα B) 1/3 τ 2/3)]. for all m, with probability at least 1 δ. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 35 / 40
72 Infinite horizon average reward Assume for any p in the true uncertainty set, the resulting MDP is unichain and communicating. Similar algorithm, except that computing the optimistic robust policy is trickier. Similar performance guarantee: O( τ) for finite U, and O(τ 2/3 ) for infinite U. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 36 / 40
73 Simulation Gold Gold Silver Silver Usual condition Abnormal condition Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 37 / 40
74 Simulation R minimax s 0 a 1 s 1 a 2 a 3 s 2 s 4 R good s 3 a 4 s 5 s 6 R bad Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 38 / 40
75 Simulation x Total rewards UCRL2 OLRM2 Optimal minimax policy Time steps x 10 6 Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 39 / 40
76 Conclusion Coupled uncertainty k-rectangularity Coupled through the index k Solved via state augmentation Learning the uncertainty Reinforcement learning adapted Diminishing regret Make robust MDP practicable. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 40 / 40
Planning and Model Selection in Data Driven Markov models
Planning and Model Selection in Data Driven Markov models Shie Mannor Department of Electrical Engineering Technion Joint work with many people along the way: Dotan Di-Castro (Yahoo!), Assaf Halak (Technion),
More informationLightning Does Not Strike Twice: Robust MDPs with Coupled Uncertainty
JMLR: Workshop and Conference Proceedings vol (212) 1 12 European Workshop on Reinforcement Learning Lightning Does Not Strike Twice: Robust MDPs with Coupled Uncertainty Shie Mannor Technion Ofir Mebel
More informationLightning Does Not Strike Twice: Robust MDPs with Coupled Uncertainty
Lightning Does Not Strike Twice: Robust MDPs with Coupled Uncertainty Shie Mannor shie@ee.technion.ac.il Department of Electrical Engineering, Technion, Israel Ofir Mebel ofirmebel@gmail.com Department
More informationPreference Elicitation for Sequential Decision Problems
Preference Elicitation for Sequential Decision Problems Kevin Regan University of Toronto Introduction 2 Motivation Focus: Computational approaches to sequential decision making under uncertainty These
More informationDecision Theory: Markov Decision Processes
Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies
More informationReinforcement Learning
Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University
More informationConstructing Learning Models from Data: The Dynamic Catalog Mailing Problem
Constructing Learning Models from Data: The Dynamic Catalog Mailing Problem Peng Sun May 6, 2003 Problem and Motivation Big industry In 2000 Catalog companies in the USA sent out 7 billion catalogs, generated
More informationCS 7180: Behavioral Modeling and Decisionmaking
CS 7180: Behavioral Modeling and Decisionmaking in AI Markov Decision Processes for Complex Decisionmaking Prof. Amy Sliva October 17, 2012 Decisions are nondeterministic In many situations, behavior and
More informationLecture 1: March 7, 2018
Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights
More informationDistributed Optimization. Song Chong EE, KAIST
Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links
More informationDecision Theory: Q-Learning
Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning
More informationMarkov decision processes
CS 2740 Knowledge representation Lecture 24 Markov decision processes Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Administrative announcements Final exam: Monday, December 8, 2008 In-class Only
More informationLecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation
Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free
More informationChristopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015
Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)
More informationReinforcement Learning. Yishay Mansour Tel-Aviv University
Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak
More informationCMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro
CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING
More informationSequential Decision Problems
Sequential Decision Problems Michael A. Goodrich November 10, 2006 If I make changes to these notes after they are posted and if these changes are important (beyond cosmetic), the changes will highlighted
More informationRobust Partially Observable Markov Decision Processes
Submitted to manuscript Robust Partially Observable Markov Decision Processes Mohammad Rasouli 1, Soroush Saghafian 2 1 Management Science and Engineering, Stanford University, Palo Alto, CA 2 Harvard
More informationBalancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm
Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu
More informationReinforcement Learning
Reinforcement Learning Lecture 6: RL algorithms 2.0 Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Present and analyse two online algorithms
More informationCS 4100 // artificial intelligence. Recap/midterm review!
CS 4100 // artificial intelligence instructor: byron wallace Recap/midterm review! Attribution: many of these slides are modified versions of those distributed with the UC Berkeley CS188 materials Thanks
More informationMarkov Chains. Chapter 16. Markov Chains - 1
Markov Chains Chapter 16 Markov Chains - 1 Why Study Markov Chains? Decision Analysis focuses on decision making in the face of uncertainty about one future event. However, many decisions need to consider
More informationReinforcement Learning
Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha
More informationLecture notes for Analysis of Algorithms : Markov decision processes
Lecture notes for Analysis of Algorithms : Markov decision processes Lecturer: Thomas Dueholm Hansen June 6, 013 Abstract We give an introduction to infinite-horizon Markov decision processes (MDPs) with
More informationOnline regret in reinforcement learning
University of Leoben, Austria Tübingen, 31 July 2007 Undiscounted online regret I am interested in the difference (in rewards during learning) between an optimal policy and a reinforcement learner: T T
More informationMARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti
1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early
More informationInternet Monetization
Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition
More informationMS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction
MS&E338 Reinforcement Learning Lecture 1 - April 2, 2018 Introduction Lecturer: Ben Van Roy Scribe: Gabriel Maher 1 Reinforcement Learning Introduction In reinforcement learning (RL) we consider an agent
More informationNotes from Week 9: Multi-Armed Bandit Problems II. 1 Information-theoretic lower bounds for multiarmed
CS 683 Learning, Games, and Electronic Markets Spring 007 Notes from Week 9: Multi-Armed Bandit Problems II Instructor: Robert Kleinberg 6-30 Mar 007 1 Information-theoretic lower bounds for multiarmed
More informationExponential Moving Average Based Multiagent Reinforcement Learning Algorithms
Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Department of Systems and Computer Engineering Carleton University Ottawa, Canada KS 5B6 Email: mawheda@sce.carleton.ca
More informationExperts in a Markov Decision Process
University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 2004 Experts in a Markov Decision Process Eyal Even-Dar Sham Kakade University of Pennsylvania Yishay Mansour Follow
More informationCourse 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016
Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the
More informationToday s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes
Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks
More informationOptimism in the Face of Uncertainty Should be Refutable
Optimism in the Face of Uncertainty Should be Refutable Ronald ORTNER Montanuniversität Leoben Department Mathematik und Informationstechnolgie Franz-Josef-Strasse 18, 8700 Leoben, Austria, Phone number:
More information16.4 Multiattribute Utility Functions
285 Normalized utilities The scale of utilities reaches from the best possible prize u to the worst possible catastrophe u Normalized utilities use a scale with u = 0 and u = 1 Utilities of intermediate
More informationLogarithmic Online Regret Bounds for Undiscounted Reinforcement Learning
Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning Peter Auer Ronald Ortner University of Leoben, Franz-Josef-Strasse 18, 8700 Leoben, Austria auer,rortner}@unileoben.ac.at Abstract
More informationReinforcement Learning
Reinforcement Learning Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ Task Grasp the green cup. Output: Sequence of controller actions Setup from Lenz et. al.
More informationPlanning Under Uncertainty II
Planning Under Uncertainty II Intelligent Robotics 2014/15 Bruno Lacerda Announcement No class next Monday - 17/11/2014 2 Previous Lecture Approach to cope with uncertainty on outcome of actions Markov
More informationMarkov Decision Processes
Markov Decision Processes Noel Welsh 11 November 2010 Noel Welsh () Markov Decision Processes 11 November 2010 1 / 30 Annoucements Applicant visitor day seeks robot demonstrators for exciting half hour
More information, and rewards and transition matrices as shown below:
CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount
More informationHybrid Machine Learning Algorithms
Hybrid Machine Learning Algorithms Umar Syed Princeton University Includes joint work with: Rob Schapire (Princeton) Nina Mishra, Alex Slivkins (Microsoft) Common Approaches to Machine Learning!! Supervised
More informationMotivation for introducing probabilities
for introducing probabilities Reaching the goals is often not sufficient: it is important that the expected costs do not outweigh the benefit of reaching the goals. 1 Objective: maximize benefits - costs.
More informationSome AI Planning Problems
Course Logistics CS533: Intelligent Agents and Decision Making M, W, F: 1:00 1:50 Instructor: Alan Fern (KEC2071) Office hours: by appointment (see me after class or send email) Emailing me: include CS533
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and
More informationComplexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning
Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Christos Dimitrakakis Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands
More informationCS 188: Artificial Intelligence
CS 188: Artificial Intelligence Adversarial Search II Instructor: Anca Dragan University of California, Berkeley [These slides adapted from Dan Klein and Pieter Abbeel] Minimax Example 3 12 8 2 4 6 14
More informationAdministration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.
Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,
More informationToday s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning
CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides
More informationBandit models: a tutorial
Gdt COS, December 3rd, 2015 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses
More informationLearning to Coordinate Efficiently: A Model-based Approach
Journal of Artificial Intelligence Research 19 (2003) 11-23 Submitted 10/02; published 7/03 Learning to Coordinate Efficiently: A Model-based Approach Ronen I. Brafman Computer Science Department Ben-Gurion
More informationINF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018
Machine learning for image classification Lecture 14: Reinforcement learning May 9, 2018 Page 3 Outline Motivation Introduction to reinforcement learning (RL) Value function based methods (Q-learning)
More informationCentral-limit approach to risk-aware Markov decision processes
Central-limit approach to risk-aware Markov decision processes Jia Yuan Yu Concordia University November 27, 2015 Joint work with Pengqian Yu and Huan Xu. Inventory Management 1 1 Look at current inventory
More informationLearning in Zero-Sum Team Markov Games using Factored Value Functions
Learning in Zero-Sum Team Markov Games using Factored Value Functions Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 27708 mgl@cs.duke.edu Ronald Parr Department of Computer
More informationAn analogy from Calculus: limits
COMP 250 Fall 2018 35 - big O Nov. 30, 2018 We have seen several algorithms in the course, and we have loosely characterized their runtimes in terms of the size n of the input. We say that the algorithm
More informationLecture notes on OPP algorithms [Preliminary Draft]
Lecture notes on OPP algorithms [Preliminary Draft] Jesper Nederlof June 13, 2016 These lecture notes were quickly assembled and probably contain many errors. Use at your own risk! Moreover, especially
More informationReinforcement Learning as Classification Leveraging Modern Classifiers
Reinforcement Learning as Classification Leveraging Modern Classifiers Michail G. Lagoudakis and Ronald Parr Department of Computer Science Duke University Durham, NC 27708 Machine Learning Reductions
More informationReinforcement Learning. Introduction
Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control
More information15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)
15-780: Graduate Artificial Intelligence Reinforcement learning (RL) From MDPs to RL We still use the same Markov model with rewards and actions But there are a few differences: 1. We do not assume we
More informationInfinite-Horizon Average Reward Markov Decision Processes
Infinite-Horizon Average Reward Markov Decision Processes Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 1 Outline The average
More informationMachine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?
Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity
More informationRobust Markov Decision Processes
Robust Markov Decision Processes Wolfram Wiesemann, Daniel Kuhn and Berç Rustem February 9, 2012 Abstract Markov decision processes (MDPs) are powerful tools for decision making in uncertain dynamic environments.
More informationMDP Preliminaries. Nan Jiang. February 10, 2019
MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process
More information16.410/413 Principles of Autonomy and Decision Making
16.410/413 Principles of Autonomy and Decision Making Lecture 23: Markov Decision Processes Policy Iteration Emilio Frazzoli Aeronautics and Astronautics Massachusetts Institute of Technology December
More informationChapter 3: The Reinforcement Learning Problem
Chapter 3: The Reinforcement Learning Problem Objectives of this chapter: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which
More informationCORC Tech Report TR Robust dynamic programming
CORC Tech Report TR-2002-07 Robust dynamic programming G. Iyengar Submitted Dec. 3rd, 2002. Revised May 4, 2004. Abstract In this paper we propose a robust formulation for discrete time dynamic programming
More informationDistributionally Robust Convex Optimization
Distributionally Robust Convex Optimization Wolfram Wiesemann 1, Daniel Kuhn 1, and Melvyn Sim 2 1 Department of Computing, Imperial College London, United Kingdom 2 Department of Decision Sciences, National
More informationCS 4649/7649 Robot Intelligence: Planning
CS 4649/7649 Robot Intelligence: Planning Probability Primer Sungmoon Joo School of Interactive Computing College of Computing Georgia Institute of Technology S. Joo (sungmoon.joo@cc.gatech.edu) 1 *Slides
More informationMarkov Decision Processes and Solving Finite Problems. February 8, 2017
Markov Decision Processes and Solving Finite Problems February 8, 2017 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15:
More informationReinforcement Learning
1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision
More informationPrioritized Sweeping Converges to the Optimal Value Function
Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science
More informationSafe Policy Improvement by Minimizing Robust Baseline Regret
Safe Policy Improvement by Minimizing Robust Baseline Regret Marek Petrik University of New Hampshire mpetrik@cs.unh.edu Mohammad Ghavamzadeh Adobe Research & INRIA Lille ghavamza@adobe.com Yinlam Chow
More informationMultiagent Value Iteration in Markov Games
Multiagent Value Iteration in Markov Games Amy Greenwald Brown University with Michael Littman and Martin Zinkevich Stony Brook Game Theory Festival July 21, 2005 Agenda Theorem Value iteration converges
More informationIntroduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16
600.463 Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16 25.1 Introduction Today we re going to talk about machine learning, but from an
More informationUsing first-order logic, formalize the following knowledge:
Probabilistic Artificial Intelligence Final Exam Feb 2, 2016 Time limit: 120 minutes Number of pages: 19 Total points: 100 You can use the back of the pages if you run out of space. Collaboration on the
More informationCMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro
CMU 15-781 Lecture 11: Markov Decision Processes II Teacher: Gianni A. Di Caro RECAP: DEFINING MDPS Markov decision processes: o Set of states S o Start state s 0 o Set of actions A o Transitions P(s s,a)
More informationMulti-model Markov Decision Processes
Multi-model Markov Decision Processes Lauren N. Steimle Department of Industrial and Operations Engineering, University of Michigan, Ann Arbor, MI 48109, steimle@umich.edu David L. Kaufman Management Studies,
More informationMulti-armed bandit models: a tutorial
Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)
More informationARTIFICIAL INTELLIGENCE. Reinforcement learning
INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html
More informationPlanning in Markov Decision Processes
Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Planning in Markov Decision Processes Lecture 3, CMU 10703 Katerina Fragkiadaki Markov Decision Process (MDP) A Markov
More informationReinforcement Learning II
Reinforcement Learning II Andrea Bonarini Artificial Intelligence and Robotics Lab Department of Electronics and Information Politecnico di Milano E-mail: bonarini@elet.polimi.it URL:http://www.dei.polimi.it/people/bonarini
More informationCS788 Dialogue Management Systems Lecture #2: Markov Decision Processes
CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes Kee-Eung Kim KAIST EECS Department Computer Science Division Markov Decision Processes (MDPs) A popular model for sequential decision
More informationModule 8 Linear Programming. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo
Module 8 Linear Programming CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Policy Optimization Value and policy iteration Iterative algorithms that implicitly solve
More informationReinforcement Learning: An Introduction
Introduction Betreuer: Freek Stulp Hauptseminar Intelligente Autonome Systeme (WiSe 04/05) Forschungs- und Lehreinheit Informatik IX Technische Universität München November 24, 2004 Introduction What is
More informationAn Analysis of Model-Based Interval Estimation for Markov Decision Processes
An Analysis of Model-Based Interval Estimation for Markov Decision Processes Alexander L. Strehl, Michael L. Littman astrehl@gmail.com, mlittman@cs.rutgers.edu Computer Science Dept. Rutgers University
More informationSparse analysis Lecture II: Hardness results for sparse approximation problems
Sparse analysis Lecture II: Hardness results for sparse approximation problems Anna C. Gilbert Department of Mathematics University of Michigan Sparse Problems Exact. Given a vector x R d and a complete
More informationReinforcement Learning. Spring 2018 Defining MDPs, Planning
Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state
More informationCS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm
CS61: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm Tim Roughgarden February 9, 016 1 Online Algorithms This lecture begins the third module of the
More informationA Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time
A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time April 16, 2016 Abstract In this exposition we study the E 3 algorithm proposed by Kearns and Singh for reinforcement
More informationAdvanced Machine Learning
Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his
More information6 Reinforcement Learning
6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,
More informationChapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS
Chapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS 63 2.1 Introduction In this chapter we describe the analytical tools used in this thesis. They are Markov Decision Processes(MDP), Markov Renewal process
More informationMachine Learning I Reinforcement Learning
Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:
More informationCS 188: Artificial Intelligence Spring Announcements
CS 188: Artificial Intelligence Spring 2011 Lecture 12: Probability 3/2/2011 Pieter Abbeel UC Berkeley Many slides adapted from Dan Klein. 1 Announcements P3 due on Monday (3/7) at 4:59pm W3 going out
More informationReinforcement Learning
CS7/CS7 Fall 005 Supervised Learning: Training examples: (x,y) Direct feedback y for each input x Sequence of decisions with eventual feedback No teacher that critiques individual actions Learn to act
More informationInformation, Utility & Bounded Rationality
Information, Utility & Bounded Rationality Pedro A. Ortega and Daniel A. Braun Department of Engineering, University of Cambridge Trumpington Street, Cambridge, CB2 PZ, UK {dab54,pao32}@cam.ac.uk Abstract.
More informationDiscrete Probability and State Estimation
6.01, Fall Semester, 2007 Lecture 12 Notes 1 MASSACHVSETTS INSTITVTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science 6.01 Introduction to EECS I Fall Semester, 2007 Lecture 12 Notes
More informationCourse basics. CSE 190: Reinforcement Learning: An Introduction. Last Time. Course goals. The website for the class is linked off my homepage.
Course basics CSE 190: Reinforcement Learning: An Introduction The website for the class is linked off my homepage. Grades will be based on programming assignments, homeworks, and class participation.
More information15-780: ReinforcementLearning
15-780: ReinforcementLearning J. Zico Kolter March 2, 2016 1 Outline Challenge of RL Model-based methods Model-free methods Exploration and exploitation 2 Outline Challenge of RL Model-based methods Model-free
More informationQ-Learning in Continuous State Action Spaces
Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental
More informationComp487/587 - Boolean Formulas
Comp487/587 - Boolean Formulas 1 Logic and SAT 1.1 What is a Boolean Formula Logic is a way through which we can analyze and reason about simple or complicated events. In particular, we are interested
More information