Practicable Robust Markov Decision Processes

Size: px
Start display at page:

Download "Practicable Robust Markov Decision Processes"

Transcription

1 Practicable Robust Markov Decision Processes Huan Xu Department of Mechanical Engineering National University of Singapore Joint work with Shiau-Hong Lim (IBM), Shie Mannor (Techion), Ofir Mebel (Apple) 18, Nov, 2015; IMS Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 1 / 40

2 Classical planning problems We typically want to maximize the expected average reward In planning: Model is known A single scalar reward Rare events (black swans) only crop-up through expectations Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 2 / 40

3 Classical planning problems We typically want to maximize the expected average reward In planning: Model is known A single scalar reward Rare events (black swans) only crop-up through expectations Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 2 / 40

4 Classical planning problems We typically want to maximize the expected average reward In planning: Model is known A single scalar reward Rare events (black swans) only crop-up through expectations Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 2 / 40

5 Motivation example - Mail catalog Large US retailer (Fortune 500 company) Marketing problem: send or not send coupon/invitation/mail order catalogue Common wisdom: per customer look at RFM: Recency, Frequency, Monetary value Dynamics matter Every model will be wrong: how do you model humans? Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 3 / 40

6 Motivation example - Mail catalog Large US retailer (Fortune 500 company) Marketing problem: send or not send coupon/invitation/mail order catalogue Common wisdom: per customer look at RFM: Recency, Frequency, Monetary value Dynamics matter Every model will be wrong: how do you model humans? Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 3 / 40

7 Motivation example - Mail catalog Large US retailer (Fortune 500 company) Marketing problem: send or not send coupon/invitation/mail order catalogue Common wisdom: per customer look at RFM: Recency, Frequency, Monetary value Dynamics matter Every model will be wrong: how do you model humans? Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 3 / 40

8 Common to the problems Real state space is huge with lots of uncertainty and parameters Batch data are available Operative solution: build a smallish MDP (< 300 states!), solve, apply. Computational speed less of an issue Uncertainty and risk are THE concern (and cannot be made scalar) Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 4 / 40

9 Common to the problems Real state space is huge with lots of uncertainty and parameters Batch data are available Operative solution: build a smallish MDP (< 300 states!), solve, apply. Computational speed less of an issue Uncertainty and risk are THE concern (and cannot be made scalar) Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 4 / 40

10 Common to the problems Real state space is huge with lots of uncertainty and parameters Batch data are available Operative solution: build a smallish MDP (< 300 states!), solve, apply. Computational speed less of an issue Uncertainty and risk are THE concern (and cannot be made scalar) Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 4 / 40

11 The Question: 1 How to optimize when the model is not (fully) known? But you have some idea on the magnitude of the uncertainty. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 5 / 40

12 Markov Decision Processes Defined by a tuple T, γ, S, A, p, r : T is the possibly infinite decision horizon. γ is the discount factor. S is the set of states. A is the set of actions. p transition probability, in the form of p t (s s, a). r immediate reward, in the form of r t (s, a). Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 6 / 40

13 Markov Decision Processes Total reward is defined: R = T t=1 γt 1 r t (s t, a t ). Classical goal: find a policy π that maximizes the expected total reward under π. Three solution approaches: Value Iteration Policy Iteration Linear Programming Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 7 / 40

14 Two types of uncertainty Internal Uncertainty: uncertainty due to random transitions/rewards Risk aware MDPs. Not this talk. Parameter uncertainty: uncertainty in the parameters Robust MDPs. This talk. Risk vs Ambiguity. Ellsberg s paradox Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 8 / 40

15 Two types of uncertainty Internal Uncertainty: uncertainty due to random transitions/rewards Risk aware MDPs. Not this talk. Parameter uncertainty: uncertainty in the parameters Robust MDPs. This talk. Risk vs Ambiguity. Ellsberg s paradox Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 8 / 40

16 Two types of uncertainty Internal Uncertainty: uncertainty due to random transitions/rewards Risk aware MDPs. Not this talk. Parameter uncertainty: uncertainty in the parameters Robust MDPs. This talk. Risk vs Ambiguity. Ellsberg s paradox Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 8 / 40

17 Parameter Uncertainty Parameter uncertainty due to: 1 noisy/incorrect observation 2 estimation from finite samples 3 environment-dependent 4 simplification of the problem Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 9 / 40

18 Robust MDPs S and A are known, p and r are unknown. When in doubt assume the worst Set inclusive uncertainty - p and r belong to a known set ( uncertainty set). Look for a policy with best worst-case performance. Problem becomes: ( ) max policy min parameter U E policy, parameter [ ] γ t r t t Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 10 / 40

19 Robust MDPs S and A are known, p and r are unknown. When in doubt assume the worst Set inclusive uncertainty - p and r belong to a known set ( uncertainty set). Look for a policy with best worst-case performance. Problem becomes: ( ) max policy min parameter U E policy, parameter [ ] γ t r t t Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 10 / 40

20 Robust MDPs S and A are known, p and r are unknown. When in doubt assume the worst Set inclusive uncertainty - p and r belong to a known set ( uncertainty set). Look for a policy with best worst-case performance. Problem becomes: ( ) max policy min parameter U E policy, parameter [ ] γ t r t t Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 10 / 40

21 Game against nature In general: problem is NP-hard except under rectangular case. Nice (non worst-case) generative probabilistic interpretation. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 11 / 40

22 Robust MDPs - conventional wisdom Problem is solvable when the uncertainty set is rectangular, i.e., the Cartesian product of uncertainty sets of individual states: { U = (p, r) : p U p (s), r } U r (s) s s Eventually solved using robust Dynamic Programming (robust value iteration, robust policy iteration) by Nilim and El-Ghoui (OR, 2005), Iyengar (MOR, 2005), among others Take the worst outcome for every state. If really bad events are to be included: must make U p and U r huge conservativeness A complicated tradeoff to make: can t correlate bad events Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 12 / 40

23 Robust MDPs - conventional wisdom Problem is solvable when the uncertainty set is rectangular, i.e., the Cartesian product of uncertainty sets of individual states: { U = (p, r) : p U p (s), r } U r (s) s s Eventually solved using robust Dynamic Programming (robust value iteration, robust policy iteration) by Nilim and El-Ghoui (OR, 2005), Iyengar (MOR, 2005), among others Take the worst outcome for every state. If really bad events are to be included: must make U p and U r huge conservativeness A complicated tradeoff to make: can t correlate bad events Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 12 / 40

24 Robust MDPs - conventional wisdom Problem is solvable when the uncertainty set is rectangular, i.e., the Cartesian product of uncertainty sets of individual states: { U = (p, r) : p U p (s), r } U r (s) s s Eventually solved using robust Dynamic Programming (robust value iteration, robust policy iteration) by Nilim and El-Ghoui (OR, 2005), Iyengar (MOR, 2005), among others Take the worst outcome for every state. If really bad events are to be included: must make U p and U r huge conservativeness A complicated tradeoff to make: can t correlate bad events Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 12 / 40

25 Practicable? The problem is not solved yet. Still issues to address for practically successful robust MDP More flexible uncertainty set this talk Probabilistic uncertainty not this talk Large scale not this talk Learn the uncertainty this talk Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 13 / 40

26 Practicable? The problem is not solved yet. Still issues to address for practically successful robust MDP More flexible uncertainty set this talk Probabilistic uncertainty not this talk Large scale not this talk Learn the uncertainty this talk Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 13 / 40

27 Practicable? The problem is not solved yet. Still issues to address for practically successful robust MDP More flexible uncertainty set this talk Probabilistic uncertainty not this talk Large scale not this talk Learn the uncertainty this talk Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 13 / 40

28 Practicable? The problem is not solved yet. Still issues to address for practically successful robust MDP More flexible uncertainty set this talk Probabilistic uncertainty not this talk Large scale not this talk Learn the uncertainty this talk Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 13 / 40

29 Practicable? The problem is not solved yet. Still issues to address for practically successful robust MDP More flexible uncertainty set this talk Probabilistic uncertainty not this talk Large scale not this talk Learn the uncertainty this talk Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 13 / 40

30 Coupling uncertainty: Intuition No coupling - conservative General coupling - NP hard Some sort of coupling that remains tractable? k-rectangular uncertainty Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 14 / 40

31 Coupling uncertainty: Intuition No coupling - conservative General coupling - NP hard Some sort of coupling that remains tractable? k-rectangular uncertainty Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 14 / 40

32 Example 1: lightning does not strike twice How many times can lightning strike? Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 15 / 40

33 Lightning does not strike twice Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 16 / 40

34 Lightning does not strike twice Limiting the number of deviation allowed { U K = (p, r) : p s = P nom, r s = R nom except at most } s 1,.., s K where p si, r si (U psi, U rsi ) If K = 0, Naive MDP If K = S, standard robust MDP K in between, interesting regime Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 17 / 40

35 Lightning does not strike twice Limiting the number of deviation allowed { U K = (p, r) : p s = P nom, r s = R nom except at most } s 1,.., s K where p si, r si (U psi, U rsi ) If K = 0, Naive MDP If K = S, standard robust MDP K in between, interesting regime Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 17 / 40

36 Lightning does not strike twice Limiting the number of deviation allowed { U K = (p, r) : p s = P nom, r s = R nom except at most } s 1,.., s K where p si, r si (U psi, U rsi ) If K = 0, Naive MDP If K = S, standard robust MDP K in between, interesting regime Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 17 / 40

37 Non-rectagular uncertainty sets Rectangular sets large uncertainty. But close to being rectangular. Rectangularity means conditional projection of the uncertainty set remains same. For LDST, there are k + 1 possible conditional projection of the uncertainty set. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 18 / 40

38 Non-rectagular uncertainty sets Rectangular sets large uncertainty. But close to being rectangular. Rectangularity means conditional projection of the uncertainty set remains same. For LDST, there are k + 1 possible conditional projection of the uncertainty set. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 18 / 40

39 Non-rectagular uncertainty sets Rectangular sets large uncertainty. But close to being rectangular. Rectangularity means conditional projection of the uncertainty set remains same. For LDST, there are k + 1 possible conditional projection of the uncertainty set. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 18 / 40

40 Non-rectagular uncertainty sets Rectangular sets large uncertainty. But close to being rectangular. Rectangularity means conditional projection of the uncertainty set remains same. For LDST, there are k + 1 possible conditional projection of the uncertainty set. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 18 / 40

41 Example 2: Union of rectangular uncertainty sets The uncertainty set is a union of several rectangular uncertainty sets (which may not be disjoint), i.e., U = J j=1 U j ; where U j = s S U j s, j = 1,, J. There are at most 2 J 1 possible conditional projection of the uncertainty set. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 19 / 40

42 k-rectangular uncertainty set Definition Let k be an integer, an uncertainty set U is called k-rectangular if for any subset of states, the class of conditional projection sets contains no more than k sets. k-rectangularity is closed under operators such as union, intersection, Cartesian product (with potentially increase of k). Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 20 / 40

43 The uncertainty model Uncertainty is realized in a history dependent way. The decision maker observes the realized uncertain parameters. Markovian uncertainty set is NP hard. Solving a two-player zero-sum stochastic game. Key observation: Uncertainty sets coupled only through k - the index of the conditional projection. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 21 / 40

44 The uncertainty model Uncertainty is realized in a history dependent way. The decision maker observes the realized uncertain parameters. Markovian uncertainty set is NP hard. Solving a two-player zero-sum stochastic game. Key observation: Uncertainty sets coupled only through k - the index of the conditional projection. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 21 / 40

45 The uncertainty model Uncertainty is realized in a history dependent way. The decision maker observes the realized uncertain parameters. Markovian uncertainty set is NP hard. Solving a two-player zero-sum stochastic game. Key observation: Uncertainty sets coupled only through k - the index of the conditional projection. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 21 / 40

46 Algoroithm for k rectangular uncertainty set Theorem The robust MDP with k rectangular uncertainty set can be solved in polynomial time. How? Augment the state-space, to include k. Doubled the decision stages - odd stages for decision maker, even stages for Nature Solve by a dynamic programming on the augmented state-space, alternatively for decision maker and for Nature. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 22 / 40

47 Algoroithm for k rectangular uncertainty set Theorem The robust MDP with k rectangular uncertainty set can be solved in polynomial time. How? Augment the state-space, to include k. Doubled the decision stages - odd stages for decision maker, even stages for Nature Solve by a dynamic programming on the augmented state-space, alternatively for decision maker and for Nature. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 22 / 40

48 Algoroithm for k rectangular uncertainty set Theorem The robust MDP with k rectangular uncertainty set can be solved in polynomial time. How? Augment the state-space, to include k. Doubled the decision stages - odd stages for decision maker, even stages for Nature Solve by a dynamic programming on the augmented state-space, alternatively for decision maker and for Nature. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 22 / 40

49 Algoroithm for k rectangular uncertainty set Theorem The robust MDP with k rectangular uncertainty set can be solved in polynomial time. How? Augment the state-space, to include k. Doubled the decision stages - odd stages for decision maker, even stages for Nature Solve by a dynamic programming on the augmented state-space, alternatively for decision maker and for Nature. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 22 / 40

50 Illustration via LDST Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 23 / 40

51 Question: where do I get the uncertainty sets? There are two types of uncertainty. Stochastic uncertainty: there is some true p and true r, just that we don t know the exact value. Adversarial uncertainty: there is no true p and r, each time when the state is visited, the parameter can vary. Due to model simplification, or some adversarial effect ignored. If I can collect more data, can I Identify the nature of the uncertainty? Learn the value of the stochastic uncertainty? Learn the level of the adversarial uncertainty? Yes we can! Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 24 / 40

52 Question: where do I get the uncertainty sets? There are two types of uncertainty. Stochastic uncertainty: there is some true p and true r, just that we don t know the exact value. Adversarial uncertainty: there is no true p and r, each time when the state is visited, the parameter can vary. Due to model simplification, or some adversarial effect ignored. If I can collect more data, can I Identify the nature of the uncertainty? Learn the value of the stochastic uncertainty? Learn the level of the adversarial uncertainty? Yes we can! Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 24 / 40

53 Question: where do I get the uncertainty sets? There are two types of uncertainty. Stochastic uncertainty: there is some true p and true r, just that we don t know the exact value. Adversarial uncertainty: there is no true p and r, each time when the state is visited, the parameter can vary. Due to model simplification, or some adversarial effect ignored. If I can collect more data, can I Identify the nature of the uncertainty? Learn the value of the stochastic uncertainty? Learn the level of the adversarial uncertainty? Yes we can! Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 24 / 40

54 Formal setup MDP with finite states and actions, reward in [0, 1]. For each pair (s, a), given a (potentially infinite) class of nested uncertainty sets U(s, a). Each pair (s, a) can be either stochastic or adversarial, which is not known. If (s, a) is stochastic, then the true p and r are unknown If (s, a) is adversarial, then its true uncertainty set (also unknown) belongs to U(s, a). Allowed to repeat the MDP many times. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 25 / 40

55 Formal setup MDP with finite states and actions, reward in [0, 1]. For each pair (s, a), given a (potentially infinite) class of nested uncertainty sets U(s, a). Each pair (s, a) can be either stochastic or adversarial, which is not known. If (s, a) is stochastic, then the true p and r are unknown If (s, a) is adversarial, then its true uncertainty set (also unknown) belongs to U(s, a). Allowed to repeat the MDP many times. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 25 / 40

56 Formal setup MDP with finite states and actions, reward in [0, 1]. For each pair (s, a), given a (potentially infinite) class of nested uncertainty sets U(s, a). Each pair (s, a) can be either stochastic or adversarial, which is not known. If (s, a) is stochastic, then the true p and r are unknown If (s, a) is adversarial, then its true uncertainty set (also unknown) belongs to U(s, a). Allowed to repeat the MDP many times. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 25 / 40

57 Challenge and Objective For adversarial state-action pair, the parameter can be arbitrary (and adaptive to the algorithm). Hence not possible to exactly identify the nature of uncertainty. Not possible to achieve diminishing regret against optimal stationary policy in hindsight. That is, may not take full advantage if the adversary chooses to play nice. Can achieve a vanishing regret against the performance of the robust MDP knowing exactly p and r for stochastic pair, and the true uncertainty set of adversarial pair. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 26 / 40

58 Challenge and Objective For adversarial state-action pair, the parameter can be arbitrary (and adaptive to the algorithm). Hence not possible to exactly identify the nature of uncertainty. Not possible to achieve diminishing regret against optimal stationary policy in hindsight. That is, may not take full advantage if the adversary chooses to play nice. Can achieve a vanishing regret against the performance of the robust MDP knowing exactly p and r for stochastic pair, and the true uncertainty set of adversarial pair. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 26 / 40

59 Challenge and Objective For adversarial state-action pair, the parameter can be arbitrary (and adaptive to the algorithm). Hence not possible to exactly identify the nature of uncertainty. Not possible to achieve diminishing regret against optimal stationary policy in hindsight. That is, may not take full advantage if the adversary chooses to play nice. Can achieve a vanishing regret against the performance of the robust MDP knowing exactly p and r for stochastic pair, and the true uncertainty set of adversarial pair. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 26 / 40

60 Challenge and Objective For adversarial state-action pair, the parameter can be arbitrary (and adaptive to the algorithm). Hence not possible to exactly identify the nature of uncertainty. Not possible to achieve diminishing regret against optimal stationary policy in hindsight. That is, may not take full advantage if the adversary chooses to play nice. Can achieve a vanishing regret against the performance of the robust MDP knowing exactly p and r for stochastic pair, and the true uncertainty set of adversarial pair. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 26 / 40

61 Main intuition When purely stochastic, one can resort to RL algorithms, such as UCRL (which consistently uses optimistic estimation) to achieve diminishing regret. However, adversary can hurt. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 27 / 40

62 Main intuition g g + β a 1 s 0 a2 a 3 s 1 s 3 s 2 s 4 g + β g α 2β < α < 3β. Choose solid line in phase 1 (2T steps), dashed line in phase 2 (T steps). The expected value of s 4 is g + β α 2, and the expected value of s 1 is g + 3β α 4 > g. The total accumulated reward is 3Tg + T (2β α). Compared to the minimax policy, the overall regret is non-diminishing. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 28 / 40

63 Main intuition Be optimistic, but cautious. Using UCRL, start by assuming all state-action pairs are stochastic. Monitor outcome of transition of each pair. Using a statistic check to identify pairs that is assumed to be stochastic but indeed adversarial, as well as pairs assumed to have an uncertainty set smaller than its true uncertainty set. Update the information of pairs that fail the statistic check, and re-solve the minimax MDP. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 29 / 40

64 The algorithm -OLRM Input: S, A, T, δ, and for each (s, a), U(s, a) 1 Initialize the set F {}. For each (s, a), set U(s, a) {}. 2 Initialize k 1. 3 Compute an optimistic robust policy π, assuming all state-action pairs in F are adversarial with uncertainty sets as given by U(s, a). 4 Execute π until one of the followings happen: The execution count of some state-action (s, a) has been doubled. The executed state-action pair (s, a) fails the stochasticity check. In this case (s, a) is added to F if it is not yet in F. Update U(s, a). 5 Increment k. Go back to step 3. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 30 / 40

65 Computing Optimistic Robust Policy Input: S, A, T, δ, F, k, and for each (s, a), U(s, a), ˆP k ( s, a) and N k (s, a). 1 Set Ṽ T k (s) = 0 for all s. 2 Repeat, for t = T 1,..., 0: For each (s, a) F, set Q t k (s, a) = min{t t, r(s, For each (s, a) / F, set Q t k (s, a) = min{t t, a) + min p U(s,a) p( )Ṽ k t+1 ( )}. + (T + 1) r(s, a) + ˆP k ( s, a)ṽ k t+1( ) 1 2N k (s, a) 14SATk 2 log }. δ For each s, set Ṽt k (s) = max a Qk t (s, a) π t (s) = arg max a Qk t (s, a). 3 Output π. and Robust to adversarial, optimistic to stochastic. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 31 / 40

66 Computing Optimistic Robust Policy Input: S, A, T, δ, F, k, and for each (s, a), U(s, a), ˆP k ( s, a) and N k (s, a). 1 Set Ṽ T k (s) = 0 for all s. 2 Repeat, for t = T 1,..., 0: For each (s, a) F, set Q t k (s, a) = min{t t, r(s, For each (s, a) / F, set Q t k (s, a) = min{t t, a) + min p U(s,a) p( )Ṽ k t+1 ( )}. + (T + 1) r(s, a) + ˆP k ( s, a)ṽ k t+1( ) 1 2N k (s, a) 14SATk 2 log }. δ For each s, set Ṽt k (s) = max a Qk t (s, a) π t (s) = arg max a Qk t (s, a). 3 Output π. and Robust to adversarial, optimistic to stochastic. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 31 / 40

67 Stochasticity check When (s, a) F, it fails the stochasticity check if: n } {ˆPkj ( s, a)ṽ k j t j +1 ( ) Ṽ k j t j +1 (s j ) > (2.5+T +3.5T S) n log 14SAT τ 2. δ j=1 When (s, a) F, it fails the stochasticity check if n { } min p( )Ṽ k j t p U(s,a) j +1 ( ) Ṽ k j t j +1 (s j ) > 2T 2(n n ) log 14τ 2. δ j=n +1 If (s, a) fails the stochasticity check, add (s, a) into F, and update U(s, a) as the smallest set in U(s, a) that satisfies n { } min p( )Ṽ k j t p U(s,a) j +1 ( ) Ṽ k j t j +1 (s j ) < T 2(n n ) log 14τ 2. δ j=n +1 Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 32 / 40

68 More on stochasticity check Essentially checking whether the value of actual transition from (s, a) is below what is expected from the parameter estimation. No false alarm: with high probability, all stochastic state-action pairs will always pass the stochasticity check; and all adversarial state-action pairs will pass the stochasticity check if U(s, a) U (s, a). May fail to identify adversarial states, if the adversary plays nicely. However, satisfactory rewards are accumulated, so nothing needs to be changed. If the adversary plays nasty, then the stochasticity check will detect it, and subsequently protect against it. if it ain t broke, don t fix it. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 33 / 40

69 More on stochasticity check Essentially checking whether the value of actual transition from (s, a) is below what is expected from the parameter estimation. No false alarm: with high probability, all stochastic state-action pairs will always pass the stochasticity check; and all adversarial state-action pairs will pass the stochasticity check if U(s, a) U (s, a). May fail to identify adversarial states, if the adversary plays nicely. However, satisfactory rewards are accumulated, so nothing needs to be changed. If the adversary plays nasty, then the stochasticity check will detect it, and subsequently protect against it. if it ain t broke, don t fix it. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 33 / 40

70 Performance guarantee Theorem Given δ, T, S, A and U, if U(s, a) C for all (s, a), then the total regret of OLRM is [ (m) O T 3/2 ( S + ] C) SAm log SATm δ for all m, with probability at least 1 δ. The total number of steps is τ = Tm, hence the regret is Õ[T ( S + C) SAτ]. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 34 / 40

71 Performance guarantee What if U is an infinity set? We consider the following class: U(s, a) = {η(s, a) + αb(s, a) : α 0 (s, a) α α } P(S) (1) Theorem Given δ, T, S, A, U as defined in Eq. (1), the total regret of OLRM is [ ( (m) Õ T S Aτ + (SAα B) 2/3 τ 1/3 + (SAα B) 1/3 τ 2/3)]. for all m, with probability at least 1 δ. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 35 / 40

72 Infinite horizon average reward Assume for any p in the true uncertainty set, the resulting MDP is unichain and communicating. Similar algorithm, except that computing the optimistic robust policy is trickier. Similar performance guarantee: O( τ) for finite U, and O(τ 2/3 ) for infinite U. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 36 / 40

73 Simulation Gold Gold Silver Silver Usual condition Abnormal condition Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 37 / 40

74 Simulation R minimax s 0 a 1 s 1 a 2 a 3 s 2 s 4 R good s 3 a 4 s 5 s 6 R bad Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 38 / 40

75 Simulation x Total rewards UCRL2 OLRM2 Optimal minimax policy Time steps x 10 6 Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 39 / 40

76 Conclusion Coupled uncertainty k-rectangularity Coupled through the index k Solved via state augmentation Learning the uncertainty Reinforcement learning adapted Diminishing regret Make robust MDP practicable. Huan Xu (NUS) Practicable Robust MDPs 18, Nov, 2015; IMS 40 / 40

Planning and Model Selection in Data Driven Markov models

Planning and Model Selection in Data Driven Markov models Planning and Model Selection in Data Driven Markov models Shie Mannor Department of Electrical Engineering Technion Joint work with many people along the way: Dotan Di-Castro (Yahoo!), Assaf Halak (Technion),

More information

Lightning Does Not Strike Twice: Robust MDPs with Coupled Uncertainty

Lightning Does Not Strike Twice: Robust MDPs with Coupled Uncertainty JMLR: Workshop and Conference Proceedings vol (212) 1 12 European Workshop on Reinforcement Learning Lightning Does Not Strike Twice: Robust MDPs with Coupled Uncertainty Shie Mannor Technion Ofir Mebel

More information

Lightning Does Not Strike Twice: Robust MDPs with Coupled Uncertainty

Lightning Does Not Strike Twice: Robust MDPs with Coupled Uncertainty Lightning Does Not Strike Twice: Robust MDPs with Coupled Uncertainty Shie Mannor shie@ee.technion.ac.il Department of Electrical Engineering, Technion, Israel Ofir Mebel ofirmebel@gmail.com Department

More information

Preference Elicitation for Sequential Decision Problems

Preference Elicitation for Sequential Decision Problems Preference Elicitation for Sequential Decision Problems Kevin Regan University of Toronto Introduction 2 Motivation Focus: Computational approaches to sequential decision making under uncertainty These

More information

Decision Theory: Markov Decision Processes

Decision Theory: Markov Decision Processes Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University

More information

Constructing Learning Models from Data: The Dynamic Catalog Mailing Problem

Constructing Learning Models from Data: The Dynamic Catalog Mailing Problem Constructing Learning Models from Data: The Dynamic Catalog Mailing Problem Peng Sun May 6, 2003 Problem and Motivation Big industry In 2000 Catalog companies in the USA sent out 7 billion catalogs, generated

More information

CS 7180: Behavioral Modeling and Decisionmaking

CS 7180: Behavioral Modeling and Decisionmaking CS 7180: Behavioral Modeling and Decisionmaking in AI Markov Decision Processes for Complex Decisionmaking Prof. Amy Sliva October 17, 2012 Decisions are nondeterministic In many situations, behavior and

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

Distributed Optimization. Song Chong EE, KAIST

Distributed Optimization. Song Chong EE, KAIST Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links

More information

Decision Theory: Q-Learning

Decision Theory: Q-Learning Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning

More information

Markov decision processes

Markov decision processes CS 2740 Knowledge representation Lecture 24 Markov decision processes Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Administrative announcements Final exam: Monday, December 8, 2008 In-class Only

More information

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING

More information

Sequential Decision Problems

Sequential Decision Problems Sequential Decision Problems Michael A. Goodrich November 10, 2006 If I make changes to these notes after they are posted and if these changes are important (beyond cosmetic), the changes will highlighted

More information

Robust Partially Observable Markov Decision Processes

Robust Partially Observable Markov Decision Processes Submitted to manuscript Robust Partially Observable Markov Decision Processes Mohammad Rasouli 1, Soroush Saghafian 2 1 Management Science and Engineering, Stanford University, Palo Alto, CA 2 Harvard

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Lecture 6: RL algorithms 2.0 Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Present and analyse two online algorithms

More information

CS 4100 // artificial intelligence. Recap/midterm review!

CS 4100 // artificial intelligence. Recap/midterm review! CS 4100 // artificial intelligence instructor: byron wallace Recap/midterm review! Attribution: many of these slides are modified versions of those distributed with the UC Berkeley CS188 materials Thanks

More information

Markov Chains. Chapter 16. Markov Chains - 1

Markov Chains. Chapter 16. Markov Chains - 1 Markov Chains Chapter 16 Markov Chains - 1 Why Study Markov Chains? Decision Analysis focuses on decision making in the face of uncertainty about one future event. However, many decisions need to consider

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha

More information

Lecture notes for Analysis of Algorithms : Markov decision processes

Lecture notes for Analysis of Algorithms : Markov decision processes Lecture notes for Analysis of Algorithms : Markov decision processes Lecturer: Thomas Dueholm Hansen June 6, 013 Abstract We give an introduction to infinite-horizon Markov decision processes (MDPs) with

More information

Online regret in reinforcement learning

Online regret in reinforcement learning University of Leoben, Austria Tübingen, 31 July 2007 Undiscounted online regret I am interested in the difference (in rewards during learning) between an optimal policy and a reinforcement learner: T T

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Internet Monetization

Internet Monetization Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition

More information

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction MS&E338 Reinforcement Learning Lecture 1 - April 2, 2018 Introduction Lecturer: Ben Van Roy Scribe: Gabriel Maher 1 Reinforcement Learning Introduction In reinforcement learning (RL) we consider an agent

More information

Notes from Week 9: Multi-Armed Bandit Problems II. 1 Information-theoretic lower bounds for multiarmed

Notes from Week 9: Multi-Armed Bandit Problems II. 1 Information-theoretic lower bounds for multiarmed CS 683 Learning, Games, and Electronic Markets Spring 007 Notes from Week 9: Multi-Armed Bandit Problems II Instructor: Robert Kleinberg 6-30 Mar 007 1 Information-theoretic lower bounds for multiarmed

More information

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Department of Systems and Computer Engineering Carleton University Ottawa, Canada KS 5B6 Email: mawheda@sce.carleton.ca

More information

Experts in a Markov Decision Process

Experts in a Markov Decision Process University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 2004 Experts in a Markov Decision Process Eyal Even-Dar Sham Kakade University of Pennsylvania Yishay Mansour Follow

More information

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016 Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the

More information

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks

More information

Optimism in the Face of Uncertainty Should be Refutable

Optimism in the Face of Uncertainty Should be Refutable Optimism in the Face of Uncertainty Should be Refutable Ronald ORTNER Montanuniversität Leoben Department Mathematik und Informationstechnolgie Franz-Josef-Strasse 18, 8700 Leoben, Austria, Phone number:

More information

16.4 Multiattribute Utility Functions

16.4 Multiattribute Utility Functions 285 Normalized utilities The scale of utilities reaches from the best possible prize u to the worst possible catastrophe u Normalized utilities use a scale with u = 0 and u = 1 Utilities of intermediate

More information

Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning

Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning Peter Auer Ronald Ortner University of Leoben, Franz-Josef-Strasse 18, 8700 Leoben, Austria auer,rortner}@unileoben.ac.at Abstract

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ Task Grasp the green cup. Output: Sequence of controller actions Setup from Lenz et. al.

More information

Planning Under Uncertainty II

Planning Under Uncertainty II Planning Under Uncertainty II Intelligent Robotics 2014/15 Bruno Lacerda Announcement No class next Monday - 17/11/2014 2 Previous Lecture Approach to cope with uncertainty on outcome of actions Markov

More information

Markov Decision Processes

Markov Decision Processes Markov Decision Processes Noel Welsh 11 November 2010 Noel Welsh () Markov Decision Processes 11 November 2010 1 / 30 Annoucements Applicant visitor day seeks robot demonstrators for exciting half hour

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

Hybrid Machine Learning Algorithms

Hybrid Machine Learning Algorithms Hybrid Machine Learning Algorithms Umar Syed Princeton University Includes joint work with: Rob Schapire (Princeton) Nina Mishra, Alex Slivkins (Microsoft) Common Approaches to Machine Learning!! Supervised

More information

Motivation for introducing probabilities

Motivation for introducing probabilities for introducing probabilities Reaching the goals is often not sufficient: it is important that the expected costs do not outweigh the benefit of reaching the goals. 1 Objective: maximize benefits - costs.

More information

Some AI Planning Problems

Some AI Planning Problems Course Logistics CS533: Intelligent Agents and Decision Making M, W, F: 1:00 1:50 Instructor: Alan Fern (KEC2071) Office hours: by appointment (see me after class or send email) Emailing me: include CS533

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and

More information

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Christos Dimitrakakis Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands

More information

CS 188: Artificial Intelligence

CS 188: Artificial Intelligence CS 188: Artificial Intelligence Adversarial Search II Instructor: Anca Dragan University of California, Berkeley [These slides adapted from Dan Klein and Pieter Abbeel] Minimax Example 3 12 8 2 4 6 14

More information

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon. Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,

More information

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides

More information

Bandit models: a tutorial

Bandit models: a tutorial Gdt COS, December 3rd, 2015 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses

More information

Learning to Coordinate Efficiently: A Model-based Approach

Learning to Coordinate Efficiently: A Model-based Approach Journal of Artificial Intelligence Research 19 (2003) 11-23 Submitted 10/02; published 7/03 Learning to Coordinate Efficiently: A Model-based Approach Ronen I. Brafman Computer Science Department Ben-Gurion

More information

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018 Machine learning for image classification Lecture 14: Reinforcement learning May 9, 2018 Page 3 Outline Motivation Introduction to reinforcement learning (RL) Value function based methods (Q-learning)

More information

Central-limit approach to risk-aware Markov decision processes

Central-limit approach to risk-aware Markov decision processes Central-limit approach to risk-aware Markov decision processes Jia Yuan Yu Concordia University November 27, 2015 Joint work with Pengqian Yu and Huan Xu. Inventory Management 1 1 Look at current inventory

More information

Learning in Zero-Sum Team Markov Games using Factored Value Functions

Learning in Zero-Sum Team Markov Games using Factored Value Functions Learning in Zero-Sum Team Markov Games using Factored Value Functions Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 27708 mgl@cs.duke.edu Ronald Parr Department of Computer

More information

An analogy from Calculus: limits

An analogy from Calculus: limits COMP 250 Fall 2018 35 - big O Nov. 30, 2018 We have seen several algorithms in the course, and we have loosely characterized their runtimes in terms of the size n of the input. We say that the algorithm

More information

Lecture notes on OPP algorithms [Preliminary Draft]

Lecture notes on OPP algorithms [Preliminary Draft] Lecture notes on OPP algorithms [Preliminary Draft] Jesper Nederlof June 13, 2016 These lecture notes were quickly assembled and probably contain many errors. Use at your own risk! Moreover, especially

More information

Reinforcement Learning as Classification Leveraging Modern Classifiers

Reinforcement Learning as Classification Leveraging Modern Classifiers Reinforcement Learning as Classification Leveraging Modern Classifiers Michail G. Lagoudakis and Ronald Parr Department of Computer Science Duke University Durham, NC 27708 Machine Learning Reductions

More information

Reinforcement Learning. Introduction

Reinforcement Learning. Introduction Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control

More information

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL) 15-780: Graduate Artificial Intelligence Reinforcement learning (RL) From MDPs to RL We still use the same Markov model with rewards and actions But there are a few differences: 1. We do not assume we

More information

Infinite-Horizon Average Reward Markov Decision Processes

Infinite-Horizon Average Reward Markov Decision Processes Infinite-Horizon Average Reward Markov Decision Processes Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 1 Outline The average

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

Robust Markov Decision Processes

Robust Markov Decision Processes Robust Markov Decision Processes Wolfram Wiesemann, Daniel Kuhn and Berç Rustem February 9, 2012 Abstract Markov decision processes (MDPs) are powerful tools for decision making in uncertain dynamic environments.

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

16.410/413 Principles of Autonomy and Decision Making

16.410/413 Principles of Autonomy and Decision Making 16.410/413 Principles of Autonomy and Decision Making Lecture 23: Markov Decision Processes Policy Iteration Emilio Frazzoli Aeronautics and Astronautics Massachusetts Institute of Technology December

More information

Chapter 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem Chapter 3: The Reinforcement Learning Problem Objectives of this chapter: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information

CORC Tech Report TR Robust dynamic programming

CORC Tech Report TR Robust dynamic programming CORC Tech Report TR-2002-07 Robust dynamic programming G. Iyengar Submitted Dec. 3rd, 2002. Revised May 4, 2004. Abstract In this paper we propose a robust formulation for discrete time dynamic programming

More information

Distributionally Robust Convex Optimization

Distributionally Robust Convex Optimization Distributionally Robust Convex Optimization Wolfram Wiesemann 1, Daniel Kuhn 1, and Melvyn Sim 2 1 Department of Computing, Imperial College London, United Kingdom 2 Department of Decision Sciences, National

More information

CS 4649/7649 Robot Intelligence: Planning

CS 4649/7649 Robot Intelligence: Planning CS 4649/7649 Robot Intelligence: Planning Probability Primer Sungmoon Joo School of Interactive Computing College of Computing Georgia Institute of Technology S. Joo (sungmoon.joo@cc.gatech.edu) 1 *Slides

More information

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Markov Decision Processes and Solving Finite Problems. February 8, 2017 Markov Decision Processes and Solving Finite Problems February 8, 2017 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15:

More information

Reinforcement Learning

Reinforcement Learning 1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision

More information

Prioritized Sweeping Converges to the Optimal Value Function

Prioritized Sweeping Converges to the Optimal Value Function Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science

More information

Safe Policy Improvement by Minimizing Robust Baseline Regret

Safe Policy Improvement by Minimizing Robust Baseline Regret Safe Policy Improvement by Minimizing Robust Baseline Regret Marek Petrik University of New Hampshire mpetrik@cs.unh.edu Mohammad Ghavamzadeh Adobe Research & INRIA Lille ghavamza@adobe.com Yinlam Chow

More information

Multiagent Value Iteration in Markov Games

Multiagent Value Iteration in Markov Games Multiagent Value Iteration in Markov Games Amy Greenwald Brown University with Michael Littman and Martin Zinkevich Stony Brook Game Theory Festival July 21, 2005 Agenda Theorem Value iteration converges

More information

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16 600.463 Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16 25.1 Introduction Today we re going to talk about machine learning, but from an

More information

Using first-order logic, formalize the following knowledge:

Using first-order logic, formalize the following knowledge: Probabilistic Artificial Intelligence Final Exam Feb 2, 2016 Time limit: 120 minutes Number of pages: 19 Total points: 100 You can use the back of the pages if you run out of space. Collaboration on the

More information

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 11: Markov Decision Processes II Teacher: Gianni A. Di Caro RECAP: DEFINING MDPS Markov decision processes: o Set of states S o Start state s 0 o Set of actions A o Transitions P(s s,a)

More information

Multi-model Markov Decision Processes

Multi-model Markov Decision Processes Multi-model Markov Decision Processes Lauren N. Steimle Department of Industrial and Operations Engineering, University of Michigan, Ann Arbor, MI 48109, steimle@umich.edu David L. Kaufman Management Studies,

More information

Multi-armed bandit models: a tutorial

Multi-armed bandit models: a tutorial Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)

More information

ARTIFICIAL INTELLIGENCE. Reinforcement learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

More information

Planning in Markov Decision Processes

Planning in Markov Decision Processes Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Planning in Markov Decision Processes Lecture 3, CMU 10703 Katerina Fragkiadaki Markov Decision Process (MDP) A Markov

More information

Reinforcement Learning II

Reinforcement Learning II Reinforcement Learning II Andrea Bonarini Artificial Intelligence and Robotics Lab Department of Electronics and Information Politecnico di Milano E-mail: bonarini@elet.polimi.it URL:http://www.dei.polimi.it/people/bonarini

More information

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes Kee-Eung Kim KAIST EECS Department Computer Science Division Markov Decision Processes (MDPs) A popular model for sequential decision

More information

Module 8 Linear Programming. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

Module 8 Linear Programming. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Module 8 Linear Programming CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Policy Optimization Value and policy iteration Iterative algorithms that implicitly solve

More information

Reinforcement Learning: An Introduction

Reinforcement Learning: An Introduction Introduction Betreuer: Freek Stulp Hauptseminar Intelligente Autonome Systeme (WiSe 04/05) Forschungs- und Lehreinheit Informatik IX Technische Universität München November 24, 2004 Introduction What is

More information

An Analysis of Model-Based Interval Estimation for Markov Decision Processes

An Analysis of Model-Based Interval Estimation for Markov Decision Processes An Analysis of Model-Based Interval Estimation for Markov Decision Processes Alexander L. Strehl, Michael L. Littman astrehl@gmail.com, mlittman@cs.rutgers.edu Computer Science Dept. Rutgers University

More information

Sparse analysis Lecture II: Hardness results for sparse approximation problems

Sparse analysis Lecture II: Hardness results for sparse approximation problems Sparse analysis Lecture II: Hardness results for sparse approximation problems Anna C. Gilbert Department of Mathematics University of Michigan Sparse Problems Exact. Given a vector x R d and a complete

More information

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state

More information

CS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm

CS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm CS61: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm Tim Roughgarden February 9, 016 1 Online Algorithms This lecture begins the third module of the

More information

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time April 16, 2016 Abstract In this exposition we study the E 3 algorithm proposed by Kearns and Singh for reinforcement

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

Chapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS

Chapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS Chapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS 63 2.1 Introduction In this chapter we describe the analytical tools used in this thesis. They are Markov Decision Processes(MDP), Markov Renewal process

More information

Machine Learning I Reinforcement Learning

Machine Learning I Reinforcement Learning Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:

More information

CS 188: Artificial Intelligence Spring Announcements

CS 188: Artificial Intelligence Spring Announcements CS 188: Artificial Intelligence Spring 2011 Lecture 12: Probability 3/2/2011 Pieter Abbeel UC Berkeley Many slides adapted from Dan Klein. 1 Announcements P3 due on Monday (3/7) at 4:59pm W3 going out

More information

Reinforcement Learning

Reinforcement Learning CS7/CS7 Fall 005 Supervised Learning: Training examples: (x,y) Direct feedback y for each input x Sequence of decisions with eventual feedback No teacher that critiques individual actions Learn to act

More information

Information, Utility & Bounded Rationality

Information, Utility & Bounded Rationality Information, Utility & Bounded Rationality Pedro A. Ortega and Daniel A. Braun Department of Engineering, University of Cambridge Trumpington Street, Cambridge, CB2 PZ, UK {dab54,pao32}@cam.ac.uk Abstract.

More information

Discrete Probability and State Estimation

Discrete Probability and State Estimation 6.01, Fall Semester, 2007 Lecture 12 Notes 1 MASSACHVSETTS INSTITVTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science 6.01 Introduction to EECS I Fall Semester, 2007 Lecture 12 Notes

More information

Course basics. CSE 190: Reinforcement Learning: An Introduction. Last Time. Course goals. The website for the class is linked off my homepage.

Course basics. CSE 190: Reinforcement Learning: An Introduction. Last Time. Course goals. The website for the class is linked off my homepage. Course basics CSE 190: Reinforcement Learning: An Introduction The website for the class is linked off my homepage. Grades will be based on programming assignments, homeworks, and class participation.

More information

15-780: ReinforcementLearning

15-780: ReinforcementLearning 15-780: ReinforcementLearning J. Zico Kolter March 2, 2016 1 Outline Challenge of RL Model-based methods Model-free methods Exploration and exploitation 2 Outline Challenge of RL Model-based methods Model-free

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

Comp487/587 - Boolean Formulas

Comp487/587 - Boolean Formulas Comp487/587 - Boolean Formulas 1 Logic and SAT 1.1 What is a Boolean Formula Logic is a way through which we can analyze and reason about simple or complicated events. In particular, we are interested

More information