Bayesian reinforcement learning

Size: px
Start display at page:

Download "Bayesian reinforcement learning"

Transcription

1 Bayesian reinforcement learning Markov decision processes and approximate Bayesian computation Christos Dimitrakakis Chalmers April 16, 2015 Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

2 Overview Subjective probability and utility Subjective probability Rewards and preferences Bandit problems Bernoulli bandits Markov decision processes and reinforcement learning Markov processes Value functions Examples Bayesian reinforcement learning Reinforcement learning Bounds on the utility Planning: Heuristics and exact solutions Belief-augmented MDPs The expected MDP heuristic The maximum MDP heuristic Inference: Approximate Bayesian computation Properties of ABC Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

3 Subjective probability and utility Subjective probability and utility Subjective probability Rewards and preferences Bandit problems Bernoulli bandits Markov decision processes and reinforcement learning Markov processes Value functions Examples Bayesian reinforcement learning Reinforcement learning Bounds on the utility Planning: Heuristics and exact solutions Belief-augmented MDPs The expected MDP heuristic The maximum MDP heuristic Inference: Approximate Bayesian computation Properties of ABC Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

4 Subjective probability and utility Subjective probability Objective Probability x P θ Figure: The double slit experiment Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

5 Subjective probability and utility Subjective probability Objective Probability Figure: The double slit experiment Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

6 Subjective probability and utility Subjective probability Objective Probability Figure: The double slit experiment Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

7 Subjective probability and utility Subjective probability What about everyday life? Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

8 Subjective probability and utility Subjective probability Subjective probability Making decisions requires making predictions Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

9 Subjective probability and utility Subjective probability Subjective probability Making decisions requires making predictions Outcomes of decisions are uncertain Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

10 Subjective probability and utility Subjective probability Subjective probability Making decisions requires making predictions Outcomes of decisions are uncertain How can we represent this uncertainty? Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

11 Subjective probability and utility Subjective probability Subjective probability Making decisions requires making predictions Outcomes of decisions are uncertain How can we represent this uncertainty? Subjective probability Describe which events we think are more likely We quantify this with probability Why probability? Quantifies uncertainty in a natural way A framework for drawing conclusions from data Computationally convenient for decision making Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

12 Subjective probability and utility Rewards and preferences Rewards We are going to receive a reward r from a set R of possible rewards We prefer some rewards to others Example 1 (Possible sets of rewards R) R is a set of tickets to different musical events R is a set of financial commodities Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

13 Subjective probability and utility Rewards and preferences When we cannot select rewards directly In most problems, we cannot just choose which reward to receive We can only specify a distribution on rewards Example 2 (Route selection) Each reward r R is the time it takes to travel from A to B Route P 1 is faster than P 2 in heavy traffic and vice-versa Which route should be preferred, given a certain probability for heavy traffic? In order to choose between random rewards, we use the concept of utility Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

14 Subjective probability and utility Rewards and preferences Definition 3 (Utility) The utility is a function U : R R, such that for all a, b R a b iff U(a) U(b), (11) The expected utility of a distribution P on R is: E P (U) = r R U(r)P(r) (13) Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

15 Subjective probability and utility Rewards and preferences Definition 3 (Utility) The utility is a function U : R R, such that for all a, b R a b iff U(a) U(b), (11) The expected utility of a distribution P on R is: E P (U) = r R U(r)P(r) (12) = R U(r) dp(r) (13) Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

16 Subjective probability and utility Rewards and preferences Definition 3 (Utility) The utility is a function U : R R, such that for all a, b R a b iff U(a) U(b), (11) The expected utility of a distribution P on R is: E P (U) = U(r)P(r) (12) r R = U(r) dp(r) (13) R Assumption 1 (The expected utility hypothesis) The utility of P is equal to the expected utility of the reward under P Consequently, P Q iff E P (U) E Q (U) (14) ie we prefer P to Q iff the expected utility under P is higher than under Q Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

17 Subjective probability and utility Rewards and preferences The St Petersburg Paradox A simple game [Bernoulli, 1713] A fair coin is tossed until a head is obtained If the first head is obtained on the n-th toss, our reward will be 2 n currency units Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

18 Subjective probability and utility Rewards and preferences The St Petersburg Paradox A simple game [Bernoulli, 1713] A fair coin is tossed until a head is obtained If the first head is obtained on the n-th toss, our reward will be 2 n currency units How much are you willing to pay, to play this game once? Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

19 Subjective probability and utility Rewards and preferences The St Petersburg Paradox A simple game [Bernoulli, 1713] A fair coin is tossed until a head is obtained If the first head is obtained on the n-th toss, our reward will be 2 n currency units The probability to stop at round n is 2 n Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

20 Subjective probability and utility Rewards and preferences The St Petersburg Paradox A simple game [Bernoulli, 1713] A fair coin is tossed until a head is obtained If the first head is obtained on the n-th toss, our reward will be 2 n currency units The probability to stop at round n is 2 n Thus, the expected monetary gain of the game is 2 n 2 n = n=1 Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

21 Subjective probability and utility Rewards and preferences The St Petersburg Paradox A simple game [Bernoulli, 1713] A fair coin is tossed until a head is obtained If the first head is obtained on the n-th toss, our reward will be 2 n currency units The probability to stop at round n is 2 n Thus, the expected monetary gain of the game is 2 n 2 n = n=1 If your utility function were linear (U(r) = r) you d be willing to pay any amount to play You might not internalise the setup of the game (is the coin really fair?) Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

22 Subjective probability and utility Rewards and preferences Summary We can subjectively indicate which events we think are more likely We can define a subjective probability P for all events Similarly, we can subjectively indicate preferences for rewards We can determine a utility function for all rewards Hypothesis: we prefer the probability distribution with the highest expected utility This allows us to create algorithms for decision making Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

23 Subjective probability and utility Rewards and preferences Experimental design and Markov decision processes The following problems Shortest path problems Optimal stopping problems Reinforcement learning problems Experiment design (clinical trial) problems Advertising can be all formalised as Markov decision processes Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

24 Bandit problems Subjective probability and utility Subjective probability Rewards and preferences Bandit problems Bernoulli bandits Markov decision processes and reinforcement learning Markov processes Value functions Examples Bayesian reinforcement learning Reinforcement learning Bounds on the utility Planning: Heuristics and exact solutions Belief-augmented MDPs The expected MDP heuristic The maximum MDP heuristic Inference: Approximate Bayesian computation Properties of ABC Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

25 Bandit problems Bandit problems Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

26 Bandit problems Bandit problems Applications Efficient optimisation Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

27 Bandit problems Bandit problems Applications Efficient optimisation Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

28 Bandit problems Bandit problems Applications Efficient optimisation Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

29 Bandit problems Bandit problems Applications Efficient optimisation Online advertising Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

30 Bandit problems Bandit problems Applications Efficient optimisation Online advertising Clinical trials Ultrasound Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

31 Bandit problems Bandit problems Applications Efficient optimisation Online advertising Clinical trials Robot scientist Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

32 Bandit problems The stochastic n-armed bandit problem Actions and rewards A set of actions A = {1,, n} Each action gives you a random reward with distribution P(r t a t = i) The expected reward of the i-th arm is ω i E(r t a t = i) Utility The utility is the sum of the individual rewards r = r 1,, r T T U(r) r t t=1 Definition 4 (Policies) A policy π is an algorithm for taking actions given the observed history P π (a t+1 a 1, r 1,, a t, r t) is the probability of the next action a t+1 Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

33 Bandit problems Bernoulli bandits Bernoulli bandits Example 5 (Bernoulli bandits) Consider n Bernoulli distributions with parameters ω i (i = 1,, n) such that r t a t = i Bern(ω i ) Then, P(r t = 1 a t = i) = ω i P(r t = 0 a t = i) = 1 ω i (21) Then the expected reward for the i-th bandit is E(r t a t = i) = ω i Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

34 Bandit problems Bernoulli bandits Bernoulli bandits Example 5 (Bernoulli bandits) Consider n Bernoulli distributions with parameters ω i (i = 1,, n) such that r t a t = i Bern(ω i ) Then, P(r t = 1 a t = i) = ω i P(r t = 0 a t = i) = 1 ω i (21) Then the expected reward for the i-th bandit is E(r t a t = i) = ω i Exercise 1 (The optimal policy) If we know ω i for all i, what is the best policy? What if we don t? Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

35 Bandit problems Bernoulli bandits A simple heuristic for the unknown reward case Say you keep a running average of the reward obtained by each arm ˆω t,i = R t,i /n t,i n t,i the number of times you played arm i R t,i the total reward received from i Whenever you play a t = i: Greedy policy: What should the initial values n 0,i, R 0,i be? R t+1,i = R t,i + r t, n t+1,i = n t,i + 1 a t = arg max ˆω t,i i Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

36 Bandit problems Bernoulli bandits The greedy policy for n 0,i = R 0,i = ω 1 ω 2 ˆω 1 ˆω 2 t k=1 r k/t Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

37 Markov decision processes and reinforcement learning Subjective probability and utility Subjective probability Rewards and preferences Bandit problems Bernoulli bandits Markov decision processes and reinforcement learning Markov processes Value functions Examples Bayesian reinforcement learning Reinforcement learning Bounds on the utility Planning: Heuristics and exact solutions Belief-augmented MDPs The expected MDP heuristic The maximum MDP heuristic Inference: Approximate Bayesian computation Properties of ABC Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

38 Markov decision processes and reinforcement learning Markov processes A Markov process Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

39 Markov decision processes and reinforcement learning Markov processes Markov process s t 1 s t s t+1 Definition 6 (Markov Process or Markov Chain) The sequence {s t t = 1, } of random variables s t : Ω S is a Markov process if P(s t+1 s t,, s 1 ) = P(s t+1 s t ) (31) s t is state of the Markov process at time t P(s t+1 s t ) is the transition kernel of the process The state of an algorithm Observe that the R, n vectors of our greedy bandit algorithm form a Markov process They also summarise our belief about which arm is the best Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

40 Markov decision processes and reinforcement learning Markov processes Markov decision processes Markov decision processes (MDP) At each time step t: We observe state s t S We take action a t A r t s t s t+1 We receive a reward r t R a t Markov property of the reward and state distribution P µ (s t+1 s t, a t ) P µ(r t s t, a t) (Transition distribution) (Reward distribution) Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

41 Markov decision processes and reinforcement learning Markov processes The agent The agent s policy π P π (a t r t, s t, a t,, r 1, s 1, a 1) P π (a t s t) (history-dependent policy) (Markov policy) Definition 7 (Utility) Given a horizon T 0, and discount factor γ (0, 1] the utility can be defined as T t U t γ k r t+k (32) k=0 The agent wants to to find π maximising the expected total future reward T t E π µ U t = E π µ γ k r t+k (expected utility) k=0 Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

42 Markov decision processes and reinforcement learning Value functions State value function V π µ,t(s) E π µ(u t s t = s) (33) The optimal policy π dominates all other policies π everywhere in S The optimal value function V is the value function of the optimal policy π π (µ) : V π (µ) t,µ (s) Vt,µ(s) π π, t, s (34) V t,µ(s) V π (µ) t,µ (s), (35) Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

43 Markov decision processes and reinforcement learning Examples Stochastic shortest path problem with a pit O X Properties T r t = 1, but r t = 0 at X and 100 at O and the problem ends P µ (s t+1 = X s t = X ) = 1 A = {North, South, East, West} Moves to a random direction with probability ω Walls block Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

44 Markov decision processes and reinforcement learning Examples (a) ω = (b) ω = (c) value Figure: Pit maze solutions for two values of ω Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

45 Markov decision processes and reinforcement learning Examples How to evaluate a policy (Case: γ = 1) V π µ,t(s) E π µ(u t s t = s) (36) (37) This derivation directly gives a number of policy evaluation algorithms Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

46 Markov decision processes and reinforcement learning Examples How to evaluate a policy (Case: γ = 1) V π µ,t(s) E π µ(u t s t = s) (36) = T t k=0 E π µ(r t+k s t = s) (37) (38) This derivation directly gives a number of policy evaluation algorithms Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

47 Markov decision processes and reinforcement learning Examples How to evaluate a policy (Case: γ = 1) V π µ,t(s) E π µ(u t s t = s) (36) = T t k=0 E π µ(r t+k s t = s) (37) = E π µ(r t s t = s) + E π µ(u t+1 s t = s) (38) (39) This derivation directly gives a number of policy evaluation algorithms Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

48 Markov decision processes and reinforcement learning Examples How to evaluate a policy (Case: γ = 1) Vµ,t(s) π E π µ(u t s t = s) (36) T t = E π µ(r t+k s t = s) (37) k=0 = E π µ(r t s t = s) + E π µ(u t+1 s t = s) (38) = E π µ(r t s t = s) + Vµ,t+1(i) π P π µ(s t+1 = i s t = s) i S (39) This derivation directly gives a number of policy evaluation algorithms Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

49 Markov decision processes and reinforcement learning Examples How to evaluate a policy (Case: γ = 1) Vµ,t(s) π E π µ(u t s t = s) (36) T t = E π µ(r t+k s t = s) (37) k=0 = E π µ(r t s t = s) + E π µ(u t+1 s t = s) (38) = E π µ(r t s t = s) + Vµ,t+1(i) π P π µ(s t+1 = i s t = s) i S (39) This derivation directly gives a number of policy evaluation algorithms max V µ,t(s) π = max E µ (r t s t = s, a) + max Vµ,t+1(i) π P π π a π µ (s t+1 = i s t = s) gives us the optimal policy value i S Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

50 Markov decision processes and reinforcement learning Examples Backward induction for discounted infinite horizon problems We can also apply backwards induction to the infinite case The resulting policy is stationary So memory does not grow with T Value iteration for n = 1, 2, and s S do v n (s) = max a r(s, a) + γ s S P µ(s s, a)v n 1 (s ) end for Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

51 Markov decision processes and reinforcement learning Examples Summary Markov decision processes model controllable dynamical systems Optimal policies maximise expected utility can be found with: Backwards induction / value iteration Policy iteration The MDP state can be seen as The state of a dynamic controllable process The internal state of an agent Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

52 Bayesian reinforcement learning Subjective probability and utility Subjective probability Rewards and preferences Bandit problems Bernoulli bandits Markov decision processes and reinforcement learning Markov processes Value functions Examples Bayesian reinforcement learning Reinforcement learning Bounds on the utility Planning: Heuristics and exact solutions Belief-augmented MDPs The expected MDP heuristic The maximum MDP heuristic Inference: Approximate Bayesian computation Properties of ABC Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

53 Bayesian reinforcement learning Reinforcement learning The reinforcement learning problem Learning to act in an unknown world, by interaction and reinforcement Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

54 Bayesian reinforcement learning Reinforcement learning The reinforcement learning problem Learning to act in an unknown world, by interaction and reinforcement World µ; Policy π; at time t µ generates observation x t X We take action a t A using π µ gives us reward r t R Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

55 Bayesian reinforcement learning Reinforcement learning The reinforcement learning problem Learning to act in an unknown world, by interaction and reinforcement World µ; Policy π; at time t µ generates observation x t X We take action a t A using π µ gives us reward r t R Definition 8 (Utility) U t = T k=t r k Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

56 Bayesian reinforcement learning Reinforcement learning The reinforcement learning problem Learning to act in an unknown world, by interaction and reinforcement World µ; Policy π; at time t µ generates observation x t X We take action a t A using π µ gives us reward r t R Definition 8 (Expected utility) E π µ U t = E π µ T k=t r k When µ is known, calculate max π E π µ U Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

57 Bayesian reinforcement learning Reinforcement learning The reinforcement learning problem Learning to act in an unknown world, by interaction and reinforcement World µ; Policy π; at time t µ generates observation x t X We take action a t A using π µ gives us reward r t R Definition 8 (Expected utility) E π µ U t = E π µ T k=t r k Knowing µ is contrary to the problem definition Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

58 Bayesian reinforcement learning Reinforcement learning When µ is not known Bayesian idea: use a subjective belief ξ(µ) on M Initial belief ξ(µ) Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

59 Bayesian reinforcement learning Reinforcement learning When µ is not known Bayesian idea: use a subjective belief ξ(µ) on M Initial belief ξ(µ) The probability of observing history h is P π µ(h) Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

60 Bayesian reinforcement learning Reinforcement learning When µ is not known Bayesian idea: use a subjective belief ξ(µ) on M Initial belief ξ(µ) The probability of observing history h is P π µ(h) We can use this to adjust our belief via Bayes theorem: ξ(µ h, π) P π µ(h)ξ(µ) Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

61 Bayesian reinforcement learning Reinforcement learning When µ is not known Bayesian idea: use a subjective belief ξ(µ) on M Initial belief ξ(µ) The probability of observing history h is P π µ(h) We can use this to adjust our belief via Bayes theorem: ξ(µ h, π) P π µ(h)ξ(µ) We can thus conclude which µ is more likely Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

62 Bayesian reinforcement learning Reinforcement learning When µ is not known Bayesian idea: use a subjective belief ξ(µ) on M Initial belief ξ(µ) The probability of observing history h is P π µ(h) We can use this to adjust our belief via Bayes theorem: ξ(µ h, π) P π µ(h)ξ(µ) We can thus conclude which µ is more likely The subjective expected utility E π ξ U = ( E π µ U ) ξ(µ) µ Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

63 Bayesian reinforcement learning Reinforcement learning When µ is not known Bayesian idea: use a subjective belief ξ(µ) on M Initial belief ξ(µ) The probability of observing history h is P π µ(h) We can use this to adjust our belief via Bayes theorem: ξ(µ h, π) P π µ(h)ξ(µ) We can thus conclude which µ is more likely The subjective expected utility U ξ max π E π ξ U = max π ( E π µ U ) ξ(µ) µ Integrates planning and learning, and the exploration-exploitation trade-off Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

64 Bayesian reinforcement learning Bounds on the utility Bounds on the ξ-optimal utility U ξ max π E π ξ U E U ξ U µ 1 : No trap U µ 2 : Trap Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

65 Bayesian reinforcement learning Bounds on the utility Bounds on the ξ-optimal utility U ξ max π E π ξ U E U ξ U µ 1 : No trap U µ 2 : Trap π 1 Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

66 Bayesian reinforcement learning Bounds on the utility Bounds on the ξ-optimal utility U ξ max π E π ξ U E U ξ U µ 1 : No trap U µ 2 : Trap π 1 π 2 Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

67 Bayesian reinforcement learning Bounds on the utility Bounds on the ξ-optimal utility U ξ max π E π ξ U E U ξ U µ 1 : No trap U µ 2 : Trap π 1 π 2 ξ 1 Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

68 Bayesian reinforcement learning Bounds on the utility Bounds on the ξ-optimal utility U ξ max π E π ξ U E U ξ U µ 1 : No trap U µ 2 : Trap π 1 π 2 π ξ 1 U ξ 1 ξ 1 Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

69 Bayesian reinforcement learning Bounds on the utility Bounds on the ξ-optimal utility U ξ max π E π ξ U E U U µ 1 : No trap π 1 U ξ U µ 2 : Trap π ξ 1 U ξ 1 π 2 ξ ξ 1 Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

70 Bayesian reinforcement learning Bounds on the utility Bounds on the ξ-optimal utility U ξ max π E π ξ U E U U µ 1 : No trap π 1 U ξ U µ 2 : Trap π ξ 1 U ξ 1 π 2 ξ ξ 1 Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

71 Bayesian reinforcement learning Planning: Heuristics and exact solutions Bernoulli bandits Decision-theoretic approach Assume r t a t = i P ωi, with ω i Ω Define prior belief ξ 1 on Ω For each step t, select action a t to maximise ( T t ) E ξt (U t a t ) = E ξt γ k r t+k a t Obtain reward r t k=1 Calculate the next belief ξ t+1 = ξ t( a t, r t) How can we implement this? Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

72 Bayesian reinforcement learning Planning: Heuristics and exact solutions Bayesian inference on Bernoulli bandits Likelihood: P ω(r t = 1) = ω Prior: ξ(ω) ω α 1 (1 ω) β 1 (ie Beta(α, β)) prior Figure: Prior belief ξ about the mean reward ω Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

73 Bayesian reinforcement learning Planning: Heuristics and exact solutions Bayesian inference on Bernoulli bandits For a sequence r = r 1,, r n, P ω(r) ω #1(r) i (1 ω i ) #0(r) prior likelihood Figure: Prior belief ξ about ω and likelihood of ω for 100 plays with 70 1s Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

74 Bayesian reinforcement learning Planning: Heuristics and exact solutions Bayesian inference on Bernoulli bandits Posterior: Beta(α + #1(r), β + #0(r)) prior likelihood posterior Figure: Prior belief ξ(ω) about ω, likelihood of ω for the data r, and posterior belief ξ(ω r) Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

75 Bayesian reinforcement learning Planning: Heuristics and exact solutions Bernoulli example Consider n Bernoulli distributions with unknown parameters ω i (i = 1,, n) such that r t a t = i Bern(ω i ), E(r t a t = i) = ω i (41) Our belief for each parameter ω i is Beta(α i, β i ), with density f (ω α i, β i ) so that n ξ(ω 1,, ω n) = f (ω i α i, β i ) i=1 (a priori independent) N t,i t k=1 I {a k = i}, ˆr t,i 1 N t,i Then, the posterior distribution for the parameter of arm i is t r t I {a k = i} k=1 ξ t = Beta(α i + N t,i ˆr t,i, β i + N t,i (1 ˆr t,i )) Since r t {0, 1} there are O((2n) T ) possible belief states for a T -step bandit problem Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

76 Bayesian reinforcement learning Planning: Heuristics and exact solutions Belief states The state of the decision-theoretic bandit problem is the state of our belief A sufficient statistic is the number of plays and total rewards Our belief state ξ t is described by the priors α, β and the vectors N t = (N t,1,, N t,i ) (42) ˆr t = (ˆr t,1,, ˆr t,i ) (43) The next-state probabilities are defined as: P(r t = 1 a t = i, ξ t) = α i + N t,i ˆr t,i α i + β i + N t,i as ξ t+1 is a deterministic function of ξ t, r t and a t So the bandit problem can be formalised as a Markov decision process Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

77 Bayesian reinforcement learning Planning: Heuristics and exact solutions a t ω r t Figure: The basic bandit MDP The decision maker selects a t, while the parameter ω of the process is hidden It then obtains reward r t The process repeats for t = 1,, T a t a t+1 ξ t ξ t+1 r t r t+1 Figure: The decision-theoretic bandit MDP While ω is not known, at each time step t we maintain a belief ξ t on Ω The reward distribution is then defined through our belief Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

78 Bayesian reinforcement learning Planning: Heuristics and exact solutions Backwards induction (Dynamic programming) for n = 1, 2, and s S do E(U t ξ t) = max a t A E(rt ξt, at) + γ ξ t+1 P(ξ t+1 ξ t, a t) E(U t+1 ξ t+1) end for Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

79 Bayesian reinforcement learning Planning: Heuristics and exact solutions Backwards induction (Dynamic programming) for n = 1, 2, and s S do end for E(U t ξ t) = max a t A E(rt ξt, at) + γ ξ t+1 P(ξ t+1 ξ t, a t) E(U t+1 ξ t+1) Exact solution methods: exponential in the horizon Dynamic programming (backwards induction etc) Policy search Approximations (Stochastic) branch and bound Upper confidence trees Approximate dynamic programming Local policy search (eg gradient based) Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

80 Bayesian reinforcement learning Planning: Heuristics and exact solutions Bayesian RL for unknown MDPs The MDP as an environment We are in some environment µ, where at each time, we: step t: Observe state s t S Take action a t A Receive reward r t R µ ξ s t s t+1 a t r t How can we find the Bayes optimal policy for unknown MDPs? Figure: The unknown Markov decision process Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

81 Bayesian reinforcement learning Planning: Heuristics and exact solutions Some heuristics 1 Only change policy at the start of epochs t i 2 Calculate the belief ξ ti 3 Find a good policy π i for the current belief 4 Execute it until the next epoch i + 1 Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

82 Bayesian reinforcement learning Planning: Heuristics and exact solutions One simple heuristic is to simply calculate the expected MDP for a given belief ξ: µ ξ E ξ µ Then, we simply calculate the optimal policy for µ ξ : π ( µ ξ ) arg max π Π 1 V π µ ξ, Example 9 (Counterexample) ϵ 0 0 ϵ 0 0 ϵ (a) µ 1 (b) µ 2 (c) µ ξ Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

83 Bayesian reinforcement learning Planning: Heuristics and exact solutions Another heuristic is to get the most probable MDP for a belief ξ: µ ξ arg max ξ(µ) µ Then, we simply calculate the optimal policy for µ ξ: Example 10 π ( µ ξ ) arg max π Π 1 V π µ ξ, a = 0 ϵ a = 1 0 a = i 1 a = n 0 Figure: The MDP µ i from A + 1 MDPs Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

84 Bayesian reinforcement learning Planning: Heuristics and exact solutions Posterior (Thompson) sampling Another heuristic is to simply sample an MDP from the belief ξ: µ (k) ξ(µ) Then, we simply calculate the optimal policy for µ (k) : π ( µ ξ ) arg max V π µ (k), π Π 1 Properties T regret (Direct proof: hard [1] Easy proof: convert to confidence bound [11]) Generally applicable for many beliefs Connections to differential privacy [9] Generalises to stochastic value function bounds [8] Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

85 Bayesian reinforcement learning Planning: Heuristics and exact solutions Belief-Augmented MDPs Unknown bandit problems can be converted into MDPs through the belief state We can do the same for MDPs We just create a hyperstate, composed of the current belief and the current belief state Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

86 Bayesian reinforcement learning Planning: Heuristics and exact solutions ψ t ψ t+1 s t s t+1 ξ t ξ t+1 ψ t ψ t+1 a t a t (a) The complete MDP model (b) Compact form of the model Figure: Belief-augmented MDP The augmented MDP P(s t+1 S ξ t, s t, a t ) P µ (s t+1 S s t, a t ) dξ t (µ) (44) S ξ t+1 ( ) = ξ t ( s t+1, s t, a t ) (45) Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

87 Bayesian reinforcement learning Planning: Heuristics and exact solutions So now we have converted the unknown MDP problem into an MDP Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

88 Bayesian reinforcement learning Planning: Heuristics and exact solutions So now we have converted the unknown MDP problem into an MDP That means we can use dynamic programming to solve it Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

89 Bayesian reinforcement learning Planning: Heuristics and exact solutions So now we have converted the unknown MDP problem into an MDP That means we can use dynamic programming to solve it So are we done? Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

90 Bayesian reinforcement learning Planning: Heuristics and exact solutions So now we have converted the unknown MDP problem into an MDP That means we can use dynamic programming to solve it So are we done? Unfortunately the exact solution is again exponential in the horizon Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

91 Bayesian reinforcement learning Inference: Approximate Bayesian computation ABC (Approximate Bayesian Computation) RL 1 1 Dimitrakakis, Tziortiotis ABC Reinforcement Learning: ICML 2013 Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

92 Bayesian reinforcement learning Inference: Approximate Bayesian computation ABC (Approximate Bayesian Computation) RL 1 How to deal with an arbitrary model space M The models µ M may be non-probabilistic simulators We may not know how to choose the simulator parameters 1 Dimitrakakis, Tziortiotis ABC Reinforcement Learning: ICML 2013 Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

93 Bayesian reinforcement learning Inference: Approximate Bayesian computation ABC (Approximate Bayesian Computation) RL 1 How to deal with an arbitrary model space M The models µ M may be non-probabilistic simulators We may not know how to choose the simulator parameters Overview of the approach Place a prior on the simulator parameters Observe some data h on the real system Approximate the posterior by statistics on simulated data Calculate a near-optimal policy for the posterior 1 Dimitrakakis, Tziortiotis ABC Reinforcement Learning: ICML 2013 Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

94 Bayesian reinforcement learning Inference: Approximate Bayesian computation ABC (Approximate Bayesian Computation) RL 1 How to deal with an arbitrary model space M The models µ M may be non-probabilistic simulators We may not know how to choose the simulator parameters Overview of the approach Place a prior on the simulator parameters Observe some data h on the real system Approximate the posterior by statistics on simulated data Calculate a near-optimal policy for the posterior Results Soundness depends on properties of the statistics In practice, can require much less data than a general model 1 Dimitrakakis, Tziortiotis ABC Reinforcement Learning: ICML 2013 Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

95 Bayesian reinforcement learning Inference: Approximate Bayesian computation A set of trajectories real data good sim bad sim Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

96 Bayesian reinforcement learning Inference: Approximate Bayesian computation A set of trajectories Cumulative features of real data Trajectories are easy to generate How to compare? Use a statistic Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

97 Bayesian reinforcement learning Inference: Approximate Bayesian computation A set of trajectories Cumulative features of good sim Trajectories are easy to generate How to compare? Use a statistic Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

98 Bayesian reinforcement learning Inference: Approximate Bayesian computation A set of trajectories Cumulative features of bad sim Trajectories are easy to generate How to compare? Use a statistic Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

99 Bayesian reinforcement learning Inference: Approximate Bayesian computation ABC (Approximate Bayesian Computation) When there is no probabilistic model (P µ is not available): ABC! A prior ξ on a class of simulators M History h H from policy π Statistic f : H (W, ) Threshold ϵ > 0 Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

100 Bayesian reinforcement learning Inference: Approximate Bayesian computation ABC (Approximate Bayesian Computation) When there is no probabilistic model (P µ is not available): ABC! A prior ξ on a class of simulators M History h H from policy π Statistic f : H (W, ) Threshold ϵ > 0 Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

101 Bayesian reinforcement learning Inference: Approximate Bayesian computation ABC (Approximate Bayesian Computation) When there is no probabilistic model (P µ is not available): ABC! A prior ξ on a class of simulators M History h H from policy π Statistic f : H (W, ) Threshold ϵ > 0 Example 11 (Cumulative features) Feature function ϕ : X R k f (h) t ϕ(x t ) Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

102 Bayesian reinforcement learning Inference: Approximate Bayesian computation ABC (Approximate Bayesian Computation) When there is no probabilistic model (P µ is not available): ABC! A prior ξ on a class of simulators M History h H from policy π Statistic f : H (W, ) Threshold ϵ > 0 Example 11 (Utility) f (h) t r t Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

103 Bayesian reinforcement learning Inference: Approximate Bayesian computation ABC (Approximate Bayesian Computation) When there is no probabilistic model (P µ is not available): ABC! A prior ξ on a class of simulators M History h H from policy π Statistic f : H (W, ) Threshold ϵ > 0 Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

104 Bayesian reinforcement learning Inference: Approximate Bayesian computation ABC (Approximate Bayesian Computation) When there is no probabilistic model (P µ is not available): ABC! A prior ξ on a class of simulators M History h H from policy π Statistic f : H (W, ) Threshold ϵ > 0 ABC-RL using Thompson sampling do ˆµ ξ, h P πˆµ // sample a model and history until f (h ) f (h) ϵ // until the statistics are close µ (k) = ˆµ // approximate posterior sample µ (k) ξ ϵ ( h t ) π (k) arg max E π U µ (k) t // approximate optimal policy for sample Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

105 Bayesian reinforcement learning Inference: Approximate Bayesian computation ABC (Approximate Bayesian Computation) When there is no probabilistic model (P µ is not available): ABC! A prior ξ on a class of simulators M History h H from policy π Statistic f : H (W, ) Threshold ϵ > 0 ABC-RL using Thompson sampling do ˆµ ξ, h P πˆµ // sample a model and history until f (h ) f (h) ϵ // until the statistics are close µ (k) = ˆµ // approximate posterior sample µ (k) ξ ϵ ( h t ) π (k) arg max E π U µ (k) t // approximate optimal policy for sample Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

106 Bayesian reinforcement learning Inference: Approximate Bayesian computation ABC (Approximate Bayesian Computation) When there is no probabilistic model (P µ is not available): ABC! A prior ξ on a class of simulators M History h H from policy π Statistic f : H (W, ) Threshold ϵ > 0 ABC-RL using Thompson sampling do ˆµ ξ, h P πˆµ // sample a model and history until f (h ) f (h) ϵ // until the statistics are close µ (k) = ˆµ // approximate posterior sample µ (k) ξ ϵ ( h t ) π (k) arg max E π U µ (k) t // approximate optimal policy for sample Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

107 Bayesian reinforcement learning Inference: Approximate Bayesian computation ABC (Approximate Bayesian Computation) When there is no probabilistic model (P µ is not available): ABC! A prior ξ on a class of simulators M History h H from policy π Statistic f : H (W, ) Threshold ϵ > 0 ABC-RL using Thompson sampling do ˆµ ξ, h P πˆµ // sample a model and history until f (h ) f (h) ϵ // until the statistics are close µ (k) = ˆµ // approximate posterior sample µ (k) ξ ϵ ( h t ) π (k) arg max E π U µ (k) t // approximate optimal policy for sample Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

108 Bayesian reinforcement learning Properties of ABC The approximate posterior ξ ϵ ( h) Corollary 11 If f is a sufficient statistic and ϵ = 0, then ξ( h) = ξ ϵ( h) Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

109 Bayesian reinforcement learning Properties of ABC The approximate posterior ξ ϵ ( h) Corollary 11 If f is a sufficient statistic and ϵ = 0, then ξ( h) = ξ ϵ( h) Assumption 2 (A1 Lipschitz log-probabilities) For the policy π, L > 0 st h, h H and µ M ln [ P π µ(h)/ P π µ(h ) ] L f (h) f (h ) Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

110 Bayesian reinforcement learning Properties of ABC The approximate posterior ξ ϵ ( h) Corollary 11 If f is a sufficient statistic and ϵ = 0, then ξ( h) = ξ ϵ( h) Assumption 2 (A1 Lipschitz log-probabilities) For the policy π, L > 0 st h, h H and µ M ln [ P π µ(h)/ P π µ(h ) ] L f (h) f (h ) Theorem 12 (The approximate posterior ξ ϵ ( h) is close to ξ( h)) If A1 holds then ϵ > 0: D (ξ( h) ξ ϵ ( h)) 2Lϵ + ln A h ϵ, (46) where A h ϵ {z H f (z) f (h) ϵ} Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

111 Bayesian reinforcement learning Properties of ABC The approximate posterior ξ ϵ ( h) Corollary 11 If f is a sufficient statistic and ϵ = 0, then ξ( h) = ξ ϵ( h) Assumption 2 (A1 Lipschitz log-probabilities) For the policy π, L > 0 st h, h H and µ M ln [ P π µ(h)/ P π µ(h ) ] L f (h) f (h ) Theorem 12 (The approximate posterior ξ ϵ ( h) is close to ξ( h)) If A1 holds then ϵ > 0: D (ξ( h) ξ ϵ ( h)) 2Lϵ + ln A h ϵ, (46) where A h ϵ {z H f (z) f (h) ϵ} Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

112 Bayesian reinforcement learning Properties of ABC Summary Unknown MDPs can be handled in a Bayesian framework This defines a belief-augmented MDP with A state for the MDP A state for the agent s belief The Bayes-optimal utility is convex, enabling approximations A big problem in specifying the right prior Questions? Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

113 Belief updates Belief updates Discounted reward MDPs Backwards induction Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

114 Belief updates Updating the belief in discrete MDPs Let D t = s t, a t 1, r t 1 be the observed data to time t Then ξ(b D t, π) = B Pπ µ(d t ) dξ(µ) M Pπ µ(d t ) dξ(µ) (51) ξ t+1 (B) ξ(b D t+1 ) = B Pπ µ(d t) dξ(µ) M Pπ µ(d t ) dξ(µ) = P µ(s B t+1, r t s t, a t )π(a t s t, a t 1, r t 1 ) dξ(µ D t ) P M µ(s t+1, r t s t, a t )π(a t s t, a t 1, r t 1 ) dξ(µ D t ) = P µ(s B t+1, r t s t, a t ) dξ t (µ) P M µ(s t+1, r t s t, a t ) dξ t (µ) (52) (53) (54) Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

115 Belief updates Backwards induction policy evaluation for State s S, t = T,, 1 do Update values v t (s) = E π µ(r t s t = s) + j S P π µ(s t+1 = j s t = s)v t+1 (j), (55) end for 1 05? 04 1? ? s t a t r t s t+1 Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

116 Belief updates Backwards induction policy evaluation for State s S, t = T,, 1 do Update values v t (s) = E π µ(r t s t = s) + j S P π µ(s t+1 = j s t = s)v t+1 (j), (55) end for 1 05? 04 1? s t a t r t s t+1 Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

117 Belief updates Backwards induction policy evaluation for State s S, t = T,, 1 do Update values v t (s) = E π µ(r t s t = s) + j S P π µ(s t+1 = j s t = s)v t+1 (j), (55) end for ? s t a t r t s t+1 Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

118 Belief updates Backwards induction policy evaluation for State s S, t = T,, 1 do Update values v t (s) = E π µ(r t s t = s) + j S P π µ(s t+1 = j s t = s)v t+1 (j), (55) end for s t a t r t s t+1 Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

119 Discounted reward MDPs Belief updates Discounted reward MDPs Backwards induction Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

120 Discounted reward MDPs Discounted total reward U t = lim T T γ k r k, γ (0, 1) k=t Definition 13 A policy π is stationary if π(a t s t ) does not depend on t Remark 1 We can use the Markov chain kernel P µ,π to write the expected utility vector as v π = γ t Pµ,πr t (61) t=0 Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

121 Discounted reward MDPs Theorem 14 For any stationary policy π, v π is the unique solution of v = r + γp µ,π v fixed point (62) In addition, the solution is: v π = (I γp µ,π ) 1 r (63) Example 15 Similar to the geometric series: t=0 α t = 1 1 α Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

122 Discounted reward MDPs Policy iteration Algorithm 1 Policy iteration Input µ, S Initialise v 0 for n = 1, 2, do π n+1 = arg max π {r + γp π v n } v n+1 = (I γp µ,πn+1 ) 1 r break if π n+1 = π n end for Return π n, v n // policy improvement // policy evaluation Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

123 Discounted reward MDPs Backwards induction Backwards induction policy optimization for State s S, t = T,, 1 do Update values v t (s) = max E µ (r t s t = s, a t = a) + P µ (s t+1 = j s t = s, a t = a)v t+1 (j), (64) a j S end for 1? ?? s t a t r t s t+1 Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

124 Discounted reward MDPs Backwards induction Backwards induction policy optimization for State s S, t = T,, 1 do Update values v t (s) = max E µ (r t s t = s, a t = a) + P µ (s t+1 = j s t = s, a t = a)v t+1 (j), (64) a j S end for s t a t r t s t+1 Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

125 References [1] Shipra Agrawal and Navi Goyal Analysis of thompson sampling for the multi-armed bandit problem In COLT 2012, 2012 [2] Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer Finite time analysis of the multiarmed bandit problem Machine Learning, 47(2/3): , 2002 [3] Dimitri P Bertsekas Dynamic Programming and Optimal Control Athena Scientific, 2001 [4] Dimitri P Bertsekas and John N Tsitsiklis Neuro-Dynamic Programming Athena Scientific, 1996 [5] Herman Chernoff Sequential design of experiments Annals of Mathematical Statistics, 30(3): , 1959 [6] Herman Chernoff Sequential Models for Clinical Trials In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Vol4, pages Univ of Calif Press, 1966 [7] Morris H DeGroot Optimal Statistical Decisions John Wiley & Sons, 1970 [8] Christos Dimitrakakis Monte-carlo utility estimates for bayesian reinforcement learning In IEEE 52nd Annual Conference on Decision and Control (CDC 2013), 2013 arxiv: [9] Christos Dimitrakakis, Blaine Nelson, Aikaterini Mitrokotsa, and Benjamin Rubinstein Robust and private Bayesian inference In Algorithmic Learning Theory, 2014 [10] Milton Friedman and Leonard J Savage The expected-utility hypothesis and the measurability of utility The Journal of Political Economy, 60(6):463, 1952 Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

126 Discounted reward MDPs Backwards induction [11] Emilie Kaufmanna, Nathaniel Korda, and Rémi Munos Thompson sampling: An optimal finite time analysis In ALT-2012, 2012 [12] Marting L Puterman Markov Decision Processes : Discrete Stochastic Dynamic Programming John Wiley & Sons, New Jersey, US, 1994 [13] Leonard J Savage The Foundations of Statistics Dover Publications, 1972 [14] Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger Gaussian process optimization in the bandit setting: No regret and experimental design In ICML 2010, 2010 [15] Richard S Sutton and Andrew G Barto Reinforcement Learning: An Introduction MIT Press, 1998 Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, / 60

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Christos Dimitrakakis Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands

More information

Bayesian reinforcement learning and partially observable Markov decision processes November 6, / 24

Bayesian reinforcement learning and partially observable Markov decision processes November 6, / 24 and partially observable Markov decision processes Christos Dimitrakakis EPFL November 6, 2013 Bayesian reinforcement learning and partially observable Markov decision processes November 6, 2013 1 / 24

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Uncertainty & Probabilities & Bandits Daniel Hennes 16.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Uncertainty Probability

More information

Artificial Intelligence

Artificial Intelligence Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value

More information

Bandit models: a tutorial

Bandit models: a tutorial Gdt COS, December 3rd, 2015 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses

More information

Why (Bayesian) inference makes (Differential) privacy easy

Why (Bayesian) inference makes (Differential) privacy easy Why (Bayesian) inference makes (Differential) privacy easy Christos Dimitrakakis 1 Blaine Nelson 2,3 Aikaterini Mitrokotsa 1 Benjamin Rubinstein 4 Zuhe Zhang 4 1 Chalmers university of Technology 2 University

More information

Evaluation of multi armed bandit algorithms and empirical algorithm

Evaluation of multi armed bandit algorithms and empirical algorithm Acta Technica 62, No. 2B/2017, 639 656 c 2017 Institute of Thermomechanics CAS, v.v.i. Evaluation of multi armed bandit algorithms and empirical algorithm Zhang Hong 2,3, Cao Xiushan 1, Pu Qiumei 1,4 Abstract.

More information

1 MDP Value Iteration Algorithm

1 MDP Value Iteration Algorithm CS 0. - Active Learning Problem Set Handed out: 4 Jan 009 Due: 9 Jan 009 MDP Value Iteration Algorithm. Implement the value iteration algorithm given in the lecture. That is, solve Bellman s equation using

More information

Lecture 5: Regret Bounds for Thompson Sampling

Lecture 5: Regret Bounds for Thompson Sampling CMSC 858G: Bandits, Experts and Games 09/2/6 Lecture 5: Regret Bounds for Thompson Sampling Instructor: Alex Slivkins Scribed by: Yancy Liao Regret Bounds for Thompson Sampling For each round t, we defined

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error

More information

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University

More information

Linear Bayesian Reinforcement Learning

Linear Bayesian Reinforcement Learning Linear Bayesian Reinforcement Learning Nikolaos Tziortziotis ntziorzi@gmail.com University of Ioannina Christos Dimitrakakis christos.dimitrakakis@gmail.com EPFL Konstantinos Blekas kblekas@cs.uoi.gr University

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his

More information

Exploration. 2015/10/12 John Schulman

Exploration. 2015/10/12 John Schulman Exploration 2015/10/12 John Schulman What is the exploration problem? Given a long-lived agent (or long-running learning algorithm), how to balance exploration and exploitation to maximize long-term rewards

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

Multi-armed bandit models: a tutorial

Multi-armed bandit models: a tutorial Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

Decision Theory: Markov Decision Processes

Decision Theory: Markov Decision Processes Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies

More information

Lecture 4: Lower Bounds (ending); Thompson Sampling

Lecture 4: Lower Bounds (ending); Thompson Sampling CMSC 858G: Bandits, Experts and Games 09/12/16 Lecture 4: Lower Bounds (ending); Thompson Sampling Instructor: Alex Slivkins Scribed by: Guowei Sun,Cheng Jie 1 Lower bounds on regret (ending) Recap from

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models On the Complexity of Best Arm Identification in Multi-Armed Bandit Models Aurélien Garivier Institut de Mathématiques de Toulouse Information Theory, Learning and Big Data Simons Institute, Berkeley, March

More information

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016 Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the

More information

Reinforcement Learning. Introduction

Reinforcement Learning. Introduction Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Lecture 6: RL algorithms 2.0 Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Present and analyse two online algorithms

More information

Markov Decision Processes

Markov Decision Processes Markov Decision Processes Noel Welsh 11 November 2010 Noel Welsh () Markov Decision Processes 11 November 2010 1 / 30 Annoucements Applicant visitor day seeks robot demonstrators for exciting half hour

More information

Markov decision processes

Markov decision processes CS 2740 Knowledge representation Lecture 24 Markov decision processes Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Administrative announcements Final exam: Monday, December 8, 2008 In-class Only

More information

Reinforcement Learning as Classification Leveraging Modern Classifiers

Reinforcement Learning as Classification Leveraging Modern Classifiers Reinforcement Learning as Classification Leveraging Modern Classifiers Michail G. Lagoudakis and Ronald Parr Department of Computer Science Duke University Durham, NC 27708 Machine Learning Reductions

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms * Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms

More information

Internet Monetization

Internet Monetization Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition

More information

Online Learning and Sequential Decision Making

Online Learning and Sequential Decision Making Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Sequential Decision

More information

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396 Machine Learning Reinforcement learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 1 / 32 Table of contents 1 Introduction

More information

The Multi-Armed Bandit Problem

The Multi-Armed Bandit Problem The Multi-Armed Bandit Problem Electrical and Computer Engineering December 7, 2013 Outline 1 2 Mathematical 3 Algorithm Upper Confidence Bound Algorithm A/B Testing Exploration vs. Exploitation Scientist

More information

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan Background: Global Optimization and Gaussian Processes The Geometry of Gaussian Processes and the Chaining Trick Algorithm

More information

Lecture 8: Policy Gradient

Lecture 8: Policy Gradient Lecture 8: Policy Gradient Hado van Hasselt Outline 1 Introduction 2 Finite Difference Policy Gradient 3 Monte-Carlo Policy Gradient 4 Actor-Critic Policy Gradient Introduction Vapnik s rule Never solve

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Lecture 5: Bandit optimisation Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Introduce bandit optimisation: the

More information

Bayesian and Frequentist Methods in Bandit Models

Bayesian and Frequentist Methods in Bandit Models Bayesian and Frequentist Methods in Bandit Models Emilie Kaufmann, Telecom ParisTech Bayes In Paris, ENSAE, October 24th, 2013 Emilie Kaufmann (Telecom ParisTech) Bayesian and Frequentist Bandits BIP,

More information

Analysis of Thompson Sampling for the multi-armed bandit problem

Analysis of Thompson Sampling for the multi-armed bandit problem Analysis of Thompson Sampling for the multi-armed bandit problem Shipra Agrawal Microsoft Research India shipra@microsoft.com avin Goyal Microsoft Research India navingo@microsoft.com Abstract We show

More information

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning Hanna Kurniawati Today } What is machine learning? } Where is it used? } Types of machine learning

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University Bandit Algorithms Zhifeng Wang Department of Statistics Florida State University Outline Multi-Armed Bandits (MAB) Exploration-First Epsilon-Greedy Softmax UCB Thompson Sampling Adversarial Bandits Exp3

More information

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I Sébastien Bubeck Theory Group i.i.d. multi-armed bandit, Robbins [1952] i.i.d. multi-armed bandit, Robbins [1952] Known

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Learning to play K-armed bandit problems

Learning to play K-armed bandit problems Learning to play K-armed bandit problems Francis Maes 1, Louis Wehenkel 1 and Damien Ernst 1 1 University of Liège Dept. of Electrical Engineering and Computer Science Institut Montefiore, B28, B-4000,

More information

Lecture 3: Markov Decision Processes

Lecture 3: Markov Decision Processes Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models

Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models c Qing Zhao, UC Davis. Talk at Xidian Univ., September, 2011. 1 Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models Qing Zhao Department of Electrical and Computer Engineering University

More information

Lecture 3: Lower Bounds for Bandit Algorithms

Lecture 3: Lower Bounds for Bandit Algorithms CMSC 858G: Bandits, Experts and Games 09/19/16 Lecture 3: Lower Bounds for Bandit Algorithms Instructor: Alex Slivkins Scribed by: Soham De & Karthik A Sankararaman 1 Lower Bounds In this lecture (and

More information

Control Theory : Course Summary

Control Theory : Course Summary Control Theory : Course Summary Author: Joshua Volkmann Abstract There are a wide range of problems which involve making decisions over time in the face of uncertainty. Control theory draws from the fields

More information

Sparse Linear Contextual Bandits via Relevance Vector Machines

Sparse Linear Contextual Bandits via Relevance Vector Machines Sparse Linear Contextual Bandits via Relevance Vector Machines Davis Gilton and Rebecca Willett Electrical and Computer Engineering University of Wisconsin-Madison Madison, WI 53706 Email: gilton@wisc.edu,

More information

Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning

Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning Peter Auer Ronald Ortner University of Leoben, Franz-Josef-Strasse 18, 8700 Leoben, Austria auer,rortner}@unileoben.ac.at Abstract

More information

Introduction to Reinforcement Learning Part 1: Markov Decision Processes

Introduction to Reinforcement Learning Part 1: Markov Decision Processes Introduction to Reinforcement Learning Part 1: Markov Decision Processes Rowan McAllister Reinforcement Learning Reading Group 8 April 2015 Note I ve created these slides whilst following Algorithms for

More information

Elements of Reinforcement Learning

Elements of Reinforcement Learning Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,

More information

Reinforcement Learning

Reinforcement Learning 1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision

More information

Approximate Universal Artificial Intelligence

Approximate Universal Artificial Intelligence Approximate Universal Artificial Intelligence A Monte-Carlo AIXI Approximation Joel Veness Kee Siong Ng Marcus Hutter David Silver University of New South Wales National ICT Australia The Australian National

More information

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam: Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,

More information

RL 14: Simplifications of POMDPs

RL 14: Simplifications of POMDPs RL 14: Simplifications of POMDPs Michael Herrmann University of Edinburgh, School of Informatics 04/03/2016 POMDPs: Points to remember Belief states are probability distributions over states Even if computationally

More information

The convergence limit of the temporal difference learning

The convergence limit of the temporal difference learning The convergence limit of the temporal difference learning Ryosuke Nomura the University of Tokyo September 3, 2013 1 Outline Reinforcement Learning Convergence limit Construction of the feature vector

More information

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3 COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 22 Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3 How to balance exploration and exploitation in reinforcement

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and

More information

Optimism in the Face of Uncertainty Should be Refutable

Optimism in the Face of Uncertainty Should be Refutable Optimism in the Face of Uncertainty Should be Refutable Ronald ORTNER Montanuniversität Leoben Department Mathematik und Informationstechnolgie Franz-Josef-Strasse 18, 8700 Leoben, Austria, Phone number:

More information

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69 R E S E A R C H R E P O R T Online Policy Adaptation for Ensemble Classifiers Christos Dimitrakakis a IDIAP RR 03-69 Samy Bengio b I D I A P December 2003 D a l l e M o l l e I n s t i t u t e for Perceptual

More information

Logic, Knowledge Representation and Bayesian Decision Theory

Logic, Knowledge Representation and Bayesian Decision Theory Logic, Knowledge Representation and Bayesian Decision Theory David Poole University of British Columbia Overview Knowledge representation, logic, decision theory. Belief networks Independent Choice Logic

More information

Decision Theory: Q-Learning

Decision Theory: Q-Learning Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning

More information

Lecture 23: Reinforcement Learning

Lecture 23: Reinforcement Learning Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:

More information

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan Some slides borrowed from Peter Bodik and David Silver Course progress Learning

More information

Reinforcement Learning II

Reinforcement Learning II Reinforcement Learning II Andrea Bonarini Artificial Intelligence and Robotics Lab Department of Electronics and Information Politecnico di Milano E-mail: bonarini@elet.polimi.it URL:http://www.dei.polimi.it/people/bonarini

More information

Machine Learning I Reinforcement Learning

Machine Learning I Reinforcement Learning Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:

More information

Factored State Spaces 3/2/178

Factored State Spaces 3/2/178 Factored State Spaces 3/2/178 Converting POMDPs to MDPs In a POMDP: Action + observation updates beliefs Value is a function of beliefs. Instead we can view this as an MDP where: There is a state for every

More information

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is

More information

ARTIFICIAL INTELLIGENCE. Reinforcement learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

More information

On Bayesian bandit algorithms

On Bayesian bandit algorithms On Bayesian bandit algorithms Emilie Kaufmann joint work with Olivier Cappé, Aurélien Garivier, Nathaniel Korda and Rémi Munos July 1st, 2012 Emilie Kaufmann (Telecom ParisTech) On Bayesian bandit algorithms

More information

Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade

Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for Big data 1 / 22

More information

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted 15-889e Policy Search: Gradient Methods Emma Brunskill All slides from David Silver (with EB adding minor modificafons), unless otherwise noted Outline 1 Introduction 2 Finite Difference Policy Gradient

More information

Reinforcement Learning and NLP

Reinforcement Learning and NLP 1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value

More information

Reinforcement Learning: the basics

Reinforcement Learning: the basics Reinforcement Learning: the basics Olivier Sigaud Université Pierre et Marie Curie, PARIS 6 http://people.isir.upmc.fr/sigaud August 6, 2012 1 / 46 Introduction Action selection/planning Learning by trial-and-error

More information

REINFORCEMENT LEARNING

REINFORCEMENT LEARNING REINFORCEMENT LEARNING Larry Page: Where s Google going next? DeepMind's DQN playing Breakout Contents Introduction to Reinforcement Learning Deep Q-Learning INTRODUCTION TO REINFORCEMENT LEARNING Contents

More information

Parallel Gaussian Process Optimization with Upper Confidence Bound and Pure Exploration

Parallel Gaussian Process Optimization with Upper Confidence Bound and Pure Exploration Parallel Gaussian Process Optimization with Upper Confidence Bound and Pure Exploration Emile Contal David Buffoni Alexandre Robicquet Nicolas Vayatis CMLA, ENS Cachan, France September 25, 2013 Motivating

More information

RL 3: Reinforcement Learning

RL 3: Reinforcement Learning RL 3: Reinforcement Learning Q-Learning Michael Herrmann University of Edinburgh, School of Informatics 20/01/2015 Last time: Multi-Armed Bandits (10 Points to remember) MAB applications do exist (e.g.

More information

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks

More information

Approximate Universal Artificial Intelligence

Approximate Universal Artificial Intelligence Approximate Universal Artificial Intelligence A Monte-Carlo AIXI Approximation Joel Veness Kee Siong Ng Marcus Hutter Dave Silver UNSW / NICTA / ANU / UoA September 8, 2010 General Reinforcement Learning

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

CS 598 Statistical Reinforcement Learning. Nan Jiang

CS 598 Statistical Reinforcement Learning. Nan Jiang CS 598 Statistical Reinforcement Learning Nan Jiang Overview What s this course about? A grad-level seminar course on theory of RL 3 What s this course about? A grad-level seminar course on theory of RL

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning Introduction to Reinforcement Learning Rémi Munos SequeL project: Sequential Learning http://researchers.lille.inria.fr/ munos/ INRIA Lille - Nord Europe Machine Learning Summer School, September 2011,

More information

An Optimal Bidimensional Multi Armed Bandit Auction for Multi unit Procurement

An Optimal Bidimensional Multi Armed Bandit Auction for Multi unit Procurement An Optimal Bidimensional Multi Armed Bandit Auction for Multi unit Procurement Satyanath Bhat Joint work with: Shweta Jain, Sujit Gujar, Y. Narahari Department of Computer Science and Automation, Indian

More information

Partially Observable Markov Decision Processes (POMDPs) Pieter Abbeel UC Berkeley EECS

Partially Observable Markov Decision Processes (POMDPs) Pieter Abbeel UC Berkeley EECS Partially Observable Markov Decision Processes (POMDPs) Pieter Abbeel UC Berkeley EECS Many slides adapted from Jur van den Berg Outline POMDPs Separation Principle / Certainty Equivalence Locally Optimal

More information

Exploration and Exploitation in Bayes Sequential Decision Problems

Exploration and Exploitation in Bayes Sequential Decision Problems Exploration and Exploitation in Bayes Sequential Decision Problems James Anthony Edwards Submitted for the degree of Doctor of Philosophy at Lancaster University. September 2016 Abstract Bayes sequential

More information

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes Kee-Eung Kim KAIST EECS Department Computer Science Division Markov Decision Processes (MDPs) A popular model for sequential decision

More information

Open Theoretical Questions in Reinforcement Learning

Open Theoretical Questions in Reinforcement Learning Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem

More information

Temporal difference learning

Temporal difference learning Temporal difference learning AI & Agents for IET Lecturer: S Luz http://www.scss.tcd.ie/~luzs/t/cs7032/ February 4, 2014 Recall background & assumptions Environment is a finite MDP (i.e. A and S are finite).

More information

ilstd: Eligibility Traces and Convergence Analysis

ilstd: Eligibility Traces and Convergence Analysis ilstd: Eligibility Traces and Convergence Analysis Alborz Geramifard Michael Bowling Martin Zinkevich Richard S. Sutton Department of Computing Science University of Alberta Edmonton, Alberta {alborz,bowling,maz,sutton}@cs.ualberta.ca

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information