1 Reinforcement Learning Lecture 6: RL algorithms 2.0 Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology

2 Objectives of this lecture Present and analyse two online algorithms based on the optimism in front of uncertainty principle, and compare their regret to algorithms with random exploration UCB-VI for episodic RL problems UCRL2 for ergodic RL problems 2

3 Lecture 6: Outline 1. Minimal exploration in RL 2. UCB-VI 3. UCRL2 3

5 Towards minimal exploration The MDP model is unknown and has to be learnt. Solutions for on-policy algorithms: 1. Estimate the model then optimise: poor regret and premature exploitation 2. ɛ greedy exploration: undirected exploration (explores too much (state, action) pairs with low values) 3. Bandit-like optimal exploration-exploitation trade-off But how much should a (state,action) pair be explored? 5

6 Regret lower bounds In the case of ergodic RL problems: Problem-specific lower bound (Burnetas - Katehakis 1997) E[N (s,a) (T )] 1 lim inf T log(t ) K M (s, a) Leading to an asymptotic regret lower bound scaling as SA log(t ) Minimax lower bound Θ( SAT ) We don t know when the asymptotic problem-specific regret lower bound is representative, often for very large T! Read for bandit optimisation: Explore First, Exploit Next: The True Shape of Regret in Bandit Problems, Garivier et al., 6

7 Which regret lower bound should we target? Example: SA = 1000, comparison of SAT and SA log(t ) 7

8 Which regret lower bound should we target? Boundary: SA = T log(t ) 2 8

9 Optimism in front of uncertainty Estimate the unknown system parameters (here p(, ) and r(, )) and build an optimistic reward estimate to trigger exploration. Estimate: find confidence balls containing the true model w.h.p. Optimistic reward estimate: find the model within the confidence balls leading to the highest value. 9

10 Optimism in front of uncertainty: generic algorithm Algorithm. (for Infinite horizon RL problems) Initialise ˆp, ˆr, and N(s, a) For t = 1, 2, Build an optimistic reward model ( Q(s, a)) s,a from ˆp, ˆr, and N(s, a) 2. Select action a(t) maximising Q(s(t), a) over A s(t) 3. Observe the transition to s(t + 1) and collect reward r(s(t), a(t)) 4. Update ˆp, ˆr, and N(s, a) 10

11 Examples UCB-VI: directly build a confidence ball for the Q function based on the empirical estimates of the model. UCRL2: first build confidence balls for the reward and transition probabilities, and then identify Q. 11

13 Finite-horizon MDP to episodic RL problems Initial state s 0 (could be a r.v.) Transition probabilities at time t: p(s s, a) Reward at time t: r(s, a) and at time H: r H (s) Unknown transition probabilities and reward function Objective: quickly learn a policy π maximising over π 0 MD [ H 1 ] V π0 H := E u=0 r(s π0 u, a π0 u ) + r H (s π0 H ). 13

14 Finite-horizon MDP to episodic RL problems Data: K episodes of length H (actions, states, rewards) Learning algorithm π : data π K MD Performance of π: how close π K is from the optimal policy π 14

15 UCB-VI UCBVI is an extension of Value Iteration, guaranteeing that the resulting value function is a (high-probability) upper confidence bound (UCB) on the optimal value function. At the beginning of episode k, it computes state-action values using empirical transition kernel and reward function. In step h of backward induction (to update Q k,h (s, a) for any (s, a)), it adds a bonus b k,h (s, a) to the value, and ensures that Q k,h never exceeds Q k,h 1. Two variants of UCBVI, depending on the choice of bonus b k,h : UCBVI-CH UCBVI-FB 15

16 UCB-VI algorithm Variables to be maintained by the algorithm: for known reward function - ˆp = (ˆp(s s, a), s, s S, a A s ): estimated transition probabilities - Q = (Q h (s, a), h H, s S, a A s ): estimated Q-function - b = (b h (s, a), h H, s S, a A s ): Q-value bonus - N = (N(s, a), s S, a A s ): number of visits to (s, a) so far - N = (N h (s, a), h H, s S, a A s ): number of visits in the h-step of episodes to (s, a) so far 16

17 UCB-VI algorithm Algorithm. UCB-VI Input: Initial state distribution ν 0, precision δ Initialise the variables ˆp, N, and N For episode k = 1, 2, Optimistic reward: a. Compute the bonus: b bonus(n, N, ˆp, Q, δ) b. Estimate the Q-function: Q bellmanopt(q, b, ˆp) 2. Initialise the state s(0) ν 0 3. for h = 1,..., H, select action a arg max a A s(h 1) Q h (s(h 1), a ) 4. Observe the transition and update ˆp, N, and N 17

18 UCB-VI algorithm: bonus UCBVI-CH: b h (s, a) = 7H N(s, a) log(5sat/δ) UCBVI-BF: 8L b h (s, a) = N(s, a) Var p( s,a)(v h+1 (Y )) p(y s, a) min N(s, a) y 14HL 3N(s, a) { 10 4 H 3 S 2 AL 2 N h+1 (y), H 2 } where L = log(5sat/δ). 18

19 UCB-VI algorithm: Optimistic Bellman operator bellmanopt(q, b, ˆp) applies Dynamic Programming with a bonus. Initialisation: Q H (s, a) = r H (s) for all (s, a) For step h = H( 1,..., 1: for all (s, a) visited at least once so far: Q h (s, a) min Q h (s, a), H, r(s, a) + ) y ˆp(y s, a)v h+1(s) + b h (s, a) 19

20 UCB-VI: Regret guarantees Regret up to time T = KH: R UCBV I (T ) = K k=1 (V (x k,1 ) V π k (x k,1 )). Theorem For any δ > 0, the regret of UCB-VI-CH(δ) is bounded w.p. at least 1 δ by: R UCBV I CH (T ) 20HL SAT + 250H 2 S 2 AL 2, with L = log(5hsat/δ). For T HS 3 A and SA H, the regret upper bound scales as Õ(H SAT ) (!?) 20

21 Sketch of proof Notations: - π k is the policy applied by UCBVI in the k-th episode - V k,h is the optimistic value function computed by UCBVI in the h-step of the k-th episode - V π h is the value function from step h under π - P π = (p(s s, π(s))) s,s - ˆP π k = (ˆp k (s s, π(s))) s,s where ˆp k is the estimated transitions in episode k Claim 1: by construction with high probability, V k,h V h. Then: R UCBV I (T ) R(T ) = K (V k,1 (x k,1 ) V π k (x k,1 )) k=1 21

22 Sketch of proof Let k,h = V k,h V π k h, so that R(T ) = K k=1 k,1 (x k,1 ). Backward induction on h to bound k,1 : introduce δ k,h = k,h (x k,h ), then δ k,h ( ˆP π k k P π k ) k,h+1 (x k,h ) + δ k,h+1 + ɛ k,h + b k,h + e k,h where { ɛ k,h = P π k k,h+1 (x k,h ) k,h+1 (x k,h+1 ) e k,h = ( ˆP π k k P π k )Vh+1 (x k,h) Concentration + Martingale (Azuma) + bounding bonus 22

23 Numerical experiments The river-swim example... 23

24 Regret Regret 4 states, H = 2, δ = 0.05 (for UCBVI), ɛ-greedy: ɛ t = min(1, 1000/t) 10 6 UCBVI-CH DP -greedy Episode

25 Regret Regret 4 states, H = 3, δ = 0.05 (for UCBVI), ɛ-greedy: ɛ t = min(1, 1000/t) UCBVI-CH DP -greedy Episode

26 Q * (s,a) - Q k,1 (s,a) Optimistic Q-values 4 states, H = 3, δ = 0.05 (for UCBVI) s = 1, a = 1 s = 1, a = 2 s = 2, a = 1 s = 2, a = 2 s = 3, a = 1 s = 3, a = 2 s = 4, a = 1 s = 4, a = Episode

27 V * (s) - V k(s) Value function convergence under UCBVI 4 states, H = 3, δ = 0.05 (for UCBVI) s = 1 s = 2 s = 3 s = Episode

29 Expected average reward MDP to ergodic RL problems Stationary transition probabilities p(s s, a) and rewards r(s, a), uniformly bounded: a, s, r(s, a) 1 Objective: learn from data a policy π MD maximising (over all possible policies) [ T 1 ] g π = V π 1 (s 0 ) := lim inf T T E s 0 r(s π u, a π u, ) u=0 29

30 Ergodic RL problems: Preliminaries Optimal policy Recall Bellman s equation ( ) g + h (s) = max a A r(s, a) + h p( s, a), s where g is the maximal gain, and h is the bias function (h is uniquely determined up to an additive constant). Note: g does not depend on the initial state for communicating MDPs. Let a (s) denote any optimal action for state s (i.e., a maximizer in the above). Define the gap for sub-optimal action a at state s: φ(s, a) := ( r(s, a (s)) r(s, a) ) + h ( p( s, a (s)) p( s, a) ) 30

31 Ergodic RL problems: Preliminaries Diameter D: defined as D := max s s min π E[T π s,s ] where Ts,s π denotes the first time step in which s is reached under π staring from initial state s. Remark: all communicating MDPs have a finite diameter. Important parameters impacting performance Diameter D Gap Φ := min s,a a (s) φ(s, a) Gap := min π (g g π ) 31

32 Ergodic RL problems: Regret lower bounds Problem-specific regret lower bound: (Burnetas-Katehakis) For any algorithm π, R π (T ) lim inf T log(t ) c bk := s,a φ(s, a) inf{kl(p( s, a), q) : q Θ s,a } where Θ s,a is the set of distributions q s.t. replacing (only) p( s, a) by q makes a the unique optimal action in state s. - asymptotic (valid as T ) - valid for any ergodic MDP - scales as Ω( DSA Φ log(t )) for specific MDPs Minimax regret lower bound: Ω( DSAT ) - non-asymptotic (valid for all T DSA) - derived for a specific family of hard-to-learn communicating MDPs 32

33 Ergodic RL problems: State-of-the-art Two types of algorithms targeting different regret guarantees: Problem-specific guarantees - MDP-specific regret bound scaling as O(log(T )) - Algorithms: B-K (Burnetas & Katehakis, 1997), OLP (Tewari & Bartlett, 2007), UCRL2 (Jaksch et al. 2009), KL-UCRL (Filippi et al. 2010) Minimax guarantees - Valid for a class of MDPs with S states and A actions, and (typically) diameter D - Scaling as Ω( T ) - Algorithms: UCRL2 (Jaksch et al. 2009), KL-UCRL (Filippi et al. 2010), REGAL (Bartlett & Tewari, 2009), A-J (Agrawal & Jia, 2010) 33

34 Ergodic RL problems: State-of-the-art Algorithm Setup Regret B-K ergodic MDPs, known rewards O (c bk log(t )) asympt. ( ) OLP ergodic MDPs, known rewards O D 2 SA Φ log(t ) asympt. ( ) UCRL unichain MDPs O S 5 A 2 log(t ) ( ) UCRL2, KL-UCRL communicating MDPs O D 2 S 2 A ( log(t ) ) Lower Bound ergodic MDPs, known rewards Ω (c bk log(t )), Ω DSA Φ log(t ) Algorithm Setup Regret ( UCRL2 communicating MDPs Õ DS ) AT ( KL-UCRL communicating MDPs Õ DS ) AT ( REGAL weakly comm. MDPs, known rewards Õ BS ) AT ( A-J communicating MDPs, known rewards Õ D ) SAT, T S 5 A ( DSAT ) Lower Bound known rewards Ω, T DSA *B denotes the span of bias function of true MDP, and B D 34

35 UCRL2 UCRL2 is an optimistic algorithm that works in episodes of increasing lengths. At the beginning of each episode k, it maintains a set of plausible MDPs M k (which contains the true MDP w.h.p.) It then computes an optimal policy π k, which has the largest gain over all MDPs in M k (π k argmax M M k,π g π (M )). - For computational efficiency, UCRL2 computes an 1 tk -optimal policy, where t k is the starting step of episode k - To find a near-optimal policy, UCRL2 uses Extended Value Iteration It then follows π k within episode k until the number of visits for some pair (s, a) is doubled (and so, a new episode starts). 35

36 UCRL2 Notations: - k N: index of an episode - N k (s, a): total no. visits of pairs (s, a) before episode k - ˆp k ( s, a): empirical transition probability of (s, a) made by observations up to episode k - ˆr k (s, a): empirical reward distribution of (s, a) made by observations up to episode k - π k : policy followed in episode k - M k : set of models for episode k (defined next) - ν k (s, a): no. of visits of pairs (s, a) seen so far in episode k 36

37 UCRL2: Main ingredients The set of plausible MDPs M k : for confidence parameter δ, define { M k = M = (S, A, r, p) : (s, a), r(s, a) ˆr k (s, a) 3.5 log(2sat/δ) N k (s, a) + } p( s, a) ˆp k ( s, a) 1 14S log(2at/δ) N k (s, a) + Optimistic gain: find in M k the MDP that leads to the highest gain. We need to solve for episode k: maximise over (M, π) g π (M) subject to M M k 37

38 UCRL2 pseudo-code Algorithm. UCRL2 Input: Initial state s 0, precision δ, t = 1 For each episode k 1: 1. Initialisation. t k = t (start time of the episode) Update N k (s, a), ˆr k (s, a), and ˆp k (s, a) for all (s, a) 2. Compute the set of possible MDPs M k (using δ) 3. Compute the policy π k ExtendedValueIteration(M k, 1/ t k ) 4. Execute π k and end the episode: While [ν k (s t, π k (s t )) < max(1, N k (s t, π k (s t ))] - Play π k (s t), observe the reward and the next state - Update ν k (s t, π k (s t)) ν k (s t, π k (s t)) + 1 and t t

39 Extended value iteration Set of plausible MDPs M k : { M k = M = (S, A, r, p) : (s, a), r(s, a) ˆr k (s, a) d(s, a) } p( s, a) ˆp k ( s, a) 1 d (s, a) We wish to find M M k and a policy π k maximising g π (M ) over all possible M M k and policy π. Ideas: a. we can fix the reward to its maximum: r(s, a) = ˆr(s, a) + d(s, a) b. solve a large MDP whose set of actions is A s where (a, q) A s if and only if q P k (s, a) with: P k (s, a) = {q : q( ) ˆp k ( s, a) 1 d (s, a)} 39

40 Extended value iteration Solution: apply one of the known algorithms to find an optimal policy in MDPs, i.e., value iteration algorithm. Extended Value Iteration: For all s S, starting from u 0 (s) = 0: { } u i+1 (s) = max r(s, a) + max a A q P k (s,a) u i q - P k (s, a) is a polytope, and the inner maximisation can be done in O(S) operations. - To obtain an ε-optimal policy, the update is stopped when max s (u i+1 (s) u i (s)) min s (u i+1 (s) u i (s)) ε 40

41 UCRL2: Regret guarantees Let π =UCRL2 Regret up to time T : R π (T ) = T g T t=1 r(sπ t, a π t ), a random variable capturing the learning cost and the mixing time problems. Theorem W.p. at least 1 δ, the regret of UCRL2 satisfies, for any initial state, for any T > 1, R π (T ) 34DS AT log( T δ ). For any initial state, and any T 1, we have w.p. at least 1 3δ, R π (T ) 34 2 D2 S 2 A log( T δ ) ɛ + ɛt. 41

42 Regret Regret 6 states, δ = 0.05 (for UCRL2), ɛ-greedy: ɛ t = min(1, 1000/t) UCRL2 KL-UCRL -Greedy Time

43 Regret 12 states, δ = 0.05 (for UCRL2) 10 x 104 UCRL2 KL UCRL 8 Regret Time x

44 References Episodic RL UCBVI algorithm: M. Gheshlaghi Azar, I. Osband, and R. Munos, Minimax regret bounds for reinforcement learning, Proc. ICML, Ergodic RL UCRL algorithm: P. Auer & R. Ortner, Logarithmic online regret bounds for undiscounted reinforcement learning, Proc. NIPS, UCRL2 algorithm and minimax LB: P. Auer, T. Jaksch, and R. Ortner, Near-optimal regret bounds for reinforcement learning, J. Machine Learning Research,

