Reinforcement Learning Lecture 6: RL algorithms 2.0 Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology
Objectives of this lecture Present and analyse two online algorithms based on the optimism in front of uncertainty principle, and compare their regret to algorithms with random exploration UCB-VI for episodic RL problems UCRL2 for ergodic RL problems 2
Lecture 6: Outline 1. Minimal exploration in RL 2. UCB-VI 3. UCRL2 3
Lecture 6: Outline 1. Minimal exploration in RL 2. UCB-VI 3. UCRL2 4
Towards minimal exploration The MDP model is unknown and has to be learnt. Solutions for on-policy algorithms: 1. Estimate the model then optimise: poor regret and premature exploitation 2. ɛ greedy exploration: undirected exploration (explores too much (state, action) pairs with low values) 3. Bandit-like optimal exploration-exploitation trade-off But how much should a (state,action) pair be explored? 5
Regret lower bounds In the case of ergodic RL problems: Problem-specific lower bound (Burnetas - Katehakis 1997) E[N (s,a) (T )] 1 lim inf T log(t ) K M (s, a) Leading to an asymptotic regret lower bound scaling as SA log(t ) Minimax lower bound Θ( SAT ) We don t know when the asymptotic problem-specific regret lower bound is representative, often for very large T! Read for bandit optimisation: Explore First, Exploit Next: The True Shape of Regret in Bandit Problems, Garivier et al., https://arxiv.org/abs/1602.07182 6
Which regret lower bound should we target? Example: SA = 1000, comparison of SAT and SA log(t ) 7
Which regret lower bound should we target? Boundary: SA = T log(t ) 2 8
Optimism in front of uncertainty Estimate the unknown system parameters (here p(, ) and r(, )) and build an optimistic reward estimate to trigger exploration. Estimate: find confidence balls containing the true model w.h.p. Optimistic reward estimate: find the model within the confidence balls leading to the highest value. 9
Optimism in front of uncertainty: generic algorithm Algorithm. (for Infinite horizon RL problems) Initialise ˆp, ˆr, and N(s, a) For t = 1, 2,... 1. Build an optimistic reward model ( Q(s, a)) s,a from ˆp, ˆr, and N(s, a) 2. Select action a(t) maximising Q(s(t), a) over A s(t) 3. Observe the transition to s(t + 1) and collect reward r(s(t), a(t)) 4. Update ˆp, ˆr, and N(s, a) 10
Examples UCB-VI: directly build a confidence ball for the Q function based on the empirical estimates of the model. UCRL2: first build confidence balls for the reward and transition probabilities, and then identify Q. 11
Lecture 6: Outline 1. Minimal exploration in RL 2. UCB-VI 3. UCRL2 12
Finite-horizon MDP to episodic RL problems Initial state s 0 (could be a r.v.) Transition probabilities at time t: p(s s, a) Reward at time t: r(s, a) and at time H: r H (s) Unknown transition probabilities and reward function Objective: quickly learn a policy π maximising over π 0 MD [ H 1 ] V π0 H := E u=0 r(s π0 u, a π0 u ) + r H (s π0 H ). 13
Finite-horizon MDP to episodic RL problems Data: K episodes of length H (actions, states, rewards) Learning algorithm π : data π K MD Performance of π: how close π K is from the optimal policy π 14
UCB-VI UCBVI is an extension of Value Iteration, guaranteeing that the resulting value function is a (high-probability) upper confidence bound (UCB) on the optimal value function. At the beginning of episode k, it computes state-action values using empirical transition kernel and reward function. In step h of backward induction (to update Q k,h (s, a) for any (s, a)), it adds a bonus b k,h (s, a) to the value, and ensures that Q k,h never exceeds Q k,h 1. Two variants of UCBVI, depending on the choice of bonus b k,h : UCBVI-CH UCBVI-FB 15
UCB-VI algorithm Variables to be maintained by the algorithm: for known reward function - ˆp = (ˆp(s s, a), s, s S, a A s ): estimated transition probabilities - Q = (Q h (s, a), h H, s S, a A s ): estimated Q-function - b = (b h (s, a), h H, s S, a A s ): Q-value bonus - N = (N(s, a), s S, a A s ): number of visits to (s, a) so far - N = (N h (s, a), h H, s S, a A s ): number of visits in the h-step of episodes to (s, a) so far 16
UCB-VI algorithm Algorithm. UCB-VI Input: Initial state distribution ν 0, precision δ Initialise the variables ˆp, N, and N For episode k = 1, 2,... 1. Optimistic reward: a. Compute the bonus: b bonus(n, N, ˆp, Q, δ) b. Estimate the Q-function: Q bellmanopt(q, b, ˆp) 2. Initialise the state s(0) ν 0 3. for h = 1,..., H, select action a arg max a A s(h 1) Q h (s(h 1), a ) 4. Observe the transition and update ˆp, N, and N 17
UCB-VI algorithm: bonus UCBVI-CH: b h (s, a) = 7H N(s, a) log(5sat/δ) UCBVI-BF: 8L b h (s, a) = N(s, a) Var p( s,a)(v h+1 (Y )) + + 8 p(y s, a) min N(s, a) y 14HL 3N(s, a) { 10 4 H 3 S 2 AL 2 N h+1 (y), H 2 } where L = log(5sat/δ). 18
UCB-VI algorithm: Optimistic Bellman operator bellmanopt(q, b, ˆp) applies Dynamic Programming with a bonus. Initialisation: Q H (s, a) = r H (s) for all (s, a) For step h = H( 1,..., 1: for all (s, a) visited at least once so far: Q h (s, a) min Q h (s, a), H, r(s, a) + ) y ˆp(y s, a)v h+1(s) + b h (s, a) 19
UCB-VI: Regret guarantees Regret up to time T = KH: R UCBV I (T ) = K k=1 (V (x k,1 ) V π k (x k,1 )). Theorem For any δ > 0, the regret of UCB-VI-CH(δ) is bounded w.p. at least 1 δ by: R UCBV I CH (T ) 20HL SAT + 250H 2 S 2 AL 2, with L = log(5hsat/δ). For T HS 3 A and SA H, the regret upper bound scales as Õ(H SAT ) (!?) 20
Sketch of proof Notations: - π k is the policy applied by UCBVI in the k-th episode - V k,h is the optimistic value function computed by UCBVI in the h-step of the k-th episode - V π h is the value function from step h under π - P π = (p(s s, π(s))) s,s - ˆP π k = (ˆp k (s s, π(s))) s,s where ˆp k is the estimated transitions in episode k Claim 1: by construction with high probability, V k,h V h. Then: R UCBV I (T ) R(T ) = K (V k,1 (x k,1 ) V π k (x k,1 )) k=1 21
Sketch of proof Let k,h = V k,h V π k h, so that R(T ) = K k=1 k,1 (x k,1 ). Backward induction on h to bound k,1 : introduce δ k,h = k,h (x k,h ), then δ k,h ( ˆP π k k P π k ) k,h+1 (x k,h ) + δ k,h+1 + ɛ k,h + b k,h + e k,h where { ɛ k,h = P π k k,h+1 (x k,h ) k,h+1 (x k,h+1 ) e k,h = ( ˆP π k k P π k )Vh+1 (x k,h) Concentration + Martingale (Azuma) + bounding bonus 22
Numerical experiments The river-swim example... 23
Regret Regret 4 states, H = 2, δ = 0.05 (for UCBVI), ɛ-greedy: ɛ t = min(1, 1000/t) 10 6 UCBVI-CH DP -greedy 10 4 10 2 10 0 10-2 0 1 2 3 4 5 6 Episode 10 5 24
Regret Regret 4 states, H = 3, δ = 0.05 (for UCBVI), ɛ-greedy: ɛ t = min(1, 1000/t) 10 6 10 4 10 2 UCBVI-CH DP -greedy 10 0 10-2 0 2 4 6 8 10 12 14 Episode 10 5 25
Q * (s,a) - Q k,1 (s,a) Optimistic Q-values 4 states, H = 3, δ = 0.05 (for UCBVI) -0.5-1 -1.5-2 -2.5 s = 1, a = 1 s = 1, a = 2 s = 2, a = 1 s = 2, a = 2 s = 3, a = 1 s = 3, a = 2 s = 4, a = 1 s = 4, a = 2-3 0 1 2 3 4 5 6 7 8 9 10 Episode 10 6 26
V * (s) - V k(s) Value function convergence under UCBVI 4 states, H = 3, δ = 0.05 (for UCBVI) 2.5 2 1.5 s = 1 s = 2 s = 3 s = 4 1 0.5 0 0 1 2 3 4 5 6 7 8 9 10 Episode 10 6 27
Lecture 6: Outline 1. Minimal exploration in RL 2. UCB-VI 3. UCRL2 28
Expected average reward MDP to ergodic RL problems Stationary transition probabilities p(s s, a) and rewards r(s, a), uniformly bounded: a, s, r(s, a) 1 Objective: learn from data a policy π MD maximising (over all possible policies) [ T 1 ] g π = V π 1 (s 0 ) := lim inf T T E s 0 r(s π u, a π u, ) u=0 29
Ergodic RL problems: Preliminaries Optimal policy Recall Bellman s equation ( ) g + h (s) = max a A r(s, a) + h p( s, a), s where g is the maximal gain, and h is the bias function (h is uniquely determined up to an additive constant). Note: g does not depend on the initial state for communicating MDPs. Let a (s) denote any optimal action for state s (i.e., a maximizer in the above). Define the gap for sub-optimal action a at state s: φ(s, a) := ( r(s, a (s)) r(s, a) ) + h ( p( s, a (s)) p( s, a) ) 30
Ergodic RL problems: Preliminaries Diameter D: defined as D := max s s min π E[T π s,s ] where Ts,s π denotes the first time step in which s is reached under π staring from initial state s. Remark: all communicating MDPs have a finite diameter. Important parameters impacting performance Diameter D Gap Φ := min s,a a (s) φ(s, a) Gap := min π (g g π ) 31
Ergodic RL problems: Regret lower bounds Problem-specific regret lower bound: (Burnetas-Katehakis) For any algorithm π, R π (T ) lim inf T log(t ) c bk := s,a φ(s, a) inf{kl(p( s, a), q) : q Θ s,a } where Θ s,a is the set of distributions q s.t. replacing (only) p( s, a) by q makes a the unique optimal action in state s. - asymptotic (valid as T ) - valid for any ergodic MDP - scales as Ω( DSA Φ log(t )) for specific MDPs Minimax regret lower bound: Ω( DSAT ) - non-asymptotic (valid for all T DSA) - derived for a specific family of hard-to-learn communicating MDPs 32
Ergodic RL problems: State-of-the-art Two types of algorithms targeting different regret guarantees: Problem-specific guarantees - MDP-specific regret bound scaling as O(log(T )) - Algorithms: B-K (Burnetas & Katehakis, 1997), OLP (Tewari & Bartlett, 2007), UCRL2 (Jaksch et al. 2009), KL-UCRL (Filippi et al. 2010) Minimax guarantees - Valid for a class of MDPs with S states and A actions, and (typically) diameter D - Scaling as Ω( T ) - Algorithms: UCRL2 (Jaksch et al. 2009), KL-UCRL (Filippi et al. 2010), REGAL (Bartlett & Tewari, 2009), A-J (Agrawal & Jia, 2010) 33
Ergodic RL problems: State-of-the-art Algorithm Setup Regret B-K ergodic MDPs, known rewards O (c bk log(t )) asympt. ( ) OLP ergodic MDPs, known rewards O D 2 SA Φ log(t ) asympt. ( ) UCRL unichain MDPs O S 5 A 2 log(t ) ( ) UCRL2, KL-UCRL communicating MDPs O D 2 S 2 A ( log(t ) ) Lower Bound ergodic MDPs, known rewards Ω (c bk log(t )), Ω DSA Φ log(t ) Algorithm Setup Regret ( UCRL2 communicating MDPs Õ DS ) AT ( KL-UCRL communicating MDPs Õ DS ) AT ( REGAL weakly comm. MDPs, known rewards Õ BS ) AT ( A-J communicating MDPs, known rewards Õ D ) SAT, T S 5 A ( DSAT ) Lower Bound known rewards Ω, T DSA *B denotes the span of bias function of true MDP, and B D 34
UCRL2 UCRL2 is an optimistic algorithm that works in episodes of increasing lengths. At the beginning of each episode k, it maintains a set of plausible MDPs M k (which contains the true MDP w.h.p.) It then computes an optimal policy π k, which has the largest gain over all MDPs in M k (π k argmax M M k,π g π (M )). - For computational efficiency, UCRL2 computes an 1 tk -optimal policy, where t k is the starting step of episode k - To find a near-optimal policy, UCRL2 uses Extended Value Iteration It then follows π k within episode k until the number of visits for some pair (s, a) is doubled (and so, a new episode starts). 35
UCRL2 Notations: - k N: index of an episode - N k (s, a): total no. visits of pairs (s, a) before episode k - ˆp k ( s, a): empirical transition probability of (s, a) made by observations up to episode k - ˆr k (s, a): empirical reward distribution of (s, a) made by observations up to episode k - π k : policy followed in episode k - M k : set of models for episode k (defined next) - ν k (s, a): no. of visits of pairs (s, a) seen so far in episode k 36
UCRL2: Main ingredients The set of plausible MDPs M k : for confidence parameter δ, define { M k = M = (S, A, r, p) : (s, a), r(s, a) ˆr k (s, a) 3.5 log(2sat/δ) N k (s, a) + } p( s, a) ˆp k ( s, a) 1 14S log(2at/δ) N k (s, a) + Optimistic gain: find in M k the MDP that leads to the highest gain. We need to solve for episode k: maximise over (M, π) g π (M) subject to M M k 37
UCRL2 pseudo-code Algorithm. UCRL2 Input: Initial state s 0, precision δ, t = 1 For each episode k 1: 1. Initialisation. t k = t (start time of the episode) Update N k (s, a), ˆr k (s, a), and ˆp k (s, a) for all (s, a) 2. Compute the set of possible MDPs M k (using δ) 3. Compute the policy π k ExtendedValueIteration(M k, 1/ t k ) 4. Execute π k and end the episode: While [ν k (s t, π k (s t )) < max(1, N k (s t, π k (s t ))] - Play π k (s t), observe the reward and the next state - Update ν k (s t, π k (s t)) ν k (s t, π k (s t)) + 1 and t t + 1 38
Extended value iteration Set of plausible MDPs M k : { M k = M = (S, A, r, p) : (s, a), r(s, a) ˆr k (s, a) d(s, a) } p( s, a) ˆp k ( s, a) 1 d (s, a) We wish to find M M k and a policy π k maximising g π (M ) over all possible M M k and policy π. Ideas: a. we can fix the reward to its maximum: r(s, a) = ˆr(s, a) + d(s, a) b. solve a large MDP whose set of actions is A s where (a, q) A s if and only if q P k (s, a) with: P k (s, a) = {q : q( ) ˆp k ( s, a) 1 d (s, a)} 39
Extended value iteration Solution: apply one of the known algorithms to find an optimal policy in MDPs, i.e., value iteration algorithm. Extended Value Iteration: For all s S, starting from u 0 (s) = 0: { } u i+1 (s) = max r(s, a) + max a A q P k (s,a) u i q - P k (s, a) is a polytope, and the inner maximisation can be done in O(S) operations. - To obtain an ε-optimal policy, the update is stopped when max s (u i+1 (s) u i (s)) min s (u i+1 (s) u i (s)) ε 40
UCRL2: Regret guarantees Let π =UCRL2 Regret up to time T : R π (T ) = T g T t=1 r(sπ t, a π t ), a random variable capturing the learning cost and the mixing time problems. Theorem W.p. at least 1 δ, the regret of UCRL2 satisfies, for any initial state, for any T > 1, R π (T ) 34DS AT log( T δ ). For any initial state, and any T 1, we have w.p. at least 1 3δ, R π (T ) 34 2 D2 S 2 A log( T δ ) ɛ + ɛt. 41
Regret Regret 6 states, δ = 0.05 (for UCRL2), ɛ-greedy: ɛ t = min(1, 1000/t) 12 10 4 10 UCRL2 KL-UCRL -Greedy 8 6 4 2 0 0 1 2 3 4 5 6 7 8 9 10 Time 10 5 42
Regret 12 states, δ = 0.05 (for UCRL2) 10 x 104 UCRL2 KL UCRL 8 Regret 6 4 2 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Time x 10 5 43
References Episodic RL UCBVI algorithm: M. Gheshlaghi Azar, I. Osband, and R. Munos, Minimax regret bounds for reinforcement learning, Proc. ICML, 2017. Ergodic RL UCRL algorithm: P. Auer & R. Ortner, Logarithmic online regret bounds for undiscounted reinforcement learning, Proc. NIPS, 2006. UCRL2 algorithm and minimax LB: P. Auer, T. Jaksch, and R. Ortner, Near-optimal regret bounds for reinforcement learning, J. Machine Learning Research, 2010. 44