Reinforcement Learning: An Introduction

Introduction Betreuer: Freek Stulp Hauptseminar Intelligente Autonome Systeme (WiSe 04/05) Forschungs- und Lehreinheit Informatik IX Technische Universität München November 24, 2004

Introduction What is Learning? Learning is often divided into: Supervised learning Trial-and-Error learning In a computational sense these two are referred to as supervised learning and reinforcement learning. Just as we often combine both ways of learning in the real world we also do so in computational learning. Possible techniques for supervised learning include machine learning, artificial neural networks and more; we will ignore all of these completely and take a closer look at reinforcement learning (RL).

Outline Markov Decision Processes The Gridworld 1 The Problem Markov Decision Processes - The smaller Problem The Gridworld - An Example 2 The Bellmann Equation - Calculating the Value-Function Value Iteration - An Incremental Approach 3 Prediction - Policy Evaluation in a TD World Q-Learning - Learning the Optimal Policy 4

Markov Decision Processes The Gridworld The Problem What is? Computational learning of an agent by interaction with it s environment. Advantages: fast implementation polynomial complexity wide variety of applications: robotics controlling and planning,...

Markov Decision Processes (MDPs) Markov Decision Processes Markov Decision Processes The Gridworld We represent our environment as a (finite) MDP: S A s P(s t, a t, s t+1 ) r t+1 set of states set of actions for every state transition Probability function scalar reward Table: The Elements of a Markov Decision Process (MDP) Markov Property Every transition only depends on the current state and action (and the transition probability function)

Markov Decision Processes The Gridworld The Gridworld: Representation as an MDP Parts of the Grid Every square represents a state In every state the possible actions are the MDP is deterministic A reward of -1 is given on every transition Two terminal states (light grey in the graphics) Figure: Empty 4x4 Gridworld, squares are equal to states.

The Gridworld: A Random Walk Markov Decision Processes The Gridworld 0 1 0 Figure: A Random Walk over the Gridworld. Figure: Steps from each state to reach the terminal state given the random walk.

The Gridworld: A Random Walk Markov Decision Processes The Gridworld 0 1 2 0 Figure: A Random Walk over the Gridworld. Figure: Steps from each state to reach the terminal state given the random walk.

The Gridworld: A Random Walk Markov Decision Processes The Gridworld 0 1 3 2 0 Figure: A Random Walk over the Gridworld. Figure: Steps from each state to reach the terminal state given the random walk.

The Gridworld: A Random Walk Markov Decision Processes The Gridworld 0 1 2 3 4 0 Figure: A Random Walk over the Gridworld. Figure: Steps from each state to reach the terminal state given the random walk.

The Gridworld: A Random Walk Markov Decision Processes The Gridworld 0 1 2 3 5 4 0 Figure: A Random Walk over the Gridworld. Figure: Steps from each state to reach the terminal state given the random walk.

The Gridworld: A Random Walk Markov Decision Processes The Gridworld 0 1 2 3 5 4 6 0 Figure: A Random Walk over the Gridworld. Figure: Steps from each state to reach the terminal state given the random walk.

The Gridworld: A Random Walk Markov Decision Processes The Gridworld 0 1 2 3 7 4 6 0 Figure: A Random Walk over the Gridworld. Figure: Steps from each state to reach the terminal state given the random walk.

Markov Decision Processes The Gridworld The Gridworld: A General Approach to Random Walks General Random Walk On the last sheet we have seen one random walk and the resulting number of steps from each state to reach a terminal state. Now imagine how the grid would be labeled generally when we always take a random action in every state until we reach a terminal state. We are looking for the number of steps in the mean.

Solution to The Gridworld Markov Decision Processes The Gridworld π V (s) policy π selects an action for every state state-value function V, the sum of future rewards for each state 0 1 2 3 1 2 3 2 2 3 2 1 3 2 1 0 Figure: The Gridworld with state-value function for the optimal policy and the possible actions for the optimal policy.

Outline The Bellmann Equation Value Iteration 1 The Problem Markov Decision Processes - The smaller Problem The Gridworld - An Example 2 The Bellmann Equation - Calculating the Value-Function Value Iteration - An Incremental Approach 3 Prediction - Policy Evaluation in a TD World Q-Learning - Learning the Optimal Policy 4

The Bellmann Equation The Bellmann Equation Value Iteration Theorem Bellmann Equation V (s t ) = r t+1 + γv (s t+1 ) Parts of the Bellmann Equation every state s t at time t the return r t+1 given for the action a t under the current policy π the successor state s t+1 and 0 < γ 1.

The Bellmann Equation The Bellmann Equation Value Iteration Proof n V (s t ) = γ k r t+k+1 k=0 = r t+1 + = r t+1 + = r t+1 + γ n γ k r t+k+1 k=1 n γ k+1 r t+k+2 k=0 n γ k r t+k+2 k=0 = r t+1 + γv (s t+1 )

The Bellmann Equation The Bellmann Equation Value Iteration The Bellmann Optimality Equation for V (s t ) = max a (r t+1 + γv (s t+1 )) s t, t, r t+1 and γ as stated before, max a selecting the action that gives the maximal reward plus the current estimate of the return for the successor state s t+1 computes the optimal state-value-function.

Policy Evaluation The Bellmann Equation Value Iteration Policy Evaluation or Computing V π (s) Policy Evaluation describes the process of computing the state-value function V π for a given policy π. Algorithm Policy Evaluation repeat: = 0 foreach s S: v = V (s) choose a from s using π, observe r t+1 and s t+1 V (s) = r t+1 + γv (s t+1 ) = max(, v V (s) ) until < Θ // small positive number

Value Iteration The Bellmann Equation Value Iteration Value Iteration or Finding the Optimal Policy Value Iteration computes the optimal state-value function V. Algorithm Value Iteration repeat: = 0 foreach s S: v = V ( s) V (s) = max a (r t+1 + γv (s t+1 )) = max(, v V (s) ) until < Θ // small positive number

Outline Prediction Q-Learning 1 The Problem Markov Decision Processes - The smaller Problem The Gridworld - An Example 2 The Bellmann Equation - Calculating the Value-Function Value Iteration - An Incremental Approach 3 Prediction - Policy Evaluation in a TD World Q-Learning - Learning the Optimal Policy 4

Prediction Q-Learning Model (S, A s, P, r) needs to be known Computation of V (s) by iteration over entire state set Temporal Difference (TD) Learning Model unknown, based on experience Sample state, action and reward triples Approximates V (s) while interacting with the environment

Prediction Prediction Q-Learning Prediction in Policy evaluation is also known as the prediction problem V π (s) is computed by turning the bellmann equation into an update rule and iteration until no more changes occur TD Prediction: Update Rule for TD(0) V (s t ) = V (s t ) + α[r t+1 + γv (s t+1 ) V (s t )] Convergence criteria 1 sufficiently small α, convergence in the mean 2 α decreases over time, convergence with Probability = 1

Prediction Q-Learning Q-Learning and The State-Action-Value Function Learning the State-Action-Value Function Value iteration: take max over all actions, this is only possible with a complete model. Q-Learning: we compute Q(s,a) instead; taking the max over Q(s,a) for all actions doesn t require a model. Figure: Digraph of a simple MDP.

Off-Policy Learning: Q-Learning Off-Policy Learning Prediction Q-Learning To assure we constantly explore our state space, we follow one policy while actually evaluating another one. Algorithm Q-Learning repeat (for each episode): initialize s choose a t from s t using ɛ-greedy policy from Q: repeat (for each step of episode): take action a t, observe r t+1, s t+1 choose a t+1 from s t+1 using ɛ-greedy policy from Q: Q(s, a) = Q(s, a) + α[r t+1 + γmax a Q(s t+1, a t+1 ) Q(s, a)] s t = s t+1, a t = a t+1 until s is terminal

Outline 1 The Problem Markov Decision Processes - The smaller Problem The Gridworld - An Example 2 The Bellmann Equation - Calculating the Value-Function Value Iteration - An Incremental Approach 3 Prediction - Policy Evaluation in a TD World Q-Learning - Learning the Optimal Policy 4

The Acrobot Parts of the Acrobot Two links Torque [ 1, 0, 1] is exerted only at the second joint Continuous state variables Θ 1, Θ 2, Θ 1, Θ 2 Goal: hit horizontal line at a distance of max(l 1, L 2 ) in min time r t+1 = 1 γ = 1 Torque applied here T1 L1 M1 L2 T2 M2 Figure: The Acrobot

The Acrobot Parts of the Acrobot II The angular velocities are limited State-space is a rectangular confined region in a four dimensional spatial State-space is bounded but continuous = M2 L2 T1 Torque applied here L1 M1 T2 Tilings; define discrete intervals for each dimension Figure: The Acrobot

The Acrobot - An Algorithm that Solves the Problem Solution A TD algorithm similar to Q-learning called Sarsa(λ) was implemented to solve the problem, where the λ indicates that updates concern more than one state. The constants were set to α = 0.2/48, λ = 0.9 and with ɛ = 0 the algorithm optimizes the policy greedy with respect to the state-action-value function which is important since one exploratory move could goof a whole sequence of good moves. Exploration was ensured by optimistically initializing the values for Q(s, a) with 0.

Generalizations Conclusion Exploration vs. Exploitation Optimistic Initial Values Outline 5 Generalizations Exploration vs. Exploitation Optimistic Initial Values 6 Conclusion: Back to the Full Problem

Generalizations Conclusion Exploration vs. Exploitation Optimistic Initial Values Exploration vs. Exploitation Exploration Extend knowledge of the model by trying out Exploitation Exploit knowledge to achieve maximum return Solutions ɛ-greedy policy selection Following one policy while evaluating another Optimistic initial values

Generalizations Conclusion Exploration vs. Exploitation Optimistic Initial Values Optimistic Initial Values Optimistic Initial Values Initial state or state-action values are set higher than they could ever get Thus exploration in an early phase is guaranteed Can not be used in the non-stationary case

Generalizations Conclusion Outline 5 Generalizations Exploration vs. Exploitation Optimistic Initial Values 6 Conclusion: Back to the Full Problem

Generalizations Conclusion Conclusion Disadvantages Lookout One goal only Markov Property = Abstraction from the environment Actions from a discrete space, always take one time-step In the real world good initial behavior is often more important than asymptotic optimal behavior Some of the previously mentioned disadvantages are solved by a current interest in research: hierarchical multi-agent learning.

Generalizations Conclusion F I N