Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Similar documents
Reinforcement Learning. Introduction

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Decision Theory: Q-Learning

Decision Theory: Markov Decision Processes

Markov decision processes

Reinforcement Learning

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Reinforcement Learning and Control

Internet Monetization

Artificial Intelligence & Sequential Decision Problems

Markov Decision Processes

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

Grundlagen der Künstlichen Intelligenz

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Lecture 3: The Reinforcement Learning Problem

CS 7180: Behavioral Modeling and Decisionmaking

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Reinforcement Learning

Machine Learning I Reinforcement Learning

CS599 Lecture 1 Introduction To RL

Introduction to Reinforcement Learning

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Chapter 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Markov Decision Processes and Solving Finite Problems. February 8, 2017

, and rewards and transition matrices as shown below:

Reinforcement Learning

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

Grundlagen der Künstlichen Intelligenz

Reinforcement Learning and Deep Reinforcement Learning

Discrete planning (an introduction)

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Temporal Difference Learning & Policy Iteration

Reinforcement learning

Lecture 1: March 7, 2018

REINFORCEMENT LEARNING

Markov decision processes (MDP) CS 416 Artificial Intelligence. Iterative solution of Bellman equations. Building an optimal policy.

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

A Gentle Introduction to Reinforcement Learning

1 MDP Value Iteration Algorithm

Planning in Markov Decision Processes

Reinforcement Learning

Reading Response: Due Wednesday. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Basics of reinforcement learning

Markov Decision Processes Chapter 17. Mausam

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

Reinforcement learning

Lecture 3: Markov Decision Processes

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

1. (3 pts) In MDPs, the values of states are related by the Bellman equation: U(s) = R(s) + γ max a

Some AI Planning Problems

Reinforcement Learning

6 Reinforcement Learning

Artificial Intelligence

Q-Learning in Continuous State Action Spaces

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Lecture 2: Learning from Evaluative Feedback. or Bandit Problems

Reinforcement Learning

An Introduction to Reinforcement Learning

CSE250A Fall 12: Discussion Week 9

An Adaptive Clustering Method for Model-free Reinforcement Learning

Introduction to Reinforcement Learning Part 1: Markov Decision Processes

An Introduction to Markov Decision Processes. MDP Tutorial - 1

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Reinforcement Learning: An Introduction

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Reinforcement Learning

Control Theory : Course Summary

Motivation for introducing probabilities

Autonomous Helicopter Flight via Reinforcement Learning

16.4 Multiattribute Utility Functions

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Planning Under Uncertainty II

Probabilistic Planning. George Konidaris

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. George Konidaris

Reinforcement Learning II

Markov Decision Processes Chapter 17. Mausam

CS 4100 // artificial intelligence. Recap/midterm review!

CS 570: Machine Learning Seminar. Fall 2016

Reinforcement Learning and NLP

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

Reinforcement learning an introduction

RL 3: Reinforcement Learning

MDP Preliminaries. Nan Jiang. February 10, 2019

CS 598 Statistical Reinforcement Learning. Nan Jiang

State Space Abstractions for Reinforcement Learning

Lecture 23: Reinforcement Learning

Preference Elicitation for Sequential Decision Problems

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

16.410/413 Principles of Autonomy and Decision Making

Hidden Markov Models (HMM) and Support Vector Machine (SVM)

Transcription:

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45

Overview We consider probabilistic temporal models where the values of certain variables are chosen (controlled) by a rational agent. Outline 1 Simple decision-making 2 Sequential decision-making 3 Markov Decision Process (MDP) 4 Planning algorithms for Markov Decision Processes 5 Learning algorithms for Markov Decision Processes 2 / 45

Decision-making So far, we have focused on temporal models that describe how a set of random variables, called state, changes over time. These are models of uncontrolled systems: we just observe, predict, estimate, but we do nothing to change how things go. In reality, some variables of the system can be controlled by a rational agent. What you see: day 1 day 2 day 3 day 4 What happens: 3 / 45

Decision-making So far, we have focused on temporal models that describe how a set of random variables, called state, changes over time. These are models of uncontrolled systems: we just observe passively, predict, estimate, but we do nothing to change how things go. In reality, some variables of the system can be controlled by a rational agent. BMeter 1 P(C)=.5 Battery 0 Battery 1 Cloudy X 0 X 1 C t f P(S).10.50 Sprinkler Wet Grass Rain C t f P(R).80.20 txx 0 X 1 Z 1 S t t f f R t f t f P(W).99.90.90.00 4 / 45

Decision-making The set of controlled variables is called an action. Decision-making: choosing values for the actions. The distribution of next states (at t + 1) depends on the current state and the executed action (at t). Action: a t a t+1 State: S t S t+1 Observations: Z t Z t+1 5 / 45

Interaction between an agent and a dynamical system action Agent Dynamical System observation Figure: The process of interaction between an agent and a dynamical system. In this lecture, we consider only fully observable Markov Decision Processes, where the agent knows the state at any time (Z t = S t ). 6 / 45

Decision-making Search problems are decision-making problems: actions correspond to assigning values to variables. Navigation problems are decision-making problems: Actions correspond to moving in a particular direction. States correspond to the position of the agent. 7 / 45

Example of decision-making problems: robot navigation State: position of the robot Actions: move east, move west, move north, move south. move east move east move east s 0 s 1 s 2 8 / 45

Example of decision-making problems: buying stocks State: prices of stock A and B, how much money we invested so far, and which stocks we have bought. Action: buying stock A or B. (A = $100, B = $100) buy B buy A proba : 0.3 proba : 0.7 proba : 0.8 proba : 0.2 (A = $80, B = $110) (A = $90, B = $110) (A = $70, B = $100) (A = $80, B = $100) buy B buy A buy B buy A buy B buy A buy B buy A 9 / 45

Value (Utility) An action applied at a given state may lead to different outcomes with varying probabilities. Example: If we buy stock A, its price can change to $70 with probability 0.8 and to $80 with probability 0.2. Decision theory, in its simplest form, deals with choosing among actions based on the desirability of their immediate outcomes. A value (or utility) function V maps every state into a real value that indicates how much we would like to be in that state. Example: V (A = $90, B = $110, stocks we have = {A}, invested money = $100) = 10 Note: In Russell and Norvig textbook, the term utility (denoted by U) is used instead of value (denoted by U). The two terms are equivalent. 10 / 45

Markov Decision Process (MDP) Formally, an MDP is a tuple S, A, T, R, where: S: is the space of state values. A: is the space of action values. T : is the transition matrix. R: is a reward function. from http://artint.info 11 / 45

Example of a Markov Decision Process with three states and two actions from Wikipedia 12 / 45

sented as a Markov Decision Process. Of course, one can imagine more compli as Example tracking of a mobile Markov target, Decision collecting Process samples, or navigating in a hostile e defective actuators. Most of these problems can be easily casted as MDPs. N W N S E W, N s 1 s 2 s 3 E E N W S N E W S N E S W s 4 s 5 s 6 N W S N E W S N E S W, S s 7 s 8 s 9 W W S 13 / 45

Markov Decision Process (MDP) State space S: The definition of a state is recursive: A state is a representation of all the relevant information for predicting future states, in addition to all the information relevant for the related decision problem. In the example of robot navigation, the state space S = {s 1, s 2, s 3, s 4, s 5, s 6, s 7, s 8, s 9 } corresponds to the set of the robot s locations on the grid. The state space may be finite, countably infinite, or continuous. We will focus on models with a finite set of states. In our example, the states correspond to different positions on a discretized grid. 14 / 45

Markov Decision Process (MDP) Action space A: The states of the system are modified by the actions executed by an agent. The goal of the agent is to choose actions that will influence the future of the system so that the more desirable states will occur more frequently. The actions space can be finite, infinite or continuous, but we will consider only the finite case. In our example, the actions of the robot might be move north, move south, move east, move west, or do not move, so A = {N, S, E, W, nothing}. In the MDP framework, the state changes only after an agent executes an action. 15 / 45

Markov Decision Process (MDP) Transition function T : When an agent tries to execute an action in a given state, the action does not always lead to the same result, this is due to the fact that the information represented by the state is not sufficient for determining precisely the outcome of the actions. T (s t, a t, s t+1 ) returns the probability of transitioning to state s t+1 after executing action a t in state s t. T (s t, a t, s t+1 ) = P (s t+1 s t, a t ) In our example, the actions can be either deterministic, or stochastic if the floor is slippery, and the robot might ends up in a different position while trying to move toward another one. 16 / 45

Markov Decision Process (MDP) Reward function R: The preferences of the agent are defined by the reward function R. This function directs the agent towards desirable states and keeps it away from unwanted ones. R(s t, a t ) returns a reward (or a penalty) to the agent for executing action a t in state s t. The goal of the agent is then to choose actions that maximize its cumulated reward. The elegance of the MDP framework comes from the possibility of modeling complex concurrent tasks by simply assigning rewards to the states. In our example, one may consider a reward of +100 for reaching the goal state, a 2 for any movement (consumption of energy), and a 1 for not doing anything (waste of time). 17 / 45

Finite or Infinite Horizons, Discount Factors Given a reward function, the goal of the agent is to maximize the expected cumulated reward over some number H of steps, called the horizon. The horizon can be either finite or infinite. If the horizon is finite, then the optimal actions of the agent will depend not only on the states, but also on the remaining number of steps until the end. In our example, if the agent is at the top right position on the grid, and only two steps are left before the end, then it would be better to do nothing and receive an average reward of 1 per step, than to move and receive an average reward of 2 per step, since the goal cannot be reached within two steps. 18 / 45

Finite or Infinite Horizons, Discount Factors Given a reward function, the goal of the agent is to maximize the expected cumulated reward over some number H of steps, called the horizon. If the horizon is infinite (H = ), then the optimal actions depend only on the state. In our case, the optimal action at any step is to move toward the goal. A discount factor γ [0, 1) is also used to indicate how the importance of the earned rewards decreases for every time-step delay. A reward that will be received k time-steps later is scaled down by a factor of γ k. The discount factor can also be interpreted as the probability that the process continues after any step. 19 / 45

Policies and Value Functions The agent selects its actions according to a policy (or a strategy). A deterministic stationary policy π is a function that maps every state s into an action a. π : S A. A stochastic stationary policy is a function that maps each state into a probability distribution over the different possible actions. We consider only deterministic policies here. The value function of a policy π is a function V π that associates to each state the sum of expected rewards that the agent will receive if it starts executing policy π from that state. In other terms: V π (s) = t=0 ] γ t E st [R(s t, π(s t )) π, s 0 = s where π(s t ) is the action chosen in state s. 20 / 45

Bellman Equation The value function of a stationary policy can also be recursively defined as: ] V π (s) = γ t E st [R(s t, π(s t )) π, s 0 = s t=0 = R(s, π(s)) + = R(s, π(s)) + γ t=1 ] γ t E st [R(s t, π(s t )) π, s 0 = s t=0 ] γ t E st [R(s t, π(s t )) π, s 0 T (s, π(s),.) (the next state is the new s 0, distributed as T (s, π(s),.)) = R(s, π(s)) + γ [ ] T (s, π(s), s ) γ t E st R(s t, π(s t )) π, s 0 = s s S t=0 = R(s, π(s)) + γ s S T (s, π(s), s )V π (s ) 21 / 45

Bellman Equation V π (s) = R(s, π(s)) + γ s S T (s, π(s), s )V π (s ) value = immediate reward + γ(expected value of next state) This equation plays a central role in dynamic programming, a family of methods for solving a complex problem by breaking it down into a collection of simpler subproblems. In dynamic programming, invented by Richard Bellman in 1957, sub-problems are nested recursively inside larger problems. Richard Bellman (1920-1984) 22 / 45

Optimal policies Bellman Equation V π (s) = R(s, π(s)) + γ s S T (s, π(s), s )V π (s ) Optimal policies An optimal policy π is one that satisfies: s S : π arg max V π (s) π The value function of an optimal policy is called the optimal value function, it is defined as: [ V (s) = max R(s, a) + γ ] T (s, a, s )V (s ) a A s S 23 / 45

Optimal policies Optimal policies In his seminal work on dynamic programming, Richard Bellman proved that a stationary deterministic optimal policy exists for any discounted infinite horizon MDP. If the value function V π of a given policy π satisfies [ V π (s) = max R(s, a) + γ ] T (s, a, s )V π (s ), a A s S then V π = V and π is an optimal policy. The equation above is a necessary and sufficient optimality condition. 24 / 45

Planning with Markov Decision Processes Planning: finding an optimal policy π given an MDP S, A, T, R. Most of planning algorithms for MDPs fall in one of the two categories: Policy iteration Value iteration 25 / 45

Policy Iteration Start with a randomly chosen policy π t at t = 0 Alternate between the policy evaluation and the policy improvement operations until convergence. Policy evaluation Randomly initialize the value function V k, for k = 0. Repeat the operation: s S : V k+1 (s) = R(s, π t (s)) + γ s S T (s, π t (s), s )V k (s ) until s S : V k (s) V k 1 (s) < ɛ for a predefined error threshold ɛ. 26 / 45

Policy Iteration Start with a randomly chosen policy π t at t = 0 Alternate between the policy evaluation and the policy improvement operations until convergence. Policy improvement Find a greedy policy π t+1 given the value function V k (computed in the policy evaluation phase): [ s S : π t+1 (s) = arg max R(s, a) + γ ] T (s, a, s )V k (s ) a A s S The policy iteration process stops when π t = π t 1, in which case π t is an optimal policy, i.e. π t = π. 27 / 45

Input: An MDP model S, A, T, R ; /* Initialization */ ; t = 0, k = 0; s S: Initialize π t (s) with an arbitrary action; s S: Initialize V k (s) with an arbitrary value; repeat /* Policy evaluation */; repeat s S : V k+1 (s) = R(s, π t (s)) + γ s S T (s, π t(s), s )V k (s ); k k + 1; until s S : V k (s) V k 1 (s) < ɛ; /* Policy improvement */; s S : π t+1 (s) = arg max a A [R(s, a) + γ ] s S T (s, a, s )V k (s ) ; t t + 1; until π t = π t 1 ; π = π t ; Output: An optimal policy π ; 28 / 45

Example Évaluation d State space S = {s 11, s 12, s 13, s 21, s 22, s 23, s 31, s 32, s 33 } Action space A = {,,,, do nothing} Deterministic transition function Reward function a : R(s 33, a) = 1, a, s s 33 : R(s, a) = 0 Discount factor γ = 0.9. s 11 s 12 s 13 s 21 s 22 s 23 s 31 s 32 s 33 Initial policy 29 / 45

Example Let s perform the policy evaluation on the initial policy Évaluation d V 0 State s 11 0 s 12 0 s 13 0 s 21 0 s 22 0 s 23 0 s 31 0 s 32 0 s 33 0 s 11 s 12 s 13 s 21 s 22 s 23 s 31 s 32 s 33 Initial policy 30 / 45

Example Let s perform the policy evaluation on the initial policy Évaluation d V 0 V 1 State s 11 0 0 + 0.9 0 s 12 0 0 + 0.9 0 s 13 0 0 + 0.9 0 s 21 0 0 + 0.9 0 s 22 0 0 + 0.9 0 s 23 0 0 + 0.9 0 s 31 0 0 + 0.9 0 s 32 0 0 + 0.9 0 s 33 0 1 + 0.9 0 s 11 s 12 s 13 s 21 s 22 s 23 s 31 s 32 s 33 Initial policy 31 / 45

Example Let s perform the policy evaluation on the initial policy Évaluation d V 0 V 1 State s 11 0 0 s 12 0 0 s 13 0 0 s 21 0 0 s 22 0 0 s 23 0 0 s 31 0 0 s 32 0 0 s 33 0 1 s 11 s 12 s 13 s 21 s 22 s 23 s 31 s 32 s 33 Initial policy 32 / 45

Example Let s perform the policy evaluation on the initial policy Évaluation d V 0 V 1 V 2 State s 11 0 0 0 + 0.9 0 s 12 0 0 0 + 0.9 0 s 13 0 0 0 + 0.9 0 s 21 0 0 0 + 0.9 0 s 22 0 0 0 + 0.9 0 s 23 0 0 0 + 0.9 1 s 31 0 0 0 + 0.9 0 s 32 0 0 0 + 0.9 0 s 33 0 1 1 + 0.9 1 s 11 s 12 s 13 s 21 s 22 s 23 s 31 s 32 s 33 Initial policy 33 / 45

Example Let s perform the policy evaluation on the initial policy Évaluation d V 0 V 1 V 2 State s 11 0 0 0 s 12 0 0 0 s 13 0 0 0 s 21 0 0 0 s 22 0 0 0 s 23 0 0 0.9 s 31 0 0 0 s 32 0 0 0 s 33 0 1 1.9 s 11 s 12 s 13 s 21 s 22 s 23 s 31 s 32 s 33 Initial policy 34 / 45

Example Let s perform the policy evaluation on the initial policy Évaluation d V 0 V 1 V 2 V 3 State s 11 0 0 0 0 + 0.9 0 s 12 0 0 0 0 + 0.9 0 s 13 0 0 0 0 + 0.9 0.9 s 21 0 0 0 0 + 0.9 0 s 22 0 0 0 0 + 0.9 0 s 23 0 0 0.9 0 + 0.9 1.9 s 31 0 0 0 0 + 0.9 0 s 32 0 0 0 0 + 0.9 0 s 33 0 1 1.9 1 + 0.9 1.9 s 11 s 12 s 13 s 21 s 22 s 23 s 31 s 32 s 33 Initial policy 35 / 45

Example Let s perform the policy evaluation on the initial policy Évaluation d V 0 V 1 V 2 V 3 State s 11 0 0 0 0 s 12 0 0 0 0 s 13 0 0 0 0.81 s 21 0 0 0 0 s 22 0 0 0 0 s 23 0 0 0.9 1.71 s 31 0 0 0 0 s 32 0 0 0 0 s 33 0 1 1.9 2.71 s 11 s 12 s 13 s 21 s 22 s 23 s 31 s 32 s 33 Initial policy 36 / 45

Example Let s perform the policy evaluation on the initial policy Évaluation d V 0 V 1 V 2 V 3... V 1000 State s 11 0 0 0 0 4.3 s 12 0 0 0 0 7.3 s 13 0 0 0 0.81 8.1 s 21 0 0 0 0 4.8 s 22 0 0 0 0 6.6 s 23 0 0 0.9 1.71 9 s 31 0 0 0 0 5.3 s 32 0 0 0 0 5.9 s 33 0 1 1.9 2.71 10 s 11 s 12 s 13 s 21 s 22 s 23 s 31 s 32 s 33 Initial policy 37 / 45

Example Amélioration de Now, we improve the previous policy based on the calculated values V 0 V 1 V 2 V 3... V 1000 State s 11 0 0 0 0 4.3 s 12 0 0 0 0 7.3 s 13 0 0 0 0.81 8.1 s 21 0 0 0 0 4.8 s 22 0 0 0 0 6.6 s 23 0 0 0.9 1.71 9 s 31 0 0 0 0 5.3 s 32 0 0 0 0 5.9 s 33 0 1 1.9 2.71 10 s 11 s 12 s 13 s 21 s 22 s 23 s 31 s 32 s 33 Improved policy 38 / 45

Value Iteration Value iteration can be written as a simple backup operation: [ s S : V k+1 (s) = max R(s, a) + γ ] T (s, a, s )V k (s ) a A s S This operation is repeated until s S : V k (s) V k 1 (s) < ɛ, in which case the optimal policy is simply the greedy policy with respect to the value function V k [ s S : π (s) = arg max R(s, a) + γ ] T (s, a, s )V k (s ) a A s S 39 / 45

Value Iteration Input: An MDP model S, A, T, R ; k = 0; s S: Initialize V k (s) with an arbitrary value; repeat s S : V k+1 (s) = max a A [R(s, a) + γ ] s S T (s, a, s )V k (s ) ; k k + 1; until s S : V k (s) V k 1 (s) < ɛ; s S : π (s) = arg max a A [R(s, a) + γ ] s S T (s, a, s )V k (s ) ; Output: An optimal policy π ; Algorithm 2: The value iteration algorithm. 40 / 45

Learning with Markov Decision Processes How can we find an optimal policy when we do not know the transition function T? Reinforcement Learning Learning with MDPs generally refers to the problem of finding an optimal policy π for an MDP with unknown transition function T. The agent learns the best actions from experience, by acting and observing the received rewards. This problem is better known under the name of Reinforcement Learning (a.k.a trial-and-error learning). 41 / 45

Reinforcement Learning http://www.ausy.tu-darmstadt.de/research/research 42 / 45

Reinforcement learning in behavioral psychology Reinforcement: strengthening an organism s future behavior whenever that behavior is preceded by a specific antecedent stimulus. Behaviour will likely be repeated if it has been reinforced (with positive rewards). Behaviour will likely not be repeated if it has been punished (with negative rewards). http://www.cs.utexas.edu/ eladlieb/rlrg.html 43 / 45

Reinforcement Learning Before presenting some learning algorithms, we will first need to introduce the Q-value function. Q-value A Q-value is the expected sum of rewards that an agent will receive if it executes action a in state s then follows a policy π for the remaining steps. Q π (s, a) = R(s, a) + γ s S T (s, a, s )V π (s ) a can be any action, it is not necessarily π(s). 44 / 45

Reinforcement Learning Input: An MDP model S, A, R with unknown transition function; t = 0, s 0 is an initial state; s S, a A: Initialize Q t (s, a) with an arbitrary value; repeat π(s t ) = arg max a A Q t (s t, a); Choose action a t as π(s t ) with probability 1 ɛ t (for exploitation), and as a random action (for exploration) with probability ɛ t ; Execute action a t and observe the received reward R(s t, a t ) and the next state s t+1 ; Q t+1 (s t, a t ) = ] (1 α t )Q t (s t, a t ) + α t [R(s t, a t ) + γ max a A Q t (s t+1, a ) ; t t + 1; until the end of learning; Output: A learned policy π; Algorithm 3: The Q-learning algorithm. 45 / 45