Reinforcement Learning and Deep Reinforcement Learning

Similar documents
Lecture 3: Markov Decision Processes

Reinforcement Learning

Internet Monetization

REINFORCEMENT LEARNING

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Reinforcement Learning. Introduction

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Reinforcement Learning

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Grundlagen der Künstlichen Intelligenz

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Reinforcement Learning

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

Decision Theory: Q-Learning

Decision Theory: Markov Decision Processes

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Learning Control for Air Hockey Striking using Deep Reinforcement Learning

Reinforcement Learning. George Konidaris

Reinforcement Learning and NLP

Markov Decision Processes

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

(Deep) Reinforcement Learning

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Reinforcement Learning: An Introduction

Lecture 1: March 7, 2018

Artificial Intelligence

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Planning in Markov Decision Processes

Machine Learning I Reinforcement Learning

The convergence limit of the temporal difference learning

Introduction to Reinforcement Learning

Trust Region Policy Optimization

Least squares policy iteration (LSPI)

, and rewards and transition matrices as shown below:

Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Reinforcement Learning

Temporal difference learning

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Reinforcement Learning Part 2

6 Reinforcement Learning

Notes on Reinforcement Learning

MDP Preliminaries. Nan Jiang. February 10, 2019

Markov decision processes

Q-Learning in Continuous State Action Spaces

Reinforcement Learning II

Reinforcement Learning. Summer 2017 Defining MDPs, Planning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Lecture 8: Policy Gradient

CSC321 Lecture 22: Q-Learning

Lecture 7: Value Function Approximation

Reinforcement Learning and Control

An Introduction to Reinforcement Learning

Off-Policy Actor-Critic

Policy Gradient Methods. February 13, 2017

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods

Understanding (Exact) Dynamic Programming through Bellman Operators

Grundlagen der Künstlichen Intelligenz

Lecture 17: Reinforcement Learning, Finite Markov Decision Processes

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Lecture 23: Reinforcement Learning

Machine Learning I Continuous Reinforcement Learning

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Reinforcement Learning via Policy Optimization

An Adaptive Clustering Method for Model-free Reinforcement Learning

Reinforcement Learning: the basics

The Reinforcement Learning Problem

Reinforcement Learning

Variance Reduction for Policy Gradient Methods. March 13, 2017

Reinforcement Learning for NLP

Reinforcement learning an introduction

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

Deep Reinforcement Learning: Policy Gradients and Q-Learning

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Multiagent Value Iteration in Markov Games

Probabilistic Planning. George Konidaris

arxiv: v1 [cs.lg] 20 Jun 2018

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Reinforcement learning

Multiagent (Deep) Reinforcement Learning

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

CS 598 Statistical Reinforcement Learning. Nan Jiang

Hidden Markov Models (HMM) and Support Vector Machine (SVM)

CS599 Lecture 1 Introduction To RL

Motivation for introducing probabilities

Markov Decision Processes and Dynamic Programming

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

CS230: Lecture 9 Deep Reinforcement Learning

Reinforcement Learning

Discrete planning (an introduction)

Markov Decision Processes (and a small amount of reinforcement learning)

Transcription:

Reinforcement Learning and Deep Reinforcement Learning Ashis Kumer Biswas, Ph.D. ashis.biswas@ucdenver.edu Deep Learning November 5, 2018 1 / 64

Outlines 1 Principles of Reinforcement Learning 2 The Q value 3 Q-learning example 4 Q-learning in Python 5 Non-deterministic Environment 6 Temporal difference Learning 7 Q-learning on OpenAI Gym 8 Deep Q-network (DQN) 9 DQN on Keras 10 Double DQN (DDQN) Deep Learning November 5, 2018 2 / 64

Outlines 1 Principles of Reinforcement Learning 2 The Q value 3 Q-learning example 4 Q-learning in Python 5 Non-deterministic Environment 6 Temporal difference Learning 7 Q-learning on OpenAI Gym 8 Deep Q-network (DQN) 9 DQN on Keras 10 Double DQN (DDQN) Deep Learning November 5, 2018 3 / 64

X=42 Reinforcement Learning Interpreter Environment Reward Action 1 An agent takes an action in an environment. State Agent Figure: The perception-action-learning loop. Image source https://en.wikipedia.org/wiki/reinforcement_learning Deep Learning November 5, 2018 4 / 64

X=42 Reinforcement Learning Interpreter Environment Reward State Agent Action Figure: The perception-action-learning loop. Image source https://en.wikipedia.org/wiki/reinforcement_learning 1 An agent takes an action in an environment. 2 That action is interpreted into a reward, R and a representation of the state, S. Deep Learning November 5, 2018 4 / 64

X=42 Reinforcement Learning Interpreter Environment Reward State Agent Action Figure: The perception-action-learning loop. Image source https://en.wikipedia.org/wiki/reinforcement_learning 1 An agent takes an action in an environment. 2 That action is interpreted into a reward, R and a representation of the state, S. 3 These two are then fed back into the agent. Deep Learning November 5, 2018 4 / 64

A Crawling Robot learns to crawl Figure: A Crawling robot developed by Francis wyffels. Video Link: https://www.youtube.com/watch?v=2inrjx6ideo Deep Learning November 5, 2018 5 / 64

Introduction to Markov Decision Process, MDP MDPs formally describe an environment for reinforcement learning, RL. where the environment is fully observable. That is, the current state completely characterizes the process. Almost all RL problems can be formalized as MDPs. Deep Learning November 5, 2018 6 / 64

Markov Property The future is independent of the past given the present. Definition A state S t is Markov if and only if P[S t+1 S t ] = P[S t+1 S 1,, S t ] The state captures all relevant information from the history. Once the state is known, the history may be thrown away. i.e., the state is a sufficient statistic of the future. Deep Learning November 5, 2018 7 / 64

State Transition Matrix For a Markov state, s, and the successor state s, the state transition probability is defined as: P ss = P[S t+1 = s S t = s] State transition matrix, P defines transition probabilities from all states s to all successor states, s, P 11 P 1n P =..... P n1 P nn where each row of the matrix sums to 1. Deep Learning November 5, 2018 8 / 64

Markov Process A Markov Process is a memoryless random process, i.e., a sequence of random states S 1, S 2,, with the Markov property. Definition A Markov Process (or Markov Chain) is a tuple, S, P, such that: S is a finite set of states. P is a state transition probability matrix: P ss = P[S t+1 = s S t = s] Deep Learning November 5, 2018 9 / 64

Example: Student Markov Chain Sample episodes for Student Markov Chain starting from S 1 = C1. C1, C2, C3, Pass, Sleep Deep Learning November 5, 2018 10 / 64

Example: Student Markov Chain Sample episodes for Student Markov Chain starting from S 1 = C1. C1, C2, C3, Pass, Sleep C1, FB, FB, C1, C2, Sleep Deep Learning November 5, 2018 10 / 64

Example: Student Markov Chain Sample episodes for Student Markov Chain starting from S 1 = C1. C1, C2, C3, Pass, Sleep C1, FB, FB, C1, C2, Sleep C1, C2, C3, Pub, C2, C3, Pass, Sleep Deep Learning November 5, 2018 10 / 64

Example: Student Markov Chain Sample episodes for Student Markov Chain starting from S 1 = C1. C1, C2, C3, Pass, Sleep C1, FB, FB, C1, C2, Sleep C1, C2, C3, Pub, C2, C3, Pass, Sleep C1, FB, FB, C1, C2, C3, Pub, C1, FB, FB, FB, C1, C2, C3, Pub, C2, Sleep Deep Learning November 5, 2018 10 / 64

Example: Student Markov Chain Deep Learning November 5, 2018 11 / 64

Markov Reward Process, MRP A Markov reward process is a Markov Chain with values. Definition A Markov Reward Process is a tuple S, P, R, γ Deep Learning November 5, 2018 12 / 64

Markov Reward Process, MRP A Markov reward process is a Markov Chain with values. Definition A Markov Reward Process is a tuple S, P, R, γ S is a finite set of states. P is a state transition probability matrix, P ss = P[S t+1 = s S t = s] Deep Learning November 5, 2018 12 / 64

Markov Reward Process, MRP A Markov reward process is a Markov Chain with values. Definition A Markov Reward Process is a tuple S, P, R, γ S is a finite set of states. P is a state transition probability matrix, P ss = P[S t+1 = s S t = s] R is a reward function, R s = E[R t+1 S t = s] Deep Learning November 5, 2018 12 / 64

Markov Reward Process, MRP A Markov reward process is a Markov Chain with values. Definition A Markov Reward Process is a tuple S, P, R, γ S is a finite set of states. P is a state transition probability matrix, P ss = P[S t+1 = s S t = s] R is a reward function, R s = E[R t+1 S t = s] γ is a discount factor, γ [0, 1] Deep Learning November 5, 2018 12 / 64

The Student MRP Deep Learning November 5, 2018 13 / 64

Return Definition The return, G t is the total discounted reward from time-step t: G t = R t+1 + γr t+2 + γ 2 R t+3 = γ k R t+k+1 k=0 The discount, γ [0, 1] is the present value of future rewards. The value of receiving reward, R after k + 1 time-steps is γ k R. This values immediate reward above delayed reward: γ close to 0 leads to myopic evaluation. γ close to 1 leads to far-sighted evaluation. Deep Learning November 5, 2018 14 / 64

Why discount? Most Markov reward and decision processes are discounted. Why? Avoids infinite returns in cyclic Markov Processes. Uncertainty about the future may not be fully represented. If the reward is financial, immediate rewards may earn more interest than delayed rewards. Animal/human behavior shows preference for immediate reward. Deep Learning November 5, 2018 15 / 64

Value function The value function v(s) gives the long-term value of state s. Definition The state value function v(s) of an MRP is the expected return starting from state s: v(s) = E[G t S t = s] Deep Learning November 5, 2018 16 / 64

The Student MRP Sample returns for Student MRP, starting from S 1 = C1 with γ = 0.5 G 1 = R 2 + γr 3 + C1, C2, C3, Pass, Sleep v 1 = 2 2 1 2 2 1 4 + 10 1 8 = 2.25 Deep Learning November 5, 2018 17 / 64

The Student MRP Sample returns for Student MRP, starting from S 1 = C1 with γ = 0.5 G 1 = R 2 + γr 3 + C1, FB, FB, C1, C2, Sleep v 1 = 2 1 1 2 1 1 4 2 1 8 2 1 16 = 3.125 Deep Learning November 5, 2018 18 / 64

The Student MRP Sample returns for Student MRP, starting from S 1 = C1 with γ = 0.5 G 1 = R 2 + γr 3 + C1, FB, FB, C1, C2, Sleep v 1 = 2 1 1 2 1 1 4 2 1 8 2 1 16 = 3.125 Deep Learning November 5, 2018 19 / 64

State-Value function for Student MRP (γ = 0) A myopic evaluation!!! Deep Learning November 5, 2018 20 / 64

State-Value function for Student MRP (γ = 0.9) Deep Learning November 5, 2018 21 / 64

State-Value function for Student MRP (γ = 1) A far-sighted evaluation!!! Deep Learning November 5, 2018 22 / 64

Bellman Equation for MRPs The value function, v(s) can be decomposed into two parts: immediate reward, R t+1. Deep Learning November 5, 2018 23 / 64

Bellman Equation for MRPs The value function, v(s) can be decomposed into two parts: immediate reward, R t+1. discounted value of successor state γv(s t+1 ) Deep Learning November 5, 2018 23 / 64

Bellman Equation for MRPs The value function, v(s) can be decomposed into two parts: immediate reward, R t+1. discounted value of successor state γv(s t+1 ) v(s) = E[G t S t = s] = E[R t+1 + γr t+2 + γ 2 R t+3 + S t = s] = E[R t+1 + γ(r t+2 + γr t+3 + ) S t = s] = E[R t+1 + γg t+1 S t = s] = E[R t+1 + γv(s t+1 ) S t = s] Deep Learning November 5, 2018 23 / 64

Bellman Equation for MRPs it can be represented as: v(s) = E[R t+1 + γv(s t+1 ) S t = s] v(s) = R s + γ s S P ss v(s ) Deep Learning November 5, 2018 24 / 64

Bellman Equation for Student MRP Deep Learning November 5, 2018 25 / 64

Bellman Equation in Matrix Form The Bellman equation can be expressed concisely using matrices, v = R + γpv where, v is a column vector with one entry per state. v(1). v(n) = R 1 P 11 P 1n v(1). + γ...... R n P n1 P nn v(n) Deep Learning November 5, 2018 26 / 64

Solving the Bellman Equation It is a linear equation. Deep Learning November 5, 2018 27 / 64

Solving the Bellman Equation It is a linear equation. It can be solved directly:: v = R + γpv (1 γp)v = R v = (1 γp) 1 R Deep Learning November 5, 2018 27 / 64

Solving the Bellman Equation It is a linear equation. It can be solved directly:: v = R + γpv (1 γp)v = R v = (1 γp) 1 R Computational complexity is O(n 3 ) for n states. Direct solution only possible for small MRPs. Deep Learning November 5, 2018 27 / 64

Solving the Bellman Equation It is a linear equation. It can be solved directly:: v = R + γpv (1 γp)v = R v = (1 γp) 1 R Computational complexity is O(n 3 ) for n states. Direct solution only possible for small MRPs. There are iterative solutions for large MRPs developed: Dynamic Programming Deep Learning November 5, 2018 27 / 64

Solving the Bellman Equation It is a linear equation. It can be solved directly:: v = R + γpv (1 γp)v = R v = (1 γp) 1 R Computational complexity is O(n 3 ) for n states. Direct solution only possible for small MRPs. There are iterative solutions for large MRPs developed: Dynamic Programming Monte-Carlo Evaluation Deep Learning November 5, 2018 27 / 64

Solving the Bellman Equation It is a linear equation. It can be solved directly:: v = R + γpv (1 γp)v = R v = (1 γp) 1 R Computational complexity is O(n 3 ) for n states. Direct solution only possible for small MRPs. There are iterative solutions for large MRPs developed: Dynamic Programming Monte-Carlo Evaluation Temporal Difference Learning Deep Learning November 5, 2018 27 / 64

Markov Decision Process, MDP A Markov Decision Process, MDP is a Markov Reward Process with decisions. It is an environment in which all states are Markov. Definition A Markov Decision Process is a tuple S, A, P, R, γ Deep Learning November 5, 2018 28 / 64

Markov Decision Process, MDP A Markov Decision Process, MDP is a Markov Reward Process with decisions. It is an environment in which all states are Markov. Definition A Markov Decision Process is a tuple S, A, P, R, γ S is a finite set of states. Deep Learning November 5, 2018 28 / 64

Markov Decision Process, MDP A Markov Decision Process, MDP is a Markov Reward Process with decisions. It is an environment in which all states are Markov. Definition A Markov Decision Process is a tuple S, A, P, R, γ S is a finite set of states. A is a finite set of actions. Deep Learning November 5, 2018 28 / 64

Markov Decision Process, MDP A Markov Decision Process, MDP is a Markov Reward Process with decisions. It is an environment in which all states are Markov. Definition A Markov Decision Process is a tuple S, A, P, R, γ S is a finite set of states. A is a finite set of actions. P is a state transition probability matrix, P a ss = P[S t+1 = s S t = s, A t = a] Deep Learning November 5, 2018 28 / 64

Markov Decision Process, MDP A Markov Decision Process, MDP is a Markov Reward Process with decisions. It is an environment in which all states are Markov. Definition A Markov Decision Process is a tuple S, A, P, R, γ S is a finite set of states. A is a finite set of actions. P is a state transition probability matrix, P a ss = P[S t+1 = s S t = s, A t = a] R is a reward function, R a s = E[R t+1 S t = s, A t = a] Deep Learning November 5, 2018 28 / 64

Markov Decision Process, MDP A Markov Decision Process, MDP is a Markov Reward Process with decisions. It is an environment in which all states are Markov. Definition A Markov Decision Process is a tuple S, A, P, R, γ S is a finite set of states. A is a finite set of actions. P is a state transition probability matrix, P a ss = P[S t+1 = s S t = s, A t = a] R is a reward function, R a s = E[R t+1 S t = s, A t = a] γ is a discount factor, γ [0, 1] Deep Learning November 5, 2018 28 / 64

Student Markov Process Deep Learning November 5, 2018 29 / 64

Student Markov Reward Process, MRP Deep Learning November 5, 2018 30 / 64

Student Markov Decision Process, MDP Deep Learning November 5, 2018 31 / 64

Policies (1 of 2) Definition A policy, π is a distribution over actions given states, π(a s) = P[A t = a S t = s] A policy fully defines the behavior of an agent. MDP policies depend on the current state (and not the history). That is, policies are stationary (time independent), A t π(. s t ), t > 0 Deep Learning November 5, 2018 32 / 64

Policies (2 of 2) Given an MDP, M = S, A, P, R, γ, and a policy, π The state sequence S 1, S 2, is a Markov process S, P π The state and reward sequence S 1, R 2, S 2, is a Markov Reward Process, MRP: S, P π, R π, γ, where P π ss = a A π(a s)p a ss R π s = a A π(a s)r a s Deep Learning November 5, 2018 33 / 64

Value Functions Definition The state-value function, v π (s) of an MDP is the expected return starting from state s, and then following policy π. v π (s) = E π [G t S t = s] Deep Learning November 5, 2018 34 / 64

Value Functions Definition The state-value function, v π (s) of an MDP is the expected return starting from state s, and then following policy π. v π (s) = E π [G t S t = s] Definition The action-value function q π (s, a) is the expected return starting from state s, taking action a, and then following policy π. q π (s, a) = E π [G t S t = s, A t = a] Deep Learning November 5, 2018 34 / 64

Example: State-Value Function for Student MDP Deep Learning November 5, 2018 35 / 64

Bellman Expectation Equation The state-value function can again be decomposed into immediate reward plus discounted value of successor state, v π (s) = E π [R t+1 + γv π (S t+1 ) S t = s] The action-value function can also be decomposed as, q π (s, a) = E π [R t+1 + γq π (S t+1, A t+1 ) S t = s, A t = a] Deep Learning November 5, 2018 36 / 64

Bellman Expectation Equation for v π (.) v π (s) = a A π(a s)q π (s, a) Deep Learning November 5, 2018 37 / 64

Bellman Expectation Equation for q π (.) (1 of 2) q π (s, a) = R a s + γ s S P a ss v π(s ) Deep Learning November 5, 2018 38 / 64

Bellman Expectation Equation for v π (.) v π (s) = π(a s) R a s + γ Pss a v π(s ) a A s S Deep Learning November 5, 2018 39 / 64

Bellman Expectation Equation for q π (.) (2 of 2) q π (s, a) = R a s + γ s S Pss a π(a s )q π (s, a ) a A Deep Learning November 5, 2018 40 / 64

Bellman Expectation Equation in Student MDP Deep Learning November 5, 2018 41 / 64

Optimal Value Function Definition The optimal state-value function, v (s) is the maximum state-value function over all policies v (s) = max π v π (s) Definition The optimal action-value function, q (s, a) is the maximum action-value function over all policies q (s, a) = max π q π (s, a) The optimal value function specifies the best possible performance in the MDP. An MDP is considered solved when we know the optimal value function. Deep Learning November 5, 2018 42 / 64

Optimal State-Value function for the Student MDP Deep Learning November 5, 2018 43 / 64

Optimal Action-Value function for the Student MDP Deep Learning November 5, 2018 44 / 64

Optimal Policy It defines a partial ordering over the policies, π π if v π (s) v π (s), s Theorem For any Markov Decision Process, There exists an optimal policy, π that is better than or equal to all other policies, π π, π. All optimal policies achieve the optimal state value function, v π (s) = v (s) All optimal policies achieve the optimal action value function, q π (s, a) = q (s, a) Deep Learning November 5, 2018 45 / 64

Finding an optimal policy An optimal policy can be found by maximizing over q (s, a), 1 if a = argmax q (s, a) π (a s) = a A 0 otherwise There is always a deterinistic optimal policy for any MDP. If we know q (s, a), we immediately have the optimal policy. Deep Learning November 5, 2018 46 / 64

Example: Optimal Policy for the Student MDP Deep Learning November 5, 2018 47 / 64

Bellman Optimality Equation for v The optimal value functions are recursively related by the Bellman optimality equations: v (s) = max a q (s, a) Deep Learning November 5, 2018 48 / 64

Bellman Optimality Equation for q q (s, a) = R a s + γ s S P a ss v (s ) Deep Learning November 5, 2018 49 / 64

Bellman Optimality Equation for v (2) v (s) = max a R a s + γ Pss a v (s ) s S Deep Learning November 5, 2018 50 / 64

Bellman Optimality Equation for q (2) q (s, a) = R a s + γ s S P a ss max a q (s, a ) Deep Learning November 5, 2018 51 / 64

Bellman Optimality Equation for the Student MDP Deep Learning November 5, 2018 52 / 64

Solving the Bellman Optimality Equation Bellman Optimality Equation is non-linear. No closed form solution (in general). However, many iterative solution available: Policy gradient Q-learning SARSA Deep Learning November 5, 2018 53 / 64

Outlines 1 Principles of Reinforcement Learning 2 The Q value 3 Q-learning example 4 Q-learning in Python 5 Non-deterministic Environment 6 Temporal difference Learning 7 Q-learning on OpenAI Gym 8 Deep Q-network (DQN) 9 DQN on Keras 10 Double DQN (DDQN) Deep Learning November 5, 2018 54 / 64

The Q value How to find the optimal π? How does the agent learn by interacting with the environment? Instead of finding the policy that maximizes the state-value for all states, find the action that maximizes the Q value for all states. Deep Learning November 5, 2018 55 / 64

Outlines 1 Principles of Reinforcement Learning 2 The Q value 3 Q-learning example 4 Q-learning in Python 5 Non-deterministic Environment 6 Temporal difference Learning 7 Q-learning on OpenAI Gym 8 Deep Q-network (DQN) 9 DQN on Keras 10 Double DQN (DDQN) Deep Learning November 5, 2018 56 / 64

Outlines 1 Principles of Reinforcement Learning 2 The Q value 3 Q-learning example 4 Q-learning in Python 5 Non-deterministic Environment 6 Temporal difference Learning 7 Q-learning on OpenAI Gym 8 Deep Q-network (DQN) 9 DQN on Keras 10 Double DQN (DDQN) Deep Learning November 5, 2018 57 / 64

Outlines 1 Principles of Reinforcement Learning 2 The Q value 3 Q-learning example 4 Q-learning in Python 5 Non-deterministic Environment 6 Temporal difference Learning 7 Q-learning on OpenAI Gym 8 Deep Q-network (DQN) 9 DQN on Keras 10 Double DQN (DDQN) Deep Learning November 5, 2018 58 / 64

Outlines 1 Principles of Reinforcement Learning 2 The Q value 3 Q-learning example 4 Q-learning in Python 5 Non-deterministic Environment 6 Temporal difference Learning 7 Q-learning on OpenAI Gym 8 Deep Q-network (DQN) 9 DQN on Keras 10 Double DQN (DDQN) Deep Learning November 5, 2018 59 / 64

Outlines 1 Principles of Reinforcement Learning 2 The Q value 3 Q-learning example 4 Q-learning in Python 5 Non-deterministic Environment 6 Temporal difference Learning 7 Q-learning on OpenAI Gym 8 Deep Q-network (DQN) 9 DQN on Keras 10 Double DQN (DDQN) Deep Learning November 5, 2018 60 / 64

Outlines 1 Principles of Reinforcement Learning 2 The Q value 3 Q-learning example 4 Q-learning in Python 5 Non-deterministic Environment 6 Temporal difference Learning 7 Q-learning on OpenAI Gym 8 Deep Q-network (DQN) 9 DQN on Keras 10 Double DQN (DDQN) Deep Learning November 5, 2018 61 / 64

Outlines 1 Principles of Reinforcement Learning 2 The Q value 3 Q-learning example 4 Q-learning in Python 5 Non-deterministic Environment 6 Temporal difference Learning 7 Q-learning on OpenAI Gym 8 Deep Q-network (DQN) 9 DQN on Keras 10 Double DQN (DDQN) Deep Learning November 5, 2018 62 / 64

Outlines 1 Principles of Reinforcement Learning 2 The Q value 3 Q-learning example 4 Q-learning in Python 5 Non-deterministic Environment 6 Temporal difference Learning 7 Q-learning on OpenAI Gym 8 Deep Q-network (DQN) 9 DQN on Keras 10 Double DQN (DDQN) Deep Learning November 5, 2018 63 / 64

Thanks Questions? Deep Learning November 5, 2018 64 / 64