Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Similar documents
REINFORCEMENT LEARNING

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Reinforcement Learning

, and rewards and transition matrices as shown below:

Reinforcement Learning. George Konidaris

Markov decision processes

Lecture 7: Value Function Approximation

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Grundlagen der Künstlichen Intelligenz

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Active Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning

Artificial Intelligence

Internet Monetization

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Q-Learning in Continuous State Action Spaces

Reinforcement Learning as Variational Inference: Two Recent Approaches

Reinforcement Learning

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

CS599 Lecture 1 Introduction To RL

Reinforcement Learning and Deep Reinforcement Learning

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

Reinforcement Learning: An Introduction

State Space Abstractions for Reinforcement Learning

Introduction to Reinforcement Learning

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Machine Learning I Reinforcement Learning

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Reinforcement Learning. Introduction

Decision Theory: Q-Learning

Markov Decision Processes (and a small amount of reinforcement learning)

Reinforcement Learning

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Machine Learning I Continuous Reinforcement Learning

Basics of reinforcement learning

Reinforcement Learning and NLP

Lecture 1: March 7, 2018

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Reinforcement Learning via Policy Optimization

Reinforcement learning

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Decision Theory: Markov Decision Processes

Reinforcement Learning

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

(Deep) Reinforcement Learning

Lecture 8: Policy Gradient

Reinforcement Learning

Reinforcement learning an introduction

Reinforcement Learning

Learning Control Under Uncertainty: A Probabilistic Value-Iteration Approach

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

CS 570: Machine Learning Seminar. Fall 2016

Notes on Reinforcement Learning

Reinforcement Learning and Control

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

Lecture 23: Reinforcement Learning

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

CSC321 Lecture 22: Q-Learning

Reinforcement Learning

MDP Preliminaries. Nan Jiang. February 10, 2019

CS 7180: Behavioral Modeling and Decisionmaking

Adaptive Importance Sampling for Value Function Approximation in Off-policy Reinforcement Learning

Reinforcement Learning. Machine Learning, Fall 2010

Markov Decision Processes

Trust Region Policy Optimization

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Reinforcement Learning: the basics

A Gentle Introduction to Reinforcement Learning

Real Time Value Iteration and the State-Action Value Function

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods

CS 598 Statistical Reinforcement Learning. Nan Jiang

Planning in Markov Decision Processes

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Temporal Difference Learning & Policy Iteration

An Introduction to Reinforcement Learning

An online kernel-based clustering approach for value function approximation

Autonomous Helicopter Flight via Reinforcement Learning

Laplacian Agent Learning: Representation Policy Iteration

Reinforcement Learning

Artificial Intelligence & Sequential Decision Problems

Reinforcement Learning

Reinforcement Learning In Continuous Time and Space

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

6 Reinforcement Learning

Reinforcement Learning II

Temporal difference learning

Preference Elicitation for Sequential Decision Problems

Reinforcement Learning

CS599 Lecture 2 Function Approximation in RL

Animal learning theory

Gradient Methods for Markov Decision Processes

Nonparametric Inverse Reinforcement Learning and Approximate Optimal Control with Temporal Logic Tasks. Siddharthan Rajasekaran

Policy Gradient Reinforcement Learning for Robotics

ilstd: Eligibility Traces and Convergence Analysis

Transcription:

Reinforcement Learning

Introduction

Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it has no explicit supervision so uses a rewarding system to learn feature-outcome relationship. The crucial advantage of reinforcement learning is its non-greedy nature: we do not need to improve performance in a short term but to optimize a long-term achievement.

RL terminology Reinforcement learning is a dynamic process where at each step, a new decision rule or policy is updated based on new data and rewarding system. Terminology used in reinforcement learning: Agent: whoever uses learned decisions during the process (robot in AI) Action (A): a decision to be taken during the process State (S): environment variables that may interact with Action Reward (R): a value system to evaluate Action given State. Note that (A, S, R) is time-step dependent so we use (A t, S t, R t ) to reflect time-step t.

Reinforcement learning diagram

Maze example

Maze example: continue

Maze example: continue

Mountain car problem

RL Framework

RL Notation At time-step t, the agent observe a state S t from a state space (S T ) and selects an action A t from an action space (A t ). Both action and state result in transition to a new state S t+1. Given (A t, S t, S t+1 ), the agent receives an immediate reward R t = r t (S t, A t, S t+1 ) R, where r t (,, ) is called immediate reward function.

RL mathematical formulation At time t, we assume a transition probability function from (S t = s, A t = a) to (S t+1 = s ): p t (s s, a) 0, s p t (s s, a)ds = 1. We also assume A t given S t from a probability distribution: π t (a s) 0, a π t(a s)da = 1. A trajectory (training sample) (s 1, a 1, s 2,..., s T, a T, s T+1 ) is generated as follows: start from an initial state s 1 from a probability distribution p(s); for t = 1, 2,..., T (T is the total number of steps), (a) a t is chosen from π t ( s t ), (b) the next state s t+1 is from p t ( s t, a t ). It is called finite horizon if T < and infinite horizon if T =.

Goal of RL Define the return at time t as T γ j t r j (S j, A j, S j+1 ) j=t where γ [0, 1) is called the discount factor (discounting long trajectory). An action policy, π = (π 1,..., π T ), is a sequence of probability distribution functions, where π t is a probability distribution for A t given S t. The goal of RL is to learn the optimal action decision, policy π = (π1, π 2,..., π T ), to maximize the expected return: T E π [ γ j 1 r j (S j, A j, S j+1 )], E π ( ) means A t S t π t ( S t ). j=1

Optimal policy RL aims to find the best action decision rules such that the average long-term reward is maximized if such rules are implemented. Note: π is a function of states and for any individual, we only know what actions should be at time t after observing its states ate time t. This is related to the so-called adaptive decision or dynamic decision.

How supervised learning is framed in RL context? We can imagine S t to be all data (both feature and outcome) collected by step t. Then A t is the prediction rule from a class of prediction functions based on S t (no need to be perfect prediction function; can be even random prediction) so π t is the probabilistic selection of which prediction function at t. Based on (S t, A t ), S t+1 can be S t with additionally collected data, or S t with individual errors, or just S t. R t is the prediction error evaluated at the data. The goal is to learn the best prediction rule RL method can help!

State-Action and State Value Functions

Two important concepts in RL State-Action value function (SAV) It is the expected return increment at time t given state S t = s and action A t = a: Q π t (s, a) = E π [ T γ j t r j (S j, A j, S j+1 ) S t = s, A t = a]. j=t Q t (s, a) Qπ t (s, a) is the optimal expected return at time t. State value function (SV) It is the expected return increment at time t given state S t = s: V π t (s) = E π [ T γ j t r j (S j, A j, S j+1 ) S t = s]. j=t Similarly, Vt (s) = Vπ t (s). Clearly, Vt π(s) = a Qπ t (s, a)π t(a s)da.

Bellman equations The Bellman equation for SV: ] Vt π (s) = E π [r t (s, A t, S t+1 ) + γvt+1 π (S t+1) S t = s [ = rt (s, a, s ) + γvt+1 π (s ) ] π t (a s)p t (s s, a)dads. s a The Bellman equation for SAV: ] Q π t (s, a) = E π [r t (s, a, S t+1 ) + γq π t+1 (S t+1, A t+1 ) S t = s, A t = a [ = rt (s, a, s ) + γq π t+1 (s, a ) ] s a π t+1 (a s )p t (s s, a)da ds.

Optimal policy learning: Bellman equation Bellman equation for optimal policy: [ Qt π (s, a) = E π V π t (s) = max a Q π t (s, a), ] r t (s, a, S t+1 ) + γvt π (S t+1 ) S t = s, A t = a, { } πt (s) I a = argmax a Q π t (s, a).

Reinforcement Learning for Finite Horizon

Value function given π For finite T, the Bellman equations suggest a backward procedure to evaluate the value function associated a particular policy: start from time T. We can learn Q π T (s, a) = E[R T S T = s, A T = a]i(a π( s)). at time T 1, we learn Q π T 1 (s, a) as [ ] E R T 1 + γe π [Q π T (S T, A T ) S T ] S T 1 = s, A T 1 = a I(a π( s)).. we perform learning backwards till time 1. Note that each step can be estimated using parametric, nonparametric or machine learning.

Optimal policy learning for finite horizon (Q-learning) Start from time T. We can learn Q π T (s, a) = E[R T S T = s, A T = a]. We calculate πt (s) as with probability 1 at a = argmax a Q π T (s, a). At time T 1, we learn Q π T 1 (s, a) as [ ] E R T 1 + γ max Q π a T (S T, a ) S T 1 = s, A T 1 = a. We obtain πt 1 as the one with probability 1 at a = argmax a Q π T 1 (s, a). We perform the same learning procedures backwards till time 1 to learn all the optimal policies.

Statistical models for state-value function Parametric/semiparametric models for Q π (s, a) are commonly used. We assume Q π (s, a) = B θ b φ b (s, a), b=1 where φ b (s, a) is a sequence of basis functions. In other words, the policy is indirectly represented by θ b s. From the Bellman equation, we note that the conditional mean of R t = r(s t, A t, S t+1 ) given (S t, A t ) is Q π (S t, A t ) γe π [Q π (S t+1, A t+1 ) S t, A t ] = θ T ψ(s t, A t ) under policy π, where ψ(s, a) = φ(s, a) γe π [φ(s t+1, A t+1 ) S t = s, A t = a].

Numerical implementation Suppose we have data from n subjects, each with a training sample of T steps, or n training T-step sample from the same agent, (S i1, A i1, S i2,..., S it, A it, S i,t+1 ). We estimate ψ(s, a) by ψ b (s, a) = n T i=1 t=1 φ b (s, a) γ I(S it = s, A it = a)e π [φ b (S i,t+1, A i,t+1 ]) n T i=1 t=1 I(S. it = s, A it = a) We perform a least-squares estimation 1 min θ nt n i=1 T t=1 [ ] 2 I(A it S it π) θ T ψ(sit, A it ) R it, where A it S it π means that the data of A it is obtained by following the policy.

More on numerical implementation Regularization may be introduced to have a more sparse solution. L 2 -minimization can be replaced by L 1 -minimization to gain robustness. Choice of basis functions: radial basis function where kernel function can be the usual Gaussian kernel (one possible definition of d(s, s ) is the shortest path from s to s in the graph defined by transition probabilities).

Alternative methods Modelling transition probability functions Active policy iteration (active learning) update sampling policy actively

Reinforcement Learning for Infinite Horizon

Value function learning given π When T = or T is large, Q-learning method may not be applicable. The salvage is to take advantage of process stability when t is large so we can assume the following Markov decision process (MDP): MDP assumes that state and action spaces are constant over time. MPD assumes pt (s s, a) to be independent of t. Reward function rt (s, a, s ) is independent of t. MDP assumption is plausible for a long horizon and after certain number of steps.

Bellman equations under MDP for infinite horizon Under MPD, Q π t (s, a) = Qπ (s, a) and Vt π(s) = Vπ (s). Bellman equations become ] V π (s) = E π [r(s, A t, S t+1 ) + γv π (S t+1 ) S t = s, ] Q π (s, a) = E π [r(s, a, S t+1 ) + γq π (S t+1, A t+1 ) S t = s, A t = a.

On- and Off-policy estimation We can still apply least-square learning algorithm for Q π (s, a: [ T ] E π (θ T ψ(s t, A t ) R t ) 2 t=1 using the history sample (S t, A t ) following the target policy π. This is called on-policy reinforcement learning. However, not all policy has been seen in the history sample. An alternative method is to use importance sampling: [ T ] [ T ] E π (θ T ψ(s t, A t ) R t ) 2 = E π (θ T ψ(s t, A t ) R t ) 2 w t, t=1 t=1 where t t w t = π(a j S j )/ π(a j S j ). j=1 Donglin Zeng, j=1 Department of Biostatistics, University of North Carolina

Off-policy iteration: more We need one assumption: there exists a policy in history sample, π, such that π(a s) > 0, (a, s). Adaptive importance weighting is to replace w t by w ν t and choose ν via cross-validation. When history sample have multiple policies π s, we can obtain the estimate from importance weighting with respect to each policy and aggregate estimation (sample-reuse policy iteration).

Reinforcement Learning for Optimal Policy The concept of RL is to make use of existing data from some given policies to learn potentially improved policies (EXPLOITATION); it then tries new policies to collect additional data evidence (EXPLORATION). Reinforcement learning methods are mostly into two groups: (policy iteration) model-based or learning methods to approximate optimal SAV (policy search) model-based or learning methods to directly maximize SV for estimating π.

Optimal policy learning: policy iteration procedure Start from a policy π. Policy evaluation: evaluate Q π (s, a) and thus V π (s). Policy improvement: update π(a s) to be I(a = a π (s)) where a π (s) is the action maximizing Q π (s, a). Iterate between policy evaluation step and policy improvement.

Soft policy iteration procedure Selecting a deterministic policy update may be too greedy if the initial policy is far from the optimal. More soft policy update includes: π(a s) exp{q π (s, a)/τ}, (ɛ-greedy policy improvement) π(a x) has a probability (1 ɛ + ɛ/m) at a = a(π) and probability ɛ/m at other a s, where m is the number of possible actions.

Optimal policy learning: direct policy search The direct policy search approach aims for finding the policy maximizing the expected return. Suppose we model policy as π(a s; θ). The expected return under π is given by T J(θ) = p(s 1 ) p(s t+1 s t, a t )π(a t s t ; θ) s 1,...,s T t=1 { T } γ t 1 r(s t, a t, s t+1 ) s 1 ds T. t=1 We optimize J(θ) to find the optimal θ. Gradient approach can be adopted for optimization. EM-based policy search can be used for optimization. Importance sampling can be used for evaluating J(θ).

How RL works in artificial intelligence? The agent (Robot) starts with one initial policy, π (0), to yield a trial for a period (each trial sometimes called epoch). The agent uses RL algorithms (Q-learning, least square estimation) to learn the state-action value function for π (0). The agent uses policy iteration or directly policy search method to obtain an improved policy π (1) then runs a new trial under this policy. The agent continues this process, where SAV function learning can reuse all previous policies based on importance sampling. It stops when the value or policy has negligible change.

What statisticians can do with RL? Improve the design of initial policy (random policy or other choices). Pilot trials? Improve learning methods RL algorithms. Improve policy update. Characterize convergence rates and so on. Design better rewarding systems.

Simulated examples

Robot-Arm control example

Robot-Arm control example: continue

Robot-Arm control example: continue

Mountain car example Action space: force applied to the car (0.2, 0.2, 0). State space: (x, ẋ) where x is the horizontal position ( [ 1.2, 0.5]) and ẋ is the velocity ( [ 1.5, 1.5]). Transition: x t+1 = x t + ẋ t+1 δt, ẋ t+1 = ẋ t + ( 9.8wcos(3x t ) + a t /w kẋ t )δt, where w is the mass 0.2kg, k is the friction coefficient 0.3, and δt is 0.1 second. Reward: r(s, a, s ) = { 1 xs 0.5, 0.01 o.w. Policy iteration uses kernels with centers at { 1.2, 0.35, 0.5} { 1.5, 0.5, 0.5, 1.5} and σ = 1.

Experiment results

Experiment results