Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Reinforcement Learning

Introduction

Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it has no explicit supervision so uses a rewarding system to learn feature-outcome relationship. The crucial advantage of reinforcement learning is its non-greedy nature: we do not need to improve performance in a short term but to optimize a long-term achievement.

RL terminology Reinforcement learning is a dynamic process where at each step, a new decision rule or policy is updated based on new data and rewarding system. Terminology used in reinforcement learning: Agent: whoever uses learned decisions during the process (robot in AI) Action (A): a decision to be taken during the process State (S): environment variables that may interact with Action Reward (R): a value system to evaluate Action given State. Note that (A, S, R) is time-step dependent so we use (A t, S t, R t ) to reflect time-step t.

Reinforcement learning diagram

Maze example

Maze example: continue

Mountain car problem

RL Framework

RL Notation At time-step t, the agent observe a state S t from a state space (S T ) and selects an action A t from an action space (A t ). Both action and state result in transition to a new state S t+1. Given (A t, S t, S t+1 ), the agent receives an immediate reward R t = r t (S t, A t, S t+1 ) R, where r t (,, ) is called immediate reward function.

RL mathematical formulation At time t, we assume a transition probability function from (S t = s, A t = a) to (S t+1 = s ): p t (s s, a) 0, s p t (s s, a)ds = 1. We also assume A t given S t from a probability distribution: π t (a s) 0, a π t(a s)da = 1. A trajectory (training sample) (s 1, a 1, s 2,..., s T, a T, s T+1 ) is generated as follows: start from an initial state s 1 from a probability distribution p(s); for t = 1, 2,..., T (T is the total number of steps), (a) a t is chosen from π t ( s t ), (b) the next state s t+1 is from p t ( s t, a t ). It is called finite horizon if T < and infinite horizon if T =.

Goal of RL Define the return at time t as T γ j t r j (S j, A j, S j+1 ) j=t where γ [0, 1) is called the discount factor (discounting long trajectory). An action policy, π = (π 1,..., π T ), is a sequence of probability distribution functions, where π t is a probability distribution for A t given S t. The goal of RL is to learn the optimal action decision, policy π = (π1, π 2,..., π T ), to maximize the expected return: T E π [ γ j 1 r j (S j, A j, S j+1 )], E π ( ) means A t S t π t ( S t ). j=1

Optimal policy RL aims to find the best action decision rules such that the average long-term reward is maximized if such rules are implemented. Note: π is a function of states and for any individual, we only know what actions should be at time t after observing its states ate time t. This is related to the so-called adaptive decision or dynamic decision.

How supervised learning is framed in RL context? We can imagine S t to be all data (both feature and outcome) collected by step t. Then A t is the prediction rule from a class of prediction functions based on S t (no need to be perfect prediction function; can be even random prediction) so π t is the probabilistic selection of which prediction function at t. Based on (S t, A t ), S t+1 can be S t with additionally collected data, or S t with individual errors, or just S t. R t is the prediction error evaluated at the data. The goal is to learn the best prediction rule RL method can help!

State-Action and State Value Functions

Two important concepts in RL State-Action value function (SAV) It is the expected return increment at time t given state S t = s and action A t = a: Q π t (s, a) = E π [ T γ j t r j (S j, A j, S j+1 ) S t = s, A t = a]. j=t Q t (s, a) Qπ t (s, a) is the optimal expected return at time t. State value function (SV) It is the expected return increment at time t given state S t = s: V π t (s) = E π [ T γ j t r j (S j, A j, S j+1 ) S t = s]. j=t Similarly, Vt (s) = Vπ t (s). Clearly, Vt π(s) = a Qπ t (s, a)π t(a s)da.

Bellman equations The Bellman equation for SV: ] Vt π (s) = E π [r t (s, A t, S t+1 ) + γvt+1 π (S t+1) S t = s [ = rt (s, a, s ) + γvt+1 π (s ) ] π t (a s)p t (s s, a)dads. s a The Bellman equation for SAV: ] Q π t (s, a) = E π [r t (s, a, S t+1 ) + γq π t+1 (S t+1, A t+1 ) S t = s, A t = a [ = rt (s, a, s ) + γq π t+1 (s, a ) ] s a π t+1 (a s )p t (s s, a)da ds.

Optimal policy learning: Bellman equation Bellman equation for optimal policy: [ Qt π (s, a) = E π V π t (s) = max a Q π t (s, a), ] r t (s, a, S t+1 ) + γvt π (S t+1 ) S t = s, A t = a, { } πt (s) I a = argmax a Q π t (s, a).

Reinforcement Learning for Finite Horizon

Value function given π For finite T, the Bellman equations suggest a backward procedure to evaluate the value function associated a particular policy: start from time T. We can learn Q π T (s, a) = E[R T S T = s, A T = a]i(a π( s)). at time T 1, we learn Q π T 1 (s, a) as [ ] E R T 1 + γe π [Q π T (S T, A T ) S T ] S T 1 = s, A T 1 = a I(a π( s)).. we perform learning backwards till time 1. Note that each step can be estimated using parametric, nonparametric or machine learning.

Optimal policy learning for finite horizon (Q-learning) Start from time T. We can learn Q π T (s, a) = E[R T S T = s, A T = a]. We calculate πt (s) as with probability 1 at a = argmax a Q π T (s, a). At time T 1, we learn Q π T 1 (s, a) as [ ] E R T 1 + γ max Q π a T (S T, a ) S T 1 = s, A T 1 = a. We obtain πt 1 as the one with probability 1 at a = argmax a Q π T 1 (s, a). We perform the same learning procedures backwards till time 1 to learn all the optimal policies.

Statistical models for state-value function Parametric/semiparametric models for Q π (s, a) are commonly used. We assume Q π (s, a) = B θ b φ b (s, a), b=1 where φ b (s, a) is a sequence of basis functions. In other words, the policy is indirectly represented by θ b s. From the Bellman equation, we note that the conditional mean of R t = r(s t, A t, S t+1 ) given (S t, A t ) is Q π (S t, A t ) γe π [Q π (S t+1, A t+1 ) S t, A t ] = θ T ψ(s t, A t ) under policy π, where ψ(s, a) = φ(s, a) γe π [φ(s t+1, A t+1 ) S t = s, A t = a].

Numerical implementation Suppose we have data from n subjects, each with a training sample of T steps, or n training T-step sample from the same agent, (S i1, A i1, S i2,..., S it, A it, S i,t+1 ). We estimate ψ(s, a) by ψ b (s, a) = n T i=1 t=1 φ b (s, a) γ I(S it = s, A it = a)e π [φ b (S i,t+1, A i,t+1 ]) n T i=1 t=1 I(S. it = s, A it = a) We perform a least-squares estimation 1 min θ nt n i=1 T t=1 [ ] 2 I(A it S it π) θ T ψ(sit, A it ) R it, where A it S it π means that the data of A it is obtained by following the policy.

More on numerical implementation Regularization may be introduced to have a more sparse solution. L 2 -minimization can be replaced by L 1 -minimization to gain robustness. Choice of basis functions: radial basis function where kernel function can be the usual Gaussian kernel (one possible definition of d(s, s ) is the shortest path from s to s in the graph defined by transition probabilities).

Alternative methods Modelling transition probability functions Active policy iteration (active learning) update sampling policy actively

Reinforcement Learning for Infinite Horizon

Value function learning given π When T = or T is large, Q-learning method may not be applicable. The salvage is to take advantage of process stability when t is large so we can assume the following Markov decision process (MDP): MDP assumes that state and action spaces are constant over time. MPD assumes pt (s s, a) to be independent of t. Reward function rt (s, a, s ) is independent of t. MDP assumption is plausible for a long horizon and after certain number of steps.

Bellman equations under MDP for infinite horizon Under MPD, Q π t (s, a) = Qπ (s, a) and Vt π(s) = Vπ (s). Bellman equations become ] V π (s) = E π [r(s, A t, S t+1 ) + γv π (S t+1 ) S t = s, ] Q π (s, a) = E π [r(s, a, S t+1 ) + γq π (S t+1, A t+1 ) S t = s, A t = a.

On- and Off-policy estimation We can still apply least-square learning algorithm for Q π (s, a: [ T ] E π (θ T ψ(s t, A t ) R t ) 2 t=1 using the history sample (S t, A t ) following the target policy π. This is called on-policy reinforcement learning. However, not all policy has been seen in the history sample. An alternative method is to use importance sampling: [ T ] [ T ] E π (θ T ψ(s t, A t ) R t ) 2 = E π (θ T ψ(s t, A t ) R t ) 2 w t, t=1 t=1 where t t w t = π(a j S j )/ π(a j S j ). j=1 Donglin Zeng, j=1 Department of Biostatistics, University of North Carolina

Off-policy iteration: more We need one assumption: there exists a policy in history sample, π, such that π(a s) > 0, (a, s). Adaptive importance weighting is to replace w t by w ν t and choose ν via cross-validation. When history sample have multiple policies π s, we can obtain the estimate from importance weighting with respect to each policy and aggregate estimation (sample-reuse policy iteration).

Reinforcement Learning for Optimal Policy The concept of RL is to make use of existing data from some given policies to learn potentially improved policies (EXPLOITATION); it then tries new policies to collect additional data evidence (EXPLORATION). Reinforcement learning methods are mostly into two groups: (policy iteration) model-based or learning methods to approximate optimal SAV (policy search) model-based or learning methods to directly maximize SV for estimating π.

Optimal policy learning: policy iteration procedure Start from a policy π. Policy evaluation: evaluate Q π (s, a) and thus V π (s). Policy improvement: update π(a s) to be I(a = a π (s)) where a π (s) is the action maximizing Q π (s, a). Iterate between policy evaluation step and policy improvement.

Soft policy iteration procedure Selecting a deterministic policy update may be too greedy if the initial policy is far from the optimal. More soft policy update includes: π(a s) exp{q π (s, a)/τ}, (ɛ-greedy policy improvement) π(a x) has a probability (1 ɛ + ɛ/m) at a = a(π) and probability ɛ/m at other a s, where m is the number of possible actions.

Optimal policy learning: direct policy search The direct policy search approach aims for finding the policy maximizing the expected return. Suppose we model policy as π(a s; θ). The expected return under π is given by T J(θ) = p(s 1 ) p(s t+1 s t, a t )π(a t s t ; θ) s 1,...,s T t=1 { T } γ t 1 r(s t, a t, s t+1 ) s 1 ds T. t=1 We optimize J(θ) to find the optimal θ. Gradient approach can be adopted for optimization. EM-based policy search can be used for optimization. Importance sampling can be used for evaluating J(θ).

How RL works in artificial intelligence? The agent (Robot) starts with one initial policy, π (0), to yield a trial for a period (each trial sometimes called epoch). The agent uses RL algorithms (Q-learning, least square estimation) to learn the state-action value function for π (0). The agent uses policy iteration or directly policy search method to obtain an improved policy π (1) then runs a new trial under this policy. The agent continues this process, where SAV function learning can reuse all previous policies based on importance sampling. It stops when the value or policy has negligible change.

What statisticians can do with RL? Improve the design of initial policy (random policy or other choices). Pilot trials? Improve learning methods RL algorithms. Improve policy update. Characterize convergence rates and so on. Design better rewarding systems.

Simulated examples

Robot-Arm control example

Robot-Arm control example: continue

Mountain car example Action space: force applied to the car (0.2, 0.2, 0). State space: (x, ẋ) where x is the horizontal position ( [ 1.2, 0.5]) and ẋ is the velocity ( [ 1.5, 1.5]). Transition: x t+1 = x t + ẋ t+1 δt, ẋ t+1 = ẋ t + ( 9.8wcos(3x t ) + a t /w kẋ t )δt, where w is the mass 0.2kg, k is the friction coefficient 0.3, and δt is 0.1 second. Reward: r(s, a, s ) = { 1 xs 0.5, 0.01 o.w. Policy iteration uses kernels with centers at { 1.2, 0.35, 0.5} { 1.5, 0.5, 0.5, 1.5} and σ = 1.

Experiment results