Reinforcement Learning Yihay Manour Google Inc. & Tel-Aviv Univerity
Outline Goal of Reinforcement Learning Mathematical Model (MDP) Planning Learning Current Reearch iue 2
Goal of Reinforcement Learning Goal oriented learning through interaction Control of large cale tochatic environment with partial knowledge. Supervied / Unupervied Learning Learn from labeled / unlabeled example 3
Reinforcement Learning - origin Artificial Intelligence Control Theory Operation Reearch Cognitive Science & Pychology Solid foundation; well etablihed reearch. 4
Typical Application Robotic Elevator control [CB]. Robo-occer [SV]. Board game backgammon [T], checker [S]. Che [B] Scheduling Dynamic channel allocation [SB]. Inventory problem. 5
Contrat with Supervied Learning The ytem ha a tate. The algorithm influence the tate ditribution. Inherent Tradeoff: Exploration veru Exploitation. 6
Mathematical Model - Motivation Model of uncertainty: Environment, action, our knowledge. Focu on deciion making. Maximize long term reward. Markov Deciion Proce (MDP) 7
Mathematical Model - MDP Markov deciion procee S- et of tate A- et of action δ - Tranition probability R - Reward function Similar to DFA! 8
MDP model - tate and action Environment = tate 0.3 0.7 action a Action = tranition δ (, a, ' ) 9
MDP model - reward R(,a) = reward at tate for doing action a (a random variable). Example: R(,a) = -1 with probability 0.5 +10 with probability 0.35 +20 with probability 0.15 10
MDP model - trajectorie trajectory: 0 a 0 r 0 1 a 1 r 1 2 a 2 r 2 11
MDP - Return function. Combining all the immediate reward to a ingle value. Modeling Iue: Are early reward more valuable than later reward? I the ytem terminating or continuou? Uually the return i linear in the immediate reward. 12
MDP model - return function Finite Horizon - parameter H return = R( i, ai ) Infinite Horizon 1 i H i dicounted - parameter γ<1. return = γ R(i,ai ) i=0 undicounted 1 N N 1 i= 0 R( i,a i N ) return Terminating MDP 13
MDP model - action election AIM: Maximize the expected return. Fully Obervable - can ee the entire tate. Policy - mapping from tate to action Optimal policy: optimal from any tart tate. THEOREM: There exit a determinitic optimal policy 14
Contrat with Supervied Learning Supervied Learning: Fixed ditribution on example. Reinforcement Learning: The tate ditribution i policy dependent!!! A mall local change in the policy can make a huge global change in the return. 15
MDP model - ummary S a A δ ( 1, a, 2) π : R(,a) S A i = 0 γ i ri - et of tate, S =n. -et of k action, A =k. - tranition function. - immediate reward function. - policy. - dicounted cumulative return. 16
Simple example: N- armed bandit Single tate. Goal: Maximize um of immediate reward. a 1 a 2 Given the model: Greedy action. a 3 Difficulty: unknown model. 17
N-Armed Bandit: Highlight Algorithm (near greedy): Exponential weight G i um of reward of action a i w i = β G i Follow the (perturbed) leader Reult: For any equence of T reward: E[online] > max i {G i } - qrt{t log N} 18
Planning - Baic Problem. Given a complete MDP model. Policy evaluation - Given a policy π, etimate it return. Optimal control - Find an optimal policy π * (maximize the return from any tart tate). 19
Planning - Value Function V π () The expected return tarting at tate following π. Q π (,a) The expected return tarting at tate with action a and then following π. V () and Q (,a) are define uing an optimal policy π. V () = max π V π () 20
Planning - Policy Evaluation Dicounted infinite horizon (Bellman Eq.) V π () = E ~ π () [ R(,π ()) + γ V π ( )] Rewrite the expectation V π ( ) = E[ R(, π ( ))] + γ ' δ (, π ( ), ') V π ( ') Linear ytem of equation. 21
Algorithm - Policy Evaluation Example A={+1,-1} γ = 1/2 δ( i,a)= i+a π random a: R( i,a) = i 0 1 0 3 1 2 3 2 V π ( 0 ) = 0 +γ [π( 0,+1)V π ( 1 ) + π( 0,-1) V π ( 3 ) ] 22
Algorithm -Policy Evaluation Example A={+1,-1} γ = 1/2 δ( i,a)= i+a π random a: R( i,a) = i 0 1 0 1 3 2 3 2 V π ( 0 ) = 5/3 V π ( 1 ) = 7/3 V π ( 2 ) = 11/3 V π ( 3 ) = 13/3 V π ( 0 ) = 0 + (V π ( 1 ) + V π ( 3 ) )/4 23
Algorithm - optimal control State-Action Value function: Q π (,a) = E [ R(,a)] + γ E ~ (,a) [ V π ( )] π π Note V ( ) = Q (, π ( )) For a determinitic policy π. 24
Algorithm -Optimal control Example A={+1,-1} γ = 1/2 δ( i,a)= i+a π random R( i,a) = i 0 1 0 1 3 2 3 2 Q π ( 0,+1) = 5/6 Q π ( 0,-1) = 13/6 Q π ( 0,+1) = 0 +γ V π ( 1 ) 25
Algorithm - optimal control CLAIM: A policy π i optimal if and only if at each tate : V π () = MAX a {Q π (,a)} (Bellman Eq.) PROOF: Aume there i a tate and action a.t., V π () < Q π (,a). Then the trategy of performing a at tate (the firt time) i better than π. Thi i true each time we viit, o the policy that perform action a at tate i better than π. 26
Algorithm -optimal control Example A={+1,-1} γ = 1/2 δ( i,a)= i+a π random R( i,a) = i 0 1 0 1 3 2 3 2 Changing the policy uing the tate-action value function. 27
28 MDP - computing optimal policy 1. Linear Programming 2. Value Iteration method. )}, ( max { arg ) ( 1 a Q i a i = π π )} ' ( ) ',, ( ), ( max{ ) ( ' 1 V a a R V i a i + + δ γ 3. Policy Iteration method.
29 Convergence: Value Iteration Ditance of V i from the optimal V * (in L ) + + = + = ), ( ) ( ) ( ) ( ')] ( ') ( ')[,, ( ), ( ) ( ') ( '),, ( ), ( ), ( * 1 * * * 1 * * ' * * * * ' i i i i i i i i i V V V V a Q V V V V V V V a a Q V V a a R a Q γ γ δ γ δ γ Convergence Rate: 1/(1-γ) ONLY Peudo Polynomial
Convergence: Policy Iteration Policy Iteration Algorithm: Compute Q π (,a) Set π() = arg max a Q π (,a) Reiterate Convergence: Policy can only improve V t+1 () V t () Le iteration then Value Iteration, but more expenive iteration. OPEN: How many iteration doe it require?! LB: linear UB: 2 n /n (2-action MDP) [MS] 30
Outline Done Goal of Reinforcement Learning Mathematical Model (MDP) Planning Value iteration Policy iteration Now: Learning Algorithm Model baed Model Free 31
Planning veru Learning Tightly coupled in Reinforcement Learning Goal: maximize return while learning. 32
Example - Elevator Control Learning (alone): Model the arrival model well. Planning (alone) : Given arrival model build chedule Real objective: Contruct a chedule while updating model 33
Learning Algorithm Given acce only to action perform: 1. policy evaluation. 2. control - find optimal policy. Two approache: 1. Model baed (Dynamic Programming). 2. Model free (Q-Learning). 34
Learning - Model Baed Etimate the model from the obervation. (Both tranition probability and reward.) Ue the etimated model a the true model, and find optimal policy. If we have a good etimated model, we hould have a good etimation. 35
Learning - Model Baed: off policy Let the policy run for a long time. what i long?! Auming ome exploration Build an oberved model : Tranition probabilitie Reward Ue the oberved model to etimate value of the policy. 36
Learning - Model Baed ample ize Sample ize (optimal policy): Naive: O( S 2 A log ( S A ) ) ample. (approximate each tranition δ(,a, ) well.) Better: O( S A log ( S A ) ) ample. (Sufficient to approximate optimal policy.) [KS, NIPS 98] 37
Learning - Model Baed: on policy The learner ha control over the action. The immediate goal i to lean a model A before: Build an oberved model : Tranition probabilitie and Reward Ue the oberved model to etimate value of the policy. Accelerating the learning: How to reach new place?! 38
Learning - Model Baed: on policy Well ampled node Relatively unknown node 39
Learning - Model Baed: on policy HIGH REAWRD Well ampled node Relatively unknown node Exploration Planning in new MDP 40
Learning: Policy improvement Aume that we can perform: Given a policy π, Etimate V and Q function of π Can run policy improvement: π = Greedy (Q) Proce converge if etimation are accurate. 41