Introduction to Reinforcement Learning

Size: px

Start display at page:

Download "Introduction to Reinforcement Learning"

Clara Powers
6 years ago
Views:

1 Introduction to Reinforcement Learning Rémi Munos SequeL project: Sequential Learning munos/ INRIA Lille - Nord Europe Machine Learning Summer School, September 2011, Bordeaux

2 Outline of the course Part 1: Introduction to Reinforcement Learning and Dynamic Programming Settting, examples Dynamic programming: value iteration, policy iteration RL algorithms: TD(λ), Q-learning. Part 2: Approximate dynamic programming L - performance bounds Sample-based algorithms: Least Squares TD, Bellman Residual, Fitted-VI Part 3: Exploration-Exploitation tradeoffs The stochastic bandit: UCB The adversarial bandit: EXP3 Populations of bandits: Tree search, Nash equilibrium. Applications to games (Go, Poker) and planning.

3 Part 1: Introduction to Reinforcement Learning and Dynamic Programming Settting, examples Dynamic programming: value iteration, policy iteration RL algorithms: TD(λ), Q-learning. General references: Neuro Dynamic Programming, Bertsekas et Tsitsiklis, Introduction to Reinforcement Learning, Sutton and Barto, Markov Decision Problems, Puterman, Markov Decision Processes in Artificial Intelligence, Sigaud and Buffet ed., Algorithms for Reinforcement Learning, Szepesvári, 2009.

4 Introduction to Reinforcement Learning (RL) Acquire skills for sequencial decision making in complex, stochastic, partially observable, possibly adversarial, environments. Learning from experience a behavior policy (what to do in each situation) from past success or failures Examples: Hot and Cold game, chess game, learning to ride a bicycle, autonomous robotics, operation research, decision making in stochastic market,... any adaptive sequential decision making task.

5 Birth of the domain Meeting in the end of the 70s: Computational Neurosciences. Reinforcement of synaptic weights in neuronal transmissions (Hebbs rules, Rescorla-Wagner models). Reinforcement = correlations in neuronal activity. Experimental Psychology. Animal conditioning: reinforcement of behaviors that lead to a satisfaction (behaviorism initiated by Pavlov, Skinner). Reinforcement = satisfaction or discomfort. Independently, a mathematical formalism appeared in the 50s, 60s: Dynamic Programming by R. Bellman, in optimal control theory. Reinforcement = criterion to be optimized.

6 Experimental Psychology About animal conditioning, Thorndike (1911) says: Of several responses made to the same situation, those which are accompanied or closely followed by satisfaction to the animal will, other things being equal, be more firmly connected with the situation, so that, when it recurs, they will be more likely to recur; those which are accompanied or closely followed by discomfort to the animal will, other things being equal, have their connections with that situation weakened, so that, when it recurs, they will be less likely to occur. The greater the satisfaction or discomfort, the greater the strengthening or weakening of the bond The Law of Effect.

7 History of the computational RL Shannon 1950: Programming a computer for playing chess. Minsky 1954: Neural-Analog Reinforcement Systems. Samuel 1959: Learning for the game of checkers. Michie 1961: Trial and error for tic-tac-toe game. Michie and Chambers 1968: Inverted pendulum. Widrow, Gupta, Maitra 1973: Punish/reward: learning with a critic in adaptive threshold systems. Neuronal rules. Barto, Sutton, Anderson 1983: Actor-Critic neuronal rules. Sutton 1984: Temporal Credit Assignment in RL. Sutton 1988: Learning from temporal differences. Watkins 1989: Q-learning. Tesauro 1992: TD-Gammon

8 A few applications TD-Gammon. [Tesauro ]: Backgammon. KnightCap [Baxter et al. 1998]: chess ( 2500 ELO) Robotics: juggling, acrobots [Schaal and Atkeson, 1994] Mobile robot navigation [Thrun et al., 1999] Elevator controller [Crites et Barto, 1996], Packet Routing [Boyan et Littman, 1993], Job-Shop Scheduling [Zhang et Dietterich, 1995], Production manufacturing optimization[mahadevan et al., 1998], Game of poker (Bandit algo for Nash computation) Game of go (hierarchical bandits, UCT) szepesva/research/rlapplications.html

9 Intro to Reinforcement Learning Intro to Dynamic Programming DP algorithms... RL algorithms...

10 Reinforcement Learning Decision making agent State Action Reinforcement Environment Stochastic Partially observable Adversarial Environment: can be stochastic (Tetris), adversarial (Chess), partially unknown (bicycle), partially observable (robot) Available information: the reinforcement (may be delayed) Goal: maximize the expected sum of future rewards. Problem: How to sacrify a short term small reward to priviledge larger rewards in the long term?

11 Optimal value function Gives an evaluation of each state if the agent plays optimally. Ex: in a stochastic environment: V (x t ) actions Transition probabilities V (x t+1 ) Bellman eq.: V (x t ) = max a A [r(x t, a) + ] y p(y x t, a)v (y) Temporal difference: δ t (V ) = V (x t+1 ) + r(x t, a t ) V (x t ) If V is known, then when choosing the optimal action a t, E[δ t (V )] = 0 (i.e., in average there is no surprise)

12 Learning the value function How can we learn the value function? Thanks to the surprise! (i.e., the temporal difference) If V is an approximation of V and we observe δ t (V ) = V (x t+1 ) + r(x t, a t ) V (x t ) > 0 then V (x t ) should be increased. From V we deduce the optimal action in x t : [ arg max r(xt, a) + a A y p(y x t, a)v (y) ] locally maximizing the value function maximizing the sum of future rewards. Note that RL is really different from supervised learning!

13 Challenges of RL The state-dynamics and reward functions are unknown. Two approaches: Model-based: learn first a model, then use dynamic programming Model-free: use samples to directly learn a good value function The curse of dimensionality: In high-dimensional problems, the computational complexity may be prohibitively large. Need to learn an approximation of the value function / policy. Exploration issue: Where should one explore the state space in order to build a good representation of the unknown function where (and only where) it is useful.

14 Introduction to Dynamic Programming References: Neuro Dynamic Programming, Bertsekas et Tsitsiklis, Markov Decision Problems, Puterman, Markov Decision Processes in Artificial Intelligence, Sigaud and Buffet ed., 2008.

15 Markov Decision Process [Bellman 1957, Howard 1960, Dubins et Savage 1965] We consider a discrete-time process (x t ) X where: X : state space (assumed to be finite) A: action space (or decisions) (assumed to be finite) State dynamics: All relevant information about future is included in the current state and action P(x t+1 x t, x t 1,..., x 0, a t, a t 1,..., a 0 ) = P(x t+1 x t, a t ) (Markov property). Thus we can define transition probabilities p(y x, a) = P(x t+1 = y x t = x, a t = a). Reinforcement (or reward): r(x, a) is obtained when choosing action a in state x. An MDP is defined by (X, A, p, r).

16 Definition of policy Policy π = (π 1, π 2,... ), where π t : X A defines the action π t (x) choosen in x at time t. If π = (π, π,..., π), the policy is stationary ou Markovian. When following a stationary policy π the process (x t ) t 0 is a Markov chain with transition probabilities p(x t+1 x t ) = p(x t+1 x t, π t (x t )). Our goal is to learn a policy that maximizes the sum of rewards.

17 Performance of a policy For any policy π, define the value function V π : Infinite horizon: Discounted: V π (x) = E [ γ t r(x t, a t ) x 0 = x; π ], t=0 where 0 γ < 1 is the discount factor Undiscounted: V π (x) = E [ r(x t, a t ) x 0 = x; π ] Average: V π (x) = lim T t=0 1 T 1 T E[ r(x t, a t ) x 0 = x; π ] Finite horizon: V π (x, t) = E [ T 1 r(x s, a s ) + R(x T ) x t = x; π ] s=t t=0 Goal of the MDP: Find an optimal policy π, i.e. V π = sup V π π

18 Example: Tetris State: wall configuration + new piece Action: posible positions of the new piece on the wall, Reward: number of lines removed Next state: Resulting configuration of the wall + random new piece. (we can prove that for any policy, the game ends in finite time a.s.) State space states for regular Tetris

19 The dilemma of the MLSS student 1 r=0 0.5 p=0.5 r=1 Sleep 0.5 Sleep 0.4 Think Think Sleep 0.6 r= Think Sleep r= Think You try to maximize the sum of rewards! 4 r= 10 5 r=100 r=

20 Solution of the MLSS student V = r=0 0.5 p=0.5 Sleep Think Think Sleep 0.6 r= 1 V = r=1 0.5 Think 0.5 r= 10 V = V = Sleep Sleep Think r=100 r= 10 V = 10 5 V = r= 1000 V = V 5 = 10, V 6 = 100, V 7 = 1000, V 4 = V V V 3 = V V V 2 = V V 1 and V 1 = max{0.5v V 1, 0.5V V 1 }, thus: V 1 = V 2 = 88.3.

21 Infinite horizon, discounted problems Let π be a stationary deterministic policy. Define the value function V π as: V π (x) = E [ γ t r(x t, π(x t )) x 0 = x; π ], t=0 where 0 γ < 1 a discount factor (which relates rewards in the future compared to current rewards). And define the optimal value function: V = sup π V π. Proposition 1 (Bellman equations). For any policy π, V π satisfies: V π (x) = r(x, π(x)) + γ p(y x, π(x))v π (y), y X and V satisfies: V [ (x) = max r(x, a) + γ p(y x, a)v (y) ]. a A y X

22 Proof of Proposition 1 V π (x) = E [ γ t r(x t, π(x t )) x 0 = x; π ] t 0 = r(x, π(x)) + E [ γ t r(x t, π(x t )) x 0 = x; π ] t 1 = r(x, π(x)) + γ P(x 1 = y x 0 = x; π) y E [ γ t 1 r(x t, π(x t )) x 1 = y; π ] t 1 = r(x, π(x)) + γ p(y x, π(x))v π (y). y

23 Proof of Proposition 1 (continued) And for all policy π = (a, π ) (not necessarily stationary), V (x) = max π = max (a,π ) = max a = max a E[ γ t r(x t, π(x t )) x 0 = x; π ] t 0 [ r(x, a) + γ y [ r(x, a) + γ y [ r(x, a) + γ y ] p(y x, a)v π (y) ] p(y x, a) max V π (y) π ] p(y x, a)v (y). (1) where (1) holds since: max π y p(y x, a)v π (y) y p(y x, a) max π V π (y) Let π be the policy defined by π(y) = arg max π V π (y). Thus y p(y x, a) max π V π (y) = y p(y x, a)v π (y) max π y p(y x, a)v π (y).

24 Bellman operators Since X is finite (say with N states), V π can be considered as a vector of IR N. Write: r π IR N the vector with components r π (x) = r(x, π(x)) P π IR N N the stochastic matric with elements P π (x, y) = p(y x, π(x)). Define the Bellman operator T π : IR N IR N : for any W IR N, T π W (x) = r(x, π(x)) + γ y p(y x, π(x))w (y) Dynamic Programming operator T : IR N IR N : [ T W (x) = max r(x, a) + γ p(y x, a)w (y) ]. a A y

25 Proposition 2. Properties of the value functions 1. V π is the unique fixed-point of T π V π = T π V π. 2. V is the unique fixed-point of T : V = T V. 3. For any policy π, we have V π = (I γp π ) 1 r π 4. The policy defined by π (x) arg max a A is optimal (and stationary) [ r(x, a) + γ p(y x, a)v (y) ] y

26 Proof of Proposition 2 [part 1] Basic properties of the Bellman operators: Monotonicity: If W 1 W 2 (componentwise) then T π W 1 T π W 2, and T W 1 T W 2. Contraction in max-norm: For any vectors W 1 and W 2, Indeed, for all x X, T π W 1 T π W 2 γ W 1 W 2, T W 1 T W 2 γ W 1 W 2. T W 1 (x) T W 2 (x) = max a max a γ max a [ r(x, a) + γ p(y x, a)w 1 (y) ] y [ r(x, a) + γ p(y x, a)w 2 (y) ] y p(y x, a) W 1 (y) W 2 (y) y γ W 1 W 2

27 Proof of Proposition 2 [part 2] 1. From Proposition 1, V π is a fixed point of T π. Uniqueness comes from the contraction property of T π. 2. Idem for V. 3. From Point 1, V π = r π + γp π V π. Thus (I γp π )V π = r π. Now P π is a stochastic matrix (whose eingenvalues have a modulus 1), thus the eing. of (I γp π ) have a modulus 1 γ > 0, thus is invertible. 4. From the definition of π, we have T π V = T V = V. Thus V is the fixed-point of T π. But, by definition, V π is the fixed-point of T π and since there is uniqueness of the fixed-point, V π = V and π is optimal.

28 Value Iteration Proposition 3. For any V 0 IR N, define V k+1 = T V k. Then V k V. Idem for V π : define V k+1 = T π V k. Then V k V π. Proof. V k+1 V = T V k T V γ V k V γ k+1 V 0 V 0 (idem for V π ) Variant: asynchronous iterations

29 Policy Iteration Choose any initial policy π 0. Iterate: 1. Policy evaluation: compute V π k. 2. Policy improvement: π k+1 greedy w.r.t. V π k : [ π k+1 (x) arg max r(x, a) + γ p(y x, a)v π k (y) ], a A (i.e. π k+1 arg max π T π V π k ) y Stop when V π k = V π k+1. Proposition 4. Policy iteration generates a sequence of policies with increasing performance (V π k+1 V π k ) and terminates in a finite number of steps with the optimal policy π.

30 Proof of Proposition 4 From the definition of the operators T, T π k, T π k+1 and from π k+1, V π k = T π k V π k T V π k = T π k+1 V π k, (2) and from the monotonicity of T π k+1, we have V π k lim n (T π k+1 ) n V π k = V π k+1. Thus (V π k ) k is a non-decreasing sequence. Since there is a finite number of possible policies (finite state and action spaces), the stopping criterion holds for a finite k; We thus have equality in (2), thus V π k = T V π k so V π k = V and π k is an optimal policy.

31 Back to Reinforcement Learning What if the transition probabilities p(y x, a) and the reward functions r(x, a) are unknown? In DP, we used their knowledge in value iteration: V k+1 (x) = T V k (x) = max a [ r(x, a) + γ p(y x, a)v k (y) ]. in policy iteration: when computing V π k (which requires iterating T π k ) when computing the greedy policy: [ π k+1 (x) arg max r(x, a) + γ p(y x, a)v π k (y) ], a A RL = introduction of 2 ideas: sampling and Q-functions. y y

32 1st idea: Use sampling Use the real system (or a simulator) to generate trajectories. This means that from any state-action (x t, a t ) we can obtain a next-state sample x t+1 p( x t, a t ) and a reward sample r(x t, a t ). So we can estimate V π (x) by Monte-Carlo: Run trajectories (x k t ) starting from x and following π: V k+1 (x) = (1 η k )V k (x) + η k γ t r(xt k, π(xt k )), where e.g. η k = 1/k (sample mean), or more generally use Stochastic Approximation with k η k = and k η2 k <. Problem: this method should be repeated for all initial states x and is not sample efficient. t 0

33 Temporal difference learning The update rule rewrites: for s t (by writing r t = r(x t, π(x t ))) V k+1 (x s ) = (1 η k )V k (x s ) + η k γ t s r t, = V k (x s ) + η k t s t s γ t s[ ] r t + γv k (x t+1 ) V k (x t ) }{{} δ t(v k ) δ t (V k ) is the temporal difference for the transition x t x t+1. δ t (V k ) provides an indication of the direction towards which the estimate V k (x s ) should be updated. Note that E[δ t (V π )] = 0 (this is Bellman equation).

34 TD(λ) algorithm [Sutton, 1988] Observe a trajectory (x t ) by following π. Update the value of each states (x s ): V k+1 (x s ) = V k (x s ) + η k (x s ) t s (γλ) t s δ t (V k ), where 0 λ 1. V k (x s ) is impacted by δ t (V k ) at all times t s For λ = 1, we recover Monte-Carlo: V k+1 (x s ) = V k (x s ) + η k (x s ) t s γ t s δ t (V k ) For λ = 0, we have TD(0): V k+1 (x s ) = V k (x s ) + η k (x s )δ s (V k )

35 Convergence of TD(λ) Proposition 5 (Jaakkola, Jordan et Singh, 1994). Assume that all states x X are visited infinitely often and that the learning steps η k (x) satisfy for all x X, k 0 η k(x) =, k 0 η2 k (x) <, then V k a.s. V π. Proof. ] TD(0) rewrites V k+1 (x s ) = V k (x s ) + η k (x s )[ T π V k (x s ) V k (x s ), where T π V k (x s ) = r s + γv k (x s+1 ) is a noisy estimate of T π V k (x s ) = r(x s, a s ) + γ y p(y x s, a s )V k (y). This is a Stochastic Approximation algorithm for estimating the fixed-point of the contraction mapping T π. TD(λ) similar.

36 Tradeoff for λ Example of the linear chain: Expected quadratic error (for 100 trajectories): λ Small λ reduces variance Large λ propagates rewards faster

37 2nd Idea: use Q-value functions Define the Q-value function Q π : X A IR: for a policy π, Q π (x, a) = E [ γ t r(x t, a t ) x 0 = x, a 0 = a, a t = π(x t ), t 1 ] t 0 and the optimal Q-value function Q (x, a) = max π Q π (x, a). Proposition 6. Q π and Q satisfy the Bellman equations: Q π (x, a) = r(x, a) + γ y X Q (x, a) = r(x, a) + γ y X p(y x, a)q π (y, π(y)) p(y x, a) max b A Qπ (y, b) Idea: compute Q and then π (x) arg max a Q (x, a).

38 Q-learning algorithm [Watkins, 1989] Whenever a transition x t, a t update the Q-value: r t xt+1 occurs, Q k+1 (x t, a t ) = Q k (x t, a t )+η k (x t, a t ) [ r t +γ max b A Q k(x t+1, b) Q k (x t, a t ) ]. Proposition 7 (Watkins et Dayan, 1992). Assume that all state-action pairs (x, a) are visited infinitely often and that the learning steps satisfy for all x, a, k 0 η k(x, a) =, k 0 η2 k (x, a) <, then Q k a.s. Q. Again the proof relies on Stochastic Approximation algorithm for estimating the fixed-point of a contraction mapping. Remark: This does not say how to explore the space.

39 Q-learning algorithm Deterministic case, discount factor γ = 0.9. Take steps η = After transition x, a r y update Q k+1 (x, a) = r + γ max b A Q k (y, b)

40 Optimal Q-values Bellman s equation: Q (x, a) = γ max b A Q (next-state(x, a), b).

41 Conclusions of Part 1 When the state-space is finite and small : If transition probabilities and rewards are known, then DP algorithms (value iteration, policy iteration) compute the optimal solution Otherwise, use sampling techniques and RL algorithms (TD(λ), Q-learning) apply 2 big problems: Usually state-space is HUGE! We face the curse of dimensionality and thus we need to find approximate solutions Part 2. We need efficient exploration Part 3.

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course How to model an RL problem The Markov Decision Process