Misc. Topics and Reinforcement Learning

Size: px

Start display at page:

Download "Misc. Topics and Reinforcement Learning"

Margaret Stephens
5 years ago
Views:

1 1/79 Misc. Topics and Reinforcement Learning Acknowledgement: some parts of this slides is based on Prof. Mengdi Wang s, Prof. Dimitri Bertsekas and Prof. David Silver s lecture notes

2 2/79 作业 1) 阅读如下章节 : Richard S. Sutton and Andrew G. Barto, Reinforcement Learning: An Introduction, Chapter 2: Multi-armed Bandits Chapter 3: Finite Markov Decision Processes Chapter 4: Dynamic Programming Chapter 5: Monte Carlo Methods Chapter 6: Temporal-Difference Learning Chapter 9: On-policy Prediction with Approximation Chapter 10: On-policy Control with Approximation Chapter 13: Policy Gradient Methods 2) 至少看懂每章的三个 Example 如果有程序, 测试或实现其程序

3 Outline 3/79 1 Policy Search: A Simple Example of Energy Storage 2 Off-Policy RL via Linear Duality 3 Online Learning and Regret 4 Other Type of Methods 5 Trust Region Policy Optimization 6 Examples and Platform 7 AlphaGo Zero

4 Policy Approximation 4/79 Sometimes it is easier to parameterize the policy µ by µ(i) µ(i; σ) Direct policy search [ ] min E α k g(i k, µ(i k ; σ), j k ) σ k=1 This is an stochastic optimization problem. Very likely to be nonconvex σ is relatively low-dimensional The high-dimensionality of the DP is hidden in the complicated expectation

5 Direct policy search via Sample Average Approximation 5/79 One approach is to simulate many many long trajectories and formulate an sample average approximation min σ L l=1 k=1 α k g(i kl, µ(i kl ; σ), j kl ) which is based on L trajectories {i 0l, i 1l,...}, where l = 1,..., L. As L, the sample average approximation converges to the original problem, and the solution σ converges to σ.

6 Direct policy search via Stochastic Gradient Descent Another approach is to apply stochastic gradient descent. At each iteration l, generate a trajectory {i 0l, i 1l,...} and compute the gradient G l = α k σ g(i kl, µ(i kl ; σ), j kl ), k=1 and update the policy parameter by σ l+1 = σ l γg l. Each G l is an unbiased sample of the gradient of the overall objective at the current parameter σ l. If the gradient is not computable, it can be replaced with finite differences. a.s. If the problem is convex, σ l σ a.s. If nonconvex (the most likely care), σ l a local optimum (good enough in practice most of the time). 6/79

7 A Very Simple Example: Electricity Price Electricity prices are very volatile. Very difficult to store electricity. Spiky price with heavy load (hot summer, freezing winter, ) Average/median price fairly cheap Related to weather, season, time of day, etc 7/79

8 8/79 Battery Storage Suppose that you operate a battery and participate in the regional electricity market Challenge: find a policy for charging and discharging the battery Strategy posed by the battery manufacturer: Buy low, sell high

9 9/79 There are many approaches that based on DP. Solving DP requires full knowledge of the model: how price changes? how the storage may have a market impact on future price? if there is forecast, how good is the forecast? what are the uncertainties and what are their distribution? To gain this knowledge, we could fit a time series model for the price dynamics. Then we formulate a DP problem and find an optimal policy of the DP problem.

10 DP Model 10/79 State: price p t, storage capacity c t Action: buy u t = 1, sell u t = 1, do nothing u t = 0 Action constraint: if c t+1 = 0, cannot sell; if c t = C, cannot buy State transition of storage inventory: c t+1 = c t u t State transition of electricity price (learned from data) p t+1 = p t + f (p t ) + ɛ t

11 DP Model 11/79 The overall objective DP algorithm [ T ] max E u t p t {u t} t=1 V(p t, c t ) = max u t {u t p t + E[V(p t+1, c t u t )]}

12 12/79 The ultimate problem is find a function/policy/strategy such that µ : {state} {action} [ T ] max E µ(p t, c t )p t µ t=1 subject to the state transitions as constraints Difficulty I: We do not know the distribution and dynamics of the time series p t = Need data-based approach Difficulty II: Searching over a space of functions is hard = Need to narrow the search zone to simple parametric family

13 Optimizing A Storage Policy 13/79 Consider a simple policy in which we choose a sell price and a buy price

14 Optimizing over Simple Policies 14/79 Now let search over simple threshold policies The modified problem is [ T ] max E µ(p t, c t )p t t=1 subject to 1 if p < ρ store and c < C µ(p, c) = 1 if p < ρ withdraw and c > 0 0 otherwise as well as the state transition constraints We search for threshold values ρ store and ρ withdraw by backtesting

15 15/79 Optimizing over Simple Policies Average historical profit as function of the two policy parameters. For a given pair of (ρ store, ρ withdraw ), the profit value is calculated by one simulation run on the entire price history. The optimal policy stands out!

16 Make the Problem Harder 16/79 Battery charge/discharge initial time: In practice, we can not turn on and off the battery or generator immediately. The battery of generator need to warm up for some time before charging/generating. Using forecast Suppose that we have an 1-hour price forecast. We want the policy to make use of the forecast. The policy could be: to charge if a weighted combination of current price and forecast price is smaller than a threshold; and to withdraw if the weight combo is greater than a threshold. The parameters are: two threshold prices; weights of the combination

17 Outline 17/79 1 Policy Search: A Simple Example of Energy Storage 2 Off-Policy RL via Linear Duality 3 Online Learning and Regret 4 Other Type of Methods 5 Trust Region Policy Optimization 6 Examples and Platform 7 AlphaGo Zero

18 18/79 Look at the Bellman Equation Again Consider a MDP model with States i = 1,..., n Probability transition matrix under policy µ is P µ R n n Reward of transition is g µ R n The Bellman equation is J = min µ g µ + αp µ J This is a nonlinear system of equations. Note: The righthandside is the infimum of a number of linear mappings of J!

19 DP is a special case of LP Theorem Every finite-state DP problem is an LP problem. Let c 0. We construct the following LP max s.t. c 1 J(1) c n J(n) n n J(i) p ij (u)g iju + α p ij (u)j(j), u A j=1 j=1 or more compactly max s.t. c J J g µ + αp µ J, u A The variables are J(i) where i = 1,..., n. For each state action pair (i, u), there is an inequality constraint. 19/79

20 20/79 DP is a special case of LP If J TJ, then J J. If J TJ, then J J. Suppose that J TJ. Applying operator T on both sides k 1 times, and by the monotonicity of T, we have J TJ T 2 J... T k J. Note that lim k T k J = J. Hence, we have J J.

21 21/79 DP is a special case of LP Theorem This solution to the constructed LP max s.t. c J J g µ + αp µ J, u A is exactly the solution to the Bellman s equation J = min µ g µ + P µ J Proof: The solution J to the Bellman equation is obviously a feasible solution to the LP. If the LP solution J is different from J, it must solve the Bellman equation at the same time. Since the Bellman equation has a unique solution, J = J.

22 22/79 ADP via Approximate Linear Programming The constructed LP is of huge scale. max s.t. c J J g µ + αp µ J, u A Approximate LP: We may approximate J by adding the constraint J = Φσ, so the variable dimension becomes smaller. We may sample a subset of all constraints, so the constraint dimension becomes smaller. LP and Approximate LP can be solved by simulation/online.

23 Outline 23/79 1 Policy Search: A Simple Example of Energy Storage 2 Off-Policy RL via Linear Duality 3 Online Learning and Regret 4 Other Type of Methods 5 Trust Region Policy Optimization 6 Examples and Platform 7 AlphaGo Zero

24 Multi-Arm Bandits - Simplest Online Learning Model Suppose you are faced with N slot machines. Each bandit distributes a random prize. Some bandits are very generous, others not so much. Of course, you don t know what these expectations are. By only choosing one bandit per round, our task is devise a strategy to maximize our winnings. 24/79

25 Multi-Arm Bandits 25/79 The Problem: Let there be K bandits, each giving a random prize with expectations E[X b ] [0, 1], b 1,..., K. If you pull bandit b, you get an independent sample of X b. We want to solve the one-shot optimization problem: max E[X b], b 1,...,K but we can experiment repeatedly. Our task can be phrased as find the best bandit, and as quickly as possible. We use trial-and-error to gain knowledge on the expected values.

26 A Naive Strategy For each round, we are given current estimated means ˆX b, b 1,..., K 1 With probability 1 ɛ k, select the currently known-to-best bandit b k = arg max ˆX b. 2 With probability ɛ k, select a random bandit (according to the uniform distribution) to pull 3 Observe the result of pulling bandit b k, and update ˆX b k as the new sample average. Return to 1. Exploration vs. Exploitation The random selection with probability ɛ k guarantees exploration The selection of current best with probability 1 ɛ k guarantees exploition Say ɛ k = 1/k, the algorithm increasingly exploits the existing knowledge and is guaranteed to find the true optimal bandit. 26/79

27 27/79 Bayesian Bandit Strategy For each round, we are given the current posterior distributions of all bandits mean returns. Sample a random variable X b from the prior of bandit b, for all b = 1,..., K Select the bandit with largest sample, i.e. select bandit b = arg max X b. Observe the result of pulling bandit b, and update your prior on bandit b. Return to 1.

28 28/79

29 How bandits related to real-time decision making? Each bandit is a possible configuration of the policy parameters. Bandit strategy is online parameter tuning. Evaluation of goodness: T Regret = T max E[X b] E[X b t ] b=1,...,k t=1 Regret of a reasonable learning strategy is usually between log T and T 29/79

30 Learning in Sequential Decision Making 30/79 Traditionally, sequential making problems are modeled by dynamic programming. In one-shot optimization, finding the best arm by trial-and-error is known as online learning. Let s combine these two: DP + Online Learning = Reinforcement Learning

31 Outline 31/79 1 Policy Search: A Simple Example of Energy Storage 2 Off-Policy RL via Linear Duality 3 Online Learning and Regret 4 Other Type of Methods 5 Trust Region Policy Optimization 6 Examples and Platform 7 AlphaGo Zero

32 Q-Learning Off-policy and model-free Q(s, a) Q(s, a) + α(r(s, a, s0 ) + γ max Q(s0, a0 ) Q(s, a)) 0 a 32/79

33 DQN 33/79 min E (s,a,s θ ) U(D)[r(s, a) + γ max Q(s, a ; θ ) Q(s, a; θ)] 2 a works well in discrete tasks, like Atari games.

34 DQN 34/79 Ref: :

35 Policy Gradient 35/79 So far, we are discussing value-based RL learn value function improve policy (e.g. ɛ-greedy) Now, consider parametrize the policy Policy-based RL no value function learn policy π θ (a s) = P(a s, θ)

36 Policy-based RL 36/79 Advantages: Better convergence properties effective in high-dimensional or continuous action spaces can learn stochastic policy Disadvantages: Typically converge to a local rather than global optimum evaluating a policy is typically inefficient and high variance

37 Measure the equality of a policy π θ In episodic environments, s 0 is the start state in an episode: J 1 (θ) =V πθ (s 0 ) = E πθ [ γ t r t+1 s 0 ] t=0 = s p πθ (s) a π θ (a s)r a s here p πθ (s) is the unnormalized visitation distribution w.r.t. π θ p πθ (s) = γ t P(s t = s s 0, π θ ) t=0 In continuing environments: Average reward formulation J avr (θ) = s p π θ (s) a π θ(a s)r a s p πθ (s) is the stationary distribution of a Markov chain for π θ p πθ (s) = lim t P(s t = s s 0, π θ ) 37/79

38 Policy Gradient 38/79 Goal: find θ to maximize J(θ) by ascending the gradient of the policy, w.r.t θ θ θ + α θ J(θ), where θ J(θ) = J(θ) θ 1. J(θ) θ N How to compute gradient? Perturbing θ by small amount ɛ in k th dimension, k [1, N] J(θ) θ k J(θ + ɛe k) J(θ) ɛ Simple, but noisy, inefficient in most cases

39 One-step MDPs Consider one-step MDPs start with s p(s), and terminate after one step with reward r a s J(θ) = s p(s) a π θ (a s)r a s, θ J(θ) = s p(s) a π θ (a s)r a s = s p(s) a π θ (a s) θ log π θ (a s)r a s =E πθ [ θ log π θ (a s)r a s ] 39/79

40 Policy Gradient Theorem 1 Consider the multi-step MDPs, we can use likelihood ratio to obtain the similar conclusion: Theorem For any differentiable policy π θ (a s), and for any of the policy objective function J = J 1, J avr, the policy gradient for policy objective function J(θ) is θ J(θ) = s = s p πθ (s) a p πθ (s) a θ π θ (a s)q πθ (s, a) π θ (a s) θ log π θ (a s)q πθ (s, a) =E πθ [ θ log π θ (a s)q πθ (s, a)] 1 Policy Gradient Methods for Reinforcement Learning with Function Approximation, Richard S. Sutton,etc, /79

41 Policy gradient methods 41/79 θ J(θ) = E πθ [ θ log π θ (a s)q πθ (s, a)] Update the parameter θ along the direction of θ J(θ) Practically, apply SGD: θ θ + α θ log π θ (a s)q πθ (s, a) Require action-value function and step-size α is important Policy gradients have rather strong convergence guarantees, even when used in conjunction with approximate value functions and recent results created a theoretically solid framework for policy gradient estimation from sampled data

42 42/79 Actor-Critic Combine value function approximation and policy gradient Reduce the variance Use a critic to estimate the action-value function Q w (s, a) Q πθ (s, a) Actor-Critic algorithms maintains two set of parameters Critic: update action-value function parameters w Actor: update policy parameters θ in direction suggested by Critic θ J(θ) E πθ [ θ log π θ (a s)q w (s, a)] θ θ + α θ log π θ (a s)q w (s, a) Can we avoid any bias by choosing action-value function approximation carefully? θ J(θ) = E πθ [ θ log π θ (a s)q w (s, a)]

43 A3C: 43/79 reference: Asynchronous Methods for Deep Reinforcement Learning

44 Compatible Function Approximation Theorem Theorem If the action-value function approximator Q w (s, a) satisfies the following two conditions: w Q w (s, a) = θ log π θ (a s), w = arg min w E πθ [(Q πθ (s, a) Q w (s, a)) 2 ] then the policy gradient is exact, i.e. θ J(θ) = E πθ [ θ log π θ (a s)q w (s, a)] The function approximator is compatible with the policy in the sense that if we use the approximations Q w (s, a) in lieu of their true values to compute the gradient, then the result would be exact 44/79

45 Baseline 45/79 Reduce the variance A baseline function B(s) E πθ [ θ log π θ (a, a)b(s)] = p πθ (s) θ π θ (a s)b(s) s a = p πθ (s)b(s) θ π θ (a s) = 0 s a Thus, θ J(θ) = E πθ [ θ log π θ (a s)(q πθ (s, a) B(s))] A good baseline B(s) = V πθ (s) Advantage function A πθ (s, a) = Q πθ (s, a) V πθ (s)

46 46/79 Policy gradient θ J(θ) =E πθ [ θ log π θ (a s)q πθ (s, a)] =E πθ [ θ log π θ (a s)a πθ (s, a)] Apply SGD θ θ + α θ log π θ (a s)a πθ (s, a) The advantage function can significantly reduce variance of policy gradient Estimate both V πθ (s) and Q πθ (s, a) to obtain A πθ (s, a) e.g. TD learning

47 47/79 Estimation of advantage function Apply TD learning to estimate value function TD error δ π θ δ π θ = r(s, a, s ) + γv πθ (s ) V πθ (s) is an unbiased estimate of advantage function E πθ [δ π θ s, a] =E πθ [r(s, a, s ) + γv πθ (s ) V πθ (s) s, a] =E πθ [r(s, a, s ) + γv πθ (s ) s, a] V πθ (s) =Q πθ (s, a) V πθ (s) = A πθ (s, a) thus the update θ θ + α θ log π θ (a s)δ π θ

48 Outline 48/79 1 Policy Search: A Simple Example of Energy Storage 2 Off-Policy RL via Linear Duality 3 Online Learning and Regret 4 Other Type of Methods 5 Trust Region Policy Optimization 6 Examples and Platform 7 AlphaGo Zero

49 Trust Region Policy Optimization (John Schulman, etc, 2016) Assume start-state distribution ρ 0 is independent with policy Total expected discounted reward with policy π η(π) = E π [ γ t r(s t )] t=0 Between any two different policy π and π η( π) = η(π) + E π [ γ t A π (s t, a t )] t=0 = η(π) + P(s t = s π) π(a s)γ t A π (s, a) t=0 s a = η(π) + γ t P(s t = s π) π(a s)a π (s, a) s a t=0 = η(π) + ρ π (s) π(a s)a π (s, a). s a 49/79

50 Trust Region Policy Optimization 50/79 Find new policy π to maximize η( π) η(π) for given π, that is max η(π) + π s ρ π (s) a π(a s)a π (s, a) For simpleness, maximize the approximator L π ( π) = η(π) + s ρ π (s) a π(a s)a π (s, a) Parameterize the policy π(a s) := π θ (a s) L πθold (π θ ) = η(π θold ) + s ρ πθold (s) a π θ (a s)a πθold (s, a)

51 Why L πθold (π θ )? 51/79 A sufficiently small step θ old θ improves L πθold (π θ ) also improves η L πθold (π θold ) =η(π θold ), θ L πθold (π θ ) θ=θold = θ η(π θ ) θ=θold. Lower bounds on the improvement of η where η(π θnew ) L πθold (π θnew ) 2ɛγ (1 γ) 2 α2 ɛ = max E a πθnew A πθold (s, a) s α =D max TV (π θold π θnew ) = max D TV (π θold ( s) π θnew ( s)) s

52 Lower bound 52/79 TV divergence between two distribution p, q (discrete case) D TV (p q) = 1 p(x) q(x) 2 X KL divergence between two distribution p, q (discrete case) D KL (p q) = X p(x) log p(x) q(x) (D TV (p q)) 2 D KL (p q) (Pollard(2000),Ch.3) Thus obtain a lower bound where η(π θnew ) L πθold (π θnew ) 2ɛγ (1 γ) 2 α α = D max KL (π θold π θnew ) := max D KL (π θold ( s) π θnew ( s)) s

53 Practical algorithm 53/79 2ɛγ The penalty coefficient (1 γ) is large in practice, which yields small 2 update Take a constraint on the KL divergence, i.e., a trust region constraint: A heuristic approximation where max L πθold (π θ ) θ s.t. D max KL (π θold π θ ) δ max θ L πθold (π θ ) s.t. D ρπ θ old KL (π θold π θ ) δ D ρπ θ old KL (π θold π θ ) = E πθold (D KL (π θold ( s) π θ ( s))

54 Connection with prior work 54/79 Approximate the objective function linearly and the constraint quadratically max θ L πθold (π θold ) T (θ θ old ) θ s.t 1 2 (θ θ old) T A(θ θ old ) δ where A := A ij = θ i θ j E s ρπθold [D KL (π θold ( s) π θ ( s))] θ=θold The update with Natural policy gradient θ new =θ old + λa 1 θ L πθold (π θold ) =θ old + λa 1 θ η(π θold )

55 Connection with prior work 55/79 Approximate the objective function linearly and the constraint quadratically where A := I max θ L πθold (π θold ) T (θ θ old ) θ s.t 1 2 (θ θ old) T A(θ θ old ) δ The update with Standard policy gradient θ new =θ old + λ θ L πθold (π θold ) =θ old + λ θ η(π θold )

56 Outline 56/79 1 Policy Search: A Simple Example of Energy Storage 2 Off-Policy RL via Linear Duality 3 Online Learning and Regret 4 Other Type of Methods 5 Trust Region Policy Optimization 6 Examples and Platform 7 AlphaGo Zero

57 57/79 Tetris Height: 12 Width: 7 Rotate and move the falling shape Gravity related to current height Score when eliminating an entire level Game over when reaching the ceiling

DP Model of Tetris State: The current board, the current falling tile, predictions of future tiles Termination state: when the tiles reach the ceiling, the game is over with no more future reward

58 DP Model of Tetris State: The current board, the current falling tile, predictions of future tiles Termination state: when the tiles reach the ceiling, the game is over with no more future reward Action: Rotation and shift System Dynamics: The next board is deterministically determined by the current board and the player s placement of the current tile. The future tiles are generated randomly. Uncertainty: Randomness in future tiles Transitional cost g: If a level is cleared by the current action, score 1; otherwise score 0. Objective: Expectation of total score. 58/79

59 Interesting facts about Tetris 59/79 First released in 1984 by Alexey Pajitnov from the Soviet Union Has been proved to be NP-complete. Game will be over with probability 1. For a 12 7 board, the number of possible states Highest score achieved by human 1 million Highest score achieved by algorithm 35 million (average performance)

60 60/79 Solve Tetris by Approximate Dynamic Programming Curse of dimensionality: The exact cost-to-go vector V has dimension equal to the state space size (10 25 ). The transition matrix is , although sparse. There is no way to solve an Bellman equation. Cost function approximation Instead of characterizing the entire board configuration, we use features φ 1,..., φ d to describe the current state Represent the cost-to-go vector V using linear combinations of a small number of features V = β 1 φ β d φ d We approximate the high-dim space of V use a low-dim feature space of β

Features of Tetris 61/79 The earliest tetris learning algorithm uses 3 features The latest best algorithm uses 22 features The mapping from state

61 Features of Tetris 61/79 The earliest tetris learning algorithm uses 3 features The latest best algorithm uses 22 features The mapping from state space to feature space can be viewed as state aggregation: different states/boards with similar height/holes/structures are treated as a single meta-state.

62 Learning in Tetris 62/79 Ideally, we still want to solve the Bellman equation V = max g µ + P µ V We consider an approximate solution Ṽ to the Bellman equation such that Ṽ = Φβ We apply policy iteration with approximate function evaluation Outer loop: keep improving the policies µ k using one-step lookahead Inner loop: finding the cost-to-go V µk associated with the current policy µ k, by taking samples and using feature representation We want to avoid high-dimensional algebraic operations, as well as memory of high-dimensional quantities.

63 Algorithms tested on tetris More examples of learning methods: Temporal difference learning, TD(λ), LSTD, cross-entropy method, actor-critic, active learning, etc... These algorithms are essentially all combinations of DP, sampling, parametric models. 63/79

64 Tetris World Record: Human vs. AI 64/79

65 More on Games and Learning 65/79 Tetris domain of 2008 Reinforcement Learning Competition 3rd International Reinforcement Learning Competition (2009, adversarial Tetris) 2013 Reinforcement Learning Competition (unmanned helicopter control) Since 2006, Annual Computer Poker Competition ( 2013, 2014, 2015 MIT Pokerbots ( Many more!

66 66/79 Platform Gym: a toolkit for developing and comparing reinforcement learning algorithms. Support Atari and Mujoco. Universe: measuring and training an AI across the world s supply of games, websites and other applications Deepmind Lab: a fully 3D game-like platform tailored for agent-based AI research ViZDoom: allows developing AI bots that play Doom using only the visual information

67 67/79 Platform Rllab: mainly supports for TRPO, VPG, CEM, NPG Baselines: supports for TRPO, PPO, DQN, A2C... Github Implement your algorithms through these packages

68 Gym 68/79 A simple implementation of sampling a single path with gym

69 69/79 Environments Mujoco, continuous tasks A physics engine for detailed, efficient rigid body simulations with contacts Swimmer, Hopper, Walker, Reacher,... Gaussian distribution

70 70/79 Environments Atari 2600, discrete action space Categorial distribution

71 Outline 71/79 1 Policy Search: A Simple Example of Energy Storage 2 Off-Policy RL via Linear Duality 3 Online Learning and Regret 4 Other Type of Methods 5 Trust Region Policy Optimization 6 Examples and Platform 7 AlphaGo Zero

72 AlphaGo Zero 72/79

73 AlphaGo Zero 73/79

74 AlphaGo Zero 74/79

75 AlphaGo Zero 75/79

76 AlphaGo Zero 76/79

77 AlphaGo Zero 77/79

78 AlphaGo Zero 78/79

79 AlphaGo Zero 79/79

Lecture 4: Misc. Topics and Reinforcement Learning

Approximate Dynamic Programming Lecture 4: Misc. Topics and Reinforcement Learning Mengdi Wang Operations Research and Financial Engineering Princeton University August 1-4, 2015 1/56 Feature Extraction