Lecture 4: Misc. Topics and Reinforcement Learning

Size: px

Start display at page:

Download "Lecture 4: Misc. Topics and Reinforcement Learning"

Marcia Murphy
5 years ago
Views:

1 Approximate Dynamic Programming Lecture 4: Misc. Topics and Reinforcement Learning Mengdi Wang Operations Research and Financial Engineering Princeton University August 1-4, /56

2 Feature Extraction is Linear Approximation of High-d Cost Vector 2/56

3 Today 1 Policy Search: A Simple Example of Energy Storage 2 DP and LP 3 Online Learning and Q-Learning 4 Final Remarks Exercise 4: Q-Learning 3/56

4 Policy Search: A Simple Example of Energy Storage 1 Policy Search: A Simple Example of Energy Storage 2 DP and LP 3 Online Learning and Q-Learning 4 Final Remarks Exercise 4: Q-Learning 4/56

5 Policy Search: A Simple Example of Energy Storage Policy Approximation Sometimes it is easier to parameterize the policy µ by µ(i) µ(i; r). Direct policy search " 1 # X min E k g(i k,µ(i k ; r),j k ) r k=1 This is an stochastic optimization problem. Very likely to be nonconvex r is relatively low-dimensional The high-dimensionality of the DP is hidden in the complicated expectation 5/56

6 Policy Search: A Simple Example of Energy Storage Direct policy search via Sample Average Approximation One approach is to simulate many many long trajectories and formulate an sample average approximation min r LX `=1 k=1 1X k g(i k`,µ(i k`; r),j k`), which is based on L trajectories {i 0`,i 1`,...}, where` =1,...,L. As L!1, the sample average approximation converges to the original problem, and the solution r converges to r. 6/56

7 Policy Search: A Simple Example of Energy Storage Direct policy search via Stochastic Gradient Descent Another approach is to apply stochastic gradient descent. At each iteration `, generate a trajectory {i 0`,i 1`,...} and compute the gradient 1X G` = k r r g(i k`,µ(i k`; r),j k`), k=1 and update the policy parameter by r`+1 = r` G`. Each G` is an unbiased sample of the gradient of the overall objective at the current parameter r`. If the gradient is not computable, it can be replaced with finite di erences. a.s. If the problem is convex, r`! r. a.s. If nonconvex (the most likely care), r`! a local optimum (good enough in practice most of the time). 7/56

8 Policy Search: A Simple Example of Energy Storage A Very Simple Example: Electricity Price Electricity prices are very volatile. Very di cult to store electricity. Spiky price with heavy load (hot summer, freezing winter, ) Average/median price fairly cheap Related to weather, season, time of day, etc 8/56

9 Policy Search: A Simple Example of Energy Storage Battery Storage Suppose that you operate a battery and participate in the regional electricity market Challenge: find a policy for charging and discharging the battery Strategy posed by the battery manufacturer: Buy low, sell high 9/56

10 Policy Search: A Simple Example of Energy Storage There are many approaches that based on DP. Solving DP requires full knowledge of the model: how price changes? how the storage may have a market impact on future price? if there is forecast, how good is the forecast? what are the uncertainties and what are their distribution? To gain this knowledge, we could fit a time series model for the price dynamics. Then we formulate a DP problem and find an optimal policy of the DP problem. 10 / 56

11 Policy Search: A Simple Example of Energy Storage DP Model State: price p t, storage capacity c t Action: buy u t = 1, sellu t =1, do nothing u t =0 Action constraint: if c t+1 =0, cannot sell; if c t = C, cannot buy State transition of storage inventory: c t+1 = c t u t State transition of electricity price (learned from data) p t+1 = p t + f(p t )+ t 11 / 56

12 Policy Search: A Simple Example of Energy Storage DP Model The overall objective DP algorithm " T # max E X u t p t {u t} t=1 V (p t,c t ) = max u t {u t p t + E [V (p t+1,c t u t )]} 12 / 56

13 Policy Search: A Simple Example of Energy Storage The ultimate problem is find a function/policy/strategy such that µ : {state} 7! {action} " T # max E X µ(p t,c t )p t µ t=1 subject to the state transitions as constraints Di culty I: We do not know the distribution and dynamics of the time series p t ) Need data-based approach Di culty II: Searching over a space of functions is hard ) Need to narrow the search zone to simple parametric family 13 / 56

14 Policy Search: A Simple Example of Energy Storage Optimizing A Storage Policy Consider a simple policy in which we choose a sell price and a buy price 14 / 56

15 Policy Search: A Simple Example of Energy Storage Optimizing over Simple Policies Now let search over simple threshold policies The modified problem is " T # X max E µ(p t,c t )p t t=1 subject to 8 < µ(p, c) = : -1 if p< store and c<c 1 if p< withdraw and c>0 0 otherwise as well as the state transition constraints We search for threshold values store and withdraw by backtesting 15 / 56

For a given pair of ( store, withdraw), the profit value is calculated by one

16 Policy Search: A Simple Example of Energy Storage Optimizing over Simple Policies Average historical profit as function of the two policy parameters. For a given pair of ( store, withdraw), the profit value is calculated by one simulation run on the entire price history. The optimal policy stands out! 16 / 56

17 Policy Search: A Simple Example of Energy Storage Make the Problem Harder Battery charge/discharge initial time In practice, we can not turn on and o the battery or generator immediately. The battery of generator need to warm up for some time before charging/generating. Using forecast Suppose that we have an 1-hour price forecast. We want the policy to make use of the forecast. The policy could be: to charge if a weighted combination of current price and forecast price is smaller than a threshold; and to withdraw if the weight combo is greater a threshold. The parameters are: two threshold prices; weights of the combination 17 / 56

18 DP and LP 1 Policy Search: A Simple Example of Energy Storage 2 DP and LP 3 Online Learning and Q-Learning 4 Final Remarks Exercise 4: Q-Learning 18 / 56

19 DP and LP Look at the Bellman Equation Again Consider a MDP model with States i =1,...,n Probability transition matrix under policy µ is P µ 2< n n Reward of transition is g µ 2< n The Bellman equation is J =min µ g µ + P µ J. This is a nonlinear system of equations. Note: The righthandside is the infimum of a number of linear mappings of J! 19 / 56

20 DP and LP DP is a special case of LP Theorem Every finite-state DP problem is an LP problem. We construct the following LP or more compactly: minimize J(1) J(n) nx subject to J(i) p ij (u)g iju + 8 u 2 A, j=1 minimize e 0 J nx p ij (u)j(j), j=1 subject to J g µ + P µ J, 8 µ. The variables are J ( i) where i =1,...,n. For each state action pair (i, u), there is an inequality constraint. Dimension of LP: n Tn A (here we assume a finite-state problem) 20 / 56

21 DP and LP DP is a special case of LP Theorem This solution to the constructed LP minimize e 0 J subject to J g µ + P µ J, 8 µ. is exactly the solution to the Bellman s equation J =min µ g µ + P µ J. Proof. The solution J to the Bellman equation is obviously a feasible solution to the LP. If the LP solution J is di erent from J, it must solve the Bellman equation at the same time. Since the Bellman equation has a unique solution, J = J. 21 / 56

22 DP and LP ADP via Approximate Linear Programming The constructed LP is of huge scale. Approximate LP: minimize e 0 J subject to J g µ + P µ J, 8 µ. We may approximate J by adding the constraint J = dimension becomes smaller. r, sothevariable We may sample a subset of all constraints, so the constraint dimension becomes smaller. LP and Approximate LP can be solved by simulation/online. 22 / 56

23 Online Learning and Q-Learning 1 Policy Search: A Simple Example of Energy Storage 2 DP and LP 3 Online Learning and Q-Learning 4 Final Remarks Exercise 4: Q-Learning 23 / 56

Online Learning and Q-Learning Multi-Arm Bandits - Simplest Online Learning Model Suppose you are faced with N slot machines. Each bandit distributes a random prize.

24 Online Learning and Q-Learning Multi-Arm Bandits - Simplest Online Learning Model Suppose you are faced with N slot machines. Each bandit distributes a random prize. Some bandits are very generous, others not so much. Of course, you don t know what these expectations are. By only choosing one bandit per round, our task is devise a strategy to maximize our winnings. 24 / 56

25 Online Learning and Q-Learning Multi-Arm Bandits The Problem Let there be K bandits, each giving a random prize with expectations E [X b ] 2 [0, 1],b2 1,...,K. If you pull bandit b, you get an independent sample of X b. We want to solve the one-shot optimization problem: but we can experiment repeatedly. max E [X b], b21,...,k Our task can be phrased as find the best bandit, and as quickly as possible. We use trial-and-error to gain knowledge on the expected values. 25 / 56

26 Online Learning and Q-Learning ANaiveStrategy For each round, we are given current estimated means ˆX b, b 2 1,...,K 1 With probability 1 k, select the currently known-to-best bandit b k = argmax ˆX b. 2 With probability k, select a random bandit (according to the uniform distribution) to pull 3 Observe the result of pulling bandit b k,andupdate ˆX b k as the new sample average. Return to 1. Exploration vs. Exploitation The random selection with probability k guarantees exploration The selection of current best with probability 1 k guarantees exploition Say k =1/k, the algorithm increasingly exploits the existing knowledge and is guaranteed to find the true optimal bandit. 26 / 56

27 Online Learning and Q-Learning Bayesian Bandit Strategy For each round, we are given the current posterior distributions of all bandits mean returns. 1 Sample a random variable X b from the prior of bandit b, forall b =1,...,K. 2 Select the bandit with largest sample, i.e. select bandit b = argmaxx b. 3 Observe the result of pulling bandit b, and update your prior on bandit b. Return to / 56

28 Online Learning and Q-Learning 28 / 56

Online Learning and Q-Learning How does bandits related to real-time decision making? Each bandit is a possible configuration of the policy parameters.

29 Online Learning and Q-Learning How does bandits related to real-time decision making? Each bandit is a possible configuration of the policy parameters. Bandit strategy is online parameter tuning. Evaluation of goodness: TX Regret = T max E [X b] E [X b t] b=1,...,k t=1 Regret of a reasonable learning strategy is usually between log T and p T 29 / 56

30 Online Learning and Q-Learning Learning in Sequential Decision Making Traditionally, sequential making problems are modeled by dynamic programming. In one-shot optimization, finding the best arm by trial-and-error is known as online learning. Let s combine these two: DP + Online Learning = Reinforcement Learning 30 / 56

31 Online Learning and Q-Learning From DP to Reinforcement Learning Ideally, DP solves the fixed equation: finding V such that V =min µ {g µ + P µ V } Practically, we often wish to solve Bellman s equation without knowing P µ, g µ. What we do have: a simulator that starts from state i, given action a, generate random samples of transition costs and future state g(i, i next,a), i next Example: Optimize a trading policy to maximize profit Current transaction has unknown market impact Use current order book as states/features 31 / 56

32 Online Learning and Q-Learning Rewrite Bellman Equation with Q-Factors Q-factors -wedefineq(i, u) to be the value of the post-decision state (i, u): Q(i, u) = = nx p ij (u)(g(i, u, j)+ V (j)) j=1 nx j=1 Bellman equation for Q-factors Q(i, u) = nx j=1 DP algorithm works for Q functions as well. p ij (u) g(i, u, j)+ min Q(j, v) v2u(j) p ij (u) g(i, u, j)+ min Q(j, v) v2u(j) 32 / 56

33 Online Learning and Q-Learning Q-Learning We hope to solve the Bellman equation using VI Q (i, u) = nx j=1 Q-learning is simulation-based VI for Q-factors. h i p ij (u) g(i, u, j)+ min Q (j, v) v Q k+1 (i, u) =(1 )Q k (i, u, j)+ Sample of BE s Righthandside 33 / 56

34 Online Learning and Q-Learning Q-Learning Algorithm Generate state sequence {(i k,u k,j k )}: sample(i k,j k ) according to the system using control u k. Update for each (i k,u k,j k ) with stepsize k > 0: Q k+1 (i k,u k )=(1 k)q k (i k,u k )+ k g(i k,u k,j k )+ min Q k (j k,v) v Real-time policy can be recovered from Q-values easily: µ k (i) =argmin u Q k (i, u), i =1,...,n Comments: Almost sure convergence to the optimal Q-values and policy (classical convergence result of stochastic approximation) Fully online and real-time, more importantly, model-free The simulator needs to sample all (i, u) su ciently - exploitation vs. exploration 34 / 56

35 Online Learning and Q-Learning Q-Learning for Optimal Stopping Problem Stopping problem: i: state (for example, stock price and time) that moves to state j with probability p ij Actions: HOLD or STOP C(i): cost of stopping at state i g(i, HOLD, j) =0, g(i, ST OP, j) =C(i). Bellman equation: for all i Q(i, HOLD) = nx p ij min{q(j, ST OP ),Q(j, HOLD)} j=1 and Q(i, ST OP )=C(i) 35 / 56

36 Online Learning and Q-Learning Q-Learning for Optimal Stopping Problem Algorithm Generate state path {i k } according to the stochastic system and randomized actions Update for each state i k by using a stepsize k > 0: Q k+1 (i k )=(1 k)q k (i k )+ k min{c(i k+1 ),Q k (i k+1 )} 36 / 56

37 Final Remarks 1 Policy Search: A Simple Example of Energy Storage 2 DP and LP 3 Online Learning and Q-Learning 4 Final Remarks Exercise 4: Q-Learning 37 / 56

38 Final Remarks Theory of Approximate DP Infinite-Horizon DP Problem Minimize over policies = {µ 0,µ 1,...} the objective cost function J (x 0 )= lim N!1 E w k,k=0,1,... ( N 1 X k=0 k g (x k,µ k (x k ),w k ) ) How to Approximate DP Approximation: parameterize policies/cost vectors, aggregation, etc. Simulation: Use simulation-generated trajectories {x k } to calculate DP quantities, without knowing the system dynamics. 38 / 56

39 Final Remarks Markovian Decision Process Assume the system is an n-state (controlled) Markov chain Change to Markov chain notation States i =1,...,n (instead of x) Transition probabilities p ik i k+1 (u k ) [instead of x k+1 = f(x k,u k,w k )] Cost per stage g(i, u, j) [instead of g(x k,u k,w k )] Cost of a policy = {µ 0,µ 1,...} J (i) = lim N!1 E w k k=0,1,... ( N 1 X k=0 k g (i k,µ k (i k ),i k+1 ) i 0 = i ) 39 / 56

40 Final Remarks MDP Continued The optimal cost vector satisfies the Bellman equation for all i or in matrix form J (i) =min u2u J = nx p ij (u)(g(i, u, j)+ J (j)), j=1 Shorthand notation for DP mappings min {g µ + P µ J }. µ:{1,...,n}7!u (TJ)(i) = min u2u(i) j=1 nx p ij (u) g(i, u, j)+ J(j), i =1,...,n, nx (T µ J)(i) = p ij µ(i) g (i, µ(i),j)+ J(j), i =1,...,n j=1 40 / 56

41 Final Remarks Approximation Architecture Approximation in Policy Space Parameterize the set of policies µ using a vector r, and then optimize over r. Approximation in Value Space Approximate J and J µ from a family of functions parameterized by r, e.g., a linear approximation J r, J(i) (i) 0 r. 41 / 56

42 Final Remarks Approximate DP Algorithms: A Roadmap Approximate PI (*) Implement the two steps of PI in an approximate sense: Policy Evaluation J µt = T µt J µt by approximation/simulation Direct Approach (*), e.g., simulation-based least squares Indirect Approach, solve J µt = T µt J µt by TD/LSTD/LSPE. Policy Improvement T µt+1 J µt = TJ µt using the approximate cost vector/q-factors. Approximate J and Q Solve J = TJ or Q = FQ directly by simulation, e.g., Q- Learning, Bellman Error Minimization, LP approach 42 / 56

43 Final Remarks Tetris Height: 12 Width: 7 Rotate and move the falling shape Gravity related to current height Score when eliminating an entire level Game over when reaching the ceiling 43 / 56

44 Final Remarks Tetris at MIT 44 / 56

Final Remarks DP Model of Tetris State: The current board, the current falling tile, predictions of future tiles Termination state: when the tiles reach the ceiling, the game is over with no more

45 Final Remarks DP Model of Tetris State: The current board, the current falling tile, predictions of future tiles Termination state: when the tiles reach the ceiling, the game is over with no more future reward Action: Rotation and shift System Dynamics: The next board is deterministically determined by the current board and the player s placement of the current tile. The future tiles are generated randomly. Uncertainty: Randomness in future tiles Transitional cost g: If a level is cleared by the current action, score 1; otherwise score 0. Objective: Expectation of total score. 45 / 56

46 Final Remarks Interesting facts about Tetris First released in 1984 by Alexey Pajitnov from the Soviet Union Has been proved to be NP-complete. Game will be over with probability 1. For a 12 7 board, the number of possible states Highest score achieved by human 1 million Highest score achieved by algorithm 35 million (average performance) 46 / 56

47 Final Remarks Solve Tetris by Approximate Dynamic Programming Curse of dimensionality: The exact cost-to-go vector V has dimension equal to the state space size (10 25 ). The transition matrix is , although sparse. There is no way to solve an Bellman equation. Cost function approximation Instead of characterizing the entire board configuration, we use features 1,..., d to describe the current state Represent the cost-to-go vector V using linear combinations of a small number of features V = d d We approximate the high-dim space of V use a low-dim feature space of. 47 / 56

Final Remarks Features of Tetris The earliest tetris learning algorithm uses 3 features The latest best algorithm uses 22 features The mapping from state

48 Final Remarks Features of Tetris The earliest tetris learning algorithm uses 3 features The latest best algorithm uses 22 features The mapping from state space to feature space can be viewed as state aggregation: di erent states/boards with similar height/holes/structures are treated as a single meta-state. 48 / 56

49 Final Remarks Learning in Tetris Ideally, we still want to solve the Bellman equation V = max µ g µ + P µ V We consider an approximate solution Ṽ to the Bellman equation such that Ṽ = We apply policy iteration with approximate function evaluation Outer loop: keep improving the policies µ k using one-step lookahead Inner loop: finding the cost-to-go V µk associated with the current policy µ k, by taking samples and using feature representation We want to avoid high-dimensional algebraic operations, as well as memory of high-dimensional quantities. 49 / 56

50 Final Remarks Algorithms tested on tetris More examples of learning methods: Temporal di erence learning, TD( ), LSTD, cross-entropy method, actor-critic, active learning, etc... These algorithms are essentially all combinations of DP, sampling, parametric models. 50 / 56

51 Final Remarks Tetris World Record: Human vs. AI 51 / 56

Competition (unmanned helicopter control) Since 2006, Annual Computer Poker Competition (http://www.

52 Final Remarks More on Games and Learning Tetris domain of 2008 Reinforcement Learning Competition 3rd International Reinforcement Learning Competition (2009, adversarial Tetris) 2013 Reinforcement Learning Competition (unmanned helicopter control) Since 2006, Annual Computer Poker Competition ( 2013, 2014, 2015 MIT Pokerbots ( Many more! 52 / 56

53 Final Remarks Exercise 4: Q-Learning Q-Learning Exercise 4 Use Q-learning to evaluate an American call option. Construct a simulator that generates trajectories of {(i k,j k )}. Write a Q-learning algorithm that interacts with the simulator function. Upon obtaining each (i k,j k,u k ), choose an appropriate stepsize example, k =1/k) andupdatetheq-factors. Plot the results. k (for 53 / 56

54 Final Remarks Exercise 4: Q-Learning Convergence of Option Prices 0.06 Convergence of Option Prices Option Price Stock Price 54 / 56

55 Final Remarks Exercise 4: Q-Learning Convergence of Option Prices 7 x 10 3 Convergence of Cost Vectors in Q Learning 6 5 Option Prices At S=K= Number of Samples 55 / 56

56 Final Remarks Exercise 4: Q-Learning Convergence of Exercising Policies 1.08 Convergence of Policies (blue: exercise, red: hold) Price Number of Policy Iteration 56 / 56

Misc. Topics and Reinforcement Learning

1/79 Misc. Topics and Reinforcement Learning http://bicmr.pku.edu.cn/~wenzw/bigdata2018.html Acknowledgement: some parts of this slides is based on Prof. Mengdi Wang s, Prof. Dimitri Bertsekas and Prof.