Lecture 4: Misc. Topics and Reinforcement Learning
|
|
- Marcia Murphy
- 5 years ago
- Views:
Transcription
1 Approximate Dynamic Programming Lecture 4: Misc. Topics and Reinforcement Learning Mengdi Wang Operations Research and Financial Engineering Princeton University August 1-4, /56
2 Feature Extraction is Linear Approximation of High-d Cost Vector 2/56
3 Today 1 Policy Search: A Simple Example of Energy Storage 2 DP and LP 3 Online Learning and Q-Learning 4 Final Remarks Exercise 4: Q-Learning 3/56
4 Policy Search: A Simple Example of Energy Storage 1 Policy Search: A Simple Example of Energy Storage 2 DP and LP 3 Online Learning and Q-Learning 4 Final Remarks Exercise 4: Q-Learning 4/56
5 Policy Search: A Simple Example of Energy Storage Policy Approximation Sometimes it is easier to parameterize the policy µ by µ(i) µ(i; r). Direct policy search " 1 # X min E k g(i k,µ(i k ; r),j k ) r k=1 This is an stochastic optimization problem. Very likely to be nonconvex r is relatively low-dimensional The high-dimensionality of the DP is hidden in the complicated expectation 5/56
6 Policy Search: A Simple Example of Energy Storage Direct policy search via Sample Average Approximation One approach is to simulate many many long trajectories and formulate an sample average approximation min r LX `=1 k=1 1X k g(i k`,µ(i k`; r),j k`), which is based on L trajectories {i 0`,i 1`,...}, where` =1,...,L. As L!1, the sample average approximation converges to the original problem, and the solution r converges to r. 6/56
7 Policy Search: A Simple Example of Energy Storage Direct policy search via Stochastic Gradient Descent Another approach is to apply stochastic gradient descent. At each iteration `, generate a trajectory {i 0`,i 1`,...} and compute the gradient 1X G` = k r r g(i k`,µ(i k`; r),j k`), k=1 and update the policy parameter by r`+1 = r` G`. Each G` is an unbiased sample of the gradient of the overall objective at the current parameter r`. If the gradient is not computable, it can be replaced with finite di erences. a.s. If the problem is convex, r`! r. a.s. If nonconvex (the most likely care), r`! a local optimum (good enough in practice most of the time). 7/56
8 Policy Search: A Simple Example of Energy Storage A Very Simple Example: Electricity Price Electricity prices are very volatile. Very di cult to store electricity. Spiky price with heavy load (hot summer, freezing winter, ) Average/median price fairly cheap Related to weather, season, time of day, etc 8/56
9 Policy Search: A Simple Example of Energy Storage Battery Storage Suppose that you operate a battery and participate in the regional electricity market Challenge: find a policy for charging and discharging the battery Strategy posed by the battery manufacturer: Buy low, sell high 9/56
10 Policy Search: A Simple Example of Energy Storage There are many approaches that based on DP. Solving DP requires full knowledge of the model: how price changes? how the storage may have a market impact on future price? if there is forecast, how good is the forecast? what are the uncertainties and what are their distribution? To gain this knowledge, we could fit a time series model for the price dynamics. Then we formulate a DP problem and find an optimal policy of the DP problem. 10 / 56
11 Policy Search: A Simple Example of Energy Storage DP Model State: price p t, storage capacity c t Action: buy u t = 1, sellu t =1, do nothing u t =0 Action constraint: if c t+1 =0, cannot sell; if c t = C, cannot buy State transition of storage inventory: c t+1 = c t u t State transition of electricity price (learned from data) p t+1 = p t + f(p t )+ t 11 / 56
12 Policy Search: A Simple Example of Energy Storage DP Model The overall objective DP algorithm " T # max E X u t p t {u t} t=1 V (p t,c t ) = max u t {u t p t + E [V (p t+1,c t u t )]} 12 / 56
13 Policy Search: A Simple Example of Energy Storage The ultimate problem is find a function/policy/strategy such that µ : {state} 7! {action} " T # max E X µ(p t,c t )p t µ t=1 subject to the state transitions as constraints Di culty I: We do not know the distribution and dynamics of the time series p t ) Need data-based approach Di culty II: Searching over a space of functions is hard ) Need to narrow the search zone to simple parametric family 13 / 56
14 Policy Search: A Simple Example of Energy Storage Optimizing A Storage Policy Consider a simple policy in which we choose a sell price and a buy price 14 / 56
15 Policy Search: A Simple Example of Energy Storage Optimizing over Simple Policies Now let search over simple threshold policies The modified problem is " T # X max E µ(p t,c t )p t t=1 subject to 8 < µ(p, c) = : -1 if p< store and c<c 1 if p< withdraw and c>0 0 otherwise as well as the state transition constraints We search for threshold values store and withdraw by backtesting 15 / 56
16 Policy Search: A Simple Example of Energy Storage Optimizing over Simple Policies Average historical profit as function of the two policy parameters. For a given pair of ( store, withdraw), the profit value is calculated by one simulation run on the entire price history. The optimal policy stands out! 16 / 56
17 Policy Search: A Simple Example of Energy Storage Make the Problem Harder Battery charge/discharge initial time In practice, we can not turn on and o the battery or generator immediately. The battery of generator need to warm up for some time before charging/generating. Using forecast Suppose that we have an 1-hour price forecast. We want the policy to make use of the forecast. The policy could be: to charge if a weighted combination of current price and forecast price is smaller than a threshold; and to withdraw if the weight combo is greater a threshold. The parameters are: two threshold prices; weights of the combination 17 / 56
18 DP and LP 1 Policy Search: A Simple Example of Energy Storage 2 DP and LP 3 Online Learning and Q-Learning 4 Final Remarks Exercise 4: Q-Learning 18 / 56
19 DP and LP Look at the Bellman Equation Again Consider a MDP model with States i =1,...,n Probability transition matrix under policy µ is P µ 2< n n Reward of transition is g µ 2< n The Bellman equation is J =min µ g µ + P µ J. This is a nonlinear system of equations. Note: The righthandside is the infimum of a number of linear mappings of J! 19 / 56
20 DP and LP DP is a special case of LP Theorem Every finite-state DP problem is an LP problem. We construct the following LP or more compactly: minimize J(1) J(n) nx subject to J(i) p ij (u)g iju + 8 u 2 A, j=1 minimize e 0 J nx p ij (u)j(j), j=1 subject to J g µ + P µ J, 8 µ. The variables are J ( i) where i =1,...,n. For each state action pair (i, u), there is an inequality constraint. Dimension of LP: n Tn A (here we assume a finite-state problem) 20 / 56
21 DP and LP DP is a special case of LP Theorem This solution to the constructed LP minimize e 0 J subject to J g µ + P µ J, 8 µ. is exactly the solution to the Bellman s equation J =min µ g µ + P µ J. Proof. The solution J to the Bellman equation is obviously a feasible solution to the LP. If the LP solution J is di erent from J, it must solve the Bellman equation at the same time. Since the Bellman equation has a unique solution, J = J. 21 / 56
22 DP and LP ADP via Approximate Linear Programming The constructed LP is of huge scale. Approximate LP: minimize e 0 J subject to J g µ + P µ J, 8 µ. We may approximate J by adding the constraint J = dimension becomes smaller. r, sothevariable We may sample a subset of all constraints, so the constraint dimension becomes smaller. LP and Approximate LP can be solved by simulation/online. 22 / 56
23 Online Learning and Q-Learning 1 Policy Search: A Simple Example of Energy Storage 2 DP and LP 3 Online Learning and Q-Learning 4 Final Remarks Exercise 4: Q-Learning 23 / 56
24 Online Learning and Q-Learning Multi-Arm Bandits - Simplest Online Learning Model Suppose you are faced with N slot machines. Each bandit distributes a random prize. Some bandits are very generous, others not so much. Of course, you don t know what these expectations are. By only choosing one bandit per round, our task is devise a strategy to maximize our winnings. 24 / 56
25 Online Learning and Q-Learning Multi-Arm Bandits The Problem Let there be K bandits, each giving a random prize with expectations E [X b ] 2 [0, 1],b2 1,...,K. If you pull bandit b, you get an independent sample of X b. We want to solve the one-shot optimization problem: but we can experiment repeatedly. max E [X b], b21,...,k Our task can be phrased as find the best bandit, and as quickly as possible. We use trial-and-error to gain knowledge on the expected values. 25 / 56
26 Online Learning and Q-Learning ANaiveStrategy For each round, we are given current estimated means ˆX b, b 2 1,...,K 1 With probability 1 k, select the currently known-to-best bandit b k = argmax ˆX b. 2 With probability k, select a random bandit (according to the uniform distribution) to pull 3 Observe the result of pulling bandit b k,andupdate ˆX b k as the new sample average. Return to 1. Exploration vs. Exploitation The random selection with probability k guarantees exploration The selection of current best with probability 1 k guarantees exploition Say k =1/k, the algorithm increasingly exploits the existing knowledge and is guaranteed to find the true optimal bandit. 26 / 56
27 Online Learning and Q-Learning Bayesian Bandit Strategy For each round, we are given the current posterior distributions of all bandits mean returns. 1 Sample a random variable X b from the prior of bandit b, forall b =1,...,K. 2 Select the bandit with largest sample, i.e. select bandit b = argmaxx b. 3 Observe the result of pulling bandit b, and update your prior on bandit b. Return to / 56
28 Online Learning and Q-Learning 28 / 56
29 Online Learning and Q-Learning How does bandits related to real-time decision making? Each bandit is a possible configuration of the policy parameters. Bandit strategy is online parameter tuning. Evaluation of goodness: TX Regret = T max E [X b] E [X b t] b=1,...,k t=1 Regret of a reasonable learning strategy is usually between log T and p T 29 / 56
30 Online Learning and Q-Learning Learning in Sequential Decision Making Traditionally, sequential making problems are modeled by dynamic programming. In one-shot optimization, finding the best arm by trial-and-error is known as online learning. Let s combine these two: DP + Online Learning = Reinforcement Learning 30 / 56
31 Online Learning and Q-Learning From DP to Reinforcement Learning Ideally, DP solves the fixed equation: finding V such that V =min µ {g µ + P µ V } Practically, we often wish to solve Bellman s equation without knowing P µ, g µ. What we do have: a simulator that starts from state i, given action a, generate random samples of transition costs and future state g(i, i next,a), i next Example: Optimize a trading policy to maximize profit Current transaction has unknown market impact Use current order book as states/features 31 / 56
32 Online Learning and Q-Learning Rewrite Bellman Equation with Q-Factors Q-factors -wedefineq(i, u) to be the value of the post-decision state (i, u): Q(i, u) = = nx p ij (u)(g(i, u, j)+ V (j)) j=1 nx j=1 Bellman equation for Q-factors Q(i, u) = nx j=1 DP algorithm works for Q functions as well. p ij (u) g(i, u, j)+ min Q(j, v) v2u(j) p ij (u) g(i, u, j)+ min Q(j, v) v2u(j) 32 / 56
33 Online Learning and Q-Learning Q-Learning We hope to solve the Bellman equation using VI Q (i, u) = nx j=1 Q-learning is simulation-based VI for Q-factors. h i p ij (u) g(i, u, j)+ min Q (j, v) v Q k+1 (i, u) =(1 )Q k (i, u, j)+ Sample of BE s Righthandside 33 / 56
34 Online Learning and Q-Learning Q-Learning Algorithm Generate state sequence {(i k,u k,j k )}: sample(i k,j k ) according to the system using control u k. Update for each (i k,u k,j k ) with stepsize k > 0: Q k+1 (i k,u k )=(1 k)q k (i k,u k )+ k g(i k,u k,j k )+ min Q k (j k,v) v Real-time policy can be recovered from Q-values easily: µ k (i) =argmin u Q k (i, u), i =1,...,n Comments: Almost sure convergence to the optimal Q-values and policy (classical convergence result of stochastic approximation) Fully online and real-time, more importantly, model-free The simulator needs to sample all (i, u) su ciently - exploitation vs. exploration 34 / 56
35 Online Learning and Q-Learning Q-Learning for Optimal Stopping Problem Stopping problem: i: state (for example, stock price and time) that moves to state j with probability p ij Actions: HOLD or STOP C(i): cost of stopping at state i g(i, HOLD, j) =0, g(i, ST OP, j) =C(i). Bellman equation: for all i Q(i, HOLD) = nx p ij min{q(j, ST OP ),Q(j, HOLD)} j=1 and Q(i, ST OP )=C(i) 35 / 56
36 Online Learning and Q-Learning Q-Learning for Optimal Stopping Problem Algorithm Generate state path {i k } according to the stochastic system and randomized actions Update for each state i k by using a stepsize k > 0: Q k+1 (i k )=(1 k)q k (i k )+ k min{c(i k+1 ),Q k (i k+1 )} 36 / 56
37 Final Remarks 1 Policy Search: A Simple Example of Energy Storage 2 DP and LP 3 Online Learning and Q-Learning 4 Final Remarks Exercise 4: Q-Learning 37 / 56
38 Final Remarks Theory of Approximate DP Infinite-Horizon DP Problem Minimize over policies = {µ 0,µ 1,...} the objective cost function J (x 0 )= lim N!1 E w k,k=0,1,... ( N 1 X k=0 k g (x k,µ k (x k ),w k ) ) How to Approximate DP Approximation: parameterize policies/cost vectors, aggregation, etc. Simulation: Use simulation-generated trajectories {x k } to calculate DP quantities, without knowing the system dynamics. 38 / 56
39 Final Remarks Markovian Decision Process Assume the system is an n-state (controlled) Markov chain Change to Markov chain notation States i =1,...,n (instead of x) Transition probabilities p ik i k+1 (u k ) [instead of x k+1 = f(x k,u k,w k )] Cost per stage g(i, u, j) [instead of g(x k,u k,w k )] Cost of a policy = {µ 0,µ 1,...} J (i) = lim N!1 E w k k=0,1,... ( N 1 X k=0 k g (i k,µ k (i k ),i k+1 ) i 0 = i ) 39 / 56
40 Final Remarks MDP Continued The optimal cost vector satisfies the Bellman equation for all i or in matrix form J (i) =min u2u J = nx p ij (u)(g(i, u, j)+ J (j)), j=1 Shorthand notation for DP mappings min {g µ + P µ J }. µ:{1,...,n}7!u (TJ)(i) = min u2u(i) j=1 nx p ij (u) g(i, u, j)+ J(j), i =1,...,n, nx (T µ J)(i) = p ij µ(i) g (i, µ(i),j)+ J(j), i =1,...,n j=1 40 / 56
41 Final Remarks Approximation Architecture Approximation in Policy Space Parameterize the set of policies µ using a vector r, and then optimize over r. Approximation in Value Space Approximate J and J µ from a family of functions parameterized by r, e.g., a linear approximation J r, J(i) (i) 0 r. 41 / 56
42 Final Remarks Approximate DP Algorithms: A Roadmap Approximate PI (*) Implement the two steps of PI in an approximate sense: Policy Evaluation J µt = T µt J µt by approximation/simulation Direct Approach (*), e.g., simulation-based least squares Indirect Approach, solve J µt = T µt J µt by TD/LSTD/LSPE. Policy Improvement T µt+1 J µt = TJ µt using the approximate cost vector/q-factors. Approximate J and Q Solve J = TJ or Q = FQ directly by simulation, e.g., Q- Learning, Bellman Error Minimization, LP approach 42 / 56
43 Final Remarks Tetris Height: 12 Width: 7 Rotate and move the falling shape Gravity related to current height Score when eliminating an entire level Game over when reaching the ceiling 43 / 56
44 Final Remarks Tetris at MIT 44 / 56
45 Final Remarks DP Model of Tetris State: The current board, the current falling tile, predictions of future tiles Termination state: when the tiles reach the ceiling, the game is over with no more future reward Action: Rotation and shift System Dynamics: The next board is deterministically determined by the current board and the player s placement of the current tile. The future tiles are generated randomly. Uncertainty: Randomness in future tiles Transitional cost g: If a level is cleared by the current action, score 1; otherwise score 0. Objective: Expectation of total score. 45 / 56
46 Final Remarks Interesting facts about Tetris First released in 1984 by Alexey Pajitnov from the Soviet Union Has been proved to be NP-complete. Game will be over with probability 1. For a 12 7 board, the number of possible states Highest score achieved by human 1 million Highest score achieved by algorithm 35 million (average performance) 46 / 56
47 Final Remarks Solve Tetris by Approximate Dynamic Programming Curse of dimensionality: The exact cost-to-go vector V has dimension equal to the state space size (10 25 ). The transition matrix is , although sparse. There is no way to solve an Bellman equation. Cost function approximation Instead of characterizing the entire board configuration, we use features 1,..., d to describe the current state Represent the cost-to-go vector V using linear combinations of a small number of features V = d d We approximate the high-dim space of V use a low-dim feature space of. 47 / 56
48 Final Remarks Features of Tetris The earliest tetris learning algorithm uses 3 features The latest best algorithm uses 22 features The mapping from state space to feature space can be viewed as state aggregation: di erent states/boards with similar height/holes/structures are treated as a single meta-state. 48 / 56
49 Final Remarks Learning in Tetris Ideally, we still want to solve the Bellman equation V = max µ g µ + P µ V We consider an approximate solution Ṽ to the Bellman equation such that Ṽ = We apply policy iteration with approximate function evaluation Outer loop: keep improving the policies µ k using one-step lookahead Inner loop: finding the cost-to-go V µk associated with the current policy µ k, by taking samples and using feature representation We want to avoid high-dimensional algebraic operations, as well as memory of high-dimensional quantities. 49 / 56
50 Final Remarks Algorithms tested on tetris More examples of learning methods: Temporal di erence learning, TD( ), LSTD, cross-entropy method, actor-critic, active learning, etc... These algorithms are essentially all combinations of DP, sampling, parametric models. 50 / 56
51 Final Remarks Tetris World Record: Human vs. AI 51 / 56
52 Final Remarks More on Games and Learning Tetris domain of 2008 Reinforcement Learning Competition 3rd International Reinforcement Learning Competition (2009, adversarial Tetris) 2013 Reinforcement Learning Competition (unmanned helicopter control) Since 2006, Annual Computer Poker Competition ( 2013, 2014, 2015 MIT Pokerbots ( Many more! 52 / 56
53 Final Remarks Exercise 4: Q-Learning Q-Learning Exercise 4 Use Q-learning to evaluate an American call option. Construct a simulator that generates trajectories of {(i k,j k )}. Write a Q-learning algorithm that interacts with the simulator function. Upon obtaining each (i k,j k,u k ), choose an appropriate stepsize example, k =1/k) andupdatetheq-factors. Plot the results. k (for 53 / 56
54 Final Remarks Exercise 4: Q-Learning Convergence of Option Prices 0.06 Convergence of Option Prices Option Price Stock Price 54 / 56
55 Final Remarks Exercise 4: Q-Learning Convergence of Option Prices 7 x 10 3 Convergence of Cost Vectors in Q Learning 6 5 Option Prices At S=K= Number of Samples 55 / 56
56 Final Remarks Exercise 4: Q-Learning Convergence of Exercising Policies 1.08 Convergence of Policies (blue: exercise, red: hold) Price Number of Policy Iteration 56 / 56
Misc. Topics and Reinforcement Learning
1/79 Misc. Topics and Reinforcement Learning http://bicmr.pku.edu.cn/~wenzw/bigdata2018.html Acknowledgement: some parts of this slides is based on Prof. Mengdi Wang s, Prof. Dimitri Bertsekas and Prof.
More information6.231 DYNAMIC PROGRAMMING LECTURE 6 LECTURE OUTLINE
6.231 DYNAMIC PROGRAMMING LECTURE 6 LECTURE OUTLINE Review of Q-factors and Bellman equations for Q-factors VI and PI for Q-factors Q-learning - Combination of VI and sampling Q-learning and cost function
More informationDynamic Programming and Reinforcement Learning
Dynamic Programming and Reinforcement Learning Reinforcement learning setting action/decision Agent Environment reward state Action space: A State space: S Reward: R : S A S! R Transition:
More informationReinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019
Reinforcement Learning and Optimal Control ASU, CSE 691, Winter 2019 Dimitri P. Bertsekas dimitrib@mit.edu Lecture 8 Bertsekas Reinforcement Learning 1 / 21 Outline 1 Review of Infinite Horizon Problems
More informationInfinite-Horizon Dynamic Programming
1/70 Infinite-Horizon Dynamic Programming http://bicmr.pku.edu.cn/~wenzw/bigdata2018.html Acknowledgement: this slides is based on Prof. Mengdi Wang s and Prof. Dimitri Bertsekas lecture notes 2/70 作业
More informationBayesian Active Learning With Basis Functions
Bayesian Active Learning With Basis Functions Ilya O. Ryzhov Warren B. Powell Operations Research and Financial Engineering Princeton University Princeton, NJ 08544, USA IEEE ADPRL April 13, 2011 1 / 29
More informationIntroduction to Approximate Dynamic Programming
Introduction to Approximate Dynamic Programming Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 Approximate Dynamic Programming 1 Key References Bertsekas, D.P.
More informationValue and Policy Iteration
Chapter 7 Value and Policy Iteration 1 For infinite horizon problems, we need to replace our basic computational tool, the DP algorithm, which we used to compute the optimal cost and policy for finite
More informationElements of Reinforcement Learning
Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,
More informationLecture 23: Reinforcement Learning
Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:
More informationDistributed Optimization. Song Chong EE, KAIST
Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links
More informationBasics of reinforcement learning
Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system
More information6 Reinforcement Learning
6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,
More informationReinforcement Learning. Yishay Mansour Tel-Aviv University
Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak
More informationAdministration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.
Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,
More informationReinforcement learning
Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error
More informationBalancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm
Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu
More informationAdvanced Machine Learning
Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his
More informationOptimistic Policy Iteration and Q-learning in Dynamic Programming
Optimistic Policy Iteration and Q-learning in Dynamic Programming Dimitri P. Bertsekas Laboratory for Information and Decision Systems Massachusetts Institute of Technology November 2010 INFORMS, Austin,
More informationThe Multi-Arm Bandit Framework
The Multi-Arm Bandit Framework A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/94
More informationMarkov Decision Processes
Markov Decision Processes Noel Welsh 11 November 2010 Noel Welsh () Markov Decision Processes 11 November 2010 1 / 30 Annoucements Applicant visitor day seeks robot demonstrators for exciting half hour
More informationLearning Tetris. 1 Tetris. February 3, 2009
Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are
More informationStochastic Composition Optimization
Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works with Ethan X. Fang, Han Liu, and Ji Liu ORFE@Princeton ICCOPT, Tokyo, August 8-11, 2016 1 / 24 Collaborators
More informationCS599 Lecture 1 Introduction To RL
CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming
More informationBandit models: a tutorial
Gdt COS, December 3rd, 2015 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses
More informationComplexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning
Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Christos Dimitrakakis Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands
More informationReinforcement Learning
Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value
More informationReinforcement Learning. Introduction
Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control
More informationMarkov Decision Processes and Dynamic Programming
Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course How to model an RL problem The Markov Decision Process
More informationMarkov Decision Processes and Dynamic Programming
Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes
More informationIntroduction to Reinforcement Learning
CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.
More informationComputation and Dynamic Programming
Computation and Dynamic Programming Huseyin Topaloglu School of Operations Research and Information Engineering, Cornell University, Ithaca, New York 14853, USA topaloglu@orie.cornell.edu June 25, 2010
More informationProject Discussions: SNL/ADMM, MDP/Randomization, Quadratic Regularization, and Online Linear Programming
Project Discussions: SNL/ADMM, MDP/Randomization, Quadratic Regularization, and Online Linear Programming Yinyu Ye Department of Management Science and Engineering Stanford University Stanford, CA 94305,
More informationMARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti
1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early
More informationStochastic Shortest Path Problems
Chapter 8 Stochastic Shortest Path Problems 1 In this chapter, we study a stochastic version of the shortest path problem of chapter 2, where only probabilities of transitions along different arcs can
More informationLecture 1: March 7, 2018
Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights
More information15-780: LinearProgramming
15-780: LinearProgramming J. Zico Kolter February 1-3, 2016 1 Outline Introduction Some linear algebra review Linear programming Simplex algorithm Duality and dual simplex 2 Outline Introduction Some linear
More informationMarks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:
Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,
More informationSTOCHASTIC MODELS LECTURE 1 MARKOV CHAINS. Nan Chen MSc Program in Financial Engineering The Chinese University of Hong Kong (ShenZhen) Sept.
STOCHASTIC MODELS LECTURE 1 MARKOV CHAINS Nan Chen MSc Program in Financial Engineering The Chinese University of Hong Kong (ShenZhen) Sept. 6, 2016 Outline 1. Introduction 2. Chapman-Kolmogrov Equations
More informationReinforcement Learning
Reinforcement Learning Ron Parr CompSci 7 Department of Computer Science Duke University With thanks to Kris Hauser for some content RL Highlights Everybody likes to learn from experience Use ML techniques
More informationCOMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati
COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning Hanna Kurniawati Today } What is machine learning? } Where is it used? } Types of machine learning
More informationEvaluation of multi armed bandit algorithms and empirical algorithm
Acta Technica 62, No. 2B/2017, 639 656 c 2017 Institute of Thermomechanics CAS, v.v.i. Evaluation of multi armed bandit algorithms and empirical algorithm Zhang Hong 2,3, Cao Xiushan 1, Pu Qiumei 1,4 Abstract.
More informationHidden Markov Models (HMM) and Support Vector Machine (SVM)
Hidden Markov Models (HMM) and Support Vector Machine (SVM) Professor Joongheon Kim School of Computer Science and Engineering, Chung-Ang University, Seoul, Republic of Korea 1 Hidden Markov Models (HMM)
More informationLecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem
Lecture 9: UCB Algorithm and Adversarial Bandit Problem EECS598: Prediction and Learning: It s Only a Game Fall 03 Lecture 9: UCB Algorithm and Adversarial Bandit Problem Prof. Jacob Abernethy Scribe:
More informationQ-Learning in Continuous State Action Spaces
Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental
More informationOn the Convergence of Optimistic Policy Iteration
Journal of Machine Learning Research 3 (2002) 59 72 Submitted 10/01; Published 7/02 On the Convergence of Optimistic Policy Iteration John N. Tsitsiklis LIDS, Room 35-209 Massachusetts Institute of Technology
More informationProf. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be
REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while
More informationRECURSION EQUATION FOR
Math 46 Lecture 8 Infinite Horizon discounted reward problem From the last lecture: The value function of policy u for the infinite horizon problem with discount factor a and initial state i is W i, u
More informationLecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation
Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free
More informationCourse 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016
Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the
More informationMulti-armed bandit models: a tutorial
Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)
More informationReinforcement Learning
Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation
More informationPolicy Gradient Reinforcement Learning Without Regret
Policy Gradient Reinforcement Learning Without Regret by Travis Dick A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science Department of Computing Science University
More informationReinforcement learning an introduction
Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,
More informationReinforcement Learning
Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha
More informationMS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction
MS&E338 Reinforcement Learning Lecture 1 - April 2, 2018 Introduction Lecturer: Ben Van Roy Scribe: Gabriel Maher 1 Reinforcement Learning Introduction In reinforcement learning (RL) we consider an agent
More informationCSE250A Fall 12: Discussion Week 9
CSE250A Fall 12: Discussion Week 9 Aditya Menon (akmenon@ucsd.edu) December 4, 2012 1 Schedule for today Recap of Markov Decision Processes. Examples: slot machines and maze traversal. Planning and learning.
More informationReinforcement Learning
1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision
More informationThis question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.
This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. 1. Suppose you have a policy and its action-value function, q, then you
More informationProcedia Computer Science 00 (2011) 000 6
Procedia Computer Science (211) 6 Procedia Computer Science Complex Adaptive Systems, Volume 1 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology 211-
More informationReinforcement Learning
Reinforcement Learning Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ Task Grasp the green cup. Output: Sequence of controller actions Setup from Lenz et. al.
More informationAlireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017
s s Machine Learning Reading Group The University of British Columbia Summer 2017 (OCO) Convex 1/29 Outline (OCO) Convex Stochastic Bernoulli s (OCO) Convex 2/29 At each iteration t, the player chooses
More informationCSC242: Intro to AI. Lecture 23
CSC242: Intro to AI Lecture 23 Administrivia Posters! Tue Apr 24 and Thu Apr 26 Idea! Presentation! 2-wide x 4-high landscape pages Learning so far... Input Attributes Alt Bar Fri Hun Pat Price Rain Res
More informationLecture notes for Analysis of Algorithms : Markov decision processes
Lecture notes for Analysis of Algorithms : Markov decision processes Lecturer: Thomas Dueholm Hansen June 6, 013 Abstract We give an introduction to infinite-horizon Markov decision processes (MDPs) with
More informationLinear Programming Methods
Chapter 11 Linear Programming Methods 1 In this chapter we consider the linear programming approach to dynamic programming. First, Bellman s equation can be reformulated as a linear program whose solution
More informationLearning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods
Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods Yaakov Engel Joint work with Peter Szabo and Dmitry Volkinshtein (ex. Technion) Why use GPs in RL? A Bayesian approach
More informationReinforcement Learning
CS7/CS7 Fall 005 Supervised Learning: Training examples: (x,y) Direct feedback y for each input x Sequence of decisions with eventual feedback No teacher that critiques individual actions Learn to act
More information5. Solving the Bellman Equation
5. Solving the Bellman Equation In the next two lectures, we will look at several methods to solve Bellman s Equation (BE) for the stochastic shortest path problem: Value Iteration, Policy Iteration and
More information, and rewards and transition matrices as shown below:
CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount
More informationIntroduction to Reinforcement Learning. CMPT 882 Mar. 18
Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and
More informationToday s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes
Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks
More information6.231 DYNAMIC PROGRAMMING LECTURE 17 LECTURE OUTLINE
6.231 DYNAMIC PROGRAMMING LECTURE 17 LECTURE OUTLINE Undiscounted problems Stochastic shortest path problems (SSP) Proper and improper policies Analysis and computational methods for SSP Pathologies of
More informationMarkov Decision Processes Chapter 17. Mausam
Markov Decision Processes Chapter 17 Mausam Planning Agent Static vs. Dynamic Fully vs. Partially Observable Environment What action next? Deterministic vs. Stochastic Perfect vs. Noisy Instantaneous vs.
More informationToday s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning
CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationA Gentle Introduction to Reinforcement Learning
A Gentle Introduction to Reinforcement Learning Alexander Jung 2018 1 Introduction and Motivation Consider the cleaning robot Rumba which has to clean the office room B329. In order to keep things simple,
More informationMDP Preliminaries. Nan Jiang. February 10, 2019
MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process
More informationReinforcement Learning and Control
CS9 Lecture notes Andrew Ng Part XIII Reinforcement Learning and Control We now begin our study of reinforcement learning and adaptive control. In supervised learning, we saw algorithms that tried to make
More informationReinforcement Learning. Spring 2018 Defining MDPs, Planning
Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state
More information1 MDP Value Iteration Algorithm
CS 0. - Active Learning Problem Set Handed out: 4 Jan 009 Due: 9 Jan 009 MDP Value Iteration Algorithm. Implement the value iteration algorithm given in the lecture. That is, solve Bellman s equation using
More informationINF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018
Machine learning for image classification Lecture 14: Reinforcement learning May 9, 2018 Page 3 Outline Motivation Introduction to reinforcement learning (RL) Value function based methods (Q-learning)
More information1. Using the model and notations covered in class, the expected returns are:
Econ 510a second half Yale University Fall 2006 Prof. Tony Smith HOMEWORK #5 This homework assignment is due at 5PM on Friday, December 8 in Marnix Amand s mailbox. Solution 1. a In the Mehra-Prescott
More informationFigure 1: Bayes Net. (a) (2 points) List all independence and conditional independence relationships implied by this Bayes net.
1 Bayes Nets Unfortunately during spring due to illness and allergies, Billy is unable to distinguish the cause (X) of his symptoms which could be: coughing (C), sneezing (S), and temperature (T). If he
More informationReinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN
Reinforcement Learning for Continuous Action using Stochastic Gradient Ascent Hajime KIMURA, Shigenobu KOBAYASHI Tokyo Institute of Technology, 4259 Nagatsuda, Midori-ku Yokohama 226-852 JAPAN Abstract:
More informationMarkov Decision Processes Chapter 17. Mausam
Markov Decision Processes Chapter 17 Mausam Planning Agent Static vs. Dynamic Fully vs. Partially Observable Environment What action next? Deterministic vs. Stochastic Perfect vs. Noisy Instantaneous vs.
More informationConstructing Learning Models from Data: The Dynamic Catalog Mailing Problem
Constructing Learning Models from Data: The Dynamic Catalog Mailing Problem Peng Sun May 6, 2003 Problem and Motivation Big industry In 2000 Catalog companies in the USA sent out 7 billion catalogs, generated
More informationYevgeny Seldin. University of Copenhagen
Yevgeny Seldin University of Copenhagen Classical (Batch) Machine Learning Collect Data Data Assumption The samples are independent identically distributed (i.i.d.) Machine Learning Prediction rule New
More information3E4: Modelling Choice. Introduction to nonlinear programming. Announcements
3E4: Modelling Choice Lecture 7 Introduction to nonlinear programming 1 Announcements Solutions to Lecture 4-6 Homework will be available from http://www.eng.cam.ac.uk/~dr241/3e4 Looking ahead to Lecture
More informationReinforcement Learning and NLP
1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value
More informationOptimal Stopping Problems
2.997 Decision Making in Large Scale Systems March 3 MIT, Spring 2004 Handout #9 Lecture Note 5 Optimal Stopping Problems In the last lecture, we have analyzed the behavior of T D(λ) for approximating
More informationELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches. Ville Kyrki
ELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches Ville Kyrki 9.10.2017 Today Direct policy learning via policy gradient. Learning goals Understand basis and limitations
More informationarxiv: v1 [cs.lg] 23 Oct 2017
Accelerated Reinforcement Learning K. Lakshmanan Department of Computer Science and Engineering Indian Institute of Technology (BHU), Varanasi, India Email: lakshmanank.cse@itbhu.ac.in arxiv:1710.08070v1
More informationLecture 15: Bandit problems. Markov Processes. Recall: Lotteries and utilities
Lecture 15: Bandit problems. Markov Processes Bandit problems Action values (and now to compute them) Exploration-exploitation trade-off Simple exploration strategies -greedy Softmax (Boltzmann) exploration
More informationDRAFT Formulation and Analysis of Linear Programs
DRAFT Formulation and Analysis of Linear Programs Benjamin Van Roy and Kahn Mason c Benjamin Van Roy and Kahn Mason September 26, 2005 1 2 Contents 1 Introduction 7 1.1 Linear Algebra..........................
More informationLecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan
COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan Some slides borrowed from Peter Bodik and David Silver Course progress Learning
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Uncertainty & Probabilities & Bandits Daniel Hennes 16.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Uncertainty Probability
More informationThe Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount
The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount Yinyu Ye Department of Management Science and Engineering and Institute of Computational
More informationReinforcement Learning Wrap-up
Reinforcement Learning Wrap-up Slides courtesy of Dan Klein and Pieter Abbeel University of California, Berkeley [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley.
More informationREINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning
REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari
More informationCS261: A Second Course in Algorithms Lecture #12: Applications of Multiplicative Weights to Games and Linear Programs
CS26: A Second Course in Algorithms Lecture #2: Applications of Multiplicative Weights to Games and Linear Programs Tim Roughgarden February, 206 Extensions of the Multiplicative Weights Guarantee Last
More information