Lecture 4: Misc. Topics and Reinforcement Learning

Size: px
Start display at page:

Download "Lecture 4: Misc. Topics and Reinforcement Learning"

Transcription

1 Approximate Dynamic Programming Lecture 4: Misc. Topics and Reinforcement Learning Mengdi Wang Operations Research and Financial Engineering Princeton University August 1-4, /56

2 Feature Extraction is Linear Approximation of High-d Cost Vector 2/56

3 Today 1 Policy Search: A Simple Example of Energy Storage 2 DP and LP 3 Online Learning and Q-Learning 4 Final Remarks Exercise 4: Q-Learning 3/56

4 Policy Search: A Simple Example of Energy Storage 1 Policy Search: A Simple Example of Energy Storage 2 DP and LP 3 Online Learning and Q-Learning 4 Final Remarks Exercise 4: Q-Learning 4/56

5 Policy Search: A Simple Example of Energy Storage Policy Approximation Sometimes it is easier to parameterize the policy µ by µ(i) µ(i; r). Direct policy search " 1 # X min E k g(i k,µ(i k ; r),j k ) r k=1 This is an stochastic optimization problem. Very likely to be nonconvex r is relatively low-dimensional The high-dimensionality of the DP is hidden in the complicated expectation 5/56

6 Policy Search: A Simple Example of Energy Storage Direct policy search via Sample Average Approximation One approach is to simulate many many long trajectories and formulate an sample average approximation min r LX `=1 k=1 1X k g(i k`,µ(i k`; r),j k`), which is based on L trajectories {i 0`,i 1`,...}, where` =1,...,L. As L!1, the sample average approximation converges to the original problem, and the solution r converges to r. 6/56

7 Policy Search: A Simple Example of Energy Storage Direct policy search via Stochastic Gradient Descent Another approach is to apply stochastic gradient descent. At each iteration `, generate a trajectory {i 0`,i 1`,...} and compute the gradient 1X G` = k r r g(i k`,µ(i k`; r),j k`), k=1 and update the policy parameter by r`+1 = r` G`. Each G` is an unbiased sample of the gradient of the overall objective at the current parameter r`. If the gradient is not computable, it can be replaced with finite di erences. a.s. If the problem is convex, r`! r. a.s. If nonconvex (the most likely care), r`! a local optimum (good enough in practice most of the time). 7/56

8 Policy Search: A Simple Example of Energy Storage A Very Simple Example: Electricity Price Electricity prices are very volatile. Very di cult to store electricity. Spiky price with heavy load (hot summer, freezing winter, ) Average/median price fairly cheap Related to weather, season, time of day, etc 8/56

9 Policy Search: A Simple Example of Energy Storage Battery Storage Suppose that you operate a battery and participate in the regional electricity market Challenge: find a policy for charging and discharging the battery Strategy posed by the battery manufacturer: Buy low, sell high 9/56

10 Policy Search: A Simple Example of Energy Storage There are many approaches that based on DP. Solving DP requires full knowledge of the model: how price changes? how the storage may have a market impact on future price? if there is forecast, how good is the forecast? what are the uncertainties and what are their distribution? To gain this knowledge, we could fit a time series model for the price dynamics. Then we formulate a DP problem and find an optimal policy of the DP problem. 10 / 56

11 Policy Search: A Simple Example of Energy Storage DP Model State: price p t, storage capacity c t Action: buy u t = 1, sellu t =1, do nothing u t =0 Action constraint: if c t+1 =0, cannot sell; if c t = C, cannot buy State transition of storage inventory: c t+1 = c t u t State transition of electricity price (learned from data) p t+1 = p t + f(p t )+ t 11 / 56

12 Policy Search: A Simple Example of Energy Storage DP Model The overall objective DP algorithm " T # max E X u t p t {u t} t=1 V (p t,c t ) = max u t {u t p t + E [V (p t+1,c t u t )]} 12 / 56

13 Policy Search: A Simple Example of Energy Storage The ultimate problem is find a function/policy/strategy such that µ : {state} 7! {action} " T # max E X µ(p t,c t )p t µ t=1 subject to the state transitions as constraints Di culty I: We do not know the distribution and dynamics of the time series p t ) Need data-based approach Di culty II: Searching over a space of functions is hard ) Need to narrow the search zone to simple parametric family 13 / 56

14 Policy Search: A Simple Example of Energy Storage Optimizing A Storage Policy Consider a simple policy in which we choose a sell price and a buy price 14 / 56

15 Policy Search: A Simple Example of Energy Storage Optimizing over Simple Policies Now let search over simple threshold policies The modified problem is " T # X max E µ(p t,c t )p t t=1 subject to 8 < µ(p, c) = : -1 if p< store and c<c 1 if p< withdraw and c>0 0 otherwise as well as the state transition constraints We search for threshold values store and withdraw by backtesting 15 / 56

16 Policy Search: A Simple Example of Energy Storage Optimizing over Simple Policies Average historical profit as function of the two policy parameters. For a given pair of ( store, withdraw), the profit value is calculated by one simulation run on the entire price history. The optimal policy stands out! 16 / 56

17 Policy Search: A Simple Example of Energy Storage Make the Problem Harder Battery charge/discharge initial time In practice, we can not turn on and o the battery or generator immediately. The battery of generator need to warm up for some time before charging/generating. Using forecast Suppose that we have an 1-hour price forecast. We want the policy to make use of the forecast. The policy could be: to charge if a weighted combination of current price and forecast price is smaller than a threshold; and to withdraw if the weight combo is greater a threshold. The parameters are: two threshold prices; weights of the combination 17 / 56

18 DP and LP 1 Policy Search: A Simple Example of Energy Storage 2 DP and LP 3 Online Learning and Q-Learning 4 Final Remarks Exercise 4: Q-Learning 18 / 56

19 DP and LP Look at the Bellman Equation Again Consider a MDP model with States i =1,...,n Probability transition matrix under policy µ is P µ 2< n n Reward of transition is g µ 2< n The Bellman equation is J =min µ g µ + P µ J. This is a nonlinear system of equations. Note: The righthandside is the infimum of a number of linear mappings of J! 19 / 56

20 DP and LP DP is a special case of LP Theorem Every finite-state DP problem is an LP problem. We construct the following LP or more compactly: minimize J(1) J(n) nx subject to J(i) p ij (u)g iju + 8 u 2 A, j=1 minimize e 0 J nx p ij (u)j(j), j=1 subject to J g µ + P µ J, 8 µ. The variables are J ( i) where i =1,...,n. For each state action pair (i, u), there is an inequality constraint. Dimension of LP: n Tn A (here we assume a finite-state problem) 20 / 56

21 DP and LP DP is a special case of LP Theorem This solution to the constructed LP minimize e 0 J subject to J g µ + P µ J, 8 µ. is exactly the solution to the Bellman s equation J =min µ g µ + P µ J. Proof. The solution J to the Bellman equation is obviously a feasible solution to the LP. If the LP solution J is di erent from J, it must solve the Bellman equation at the same time. Since the Bellman equation has a unique solution, J = J. 21 / 56

22 DP and LP ADP via Approximate Linear Programming The constructed LP is of huge scale. Approximate LP: minimize e 0 J subject to J g µ + P µ J, 8 µ. We may approximate J by adding the constraint J = dimension becomes smaller. r, sothevariable We may sample a subset of all constraints, so the constraint dimension becomes smaller. LP and Approximate LP can be solved by simulation/online. 22 / 56

23 Online Learning and Q-Learning 1 Policy Search: A Simple Example of Energy Storage 2 DP and LP 3 Online Learning and Q-Learning 4 Final Remarks Exercise 4: Q-Learning 23 / 56

24 Online Learning and Q-Learning Multi-Arm Bandits - Simplest Online Learning Model Suppose you are faced with N slot machines. Each bandit distributes a random prize. Some bandits are very generous, others not so much. Of course, you don t know what these expectations are. By only choosing one bandit per round, our task is devise a strategy to maximize our winnings. 24 / 56

25 Online Learning and Q-Learning Multi-Arm Bandits The Problem Let there be K bandits, each giving a random prize with expectations E [X b ] 2 [0, 1],b2 1,...,K. If you pull bandit b, you get an independent sample of X b. We want to solve the one-shot optimization problem: but we can experiment repeatedly. max E [X b], b21,...,k Our task can be phrased as find the best bandit, and as quickly as possible. We use trial-and-error to gain knowledge on the expected values. 25 / 56

26 Online Learning and Q-Learning ANaiveStrategy For each round, we are given current estimated means ˆX b, b 2 1,...,K 1 With probability 1 k, select the currently known-to-best bandit b k = argmax ˆX b. 2 With probability k, select a random bandit (according to the uniform distribution) to pull 3 Observe the result of pulling bandit b k,andupdate ˆX b k as the new sample average. Return to 1. Exploration vs. Exploitation The random selection with probability k guarantees exploration The selection of current best with probability 1 k guarantees exploition Say k =1/k, the algorithm increasingly exploits the existing knowledge and is guaranteed to find the true optimal bandit. 26 / 56

27 Online Learning and Q-Learning Bayesian Bandit Strategy For each round, we are given the current posterior distributions of all bandits mean returns. 1 Sample a random variable X b from the prior of bandit b, forall b =1,...,K. 2 Select the bandit with largest sample, i.e. select bandit b = argmaxx b. 3 Observe the result of pulling bandit b, and update your prior on bandit b. Return to / 56

28 Online Learning and Q-Learning 28 / 56

29 Online Learning and Q-Learning How does bandits related to real-time decision making? Each bandit is a possible configuration of the policy parameters. Bandit strategy is online parameter tuning. Evaluation of goodness: TX Regret = T max E [X b] E [X b t] b=1,...,k t=1 Regret of a reasonable learning strategy is usually between log T and p T 29 / 56

30 Online Learning and Q-Learning Learning in Sequential Decision Making Traditionally, sequential making problems are modeled by dynamic programming. In one-shot optimization, finding the best arm by trial-and-error is known as online learning. Let s combine these two: DP + Online Learning = Reinforcement Learning 30 / 56

31 Online Learning and Q-Learning From DP to Reinforcement Learning Ideally, DP solves the fixed equation: finding V such that V =min µ {g µ + P µ V } Practically, we often wish to solve Bellman s equation without knowing P µ, g µ. What we do have: a simulator that starts from state i, given action a, generate random samples of transition costs and future state g(i, i next,a), i next Example: Optimize a trading policy to maximize profit Current transaction has unknown market impact Use current order book as states/features 31 / 56

32 Online Learning and Q-Learning Rewrite Bellman Equation with Q-Factors Q-factors -wedefineq(i, u) to be the value of the post-decision state (i, u): Q(i, u) = = nx p ij (u)(g(i, u, j)+ V (j)) j=1 nx j=1 Bellman equation for Q-factors Q(i, u) = nx j=1 DP algorithm works for Q functions as well. p ij (u) g(i, u, j)+ min Q(j, v) v2u(j) p ij (u) g(i, u, j)+ min Q(j, v) v2u(j) 32 / 56

33 Online Learning and Q-Learning Q-Learning We hope to solve the Bellman equation using VI Q (i, u) = nx j=1 Q-learning is simulation-based VI for Q-factors. h i p ij (u) g(i, u, j)+ min Q (j, v) v Q k+1 (i, u) =(1 )Q k (i, u, j)+ Sample of BE s Righthandside 33 / 56

34 Online Learning and Q-Learning Q-Learning Algorithm Generate state sequence {(i k,u k,j k )}: sample(i k,j k ) according to the system using control u k. Update for each (i k,u k,j k ) with stepsize k > 0: Q k+1 (i k,u k )=(1 k)q k (i k,u k )+ k g(i k,u k,j k )+ min Q k (j k,v) v Real-time policy can be recovered from Q-values easily: µ k (i) =argmin u Q k (i, u), i =1,...,n Comments: Almost sure convergence to the optimal Q-values and policy (classical convergence result of stochastic approximation) Fully online and real-time, more importantly, model-free The simulator needs to sample all (i, u) su ciently - exploitation vs. exploration 34 / 56

35 Online Learning and Q-Learning Q-Learning for Optimal Stopping Problem Stopping problem: i: state (for example, stock price and time) that moves to state j with probability p ij Actions: HOLD or STOP C(i): cost of stopping at state i g(i, HOLD, j) =0, g(i, ST OP, j) =C(i). Bellman equation: for all i Q(i, HOLD) = nx p ij min{q(j, ST OP ),Q(j, HOLD)} j=1 and Q(i, ST OP )=C(i) 35 / 56

36 Online Learning and Q-Learning Q-Learning for Optimal Stopping Problem Algorithm Generate state path {i k } according to the stochastic system and randomized actions Update for each state i k by using a stepsize k > 0: Q k+1 (i k )=(1 k)q k (i k )+ k min{c(i k+1 ),Q k (i k+1 )} 36 / 56

37 Final Remarks 1 Policy Search: A Simple Example of Energy Storage 2 DP and LP 3 Online Learning and Q-Learning 4 Final Remarks Exercise 4: Q-Learning 37 / 56

38 Final Remarks Theory of Approximate DP Infinite-Horizon DP Problem Minimize over policies = {µ 0,µ 1,...} the objective cost function J (x 0 )= lim N!1 E w k,k=0,1,... ( N 1 X k=0 k g (x k,µ k (x k ),w k ) ) How to Approximate DP Approximation: parameterize policies/cost vectors, aggregation, etc. Simulation: Use simulation-generated trajectories {x k } to calculate DP quantities, without knowing the system dynamics. 38 / 56

39 Final Remarks Markovian Decision Process Assume the system is an n-state (controlled) Markov chain Change to Markov chain notation States i =1,...,n (instead of x) Transition probabilities p ik i k+1 (u k ) [instead of x k+1 = f(x k,u k,w k )] Cost per stage g(i, u, j) [instead of g(x k,u k,w k )] Cost of a policy = {µ 0,µ 1,...} J (i) = lim N!1 E w k k=0,1,... ( N 1 X k=0 k g (i k,µ k (i k ),i k+1 ) i 0 = i ) 39 / 56

40 Final Remarks MDP Continued The optimal cost vector satisfies the Bellman equation for all i or in matrix form J (i) =min u2u J = nx p ij (u)(g(i, u, j)+ J (j)), j=1 Shorthand notation for DP mappings min {g µ + P µ J }. µ:{1,...,n}7!u (TJ)(i) = min u2u(i) j=1 nx p ij (u) g(i, u, j)+ J(j), i =1,...,n, nx (T µ J)(i) = p ij µ(i) g (i, µ(i),j)+ J(j), i =1,...,n j=1 40 / 56

41 Final Remarks Approximation Architecture Approximation in Policy Space Parameterize the set of policies µ using a vector r, and then optimize over r. Approximation in Value Space Approximate J and J µ from a family of functions parameterized by r, e.g., a linear approximation J r, J(i) (i) 0 r. 41 / 56

42 Final Remarks Approximate DP Algorithms: A Roadmap Approximate PI (*) Implement the two steps of PI in an approximate sense: Policy Evaluation J µt = T µt J µt by approximation/simulation Direct Approach (*), e.g., simulation-based least squares Indirect Approach, solve J µt = T µt J µt by TD/LSTD/LSPE. Policy Improvement T µt+1 J µt = TJ µt using the approximate cost vector/q-factors. Approximate J and Q Solve J = TJ or Q = FQ directly by simulation, e.g., Q- Learning, Bellman Error Minimization, LP approach 42 / 56

43 Final Remarks Tetris Height: 12 Width: 7 Rotate and move the falling shape Gravity related to current height Score when eliminating an entire level Game over when reaching the ceiling 43 / 56

44 Final Remarks Tetris at MIT 44 / 56

45 Final Remarks DP Model of Tetris State: The current board, the current falling tile, predictions of future tiles Termination state: when the tiles reach the ceiling, the game is over with no more future reward Action: Rotation and shift System Dynamics: The next board is deterministically determined by the current board and the player s placement of the current tile. The future tiles are generated randomly. Uncertainty: Randomness in future tiles Transitional cost g: If a level is cleared by the current action, score 1; otherwise score 0. Objective: Expectation of total score. 45 / 56

46 Final Remarks Interesting facts about Tetris First released in 1984 by Alexey Pajitnov from the Soviet Union Has been proved to be NP-complete. Game will be over with probability 1. For a 12 7 board, the number of possible states Highest score achieved by human 1 million Highest score achieved by algorithm 35 million (average performance) 46 / 56

47 Final Remarks Solve Tetris by Approximate Dynamic Programming Curse of dimensionality: The exact cost-to-go vector V has dimension equal to the state space size (10 25 ). The transition matrix is , although sparse. There is no way to solve an Bellman equation. Cost function approximation Instead of characterizing the entire board configuration, we use features 1,..., d to describe the current state Represent the cost-to-go vector V using linear combinations of a small number of features V = d d We approximate the high-dim space of V use a low-dim feature space of. 47 / 56

48 Final Remarks Features of Tetris The earliest tetris learning algorithm uses 3 features The latest best algorithm uses 22 features The mapping from state space to feature space can be viewed as state aggregation: di erent states/boards with similar height/holes/structures are treated as a single meta-state. 48 / 56

49 Final Remarks Learning in Tetris Ideally, we still want to solve the Bellman equation V = max µ g µ + P µ V We consider an approximate solution Ṽ to the Bellman equation such that Ṽ = We apply policy iteration with approximate function evaluation Outer loop: keep improving the policies µ k using one-step lookahead Inner loop: finding the cost-to-go V µk associated with the current policy µ k, by taking samples and using feature representation We want to avoid high-dimensional algebraic operations, as well as memory of high-dimensional quantities. 49 / 56

50 Final Remarks Algorithms tested on tetris More examples of learning methods: Temporal di erence learning, TD( ), LSTD, cross-entropy method, actor-critic, active learning, etc... These algorithms are essentially all combinations of DP, sampling, parametric models. 50 / 56

51 Final Remarks Tetris World Record: Human vs. AI 51 / 56

52 Final Remarks More on Games and Learning Tetris domain of 2008 Reinforcement Learning Competition 3rd International Reinforcement Learning Competition (2009, adversarial Tetris) 2013 Reinforcement Learning Competition (unmanned helicopter control) Since 2006, Annual Computer Poker Competition ( 2013, 2014, 2015 MIT Pokerbots ( Many more! 52 / 56

53 Final Remarks Exercise 4: Q-Learning Q-Learning Exercise 4 Use Q-learning to evaluate an American call option. Construct a simulator that generates trajectories of {(i k,j k )}. Write a Q-learning algorithm that interacts with the simulator function. Upon obtaining each (i k,j k,u k ), choose an appropriate stepsize example, k =1/k) andupdatetheq-factors. Plot the results. k (for 53 / 56

54 Final Remarks Exercise 4: Q-Learning Convergence of Option Prices 0.06 Convergence of Option Prices Option Price Stock Price 54 / 56

55 Final Remarks Exercise 4: Q-Learning Convergence of Option Prices 7 x 10 3 Convergence of Cost Vectors in Q Learning 6 5 Option Prices At S=K= Number of Samples 55 / 56

56 Final Remarks Exercise 4: Q-Learning Convergence of Exercising Policies 1.08 Convergence of Policies (blue: exercise, red: hold) Price Number of Policy Iteration 56 / 56

Misc. Topics and Reinforcement Learning

Misc. Topics and Reinforcement Learning 1/79 Misc. Topics and Reinforcement Learning http://bicmr.pku.edu.cn/~wenzw/bigdata2018.html Acknowledgement: some parts of this slides is based on Prof. Mengdi Wang s, Prof. Dimitri Bertsekas and Prof.

More information

6.231 DYNAMIC PROGRAMMING LECTURE 6 LECTURE OUTLINE

6.231 DYNAMIC PROGRAMMING LECTURE 6 LECTURE OUTLINE 6.231 DYNAMIC PROGRAMMING LECTURE 6 LECTURE OUTLINE Review of Q-factors and Bellman equations for Q-factors VI and PI for Q-factors Q-learning - Combination of VI and sampling Q-learning and cost function

More information

Dynamic Programming and Reinforcement Learning

Dynamic Programming and Reinforcement Learning Dynamic Programming and Reinforcement Learning Reinforcement learning setting action/decision Agent Environment reward state Action space: A State space: S Reward: R : S A S! R Transition:

More information

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019 Reinforcement Learning and Optimal Control ASU, CSE 691, Winter 2019 Dimitri P. Bertsekas dimitrib@mit.edu Lecture 8 Bertsekas Reinforcement Learning 1 / 21 Outline 1 Review of Infinite Horizon Problems

More information

Infinite-Horizon Dynamic Programming

Infinite-Horizon Dynamic Programming 1/70 Infinite-Horizon Dynamic Programming http://bicmr.pku.edu.cn/~wenzw/bigdata2018.html Acknowledgement: this slides is based on Prof. Mengdi Wang s and Prof. Dimitri Bertsekas lecture notes 2/70 作业

More information

Bayesian Active Learning With Basis Functions

Bayesian Active Learning With Basis Functions Bayesian Active Learning With Basis Functions Ilya O. Ryzhov Warren B. Powell Operations Research and Financial Engineering Princeton University Princeton, NJ 08544, USA IEEE ADPRL April 13, 2011 1 / 29

More information

Introduction to Approximate Dynamic Programming

Introduction to Approximate Dynamic Programming Introduction to Approximate Dynamic Programming Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 Approximate Dynamic Programming 1 Key References Bertsekas, D.P.

More information

Value and Policy Iteration

Value and Policy Iteration Chapter 7 Value and Policy Iteration 1 For infinite horizon problems, we need to replace our basic computational tool, the DP algorithm, which we used to compute the optimal cost and policy for finite

More information

Elements of Reinforcement Learning

Elements of Reinforcement Learning Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,

More information

Lecture 23: Reinforcement Learning

Lecture 23: Reinforcement Learning Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:

More information

Distributed Optimization. Song Chong EE, KAIST

Distributed Optimization. Song Chong EE, KAIST Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon. Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his

More information

Optimistic Policy Iteration and Q-learning in Dynamic Programming

Optimistic Policy Iteration and Q-learning in Dynamic Programming Optimistic Policy Iteration and Q-learning in Dynamic Programming Dimitri P. Bertsekas Laboratory for Information and Decision Systems Massachusetts Institute of Technology November 2010 INFORMS, Austin,

More information

The Multi-Arm Bandit Framework

The Multi-Arm Bandit Framework The Multi-Arm Bandit Framework A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/94

More information

Markov Decision Processes

Markov Decision Processes Markov Decision Processes Noel Welsh 11 November 2010 Noel Welsh () Markov Decision Processes 11 November 2010 1 / 30 Annoucements Applicant visitor day seeks robot demonstrators for exciting half hour

More information

Learning Tetris. 1 Tetris. February 3, 2009

Learning Tetris. 1 Tetris. February 3, 2009 Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are

More information

Stochastic Composition Optimization

Stochastic Composition Optimization Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works with Ethan X. Fang, Han Liu, and Ji Liu ORFE@Princeton ICCOPT, Tokyo, August 8-11, 2016 1 / 24 Collaborators

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Bandit models: a tutorial

Bandit models: a tutorial Gdt COS, December 3rd, 2015 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses

More information

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Christos Dimitrakakis Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value

More information

Reinforcement Learning. Introduction

Reinforcement Learning. Introduction Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course How to model an RL problem The Markov Decision Process

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

Computation and Dynamic Programming

Computation and Dynamic Programming Computation and Dynamic Programming Huseyin Topaloglu School of Operations Research and Information Engineering, Cornell University, Ithaca, New York 14853, USA topaloglu@orie.cornell.edu June 25, 2010

More information

Project Discussions: SNL/ADMM, MDP/Randomization, Quadratic Regularization, and Online Linear Programming

Project Discussions: SNL/ADMM, MDP/Randomization, Quadratic Regularization, and Online Linear Programming Project Discussions: SNL/ADMM, MDP/Randomization, Quadratic Regularization, and Online Linear Programming Yinyu Ye Department of Management Science and Engineering Stanford University Stanford, CA 94305,

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Stochastic Shortest Path Problems

Stochastic Shortest Path Problems Chapter 8 Stochastic Shortest Path Problems 1 In this chapter, we study a stochastic version of the shortest path problem of chapter 2, where only probabilities of transitions along different arcs can

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

15-780: LinearProgramming

15-780: LinearProgramming 15-780: LinearProgramming J. Zico Kolter February 1-3, 2016 1 Outline Introduction Some linear algebra review Linear programming Simplex algorithm Duality and dual simplex 2 Outline Introduction Some linear

More information

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam: Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,

More information

STOCHASTIC MODELS LECTURE 1 MARKOV CHAINS. Nan Chen MSc Program in Financial Engineering The Chinese University of Hong Kong (ShenZhen) Sept.

STOCHASTIC MODELS LECTURE 1 MARKOV CHAINS. Nan Chen MSc Program in Financial Engineering The Chinese University of Hong Kong (ShenZhen) Sept. STOCHASTIC MODELS LECTURE 1 MARKOV CHAINS Nan Chen MSc Program in Financial Engineering The Chinese University of Hong Kong (ShenZhen) Sept. 6, 2016 Outline 1. Introduction 2. Chapman-Kolmogrov Equations

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Ron Parr CompSci 7 Department of Computer Science Duke University With thanks to Kris Hauser for some content RL Highlights Everybody likes to learn from experience Use ML techniques

More information

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning Hanna Kurniawati Today } What is machine learning? } Where is it used? } Types of machine learning

More information

Evaluation of multi armed bandit algorithms and empirical algorithm

Evaluation of multi armed bandit algorithms and empirical algorithm Acta Technica 62, No. 2B/2017, 639 656 c 2017 Institute of Thermomechanics CAS, v.v.i. Evaluation of multi armed bandit algorithms and empirical algorithm Zhang Hong 2,3, Cao Xiushan 1, Pu Qiumei 1,4 Abstract.

More information

Hidden Markov Models (HMM) and Support Vector Machine (SVM)

Hidden Markov Models (HMM) and Support Vector Machine (SVM) Hidden Markov Models (HMM) and Support Vector Machine (SVM) Professor Joongheon Kim School of Computer Science and Engineering, Chung-Ang University, Seoul, Republic of Korea 1 Hidden Markov Models (HMM)

More information

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem Lecture 9: UCB Algorithm and Adversarial Bandit Problem EECS598: Prediction and Learning: It s Only a Game Fall 03 Lecture 9: UCB Algorithm and Adversarial Bandit Problem Prof. Jacob Abernethy Scribe:

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

On the Convergence of Optimistic Policy Iteration

On the Convergence of Optimistic Policy Iteration Journal of Machine Learning Research 3 (2002) 59 72 Submitted 10/01; Published 7/02 On the Convergence of Optimistic Policy Iteration John N. Tsitsiklis LIDS, Room 35-209 Massachusetts Institute of Technology

More information

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while

More information

RECURSION EQUATION FOR

RECURSION EQUATION FOR Math 46 Lecture 8 Infinite Horizon discounted reward problem From the last lecture: The value function of policy u for the infinite horizon problem with discount factor a and initial state i is W i, u

More information

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free

More information

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016 Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the

More information

Multi-armed bandit models: a tutorial

Multi-armed bandit models: a tutorial Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation

More information

Policy Gradient Reinforcement Learning Without Regret

Policy Gradient Reinforcement Learning Without Regret Policy Gradient Reinforcement Learning Without Regret by Travis Dick A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science Department of Computing Science University

More information

Reinforcement learning an introduction

Reinforcement learning an introduction Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha

More information

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction MS&E338 Reinforcement Learning Lecture 1 - April 2, 2018 Introduction Lecturer: Ben Van Roy Scribe: Gabriel Maher 1 Reinforcement Learning Introduction In reinforcement learning (RL) we consider an agent

More information

CSE250A Fall 12: Discussion Week 9

CSE250A Fall 12: Discussion Week 9 CSE250A Fall 12: Discussion Week 9 Aditya Menon (akmenon@ucsd.edu) December 4, 2012 1 Schedule for today Recap of Markov Decision Processes. Examples: slot machines and maze traversal. Planning and learning.

More information

Reinforcement Learning

Reinforcement Learning 1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision

More information

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. 1. Suppose you have a policy and its action-value function, q, then you

More information

Procedia Computer Science 00 (2011) 000 6

Procedia Computer Science 00 (2011) 000 6 Procedia Computer Science (211) 6 Procedia Computer Science Complex Adaptive Systems, Volume 1 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology 211-

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ Task Grasp the green cup. Output: Sequence of controller actions Setup from Lenz et. al.

More information

Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017

Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017 s s Machine Learning Reading Group The University of British Columbia Summer 2017 (OCO) Convex 1/29 Outline (OCO) Convex Stochastic Bernoulli s (OCO) Convex 2/29 At each iteration t, the player chooses

More information

CSC242: Intro to AI. Lecture 23

CSC242: Intro to AI. Lecture 23 CSC242: Intro to AI Lecture 23 Administrivia Posters! Tue Apr 24 and Thu Apr 26 Idea! Presentation! 2-wide x 4-high landscape pages Learning so far... Input Attributes Alt Bar Fri Hun Pat Price Rain Res

More information

Lecture notes for Analysis of Algorithms : Markov decision processes

Lecture notes for Analysis of Algorithms : Markov decision processes Lecture notes for Analysis of Algorithms : Markov decision processes Lecturer: Thomas Dueholm Hansen June 6, 013 Abstract We give an introduction to infinite-horizon Markov decision processes (MDPs) with

More information

Linear Programming Methods

Linear Programming Methods Chapter 11 Linear Programming Methods 1 In this chapter we consider the linear programming approach to dynamic programming. First, Bellman s equation can be reformulated as a linear program whose solution

More information

Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods

Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods Yaakov Engel Joint work with Peter Szabo and Dmitry Volkinshtein (ex. Technion) Why use GPs in RL? A Bayesian approach

More information

Reinforcement Learning

Reinforcement Learning CS7/CS7 Fall 005 Supervised Learning: Training examples: (x,y) Direct feedback y for each input x Sequence of decisions with eventual feedback No teacher that critiques individual actions Learn to act

More information

5. Solving the Bellman Equation

5. Solving the Bellman Equation 5. Solving the Bellman Equation In the next two lectures, we will look at several methods to solve Bellman s Equation (BE) for the stochastic shortest path problem: Value Iteration, Policy Iteration and

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks

More information

6.231 DYNAMIC PROGRAMMING LECTURE 17 LECTURE OUTLINE

6.231 DYNAMIC PROGRAMMING LECTURE 17 LECTURE OUTLINE 6.231 DYNAMIC PROGRAMMING LECTURE 17 LECTURE OUTLINE Undiscounted problems Stochastic shortest path problems (SSP) Proper and improper policies Analysis and computational methods for SSP Pathologies of

More information

Markov Decision Processes Chapter 17. Mausam

Markov Decision Processes Chapter 17. Mausam Markov Decision Processes Chapter 17 Mausam Planning Agent Static vs. Dynamic Fully vs. Partially Observable Environment What action next? Deterministic vs. Stochastic Perfect vs. Noisy Instantaneous vs.

More information

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

A Gentle Introduction to Reinforcement Learning

A Gentle Introduction to Reinforcement Learning A Gentle Introduction to Reinforcement Learning Alexander Jung 2018 1 Introduction and Motivation Consider the cleaning robot Rumba which has to clean the office room B329. In order to keep things simple,

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Reinforcement Learning and Control

Reinforcement Learning and Control CS9 Lecture notes Andrew Ng Part XIII Reinforcement Learning and Control We now begin our study of reinforcement learning and adaptive control. In supervised learning, we saw algorithms that tried to make

More information

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state

More information

1 MDP Value Iteration Algorithm

1 MDP Value Iteration Algorithm CS 0. - Active Learning Problem Set Handed out: 4 Jan 009 Due: 9 Jan 009 MDP Value Iteration Algorithm. Implement the value iteration algorithm given in the lecture. That is, solve Bellman s equation using

More information

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018 Machine learning for image classification Lecture 14: Reinforcement learning May 9, 2018 Page 3 Outline Motivation Introduction to reinforcement learning (RL) Value function based methods (Q-learning)

More information

1. Using the model and notations covered in class, the expected returns are:

1. Using the model and notations covered in class, the expected returns are: Econ 510a second half Yale University Fall 2006 Prof. Tony Smith HOMEWORK #5 This homework assignment is due at 5PM on Friday, December 8 in Marnix Amand s mailbox. Solution 1. a In the Mehra-Prescott

More information

Figure 1: Bayes Net. (a) (2 points) List all independence and conditional independence relationships implied by this Bayes net.

Figure 1: Bayes Net. (a) (2 points) List all independence and conditional independence relationships implied by this Bayes net. 1 Bayes Nets Unfortunately during spring due to illness and allergies, Billy is unable to distinguish the cause (X) of his symptoms which could be: coughing (C), sneezing (S), and temperature (T). If he

More information

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN Reinforcement Learning for Continuous Action using Stochastic Gradient Ascent Hajime KIMURA, Shigenobu KOBAYASHI Tokyo Institute of Technology, 4259 Nagatsuda, Midori-ku Yokohama 226-852 JAPAN Abstract:

More information

Markov Decision Processes Chapter 17. Mausam

Markov Decision Processes Chapter 17. Mausam Markov Decision Processes Chapter 17 Mausam Planning Agent Static vs. Dynamic Fully vs. Partially Observable Environment What action next? Deterministic vs. Stochastic Perfect vs. Noisy Instantaneous vs.

More information

Constructing Learning Models from Data: The Dynamic Catalog Mailing Problem

Constructing Learning Models from Data: The Dynamic Catalog Mailing Problem Constructing Learning Models from Data: The Dynamic Catalog Mailing Problem Peng Sun May 6, 2003 Problem and Motivation Big industry In 2000 Catalog companies in the USA sent out 7 billion catalogs, generated

More information

Yevgeny Seldin. University of Copenhagen

Yevgeny Seldin. University of Copenhagen Yevgeny Seldin University of Copenhagen Classical (Batch) Machine Learning Collect Data Data Assumption The samples are independent identically distributed (i.i.d.) Machine Learning Prediction rule New

More information

3E4: Modelling Choice. Introduction to nonlinear programming. Announcements

3E4: Modelling Choice. Introduction to nonlinear programming. Announcements 3E4: Modelling Choice Lecture 7 Introduction to nonlinear programming 1 Announcements Solutions to Lecture 4-6 Homework will be available from http://www.eng.cam.ac.uk/~dr241/3e4 Looking ahead to Lecture

More information

Reinforcement Learning and NLP

Reinforcement Learning and NLP 1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value

More information

Optimal Stopping Problems

Optimal Stopping Problems 2.997 Decision Making in Large Scale Systems March 3 MIT, Spring 2004 Handout #9 Lecture Note 5 Optimal Stopping Problems In the last lecture, we have analyzed the behavior of T D(λ) for approximating

More information

ELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches. Ville Kyrki

ELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches. Ville Kyrki ELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches Ville Kyrki 9.10.2017 Today Direct policy learning via policy gradient. Learning goals Understand basis and limitations

More information

arxiv: v1 [cs.lg] 23 Oct 2017

arxiv: v1 [cs.lg] 23 Oct 2017 Accelerated Reinforcement Learning K. Lakshmanan Department of Computer Science and Engineering Indian Institute of Technology (BHU), Varanasi, India Email: lakshmanank.cse@itbhu.ac.in arxiv:1710.08070v1

More information

Lecture 15: Bandit problems. Markov Processes. Recall: Lotteries and utilities

Lecture 15: Bandit problems. Markov Processes. Recall: Lotteries and utilities Lecture 15: Bandit problems. Markov Processes Bandit problems Action values (and now to compute them) Exploration-exploitation trade-off Simple exploration strategies -greedy Softmax (Boltzmann) exploration

More information

DRAFT Formulation and Analysis of Linear Programs

DRAFT Formulation and Analysis of Linear Programs DRAFT Formulation and Analysis of Linear Programs Benjamin Van Roy and Kahn Mason c Benjamin Van Roy and Kahn Mason September 26, 2005 1 2 Contents 1 Introduction 7 1.1 Linear Algebra..........................

More information

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan Some slides borrowed from Peter Bodik and David Silver Course progress Learning

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Uncertainty & Probabilities & Bandits Daniel Hennes 16.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Uncertainty Probability

More information

The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount

The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount Yinyu Ye Department of Management Science and Engineering and Institute of Computational

More information

Reinforcement Learning Wrap-up

Reinforcement Learning Wrap-up Reinforcement Learning Wrap-up Slides courtesy of Dan Klein and Pieter Abbeel University of California, Berkeley [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley.

More information

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari

More information

CS261: A Second Course in Algorithms Lecture #12: Applications of Multiplicative Weights to Games and Linear Programs

CS261: A Second Course in Algorithms Lecture #12: Applications of Multiplicative Weights to Games and Linear Programs CS26: A Second Course in Algorithms Lecture #2: Applications of Multiplicative Weights to Games and Linear Programs Tim Roughgarden February, 206 Extensions of the Multiplicative Weights Guarantee Last

More information