Misc. Topics and Reinforcement Learning

Size: px
Start display at page:

Download "Misc. Topics and Reinforcement Learning"

Transcription

1 1/79 Misc. Topics and Reinforcement Learning Acknowledgement: some parts of this slides is based on Prof. Mengdi Wang s, Prof. Dimitri Bertsekas and Prof. David Silver s lecture notes

2 2/79 作业 1) 阅读如下章节 : Richard S. Sutton and Andrew G. Barto, Reinforcement Learning: An Introduction, Chapter 2: Multi-armed Bandits Chapter 3: Finite Markov Decision Processes Chapter 4: Dynamic Programming Chapter 5: Monte Carlo Methods Chapter 6: Temporal-Difference Learning Chapter 9: On-policy Prediction with Approximation Chapter 10: On-policy Control with Approximation Chapter 13: Policy Gradient Methods 2) 至少看懂每章的三个 Example 如果有程序, 测试或实现其程序

3 Outline 3/79 1 Policy Search: A Simple Example of Energy Storage 2 Off-Policy RL via Linear Duality 3 Online Learning and Regret 4 Other Type of Methods 5 Trust Region Policy Optimization 6 Examples and Platform 7 AlphaGo Zero

4 Policy Approximation 4/79 Sometimes it is easier to parameterize the policy µ by µ(i) µ(i; σ) Direct policy search [ ] min E α k g(i k, µ(i k ; σ), j k ) σ k=1 This is an stochastic optimization problem. Very likely to be nonconvex σ is relatively low-dimensional The high-dimensionality of the DP is hidden in the complicated expectation

5 Direct policy search via Sample Average Approximation 5/79 One approach is to simulate many many long trajectories and formulate an sample average approximation min σ L l=1 k=1 α k g(i kl, µ(i kl ; σ), j kl ) which is based on L trajectories {i 0l, i 1l,...}, where l = 1,..., L. As L, the sample average approximation converges to the original problem, and the solution σ converges to σ.

6 Direct policy search via Stochastic Gradient Descent Another approach is to apply stochastic gradient descent. At each iteration l, generate a trajectory {i 0l, i 1l,...} and compute the gradient G l = α k σ g(i kl, µ(i kl ; σ), j kl ), k=1 and update the policy parameter by σ l+1 = σ l γg l. Each G l is an unbiased sample of the gradient of the overall objective at the current parameter σ l. If the gradient is not computable, it can be replaced with finite differences. a.s. If the problem is convex, σ l σ a.s. If nonconvex (the most likely care), σ l a local optimum (good enough in practice most of the time). 6/79

7 A Very Simple Example: Electricity Price Electricity prices are very volatile. Very difficult to store electricity. Spiky price with heavy load (hot summer, freezing winter, ) Average/median price fairly cheap Related to weather, season, time of day, etc 7/79

8 8/79 Battery Storage Suppose that you operate a battery and participate in the regional electricity market Challenge: find a policy for charging and discharging the battery Strategy posed by the battery manufacturer: Buy low, sell high

9 9/79 There are many approaches that based on DP. Solving DP requires full knowledge of the model: how price changes? how the storage may have a market impact on future price? if there is forecast, how good is the forecast? what are the uncertainties and what are their distribution? To gain this knowledge, we could fit a time series model for the price dynamics. Then we formulate a DP problem and find an optimal policy of the DP problem.

10 DP Model 10/79 State: price p t, storage capacity c t Action: buy u t = 1, sell u t = 1, do nothing u t = 0 Action constraint: if c t+1 = 0, cannot sell; if c t = C, cannot buy State transition of storage inventory: c t+1 = c t u t State transition of electricity price (learned from data) p t+1 = p t + f (p t ) + ɛ t

11 DP Model 11/79 The overall objective DP algorithm [ T ] max E u t p t {u t} t=1 V(p t, c t ) = max u t {u t p t + E[V(p t+1, c t u t )]}

12 12/79 The ultimate problem is find a function/policy/strategy such that µ : {state} {action} [ T ] max E µ(p t, c t )p t µ t=1 subject to the state transitions as constraints Difficulty I: We do not know the distribution and dynamics of the time series p t = Need data-based approach Difficulty II: Searching over a space of functions is hard = Need to narrow the search zone to simple parametric family

13 Optimizing A Storage Policy 13/79 Consider a simple policy in which we choose a sell price and a buy price

14 Optimizing over Simple Policies 14/79 Now let search over simple threshold policies The modified problem is [ T ] max E µ(p t, c t )p t t=1 subject to 1 if p < ρ store and c < C µ(p, c) = 1 if p < ρ withdraw and c > 0 0 otherwise as well as the state transition constraints We search for threshold values ρ store and ρ withdraw by backtesting

15 15/79 Optimizing over Simple Policies Average historical profit as function of the two policy parameters. For a given pair of (ρ store, ρ withdraw ), the profit value is calculated by one simulation run on the entire price history. The optimal policy stands out!

16 Make the Problem Harder 16/79 Battery charge/discharge initial time: In practice, we can not turn on and off the battery or generator immediately. The battery of generator need to warm up for some time before charging/generating. Using forecast Suppose that we have an 1-hour price forecast. We want the policy to make use of the forecast. The policy could be: to charge if a weighted combination of current price and forecast price is smaller than a threshold; and to withdraw if the weight combo is greater than a threshold. The parameters are: two threshold prices; weights of the combination

17 Outline 17/79 1 Policy Search: A Simple Example of Energy Storage 2 Off-Policy RL via Linear Duality 3 Online Learning and Regret 4 Other Type of Methods 5 Trust Region Policy Optimization 6 Examples and Platform 7 AlphaGo Zero

18 18/79 Look at the Bellman Equation Again Consider a MDP model with States i = 1,..., n Probability transition matrix under policy µ is P µ R n n Reward of transition is g µ R n The Bellman equation is J = min µ g µ + αp µ J This is a nonlinear system of equations. Note: The righthandside is the infimum of a number of linear mappings of J!

19 DP is a special case of LP Theorem Every finite-state DP problem is an LP problem. Let c 0. We construct the following LP max s.t. c 1 J(1) c n J(n) n n J(i) p ij (u)g iju + α p ij (u)j(j), u A j=1 j=1 or more compactly max s.t. c J J g µ + αp µ J, u A The variables are J(i) where i = 1,..., n. For each state action pair (i, u), there is an inequality constraint. 19/79

20 20/79 DP is a special case of LP If J TJ, then J J. If J TJ, then J J. Suppose that J TJ. Applying operator T on both sides k 1 times, and by the monotonicity of T, we have J TJ T 2 J... T k J. Note that lim k T k J = J. Hence, we have J J.

21 21/79 DP is a special case of LP Theorem This solution to the constructed LP max s.t. c J J g µ + αp µ J, u A is exactly the solution to the Bellman s equation J = min µ g µ + P µ J Proof: The solution J to the Bellman equation is obviously a feasible solution to the LP. If the LP solution J is different from J, it must solve the Bellman equation at the same time. Since the Bellman equation has a unique solution, J = J.

22 22/79 ADP via Approximate Linear Programming The constructed LP is of huge scale. max s.t. c J J g µ + αp µ J, u A Approximate LP: We may approximate J by adding the constraint J = Φσ, so the variable dimension becomes smaller. We may sample a subset of all constraints, so the constraint dimension becomes smaller. LP and Approximate LP can be solved by simulation/online.

23 Outline 23/79 1 Policy Search: A Simple Example of Energy Storage 2 Off-Policy RL via Linear Duality 3 Online Learning and Regret 4 Other Type of Methods 5 Trust Region Policy Optimization 6 Examples and Platform 7 AlphaGo Zero

24 Multi-Arm Bandits - Simplest Online Learning Model Suppose you are faced with N slot machines. Each bandit distributes a random prize. Some bandits are very generous, others not so much. Of course, you don t know what these expectations are. By only choosing one bandit per round, our task is devise a strategy to maximize our winnings. 24/79

25 Multi-Arm Bandits 25/79 The Problem: Let there be K bandits, each giving a random prize with expectations E[X b ] [0, 1], b 1,..., K. If you pull bandit b, you get an independent sample of X b. We want to solve the one-shot optimization problem: max E[X b], b 1,...,K but we can experiment repeatedly. Our task can be phrased as find the best bandit, and as quickly as possible. We use trial-and-error to gain knowledge on the expected values.

26 A Naive Strategy For each round, we are given current estimated means ˆX b, b 1,..., K 1 With probability 1 ɛ k, select the currently known-to-best bandit b k = arg max ˆX b. 2 With probability ɛ k, select a random bandit (according to the uniform distribution) to pull 3 Observe the result of pulling bandit b k, and update ˆX b k as the new sample average. Return to 1. Exploration vs. Exploitation The random selection with probability ɛ k guarantees exploration The selection of current best with probability 1 ɛ k guarantees exploition Say ɛ k = 1/k, the algorithm increasingly exploits the existing knowledge and is guaranteed to find the true optimal bandit. 26/79

27 27/79 Bayesian Bandit Strategy For each round, we are given the current posterior distributions of all bandits mean returns. Sample a random variable X b from the prior of bandit b, for all b = 1,..., K Select the bandit with largest sample, i.e. select bandit b = arg max X b. Observe the result of pulling bandit b, and update your prior on bandit b. Return to 1.

28 28/79

29 How bandits related to real-time decision making? Each bandit is a possible configuration of the policy parameters. Bandit strategy is online parameter tuning. Evaluation of goodness: T Regret = T max E[X b] E[X b t ] b=1,...,k t=1 Regret of a reasonable learning strategy is usually between log T and T 29/79

30 Learning in Sequential Decision Making 30/79 Traditionally, sequential making problems are modeled by dynamic programming. In one-shot optimization, finding the best arm by trial-and-error is known as online learning. Let s combine these two: DP + Online Learning = Reinforcement Learning

31 Outline 31/79 1 Policy Search: A Simple Example of Energy Storage 2 Off-Policy RL via Linear Duality 3 Online Learning and Regret 4 Other Type of Methods 5 Trust Region Policy Optimization 6 Examples and Platform 7 AlphaGo Zero

32 Q-Learning Off-policy and model-free Q(s, a) Q(s, a) + α(r(s, a, s0 ) + γ max Q(s0, a0 ) Q(s, a)) 0 a 32/79

33 DQN 33/79 min E (s,a,s θ ) U(D)[r(s, a) + γ max Q(s, a ; θ ) Q(s, a; θ)] 2 a works well in discrete tasks, like Atari games.

34 DQN 34/79 Ref: :

35 Policy Gradient 35/79 So far, we are discussing value-based RL learn value function improve policy (e.g. ɛ-greedy) Now, consider parametrize the policy Policy-based RL no value function learn policy π θ (a s) = P(a s, θ)

36 Policy-based RL 36/79 Advantages: Better convergence properties effective in high-dimensional or continuous action spaces can learn stochastic policy Disadvantages: Typically converge to a local rather than global optimum evaluating a policy is typically inefficient and high variance

37 Measure the equality of a policy π θ In episodic environments, s 0 is the start state in an episode: J 1 (θ) =V πθ (s 0 ) = E πθ [ γ t r t+1 s 0 ] t=0 = s p πθ (s) a π θ (a s)r a s here p πθ (s) is the unnormalized visitation distribution w.r.t. π θ p πθ (s) = γ t P(s t = s s 0, π θ ) t=0 In continuing environments: Average reward formulation J avr (θ) = s p π θ (s) a π θ(a s)r a s p πθ (s) is the stationary distribution of a Markov chain for π θ p πθ (s) = lim t P(s t = s s 0, π θ ) 37/79

38 Policy Gradient 38/79 Goal: find θ to maximize J(θ) by ascending the gradient of the policy, w.r.t θ θ θ + α θ J(θ), where θ J(θ) = J(θ) θ 1. J(θ) θ N How to compute gradient? Perturbing θ by small amount ɛ in k th dimension, k [1, N] J(θ) θ k J(θ + ɛe k) J(θ) ɛ Simple, but noisy, inefficient in most cases

39 One-step MDPs Consider one-step MDPs start with s p(s), and terminate after one step with reward r a s J(θ) = s p(s) a π θ (a s)r a s, θ J(θ) = s p(s) a π θ (a s)r a s = s p(s) a π θ (a s) θ log π θ (a s)r a s =E πθ [ θ log π θ (a s)r a s ] 39/79

40 Policy Gradient Theorem 1 Consider the multi-step MDPs, we can use likelihood ratio to obtain the similar conclusion: Theorem For any differentiable policy π θ (a s), and for any of the policy objective function J = J 1, J avr, the policy gradient for policy objective function J(θ) is θ J(θ) = s = s p πθ (s) a p πθ (s) a θ π θ (a s)q πθ (s, a) π θ (a s) θ log π θ (a s)q πθ (s, a) =E πθ [ θ log π θ (a s)q πθ (s, a)] 1 Policy Gradient Methods for Reinforcement Learning with Function Approximation, Richard S. Sutton,etc, /79

41 Policy gradient methods 41/79 θ J(θ) = E πθ [ θ log π θ (a s)q πθ (s, a)] Update the parameter θ along the direction of θ J(θ) Practically, apply SGD: θ θ + α θ log π θ (a s)q πθ (s, a) Require action-value function and step-size α is important Policy gradients have rather strong convergence guarantees, even when used in conjunction with approximate value functions and recent results created a theoretically solid framework for policy gradient estimation from sampled data

42 42/79 Actor-Critic Combine value function approximation and policy gradient Reduce the variance Use a critic to estimate the action-value function Q w (s, a) Q πθ (s, a) Actor-Critic algorithms maintains two set of parameters Critic: update action-value function parameters w Actor: update policy parameters θ in direction suggested by Critic θ J(θ) E πθ [ θ log π θ (a s)q w (s, a)] θ θ + α θ log π θ (a s)q w (s, a) Can we avoid any bias by choosing action-value function approximation carefully? θ J(θ) = E πθ [ θ log π θ (a s)q w (s, a)]

43 A3C: 43/79 reference: Asynchronous Methods for Deep Reinforcement Learning

44 Compatible Function Approximation Theorem Theorem If the action-value function approximator Q w (s, a) satisfies the following two conditions: w Q w (s, a) = θ log π θ (a s), w = arg min w E πθ [(Q πθ (s, a) Q w (s, a)) 2 ] then the policy gradient is exact, i.e. θ J(θ) = E πθ [ θ log π θ (a s)q w (s, a)] The function approximator is compatible with the policy in the sense that if we use the approximations Q w (s, a) in lieu of their true values to compute the gradient, then the result would be exact 44/79

45 Baseline 45/79 Reduce the variance A baseline function B(s) E πθ [ θ log π θ (a, a)b(s)] = p πθ (s) θ π θ (a s)b(s) s a = p πθ (s)b(s) θ π θ (a s) = 0 s a Thus, θ J(θ) = E πθ [ θ log π θ (a s)(q πθ (s, a) B(s))] A good baseline B(s) = V πθ (s) Advantage function A πθ (s, a) = Q πθ (s, a) V πθ (s)

46 46/79 Policy gradient θ J(θ) =E πθ [ θ log π θ (a s)q πθ (s, a)] =E πθ [ θ log π θ (a s)a πθ (s, a)] Apply SGD θ θ + α θ log π θ (a s)a πθ (s, a) The advantage function can significantly reduce variance of policy gradient Estimate both V πθ (s) and Q πθ (s, a) to obtain A πθ (s, a) e.g. TD learning

47 47/79 Estimation of advantage function Apply TD learning to estimate value function TD error δ π θ δ π θ = r(s, a, s ) + γv πθ (s ) V πθ (s) is an unbiased estimate of advantage function E πθ [δ π θ s, a] =E πθ [r(s, a, s ) + γv πθ (s ) V πθ (s) s, a] =E πθ [r(s, a, s ) + γv πθ (s ) s, a] V πθ (s) =Q πθ (s, a) V πθ (s) = A πθ (s, a) thus the update θ θ + α θ log π θ (a s)δ π θ

48 Outline 48/79 1 Policy Search: A Simple Example of Energy Storage 2 Off-Policy RL via Linear Duality 3 Online Learning and Regret 4 Other Type of Methods 5 Trust Region Policy Optimization 6 Examples and Platform 7 AlphaGo Zero

49 Trust Region Policy Optimization (John Schulman, etc, 2016) Assume start-state distribution ρ 0 is independent with policy Total expected discounted reward with policy π η(π) = E π [ γ t r(s t )] t=0 Between any two different policy π and π η( π) = η(π) + E π [ γ t A π (s t, a t )] t=0 = η(π) + P(s t = s π) π(a s)γ t A π (s, a) t=0 s a = η(π) + γ t P(s t = s π) π(a s)a π (s, a) s a t=0 = η(π) + ρ π (s) π(a s)a π (s, a). s a 49/79

50 Trust Region Policy Optimization 50/79 Find new policy π to maximize η( π) η(π) for given π, that is max η(π) + π s ρ π (s) a π(a s)a π (s, a) For simpleness, maximize the approximator L π ( π) = η(π) + s ρ π (s) a π(a s)a π (s, a) Parameterize the policy π(a s) := π θ (a s) L πθold (π θ ) = η(π θold ) + s ρ πθold (s) a π θ (a s)a πθold (s, a)

51 Why L πθold (π θ )? 51/79 A sufficiently small step θ old θ improves L πθold (π θ ) also improves η L πθold (π θold ) =η(π θold ), θ L πθold (π θ ) θ=θold = θ η(π θ ) θ=θold. Lower bounds on the improvement of η where η(π θnew ) L πθold (π θnew ) 2ɛγ (1 γ) 2 α2 ɛ = max E a πθnew A πθold (s, a) s α =D max TV (π θold π θnew ) = max D TV (π θold ( s) π θnew ( s)) s

52 Lower bound 52/79 TV divergence between two distribution p, q (discrete case) D TV (p q) = 1 p(x) q(x) 2 X KL divergence between two distribution p, q (discrete case) D KL (p q) = X p(x) log p(x) q(x) (D TV (p q)) 2 D KL (p q) (Pollard(2000),Ch.3) Thus obtain a lower bound where η(π θnew ) L πθold (π θnew ) 2ɛγ (1 γ) 2 α α = D max KL (π θold π θnew ) := max D KL (π θold ( s) π θnew ( s)) s

53 Practical algorithm 53/79 2ɛγ The penalty coefficient (1 γ) is large in practice, which yields small 2 update Take a constraint on the KL divergence, i.e., a trust region constraint: A heuristic approximation where max L πθold (π θ ) θ s.t. D max KL (π θold π θ ) δ max θ L πθold (π θ ) s.t. D ρπ θ old KL (π θold π θ ) δ D ρπ θ old KL (π θold π θ ) = E πθold (D KL (π θold ( s) π θ ( s))

54 Connection with prior work 54/79 Approximate the objective function linearly and the constraint quadratically max θ L πθold (π θold ) T (θ θ old ) θ s.t 1 2 (θ θ old) T A(θ θ old ) δ where A := A ij = θ i θ j E s ρπθold [D KL (π θold ( s) π θ ( s))] θ=θold The update with Natural policy gradient θ new =θ old + λa 1 θ L πθold (π θold ) =θ old + λa 1 θ η(π θold )

55 Connection with prior work 55/79 Approximate the objective function linearly and the constraint quadratically where A := I max θ L πθold (π θold ) T (θ θ old ) θ s.t 1 2 (θ θ old) T A(θ θ old ) δ The update with Standard policy gradient θ new =θ old + λ θ L πθold (π θold ) =θ old + λ θ η(π θold )

56 Outline 56/79 1 Policy Search: A Simple Example of Energy Storage 2 Off-Policy RL via Linear Duality 3 Online Learning and Regret 4 Other Type of Methods 5 Trust Region Policy Optimization 6 Examples and Platform 7 AlphaGo Zero

57 57/79 Tetris Height: 12 Width: 7 Rotate and move the falling shape Gravity related to current height Score when eliminating an entire level Game over when reaching the ceiling

58 DP Model of Tetris State: The current board, the current falling tile, predictions of future tiles Termination state: when the tiles reach the ceiling, the game is over with no more future reward Action: Rotation and shift System Dynamics: The next board is deterministically determined by the current board and the player s placement of the current tile. The future tiles are generated randomly. Uncertainty: Randomness in future tiles Transitional cost g: If a level is cleared by the current action, score 1; otherwise score 0. Objective: Expectation of total score. 58/79

59 Interesting facts about Tetris 59/79 First released in 1984 by Alexey Pajitnov from the Soviet Union Has been proved to be NP-complete. Game will be over with probability 1. For a 12 7 board, the number of possible states Highest score achieved by human 1 million Highest score achieved by algorithm 35 million (average performance)

60 60/79 Solve Tetris by Approximate Dynamic Programming Curse of dimensionality: The exact cost-to-go vector V has dimension equal to the state space size (10 25 ). The transition matrix is , although sparse. There is no way to solve an Bellman equation. Cost function approximation Instead of characterizing the entire board configuration, we use features φ 1,..., φ d to describe the current state Represent the cost-to-go vector V using linear combinations of a small number of features V = β 1 φ β d φ d We approximate the high-dim space of V use a low-dim feature space of β

61 Features of Tetris 61/79 The earliest tetris learning algorithm uses 3 features The latest best algorithm uses 22 features The mapping from state space to feature space can be viewed as state aggregation: different states/boards with similar height/holes/structures are treated as a single meta-state.

62 Learning in Tetris 62/79 Ideally, we still want to solve the Bellman equation V = max g µ + P µ V We consider an approximate solution Ṽ to the Bellman equation such that Ṽ = Φβ We apply policy iteration with approximate function evaluation Outer loop: keep improving the policies µ k using one-step lookahead Inner loop: finding the cost-to-go V µk associated with the current policy µ k, by taking samples and using feature representation We want to avoid high-dimensional algebraic operations, as well as memory of high-dimensional quantities.

63 Algorithms tested on tetris More examples of learning methods: Temporal difference learning, TD(λ), LSTD, cross-entropy method, actor-critic, active learning, etc... These algorithms are essentially all combinations of DP, sampling, parametric models. 63/79

64 Tetris World Record: Human vs. AI 64/79

65 More on Games and Learning 65/79 Tetris domain of 2008 Reinforcement Learning Competition 3rd International Reinforcement Learning Competition (2009, adversarial Tetris) 2013 Reinforcement Learning Competition (unmanned helicopter control) Since 2006, Annual Computer Poker Competition ( 2013, 2014, 2015 MIT Pokerbots ( Many more!

66 66/79 Platform Gym: a toolkit for developing and comparing reinforcement learning algorithms. Support Atari and Mujoco. Universe: measuring and training an AI across the world s supply of games, websites and other applications Deepmind Lab: a fully 3D game-like platform tailored for agent-based AI research ViZDoom: allows developing AI bots that play Doom using only the visual information

67 67/79 Platform Rllab: mainly supports for TRPO, VPG, CEM, NPG Baselines: supports for TRPO, PPO, DQN, A2C... Github Implement your algorithms through these packages

68 Gym 68/79 A simple implementation of sampling a single path with gym

69 69/79 Environments Mujoco, continuous tasks A physics engine for detailed, efficient rigid body simulations with contacts Swimmer, Hopper, Walker, Reacher,... Gaussian distribution

70 70/79 Environments Atari 2600, discrete action space Categorial distribution

71 Outline 71/79 1 Policy Search: A Simple Example of Energy Storage 2 Off-Policy RL via Linear Duality 3 Online Learning and Regret 4 Other Type of Methods 5 Trust Region Policy Optimization 6 Examples and Platform 7 AlphaGo Zero

72 AlphaGo Zero 72/79

73 AlphaGo Zero 73/79

74 AlphaGo Zero 74/79

75 AlphaGo Zero 75/79

76 AlphaGo Zero 76/79

77 AlphaGo Zero 77/79

78 AlphaGo Zero 78/79

79 AlphaGo Zero 79/79

Lecture 4: Misc. Topics and Reinforcement Learning

Lecture 4: Misc. Topics and Reinforcement Learning Approximate Dynamic Programming Lecture 4: Misc. Topics and Reinforcement Learning Mengdi Wang Operations Research and Financial Engineering Princeton University August 1-4, 2015 1/56 Feature Extraction

More information

Lecture 9: Policy Gradient II (Post lecture) 2

Lecture 9: Policy Gradient II (Post lecture) 2 Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Additional reading: Sutton and Barto 2018 Chp. 13 2 With many slides from or derived from David Silver

More information

Lecture 9: Policy Gradient II 1

Lecture 9: Policy Gradient II 1 Lecture 9: Policy Gradient II 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2019 Additional reading: Sutton and Barto 2018 Chp. 13 1 With many slides from or derived from David Silver and John

More information

Infinite-Horizon Dynamic Programming

Infinite-Horizon Dynamic Programming 1/70 Infinite-Horizon Dynamic Programming http://bicmr.pku.edu.cn/~wenzw/bigdata2018.html Acknowledgement: this slides is based on Prof. Mengdi Wang s and Prof. Dimitri Bertsekas lecture notes 2/70 作业

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Policy Search: Actor-Critic and Gradient Policy search Mario Martin CS-UPC May 28, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 28, 2018 / 63 Goal of this lecture So far

More information

Trust Region Policy Optimization

Trust Region Policy Optimization Trust Region Policy Optimization Yixin Lin Duke University yixin.lin@duke.edu March 28, 2017 Yixin Lin (Duke) TRPO March 28, 2017 1 / 21 Overview 1 Preliminaries Markov Decision Processes Policy iteration

More information

Lecture 8: Policy Gradient

Lecture 8: Policy Gradient Lecture 8: Policy Gradient Hado van Hasselt Outline 1 Introduction 2 Finite Difference Policy Gradient 3 Monte-Carlo Policy Gradient 4 Actor-Critic Policy Gradient Introduction Vapnik s rule Never solve

More information

Lecture 8: Policy Gradient I 2

Lecture 8: Policy Gradient I 2 Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Additional reading: Sutton and Barto 2018 Chp. 13 2 With many slides from or derived from David Silver and John Schulman

More information

Deep Reinforcement Learning: Policy Gradients and Q-Learning

Deep Reinforcement Learning: Policy Gradients and Q-Learning Deep Reinforcement Learning: Policy Gradients and Q-Learning John Schulman Bay Area Deep Learning School September 24, 2016 Introduction and Overview Aim of This Talk What is deep RL, and should I use

More information

(Deep) Reinforcement Learning

(Deep) Reinforcement Learning Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 17 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015

More information

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted 15-889e Policy Search: Gradient Methods Emma Brunskill All slides from David Silver (with EB adding minor modificafons), unless otherwise noted Outline 1 Introduction 2 Finite Difference Policy Gradient

More information

Approximation Methods in Reinforcement Learning

Approximation Methods in Reinforcement Learning 2018 CS420, Machine Learning, Lecture 12 Approximation Methods in Reinforcement Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html Reinforcement

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory Deep Reinforcement Learning Jeremy Morton (jmorton2) November 7, 2016 SISL Stanford Intelligent Systems Laboratory Overview 2 1 Motivation 2 Neural Networks 3 Deep Reinforcement Learning 4 Deep Learning

More information

Lecture 23: Reinforcement Learning

Lecture 23: Reinforcement Learning Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Reinforcement Learning and NLP

Reinforcement Learning and NLP 1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value

More information

CSC321 Lecture 22: Q-Learning

CSC321 Lecture 22: Q-Learning CSC321 Lecture 22: Q-Learning Roger Grosse Roger Grosse CSC321 Lecture 22: Q-Learning 1 / 21 Overview Second of 3 lectures on reinforcement learning Last time: policy gradient (e.g. REINFORCE) Optimize

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Policy gradients Daniel Hennes 26.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 Policy based reinforcement learning So far we approximated the action value

More information

Policy Gradient Methods. February 13, 2017

Policy Gradient Methods. February 13, 2017 Policy Gradient Methods February 13, 2017 Policy Optimization Problems maximize E π [expression] π Fixed-horizon episodic: T 1 Average-cost: lim T 1 T r t T 1 r t Infinite-horizon discounted: γt r t Variable-length

More information

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

Deep Reinforcement Learning

Deep Reinforcement Learning Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 50 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

Dynamic Programming and Reinforcement Learning

Dynamic Programming and Reinforcement Learning Dynamic Programming and Reinforcement Learning Reinforcement learning setting action/decision Agent Environment reward state Action space: A State space: S Reward: R : S A S! R Transition:

More information

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI Temporal Difference Learning KENNETH TRAN Principal Research Engineer, MSR AI Temporal Difference Learning Policy Evaluation Intro to model-free learning Monte Carlo Learning Temporal Difference Learning

More information

Reinforcement Learning for NLP

Reinforcement Learning for NLP Reinforcement Learning for NLP Advanced Machine Learning for NLP Jordan Boyd-Graber REINFORCEMENT OVERVIEW, POLICY GRADIENT Adapted from slides by David Silver, Pieter Abbeel, and John Schulman Advanced

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation

More information

REINFORCEMENT LEARNING

REINFORCEMENT LEARNING REINFORCEMENT LEARNING Larry Page: Where s Google going next? DeepMind's DQN playing Breakout Contents Introduction to Reinforcement Learning Deep Q-Learning INTRODUCTION TO REINFORCEMENT LEARNING Contents

More information

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free

More information

Decision Theory: Q-Learning

Decision Theory: Q-Learning Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while

More information

Introduction of Reinforcement Learning

Introduction of Reinforcement Learning Introduction of Reinforcement Learning Deep Reinforcement Learning Reference Textbook: Reinforcement Learning: An Introduction http://incompleteideas.net/sutton/book/the-book.html Lectures of David Silver

More information

Reinforcement learning an introduction

Reinforcement learning an introduction Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,

More information

Reinforcement Learning via Policy Optimization

Reinforcement Learning via Policy Optimization Reinforcement Learning via Policy Optimization Hanxiao Liu November 22, 2017 1 / 27 Reinforcement Learning Policy a π(s) 2 / 27 Example - Mario 3 / 27 Example - ChatBot 4 / 27 Applications - Video Games

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Daniel Hennes 19.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Eligibility traces n-step TD returns Forward and backward view Function

More information

Reinforcement Learning Part 2

Reinforcement Learning Part 2 Reinforcement Learning Part 2 Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ From previous tutorial Reinforcement Learning Exploration No supervision Agent-Reward-Environment

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

CS230: Lecture 9 Deep Reinforcement Learning

CS230: Lecture 9 Deep Reinforcement Learning CS230: Lecture 9 Deep Reinforcement Learning Kian Katanforoosh Menti code: 21 90 15 Today s outline I. Motivation II. Recycling is good: an introduction to RL III. Deep Q-Learning IV. Application of Deep

More information

ARTIFICIAL INTELLIGENCE. Reinforcement learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

More information

Advanced Policy Gradient Methods: Natural Gradient, TRPO, and More. March 8, 2017

Advanced Policy Gradient Methods: Natural Gradient, TRPO, and More. March 8, 2017 Advanced Policy Gradient Methods: Natural Gradient, TRPO, and More March 8, 2017 Defining a Loss Function for RL Let η(π) denote the expected return of π [ ] η(π) = E s0 ρ 0,a t π( s t) γ t r t We collect

More information

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon. Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,

More information

Reinforcement Learning. Introduction

Reinforcement Learning. Introduction Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning RL in continuous MDPs March April, 2015 Large/Continuous MDPs Large/Continuous state space Tabular representation cannot be used Large/Continuous action space Maximization over action

More information

Deep Reinforcement Learning via Policy Optimization

Deep Reinforcement Learning via Policy Optimization Deep Reinforcement Learning via Policy Optimization John Schulman July 3, 2017 Introduction Deep Reinforcement Learning: What to Learn? Policies (select next action) Deep Reinforcement Learning: What to

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo MLR, University of Stuttgart

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Ron Parr CompSci 7 Department of Computer Science Duke University With thanks to Kris Hauser for some content RL Highlights Everybody likes to learn from experience Use ML techniques

More information

Lecture 7: Value Function Approximation

Lecture 7: Value Function Approximation Lecture 7: Value Function Approximation Joseph Modayil Outline 1 Introduction 2 3 Batch Methods Introduction Large-Scale Reinforcement Learning Reinforcement learning can be used to solve large problems,

More information

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019 Reinforcement Learning and Optimal Control ASU, CSE 691, Winter 2019 Dimitri P. Bertsekas dimitrib@mit.edu Lecture 8 Bertsekas Reinforcement Learning 1 / 21 Outline 1 Review of Infinite Horizon Problems

More information

David Silver, Google DeepMind

David Silver, Google DeepMind Tutorial: Deep Reinforcement Learning David Silver, Google DeepMind Outline Introduction to Deep Learning Introduction to Reinforcement Learning Value-Based Deep RL Policy-Based Deep RL Model-Based Deep

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo Marc Toussaint University of

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Notes on Reinforcement Learning

Notes on Reinforcement Learning 1 Introduction Notes on Reinforcement Learning Paulo Eduardo Rauber 2014 Reinforcement learning is the study of agents that act in an environment with the goal of maximizing cumulative reward signals.

More information

Decision Theory: Markov Decision Processes

Decision Theory: Markov Decision Processes Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies

More information

Machine Learning I Continuous Reinforcement Learning

Machine Learning I Continuous Reinforcement Learning Machine Learning I Continuous Reinforcement Learning Thomas Rückstieß Technische Universität München January 7/8, 2010 RL Problem Statement (reminder) state s t+1 ENVIRONMENT reward r t+1 new step r t

More information

Reinforcement Learning as Variational Inference: Two Recent Approaches

Reinforcement Learning as Variational Inference: Two Recent Approaches Reinforcement Learning as Variational Inference: Two Recent Approaches Rohith Kuditipudi Duke University 11 August 2017 Outline 1 Background 2 Stein Variational Policy Gradient 3 Soft Q-Learning 4 Closing

More information

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396 Machine Learning Reinforcement learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 1 / 32 Table of contents 1 Introduction

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Reinforcement Learning and Deep Reinforcement Learning

Reinforcement Learning and Deep Reinforcement Learning Reinforcement Learning and Deep Reinforcement Learning Ashis Kumer Biswas, Ph.D. ashis.biswas@ucdenver.edu Deep Learning November 5, 2018 1 / 64 Outlines 1 Principles of Reinforcement Learning 2 The Q

More information

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017 Deep Reinforcement Learning STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017 Outline Introduction to Reinforcement Learning AlphaGo (Deep RL for Computer Go)

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018 Machine learning for image classification Lecture 14: Reinforcement learning May 9, 2018 Page 3 Outline Motivation Introduction to reinforcement learning (RL) Value function based methods (Q-learning)

More information

Reinforcement Learning and Control

Reinforcement Learning and Control CS9 Lecture notes Andrew Ng Part XIII Reinforcement Learning and Control We now begin our study of reinforcement learning and adaptive control. In supervised learning, we saw algorithms that tried to make

More information

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha

More information

Elements of Reinforcement Learning

Elements of Reinforcement Learning Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,

More information

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Cyber Rodent Project Some slides from: David Silver, Radford Neal CSC411: Machine Learning and Data Mining, Winter 2017 Michael Guerzhoy 1 Reinforcement Learning Supervised learning:

More information

Approximate Q-Learning. Dan Weld / University of Washington

Approximate Q-Learning. Dan Weld / University of Washington Approximate Q-Learning Dan Weld / University of Washington [Many slides taken from Dan Klein and Pieter Abbeel / CS188 Intro to AI at UC Berkeley materials available at http://ai.berkeley.edu.] Q Learning

More information

Lecture 3: Markov Decision Processes

Lecture 3: Markov Decision Processes Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov

More information

Olivier Sigaud. September 21, 2012

Olivier Sigaud. September 21, 2012 Supervised and Reinforcement Learning Tools for Motor Learning Models Olivier Sigaud Université Pierre et Marie Curie - Paris 6 September 21, 2012 1 / 64 Introduction Who is speaking? 2 / 64 Introduction

More information

Internet Monetization

Internet Monetization Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition

More information

Machine Learning I Reinforcement Learning

Machine Learning I Reinforcement Learning Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:

More information

Artificial Intelligence

Artificial Intelligence Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important

More information

Distributed Optimization. Song Chong EE, KAIST

Distributed Optimization. Song Chong EE, KAIST Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

Planning in Markov Decision Processes

Planning in Markov Decision Processes Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Planning in Markov Decision Processes Lecture 3, CMU 10703 Katerina Fragkiadaki Markov Decision Process (MDP) A Markov

More information

Bayesian Active Learning With Basis Functions

Bayesian Active Learning With Basis Functions Bayesian Active Learning With Basis Functions Ilya O. Ryzhov Warren B. Powell Operations Research and Financial Engineering Princeton University Princeton, NJ 08544, USA IEEE ADPRL April 13, 2011 1 / 29

More information

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state

More information

15-780: ReinforcementLearning

15-780: ReinforcementLearning 15-780: ReinforcementLearning J. Zico Kolter March 2, 2016 1 Outline Challenge of RL Model-based methods Model-free methods Exploration and exploitation 2 Outline Challenge of RL Model-based methods Model-free

More information

Temporal Difference Learning & Policy Iteration

Temporal Difference Learning & Policy Iteration Temporal Difference Learning & Policy Iteration Advanced Topics in Reinforcement Learning Seminar WS 15/16 ±0 ±0 +1 by Tobias Joppen 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof.

More information

6.231 DYNAMIC PROGRAMMING LECTURE 6 LECTURE OUTLINE

6.231 DYNAMIC PROGRAMMING LECTURE 6 LECTURE OUTLINE 6.231 DYNAMIC PROGRAMMING LECTURE 6 LECTURE OUTLINE Review of Q-factors and Bellman equations for Q-factors VI and PI for Q-factors Q-learning - Combination of VI and sampling Q-learning and cost function

More information

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study CS 287: Advanced Robotics Fall 2009 Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study Pieter Abbeel UC Berkeley EECS Assignment #1 Roll-out: nice example paper: X.

More information

Reinforcement Learning. George Konidaris

Reinforcement Learning. George Konidaris Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall 2017 Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom

More information

Real Time Value Iteration and the State-Action Value Function

Real Time Value Iteration and the State-Action Value Function MS&E338 Reinforcement Learning Lecture 3-4/9/18 Real Time Value Iteration and the State-Action Value Function Lecturer: Ben Van Roy Scribe: Apoorva Sharma and Tong Mu 1 Review Last time we left off discussing

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning Introduction to Reinforcement Learning Rémi Munos SequeL project: Sequential Learning http://researchers.lille.inria.fr/ munos/ INRIA Lille - Nord Europe Machine Learning Summer School, September 2011,

More information

Reinforcement Learning. Machine Learning, Fall 2010

Reinforcement Learning. Machine Learning, Fall 2010 Reinforcement Learning Machine Learning, Fall 2010 1 Administrativia This week: finish RL, most likely start graphical models LA2: due on Thursday LA3: comes out on Thursday TA Office hours: Today 1:30-2:30

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

Learning Tetris. 1 Tetris. February 3, 2009

Learning Tetris. 1 Tetris. February 3, 2009 Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are

More information

ilstd: Eligibility Traces and Convergence Analysis

ilstd: Eligibility Traces and Convergence Analysis ilstd: Eligibility Traces and Convergence Analysis Alborz Geramifard Michael Bowling Martin Zinkevich Richard S. Sutton Department of Computing Science University of Alberta Edmonton, Alberta {alborz,bowling,maz,sutton}@cs.ualberta.ca

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning II Daniel Hennes 11.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Eligibility traces n-step TD returns

More information

The Multi-Arm Bandit Framework

The Multi-Arm Bandit Framework The Multi-Arm Bandit Framework A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/94

More information

Deep reinforcement learning. Dialogue Systems Group, Cambridge University Engineering Department

Deep reinforcement learning. Dialogue Systems Group, Cambridge University Engineering Department Deep reinforcement learning Milica Gašić Dialogue Systems Group, Cambridge University Engineering Department 1 / 25 In this lecture... Introduction to deep reinforcement learning Value-based Deep RL Deep

More information