Deep Reinforcement Learning
|
|
- Irma Knight
- 6 years ago
- Views:
Transcription
1 Deep Reinforcement Learning John Schulman 1 MLSS, May 2016, Cadiz 1 Berkeley Artificial Intelligence Research Lab
2 Agenda Introduction and Overview Markov Decision Processes Reinforcement Learning via Black-Box Optimization Policy Gradient Methods Variance Reduction for Policy Gradients Trust Region and Natural Gradient Methods Open Problems Course materials: goo.gl/5wsgbj
3 Introduction and Overview
4 What is Reinforcement Learning? Branch of machine learning concerned with taking sequences of actions Usually described in terms of agent interacting with a previously unknown environment, trying to maximize cumulative reward action Agent Environment observation, reward
5 Motor Control and Robotics Robotics: Observations: camera images, joint angles Actions: joint torques Rewards: stay balanced, navigate to target locations, serve and protect humans
6 Business Operations Inventory Management Observations: current inventory levels Actions: number of units of each item to purchase Rewards: profit Resource allocation: who to provide customer service to first Routing problems: in management of shipping fleet, which trucks / truckers to assign to which cargo
7 Games A different kind of optimization problem (min-max) but still considered to be RL. Go (complete information, deterministic) AlphaGo 2 Backgammon (complete information, stochastic) TD-Gammon 3 Stratego (incomplete information, deterministic) Poker (incomplete information, stochastic) 2 David Silver, Aja Huang, et al. Mastering the game of Go with deep neural networks and tree search. In: Nature (2016), pp Gerald Tesauro. Temporal difference learning and TD-Gammon. In: Communications of the ACM 38.3 (1995), pp
8 Approaches to RL Policy Optimization Dynamic Programming modified policy iteration DFO / Evolution Policy Gradients Policy Iteration Value Iteration Actor-Critic Methods Q-Learning
9 What is Deep RL? RL using nonlinear function approximators Usually, updating parameters with stochastic gradient descent
10 What s Deep RL? Whatever the front half of the cerebral cortex does (motor and executive cortices)
11 Markov Decision Processes
12 Definition Markov Decision Process (MDP) defined by (S, A, P), where S: state space A: action space P(r, s s, a): a transition probability distribution Extra objects defined depending on problem setting µ: Initial state distribution γ: discount factor
13 Episodic Setting In each episode, the initial state is sampled from µ, and the process proceeds until the terminal state is reached. For example: Taxi robot reaches its destination (termination = good) Waiter robot finishes a shift (fixed time) Walking robot falls over (termination = bad) Goal: maximize expected reward per episode
14 Policies Deterministic policies: a = π(s) Stochastic policies: a π(a s) Parameterized policies: π θ
15 Episodic Setting s 0 µ(s 0 ) a 0 π(a 0 s 0 ) s 1, r 0 P(s 1, r 0 s 0, a 0 ) a 1 π(a 1 s 1 ) s 2, r 1 P(s 2, r 1 s 1, a 1 )... a T 1 π(a T 1 s T 1 ) s T, r T 1 P(s T s T 1, a T 1 ) Objective: maximize η(π), where η(π) = E[r 0 + r r T 1 π]
16 Episodic Setting π Agent a0 a1 at-1 s0 s1 s2 st μ0 r0 r1 rt-1 P Environment Objective: maximize η(π), where η(π) = E[r 0 + r r T 1 π]
17 Parameterized Policies A family of policies indexed by parameter vector θ R d Deterministic: a = π(s, θ) Stochastic: π(a s, θ) Analogous to classification or regression with input s, output a. E.g. for neural network stochastic policies: Discrete action space: network outputs vector of probabilities Continuous action space: network outputs mean and diagonal covariance of Gaussian
18 Reinforcement Learning via Black-Box Optimization
19 Derivative Free Optimization Approach Objective: maximize E[R π(, θ)] View θ R as a black box Ignore all other information other than R collected during episode
20 Cross-Entropy Method Evolutionary algorithm Works embarrassingly well István Szita and András Lörincz. Learning Tetris using the noisy cross-entropy method. In: Neural computation (2006), pp Victor Gabillon, Mohammad Ghavamzadeh, and Bruno Scherrer. Approximate Dynamic Programming Finally Performs Well in the Game of Tetris. In: Advances in Neural Information Processing Systems. 2013
21 Cross-Entropy Method Evolutionary algorithm Works embarrassingly well A similar algorithm, Covariance Matrix Adaptation, has become standard in graphics:
22 Cross-Entropy Method Initialize µ R d, σ R d for iteration = 1, 2,... do Collect n samples of θ i N(µ, diag(σ)) Perform a noisy evaluation R i θ i Select the top p% of samples (e.g. p = 20), which we ll call the elite set Fit a Gaussian distribution, with diagonal covariance, to the elite set, obtaining a new µ, σ. end for Return the final µ.
23 Cross-Entropy Method Analysis: a very similar algorithm is an minorization-maximization (MM) algorithm, guaranteed to monotonically increase expected reward Recall that Monte-Carlo EM algorithm collects samples, reweights them, and them maximizes their logprob We can derive MM algorithm where each iteration you maximize i log p(θ i)r i
24 Policy Gradient Methods
25 Policy Gradient Methods: Overview Problem: maximize E[R π θ ] Intuitions: collect a bunch of trajectories, and Make the good trajectories more probable 2. Make the good actions more probable (actor-critic, GAE) 3. Push the actions towards good actions (DPG, SVG)
26 Score Function Gradient Estimator Consider an expectation E x p(x θ) [f (x)]. Want to compute gradient wrt θ θ E x [f (x)] = θ dx p(x θ)f (x) = dx θ p(x θ)f (x) = dx p(x θ) θp(x θ) p(x θ) f (x) = dx p(x θ) θ log p(x θ)f (x) = E x [f (x) θ log p(x θ)]. Last expression gives us an unbiased gradient estimator. Just sample x i p(x θ), and compute ĝ i = f (x i ) θ log p(x i θ). Need to be able to compute and differentiate density p(x θ) wrt θ
27 Derivation via Importance Sampling Alternate Derivation Using Importance Sampling [ ] p(x θ) E x θ [f (x)] = E x θold p(x θ old ) f (x) [ ] θ p(x θ) θ E x θ [f (x)] = E x θold p(x θ old ) f (x) θ E x θ [f (x)] [ θ p(x θ) θ=θold θ=θold = E x θold f (x) p(x θ old ) = E x θold [ θ log p(x θ) ] θ=θold f (x) ]
28 Score Function Gradient Estimator: Intuition ĝ i = f (x i ) θ log p(x i θ) Let s say that f (x) measures how good the sample x is. Moving in the direction ĝ i pushes up the logprob of the sample, in proportion to how good it is Valid even if f (x) is discontinuous, and unknown, or sample space (containing x) is a discrete set
29 Score Function Gradient Estimator: Intuition ĝ i = f (x i ) θ log p(x i θ)
30 Score Function Gradient Estimator: Intuition ĝ i = f (x i ) θ log p(x i θ)
31 Score Function Gradient Estimator for Policies Now random variable x is a whole trajectory τ = (s 0, a 0, r 0, s 1, a 1, r 1,..., s T 1, a T 1, r T 1, s T ) Just need to write out p(τ θ): θ E τ [R(τ)] = E τ [ θ log p(τ θ)r(τ)] T 1 p(τ θ) = µ(s 0 ) [π(a t s t, θ)p(s t+1, r t s t, a t )] t=0 T 1 log p(τ θ) = log µ(s 0 ) + [log π(a t s t, θ) + log P(s t+1, r t s t, a t )] t=0 T 1 θ log p(τ θ) = θ log π(a t s t, θ) θ E τ [R] = E τ [ t=0 R θ T 1 ] log π(a t s t, θ) t=0 Interpretation: using good trajectories (high R) as supervised examples in classification / regression
32 Policy Gradient Slightly Better Formula Previous slide: [( T 1 )( T 1 )] θ E τ [R] = E τ r t θ log π(a t s t, θ) t=0 But we can cut trajectory to t steps and derive gradient estimator for one reward term r t. [ ] t θ E [r t ] = E r t θ log π(a t s t, θ) t=0 t=0 Sum this formula over t, obtaining θ E [R] = E = E [ T 1 r t t=0 [ T 1 t=0 t t=0 θ log π(a t s t, θ) T 1 θ log π(a t s t, θ) t =t r t ] ]
33 Adding a Baseline Suppose f (x) 0, x Then for every x i, gradient estimator ĝ i tries to push up it s density We can derive a new unbiased estimator that avoids this problem, and only pushes up the density for better-than-average x i. θ E x [f (x)] = θ E x [f (x) b] = E x [ θ log p(x θ)(f (x) b)] A near-optimal choice of b is always E [f (x)] (which must be estimated)
34 Policy Gradient with Baseline Recall T 1 θ E τ [R] = t =0 r t T 1 θ log π(a t s t, θ) Using the E at [ θ log π(a t s t, θ)] = 0, we can show [ T 1 ( T 1 )] θ E τ [R] = E τ θ log π(a t s t, θ) r t b(s t ) t=0 t=t for any baseline function b : S R Increase logprob of action a t proportionally to how much returns T 1 t=t r t are better than expected Later: use value functions to further isolate effect of action, at the cost of bias For more general picture of score function gradient estimator, see stochastic computation graphs 4. 4 John Schulman, Nicolas Heess, et al. Gradient Estimation Using Stochastic Computation Graphs. In: Advances in Neural Information Processing Systems. 2015, pp t=t
35 That s all for today Course Materials: goo.gl/5wsgbj
36 Variance Reduction for Policy Gradients
37 Review (I) Process for generating trajectory τ = (s 0, a 0, r 0, s 1, a 1, r 1,..., s T 1, a T 1, r T 1, s T ) s 0 µ(s 0 ) a 0 π(a 0 s 0 ) s 1, r 0 P(s 1, r 0 s 0, a 0 ) a 1 π(a 1 s 1 ) s 2, r 1 P(s 2, r 1 s 1, a 1 )... a T 1 π(a T 1 s T 1 ) s T, r T 1 P(s T s T 1, a T 1 ) Given parameterized policy π(a s, θ), the optimization problem is maximize θ E τ [R π(, θ)] where R = r 0 + r r T 1.
38 Review (II) In general, we can compute gradients of expectations with the score function gradient estimator θ E x p(x θ) [f (x)] = E x [ θ log p(x θ)f (x)] We derived a formula for the policy gradient [ T 1 θ E τ [R] = E τ θ log π(a t s t, θ) t=0 ( T 1 t=t r t b(s t ) )]
39 Value Functions The state-value function V π is defined as: V π (s) = E[r 0 + r 1 + r s 0 = s] Measures expected future return, starting with state s The state-action value function Q π is defined as Q π (s, a) = E[r 0 + r 1 + r s 0 = s, a 0 = a] The advantage function A π is A π (s, a) = Q π (s, a) V π (s) Measures how much better is action a than what the policy π would ve done.
40 Refining the Policy Gradient Formula Recall [ T 1 ( T 1 )] θ E τ [R] = E τ θ log π(a t s t, θ) r t b(s t ) t=0 t=t ( T 1 T 1 )] = E τ [ θ log π(a t s t, θ) r t b(s t ) t=0 t=t [ [( T 1 T 1 )]] = E s0...a t θ log π(a t s t, θ)e rtst+1...s T r t b(s t ) t=0 t=t = T 1 t=0 E s0...a t [ θ log π(a t s t, θ)e rts t+1...s T [Q π (s t, a t ) b(s t )] ] Where the last equality used the fact that [ T 1 ] E rtst+1...s T = Q π (s t, a t ) t=t r t
41 Refining the Policy Gradient Formula From the previous slide, we ve obtained [ T 1 ] θ E τ [R] = E τ θ log π(a t s t, θ)(q π (s t, a t ) b(s t )) t=0 Now let s define b(s) = V π (s), which turns out to be near-optimal 5. We get [ T 1 ] θ E τ [R] = E τ θ log π(a t s t, θ)a π (s t, a t ) t=0 Intuition: increase the probability of good actions (positive advantage) decrease the probability of bad ones (negative advantage) 5 Evan Greensmith, Peter L Bartlett, and Jonathan Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. In: The Journal of Machine Learning Research 5 (2004), pp
42 Variance Reduction Now, we have the following policy gradient formula: [ T 1 ] θ E τ [R] = E τ θ log π(a t s t, θ)a π (s t, a t ) t=0 A π is not known, but we can plug in a random variable  t, an advantage estimator Previously, we showed that taking  t = r t + r t+1 + r t+2 + b(s t ) for any function b(s t ), gives an unbiased policy gradient estimator. b(s t ) V π (s t ) gives variance reduction.
43 The Delayed Reward Problem One reason RL is difficult is the long delay between action and reward
44 The Delayed Reward Problem With policy gradient methods, we are confounding the effect of multiple actions:  t = r t + r t+1 + r t+2 + b(s t ) mixes effect of a t, a t+1, a t+2,... SNR of  t scales roughly as 1/T Only at contributes to signal A π (s t, a t ), but a t+1, a t+2,... contribute to noise.
45 Var. Red. Idea 1: Using Discounts Discount factor γ, 0 < γ < 1, downweights the effect of rewars that are far in the future ignore long term dependencies We can form an advantage estimator using the discounted return: Â γ t = r t + γr t+1 + γ 2 r t b(s }{{} t ) discounted return reduces to our previous estimator when γ = 1. So advantage has expectation zero, we should fit baseline to be discounted value function V π,γ (s) = E τ [ r0 + γr 1 + γ 2 r s 0 = s ] Â γ t is a biased estimator of the advantage function
46 Var. Red. Idea 2: Value Functions in the Future Another approach for variance reduction is to use the value function to estimate future rewards r t + r t+1 + r t r t + V (s t+1 ) r t + r t+1 + V (s t+2 )... use empirical rewards cut off at one timestep cut off at two timesteps Adding the baseline again, we get the advantage estimators  t = r t + V (s t+1 ) V (s t )  t = r t + r t+1 + V (s t+2 ) V (s t )... cut off at one timestep cut off at two timesteps
47 Combining Ideas 1 and 2 Can combine discounts and value functions in the future, e.g., Â t = r t + γv (s t+1 ) V (s t ), where V approximates discounted value function V π,γ. The above formula is called an actor-critic method, where actor is the policy π, and critic is the value function V. 6 Going further, the generalized advantage estimator 7 Â γ,λ t =δ t + (γλ)δ t+1 + (γλ) 2 δ t where δ t = r t + γv (s t+1 ) V (s t ) Interpolates between two previous estimators: λ = 0 : r t + γv (s t+1 ) V (s t ) (low v, high b) λ = 1 : r t + γr t+1 + γ 2 r t+2 + V (s t ) (low b, high v) 6 Vijay R Konda and John N Tsitsiklis. Actor-Critic Algorithms. In: Advances in Neural Information Processing Systems. Vol. 13. Citeseer. 1999, pp John Schulman, Philipp Moritz, et al. High-dimensional continuous control using generalized advantage estimation. In: arxiv preprint arxiv: (2015).
48 Alternative Approach: Reparameterization Suppose problem has continuous action space, a R d Then d da Qπ (s, a) tells use how to improve our action We can use reparameterization trick, so a is a deterministic function a = f (s, z), where z is noise. Then, θ E τ [R] = θ Q π (s 0, a 0 ) + θ Q π (s 1, a 1 ) +... This method is called the deterministic policy gradient 8 A generalized version, which also uses a dynamics model, is described as the stochastic value gradient 9 8 David Silver, Guy Lever, et al. Deterministic policy gradient algorithms. In: ICML. 2014; Timothy P Lillicrap et al. Continuous control with deep reinforcement learning. In: arxiv preprint arxiv: (2015). 9 Nicolas Heess et al. Learning continuous control policies by stochastic value gradients. In: Advances in Neural Information Processing Systems. 2015, pp
49 Trust Region and Natural Gradient Methods
50 Optimization Issues with Policy Gradients Hard to choose reasonable stepsize that works for the whole optimization we have a gradient estimate, no objective for line search statistics of data (observations and rewards) change during learning They make inefficient use of data: each experience is only used to compute one gradient. Given a batch of trajectories, what s the most we can do with it?
51 Policy Performance Function Let η(π) denote the performance of policy π η(π) = E τ [R π] The following neat identity holds: η( π) = η(π) + E τ π [A π (s 0, a 0 ) + A π (s 1, a 1 ) + A π (s 2, a 2 ) +... ] Proof: consider nonstationary policy π 0 π 1 π 2,... η( π π π ) = η(πππ ) + η( πππ ) η(πππ ) + η( π ππ ) η( πππ ) + η( π π π ) η( π ππ ) +... t th difference term equals A π (s t, a t )
52 Local Approximation We just derived an expression for the performance of a policy π relative to π η( π) = η(π) + E τ π [A π (s 0, a 0 ) + A π (s 1, a 1 ) +... ] = η(π) + E s0: π [E a0: π [A π (s 0, a 0 ) + A π (s 1, a 1 ) +... ]] Can t use this to optimize π because state distribution has complicated dependence. Let s define L π the local approximation, which ignores change in state distribution can be estimated by sampling from π L π ( π) = E s0: π [E a0: π [A π (s 0, a 0 ) + A π (s 1, a 1 ) +... ]] [ T 1 ] = E s0: E a π [A π (s t, a t )] t=0 [ T 1 [ ] ] π(at s t ) = E s0: E a π π(a t s t ) Aπ (s t, a t ) t=0 [ T 1 π(a t s t ) = E τ π π(a t s t ) Aπ (s t, a t ) t=0 ]
53 Local Approximation Now let s consider parameterized policy, π(a s, θ). Sample with θ old, now write local approximation in terms of θ. [ T 1 [ ] ] π(at s t ) L π ( π) = E s0: E a π π(a t s t ) Aπ (s t, a t ) t=0 t=0 [ T 1 [ ] ] π(at s t, θ) L θold (θ) = E s0: E a θ π(a t s t, θ old ) Aθ (s t, a t ) L θold (θ) matches η(θ) to first order around θ old. θ L θold (θ) [ T 1 [ ] ] θ π(a t s t, θ) θ=θ0 = E s0: E a θ π(a t s t, θ old ) Aθ (s t, a t ) t=0 [ T 1 [ = E s0: E a θ θ log π(a t s t, θ)a θ (s t, a t ) ]] t=0 = θ η(θ) θ=θold
54 MM Algorithm Theorem (ignoring some details) 10 η(θ) L θold (θ) C max D KL [π( θ old, s) π( θ, s)] }{{} s }{{} local approx. to η penalty for changing policy θ η(θ) L(θ)+C KL L(θ) If θ old θ new improves lower bound, it s guaranteed to improve η 10 John Schulman, Sergey Levine, et al. Trust Region Policy Optimization. In: arxiv preprint arxiv: (2015).
55 Review Want to optimize η(θ). Collected data with policy parameter θ old, now want to do update Derived local approximation L θold (θ) Optimizing KL penalized local approximation givesn guaranteed improvement to η More approximations gives practical algorithm, called TRPO
56 TRPO Approximations Steps: Instead of max over state space, take mean Linear approximation to L, quadratic approximation to KL divergence Use hard constraint on KL divergence instead of penalty Solve the following problem approximately maximize L θold (θ) subject to D KL [θ old θ] δ Solve approximately through line search in the natural gradient direction s = F 1 g Resulting algorithm is a refined version of natural policy gradient Sham Kakade. A Natural Policy Gradient. In: NIPS. vol , pp
57 Empirical Results: TRPO + GAE TRPO, with neural network policies, was applied to learn controllers for 2D robotic swimming, hopping, and walking, and playing Atari games 12 Used TRPO along with generalized advantage estimation to optimize locomotion policies for 3D simulated robots 13 Friday, June 6, John Schulman, Sergey Levine, et al. Trust Region Policy Optimization. In: arxiv preprint arxiv: (2015). 13 John Schulman, Philipp Moritz, et al. High-dimensional continuous control using generalized advantage estimation. In: arxiv preprint arxiv: (2015).
58 Putting In Perspective Quick and incomplete overview of recent results with deep RL algorithms Policy gradient methods TRPO + GAE Standard policy gradient (no trust region) + deep nets + parallel implementation 14 Repar trick 15 Q-learning 16 and modifications 17 Combining search + supervised learning V. Mnih et al. Playing Atari with Deep Reinforcement Learning. In: arxiv preprint arxiv: (2013). 15 Nicolas Heess et al. Learning continuous control policies by stochastic value gradients. In: Advances in Neural Information Processing Systems. 2015, pp ; Timothy P Lillicrap et al. Continuous control with deep reinforcement learning. In: arxiv preprint arxiv: (2015). 16 V. Mnih et al. Playing Atari with Deep Reinforcement Learning. In: arxiv preprint arxiv: (2013). 17 Ziyu Wang, Nando de Freitas, and Marc Lanctot. Dueling Network Architectures for Deep Reinforcement Learning. In: arxiv preprint arxiv: (2015); Hado V Hasselt. Double Q-learning. In: Advances in Neural Information Processing Systems. 2010, pp X. Guo et al. Deep learning for real-time Atari game play using offline Monte-Carlo tree search planning. In: Advances in Neural Information Processing Systems. 2014, pp ; Sergey Levine et al. End-to-end training of deep visuomotor policies. In: arxiv preprint arxiv: (2015); Igor Mordatch et al. Interactive Control of Diverse Complex Characters with Neural Networks. In: Advances in Neural Information Processing Systems. 2015, pp
59 Open Problems
60 What s the Right Core Model-Free Algorithm? Policy gradients (score function vs. reparameterization, natural vs. not natural) vs. Q-learning vs. derivative-free optimization vs others Desiderata scalable sample-efficient robust learns from off-policy data
61 Exploration Exploration: actively encourage agent to reach unfamiliar parts of state space, avoid getting stuck in local maximum of performance Can solve finite MDPs in polynomial time with exploration 19 optimism about new states and actions maintain distribution over possible models, and plan with them (Bayesian RL, Thompson sampling) How to do exploration in deep RL setting? Thompson sampling 20, novelty bonus Alexander L Strehl et al. PAC model-free reinforcement learning. In: Proceedings of the 23rd international conference on Machine learning. ACM. 2006, pp Ian Osband et al. Deep Exploration via Bootstrapped DQN.. In: arxiv preprint arxiv: (2016). 21 Bradly C Stadie, Sergey Levine, and Pieter Abbeel. Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models. In: arxiv preprint arxiv: (2015).
62 Hierarchy task 1 task 2 task 3 task 4 10 timesteps / day walk to x fetch object y say z.01 hz: 10 3 time steps per day footstep planning: 1hz: 10 5 timesteps / day torque control: 100hz: 10 7 timesteps /day
63 More Open Problems Using learned models Learning from demonstrations
64 The End Questions?
Deep Reinforcement Learning: Policy Gradients and Q-Learning
Deep Reinforcement Learning: Policy Gradients and Q-Learning John Schulman Bay Area Deep Learning School September 24, 2016 Introduction and Overview Aim of This Talk What is deep RL, and should I use
More informationPolicy Gradient Methods. February 13, 2017
Policy Gradient Methods February 13, 2017 Policy Optimization Problems maximize E π [expression] π Fixed-horizon episodic: T 1 Average-cost: lim T 1 T r t T 1 r t Infinite-horizon discounted: γt r t Variable-length
More informationDeep Reinforcement Learning via Policy Optimization
Deep Reinforcement Learning via Policy Optimization John Schulman July 3, 2017 Introduction Deep Reinforcement Learning: What to Learn? Policies (select next action) Deep Reinforcement Learning: What to
More informationLecture 9: Policy Gradient II 1
Lecture 9: Policy Gradient II 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2019 Additional reading: Sutton and Barto 2018 Chp. 13 1 With many slides from or derived from David Silver and John
More informationLecture 9: Policy Gradient II (Post lecture) 2
Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Additional reading: Sutton and Barto 2018 Chp. 13 2 With many slides from or derived from David Silver
More informationTrust Region Policy Optimization
Trust Region Policy Optimization Yixin Lin Duke University yixin.lin@duke.edu March 28, 2017 Yixin Lin (Duke) TRPO March 28, 2017 1 / 21 Overview 1 Preliminaries Markov Decision Processes Policy iteration
More informationVariance Reduction for Policy Gradient Methods. March 13, 2017
Variance Reduction for Policy Gradient Methods March 13, 2017 Reward Shaping Reward Shaping Reward Shaping Reward shaping: r(s, a, s ) = r(s, a, s ) + γφ(s ) Φ(s) for arbitrary potential Φ Theorem: r admits
More informationReinforcement Learning and NLP
1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value
More informationLecture 8: Policy Gradient I 2
Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Additional reading: Sutton and Barto 2018 Chp. 13 2 With many slides from or derived from David Silver and John Schulman
More informationReinforcement Learning
Reinforcement Learning Policy gradients Daniel Hennes 26.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 Policy based reinforcement learning So far we approximated the action value
More informationReinforcement Learning
Reinforcement Learning Policy Search: Actor-Critic and Gradient Policy search Mario Martin CS-UPC May 28, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 28, 2018 / 63 Goal of this lecture So far
More informationREINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning
REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari
More informationAdvanced Policy Gradient Methods: Natural Gradient, TRPO, and More. March 8, 2017
Advanced Policy Gradient Methods: Natural Gradient, TRPO, and More March 8, 2017 Defining a Loss Function for RL Let η(π) denote the expected return of π [ ] η(π) = E s0 ρ 0,a t π( s t) γ t r t We collect
More informationO P T I M I Z I N G E X P E C TAT I O N S : T O S T O C H A S T I C C O M P U TAT I O N G R A P H S. john schulman. Summer, 2016
O P T I M I Z I N G E X P E C TAT I O N S : FROM DEEP REINFORCEMENT LEARNING T O S T O C H A S T I C C O M P U TAT I O N G R A P H S john schulman Summer, 2016 A dissertation submitted in partial satisfaction
More informationReinforcement Learning via Policy Optimization
Reinforcement Learning via Policy Optimization Hanxiao Liu November 22, 2017 1 / 27 Reinforcement Learning Policy a π(s) 2 / 27 Example - Mario 3 / 27 Example - ChatBot 4 / 27 Applications - Video Games
More informationReinforcement Learning for NLP
Reinforcement Learning for NLP Advanced Machine Learning for NLP Jordan Boyd-Graber REINFORCEMENT OVERVIEW, POLICY GRADIENT Adapted from slides by David Silver, Pieter Abbeel, and John Schulman Advanced
More informationQ-Learning in Continuous State Action Spaces
Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental
More informationREINFORCEMENT LEARNING
REINFORCEMENT LEARNING Larry Page: Where s Google going next? DeepMind's DQN playing Breakout Contents Introduction to Reinforcement Learning Deep Q-Learning INTRODUCTION TO REINFORCEMENT LEARNING Contents
More informationExploration. 2015/10/12 John Schulman
Exploration 2015/10/12 John Schulman What is the exploration problem? Given a long-lived agent (or long-running learning algorithm), how to balance exploration and exploitation to maximize long-term rewards
More informationOptimizing Expectations: From Deep Reinforcement Learning to Stochastic Computation Graphs
Optimizing Expectations: From Deep Reinforcement Learning to Stochastic Computation Graphs John Schulman Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report
More informationINF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018
Machine learning for image classification Lecture 14: Reinforcement learning May 9, 2018 Page 3 Outline Motivation Introduction to reinforcement learning (RL) Value function based methods (Q-learning)
More informationarxiv: v5 [cs.lg] 9 Sep 2016
HIGH-DIMENSIONAL CONTINUOUS CONTROL USING GENERALIZED ADVANTAGE ESTIMATION John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan and Pieter Abbeel Department of Electrical Engineering and Computer
More informationMarkov Decision Processes and Solving Finite Problems. February 8, 2017
Markov Decision Processes and Solving Finite Problems February 8, 2017 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15:
More informationDeep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory
Deep Reinforcement Learning Jeremy Morton (jmorton2) November 7, 2016 SISL Stanford Intelligent Systems Laboratory Overview 2 1 Motivation 2 Neural Networks 3 Deep Reinforcement Learning 4 Deep Learning
More informationVIME: Variational Information Maximizing Exploration
1 VIME: Variational Information Maximizing Exploration Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, Pieter Abbeel Reviewed by Zhao Song January 13, 2017 1 Exploration in Reinforcement
More informationCombining PPO and Evolutionary Strategies for Better Policy Search
Combining PPO and Evolutionary Strategies for Better Policy Search Jennifer She 1 Abstract A good policy search algorithm needs to strike a balance between being able to explore candidate policies and
More informationBehavior Policy Gradient Supplemental Material
Behavior Policy Gradient Supplemental Material Josiah P. Hanna 1 Philip S. Thomas 2 3 Peter Stone 1 Scott Niekum 1 A. Proof of Theorem 1 In Appendix A, we give the full derivation of our primary theoretical
More informationIntroduction to Reinforcement Learning. CMPT 882 Mar. 18
Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and
More informationCS599 Lecture 1 Introduction To RL
CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming
More informationLecture 8: Policy Gradient
Lecture 8: Policy Gradient Hado van Hasselt Outline 1 Introduction 2 Finite Difference Policy Gradient 3 Monte-Carlo Policy Gradient 4 Actor-Critic Policy Gradient Introduction Vapnik s rule Never solve
More informationReinforcement Learning
Reinforcement Learning Function approximation Daniel Hennes 19.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Eligibility traces n-step TD returns Forward and backward view Function
More informationGeneralized Advantage Estimation for Policy Gradients
Generalized Advantage Estimation for Policy Gradients Authors Department of Electrical Engineering and Computer Science Abstract Value functions provide an elegant solution to the delayed reward problem
More informationLecture 1: March 7, 2018
Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights
More information15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted
15-889e Policy Search: Gradient Methods Emma Brunskill All slides from David Silver (with EB adding minor modificafons), unless otherwise noted Outline 1 Introduction 2 Finite Difference Policy Gradient
More informationLecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation
Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free
More informationDeep reinforcement learning. Dialogue Systems Group, Cambridge University Engineering Department
Deep reinforcement learning Milica Gašić Dialogue Systems Group, Cambridge University Engineering Department 1 / 25 In this lecture... Introduction to deep reinforcement learning Value-based Deep RL Deep
More informationLecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan
COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan Some slides borrowed from Peter Bodik and David Silver Course progress Learning
More informationUsing Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *
Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms
More informationGradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation Graphs Yoonho Lee Department of Computer Science and Engineering Pohang University of Science and Technology May 9, 2017 Outline Stochastic Gradient Estimation
More information(Deep) Reinforcement Learning
Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 17 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015
More informationIntroduction to Reinforcement Learning
CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.
More informationReinforcement Learning as Classification Leveraging Modern Classifiers
Reinforcement Learning as Classification Leveraging Modern Classifiers Michail G. Lagoudakis and Ronald Parr Department of Computer Science Duke University Durham, NC 27708 Machine Learning Reductions
More informationCSC321 Lecture 22: Q-Learning
CSC321 Lecture 22: Q-Learning Roger Grosse Roger Grosse CSC321 Lecture 22: Q-Learning 1 / 21 Overview Second of 3 lectures on reinforcement learning Last time: policy gradient (e.g. REINFORCE) Optimize
More informationProf. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be
REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while
More informationVariational Deep Q Network
Variational Deep Q Network Yunhao Tang Department of IEOR Columbia University yt2541@columbia.edu Alp Kucukelbir Department of Computer Science Columbia University alp@cs.columbia.edu Abstract We propose
More informationReinforcement Learning
Reinforcement Learning Cyber Rodent Project Some slides from: David Silver, Radford Neal CSC411: Machine Learning and Data Mining, Winter 2017 Michael Guerzhoy 1 Reinforcement Learning Supervised learning:
More informationReinforcement Learning
Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo MLR, University of Stuttgart
More informationIntroduction of Reinforcement Learning
Introduction of Reinforcement Learning Deep Reinforcement Learning Reference Textbook: Reinforcement Learning: An Introduction http://incompleteideas.net/sutton/book/the-book.html Lectures of David Silver
More informationarxiv: v4 [cs.lg] 14 Oct 2018
Equivalence Between Policy Gradients and Soft -Learning John Schulman 1, Xi Chen 1,2, and Pieter Abbeel 1,2 1 OpenAI 2 UC Berkeley, EECS Dept. {joschu, peter, pieter}@openai.com arxiv:174.644v4 cs.lg 14
More informationLearning Dexterity Matthias Plappert SEPTEMBER 6, 2018
Learning Dexterity Matthias Plappert SEPTEMBER 6, 2018 OpenAI OpenAI is a non-profit AI research company, discovering and enacting the path to safe artificial general intelligence. OpenAI OpenAI is a non-profit
More informationDeep Reinforcement Learning. Scratching the surface
Deep Reinforcement Learning Scratching the surface Deep Reinforcement Learning Scenario of Reinforcement Learning Observation State Agent Action Change the environment Don t do that Reward Environment
More informationApproximate Q-Learning. Dan Weld / University of Washington
Approximate Q-Learning Dan Weld / University of Washington [Many slides taken from Dan Klein and Pieter Abbeel / CS188 Intro to AI at UC Berkeley materials available at http://ai.berkeley.edu.] Q Learning
More informationReinforcement Learning
Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation
More informationReinforcement Learning. Yishay Mansour Tel-Aviv University
Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak
More informationCS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study
CS 287: Advanced Robotics Fall 2009 Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study Pieter Abbeel UC Berkeley EECS Assignment #1 Roll-out: nice example paper: X.
More informationApproximation Methods in Reinforcement Learning
2018 CS420, Machine Learning, Lecture 12 Approximation Methods in Reinforcement Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html Reinforcement
More informationMachine Learning I Continuous Reinforcement Learning
Machine Learning I Continuous Reinforcement Learning Thomas Rückstieß Technische Universität München January 7/8, 2010 RL Problem Statement (reminder) state s t+1 ENVIRONMENT reward r t+1 new step r t
More informationAn Introduction to Reinforcement Learning
An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@cse.iitb.ac.in Department of Computer Science and Engineering Indian Institute of Technology Bombay April 2018 What is Reinforcement
More informationTemporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI
Temporal Difference Learning KENNETH TRAN Principal Research Engineer, MSR AI Temporal Difference Learning Policy Evaluation Intro to model-free learning Monte Carlo Learning Temporal Difference Learning
More informationMARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti
1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early
More informationReinforcement Learning
Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo Marc Toussaint University of
More informationLecture 23: Reinforcement Learning
Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and
More informationImplicit Incremental Natural Actor Critic
Implicit Incremental Natural Actor Critic Ryo Iwaki and Minoru Asada Osaka University, -1, Yamadaoka, Suita city, Osaka, Japan {ryo.iwaki,asada}@ams.eng.osaka-u.ac.jp Abstract. The natural policy gradient
More informationMachine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel
Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING Slides adapted from Tom Mitchell and Peter Abeel Machine Learning: Jordan Boyd-Graber UMD Machine Learning
More informationCS230: Lecture 9 Deep Reinforcement Learning
CS230: Lecture 9 Deep Reinforcement Learning Kian Katanforoosh Menti code: 21 90 15 Today s outline I. Motivation II. Recycling is good: an introduction to RL III. Deep Q-Learning IV. Application of Deep
More informationMarks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:
Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,
More informationReinforcement learning an introduction
Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,
More informationMultiagent (Deep) Reinforcement Learning
Multiagent (Deep) Reinforcement Learning MARTIN PILÁT (MARTIN.PILAT@MFF.CUNI.CZ) Reinforcement learning The agent needs to learn to perform tasks in environment No prior knowledge about the effects of
More informationDeep Reinforcement Learning
Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 50 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015
More informationarxiv: v2 [cs.lg] 7 Mar 2018
Comparing Deep Reinforcement Learning and Evolutionary Methods in Continuous Control arxiv:1712.00006v2 [cs.lg] 7 Mar 2018 Shangtong Zhang 1, Osmar R. Zaiane 2 12 Dept. of Computing Science, University
More informationReinforcement Learning II. George Konidaris
Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2017 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes
More informationLecture 7: Value Function Approximation
Lecture 7: Value Function Approximation Joseph Modayil Outline 1 Introduction 2 3 Batch Methods Introduction Large-Scale Reinforcement Learning Reinforcement learning can be used to solve large problems,
More informationReinforcement. Function Approximation. Learning with KATJA HOFMANN. Researcher, MSR Cambridge
Reinforcement Learning with Function Approximation KATJA HOFMANN Researcher, MSR Cambridge Representation and Generalization in RL Focus on training stability Learning generalizable value functions Navigating
More informationReinforcement Learning II. George Konidaris
Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2018 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes
More informationReinforcement Learning
Reinforcement Learning Ron Parr CompSci 7 Department of Computer Science Duke University With thanks to Kris Hauser for some content RL Highlights Everybody likes to learn from experience Use ML techniques
More informationLearning Tetris. 1 Tetris. February 3, 2009
Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are
More informationToday s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes
Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks
More informationAn Off-policy Policy Gradient Theorem Using Emphatic Weightings
An Off-policy Policy Gradient Theorem Using Emphatic Weightings Ehsan Imani, Eric Graves, Martha White Reinforcement Learning and Artificial Intelligence Laboratory Department of Computing Science University
More informationThis question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.
This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. 1. Suppose you have a policy and its action-value function, q, then you
More informationMachine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?
Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity
More informationDeep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017
Deep Reinforcement Learning STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017 Outline Introduction to Reinforcement Learning AlphaGo (Deep RL for Computer Go)
More informationTutorial on Policy Gradient Methods. Jan Peters
Tutorial on Policy Gradient Methods Jan Peters Outline 1. Reinforcement Learning 2. Finite Difference vs Likelihood-Ratio Policy Gradients 3. Likelihood-Ratio Policy Gradients 4. Conclusion General Setup
More information1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5
Table of contents 1 Introduction 2 2 Markov Decision Processes 2 3 Future Cumulative Reward 3 4 Q-Learning 4 4.1 The Q-value.............................................. 4 4.2 The Temporal Difference.......................................
More informationForward Actor-Critic for Nonlinear Function Approximation in Reinforcement Learning
Forward Actor-Critic for Nonlinear Function Approximation in Reinforcement Learning Vivek Veeriah Dept. of Computing Science University of Alberta Edmonton, Canada vivekveeriah@ualberta.ca Harm van Seijen
More informationDeterministic Policy Gradient, Advanced RL Algorithms
NPFL122, Lecture 9 Deterministic Policy Gradient, Advanced RL Algorithms Milan Straka December 1, 218 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics
More informationDavid Silver, Google DeepMind
Tutorial: Deep Reinforcement Learning David Silver, Google DeepMind Outline Introduction to Deep Learning Introduction to Reinforcement Learning Value-Based Deep RL Policy-Based Deep RL Model-Based Deep
More informationReinforcement Learning: An Introduction
Introduction Betreuer: Freek Stulp Hauptseminar Intelligente Autonome Systeme (WiSe 04/05) Forschungs- und Lehreinheit Informatik IX Technische Universität München November 24, 2004 Introduction What is
More informationToday s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning
CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides
More informationReinforcement Learning
Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University
More informationComparing Deep Reinforcement Learning and Evolutionary Methods in Continuous Control
Comparing Deep Reinforcement Learning and Evolutionary Methods in Continuous Control Shangtong Zhang Dept. of Computing Science University of Alberta shangtong.zhang@ualberta.ca Osmar R. Zaiane Dept. of
More informationLearning Control for Air Hockey Striking using Deep Reinforcement Learning
Learning Control for Air Hockey Striking using Deep Reinforcement Learning Ayal Taitler, Nahum Shimkin Faculty of Electrical Engineering Technion - Israel Institute of Technology May 8, 2017 A. Taitler,
More informationReinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina
Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it
More informationCS 570: Machine Learning Seminar. Fall 2016
CS 570: Machine Learning Seminar Fall 2016 Class Information Class web page: http://web.cecs.pdx.edu/~mm/mlseminar2016-2017/fall2016/ Class mailing list: cs570@cs.pdx.edu My office hours: T,Th, 2-3pm or
More informationDueling Network Architectures for Deep Reinforcement Learning (ICML 2016)
Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016) Yoonho Lee Department of Computer Science and Engineering Pohang University of Science and Technology October 11, 2016 Outline
More informationNotes on Reinforcement Learning
1 Introduction Notes on Reinforcement Learning Paulo Eduardo Rauber 2014 Reinforcement learning is the study of agents that act in an environment with the goal of maximizing cumulative reward signals.
More informationReinforcement learning
Reinforcement learning Stuart Russell, UC Berkeley Stuart Russell, UC Berkeley 1 Outline Sequential decision making Dynamic programming algorithms Reinforcement learning algorithms temporal difference
More informationCOMP9444 Neural Networks and Deep Learning 10. Deep Reinforcement Learning. COMP9444 c Alan Blair, 2017
COMP9444 Neural Networks and Deep Learning 10. Deep Reinforcement Learning COMP9444 17s2 Deep Reinforcement Learning 1 Outline History of Reinforcement Learning Deep Q-Learning for Atari Games Actor-Critic
More informationECE521 lecture 4: 19 January Optimization, MLE, regularization
ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity
More informationTowards Generalization and Simplicity in Continuous Control
Towards Generalization and Simplicity in Continuous Control Aravind Rajeswaran, Kendall Lowrey, Emanuel Todorov, Sham Kakade Paul Allen School of Computer Science University of Washington Seattle { aravraj,
More information