Advanced Policy Gradient Methods: Natural Gradient, TRPO, and More. March 8, 2017
|
|
- Daisy Kelly
- 5 years ago
- Views:
Transcription
1 Advanced Policy Gradient Methods: Natural Gradient, TRPO, and More March 8, 2017
2 Defining a Loss Function for RL Let η(π) denote the expected return of π [ ] η(π) = E s0 ρ 0,a t π( s t) γ t r t We collect data with π old. Want to optimize some objective to get a new policy π t=0 A useful identity 1 : [ ] η(π) = η(π old ) + E τ π γ t A π old (s t, a t ) t=0 1 S. Kakade and J. Langford. Approximately optimal approximate reinforcement learning. ICML
3 Proof of Useful Identity First note that A π old (s, a) = E s P(s s,a) [r(s) + γv π old (s ) V π old (s)]. [ ] E τ π γ t A πold (s t, a t ) t=0 [ ] = E τ π γ t (r(s t ) + γv π old (s t+1 ) V π old (s t )) t=0 = E τ π [ V π old (s 0 ) + ] γ t r(s t ) t=0 [ ] = E s0 [V π old (s 0 )] + E τ π γ t r(s t ) = η(π old ) + η(π) t=0
4 Surrogate Loss Function Want to manipulate η(π) into an objective that we can estimate from sampled data η(π) = const + E s π,a π [A π old (s, a)] [ ] π(a s) = const + E s π,a πold π old (a s) Aπ old (s, a) Define L πold (π) to be the surrogate objective that ignores change in state distrib: L(π) = E s πold,a π [A π old (s t, a t )] [ ] π(a s) = E s πold,a π old π old (a s) Aπ old (s, a) Matches to first order for parameterized policy θ L(π θ ) θold = E s,a πold [ θ π θ (a s) π old (a s) Aπ old (s, a)] θold = E s,a πold [ θ log π θ (a s)a π old (s, a)] θold = θ η(π θ ) θ=θold Local approximation to the performance of the policy
5 Improvement Theory Theory: bound the difference between L πold (π) and η(π), the performance of the policy (error due because we re ignoring state distrib. change) Result: η(π) L πold (π) C max s KL[π old ( s), π( s)], where c = 2ɛγ/(1 γ) 2 Monotonic improvement guaranteed (MM algorithm)
6 Practical Approximations Theory: should maximize L πold (π) C max s KL[π old ( s), π( s)]. Approximations: Estimate L πold (π) using trajectories sampled from π old ˆL πold (π) = π(a n s n) n π old (a n s n)ân If just gradient needed, can use ˆLπold (π) = n log π(s n s n )Ân Use mean KL divergence E s πold [KL[π old ( s), π( s)]] instead of max s KL[... ] Define KLπold (π) = E s πold [KL[π old ( s), π( s)]] Use estimate from samples, n KL[π old( s n ), π( s n )] C is too pessimistic and provides guarantee for discounted return Natural policy gradient and PPO: use fixed or adaptive coefficient C TRPO: use hard constraint with fixed KL penalty
7 Solving KL-Penalized Problem maximize θ L πθold (π θ ) C KL πθold (π θ ) Make linear approximation to L πθold maximize θ and quadratic approximation to KL term: g (θ θ old ) C 2 (θ θ old) T F (θ θ old ) where g = θ L π θold (π θ ) θ=θold, F = 2 2 θ KL π θold (π θ ) θ=θold Quadratic part of L is negligible compared to KL term F is positive semidefinite, but not if we include Hessian of L Solution: θ θ old = 1 C F 1 g
8 Solving Linear Systems using Conjugate Gradient Previous slide: θ θ old = 1 F 1 g. Don t want to form full Hessian matrix C F = 2 KL 2 θ π θold (π θ ) θ=θold (memory and computation) Can compute F 1 g approximately using conjugate gradient algorithm without forming F explicitly
9 Truncated Newton Method Conjugate gradient algorithm approximately solves for x = A 1 b, without explicitly forming matrix A, just reads A through matrix-vector products v Av. After k iterations, CG has minimized 1 2 x T Ax bx in subspace spanned by b, Ab, A 2 b,..., A k 1 b Given vector v with same dimension as θ, want to compute H 1 v, where H = 2 2 θ f (θ). To perform CG, Hessian-vector products v Hv. Can form this function using autodiff software like Tensorflow. Example: theta = tf.placeholder(...) # parameter vector f_of_theta =... # scalar vector = tf.placeholder([dim_theta]) gradient = tf.grad(f, theta) gradient_vector_product = tf.sum( gradient * vector ) hessian_vector_product = tf.grad(gradient_vector_product, theta) Hessian vector product computation takes 1-2 times as long as gradient computation S. J. Wright and J. Nocedal. Numerical optimization. Springer New York, 1999
10 Truncated Newton Method Hessian-vector can be formed as follows if KL Hessian (F ) is computed using KL theta = tf.placeholder(...) # parameter vector actprob =... # batchsize x n_actions. # E.g., outputs of neural network with softmax oldactprob = tf.stop_gradient(actprob) vector = tf.placeholder([dim_theta]) kl = tf.sum( oldactprob * tf.log(oldactprob / actprob), axis=1) gradient = tf.grad(kl, theta) gradient_vector_product = tf.sum( gradient * vector ) hessian_vector_product = tf.grad(gradient_vector_product, theta) S. J. Wright and J. Nocedal. Numerical optimization. Springer New York, 1999
11 Solving KL-Penalized Problem: Summary maximize θ L πθold (π θ ) C KL πθold (π θ ) Make linear approximation to L πθold and quadratic approximation to KL term: maximize θ g (θ θ old ) C 2 (θ θ old) T F (θ θ old ) where g = θ L π θold (π θ ) θ=θold, F = 2 2 θ KL π θold (π θ ) θ=θold Solution: θ θ old = 1 C F 1 g Solve for F 1 g approximately using conjugate gradient algorithm by forming Hessian-vector product function
12 Truncated Natural Policy Gradient Algorithm for iteration=1, 2,... do Run policy for T timesteps or N trajectories Estimate advantage function at all timesteps Compute policy gradient g Use CG (with Hessian-vector products) to compute F 1 g Update policy parameter θ = θ old + αf 1 g end for
13 TRPO: KL-Constrained Problem Unconstrained problem: maximize L πθold (π θ ) C KL πθold (π θ ) Constrained problem: maximize L πθold (π θ ) subject to KL πθold (π θ ) δ Often easier to set hyperparameter δ rather than C, can remain fixed over whole learning process We ll solve constrained quadratic problem: compute F 1 g, and then rescale step to get correct KL Take linear and quadratic constraint: 1 maximize θ g (θ θ old ) subject to 2 (θ θ old) T F (θ θ old ) δ Form Lagrangian L(θ, λ) = g (θ θold ) λ 2 [(θ θ old) T F (θ θ old ) δ] Differentiate wrt θ, get θ θold = 1 λ F 1 g To satisfy constraint, want 1 2 st Fs = δ. Given candidate step sunscaled, rescale to s = 2δ s unscaled (Fs unscaled ) s unscaled
14 TRPO: KL-Constrained Problem Compute s unscaled = F 1 g Rescale: s = 2δ s unscaled (Hs unscaled ) s unscaled Now do backtracking line search on original problem (before quadratic constraint) maximize L πθold (π θ ) 1[KL πθold (π θ ) δ] Use steps s, s/2, s/4,... until line search objective improves
15 TRPO Algorithm for iteration=1, 2,... do Run policy for T timesteps or N trajectories Estimate advantage function at all timesteps Compute policy gradient g Use CG (with Hessian-vector products) to compute F 1 g Compute rescaled step s = αf 1 g with rescaling and line search Apply update: θ = θ old + αf 1 g end for
16 Alternative Method for Calculating Natural Gradients Given parameterized probability density p θ (x) Fisher information matrix 2 θ KL[p θ old, p θ ] = E x pθold [ ( ) T ( θ(x))] θ log p θ(x) θ log p θ=θold FIM forms a metric on policy s parameter space, induced by KL divergence. Makes step invariant to reparameterization of coordinates (θ = f (θ)), whereas gradient is not invariant. In policy optimization setting, instead of forming Fisher by differentiating KL, can explictly form ( n log π θ θ(a n s n ) ) T ( log π θ θ(a n s n ) )
17 Proximal Policy Optimization Back to penalty instead of constraint maximize θ N n=1 π θ (a n s n ) π θold (a n s n )Ân C KL πθold (π θ ) Pseudocode: for iteration=1, 2,... do Run policy for T timesteps or N trajectories Estimate advantage function at all timesteps Do SGD on above objective for some number of epochs If KL too high, increase β. If KL too low, decrease β. end for same performance as TRPO, but only first-order optimization
18 Approximations in Supervised vs Reinforcement Learning Supervised learning Linear approximation given by gradient f (θ) f (θ0 ) + (θ θ 0 ) g Training loss approximates test loss Reinforcement learning (policy gradients) Linear approximation given by gradient of surrogate f (θ) f (θ 0 ) + (θ θ 0 ) g Training surrogate approximates test surrogate (sampled data is representative of visitation distribution) State distribution doesn t change much
19 Further Reading S. Kakade. A Natural Policy Gradient. NIPS S. Kakade and J. Langford. Approximately optimal approximate reinforcement learning. ICML J. Peters and S. Schaal. Natural actor-critic. Neurocomputing (2008) J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. Trust Region Policy Optimization. ICML (2015) Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. Benchmarking Deep Reinforcement Learning for Continuous Control. ICML (2016) J. Martens and I. Sutskever. Training deep and recurrent networks with Hessian-free optimization. Neural Networks: Tricks of the Trade. Springer, 2012
20 That s all. Questions?
Trust Region Policy Optimization
Trust Region Policy Optimization Yixin Lin Duke University yixin.lin@duke.edu March 28, 2017 Yixin Lin (Duke) TRPO March 28, 2017 1 / 21 Overview 1 Preliminaries Markov Decision Processes Policy iteration
More informationLecture 9: Policy Gradient II 1
Lecture 9: Policy Gradient II 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2019 Additional reading: Sutton and Barto 2018 Chp. 13 1 With many slides from or derived from David Silver and John
More informationLecture 9: Policy Gradient II (Post lecture) 2
Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Additional reading: Sutton and Barto 2018 Chp. 13 2 With many slides from or derived from David Silver
More informationReinforcement Learning via Policy Optimization
Reinforcement Learning via Policy Optimization Hanxiao Liu November 22, 2017 1 / 27 Reinforcement Learning Policy a π(s) 2 / 27 Example - Mario 3 / 27 Example - ChatBot 4 / 27 Applications - Video Games
More informationDeep Reinforcement Learning: Policy Gradients and Q-Learning
Deep Reinforcement Learning: Policy Gradients and Q-Learning John Schulman Bay Area Deep Learning School September 24, 2016 Introduction and Overview Aim of This Talk What is deep RL, and should I use
More informationPolicy Gradient Methods. February 13, 2017
Policy Gradient Methods February 13, 2017 Policy Optimization Problems maximize E π [expression] π Fixed-horizon episodic: T 1 Average-cost: lim T 1 T r t T 1 r t Infinite-horizon discounted: γt r t Variable-length
More informationDeep Reinforcement Learning via Policy Optimization
Deep Reinforcement Learning via Policy Optimization John Schulman July 3, 2017 Introduction Deep Reinforcement Learning: What to Learn? Policies (select next action) Deep Reinforcement Learning: What to
More informationDeterministic Policy Gradient, Advanced RL Algorithms
NPFL122, Lecture 9 Deterministic Policy Gradient, Advanced RL Algorithms Milan Straka December 1, 218 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics
More informationVIME: Variational Information Maximizing Exploration
1 VIME: Variational Information Maximizing Exploration Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, Pieter Abbeel Reviewed by Zhao Song January 13, 2017 1 Exploration in Reinforcement
More informationDeep Reinforcement Learning
Deep Reinforcement Learning John Schulman 1 MLSS, May 2016, Cadiz 1 Berkeley Artificial Intelligence Research Lab Agenda Introduction and Overview Markov Decision Processes Reinforcement Learning via Black-Box
More informationVariance Reduction for Policy Gradient Methods. March 13, 2017
Variance Reduction for Policy Gradient Methods March 13, 2017 Reward Shaping Reward Shaping Reward Shaping Reward shaping: r(s, a, s ) = r(s, a, s ) + γφ(s ) Φ(s) for arbitrary potential Φ Theorem: r admits
More informationReinforcement Learning
Reinforcement Learning Policy Search: Actor-Critic and Gradient Policy search Mario Martin CS-UPC May 28, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 28, 2018 / 63 Goal of this lecture So far
More informationReinforcement Learning for NLP
Reinforcement Learning for NLP Advanced Machine Learning for NLP Jordan Boyd-Graber REINFORCEMENT OVERVIEW, POLICY GRADIENT Adapted from slides by David Silver, Pieter Abbeel, and John Schulman Advanced
More informationO P T I M I Z I N G E X P E C TAT I O N S : T O S T O C H A S T I C C O M P U TAT I O N G R A P H S. john schulman. Summer, 2016
O P T I M I Z I N G E X P E C TAT I O N S : FROM DEEP REINFORCEMENT LEARNING T O S T O C H A S T I C C O M P U TAT I O N G R A P H S john schulman Summer, 2016 A dissertation submitted in partial satisfaction
More informationReinforcement Learning and NLP
1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value
More informationREINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning
REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari
More informationBehavior Policy Gradient Supplemental Material
Behavior Policy Gradient Supplemental Material Josiah P. Hanna 1 Philip S. Thomas 2 3 Peter Stone 1 Scott Niekum 1 A. Proof of Theorem 1 In Appendix A, we give the full derivation of our primary theoretical
More informationLecture 8: Policy Gradient I 2
Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Additional reading: Sutton and Barto 2018 Chp. 13 2 With many slides from or derived from David Silver and John Schulman
More informationDeep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory
Deep Reinforcement Learning Jeremy Morton (jmorton2) November 7, 2016 SISL Stanford Intelligent Systems Laboratory Overview 2 1 Motivation 2 Neural Networks 3 Deep Reinforcement Learning 4 Deep Learning
More informationDecision Theory: Q-Learning
Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning
More informationConvex Optimization Algorithms for Machine Learning in 10 Slides
Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,
More informationCSC2541 Lecture 5 Natural Gradient
CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse CSC2541 Lecture 5 Natural Gradient 1 / 12 Motivation Two classes of optimization procedures used throughout ML (stochastic) gradient descent,
More informationIntroduction to Reinforcement Learning. CMPT 882 Mar. 18
Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and
More informationMDP Preliminaries. Nan Jiang. February 10, 2019
MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process
More informationActor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017
Actor-critic methods Milica Gašić Dialogue Systems Group, Cambridge University Engineering Department February 21, 2017 1 / 21 In this lecture... The actor-critic architecture Least-Squares Policy Iteration
More informationMarkov Decision Processes and Solving Finite Problems. February 8, 2017
Markov Decision Processes and Solving Finite Problems February 8, 2017 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15:
More informationImplicit Incremental Natural Actor Critic
Implicit Incremental Natural Actor Critic Ryo Iwaki and Minoru Asada Osaka University, -1, Yamadaoka, Suita city, Osaka, Japan {ryo.iwaki,asada}@ams.eng.osaka-u.ac.jp Abstract. The natural policy gradient
More informationarxiv: v4 [cs.lg] 14 Oct 2018
Equivalence Between Policy Gradients and Soft -Learning John Schulman 1, Xi Chen 1,2, and Pieter Abbeel 1,2 1 OpenAI 2 UC Berkeley, EECS Dept. {joschu, peter, pieter}@openai.com arxiv:174.644v4 cs.lg 14
More informationUsing Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *
Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms
More informationDueling Network Architectures for Deep Reinforcement Learning (ICML 2016)
Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016) Yoonho Lee Department of Computer Science and Engineering Pohang University of Science and Technology October 11, 2016 Outline
More informationQ-Learning in Continuous State Action Spaces
Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental
More informationDecision Theory: Markov Decision Processes
Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies
More informationOptimizing Expectations: From Deep Reinforcement Learning to Stochastic Computation Graphs
Optimizing Expectations: From Deep Reinforcement Learning to Stochastic Computation Graphs John Schulman Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report
More informationLecture 6: CS395T Numerical Optimization for Graphics and AI Line Search Applications
Lecture 6: CS395T Numerical Optimization for Graphics and AI Line Search Applications Qixing Huang The University of Texas at Austin huangqx@cs.utexas.edu 1 Disclaimer This note is adapted from Section
More informationREINFORCEMENT LEARNING
REINFORCEMENT LEARNING Larry Page: Where s Google going next? DeepMind's DQN playing Breakout Contents Introduction to Reinforcement Learning Deep Q-Learning INTRODUCTION TO REINFORCEMENT LEARNING Contents
More informationNew Insights and Perspectives on the Natural Gradient Method
1 / 18 New Insights and Perspectives on the Natural Gradient Method Yoonho Lee Department of Computer Science and Engineering Pohang University of Science and Technology March 13, 2018 Motivation 2 / 18
More informationCS Deep Reinforcement Learning HW5: Soft Actor-Critic Due November 14th, 11:59 pm
CS294-112 Deep Reinforcement Learning HW5: Soft Actor-Critic Due November 14th, 11:59 pm 1 Introduction For this homework, you get to choose among several topics to investigate. Each topic corresponds
More informationMulti-Batch Experience Replay for Fast Convergence of Continuous Action Control
1 Multi-Batch Experience Replay for Fast Convergence of Continuous Action Control Seungyul Han and Youngchul Sung Abstract arxiv:1710.04423v1 [cs.lg] 12 Oct 2017 Policy gradient methods for direct policy
More informationNumerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09
Numerical Optimization 1 Working Horse in Computer Vision Variational Methods Shape Analysis Machine Learning Markov Random Fields Geometry Common denominator: optimization problems 2 Overview of Methods
More informationLecture 7: CS395T Numerical Optimization for Graphics and AI Trust Region Methods
Lecture 7: CS395T Numerical Optimization for Graphics and AI Trust Region Methods Qixing Huang The University of Texas at Austin huangqx@cs.utexas.edu 1 Disclaimer This note is adapted from Section 4 of
More informationGradient Methods for Markov Decision Processes
Gradient Methods for Markov Decision Processes Department of Computer Science University College London May 11, 212 Outline 1 Introduction Markov Decision Processes Dynamic Programming 2 Gradient Methods
More informationReinforcement Learning
Reinforcement Learning Inverse Reinforcement Learning LfD, imitation learning/behavior cloning, apprenticeship learning, IRL. Hung Ngo MLR Lab, University of Stuttgart Outline Learning from Demonstrations
More informationProximal Policy Optimization Algorithms
Proximal Policy Optimization Algorithms John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov OpenAI {joschu, filip, prafulla, alec, oleg}@openai.com arxiv:177.6347v2 [cs.lg] 28 Aug
More informationValue Function Methods. CS : Deep Reinforcement Learning Sergey Levine
Value Function Methods CS 294-112: Deep Reinforcement Learning Sergey Levine Class Notes 1. Homework 2 is due in one week 2. Remember to start forming final project groups and writing your proposal! Proposal
More informationLearning Dexterity Matthias Plappert SEPTEMBER 6, 2018
Learning Dexterity Matthias Plappert SEPTEMBER 6, 2018 OpenAI OpenAI is a non-profit AI research company, discovering and enacting the path to safe artificial general intelligence. OpenAI OpenAI is a non-profit
More informationReinforcement Learning
Reinforcement Learning Policy gradients Daniel Hennes 26.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 Policy based reinforcement learning So far we approximated the action value
More informationProximal Policy Optimization (PPO)
Proximal Policy Optimization (PPO) default reinforcement learning algorithm at OpenAI Policy Gradient On-policy Off-policy Add constraint DeepMind https://youtu.be/gn4nrcc9twq OpenAI https://blog.openai.com/o
More informationTowards Generalization and Simplicity in Continuous Control
Towards Generalization and Simplicity in Continuous Control Aravind Rajeswaran, Kendall Lowrey, Emanuel Todorov, Sham Kakade Paul Allen School of Computer Science University of Washington Seattle { aravraj,
More information6 Basic Convergence Results for RL Algorithms
Learning in Complex Systems Spring 2011 Lecture Notes Nahum Shimkin 6 Basic Convergence Results for RL Algorithms We establish here some asymptotic convergence results for the basic RL algorithms, by showing
More informationCustomised Pearlmutter Propagation: A Hardware Architecture for Trust Region Policy Optimisation
Customised Pearlmutter Propagation: A Hardware Architecture for Trust Region Policy Optimisation Shengjia Shao and Wayne Luk Department of Computing, Imperial College London E-mail: {shengjia.shao12, w.luk}@imperial.ac.uk
More informationarxiv: v2 [cs.lg] 8 Aug 2018
: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor Tuomas Haarnoja 1 Aurick Zhou 1 Pieter Abbeel 1 Sergey Levine 1 arxiv:181.129v2 [cs.lg] 8 Aug 218 Abstract Model-free deep
More informationMisc. Topics and Reinforcement Learning
1/79 Misc. Topics and Reinforcement Learning http://bicmr.pku.edu.cn/~wenzw/bigdata2018.html Acknowledgement: some parts of this slides is based on Prof. Mengdi Wang s, Prof. Dimitri Bertsekas and Prof.
More informationSoft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel & Sergey Levine Department of Electrical Engineering and Computer
More informationOptimization for neural networks
0 - : Optimization for neural networks Prof. J.C. Kao, UCLA Optimization for neural networks We previously introduced the principle of gradient descent. Now we will discuss specific modifications we make
More informationOPER 627: Nonlinear Optimization Lecture 14: Mid-term Review
OPER 627: Nonlinear Optimization Lecture 14: Mid-term Review Department of Statistical Sciences and Operations Research Virginia Commonwealth University Oct 16, 2013 (Lecture 14) Nonlinear Optimization
More informationCombining PPO and Evolutionary Strategies for Better Policy Search
Combining PPO and Evolutionary Strategies for Better Policy Search Jennifer She 1 Abstract A good policy search algorithm needs to strike a balance between being able to explore candidate policies and
More informationReal Time Value Iteration and the State-Action Value Function
MS&E338 Reinforcement Learning Lecture 3-4/9/18 Real Time Value Iteration and the State-Action Value Function Lecturer: Ben Van Roy Scribe: Apoorva Sharma and Tong Mu 1 Review Last time we left off discussing
More informationCSC321 Lecture 22: Q-Learning
CSC321 Lecture 22: Q-Learning Roger Grosse Roger Grosse CSC321 Lecture 22: Q-Learning 1 / 21 Overview Second of 3 lectures on reinforcement learning Last time: policy gradient (e.g. REINFORCE) Optimize
More informationIPAM Summer School Optimization methods for machine learning. Jorge Nocedal
IPAM Summer School 2012 Tutorial on Optimization methods for machine learning Jorge Nocedal Northwestern University Overview 1. We discuss some characteristics of optimization problems arising in deep
More informationINF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018
Machine learning for image classification Lecture 14: Reinforcement learning May 9, 2018 Page 3 Outline Motivation Introduction to reinforcement learning (RL) Value function based methods (Q-learning)
More informationModule 8 Linear Programming. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo
Module 8 Linear Programming CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Policy Optimization Value and policy iteration Iterative algorithms that implicitly solve
More informationarxiv: v5 [cs.lg] 9 Sep 2016
HIGH-DIMENSIONAL CONTINUOUS CONTROL USING GENERALIZED ADVANTAGE ESTIMATION John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan and Pieter Abbeel Department of Electrical Engineering and Computer
More informationTrust-Region Optimization Methods Using Limited-Memory Symmetric Rank-One Updates for Off-The-Shelf Machine Learning
Trust-Region Optimization Methods Using Limited-Memory Symmetric Rank-One Updates for Off-The-Shelf Machine Learning by Jennifer B. Erway, Joshua Griffin, Riadh Omheni, and Roummel Marcia Technical report
More informationarxiv: v3 [cs.ai] 22 Feb 2018
TRUST-PCL: AN OFF-POLICY TRUST REGION METHOD FOR CONTINUOUS CONTROL Ofir Nachum, Mohammad Norouzi, Kelvin Xu, & Dale Schuurmans {ofirnachum,mnorouzi,kelvinxx,schuurmans}@google.com Google Brain ABSTRACT
More informationDeep Learning & Neural Networks Lecture 4
Deep Learning & Neural Networks Lecture 4 Kevin Duh Graduate School of Information Science Nara Institute of Science and Technology Jan 23, 2014 2/20 3/20 Advanced Topics in Optimization Today we ll briefly
More informationComparing Deep Reinforcement Learning and Evolutionary Methods in Continuous Control
Comparing Deep Reinforcement Learning and Evolutionary Methods in Continuous Control Shangtong Zhang Dept. of Computing Science University of Alberta shangtong.zhang@ualberta.ca Osmar R. Zaiane Dept. of
More informationLecture 8: Policy Gradient
Lecture 8: Policy Gradient Hado van Hasselt Outline 1 Introduction 2 Finite Difference Policy Gradient 3 Monte-Carlo Policy Gradient 4 Actor-Critic Policy Gradient Introduction Vapnik s rule Never solve
More informationarxiv: v2 [cs.lg] 7 Mar 2018
Comparing Deep Reinforcement Learning and Evolutionary Methods in Continuous Control arxiv:1712.00006v2 [cs.lg] 7 Mar 2018 Shangtong Zhang 1, Osmar R. Zaiane 2 12 Dept. of Computing Science, University
More informationReinforcement Learning
Reinforcement Learning Cyber Rodent Project Some slides from: David Silver, Radford Neal CSC411: Machine Learning and Data Mining, Winter 2017 Michael Guerzhoy 1 Reinforcement Learning Supervised learning:
More informationReinforcement Learning
Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University
More informationImproving L-BFGS Initialization for Trust-Region Methods in Deep Learning
Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning Jacob Rafati http://rafati.net jrafatiheravi@ucmerced.edu Ph.D. Candidate, Electrical Engineering and Computer Science University
More informationReinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina
Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it
More informationStochastic Optimization Algorithms Beyond SG
Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods
More informationGenerative Adversarial Imitation Learning
Generative Adversarial Imitation Learning Jonathan Ho OpenAI hoj@openai.com Stefano Ermon Stanford University ermon@cs.stanford.edu Abstract Consider learning a policy from example expert behavior, without
More informationNonlinear Optimization for Optimal Control
Nonlinear Optimization for Optimal Control Pieter Abbeel UC Berkeley EECS Many slides and figures adapted from Stephen Boyd [optional] Boyd and Vandenberghe, Convex Optimization, Chapters 9 11 [optional]
More informationSuppose that the approximate solutions of Eq. (1) satisfy the condition (3). Then (1) if η = 0 in the algorithm Trust Region, then lim inf.
Maria Cameron 1. Trust Region Methods At every iteration the trust region methods generate a model m k (p), choose a trust region, and solve the constraint optimization problem of finding the minimum of
More informationTrust Region Evolution Strategies
Trust Region Evolution Strategies Guoqing Liu, Li Zhao, Feidiao Yang, Jiang Bian, Tao Qin, Nenghai Yu, Tie-Yan Liu University of Science and Technology of China Microsoft Research Institute of Computing
More informationReinforcement Learning
Reinforcement Learning From the basics to Deep RL Olivier Sigaud ISIR, UPMC + INRIA http://people.isir.upmc.fr/sigaud September 14, 2017 1 / 54 Introduction Outline Some quick background about discrete
More informationTutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba
Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation
More informationNormalization Techniques in Training of Deep Neural Networks
Normalization Techniques in Training of Deep Neural Networks Lei Huang ( 黄雷 ) State Key Laboratory of Software Development Environment, Beihang University Mail:huanglei@nlsde.buaa.edu.cn August 17 th,
More informationDiscrete Variables and Gradient Estimators
iscrete Variables and Gradient Estimators This assignment is designed to get you comfortable deriving gradient estimators, and optimizing distributions over discrete random variables. For most questions,
More informationConstraint-Space Projection Direct Policy Search
Constraint-Space Projection Direct Policy Search Constraint-Space Projection Direct Policy Search Riad Akrour 1 riad@robot-learning.de Jan Peters 1,2 jan@robot-learning.de Gerhard Neumann 1,2 geri@robot-learning.de
More informationSubsampled Hessian Newton Methods for Supervised
1 Subsampled Hessian Newton Methods for Supervised Learning Chien-Chih Wang, Chun-Heng Huang, Chih-Jen Lin Department of Computer Science, National Taiwan University, Taipei 10617, Taiwan Keywords: Subsampled
More informationTowards Generalization and Simplicity in Continuous Control
Towards Generalization and Simplicity in Continuous Control Aravind Rajeswaran Kendall Lowrey Emanuel Todorov Sham Kakade University of Washington Seattle { aravraj, klowrey, todorov, sham } @ cs.washington.edu
More informationReinforcement Learning
Reinforcement Learning Ron Parr CompSci 7 Department of Computer Science Duke University With thanks to Kris Hauser for some content RL Highlights Everybody likes to learn from experience Use ML techniques
More informationarxiv: v1 [cs.lg] 31 May 2016
Curiosity-driven Exploration in Deep Reinforcement Learning via Bayesian Neural Networks arxiv:165.9674v1 [cs.lg] 31 May 216 Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, Pieter Abbeel
More informationCSC 411 Lecture 17: Support Vector Machine
CSC 411 Lecture 17: Support Vector Machine Ethan Fetaya, James Lucas and Emad Andrews University of Toronto CSC411 Lec17 1 / 1 Today Max-margin classification SVM Hard SVM Duality Soft SVM CSC411 Lec17
More informationarxiv: v1 [stat.ml] 6 Nov 2018
Are Deep Policy Gradient Algorithms Truly Policy Gradient Algorithms? Andrew Ilyas 1, Logan Engstrom 1, Shibani Santurkar 1, Dimitris Tsipras 1, Firdaus Janoos 2, Larry Rudolph 1,2, and Aleksander Mądry
More informationNonlinear Optimization Methods for Machine Learning
Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks
More information(Deep) Reinforcement Learning
Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 17 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015
More informationDevelopment of a Deep Recurrent Neural Network Controller for Flight Applications
Development of a Deep Recurrent Neural Network Controller for Flight Applications American Control Conference (ACC) May 26, 2017 Scott A. Nivison Pramod P. Khargonekar Department of Electrical and Computer
More informationPolicy Gradient Reinforcement Learning for Robotics
Policy Gradient Reinforcement Learning for Robotics Michael C. Koval mkoval@cs.rutgers.edu Michael L. Littman mlittman@cs.rutgers.edu May 9, 211 1 Introduction Learning in an environment with a continuous
More informationStochastic Analogues to Deterministic Optimizers
Stochastic Analogues to Deterministic Optimizers ISMP 2018 Bordeaux, France Vivak Patel Presented by: Mihai Anitescu July 6, 2018 1 Apology I apologize for not being here to give this talk myself. I injured
More informationGradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation Graphs Yoonho Lee Department of Computer Science and Engineering Pohang University of Science and Technology May 9, 2017 Outline Stochastic Gradient Estimation
More informationarxiv: v1 [cs.lg] 13 Dec 2018
Soft Actor-Critic Algorithms and Applications Tuomas Haarnoja Aurick Zhou Kristian Hartikainen George Tucker arxiv:1812.05905v1 [cs.lg] 13 Dec 2018 Sehoon Ha Jie Tan Vikash Kumar Henry Zhu Abhishek Gupta
More informationMachine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel
Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING Slides adapted from Tom Mitchell and Peter Abeel Machine Learning: Jordan Boyd-Graber UMD Machine Learning
More informationMachine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?
Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity
More informationTrust Regions. Charles J. Geyer. March 27, 2013
Trust Regions Charles J. Geyer March 27, 2013 1 Trust Region Theory We follow Nocedal and Wright (1999, Chapter 4), using their notation. Fletcher (1987, Section 5.1) discusses the same algorithm, but
More informationOptimization Methods for Machine Learning
Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction
More informationReinforcement Learning
Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation
More information