Learning Dexterity Matthias Plappert SEPTEMBER 6, 2018

Similar documents
Trust Region Policy Optimization

Reinforcement Learning via Policy Optimization

Reinforcement Learning

Deep Reinforcement Learning: Policy Gradients and Q-Learning

Reinforcement Learning and NLP

CS230: Lecture 9 Deep Reinforcement Learning

Reinforcement Learning

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

Q-Learning in Continuous State Action Spaces

Lecture 8: Policy Gradient I 2

CSC321 Lecture 22: Q-Learning

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

Policy Gradient Methods. February 13, 2017

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Combining PPO and Evolutionary Strategies for Better Policy Search

Decision Theory: Q-Learning

Introduction of Reinforcement Learning

Lecture 6: CS395T Numerical Optimization for Graphics and AI Line Search Applications

Towards Generalization and Simplicity in Continuous Control

Development of a Deep Recurrent Neural Network Controller for Flight Applications

Reinforcement Learning, Neural Networks and PI Control Applied to a Heating Coil

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Decision Theory: Markov Decision Processes

Reinforcement Learning

Variance Reduction for Policy Gradient Methods. March 13, 2017

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Lecture 9: Policy Gradient II (Post lecture) 2

Learning Control for Air Hockey Striking using Deep Reinforcement Learning

Notes on policy gradients and the log derivative trick for reinforcement learning

Reinforcement Learning

Lecture 9: Policy Gradient II 1

Deep Reinforcement Learning via Policy Optimization

ELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches. Ville Kyrki

Advanced Policy Gradient Methods: Natural Gradient, TRPO, and More. March 8, 2017

Visual meta-learning for planning and control

Proximal Policy Optimization (PPO)

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

VIME: Variational Information Maximizing Exploration

Towards Generalization and Simplicity in Continuous Control

Driving a car with low dimensional input features

CS Deep Reinforcement Learning HW2: Policy Gradients due September 19th 2018, 11:59 pm

Reinforcement Learning Part 2

Markov decision processes

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Deep Reinforcement Learning. Scratching the surface

Artificial Intelligence & Sequential Decision Problems

Lecture 8: Policy Gradient

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Lecture 10 - Planning under Uncertainty (III)

arxiv: v1 [cs.lg] 13 Dec 2013

Human-level control through deep reinforcement. Liia Butler

arxiv: v1 [cs.lg] 20 Jun 2018

REINFORCEMENT LEARNING

CS599 Lecture 1 Introduction To RL

An Introduction to Reinforcement Learning

Identifying and Solving an Unknown Acrobot System

arxiv: v2 [cs.lg] 20 Mar 2018

Policy Gradient Reinforcement Learning for Robotics

Deep Reinforcement Learning

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Lecture 23: Reinforcement Learning

Reinforcement Learning as Classification Leveraging Modern Classifiers

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

Bayesian Deep Learning

Notes on Reinforcement Learning

Reinforcement Learning and Deep Reinforcement Learning

Machine Learning I Continuous Reinforcement Learning

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Optimal Control. McGill COMP 765 Oct 3 rd, 2017

CONCEPT LEARNING WITH ENERGY-BASED MODELS

Lecture 1: March 7, 2018

Dialogue management: Parametric approaches to policy optimisation. Dialogue Systems Group, Cambridge University Engineering Department

Discrete Latent Variable Models

Partially Observable Markov Decision Processes (POMDPs)

Reinforcement Learning II. George Konidaris

Active Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning

CS 570: Machine Learning Seminar. Fall 2016

Reinforcement Learning II. George Konidaris

Multi-Batch Experience Replay for Fast Convergence of Continuous Action Control

Learning Tetris. 1 Tetris. February 3, 2009

Chapter 3: The Reinforcement Learning Problem

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

Reinforcement Learning for NLP

Reinforcement learning an introduction

16.410/413 Principles of Autonomy and Decision Making

Autonomous Helicopter Flight via Reinforcement Learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning

arxiv: v1 [cs.lg] 13 Dec 2018

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

Q-learning. Tambet Matiisen

Basics of reinforcement learning

Reinforcement Learning

arxiv: v1 [stat.ml] 6 Nov 2018

Introduction to Convolutional Neural Networks 2018 / 02 / 23

An Adaptive Clustering Method for Model-free Reinforcement Learning

Reinforcement Learning. Machine Learning, Fall 2010

CS343 Artificial Intelligence

Transcription:

Learning Dexterity Matthias Plappert SEPTEMBER 6, 2018

OpenAI OpenAI is a non-profit AI research company, discovering and enacting the path to safe artificial general intelligence.

OpenAI OpenAI is a non-profit AI research company, discovering and enacting the path to safe artificial general intelligence.

OpenAI OpenAI is a non-profit AI research company, discovering and enacting the path to safe artificial general intelligence.

OpenAI OpenAI is a non-profit AI research company, discovering and enacting the path to safe artificial general intelligence.

OpenAI Charter Broadly Distributed Benefits Long-Term Safety Technical Leadership Cooperative Orientation Full text available on our blog: https://blog.openai.com/openai-charter/

Learning Dexterity

Reinforcement Learning GO (ALPHAGO ZERO) DOTA 2 (OPENAI FIVE)

Reinforcement Learning for Robotics (1) Rajeswaran et al. (2017)

Reinforcement Learning for Robotics (2) Kumar et al. (2016)

Reinforcement Learning for Robotics (3) Levine et al. (2018)

Sim2Real Simulated environment Simulated environment for training Simulated environment for training Simulated environments for training for training Transfer Real robot hardware

Transfer

Transfer

Task & Setup

Shadow Dexterous Hand

PhaseSpace tracking

Top RGB camera Right RGB camera Left RGB camera

Challenges Reinforcement learning for the real world High-dimensional control Noisy and partial observations Manipulating more than one object

Reinforcement Learning + Domain Randomization

Reinforcement Learning

Reinforcement Learning (1) Action at Agent Environment State st+1 and reward rt

Reinforcement Learning (2) Formalize as Markov decision process M = (S, A, P, ρ, r)

Reinforcement Learning (2) Formalize as Markov decision process Set of states M = (S, A, P, ρ, r)

Reinforcement Learning (2) Formalize as Markov decision process Set of states M = (S, A, P, ρ, r) Set of actions

Reinforcement Learning (2) Formalize as Markov decision process Set of states Transition probabilities: s t+1 P(, s t, a t ) M = (S, A, P, ρ, r) Set of actions

Reinforcement Learning (2) Formalize as Markov decision process Set of states Transition probabilities: s t+1 P(, s t, a t ) M = (S, A, P, ρ, r) Set of actions Initial state distribution: s 0 ρ( )

Reinforcement Learning (2) Formalize as Markov decision process Transition probabilities: s t+1 P(, s t, a t ) Set of states Reward function: r : S A M = (S, A, P, ρ, r) Set of actions Initial state distribution: s 0 ρ( )

Reinforcement Learning (2) Formalize as Markov decision process Transition probabilities: s t+1 P(, s t, a t ) Set of states Reward function: r : S A M = (S, A, P, ρ, r) Set of actions Initial state distribution: s 0 ρ( ) Agent uses a policy to select actions: a t π( s t )

Reinforcement Learning (3) Let τ denote a trajectory with s 0 ρ( ), a t π( s t ), s t+1 P( s t, a t )

Reinforcement Learning (3) Let τ denote a trajectory with s 0 ρ( ), a t π( s t ), s t+1 P( s t, a t ) The discounted return is then defined as: R(τ) := t γ t r(s t, a t ) γ [0,1)

Reinforcement Learning (3) Let τ denote a trajectory with s 0 ρ( ), a t π( s t ), s t+1 P( s t, a t ) The discounted return is then defined as: R(τ) := t γ t r(s t, a t ) γ [0,1) We wish to find a policy that maximizes the expected discounted return: π* := arg max E τ [R(τ)] π

Reinforcement Learning (4) Depending on the assumptions, many methods exist to find optimal policies. Examples are: Dynamic programming Policy gradient methods Q learning This talk will focus on policy gradient methods Policy gradients are model-free, i.e. we do not know the transition distribution

Policy Gradients (1) Let s assume a parameterized policy πθ, where θ is some parameter vector

Policy Gradients (1) Let s assume a parameterized policy πθ, where θ is some parameter vector Our optimization objective then is: J(θ) := E τ [R(τ)]

Policy Gradients (1) Let s assume a parameterized policy πθ, where θ is some parameter vector Our optimization objective then is: has a dependency on θ through τ J(θ) := E τ [R(τ)] has no dependency on θ

Policy Gradients (1) Let s assume a parameterized policy πθ, where θ is some parameter vector Our optimization objective then is: has a dependency on θ through τ J(θ) := E τ [R(τ)] has no dependency on θ Simple idea: Let s compute the gradient w.r.t. θ and do gradient ascent

Policy Gradients (2) Goal: Compute the gradient θ J(θ) = θ E τ [R(τ)]

Policy Gradients (2) Goal: Compute the gradient θ J(θ) = θ E τ [R(τ)] Expanding the expectation and rearranging we get: θ J(θ) = θ [ R(τ)p(τ)dτ ]

Policy Gradients (2) Goal: Compute the gradient θ J(θ) = θ E τ [R(τ)] Expanding the expectation and rearranging we get: θ J(θ) = θ [ R(τ)p(τ)dτ ] = R(τ) θ p(τ)dτ

Policy Gradients (3) θ J(θ) = R(τ) θ p(τ)dτ Goal: Compute

Policy Gradients (3) θ J(θ) = R(τ) θ p(τ)dτ Goal: Compute

Policy Gradients (3) Goal: Compute θ J(θ) = R(τ) θ p(τ)dτ Use the "log derivative trick": log p(τ) = 1 θ p(τ) θ p(τ) θ p(τ) = log p(τ)p(τ) θ

Policy Gradients (3) Goal: Compute θ J(θ) = R(τ) θ p(τ)dτ Use the "log derivative trick": log p(τ) = 1 θ p(τ) θ p(τ) θ p(τ) = log p(τ)p(τ) θ... and plug back in to obtain: θ J(θ) = R(τ) θ log p(τ)p(τ)dτ

Policy Gradients (3) Goal: Compute θ J(θ) = R(τ) θ p(τ)dτ Use the "log derivative trick": log p(τ) = 1 θ p(τ) θ p(τ) θ p(τ) = log p(τ)p(τ) θ... and plug back in to obtain: θ J(θ) = R(τ) θ log p(τ)p(τ)dτ = E τ [R(τ) θ log p(τ)]

Policy Gradients (4) Goal: Compute θ log p(τ)

Policy Gradients (4) Goal: Compute θ log p(τ) Probability of a trajectory τ is: p(τ) = ρ(s 0 ) t P(s t+1 s t, a t )π θ (a t s t )

Policy Gradients (4) Goal: Compute θ log p(τ) Probability of a trajectory τ is: p(τ) = ρ(s 0 ) t P(s t+1 s t, a t )π θ (a t s t ) log p(τ) = log ρ(s 0 ) + t log P(s t+1 s t, a t ) + log π θ (a t s t )

Policy Gradients (4) Goal: Compute θ log p(τ) Probability of a trajectory τ is: p(τ) = ρ(s 0 ) t P(s t+1 s t, a t )π θ (a t s t ) log p(τ) = log ρ(s 0 ) + t log P(s t+1 s t, a t ) + log π θ (a t s t ) Taking the gradient: θ log p(τ) = t θ log π θ (a t s t )

Policy Gradients (5) Putting it all together: θ J(θ) = θ E τ [R(τ)]

Policy Gradients (5) Putting it all together: θ J(θ) = θ E τ [R(τ)] = R(τ) θ p(τ)dτ

Policy Gradients (5) Putting it all together: θ J(θ) = θ E τ [R(τ)] = R(τ) θ p(τ)dτ = E τ [R(τ) θ log p(τ)]

Policy Gradients (5) Putting it all together: θ J(θ) = θ E τ [R(τ)] = R(τ) θ p(τ)dτ = E τ [R(τ) θ log p(τ)] = E τ [ R(τ) t θ log π θ (a t s t ) ]

Policy Gradients (5) Putting it all together: θ J(θ) = θ E τ [R(τ)] = R(τ) θ p(τ)dτ = E τ [R(τ) θ log p(τ)] = E τ [ R(τ) t θ log π θ (a t s t ) ] Last missing piece: Computing the expectation

Policy Gradients (6) Goal: Compute expectation of: θ J(θ) = E τ [ R(τ) t θ log π θ (a t s t ) ]

Policy Gradients (6) Goal: Compute expectation of: θ J(θ) = E τ [ R(τ) t θ log π θ (a t s t ) ] Can be estimated using Monte Carlo sampling, which corresponds to rolling out the policy multiple times to collect N trajectories: τ (1), τ (2),, τ (N)

Policy Gradients (6) Goal: Compute expectation of: θ J(θ) = E τ [ R(τ) t θ log π θ (a t s t ) ] Can be estimated using Monte Carlo sampling, which corresponds to rolling out the policy multiple times to collect N trajectories: τ (1), τ (2),, τ (N) Our final estimate of the policy gradient is thus: N θ J(θ) 1 N n=1 [ R(τ(n) ) t θ log π θ (a (n) t s (n) t ) ]

Policy Gradients (7) 1. Initialize θ arbitrarily 2. Repeat 1. Collect N trajectories τ (1), τ (2),, τ (N) 2. Estimate gradient: N g 1 N n=1 [ R(τ(n) ) t θ log π θ (a (n) t s (n) t ) ] 3. Update parameters: θ θ + α g

Proximal Policy Optimization Vanilla policy gradient algorithm simple but has several shortcomings We use Proximal Policy Optimization (PPO) in all our experiments (Schulman et al., 2017) Underlaying idea is exactly the same but uses Baselines for variance reduction with generalized advantage estimation Importance sampling to use slightly off-policy data Clipping to improve stability

Reward Function r t = d t d t+1, where d t and d t+1 denote the rotation angle between the desired and current orientation before and after the transition On success: +5 On drop: -20

Policy Architecture Action Distribution finger joint positions LSTM Fully-connected ReLU Normalization fingertip positions object pose Noisy Observation Goal

Distributed Training with Rapid Parameters Worker Optimizer 6,000 CPU cores 8 GPUs

Transfer

Transfer

Domain Randomization

Domain Randomization (1) Sadeghi & Levine (2017)

Domain Randomization (2) Tobin et al. (2017)

Vision Randomizations

Vision Architecture ReLU Object Position Object Rotation Concat SSM SSM SSM ResNet ResNet ResNet Pool Pool Pool Conv Conv Conv Camera 1 Camera 2 Camera 3

Physics Randomizations (1) Peng et al. (2018)

Physics Randomizations (2) object dimensions object and robot link masses surface friction coefficients robot joint damping coefficients actuator force gains joint limits gravity vector

Full System Recap

Results

Page Title

Learned Strategies

Learned Grasps Tip Palmar Tripo Quadpo Power 5-finger Precision

Quantitative Results

Quantitative Results

Quantitative Results

Ablation of Randomizations

Training Time

Effect of Memory

Surprises Tactile sensing is not necessary to manipulate real-world objects Randomizations developed for one object generalize to others with similar properties With physical robots, good systems engineering is as important as good algorithms

Challenges Ahead Less manual tuning in domain randomization Make use of demonstrations Faster learning in real world

Blog Post Pre-print https://blog.openai.com/learning-dexterity/ https://arxiv.org/pdf/1808.00177.pdf

Thank You Visit openai.com for more information. FOLLOW @OPENAI ON TWITTER

References Levine et al. "Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection." The International Journal of Robotics Research37.4-5 (2018): 421-436. Peng et al. "Sim-to-real transfer of robotic control with dynamics randomization." arxiv preprint arxiv:1710.06537 (2017). Tobin et al. "Domain randomization for transferring deep neural networks from simulation to the real world." Intelligent Robots and Systems (IROS), 2017 IEEE/RSJ International Conference on. IEEE, 2017. Rajeswaran et al. "Learning complex dexterous manipulation with deep reinforcement learning and demonstrations." arxiv preprint arxiv: 1709.10087 (2017). Kumar, Vikash, Emanuel Todorov, and Sergey Levine. "Optimal control with learned local models: Application to dexterous manipulation." Robotics and Automation (ICRA), 2016 IEEE International Conference on. IEEE, 2016. Schulman et al. "Proximal policy optimization algorithms." arxiv preprint arxiv: 1707.06347 (2017). Sadeghi & Levine. "CAD2RL: Real single-image flight without a single real image." arxiv preprint arxiv:1611.04201 (2016).