Learning Dexterity Matthias Plappert SEPTEMBER 6, 2018

Size: px

Start display at page:

Download "Learning Dexterity Matthias Plappert SEPTEMBER 6, 2018"

Winfred Carr
5 years ago
Views:

1 Learning Dexterity Matthias Plappert SEPTEMBER 6, 2018

2 OpenAI OpenAI is a non-profit AI research company, discovering and enacting the path to safe artificial general intelligence.

3 OpenAI OpenAI is a non-profit AI research company, discovering and enacting the path to safe artificial general intelligence.

4 OpenAI OpenAI is a non-profit AI research company, discovering and enacting the path to safe artificial general intelligence.

5 OpenAI OpenAI is a non-profit AI research company, discovering and enacting the path to safe artificial general intelligence.

6 OpenAI Charter Broadly Distributed Benefits Long-Term Safety Technical Leadership Cooperative Orientation Full text available on our blog:

7 Learning Dexterity

11 Reinforcement Learning GO (ALPHAGO ZERO) DOTA 2 (OPENAI FIVE)

12 Reinforcement Learning for Robotics (1) Rajeswaran et al. (2017)

13 Reinforcement Learning for Robotics (2) Kumar et al. (2016)

14 Reinforcement Learning for Robotics (3) Levine et al. (2018)

15 Sim2Real Simulated environment Simulated environment for training Simulated environment for training Simulated environments for training for training Transfer Real robot hardware

16 Transfer

17 Transfer

18 Task & Setup

22 Shadow Dexterous Hand

23 PhaseSpace tracking

24 Top RGB camera Right RGB camera Left RGB camera

25 Challenges Reinforcement learning for the real world High-dimensional control Noisy and partial observations Manipulating more than one object

26 Reinforcement Learning + Domain Randomization

27 Reinforcement Learning

28 Reinforcement Learning (1) Action at Agent Environment State st+1 and reward rt

29 Reinforcement Learning (2) Formalize as Markov decision process M = (S, A, P, ρ, r)

30 Reinforcement Learning (2) Formalize as Markov decision process Set of states M = (S, A, P, ρ, r)

31 Reinforcement Learning (2) Formalize as Markov decision process Set of states M = (S, A, P, ρ, r) Set of actions

32 Reinforcement Learning (2) Formalize as Markov decision process Set of states Transition probabilities: s t+1 P(, s t, a t ) M = (S, A, P, ρ, r) Set of actions

33 Reinforcement Learning (2) Formalize as Markov decision process Set of states Transition probabilities: s t+1 P(, s t, a t ) M = (S, A, P, ρ, r) Set of actions Initial state distribution: s 0 ρ( )

34 Reinforcement Learning (2) Formalize as Markov decision process Transition probabilities: s t+1 P(, s t, a t ) Set of states Reward function: r : S A M = (S, A, P, ρ, r) Set of actions Initial state distribution: s 0 ρ( )

35 Reinforcement Learning (2) Formalize as Markov decision process Transition probabilities: s t+1 P(, s t, a t ) Set of states Reward function: r : S A M = (S, A, P, ρ, r) Set of actions Initial state distribution: s 0 ρ( ) Agent uses a policy to select actions: a t π( s t )

36 Reinforcement Learning (3) Let τ denote a trajectory with s 0 ρ( ), a t π( s t ), s t+1 P( s t, a t )

37 Reinforcement Learning (3) Let τ denote a trajectory with s 0 ρ( ), a t π( s t ), s t+1 P( s t, a t ) The discounted return is then defined as: R(τ) := t γ t r(s t, a t ) γ [0,1)

38 Reinforcement Learning (3) Let τ denote a trajectory with s 0 ρ( ), a t π( s t ), s t+1 P( s t, a t ) The discounted return is then defined as: R(τ) := t γ t r(s t, a t ) γ [0,1) We wish to find a policy that maximizes the expected discounted return: π* := arg max E τ [R(τ)] π

39 Reinforcement Learning (4) Depending on the assumptions, many methods exist to find optimal policies. Examples are: Dynamic programming Policy gradient methods Q learning This talk will focus on policy gradient methods Policy gradients are model-free, i.e. we do not know the transition distribution

40 Policy Gradients (1) Let s assume a parameterized policy πθ, where θ is some parameter vector

41 Policy Gradients (1) Let s assume a parameterized policy πθ, where θ is some parameter vector Our optimization objective then is: J(θ) := E τ [R(τ)]

42 Policy Gradients (1) Let s assume a parameterized policy πθ, where θ is some parameter vector Our optimization objective then is: has a dependency on θ through τ J(θ) := E τ [R(τ)] has no dependency on θ

43 Policy Gradients (1) Let s assume a parameterized policy πθ, where θ is some parameter vector Our optimization objective then is: has a dependency on θ through τ J(θ) := E τ [R(τ)] has no dependency on θ Simple idea: Let s compute the gradient w.r.t. θ and do gradient ascent

44 Policy Gradients (2) Goal: Compute the gradient θ J(θ) = θ E τ [R(τ)]

45 Policy Gradients (2) Goal: Compute the gradient θ J(θ) = θ E τ [R(τ)] Expanding the expectation and rearranging we get: θ J(θ) = θ [ R(τ)p(τ)dτ ]

46 Policy Gradients (2) Goal: Compute the gradient θ J(θ) = θ E τ [R(τ)] Expanding the expectation and rearranging we get: θ J(θ) = θ [ R(τ)p(τ)dτ ] = R(τ) θ p(τ)dτ

47 Policy Gradients (3) θ J(θ) = R(τ) θ p(τ)dτ Goal: Compute

48 Policy Gradients (3) θ J(θ) = R(τ) θ p(τ)dτ Goal: Compute

49 Policy Gradients (3) Goal: Compute θ J(θ) = R(τ) θ p(τ)dτ Use the "log derivative trick": log p(τ) = 1 θ p(τ) θ p(τ) θ p(τ) = log p(τ)p(τ) θ

50 Policy Gradients (3) Goal: Compute θ J(θ) = R(τ) θ p(τ)dτ Use the "log derivative trick": log p(τ) = 1 θ p(τ) θ p(τ) θ p(τ) = log p(τ)p(τ) θ... and plug back in to obtain: θ J(θ) = R(τ) θ log p(τ)p(τ)dτ

51 Policy Gradients (3) Goal: Compute θ J(θ) = R(τ) θ p(τ)dτ Use the "log derivative trick": log p(τ) = 1 θ p(τ) θ p(τ) θ p(τ) = log p(τ)p(τ) θ... and plug back in to obtain: θ J(θ) = R(τ) θ log p(τ)p(τ)dτ = E τ [R(τ) θ log p(τ)]

52 Policy Gradients (4) Goal: Compute θ log p(τ)

53 Policy Gradients (4) Goal: Compute θ log p(τ) Probability of a trajectory τ is: p(τ) = ρ(s 0 ) t P(s t+1 s t, a t )π θ (a t s t )

54 Policy Gradients (4) Goal: Compute θ log p(τ) Probability of a trajectory τ is: p(τ) = ρ(s 0 ) t P(s t+1 s t, a t )π θ (a t s t ) log p(τ) = log ρ(s 0 ) + t log P(s t+1 s t, a t ) + log π θ (a t s t )

55 Policy Gradients (4) Goal: Compute θ log p(τ) Probability of a trajectory τ is: p(τ) = ρ(s 0 ) t P(s t+1 s t, a t )π θ (a t s t ) log p(τ) = log ρ(s 0 ) + t log P(s t+1 s t, a t ) + log π θ (a t s t ) Taking the gradient: θ log p(τ) = t θ log π θ (a t s t )

56 Policy Gradients (5) Putting it all together: θ J(θ) = θ E τ [R(τ)]

57 Policy Gradients (5) Putting it all together: θ J(θ) = θ E τ [R(τ)] = R(τ) θ p(τ)dτ

58 Policy Gradients (5) Putting it all together: θ J(θ) = θ E τ [R(τ)] = R(τ) θ p(τ)dτ = E τ [R(τ) θ log p(τ)]

59 Policy Gradients (5) Putting it all together: θ J(θ) = θ E τ [R(τ)] = R(τ) θ p(τ)dτ = E τ [R(τ) θ log p(τ)] = E τ [ R(τ) t θ log π θ (a t s t ) ]

60 Policy Gradients (5) Putting it all together: θ J(θ) = θ E τ [R(τ)] = R(τ) θ p(τ)dτ = E τ [R(τ) θ log p(τ)] = E τ [ R(τ) t θ log π θ (a t s t ) ] Last missing piece: Computing the expectation

61 Policy Gradients (6) Goal: Compute expectation of: θ J(θ) = E τ [ R(τ) t θ log π θ (a t s t ) ]

62 Policy Gradients (6) Goal: Compute expectation of: θ J(θ) = E τ [ R(τ) t θ log π θ (a t s t ) ] Can be estimated using Monte Carlo sampling, which corresponds to rolling out the policy multiple times to collect N trajectories: τ (1), τ (2),, τ (N)

63 Policy Gradients (6) Goal: Compute expectation of: θ J(θ) = E τ [ R(τ) t θ log π θ (a t s t ) ] Can be estimated using Monte Carlo sampling, which corresponds to rolling out the policy multiple times to collect N trajectories: τ (1), τ (2),, τ (N) Our final estimate of the policy gradient is thus: N θ J(θ) 1 N n=1 [ R(τ(n) ) t θ log π θ (a (n) t s (n) t ) ]

64 Policy Gradients (7) 1. Initialize θ arbitrarily 2. Repeat 1. Collect N trajectories τ (1), τ (2),, τ (N) 2. Estimate gradient: N g 1 N n=1 [ R(τ(n) ) t θ log π θ (a (n) t s (n) t ) ] 3. Update parameters: θ θ + α g

65 Proximal Policy Optimization Vanilla policy gradient algorithm simple but has several shortcomings We use Proximal Policy Optimization (PPO) in all our experiments (Schulman et al., 2017) Underlaying idea is exactly the same but uses Baselines for variance reduction with generalized advantage estimation Importance sampling to use slightly off-policy data Clipping to improve stability

66 Reward Function r t = d t d t+1, where d t and d t+1 denote the rotation angle between the desired and current orientation before and after the transition On success: +5 On drop: -20

67 Policy Architecture Action Distribution finger joint positions LSTM Fully-connected ReLU Normalization fingertip positions object pose Noisy Observation Goal

68 Distributed Training with Rapid Parameters Worker Optimizer 6,000 CPU cores 8 GPUs

69 Transfer

70 Transfer

71 Domain Randomization

72 Domain Randomization (1) Sadeghi & Levine (2017)

73 Domain Randomization (2) Tobin et al. (2017)

75 Vision Randomizations

76 Vision Architecture ReLU Object Position Object Rotation Concat SSM SSM SSM ResNet ResNet ResNet Pool Pool Pool Conv Conv Conv Camera 1 Camera 2 Camera 3

77 Physics Randomizations (1) Peng et al. (2018)

78 Physics Randomizations (2) object dimensions object and robot link masses surface friction coefficients robot joint damping coefficients actuator force gains joint limits gravity vector

79 Full System Recap

82 Results

83 Page Title

84 Learned Strategies

85 Learned Grasps Tip Palmar Tripo Quadpo Power 5-finger Precision

86 Quantitative Results

87 Quantitative Results

88 Quantitative Results

89 Ablation of Randomizations

90 Training Time

91 Effect of Memory

92 Surprises Tactile sensing is not necessary to manipulate real-world objects Randomizations developed for one object generalize to others with similar properties With physical robots, good systems engineering is as important as good algorithms

93 Challenges Ahead Less manual tuning in domain randomization Make use of demonstrations Faster learning in real world

94 Blog Post Pre-print

95 Thank You Visit openai.com for more information. ON TWITTER

96 References Levine et al. "Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection." The International Journal of Robotics Research (2018): Peng et al. "Sim-to-real transfer of robotic control with dynamics randomization." arxiv preprint arxiv: (2017). Tobin et al. "Domain randomization for transferring deep neural networks from simulation to the real world." Intelligent Robots and Systems (IROS), 2017 IEEE/RSJ International Conference on. IEEE, Rajeswaran et al. "Learning complex dexterous manipulation with deep reinforcement learning and demonstrations." arxiv preprint arxiv: (2017). Kumar, Vikash, Emanuel Todorov, and Sergey Levine. "Optimal control with learned local models: Application to dexterous manipulation." Robotics and Automation (ICRA), 2016 IEEE International Conference on. IEEE, Schulman et al. "Proximal policy optimization algorithms." arxiv preprint arxiv: (2017). Sadeghi & Levine. "CAD2RL: Real single-image flight without a single real image." arxiv preprint arxiv: (2016).

Trust Region Policy Optimization

Trust Region Policy Optimization Yixin Lin Duke University yixin.lin@duke.edu March 28, 2017 Yixin Lin (Duke) TRPO March 28, 2017 1 / 21 Overview 1 Preliminaries Markov Decision Processes Policy iteration