Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods

Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods Yaakov Engel Joint work with Peter Szabo and Dmitry Volkinshtein (ex. Technion)

Why use GPs in RL? A Bayesian approach to value estimation Forces us to to make our assumptions explicit Non-parametric priors are placed and inference is performed directly in function space (kernels). Domain knowledge intuitively coded into priors Provides full posterior, not just point estimates Efficient, on-line implementation, suitable for large problems 2/23

Markov Decision Processes X : state space U: action space p: X X U [0, 1], x t+1 p( x t, u t ) q: IR X U [0, 1], R(x t, u t ) q( x t, u t ) A Stationary policy: µ: U X [0, 1], u t µ( x t ) Discounted Return: D µ (x) = i=0 γi R(x i, u i ) (x 0 = x) Value function: V µ (x) = E µ [D µ (x)] Goal: Find a policy µ maximizing V µ (x) x X 3/23

For a fixed policy µ: Bellman s Equation V µ (x) = E x,u x Optimal value and policy: [ ] R(x, u) + γv µ (x ) V (x) = max V µ (x), µ = argmax V µ (x) µ µ How to solve it? - Methods based on Value Iteration (e.g. Q-learning) - Methods based on Policy Iteration (e.g. SARSA, OPI, Actor-Critic) 4/23

Solution Method Taxonomy RL Algorithms Purely Policy based (Policy Gradient) Value Function based Value Iteration type (Q Learning) Policy Iteration type (Actor Critic, OPI, SARSA) PI methods need a subroutine for policy evaluation 5/23

Gaussian Process Temporal Difference Learning Model Equations: Or, in compact form: R(x i ) = V (x i ) γv (x i+1 ) + N(x i ) R t = H t+1 V t+1 + N t H t = Our (Bayesian) goal: 1 γ 0... 0 0 1 γ... 0.. 0 0... 1 γ Find the posterior distribution of V ( ), given a sequence of observed states and rewards.. 6/23

The Posterior General noise covariance: Cov[N t ] = Σ t Joint distribution: R t 1 V (x) N 0 0, H tk t H t + Σ t H t k t (x) k t (x) H t k(x, x) Invoke Bayes Rule: E[V (x) R t 1 ] = k t (x) α t Cov[V (x), V (x ) R t 1 ] = k(x, x ) k t (x) C t k t (x ) k t (x) = (k(x 1, x),..., k(x t, x)) 7/23

8/23

The Octopus Arm Can bend and twist at any point Can do this in any direction Can be elongated and shortened Can change cross section Can grab using any part of the arm Virtually infinitely many DOF 9/23

The Muscular Hydrostat Mechanism A constraint: Muscle fibers can only contract (actively) In vertebrate limbs, two separate muscle groups - agonists and antagonists - are used to control each DOF of every joint, by exerting opposite torques. But the Octopus has no skeleton! Balloon example Muscle tissue is incompressible, therefore, if muscles are arranged such that different muscle groups are interleaved in perpendicular directions in the same region, contraction in one direction will result in extension in at least one of the other directions. This is the Muscular Hydrostat mechanism 10/23

Octopus Arm Anatomy 101 11/23

Our Arm Model dorsal side arm tip arm base C N pair #N+1 C 1 pair #1 ventral side longitudinal muscle transverse muscle,-,-././ 01 01 3 3 transverse muscle longitudinal muscle 12/23

The Muscle Model f(a) = (k 0 + a(k max k 0 )) (l l 0 ) + β dl dt a [0, 1] 13/23

Other Forces Gravity Buoyancy Water drag Internal pressures (maintain constant compartmental volume) 10 compartments 22 point masses (x, y, ẋ, ẏ) = 88 state variables Dimensionality 14/23

The Control Problem Starting from a random position, bring {any part, tip} of arm into contact with a goal region, optimally. Optimality criteria: Time, energy, obstacle avoidance Constraint: We only have access to sampled trajectories Our approach: Define problem as a MDP Apply Reinforcement Learning algorithms 15/23

The Task 0.15 0.1 t = 1.38 0.05 0 0.05 0.1 0.1 0.05 0 0.05 0.1 0.15 16/23

Actions Each action specifies a set of fixed activations one for each muscle in the arm. Action # 1 Action # 2 Action # 3 Action # 4 Action # 5 Action # 6 Base rotation adds duplicates of actions 1,2,4 and 5 with positive and negative torques applied to the base. 17/23

Deterministic rewards: Rewards +10 for a goal state, Large negative value for obstacle hitting, -1 otherwise. Energy economy: A constant multiple of the energy expended by the muscles in each action interval was deducted from the reward. 18/23

Fixed Base Task I 19/23

Fixed Base Task II 20/23

Rotating Base Task I 21/23

Rotating Base Task II 22/23

To Wrap Up There s more to GPs than regression and classification Online sparsification works Challenges How to use value uncertainty? What s a disciplined way to select actions? What s the best noise covariance? More complicated tasks 23/23