Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods

Similar documents
Reinforcement learning with Gaussian processes

Reinforcement Learning and Deep Reinforcement Learning

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

Reinforcement Learning

Lecture 3: Markov Decision Processes

Reinforcement Learning: An Introduction

Model-Based Reinforcement Learning with Continuous States and Actions

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

REINFORCEMENT LEARNING

Reinforcement Learning. Introduction

, and rewards and transition matrices as shown below:

Value Function Approximation through Sparse Bayesian Modeling

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Decision Theory: Q-Learning

Reinforcement Learning

An online kernel-based clustering approach for value function approximation

Reinforcement Learning. George Konidaris

Decision Theory: Markov Decision Processes

Reinforcement learning

Gaussian with mean ( µ ) and standard deviation ( σ)

CS599 Lecture 1 Introduction To RL

Bayesian Active Learning With Basis Functions

Lecture 1: March 7, 2018

Final Exam December 12, 2017

Reinforcement Learning and NLP

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Online Kernel Selection for Bayesian Reinforcement Learning

Reinforcement Learning In Continuous Time and Space

Statistical Techniques in Robotics (16-831, F12) Lecture#20 (Monday November 12) Gaussian Processes

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Reinforcement learning an introduction

Reinforcement Learning

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Reinforcement Learning as Variational Inference: Two Recent Approaches

Reading Response: Due Wednesday. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Preference Elicitation for Sequential Decision Problems

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes

Internet Monetization

Hidden Markov Models (HMM) and Support Vector Machine (SVM)

Machine Learning I Reinforcement Learning

Final Exam December 12, 2017

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Bayes Adaptive Reinforcement Learning versus Off-line Prior-based Policy Search: an Empirical Comparison

Introduction. Chapter 1

The Art of Sequential Optimization via Simulations

Sparse Gaussian Process Temporal Difference Learning for Marine Robot Navigation

Gaussian Processes. 1 What problems can be solved by Gaussian Processes?

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Value Function Approximation through Sparse Bayesian Modeling

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Figure 1: Bayes Net. (a) (2 points) List all independence and conditional independence relationships implied by this Bayes net.

Reinforcement Learning. Machine Learning, Fall 2010

Markov decision processes

ECE521 week 3: 23/26 January 2017

Deep Reinforcement Learning: Policy Gradients and Q-Learning

Optimal Control with Learned Forward Models

Q-Learning in Continuous State Action Spaces

Grundlagen der Künstlichen Intelligenz

Final Exam, Fall 2002

Mathematical Formulation of Our Example

Autonomous Helicopter Flight via Reinforcement Learning

Infinite-Horizon Discounted Markov Decision Processes

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Infinite-Horizon Average Reward Markov Decision Processes

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Robotics Part II: From Learning Model-based Control to Model-free Reinforcement Learning

Reinforcement Learning

Bayesian Policy Gradient Algorithms

Manipulators. Robotics. Outline. Non-holonomic robots. Sensors. Mobile Robots

Model Selection for Gaussian Processes

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control

University of Alberta

Procedia Computer Science 00 (2011) 000 6

Inverse Optimal Control

Reinforcement Learning

arxiv: v1 [cs.ai] 14 Sep 2016

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

(Deep) Reinforcement Learning

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Basics of reinforcement learning

CSL302/612 Artificial Intelligence End-Semester Exam 120 Minutes

Partially Observable Markov Decision Processes (POMDPs)

Markov Decision Processes and Solving Finite Problems. February 8, 2017

ELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches. Ville Kyrki

Learning Control Under Uncertainty: A Probabilistic Value-Iteration Approach

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Be able to define the following terms and answer basic questions about them:

Approximate Dynamic Programming

Nonparametric Bayesian Methods (Gaussian Processes)

Probabilistic & Unsupervised Learning

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Transcription:

Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods Yaakov Engel Joint work with Peter Szabo and Dmitry Volkinshtein (ex. Technion)

Why use GPs in RL? A Bayesian approach to value estimation Forces us to to make our assumptions explicit Non-parametric priors are placed and inference is performed directly in function space (kernels). Domain knowledge intuitively coded into priors Provides full posterior, not just point estimates Efficient, on-line implementation, suitable for large problems 2/23

Markov Decision Processes X : state space U: action space p: X X U [0, 1], x t+1 p( x t, u t ) q: IR X U [0, 1], R(x t, u t ) q( x t, u t ) A Stationary policy: µ: U X [0, 1], u t µ( x t ) Discounted Return: D µ (x) = i=0 γi R(x i, u i ) (x 0 = x) Value function: V µ (x) = E µ [D µ (x)] Goal: Find a policy µ maximizing V µ (x) x X 3/23

For a fixed policy µ: Bellman s Equation V µ (x) = E x,u x Optimal value and policy: [ ] R(x, u) + γv µ (x ) V (x) = max V µ (x), µ = argmax V µ (x) µ µ How to solve it? - Methods based on Value Iteration (e.g. Q-learning) - Methods based on Policy Iteration (e.g. SARSA, OPI, Actor-Critic) 4/23

Solution Method Taxonomy RL Algorithms Purely Policy based (Policy Gradient) Value Function based Value Iteration type (Q Learning) Policy Iteration type (Actor Critic, OPI, SARSA) PI methods need a subroutine for policy evaluation 5/23

Gaussian Process Temporal Difference Learning Model Equations: Or, in compact form: R(x i ) = V (x i ) γv (x i+1 ) + N(x i ) R t = H t+1 V t+1 + N t H t = Our (Bayesian) goal: 1 γ 0... 0 0 1 γ... 0.. 0 0... 1 γ Find the posterior distribution of V ( ), given a sequence of observed states and rewards.. 6/23

The Posterior General noise covariance: Cov[N t ] = Σ t Joint distribution: R t 1 V (x) N 0 0, H tk t H t + Σ t H t k t (x) k t (x) H t k(x, x) Invoke Bayes Rule: E[V (x) R t 1 ] = k t (x) α t Cov[V (x), V (x ) R t 1 ] = k(x, x ) k t (x) C t k t (x ) k t (x) = (k(x 1, x),..., k(x t, x)) 7/23

8/23

The Octopus Arm Can bend and twist at any point Can do this in any direction Can be elongated and shortened Can change cross section Can grab using any part of the arm Virtually infinitely many DOF 9/23

The Muscular Hydrostat Mechanism A constraint: Muscle fibers can only contract (actively) In vertebrate limbs, two separate muscle groups - agonists and antagonists - are used to control each DOF of every joint, by exerting opposite torques. But the Octopus has no skeleton! Balloon example Muscle tissue is incompressible, therefore, if muscles are arranged such that different muscle groups are interleaved in perpendicular directions in the same region, contraction in one direction will result in extension in at least one of the other directions. This is the Muscular Hydrostat mechanism 10/23

Octopus Arm Anatomy 101 11/23

Our Arm Model dorsal side arm tip arm base C N pair #N+1 C 1 pair #1 ventral side longitudinal muscle transverse muscle,-,-././ 01 01 3 3 transverse muscle longitudinal muscle 12/23

The Muscle Model f(a) = (k 0 + a(k max k 0 )) (l l 0 ) + β dl dt a [0, 1] 13/23

Other Forces Gravity Buoyancy Water drag Internal pressures (maintain constant compartmental volume) 10 compartments 22 point masses (x, y, ẋ, ẏ) = 88 state variables Dimensionality 14/23

The Control Problem Starting from a random position, bring {any part, tip} of arm into contact with a goal region, optimally. Optimality criteria: Time, energy, obstacle avoidance Constraint: We only have access to sampled trajectories Our approach: Define problem as a MDP Apply Reinforcement Learning algorithms 15/23

The Task 0.15 0.1 t = 1.38 0.05 0 0.05 0.1 0.1 0.05 0 0.05 0.1 0.15 16/23

Actions Each action specifies a set of fixed activations one for each muscle in the arm. Action # 1 Action # 2 Action # 3 Action # 4 Action # 5 Action # 6 Base rotation adds duplicates of actions 1,2,4 and 5 with positive and negative torques applied to the base. 17/23

Deterministic rewards: Rewards +10 for a goal state, Large negative value for obstacle hitting, -1 otherwise. Energy economy: A constant multiple of the energy expended by the muscles in each action interval was deducted from the reward. 18/23

Fixed Base Task I 19/23

Fixed Base Task II 20/23

Rotating Base Task I 21/23

Rotating Base Task II 22/23

To Wrap Up There s more to GPs than regression and classification Online sparsification works Challenges How to use value uncertainty? What s a disciplined way to select actions? What s the best noise covariance? More complicated tasks 23/23