Robotics Part II: From Learning Model-based Control to Model-free Reinforcement Learning

Size: px

Start display at page:

Download "Robotics Part II: From Learning Model-based Control to Model-free Reinforcement Learning"

Sabrina Boone
5 years ago
Views:

1 Robotics Part II: From Learning Model-based Control to Model-free Reinforcement Learning Stefan Schaal Max-Planck-Institute for Intelligent Systems Tübingen, Germany & Computer Science, Neuroscience, & Biomedical Engineering University of Southern California, Los Angeles

2 Where Did We Stop...

3 Outline A Bit of Robotics History Foundations of Control Adaptive Control Learning Control - Model-based Robot Learning - Reinforcement Learning

4 What Needs to Be Learned in Learning Control? Internal Models Value Functions Control Policies Coordinate Transformations The Majority of the Learning Problems Involve Function Approximation Unsupervised Learning & Classification

5 Learning Internal Models Forward Models - models the causal functional relationship - for example: Inverse Models - models the inverse of the causal functional relationship - for example: y = f ( x) ( ( )!q G( q) )!!q = B 1 ( q) τ C q,!q x = f 1 ( y) B( q)!!q + C( q,!q )!q + G( q) = τ - NOTE: inverse models are not necessarily functions any more!

6 Inverse Models May Not Be Trivially Learnable

7 Inverse Models May Not Be Trivially Learnable t = f ( θ 1 1 1,θ ) 2 t = f ( θ 2 2 1,θ ) 2 what is f 1 ( t)?

Characteristics of Function Approximation in Robotics Incremental Learning large amounts of data continual learning to be approximated functions of

8 Characteristics of Function Approximation in Robotics Incremental Learning large amounts of data continual learning to be approximated functions of growing and unknown complexity Fast Learning data efficient computationally efficient real-time Robust Learning minimal interference hundreds of inputs

9 Linear Regression: One of the Simplest Function Approximation Methods Recall the simple adaptive control model with: f ( x) = θx - find the line through all data points - imagine a spring attached between the line and each data point - all springs have the same spring constant - points far away generate more force (danger of outliers) - springs are vertical - solution is the minimum energy solution achieved by the springs y x

10 Linear Regression: One of the Simplest The data generating model: The Least Squares cost function Minimizing the cost gives the least-square solution Function Approximation Methods y =!w T!x + w 0 + ε = w T x + ε where x = x T T!w,1,w = w 0 J = 1 2 t y where : t = J w = 0 = J w, E{ ε} = 0 ( ) T ( t y) = 1 ( 2 t Xw ) T ( t Xw) t 1 t 2 t n = t T X + Xw, X = x 1 T x 2 T x n T 1 ( 2 t Xw ) T ( t Xw) ( ) T X = t T X + w T X T X thus : t T X = w T X T X or X T t = X T Xw result : w = ( X T X) 1 X T t ( ) T X = t Xw

11 Recursive Least Squares: An Incremental Version of Linear Regression Based on the matrix inversion theorem: ( A BC) 1 = A 1 + A 1 B( I + CA 1 B) 1 CA 1 Incremental updating of a linear regression model Initialize: P n = I 1 γ For every new data point ( x,t) (note that x includes the bias): P n+1 = 1 λ Pn Pn xx T P n λ + x T P n x W n+1 = W n + P n+1 x( t W nt x) T where γ << 1 (note P ( X T X) 1 where λ = 1 if no forgetting < 1 if forgetting - NOTE: RLS gives exactly the same solution as linear regression if no forgetting

12 Making Linear Regression Nonlinear: Locally Weighted Regression Region of Validity 2θ k Receptive Field Activation w 1 Linear Model 0 N i=1 ( ) 2 J = w i y i x i T β Note: Using GPs, SVR, Mixture Models, etc., are other ways to nonlinear regression

13 Locally Weighted Regression Piecewise linear function approximation, Each local model is learned from only local data No over-fitting due to too many local models (unlike RBFs, ME)

x Weighting Kernel: w = exp 1 2 ( x c )T D( x c) Combined Prediction: y = K i=1 K w i i=1 w i y k where learned with D = M T M Gradient descent in

14 Locally Weighted Regression Linear Model: y = β x T x + β 0 = β T x where learned with x = [ x T 1] T Recursive weighted least squares: β n+1 k = β n k + wp n+1 k x( y!x T n β ) T k P n+1 k = 1 λ P n k P n k!x!x T n P k λ w +!xt P k n!x Weighting Kernel: w = exp 1 2 ( x c )T D( x c) Combined Prediction: y = K i=1 K w i i=1 w i y k where learned with D = M T M Gradient descent in penalized leave-one-out local cross-validation (PRESS) cost function: M n+1 k = M n k α J M N n J = N w k,i y i ŷ k,i, i + γ D k,ij i=1 w k,i i=1 ( ) < w gen i=1, j =1 add model when if min w k k createnew RF at c K +1 = x

15 Locally Weighted Regression ( ( ( ))) z = max exp( 10x 2 ),exp( 50y 2 ),1.25exp 5 x 2 + y x x

16 Locally Weighted Regression Inserted into Adaptive Control

17 Locally Weighted Regression Learn forward model of task dynamics, then computer controller

18 Locally Weighted Regression Learn forward model of task dynamics, then computer controller

19 Criticism of Locally Weighted Learning Breaks down in high-dimensional spaces Computationally expensive and numerically brittle due to (incremental) dxd matrix inversion Not compatible with modern probabilistic statistical learning algorithms Too many manual tuning parameters

20 The Curse of Dimensionality The power of local learning comes from exploiting the discriminative power of local neighborhood relations. But the notion of a local breaks down in high dim. spaces!

21 The Curse of Dimensionality Movement Data is Locally Low Dimensional 0.25 / / 0.2 Probability 0.15 Thus, locally weighted learning can work if used with local dimensionality reduction! Dimensionality / / 105 Derived with Bayesian Factor Analysis

22 A Bayesian Approach to Locally Weighted Learning Linear Regression as a Graphical Model y i = x i T β + ε ( ) ε N 0,ψ y β = ( X T X) 1 Xy

23 A Bayesian Approach to Locally Weighted Learning Inserting a Partial-Least-Squares-like projection as a set of hidden variables z i,m = x i,m β j + η m y i = d z i,m m=1 ε N 0,ψ y + ε ( ) η N ( 0,ψ ) m z,m

24 A Bayesian Approach to Locally Weighted Learning Robust linear regression with automatic relevance detection (ARD, sparsification) z i,m = x i,m β j + η m y i = d z i,m m=1 ε N 0,ψ y + ε ( ) η m N ( 0,ψ ) z,m β m N 0, 1 α m α m Gamma( a α,b α,)

25 A Full Bayesian Treatment of Locally Weighted Learning The final model for full Bayesian parameter adaptation for regression and locality ψ y y i i = 1,.., N ψ z1 ψ z2 ψ zd z i1 b 1 z i2 z id b d b 2 x i1 w i1 x i2 w x i2 id w id h 1 h 2 h d

26 Locally Weighted Learning In High Dimensional Spaces Learning the cross function in 20-dimensional space z TextEnd 0.5 z TextEnd y x y x

Locally Weighted Learning In High Dimensional Spaces Learning the cross function in 20-dimensional space nmse on Test Set 0.14 0.12 0.1 0.08 0.

27 Locally Weighted Learning In High Dimensional Spaces Learning the cross function in 20-dimensional space nmse on Test Set D-cross 10D-cross 20D-cross #Training Data Points #Receptive Fields / Average #Projections

28 Locally Weighted Learning In High Dimensional Spaces Learning inverse kinematics in 60 dimensional space

29 Locally Weighted Learning In High Dimensional Spaces Skill learning

30 Outline A Bit of Robotics History Foundations of Control Adaptive Control Learning Control - Model-based Robot Learning - Reinforcement Learning

31 Given: A Parameterized Policy and a Controller Note: we are now starting to address planning, i.e,. where do desired trajectories come from?

32 Trial & Error Learning Reinforcement Learning from Trajectories Problem: How can a motor system learn a novel motor skill? Reinforcement learning is a general approach to this problem, but little work has been done to scale to the highdimensional continuous stateaction domains of humans Approach: Teach with imitation learning the initial skill using a parameterized control policy Provide an objective function for the skill Perform trial-and-error learning from exploratory trajectories

Reinforcement Learning Terminology Policies perceived state to action mapping (can be probabilistic) Reward functions maps the perceived stateaction pair into a a single number, an immediate reward

33 Reinforcement Learning Terminology Policies perceived state to action mapping (can be probabilistic) Reward functions maps the perceived stateaction pair into a a single number, an immediate reward (stochastic) Value functions maps the state into the accumulated expected reward that would be received if starting in the state Models predicts the next state given the current state and action (can be probabilistic) l Policy: what to do l Reward: what is good l Value: what is good because it predicts reward l Model: what follows what Objective: Optimize Reward!

34 Value Functions The value of a state is the expected return starting from that state; depends on the agent s policy: State - value function for policy π : { } = E π γ k r t + k +1 x t = x V π (x) = E π R t x t = x The value of taking an action in a state under policy π is the expected return starting from that state, taking that action, and thereafter following π : Action - value function for policy π : Q π (x,u) = E π R t x t = x,u t = u k =0 k =0 { } = E π γ k r t + k +1 x t = x,u t = u

35 Bellman Equation for a Policy π The basic idea: So: V π (x) = E { π R t x t = x} = E { π r t +1 + γ V ( x ) t +1 x t = x}

36 Bellman Optimality Equation for V* The value of a state under an optimal policy must equal the expected return for the best action from that state: V *( x) = max Q π ( x,u) u A(x) = max u A(x) E{ r + γ V *( x x = x,u = u) } t +1 t +1 t t V is the unique solution of this system of equations.

37 Bellman Optimality Equation for Q* The value of a state/action under an optimal policy must equal the expected return for this action from that state, and then following the optimal policy: { } Q (x,u) = E r + γ maxq (x,u') x = x,u = u t+1 t+1 t t u' Q is the unique solution of this system of equations.

38 Example: Learning a Pendulum Swing-Up Note: Both policy and value function are rather complex landscapes with discontinuities! Policy Value Function

39 Some More Exciting Examples

40 State-Based vs. Trajectory-based Reinforcement Learning From about , value function-based (i.e., state-based) reinforcement learning has been dominant (textbook Sutton&Barto) Pros: - well-understood theory - convergence proofs for discrete state-action systems - a useful set of algorithms to work with (model-based and model-free) - ideally a globally optimal solution Cons: - problematic in continuous state-action spaces (max-operator in continuous spaces) - curse of dimensionality in high-dimensional systems - hard to combine with function approximation - greed (= agressive) updating Trajectory-based reinforcement learning Pros: - can work in high dimensional continuous state-action spaces - does not suffer from the curse of dimensionality Cons: - Locally optimal solutions - classical methods learn very slowly

41 Trajectory-based Reinforement Learning with Parameterized Policies ( ) = π x t u t!x d t ( ) = π x d t ( ( ),t,α ) or ( ( ),t,α ) Example: Dynamic Systems Policies, initalized by imitation τ!!y = α z ( β z ( g y)!y ) + τ!x = α x x k i=1 k i=1 w i b i x w i

42 Trajectory-based Reinforcement Learning Define a cost function along the trajectory: T J = E τ i=0 And a parameterized control policy (e.g., a movement primitive) τ!y = f ( y, goal,b) r i Optimize J with respect to parameters b, e.g., by gradient descent b n+1 = b n + α J b

43 Example: Learning with Natural Gradients Goal: Hit ball to fly far Note: about trials are needed.

44 Reinforcement Learning from Trajectories State-of-the-art of Reinforcement Learning from Trajectories: - Given the cost per trajectory τ : - The motor primitives with parameters b: RL with Natural Gradients J = E τ T τ!y = f ( y,goal,b) b new = b old + α J NAC b Probabilistic RL with Reward-Weighted Regression b new Trajectory-based Q-learning (fitted Q-iteration) - an actor-critic based method based on an action-value function over trajectories RL with path-integrals (a probabilistic, model-based/model-free approach derived from stochastic optimal control) i=0 r i R τ b τ / R τ T T

45 Reinforcement Learning Based on Path Integrals Pre-requisites System Dynamics (Control-Affine): ( ) = F x,u,t!x = f ( x,t) + G( x) u( t) + ε( t) ( ) Cost Function: r t = q(x t ) u T t Ru t T J xt = E xt q T + r t ' dt ' t '=t Note: this is a more structured approach to RL Goal: find commands u that minimize this cost

46 Reinforcement Learning Based on Path Integrals Sketch of the Path-Integral Derivation Stochastic HJB Equations: t V ( x t,t) = min u t:t m r t + x V ( x t,t) T F( x,u,t ) { ( )} Tr Ω ( x,u,t ) 2 V x,t x t min u t:t m 1 2 u T t Ru t + q t + x V x t,t ( ) T f x,t ( ) + x V x t,t ( ) T G x ( )u t u T t R + x V ( x t,t) T G( x t ) = 0 u t = R 1 G( x t ) T x V ( x t,t) ( ) Tr G x { ( )Σ G( x)t 2 x V ( x t,t)} = 0

47 Reinforcement Learning Based on Path Integrals Sketch of the Path-Integral Derivation t V ( x t,t) = min u t:t m r t + x V ( x t,t) T F( x,u,t ) u t = R 1 G( x ) T t x V ( x t,t) ( )!x = f ( x,t) + G( x) u( t) + ε( t) { ( )} Tr Ω ( x,u,t ) 2 V x,t x t t V ( x t,t) = 1 2 V x x (,t t ) T G( x)r 1 G( x) T x V ( x t,t) + q t + x V ( x t,t) T f x,t ( ) Tr G x { ( )Σ G( x)t 2 x V ( x t,t)}

48 Reinforcement Learning Based on Path Integrals Sketch of the Path-Integral Derivation t V ( x t,t) = 1 2 V x x (,t t ) T G( x)r 1 G( x) T x V ( x t,t) + q t + x V ( x t,t) T f x,t ( ) Tr G x { ( )Σ G( x)t 2 x V ( x t,t)} Simplification: λr 1 = Σ Log-Transformation Trick: V ( x t,t) = λ logψ ( x t,t) t ψ ( x t,t) = 1 λ ψ ( x t,t)q t x ψ ( x t,t) T f x,t ( ) 1 2 Tr G x { ( )Σ G( x)t 2 xψ ( x t,t)} Chapman Kolmogorov PDE: 2nd Order and Linear

49 Reinforcement Learning Based on Path Integrals Sketch of the Path-Integral Derivation t ψ ( x t,t) = 1 λ ψ ( x,t t )q t x ψ ( x t,t) T f x,t ( ) 1 2 Tr G x { ( )Σ G( x)t 2 xψ ( x t,t)} Application of Feynman-Kac Theorem: A numerical method to solve certain PDEs ( ) = E τ ψ x T,T ψ x t,t ( )exp t '=T t '=t 1 λ q t ' dt '

50 Reinforcement Learning Based on Path Integrals Sketch of the Path-Integral Derivation ( ) = E τ ψ x T,T ψ x t,t u t = R 1 G x t ( )exp t '=T t '=t ( ) T x V ( x t,t) 1 λ q t ' dt ' A bit of algebra... { u t = E τ w τ R 1 G( x t ) T ( G( x t )R 1 G ( x ) T ) 1 t G( } x t )ε t Optimal Control Law

51 Path Integral RL Applied to Parameterized Policies (Motor Primitives) Note that a version of motor primitives can be written as control affine stochastic differential equations!x = f ( x) + g T ( θ + ε) - ε is interpreted as intentionally injected exploration noise - the parameters θ are the control vector - f(x) is the spring-damper of the primitives - g(x) are the basis functions of the function approximator It is also necessary to create a iterative version of path integral optimal control - the original path integral optimal control framework explores only based on the passive dynamics, i.e., u=0

52 PI 2 Reinforcement Learning For parameterized policies like dynamic motor primitives, a beautifully simple algorithm results: 1) Create K trajectories of the motor primitive for a given task with noise. 2) We can write the cost to go from every time step t of the trajectory as: R t = q T + T i=t r i 3) The probability of a trajectory becomes k P( ξ t ) = exp 1 λ R t K j =1 k exp 1 λ R t j 4) Update the parameter θ of the motor primitive as Δθ t = K k =1 k P( ξ t ) R 1 g k (x t )g k (x t ) T g k (x t ) T R 1 g k (x t ) 5) Final parameter update θ new = θ old + Δθ t ε k t Note that there a NO open tuning parameters except for the exploration noise

53 PI 2 Reinforcement Learning The Intuition of Path Integral Reinforcement Learning - Generate multiple trials i with some variation, e.g., due to noise or exploration - For every time t, compute i the cost R t for every trial: R t i T = q T + q(x t ) u t t T Ru t dτ t - Convert the cost into a positive weight w t i = exp( i λr ) t - Update the motor command at every time step to be the reward weighted average of all experienced commands in the trial w i i t u t u t new = i i w t i Surprisingly, this intuition turns out to be the optimal solution

54 PI 2 Reinforcement Learning: Some Remarks PI 2 can be model-based to model-free Rigid Body Dynamics:!!q = M q Control Law: u = u ff + K p ( ) 1 u C( q,!q )!q + G( q) ( ) ( q d q) + K D (!q d!q ) Motor Primitives:!!q i d = α ( z β ( z g i q i ) d!q i ) d + ψ T θ PI 2 can optimize trajectory plans, controllers, or both PI 2 has only one open parameter, i.e., the level of exploration noise PI 2 allows a rather simple derivation of inverse reinforcement learning

Example: Results on 2D Reaching Through a Via Point 10000000 Via-Point 9000000 8000000 0.

55 Example: Results on 2D Reaching Through a Via Point Via-Point Cost y [m] Number of Roll-Outs Initial x [m] PI2 REINFORCE PG NAC

56 Example: Results on 20D Reaching Through a Via Point Via-Point Cost y [m] Number of Roll-Outs x [m]

Example: Results on 50D Reaching Through a Via Point 10000000 Via-Point 9000000 8000000 0.

57 Example: Results on 50D Reaching Through a Via Point Via-Point Cost y [m] Number of Roll-Outs x [m]

Learning converges after about 20-30 trial!

58 Example: Dog Jump 600 This is a 12 DOF motor system, using 50 basis functions per primitive. Learning converges after about trial! Performance improved by 15cm (0.5 body lengths) Cost Number of Roll-Outs

59 Reinforcement Learning in Manipulation

60 Learning Locomotion over Rough Terrain

61 Outline A Bit of Robotics History Foundations of Control Adaptive Control Learning Control - Model-based Robot Learning - Reinforcement Learning What Comes Next?

62 Towards Truly Autonomous Robots Big Robots Very Little Robots

Path Integral Stochastic Optimal Control for Reinforcement Learning

Preprint August 3, 204 The st Multidisciplinary Conference on Reinforcement Learning and Decision Making RLDM203 Path Integral Stochastic Optimal Control for Reinforcement Learning Farbod Farshidian Institute