Data-Efficient Control Policy Search using Residual Dynamics Learning

Size: px

Start display at page:

Download "Data-Efficient Control Policy Search using Residual Dynamics Learning"

Ralf Collins
6 years ago
Views:

1 Data-Efficient Control Policy Search sing Residal Dynamics Learning Matteo Saveriano 1, Ychao Yin 1, Pietro Falco 1 and Donghei Lee 1,2 Abstract In this work, we propose a model-based and data efficient approach for reinforcement learning. The main idea of or algorithm is to combine simlated and real rollots to efficiently find an optimal control policy. While performing rollots on the robot, we exploit sensory data to learn a probabilistic model of the residal difference between the measred state and the state predicted by a simplified model. The simplified model can be any dynamical system, from a very accrate system to a simple, linear one. The residal difference is learned with Gassian processes. Hence, we assme that the difference between real and simplified model is Gassian distribted, which is less strict than assming that the real system is Gassian distribted. The combination of the partial model and the learned residals is exploited to predict the real system behavior and to search for an optimal policy. Simlations and experiments show that or approach significantly redces the nmber of rollots needed to find an optimal control policy for the real system. Simlation Approximate Policy Policy Improvement Sec. II.E, II.F Real Experience θ n Real Policy Δx π a π n Approximate Model + Robot π a, x a Learning Residal Dynamics Sec. II.B, II.C π r, x r x r x a I. INTRODUCTION Fig. 1. Overview of the algorithm. Robots that learn novel tasks by self-practice have the chance to be exploited in a wide range of sitations, inclding indstrial and social applications. Reinforcement Learning RL gives the agent the possibility to atonomosly learn a task by trial and error [1]. In particlar, the agent performs a nmber of trials rollots and ses sensory data to improve the exection. In robotics and control applications, RL presents two main disadvantages [2]. First, the state and action spaces are continos-valed and high dimensional, and existing approaches in RL to search continos spaces are not sample-efficient. A common strategy [3] [6] to redce the search space to a discrete space consists in parameterizing the control policy with a discrete nmber of learnable parameters. Typical policy parameterizations adopt Gassian mixtre models [3], hidden Markov models [7], [8], or stable dynamical systems [5]. Second, it is extremely time consming to perform a big nmber of trials rollots on real devices. The rollots can be redced with a proper initialization of the policy, for example sing kinesthetic teaching approaches [9]. However, redcing the rollots needed to find an optimal policy is still an open problem. RL algorithms can be classified into two categories, namely model-free and model-based approaches. Model-free methods [4] [6] directly learn the policy on the data samples obtained after each rollot. Model-free approaches do not reqire any model approximation and are applicable in many 1 Hman-centered Assistive Robotics, Technical University of Mnich, Mnich, Germany {matteo.saveriano, ychao.yin, pietro.falco, dhlee}@tm.de 2 Institte of Robotics and Mechatronics, German Aerospace Center DLR, Germany sitations. The problem of model-free approaches is that they may reqire many rollots to find the optimal policy [4]. Model-based methods, instead, explicitly learn a model of the system to control from sensory data and se this learned model to search the optimal policy. The comparison in [1] shows that model-based approaches are efficient compared with model-free methods. However, model-based methods strongly sffer from the so-called model bias problem. Indeed, they assme that the learned model is sfficiently accrate to represent the real system [11] [13]. Learning an accrate model of the system is not always possible. This is the case, for instance, when the training data is too sparse. It is clear that searching for a control policy sing inaccrate models might lead to poor reslts. The Probabilistic Inference for Learning Control algorithm in [1] alleviates the model-bias problem by inclding ncertainty in the learned model. In particlar, leverages Gassian Processes GP [14] to learn a probabilistic model of the system which represents model ncertainties as Gassian noise. The GP model allows to consider ncertainties in the long-term state predictions, and to explicitly consider evental model errors in the policy evalation. The comparison performed in [1] shows that otperforms state-of-the-art algorithms in terms of performed rollots on the real robot in pendlm and cart pole swing-p tasks. Despite the good performance, needs to learn a reliable model from sensory data to find an optimal policy. In order to explore the state space as mch as possible and to rapidly learn a reliable model, applies a random policy at the first stage of the learning. In

2 general, starting from a randomly initialized policy can be dangeros for real systems. The robot, in fact, can exhibit nstable behaviors in the first stages of the learning process, with the conseqent risk of damages. The work in [15] combines the benefits of model-free and model-based RL. Athors propose to model the system to control as a dynamical system and to find an initial control policy by sing optimal control. The learned policy is then applied to the real system and improved sing a model-free approach. The policy learned on the model represents a good starting point for policy search, significantly redcing the nmber of performed rollots. The approach in [16] also ses optimal control to find an initial policy, bt it ses sensory data to fit a locally linear model of the system. In this paper, we propose a model-based RL approach that exploits an approximate, analytical model of the robot to rapidly learn an optimal control policy. The proposed Policy Improvement with REsidal Model learning assmes that the system is modeled as a non-linear dynamical system. In real scenarios, it is not trivial to analytically derive the entire model, bt it is reasonable to assme that part of the model is easy to derive. For instance, in a robotic application, the model of the robot in free motion has a wellknown strctre [17], bt it is hard to model the sensor noise or the friction coefficient [18]. We refer to the known part of the model as the approximate model. Given the approximate model, we learn an optimal control policy in simlation. We exploit in this stdy, bt other choices are possible. Directly applying the policy learned in simlation on a real robot does not garantee to flfill the task. Hence, we exploit sensory data to learn the residal difference between the real and the approximate model sing Gassian Processes GP [14]. The sperposition of the approximate model and the learned residal term represents a good estimation of the real dynamics and it is adopted for model-based policy search. An overview of is shown in Fig. 1. In contrast to, or approach exploits the approximate model to be more data efficient and to converge faster to the optimal policy. Compared to the work in [15], explicitly learns the residal difference to improve the model and speed-p the policy search. Indeed, the approximate model is globally valid and we se sensory data only to estimate local variations of that model. In contrast to [16], we do not constraint the system dynamics to be locally linear. The rest of the paper is organized as follows. Section II describes or approach for model-based policy search. Experiments and comparisons are reported in Sec. III. Section IV states the conclsion and proposes ftre extensions. II. POLICY IMPROVEMENT WITH RESIDUAL MODEL LEARNING A. Preliminaries The goal of a reinforcement learning algorithm is to find a control policy = πx that minimizes the expected retrn J π θ = N E[cx t ] 1 t= where E[ ] is the operator that comptes the expected vale of a random variable, and cx t is the cost of being in state x at time t. The vector θ represents a set of vales sed to parameterize the continos-valed policy fnction πx, θ. In this work, we consider that the system to control is modeled as a non-linear dynamical system in the form x r t+1 = f r x r t, r t 2 where x r R d is the continos-valed state vector, r R f is the control inpt vector, f r R d is a continos and continosly differentiable fnction. The sperscript r indicates that the dynamical system 2 is the real exact model of the system. The difference eqation 2 can model a variety of physical systems, inclding robotic platforms. B. Additive Unknown Dynamics In general, it is not trivial to derive the exact model in 2. For example, althogh it is relatively simple to compte the dynamical model of a robotic maniplator in free motion, it is complicated to model evental external forces. In many robotics applications, we can assme that the fnction f r, in 2 consists of two terms: i a deterministic and know term, and ii an additive nknown term. Under this assmption, the dynamical system in 2 can be rewritten as x r t+1 = f a x r t, r t + f x r t, r t + w 3 where the fnction f a, is assmed known and deterministic, f, + w is an nknown and stochastic term, and w R d is an additive Gassian noise. The dynamical system x a t+1 = f a x a t, a t 4 represents the known dynamics in 3, and we call it the approximate model. Note that x a t and x r t represent the same state vector in different points of the state space R d. The approximate model is known a priori. Hence, a control policy a = π a x a for 3 can be learned in simlation by sing a reinforcement learning or an optimal control approach. In this work, we sed the algorithm [1] to learn π a, bt other choices are possible. It is worth noting that searching for the policy π a reqires only simlations, i.e., no experiments on the robot are performed at this stage. The learned policy π a represents a good starting point to search for a control policy π r for the real robot. This idea of initializing a control policy from simlations has been effectively applied in [15], where a linear system is sed as approximate model. Apart form initializing a policy from simlation, this work exploits real data for the robot to explicitly learn the nknown dynamics f, + w in 3. In other words, when the learned policy π a is applied to the real robot, the measred states are sed either to improve the crrent policy or to learn an estimate of the nknown dynamic. This significantly redces the nmber of rollots to perform on the real robot [1].

3 C. Learning Unknown Dynamics sing Gassian Processes The nknown dynamics f, + w in 3 is learned from sensory data sing Gassian Processes GP [14]. The goal of the learning process is to obtain an estimate ˆf of f +w sch that f a, + ˆf, f a, +f, +w. To learn the nknown dynamics, we exploit the real x r t and approximate x a t states, as well as the real r t and approximate a t control inpts. For convenience, let s define t = x t, t, where x t = x r t x a t and t = r t a t. We also define x r t = x r t, r t and x a t = x a t, a t. Using these definitions, it is straightforward to show that x r t = x a t + t and to rewrite the real dynamics in 3 as f r x r t = f r x a t + t = f a x a t + t + f x a t + t + w 5 Recalling that the zero-order Taylor series expansion of a fnction fa + a = fa + r a, we can write f r x a t + t = f a x a t + r a t + f x a t + r t + w 6 where r a t and r t are the residal errors generated by stopping the Taylor series expansion at the zero-order. Considering 6, the difference between the real 3 and approximate 4 systems can be written as x t+1 = f r x a t + t f a x a t = f a x a t + f x a t + r t + w f a x a t = f x a t + r t + w = ˆf x a t, t where we pose r t = r a t + r t. GP assme that the data are generated by a set of nderlying fnctions, whose joint probability distribtion is Gassian [14]. Having assmed a Gassian noise term w, we can conclde that the stochastic process 7 can be effectively represented as a GP. Eqation 7 shows that ˆf maps x a t and t into the next state x t+1. Hence, the training inpts of or GP, while the training otpts are {y t = x t+1 x t } N 1 t=. The training data are obtained as follows. At the n th rollot, the crrent policy π n is applied to the robot and to the approximate model. As already mentioned, for the first rollot we set π 1 = π a. After the trial, we have the two data sets X a = {x a t, a t } N t= and are the tples { x a t, t } N 1 t= = {x r t, r t } N t=. Hence, we have to simply calclate the element-wise difference X r X a to obtain the tples X d = { x t, t } N t=, which contain the rest of the training data for the Gassian process. The state predicted with a GP x t+1 is Gassian distribted, i.e., p x t+1 x a t, t = N x t+1 µ t+1, Σ t+1. Given N training inpts X = [ x a 1, 1,..., x a N, N ], training otpts y = [y 1,..., y N ] T, and a qery point x = X r T t T ] T, the predicted one-step mean µ t+1 and variance Σ t+1 are µ t+1 = x t + K x X K X X + σni 2 1 y Σ t+1 = k x, x K x X K X X + σni K X x [ x a, t where K x X = k x, X, K X x = K T x, k, is a X covariance fnction and the generic element ij of the matrix 7 K X X is given by K ij X = k x X i, x j. In or approach, k, is the sqared exponential covariance fnction k x i, x j = σk 2 exp 1 2 x i x j T Λ 1 x i x j 9 where Λ = diagl 2 1,..., l 2 D. The tnable parameters Λ, σ2 k and σ 2 n are learned from the training data by sing evidence maximization [14]. The standard GP formlation works for scalar otpts. For systems with a mltidimensional state, we consider one GP for each dimension assming that the dimensions are independent to each other. D. Policy Parameterization and Cost Fnction The control of a robotic device reqires a continosvaled control policy. It is clear that searching for a control policy in a continos space is nfeasible in terms of comptational cost. A common strategy [2] to solve this isse consists in assming that the policy π is parameterized by θ R p, i.e., πx, θ. In this work, we assme that the policy is the mean of a GP in the form D πx, θ = kx, c i K CC + σπi 2 1 yπ 1 i where x is the crrent state, C = [c 1,..., c D ] are the centers of the Gassian basis fnction k, defined as in 9. The vector y π plays the role of the GP training targets, while K CC is defined as in 8. The tnable parameters of the policy 1 are θ = [C, Λ, y π ]. As sggested in [1], we set the variance σπ 2 =.1 in or experiments. As already mentioned, the optimal policy minimizes the expected retrn J π θ in 1. In this work, we adopt a satrating cost fnction cx r t with spread σc 2 cx r t = 1 exp 1 2 x r t x g 2 [, 1] 11 where x g denotes the goal target state 1. E. Policy Evalation The learned GP model is sed to compte the longterm predictions p x 1 π,..., p x N, π starting from p x. The long-term predictions, in fact, are needed to compte the expected costs and to minimize the expected retrn in 1. Compte the long-term predictions from the one-step predictions in 8 reqires the predictive distribtion py t = pˆf ξ t ξ t pξ t dˆf dξ t 12 where ξ t = [ x a t T t T ] T and y t = x t+1 x t. Solving the integral in 12 is analytically intractable. Hence, we assme a Gassian distribtion p t = N t µ, Σ. The mean µ and Σ are obtained by applying the moment matching algorithm in [1]. The predictive distribtion p x t+1 π is also Gassian with mean and covariance µ t+1 = µ t + µ Σ t+1 = Σ t + Σ + cov x t, t + cov t, x t 13 1 In [19], the athor presents the benefits of the satrating cost 11 compared to a qadratic cost.

4 where the covariance cov, is compted as in [19]. Once the long-term predictions are compted, the estimated real state can be calclated as x r t = x a t + x t, where x a t is the deterministic state of the approximate model see Sec. II-A. Given that x r t = x a t + x t, the expected real long-term cost can be expressed as N N J π θ = E x r t [cx r t ] = E xt [c x t ] 14 t= t= Indeed, by sbstitting x t = x r t x a t into 11 we obtain cx r t = 1 exp 1 2 x r t x g 2 = 1 exp 1 2 x t + x a t x g 2 = 1 exp 1 2 x t x g x a t 2 = 1 exp 1 2 x t x g,t 2 = c x t The vector x a t is the approximate state compted by applying the initial policy π a and it is deterministic. Hence, we can precalclate all the x g,t for t = 1,..., N. Note that by minimizing c x t we force the real robot to follow the same trajectory of the approximate model. Recalling that the predictive distribtion p x t π = N x t µ t, Σ t, we can rewrite the real expected costs in 1 as E xt [c x t ] = c x t N x t µ t, Σ t d x t 15 where µ t and Σ t are defined as in 13. The expected vale E xt [c x t ] in 15 can be compted analytically and it is differentiable, which allows the adoption of gradient based approaches for policy search. F. Policy Search We leverage the gradient J π θ/ θ to search the policy parameters θ that minimize the expected long-term cost J π θ. In or formlation, the expected costs in 1 can be written in the form 15. Given that the expected costs have the form 15, and assming that the moment matching algorithm is sed for long-term predictions, the gradient J π θ/ θ can be compted analytically. The comptation of the gradient J π θ/ θ is detailed in [1]. The reslt of the policy improvement is a set of policy parameters θ n, where n indicates the n th iteration. The control policy π n = πx, θ n drives the estimated real state x a t + x t towards the desired goal. The new policy π n is applied to the real robot to obtain new training data and to refine the residal model. The process is repeated ntil the task is learned. is smmarized in Algorithm 1. III. EXPERIMENTS Experiments in this section show the effectiveness of or approach in learning control policies for simlated and real physical systems. The proposed approach is compared with the algorithm in [1] considering the nmber of Algorithm 1 Policy Improvement with REsidal Model learning 1: Learn a policy π a for the approximate model 4 2: π n = π a 3: while task learning is not completed do 4: Apply π n to the approximate model and the real robot 5: Calclate the new training set X d = X r X a 6: Calclate the new target set x g,t = x g x a t 7: Learn GP model for the dynamics 7 8: while J π θ not minimized do 9: Policy evalation sing 13 and 15 1: Policy improvement Sec. II-F and [1] 11: end while 12: retrn θ 13: end while 14: retrn π f k + θ a l m Fig. 2. a The pendlm interacting with an elastic environment. b The cart pole system connected to a spring. rollots performed on the real system and the total time dration of the rollots real experience. Both and are implemented in Matlab R. In all the experiments, we se the satrating cost fnction in 11. To consider bonded control inpts, the control policy in 1 is sqashed into the interval [ max, max ] by sing the fnction σx = max 9 sin x+sin 3x 8. A. Simlation Reslts We tested in two different tasks, namely a pendlm swing-p and a cart pole swing-p see Fig. 2, and compared the reslts with [1]. For a fair comparison, we se the same parameters for and. 1 Pendlm swing-p: The task consists in balancing the pendlm in the inverted position θ g = π rad, starting from the stable position θ = rad. As shown in Fig. 2a, the pendlm interacts with an elastic environment modeled as a spring when it reaches the vertical position θ = π rad. The interaction generates an extra force f k on the pendlm, which is neglected in or approximate model. Hence, the approximate model is the standard pendlm model + b θ t 1 4 ml2 + I mlg sin θ t = t b θ t θ l f k m

5 where the mass m = 1 kg, the length l = 1 m, I = 1 12 ml2 is the moment of inertia of a pendlm arond the midpoint, b =.1 snm/rad is the friction coefficient, g = 9.81 m/s 2 is the gravity acceleration and t is the control torqe. The state of the pendlm is defined as x = [ θ, θ] T, while the goal to reach is x g = [, π] T. The external force f k depends on the stiffness of the spring. We tested three different stiffness vales, namely 1, 2 and 5 N/m. Reslts are reported is Tab. I. Note that the time of real experience in Tab. I is a mltiple of the nmber of rollots. This is becase we keep the dration of the task fixed for each rollot. TABLE I RESULTS FOR THE PENDULUM SWING-UP. Stiffness Real rollots Real experience [N/m] [#] [s] [1] [1] [1] Stiffness 1 N/m: For a stiffness eqal to 1 N/m, max = 5 Nm is sfficient to flfill the task. The total dration of the task is T = 4 s, while the sampling time is dt =.1 s. Reslts in Tab. I show that needs only 2 iterations and 8 s of real experience to learn the control policy. For comparison, needs 6 iterations and 24 s of real experience. The improvement mostly depends on the redced state prediction error of. As shown in Fig. 3a, the approximate model accrately represents the system ntil the pendlm toches the spring. This lets to properly estimate the state after 2 iterations see Fig. 3b., instead, spends 5 iterations to learn a proper model of the system. As shown in Fig. 4, after 2 iterations the pendlm is able to reach the goal x g = [, π] T. The learned control policy is bonded x [ 5, 5] Nm, and it drives the pendlm to the goal position in less than 1 s see Fig. 4a. This reslts in an average expected retrn J π θ 12 see Fig. 4b. needs more rollots 6 instead of 2 to find the optimal policy. Since we sed the same policy parameterization and cost fnction in both the approaches, and learn almost the same control policy that generates a similar behavior of the pendlm. Stiffness 2 N/m: For a stiffness eqal to 2 N/m, we have to redce the sampling time to.5 s to avoid nmerical instabilities in the simlation of the real model. Moreover, we increase the maximm control inpt to 15 Nm. In this case, or approach needs 3 iterations and 12 s of real experience to learn the control policy see Tab. I. Stiffness 5 N/m: For a stiffness eqal to 5 N/m, we have to frther redce the sampling time to.125 s. To speed-p the learning, we redce the prediction time to 2 s. Moreover, we increase the maximm control inpt to 25 Nm. In this case, or approach needs 2 iterations and 4 s of real experience to learn the control policy see Tab. I predicted real a Rollots [#] b 5 6 Fig. 3. Pendlm model learning reslts stiffness 1 N/m. a State predicted and measred after one iteration of. b State prediction root mean sqare errors of and for different iterations. [Nm] -4 5 Pendlm angle Control policy a Rollots [#] b 6 7 Fig. 4. Reslts for the pendlm swing-p stiffness 1 N/m. a Pendlm angle and control policy obtained with after 2 iterations. The black solid line represents the goal. b Evoltion of the expected retrn J π θ for different rollots mean and std over 5 exections. 2 Cart pole swing-p: The task consists in swingingp the pendlm on the cart and balance it in the inverted position θ g = π rad, starting from the stable position θ = rad. The control inpt is the horizontal force t that makes the cart moving to the left or to the right. As shown in Fig. 2b, the cart is connected to a wall throgh a spring. The additive force f k of the spring affects the motion of the cart. The mass of the cart is m c =.5 kg, the mass of the pendlm is m p =.5 kg, and the length of the pendlm is l =.5 m. The state of the cart pole system is x = [p, ṗ, θ, θ] T, where p is the position of the cart, ṗ is the velocity of the cart, θ and θ are respectively the angle and the anglar velocity of the pendlm. The goal state is x g = [,, π, ] T. The cart pole system is more complicated than the single pendlm, since the state has for dimensions instead of two. Also in this case, the approximate model does not consider the extra force f k. The approximate model of the cart pole pendlm is then m c + m p p t mpl θ t cos θ t 1 2 mpl θ 2 t sin θ t = t bṗ t 2l θ t + 3 p t cos θ t + 3g sin θ t = We tested three different stiffness vales, namely 25, 5 and 12 N/m. Reslts are reported is Tab. II. Recall that the time of real experience in Tab. II is a mltiple of the nmber of rollots. Stiffness 25 N/m: For a stiffness eqal to 25 N/m, max = 1 N is sfficient to flfill the task. The total dration of the task is T = 4 s, while the sampling time is dt =.1 s. Reslts in Tab. II show that or approach needs only 2 iterations and 8 s of real experience to learn the control

6 [m] [m/s] [N] TABLE II RESULTS FOR THE CART POLE SWING-UP. Stiffness Real rollots Real experience [N/m] [#] [s] [1] [1] [1] Cart position Cart velocity a c [rad/s] 4 2 Pendlm angle Pendlm anglar velocity b Rollots [#] Fig. 5. Reslts for the cart pole swing-p stiffness 12 N/m. a b State of the cart pole system and c learned control policy after 15 iterations of. The black solid line represents the goal. d Evoltion of the expected retrn J π θ for different rollots mean and std over 5 exections. d takes from 25% to 67% less iterations than. As a conseqence, the time of real experience is redced from 25% to 67% from a minimm of 2 s to a maximm of 16 s less than. B. Robot Experiment is applied to control a qbmove Maker variable stiffness actator VSA [2]. As shown in Fig. 6, the arm consists of for VSA. The first joint of the arm is actated, while the other three are not controlled. The task consists in swinging-p the robotic arm by controlling the actated joint. As approximate model, we se the mass spring system m θ t + kθ t t = 16 where m =.78 kg is the mass of the nactated joints, and k = 14 Nm/rad is the maximm stiffness of the qbmove Maker [2]. The variable θ t represents the link position, while t is the commanded motor position. It is worth noticing that the approximate model in 16 neglects the length, inertia and friction of the arm in Fig. 6. Moreover, the model in 16 represents a VSA only in the region where the behavior of the spring is linear. Actated Joint Unactated Joints policy. For comparison, needs 5 iterations and 2 s of real experience. Stiffness 5 N/m: For a stiffness eqal to 2 N/m, we have to increase the maximm control inpt to 15 N in order to flfill the task. In this case, needs 3 iterations and 12 s of real experience to learn the policy see Tab. II. Stiffness 12 N/m: For a stiffness eqal to 12 N/m, we have to redce the sampling time to.5 s to avoid nmerical instabilities in the simlation of the real model. We also redce the dration of the task to T = 2 s to speed-p the learning process. In this case, or approach needs 15 iterations and 3 s of real experience to learn the control policy see Tab. II. As shown in Fig. 5, after 15 iterations the cart pole system reaches the goal x g = [,, π, ] T. The learned control policy is bonded x [ 15, 15] N, and it drives the cart pole system to the goal position in less than 4 s see Fig. 5a to 5c. This reslts in an average expected retrn J π θ 2 see Fig. 5d. needs more rollots 23 instead of 15 to find the optimal policy. Since we sed the same policy parameterization and cost fnction in both the approaches, and learn almost the same control policy that generates a similar behavior of the cart pole system. In the considered simlations, the proposed performs better than. In terms of rollots, or approach Fig. 6. The VSA arm sed in the robotic experiment. The goal is to learn a control policy t that drives the model from the initial state x = [θ, θ ] = [π/2, ] T to the goal x g = [ π/2, ] T in T = 5 s. The maximm inpt position is max = 1.7 rad. Joint velocities are compted from the measred joint angles sing a constant velocity Kalman filter. The robot is controlled at 5 Hz via a Simlink R interface. Reslts in Tab. III show that or approach needs 3 iterations and 15 s of real experience to lean the task. For comparison, needs 5 iterations and 25 s of real experience to learn the same task. Hence, shows an improvement of the performance of 4%. Figre 7 shows the learned policy, the cost evoltion and the state of the robot after 3 iterations of. Note that, for a better visalization, we show only the first two seconds of the task. As shown in Fig. 7a, the robot effectively reaches the goal after approximatively.5 s. This reslts in an expected retrn J π θ 9 see Fig. 7d. needs 5 rollots instead of 3 to find the optimal policy. As shown in Fig. 7d, the expected retrn of is also J π θ 9. This indicates that and learn a similar control policy that generates a similar behavior of the VSA pendlm.

[rad/s] 2.5 -.5 Joint angle Joint anglar velocity -1.5.5 1 1.5 2 a c -.8-1 -1.2-1.4 Control policy -1.6.5 1 1.5 2 19 14 9 b 1 2 3 4 5 Rollots [#] d Fig. 7. Reslts for the VSA pendlm swing-p.

d Evoltion of the expected retrn J π θ for different rollots.

7 [rad/s] Joint angle Joint anglar velocity a c Control policy b Rollots [#] d Fig. 7. Reslts for the VSA pendlm swing-p. a State of the VSA pendlm, b learned control policy, and c snapshots of the task exection after 3 iterations of. The black solid line in a represents the goal. d Evoltion of the expected retrn J π θ for different rollots. Performed simlations and experiments show that learning a black-box model of the residal, instead of a black-box model of the entire system, can significantly improve the performance of the policy search algorithm. TABLE III RESULTS FOR THE VSA PENDULUM SWING-UP. Real rollots Real experience [#] [s] 3 15 [1] 5 25 IV. CONCLUSION AND FUTURE WORK In this work, we proposed, a model-based reinforcement learning approach that exploits a simplified model to find an optimal control policy for a given task. The proposed approach works in two steps. First, an optimal policy is fond for the simplified model sing standard policy search algorithms. Second, sensory data from the real robot are exploited to learn a probabilistic model that represents the difference between the real system and its simplified model. Differently from most model-based RL approaches, or method does not learn a black-box model of the entire system. learns, instead, a black-box model of the differences between the real system and an approximated model, i.e., the model residal. As a conseqence, or approach is sitable in applications where the system model is inaccrate or only partially available. The approach has been compared with, showing improvements in terms of performed rollots and time of real experience. Or ftre research will focs on testing the proposed approach in more complicated tasks with a high dimensional searching space. An important open qestion is how 6 inaccrate the model can be in order to maintain enhanced performance. We expect that with a completely wrong model, the performance of the algorithm becomes similar or slightly worse than classical black-box approaches sch as. However, formally qantifying the tolerated ncertainty still remains an open theoretical problem. Another potential limitation of or approach is that we assme additive model residals. Evalating the generality of this assmption more precisely will be a possible ftre research topic. ACKNOWLEDGEMENTS This work has been partially spported by the Technical University of Mnich, International Gradate School of Science and Engineering, and by the Marie Crie Action LEACON, EU project REFERENCES [1] R. Stton and A. Barto, Reinforcement learning: an introdction, ser. A Bradford book. MIT Press, [2] J. Kober, D. Bagnell, and J. Peters, Reinforcement learning in robotics: a srvey, IJRR, vol. 32, no. 11, pp , 213. [3] S. Calinon, P. Kormshev, and D. Caldwell, Compliant skills acqisition and mlti-optima policy search with EM-based reinforcement learning, RAS, vol. 61, no. 4, pp , 213. [4] J. Kober and J. Peters, Policy search for motor primitives in robotics, in NIPS, 29, pp [5] E. Theodoro, J. Bchli, and S. Schaal, A generalized path integral control approach to reinforcement learning, Jornal of Machine Learning Research, vol. 11, pp , 21. [6] J. Peters, K. Melling, and Y. Altn, Relative entropy policy search, in Conference on Articial Intelligence AAAI, 21, pp [7] D. Lee, H. Knori, and Y. Nakamra, Association of whole body motion from tool knowledge for hmanoid robots, in International Conference on Intelligent Robots and Systems, 28, pp [8] H. Knori, D. Lee, and Y. Nakamra, Associating and reshaping of whole body motions for object maniplation, in International Conference on Intelligent Robots and Systems, 29, pp [9] M. Saveriano, S. An, and D. Lee, Incremental kinesthetic teaching of end-effector and nll-space motion primitives, in International Conference on Robotics and Atomation, 215, pp [1] M. P. Deisenroth, D. Fox, and C. E. Rasmssen, Gassian processes for data-efficient learning in robotics and control, TPAMI, vol. 37, no. 2, pp , 215. [11] J. G. Schneider, Exploiting model ncertainty estimates for safe dynamic control learning, in in Neral Information Processing Systems 9. The MIT Press, 1996, pp [12] C. G. Atkeson and S. Schaal, Robot learning from demonstration, in International Conference on Machine Learning, 1997, pp [13] C. G. Atkeson and J. C. Santamaria, A comparison of direct and model-based reinforcement learning, in International conference on robotics and atomation, 1997, pp [14] C. E. Rasmssen and C. K. I. Williams, Gassian Processes for Machine Learning. MIT Press, 26. [15] F. Farshidian, M. Nenert, and J. Bchli, Learning of closed-loop motion control, in IROS, 214, pp [16] S. Levine and P. Abbeel, Learning neral network policies with gided policy search nder nknown dynamics, in Advances in Neral Information Processing Systems, 214, pp [17] B. Siciliano, L. Sciavicco, L. Villani, and G. Oriolo, Robotics - Modelling, Planning and Control. Springer, 29. [18] G. D. Maria, P. Falco, C. Natale, and S. Pirozzi, Integrated force/tactile sensing: The enabling technology for slipping detection and avoidance, in ICRA, 215, pp [19] M. Deisenroth, Efficient Reinforcement Learning sing Gassian Processes. KIT Scientific Pblishing, 21. [2] M. G. Catalano, G. Grioli, M. Garabini, F. Bonomo, M. Mancini, N. Tsagarakis, and A. Bicchi, Vsa-cbebot: A modlar variable stiffness platform for mltiple degrees of freedom robots, in International Conference on Robotics and Atomation, 211, pp

Chapter 4 Supervised learning:

Chapter 4 Supervised learning: Chapter 4 Spervised learning: Mltilayer Networks II Madaline Other Feedforward Networks Mltiple adalines of a sort as hidden nodes Weight change follows minimm distrbance principle Adaptive mlti-layer