Learning to Serve and Bounce a Ball

Size: px

Start display at page:

Download "Learning to Serve and Bounce a Ball"

Cornelius Walsh
5 years ago
Views:

1 Sndr Amend Gregor Gebhrdt Technische Universität Drmstdt Abstrct In this pper we investigte lerning the tsks of bll serving nd bll bouncing. These tsks disply chrcteristics which re common in vriety of motor skills. To lern the required motor skills for these tsks the robot uses eltive Entropy Policy Serch which is stte of the rt method in Policy Serch einforcement Lerning. Our experiments show tht EPS does not only converge consistently to good solutions, but lso robust solutions. 1 Introduction Lerning motor skills similr to those of humn beings poses chllenging tsk for robots. They re difficult nd non-trivil to lern, but re necessry for performing complex tsks under vrying conditions. There re severl common spects to motor skills: Motor skills re often trget oriented. For exmple reching movements re often directed to specific objects. Motor skills re often lso constrined by time. As n exmple, bll hs to be cught before it hits the ground. Some tsks such s wlking require periodic motor skills, which involve repetedly performing similr movements. Most motor skills lso require feedbck to compenste for errors. For exmple when writing we try to keep constnt pressure on the pen. In this pper we will tke closer look t the motor skills of serving nd bouncing bll. Bll serving requires the robot to hit dropped bll to desired trget loction on the ground. Bll bouncing involves repetedly hitting the bll into the ir while keeping it centered bove the pddle. They re good representtives of the common spects of motor skills mentioned bove. Bll serving is trget oriented nd time dependent while bll bouncing enhnces serving with the spect of feedbck nd periodic movement. A detiled description of the tsks follows in Section 2. For lerning these tsks we will use the stte of the rt method eltive Entropy Policy Serch EPS which is explined in Section Furthermore we will compre EPS to Finite Differences nd evlute the robustness of our solution in Section 4. The benefit of lerning feedbck controller for bll bouncing will lso be evluted. 2 Bckground Before we come to the methods we used to lern the bll bouncing tsk, we will first describe the setup of the bll bouncing environment we used for the experiments nd then show how we modeled the movements of the robot rm. 2.1 The Experiment Setup For the experiments we used Brrett WAM Arm with seven degrees of freedom with tble tennis rcket ttched s the end effector. The robot rm ws simulted using the SL Simultion nd el-time Control Softwre Pckge Schl, The kinemtic configurtion of the Brrett WAM is shown in Figure 1. For our experiments we only ctuted the 4th nd the 7th joint. The 4th joint ws used to perform stroke movement, which will be explined in detil in the next section, nd the 7th joint ws used to control the motion of the bll in the x-direction. In the bll serving tsk the bll ws initilly dropped from the ceiling bove the tble tennis rcket. The robot hs to perform hitting motion to redirect the bll to trget loction on the ground. The trget loctions were specified s prt of the tsk. Given the trget loction x g nd the loction where the bll lnded x b the rewrd for the bll serving tsk ws given by = x g x b 2 In the bll bouncing tsk the bll ws lso dropped from the ceiling. Insted of trget on the ground the bll hd to be hit bck to the initil strting height bove the rcket. In this wy the robot nd bll return to the sme stte s t the strt of the movement, nd the ction cn be performed repetedly. Given tht

the pek of the blls trjectory is t x p nd the initil position ws t x i the rewrd function for the bll bouncing tsk ws given by = x i x p 2 ẋ p 2 For the bll bouncing tsk the bll position ws limited

Figure 2: First pproch of prmetric representtion of stroke: sinusoidl trjectory of joint 4 elbow with mplitude A nd period T.

The trjectory of the elbow joint of this representtion is depicted in Figure 2. Figure 1: The kinemtic configurtion of the 7-DOF Brrett WAM Arm. The blue rrows depict the rottion xes of the joints.

2 the pek of the blls trjectory is t x p nd the initil position ws t x i the rewrd function for the bll bouncing tsk ws given by = x i x p 2 ẋ p 2 For the bll bouncing tsk the bll position ws limited to the x-z plne. The lerned controller should lso be robust to disturbnces in the x direction. Figure 2: First pproch of prmetric representtion of stroke: sinusoidl trjectory of joint 4 elbow with mplitude A nd period T. hitting phse, nd fixed return phse in which the robot rm returns to its initil position. The dely phse strts when Bll hs reched its pek position. The trjectory of the elbow joint of this representtion is depicted in Figure 2. Figure 1: The kinemtic configurtion of the 7-DOF Brrett WAM Arm. The blue rrows depict the rottion xes of the joints. For our experiments we ctuted the 4th nd the 7th joint, while the other joints were kept in their zero position. 2.2 Prmetric epresenttion of Stroke The first nd most intuitive prmetric representtion of the stroke ws simple sinusoidl movement in the 4th joint elbow of the robot. The prmeters of this movement consisted only of the mplitude A of the sine nd its period T. The trjectory of the 4-th joint for this movement is outlined in Figure 2. However, this representtion did not led to good results. The second pproch we followed ws slightly more sophisticted representtion. Tht prt of the movement, in which the bll is hit, is still sinusoidl trjectory with mplitude A nd period T. Additionlly we introduced dely phse with the prmeter d before, nd return phse fter the stroke phse. So the totl movement is composed of dely phse, sinusoidl Figure 3: Second pproch of prmetric representtion of stroke: The movement consists of three phses: 1 dely phse with prmeter d, in which the rm rests t its initil position. 2 the hitting phse in which the rm follows sinusoidl trjectory with mplitude A nd period T in joint 4 to hit the bll. We will show in Section 4, tht we were ble to chieve good results with this representtion. 2.3 PD Control Additionlly to the stroke movement in joint 4 elbow of the robot, we lerned PD controller proportionlderivtive controller for the 7th joint, to rect on movements of the bll in x-direction. A PD controller hs the two prmeters k p nd k d which re pplied

3 Sndr Amend, Gregor Gebhrdt to the error nd the error s derivtive, respectively: u = k p e + k d ė. 1 The error e is the difference between desired position x d nd the bll s position x b, nlogously the error s derivtion is the difference between desired velocity ẋ d nd the bll s velocity ẋ b ll of them re only the x-components of positions nd velocities: u = k p x d x b + k d ẋ d ẋ b. 2 For the bll bouncing tsk the desired position of the bll is x d = 0, which is centered bove the rcket, nd it hs velocity of ẋ d = 0. For the bll serving tsk the PD controller ws not used. Insted the robot selected fixed ngle for the wrist joint throughout the trjectory. Using this joint it is ble to hit the bll in different x directions. 2.4 The Lerning Method In reinforcement lerning the generl setup Sutton nd Brto, 1998 considers n gent tht intercts with its environment. The ctions tken by this gent re bsed on Mrkov Decision Process MDP. Hence, if the gent is in stte s S it selects n ction A using the policy π s. The gent then trnsfers to the next stte s with the trnsition probbility Pss = ps s,. This trnsition yields rewrd rs, = s for the gent. The gol of reinforcement lerning is now to find policy tht mximizes the expected rewrd of the gent Jπ = s, µ π sπ s s 3 Here, µ π s denotes the probbility ob the gent being in stte s, the stte distribution. Policy serch methods re one group of methods to mximize the expected rewrd by directly serching for n optiml policy. However, one drwbck of most of these methods is they tke only the experience of the most recent trils into ccount for computing the new policy. Hence, there is loss of informtion from older policy evlutions during the policy improvement step. entropy between the observed dt distribution qs, nd the dt distribution p π s, = µ π sπ s Peters et l., 2010 is constrint to n upper bound ε: Dp π q = s, µ π sπ s log µπ sπ s qs, ε. 4 Together with the ssumption of sttionry stte distribution µ π s nd the constrint tht probbility distributions must sum to 1, they get the following problem sttement: Problem Sttement. The gol of reltive entropy policy serch is to obtin policies tht mximize the expected rewrd Jπ while the informtion loss is bounded, i.e., mx Jπ = µ π sπ s π,µ π s 5 s, s.t. ε µ π sπ s log µπ sπ s 6 qs, s, µ π s φ s = µ π sπ spss φ s 7 s s,,s 1 = s, µ π sπ s 8 Both µ π nd π re probbility distributions nd the fetures φ s of the MDP re sttionry under policy π. 3 Lerning the Stroke As we described in Section 2.2, the stroke movements re bstrcted using prmetric description. Due to this simplifiction we do not hve ny sttes or, from nother point of view, we re lwys in the sme stte, when executing the movement. Thus we cn simplify the EPS problem sttement to the following form: Problem Sttement. Mximize the expected rewrd Jπ while the loss of informtion is bounded, i.e., mx Jπ = π s.t. ε π 9 π log π q eltive Entropy Policy Serch 1 = π 11 To circumvent the problem of loss of informtion when directly optimizing the policy, Peters et l proposed the reltive entropy policy serch EPS method. The objective function here is gin the expected rewrd, which they wnt to be mximized. But dditionlly the Kullbck-Leibler divergence or reltive From this simplified problem we cn then derive the policy updte π = q exp q exp, 12

4 with the Lgrngin prmeter which we obtin from the minimiztion of the dul function g = log q exp + ε. 13 As we re using policy itertion the dt distribution q is inherent in the distribution of the smples from the old policy. Hence the smple-bsed policy updte becomes weighted mximum likelihood estimtion with weight w i for the i-th smple: r exp i w i =, 14 i=1 exp r i where r i is the rewrd received by the i-th smple of the policy. The dul function becomes then 1 ri ĝ = log exp + ε, 15 i=1 with the number of smples. The lerning lgorithm is outlined in Algorithm 1. 4 Experiments To evlute EPS for lerning motor skills we rn three experiments. In the first experiment we compred the performnce of EPS with tht of finite differences with POP on the bll serving tsk. In the second experiment we evluted the effect of different vlues for ε on the performnce of EPS. In the third experiment the robot lerns bll bouncing with feedbck controller to compenste for errors. 4.1 Comprison with Finite Differences In this experiment the robot ws given the tsk of serving the bll to three different loctions: x g = [0, 1.5] x g = [1, 2] x g = [2, 1] Ech trget loction ws evluted three times with both methods. For comprison the robot used EPS nd finite differences with POP. For EPS the ε ws set to 1. For finite differences with POP the initil step size ws The step size ws incresed by fctor of 1.2 when the grdient direction styed the sme nd decresed by fctor of 0.5 when the grdient flipped. Both methods were given 50 itertions for ech tsk nd 15 smples per itertion. The results of the experiments re shown in Figure 4. Policy Itertion with Adpted EPS Input: mximl informtion loss ε, initil policy π 0, number of itertions, number of smples M. for k do Smpling:: Drw M smples i from the policy π k. Critic: Evlute policy forech smple i, i {1,..., M} do Perfrom experiment with smple i to obtin rewrd r i. Compute the Dul Function: 1 ri ĝ = log exp + ε i=1 Compute the Dul Function s Derivtive: ĝ 1 ri = log exp i=1 1 i=1 r r i exp i + ε i=1 exp r i Optimize: = fmin BFGSĝ, ĝ, 0. Actor: Improve policy ew Policy is weighted mximum likelihood estimtion of smples with weights r exp i w i = i=1 exp r i Algorithm 1: Policy Itertion with dpted EPS. fmin BFGS stnds for the Broyden-Fletcher- Goldfrb-Shnnon BFGS optimiztion method. As cn be seen by the plots both methods strted with the sme performnce s they were initilized with the sme prmeters. However EPS quickly converged to the trget loction. In comprison the finite differences method converged more slowly. The solutions found by finite differences tended to perform worse thn those of EPS. This is due to finite differences often ending in locl mximum where it simply lets the bll drop to the ground. At this point there is no longer grdient nd the method stops lerning. This experiment shows tht EPS is more robust to getting stuck in such locl mximum.

Sndr Amend, Gregor Gebhrdt 0.2 Bll Trjectory Bll Bouncing without Controller 0 0.2 z Position 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.

5 Sndr Amend, Gregor Gebhrdt 0.2 Bll Trjectory Bll Bouncing without Controller z Position x 10 3 x Position Figure 4: Comprison Finite Differences with POP nd EPS finl rewrd ewrds After 50 Itertions epsilon Figure 5: Effect of different ε vlues 4.2 Effect of different vlues for ε In the second experiment we investigted the effects of chnging the min prmeter of the EPS method. The evlution ws gin performed using the bll serving tsk. In this cse the bll ws lwys served to the sme trget loction. x g = [ 2, 2] As performnce mesure we looked t the rewrds chieved fter 50 itertions. Agin ech itertion consisted of 15 smples. Ech ε vlue ws tested five times. The results of the experiment re shown in Figure 5. As cn be seen the EPS lgorithm is ble to consistently obtin good finl rewrds when using ε vlues in the rnge 0.4 to 2. Using higher ε vlues seems to hve led to numericl instbilities nd the method Figure 6: Bll trjectory without feedbck controller. A smll error ccumultes nd would eventully led to the bll flling off the pddle. Student Version of MATLAB does not lern. For smller ε vlues the performnce grdully decreses. The wide pek of performnce indictes tht the performnce of EPS is not sensitive to ε nd cn be esily tuned. 4.3 Lerning Bll Bouncing In the finl experiment the robot ws given the tsk to robustly perform the bll bouncing tsk. First the robot lerned stndrd hitting movement where the bll ws dropped from the initil position s usul. To lern this hitting motion EPS ws used with 50 itertions, ε = 1 nd 15 smples per itertion. The lerned movement ws conctented into periodic motion ccording to the time it took the bll to get bck to the initil position. The resulting bll trjectory cn be seen in Figure 6. A smll error ccumulted in the x direction which would eventully result in the bll flling off the pddle. We therefore lso lerned PD controller bsed on the blls position nd velocity in the x direction s lredy described in section 2.3. To lern the gins of the PD controller we gin used EPS for 50 itertions. The resulting controller ws evluted by initilly dropping the bll t 2.5 cm increments cross the width of the rcket. In ech tril the robot successfully mnged to bounce the bll bck to center of the rcket nd keep bouncing it there. As n dditionl test we evluted the sitution where the bll is thrown ner the edge of the pddle with horizontl speed of 1.25 m/s. The resulting bll trjectory is shown in Figure 7. The figure shows tht even in this extreme sitution the lerned controller ws still ble to compenste for the error nd successfully perform the bll bouncing tsk.

6 z Position Bll Trjectory Bll Bouncing with Controller x Position Figure 7: Bll trjectory with feedbck controller. The bll is thrown onto the pddle with speed of 1.25 m/s. The width of the pddle is mrked with blck line. The robot is ble to slowly bring the bll to the center of the pddle. 5 Conclusion Student Version of MATLAB In this pper we investigted using the eltive Entropy Policy Serch lgorithm for lerning robot motor skills. We looked t the tsks of bll serving nd bll bouncing. Our experimentl results show tht EPS is robust to getting stuck in locl mxim. EPS is bsed on bounding the informtion loss between policies by vlue ε. We discovered tht the performnce of the lgorithm is not sensitive to this prmeter which mens it cn be esily set. Using the EPS lgorithm the robot ws ble to lern to serve the bll to vrious trget loctions nd robustly perform the bll bouncing tsk. The lerned controller ws even ble to compenste for the bll being thrown onto the pddle with horizontl velocity of 1.25 m/s. In the future we would like to use controller bsed on inverse kinemtics, such tht the pddle cn esily be rotted round the x xis. Using such controller the PD feedbck controller could lso be lerned for the y direction similr to how the robot lerns it now in the x direction. We lso pln to investigte using the stte dependent version of EPS to directly lern to serve the bll to different trget loctions. eferences C. Dniel, G. eumnn, nd J. Peters. Hierrchicl reltive entropy policy serch. In Proceedings of the Interntionl Conference on Artificil Intelligence nd Sttistics AISTATS 2012, C. Dniel, G. eumnn, nd J. Peters. Lerning concurrent motor skills in verstile solution spces. In Proceedings of the Interntionl Conference on obot Systems IOS, 2012b. Jn Peters, K Mülling, nd Ysemin Altun. eltive entropy policy serch. tionl Conference on Artificil Intelligence, Stefn Schl. The sl simultion nd rel-time control softwre pckge. Processing, pges 1 94, ichrd S. Sutton nd Andrew G. Brto. Introduction to einforcement Lerning. MIT Press, Cmbridge, MA, USA, 1st edition, ISB Derivtion of dpted EPS Lgrngin of the progrm in Equtions 9-11: L = π + ε π log π q + λ 1 π = [ π log π ] q λ + ε + λ 16 Differentite the Lgrngin with respect to π: L π = log π q λ + π 1 π = log π q λ 17 Set to zero nd solve for π: log π q = λ 18 π = q exp λ 1 19 = q exp exp 1 λ 20 Since we require tht π = 1, we cn sum up both sides of Eqution 20 over nd obtin: 1 = q exp exp 1 λ 1 = exp 1 λ q exp exp 1 λ 1 = q exp 21

7 Sndr Amend, Gregor Gebhrdt If we insert Eqution 21 into Eqution 20 we get: q exp π = 22 q exp We cn now replce π in the Lgrngin Eqution 16 using Eqution 22 nd obtin the dul function: q exp L = q exp q exp log q q exp λ + ε + λ 23 For the ske of limited spce the following equtions show the simplifiction of only the term in the big squre brckets of Eqution 23: q exp log q λ q exp [ = + log ] q exp λ [ = log ] q exp λ 24 As the term in Eqution 24 is now not dependent on the running vrible of the outer summtion in Eqution 23 nymore, we cn pull it out of tht sum nd thus q exp q exp = Applying the simplifictions of Equtions 24 nd 25 to the Lgrngin in Eqution 23, we obtin the dul function s follows: g = log q exp λ + ε + λ = log q exp + ε 26

Reinforcement learning II

Reinforcement learning II CS 1675 Introduction to Mchine Lerning Lecture 26 Reinforcement lerning II Milos Huskrecht milos@cs.pitt.edu 5329 Sennott Squre Reinforcement lerning Bsics: Input x Lerner Output Reinforcement r Critic