International Scientific and Technological Conference EXTREME ROBOTICS October 8-9, 215, Saint-Petersburg, Russia LEARNING MOBILE ROBOT BASED ON ADAPTIVE CONTROLLED MARKOV CHAINS V.Ya. Vilisov University of Technology, Energy IT LLP, Russia, Moscow Region, Korolev vvib@yandex.ru Abstract. Herein we suggest a obile robot-training algorith that is based on the preference approxiation of the decision taker who controls the robot, which in its turn is anaged by the Markov chain. Set-up of the odel paraeters is ade on the basis of the data referring to the situations and decisions involving the decision taker. The odel that adapts to the decision taker's preferences can be set up either a priori, during the process of the robot's noral operation, or during specially planned testing sessions. Basing on the siulation odelling data of the robot's operation process and on the decision taker's robot control we have set up the odel paraeters thus illustrating both working capacity of all algorith coponents and adaptation effectiveness. Key words: obile robot, learning, adaptation, controlled Markov chains. Introduction There are any ways to increase the effectiveness of the robotic syste (RS). Thus, we either can provide a high effectiveness of the sensor systes (SS) and/or axiu possible independence of the RS. Within the fraework of the first ethod, it is possible to increase the inforativeness and a nuber of data collection channels. However, it ay also require increasing the productivity of the on-board coputer. SS effectiveness can also be increased by using effective algoriths for data processing that are received fro the (for exaple, we can do it by applying different filtration odels: Kalan filtration, Bayesian filtration, POMDP etc., as well as using an effective solution of SLAM proble etc.). Within the first approach, the task of one of the probles is to discover the structure and coposition of the SS that would possess a iniu sufficiency for the perforance of RS's tasks. The second approach is closely related to the necessity of providing a special behavior of the RS that would correspond to the objectives and preferences of the decision taker that controls the RS. This direction is closely related to the research undergone in the field of training RS to behave adequately and to interconnect ore effectively with the decision taker [1, 2, 3]. Here we ake a good use of the Bayesian education tools and Markov processes including POMDP. The solutions that were obtained when ipleenting these approaches quite often can copleent and/or copensate each other's disadvantages. The purpose of this work is to solve the proble of selecting the iniu coposition of SS that would be sufficient for RS to execute tasks independently. However, due to the liited size of the article, the work concentrates only on teaching independent obile robots (IMR) preferences of the decision taker and estiating effectiveness of the task perforance within the selected configuration of the SS. As an RS we take an IMR possessing two contact (or other functionally siilar) sensors and a pair of driving wheels. The task of the robotic syste lies in scanning soe area (roo, territory etc.). Typical exaple of such IMR is a robotic vacuu cleaner (RVC) or a robotic lawn ower. RS can perfor this task by scanning the roo using one of the stored progras that is chosen by the decision taker. At that, the effectiveness of the task perforance shall vary according to the progra selection. It depends on both the progra and the roo's shape. The logic of the RS is the following: by using different behavior odes applied for siilar situations (activation of specific sensors), it akes different decisions. For exaple, when the left sensor is activated, the RS shall decide to ove back taking a right turn etc. In order to siplify the presentation of this work s aterials we shall consider the siplest variant of IMR containing two sensors (states) and two decisions (variant actions). In particular, the sensor area is presented by two front (left and right) crash/proxiity transducers with the two actions/decisions being the crash reactions (states). The decisions take for of oving back by turning left and oving back by turning right. The increase of the ultitude of states and decisions will boost up the variety and effectiveness of IMR behavior not influencing in any way the content of the suggested algorith. All ain atheatical constructions shall be perfored for the case with two states and two decisions. This class of probles is often solved with discrete Markov Decisions Processes (MDP). The suggested algorith of adapting RS to the objective
preferences of the decision taker is founded on the reverse proble for the Markov payoff chain (RPMDP). The proble to be studied here observes effective actions of the decision taker thus coputing the estiate of the payoff/objective function of MDP. Therefore, when solving a direct proble for MDP (DPMDP), the optial controls shall be adapted to the objective preferences of the decision taker. Markov Models Applied in RS Manageent Markov payoff chains or Markov profit chains, which are also called controlled Markov chains are the developents of Markov chains, whose description is copleented with the control eleent, i.e. the decision of the decision taker ade when locating Markov chain in one of its states. The decision is presented as a ultitude of alternatives and their corresponding ultitude of transition probability atrices. A large group is coprised by the Partially Observable Markov Decisions Processes (POMDP). In coparison with MDP its additional eleent is the ultitude of diensions. These odels contain all coponents that are inherent to the traditional odels of dynaic controlled systes [12], which, in their turn, contain equations of process/syste, control and diensions. This presentation allows us to solve three ain groups of anageent probles: filtration, identification and optial control. When using MDP in RS anageent probles [5, 6, 7, 8, 9], we consider payoff functions (PF) be known and a priori set up when the syste or controlling algoriths were developed. Structure and paraeters of PF influence the specific value of the optial decision. If PF does not correspond to the value hierarchy and preference syste of the decision taker that controls the RS, the coputed solution shall be optial for the specific PF at the sae tie being non-optial for the objective preferences of the decision taker. Therefore, the optial controlled influence that is obtained by any ethod is optial to the PF. MDP is considered to be set up if we know its such eleents as ultitude of states i = 1,, start state probability vector p = p i, ultitude of solutions k = 1,,, K one-step process transition conditional probability atrix P k = p k ij, onestep conditional payoff atrix R k = r k K ij. The solution of MDP is the optial strategy f being one out of any S-strategies. The rando strategy that possesses s = 1, S index can be presented as the following colun vector: f s = s s [k 1 k 2 k s ] T. Here T is the conjugation sybol. Hereinafter we shall record colun vector as a conjugated row vector for the purposes of K copactedness. Optial strategy provides a axiu of one-step accuulated or ean profits/payoffs. Within the strategy structure, k s i is the decision that should be ade according to sstrategy should the process at the current stage n is in state i.. The structure of the specific strategy that is applied for usage (decision aking) within the current presentation leads to the fact that instead of the ultitude of P k and R k atrices we shall see the generation of single working atrices, correspondingly P s and R s. Solution of Direct Proble MDP (DPMDP) When solving MDP we usually apply [1, 11] a recurrent algorith based on the principle of Richard Bellan or iteration algorith of Ronald Howard [11] that allows to iprove the solution stepwise. Should the spaces of states and solutions be not large, the optial solution of the proble ay be found with the brute-force search for strategies. We used brute-force search ethod for the odel calculations below. Brute-Force Search for Strategies suggests coparing the copeting strategies using the value of one-step ean payoff that is existing within the steady-state conditions. The search is perfored within the class of stationary strategies. Let us define ean payoff value calculated for one step of V s for the rando s-strategy within the steady-state conditions. Let us construct P s and R s work atrices for the s-strategy. They are fored fro the original transition atrices where we use a specific strategy configuration f s = s s [k 1 k 2 k s ] T as the key. Thus, the first row of P s is oved fro the first row of P k 1 s atrix, the second row is oved fro the second row of P k 2 s atrix and so on. R s atrix is constructed in the sae anner. Thus, for the purposes of s-strategy referring to the ultitude of P k and R k atrices, we can build a single one-step transition atrix P s and a single one-step payoff atrix R s. Therefore, on condition that the process was in i-state, the onestep ean payoff shall be defined as: r i s = p ij s r ij s j=1 In order to calculate the undoubted ean payoff we shall define the state probability vector within f N = [p N 1 p N 2 p N ] T steady-state conditions, where N stands for the fact that probabilities correspond to the large step nubers accounting for the steady-state process character. In this case, the ean payoff of V s -step shall be defined as the following value for the s-stationary strategy: V s N = p i r s N s i = p i p ij j=1 r ij s
Should the payoff have a eaning of profit, the optial strategy selection criterion is: s s = arg ax V s {1,S } Liited probability vector of states referring to Markov p N -process coplies with the following atrix equation: (P s ) Tp N = p N At that, the condition of noralization should be perfored for the probabilities of states: N p i = 1 The solution of the syste coprised of two last equations allows obtaining coordinate values of p N vector. Solution of Reverse Proble MDP (RPMDP) Let us consider what data is included into the observations and what should be found as a result of RPMDP solution. A ultitude of presentations is available to the observations. At each n-step we can see i n chain conditions and k n decisions that are available for observations and which were ade by the decision taker, where i n, k n {1; 2}. After the end of presentation, the v value of the received payoff shall be coputed. Within the context of the exaple under consideration (IMR scanning robot), the condition is the actuation of the left or right sensor while the decision shall be its oving back aking a left or right turn. A useful (without consideration of wastes) duration of the path that was ade by the robot during the certain period of tie (which is proportional to the scanned area) can serve as each presentation's payoff. Thus, the coponent of the observation that is considered within the reverse proble solution algorith is one presentation, i.e. the totality of the alternating conditions, ade decisions and presentation's total payoffs. The result of RPMDP solution is the eleents that are necessary for the direct proble (DPMDP) solving, i.e. probability and payoff atrices. RPMDP solution schee can be presented with 3 stages. 1st Stage. Each presentation contains the f s {f 1, f 2, f 3, f 4 } strategy, which was chosen by the decision taker. The coplete set of strategies is the following: f 1 = [1 1] T ; f 2 = [1 2] T ; f 3 = [2 1] T ; f 4 = [2 2] T. Referring to each presentation, we estiate the frequencies of soe decisions ade within soe condition. Afterwards, on the basis of these frequencies, we define the closest strategy, which later corresponds to the given presentation. 2nd Stage. The whole ultitude of transition probability atrices is estiated in relation to each presentation: P 1, P 2, P 3, P 4. Each of these atrices is a conditional one (i.e. it is applied when we select the solution that corresponds to the upper index of the atrix). Siilar to the 1st stage, we separately copute the frequencies of transitioning fro one state to another per each presentation (using the pairs of steps of Markov chain) taking into account the ade decision (conditions of the transition). The frequencies that are obtained in this anner are the estiates of one-step MDP conditional transition probabilities. Afterwards these probabilities are put into the corresponding places of the transition probability atrices. 3rd Stage. The purpose of constructing the payoff estiates is the following: to calculate the estiates of the payoff vector r i s on the basis of the observed paraeters (estiates) of the transition probability atrices and according to the payoffs in each observation (presentation). For this we shall use the least square ethod, whose recurrent for [11] that connects the previous (q) estiates of the observations with the current ones, (q+1) is the following: r q+1 = r q + Q q P q [P q T Q q P q + 1] 1 [v q+1 P q T r q], Q q+1 = Q q Q q P q [P q T Q q P q + 1] 1 P q T Q q, where: Q q = (P q T P q ) 1 ; v q+1 is the payoff within the (q+1)-nubered observation; P q is the transition atrices that are obtained on the 2nd stage within q-nubered observation. Within the recurrent estiation equations, the MDP presentation is the observation step while the MDP step is actually one step of Markov chain ade within the fraework of the specific presentation. With the appearance of each new q-nubered presentation of the payoff vector, its estiates are refined recurrently. This is the foral characterization of the decision taker's positive experience ade with the help of MDP. Here we can also say that current preferences of the decision taker are approxiated by the Markov chain of decision aking. The recurrent estiation algorith not only reoves the prior doubt, but also allows adapting to the drift of payoffs, objectives and preferences of the decision taker by correcting the strategies on the basis of the payoff vectors, which are adjusted according to the current observations of the decision taker's actions. Model Exaple In order to check the working capacity and effectiveness of the suggested schee of the reverse proble solution (which is the nucleus of the obile robot adaptive control procedure) we have conducted the siulation experient. The given data was fored randoly. One of the paraeter
Probability Payents Mean Payoff, v variants of the odelled MDP is provided in the table below. Solving direct proble by the brute-force strategy search showed that the 2nd strategy is the optial one: f 2 = [1 2] T. According to its logic, we should choose the first solution for the first process state with the second being chosen for the second one. In this case, the one-step ean payoff shall constitute 71 units within the steady-state ode. Using the Gaes Theory terinology, this solution corresponds to the decision taker's pure strategy. As a rule, when the operator (decision taker) controls the robot in reality, he/she takes into account a ultitude of perforance targets (and not only a single one). At that he/she does not "feel" the strategy, whose optiality is based on any criteria, therefore the decision taker ay often use his/her own subjective and ixed control strategy. Solutions k 1 2 Table 1. Paraeters of Modelled MDP One-Step Transition Probabilities Conditions i p ij One-Step Payoffs r ij j=1 j=2 j=1 j=2 1.5.95 45 79 2.19.81 44 31 1.27.73 25 23 2.48.52 93 45 8 6 4 2 1 2 3 4 Strategy, f Pic. 1. Mean Payoffs In order to solve the reverse proble we have siulated 1 presentations, with each of the containing 3 points. It eans that the odelled decision taker ade 3 decisions in relation to the appearing values of the current states per each presentation. One out of four pure strategies was applied during each presentation. This data was processed in accordance with three stages of the reverse proble solution algorith that are provided above. According to the statistics of ade decisions, we have definitely identified the pure strategies that were applied by the decision taker at the first stage. This is caused by the fact that each research considered a fully observable MDP. At the second state we have coputed subsequently refined estiations of one-step transition atrices. At that, within the iteration process of the estiate refineent, each of 1 presentations was used as another observation. On Picture 2 you can see step-by-step changes of the estiates referring to 4 probabilities (the total nuber of atrix probabilities is 8, but 4 of the are independent while the rest 4 are coputed as one's copleent). 1..9.8.7.6.5.4 p12(1) p22(1).3 p12(2) p22(2).2 p12(1)_odel p22(1)_odel.1 p12(2)_odel p22(2)_odel. 2 4 6 8 1 12 Steps Pic. 2. Convergence of MDP's Probability Estiates 6 5 4 3 2 1-1 2 4 6 8 Steps 1 r11(1) r12(1) r21(1) r22(1) r11(2) r12(2) r21(2) r22(2) Pic. 3. Convergence of MDP's Payoff Estiates Modelled probability values (provided in Table 1) are arked separately on Picture 2. r i k estiates of the eleents that are folded into the vectors of payoff atrices are coputed at the third stage with the help of the observations that are calculated on each step (i.e. that refer to each new presentation) and the payoff that corresponds to the perfored presentation according to the recurrent correlations. Their sufficient nuber for the purposes of the considered diensionalities is four (siilar to the probability estiates). The
Strategy estiate convergence of these values is shown graphically on Pic. 3, while Pic. 4 contains solutions of the direct proble MDP ade on the basis of the step-by-step estiates. It is clearly shown on Pic. 4 that the adaptation process is converging quickly. 4 3 2 1 k k_odel 2 4 6 8 1 Steps 12 Pic. 4. Convergence of MDP Solutions According to its Paraeter Estiates Conclusions The research that was ade for other diensionalities of state space (capacity of the sensor field) referring to the solution space showed that the solution convergence in relation to the considered class of odels stays high. Moreover, the experient optial planning tools that are used during the RS education process allow shortening the tie that is usually used to adjust the decision taker's preferences. The preference odel that is adjusted and prebuilt within the RS is highly adequate to the preferences and objectives of the decision taker, while the quality of decisions that are ade by the RS are not inferior to the quality of the decisions ade by the "teacher" of MDP-odel. Should the signs of the environent non-stationarity appear or should the decision taker's preferences change, the odel can be re-adjusted and reloaded into the RS. The process of setting up/educating the odel can be ade using a separate odel or a special testing device, while the MDP-odel that was adjusted for new conditions can be loaded as a "hot" update without interrupting noral operation of the RS. Further developent of the suggested approach can unfold in several directions. In particular, we can extend the considered spectru of the robotic syste s sensor field variants and/or use POMDP-odels. References [1] M.P. Woodward, R.J. Wood Fraing. Huan- Robot Task Counication as a POMDP. - [Webversion]. -URL: http://arxiv.org/pdf/124.28.pdf. - [Last access date: 21. 5. 215] [2] M.P. Woodward, R.J. Wood. Using Bayesian Inference to Learn High-Level Tasks fro a Huan Teacher. - Web-version. Conf. on Artificial Intelligence and Pattern Recognition, Orlando, FL, July 29, pp. 138-145, 29. [3] A.A. Zhdanov. Independent Artificial Intellect. - Moscow: BINOM Publishing House - 28. - 359 pp. [4] S. Koenig, R.G. Sions. Xavier: A Robot Navigation Architecture Based on Partially Observable Markov Decision Process Models.- Carnegie Mellon University. - [Web-version]. - URL: http://citeseerx.ist.psu.edu/ viewdoc/download?do.1.1.13.588&rep=re p1&type=pdf. - [Last access date: 21. 5. 215] [5] M.E. Lopez, R. Barea, L.M. Bergasa. Global Navigation of Assistant Robots using Partially Observable Markov Decision Processes. - [Webversion]. -URL: http://cdn.intechopen.co/pdfs/141/intech- Global_navigation_of_ assistant_robots_using_partially_observable_ar kov_decision_processes.pdf. - [Last access date: 21. 5. 215] [6] S. Ong, S.W. Png, D. Hsu, W.S. Lee. POMDPs for Robotic Tasks with Mixed Observability. - Robotics: Science & Systes. - 29. - [Webversion]. - http://www.cop.nus.edu.sg/~leews/publications/ rss9.pdf. - [Last access date: 21. 5. 215] [7] R.S. Sutton, A.G. Barto. Reinforceent Learning: An Introduction. Cabridge: The M.I.T. Press, 1998. [8] V.Ya. Vilisov. Robot Training Under Conditions of Incoplete Inforation. - [Web-version]. - URL: http://arxiv.org/ftp/arxiv/papers/142/142.2996. pdf. - [Last access date: 21. 5. 215] [9] H. Mine, S. Osaki. Markovian Decision Processes, NY: Aerican Elsevier Publishing Copany, Inc., 197, p. 176. [1] V.Ya. Vilisov. Adaptive Models Used for Researching Econoic Operations - Moscow: Enit Publishing House - 27. - 286 pages. [11] R.A. Howard. Dynaic Prograing and Markov Processes. - NY-London: The M.I.T. Press. - 196. [12] G.F. Franklin, J.D. Powell, M. Workan. Digital Control of Dynaic Systes. - NY: Addison Wesley Longan. - 1998.