International Scientific and Technological Conference EXTREME ROBOTICS October 8-9, 2015, Saint-Petersburg, Russia

Similar documents
Pattern Recognition and Machine Learning. Artificial Neural networks

COS 424: Interacting with Data. Written Exercises

Department of Electronic and Optical Engineering, Ordnance Engineering College, Shijiazhuang, , China

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

About the definition of parameters and regimes of active two-port networks with variable loads on the basis of projective geometry

SPECTRUM sensing is a core concept of cognitive radio

Pattern Recognition and Machine Learning. Artificial Neural networks

A Simplified Analytical Approach for Efficiency Evaluation of the Weaving Machines with Automatic Filling Repair

An Adaptive UKF Algorithm for the State and Parameter Estimations of a Mobile Robot

Ch 12: Variations on Backpropagation

Non-Parametric Non-Line-of-Sight Identification 1

Analyzing Simulation Results

Handwriting Detection Model Based on Four-Dimensional Vector Space Model

Interactive Markov Models of Evolutionary Algorithms

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Kernel Methods and Support Vector Machines

Nonmonotonic Networks. a. IRST, I Povo (Trento) Italy, b. Univ. of Trento, Physics Dept., I Povo (Trento) Italy

An Improved Particle Filter with Applications in Ballistic Target Tracking

A Model for the Selection of Internet Service Providers

Inspection; structural health monitoring; reliability; Bayesian analysis; updating; decision analysis; value of information

Block designs and statistics

ANALYSIS OF HALL-EFFECT THRUSTERS AND ION ENGINES FOR EARTH-TO-MOON TRANSFER

Tracking using CONDENSATION: Conditional Density Propagation

Mathematical Model and Algorithm for the Task Allocation Problem of Robots in the Smart Warehouse

Estimating Parameters for a Gaussian pdf

An Inverse Interpolation Method Utilizing In-Flight Strain Measurements for Determining Loads and Structural Response of Aerospace Vehicles

A note on the multiplication of sparse matrices

Symbolic Analysis as Universal Tool for Deriving Properties of Non-linear Algorithms Case study of EM Algorithm

The dynamic game theory methods applied to ship control with minimum risk of collision

Feature Extraction Techniques

CS Lecture 13. More Maximum Likelihood

Experimental Design For Model Discrimination And Precise Parameter Estimation In WDS Analysis

Qualitative Modelling of Time Series Using Self-Organizing Maps: Application to Animal Science

Data-Driven Imaging in Anisotropic Media

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Neural Network-Aided Extended Kalman Filter for SLAM Problem

Use of PSO in Parameter Estimation of Robot Dynamics; Part One: No Need for Parameterization

Sequence Analysis, WS 14/15, D. Huson & R. Neher (this part by D. Huson) February 5,

IAENG International Journal of Computer Science, 42:2, IJCS_42_2_06. Approximation Capabilities of Interpretable Fuzzy Inference Systems

2.9 Feedback and Feedforward Control

Figure 1: Equivalent electric (RC) circuit of a neurons membrane

Pattern Recognition and Machine Learning. Artificial Neural networks

Randomized Recovery for Boolean Compressed Sensing

Research in Area of Longevity of Sylphon Scraies

Feedforward Networks

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Supplementary to Learning Discriminative Bayesian Networks from High-dimensional Continuous Neuroimaging Data

The Distribution of the Covariance Matrix for a Subset of Elliptical Distributions with Extension to Two Kurtosis Parameters

A Self-Organizing Model for Logical Regression Jerry Farlow 1 University of Maine. (1900 words)

OPTIMIZATION OF SPECIFIC FACTORS TO PRODUCE SPECIAL ALLOYS

SEISMIC FRAGILITY ANALYSIS

Feedforward Networks. Gradient Descent Learning and Backpropagation. Christian Jacob. CPSC 533 Winter 2004

An improved self-adaptive harmony search algorithm for joint replenishment problems

Distributed Subgradient Methods for Multi-agent Optimization

Asynchronous Gossip Algorithms for Stochastic Optimization

ZISC Neural Network Base Indicator for Classification Complexity Estimation

Supplementary Information for Design of Bending Multi-Layer Electroactive Polymer Actuators

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

paper prepared for the 1996 PTRC Conference, September 2-6, Brunel University, UK ON THE CALIBRATION OF THE GRAVITY MODEL

Fairness via priority scheduling

Sensorless Control of Induction Motor Drive Using SVPWM - MRAS Speed Observer

LONG-TERM PREDICTIVE VALUE INTERVAL WITH THE FUZZY TIME SERIES

Feedforward Networks

Ph 20.3 Numerical Solution of Ordinary Differential Equations

CHAPTER 19: Single-Loop IMC Control

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

Vulnerability of MRD-Code-Based Universal Secure Error-Correcting Network Codes under Time-Varying Jamming Links

Fast Montgomery-like Square Root Computation over GF(2 m ) for All Trinomials

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

REDUCTION OF FINITE ELEMENT MODELS BY PARAMETER IDENTIFICATION

Time-of-flight Identification of Ions in CESR and ERL

Envelope frequency Response Function Analysis of Mechanical Structures with Uncertain Modal Damping Characteristics

International Journal of Scientific & Engineering Research, Volume 4, Issue 9, September ISSN

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

Bayes Decision Rule and Naïve Bayes Classifier

Effective joint probabilistic data association using maximum a posteriori estimates of target states

INTELLECTUAL DATA ANALYSIS IN AIRCRAFT DESIGN

Detecting Intrusion Faults in Remotely Controlled Systems

IN modern society that various systems have become more

Bootstrapping Dependent Data

Multi-Dimensional Hegselmann-Krause Dynamics

Warning System of Dangerous Chemical Gas in Factory Based on Wireless Sensor Network

Recovering Data from Underdetermined Quadratic Measurements (CS 229a Project: Final Writeup)

RESTARTED FULL ORTHOGONALIZATION METHOD FOR SHIFTED LINEAR SYSTEMS

The Use of Analytical-Statistical Simulation Approach in Operational Risk Analysis

Seismic Analysis of Structures by TK Dutta, Civil Department, IIT Delhi, New Delhi.

Identical Maximum Likelihood State Estimation Based on Incremental Finite Mixture Model in PHD Filter

ANALYTICAL INVESTIGATION AND PARAMETRIC STUDY OF LATERAL IMPACT BEHAVIOR OF PRESSURIZED PIPELINES AND INFLUENCE OF INTERNAL PRESSURE

DIRECT NUMERICAL SIMULATION OF DAMAGE PROGRESSION IN LAMINATED COMPOSITE PLATES USING MULTI-SCALE MODELLING

SHAPE IDENTIFICATION USING DISTRIBUTED STRAIN DATA FROM EMBEDDED OPTICAL FIBER SENSORS

Decentralized Adaptive Control of Nonlinear Systems Using Radial Basis Neural Networks

Robustness Experiments for a Planar Hopping Control System

Tele-Operation of a Mobile Robot Through Haptic Feedback

Matrix Inversion-Less Signal Detection Using SOR Method for Uplink Large-Scale MIMO Systems

A MESHSIZE BOOSTING ALGORITHM IN KERNEL DENSITY ESTIMATION

NBN Algorithm Introduction Computational Fundamentals. Bogdan M. Wilamoswki Auburn University. Hao Yu Auburn University

are equal to zero, where, q = p 1. For each gene j, the pairwise null and alternative hypotheses are,

CHARACTER RECOGNITION USING A SELF-ADAPTIVE TRAINING

Transcription:

International Scientific and Technological Conference EXTREME ROBOTICS October 8-9, 215, Saint-Petersburg, Russia LEARNING MOBILE ROBOT BASED ON ADAPTIVE CONTROLLED MARKOV CHAINS V.Ya. Vilisov University of Technology, Energy IT LLP, Russia, Moscow Region, Korolev vvib@yandex.ru Abstract. Herein we suggest a obile robot-training algorith that is based on the preference approxiation of the decision taker who controls the robot, which in its turn is anaged by the Markov chain. Set-up of the odel paraeters is ade on the basis of the data referring to the situations and decisions involving the decision taker. The odel that adapts to the decision taker's preferences can be set up either a priori, during the process of the robot's noral operation, or during specially planned testing sessions. Basing on the siulation odelling data of the robot's operation process and on the decision taker's robot control we have set up the odel paraeters thus illustrating both working capacity of all algorith coponents and adaptation effectiveness. Key words: obile robot, learning, adaptation, controlled Markov chains. Introduction There are any ways to increase the effectiveness of the robotic syste (RS). Thus, we either can provide a high effectiveness of the sensor systes (SS) and/or axiu possible independence of the RS. Within the fraework of the first ethod, it is possible to increase the inforativeness and a nuber of data collection channels. However, it ay also require increasing the productivity of the on-board coputer. SS effectiveness can also be increased by using effective algoriths for data processing that are received fro the (for exaple, we can do it by applying different filtration odels: Kalan filtration, Bayesian filtration, POMDP etc., as well as using an effective solution of SLAM proble etc.). Within the first approach, the task of one of the probles is to discover the structure and coposition of the SS that would possess a iniu sufficiency for the perforance of RS's tasks. The second approach is closely related to the necessity of providing a special behavior of the RS that would correspond to the objectives and preferences of the decision taker that controls the RS. This direction is closely related to the research undergone in the field of training RS to behave adequately and to interconnect ore effectively with the decision taker [1, 2, 3]. Here we ake a good use of the Bayesian education tools and Markov processes including POMDP. The solutions that were obtained when ipleenting these approaches quite often can copleent and/or copensate each other's disadvantages. The purpose of this work is to solve the proble of selecting the iniu coposition of SS that would be sufficient for RS to execute tasks independently. However, due to the liited size of the article, the work concentrates only on teaching independent obile robots (IMR) preferences of the decision taker and estiating effectiveness of the task perforance within the selected configuration of the SS. As an RS we take an IMR possessing two contact (or other functionally siilar) sensors and a pair of driving wheels. The task of the robotic syste lies in scanning soe area (roo, territory etc.). Typical exaple of such IMR is a robotic vacuu cleaner (RVC) or a robotic lawn ower. RS can perfor this task by scanning the roo using one of the stored progras that is chosen by the decision taker. At that, the effectiveness of the task perforance shall vary according to the progra selection. It depends on both the progra and the roo's shape. The logic of the RS is the following: by using different behavior odes applied for siilar situations (activation of specific sensors), it akes different decisions. For exaple, when the left sensor is activated, the RS shall decide to ove back taking a right turn etc. In order to siplify the presentation of this work s aterials we shall consider the siplest variant of IMR containing two sensors (states) and two decisions (variant actions). In particular, the sensor area is presented by two front (left and right) crash/proxiity transducers with the two actions/decisions being the crash reactions (states). The decisions take for of oving back by turning left and oving back by turning right. The increase of the ultitude of states and decisions will boost up the variety and effectiveness of IMR behavior not influencing in any way the content of the suggested algorith. All ain atheatical constructions shall be perfored for the case with two states and two decisions. This class of probles is often solved with discrete Markov Decisions Processes (MDP). The suggested algorith of adapting RS to the objective

preferences of the decision taker is founded on the reverse proble for the Markov payoff chain (RPMDP). The proble to be studied here observes effective actions of the decision taker thus coputing the estiate of the payoff/objective function of MDP. Therefore, when solving a direct proble for MDP (DPMDP), the optial controls shall be adapted to the objective preferences of the decision taker. Markov Models Applied in RS Manageent Markov payoff chains or Markov profit chains, which are also called controlled Markov chains are the developents of Markov chains, whose description is copleented with the control eleent, i.e. the decision of the decision taker ade when locating Markov chain in one of its states. The decision is presented as a ultitude of alternatives and their corresponding ultitude of transition probability atrices. A large group is coprised by the Partially Observable Markov Decisions Processes (POMDP). In coparison with MDP its additional eleent is the ultitude of diensions. These odels contain all coponents that are inherent to the traditional odels of dynaic controlled systes [12], which, in their turn, contain equations of process/syste, control and diensions. This presentation allows us to solve three ain groups of anageent probles: filtration, identification and optial control. When using MDP in RS anageent probles [5, 6, 7, 8, 9], we consider payoff functions (PF) be known and a priori set up when the syste or controlling algoriths were developed. Structure and paraeters of PF influence the specific value of the optial decision. If PF does not correspond to the value hierarchy and preference syste of the decision taker that controls the RS, the coputed solution shall be optial for the specific PF at the sae tie being non-optial for the objective preferences of the decision taker. Therefore, the optial controlled influence that is obtained by any ethod is optial to the PF. MDP is considered to be set up if we know its such eleents as ultitude of states i = 1,, start state probability vector p = p i, ultitude of solutions k = 1,,, K one-step process transition conditional probability atrix P k = p k ij, onestep conditional payoff atrix R k = r k K ij. The solution of MDP is the optial strategy f being one out of any S-strategies. The rando strategy that possesses s = 1, S index can be presented as the following colun vector: f s = s s [k 1 k 2 k s ] T. Here T is the conjugation sybol. Hereinafter we shall record colun vector as a conjugated row vector for the purposes of K copactedness. Optial strategy provides a axiu of one-step accuulated or ean profits/payoffs. Within the strategy structure, k s i is the decision that should be ade according to sstrategy should the process at the current stage n is in state i.. The structure of the specific strategy that is applied for usage (decision aking) within the current presentation leads to the fact that instead of the ultitude of P k and R k atrices we shall see the generation of single working atrices, correspondingly P s and R s. Solution of Direct Proble MDP (DPMDP) When solving MDP we usually apply [1, 11] a recurrent algorith based on the principle of Richard Bellan or iteration algorith of Ronald Howard [11] that allows to iprove the solution stepwise. Should the spaces of states and solutions be not large, the optial solution of the proble ay be found with the brute-force search for strategies. We used brute-force search ethod for the odel calculations below. Brute-Force Search for Strategies suggests coparing the copeting strategies using the value of one-step ean payoff that is existing within the steady-state conditions. The search is perfored within the class of stationary strategies. Let us define ean payoff value calculated for one step of V s for the rando s-strategy within the steady-state conditions. Let us construct P s and R s work atrices for the s-strategy. They are fored fro the original transition atrices where we use a specific strategy configuration f s = s s [k 1 k 2 k s ] T as the key. Thus, the first row of P s is oved fro the first row of P k 1 s atrix, the second row is oved fro the second row of P k 2 s atrix and so on. R s atrix is constructed in the sae anner. Thus, for the purposes of s-strategy referring to the ultitude of P k and R k atrices, we can build a single one-step transition atrix P s and a single one-step payoff atrix R s. Therefore, on condition that the process was in i-state, the onestep ean payoff shall be defined as: r i s = p ij s r ij s j=1 In order to calculate the undoubted ean payoff we shall define the state probability vector within f N = [p N 1 p N 2 p N ] T steady-state conditions, where N stands for the fact that probabilities correspond to the large step nubers accounting for the steady-state process character. In this case, the ean payoff of V s -step shall be defined as the following value for the s-stationary strategy: V s N = p i r s N s i = p i p ij j=1 r ij s

Should the payoff have a eaning of profit, the optial strategy selection criterion is: s s = arg ax V s {1,S } Liited probability vector of states referring to Markov p N -process coplies with the following atrix equation: (P s ) Tp N = p N At that, the condition of noralization should be perfored for the probabilities of states: N p i = 1 The solution of the syste coprised of two last equations allows obtaining coordinate values of p N vector. Solution of Reverse Proble MDP (RPMDP) Let us consider what data is included into the observations and what should be found as a result of RPMDP solution. A ultitude of presentations is available to the observations. At each n-step we can see i n chain conditions and k n decisions that are available for observations and which were ade by the decision taker, where i n, k n {1; 2}. After the end of presentation, the v value of the received payoff shall be coputed. Within the context of the exaple under consideration (IMR scanning robot), the condition is the actuation of the left or right sensor while the decision shall be its oving back aking a left or right turn. A useful (without consideration of wastes) duration of the path that was ade by the robot during the certain period of tie (which is proportional to the scanned area) can serve as each presentation's payoff. Thus, the coponent of the observation that is considered within the reverse proble solution algorith is one presentation, i.e. the totality of the alternating conditions, ade decisions and presentation's total payoffs. The result of RPMDP solution is the eleents that are necessary for the direct proble (DPMDP) solving, i.e. probability and payoff atrices. RPMDP solution schee can be presented with 3 stages. 1st Stage. Each presentation contains the f s {f 1, f 2, f 3, f 4 } strategy, which was chosen by the decision taker. The coplete set of strategies is the following: f 1 = [1 1] T ; f 2 = [1 2] T ; f 3 = [2 1] T ; f 4 = [2 2] T. Referring to each presentation, we estiate the frequencies of soe decisions ade within soe condition. Afterwards, on the basis of these frequencies, we define the closest strategy, which later corresponds to the given presentation. 2nd Stage. The whole ultitude of transition probability atrices is estiated in relation to each presentation: P 1, P 2, P 3, P 4. Each of these atrices is a conditional one (i.e. it is applied when we select the solution that corresponds to the upper index of the atrix). Siilar to the 1st stage, we separately copute the frequencies of transitioning fro one state to another per each presentation (using the pairs of steps of Markov chain) taking into account the ade decision (conditions of the transition). The frequencies that are obtained in this anner are the estiates of one-step MDP conditional transition probabilities. Afterwards these probabilities are put into the corresponding places of the transition probability atrices. 3rd Stage. The purpose of constructing the payoff estiates is the following: to calculate the estiates of the payoff vector r i s on the basis of the observed paraeters (estiates) of the transition probability atrices and according to the payoffs in each observation (presentation). For this we shall use the least square ethod, whose recurrent for [11] that connects the previous (q) estiates of the observations with the current ones, (q+1) is the following: r q+1 = r q + Q q P q [P q T Q q P q + 1] 1 [v q+1 P q T r q], Q q+1 = Q q Q q P q [P q T Q q P q + 1] 1 P q T Q q, where: Q q = (P q T P q ) 1 ; v q+1 is the payoff within the (q+1)-nubered observation; P q is the transition atrices that are obtained on the 2nd stage within q-nubered observation. Within the recurrent estiation equations, the MDP presentation is the observation step while the MDP step is actually one step of Markov chain ade within the fraework of the specific presentation. With the appearance of each new q-nubered presentation of the payoff vector, its estiates are refined recurrently. This is the foral characterization of the decision taker's positive experience ade with the help of MDP. Here we can also say that current preferences of the decision taker are approxiated by the Markov chain of decision aking. The recurrent estiation algorith not only reoves the prior doubt, but also allows adapting to the drift of payoffs, objectives and preferences of the decision taker by correcting the strategies on the basis of the payoff vectors, which are adjusted according to the current observations of the decision taker's actions. Model Exaple In order to check the working capacity and effectiveness of the suggested schee of the reverse proble solution (which is the nucleus of the obile robot adaptive control procedure) we have conducted the siulation experient. The given data was fored randoly. One of the paraeter

Probability Payents Mean Payoff, v variants of the odelled MDP is provided in the table below. Solving direct proble by the brute-force strategy search showed that the 2nd strategy is the optial one: f 2 = [1 2] T. According to its logic, we should choose the first solution for the first process state with the second being chosen for the second one. In this case, the one-step ean payoff shall constitute 71 units within the steady-state ode. Using the Gaes Theory terinology, this solution corresponds to the decision taker's pure strategy. As a rule, when the operator (decision taker) controls the robot in reality, he/she takes into account a ultitude of perforance targets (and not only a single one). At that he/she does not "feel" the strategy, whose optiality is based on any criteria, therefore the decision taker ay often use his/her own subjective and ixed control strategy. Solutions k 1 2 Table 1. Paraeters of Modelled MDP One-Step Transition Probabilities Conditions i p ij One-Step Payoffs r ij j=1 j=2 j=1 j=2 1.5.95 45 79 2.19.81 44 31 1.27.73 25 23 2.48.52 93 45 8 6 4 2 1 2 3 4 Strategy, f Pic. 1. Mean Payoffs In order to solve the reverse proble we have siulated 1 presentations, with each of the containing 3 points. It eans that the odelled decision taker ade 3 decisions in relation to the appearing values of the current states per each presentation. One out of four pure strategies was applied during each presentation. This data was processed in accordance with three stages of the reverse proble solution algorith that are provided above. According to the statistics of ade decisions, we have definitely identified the pure strategies that were applied by the decision taker at the first stage. This is caused by the fact that each research considered a fully observable MDP. At the second state we have coputed subsequently refined estiations of one-step transition atrices. At that, within the iteration process of the estiate refineent, each of 1 presentations was used as another observation. On Picture 2 you can see step-by-step changes of the estiates referring to 4 probabilities (the total nuber of atrix probabilities is 8, but 4 of the are independent while the rest 4 are coputed as one's copleent). 1..9.8.7.6.5.4 p12(1) p22(1).3 p12(2) p22(2).2 p12(1)_odel p22(1)_odel.1 p12(2)_odel p22(2)_odel. 2 4 6 8 1 12 Steps Pic. 2. Convergence of MDP's Probability Estiates 6 5 4 3 2 1-1 2 4 6 8 Steps 1 r11(1) r12(1) r21(1) r22(1) r11(2) r12(2) r21(2) r22(2) Pic. 3. Convergence of MDP's Payoff Estiates Modelled probability values (provided in Table 1) are arked separately on Picture 2. r i k estiates of the eleents that are folded into the vectors of payoff atrices are coputed at the third stage with the help of the observations that are calculated on each step (i.e. that refer to each new presentation) and the payoff that corresponds to the perfored presentation according to the recurrent correlations. Their sufficient nuber for the purposes of the considered diensionalities is four (siilar to the probability estiates). The

Strategy estiate convergence of these values is shown graphically on Pic. 3, while Pic. 4 contains solutions of the direct proble MDP ade on the basis of the step-by-step estiates. It is clearly shown on Pic. 4 that the adaptation process is converging quickly. 4 3 2 1 k k_odel 2 4 6 8 1 Steps 12 Pic. 4. Convergence of MDP Solutions According to its Paraeter Estiates Conclusions The research that was ade for other diensionalities of state space (capacity of the sensor field) referring to the solution space showed that the solution convergence in relation to the considered class of odels stays high. Moreover, the experient optial planning tools that are used during the RS education process allow shortening the tie that is usually used to adjust the decision taker's preferences. The preference odel that is adjusted and prebuilt within the RS is highly adequate to the preferences and objectives of the decision taker, while the quality of decisions that are ade by the RS are not inferior to the quality of the decisions ade by the "teacher" of MDP-odel. Should the signs of the environent non-stationarity appear or should the decision taker's preferences change, the odel can be re-adjusted and reloaded into the RS. The process of setting up/educating the odel can be ade using a separate odel or a special testing device, while the MDP-odel that was adjusted for new conditions can be loaded as a "hot" update without interrupting noral operation of the RS. Further developent of the suggested approach can unfold in several directions. In particular, we can extend the considered spectru of the robotic syste s sensor field variants and/or use POMDP-odels. References [1] M.P. Woodward, R.J. Wood Fraing. Huan- Robot Task Counication as a POMDP. - [Webversion]. -URL: http://arxiv.org/pdf/124.28.pdf. - [Last access date: 21. 5. 215] [2] M.P. Woodward, R.J. Wood. Using Bayesian Inference to Learn High-Level Tasks fro a Huan Teacher. - Web-version. Conf. on Artificial Intelligence and Pattern Recognition, Orlando, FL, July 29, pp. 138-145, 29. [3] A.A. Zhdanov. Independent Artificial Intellect. - Moscow: BINOM Publishing House - 28. - 359 pp. [4] S. Koenig, R.G. Sions. Xavier: A Robot Navigation Architecture Based on Partially Observable Markov Decision Process Models.- Carnegie Mellon University. - [Web-version]. - URL: http://citeseerx.ist.psu.edu/ viewdoc/download?do.1.1.13.588&rep=re p1&type=pdf. - [Last access date: 21. 5. 215] [5] M.E. Lopez, R. Barea, L.M. Bergasa. Global Navigation of Assistant Robots using Partially Observable Markov Decision Processes. - [Webversion]. -URL: http://cdn.intechopen.co/pdfs/141/intech- Global_navigation_of_ assistant_robots_using_partially_observable_ar kov_decision_processes.pdf. - [Last access date: 21. 5. 215] [6] S. Ong, S.W. Png, D. Hsu, W.S. Lee. POMDPs for Robotic Tasks with Mixed Observability. - Robotics: Science & Systes. - 29. - [Webversion]. - http://www.cop.nus.edu.sg/~leews/publications/ rss9.pdf. - [Last access date: 21. 5. 215] [7] R.S. Sutton, A.G. Barto. Reinforceent Learning: An Introduction. Cabridge: The M.I.T. Press, 1998. [8] V.Ya. Vilisov. Robot Training Under Conditions of Incoplete Inforation. - [Web-version]. - URL: http://arxiv.org/ftp/arxiv/papers/142/142.2996. pdf. - [Last access date: 21. 5. 215] [9] H. Mine, S. Osaki. Markovian Decision Processes, NY: Aerican Elsevier Publishing Copany, Inc., 197, p. 176. [1] V.Ya. Vilisov. Adaptive Models Used for Researching Econoic Operations - Moscow: Enit Publishing House - 27. - 286 pages. [11] R.A. Howard. Dynaic Prograing and Markov Processes. - NY-London: The M.I.T. Press. - 196. [12] G.F. Franklin, J.D. Powell, M. Workan. Digital Control of Dynaic Systes. - NY: Addison Wesley Longan. - 1998.