THE emerging application of Mobile Ad Hoc Networks

Size: px
Start display at page:

Download "THE emerging application of Mobile Ad Hoc Networks"

Transcription

1 ANNUAL CONFERENCE OF ITA, SEPT. Controlling Autonomous Data Ferries with Partial Observations Ting He, Kang-Won Lee, and Ananthram Swami Abstract We consider supporting communications in highlypartitioned mobile wireless networks via controllable data ferries. While existing ferry control techniques assume either stationary nodes or complete ferry observation of node locations, we address the more challenging scenario of highly mobile nodes and partial ferry observations. Using the tool of Partially Observable Markov Decision Processes (POMDP), we develop a comprehensive framework where we expand the solution space from predetermined trajectories to policies that can map ferry observations to navigation actions dynamically. Under this framework, we present an optimal and several efficient heuristic policies. We compare the proposed policies with predetermined control through analysis and simulations with respect to multiple node mobility parameters including speed, locality, activeness, and range of movement. The comparison shows a significant performance gain of up to twice a contact rate in cases of high uncertainty. In cases of low uncertainty, we give a sufficient condition under which the predetermined control is optimal. Index Terms Dynamic data ferry control, Partially Observable Markov Decision Process, Performance analysis. I. INTRODUCTION THE emerging application of Mobile Ad Hoc Networks (MANETs) in mission-critical environments brings new technical challenges. Consider a MANET consisting of highly mobile nodes covering a large, infrastructureless field. Examples of such networks can be found in military tactical operations, search-and-rescue scenarios, as well as mobile sensor networks. Such networks exhibit properties drastically different from traditional networks: low density, high mobility, and poor connectivity. On top of these, the mobility in these networks is usually dictated by their mission needs and physical constraints such as roads and obstacles, which may not coincide with the mobility needed to provide sufficient contact opportunities. These challenges make it extremely difficult to enable efficient communications via in-network solutions. Meanwhile, the fast development of robot technology has brought to life robots with increasingly sophisticated mobility, communication, and computation capabilities at significantly reduced size and cost. In particular, existing work [] has shown that it is feasible to build light-weight robots from T. He and K-W. Lee are with the IBM T.J. Watson Research Center, Hawthorne, NY 53 {the,kangwon}@us.ibm.com A. Swami is with the Army Research Laboratory, Adelphi, MD 783 e- mail: aswami@arl.army.mil Research was sponsored by the U.S. Army Research Laboratory and the U.K. Ministry of Defence under Agreement Number W9NF The views and conclusions contained in this document are those of the author(s) and do not represent the official policies of the U.S. and U.K. Governments. The Governments are authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation hereon. Part of this work was reported in ACM MobiHoc []. off-the-shelf elements that can be mounted on many types of mobile platforms including unmanned ground or aerial vehicles. The advance in robotic technology and the challenges in mission-critical MANETs inspire our investigation of applying autonomous data ferries in highly-partitioned MANETs. Specifically, we propose to enhance these networks with dedicated communication nodes acting as data ferries. We envision a heterogeneous network structure as in [3], where the network consists of regular nodes that passively move according to their inherent mobility patterns and data ferries that actively move according to external control. The regular nodes have communication needs among themselves, while the data ferries only serve as relays in a load-carry-and-deliver manner. To fully utilize these ferries, however, effective control of their mobility is needed. The problem of ferry mobility control has been extensively studied in the case of stationary regular nodes with known locations [3] [5]. The solutions there involve designing ferry routes to cover the nodes and letting the ferries follow these routes deterministically like shuttles. When nodes are mobile, however, very little work has been done, with the only existing solutions to our knowledge [6], [7] assuming complete observation of nodes current locations by the ferries. In practice, this assumption may fail due to obstacles, limited communication range, or poor channel conditions. In this paper, we consider the scenario when nodes move stochastically, and data ferries can only observe a subset of the field at a time, referred to as partial observations. Without complete observations, the ferries will have uncertainties about node locations. Our goal is to develop a solution that enables the ferries to self-navigate based only on partial observations and statistical knowledge of node mobility patterns. A. Related Work The idea of using controlled mobility to facilitate communications was first proposed in [8], where nodes deliberately change trajectories to send messages to otherwise disconnected nodes. This solution has limited application in mission-critical networks since regular nodes often cannot afford to change trajectories for communication purposes. For such applications, designated communication nodes are used instead [3], [4], [6], [7]. To compensate for the sparsity of the networks, a delay-tolerant scheme is used, where communication nodes physically carry data between task nodes [3], [4], [6], [7] or from sensors to base stations [5]. The solution in [7] actually belongs to the stationary-node category as nodes are stationary unless chosen to be ferries.

2 ANNUAL CONFERENCE OF ITA, SEPT. Our work belongs to this category. The existing work is limited by several assumptions: most solutions assume the task nodes to be stationary with known positions [3] [5]; when task nodes are mobile, the solutions assume that ferries know nodes current locations through either out-of-band communications [6] or external location management systems [7]. There have been multiple extensions in terms of channel models, performance metrics, traffic demands, etc., but the main assumption of fully observable network state still remains. In contrast, we aim to explicitly handle stochastic node movements and incomplete ferry observations. Technically, our approach belongs to stochastic control and in particular Partially Observable Markov Decision Processes (POMDP) [9]. Although previously applied to agent navigation in robotics [], [], its application to mobility control in communication networks has not been explored before. Our problem differs from the agent navigation problem in that the subsystem that evolves dynamically (task nodes) is different from the subsystem that is controlled (data ferries), which leads to unique properties as explained later. B. Our Approach and Results We take the approach of stochastic control, where in contrast to predetermined control in stationary networks, we show that the optimal control in mobile networks needs to be dynamic in nature. Specifically, our contributions are: A comprehensive control framework: We model the entire network of nodes and ferries as an interactive system represented by a Partially Observable Markov Decision Process (POMDP), which seamlessly combines the tasks of information collection and decision making. The framework enables a systematic solution to handle uncertainties due to stochastic node movements and partial ferry observations. Optimal and efficient heuristic policies: Based on the proposed framework, we develop both optimal and heuristic policies by applying POMDP techniques. We develop a recursive algorithm to compute the optimal policy and a set of efficient heuristic algorithms due to the complexity of the optimal policy. Moreover, we show that our problem has a unique structure that allows an explicit policy representation linked back to ferry trajectories. Analysis and simulations: We conduct rigorous analysis of the proposed solution verified by simulations. In particular, since dynamic control generally requires higher costs than predetermined control, we want to characterize their performance gap and how it is influenced by external parameters to provide general guidance on system design. Using ferry-node inter-contact time as the performance measure, we derive both upper bounds (achievability) and lower bounds (converse). Our achievability results show promising performance of the proposed dynamic control in that it outperforms predetermined control the most when improvement is most needed: the field is large (or the sensing range is small) and the prior knowledge is minimal (nodes steady-state distributions are close to uniform). In this case, our solution can reduce the inter-contact time by at least Ω( n) for a domain of size n, and even Θ(n) if nodes are constrained to -D routes. The r node ferry (current) ferry (previous) Fig.. Example of two consecutive slots: one (gateway) node per domain; contact only occurs in the second slot. r: ferry sensing radius. converse shows that this reduction is at most half, achievable by our policy under symmetric and infrequent mobility. On the other hand, when nodes steady-state distributions are localized, we also give a sufficient condition under which predetermined control is optimal. The rest of the paper is organized as follows. Section II formulates the problem of dynamic ferry control. Section III presents a control framework via POMDP. Section IV develops the optimal and heuristic policies, whose performance is analyzed in Section V. Section VI evaluates the solution through simulations, and Section VII concludes the paper. A. Network Model II. PROBLEM FORMULATION As illustrated in Fig., we assume that the network of regular (i.e., non-ferry) nodes is divided into N partitions, each occupying a disjoint region called a domain. Within each domain, at least one node is selected as the gateway for inter-domain communications (assuming sufficient intradomain contacts). We make this distinction because selected nodes may need special equipments (e.g., high-power antennas) to communicate with ferries, although for our purpose it suffices to model only these gateways, which will simply be called nodes in the sequel. Each node is equipped with a radio of range r and moves within its domain according to some stochastic mobility model to be specified later. Assume that each data ferry is equipped with a radio of the same range as the nodes. In addition, it has a buffer to store messages and a controller to specify its movements according to a control policy π (defined later). Assume the ferry self-navigates periodically with a predetermined interval (called slot length), and at the end of each slot t, it senses node s presence (e.g., by broadcasting beacons) to obtain a binary observation z t (z t = for contact and for miss). Assume the sensing range to be equal to the radio range. Upon contact, the ferry will exchange messages with the node and then move to the next domain to ferry data. B. Main Problem: Ferry Mobility Control The main problem in this paper is to design a control policy that optimizes the inter-domain communication performance. To facilitate control, we discretize the space by partitioning each domain into a number of cells of radius r (see Fig. ) Equivalently, r denotes the smaller of ferry/node ranges. r

3 ANNUAL CONFERENCE OF ITA, SEPT. 3 such that the ferry can sense the entire cell from its center. Each control action u t specifies the cell for the ferry to move to during slot t. We model the design criteria by a reward function 3 E[R(z T, ut )], where R( ) is the performance metric as a function of the observation history z T and the action history u T up to time T, and T > defines a time window of control (called the horizon). We focus on the infinite horizon case (T = ) for long-term control. Compared with ferry mobility control in the literature (e.g., [3]), the main difference here is that instead of designing the entire ferry trajectory u T ahead of time, we look at it as a stochastic sequence, where each action u t can be optimized individually at run time. This means that the control decision u t can be based not only on prior knowledge such as nodes steady-state distributions, but also on run-time information including the ferry s past observations z t and actions u t. Accordingly, the control problem also shifts from designing the trajectory to designing the control policy π, which is a sequence of mappings π=(π t ) T t= that maps all the( available information to the next action at each slot, i.e., π t z t, u t ) = ut. Our goal is to find a policy π that maximizes the total reward: π = arg max E π [ ] R(z T, u T ) u t = π t (z t, u t ). () III. DESIGN FRAMEWORK We will focus on the control of an individual ferry. Consider the case when there is one (gateway) node per domain and the number of domains N is large (the cases of multiple nodes per domain and small N are addressed separately in []). To ensure scalability, we divide the ferry control into global control and local control as follows. Global control: The global control determines which domain to go to. An intuitive solution is to iterate among all the domains in a predetermined or random order. More sophisticated solutions involve optimization of certain performance metrics. Since the domains are fixed, we can view an entire domain as a stationary node and apply existing solutions for ferry route design [3] to order the domains. Local control: The local control handles the specific movements among cells within a domain. This is a more challenging task because the ferry can only see one cell at a time, and will be the focus of this paper. A. A Simple Example To illustrate the difficulty of this problem, let us consider a simple example. Suppose that each domain is only partitioned into cells, labeled cell and cell. In each slot, a node will move from cell to with probability p and from cell to with probability q. Assuming that a contact occurs every time the ferry is in the same cell of a node, after which it moves to a different domain, we want to maximize the contact rate (r c= lim T R(zT, ut )/T, where R( ) is the number of contacts). 3 We use x t t to denote the vector (x t,..., x t ) for t t. Simple as it seems, much thought is required to achieve optimality. A baseline policy of random movements will only give a contact rate of.5. An intuitive solution is to check the other cell whenever the ferry has a miss. This policy leads to oscillating movements, whose contact rate can be shown to be r o c = (p + q)( pq) + p + q ( p q)( pq). Another intuitive solution is to let the ferry wait in one cell, whose contact rate can be shown to be r w c = (p + q)pq p + q + pq(p + q). We see that none of these policies can guarantee optimality, e.g., the waiting policy is better if p = q >.5, and the oscillating policy is better if p = q <.5. A smarter design is perhaps to keep chasing the node to its most likely cell, called the myopic policy. Interestingly, even this policy is not optimal in general (e.g., for p =.5, q =.5, the oscillating policy outperforms the myopic policy). In the sequel, we will introduce a systematic approach to achieve optimality. B. A Systematic Solution via POMDP The preceding simple example turns out to require a sophisticated solution []. In general, the problem is even trickier since the ferry will not know the exact location of the node before contact. In this section, we introduce a systematic design framework using the tool of Partially Observable Markov Decision Process (POMDP) [9]. POMDP models a control problem where the goal is to select control actions based on run-time observations reflecting the state of the controlled system to optimize reward. A POMDP problem is represented by the following tuple < S, U, Z, P, P z, r, γ, b > []: ) S: the state space, defined as the set of internal states of the controlled system; here the state s t denotes the cell of the current node at the beginning of slot t and the state space S the set of all cells denoted by {,,..., n}; ) U: the action space, defined as the set of control options; here the action u t denotes the cell the ferry will move to in slot t, which can be either in the current domain or in the next domain (specified by the global control); accordingly, U is divided into follow actions U f = {,..., n} (i U f means to move to cell i of the current domain) and switch actions U s = {,..., n } (j U s means to switch to cell j of the next domain); 3) Z: the observation space, defined as the set of possible observations; here Z = {, }, with denoting contact and miss (z t Z is taken at the end of slot t); 4) P: the state transition model specifying the transition probabilities of the system state; here it is the transition matrix from node mobility model (assuming Markovian mobility), where P(i, j) = Pr{s t+ = j s t = i} is the probability for the node to move from cell i to cell j during slot t; 5) P z : the observation model specifying how observations are related to states and actions; here P z (s, u, z) =

4 ANNUAL CONFERENCE OF ITA, SEPT. 4 Pr{z t = z s t+ = s, u t = u} is the probability of sensing z if the node moves to cell s and the ferry to cell u; we assume P z (s, u, z) = if and only if z = (s == u) (perfect sensing); 6) r(z, u): the payoff function specifying the reward of taking action u and observing z; here we use r(z, u) = z to give a unit reward for each contact; 7) γ (, ]: the discount factor specifying how much delayed rewards will be discounted; the total reward is given by 4 R(z T, ut ) = T t= γt r(z t, u t ); 8) b : the initial belief specifying the initial distribution of the system state; here we assume it is the node s steadystate distribution b (i.e., the solution to b T = bt P, assuming it exists). Many of the above elements are design choices and can be extended. The main assumption behind this framework is the Markovian node mobility model. While noting that it is a limiting assumption, we point out that most mobility models, e.g., random walk, random waypoint, and their variations, can be mapped to Markovian models under proper quantization. IV. OPTIMAL AND HEURISTIC POLICIES We now develop policies for the local control under the framework in Section III-B. We will first develop a mathematical representation of the optimal policy and then investigate the algorithms. For clarify, we assume nodes in different domains move i.i.d. with transition matrix P and steady-state distribution b. The non-i.i.d. case can be easily handled by plugging in different matrices and distributions. A. Optimal Policy It is known in POMDP theory that a sufficient statistic is the belief vector b t, defined as the posterior distribution of the node (at the beginning of slot t) given all past actions and observations, i.e., b t (s) = Pr{s t = s z t, u t }. It thus suffices to design policy based on b t. We introduce an auxiliary function VT π called the value function, where V T π(b) denotes the total discounted reward gained by policy π over a horizon T starting from belief b [9]. It is known that the optimal value function must satisfy the following equation known as the value iteration: V T (b) = γ max u U [ r(b, u) + ] P z (b, u, z)v T (f(b, u, z)), () z Z where r(b, u) is the expected reward in the current slot, P z (b, u, z) the probability of observing z, and f(b, u, z) the belief after taking action u and observation z. The optimal policy is the solution to (), i.e., π T (b) is the action u T achieving V T (b). We now translate the above definitions to our problem. In our problem, the belief vector experiences two updates per slot: a belief transition and a Bayesian update. The belief transition takes place after the ferry takes an action u t at the 4 There are other reward structures, but we select discounted reward as it is the most common one for infinite horizon. beginning of slot t and before it takes an observation z t at the end of slot t to model node movements during this slot: { b P t = f (b t, u t ) = T b t if u t U f, (3) b if u t U s, where the belief is reset if the ferry switches to a new domain. The Bayesian update occurs after the ferry observes 5 z t : { b t+ = f (b b if z t, u t, z t ) = t =, (b (4) t) \ut if z t =, where the belief is reset after a contact (and the ferry will switch domains), or updated by the Bayesian rule after a miss. Then the procedure continues in the next slot. Based on the above belief updates and the definitions in Section III-B, the value iteration () in our problem becomes V T (b) = γ max u U [b (u) + b (u)v T ( b ) + [ b (u) ] V T ( b \u ) ], (5) and the action achieving (5) gives the optimal policy. Physically, V T (b) is the maximum discounted number of contacts over T slots starting from a node distribution b. Although γ does not appear in the physical problem, it is needed to ensure a unique optimal solution in the long run. Fact 4. ( [3]): There exists a unique stationary policy π that optimizes the total reward as T if γ <, and its value function V is invariant under (). B. Policy-Computing Algorithms Although mathematically sound, the equation (5) cannot be directly used to compute the policy as there are infinitely many belief vectors. In this section, we explore algorithms that can compute the policy in a finite number of steps. ) Hardness of Optimal Policy: Sondik [9] first showed the possibility of computing the optimal policy by utilizing the piece-wise linear and convex property of the optimal value function: V T (b) = max b α, where α is a coefficient vector, α Γ T called the value vector, that represents a linear segment of V T. The problem is then reduced to finding the set of value vectors Γ T (R + ) n. Our problem differs from the classic POMDP in that we have two belief updates per slot (3, 4) that are interleaving with the action. Consequently, our optimal value function has a unique form as follows. Proposition 4.: The optimal value function of (5) can be written as ( ) V T (b) = max max b α, max b α α Γ f α Γ s T T Moreover, Γ f T and Γs T (T ) satisfy the recursion6 : Γ f T = P Γ T, Γ s T. (6) = arg maxb α, (7) α Γ T 5 The notation b = b \u means b (u) = and b (s) = b(s)/( b(u)) for s u. 6 For T =, Γ f = Γs = {(,..., )}.

5 ANNUAL CONFERENCE OF ITA, SEPT. 5 where P Γ T={Pα : α ΓT }, and Γ T={ α u, α, α u : u S, α Γ f T } for { γ( + α u, c ) if s = u, α (s) = γα (8) (s) o.w., { γ( + α u c ) if s = u, (s) = (9) γc o.w., c = max α Γ f T Γs T b α, and c = max α Γ s T b α. The above result also gives us an explicit algorithm to compute the optimal policy. During the offline phase (when bootstrapping the controller), we start with Γ f = Γs = and iteratively apply (7) for T =,,... to compute Γ f T and Γ s T for any given T. During the online phase (when applying control), we extract the policy by solving for the action that achieves the maximization on the right-hand side of (5), where V T (b ) and V T (b \u ) are computed from Γf T and Γs T by (6). The drawback of the above algorithm is that the number of value vectors grows fast. It can be shown that Γ f T = S ( S T )/( S ) = O( S T ) and Γ s T, and thus the worst case complexity grows exponentially with T (O( S T+ )). This is actually not specific to our algorithm, but due to the intrinsic hardness of POMDP []. ) Efficient Heuristic Policies: Due to the high complexity of the optimal policy, we turn to heuristic policies. Most heuristic policies seek to approximate the optimal value function. For example, we can approximate the exact value iteration (5) by the Interpolation-Extrapolation method [], which quantizes the belief space into a finite set of representative belief vectors and runs value iterations only on these beliefs; the values at other beliefs are computed via interpolation. Alternatively, we can approximate the value-vector-based algorithm in Section IV-B by the Point-Based Value Iteration method [], which limits the set of value vectors to those achieving the maximum value in (6) for at least one of the representative beliefs. Other heuristic methods include reducing the problem to a fully observable case (MDP and QMDP approximations), directly approximating the value function by curve fitting (Least-Squares Fit), etc.; see [], [] and references therein. An extreme case is when we ignore the value function altogether, which leads to a much simpler policy, called the myopic policy, that greedily optimizes the probability of immediate contact: π (b) = arg maxb (u). () u U Unlike the above sophisticated policies, this policy does not require any offline computation and has very short online response time. On the other hand, it may suffer more performance loss due to greediness. In certain cases, however, we show that the myopic policy will give the optimal action. Theorem 4.3: At horizon T and belief state b, it is optimal to follow the myopic policy if γ b () ( b () )[ + b γ( γt )/( γ)], () where b () = max u U b (u) (achieved at u ) is the largest immediate reward, and b () = max u U\u b (u) the second largest immediate reward. This theorem gives a sufficient condition for the myopic polity to be optimal. The condition depends on two parameters, the number of remaining slots T and the current belief b. By this theorem, we can safely short-circuit value iterations in certain subsets of the belief space by following the myopic policy if condition () is satisfied. 3) Explicit Policy Representation: Different from general POMDPs, our problem has a special property as follows. Proposition 4.4: Each belief-based dynamic policy π corresponds to a unique trajectory u π =(u π (t)) d t= (d can be ) within one domain such that π is equivalent to a policy of following u π and repeating it in different domains until contact, after which this process is repeated in a new domain. This is because the ferry s observations are binary and the controller restarts (in a new domain) after each contact, which reduces a policy to a sequence of actions (u π ) if the observations are consecutive misses. This property allows us to represent the policy explicitly by u π. Note that this is different from predetermined ferry trajectories because instead of following u π deterministically, the ferry will stop and switch domain immediately after a contact, and its actual trajectory will be stochastic. To distinguish from the actual trajectory, we will call u π the policy trajectory of policy π. V. PERFORMANCE ANALYSIS We analyze the performance of the proposed policies using the measure of average inter-contact time E[K], i.e., the mean number of slots between two consecutive ferry-node contacts (/E[K] is the contact rate). Our goal is two-fold. First, we want to compare dynamic policies with predetermined policies. Since dynamic policies are more expensive to implement, the comparison will provide guidance on performance-cost tradeoff. Moreover, we want to see how the performance is influenced by node mobility models. Instead of limiting to specific models, we focus on generic parameters such as speed, locality, activeness, and the size and the shape of the field to draw insights on when dynamic control is worthwhile and when predetermined control suffices. A. Dynamic vs. Predetermined Control To analyze the fundamental performance gain, we need to compare the optimal dynamic policy with the optimal predetermined counterpart. We begin with the following observation. Claim 5.: The optimal predetermined policy is to keep switching among the most likely cells of different domains (called the switching policy). Therefore, it suffices to compare dynamic policies with the switching policy. In the sequel, we use the notion that (b (i) )n i= is an ordered sequence of the elements in b such that b () b () the switching policy. It is easy to see that E[K SW ] = /b ()... Let K SW denote the inter-contact time of. For

6 ANNUAL CONFERENCE OF ITA, SEPT. 6 dynamic policies, an exact analysis is intractable as we cannot even write down the optimal policy explicitly (see Section IV-B). Instead, we seek to bound its performance from both above and below. For the upper bound, we need to analyze a good but simple dynamic policy. Specifically, consider the myopic policy with policy trajectory u = (u(t)) d t=, and let p t denote the conditional probability of contact in step t given that there has been no contact in the previous t steps. Then p t and u(t) can be computed recursively by p t = maxb t(s), s u(t) = argmaxb t(s), () s where b t+ = PT (b t) \u(t) and b =b ; d is the smallest t such that p t+ < b (). For the lower bound, we need to modify the problem into one that allows superior performance. Consider a hypothetical scenario where all nodes are stationary with distribution b such that the ferry never needs to visit a cell twice. The policy trajectory in this case is the sequence of cells corresponding to b (), b(),.. (. The conditional ) contact probability p t is given by p t = b(t) / t b (i), and the trajectory length d is i= the smallest t such that p t+ < b(). With these notations, we present the following bounds. Theorem 5.: The average inter-contact time E[K DY ] of the optimal dynamic policy satisfies K + d ( p ) p d where p = t= E[K DY ] K + d( p ) p, (3) ( p t ), K = d t p tp t t= i= ( p i ), and p, K are similarly defined with p t, d replaced by p t, d. This theorem provides both achievability and converse results on dynamic ferry control. The achievability result (the upper bound) gives a guaranteed performance gain in comparison with predetermined control (E[K SW ]). The converse result (the lower bound) establishes a fundamental limit that no control policy can outperform, which can serve as a reference in evaluating the effectiveness of specific policies. B. Impact of Domain Size & Node Speed We now apply the result in Theorem 5. to analyze the impact of node mobility parameters. We first examine the impact of domain size, measured by the number of cells n per domain, and node speed, measured by the number of cells m a node can move across per slot (i.e., it can move to locations at most m cells away). We consider a symmetric mobility model where nodes are uniformly distributed in the steady state to only focus on n and m; asymmetric mobility models will be treated separately in Section V-C. Intuitively, the ferry starts with little knowledge about node location, but will learn more as it visits cells and makes its belief vector more concentrated. However, due to node movements, if the ferry misses the node in one cell, its probabilities of seeing the node in neighboring cells will also y y x x (a) (b) Fig.. Dominating trajectories for a -D random walk: (a), (b) are both the longest. : starting point; : ending point. drop in the next slot, and this effect will propagate to further neighbors over time. To compensate for this effect, we make the ferry follow a dominating trajectory, defined as a sequence of cells c(), c(),... such that no c(k) can be reached by a node starting from cell c(j) k j slots ago. The idea is to let the ferry scan the field by visiting only one cell per neighborhood to best utilize the contact opportunities; see Fig. for illustrations. Using this policy, we obtain the following bounds. Corollary 5.3: If b is uniform, then E[K SW ] = n, and n E[K DY ] n d(n), (4) where d(n) is the length of the longest dominating trajectory. The above gives closed-form bounds on control performance in terms of domain size and the length of dominating trajectory. Intuitively, the more stretched the domain is and the slower the node moves, the longer the dominating trajectory, and the better dynamic policies will perform. For example, for random walks (i.e., m = ), d(n) n/ if the domain is -D (trajectory, 3,...) and d(n) n if -D (see Fig. ). In general, it can be shown that d(n) = Θ(n/m) in -D and d(n) = Ω( n/m) in -D. This result is consistent with the intuition that it is easier to catch nodes that move slowly or in constrained domains. Meanwhile, since a predetermined controller only uses the steady-state distribution, its performance is independent of the above parameters. As the domain size n increases (for fixed m), the absolute performance gain E[K SW ] E[K DY ] will always increase, at least as Ω( n). If, in addition, nodes are constrained to some routes (-D mobility), then even the relative gain (E[K SW ] E[K DY ])/E[K SW ] will be positive, i.e., there will be a positive-fraction improvement in the contact rate. This is significant given the fact that the contact rate can at most be doubled as seen from the lower bound. Although not proved for -D mobility, simulation shows that the upper bound is loose in this case and dynamic control still provides significant improvement in contact rate (see Section VI). C. Impact of Node Locality As nodes mobility becomes asymmetric, their steady-state distribution b will deviate from the uniform distribution. After certain point, b alone can already give a good prediction of node location, and dynamic control becomes unnecessary. Indeed, we have confirmed the above intuition by the following result

7 ANNUAL CONFERENCE OF ITA, SEPT. 7 Corollary 5.4: If the node mobility is sufficiently biased such that b () > b () /( b() ), then E[K DY ] = E[K SW ], i.e., the switching policy is optimal. This result provides a sufficient condition for the switching policy to be optimal. When this condition applies, we know for sure that having the ferry hop among the most probable cells of different domains is the best policy, and no dynamic controller is needed. D. Impact of Node Activeness Intuitively, there are two aspects of nodes level of mobility: how far a node can move in one slot and with what probability. The first aspect reflects node speed and has been discussed in Section V-B; the second reflects how active the nodes are and will be the focus of this subsection. To model node activeness, we start from a base transition matrix P and scale all the off-diagonal entries by a parameter β >, called the activeness parameter. We then modify the diagonal entries to make it a valid transition matrix again, i.e., the new matrix P β is given by (P β ) ij = βp ij for i j and (P β ) ii = β j i P ij, where < β < /[max i j i P ij]. Intuitively, β models how frequently nodes cross cell boundaries during one slot. It is orthogonal to the steady-state distribution, as stated below. Lemma 5.5: Let P be a base transition matrix and P β its scaled version according to an activeness parameter β. Then their associated steady-state distribution b and b are equal for all β (, /( min P ii )). i Since b is independent of β, the previous results that only depend on b, e.g., the performance of the switching policy and the lower bound in (3), are also independent of β. Although the general upper bound in (3) still holds here (with P replaced by P β ), we can narrow the gap by explicitly including β as follows. Theorem 5.6: If we scale the transition matrix in Theorem 5. by an activeness parameter β, then besides the bounds in (3), E[K DY ] also satisfies E[K DY ] K + d ( p )/ p, where d p = t= K = p b (t) βδ d t= ( d t t b (t) βδ t= b (t) (d t)(d t + ), (5) i= b (i) (t i) ), (6) for δ = max i =j P ij and d defined as in Theorem 5.. In particular, as β, E[K DY ] converges to the lower bound at O(β). Besides verifying the intuition that as nodes become less active, it will be easier to infer node locations from past observations and the control performance will improve, the above result also guarantees that the convergence rate is (at least) linear in β, i.e., an x% decrease in node activeness will generate at least x% decrease in the performance gap compared with a ferry in stationary networks. VI. SIMULATIONS We now test the proposed policies via simulations. We use a generalized random walk to model node mobility, where under peak speed m as defined in Section V-B, a node moves from its current cell i to one of its (m )-hop neighbors, τ j h cell j, with transition probability P ij proportional to e for a tightness parameter τ and a home cell h ( j h is the taxicab distance between cells j and h). The home cell h models a spot that the node drifts toward (for τ > ) or away from (for τ < ) with strength τ, which can be tuned to control its locality. Moreover, we control its activeness by setting P ij = β (, ). j i Policies: We first compare various dynamic control policies as well as the predetermined switching policy; see Fig Besides the myopic policy ( MY ), the switching policy ( SW ), and the optimal policy ( VI ; see Section IV-B), we also implement several state-of-the-art heuristic policies in POMDP: Point-Based Value Iteration [] ( PBVI ), Interpolation -Extrapolation [] ( IE ), and Fast-Informed Bound [] ( FIB ) 7. Although suboptimal, the myopic policy closely approximates the optimal and the other sophisticated policies at a much lower complexity. To see the reason, we plot both the discounted reward and the actual performance under different γ in Fig. 5. While the discounted reward is sensitive to γ, the actual performance is not, which explains the good performance of the myopic policy as γ controls the level of greediness. Similar observations hold in -D cases. Mobility parameters: We then verify the impact of various node mobility parameters. To test the impact of domain size and layout, we plot the mean inter-contact times of the switching and the myopic policies under both -D and -D mobilities together with the bounds in Corollary 5.3; see Fig. 6. The switching policy and the lower bound are invariant to the domain layout as long as b is fixed; however, the myopic policy is sensitive. As predicted by analysis, it performs better in constrained domains (-D), improving the switching policy by an amount that grows linearly with n. As the domain becomes unconstrained (-D), the performance degrades, but is clearly better than the upper bound, implying that the upper bound is loose in -D, and dynamic control can still provide significant performance gain. As node (peak) speed m increases (not shown), the myopic policy becomes less sensitive to domain layout, with inter-contact times for -D and -D both converging to the upper bound with d(n) n/m, i.e., E[K MY ] (m )n/(m). Next, we test the impact of locality and activeness by varying the tightness parameter τ and the activeness parameter β; see Fig Fig. 7 shows that the ferry s performance is highly sensitive to variations in node locality. Dynamic control ( MY ) outperforms the predetermined control ( SW ) when node s steady-state distribution is close to uniform ( τ ), but if τ exceeds certain point, the distribution becomes so localized that the two policies become identical. Interestingly, instead of smooth transitions, we see sharp jumps at both ends 7 FIB computes an upper bound on the value function by storing only one value vector that dominates the optimal set of value vectors in every element.

8 ANNUAL CONFERENCE OF ITA, SEPT. 8 no. contacts VI PBVI IE FIB MY SW horizon T Fig. 3. Policy comparisons: performance (- D, n = 5, m =, h = 3, β =., τ =, γ = ). Policy Offline Online (sec) (msec) VI PBVI IE FIB.9.34 MY.4 SW.5 Fig. 4. Policy comparisons: complexity (offline computation time & online response time). (discounted) no. contacts discounted reward: VI myopic actual contacts: VI myopic discount factor γ Fig. 5. Impact of discount factor γ (same settings as in Fig. 3). of the optimality region of the switching policy. This region contains the optimality region derived from Corollary 5.4 (outside the vertical lines), verifying the sufficient optimality condition. Fig. 8 confirms that the switching policy and the lower bound are independent of the level of activeness, while the myopic policy quickly converges to the lower bound with the decrease of β as predicted in Theorem 5.6. inter-contact E[K] SW lower bound MY: D D upper bound: D D domain size n Fig. 6. Impact of domain size & layout (m =, h = domain center, β =., τ = ). inter-contact E[K] tightness τ Fig. 7. Impact of locality controlled by tightness τ (-D, n =, m =, h = center, β =.). inter-contact E[K] activeness β SW MY SW MY lower bound Fig. 8. Impact of activeness (τ =, β as labeled, rest as in Fig. 7). VII. CONCLUSIONS We studied mobility control of autonomous data ferries in mobile networks. In contrast to existing solutions, the proposed solution is fully dynamic. Our evaluations show promising performance improvement measured by increases in contact rate, especially when the performance suffers most. Although the presented results are based on the specific scenario of one gateway per domain and separate global/local control, our approach is also applicable to the more general scenario of multiple gateways per domain and joint control, where we have obtained initial results []. Our work reveals more possibilities: While the current solution is designed for stationary mobilities, can similar techniques be used to track node speed, in cases that nodes drift in fixed directions but with uncertain speed, so that the ferry can catch the nodes accurately? Moreover, we have focused on contact optimization, but are the solutions amendable for optimizing end-to-end performance metrics such as delay and how does this change the comparison between policies? Our study provides a sound foundation for tackling these problems. REFERENCES [] T. He, K.-W. Lee, and A. Swami, Flying in the Dark: Controlling Autonomous Data Ferries with Partial Observations, in ACM MobiHoc,. [] T. Brown, B. Argrow, C. Dixon, S. Doshi, R.-G. Thekkekunnel, and D. Henkel, Ad hoc uav ground network (augnet), in AIAA Unmanned Unlimited Conference, 4. [3] W. Zhao, M. Ammar, and E. Zegura, Controlling the mobility of multiple data transport ferries in a delay-tolerant network, in IEEE INFOCOM, 5. [4] D. Henkel and T. Brown, On controlled node mobility in delay-tolerant networks of unmannned aerial vehicles, in International Symposium on Advance Radio Technologies, 6. [5] Data Mule: Mobility for Enhanced Performance in Sensor Networks, 5, [6] W. Zhao, M. Ammar, and E. Zegura, A message ferrying approach for data delivery in sparse mobile ad hoc networks, in ACM MobiHoc, 4. [7] J. Wu, S. Yang, and F. Dai, Logarithmic store-carry-forward routing in mobile ad hoc networks, IEEE Trans. PDS. [8] Q. Li and D. Rus, Sending messages to mobile users in disconnected ad-hoc wireless networks, in ACM MobiCom,. [9] E. Sondik, The optimal control of partially observable markov decision processes, Ph.D. dissertation, Stanford University, 97. [] M. Hauskrecht, Value-function approximations for partially observable markov decision processes, Journal of AI Research. [] J. Pineau, G. Gordon, and S. Thrun, Anytime point-based approximations for large pomdps, Journal of AI Research. [] I. MacPhee and B. Jordan, Optimal search for a moving target, Probability in the Engineering and Informational Sciences, 995. [3] E. Sondik, The optimal control of partially observable markov processes over the infinite horizon: Discounted costs, OR. [4] T. He, K.-W. Lee, and A. Swami, Controlling Autonomous Data Ferries: Proof of Selected Theorems, IBM, Tech. Rep. RC498, April,

Optimal Control of Partiality Observable Markov. Processes over a Finite Horizon

Optimal Control of Partiality Observable Markov. Processes over a Finite Horizon Optimal Control of Partiality Observable Markov Processes over a Finite Horizon Report by Jalal Arabneydi 04/11/2012 Taken from Control of Partiality Observable Markov Processes over a finite Horizon by

More information

On the Optimality of Myopic Sensing. in Multi-channel Opportunistic Access: the Case of Sensing Multiple Channels

On the Optimality of Myopic Sensing. in Multi-channel Opportunistic Access: the Case of Sensing Multiple Channels On the Optimality of Myopic Sensing 1 in Multi-channel Opportunistic Access: the Case of Sensing Multiple Channels Kehao Wang, Lin Chen arxiv:1103.1784v1 [cs.it] 9 Mar 2011 Abstract Recent works ([1],

More information

STRUCTURE AND OPTIMALITY OF MYOPIC SENSING FOR OPPORTUNISTIC SPECTRUM ACCESS

STRUCTURE AND OPTIMALITY OF MYOPIC SENSING FOR OPPORTUNISTIC SPECTRUM ACCESS STRUCTURE AND OPTIMALITY OF MYOPIC SENSING FOR OPPORTUNISTIC SPECTRUM ACCESS Qing Zhao University of California Davis, CA 95616 qzhao@ece.ucdavis.edu Bhaskar Krishnamachari University of Southern California

More information

Channel Probing in Communication Systems: Myopic Policies Are Not Always Optimal

Channel Probing in Communication Systems: Myopic Policies Are Not Always Optimal Channel Probing in Communication Systems: Myopic Policies Are Not Always Optimal Matthew Johnston, Eytan Modiano Laboratory for Information and Decision Systems Massachusetts Institute of Technology Cambridge,

More information

A Decentralized Approach to Multi-agent Planning in the Presence of Constraints and Uncertainty

A Decentralized Approach to Multi-agent Planning in the Presence of Constraints and Uncertainty 2011 IEEE International Conference on Robotics and Automation Shanghai International Conference Center May 9-13, 2011, Shanghai, China A Decentralized Approach to Multi-agent Planning in the Presence of

More information

Planning Under Uncertainty II

Planning Under Uncertainty II Planning Under Uncertainty II Intelligent Robotics 2014/15 Bruno Lacerda Announcement No class next Monday - 17/11/2014 2 Previous Lecture Approach to cope with uncertainty on outcome of actions Markov

More information

Performance of Round Robin Policies for Dynamic Multichannel Access

Performance of Round Robin Policies for Dynamic Multichannel Access Performance of Round Robin Policies for Dynamic Multichannel Access Changmian Wang, Bhaskar Krishnamachari, Qing Zhao and Geir E. Øien Norwegian University of Science and Technology, Norway, {changmia,

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Partially Observable Markov Decision Processes (POMDPs)

Partially Observable Markov Decision Processes (POMDPs) Partially Observable Markov Decision Processes (POMDPs) Geoff Hollinger Sequential Decision Making in Robotics Spring, 2011 *Some media from Reid Simmons, Trey Smith, Tony Cassandra, Michael Littman, and

More information

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time April 16, 2016 Abstract In this exposition we study the E 3 algorithm proposed by Kearns and Singh for reinforcement

More information

Probability Map Building of Uncertain Dynamic Environments with Indistinguishable Obstacles

Probability Map Building of Uncertain Dynamic Environments with Indistinguishable Obstacles Probability Map Building of Uncertain Dynamic Environments with Indistinguishable Obstacles Myungsoo Jun and Raffaello D Andrea Sibley School of Mechanical and Aerospace Engineering Cornell University

More information

Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models

Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models c Qing Zhao, UC Davis. Talk at Xidian Univ., September, 2011. 1 Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models Qing Zhao Department of Electrical and Computer Engineering University

More information

A POMDP Framework for Cognitive MAC Based on Primary Feedback Exploitation

A POMDP Framework for Cognitive MAC Based on Primary Feedback Exploitation A POMDP Framework for Cognitive MAC Based on Primary Feedback Exploitation Karim G. Seddik and Amr A. El-Sherif 2 Electronics and Communications Engineering Department, American University in Cairo, New

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha

More information

Artificial Intelligence

Artificial Intelligence Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important

More information

RL 14: Simplifications of POMDPs

RL 14: Simplifications of POMDPs RL 14: Simplifications of POMDPs Michael Herrmann University of Edinburgh, School of Informatics 04/03/2016 POMDPs: Points to remember Belief states are probability distributions over states Even if computationally

More information

Data Gathering and Personalized Broadcasting in Radio Grids with Interferences

Data Gathering and Personalized Broadcasting in Radio Grids with Interferences Data Gathering and Personalized Broadcasting in Radio Grids with Interferences Jean-Claude Bermond a,, Bi Li a,b, Nicolas Nisse a, Hervé Rivano c, Min-Li Yu d a Coati Project, INRIA I3S(CNRS/UNSA), Sophia

More information

Dynamic spectrum access with learning for cognitive radio

Dynamic spectrum access with learning for cognitive radio 1 Dynamic spectrum access with learning for cognitive radio Jayakrishnan Unnikrishnan and Venugopal V. Veeravalli Department of Electrical and Computer Engineering, and Coordinated Science Laboratory University

More information

Maximizing throughput in zero-buffer tandem lines with dedicated and flexible servers

Maximizing throughput in zero-buffer tandem lines with dedicated and flexible servers Maximizing throughput in zero-buffer tandem lines with dedicated and flexible servers Mohammad H. Yarmand and Douglas G. Down Department of Computing and Software, McMaster University, Hamilton, ON, L8S

More information

Bayesian Congestion Control over a Markovian Network Bandwidth Process

Bayesian Congestion Control over a Markovian Network Bandwidth Process Bayesian Congestion Control over a Markovian Network Bandwidth Process Parisa Mansourifard 1/30 Bayesian Congestion Control over a Markovian Network Bandwidth Process Parisa Mansourifard (USC) Joint work

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

Efficient Maximization in Solving POMDPs

Efficient Maximization in Solving POMDPs Efficient Maximization in Solving POMDPs Zhengzhu Feng Computer Science Department University of Massachusetts Amherst, MA 01003 fengzz@cs.umass.edu Shlomo Zilberstein Computer Science Department University

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

Partially Observable Markov Decision Processes (POMDPs)

Partially Observable Markov Decision Processes (POMDPs) Partially Observable Markov Decision Processes (POMDPs) Sachin Patil Guest Lecture: CS287 Advanced Robotics Slides adapted from Pieter Abbeel, Alex Lee Outline Introduction to POMDPs Locally Optimal Solutions

More information

Reinforcement Learning

Reinforcement Learning 1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision

More information

Dynamic Power Allocation and Routing for Time Varying Wireless Networks

Dynamic Power Allocation and Routing for Time Varying Wireless Networks Dynamic Power Allocation and Routing for Time Varying Wireless Networks X 14 (t) X 12 (t) 1 3 4 k a P ak () t P a tot X 21 (t) 2 N X 2N (t) X N4 (t) µ ab () rate µ ab µ ab (p, S 3 ) µ ab µ ac () µ ab (p,

More information

Sequential Decision Problems

Sequential Decision Problems Sequential Decision Problems Michael A. Goodrich November 10, 2006 If I make changes to these notes after they are posted and if these changes are important (beyond cosmetic), the changes will highlighted

More information

Heuristic Search Algorithms

Heuristic Search Algorithms CHAPTER 4 Heuristic Search Algorithms 59 4.1 HEURISTIC SEARCH AND SSP MDPS The methods we explored in the previous chapter have a serious practical drawback the amount of memory they require is proportional

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and

More information

Probabilistic inference for computing optimal policies in MDPs

Probabilistic inference for computing optimal policies in MDPs Probabilistic inference for computing optimal policies in MDPs Marc Toussaint Amos Storkey School of Informatics, University of Edinburgh Edinburgh EH1 2QL, Scotland, UK mtoussai@inf.ed.ac.uk, amos@storkey.org

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Department of Systems and Computer Engineering Carleton University Ottawa, Canada KS 5B6 Email: mawheda@sce.carleton.ca

More information

Optimality of Myopic Sensing in Multi-Channel Opportunistic Access

Optimality of Myopic Sensing in Multi-Channel Opportunistic Access Optimality of Myopic Sensing in Multi-Channel Opportunistic Access Tara Javidi, Bhasar Krishnamachari, Qing Zhao, Mingyan Liu tara@ece.ucsd.edu, brishna@usc.edu, qzhao@ece.ucdavis.edu, mingyan@eecs.umich.edu

More information

Capacity of a Two-way Function Multicast Channel

Capacity of a Two-way Function Multicast Channel Capacity of a Two-way Function Multicast Channel 1 Seiyun Shin, Student Member, IEEE and Changho Suh, Member, IEEE Abstract We explore the role of interaction for the problem of reliable computation over

More information

Information in Aloha Networks

Information in Aloha Networks Achieving Proportional Fairness using Local Information in Aloha Networks Koushik Kar, Saswati Sarkar, Leandros Tassiulas Abstract We address the problem of attaining proportionally fair rates using Aloha

More information

Cover Page. The handle holds various files of this Leiden University dissertation

Cover Page. The handle  holds various files of this Leiden University dissertation Cover Page The handle http://hdl.handle.net/1887/39637 holds various files of this Leiden University dissertation Author: Smit, Laurens Title: Steady-state analysis of large scale systems : the successive

More information

Reinforcement Learning. Introduction

Reinforcement Learning. Introduction Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control

More information

CSE 573. Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming. Slides adapted from Andrey Kolobov and Mausam

CSE 573. Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming. Slides adapted from Andrey Kolobov and Mausam CSE 573 Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming Slides adapted from Andrey Kolobov and Mausam 1 Stochastic Shortest-Path MDPs: Motivation Assume the agent pays cost

More information

RL 14: POMDPs continued

RL 14: POMDPs continued RL 14: POMDPs continued Michael Herrmann University of Edinburgh, School of Informatics 06/03/2015 POMDPs: Points to remember Belief states are probability distributions over states Even if computationally

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

TRANSMISSION STRATEGIES FOR SINGLE-DESTINATION WIRELESS NETWORKS

TRANSMISSION STRATEGIES FOR SINGLE-DESTINATION WIRELESS NETWORKS The 20 Military Communications Conference - Track - Waveforms and Signal Processing TRANSMISSION STRATEGIES FOR SINGLE-DESTINATION WIRELESS NETWORKS Gam D. Nguyen, Jeffrey E. Wieselthier 2, Sastry Kompella,

More information

Power Allocation over Two Identical Gilbert-Elliott Channels

Power Allocation over Two Identical Gilbert-Elliott Channels Power Allocation over Two Identical Gilbert-Elliott Channels Junhua Tang School of Electronic Information and Electrical Engineering Shanghai Jiao Tong University, China Email: junhuatang@sjtu.edu.cn Parisa

More information

Multi-Robotic Systems

Multi-Robotic Systems CHAPTER 9 Multi-Robotic Systems The topic of multi-robotic systems is quite popular now. It is believed that such systems can have the following benefits: Improved performance ( winning by numbers ) Distributed

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

Markov decision processes

Markov decision processes CS 2740 Knowledge representation Lecture 24 Markov decision processes Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Administrative announcements Final exam: Monday, December 8, 2008 In-class Only

More information

Decentralized Stochastic Control with Partial Sharing Information Structures: A Common Information Approach

Decentralized Stochastic Control with Partial Sharing Information Structures: A Common Information Approach Decentralized Stochastic Control with Partial Sharing Information Structures: A Common Information Approach 1 Ashutosh Nayyar, Aditya Mahajan and Demosthenis Teneketzis Abstract A general model of decentralized

More information

On queueing in coded networks queue size follows degrees of freedom

On queueing in coded networks queue size follows degrees of freedom On queueing in coded networks queue size follows degrees of freedom Jay Kumar Sundararajan, Devavrat Shah, Muriel Médard Laboratory for Information and Decision Systems, Massachusetts Institute of Technology,

More information

Distributed Optimization. Song Chong EE, KAIST

Distributed Optimization. Song Chong EE, KAIST Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links

More information

Bayesian Congestion Control over a Markovian Network Bandwidth Process: A multiperiod Newsvendor Problem

Bayesian Congestion Control over a Markovian Network Bandwidth Process: A multiperiod Newsvendor Problem Bayesian Congestion Control over a Markovian Network Bandwidth Process: A multiperiod Newsvendor Problem Parisa Mansourifard 1/37 Bayesian Congestion Control over a Markovian Network Bandwidth Process:

More information

Multi-channel Opportunistic Access: A Case of Restless Bandits with Multiple Plays

Multi-channel Opportunistic Access: A Case of Restless Bandits with Multiple Plays Multi-channel Opportunistic Access: A Case of Restless Bandits with Multiple Plays Sahand Haji Ali Ahmad, Mingyan Liu Abstract This paper considers the following stochastic control problem that arises

More information

Continuous-Model Communication Complexity with Application in Distributed Resource Allocation in Wireless Ad hoc Networks

Continuous-Model Communication Complexity with Application in Distributed Resource Allocation in Wireless Ad hoc Networks Continuous-Model Communication Complexity with Application in Distributed Resource Allocation in Wireless Ad hoc Networks Husheng Li 1 and Huaiyu Dai 2 1 Department of Electrical Engineering and Computer

More information

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016 Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the

More information

Data Gathering and Personalized Broadcasting in Radio Grids with Interferences

Data Gathering and Personalized Broadcasting in Radio Grids with Interferences Data Gathering and Personalized Broadcasting in Radio Grids with Interferences Jean-Claude Bermond a,b,, Bi Li b,a,c, Nicolas Nisse b,a, Hervé Rivano d, Min-Li Yu e a Univ. Nice Sophia Antipolis, CNRS,

More information

Worst case analysis for a general class of on-line lot-sizing heuristics

Worst case analysis for a general class of on-line lot-sizing heuristics Worst case analysis for a general class of on-line lot-sizing heuristics Wilco van den Heuvel a, Albert P.M. Wagelmans a a Econometric Institute and Erasmus Research Institute of Management, Erasmus University

More information

On Two Class-Constrained Versions of the Multiple Knapsack Problem

On Two Class-Constrained Versions of the Multiple Knapsack Problem On Two Class-Constrained Versions of the Multiple Knapsack Problem Hadas Shachnai Tami Tamir Department of Computer Science The Technion, Haifa 32000, Israel Abstract We study two variants of the classic

More information

Planning by Probabilistic Inference

Planning by Probabilistic Inference Planning by Probabilistic Inference Hagai Attias Microsoft Research 1 Microsoft Way Redmond, WA 98052 Abstract This paper presents and demonstrates a new approach to the problem of planning under uncertainty.

More information

The Multi-Agent Rendezvous Problem - The Asynchronous Case

The Multi-Agent Rendezvous Problem - The Asynchronous Case 43rd IEEE Conference on Decision and Control December 14-17, 2004 Atlantis, Paradise Island, Bahamas WeB03.3 The Multi-Agent Rendezvous Problem - The Asynchronous Case J. Lin and A.S. Morse Yale University

More information

Sensor Tasking and Control

Sensor Tasking and Control Sensor Tasking and Control Sensing Networking Leonidas Guibas Stanford University Computation CS428 Sensor systems are about sensing, after all... System State Continuous and Discrete Variables The quantities

More information

Reinforcement Learning II

Reinforcement Learning II Reinforcement Learning II Andrea Bonarini Artificial Intelligence and Robotics Lab Department of Electronics and Information Politecnico di Milano E-mail: bonarini@elet.polimi.it URL:http://www.dei.polimi.it/people/bonarini

More information

Online Learning to Optimize Transmission over an Unknown Gilbert-Elliott Channel

Online Learning to Optimize Transmission over an Unknown Gilbert-Elliott Channel Online Learning to Optimize Transmission over an Unknown Gilbert-Elliott Channel Yanting Wu Dept. of Electrical Engineering University of Southern California Email: yantingw@usc.edu Bhaskar Krishnamachari

More information

EE 550: Notes on Markov chains, Travel Times, and Opportunistic Routing

EE 550: Notes on Markov chains, Travel Times, and Opportunistic Routing EE 550: Notes on Markov chains, Travel Times, and Opportunistic Routing Michael J. Neely University of Southern California http://www-bcf.usc.edu/ mjneely 1 Abstract This collection of notes provides a

More information

Motivation for introducing probabilities

Motivation for introducing probabilities for introducing probabilities Reaching the goals is often not sufficient: it is important that the expected costs do not outweigh the benefit of reaching the goals. 1 Objective: maximize benefits - costs.

More information

Chapter 16 Planning Based on Markov Decision Processes

Chapter 16 Planning Based on Markov Decision Processes Lecture slides for Automated Planning: Theory and Practice Chapter 16 Planning Based on Markov Decision Processes Dana S. Nau University of Maryland 12:48 PM February 29, 2012 1 Motivation c a b Until

More information

CS 7180: Behavioral Modeling and Decisionmaking

CS 7180: Behavioral Modeling and Decisionmaking CS 7180: Behavioral Modeling and Decisionmaking in AI Markov Decision Processes for Complex Decisionmaking Prof. Amy Sliva October 17, 2012 Decisions are nondeterministic In many situations, behavior and

More information

Session-Based Queueing Systems

Session-Based Queueing Systems Session-Based Queueing Systems Modelling, Simulation, and Approximation Jeroen Horters Supervisor VU: Sandjai Bhulai Executive Summary Companies often offer services that require multiple steps on the

More information

Consensus Algorithms are Input-to-State Stable

Consensus Algorithms are Input-to-State Stable 05 American Control Conference June 8-10, 05. Portland, OR, USA WeC16.3 Consensus Algorithms are Input-to-State Stable Derek B. Kingston Wei Ren Randal W. Beard Department of Electrical and Computer Engineering

More information

On Prediction and Planning in Partially Observable Markov Decision Processes with Large Observation Sets

On Prediction and Planning in Partially Observable Markov Decision Processes with Large Observation Sets On Prediction and Planning in Partially Observable Markov Decision Processes with Large Observation Sets Pablo Samuel Castro pcastr@cs.mcgill.ca McGill University Joint work with: Doina Precup and Prakash

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

Region-Based Dynamic Programming for Partially Observable Markov Decision Processes

Region-Based Dynamic Programming for Partially Observable Markov Decision Processes Region-Based Dynamic Programming for Partially Observable Markov Decision Processes Zhengzhu Feng Department of Computer Science University of Massachusetts Amherst, MA 01003 fengzz@cs.umass.edu Abstract

More information

arxiv:cs/ v1 [cs.ni] 27 Feb 2007

arxiv:cs/ v1 [cs.ni] 27 Feb 2007 Joint Design and Separation Principle for Opportunistic Spectrum Access in the Presence of Sensing Errors Yunxia Chen, Qing Zhao, and Ananthram Swami Abstract arxiv:cs/0702158v1 [cs.ni] 27 Feb 2007 We

More information

AN ABSTRACT OF THE THESIS OF

AN ABSTRACT OF THE THESIS OF AN ABSTRACT OF THE THESIS OF Thai Duong for the degree of Master of Science in Computer Science presented on May 20, 2015. Title: Data Collection in Sensor Networks via the Novel Fast Markov Decision Process

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error

More information

Control Theory : Course Summary

Control Theory : Course Summary Control Theory : Course Summary Author: Joshua Volkmann Abstract There are a wide range of problems which involve making decisions over time in the face of uncertainty. Control theory draws from the fields

More information

Approximating the Partition Function by Deleting and then Correcting for Model Edges (Extended Abstract)

Approximating the Partition Function by Deleting and then Correcting for Model Edges (Extended Abstract) Approximating the Partition Function by Deleting and then Correcting for Model Edges (Extended Abstract) Arthur Choi and Adnan Darwiche Computer Science Department University of California, Los Angeles

More information

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam: Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,

More information

Optimality of Myopic Sensing in Multi-Channel Opportunistic Access

Optimality of Myopic Sensing in Multi-Channel Opportunistic Access Optimality of Myopic Sensing in Multi-Channel Opportunistic Access Tara Javidi, Bhasar Krishnamachari,QingZhao, Mingyan Liu tara@ece.ucsd.edu, brishna@usc.edu, qzhao@ece.ucdavis.edu, mingyan@eecs.umich.edu

More information

Final Exam December 12, 2017

Final Exam December 12, 2017 Introduction to Artificial Intelligence CSE 473, Autumn 2017 Dieter Fox Final Exam December 12, 2017 Directions This exam has 7 problems with 111 points shown in the table below, and you have 110 minutes

More information

Partially Observable Markov Decision Processes (POMDPs) Pieter Abbeel UC Berkeley EECS

Partially Observable Markov Decision Processes (POMDPs) Pieter Abbeel UC Berkeley EECS Partially Observable Markov Decision Processes (POMDPs) Pieter Abbeel UC Berkeley EECS Many slides adapted from Jur van den Berg Outline POMDPs Separation Principle / Certainty Equivalence Locally Optimal

More information

Near-Potential Games: Geometry and Dynamics

Near-Potential Games: Geometry and Dynamics Near-Potential Games: Geometry and Dynamics Ozan Candogan, Asuman Ozdaglar and Pablo A. Parrilo January 29, 2012 Abstract Potential games are a special class of games for which many adaptive user dynamics

More information

Distributed Estimation and Detection for Smart Grid

Distributed Estimation and Detection for Smart Grid Distributed Estimation and Detection for Smart Grid Texas A&M University Joint Wor with: S. Kar (CMU), R. Tandon (Princeton), H. V. Poor (Princeton), and J. M. F. Moura (CMU) 1 Distributed Estimation/Detection

More information

16.4 Multiattribute Utility Functions

16.4 Multiattribute Utility Functions 285 Normalized utilities The scale of utilities reaches from the best possible prize u to the worst possible catastrophe u Normalized utilities use a scale with u = 0 and u = 1 Utilities of intermediate

More information

Physics-Aware Informative Coverage Planning for Autonomous Vehicles

Physics-Aware Informative Coverage Planning for Autonomous Vehicles Physics-Aware Informative Coverage Planning for Autonomous Vehicles Michael J. Kuhlman 1, Student Member, IEEE, Petr Švec2, Member, IEEE, Krishnanand N. Kaipa 2, Member, IEEE, Donald Sofge 3, Member, IEEE,

More information

10 Robotic Exploration and Information Gathering

10 Robotic Exploration and Information Gathering NAVARCH/EECS 568, ROB 530 - Winter 2018 10 Robotic Exploration and Information Gathering Maani Ghaffari April 2, 2018 Robotic Information Gathering: Exploration and Monitoring In information gathering

More information

Internet Monetization

Internet Monetization Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition

More information

EE C128 / ME C134 Feedback Control Systems

EE C128 / ME C134 Feedback Control Systems EE C128 / ME C134 Feedback Control Systems Lecture Additional Material Introduction to Model Predictive Control Maximilian Balandat Department of Electrical Engineering & Computer Science University of

More information

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague Sequential decision making under uncertainty Jiří Kléma Department of Computer Science, Czech Technical University in Prague https://cw.fel.cvut.cz/wiki/courses/b4b36zui/prednasky pagenda Previous lecture:

More information

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING

More information

1 Modelling and Simulation

1 Modelling and Simulation 1 Modelling and Simulation 1.1 Introduction This course teaches various aspects of computer-aided modelling for the performance evaluation of computer systems and communication networks. The performance

More information

Factored State Spaces 3/2/178

Factored State Spaces 3/2/178 Factored State Spaces 3/2/178 Converting POMDPs to MDPs In a POMDP: Action + observation updates beliefs Value is a function of beliefs. Instead we can view this as an MDP where: There is a state for every

More information

Network Algorithms and Complexity (NTUA-MPLA) Reliable Broadcast. Aris Pagourtzis, Giorgos Panagiotakos, Dimitris Sakavalas

Network Algorithms and Complexity (NTUA-MPLA) Reliable Broadcast. Aris Pagourtzis, Giorgos Panagiotakos, Dimitris Sakavalas Network Algorithms and Complexity (NTUA-MPLA) Reliable Broadcast Aris Pagourtzis, Giorgos Panagiotakos, Dimitris Sakavalas Slides are partially based on the joint work of Christos Litsas, Aris Pagourtzis,

More information

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks

More information

Near-optimal policies for broadcasting files with unequal sizes

Near-optimal policies for broadcasting files with unequal sizes Near-optimal policies for broadcasting files with unequal sizes Majid Raissi-Dehkordi and John S. Baras Institute for Systems Research University of Maryland College Park, MD 20742 majid@isr.umd.edu, baras@isr.umd.edu

More information

Learning in Zero-Sum Team Markov Games using Factored Value Functions

Learning in Zero-Sum Team Markov Games using Factored Value Functions Learning in Zero-Sum Team Markov Games using Factored Value Functions Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 27708 mgl@cs.duke.edu Ronald Parr Department of Computer

More information

Policy Gradient Reinforcement Learning for Robotics

Policy Gradient Reinforcement Learning for Robotics Policy Gradient Reinforcement Learning for Robotics Michael C. Koval mkoval@cs.rutgers.edu Michael L. Littman mlittman@cs.rutgers.edu May 9, 211 1 Introduction Learning in an environment with a continuous

More information

Reinforcement Learning and Control

Reinforcement Learning and Control CS9 Lecture notes Andrew Ng Part XIII Reinforcement Learning and Control We now begin our study of reinforcement learning and adaptive control. In supervised learning, we saw algorithms that tried to make

More information

Persuading a Pessimist

Persuading a Pessimist Persuading a Pessimist Afshin Nikzad PRELIMINARY DRAFT Abstract While in practice most persuasion mechanisms have a simple structure, optimal signals in the Bayesian persuasion framework may not be so.

More information

Efficient Sensor Network Planning Method. Using Approximate Potential Game

Efficient Sensor Network Planning Method. Using Approximate Potential Game Efficient Sensor Network Planning Method 1 Using Approximate Potential Game Su-Jin Lee, Young-Jin Park, and Han-Lim Choi, Member, IEEE arxiv:1707.00796v1 [cs.gt] 4 Jul 2017 Abstract This paper addresses

More information

Introduction. next up previous Next: Problem Description Up: Bayesian Networks Previous: Bayesian Networks

Introduction. next up previous Next: Problem Description Up: Bayesian Networks Previous: Bayesian Networks Next: Problem Description Up: Bayesian Networks Previous: Bayesian Networks Introduction Autonomous agents often find themselves facing a great deal of uncertainty. It decides how to act based on its perceived

More information

An Introduction to Markov Decision Processes. MDP Tutorial - 1

An Introduction to Markov Decision Processes. MDP Tutorial - 1 An Introduction to Markov Decision Processes Bob Givan Purdue University Ron Parr Duke University MDP Tutorial - 1 Outline Markov Decision Processes defined (Bob) Objective functions Policies Finding Optimal

More information