Routing a fleet of vehicles for decentralized reconnaissance with shared workload among regions with uncertain information

Size: px

Start display at page:

Download "Routing a fleet of vehicles for decentralized reconnaissance with shared workload among regions with uncertain information"

Derick Benson
5 years ago
Views:

1 Routing a feet of vehices for decentraized reconnaissance with shared woroad among regions with uncertain information Yan Xia 1, Rajan Batta 2 Department of Industria and Systems Engineering The State University of New Yor at Buffao Buffao, NY USA Raesh Nagi 3 Department of Industria and Enterprise Systems Engineering, University of Iinois at Urbana-Champaign, Urbana, IL 61801, Abstract This paper studies the probem of controing a feet of vehices to search and coect information reward within a specified mission time from a set of regions containing uncertain information. We see a decentraized time-aocation poicy using pre-cacuated routes to maximize the tota reward. We demonstrate that sharing regions among vehices is beneficia. However, shared regions mae the decentraized time-aocation probem computationay intractabe. To overcome this, we deveop an approximate formuation using an independency assumption. This approximate mode aows us to decompose, by vehice, the time-aocation probem, and obtain an easiy impementabe poicy that taes on a Marovian form. We derive a tight upper bound for the decentraized time-aocation poicy using the obtained Marovian poicy. We aso deveop a sufficient condition under which the approximate formuation becomes exact. A numerica study estabishes the computationa efficiency of the method ony a few CPU seconds are needed for probems with a panning horizon of 300 time units and 40 regions, and demonstrates the benefit of using a region-sharing strategy. The numerica study aso examines the feet s woroad sharing behavior with respect to the cooperation factor (which measures the fused information reward gained from sharing), the mission duration and the search sequence. Keywords: Search Theory; Decentraized Contro; Resource Aocation; Marovian Poicy; Mutiagent System 1 E-mai: yanxia@buffao.edu 2 E-mai: batta@buffao.edu 3 E-mai: nagi@iinois.edu 1

2 1. Introduction, Motivation and Contribution Information coection is a prerequisite in various rea word operations. In miitary operations, a mission commander s decisions are typicay made based on information about the enemy positions and assets, which usuay refer to the existence of miitary targets such as arsena, radar station, and airport. Coection of the information can, for exampe, correspond to a fixed period of video surveiance. In disaster reief operations, the reief aocation is usuay decided based on information about the damage to the area where information can be the existence of casuaties in a region and coection of the information can be identifying the size of the casuaty group (Gong and Batta, 2007). In rescue management, prior information is coected to faciitate rescue panning. For instance, in the search of the Maaysia missing pane MH370 (BBC, 2014), the search team attempts to ocaize as many pieces of suspicious debris as possibe over the sea. The debris is then checed by rescue ships in hope of finding the pane. In forest fire fighting, besides nowing whether a widfire exists, information about the spreading direction and the eve of the fire is aso important for fighting widfires (Merino et a., 2006). In border contro (Pietz and Royset, 2013), it is important to identify smuggers (information) and trac them (coection) so that smugging can be stopped by the coast guard. In space exporation, one important tas is to coect environmenta information from different sites of a panet (Becer et a., 2004). Information coection is traditionay performed by manned aircraft such as heicopters or ground vehices, which can be very expensive. Recent deveopment of automatic agents such as unmanned aeria vehices (UAVs) provides a more economica soution to coect information from a arge area with disjoint regions of interest (Romesh, 2013; Wayne, 2014). This paper considers a reconnaissance probem of controing a feet of vehices (agents) to coect information from a set of regions. In each region, information may or may not exist, and if the information exists, it taes a vehice a random amount of time to detect it. After detecting it, the vehice can decide whether to coect the information, which taes a given amount of time and provides a reward (a measurement of the information s vaue) to the feet. The goa is to maximize the tota reward coected by the feet within a given mission time. Two commony used contro poicies are centraized and decentraized. Our focus is on a decentraized poicy. Ceary, a centraized poicy, by definition, provides better performance than a decentraized poicy; however, it requires that a centra agent through onine communication coects rea-time observations from individua members, processes the observations, and returns the contro decisions to individua member. These requirements create severa drawbacs. First, the centraized method acs robustness due to communication oss (Seier and Sengupta, 2001), under which scenario some individua members cannot contact the centra agent. Second, transfer of each individua member s rea-time observations to the centra agent may be restricted by imited bandwidth. Under either situation, the centra agent cannot coect a the necessary observations to mae a decision for each member and centraized contro fais. Third, deciding a centraized contro poicy has high computationa compexity and is difficut to impement efficienty (Shima and Rasmussen, 2009). Finay, security is a concern since the communication networ can be haced and sensitive messages may be reveaed to the enemy, which creates a ris to the mission (Howard, 2013). Even when centraized 2

3 poicies can be impemented, strategies need to be deveoped for operations over a period of time when communication is ost between vehices and the centra agent. Decentraized strategies provide a way to operate in such a situation, with the understanding that once communications are restored a switch is made bac to a centraized poicy. The decentraized contro probem studied in this paper can be viewed as a finite horizon partiay observabe decentraized Marov decision process (DEC-POMDP), for which Seuen and Ziberstein (2008) provide an exceent review. Optimizing such a process is proved to be NEXP-hard (Bernstein et a., 2002). Severa exact and approximate methods are designed to sove a DEC-POMDP probem. Szer et a. (2005) deveop a search heuristic based on the widey appied A agorithm (Hart et a., 1968), and in Szer and Charpiet (2006) they propose a dynamic programming based approach, which can be executed exacty or approximatey. Oiehoe et a. (2008b) appy approximate Q-earning to a genera DEC-PODMP probem, which is a cassica approximate dynamic programming method (Powe, 2007). As a meta-heuristic to the probem, Oiehoe et a. (2008a) impement the cross-entropy method (Kroese, 2010) and compare the method s performance with another heuristic caed joint equiibrium search for poicies (JESP), which is designed for soving genera DEC-PODMP probems (Nair et a., 2003). JESP starts from any feasibe poicy and iterativey improves the poicy to an equiibrium where no agent can improve its poicy by itsef. Aras and Dutech (2010) study the probem using a mixed integer inear programming (MILP) approach, which can expore the high-efficiency computation provided by commercia MILP sovers such as CPLEX (IBM, 2014). These reviewed methods provide poicies with good quaities but their appicabiity is imited to sma probems with fewer than 10 panning horizons but we attempt to sove reaistic size probem far beyond these methods computationa capabiity. Genera DEC-POMDPs require poicies with history-dependent actions. An impementation in our context wi require exponentia space. Therefore, we see an aternative method and use a poicy whose action is ony dependent on the vehice s remaining mission time. As iustrated ater such a poicy ony taes biinear space, and generates a sequence of oo-up tabes, one for each region-vehice pair. In each oo-up tabe, a decision is stored for each decision epoch (introduced in the next section) and the remaining time. The poicy is straightforward to impement: Whenever a vehice reaches a decision epoch, it simpy executes the action isted in the oo-up tabe. We deveop a two-stage soution procedure to obtain such a poicy. We pan routes in the first stage to determine the regions for each vehice and the sequence to visit them. Mutipe route famiies are generated in this stage, each of which is composed of one route for each feet member. In the second stage, we evauate each route famiy using a decentraized time-aocation poicy, which provides entries to each ooup tabe. To obtain the time-aocation poicy for a route famiy, we need to sove a DEC-POMDP probem. The fina poicy is estabished by finding the time-aocation poicy that provides the maxima expected reward to the feet. The first stage of the soution procedure reates to a stream of iterature that uses deterministic modes to anayze the probem of controing a feet of UAVs in reconnaissance missions, e.g., Chao et a. (1996), Schumacher et a. (2006), Rathinam et a. (2007), Kress and Royset (2008), Murray and Karwan (2010), Mufai et a. (2012), and Pietz and Royset (2013). The modes estabished in these papers consider different constraints and objectives but their fina soution a assign a route to 3

4 each vehice, which is composed of a sequence of way-points. Depending on the objectives, some of the modes aso decide the amount of time spent in each of the way-points for each vehice to coect reward, e.g., Kress and Royset (2008), Mufai et a. (2012), and Pietz and Royset (2013). The main difference between these modes and our wor is the foowing: The objective vaue (e.g., the expected tota reward coected by the feet) is automaticay cacuated in a deterministic mode once the routes are decided; however, in our probem we need to further sove a decentraized time-aocation probem to now how much reward the feet can coect from a route famiy. As opposed to deterministic modes, our approach creates a demonstrated need for sharing regions. It is iustrated in our numerica studies that the extent to when sharing occurs and which regions get seected for sharing is a function of the cooperation factor, the mission duration, and the search sequence for each vehice. We aso mae methodoogica contributions to sove the DEC-POMDP probem required by the second stage of the soution procedure, which are: Design of an approximate formuation to obtain an efficient decentraized time-aocation poicy for each route famiy under an independency assumption. Derivation of a tight upper bound for the decentraized time-aocation poicy. Deveopment of a sufficient condition for the approximate formuation to be exact. The rest of this paper is organized as foows: 2 presents the mode description. 3 expains how to create route famiies. 4 formuates and soves the decentraized time-aocation probem for a given route famiy. For an arbitrary route famiy, a cosed form upper bound for the decentraized time-aocation poicy is deveoped in 5. The numerica study is presented in 6. Finay, we mae our concuding discussion and propose future research directions in 7. To enhance readabiity and conciseness of the paper, much of the materia is presented in the appendix: A provides integer programming formuations for the routing modes introduced in 3 of the main artice; B provides an agorithm that extracts two types of quantities from a timeaocation poicy, which are used in 4 of the main artice; C contains the proofs of some theorems and propositions that are omitted in the main artice. D shows the tightness of the bound deveoped in Mode Description Consider a reconnaissance probem in which a feet of vehices U = {1, 2,..., n} is assigned to search and coect information from a set of regions A = {1, 2,..., L}. Each vehice i U is assigned a start depot and an end depot, and needs to visit a subset of regions in A and arrive at its end depot within a given mission time T. We use the remaining time (t) of a vehice to represent the time that a vehice has eft to reach its end depot. Time is modeed as a discrete entity. At most one piece of information exists in region α A with a priori probabiity e α. Consider a vehice that arrives at region α and reserves x units of time to search for information. The actua time it consumes to detect the information is a discrete random variabe whose vaue is x with probabiity e α p α (x), x = 1, 2,..., x where p α (x) is the conditiona probabiity of finding the information in the 4

5 x th unit time given that the information exists. The vehice may aso fai to detect any information after spending x units of time with probabiity 1 e α P α ( x), where we ca P α ( x) = x x=1 p α(x) as the detection function of a region. An exampe of how to obtain P α (x) is provided in our numerica study ( 6). If the information is found, the vehice can either coect it (which taes s α 0 units of time) or eave the region without coecting the information (which is instantaneous). For region α, we define a random variabe θ α that represents whether information exists (θ α = 1) or not (θ α = 0). We aso define a random variabe wα i that indicates the time that vehice i needs to detect the information in region α. If vehice i schedues x units of time to search for information in region α, the actua time it spends on searching is min{wα, i x}. If x wα, i the vehice wi detect the target; otherwise, it wi not. Let f χ (x) be the probabiity that an arbitrary random variabe χ taes the vaue x and f χ (x ) be the corresponding conditiona probabiity given any condition specified in. When θ α = 1 (information exists), we have f w i α (w θ α = 1) = p α (w), for w = 0, 1, 2,... When θ α = 0 (information does not exist), we have f w i α (w θ α = 0) = { 1 if w = T + 1, 0 otherwise. (1) Under (1), the vehice can never detect the information when it does not exist and consumes a its reserved search time since the vehice can never schedue more than T units of time to search in a region. To simpify the notation, throughout the reminder of this paper we use w i α and θ α to represent both the random variabe and its reaization. If the vehice does not detect any information after spending x units of time in searching, the posterior probabiity that θ α = 1 can be obtained using Bayes rue. We define an observation variabe o i α given that x units of time are reserved for searching for information in region α, which satisfies o i α = { x if x < w i α, 1 if x wα. i (2) In (2), o i α 0 means that the information is not found and o i α units of time have been spent in searching and o i α = 1 means that the information is found. The conditiona probabiity that θ α = 1 given o i α is { e α(1 P α(o i f θα (1 o i α )) α) = e α(1 P α(o i α))+1 e α if o i α 0, (3) 1 if o i α = 1. We mae the foowing assumption for assigning regions to vehices. Assumption 1 A region can be assigned to at most two vehices. If a region α is searched by a singe vehice, a fixed reward g α is coected. If two vehices successfuy coect the information, a reward of (1 + γ α )g α is obtained by the feet, where γ α 0 is the cooperation 5

6 factor of the vehices in region α. Information fusion (a technique that merges the information from heterogeneous sources) can be used to obtain an appropriate vaue of γ α (Naamura et a., 2007). The cooperation factor provides a quantitative measure of the extra reward gained by fusing the information. We use P ( ) to represent the probabiity that event wiappen. For the search process of each vehice in each of its assigned regions, we mae the foowing assumption. Assumption 2 For any subset U U and A i A, i U, we have P ({wα} i α A i,i U, {θ α} α Ai) = f i U θα (θ α ) f w i α α (wα θ i α ). i U Ai i U,α A i (4) Equation (4) impies three types of independency. Existence of information in a region is independent of existence of information in other regions. A simiar independency appies to the time needed to detect information in a region. Lasty, for any shared region α A i A j, the detection time is independent for both assigned vehices given θ α. Our decentraized poicy is constructed under the foowing considerations: 1. Each region is visited by a vehice at most once. 2. Each vehice foows a specified route, composed of a sequence of regions from ower order to higher order. During the mission, the vehice is ony aowed to trave from a ower order region to a higher order region. (Note that the order of a region can be different for different vehices.) With these considerations, we represent vehice i s route as H i = {h i 0, h i 1, h i 2,..., h i L i, h i L i +1 }, i U. h i, 1 L i, indicates the index of a region and is the order index of the region where a arger means a higher order. h i 0 and h i L i +1 are the start and end depots, respectivey. The routes in a route famiy are created such that each region is either visited by one or by two vehices. This impies that for each vehice i U, we can divide the regions that it visits into two sets, S i (shared) and O i (not shared). Given the route, a vehice has at most three decision epochs (in each region): when the vehice arrives, when the information is detected, and when the vehice decides to eave. Consider a region α that vehice i is assigned, x i α( ) is an integer that specifies the amount of search time to reserve for searching; yα( ) i is a binary variabe that specifies whether the information shoud be coected (yα( ) i = 1) or not (yα( ) i = 0); zα( ) i is an integer which indicates the order index of the next region to visit. At the start depot we ony have zα(t i ) which determines the first region to visit if the feet is given T units of mission time. 3. Routing with Shared Regions The first stage of the soution procedure is to create route famiies. An initia route famiy can be obtained by using a suitabe deterministic mode. We provide two exampes in A. For each initia route famiy, we use a Minima Insertion Rue to seect and assign shared regions according to a 6

7 threshod δ. Let H i = {h i 0, h i 1, h i 2,..., h i L, h L i +1} be vehice i s route assigned by the initia route famiy, and d(α 1, α 2 ) represents the trave time between region α 1 and region α 2. The trave cost to insert a region α to route H i is d M (α, H) = min =0,1,...,Li {d(h, α) + d(α, h +1 ) d(h, h +1 )}. Let Õi be the set of non-shared regions in vehice i s initia route. We define the seection rue: Definition 1 (Minima Insertion Rue) For i U, assign region α Õi as a shared region to vehice j i if d M (α, H j ) < δ and d M (α, H j ) = min j i{d M (α, H j )}, where ties are broen arbitrariy. Varying the threshod δ of the Minima Insertion Rue, different sets of shared regions can be created. To enumerate a possibe combinations, we can increase δ one unit at a time from zero to an upper bound, under which a regions are shared. Observation 1 If d max = max αi,α j A d(α i, α j ), d M (α, H i ) 2d max for i U, α A. Since for any regions α 1, α 2, α 3, we have d(α 1, α 2 ) + d(α 2, α 3 ) d(α 1, α 3 ) 2d max 0, Observation 1 hods. According to Observation 1, any region in the area wi be shared if δ d max. For each set of shared regions that a vehice receives, we re-optimize the vehice s route. To do so, we consider two methods: Unconstrained and constrained re-optimization. The unconstrained re-optimization method sees a minima trave time route for each vehice to visit a its assigned regions. The constrained re-optimization pursues a minima trave time route but retains the orders of the initia regions, i.e., if a region has a ower order than another region in the initia route, it must retain a ower order than that region in the re-optimized route. We wi expain the reason to perform constrained re-optimization in 4.3. The integer programming formuations of these two re-optimization methods are presented in A. 4. Finding a decentraized time-aocation poicy To streamine presentation of the materia, we divide this section into four parts. 4.1 formuates the decentraized time-aocation probem. 4.2 deveops a decomposabe approximation for the formuation in contains an agorithm to find a ocay optima soution. Finay, 4.4 deveops a sufficient condition for the approximation in 4.2 to be exact, and aso expains cacuation of the exact vaue of a Marovian poicy. A of these eements are integra to our approach for finding a decentraized time-aocation poicy. 4.1 The decentraized time-aocation probem To simpify notation, we use a two-vehice setting to estabish the resuts in the foowing two sections. The mode and its resuts can be naturay extended to the muti-vehice scenario, for which we wi provide an expanation in the concusion. Note that when there are two vehices, we have U = {1, 2} and S = S 1 = S 2 and we use the notation interchangeaby. Assume that each vehice i has been assigned a route H i = {h i 0, h i 1, h i 2,..., h i L i, h i L i +1 }. The timeaocation probem for a given route famiy is to determine a decentraized time-aocation poicy that 7

8 provides the maximum expected reward. We represent vectors in bod and define o i = (o i, o i,..., o i h i i 1 h i 2 h) i for 1 L i. We aso define o i 0 for the consistency of the notation, which has no practica h i 0 meaning. The form of the optima time-aocation poicy is given in the foowing proposition. Proposition 1 (Oiehoe et a., 2008b) There exists an optima decentraized time-aocation poicy π = {π i } i U such that π i, i U, can be represented as π i = {z i h (T ); x i i 0 h (t, o i i ), y i i 1 h (t, o i i ), z i i 1 h (t, o i i : t = 0, 1, 2,..., T, = 1, 2,..., L 1) i i }. (5) Proposition 1 is a direct resut of Proposition 2.1 in Oiehoe et a. (2008b). The feasibiity of a time-aocation poicy is restricted by the vehice s remaining time and the orders of the regions specified by the given route. More expicity, for a poicy to be feasibe, the vehice shoud aways have enough time to reach the end depot. In addition, when the vehice decides to eave a region, it can ony trave to the region that has a higher order in the given route. Observation 2 A poicy π = {π i } i U in the form of (5) is feasibe if and ony if for i U T d(h i 0, h i z i h i (T ) ) d(hi z i 0 h i (T ), hi L i +1), L i + 1 z i h ) > 0, 0(T i 0 (6a) t x i h (t, o i i ) d(h i, h i i L 1 i +1), = 1,..., L i, T t d(h i, h i L i +1), o i h 1, i (6b) t y i h (t, o i i )s i 1 h i d(h i, h i L i +1), = 1,..., L i, T t d(h i, h i L i +1), o i h 1, i (6c) t d(h i, h i z i h i (t,o i h ) ) d(hi z i h 1 (t,o i h ), hi L i +1), = 1,..., L i, T t d(h i, h i L i +1), o i h 1, i 1 (6d) z i h (t, o i i ) >, = 0, 1,..., L i i, T t d(h i, h i L 1 i +1), o i h 1. i (6e) We use Π i to represent the set of a feasibe poicies of vehice i in the form of (5). Constraint (6a) states the initia condition that the vehice can ony start with a region if it sti has enough time to reach the end depot after arriving at the region. Constraints (6b) and (6c) state that the vehice has to reserve enough time to trave to the end depot when it decides to search for or coect information in a region. Constraint (6d) ensures that if the vehice decides to trave to region h i where = zi (t, o i it must have enough time to reach the end depot after arriving at the region. h i 1), i Constraint (6e) requires that the vehice can ony trave to a region that has a higher order than its current region. Throughout the paper, a vehice succeeds in a region means that the vehice detects information in a region and coects it. For a feasibe poicy π i of vehice i, we use τα(π i i ) to represent the conditiona probabiity that the vehice succeeds in region α given that information exists in the region. The expected reward coected from a non-shared region β O i is R β (π i ) = e β g β τ i β(π i ). For a decentraized poicy π we et τ 1 α(π 1) be the conditiona probabiity that vehice 1 succeeds in region α given that information exists and vehice 2 aso succeeds, and τ 1 α(π 0) be the conditiona 8

9 probabiity that vehice 1 succeeds in region α given that information exists but vehice 2 does not succeed. The expected reward that the feet coects from a shared region α S under poicy π is R α (π) = e α g α [τ 1 α(π 0)(1 τ 2 α(π 2 )) + (1 τ 1 α(π 1))τ 2 α(π 2 ) + (1 + γ α )τ 1 α(π 1)τ 2 α(π 2 )]. For each vehice i U, we define R π i = β O i R β (π i ) as the expected reward that vehice i coects from a its non-shared regions under poicy π i. The expected tota reward coected by the feet under poicy π is R(π i ) = R π i + R α (π). (7) i U α S Then we can formuate the time-aocation probem as max R(π i ) s.t. π 1 Π 1, π 2 Π 2. (8) To sove the probem defined by (8), we need to optimize a DEC-POMDP, which is impractica considering the size of the probem that we attempt to sove. Furthermore, a poicy in the form of (5) is not ony difficut to compute but aso expensive to store since the number of possibe reaizations of o i is exponentia in and we need to determine an action for each combination of the reaization h i o i and the corresponding remaining time t. With these difficuties associated with a genera poicy h i in the form of (5), we focus on the impementation of Marovian poicies. Definition 2 A decentraized poicy π = {π i } i U is a Marovian poicy, if π i, i U, can be represented as π i = {z i h (T ); x i i 0 h (t), y i i (t), z i i : t = 0, 1, 2,..., T, = 1, 2,..., L (t) i i }. Simiar to (6), for a Marovian poicy to be feasibe, we have the foowing observation. Observation 3 A Marovian poicy π = {π i } i U is feasibe if and ony if for i U T d(h i 0, h i z i h i (T ) ) d(hi z i 0 h i (T ), hi L i +1), L i + 1 z i h ) > 0, 0(T i 0 t x i h d(h (t) i, h i i L i +1), = 1,..., L i, T t d(h i, h i L i +1), t y i h i (t)s h i d(h i, h i L i +1), = 1,..., L i, T t d(h i, h i L i +1), t d(h i, h i z i h i (t) ) d(hi z i h i (t), hi L i +1), = 1,..., L i, T t d(h i, h i L i +1), L i + 1 z i h (t, o i i >, = 1,..., L 1) i i, T t d(h i, h i L i +1). We use Π i to represent the set of a feasibe Marovian poicies of vehice i. The size of a Marovian poicy is biinear on L i and T but it is sti difficut to find the optima Marovian poicy since it requires a forward induction over the entire Marovian poicy space given by (9). Nevertheess, a sub-optima Marovian poicy can be obtained by repacing the objective function of (8) with an approximation and decomposing the optimization probem into sub-probems each of which can be considered as a singe vehice probem. 9

10 4.2 Formuation of a Decomposabe Approximation The approximation formuation is deveoped by using an approximate objective function. The expected reward coected from a shared region α under a poicy π is approximated as R α (π) = e α g α [τ 1 α(π 1 )(1 τ 2 α(π 2 )) + τ 2 α(π 2 )(1 τ 1 α(π 1 )) + (1 + γ α )τ 1 α(π 1 )τ 2 α(π 2 )]. (10) The approximation (10) is created based on the assumption that under any decentraized poicy π in the form of (5), whether a vehice succeeds in a shared region α is conditionay independent of the other vehice given that information exists in the region. We shoud note that this assumption generay does not hod but we wi provide a sufficient condition in 4.4 under which this assumption is satisfied. With (10) the expected reward of appying poicy π is approximated as R(π) = R(π 1, π 2 ) = R π 1 + R π 2 + α R α (π). (11) Under (11) the feet s time-aocation probem is re-formuated as: (IAP): Independent approximation probem. max R(π) s.t. π 1 Π 1, π 2 Π 2. We first consider a subprobem of probem (IAP), (IAP i ): max{ R(π) : π i Π i π j, j i}, which optimizes vehice i s poicy by fixing the poicy appied by the other vehice. To prove our main theorem we need the foowing emma, which shows the fact that if the poicy payed by one vehice is fixed, the maxima expected reward that the other vehice can coect for the feet at any decision epoch is ony determined by the decision epoch and the vehice s remaining time given the objective function (11). Lemma 1 Assume that vehice i aways coects a reward of (1 τα(π j j )) + γ α τα(π j j ) in a shared region α given that the other vehice appies poicy π j. For = 1, 2,..., L i, T t d h i,h i there exist: L i G s T t d h(t), i h i,h i : The maxima expected (future) reward that vehice i can coect if it L i +1 arrives at region h i with t units of remaining time. 2. G c T t d h(t), i h i,h i : The maxima expected (future) reward that vehice i can coect if it L i +1 detects the information in region h i with t units of remaining time. 3. G d T t d h(t), i h i,h i : The maxima expected (future) reward that vehice i can coect if it L i +1 eaves region h i with t units of remaining time. Proof. Since no information reward can be coected at the end depot, we can define G s h i L i +1(t) = 0 for t = 0, 1, 2,..., T, which is aso the maxima expected reward that vehice i can coect when it arrives at the end depot with t units of remaining time. We wi prove the emma using induction. As the 10

11 induction hypothesis, we assume that for L i + 1 >, G s d(h h(t), i i, hi L i +1 ) t T is we defined. We wi show that G s (t), G c (t) and G d d(h h i i h(t), i i, hi L i +1 ) t T are aso we defined for =. When vehice i decides to eave region h i with t units of remaining time, it can trave to region h i if > and t d(h i, hi ) d(hi, hi L i +1 ). If the vehice decides to trave to region hi, according to the induction hypothesis, it wi coect a maxima expected reward of G s (t). Because the vehice h i has to trave to one of the regions in {h i +1,..., hi L i i +1}, the maxima expected future reward that the vehice can coect is { } G d h = max G (t) d i h i (t d(hi, h i )) : L i+1 >, t d(h i, h i ) d(hi, hi L i +1). (12) For T t d(h i, hi L i +1 ), since t d(hi, hi L i +1 ) 0 = d(l i + 1, L i + 1), G d h i (t) is we defined. For G c h i (t), we consider two possibe cases: 1. h i O i (region h i is a non-shared region): If the vehice decides to coect the information, the vehice coects an immediate reward of g h i and consumes s h i units of time. 2. h i S (region hi is a shared region): If the vehice decides to coect the information, due to the emma s assumption, the vehice coects a reward of g h i (γ h i τ j (π h i j ) + 1 τ j h(π i j )) and consumes units of time. s h i Under both cases, if the vehice decides to eave without coecting, the vehice coects no immediate reward and consumes no time. After either decision, the vehice eaves region h i, and the maxima expected reward that the vehice can coect afterwards is given by G d We define h( ). i ˆτ i h i = { 0 if h i O i, τ j h(π i j ) if h i S. Since the vehice can ony coect the information if t s h i d(h i, hi L i +1 ), the maxima expected future reward that the vehice can coect is { G d if d(h G c h h = (t) i (t) i, hi L i +1 ) t < d(hi, hi L i +1 ) + s h { i, } i max G d (t), G d (t s h i h i h i ) + g h i (γ h i ˆτ i + (1 ˆτ i h i h)) i if d(h i, hi L i +1 ) + s h i t T. (13) Since G d (t) is we defined for T t d(h i h i, hi L i +1 ), Gc is we defined for T t d(h h(t) i i, hi L i +1 ). When the vehice arrives at region h i, it decides the maxima amount of time ( x) to search for the information. The information can be found after spending t = 1, 2,.., x units of time and then the vehice needs to decide whether to coect the information, under which scenario the maxima expected reward that the vehice can coect is given by G c The vehice may aso fai to find the information h( ). i under which scenario zero reward is coected from region h i and it wi eave the region with the expected future reward given by G d Since the vehice cannot schedue more than t d(h h( ). i i, hi L i +1 ) 11

12 units of time to search for the information, the maxima expected reward that the vehice can coect is { x } G s h = max e (t) i h i 0 x t d(h i,hi L i +1 ) p h i (x)g c h (t x) + (1 e i h i P h ( x))g d h x). (14) iˆ (t i x=1 Since x = 0 is aways a feasibe decision when t d(h i, hi L i +1 ), Gs h i (t) is we defined for T t d(h i, hi L i +1 ). The induction is compete and the emma is proved. Our main resut estabishes a Marovian poicy that soves probem (IAP i ). Theorem 1 There is a Marovian poicy π i Π i that soves probem (IAP i ) given π j Π j. Proof. Given G s (t), G c (t), and G d defined for = 1, 2,..., L h i i h(t) i i +1, we can design a Marovian poicy as foows: For = 1, 2,..., L i and T t d(h i, hi L i +1 ): { x x i h = arg max (t) i 0 x t d(h i,hi L i +1 ) y i h i (t) = x=1 } e h i p h i (x)g c h (t x) + (1 e i h i P h ( x))g d h x) iˆ (t i, (15a) { 1 if G c = G h(t) d s i h(t i h i ) + g h i (γ h i ˆτ i + (1 ˆτ i h i )), i 0 otherwise. z i h i (t) = arg max { G d h i (t d(h i, h i )) : L i+1 >, t d(h i, h i ) d(h i, h i L i ) (15b) }. (15c) Under the poicy specified by (15), the maxima expected rewards defined by G s (t), G c (t) and G d h i i h(t) i can be achieved for each decision epoch and the corresponding remaining time. When = 0, simiar to what we did for > 0, the maxima expected reward that the vehice can coect from regions h i 1, h i 2,..., h i given that vehice i eaves region L i i +1 hi 0 with T units of remaining time is { } G d h ) = max G 0(T d i h d(h (T i 0, h i )) : L i i+1 > 0, T d(h i 0, h i ) d(h i, h i L i ). (16) The decision z i h i 0(T ) that achieves the maxima expected reward is { } z i h ) = arg max G 0(T s i h d(h (T i 0, h i )) : L i i+1 > 0, T d(h i 0, h i ) d(h i, h i L i ). (17) Now we show that the poicy specified by (15) and (17) soves probem (IAP i ) for any given π j in the form of (5). Let π i be the poicy specified by (15) and (17). We consider any poicy ˆπ i in the form of (5). According to the definition of G s h i 0(T ), we have α O i e α g α τ i h i (π i ) + α S e α g α τ i α(π i )[(1 τ j α(π j )) + γ α τ j α(π j )] e α g α τ i h (ˆπ i i ) + e α g α τα(ˆπ i i )[(1 τα(π j j )) + γ α τα(π j j )]. α O i α S 12

13 Given π j, we consider R(π i, π j ) R(ˆπ i, π j ). We have R(π i, π j ) R(ˆπ i, π j ) = α S + R π i + R π j Rˆπ i R π j α S e α g α [τ 1 α(π 1 )(1 τ 2 α(π 2 )) + τ 2 α(π 2 )(1 τ 1 α(π 1 )) + (1 + γ α )τ 1 α(π 1 )τ 2 α(π 2 )] e α g α [τ 1 α(ˆπ 1 )(1 τ 2 α(π 2 )) + τ 2 α(π 2 )(1 τ 1 α(ˆπ 1 )) + (1 + γ α )τ 1 α(ˆπ 1 )τ 2 α(π 2 )] = e α g α τ i h (π i i ) e α g α τ i h (ˆπ i i ) + e α g α τα(π i i )[(1 τα(π j j )) + γ α τα(π j j )] α O i α O i α S α S e α g α τ i α(ˆπ i )[(1 τ j α(π j )) + γ α τ j α(π j )] 0. Since R(π i, π j ) is greater than R(ˆπ i, π j ) for any ˆπ i ˆΠ i, the poicy estabished by (15) and (17) soves probem (IAP i ), which is a Marovian poicy by its form. Remar 1 When no regions are shared, the soution of probem (IAP i ) maximizes R π i. Since we aso have R(π) = R π 1 + R π 2, the poicies obtained by soving probems (IAP 1 ) and (IAP 2 ) compose an optima time-aocation poicy in this case. For readabiity and conciseness of the paper, we deay the proofs of the remaining theorems and propositions to C of the appendix. 4.3 Obtaining a Loca Optimum We first define a oca optimum of probem (IAP). Definition 3 A Marovian poicy π L, = (π L, 1, π L, 2 ) is a oca optimum of probem (IAP) if π L, 1 arg max{ R(π 1, π L, π L, 2 arg max{ 2 ) : π 1 Π 1 }, L, R(π 1, π 2 ) : π 2 Π 2 }. Agorithm 1 impements the idea of JESP agorithm (Nair et a., 2003), which iterativey improves one vehice s poicy whie fixing the poicy appied by the other vehice. Since we sove the approximate probem (IAP) in the agorithm instead of the true probem, our agorithm runs much more efficienty. Theorem 2 Agorithm 1 converges to a oca optimum of probem (IAP) in a finite number of steps. Each restart of Agorithm 1 may converge to a different oca optimum since the initia vehice i is randomy seected and the vaues of ˆτ i α, α S are aso randomy generated. To interpret this, we can regard Agorithm 1 as using a oca-search method to find a oca optimum of a non-convex optimization probem. Starting from a different initia point, the oca-search method may converge to a different oca optimum. Now we expain why constrained re-optimization is performed. Any feasibe time-aocation poicy for the initia route famiy is sti feasibe to the route famiy obtained by constrained re-optimization 13

14 Agorithm 1 1: Initiaization: Seect a random i {1, 2} and generate a random ˆτ i α [0, 1], for α S. Set R 1, R 0 and j {1, 2} \ {i}. 2: Sove (IAP i ) s using the ˆτ i α, α S and obtain the poicy π i. For α S: ˆτ j α τ i α(π i ). 3: whie R R do 4: R R. 5: Sove (IAP j ) and use the obtained poicy to update π j. (Do not update if the current π j soves (IAP j ).) 6: For α S: ˆτ i α τ j α(π j ). 7: Sove (IAP i ) and use the obtained poicy to update π i. (Do not update if the current π i soves (IAP i ).) 8: For α S: ˆτ j α τ i α(π i ). 9: Obtain R(π 1, π 2 ) using (10) and update R R(π 1, π 2 ). 10: end whie where the same expected reward is coected by the feet from the two famiies. To this end, we consider an arbitrary feasibe poicy for the initia route famiy. zα( ) i decisions in the poicy wi not ead the vehice to a new region and a zα( ) i decisions are sti feasibe since the orders of the initia regions do not change. Aso because the x i α( ) and yα( ) i decisions for each region are the same, each vehice wi behave in the re-optimized route exacty the same as what it does in the initia route. Thus, if we use the optima poicy of the initia route famiy to generate the ˆτ α, i α S for Agorithm 1, there is a high chance that we wi obtain a better time-aocation poicy from the new route famiy. Even though Agorithm 1 may sti go over a feasibe poicies before convergence in the worst case, in practice, it ony goes over a few poicies to converge, which we wi iustrate in our numerica study. More importanty, each iteration (ines 4-9) runs in poynomia time of L and T, this is the main reason that the agorithm can sove reaistic size probems. To this end, the probems in ine 5 and ine 7 are soved using bacward induction on the recursions defined by (12), (13), (14) and (16) with initia condition G h i Li (t) = 0, t = 0, 1, 2,..., T. It is easy to verify that the bacward induction +1 taes O(T L 2 + T 2 L) time. We use forward induction in ine 6 and ine 8 to update the variabes ˆτ α, i α S, i U, for which we provide an agorithm in B. The forward induction is performed over the route of each vehice (from h i 0 to h i L i +1 for vehice i U) but the size of the induction tree in each iteration is inear in T. The forward induction taes O(LT 2 ) time. In ine 9, we cacuated R α π using stored vaues and it taes ony O(L) time. 4.4 Cacuation of Exact Vaue of a Marovian Poicy and a Sufficient Condition for the Approximation to be Exact Agorithm 1 provides an efficient method to find a poicy; however, it does not offer the true vaue of the obtained poicy since R(π) is ony an approximation of R(π). To cacuate R(π), we use conditiona independency. The method is derived based on the shared order index of each shared region for each vehice, which is defined as foows: Definition 4 The shared order index of a shared region α for vehice i is defined as I i α =, if α has the th highest order index among the vehice s shared regions. 14

15 The maxima and minima shared order indices of region α are defined as I α = max{iα, 1 Iα} 2 and I α = min{iα, 1 Iα}, 2 respectivey. Based on the shared order indices of the shared regions, we define the dependent set of each shared region. Definition 5 Φ α = {β : I β I α, β S} is the dependent set of a shared region α. According to the definition of the dependent set, for any shared region α, if 1 and 2 are the corresponding order indices of region α in the two vehices routes, it is easy to verify that Φ α = {h 1 1,..., h } {h 2 1,..., h }. For any region set V A, we use θ V = {θ α } α V to represent a reaization of the existence of information in each region that beongs to V. Theorem 3 Whether a vehice succeeds in a shared region α is conditionay independent of the other vehice under a decentraized Marovian poicy given θ α and θ Φα. Using Theorem 3, we can cacuate R α (π) using conditiona independence given θ α and θ Φα. R α (π) = P (θ Φα )[τα(π 1 1 θ Φα )(1 τα(π 2 2 θ Φα )) + τα(π 2 2 θ Φα )(1 τα(π 1 1 θ Φα )) θ Φα {0,1} Φα + (γ α + 1)τα(π 1 1 θ Φα )τα(π 2 2 θ Φα )], (19) where τ i α(π i θ Φα ) is the conditiona probabiity that vehice i succeeds in region α given θ α and θ Φα. τ i α(π i θ Φα ), i U, α S, can be cacuated using the agorithm provided in B and P (θ Φα ) can be cacuated using Lemma 2 provided in C. The foowing proposition estabishes the monotonicity of the dependent set for each shared region of two vehices with respect to the region s minima shared order index. Proposition 2 The dependent set of a shared region α is monotone w.r.t. I α, i.e., if I α1 I α2, Φ α1 Φ α2. Using the resut of Proposition 2, the compexity of cacuating R(π) is determined by the size of the argest dependent set. In other words, we ony need to enumerate a possibe reaizations of θ Φβ where β arg max α S {I α } so that we can cacuate the expected reward coected in each shared region and the number of possibe reaizations for θ Φβ is 2 Φβ. Note that we have Φ β S 1 and Φ β = S 1 if and ony if β is the ast shared region for both vehices to search. Proposition 3 Whether a vehice succeeds in α S is conditionay independent of the other vehice given θ α under a decentraized poicy π in the form of (5) if the shared regions are searched in an exacty opposite order by the two vehices. Proposition 3 proposes a sufficient condition, under which (10) provides the exact expected reward coected by any Marovian poicy from a shared region. Under a weaer condition, we are abe to prove an important resut. 15

16 Theorem 4 If whether a vehice succeeds in α S is conditionay independent of the other vehice given θ α under an optima decentraized poicy π in the form of (5), there exists a Marovian poicy that soves the time-aocating probem optimay. The resut of Theorem 4 provides a sufficient condition under which the approximate probem (IAP) is equivaent to the origina decentraized time-aocation probem, which, in its genera form, is a DEC-POMDP. The theorem aso impies that under the condition of Proposition 3, the optima soution of probem (IAP) provides an optima soution to the time-aocating probem for a given route famiy. We end this section with an iustration of the argest dependent set of a route famiy in Figure 1. In scenario (L), since both vehices visit the shared regions in the same order, the size of the argest dependent set reaches its maximum, which is S 1. In scenario (R), since both vehices visit the shared regions in an exact opposite order, the argest dependent set is sti empty. Dependent set of region α Region α :Regions beong to the dependent set of region α. :Shared regions. :Vehice 1 s non-shared regions. :Vehice 2 s non-shared regions. Same Direction: Scenario (L) The dependent set of each shared region is empty. Opposite Direction: Scenario (R) Figure 1: The dependent set of a shared region 5. An upper bound on the maximum reward coected from a group of routes In this section, we use the vaue provided by a oca optimum of probem (IAP), i.e., R(π L, ), to deveop an upper bound on the maximum expected reward that the feet can coect from a given route famiy by foowing a decentraized time-aocation poicy, i.e., R(π ). To estabish the upper bound, we derive a ower bound on the ratio R(π L, ). R(π ) The ower bound on the ratio is estabished with respect to the cooperation factors of the shared regions, each of which can be greater, smaer, or equa to 1. When the cooperation factor is greater than 1, the second vehice that coects the information gains more reward for the feet. This impies that the fused information from different sensors provides much more vaue than the information 16

17 from a singe source. For instance, muti-sensor image fusion can provide a fused image that is more informative than any of the input images (Haghighat et a., 2011). Whie it is aso common to see that the second vehice coects ess information than the first vehice since the tota information existing in a region can decrease after each successfu coection, under which situation we expect the cooperation factor to be smaer than 1. The ower bound of the ratio is found in a cosed form of γ and γ, which satisfy γ = max{1, max α S γ α } and γ = min{1, min α S γ α }. Theorem 5 We consider two specia scenarios of Theorem 5. Remar 2 When γ = 1, (20) degenerates to When γ = 1, (20) degenerates to R(π L, ) R(π ) 1 (2 γ)γ. (20) R(π L, ) R(π ) 1 (2 γ). (21) R(π L, ) R(π ) 1 γ. (22) Remar 2 highights two specia cases that are common in practice. The bound (21) refers to the scenario where the vaue of information in each shared region decreases after the first coection. For this scenario, we have 1 (2 γ) 1 2. This impies that R(π L, ) achieves at east 50% of the optima vaue in the worst case. The bound (22) refers to the scenario where a much arger information reward can be obtained by fuzing the information coected by two vehices. We can observe that R(π L, ) can be arbitrariy bad compared to the optima vaue if γ approaches infinity. In D, we provide two coroaries to show that how the bounds (21) and (22) can be approached from a theoretica point of view. We aso notice that when γ and γ approach to 1, the genera bound (20) converges to 1, which means that R(π L, ) converges to the same vaue that the optima poicy provides. Under this situation, bound (20) can be used to estimate the convergence speed. To examine the worst-case performance, we can first compute an upper bound of R(π ) using the the vaue R(π L, ) and the bound (20). Then the worst-case performance ratio R(πL, ) can be derived. R(π ) Since we notice that R(π L, ) and R(π L, ) are usuay cose in practice, we can use the bound (20) as an estimate of the true worst-case performance ratio. 6. Numerica Study The purpose of the numerica study is threefod. 6.2 estabishes the efficiency of the agorithm in 4.3 that finds a Marovian time-aocation poicy for a given route famiy. 6.3 expores the benefit of using a region-sharing strategy. 6.4 presents additiona insights reated to the mission duration, cooperation factor vaue and search sequence. Tests are a performed on a persona computer with Inte i7 CPU and 8 GB RAM under Window 7 system. 17

18 6.1 Parameter Setup and Scenario Generation A rectanguar map of size W L has two depots ocated at the midde points of the eft and right edges. Two vehices are assigned to search N regions, which are uniformy generated at integer points of the map, within T units of mission time. The trave time between two regions is cacuated using Eucidean distance rounded down to the nearest integer. Two basic scenarios are tested which are the scenarios (L) and (R) presented in Figure 1 (at the end of 4). In scenario (L), both vehices trave from the eft depot to the right depot. In scenario (R), one vehice traves from the eft depot to the right depot whie the other vehice traves from the right depot to the eft depot. We generate vaues for parameters associated with a region α as foows: e α, the a priori probabiity that information exists, is uniformy generated from a sub-interva [e, e] [0, 1]. g α, the reward of coecting the information, is assumed to be 1. s α, the time to coect the information, is an integer random variabe whose vaue is equay iey to be either 0,1,2,3,4 or 5. γ α, the cooperation factor, is uniformy generated in the range [γ, γ] = [0.0, 2.0]. P α (t), the detection function, is assumed to foow the random search formua in Koopman (1980) and taes the form P α (t) = 1 ρ t α, (23) where ρ α is uniformy generated from [ρ, ρ] (0, 1). According to the deveopment in Koopman (1980), ρ α is determined by the size of the region, the speed of the vehice, and the effective range of the sensor in practice and aways fas within (0, 1). For each map, we generate two initia route famiies using the formuations provided in A. The first famiy minimizes the tota trave time of the feet and the second famiy minimizes the maxima trave time of a vehice. We obtain the optima poicies for these two famiies and use NS to represent the better poicy. For each initia route, we use the minimum insertion rue to seect a possibe sets of shared regions for the two vehices, and for each set of shared regions we obtain two route famiies respectivey using the constrained and unconstrained re-optimization. For each re-optimized route famiy, a time-aocation poicy is generated by Agorithm 1. We seect one poicy among the poicies generated for each initia route famiy that has the highest approximate vaue ( R(π)). Then, the exact vaues of the seected poicies are cacuated using (7) and (19). We use S to represent the better poicy of the two seected poicies. The notation V (i) (j) and V (i) (i) is used to denote, respectivey, the approximate and exact vaues of the poicy j for scenario i. For exampe, V (L) (S) represents the exact vaue of the poicy S obtained under scenario (L). 6.2 The efficiency of finding a decentraized time-aocation poicy We investigate the efficiency of computing a Marovian poicy for a given group famiy using Agorithm 1. We use the foowing parameter setup: W L = 30 60, [e, e] = [0.3, 0.7], [ρ, ρ] = [0.15, 0.35], 18

19 and [γ, γ] = [0.0, 2.0]. For each map, we randomy seect an initia route famiy from the two initia famiies. The threshod of the minima insertion rue is set to δ = 25. After adding the seected shared regions to each vehice s initia route, we use one of the two re-optimization methods to modify vehice routes. The objective is to test how much computation time it taes to obtain a oca optimum of probem (IAP) and the time needed to cacuate the poicy s expected vaue (R(π)). We first present the resuts for scenario (R) under different units of mission time (T ) and different numbers of regions (N). For each (T, N) pair, the resuts are coected from 50 random maps. Tabe 1: Computation Compexity: Scenario (R) (T, N) Iterations Time(C) Time(E) S Φ β min max avg min max avg min max avg avg avg (100, 20) (200, 20) (300, 20) (100, 30) (200, 30) (300, 30) (100, 40) (200, 40) (300, 40) In Tabe 1, Iterations represents the number of iterations for Agorithm 1 to converge, Time(C) captures the corresponding CPU seconds consumed, and Time(E) records the CPU seconds to cacuate R(π). In addition, for each quantity, min, max and avg provide the corresponding minimum, maximum and average of the corresponding quantity among the 50 random maps. S represents the number of shared regions and Φ β refers to the size of the argest dependent set, for which we provide their average vaues in the tabe. For the argest probem, i.e., (300, 40), the maxima time of computing a oca optimum of probem (IAP) is sti sma (ess than 5 seconds), and the average time is about haf of the maxima time. This shows the efficiency of using the approximate formuation to find a decentraized time-aocation poicy. However, Time(E) has arger variation. Even though, on average, it taes negigibe amount of time to cacuate R(π) under this scenario, it may tae severa seconds to compute it in the worst-case situation. Under these situations, we observe arge dependent sets. Tabe 2 provides the test resuts for scenario (L). We observe arge computation time for cacuating R(π) for the cases (100, 20), (200, 20) and (300, 20), and we do not ist the computation time for the other cases since we experienced constant out-of-memory probems during the experiments. This is because the size of the dependent set can be much arger when the vehices trave in the same direction than that when the vehices trave in an opposite direction, which can be observed by comparing the Φ β coumns in the two tabes. We use a breadth-first agorithm to cacuate R(π), which saves computation time but may expode the memory when the size of the dependent size is arge. Under such a scenario, we can use a memory saving agorithm, e.g., a depth-first enumeration agorithm or 19

20 Tabe 2: Computation Compexity: Scenario (L) (T, N) Iterations Time(C) Time(E) S Φ β min max avg min max avg min max avg avg avg (100, 20) (200, 20) (300, 20) (100, 30) (200, 30) (300, 30) (100, 40) (200, 40) (300, 40) Monte Caro simuation to estimate R(π). We aso observe from both tabes that the time to compute a poicy and the number of iterations to converge do not change much even though the average size of the argest dependent set ( Φ β ) differs significanty in the two tabes. This observation impies that the convergence of Agorithm 1 is not infuenced by the size of the dependent set. It is shown in both tabes that the average computation time and iterations of obtaining a poicy increase with N and T. This impies that it taes more iterations to converge for a arger probem. Since the time to perform an iteration aso increases with the probem size, the time to compute a poicy aso increases with the probem size. Finay, we do not provide the computation time of soving the routing probems since we use standard routing modes and sove them using a standard sover. According to our experiments, the argest routing probems isted above, i.e., the probems generated under (300, 40), can be soved in a few seconds by a standard commercia sover (CPLEX) using the formuations provided in A. When it becomes expensive to sove the routing probem exacty, heuristic methods can be appied to create and re-optimize the routes. 6.3 Benefits of sharing This section iustrates the infuence of the cooperation factor and mission time vaues on sharing strategies. We use the foowing parameter setup: W L = 14 20, N = 16, [e, e] = [0.3, 0.7] and [ρ, ρ] = [0.15, 0.35]. In each map, we use a constant cooperation factor for a regions. The tested vaues are γ = 0.2, 0.5, 0.8, 1.1, 1.4, 1.7, 2.0 and 50 random maps are generated for each vaue. The tests are performed for scenario (L) with two different mission time, T = 80 and T = 100, and the resuts are isted in Figures 2 and 3, respectivey. In each figure, the three ines present the maximum, average, and minimum of the percentage V (L) (S) V (L) (NS) improvements through region-sharing, i.e., 1, among the 50 random maps generated for each cooperation factor vaue. We have the foowing observations from comparing the two figures. First, the three quantities increase with the cooperation factor in both figures, which impies that 20

21 Percentage Improvement % 90.00% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% 10.75% 4.59% Cooperation Factor Maximum Average Minimum Figure 2: Percentage improvements under scenario (L) for different cooperation factors and T = 80 a arger cooperation factor bring a arger benefit through region-sharing. Second, in Figure 2, the average improvements are much smaer than the ones in Figure 3. This impies that a arger benefit is gained through region-sharing when the feet is given more mission time. In addition, the average and minima performance improvements aso increase much faster with respect to the cooperation factor vaue in Figure 3 than that in Figure 2, especiay when the cooperation factor is ow. This indicates that when the feet has more mission time, they aso benefit more from the increase of the cooperation efficiency (i.e., cooperation factor vaue). Third, for ρ = 0.2, 0.5 the average improvements in Figure 2 are ess than 5%. This means that having two vehices search one region is generay not economica when the cooperation factor is ow and the mission time is insufficient. Nevertheess, we sti observe in Figure 2 that the maxima improvement at γ = 0.2 is more than 10%, which impies the potentia benefit of region-sharing even when the feet have ow cooperation efficiency and insufficient mission time. Furthermore, when more mission time is given to each vehice, i.e., the scenarios presented in Figure 3, the minima performance improvement through region-sharing has aready been 7.98% at γ = 0.5 and the average improvement is more than 10%. This shows that we can expect a stabe performance improvement through region-sharing at a reativey ow cooperation factor if the feet has a sufficienty arge mission time. We present how the improvements are distributed among the random cases generated for γ = 0.2, 2 and T = 80, 110 in Figure 4. Except for the setting where γ = 0.2, T = 80, we observe that the improvement foows a unimoda distribution and the pea is cose to the average. This shows that the average improvement curves presented in Figures 2 and 3 generay refect the benefit of using regionsharing strategy for the corresponding cooperation factor vaue and mission time. However, for the setting γ = 0.2, T = 80, the pea of the distribution is to the right of the average and the distribution has a ong tai. This shows that when the feet has insufficient mission time and the cooperation factor is ow, the improvement has a decreasing chance to be arger but the chance of having a arge improvement cannot be ignored. Finay, we investigate the benefit of sharing under an extreme scenario where we have γ = 0 for a 21

22 Percentage Improvement % % % % % 90.00% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% 7.98% Cooperation Factor Maximum Average Minimum Figure 3: Percentage improvements under scenario (L) for different cooperation factors and T = 110 regions. This means that if one vehice coects the information from a region, the other vehice gains no extra reward for the feet even if it aso coects the information. This time, we consider scenario (R) and three settings are tested with the corresponding parameters isted in Tabe 3. For each setting, we generate 25 maps. We present two ratios in the tabe. Ratio 1 presents the performance improvement through region-sharing according to the poicy s approximate vaue ( R(π)) and Ratio 2 presents the true improvement. We present the maps that have the argest and the second argest Ratio 2 among the 25 maps for the corresponding setting. We observe a performance improvement of 15.37% in map 2. In addition, in maps 1 and 3, we observe the improvements through region-sharing directy from the approximate vaues of the corresponding poicies. In maps 2, 4, 5, and 6, we do not observe arge improvements from the poicies approximate vaues (Ratio 1) but the true improvements (Ratio 2) are much arger. This is because a arge dependent set can exist when the two vehices trave in the same direction which can mae a significant difference between V (R) (S) and V (R) (S). 6.4 Additiona Insights This section investigates the joint behavior of the feet through region-sharing with respective to the mission duration, the vaue of the cooperation factor and the search sequence The study of an ideaized scenario. We first investigate the probem using an ideaized scenario presented in Figure 5. We have 10 regions ocated on a straight ine and every two adjacent regions have a distance of 1 unit. For each region α, we set e α = 0.5, ρ α = 0.25, and s α = 0. Such a scenario excudes the infuence of the regions ocation 22

23 Frequency Frequency Frequency Frequency Cf=0.2, T=80 Cf=2.0, T= Average: 1.78% Average: 65.26% Percentage Improvement Range Percentage Improvement Range Cf=0.2, T=110 Cf=2.0, T= Average: 4.52% Average: 95.12% % 1.50% 3.00% 4.50% 6.00% 7.50% 9.00% More Percentage Improvement Range Percentage Improvement Range Figure 4: Distribution of the improvements under scenario (L) for γ = 0.2, 2 and T = 80, 110 Tabe 3: Resuts under scenario (R) where max α S γ α = 0 Map W L N T [e, e] [ρ, ρ] V (R) (NS) V (R) (S) V (R) (S) Ratio 1 Ratio [0.3, 0.7] [0.15, 0.35] % 9.12% [0.3, 0.7] [0.15, 0.35] % 15.37% [0.3, 0.7] [0.15, 0.35] % 10.04% [0.3, 0.7] [0.15, 0.35] % 6.05% [0.3, 0.7] [0.15, 0.35] % 5.97% [0.3, 0.7] [0.15, 0.35] % 9.20% Ratio 1: V (R) (S) V 1, Ratio 2: V (R) (NS) (R) (S) V (R) (NS) 1. since it is not beneficia to sip a region and the infuence of the information coection time in each region since a vehice wi aways coect the information if it detects it. In the given scenario, both vehices have to spend 11 units of trave time whether or not they sip a region. When T = 16, each of them has 5 units of time to aocate. From the resuts for γ = 0.5, we observe that a regions are searched, which means that each of the vehices searches 5 regions by spending 1 unit of time in each of them and none of the regions is shared. For γ = 1.5, however, ony 5 regions are actuay searched and a of them are shared by the two vehices. For γ = 0.5, when we give the vehices more mission time, each vehice tends to aocate the additiona time to the regions that are initiay searched by the other vehice instead of their own regions. This can be observed from the increase of the Shared(No.) in the upper bracet of the tabe. This shows that the vehices can prefer region-sharing rather than securing their own information reward even if the cooperation factor is beow 1 (i.e., with ow cooperation efficiency). For γ = 1.5, when the vehices have more mission 23

1 2 10 Vehice 1 Vehice 2 Figure 5: A map with regions on a straight ine Tabe 4: Percentages of the regions that are searched and shared Searched(No.) 10 10 10 10 10 10 10 γ = 0.5 Shared(No.

24 Vehice 1 Vehice 2 Figure 5: A map with regions on a straight ine Tabe 4: Percentages of the regions that are searched and shared Searched(No.) γ = 0.5 Shared(No.) T Searched(No.) γ = 1.5 Shared(No.) T Searched(No.): The number of regions that are actuay searched. Shared(No.): The number of regions that are shared. time, the vehices can prefer aocating more time to secure the joint reward in a shared region that have aready been searched than searching new regions (under high cooperation efficiency). This can be observed from the Searched(No.) in the ower bracet of the tabe since ony 9 regions are actuay searched when each vehice is given 22 units of mission time. To investigate the infuence of the cooperation factor, we now et the 10 regions have different cooperation factors, from 0.2 to 2.0 with an increment of 0.2 per region. Two scenarios are tested. (A): The regions are ocated so that each region has a ower cooperation factor than the regions on its eft. (B): The regions are ocated so that each region has a ower cooperation factor than the regions on its right. We present the test resuts for different mission time T in Tabes 5 and 6, respectivey. Comparing the rewards (i.e., V (A) (S) and V (B) (S)) in the two tabes, we find out that the expected reward gained in scenario (A) is sighty higher than the one gained in scenario (B). In addition, the numbers beow the cooperation factors are the conditiona probabiities that vehices 1 (eft) and 2 (right) succeed in the corresponding region given that the information exists. Since the regions are the same except for their cooperation factors, when regions are shared, intuitivey we shoud expect the vehices to have arger probabiities to succeed in the regions that have arger cooperation factors. We observe this intuitive behavior in scenario (A) whie in scenario (B) we find out that both vehices have the argest probabiities to succeed in the second ast region (γ = 1.8). This is expained as foows: If both vehices reserve arge amounts of time for the ast region they may end up wasting the remaining 24

25 Tabe 5: Test resuts for scenario (A) with different mission time T T γ = 2.0 γ = 1.8 γ = 0.4 γ = 0.2 V (A) (S) V (A) (S) Tabe 6: Test resuts for scenario (B) with different mission time T T γ = 2.0 γ = 1.8 γ = 0.4 γ = 0.2 V (B) (S) V (B) (S) time if they detect the information eary in the ast region. As a resut, they reserve the argest amounts of time for the second ast region. The sequence aso infuences the vehices behaviors in the region with ow cooperation factors. In scenario (A), we observe that both vehices have arge probabiities to succeed in the regions with ow cooperation factors (γ = 0.2, 0.4) when more mission time is given (T = 35, 40). This is because that the ow cooperator regions are searched at the end and if any vehice spends ess time in its eary stages of the search it wi search these regions. However, in scenario (B) the ow cooperator regions are searched in the beginning. As a resut, ony one vehice wi commit to each of them since the cooperation factor is too ow, i.e., it is not economica for both vehices to search them without nowing how much time they wi need in the future. To further investigate the infuence of the sequence, we tae scenario (A) and then change the position (counted from the eft) of the region with the highest cooperation factor (γ = 2.0). The resuts for T = 40 are presented in Tabe 7. Two observations are obtained from Tabe 7. When the region is ocated in a ater position, both vehices have smaer probabiities to succeed in the region. More importanty, the expected reward that the feet can coect aso decreases. Combined with the former observations, the observations suggest that it is more beneficia to search regions with higher cooperation factors before searching regions with ower cooperation factors. 25

26 Tabe 7: Resuts for different positions of the region with the highest cooperation factor τ τ Reward Position τ i : The conditiona probabiity for vehice i {1, 2} to succeed in the region given that the information exists The study of random scenarios. In Tabe 3, we observe some cases where V (R) (S) is very different from V (R) (S) (maps 2,4,5,6) whie in Tabes 5 and 6, V (R) (S) and V (R) (S) are very cose. To further investigate this observation, we use three different amounts of mission time T = 65, 80, 100, and randomy generate 25 maps for each with the parameter setups isted in Tabe 8. The tests are performed for scenarios (L) and (R). As iustrated in Figure 1 in 4 and the resuts presented in Tabes 1 and 2, when the two vehices trave in the same direction, we expect arge dependent sets; on the contrary, when they trave in opposite directions, we expect the dependent sets to be sma or empty. We seect the maps where we observed a considerabe difference (more than 1.5% or smaer than -1.5%) between the approximate and exact vaues of the best poicy obtained after sharing for each scenario (Ratio 1 and Ratio 2), or between the exact vaues of the best poicies obtained for both scenarios (Ratio 3). We found 8 out of 75 random maps, which are isted in Tabe 8. Comparing Ratio 1 and Ratio 2 in the tabe, we find that the exact and approximate vaues of the poicy obtained for the scenario (L) is more iey to be different. In maps 7, 8, 10, 11, 12, 13, and 14, we observe a arge difference between R(π) and R(π). On the contrary, the difference in scenario (R) can be negigibe except for map 22. This shows that a arger dependent set more iey eads to a difference between R(π) and R(π). However, the infuence of the dependency can be either beneficia or harmfu. In maps 8, 10, 13, 14, R(π) is greater than R(π) but in maps 7, 11, 12, R(π) is smaer. Combining this observation with Ratio 3, we find that when the dependency infuences in a positive way, the poicy obtained in scenario (L) can be better than the one obtained in scenario (R) where the infuence of the dependency is negigibe. We aso notice that 7 maps are seected for T = 80, 100 but ony 1 map is seect for T = 60. This shows that the dependency has a arger impact when the feet is given more mission time. Finay, for both scenarios presented in Tabe 8, R(π) and R(π) are cose in map 9 and R(π) is greater than R(π) in map 14. However, in both maps we observe that the poicy found in scenario (R) is better than the one found in scenario (L). This means that regardess of the dependency, the trave direction can mae an significant infuence on the performance of the poicy obtained through region-sharing. This observation is consistent with the observation obtained from Tabes 5 and 6 that the search sequence of the regions can mae an impact to the expected reward that the feet can coect. 26

27 Tabe 8: Random maps for testing the infuence of trave directions Map W L T Ṽ (L) (S) V (L) (S) Ṽ (R) (S) V (R) (S) Ratio 1 Ratio 2 Ratio % 0.29% -3.34% % 0.00% 6.15% % 0.00% -6.33% % 0.09% 5.74% % 0.08% -1.74% % 0.00% -7.85% % 0.00% 4.82% % 7.42% -3.08% Ratio 1: V (L) (S) Ṽ (L) (S) 1, Ratio 2: V (R) (S) Ṽ (R) (S) 1, Ratio 3: V (L) (S) V (R) (S) 1. Unisted parameters: [γ, γ] = [0.0, 2.0], [e, e] = [0.3, 0.7], [ρ, ρ] = [0.15, 0.35]. 6.5 Muti-vehice Impementation Finay, we expain how to impement the method in cases having more than two vehices under Assumption 1. The approximate formuation (IAP) and the soution agorithm (Agorithm 1) can be modified in the foowing way: For each vehice i, probem (IAP i ) is constructed by fixing the poicies appied by a vehices except vehice i where ˆτ α(π), i α S i may be cacuated from different vehices poicies. Then we can obtain a Marovian poicy by soving probem (IAP i ) using the same formuation defined by (15) and (17). Within each iteration of the agorithm, we sove probem (IAP i ) for a i U and the agorithm continues unti no improvement is made during an entire iteration where a oca optimum of probem (IAP) is reached. To cacuate the expected reward of the obtained poicy, we first cacuate R π i, i U, i.e., the expected reward that vehice i coects from its non-shared regions, for which we can use the same agorithm provided in B. Since a shared region α is assigned to two vehices under Assumption 1, we can sti use Theorem 3 to identify the dependent set of this region with respect to the two vehices. Then (19) can be appied to cacuate the expected reward coected from the region. However, as expained in 4.4, cacuating the expected reward from a shared region requires enumerating 2 S 1 possibe scenarios in the worst case, where S is the number of shared regions between the two vehices. If every vehice shares regions with each of the other vehices, we need to enumerate a tota number of i j 2 S i Sj 1 possibe scenarios in the worst case, which is computationay expensive. This is the reason that we do not perform the numerica tests for scenarios having more than two vehices. Nevertheess, the computation time to obtain a Marovian poicy that estabishes a oca optimum of probem (IAP) does not increase significanty since the computation time that each iteration of Agorithm 1 taes increases ineary with respect to the number of vehices. Finay, we shoud note that the bound in Theorem 5 stiods in the muti-vehice case as ong as Assumption 1 hods. The corresponding proof can be mimiced using the steps presented in C. 27

28 7. Concusion and Future Research This paper studies how to route a feet of vehices to search and coect information from a set of regions with uncertain information. We demonstrate the benefit of using a region-sharing strategy under a decentraized environment and deveop a method that conquers the computationa difficuty of the associated time-aocation probem. A Marovian poicy is derived to guide the vehices decentraized decisions, which is obtained through a decomposabe approximation of the origina probem. We propose the concept of dependent set to cacuate the exact vaue of such a poicy using conditiona independency. A sufficient condition is deveoped under which there exists an optima Marovian poicy that soves the time-aocation probem, which aso impies that the approximation is exact given the condition. To examine the performance oss of using a Marovian poicy, we deveop a tight upper bound on the performance of a decentraized time-aocation poicy. Through a numerica study we show that region-sharing is beneficia even in the scenarios where the regions cooperation factors are ow. In addition, when the cooperation factors increase, we observe an increasing trend of the performance improvement through region-sharing. These resuts estabish the vaue of the method deveoped in this paper. Insights are gained for understanding how mission time, cooperation factor and search sequence infuence the vehices behaviors under a region-sharing strategy. The strength and simpicity of a Marovian poicy maes it a vauabe aternative soution for other DEC-POMDPs, which invove muti-agent sequentia resource-aocation. The resource to aocate is not restricted to time. For instance, we may have a team of agents, each of which has a fixed budget to fund severa assigned projects. Each project may consume a different amount of money to continue in each period and the success of a (some) project(s) provides a (joint) reward to the team. A Marovian poicy can be obtained using a simiar iterative method under an independency assumption. The study in this paper has severa imitations. The theories ony hod for the scenario where a region is shared by no more than two vehices since we do not provide a reward mode to determine the joint reward coected in a region that is searched by more than two vehices. Given a proper reward mode, we can sti find a Marovian poicy using the iterative agorithm designed in the paper under the same independency assumption but the upper bound wi not hod. We eave the investigation for future research considering that many missions may require cooperation among severa vehices for a singe tas (region). Another extension is to aow a region to have mutipe pieces of information. Under this situation, a vehice might not eave a region after detecting a singe piece of information. Moreover, our mode is estabished by assuming that the existence of information and the corresponding detection time in different regions are independent, and the detection time of two vehices in a shared region is conditionay independent given that information exists. Reaxation of these independencies is suggested for future wor. In short, our study is ony an initia attempt to mode and sove the decentraized muti-vehice resource-aocation for information searching and coecting. With future reaxations, we beieve that the mode can cover a broader cass of DEC-POMDP probems. Our hope is that the proposed method can turn into a powerfu too to conquer those probems which are nown for their notorious computationa compexities. 28

29 References Aras, R., A. Dutech An investigation into mathematica programming for finite horizon decentraized POMDPs. Journa of Artificia Inteigence Research 37(1) BBC The search for fight MH accessed 5-May Becer, R., S. Ziberstein, C. V. Godman Soving transition independent decentraized Marov decision processes 22(1) Bernstein, D.S., R. Givan, N. Immerman, S. Ziberstein The compexity of decentraized contro of Marov decision processes. Mathematics of Operations Research 27(4) Chao, I., B. L. Goden, E. A. Wasi The team orienteering probem. European Journa of Operationa Research 88(3) Gong, Q., R. Batta Aocation and reaocation of ambuances to casuaty custers in a disaster reief operation. IIE Transactions 39(1) Haghighat, M. B. A., A. Aghagozadeh, H. Seyedarabi A non-reference image fusion metric based on mutua information of image features. Computers & Eectrica Engineering 37(5) Hart, P. E., N. J. Nisson, B. Raphae A forma basis for the heuristic determination of minimum cost paths. Systems Science and Cybernetics, IEEE Transactions on 4(2) Howard, C Uav command, contro & communications. miitaryaerospace.com/artices/print/voume-24/issue-7/specia-report/ uav-command-contro-communications.htm; accessed 1-May IBM CPLEX optimizer. cpex-optimizer/; accessed 10-August Koopman, B. O Search and Screening: Genera Principes with Historica Appications. Pergamon Press, Amsterdam, Netherands. Kress, M., J. O. Royset Aeria search optimization mode (ASOM) for UAVs in specia operations. Miitary Operations Research 13(1) Kroese, D. P Cross-Entropy Method. John Wiey & Sons, Hoboen, New Jersey, USA. Merino, L., F. Cabaero, J. R. Martínez-de Dios, J. Ferruz, A. Oero A cooperative perception system for mutipe UAVs: Appication to automatic detection of forest fires. Journa of Fied Robotics 23(3-4) Mufai, F., R. Batta, R. Nagi Simutaneous sensor seection and routing of unmanned aeria vehices for compex mission pans. Computers & Operations Research 39(11). 29

30 Murray, C. C., M. H. Karwan An extensibe modeing framewor for dynamic reassignment and rerouting in cooperative airborne operations. Nava Research Logistics 57(7) Nair, R., M. Tambe, M. Yooo, D. Pynadath, S. Marsea Taming decentraized POMDPs: Towards efficient poicy computation for mutiagent settings. Proceedings of the 18th Internationa Joint Conference on Artificia Inteigence Naamura, E. F., A. A. F. Loureiro, A. C. Frery Information fusion for wireess sensor networs: Methods, modes, and cassifications. ACM Computing Surveys 39(3). Oiehoe, F. A., J. F. P. Kooij, N. Vassis. 2008a. The cross-entropy method for poicy search in decentraized POMDPs. Informatica 32(4) Oiehoe, F. A., M. T. J. Spaan, N. A. Vassis. 2008b. Optima and approximate Q-vaue functions for decentraized POMDPs. Journa of Artificia Inteigence Research 32(1) Pietz, J., J. O. Royset Generaized orienteering probem with resource dependent rewards. Nava Research Logistics 60(4) Powe, W.B Approximate Dynamic Programming: Soving the Curses of Dimensionaity. John Wiey & Sons, Hoboen, New Jersey, USA. Rathinam, S., R. Sengupta, S. Darbha A resource aocation agorithm for mutivehice systems with nonhoonomic constraints. Automation Science and Engineering, IEEE Transactions on 4(1) Romesh, R Five reasons why drones are here to stay. artices/ /five-reasons-why-drones-are-here-to-stay#p1; accessed 15-Juy Schumacher, C., P. R. Chander, M. Pachter, L. S. Pachter Optimization of air vehices operations using mixed-integer inear programming. Journa of the Operationa Research Society 58(4) Seier, P., R. Sengupta Anaysis of communication osses in vehice contro probems. Proceedings of the 2001 American Contro Conference, vo Seuen, S., S. Ziberstein Forma modes and agorithms for decentraized decision maing under uncertainty. Autonomous Agents and Muti-Agent Systems 17(2) Shima, T., S. J. Rasmussen UAV Cooperative Decision and Contro: Chaenges and Practica Approaches. Society for Industria Mathematics, Phiadephia, Pennsyvania, USA. Szer, D., F. Charpiet Point-based dynamic programming for DEC-POMDPs. Proceedings of the 21st Nationa Conference on Artificia Inteigence, vo

31 Szer, D., F. Charpiet, S. Ziberstein MAA*: A heuristic search agorithm for soving decentraized POMDPs. Proceedings of the 21st Conference on Uncertainty in Artificia Inteigence Wayne, M Drones are cheap, sodiers are not: A cost-benefit anaysis of war. http: //theconversation.com/; accessed 15-Juy

32 A. Integer Programming Formuation for Routing Modes In this section, we provide two exampes to iustrate how to create initia route famiies. The first exampe minimizes the tota trave time of the feet under the condition that each region has to be visited by exacty one vehice. This mode sees the maximum tota amount of remaining time for the feet to invest in searching and information coecting; however, it may assign unbaanced woroad to the feet members. For instance, a vehice may incur a arge trave time to cover many regions so that it has itte time to invest in searching. On the contrary, another may have a arge amount of remaining time to spend but ony have a few regions to search. To hep circumvent this issue, the second mode minimizes the maxima trave time of a vehice. This mode sees a soution under which each vehice wiave a reasonabe amount of time to spend in searching and information coecting after the assignment. We define the foowing notation. U set of vehices. A set of regions to search, A = {1, 2,..., L}. m index of a vehice, m U. i, j indices of regions, i, j A. 0, L + 1 indices represent a vehice s start and end depots, respectivey. x m i,j a binary variabe represents whether vehice m traves from region i to j, i, j A. x m 0,i a binary variabe represents whether vehice m traves from its start depot to region i, i A. x m i,l+1 a binary variabe represents whether vehice m traves from region i to its end depot, i A. x m 0,L+1 a binary variabe represents whether vehice m traves from its start depot to its end depot d(i, j) a given integer provides the trave time between region i and j, i, j A. d m (0, i) a given integer provides the time for vehice m to trave from its start depot to region i, i A. d m (i, L + 1) a given integer provides the time for vehice m to trave from region i to its end depot, i A. d m (0, L + 1) a given integer provides the time for vehice m to trave from its start depot to its end depot. i m an integer variabe represents the order index of region i for vehice m. z m an integer variabe represents the tota trave time of vehice m. z max an integer variabe represents the maxima trave time of a vehice among the feet. We first present the formuation for the mode that minimizes the tota trave time of the feet subject to that each region in the area is visited by exacty one vehice. 32

33 (Rout 1 ): Minimize the tota trave time. min [ d(i, j)x m i,j + (d m (0, i)x m 0,i + d m (i, L + 1)x m i,l+1) + d m 0,L+1x m 0,L+1] m U i,j A i A s.t. x m 0,i + x m 0,L+1 = 1, m U, i A x m i,l+1 + x m 0,L+1 = 1, m U, i A (24a) (24b) x m 0,i + x m j,i = x m i,l+1 + x m i,j, i A, m U, (24c) j A j A x m j,i + x m 0,i] = 1, i A, m U, (24d) m U[ j i m j m i x m i,j (1 x m i,j)(l 1), i, j A, m U, (24e) x m i,j {0, 1}, i, j A, m U, x m 0,i {0, 1}, i A, m U, x m i,l+1 {0, 1}, i A, m U, x m 0,L+1 {0, 1}, m U. (24f) (24g) (24h) In (Rout 1 ), the objective function is the tota trave time of the feet. (24a)-(24c) compose the fow baance from the start depot to the end depot for each vehice. (24d) requires that each region is visited by exacty one vehice. (24e) enforces a strict order of each region visited by a vehice. (24f)-(24i) are binary constraints. Then we present the formuation for minimizing the maxima trave time of a feet member whie each region stias to be assigned to at east one vehice. (Rout 2 ): Minimize the maxima trave time of each vehice. (24i) s.t. z max min z max (24a)-(24i) d(i, j)x m i,j + (d m (0, i)x m 0,i + d m (i, L + 1)x m i,l+1) + d m 0,L+1x m 0,L+1, m U. (25) i,j A i A In (Rout 2 ), the right hand side of (25) is the tota trave time of vehice m. Since z max is greater than or equa to each vehice s tota trave time, minimizing z max is equivaent to minimizing the maxima trave time of a feet member. Next, we present our re-optimization methods. Consider that vehice m receives a new assignment H m = {h m 1, h m 2,..., h m L m, h m L,..., m+1 hm L m+v m }. Here h m 1, h m 2,..., h m L m are the indices of the originay assigned regions and 1, 2,..., L m are their corresponding order indices in the origina route. h m L+1,..., hm L+v m are the indices of the shared regions added to the route. We first introduce the formuation for the unconstrained re-optimization method. 33

34 (Rout 3 ): Minimize the tota trave time of vehice m. min d(i, j)x m i,j + (d m (0, i)x m 0,i + d m (i, L + 1)x m i,l+1) + d m 0,L+1x m 0,L+1 i,j H m i H m s.t. x m 0,i = 1, i H m x m i,l+1 = 1, i H m x m j,i + x m 0,i = x m i,j + x m i,l+1, i H m, j H m j H m x m j,i + x m 0,i = 1, i H m, j H m (26a) (26b) (26c) (26d) t m j t m i x m i,j (1 x m i,j)(l 1), i, j H m (26e) x m i,j {0, 1}, i, j A, x m 0,i {0, 1}, i A, x m i,l+1 {0, 1}, i A, x m 0,L+1 {0, 1}. (26f) (26g) (26h) In (Rout 3 ), the objective function is the tota trave time of vehice m. Constraints (26a)-(26c) estabish the fow baance. (26d) requires that each assigned region has to be visited by the vehice. (26e) enforces a strict order of the assigned regions to be visited. (26f)-(26i) are binary constraints. Finay, we present the formuations for the constrained re-optimization method. (Rout 4 ): Minimize the tota trave time of vehice m but retrain the orders of its origina regions. (26i) s.t. min z max (26a)-(26i) t m h +1 t m h, = 1, 2,..., L m 1. (27) (27) is the additiona constraint that we put for the constrained re-optimization, which retains the orders of the origina regions. Note that region h m +1 has a higher order than region hm in the origina route and (27) enforces that it must be visited ater than region h m in the re-optimized route. B. An Agorithm to Cacuate the Expected Reward and Success Probabiities For a suggested route H i = {h i 0, h i 1, h i 2,..., h i L i, h i L i +1 } and a Marovian poicy π i, Agorithm 2 provides two outputs: τ i h(π i i θ V i ) for h i h i A and R π i (ony when V = ). For any given V S, we define V i = {h i h 1, h i 2,..., h i i 1 } V. τ i h(π i i θ V i ) is the conditiona probabiity that vehice i succeeds in region h i h i given θ V i and θ h h i i = 1. 34

35 Agorithm 2 Require: H i = {h i 0, h i 1, h i 2,..., h i L i, h i L }, π i+1 i = {z i (T ); x i (t), y i (t), z i (t) : t = 0, 1, 2,..., T, = 1, 2,..., L h i i i i i } and θ V. 1: Initiaization: R πi 0; c 0; f s (t θ h i V i ), f c (t θ i h i V i ), f c (t θ i h i V i ), f d (t θ i h i V i ) 0 for = 1 : L i, t = 0 : T ; +1 i +1 τ i (π h i i θ V i ) 0 for = 1 : L i. i 2: c z i (T ), f h i s d(h 0 c(t i 0, h i i c ) θ V i ) 1. h i 3: for = c : L i do 4: for t : f s (t θ h i V i ) > 0 do i 5: for x = 1 : x i (t) do h i c 6: f (t x θ h i V i ) f c (t x θ i h i V i ) + e h h i p h i (x). 7: end for 8: if h i V then 9: for x = 1 : x i (t) do h i 10: f c (t x θ h i V i ) f c (t x θ i h i V i ) + e h h +1 i i p h i (x) : end for 12: f d (t x i (t) θ h i i V i ) f d (t x i (t) θ i h i i V i ) + [1 e h h +1 i i P h i (x i (t))]f s (t θ h i i V i ). +1 i 13: ese 14: if θ h i = 1 then 15: for x = 1 : x i (t) do h i 16: f c (t x θ h i V i ) f c (t x θ i h i V i ) + p h h +1 i i (x) : end for 18: f d (t x i (t) θ h i i V i ) f d (t x i (t) θ i h i i V i ) + [1 P h h +1 i i (x i (t))]f s (t θ h i i V i ). +1 i 19: ese 20: f d (t x i (t) θ h i i V i ) f d (t x i (t) θ i h i i V i ) + f s (t θ h +1 i h i V i ). +1 i 21: end if 22: end if 23: end for c 24: for t : f (t θ h i V i ) > 0 do i 25: τ i (π h i i θ V i ) τ i (π i h i i θ V i ) + y i (t) f c (t θ h i h i i V i )/(1 e h i i ). 26: end for 27: for t : f c (t θ h i V i ) > 0 do i +1 28: if i O i and V = then 29: R πi R πi + y i (t)f c (t θ h i i V i )g h i i : end if 31: f d (t y i (t)s h i i h i θ V i ) f d (t y i (t)s h i h i i h i θ V i ) + f c (t θ h +1 i h i V i ). +1 i +1 32: end for 33: for t : f d (t θ h i V i ) > 0 do i +1 34: d z i (t), f s (t d(h i h i i, hi d ) θ V i ) f d (t θ d h i h i V i ). i d +1 35: end for 36: end for 37: return R πi and τ i (π h i i θ V i ) for = 1 : L i. i 35

36 We use the foowing quantities in Agorithm 2: For a given θ V i, f s h i h(t θ i V i ) is the conditiona h i probabiity that vehice i arrives at region h i with t units of remaining time and f c h(t θ i V i ) is the h i conditiona probabiity that vehice i detects the information in region h i with t units of remaining time. For a given θ V i, f c (t θ h i h i V i ) is the conditiona probabiity that vehice i detects the information +1 i +1 in h i with t units of remaining time and f d h(t θ i V i ) is the conditiona probabiity that vehice i eaves h i +1 region h i with t units of remaining time. In the agorithm, ines 3-23 correspond to searching. Lines 6-9 update f s h(t θ i V i ). Lines 8-23 update h i V; ines are for the scenario f c h(t θ i V i ) and f d (t θ h i h i V i ): Lines 9-12 are for the scenario h i +1 i +1 where θ h i = 1 is given in θ V i ; ine 20 is for the scenario where θ h h i i = 0 is given in θ V i. Lines h +1 i correspond to information coecting. Lines update τ i h(π i i θ V i ) using f s (t θ h i h i V i ) cacuated i ). Lines correspond in ines 6-9. Lines update the reward R π i. Line 31 updates f d h i (t θ V i h i +1 to eaving a region where we update f s h i d (t θ V i h i d ) for the regions that wi be visited afterwards using f d h(t θ i V i ). h i +1 Finay, if we set V =, the agorithm outputs R π i and τ i h(π i i ). If we set V = Φ β where Φ β is the argest dependent set, τ i h(π i i θ V i ) is the conditiona probabiity required by (19) in 4.4. h i C. Proofs This section presents the proofs of the theorems and propositions proposed in 4 and 5 of the main artice. Let x be an arbitrary -dimension integer vector, throughout the reminder of the appendix, represents that the summation is over the -dimension integer attice, i.e., Z. x Proof of Theorem 2. The feet s poicy changes ony in ine 5 and ine 7 of Agorithm 1. Without osing generaity, we assume that i = 1, j = 2. Let (π1, π2) be the poicy obtained after iterations of the oop defined in ines R and R are the corresponding vaues of R and R at the end of the th iteration (after ine 9 is executed). R represents the objective vaue achieved in the former iteration and R is the objective vaue achieved in the current iteration. At iteration > 1, the probem soved in ine 5 is π1 arg max{ R(π 1, π2 1 ) : π 1 Π 1 } and the probem soved in ine 7 is π2 arg max{ R(π 1, π 2 ) : π 2 Π 2 }. We shoud note that R 1 = R(π 1, π2 1 ) and R = R(π 1, π2). First, we have R = R(π 1 1, π 1 2 ) max{ R(π 1, π 1 2 ) : π 1 Π 1 } max{ R(π 1, π 2 ) : π 2 Π 2 } = R. (28) If R > R for = 1, 2,..., K, we have R(π 1 K, π2 K K 1 ) > R(π 1, π2 K 1 ) >... > R(π 1, 1 π2). 1 Thus, we must go over K different poicies. Since there is a finite number of poicies, after a finite number of iterations (K), we must reach R K 1 = R, which is equivaent to R(π 1, π2 K 1 ) = R(π 1 K, π2 K ). According to (28), π1 K 1 must sove max{ R(π 1, π2 1 ) : π 1 Π 1 } in ine 5 and π2 K 1 1 must sove max{ R(π 1, π 2 ) : π 2 Π 2 } 36

37 in ine 7 (note that no update is made at ine 5); otherwise, R must be stricty greater than R. Therefore, π K 1 1 arg max{ R(π 1, π K 1 π K 1 2 arg max{ 2 ) : π 1 Π 1 }, K 1 R(π 1, π 2 ) : π 2 Π 2 }, and the agorithm converges to a oca optimum of probem (IAP). Assumption 2 serves as the ey assumption to estabish the proof. We first introduce a emma that deveops three direct resuts from Assumption 2. Lemma 2 Under Assumption 2, for any route famiy we have the foowing resuts: i). For V A, P ({θ α } α V ) = α V f θα (θ α ) (29) ii). Consider two region sets V 1, V 2 A. Let S = V 1 V2 and O i = V i \ S, i {1, 2}, and we have P ({wα 1 1, θ α1 } α1 O 1, {w2 α 2, θ α2 } α2 O 2, {w1 α, wα} 2 α S {θ α } α S ) = f w 1 α1 (wα 1 1 θ α1 )f θα1 (θ α1 ) f w 2 α2 (wα 2 2 θ α2 )f θα2 (θ α2 ) f w 1 α (wα θ 1 α )f w i α (wα θ 2 α ) α S α 1 O 1 α 2 O 2 =P ({w 1 α 1, θ α1 } α1 O 1, {w1 α} α S {θ α } α S )P ({w 2 α 2, θ α2 } α2 O 2, {w2 α} α S {θ α } α S ), (30) where P ({w i α i, θ αi } αi O i, {wi α} α S {θ α } α S ) = α i O i f w i αi (w i α i θ αi )f θαi (θ αi ) α S f w i α (w i α θ α ), i {1, 2}. (31) iii). Consider two region sets V 1, V 2 A satisfying V 1 V2 =. We have P ({w 2 α 2, θ α2 } α2 V 2, {w 1 α 1, θ α1 } α1 V 1 ) = P ({w 2 α 2, θ α2 } α2 V 2 )P ({w 1 α 1, θ α1 } α1 V 1 ). (32) Moreover, et g({wα 1 1, θ α1 } α1 V 1 ) be a function of {wα 1 1, θ α1 } α1 V 1 and x be a reachabe vaue of g({wα 1 1, θ α1 } α1 V 1 ). We have P ({wα 2 2, θ α2 } α2 V 2 g({wα 1 1, θ α1 } α1 V 1 ) = x) = f w 2 α2 (wα 2 2 θ α2 )f θα2 (θ α2 ). (33) α 2 V 2 Proof. i). Let U = {1} in (4) and we have P ({θ α } α V ) = P ({wα, 1 θ α } α V ) = {w 1 α} α V {w 1 α } α V α V f w 1 α (w 1 α θ α1 )f θα1 (θ α1 ) = α V[f θα1 (θ α1 ) w 1 α 37 f w 1 α (w 1 α θ α1 )] = α V f θα (θ α ).

38 ii). It is obvious that P ({w 1 α 1, θ α1 } α1 O 1, {w2 α 2, θ α2 } α2 O 2, {w1 α, w 2 α} α S {θ α } α S ) = P ({w1 α 1, θ α1 } α1 O 1, {w2 α 2, θ α2 } α2 O 2, {w1 α, wα, 2 θ α } α S ). P ({θ α } α S ) (34) Using Assumption 2 and (29), (34) can be written as P ({wα 1 1, θ α1 } α1 O 1, {w2 α 2, θ α2 } α2 O 2, {w1 α, wα} 2 α S {θ α } α S ) α 1 O f 1 = w 1 α1 (wα 1 1 θ α1 )f θα1 (θ α1 ) α 2 O f 2 w 2 α2 (wα 2 2 θ α2 )f θα2 (θ α2 ) α S f w 1 α (wα θ 1 α )f w i α (wα θ 2 α )f θα (θ α ) α S f θα (θ α ) = f w 1 α1 (wα 1 1 θ α1 )f θα1 (θ α1 ) f w 2 α2 (wα 2 2 θ α2 )f θα2 (θ α2 ) f w 1 α (wα θ 1 α )f w 2 α (wα θ 2 α ). α S α 1 O 1 α 2 O 2 We ony need to show that (31) is true. Using Assumption 2 and (29), we have P ({wα i i, θ αi } αi O i, {wi α} α S {θ α } α S ) = P ({wi α i, θ αi } αi O i, {wi α, θ α } α S ) P ({θ α } α S ) α i O f i = w i αi (wα i i θ αi )f θαi (θ αi ) α S f w i α (wα θ i α )f θα (θ α ) = f w i α S f θα (θ α ) αi (wα i i θ αi )f θαi (θ αi ) f w i α (wα θ i α ). α S iii). We first prove (32). Using Assumption 2, P ({w 2 α 2, θ α2 } α2 V 2, {w 1 α 1, θ α1 } α1 V 1 ) = =P ({w 2 α 2, θ α2 } α2 V 2 )P ({w 1 α 1, θ α1 } α1 V 1 ). α i O i α 1 V 1 f w 1 α1 (w 1 α 1 θ α1 )f θα1 (θ α1 ) α 2 V 2 f w 2 α2 (w 2 α 2 θ α2 )f θα2 (θ α2 ) The ast equaity aso comes from Assumption 2 by setting U = {i} and A i = V i for i {1, 2}. Now we prove (33). Define an indicator function κ(x, y) so that κ(x, y) = 1 if x = y; otherwise, κ(x) = 0. P ({wα 2 2, θ α2 } α2 V 2 g({wα 1 1, θ α1 } α1 V 1 ) = x) {wα = 1 1,θ α1 } α1 κ(g({w 1 V 1 α 1, θ α1 } α1 V 1 ), x)p ({wα 2 2, θ α2 } α2 V 2, {wα 1 1, θ α1 } α1 V 1 ) {wα 1 1,θ α1 } α1 κ(g({w 1 V 1 α 1, θ α1 } α1 V 1, x)p ({wα 1 1, θ α1 } α1 V 1 ) = [ {wα 1 1,θ α1 } α1 κ(g({w 1 V 1 α 1, θ α1 } α1 V 1 ), x)p ({wα 1 1, θ α1 } α1 V 1 )]P ({wα 2 2, θ α2 } α2 V 2 ) {wα 1 1,θ α1 } α1 κ(g({w 1 V 1 α 1, θ α1 } α1 V 1, x)p ({wα 1 1, θ α1 } α1 V 1 ) =P ({wα 2 2, θ α2 } α2 V 2 ) = f w 2 α2 (wα 2 2 θ α2 )f θα2 (θ α2 ). α 2 V 2 Note that the second equaity comes from (32). We define a random variabe t π i For a random scenario, if region h h: i i poicy π i, t π i h i is not sipped by vehice i under is the remaining time of the vehice when it arrives at region h i ; otherwise, tπ i h i 38 is equa to

39 d(h i, hi L i +1 ). Here a random scenario refers to a reaization of the random variabes {wi h i, θ h i } =1: 1. With the assist of t π i, we can define an indicator function u πi (w i h i h i i i succeeds in region h i given wi and t π h i i; otherwise, u π i i h(w i i h i be represented in the foowing:, t π i) so that u π h i i h(w i, t π i h i i = 1 if vehice ) i, t π i h i ) = 0. Such an indicator function can u π i h(w i i h, t π i i = ) i { 1 if w i x i (t π h i i i and y ) i (t π i h i i w i = 1, i h ) i 0 otherwise. (35) To show that the indicator function defined by (35) is consistent with the fact, we consider three possibe cases for a random scenario: 1. Region h i is sipped under the given scenario: When region h i is sipped, we have t π i h i. Thus, we have u π i h(w i, t π i h i i = 0 un- ) i d(h i, hi L i +1 ). It is obvious that xi h(t π i i = 0 < 1 w h) i i h i der this case. 2. Region h i is visited by vehice i but does not have information: When region h i does not have information, we have w i = T + 1. Since x i h i h(t π i i) T < T + 1, we aso have u π h i i h(w i, t π i h i i = 0 ) i under this case. 3. Region h i is visited by vehice i and has information: The search time schedued by vehice i in region h i is x i h(t π i i and the information can be detected if and ony if x h) i (t π i h i i w ) i If the i h. i information is detected by the vehice, the vehice wiave t π i w i units of remaining time. h i h i Then, the vehice wi coect the information if and ony if y i h(t π i i w i = 1. Thus, the vehice h i h ) i wi succeed in this case if and ony if x i h(t π i i w h) i and y i i h i (t π i i w i = 1. h i h ) i Therefore, the indicator function is consistent with the fact. The proof of Theorem 3 reies on the foowing emma, which states that a vehice s remaining time when it arrives at a region is ony dependent on the reaizations of the random variabes associated with the regions that have ower orders than the current region. Lemma 3 Given a Marovian poicy π i, t π i can be written as a function of {θ h i h i, w i h i i } =1:i 1 for any i 1, i.e. t π i h i i where g π i h i i ( ) maps an integer vector to an integer vaue. Proof. If i < z i h i 0(T ), we have t π i h i 1 i If i = z i ), we have t h0(t π i i h i i emma hods for i z i ). h0(t i = g π i h i i ({θ h i, w i h i } =1:i 1), (36) = d(h i i, h i L i +1 ). Then we define gπi ({θ h i h i, w i h i i } =1:i 1) = d(h i i, h i L i +1 ). = T d(h i 1, h i L i +1 ). Then we define tπ i = T d(h i h 1, h i i L 1 i +1 ). Therefore, the i We wi prove the resut for i z i h i 0(T ) using induction. As the induction hypothesis, suppose (36) hods for i, where z i h i 0(T ). To compete the induction we ony need to show that (36) hods for i = + 1. Consider two possibe cases: 39 =

40 1. If region h i +1 is sipped under the scenario, we have tπ i h i +1 = d(h i +1, h L i +1). Then we can define g π i h i +1({θ h i, w i h i } =1: ) = d(h i +1, h L i +1). 2. If the vehice visits region h i +1, et hi be the region that the vehice comes from, where. We have { } t π i = t π h i i min w i +1 h i h i, xi h i i ) u π (tπ h i i (w i h i h i, tπ i )s h i h i d(hi, hi +1). (37) In the right hand side of (37), the first term is the vehice s remaining time when it arrives at region h i. The second and the third terms are the time spent in searching and in coecting information from region h i, respectivey. The ast term is the trave time from region hi to region h i +1. According to the induction hypothesis, tπ i (π h i i ) = g π i ({θ h i h i, w i h} i =1: 1). Define { } g π i h+1({θ i h i, w i h } i =1: ) =g π i ({θ h i h i, wi i } =1: 1) min w i h i, xi h i i ({θ (gπ h i h i, wi i } =1: 1)) Combine the two cases and we have t π i h i i u π i h i (w i h i, gπ i h i ({θ h i, wi h i } =1: 1))s h i d(hi, hi +1). (38) = g π i h i i ({θ h i, wi h i } =1: i ) we defined for i = + 1. It is easy to verify that g π i h i +1( ) ony provides integer vaues in (38) since a terms used are integers. Thus, the induction is compete and the emma hods for z i h i 0(T ). Proof of Theorem 3. Let 1 and 2 be the corresponding orders of a shared region α in vehice 1 s and vehice 2 s routes, i.e., α = h 1 1 = h 2 2. For a random scenario, et {θ h i, w i h i } =1:i be the reaization of the corresponding random variabes that vehice i wi encounter. According to Lemma 3, t π i g π i ({θ h i h i, w i } h i =1:i 1), i {1, 2} where g π i ( ), i {1, 2} are we defined functions. According to i h i i the definition of dependent set, we have {h 1 1, h 1 2,..., h } {h 2 1, h 2 2,..., h } = Φ α. Define O i = {h 1 1, h 1 2,..., h } \ Φ α for i {1, 2} and S = Φ α { α}. Let τ α (π θ Φ α ) be the conditiona probabiity that both vehices succeed in region α given θ S = { θ α } α S where θ α = 1. Using the indicator functions u π i (w i, t π h i h i i ), i {1, 2}, we have i i i i τ α (π θ Φ α ) = {θ h 1,w 1 1 h 1,t π 1 h 1 } {θ h 2,w h 2,t π 2 h 1 2 } 2 2 h i i = { } u π 1 (w 1 h 1 h, t π 1 1 )u π 1 1 h 1 2 (w 2 2 h, t π 2 2 )P (θ h 2 h 1 1, w 1 h, θ , w 2 h, t π 2 1, t π 2 h 1 2 θ 2 S ). (39) 1 2 To rewrite (39), we use the same indicator function κ(x, y) defined for proving ii) of Lemma 2 and τ α (π θ Φ α ) is equa to {θ h 1,w 1 1 } =1:1 {θ h 2,w 2 2 } =1:2 t 1 t 2 { u π 1 h 1 1 (w 1 h 1 1, t 1 )κ(t 1, g π 1 κ(t 2, g π 2({θ h 2 h 2, w 2 h } 2 =1:2 1))P ({θ h 1, w 1 h } 1 =1:1, {θ h 2, w 2 h} 2 =1:2 θ S ) 40 h 1 ({θ h 1, w 1 h 1 } =1:1 1))u π 2 h 2 2 (w 2 h 2 2, t 2 ) }.

41 For P ({θ h 1, w 1 } h 1 =1:1, {θ h 2, w 2 h} 2 =1:2 θ S ), we have P ({θ h 1, w 1 h } 1 =1:1, {θ h 2, w 2 h} 2 =1:2 θ S ) { P ({θα1, wα 1 = 1 } α1 O 1, {θ α 2, wα 2 2 } α2 O 2, {w1 α, wα} 2 α S θ S ) if θ h 1 1 = θ h 2 2 = θ α, h 1 1 = h 2 2 = α S, 0 otherwise. (40) For P ({θ h i, w i h i } =1:i θ S ), i U, we have P ({θ h i, w i h i } =1:i θ S ) = { P ({θ αi, w i α i } αi O i, {wi α} α S θ S ) if θ h i = θ α for h i = α S, 0 otherwise. (41) as Using ii) of Lemma 2, (40) and (41), P ({θ h 1, w 1 } h 1 =1:1, {θ h 2, w 2 h} 2 =1:2 θ S ) can be further rewritten P ({θ h 1, w 1 h } 1 =1:1, {θ h 2, w 2 h } 2 =1:2 θ S ) = P ({θ h i, w i h} i =1:i θ S ). (42) i U Using (42), τ α (π θ Φ α ) can be written as u π1 (w 1 h 1 h, t 1 1 )κ(t 1, g π 1 ({θ {θ h 1,w h 1 h 1, w 1 h 1 } 1 =1:1 1))P ({θ h 1, w 1 h} 1 =1:1 θ S ) 1 } =1:1 t 1 u π2 (w 2 h 2 h, t 2 2 )κ(t 2, g π 2 ({θ 2 2 h 2 h 2, w 2 h 2 2 } =1:2 1))P ({θ h 2, w 2 h} 2 =1:2 θ S ) {θ h 2,w 2 2 } =1:2 t 2. (43) Since τ ĩ α(π i θ Φ α ) = {θ h i,w h i } =1:i t i u πi h i i (w i h i i, t i )κ(t i, g π i h i i ({θ h i, w h i } =1:i 1))P ({θ h i, w i h i } =1:i θ S ), which is the conditiona probabiity that vehice i succeeds in region α given θ S. Therefore, whether a vehice succeeds in region α is conditionay independent of the other vehice given θ Φ α and the information exists. Proof of Proposition 2. We just need to show that if α Φ α2 we must have α Φ α1. To see this, since α Φ α2 we have I α I α2. According to the proposition s condition, we have I α I α2 I α1. Therefore, we have α Φ α1. The proof of Proposition 3 requires the foowing emma, which considers the poicy in the form of (5) and proves a simiar resut to Lemma 3 but for both the remaining time and the observation variabe. 41

42 Lemma 4 Consider a poicy π = (π 1, π 2 ) where π i = {x i (t, o i ), y i (t, o i ), z i (t, o i : t = h i i 1 h i i 1 h i 1) i 0, 1, 2,..., T, = 0, 1, 2,..., L i }. Let {θ h i, w i h} i =1:i be a reaization of the scenario that vehice i wi encounter unti eaving region h i i. Under this scenario, for any i 1, t π i h i i o i h i i = ζ π i h i i ({θ h i, w i h i } =1:i 1), = η π i h i i ({θ h i, w i h i } =1:i ), (45) where ζ π i h i i ( ) and η π i h i i ( ) are functions that map an integer vector to an integer vaue. Proof. If i < z i ), we have t h0(t π i i h i i 0. Therefore, the emma hods for i < z i h i 0 T d(h i 0, h i i ). Given w i h i, define η π i ({θ h i h i, w h i } =1:i ) = i = ζ π i ({θ h i h i, w i } h i =1:i 1) = d(h i i i, h i L i +1 ) and oi = η π h i i i (T ). If i = z i h i 0(T ), we have t π i h i i x i h i i (T d(h i 0, h i i ), 0) if x i α(t d(h i 0, h i i ), 0) < w i h i i, ({θ h i h i, w i h i i } =1:i ) = = ζ π i ({θ h i h i, w i h i i } =1:i 1) = 1 if x i h i (T d(h i 0, h i i ), 0) w i h i i. Here 0 is a zero vector that represents previous observations. Thus, the emma aso hods for i = z i ). We define an indicator function u h0(t π i i h(w i t, o i h, i ) for = 1, 2,..., L i h i i satisfying u π i 1 h(w i t, o i h, i = i h 1) i 1 if x i (t, o i ) w i and y i (t w i, o i = 1, which corresponds to the scenario where vehice i h i i 1 h i i i 1) i finds the information in region h i and coects it; otherwise, uπ i h(w i t, o i h, i = 0. i h 1) i For i z i ), we prove the emma s resut using induction. As the induction hypothesis, we h0(t i assume (45) hods for i where z i ). Now we consider h0(t i i = + 1. Let h i be the region that vehice i traves from. We have two possibe cases: 1. If region h i +1 is sipped under the scenario, we define: ζ π i h+1({θ i h i, w i } h i =1: ) = d(h i +1, hi L i +1 ) and ηπ i h+1({θ i h i, w i h} i =1:+1 ) = If the vehice visits region h i +1, et hi be the region that the vehice comes from, where. We have { } t π i = t π h i i min x i +1 h i h i i, o i (tπ h i h ), w i i h u π i i (w i 1 h i h i, tπ i, o i h i h i )s h i d(hi, hi +1). According to the induction hypothesis, we have: t π i h i = ζ πi ({θ h i h i, w h i } =1: 1), o i = {η π h i i ({θ h i h i 1, w h i } =1: )} =1: 1. We can define h+1({θ i h i, w i h } i =1: ) = ζ π i ({θ h i h i, wi i } =1: 1) { } min w h i, xi h i (ζπi h i ({θ h i, wi i } =1: 1), {η π i ({θ i h i, w i h} i =1: )} =1: 1) ζ π i u π i h i (w i h i, ζπ i h i ({θ h i, wi h i } =1: 1), {η π i h i ({θ h i, w i h i } =1: )} =1: 1)s h i d(hi, hi +1). (46) 42

43 For o i we have h i +1 Then we can define o i h i +1 = η π i h i +1({θ h i, w i h i } =1:+1 ) = { x i (t π h i i o +1 h+1, i if x i h) i (t π i h i i o +1 h+1, i w i h) i i h+1, i w h i +1 if x i h+1(t π i i o h+1, i < w i h) i i h+1. i x i h i +1(ζ π i h i +1({θ h i, wi h i } =1:), {η π i h i ({θ h i, w i h i } =1: )} =1:) if x i h i +1(ζ π i h i +1({θ h i, wi h i } =1:), {η π i h i ({θ h i, w i h i } =1: )} =1:) w i h i +1, w i h i +1 if x i h i +1(ζ π i h i +1({θ h i, wi h i } =1:), {η π i h i ({θ h i, w i h i } =1: )} =1:) < w i h i +1. Combine the two cases, we have t π i = ζ π h i i +1 h+1({θ i h i, w i h} i =1: ) and o i = η π h i i +1 h+1({θ i h i, w i h} i =1:+1 ) we defined. It is easy to verify that ζ π i ( ) and η π h i i defined in (46) and (47) ony provide integer +1 h+1( ) i vaues since a terms used are integers. Thus, the induction is compete and the emma hods for i z i ). h0(t i Proof of Proposition 3. Consider a shared region h 1 1 = h 2 2 = α. Let 1 and 2 be the orders of region α in vehice 1 s and vehice 2 s routes, respectivey. Since the shared regions are searched in an exact opposite order by the two vehices, it is easy to verify that Φ α =. Given θ α = 1, the conditiona probabiity that both vehices succeed in region α is provided by (48) using the indicator function u π (w i, t, o i defined in the proof of Lemma 4. h i i 1) i (47) τ α (π) = {w 1 h 1,w 2 h 2,t π 1 h 1 1,t π 2 h 2 2,o 1 h 1 1,o 2 h } 2 1 u π 1 (w 1 h 1 h, t π 1 1, o h 1 h )u π 1 2 (w h 2 h, t π 2 2, o h 2 h ) = {θ h 1,w 1 1 } =1:1 1 {θ h 2,w 2 2 } =1:2 1 P (w 1 h, w 2 1, t π 2 1, t π 1 2 h 1 2, o 1 2 h o , 2 h ) θ α {t π 1 h 1,t π 2 h 2,o 1 h 1 1,o 2 h } 2 1 [u π 1 h 1 1 (w 1 h 1 1, t π 1 h 1 1, o 1 h )u π 2 h 2 2 (w 2 h 2 2, t π 2 h 2 2, o 2 h 2 2 1) (48) P ({θ h 1, w 1 h } 1 =1:1 1, {θ h 2, w 2 h } 2 =1:2 1, w 1, w 2 1, t π 2 1, t π 1 2 h 1 2, o 1 2 h o , 2 h )] θ α Given {θ h 1, w 1 } h 1 =1:1 1 and {θ h 2, w 2 } h 2 =1:2 1, using the indicator function κ(x, y) defined in the proof 43

44 of Theorem 3, we have u π 1 (w 1 h 1 h, t π 1 1, o h 1 h )u π 1 2 (w h 2 h, t π 2 2, o h 2 h ) {t π 1 h 1,t π 2 h 2,o 1 h 1 1,o 2 h } 2 1 = P ({θ h 1, w 1 h } 1 =1:1 1, {θ h 2, w 2 h } 2 =1:2 1, w 1, w 2 1, t π 2 1, t π 1 2 h 1 2, o 1 2 h o , 2 h ) θ α { u π 1 (w 1 h 1 h, ζ π 1 1 ( ), ô h 1 1 1)u π 2 (w 2 2 h, ζ π 2 2 ( ), ô h 2 2 1)κ(ζ π 1 ( ), t 1 1 )κ(ζ π 2( ), t ) 1 2 {t 1,t 2,ô 1 1 1,ô2 2 1 } [ 1 1 =1 κ(η π 1( ), ô 1 h 1 ) ] [ 2 1 =1 κ(η π 2( ), ô 2 h 2 ) ] P (w 1 h, w 2 1, {θ , w 1 h } 1 =1:1 1, {θ h 2, w 2 h } 2 =1:2 1 θ α ) }. (49) In (49), we use η π i( ) and ζ π h i i ( ) as concise forms of their counterparts defined in Lemma 4. ô i h i i 1 is an integer vector that has the same dimension as o i i 1, i U. Since a shared regions are searched in an exact opposite order, we have i U t i,ô i i 1 {h 1 1,..., h 1 1 1} {h 1 2,..., h 2 2 1} =. Using ii) of Lemma 2, we can further rewrite (49) as [ i ] 1 u π i (w i h i h, ζ π i i ( ), ô i i i h i i 1)κ(ζ π i ( ), t i i ) κ(η π i( ), ô i i ) P (w i i h, {θ i i i, w i h i i } =1:i 1 θ α ). (50) =1 Combine (48), (49) and (50), τ α (π) can be written as { i U {θ h i,w i i } =1:i t i,ô i i 1 u π i h i i (w i h i i, ζ π i h i i ( ), ô i i 1)κ(ζ π i h i i ( ), t i ) P (w i h i i, {θ h i, w i h i } =1:i 1 θ α ) }, [ i 1 =1 κ(η π i h i ( ), ô i ) ] (51) which is product of the conditiona probabiity that each vehice wi succeed in region α given θ α. Therefore, whether a vehice succeeds in the region is conditionay independent of the other vehice given θ α. Proof of Theorem 4. A Marovian poicy can be viewed as a specia case of the poicy in the form of (5). To see this, we can assign the same decision to vehice i U in a poicy in the form of (5) as ong as the vehice has the same remaining time at the same decision epoch no matter what history observations the vehice received. Then a Marovian poicy is estabished. Let π = (π1, π2) be an optima timeaocation poicy in the form of (5) and R(π ) be the expected reward provided by π. Since whether a vehice succeeds in a shared region is independent of the other vehice, we have R(π) = R(π). Using Theorem 1, et ˆπ 1 be a Marovian poicy that soves max{ R(π 1, π2) : π 1 Π 1 } and et ˆπ 2 be a 44

45 Marovian poicy that soves max{ R(ˆπ 1, π 2 ) : π 2 Π 2 }. We shoud have R(ˆπ 1, ˆπ 2 ) R(π ) = R(π ). Since poicy (ˆπ 1, ˆπ 2 ) can be viewed as a specia case of the poicy in the form of (5), according to the theorem s condition, we shoud have R(ˆπ 1, ˆπ 2 ) = R(ˆπ 1, ˆπ 2 ). Therefore, we have R(ˆπ 1, ˆπ 2 ) R(π ). Since π is an optima poicy, (ˆπ 1, ˆπ 2 ) must aso be an optima poicy. We estabish three emmas to faciitate the proof of Theorem 5. In Lemma 5, we dupicate each shared region α satisfying γ α < 1 and assign one to each vehice. Then we reate the corresponding expected rewards when the same poicy is appied. We use α S : γ α 1 to represent for a for α S satisfying γ α 1 ; simiary, we aso have α S : γ α > 1 to represent for a for α S satisfying γ α > 1. Lemma 5 For any poicy π = (π 1, π 2 ), et R O (π) be the expected reward the feet wi coect by foowing poicy π assuming that for α S satisfying γ α 1, g α wi be coected by each vehice if it succeeds in the region. We have R O (π) R(π). (52) Proof. The expected reward coected by the feet under poicy π is R(π) = g α e α [τα(π 0)(1 1 τα(π 2 2 )) + τα(π 0)(1 2 τα(π 1 1 )) + (1 + γ α )τα(π 1)τ 1 α(π 2 2 )] + R π 1 α S:γ α 1 + R π 2 + g α e α [τα(π 0)(1 1 τα(π 2 2 )) + τα(π 0)(1 2 τα(π 1 1 )) + (1 + γ α )τα(π 1)τ 1 α(π 2 2 )]. α S:γ α>1 (53) Due to the assumption that the same amount of reward wi be coected regardess of the other vehice s resut, an expected reward of e α g α τα(π i i ) wi be coected by vehice i in any shared region α satisfying γ α < 1. Therefore, we have R O (π) = R π 1 + R π 2 + g α e α [τα(π 1 1 ) + τα(π 2 2 )] + α S:γ α>1 α S:γ α 1 Combine (53) and (54), we have Since R O (π) R(π) = α S g α e α [τ 1 α(π 0)(1 τ 2 α(π 2 )) + τ 2 α(π 0)(1 τ 1 α(π 1 )) + (1 + γ α )τ 1 α(π 1)τ 2 α(π 2 )]. g α e α [τ 1 α(π 1 ) + τ 2 α(π 2 ) τ 1 α(π 0)(1 τ 2 α(π 2 )) τ 2 α(π 0)(1 τ 1 α(π 1 )) (1 + γ α )τ 1 α(π 1)τ 2 α(π 2 )]. (54) τ 1 α(π 0)(1 τ 2 α(π 2 )) + τ 2 α(π 0)(1 τ 1 α(π 1 )) + (1 + γ α )τ 1 α(π 1)τ 2 α(π 2 ) τ 1 α(π 0)(1 τ 2 α(π 2 )) + τ 1 α(π 1)τ 2 α(π 2 ) + τ 2 α(π 0)(1 τ 1 α(π 1 )) + τ 2 α(π 1)τ 1 α(π 1 ) =τ 1 α(π 1 ) + τ 2 α(π 2 ), (55) we have R O (π) R(π). Note that the inequaity in (55) comes from the fact that γ α < 1 and τα(π 1)τ 1 α(π 2 2 ) = τα(π 1)τ 2 α(π 1 1 ) = τ α (π) 0. 45

46 In Lemma 6, we compensate each vehice for its reward oss assuming that it is aways the second vehice to coect the information if both vehices succeed in a shared region and extend inequaity (52) to (56). The compensation is cacuated using a fixed τ, which provides each vehice the conditiona probabiity that the other vehice succeeds in each shared region α given that information exists in the region. Let τ = { τ 1 α, τ 2 α} α S:γα 1 where 0 τ 1 α, τ 2 α 1. Lemma 6 For any poicy π = (π 1, π 2 ), et R + (π τ) be the expected reward that the feet wi receive by appying poicy π assuming that each vehice i U wi coect an expected reward of g α e α τα(π i i )[(1 τ α) i + τ αγ i α ] from region α S satisfying γ α 1. We have R O (π) R + (π τ) + e α g α (1 γ α ) τ α. i (56) i U α S:γ α 1 Proof. To compare R O (π i ) and R + (π i τ), we ony need to compare the the expected tota reward coected from each shared region α satisfying γ α 1 since the expected tota rewards coected from a the other regions are the same. We use a big M to represent the expected tota reward coected from a the other regions and have R O (π) = M + g α e α [τα(π 1 1 ) + τα(π 2 2 )]. α S:γ α 1 R + (π τ i ) + e α g α (1 γ α ) τ α i = M+ i U α S:γ α 1 g α e α [τα(π 1 1 )(1 τ α) 1 + τα(π 2 2 )(1 τ α) 2 + γ α (τα(π 1 1 ) τ α 1 + τα(π 2 2 ) τ α) 2 + ( τ α 1 + τ α)(1 2 γ α )] α S:γ α 1 M + α S:γ α 1 = M + α S:γ α 1 g α e α [τ 1 α(π 1 )(1 τ 1 α) + τ 2 α(π 2 )(1 τ 2 α) + τ 1 α(π 1 ) τ 1 α + τ 2 α(π 2 ) τ 2 α] g α e α [τ 1 α(π 1 ) + τ 2 α(π 2 )] = R O (π). The inequaity in (57) hods since τ α 1 + τ α 2 τα(π 1 1 ) τ α 1 + τα(π 2 2 ) τ α. 2 Lemma 7 provides a ratio under which we can extend the inequaities derived in Lemma 5 and Lemma 7 in the proof of Theorem 5. Let τ α 1 = τα(π 2 2 ), τ α 2 = τα(π 1 1 ) for α S and τ = { τ α, 1 τ α} 2 α S. Define R + (π i τ) = R π i + α S e αg α [τα(π i i )(1 τ α) i + γ α τα(π i i ) τ α], i which is the expected tota reward that vehice i wi coect if it coects a reward of [γ α τ α i + (1 τ α)] 1 when it succeeds in any shared region α S. Lemma 7 For any Marovian poicy π = (π 1, π 2 ), we have (2 γ) R(π) 1 + γ R + (π i τ) + 2γ i U i U α S:γ α 1 (57) e α g α (1 γ α ) τ i α. (58) Proof. We first write R(π) as R(π) = R π i + e α g α [τα(π 1 1 )(1 τα(π 2 2 )) + τα(π 2 2 )(1 τα(π 1 1 )) + (1 + γ α )τα(π 1 1 )τα(π 2 2 )] i U α S:γ α 1 + e α g α [τα(π 1 1 )(1 τα(π 2 2 )) + τα(π 2 2 )(1 τα(π 1 1 )) + (1 + γ α )τα(π 1 1 )τα(π 2 2 )]. α S:γ α>1 (59) 46

47 For any α S where γ α 1, we have (2 γ)[τα(π 1 1 )(1 τα(π 2 2 )) + τα(π 2 2 )(1 τα(π 1 1 )) + (1 + γ α )τα(π 1 1 )τα(π 2 2 )] (2 γ α )[τα(π 1 1 )(1 τα(π 2 2 )) + τα(π 2 2 )(1 τα(π 1 1 )) + (1 + γ α )τα(π 1 1 )τα(π 2 2 )] = τα(π 1 1 )(1 τα(π 2 2 )) + γ α τα(π 1 1 )τα(π 2 2 ) + (1 γ α )τα(π 2 2 ) + τα(π 2 2 )(1 τα(π 1 1 )) + γ α τα(π 2 2 )τα(π 1 1 ) + (1 γ α )τα(π 1 1 )(1 τα(π 2 2 )) + (1 γ α )(1 + γ α )τα(π 1 1 )τα(π 2 2 ) τα(π 1 1 )(1 τα(π 2 2 )) + γ α τα(π 1 1 )τα(π 2 2 ) + (1 γ α )τα(π 2 2 ) + τα(π 2 2 )(1 τα(π 1 1 )) + γ α τα(π 2 2 )τα(π 1 1 ) + (1 γ α )τα(π 1 1 ) = [τα(π i i )(1 τ α) i + γ α τα(π i i ) τ α] i + (1 γ α ) τ α. i i U i U (60) Using (60), we have (2 γ) α S:γ α 1 e α g α [τ 1 α(π 1 )(1 τ 2 α(π 2 )) + τ 2 α(π 2 )(1 τ 1 α(π 1 )) + (1 + γ α )τ 1 α(π 1 )τ 2 α(π 2 )] e α g α [τα(π i i )(1 τ α) i + γ α τα(π i i ) τ α] i + i U α S:γ α 1 i U 1 + γ e α g α [τ 2γ α(π i i )(1 τ α) i + γ α τα(π i i ) τ α] i + i U α S:γ α 1 i U α S:γ α 1 e α g α (1 γ α ) τ i α α S:γ α 1 e α g α (1 γ α ) τ i α. (61) For α S : γ α > 1, τ 1 α(π 1 )(1 τ 2 α(π 2 )) + τ 2 α(π 2 )(1 τ 1 α(π 1 )) + (1 + γ α )τ 1 α(π 1 )τ 2 α(π 2 ) = 1 + γ 2γ { 2γ 1 + γ [τ α(π 1 1 )(1 τα(π 2 2 )) + τα(π 2 2 )(1 τα(π 1 1 ))] + 2γ(1 + γ α) τ γ α(π 1 )τα(π 2 2 )} 1 + γ 2γ [ τ α(1 1 τ α) 2 + τ α(1 2 τ α) 1 + 2γ α τ α τ 1 α] 2 = 1 + γ e α g α [τ i 2γ α(π i )(1 τ α) i + γ α τα(π i i ) τ α]. i This impies that (2 γ) 1 + γ 2γ α S:γ α>1 i U α S:γ α 1 i U α S:γ α 1 e α g α [τ 1 α(π 1 )(1 τ 2 α(π 2 )) + τ 2 α(π 2 )(1 τ 1 α(π 1 )) + (1 + γ α )τ 1 α(π 1 )τ 2 α(π 2 )] e α g α [τ i α(π i )(1 τ i α) + γ α τ i α(π i ) τ i α]. Since 1+γ 2γ i U Rπ i i U Rπ i, combining (59), (61) and (62), we have (58) hod. With the assistance of the three emmas, we can compete the proof of Theorem 5. Proof of Theorem 5. Let π,l be an arbitrary oca optimum of probem (IAP) and π be an optima poicy in the form of (5). Set τ α i = τα(π i,l i ) for i U, α S and we first show R + (π τ) 1 + γ R + (π,l i τ). 2 (63) 47 i U (62)

48 According to Definition 3, π L, i Now we consider R + (π τ). optimizes R + (π L, i τ) in Π i. We have { }}{ R + (π τ) = [R π i + e α g α (τα(π i i )(1 τ α) i + γ α τα(π i i ) τ α)] i + α S:γ α>1 i U α S:γ α 1 R + (π L, i τ) R + (π i τ). (64) =M e α g α [τ 1 α(π 1 0)(1 τ 2 α(π 2)) + τ 2 α(π 2 0)(1 τ 1 α(π 1)) γ α (τ 1 2 α(π1 1)τ α(π 2 2) + τα(π 2 2 1)τ α(π 1 1))] M γ γ [M + e α g α (τ 2 α(π 1 1) + τα(π 2 2))]. α S:γ α>1 α S:γ α>1 e α g α (τ 1 α(π 1) + τ 2 α(π 2)) Note that the first inequaity in (65) hods since τα(π 1 1) = τα(π 1 1 0)(1 τα(π 2 2)) + τα(π 1 1 1)τ α(π 2 2). On the other hand, we have R + (πi π L, ) = M + e α g α [τα(π 1 1)(1 τ α) 1 + γ α τα(π 1 1) τ α 1 i U α S:γ α>1 + τ 2 α(π 2)(1 τ 2 α) + γ α τ 2 α(π 2) τ 2 α] M + α S:γ α>1 e α g α (τ 1 α(π 1) + τ 2 α(π 2)). Combining (64), (65) and (66), we have (63) hod. Using Lemma 5 and Lemma 6, we have R + (π τ) + e α g α (1 γ α ) τ α i R O (π ) R(π ). i U α S:γ α 1 Finay, combining (58), (63) and (67), we have R(π ) R + (π τ) + i U α S:γ α 1 [ 1 + γ R + (πi π L, ) + 2 i U i U γ[ 1 + γ R + (πi π L, ) + 2γ i U i U γ(2 γ) R(π L, ). e α g α (1 γ α ) τ i α α S:γ α 1 α S:γ α 1 e α g α (1 γ α ) τ i α] e α g α (1 γ α ) τ i α] (65) (66) (67) D. Tightness of the upper bound This section iustrates how the ratios highighted in Remar 2 can be approached. represent a very sma positive number. We use ɛ to 48

49 Coroary 1 The ratio in (21) can be approached arbitrariy cose when γ = 0 or γ = 1. Before proving the coroary, we first study a specia case of the time-aocation probem. Lemma 8 If γ α = 1, α S, for any poicy π = ( π 1, π 2 ) in the form of (5), (ˆπ 1, ˆπ 2 ) is an optima poicy if ˆπ 1 arg max{ R(π 1, π 2 ) : π 1 Π 1 }, ˆπ 2 arg max{ R( π 1, π 2 ) : π 2 Π 2 }. (68a) (68b) Proof. For α S and any poicy π = (π 1, π 2 ), we have R α (π 1, π 2 ) =g α e α [τ 1 α(π 0)(1 τ 2 α(π 2 )) + τ 2 α(π 0)(1 τ 1 α(π 2 )) + (1 + γ α )τ 1 α(π 1)τ 2 α(π 2 ) = g α e α [τ 1 α(π 1 ) + τ 2 α(π 2 )] = R α (π 1, π 2 ). (69) Let (π 1, π 2) be an optima poicy. Combining (68a) and (69), we have R(ˆπ 1, π 2 ) = Rˆπ 1 + R π 2 + g α e α [τα(ˆπ 1 1 ) + τα( π 2 2 )] R(π1, π 2 ) α S = R π 1 + R π 2 + g α e α [τα(π 1 1) + τα( π 2 2 )]. α S (70) From (70), we can obtain R(ˆπ 1, π2) = Rˆπ 1 + R π 2 + g α e α [τα(ˆπ 1 1 ) + τα(π 2 2)] α S R π 1 + R π2 + g α e α [τα(π 1 1) + τα(π 2 2)] = R(π1, π2). α S (71) Therefore, (ˆπ 1, π2) is aso an optima poicy. Using a simiar method to repace π2 with ˆπ 2, we can easiy show that (ˆπ 1, ˆπ 2 ) is aso an optima poicy. Lemma 8 shows that a singe iteration of Agorithm 1 finds an optima time-aocation poicy when the cooperation factor is equa to one for a shared regions. Proof of Coroary 1. Given γ = 1, if γ = 1, then γ α = 1 for a α S. According to Lemma 8, a oca optimum of probem (IAP) is aso an optima time-aocation poicy. Therefore, R(πL, ) R(π ) = 1. We use an exampe to prove the rest of Coroary 1. Consider the exampe iustrated in Figure 6 where two vehices are assigned to search three regions. Since we have ony one shared region in the exampe, R(π) = R(π) for any Marovian poicy π according to Proposition 3. Combining Proposition 3 and Theorem 4, the optima Marovian poicy is aso an optima poicy to the timeaocation probem. Note that we aso have the same resut for the exampe that we wi create to prove Coroary 2. We have the foowing parameter setups: For regions 1,2,3, we set e 1 = 0.5, e 2 = 0.5, e 3 = 0.5; g 1 = 1 ɛ, g 2 = ɛ, g 3 = 1 + ɛ; p 1 (1) = 1 ɛ, p 2 (1) = 1 ɛ 2, p 3 (1) = 1 ɛ; γ 3 = γ = 0; s 1 = s 2 = s 3 = 0. Note that under the ast condition, it is aways optima for a vehice to coect the information since it taes 49

1 3 2 Vehice 1 Vehice 2 Figure 6: The exampe to show that R(π L, ) R(π ) can approach 1/2 arbitrariy cose when γ = 0 zero amount of time and provides non-negative reward to the feet.

50 1 3 2 Vehice 1 Vehice 2 Figure 6: The exampe to show that R(π L, ) R(π ) can approach 1/2 arbitrariy cose when γ = 0 zero amount of time and provides non-negative reward to the feet. In addition, each vehice s route is a straight ine, and therefore there is no benefit to sip a region for both vehices. Since the trave time does not infuence the soution, we assume zero trave between each pair of regions to simpify the representation of the poicy. Each vehice is given one unit of mission time and there are two feasibe poicies for each vehice under this scenario: π a 1 = {x 1 1(1) = 0, x 1 3(1) = 1}, π b 1 = {x 1 1(1) = 1, x 1 3(1) = 0} and π a 2 = {x 2 2(1) = 0, x 2 3(1) = 1}, π b 2 = {x 2 2(1) = 1, x 2 3(1) = 0} for vehices 1 and 2, respectivey. We first show that (π a 1, π b 2) is a oca optimum of probem (IAP). First, we have R(π a 1, π b 2) > R(π b 1, π b 2). Since for vehice 1, region 3 provides a arger reward than region 1 if the vehice succeeds whie the vehice has the same probabiity to succeed in both regions. We aso have R(π a 1, π a 2) = e 3 g 3 [1 (1 p 3 (1)) 2 ] = 0.5(1 ɛ 2 ). Since R(π a 1, π b 2) = e 3 g 3 p 3 (1)+e 1 g 1 p 1 (1) = 0.5(1 ɛ) + 0.5ɛ(1 ɛ 2 ) = 0.5(1 ɛ 3 ) > 0.5(1 ɛ 2 ) = R(π a 1, π a 2), (π a 1, π b 2) is a oca optimum. However, it is easy to verify that the optima poicy shoud be (π b 1, π a 2), where R(π b 1, π a 2) = e 3 g 3 p 3 (1)+ e 2 g 2 p 2 (1) = 0.5(1 ɛ) (1 ɛ). Since 0.5(1 ɛ 3 ) im ɛ 0 0.5(1 ɛ) (1 ɛ) = 1 2 = 1 2 γ, the ratio provided in (21) can be approached arbitrariy cose when γ = 0. Coroary 2 For any γ 1, (22) can be approached arbitrariy cose. 1 2 Vehice 1 Vehice 2 Figure 7: The exampe to show that R(π L, ) R(π ) can approach 1/ γ arbitrariy cose 50

Asynchronous Control for Coupled Markov Decision Systems

Asynchronous Control for Coupled Markov Decision Systems INFORMATION THEORY WORKSHOP (ITW) 22 Asynchronous Contro for Couped Marov Decision Systems Michae J. Neey University of Southern Caifornia Abstract This paper considers optima contro for a coection of