Information Gathering and Reward Exploitation of Subgoals for P

Size: px

Start display at page:

Download "Information Gathering and Reward Exploitation of Subgoals for P"

Osborne James
6 years ago
Views:

1 Information Gathering and Reward Exploitation of Subgoals for POMDPs Hang Ma and Joelle Pineau McGill University AAAI January 27, 2015

3 POMDPs Background A partially observable Markov decision process (POMDP) is a tuple S, A, Ω, T, O, R, b 0, γ S is a finite set of states, A is a finite set of actions, Ω is a finite set of observations; T (s, a, s ) = p(s s, a) is the transition function that maps each state and action to a probability distribution over states; O(s, a, o) = p(o s, a) is the observation function that maps a state and an action to a probability distribution over possible observations; R(s, a) is the reward function, b 0 (s) is the initial belief state, and γ is the discount factor.

4 Beliefs Background A belief state b B is a sufficient statistic for the history, and is updated after taking action a and receiving observation o as follows: b a,o (s ) = O(s, a, o) s S T (s, a, s )b(s), (1) p(o a, b) where p(o a, b) = s S b(s) s S T (s, a, s )O(s, a, o) is a normalizing factor.

5 Policies Background A policy is a mapping from the current belief state to an action. A value function V π (b) specifies the expected reward gained starting from b followed by policy π: V π (b) = b(s)r(s, π(b)) + γ p(o b, π(b))v π (b π(b),o ). s S o Ω

6 POMDP Planning Background Find an optimal policy that maximizes its value function.

7 Challenges Background Computational complexity: We must reason in a (n 1)-dimensional continuous belief space (the curse of dimensionality). Complexity also grows fast with the length of planning horizon (the curse of history).

8 Challenges Background Computational complexity: We must reason in a (n 1)-dimensional continuous belief space (the curse of dimensionality). Complexity also grows fast with the length of planning horizon (the curse of history). Practical challenges: How to carry out intelligent information gathering in a large high dimensional belief space. How to scale up planning with long sequences of actions and delayed rewards.

9 b0 a1 a2 o1 o2 Figure: A belief tree rooted at b 0.

10 Existing Solvers Background PBVI (Pineau et al., 2003) HSVI/HSVI2 (Smith and Simmons, 2004, 2005) FSVI (Shani et al., 2007) SARSOP (Kurniawati et al., 2008) RTDP-Bel (Bonet and Geffner, 2009) MiGS (Kurniawati et al., 2011) Some online solvers, e.g. PUMA (He et al., 2010), POMCP (Silver and Veness, 2010)

11 Our Approach: Information Gathering and Reward Exploitation of Subgoals (IGRES) Main ideas:

12 Our Approach: Information Gathering and Reward Exploitation of Subgoals (IGRES) Main ideas: 1 Sample potentially important states as subgoals.

13 Our Approach: Information Gathering and Reward Exploitation of Subgoals (IGRES) Main ideas: 1 Sample potentially important states as subgoals. 2 Generate macro-actions (sequences of actions) for transitions to subgoals.

14 Our Approach: Information Gathering and Reward Exploitation of Subgoals (IGRES) Main ideas: 1 Sample potentially important states as subgoals. 2 Generate macro-actions (sequences of actions) for transitions to subgoals. 3 Generate macro-actions for information gathering and reward exploitation in the neighborhood of subgoals.

15 Capturing Important States 1 Identify importance heuristic functions with respect to: Immediate reward; Information gain. 2 Then a state is sampled as subgoal with probability of its importance.

16 Leverage the Structure 1 Specify distance between states with respect to the approximate similarity of their value. 2 Group all the states by the distance to the subgoals.

17 Sampling Belief States with Macro-actions 1 Associate each belief with an estimated current state. 2 Generate a macro-action towards the corresponding subgoal. 3 Gather information and exploit rewards around the subgoal. b 0 a 1 o 1 b 1 b l s 0 a 1 s 1 s l

18 Overview of IGRES

19 Conclusion Background Reduce planning complexity by using macro-actions. Sample potentially most useful beliefs (based on subgoal states). Capability of planning in large state space for a long horizon.

20 s 0 = tiger-left Pr(o = obs-left s = s 0, a = listen) = 0.85 Pr(o = obs-right s = s 0, a = listen) = 0.15 s 1 = tiger-left Pr(o = obs-left s = s 1, a = listen) = 0.15 Pr(o = obs-right s = s 1, a = listen) = 0.85 Reward Function: Opening the wrong door: -100 Opening the correct door: 10 Listen action: -1 Figure: Tiger Domain

21 Figure: RockSample Domain

22 Figure: Underwater Domain (a) Underwa instance of c shown on a a the possible i the robot. Th likely to star positions. D tinations. R O marks pl can fully loca

23 Figure: Homecare Domain

3-D Navigat (UAV) navigat ure 4), where environment is ger zones.

24 (a) Figure 3: Sampled milestones. (a) The initial roadmap. (b) The milestone. Figure: 3D-Navigation Domain Figure 4: 3-D navigation. See the caption of Figure 2 for the meanings of labels. 3-D Navigat (UAV) navigat ure 4), where environment is ger zones. T (x, y, z, θ p, θ y ), tion, θ p is the p The 3-D enviro horizo The pitch ang the yaw angle, entrance to th its initial state either rotate in adjacent cells motion is quite moves forward

25 Results of the benchmark domains. Return Time(s) Tiger S = 2, A = 3, Ω = 2 RTDP-Bel ± HSVI ± 0.09 <1 FSVI* N/A SARSOP ± MiGS ± IGRES (# subgoals: 1) ± Noisy-tiger S = 2, A = 3, Ω = 2 RTDP-Bel ± HSVI ± 0.04 <1 FSVI* N/A SARSOP ± MiGS ± IGRES (# subgoals: 1) ± RockSample(4,4) S = 257, A = 9, Ω = 2 RTDP-Bel ± HSVI ± 0.01 <1 FSVI ± SARSOP ± MiGS 8.57 ± IGRES (# subgoals: 4) ± RockSample(7,8) S = 12545, A = 13, Ω = 2 RTDP-Bel ± HSVI ± FSVI ± SARSOP ± MiGS 7.35 ± IGRES (# subgoals: 8) ± Hallway2 S = 92, A = 5, Ω = 17 RTDP-Bel ± HSVI ± FSVI ± SARSOP ± MiGs ± IGRES (# subgoals: 20) ± Tag S = 870, A = 5, Ω = 30 RTDP-Bel 6.32 ± HSVI ± FSVI 6.11 ± SARSOP 6.08 ± MiGS 6.00 ± IGRES (# subgoals: 20) 6.12 ± Underwater Navigation S = 2653, A = 6, Ω = 103 RTDP-Bel ± HSVI ± FSVI ± SARSOP ± MiGS ± IGRES (# subgoals: 20) ± Homecare S = 5408, A = 9, Ω = 928 RTDP-Bel** N/A HSVI ± FSVI*** N/A SARSOP ± MiGS ± IGRES (# subgoals: 30) ± D-Navigation S = 16969, A = 5, Ω = 14 RTDP-Bel ± HSVI ± FSVI** N/A SARSOP ± MiGS (2.977 ± 0.512) IGRES (# subgoals: 163) (3.272 ± 0.193) * ArrayIndexOutOfBoundsException is thrown. ** Solver is not able to compute a solution given large amount of computation time. *** OutOfMemoryError is thrown.

26 Results of the adaptive management of migratory birds problem. Return Time(s) Lesser sand plover S = 108, A = 3, Ω = 36 symbolic Perseus* IGRES (# subgoals: 18) ± Bar-tailed godwit b. S = 972, A = 5, Ω = 324 symbolic Perseus* IGRES (# subgoals: 36) ± Terek sandpiper S = 2916, A = 6, Ω = 972 symbolic Perseus* IGRES (# subgoals: 72) ± Bar-tailed godwit m. S = 2916, A = 6, Ω = 972 symbolic Perseus* IGRES (# subgoals: 72) ± Grey-tailed tattler S = 2916, A = 6, Ω = 972 symbolic Perseus* IGRES (# subgoals: 72) ± IGRES (# subgoals: 72) ± * Results from Nicol et al. (2013).

27 Figure: Performances on Grey-tailed tattler Domain

28 Thank You!

29 Reference I Background B. Bonet and H. Geffner. Solving POMDPs: RTDP-bel vs. point-based algorithms. In International Jont Conference on Artifical Intelligence, R. He, E. Brunskill, and N. Roy. Puma: Planning under uncertainty with macro-actions. In National Conference on Artificial Intelligence, H. Kurniawati, D. Hsu, and W. S. Lee. SARSOP: Efficient point-based POMDP planning by approximating optimally reachable belief spaces. In Robotics: Science and Systems, H. Kurniawati, Y. Du, D. Hsu, and W. S. Lee. Motion planning under uncertainty for robotic tasks with long time horizons. International Journal of Robotics Research, 30(3): , 2011.

30 Reference II Background S. Nicol, O. Buffet, T. Iwamura, and I. Chadès. Adaptive management of migratory birds under sea level rise. In International Joint Conference on Artificial Intelligence, J. Pineau, G. Gordon, and S. Thrun. Point-based value iteration: An anytime algorithm for POMDPs. In International Joint Conference on Artificial Intelligence, G. Shani, R. I. Brafman, and S. E. Shimony. Forward search value iteration for POMDPs. In International Joint Conference on Artifical Intelligence, D. Silver and J. Veness. Monte-carlo planning in large POMDPs. In Advances in Neural Information Processing Systems, T. Smith and R. G. Simmons. Heuristic search value iteration for POMDPs. In Uncertainty in Artificial Intelligence, 2004.

31 Reference III Background T. Smith and R. G. Simmons. Point-based POMDP algorithms: Improved analysis and implementation. In Uncertainty in Artificial Intelligence, 2005.

Heuristic Search Value Iteration for POMDPs

520 SMITH & SIMMONS UAI 2004 Heuristic Search Value Iteration for POMDPs Trey Smith and Reid Simmons Robotics Institute, Carnegie Mellon University {trey,reids}@ri.cmu.edu Abstract We present a novel POMDP