Non-Myopic Multi-Aspect Sensing with Partially Observable Markov Decision Processes

Size: px

Start display at page:

Download "Non-Myopic Multi-Aspect Sensing with Partially Observable Markov Decision Processes"

Gavin Barrett
5 years ago
Views:

1 Non-Myopic Multi-Apect Sening with Prtilly Oervle Mrkov Deciion Procee Shiho Ji 2 Ronld Prr nd Lwrence Crin Deprtment of Electricl & Computer Engineering 2 Deprtment of Computer Engineering Duke Univerity Durhm NC

2 Outline Summry of the underlying prtilly-oerved Mrkov model with correponding ction Prtilly oerved Mrkov deciion procee POMDP nd elief tte cot nd Bye rik Lerning POMDP policy vi vlue itertion with policy defining the optiml ction for given elief tte ccounting for dicounted infinite horizon non-myopic wo POMDP implementtion trtegie for multi-trget cttering dt Myopic or greedy ening lterntive with top criterion Exmple reult on cttering dt meured y NRL - Action: Selection of optiml trget-enor orienttion fullnd dt - Action: Selection of optiml trget-enor orienttion nd frequency und

3 Bic Contruct S S2 trget S3 S4 Scttering Dt Cn e Segmented into Angulr Bin Chrcterized y prticulr phyic Ech uch ngulr rnge termed tte S S2... SN

4 Hidden Mrkov Model π π 2 π

5 = K j j i j j i j i j d w d w p φ φ φ = 2 2 / 2 exp 2 j j j w σ φ σ π φ 2 / = j j j φ φ σ Action-Dependent Stte-rnition Mtrix Let d ij repreent the ngulr ditnce etween the center of tte i nd j in precried direction e.g. clockwie he proility of trnition from tte i to tte j fter moving ngulr ditnce φ i he tndrd devition σ j i dictted y the width of tte j i j d ij

6 Outline Summry of the underlying prtilly-oerved model with correponding ction Prtilly oerved Mrkov deciion procee POMDP nd elief tte cot nd Bye rik Lerning POMDP policy vi vlue itertion with policy defining the optiml ction for given elief tte ccounting for dicounted infinite horizon non-myopic wo POMDP implementtion trtegie for multi-trget cttering dt Myopic or greedy ening lterntive with top criterion Exmple reult on cttering dt meured y NRL - Action: Selection of optiml trget-enor orienttion fullnd dt - Action: Selection of optiml trget-enor orienttion nd frequency und

7 o Pr = o o Pro o Pro Pr Pr Pro Pro Pr Pro = = = p p Belief Stte Sufficient Sttitic he elief tte quntifie the proility tht the enor i in tte given equence of ction nd correponding oervtion he elief tte t time i ufficient ttitic for ll ction nd oervtion up to tht point Very importnt for prcticl implementtion: Needn t tore ll previou ction & oervtion Belief tte computed redily uing underlying trget POMDP model

8 Belief Stte nd Bye Rik Belief tte my lo e ued to compute the proility tht trget n i eing interrogted ed on previou ction nd oervtion p n o...o... = p n = hi fct ply key role in uequent policy deign which mp elief tte to correponding ction ecue the elief tte my e ued to compute the Bye rik of clifiction deciion S n rget = rg min u N v= C uv p v = rg min u N v= C uv S v

9 Action nd Sening Cot wo type of ction: - Sening ction tht elect next ngle of oervtion nd/or frequency of opertion - Deciion ction â for which ening i topped nd clifiction deciion i mde Cot for ening ction: c independent of wht trget tte i viited thi repreent the cot of performing meurement poily enor dependent Introduce rik-ed terminl rewrd for mking deciion thi termed ction â

10 Clifiction Cot Upon performing clifiction ction â we move into new tte ij correponding to declring trget i when the ctul trget i trget j he cot ocited with tte ij i repreented C ij he proility of interrogting trget j given elief tte where re the underlying tte of the trget i p j = S j he expected immedite cot of tking terminl clifiction ction â in elief tte my therefore e repreented C = mx ˆ i j C ij p j mx = ˆ i j S j C ij Immedite expected vlue of terminting ening nd declring trget i ction â i driven y Bye rik

11 POMDP Formultion Summry Action Stte Cot Sening Action: Move pltform ngle φ Perform meurement with one of M enor Clifiction Action: Stop ening declre oject under tet to e one memer from et { 2 N} S = { n k k n } rget tte k cro ll trget n={ 2 N} uv correponding to declring trget u when in relity trget v i eing ened; oth u nd v memer of the et { 2 N} cm m repreenting one of the M poile enor independent of trget tte viited C uv for clifiction tte uv In term of trget tte in S c=u=c uv for ll ocited with trget v

12 POMDP Summry Algorithm h two type of tte: underlying tte of the trget plu terminl tte ij fter mking clifiction deciion Optiml policy lern wht ening ction to tke given elief tte well policy to when to mke deciion top ening function of the elief tte My include different cot for different enor modlitie while lo ccounting vi Bye rik for cot of different miclifiction C ij Optiml policy determined vi point-ed lgorithm tht preerve the locl lope of vlue function

13 Outline Summry of the underlying prtilly-oerved model with correponding ction Prtilly oerved Mrkov deciion procee POMDP nd elief tte cot nd Bye rik Lerning POMDP policy vi vlue itertion with policy defining the optiml ction for given elief tte ccounting for dicounted infinite horizon non-myopic wo POMDP implementtion trtegie for multi-trget cttering dt Myopic or greedy ening lterntive with top criterion Exmple reult on cttering dt meured y NRL - Action: Selection of optiml trget-enor orienttion fullnd dt - Action: Selection of optiml trget-enor orienttion nd frequency und

14 Implementtion Iue he cot of tking ction when in elief tte t tep from the horizon i Dicounted Expected Future Cot χ t = min C + γ p χ t B Immedite Expected Cot Become dynmic-progrmming prolem for lerning the optiml policy which mp elief tte to ction dicounted infinite-horizon prolem Vlue-itertion dynmic progrmming tilize when fixed ction i defined for ech elief tte defining the optiml dicounted infinite-horizon policy

15 min S C t t α χ α = + = O S S C A t p p C t o o min min α γ χ α he cot function i liner in the elief tte which implie tht the cot function i piecewie liner concve prolem in the elief-pce implex Belief pce χ For locl region in elief pce we trck nd updte the lope α tht minimize the locl cot Implementtion Iue - 2 Vlue itertion ecome prolem of lerning the elief-tte locl lope α for ech of which there i n optiml ction policy Policy lerned y trcking lope pproximtely

16 Outline Summry of the underlying prtilly-oerved model with correponding ction Prtilly oerved Mrkov deciion procee POMDP nd elief tte cot nd Bye rik Lerning POMDP policy vi vlue itertion with policy defining the optiml ction for given elief tte ccounting for dicounted infinite horizon non-myopic wo POMDP implementtion trtegie for multi-trget cttering dt Myopic or greedy ening lterntive with top criterion Exmple reult on cttering dt meured y NRL - Action: Selection of optiml trget-enor orienttion fullnd dt - Action: Selection of optiml trget-enor orienttion nd frequency und

17 wo Ditinct POMDP Formultion Infinite horizon with reet upon ech clifiction - Approprite when we wih to perform equence of mny clifiction - Multi-trget ening within udget - Policy h the opportunity to opt out of difficult ening ce trget miguity Algorithm trnition into n oring tte fter clifiction - Finite-horizon policy with horizon dictted y difficulty of initil elief tte - Doe not hve opportunity to opt out of difficult clifiction ce

18 Rndom Reet rget rget 2 Sening cot cm for enor m â â â 2 â 2 Actully rget Actully rget 2 Actully rget Actully rget 2 Cot: C C 2 C 2 C 22 Aoring Stte

19 Outline Summry of the underlying prtilly-oerved model with correponding ction Prtilly oerved Mrkov deciion procee POMDP nd elief tte cot nd Bye rik Lerning POMDP policy vi vlue itertion with policy defining the optiml ction for given elief tte ccounting for dicounted infinite horizon non-myopic wo POMDP implementtion trtegie for multi-trget cttering dt Myopic or greedy ening lterntive with top criterion Exmple reult on cttering dt meured y NRL - Action: Selection of optiml trget-enor orienttion fullnd dt - Action: Selection of optiml trget-enor orienttion nd frequency und

20 + + = + = O o S S N v uv u E v p o p C R min [ ] ˆ = E R R c C = = min C R S v N v uv u Myopic Sening with Stop Criterion Given elief tte the expected rik fter tking ening ction + my e expreed We compute the difference etween the cot of ction + nd the correponding expected reduction in rik We terminte ening when thi difference ecome poitive cot exceed expected reduction in rik

21 Outline Summry of the underlying prtilly-oerved model with correponding ction Prtilly oerved Mrkov deciion procee POMDP nd elief tte cot nd Bye rik Lerning POMDP policy vi vlue itertion with policy defining the optiml ction for given elief tte ccounting for dicounted infinite horizon non-myopic wo POMDP implementtion trtegie for multi-trget cttering dt Myopic or greedy ening lterntive with top criterion Exmple reult on cttering dt meured y NRL - Action: Selection of optiml trget-enor orienttion fullnd dt - Action: Selection of optiml trget-enor orienttion nd frequency und

22 rget rget rget 3 rget 4 φ Ri rget 5 Internl Ocilltor

Multi-Apect Dt 50 Reopne for rget 50 Reopne for rget 2 50 Reopne for rget 3 45 0 45 0 45 0 40 35

Frequency khz 30 25 20 5 0-0 -20-30 5 0 0 50 00 50 200 250 300 350 Angle Deg Reopne for rget 4

rget 5 Angle Deg -40 45 0 45 0 40 35 0 40 35 0 Frequency khz 30 25 20-0 -20 Frequency khz 30 25

23 Multi-Apect Dt 50 Reopne for rget 50 Reopne for rget 2 50 Reopne for rget Frequency khz Frequency khz Frequency khz Angle Deg Reopne for rget Angle Deg Reopne for rget 5 Angle Deg Frequency khz Frequency khz Angle Deg Angle Deg

24 Full-Bnd Dt Clifiction Accurcy v. Averge Numer of Action C = C uu =-0 nd C uv =C c with C c vrile from 5 to Clifiction Accurcy POMDP - oring POMDP - reet Greedy - topping Greedy - no topping verge numer of ction

25 Full-Bnd Dt Clifiction Accurcy v. Cot of Miclifiction C = C uu =-0 nd C uv =C c with C c vrile from 5 to Clifiction Accurcy POMDP - oring POMDP - reet Greedy - topping Cot of Miclifiction

26 Full-Bnd Dt Cot Per Action v. Cot of Miclifiction -0.4 C = C uu =-0 nd C uv =C c with C c vrile from 5 to Cot per Action POMDP - oring POMDP - reet Greedy - topping Cot of Miclifiction

27 Outline Summry of the underlying prtilly-oerved model with correponding ction Prtilly oerved Mrkov deciion procee POMDP nd elief tte cot nd Bye rik Lerning POMDP policy vi vlue itertion with policy defining the optiml ction for given elief tte ccounting for dicounted infinite horizon non-myopic wo POMDP implementtion trtegie for multi-trget cttering dt Myopic or greedy ening lterntive with top criterion Exmple reult on cttering dt meured y NRL - Action: Selection of optiml trget-enor orienttion fullnd dt - Action: Selection of optiml trget-enor orienttion nd frequency und

28 Action: Select Sund nd Angle C = C uu =-0 nd C uv =C c with C c =40 Fixed ngulr mpling 5 o Angle Selection Fixed und: LL Fixed und: HL Fixed und: LH Fixed und: HH Fixed nd: Fullnd Sund Selection 86.% 72.67% 73.72% 77.72% 76.50% 90.72% % % % % % % % % % % % % % 2.5 Blck: HMM with fixed ngulr mpling of 5 o five ction Blue: Myopic POMDP with fixed numer of five ction Red: Non-myopic POMDP with reet verge numer of ction

29 Summry nd Future Work Hve developed POMDP formultion for generl ening prolem with policy deigned to define the optiml ction for given elief tte ccounting for dicounted infinite horizon non-myopic Algorithm operte in rel time nd optimlly integrte the ening nd ignl proceing tk perfect mtch for UUV for exmple Key point: he POMDP formultion ume cce to model for the trget to lern the optiml policy; my not e relitic in mny etting Reinforcement lerning RL i generliztion of POMDP wherein the ening ction re not performed imply to exploit n underlying model optimlly ut the ction lo ddre explortion to lern more out given environment/trget tht it my not hve een previouly RL POMDP optimlly execute ction in non-myopic etting to ddre the exploittionexplortion trdeoff; now extending the reerch towrd RL POMDP

Markov Decision Processes

Markov Decision Processes Mrkov Deciion Procee A Brief Introduction nd Overview Jck L. King Ph.D. Geno UK Limited Preenttion Outline Introduction to MDP Motivtion for Study Definition Key Point of Interet Solution Technique Prtilly