Outline. Lecture 13. Sequential Decision Making. Sequential Decision Making. Markov Decision Process. Stationary Preferences

Size: px

Start display at page:

Download "Outline. Lecture 13. Sequential Decision Making. Sequential Decision Making. Markov Decision Process. Stationary Preferences"

Hubert Gardner
5 years ago
Views:

1 Outline Lecture 3 October 27, 2009 C 486/686 Markov Decision Processes Dynamic Decision Networks Russell and Norvig: ect 7., 7.2 (up to p. 620), 7.4, equential Decision Making tatic Decision Making Decision Networks tatic Inference Bayesian Networks equential Inference Hidden Markov Models Dynamic Bayesian Networks equential Decision Making Markov Decision Processes Dynamic Decision Networks 3 equential Decision Making Wide range of applications Robotics (e.g., control) Investments (e.g., portfolio management) Computational linguistics (e.g., dialogue management) Operations research (e.g., inventory management, resource allocation, call admission control) ssistive technologies (e.g., patient monitoring and support) 4 Markov Decision Process Intuition: Markov Process with Decision nodes Utility nodes a 0 a a 2 tationary Preferences Hum but why many utility nodes? U(s 0,s,s 2, ) Infinite process infinite utility function s 0 s s 2 s 3 olution: ssume stationary and additive preferences U(s 0,s,s 2, ) = Σ t R(s t ) r 5 6

2 Discounted/verage Rewards Markov Decision Process If process infinite, isn t Σ t R(s t ) infinite? olution : discounted rewards Discount factor: 0 γ Finite utility: Σ t γ t R(s t ) is a geometric sum γ is like an inflation rate of /γ - Intuition: prefer utility sooner than later olution 2: average rewards More complicated computationally Beyond the scope of this course Definition et of states: et of actions (i.e., decisions): Transition model: Pr(s t a t-,s t- ) Reward model (i.e., utility): R(s t ) Discount factor: 0 γ Horizon (i.e., # of time steps): h Goal: find optimal policy 7 8 Inventory Management Policy Markov Decision Process tates: inventory levels ctions: {donothing, orderwidgets} Transition model: stochastic demand Reward model: ales Costs - torage Discount factor: Horizon: Tradeoff: increasing supplies decreases odds of missed sales but increases storage costs Choice of action at each time step Formally: Mapping from states to actions i.e., δ(s t ) = a t ssumption: fully observable states llows a t to be chosen only based on current state s t. Why? 9 0 Policy Optimization Policy Optimization Policy evaluation: Compute expected utility h EU(δ) = Σ t=0 γ t Pr(s t δ) R(s t ) Three algorithms to optimize policy: Value iteration Policy iteration Linear Programming Optimal policy: Policy with highest expected utility EU(δ) EU(δ*) for all δ Value iteration: Equivalent to variable elimination 2 2

3 Value Iteration Nothing more than variable elimination Performs dynamic programming Optimize decisions in reverse order a 0 a a 2 s 0 s s 2 s 3 Value Iteration t each t, starting from t=h down to 0: Optimize a t : EU(a t s t )? Factors: Pr(s i+ a i,s i ), R(s i ), for 0 i h Restrict s t Eliminate s t+,,s h,a t+,,a h a 0 a a 2 s 0 s s 2 s 3 r r 3 4 Value Iteration Value when no time left: V(s h ) = R(s h ) Value with one time step left: V(s h- ) = max ah- R(s h- ) + γ Σ sh Pr(s h s h-,a h- ) V(s h ) Value with two time steps left: V(s h-2 ) = max ah-2 R(s h-2 ) + γ Σ sh- Pr(s h- s h-2,a h-2 ) V(s h- ) Bellman s equation: V(s t ) = max at R(s t ) + γ Σ st+ Pr(s t+ s t,a t ) V(s t+ ) a t * = argmax at R(s t ) + γ Σ st+ Pr(s t+ s t,a t ) V(s t+ ) 5 Markov Decision Process Poor & Unknown Rich & Unknown Poor & Famous Rich & Famous γ = 0.9 You own a company In every state you must choose between aving money or dvertising 6 t h h- h-2 h-3 h-4 h-5 PU RU V(PU) V(PF) PF V(RU) γ = 0.9 RF V(RF) Finite Horizon When h is finite, Non-stationary optimal policy Best action different at each time step Intuition: best action varies with the amount of time left 7 8 3

4 Infinite Horizon Infinite Horizon When h is infinite, tationary optimal policy ame best action at each time step Intuition: same (infinite) amount of time left at each time step, hence same best action Problem: value iteration does an infinite number of iterations ssuming a discount factor γ, after k time steps, rewards are scaled down by γ k For large enough k, rewards become insignificant since γ k 0 olution: pick large enough k run value iteration for k steps Execute policy found at the k th iteration 9 20 Computational Complexity Dynamic Decision Network pace and time: O(k 2 ) Here k is the number of iterations ctt-2 ctt- ctt But what if and are defined by several random variables and consequently exponential? Mt-2 Tt-2 Lt-2 Mt- Tt- Lt- Mt Tt Lt Mt+ Tt+ Lt+ olution: exploit conditional independence Dynamic decision network Ct-2 Nt-2 Ct- Nt- Ct Nt Ct+ Nt+ 2 R t-2 R R t- t R t+ 22 Dynamic Decision Network imilarly to dynamic Bayes nets: Compact representation Exponential time for decision making Partial Observability What if states are not fully observable? olution: Partially Observable Markov Decision Process o o o o 2 o 3 a 0 a a 2 s 0 s s 2 s 3 r

Partially Observable Markov Decision Process (POMDP) Definition et of states: et of actions (i.e., decisions): et of observations: O Transition model: Pr(s t a t-,s t- ) Observation model: Pr(o t s t ) Reward model (i.

5 Partially Observable Markov Decision Process (POMDP) Definition et of states: et of actions (i.e., decisions): et of observations: O Transition model: Pr(s t a t-,s t- ) Observation model: Pr(o t s t ) Reward model (i.e., utility): R(s t ) Discount factor: 0 γ Horizon (i.e., # of time steps): h POMDP Problem: action choice generally depends on all previous observations Two solutions: Consider only policies that depend on a finite history of observations Find stationary sufficient statistics encoding relevant past observations Policy: mapping from past obs. to actions Partially Observable DDN Policy Optimization ctions do not depend on all state variables ctt-2 ctt- ctt Policy optimization: Value iteration (variable elimination) Policy iteration Mt-2 Mt- Mt Mt+ Tt-2 Lt-2 Ct-2 Nt-2 Tt- Lt- Ct- Nt- Tt Lt Ct Nt Tt+ Lt+ Ct+ Nt+ POMDP and PODDN complexity: Exponential in O and k when action choice depends on all previous observations In practice, good policies based on subset of past observations can still be found R t-2 R t- R t R t COCH project ging Population utomated prompting system to help elderly persons wash their hands ITL: lex Mihailidis, Pascal Poupart, Jennifer Boger, Jesse Hoey, Geoff Fernie and Craig Boutilier Dementia Deterioration of intellectual faculties Confusion Memory losses (e.g., lzheimer s disease) Consequences: Loss of autonomy Continual and expensive care required

Intelligent ssistive Technology ystem Overview Let s facilitate aging in place Intelligent assistive technology Non-obtrusive, yet pervasive daptable sensors planning Benefits: Greater autonomy

uncertain prompt effects Partially unknown environment Unknown user habits, preferences and abilities Tradeoff between complex concurrent goals Rapid task completion vs greater autonomy pproach:

6 Intelligent ssistive Technology ystem Overview Let s facilitate aging in place Intelligent assistive technology Non-obtrusive, yet pervasive daptable sensors planning Benefits: Greater autonomy Feeling of independence hand washing verbal cues 3 32 Prompting trategy POMDP components equential decision problem equence of prompts Noisy sensors & imprecise actuators Noisy image processing, uncertain prompt effects Partially unknown environment Unknown user habits, preferences and abilities Tradeoff between complex concurrent goals Rapid task completion vs greater autonomy pproach: Partially Observable Markov Decision Processes (POMDPs( POMDPs) 33 tate set = dom(hl) x dom(wf) x dom(d) x Hand Location {tap,water,soap,towel,sink,away, } Water Flow {on, off}, Dementia {high, low}, etc. Observation set O = dom(c) x dom(f) Camera {handsttap, handsttowel, } Faucet sensor {wateron, wateroff} ction set DoNothing, CallCaregiver, Prompt {turnonwater, rinsehands, useoap, } 34 Transition function Pr(s s,a) sink,off POMDP components Observation function Pr(o s) sink,off 0.0 sink,off tap,on 0.95 tap,on soap,off 0.0 soap,off Machine Learning Decision Trees Next Class Reward function R(s,a) Task completed 0 Call caregiver -30 Each prompt -, -2 or

Symbolic Perseus: a Generic POMDP Algorithm with Application to Dynamic Pricing with Demand Learning

Symbolic Perseus: a Generic POMDP Algorithm with Application to Dynamic Pricing with Demand Learning Pascal Poupart (University of Waterloo) INFORMS 2009 1 Outline Dynamic Pricing as a POMDP Symbolic Perseus