Grundlagen der Künstlichen Intelligenz

Size: px

Start display at page:

Download "Grundlagen der Künstlichen Intelligenz"

Magnus Elliott
5 years ago
Views:

1 Grundlagen der Künstlichen Intelligenz Formal models of interaction Daniel Hennes (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1

2 Today Taxonomy of domains Models of interaction Planning domain definition language Markov decision processes... and extensions Basics of reinforcement learning Models in control theory Russell & Norvig: Chapter 2.3, 10.1,

3 Taxonomy of domains Domains (or models of domains) can be distinguished based on their state representation atomic black box with no internal structure factored boolean, integer, real-valued one of a fixed set of symbols structured objects with attributes relationships between objects 3

4 Relational representations of state The world is composed of objects; its state described in terms of properties and relations of objects. Formally A set of constants (referring to objects) A set of predicates (referring to object properties or relations) A set of functions () A (grounded) state can then described by a conjunction of predicates (and functions), e.g.: Constants: C 1, C 2, P 1, P 2, SFO, JFK Predicates: At(.,.), Cargo(.), Plane(.), Airport(.) A state description: At(C 1, SFO) At(C 2, JFK) At(P 1, SFO) At(P 2, JFK) Cargo(C 1 ) Cargo(C 2 ) Plane(P 1 ) Plane(P 2 ) Airport(JFK) Airport(SFO) 4

5 Taxonomy of domains Categories of Russel & Norvig: Fully observable vs. partially observable Single agent vs. multiagent Deterministic vs. stochastic Episodic vs. sequential Static vs. dynamic Discrete vs. continuous Known vs. unknown Additionally: Propositional vs. relational Time discrete vs. time continuous 5

6 Taxonomy of domains n 2.3. The Nature of Environments 45 Task Environment Observable Agents Deterministic Episodic Static Discrete Crossword puzzle Fully Single Deterministic Sequential Static Discrete Chess with a clock Fully Multi Deterministic Sequential Semi Discrete Poker Partially Multi Stochastic Sequential Static Discrete Backgammon Fully Multi Stochastic Sequential Static Discrete Taxi driving Partially Multi Stochastic Sequential Dynamic Continuous Medical diagnosis Partially Single Stochastic Sequential Dynamic Continuous Image analysis Fully Single Deterministic Episodic Semi Continuous Part-picking robot Partially Single Stochastic Episodic Dynamic Continuous Refinery controller Partially Single Stochastic Sequential Dynamic Continuous Interactive English tutor Partially Multi Stochastic Sequential Dynamic Discrete Figure 2.6 Examples of task environments and their characteristics. batch of defective parts, the robot should learn from several observations that the distribution of defects has changed, and should modify its behavior for subsequent parts. We have not included a known/unknown column because, as explained earlier, this is not strictly a property of the environment. For some environments, such as chess and poker, it is quite easy to 6

7 Formal models of interaction Tables Planning languages Planning Domain Definition Language (PDDL) Stanford Research Institute Problem Solver (STRIPS) Action description language (ADL) Markov decision processes (MDPs)... and extensions Models in control theory 7

8 Tables Trivial: store a mapping (s, a) s as a table 8

9 Planning Domain Definition Language Developed for the 1998/2000 International Planning Competition 9

10 PDDL vs. table PDDL also describes a deterministic mapping (s, a) s, but using a set of action schema (rules) of the form ActionName(...) : PRECONDITION EFFECT where action arguments are variables and the preconditions and effects are conjunctions of predicates 10

11 11

12 Interaction domains with uncertainty An agent is an entity that perceives and acts Uncertainty due to: Stochasticity in the system (outcome of actions uncertain) Incomplete observability (cannot observe all the variables) Incomplete modeling (discard some of the information) Lazyness (simple uncertain rule vs. complex certain one) Noisy Deictic Rules: A probabilistic extension of PDDL rules 12

13 Recall: Expectimax Expectimax(C) = i P(d i ) max s S(d i ) Minimax(s) 13

14 Example: grid world actions: Up, Down, Left, Right goal states +1, 1 are terminal states fully observable non-deterministic 14

15 Example: grid world strategy: [Up, Up, Right, Right, Right] = [Right, Right, Up, Up, Right] =

16 Markov Decision Process Set of states s S Set of actions a A might depend on current state s: A(s) Transition function T (s, a, s ) Probability that taking action a in s leads to s T (s, a, s ) = P(s s, a) also called the model Reward function R(s, a, s ) or R(s), R(s ), or R(s, a) Start state s 0 (or distribution P(s 0 )) Termination criteria Stationary if T and R are independent of time 16

17 Markov Decision Process a 0 a 1 a 2 s 0 s 1 s 2 r 0 r 1 r 2 P(s 0:T +1, a 0:T, r 0:T ; π) = P(s 0 ) T P(a t s t ; π) P(r t s t, a t ) P(s t+1 s t, a t ) t=0 initial state distribution: P(s 0 ) transition probabilities: P(s t+1 s t, a t ) reward probabilities: P(r t s t, a t ) agent s policy: π(a t s t ) = P(a t s t ; π) (or deterministic a t = π(s t )) 17

18 Markov property Given the present state, the future and the past are independent. P(s t+1 s t, a t, s t 1, a t 1, s t 2, a t 2,..., s 0 ) = P(s t+1 s t, a t ) 18

19 Solving MDPs Deterministic search problems: optimal plan, or sequence of actions, from start to a goal MDP: optimal policy π : S A an optimal policy maximizes expected reward gives an action for each state defines a reflex agent 19

20 Maximizing expected reward Cumulative reward Discounted reward E [r t0 + r t1 + r t2 + r T ] = E with γ < 1 and t : r t r max [ ] E r t0 + γr t1 + γ 2 r t2... [ T t=t 0 r t ] γ t t 0 r t γ t t 0 r max = r max t=t 0 t=t 0 1 γ 20

21 Reinforcement learning MDPs are the basis of reinforcement learning, where P(s s, a) is not know to the agent Autonomous agent that interacts with its environments Learning through interaction Agent receives feedback in the form or rewards Improves over time through trial & error Agent s t r t r t+1 a t s t+1 Environment 21

22 Recent successes of (deep) reinforcement learning images from Human-level control through deep reinforcement learning and Continuous control with deep reinforcement learning (Google Deepmind / Nature) 22

23 Extensions of MDPs MDP: Markov decision process POMDP: Partially observable Markov decision process MMDP: Multi-agent Markov decision process Dec-POMDP: Decentralized Partially observable Markov decision process SG: Stochastic game 23

24 Partially Observable MDP (POMDP) agent y 0 a 0 y 1 a 1 y 2 a 2 s 0 s 1 s 2 r 0 r 1 r 2 initial state distribution: P(s 0 ) transition probabilities: P(s t+1 s t, a t ) observation probabilities: P(y t+1 s t+1, a t ) reward probabilities: P(r t s t, a t ) Markov property does not hold for y: P(y t+1 y t, a t ) is unknown 24

25 Agent models for POMDP The agent maps y t a t (stimulus-response mapping... generally non-optimal) The agent stores all previous observations and maps y 0:t, a 0:t 1 a t (agent function) The agent stores only the recent history and maps y t k:t, a t k:t 1 a t (may be a good heuristic) The agent is some machine with its own internal state n t, e.g., a computer, a finite state machine, a brain... The agent maps (n t 1, y t ) n t (internal state update) and n t a t The agent maintains a full probability distribution (belief) b t (s t ) over the state, maps (b t 1, y t ) b t (Bayesian belief update), and b t a t 25

26 POMDP coupled to a state machine agent n 0 n 1 n 2 y 0 a 0 y 1 a 1 y 2 a 2 s 0 s 1 s 2 r 0 r 1 r 2 26

27 Dec-POMDP figure from F. Oliehoek (2012) 27

28 Models in control theory Time is continuous: t R The system state, actions and observations are continuous: x(t) R n, u(t) R d, y(t) R m A controlled system can be described as: linear: non-linear: ẋ = Ax + Bu y = Cx + Du ẋ = f (x, u) y = h(x, u) A typical agent model is a feedback regulator (stimulus-response) u = Ke(y) e is some error-term 28

29 Stochastic control The differential equations become stochastic: dx = f (x, u) dt + dξ x dy = h(x, u) dt + dξ y dξ is a Wiener processes with dξ, dξ = C ij (x, u) This is the control theory analogue to POMDPs 29

30 Next Solving MDPs Dynamic programming Bellman equation Value iteration Policy iteration Reinforcement learning Q-learning SARSA 30

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and