CS 4100 // artificial intelligence. Recap/midterm review!

Similar documents
CSE 473: Artificial Intelligence Spring 2014

CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm

Constraint Satisfaction Problems

Introduction to Arti Intelligence

Announcements. CS 188: Artificial Intelligence Fall Adversarial Games. Computing Minimax Values. Evaluation Functions. Recap: Resource Limits

CS 188: Artificial Intelligence Fall Announcements

CS221 Practice Midterm

CS 188: Artificial Intelligence

Minimax strategies, alpha beta pruning. Lirong Xia

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Final. Introduction to Artificial Intelligence. CS 188 Spring You have approximately 2 hours and 50 minutes.

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Announcements. CS 188: Artificial Intelligence Spring Mini-Contest Winners. Today. GamesCrafters. Adversarial Games

Midterm. Introduction to Artificial Intelligence. CS 188 Summer You have approximately 2 hours and 50 minutes.

Name: UW CSE 473 Final Exam, Fall 2014

Chapter 6 Constraint Satisfaction Problems

Mock Exam Künstliche Intelligenz-1. Different problems test different skills and knowledge, so do not get stuck on one problem.

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

CS 188: Artificial Intelligence Spring Announcements

Introduction to Fall 2008 Artificial Intelligence Midterm Exam

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

CS 570: Machine Learning Seminar. Fall 2016

Section Handout 5 Solutions

Tentamen TDDC17 Artificial Intelligence 20 August 2012 kl

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Constraint satisfaction search. Combinatorial optimization search.

Introduction to Spring 2009 Artificial Intelligence Midterm Exam

Introduction to Spring 2006 Artificial Intelligence Practice Final

Decision Theory: Markov Decision Processes

Final Exam December 12, 2017

Final Exam December 12, 2017

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Constraint Satisfaction

Final. Introduction to Artificial Intelligence. CS 188 Spring You have approximately 2 hours and 50 minutes.

Introduction to Fall 2009 Artificial Intelligence Final Exam

CSC242: Intro to AI. Lecture 7 Games of Imperfect Knowledge & Constraint Satisfaction

CS599 Lecture 1 Introduction To RL

Reinforcement Learning

Markov Decision Processes Chapter 17. Mausam

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Constraint Satisfaction [RN2] Sec [RN3] Sec Outline

CS 188: Artificial Intelligence

Reinforcement Learning Wrap-up

CMU Lecture 4: Informed Search. Teacher: Gianni A. Di Caro

Reinforcement Learning. Introduction

CS 7180: Behavioral Modeling and Decisionmaking

Announcements. CS 188: Artificial Intelligence Spring Probability recap. Outline. Bayes Nets: Big Picture. Graphical Model Notation

CS221 Practice Midterm #2 Solutions

Reinforcement Learning and Control

CITS4211 Mid-semester test 2011

CS 188: Artificial Intelligence Spring 2007

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Problem-Solving via Search Lecture 3

Sampling from Bayes Nets

Approximate Q-Learning. Dan Weld / University of Washington

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Final. CS 188 Fall Introduction to Artificial Intelligence

The Reinforcement Learning Problem

Homework 2: MDPs and Search

Decision Theory: Q-Learning

CSL302/612 Artificial Intelligence End-Semester Exam 120 Minutes

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

Using first-order logic, formalize the following knowledge:

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Summary. Agenda. Games. Intelligence opponents in game. Expected Value Expected Max Algorithm Minimax Algorithm Alpha-beta Pruning Simultaneous Game

ARTIFICIAL INTELLIGENCE. Reinforcement learning

CSE 473: Artificial Intelligence Autumn Topics

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Evaluation for Pacman. CS 188: Artificial Intelligence Fall Iterative Deepening. α-β Pruning Example. α-β Pruning Pseudocode.

CS 188: Artificial Intelligence Fall 2008

Lecture 1: March 7, 2018

Markov Models. CS 188: Artificial Intelligence Fall Example. Mini-Forward Algorithm. Stationary Distributions.

Informed Search. Chap. 4. Breadth First. O(Min(N,B L )) BFS. Search. have same cost BIBFS. Bi- Direction. O(Min(N,2B L/2 )) BFS. have same cost UCS

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Internet Monetization

Machine Learning I Reinforcement Learning

CS 188: Artificial Intelligence Spring 2009

Informed Search. Day 3 of Search. Chap. 4, Russel & Norvig. Material in part from

Introduction to Spring 2009 Artificial Intelligence Midterm Solutions

Lecture 15: Bandit problems. Markov Processes. Recall: Lotteries and utilities

CSE 573. Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming. Slides adapted from Andrey Kolobov and Mausam

Finding optimal configurations ( combinatorial optimization)

Markov Decision Processes (and a small amount of reinforcement learning)

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

CS 188: Artificial Intelligence Fall Recap: Inference Example

Search and Lookahead. Bernhard Nebel, Julien Hué, and Stefan Wölfl. June 4/6, 2012

Lecture 3: Markov Decision Processes

Course basics. CSE 190: Reinforcement Learning: An Introduction. Last Time. Course goals. The website for the class is linked off my homepage.

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Reinforcement Learning II

Intelligent Systems (AI-2)

CSEP 573: Artificial Intelligence

A Polynomial-time Nash Equilibrium Algorithm for Repeated Games

CS 188 Fall Introduction to Artificial Intelligence Midterm 1

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

CSE 573: Artificial Intelligence

Announcements. CS 188: Artificial Intelligence Fall Markov Models. Example: Markov Chain. Mini-Forward Algorithm. Example

Graph Search Howie Choset

CS 662 Sample Midterm

Transcription:

CS 4100 // artificial intelligence instructor: byron wallace Recap/midterm review! Attribution: many of these slides are modified versions of those distributed with the UC Berkeley CS188 materials Thanks to John DeNero and Dan Klein

What have we covered?* Search! - BFS, DFS, UCS - A* and informed search Constraint Satisfaction Problems (CSPs) Adversarial Search / game playing / expecti-max * Large areas; non-exhaustive

What have we covered?* (continued) Markov Decision Processes (MDPs) Reinforcement Learning (RL) Markov Models (MMs) / Hidden Markov Models (HMMs) - And corresponding probability theory * Large areas; non-exhaustive

What have we covered?* Search! - BFS, DFS, UCS - A* and informed search Constraint Satisfaction Problems (CSPs) Adversarial Search / game playing / expecti-max * Large areas; non-exhaustive

Search basics Agents that Plan Ahead We will treat plans as search problems Uninformed Search Methods - Depth-First Search - Breadth-First Search - Uniform-Cost Search

Search problems A search problem consists of: - A state space - A successor function (with actions, costs) N, 1.0 E, 1.0 - A start state and a goal test A solution is a sequence of actions (a plan) that transforms the start state to a goal state

World states v. search states The world state includes every last detail of the environment A search state keeps only the details needed for planning (abstraction) Problem: Pathing States: (x,y) location Actions: NSEW Successor: update location only Goal test: is (x,y)=end Problem: Eat-All-Dots States: {(x,y), dot booleans} Actions: NSEW Successor: update location and possibly a dot boolean (if we eat food) Goal test: dots all false

State space graphs vs. search trees S State space graph a b c e d f h p q r G Each node in the search tree is an entire path in the state space graph. b a d c a Search tree S e e h r h r p q f p q f q c G q c G a p q a

Depth-First Search (DFS) S a b d p a c e p h f r q q c G a q e p h f r q q c G a S G d b p q c e h a f r q p h f d b a c e r Strategy expand a deepest node first Implementation Fringe is a stack (LIFO)

S a b d p a c e p h f r q q c G a q e p h f r q q c G a S G d b p q c e h a f r Search Tiers Strategy expand a shallowest node first Implementation Fringe is a FIFO queue Breadth-First Search (BFS)

Uniform Cost Search (UCS) Strategy expand a cheapest node first: Fringe is a priority queue (priority: cumulative cost) b 3 S 1 2 1 p a d 15 8 9 q 2 c h e 8 2 r G 2 f 1 S 0 d 3 e 9 p 1 Cost contours b a 4 6 c a 11 h p e 13 q 5 r f 7 8 h 17 r 11 p q f q c G q 16 q 11 c G 10 a a

Informed search: A* and beyond

Search heuristics A heuristic is A function that estimates how close a state is to a goal Designed for a particular search problem What might we use for PacMan (e.g., for pathing)? Manhattan distance, Euclidean distance 10 5 11.2

Example: Teg Grenager Combining UCS and Greedy Uniform-cost orders by path cost, or backward cost g(n) Greedy orders by goal proximity, or forward cost h(n) 8 1 e h=1 g = 1 h=5 a S g = 0 h=6 1 3 S a d h=6 1 h=5 h=2 1 c b h=7 h=6 A* Search orders by the sum: f(n) = g(n) + h(n) 2 G h=0 g = 2 h=6 g = 3 h=7 b c d G g = 4 h=2 g = 6 h=0 e d G g = 9 h=1 g = 10 h=2 g = 12 h=0

Admissible heuristics, formally A heuristic h is admissible (optimistic) if: where is the true cost to a nearest goal. Coming up with admissible heuristics is most of what s involved in using A* in practice.

Consistency of heuristics A h=4 1 C h=1 h=2 3 Main idea: estimated heuristic costs actual costs Admissibility: heuristic cost actual cost to goal h(a) actual cost from A to G Consistency: heuristic arc cost actual cost for each arc h(a) h(c) cost(a to C) Consequences of consistency: G The f value along a path never decreases h(a) cost(a to C) + h(c) A* graph search is optimal

Search example problem

What have we covered?* Search! - BFS, DFS, UCS - A* and informed search Constraint Satisfaction Problems (CSPs) Adversarial Search / game playing / expecti-max * Large areas; non-exhaustive

Constraint Satisfaction Problems (CSPs) Standard search problems: State is a black box : arbitrary data structure Goal test can be any function over states Successor function can also be anything Constraint satisfaction problems (CSPs): A special subset of search problems State is defined by variables X i with values from a domain D (sometimes D depends on i) Goal test is a set of constraints specifying allowable combinations of values for subsets of variables Simple example of a formal representation language Allows useful general-purpose algorithms with more power than standard search algorithms

CSP Examples

Constraint graphs

Filtering: forward checking Filtering: Keep track of domains for unassigned variables and cross off bad option WA NT SA Q NSW V

Filtering: constraint propagation Forward checking propagates information from assigned to unassigned variables, but doesn't provide early detection for all failures: WA NT SA Q NSW V NT and SA cannot both be blue! Why didn t we detect this yet? Constraint propagation: reason from constraint to constraint

Consistency of a single arc An arc X Y is consistent iff for every x in the tail there is some y in the head which could be assigned without violating a constraint WA NT SA Q NSW V Forward checking: Enforcing consistency of arcs pointing to each new assignment

Ordering: minimum remaining values Variable Ordering: Minimum remaining values (MRV): Choose the variable with the fewest legal left values in its domain Why min rather than max? Also called most constrained variable Fail-fast ordering

Ordering: least constraining value Value Ordering: Least Constraining Value Given a choice of variable, choose the least constraining value I.e., the one that rules out the fewest values in the remaining variables Note that it may take some computation to determine this! (E.g., rerunning filtering) Why least rather than most? Combining these ordering ideas makes 1000 queens feasible

Tree-structured CSPs Algorithm for tree-structured CSPs: Order: Choose a root variable, order variables so that parents precede children Remove backward: For i = n : 2, apply RemoveInconsistent(Parent(X i ),X i ) Assign forward: For i = 1 : n, assign X i consistently with Parent(X i )

CSP example problem

What have we covered?* Search! - BFS, DFS, UCS - A* and informed search Constraint Satisfaction Problems (CSPs) Adversarial Search / game playing / expecti-max * Large areas; non-exhaustive

Adversarial search

Adversarial game trees -20-8 -18-5 -10 +4-20 +8

Minimax values States Under Agent s Control: States Under Opponent s Control: -8-8 -10-8 -5-10 +8 Terminal States:

Adversarial search (Minimax) Deterministic, zero-sum games: Tic-tac-toe, chess, checkers One player maximizes result The other minimizes result Minimax values: computed recursively 5 max Minimax search: A state-space search tree Players alternate turns Compute each node s minimax value: the best achievable utility against a rational 2 5 (optimal) adversary 8 2 5 6 Terminal values: part of the game min

Pruning max min 3 12 8 3 4 6 14 5 2 What did we do that was inefficient here?

Minimax pruning max min 3 12 8 3 14 5 2

Evaluation functions Evaluation functions score non-terminals in depth-limited search Ideal function: returns the actual minimax value of the position In practice: typically weighted linear sum of features: e.g. f 1 (s) = (num white queens num black queens), etc.

Adversarial search example problem

What have we covered?* (continued) Markov Decision Processes (MDPs) Reinforcement Learning (RL) Markov Models (MMs) / Hidden Markov Models (HMMs) - And corresponding probability theory * Large areas; non-exhaustive

Markov Decision Processes An MDP is defined by States Î S Actions a Î A Transition function T(s, a, s ) Probability that a from s leads to s, i.e., P(s s, a) Also called the model or the dynamics Reward function R(s, a, s ) Sometimes just R(s) or R(s ) Start state Maybe a terminal state MDPs are non-deterministic search problems One way to solve them is with expectimax search; but we ll do better They can go on forever Arguably, life is an MDP

Optimal quantities The value (utility) of a state s V * (s) = expected utility starting in s and acting optimally The value (utility) of a q-state (s,a) Q * (s,a) = expected utility starting out having taken action a from state s and (thereafter) acting optimally a s, a s s is a state (s, a) is a q-state The optimal policy p * (s) = optimal action from state s s,a,s s (s,a,s ) is a transition

Values of states Fundamental operation: compute the (expectimax) value of a state Expected utility under optimal action This will be an average sum of (discounted) rewards This is just what expectimax computed! Recursive definition of value: s,a,s s a s, a s

Value iteration Start at bottom with V 0 (s) = 0: no time steps left means an expected reward sum of zero Given vector of V k (s) values, do one ply of expectimax from each state: Repeat until convergence Complexity of each iteration: O(S 2 A) s,a,s V k+1 (s) a s, a V k (s ) Theorem: will converge to unique optimal values - Basic idea: approximations get refined towards optimal values (but will only converge if we use a discount / have a finite horizon!) - Policy often converges before values!

MDPs problem

What have we covered?* (continued) Markov Decision Processes (MDPs) Reinforcement Learning (RL) Markov Models (MMs) / Hidden Markov Models (HMMs) - And corresponding probability theory * Large areas; non-exhaustive

Reinforcement learning Agent State: s Reward: r Actions: a Environment Basic idea: Receive feedback in the form of rewards Agent s utility is defined by the reward function Must (learn to) act so as to maximize expected rewards All learning is based on observed samples of outcomes!

Temporal Difference Learning (model free!) Big idea: learn from every experience! Update V(s) each time we experience a transition (s, a, s, r) Likely outcomes s will contribute updates more often Temporal difference learning of values Policy still fixed, still doing evaluation! Move values toward value of whatever successor occurs: running average p(s) s s, p(s) s Sample of V(s): Update to V(s): Can rewrite as:

Q-Learning Q-Learning: sample-based Q-value iteration Learn Q(s,a) values as you go Receive a sample (s,a,s,r) Consider your old estimate: Consider your new sample estimate: Incorporate the new estimate into a running average:

Exploration functions When to explore? Random actions: explore a fixed amount Better idea: explore areas whose badness is not (yet) established, eventually stop exploring Exploration function Takes a value estimate u and a visit count n, and returns an optimistic utility, e.g. Regular Q-Update: Modified Q-Update:

RL problem example

What have we covered?* (continued) Markov Decision Processes (MDPs) Reinforcement Learning (RL) Markov Models (MMs) / Hidden Markov Models (HMMs) - And corresponding probability theory * Large areas; non-exhaustive

Chain rule and Markov models From the chain rule, every joint distribution over Assuming that for all t: P (X 1,X 2,...,X T )=P (X 1 ) X t? X 1,...,X t 2 X t 1 X 1 X 2 X 3 X 4 Gives us the expression posited on the earlier slide: X 1,X 2,...,X T can be written as: TY P (X t X 1,X 2,...,X t 1 ) t=2 P (X 1,X 2,...,X T )=P (X 1 ) TY t=2 P (X t X t 1 )

Mini-forward algorithm Question: What s P(X) on some day t? X 1 X 2 X 3 X 4 P (x t )= X x t 1 P (x t 1,x t ) = X x t 1 P (x t x t 1 )P (x t 1 ) Forward simulation

The chain rule and HMMs, in general X 1 X 2 X 3 E 1 E 2 E 3 From the chain rule, every joint distribution over P (X 1,E 1,...,X T,E T )=P (X 1 )P (E 1 X 1 ) can be written as: Assuming that for all t: State independent of all past states and all past evidence given the previous state, i.e.: Evidence is independent of all past states and all past evidence given the current state, i.e.: Which gives us: X 1,E 1,...,X T,E T TY P (X t X 1,E 1,...,X t 1,E t 1 )P (E t X 1,E 1,...,X t 1,E t 1,X t ) t=2 X t? X 1,E 1,...,X t 2,E t 2,E t 1 X t 1 E t? X 1,E 1,...,X t 2,E t 2,X t 1,E t 1 X t P (X 1,E 1,...,X T,E T )=P (X 1 )P (E 1 X 1 ) TY P (X t X t 1 )P (E t X t ) t=2

Implied conditional independencies X 1 X 2 X 3 E 1 E 2 E 3 Many implied conditional independencies, e.g., E 1? X 2,E 2,X 3,E 3 X 1 We can prove these as we did last class for Markov models (but we won t today) This also comes from the graphical model; we ll cover this more formally in a later lecture

Dynamic programming (the forward algorithm ) time T

Markov model problem example