ARTIFICIAL INTELLIGENCE. Uncertainty: probabilistic reasoning

Similar documents
Artificial Intelligence

Uncertainty. Outline

Uncertainty. Chapter 13

Uncertainty. Chapter 13

Artificial Intelligence Uncertainty

Pengju XJTU 2016

Uncertainty. Russell & Norvig Chapter 13.

Uncertainty. 22c:145 Artificial Intelligence. Problem of Logic Agents. Foundations of Probability. Axioms of Probability

Uncertain Knowledge and Reasoning

Uncertainty. Introduction to Artificial Intelligence CS 151 Lecture 2 April 1, CS151, Spring 2004

n How to represent uncertainty in knowledge? n Which action to choose under uncertainty? q Assume the car does not have a flat tire

Lecture 10: Introduction to reasoning under uncertainty. Uncertainty

COMP9414/ 9814/ 3411: Artificial Intelligence. 14. Uncertainty. Russell & Norvig, Chapter 13. UNSW c AIMA, 2004, Alan Blair, 2012

Probabilistic Reasoning

CS 561: Artificial Intelligence

Uncertainty. Outline. Probability Syntax and Semantics Inference Independence and Bayes Rule. AIMA2e Chapter 13

Uncertainty and Bayesian Networks

Quantifying uncertainty & Bayesian networks

Outline. Uncertainty. Methods for handling uncertainty. Uncertainty. Making decisions under uncertainty. Probability. Uncertainty

Probabilistic Robotics

Uncertainty. Chapter 13. Chapter 13 1

Uncertainty. Chapter 13, Sections 1 6

Web-Mining Agents Data Mining

An AI-ish view of Probability, Conditional Probability & Bayes Theorem

10/18/2017. An AI-ish view of Probability, Conditional Probability & Bayes Theorem. Making decisions under uncertainty.

Uncertainty (Chapter 13, Russell & Norvig) Introduction to Artificial Intelligence CS 150 Lecture 14

Chapter 13 Quantifying Uncertainty

Bayesian networks. Soleymani. CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2018

Probabilistic Models

Probabilistic representation and reasoning

Basic Probability and Decisions

Probabilistic Reasoning. (Mostly using Bayesian Networks)

Resolution or modus ponens are exact there is no possibility of mistake if the rules are followed exactly.

Pengju

Uncertainty. Logic and Uncertainty. Russell & Norvig. Readings: Chapter 13. One problem with logical-agent approaches: C:145 Artificial

CS 5522: Artificial Intelligence II

Quantifying Uncertainty & Probabilistic Reasoning. Abdulla AlKhenji Khaled AlEmadi Mohammed AlAnsari

Ch.6 Uncertain Knowledge. Logic and Uncertainty. Representation. One problem with logical approaches: Department of Computer Science

Probabilistic Reasoning. Kee-Eung Kim KAIST Computer Science

CS 343: Artificial Intelligence

CS 188: Artificial Intelligence Fall 2009

Uncertainty (Chapter 13, Russell & Norvig)

CS 188: Artificial Intelligence Spring Announcements

UNCERTAINTY. In which we see what an agent should do when not all is crystal-clear.

CS 5100: Founda.ons of Ar.ficial Intelligence

CSE 473: Artificial Intelligence

Objectives. Probabilistic Reasoning Systems. Outline. Independence. Conditional independence. Conditional independence II.

13.4 INDEPENDENCE. 494 Chapter 13. Quantifying Uncertainty

CS 188: Artificial Intelligence. Our Status in CS188

Where are we in CS 440?

Probabilistic Reasoning

Reasoning with Uncertainty. Chapter 13

Probabilistic Reasoning Systems

Uncertainty. CmpE 540 Principles of Artificial Intelligence Pınar Yolum Uncertainty. Sources of Uncertainty

CS 188: Artificial Intelligence Fall 2008

Where are we in CS 440?

CSE 473: Artificial Intelligence Autumn 2011

Probabilistic Models. Models describe how (a portion of) the world works

CS 188: Artificial Intelligence Fall 2009

Course Introduction. Probabilistic Modelling and Reasoning. Relationships between courses. Dealing with Uncertainty. Chris Williams.

Probabilistic representation and reasoning

Bayesian Networks. Vibhav Gogate The University of Texas at Dallas

Our Status. We re done with Part I Search and Planning!

Bayesian Networks. Vibhav Gogate The University of Texas at Dallas

Brief Intro. to Bayesian Networks. Extracted and modified from four lectures in Intro to AI, Spring 2008 Paul S. Rosenbloom

Bayesian networks (1) Lirong Xia

PROBABILISTIC REASONING SYSTEMS

This lecture. Reading. Conditional Independence Bayesian (Belief) Networks: Syntax and semantics. Chapter CS151, Spring 2004

Events A and B are independent P(A) = P(A B) = P(A B) / P(B)

Uncertainty and Belief Networks. Introduction to Artificial Intelligence CS 151 Lecture 1 continued Ok, Lecture 2!

Fusion in simple models

Graphical Models - Part I

Probability Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 27 Mar 2012

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

Lecture Overview. Introduction to Artificial Intelligence COMP 3501 / COMP Lecture 11: Uncertainty. Uncertainty.

Probability. CS 3793/5233 Artificial Intelligence Probability 1

COMP5211 Lecture Note on Reasoning under Uncertainty

CS 5522: Artificial Intelligence II

Bayesian Networks BY: MOHAMAD ALSABBAGH

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

PROBABILISTIC REASONING Outline

Bayes Nets: Independence

Probability and Decision Theory

Informatics 2D Reasoning and Agents Semester 2,

Stochastic Methods. 5.0 Introduction 5.1 The Elements of Counting 5.2 Elements of Probability Theory

Artificial Intelligence CS 6364

Artificial Intelligence Bayes Nets: Independence

Directed Graphical Models

Bayesian Network. Outline. Bayesian Network. Syntax Semantics Exact inference by enumeration Exact inference by variable elimination

Bayesian networks. Chapter 14, Sections 1 4

CS 188: Artificial Intelligence. Bayes Nets

Directed Graphical Models or Bayesian Networks

Implementing Machine Reasoning using Bayesian Network in Big Data Analytics

Bayesian networks. Chapter Chapter

Cartesian-product sample spaces and independence

CS 343: Artificial Intelligence

Announcements. CS 188: Artificial Intelligence Spring Probability recap. Outline. Bayes Nets: Big Picture. Graphical Model Notation

EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS

CS188 Outline. We re done with Part I: Search and Planning! Part II: Probabilistic Reasoning. Part III: Machine Learning

Y. Xiang, Inference with Uncertain Knowledge 1

Transcription:

INFOB2KI 2017-2018 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Uncertainty: probabilistic reasoning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

Outline Reasoning under uncertainty Probabilities Bayes rule & Bayesian Networks Bayesian skill rating

Uncertainty Let action A t = leave for airport t minutes before flight Will A t get me there on time? Problems: 1. partial observability (road state, other drivers' plans, etc.) 2. noisy sensors (traffic reports) 3. uncertainty in action outcomes (flat tire, etc.) 4. immense complexity of modeling and predicting traffic Hence a purely logical approach either 1. risks falsehood: A 125 will get me there on time, or 2. leads to conclusions that are too weak for decision making: A 125 will get me there on time if there's no accident on the bridge and it doesn't rain and my tires remain intact etc etc. (A 1440 might reasonably be said to get me there on time but I'd have to stay overnight in the airport )

How do we deal with uncertainty? Implicit: Ignore what you are uncertain of, when you can Build procedures that are robust to uncertainty Explicit: Build a model of the world that describes uncertainty about its state, dynamics, and observations Reason about the effect of actions given the model

Methods for handling uncertainty Default or nonmonotonic logic: e.g. assume my car does not have a flat tire e.g. assume A 125 works unless contradicted by evidence Issues: What assumptions are reasonable? How to handle contradiction? Rules with fudge factors: e.g. A 125 0.3 get there on time; e.g. Sprinkler 0.99 WetGrass; WetGrass 0.7 Rain Issues: Problems with combination, e.g., Sprinkler implies Rain?? Fuzzy Logic e.g. The road is busy e.g. At the airport 120 minutes before departure is more than in time e.g. IF road(busy) and A 125 THEN at_airport (just in time) Probability Model agent's degree of belief, given the available evidence e.g. A 25 will get me there on time with probability 0.04

Probability A well known and well understood framework for uncertainty Probabilistic assertions summarize effects of laziness: failure to enumerate exceptions, qualifications, etc. ignorance: lack of relevant facts, initial conditions, etc. Clear semantics (mathematically correct) Provides principled answers for: Combining evidence Predictive & Diagnostic reasoning Incorporation of new evidence Intuitive (at some level) to human experts Can be assessed from data

Axioms of probability For any propositions A, B 0 P(A) 1 P(True) = 1 and P(False) = 0 P(A B) = P(A) + P(B) P(A B) Note: P(Av A) = P(A)+P( A) P(A A) P(True) = P(A)+P( A) P(False) 1 = P(A) + P( A) So: P(A) = 1 P( A)

Frequency Interpretation Draw a ball from an urn containing n balls of the same size, r red and s yellow. The probability that the proposition A = the ball is red is true corresponds to the relative frequency with which we expect to draw a red ball P(A) =? I.e. to the frequentist, probability lies objectively in the external world.

Subjective Interpretation There are many situations in which there is no objective frequency interpretation: On a windy day, just before paragliding from the top of El Capitan, you say there is a probability of 0.05 that I am going to die You have worked hard on your AI class and you believe that the probability that you will pass is 0.9 Bayesian Viewpoint probability is "degree of belief", or "degree of uncertainty". To the Bayesian, probability lies subjectively in the mind, and can be different for people with different information e.g., the probability that Wayne will get rich from selling his kidney.

Bayesian probability updating 1. You have a prior (or unconditional) assessment of the probability of an event 2. You subsequently receive additional information or evidence 3. Your posterior assessment is now your previous assessment, updated with this new info Images from Moserware.com 1. 2. 3.

Random variables A proposition that takes the value True with probability p and False with probability 1 p is a random variable with distribution <p,1 p> If an urn contains balls having 3 possible colors red, yellow, and blue the color of a ball picked at random from the bag is a random variable with 3 possible values The (probability) distribution of a random variable X with n values x 1, x 2,, x n is: <p 1, p 2,, p n > with P(X=x i ) = p i and p i = 1

Joint Distribution Consider k random variables X 1,, X k joint distribution on these variables: a table where each entry gives the probability of one combination of values of X 1,, X k Example: two valued variables Cavity and Toothache Shorthand notation for propositions: Cavity = yes and Cavity = no P(C T) toothache toothache cavity 0.04 0.06 cavity 0.01 0.89 P( cavity toothache) P( cavity toothache)

Joint Distribution Says It All P(toothache) = P((toothache cavity) v (toothache cavity)) = P(toothache cavity) + P(toothache cavity) (Marginalisation) P(C T) toothache toothache cavity 0.04 0.06 cavity 0.01 0.89 = 0.04 + 0.01 = 0.05! use P(a v b) = P(a) + P(b) P(a b) or P(a) = P(a b) + P(a b) P(toothache v cavity) = P((toothache cavity) v (toothache cavity) v ( toothache cavity)) = 0.04 + 0.01 + 0.06 = 0.11

Conditional Probability Definition: P(A B) =P(A B) / P(B) (assumes P(B) > 0!) Read P(A B): probability of A given that B is known to be true can also write this as: P(A B) = P(A B) P(B) which is called the product rule Note: P(A B) is often written as P(A,B)

Example P(C T) toothache toothache cavity 0.04 0.06 cavity 0.01 0.89 P(cavity toothache) = = P(cavity toothache) / P(toothache) P(cavity toothache) =? P(toothache) =? P(cavity toothache) = 0.04/0.05 = 0.8

Normalization P(C T) toothache toothache cavity 0.04 0.06 cavity 0.01 0.89 Denominator can be viewed as a normalization constant α P(cavity toothache) = α P(cavity, toothache) = α 0.04 P( cavity toothache) = α P( cavity, toothache) = α 0.01 1 = α 0.04 + α 0.01 = α 0.05 α = 20

Bayes Rule From the product rule: P(A B) = P(A B) P(B) = P(B A) P(A) Bayes rule: P(B A) = P(A B) P(B) P(A) Useful for assessing diagnostic from causal probability: P(Cause Effect) = P(Effect Cause) P(Cause) / P(Effect) E.g., let M be meningitis, S be stiff neck: P(m s) = P(s m) P(m) / P(s) = 0.8 0.0001 / 0.1 = 0.0008 Note: posterior probability of meningitis still very small!

Generalizations P(A B C) = P(A B C) P(C) = P(A B C) P(B C) P(C) P(A B C) = P(A B C) P(C) = P(B A C) P(A C) P(C) chain rule P(B A,C) = P(A B,C) P(B C) P(A C) Marginalisation rule: P(X) = P(X Y=y)

Representing Probability Naïve representations of probability run into problems. Example: Patients in hospital are described by several attributes (variables): Background: age, gender, history of diseases, Symptoms: fever, blood pressure, headache, Diseases: pneumonia, heart attack, A probability distribution needs to assign a number to each combination of values of these variables 20 binary variables already require 2 10 ~10 6 numbers Real examples usually involve hundreds of attributes

Practical Representation Key idea exploit regularities Here we focus on exploiting (conditional) independence properties

Independent Random Variables Two variables X and Y are independent if P(X = x Y = y) = P(X = x) for all values x,y That is, learning the values of Y does not change prediction of X If X and Y are independent then P(X,Y) = P(X Y)P(Y) = P(X)P(Y) In general, if X 1,,X n are independent, then P(X 1,,X n )= P(X 1 )...P(X n ) Requires O(n) parameters

Independence: example P(Toothache, Catch, Cavity, Weather) = P(Toothache, Catch, Cavity) P(Weather) 32 entries reduced to 12 (4 for Weather and 8 for Toothache & Catch & Cavity); Absolute independence: powerful but rare Dentistry is a large field with hundreds of variables, none of which are independent. What to do?

Conditional Independence A more suitable notion is that of conditional independence Two variables X and Y are conditionally independent given Z if P(X = x Y = y,z=z) = P(X = x Z=z) for all values x,y,z That is, learning the values of Y does not change prediction of X once we know the value of Z

Examples If I have a cavity, the probability that the probe catches in it doesn't depend on whether I have a toothache: (1) P(catch toothache, cavity) = P(catch cavity) The same independence holds if I haven't got a cavity: (2) P(catch toothache, cavity) = P(catch cavity) Variable Catch is conditionally independent of variable Toothache given variable Cavity: P(Catch Toothache,Cavity) = P(Catch Cavity) Equivalent statements: P(Toothache Catch, Cavity) = P(Toothache Cavity) P(Toothache, Catch Cavity) = P(Toothache Cavity) P(Catch Cavity)

Conditional independence contd. Write out full joint distribution using chain rule: P(Toothache, Catch, Cavity) = P(Toothache Catch, Cavity) P(Catch, Cavity) = P(Toothache Catch, Cavity) P(Catch Cavity) P(Cavity) = P(Toothache Cavity) P(Catch Cavity) P(Cavity) I.e., 2 + 2 + 1 = 5 independent numbers In most cases, the use of conditional independence reduces the size of the representation of the joint distribution from exponential in the number of variables n to linear in n. Conditional independence is our most basic and robust form of knowledge about uncertain environments.

A Bayesian Network A Bayesian network is made up of: A 1. A Directed Acyclic Graph (DAG), representing (conditional) independences B C D 2. A set of tables for each node in the graph, representing (conditional) probability distributions A P(A) A B P(B A) B D P(D B) B C P(C B) false 0.6 false false 0.01 false false 0.02 false false 0.4 true 0.4 false true 0.99 false true 0.98 false true 0.6 true false 0.7 true false 0.05 true false 0.9 true true 0.3 true true 0.95 true true 0.1

A Directed Acyclic Graph Each node in the graph is a random variable A A node X is a parent of another node Y if there is an arrow from node X to node Y e.g. A is a parent of B B Informally, an arrow from node X to node Y means X has a direct influence on Y; not necessarily causal! C D Formally, chains of arrows only capture the independence relation between the variables (by means of d separation) we can reason in any direction!!

A Set of Tables for Each Node A P(A) false 0.6 true 0.4 B C P(C B) false false 0.4 false true 0.6 true false 0.9 true true 0.1 A B P(B A) false false 0.01 false True 0.99 true false 0.7 true true 0.3 A B Each node X i has a conditional probability distribution P(X i Parents(X i )) that quantifies the effect of the parents on the node The parameters are the probabilities in these CPTs (conditional probability tables) C D B D P(D B) false false 0.02 false true 0.98 true false 0.05 true true 0.95

A Set of Tables for Each Node Conditional Probability Distribution for C given B B C P(C B) false false 0.4 false true 0.6 true false 0.9 true true 0.1 For a given combination of conditioning values (for parents), the entries for P(C=true B) and P(C=false B) must add to 1, e.g. P(C=true B=false) + P(C=false B=false)=1 If you have a Boolean variable with k Boolean parents, this table has 2 k+1 probabilities (but only 2 k need to be stored)

Bayesian Networks Two important properties: 1. Encodes the conditional independence relationships between the variables in the graph structure 2. Is a compact representation of the joint probability distribution over the variables

Conditional Independence The Markov condition: given its parents (P 1, P 2 ), a node (X ) is conditionally independent of its nondescendants (ND 1, ND 2,. ) P 1 P 2 ND 1 X ND 2 C 1 C 2

The Joint Probability Distribution Due to the Markov condition, we can compute the joint probability distribution over all the variables X 1,, X n in the Bayesian net using the formula: n P( X1 x1,..., X x ) P( X x Parents( X )) n n i 1 i i i Where Parents(X i ) means the values of the parents of the node X i with respect to the graph: so it is a product over CPT parameters.

Using a Bayesian Network Example Using the network in the example, suppose you want to calculate: This is product P(A = true, B = true, C = true, D = true) determined by = P(A = true) * P(B = true A = true) * graph structure P(C = true B = true) P( D = true B = true) = (0.4)*(0.3)*(0.1)*(0.95) = 0.0114 A B These numbers are from the conditional probability tables (CPTs) C D

Inference Using a Bayesian network to compute probabilities is called inference In general, inference involves queries of the form: P( X E ) = P(X E)/P(E) = α P(X E) E = the observed evidence variable(s) X = The query variable(s): in standard inference only 1 compute P(X=x E=e) from joint; compute P(E=e) = P(X=x E=e) + P(X= x E=e) = 1/ α

Using a BN; Example II Using the network in the example we calculate: P(C = true A= false) = P(C= true A = false)/ P(A=false) A B The numerator equals: C D P(C= true A = false) = P(C = true A=false B = true) Marginalisation + P( C = true A=false B = false) = P(C = true B = true) * P(B= true A = false) * P(A = false) + P(C = true B = false) * P(B= false A = false) * P(A = false) = (0.1 * 0.99 * 0.6) + (0.6 * 0.01 * 0.6) = 0.063 These products As a result we get: 0.063/0.6 = 0.105 determined by graph structure What about P(A = false C = true)?

Inference HasAnthrax HasCough HasFever HasDifficultyBreathing HasWideMediastinum An example of a query would be: P( HasAnthrax = true HasFever = true, HasCough = true) Note: even though variables HasDifficultyBreathing and HasWideMediastinum are in the network, they are not given values in the query (ie. they do not appear either as query variables or evidence variables) they are treated as unobserved variables

Bayes' rule and conditional independence P(Cavity Toothache Catch) = αp(toothache Catch Cavity) P(Cavity) = αp(toothache Cavity) P(Catch Cavity) P(Cavity) This is an example of a naïve Bayes model: P(Cause, Effect 1,,Effect n ) = P(Cause) i P(Effect i Cause) Total number of parameters is linear in n

Complexity Exact inference is NP hard: Exact inference is feasible in small to mediumsized networks Exact inference in large, dense networks takes a very long time Approximate inference techniques exist which are much faster and give pretty good results (but no guarantees)

How to build BNs There are two options (or combinations thereof): Handcrafting with the help of an expert in the domain of application Machine learning it from data

A real application: TrueSkill Algorithm used in Xbox live for ranking and matching players Leaderboard How do you determine your game skills? Idea: skill is related to probability of winning: s 1 > s 2 P(player 1 wins) > P(player 2 wins) Who is a suitable opponent? Idea: someone you beat with 50% chance. TrueSkill material: thanks to Ralf Herbrich and Thore Graepel, Microsoft Research Cambridge

TrueSkill: what is your true skill? each player has a skill distribution: N(μ, σ 2 ) each player has a TrueSkill: Moserware.com s = μ 3σ a novice player starts with μ 0 = 25 σ 0 = (25/3) yielding a TrueSkill of 0. The TrueSkill parameters are updated given the outcome of a game.

TrueSkill: updating Confidence in skill-level Game outcome: 1 ste place 2 de place 3 rde place 0 10 20 30 40 50 Skill-level

TrueSkill: how updating works Bayesian network (continuous) consider two players, each with own skill distribution given their skill, players will deliver a certain performance with a certain probability skil l 1 perf 1 skil l 2 perf 2 upon which one of the two will win (or draw) outcome We can compute: P(1 beats 2 skill 1 and skill 2 ) But also: P(skill 1 1 beats 2 and skill 2 ) matching skill updating

Does it work? An experiment Data Sets: Halo 2 Beta, Halo 3 Public Beta 3 game modes: Free-for-All Two Teams 1-on-1 Thousands of players, 10-thousands of outcomes

Skill Level Convergence-speed 40 35 30 25 20 After how many games is your true skill determined? Who is better: char, or 15 10 5 char (TrueSkill ) SQLWildman (TrueSkill ) char (Halo 2 rank) SQLWildman (Halo 2 rank) 0 0 100 200 300 400 Number of games played SQLWildman?

Two players compared 100% Who is better: Winning percentage 80% 60% 40% char, or SQLWildman? 5/8 games won by char 20% 0% 0 100 200 300 400 500 Number of games played char wins SQLWildman wins draw

Matching players Halo 3 Public Beta; Team Slayer Hopper After After 10 1 30100 games games Most players have 50% chance of winning, independent of skill level!

Applications of BN Also: Bayesian medical kiosk (http://vimeo.com/64474130)