UNCERTAIN KNOWLEDGE AND REASONING

QUESTION BANK DEPARTMENT: CSE SEMESTER: VI SUBJECT CODE / Name: CS2351 ARTIFICIAL INTELLIGENCE UNIT IV PART -A (2 Marks) UNCERTAIN KNOWLEDGE AND REASONING 1. List down two applications of temporal probabilistic models. A suitable way to deal with this problem is to identify a temporal causal model that may effectively explain the patterns observed in the data. Here we will concentrate on probabilistic models that provide a convenient framework to represent and manage underspecified information; in particular, we will consider the class of Causal Probabilistic Networks (CPN). 2. Define Dempster-Shafer theory. The Dempster Shafer theory (DST) is a mathematical theory of evidence. It allows one to combine evidence from different sources and arrive at a degree of belief (represented by a belief function) that takes into account all the available evidence. The theory was first developed by Arthur P. Dempster and Glenn Shafer 3. Define Uncertainty. Uncertainty means that many of the simplifications that are possible with deductive inference are no longer valid. 4. State the reason why first order, logic fails to cope with that the mind like medical diagnosis. Three reasons: Laziness: It is hard to lift complete set of antecedents of consequence, needed to ensure and exception less rule. Theoretical Ignorance: Medical science has no complete theory for the domain. Practical ignorance: Even if we know all the rules, we may be uncertain about a particular item needed. 5. What is the need for probability theory in uncertainty? Probability provides the way of summarizing the uncertainty that comes from our laziness and ignorance. Probability statements do not have quite the same kind of semantics known as evidences. 6. What is the need for utility theory in uncertainty? Utility theory says that every state has a degree of usefulness, or utility to In agent,

and that the agent will prefer states with higher utility. The use utility theory to represent and reason with preferences. 7. What Is Called As Decision Theory? Preferences As Expressed by Utilities Are Combined with Probabilities in the General Theory of Rational Decisions Called Decision Theory. Decision Theory = Probability Theory + Utility Theory. 8. Define conditional probability? Once the agents has obtained some evidence concerning the previously unknown propositions making up the domain conditional or posterior probabilities with the notation p(a/b) is used. This is important that p(a/b) can only be used when all be is known. 9. When probability distribution is used? If we want to have probabilities of all the possible values of a random variable probability distribution is used. Eg: P(weather) = (0.7,0.2,0.08,0.02). This type of notations simplifies many equations. 10. What is an atomic event? An atomic event is an assignment of particular values to all variables, in other words, the complete specifications of the state of domain. 11. Define joint probability distribution. Joint probability distribution completely specifies an agent's probability assignments to all propositions in the domain. The joint probability distribution p(x1,x2,---- ---- xn) assigns probabilities to all possible atomic events; where x1,x2------xn=variables. 12. What is meant by belief network? A belief network is a graph in which the following holds A set of random variables A set of directive links or arrows connects pairs of nodes. The conditional probability table for each node The graph has no directed cycles. 13. What are called as Poly trees? The algorithm that works only on singly connected networks known as Poly trees. Here at most one undirected path between any two nodes is present. 14. What is a multiple connected graph? A multiple connected graph is one in which two nodes are connected by more than one path. 15. List the three basic classes of algorithms for evaluating multiply connected graphs. The three basic classes of algorithms for evaluating multiply connected graphs

Clustering methods; Conditioning methods; Stochastic simulation methods. 16. What is called as principle of Maximum Expected Utility (MEU)? The basic idea is that an agent is rational if and only if it chooses the action that yields the highest expected utility, averaged over all the possible outcomes of the action. This is known as MEU 17. What is meant by deterministic nodes? A deterministic node has its value specified exactly by the values of its parents, with no uncertainty. 18. What are all the uses of a belief network? The uses of a belief network are, Making decisions based on probabilities in the network and on the agent's utilities; Deciding which additional evidence variables should be observed in order to gain useful information; Performing sensitivity analysis to understand which aspects of the model have the greatest impact on the probabilities of the query variables (and therefore must be accurate); Explaining the results of probabilistic inference to the user. 19. What is called as Markov Decision problem? The problem of calculating an optimal policy in an accessible, stochastic environment with a known transition model is called a Markov Decision Problem(MDP). 20. Define Dynamic Belief Network. A Belief network with one node for each state and sensor variable for each time step is called a Dynamic Belief Network.(DBN). 21. Define Dynamic Decision Network? A decision network is obtained by adding utility nodes, decision nodes for action in DBN. DDN calculates the expected utility of each decision sequence. PART - B 1. Explain about Probabilistic Reasoning.

The students should understand the role of uncertainty in knowledge representation Students should learn the use of probability theory to represent uncertainty Students should understand the basic of probability theory, including o Probability distributions o Joint probability o Marginal probability o Conditional probability o Independence o Conditional independence Should learn inference mechanisms in probability theory including o Bayes rule o Product rule Should be able to convert natural language statements into probabilistic statements and apply inference rules Students should understand Bayesian networks as a data structure to represent conditional independence Should understand the syntax and semantics of Bayes net Should understand inferencing mechanisms in Bayes net Should understand efficient inferencing techniques like variable ordering Should understand the concept of d-separation Should understand inference mechanism for the special case of polytrees Students should have idea about approximate inference techniques in Bayesian networks At the end of this lesson the student should be able to do the following: Represent a problem in terms of probabilistic statemenst Apply Bayes rule and product rule for inferencing Represent a problem using Bayes net Perform probabilistic inferencing using Bayes net. Probabilistic Reasoning Using logic to represent and reason we can represent knowledge about the world with facts and rules, like the following ones: bird(tweety). fly(x) :- bird(x). We can also use a theorem-prover to reason about the world and deduct new facts about the world, for e.g.,?- fly(tweety). Yes However, this often does not work outside of toy domains - non-tautologous certain rules are hard to find. A way to handle knowledge representation in real problems is to extend logic by using certainty factors. In other words, replace

IF condition THEN fact with IF condition with certainty x THEN fact with certainty f(x) Unfortunately cannot really adapt logical inference to probabilistic inference, since the latter is not context-free. Replacing rules with conditional probabilities makes inferencing simpler. Replace smoking -> lung cancer or lotsofconditions, smoking -> lung cancer with P(lung cancer smoking) = 0.6 Uncertainty is represented explicitly and quantitatively within probability theory, a formalism that has been developed over centuries. A probabilistic model describes the world in terms of a set S of possible states - the sample space. We don t know the true state of the world, so we (somehow) come up with a probability distribution over S which gives the probability of any state being the true one. The world usually described by a set of variables or attributes. Consider the probabilistic model of a fictitious medical expert system. The world is described by 8 binary valued variables: Visit to Asia? A Tuberculosis? T Either tub. or lung cancer? E Lung cancer? L Smoking? S Bronchitis? B Dyspnoea? D Positive X-ray? X We have 2 8 = 256 possible states or configurations and so 256 probabilities to find. 2.Explain about Review of Probability Theory. The primitives in probabilistic reasoning are random variables. Just like primitives in Propositional Logic are propositions. A random variable is not in fact a variable, but a

function from a sample space S to another space, often the real numbers. For example, let the random variable Sum (representing outcome of two die throws) be defined thus: Sum(die1, die2) = die1 +die2 Each random variable has an associated probability distribution determined by the underlying distribution on the sample space Continuing our example : P(Sum = 2) = 1/36, P(Sum = 3) = 2/36,..., P(Sum = 12) = 1/36 Consdier the probabilistic model of the fictitious medical expert system mentioned before. The sample space is described by 8 binary valued variables. Visit to Asia? A Tuberculosis? T Either tub. or lung cancer? E Lung cancer? L Smoking? S Bronchitis? B Dyspnoea? D Positive X-ray? X There are 2 8 = 256 events in the sample space. Each event is determined by a joint instantiation of all of the variables. S = {(A = f, T = f,e = f,l = f, S = f,b = f,d = f,x = f), (A = f, T = f,e = f,l = f, S = f,b = f,d = f,x = t),... (A = t, T = t,e = t,l = t, S = t,b = t,d = t,x = t)} Since S is defined in terms of joint instantations, any distribution defined on it is called a joint distribution. ll underlying distributions will be joint distributions in this module. The variables {A,T,E, L,S,B,D,X} are in fact random variables, which project values. L(A = f, T = f,e = f,l = f, S = f,b = f,d = f,x = f) = f L(A = f, T = f,e = f,l = f, S = f,b = f,d = f,x = t) = f L(A = t, T = t,e = t,l = t, S = t,b = t,d = t,x = t) = t Each of the random variables {A,T,E,L,S,B,D,X} has its own distribution, determined by the underlying joint distribution. This is known as the margin distribution. For example, the distribution for L is denoted P(L), and this distribution is defined by the two probabilities P(L = f) and P(L = t). For example, P(L = f) = P(A = f, T = f,e = f,l = f, S = f,b = f,d = f,x = f) + P(A = f, T = f,e = f,l = f, S = f,b = f,d = f,x = t) + P(A = f, T = f,e = f,l = f, S = f,b = f,d = t,x = f)...

P(A = t, T = t,e = t,l = f, S = t,b = t,d = t,x = t) P(L) is an example of a marginal distribution. Here s a joint distribution over two binary value variables A and B We get the marginal distribution over B by simply adding up the different possible values of A for any value of B (and put the result in the margin ). In general, given a joint distribution over a set of variables, we can get the marginal distribution over a subset by simply summing out those variables not in the subset. In the medical expert system case, we can get the marginal distribution over, say, A,D by simply summing out the other variables: However, computing marginals is not an easy task always. For example, P(A = t,d = f) = P(A = t, T = f,e = f,l = f, S = f,b = f,d = f,x = f) + P(A = t, T = f,e = f,l = f, S = f,b = f,d = f,x = t) + P(A = t, T = f,e = f,l = f, S = f,b = t,d = f,x = f) + P(A = t, T = f,e = f,l = f, S = f,b = t,d = f,x = t)... P(A = t, T = t,e = t,l = t, S = t,b = t,d = f,x = t) This has 64 summands! Each of whose value needs to be estimated from empirical data. For the estimates to be of good quality, each of the instances that appear in the summands should appear sufficiently large number of times in the empirical data. Often such a large amount of data is not available. However, computation can be simplified for certain special but common conditions. This is the condition of independence of variables.

Two random variables A and B are independent iff P(A,B) = P(A)P(B) i.e. can get the joint from the marginals This is quite a strong statement: It means for any value x of A and any value y of B P(A = x,b = y) = P(A = x)p(b = y) Note that the independence of two random variables is a property of a the underlying probability distribution. We can have Conditional probability is defined as: It means for any value x of A and any value y of B If A and B are independent then Conditional probabilities can represent causal relationships in both directions. From cause to (probable) effects From effect to (probable) cause 3.Explain about Probabilistic Inference Rules.

Two rules in probability theory are important for inferencing, namely, the product rule and the Bayes' rule. Here is a simple example, of application of Bayes' rule. Suppose you have been tested positive for a disease; what is the probability that you actually have the disease? It depends on the accuracy and sensitivity of the test, and on the background (prior) probability of the disease. Let P(Test=+ve Disease=true) = 0.95, so the false negative rate, P(Test=-ve Disease=true), is 5%. Let P(Test=+ve Disease=false) = 0.05, so the false positive rate is also 5%. Suppose the disease is rare: P(Disease=true) = 0.01 (1%). Let D denote Disease and "T=+ve" denote the positive Tes. Then, P(T=+ve D=true) * P(D=true) P(D=true T=+ve) = ------------------------------------------------------------ P(T=+ve D=true) * P(D=true)+ P(T=+ve D=false) * P(D=false)

0.95 * 0.01 = -------------------------------- = 0.161 0.95*0.01 + 0.05*0.99 So the probability of having the disease given that you tested positive is just 16%. This seems too low, but here is an intuitive argument to support it. Of 100 people, we expect only 1 to have the disease, but we expect about 5% of those (5 people) to test positive. So of the 6 people who test positive, we only expect 1 of them to actually have the disease; and indeed 1/6 is approximately 0.16. In other words, the reason the number is so small is that you believed that this is a rare disease; the test has made it 16 times more likely you have the disease, but it is still unlikely in absolute terms. If you want to be "objective", you can set the prior to uniform (i.e. effectively ignore the prior), and then get P(T=+ve D=true) * P(D=true) P(D=true T=+ve) = ------------------------------------------------------------ P(T=+ve) 0.95 * 0.5 0.475 = -------------------------- = ------- = 0.95 0.95*0.5 + 0.05*0.5 0.5 This, of course, is just the true positive rate of the test. However, this conclusion relies on your belief that, if you did not conduct the test, half the people in the world have the disease, which does not seem reasonable. A better approach is to use a plausible prior (eg P(D=true)=0.01), but then conduct multiple independent tests; if they all show up positive, then the posterior will increase. For example, if we conduct two (conditionally independent) tests T1, T2 with the same reliability, and they are both positive, we get P(T1=+ve D=true) * P(T2=+ve D=true) * P(D=true) P(D=true T1=+ve,T2=+ve) = ------------------------------------------------------------ P(T1=+ve, T2=+ve) 0.95 * 0.95 * 0.01 0.009 = ----------------------------- = ------- = 0.7826 0.95*0.95*0.01 + 0.05*0.05*0.99 0.0115 The assumption that the pieces of evidence are conditionally independent is called the naive Bayes assumption. This model has been successfully used for mainly application including classifying email as spam (D=true) or not (D=false) given the presence of various key words (Ti=+ve if word i is in the text, else Ti=-ve). It is clear that the words are not independent, even conditioned on spam/not-spam, but the model works surprisingly well nonetheless.

In many problems, complete independence of variables do not exist. Though many of them are conditionally independent. X and Y are conditionally independent given Z iff In full: X and Y are conditionally independent given Z iff for any instantiation x, y, z of X, Y,Z we have An example of conditional independence: Consider the following three Boolean random variables: LeaveBy8, GetTrain, OnTime Suppose we can assume that: P(OnTime GetTrain, LeaveBy8) = P(OnTime GetTrain) but NOT P(OnTime LeaveBy8) = P(OnTime) Then, OnTime is dependent on LeaveBy8, but independent of LeaveBy8 given GetTrain. We can represent P(OnTime GetTrain, LeaveBy8) = P(OnTime GetTrain) graphically by: LeaveBy8 -> GetTrain -> OnTime Inferencing in Bayesian Networks 10.5.5.1 Exact Inference The basic inference problem in BNs is described as follows: Given 1. A Bayesian network BN 2. Evidence e - an instantiation of some of the variables in BN (e can be empty) 3. A query variable Q Compute P(Q e) - the (marginal) conditional distribution over Q Given what we do know, compute distribution over what we do not. Four categories of

inferencing tasks are usually encountered. 1. Diagnostic Inferences (from effects to causes) Given that John calls, what is the probability of burglary? i.e. Find P(B J) 2. Causal Inferences (from causes to effects) Given Burglary, what is the probability that John calls, i.e. P(J B) Mary calls, i.e. P(M B) 3. Intercausal Inferences (between causes of a common event) Given alarm, what is the probability of burglary? i.e. P(B A) Now given Earthquake, what is the probability of burglary? i.e. P(B A E) 4. Mixed Inferences (some causes and some effects known) Given John calls and no Earth quake, what is the probability of Alarm, i.e. P(A J,~E) We will demonstrate below the inferencing procedure for BNs. As an example consider the following linear BN without any apriori evidence. Consider computing all the marginals (with no evidence). P(A) is given, and We don't need any conditional independence assumption for this. For example, suppose A, B are binary then we have Now, P(B) (the marginal distribution over B) was not given originally... but we just computed it in the last step, so we re OK (assuming we remembered to store P(B) somewhere). If C were not independent of A given B, we would have a CPT for P(C A,B) not P(C B).Note that we had to wait for P(B) before P(C) was calculable. If each node has k values, and the chain has n nodes this algorithm has complexity O(nk 2 ). Summing over the joint has complexity O(k n ).

Complexity can be reduced by more efficient summation by pushing sums into products. Dynamic programming may also be used for the problem of exact inferencing in the above Bayes Net. The steps are as follows: 1. We first compute 2. f 1 (B) is a function representable by a table of numbers, one for each possible value of B. 3. Here, 4. We then use f 1 (B) to calculate f 2 (C) by summation over B This method of solving a problem (ie finding P(D)) by solving subproblems and storing the results is characteristic of dynamic programming. The above methodology may be generalized. We eliminated variables starting from the root, but we dont have to. We might have also done the following computation.

The following points are to be noted about the above algorithm. The algorithm computes intermediate results which are not individual probabilities, but entire tables such as f 1 (C,E). It so happens that f 1( C,E) = P(E C) but we will see examples where the intermediate tables do not represent probability distributions. Dealing with Evidence Dealing with evidence is easy. Suppose {A,B,C,D,E} are all binary and we want P(C A = t,e = t). Computing P(C,A = t,e = t) is enough it s a table of numbers, one for each value of C. We need to just renormalise it so that they add up to 1. It was noticed from the above computation that conditional distributions are basically just normalised marginal distributions. Hence, the algorithms we study are only concerned with computing marginals. Getting the actual conditional probability values is a trivial tidying-up last step. Now let us concentrate on computing It can be done by plugging in the observed values for A and E and summing out B and D.

We don t really care about P(A = t), since it will cancel out. Now let us see how evidence-induce independence can be exploited. Consider the following computation. Since, Clever variable elimination would jump straight to (5). Choosing an optimal order of variable elimination leads to a large amount of computational sving. However, finding the optimal order is a hard problem. 10.5.5.1.1 Variable Elimination For a Bayes net, we can sometimes use the factored representation of the joint probability distribution to do marginalization efficiently. The key idea is to "push sums in" as far as possible when summing (marginalizing) out irrelevant terms, e.g., for the water sprinkler network

Notice that, as we perform the innermost sums, we create new terms, which need to be summed over in turn e.g., where, Continuing this way, where, In a nutshell, the variable elimination procedure repeats the following steps. 1. Pick a variable X i 2. Multiply all expressions involving that variable, resulting in an expression f over a number of variables (including X i ) 3. Sum out X i, i.e. compute and store For the multiplication, we must compute a number for each joint instantiation of all variables in f, so complexity is exponential in the largest number of variables participating in one of these multiplicative subexpressions. If we wish to compute several marginals at the same time, we can use Dynamic Programming to avoid the redundant computation that would be involved if we used variable elimination repeatedly. Exact inferencing in a general Bayes net is a hard problem. However, for networks with some special topologies efficient solutions inferencing techniques. We discuss one such technque for a class of networks called Poly-trees. Inferencing in Poly-Trees A poly-tree is a graph where there is at most one undirected path between any two pair of nodes. The inferencing problem in poly-trees may be stated as follows. U: U 1 U m, parents of node X Y: Y 1 Y n, children of node X

X: Query variable E: Evidence variables (whose truth values are known) Objective: compute P(X E) E + X is the set of causal support for X comprising of the variables above X connected through its parents, which are known. E - X is the set of evidential support for X comprising of variables below X connected through its children. In order to compute P(X E) we have + - P(X E) = P(X E X,E X ) - + + P(E X X,E X )P(X E X ) = ------------------------------- - + P(E X E X ) Since X d-separates E + from E - we can simplify the numerator as - + P(X E) = α P(E X X)P(X E X ) X where 1/α is the constant representing denominator. X conditional independence relations. If the parents are known, X is conditionally independent from all other nodes in the Causal support set. Similarly, given the children, X is independent from all other variables in the evidential support set. Approximate Inferencing in Bayesian Networks Many real models of interest, have large number of nodes, which makes exact inference very slow. Exact inference is NP-hard in the worst case.) We must therefore resort to approximation techniques. Unfortunately, approximate inference is #P-hard, but we can nonetheless come up with approximations which often work well in practice. Below is a list of the major techniques. Variational methods. The simplest example is the mean-field approximation, which exploits the law of large numbers to approximate large sums of random variables by their means. In particular, we essentially decouple all the nodes, and introduce a new parameter, called a variational parameter, for each node, and iteratively update these parameters so as to minimize the cross-entropy (KL distance) between the approximate and true probability distributions. Updating the variational parameters becomes a proxy for inference. The mean-field approximation produces a lower bound on the likelihood. More sophisticated methods are possible, which give tighter lower (and upper) bounds.

Chain (MCMC), and includes as special cases Gibbs sampling and the Metropolis- Hasting algorithm. Bounded cutset conditioning. By instantiating subsets of the variables, we can break loops in the graph. Unfortunately, when the cutset is large, this is very slow. By instantiating only a subset of values of the cutset, we can compute lower bounds on the probabilities of interest. Alternatively, we can sample the cutsets jointly, a technique known as block Gibbs sampling. Parametric approximation methods. These express the intermediate summands in a simpler form, e.g., by approximating them as a product of smaller factors. "Minibuckets" and the Boyen-Koller algorithm fall into this category R. Loganathan, AP/CSE. Mahalakshmi Engineering College, Trichy