CMPT Machine Learning. Bayesian Learning Lecture Scribe for Week 4 Jan 30th & Feb 4th

CMPT 882 - Machine Learning Bayesian Learning Lecture Scribe for Week 4 Jan 30th & Feb 4th Stephen Fagan sfagan@sfu.ca Overview: Introduction - Who was Bayes? - Bayesian Statistics Versus Classical Statistics - Bayesian Learning (in a nutshell) Prerequisites From Probability Theory - Basic Terms - Basic Formulas Bayes Theorem - Derivation - Significance - Example Naive Bayes Classifier Bayesian Belief Networks - Conditional Independence - Representation - Inference - Causation - A Brief History of Causation

Introduction: Who was Bayes? "It is impossible to understand a man s work unless you understand something of his character and unless you understand something of his environment." Thomas Bayes? (possible, but not probable) Reverend Thomas Bayes (1702-1761) was an English theologian and mathematician. Motivated by his religious beliefs, he proposed the well known argument for the existence of God known as the argument by design. Basically, the argument is: without assuming the existence of God, the operation of the universe is extremely unlikely, therefore, since the operation of the universe is a fact, it is very likely that God exists. To back up this argument, Bayes produced a general mathematical theory which introduced probabilistic inferences (a method for calculating the probability that an event will occur in the future from the frequency with which it has occurred in prior trials). Central to this theory was a theorem, now known as Bayes Theorem (1764), which states that one s evidence confirms the likelihood of an hypothesis only to the degree that the appearance of this evidence would be more probable with the assumption of the hypothesis than without it (see below for its formal statement). Bayesian Statistics Versus Classical Statistics: The central difference between Bayesian and Classical statistics is that in Bayesian statistics we assume that we know the probability of any event (before any calculations) and the classical statistician does not. The probabilities that the Bayesian assumes we know are called prior probabilities. Bayesian Learning (in a nutshell): As Bayesians, we assume that we have a prior probability distribution for all events. This gives us a quantitative method to weight the evidence that we come across during

learning. Such methods allow us to construct a more detailed ranking of the alternative hypotheses than if we were only concerned with consistency of the hypotheses with the evidence (though as we will see, consistency based learning is a subclass of Bayesian learning). As a result, Bayesian methods provides practical learning algorithms (though they require prior probabilities - see below) such as Naive Bayes learning and Bayesian belief network learning. In addition to this, Bayesian methods are thought to provide a useful conceptual framework with which we can get a standard for evaluating other learning algorithms. Prerequisites From Probability Theory In order to confidently use Bayesian learning methods, we will need to be familiar with a few basic terms and formulas from Probability Theory. Basic Terms a. Random Variable Since our concern is with machine learning, we can think of random variables as being like attributes which can take various values. e.g. sunny, rain, cloudy, snow a. Domain This is the set of possible values that a random variable can take. It could be finite or infinite. b. Probability Distribution This is a mapping from a domain (see above) to values in 0,1. When the domain has a finite of countably infinite number of distinct elements, then the sum of all of the probabilities given by the probability distribution equals 1. e.g. P 0.7,0.2, 0.08,0. 02 so, P sunny 0. 7, and so on. c. Event Each assignment of a domain value to a random variable is called an event e.g. rain Basic Formulas a. Conditional Probability This formula allows you to calculate the probability of an event A given that event B is assumed to have been obtained. This probability is denoted by P A B.

P A B P A B P B b. Product Rule This rule, derived from a, gives the probability of a conjunction of events A and B: P A B P A B P B P B A P A c. Sum Rule This rule gives the probability of the disjunction of events A and B: P A B P A P B P A B d. Theorem of Total Probability n If events A 1,...,A n are mutually exclusive with P A i i 1 P B Bayes Theorem n P B A i P A i. i 1 1, then The informal statement of Bayes Theorem is that one s evidence confirms the likelihood of an hypothesis only to the degree that the appearance of this evidence would be more probable with the assumption of the hypothesis than without it. Formally, in the special case for machine learning we get: P h D P D h P h P D where: D is a set training data. h is a hypothesis. P h D is the posterior probability, i.e. the conditional probability of h after the training data (evidence) is presented. P h is the prior probability of hypothesis h. This non-classical quantity is often found by looking at data from the past (or in the training data). P D is the prior probability of the training data D. This quantity is often a constant value, P D P D h P h P D h P h, which can be computed easily when we insist that P h D and P h D sum to 1. P D h is the probability of D given h, and is called the likelihood. This quantity is often easy to calculate since we sometimes assign it the value 1 when D and h are consistent, and assign it 0 when they are inconsistent. It should be noted that Bayes Theorem is completely general and can be applied to any situation where one wants to calculate a conditional probability and one has knowledge of prior probabilities. It s generality is demonstrated through its derivation, which is very simple. To help our intuitive understanding of Bayes theorem, consider the example where we see some clouds in the sky and we are wondering what the chances of rain are. That is we

% % want to know P rain clouds!. By Bayes Theorem, we know that this is equal to P" clouds rain# P" rain# P" clouds#. Here are some properties of Bayes theorem which make this formula more intuitive: $ The more likely P clouds rain! is, the more likely P rain clouds! is. $ If P clouds rain! % 0, the P rain clouds! % 0. If we take all of the probabilities to be 0 or 1, then we get the propositional calculus. $ Bayes theorem is only useable when P clouds! & 0. However there is research about extending Bayes theorem to handle cases like P clouds! % 0 (e.g. belief revision). $ The more likely P rain! is, the more likely P rain clouds! is. $ If P clouds! % 1, then P rain clouds! % P rain!. $ The more surprising you evidence (the smaller that P clouds! is), the larger its effect (the larger P rain clouds! is). Derivation of Bayes Theorem The derivation of this famous theorem is quite trivial. It is short and only uses the definition of conditional probability and the commutitivity of conjunction. P" D h# P" h# P" D# % P" D' h# 1 P" h# P h! P" D# P" D' h# P" D# P" h' D# P" D# % P h D! Despite this formal simplicity, Bayes Theorem is still considered an important result. Significance Bayes Theorem is important for several reasons: 1. Bayesians regard the theorem as a rule for updating beliefs in response to new evidence. 2. The posterior probability, P h D!, is a quantity that people find hard to assess (they are more used to calculating P D h! ). The theorem expresses this quantity in terms that are more accessible. 3. It forms the basis for some practical learning algorithms (see below). The general Bayesian learning strategy is: 1. Start with your prior probabilities, P H!. 2. Use data D to form P H D!. 3. Adopt most likely hypothesis given P H D!.

( ( Bayes Theorem is used to choose the hypothesis that has the highest probability of being correct, given some set of training data. We call such an hypothesis a maximum a posteriori (MAP) hypothesis, and denote it by h MAP. h MAP ( arg max P* h D+ h) H P, D h- P, h- P, D- arg max h) H arg max P* D h+ P* h+ h) H Notice that P* D+ is omitted from the denominator is the last line because it is essentially a constant with respect to the class of hypotheses, H. If all of the hypotheses have the same prior probability,. i,j/ P* h i + ( P* h j + 0, then we define the maximum likelihood (ML) hypothesis as: Example 1 2 3 4 5 6 7 8 7 9 h ML ( arg max P* D h i +. h i ) H P* cancer+ ( 0.08 P* : cancer+ ( 0.992 P* ; cancer+ ( 0.98 P* <,cancer+ ( 0. 2 P* ; : cancer+ ( 0.03 P* <, : cancer+ ( 0.97 Where ; and < represent positive and negative cancer-test results, respectively. = > 4? 8 2 @ 5 9 What is the probability that I have cancer given that my cancer-test result it positive? (i.e. what is P* cancer ; +?) A 7 B C > B 7 8 2 @ 5 9 By Bayes Theorem, P* cancer ; + P, D cancer- P, cancer- ( P, D -. We know P* ; cancer+ and P* cancer+, but we must calculate P* ; + as follows: P* ; + ( P* ; E cancer+ F P* ; E : cancer+ So, ( P* ; cancer+ P* cancer+ F P* ; : cancer+ P* : cancer+

J J PG cancer H I Naive Bayes Classifier PK L cancerm PK cancerm PK L M PK L cancerm PK cancerm PK L cancerm PK cancerm N PK L O cancerm PK O cancerm J 0.98P 0.008 0.98P 0.008N 0.03P 0.92 0.0078 J 0.0078N 0.0298 J 0.21 We see that PG cancer H I is still less than 1/2, however this does not imply any particular action. For different people, actions will vary (even with the same information) depending on their subjective utility judgements. Another thin to notice is that this value, PG cancer H I J 0. 21, is often confused (even by doctors) with PG H canceri J 0. 98. The naive Bayes classifier is a highly practical Bayesian learning method that can be used when: 1. the amount of training data is moderate or large (so that the frequency of the events in the data accurately reflect their probability of occurring outside of the training data) and 2. when the attributes values that describe instances are independent given the classification (see below). That is, given the target value (i.e. classification), v, of an instance that has attributes a 1,a 2,...,a n, PG a i Q a 2 Q... Q a n I J R i PG a i vi. Here is what the naive Bayes classifier does: S Let x be an instance described by a conjunction of attribute values and let fg xi be a target function whose range is some finite set V (representing the classes). S The learner is provided with a set of training examples of the target function and is then asked to classify (i.e. predict the target value of) a new instance which is described by a tuple of attributes, T a 1,a 2,...,a n U. S The learner assigns to the new instance the most probable target value, v MAP, given the attribute values that describe it, where

V V V V ] \ \ \ v MAP arg max PX v j a 1,a 2,...,a ny v jw V arg max PZ a 1,a 2,...,a n v j[ PZ v j[ PZ a v jw 1,a 2,...,a V n[ arg max PX a 1,a 2,...,a n v jy PX v jy v jw V arg max PX v jy v jw V i PX a i v jy V v NB \ by Bayes theorem because PX a 1,a 2,...,a ny is a constant given the instance by the assumption of conditional independence of the attributes given the target value denoting the target value outputted by the naive Bayes classifier. To calculate this value, the learner first estimates the PX v jy values from the training data by simply counting the frequency with which each v j occurs in the data. The PX a i v jy values are then calculated by counting the frequency with which a i occurs in the training examples that get the target value v j. Thus, the importance of having large training set is due to the fact that they determine the accuracy of these critical values. If the number of attribute values is n and the number of distinct target values is k, then the learner only needs to calculate n ^ k such PX a i v jy values. Computationally, this is very cheap compared with the number of PX a 1,a 2,...,a n v jy values that would have to be calculated if we did not have the assumption of conditional independence. Another interesting thing to note about the naive bayes learning method is that it doesn t perform an explicit search through the hypothesis space. Instead it merely counts the frequency of various data combinations in the training set to calculate probabilities. For examples of the naive Bayes classifier, see section 6.9.1 and section 6.10 of the text. Bayesian Belief Networks In many cases, the condition of complete conditional independence cannot be met, and so naive Bayes classifier will not learn successfully. However, as we have seen, to remove this condition completely is computationally very expensive: we would have to find a number of conditional probabilities equal to the number of instances times the number of target values (as opposed to merely n ^ k). Bayesian belief networks (aka Bayes nets, belief nets, probability nets, causal nets) offer us a compromise: "A Bayesian belief network describes the probability distribution governing a set of variables by specifying a set of conditional independence assumptions along with a set of conditional probabilities." (page 184 of text)

To summarize, Bayes nets provide compact representations of joint probability distributions in systems with a lot of independencies (but some dependencies). In addition to the information contained in the training data, Bayesian nets allow us to incorporate any prior knowledge we have about the dependencies (and independencies) among the variables. This method of stating conditional independence that apply to subsets of the variables is less constraining than the global assumption of conditional independence made by the naive Bayes classifier. Conditional Independence Let X, Y, and Z be three discrete valued random variables where each can take on values from the domains V_ X`, V_ Y`, and V_ Z`, respectively. We say that X is conditionally independent of Y given Z provided _ a x i,y j,z k` P_ X b x i Y b y j c Z b z k` b P_ X b x i Z b z k` where x i d V_ X`, y j d V_ Y`, and z k d V_ Z`. This expression is abbreviated as P_ X Y c Z` b P_ X Z`. This definition easily extends to sets of variables as well (see text, page 185). Representation Bayesian belief networks are graphically represented by a directed acyclic graph and associated probability matrices which describe the prior and conditional probabilities of the variables. In the graph (network), each variable is represented by a node. The directed arcs between the nodes indicate that the variables are conditionally independent of its non-descendents given its immediate predecessors in the network. X is a descendent of Y if there is a directed path from Y to X. Associated with every node is a probability matrix which describes the probability distribution for that variable given the values of its immediate predecessors. Example of a Bayesian Belief Network

m from http://www.gpfn.sk.ca/~daryle/papers/bayesian_networks/bayes.html For the above graph, the probability matrix for the Alarm node given the events of Earthquake and Burglary might look like this: Earthquake Burglary Pe A Ve Ef g Ve Bf f Pe h A Ve Ef g Ve Bf f yes yes 0. 90 0. 10 yes no 0. 20 0.80 no yes 0. 90 0.10 no no 0. 01 0. 99 We can use the information in such tables to calculate the probability for any desired assignment i y 1,...,y n j to the tuple of network variables i Y 1,...,Y n j using Pe y 1,...,y n f k Inference n il 1 Pe y i Parentse y i f f. Given a specified Bayes network, we may want to infer the value of a specific variable given the observed values of the other variables. Since we are dealing with probabilities, we will likely not get a specific value. Instead we will calculate a probability distribution for the variable and then output the most probable value(s). In the above example, we were wondering where an apple tree is sick given that it is losing its leaves. The result is that chances are slightly in favor of the tree not being sick. This calculation is straightforward when the values of all of the other nodes are known, but when only a subset of the variables are known the problem becomes (NP) hard. There s a lot of research being done on methods of probabilistic inference in Bayesian nets as well as on devising effective algorithms for learning Bayesian networks from training data.

n Causation Another reason for the popularity of Bayesian belief networks is that they are thought to be a convenient way to represent causal knowledge. We already know that the arrows in a Bayes net indicate that the variables are conditionally independent of their non-descendents given their immediate predecessors in the network. This statement is known as the Markov condition. What does this condition imply about causation? Consider the following simple graph: This graph might represent the fact that the switch causes the light to be on. However, the issue is not this simple. Since the graph only represents statistical information, we might infer that there is some kind of causal relationship, but we don t know the details of such a relationship. For example, we consistently find that when we turn the switch on, the light comes on, however with similar consistency we find that when the light is on, so is the switch. More specifically, there are two issues concerning causation in Bayesian nets: 1. Are there unobserved common causes? That is, are there other variables, not represented in the network, that are causally related to both a node and that nodes parent. For example, consider the following two graphs: In the first (two node) graph, the fact that smoking causes cancer is represented. In the second graph, a gene which causes both smoking and cancer is represented. If such a common cause existed, but was not represented in our Bayesian network, then our inferences from the network would likely be inaccurate. In cases where we are unsure whether there is an unobserved common cause, we have two options: n Fisher suggested that controlled (randomized) experiments would help to uncover unobserved common causes. However, sometimes such experiments are impossible due to ethical constraints or where data is uncontrollable. Or, we could assume there are no unobserved common causes (as long as inferences appear accurate). Question: Are there ways, other than controlled experiments, to determine

if there might be common causes? (see work from CMU and by Pearl) 2. What way do the causal relationships go? That is, on the assumption that there are no unobservable common causes, how do we determine what is a cause and what is an effect? For example, given our training data, can we distinguish between the following two graphs: o p q r s t o p q r s u Notation: Let Av w B denote that A is independent of B. Yes, we can tell the difference using the independence relation. In Graph A we find that x sickv w dryy but in Graph B we find that x sickv w dry losesy. So in Graph A, if we alter Px sicky then dry wouldn t change, but in Graph B if we change Px sicky, then dry would change. Essentially, we want to determine whether or not Px sick dryy z Px sicky holds. To do so we could look at our data and determine whether the percentage of dry trees that are sick is the same as the percentage of non-dry trees that are sick.

A Brief History of Causation 18th Century, Philosophy, Hume: - Causality isn t real; we can t see it. - We can t distinguish cause from causation 19th Century, Philosophy, Mill: - Mill s methods for causal inference: - There are no unobserved causes. - An effect is caused by a conjunction of variables. - Methods fail for disjunctions of causes. - (1970: Winston reinvents Mill s methods) 20th Century, Philosophy, Positivism: - Causation isn t real. - Causation isn t scientific (not definable in FO Logic). - Causation is defeasible (situation dependent). Statistics, Pearson: - No cause from correlation. Fisher: - Randomized experiments can determine causes. 1980s Philosophy, Lewis-Stalnaker: - Theory of Counterfactuals and Causation - logic of "If I were to..." - We can reason about things yet to happen. Tetrad Group: - Causal Graphs Comp. Sci., Pearl: - Bayesian graphs as compact representations of probability distributions. - Causality wins 2001 Lakatos Award. - How to behave given causal information.