Machine Learning. Bayesian Learning. Michael M. Richter. Michael M. Richter

Size: px

Start display at page:

Download "Machine Learning. Bayesian Learning. Michael M. Richter. Michael M. Richter"

Garry Strickland
6 years ago
Views:

1 Machine Learning Bayesian Learning

2 Topic This is concept learning the probabilistic way. That means, everything that is stated is done in an exact way but not always true. That means, the learned concept is equipped with a probability for being correct.

3 History Bayesian Decision Theory came long before Version Spaces, Decision Tree Learning and Neural Networks. It was studied in the field of Statistical Theory and more specifically, in the field of Pattern Recognition. Bayesian Decision Theory is at the basis of important learning schemes such as the Naïve Bayes Classifier, Learning Bayesian Belief Networks and the EM Algorithm. Bayesian Decision Theory is also useful as it provides a framework within which many non-bayesian classifiers can be studied (See [Mitchell, Sections 6.3, 4,5,6]).

4 : Why Bayesian Classification? Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data. Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured

5 Maximum Likelihood Suppose there are a number of hypothesis generated and that for each one there is a probability for being the right one is calculated. The maximum likelihood principle says one should choose the hypothesis h with the highest probability. P(h D) is the a posteriori probability of h (after seeing the data D). P(h) is the a priori probability of h. P(D h) is the likelihood of D under h.

6 Part 1 The Naïve Bayesian Approach

7 Basic Formulas for Probabilities Product Rule : probability P(AB) of a conjunction of two events A and B: Sum Rule: probability of a disjunction of two events A and B: Theorem of Total Probability: if events A1,., An are mutually exclusive then ) ( ) ( ) ( ) ( ), ( A P A B P B P B A P B A P ) ( ) ( ) ( ) ( AB P B P A P B A P ) ( ) ( ) ( 1 i n i i A P A B P B P

8 A Basic Learning Scenario (1) Event Y = y Observed example event Event Z = z Correctness of hypothesis z D: Data P(Z = z Y = y) = P(Z = z) P(Y = y Z = z) P(Y = y) Probability that h is a correct hypothesis for data D P(h D) = Prob(h) to be a correct hypothesis P(h ) P(D h) P(D) Prob(D) to be observed, if h is correct Prob( D) to be observed

9 A Basic Learning Scenario (2) Notation: P(h) is a-priori-probability of h P(D h) is likelihood of D under h P(h D) is a-posteriori-probability of h given D The basic theorem of Bayes Rule: P(h D) = P(h ) P(D h) P(D) This theorem makes applications possible because it reduces the unknown conditional probability to ones that are known a priori.

10 A Basic Learning Scenario (3) Learner has hypotheses h1,,hn and uses observed data D. Wanted: Some h { h1,..., hk }, for which P(h D) is maximal: (Maximum-a-posteriori-Hypothesis). A posteriori means: After seeing data Background knowledge: A-priori-probability Pr(h) of h A priori means: before seeing data

11 Bayesian Classification and Decision (1) The Bayes decision rule selects the class with minimum conditional risk. In the case of minimum-error-rate classification, the rule will select the class with the maximum posteriori probability. Suppose there are k classes, c1, c2,..., ck. Given a feature vector x: The minimum-error-rate rule will assign it to the class cj if P(cj x) > P(ci x) for all i j.

12 Bayesian Classification and Decision (2) An equivalent but more useful criterion for minimum-errorrate classification is: Choose class cj so that P(x cj)p(cj) > P(x ci)p(ci) for all i j This relies on Bayes theorem. Note: There can no method exist that finds with higher probability the correct hypothesis. But: That can change if one has additional knowledge.

13 Assume: Example (1) A lab test D for a form of cancer has 98% chance of giving a positive result if the cancer is present, and 97% chance of giving a negative result if the cancer is absent. (2) 0.8% of population has this cancer: P(cancer)=0.008 and P(~cancer)=0.992 What is probability that the cancer is present for a positive result? P(cancer D) = P(D cancer)p(cancer) / P(D) = 0.98*0.008 /(0.98* *0.992)=0.21

14 MAP and ML Given some data D and a hypothesis space H, what is the most probable hypothesis h H; i.e., P(h D) is maximal? This hypothesis is called the maximal a posteriori hypothesis h MAP h MAP = = argmax h H P(D h)p(h) Again: h MAP is optimal in the sense that no method can exist that finds with higher correct probability. If P(h) = P(h ) for all h, h H then this reduces to the maximum likelihood principle h ML = argmax h H P(D h)

15 The Gibbs Classifier Bayes Optimal is optimal but expensive; it uses all hypotheses in H. Non-optimal but much more efficient is the GIBBS-classifier Algorithm: Given: Sample S set of data {x1,...,xm} D), hypotheses space H with a probability distribution P and some to be classified. Method: 1. Select h H randomly according to P (that is similar to GA!) 2. Output: h(x) Surprisingly: E(errorGibbs) 2 E(errorBayesOptimal)

16 The Naïve Bayesian Algorithm (1) Learning Scenario: Examples x 1,...,x m, x i = (a i1,...,a in ) for attributes A 1,..., A n; Hypotheses H = { h1,..., hk } for classes Class of x ist C(x) Two ways to proceed: 1) Using Bayes optimal classification 2) Do not access H for classification. Method 2) avoids to overview all hypothesis in H what is often very difficult and impractical.

17 Estimation of Probabilities from Samples X 1 X 2 X N C Two classes: -1, +1 N boolean attributes How do we estimate P(C)? E.g. Simple Binomial estimation Count # of instances with C = -1, and with C = +1 How do we estimate P(X1,,XN C)? Count P(X1,,XN C=+1) Count P(X1,,XN C=-1) Very complex tasks!

18 Conditional Independence Conditional independence is supposed to simplify the estimation task. Def.: (i) Y is independent of Z, if for all y Y, z Z P(Y = y, Z = z) = P(Y = y) P(Z = z) (ii) X is conditionally independent of Y given Z if P(X=x,Y=y Z=z) = P(X=x Z=z) P(Y=y Z=z) Another formulation: P(X = x Y = y, Z = z) = P(X = x Z = z) This reduces the complexity for n variables from O(2n) in the product space to O(n)!

19 The Naïve Bayesian Algorithm (2) Given x = (a 1,...,a n ): The (conditional) independence assumption says: P(a1,,an h) = P(a1 h)p(a2 h) P(an h) The assumption is called naive. This reduces the parameter estimation from the product space (which is =(2 n )) to the sum of the attribute spaces (which is O(n)). However, it is not always satisfied (e.g. thunder is not independent to rain). The goal is now to avoid the knowledge about P(h) for all h H.

20 The Naïve Bayesian Algorithm (3) Therefore we proceed:. h MAP = h { h1,..., hk }, for which Equivalent: P(C(x)=h x = (a1,...,an)) is maximal. P(x = (a1,...,an) C(x)=h) P(C(x)=h) is maximal. The probabilities from the right side are estimated from a given set S of examples. Without the independence assumption this would be impractible because S needs to be too large.

21 Example A naive Bayes classiers adopts the assumption of conditional independence. Given: P(pneumonia) = 0.01, P(flu) = 0.05 P(cough pneumonia) = 0.9, P(fever pneumonia) = 0.9, P(chest-pain pneumonia) = 0.8, P(cough flu) = 0.5, P(fever flu) = 0.9, P(chest-pain flu) = 0.1 Suppose a patient had cough, fever, but no chest pain. What is the probability ratio between pneumonia and flu? What is the best diagnosis? Solution: Probability ratio = 0.01 * 0.9 * 0.9 * (1-0.8) / 0.05 * 0.5 * 0.9 * (1-0.1) = 0.08 So flu is at least ten times more likely than pneumonia.

22 Advantages: Discussion (1) Tends to work well despite strong assumption of conditional independence. Experiments show it to be quite competitive with other classification methods on standard UCI datasets. Although it does not produce accurate probability estimates when its independence assumptions are violated, it may still pick the correct maximum-probability class in many cases. Able to learn conjunctive concepts in any case

23 Discussion (2) Disadvantages: Does not perform any search of the hypothesis space. Directly constructs a hypothesis from parameter estimates that are easily calculated from the training data. Strong bias Not guarantee consistency with training data. Typically handles noise well since it does not even focus on completely fitting the training data.

24 Part 2 Belief Networks

25 Bayesian Belief Networks (1) Discussing the independence assumption: Positive: makes computation feasible Negative: Is it often not satisfied Reason: There are causal or influential relations between the attributes. Such relation are background knowledge. Idea: Make them visible in a graph. Conditional independence is now only between a subsets of variables. Belief networks combine both.

26 Bayesian Belief Networks (2) A Bayesian belief net (BBN) is a directed graph, together with an associated set of probability tables. The nodes represent variables, which can be discrete or continuous. The edges represent causal/influential relationships between variables. Nodes not connected by edges are independent.

27 Causality (1) Although Bayesian networks are often used to represent causal relationships, this need not be the case: a directed edge from u to v does not require that X v is causally dependent on X u. Example: The graphs: A B C and C B A are equivalent: that is they impose exactly the same conditional independence requirements.

28 Causality (2) A causal network is a Bayesian network with an explicit requirement that the relationships be causal. The additional semantics of the causal networks specify that if a node X is actively caused to be in a given state x (an action written as do(x=x)), then the probability density function changes to the one of the network obtained by cutting the links from X's parents to X, and setting X to the caused value x. Using these semantics, one can predict the impact of external interventions from data obtained prior to intervention.

29 Influence Diagrams The network can represent influence diagrams. Such diagrams are used to represent decision models. Therefore they are a method to support decision making.

30 Example (1) Temperature Winds Umbrella Cloudiness Rain Temperature: cold, mild, hot Cloudes: none, partial, covered Winds: No, mild, strong Conditinal probability table

31 Storm Lightning Thunder Example (2) BusTourGroup Campfire ForestFire Associated with each node is a conditional probability table, which specifies the conditional distribution for the variable given its immediate parents in the graph Each node is asserted to be conditionally independent of its non-descendants, given its immediate parents

32 Inference in Bayesian Networks (1) In general: Calculate conditional probabilities along the directed edges. This can be done in a forward or backward mode. Example forward mode: Suppose we have the edge A B, then we get and P(B) = P(B A)P(A) + P(B not A)P(not A) P(not B) = P(not B A)P(A) + P(not B not A)P(not A)

33 Inference in Bayesian Networks (2) Suppose we want to calculate P(AB E). Using P(A,B) = P(A B) p(b) we get: P(AB E) = P(A E) * P(B AE) P(AB E) = P(B E) * P(A BE) Therefore: P(A BE) = ( P(A E) * P(B AE) ) / P(B E) (another version of Bayes' Theorem).

34 Example (1) Age Income How likely are elderly rich people to buy Sun? P( paper = Sun Age>60, Income > 60k) House Owner Living Location Newspaper Preference Voting Pattern

35 Example (2) Age House Owner Income Living Location How likely are elderly rich people who voted liberal to buy Herald? P( paper = H Age>60, Income > 60k, v = liberal) Newspaper Preference Voting Pattern

36 Unobserved Variables Bayesian networks can be used to answer probabilistic queries about unobserved variables They can be used to find out updated knowledge of the state of a subset of variables when other variables (the evidence variables) are observed. This process of computing the posterior distribution of variables given evidence is called probabilistic inference. A Bayesian network can thus be considered a mechanism for automatically applying Bayes theorem to complex problems.

37 Inference in Bayesian Networks (2) In the network we can chain on several edges: Find the probability of H given that A1, A2, A3 and E have happened: P(H A1A2A3E) = ( P(H E) * P(A1A2A3 HE) ) / P(A1A2A3 E) because: P(A1A2A3 E) = P(A1 A2A3E) * P(A2A3 E) = P(A1 A2A3E) * P(A2 A3E) P(A3 E). With independence this simplifies. E.g. we get: P(H A1AEI) = ( P(H E) * P(A1 HE) ) * P(A2 HE) ) / ( P(A1 E) * P(A2 E) )

38 Consider the net: Recalculation (1) A C Given probabilities: B True False P(A) = 0.1 P(~A) = 0.9 True False P(B) = 0.4 P(~B) = 0.6 A True False B True False True False True P(C AB) = 0.8 P(C A~B) = 0.6 P(C ~AB) = 0.5 P(C ~A~B) = 0.5 False P(~C AB) = 0.2 P(C A~B) = 0.6 P(~C ~AB) = 0.5 P(~C ~A~B) = 0.5

39 Recalculation (2) Calculation of the probability of C: p(c) =p(cab) + p(c~ab) + p(ca~b) + p(c~a~b)=p(c AB) * p(ab) + p(c ~AB) * p(~ab) + p(c A~B) * p(a~b) + p(c ~A~B) * p(~a~b)=p(c AB) * p(a) * p(b) + p(c ~AB) * p(~a) * p(b) + p(c A~B) * p(a) * p(~b) + p(c ~A~B) * p(~a) * p(~b)=0.518 Recalculation of P(A) and P(B) If we know that C is true using Bayes rule: P(B C) =( P( C B) * P(B) ) / P(C)=( ( P(C AB) * P(A) + P(C ~AB) * P(~A) ) * P(B) ) / P(C)=( (0.8 * * 0.9) * 0.4 ) / = P(A C) =( P( C A) * P(A) ) / P(C)=( ( ppc AB) * P(B) + P(C A~B) * P(~B) ) * P(A) ) / P(C)=( (0.8 * * 0.6) * 0.1 ) / = 0.131

40 Complete and Incomplete Information 1. The network structure is given in advance and all the variables are fully observable in the training examples. ==> Trivial Case: just estimate the conditional probabilities. 2. The network structure is given in advance but only some of the variables are observable in the training data. ==> Similar to learning the weights for the hidden units of a Neural Net: Gradient Ascent Procedure 3. The network structure is not known in advance. ==> Use a heuristic search or constraint-based technique to search through potential structures.

41 Parameter Learning In order to fully specify the Bayesian network and thus fully represent the joint parameter probabilitydistribution, it is necessary to specify for each node X the probability distribution for X conditional upon X's parents. The distribution of X conditional upon its parents may have any form.

42 Expectation Maximization: Unobservable Relevant Variables. Example:Assume that data points have been uniformly generated from k distinct Gaussian with the same known variance. Problem find a hypothesis h=< 1, 2,.., k > that describes the means of each of the k distributions. In particular, we are looking for a maximum likelihood hypothesis for these means. We extend the problem description as follows: for each point x i, there are k hidden variables z i1,..,z ik such that z il =1 if x i was generated by normal distribution N and z iq = 0 for all q N.

43 EM Algorithm Initially: An arbitrary initial hypothesis h=< 1, 2,.., k > is chosen. The EM Algorithm contains two steps: Step 1 (Estimation, E): Calculate the expected value E[zij] of each hidden variable zij, assuming that the current hypothesis h=< 1, 2,.., k > holds. Step 2 (Maximization, M): Calculate a new maximum likelihood hypothesis h =< 1, 2,.., k >, assuming the value taken on by each hidden variable z ij is its expected value E[z ij ] calculated in step 1. Then replace the hypothesis h=< 1, 2,.., k > by the new hypothesis h =< 1, 2,.., k > and iterate.

44 Problems and Limitations (1) A computational problem is exploring a previously unknown network. To calculate the probability of any branch of the network, all branches must be calculated. This process of network discovery is an NP-hard task which might either be too costly to perform, or impossible given the number and combination of variables.

45 Problems and Limitations (2) The network relies on the quality and coverage of the prior beliefs (which is knowledge!!) used in the inference processing. The network is only as useful as this background knowledge is reliable. Both, too optimistic or too pessimistic expectation of the quality of these prior beliefs will invalidate the results. Related to this is the selection of the statistical distribution induced in modeling the data. Selecting the proper distribution model to describe the data has a notable effect on the quality of the resulting network.

46 Dependency Networks They are a generalization and alternative to the Bayesian network. It has also has a graph and probability component but graph can be cyclic. The probability component is as in a Bayesian network.

47 Loops If BP is used on graphs with loops, messages may circulate indefinitely Empirically, a good approximation is still achievable Stop after fixed # of iterations Stop when no significant change in beliefs If solution is not oscillatory but converges, it usually is a good approximation

48 Applications Bayesian Learning is a standard methods in many application areas like Medicine (classification, prediction) Image retrieval and pattern recognition Quality control for material Some competitors are e.g. Support vector machines Clustering methods

49 Tools Hugin tool: Implements the propagation algorithm of Lauritzen and Spiegelhalter. A more modern and powerful BBN tool is the AgendaRisk tool With this tool it is possible to perform fast propagation in large BBNs (with hundreds of nodes and millions of state combinations) GeNIe: WinMine Toolkit, Weka, Matlab

50 Summary Bayes theorem Baysian decision Maximum a posteriori and maximum likelihood The naïve Baysian method and conditional independence Gibbs system Belief nets and inference in nets and belief revision Estimating unknown parameters: EM algorithm Limitations

51 Some References (1) Bernardo, J. M. and Smith, A. F. M. (1994) Bayesian Theory, New York: John Wiley. Gelman, A., Carlin, J.B., Stern, H.S., and Rubin, D.B. (1995) Bayesian Data Analysis, London: Chapman & Hall, ISBN Ian H.Witten, Eibe Frank: Data Mining Practical Machine Learning Tools with Java Implementations. Morgan Kaufmann, David W. Aha: Machine Learning tools. home.earthlink.net/~dwaha/research/machinelearning.html

52 Some References (2) Heckerman, David :Tutorial on Learning with Bayesian Networks. In Jordan, Michael Irwin, Learning in Graphical Models, Adaptive Computation and Machine Learning, MIT Press 1998, pp , Borgelt, Christian; Kruse, Rudolf (March 2002). Graphical Models for Data Analysis and Mining Chichester, D. Heckerman, D. M. Chickering, C. Meek, R. Rounthwaite, C. Kadie, Dependency Networks for Inference, Collaborative Filtering, and Data Visualization, Journal of Machine Learning Research, Vol. 1, 2000, pp

Stephen Scott.

Stephen Scott. 1 / 28 ian ian Optimal (Adapted from Ethem Alpaydin and Tom Mitchell) Naïve Nets sscott@cse.unl.edu 2 / 28 ian Optimal Naïve Nets Might have reasons (domain information) to favor some hypotheses/predictions