Applied Logic. Lecture 4 part 2 Bayesian inductive reasoning. Marcin Szczuka. Institute of Informatics, The University of Warsaw

Size: px

Start display at page:

Download "Applied Logic. Lecture 4 part 2 Bayesian inductive reasoning. Marcin Szczuka. Institute of Informatics, The University of Warsaw"

Griffin Elliott
5 years ago
Views:

1 Applied Logic Lecture 4 part 2 Bayesian inductive reasoning Marcin Szczuka Institute of Informatics, The University of Warsaw Monographic lecture, Spring semester 2017/2018 Marcin Szczuka (MIMUW) Applied Logic / 34

2 The ones illiterate in general probability theory still keep asking why, above all, Trurl did probabilized the dragon instead of elf or dwarf. Those do it due to ignorance, since they do not know that the dragon is just more probable than a dwarf... Stanisław Lem, The Cyberiad Fable three or dragons of probability Marcin Szczuka (MIMUW) Applied Logic / 34

3 Lecture plan 1 Introduction 2 Bayesian reasoning 3 Bayesian prediction and decision support Classification problems Selecting hypothesis - MAP and ML Bayesian Optimal Classifier Naïve Bayes classifier 4 Hypothesis selection general issues Marcin Szczuka (MIMUW) Applied Logic / 34

4 Measure of truth/possibility Recall that from an inductive (quasi-)formal system that we dare to call inductive logic we expect to provide a measure of support. This measure gives use the level of influence of the truthfulness of premises on truthfulness of conclusions. We require: 1 Fulfillment of the Criterion of Adequacy (CoA). 2 Ensuring, that the degree of confidence in the inferred conclusion is no greater than the confidence of the premises and inference rules. 3 Ability to clearly discern between proper conclusions (hypotheses) and nonsensical ones. 4 Intuitive interpretation. Marcin Szczuka (MIMUW) Applied Logic / 34

5 Probabilistic inference From the earliest onset the researchers tried to match the inductive reasoning paradigm with probability and/or statistics. Over time probability-based reasoning, in particular Bayesian reasoning, have established itself as a central focal point of philosophers and logicians working on formalisation of inductive systems (inductive logics). elements of probabilistic reasoning can be found in works of Pascal, Fermat, and others. Modern, formal approach to inductive logic based on the notion of similarity and probability was proposed by John Maynard Keynes in Treatise on Probability (1921). Rudolf Carnap developed further these ideas in his Logical Foundations of Probability (1950) and some other works, which are now considered a corner stone of probabilistic logic. After the mathematical theory of probability was ordered by Kolmogorov the probabilistic reasoning gained more traction as a proper, formal theory. Marcin Szczuka (MIMUW) Applied Logic / 34

6 Probabilistic inductive logic In case of inductive logics, in particular those based on probability, there is very little point in considering the strict formal consequence relation and its relationship with relation =. For the relation = we usually consider the support (probability) mapping rather than exact logical consequences. Support mapping (function) Function P : L [0, 1], where L is a set of statements (a language) is called support function if for A, B, C - statements in L the following holds: 1 There exists at least one pair of statement D, E L for which P (D E) < 1. 2 If B = A then P (A B) = 1. 3 If = (B C) then P (A B) = P (A C). 4 If C = (A B) then either P (A B C) = P (A C) + P (B C) or D L P (D C) = 1. 5 P ((A B) C) = P (A (B C)) P (B C) Marcin Szczuka (MIMUW) Applied Logic / 34

7 Probabilistic inductive logic Its easy to see that the conditions for support function P are a re-formulation of the axioms for probability measure. In definition of P the operator corresponds to logical entailment, i.e., the basic step in reasoning. It is easy to see that the mapping P is not uniquely defined. The conditions for P are essentially the same as for (unconditional) probability. It suffices to set P (A) = P (A (D D)) for some sentence (event) D. However, these conditions also allow for establishing the value P (A C) in case of probability of event C being 0 (P (C) = P (C (D D)) = 0). Condition 1 (not-triviality) in definition of P can be also expressed as A L P ((A A) (A A)) < 1. Marcin Szczuka (MIMUW) Applied Logic / 34

8 Lecture plan 1 Introduction 2 Bayesian reasoning 3 Bayesian prediction and decision support Classification problems Selecting hypothesis - MAP and ML Bayesian Optimal Classifier Naïve Bayes classifier 4 Hypothesis selection general issues Marcin Szczuka (MIMUW) Applied Logic / 34

9 Probability At this point we need to introduce the (simplified) axioms for probability measure that we will further use. In order to clearly discern form previous notation we will use the Pr to mark probability measure. Axioms for discrete probability (Kolmogorov) 1 For each event A Ω value Pr(A) [0, 1]. 2 Unit measure Pr(Ω) = 1. 3 Additivity if A 1,..., A n are mutually exclusive events, then n Pr(A i ) = 1 Pr(B) = i=1 n Pr(B A i ) Pr(A i ). i=1 Axiom 2 (unit measure) may be a source of some concern for us. Marcin Szczuka (MIMUW) Applied Logic / 34

10 Properties of probability Pr(A B) = Pr(B) Pr(A B) = Pr(A) Pr(B A) Pr(A B) = Pr(A) + Pr(B) Pr(A B) Pr(A B) - (conditional) probability of A given B. Pr(A B) = Pr(A B) Pr(B) Bayes rule Pr(A B) = Pr(B A) Pr(A) Pr(B) Marcin Szczuka (MIMUW) Applied Logic / 34

11 Bayesian inference For the reason that will become clear in the next part of the lecture, we will use the following notation. T X - set of premises (evidence set) coming from (huge) universe X. h H - conclusion (hypothesis) coming from some (huge) set of hypotheses H. V S H,T - version space, i.e., subset of H containing hypotheses that are consistent with T. Inference rule (Bayes ) For a hypothesis h H and evidence set T X: Pr(h T ) = Pr(T h) Pr(h) Pr(T ) Probability (level of support) of conclusion (hypothesis) h is established on the basis of support of premises (evidence) and the degree to which the hypothesis justifies the existence of evidence (premises). Marcin Szczuka (MIMUW) Applied Logic / 34

12 Remarks Pr(h T ) - a posteriori (posterior) probability of hypothesis h given premises (evidence data) T. That is what we are looking for. Pr(T ) - Probability of premises (evidence data) T. Fortunately, we do not have to know it if we are only interested in comparison of posterior probabilities of hypotheses. If, for some reason, we need to directly calculate that then we may have a problem. We need to calculate Pr(h) and Pr(T h). For the moment we assume that we can do that and that H is known. Pr(T h) determines the degree to which h justifies the appearance (truthfulness) of premises in T. Marcin Szczuka (MIMUW) Applied Logic / 34

13 Lecture plan 1 Introduction 2 Bayesian reasoning 3 Bayesian prediction and decision support Classification problems Selecting hypothesis - MAP and ML Bayesian Optimal Classifier Naïve Bayes classifier 4 Hypothesis selection general issues Marcin Szczuka (MIMUW) Applied Logic / 34

14 Lecture plan 1 Introduction 2 Bayesian reasoning 3 Bayesian prediction and decision support Classification problems Selecting hypothesis - MAP and ML Bayesian Optimal Classifier Naïve Bayes classifier 4 Hypothesis selection general issues Marcin Szczuka (MIMUW) Applied Logic / 34

15 Decision support tasks The real usefulness of the Bayesian approach is visible in its practical applications. The most popular of these is decision support (classification). Decision support (classification) is an example of using inductive inference methods such as prediction, argument by analogy and eliminative induction. We are going to construct Bayesian classifiers,i.e., algorithms (procedures) that learn the probability of decision value (classification) for new cases on the basis of cases observed previously (training sample). By restricting the reasoning task to prediction of decision value we can produce computationally viable, automated tool. Marcin Szczuka (MIMUW) Applied Logic / 34

16 Classifiers - basic notions Domain (space, universe) is a set X from which we draw examples. An element x X we address as example (instance, case, record, entity, vector, object, row). Attribute (feature, variable, measurement) is a function a : X A. Set A is called attribute value set or attribute domain. We assume that each example x X is completely represented by the vector where a 1 (x),..., a n (x), a i : X A i for i = 1,..., n. n is sometimes called the size (length) of example. Foe our purposes we usually distinguish a special decision attribute (decision, class), traditionally marked by dec or d. Marcin Szczuka (MIMUW) Applied Logic / 34

17 Tabular data Outlook Temp Humid Wind EnjoySpt sunny hot high FALSE no sunny hot high TRUE no overcast hot high FALSE yes rainy mild high FALSE yes rainy cool normal FALSE yes rainy cool normal TRUE no overcast cool normal TRUE yes sunny mild high FALSE no..... rainy mild high TRUE no Marcin Szczuka (MIMUW) Applied Logic / 34

18 Classifier Training set (training sample) T X corresponds to the set of premises. T d - subset of training data with decision d which corresponds to the set of premises supporting a particular hypothesis. T d a i =v - subset of training data with attribute a i requal to v and decision d. This corresponds to the set of premises of particular type supporting a particular hypothesis. Hypothesis space H is now limited to a set of possible decision bvalues, i.e., conditions (dec = d), where d V dec. Classification task Given training sample T determine the best (most probable) value of dec(x) for previously unseen case x X ( x / T ). Question: How to choose the best value of decision? Marcin Szczuka (MIMUW) Applied Logic / 34

19 Lecture plan 1 Introduction 2 Bayesian reasoning 3 Bayesian prediction and decision support Classification problems Selecting hypothesis - MAP and ML Bayesian Optimal Classifier Naïve Bayes classifier 4 Hypothesis selection general issues Marcin Szczuka (MIMUW) Applied Logic / 34

20 Hypothesis selection - MAP In Bayesian classification we want to find the most probable decision value for new example x given the collection of previously seen (training) examples and attribute values for x. So, using Bayes formula we need to find a hypothesis h (decision value) that maximises support (empirical probability). MAP - Maximum A Posteriori hypothesis Given training set T we attempt to classify example x X using hypothesis h MAP H by assigning to object x the decision value given by: h MAP = arg max h H Pr(h T ) = arg max Pr(T h) Pr(h) h H In MAP we chose the hypothesis that is the most probable. Marcin Szczuka (MIMUW) Applied Logic / 34

21 Hypothesis selection - ML ML - Maximum Likelihood hypothesis Given training set T we attempt to classify example x X using hypothesis h ML H by assigning to object x the decision value given by: h ML = arg max Pr(T h). h H In ML approach we chose the hypothesis that best explains (makes it most likely) the existence of our training sample. Note, that the hypothesis h ML may itself have low probability, but be very well adjusted to our particular data. Marcin Szczuka (MIMUW) Applied Logic / 34

22 Discussion of ML and MAP Both methods require the knowledge of Pr(T h). In case of MAP we also need musimy Pr(h) to be able to use Bayes formula. MAP is quite natural, but has major drawbacks. In particular, it promotes the dominating decision value. Both methods assume that the training set is error-free and that the hypothesis we look for is in H. ML is close to intuitive understanding of inductive learning. In the process of selecting hypothesis we go for the one that gives the best reason for existence of the particular training set we have. The MAP rule for selecting hypotheses select the most probable hypothesis while we are rather interested in selecting the most probable decision value for an example. With V dec = {0, 1}, H = {h MAP, h 1,..., h m }, 1 i m h(x) = 0, h MAP (x) = 1 and m Pr(h MAP T ) Pr(h i T ) Marcin Szczuka (MIMUW) Applied Logic / 34 i=1

23 Finding probabilities Pr(h) the easier part. We may be either given a probability (by learning method) or treat all hypotheses equally. In the later case: Pr(h) = 1 H The problem is the size of H. It may be a HUGE space. Also, in reality, we may not even know the whole H. Pr(T h) the harder part. Notice, that we are in fact only interested in decision making. We want to know the probability that a sample T will be consistent (will have the same decision) with hypothesis h. This yields: { 1 gdy h V SH,T Pr(T h) = 0 gdy h / V S H,T Unfortunately, the problem with size of H is still present. Marcin Szczuka (MIMUW) Applied Logic / 34

24 ML and MAP in practice The MAP and/or ML, despite serious practical limitations, can still be used in some special cases, given that: The hypothesis space is very restricted (and reasonably small). We use MAP and/or ML to score (few) competing hypotheses constructed by other means. This relates to the topics of stacking, coupled classifiers and layered learning. Marcin Szczuka (MIMUW) Applied Logic / 34

25 Lecture plan 1 Introduction 2 Bayesian reasoning 3 Bayesian prediction and decision support Classification problems Selecting hypothesis - MAP and ML Bayesian Optimal Classifier Naïve Bayes classifier 4 Hypothesis selection general issues Marcin Szczuka (MIMUW) Applied Logic / 34

26 Bayesian Optimal Classifier The Bayesian Optimal Classifier (BOC) always returns the most probable decision value for an example. In this respect it cannot be beaten by any other algorithm in terms of true (global) error. Sadly, the BOC isn t very useful from practical point of view since it uses entire hypothesis space. The hypothesis returned by BOC may not belong to H. Let c(.) be a the desired decision (target concept), T training sample. Then h BOC = arg max d V dec Pr(c(x) = d T ) where: Pr(c(x) = d T ) = h H Pr(c(x) = d h) Pr(h T ) Pr(c(x) = d h) = { 1 if h(x) = d 0 if h(x) d The hypothesis returned by BOC may not belong to H. Marcin Szczuka (MIMUW) Applied Logic / 34

27 Lecture plan 1 Introduction 2 Bayesian reasoning 3 Bayesian prediction and decision support Classification problems Selecting hypothesis - MAP and ML Bayesian Optimal Classifier Naïve Bayes classifier 4 Hypothesis selection general issues Marcin Szczuka (MIMUW) Applied Logic / 34

28 Naïve Bayes classifier Let x be a new example that we need to classify. We should select a hypothesis h such that: n h(x ) = arg max Pr(c(x) = d a i (x) = a i (x )) d V dec Hence, from Bayes formula: arg max d C Pr(c(x) = d) Pr( n i=1 i=1 a i (x) = a i (x ) c(x) = d) If we (naïvely) assume that attributes are independent as probabilistic variables then: arg max Pr(c(x) = d) n Pr(a i (x) = a i (x ) c(x) = d) d C i=1 All, that is left to do is to estimate Pr(c(x) = d) and Pr(a i (x) = v c(x) = d) from data. Marcin Szczuka (MIMUW) Applied Logic / 34

29 NBC - technical details Usually, we employ an m-estimate to get Pr(a i (x) = v c(x) = d) = T d a i v + mp T + m where m is an integer parameter, and p is prior probability of decision class. Usually, if no background knowledge is given, we set m = A i and p = 1 A i, where A i is a (finite) set of values for attribute a i. Complexity of NBC For each example we have to modify counts for decision class and for particular attribute values. That is, in total O(n T ) basic computational steps Complexity of NBC is the lowest rational estimate for any classification algorithm without prior knowledge. Also, each step in NBC is fast and cheap, hence the method is computationally efficient. Marcin Szczuka (MIMUW) Applied Logic / 34

30 Lecture plan 1 Introduction 2 Bayesian reasoning 3 Bayesian prediction and decision support Classification problems Selecting hypothesis - MAP and ML Bayesian Optimal Classifier Naïve Bayes classifier 4 Hypothesis selection general issues Marcin Szczuka (MIMUW) Applied Logic / 34

31 Requirements for hypotheses On the higher level of abstraction we can demand from the hypothesis to not only be the best (most probable) explanation, but also to be the simplest one. This may be seen as a special application of lex parsimoniæ (Occam s razor). We prefer the simplest explanation, i.e., the hypothesis that requires according to William of Occam the least amount of assumptions. In practice, lex parsimoniæ is frequently replaced by a simpler Minimum Description Length (MDL) principle. MDL - Minimum Description Length MDL recommends the simplest method for re-encoding the data with use of hypothesis, i.e., hypothesis that gives the best compression. Choosing the particular hypothesis produces a shortest algorithm for reproduction of data. In classification, this usually means the shortest hypothesis. Marcin Szczuka (MIMUW) Applied Logic / 34

32 MDL in Bayesian classification Bayesian classifiers are considered one of the best method for producing MDL-compliant hypotheses. For the purposes of comparing description lengths in the example below we define the length with a (binary) logarithm of the description of probability. Taking the logarithm of Bayes formula, we get: log Pr(h T ) = log Pr(h) + log Pr(T h) log Pr(T ) Substituting L(.) for log Pr(.) we obtain: L(h T ) = L(h) + L(T h) L(T ) where L(h), L(T h) represent the length of hypothesis h and length of data T (given h). In both cases we assume that the encoding is known and optimal. Marcin Szczuka (MIMUW) Applied Logic / 34

33 MDL in Bayesian classification Ultimately, we select a hypothesis that is the best w.r.t. MDL: h MDL = arg min h H L Enc H (h) + L EncD (T h) Assuming that Enc H and Enc D are odpowiednio, hipotezy i danych, dostajemy: h MDL = h MAP. Intuitively, MDL helps to find the right balance between quality and simplicity of a hypothesis. The MDL principle is frequently used for scoring candidate hypotheses constructed by other means. It is also applicable to the task of simplifying existing hypotheses, for example in filtering of decision rule sets and decision tree pruning. It also provides an effective stop criterion for many practical algorithms. Marcin Szczuka (MIMUW) Applied Logic / 34

34 Kolmogorov complexity MDL is also connected with more general notion of Kolmogorov Complexity (descriptive complexity, Kolmogorov Chaitin complexity, algorithmic entropy). Kolmogorov Complexity for a finite or infinite sequence of symbols (stream of data) is defined as a length of the simplest (shortest) algorithm that generates this data. Naturally, the notion of algorithm length is quite complicated and requires formal definition. Such definition is usally done with use of formal languages and Turing machines. In most non-trivial cases the task of calculating Kolmogorov complexity for a sequence is very hard, frequently practically impossible (undecidable). Let s consider two finite sequences of numbers: has a very low Kolmogorov complexity since there exists a very simple algorithm to generate decimal expansion of π is a random sequence with potentially very high Kolmogorov complexity. Marcin Szczuka (MIMUW) Applied Logic / 34

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction 15-0: Learning vs. Deduction Artificial Intelligence Programming Bayesian Learning Chris Brooks Department of Computer Science University of San Francisco So far, we ve seen two types of reasoning: Deductive