III. Naïve Bayes (pp.70-72) Probability review

Size: px

Start display at page:

Download "III. Naïve Bayes (pp.70-72) Probability review"

Millicent Rich
5 years ago
Views:

1 III. Naïve Bayes (pp.70-72) This is a short section in our text, but we are presenting more material in these notes. Probability review Definition of probability: The probability of an even E is the ratio of nr. of cases where E occurs to total nr. of cases. Note well: This applies only when all the cases are equally likely! Example: We roll a pair of (fair) dice. Find the probability of the event E= The sum is five. In the experiment above, calculate the probability of the events A= The sum is odd, B= The sum is prime. Union (or) and intersection (and) of events Cast a die and define these events: A = Nr. dots is 1 B = Nr. dots is odd Calculate P(A), P(B), P(A B), P(AUB). In the two-dice experiment, calculate the probabilities of these events: A = The sum is <10 and second die is >4 B = The sum is <10 or second die is >4

2 Law of conditional probability:, or P(A given B) = P(A and B)/P(B) (*) Interpretation: Since we know that B occurred, we renormalize by dividing by its probability. Example: Assuming the dots in the figure are equally likely, calculate P(A B), P(B A). Do it two ways: from scratch, with the definition of probability as a ratio, and by applying the formula of conditional proabability. In the two-dice experiment, calculate the probability of the event C = The sum is <10 given that the second die is >4 Equation (*) can be used to calculate any of the 3 probabilities involved, knowing the other 2. In particular, this form is very useful: P(A B) = P(A B) P(B) = P(B A) P(A) (**)

3 Bayes Theorem: (***) Prove Bayes Theorem, using Eq. (**) There is a big party on campus where all CS and Business majors are invited. One in ten Business majors are shy and six in ten CS majors are shy. We meet a student who is shy. Is it more likely for their major to be CS or Business? Technically, what is the probability for their major to be CS? (Please ponder for a minute...) Hint: We are missing an important piece of information: There are about 100 CS majors and 1000 Business majors in this University! Solve the previous problem by using Bayes Theorem directly: (***) Hint: First define the events A and B! Solution: A = The student we meet is a CS major. B = The student we meet is shy. A B =???? B A =???? Can we calculate the probabilities of all events on the RHS of Bayes Theorem?

4 P(B A) = P(shy, given that CS major) = 6/10 = 0.6 P(A) = P(CS major) = 100/( ) = P(B) = P(shy) =????... Let us think back on what we did in the first solution: Where do the 100 and 60 come from? P(shy) = P(shy and CS) + P(shy and Business) Now we apply conditional probability (**) to each term: = P(shy, given CS) P(CS) + P(shy, given Business) P(Business) = = 0.6 x x 1000 Note: Since all probabilities involved have a denominator of 1,100, we only wrote the numerators! The addition of proababilities operated above is so useful, it was enshrined as another theorem or law of Probability Theory: Law of Total Probability (LTP): P(B) = P(B A) P(A) + P(B not A) P(not A) Note: P(Business) = P(not CS) = P(not A) When using the LTP in the denominator of Bayes Theorem, we have this more detailed form of Bayes Theorem: (****) P(A) is called the prior, and P(A B) the posterior probability of A. B is called evidence. The ratio is the support B offers A.

5 Two cab companies serve a city: the Green company operates 85% of the cabs and the Blue company operates 15% of the cabs. One of the cabs is involved in a hit-and-run accident at night, and a witness identifies the hitand-run cab as a Blue cab. When the court tests the reliability of the witness under circumstances similar to those on the night of the accident, (s)he correctly identifies the color of a cab 80% of the time and misidentifies it the other 20% of the time. What is the probability that the cab involved in the accident was Blue, as stated by the witness? Hint: Use Bayes Theorem. Define the events A and B. - For more practice: Wikipedia s page for Bayes Theorem has three nice examples - study them all: Drug tests Reliability of factory machines Identification of beetles

6 Application: TEXT LEARNING We have a number of messages written by several authors 1. For simplicity, let us call the authors A1 and A2. Also for simplicity, the classification will not be based on all the words present, but on a (relatively small) subset of them 2. In this example, let us consider only three magic words: foo, bar, and baz. For each author, we calculate the probability of each word to appear in a message: P(foo A1) = 0.2 P(bar A1) = 0.3 P(baz A1) = 0.4 P(foo A2) = 0.3 P(bar A2) = 0.1 P(baz A2) = 0.3 We also need to know the distribution of the messages between A1 and A2. Ideally, they are evenly distributed: P(A1) = P(A2) = 0.5. We now have a new message, whose author is unknown, either A1 or A2. We find which of the magic words are present in the message, for example foo and bar. We call {foo bar} a bag of words, because the order or proximity of the words are not considered. We would like to calculate the probabilities P(A1 foo bar) and P(A2 foo bar), because then we would predict that the author with the higher probability is the author of the message. Mnemonic: The formulas are easy to remember with A for author and B for bag of words: P(A B). We apply Bayes Theorem (***) to find: P(A1 foo bar) = P(foo bar A1)P(A1)/P(foo bar), and similar for A2. Note: In Bayes Theorem we only have one piece of evidence, but here we have two: foo and bar. What to do? Here is where the naive in Naive Bayes comes into play: We assume that the words occur independently 3, so we can factorize the intersections: P(foo bar A1) = P(foo A1)P(bar A1) and similar for A2. Combining the last two eqns. we have: P(A1 foo bar) = P(foo A1)P(bar A1)P(A1)/P(foo bar) and similar for A2. Since A1 and A2 have the same denominator, we don t need it in order to establish which probability is greater, so we further simplify the formulas to: P(A1 foo bar) ~ P(foo A1)P(bar A1)P(A1) P(A2 foo bar) ~ P(foo A2)P(bar A2)P(A2) (o) With the numerical values shown at the beginning of this section, find out which author is more likely for the message. - 1 For example, the 85 Federalist Papers were written by Alexander Hamilton, James Madison, and John Jay. 2 Several studies on the disputed Federalist Papers are based on a set of 70 so-called function words. 3 In principle, if we had a large-enough corpus of messages from an author, we could estimate joint distributions for each combination of words, but in practice the combinatorial explosion prevents us from doing so.

7 How about baz, or in general any magic word that is missing from the message? Their absence may also count as information. How do the formulas (o) change to take this into account? Recalculate the probabilities from the problem above, taking into account that the word baz is missing. Which author is more likely now? - Note: The BernoulliNB classifier from scikit-learn does take into account the probabilities of the missing features!

8 Solutions: Prove Bayes Theorem, using Eq. (**) Multiply both sides by P(B), to get P(A B) P(B) = P(B A) P(A). According to (**), both sides are P(A B). Two cab companies serve a city: the Green company operates 85% of the cabs and the Blue company operates 15% of the cabs. One of the cabs is involved in a hit-and-run accident at night, and a witness identifies the hitand-run cab as a Blue cab. When the court tests the reliability of the witness under circumstances similar to those on the night of the accident, he correctly identifies the color of a cab 80% of the time and misidentifies it the other 20% of the time. What is the probability that the cab involved in the accident was Blue, as stated by the witness? Hint: Use Bayes Theorem. Define the events A and B. With the numerical values shown at the beginning of this section, find out which author is more likely for the message. P(A1 foo bar) ~ P(foo A1)P(bar A1)P(A1) = 0.2*0.3*0.5 = 0.03 P(A2 foo bar) ~ P(foo A2)P(bar A2)P(A2) = 0.3*0.1*0.5 = A1 is more likely. Recalculate the probabilities from the problem above, taking into account that the word baz is missing. Which author is more likely now? The 1 st probability above is further multiplied by 1 - P(baz A1): 0.03*0.6 = The 2 nd probability above is further multiplied by 1 - P(baz A2): 0.015*0.7 = A1 is still more likely.

9 In our textbook, we are shown how to represent four messages (each message is a row) and words (each feature/column is a word) in vectorized form: Let us use W0, W1, W2, and W3 for clarity. The targets/classes are represented in the array y: Let us use A0 and A1 for clarity. We write code to count the number of occurrences of each word in each class, by summing each column (axis=0). The function np.unique returns the sorted unique elements of a numpy array: Now we can convert the counts above to the probabilities we need for the Naive Bayes algorithm! Finish calculating the missing probabilities above! For easy reference, place all the probabilities obtained above in this table:

10 Since we are going to use these probabilities for multiplication, we can avoid the zeroes by adding one to all denominators and numerators. This is called Laplace smoothing: We have a new message: [1, 1, 0, 0]. Calculate the products for the Naive Bayes algorithm and decide which author is more likely. Do it by using only the positive occurences. For more practice: Use both positive and negative occurences. For more practice: Programming Naive Bayes classification from scratch Write a Python function that takes an array of four binary values (the message) as argument, and returns the prediction. Hint: For this problem, it is sufficient to hard-code the probability table as a two-dimensional list-of-lists or numpy array.

11 Create an array X for the four-word example, with 20 rows and 4 columns. Place 10 messages from A0 first, followed by 10 from A1. The vector y has 10 zeros (A0), followed by 10 ones (A1, or not A0). Solution: Below is a CSV (comma-separated-variables) file, visualized with a spreadsheet editor (left) and with a plain-text editor (right). The name of the file is messages.csv. The first 4 columns have the data in the array X, and the 5 th has the data for y (targets). Because values are missing, we import data into a numpy array using genfromtxt: Now we create and train a Naive Bayes classifier that implements the algorithm described above. It is called Bernoulli Naive Bayes: The (smoothing) parameter alpha has a meaning in NB classification that is slightly different from regression, but similar in that it controls the complexity of the model: If all the words appear in each class of the training set (as was the case in the example above), then no smoothing is necessary. If, however, one word, e.g. W3, is missing from a class in the training set, e.g. A1, then the estimated conditional probability is zero P(W3/A1) = 0. All future messages from A1 that happen to contain W3 will be given a probability of zero, irrespective of any other words they contain! This is effectively noise: Due to the accidental content of our sample, we are under the wrong impression that W3 never occurs in A1s messages. A model that attempts to model this accident is too complex, so alpha reduces this complexity. To avoid the case described above, a constant alpha is added to all the counts. By default, alpha = 1 (Laplace smoothing). The NB classifiers in scikit-learn do not allow alpha = 0. Even if we give alpha a value of zero, it will be automatically set to a very small value that is practically equal to zero.

12 The feature counts calculated manually in the text code are avilable as an attrribute of the classifier: and the numbers of datapoints in each class are also tallied automatically: and the prior probabilities are stored in the classifier: Due to the multiplicative nature of calculations in the Naive Bayes algorithm, the probabilities are stored in logarithmic form - this way they can be added rather than multiplied. In our example, note that the result is , which is simply the natural logarithm of 0.5, since both authors are equally represented. As with all classifiers and regressors, a member function allows to calculate the score for an array of points. Since we used the entire dataset for training, let us find the training score: What conclusion do we draw from the score above? Underfitting (because the data set is too small!) SKIP MultinomialNB and GaussianNB

13 Conclusions on Naive Bayes (BernoulliNB): Strengths: alpha is not as important as in regression, but it can still fine-tune the model. Works well (efficiently) with large, sparse matrices X (more in the lab!) Like linear models: fast to train and predict, easy to understand. On very large datasets, it is even faster to train than a linear model! Weaknesses/limitations: Is used only when the features are binary (0 or 1), and specifically for classifying text. Assumes independence of features, which may not be the case in real-life, e.g. the feature overcast (Y/N) is probably correlated with temperature (High/Low). Data scarcity... can be mitigated using smoothing (alpha).

14 Solutions: We have a new message: [1, 1, 0, 0]. Calculate the products for the Naive Bayes algorithm and decide which author is more likely. Do it by using only the positive occurences: Conclusion: not A0, a.k.a. A1 is more likely. For more practice: Use both positive and negative occurences. Hint: The two authors turn out to be equally likely!

Bayes Theorem. Jan Kracík. Department of Applied Mathematics FEECS, VŠB - TU Ostrava

Bayes Theorem. Jan Kracík. Department of Applied Mathematics FEECS, VŠB - TU Ostrava Jan Kracík Department of Applied Mathematics FEECS, VŠB - TU Ostrava Introduction Bayes theorem fundamental theorem in probability theory named after reverend Thomas Bayes (1701 1761) discovered in Bayes