Probabilistic and Bayesian Analytics Based on a Tutorial by Andrew W. Moore, Carnegie Mellon University

Size: px

Start display at page:

Download "Probabilistic and Bayesian Analytics Based on a Tutorial by Andrew W. Moore, Carnegie Mellon University"

Ruth Gaines
5 years ago
Views:

1 robabilistic and Bayesian Analytics Based on a Tutorial by Andrew W. Moore, Carnegie Mellon Uniersity

2 Discrete Random Variables A is a Boolean-alued random ariable if A denotes an eent, and there is some degree of uncertainty as to whether A occurs. Examples A The US president in 223 will be male A You wake up tomorrow with a headache A You hae Ebola robabilistic Analytics: Slide 2

3 robabilities We write A as the fraction of possible worlds in which A is true We could at this point spend 2 hours on the philosophy of this. But we won t. robabilistic Analytics: Slide 3

4 Visualizing A Eent space of all possible worlds Worlds in which A is true A Area of reddish oal Its area is Worlds in which A is False robabilistic Analytics: Slide 4

5 The Axioms of robability < A < True False A or B A + B - A and B Where do these axioms come from? Were they discoered? Answers coming up later. robabilistic Analytics: Slide 5

6 Interpreting the axioms < A < True False A or B A + B - A and B The area of A can t get any smaller than And a zero area would mean no world could eer hae A true robabilistic Analytics: Slide 6

7 Interpreting the axioms < A < True False A or B A + B - A and B The area of A can t get any bigger than And an area of would mean all worlds will hae A true robabilistic Analytics: Slide 7

8 Interpreting the axioms < A < True False A or B A + B - A and B A B robabilistic Analytics: Slide 8

9 Interpreting the axioms < A < True False A or B A + B - A and B A A or B B A and B B Simple addition and subtraction robabilistic Analytics: Slide 9

10 These Axioms are Not to be Trifled With There hae been attempts to do different methodologies for uncertainty Fuzzy Logic Three-alued logic Dempster-Shafer Non-monotonic reasoning But the axioms of probability are the only system with this property: If you gamble using them you can t be unfairly exploited by an opponent using some other system [di Finetti 93] robabilistic Analytics: Slide

11 Theorems from the Axioms < A <, True, False A or B A + B - A and B From these we can proe: not A ~A -A How? robabilistic Analytics: Slide

12 Another important theorem < A <, True, False A or B A + B - A and B From these we can proe: A A ^ B + A ^ ~B How? robabilistic Analytics: Slide 2

13 Multialued Random Variables Suppose A can take on more than 2 alues A is a random ariable with arity k if it can take on exactly one alue out of {, 2,.. k } Thus A A i " A j if i!! A! A 2 k j robabilistic Analytics: Slide 3

14 An easy fact about Multialued Random Variables: Using the axioms of probability < A <, True, False A or B A + B - A and B And assuming that A obeys A A It s easy to proe that i " A j if i!! A! A 2 k A j " A 2 " A! i A i j j robabilistic Analytics: Slide 4

15 An easy fact about Multialued Random Variables: Using the axioms of probability < A <, True, False A or B A + B - A and B And assuming that A obeys A A It s easy to proe that i " A j if i!! A! A 2 k A And thus we can proe j " A 2 " A! k! j A j i A i j j robabilistic Analytics: Slide 5

16 Another fact about Multialued Random Variables: Using the axioms of probability < A <, True, False A or B A + B - A and B And assuming that A obeys A A It s easy to proe that B i " A j if i!! A! A 2 k "[ A j # A 2 # A ]! i B " A i j j robabilistic Analytics: Slide 6

17 robabilistic Analytics: Slide 7 Another fact about Multialued Random Variables: Using the axioms of probability < A <, True, False A or B A + B - A and B And assuming that A obeys It s easy to proe that j i A A j i! " if 2!! k A A A ] [ 2! " # # " i j i j A B A A A B And thus we can proe! " k j j A B B

18 Elementary robability in ictures ~A + A robabilistic Analytics: Slide 8

19 Elementary robability in ictures B B ^ A + B ^ ~A robabilistic Analytics: Slide 9

20 Elementary robability in ictures k! j A j robabilistic Analytics: Slide 2

21 Elementary robability in ictures B k! j B " A j robabilistic Analytics: Slide 2

22 Definition of Conditional robability A ^ B A B B Corollary: The Chain Rule A ^ B A B B robabilistic Analytics: Slide 22

23 robabilistic Inference F H Hae a headache F Coming down with Flu H H / F /4 H F /2 One day you wake up with a headache. You think: Drat! 5% of flus are associated with headaches so I must hae a 5-5 chance of coming down with flu Is this reasoning good? robabilistic Analytics: Slide 23

24 Bayes Rule A ^ B A B B B A A A This is Bayes Rule Bayes, Thomas 763 An essay towards soling a problem in the doctrine of chances. hilosophical Transactions of the Royal Society of London, 53:37-48 robabilistic Analytics: Slide 24

25 robabilistic Analytics: Slide 25 More General Forms of Bayes Rule ~ ~ A A B A A B A A B A B + X B X A X A B X A B!!!!

26 robabilistic Analytics: Slide 26 More General Forms of Bayes Rule! n A k k k i i i A A B A A B B A

27 Useful Easy-to-proe facts A B + A B n A! A B k k robabilistic Analytics: Slide 27

28 The Joint Distribution Recipe for making a joint distribution of M ariables: Example: Boolean ariables A, B, C robabilistic Analytics: Slide 28

29 The Joint Distribution Recipe for making a joint distribution of M ariables: A B Example: Boolean ariables A, B, C C. Make a truth table listing all combinations of alues of your ariables if there are M Boolean ariables then the table will hae 2 M rows. robabilistic Analytics: Slide 29

30 The Joint Distribution Recipe for making a joint distribution of M ariables: A B Example: Boolean ariables A, B, C C rob.3.5. Make a truth table listing all combinations of alues of your ariables if there are M Boolean ariables then the table will hae 2 M rows. 2. For each combination of alues, say how probable it is robabilistic Analytics: Slide 3

31 The Joint Distribution Recipe for making a joint distribution of M ariables: A B Example: Boolean ariables A, B, C C rob.3.5. Make a truth table listing all combinations of alues of your ariables if there are M Boolean ariables then the table will hae 2 M rows. 2. For each combination of alues, say how probable it is. 3. If you subscribe to the axioms of probability, those numbers must sum to. A C.3 B. robabilistic Analytics: Slide 3

32 Using the Joint One you hae the JD you can ask for the probability of any logical expression inoling your attribute E! rows matching E row robabilistic Analytics: Slide 32

33 Using the Joint oor Male.4654 E! rows matching E row robabilistic Analytics: Slide 33

34 Using the Joint oor.764 E! rows matching E row robabilistic Analytics: Slide 34

35 Inference with the Joint E E 2 E " E 2 E 2! rows matching E! rows matching E row and E 2 2 row robabilistic Analytics: Slide 35

36 Inference with the Joint E E 2 E " E 2 E 2! rows matching E! rows matching E row and E 2 2 row Male oor.4654 / robabilistic Analytics: Slide 36

37 Inference is a big deal I e got this eidence. What s the chance that this conclusion is true? I e got a sore neck: how likely am I to hae meningitis? I see my lights are out and it s 9pm. What s the chance my spouse is already asleep? robabilistic Analytics: Slide 37

38 Inference is a big deal I e got this eidence. What s the chance that this conclusion is true? I e got a sore neck: how likely am I to hae meningitis? I see my lights are out and it s 9pm. What s the chance my spouse is already asleep? robabilistic Analytics: Slide 38

39 Inference is a big deal I e got this eidence. What s the chance that this conclusion is true? I e got a sore neck: how likely am I to hae meningitis? I see my lights are out and it s 9pm. What s the chance my spouse is already asleep? There s a thriing set of industries growing based around Bayesian Inference. Highlights are: Medicine, harma, Help Desk Support, Engine Fault Diagnosis robabilistic Analytics: Slide 39

40 Where do Joint Distributions come from? Idea One: Expert Humans Idea Two: Simpler probabilistic facts and some algebra Example: Suppose you knew A.7 B A.2 B ~A. C A^B. C A^~B.8 C ~A^B.3 C ~A^~B. Then you can automatically compute the JD using the chain rule Ax ^ By ^ Cz Cz Ax^ By By Ax Ax In another lecture: Bayes Nets, a systematic way to do this. robabilistic Analytics: Slide 4

41 Where do Joint Distributions come from? Idea Three: Learn them from data! repare to see one of the most impressie learning algorithms you ll come across in the entire course. robabilistic Analytics: Slide 4

42 Learning a joint distribution Build a JD table for your attributes in which the probabilities are unspecified A B C rob The fill in each row with ˆ row records matching row total number of records?? A B C rob?.3?.5?.?.5?.5?..25 Fraction of all records in which A and B are True but C is False. robabilistic Analytics: Slide 42

43 Example of Learning a Joint This Joint was obtained by learning from three attributes in the UCI Adult Census Database [Kohai 995] robabilistic Analytics: Slide 43

44 Where are we? We hae recalled the fundamentals of probability We hae become content with what JDs are and how to use them And we een know how to learn JDs from data. robabilistic Analytics: Slide 44

45 Density Estimation Our Joint Distribution learner is our first example of something called Density Estimation A Density Estimator learns a mapping from a set of attributes to a robability Input Attributes Density Estimator robability robabilistic Analytics: Slide 45

46 Density Estimation Compare it against the two other major kinds of models: Input Attributes Classifier rediction of categorical output Input Attributes Density Estimator robability Input Attributes Regressor rediction of real-alued output robabilistic Analytics: Slide 46

47 Ealuating Density Estimation Test-set criterion for estimating performance on future data* * See the Decision Tree or Cross Validation lecture for more detail Input Attributes Classifier rediction of categorical output Test set Accuracy Input Attributes Density Estimator robability? Input Attributes Regressor rediction of real-alued output Test set Accuracy robabilistic Analytics: Slide 47

48 Ealuating a density estimator Gien a record x, a density estimator M can tell you how likely the record is: Gien a dataset with R records, a density estimator can tell you how likely the dataset is: Under the assumption that all records were independently generated from the Density Estimator s R JD ˆ dataset M ˆ x ˆ x M ˆ! " x2 K" xr M xk M k robabilistic Analytics: Slide 48

49 A small dataset: Miles er Gallon mpg modelyear maker 92 Training Set Records good 75to78 asia bad 7to74 america bad 75to78 europe bad 7to74 america bad 7to74 america bad 7to74 asia bad 7to74 asia bad 75to78 america : : : : : : : : : bad 7to74 america good 79to83 america bad 75to78 america good 79to83 america bad 75to78 america good 79to83 america good 79to83 america bad 7to74 america good 75to78 europe bad 75to78 europe From the UCI repository thanks to Ross Quinlan robabilistic Analytics: Slide 49

50 A small dataset: Miles er Gallon mpg modelyear maker 92 Training Set Records good 75to78 asia bad 7to74 america bad 75to78 europe bad 7to74 america bad 7to74 america bad 7to74 asia bad 7to74 asia bad 75to78 america : : : : : : : : : bad 7to74 america good 79to83 america bad 75to78 america good 79to83 america bad 75to78 america good 79to83 america good 79to83 america bad 7to74 america good 75to78 europe bad 75to78 europe robabilistic Analytics: Slide 5

51 A small dataset: Miles er Gallon mpg modelyear maker 92 Training Set Records ˆdataset M good 75to78 asia bad 7to74 america bad 75to78 europe bad 7to74 america bad 7to74 america bad 7to74 asia bad 7to74 asia bad 75to78 america : : : : : : : : : bad 7to74 america good 79to83 america bad 75to78 ˆ america x good 79to83 america bad 75to78 america good 79to83 america good 79to83 america bad 7to74 america good 75to78 europe bad 75to78 europe R " x2 K" x M # k -23 in this case R ˆ xk M 3.4! robabilistic Analytics: Slide 5

52 Log robabilities Since probabilities of datasets get so small we usually use log probabilities log ˆdataset M log R " k R ˆ xk M log ˆ xk M! k robabilistic Analytics: Slide 52

53 A small dataset: Miles er Gallon mpg modelyear maker 92 Training Set Records log ˆdataset M good 75to78 asia bad 7to74 america bad 75to78 europe bad 7to74 america bad 7to74 america bad 7to74 asia bad 7to74 asia bad 75to78 america : : : : : : : : : bad 7to74 america good 79to83 america log# ˆ x " k M log ˆ xk M bad 75to78 america good 79to83 america bad 75to78 america good 79to83 america good 79to83 america in this case bad 7to74 america good 75to78 europe bad 75to78 europe R k! R k robabilistic Analytics: Slide 53

54 Summary: The Good News We hae a way to learn a Density Estimator from data. Density estimators can do many good things Can sort the records by probability, and thus spot weird records anomaly detection Can do inference: E E2 Automatic Doctor / Help Desk etc Ingredient for Bayes Classifiers see later robabilistic Analytics: Slide 54

55 Summary: The Bad News Density estimation by directly learning the joint is triial, mindless and dangerous robabilistic Analytics: Slide 55

56 Using a test set An independent test set with 96 cars has a worse log likelihood actually it s a billion quintillion quintillion quintillion quintillion times less likely.density estimators can oerfit. And the full joint density estimator is the oerfittiest of them all! robabilistic Analytics: Slide 56

57 Oerfitting Density Estimators If this eer happens, it means there are certain combinations that we learn are impossible log ˆtestset M "! log$ ˆ x M # k log if R k for any k ˆ x k M R k ˆ x k M robabilistic Analytics: Slide 57

58 Using a test set The only reason that our test set didn t score -infinity is that my code is hard-wired to always predict a probability of at least one in 2 We need Density Estimators that are less prone to oerfitting robabilistic Analytics: Slide 58

59 Naïe Density Estimation The problem with the Joint Estimator is that it just mirrors the training data. We need something which generalizes more usefully. The naïe model generalizes strongly: Assume that each attribute is distributed independently of any of the other attributes. robabilistic Analytics: Slide 59

60 Independently Distributed Data Let x[i] denote the i th field of record x. The independently distributed assumption says that for any i,, u u 2 u i- u i+ u M x[ i] x[] u, x[2] u2, Kx[ i! ] ui!, x[ i + ] ui+, Kx[ M ] x[ i] Or in other words, x[i] is independent of {x[],x[2],..x[i-], x[i+], x[m]} This is often written as x[ i] " { x[], x[2], K x[ i! ], x[ i + ], Kx[ M ]} u M robabilistic Analytics: Slide 6

61 A note about independence Assume A and B are Boolean Random Variables. Then if and only if A and B are independent A B A A and B are independent is often notated as A! B robabilistic Analytics: Slide 6

62 Independence Theorems Assume A B A Then A^B Assume A B A Then B A A B B robabilistic Analytics: Slide 62

63 Independence Theorems Assume A B A Then ~A B Assume A B A Then A ~B ~A A robabilistic Analytics: Slide 63

64 robabilistic Analytics: Slide 64 Multialued Independence For multialued Random Variables A and B, A! B if and only if :, u A B u A u! from which you can then proe things like :, B u A B u A u! " :, B A B u!

65 Using the Naïe Distribution Once you hae a Naïe Distribution you can easily compute any row of the joint distribution. Suppose A, B, C and D are independently distributed. What is A^~B^C^~D? robabilistic Analytics: Slide 65

66 Using the Naïe Distribution Once you hae a Naïe Distribution you can easily compute any row of the joint distribution. Suppose A, B, C and D are independently distributed. What is A^~B^C^~D? A ~B^C^~D ~B^C^~D A ~B^C^~D A ~B C^~D C^~D A ~B C^~D A ~B C ~D ~D A ~B C ~D robabilistic Analytics: Slide 66

67 Naïe Distribution General Case Suppose x[], x[2], x[m] are independently distributed. M!, x[2] u2, x[ M ] um x[ k] u k k x[] u K So if we hae a Naïe Distribution we can construct any row of the implied Joint Distribution on demand. So we can do any inference But how do we learn a Naïe Density Estimator? robabilistic Analytics: Slide 67

68 Learning a Naïe Density Estimator ˆ x[ i] u # records in which x[ i] u total number of records Another triial learning algorithm! robabilistic Analytics: Slide 68

69 Contrast Joint DE Naïe DE Can model anything No problem to model C is a noisy copy of A Gien records and more than 6 Boolean attributes will screw up badly Can model only ery boring distributions Outside Naïe s scope Gien records and, multialued attributes will be fine robabilistic Analytics: Slide 69

70 Reminder: The Good News We hae two ways to learn a Density Estimator from data. *In other lectures we ll see astly more impressie Density Estimators Mixture Models, Bayesian Networks, Density Trees, Kernel Densities and many more Density estimators can do many good things Anomaly detection Can do inference: E E2 Automatic Doctor / Help Desk etc Ingredient for Bayes Classifiers robabilistic Analytics: Slide 7

71 How to build a Bayes Classifier Assume you want to predict output Y which has arity n Y and alues, 2, ny. Assume there are m input attributes called X, X 2, X m Break dataset into n Y smaller datasets called DS, DS 2, DS ny. Define DS i Records in which Y i For each DS i, learn Density Estimator M i to model the input distribution among the Y i records. robabilistic Analytics: Slide 7

72 How to build a Bayes Classifier Assume you want to predict output Y which has arity n Y and alues, 2, ny. Assume there are m input attributes called X, X 2, X m Break dataset into n Y smaller datasets called DS, DS 2, DS ny. Define DS i Records in which Y i For each DS i, learn Density Estimator M i to model the input distribution among the Y i records. M i estimates X, X 2, X m Y i robabilistic Analytics: Slide 72

73 How to build a Bayes Classifier Assume you want to predict output Y which has arity n Y and alues, 2, ny. Assume there are m input attributes called X, X 2, X m Break dataset into n Y smaller datasets called DS, DS 2, DS ny. Define DS i Records in which Y i For each DS i, learn Density Estimator M i to model the input distribution among the Y i records. M i estimates X, X 2, X m Y i Idea: When a new set of input alues X u, X 2 u 2,. X m u m come along to be ealuated predict the alue of Y that makes X, X 2, X m Y i most likely Y predict Is this a good idea? argmax X u L X m u m Y robabilistic Analytics: Slide 73

74 How to build a Bayes Classifier Assume you want to predict output Y which has arity n Y and alues, 2, ny. Assume there are m input attributes called This X is, X a 2, Maximum X m Likelihood classifier. Break dataset into n Y smaller datasets called DS, DS 2, DS ny. Define DS i Records in which Y i It can get silly if some Ys For each DS i, learn Density Estimator M i to model the input are ery unlikely distribution among the Y i records. M i estimates X, X 2, X m Y i Idea: When a new set of input alues X u, X 2 u 2,. X m u m come along to be ealuated predict the alue of Y that makes X, X 2, X m Y i most likely Y predict Is this a good idea? argmax X u L X m u m Y robabilistic Analytics: Slide 74

75 How to build a Bayes Classifier Assume you want to predict output Y which has arity n Y and alues, 2, ny. Assume there are m input attributes called X, X 2, X m Break dataset into n Y smaller datasets called DS, DS 2, DS ny. Define DS i Records in which Y i Much Better Idea For each DS i, learn Density Estimator M i to model the input distribution among the Y i records. M i estimates X, X 2, X m Y i Idea: When a new set of input alues X u, X 2 u 2,. X m u m come along to be ealuated predict the alue of Y that makes Y i X, X 2, X m most likely predict Y argmax Y X u L X m u Is this a good idea? m robabilistic Analytics: Slide 75

76 Terminology MLE Maximum Likelihood Estimator: Y predict argmax X u L X m u m Y MA Maximum A-osteriori Estimator: predict Y argmax Y X u L X m u m robabilistic Analytics: Slide 76

77 Getting what we need predict Y argmax Y X u L X m u m robabilistic Analytics: Slide 77

78 robabilistic Analytics: Slide 78 Getting a posterior probability! n Y j j j m m m m m m m m m m Y Y u X u X Y Y u X u X u X u X Y Y u X u X u X u X Y L L L L L

79 robabilistic Analytics: Slide 79 Bayes Classifiers in a nutshell argmax argmax predict Y Y u X u X u X u X Y Y m m m m L L. Learn the distribution oer inputs for each alue Y. 2. This gies X, X 2, X m Y i. 3. Estimate Y i. as fraction of records with Y i. 4. For a new prediction:

80 Bayes Classifiers in a nutshell. Learn the distribution oer inputs for each alue Y. 2. This gies X, X 2, X m Y i. 3. Estimate Y i. as fraction of records with Y i. We can use our faorite 4. For a new prediction: Density Estimator here. Y predict argmax Y argmax X u L X m X u m u Right now m we hae m two Y L X options: Y u Joint Density Estimator Naïe Density Estimator robabilistic Analytics: Slide 8

81 Joint Density Bayes Classifier Y predict argmax X u L X m u m Y In the case of the joint Bayes Classifier this degenerates to a ery simple rule: Y Y predict the most common alue of Y among records in which X u, X 2 u 2,. X m u m. Note that if no records hae the exact set of inputs X u, X 2 u 2,. X m u m, then X, X 2, X m Y i for all alues of Y. In that case we just hae to guess Y s alue robabilistic Analytics: Slide 8

82 robabilistic Analytics: Slide 82 Naïe Bayes Classifier argmax predict Y Y u X u X Y m m L In the case of the naie Bayes Classifier this can be simplified:! n Y j j j Y u X Y Y predict argmax

83 robabilistic Analytics: Slide 83 Naïe Bayes Classifier argmax predict Y Y u X u X Y m m L In the case of the naie Bayes Classifier this can be simplified:! n Y j j j Y u X Y Y predict argmax Technical Hint: If you hae, input attributes that product will underflow in floating point math. You should use logs:!! " # $ $ % & + ' n Y j j j Y u X Y Y predict log log argmax

84 BC Results: XOR The XOR dataset consists of 4, records and 2 Boolean inputs called a and b, generated 5-5 randomly as or. c output a XOR b The Classifier learned by Joint BC The Classifier learned by Naie BC robabilistic Analytics: Slide 84

.o where a,b,c are generated 5-5 randomly as or.

85 Naïe BC Results: All Irreleant The all irreleant dataset consists of 4, records and 5 Boolean attributes called a,b,c,d..o where a,b,c are generated 5-5 randomly as or. output with probability.75, with prob.25 The Classifier learned by Naie BC robabilistic Analytics: Slide 85

86 More Facts About Bayes Classifiers Many other density estimators can be slotted in*. Density estimation can be performed with real-alued inputs* Bayes Classifiers can be built with real-alued inputs* Rather Technical Complaint: Bayes Classifiers don t try to be maximally discriminatie---they merely try to honestly model what s going on* Zero probabilities are painful for Joint and Naïe. A hack justifiable with the magic words Dirichlet rior can help*. Naïe Bayes is wonderfully cheap. And suries, attributes cheerfully! *See future Andrew Lectures robabilistic Analytics: Slide 86

87 robability What you should know Fundamentals of robability and Bayes Rule What s a Joint Distribution How to do inference i.e. E E2 once you hae a JD Density Estimation What is DE and what is it good for How to learn a Joint DE How to learn a naïe DE robabilistic Analytics: Slide 87

88 What you should know Bayes Classifiers How to build one How to predict with a BC Contrast between naïe and joint BCs robabilistic Analytics: Slide 88

Probabilistic and Bayesian Analytics

Probabilistic and Bayesian Analytics Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these