Probability Theory Review

Size: px

Start display at page:

Download "Probability Theory Review"

Jasmine Franklin
5 years ago
Views:

1 Probability Theory Review Brendan O Connor Recitation Sept 11 & 12,

2 Mathematical Tools for Machine Learning Probability Theory Linear Algebra Calculus Wikipedia is great reference 2

Probability Theory in AI Fundamental problem: Reasoning about uncertainty Uncertainty in Artificial Intelligence : history of non-probabilistic approaches.

3 Probability Theory in AI Fundamental problem: Reasoning about uncertainty Uncertainty in Artificial Intelligence : history of non-probabilistic approaches. fuzzy logic, nonmonotonic reasoning, Dempster-Shafer, three-valued logics... Judea Pearl 2011 Turing Award winner Pearl 1988 (among others): probability theory is the way to go Probabilitistic approaches completely dominate modern machine learning 3

4 Why Probability Theory? Dutch book If you violate these rules, you will lose money. Kolmogorov Axioms Cox Axioms Probabilities of boolean variables. Logic. Bruno de Finetti Andrey Kolmogorov

5 Cox axioms Probability theory as an extension of boolean logic. Let B(X Z) be a degree of belief function. Belief in proposition X Conditional on background knowledge Z Assume 1. Degrees-of-belief are well-ordered 2. B(~X) = f[ B(X) ] 3. B(X and Y) = g( B(X), B(Y X) ) note controversies -- see references on 5

6 Cox axioms Probability theory as an extension of boolean logic. Let B(X Z) be a degree of belief function. Belief in proposition X Conditional on background knowledge Z Assume 1. Degrees-of-belief are well-ordered 2. B(~X) = f[ B(X) ] 3. B(X and Y) = g( B(X), B(Y X) ) Then can map B(.) to P(.) where 1. P(true) = 1 2. P(false) = 0 3. P(X and Y) = P(X) P(Y X) 4. P(~X) = 1 - P(X) note controversies -- see references on 5

7 Cox axioms Probability theory as an extension of boolean logic. Let B(X Z) be a degree of belief function. Belief in proposition X Conditional on background knowledge Z Assume 1. Degrees-of-belief are well-ordered 2. B(~X) = f[ B(X) ] 3. B(X and Y) = g( B(X), B(Y X) ) Then can map B(.) to P(.) where 1. P(true) = 1 2. P(false) = 0 3. P(X and Y) = P(X) P(Y X) 4. P(~X) = 1 - P(X) 5 1. G(true) = 0 2. G(false) = -infinity note controversies -- see references on 3. G(X and Y) = G(X) + G(Y X) 4. G(~X) = log(1 - exp[g(x)])

8 Many views of probability Meaning of a probability Epistemic: knowledge/belief; vs Long-term frequencies Related: Bayesian vs Frequentist statistics 6

9 Discrete Random Variables Views of RVs: (1) outcome of a random process, or (2) has a true value that you don t know Binary (boolean) variable Values(A) = {0, 1} Values(A) = {-1, +1} Values(A) = {, } Categorical variable Values(A) = {0, 1, 2} Values(A) = {55, 32, 10} Values(A) = {,, } Probability (Mass) Function P(A=a) --> [0, 1] P(A=1) = 0.7 P(A=0) = Many notations PA(a) P(A) P(a)

10 Crucial definitions/laws They hold for all RV s. Make sure you know them! Conditional Probability and Chain Rule P (A B) = P (AB) P (B) P (AB) =P (A B)P (B) Law of Total Probability P (A) = b P (A) = b P (A, B = b) P (A B = b)p (B = b) Disjunction (Union) P (A B) =P (A)+P (B) P (AB) Negation (Complement) P ( A) =1 P (A) 8

11 variables A, B, C A 0 0 B 0 0 C 0 1 Prob A C B 0.10 Class Gender Age No Yes Total 1st Male Child st Male Adult st Female Child st Female Adult Lower Male Child Lower Male Adult Lower Female Child Lower Female Adult totals: st Class Lower Male No Gender Survival Female Yes No Yes Adult Child Child Adult Age 9 Any close conditional independences? P(Survive adult lower female) \approx P(Survive child lower female) Survive _ _ Age lower female

12 Data : 1035 (A,B) pairs P(A= ) = A= A= B= P(A= ) = B=

13 Data : 1035 (A,B) pairs A= A= B= B= P(A= ) = P(A= ) =

14 Data : 1035 (A,B) pairs A= A= B= B= P(A= ) = P(A= ) = Sum Rule P (A = a) =1 P(A= ) + P(A= ) = 1 a V alues(a)

15 Data : 1035 (A,B) pairs A= A= B= B= P(A= ) = P(A= ) = Sum Rule P (A = a) =1 P(A= ) + P(A= ) = 1 a V alues(a) Negation (Complement) 1035 P(A!= ) = 1 - P(A= ) =

16 Data : 1035 (A,B) pairs Disjunctions A= A= B= B= P(A= or B= ) = #{A= or B= } N = P (A B) =P (A)+P (B) P (AB) P(A= or B= ) = = P(A= ) + P(B= ) - P(A=,B= ) =

17 Data : 1035 (A,B) pairs A= A= B= B= Joint and Conditional Probs P(A=, B= ) = P(A= B= ) = A= A= B= B=

18 Data : 1035 (A,B) pairs Joint and Conditional Probs A= A= B= P(A=, B= ) = #{A=, B= } N = B= P(A= B= ) = A= A= B= B=

19 Data : 1035 (A,B) pairs Joint and Conditional Probs A= A= B= P(A=, B= ) = #{A=, B= } N = B= #{A=, B= } 15 P(A= B= ) = #{B= } = A= A= B= B=

20 Data : 1035 (A,B) pairs Joint and Conditional Probs A= A= B= P(A=, B= ) = #{A=, B= } N = B= #{A=, B= } 15 P(A= B= ) = #{B= } = Every probability A= A= has a universe or background knowledge P(A=, B= in the table) B= B= P(A= B=, and in the table)

21 Data : 1035 (A,B) pairs Joint and Conditional Probs A= A= B= P(A=, B= ) = #{A=, B= } N = B= P(A= B= ) = #{A=, B= } #{B= } = Fraction of worlds in which B is true that also have A true P (A B) = P (AB) P (B) P(A=, B= ) P(A= B= ) = = P(B= )

22 Data : 1035 (A,B) pairs Joint and Conditional Probs A= A= B= P(A=, B= ) = #{A=, B= } N = B= P(A= B= ) = #{A=, B= } #{B= } = Def n conditional prob. P (A B) = P (AB) P (B) Chain rule P (AB) =P (A B)P (B)

23 Data : 1035 (A,B) pairs Law of Total Probability A= A= P (A) = b P (A, B = b) B= B= Break A into B subgroups Add joint probs across subgroups P(A= ) = P(A=, B= ) + P(A=, B= ) = A= A= B= B=

24 Data : 1035 (A,B) pairs Law of Total Probability A= A= P (A) = b P (A, B = b) B= B= Break A into B subgroups Add joint probs across subgroups A= A= B= B= Break universe into B subgroups Average cond probs across subgroups P(A= ) = P(A=, B= ) + P(A=, B= ) = P (A) = P (A B = b)p (B = b) b P(A= ) = P(A= B= ) P(B= ) + P(A= B= ) P(B= ) =

25 Crucial definitions/laws They hold for all RV s Conditional Probability and Chain Rule P (A B) = P (AB) P (B) P (AB) =P (A B)P (B) Law of Total Probability P (A) = b P (A) = b P (A, B = b) P (A B = b)p (B = b) Disjunction (Union) P (A B) =P (A)+P (B) P (AB) Negation (Complement) P ( A) =1 P (A) 16

26 Background conditional variants P ( A) =1 P (A) P (AB) =P (A B)P (B) etc... picture 17

27 Background conditional variants P ( A) =1 P (A) P ( A Z) =1 P (A Z) P (AB) =P (A B)P (B) etc picture

28 Background conditional variants P ( A) =1 P (A) P ( A Z) =1 P (A Z) P (AB) =P (A B)P (B) P (AB Z) =P (A BZ)P (B Z) etc picture

29 Independence A and B are independent when P(A B) = P(A) (Alternate formulation...) P(AB) =? P(A B) P(B) P(A) P(B) 18

30 Independence A and B are independent when P(A B) = P(A) (Alternate formulation...) P(AB) =? P(A B) P(B) P(A) P(B) 18

31 Independence Chain rule: in general P(ABC) = But when A, B, C are independent P(ABC) =? If you have independence -- can estimate joint probability of thousands of RV s [1] = P(A BC) P(B C) P(C)? = P(A) P(B) P(C) 19

32 Independence Chain rule: in general P(ABC) = P(A BC) P(B C) P(C) But when A, B, C are independent P(ABC) =? If you have independence -- can estimate joint probability of thousands of RV s [1] = P(A BC) P(B C) P(C)? = P(A) P(B) P(C) 19

33 Notation P (AB) P (A B) P (A B) P (A = a, B = b) P (A = a B = b) P (A) = b P (A, B = b) P (A = a) = b P (A = a, B = b) 20

34 Discrete vs Continuous RV s Discrete: Prob Mass Fn 1 P (X) [0, 1] x V alues X P (X = x) =1 2 3 Continuous: Prob Density Fn There is no probability at one point! P (X [a, b]) = p(x)dx =1 b a Quiz: can p(x)>1? p(x)dx p Beta (x;4, 2) x

35 Discrete vs Continuous RV s Discrete: Prob Mass Fn 1 P (X) [0, 1] x V alues X P (X = x) =1 2 3 Continuous: Prob Density Fn There is no probability at one point! P (X [a, b]) = p(x)dx =1 p(x) [0, ) b a p(x)dx p Beta (x;4, 2) x

36 Bayes Rule Want P(H E) but only have P(E H) P (H E)P (E) =P (E H)P (H) H: have disease? P (H E) = P (E H)P (H) P (E) E: outcome of medical test Why? e.g. P(E H) has been measured, or H causes E... Likelihood Prior P (H E) = P (E H)P (H) P (E) Posterior Normalizer 22 Pierre-Simon Laplace Rev. Thomas Bayes c

37 Bayes Rule and its pesky denominator Likelihood Prior P (H E) = P (E H)P (H) P (E) = P (E h)p (H) h P (E h )P (h ) P (H E) = 1 Z P (E H)P (H) Z: whatever lets the posterior, when summed across h, to sum to 1 Zustandssumme, "sum over states" P (H E) P (E H)P (H) Proportional to Zustandssumme, "sum over states" -- from 19th century statistical physics. (Called a partition function for certain types of models, Markov random fields.) Unnormalized posterior By itself does not sum to 1! Computing this can be very difficult in high dimensions -- tons of work in statistical machine learning is essentially devoted to computing Z or variants of it. 23

38 Bayes Rule: Discrete Sum to 1? a b c P (H = h) Prior Multiply P (E H = h) Likelihood P (E H = h)p (H = h) Unnorm. Posterior Normalize 1 Z P (E H = h)p (H = h) Posterior 24

39 Bayes Rule: Discrete [0.2, 0.2, 0.6] a b c P (H = h) Prior Sum to 1? Yes Multiply P (E H = h) Likelihood P (E H = h)p (H = h) Unnorm. Posterior Normalize 1 Z P (E H = h)p (H = h) Posterior 24

40 Bayes Rule: Discrete [0.2, 0.2, 0.6] a b c P (H = h) Prior Sum to 1? Yes Multiply [0.2, 0.05, 0.05] P (E H = h) Likelihood No P (E H = h)p (H = h) Unnorm. Posterior Normalize 1 Z P (E H = h)p (H = h) Posterior 24

41 Bayes Rule: Discrete [0.2, 0.2, 0.6] a b c P (H = h) Prior Sum to 1? Yes Multiply [0.2, 0.05, 0.05] P (E H = h) Likelihood No [0.04, 0.01, 0.03] P (E H = h)p (H = h) Unnorm. Posterior No Normalize 1 Z P (E H = h)p (H = h) Posterior 24

42 Bayes Rule: Discrete [0.2, 0.2, 0.6] a b c P (H = h) Prior Sum to 1? Yes Multiply [0.2, 0.05, 0.05] P (E H = h) Likelihood No [0.04, 0.01, 0.03] P (E H = h)p (H = h) Unnorm. Posterior No Normalize [0.500, 0.125, 0.375] 1 Z P (E H = h)p (H = h) Posterior Yes 24

43 Bayes Rule: Discrete, uniform prior [0.33, 0.33, 0.33] a b c P (H = h) Prior Sum to 1? Yes Multiply [0.2, 0.05, 0.05] P (E H = h) Likelihood No [0.066, 0.016, 0.016] P (E H = h)p (H = h) Unnorm. Posterior No Normalize [0.66, 0.16, 0.16] 1 Z P (E H = h)p (H = h) Posterior Yes 25

44 Bayes Rule: Discrete, uniform prior [0.33, 0.33, 0.33] a b c Uniform distribution: Uninformative prior P (H = h) Prior Sum to 1? Yes Multiply [0.2, 0.05, 0.05] P (E H = h) Likelihood No Normalize [0.066, 0.016, 0.016] [0.66, 0.16, 0.16] 1 Z P (E H = h)p (H = h) Unnorm. Posterior Uniform prior implies that posterior is just renormalized likelihood P (E H = h)p (H = h) Posterior No Yes 25

45 Bayes Rule: Continuous, with prior. dprior Beta(2,2) P (H = h) Prior Sum to 1? Yes Multiply lik L(theta) 0 heads 5 tails P (E H = h) Likelihood No Normalize lik * dprior rior/sum(lik * dprior) P (E H = h)p (H = h) Unnorm. Posterior 1 P (E H = h)p (H = h) Z Posterior 26 No Yes

46 Bayes Rule: Continuous, uniform prior dprior Uniform distribution: Uninformative prior P (H = h) Prior Sum to 1? Yes Multiply lik L(theta) 0 heads 5 tails P (E H = h) Likelihood No Normalize lik * dprior rior/sum(lik * dprior) 1 Z 27 P (E H = h)p (H = h) Unnorm. Posterior Uniform prior implies that posterior is just renormalized likelihood P (E H = h)p (H = h) Posterior No Yes

47 Bayes Rule: More data swamps priors. 1 head, 1 tail 3 heads, 1 tail 8 heads, 1 tail Prior dprior dprior dprior p Beta (θ;2, 2) 1 θ(1 θ) Z Likelihood lik lik lik θ #H (1 θ) #T lik * dprior lik * dprior lik * dprior Posterior prior/sum(lik * dprior) prior/sum(lik * dprior) prior/sum(lik * dprior) 1 Z θ#h+1 #T +1 (1 θ)

48 Posterior prob Bayes Rule: odds-ratio form Likelihood Prior prob P (H E) =P (E H)P (H)/P (E) [0, 1] [0, 1] [0, 1] P (h 1 E) P (h 2 E) = P (E h 1) P (E h 2 ) P (h 1 ) P (h 2 ) Posterior Odds Likelihood Ratio Prior Odds [0, inf) [0, inf) [0, inf) 29

49 Bayes Rule: odds-ratio form Posterior prob Likelihood Prior prob P (H E) =P (E H)P (H)/P (E) [0, 1] [0, 1] [0, 1] Binary prob rescalings Range coin flip prob p [0,1] 1/2 P (h 1 E) P (h 2 E) = P (E h 1) P (E h 2 ) Posterior Odds Likelihood Ratio P (h 1 ) P (h 2 ) Prior Odds [0, inf) [0, inf) [0, inf) log-prob log(p) (-inf, 0] log(1/2) odds p / (1-p) [0, +inf) 1 log-odds logit log[p/(1-p)] (-inf, +inf) 0 29

50 i.i.d. likelihood Bernoulli random process: X_i comes out H at prob θ, else comes out T. Likelihood of θ : L(θ) = P (X 1,X 2,X 3,...,X n θ) =??? Prob of data under parameter θ = i = i P (X i θ) {θ if X i =H else 1 θ} Independent... and Identically Distributed! = i θ 1{X i=h} (1 θ) 1{X i=t } = θ #{X i=h} (1 θ) #{X i=t } 30

51 Maximum Likelihood If we want to believe a single point value of theta, what should we believe? How about the maximum likelihood value of theta: that makes the data most probable L(θ) =θ #{X i=h} (1 θ) #{X i=t } arg max θ arg max θ L(θ) log L(θ) Most important trick: Take the log! Same argmax since monotonic. log(x) #{X i = H} log θ +#{X i = T } log(1 θ) x 31

52 Maximum Posterior (MAP) p(θ D) =P (D θ)p(θ)/p (D) p(θ D) P (D θ)p(θ) arg max θ p(θ D) = arg max θ P (D θ)p(θ) dprior P (H = h) Prior lik P (E H = h) Likelihood Another trick: unnormalized is good enough for MAP lik * dprior P (E H = h)p (H = h) Unnorm. Posterior rior/sum(lik * dprior) 1 Z P (E H = h)p (H = h) Posterior

53 33

54 Entropy H(X) = x p(x) log p(x) H(X) =E[log p(x)] 34

Machine Learning

Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University August 30, 2017 Today: Decision trees Overfitting The Big Picture Coming soon Probabilistic learning MLE,