KEY CONCEPTS IN PROBABILITY: SMOOTHING, MLE, AND MAP

Size: px

Start display at page:

Download "KEY CONCEPTS IN PROBABILITY: SMOOTHING, MLE, AND MAP"

Amberly Clara Flowers
6 years ago
Views:

1 KEY CONCEPTS IN PROBABILITY: SMOOTHING, MLE, AND MAP

2 Outline MAPs and MLEs catchup from last week Joint Distributions a new learner Naïve Bayes another new learner

3 Administrivia Homeworks: Due tomorrow Hardcopy and Autolab submission (see wiki) Texts Mitchell or Murphy are optional this week an update from Tom Mitchell s longexpected new edition Bishop is also excellent if you prefer but a little harder to skip around in pick one or the other (both is overkill) main differences are not content but notation: for instance

4 Some practical problems I bought a loaded d20 on EBay but it didn t come with any useful specs. How can I find out how it behaves? Frequency Face Shown 1. Collect some data (20 rolls) 2. Estimate Pr(i)=C(rolls of i)/c(any roll)

5 A better solution I bought a loaded d20 on EBay but it didn t come with any specs. How can I find out how it behaves? Frequency Face Shown 0. Imagine some data (20 rolls, each i shows up 1x) 1. Collect some data (20 rolls) 2. Estimate Pr(i)=C(rolls of i)/c(any roll)

6 A better solution? Q: What if I used m rolls with a probability of q=1/20 of rolling any i? Pˆr( i) = C( ANY ) C( i) C( IMAGINED) Pˆr( i) = C( i) + mq C( ANY ) + m I can use this formula with m>20, or even with m<20 say with m=1

7 Terminology more later This is called a uniform Dirichlet prior C(i), C(ANY) are sufficient statistics Pˆr( i) = C( i) + mq C( ANY ) + m Tom s notes are different MLE = maximum likelihood estimate MAP= maximum a posteriori estimate

8 Some differences. William: Estimate each probability Pr(i) associated with a multinomial with MLE as: Tom: estimate Θ=P(heads) for a binomial with MLE as: #heads ˆPr(i) = C(i) C(ANY ) #tails for C(i)=count of times you saw i, and estimate ith MAP as: and with MAP as: #imaginary heads Pˆr( i) = C( i) + mq C( ANY ) + m #imaginary tails

9 Some apparent differences. Pˆr( i) = C( i) + mq C( ANY ) + m Tom: estimate Θ=P(heads) for a binomial with MLE as: #heads C(i) = α 1 #tails C(ANY) = α 0 +α 1 m = (γ 0 +γ 1 ) q = γ 1 / (γ 0 +γ 1 ).. and confidence in prior and with MAP as: #imaginary heads emphasizes the prior emphasizes the pseudo-data #imaginary tails

10 imagined m=60 samples with q = 0.3 imagined m=60 samples with q = 0.4

11 imagined m=120 samples with q = 0.3 imagined m=120 samples with q = 0.4

12 Why we call this a MAP Simple case: replace the die with a coin Now there s one parameter: q=p(h) I start with a prior over q, P(q) I get some data: D={D1=H, D2=T,.} I compute maximum of posterior of q argmax q P(D q) argmax q P(q D) = P(D q)p(q) P(D) = argmax q P(D q)p(q) MAP estimate MLE estimate

13 Why we call this a MAP Simple case: replace the die with a coin Now there s one parameter: q=p(h) I start with a prior over q, P(q) I get some data: D={D1=H, D2=T,.} I compute the posterior of q The math works if the pdf of P(q) is P(x) = α+1,β+1 are counts of imaginary pos/neg examples

14 Why we call this a MAP The math works if the pdf P(x) =

15 Why we call this a MAP This is called a beta distribution The generalization to multinomials is called a Dirichlet distribution Parameters are f(x 1,,x K ) =

16 KEY CONCEPTS IN PROBABILITY: THE JOINT DISTRIBUTION

17 Some practical problems I have 1 standard fair d6 die, 2 loaded d6 die, one loaded high, one low. Loaded high: P(X=6)=0.50 Loaded low: P(X=1)=0.50 Experiment: pick one d6 uniformly at random (A) and roll it. What is more likely rolling a seven or rolling doubles? Three combinations: HL, HF, FL P(D) = P(D ^ A=HL) + P(D ^ A=HF) + P(D ^ A=FL) = P(D A=HL)*P(A=HL) + P(D A=HF)*P(A=HF) + P(A A=FL)*P(A=FL)

18 A brute-force solution A Roll 1 Roll 2 P Comment FL 1 1 1/3 * 1/6 * ½ doubles FL 1 2 1/3 * 1/6 * 1/10 A joint probability table shows P(X1=x1 and and Xk=xk) FL for 1 every possible combination of values x1,x2,., xk 1 6 seven FL With 2 this you 1 can compute any P(A) where A is any FL boolean 2 combination of the primitive events (Xi=Xk), e.g. P(doubles) FL 6 P(seven or 6 eleven) doubles HL 1 P(total is higher 1 than 5) HL 1. 2 HF 1 1 doubles

19 The Joint Distribution Example: Boolean variables A, B, C Recipe for making a joint distribution of M variables:

20 The Joint Distribution Example: Boolean variables A, B, C Recipe for making a joint distribution of M variables: 1. Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2 M rows). A B C

21 The Joint Distribution Example: Boolean variables A, B, C Recipe for making a joint distribution of M variables: 1. Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2 M rows). 2. For each combination of values, say how probable it is. A B C Prob

22 The Joint Distribution Example: Boolean variables A, B, C Recipe for making a joint distribution of M variables: 1. Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2 M rows). 2. For each combination of values, say how probable it is. 3. If you subscribe to the axioms of probability, those numbers must sum to 1. A B C Prob

23 Estimating The Joint Distribution Example: Boolean variables A, B, C Recipe for making a joint distribution of M variables: 1. Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2 M rows). 2. For each combination of values, estimate how probable it is from data. 3. If you subscribe to the axioms of probability, those numbers must sum to 1. A B C Prob

24 Pros and Cons of the Joint Distribution You can do a lot with it! J Answer any query Pr(Y1,Y2,.. X1,X2, ) It takes up a lot of room! L It takes a lot of data to train! L It can be expensive to use L The big question: how do you simplify (approximate, compactly store, ) the joint and still be able to answer interesting queries?

25 Density Estimation Our Joint Distribution learner is our first example of something called Density Estimation A Density Estimator learns a mapping from a set of attributes values to a Probability Input Attributes Density Estimator Probability Copyright Andrew W. Moore

26 Density Estimation looking ahead Compare it to two other major kinds of models: Input Attributes Input Attributes Classifier Density Estimator Prediction of categorical output or class One of a few discrete values Probability Input Attributes Regressor Prediction of real-valued output Copyright Andrew W. Moore

27 Another example

28 Another example Starting point: Google books 5-gram data All 5-grams that appear >= 40 times in a corpus of 1M English books 30Gb compressed, Gb uncompressed Each 5-gram contains frequency distribution over years (which I ignored) Pulled out counts for all 5-grams (A,B,C,D,E) where C=affect or C=effect and turned this into a joint probability table

29 Some of the Joint Distribution A B C D E p is the effect of the is the effect of a The effect of this to this effect : be the effect of the not the effect of any does not affect the general does not affect the question any manner affect the principle about 50k more rows...that summarize 90M 5-gram instances in text

30 Example queries Pr(C)? c Pr(C=c) C=effect C=affect C=Effect C=EFFECT C=effecT

31 Example queries Pr(B C=affect)? b Pr(B=b C=affect) B=not B=to B=may B=they B=which

32 Example queries Pr(C B=not,D=the)? c Pr(C b=not,d=the) B=affect B=effect

33 Density Estimation As a Classifier Input Attributes Input Attributes Classifier Density Estimator Prediction of categorical output or class One of a few discrete values Probability P(X 1 =x 1,,X n =x n ) Input Attributes + Class Y Density Estimator Probability P(Y=y 1 X 1 =x 1,,X n =x n ) P(Y=y k X 1 =x 1,,X n =x n ) Predict: f(x 1 =x 1,,X n =x n )=max y i P(Y=y i X 1 =x 1,,X n =x n ) Copyright Andrew W. Moore

34 An experiment: how useful is the brute-force joint classifier? Test set: extracted all uses affect or effect in a 20k document newswire corpus: about 723 n-grams, 661 distinct Tried to predict center word C with: argmax c Pr(C=c A=a,B=b,D=d,E=e) using the joint estimated from the Google ngram data

35 Poll time

36 Example queries How many errors would I expect in 100 trials if my classifier always just guesses the most frequent class? c Pr(C=c) C=effect C=affect C=Effect C=EFFECT C=effecT

37 Performance summary Pattern Used Errors P(C A,B,D,E) But: no counts at all for a,b,c,d for 622 of the 723 instances!

38 Slightly fancier idea. Tried to predict center word with: Pr(C A=a,B=b,D=d,E=e) then P(C A,B,D) if there s no data for that then P(C B,D) if there s no data for that then P(C B) then P(C)

39 EXAMPLES The cumulative _ of the à effect (1.0) Go into _ on January à effect (1.0) From cumulative _ of accounting not present in train data Nor is From cumulative _ of _ But _ cumulative _ of _ à effect (1.0) Would not _ Finance Minister not present But _ not _ à affect (0.9625)

40 Performance summary Pattern Used Errors P(C A,B,D,E) P(C A,B,D) P(C B,D) P(C B) P(C) % error 5% error 15% error

A [somewhat] Quick Overview of Probability. Shannon Quinn CSCI 6900

A [somewhat] Quick Overview of Probability Shannon Quinn CSCI 6900 [Some material pilfered from http://www.cs.cmu.edu/~awm/tutorials] Probabilistic and Bayesian Analytics Note to other teachers and users