Probability & statistics for linguists Class 2: more probability D. Lassiter (h/t: R. Levy)
conditional probability P (A B) = when in doubt about meaning: draw pictures. P (A \ B) P (B) keep B- consistent values propor:onally the same normalize a-er removing not- B as a possibility FYI: Bayesians often think of conditional probability judgments as conceptually basic, use them to define conjunctive probability: P (A \ B) =P (A B) P (B)
probabilistic dynamics A core Bayesian assumption: For any propositions A and B, your degree of belief P(B), after observing that A is true, should be equal to your conditional degree of belief P(B A) before you made this observation. Dynamics of belief are determined by the initial model, data received, and conditioning. [Don t adjust anything without a reason!]
CP example Hypothetical probabilities from historical English: Pronoun Not Pronoun Object Preverbal 0.224 0.655 Object Postverbal 0.014 0.107 How do we calculate P (Pronoun Postverbal)? P (Postverbal \ Pronoun) P (Pronoun Postverbal) = P (Postverbal) 0.014 = 0.014 + 0.107 =0.116
Bayes rule P (A B) = P (B A) P (A) P (B) Exercise: prove from CP definition. Why is this mathematical triviality so exciting?
More on Bayes rule with conventional names for terms P (A B) = Likelihood z } { P (B A) with extra background variables: Prior z } { P (A) P (B) {z } Normalizing constant P (A B,I) = P (B A, I)P (A I) P (B I)
Bayes rule example consider hypothetical Old English: P (Object Animate) =0.4 P (Object Postverbal Object Animate) =0.7 P (Object Postverbal Object Inanimate) =0.8 Imagine you're an incremental sentence processor. You encounter a transitive verb but haven't encountered the object yet. How likely is it that the object is animate? i.e., compute P (Anim PostV)
random variables a discrete RV is a function from Ω to a finite or countably infinite set of reals. next week: continuous RVs P(X=x) is a probability mass function. Ex: a Bernoulli trial with outcomes success and failure (yes/no, 0/1,..) 8 >< if x =1 P (X = x) = 1 if x =0 >: 0 otherwise
multinomial trials Generalizing the Bernoulli trial to multiple outcomes c 1,...,c r, we get a multinomial trial with r 1 parameters 1,..., r 1. 8 1 if x = c 1 P (X = x) = 2 if x = c 2 ><.. r 1 if x = c r 1 P r 1 1 i=1 >: i if x = c r 0 otherwise [Why only r 1 parameters, instead of r?]
example of multinomial trials You decide to pull Alice in Wonderland o your bookshelf, open to a random page, put your finger down randomly on that page, and record the letter that your finger is resting on. In Alice in Wonderland, 12.6% of the letters are e, 9.9% are t, 8.2% are a, and so forth. We could write the parameters of our model as e =0.126, t =0.099, a =0.082, and so forth.
another perspective a (random) variable is a partition on W what semanticists think of as a question denotation. rain? =[ is it raining? ] = {{w rain(w)}, {w rain(w)}} Dan-hunger =[ How hungry is Dan? ] ={{w hungry(w)(d)}, {w sorta-hungry(w)(d)}, {w very-hungry(w)(d)}}
joint probability We ll often use capital letters for RVs, lowercase for specific answers. P(X=x): prob. that the answer to X is x Joint probability: a distribution over all possible combinations of a set of variables. P (X = x ^ Y = y) usu.written P (X = x, Y = y)
2-Variable model rain no rain not hungry sorta hungry A joint distribu:on determines a number for each cell. very hungry
marginal probability P (X = x) = X y P (X = x ^ Y = y) P(it s raining) is the sum of: P(it s raining and Dan s not hungry) P(it s raining and Dan s kinda hungry) P(it s raining and Dan s very hungry) obvious given that RVs are just partitions
X = independence Y,8x8y : P (X = x) =P (X = x Y = y) X and Y are independent RVs iff: changing P(X) does not affect P(Y) equiv.: if changing P(Y) does not affect P(X) Pearl 00: independence is a default; cognitively more basic than prob. estimation ex.: traffic in LA vs. price of beans in China greatly simplifies probabilistic inference
2-Variable model not hungry rain no rain Here, let probability be propor:onal to area. sorta hungry rain, Dan- hunger independent very hungry probably, it s raining probably, Dan is sorta hungry
2-RV structured model not hungry rain no rain rain, Dan- hunger not indep.: rain reduces appe:te sorta hungry If rain, Dan s probably not hungry very hungry If no rain, Dan s probably sorta hungry
conditional independence If A is conditionally independent of B, given C, we write A B C. = This holds i, for all values of A and B, P (A B,C) =P (A C) Important: A B does not imply A B C. This is exemplified by explaining away. = =
Bayes nets rain sprinkler wet grass dependent on rain and sprinkler upon observing wet grass = 1, update P(V) := P(V wet grass = 1) high probability that at least one enabler is true (Pearl, 1988) wet grass rain and sprinkler independent but condi:onally dependent given wet grass!
cumulative probability Flip a π-weighted coin till you get heads, then stop. Let Z denote the question/rv how many flips before stopping?. Possible values are the z s: 0,1,2,3,... P(0) =? P(1) =? P(2) =? P(3) =?...
cumulative probability Flip a π-weighted coin till you get heads, then stop. Let Z denote the question/rv how many flips before stopping?. Possible values are the z s: 0,1,2,3,... P(0) = 0 P(1) = π P(2) = (1 π) * π P(3) = (1 π) 2 * π...
cumulative probability Flip a π-weighted coin till you get heads, then stop. Let Z denote the question/rv how many flips before stopping?. Possible values are the z s: 0,1,2,3,... in general, P (z) =(1 ) (z 1) why?
cumulative probability Flip a π-weighted coin till you get heads, then stop. Let Z denote the question/rv how many flips before stopping?. Possible values are the z s: 0,1,2,3,... P (z) = prob. of z 1tails z } { (1 ) (z 1) {z} prob. of a head complicated-looking models are usually built up from simple logical reasoning like this
cumulative probability Z is how many flips before stopping? F(z) is the probability that Z z. Values of a RV are mutually exclusive, so just add: F(0) = 0 F(1) = P(1) F(2) = P(1) + P(2) F(3) = P(1) + P(2) + P(3)...
cumulative probability Z is how many flips before stopping? F(z) is the probability that Z z. F(0) = 0 F(1) = P(1) F(2) = F(1) + P(2) F(3) = F(2) + P(3) F(4) = F(3) + P(4)...
cumulative probability Z is how many flips before stopping? F(z) is the probability that Z z. In general, F (z) = X iapplez(1 ) (i 1)
cumulative probability suppose π =.1. Then P(z) and F(z) are: Probability Cumulative probability P(z) 0.00 0.04 0.08 0 20 40 60 80 F(z) 0.2 0.6 1.0 0 20 40 60 80 z z Now let s start learning how to do these calculations and plot them in R.