CS395T Computational Statistics with Application to Bioinformatics

Size: px

Start display at page:

Download "CS395T Computational Statistics with Application to Bioinformatics"

Edwin Nash
5 years ago
Views:

1 CS395T Computational Statistics with Application to Bioinformatics Prof. William H. Press Spring Term, 2010 The University of Texas at Austin Unit 2: Estimation of Parameters The University of Texas at Austin, CS 395T, Spring 2010, Prof. William H. Press 1

2 Our next topic is Bayesian Estimation of Parameters. We ll ease into it with an example that looks a lot like the Monte Hall Problem: The Jailer s Tip: A Of 3 prisoners (A,B,C), 2 will be released tomorrow. A, who thinks he has a 2/3 chance of being released, asks jailer for name of one of the lucky but not himself. Jailer says, truthfully, B. Darn, thinks A, now my chances are only ½, C or me. Is this like Monty Hall? Did the data ( B ) change the probabilities? The University of Texas at Austin, CS 395T, Spring 2010, Prof. William H. Press 2

Further, suppose (unlike Monty Hall) the jailer is not indifferent about responding B versus C. Does that change your answer to the previous question?

3 Further, suppose (unlike Monty Hall) the jailer is not indifferent about responding B versus C. Does that change your answer to the previous question? P (S B BC) =x, (0 x 1) says B P (A S B )=P (AB S B )+P (AC S B ) = = P (S B AB)P (AB) P (S B AB)P (AB)+P (S B BC)P (BC)+P (S B CA)P (CA) x = 1 1+x 1 1/3 So if A knows the value x, he can calculate his chances. If x=1/2 (like Monty Hall), his chances are 2/3, same as before; so (unlike Monty Hall) he got no new information. If x 1/2, he does get new info his chances change. But what if he doesn t know x at all? x The University of Texas at Austin, CS 395T, Spring 2010, Prof. William H. Press 3

4 Marginalization (this is important!) When a model has unknown, or uninteresting, parameters we integrate them out multiplying by any knowledge of their distribution At worst, just a prior informed by background information At best, a narrower distribution based on data This is not any new assumption about the world it s just the Law of de-anding (e.g., Jailer s Tip): P (A S B I)= R x P (A S BxI) p(x I) dx = R x 1 1+x p(x I) dx law of de-anding: x s are EME! The University of Texas at Austin, CS 395T, Spring 2010, Prof. William H. Press 4

5 (repeating previous equation:) P (A S B I)= R x P (A S BxI) p(x I) dx = R x 1 1+x p(x I) dx first time we ve seen a continuous probability distribution, but we ll skip the obvious repetition of all the previous laws p(x) p(x I) (Notice that p(x) is a probability of a probability! That is fairly common in Bayesian inference.) X P i =1 X i i p(x i )dx i =1 Z x p(x) dx =1 The University of Texas at Austin, CS 395T, Spring 2010, Prof. William H. Press 5

6 (repeating previous equation:) P (A S B I)= R x P (A S BxI) p(x I) dx = R x 1 1+x p(x I) dx What should Prisoner A take for p(x)? Maybe the uniform prior? A p(x) =1, (0 x 1) P (A S B I)= R x dx =ln2=0.693 Not the same as the massed prior at x=1/2 Dirac delta function p(x) =δ(x 1 2 ), (0 x 1) P (A S B I)= 1 1+1/2 =2/3 substitute value and remove integral The University of Texas at Austin, CS 395T, Spring 2010, Prof. William H. Press 6

7 Review where we are: We are trying to estimate a parameter x = P (S B BC), (0 x 1) P (A S B I)= R x P (A S BxI) p(x I) dx = R x 1 1+x p(x I) dx The form of our estimate is a (Bayesian) probability distribution (of the parameter, itself here just happening to be a probability) This is a sterile exercise if it is just a debate about priors. What we need is data! Data might be a previous history of choices by the jailer in identical circumstances. BCBCCBCCCBBCBCBCCCCBBCBCCCBCBCBBCCB N =35, N B =15, N C =20 (What s wrong with: x=15/35=0.43? Hold on ) We hypothesize (might later try to check) that these are i.i.d. Bernoulli trials and therefore informative about x independent and identically distributed As good Bayesians, we now need P (data x) The University of Texas at Austin, CS 395T, Spring 2010, Prof. William H. Press 7

8 P (data x) means different things in frequentist vs. Bayesian contexts, so this is a good time to understand the differences (we ll use both ideas as appropriate) Frequentist considers the universe of what might have been, imagining repeated trials, even if they weren t actually tried, and needs no prior: since i.i.d. only the N s can matter (a so-called sufficient statistic ). P (data x) = no. of equivalent arrangements µ N N B prob. of exact sequence seen x N B (1 x) N C n k = n! k!(n k)! Bayesian considers only the exact data seen, and has a prior: P (x data) x N B (1 x) N C p(x I) No binomial coefficient, both conceptually and also since independent of x and absorbed in the proportionality. Use only the data you see, not equivalent arrangements that you didn t see. This issue is one we ll return to, not always entirely sympathetically to Bayesians (e.g., goodness-of-fit). but we might first suppose that the prior it is uniform The University of Texas at Austin, CS 395T, Spring 2010, Prof. William H. Press 8

9 Bayes numerator and denominator are: P (x data) = x N B (1 x) N N B 1 Z 1 0 P (x data) = Z 1 0 x N B (1 x) N N B dx = Γ(N B +1)Γ(N N B +1) Γ(N +2) Plot of numerator over denominator for N=35, N B = 15: so this is what the data tells us about p(x) The University of Texas at Austin, CS 395T, Spring 2010, Prof. William H. Press 9

10 You should learn to do calculations like this in MATLAB or Mathematica: syms nn nb x num = x^nb * (1-x)^(nn-nb) num = x^nb*(1-x)^(nn-nb) denom = int(num, 0, 1) denom = gamma(nn-nb+1)*gamma(nb+1)/gamma(nn+2) p = num / denom p = x^nb*(1-x)^(nn-nb)/gamma(nnnb+1)/gamma(nb+1)*gamma(nn+2) ezplot(subs(p,[nn,nb],[35,15]),[0,1]) In[7]:= num x^nb 1 x ^ nn nb Out[7]= 1 x nb nn x nb In[8]:= denom Integrate num, x, 0, 1, GenerateConditions False Out[8]= Gamma 1 nb Gamma 1 nb nn Gamma 2 nn In[9]:= p x_ num denom Out[9]= 1 x nb nn x nb Gamma 2 nn Gamma 1 nb Gamma 1 nb nn In[12]:= Plot p x. nn 35, nb 15, x, 0, 1, PlotRange All, Frame True Out[12]= Graphics The University of Texas at Austin, CS 395T, Spring 2010, Prof. William H. Press 10

11 Find the mean, standard error, and mode of our estimate for x P (x data) = x N B (1 x) N N B dp (x data) dx =0 x = N B N maximum likelihood (ML) answer is to estimate x as exactly the fraction seen hxi = Z 1 0 xp (x data)dx = N B +1 N +2 mean is the 1 st moment notice it s different from ML! variance involves the 2 nd moment, Var(x) = x 2 hxi 2 = Z 1 0 x 2 P (x data)dx hxi 2 = (N B +1)(N N B +1) (N +2) 2 (N +3) This shows how p(x) gets narrower as the amount of data increases. The University of Texas at Austin, CS 395T, Spring 2010, Prof. William H. Press 11

12 (Let s leave behind the metaphor of the Jailer and Prisoner A.) What we are illustrating is called Bernoulli trials: two possible outcomes i.i.d. events a single parameter x (the probability of one outcome) a sufficient statistic is the pair of numbers N and N B P (data x) =x N B (1 x) N N B (in the Bayesian sense) Jacob and Johann Bernoulli P (x data) x N B (1 x) N N B P (x I) for uniform prior, the Bayes denominator is, as we ve seen, easy to calculate: Z 1 0 P (x data) = Z 1 0 x N B (1 x) N N B dx = Γ(N B +1)Γ(N N B +1) Γ(N +2) The University of Texas at Austin, CS 395T, Spring 2010, Prof. William H. Press 12

13 Are there any other mathematical forms for the prior that would still leave the Bayes denominator easy to calculate? Yes! try P (x I) x β (1 x) α Choose and to make any desired center and width. P (x data) = x N B (1 x) N N B x β (1 x) α Z 1 0 P (x data) = Z 1 0 x N B+β (1 x) N N B+α dx = Γ(N B + β +1)Γ(N N B + α +1) Γ(N + α + β +2) Priors that preserve the analytic form of p(x) are called conjugate priors. There is nothing special about them except mathematical convenience. If you start with a conjugate prior, you ll also be able to assimilate new data trivially, just by changing the parameters of your estimate. This is because every posterior is in the right analytic form to be the new prior! The University of Texas at Austin, CS 395T, Spring 2010, Prof. William H. Press 13

14 By the way, if I show a special love of Bernoulli trials, it might be because I am an academic descendent of the Bernoulli brothers! Actually, this is not a very exclusive club: Gauss and the Bernoullis each have ~50,000 recorded descendents in the Mathematics Genealogy database, and probably many times more unrecorded. The University of Texas at Austin, CS 395T, Spring 2010, Prof. William H. Press 14

5% mutation prob per STR per generation (though highly variable) if use Y chromosome only, get paternal ancestry There are

15 Next example (with some biology): Individual identity, or ancestry, can be determined by variable length short tandem repeats (STRs) in the genome. ~0.5% mutation prob per STR per generation (though highly variable) if use Y chromosome only, get paternal ancestry There are companies that sell certificates with your genotype. A bit opportunistic, since in a few years your whole genome will be sequenced by your health plan. The University of Texas at Austin, CS 395T, Spring 2010, Prof. William H. Press 15

16 Margaret, my ex-wife, is really into the Towne family. (And, she s neither a biologist nor a Towne.) The University of Texas at Austin, CS 395T, Spring 2010, Prof. William H. Press 16

17 Here s data from Margaret on 8 recent Townes (identified only by T code). (We ll use this data several times in the next few of lectures.) or, just showing the changes: The University of Texas at Austin, CS 395T, Spring 2010, Prof. William H. Press 17

18 Let s do a Bayesian estimation of the parameter r, the mutation rate per locus per generation. let s assume no back mutations! (their effect on this data set would be small) N=1 N=3 =0 bin(0,3x37,r) N=6 =0 bin(0,3x37,r) N=9 N=10 N=11 =23 =0 =5 bin(0,6x37,r) =1 =0 bin(1,5x37,r) bin(0,5x37,r) =4 (of 12) =3 bin(1,10x37,r) =1 bin(1,11x37,r) The University of Texas at Austin, CS 395T, Spring 2010, Prof. William H. Press 18

19 Unraveling dependencies We used ancestry as an intuitive example of how dependencies work, because its causal mechanism is well known! Later, we ll do more general Bayesian networks to capture models with more complicated causal dependencies. a b c d e Another important idea is conditional independence Example: b and e are conditionally independent given a while b and d are not conditionally independent given a:? The University of Texas at Austin, CS 395T, Spring 2010, Prof. William H. Press 19

20 So we have a statistical model for the data, that is, a way to compute P (data parameters) It is not exact, but statistical models rarely (never?) are. neglects backmutations assumes single probability for all loci etc. The model is: P (data r) =bin(0, 3 37,r)bin(0, 3 37,r)bin(1, 5 37,r)bin(0, 5 37,r) bin(0, 6 37,r)bin(1, 11 37,r)bin(3, 10 37,r) Bayes estimation of the parameter: P (r data) P (data r) P (r) P (data r) 1 r What kind of prior is this??? It is called log-uniform The log-uniform prior has equal probability in each order of magnitude. R 10r r P (r)dr = R 10r r 1 r dr =log10 It is often taken as the non-informative prior when you don t even know the order of magnitude of the (positive) quantity. It is an improper prior since its integral is infinite. This is almost always ok, but it is possible to construct paradoxes with improper priors (e.g., the marginalization paradox ) The University of Texas at Austin, CS 395T, Spring 2010, Prof. William H. Press 20

21 Here is the plot of the (normalized) P (r data) This is (almost) real biology. We ve measured the mutation probability, per locus per generation of Y chromosome STRs. This tells us something about the actual DNA replication machinery! The University of Texas at Austin, CS 395T, Spring 2010, Prof. William H. Press 21

22 It really did matter (a bit) that we sorted out the conditional dependencies correctly. Here s a comparison to doing it wrong by assuming all data independent: The true dependencies allow somewhat larger values of r, because we don t wrongly count the =0 branches multiple times We ll come back to the Towne family for some fancier stuff later! Ignoring conditional dependencies and just multiplying the probabilities of the data as if they were independent is called naïve Bayes. People often do this. It is mathematically incorrect, but sometimes it is all you can do! The University of Texas at Austin, CS 395T, Spring 2010, Prof. William H. Press 22

The basic paradigm of Bayesian parameter estimation : Construct a statistical model for the probability of the observed data as a function of all parameters treat dependency in the data

posterior distribution of the parameters Marginalize as desired to get the distributions of single (or a manageable few multivariate) parameters Cosmological models are typically fit to

23 The basic paradigm of Bayesian parameter estimation : Construct a statistical model for the probability of the observed data as a function of all parameters treat dependency in the data correctly Assign prior distributions to the parameters jointly or independently as appropriate use the results of previous data if available Use Bayes law to get the (multivariate) posterior distribution of the parameters Marginalize as desired to get the distributions of single (or a manageable few multivariate) parameters Cosmological models are typically fit to many parameters. Marginalization yields the distribution of parameters of interest, here two, shown as contours. The University of Texas at Austin, CS 395T, Spring 2010, Prof. William H. Press 23

model] bin(5, 9 37,r)bin(4, 10 12,r) now include these Editing outliers is a tricky issue that we will

24 A small cloud: The way we trimmed the data mattered. (And should trouble us a bit!) Here s the effect of including T11 and T13, both of which seemed to be outliers: P (data r) = [old model] bin(5, 9 37,r)bin(4, 10 12,r) now include these Editing outliers is a tricky issue that we will return to when we learn about mixture models. The University of Texas at Austin, CS 395T, Spring 2010, Prof. William H. Press 24

CS395T Computational Statistics with Application to Bioinformatics

CS395T Computational Statistics with Application to Bioinformatics Prof. William H. Press Spring Term, 2009 The University of Texas at Austin Unit 2: (Bayesian) Estimation of Parameters The University