CS 395T: Computational Statistics with Application to Bioinformatics

Size: px

Start display at page:

Download "CS 395T: Computational Statistics with Application to Bioinformatics"

Laurence Rich
5 years ago
Views:

1 CS 395T: Computational Statistics with Application to Bioinformatics Instructor: Prof. William H. Press Spring Term, 2008 The University of Texas at Austin Lecture 14 The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 1

2 Last lecture, we Explored the use of the permutation test for the significance of contingency tables requires expanding the contingency table to a dataset combines nicely with compression of ordinal data [example: maternal drinking] e.g., test difference of means, or of higher moments if you test multiple moments, think about multiple hypothesis correction but don t necessarily do it or use permutation test directly on (nominal) contingency tables in which case you have to pick a statistic Wald or Pearson are fine Contrasted the permutation test to bootstrap resampling one breaks the causal connection in the data, the other doesn t one lives in the null hypothesis world, the other in the (estimated) true population therefore, one addresses false positives (p-values), the other, false negatives Observed that, on (nominal) contingency tables, the permutation test samples exactly the null population of Fisher s Exact Test preserves all marginals gives easy Monte Carlo way to compute Fisher s Exact Test on contingency tables of any size Saw that even if we can easily compute it, Fisher s Exact Test remains problematic discrete values, and data always lands on a tie 2-tailed test is fragile to irrelevant number-theoretic coincidences and it still doesn t properly model an actual protocol! Decided, as good Bayesians, to marginalize over (by sampling from) the unknown column and row probabilities We use Dirichlet, the conjugate distribution to multinomial The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 2

3 In case it s not obvious that sampling over q is the same as marginalizing over it, here s a formal proof: We generate a deviate pair (f,q) by choosing a q, and then, independently, an f given q, so p(f,q) =p(q) p(f q) We then ignore q, so our f is drawn from the distribution p(f) = R p(f,q) dq = R p(q) p(f q) dq which is the same as the desired marginalization p(f) = R p(f q) p(q) dq (This is so close to self-evident, that I m not sure that the proof adds anything!) The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 3

4 So let s reanalyze assuming that the condition (column) marginals were fixed by the protocol, and we Bayessample the row probabilities: (You ll never believe my sampling function unless I go through an example!) >> table = [8 3; 16 26] table = >> marfix = sum(table,1)' marfix = >> marvar = sum(table,2)' marvar = >> gammas = gamrnd(marvar+1,1) gammas = >> q = gammas./ sum(gammas) q = >> qmat = repmat(q,[size(table,2),1]) qmat = >> tabout = mnrnd(marfix,qmat)' tabout = column marginals (transposed) row marginals (transposed) ~Gamma(n i +1) generated (random) row probabilities q i finally, we generate multinomial deviates for each column, using the generated row probabilities The reason everything is done in the transpose is because of the way that Matlab s mnrnd function expects its arguments to be shaped. Sorry about that! The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 4

5 Encapsulate the sampling process into a function. Then, generate a bunch of samples and look at their Wald statistics. function tabout = tabnullsamp(tabin) marfix = sum(tabin,1)'; marvar = sum(tabin,2)'; q = gamrnd(marvar+1,1); qmat = repmat(q./sum(q),[size(tabin,2),1]); tabout = mnrnd(marfix,qmat)'; wald(table) ans = tabnullsamp(table), tabnullsamp(table) ans = ans = wald(tabnullsamp(table)) ans = samps = arrayfun(@(x) wald(tabnullsamp(table)), 1:30000); hist(samps,-4:.05:4) cdfplot(samps) The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 5

6 There are still discreteness effects (after all, these are integer tables), but they are less troubling: pval = numel(samps(samps>=wald(table)))/numel(samps) pvaltt = (numel(samps(samps>=wald(table)))+numel(samps(samps<= -wald(table))) )/numel(samps) pval = pvaltt = one-tail vs. two-tail now much more reasonble This is probably the most honest answer that we can get for the significance of this particular contingency table. The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 6

7 Let s reanalyze the maternal drinking data using the same methodology, but (just for variety) with the Pearson statistic: function chis = pearson(table) nhtable = sum(table,2)*sum(table,1)/sum(sum(table)); chis = sum(sum((table-nhtable).^2./nhtable)); table = [ ; ]' table = pearson(table) ans = transpose to make the unfixed marginals be the rows, as before (this is a guess as to what the protocol actually was!) columns are the conditions samps = arrayfun(@(x) pearson(tabnullsamp(table)), 1:10000); hist(samps,0:0.25:90) The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 7

8 Giving the results pval = numel(samps(samps>pearson(table)))/numel(samps) pval = Why is this less significant than the previous analysis, which gave p=.015 or.011 (mean or square-mean)? Because here we didn t use the fact that the factors were ordinal and quantitatively related. By virtue of using that information, and compressing the rows, the previous method was more powerful. The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 8

9 Actually, it seems likely that this data was a cross-sectional study with no fixed marginals. If so, a better sampling of the null hypothesis would be: function tabout = tabnullsamp2(tabin) marcol = sum(tabin,1); marrow = sum(tabin,2); ntot = sum(marcol); q = gamrnd(marcol+1,1); q = q./sum(q); p = gamrnd(marrow+1,1); p = p./sum(p); pq = p * q; tabout = reshape(mnrnd(ntot,pq(:)'),size(tabin)); samps = arrayfun(@(x) pearson(tabnullsamp2(table)), 1:10000); hist(samps,0:0.25:90) pval = numel(samps(samps>pearson(table)))/numel(samps) pval = different from previous: protocol matters! This would probably the honest answer if the table were nominal, not ordinal. But, as said, since it is ordinal, the previous analysis using difference of means is more powerful. You now know all you need to know about contingency tables and much more than almost everyone who uses them! The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 9

10 New topic: the Multivariate Normal Distribution Generalizes Normal (Gaussian) to M-dimensions Like 1-d Gaussian, completely defined by its mean and (co-)variance Mean is a M-vector, covariance is a M x M matrix Because mean and covariance are easy to estimate from a data set, it is easy perhaps too easy to fit a multivariate normal distribution to data. μ = hxi Σ = h(x μ) (x μ)i Sample the sizes of 1 st and 2 nd introns for 1000 genes: g = readgenestats('genestats.dat'); ggg = g(g.ne>2,:); which = randsample(size(ggg,1),1000); iilen = ggg.intronlen(which); i1len = zeros(size(which)); i2len = zeros(size(which)); for j=1:numel(i1len), i1llen(j) = log10(iilen{j}(1)); end; for j=1:numel(i2len), i2llen(j) = log10(iilen{j}(2)); end; plot(i1llen,i2llen,'+') hold on The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 10

11 This is kind of fun, because it s not just the usual featureless scatter plot notice the biology! Is there a significant correlation here? (Yes, we ll see.) (Do you think the correlation could be only from the biological censoring of low values, and not be a real association?) The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 11

12 Biological digression: The hard lower bounds on intron length are because the intron has to fit around the big spliceosome machinery! It s all carefully arranged to allow exons of any length, even quite small. Why? Could the spliceosome have evolved to require a minimum exon length, too? Are we seeing chance early history, or selection? credit: Alberts et al. Molecular Biology of the Cell The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 12

We can easily sample from the fitted multivariate Gaussian: mu = mean([i1llen; i2llen],2) sig = cov(i1llen,i2llen) mu = 3.2844 3.2483 sig = 0.6125 0.2476 0.

13 We can easily sample from the fitted multivariate Gaussian: mu = mean([i1llen; i2llen],2) sig = cov(i1llen,i2llen) mu = sig = rsamp = mvnrnd(mu,sig,1000); plot(rsamp(:,1),rsamp(:,2),'or') hold off By the way, if renormalized, the covariance matrix is the linear correlation matrix: r = sig./ sqrt(diag(sig) * diag(sig)') tval = sqrt(numel(iilen))*r r = tval = [rr p] = corrcoef(i1llen,i2llen) rr = p = statistical significance of the correlation in standard deviations (but note: uses CLT) Matlab has built-ins The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 13

14 How to generate multivariate normal deviates: So, just take: and you get a multivariate normal deviate! The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 14

15 A related, useful, Cholesky trick is to draw error ellipses (ellipsoids, ) So, locus of points at 1 standard deviation is If z is on the unit circle (sphere, ) then function [x y] = errorellipse(mu,sigma,stdev,n) L = chol(sigma,'lower'); circle = [cos(2*pi*(0:n)/n); sin(2*pi*(0:n)/n)].*stdev; ellipse = L*circle + repmat(mu,[1,n+1]); x = ellipse(1,:); y = ellipse(2,:); plot(i1llen,i2llen,'+b'); hold on [xx yy] = errorellipse(mu,sig,1,100); plot(xx,yy,'-r'); [xx yy] = errorellipse(mu,sig,2,100); plot(xx,yy,'-r') The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 15

A few slides back, I put out the red herring that there could be a correlation r that wasn t a real association but instead came from weirdness in the individual distributions.

16 A few slides back, I put out the red herring that there could be a correlation r that wasn t a real association but instead came from weirdness in the individual distributions. I was being a provocateur. It can t be so: 1. The theorem that r is ~ normal (around zero) for independent distributions is a CLT and doesn t depend on the individual distributions (as long as they have suitably convergent moments and the number of data points is large) 2. It s clear for our example data if you plot even one random permutation: plot(i1llen,i2llen,'+b') hold on plot(i1llen(randperm(numel(i1llen))),i2llen,'+r'); 3. In case of any lingering doubt, you could use brute force: do the permutation test and plot the histogram of r values. You ll recover the CLT result. The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 16

Computational Statistics with Application to Bioinformatics

Computational Statistics with Application to Bioinformatics Prof. William H. Press Spring Term, 2008 The University of Texas at Austin Unit 9: Working with Multivariate Normal Distributions The University