The locfdr Package. August 19, hivdata... 1 lfdrsim... 2 locfdr Index 5

Size: px

Start display at page:

Download "The locfdr Package. August 19, hivdata... 1 lfdrsim... 2 locfdr Index 5"

Bruce Morrison
5 years ago
Views:

1 Title Computes local false discovery rates Version The locfdr Package August 19, 2006 Author Bradley Efron, Brit Turnbull and Balasubramanian Narasimhan Computation of local false discovery rates Maintainer Bradley Efron License GPL 2.0 R topics documented: hivdata lfdrsim locfdr Index 5 hivdata HIV data set Format The data comprises 7680 z-values, each relating to a two-sample t-test. The test compares gene expression values for 4 HIV patients with values for 4 normal subjects; the t-score T[i] for gene i has been transformed to a normal scale, z[i] = qnorm(pt(t[i], df=6)), so that the z[i] s theoretically would have a standard N(0, 1) distribution under the null hypothesis. The original experiment is described in van t Wout et. al. (2003). data(hivdata) A vector containing 7680 z-values References van t Wout, et. al., Cellular gene expression upon human immuno-deficiency virus type 1 infection of CD4+-T-Cell lines, Journal ofvirology 77,

2 2 locfdr lfdrsim Simulated data set for locfdr A simulated dataset that involves 2000 "genes", each of which has yielded a test statistic "zex", with zex[i] N(mu[i], 1) (independently for i = 1, 2, ) The data comprises 2000 µ i values and 2000 z-values. data(lfdrsim) Format A matrix of 2000 rows and 2 columns containing mu and the z-score values (zex) locfdr Local False Discovery Rate Calculation Compute local false discovery rates, following the definitions and description in Efron (2004) JASA, Volume 99, pages and Efron, B (2005) "Local false discovery rates" and Efron, B. (2005) "Correlation and large-scale simultaneous significance testing" stanford.edu/~brad/papers/. locfdr(zz, bre=120, df=7, pct=0, pct0=1/4, nulltype=1, type=0, plot=1, mult, main=" ", sw=0) Arguments zz bre A vector of summary statistics, one for each case under simultaneous consideration. In a microarray experiment there would be one component of zz for each gene, perhaps a t-statistic comparing gene expression levels under two different conditions. Results may be improved by transforming zz so that its components are theoretically distributed as N(0, 1) under the null hypothesis, for example via z[i] = qnorm(pt(t[i],df)) when using t-statistics. This is especially important when the theoretical null option is invoked (see below). Recentering and rescaling zz may be necessary if its central histogram looks very far removed from mean 0 and variance 1. The calculations assume a large number of cases, say at least length(zz) exceeding 200. Number of breaks in the discretization of the z-score axis, or a vector of breakpoints fully describing the discretization. If length(zz) is small, such as when the number of cases is less than about 1000, set bre to a number lower than the default of 120.

3 locfdr 3 df pct pct0 nulltype type plot mult main sw Degrees of freedom for fitting the estimated density f(z). Larger values of df may be required if f(z) has sharp bends or other irregularities. A warning is issued if the fitted curve does not adequately match the histogram counts. It is a good idea to use the plot option to view the histogram and fitted curve. Excluded tail proportions of zz s when fitting f(z). pct=0 includes full range of zz s. pct can also be a 2-vector, describing the fitting range. Proportion of the zz distribution used in fitting the null density f0(z) by central matching. If a 2-vector, e.g. pct0=c(0.25,0.60), the range [pct0[1], pct0[2]] is used. If a scalar, [pct0, 1-pct0] is used. Type of null hypothesis assumed in estimating f 0(z), for use in the fdr calculations; 0 is theoretical null N(0, 1) [which assumes that the original zz scores have been scaled to have a N(0, 1) distribution under the null hypothesis]; 1 (the default) is the empirical null with parameters estimated by maximum likelihood; 2 is the empirical null with parameters estimated by central matching (see second reference); 3 is a "split normal" version of 2, in which f0(z) is allowed to have different scales on the two sides of the maximum. Unless sw == 2 or 3, the theoretical, maximum likelihood, and central matching estimates all will be output in the matrix fp0, and both the theoretical and the specified nulltype will be used in the calculations output in mat, but only the specified nulltype is used in the calculation of the output fdr (local fdr estimates for every case). Type of fitting used for f(z); 0 is a natural spline, 1 is a polynomial, in either case with degrees of freedom df [so total degrees of freedom including the intercept is df+1.] Plots desired. plot=0 gives no plots. plot=1 gives single plot showing the histogram of zz and fitted densities f(z) and f0(z); colored histogram bars indicate estimated non-null counts; yellow triangles on the x-axis indicate threshold z-values for fdr <= 0.2. plot=2 also gives plot of fdr, and the right and left tail area Fdr curves; plot=3 gives instead the f1 cdf of the estimated fdr curve, as in figure 4 of the second reference; plot=4 gives all three plots. Optional scalar multiple (or vector of multiples) of the sample size for calculation of the corresponding hypothetical Efdr value(s). Main heading for the histogram plot when plot>0. Determines the type of output desired. sw = 2 gives a list consisting of the last 5 values listed below. sw = 3 gives the square matrix of dimension bre-1 representing the influence function of log(fdr), i.e. the derivative of log(fdr) (for each bin) with respect to the bin counts. Any other value of sw returns a list consisting of the first 5 (6 if mult is supplied) values listed below. Details Value The standard error estimate lfdrse assumes independence of the zz values and should usually be considered as a lower bound on the true standard errors. See the third reference. The density estimates f, f0, f0theo are scaled to add up to approximately the number of zz s. The non-null density f1 is scaled to add up to approximately (1-p0) times the number of zz s. i.e. the estimated number of non-null zz s. fdr the estimated local false discovery rate for each case, using the selected options for type and nulltype.

4 4 locfdr fp0 Efdr the estimated parameters delta (mean of f0), sigma (standard deviation of f0), and p0, along with their standard errors. If nulltype<3, fp0 is a 5 by 3 matrix, with columns representing delta, sigma, and p0 and rows representing nulltypes and estimate vs. standard error. If nulltype==3, a fourth column represents the sigma estimate for the right side of f0. the expected false discovery rate for the non-null cases, a measure of the experiment s power as described in Section 3 of the second reference. Large values of Efdr, say Efdr>0.4, indicate low power. Overall Efdr and right and left values are given, both for the specified nulltype and for nulltype 0. If nulltype==0, values are given for nulltypes 1 and 0. cdf1 a 99x2 matrix giving the estimated cdf of fdr under the non-null distribution f1. Large values of the cdf for small fdr values indicate good power; see Section 3 of the second reference. Set plot to 3 or 4 to see the cdf plot. mat A matrix summarizing the estimates of f(z), f0(z), fdr(z), etc. at the bre 1 midpoints "x" of the break discretization. These are convenient for comparisons and plotting; mat includes fdr from nulltype 1, 2, or 3 as specified, estimates of the usual tail-area False Discovery Rates, Fdrleft and Fdrright, and also fdrtheo and f0theo, the fdr and f0 estimates assuming the theoretical null density N(0, 1). If nulltype==0, the fdr and f0 columns of mat are calculated using nulltype 1. The 10th column of mat, "lfdrse", is an estimate of standard error for the curve log(fdr) and is calculated based on the specified nulltype. The 11th column of mat is an estimate p1f1 of the subdensity for the non-null z-scores. Column "counts" gives the histogram counts for zz. mult pds x f pds. stdev Author(s) Bradley Efron If the argument mult was supplied, vector of the ratios of hypothetical Efdr for the supplied multiples of the sample size to Efdr for the actual sample size. The estimates of p0, delta, and sigma. The bin midpoints. The values of f(z) at the bin midpoints. The derivative of the estimates of p0 (when nulltype==1) or log(p0) (when nulltype==0 or 2), delta, and sigma with respect to the bin counts. The delta-method estimates of the standard deviations of the p0, delta, and sigma estimates. References Efron, B. (2004) "Large-scale simultaneous hypothesis testing: the choice of a null hypothesis", Jour Amer Stat Assoc, 99, pp Efron, B. (2006) "Size, Power, and False Discovery Rates" Efron, B. (2006) "Correlation and Large-Scale Simultaneous Significance Testing" Examples ## HIV data example data(hivdata) w <- locfdr(hivdata)

5 Index Topic datasets hivdata, 1 lfdrsim, 2 Topic htest Topic models hivdata, 1 lfdrsim, 2 5

Package locfdr. July 15, Index 5

Package locfdr. July 15, Index 5 Version 1.1-8 Title Computes Local False Discovery Rates Package locfdr July 15, 2015 Maintainer Balasubramanian Narasimhan License GPL-2 Imports stats, splines, graphics Computation