Package IGG. R topics documented: April 9, 2018

Similar documents
Package horseshoe. November 8, 2016

Package SHIP. February 19, 2015

Package FDRreg. August 29, 2016

Package basad. November 20, 2017

Package BayesNI. February 19, 2015

Package LBLGXE. R topics documented: July 20, Type Package

Package SimSCRPiecewise

Package BNN. February 2, 2018

Package RcppSMC. March 18, 2018

Package Grace. R topics documented: April 9, Type Package

Package EMVS. April 24, 2018

Package bayeslm. R topics documented: June 18, Type Package

Accepted author version posted online: 28 Nov 2012.

Package NonpModelCheck

Package beam. May 3, 2018

Package gtheory. October 30, 2016

Package gma. September 19, 2017

Package sodavis. May 13, 2018

Package spcov. R topics documented: February 20, 2015

Or How to select variables Using Bayesian LASSO

SHOTA KATAYAMA AND YUTAKA KANO. Graduate School of Engineering Science, Osaka University, 1-3 Machikaneyama, Toyonaka, Osaka , Japan

Package effectfusion

Package BGGE. August 10, 2018

Package ShrinkCovMat

Package spatial.gev.bma

Package factorqr. R topics documented: February 19, 2015

Similarities of Ordered Gene Lists. User s Guide to the Bioconductor Package OrderedList

Package reinforcedpred

Package ARCensReg. September 11, 2016

Package sscor. January 28, 2016

Package icmm. October 12, 2017

Package IsingFit. September 7, 2016

Frequentist Accuracy of Bayesian Estimates

Package SCEPtERbinary

Package IndepTest. August 23, 2018

Package EnergyOnlineCPM

Package bhrcr. November 12, 2018

Package MultisiteMediation

Package ICBayes. September 24, 2017

Package IUPS. February 19, 2015

Package flexcwm. R topics documented: July 11, Type Package. Title Flexible Cluster-Weighted Modeling. Version 1.0.

Package covtest. R topics documented:

Package lmm. R topics documented: March 19, Version 0.4. Date Title Linear mixed models. Author Joseph L. Schafer

DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania

Package RZigZag. April 30, 2018

Package face. January 19, 2018

Package ebal. R topics documented: February 19, Type Package

Package RootsExtremaInflections

Package LIStest. February 19, 2015

Package flexcwm. R topics documented: July 2, Type Package. Title Flexible Cluster-Weighted Modeling. Version 1.1.

Frequentist Accuracy of Bayesian Estimates

Package sbgcop. May 29, 2018

ASYMPTOTIC THEORY FOR DISCRIMINANT ANALYSIS IN HIGH DIMENSION LOW SAMPLE SIZE

Package noncompliance

Package SPreFuGED. July 29, 2016

The lars Package. R topics documented: May 17, Version Date Title Least Angle Regression, Lasso and Forward Stagewise

Package scio. February 20, 2015

Package weightquant. R topics documented: December 30, Type Package

Package darts. February 19, 2015

Package pfa. July 4, 2016

Are a set of microarrays independent of each other?

Package CPE. R topics documented: February 19, 2015

Package elhmc. R topics documented: July 4, Type Package

application in microarrays

Package clogitboost. R topics documented: December 21, 2015

Package zic. August 22, 2017

Package covsep. May 6, 2018

Package locfdr. July 15, Index 5

Bayesian Inference and the Parametric Bootstrap. Bradley Efron Stanford University

Package jmcm. November 25, 2017

Regularized Discriminant Analysis and Its Application in Microarrays

False discovery control for multiple tests of association under general dependence

Scalable MCMC for the horseshoe prior

ROW AND COLUMN CORRELATIONS (ARE A SET OF MICROARRAYS INDEPENDENT OF EACH OTHER?) Bradley Efron Department of Statistics Stanford University

Package CovSel. September 25, 2013

Package gwqs. R topics documented: May 7, Type Package. Title Generalized Weighted Quantile Sum Regression. Version 1.1.0

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

Package comparec. R topics documented: February 19, Type Package

Package rarhsmm. March 20, 2018

Sparse Linear Models (10/7/13)

Package msbp. December 13, 2018

Package dhsic. R topics documented: July 27, 2017

An Algorithm for Bayesian Variable Selection in High-dimensional Generalized Linear Models

Package unbalhaar. February 20, 2015

Package asus. October 25, 2017

Package mbvs. August 28, 2018

Package metafuse. October 17, 2016

Sparse Linear Discriminant Analysis with Applications to High Dimensional Low Sample Size Data

Package FinCovRegularization

Package bigtime. November 9, 2017

Package MACAU2. R topics documented: April 8, Type Package. Title MACAU 2.0: Efficient Mixed Model Analysis of Count Data. Version 1.

A Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data

The lmm Package. May 9, Description Some improved procedures for linear mixed models

Package tsvd. March 11, 2015

Package abundant. January 1, 2017

Package clustergeneration

Package eel. September 1, 2015

Identification of Statistically Significant Features from Random Forests

Package SEMModComp. R topics documented: February 19, Type Package Title Model Comparisons for SEM Version 1.0 Date Author Roy Levy

Package Blendstat. February 21, 2018

Transcription:

Package IGG April 9, 2018 Type Package Title Inverse Gamma-Gamma Version 1.0 Date 2018-04-04 Author Ray Bai, Malay Ghosh Maintainer Ray Bai <raybai07@ufl.edu> Description Implements Bayesian linear regression, normal means estimation, and variable selection using the inverse gamma-gamma prior, as introduced by Bai and Ghosh (2018) <arxiv:1710.04369>. License GPL-3 LazyData true Depends R (>= 3.1.0), Matrix, MASS, glmnet, pscl, GIGrvg NeedsCompilation yes Repository CRAN Date/Publication 2018-04-09 12:11:15 UTC R topics documented: IGG-package........................................ 2 diabetes........................................... 3 igg.............................................. 3 igg.normalmeans...................................... 5 singh2002.......................................... 9 Index 11 1

2 IGG-package IGG-package Inverse Gamma-Gamma Description This package contains the functions, igg and igg.normalmeans, for implementing Bayesian linear regression, sparse normal means estimation, and variable selection with the inverse gamma-gamma (IGG) prior, as introduced by Bai and Ghosh (2018) <arxiv:1710.04369> Details The DESCRIPTION file: This package was not yet installed at build time. Index of help topics: IGG-package Inverse Gamma-Gamma diabetes Blood and other measurements in diabetics igg Inverse Gamma-Gamma Regression igg.normalmeans Normal Means Estimation and Classification with the IGG Prior singh2002 Prostate Cancer Study of Singh et al. (2002) This package implements the IGG model for sparse Bayesian linear regression and the normal means problem. Our package performs both estimation and model selection. The igg and igg.normalmeans functions also returns the endpoints of the credible intervals (i.e. the 2.5th and 97.5th percentiles) for every single parameter of interest, so that uncertainty quantification can be assessed. Author(s) Ray Bai and Malay Ghosh Maintainer: Ray Bai <raybai07@ufl.edu> References Bai, R. and Ghosh, M. (2018). "The Inverse Gamma-Gamma Prior for Optimal Posterior Contraction and Multiple Hypothesis Testing." Submitted, arxiv:1711.07635.

diabetes 3 diabetes Blood and other measurements in diabetics Description The diabetes data frame has 442 rows and 3 columns. These are the data used in the Efron et al. "Least Angle Regression" paper and is also available in the lars package. Usage data(diabetes) Format Details This data frame consists of the following columns: x: is a design matrix with 10 columns (no interactions). y: is a numeric vector. x2: is a design matrix with 64 columns (includes interactions. The x matrix has been standardized to have unit L2 norm in each column and zero mean. The matrix x2 consists of x plus certain interactions. Source https://cran.r-project.org/web/packages/lars/ References Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2003). "Least Angle Regression" (with discussion). Annals of Statistics, 32(2): 407-499. igg Inverse Gamma-Gamma Regression Description This function provides a fully Bayesian approach for obtaining a sparse estimate of the p 1 vector, β in the univariate linear regression model, y = Xβ + ɛ, where ɛ N n (0, σ 2 I n ). This is achieved by placing the inverse gamma-gamma (IGG) prior on the coefficients of β. In the case where p > n, we utilize an efficient sampler from Bhattacharya et al. (2016) to reduce the computational cost of sampling from the full conditional density of β to O(n 2 p).

4 igg Usage igg(x, y, a=na, b=na, sigma2=na, max.steps=10000, burnin=5000) Arguments X y a b Details Value sigma2 n p design matrix. n 1 response vector. The parameter for IG(a, 1). If not specified (a=na), defaults to 1/2 + 1/p. User may specify a value for a between 0 and 1. The parameter for G(b, 1). If not specified (b=na), defaults to 1/p. User may specify a value for b between 0 and 1. The variance parameter. If the user does not specify this (sigma2=na), the Gibbs sampler will estimate this using Jeffreys prior. If σ 2 is known or estimated separately (e.g. through empirical Bayes), the user may also specify it. max.steps The total number of iterations to run in the Gibbs sampler. Defaults to 10,000. burnin The number of burn-in iterations for the Gibbs sampler. Defaults to 5,000. The function performs sparse estimation of β in the standard linear regression model and variable selection from the p covariates. Variable selection is performed by assessing the 95 percent marginal posterior credible intervals. The lower and upper endpoints of the 95 percent credible intervals for each of the p covariates are also returned so that the user may assess uncertainty quantification. The full model is: Y (X, β) N n (Xβ, σ 2 I n ), β i (λ i, ξ i, σ 2 ) N(0, σ 2 λ i ξ i ), i = 1,..., p, λ i IG(a, 1), i = 1,..., p, ξ i G(b, 1), i = 1,..., p, σ 2 1/σ 2. If σ 2 is known or estimated separately, the Gibbs sampler does not sample from the full conditional for σ 2. The function returns a list containing the following components: beta.hat The posterior mean estimate of β. beta.med The posterior median estimate of β. beta.intervals The lower and upper endpoints of the 95 percent credible intervals for all p estimates in β. igg.classifications A p-dimensional binary vector with "1" if the covariate is selected and "0" if it is deemed irrelevant.

igg.normalmeans 5 Author(s) Ray Bai and Malay Ghosh References Bai, R. and Ghosh, M. (2018). "The Inverse Gamma-Gamma Prior for Optimal Posterior Contraction and Multiple Hypothesis Testing." Submitted, arxiv:1711.07635. Bhattacharya, A., Chakraborty, A., and Mallick, B.K. (2016). "Fast Sampling with Gaussian Scale Mixture Priors in High-Dimensional Regression." Biometrika, 69(2): 447-457. Examples ###################### # Load diabetes data # ###################### data(diabetes) attach(diabetes) X <- scale(diabetes$x) y <- scale(diabetes$y) ################################ # Fit the IGG regression model # ################################ igg.model <- igg(x=x, y=y, max.steps=5000, burnin=2500) ############################## # Posterior median estimates # ############################## igg.model$beta.med ########################################### # 95 percent posterior credible intervals # ########################################### igg.model$beta.intervals ###################### # Variable selection # ###################### igg.model$igg.classifications igg.normalmeans Normal Means Estimation and Classification with the IGG Prior

6 igg.normalmeans Description Usage This function provides a fully Bayesian approach for obtaining a sparse estimate of θ = (θ 1,..., θ n ) in the normal means problem, X i = θ i + ɛ i, where ɛ i N(0, σ 2 ). This is achieved by placing the inverse gamma-gamma (IGG) prior on the individual θ i s. Variable selection can also be performed by using either thresholding the shrinkage factor in the posterior mean, or by examining the marginal 95 percent credible intervals. igg.normalmeans(x, a=na, b=na, sigma2=na, var.select = c("threshold", "intervals"), max.steps=10000, burnin=5000) Arguments x a b Details sigma2 var.select an n 1 multivariate normal vector. The parameter for IG(a, 1). If not specified (a=na), defaults to 1/2 + 1/n. User may specify a value for a between 0 and 1. The parameter for G(b, 1). If not specified (b=na), defaults to 1/n. User may specify a value for b between 0 and 1. The variance parameter. If the user does not specify this (sigma2=na), the Gibbs sampler will estimate this using Jeffreys prior. If sigma 2 is known or estimated separately (e.g. through empirical Bayes), the user may also specify it. The method of variable selection. threshold selects variables by thresholding the shrinkage factor in the posterior mean. intervals will classify entries of x as either signals (x i 0) or as noise (x i = 0) by examining the 95 percent marginal credible intervals. max.steps The total number of iterations to run in the Gibbs sampler. Defaults to 10,000. burnin The number of burn-in iterations for the Gibbs sampler. Defaults to 5,000. The function performs sparse estimation of θ = (θ 1,..., θ n ) in normal means problem. The full model is: X θ N n (θ, σ 2 I n ), θ i (λ i, ξ i, σ 2 ) N(0, σ 2 λ i ξ i ), i = 1,..., n, λ i IG(a, 1), i = 1,..., n, ξ i G(b, 1), i = 1,..., n, σ 2 1/σ 2. If σ 2 is known or estimated separately, the Gibbs sampler does not sample from the full conditional for σ 2.

igg.normalmeans 7 As described in Bai and Ghosh (2018), the posterior mean is of the form [E(1 κ i X i )]X i, i = 1,..., n. The "threshold" method for variable selection is to classify θ i as signal (θ i 0) if and to classify θ i as noise (θ i = 0) if E(1 κ i X i ) > 1/2, Value E(1 κ i X i ) 1/2. The function returns a list containing the following components: theta.hat The posterior mean estimate of θ. theta.med The posterior median estimate of θ. theta.intervals The lower and upper endpoints of the 95 percent credible intervals for all n components of θ. igg.classifications An n-dimensional binary vector with "1" if the covariate is selected and "0" if it is deemed irrelevant. Author(s) Ray Bai and Malay Ghosh References Bai, R. and Ghosh, M. (2018). "The Inverse Gamma-Gamma Prior for Optimal Posterior Contraction and Multiple Hypothesis Testing." Submitted, arxiv:1711.07635. Examples ################################################### ################################################### ## Example on synthetic data. ## ## 5 percent of entries in a sparse vector theta ## ## are set equal to signal value A =7. ## ################################################### ################################################### n <- 100 sparsity.level <- 5 A <- 7 # Initialize theta vector of all zeros theta.true <- rep(0,n) # Set (sparsity.level)% of them to be A q <- floor(n*(sparsity.level/100))

8 igg.normalmeans # Pick random indices of theta.true to equal A signal.indices <- sample(1:n, size=q, replace=false) ####################### # Generate true theta # ####################### theta.true[signal.indices] <- A ####################################################### # Generate data X by corrupting theta.true with noise # ####################################################### X <- theta.true + rnorm(n,0,1) ########################## # Run the IGG model on X # ########################## # For optimal performance, should set max.steps=10,000 and burnin=5000. igg.model <- igg.normalmeans(x, var.select="threshold", max.steps=2000, burnin=1000) ################################ # Calculate mean squared error # ################################ igg.mse <- sum((igg.model$theta.med-theta.true)^2)/n igg.mse # To compute misclassification probability and false discovery rate true.classifications <- rep(0,n) signal.indicies <- which(theta.true!= 0) true.classifications[signal.indices] <- 1 igg.classifications <- igg.model$igg.classifications false.pos <- length(which(igg.classifications!= 0 & true.classifications == 0)) num.pos <- length(which(igg.classifications!= 0)) false.neg <- length(which(igg.classifications == 0 & true.classifications!= 0)) ######################################################## # Calculate FDR and misclassification probability (MP) # ######################################################## igg.fdr <- false.pos/num.pos igg.fdr igg.mp <- (false.pos+false.neg)/n igg.mp ## Not run: # # ###################################################### ###################################################### ## Prostate cancer data analysis. ##

singh2002 9 ## Running this code will allow you to reproduce ## ## the results in Section 6 of Bai and Ghosh (2018) ## ###################################################### ###################################################### # Load the data data(singh2002) attach(singh2002) # Only look at the gene expression data. # First 50 rows are the cancer patients, # and the last 52 rows are the control subjects.d prostate.data <- singh2002$x # Perform 2-sample t-test and obtain z-scores n <- ncol(prostate.data) test.stat <- rep(na,n) z.scores <- rep(na, n) # Fill in the vectors for(i in 1:n){ test.stat[i] <- t.test(prostate.data[51:102,i], prostate.data[1:50,i])$statistic z.scores[i] <- qnorm(pt(test.stat[i],100)) } ####################################### # Apply IGG model on the z-scores. # # Here sigma2 is known with sigma2= 1 # ####################################### igg.model <- igg.normalmeans(z.scores, sigma2=1, var.select="threshold") ########################################## # How many genes flagged as significant? # ########################################## num.sig <- sum(igg.model$igg.classifications!= 0) num.sig ####################################################### # Estimated effect size for 10 most significant genes # ####################################################### most.sig <- c(610,1720,332,364,914,3940,4546,1068,579,4331) igg.model$theta.hat[most.sig] ## End(Not run) singh2002 Prostate Cancer Study of Singh et al. (2002)

10 singh2002 Description Usage Format Details Source Gene expression dat (6033 genes for 102 samples) from the microarray study of Singh et al. (2002). Also available in the sda package. data(singh2002) A list of two components. x: is a 102 6033 matrix containing the expression levels. The rows contain the samples and the columns the genes. y: is a factor containing the diagnosis for each sample ("cancer" or "healthy"). This data set contains measurements of the gene expression of 6033 genes for 102 observations. The first 52 rows are for the cancer patients and the last 50 rows are for the normal control subjects. The data is described in Singh et al. (2001) and are provided in exactly the form as used by Efron (2010). References Singh, D., Febbo, P.G., Ross, K., Jackson, D.G., Maonla, J., Ladd, C., Tamayo, P., Renshaw, A.A., D Amico, A.V., Richie, J.P., Lander, E.S., Loda, M., Kantoff, P.W., Golub, T.R., and Sellers, W.R. (2002). "Gene expression correlates of clinical prostate cancer behavior." Cancer Cell, 1(2): 203-209. Efron, B. (2010). "The Future of Indirect Evidence." Statistical Science, 25(2): 145-157.

Index diabetes, 3 IGG (IGG-package), 2 igg, 3 IGG-package, 2 igg.normalmeans, 5 singh2002, 9 11