Package threeboost. February 20, 2015

Similar documents
Package mmm. R topics documented: February 20, Type Package

Package darts. February 19, 2015

Package covtest. R topics documented:

Package BayesNI. February 19, 2015

Package scio. February 20, 2015

Package penmsm. R topics documented: February 20, Type Package

Package ebal. R topics documented: February 19, Type Package

Package IUPS. February 19, 2015

Package ProDenICA. February 19, 2015

Package flexcwm. R topics documented: July 11, Type Package. Title Flexible Cluster-Weighted Modeling. Version 1.0.

Package SimSCRPiecewise

Package flexcwm. R topics documented: July 2, Type Package. Title Flexible Cluster-Weighted Modeling. Version 1.1.

Package MultisiteMediation

Package depth.plot. December 20, 2015

Package effectfusion

Package gtheory. October 30, 2016

Package metafuse. October 17, 2016

Package IsingFit. September 7, 2016

Package IndepTest. August 23, 2018

Package severity. February 20, 2015

Package spatial.gev.bma

Package BNN. February 2, 2018

Package horseshoe. November 8, 2016

Package reinforcedpred

Package ShrinkCovMat

Package factorqr. R topics documented: February 19, 2015

Package Grace. R topics documented: April 9, Type Package

Package pfa. July 4, 2016

Package SEMModComp. R topics documented: February 19, Type Package Title Model Comparisons for SEM Version 1.0 Date Author Roy Levy

Package EMVS. April 24, 2018

Package sbgcop. May 29, 2018

Package metansue. July 11, 2018

Package mlmmm. February 20, 2015

Package My.stepwise. June 29, 2017

Package SPreFuGED. July 29, 2016

Package ltsbase. R topics documented: February 20, 2015

Package symmoments. R topics documented: February 20, Type Package

Package gma. September 19, 2017

Package clogitboost. R topics documented: December 21, 2015

Package causalweight

Package bivariate. December 10, 2018

Package icmm. October 12, 2017

Package ARCensReg. September 11, 2016

Package InferenceSMR

Package CoxRidge. February 27, 2015

Package basad. November 20, 2017

Package drgee. November 8, 2016

Package survidinri. February 20, 2015

Package lmm. R topics documented: March 19, Version 0.4. Date Title Linear mixed models. Author Joseph L. Schafer

Package monmlp. R topics documented: December 5, Type Package

Package GPfit. R topics documented: April 2, Title Gaussian Processes Modeling Version Date

Package HIMA. November 8, 2017

Package milineage. October 20, 2017

Package RI2by2. October 21, 2016

Package RootsExtremaInflections

Package multivariance

Package noncompliance

Package clustergeneration

Package aspi. R topics documented: September 20, 2016

Package hot.deck. January 4, 2016

Package EBMAforecast

Package covsep. May 6, 2018

Package r2glmm. August 5, 2017

Package sodavis. May 13, 2018

Package msir. R topics documented: April 7, Type Package Version Date Title Model-Based Sliced Inverse Regression

Package unbalhaar. February 20, 2015

Package spcov. R topics documented: February 20, 2015

Package CausalFX. May 20, 2015

Package GORCure. January 13, 2017

Package FDRreg. August 29, 2016

The pan Package. April 2, Title Multiple imputation for multivariate panel or clustered data

Package IGG. R topics documented: April 9, 2018

Package leiv. R topics documented: February 20, Version Type Package

Package NonpModelCheck

Package CompQuadForm

Package SpatialNP. June 5, 2018

Package face. January 19, 2018

Package esaddle. R topics documented: January 9, 2017

Package CopulaRegression

Package ssanv. June 23, 2015

Package thief. R topics documented: January 24, Version 0.3 Title Temporal Hierarchical Forecasting

Package comparec. R topics documented: February 19, Type Package

Package ForwardSearch

Package abundant. January 1, 2017

Package jaccard. R topics documented: June 14, Type Package

Package EMMIXmcfa. October 18, 2017

Package TwoStepCLogit

Package SimCorMultRes

Package weightquant. R topics documented: December 30, Type Package

Package rbvs. R topics documented: August 29, Type Package Title Ranking-Based Variable Selection Version 1.0.

Package SHIP. February 19, 2015

Package ICGOR. January 13, 2017

Package sscor. January 28, 2016

Package MicroMacroMultilevel

Package mpmcorrelogram

Package SpatPCA. R topics documented: February 20, Type Package

Package MACAU2. R topics documented: April 8, Type Package. Title MACAU 2.0: Efficient Mixed Model Analysis of Count Data. Version 1.

Package generalhoslem

Package BayesTreePrior

Package gwqs. R topics documented: May 7, Type Package. Title Generalized Weighted Quantile Sum Regression. Version 1.1.0

Transcription:

Type Package Package threeboost February 20, 2015 Title Thresholded variable selection and prediction based on estimating equations Version 1.1 Date 2014-08-09 Author Julian Wolfson and Christopher Miller Maintainer Julian Wolfson <julianw@umn.edu> This package implements a thresholded version of the EEBoost algorithm described in [Wolfson (2011, JASA)]. EEBoost is a general-purpose method for variable selection which can be applied whenever inference would be based on an estimating equation. The package currently implements variable selection based on the Generalized Estimating Equations, but can also accommodate user-provided estimating functions. Thresholded EEBoost is a generalization which allows multiple variables to enter the model at each boosting step. License GPL-3 Imports Matrix Suggests mvtnorm NeedsCompilation no Repository CRAN Date/Publication 2014-08-11 00:18:02 R topics documented: threeboost-package..................................... 2 coef_traceplot........................................ 2 ee.gee........................................... 2 eeboost........................................... 3 geeboost........................................... 4 QIC............................................. 5 threeboost.......................................... 6 Index 9 1

2 ee.gee threeboost-package Thresholded boosting based on estimating equations The threeboost package implements a the EEBoost algorithm described in Wolfson (2011, JASA). EEBoost is a general-purpose method for variable selection which can be applied whenever inference would be based on an estimating equation. Thresholded EEBoost (function threeboost) is a generalization of EEBoost which allows multiple variables to enter the model at each boosting step. EEBoost (function eeboost) is a special case of thresholded boosting with the threshold set to 1. The package currently provides a "pre-packaged" function, geeboost, which carries out variable selection for correlated outcome data based on the Generalized Estimating Equations. However, the threeboost (and eeboost) functions can also accommodate user-provided estimating functions. coef_traceplot Draw a coefficient traceplot This function draws a traceplot of coefficient values vs. number of iterations. coef_traceplot(coef.mat, varnames = NULL) coef.mat varnames The matrix of coefficients (one coefficient vector per row). (Optional) list of variable name labels to make the traceplot more readable. ee.gee GEE estimating functions Internal functions for computing the GEE. Should generally not be called by user. ee.gee(y, X, b, mu.y, g.y, v.y, aux, id = 1:length(Y), uid = sort(unique(id)), rows.indivs = lapply(uid, function(j) { which(id == j) }), corstr = "ind") ee.gee.aux(y, X, b, mu.y, g.y, v.y, id = 1:length(Y), uid = sort(unique(id)), rows.indivs = lapply(uid, function(j) { which(id == j) }))

eeboost 3 Y X b mu.y g.y v.y aux id uid rows.indivs corstr Vector of (correlated) outcomes Matrix of predictors Vector of coefficients Mean function Link function (inverse of mean function) Variance function Auxiliary function for coputing (co)variance parameters Vector of cluster IDs Vector of unique subject IDs List of rows of X corresponding to each subject ID Working correlation structure Functions : eeboost EEBoost Alias for ThrEEBoost (which defaults to a threshold value of 1). eeboost(...)... to threeboost. See Also threeboost

4 geeboost geeboost GEEBoost Thresholded boosting for correlated data via GEE geeboost(y, X, id = 1:length(Y), family = "gaussian", corstr = "ind", traceplot = FALSE,...) Y X id Details Value family corstr traceplot Vector of (presumably correlated) outcomes Matrix of predictors Index indicating clusters of correlated observations Outcome distribution to be used. "gaussian" (the default), "binomial", and "poisson" are currently implemented. Working correlation structure to use. "ind" (for independence, the default) and "exch" (for exchanageable) are currently implemented. Option of whether or not to produce a traceplot of the coefficient values. See coef_traceplot for details.... Additional arguments to be passed to the threeboost function. See the threeboost help page for details. This function implements thresholded EEBoost for the Generalized Estimating Equations. The arguments are consistent with those used by geepack. See Also A list with three entries: coefmat A matrix with maxit rows and ncol(x) columns, with each row containing the parameter vector from an iteration of EEBoost. QICs A vector of QICs computed from the coefficients. final.model The coefficients corresponding to the model (set of coefficients) yielding the smallest QIC. threeboost Wolfson, J. EEBoost: A general method for prediction and variable selection using estimating equations. Journal of the American Statistical Association, 2011.

QIC 5 Examples # Generate some test data library(mvtnorm) library(matrix) n <- 30 n.var <- 50 clust.size <- 4 B <- c(rep(2,5),rep(0.2,5),rep(0.05,10),rep(0,n.var-20)) mn.x <- rep(0,n.var) sd.x <- 0.5 rho.x <- 0.3 cov.sig.x <- sd.x^2*((1-rho.x)*diag(rep(1,10)) + rho.x*matrix(data=1,nrow=10,ncol=10)) sig.x <- as.matrix( Matrix::bdiag(lapply(1:(n.var/10),function(x) { cov.sig.x } ) ) ) sd.y <- 0.5 rho.y <- 0.3 indiv.sig <- sd.y^2*( (1-rho.Y)*diag(rep(1,4)) + rho.y*matrix(data=1,nrow=4,ncol=4) ) sig.list <- list(length=n) for(i in 1:n) { sig.list[[i]] <- indiv.sig } Sig <- Matrix::bdiag(sig.list) indiv.index <- rep(1:n,each=clust.size) sig.y <- as.matrix(sig) if(require(mvtnorm)) { X <- mvtnorm::rmvnorm(n*clust.size,mean=mn.x,sigma=sig.x) mn.y <- X %*% B Y <- mvtnorm::rmvnorm(1,mean=mn.y,sigma=sig.y) ## Correlated continuous outcomes expit <- function(x) { exp(x) / (1 + exp(x)) } ## Correlated binary outcomes Y.bin <- rbinom(n*clust.size,1,p=expit(mvtnorm::rmvnorm(1,mean=mn.y,sigma=sig.y))) Y.pois <- rpois(length(y),lambda=exp(mn.y)) ## Correlated Poisson outcomes } else { stop( Need mvtnorm package to generate correlated data. )} ## Run EEBoost (w/ indep working correlation) results.lin <- geeboost(y,x,id=indiv.index,maxit=1000) ## Not run: results.bin <- geeboost(y.bin,x,id=indiv.index,family="binomial",maxit=1000) results.pois <- geeboost(y.pois,x,id=indiv.index,family="poisson",maxit=1000,traceplot=true) ## End(Not run) print(results.lin$final.model) QIC Pan s QIC Calculates a simple version of Pan s QIC for a GEE model defined by a vector of regression coefficients.

6 threeboost QIC(Y, X, b, family = "gaussian") Y X b family A vector of outcomes. A matrix of predictors. A vector of regression coefficients (e.g., a row from the coefficient matrix produced by geeboost) Version of QIC to implement, for either "gaussian", "binomial" or "poisson" outcomes. Should match the family argument used in the original boosting algorithm. threeboost Thresholded EEBoost Run the thresholded EEBoost procedure. threeboost(y, X, EE.fn, b.init = rep(0, ncol(x)), eps = 0.01, maxit = 1000, itertrack = FALSE, reportinterval = 1, stop.rule = "on.repeat", thresh = 1) Y X Vector of outcomes. Matrix of predictors. Will be automatically scaled using the scale function. EE.fn Estimating function taking arguments Y, X, and parameter vector b. b.init eps Initial parameter values. For variable selection, typically start with a vector of zeroes (the default). Step length. Default is 0.01, value should be relatively small. maxit Maximum number of iterations. Default is 1000. itertrack Indicates whether or not diagnostic information should be printed out at each iteration. Default is FALSE. reportinterval If itertrack is TRUE, how many iterations the algorithm should wait between each diagnostic report. stop.rule thresh Rule for stopping the iterations before maxit is reached. Possible values are "on.repeat" and "pct.change". See Details for more information. Threshold parameter for ThrEEBoost.

threeboost 7 Details Value threeboost Implements a thresholded version of the EEBoost algorithm described in Wolfson (2011, JASA). EEBoost is a general-purpose method for variable selection which can be applied whenever inference would be based on an estimating equation. The package currently implements variable selection based on the Generalized Estimating Equations, but can also accommodate userprovided estimating functions. Thresholded EEBoost is a generalization which allows multiple variables to enter the model at each boosting step. Thresholded EEBoost with thresholding parameter = 1 is equivalent to EEBoost. Typically, the boosting procedure is run for maxit iterations, producing maxit models defined by a set of regression coefficients. An additional step (e.g. model scoring, cross-validated estimate of prediction error) is needed to select a final model. However, an alternative is to stop the iterations before maxit is reached. The user can request this feature by setting stop.rule to one of the following options: "on.repeat": Sometimes, ThrEEBoost will alternate between stepping on the same two directions, usually indicating numerical problems. Setting stop.rule="on.oscillate" will terminate the algorithm if this happens. "pct.change": Stop if, for conseuctive iterations, the sum of the magnitudes of the elements of the estimating equation changes by < 1%. A matrix with maxit rows and ncol(x) columns, with each row containing the parameter vector from an iteration of ThrEEBoost. See Also geeboost for an example of how to call (Thr)EEBoost with a custom estimating function. Wolfson, J. EEBoost: A general method for prediction and variable selection using estimating equations. Journal of the American Statistical Association, 2011. Examples library(matrix) # Generate some test data - uses mvtnorm package n <- 30 n.var <- 50 clust.size <- 4 B <- c(rep(2,5),rep(0.2,5),rep(0.05,10),rep(0,n.var-20)) mn.x <- rep(0,n.var) sd.x <- 0.5 rho.x <- 0.3 cov.sig.x <- sd.x^2*((1-rho.x)*diag(rep(1,10)) + rho.x*matrix(data=1,nrow=10,ncol=10)) sig.x <- as.matrix( Matrix::bdiag(lapply(1:(n.var/10),function(x) { cov.sig.x } ) ) ) sd.y <- 0.5 rho.y <- 0.3 indiv.sig <- sd.y^2*( (1-rho.Y)*diag(rep(1,4)) + rho.y*matrix(data=1,nrow=4,ncol=4) ) sig.list <- list(length=n)

8 threeboost for(i in 1:n) { sig.list[[i]] <- indiv.sig } Sig <- Matrix::bdiag(sig.list) indiv.index <- rep(1:n,each=clust.size) sig.y <- as.matrix(sig) if(require(mvtnorm)) { X <- mvtnorm::rmvnorm(n*clust.size,mean=mn.x,sigma=sig.x) mn.y <- X %*% B ## Correlated continuous outcome Y <- mvtnorm::rmvnorm(1,mean=mn.y,sigma=sig.y) } else { stop( Need mvtnorm package to generate correlated example data. ) } ## Define the Gaussian GEE estimating function with independence working correlation mu.lin <- function(eta){eta} g.lin <- function(m){m} v.lin <- function(eta){rep(1,length(eta))} EE.fn.ind <- function(y,x,b) { ee.gee(y,x,b, mu.y=mu.lin, g.y=g.lin, v.y=v.lin, aux=function(...) { ee.gee.aux(...,mu.y=mu.lin,g.y=g.lin,v.y=v.lin) }, id=indiv.index, corstr="ind") } ## These two give the same result coef.mat <- eeboost(y,x,ee.fn.ind,maxit=250) coef.mat2 <- geeboost(y,x,id=indiv.index,family="gaussian",corstr="ind",maxit=250)$coefmat par(mfrow=c(1,2)) coef_traceplot(coef.mat) coef_traceplot(coef.mat2)

Index coef_traceplot, 2, 4 ee.gee, 2 eeboost, 2, 3 geeboost, 2, 4, 6, 7 QIC, 5 scale, 6 threeboost, 2 4, 6 threeboost-package, 2 9