Package sodavis. May 13, 2018

Similar documents
Package EMVS. April 24, 2018

Package SimSCRPiecewise

Package BNN. February 2, 2018

Package IsingFit. September 7, 2016

Package IGG. R topics documented: April 9, 2018

arxiv: v2 [stat.me] 29 May 2017

Package invgamma. May 7, 2017

Package covsep. May 6, 2018

Package SpatialVS. R topics documented: November 10, Title Spatial Variable Selection Version 1.1

Package sbgcop. May 29, 2018

Package msir. R topics documented: April 7, Type Package Version Date Title Model-Based Sliced Inverse Regression

Package flexcwm. R topics documented: July 11, Type Package. Title Flexible Cluster-Weighted Modeling. Version 1.0.

Package NonpModelCheck

Package ShrinkCovMat

Package ICBayes. September 24, 2017

Package flexcwm. R topics documented: July 2, Type Package. Title Flexible Cluster-Weighted Modeling. Version 1.1.

Package CoxRidge. February 27, 2015

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

Package Grace. R topics documented: April 9, Type Package

Package clogitboost. R topics documented: December 21, 2015

Package CopulaRegression

Package IUPS. February 19, 2015

Package scio. February 20, 2015

Package gma. September 19, 2017

Package depth.plot. December 20, 2015

Package comparec. R topics documented: February 19, Type Package

Package IndepTest. August 23, 2018

Package gtheory. October 30, 2016

Package generalhoslem

Package FinCovRegularization

Package ebal. R topics documented: February 19, Type Package

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Package ltsbase. R topics documented: February 20, 2015

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

Package covtest. R topics documented:

Package factorqr. R topics documented: February 19, 2015

Package idmtpreg. February 27, 2018

Package ICGOR. January 13, 2017

Package anoint. July 19, 2015

Statistics 262: Intermediate Biostatistics Model selection

Package FDRreg. August 29, 2016

Package riv. R topics documented: May 24, Title Robust Instrumental Variables Estimator. Version Date

Package darts. February 19, 2015

Package GORCure. January 13, 2017

Package face. January 19, 2018

Package JointModel. R topics documented: June 9, Title Semiparametric Joint Models for Longitudinal and Counting Processes Version 1.

Package ARCensReg. September 11, 2016

Package CovSel. September 25, 2013

Package jmcm. November 25, 2017

Package penmsm. R topics documented: February 20, Type Package

Package threeboost. February 20, 2015

Package MultisiteMediation

Package bivariate. December 10, 2018

Package gwqs. R topics documented: May 7, Type Package. Title Generalized Weighted Quantile Sum Regression. Version 1.1.0

Package ForecastCombinations

Package effectfusion

Package spcov. R topics documented: February 20, 2015

Package metafuse. October 17, 2016

Package gk. R topics documented: February 4, Type Package Title g-and-k and g-and-h Distribution Functions Version 0.5.

Package ProDenICA. February 19, 2015

Package reinforcedpred

Package survidinri. February 20, 2015

Package bpp. December 13, 2016

Package RcppSMC. March 18, 2018

Package HIMA. November 8, 2017

Package LPTime. March 3, 2015

Package SPreFuGED. July 29, 2016

Package crrsc. R topics documented: February 19, 2015

Package decon. February 19, 2015

Package threg. August 10, 2015

Package mar. R topics documented: February 20, Title Multivariate AutoRegressive analysis Version Author S. M. Barbosa

Package My.stepwise. June 29, 2017

Package locfdr. July 15, Index 5

Package basad. November 20, 2017

Package abundant. January 1, 2017

Package sklarsomega. May 24, 2018

Package spatial.gev.bma

Package drgee. November 8, 2016

Package eel. September 1, 2015

Package TruncatedNormal

Package beam. May 3, 2018

Package SEMModComp. R topics documented: February 19, Type Package Title Model Comparisons for SEM Version 1.0 Date Author Roy Levy

Bayesian linear regression

Package BayesNI. February 19, 2015

Luke B Smith and Brian J Reich North Carolina State University May 21, 2013

Package MACAU2. R topics documented: April 8, Type Package. Title MACAU 2.0: Efficient Mixed Model Analysis of Count Data. Version 1.

Package HDtest. R topics documented: September 13, Type Package

Machine Learning Linear Classification. Prof. Matteo Matteucci

Package LCAextend. August 29, 2013

Other likelihoods. Patrick Breheny. April 25. Multinomial regression Robust regression Cox regression

Confidence Intervals for the Interaction Odds Ratio in Logistic Regression with Two Binary X s

Regression, Ridge Regression, Lasso

Package BayesS5. February 24, 2017

Package OGI. December 20, 2017

Package KRLS. July 10, 2017

Package Blendstat. February 21, 2018

Package hurdlr. R topics documented: July 2, Version 0.1

Package sscor. January 28, 2016

Package netcoh. May 2, 2016

The nltm Package. July 24, 2006

Transcription:

Type Package Package sodavis May 13, 2018 Title SODA: Main and Interaction Effects Selection for Logistic Regression, Quadratic Discriminant and General Index Models Version 1.2 Depends R (>= 3.0.0), nnet, MASS, mvtnorm Date 2018-05-12 Author Yang Li, Jun S. Liu Maintainer Yang Li <yangli.stat@gmail.com> Variable and interaction selection are essential to classification in high-dimensional setting. In this package, we provide the implementation of SODA procedure, which is a forwardbackward algorithm that selects both main and interaction effects under logistic regression and quadratic discriminant analysis. We also provide an extension, S-SODA, for dealing with the variable selection problem for semi-parametric models with continuous responses. License GPL-2 NeedsCompilation no Repository CRAN Date/Publication 2018-05-13 21:24:03 UTC R topics documented: mich_lung.......................................... 2 pumadyn.......................................... 2 soda............................................. 3 soda_trace_cv....................................... 4 s_soda............................................ 5 s_soda_model........................................ 6 s_soda_pred......................................... 7 s_soda_pred_grid...................................... 8 Index 9 1

2 pumadyn mich_lung Gene expression data for Michigan lung cancer study in Beer et al. (2002) Gene expression data of 5217 genes for n = 86 subjects, with 62 subjects in "good outcomes" (class 1) and 24 subjects in "poor outcomes" (class 2), from the microarray study of Beer et al. (2002). data(mich_lung) Format Response variable vector and design matrix on 86 observations for expression of 5217 genes. References Beer et al. (1999) Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nature medicine, 286(8): 816-824. pumadyn Pumadyn dataset This is a dataset synthetically generated from a realistic simulation of the dynamics of a Unimation Puma 560 robot arm. data(pumadyn) Format Response variable vector and design matrix on 4499 in-sample and 3693 out-sample observations for 32 predictor variables. References Corke, P. I. (1996). A Robotics Toolbox for MATLAB. IEEE Robotics and Automation Magazine, 3 (1): 24-32.

soda 3 soda SODA algorithm for variable and interaction selection SODA is a forward-backward variable and interaction selection algorithm under logistic regression model with second-order terms. In the forward stage, a stepwise procedure is conducted to screen for important predictors with both main and interaction effects, and in the backward stage SODA remove insignificant terms so as to optimize the extended BIC (EBIC) criterion. SODA is applicable for variable selection for logistic regression, linear/quadratic discriminant analysis and other discriminant analysis with generative model being in exponential family. soda(xx, yy, norm = F, debug = F, gam = 0, minf = 3) xx yy The response vector of dimension n * 1. norm debug gam minf Logical flag for xx variable quantile normalization to standard normal, prior to performing SODA algorithm. Default is norm=false. Quantile-normalization is suggested if the data contains obvious outliers. Logical flag for printing debug information. Tuning paramter gamma in extended BIC criterion. EBIC for selected set S: EBIC = -2 * log-likelihood + S * log(n) + 2 * S * gamma * log(p) Minimum number of steps in forward interaction screening. Default is minf=3. EBIC Type Var Term final_ebic final_var final_term Trace of extended Bayesian information criterion (EBIC) score. Trace of step type ("Forward (Main)", "Forward (Int)", "Backward"). Trace of selected variables. Trace of selected main and interaction terms. Final selected term set EBIC score. Final selected variables. Final selected main and interaction terms. Author(s) Yang Li, Jun S. Liu

4 soda_trace_cv References Li Y, Liu JS. (2017). Robust Variable and Interaction Selection for Logistic Regression and Multiple Index Models. Technical Report. Examples # # (uncomment the code to run) # # simulation study with 1 main effect and 2 interactions # N = 250; # p = 1000; # r = 0.5; # s = 1; # H = abs(outer(1:p, 1:p, "-")) # S = s * r^h; # S[cbind(1:p, 1:p)] = S[cbind(1:p, 1:p)] * s # xx = as.matrix(data.frame(mvrnorm(n, rep(0,p), S))); # zz = 1 + xx[,1] - xx[,10]^2 + xx[,10]*xx[,20]; # yy = as.numeric(runif(n) < exp(zz) / (1+exp(zz))) # res_soda = soda(xx, yy, gam=0.5); # cv_soda = soda_trace_cv(xx, yy, res_soda) # cv_soda # # Michigan lung cancer dataset # data(mich_lung); # res_soda = soda(mich_lung_xx, mich_lung_yy, gam=0.5); # cv_soda = soda_trace_cv(mich_lung_xx, mich_lung_yy, res_soda) # cv_soda soda_trace_cv Calculate a trace of cross-validation error rate for SODA forwardbackward procedure This function takes a SODA result variable as input, and calculates the cross-validation error for each step of the SODA procedure. soda_trace_cv(xx, yy, res_soda) xx yy The response vector of dimension n * 1. res_soda SODA result varaible. See example below.

s_soda 5 Author(s) Yang Li, Jun S. Liu Examples # Michigan lung cancer dataset (uncomment the code to run) #data(mich_lung); #res_soda = soda(mich_lung_xx, mich_lung_yy, gam=0.5); #cv_soda = soda_trace_cv(mich_lung_xx, mich_lung_yy, res_soda) #cv_soda s_soda S-SODA algorithm for general index model variable selection S-SODA is an extension of SODA to conduct variable selection for general index models with continuous response. S-SODA first evenly discretizes the continuous response into H slices, and then apply SODA on the discretized response. Compared with existing variable selection methods based on the Sliced Inverse Regression (SIR), SODA requires neither the linearity nor the constant variance condition and is much more robust. s_soda(x, y, H = 5, gam = 0, minf = 3, norm = F, debug = F) x y The response vector of dimension n * 1. H gam minf norm debug The number of slices. EBIC penalization coefficient parameter for SODA. Minimum number of steps in forward interaction screening. Default is minf=3. If set as True, S-SODA first marginally quantile-normalize each predictor to the standard normal distribution. If print debug information. BIC Var Term best_bic best_var best_term Trace of extended Bayesian information criterion (EBIC) score. Trace of selected variables. Trace of selected main and interaction terms. Final selected term set EBIC score. Final selected variables. Final selected main and interaction terms.

6 s_soda_model Examples # # (uncomment the code to run) # # Simulation: x1 / (1 + x2^2) example # N = 500 # x1 = runif(n, -3, +3) # x2 = runif(n, -3, +3) # x3 = x1 / exp(x2^2) + rnorm(n, 0, 0.2) # ss = s_soda_model(cbind(x1,x2), x3, H=25) # # # true surface in grid # MM = 50 # xx1 = seq(-3, +3, length.out = MM) # xx2 = seq(-3, +3, length.out = MM) # yyy = matrix(0, MM, MM) # for(i in 1:MM) # for(j in 1:MM) # yyy[i,j] = xx1[i] / exp(xx2[j]^2) # # # predicted surface # ppp = s_soda_pred_grid(xx1, xx2, ss, po=1) # # par(mfrow=c(1, 2), mar=c(1.75, 3, 1.25, 1.5)) # persp(xx1, xx2, yyy, theta=-45, xlab="x1", ylab="x2", zlab="y") # persp(xx1, xx2, ppp, theta=-45, xlab="x1", ylab="x2", zlab="pred") # # # Pumadyn dataset # #data(pumadyn); # #s_soda(pumadyn_isample_x, pumadyn_isample_y, H=25, gam=0) s_soda_model S-SODA model estimation. S-SODA assumes within each slice the X vector follow multivariate normal distribution. function estimates the mean vector and covariance matrix of X for each slice. This s_soda_model(x, y, H = 10) x y The response vector of dimension n * 1. H The number of slices.

s_soda_pred 7 int_h int_p int_l int_m int_v Slice index. Proportion of samples in each slice. Length of each slice (max - min response). Mean vector of covariates in each slice. Covariance matrix of covariates in each slice. s_soda_pred Predict the response y using S-SODA model. S-SODA assumes within each slice the X vector follow multivariate normal distribution. This function predicts the response y by reverting the P(X slice(y)) to P(slice(y) X), and estimates the E(y X) as sum_h E(y slice(y)=h, X) P (slice(y)=h X) s_soda_pred(x, model, po = 1) x model po S-SODA model estimated from s_soda_model function. Order of terms in X to approximate E(y slice(y)=h, X). If po=0, E(y slice(y)=h, X) is the mean of y in slice h. If po=1, E(y slice(y)=h, X) is the linear regression of X to predict y in slice h. If po=2, the linear regression also include 2nd order terms of X. Predicted response.

8 s_soda_pred_grid s_soda_pred_grid Predict the response y using S-SODA model in a 2-dimensional grid. Calls function s_soda_pred in a 2-dimensional grid defined by x1 and x2. s_soda_pred_grid(xx1, xx2, model, po = 1) xx1 Grid breakpoints for predictor 1. xx2 Grid breakpoints for predictor 2. model S-SODA model estimated from s_soda_model. po Order of terms in X to approximate E(y slice(y)=h, X). Predicted response.

Index Topic Prediction s_soda_pred, 7 s_soda_pred_grid, 8 Topic S-SODA s_soda, 5 s_soda_model, 6 s_soda_pred, 7 s_soda_pred_grid, 8 Topic SODA soda_trace_cv, 4 Topic cross-validation soda_trace_cv, 4 Topic datasets mich_lung, 2 pumadyn, 2 Topic general index model s_soda, 5 Topic interaction_selection s_soda, 5 Topic logistic_regression Topic quadratic_discriminant_analysis s_soda_pred_grid, 8 soda_trace_cv, 4 mich_lung, 2 mich_lung_xx (mich_lung), 2 mich_lung_yy (mich_lung), 2 pumadyn, 2 pumadyn_isample_x (pumadyn), 2 pumadyn_isample_y (pumadyn), 2 pumadyn_osample_x (pumadyn), 2 pumadyn_osample_y (pumadyn), 2 s_soda, 5 s_soda_model, 6 s_soda_pred, 7 9