Submitted to the Annals of Applied Statistics SUPPLEMENTARY APPENDICES FOR WAVELET-DOMAIN REGRESSION AND PREDICTIVE INFERENCE IN PSYCHIATRIC NEUROIMAGING By Philip T. Reiss, Lan Huo, Yihong Zhao, Clare Kelly and R. Todd Ogden APPENDIX A: SIMULATION DETAILS A.. Comparative simulation study. Given the true coefficient image β = β (k) for k = or 2 as defined in 4, we generated continuous outcomes (A) y i = x T i β + ε i with ε i N(0, σ 2 ), with σ 2 chosen as the solution to (A2) σ 2 s 2 xβ + σ2 = R2, where s 2 xβ is the sample variance of xt β,..., xt n β, representing variance explained by the model, and R 2 is the specified value (0. or 0.5). The left side of (A2) is similar to what Tibshirani and Knight (999) called the theoretical R 2, and has the interpretation that for responses generated according to (A), (A2), the coefficient of determination for the true model is approximately equal to the specified value R 2. An R 2 analogue for logistic regression (Menard, 2000) is given by (A3) R 2 L = log(l M) log(l 0 ), where L M is the likelihood of the given model while L 0 is the likelihood of the model containing only an intercept. For the logistic regression simulation settings we defined a theoretical version of RL 2 analogous to the left side of (A2). Suppose we are simulating responses y i Bernoulli(p i ), i =,..., n, where p i (A4) log = δ 0 + x T i β p i for given δ 0, β, and predictors x i (i =,..., n). Let E( ) denote expectation under the assumed model (A4), and let L... denote likelihood based on the
2 P. T. REISS ET AL. true value of parameters given in the subscript(s), with parameters not given in the subscripts set to zero. Then the proposed variant of (A3) for use in simulations is (A5) (A6) RL 2 = Elog(L δ 0,β) Elog(L δ0 ) = log{+exp( δ 0 x T i β)} +exp( δ 0 x T i β) + log{+exp(δ 0+x T i β)} +exp(δ 0 +x T i β). log{+exp( δ 0 )} +exp( δ 0 x T i β) + log{+exp(δ 0)} +exp(δ 0 +x T i β) To perform simulations with a desired value of R 2 L, we take β = sβ 0 in (A6) for a given β 0, and numerically solve for s > 0 such that (A6) equals the specified value. We used 8-fold CV with 8 repetitions (see 3.6), i.e., minimization of (3) or (4) with K = R = 8, to choose from among the following candidate tuning parameter values. For FPCR, the number of basis functions along each dimension was chosen between 20 and 30, and the number of PCs chosen from among the values 20; for given numbers of basis functions and of PCs, the roughness penalty parameter was chosen by restricted maximum likelihood (Reiss and Ogden, 2009; Wood, 20). For the three wavelet methods, the decomposition level parameter j 0 was set to 4. For WPCR and WPLS, we retained c = 200 wavelet coefficients, and again chose from among 20 components. (In our experience, a small number of retained wavelet coefficients can sometimes attain the minimal CV, but at the expense of leading to highly unstable estimates; we therefore chose the moderate value c = 200 rather than varying c for what would likely be a miniscule reduction in CV.) For WNet, the mixing parameter α in () is chosen from among 0., 0.4, 0.7,, and a dense grid of candidate λ values was automatically chosen by the glmnet algorithm (Friedman, Hastie and Tibshirani, 200) for each value of α. A.2. Permutation test simulation study. To test the power of the permutation procedure for model (A), we again generated continuous responses with specified R 2 values in the sense that σ 2 was chosen to satisfy (A2). For the linear model with a scalar covariate (A7) y i = t i δ + x T i β + ε i, we fixed the coefficient of determination Rt 2 = 0.2 for the scalar predictors, and set the partial coefficient of determination Rx t 2 = 0.02, 0.04,..., 0.6
WAVELET-DOMAIN IMAGE REGRESSION 3 for the image predictors (see Anderson-Sprecher, 994), by choosing δ, σ 2 in (A7) to satisfy the pair of equations (A8) s2 xβ + σ2 s 2 tδ+xβ + σ2 = R2 t = 0.2, σ 2 s 2 xβ + σ2 = R2 x t, where s 2 tδ+xβ is the sample variance of the t iδ + x T i β (i =,..., n). The second equation in (A8) can be solved directly for σ 2. Substituting this value into the first equation, and noting that s 2 tδ+xβ = δ2 s 2 t + 2δs t,xβ + s 2 xβ (where s 2 t is the sample variance of the t i s and s t,xβ is the sample covariance the t i s and the x T i β s), yields a quadratic equation that can be solved for δ. To evaluate the permutation test for logistic regression, we used the same set of R 2 values for the case without scalar covariates, and the same Rt 2 and Rx t 2 values for the case with scalar covariates. As in A., this required defining Rt 2 and Rx t 2 (partial R2 ) for simulating from a logistic regression model, in this case (A9) log p i p i = δ 0 + t i δ + x T i β. With analogous notation to (A5), we define and Rt 2 = Elog(L δ 0,δ ) Elog(L δ0 ) = Rx t 2 = Elog(L δ 0,δ,β) Elog(L δ0,δ ) (A0) = log{+exp( δ 0 t i δ )} +exp( δ 0 t i δ x T i β) + log{+exp(δ 0+t i δ )} log{+exp( δ 0 )} +exp( δ 0 t i δ x T i β) + log{+exp(δ 0)} log{+exp( δ 0 t i δ x T i β)} +exp( δ 0 t i δ x T i β) + log{+exp(δ 0+t i δ +x T i β)}. log{+exp( δ 0 t i δ )} +exp( δ 0 t i δ x T i β) + log{+exp(δ 0+t i δ )} Given (t i, x i ) (i =,..., n), δ 0, and β 0 such that β = sβ 0 for some s, attaining specified values Rt 2 of Rx t 2 reduces to solving the above two
4 P. T. REISS ET AL. equations for δ and s. Assuming the x i s have mean zero, we can simplify the problem via the approximation n log{+exp( δ0 t i δ )} R 2 t +exp( δ 0 t i δ ) + log{+exp(δ 0+t i δ )} +exp(δ 0 +t i δ ) log{+exp( δ0 )} +exp( δ 0 t i δ ) + log{+exp(δ 0)} +exp(δ 0 +t i δ ). We treat this as an equality and solve it for δ, then insert the result into (A0) and solve for s. APPENDIX B: PERMUTATION OF RESIDUALS The original permutation of regressor residuals (PRR) procedure of Potter (2005) differs somewhat from what we propose in 5. of the main text. The PRR procedure (adapted slightly to the image-predictor context) uses design matrix (A) T Π(I P T )X rather than T P T X + Π(I P T )X as in (5); in other words, it would simply replace the X portion of the design matrix with the permuted residuals instead of adding back the permuted residuals to P T X. For the unpenalized model considered by Potter (2005) (see also section 2.4.3 of Ridgway, 2009), the simpler design matrix (A) is equivalent to (5). But for penalized models such as the wavelet-domain elastic net, the two design matrices tend to produce slightly different results. We therefore prefer the permuted-data design matrix (5), which preserves the original data s dependence between scalar and image predictors. In a different neuroimaging setting, Winkler et al. (204) show that PRR (which they refer to as the Smith procedure ) compares favorably to other permutation test procedures for linear models with nuisance predictors. APPENDIX C: LINEAR REGRESSION POWER SIMULATION RESULTS Here we report linear regression simulation results for the permutation test procedure (see 5. for logistic regression results). We first considered the case without scalar covariates, and generated responses (A2) y i = x T i β + ε i with ε i N(0, σ 2 ), i =,..., n = 333, where x i R 642 is the ith image (expressed as a vector), β is the true coefficient image shown in Figure A(a) (similarly vectorized),
WAVELET-DOMAIN IMAGE REGRESSION 5 (a) (b) (c) prop. p... 0.05 0.0 0.2 0.4 0.6 0.8.0 prop. p... 0.05 0.0 0.2 0.4 0.6 0.8.0 0.00 0.05 0.0 0.5 0.20 0.25 0.30 R 2 0.00 0.05 0.0 0.5 0.20 0.25 0.30 R 2 Fig A. (a) True coefficient image β used in the power study: gray denotes 0, black denotes. (b) Estimated probability of rejecting the null hypothesis β = 0 as a function of R 2, with 95% confidence intervals, for model (A2). (c) Same, for model (A3). and σ 2 is chosen to attain approximate R 2 values as in Supplementary Appendix A. We simulated 200 response vectors to assess power to reject H 0 : β = 0 at the p =.05 level for each of the R 2 values.04,.07,.,.5,.2,.25,.3, as well as 000 response vectors with β = 0 (R 2 = 0) to assess type-i error rate. Next we considered testing the same null hypothesis for the model (A3) y i = t i δ + x T i β + ε i with ε i N(0, σ 2 ), with a scalar covariate t i such that R 2 for the submodel E(y i t i ) = t i δ is approximately 0.2. We generated the same number of response vectors as above for each of the above R 2 values, but here R 2 refers to the partial R 2 adjusting for t i (see Supplementary Appendix A.2). The results, displayed in Figure A(b) and (c), indicate that the nominal type-i error rate is approximately attained for both models, and the power exceeds 90% when R 2 is at least.5 for either model (A2) or model (A3). APPENDIX D: SELECTING A SUBSAMPLE OF THE ADHD-200 DATA SET Of the 776 individuals in the ADHD-200 training sample, we considered only the 450 individuals who were right-handed and were either typically developing controls (340) or diagnosed with combined-type ADHD (0), the subtype expected to be most readily distinguishable from controls. Head motion artifacts have recently emerged as a major concern in the resting-state fmri literature (e.g. Van Dijk, Sabuncu and Buckner, 202). Since there is no consensus as yet on how to address this issue, we chose to sacrifice a considerable amount of data in order to minimize the risk of spurious findings due to motion artifacts. We excluded those subjects whose
6 P. T. REISS ET AL. mean framewise displacement (FD) (Power et al., 202), a motion score, exceeded 0.25. We then matched the control and ADHD groups on mean FD by dividing the sample into mean FD deciles, and then randomly subsampling either controls or ADHD individuals within each decile to attain roughly equal control-to-adhd ratios for each decile. This reduced the number of subjects to 333 (257 controls and 76 with combined-type ADHD; 98 males, 35 females; age range 7.7 20.45). The falff data were processed and made available by the Neuro Bureau via the NITRC repository; the data and full details of the image processing steps are available at http://www.nitrc.org/plugins/mwiki/index. php/neurobureau:athenapipeline. Nonzero falff values were recorded only for voxels within the brain, but due to inter-subject differences in scan volume coverage, the set of brain voxels varied somewhat among subjects. Our analysis included the 92 voxels located within the brain for all 333 subjects. APPENDIX E: FROM 2D TO 3D PREDICTORS We fitted model (8) by the wavelet-domain elastic net, using the same set of 333 individuals as in 6, but with 3D maps a 32 32 32 set of voxels from the falff maps rather than with 2D slices. Here, as in 4, we retained sufficiently many wavelet coefficients to capture 99.5% of the excess variance. We were particularly interested in whether the relative performance of lower vs. higher values of α (less sparse vs. more sparse fits) differed when 3D rather than 2D images were used. Figure A2 shows the observed CV deviance when we used, 6, or all 32 of the 32 32 axial slices. For slice, the lowest CV score is attained with α =, i.e., the lasso. But for 6 or 32 slices, less sparse models, in particular α = 0., are favored. This suggests that as the number of voxels grows, choosing a sparse coefficient image incurs a higher cost in terms of predictive accuracy. REFERENCES Anderson-Sprecher, R. (994). Model comparisons and R 2. The American Statistician 48 3 7. Friedman, J., Hastie, T. and Tibshirani, R. (200). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33 22. Menard, S. (2000). Coefficients of determination for multiple logistic regression analysis. American Statistician 54 7 24. Potter, D. M. (2005). A permutation test for inference in logistic regression with smalland moderate-sized data sets. Statistics in Medicine 24 693 708. Power, J. D., Barnes, K. A., Snyder, A. Z., Schlaggar, B. L. and Petersen, S. E. (202). Spurious but systematic correlations in functional connectivity MRI networks arise from subject motion. NeuroImage 59 242 254.
WAVELET-DOMAIN IMAGE REGRESSION 7 slice 6 slices 32 slices CV Deviance 42 43 44 45 α 0. 0.4 0.7 CV Deviance 42 43 44 45 46 α 0. 0.4 0.7 CV Deviance 42 44 46 48 50 52 α 0. 0.4 0.7 0.000 0.002 0.004 0.006 0.008 0.00 0.02 λ 0.00 0.02 0.04 0.06 0.08 0.0 0.2 0.4 λ 0.00 0.02 0.04 0.06 0.08 0.0 0.2 λ Fig A2. CV deviance for the wavelet-domain elastic net fitted to our subsample of the ADHD-200 data set using a 32 32 32 set of voxels from the falff images. Reiss, P. T. and Ogden, R. T. (2009). Smoothing parameter selection for a class of semiparametric linear models. Journal of the Royal Statistical Society: Series B 7 505 523. Ridgway, G. R. (2009). Statistical analysis for longitudinal MR imaging of dementia PhD thesis, University College London. Tibshirani, R. and Knight, K. (999). The covariance inflation criterion for adaptive model selection. Journal of the Royal Statistical Society, Series B 6 529 546. Van Dijk, K. R. A., Sabuncu, M. R. and Buckner, R. L. (202). The influence of head motion on intrinsic functional connectivity MRI. NeuroImage 59 43 438. Winkler, A. M., Ridgway, G. R., Webster, M. A., Smith, S. M. and Nichols, T. E. (204). Permutation inference for the general linear model. NeuroImage 92 38 397. Wood, S. N. (20). Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. Journal of the Royal Statistical Society: Series B 73 3 36. Department of Child and Adolescent Psychiatry New York University School of Medicine Park Ave., 7th floor New York, NY 006 E-mail: phil.reiss@nyumc.org lan.huo@nyumc.org yihong.zhao@nyumc.org amclarekelly@gmail.com Department of Biostatistics Columbia University 722 W. 68th St., 6th floor New York, NY 0032 E-mail: to66@columbia.edu