Regression When the Predictors Are Images

Size: px

Start display at page:

Download "Regression When the Predictors Are Images"

Maria Goodman
6 years ago
Views:

1 University of Haifa From the SelectedWorks of Philip T. Reiss May, 2009 Regression When the Predictors Are Images Philip T. Reiss, New York University Available at:

2 Regression When the Predictors Are Images Philip T. Reiss New York University reiss Wright State University Dayton, Ohio May 1, 2009 Joint work with Todd Ogden Research supported in part by grant number 5F31MH , NIMH

3 Simple linear regression: fit least-squares line x y x y 2

4 y x2 y x2 Multiple linear regression: fit least-squares (hyper)plane x1 x1 3

5 for i = 1,..., n, where Basic model equation for multiple linear regression y i = β 0 + β 1 x i β p 1 x i,p 1 + ε i y i is the ith subject s outcome, β 0 is the intercept of the true line (or plane...), x i1,..., x i,p 1 are subject i s values for predictor 1,..., p 1, β 1,..., β p 1 are coefficients or effects of predictor 1,..., p 1, ε i is an error term which in the nicest case is iid normal : 1. independent across subjects, 2. identically distributed for each subject, 3. normally distributed. Each of these 3 conditions can be relaxed. 4

6 Above system of n equations in p unknowns can be written in inner product form as where x = (1, x i1,..., x i,p 1 ) T matrix form as y i = x T i β + ε i, i = 1,..., n, and β = (β 0, β 1,..., β p 1 ) T, or in y = Xβ + ε where X is the n p matrix with ith row x i (the design matrix ). The least-squares estimate of β is then ˆβ = arg min β R p y Xβ 2. If rankx = p then we have the unique minimizer ˆβ = (X T X) 1 X T y. But if n < p, there are ordinarily infinitely many β R p such that y Xβ 2 = 0. So we generally seek the optimal β within some reasonable subset of R p. 5

7 A motivating example: Serotonin receptors in depression Major depressive disorder affects 9.5% of the U.S. population each year. Serotonin (5-HT), a neurotransmitter, is believed to play an important role in the disorder, mediated in part by distribution of 5-HT 1A receptors in the brain. 5-HT 1A receptor binding potential (BP), a measure of the receptors availability, can be mapped at each voxel (volume unit) of the brain via PET imaging and tracer kinetic modeling. Question: How can we relate these BP maps to a depression-related outcome such as Hamilton Depression Score (HAM-D)? 6

8 Idea: regress HAM-D (scalar) on BP (image) y i = α + s T i f + ε i 7

9 This can be thought of 1. as super-high-dimensional regression, 2. as regression with functional data (L 2 inner product) Many potential applications in psychiatry, including 1. predicting treatment response in depression 2. predicting progression from mild cognitive impairment to Alzheimer s disease 8

10 Road map Linear Generalized linear regression regression 1-D signal predictors D image predictors 3 9

11 An easier application: near-infrared spectroscopy!"#$%"&$&'(")$*+(,)-"!.#$/011(-(2,(3$&'(")$*+(,)-",5.2!674 #'$ #'% #'&!'#!'"!"##!$##!%##!&## "### ""## "$## 9:;;+<+-=+>1,5.2!674!#'##8 #'### #'##8 #'#!# #'#!8!"##!$##!%##!&## "### ""## "$## ()*+,+-./ ()*+,+-./012-34!,#$450*)6-(7$89% We model a vector y of n scalar responses as!3#$450*)6-(7$:89%!%;4<!!##!8# # 8# y = Xα + Sf + ε!"##!$##!%##!&## "### ""## "$## ()*+,+-./ and each column has mean zero?5+;;:=:+-/1;@-=/:5-?5+;;:=:+-/1;@-=/:5-!!##!8# # 8# where X is an n p covariate matrix (may just be a column of 1s) S is an n N matrix: ith row s i represents a signal predictor defined at points v 1,..., v N,!"##!$##!%##!&## "### ""## "$## ()*+,+-./ ε denotes iid errors Since n N, must reduce the dimension of S to solve for coefficient function f. 10

12 Previous approaches to solving for f: 1. Principal component regression (Massy, 1965) Replace model y = Xα + Sf + ε with y = Xα + SV q β + ε, where V q comprises the q leading columns of V, given the singular value decomposition S = UDV T ; take ˆf = V q ˆβ. 11

13 2. Penalized B-spline expansion Replace y = Xα + Sf + ε with y = Xα + SBβ + ε where the columns of B form a B-spline basis; take ˆf = B ˆβ, where (ˆα, ˆβ) minimize the penalized least squares criterion y Xα SBβ 2 + λβ T P β. P is chosen so that β T P β is a measure of the roughness of f = Bβ, e.g., R f (v) 2 dv (Cardot, Ferraty, and Sarda, 2003) or difference penalty (Marx and Eilers, 1999). The choice of λ > 0 governs the tradeoff between fidelity to the data and smoothness of the coefficient function (higher λ smoother). 12

14 Functional principal component regression Reiss and Ogden (2007) propose to combine the above two approaches to obtain a coefficient function estimate where ˆf = BV qˆζ B is N K with columns forming a B-spline basis as before; V q is a K q matrix whose columns are the leading columns of V, given the singular value decomposition SB = UDV T. This functional principal component regression (FPCR) modifies the above penalized least squares criterion y Xα SBβ 2 + λβ T P β by further restricting to the q leading principal components, to obtain y Xα SBV q ζ 2 + λζ T V T q P V q ζ. 13

15 The second dimension reduction SB SBV q is optimal in the following minimax sense. Proposition 1. Let UDV T be the singular value decomposition of the n K matrix Z, let U q be the matrix consisting of the first q < min(n, K) columns of U, and let D q be the q q upper left submatrix of D. Then M 0 = U q D q minimizes max Zw proj M Zw w R K, w =1 over all n q matrices M, where proj M denotes projection onto the column space of M. In other words: If we wish to replace an n K design matrix Z = SB with an n q design matrix M with q < K, the maximum perturbation in fitted values is minimized by taking M to be U q D q, which equals SBV q, the FPCR design matrix. 14

16 The q in Tuning parameter 1: Number of principal components y Xα SBV q ζ 2 + λζ T V T q P V q ζ can be chosen by multifold cross-validation: 1. Divide the data points (y i, x i, s i ) into, say, 5 validation sets. 2. For given q, and for k = 1,..., 5, remove the kth validation set and obtain estimates ˆα ( k) (q), ˆζ ( k) (q) with the remaining data (the training set ); use those estimates to obtain predicted values ŷ ( k) k (q) for the left-out outcomes; obtain prediction error y k ŷ ( k) k (q) 2 for the kth validation set. 3. Choose q such that P 5 k=1 y k ŷ( k) k (q) 2 is minimized. Alternatively, just choose q that seems large enough to capture the necessary detail. 15

17 Tuning parameter 2: Roughness penalty parameter λ GCV Too bumpy Too smooth Just right! Lambda Given q, the λ in y Xα SBV q ζ 2 + λζ T V T q P V qζ can be chosen by optimizing over λ a criterion function such as: 1. generalized cross-validation (GCV; Craven and Wahba, 1979) related to cross-validation, but does not require fitting the model repeatedly. 2. restricted maximum likelihood (REML; Ruppert, Wand, and Carroll, 2003) relies on connection with linear mixed models. 16

18 Reiss and Ogden (2009a) derived a 1-parameter family of functions h k (k 1) s.t.» d sgn dλ REML(λ)» d sgn GCV (λ) dλ = sgn[h 2 (λ) h 1 (λ)] and = sgn[h 2 (λ) h 3 (λ)]. In non-pathological cases, h 1 crosses h 2 (from smaller to larger) at a unique value ˆλ REML, whereas h 3 crosses h 2 (from larger to smaller) at a unique point ˆλ GCV. REML smooths more than GCV iff the latter crossing of h 2 precedes the former one. h functions GCV REML h1 h2 h Log(lambda) 17

19 Some asymptotic results Assume the outcomes are generated by the model y = α1 + S Bβ + ε where S = `s 1,..., s n T ; s i = (s i1,..., s in )T, i = 1, 2,..., which are i.i.d. random vectors with E(s 1j ) = 0 and E(s 4 1j ) < for each j = 1,..., N; ε is a vector of i.i.d. errors with mean zero and finite variance, independent of S ; and B is a fixed N K B-spline basis matrix. Let V q be the K q population principal component matrix whose columns are the eigenvectors of E(B T s 1 s T 1 B) corresponding to its leading eigenvalues ξ 1 >... > ξ q > 0, and let Ξ q = diag(ξ 1,..., ξ q ). Theorem 1. Suppose β = V qζ for some ζ = (ζ 1,..., ζ q ) T. If ˆβ n = V qˆζ denotes a q-component FPCR estimate for which λn is chosen to be o P (n 1/2 ), then n 1/2 (ˆβ n β) d Z 1 + Z 2, where Z 1 N K (0, σ 2 V qξ 1 q V T q ) and Z 2 N K (0, W ) for a K K matrix W not depending on σ 2. 18

20 Under mild assumptions, the λ n = o P (n 1/2 ) condition imposed in Theorem 1 is met if λ n is chosen by GCV or REML. Indeed, a stronger condition then holds: Theorem 2. Let λ n be the GCV or REML value associated with the q-component FPCR estimate. Assume that V T q P V q is nonsingular, and that f = Bβ with V T q β 0. Then there exists M > 0 such that P (λ n > M) 0 as n. 19

21 Consistency of choosing number of components by multifold CV: Theorem 3. Assume that f = BV qζ with ζ q 0. Suppose the number of FPCR components is chosen by multifold CV using D n divisions of the n observations into training and validation sets of size n t and n v = n n t, respectively, with λ n = o P (n 1/2 ). If n t, n v and D n = o P [(min{n t, n v }) 1/2 ], then for any positive integer q 1 < q, the q-component model will be chosen over the q 1 -component model with probability tending to 1 as n. 20

22 (a) Raw wheat spectra (b) Differenced wheat spectra log(1/r) Differenced log(1/r)! Wavelength (nm) Wavelength (nm) (c) Moisture: PCR (d) Moisture: FPCR!REML Coefficient function!100! Coefficient function!100! Wavelength (nm) Wavelength (nm) 21

23 Simulation study Given an n N signal matrix S (N-dimensional signal for each of n subjects), create a true coefficient function f R N and simulate outcomes y = Sf + ε, where ε consists of n iid normal errors. Then use FPCR to derive estimate ˆf. Two key performance metrics are estimation error ˆf f 2, prediction error ŷ E(y S) 2 = S ˆf Sf 2. 22

24 Table 1: Average Scaled Mean Squared Error of Prediction. Smooth function Bumpy function Wheat Gasoline Wheat Gasoline PBSE-GCV PBSE-REML PBSE-oracle PCR B-spline PCR FPCR C FPCR R -GCV FPCR R -REML FPCR R -oracle FPCR R -plug-in PLS B-spline PLS FPLS C FPLS R -GCV FPLS R -REML

25 Table 2: Mean of L 2 Norm of Error (Root Integrated Squared Error) in Estimating f. Smooth function Bumpy function Wheat Gasoline Wheat Gasoline PBSE-GCV PBSE-REML PBSE-oracle PCR B-spline PCR FPCR C FPCR R -GCV FPCR R -REML FPCR R -oracle FPCR R -plug-in PLS B-spline PLS FPLS C FPLS R -GCV FPLS R -REML

26 Linear Generalized linear regression regression 1-D signal predictors D image predictors 3 25

27 Generalized linear models Problem: The regression model y i = x T i β + ε i is not appropriate for certain types of outcome y i e.g., if y i is binary (0/1). Under the standard assumptions that ε i is independent of x i and has expectation 0, µ i E(y i x i ) = x T i β. This motivates the generalized linear model µ i = g 1 (x T i β) where g is an invertible link function. Key example: logistic regression. If y i is binary, µ i is simply p i P (y i = 1). With inverse link g 1 (t) = et, the above becomes 1+e t» p i = exp(xt i β) exp(x T 1 + exp(x T i β), i.e., y i Bernoulli i β) 1 + exp(x T i β). 26

28 FPCR for generalized linear models Reiss and Ogden (2009b) extended FPCR from the linear model µ = Xα + Sf to the generalized linear model g(µ) [g(µ 1 ),..., g(µ n )] T = Xα + Sf where g is an appropriate link function. Again we apply the restriction f = BV q ζ to obtain, e.g., the logistic model ( ) exp[(xα + SBV q ζ) i ] y i Bernoulli. 1 + exp[(xα + SBV q ζ) i ] 27

29 Linear Generalized linear regression regression 1-D signal predictors D image predictors 3 28

30 Extension to image predictors As for linear model with 1-D predictors, we seek to minimize y Xα SBV q ζ 2 + λζ T V T q P V q ζ, but must modify B and P. For the columns of B, we chose radial cubic B-splines (Saranli and Baykal, 1998) centered at each of an equally spaced grid of knots κ 1,..., κ K R 2. We chose P to yield the thin plate penalty given in two dimensions by ζ T V T q P V q ζ Z Z " «2 2 «f 2 2 «f 2 2 # f dv 1 dv 2. v 1 v 2 v 2 1 v

31 y i = α + s T i f + ε i Key scientific question re PET images: Where in the brain (at which voxels v) is f(v) significantly positive/negative? 30

32 This question can be approached through simultaneous confidence bands for the coefficient image. A 95% (pointwise) confidence interval for f(v) is an interval [L, U] (determined by the data) such that P (L f(v) U) =.95. A 95% simultaneous confidence band for f is given by functions ˆf L ( ), ˆf U ( ) such that P [ ˆf L (v) f(v) ˆf L (v) v] =.95. For a 1-D coefficient function this might look something like: Coefficient function Sig. positive Sig. negative

33 No analytic expression is available for ˆf L, ˆf U, so we pursue a bootstrapping approach (Mandel and Betensky, 2008): 1. Using, say, 999 random samples with replacement of n data points (y, x, s ) from the n observations, obtain bootstrap estimates ˆf 1,..., ˆf 999 of the coefficient function. 2. At each v, order the function estimates: ˆf (1)(v)... ˆf (999)(v). 3. [ ˆf (25)(v), ˆf (975)(v)] is a pointwise 95% confidence interval for f(v). 4. Q v [ ˆf (25)(v), ˆf (975)(v)] is in general not a simultaneous 95% confidence band for f; but the Cartesian product of, say, 99% pointwise intervals will be. 5. Given 95% simultaneous confidence bands [ ˆf L, ˆf U ] for f, the coefficient function is declared significantly positive at all v such that ˆf L (v) > 0. 32

34 Oversimplified example

35 More realistic example Function estimates cross over each other. Simultaneous band formed by Cartesian product of pointwise intervals from 2nd-smallest to 2nd-largest function: 34

36 Simulation studies We used 68 maps of binding potential of 5-HT 1A receptors obtained by Parsey et al. (2006) to perform two simulation studies: Study 1 Compare performance of FPCR with two other methods. Study 2 Assess how well the simultaneous inference procedure detects nonzero regions of f. 35

37 True coefficient image f for simulation studies

38 Simulation study 1: Performance comparisons We compared three methods for both linear and logistic regression. The methods all begin with restricting the coefficient function to a spline basis, but differ in what happens next. (a) Roughness penalty only referred to above as the penalized B-spline expansion. (b) Reduction to leading PCs only similar to FPCR, but without a roughness penalty (i.e., λ = 0). Thus the only smoothing parameter is the number of components, which we chose by 5-fold cross-validation from among the values (c) Leading PCs plus roughness penalty i.e., FPCR, with 35 PCs (accounting for 96% of the variation in SB). For both linear and logistic regression, we generated 100 outcome vectors with equally spaced R 2 ( proportion of explained variation ) values from.2 to.95, and computed each method s prediction error. 37

39 (a) Linear (b) Logistic Prediction error All components 1-10 components, λ=0 35 components Prediction error R^ R^2 Figure 1: (a) Linear regression results. (b) Logistic regression results. 38

40 Simulation study 2: Detecting significance For RL 2 =.5,.6,.7,.8,.9 (Menard, 2000), we simulated 20 sets of simulated binary outcomes " # exp(s T y i Bernoulli i f) 1 + exp(s T i f) (i = 1,..., 68), where s i denotes subject i s binding potential map (1 slice, 5778 voxels), and f = kf 0, where f 0 is the artificial coefficient image shown previously and k is chosen to attain the specified R 2 L. For each set of outcomes, we estimated f by logistic FPCR with 35 PCs, and formed simultaneous confidence bands from 1999 bootstrap samples. Using these bands, we determined how many times (out of 20) each voxel was deemed significantly positive (or negative). 39

41 R^2 = 0.5 R^2 = 0.6 R^2 = 0.7 R^2 = 0.8 R^2 = 0.9 Times (out of 20) found significant

42 f, ˆf, and simultaneous bands: 10 examples with R 2 L =.7 9 / 9 true, 2 false 9 / 9 true, 1 false 4 / 9 true, 0 false 8 / 9 true, 0 false 6 / 9 true, 0 false Coefficient function Coefficient function Coefficient function Coefficient function Coefficient function Voxel Voxel Voxel Voxel Voxel 5 / 9 true, 0 false 0 / 9 true, 0 false 5 / 9 true, 1 false 9 / 9 true, 0 false 8 / 9 true, 1 false Coefficient function Coefficient function Coefficient function Coefficient function Coefficient function Voxel Voxel Voxel Voxel Voxel 41

43 f, ˆf, and simultaneous bands: zooming in 9 / 9 true, 2 false 9 / 9 true, 1 false 4 / 9 true, 0 false 8 / 9 true, 0 false 6 / 9 true, 0 false Coefficient function Coefficient function Coefficient function Coefficient function Coefficient function Voxel Voxel Voxel Voxel Voxel 5 / 9 true, 0 false 0 / 9 true, 0 false 5 / 9 true, 1 false 9 / 9 true, 0 false 8 / 9 true, 1 false Coefficient function Coefficient function Coefficient function Coefficient function Coefficient function Voxel Voxel Voxel Voxel Voxel 42

44 Discussion Regressing scalar outcomes on entire images is a very challenging problem, but our method seems to do quite well at detecting salient (f 0) regions. Extension to 3D images will require tackling several practical issues. Another extension: locally adaptive smoothing parameter. Ongoing work (Todd Ogden and Yihong Zhao) uses wavelets rather than splines, and seeks a sparse representation of the coefficient function. 43

45 Thank you! 44

46 References [1] Cardot, H., Ferraty, F., and Sarda, P. (2003). Spline estimators for the functional linear model. Statistica Sinica 13, [2] Craven, P., and Wahba, G. (1979). Smoothing noisy data with spline functions: estimating the correct degree of smoothing by the method of generalized cross-validation. Numerische Mathematik 31, [3] Mandel, M., and Betensky, R. A. (2008). Simultaneous confidence intervals based on the percentile bootstrap approach. Computational Statistics and Data Analysis 52, [4] Marx, B. D., and Eilers, P. H. C. (1999). Generalized linear regression on sampled signals and curves: a P -spline approach. Technometrics 41, [5] Massy, W. F. (1965). Principal components regression in exploratory statistical research. Journal of the American Statistical Association 60, [6] Menard, S. (2000). Coefficients of determination for multiple logistic regression analysis. The American Statistician 54, [7] Parsey, R. V., Oquendo, M. A., Ogden, R. T., Olvet, D. M., Simpson, N., Huang, Y., Van Heertum, R. L., Arango, V., and Mann, J. J. (2006). Altered serotonin 1A binding in major depression: A [carbonyl-c-11]way positron emission tomography study. Biological Psychiatry 59,

47 [8] Reiss, P. T., and Ogden, R. T. (2007). Functional principal component regression and functional partial least squares. Journal of the American Statistical Association 102, [9] Reiss, P. T., and Ogden, R. T. (2009a). Smoothing parameter selection for a class of semiparametric linear models. Journal of the Royal Statistical Society, Series B 71, [10] Reiss, P. T., and Ogden, R. T. (2009b). Functional generalized linear models with images as predictors. Biometrics, to appear. [11] Ruppert, D., Wand, M. P., and Carroll, R. J. (2003). Semiparametric Regression. Cambridge and New York: Cambridge University Press. [12] Saranli, A., and Baykal, B. (1998). Complexity reduction in radial basis function (RBF) networks by using radial B-spline functions. Neurocomputing 18,

Simultaneous Confidence Bands for the Coefficient Function in Functional Regression

University of Haifa From the SelectedWorks of Philip T. Reiss August 7, 2008 Simultaneous Confidence Bands for the Coefficient Function in Functional Regression Philip T. Reiss, New York University Available