Sparse Functional Regression Junier B. Olia, Barnabás Póczos, Aarti Singh, Jeff Schneider, Timothy Verstynen Machine Learning Department Robotics Institute Psychology Department Carnegie Mellon Uniersity Pittsburgh, PA 53 Introduction There are a multitude of applications and domains where the study of a mapping that takes in a functional input and outputs a real-alue is of interest. That is, if I is some class of input functions with domain R and range R, then one may be interested in a mapping h : I 7! R: h(f) =Y (Figure (a)). Examples include: a mapping that takes in the time-series of a commodity s price in the past (f is a function with the domain of time and range of price) and outputs the expected price of the commodity in the nearby future; also, a mapping that takes a patient s cardiac monitor s time-series and outputs a health index. Recently, work by [5] has explored this type of regression problem when the input function is a distribution. Furthermore, the general case of an arbitrary functional input is related to functional analysis []. Howeer, it is often expected that the response one is interested in regressing is dependent on not ust one, but many functions. That is, it may be fruitful to consider a mapping h : I... I p 7! R: h(f,...,f p )=Y (Figure (b)). For instance, this is likely the case in regressing the price of a commodity in the future, since the commodity s future price is not only dependent on the history of it own price, but also the history of other commodities prices as well. A response s dependence on multiple functional coariates is especially common in neurological data, where thousands of oxels in the brain may each contain a corresponding function. In fact, in such domains it is not uncommon to hae a number of input functional coariates that far exceeds the number of training instances one has in a data-set. Thus, it would be beneficial to hae an estimator that is sparse in the number of functional coariates used to regress the response against. That is, find an estimate, ĥ, that depends on a small subset {i,...,i S } {,...,p}, such that ĥ(f,...,f p )=ĥs(f i,...,f is ) (Figure (c)). Here we present a semi-parametric estimator to perform sparse regression with multiple input functional coariates and a real-alued response, FuSSO: Functional Shrinkage and Selection Operator. No parametric assumptions are made on the nature of input functions. We shall assume that the response is the result of a sparse set of linear combinations of input functions and other non-paramteric functions {g i }: Y = P hf,g i+. The resulting method is a LASSO-like [7] estimator that effectiely zeros out entire functions from consideration in regressing the response. The estimator was found to be effectie in regressing the age of a subect when gien orientation distribution function (ODF) data for the subect s white matter. Related Work As preiously mentioned, recently [5] explored regression with a mapping that takes in a probability density function and outputs a real alue. Furthermore, [4] studies the case when both the input and outputs are distributions. In addition, functional analysis relates to the study of functional data []. In all of these works, the mappings studied take in only one functional coariate. Howeer, it is not immediately eident how to expand on these ideas to deelop an estimator that simultaneously performs regression and feature selection with multiple function coariates.
f f f f f i...... Y f i...... Y f p- f p- f p f p (a) Single Functional Coariate (b) Multiple Functional Coariates (c) Sparse Model Figure : (a) Model where mapping takes in a function f and produces a real Y. (b) Model where response Y is dependent on multiple input functions f,...,f p. (c) Sparse model where response Y is dependent on a sparse subset of input functions f,...,f p. To our knowledge, there has been no prior work in studying sparse mappings that take multiple functional inputs and produce a real-alued output. LASSO-like regression estimators that work with functional data include the following. In [3], one has a functional output and seeral realalued coariates. Here, the estimator finds a sparse set of functions to scale by the real alued coariates to produce a functional response. Also, [, ] study the case when one has one functional coariate f and one real alued response that is linearly dependent on f and some function g: Y = hf,gi = R fg. First, in [] the estimator searches for sparsity across waelet basis proection coefficients. In [], sparsity is achieed in the time (input) domain of the d th deriatie of g; i.e. [D d g](t) =for many alues of t where D d is the differential operator. Hence, roughly speaking, [, ] look for sparsity across frequency and time domains respectiely, for the regessing function g. Howeer, these methods do not consider the case where one has many input functional coariates {f,...,f p }, and needs to choose amongst them. That is, [, ] do not proide a method to select among function coariates in an analogous fashion to how the LASSO selects among real-alued coariates. Lastly, it is worth noting that in our estimator we will hae an additie linear model, P hf,g i where we search for {g i } in a broad, non-parametric family such that many g are the zero function. Such P a task is similar in nature to the SpAM estimator [6], in which one also has an additie model g (X ) (in the dimensions of a real ector X) and searches for {g i } in a broad, non-parametric family such that many g are the zero function. Note though, that in the SpAM model, the {g i } functions are applied to real coariates ia a function ealuation. In the FuSSO model, {g i } are applied to functional coariates ia an inner product; that is, FuSSO works oer functional, not real-alued coariates, unlike SpAM. 3 Model In order to better understand FuSSO s model we draw seeral analogies to real-alued linear regression and Group-LASSO [9]. First, consider a model for typical real-alued linear regression with a data-set of input-output pairs {(X i,y i )} N i= : Y i = hx i,wi + i, where Y i R, X i R d,w R d, i iid N (, dx ), and hx i,wi = X i w. If, instead, one were working with functional data {(f (i),y i )} N i=, where f (i) :[, ] 7! R and f (i) L [, ], one might similarly consider a linear model: Y i = hf (i),gi + i, where g :[, ] 7! R, and hf (i),gi = Z f (i) (t)g(t)dt.
If = {' m } is an orthonormal basis for L [, ] [8] then we hae that Z f (i) (x) = m (i) ' m (x), where m (i) = f (i) (t)' m (t)dt. () Similarly, g(x) = P m' m (x). Thus, Y i = hf (i),gi + i = h m (i) ' m (x), = m m + i, k= where the last step follows from orthonormality of. k' k (x)i + i = k= m kh' m (x), ' k (x)i + i Going back to the real-alued coariate case, if instead of haing one feature ector per data instance, X i R d, one had p feature ectors associated with each data instance: {X i apple apple p, X i R d }, an additie linear model could be used for regression: Y i = hx id,w d i + i, where w,...,w d R d. d= Similarly, in the functional case, one may hae p functions associated with data instance i: {f (i) apple apple p, f (i) L [, ]}. Then, an additie linear model would be: Y i = hf (i),g i + i = + i, () where g,...,g p L [, ], and and are proection coefficients. Suppose that one has few obserations relatie to the number of features (N p). In the realalued case, in order to effectiely find a solution for w =(w T,...,wp T ) T one may search for a group sparse solution where many w =. To do so, one may consider the following Group-LASSO regression: w? = argmin w N ky X w k + N kw k, (3) where X is the N d matrix X =[X...X N ] T, Y =(Y,...,Y N ) T, and k k is the Euclidean norm. If in the functional case () one also has that N p, one may set up a similar optimization to (3), whose direct analogue is: g? = argmin @Y i hf (i),g ia + N kg k; (4) g N equialently,? = argmin N where g? = {g? i }p i= = {P i= @Y i i=? im ' m, } p i=. A u X + N t, (5) Howeer, it is intractable to assume that one is able to directly obsere functional inputs {f (i) apple i apple N, apple apple p}. Thus, we shall instead assume that one obseres {~y (i) apple i apple N, apple apple p} where ~y (i) = ~ f (i) + (i), f ~ (i) = f (i) T (/n), f (i) (/n),...,f (i) (), (i) iid N (, I). (6) 3
That is, we obsere a grid of n noisy alues for each functional input. Then, one may estimate as: = n ~' T m~y (i) = n ~' T m( f ~ (i) + (i) )= + (i) where ~' m =(' m (/n), ' m (/n),...,' m ()) T. Furthermore, we may truncate the number of basis functions used to express f (i) to M n, estimating it as: f (i) (x) = Using the truncated estimate (7), one has: (i) h f (x),g i = XM n XM n Hence, using the approximations (7), (5) becomes: ˆ XM n = argmin @Y i N i= = argmin N ky ' m(x). (7), (i) and k f (x)k = A u X t Mn ( ). u X + t Mn N (8) à k + N k k, (9) where à is the N M n matrix with alues Ã(i, m) = and =(,..., M n ) T. Note that one need not consider proection coefficients for m>m n since such proection coefficients will not decrease the MSE term in (8) (because =for m>m n), and 6=for m>m n increases the norm penalty term in (8). Hence, we see that our sparse functional estimates are a Group-LASSO problem on the proection coefficients. In a future publication, we shall show that if {f (i) }, and {g } are in a Sobole function class and some other mild assumptions hold, then our estimator is asymptotically sparsistent. 4 Experiments We tested the FuSSO estimator with neurological data. It consisted of 89 total subects. Orientation distribution function (ODF) (Figure (a)) data was proided for each subect in a template space for white-matter oxels; a total of oer 5 thousand oxel s ODFs were regressed on. We looked to regress a subect s age gien his/her respectie ODF data. The proection coefficients for the ODFs at each oxel were estimated using the cosine basis. The FuSSO estimator gae a held out MSE of 7.855, where the ariance for age was 56.465. 4 3 5 Frequency Frequency 5 (a) Example ODF 4 6 8 Age (b) Ages (c) Actie Voxels 3 Absolute Error (d) Errors Figure : (a) An example ODF for a oxel. (b) Histogram of ages for subects. (c) Voxels in the support of model shown in blue. (d) Histogram of held out error magnitudes. 4
References [] F. Ferraty and P. Vieu. Nonparametric functional data analysis: theory and practice. Springer, 6. [] Gareth M James, Jing Wang, and Ji Zhu. Functional linear regression that s interpretable. The Annals of Statistics, pages 83 8, 9. [3] Nicola Mingotti, Rosa E Lillo, and Juan Romo. Lasso ariable selection in functional regression. 3. [4] Junier B Olia, Barnabás Póczos, and Jeff Schneider. Distribution to distribution regression. [5] B. Poczos, A. Rinaldo, A. Singh, and L Wasserman. Distribution-Free Distribution Regression. arxi preprint arxi:3.8,. [6] Pradeep Raikumar, John Lafferty, Han Liu, and Larry Wasserman. Sparse additie models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 7(5):9 3, 9. [7] Robert Tibshirani. Regression shrinkage and selection ia the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 67 88, 996. [8] Alexandre B Tsybako. Introduction to nonparametric estimation. Springer, 8. [9] Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped ariables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68():49 67, 6. [] Yihong Zhao, R Todd Ogden, and Philip T Reiss. Waelet-based lasso in functional linear regression. Journal of Computational and Graphical Statistics, (3):6 67,. 5