FuSSO: Functional Shrinkage and Selection Operator

Size: px

Start display at page:

Download "FuSSO: Functional Shrinkage and Selection Operator"

Darcy Green
6 years ago
Views:

1 Junier B. Oliva Barnabás Póczos Timothy Verstynen Aarti Singh Jeff Schneider Fang-Cheng eh Wen-ih Tseng Carnegie Mellon University ational Taiwan University Abstract We present the FuSSO, a functional analogue to the LASSO, that efficiently finds a sparse set of functional input covariates to regress a real-valued response against. The FuSSO does so in a semi-parametric fashion, making no parametric assumptions about the nature of input functional covariates and assuming a linear form to the mapping of functional covariates to the response. We provide a statistical backing for use of the FuSSO via proof of asymptotic sparsistency under various conditions. Furthermore, we observe good results on both synthetic and real-world data. Introduction Modern data collection has allowed us to collect not ust more data, but more complex data. In particular, complex obects like sets, distributions, and functions are becoming prevalent in many domains. It would be beneficial to perform machine learning tasks using these complex obects. However, many existing techniques can not handle complex, possibly infinite dimensional, obects; hence one often resorts to the ad-hoc technique of representing these complex obect by arbitrary summary statistics. In this paper, we look to perform a regression task when dealing with functional data. Specifically, we look to regress a mapping that takes in many functional input covariates and outputs a real value. Moreover, since we are considering many functional covariates possibly many more than the number of instances of one s data, we look to find an estimator that performs feature selection by only regressing on a subset of all possible input functional covariates. To this Appearing in Proceedings of the 7 th International Conference on Artificial Intelligence and Statistics AISTATS 4, Reykavik, Iceland. JMLR: W&CP volume 33. Copyright 4 by the authors. end we present the Functional Shrinkage and Selection Operator FuSSO, for performing sparse functional regression in a principled, semi-parametric manner. Indeed, there are a multitude of applications and domains where the study of a mapping that takes in a functional input and outputs a real-value is of interest. That is, if I is some class of input functions with domain Ψ R and range R, then one may be interested in a mapping h : I R: hf Figure a. Examples include: a mapping that takes in the timeseries of a commodity s price in the past f is a function with the domain of time and range of price and outputs the expected price of the commodity in the nearby future; also, a mapping that takes a patient s cardiac monitor s time-series and outputs a health index. Recently, work by 8 has explored this type of regression problem when the input function is a distribution. Furthermore, the general case of an arbitrary functional input is related to functional analysis. However, often it is expected that the response one is interested in regressing is dependent on not ust one, but many functions. That is, it may be fruitful to consider a mapping h : I... I p R: hf,..., f p Figure b. For instance, this is likely the case in regressing the price of a commodity in the future, since the commodity s future price is not only dependent on the history of it own price, but also the history of other commodities prices as well. A response s dependence on multiple functional covariates is especially common in neurological data, where thousands of voxels in the brain may each contain a corresponding function. In fact, in such domains it is not uncommon to have a number of input functional covariates that far exceeds the number of training instances one has in a data-set. Thus, it would be beneficial to have an estimator that is sparse in the number of functional covariates used to regress the response against. That is, find an estimate, ĥs, that depends on a small subset {i,..., i S } {,..., p}, such that ĥf,..., f p ĥsf i,..., f is Figure c. Here we present a semi-parametric estimator to perform sparse regression with multiple input functional 75

f f f f f f i...... f i...... f p- f p- f p f p a Single Functional Covariate b Multiple Functional Covariates c Sparse Model Figure : a Model where mapping takes in a function f and produces a real.

2 f f f f f f i f i f p- f p- f p f p a Single Functional Covariate b Multiple Functional Covariates c Sparse Model Figure : a Model where mapping takes in a function f and produces a real. b Model where response is dependent on multiple input functions f,..., f p. c Sparse model where response is dependent on a sparse subset of input functions f,..., f p. covariates and a real-valued response, the FuSSO: Functional Shrinkage and Selection Operator. o parametric assumptions are made on the nature of input functions. We shall assume that the response is the result of a sparse set of linear combinations of input functions and other non-paramteric functions {g i }: f, g. The resulting method is a LASSOlike estimator that effectively zeros out entire functions from consideration in regressing the response. Our contributions are as follows. We introduce the FuSSO, an estimator for performing regression with many functional covariates and a real-valued response. Furthermore, we provide a theoretical backing of the FuSSO estimator via proof of asymptotic sparsistency under certain conditions. We also illustrate the estimator with applications on synthetic data as well as in regressing the age of a subect when given orientation distribution function dodf 4 data for the subect s white matter. Related Work As previously mentioned, recently 8 explored regression with a mapping that takes in a probability density function and outputs a real value. Furthermore, 7 studies the case when both the input and outputs are distributions. In addition, functional analysis relates to the study for functional data. In all these works, the mappings studied take in only one functional covariate. Based on them, it is not immediately evident how to expand on these ideas to develop an estimator that simultaneously performs regression and feature selection with multiple function covariates. To the best of our knowledge, there has been no prior work in studying sparse mappings that take multiple functional inputs and produce a real-valued output. LASSO-like regression estimators that work with functional data include the following. In 6, one has a functional output and several real-valued covariates. Here, the estimator finds a sparse set of functions to scale by the real valued covariates to produce the functional response. Also, 7, 3 study the case when one has one functional covariate f and one real valued response that is linearly dependent on f and some function g: f, g fg. In 7 the estimator searches for sparsity across wavelet basis proection coefficients. In 3, sparsity is in achieved in the time input domain of the d th derivative of g; i.e. D d gt for many values of t where D d is the differential operator. Hence, roughly speaking, 7, 3 look for sparsity across frequency and time domains respectively, for the regressing function g. However, these methods do not consider the case where one has many input functional covariates {f,..., f p }, and needs to choose among them. That is, 7, 3 do not provide a method to select among function covariates in an analogous fashion to how the LASSO selects among real-valued covariates. Lastly, it is worth noting that in our estimator we will have an additive linear model, f, g where we search for {g i } in a broad, non-parametric family such that many g are the zero function. Such a task is similar in nature to the SpAM estimator 9, in which one also has an additive model g X in the dimensions of a real vector X and searches for {g i } in a broad, non-parametric family such that many g are the zero function. ote though, that in the SpAM model, the {g i } functions are applied to real covariates via a function evaluation. In the FuSSO model, {g i } are applied to functional covariates via an inner product; that is, FuSSO works over functional, not real-valued covariates, unlike SpAM. 76

3 Oliva, Póczos, Verstynen, Singh, Schneider, eh, Tseng 3 Model To better understand FuSSO s model we draw several analogies to real-valued linear regression and Group- LASSO 6. ote that although for simplicity we focus on functions working over a one dimensional domain, it is straightforward to extend the estimator and results to the multidimensional case. Consider a model for typical real-valued linear regression with a data-set of input-output pairs {X i, i } i : i X i, w + ɛ i, where i R, X i R d, w R d, ɛ i, σ, and X i, w d X iw. If instead one were working with functional data {f i, i } i, where f i :, R and f i L,, one may similarly consider a linear model: i f i, g + ɛ i, where, g :, R, and f i, g f i tgtdt. If Φ {ϕ m } m is an orthonormal basis for L, then we have that f i x m α i m ϕ m x, where, α m i f i tϕ m tdt. Similarly, gx m β mϕ m x where β m gtϕ mtdt. Thus, i f i, g + ɛ i α m i ϕ m x, β k ϕ k x + ɛ i m m k m k α i m β k ϕ m x, ϕ k x + ɛ i α i m β m + ɛ i, where the last step follows from orthonormality of Φ. Going back to the real-valued covariate case, if instead of having one feature vector per data instance: X i R d, one had p feature vectors associated to each data instance: {X i p, X i R d }, an additive linear model may be used for regression: i X i, w + ɛ i, where w,..., w p R d. Similarly, in the functional case one may have p functions associated with data instance i: {f i p, f i L, }. Then, an additive linear model would be: i f i, g + ɛ i m α i m β m + ɛ i, where g,..., g p L,, and α i m and β m are proection coefficients for f i and g respectively. Suppose that one has few observations relative to the number of features p. In the real-valued case, in order to effectively find a solution for w w T,..., w T p T one may search for a group sparse solution where many w. To do so, one may consider the following Group-LASSO regression: w argmin w X w + w, 3 where X is the d matrix X X... X T,,..., T, and is the Euclidean norm. If in the functional case one also has that p, one may set up a similar optimization to 3, whose direct analogue is: g argmin g equivalently, i i, g f i 4 + g ; 5 β argmin i α i m β β m 6 i m + βm, 7 m where g {g i } p i { m β imϕ m, } p i. However, it is unfeasible to directly observe functional inputs {f i i, p}. Thus, we shall instead assume that one observes { y i i, p} where y i f i ξ i f i f i + ξ i, 8 T /n, f i /n,..., f i, 9, σ ξi n. 77

4 That is, we observe a grid of n noisy values for each functional input. Then, one may estimate α i m as: α i m n ϕt m y i n ϕt m f i + ξ i ᾱ i m + ηi m where ϕ m ϕ m /n, ϕ m /n,..., ϕ m T. Furthermore, we may truncate the number of basis functions used to express f i to M n, estimating it as: f i x M n m α i m ϕ mx. Using the truncated estimate, one has: M n i f x, g m i f x α i m β m, and Mn α i m. m Hence, using the approximations, 7 becomes: M n ˆβ argmin i α i m β β m 3 i m + Mn βm 4 argmin β m Ã β + β, 5 where Ã is the M n matrix with values Ãi, m α i m and β β,..., β Mn T. ote that one need not consider proection coefficients β m for m > M n since such proection coefficients will not decrease the MSE term in 3 because α i m for m > M n, and β m for m > M n increases the norm penalty term in 4. Hence we see that our sparse functional estimates are a Group-LASSO problem on the proection coefficients. Extensions It is useful to note that there are several straightforward extensions to the FuSSO as presented. First, we would like to note that it may be possible to estimate the inner product of a function f i, and g as f i g y i, n g, where g g /n,..., g T. This effectively allows one to use a naive approach of simply using Group-LASSO on the y i feature vectors directly we ll refer to this method as -GL. It is important to note, however, that -GL will be less robust to noise, and adaptive and efficient to smoothness than the FuSSO. Furthermore, we note that it is not necessary to have observations for input functions that are on a grid for the FuSSO, since one may estimate proection coefficients in the case of an irregular design. Moreover, we may also estimate proection coeffincients for density functions with samples drawn from the pdf. ote that the -GL would fail to estimate our model in the irregular design case, and would not be possible in the case were functions are pdf. Also, a two-stage estimator as described in 5, where one first uses the regularization penalty with a large to find the support, then solves the optimization problem with a smaller on ust the estimated support to estimate the response, may be more efficient at estimating the response. Furthermore, an analogous problem as 5 may be framed to perform logistic regression and classification. 4 Theory ext, we show that the FuSSO is able to recover the correct sparsity pattern asymptotically; i.e., that the FuSSO estimate is sparsistent. In order to do so, we shall show that with high probability there is an optimal solution to our optimization problem 5 with the correct sparsity pattern. We follow a similar argument to, 9. We shall use a witness technique to show that there is a coefficient/subgradient pair ˆβ, û such that supp ˆβ suppβ, for true response generating β. Let Ωβ p β, be our penalty term 4. Let S denote the true set of non-zero functions; i.e. S { β }, with s S. First, we fix ˆβ S c, and set û S Ω β S. ote that for a vector β, Ω β {u} where: u β / β, if β ; u u if β. We shall show that with high probability, S, ˆβ and S c, u <, thus showing that there is an optimal solution to our optimization problem 5 that has the true sparsity pattern with high probability. First, we elaborate on our assumptions. 4. Assumptions Let Φ be the trigonometric basis, ϕ x, k : ϕ k x cosπkx, ϕ k+ x sinπkx. Let D {{ y i } p, i} i i, where y is as 8, and i p m αi m β m + ɛ i as in. Assume that i, p: α i Θγ, Q, where: Θγ, Q {θ : c kθk Q}, k c k k γ if k even or one, k γ otherwise, 78

5 Oliva, Póczos, Verstynen, Singh, Schneider, eh, Tseng α i {α i m R αi m f i ϕ m, m + } for < Q < and < γ <. Furthermore, assume that that for the true β generating the observed responses i, β, p: β Θγ, Q. Let A be the M n matrix with entries A i, m α i m. Let A S denote the matrix made up from horizontally concatenating the A matrices with S; i.e. A S A... A s, where {,..., s } S and i < k for i < k. Suppose the following: Λ max AT S A S Cmax < 6 Λ min AT S A S Cmin >. 7 Also, suppose δ, s.t. S c Λ max AT A Cmax < 8 AT A S AT S A S δ/ s 9 Let Ā be the M n matrix with entries Āi, m ᾱ i m n ϕt mf i. Let H be the M n matrix with entries H i, m η i m n ϕt mξ i. Thus, Ã Ā + H. Furthermore, let E Ā A. Then, Ã A + E + H. In addition to the aforementioned assumptions, we shall further assume the following: a < / s.t. pm n n a / e n a ρ ρ min S β > smn n γ+ / + n a s 3 / M / γ n + logsm n / 3 smn /ρ 4 s M n n γ+/ s log + n 5 sm n n + smn log 6 γ+a / n a+ / Mn logp sm n / 7 s/ M γ / n, 8 and we assume γ for the sake of simplification. We may further simplify our assumptions if we take n / and choose M n optimally for function estimation: M n n /γ+ /4γ+. Furthermore, take s O, ρ, and γ. Under these conditions, our assumptions reduce to < a and taking the follow to go to zero: p a 3 e a, /, 7 /, log, 9 / logp. 4. Sparsistency Theorem : P Ŝ S. First, we state some lemmas, whose proofs may be found in the supplementary materials. 4.. Lemmata Lemma Let X be a non-negative r.v. and C be an measurable event, then E X C PC E X. Lemma n n k ϕ mk/nϕ l k/n I{l m}, for l, m n. Lemma 3 Let H i, σ ξ n I, and Hi S be the rows of H, then H i, σ ξ n I. Lemma 4 P H max n a σ ξ pm n n a / e n a σ ξ Lemma 5 E max C Q n γ+/, where C Q, is a constant depending on Q. Lemma 6 β S Qs. Lemma 7, n, C min, C max, < C min C max <, < δ s.t. if H max < n a, and >, n > n then Λ max ÃT S ÃS C max < 9 Λ min C min > 3 S c, ÃT S ÃS ÃT ÃS ÃT S ÃS δ s Proof of Theorem Proposition P S ˆβ. Proof. Recall that by, ρ min S β >. Thus to prove that S ˆβ, it suffices to show that : ˆβ S βs ρ. To do so we show P ˆβ S βs > ρ. Let B be the event that H max < n a. ote that: P ˆβ S βs > ρ P ˆβ S βs > ρ B PB + P B c. Furthermore, P ˆβ S βs > ρ B ρ E ˆβ S βs B. Then, looking at the stationarity condition for the support S: ÃT S ÃS ˆβS + û S. 3 79

6 Let V be the vector with entries V i S mm n+ αi m β m ; i.e. the error from truncation. Then, using 3 A S βs + V + ɛ ÃT S ÃS ˆβS A S βs V ɛ + û S ÃS ˆβ S βs A S ÃSβS V ɛ û S Ã T S Thus, ÃT S ÃS ˆβ S β S ÃT S E S + H S β S + ÃT S V Let Σ SS ÃT S ÃS ; we see that, + ÃT S ɛ û S. 33 ˆβ S β S Σ SS ÃT S E S + H S β S + Σ SS ÃT S V + Σ SS ÃT S ɛ + Σ SS û S. Thus, we proceed to bound each term on the LHS in expectation. First, note that Σ SS ÃT S E S + H S βs Σ SS ÃT S E S + H S βs Σ SS AT S + ET S + HT S E S + H S βs Σ SS A T S E S + H S βs + E S + H S T E S + H S βs. Moreover, given that B occurs: Thus, E Σ SS sm n Σ SS smn. C min Σ SS A T S E S + H S βs B PB smn C min AT S E E S + H S β S B PB Q sm n C min ES β S + E H S β S B PB, noting that A T S Q. Moreover, by Lemma : Also, H S β S E H S β S B PB E HS β S. sm n is normally distributed and VarHiT S β S VarH i S β S σ ξ n β S σ ξ Qs n. Hence, by a Gaussian inequality e.g. 3 we have: E H S βs σξ Qs log/n. Unless otherwise specified, let X i be the i th row of matrix X and X be the th column. Also, E S β S max i EiT S β S β S max i Ei S Thus, E Σ Qs C Q smn n γ+/ QC Q s M n n γ+/ SS A T S E S + H S βs smn O s s log M n n γ+/ +. n Furthermore, E E S + H S T E S + H S βs B PB E max E S + H S T E S + H S β sm n S B PB E max E S + H S E S + H S β sm n S E E S + H S E S + H S β S B PB. Then, given that B occurs E S + H S E S + H S C Q n γ+/ + n a, and, as before: E E S + H S β S B C s s log M n n γ+/ + C 3. n Hence, E Σ SS ÃT S E S + H S βs PB O smn s M n n γ+/ + s log/n + n γ+/ + n a PB O smn s M n n γ+/ + s log/n The next terms are bounded as follows. Lemma 8 E Σ SS ÃT S V PB s O. Mn γ / Lemma 9 E Σ SS ÃT S ɛ PB logsmn O /. Lastly, û S max û max û S S Σ SS û S Σ SS û S smn. C min Keeping only leading terms, E ˆβ S βs PB O s 3 / M n n γ+/ + s M n log/n See Supplemental Materials for proof. 7

7 Oliva, Póczos, Verstynen, Singh, Schneider, eh, Tseng + O s 3 / /M γ / n + logsm n / + smn. Hence, by assumptions -4 we have P ˆβ S βs > ρ One may similarly look at the stationarity for S c to analyze û : ÃT ÃS β S + û ÃS ˆβ S βs A S ÃSβS V ɛ + û. ÃT Thus, û ÃT ÃSβ S ˆβ S + ÃT A S ÃSβ S + ÃT V + ɛ ΣS Σ SS ÃT S E S + H S β S ÃT S V ÃT S ɛ + û S ÃT E S + H S β S + ÃT V + ɛ, where Σ S ÃT ÃS and using 33. We wish to show that S c û satisfies the KKT conditions, that is: Proposition P max S c û <. Proof. Let µ H E û H. We proceed as follows: P max S c u < P max S c µh + u µ H < P max S c µh + M n u µ H < P max S c µh < δ, max S c u µ H < P max S c µh δ δ P max S c u µ H. M n We obtain the following results: Lemma P Lemma P max S c µ H δ max S c û µ H δ M n δ M n Hence, we have that P max S c û <. 5 Experiments 5. Synthetic Data We tested the FuSSO on synthetic data-sets of D {{ y i } p, i} i i where y as in 8. The experiments performed were as follows. First, we fix, n, p, and s. For i,...,,,..., p we create random functions using a maximum of M proection coefficients as follows: Set a m Unif, for m,..., M; set a m a i /c m, where c m m if m or is even, c m m if m is odd; 3 set a m a m / a ; 4 set α i a. See Figures b,c for typical functions. Similarly, we generate β for,..., s; for s +,..., p, we set β. Then, we generate i as i p β, αi + ɛ i + ɛ i, where ɛ,.. Also, a grid s β, αi of n noisy function evaluations were generated to make as in 8, with σ ξ.. These were then used to compute α i m for m,..., M n as in, M n was chosen by cross validation. See Figures b, c for typical noisy observations and function estimates for n 5 and n 5 respectively. y i We fixed s 5 and chose the following configurations for the other parameters: p,, n {, 5, 5,, 5, 5,, 5, 5}. For each tuple of p,, n configurations, random trails were performed. We recorded, r, the fraction of the trails that a value was able recover the correct sparsity pattern i.e. that only the first 5 functions are in the support. We also recorded the mean length of the range of,, that were able to recover t t the correct support; i.e. t, where t t f t l / t max, t f is the largest value found to recover the correct support in the t th trails, t l the smallest such, and t max is the smallest to produce ˆβ t is taken to be zero if no recovered the correct support. The results were as follows: p,, n r,5,5.68.5,5,5.477,5,5.479 Hence we see that even when the number of observations per function is small 5 or 5 and the number of total number of input functional covariates is large we were able to test up to, the FuSSO can recover the correct support. Also, to illustrate this point that running Group-LASSO on the y i features - GL is less robust to noise and adaptive to smoothness, we ran noisier trails using the configuration of p,, n, 5, 5. We increased the standard deviation of the noise on grid function observation and on the response to be 5 and respectively. Under these conditions the FuSSO was able to recover the support in 49% of the trails were as -GL recovered the support in 3% of the trails. Furthermore the FuSSO had 7

Typical Function.5.5 True Observations Estimated.5 X a Function at n 5 Typical Function.5.5 True Observations Estimated.5 X c Function at n 5 Regressed Function orm.8.6.4.

Path at p, n 5, n 5 of ˆβ statistic of an dodf function for age regression 5. The proection coefficients for the dodfs at each voxel were estimated using the cosine basis.

8 Typical Function.5.5 True Observations Estimated.5 X a Function at n 5 Typical Function.5.5 True Observations Estimated.5 X c Function at n 5 Regressed Function orm Functional Regularization Path 4 6 b Reg. Path at p, n 5, n 5 of ˆβ Regressed Function orm Functional Regularization Path 3 4 d Reg. Path at p, n 5, n 5 of ˆβ statistic of an dodf function for age regression 5. The proection coefficients for the dodfs at each voxel were estimated using the cosine basis. The FuSSO estimator gave a cross-validated MSE of 7.855, where the variance for age was ; selected voxels in the support may be seen in Figure 3c. The LASSO estimate using QA values gave a cross-validated MSE of Thus, one may see that considering the entire functional data gave us better results for age regression. We note that we were unable to use the naive approach of -GL in this case because of memory constraints and the fact that function evaluation points did not lie on a d square grid. Frequency 4 3 Figure : ac Two typical functions, noisy observations, and estimates. bd Regularization paths showing the norms of ˆβ in red for in support, blue otherwise for a range of ; rightmost vertical line indicates largest able to recover the support, leftmost line for smallest such a t.743 compared to t.54 for -GL. a Example ODF Frequency Age 5 b Ages 5. eurological Data We also tested the FuSSO estimator with a neurological data-set, using a total of 89 subects 4. Subects ranged in age from 8 to 6 years old Figure 3b. Our goal was to learn a regression that maps the dodfs at each white matter voxel for each subect to the subect s age. The dodf is a function represents the amount of water molecules, or spins, undergoing diffusion in different orientations over the S sphere5. I.e., each dodf is a function with a d domain of azimuth, elevation spherical coordinates and a range of reals representing the strength of water diffusion at the given orientations see Figure 3a. Data was provided for each subect in a template space for white-matter voxels; a total of over 5 thousand voxels dodfs were regressed on i.e. p 5. We also compared regression using the FuSSO and functional covariates to using the LASSO and real valued covariates. We used the non-functional collection of quantitative anisotropy QA values for the same white matter voxels as with dodf functions. QA values are the estimated amount of spins that undergo diffusion in the direction of the principle fiber orientation, i.e., the peak of the dodf; QAs have been used as a measure of white matter integrity in the underlying voxel hence making for a descriptive and effective summary c Voxels in support 3 Absolute Error d Errors Figure 3: a An example ODF for a voxel. b Histogram of ages for subects. c Voxels in the support of model shown in blue. d Histogram of held out error magnitudes. 6 Conclusion In conclusion, this paper presents the FuSSO, a functional analogue to the LASSO. The FuSSO allows one to efficiently find a sparse set of functional input covariates to regress a real-valued response against. The FuSSO makes no parametric assumptions about the nature of input functional covariates and assumes a linear form to the mapping of functional covariates to the response. We provide a statistical backing for use of the FuSSO via proof of asymptotic sparsistency. Acknowledgements This work is supported in part by SF grants IIS47658 and IIS535. 7

9 Oliva, Póczos, Verstynen, Singh, Schneider, eh, Tseng References Raendra Bhatia. Matrix analysis, volume 69. Springer, 997. F. Ferraty and P. Vieu. onparametric functional data analysis: theory and practice. Springer, 6. 3 Gareth M James, Jing Wang, and Ji Zhu. Functional linear regression that s interpretable. The Annals of Statistics, pages 83 8, 9. 4 Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: isoperimetry and processes, volume 3. Springer, Fang-Cheng eh, Van Jay Wedeen, W Tseng, et al. Generalized q-sampling imaging. IEEE transactions on medical imaging, 99:66,. 6 Ming uan and i Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B Statistical Methodology, 68:49 67, 6. 7 ihong Zhao, R Todd Ogden, and Philip T Reiss. Wavelet-based lasso in functional linear regression. Journal of Computational and Graphical Statistics, 3:6 67,. 5 icolai Meinshausen. Relaxed lasso. Computational Statistics & Data Analysis, 5: , 7. 6 icola Mingotti, Rosa E Lillo, and Juan Romo. Lasso variable selection in functional regression Junier B Oliva, Barnabás Póczos, and Jeff Schneider. Distribution to distribution regression. In International Conference on Machine Learning ICML, page ICML, 3. 8 B. Poczos, A. Rinaldo, A. Singh, and L Wasserman. Distribution-Free Distribution Regression. AISTATS, 3. 9 Pradeep Ravikumar, John Lafferty, Han Liu, and Larry Wasserman. Sparse additive models. Journal of the Royal Statistical Society: Series B Statistical Methodology, 75:9 3, 9. Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B Methodological, pages 67 88, 996. Alexandre B Tsybakov. Introduction to nonparametric estimation. Springer, 8. Martin J Wainwright. Sharp thresholds for highdimensional and noisy recovery of sparsity. arxiv preprint math/6574, 6. 3 Larry Wasserman. Probability inequalities. Lecture.pdf, August. 4 Fang-Cheng eh and Wen-ih Isaac Tseng. tu- 9: a high angular resolution brain atlas constructed by q-space diffeomorphic reconstruction. euroimage, 58:9 99,. 73

Sparse Functional Regression

Sparse Functional Regression Junier B. Olia, Barnabás Póczos, Aarti Singh, Jeff Schneider, Timothy Verstynen Machine Learning Department Robotics Institute Psychology Department Carnegie Mellon Uniersity