Sparse Functional Regression

Similar documents
FuSSO: Functional Shrinkage and Selection Operator

Modeling Highway Traffic Volumes

Applied Machine Learning for Design Optimization in Cosmology, Neuroscience, and Drug Discovery

Distribution-Free Distribution Regression

SUPPLEMENTARY MATERIAL. Authors: Alan A. Stocker (1) and Eero P. Simoncelli (2)

Asymptotic Normality of an Entropy Estimator with Exponentially Decaying Bias

arxiv: v1 [stat.ml] 15 Feb 2018

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

Assignment 4 (Solutions) NPTEL MOOC (Bayesian/ MMSE Estimation for MIMO/OFDM Wireless Communications)

An Optimal Split-Plot Design for Performing a Mixture-Process Experiment

Balanced Partitions of Vector Sequences

A matrix Method for Interval Hermite Curve Segmentation O. Ismail, Senior Member, IEEE

Learning discrete graphical models via generalized inverse covariance matrices

Learning Markov Network Structure using Brownian Distance Covariance

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage

OBSERVATIONS ON BAGGING

Astrometric Errors Correlated Strongly Across Multiple SIRTF Images

A Regularization Framework for Learning from Graph Data

different formulas, depending on whether or not the vector is in two dimensions or three dimensions.

Stability Approach to Regularization Selection (StARS) for High Dimensional Graphical Models

Geometric ergodicity of the Bayesian lasso

Sparse Nonparametric Density Estimation in High Dimensions Using the Rodeo

The lasso. Patrick Breheny. February 15. The lasso Convex optimization Soft thresholding

Math 144 Activity #9 Introduction to Vectors

Estimation of Efficiency with the Stochastic Frontier Cost. Function and Heteroscedasticity: A Monte Carlo Study

LECTURE 3 3.1Rules of Vector Differentiation

0 a 3 a 2 a 3 0 a 1 a 2 a 1 0

Variance Reduction for Stochastic Gradient Optimization

Prediction of anode arc root position in a DC arc plasma torch

6.1.1 Angle between Two Lines Intersection of Two lines Shortest Distance from a Point to a Line

Residual migration in VTI media using anisotropy continuation

4. A Physical Model for an Electron with Angular Momentum. An Electron in a Bohr Orbit. The Quantum Magnet Resulting from Orbital Motion.

DATA MINING AND MACHINE LEARNING

Review of Matrices and Vectors 1/45

On general error distributions

Efficient solution of interval optimization problem

arxiv: v1 [physics.comp-ph] 17 Jan 2014

Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, Emily Fox 2014

Towards Green Distributed Storage Systems

Semiparametric Analysis of Heterogeneous Data Using Varying-Scale Generalized Linear Models

Is the test error unbiased for these programs? 2017 Kevin Jamieson

STATISTICAL MACHINE LEARNING FOR STRUCTURED AND HIGH DIMENSIONAL DATA

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

THE FIFTH DIMENSION EQUATIONS

Noise constrained least mean absolute third algorithm

On computing Gaussian curvature of some well known distribution

UNDERSTAND MOTION IN ONE AND TWO DIMENSIONS

v v Downloaded 01/11/16 to Redistribution subject to SEG license or copyright; see Terms of Use at

Online Dictionary Learning with Group Structure Inducing Norms

Non-Surjective Finite Alphabet Iterative Decoders

Doubly Decomposing Nonparametric Tensor Regression (ICML 2016)

Patterns of Non-Simple Continued Fractions

Notes on Linear Minimum Mean Square Error Estimators

Is the test error unbiased for these programs?

Understanding the relationship between Functional and Structural Connectivity of Brain Networks

LESSON 4: INTEGRATION BY PARTS (I) MATH FALL 2018

An Improved 1-norm SVM for Simultaneous Classification and Variable Selection

Fast Algorithms for LAD Lasso Problems

6. Regularized linear regression

OPTIMAL RESOLVABLE DESIGNS WITH MINIMUM PV ABERRATION

Lecture 21: Physical Brownian Motion II

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Alternatives to Basis Expansions. Kernels in Density Estimation. Kernels and Bandwidth. Idea Behind Kernel Methods

Support Vector Machine For Functional Data Classification

Statistical Learning with the Lasso, spring The Lasso

LASSO Review, Fused LASSO, Parallel LASSO Solvers

Cases of integrability corresponding to the motion of a pendulum on the two-dimensional plane

A note on the group lasso and a sparse group lasso

DEVIL PHYSICS THE BADDEST CLASS ON CAMPUS AP PHYSICS

Sparse Nonparametric Density Estimation in High Dimensions Using the Rodeo

Equivalence of Multi-Formulated Optimal Slip Control for Vehicular Anti-Lock Braking System

Stability Approach to Regularization Selection (StARS) for High Dimensional Graphical Models

Convex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs)

Insights into Cross-validation

Online Companion to Pricing Services Subject to Congestion: Charge Per-Use Fees or Sell Subscriptions?

An Alternative Characterization of Hidden Regular Variation in Joint Tail Modeling

Statistical Inference

Chapter 1: Kinematics of Particles

State-space Modelling of Hysteresis-based Control Schemes

The Nonparanormal skeptic

Last Lecture Recap. UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 6: Regression Models with Regulariza8on

SPECTRAL analysis of time-series recorded from the. Robust Estimation of Sparse Narrowband Spectra from Binary Neuronal Spiking Data

Sparse Additive Functional and kernel CCA

Journal of Computational and Applied Mathematics. New matrix iterative methods for constraint solutions of the matrix

SELECTION, SIZING, AND OPERATION OF CONTROL VALVES FOR GASES AND LIQUIDS Class # 6110

Tales from fmri Learning from limited labeled data. Gae l Varoquaux

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference

Statistical Machine Learning for Structured and High Dimensional Data

(a) During the first part of the motion, the displacement is x 1 = 40 km and the time interval is t 1 (30 km / h) (80 km) 40 km/h. t. (2.

Weiss-Weinstein Bounds for Various Priors

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

A New Extended Uniform Distribution

informs DOI /moor.xxxx.xxxx c 20xx INFORMS

The lasso, persistence, and cross-validation

CSC 576: Variants of Sparse Learning

Variable Selection for Highly Correlated Predictors

A possible mechanism to explain wave-particle duality L D HOWE No current affiliation PACS Numbers: r, w, k

Motion in Two and Three Dimensions

Transcription:

Sparse Functional Regression Junier B. Olia, Barnabás Póczos, Aarti Singh, Jeff Schneider, Timothy Verstynen Machine Learning Department Robotics Institute Psychology Department Carnegie Mellon Uniersity Pittsburgh, PA 53 Introduction There are a multitude of applications and domains where the study of a mapping that takes in a functional input and outputs a real-alue is of interest. That is, if I is some class of input functions with domain R and range R, then one may be interested in a mapping h : I 7! R: h(f) =Y (Figure (a)). Examples include: a mapping that takes in the time-series of a commodity s price in the past (f is a function with the domain of time and range of price) and outputs the expected price of the commodity in the nearby future; also, a mapping that takes a patient s cardiac monitor s time-series and outputs a health index. Recently, work by [5] has explored this type of regression problem when the input function is a distribution. Furthermore, the general case of an arbitrary functional input is related to functional analysis []. Howeer, it is often expected that the response one is interested in regressing is dependent on not ust one, but many functions. That is, it may be fruitful to consider a mapping h : I... I p 7! R: h(f,...,f p )=Y (Figure (b)). For instance, this is likely the case in regressing the price of a commodity in the future, since the commodity s future price is not only dependent on the history of it own price, but also the history of other commodities prices as well. A response s dependence on multiple functional coariates is especially common in neurological data, where thousands of oxels in the brain may each contain a corresponding function. In fact, in such domains it is not uncommon to hae a number of input functional coariates that far exceeds the number of training instances one has in a data-set. Thus, it would be beneficial to hae an estimator that is sparse in the number of functional coariates used to regress the response against. That is, find an estimate, ĥ, that depends on a small subset {i,...,i S } {,...,p}, such that ĥ(f,...,f p )=ĥs(f i,...,f is ) (Figure (c)). Here we present a semi-parametric estimator to perform sparse regression with multiple input functional coariates and a real-alued response, FuSSO: Functional Shrinkage and Selection Operator. No parametric assumptions are made on the nature of input functions. We shall assume that the response is the result of a sparse set of linear combinations of input functions and other non-paramteric functions {g i }: Y = P hf,g i+. The resulting method is a LASSO-like [7] estimator that effectiely zeros out entire functions from consideration in regressing the response. The estimator was found to be effectie in regressing the age of a subect when gien orientation distribution function (ODF) data for the subect s white matter. Related Work As preiously mentioned, recently [5] explored regression with a mapping that takes in a probability density function and outputs a real alue. Furthermore, [4] studies the case when both the input and outputs are distributions. In addition, functional analysis relates to the study of functional data []. In all of these works, the mappings studied take in only one functional coariate. Howeer, it is not immediately eident how to expand on these ideas to deelop an estimator that simultaneously performs regression and feature selection with multiple function coariates.

f f f f f i...... Y f i...... Y f p- f p- f p f p (a) Single Functional Coariate (b) Multiple Functional Coariates (c) Sparse Model Figure : (a) Model where mapping takes in a function f and produces a real Y. (b) Model where response Y is dependent on multiple input functions f,...,f p. (c) Sparse model where response Y is dependent on a sparse subset of input functions f,...,f p. To our knowledge, there has been no prior work in studying sparse mappings that take multiple functional inputs and produce a real-alued output. LASSO-like regression estimators that work with functional data include the following. In [3], one has a functional output and seeral realalued coariates. Here, the estimator finds a sparse set of functions to scale by the real alued coariates to produce a functional response. Also, [, ] study the case when one has one functional coariate f and one real alued response that is linearly dependent on f and some function g: Y = hf,gi = R fg. First, in [] the estimator searches for sparsity across waelet basis proection coefficients. In [], sparsity is achieed in the time (input) domain of the d th deriatie of g; i.e. [D d g](t) =for many alues of t where D d is the differential operator. Hence, roughly speaking, [, ] look for sparsity across frequency and time domains respectiely, for the regessing function g. Howeer, these methods do not consider the case where one has many input functional coariates {f,...,f p }, and needs to choose amongst them. That is, [, ] do not proide a method to select among function coariates in an analogous fashion to how the LASSO selects among real-alued coariates. Lastly, it is worth noting that in our estimator we will hae an additie linear model, P hf,g i where we search for {g i } in a broad, non-parametric family such that many g are the zero function. Such P a task is similar in nature to the SpAM estimator [6], in which one also has an additie model g (X ) (in the dimensions of a real ector X) and searches for {g i } in a broad, non-parametric family such that many g are the zero function. Note though, that in the SpAM model, the {g i } functions are applied to real coariates ia a function ealuation. In the FuSSO model, {g i } are applied to functional coariates ia an inner product; that is, FuSSO works oer functional, not real-alued coariates, unlike SpAM. 3 Model In order to better understand FuSSO s model we draw seeral analogies to real-alued linear regression and Group-LASSO [9]. First, consider a model for typical real-alued linear regression with a data-set of input-output pairs {(X i,y i )} N i= : Y i = hx i,wi + i, where Y i R, X i R d,w R d, i iid N (, dx ), and hx i,wi = X i w. If, instead, one were working with functional data {(f (i),y i )} N i=, where f (i) :[, ] 7! R and f (i) L [, ], one might similarly consider a linear model: Y i = hf (i),gi + i, where g :[, ] 7! R, and hf (i),gi = Z f (i) (t)g(t)dt.

If = {' m } is an orthonormal basis for L [, ] [8] then we hae that Z f (i) (x) = m (i) ' m (x), where m (i) = f (i) (t)' m (t)dt. () Similarly, g(x) = P m' m (x). Thus, Y i = hf (i),gi + i = h m (i) ' m (x), = m m + i, k= where the last step follows from orthonormality of. k' k (x)i + i = k= m kh' m (x), ' k (x)i + i Going back to the real-alued coariate case, if instead of haing one feature ector per data instance, X i R d, one had p feature ectors associated with each data instance: {X i apple apple p, X i R d }, an additie linear model could be used for regression: Y i = hx id,w d i + i, where w,...,w d R d. d= Similarly, in the functional case, one may hae p functions associated with data instance i: {f (i) apple apple p, f (i) L [, ]}. Then, an additie linear model would be: Y i = hf (i),g i + i = + i, () where g,...,g p L [, ], and and are proection coefficients. Suppose that one has few obserations relatie to the number of features (N p). In the realalued case, in order to effectiely find a solution for w =(w T,...,wp T ) T one may search for a group sparse solution where many w =. To do so, one may consider the following Group-LASSO regression: w? = argmin w N ky X w k + N kw k, (3) where X is the N d matrix X =[X...X N ] T, Y =(Y,...,Y N ) T, and k k is the Euclidean norm. If in the functional case () one also has that N p, one may set up a similar optimization to (3), whose direct analogue is: g? = argmin @Y i hf (i),g ia + N kg k; (4) g N equialently,? = argmin N where g? = {g? i }p i= = {P i= @Y i i=? im ' m, } p i=. A u X + N t, (5) Howeer, it is intractable to assume that one is able to directly obsere functional inputs {f (i) apple i apple N, apple apple p}. Thus, we shall instead assume that one obseres {~y (i) apple i apple N, apple apple p} where ~y (i) = ~ f (i) + (i), f ~ (i) = f (i) T (/n), f (i) (/n),...,f (i) (), (i) iid N (, I). (6) 3

That is, we obsere a grid of n noisy alues for each functional input. Then, one may estimate as: = n ~' T m~y (i) = n ~' T m( f ~ (i) + (i) )= + (i) where ~' m =(' m (/n), ' m (/n),...,' m ()) T. Furthermore, we may truncate the number of basis functions used to express f (i) to M n, estimating it as: f (i) (x) = Using the truncated estimate (7), one has: (i) h f (x),g i = XM n XM n Hence, using the approximations (7), (5) becomes: ˆ XM n = argmin @Y i N i= = argmin N ky ' m(x). (7), (i) and k f (x)k = A u X t Mn ( ). u X + t Mn N (8) à k + N k k, (9) where à is the N M n matrix with alues Ã(i, m) = and =(,..., M n ) T. Note that one need not consider proection coefficients for m>m n since such proection coefficients will not decrease the MSE term in (8) (because =for m>m n), and 6=for m>m n increases the norm penalty term in (8). Hence, we see that our sparse functional estimates are a Group-LASSO problem on the proection coefficients. In a future publication, we shall show that if {f (i) }, and {g } are in a Sobole function class and some other mild assumptions hold, then our estimator is asymptotically sparsistent. 4 Experiments We tested the FuSSO estimator with neurological data. It consisted of 89 total subects. Orientation distribution function (ODF) (Figure (a)) data was proided for each subect in a template space for white-matter oxels; a total of oer 5 thousand oxel s ODFs were regressed on. We looked to regress a subect s age gien his/her respectie ODF data. The proection coefficients for the ODFs at each oxel were estimated using the cosine basis. The FuSSO estimator gae a held out MSE of 7.855, where the ariance for age was 56.465. 4 3 5 Frequency Frequency 5 (a) Example ODF 4 6 8 Age (b) Ages (c) Actie Voxels 3 Absolute Error (d) Errors Figure : (a) An example ODF for a oxel. (b) Histogram of ages for subects. (c) Voxels in the support of model shown in blue. (d) Histogram of held out error magnitudes. 4

References [] F. Ferraty and P. Vieu. Nonparametric functional data analysis: theory and practice. Springer, 6. [] Gareth M James, Jing Wang, and Ji Zhu. Functional linear regression that s interpretable. The Annals of Statistics, pages 83 8, 9. [3] Nicola Mingotti, Rosa E Lillo, and Juan Romo. Lasso ariable selection in functional regression. 3. [4] Junier B Olia, Barnabás Póczos, and Jeff Schneider. Distribution to distribution regression. [5] B. Poczos, A. Rinaldo, A. Singh, and L Wasserman. Distribution-Free Distribution Regression. arxi preprint arxi:3.8,. [6] Pradeep Raikumar, John Lafferty, Han Liu, and Larry Wasserman. Sparse additie models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 7(5):9 3, 9. [7] Robert Tibshirani. Regression shrinkage and selection ia the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 67 88, 996. [8] Alexandre B Tsybako. Introduction to nonparametric estimation. Springer, 8. [9] Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped ariables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68():49 67, 6. [] Yihong Zhao, R Todd Ogden, and Philip T Reiss. Waelet-based lasso in functional linear regression. Journal of Computational and Graphical Statistics, (3):6 67,. 5