Efficient Sampling for Learning Sparse Additive Models in High Dimensions

Size: px

Start display at page:

Download "Efficient Sampling for Learning Sparse Additive Models in High Dimensions"

Mark Cooper
5 years ago
Views:

1 Efficient Samping for Learning Sparse Additive Modes in High Dimensions Hemant Tyagi ETH Zürich Andreas Krause ETH Zürich Bernd Gärtner ETH Zürich Abstract We consider the probem of earning sparse additive modes, i.e., functions of the form: f() = S φ ( ), R d from point queries of f. Here S is an unknown subset of coordinate variabes with S = k d. Assuming φ s to be smooth, we propose a set of points at which to sampe f and an efficient randomized agorithm that recovers a uniform approimation to each unknown φ. We provide a rigorous theoretica anaysis of our scheme aong with sampe compeity bounds. Our agorithm utiizes recent resuts from compressive sensing theory aong with a nove conve quadratic program for recovering robust uniform approimations to univariate functions, from point queries corrupted with arbitrary bounded noise. Lasty we theoreticay anayze the impact of noise either arbitrary but bounded, or stochastic on the performance of our agorithm. 1 Introduction Severa probems in science and engineering require estimating a rea-vaued, non-inear (and often non-conve) function f defined on a compact subset of R d in high dimensions. This chaenge arises, e.g., when characterizing compe engineered or natura (e.g., bioogica) systems [1, 2, 3]. The numerica soution of such probems invoves earning the unknown f from point evauations ( i, f( i )) n i=1. Unfortunatey, if the ony assumption on f is of mere smoothness, then the probem is in genera intractabe. For instance, it is we known [4] that if f is C s -smooth then n = Ω((1/δ) d/s ) sampes are needed for uniformy approimating f within error 0 < δ < 1. This eponentia dependence on d is referred to as the curse of dimensionaity. Fortunatey, many functions arising in practice are much better behaved in the sense that they are intrinsicay ow-dimensiona, i.e., depend on ony a sma subset of the d variabes. Estimating such functions has received much attention and has ed to a considerabe amount of theory aong with agorithms that do not suffer from the curse of dimensionaity (cf., [5, 6, 7, 8]). Here we focus on the probem of earning one such cass of functions, assuming f possesses the sparse additive structure: f( 1, 2,..., d ) = S φ ( ); S {1,..., d}, S = k d. (1.1) Functions of the form (1.1) are referred to as sparse additive modes (SPAMs) and generaize sparse inear modes to which they reduce to if each φ is inear. The probem of estimating SPAMs has received considerabe attention in the regression setting (cf., [9, 10, 11] and references within) where ( i, f( i )) n i=1 are typicay i.i.d sampes from some unknown probabiity measure P. This setting, however, does not consider the possibiity of samping f at specificay chosen points, taiored to the additive structure of f. In this paper, we propose a strategy for querying f, together with an efficient recovery agorithm, with much stronger guarantees than known in the regression setting. In particuar, we provide the first resuts guaranteeing uniformy accurate recovery of each individua component φ of the SPAM. This can be crucia in appications where the goa is to not merey approimate f, but gain insight into its structure. 1

2 Reated work. SPAMs have been studied etensivey in the regression setting, with observations being corrupted with random noise. [9] proposed the COSSO method, which is an etension of the Lasso to the reproducing kerne Hibert space (RKHS) setting. A simiiar etension was considered in [10]. In [12], the authors propose a east squares method reguarized with smoothness, with each φ ying in an RKHS and derive error rates for estimating f, in the L 2 (P) norm 1. [13, 14] propose methods based on east squares oss reguarized with sparsity and smoothness constraints. [13] proves consistency of its method in terms of mean squared risk whie [14] derives error rates for estimating f in the empirica L 2 (P n ) norm 1. [11] considers the setting where each φ ies in an RKHS. They propose a conve program for estimating f and derive error rates for the same, in the L 2 (P), L 2 (P n ) norms. Furthermore they estabish the minima optimaity of their method for the L 2 (P) norm. For instance, they derive an error rate of O((k og d/n) + kn 2s 2s+1 ) in the L 2 (P) norm for estimating C s smooth SPAMs. An estimator simiar to the one in [11] was aso considered by [15]. They derive simiar error rates as in [11], abeit under stronger assumptions on f. There is further reated work in approimation theory, where it is assumed that f can be samped at a desired set of points. [5] considers a setting more genera than (1.1), with f simpy assumed to depend on an unknown subset of k d-coordinate variabes. They construct a set of samping points of size O(c k og d) for some constant c > 0, and present an agorithm that recovers a uniform approimation 2 to f. This mode is generaized in [8], with f assumed to be of the form f() = g(a) for unknown A R k d ; each row of A is assumed to be sparse. [7] generaizes this, by removing the sparsity assumption on A. Whie the methods of [5, 8, 7] coud be empoyed for earning SPAMs, their samping sets wi be of size eponentia in k, and hence sub-optima. Furthermore, whie these methods derive uniform approimations to f, they are unabe to recover the individua φ s. Our contributions. Our contributions are threefod: 1. We propose an efficient agorithm that queries f at O(k og d) ocations and recovers: (i) the active set S aong with (ii) a uniform approimation to each φ, S. In contrast, the eisting error bounds in the statistics community [11, 12, 15] are in the much weaker L 2 (P) sense. Furthermore, the eisting theory in both statistics and approimation theory provides epicit error bounds for recovering f and not the individua φ s. 2. An important component of our agorithm is a nove conve quadratic program for estimating an unknown univariate function from point queries corrupted with arbitrary bounded noise. We derive rigorous error bounds for this program in the L norm that demonstrate the robustness of the soution returned. We aso epicity demonstrate the effect of noise, samping density and the curvature of the function on the soution returned. 3. We theoreticay anayze the impact of additive noise in the point queries on the performance of our agorithm, for two noise modes: arbitrary bounded noise and stochastic (iid) noise. In particuar for additive Gaussian noise, we show that our agorithm recovers a robust uniform approimation to each φ with at most O(k 3 (og d) 2 ) point queries of f. We aso provide simuation resuts that vaidate our theoretica findings. 2 Probem statement For any function g we denote its p th derivative by g (p) when p is arge, ese we use appropriate number of prime symbos. g L [a,b] denotes the L norm of g in [a, b]. For a vector we denote its q norm for 1 q by q. We consider approimating functions f : R d R from point queries. In particuar, for some unknown active S {1,..., d} with S = k d, we assume f to be of the additive form: f( 1,..., d ) = S φ ( ). Here φ : R R are the individua univariate components of the mode. Our goa is to query f at suitaby chosen points in its domain in order to recover an estimate φ est, of φ in a compact subset Ω R for each S. We measure the approimation error in the L norm. For simpicity, we assume that Ω = [ 1, 1], meaning that we guarantee an upper 1 f 2 L 2 (P) = R f() 2 dp() and f 2 L 2 (P n) = 1 n P i f 2 ( i) 2 This means in the L norm 2

3 bound on: φ est, φ L [ 1,1] ; S. Furthermore, we assume that we can query f from a sight enargement: [ (1 + r), (1 + r)] d of [ 1, 1] d for 3 some sma r > 0. As wi be seen ater, the enargement r can be made arbitrariy cose to 0. We now ist our main assumptions for this probem. 1. Each φ is assumed to be sufficienty smooth. In particuar we assume that φ C 5 [ (1 + r), (1 + r)] where C 5 denotes five times continuous differentiabiity. Since [ (1 + r), (1 + r)] is compact, this impies that there eist constants B 1,..., B 5 0 so that ma S φ(p) L [ (1+r),(1+r)] B p ; p = 1,..., 5. (2.1) 1 2. We assume each φ to be centered in the interva [ 1, 1], i.e. 1 φ (t)dt = 0; S. Such a condition is necessary for unique identification of φ. Otherwise one coud simpy repace each φ with φ + a for a R where a = 0 and unique identification wi not be possibe. 3. We require that for each φ, I [ 1, 1] with I being connected and µ(i ) δ so that φ () D ; I. Here µ(i) denotes the Lebesgue measure of I and δ, D > 0 are constants assumed to be known to the agorithm. This assumption essentiay enabes us to detect the active set S. If say φ was zero or cose to zero throughout [ 1, 1] for some S, then due to Assumption 2 this woud impy that φ is zero or cose to zero. We remark that it suffices to use estimates for our probem parameters instead of eact vaues. In particuar we can use upper bounds for: k, B p ; p = 1,..., 5 and ower bounds for the parameters: D, δ. Our methods and resuts stated in the coming sections wi remain unchanged. 3 Our samping scheme and agorithm In this section, we first motivate and describe our samping scheme for querying f. We then outine our agorithm and epain the intuition behind its different stages. Consider the Tayor epansion of f at any point ξ R d aong the direction v R d with step size: ɛ > 0. For any C p smooth f; p 2, we obtain for ζ = ξ + θv for some 0 < θ < ɛ the foowing epression: f(ξ + ɛv) f(ξ) = v, f(ξ) + 1 ɛ 2 ɛvt 2 f(ζ)v. (3.1) Note that (3.1) can be interpreted as taking a noisy inear measurement of f(ξ) with the measurement vector v and the noise being the Tayor remainder term. Importanty, due to the sparse additive form of f, we have φ 0, / S, impying that f(ξ) = [φ 1(ξ 1 ) φ 2(ξ 2 )... φ d (ξ d)] is at most k-sparse. Hence (3.1) actuay represents a noisy inear measurement of the k-sparse vector : f(ξ). For any fied ξ, we know from compressive sensing (CS) [16, 17] that f(ξ) can be recovered (with high probabiity) using few random inear measurements 4. This motivates the foowing sets of points using which we query f as iustrated in Figure 1. For integers m, m v > 0 we define { X := ξ i = i } (1, 1,..., 1) T R d : i = m,..., m, (3.2) m { V := v j R d : v j, = ± 1 } w.p. 1/2 each; j = 1,..., m v and = 1,..., d. (3.3) mv Using (3.1) at each ξ i X and v j V for i = m,..., m and j = 1,..., m v eads to: f(ξ i + ɛv j ) f(ξ i ) 1 = v j, f(ξ i ) + }{{ ɛ }}{{} 2 ɛvt j 2 f(ζ i,j )v j, (3.4) y i }{{} i,j n i,j 3 In case f : [a, b] d R we can define g : [ 1, 1] d R where g() = f( (b a) + b+a ) = P φ 2 2 S ( ) with φ ( ) = φ ( (b a) 2 + b+a ). We then sampe g from within [ (1 + r), (1 + 2 r)]d for some sma r > 0 by querying f, and estimate φ in [ 1, 1] which in turn gives an estimate to φ in [a, b]. 4 Estimating sparse gradients via compressive sensing has been considered previousy by Fornasier et a. [8] abeit for a substantiay different function cass than us. Hence their samping scheme differs consideraby from ours, and is not taiored for earning SPAMs. 3

4 where i = f(ξ i ) = [φ 1(i/m ) φ 2(i/m )... φ d (i/m )] is k-sparse. Let us denote V = [v 1... v mv ] T, y i = [y i,1... y i,mv ] and n i = [n i,1... n i,mv ]. Then for each i, we can write (3.4) in the succinct form: y i = V i + n i. (3.5) Here V R mv d represents the inear measurement matri, y i R mv denotes the measurement vector at ξ i and n i represents noise on account of non-inearity of f. Note that we query f at X ( V + 1) = (2m + 1)(m v + 1) many points. Given y i, V we can recover a robust approimation to i via 1 minimization [16, 17]. On account of the structure of f, we thus recover noisy estimates to φ at equispaced points aong the interva [ 1, 1]. We are now in a position to formay present our agorithm for earning SPAMs. Our agorithm for earning SPAMs. The steps invoved in our earning scheme are outined in Agorithm 1. Steps 1-4 invove the CS-based recovery stage wherein we use the aforementioned samping sets to formuate our probem as a CS one. Step 4 invoves a simpe threshoding procedure where an appropriate threshod τ is empoyed to recover the unknown active set S. In Section 4 we provide precise conditions on our samping parameters which guarantee eact recovery, i.e. Ŝ = S. Step 5 everages a conve quadratic program (P), that uses noisy estimates of φ (i/m ), i.e., i, for each Ŝ and i = m,..., m, to return a cubic spine estimate φ. This program and its theoretica properties are epained in Section 4. Finay, in Step 6 we derive our fina estimate φ est, via piecewise integration of φ for each Ŝ. Hence our fina estimate of φ is a spine of degree 4. The performance of Agorithm 1 for recovering S and the individua φ s is presented in Theorem 1, which is aso our first main resut. A proofs are deferred to the appendi. ( ) ( ) Figure 1: The points ξ i X (bue disks) and ξ i + ɛv j (red arrows) for v j V. Agorithm 1 Agorithm for earning φ in the SPAM: f() = S φ ( ) 1: Choose m, m v and construct samping sets X and V as in (3.2), (3.3). 2: Choose step size ɛ > 0. Query f at f(ξ i ),f(ξ i +ɛv j ) for i = m,..., m and j = 1,..., m v. 3: Construct y i where y i,j = f(ξi+ɛvj) f(ξi) ɛ for i = m,..., m and j = 1,..., m v. 4: Set i := argmin z 1. For τ > 0 compute Ŝ = m i= m { {1,..., d} : i, > τ}. y i=vz 5: For each Ŝ, run (P) as defined in Section 4 using ( i,) m i= m, τ and some smoothing parameter γ 0, to obtain φ. 6: For each Ŝ, set φ est, to be the piece-wise integra of φ as epained in Section 4. Theorem 1. There eist constants C, C 1 > 0 such that if m (1/δ), m v C 1 k og d, 0 < ɛ < D m v and τ = CɛkB2 then with high probabiity, Ŝ = S and for any γ 0 the estimate φ est, returned by Agorithm 1 satisfies for each S: φ est, φ L [ 1,1] [59(1 + γ)] CɛkB mv 64m 4 φ (5) L [ 1,1]. (3.6) Reca that k, B 2, D, δ are our probem parameters introduced in Section 2, whie ɛ is the step size parameter from (3.4). We see that with O(k og d) point queries of f and with ɛ < D m v, the active set is recovered eacty. The error bound in (3.6) hods for a such choices of ɛ. It is a sum of two terms in which the first one arises during the estimation of f during the CS stage. The second error term is the interpoation error bound for interpoating φ from its sampes in the noise-free setting. We note that our point queries ie in [ (1 + (ɛ/ m v )), (1 + (ɛ/ m v ))] d. For the stated condition on ɛ in Theorem 1 we have ɛ/ m v < D which can be made arbitrariy cose to zero by choosing an appropriatey sma ɛ. Hence we sampe from ony a sma enargement of [ 1, 1] d. 4

5 4 Anayzing the agorithm We now describe and anayze in more detai the individua stages of Agorithm 1. We first anayze Steps 1-4 which constitute the compressive sensing (CS) based recovery stage. Net, we anayze Step 5 where we aso introduce our conve quadratic program. Lasty, we anayze Step 6 where we derive our fina estimate φ est,. Compressive sensing-based recovery stage. This stage of Agorithm 1 invoves soving a sequence of inear programs for recovering estimates of i = [φ 1(i/m )... φ d (i/m )] for i = m,..., m. We note that the measurements y i are noisy inear measurements of i with the noise being arbitrary and bounded. For such a noise mode, it is known that 1 minimization resuts in robust recovery of the sparse signa [18]. Using this resut in our setting aows us to quantify the recovery error i i 2 as specified in Lemma 1. Lemma 1. There eist constants c 3 1 and C, c 1 > 0 such that for m v satisfying c 3k og d < m v < d/(og 6) 2 we have with probabiity at east 1 e c 1 mv e m vd that i satisfies i i 2 CɛkB 2 for a i = m,..., m. Furthermore, given that this hods and m 1/δ is satisfied we then have for any ɛ < D m v that the choice τ = CɛkB2 impies that Ŝ = S. Thus upon using 1 minimization based decoding at 2m + 1 points, we recover robust estimates i to i which immediatey gives us estimates φ (i/m ) = i, of φ (i/m ) for i = m,..., m and = 1,..., d. In order to recover the active set S, we first note that the spacing between consecutive sampes in X is 1/m. Therefore the condition m 1/δ impies on account of Assumption 3 that the sampe spacing is fine enough to ensure that for each S, there eists a sampe i for which φ (i/m ) D hods. The stated choice of the step size ɛ essentiay guarantees / S, i that φ (i/m ) ies within a sufficienty sma neighborhood of the origin in turn enabing detection of the active set. Therefore after this stage of Agorithm 1, we have at hand: the active set S aong with the estimates: ( φ (i/m )) m i=m for each S. Furthermore, it is easy to see that φ (i/m ) φ (i/m ) τ = CɛkB2, S, i. Robust estimation via cubic spines. Our aim now is to recover a smooth, robust estimate to φ by using the noisy sampes ( φ (i/m )) m i=m. Note that the noise here is arbitrary and bounded by τ = CɛkB2. To this end we choose to use cubic spines as our estimates, which are essentiay piecewise cubic poynomias that are C 2 smooth [19]. There is a considerabe amount of iterature in the statistics community devoted to the probem of estimating univariate functions from noisy sampes via cubic spines (cf., [20, 21, 22, 23]), abeit under the setting of random noise. Cubic spines have aso been studied etensivey in the approimation theoretic setting for interpoating sampes (cf., [19, 24, 25]). We introduce our soution to this probem in a more genera setting. Consider a smooth function g : [t 1, t 2 ] R and a uniform mesh 5 : : t 1 = 0 < 1 < < n 1 < n = t 2 with i i 1 = h. We have at hand noisy sampes: ĝ i = g( i ) + e i, with noise e i being arbitrary and bounded: e i τ. In the noiseess scenario, the probem woud be an interpoation one for which a popuar cass of cubic spines are the not-a-knot cubic spines [24]. These achieve optima O(h 4 ) error rates for C 4 smooth g without using any higher order information about g as boundary conditions. Let H 2 [t 1, t 2 ] denote the space of cubic spines defined on [t 1, t 2 ] w.r.t. We then propose finding the cubic spine estimate as a soution of the foowing conve optimization probem (in the 4n coefficients of the n cubic poynomias) for some parameter γ 0: t2 min L () 2 d L H 2 [t 1,t 2] t 1 (P) s.t. ĝ i γτ L( i ) ĝ i + γτ; i = 0,..., n, L ( 1 ) = L ( + 1 ), L ( n 1 ) = L ( + n 1 ). 5 We consider uniform meshes for carity of eposition. The resuts in this section can be easiy generaized to non-uniform meshes. (4.1) (4.2) (4.3) 5

6 Note that (P) is a conve QP with inear constraints. The objective function can be verified to be a positive definite quadratic form in the spine coefficients 6. Specificay, the objective measures the tota curvature of a feasibe cubic spine in [t 1, t 2 ]. Each of the constraints (4.2)-(4.3) aong with the impicit continuity constraints of L (p) ; p = 0, 1, 2 at the interior points of, are inear equaities/inequaities in the coefficients of the piecewise cubic poynomias. (4.3) refers to the nota-knot boundary conditions [24] which are aso inear equaities in the spine coefficients. These conditions impy that L is continuous 7 at the knots 1, n 1. Thus, (P) searches amongst the space of a not-a-knot cubic spines such that L( i ) ies within a ±γτ interva of ĝ i, and returns the smoothest soution, i.e., the one with the east tota curvature. The parameter γ 0, contros the degree of smoothness of the soution. Ceary, γ = 0 impies interpoating the noisy sampes (ĝ i ) n i=0. As γ increases, the search interva: [ĝ i γτ, ĝ i + γτ] becomes arger for a i, eading to smoother feasibe cubic spines. The foowing theorem formay describes the estimation properties of (P) and is aso our second main resut. Theorem 2. For g C 4 [t 1, t 2 ] et L : [t 1, t 2 ] R be a soution of (P) for some parameter γ 0. We then have that [ ] 118(1 + γ) L g τ h4 g (4). (4.4) We show in the appendi that if t 2 t 1 (L ()) 2 d > 0, then L is unique. Note that the error bound (4.4) is a sum of two terms. The first term is proportiona to the eterna noise bound: τ, indicating that the soution is robust to noise. The second term is the error that woud arise even if perturbation was absent i.e. τ = 0. Intuitivey, if γτ is arge enough, then we woud epect the soution returned by (P) to be a ine. Indeed, a arger vaue of γτ woud impy a arger search interva in (4.2), which if sufficienty arge, woud aow a ine (that has zero curvature) to ie in the feasibe region. More formay, we show in the appendi, sufficient conditions: τ = Ω( n1/2 g γ 1 ), γ > 1, which if satisfied, impy that the soution returned by (P) is a ine. This indicates that if either n is sma or g has sma curvature, then moderatey arge vaues of τ and/or γ wi cause the soution returned by (P) to be a ine. If an estimate of g is avaiabe, then one coud for instance, use the upper bound 1 + O(n 1/2 g /τ) to restrict the range of vaues of γ within which (P) is used. Theorem 2 has the foowing Coroary for estimation of C 4 smooth φ in the interva [ 1, 1]. The proof simpy invoves repacing: g with φ, n + 1 with 2m + 1, h with 1/m and τ with CɛkB2. As the perturbation τ is directy proportiona to the step size ɛ, we show in the appendi that if additionay ɛ = Ω( m m v φ γ 1 ), γ > 1, hods, then the corresponding estimate φ wi be a ine. { Coroary 1. Let (P) be empoyed for each S using noisy sampes φ (i/m ), and i= m with step size ɛ satisfying 0 < ɛ < D m v. Denoting φ as the corresponding soution returned by (P), we then have for any γ 0 that: [ ] φ 59(1 + γ) φ CɛkB2 L [ 1,1] mv 64m 4 φ (5) L [ 1,1]. (4.5) The fina estimate. We now derive the fina estimate φ est, of φ for each S. Denote 0 (= 1) < 1 < < 2m 1 < 2m (= 1) as our equispaced set of points on [ 1, 1]. Since φ : [ 1, 1] R returned by (P) is a cubic spine, we have φ () = φ,i() for [ i, i+1 ] where φ,i is a poynomia of degree at most 3. We then define φ est, () := φ,i () + F i for [ i, i+1 ] and i = 0,..., 2m 1. Here φ,i is a antiderivative of φ,i and F i s are constants of integration. Denoting F 0 = F, we have that φ est, is continuous at 1,..., 2m 1 for: F i = φ,0 ( 1 ) + i 1 j=1 ( φ,j ( j+1 ) φ,j ( j )) φ,i ( i ) + F = F i + F ; 1 i 2m 1. Hence by denoting ψ,i ( ) := φ,i ( ) + F i we obtain φ est,( ) = ψ ( ) + F where ψ () = ψ,i () for 6 Shown in the appendi. 7 f( ) = im h 0 f( + h) and f( + ) = im h 0 + f( + h) denote eft,right hand imits respectivey. } m 6

7 [ i, i+1 ]. Now on account of Assumption 2, we require φ est, to aso be centered impying F = ψ ()d. Hence we output our fina estimate of φ to be: φ est, () := ψ () ψ ()d; [ 1, 1]. (4.6) Since φ est, is by construction continuous in [ 1, 1], is a piecewise combination of poynomias of degree at most 4, and since φ est, is a cubic spine, φ est, is a spine function of order 4. Lasty, we show in the proof of Theorem 1 that φ est, φ L [ 1,1] 3 φ φ L [ 1,1] hods. Using Coroary 1, this provides us with the error bounds stated in Theorem 1. 5 Impact of noise on performance of our agorithm Our third main contribution invoves anayzing the more reaistic scenario, when the point queries are corrupted with additive eterna noise z. Thus querying f in Step 2 of Agorithm 1 resuts in noisy vaues: f(ξ i ) + z i and f(ξ i + ɛv j ) + z i,j respectivey. This changes (3.5) to the noisy inear system: y i = V i + n i + z i where z i,j = (z i,j z i )/ɛ for i = m,..., m and j = 1,..., m v. Notice that eterna noise gets scaed by (1/ɛ), whie n i,j scaes ineary with ɛ. Arbitrary bounded noise. In this mode, the eterna noise is arbitrary but bounded, so that z i, z i,j < κ; i, j. It can be verified aong the ines of the proof of Lemma 1 that: ni + z i 2 ( 2κ mv ɛ + ɛkb2 2m v ). Observe that unike the noiseess setting, ɛ cannot be made arbitrariy cose to 0, as it woud bow up the impact of the eterna noise. The foowing theorem shows that if κ is sma reative to D 2 < φ () 2, I, S, then 8 there eists an interva for choosing ɛ, within which Agorithm 1 recovers eacty the active set S. This condition has the natura interpretation that if the signa-to- eterna noise ratio in I is sufficienty arge, then S can be detected eacty. Theorem 3. There eist constants C, C 1 > 0 such that if κ < D 2 /(16C 2 kb 2 ), m (1/δ), and m v C 1 k og d hod, then for any ɛ D m v 2 [1 A, 1 + A] where A := 1 (16C 2 kb 2 κ)/d 2 and τ = ( 2κ m v ɛ + ɛkb2 2m v ), we have in Agorithm 1, with high probabiity, that Ŝ = S and for any γ 0, for each S: ( 4C mv κ φ est, φ L [ 1,1] [59(1 + γ)] ɛ + CɛkB ) mv 64m 4 φ (5) L [ 1,1]. (5.1) Stochastic noise. In this mode, the eterna noise is assumed to be i.i.d. Gaussian, so that z i, z i,j N (0, σ2 ); i.i.d. i, j. In this setting we consider resamping f at the query point N times and then averaging the noisy sampes, in order to reduce σ. Given this, we now have that z i, z σ2 i,j N (0, N ); i.i.d. i, j. Using standard tai-bounds for Gaussians, we can show that for any κ > 0 if N is chosen arge enough then: z i,j = z i z i,j 2κ; i, j with high probabiity. Hence the eterna noise z i,j woud be bounded with high probabiity and the anaysis for Theorem 3 can be used in a straightforward manner. Of course, an advantage that we have in this setting is that κ can be chosen to be arbitrariy cose to zero by choosing a correspondingy arge vaue of N. We state a this formay in the form of the foowing theorem. Theorem 4. There eist constants C, C 1 > 0 such that for κ < D 2 /(16C 2 kb 2 ), m ( (1/δ), and 2σ ) m v C 1 k og d, if we re-sampe each query in Step 2 of Agorithm 1: N > σ2 κ og 2 κp X V times for 0 < p < 1, and average the vaues, then for any ɛ D m v 2 [1 A, 1 + A] where A := 1 (16C2 kb 2 κ)/d 2 and τ = ( 2κ m v ɛ + ɛkb2 2m v ), we have in Agorithm 1, with probabiity at east 1 p o(1), that Ŝ = S and for any γ 0, for each S: ( 4C mv κ φ est, φ L [ 1,1] [59(1 + γ)] + CɛkB ) ɛ mv 64m 4 φ (5) L [ 1,1]. (5.2) 8 I is the critica interva defined in Assumption 3 for detecting S. 7

8 Note that we query f now N X ( V + 1) times. Aso, X = (2m + 1) = Θ(1), and κ = O(k 1 ), as D, C, B 2, δ are constants. Hence the choice V = O(k og d) gives us N = O(k 2 og(p 1 k 2 og d)) and eads to an overa query compeity of: O(k 3 og d og(p 1 k 2 og d)) when the sampes are corrupted with additive Gaussian noise. Choosing p = O(d c ) for any constant c > 0 gives us a sampe compeity of O(k 3 (og d) 2 ), and ensures that the resut hods with high probabiity. The o(1) term goes to zero eponentiay fast as d. Simuation resuts. We now provide simuation resuts on synthetic data to support our theoretica findings. We consider the noisy setting with the point queries being corrupted with Gaussian noise. For d = 1000, k = 4 and S = {2, 105, 424, 782}, consider f : R d R where f = φ 2 ( 2 ) + φ 105 ( 105 ) + φ 424 ( 424 ) + φ 782 ( 782 ) with: φ 2 () = sin(π), φ 105 () = ep( 2), φ 424 () = (1/3) cos 3 (π) , φ 782 () = We choose δ = 0.3, D = 0.2 which can be verified as vaid parameters for the above φ s. Furthermore, we choose m = 2/δ = 7 and m v = 2k og d = 56 to satisfy the conditions of Theorem 4. Net, we choose constants D C = 0.2, B 2 = 35 and κ = C 2 kb 2 = as required by Theorem 4. For the choice ɛ = D m v 2 = , we then query f at (2m + 1)(m v + 1) = 855 points. The function vaues are corrupted with Gaussian noise: N (0, σ 2 /N) for σ = 0.01 and N = 100. This is equivaent to resamping and averaging the points queries N times. Importanty the sufficient condition on N, as stated in Theorem 4 is σ2 κ og( 2σ X V 2 κp ) = 6974 for p = 0.1. Thus we consider a significanty undersamped regime. Lasty we seect the threshod τ = ( ) 2κ m v ɛ + ɛkb2 2m v = as stated by Theorem 4, and empoy Agorithm 1 for different vaues of the smoothing parameter γ φ2, φest, φ105, φest, φ424, φest, φ782, φest, (a) Estimates of φ (b) Estimates of φ (c) Estimates of φ (d) Estimates of φ 782 Figure 2: Estimates φ est, of φ (back) for: γ = 0.3 (red), γ = 1 (bue) and γ = 5 (green). The resuts are shown in Figure 2. Over 10 independent runs of the agorithm we observed that S was recovered eacty each time. Furthermore we see from Figure 2 that the recovery is quite accurate for γ = 0.3. For γ = 1 we notice that the search interva γτ = becomes arge enough so as to cause the estimates φ est,424, φ est,782 to become reativey smoother. For γ = 5, the search interva γτ = becomes wide enough for a ine to fit in the feasibe region for φ 424, φ 782. This resuts in φ est,424, φ est,782 to be quadratic functions. In the case of φ 2, φ 105, the search interva is not sufficienty wide enough for a ine to ie in the feasibe region, even for γ = 5. However we notice that the estimates φ est,2, φ est,105 become reativey smoother as epected. 6 Concusion We proposed an efficient samping scheme for earning SPAMs. In particuar, we showed that with ony a few queries, we can derive uniform approimations to each underying univariate function of the SPAM. A crucia component of our approach is a nove conve QP for robust estimation of univariate functions via cubic spines, from sampes corrupted with arbitrary bounded noise. Lasty, we showed how our agorithm can hande noisy point queries for both (i) arbitrary bounded and (ii) i.i.d. Gaussian noise modes. An important direction for future work woud be to determine the optimaity of our samping bounds by deriving corresponding ower bounds on the sampe compeity. Acknowedgments. This research was supported in part by SNSF grant and a Microsoft Research Facuty Feowship. 8

9 References [1] Th. Muer-Gronbach and K. Ritter. Minima errors for strong and weak approimation of stochastic differentia equations. Monte Caro and Quasi-Monte Caro Methods, pages 53 82, [2] M.H. Maathuis, M. Kaisch, and P. Bühmann. Estimating high-dimensiona intervention effects from observationa data. The Annas of Statistics, 37(6A): , [3] M.J. Wainwright. Information-theoretic imits on sparsity recovery in the high-dimensiona and noisy setting. Information Theory, IEEE Transactions on, 55(12): , [4] J.F. Traub, G.W. Wasikowski, and H. Wozniakowski. Information-Based Compeity. Academic Press, New York, [5] R. DeVore, G. Petrova, and P. Wojtaszczyk. Approimation of functions of few variabes in high dimensions. Constr. Appro., 33: , [6] A. Cohen, I. Daubechies, R.A. DeVore, G. Kerkyacharian, and D. Picard. Capturing ridge functions in high dimensions from point queries. Constr. Appro., pages 1 19, [7] H. Tyagi and V. Cevher. Active earning of muti-inde function modes. Advances in Neura Information Processing Systems 25, pages , [8] M. Fornasier, K. Schnass, and J. Vybíra. Learning functions of few arbitrary inear parameters in high dimensions. Foundations of Computationa Mathematics, 12(2): , [9] Y. Lin and H.H. Zhang. Component seection and smoothing in mutivariate nonparametric regression. The Annas of Statistics, 34(5): , [10] M. Yuan. Nonnegative garrote component seection in functiona anova modes. In AISTATS, voume 2, pages , [11] G. Raskutti, M.J. Wainwright, and B. Yu. Minima-optima rates for sparse additive modes over kerne casses via conve programming. J. Mach. Learn. Res., 13(1): , [12] V. Kotchinskii and M. Yuan. Sparse recovery in arge ensembes of kerne machines. In COLT, pages , [13] P. Ravikumar, J. Lafferty, H. Liu, and L. Wasserman. Sparse additive modes. Journa of the Roya Statistica Society: Series B (Statistica Methodoogy), 71(5): , [14] L. Meier, S. Van De Geer, and P. Bühmann. High-dimensiona additive modeing. The Annas of Statistics, 37(6B): , [15] V. Kotchinskii and M. Yuan. Sparsity in mutipe kerne earning. The Annas of Statistics, 38(6): , [16] E.J. Candès, J.K. Romberg, and T. Tao. Stabe signa recovery from incompete and inaccurate measurements. Communications on Pure and Appied Mathematics, 59(8): , [17] D.L. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4): , [18] P. Wojtaszczyk. 1 minimization with noisy data. SIAM Journa on Numerica Anaysis, 50(2): , [19] J.H. Ahberg, E.N. Nison, and J.L. Wash. The theory of spines and their appications. Academic Press (New York), [20] I.J. Schoenberg. Spine functions and the probem of graduation. Proceedings of the Nationa Academy of Sciences, 52(4): , [21] C.M. Reinsch. Smoothing by spine functions. Numer. Math, 10: , [22] G. Wahba. Smoothing noisy data with spine functions. Numerische Mathematik, 24(5): , [23] P. Craven and G. Wahba. Smoothing noisy data with spine functions. Numerische Mathematik, 31(4): , [24] C. de Boor. A practica guide to spines. Springer Verag (New York), [25] C.A. Ha and W.W. Meyer. Optima error bounds for cubic spine interpoation. Journa of Approimation Theory, 16(2): ,

Statistical Learning Theory: A Primer

Statistical Learning Theory: A Primer Internationa Journa of Computer Vision 38(), 9 3, 2000 c 2000 uwer Academic Pubishers. Manufactured in The Netherands. Statistica Learning Theory: A Primer THEODOROS EVGENIOU, MASSIMILIANO PONTIL AND TOMASO