arxiv: v4 [stat.ml] 21 Dec 2016

Size: px

Start display at page:

Download "arxiv: v4 [stat.ml] 21 Dec 2016"

Hester Sparks
5 years ago
Views:

1 Online Active Linear Regression via Thresholing Carlos Riquelme Stanfor University Ramesh Johari Stanfor University Baosen Zhang University of Washington arxiv: v4 [stat.ml] 21 Dec 2016 Abstract We consier the problem of online active learning to collect ata for regression moeling. Specifically, we consier a ecision maer with a limite experimentation buget who must efficiently learn an unerlying linear population moel. Our main contribution is a novel threshol-base algorithm for selection of most informative observations; we characterize its performance an funamental lower bouns. We exten the algorithm an its guarantees to sparse linear regression in high-imensional settings. Simulations suggest the algorithm is remarably robust: it provies significant benefits over passive ranom sampling in real-worl atasets that exhibit high nonlinearity an high imensionality significantly reucing both the mean an variance of the square error. 1 Introuction This paper stuies online active learning for estimation of linear moels. Active learning is motivate by the premise that in many sequential ata collection scenarios, labeling or obtaining output from observations is costly. Thus ongoing ecisions must be mae about whether to collect ata on a particular unit of observation. Active learning has a rich history; see, e.g., [3, 6, 7, 8, 17]. As a motivating example, suppose that an online mareting organization plans to sen isplay avertising promotions to a new target maret. Their goal is to estimate the revenue that can be expecte for an iniviual with a given covariate vector. Unfortunately, proviing the promotion an collecting ata on each iniviual is costly. Thus the goal of the mareting organization is to acquire first the most informative observations. They must o this in an online fashion: opportunities to isplay the promotion to iniviuals arrive sequentially over time. In online active learning, this is achieve by selecting those observational units target iniviuals in this case) that provie the most information to the moel fitting proceure. Linear moels are ubiquitous in both theory an practice often use even in settings where the ata may exhibit strong nonlinearity in large part because of their interpretability, flexibility, an simplicity. As a consequence, in practice, people ten to a a large number of features an interactions to the moel, hoping to capture the right signal at the expense of introucing some noise. Moreover, the input space can be upate an extene iteratively after ata collection if the ecision maer feels preictions on a hel-out set are not goo enough. As a consequence, often times the number of covariates becomes higher than the number of available observations. In those cases, selecting the subsequent most informative ata is even more critical. Accoringly, our focus is on actively choosing observations for optimal preiction of the resulting high-imensional linear moels. Our main contributions are as follows. We initially focus on stanar linear moels, an buil the theory that we later exten to high imensional settings. First, we evelop an algorithm that sequentially selects observations if they have sufficiently large norm, in an appropriate space epenent on the ata-generating istribution). Secon, we provie a comprehensive theoretical analysis of our algorithm, incluing upper an lower bouns. We focus on minimizing mean square preiction error MSE), an show a high probability upper boun on the MSE of our approach cf. Theorem 3.1). In aition, we provie a lower boun on the best possible achievable performance in high probability

2 an expectation cf. Section 4). In some istributional settings of interest we show that this lower boun structurally matches our upper boun, suggesting our algorithm is near-optimal. The results above show that the improvement of active learning progressively weaens as the imension of the ata grows, an a new approach is neee. To tacle our original goal an aress this egraation, uner stanar sparsity assumptions, we esign an aaptive extension of the thresholing algorithm that initially evotes some buget to learn the sparsity pattern of the moel, in orer to subsequently apply active learning to the relevant lower imensional subspace. We fin that in this setting, the active learning algorithm provies significant benefit over passive ranom sampling. Theoretical guarantees are given in Theorem 3.3. Finally, we empirically evaluate our algorithm s performance. Our tests on real worl ata show our approach is remarably robust: the gain of active learning remains significant even in settings that fall outsie our theory. Our results suggest that the threshol-base rule may be a valuable tool to leverage in observation-limite environments, even when the assumptions of our theory may not exactly hol. Active learning has mainly been stuie for classification; see, e.g., [1, 2, 9, 10, 28]. For regression, see, e.g., [5, 18, 24] an the references within. A closely relate wor to our setting is [23]: they stuy online or stream-base active learning for linear regression, with ranom esign. They propose a theoretical algorithm that partitions the space by stratification base on Monte-Carlo methos, where a recently propose algorithm for linear regression [14] is use as a blac box. It converges to the globally optimal oracle ris uner possibly misspecifie moels with suitable assumptions). Due to the relatively wea moel assumptions, they achieve a constant gain over passive learning. As we aopt stronger assumptions well-specifie moel), we are able to achieve larger than constant gains, with a computationally simpler algorithm. Suppose covariate vectors are Gaussian with imension ; the total number of observations is n; an the algorithm is allowe to label at most of them. Then, we beat the stanar σ 2 / MSE to obtain σ 2 2 /[ + 2δ 1) log ] when n = δ, so active learning truly improves performance when = Ωexp)) or δ = Ω). While [23] oes not tacle high-imensional settings, we overcome the exponential ata requirements via l 1 -regularization. The remainer of the paper is organize as follows. We efine our setting in Section 2. In Section 3, we introuce the algorithm an provie analysis of a corresponing upper boun. Lower bouns are given in Section 4. Simulations are presente in Section 5, an Section 6 conclues. 2 Problem Definition The online active learning problem for regression is efine as follows. We sequentially observe n covariate vectors in a -imensional space X i R, which are i.i.. When presente with the i-th observation, we must choose whether we want to label it or not, i.e., choose to observe the outcome. If we ecie to label the observation, then we obtain Y i R. Otherwise, we o not see its label, an the outcome remains unnown. We can label at most out of the n observations. We assume covariates are istribute accoring to some nown istribution D, with zero mean EX = 0, an covariance matrix Σ = EXX T. We relax this assumption later. In aition, we assume that Y follows a linear moel: Y = X T β + ɛ, where β R an ɛ N 0, σ 2 ) i.i.. We enote observations by X, X i R, components by X j R, an sets in bolface: X R, Y R. After selecting observations, X, Y), we output an estimate ˆβ R, with no intercept. 1 Our goal is to minimize the expecte MSE of ˆβ in Σ norm, i.e. E ˆβ β 2 Σ, uner ranom esign; that is, when the X i s are ranom an the algorithm may be ranomize. This is relate to the A-optimality criterion, [22]. We use the experimentation buget to minimize the variance of ˆβ by sampling X from a ifferent threshole istribution. Minimizing expecte MSE is equivalent to minimizing the trace of the normalize inverse of the Fisher information matrix X T X, E[Y X T ˆβ ) 2 ] = E[ ˆβ β 2 Σ] + σ 2 1 We assume covariates an outcome are centere. = σ 2 E [ TrΣX T X) 1 ) ] + σ 2 2

3 where expectations are over all sources of ranomness. In this setting, the OLS estimator is the best linear unbiase estimator by the Gauss Marov Theorem. Also, for any set X of i.i.. observations, OLS ˆβ := ˆβ has sampling istribution ˆβ X N β, σ 2 X T X) 1 ), [13]. In Section 3.3, we tacle high-imensionality, where, via Lasso estimators within a two-stage algorithm. 3 Algorithm an Main Results In this section we motivate the algorithm, state the main result quantifying its performance for general istributions, an provie a high-level overview of the proof. A corollary for the Gaussian istribution is presente, an we also exten the algorithm by maing the threshol aaptive. Finally, we show how to generalize the results to sparse linear regression. In Appenix E, we erive a CLT approximation with guarantees that is useful in complex or unnown istributional settings. Without loss of generality, we assume that each observation is white, that is, E[XX T ] is the ientity matrix. For correlate observations X, we apply X := D 1/2 U T X to whiten them, Σ = UDU T see Appenix A). Note that TrΣX T X ) 1 ) = TrX T X) 1 ). We boun the whitene trace as λ max X T X) TrXT X) 1 ) λ min X T X). 1) To minimize the expecte MSE, we nee to maximize the minimum eigenvalue of X T X with high probability. The thresholing proceure in Algorithm 1 maximizes the minimum eigenvalue of X T X through two observations. First, since the sum of eigenvalues of X T X is the trace of X T X, which is in turn the sum of the norm of the observations, the algorithm chooses observations of large weighte) norm. Secon, the eigenvalues of X T X shoul be balance, that is, have similar magnitues. This is achieve by selecting the appropriate weights for the norm. Let ξ R + be a vector of weights efining the norm X 2 ξ = j=1 ξ jxj 2. Let Γ > 0 be a threshol. Algorithm 1 simply selects the observations with ξ-weighte norm larger than Γ. The selecte observations can be thought as i.i.. samples from an inuce istribution D: the original istribution conitional on X ξ Γ. Suppose observations are chosen an enote by X R. Then EX T X = i=1 EXi X it = i=1 Hi = H, where H is the covariance matrix with respect to D. This covariance matrix is iagonal uner ensity symmetry assumptions, as thresholing preserves uncorrelation; its iagonal terms are H jj = E DX 2 j = E D [X 2 j X ξ Γ] =: φ j. 2) Hence, λ min EX T X) = min j φ j, an λ max EX T X) = max j φ j. The main technical result in Theorem 3.1 is to lin the eigenvalues of the ranom matrix X T X to its eterministic counter part EX T X. From the above calculations, the goal is to fin ξ, Γ) such that min j φ j max j φ j, an both are as large as possible. The first objective is achieve when there exists some φ such that E D [X 2 j X ξ Γ] = φ j = φ, for all j. 3) We note that if X has inepenent components with the same marginal istribution after whitening), then it suffices to choose ξ j = 1 for all j. It is necessary to choose unequal weights when the marginal istributions of the components are ifferent, e.g., some are Gaussian an some are uniform, or components are epenent. For joint Gaussian, whitening removes epenencies, so we set ξ j = Thresholing Algorithm The algorithm is simple. For each incoming observation X i we compute its weighte norm X i ξ possibly after whitening if necessary). If the norm is above the threshol Γ, then we select the observation, otherwise we ignore it. We stop when we have collecte observations. Note that ranom sampling is equivalent to setting Γ = 0. We want to catch the largest observations given our buget, therefore we require that Γ satisfies P D X ξ Γ) = /n. 4) 3

4 Algorithm 1 Thresholing Algorithm. 1: Set ξ, Γ) R +1 satisfying 3) an 4). 2: Set S =. 3: for observation 1 i n o 4: Observe X i. 5: Compute X i = D 1/2 U T X i. 6: if X i ξ > Γ or S = n i + 1 then 7: Choose X i : S = S X i. 8: if S = then 9: brea. 10: en if 11: en if 12: en for 13: Return OLS estimate ˆβ base on observations in S. If we apply this rule to n inepenent observations coming from D, on average we select of them: the ξ largest. If ξ, Γ) is a solution to 3) an 4), then c ξ, c Γ) is also a solution for any c > 0. So we require i ξ i =. Algorithm 1 can be seen as a regularizing process similar to rige regression, where the amount of regularization epens on the istribution D an the buget ratio /n; it improves the conitioning of the problem. Guarantees when Σ is unnown can be erive as follows: we allocate an initial sequence of points to estimation of the inverse of the covariance matrix, an the remainer to labeling where we no longer upate our estimate). In this manner observations remain inepenent. Note that O) observations are require for accurate recovery when D is subgaussian, an O log ) if subexponential, [26]. Errors by using the estimate to whiten an mae ecisions are boune, small with high probability via Cauchy Schwarz), an the result is equivalent to using a slightly worse threshol. Algorithm 1 b Aaptive Thresholing Algorithm. 1: Set S =. 2: for observation 1 i n o 3: Observe X i, estimate Σ i = Ûi D i Ûi T. 4: Compute X i 1/2 = D i Ûi T X i. 5: Let ξ i, Γ i ) satisfy 3) an 5). 6: if X i ξi > Γ i or S =n i + 1 then 7: Choose X i : S = S X i. 8: if S = then 9: brea. 10: en if 11: en if 12: en for 13: Return OLS estimate ˆβ base on observations in S. Algorithm 1 eeps the threshol fixe from the beginning, leaing to a mathematically convenient analysis, as it generates i.i.. observations. However, Algorithm 1b, which is aaptive an upates its parameters after each observation, prouces slightly better results, as we empirically show in Appenix K. Before maing a ecision on X i, Algorithm 1b fins ξ i, Γ i ) satisfying 3) an P D X i ξi Γ i ) = S i 1 n i + 1, 5) where S i 1 is the number of observations alreay labele. The iea is ientical: set the threshol to capture, on average, the number of observations still to be labele, that is S i 1, out of the number still to be observe, n i

5 Importantly, active learning not only ecreases the expecte MSE, but also its variance. Since the variance of the MSE for fixe X epens on j 1/λ jx T X) 2 [13], it is also minimize by selecting observations that lea to large eigenvalues of X T X. 3.2 Main Theorem Theorem 3.1 states that by sampling observations from D where ξ, Γ) satisfy 3), the estimation performance is significantly improve, compare to ranomly sampling observations from the original istribution. Section 4 shows the gain in Theorem 3.1 essentially cannot be improve an Algorithm 1 is optimal. A setch of the proof is provie at the en of this section see Appenix B). Theorem 3.1 Let n > >. Assume observations X R are istribute accoring to subgaussian D with covariance matrix Σ R. Also, assume marginal ensities are symmetric aroun zero after whitening. Let X be a matrix with observations sample from the istribution inuce by the thresholing rule with parameters ξ, Γ) R +1 + satisfying 3). Let α > 0, so that t = α C > 0, then, with probability at least 1 2 exp ct 2 ) TrΣX T X) 1 ) where constants c, C epen on the subgaussian norm of D. 1 α) 2 φ, 6) While Theorem 3.1 is state in fairly general terms, we can apply the result to specific settings. We first present the Gaussian case where white components are inepenent. The proof is in Appenix D. Corollary 3.2 If the observations in Theorem 3.1 are jointly Gaussian with covariance matrix Σ R, ξ j = 1 for all j = 1,...,, an Γ = C + 2 logn/), for some constant C 1, then with probability at least 1 2 exp ct 2 ) we have that TrΣX T X) 1 ) ) 1 α) logn/). 7) The MSE of ranom sampling for white Gaussian ata is proportional to / 1), by the inverse Wishart istribution. Active learning provies a gain factor of orer 1/1 + 2 logn/)/) with high probability a very similar 1 α term shows up for ranom sampling). Note that our algorithm may select fewer than observations. Then, when the number of observations yet to be seen equals the remaining labeling buget, we shoul select all of them equivalent to ranom sampling). The number of observations with X ξ > Γ has binomial istribution, is highly concentrate aroun its mean, with variance 1 /n). By the Chernoff Bouns, the probability that the algorithm selects fewer than C ecreases exponentially fast in C. Thus, these eviations are ominate in the boun of Theorem 3.1 by the leaing term. In practice, one may set the threshol in 4) by choosing 1 + ɛ) observations for some small ɛ > 0, or use the aaptive threshol in Algorithm 1b. 3.3 Sparsity an Regularization The gain provie by active learning in our setting suffers from the curse of imensionality, as it iminishes very fast when increases, an Section 4 shows the gain cannot be improve in general. For high imensional settings where ) we assume s-sparsity in β, that is, we assume the support of β contains at most s non-zero components, for some s. In Appenix J, we also provie relate results for Rige regression. We state the two-stage Sparse Thresholing Algorithm see Algorithm 2) an show this algorithm effectively overcomes the curse of imensionality. For simplicity, we assume the ata is Gaussian, D = N 0, Σ). Base, for example, on the results of [25] an Theorem 1 in [16], we coul exten our results to subgaussian ata via the Orthogonal Matching Pursuit algorithm for recovery. The two-stage algorithm wors as follows. First, we focus on recovering the true support, S = Sβ), by selecting the very first 1 observations without thresholing), an computing the Lasso estimator ˆβ 1. Secon, we assign the weights ξ: for i S ˆβ 1 ), we set ξ i = 1, otherwise we set ξ i = 0. Then, we 5

6 apply the thresholing rule to select the remaining 2 = 1 observations. While observations are collecte in all imensions, our final estimate ˆβ 2 is the OLS estimator compute only incluing the observations selecte in the secon stage, an exclusively in those imensions in S ˆβ 1 ). Note that, in general, the points that en up being selecte by our algorithm are informational outliers, while not necessarily geometric outliers in the original space. After applying the whitening transformation, ignoring some imensions base on the Lasso results, an then thresholing base on a weighte norm possibly learnt from ata say, if components are not inepenent, an we recover the covariance matrix in a online fashion), the algorithm is able to ientify goo points for the unerlying ata istribution an β. Algorithm 2 Sparse Thresholing Algorithm. 1: Set S 1 =, S 2 =. Let = 1 + 2, n = 1 + n 2. 2: for observation 1 i 1 o 3: Observe X i. Choose X i : S 1 = S 1 X i. 4: en for 5: Set γ = 1/2, λ = 4σ 2 log)/γ : Compute Lasso estimate ˆβ 1 on S 1, regularization λ. 7: Set weights: ξ i = 1 if i S ˆβ 1 ), ξ i = 0 otherwise. 8: Set Γ = C s + 2 logn 2 / 2 ). 9: Factorize Σ S ˆβ1)S ˆβ 1) = UDU T. 10: for observation i n o 11: Observe X i R. Restrict to X i S := Xi S ˆβ 1) Rs. 12: Compute X i S = D 1/2 U T XS i. 13: if XS i ξ > Γ or 2 S 2 = n i + 1 then 14: Choose XS i : S 2 = S 2 XS i. 15: if S 2 = 2 then 16: brea. 17: en if 18: en if 19: en for 20: Return OLS estimate ˆβ 2 base on observations in S 2. Theorem 3.3 summarizes the performance of Algorithm 2; it requires the stanar assumptions on Σ, λ an min i β i for support recovery see Theorem 3 in [27]). Theorem 3.3 Let D = N 0, Σ). Assume Σ, λ an min i β i satisfy the stanar conitions given in Theorem 3 of [27]. Assume we run the Sparse Thresholing algorithm with 1 = C s log observations to recover the support of β, for an appropriate C 0. Let X 2 be 2 = 1 observations sample via thresholing on S ˆβ 1 ). It follows that for α > 0 such that t = α 2 C s > 0, there exist some universal constants c 1, c 2, an c, C that epen on the subgaussian norm of D S ˆβ 1 ), such that with probability at least it hols that 1 2e minc2 mins,log s)) logc1),ct2 log2)) TrΣ SS X T 2 X 2 ) 1 s ) ). 1 α) logn2/2) 2 Performance for ranom sampling with the Lasso estimator is Os log /). A regime of interest is s, = C 1 s log, an n = C 2, for large enough C 1, an C 2 > 0. In that case, Algorithm 2 leas to a boun of orer smaller than 1/ log), as oppose to a weaer constant guarantee for ranom sampling. The gain is at least a log factor with high probability. The proof is in Appenix H. In practice, the performance of the algorithm is improve by using all the observations to fit the final estimate ˆβ 2, as shown in simulations. However, in that case, observations are no longer i.i.. Also, using thresholing to select the initial 1 observations ecreases the probability of maing a mistae in support recovery. In Section 5 we provie simulations comparing ifferent methos. s 6

7 Normalize MSE max 0.575) Ranom Sampling Algorithm 1 Algorithm 2 Normalize MSE max 0.013) Ranom Sampling Correct Support) Algorithm 1 Correct Support) Algorithm 2 Algorithm 2 all observations) a) Zooming out b) Zooming in. Figure 1: Sparse Linear Regression 700 iters). We fix the effective imension to s = 7, an increase the ambient imension from = 100 to = 500. The buget scales as = Cs log for C 3.4, while n = 4. We set 1 = 2/3 an 2 = / Proof of Theorem 3.1 The complete proof of Theorem 3.1 is in Appenix B. We only provie a setch here. The proof is a irect application of spectral results in [26], which are erive via a covering argument using a iscrete net N on the unit Eucliean sphere S 1, together with a Bernstein-type concentration inequality that controls eviations of Xw 2 for each element w N in the net. Finally, a union boun is taen over the net. Importantly, the proof shows that if our algorithm uses ξ, Γ) which are approximate solutions to 3), then 10) still hols with min j E DX j 2 in the enominator of the RHS, instea of φ. This fact can be quite useful in practice, when F is unnown. We can evote some initial buget X 1,..., X T to recover F, an then fin ξ, Γ) approximately solving 3) an 4) uner ˆF. Note that no labeling is require. Also, the result can be extene to subexponential istributions. In this case, the probabilistic boun will be weaer incluing a term in front of the exponential). More generally, our probabilistic bouns are strongest when C log for some constant C 0, a common situation in active learning [23], where super-linear requirements in seem unavoiable in noisy settings. A simple boun for the parameter φ can be calculate as follows. Assume there exists ξ, Γ) such that φ j = φ an consier the weighte square norm Z ξ = [ ] X 2 j = j=1 ξ jxj 2. Then E D [Z ξ ] = j=1 ξ je D j=1 ξ [ jφ j = φ, an φ = E D Zξ Z ξ Γ 2] / Γ 2 / = F 1 Z ξ 1 /n)/, which implies that 1/λ min EX T X) = 1/φ /Γ 2. For specific istributions, Γ 2 / can be easily compute. The last inequality is close to equality in cases where the conitional ensity ecays extremely fast for values of j=1 ξ jx 2 j above Γ2. Heavy-taile istributions allocate mass to significantly higher values, an φ coul be much larger than Γ 2 /. 4 Lower Boun In this section we erive a lower boun for the > setting. Suppose all the ata are given. Again choose the observations with largest norms, enote by X. To minimize the preiction error, the best possible X T X is iagonal, with ientical entries, an trace equal to the sum of the norms. No selection algorithm, online or offline, can o better. Algorithm 1 achieves this by selecting observations with large norms an uncorrelate entries through whitening if necessary). Theorem 4.1 captures this intuition. 7

8 Normalize MSE max ) n x 1000) Algorithm 1b Ranom Sampling a) Protein Structure; 150 iters. Normalize MSE max ) n x 1000) Algorithm 1b Ranom Sampling b) Bie Sharing; 300 iters. Normalize MSE max ) n x 1000) Algorithm 1b 5 Ranom Sampling c) YearPreictionMSD; 150 iters. Figure 2: MSE of ˆβ OLS. The 0.05, 0.95) quantile conf. int. isplaye. Soli meian; Dashe mean. Theorem 4.1 Let A be an algorithm for the problem we escribe in Section 2. Then, E A TrΣX T X) 1 ) 2 [ ] 8) E i=1 X i) 2 E [ 1 max i [n] X i 2], where X i) is the white observation with the i-th largest norm. Moreover, fix α 0, 1). Let F be the cf of max i [n] X i 2. Then, TrΣX T X) 1 ) 2 / F 1 1 α) with probability at least 1 α. The proof is in Appenix E. The upper boun in Theorem 3.1 has a similar structure, with enominator equal to φ. By Theorem 3.1, φ = E D [Xj 2 X 2 ξ Γ2 ] for every component j. [ Hence, summing over all components: φ = E D X 2 / ]. The latter expectation is taen with respect to D, which only captures the expecte ξ-largest observations out of n, as oppose to E D [1/) i=1 X i) 2 /] in 8). The weights ξ simply account for the fact that, in reality, we cannot mae all components have equal norm, something we implicitly assume in our lower boun. We specialize the lower boun to the Gaussian setting, for which we compute the upper boun of Theorem 3.1. The proofs are base on the Fisher-Tippett Theorem an the Gumbel istribution; see Appenix F. Corollary 4.2 For Gaussian observations X i N 0, Σ) an large n, for any algorithm A E A TrΣX T X) 1 ) ). 2 log n + log log n Moreover, let α 0, 1). Then, for any A with probability at least 1 α an C = 2 log Γ/2)/, TrΣX T X) 1 ) / 2 log n + log log n 1 1 log log 1 α C The results from Corollary 3.2 have the same structure as the lower boun; hence in this setting our algorithm is near optimal. Similar results an conclusions are erive for the CLT approximation in Appenix I. 5 Simulations We conucte experiments in various settings: regularize estimators in high-imensions, an the basic thresholing approach in real-worl ata to explore its performance on strongly non-linear environments. Regularize Estimators. We compare the performance in high-imensional settings of ranom sampling an Algorithm 1 both with an appropriately ajuste Lasso estimator against Algorithm 2, which taes into account the structure of the problem s ). For completeness, we also show 8

9 the performance of Algorithm 2 when all observations are inclue in the final OLS estimate, an that of ranom sampling RS) an Algorithm 1 Thr) when the true support S is nown in avance, an the OLS compute on S. In Figure 1 a), we see that Algorithm 2 ramatically reuces the MSE, while in Figure 1 b) we zoom-in to see that, quite remarably, Algorithm 2 using all observations for the final estimate outperforms ranom sampling that nows the sparsity pattern in hinsight. We use 1 = 2/3) for recovery. More experiments are provie in Appenix K. Real-Worl Data. We show the results of Algorithm 1b online Σ estimation) with the simplest istributional assumption Gaussian threshol, ξ j = 1) versus ranom sampling on publicly available real-worl atasets UCI, [20]), measuring test square preiction error. We fix a sequence of values of n, together with = n, an for each pair n, ) we run a number of iterations. In each one, we ranomly split the ataset in training n observations, ranom orer), an test rest of them). Finally, ˆβ OLS is compute on selecte observations, an the preiction error estimate on the test set. All atasets are initially centere to have zero means covariates an response). Confience intervals are provie. We first analyze the Physicochemical Properties of Protein Tertiary Structure ataset observations), where we preict the size of the resiue, base on = 9 variables, incluing the total surface area of the protein an its molecular mass. Figure 2 a) shows the results; Algorithm 1b outperforms ranom sampling for all values of n, ). The reuction in variance is substantial. In the Bie Sharing ataset [12] we preict the number of hourly users of the service, given weather conitions, incluing temperature, win spee, humiity, an temporal covariates. There are observations, an we use = 12 covariates. Our estimator has lower mean, meian an variance MSE than ranom sampling; Figure 2 b). Finally, for the YearPreictionMSD ataset [4], we preict the year a song was release base on = 90 covariates, mainly metaata an auio features. There are observations. The MSE an variance i strongly improve; Figure 2 c). In the examples we see that, while active learning leas to strong improvements in MSE an variance reuction for moerate values of with respect to, the gain vanishes when grows large. This was expecte; the reason might be that by sampling so many outliers, we en up learning about parts of the space where heavy non-linearities arise, which may not be important to the test istribution. However, the motivation of active learning are situations of limite labeling buget, an hybri approaches combining ranom sampling an thresholing coul be easily implemente if neee. 6 Conclusion Our paper provies a comprehensive analysis of thresholing algorithms for online active learning of linear regression moels, which are shown to perform well both theoretically an empirically. Several natural open irections suggest themselves. Aitional robustness coul be guarantee in other settings by combining our algorithm as a blac box with other approaches: for example, some aition of ranom sampling or stratifie sampling coul be use to etermine if significant nonlinearity is present, an to etermine the fraction of observations that are collecte via thresholing. 7 Acnowlegments The authors woul lie to than Sven Schmit for his excellent comments an suggestions, Mohamma Ghavamzaeh for fruitful iscussions, an the anonymous reviewers for their valuable feebac. We gratefully acnowlege support from the National Science Founation uner grants CMMI , CNS , an CNS References [1] M.-F. Balcan, A. Beygelzimer, an J. Langfor. Agnostic active learning. In Proceeings of the 23r international conference on Machine learning, pages ACM, [2] M.-F. Balcan, A. Broer, an T. Zhang. Margin base active learning. In Learning Theory, pages Springer, [3] M.-F. Balcan, S. Hannee, an J. W. Vaughan. The true sample complexity of active learning. Machine learning, 802-3): ,

10 [4] T. Bertin-Mahieux, D. P. Ellis, B. Whitman, an P. Lamere. The million song ataset [5] W. Cai, Y. Zhang, an J. Zhou. Maximizing expecte moel change for active learning in regression. In Data Mining ICDM), 2013 IEEE 13th International Conference on, pages IEEE, [6] R. M. Castro an R. D. Nowa. Minimax bouns for active learning. pages 5 19, [7] D. Cohn, L. Atlas, an R. Laner. Improving generalization with active learning. Machine learning, 152): , [8] D. A. Cohn, Z. Ghahramani, an M. I. Joran. Active learning with statistical moels. Journal of artificial intelligence research, [9] S. Dasgupta an D. Hsu. Hierarchical sampling for active learning. In Proceeings of the 25th international conference on Machine learning, pages ACM, [10] S. Dasgupta, C. Monteleoni, an D. J. Hsu. A general agnostic active learning algorithm. In Avances in neural information processing systems, pages , [11] P. Embrechts, C. Klüppelberg, an T. Miosch. Moelling extremal events, volume 33. Springer Science & Business Meia, [12] H. Fanaee-T an J. Gama. Event labeling combining ensemble etectors an bacgroun nowlege. Progress in Artificial Intelligence, pages 1 15, [13] A. E. Hoerl an R. W. Kennar. Rige regression: Biase estimation for nonorthogonal problems. Technometrics, 121):55 67, [14] D. Hsu an S. Sabato. Heavy-taile regression with a generalize meian-of-means. In Proceeings of the 31st International Conference on Machine Learning ICML-14), pages 37 45, [15] T. Inglot. Inequalities for quantiles of the chi-square istribution. Probability an Mathematical Statistics, 302): , [16] A. Joseph. Variable selection in high-imension with ranom esigns an orthogonal matching pursuit. Journal of Machine Learning Research, 141): , [17] V. Koltchinsii. Raemacher complexities an bouning the excess ris in active learning. The Journal of Machine Learning Research, 11: , [18] A. Krause an C. Guestrin. Nonmyopic active learning of gaussian processes: an explorationexploitation approach. In Proceeings of the 24th international conference on Machine learning, pages ACM, [19] B. Laurent an P. Massart. Aaptive estimation of a quaratic functional by moel selection. Annals of Statistics, pages , [20] M. Lichman. UCI machine learning repository [21] K. B. Petersen et al. The matrix cooboo. [22] F. Puelsheim. Optimal esign of experiments, volume 50. siam, [23] S. Sabato an R. Munos. Active regression by stratification. In Avances in Neural Information Processing Systems, pages , [24] M. Sugiyama an S. Naajima. Pool-base active learning in approximate linear regression. Machine Learning, 753): , [25] J. Tropp an A. C. Gilbert. Signal recovery from partial information via orthogonal matching pursuit, [26] R. Vershynin. Introuction to the non-asymptotic analysis of ranom matrices. arxiv preprint arxiv: ,

11 [27] M. J. Wainwright. Sharp threshols for high-imensional an noisy sparsity recovery usingconstraine quaratic programming lasso). Information Theory, IEEE Transactions on, 555): , [28] Y. Wang an A. Singh. Noise-aaptive margin-base active learning an lower bouns uner tsybaov noise conition. arxiv preprint arxiv: , Appenix A Whitening Before thresholing the norm of incoming observations, it is useful to ecorrelate an stanarize their components, i.e., to whiten the ata. Then, we apply the algorithm to uncorrelate covariates, with zero mean an unit variance not necessarily inepenent). The covariance matrix Σ can be ecompose as Σ = UDU T, where U is orthogonal, an D iagonal with ii = λ i Σ). We whiten each observation to X = D 1/2 U T X R 1 while for X R, X = XUD 1/2 ), so that E X X T = I. We enote whitene observations by X an X in the appenix. After some algebra we see that, λ max X T X) TrΣXT X) 1 ) = Tr X T X) 1 ). 9) λ min X T X) We focus on algorithms that maximize the minimum eigenvalue of X T X with high probability, or, in general, leaing to large an even eigenvalues of X T X. B Proof of Theorem 3.1 Theorem B.1 Let n > >. Assume observations X R are istribute accoring to subgaussian D with covariance matrix Σ R. Also, assume marginal ensities are symmetric aroun zero after whitening. Let X be a matrix with observations sample from the istribution inuce by the thresholing rule with parameters ξ, Γ) R +1 + satisfying 3). Let α > 0, so that t = α C > 0, then, with probability at least 1 2 exp ct 2 ) TrΣX T X) 1 ) where constants c, C epen on the subgaussian norm of D. 1 α) 2 φ, 10) Proof We woul lie to choose out of n observations X 1,..., X n F ii. Assume our sampling inuces a new istribution F. The loss we want to minimize for our OLS estimate ˆβ = ˆβX, Y) is [ ) ] 2 E X,Y F,X F X T ˆβ X T β = σ 2 [ E X,Y F Tr ΣX T X) 1)], 11) where we assume Gaussian noise with variance σ 2. Let us see how we construct F. We sample X F, we whiten the observation Z = Σ 1/2 X, an then we select it or not accoring to a fixe thresholing rule. If Z ξ Γ, then we eep X = Σ 1/2 Z. We choose ξ an Γ so that there exists φ > 0, such that for all i = 1,...,, E W F[W i) 2 ] = φ, 12) where W i) enotes the i-th component of W F. Note that F = Fξ, Γ). Z is a linear transformation of X; W is not a linear transformation of Z. Moreover, the covariance matrix of F is Σ F = φ I. If F is a general subgaussian istribution, thresholing coul change the mean away from zero. 11

12 Assume after running our algorithm, we en up with X R. We enote by W the observations after whitening, note that by esign every w W passe our test: w ξ Γ. In other wors, w F. We see that W = XΣ 1/2 or, alternatively, X = WΣ 1/2. Now, we can erive Tr ΣX T X) 1) = Tr Σ Σ 1/2 W T WΣ 1/2) ) 1 13) = Tr Σ 1/2 Σ 1/2 W T WΣ 1/2 Σ 1/2) 1 ) 14) W = Tr T W ) ) 1 15) = Tr Σ 1/2 F Σ 1 F Σ1/2 F W T W ) ) 1 16) = Tr Σ 1 1 ) WT W) 17) F 1 = Tr φ I ) 1 WT W) 18) = 1 φ Tr 1 ) WT W) 19) φ 1 = λ min WT W) φ where W is actually white ata. Thus, note that WT W/ I as. 1 λ 1 ). 20) min W T W Assume that F is subgaussian such that if >, then X T X has full ran with probability one. Thresholing will not change the shape of the tails of the istribution, F will also be subgaussian. At this point, we nee to measure how fast λ min WT W) / goes to 1. We can use Theorem 5.39 in [26] which guarantees that, for α > 0 such that t = α C > 0, with probability at least 1 2 exp ct 2 ) we have ) 1 λ min W T W 1 α) 2, 21) as W is white subgaussian. It follows that for α > C /, with probability at least 1 2 exp ct 2 ) Note that 1/1 α) 1 + O /). Tr ΣX T X) 1) C Proof of TrX 1 ) TrDiagX) 1 ) 1 α) 2 φ. 22) In orer to justify that we want S = X T X to be as close as possible to iagonal, we show the following lemma. Uner our assumptions S is symmetric positive efinite with probability 1. Lemma C.1 Let X be a n n symmetric positive efinite matrix. Then, TrX 1 ) TrDiagX) 1 ), 23) where Diag ) returns a iagonal matrix with the same iagonal as the argument. In other wors, we show that for all positive efinite matrices with the same iagonal elements, the iagonal matrix matrix with all off iagonal elements being 0) has the least trace after the inverse operation. 12

13 Proof We show this by inuction. Consier a 2 2 matrix [ ] a b X = b c an TrX 1 1 ) = a + c) 25) ac b2 since ac b 2 > 0 X is positive efinite), the above expression is minimize when b 2 = 0, that is, X is iagonal. 24) Assume the statement is true for all n n matrices. Let X be a n + 1) n + 1) positive efinite matrix. Decompose it as [ ] A b X = b T. 26) c By the bloc inverse formula, see for example [21]) TrX 1 ) = TrA 1 ) TrA 1 bb T A 1 ), 27) where = c b T A 1 b. Note > 0 by Schur s complement for positive efinite matrices. Using the inuction hypothesis, TrA 1 ) TrDiagA) 1 ). By the positive efiniteness of A, b T A 1 b 0, therefore 1 1 c. Also, TrA 1 bb T A 1 ) 0. Thus, an the result follows. D Proof of Corollary 3.2 TrX 1 ) TrA 1 ) + 1 c = TrDiagX) 1 ), 28) Corollary D.1 If the observations in Theorem 3.1 are jointly Gaussian with covariance matrix Σ R, ξ j = 1 for all j = 1,...,, an Γ = C + 2 logn/), for some constant C 1, then with probability at least 1 2 exp ct 2 ) we have that TrΣX T X) 1 ) ) 1 α) logn/). 29) Proof We have to show that ξ j = 1 for all j, an Γ = C + 2 logn/) satisfy the equations P D X ξ Γ ) = α = n, 30) E D [ X 2 j X 2 ξ Γ 2 ] = φ, for all j, 31) an φ > logn/)/). The components of X are inepenent, as observations are jointly Gaussian. It immeiately follows that ξ j = 1, for all 1 j. Thus, Z ξ = j=1 X j χ 2, Γ 2 = F 1 1 ). 32) χ 2 n The value of Z ξ is strongly concentrate aroun its mean, EZ ξ =. We now use two tail approximations to obtain our esire result. By [19], we have that PZ ξ 2 x + 2x) exp x). 33) 13

14 If we tae exp x) = α, then x = logn/). In this case, we conclue that n ) n ) ) P Z ξ + 2 log + 2 log α = n. 34) Note that P X ξ > Γ) = PZ ξ > Γ 2 ) = α. Therefore, by efinition n ) n ) Γ + 2 log + 2 log. 35) On the other han, we woul lie to show that n )) P Z ξ + 2 log α, 36) as that woul irectly imply that Γ + 2 log n/). We can use Proposition 3.1 of [15]. For > 2 an x > 2, PZ ξ x) 1 { e x x + 2 exp x 2) log x ) + log ) }. Tae x = + 2ψ, where ψ = logn/). It follows that PZ ξ + 2ψ) 1 { e 2 + 2ψ ψ exp 1 2ψ 2) log 1 + 2ψ ) )} + log 2 = 1 { e 2 + 2ψ ψ exp log 1 + 2ψ ) 12 } 2 log exp { ψ} exp { ψ}, 37) where we assume, for example, 9 an n/ > 17 as in Proposition 5.1 of [15]). In any case, in those rare cases in our context) where < 9 an n/ very small, the previous boun still hols if we subtract a small constant C [0, 5/2] from the LHS: PZ ξ + 2ψ C). Equivalently, from 37) We conclue that PZ ξ + 2 logn/)) /n = α. 38) n ) n ) n ) + 2 log Γ + 2 log + 2 log. 39) Finally, we have that φ Γ2 By Theorem B.1, the corollary follows. 2 log n/) ) E CLT Approximation As we explain in the main text, it is sometimes ifficult to irectly compute the istribution of the ξ-norm of a white observation, given by Z ξ. Recall that Γ 2 = F 1 Z ξ 1 /n). Fortunately, Z ξ is the sum of ranom variables, an, in high-imensional spaces, a CLT approximation can help us to choose a goo threshol. In this section we erive some theoretical guarantees. The CLT is a goo iea for boune variables as the square is still boune, an therefore subgaussian), but if the unerlying components X j are unboune subgaussian, Z ξ will be at least subexponential as the square of a subgaussian ranom variable is subexponential, [26], an a higher threshol lie that coming from chi-square is more appropriate. 14

15 In aition, in the context of heavy-tails, catastrophic effects are expecte, as Pmax j X j > t) P j X j > t), leaing to observations ominate by single imensions. Assume that components X j are inepenent while not necessarily ientically istribute). By Lyapunov s CLT, one can show that 2 Z ξ = ξ j X2 j N, E[ X4 j ] 1 ). j=1 It follows that Γ satisfies P D Xi ξ Γ ) = /n if Γ 2 + Φ 1 1 ) ξi 2 E[ X4 n i ] 1 ). i=1 In the sequel, assume is large enough, an the approximation error is negligible. Define γ = E[ X4 i ] 1 ). i=1 ξ2 i Corollary E.1 Assume Z ξ = N, γ 2) an Γ 2 = + γ Φ 1 1 /n), with ξ j satisfying 3). Let α > 0, so that t = α C > 0. then with probability at least 1 2 exp ct 2 ) we have that TrΣX T X) 1 ) 1 α) γ )). 41) 2 logn/) γ log logn/) O j=1 ξ 2 j logn/) Proof Note that, by efinition, X 2 ξ Z ξ an Γ jointly solve the equations require by Theorem B.1. In orer to apply the theorem, all we nee to o is to estimate the magnitue of φ = E D [X 2 j X 2 ξ Γ 2 ] Γ2 = 1 + γ Φ 1 1 /n). 42) Therefore, we want to fin bouns on tail probabilities of the normal istribution. By Theorem 2.1 of [15], we have that for small /n 2 logn/) log4 logn/)) + 2 an the result follows. 2 2 logn/) Φ 1 1 ) n 43) log2 logn/)) + 3/2 2 logn/) 2, 44) 2 logn/) We can also show how to apply the previous result to inepenent uniform istributions centere aroun zero. In that case, we have that the fourth moment is E[ X j 4] = 9/5, so γ = 4 5, leaing to a gain factor )) 8 logn/) log logn/) φ = 1 + o. 5 logn/) F Proof of Theorem 4.1 Theorem F.1 Let A be an algorithm for the problem we escribe in Section 2. Then, E A TrΣX T X) 1 ) 2 [ E i=1 X ] 45) i) 2 E [ 1 max i [n] X i 2], 2 Some mil aitional moment/regularity conitions on each X j are require to satisfy Lyapunov s Conition. 15

16 where X i) is the white observation with the i-th largest norm. Moreover, fix α 0, 1). Let F be the cf of max i [n] X i 2. Then, with probability at least 1 α TrΣX T X) 1 ) 2 / F 1 1 α). 46) Proof We want to minimize TrΣX T X) 1 ) = Tr X T X) 1 ). Let us efine S = X T X. One can prove that H TrH 1 ) is convex for symmetric positive efinite matrices H. It then follows by Jensen s Inequality assuming >, so S is symmetric positive efinite with high probability) ETrS 1 ) TrES) 1 1 ) = λ j ES). 47) Let ES be the expecte value of S for an arbitrary algorithm A that selects its observations sequentially. We want to unerstan what is the minimum possible value the RHS of 47) can tae. The sum of eigenvalues is upper boune by λ j ES) = TrES) = ES jj ) = E[ X ij] 2 j=1 j=1 j=1 = j=1 i=1 E[ X i 2 ] i=1 [ ] E X i) 2 E i=1 [ ] max X i 2, i [n] where X i) enotes the observation with the i-th largest norm. Because ES is symmetric positive efinite, its eigenvalues are real non-negative, so that [ 0 < λ min ES) TrES) E i=1 X ] i) 2 E [ max i [n] X i 2]. We conclue that the solution to the minimization problem of 47) that is, when all eigenvalues are equal is lower boune by ETrS 1 1 ) λ j=1 j ES) 2 [ E i=1 X ] 2 i) 2 E [ max i [n] X i 2], which proves 45). In orer to prove the high-probability statement 46), note that TrΣX T X) 1 ) = Tr X T X) 1 1 ) = λ i X T X) i=1 i=1 We irectly conclue that with probability at least 1 α, 1 j=1 X j 2 / 2 j=1 X j) 2 2 max i [n] X i. 48) 2 max X i 2 F 1 1 α) 49) i [n] as F is the cf of max i [n] X i 2. It follows that with probability at least 1 α, TrΣX T X) 1 ) 2 F 1 1 α). 50) 16

17 G Proof of Corollary 4.2 Corollary G.1 For Gaussian observations X i N 0, Σ) an large n, for any algorithm A E A TrΣX T X) 1 ) ). 51) 2 log n + log log n Moreover, let α 0, 1). Then, for any A with probability at least 1 α an C = 2 log Γ/2)/, TrΣX T X) 1 ) ). 52) 2 log n + log log n 1 1 log log 1 α C Proof In orer to apply Theorem F.1, we nee to upper boun E [ max i [n] X i 2], where X i is a -imensional gaussian ranom variable with ientity covariance matrix. In other wors, we nee to upper boun the expecte maximum of n chi-square ranom variables with egrees of freeom. Let us start by proving 51). We can use extreme value theory to fin the limiting istribution of the maximum of n ranom variables. Firstly, note that the chi-square istribution is a particular case of the Gamma istribution. More specifically, χ 2 Γ/2, 2). If we parameterize the Γ istribution by α shape) an β rate), then α = /2 an β = 1/2. By the Fisher-Tippett Theorem we now that there are only three limiting istributions for lim n X n) = lim n max i n X i, where the X i are ii ranom variables, namely, Frechet, Weibull an Gumbel istributions. It is nown that the Gamma istribution is in the max-omain of attraction of the Gumbel istribution. Further, the normalizing constants are nown see Chapter 3 of [11]). In particular, we now that if X n) := max i [n] X i 2 lim P X n) 2x + 2 ln n + 2/2 1) ln ln n 2 ln Γ/2) ) = Λx) = e e x. 53) n We can assume that the asymptotic limit hols, as n is in practice very large, an compute the mean value of X n). As X n) is a positive ranom variable, E[X n) ] = = 0 0 P X n) t ) t 54) 1 P X n) t ) ) t 55) We mae the change of variables t = 2x + C, where C = 2 ln n + 2) ln ln n 2 ln Γ/2). Then, E[X n) ] = = = 0 C/2 C/2 0 C/2 0 C/2 P X n) t ) t 56) 21 P X n) 2x + C ) ) x 57) 21 e e x ) x 58) 21 e e x ) x + where γ is the Euler Mascheroni constant. We conclue that 0 21 e e x ) x 59) 2 x + 2γ = C + 2γ, 60) E[X n) ] C + 2γ 2 ln n + 2) ln ln n. 61) 17

18 If we tae the largest observations, an assume we coul split the weight equally among all imensions which is esirable), we see that the best we can o in expectation is upper boune by ) 2 ln n E[X n)] + ln ln n. 62) Now, let us prove 52). The following inequalities simplify our tas to fining a high-probability upper boun on max i [n] X i 2. We have that TrΣX T X) 1 ) = Tr X T X) 1 ) = 1 λ i X T X) i=1 i=1 1 j=1 X j 2 / 2 j=1 X j) 2 2 max i [n] X i. 63) 2 Fix α [0, 1]. We nee to fin a constant Q such that with probability at least 1 α, Q max i [n] X i 2, so that we conclue that TrΣX T X) 1 ) 2 /Q with high probability. By 53) we now that lim P X n) 2x + 2 ln n + 2/2 1) ln ln n 2 ln Γ/2) ) = Λx) = e e x. 64) n For large n, we assume the previous upper boun for X n) is exact. We want to fin Q > 0 such that P X n) Q ) = 1 α. Note that if 1 α = e e x, then x = log log 1 1 α. 65) It follows that Q = 2 ln n + 2/2 1) ln ln n log log1 α) 1 2 ln Γ/2). Finally, 52) follows as Q = 2 log n H Proof of Theorem log log n log log1 α) ln Γ/2). 66) Recall the Sparse Thresholing Algorithm below. We show the following theorem. Theorem H.1 Let D = N 0, Σ). Assume Σ, λ an min i β i satisfy the stanar conitions given in Theorem 3 of [27]. Assume we run the Sparse Thresholing algorithm with 1 = C s log observations to recover the support of β, for an appropriate C 0. Let X 2 be 2 = 1 observations sample via thresholing on S ˆβ 1 ). It follows that for α > 0 such that t = α 2 C s > 0, there exist some universal constants c 1, c 2, an c, C that epen on the subgaussian norm of D S ˆβ 1 ), such that with probability at least it hols that 1 2e minc2 mins,log s)) logc1),ct2 log2)) TrΣ SS X T 2 X 2 ) 1 s ) ). 1 α) logn2/2) 2 For support recovery, we use Theorem 3 from [27]: Theorem H.2 Consier the linear moel with ranom Gaussian esign Y = Xβ + ɛ, with i.i.. rows x i N 0, Σ) R, 67) with noise ɛ N 0, σ 2 I ). Assume the covariance matrix Σ satisfies Σ S C SΣ SS ) 1 1 γ), for some γ 0, 1], 68) s 18

19 Algorithm 3 Sparse Thresholing Algorithm. 1: Set S 1 =, S 2 =. Let = 1 + 2, n = 1 + n 2. 2: for observation 1 i 1 o 3: Observe X i. Choose X i : S 1 = S 1 X i. 4: en for 5: Set γ = 1/2, λ = 4σ 2 log)/γ : Compute Lasso estimate ˆβ 1 base on S 1, with regularization λ. 7: Set weights: ξ i = 1 if i S ˆβ 1 ), ξ i = 0 otherwise. 8: Set Γ = C s + 2 logn 2 / 2 ). Factorize Σ S ˆβ1)S ˆβ 1) = UDU T. 9: for observation i n o 10: Observe X i R. Restrict to X i S := Xi S ˆβ 1) Rs. 11: Compute X i S = D 1/2 U T XS i. 12: if XS i ξ > Γ or 2 S 2 = n i + 1 then 13: Choose XS i : S 2 = S 2 XS i. 14: if S 2 = 2 then 15: brea. 16: en if 17: en if 18: en for 19: Return OLS estimate ˆβ 2 with observations in S 2. λ min Σ SS ) C min > 0. 69) Let S = s. Consier the family of regularization parameters for φ 2 φ ρ u Σ λ φ ) = SC S) 2σ 2 log) γ 2. 70) If for some fixe δ > 0, the sequence,, s) an regularization sequence {λ } satisfy ) 2s log s) 1 + δ) θ uσ) 1 + σ2 C min λ 2 s, 71) then the following hols with prob at least 1 c 1 exp c 2 min{s, log s)}): 1. The Lasso has a unique solution ˆβ with support in S i.e. S ˆβ) Sβ )). 2. Define the gap gλ ) := c 3 λ Σ 1/2 SS σ 2 log s C min. 72) Then, if β min := min i S β i > gλ ), the signe support S ± ˆβ) is ientical to S ± β ), an moreover ˆβ S β S gλ ). The require efinitions to apply the previous theorem are ρ l Σ) = 1 2 min i j Σ ii + Σ jj 2Σ ij ), θ l Σ) = ρ l Σ SC S) C max 2 γσ)) 2, ρ u Σ) = max Σ ii, 73) i θ lσ) = ρ uσ SC S) C min γ 2 Σ). 74) Proof Theorem H1) 19

20 Let X N 0, Σ) with Σ satisfying 68) an 69). Let λ φ ) be lie in 70), for some φ > 2. Assume we choose the number of observations 1 in the first stage to be at least ) δ) θ u Σ) 1 + σ2 C min λ 2 s s log s) 75) = CΣ,, s) s log s), 76) an that β min is greater than 72). Then, with probability at least 1 c 1 exp c 2 min{s, log s)}), we recover the right support Sβ ) = S ˆβ) in the first stage of the algorithm. Conitional on this event, we apply our algorithm on the remaining observations. In the secon stage, we only loo at those imensions in S ˆβ), by setting weights ξ S ˆβ) = 1, an zero otherwise. Finally, we run OLS along the imensions in the recovere support, an using the observations collecte uring the secon stage. Importantly, note that the new observations are N 0, Σ SS ). We can now apply our original results. Denote by X 2 R 2 s the set of observations collecte in the secon stage of the algorithm. In particular, by Corollary D.1, we conclue that for α > 0 such that t = α 2 C s > 0, the following hols with probability at least 1 2 exp ct 2 ) TrΣ SS X T 2 X 2 ) 1 s ) ). 77) 1 α) logn2/2) 2 Uner the event that the recovery is correct, the contribution to the MSE of the components of β that are not in its support is zero. In other wors, β ˆβ 2 2 Σ = β ˆβ 2 ) T Σβ ˆβ 2 ) 78) = β S ˆβ 2S ) T Σ SS β S ˆβ 2S ) = β S ˆβ 2S 2 Σ SS. 79) As the events that the first an secon stages succee are inepenent, we conclue 77) hols with probability at least 1 c 1 e c2 min{s,log s)} 2e ct2 80) s 1 2e minc2 mins,log s)) logc1),ct2 log2)). 81) I Proof of CLT Lower Boun Corollary I.1 Assume the norm of white observations is istribute accoring to Z ξ = N, γ 2). Then, we have that for any algorithm A E A TrΣX T X) 1 ) Proof By Theorem F.1, we nee to compute E [ max i [n] X i 2]. By assumption X i 2 N, γ 2) for each i, which implies [ ] [ E max X i 2 = E + max γ X i 2 ] i [n] i [n] γ [ ] + γ E max N 0, 1) i [n] an the result follows. 1 + γ ). 82) 2 log n 83) 84) + γ 2 log n, 85) 20

21 J Rige Regression Regularize linear estimators also benefit from large an balance observations. We show that, uner mil assumptions, the performance of the rige regression is irectly aligne with that of previous sections. The rige estimator is ˆβ λ = X T X + λi ) 1 X T Y, given X, Y) an λ > 0. The following result shows how large values of λ min X T X) help to control the MSE of ˆβ λ. As the optimal penalty parameter λ is unnown until the en of the ata collection process, we assume it is uniformly ranom in a small interval. Theorem J.1 Let R > 0. Assume the penalty parameter for rige regression is chosen uniformly at ranom λ U[0, R]. Then, the MSE of ˆβ λ is upper boune by E λ,x ˆβ λ β 2 E X f λ min X T X) ), 86) where f is the following ecreasing function of λ min : fλ min ) = σ2 λ min + R + β λ min R log 1 + R ) + λ ) min. 87) λ min λ min + R Proof The SVD ecomposition of X = USV T implies that X T X = V SU T USV T = V S 2 V T, where U an V are orthogonal matrices. We efine W = X T X + λi ) 1, an see that W = V S 2 + λi)v T ) 1 1 = V Diag s 2 jj + λ In this case, the MSE of ˆβ λ has two sources: square bias an the trace of the covariance matrix. The covariance matrix of ˆβ λ is Cov ˆβ λ ) = σ 2 W X T XW, while its bias is given by λw β see [13]). Thus, Cov ˆβ λ ) = σ 2 V Diag s 2 jj s 2 jj + λ)2 ) j=1 ) j=1 V T. V T. 88) Note that s 2 jj = λ j, where s jj s are the singular values of X, an λ j s the eigenvalues of X T X. As [ V is orthogonal, Tr Cov ˆβ ] λ ) = σ 2 j=1 λ j/λ j + λ) 2. Unfortunately, in practice, the value of λ is unnown before collecting the ata. A common technique consists in using an aitional valiation set to choose the optimal regularization parameter λ. Generally, in supervise learning, the valiation set comes from the same istribution as the test set, while in active learning it oes not. As in the unregularize case, we want to train on unliely ata, but we want to test on liely ata. We achieve robustness against this fact as follows. We fix some fairly large R > 0 such that we assume λ 0, R). We treat λ as a ranom variable, an we impose a uniform prior D λ over 0, R). Then, we see that [ E λ D λ [Tr Cov ˆβ ]] λ ) = σ 2 j=1 = σ 2 j=1 λ j R λ j + λ) 2 R λ 1 λ j + R σ2 λ min + R. 89) 21

22 The square bias can be upper boune by [ λ 2 β T W T W β = β T λ 2 ] V Diag λ j + λ) 2 V T β j ) 2 λ β 2 2 max i λ j + λ ) 2 = β 2 λ 2. 90) λ min + λ for every λ > 0, as λ j 0 for all j. Taing expectations on both sies of 90) with respect to λ D λ, an after some algebra E Dλ Bias 2 ˆβ λ ) β 2 1 2λ min 2 R log 1 + R ) + λ min λ min λ min + R, 91) where the RHS is a ecreasing function of λ min that tens to zero as λ min grows. It follows that E ˆβ λ β 2 can be controlle by minimizing λ min X T X), an we can focus on minimizing λ min X T X) by the equivalence shown in the Problem Definition section of the main paper. K Simulations We conucte several experiments in various settings. We present here some experiments that complement those showe in the main paper. In particular, we show experiments for linear moels, synthetic linear moels, synthetic non-linear ata, an aitional regularize an real-worl atasets. K.1 Linear Moels We first empirically show the results prove in Theorem B.1. For a sequence of values of n, we choose = n observations in R, with fixe = 10. The observations are generate accoring to N 0, I), an y follows a linear moel with β i U 5, 5). For each tuple n, ) we repeat the experiment 200 times, an compute the square error β is nown). The results in Figure 3 a) show the average MSE of Algorithm 1 significantly outperforms that of ranom sampling. We also see a strong variance reuction. Figure 3 b) restricts the comparison to fixe an aaptive threshol algorithms; while the latter outperforms the former, the ifference is small. In Figure 3 c) we eep n an fixe, an vary. Finally, in Figure 4 a) we show the case where Σ I. For completeness, we repeate the simulation with observations generate accoring to a joint Gaussian istribution with a ranom covariance matrix that ha TrΣ) = 21.59, λ min = 5, an λ max = Figure 4 a) shows that thresholing algorithms outperform ranom sampling in a similar way as in the white case presente in the paper. Also, Figure 4 b) shows how the aaptive threshol slightly beats the fixe one. Finally, in Figure 4 c), we show the results of simulations when observations are sample from Laplace correlate marginals through a Gaussian Copula). We compare ranom sampling to two versions of the thresholing algorithm. The most simple one, enote by Unif-Weig Algorithm, assigns uniform weights i.e., ξ i = 1 for all i). On the other han, enote by Opt-Weig Algorithm the algorithm that uses the optimal weights previously pre-compute, in this case max i ξ i / min i ξ i 7, inepenent variables ten to require higher weights). As one woul expect, the latter oes better than the former. However, it is remarable that the ifference between ranom an thresholing is way more substantial than the ifference between optimal an approximate thresholing, an observation that can be very useful in practice. K.2 Synthetic Non-Linear Data The theory an algorithms presente in this paper are base on the linearity of the moel. To unerstan the impact of this assumption, we perform an experiment where the response moel was 22

23 Normalize MSE max 0.004) n 1K 2K 2K 5K 8K 15K 22K 28K 35K Algorithm 1b Algorithm 1 Ranom Sampling Theorem ψ =0.3) Passive Theory Normalize MSE max 0.002) n 1K 2K 2K 5K 8K 15K 22K 28K 35K Algorithm 1b Algorithm 1 Theorem ψ =0.3) a) With = n, = b) With = n, = 10. Normalize MSE max 0.003) Algorithm 1 Ranom Sampling c) With n = 3000, = 10. Figure 3: MSE of ˆβ OLS ; white Gaussian obs, 0.25, 0.75) quantile confience intervals isplaye in a), c). y = x T β + ψx T x for various values of ψ, an β i U 5, 5). Note that high-orer terms an transformations can easily be inclue in the esign matrix not one in this case). As expecte, the results in Figure 5 show an intersection point. The active learning algorithms are robust to some level of non-linearity but, at some point, ranom sampling becomes more effective. K.3 Regularization An appealing property of the propose algorithms is that their gain is preserve uner regularize estimators such as rige an lasso. This is specially relevant as it allows for higher imensional moels where transformations an interactions of the original variables are ae to better capture non-linearities in the ata an regularization is use to avoi overfitting. In fact, our algorithm can be thought of as a type of regularizing process. We repeate the first experiment from the linear moel simulations, using the rige estimator with λ = Figure 6 a) shows that the average MSEs of Algorithms 1 an 1b strongly outperform the results of ranom sampling. Their variance is less than 30% that of ranom sampling in all cases. We performe two experiments with Lasso estimators to investigate the behavior of our algorithms in the presence of sparse moels. We o not test Algorithm 2 here, but only simple thresholing approaches. First, we fixe n = 5000, = 150, = 70 an white Gaussian ata. The imension of the latent subspace, or effective imension of the moel, ranges from eff = 5 to eff = 70. Results are shown in Figure 6 b). Algorithm 1 an Algorithm 1b strongly improve the performance of ranom sampling, while their variance is at most half that of ranom sampling. 23

24 n Algorithm 1 Ranom Sampling n Algorithm 1b Algorithm 1 MSE 0.4 MSE a) With = n, = b) With = n, = 10. Normalize MSE max 0.016) n 1K 1K 1K 1K 1K 1K 1K Unif-Weig Algorithm Opt-Weig Algorithm Ranom Sampling c) Laplace Correlate Data via Gaussian Copula; Uniform vs. Optimal Weights. Figure 4: In a), b), MSE of ˆβ OLS ; N 0, Σ) ata, 0.05, 0.95) conf. intervals. Normalize MSE max 0.010) Algorithm 1b Algorithm 1 Ranom Sampling ψ x 1E-2) Figure 5: Moel is y = i β ix i + ψ i x2 i. In the secon experiment, we fixe eff = 7, an progressively increase the imension of the space from = 70 to = 450. Also, we ept fixe n = 1000 an = 100. Results are shown in Figure 6 c). Thresholing algorithms consistently ecrease the MSE of the lasso estimator with respect to ranom sampling, even though we are aing a large number of purely noisy imensions. The reason is simple. While these algorithms o not actively try to fin the latent subspace Algorithm 2 oes), their observations will be, on average, larger in those imensions too. There may be ways to leverage this 24

25 Normalize MSE max 0.004) n Algorithm 1 Ranom Sampling Normalize MSE max 0.055) Algorithm 1b Algorithm 1 Ranom Sampling a) Rige; = 10, = n effective b) Lasso; n = 5000, = 150, = 70. Normalize MSE max 0.007) Algorithm 1b Algorithm 1 Ranom Sampling c) Lasso; n = 1000, = 100, eff = 7 Figure 6: MSE of regularize estimators, λ = 0.01; white Gaussian obs. The 0.05, 0.95) confience intervals in a), an 0.25, 0.75) in b). fact, lie batche approaches where weights ξ are upate by giving more importance to promising imensions. K.4 Real Worl Datasets The Combine Cycle Power ataset has 9568 observations. The outcome is the net hourly electrical energy output of the plant, an it has = 4 covariates: temperature, pressure, humiity, an exhaust vacuum. In Figure 7, we see the phenomenon explaine in the main paper for large, the gain vanishes). In this case, an after aing all secon orer interactions, active learning solves the problem. Ranom sampling with interactions is not shown as the error was much larger. In aition, in Figure 8 we show the scatterplots of the atasets use in the paper we omitte the YearPreictionMSD ataset as = 90). 25

Sampling Algorithm 1b +interactions) 5 0 0.95 0.

150 iters). a) Protein Structure Dataset.

26 Normalize MSE max ) n x 1000) Algorithm 1b Ranom Sampling Algorithm 1b +interactions) Figure 7: Combine Cycle Power 150 iters). a) Protein Structure Dataset. b) Combine Cycle Power Plant Dataset. c) Bie Sharing Dataset. Figure 8: Scatter Plots of Real Worl Datasets. 26

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012 CS-6 Theory Gems November 8, 0 Lecture Lecturer: Alesaner Mąry Scribes: Alhussein Fawzi, Dorina Thanou Introuction Toay, we will briefly iscuss an important technique in probability theory measure concentration