Linear Regression with Limited Observation
|
|
- May Johnston
- 6 years ago
- Views:
Transcription
1 Ela Hazan Tomer Koren Technion Israel Institute of Technology, Technion City 32000, Haifa, Israel Abstract We consier the most common variants of linear regression, incluing Rige, Lasso an Support-vector regression, in a setting where the learner is allowe to observe only a fixe number of attributes of each example at training time. We present simple an efficient algorithms for these problems: for Lasso an Rige regression they nee the same total number of attributes (up to constants) as o full-information algorithms, for reaching a certain accuracy. For Support-vector regression, we require exponentially less attributes compare to the state of the art. By that, we resolve an open problem recently pose by Cesa-Bianchi et al. (2010). Experiments show the theoretical bouns to be justifie by superior performance compare to the state of the art. 1. Introuction In regression analysis the statistician attempts to learn from examples the unerlying variables affecting a given phenomenon. For example, in meical iagnosis a certain combination of conitions reflects whether a patient is afflicte with a certain isease. In certain common regression cases various limitations are place on the information available from the examples. In the meical example, not all parameters of a certain patient can be measure ue to cost, time an patient reluctance. In this paper we stuy the problem of regression in which only a small subset of the attributes per example can be observe. In this setting, we have access to all attributes an we are require to choose which of them to observe. Recently, Cesa-Bianchi et al. (2010) Appearing in Proceeings of the 29 th International Conference on Machine Learning, Einburgh, Scotlan, UK, Copyright 2012 by the author(s)/owner(s). stuie this problem an aske the following interesting question: can we efficiently learn the optimal regressor in the attribute efficient setting with the same total number of attributes as in the unrestricte regression setting? In other wors, the question amounts to whether the information limitation hiners our ability to learn efficiently at all. Ieally, one woul hope that instea of observing all attributes of every example, one coul compensate for fewer attributes by analyzing more examples, but retain the same overall sample an computational complexity. Inee, we answer this question on the affirmative for the main variants of regression: Rige an Lasso. For support-vector regression we make significant avancement, reucing the parameter epenence by an exponential factor. Our results are summarize in the table below 1, which gives bouns for the number of examples neee to attain an error of ε, such that at most k attributes 2 are viewable per example. We enote by the imension of the attribute space. Regression New boun Prev. boun Rige O ( ) ( ) kε O 2 log 2 ε kε ( ) ( 2 ) Lasso O log kε O 2 log 2 ε kε 2 SVR O ( k ) e O(log 2 1 ε) O (e 2 kε 2 ) Table 1. Our sample complexity bouns. Our bouns imply that for reaching a certain accuracy, our algorithms nee the same number of attributes as their full information counterparts. In particular, when k = Ω() our bouns coincie with those of full information regression, up to constants (cf. Kakae et al. 2008). We complement these upper bouns an prove that Ω( ε 2 ) attributes are in fact necessary to learn an ε- 1 The previous bouns are ue to (Cesa-Bianchi et al., 2010). For SVR, the boun is obtaine by aitionally incorporating the methos of (Cesa-Bianchi et al., 2011). 2 For SVR, the number of attributes viewe per example is a ranom variable whose expectation is k.
2 accurate Rige regressor. For Lasso regression, Cesa- Bianchi et al. (2010) prove that Ω( ε ) attributes are necessary, an aske what is the correct epenence on the problem imension. Our bouns imply that the number of attributes necessary for regression learning grows linearly with the problem imensions. The algorithms themselves are very simple to implement, an run in linear time. As we show in later sections, these theoretical improvements are clearly visible in experiments on stanar atasets Relate work The setting of learning with limite attribute observation (LAO) was first put forth in (Ben-Davi & Dichterman, 1998), who coine the term learning with restricte focus of attention. Cesa-Biachi et al. (2010) were the first to iscuss linear preiction in the LAO setting, an gave an efficient algorithm (as well as lower bouns) for linear regression, which is the primary focus of this paper. 2. Setting an Result Statement 2.1. Linear regression In the linear regression problem, each instance is a pair (x, y) of an attributes vector x R an a target variable y R. We assume the stanar framework of statistical learning (Haussler, 1992), in which the pairs (x, y) follow a joint probability istribution D over R R. The goal of the learner is to fin a vector w for which the linear rule ŷ w x provies a goo preiction of the target y. To measure the performance of the preiction, we use a convex loss function l(ŷ, y) : R 2 R. The most common choice is the square loss l(ŷ, y) = 1 2 (ŷ y)2, which stans for the popular leastsquares regression. Hence, in terms of the istribution D, the learner woul like to fin a regressor w R with low expecte loss, efine as L D (w) = E (x, y) D [l(w x, y). (1) The stanar paraigm for learning such regressor is seeking a vector w R that minimizes a trae-off between the expecte loss an an aitional regularization term, which is usually a norm of w. An equivalent form of this optimization problem is obtaine by replacing the regularization term with a proper constraint, giving rise to the problem min L D (w) s.t. w p B, (2) w R where B > 0 is a regularization parameter an p 1. The main variants of regression iffer on the type of l p norm constraint as well as the loss functions in the above efinition: Rige regression: p = 2 an square loss, l(ŷ, y) = 1 2 (ŷ y)2. Lasso regression: p = 1 an square loss. Support-vector regression: p = 2 an the δ- insensitive absolute loss (Vapnik, 1995), l(ŷ, y) = ŷ y δ := max{0, ŷ y δ}. Since the istribution D is unknown, we learn by relying on a training set S = {(x t, y t )} m of examples, that are assume to be sample inepenently from D. We use the notation l t (w) := l(w x t, y t ) to refer to the loss function inuce by the instance (x t, y t ). We istinguish between two learning scenarios. In the full information setup, the learner has unrestricte access to the entire ata set. In the limite attribute observation (LAO) setting, for any given example pair (x, y), the learner can observe y, but only k attributes of x (where k 1 is a parameter of the problem). The learner can actively choose which attributes to observe Limitations on LAO regression Cesa-Biachi et al. (2010) prove the following sample complexity lower boun on any LAO Lasso regression algorithm. Theorem 2.1. Let 0 < ε < 1 16, k 1 an > 4k. For any regression algorithm accessing at most k attributes per training example, there exist a istribution D over {x : x 1} {±1} an a regressor w with w 1 1 such that the algorithm must see (in expectation) at least Ω( kε ) examples in orer to learn a linear regressor w with L D (w) L D (w ) < ε. We complement this lower boun, by proviing a stronger lower boun on the sample complexity of any Rige regression algorithm, using informationtheoretic arguments. Theorem 2.2. Let ε = Ω(1/ ). For any regression algorithm accessing at most k attributes per training example, there exist a istribution D over {x : x 2 1} {±1} an a regressor w with w 2 1 such that the algorithm must see (in expectation) at least Ω( kε ) examples in orer to learn a linear regressor w, 2 w 2 1 with L D (w) L D (w ) ε. Our algorithm for LAO Rige regression (see section 3) imply this lower boun to be tight up to constants.
3 Note, however, that the boun applies only to a particular regime of the problem parameters Our algorithmic results We give efficient regression algorithms that attain the following risk bouns. For our Rige regression algorithm, we prove the risk boun ( ) E [L D ( w) min L D(w) + O B 2, w 2 B km while for our Lasso regression algorithm we establish the boun ( ) log E [L D ( w) min L D(w) + O B 2. w 1 B km Here we use w to enote the output of each algorithm on a training set of m examples, an the expectations are taken with respect to the ranomization of the algorithms. For Support-vector regression we obtain a risk boun that epens on the esire accuracy ε. Our boun implies that m = O ( k ) exp ( ( )) O log 2 B. ε examples are neee (in expectation) for obtaining an ε-accurate regressor. 3. Algorithms for LAO least-squares regression In this section we present an analyze our algorithms for Rige an Lasso regression in the LAO setting. The loss function uner consieration here is the square loss, that is, l t (w) = 1 2 (w x t y t ) 2. For convenience, we show algorithms that use k + 1 attributes of each instance, for k 1 4. Our algorithms are iterative an maintain a regressor w t along the iterations. The upate of the regressor at iteration t is base on graient information, an specifically on g t := l t (w t ) that equals (w t x t y t ) x t for the square loss. In the LAO setting, however, we o not have the access to this information, thus we buil upon unbiase estimators of the graients. 3 Inee, there are (full-information) algorithms that are known to converge in O(1/ε) rate see e.g. (Hazan et al., 2007). 4 We note that by our approach it is impossible to learn using a single attribute of each example (i.e., with k = 0), an we are not aware of any algorithm that is able to o so. See (Cesa-Bianchi et al., 2011) for a relate impossibility result. Algorithm 1 AERR Parameters: B, η > 0 Input: training set S = {(x t, y t )} t [m an k > 0 Output: regressor w with w 2 B 1: Initialize w 1 0, w 1 2 B arbitrarily 2: for t = 1 to m o 3: for r = 1 to k o 4: Pick i t,r [ uniformly an observe x t [i t,r 5: x t,r x t [i t,r e it,r 6: en for 7: x t 1 k k r=1 x t,r 8: Choose j t [ with probability w t [j 2 / w t 2 2, an observe x t [j t 9: φt w t 2 2 x t [j t /w t [j t y t 10: g t φ t x t 11: v t w t η g t 12: w t+1 v t B/ max{ v t 2, B} 13: en for 14: w 1 m m w t 3.1. Rige regression Recall that in Rige regression, we are intereste in the linear regressor that is the solution to the optimization problem (2) with p = 2, given explicitly as min L D (w) s.t. w 2 B, (3) w R Our algorithm for the LAO setting is base on a ranomize Online Graient Descent (OGD) strategy (Zinkevich, 2003). More specifically, at each iteration t we use a ranomize estimator g t of the graient g t to upate the regressor w t via an aitive rule. Our graient estimators make use of an importance-sampling metho inspire by (Clarkson et al., 2010). The pseuo-coe of our Attribute Efficient Rige Regression (AERR) algorithm is given in Algorithm 1. In the following theorem, we show that the regressor learne by our algorithm is competitive with the optimal linear regressor having 2-norm boune by B. Theorem 3.1. Assume the istribution D is such that x 2 1 an y B with probability 1. Let w be the output of AERR, when run with η = k/2m. Then, w 2 B an for any w R with w 2 B, Analysis E [L D ( w) L D (w ) + 4B 2 2 km. Theorem 3.1 is a consequence of the following two lemmas. The first lemma is obtaine as a result of a stanar regret boun for the OGD algorithm (see Zinkevich 2003), applie to the vectors g 1,..., g m.
4 Lemma 3.2. For any w 2 B we have g t (w t w ) 2B2 η + η 2 g t 2 2. (4) The secon lemma shows that the vector g t is an unbiase estimator of the graient g t := l t (w t ) at iteration t, an establishes a variance boun for this estimator. To simplify notations, here an in the rest of the paper we use E t [ to enote the conitional expectation with respect to all ranomness up to time t. Lemma 3.3. The vector g t is an unbiase estimator of the graient g t := l t (w t ), that is E t [ g t = g t. In aition, for all t we have E t [ g t 2 2 8B 2 /k. For a proof of the lemma, see (Hazan & Koren, 2011). We now turn to prove Theorem 3.1. Proof (of Theorem 3.1). First note that as w t 2 B, we clearly have w 2 B. Taking the expectation of (4) with respect to the ranomization of the algorithm, an letting G 2 := max t E t [ g t 2 2, we obtain [ m E gt (w t w ) 2B2 η + η 2 G2 m. On the other han, the convexity of l t gives l t (w t ) l t (w ) gt (w t w ). Together with the above this implies that for η = 2B/G m, [ 1 E l t (w t ) 1 m m l t (w ) + 2 BG m. Taking the expectation of both sies with respect to the ranom choice of the training set, an using G 2B 2/k (accoring to Lemma 3.3), we get [ 1 E L D (w t ) m L D (w ) + 4B 2 2 km. Finally, recalling the convexity of L D an using Jensen s inequality, the Theorem follows Lasso regression We now turn to escribe our algorithm for Lasso regression in the LAO setting, in which we woul like to solve the problem min L D (w) s.t. w 1 B. (5) w R The algorithm we provie for this problem is base on a stochastic variant of the EG algorithm (Kivinen & Warmuth, 1997), that employs multiplicative upates Algorithm 2 AELR Parameters: B, η > 0 Input: training set S = {(x t, y t )} t [m an k > 0 Output: regressor w with w 1 B 1: Initialize z + 1 1, z 1 1 2: for t = 1 to m o 3: w t (z + t z t ) B/( z + t 1 + z t 1 ) 4: for r = 1 to k o 5: Pick i t,r [ uniformly an observe x t [i t,r 6: x t,r x t [i t,r e it,r 7: en for 8: x t 1 k k r=1 x t,r 9: Choose j t [ with probability w[j / w 1, an observe x t [j t 10: φt w t 1 sign(w t [j t ) x t [j t y t 11: g t φ t x t 12: for i = 1 to o 13: ḡ t [i clip( g t [i, 1/η) 14: z + t+1 [i z+ t [i exp( η ḡ t [i) 15: z t+1 [i z t [i exp(+η ḡ t [i) 16: en for 17: en for 18: w 1 m m w t base on an estimation of the graients l t. The multiplicative nature of the algorithm, however, makes it highly sensitive to the magnitue of the upates. To make the upates more robust, we clip the entries of the graient estimator so as to prevent them from getting too large. Formally, this is accomplishe via the following clip operation: clip(x, c) := max{min{x, c}, c} for x R an c > 0. This clipping has an even stronger effect in the more general setting we consier in Section 4. We give our Attribute Efficient Lasso Regression (AELR) algorithm in Algorithm 2, an establish a corresponing risk boun in the following theorem. Theorem 3.4. Assume the istribution D is such that x 1 an y B with probability 1. Let w be the output of AELR, when run with η = 1 4B 2 2k log 2 5m, Then, w 1 B an for any w R with w 1 B we have E [L D ( w) L D (w ) + 4B 2 10 log 2 km provie that m log 2.,
5 Analysis In the rest of the section, for a vector v we let v 2 enote the vector for which v 2 [i = (v[i) 2 for all i. In orer to prove Theorem 3.4, we first consier the augmente vectors z t := (z + t, z t ) R 2 an ḡ t := (ḡ t, ḡ t ) R 2, an let p t := z t/ z t 1. For these vectors, we have the following. Lemma 3.5. p t ḡ t min i [2 ḡ t[i + log 2 η + η p t (ḡ t) 2 The lemma is a consequence of a secon-orer regret boun for the Multiplicative-Weights algorithm, essentially ue to (Clarkson et al., 2010). By means of this lemma, we establish a risk boun with respect to the clippe linear functions ḡt w. Lemma 3.6. Assume that E t [ g t 2 G 2 for all t, for some G > 0. Then, for any w 1 B, we have [ m [ m ( ) log 2 E ḡt w t E ḡt w + B + ηg 2 m η Our next step is to relate the risk generate by the linear functions g t w, to that generate by the clippe functions ḡt w. Lemma 3.7. Assume that E t [ g t 2 G 2 for all t, for some G > 0. Then, for 0 < η 1/2G we have [ m [ m E g t w t E ḡt w t + 4BηG 2 m. The final component of the proof is a variance boun, similar to that of Lemma 3.3. Lemma 3.8. The vector g t is an unbiase estimator of the graient g t := l t (w t ), that is E t [ g t = g t. In aition, for all t we have E t [ g t 2 8B 2 /k. For the complete proofs, refer to (Hazan & Koren, 2011). We are now reay to prove Theorem 3.4. Proof (of Theorem 3.4). Since w t 1 B for all t, we obtain w 2 B. Next, note that as E t [ g t = g t, we have E[ m g t w t = E[ m g t w t. Putting Lemmas 3.6 an 3.7 together, we get for η 1/2G that [ T ( ) log 2 E gt (w t w ) B + 5ηG 2 m. η Proceeing as in the proof of Theorem 3.1, an choosing η = 1 log 2 G 5m, we obtain the boun 5 log 2 E [L D ( w) L D (w ) + 2BG m. Note that for this choice of η we inee have η 1/2G, as we originally assume that m log 2. Finally, putting G = 2B 2/k as implie by Lemma 3.8, we obtain the boun in the statement of the theorem. 4. Support-vector regression In this section we show how our approach can be extene to eal with loss functions other than the square loss, of the form l(w x, y) = f(w x y), (6) (with f real an convex) an most importantly, with the δ-insensitive absolute loss function of SVR, for which f(x) = x δ := max{ x δ, 0} for some fixe 0 δ B (recall that in our results we assume the labels y t have y t B). For concreteness, we consier only the 2-norm variant of the problem (as in the stanar formulation of SVR) the results we obtain can be easily ajuste to the 1-norm setting. We overloa notation, an keep using the shorthan l t (w) := l(w x t, y t ) for referring the loss function inuce by the instance (x t, y t ). It shoul be highlighte that our techniques can be aapte to eal with many other common loss functions, incluing classification losses (i.e., of the form l(w x, y) = f(y w x)). Due to its importance an popularity, we chose to escribe our metho in the context of SVR. Unfortunately, there are strong inications that SVR learning (more generally, learning with non-smooth loss function) in the LAO setting is impossible via our approach of unbiase graient estimations (see Cesa- Bianchi et al an the references therein). For that reason, we make two moifications to the learning setting: first, we shall henceforth relax the buget constraint to allow k observe attributes per instance in expectation; an secon, we shall aim for biase graient estimators, instea of unbiase as before. To obtain such biase estimators, we uniformly ε- approximate the function f by an analytic function f ε an learn with the approximate loss function l ε t (w) = f ε (w x t y t ) instea. Clearly, any ε- suboptimal regressor of the approximate problem is an 2ε-suboptimal regressor of the original problem. For learning the approximate problem we use a novel technique, inspire by (Cesa-Bianchi et al., 2011), for estimating graients of analytic loss functions. Our estimators for l ε t can then be viewe as biase estimators of l t (we note, however, that the resulting bias might be quite large).
6 Proceure 3 GenEst Parameters: {a n } n=0 Taylor coefficients of f Input: regressor w, instance (x, y) Output: ˆφ with E[ ˆφ = f (w x y) 1: Let N = 4B 2. 2: Choose n 0 with probability Pr[n = ( 1 2 )n+1 3: if n 2 log 2 N then 4: for r = 1,..., n o 5: Choose j [ with probability w[j 2 / w 2 2, an observe x[j 6: θr w 2 2 x[j/w[j y 7: en for 8: else 9: for r = 1,..., n o 10: Choose j 1,..., j N [ w.p. w[j 2 / w 2 2, (inepenently), an observe x[j 1,..., x[j N 11: θr 1 N N s=1 w 2 2 x[j s /w[j s y 12: en for 13: en if 14: ˆφ 2 n+1 a n θ 1 θ2 θ n 4.1. Estimators for analytic loss functions Let f : R R be a real, analytic function (on the entire real line). The erivative f is thus also analytic an can be expresse as f (x) = n=0 a nx n where {a n } are the Taylor expansion coefficients of f. In Proceure 3 we give an unbiase estimator of f (w x y) in the LAO setting, efine in terms of the coefficients {a n } of f. For this estimator, we have the following (proof is omitte). Lemma 4.1. The estimator ˆφ is an unbiase estimator of f (w x y). Also, assuming x 2 1, w 2 B an y B, the secon-moment E[ ˆφ 2 is upper boune by exp(o(log 2 B)), provie that the Taylor series of f (x) converges absolutely for x 1. Finally, the expecte number of attributes of x use by this estimator is no more than Approximating SVR In orer to approximate the δ-insensitive absolute loss function, we efine f ε (x) = ε ( ) x δ 2 ρ + ε ( ) x + δ ε 2 ρ δ ε where ρ is expresse in terms of the error function erf, ρ(x) = x erf(x) + 1 π e x2, an consier the approximate loss functions l ε t (w) = f ε (w x t y t ). Inee, we have the following. Algorithm 4 AESVR Parameters: B, δ, η > 0 an accuracy ε > 0 Input: training set S = {(x t, y t )} t [m an k > 0 Output: regressor w with w 2 B 1: Let a 2n = 0 for n 0, an a 2n+1 = 2 ( 1) n, n 0 (7) π n!(2n + 1) 2: Execute algorithm 1 with lines 8 9 replace by: x t x t /ε y + t (y t + δ)/ε, y t (y t δ)/ε φ t 1 2 [GenEst(w t, x t, y + t ) + GenEst(w t, x t, y t ) 3: Return the output w of the algorithm Claim 4.2. For any ε > 0, f ε is convex, analytic on the entire real line an sup f ε (x) x δ ε. x R The claim follows easily from the ientity x δ = 1 2 x δ x + δ δ. In aition, for using Proceure 3 we nee the following simple observation, that follows immeiately from the series expansion of erf(x). Claim 4.3. ρ (x) = n=0 a 2n+1x 2n+1, with the coefficients {a 2n+1 } n 0 given in (7). We now give the main result of this section, which is a sample complexity boun for the Attribute Efficient SVR (AESVR) algorithm, given in Algorithm 4. Theorem 4.4. Assume the istribution D is such that x 2 1 an y B with probability 1. Then, for any w R with w 2 B, we have E [L D ( w) L D (w ) + ε where w is the output of AESVR (with η properly tune) on a training set of size m = O ( ) ( ( )) exp O log 2 B. (8) k ε The algorithm queries at most k + 6 attributes of each instance in expectation. Proof. First, note that for the approximate loss functions l ε t we have l ε t (w t ) = 1 2 [ ρ (w t x t y + t ) + ρ (w t x t y t ) x t. Hence, Lemma 4.1 an Claim 4.3 above imply that g t in Algorithm 4 is an unbiase estimator of l ε t (w t ). Furthermore, since x t 2 1 ε an y t ± 2 B ε, accoring to the same lemma we have E t [ φ 2 t = exp(o(log 2 B ε )). Repeating the proof of Lemma 3.3,
7 we then have E t [ g t 2 2 = E t [ φ 2 t E t [ x t 2 2 = exp(o(log 2 B ε )) k. Replacing G 2 in the proof of theorem 3.1 with the above boun, we get for the output of Algorithm 4, E [L D ( w) L D (w ) + exp(o(log 2 B ε )) km, which imply that for obtaining an ε-accurate regressor w of the approximate problem, it is enough to take m as given in (8). However, claim 4.2 now gives that w itself is an 2ε-accurate regressor of the original problem, an the proof is complete. 5. Experiments In this section we give experimental evience that support our theoretical bouns, an emonstrate the superior performance of our algorithms compare to the state of the art. Naturally, we chose to compare our AERR an AELR algorithms 5 with the AER algorithm of (Cesa-Bianchi et al., 2010). We note that AER is in fact a hybri algorithm that combines 1- norm an 2-norm regularizations, thus we use it for benchmarking in both the Rige an Lasso settings. We essentially repeate the experiments of (Cesa- Bianchi et al., 2010) an use the popular MNIST igit recognition ataset (LeCun et al., 1998). Each instance in this ataset is a image of a hanwritten igit 0 9. We focuse on the 3 vs. 5 task, on a subset of the ataset that consists of the 3 igits (labele 1) an the 5 igits (labele +1). We applie the regression algorithms to this task by regressing to the labels. In all our experiments, we ranomly split the ata to training an test sets, an use 10-fol cross-valiation for tuning the parameters of each algorithm. Then, we ran each algorithm on increasingly longer prefixes of the ataset an tracke the obtaine square-error on the test set. For faithfully comparing partial- an full-information algorithms, we also recore the total number of attributes use by each algorithm. In our first experiment, we execute AELR, AER an (offline) Lasso on the 3 vs. 5 task. We allowe both AELR an AER to use only k = 4 pixels of each training image, while giving Lasso unrestricte access to the entire set of attributes (total of 784) of each instance. The results, average over 10 runs on 5 The AESVR algorithm is presente mainly for theoretical consierations, an was not implemente in the experiments. Test square error Number of attributes AELR AER Offline 10 4 Figure 1. Test square error of Lasso algorithms with k = 4, over increasing prefixes of the 3 vs. 5 ataset. ranom train/test splits, are presente in Figure 1. Note that the x-axis represents the cumulative number of attributes use for training. The graph ens at roughly attributes, which is the total number of attributes allowe for the partial-information algorithms. Lasso, however, completes this buget after seeing merely 62 examples. As we see from the results, AELR keeps its test error significantly lower than that of AER along the entire execution, almost briging the gap with the fullinformation Lasso. Note that the latter has the clear avantage of being an offline algorithm, while both AELR an AER are online in nature. Inee, when we compare AELR with an online Lasso solver, our algorithm obtaine test error almost 10 times better. In the secon experiment, we evaluate AERR, AER an Rige regression on the same task, but now allowing the partial-information algorithms to use as much as k = 56 pixels (which amounts to 2 rows) of each instance. The results of this experiment are given in Figure 2. We see that even if we allow the algorithms to view a consierable number of attributes, the gap between AERR an AER is large. 6. Conclusions an Open Questions We have consiere the funamental problem of statistical regression analysis, an in particular Lasso an Rige regression, in a setting where the observation upon each training instance is limite to a few attributes, an gave algorithms that improve over the state of the art by a leaing orer term with respect to the sample complexity. This resolves an open question of (Cesa-Bianchi et al., 2010). The algorithms are efficient, an give a clear experimental avantage in
8 Test square error Number of attributes AERR AER Offline 10 5 Figure 2. Test square error of Rige algorithms with k = 56, over increasing prefixes of the 3 vs. 5 ataset. previously-consiere benchmarks. For the challenging case of regression with general convex loss functions, we escribe exponential improvement in sample complexity, which apply in particular to support-vector regression. It is interesting to resolve the sample complexity gap of 1 ε which still remains for Lasso regression, an to improve upon the pseuo-polynomial factor in ε for support-vector regression. In aition, establishing analogous bouns for our algorithms that hol with high probability (other than in expectation) appears to be non-trivial, an is left for future work. Another possible irection for future research is aapting our results to the setting of learning with (ranomly) missing ata, that was recently investigate see e.g. (Rostamizaeh et al., 2011; Loh & Wainwright, 2011). The sample complexity bouns our algorithms obtain in this setting are slightly worse than those presente in the current paper, an it is interesting to check if one can o better. Acknowlegments We thank Shai Shalev-Shwartz for several useful iscussions, an the anonymous referees for their etaile comments. References Ben-Davi, S. an Dichterman, E. Learning with restricte focus of attention. Journal of Computer an System Sciences, 56(3): , Cesa-Bianchi, N., Shalev-Shwartz, S., an Shamir, O. Efficient learning with partially observe attributes. In Proceeings of the 27th international conference on Machine learning, Cesa-Bianchi, N., Shalev-Shwartz, S., an Shamir, O. Online learning of noisy ata. IEEE Transactions on Information Theory, 57(12): , ec ISSN oi: /TIT Clarkson, K.L., Hazan, E., an Wooruff, D.P. Sublinear optimization for machine learning. In 2010 IEEE 51st Annual Symposium on Founations of Computer Science, pp IEEE, Haussler, D. Decision theoretic generalizations of the PAC moel for neural net an other learning applications. Information an computation, 100(1):78 150, Hazan, E. an Koren, T. Optimal algorithms for rige an lasso regression with partially observe attributes. Arxiv preprint arxiv: , Hazan, E., Agarwal, A., an Kale, S. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2): , Kakae, S.M., Sriharan, K., an Tewari, A. On the complexity of linear preiction: Risk bouns, margin bouns, an regularization. In Avances in Neural Information Processing Systems, volume 22, Kivinen, J. an Warmuth, M.K. Exponentiate graient versus graient escent for linear preictors. Information an Computation, 132(1):1 63, LeCun, Y., Bottou, L., Bengio, Y., an Haffner, P. Graient-base learning applie to ocument recognition. Proceeings of the IEEE, 86(11): , Loh, P.L. an Wainwright, M.J. High-imensional regression with noisy an missing ata: Provable guarantees with non-convexity. In Avances in Neural Information Processing Systems, Rostamizaeh, A., Agarwal, A., an Bartlett, P. Learning with missing features. In The 27th Conference on Uncertainty in Artificial Intelligence, Vapnik, V.N. The nature of statistical learning theory. Springer-Verlag, Zinkevich, M. Online convex programming an generalize infinitesimal graient ascent. In Proceeings of the 20th international conference on Machine learning, volume 20, pp ACM, 2003.
An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback
Journal of Machine Learning Research 8 07) - Submitte /6; Publishe 5/7 An Optimal Algorithm for Banit an Zero-Orer Convex Optimization with wo-point Feeback Oha Shamir Department of Computer Science an
More informationLeast-Squares Regression on Sparse Spaces
Least-Squares Regression on Sparse Spaces Yuri Grinberg, Mahi Milani Far, Joelle Pineau School of Computer Science McGill University Montreal, Canaa {ygrinb,mmilan1,jpineau}@cs.mcgill.ca 1 Introuction
More informationA PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks
A PAC-Bayesian Approach to Spectrally-Normalize Margin Bouns for Neural Networks Behnam Neyshabur, Srinah Bhojanapalli, Davi McAllester, Nathan Srebro Toyota Technological Institute at Chicago {bneyshabur,
More informationLecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012
CS-6 Theory Gems November 8, 0 Lecture Lecturer: Alesaner Mąry Scribes: Alhussein Fawzi, Dorina Thanou Introuction Toay, we will briefly iscuss an important technique in probability theory measure concentration
More informationA Course in Machine Learning
A Course in Machine Learning Hal Daumé III 12 EFFICIENT LEARNING So far, our focus has been on moels of learning an basic algorithms for those moels. We have not place much emphasis on how to learn quickly.
More informationLower bounds on Locality Sensitive Hashing
Lower bouns on Locality Sensitive Hashing Rajeev Motwani Assaf Naor Rina Panigrahy Abstract Given a metric space (X, X ), c 1, r > 0, an p, q [0, 1], a istribution over mappings H : X N is calle a (r,
More informationRobust Bounds for Classification via Selective Sampling
Nicolò Cesa-Bianchi DSI, Università egli Stui i Milano, Italy Clauio Gentile DICOM, Università ell Insubria, Varese, Italy Francesco Orabona Iiap, Martigny, Switzerlan cesa-bianchi@siunimiit clauiogentile@uninsubriait
More informationLower Bounds for the Smoothed Number of Pareto optimal Solutions
Lower Bouns for the Smoothe Number of Pareto optimal Solutions Tobias Brunsch an Heiko Röglin Department of Computer Science, University of Bonn, Germany brunsch@cs.uni-bonn.e, heiko@roeglin.org Abstract.
More informationRobust Forward Algorithms via PAC-Bayes and Laplace Distributions. ω Q. Pr (y(ω x) < 0) = Pr A k
A Proof of Lemma 2 B Proof of Lemma 3 Proof: Since the support of LL istributions is R, two such istributions are equivalent absolutely continuous with respect to each other an the ivergence is well-efine
More informationu!i = a T u = 0. Then S satisfies
Deterministic Conitions for Subspace Ientifiability from Incomplete Sampling Daniel L Pimentel-Alarcón, Nigel Boston, Robert D Nowak University of Wisconsin-Maison Abstract Consier an r-imensional subspace
More informationAlgorithms and matching lower bounds for approximately-convex optimization
Algorithms an matching lower bouns for approximately-convex optimization Yuanzhi Li Department of Computer Science Princeton University Princeton, NJ, 08450 yuanzhil@cs.princeton.eu Anrej Risteski Department
More informationLATTICE-BASED D-OPTIMUM DESIGN FOR FOURIER REGRESSION
The Annals of Statistics 1997, Vol. 25, No. 6, 2313 2327 LATTICE-BASED D-OPTIMUM DESIGN FOR FOURIER REGRESSION By Eva Riccomagno, 1 Rainer Schwabe 2 an Henry P. Wynn 1 University of Warwick, Technische
More informationSurvey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013
Survey Sampling Kosuke Imai Department of Politics, Princeton University February 19, 2013 Survey sampling is one of the most commonly use ata collection methos for social scientists. We begin by escribing
More informationLinear First-Order Equations
5 Linear First-Orer Equations Linear first-orer ifferential equations make up another important class of ifferential equations that commonly arise in applications an are relatively easy to solve (in theory)
More informationFLUCTUATIONS IN THE NUMBER OF POINTS ON SMOOTH PLANE CURVES OVER FINITE FIELDS. 1. Introduction
FLUCTUATIONS IN THE NUMBER OF POINTS ON SMOOTH PLANE CURVES OVER FINITE FIELDS ALINA BUCUR, CHANTAL DAVID, BROOKE FEIGON, MATILDE LALÍN 1 Introuction In this note, we stuy the fluctuations in the number
More informationOn the Complexity of Bandit and Derivative-Free Stochastic Convex Optimization
JMLR: Workshop an Conference Proceeings vol 30 013) 1 On the Complexity of Banit an Derivative-Free Stochastic Convex Optimization Oha Shamir Microsoft Research an the Weizmann Institute of Science oha.shamir@weizmann.ac.il
More informationExpected Value of Partial Perfect Information
Expecte Value of Partial Perfect Information Mike Giles 1, Takashi Goa 2, Howar Thom 3 Wei Fang 1, Zhenru Wang 1 1 Mathematical Institute, University of Oxfor 2 School of Engineering, University of Tokyo
More informationSchrödinger s equation.
Physics 342 Lecture 5 Schröinger s Equation Lecture 5 Physics 342 Quantum Mechanics I Wenesay, February 3r, 2010 Toay we iscuss Schröinger s equation an show that it supports the basic interpretation of
More informationNecessary and Sufficient Conditions for Sketched Subspace Clustering
Necessary an Sufficient Conitions for Sketche Subspace Clustering Daniel Pimentel-Alarcón, Laura Balzano 2, Robert Nowak University of Wisconsin-Maison, 2 University of Michigan-Ann Arbor Abstract This
More informationAgmon Kolmogorov Inequalities on l 2 (Z d )
Journal of Mathematics Research; Vol. 6, No. ; 04 ISSN 96-9795 E-ISSN 96-9809 Publishe by Canaian Center of Science an Eucation Agmon Kolmogorov Inequalities on l (Z ) Arman Sahovic Mathematics Department,
More informationOptimization of Geometries by Energy Minimization
Optimization of Geometries by Energy Minimization by Tracy P. Hamilton Department of Chemistry University of Alabama at Birmingham Birmingham, AL 3594-140 hamilton@uab.eu Copyright Tracy P. Hamilton, 1997.
More information7.1 Support Vector Machine
67577 Intro. to Machine Learning Fall semester, 006/7 Lecture 7: Support Vector Machines an Kernel Functions II Lecturer: Amnon Shashua Scribe: Amnon Shashua 7. Support Vector Machine We return now to
More informationInfluence of weight initialization on multilayer perceptron performance
Influence of weight initialization on multilayer perceptron performance M. Karouia (1,2) T. Denœux (1) R. Lengellé (1) (1) Université e Compiègne U.R.A. CNRS 817 Heuiasyc BP 649 - F-66 Compiègne ceex -
More information6 General properties of an autonomous system of two first order ODE
6 General properties of an autonomous system of two first orer ODE Here we embark on stuying the autonomous system of two first orer ifferential equations of the form ẋ 1 = f 1 (, x 2 ), ẋ 2 = f 2 (, x
More informationOn combinatorial approaches to compressed sensing
On combinatorial approaches to compresse sensing Abolreza Abolhosseini Moghaam an Hayer Raha Department of Electrical an Computer Engineering, Michigan State University, East Lansing, MI, U.S. Emails:{abolhos,raha}@msu.eu
More informationTable of Common Derivatives By David Abraham
Prouct an Quotient Rules: Table of Common Derivatives By Davi Abraham [ f ( g( ] = [ f ( ] g( + f ( [ g( ] f ( = g( [ f ( ] g( g( f ( [ g( ] Trigonometric Functions: sin( = cos( cos( = sin( tan( = sec
More informationTopic 7: Convergence of Random Variables
Topic 7: Convergence of Ranom Variables Course 003, 2016 Page 0 The Inference Problem So far, our starting point has been a given probability space (S, F, P). We now look at how to generate information
More informationSublinear Optimization for Machine Learning
Sublinear Optimization for Machine Learning Kenneth L. Clarkson, IBM Almaen Research Center Ela Hazan, Technion - Israel Institute of technology Davi P. Wooruff, IBM Almaen Research Center In this paper
More informationHow to Minimize Maximum Regret in Repeated Decision-Making
How to Minimize Maximum Regret in Repeate Decision-Making Karl H. Schlag July 3 2003 Economics Department, European University Institute, Via ella Piazzuola 43, 033 Florence, Italy, Tel: 0039-0-4689, email:
More informationThis module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics
This moule is part of the Memobust Hanbook on Methoology of Moern Business Statistics 26 March 2014 Metho: Balance Sampling for Multi-Way Stratification Contents General section... 3 1. Summary... 3 2.
More informationLinear and quadratic approximation
Linear an quaratic approximation November 11, 2013 Definition: Suppose f is a function that is ifferentiable on an interval I containing the point a. The linear approximation to f at a is the linear function
More informationCascaded redundancy reduction
Network: Comput. Neural Syst. 9 (1998) 73 84. Printe in the UK PII: S0954-898X(98)88342-5 Cascae reunancy reuction Virginia R e Sa an Geoffrey E Hinton Department of Computer Science, University of Toronto,
More informationA Weak First Digit Law for a Class of Sequences
International Mathematical Forum, Vol. 11, 2016, no. 15, 67-702 HIKARI Lt, www.m-hikari.com http://x.oi.org/10.1288/imf.2016.6562 A Weak First Digit Law for a Class of Sequences M. A. Nyblom School of
More informationSeparation of Variables
Physics 342 Lecture 1 Separation of Variables Lecture 1 Physics 342 Quantum Mechanics I Monay, January 25th, 2010 There are three basic mathematical tools we nee, an then we can begin working on the physical
More informationarxiv: v4 [cs.ds] 7 Mar 2014
Analysis of Agglomerative Clustering Marcel R. Ackermann Johannes Blömer Daniel Kuntze Christian Sohler arxiv:101.697v [cs.ds] 7 Mar 01 Abstract The iameter k-clustering problem is the problem of partitioning
More informationLecture 2: Correlated Topic Model
Probabilistic Moels for Unsupervise Learning Spring 203 Lecture 2: Correlate Topic Moel Inference for Correlate Topic Moel Yuan Yuan First of all, let us make some claims about the parameters an variables
More informationThe derivative of a function f(x) is another function, defined in terms of a limiting expression: f(x + δx) f(x)
Y. D. Chong (2016) MH2801: Complex Methos for the Sciences 1. Derivatives The erivative of a function f(x) is another function, efine in terms of a limiting expression: f (x) f (x) lim x δx 0 f(x + δx)
More informationHomework 2 Solutions EM, Mixture Models, PCA, Dualitys
Homewor Solutions EM, Mixture Moels, PCA, Dualitys CMU 0-75: Machine Learning Fall 05 http://www.cs.cmu.eu/~bapoczos/classes/ml075_05fall/ OUT: Oct 5, 05 DUE: Oct 9, 05, 0:0 AM An EM algorithm for a Mixture
More informationTractability results for weighted Banach spaces of smooth functions
Tractability results for weighte Banach spaces of smooth functions Markus Weimar Mathematisches Institut, Universität Jena Ernst-Abbe-Platz 2, 07740 Jena, Germany email: markus.weimar@uni-jena.e March
More informationarxiv: v4 [math.pr] 27 Jul 2016
The Asymptotic Distribution of the Determinant of a Ranom Correlation Matrix arxiv:309768v4 mathpr] 7 Jul 06 AM Hanea a, & GF Nane b a Centre of xcellence for Biosecurity Risk Analysis, University of Melbourne,
More informationParameter estimation: A new approach to weighting a priori information
Parameter estimation: A new approach to weighting a priori information J.L. Mea Department of Mathematics, Boise State University, Boise, ID 83725-555 E-mail: jmea@boisestate.eu Abstract. We propose a
More informationAnalyzing Tensor Power Method Dynamics in Overcomplete Regime
Journal of Machine Learning Research 18 (2017) 1-40 Submitte 9/15; Revise 11/16; Publishe 4/17 Analyzing Tensor Power Metho Dynamics in Overcomplete Regime Animashree Ananumar Department of Electrical
More informationProof of SPNs as Mixture of Trees
A Proof of SPNs as Mixture of Trees Theorem 1. If T is an inuce SPN from a complete an ecomposable SPN S, then T is a tree that is complete an ecomposable. Proof. Argue by contraiction that T is not a
More informationGeneralized Tractability for Multivariate Problems
Generalize Tractability for Multivariate Problems Part II: Linear Tensor Prouct Problems, Linear Information, an Unrestricte Tractability Michael Gnewuch Department of Computer Science, University of Kiel,
More informationA Sketch of Menshikov s Theorem
A Sketch of Menshikov s Theorem Thomas Bao March 14, 2010 Abstract Let Λ be an infinite, locally finite oriente multi-graph with C Λ finite an strongly connecte, an let p
More informationPermanent vs. Determinant
Permanent vs. Determinant Frank Ban Introuction A major problem in theoretical computer science is the Permanent vs. Determinant problem. It asks: given an n by n matrix of ineterminates A = (a i,j ) an
More informationPerfect Matchings in Õ(n1.5 ) Time in Regular Bipartite Graphs
Perfect Matchings in Õ(n1.5 ) Time in Regular Bipartite Graphs Ashish Goel Michael Kapralov Sanjeev Khanna Abstract We consier the well-stuie problem of fining a perfect matching in -regular bipartite
More informationIntroduction to Machine Learning
How o you estimate p(y x)? Outline Contents Introuction to Machine Learning Logistic Regression Varun Chanola April 9, 207 Generative vs. Discriminative Classifiers 2 Logistic Regression 2 3 Logistic Regression
More informationIntroduction to the Vlasov-Poisson system
Introuction to the Vlasov-Poisson system Simone Calogero 1 The Vlasov equation Consier a particle with mass m > 0. Let x(t) R 3 enote the position of the particle at time t R an v(t) = ẋ(t) = x(t)/t its
More informationGaussian processes with monotonicity information
Gaussian processes with monotonicity information Anonymous Author Anonymous Author Unknown Institution Unknown Institution Abstract A metho for using monotonicity information in multivariate Gaussian process
More informationLogarithmic spurious regressions
Logarithmic spurious regressions Robert M. e Jong Michigan State University February 5, 22 Abstract Spurious regressions, i.e. regressions in which an integrate process is regresse on another integrate
More informationThe total derivative. Chapter Lagrangian and Eulerian approaches
Chapter 5 The total erivative 51 Lagrangian an Eulerian approaches The representation of a flui through scalar or vector fiels means that each physical quantity uner consieration is escribe as a function
More informationTutorial on Maximum Likelyhood Estimation: Parametric Density Estimation
Tutorial on Maximum Likelyhoo Estimation: Parametric Density Estimation Suhir B Kylasa 03/13/2014 1 Motivation Suppose one wishes to etermine just how biase an unfair coin is. Call the probability of tossing
More informationMath 1B, lecture 8: Integration by parts
Math B, lecture 8: Integration by parts Nathan Pflueger 23 September 2 Introuction Integration by parts, similarly to integration by substitution, reverses a well-known technique of ifferentiation an explores
More informationAll s Well That Ends Well: Supplementary Proofs
All s Well That Ens Well: Guarantee Resolution of Simultaneous Rigi Boy Impact 1:1 All s Well That Ens Well: Supplementary Proofs This ocument complements the paper All s Well That Ens Well: Guarantee
More informationFull-information Online Learning
Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2
More informationMulti-View Clustering via Canonical Correlation Analysis
Keywors: multi-view learning, clustering, canonical correlation analysis Abstract Clustering ata in high-imensions is believe to be a har problem in general. A number of efficient clustering algorithms
More informationModelling and simulation of dependence structures in nonlife insurance with Bernstein copulas
Moelling an simulation of epenence structures in nonlife insurance with Bernstein copulas Prof. Dr. Dietmar Pfeifer Dept. of Mathematics, University of Olenburg an AON Benfiel, Hamburg Dr. Doreen Straßburger
More informationTime-of-Arrival Estimation in Non-Line-Of-Sight Environments
2 Conference on Information Sciences an Systems, The Johns Hopkins University, March 2, 2 Time-of-Arrival Estimation in Non-Line-Of-Sight Environments Sinan Gezici, Hisashi Kobayashi an H. Vincent Poor
More informationA. Exclusive KL View of the MLE
A. Exclusive KL View of the MLE Lets assume a change-of-variable moel p Z z on the ranom variable Z R m, such as the one use in Dinh et al. 2017: z 0 p 0 z 0 an z = ψz 0, where ψ is an invertible function
More informationLECTURE NOTES ON DVORETZKY S THEOREM
LECTURE NOTES ON DVORETZKY S THEOREM STEVEN HEILMAN Abstract. We present the first half of the paper [S]. In particular, the results below, unless otherwise state, shoul be attribute to G. Schechtman.
More informationApproximate Constraint Satisfaction Requires Large LP Relaxations
Approximate Constraint Satisfaction Requires Large LP Relaxations oah Fleming April 19, 2018 Linear programming is a very powerful tool for attacking optimization problems. Techniques such as the ellipsoi
More informationMulti-View Clustering via Canonical Correlation Analysis
Kamalika Chauhuri ITA, UC San Diego, 9500 Gilman Drive, La Jolla, CA Sham M. Kakae Karen Livescu Karthik Sriharan Toyota Technological Institute at Chicago, 6045 S. Kenwoo Ave., Chicago, IL kamalika@soe.ucs.eu
More informationAdmin BACKPROPAGATION. Neural network. Neural network 11/3/16. Assignment 7. Assignment 8 Goals today. David Kauchak CS158 Fall 2016
Amin Assignment 7 Assignment 8 Goals toay BACKPROPAGATION Davi Kauchak CS58 Fall 206 Neural network Neural network inputs inputs some inputs are provie/ entere Iniviual perceptrons/ neurons Neural network
More informationCUSTOMER REVIEW FEATURE EXTRACTION Heng Ren, Jingye Wang, and Tony Wu
CUSTOMER REVIEW FEATURE EXTRACTION Heng Ren, Jingye Wang, an Tony Wu Abstract Popular proucts often have thousans of reviews that contain far too much information for customers to igest. Our goal for the
More informationSelf-normalized Martingale Tail Inequality
Online-to-Confience-Set Conversions an Application to Sparse Stochastic Banits A Self-normalize Martingale Tail Inequality The self-normalize martingale tail inequality that we present here is the scalar-value
More informationLeaving Randomness to Nature: d-dimensional Product Codes through the lens of Generalized-LDPC codes
Leaving Ranomness to Nature: -Dimensional Prouct Coes through the lens of Generalize-LDPC coes Tavor Baharav, Kannan Ramchanran Dept. of Electrical Engineering an Computer Sciences, U.C. Berkeley {tavorb,
More informationLectures - Week 10 Introduction to Ordinary Differential Equations (ODES) First Order Linear ODEs
Lectures - Week 10 Introuction to Orinary Differential Equations (ODES) First Orer Linear ODEs When stuying ODEs we are consiering functions of one inepenent variable, e.g., f(x), where x is the inepenent
More informationMath 342 Partial Differential Equations «Viktor Grigoryan
Math 342 Partial Differential Equations «Viktor Grigoryan 6 Wave equation: solution In this lecture we will solve the wave equation on the entire real line x R. This correspons to a string of infinite
More informationSwitching Time Optimization in Discretized Hybrid Dynamical Systems
Switching Time Optimization in Discretize Hybri Dynamical Systems Kathrin Flaßkamp, To Murphey, an Sina Ober-Blöbaum Abstract Switching time optimization (STO) arises in systems that have a finite set
More informationMulti-View Clustering via Canonical Correlation Analysis
Kamalika Chauhuri ITA, UC San Diego, 9500 Gilman Drive, La Jolla, CA Sham M. Kakae Karen Livescu Karthik Sriharan Toyota Technological Institute at Chicago, 6045 S. Kenwoo Ave., Chicago, IL kamalika@soe.ucs.eu
More informationON THE OPTIMALITY SYSTEM FOR A 1 D EULER FLOW PROBLEM
ON THE OPTIMALITY SYSTEM FOR A D EULER FLOW PROBLEM Eugene M. Cliff Matthias Heinkenschloss y Ajit R. Shenoy z Interisciplinary Center for Applie Mathematics Virginia Tech Blacksburg, Virginia 46 Abstract
More informationRobustness and Perturbations of Minimal Bases
Robustness an Perturbations of Minimal Bases Paul Van Dooren an Froilán M Dopico December 9, 2016 Abstract Polynomial minimal bases of rational vector subspaces are a classical concept that plays an important
More informationOn the Value of Partial Information for Learning from Examples
JOURNAL OF COMPLEXITY 13, 509 544 (1998) ARTICLE NO. CM970459 On the Value of Partial Information for Learning from Examples Joel Ratsaby* Department of Electrical Engineering, Technion, Haifa, 32000 Israel
More informationOnline Appendix for Trade Policy under Monopolistic Competition with Firm Selection
Online Appenix for Trae Policy uner Monopolistic Competition with Firm Selection Kyle Bagwell Stanfor University an NBER Seung Hoon Lee Georgia Institute of Technology September 6, 2018 In this Online
More informationMonte Carlo Methods with Reduced Error
Monte Carlo Methos with Reuce Error As has been shown, the probable error in Monte Carlo algorithms when no information about the smoothness of the function is use is Dξ r N = c N. It is important for
More informationAdaptive Online Learning in Dynamic Environments
Adaptive Online Learning in Dynamic Environments Lijun Zhang, Shiyin Lu, Zhi-Hua Zhou National Key Laboratory for Novel Software Technology Nanjing University, Nanjing 210023, China {zhanglj, lusy, zhouzh}@lamda.nju.edu.cn
More informationA variance decomposition and a Central Limit Theorem for empirical losses associated with resampling designs
Mathias Fuchs, Norbert Krautenbacher A variance ecomposition an a Central Limit Theorem for empirical losses associate with resampling esigns Technical Report Number 173, 2014 Department of Statistics
More informationEstimation of the Maximum Domination Value in Multi-Dimensional Data Sets
Proceeings of the 4th East-European Conference on Avances in Databases an Information Systems ADBIS) 200 Estimation of the Maximum Domination Value in Multi-Dimensional Data Sets Eleftherios Tiakas, Apostolos.
More informationTechnion - Computer Science Department - M.Sc. Thesis MSC Constrained Codes for Two-Dimensional Channels.
Technion - Computer Science Department - M.Sc. Thesis MSC-2006- - 2006 Constraine Coes for Two-Dimensional Channels Keren Censor Technion - Computer Science Department - M.Sc. Thesis MSC-2006- - 2006 Technion
More informationCapacity Analysis of MIMO Systems with Unknown Channel State Information
Capacity Analysis of MIMO Systems with Unknown Channel State Information Jun Zheng an Bhaskar D. Rao Dept. of Electrical an Computer Engineering University of California at San Diego e-mail: juzheng@ucs.eu,
More informationSturm-Liouville Theory
LECTURE 5 Sturm-Liouville Theory In the three preceing lectures I emonstrate the utility of Fourier series in solving PDE/BVPs. As we ll now see, Fourier series are just the tip of the iceberg of the theory
More informationarxiv: v5 [cs.lg] 28 Mar 2017
Equilibrium Propagation: Briging the Gap Between Energy-Base Moels an Backpropagation Benjamin Scellier an Yoshua Bengio * Université e Montréal, Montreal Institute for Learning Algorithms March 3, 217
More informationinflow outflow Part I. Regular tasks for MAE598/494 Task 1
MAE 494/598, Fall 2016 Project #1 (Regular tasks = 20 points) Har copy of report is ue at the start of class on the ue ate. The rules on collaboration will be release separately. Please always follow the
More informationarxiv: v3 [cs.lg] 3 Dec 2017
Context-Aware Generative Aversarial Privacy Chong Huang, Peter Kairouz, Xiao Chen, Lalitha Sankar, an Ram Rajagopal arxiv:1710.09549v3 [cs.lg] 3 Dec 2017 Abstract Preserving the utility of publishe atasets
More informationMulti-View Clustering via Canonical Correlation Analysis
Technical Report TTI-TR-2008-5 Multi-View Clustering via Canonical Correlation Analysis Kamalika Chauhuri UC San Diego Sham M. Kakae Toyota Technological Institute at Chicago ABSTRACT Clustering ata in
More informationWEIGHTING A RESAMPLED PARTICLE IN SEQUENTIAL MONTE CARLO. L. Martino, V. Elvira, F. Louzada
WEIGHTIG A RESAMPLED PARTICLE I SEQUETIAL MOTE CARLO L. Martino, V. Elvira, F. Louzaa Dep. of Signal Theory an Communic., Universia Carlos III e Mari, Leganés (Spain). Institute of Mathematical Sciences
More information. Using a multinomial model gives us the following equation for P d. , with respect to same length term sequences.
S 63 Lecture 8 2/2/26 Lecturer Lillian Lee Scribes Peter Babinski, Davi Lin Basic Language Moeling Approach I. Special ase of LM-base Approach a. Recap of Formulas an Terms b. Fixing θ? c. About that Multinomial
More informationLevel Construction of Decision Trees in a Partition-based Framework for Classification
Level Construction of Decision Trees in a Partition-base Framework for Classification Y.Y. Yao, Y. Zhao an J.T. Yao Department of Computer Science, University of Regina Regina, Saskatchewan, Canaa S4S
More informationOn the number of isolated eigenvalues of a pair of particles in a quantum wire
On the number of isolate eigenvalues of a pair of particles in a quantum wire arxiv:1812.11804v1 [math-ph] 31 Dec 2018 Joachim Kerner 1 Department of Mathematics an Computer Science FernUniversität in
More informationAN INTRODUCTION TO NUMERICAL METHODS USING MATHCAD. Mathcad Release 14. Khyruddin Akbar Ansari, Ph.D., P.E.
AN INTRODUCTION TO NUMERICAL METHODS USING MATHCAD Mathca Release 14 Khyruin Akbar Ansari, Ph.D., P.E. Professor of Mechanical Engineering School of Engineering an Applie Science Gonzaga University SDC
More informationEuler equations for multiple integrals
Euler equations for multiple integrals January 22, 2013 Contents 1 Reminer of multivariable calculus 2 1.1 Vector ifferentiation......................... 2 1.2 Matrix ifferentiation........................
More informationLecture 5. Symmetric Shearer s Lemma
Stanfor University Spring 208 Math 233: Non-constructive methos in combinatorics Instructor: Jan Vonrák Lecture ate: January 23, 208 Original scribe: Erik Bates Lecture 5 Symmetric Shearer s Lemma Here
More informationA comparison of small area estimators of counts aligned with direct higher level estimates
A comparison of small area estimators of counts aligne with irect higher level estimates Giorgio E. Montanari, M. Giovanna Ranalli an Cecilia Vicarelli Abstract Inirect estimators for small areas use auxiliary
More informationWUCHEN LI AND STANLEY OSHER
CONSTRAINED DYNAMICAL OPTIMAL TRANSPORT AND ITS LAGRANGIAN FORMULATION WUCHEN LI AND STANLEY OSHER Abstract. We propose ynamical optimal transport (OT) problems constraine in a parameterize probability
More informationEquilibrium in Queues Under Unknown Service Times and Service Value
University of Pennsylvania ScholarlyCommons Finance Papers Wharton Faculty Research 1-2014 Equilibrium in Queues Uner Unknown Service Times an Service Value Laurens Debo Senthil K. Veeraraghavan University
More informationComputing Exact Confidence Coefficients of Simultaneous Confidence Intervals for Multinomial Proportions and their Functions
Working Paper 2013:5 Department of Statistics Computing Exact Confience Coefficients of Simultaneous Confience Intervals for Multinomial Proportions an their Functions Shaobo Jin Working Paper 2013:5
More informationMath Notes on differentials, the Chain Rule, gradients, directional derivative, and normal vectors
Math 18.02 Notes on ifferentials, the Chain Rule, graients, irectional erivative, an normal vectors Tangent plane an linear approximation We efine the partial erivatives of f( xy, ) as follows: f f( x+
More informationRamsey numbers of some bipartite graphs versus complete graphs
Ramsey numbers of some bipartite graphs versus complete graphs Tao Jiang, Michael Salerno Miami University, Oxfor, OH 45056, USA Abstract. The Ramsey number r(h, K n ) is the smallest positive integer
More informationIPA Derivatives for Make-to-Stock Production-Inventory Systems With Backorders Under the (R,r) Policy
IPA Derivatives for Make-to-Stock Prouction-Inventory Systems With Backorers Uner the (Rr) Policy Yihong Fan a Benamin Melame b Yao Zhao c Yorai Wari Abstract This paper aresses Infinitesimal Perturbation
More information