Linear Regression with Limited Observation

Size: px
Start display at page:

Download "Linear Regression with Limited Observation"

Transcription

1 Ela Hazan Tomer Koren Technion Israel Institute of Technology, Technion City 32000, Haifa, Israel Abstract We consier the most common variants of linear regression, incluing Rige, Lasso an Support-vector regression, in a setting where the learner is allowe to observe only a fixe number of attributes of each example at training time. We present simple an efficient algorithms for these problems: for Lasso an Rige regression they nee the same total number of attributes (up to constants) as o full-information algorithms, for reaching a certain accuracy. For Support-vector regression, we require exponentially less attributes compare to the state of the art. By that, we resolve an open problem recently pose by Cesa-Bianchi et al. (2010). Experiments show the theoretical bouns to be justifie by superior performance compare to the state of the art. 1. Introuction In regression analysis the statistician attempts to learn from examples the unerlying variables affecting a given phenomenon. For example, in meical iagnosis a certain combination of conitions reflects whether a patient is afflicte with a certain isease. In certain common regression cases various limitations are place on the information available from the examples. In the meical example, not all parameters of a certain patient can be measure ue to cost, time an patient reluctance. In this paper we stuy the problem of regression in which only a small subset of the attributes per example can be observe. In this setting, we have access to all attributes an we are require to choose which of them to observe. Recently, Cesa-Bianchi et al. (2010) Appearing in Proceeings of the 29 th International Conference on Machine Learning, Einburgh, Scotlan, UK, Copyright 2012 by the author(s)/owner(s). stuie this problem an aske the following interesting question: can we efficiently learn the optimal regressor in the attribute efficient setting with the same total number of attributes as in the unrestricte regression setting? In other wors, the question amounts to whether the information limitation hiners our ability to learn efficiently at all. Ieally, one woul hope that instea of observing all attributes of every example, one coul compensate for fewer attributes by analyzing more examples, but retain the same overall sample an computational complexity. Inee, we answer this question on the affirmative for the main variants of regression: Rige an Lasso. For support-vector regression we make significant avancement, reucing the parameter epenence by an exponential factor. Our results are summarize in the table below 1, which gives bouns for the number of examples neee to attain an error of ε, such that at most k attributes 2 are viewable per example. We enote by the imension of the attribute space. Regression New boun Prev. boun Rige O ( ) ( ) kε O 2 log 2 ε kε ( ) ( 2 ) Lasso O log kε O 2 log 2 ε kε 2 SVR O ( k ) e O(log 2 1 ε) O (e 2 kε 2 ) Table 1. Our sample complexity bouns. Our bouns imply that for reaching a certain accuracy, our algorithms nee the same number of attributes as their full information counterparts. In particular, when k = Ω() our bouns coincie with those of full information regression, up to constants (cf. Kakae et al. 2008). We complement these upper bouns an prove that Ω( ε 2 ) attributes are in fact necessary to learn an ε- 1 The previous bouns are ue to (Cesa-Bianchi et al., 2010). For SVR, the boun is obtaine by aitionally incorporating the methos of (Cesa-Bianchi et al., 2011). 2 For SVR, the number of attributes viewe per example is a ranom variable whose expectation is k.

2 accurate Rige regressor. For Lasso regression, Cesa- Bianchi et al. (2010) prove that Ω( ε ) attributes are necessary, an aske what is the correct epenence on the problem imension. Our bouns imply that the number of attributes necessary for regression learning grows linearly with the problem imensions. The algorithms themselves are very simple to implement, an run in linear time. As we show in later sections, these theoretical improvements are clearly visible in experiments on stanar atasets Relate work The setting of learning with limite attribute observation (LAO) was first put forth in (Ben-Davi & Dichterman, 1998), who coine the term learning with restricte focus of attention. Cesa-Biachi et al. (2010) were the first to iscuss linear preiction in the LAO setting, an gave an efficient algorithm (as well as lower bouns) for linear regression, which is the primary focus of this paper. 2. Setting an Result Statement 2.1. Linear regression In the linear regression problem, each instance is a pair (x, y) of an attributes vector x R an a target variable y R. We assume the stanar framework of statistical learning (Haussler, 1992), in which the pairs (x, y) follow a joint probability istribution D over R R. The goal of the learner is to fin a vector w for which the linear rule ŷ w x provies a goo preiction of the target y. To measure the performance of the preiction, we use a convex loss function l(ŷ, y) : R 2 R. The most common choice is the square loss l(ŷ, y) = 1 2 (ŷ y)2, which stans for the popular leastsquares regression. Hence, in terms of the istribution D, the learner woul like to fin a regressor w R with low expecte loss, efine as L D (w) = E (x, y) D [l(w x, y). (1) The stanar paraigm for learning such regressor is seeking a vector w R that minimizes a trae-off between the expecte loss an an aitional regularization term, which is usually a norm of w. An equivalent form of this optimization problem is obtaine by replacing the regularization term with a proper constraint, giving rise to the problem min L D (w) s.t. w p B, (2) w R where B > 0 is a regularization parameter an p 1. The main variants of regression iffer on the type of l p norm constraint as well as the loss functions in the above efinition: Rige regression: p = 2 an square loss, l(ŷ, y) = 1 2 (ŷ y)2. Lasso regression: p = 1 an square loss. Support-vector regression: p = 2 an the δ- insensitive absolute loss (Vapnik, 1995), l(ŷ, y) = ŷ y δ := max{0, ŷ y δ}. Since the istribution D is unknown, we learn by relying on a training set S = {(x t, y t )} m of examples, that are assume to be sample inepenently from D. We use the notation l t (w) := l(w x t, y t ) to refer to the loss function inuce by the instance (x t, y t ). We istinguish between two learning scenarios. In the full information setup, the learner has unrestricte access to the entire ata set. In the limite attribute observation (LAO) setting, for any given example pair (x, y), the learner can observe y, but only k attributes of x (where k 1 is a parameter of the problem). The learner can actively choose which attributes to observe Limitations on LAO regression Cesa-Biachi et al. (2010) prove the following sample complexity lower boun on any LAO Lasso regression algorithm. Theorem 2.1. Let 0 < ε < 1 16, k 1 an > 4k. For any regression algorithm accessing at most k attributes per training example, there exist a istribution D over {x : x 1} {±1} an a regressor w with w 1 1 such that the algorithm must see (in expectation) at least Ω( kε ) examples in orer to learn a linear regressor w with L D (w) L D (w ) < ε. We complement this lower boun, by proviing a stronger lower boun on the sample complexity of any Rige regression algorithm, using informationtheoretic arguments. Theorem 2.2. Let ε = Ω(1/ ). For any regression algorithm accessing at most k attributes per training example, there exist a istribution D over {x : x 2 1} {±1} an a regressor w with w 2 1 such that the algorithm must see (in expectation) at least Ω( kε ) examples in orer to learn a linear regressor w, 2 w 2 1 with L D (w) L D (w ) ε. Our algorithm for LAO Rige regression (see section 3) imply this lower boun to be tight up to constants.

3 Note, however, that the boun applies only to a particular regime of the problem parameters Our algorithmic results We give efficient regression algorithms that attain the following risk bouns. For our Rige regression algorithm, we prove the risk boun ( ) E [L D ( w) min L D(w) + O B 2, w 2 B km while for our Lasso regression algorithm we establish the boun ( ) log E [L D ( w) min L D(w) + O B 2. w 1 B km Here we use w to enote the output of each algorithm on a training set of m examples, an the expectations are taken with respect to the ranomization of the algorithms. For Support-vector regression we obtain a risk boun that epens on the esire accuracy ε. Our boun implies that m = O ( k ) exp ( ( )) O log 2 B. ε examples are neee (in expectation) for obtaining an ε-accurate regressor. 3. Algorithms for LAO least-squares regression In this section we present an analyze our algorithms for Rige an Lasso regression in the LAO setting. The loss function uner consieration here is the square loss, that is, l t (w) = 1 2 (w x t y t ) 2. For convenience, we show algorithms that use k + 1 attributes of each instance, for k 1 4. Our algorithms are iterative an maintain a regressor w t along the iterations. The upate of the regressor at iteration t is base on graient information, an specifically on g t := l t (w t ) that equals (w t x t y t ) x t for the square loss. In the LAO setting, however, we o not have the access to this information, thus we buil upon unbiase estimators of the graients. 3 Inee, there are (full-information) algorithms that are known to converge in O(1/ε) rate see e.g. (Hazan et al., 2007). 4 We note that by our approach it is impossible to learn using a single attribute of each example (i.e., with k = 0), an we are not aware of any algorithm that is able to o so. See (Cesa-Bianchi et al., 2011) for a relate impossibility result. Algorithm 1 AERR Parameters: B, η > 0 Input: training set S = {(x t, y t )} t [m an k > 0 Output: regressor w with w 2 B 1: Initialize w 1 0, w 1 2 B arbitrarily 2: for t = 1 to m o 3: for r = 1 to k o 4: Pick i t,r [ uniformly an observe x t [i t,r 5: x t,r x t [i t,r e it,r 6: en for 7: x t 1 k k r=1 x t,r 8: Choose j t [ with probability w t [j 2 / w t 2 2, an observe x t [j t 9: φt w t 2 2 x t [j t /w t [j t y t 10: g t φ t x t 11: v t w t η g t 12: w t+1 v t B/ max{ v t 2, B} 13: en for 14: w 1 m m w t 3.1. Rige regression Recall that in Rige regression, we are intereste in the linear regressor that is the solution to the optimization problem (2) with p = 2, given explicitly as min L D (w) s.t. w 2 B, (3) w R Our algorithm for the LAO setting is base on a ranomize Online Graient Descent (OGD) strategy (Zinkevich, 2003). More specifically, at each iteration t we use a ranomize estimator g t of the graient g t to upate the regressor w t via an aitive rule. Our graient estimators make use of an importance-sampling metho inspire by (Clarkson et al., 2010). The pseuo-coe of our Attribute Efficient Rige Regression (AERR) algorithm is given in Algorithm 1. In the following theorem, we show that the regressor learne by our algorithm is competitive with the optimal linear regressor having 2-norm boune by B. Theorem 3.1. Assume the istribution D is such that x 2 1 an y B with probability 1. Let w be the output of AERR, when run with η = k/2m. Then, w 2 B an for any w R with w 2 B, Analysis E [L D ( w) L D (w ) + 4B 2 2 km. Theorem 3.1 is a consequence of the following two lemmas. The first lemma is obtaine as a result of a stanar regret boun for the OGD algorithm (see Zinkevich 2003), applie to the vectors g 1,..., g m.

4 Lemma 3.2. For any w 2 B we have g t (w t w ) 2B2 η + η 2 g t 2 2. (4) The secon lemma shows that the vector g t is an unbiase estimator of the graient g t := l t (w t ) at iteration t, an establishes a variance boun for this estimator. To simplify notations, here an in the rest of the paper we use E t [ to enote the conitional expectation with respect to all ranomness up to time t. Lemma 3.3. The vector g t is an unbiase estimator of the graient g t := l t (w t ), that is E t [ g t = g t. In aition, for all t we have E t [ g t 2 2 8B 2 /k. For a proof of the lemma, see (Hazan & Koren, 2011). We now turn to prove Theorem 3.1. Proof (of Theorem 3.1). First note that as w t 2 B, we clearly have w 2 B. Taking the expectation of (4) with respect to the ranomization of the algorithm, an letting G 2 := max t E t [ g t 2 2, we obtain [ m E gt (w t w ) 2B2 η + η 2 G2 m. On the other han, the convexity of l t gives l t (w t ) l t (w ) gt (w t w ). Together with the above this implies that for η = 2B/G m, [ 1 E l t (w t ) 1 m m l t (w ) + 2 BG m. Taking the expectation of both sies with respect to the ranom choice of the training set, an using G 2B 2/k (accoring to Lemma 3.3), we get [ 1 E L D (w t ) m L D (w ) + 4B 2 2 km. Finally, recalling the convexity of L D an using Jensen s inequality, the Theorem follows Lasso regression We now turn to escribe our algorithm for Lasso regression in the LAO setting, in which we woul like to solve the problem min L D (w) s.t. w 1 B. (5) w R The algorithm we provie for this problem is base on a stochastic variant of the EG algorithm (Kivinen & Warmuth, 1997), that employs multiplicative upates Algorithm 2 AELR Parameters: B, η > 0 Input: training set S = {(x t, y t )} t [m an k > 0 Output: regressor w with w 1 B 1: Initialize z + 1 1, z 1 1 2: for t = 1 to m o 3: w t (z + t z t ) B/( z + t 1 + z t 1 ) 4: for r = 1 to k o 5: Pick i t,r [ uniformly an observe x t [i t,r 6: x t,r x t [i t,r e it,r 7: en for 8: x t 1 k k r=1 x t,r 9: Choose j t [ with probability w[j / w 1, an observe x t [j t 10: φt w t 1 sign(w t [j t ) x t [j t y t 11: g t φ t x t 12: for i = 1 to o 13: ḡ t [i clip( g t [i, 1/η) 14: z + t+1 [i z+ t [i exp( η ḡ t [i) 15: z t+1 [i z t [i exp(+η ḡ t [i) 16: en for 17: en for 18: w 1 m m w t base on an estimation of the graients l t. The multiplicative nature of the algorithm, however, makes it highly sensitive to the magnitue of the upates. To make the upates more robust, we clip the entries of the graient estimator so as to prevent them from getting too large. Formally, this is accomplishe via the following clip operation: clip(x, c) := max{min{x, c}, c} for x R an c > 0. This clipping has an even stronger effect in the more general setting we consier in Section 4. We give our Attribute Efficient Lasso Regression (AELR) algorithm in Algorithm 2, an establish a corresponing risk boun in the following theorem. Theorem 3.4. Assume the istribution D is such that x 1 an y B with probability 1. Let w be the output of AELR, when run with η = 1 4B 2 2k log 2 5m, Then, w 1 B an for any w R with w 1 B we have E [L D ( w) L D (w ) + 4B 2 10 log 2 km provie that m log 2.,

5 Analysis In the rest of the section, for a vector v we let v 2 enote the vector for which v 2 [i = (v[i) 2 for all i. In orer to prove Theorem 3.4, we first consier the augmente vectors z t := (z + t, z t ) R 2 an ḡ t := (ḡ t, ḡ t ) R 2, an let p t := z t/ z t 1. For these vectors, we have the following. Lemma 3.5. p t ḡ t min i [2 ḡ t[i + log 2 η + η p t (ḡ t) 2 The lemma is a consequence of a secon-orer regret boun for the Multiplicative-Weights algorithm, essentially ue to (Clarkson et al., 2010). By means of this lemma, we establish a risk boun with respect to the clippe linear functions ḡt w. Lemma 3.6. Assume that E t [ g t 2 G 2 for all t, for some G > 0. Then, for any w 1 B, we have [ m [ m ( ) log 2 E ḡt w t E ḡt w + B + ηg 2 m η Our next step is to relate the risk generate by the linear functions g t w, to that generate by the clippe functions ḡt w. Lemma 3.7. Assume that E t [ g t 2 G 2 for all t, for some G > 0. Then, for 0 < η 1/2G we have [ m [ m E g t w t E ḡt w t + 4BηG 2 m. The final component of the proof is a variance boun, similar to that of Lemma 3.3. Lemma 3.8. The vector g t is an unbiase estimator of the graient g t := l t (w t ), that is E t [ g t = g t. In aition, for all t we have E t [ g t 2 8B 2 /k. For the complete proofs, refer to (Hazan & Koren, 2011). We are now reay to prove Theorem 3.4. Proof (of Theorem 3.4). Since w t 1 B for all t, we obtain w 2 B. Next, note that as E t [ g t = g t, we have E[ m g t w t = E[ m g t w t. Putting Lemmas 3.6 an 3.7 together, we get for η 1/2G that [ T ( ) log 2 E gt (w t w ) B + 5ηG 2 m. η Proceeing as in the proof of Theorem 3.1, an choosing η = 1 log 2 G 5m, we obtain the boun 5 log 2 E [L D ( w) L D (w ) + 2BG m. Note that for this choice of η we inee have η 1/2G, as we originally assume that m log 2. Finally, putting G = 2B 2/k as implie by Lemma 3.8, we obtain the boun in the statement of the theorem. 4. Support-vector regression In this section we show how our approach can be extene to eal with loss functions other than the square loss, of the form l(w x, y) = f(w x y), (6) (with f real an convex) an most importantly, with the δ-insensitive absolute loss function of SVR, for which f(x) = x δ := max{ x δ, 0} for some fixe 0 δ B (recall that in our results we assume the labels y t have y t B). For concreteness, we consier only the 2-norm variant of the problem (as in the stanar formulation of SVR) the results we obtain can be easily ajuste to the 1-norm setting. We overloa notation, an keep using the shorthan l t (w) := l(w x t, y t ) for referring the loss function inuce by the instance (x t, y t ). It shoul be highlighte that our techniques can be aapte to eal with many other common loss functions, incluing classification losses (i.e., of the form l(w x, y) = f(y w x)). Due to its importance an popularity, we chose to escribe our metho in the context of SVR. Unfortunately, there are strong inications that SVR learning (more generally, learning with non-smooth loss function) in the LAO setting is impossible via our approach of unbiase graient estimations (see Cesa- Bianchi et al an the references therein). For that reason, we make two moifications to the learning setting: first, we shall henceforth relax the buget constraint to allow k observe attributes per instance in expectation; an secon, we shall aim for biase graient estimators, instea of unbiase as before. To obtain such biase estimators, we uniformly ε- approximate the function f by an analytic function f ε an learn with the approximate loss function l ε t (w) = f ε (w x t y t ) instea. Clearly, any ε- suboptimal regressor of the approximate problem is an 2ε-suboptimal regressor of the original problem. For learning the approximate problem we use a novel technique, inspire by (Cesa-Bianchi et al., 2011), for estimating graients of analytic loss functions. Our estimators for l ε t can then be viewe as biase estimators of l t (we note, however, that the resulting bias might be quite large).

6 Proceure 3 GenEst Parameters: {a n } n=0 Taylor coefficients of f Input: regressor w, instance (x, y) Output: ˆφ with E[ ˆφ = f (w x y) 1: Let N = 4B 2. 2: Choose n 0 with probability Pr[n = ( 1 2 )n+1 3: if n 2 log 2 N then 4: for r = 1,..., n o 5: Choose j [ with probability w[j 2 / w 2 2, an observe x[j 6: θr w 2 2 x[j/w[j y 7: en for 8: else 9: for r = 1,..., n o 10: Choose j 1,..., j N [ w.p. w[j 2 / w 2 2, (inepenently), an observe x[j 1,..., x[j N 11: θr 1 N N s=1 w 2 2 x[j s /w[j s y 12: en for 13: en if 14: ˆφ 2 n+1 a n θ 1 θ2 θ n 4.1. Estimators for analytic loss functions Let f : R R be a real, analytic function (on the entire real line). The erivative f is thus also analytic an can be expresse as f (x) = n=0 a nx n where {a n } are the Taylor expansion coefficients of f. In Proceure 3 we give an unbiase estimator of f (w x y) in the LAO setting, efine in terms of the coefficients {a n } of f. For this estimator, we have the following (proof is omitte). Lemma 4.1. The estimator ˆφ is an unbiase estimator of f (w x y). Also, assuming x 2 1, w 2 B an y B, the secon-moment E[ ˆφ 2 is upper boune by exp(o(log 2 B)), provie that the Taylor series of f (x) converges absolutely for x 1. Finally, the expecte number of attributes of x use by this estimator is no more than Approximating SVR In orer to approximate the δ-insensitive absolute loss function, we efine f ε (x) = ε ( ) x δ 2 ρ + ε ( ) x + δ ε 2 ρ δ ε where ρ is expresse in terms of the error function erf, ρ(x) = x erf(x) + 1 π e x2, an consier the approximate loss functions l ε t (w) = f ε (w x t y t ). Inee, we have the following. Algorithm 4 AESVR Parameters: B, δ, η > 0 an accuracy ε > 0 Input: training set S = {(x t, y t )} t [m an k > 0 Output: regressor w with w 2 B 1: Let a 2n = 0 for n 0, an a 2n+1 = 2 ( 1) n, n 0 (7) π n!(2n + 1) 2: Execute algorithm 1 with lines 8 9 replace by: x t x t /ε y + t (y t + δ)/ε, y t (y t δ)/ε φ t 1 2 [GenEst(w t, x t, y + t ) + GenEst(w t, x t, y t ) 3: Return the output w of the algorithm Claim 4.2. For any ε > 0, f ε is convex, analytic on the entire real line an sup f ε (x) x δ ε. x R The claim follows easily from the ientity x δ = 1 2 x δ x + δ δ. In aition, for using Proceure 3 we nee the following simple observation, that follows immeiately from the series expansion of erf(x). Claim 4.3. ρ (x) = n=0 a 2n+1x 2n+1, with the coefficients {a 2n+1 } n 0 given in (7). We now give the main result of this section, which is a sample complexity boun for the Attribute Efficient SVR (AESVR) algorithm, given in Algorithm 4. Theorem 4.4. Assume the istribution D is such that x 2 1 an y B with probability 1. Then, for any w R with w 2 B, we have E [L D ( w) L D (w ) + ε where w is the output of AESVR (with η properly tune) on a training set of size m = O ( ) ( ( )) exp O log 2 B. (8) k ε The algorithm queries at most k + 6 attributes of each instance in expectation. Proof. First, note that for the approximate loss functions l ε t we have l ε t (w t ) = 1 2 [ ρ (w t x t y + t ) + ρ (w t x t y t ) x t. Hence, Lemma 4.1 an Claim 4.3 above imply that g t in Algorithm 4 is an unbiase estimator of l ε t (w t ). Furthermore, since x t 2 1 ε an y t ± 2 B ε, accoring to the same lemma we have E t [ φ 2 t = exp(o(log 2 B ε )). Repeating the proof of Lemma 3.3,

7 we then have E t [ g t 2 2 = E t [ φ 2 t E t [ x t 2 2 = exp(o(log 2 B ε )) k. Replacing G 2 in the proof of theorem 3.1 with the above boun, we get for the output of Algorithm 4, E [L D ( w) L D (w ) + exp(o(log 2 B ε )) km, which imply that for obtaining an ε-accurate regressor w of the approximate problem, it is enough to take m as given in (8). However, claim 4.2 now gives that w itself is an 2ε-accurate regressor of the original problem, an the proof is complete. 5. Experiments In this section we give experimental evience that support our theoretical bouns, an emonstrate the superior performance of our algorithms compare to the state of the art. Naturally, we chose to compare our AERR an AELR algorithms 5 with the AER algorithm of (Cesa-Bianchi et al., 2010). We note that AER is in fact a hybri algorithm that combines 1- norm an 2-norm regularizations, thus we use it for benchmarking in both the Rige an Lasso settings. We essentially repeate the experiments of (Cesa- Bianchi et al., 2010) an use the popular MNIST igit recognition ataset (LeCun et al., 1998). Each instance in this ataset is a image of a hanwritten igit 0 9. We focuse on the 3 vs. 5 task, on a subset of the ataset that consists of the 3 igits (labele 1) an the 5 igits (labele +1). We applie the regression algorithms to this task by regressing to the labels. In all our experiments, we ranomly split the ata to training an test sets, an use 10-fol cross-valiation for tuning the parameters of each algorithm. Then, we ran each algorithm on increasingly longer prefixes of the ataset an tracke the obtaine square-error on the test set. For faithfully comparing partial- an full-information algorithms, we also recore the total number of attributes use by each algorithm. In our first experiment, we execute AELR, AER an (offline) Lasso on the 3 vs. 5 task. We allowe both AELR an AER to use only k = 4 pixels of each training image, while giving Lasso unrestricte access to the entire set of attributes (total of 784) of each instance. The results, average over 10 runs on 5 The AESVR algorithm is presente mainly for theoretical consierations, an was not implemente in the experiments. Test square error Number of attributes AELR AER Offline 10 4 Figure 1. Test square error of Lasso algorithms with k = 4, over increasing prefixes of the 3 vs. 5 ataset. ranom train/test splits, are presente in Figure 1. Note that the x-axis represents the cumulative number of attributes use for training. The graph ens at roughly attributes, which is the total number of attributes allowe for the partial-information algorithms. Lasso, however, completes this buget after seeing merely 62 examples. As we see from the results, AELR keeps its test error significantly lower than that of AER along the entire execution, almost briging the gap with the fullinformation Lasso. Note that the latter has the clear avantage of being an offline algorithm, while both AELR an AER are online in nature. Inee, when we compare AELR with an online Lasso solver, our algorithm obtaine test error almost 10 times better. In the secon experiment, we evaluate AERR, AER an Rige regression on the same task, but now allowing the partial-information algorithms to use as much as k = 56 pixels (which amounts to 2 rows) of each instance. The results of this experiment are given in Figure 2. We see that even if we allow the algorithms to view a consierable number of attributes, the gap between AERR an AER is large. 6. Conclusions an Open Questions We have consiere the funamental problem of statistical regression analysis, an in particular Lasso an Rige regression, in a setting where the observation upon each training instance is limite to a few attributes, an gave algorithms that improve over the state of the art by a leaing orer term with respect to the sample complexity. This resolves an open question of (Cesa-Bianchi et al., 2010). The algorithms are efficient, an give a clear experimental avantage in

8 Test square error Number of attributes AERR AER Offline 10 5 Figure 2. Test square error of Rige algorithms with k = 56, over increasing prefixes of the 3 vs. 5 ataset. previously-consiere benchmarks. For the challenging case of regression with general convex loss functions, we escribe exponential improvement in sample complexity, which apply in particular to support-vector regression. It is interesting to resolve the sample complexity gap of 1 ε which still remains for Lasso regression, an to improve upon the pseuo-polynomial factor in ε for support-vector regression. In aition, establishing analogous bouns for our algorithms that hol with high probability (other than in expectation) appears to be non-trivial, an is left for future work. Another possible irection for future research is aapting our results to the setting of learning with (ranomly) missing ata, that was recently investigate see e.g. (Rostamizaeh et al., 2011; Loh & Wainwright, 2011). The sample complexity bouns our algorithms obtain in this setting are slightly worse than those presente in the current paper, an it is interesting to check if one can o better. Acknowlegments We thank Shai Shalev-Shwartz for several useful iscussions, an the anonymous referees for their etaile comments. References Ben-Davi, S. an Dichterman, E. Learning with restricte focus of attention. Journal of Computer an System Sciences, 56(3): , Cesa-Bianchi, N., Shalev-Shwartz, S., an Shamir, O. Efficient learning with partially observe attributes. In Proceeings of the 27th international conference on Machine learning, Cesa-Bianchi, N., Shalev-Shwartz, S., an Shamir, O. Online learning of noisy ata. IEEE Transactions on Information Theory, 57(12): , ec ISSN oi: /TIT Clarkson, K.L., Hazan, E., an Wooruff, D.P. Sublinear optimization for machine learning. In 2010 IEEE 51st Annual Symposium on Founations of Computer Science, pp IEEE, Haussler, D. Decision theoretic generalizations of the PAC moel for neural net an other learning applications. Information an computation, 100(1):78 150, Hazan, E. an Koren, T. Optimal algorithms for rige an lasso regression with partially observe attributes. Arxiv preprint arxiv: , Hazan, E., Agarwal, A., an Kale, S. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2): , Kakae, S.M., Sriharan, K., an Tewari, A. On the complexity of linear preiction: Risk bouns, margin bouns, an regularization. In Avances in Neural Information Processing Systems, volume 22, Kivinen, J. an Warmuth, M.K. Exponentiate graient versus graient escent for linear preictors. Information an Computation, 132(1):1 63, LeCun, Y., Bottou, L., Bengio, Y., an Haffner, P. Graient-base learning applie to ocument recognition. Proceeings of the IEEE, 86(11): , Loh, P.L. an Wainwright, M.J. High-imensional regression with noisy an missing ata: Provable guarantees with non-convexity. In Avances in Neural Information Processing Systems, Rostamizaeh, A., Agarwal, A., an Bartlett, P. Learning with missing features. In The 27th Conference on Uncertainty in Artificial Intelligence, Vapnik, V.N. The nature of statistical learning theory. Springer-Verlag, Zinkevich, M. Online convex programming an generalize infinitesimal graient ascent. In Proceeings of the 20th international conference on Machine learning, volume 20, pp ACM, 2003.

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback Journal of Machine Learning Research 8 07) - Submitte /6; Publishe 5/7 An Optimal Algorithm for Banit an Zero-Orer Convex Optimization with wo-point Feeback Oha Shamir Department of Computer Science an

More information

Least-Squares Regression on Sparse Spaces

Least-Squares Regression on Sparse Spaces Least-Squares Regression on Sparse Spaces Yuri Grinberg, Mahi Milani Far, Joelle Pineau School of Computer Science McGill University Montreal, Canaa {ygrinb,mmilan1,jpineau}@cs.mcgill.ca 1 Introuction

More information

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks A PAC-Bayesian Approach to Spectrally-Normalize Margin Bouns for Neural Networks Behnam Neyshabur, Srinah Bhojanapalli, Davi McAllester, Nathan Srebro Toyota Technological Institute at Chicago {bneyshabur,

More information

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012 CS-6 Theory Gems November 8, 0 Lecture Lecturer: Alesaner Mąry Scribes: Alhussein Fawzi, Dorina Thanou Introuction Toay, we will briefly iscuss an important technique in probability theory measure concentration

More information

A Course in Machine Learning

A Course in Machine Learning A Course in Machine Learning Hal Daumé III 12 EFFICIENT LEARNING So far, our focus has been on moels of learning an basic algorithms for those moels. We have not place much emphasis on how to learn quickly.

More information

Lower bounds on Locality Sensitive Hashing

Lower bounds on Locality Sensitive Hashing Lower bouns on Locality Sensitive Hashing Rajeev Motwani Assaf Naor Rina Panigrahy Abstract Given a metric space (X, X ), c 1, r > 0, an p, q [0, 1], a istribution over mappings H : X N is calle a (r,

More information

Robust Bounds for Classification via Selective Sampling

Robust Bounds for Classification via Selective Sampling Nicolò Cesa-Bianchi DSI, Università egli Stui i Milano, Italy Clauio Gentile DICOM, Università ell Insubria, Varese, Italy Francesco Orabona Iiap, Martigny, Switzerlan cesa-bianchi@siunimiit clauiogentile@uninsubriait

More information

Lower Bounds for the Smoothed Number of Pareto optimal Solutions

Lower Bounds for the Smoothed Number of Pareto optimal Solutions Lower Bouns for the Smoothe Number of Pareto optimal Solutions Tobias Brunsch an Heiko Röglin Department of Computer Science, University of Bonn, Germany brunsch@cs.uni-bonn.e, heiko@roeglin.org Abstract.

More information

Robust Forward Algorithms via PAC-Bayes and Laplace Distributions. ω Q. Pr (y(ω x) < 0) = Pr A k

Robust Forward Algorithms via PAC-Bayes and Laplace Distributions. ω Q. Pr (y(ω x) < 0) = Pr A k A Proof of Lemma 2 B Proof of Lemma 3 Proof: Since the support of LL istributions is R, two such istributions are equivalent absolutely continuous with respect to each other an the ivergence is well-efine

More information

u!i = a T u = 0. Then S satisfies

u!i = a T u = 0. Then S satisfies Deterministic Conitions for Subspace Ientifiability from Incomplete Sampling Daniel L Pimentel-Alarcón, Nigel Boston, Robert D Nowak University of Wisconsin-Maison Abstract Consier an r-imensional subspace

More information

Algorithms and matching lower bounds for approximately-convex optimization

Algorithms and matching lower bounds for approximately-convex optimization Algorithms an matching lower bouns for approximately-convex optimization Yuanzhi Li Department of Computer Science Princeton University Princeton, NJ, 08450 yuanzhil@cs.princeton.eu Anrej Risteski Department

More information

LATTICE-BASED D-OPTIMUM DESIGN FOR FOURIER REGRESSION

LATTICE-BASED D-OPTIMUM DESIGN FOR FOURIER REGRESSION The Annals of Statistics 1997, Vol. 25, No. 6, 2313 2327 LATTICE-BASED D-OPTIMUM DESIGN FOR FOURIER REGRESSION By Eva Riccomagno, 1 Rainer Schwabe 2 an Henry P. Wynn 1 University of Warwick, Technische

More information

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013 Survey Sampling Kosuke Imai Department of Politics, Princeton University February 19, 2013 Survey sampling is one of the most commonly use ata collection methos for social scientists. We begin by escribing

More information

Linear First-Order Equations

Linear First-Order Equations 5 Linear First-Orer Equations Linear first-orer ifferential equations make up another important class of ifferential equations that commonly arise in applications an are relatively easy to solve (in theory)

More information

FLUCTUATIONS IN THE NUMBER OF POINTS ON SMOOTH PLANE CURVES OVER FINITE FIELDS. 1. Introduction

FLUCTUATIONS IN THE NUMBER OF POINTS ON SMOOTH PLANE CURVES OVER FINITE FIELDS. 1. Introduction FLUCTUATIONS IN THE NUMBER OF POINTS ON SMOOTH PLANE CURVES OVER FINITE FIELDS ALINA BUCUR, CHANTAL DAVID, BROOKE FEIGON, MATILDE LALÍN 1 Introuction In this note, we stuy the fluctuations in the number

More information

On the Complexity of Bandit and Derivative-Free Stochastic Convex Optimization

On the Complexity of Bandit and Derivative-Free Stochastic Convex Optimization JMLR: Workshop an Conference Proceeings vol 30 013) 1 On the Complexity of Banit an Derivative-Free Stochastic Convex Optimization Oha Shamir Microsoft Research an the Weizmann Institute of Science oha.shamir@weizmann.ac.il

More information

Expected Value of Partial Perfect Information

Expected Value of Partial Perfect Information Expecte Value of Partial Perfect Information Mike Giles 1, Takashi Goa 2, Howar Thom 3 Wei Fang 1, Zhenru Wang 1 1 Mathematical Institute, University of Oxfor 2 School of Engineering, University of Tokyo

More information

Schrödinger s equation.

Schrödinger s equation. Physics 342 Lecture 5 Schröinger s Equation Lecture 5 Physics 342 Quantum Mechanics I Wenesay, February 3r, 2010 Toay we iscuss Schröinger s equation an show that it supports the basic interpretation of

More information

Necessary and Sufficient Conditions for Sketched Subspace Clustering

Necessary and Sufficient Conditions for Sketched Subspace Clustering Necessary an Sufficient Conitions for Sketche Subspace Clustering Daniel Pimentel-Alarcón, Laura Balzano 2, Robert Nowak University of Wisconsin-Maison, 2 University of Michigan-Ann Arbor Abstract This

More information

Agmon Kolmogorov Inequalities on l 2 (Z d )

Agmon Kolmogorov Inequalities on l 2 (Z d ) Journal of Mathematics Research; Vol. 6, No. ; 04 ISSN 96-9795 E-ISSN 96-9809 Publishe by Canaian Center of Science an Eucation Agmon Kolmogorov Inequalities on l (Z ) Arman Sahovic Mathematics Department,

More information

Optimization of Geometries by Energy Minimization

Optimization of Geometries by Energy Minimization Optimization of Geometries by Energy Minimization by Tracy P. Hamilton Department of Chemistry University of Alabama at Birmingham Birmingham, AL 3594-140 hamilton@uab.eu Copyright Tracy P. Hamilton, 1997.

More information

7.1 Support Vector Machine

7.1 Support Vector Machine 67577 Intro. to Machine Learning Fall semester, 006/7 Lecture 7: Support Vector Machines an Kernel Functions II Lecturer: Amnon Shashua Scribe: Amnon Shashua 7. Support Vector Machine We return now to

More information

Influence of weight initialization on multilayer perceptron performance

Influence of weight initialization on multilayer perceptron performance Influence of weight initialization on multilayer perceptron performance M. Karouia (1,2) T. Denœux (1) R. Lengellé (1) (1) Université e Compiègne U.R.A. CNRS 817 Heuiasyc BP 649 - F-66 Compiègne ceex -

More information

6 General properties of an autonomous system of two first order ODE

6 General properties of an autonomous system of two first order ODE 6 General properties of an autonomous system of two first orer ODE Here we embark on stuying the autonomous system of two first orer ifferential equations of the form ẋ 1 = f 1 (, x 2 ), ẋ 2 = f 2 (, x

More information

On combinatorial approaches to compressed sensing

On combinatorial approaches to compressed sensing On combinatorial approaches to compresse sensing Abolreza Abolhosseini Moghaam an Hayer Raha Department of Electrical an Computer Engineering, Michigan State University, East Lansing, MI, U.S. Emails:{abolhos,raha}@msu.eu

More information

Table of Common Derivatives By David Abraham

Table of Common Derivatives By David Abraham Prouct an Quotient Rules: Table of Common Derivatives By Davi Abraham [ f ( g( ] = [ f ( ] g( + f ( [ g( ] f ( = g( [ f ( ] g( g( f ( [ g( ] Trigonometric Functions: sin( = cos( cos( = sin( tan( = sec

More information

Topic 7: Convergence of Random Variables

Topic 7: Convergence of Random Variables Topic 7: Convergence of Ranom Variables Course 003, 2016 Page 0 The Inference Problem So far, our starting point has been a given probability space (S, F, P). We now look at how to generate information

More information

Sublinear Optimization for Machine Learning

Sublinear Optimization for Machine Learning Sublinear Optimization for Machine Learning Kenneth L. Clarkson, IBM Almaen Research Center Ela Hazan, Technion - Israel Institute of technology Davi P. Wooruff, IBM Almaen Research Center In this paper

More information

How to Minimize Maximum Regret in Repeated Decision-Making

How to Minimize Maximum Regret in Repeated Decision-Making How to Minimize Maximum Regret in Repeate Decision-Making Karl H. Schlag July 3 2003 Economics Department, European University Institute, Via ella Piazzuola 43, 033 Florence, Italy, Tel: 0039-0-4689, email:

More information

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics This moule is part of the Memobust Hanbook on Methoology of Moern Business Statistics 26 March 2014 Metho: Balance Sampling for Multi-Way Stratification Contents General section... 3 1. Summary... 3 2.

More information

Linear and quadratic approximation

Linear and quadratic approximation Linear an quaratic approximation November 11, 2013 Definition: Suppose f is a function that is ifferentiable on an interval I containing the point a. The linear approximation to f at a is the linear function

More information

Cascaded redundancy reduction

Cascaded redundancy reduction Network: Comput. Neural Syst. 9 (1998) 73 84. Printe in the UK PII: S0954-898X(98)88342-5 Cascae reunancy reuction Virginia R e Sa an Geoffrey E Hinton Department of Computer Science, University of Toronto,

More information

A Weak First Digit Law for a Class of Sequences

A Weak First Digit Law for a Class of Sequences International Mathematical Forum, Vol. 11, 2016, no. 15, 67-702 HIKARI Lt, www.m-hikari.com http://x.oi.org/10.1288/imf.2016.6562 A Weak First Digit Law for a Class of Sequences M. A. Nyblom School of

More information

Separation of Variables

Separation of Variables Physics 342 Lecture 1 Separation of Variables Lecture 1 Physics 342 Quantum Mechanics I Monay, January 25th, 2010 There are three basic mathematical tools we nee, an then we can begin working on the physical

More information

arxiv: v4 [cs.ds] 7 Mar 2014

arxiv: v4 [cs.ds] 7 Mar 2014 Analysis of Agglomerative Clustering Marcel R. Ackermann Johannes Blömer Daniel Kuntze Christian Sohler arxiv:101.697v [cs.ds] 7 Mar 01 Abstract The iameter k-clustering problem is the problem of partitioning

More information

Lecture 2: Correlated Topic Model

Lecture 2: Correlated Topic Model Probabilistic Moels for Unsupervise Learning Spring 203 Lecture 2: Correlate Topic Moel Inference for Correlate Topic Moel Yuan Yuan First of all, let us make some claims about the parameters an variables

More information

The derivative of a function f(x) is another function, defined in terms of a limiting expression: f(x + δx) f(x)

The derivative of a function f(x) is another function, defined in terms of a limiting expression: f(x + δx) f(x) Y. D. Chong (2016) MH2801: Complex Methos for the Sciences 1. Derivatives The erivative of a function f(x) is another function, efine in terms of a limiting expression: f (x) f (x) lim x δx 0 f(x + δx)

More information

Homework 2 Solutions EM, Mixture Models, PCA, Dualitys

Homework 2 Solutions EM, Mixture Models, PCA, Dualitys Homewor Solutions EM, Mixture Moels, PCA, Dualitys CMU 0-75: Machine Learning Fall 05 http://www.cs.cmu.eu/~bapoczos/classes/ml075_05fall/ OUT: Oct 5, 05 DUE: Oct 9, 05, 0:0 AM An EM algorithm for a Mixture

More information

Tractability results for weighted Banach spaces of smooth functions

Tractability results for weighted Banach spaces of smooth functions Tractability results for weighte Banach spaces of smooth functions Markus Weimar Mathematisches Institut, Universität Jena Ernst-Abbe-Platz 2, 07740 Jena, Germany email: markus.weimar@uni-jena.e March

More information

arxiv: v4 [math.pr] 27 Jul 2016

arxiv: v4 [math.pr] 27 Jul 2016 The Asymptotic Distribution of the Determinant of a Ranom Correlation Matrix arxiv:309768v4 mathpr] 7 Jul 06 AM Hanea a, & GF Nane b a Centre of xcellence for Biosecurity Risk Analysis, University of Melbourne,

More information

Parameter estimation: A new approach to weighting a priori information

Parameter estimation: A new approach to weighting a priori information Parameter estimation: A new approach to weighting a priori information J.L. Mea Department of Mathematics, Boise State University, Boise, ID 83725-555 E-mail: jmea@boisestate.eu Abstract. We propose a

More information

Analyzing Tensor Power Method Dynamics in Overcomplete Regime

Analyzing Tensor Power Method Dynamics in Overcomplete Regime Journal of Machine Learning Research 18 (2017) 1-40 Submitte 9/15; Revise 11/16; Publishe 4/17 Analyzing Tensor Power Metho Dynamics in Overcomplete Regime Animashree Ananumar Department of Electrical

More information

Proof of SPNs as Mixture of Trees

Proof of SPNs as Mixture of Trees A Proof of SPNs as Mixture of Trees Theorem 1. If T is an inuce SPN from a complete an ecomposable SPN S, then T is a tree that is complete an ecomposable. Proof. Argue by contraiction that T is not a

More information

Generalized Tractability for Multivariate Problems

Generalized Tractability for Multivariate Problems Generalize Tractability for Multivariate Problems Part II: Linear Tensor Prouct Problems, Linear Information, an Unrestricte Tractability Michael Gnewuch Department of Computer Science, University of Kiel,

More information

A Sketch of Menshikov s Theorem

A Sketch of Menshikov s Theorem A Sketch of Menshikov s Theorem Thomas Bao March 14, 2010 Abstract Let Λ be an infinite, locally finite oriente multi-graph with C Λ finite an strongly connecte, an let p

More information

Permanent vs. Determinant

Permanent vs. Determinant Permanent vs. Determinant Frank Ban Introuction A major problem in theoretical computer science is the Permanent vs. Determinant problem. It asks: given an n by n matrix of ineterminates A = (a i,j ) an

More information

Perfect Matchings in Õ(n1.5 ) Time in Regular Bipartite Graphs

Perfect Matchings in Õ(n1.5 ) Time in Regular Bipartite Graphs Perfect Matchings in Õ(n1.5 ) Time in Regular Bipartite Graphs Ashish Goel Michael Kapralov Sanjeev Khanna Abstract We consier the well-stuie problem of fining a perfect matching in -regular bipartite

More information

Introduction to Machine Learning

Introduction to Machine Learning How o you estimate p(y x)? Outline Contents Introuction to Machine Learning Logistic Regression Varun Chanola April 9, 207 Generative vs. Discriminative Classifiers 2 Logistic Regression 2 3 Logistic Regression

More information

Introduction to the Vlasov-Poisson system

Introduction to the Vlasov-Poisson system Introuction to the Vlasov-Poisson system Simone Calogero 1 The Vlasov equation Consier a particle with mass m > 0. Let x(t) R 3 enote the position of the particle at time t R an v(t) = ẋ(t) = x(t)/t its

More information

Gaussian processes with monotonicity information

Gaussian processes with monotonicity information Gaussian processes with monotonicity information Anonymous Author Anonymous Author Unknown Institution Unknown Institution Abstract A metho for using monotonicity information in multivariate Gaussian process

More information

Logarithmic spurious regressions

Logarithmic spurious regressions Logarithmic spurious regressions Robert M. e Jong Michigan State University February 5, 22 Abstract Spurious regressions, i.e. regressions in which an integrate process is regresse on another integrate

More information

The total derivative. Chapter Lagrangian and Eulerian approaches

The total derivative. Chapter Lagrangian and Eulerian approaches Chapter 5 The total erivative 51 Lagrangian an Eulerian approaches The representation of a flui through scalar or vector fiels means that each physical quantity uner consieration is escribe as a function

More information

Tutorial on Maximum Likelyhood Estimation: Parametric Density Estimation

Tutorial on Maximum Likelyhood Estimation: Parametric Density Estimation Tutorial on Maximum Likelyhoo Estimation: Parametric Density Estimation Suhir B Kylasa 03/13/2014 1 Motivation Suppose one wishes to etermine just how biase an unfair coin is. Call the probability of tossing

More information

Math 1B, lecture 8: Integration by parts

Math 1B, lecture 8: Integration by parts Math B, lecture 8: Integration by parts Nathan Pflueger 23 September 2 Introuction Integration by parts, similarly to integration by substitution, reverses a well-known technique of ifferentiation an explores

More information

All s Well That Ends Well: Supplementary Proofs

All s Well That Ends Well: Supplementary Proofs All s Well That Ens Well: Guarantee Resolution of Simultaneous Rigi Boy Impact 1:1 All s Well That Ens Well: Supplementary Proofs This ocument complements the paper All s Well That Ens Well: Guarantee

More information

Full-information Online Learning

Full-information Online Learning Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2

More information

Multi-View Clustering via Canonical Correlation Analysis

Multi-View Clustering via Canonical Correlation Analysis Keywors: multi-view learning, clustering, canonical correlation analysis Abstract Clustering ata in high-imensions is believe to be a har problem in general. A number of efficient clustering algorithms

More information

Modelling and simulation of dependence structures in nonlife insurance with Bernstein copulas

Modelling and simulation of dependence structures in nonlife insurance with Bernstein copulas Moelling an simulation of epenence structures in nonlife insurance with Bernstein copulas Prof. Dr. Dietmar Pfeifer Dept. of Mathematics, University of Olenburg an AON Benfiel, Hamburg Dr. Doreen Straßburger

More information

Time-of-Arrival Estimation in Non-Line-Of-Sight Environments

Time-of-Arrival Estimation in Non-Line-Of-Sight Environments 2 Conference on Information Sciences an Systems, The Johns Hopkins University, March 2, 2 Time-of-Arrival Estimation in Non-Line-Of-Sight Environments Sinan Gezici, Hisashi Kobayashi an H. Vincent Poor

More information

A. Exclusive KL View of the MLE

A. Exclusive KL View of the MLE A. Exclusive KL View of the MLE Lets assume a change-of-variable moel p Z z on the ranom variable Z R m, such as the one use in Dinh et al. 2017: z 0 p 0 z 0 an z = ψz 0, where ψ is an invertible function

More information

LECTURE NOTES ON DVORETZKY S THEOREM

LECTURE NOTES ON DVORETZKY S THEOREM LECTURE NOTES ON DVORETZKY S THEOREM STEVEN HEILMAN Abstract. We present the first half of the paper [S]. In particular, the results below, unless otherwise state, shoul be attribute to G. Schechtman.

More information

Approximate Constraint Satisfaction Requires Large LP Relaxations

Approximate Constraint Satisfaction Requires Large LP Relaxations Approximate Constraint Satisfaction Requires Large LP Relaxations oah Fleming April 19, 2018 Linear programming is a very powerful tool for attacking optimization problems. Techniques such as the ellipsoi

More information

Multi-View Clustering via Canonical Correlation Analysis

Multi-View Clustering via Canonical Correlation Analysis Kamalika Chauhuri ITA, UC San Diego, 9500 Gilman Drive, La Jolla, CA Sham M. Kakae Karen Livescu Karthik Sriharan Toyota Technological Institute at Chicago, 6045 S. Kenwoo Ave., Chicago, IL kamalika@soe.ucs.eu

More information

Admin BACKPROPAGATION. Neural network. Neural network 11/3/16. Assignment 7. Assignment 8 Goals today. David Kauchak CS158 Fall 2016

Admin BACKPROPAGATION. Neural network. Neural network 11/3/16. Assignment 7. Assignment 8 Goals today. David Kauchak CS158 Fall 2016 Amin Assignment 7 Assignment 8 Goals toay BACKPROPAGATION Davi Kauchak CS58 Fall 206 Neural network Neural network inputs inputs some inputs are provie/ entere Iniviual perceptrons/ neurons Neural network

More information

CUSTOMER REVIEW FEATURE EXTRACTION Heng Ren, Jingye Wang, and Tony Wu

CUSTOMER REVIEW FEATURE EXTRACTION Heng Ren, Jingye Wang, and Tony Wu CUSTOMER REVIEW FEATURE EXTRACTION Heng Ren, Jingye Wang, an Tony Wu Abstract Popular proucts often have thousans of reviews that contain far too much information for customers to igest. Our goal for the

More information

Self-normalized Martingale Tail Inequality

Self-normalized Martingale Tail Inequality Online-to-Confience-Set Conversions an Application to Sparse Stochastic Banits A Self-normalize Martingale Tail Inequality The self-normalize martingale tail inequality that we present here is the scalar-value

More information

Leaving Randomness to Nature: d-dimensional Product Codes through the lens of Generalized-LDPC codes

Leaving Randomness to Nature: d-dimensional Product Codes through the lens of Generalized-LDPC codes Leaving Ranomness to Nature: -Dimensional Prouct Coes through the lens of Generalize-LDPC coes Tavor Baharav, Kannan Ramchanran Dept. of Electrical Engineering an Computer Sciences, U.C. Berkeley {tavorb,

More information

Lectures - Week 10 Introduction to Ordinary Differential Equations (ODES) First Order Linear ODEs

Lectures - Week 10 Introduction to Ordinary Differential Equations (ODES) First Order Linear ODEs Lectures - Week 10 Introuction to Orinary Differential Equations (ODES) First Orer Linear ODEs When stuying ODEs we are consiering functions of one inepenent variable, e.g., f(x), where x is the inepenent

More information

Math 342 Partial Differential Equations «Viktor Grigoryan

Math 342 Partial Differential Equations «Viktor Grigoryan Math 342 Partial Differential Equations «Viktor Grigoryan 6 Wave equation: solution In this lecture we will solve the wave equation on the entire real line x R. This correspons to a string of infinite

More information

Switching Time Optimization in Discretized Hybrid Dynamical Systems

Switching Time Optimization in Discretized Hybrid Dynamical Systems Switching Time Optimization in Discretize Hybri Dynamical Systems Kathrin Flaßkamp, To Murphey, an Sina Ober-Blöbaum Abstract Switching time optimization (STO) arises in systems that have a finite set

More information

Multi-View Clustering via Canonical Correlation Analysis

Multi-View Clustering via Canonical Correlation Analysis Kamalika Chauhuri ITA, UC San Diego, 9500 Gilman Drive, La Jolla, CA Sham M. Kakae Karen Livescu Karthik Sriharan Toyota Technological Institute at Chicago, 6045 S. Kenwoo Ave., Chicago, IL kamalika@soe.ucs.eu

More information

ON THE OPTIMALITY SYSTEM FOR A 1 D EULER FLOW PROBLEM

ON THE OPTIMALITY SYSTEM FOR A 1 D EULER FLOW PROBLEM ON THE OPTIMALITY SYSTEM FOR A D EULER FLOW PROBLEM Eugene M. Cliff Matthias Heinkenschloss y Ajit R. Shenoy z Interisciplinary Center for Applie Mathematics Virginia Tech Blacksburg, Virginia 46 Abstract

More information

Robustness and Perturbations of Minimal Bases

Robustness and Perturbations of Minimal Bases Robustness an Perturbations of Minimal Bases Paul Van Dooren an Froilán M Dopico December 9, 2016 Abstract Polynomial minimal bases of rational vector subspaces are a classical concept that plays an important

More information

On the Value of Partial Information for Learning from Examples

On the Value of Partial Information for Learning from Examples JOURNAL OF COMPLEXITY 13, 509 544 (1998) ARTICLE NO. CM970459 On the Value of Partial Information for Learning from Examples Joel Ratsaby* Department of Electrical Engineering, Technion, Haifa, 32000 Israel

More information

Online Appendix for Trade Policy under Monopolistic Competition with Firm Selection

Online Appendix for Trade Policy under Monopolistic Competition with Firm Selection Online Appenix for Trae Policy uner Monopolistic Competition with Firm Selection Kyle Bagwell Stanfor University an NBER Seung Hoon Lee Georgia Institute of Technology September 6, 2018 In this Online

More information

Monte Carlo Methods with Reduced Error

Monte Carlo Methods with Reduced Error Monte Carlo Methos with Reuce Error As has been shown, the probable error in Monte Carlo algorithms when no information about the smoothness of the function is use is Dξ r N = c N. It is important for

More information

Adaptive Online Learning in Dynamic Environments

Adaptive Online Learning in Dynamic Environments Adaptive Online Learning in Dynamic Environments Lijun Zhang, Shiyin Lu, Zhi-Hua Zhou National Key Laboratory for Novel Software Technology Nanjing University, Nanjing 210023, China {zhanglj, lusy, zhouzh}@lamda.nju.edu.cn

More information

A variance decomposition and a Central Limit Theorem for empirical losses associated with resampling designs

A variance decomposition and a Central Limit Theorem for empirical losses associated with resampling designs Mathias Fuchs, Norbert Krautenbacher A variance ecomposition an a Central Limit Theorem for empirical losses associate with resampling esigns Technical Report Number 173, 2014 Department of Statistics

More information

Estimation of the Maximum Domination Value in Multi-Dimensional Data Sets

Estimation of the Maximum Domination Value in Multi-Dimensional Data Sets Proceeings of the 4th East-European Conference on Avances in Databases an Information Systems ADBIS) 200 Estimation of the Maximum Domination Value in Multi-Dimensional Data Sets Eleftherios Tiakas, Apostolos.

More information

Technion - Computer Science Department - M.Sc. Thesis MSC Constrained Codes for Two-Dimensional Channels.

Technion - Computer Science Department - M.Sc. Thesis MSC Constrained Codes for Two-Dimensional Channels. Technion - Computer Science Department - M.Sc. Thesis MSC-2006- - 2006 Constraine Coes for Two-Dimensional Channels Keren Censor Technion - Computer Science Department - M.Sc. Thesis MSC-2006- - 2006 Technion

More information

Capacity Analysis of MIMO Systems with Unknown Channel State Information

Capacity Analysis of MIMO Systems with Unknown Channel State Information Capacity Analysis of MIMO Systems with Unknown Channel State Information Jun Zheng an Bhaskar D. Rao Dept. of Electrical an Computer Engineering University of California at San Diego e-mail: juzheng@ucs.eu,

More information

Sturm-Liouville Theory

Sturm-Liouville Theory LECTURE 5 Sturm-Liouville Theory In the three preceing lectures I emonstrate the utility of Fourier series in solving PDE/BVPs. As we ll now see, Fourier series are just the tip of the iceberg of the theory

More information

arxiv: v5 [cs.lg] 28 Mar 2017

arxiv: v5 [cs.lg] 28 Mar 2017 Equilibrium Propagation: Briging the Gap Between Energy-Base Moels an Backpropagation Benjamin Scellier an Yoshua Bengio * Université e Montréal, Montreal Institute for Learning Algorithms March 3, 217

More information

inflow outflow Part I. Regular tasks for MAE598/494 Task 1

inflow outflow Part I. Regular tasks for MAE598/494 Task 1 MAE 494/598, Fall 2016 Project #1 (Regular tasks = 20 points) Har copy of report is ue at the start of class on the ue ate. The rules on collaboration will be release separately. Please always follow the

More information

arxiv: v3 [cs.lg] 3 Dec 2017

arxiv: v3 [cs.lg] 3 Dec 2017 Context-Aware Generative Aversarial Privacy Chong Huang, Peter Kairouz, Xiao Chen, Lalitha Sankar, an Ram Rajagopal arxiv:1710.09549v3 [cs.lg] 3 Dec 2017 Abstract Preserving the utility of publishe atasets

More information

Multi-View Clustering via Canonical Correlation Analysis

Multi-View Clustering via Canonical Correlation Analysis Technical Report TTI-TR-2008-5 Multi-View Clustering via Canonical Correlation Analysis Kamalika Chauhuri UC San Diego Sham M. Kakae Toyota Technological Institute at Chicago ABSTRACT Clustering ata in

More information

WEIGHTING A RESAMPLED PARTICLE IN SEQUENTIAL MONTE CARLO. L. Martino, V. Elvira, F. Louzada

WEIGHTING A RESAMPLED PARTICLE IN SEQUENTIAL MONTE CARLO. L. Martino, V. Elvira, F. Louzada WEIGHTIG A RESAMPLED PARTICLE I SEQUETIAL MOTE CARLO L. Martino, V. Elvira, F. Louzaa Dep. of Signal Theory an Communic., Universia Carlos III e Mari, Leganés (Spain). Institute of Mathematical Sciences

More information

. Using a multinomial model gives us the following equation for P d. , with respect to same length term sequences.

. Using a multinomial model gives us the following equation for P d. , with respect to same length term sequences. S 63 Lecture 8 2/2/26 Lecturer Lillian Lee Scribes Peter Babinski, Davi Lin Basic Language Moeling Approach I. Special ase of LM-base Approach a. Recap of Formulas an Terms b. Fixing θ? c. About that Multinomial

More information

Level Construction of Decision Trees in a Partition-based Framework for Classification

Level Construction of Decision Trees in a Partition-based Framework for Classification Level Construction of Decision Trees in a Partition-base Framework for Classification Y.Y. Yao, Y. Zhao an J.T. Yao Department of Computer Science, University of Regina Regina, Saskatchewan, Canaa S4S

More information

On the number of isolated eigenvalues of a pair of particles in a quantum wire

On the number of isolated eigenvalues of a pair of particles in a quantum wire On the number of isolate eigenvalues of a pair of particles in a quantum wire arxiv:1812.11804v1 [math-ph] 31 Dec 2018 Joachim Kerner 1 Department of Mathematics an Computer Science FernUniversität in

More information

AN INTRODUCTION TO NUMERICAL METHODS USING MATHCAD. Mathcad Release 14. Khyruddin Akbar Ansari, Ph.D., P.E.

AN INTRODUCTION TO NUMERICAL METHODS USING MATHCAD. Mathcad Release 14. Khyruddin Akbar Ansari, Ph.D., P.E. AN INTRODUCTION TO NUMERICAL METHODS USING MATHCAD Mathca Release 14 Khyruin Akbar Ansari, Ph.D., P.E. Professor of Mechanical Engineering School of Engineering an Applie Science Gonzaga University SDC

More information

Euler equations for multiple integrals

Euler equations for multiple integrals Euler equations for multiple integrals January 22, 2013 Contents 1 Reminer of multivariable calculus 2 1.1 Vector ifferentiation......................... 2 1.2 Matrix ifferentiation........................

More information

Lecture 5. Symmetric Shearer s Lemma

Lecture 5. Symmetric Shearer s Lemma Stanfor University Spring 208 Math 233: Non-constructive methos in combinatorics Instructor: Jan Vonrák Lecture ate: January 23, 208 Original scribe: Erik Bates Lecture 5 Symmetric Shearer s Lemma Here

More information

A comparison of small area estimators of counts aligned with direct higher level estimates

A comparison of small area estimators of counts aligned with direct higher level estimates A comparison of small area estimators of counts aligne with irect higher level estimates Giorgio E. Montanari, M. Giovanna Ranalli an Cecilia Vicarelli Abstract Inirect estimators for small areas use auxiliary

More information

WUCHEN LI AND STANLEY OSHER

WUCHEN LI AND STANLEY OSHER CONSTRAINED DYNAMICAL OPTIMAL TRANSPORT AND ITS LAGRANGIAN FORMULATION WUCHEN LI AND STANLEY OSHER Abstract. We propose ynamical optimal transport (OT) problems constraine in a parameterize probability

More information

Equilibrium in Queues Under Unknown Service Times and Service Value

Equilibrium in Queues Under Unknown Service Times and Service Value University of Pennsylvania ScholarlyCommons Finance Papers Wharton Faculty Research 1-2014 Equilibrium in Queues Uner Unknown Service Times an Service Value Laurens Debo Senthil K. Veeraraghavan University

More information

Computing Exact Confidence Coefficients of Simultaneous Confidence Intervals for Multinomial Proportions and their Functions

Computing Exact Confidence Coefficients of Simultaneous Confidence Intervals for Multinomial Proportions and their Functions Working Paper 2013:5 Department of Statistics Computing Exact Confience Coefficients of Simultaneous Confience Intervals for Multinomial Proportions an their Functions Shaobo Jin Working Paper 2013:5

More information

Math Notes on differentials, the Chain Rule, gradients, directional derivative, and normal vectors

Math Notes on differentials, the Chain Rule, gradients, directional derivative, and normal vectors Math 18.02 Notes on ifferentials, the Chain Rule, graients, irectional erivative, an normal vectors Tangent plane an linear approximation We efine the partial erivatives of f( xy, ) as follows: f f( x+

More information

Ramsey numbers of some bipartite graphs versus complete graphs

Ramsey numbers of some bipartite graphs versus complete graphs Ramsey numbers of some bipartite graphs versus complete graphs Tao Jiang, Michael Salerno Miami University, Oxfor, OH 45056, USA Abstract. The Ramsey number r(h, K n ) is the smallest positive integer

More information

IPA Derivatives for Make-to-Stock Production-Inventory Systems With Backorders Under the (R,r) Policy

IPA Derivatives for Make-to-Stock Production-Inventory Systems With Backorders Under the (R,r) Policy IPA Derivatives for Make-to-Stock Prouction-Inventory Systems With Backorers Uner the (Rr) Policy Yihong Fan a Benamin Melame b Yao Zhao c Yorai Wari Abstract This paper aresses Infinitesimal Perturbation

More information