Aggregation of sleeping predictors to forecast electricity consumption

Size: px
Start display at page:

Download "Aggregation of sleeping predictors to forecast electricity consumption"

Transcription

1 Aggregation of sleeping predictors to forecast electricity consumption Marie Devaine 1 2 3, Yannig Goude 2, and Gilles Stoltz École Normale Supérieure, Paris, France 2 EDF R&D, Clamart, France 3 Université Paris-Sud, Orsay, France 4 HEC Paris, Jouy-en-Josas, France

2

3 Contents 1 Introduction and context Description of the framework Base forecasters Exponentially Weighted Average Beta-Forecaster Exponentially Weighted Average Forecaster Exponential Gradient Forecaster Exponential Gradient Forecaster Partition Renormalized Exponential Gradient Forecaster Zinkevich s forecasters Ridge Regression Forecaster Ridge Regression Partition Forecaster Normalized Ridge Forecaster Fixed-Share Forecaster Exponentiated-Gradient Fixed-Share Forecaster Another sleeping adaptation of the fixed-share forecaster Tricks On-line calibration of one parameter Towards a reduction of the bias of the ridge-type forecasters? Compensated Regrets Calibration of the couple of parameters for fixed-share type forecasters Another data set: French data Description of the French data Results obtained by the forecasters

4 Chapter 1 Introduction and context 3

5 1.1 Description of the framework Data The data set consists in hourly observations of the Slovaquian comsumption of electricity. The units are Megawatts and the period of reference is from 01/01/2005 to 12/31/2007. There are 35 base forecasters (experts). The experts do not necessarily output a prediction at each step, they express themselves only at certain moments. (Yet at each step at least one expert makes a prediction.) The framework is therefore called prediction with sleeping experts. Notations Time rounds are indexed by t = 1, 2,..., n. Experts are indexed by i = 1,..., N. The active experts at round t are given by a subset E t {1,..., N}. The observation at round t is denoted by y t. The prediction of expert i at round t exists if and only if i 2 E t and we denote it by f i,t. For practical reasons, we set the prediction of expert i to zero when i 62 E t. We can thus define the vector of the predictions of the experts at time t as f t = (f 1,t,..., f N,t ). At each round the master forecaster outputs a vector of weights (an element of R N ), which he uses to form a linear prediction (to be compared to y t ). Two prototypical cases arise. Constrained prediction: for each t, the vector used for the prediction is a probability distribution over E t. In this case we denote the vector of weights by p t 2 X Et. More precisely, p t 2 R N belongs to X Et if p i,t = 0 for i 62 E t, for all i, we have p i,t > 0, and i2e t p i,t = 1. (Note that in the classical framework X is simply the simplex over {1... N}.) In this constrained case, the master forecaster forms the prediction by t = i2e t p i,t f i,t = p t f t. Unconstrained prediction: the vector used for prediction at time t is possibly any vector u t of R N. In this unconstrained case the master forecaster forms the prediction by t = i2e t u i,t f i,t = u t f t. Note that the second equality holds because of the convention that f i,t = 0 for i /2 E t. An idea to deal with this sleeping experts framework may be to go back to the classical framework by considering a partition of the data set that depends on the E t. Let K be the number of values taken by the E t on the data (for t 2 {1,..., n}), we denote by U 1,..., U K the corresponding values. In addition, we denote by k t the index corresponding at the step t, so that E t = U kt. Marie Devaine, Yannig Goude, and Gilles Stoltz 4

6 Criteria to assess the quality of a master forecaster We use a root mean squared error criterion (rmse). Formally, the rmse of a sequence v n 1 = (v 1,..., v n ) of aggregation vectors is defined by v rmse(v1 n u ) = t 1 n 0 i2et t=1 12 v i,t f i,t y t A. We give below some lower bounds beyond which it is difficult to go. 1. The following bounds have counterparts in the classical framework, or B oracle = min rmse (δ j1,..., δ jn ) (j 1,...,j n)2e 1... E n = min B R N = inf v2r N rmse (v,..., v) = inf v2r N (j 1,...,j n)2e 1... E n v u t 1 n v u t 1 n n (f jt,t y t ) 2 t=1 n (v f t y t ) 2. The second oracle has a counterpart in terms of a given probability distribution p 2 X (it is legally defined thanks to the convention that f i,t = 0 for i 62 E t ). However, the p f t will be in general too biased. This is, by the way, already the case even with a general v 2 R N. The performance table below will show that the oracle B R N is not a desirable target to achieve. 2. A first family of oracles designed for the setting of sleeping experts is in terms of the partition of time depending on the the values of the E t, first in terms of any linear combination, B partr N = t=1 inf rmse (v 1,..., v K ) = inf (v 1,...,v K )2(R N ) K then only with probability distributions, (v 1,...,v K )2(R N ) K v u t 1 n B partx = min rmse (q 1,..., q K ) (q 1,...,q K )2X U1... X UK v u = min t 1 n (q kt f t y t ) 2, (q 1,...,q K )2X U1... X UK n and finally only with Dirac distributions, t=1 B part = min rmse (δ j1,..., δ jk ) (j 1,...,j K )2U 1... U K v u = min t 1 n 2 f jkt,t y t. n (j 1,...,j K )2U 1... U K t=1 n (v kt f t y t ) 2, t=1 Marie Devaine, Yannig Goude, and Gilles Stoltz 5

7 3. We can also take into account some renormalizations to deal with some experts being sleeping. In this respect, for q 2 X we denote by q(e t ) = i2e t q i a renormalization factor and let q Et 2 X Et be the conditional distribution of q subject to E t. It is defined from q by only keeping the components indexed by E t and dividing them by q(e t ). By convention, q Et = (0,..., 0) when q(e t ) = 0. a) A first version of the oracle is v B (a) u renorm = min t 1 n q2x t=1 I {q(et)6=0} n I {q(et)6=0} (q Et f t y t ) 2 which implies a natural average at each round of active forecasters when taking q as the uniform distribution (1/N,..., 1/N), v B (a) ave = u t 1! n 2 j2e t f j,t y t. n E t t=1 b) A second (more continuous) version is given by a rmse criterion with non-equal weights to all losses, v B (b) u renorm = min t 1 n n q2x t=1 q (E q (E t ) (q t) Et f t y t ) 2 and the average at each round of active forecasters corresponding: v B (b) ave = u t 1! n 2 j2e n t=1 E E t t f j,t y t. t E t t=1 The versions a) and b) give the same bound when we take the minimum over Dirac distributions only, v B best exp = min rmse (δ j,..., δ j ) u = min t 1 n n I j=1,...,n j=1,...,n t=1 I {i2et} (f j,t y t ) 2. {j2et} 4. Instead of renormalizing on the restriction of q 2 X indexed by active experts, we may consider the Euclidian projection onto X Et, which we denote by Π XEt. For q 2 X, we denote by q Et its restriction to the set E t of active experts, defined as q Et = q i1,..., q i Et, with { i 1,..., i Et } = Et and i 1 < i 2 <... < i Et. Finally, we define the projection P t as follows: for all q 2 X, t=1 t=1 P t (q) = Π XEt q Et. t=1 The corresponding oracle is B project = min q2x v u t 1 n n t=1 P t q f t y t 2. No simple reduction is obtained in this case when we take the minimum over Dirac distributions only, since P t (δ j ) = δ j when j 2 E t but P t (δ j ) is the uniform distribution of X Et when j 62 E t. Marie Devaine, Yannig Goude, and Gilles Stoltz 6

8 Notations used in the description of the algorithms We denote by l(, ) the square loss l(x, y) = (x y) 2. We need in the proofs a uniform bound on the losses; it exists when the observations and the predictions are bounded by B for instance, which is the case in practice. In this case, as the data are non negative, we can take B 2 as uniform bound on our losses. (The only difficulty is to exhibit a value for B in advance.) For each round t, we denote by 0 l t (v) = i2et v i f i,t, y t 1 A the instantaneous loss of a vector v 2 R N (possibly a probability distribution on E t ). When v = δ i is a Dirac mass on an active expert, i.e., i 2 E t, we simply write l i,t = l t (δ i ) = l(f i,t, y t ). The loss of the master forecaster at round t equals b l t = l(by t, y t ) with the notations above. All these quantities have cumulative counterparts. The cumulative loss of the forecaster is referred to as n bl n = b l t. t=1 The cumulative losses of the experts are less easy to define in the framework of sleeping experts. We consider 1. for v 2 R N, L n (v) = n l t (v) ; 2. for v1 K = (v K 1,..., v K ) 2 R N (where v K 1 can possibly be a sequence of probability distributions on U 1... U K ) 3. for a distribution q 2 X L 0 n(q) = L n v 1 K = t=1 n l t (v kt ) ; t=1 n q(e t )l t q Et, Li,n 0 = t=1 n I {i2et}l i,t, where the second definition follows from the first one by taking the probability distribution q = δ i ; t=1 4. for a distribution q 2 X L 00 n(q) = n l t P t (q). t=1 Marie Devaine, Yannig Goude, and Gilles Stoltz 7

9 Minimization of rmse via the minimization of regret We attempt to minimize the rmse of the master forecaster by ensuring that it has a small regret, where the regret is defined 1. either as sup R n (v) where R n (v) = L b n L n (v) ; v2r N 2. or sup R n v1 K v1 K 2(RN ) K where R n v K 1 = L b n L n v1 K = n l t (v kt ) ; t=1 3. or sup Rn(q) 0 where Rn(q) 0 = q2x n q (E t ) b l t Ln(q) 0 ; in the case we restrict our attention in the previous definition to Dirac distributions only, we get t=1 max R n 0 (δ j ) not. = max R j,n 0 where Rj,n 0 = j=1,...,n j=1,...,n n t=1 I {j2et} b l t l j,t ; 4. or sup Rn(q) 00 where Rn(q) 00 = L b n Ln(q) 00. q2x Marie Devaine, Yannig Goude, and Gilles Stoltz 8

10 Numerical values For practical purposes we do not use the whole data set to build our forecasts. As predictions are made for the next 24 hours, we split our data set into 24 fixed-hour subsets and run 24 algorithms in parallel. In our examples and simulations, we chose the subset corresponding to noon. We denote by n the total number of prediction steps and by n 12 the number of prediction in the noon data subset. (We index by a subscript 12 the quantities that refer to this subset only.) M is some typical order of magnitude for y t. n n 12 M N K K Some standard rmse are summarized for the noon data subset below. As argued above, B R N is high because of the bias induced by sleeping experts. (When a highly weighted expert is missing, the bias is important and the corresponding instantaneous loss is large.) We must therefore use the information of sleeping/active experts to get lower bounds smaller that this one. B oracle B R N B partr N B partx B part B (a) renorm B (a) ave B (b) renorm B (b) ave B project B best exp Global performance of the experts (results are very similar for each hour data set) is drawn in Figure 1.1. In Figure 1.2, we plot the global performance of each expert with respect to his percentage of activity. Marie Devaine, Yannig Goude, and Gilles Stoltz 9

11 RMSE Index Figure 1.1: Performance of experts % Act RMSE Figure 1.2: rmse vs percentage of activity Marie Devaine, Yannig Goude, and Gilles Stoltz 10

12 Chapter 2 Base forecasters 11

13 2.1 Exponentially Weighted Average Beta-Forecaster References This forecaster is described in [BM05] Theoretical bound s For β such that log 1 β is of the order of 1 log N we have B n max R i,n 0 6 O B p n log N i2{1,...,n} Interpretation and/or comments This is variant of the exponentially weighted average forecaster in the context of sleeping experts. It is arguably less intuitive that the one presented in Section 2.2 but we can prove in a straightforward way a theoretical bound on its performance. The parameter β used here should be thought of as e η with η > 0. The weigths obtained with this algorithm are thus of the same form as the weights given by the algorithm of Section 2.2. However, the difference is that it is not the same quantity (the same regret) that is used in the exponential weighting, even though in both cases the aim is to control the original regret, max i R 0 i,n Statement and implementation The parameter β belongs to ]0, 1[. n > 1, R 0 β,i,n = We introduce the β regret: for i 2 {1,..., N} and n t=1 I {i2et} β b l t l i,t. Note that Ri,n 0 is a β regret for β = 1 (which is however a forbidden value for β in this section). For t > 1, the convex weights p t are defined as p i,t = I {i2e t}β R0 β,i,t 1 j2e t β R0 β,j,t 1 By convention, Rβ,i,0 0 = 0 for all i = 1,..., N.. Marie Devaine, Yannig Goude, and Gilles Stoltz 12

14 The forecaster is implemented as follows. Parameters: β 2 ]0, 1[ Initialization: Rβ,i,0 0 = 0 for all i = 1,..., N and 1 if i 2 E 1 ; p i,1 = E 1 0 if i /2 E 1. For each round t = 1, 2,..., n, (1) predict with by t = p t f t ; (2) observe y t and compute the regrets for all i = 1,..., N; R 0 β,i,t = R 0 β,i,t 1 + I {i2et} β b l t l i,t (3) compute p t+1 as for all i = 1,..., N. p i,t+1 = I {i2e t+1 } β R0 β,i,t j2e t+1 β R0 β,j,t Proof of the theoretical bound Without loss of generality, we can assume that B = 1. (If this is not the case, one simply considers the l i,t /B instead of the l i,t.) We introduce first some notations: for i 2 {1,..., N} and t > 1, we denote w 0 i,t = β R0 β,i,t 1, w i,t = I {i2et}w 0 i,t, W t = In particular, we have p i,t = w i,t W t. By convexity of l in its first argument, we have N w i,t and Wt 0 = i=1 N wi,t 0. i=1 b l t 6 N p i,t l i,t = i=1 N i=1 w i,t l i,t W t and thus, W t b l t N w i,t l i,t 6 0, i=1 an inequality we will need later. We now bound Wt 0 by N for all t. We have W0 0 = N and now show that the sequence Wt 0 t>0 decreases. We use that for β 2 ]0, 1[ and x 2 [0, 1], we have βx 6 1 (1 β)x Marie Devaine, Yannig Goude, and Gilles Stoltz 13

15 and β x (1 β)x/β to get: W 0 t+1 = N wi,tβ 0 I {i2e t } βbl t l i,t i=1 6 N i=1 6 N i=1 wi,t 0 1 (1 β)l i,t I {i2et} 1 + (1 β)β b l t I {i2et}/β w 0 i,t 1 (1 β)i {i2et} l i,t b l t 0 6 N wi,t 0 + (1 b t l t i=1 6 W 0 t. N i=1 w i,t l i,t 1 A }{{} 60 In particular, we have that w 0 i,t+1 6 N for all t > 0. Since w 0 i,t+1 = β R0 β i,t, we have R 0 β,i,t 6 log N log 1 β, which rewrites as n t=1 I {i2et} b l t 6 n t=1 I {i2et}l i,t + log N log 1 β β 6 n I {t2et}l i,t + t=1 1 n β 1 + log N! log 1. β Optimizing in β (by taking the parametrization β = e η and optimizing in η) we get the claimed bound Performance 1 β 1 1e-3 5e-1 1e-4 1e-5 1e-6 1 β = 4e-6 rmse Forward note: We obtain the same performance as with the algorithm of Section 2.2. Indeed, setting β = e η, the two algorithms can be seen equivalent for sufficiently small values of η. These values are the one which we are interested in and for those, β is close to 1. Thus, the table above is the same as the one which will be given in Section 2.2 for small values of η (for which 1 β η). Marie Devaine, Yannig Goude, and Gilles Stoltz 14

16 2.1.7 Graphical evolution of the weights Weight Step Figure 2.1: Evolution of the weights for β = 1 4e-6 Marie Devaine, Yannig Goude, and Gilles Stoltz 15

17 2.2 Exponentially Weighted Average Forecaster References This is an adaptation of the forecaster of Section 2.1. previous occurrence of it in the literature. We designed it and there is no Theoretical bound s For η of the order of 1 log N, we have B n max R i,n 0 6 O B p n log N. i2{1,...,n} Interpretation and/or comments We claim that the forecaster introduced below is more natural than the one of Section 2.1 since it is based on the regret and not on a variant of it. Note that with respect to the standard (i.e., non-sleeping) version of the exponentially weighted average forecaster, it cannot be defined only in terms of the cumulative losses of the experts. Here, unlike the standard case, the loss of the master forecaster in the numerator does not cancel out with corresponding terms in the denominator of the expression defining the weights Statement and implementation The parameter η belongs to ]0, + [. For t > 1, the convex weights p t are defined as By convention Ri,0 0 = 0 for all i = 1,..., N. p i,t = I {i2e t} e ηr0 i,t 1 j2e t e ηr0 j,t 1. Marie Devaine, Yannig Goude, and Gilles Stoltz 16

18 The forecaster is implemented as follows. Parameters: η > 0 Initialization: Ri,0 0 = 0 for all i = 1,..., N and 1 if i 2 E 1 ; p i,1 = E 1 0 if i /2 E 1. For each round t = 1, 2,..., n (1) predict with by t = p t f t ; (2) observe y t and compute the regrets for all i = 1,..., N ; R 0 i,t = R 0 i,t 1 + I {i2et} b l t l i,t (3) compute p t+1 as for all i = 1,..., N. p i,t+1 = I {i2e t+1 } e ηr0 i,t j2e t+1 e ηr0 j,t Proof of theoretical bound We do not provide a direct proof but derive rather the bound from the one of Section 2.1. In particular we use that taking β = e η 2 ]0, 1[, e ηr0 i,n = β R 0 β,i,n e η(1 β) n t=1 blti {i2e t} = w 0 i,n eη(1 β) n t=1 blti {i2e t } where we used the notations of Section for the two equalities. We know from Section that w 0 i,n 6 N for all i 2 {1..., n}, so that e ηr0 i,n 6 N e η(1 β) n t=1 blti {i2e t}. Using in addition e η > 1 η, we thus get for η > 0 ηr 0 i,n 6 log N + η(1 e η ) n t=1 b l t I {i2et} 6 log N + η 2 n t=1 b l t I {i2et}. Rearranging and bounding the sum in a crude but uniform manner by n, R 0 i,n 6 log N η The claimed bound is obtained by taking η = + ηb 2 n. q log N/(B 2 n), R 0 i,n 6 2B p n log N. Marie Devaine, Yannig Goude, and Gilles Stoltz 17

19 2.2.6 Performance η 1e-8 1e-7 1e-6 1e-5 1e-4 η = 4e-6 rmse Graphical evolution of the weights Weight Step Figure 2.2: Evolution of the weights for η = 4e-6 Marie Devaine, Yannig Goude, and Gilles Stoltz 18

20 2.3 Exponential Gradient Forecaster References This forecaster follows from the one of Section 2.2 thanks to the consideration of the gradients of the losses, see [CBL06, Section 2.5] Theoretical bound s For η of the order of 1 log N, we have C n max R i,n 0 6 O C p n log N, i2{1,...,n} where C is a constant such that C 6 r b l t i We can take C = 2B 2. 6 C for all i = 1,..., N and t = 1,..., n Interpretation and/or comments The forecaster below is the exponentiated gradient version of the forecaster of Section 2.2, in which the cumulative regrets appearing in the exponent in the definition of the weights are replaced by some gradient-based upper bound. We only adapt here the forecaster of Section 2.2. We also considered a gradient version of the algorithm of Section 2.1 but again, it obtains about the same performance as the one presented in this section. This is why we do not write a dedicated section for it Statement and implementation The parameter η belongs to ]0, + [. We use r b l t to denote the gradient of the convex function v 7 l t (v) taken in p t, that is, r b l t = 2(by t y t )f t. For t > 1, the vector of weights p t is defined component-wise as I {i2e t} exp η t 1 s=1 I b {i2es}rl s (p t δ i ) p i,t = j2e t exp η t 1 s=1 I b {j2es}rl s (p t δ j ) I {i2e t} exp 2η t 1 s=1 I {i2es}(by s y s )(by s f i,s ) = j2e t exp 2η. t 1 s=1 I {j2es}(by s y s )(by s f j,s ) By convention an empty sum is nul. Marie Devaine, Yannig Goude, and Gilles Stoltz 19

21 The forecaster is implemented as follows. Parameters: learning rate η > 0 Initialization: R e i,0 = 0 for all i = 1,..., N and 1 if i 2 E 1 p i,1 = E 1 0 if i /2 E 1 For each round t = 1, 2,..., n (1) predict with by t = p t f t ; (2) observe y t and compute the pseudo-regrets for all i = 1,..., N; er i,t = e R i,t 1 + 2I {i2et}(by t y t )(by t f i,t ) (3) compute p t+1 as for all i = 1,..., N. p i,t+1 = I {i2e t+1 } e ηer i,t j2e t+1 e ηer j,t Proof of the theoretical bound Since v 7 l t (v) is convex and differentiable, we can upper bound the instantenous regrets as l(by t, y t ) l(f i,t, y t ) = l t (p t ) l t (δ i ) 6 r b l t (p t δ i ) = e l(by t, y t ) e l(f i,t, y t ), where we denoted, for all probability distributions q 2 X Et, the linearized losses by Summing up the bound, we get R 0 i,n = I {i2et} b l t l i,t 6 n e l(q f t, y t ) = r b l t q. t=1 I {i2et} e l(by t, y t ) e log N l(f i,t, y t ) 6 + ncη η for all i = 1,..., N, where C is some bound on the e l, for instance, C = 2B 2. We used here the result of Section 2.2.5, up to the replacement of the losses by their linearized counterparts. We can do so because we only use in this proof and in the present one that the losses are nonnegative functions convex in their first arguments. Marie Devaine, Yannig Goude, and Gilles Stoltz 20

22 2.3.6 Performance η 1e-7 1e-6 1e-5 1e-4 1e-3 η = 1.1e-4 rmse Graphical evolution of the weights Weight Step Figure 2.3: Evolution of the weights for η = 1.1e-4 Marie Devaine, Yannig Goude, and Gilles Stoltz 21

23 2.4 Exponential Gradient Forecaster Partition References This forecaster is an adaptation from the EG forecaster studied in [CB99] Theoretical bound s 1 log N For η of the order of B 2, we have K n sup R n q K 1 6 O CK p n log N q K 1 2X U 1... X UK where q K 1 is a shorthand notation for (q 1,..., q K ) and C is a constant such that C 6 r b l t i 6 C for all i = 1,..., N and t = 1,..., n. We can take C = 2B Interpretation and/or comments This forecaster is obtained by running in parallel K classic exponential gradient algorithms on each subset U k for k 2 {1..., K}. The bound is obtain by summing up the base bounds of [CB99] on each subset. Note that if K is large, then the proposed bound is poor Statement and implementation The parameter η belongs to ]0, + [. For t > 2, we introduce the notation U t = U kt \ {1,..., t 1}, which refers to the set of past rounds with the same sleeping configuration as the current one, i.e., the rounds indexed by s 6 t 1 and such that E s = U kt. For t > 1, the weight vector p t is defined as I {i2e t} exp η t 1 s=1 I b {i2es}rl s (p t δ i ) p i,t = j2e t exp η t 1 s=1 I b {j2es}rl s (p t δ j ) = I {i2e t} exp 2η s2u t(by s y s )(by s f i,s ) j2e t exp 2η s2u t(by s y s )(by s f j,s ). By convention an empty sum is nul. The definition is legal because if i 2 E t then by definition of U t, one also has i 2 E s for all s 2 U t. Marie Devaine, Yannig Goude, and Gilles Stoltz 22

24 This forecaster is implemented as follows. Parameters: learning rate η > 0 For each round t = 1,..., n, (1) if U t = ; 1 if i 2 E t ; p i,t = E t 0 if i /2 E t. Otherwise, p i,t = I {i2e t} exp 2η s2u t(by s y s )(by s f i,s ) j2e t exp 2η s2u t(by s y s )(by s f j,s ), for i = 1,..., N. (2) predict with by t = p t f t Performance η 1e-6 1e-5 1e-4 1e-3 η = 7.4e-5 rmse Marie Devaine, Yannig Goude, and Gilles Stoltz 23

25 2.4.6 Graphical evolution of the weights Figure 2.4: Evolution of the weights for η = 1.1e-4 Marie Devaine, Yannig Goude, and Gilles Stoltz 24

26 2.5 Renormalized Exponential Gradient Forecaster References This forecaster is described in [FSS97] Theoretical bound s For η of the order of 1 log N, we have C n sup Rn(q) 0 6 O C p n log N, q2x where C is a constant such that C 6 r b l t i We can take C = 2B 2. 6 C for all i = 1,..., N and t = 1,..., n Interpretation and/or comments This forecaster is another adaptation of the gradient-based forecasters to the setting of sleeping experts. The notion of regret minimized is not the same as in Section 2.3. However, their respective performance is similar. This is despite the fact that the two algorithms are quite different, as the graphical evolutions of the weights reveal Statement and implementation The parameter η belongs to ]0, + [. This forecaster is implemented as follows. Parameters: learning rate η Initialization: p i,1 = 1/N for i = 1,..., N For each round t = 1, 2,..., n (1) predict with by t = p Et t f t (2) observe y t and compute p t+1 as follows: p i,t e 2ηf i,t(by t y t) j2e t p j,t if i p i,t+1 = k2e t p k,t e 2ηf 2 E t, k,t(by t y t) p i,t if i 62 E t Proof of the theoretical bound Since v 7 l t (v) is convex and differentiable, we can upper bound the instantenous regrets as l(by t, y t ) l q Et = l t l q Et 6 r b l t p Et t q Et. p Et t Marie Devaine, Yannig Goude, and Gilles Stoltz 25

27 We thus have R 0 n(q) 6 n We now prove the following bound, t=1 r b ηc l t p Et t q Et K η q(e t ) r b l t p Et t q Et. q Et, p Et t K q Et, p Et t+1 where we denoted by K(, ) the Kullback-Leibler divergence between two distributions. We recall that the latter is defined, for two probability distributions p and q over a set with R elements as R qi K(q, p) = q i log. i=1 The claimed bound actually follows from an application of the general lemma stated below together with the fact that by definition of the forecaster, p Et i,t+1 = p i p Et i,t e ηγ i,t N j=1 pet i,t e ηγ j,t for all t 2 {1,..., n} and i = 1,..., N, where we set γ i,t = 2 by t y t fi,t = r b l t Lemme 1. Let q, p be two probability distributions over a set with R elements, γ 2 R R be any R dimensional real vector. Define a distribution p 0 as follows: for i = 1,..., R,, i. p 0 i = p i e ηγ i R j=1 p j e ηγ j. Then, denoting by D a bound such that D 6 γ i 6 D for i = 1,..., R, one has η p γ η R q i γ i η2 D 2 2 i=1 6 K(q, p) K(q, p 0 ). (2.1) Proof. We start with a chain of equalities, K(q, p) K(q, p 0 ) = = = R p 0 q i log i i=1 p i 0 1 R q i p ie ηγ i 1 R j=1 p j e ηγ A j p i 0 1 R R N q i p j e ηγ ja + q i log e ηγ i i=1 i=1 = i=1 j=1 0 1 R R q i p j e ηγ ja j=1 i=1 R q i ηγ i. i=1 Marie Devaine, Yannig Goude, and Gilles Stoltz 26

28 We have, first resorting to Hoeffding s lemma for each i = 1,..., R and then using that R i=1 q i = 1, R R q i p j e ηγ j p j A 6 R q η i=1 j=1 i=1 1 R γ j p j + η2 D 2 A 2 j=1 = η γ p + η2 D 2 2. We now get back to the main proof. So far, we have Rn(q) 0 6 n q(e t ) t=1 ηc K η q Et, p Et t K q Et, pt+1! Et. In our case, as we have that p i,t = p i,t+1 if i /2 E t and therefore p t+1 (E t ) = p t (E t ), we derive q(e t ) K q Et, p Et t Substituting, we get K q Et, p Et t+1 = q i log i2e t = N q i log i=1 pi,t+1 p i,t pi,t+1 p i,t = K(q, p t ) K(q, p t+1 ). (2.2) Rn(q) 0 6 n ηc 2 q(e t ) η K(q, p t) K(q, p t+1 ). t=1 A telescoping summing has appeared and we are left with R 0 n(q) 6 ηnc2 2 + K(q, p 1) η 6 ηnc2 2 + log N η where the last equality proceeds from the choice of p 1 as the uniform distribution. Optimizing in η we obtain the claimed bound Performance η 1e-7 1e-6 1e-5 1e-4 1e-3 η = 1e-4 rmse Marie Devaine, Yannig Goude, and Gilles Stoltz 27

29 2.5.7 Graphical evolution of the weights Weight Step Figure 2.5: Evolution of the weights for η = 1e-4 Marie Devaine, Yannig Goude, and Gilles Stoltz 28

30 2.6 Zinkevich s forecasters References These forecasters are an adaptation of the ones proposed in [Zin03] First forecaster: lazy projection Theoretical bound The aim of this forecaster is to minimize the regret defined in terms of the R 00 n(q), that is, to achieve a performance nearly as good as the oracle B project. We do not provide any theoretical bound in the sleeping case for the time being. Interpretation and/or comments The forecaster below uses the standard version of the lazy Zinkevich s forecaster in the framework of sleeping experts. The original forecaster proceeds by projecting on the simplex of size N the weights obtained by a so-called gradient descent. Here, in the sleeping expert context, at each round the projection is performed on the subset of the weights indexed by the active experts. Statement and implementation The parameter η belongs to ]0, + [. We first define a sequence of intermediate weights w t as w 1 = (1/N,..., 1/N) and w t+1 = w t ηi t r b l t, where I t = I {12Et},..., I {N2Et} and for two vectors u, v 2 R N, u v is the term by term product (u 1 v 1,..., u N v N ). The final convex weights p t output at round t are then given by p t = P t (w t ) where we use again the notations of Section 1.1. We briefly recall how to project a real m-tuple x = (x 1,..., x m ) 2 R m onto X m, the simplex of dimension m: let a be the unique real number such that m (y i + a) + = 1. i=1 Then the Euclidian projection of x onto X m is given by Π Xm (x) = (x 1 + a) +,..., (x m + a) +. Marie Devaine, Yannig Goude, and Gilles Stoltz 29

31 The forecaster is implemented as follows. Parameters: η > 0 Initialization: 1 if i 2 E 1 ; p i,1 = E 1 0 if i /2 E 1. For each round t = 1, 2,..., n, (1) predict with by t = p t f t ; (2) obtain y t and perform the update w t+1 = w t 2η (by t y t ) (I t f t ) ; (3) compute p t+1 = P t+1 w t+1. Performance η 1e-9 1e-8 1e-7 1e-6 η = 2e-8 rmse Graphical evolution of the weights Weight Step Figure 2.6: Evolution of the weights of the lazy version for η = 2e-8. Marie Devaine, Yannig Goude, and Gilles Stoltz 30

32 2.6.3 Second forecaster: by plug-in and projection Interpretation and/or comments This second version is less intuitive but it achieves a better practical performance. In view of its statement, the oracle its performance should be compared to is given by min q2x n t=1 l t q t where the sequence of the q t is defined iteratively by q 1 = P 1 (q) and q s+1 = P s+1 (q s ) for s > 1. Theoretical bound One can prove a O p n bound on the regret bl n min q2x n l t q t. t=1 Statement and implementation The forecaster is implemented as follows. Parameters: η > 0 Initialization: 1 if i 2 E 1 ; p i,1 = E 1 0 if i /2 E 1. For each round t = 1, 2,..., n, (1) predict with by t = p t f t ; (2) compute p t+1 as p t+1 = P t+1 p t 2η (by t y t )(I t f t ). Marie Devaine, Yannig Goude, and Gilles Stoltz 31

33 Performance η 1e-7 1e-6 1e-5 1e-4 η = 1e-5 rmse Graphical evolution of the weights Weight Step Figure 2.7: Evolution of the weights of the plug-in version for η = 1e-5. Marie Devaine, Yannig Goude, and Gilles Stoltz 32

34 2.7 Ridge Regression Forecaster References This is an adaptation of [CBL06, Section 11.7] Theoretical bound We have 0 R n (v) 6 λ n 2 kvk2 i=1 log 1 + µ i λ 1 A max 16t6n l t(v) where µ 1,..., µ N are the eigenvalues of n t=1 f tf T t. See [CBL06, Section 11.7] for the proof Interpretation and/or comments The forecaster below uses the standard version of the ridge regression forecaster in the framework of sleeping experts. In order to ensure that this forecaster is well defined at each round, predictions of inactive experts are set to zero as explained in the introduction. Because of that, the predictions of the forecaster are strongly biased. In fact, this forecaster is designed to come close to the performance given by B R N. It actually achieves this goal but the latter is not ambitious and quite irrelevant, as the introduction underlines Statement and implementation The parameter λ belongs to ]0, + [. We take u 1 as the uniform distribution on E 1. For t > 2, the vector of weights u t is defined as { u t = argmin v2r N } t 1 λkvk 2 + (v f s y s ) 2. The computation of this least-square estimate is given by u t = A 1 t b t, where for t > 2, s=1 t 1 t 1 A t = λi + f s fs T and b t = y s f s. s=1 Simple manipulations lead to the recursive update u t+1 = u t A 1 t+1 s=1 u T t f t+1 y t+1 f t. Marie Devaine, Yannig Goude, and Gilles Stoltz 33

35 The forecaster is implemented as follows. Parameters: penalization factor λ Initialization: A 1 = λi and u 1 = (1/N,..., 1/N) For each round t = 1,..., n (1) predict with by t = u t f t ; (2) observe y t and update A t+1 A t+1 = A t + f t ft T ; (3) compute u t+1 as u t+1 = u t A 1 t+1 u T t f t+1 y t+1 f t Performance λ 1e-3 1 1e+3 1e+6 λ = 4e+5 rmse Marie Devaine, Yannig Goude, and Gilles Stoltz 34

36 2.7.6 Graphical evolution of the weights Weight Step Figure 2.8: Evolution of the weights for λ = 4e+5 Marie Devaine, Yannig Goude, and Gilles Stoltz 35

37 2.8 Ridge Regression Partition Forecaster References This is an adaptation of [CBL06, Section 11.7] Theoretical bound We proceed as in Section 2.4 and derive the global bound as a sum over the K different sub-regimes U k, for k = 1,..., K: 0 0 R n v1 K K λ! n kv kk 2 log 1 + µk i A max l t (v k ) A λ t2u k k=1 i=1 for all v1 K = (v 1,..., v K ) 2 for each k = 1,..., K. R N K, where µ k 1,..., µ k N are the eigenvalues of t2u k f t f T t Interpretation and/or comments This forecaster is another adaptation of the ridge regression forecaster to the setting of sleeping experts. It simply performs K simultaneous instances of ridge regression to avoid any bias issues (see the issues encountered with the adaptation of Section 2.7). Nevertheless, this results in poor results on this data set, because K is large and the first steps of instance of the ridge regression forecaster have a large loss. We try to improve these results by cleaning the data set, e.g., by removing data corresponding to rounds t when E t = 1 (we remove 40 such t). The results are then slightly better but still very poor Statement and implementation The parameter λ belongs to ]0, + [. We introduce the sets and U t = U kt \ {1,..., t 1}. For t > 1, R N E t = { v 2 R N : 8i /2 E t, v i = 0 } if U t = ;, the vector of weights u t is defined as the uniform distribution on E t ; otherwise u t is defined as { u t = argmin v2r N E t λkvk 2 + s2u t (v f s y s ) 2 }. Note that it is equivalent to consider that we perform K simultaneous ridge regressions (on each U k, for k = 1,..., K). We thus have at step t that if U t 6= ;, then u t = A 1 t b t, A t = λi + f s fs T and b t = y s f s. s2u t s2u t Marie Devaine, Yannig Goude, and Gilles Stoltz 36

38 This forecaster is implemented as follows. Parameters: penalization factor λ For each round t = 1,..., n, (1) if U t = ;, define u t as 1 if i 2 E t ; u i,t = E t 0 if i /2 E t ; (2) otherwise, when U t 6= ;, compute A t as A t = λi + s2u t f s f T s and u t as and u i,t = 0 for i /2 E t ; u i,t i2e t = A 1 t y s f s s2u t (3) predict with by t = u t f t ; (4) observe y t Performance The performance is shown for the initial (not cleaned) data set. It is slightly better for cleaned data but remains poor (the best rmse equaling 137). λ 1e-3 1 1e+3 1e+6 1e+9 λ = 1.1e+5 rmse Marie Devaine, Yannig Goude, and Gilles Stoltz 37

39 2.8.6 Graphical evolution of the weights Weight Step Figure 2.9: Evolution of the weights for λ =1.1e+5 Marie Devaine, Yannig Goude, and Gilles Stoltz 38

40 2.9 Normalized Ridge Forecaster References This is a new forecaster! Theoretical bound We have no theoretical bound yet Interpretation and/or comments The forecaster results from an adaptation of the forecaster of Section 2.7 (the least-squares formulation) in the spirit of the one of Section 2.5 (the renormalization factors in front of the squares). Its aim is to get close to the oracle 28.8 that is the linear (weights in R N ) counterpart of B (b) renorm Statement and implementation The parameter λ belongs to ]0, + [. We take u 1 as the uniform distribution over E 1. For t > 2, we compute u t 2 R N as indicated below and then aggregate the forecasts of the experts with u t E t, where for all v 2 R N, we define, similarly to the notations used in Section 1.1, v(e t ) = v N i,t if i 2 E t, I {i2et}v i,t and v Et = v(e t ) 0 if i 62 E t, i=1 and u t is defined as u t = argmin v2r N { } t 1 λkvk 2 + v(e s ) v Es 2 f s y s s=1. Note that we do not provide any efficient recursive computation anymore. Marie Devaine, Yannig Goude, and Gilles Stoltz 39

41 The forecaster is implemented as follows. Parameters: penalization factor λ Initialization: u 1 = (1/N,..., 1/N) For each round t = 1,..., n (1) predict with by t = u E t t f t ; (2) observe y t and compute u t+1 as { u t+1 = argmin v2r N λkvk 2 + } t v(e s )(v Es f s y s ) 2 s= Performance The performance of this forecaster is better than the one of ridge regression (Section 2.7) but remains poor (worse than the one of the best expert). This might be due to a lack of precision in the implementation: since no closed form expression is available for the u t, numerical minimizations (implemented in the software R) are performed at each step. λ 1e+4 1e+5 1e+6 1e+7 λ = 1e+6 rmse Graphical evolution of the weights Weight Step Figure 2.10: Evolution of the weights for λ = 1e+6. Marie Devaine, Yannig Goude, and Gilles Stoltz 40

42 2.10 Fixed-Share Forecaster References This is an adaptation of a forecaster described, e.g., in [CBL06, Section 5.2] Theoretical bound We take a larger comparison class, whose elements are indexed by sequences (i 1,..., i n ) of elements of {1,..., N} and are called compound experts. They predict at round t as expert i t. In the context of sleeping experts we thus only consider sequences such that that for all t, one has i t 2 E t ; a sequence satisfying this condition will be referred to as an admissible sequence. The tracking regret for a such sequence is defined as The size R n (i 1,..., i n ) = size(i 1,..., i n ) = n b l t l it,t. t=1 n t=2 I {it 1 6=i t} of a sequence counts how many switches occur in the sequence. The following theoretical bound holds for the tracking regret of the forecaster presented in this section: R n (i 1,..., i n ) 6 m + 1 η log N + 1 η log 1 (α/n) m (1 α) n m 1 + η 8 nb4 for all admissible sequences (i 1,..., i n ) of size smaller than m. Thus, taking α = m/(n 1) and v η = 1 u t 8! m B 2 n (m + 1) log N + (n 1) H, n 1 we get v R n (i 1,..., i n ) 6 B 2 u t n! m 2 (m + 1) log N + (n 1) H, n 1 where H denotes the binary entropy function: H(x) = x log x (1 x) log(1 x) for x 2 [0, 1] Interpretation and/or comments At each step, this forecaster performs two updates. The first one is similar to the one performed by the exponentially weighted average forecaster of Section 2.2. The second one mixes the previous weights to ensure that all base forecasters get a sufficient weight, which allows them to recover quickly higher weights in case they start outputting good predictions. Note that for α = 0, we do not recover the forecaster of Section 2.2. This is because we deal in a different way with the fact that some experts may be inactive. Marie Devaine, Yannig Goude, and Gilles Stoltz 41

43 Statement and implementation The parameter η belongs to ]0, + [, while α belongs to [0, 1]. For t > 1, the convex weights p t are defined in two steps: the loss update (the same as in Section 2.2) and the share update, which allows our forecaster to be more reactive to breaking points (when the index of the best expert changes). The forecaster is implemented as follows. Parameters: η > 0 and 0 6 α 6 1 Initialization: w 0 = (1/N,..., 1/N) For each round t = 1, 2,..., n, (1) predict with the weights p t = w t 1 N j=1 w j,t 1 ; (2) Loss update: observe y t and update for each i = 1,..., N, { w v i,t = i,t 1 e η blt l i,t if i 2 E t, undefined if i /2 E t ; (3) Share update: let w i,t = 0 if i 62 E t+1, and w i,t = 1 E t+1 j2e t\e t+1 v j,t + α E t+1 j2e t\e t+1 v j,t + (1 α) I {i2et\e t+1 } v i,t if i 2 E t+1 (with the convention that an empty sum is null). Note: in (2), it is equivalent to put b l t or to omit it Proof of the theoretical bound The proof is given by a straightforward adaptation of the proof in the non-sleeping case proposed in [CBL06, Section 5.2]. Indeed, the version proposed here corresponds to a (fake) prior weight assignment to the set of admissible compound experts given by a certain Markovian process. We give here the relation that defines the transition probability: w0 0 i 1,..., i t+1 = w 0 0 i 1,..., i t I{it+1 2E t+1 } (1 α)i {it+1 =i t} + α E t+1 + I 1 α {i t /2E t+1 }. E t+1 That is, the transition function from i t to i t+1 is given, at round t + 1, by Tr t+1 (i t i t+1 ) = I {it+1 2E t+1 } (1 α)i {it+1 =i t} + α E t+1 + I 1 α {i t /2E t+1 } E t+1. Rewriting the proof of [CBL06, Theorem 5.1] with these prior weights we obtain how compute efficiently w i,t thanks to the vector v t. In particular, the last equality of the Marie Devaine, Yannig Goude, and Gilles Stoltz 42

44 proof yields w i,t = v j,t Tr t+1 (j i) = v j,t I {i2et+1 } (1 α)i {i=j} + α E t+1 + I 1 α {j/2e t+1 } E t+1 j2e t j2e t, from which follows the expression stated in our implementation. As the prior weights w 0 0 are larger than the ones w 0 proposed by the original version of [CBL06, Section 5.2] that puts weights on all compound experts, this entails the proposed bound Performance This forecaster has a good performance (better than the one of the exponentiated gradient forecaster of Section 2.3). This might be because this forecaster uses two parameters; however, in practice, tuning two parameters online is often more delicate than simply calibrating one. We will propose several solutions to this issue in Section 3.4. The best performance (rmse of 27.0) is achieved for (η, α ) = (2e-3, 0.2). α = 0.01 α = 0.05 α = 0.1 α = 0.2 η = 1e η = 1e η = 1e η = 1e η = 1e Marie Devaine, Yannig Goude, and Gilles Stoltz 43

45 Graphical evolution of the weights Weight Step Figure 2.11: Evolution of the weights for (η, α ) = (2e-3, 0.2). Marie Devaine, Yannig Goude, and Gilles Stoltz 44

46 2.11 Exponentiated-Gradient Fixed-Share Forecaster References It follows from the one of Section 2.10 thanks to the consideration of the gradients of the losses, see [CBL06, Section 2.5] Theoretical bound Recall that r b l t = 2(by t y t )f t. We denote by C a constant such that C 6 r b l t 6 C for all i = 1,..., N and t = i 1,..., n. We can take C = 2B 2, so that the range is 4B 2. The bound of Section 2.10 holds in particular for the forecaster described here, up to the replacement of B 2 by the new range 4 B Interpretation and/or comments The forecaster below is the exponentiated gradient version of the forecaster of Section 2.10, in which the cumulative regret of interest is first upper-bounded by some gradient-based sum Statement and implementation The parameter η belongs to ]0, + [, while α belongs to [0, 1]. The forecaster is implemented as follows. Parameters: η > 0 and 0 6 α 6 1 Initialization: w 0 = (1/N,..., 1/N) For each round t = 1, 2,..., n, (1) predict with the weights p t = w t 1 N j=1 w j,t 1 ; (2) Loss update: observe y t and update { w i,t 1 e 2η byt yt byt f i,t v i,t = if i 2 E t, undefined if i /2 E t ; (3) Share update: let w i,t = 0 if i 62 E t+1, and w i,t = 1 E t+1 j2e t\e t+1 v j,t + α E t+1 j2e t\e t+1 v j,t + (1 α) I {i2et\e t+1 } v i,t if i 2 E t+1 (with the convention that an empty sum is null). Marie Devaine, Yannig Goude, and Gilles Stoltz 45

47 Performance This forecaster has a relatively good performance (better than the one of the exponentiated gradient forecaster of Section 2.3), but surprisingly enough it does not beat the exponentially weighted average version of the fixed-share forecaster presented in Section α = 0.01 α = 0.05 α = 0.1 α = 0.2 η = 1e η = 1e η = 1e η = 1e η = 1e The best performance is achieved for the pair (η, α ) = (2e-4, corresponding rmse is ) and the value of the Graphical evolution of the weights Weight Step Figure 2.12: Evolution of the weights for (η, α ) = (2e-4, 0.07). Marie Devaine, Yannig Goude, and Gilles Stoltz 46

48 2.12 Another sleeping adaptation of the fixed-share forecaster References This is another adaptation of the original fixed-share forecaster described, e.g., in [CBL06, Section 5.2]. As the forecaster of Section 2.10, this forecaster has a counterpart in terms of exponentiated gradient Interpretation and/or comments We do not provide any theoretical result for these forecasters yet. They achieve a good performance especially in the exponentiated gradient case. Moreover, if we take α = 0 we recover exactly the forecaster of Section 2.2 (respectively, Section 2.3 in the exponentiated gradient case), which is not the case for the forecasters of Sections 2.10 and However, the weights obtained in this case are not exactly the same because of computational issues: the loss update for the fixed-share forecasters is made multiplicatively whereas it is made additively in the case of the forecasters of Sections 2.10 and Statement and implementation We present here only the implementation of the variant of the exponentially weighted average fixed-share forecaster. The exponentiated gradient version can be obtained by replacing the losses by the gradients of the losses. The parameter η belongs to ]0, + [, while α belongs to [0, 1] (when α = 0, we recover exactly the forecaster of Section 2.2). The forecaster is implemented as follows. Marie Devaine, Yannig Goude, and Gilles Stoltz 47

49 Parameters: η > 0 and 0 6 α 6 1 Initialization: w 0 = (1/N,..., 1/N) For each round t = 1, 2,..., n, (1) predict with the weights p i,t = I {i2et} w i,t 1 N j=1 I {i2et} w j,t 1 for each i = 1,..., N; (2) Loss update: observe y t and update for each i = 1,..., N, { w v i,t = i,t 1 e η blt l i,t if i 2 E t, v i,t 1 if i /2 E t ; (3) Share update: let w i,t = α W t E t+1 + (1 α)v i,t if i 2 E t+1, 0 if i /2 E t+1, where W t = N i=1 I {i2et+1 }v i,t. Marie Devaine, Yannig Goude, and Gilles Stoltz 48

50 Performance The two forecasters have a good performance. The one of the exponentially weighted average version is similar to the one of the forecaster of Section The best performance is achieved for (η, α ) = (1e-3, 0.2). α = 0.01 α = 0.05 α = 0.1 α = 0.2 η = 1e η = 1e η = 1e η = 1e The exponentiated gradient version has a performance better than the one of the forecaster of Section The best performance (rmse of 26.5) is achieved for (η, α ) = (2e-4, 0.05). α = 0.01 α = 0.05 α = 0.1 α = 0.2 η = 1e η = 1e η = 1e η = 1e η = 1e Graphical evolution of the weights Weight Weight Step Step Figure 2.13: Evolution of the weights of the two variants of the fixed-share forecasters considered here: the exponentially weighted average version (left) and the exponentiated gradient version (right). Marie Devaine, Yannig Goude, and Gilles Stoltz 49

51 Chapter 3 Tricks 50

52 3.1 On-line calibration of one parameter References This adaptive calibration method has been introduced in [GMS08] Theoretical Bound No theoretical bound yet. The aim is to achieve a performance nearly as good as the one of the considered prediction method tuned with the best parameter in hindsight Interpretation and/or comments This method provides a generic adaptive implementation of all forecasters that depend only on a (possibly vector-valued) tuning parameter. It automatically calibrates this parameter by choosing the parameter value that minimizes the past cumulative loss of the considered forecaster. Several optimization methods can be used. We choose a grid-based optimization procedure, but continuous methods, though necessarily inaccurate and more difficult to control, could be more effective Statement and implementation Consider a forecaster depending on a parameter λ and whose weight vector at step t is denoted by v t = v (λ) t. The parameter λ 2 Λ is called a tuning parameter of the forecaster and our aim is to choose it in an automatic way. The parameter space is, for instance, Λ = ]0, + [ for ridge regression, exponential weighted average, and exponential gradient forecasters. In addition we assume that v (λ) 1 = v1 does not depend on λ, which is the case for all methods studied so far. The proposed calibration method tunes the forecaster automatically by minimizing of the empirical loss. The latter is thus called ELM calibrated forecaster thereafter. It chooses v 1 = v1 and, for t > 2, v t = v bλt t where b t 1 λ t 2 argmin v (λ) 2 s f s y s. λ2eλ We chose a finite logarithmically-scaled parameter grid e Λ Λ to perform an approximation of the minimization above. s=1 Marie Devaine, Yannig Goude, and Gilles Stoltz 51

53 This is implemented as follows. Parameters: a grid e Λ Initialization: v 1 = v 1 For each round t = 1, 2,..., n, (1) predict with by t = v t f t ; (2) observe y t and compute b λ t+1 as b λ t+1 2 argmin λ2eλ t s=1 v (λ) s f s y s 2 ; (3) compute v t+1 as v t+1 = v bλ t+1 t Performance We have tested the calibration via empirical loss minimization (ELM) on three simple prediction methods: the exponential gradient (EG, see Section 2.3) and exponentially weighted average forecasters (EWA, see Section 2.2), as well as the plug-in version of Zinkevich s forecaster (see Section 2.6). We chose in all three cases a uniform logarithmic grid over [1e-6, 1e-2] for the exponentiated gradient; [1e-8, 1e-4] for the exponentially weighted average forecaster; [1e-7, 1e-3] for the plug-in version of Zinkevich s forecaster. Below are summarized the rmse of the resulting ELM calibrated forecasters, for different numbers of grid points Λ e = 5, 9, 21, and 41 (for the plug-in version of Zinkevich s forecaster, only the case = 5 was computed). Columns best and worst refer to the best (respectively, worst) performance obtained with a constant value η taken on the grid. Λ e best worst EWA EG Zink Marie Devaine, Yannig Goude, and Gilles Stoltz 52

A Further Look at Sequential Aggregation Rules for Ozone Ensemble Forecasting

A Further Look at Sequential Aggregation Rules for Ozone Ensemble Forecasting A Further Look at Sequential Aggregation Rules for Ozone Ensemble Forecasting Vivien Mallet y, Sébastien Gerchinovitz z, and Gilles Stoltz z History: This version (version 1) September 28 INRIA, Paris-Rocquencourt

More information

A Second-order Bound with Excess Losses

A Second-order Bound with Excess Losses A Second-order Bound with Excess Losses Pierre Gaillard 12 Gilles Stoltz 2 Tim van Erven 3 1 EDF R&D, Clamart, France 2 GREGHEC: HEC Paris CNRS, Jouy-en-Josas, France 3 Leiden University, the Netherlands

More information

Forecasting the electricity consumption by aggregating specialized experts

Forecasting the electricity consumption by aggregating specialized experts Forecasting the electricity consumption by aggregating specialized experts Pierre Gaillard (EDF R&D, ENS Paris) with Yannig Goude (EDF R&D) Gilles Stoltz (CNRS, ENS Paris, HEC Paris) June 2013 WIPFOR Goal

More information

Full-information Online Learning

Full-information Online Learning Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2

More information

OPERA: Online Prediction by ExpeRt Aggregation

OPERA: Online Prediction by ExpeRt Aggregation OPERA: Online Prediction by ExpeRt Aggregation Pierre Gaillard, Department of Mathematical Sciences of Copenhagen University Yannig Goude, EDF R&D, LMO University of Paris-Sud Orsay UseR conference, Standford

More information

Online Learning with Experts & Multiplicative Weights Algorithms

Online Learning with Experts & Multiplicative Weights Algorithms Online Learning with Experts & Multiplicative Weights Algorithms CS 159 lecture #2 Stephan Zheng April 1, 2016 Caltech Table of contents 1. Online Learning with Experts With a perfect expert Without perfect

More information

Online learning CMPUT 654. October 20, 2011

Online learning CMPUT 654. October 20, 2011 Online learning CMPUT 654 Gábor Bartók Dávid Pál Csaba Szepesvári István Szita October 20, 2011 Contents 1 Shooting Game 4 1.1 Exercises...................................... 6 2 Weighted Majority Algorithm

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

Online Learning and Sequential Decision Making

Online Learning and Sequential Decision Making Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning

More information

15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #16: Gradient Descent February 18, 2015

15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #16: Gradient Descent February 18, 2015 5-859E: Advanced Algorithms CMU, Spring 205 Lecture #6: Gradient Descent February 8, 205 Lecturer: Anupam Gupta Scribe: Guru Guruganesh In this lecture, we will study the gradient descent algorithm and

More information

1 Review and Overview

1 Review and Overview DRAFT a final version will be posted shortly CS229T/STATS231: Statistical Learning Theory Lecturer: Tengyu Ma Lecture # 16 Scribe: Chris Cundy, Ananya Kumar November 14, 2018 1 Review and Overview Last

More information

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and

More information

Lecture 16: Perceptron and Exponential Weights Algorithm

Lecture 16: Perceptron and Exponential Weights Algorithm EECS 598-005: Theoretical Foundations of Machine Learning Fall 2015 Lecture 16: Perceptron and Exponential Weights Algorithm Lecturer: Jacob Abernethy Scribes: Yue Wang, Editors: Weiqing Yu and Andrew

More information

Exponentiated Gradient Descent

Exponentiated Gradient Descent CSE599s, Spring 01, Online Learning Lecture 10-04/6/01 Lecturer: Ofer Dekel Exponentiated Gradient Descent Scribe: Albert Yu 1 Introduction In this lecture we review norms, dual norms, strong convexity,

More information

Noisy Streaming PCA. Noting g t = x t x t, rearranging and dividing both sides by 2η we get

Noisy Streaming PCA. Noting g t = x t x t, rearranging and dividing both sides by 2η we get Supplementary Material A. Auxillary Lemmas Lemma A. Lemma. Shalev-Shwartz & Ben-David,. Any update of the form P t+ = Π C P t ηg t, 3 for an arbitrary sequence of matrices g, g,..., g, projection Π C onto

More information

U Logo Use Guidelines

U Logo Use Guidelines Information Theory Lecture 3: Applications to Machine Learning U Logo Use Guidelines Mark Reid logo is a contemporary n of our heritage. presents our name, d and our motto: arn the nature of things. authenticity

More information

The Multi-Arm Bandit Framework

The Multi-Arm Bandit Framework The Multi-Arm Bandit Framework A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/94

More information

CS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm

CS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm CS61: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm Tim Roughgarden February 9, 016 1 Online Algorithms This lecture begins the third module of the

More information

Learning MN Parameters with Approximation. Sargur Srihari

Learning MN Parameters with Approximation. Sargur Srihari Learning MN Parameters with Approximation Sargur srihari@cedar.buffalo.edu 1 Topics Iterative exact learning of MN parameters Difficulty with exact methods Approximate methods Approximate Inference Belief

More information

Lecture 16: FTRL and Online Mirror Descent

Lecture 16: FTRL and Online Mirror Descent Lecture 6: FTRL and Online Mirror Descent Akshay Krishnamurthy akshay@cs.umass.edu November, 07 Recap Last time we saw two online learning algorithms. First we saw the Weighted Majority algorithm, which

More information

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016 AM 1: Advanced Optimization Spring 016 Prof. Yaron Singer Lecture 11 March 3rd 1 Overview In this lecture we will introduce the notion of online convex optimization. This is an extremely useful framework

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley Learning Methods for Online Prediction Problems Peter Bartlett Statistics and EECS UC Berkeley Course Synopsis A finite comparison class: A = {1,..., m}. 1. Prediction with expert advice. 2. With perfect

More information

Logarithmic Regret Algorithms for Strongly Convex Repeated Games

Logarithmic Regret Algorithms for Strongly Convex Repeated Games Logarithmic Regret Algorithms for Strongly Convex Repeated Games Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci & Eng, The Hebrew University, Jerusalem 91904, Israel 2 Google Inc 1600

More information

Bregman Divergence and Mirror Descent

Bregman Divergence and Mirror Descent Bregman Divergence and Mirror Descent Bregman Divergence Motivation Generalize squared Euclidean distance to a class of distances that all share similar properties Lots of applications in machine learning,

More information

Lecture 23: Online convex optimization Online convex optimization: generalization of several algorithms

Lecture 23: Online convex optimization Online convex optimization: generalization of several algorithms EECS 598-005: heoretical Foundations of Machine Learning Fall 2015 Lecture 23: Online convex optimization Lecturer: Jacob Abernethy Scribes: Vikas Dhiman Disclaimer: hese notes have not been subjected

More information

Gradient-Based Learning. Sargur N. Srihari

Gradient-Based Learning. Sargur N. Srihari Gradient-Based Learning Sargur N. srihari@cedar.buffalo.edu 1 Topics Overview 1. Example: Learning XOR 2. Gradient-Based Learning 3. Hidden Units 4. Architecture Design 5. Backpropagation and Other Differentiation

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

Machine Learning And Applications: Supervised Learning-SVM

Machine Learning And Applications: Supervised Learning-SVM Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine

More information

THE WEIGHTED MAJORITY ALGORITHM

THE WEIGHTED MAJORITY ALGORITHM THE WEIGHTED MAJORITY ALGORITHM Csaba Szepesvári University of Alberta CMPUT 654 E-mail: szepesva@ualberta.ca UofA, October 3, 2006 OUTLINE 1 PREDICTION WITH EXPERT ADVICE 2 HALVING: FIND THE PERFECT EXPERT!

More information

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem Lecture 9: UCB Algorithm and Adversarial Bandit Problem EECS598: Prediction and Learning: It s Only a Game Fall 03 Lecture 9: UCB Algorithm and Adversarial Bandit Problem Prof. Jacob Abernethy Scribe:

More information

The Frank-Wolfe Algorithm:

The Frank-Wolfe Algorithm: The Frank-Wolfe Algorithm: New Results, and Connections to Statistical Boosting Paul Grigas, Robert Freund, and Rahul Mazumder http://web.mit.edu/rfreund/www/talks.html Massachusetts Institute of Technology

More information

Lecture: Adaptive Filtering

Lecture: Adaptive Filtering ECE 830 Spring 2013 Statistical Signal Processing instructors: K. Jamieson and R. Nowak Lecture: Adaptive Filtering Adaptive filters are commonly used for online filtering of signals. The goal is to estimate

More information

Thompson Sampling for the non-stationary Corrupt Multi-Armed Bandit

Thompson Sampling for the non-stationary Corrupt Multi-Armed Bandit European Worshop on Reinforcement Learning 14 (2018 October 2018, Lille, France. Thompson Sampling for the non-stationary Corrupt Multi-Armed Bandit Réda Alami Orange Labs 2 Avenue Pierre Marzin 22300,

More information

Online Convex Optimization

Online Convex Optimization Advanced Course in Machine Learning Spring 2010 Online Convex Optimization Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz A convex repeated game is a two players game that is performed

More information

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology

More information

Testing Problems with Sub-Learning Sample Complexity

Testing Problems with Sub-Learning Sample Complexity Testing Problems with Sub-Learning Sample Complexity Michael Kearns AT&T Labs Research 180 Park Avenue Florham Park, NJ, 07932 mkearns@researchattcom Dana Ron Laboratory for Computer Science, MIT 545 Technology

More information

Universal Prediction with general loss fns Part 2

Universal Prediction with general loss fns Part 2 Universal Prediction with general loss fns Part 2 Centrum Wiskunde & Informatica msterdam Mathematisch Instituut Universiteit Leiden Universal Prediction with log-loss On each i (day), Marjon and Peter

More information

Constrained Optimization and Lagrangian Duality

Constrained Optimization and Lagrangian Duality CIS 520: Machine Learning Oct 02, 2017 Constrained Optimization and Lagrangian Duality Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

A survey: The convex optimization approach to regret minimization

A survey: The convex optimization approach to regret minimization A survey: The convex optimization approach to regret minimization Elad Hazan September 10, 2009 WORKING DRAFT Abstract A well studied and general setting for prediction and decision making is regret minimization

More information

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 9 Alternating Direction Method of Multipliers Shiqian Ma, MAT-258A: Numerical Optimization 2 Separable convex optimization a special case is min f(x)

More information

Online Prediction: Bayes versus Experts

Online Prediction: Bayes versus Experts Marcus Hutter - 1 - Online Prediction Bayes versus Experts Online Prediction: Bayes versus Experts Marcus Hutter Istituto Dalle Molle di Studi sull Intelligenza Artificiale IDSIA, Galleria 2, CH-6928 Manno-Lugano,

More information

Lecture 3 Feedforward Networks and Backpropagation

Lecture 3 Feedforward Networks and Backpropagation Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression

More information

Online Passive-Aggressive Algorithms

Online Passive-Aggressive Algorithms Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il

More information

Likelihood-Based Methods

Likelihood-Based Methods Likelihood-Based Methods Handbook of Spatial Statistics, Chapter 4 Susheela Singh September 22, 2016 OVERVIEW INTRODUCTION MAXIMUM LIKELIHOOD ESTIMATION (ML) RESTRICTED MAXIMUM LIKELIHOOD ESTIMATION (REML)

More information

Lecture 3 Feedforward Networks and Backpropagation

Lecture 3 Feedforward Networks and Backpropagation Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression

More information

Max Margin-Classifier

Max Margin-Classifier Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization

More information

Pure Exploration in Finitely Armed and Continuous Armed Bandits

Pure Exploration in Finitely Armed and Continuous Armed Bandits Pure Exploration in Finitely Armed and Continuous Armed Bandits Sébastien Bubeck INRIA Lille Nord Europe, SequeL project, 40 avenue Halley, 59650 Villeneuve d Ascq, France Rémi Munos INRIA Lille Nord Europe,

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

means is a subset of. So we say A B for sets A and B if x A we have x B holds. BY CONTRAST, a S means that a is a member of S.

means is a subset of. So we say A B for sets A and B if x A we have x B holds. BY CONTRAST, a S means that a is a member of S. 1 Notation For those unfamiliar, we have := means equal by definition, N := {0, 1,... } or {1, 2,... } depending on context. (i.e. N is the set or collection of counting numbers.) In addition, means for

More information

Written Examination

Written Examination Division of Scientific Computing Department of Information Technology Uppsala University Optimization Written Examination 202-2-20 Time: 4:00-9:00 Allowed Tools: Pocket Calculator, one A4 paper with notes

More information

Perceptron Mistake Bounds

Perceptron Mistake Bounds Perceptron Mistake Bounds Mehryar Mohri, and Afshin Rostamizadeh Google Research Courant Institute of Mathematical Sciences Abstract. We present a brief survey of existing mistake bounds and introduce

More information

Sequential prediction with coded side information under logarithmic loss

Sequential prediction with coded side information under logarithmic loss under logarithmic loss Yanina Shkel Department of Electrical Engineering Princeton University Princeton, NJ 08544, USA Maxim Raginsky Department of Electrical and Computer Engineering Coordinated Science

More information

EECS 545 Project Report: Query Learning for Multiple Object Identification

EECS 545 Project Report: Query Learning for Multiple Object Identification EECS 545 Project Report: Query Learning for Multiple Object Identification Dan Lingenfelter, Tzu-Yu Liu, Antonis Matakos and Zhaoshi Meng 1 Introduction In a typical query learning setting we have a set

More information

CS 435, 2018 Lecture 3, Date: 8 March 2018 Instructor: Nisheeth Vishnoi. Gradient Descent

CS 435, 2018 Lecture 3, Date: 8 March 2018 Instructor: Nisheeth Vishnoi. Gradient Descent CS 435, 2018 Lecture 3, Date: 8 March 2018 Instructor: Nisheeth Vishnoi Gradient Descent This lecture introduces Gradient Descent a meta-algorithm for unconstrained minimization. Under convexity of the

More information

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Lecture - 27 Multilayer Feedforward Neural networks with Sigmoidal

More information

Segment Description of Turbulence

Segment Description of Turbulence Dynamics of PDE, Vol.4, No.3, 283-291, 2007 Segment Description of Turbulence Y. Charles Li Communicated by Y. Charles Li, received August 25, 2007. Abstract. We propose a segment description for turbulent

More information

Lecture 3: Lower Bounds for Bandit Algorithms

Lecture 3: Lower Bounds for Bandit Algorithms CMSC 858G: Bandits, Experts and Games 09/19/16 Lecture 3: Lower Bounds for Bandit Algorithms Instructor: Alex Slivkins Scribed by: Soham De & Karthik A Sankararaman 1 Lower Bounds In this lecture (and

More information

Online Optimization in X -Armed Bandits

Online Optimization in X -Armed Bandits Online Optimization in X -Armed Bandits Sébastien Bubeck INRIA Lille, SequeL project, France sebastien.bubeck@inria.fr Rémi Munos INRIA Lille, SequeL project, France remi.munos@inria.fr Gilles Stoltz Ecole

More information

Stability and the elastic net

Stability and the elastic net Stability and the elastic net Patrick Breheny March 28 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/32 Introduction Elastic Net Our last several lectures have concentrated on methods for

More information

The Sherrington-Kirkpatrick model

The Sherrington-Kirkpatrick model Stat 36 Stochastic Processes on Graphs The Sherrington-Kirkpatrick model Andrea Montanari Lecture - 4/-4/00 The Sherrington-Kirkpatrick (SK) model was introduced by David Sherrington and Scott Kirkpatrick

More information

Online Bounds for Bayesian Algorithms

Online Bounds for Bayesian Algorithms Online Bounds for Bayesian Algorithms Sham M. Kakade Computer and Information Science Department University of Pennsylvania Andrew Y. Ng Computer Science Department Stanford University Abstract We present

More information

Online Learning and Online Convex Optimization

Online Learning and Online Convex Optimization Online Learning and Online Convex Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Learning 1 / 49 Summary 1 My beautiful regret 2 A supposedly fun game

More information

12. LOCAL SEARCH. gradient descent Metropolis algorithm Hopfield neural networks maximum cut Nash equilibria

12. LOCAL SEARCH. gradient descent Metropolis algorithm Hopfield neural networks maximum cut Nash equilibria 12. LOCAL SEARCH gradient descent Metropolis algorithm Hopfield neural networks maximum cut Nash equilibria Lecture slides by Kevin Wayne Copyright 2005 Pearson-Addison Wesley h ttp://www.cs.princeton.edu/~wayne/kleinberg-tardos

More information

Convex Optimization / Homework 1, due September 19

Convex Optimization / Homework 1, due September 19 Convex Optimization 1-725/36-725 Homework 1, due September 19 Instructions: You must complete Problems 1 3 and either Problem 4 or Problem 5 (your choice between the two). When you submit the homework,

More information

Linear Programming Redux

Linear Programming Redux Linear Programming Redux Jim Bremer May 12, 2008 The purpose of these notes is to review the basics of linear programming and the simplex method in a clear, concise, and comprehensive way. The book contains

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Online Kernel PCA with Entropic Matrix Updates

Online Kernel PCA with Entropic Matrix Updates Dima Kuzmin Manfred K. Warmuth Computer Science Department, University of California - Santa Cruz dima@cse.ucsc.edu manfred@cse.ucsc.edu Abstract A number of updates for density matrices have been developed

More information

Robust online aggregation of ensemble forecasts

Robust online aggregation of ensemble forecasts Robust online aggregation of ensemble forecasts with applications to the forecasting of electricity consumption and of exchange rates Gilles Stoltz CNRS HEC Paris The framework of this talk Sequential

More information

Lecture 14: Random Walks, Local Graph Clustering, Linear Programming

Lecture 14: Random Walks, Local Graph Clustering, Linear Programming CSE 521: Design and Analysis of Algorithms I Winter 2017 Lecture 14: Random Walks, Local Graph Clustering, Linear Programming Lecturer: Shayan Oveis Gharan 3/01/17 Scribe: Laura Vonessen Disclaimer: These

More information

Learning Binary Classifiers for Multi-Class Problem

Learning Binary Classifiers for Multi-Class Problem Research Memorandum No. 1010 September 28, 2006 Learning Binary Classifiers for Multi-Class Problem Shiro Ikeda The Institute of Statistical Mathematics 4-6-7 Minami-Azabu, Minato-ku, Tokyo, 106-8569,

More information

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop Music and Machine Learning (IFT68 Winter 8) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

More information

NATCOR. Forecast Evaluation. Forecasting with ARIMA models. Nikolaos Kourentzes

NATCOR. Forecast Evaluation. Forecasting with ARIMA models. Nikolaos Kourentzes NATCOR Forecast Evaluation Forecasting with ARIMA models Nikolaos Kourentzes n.kourentzes@lancaster.ac.uk O u t l i n e 1. Bias measures 2. Accuracy measures 3. Evaluation schemes 4. Prediction intervals

More information

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18 CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

SBS Chapter 2: Limits & continuity

SBS Chapter 2: Limits & continuity SBS Chapter 2: Limits & continuity (SBS 2.1) Limit of a function Consider a free falling body with no air resistance. Falls approximately s(t) = 16t 2 feet in t seconds. We already know how to nd the average

More information

Internal Regret in On-line Portfolio Selection

Internal Regret in On-line Portfolio Selection Internal Regret in On-line Portfolio Selection Gilles Stoltz (gilles.stoltz@ens.fr) Département de Mathématiques et Applications, Ecole Normale Supérieure, 75005 Paris, France Gábor Lugosi (lugosi@upf.es)

More information

Algorithms for Nonsmooth Optimization

Algorithms for Nonsmooth Optimization Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented at Center for Optimization and Statistical Learning, Northwestern University 2 March 2018 Algorithms for Nonsmooth Optimization

More information

Sequential Investment, Universal Portfolio Algos and Log-loss

Sequential Investment, Universal Portfolio Algos and Log-loss 1/37 Sequential Investment, Universal Portfolio Algos and Log-loss Chaitanya Ryali, ECE UCSD March 3, 2014 Table of contents 2/37 1 2 3 4 Definitions and Notations 3/37 A market vector x = {x 1,x 2,...,x

More information

Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

More information

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider the problem of

More information

Weighted Majority and the Online Learning Approach

Weighted Majority and the Online Learning Approach Statistical Techniques in Robotics (16-81, F12) Lecture#9 (Wednesday September 26) Weighted Majority and the Online Learning Approach Lecturer: Drew Bagnell Scribe:Narek Melik-Barkhudarov 1 Figure 1: Drew

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Analysis of Greedy Algorithms

Analysis of Greedy Algorithms Analysis of Greedy Algorithms Jiahui Shen Florida State University Oct.26th Outline Introduction Regularity condition Analysis on orthogonal matching pursuit Analysis on forward-backward greedy algorithm

More information

Intelligent Systems Discriminative Learning, Neural Networks

Intelligent Systems Discriminative Learning, Neural Networks Intelligent Systems Discriminative Learning, Neural Networks Carsten Rother, Dmitrij Schlesinger WS2014/2015, Outline 1. Discriminative learning 2. Neurons and linear classifiers: 1) Perceptron-Algorithm

More information

Lecture 21: Minimax Theory

Lecture 21: Minimax Theory Lecture : Minimax Theory Akshay Krishnamurthy akshay@cs.umass.edu November 8, 07 Recap In the first part of the course, we spent the majority of our time studying risk minimization. We found many ways

More information

The Free Matrix Lunch

The Free Matrix Lunch The Free Matrix Lunch Wouter M. Koolen Wojciech Kot lowski Manfred K. Warmuth Tuesday 24 th April, 2012 Koolen, Kot lowski, Warmuth (RHUL) The Free Matrix Lunch Tuesday 24 th April, 2012 1 / 26 Introduction

More information

Online Passive-Aggressive Algorithms

Online Passive-Aggressive Algorithms Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il

More information

Review of Multi-Calculus (Study Guide for Spivak s CHAPTER ONE TO THREE)

Review of Multi-Calculus (Study Guide for Spivak s CHAPTER ONE TO THREE) Review of Multi-Calculus (Study Guide for Spivak s CHPTER ONE TO THREE) This material is for June 9 to 16 (Monday to Monday) Chapter I: Functions on R n Dot product and norm for vectors in R n : Let X

More information

4. Multilayer Perceptrons

4. Multilayer Perceptrons 4. Multilayer Perceptrons This is a supervised error-correction learning algorithm. 1 4.1 Introduction A multilayer feedforward network consists of an input layer, one or more hidden layers, and an output

More information

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 Online Convex Optimization Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 The General Setting The General Setting (Cover) Given only the above, learning isn't always possible Some Natural

More information

Lecture 19: Follow The Regulerized Leader

Lecture 19: Follow The Regulerized Leader COS-511: Learning heory Spring 2017 Lecturer: Roi Livni Lecture 19: Follow he Regulerized Leader Disclaimer: hese notes have not been subjected to the usual scrutiny reserved for formal publications. hey

More information

Statistical Ranking Problem

Statistical Ranking Problem Statistical Ranking Problem Tong Zhang Statistics Department, Rutgers University Ranking Problems Rank a set of items and display to users in corresponding order. Two issues: performance on top and dealing

More information

Bisection Ideas in End-Point Conditioned Markov Process Simulation

Bisection Ideas in End-Point Conditioned Markov Process Simulation Bisection Ideas in End-Point Conditioned Markov Process Simulation Søren Asmussen and Asger Hobolth Department of Mathematical Sciences, Aarhus University Ny Munkegade, 8000 Aarhus C, Denmark {asmus,asger}@imf.au.dk

More information

Piecewise Constant Prediction

Piecewise Constant Prediction Piecewise Constant Prediction Erik Ordentlich Information heory Research Hewlett-Packard Laboratories Palo Alto, CA 94304 Email: erik.ordentlich@hp.com Marcelo J. Weinberger Information heory Research

More information

Learning correlated equilibria in games with compact sets of strategies

Learning correlated equilibria in games with compact sets of strategies Learning correlated equilibria in games with compact sets of strategies Gilles Stoltz a,, Gábor Lugosi b a cnrs and Département de Mathématiques et Applications, Ecole Normale Supérieure, 45 rue d Ulm,

More information