Aggregation of sleeping predictors to forecast electricity consumption

Size: px

Start display at page:

Download "Aggregation of sleeping predictors to forecast electricity consumption"

Brendan Chester Lamb
5 years ago
Views:

1 Aggregation of sleeping predictors to forecast electricity consumption Marie Devaine 1 2 3, Yannig Goude 2, and Gilles Stoltz École Normale Supérieure, Paris, France 2 EDF R&D, Clamart, France 3 Université Paris-Sud, Orsay, France 4 HEC Paris, Jouy-en-Josas, France

3 Contents 1 Introduction and context Description of the framework Base forecasters Exponentially Weighted Average Beta-Forecaster Exponentially Weighted Average Forecaster Exponential Gradient Forecaster Exponential Gradient Forecaster Partition Renormalized Exponential Gradient Forecaster Zinkevich s forecasters Ridge Regression Forecaster Ridge Regression Partition Forecaster Normalized Ridge Forecaster Fixed-Share Forecaster Exponentiated-Gradient Fixed-Share Forecaster Another sleeping adaptation of the fixed-share forecaster Tricks On-line calibration of one parameter Towards a reduction of the bias of the ridge-type forecasters? Compensated Regrets Calibration of the couple of parameters for fixed-share type forecasters Another data set: French data Description of the French data Results obtained by the forecasters

4 Chapter 1 Introduction and context 3

5 1.1 Description of the framework Data The data set consists in hourly observations of the Slovaquian comsumption of electricity. The units are Megawatts and the period of reference is from 01/01/2005 to 12/31/2007. There are 35 base forecasters (experts). The experts do not necessarily output a prediction at each step, they express themselves only at certain moments. (Yet at each step at least one expert makes a prediction.) The framework is therefore called prediction with sleeping experts. Notations Time rounds are indexed by t = 1, 2,..., n. Experts are indexed by i = 1,..., N. The active experts at round t are given by a subset E t {1,..., N}. The observation at round t is denoted by y t. The prediction of expert i at round t exists if and only if i 2 E t and we denote it by f i,t. For practical reasons, we set the prediction of expert i to zero when i 62 E t. We can thus define the vector of the predictions of the experts at time t as f t = (f 1,t,..., f N,t ). At each round the master forecaster outputs a vector of weights (an element of R N ), which he uses to form a linear prediction (to be compared to y t ). Two prototypical cases arise. Constrained prediction: for each t, the vector used for the prediction is a probability distribution over E t. In this case we denote the vector of weights by p t 2 X Et. More precisely, p t 2 R N belongs to X Et if p i,t = 0 for i 62 E t, for all i, we have p i,t > 0, and i2e t p i,t = 1. (Note that in the classical framework X is simply the simplex over {1... N}.) In this constrained case, the master forecaster forms the prediction by t = i2e t p i,t f i,t = p t f t. Unconstrained prediction: the vector used for prediction at time t is possibly any vector u t of R N. In this unconstrained case the master forecaster forms the prediction by t = i2e t u i,t f i,t = u t f t. Note that the second equality holds because of the convention that f i,t = 0 for i /2 E t. An idea to deal with this sleeping experts framework may be to go back to the classical framework by considering a partition of the data set that depends on the E t. Let K be the number of values taken by the E t on the data (for t 2 {1,..., n}), we denote by U 1,..., U K the corresponding values. In addition, we denote by k t the index corresponding at the step t, so that E t = U kt. Marie Devaine, Yannig Goude, and Gilles Stoltz 4

6 Criteria to assess the quality of a master forecaster We use a root mean squared error criterion (rmse). Formally, the rmse of a sequence v n 1 = (v 1,..., v n ) of aggregation vectors is defined by v rmse(v1 n u ) = t 1 n 0 i2et t=1 12 v i,t f i,t y t A. We give below some lower bounds beyond which it is difficult to go. 1. The following bounds have counterparts in the classical framework, or B oracle = min rmse (δ j1,..., δ jn ) (j 1,...,j n)2e 1... E n = min B R N = inf v2r N rmse (v,..., v) = inf v2r N (j 1,...,j n)2e 1... E n v u t 1 n v u t 1 n n (f jt,t y t ) 2 t=1 n (v f t y t ) 2. The second oracle has a counterpart in terms of a given probability distribution p 2 X (it is legally defined thanks to the convention that f i,t = 0 for i 62 E t ). However, the p f t will be in general too biased. This is, by the way, already the case even with a general v 2 R N. The performance table below will show that the oracle B R N is not a desirable target to achieve. 2. A first family of oracles designed for the setting of sleeping experts is in terms of the partition of time depending on the the values of the E t, first in terms of any linear combination, B partr N = t=1 inf rmse (v 1,..., v K ) = inf (v 1,...,v K )2(R N ) K then only with probability distributions, (v 1,...,v K )2(R N ) K v u t 1 n B partx = min rmse (q 1,..., q K ) (q 1,...,q K )2X U1... X UK v u = min t 1 n (q kt f t y t ) 2, (q 1,...,q K )2X U1... X UK n and finally only with Dirac distributions, t=1 B part = min rmse (δ j1,..., δ jk ) (j 1,...,j K )2U 1... U K v u = min t 1 n 2 f jkt,t y t. n (j 1,...,j K )2U 1... U K t=1 n (v kt f t y t ) 2, t=1 Marie Devaine, Yannig Goude, and Gilles Stoltz 5

7 3. We can also take into account some renormalizations to deal with some experts being sleeping. In this respect, for q 2 X we denote by q(e t ) = i2e t q i a renormalization factor and let q Et 2 X Et be the conditional distribution of q subject to E t. It is defined from q by only keeping the components indexed by E t and dividing them by q(e t ). By convention, q Et = (0,..., 0) when q(e t ) = 0. a) A first version of the oracle is v B (a) u renorm = min t 1 n q2x t=1 I {q(et)6=0} n I {q(et)6=0} (q Et f t y t ) 2 which implies a natural average at each round of active forecasters when taking q as the uniform distribution (1/N,..., 1/N), v B (a) ave = u t 1! n 2 j2e t f j,t y t. n E t t=1 b) A second (more continuous) version is given by a rmse criterion with non-equal weights to all losses, v B (b) u renorm = min t 1 n n q2x t=1 q (E q (E t ) (q t) Et f t y t ) 2 and the average at each round of active forecasters corresponding: v B (b) ave = u t 1! n 2 j2e n t=1 E E t t f j,t y t. t E t t=1 The versions a) and b) give the same bound when we take the minimum over Dirac distributions only, v B best exp = min rmse (δ j,..., δ j ) u = min t 1 n n I j=1,...,n j=1,...,n t=1 I {i2et} (f j,t y t ) 2. {j2et} 4. Instead of renormalizing on the restriction of q 2 X indexed by active experts, we may consider the Euclidian projection onto X Et, which we denote by Π XEt. For q 2 X, we denote by q Et its restriction to the set E t of active experts, defined as q Et = q i1,..., q i Et, with { i 1,..., i Et } = Et and i 1 < i 2 <... < i Et. Finally, we define the projection P t as follows: for all q 2 X, t=1 t=1 P t (q) = Π XEt q Et. t=1 The corresponding oracle is B project = min q2x v u t 1 n n t=1 P t q f t y t 2. No simple reduction is obtained in this case when we take the minimum over Dirac distributions only, since P t (δ j ) = δ j when j 2 E t but P t (δ j ) is the uniform distribution of X Et when j 62 E t. Marie Devaine, Yannig Goude, and Gilles Stoltz 6

8 Notations used in the description of the algorithms We denote by l(, ) the square loss l(x, y) = (x y) 2. We need in the proofs a uniform bound on the losses; it exists when the observations and the predictions are bounded by B for instance, which is the case in practice. In this case, as the data are non negative, we can take B 2 as uniform bound on our losses. (The only difficulty is to exhibit a value for B in advance.) For each round t, we denote by 0 l t (v) = i2et v i f i,t, y t 1 A the instantaneous loss of a vector v 2 R N (possibly a probability distribution on E t ). When v = δ i is a Dirac mass on an active expert, i.e., i 2 E t, we simply write l i,t = l t (δ i ) = l(f i,t, y t ). The loss of the master forecaster at round t equals b l t = l(by t, y t ) with the notations above. All these quantities have cumulative counterparts. The cumulative loss of the forecaster is referred to as n bl n = b l t. t=1 The cumulative losses of the experts are less easy to define in the framework of sleeping experts. We consider 1. for v 2 R N, L n (v) = n l t (v) ; 2. for v1 K = (v K 1,..., v K ) 2 R N (where v K 1 can possibly be a sequence of probability distributions on U 1... U K ) 3. for a distribution q 2 X L 0 n(q) = L n v 1 K = t=1 n l t (v kt ) ; t=1 n q(e t )l t q Et, Li,n 0 = t=1 n I {i2et}l i,t, where the second definition follows from the first one by taking the probability distribution q = δ i ; t=1 4. for a distribution q 2 X L 00 n(q) = n l t P t (q). t=1 Marie Devaine, Yannig Goude, and Gilles Stoltz 7

9 Minimization of rmse via the minimization of regret We attempt to minimize the rmse of the master forecaster by ensuring that it has a small regret, where the regret is defined 1. either as sup R n (v) where R n (v) = L b n L n (v) ; v2r N 2. or sup R n v1 K v1 K 2(RN ) K where R n v K 1 = L b n L n v1 K = n l t (v kt ) ; t=1 3. or sup Rn(q) 0 where Rn(q) 0 = q2x n q (E t ) b l t Ln(q) 0 ; in the case we restrict our attention in the previous definition to Dirac distributions only, we get t=1 max R n 0 (δ j ) not. = max R j,n 0 where Rj,n 0 = j=1,...,n j=1,...,n n t=1 I {j2et} b l t l j,t ; 4. or sup Rn(q) 00 where Rn(q) 00 = L b n Ln(q) 00. q2x Marie Devaine, Yannig Goude, and Gilles Stoltz 8

10 Numerical values For practical purposes we do not use the whole data set to build our forecasts. As predictions are made for the next 24 hours, we split our data set into 24 fixed-hour subsets and run 24 algorithms in parallel. In our examples and simulations, we chose the subset corresponding to noon. We denote by n the total number of prediction steps and by n 12 the number of prediction in the noon data subset. (We index by a subscript 12 the quantities that refer to this subset only.) M is some typical order of magnitude for y t. n n 12 M N K K Some standard rmse are summarized for the noon data subset below. As argued above, B R N is high because of the bias induced by sleeping experts. (When a highly weighted expert is missing, the bias is important and the corresponding instantaneous loss is large.) We must therefore use the information of sleeping/active experts to get lower bounds smaller that this one. B oracle B R N B partr N B partx B part B (a) renorm B (a) ave B (b) renorm B (b) ave B project B best exp Global performance of the experts (results are very similar for each hour data set) is drawn in Figure 1.1. In Figure 1.2, we plot the global performance of each expert with respect to his percentage of activity. Marie Devaine, Yannig Goude, and Gilles Stoltz 9

11 RMSE Index Figure 1.1: Performance of experts % Act RMSE Figure 1.2: rmse vs percentage of activity Marie Devaine, Yannig Goude, and Gilles Stoltz 10

12 Chapter 2 Base forecasters 11

13 2.1 Exponentially Weighted Average Beta-Forecaster References This forecaster is described in [BM05] Theoretical bound s For β such that log 1 β is of the order of 1 log N we have B n max R i,n 0 6 O B p n log N i2{1,...,n} Interpretation and/or comments This is variant of the exponentially weighted average forecaster in the context of sleeping experts. It is arguably less intuitive that the one presented in Section 2.2 but we can prove in a straightforward way a theoretical bound on its performance. The parameter β used here should be thought of as e η with η > 0. The weigths obtained with this algorithm are thus of the same form as the weights given by the algorithm of Section 2.2. However, the difference is that it is not the same quantity (the same regret) that is used in the exponential weighting, even though in both cases the aim is to control the original regret, max i R 0 i,n Statement and implementation The parameter β belongs to ]0, 1[. n > 1, R 0 β,i,n = We introduce the β regret: for i 2 {1,..., N} and n t=1 I {i2et} β b l t l i,t. Note that Ri,n 0 is a β regret for β = 1 (which is however a forbidden value for β in this section). For t > 1, the convex weights p t are defined as p i,t = I {i2e t}β R0 β,i,t 1 j2e t β R0 β,j,t 1 By convention, Rβ,i,0 0 = 0 for all i = 1,..., N.. Marie Devaine, Yannig Goude, and Gilles Stoltz 12

14 The forecaster is implemented as follows. Parameters: β 2 ]0, 1[ Initialization: Rβ,i,0 0 = 0 for all i = 1,..., N and 1 if i 2 E 1 ; p i,1 = E 1 0 if i /2 E 1. For each round t = 1, 2,..., n, (1) predict with by t = p t f t ; (2) observe y t and compute the regrets for all i = 1,..., N; R 0 β,i,t = R 0 β,i,t 1 + I {i2et} β b l t l i,t (3) compute p t+1 as for all i = 1,..., N. p i,t+1 = I {i2e t+1 } β R0 β,i,t j2e t+1 β R0 β,j,t Proof of the theoretical bound Without loss of generality, we can assume that B = 1. (If this is not the case, one simply considers the l i,t /B instead of the l i,t.) We introduce first some notations: for i 2 {1,..., N} and t > 1, we denote w 0 i,t = β R0 β,i,t 1, w i,t = I {i2et}w 0 i,t, W t = In particular, we have p i,t = w i,t W t. By convexity of l in its first argument, we have N w i,t and Wt 0 = i=1 N wi,t 0. i=1 b l t 6 N p i,t l i,t = i=1 N i=1 w i,t l i,t W t and thus, W t b l t N w i,t l i,t 6 0, i=1 an inequality we will need later. We now bound Wt 0 by N for all t. We have W0 0 = N and now show that the sequence Wt 0 t>0 decreases. We use that for β 2 ]0, 1[ and x 2 [0, 1], we have βx 6 1 (1 β)x Marie Devaine, Yannig Goude, and Gilles Stoltz 13

15 and β x (1 β)x/β to get: W 0 t+1 = N wi,tβ 0 I {i2e t } βbl t l i,t i=1 6 N i=1 6 N i=1 wi,t 0 1 (1 β)l i,t I {i2et} 1 + (1 β)β b l t I {i2et}/β w 0 i,t 1 (1 β)i {i2et} l i,t b l t 0 6 N wi,t 0 + (1 b t l t i=1 6 W 0 t. N i=1 w i,t l i,t 1 A }{{} 60 In particular, we have that w 0 i,t+1 6 N for all t > 0. Since w 0 i,t+1 = β R0 β i,t, we have R 0 β,i,t 6 log N log 1 β, which rewrites as n t=1 I {i2et} b l t 6 n t=1 I {i2et}l i,t + log N log 1 β β 6 n I {t2et}l i,t + t=1 1 n β 1 + log N! log 1. β Optimizing in β (by taking the parametrization β = e η and optimizing in η) we get the claimed bound Performance 1 β 1 1e-3 5e-1 1e-4 1e-5 1e-6 1 β = 4e-6 rmse Forward note: We obtain the same performance as with the algorithm of Section 2.2. Indeed, setting β = e η, the two algorithms can be seen equivalent for sufficiently small values of η. These values are the one which we are interested in and for those, β is close to 1. Thus, the table above is the same as the one which will be given in Section 2.2 for small values of η (for which 1 β η). Marie Devaine, Yannig Goude, and Gilles Stoltz 14

16 2.1.7 Graphical evolution of the weights Weight Step Figure 2.1: Evolution of the weights for β = 1 4e-6 Marie Devaine, Yannig Goude, and Gilles Stoltz 15

17 2.2 Exponentially Weighted Average Forecaster References This is an adaptation of the forecaster of Section 2.1. previous occurrence of it in the literature. We designed it and there is no Theoretical bound s For η of the order of 1 log N, we have B n max R i,n 0 6 O B p n log N. i2{1,...,n} Interpretation and/or comments We claim that the forecaster introduced below is more natural than the one of Section 2.1 since it is based on the regret and not on a variant of it. Note that with respect to the standard (i.e., non-sleeping) version of the exponentially weighted average forecaster, it cannot be defined only in terms of the cumulative losses of the experts. Here, unlike the standard case, the loss of the master forecaster in the numerator does not cancel out with corresponding terms in the denominator of the expression defining the weights Statement and implementation The parameter η belongs to ]0, + [. For t > 1, the convex weights p t are defined as By convention Ri,0 0 = 0 for all i = 1,..., N. p i,t = I {i2e t} e ηr0 i,t 1 j2e t e ηr0 j,t 1. Marie Devaine, Yannig Goude, and Gilles Stoltz 16

18 The forecaster is implemented as follows. Parameters: η > 0 Initialization: Ri,0 0 = 0 for all i = 1,..., N and 1 if i 2 E 1 ; p i,1 = E 1 0 if i /2 E 1. For each round t = 1, 2,..., n (1) predict with by t = p t f t ; (2) observe y t and compute the regrets for all i = 1,..., N ; R 0 i,t = R 0 i,t 1 + I {i2et} b l t l i,t (3) compute p t+1 as for all i = 1,..., N. p i,t+1 = I {i2e t+1 } e ηr0 i,t j2e t+1 e ηr0 j,t Proof of theoretical bound We do not provide a direct proof but derive rather the bound from the one of Section 2.1. In particular we use that taking β = e η 2 ]0, 1[, e ηr0 i,n = β R 0 β,i,n e η(1 β) n t=1 blti {i2e t} = w 0 i,n eη(1 β) n t=1 blti {i2e t } where we used the notations of Section for the two equalities. We know from Section that w 0 i,n 6 N for all i 2 {1..., n}, so that e ηr0 i,n 6 N e η(1 β) n t=1 blti {i2e t}. Using in addition e η > 1 η, we thus get for η > 0 ηr 0 i,n 6 log N + η(1 e η ) n t=1 b l t I {i2et} 6 log N + η 2 n t=1 b l t I {i2et}. Rearranging and bounding the sum in a crude but uniform manner by n, R 0 i,n 6 log N η The claimed bound is obtained by taking η = + ηb 2 n. q log N/(B 2 n), R 0 i,n 6 2B p n log N. Marie Devaine, Yannig Goude, and Gilles Stoltz 17

19 2.2.6 Performance η 1e-8 1e-7 1e-6 1e-5 1e-4 η = 4e-6 rmse Graphical evolution of the weights Weight Step Figure 2.2: Evolution of the weights for η = 4e-6 Marie Devaine, Yannig Goude, and Gilles Stoltz 18

20 2.3 Exponential Gradient Forecaster References This forecaster follows from the one of Section 2.2 thanks to the consideration of the gradients of the losses, see [CBL06, Section 2.5] Theoretical bound s For η of the order of 1 log N, we have C n max R i,n 0 6 O C p n log N, i2{1,...,n} where C is a constant such that C 6 r b l t i We can take C = 2B 2. 6 C for all i = 1,..., N and t = 1,..., n Interpretation and/or comments The forecaster below is the exponentiated gradient version of the forecaster of Section 2.2, in which the cumulative regrets appearing in the exponent in the definition of the weights are replaced by some gradient-based upper bound. We only adapt here the forecaster of Section 2.2. We also considered a gradient version of the algorithm of Section 2.1 but again, it obtains about the same performance as the one presented in this section. This is why we do not write a dedicated section for it Statement and implementation The parameter η belongs to ]0, + [. We use r b l t to denote the gradient of the convex function v 7 l t (v) taken in p t, that is, r b l t = 2(by t y t )f t. For t > 1, the vector of weights p t is defined component-wise as I {i2e t} exp η t 1 s=1 I b {i2es}rl s (p t δ i ) p i,t = j2e t exp η t 1 s=1 I b {j2es}rl s (p t δ j ) I {i2e t} exp 2η t 1 s=1 I {i2es}(by s y s )(by s f i,s ) = j2e t exp 2η. t 1 s=1 I {j2es}(by s y s )(by s f j,s ) By convention an empty sum is nul. Marie Devaine, Yannig Goude, and Gilles Stoltz 19

21 The forecaster is implemented as follows. Parameters: learning rate η > 0 Initialization: R e i,0 = 0 for all i = 1,..., N and 1 if i 2 E 1 p i,1 = E 1 0 if i /2 E 1 For each round t = 1, 2,..., n (1) predict with by t = p t f t ; (2) observe y t and compute the pseudo-regrets for all i = 1,..., N; er i,t = e R i,t 1 + 2I {i2et}(by t y t )(by t f i,t ) (3) compute p t+1 as for all i = 1,..., N. p i,t+1 = I {i2e t+1 } e ηer i,t j2e t+1 e ηer j,t Proof of the theoretical bound Since v 7 l t (v) is convex and differentiable, we can upper bound the instantenous regrets as l(by t, y t ) l(f i,t, y t ) = l t (p t ) l t (δ i ) 6 r b l t (p t δ i ) = e l(by t, y t ) e l(f i,t, y t ), where we denoted, for all probability distributions q 2 X Et, the linearized losses by Summing up the bound, we get R 0 i,n = I {i2et} b l t l i,t 6 n e l(q f t, y t ) = r b l t q. t=1 I {i2et} e l(by t, y t ) e log N l(f i,t, y t ) 6 + ncη η for all i = 1,..., N, where C is some bound on the e l, for instance, C = 2B 2. We used here the result of Section 2.2.5, up to the replacement of the losses by their linearized counterparts. We can do so because we only use in this proof and in the present one that the losses are nonnegative functions convex in their first arguments. Marie Devaine, Yannig Goude, and Gilles Stoltz 20

22 2.3.6 Performance η 1e-7 1e-6 1e-5 1e-4 1e-3 η = 1.1e-4 rmse Graphical evolution of the weights Weight Step Figure 2.3: Evolution of the weights for η = 1.1e-4 Marie Devaine, Yannig Goude, and Gilles Stoltz 21

23 2.4 Exponential Gradient Forecaster Partition References This forecaster is an adaptation from the EG forecaster studied in [CB99] Theoretical bound s 1 log N For η of the order of B 2, we have K n sup R n q K 1 6 O CK p n log N q K 1 2X U 1... X UK where q K 1 is a shorthand notation for (q 1,..., q K ) and C is a constant such that C 6 r b l t i 6 C for all i = 1,..., N and t = 1,..., n. We can take C = 2B Interpretation and/or comments This forecaster is obtained by running in parallel K classic exponential gradient algorithms on each subset U k for k 2 {1..., K}. The bound is obtain by summing up the base bounds of [CB99] on each subset. Note that if K is large, then the proposed bound is poor Statement and implementation The parameter η belongs to ]0, + [. For t > 2, we introduce the notation U t = U kt \ {1,..., t 1}, which refers to the set of past rounds with the same sleeping configuration as the current one, i.e., the rounds indexed by s 6 t 1 and such that E s = U kt. For t > 1, the weight vector p t is defined as I {i2e t} exp η t 1 s=1 I b {i2es}rl s (p t δ i ) p i,t = j2e t exp η t 1 s=1 I b {j2es}rl s (p t δ j ) = I {i2e t} exp 2η s2u t(by s y s )(by s f i,s ) j2e t exp 2η s2u t(by s y s )(by s f j,s ). By convention an empty sum is nul. The definition is legal because if i 2 E t then by definition of U t, one also has i 2 E s for all s 2 U t. Marie Devaine, Yannig Goude, and Gilles Stoltz 22

24 This forecaster is implemented as follows. Parameters: learning rate η > 0 For each round t = 1,..., n, (1) if U t = ; 1 if i 2 E t ; p i,t = E t 0 if i /2 E t. Otherwise, p i,t = I {i2e t} exp 2η s2u t(by s y s )(by s f i,s ) j2e t exp 2η s2u t(by s y s )(by s f j,s ), for i = 1,..., N. (2) predict with by t = p t f t Performance η 1e-6 1e-5 1e-4 1e-3 η = 7.4e-5 rmse Marie Devaine, Yannig Goude, and Gilles Stoltz 23

25 2.4.6 Graphical evolution of the weights Figure 2.4: Evolution of the weights for η = 1.1e-4 Marie Devaine, Yannig Goude, and Gilles Stoltz 24

26 2.5 Renormalized Exponential Gradient Forecaster References This forecaster is described in [FSS97] Theoretical bound s For η of the order of 1 log N, we have C n sup Rn(q) 0 6 O C p n log N, q2x where C is a constant such that C 6 r b l t i We can take C = 2B 2. 6 C for all i = 1,..., N and t = 1,..., n Interpretation and/or comments This forecaster is another adaptation of the gradient-based forecasters to the setting of sleeping experts. The notion of regret minimized is not the same as in Section 2.3. However, their respective performance is similar. This is despite the fact that the two algorithms are quite different, as the graphical evolutions of the weights reveal Statement and implementation The parameter η belongs to ]0, + [. This forecaster is implemented as follows. Parameters: learning rate η Initialization: p i,1 = 1/N for i = 1,..., N For each round t = 1, 2,..., n (1) predict with by t = p Et t f t (2) observe y t and compute p t+1 as follows: p i,t e 2ηf i,t(by t y t) j2e t p j,t if i p i,t+1 = k2e t p k,t e 2ηf 2 E t, k,t(by t y t) p i,t if i 62 E t Proof of the theoretical bound Since v 7 l t (v) is convex and differentiable, we can upper bound the instantenous regrets as l(by t, y t ) l q Et = l t l q Et 6 r b l t p Et t q Et. p Et t Marie Devaine, Yannig Goude, and Gilles Stoltz 25

27 We thus have R 0 n(q) 6 n We now prove the following bound, t=1 r b ηc l t p Et t q Et K η q(e t ) r b l t p Et t q Et. q Et, p Et t K q Et, p Et t+1 where we denoted by K(, ) the Kullback-Leibler divergence between two distributions. We recall that the latter is defined, for two probability distributions p and q over a set with R elements as R qi K(q, p) = q i log. i=1 The claimed bound actually follows from an application of the general lemma stated below together with the fact that by definition of the forecaster, p Et i,t+1 = p i p Et i,t e ηγ i,t N j=1 pet i,t e ηγ j,t for all t 2 {1,..., n} and i = 1,..., N, where we set γ i,t = 2 by t y t fi,t = r b l t Lemme 1. Let q, p be two probability distributions over a set with R elements, γ 2 R R be any R dimensional real vector. Define a distribution p 0 as follows: for i = 1,..., R,, i. p 0 i = p i e ηγ i R j=1 p j e ηγ j. Then, denoting by D a bound such that D 6 γ i 6 D for i = 1,..., R, one has η p γ η R q i γ i η2 D 2 2 i=1 6 K(q, p) K(q, p 0 ). (2.1) Proof. We start with a chain of equalities, K(q, p) K(q, p 0 ) = = = R p 0 q i log i i=1 p i 0 1 R q i p ie ηγ i 1 R j=1 p j e ηγ A j p i 0 1 R R N q i p j e ηγ ja + q i log e ηγ i i=1 i=1 = i=1 j=1 0 1 R R q i p j e ηγ ja j=1 i=1 R q i ηγ i. i=1 Marie Devaine, Yannig Goude, and Gilles Stoltz 26

28 We have, first resorting to Hoeffding s lemma for each i = 1,..., R and then using that R i=1 q i = 1, R R q i p j e ηγ j p j A 6 R q η i=1 j=1 i=1 1 R γ j p j + η2 D 2 A 2 j=1 = η γ p + η2 D 2 2. We now get back to the main proof. So far, we have Rn(q) 0 6 n q(e t ) t=1 ηc K η q Et, p Et t K q Et, pt+1! Et. In our case, as we have that p i,t = p i,t+1 if i /2 E t and therefore p t+1 (E t ) = p t (E t ), we derive q(e t ) K q Et, p Et t Substituting, we get K q Et, p Et t+1 = q i log i2e t = N q i log i=1 pi,t+1 p i,t pi,t+1 p i,t = K(q, p t ) K(q, p t+1 ). (2.2) Rn(q) 0 6 n ηc 2 q(e t ) η K(q, p t) K(q, p t+1 ). t=1 A telescoping summing has appeared and we are left with R 0 n(q) 6 ηnc2 2 + K(q, p 1) η 6 ηnc2 2 + log N η where the last equality proceeds from the choice of p 1 as the uniform distribution. Optimizing in η we obtain the claimed bound Performance η 1e-7 1e-6 1e-5 1e-4 1e-3 η = 1e-4 rmse Marie Devaine, Yannig Goude, and Gilles Stoltz 27

29 2.5.7 Graphical evolution of the weights Weight Step Figure 2.5: Evolution of the weights for η = 1e-4 Marie Devaine, Yannig Goude, and Gilles Stoltz 28

30 2.6 Zinkevich s forecasters References These forecasters are an adaptation of the ones proposed in [Zin03] First forecaster: lazy projection Theoretical bound The aim of this forecaster is to minimize the regret defined in terms of the R 00 n(q), that is, to achieve a performance nearly as good as the oracle B project. We do not provide any theoretical bound in the sleeping case for the time being. Interpretation and/or comments The forecaster below uses the standard version of the lazy Zinkevich s forecaster in the framework of sleeping experts. The original forecaster proceeds by projecting on the simplex of size N the weights obtained by a so-called gradient descent. Here, in the sleeping expert context, at each round the projection is performed on the subset of the weights indexed by the active experts. Statement and implementation The parameter η belongs to ]0, + [. We first define a sequence of intermediate weights w t as w 1 = (1/N,..., 1/N) and w t+1 = w t ηi t r b l t, where I t = I {12Et},..., I {N2Et} and for two vectors u, v 2 R N, u v is the term by term product (u 1 v 1,..., u N v N ). The final convex weights p t output at round t are then given by p t = P t (w t ) where we use again the notations of Section 1.1. We briefly recall how to project a real m-tuple x = (x 1,..., x m ) 2 R m onto X m, the simplex of dimension m: let a be the unique real number such that m (y i + a) + = 1. i=1 Then the Euclidian projection of x onto X m is given by Π Xm (x) = (x 1 + a) +,..., (x m + a) +. Marie Devaine, Yannig Goude, and Gilles Stoltz 29

31 The forecaster is implemented as follows. Parameters: η > 0 Initialization: 1 if i 2 E 1 ; p i,1 = E 1 0 if i /2 E 1. For each round t = 1, 2,..., n, (1) predict with by t = p t f t ; (2) obtain y t and perform the update w t+1 = w t 2η (by t y t ) (I t f t ) ; (3) compute p t+1 = P t+1 w t+1. Performance η 1e-9 1e-8 1e-7 1e-6 η = 2e-8 rmse Graphical evolution of the weights Weight Step Figure 2.6: Evolution of the weights of the lazy version for η = 2e-8. Marie Devaine, Yannig Goude, and Gilles Stoltz 30

32 2.6.3 Second forecaster: by plug-in and projection Interpretation and/or comments This second version is less intuitive but it achieves a better practical performance. In view of its statement, the oracle its performance should be compared to is given by min q2x n t=1 l t q t where the sequence of the q t is defined iteratively by q 1 = P 1 (q) and q s+1 = P s+1 (q s ) for s > 1. Theoretical bound One can prove a O p n bound on the regret bl n min q2x n l t q t. t=1 Statement and implementation The forecaster is implemented as follows. Parameters: η > 0 Initialization: 1 if i 2 E 1 ; p i,1 = E 1 0 if i /2 E 1. For each round t = 1, 2,..., n, (1) predict with by t = p t f t ; (2) compute p t+1 as p t+1 = P t+1 p t 2η (by t y t )(I t f t ). Marie Devaine, Yannig Goude, and Gilles Stoltz 31

33 Performance η 1e-7 1e-6 1e-5 1e-4 η = 1e-5 rmse Graphical evolution of the weights Weight Step Figure 2.7: Evolution of the weights of the plug-in version for η = 1e-5. Marie Devaine, Yannig Goude, and Gilles Stoltz 32

34 2.7 Ridge Regression Forecaster References This is an adaptation of [CBL06, Section 11.7] Theoretical bound We have 0 R n (v) 6 λ n 2 kvk2 i=1 log 1 + µ i λ 1 A max 16t6n l t(v) where µ 1,..., µ N are the eigenvalues of n t=1 f tf T t. See [CBL06, Section 11.7] for the proof Interpretation and/or comments The forecaster below uses the standard version of the ridge regression forecaster in the framework of sleeping experts. In order to ensure that this forecaster is well defined at each round, predictions of inactive experts are set to zero as explained in the introduction. Because of that, the predictions of the forecaster are strongly biased. In fact, this forecaster is designed to come close to the performance given by B R N. It actually achieves this goal but the latter is not ambitious and quite irrelevant, as the introduction underlines Statement and implementation The parameter λ belongs to ]0, + [. We take u 1 as the uniform distribution on E 1. For t > 2, the vector of weights u t is defined as { u t = argmin v2r N } t 1 λkvk 2 + (v f s y s ) 2. The computation of this least-square estimate is given by u t = A 1 t b t, where for t > 2, s=1 t 1 t 1 A t = λi + f s fs T and b t = y s f s. s=1 Simple manipulations lead to the recursive update u t+1 = u t A 1 t+1 s=1 u T t f t+1 y t+1 f t. Marie Devaine, Yannig Goude, and Gilles Stoltz 33

35 The forecaster is implemented as follows. Parameters: penalization factor λ Initialization: A 1 = λi and u 1 = (1/N,..., 1/N) For each round t = 1,..., n (1) predict with by t = u t f t ; (2) observe y t and update A t+1 A t+1 = A t + f t ft T ; (3) compute u t+1 as u t+1 = u t A 1 t+1 u T t f t+1 y t+1 f t Performance λ 1e-3 1 1e+3 1e+6 λ = 4e+5 rmse Marie Devaine, Yannig Goude, and Gilles Stoltz 34

36 2.7.6 Graphical evolution of the weights Weight Step Figure 2.8: Evolution of the weights for λ = 4e+5 Marie Devaine, Yannig Goude, and Gilles Stoltz 35

37 2.8 Ridge Regression Partition Forecaster References This is an adaptation of [CBL06, Section 11.7] Theoretical bound We proceed as in Section 2.4 and derive the global bound as a sum over the K different sub-regimes U k, for k = 1,..., K: 0 0 R n v1 K K λ! n kv kk 2 log 1 + µk i A max l t (v k ) A λ t2u k k=1 i=1 for all v1 K = (v 1,..., v K ) 2 for each k = 1,..., K. R N K, where µ k 1,..., µ k N are the eigenvalues of t2u k f t f T t Interpretation and/or comments This forecaster is another adaptation of the ridge regression forecaster to the setting of sleeping experts. It simply performs K simultaneous instances of ridge regression to avoid any bias issues (see the issues encountered with the adaptation of Section 2.7). Nevertheless, this results in poor results on this data set, because K is large and the first steps of instance of the ridge regression forecaster have a large loss. We try to improve these results by cleaning the data set, e.g., by removing data corresponding to rounds t when E t = 1 (we remove 40 such t). The results are then slightly better but still very poor Statement and implementation The parameter λ belongs to ]0, + [. We introduce the sets and U t = U kt \ {1,..., t 1}. For t > 1, R N E t = { v 2 R N : 8i /2 E t, v i = 0 } if U t = ;, the vector of weights u t is defined as the uniform distribution on E t ; otherwise u t is defined as { u t = argmin v2r N E t λkvk 2 + s2u t (v f s y s ) 2 }. Note that it is equivalent to consider that we perform K simultaneous ridge regressions (on each U k, for k = 1,..., K). We thus have at step t that if U t 6= ;, then u t = A 1 t b t, A t = λi + f s fs T and b t = y s f s. s2u t s2u t Marie Devaine, Yannig Goude, and Gilles Stoltz 36

38 This forecaster is implemented as follows. Parameters: penalization factor λ For each round t = 1,..., n, (1) if U t = ;, define u t as 1 if i 2 E t ; u i,t = E t 0 if i /2 E t ; (2) otherwise, when U t 6= ;, compute A t as A t = λi + s2u t f s f T s and u t as and u i,t = 0 for i /2 E t ; u i,t i2e t = A 1 t y s f s s2u t (3) predict with by t = u t f t ; (4) observe y t Performance The performance is shown for the initial (not cleaned) data set. It is slightly better for cleaned data but remains poor (the best rmse equaling 137). λ 1e-3 1 1e+3 1e+6 1e+9 λ = 1.1e+5 rmse Marie Devaine, Yannig Goude, and Gilles Stoltz 37

39 2.8.6 Graphical evolution of the weights Weight Step Figure 2.9: Evolution of the weights for λ =1.1e+5 Marie Devaine, Yannig Goude, and Gilles Stoltz 38

40 2.9 Normalized Ridge Forecaster References This is a new forecaster! Theoretical bound We have no theoretical bound yet Interpretation and/or comments The forecaster results from an adaptation of the forecaster of Section 2.7 (the least-squares formulation) in the spirit of the one of Section 2.5 (the renormalization factors in front of the squares). Its aim is to get close to the oracle 28.8 that is the linear (weights in R N ) counterpart of B (b) renorm Statement and implementation The parameter λ belongs to ]0, + [. We take u 1 as the uniform distribution over E 1. For t > 2, we compute u t 2 R N as indicated below and then aggregate the forecasts of the experts with u t E t, where for all v 2 R N, we define, similarly to the notations used in Section 1.1, v(e t ) = v N i,t if i 2 E t, I {i2et}v i,t and v Et = v(e t ) 0 if i 62 E t, i=1 and u t is defined as u t = argmin v2r N { } t 1 λkvk 2 + v(e s ) v Es 2 f s y s s=1. Note that we do not provide any efficient recursive computation anymore. Marie Devaine, Yannig Goude, and Gilles Stoltz 39

41 The forecaster is implemented as follows. Parameters: penalization factor λ Initialization: u 1 = (1/N,..., 1/N) For each round t = 1,..., n (1) predict with by t = u E t t f t ; (2) observe y t and compute u t+1 as { u t+1 = argmin v2r N λkvk 2 + } t v(e s )(v Es f s y s ) 2 s= Performance The performance of this forecaster is better than the one of ridge regression (Section 2.7) but remains poor (worse than the one of the best expert). This might be due to a lack of precision in the implementation: since no closed form expression is available for the u t, numerical minimizations (implemented in the software R) are performed at each step. λ 1e+4 1e+5 1e+6 1e+7 λ = 1e+6 rmse Graphical evolution of the weights Weight Step Figure 2.10: Evolution of the weights for λ = 1e+6. Marie Devaine, Yannig Goude, and Gilles Stoltz 40

42 2.10 Fixed-Share Forecaster References This is an adaptation of a forecaster described, e.g., in [CBL06, Section 5.2] Theoretical bound We take a larger comparison class, whose elements are indexed by sequences (i 1,..., i n ) of elements of {1,..., N} and are called compound experts. They predict at round t as expert i t. In the context of sleeping experts we thus only consider sequences such that that for all t, one has i t 2 E t ; a sequence satisfying this condition will be referred to as an admissible sequence. The tracking regret for a such sequence is defined as The size R n (i 1,..., i n ) = size(i 1,..., i n ) = n b l t l it,t. t=1 n t=2 I {it 1 6=i t} of a sequence counts how many switches occur in the sequence. The following theoretical bound holds for the tracking regret of the forecaster presented in this section: R n (i 1,..., i n ) 6 m + 1 η log N + 1 η log 1 (α/n) m (1 α) n m 1 + η 8 nb4 for all admissible sequences (i 1,..., i n ) of size smaller than m. Thus, taking α = m/(n 1) and v η = 1 u t 8! m B 2 n (m + 1) log N + (n 1) H, n 1 we get v R n (i 1,..., i n ) 6 B 2 u t n! m 2 (m + 1) log N + (n 1) H, n 1 where H denotes the binary entropy function: H(x) = x log x (1 x) log(1 x) for x 2 [0, 1] Interpretation and/or comments At each step, this forecaster performs two updates. The first one is similar to the one performed by the exponentially weighted average forecaster of Section 2.2. The second one mixes the previous weights to ensure that all base forecasters get a sufficient weight, which allows them to recover quickly higher weights in case they start outputting good predictions. Note that for α = 0, we do not recover the forecaster of Section 2.2. This is because we deal in a different way with the fact that some experts may be inactive. Marie Devaine, Yannig Goude, and Gilles Stoltz 41

43 Statement and implementation The parameter η belongs to ]0, + [, while α belongs to [0, 1]. For t > 1, the convex weights p t are defined in two steps: the loss update (the same as in Section 2.2) and the share update, which allows our forecaster to be more reactive to breaking points (when the index of the best expert changes). The forecaster is implemented as follows. Parameters: η > 0 and 0 6 α 6 1 Initialization: w 0 = (1/N,..., 1/N) For each round t = 1, 2,..., n, (1) predict with the weights p t = w t 1 N j=1 w j,t 1 ; (2) Loss update: observe y t and update for each i = 1,..., N, { w v i,t = i,t 1 e η blt l i,t if i 2 E t, undefined if i /2 E t ; (3) Share update: let w i,t = 0 if i 62 E t+1, and w i,t = 1 E t+1 j2e t\e t+1 v j,t + α E t+1 j2e t\e t+1 v j,t + (1 α) I {i2et\e t+1 } v i,t if i 2 E t+1 (with the convention that an empty sum is null). Note: in (2), it is equivalent to put b l t or to omit it Proof of the theoretical bound The proof is given by a straightforward adaptation of the proof in the non-sleeping case proposed in [CBL06, Section 5.2]. Indeed, the version proposed here corresponds to a (fake) prior weight assignment to the set of admissible compound experts given by a certain Markovian process. We give here the relation that defines the transition probability: w0 0 i 1,..., i t+1 = w 0 0 i 1,..., i t I{it+1 2E t+1 } (1 α)i {it+1 =i t} + α E t+1 + I 1 α {i t /2E t+1 }. E t+1 That is, the transition function from i t to i t+1 is given, at round t + 1, by Tr t+1 (i t i t+1 ) = I {it+1 2E t+1 } (1 α)i {it+1 =i t} + α E t+1 + I 1 α {i t /2E t+1 } E t+1. Rewriting the proof of [CBL06, Theorem 5.1] with these prior weights we obtain how compute efficiently w i,t thanks to the vector v t. In particular, the last equality of the Marie Devaine, Yannig Goude, and Gilles Stoltz 42

44 proof yields w i,t = v j,t Tr t+1 (j i) = v j,t I {i2et+1 } (1 α)i {i=j} + α E t+1 + I 1 α {j/2e t+1 } E t+1 j2e t j2e t, from which follows the expression stated in our implementation. As the prior weights w 0 0 are larger than the ones w 0 proposed by the original version of [CBL06, Section 5.2] that puts weights on all compound experts, this entails the proposed bound Performance This forecaster has a good performance (better than the one of the exponentiated gradient forecaster of Section 2.3). This might be because this forecaster uses two parameters; however, in practice, tuning two parameters online is often more delicate than simply calibrating one. We will propose several solutions to this issue in Section 3.4. The best performance (rmse of 27.0) is achieved for (η, α ) = (2e-3, 0.2). α = 0.01 α = 0.05 α = 0.1 α = 0.2 η = 1e η = 1e η = 1e η = 1e η = 1e Marie Devaine, Yannig Goude, and Gilles Stoltz 43

45 Graphical evolution of the weights Weight Step Figure 2.11: Evolution of the weights for (η, α ) = (2e-3, 0.2). Marie Devaine, Yannig Goude, and Gilles Stoltz 44

46 2.11 Exponentiated-Gradient Fixed-Share Forecaster References It follows from the one of Section 2.10 thanks to the consideration of the gradients of the losses, see [CBL06, Section 2.5] Theoretical bound Recall that r b l t = 2(by t y t )f t. We denote by C a constant such that C 6 r b l t 6 C for all i = 1,..., N and t = i 1,..., n. We can take C = 2B 2, so that the range is 4B 2. The bound of Section 2.10 holds in particular for the forecaster described here, up to the replacement of B 2 by the new range 4 B Interpretation and/or comments The forecaster below is the exponentiated gradient version of the forecaster of Section 2.10, in which the cumulative regret of interest is first upper-bounded by some gradient-based sum Statement and implementation The parameter η belongs to ]0, + [, while α belongs to [0, 1]. The forecaster is implemented as follows. Parameters: η > 0 and 0 6 α 6 1 Initialization: w 0 = (1/N,..., 1/N) For each round t = 1, 2,..., n, (1) predict with the weights p t = w t 1 N j=1 w j,t 1 ; (2) Loss update: observe y t and update { w i,t 1 e 2η byt yt byt f i,t v i,t = if i 2 E t, undefined if i /2 E t ; (3) Share update: let w i,t = 0 if i 62 E t+1, and w i,t = 1 E t+1 j2e t\e t+1 v j,t + α E t+1 j2e t\e t+1 v j,t + (1 α) I {i2et\e t+1 } v i,t if i 2 E t+1 (with the convention that an empty sum is null). Marie Devaine, Yannig Goude, and Gilles Stoltz 45

47 Performance This forecaster has a relatively good performance (better than the one of the exponentiated gradient forecaster of Section 2.3), but surprisingly enough it does not beat the exponentially weighted average version of the fixed-share forecaster presented in Section α = 0.01 α = 0.05 α = 0.1 α = 0.2 η = 1e η = 1e η = 1e η = 1e η = 1e The best performance is achieved for the pair (η, α ) = (2e-4, corresponding rmse is ) and the value of the Graphical evolution of the weights Weight Step Figure 2.12: Evolution of the weights for (η, α ) = (2e-4, 0.07). Marie Devaine, Yannig Goude, and Gilles Stoltz 46

48 2.12 Another sleeping adaptation of the fixed-share forecaster References This is another adaptation of the original fixed-share forecaster described, e.g., in [CBL06, Section 5.2]. As the forecaster of Section 2.10, this forecaster has a counterpart in terms of exponentiated gradient Interpretation and/or comments We do not provide any theoretical result for these forecasters yet. They achieve a good performance especially in the exponentiated gradient case. Moreover, if we take α = 0 we recover exactly the forecaster of Section 2.2 (respectively, Section 2.3 in the exponentiated gradient case), which is not the case for the forecasters of Sections 2.10 and However, the weights obtained in this case are not exactly the same because of computational issues: the loss update for the fixed-share forecasters is made multiplicatively whereas it is made additively in the case of the forecasters of Sections 2.10 and Statement and implementation We present here only the implementation of the variant of the exponentially weighted average fixed-share forecaster. The exponentiated gradient version can be obtained by replacing the losses by the gradients of the losses. The parameter η belongs to ]0, + [, while α belongs to [0, 1] (when α = 0, we recover exactly the forecaster of Section 2.2). The forecaster is implemented as follows. Marie Devaine, Yannig Goude, and Gilles Stoltz 47

49 Parameters: η > 0 and 0 6 α 6 1 Initialization: w 0 = (1/N,..., 1/N) For each round t = 1, 2,..., n, (1) predict with the weights p i,t = I {i2et} w i,t 1 N j=1 I {i2et} w j,t 1 for each i = 1,..., N; (2) Loss update: observe y t and update for each i = 1,..., N, { w v i,t = i,t 1 e η blt l i,t if i 2 E t, v i,t 1 if i /2 E t ; (3) Share update: let w i,t = α W t E t+1 + (1 α)v i,t if i 2 E t+1, 0 if i /2 E t+1, where W t = N i=1 I {i2et+1 }v i,t. Marie Devaine, Yannig Goude, and Gilles Stoltz 48

50 Performance The two forecasters have a good performance. The one of the exponentially weighted average version is similar to the one of the forecaster of Section The best performance is achieved for (η, α ) = (1e-3, 0.2). α = 0.01 α = 0.05 α = 0.1 α = 0.2 η = 1e η = 1e η = 1e η = 1e The exponentiated gradient version has a performance better than the one of the forecaster of Section The best performance (rmse of 26.5) is achieved for (η, α ) = (2e-4, 0.05). α = 0.01 α = 0.05 α = 0.1 α = 0.2 η = 1e η = 1e η = 1e η = 1e η = 1e Graphical evolution of the weights Weight Weight Step Step Figure 2.13: Evolution of the weights of the two variants of the fixed-share forecasters considered here: the exponentially weighted average version (left) and the exponentiated gradient version (right). Marie Devaine, Yannig Goude, and Gilles Stoltz 49

51 Chapter 3 Tricks 50

52 3.1 On-line calibration of one parameter References This adaptive calibration method has been introduced in [GMS08] Theoretical Bound No theoretical bound yet. The aim is to achieve a performance nearly as good as the one of the considered prediction method tuned with the best parameter in hindsight Interpretation and/or comments This method provides a generic adaptive implementation of all forecasters that depend only on a (possibly vector-valued) tuning parameter. It automatically calibrates this parameter by choosing the parameter value that minimizes the past cumulative loss of the considered forecaster. Several optimization methods can be used. We choose a grid-based optimization procedure, but continuous methods, though necessarily inaccurate and more difficult to control, could be more effective Statement and implementation Consider a forecaster depending on a parameter λ and whose weight vector at step t is denoted by v t = v (λ) t. The parameter λ 2 Λ is called a tuning parameter of the forecaster and our aim is to choose it in an automatic way. The parameter space is, for instance, Λ = ]0, + [ for ridge regression, exponential weighted average, and exponential gradient forecasters. In addition we assume that v (λ) 1 = v1 does not depend on λ, which is the case for all methods studied so far. The proposed calibration method tunes the forecaster automatically by minimizing of the empirical loss. The latter is thus called ELM calibrated forecaster thereafter. It chooses v 1 = v1 and, for t > 2, v t = v bλt t where b t 1 λ t 2 argmin v (λ) 2 s f s y s. λ2eλ We chose a finite logarithmically-scaled parameter grid e Λ Λ to perform an approximation of the minimization above. s=1 Marie Devaine, Yannig Goude, and Gilles Stoltz 51

53 This is implemented as follows. Parameters: a grid e Λ Initialization: v 1 = v 1 For each round t = 1, 2,..., n, (1) predict with by t = v t f t ; (2) observe y t and compute b λ t+1 as b λ t+1 2 argmin λ2eλ t s=1 v (λ) s f s y s 2 ; (3) compute v t+1 as v t+1 = v bλ t+1 t Performance We have tested the calibration via empirical loss minimization (ELM) on three simple prediction methods: the exponential gradient (EG, see Section 2.3) and exponentially weighted average forecasters (EWA, see Section 2.2), as well as the plug-in version of Zinkevich s forecaster (see Section 2.6). We chose in all three cases a uniform logarithmic grid over [1e-6, 1e-2] for the exponentiated gradient; [1e-8, 1e-4] for the exponentially weighted average forecaster; [1e-7, 1e-3] for the plug-in version of Zinkevich s forecaster. Below are summarized the rmse of the resulting ELM calibrated forecasters, for different numbers of grid points Λ e = 5, 9, 21, and 41 (for the plug-in version of Zinkevich s forecaster, only the case = 5 was computed). Columns best and worst refer to the best (respectively, worst) performance obtained with a constant value η taken on the grid. Λ e best worst EWA EG Zink Marie Devaine, Yannig Goude, and Gilles Stoltz 52

A Further Look at Sequential Aggregation Rules for Ozone Ensemble Forecasting

A Further Look at Sequential Aggregation Rules for Ozone Ensemble Forecasting Vivien Mallet y, Sébastien Gerchinovitz z, and Gilles Stoltz z History: This version (version 1) September 28 INRIA, Paris-Rocquencourt