Construction of an Informative Hierarchical Prior Distribution: Application to Electricity Load Forecasting

Construction of an Informative Hierarchical Prior Distribution: Application to Electricity Load Forecasting Anne Philippe Laboratoire de Mathématiques Jean Leray Université de Nantes Workshop EDF-INRIA, 5 avril 2012, IHP - Paris Work in collaboration with Tristan Launay LMJL / EDF OSIRIS, R39 and Sophie Lamarche EDF OSIRIS, R39 anne.philippe@univ-nantes.fr A. Philippe (LMJL) Informative Hierarchical Prior Distribution 1 / 31

Plan 1 Introduction 2 Bayesian approach and asymptotic results 3 Construction of the prior 4 Numerical results 5 Bibliography A. Philippe (LMJL) Informative Hierarchical Prior Distribution 2 / 31

Introduction Plan 1 Introduction 2 Bayesian approach and asymptotic results 3 Construction of the prior 4 Numerical results 5 Bibliography A. Philippe (LMJL) Informative Hierarchical Prior Distribution 3 / 31

Introduction The electricity load signal 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Time Relative Load 2002 2003 2004 2005 2006 2007 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Time Relative Load 2005 06 13 2005 06 16 2005 06 19 2005 06 22 2005 06 25 2005 12 05 2005 12 08 2005 12 11 2005 12 14 2005 12 17 Mo. Tu. We. Th. Fr. Sa. Su. Mo. Tu. We. Th. Fr. Sa. Su. 5 0 5 10 15 20 25 30 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Temperature Relative Load Comments the electricity demand exhibits multiple strong seasonalities : yearly, weekly and daily. daylight saving times, holidays and bank holidays induce several breaks non-linear link with the temperature. A. Philippe (LMJL) Informative Hierarchical Prior Distribution 4 / 31

Introduction The model The electricity load X t is written as X t = s t + w t + ɛ t s t : seasonal part w t : non-linear weather-depending part (heating) (ɛ t) t N i.i.d. from N (0, σ 2 ) Heating part w t = { g(tt u) if T t < u 0 if T t u g is the heating gradient, u is the heating threshold A. Philippe (LMJL) Informative Hierarchical Prior Distribution 5 / 31

Introduction Model (cont.) Seasonal part p { ( ) ( )} s t = 2πj 2πj a j cos 365.25 t + b j sin 365.25 t + j=1 q ω j1 Ωj (t) j=1 r κ j1 Kj (t) j=1 the average seasonal behaviour with a truncated Fourier series gaps (parameters ω j R) represent the average levels over predetermined periods given by a partition (Ω j) j {1,...,d12 } of the calendar. day-to-day adjustments through shapes (parameters κ j) that depends on the so-called days types which are given by a second partition (K j) j {1,...,d2 } of the calendar. A. Philippe (LMJL) Informative Hierarchical Prior Distribution 6 / 31

Bayesian approach and asymptotic results Plan 1 Introduction 2 Bayesian approach and asymptotic results 3 Construction of the prior 4 Numerical results 5 Bibliography A. Philippe (LMJL) Informative Hierarchical Prior Distribution 7 / 31

Bayesian approach and asymptotic results Bayesian principle Let X 1,..., X N be the observations. The likelihood of the model, conditionnally to the temperatures, is : N { 1 l(x 1,..., X N θ) = exp 1 } [Xt (st + wt)]2 2πσ 2 2σ2 t=1 Let π(θ) be the prior distribution on θ The bayesian inference is based on the posterior distribution of θ π(θ X 1,..., X N) l(x 1,..., X N θ)π(θ) The Bayes estimate (under quadratic loss) is θ = E[θ X 1,..., X N] Predictive distribution p(x N+k X 1,..., X N) = l(x N+k θ)π(θ X 1,..., X N) dθ The optimal prediction (under quadratic loss) is X N+k = E[X N+k X 1,..., X N] Θ A. Philippe (LMJL) Informative Hierarchical Prior Distribution 8 / 31

Bayesian approach and asymptotic results Asymptotic results for the heating part The observations X 1:n = (X 1,..., X n) depend on an exogenous variable T 1:n = (T 1,..., T n) via the model for i = 1,..., n. X i = µ(η, T i) + ξ i := g(t i u)1 [Ti, + [(u) + ξ i, (ξ i) i N is a sequence of i.i.d. from N (0, σ 2 ). θ 0 will denote the true value of θ. θ = (g, u, σ 2 ) Θ = R ]u, u[ R +. A. Philippe (LMJL) Informative Hierarchical Prior Distribution 9 / 31

Bayesian approach and asymptotic results Consistency Assumptions (A). 1 There exists K Θ a compact subset of the parameter space Θ such that the MLE θ n K for any n large enough. 2 Let F be a cumulative distribution function which is continuously differentiable and F does not vanish over [u, u] F n(u) = 1 n n i=1 1 [Ti, + [(u), n + F(u) Theorem ( Launay et al [2]) Let π( ) be a prior distribution on θ, continuous and positive on a neighbourhood of θ 0 and let U be a neighbourhood of θ 0, then under Assumptions (A), as n +, π(θ X 1:n) dθ a.s. 1. U A. Philippe (LMJL) Informative Hierarchical Prior Distribution 10 / 31

Bayesian approach and asymptotic results Asymptotic normality Theorem (Launay et al [2]) Let π( ) be a prior distribution on θ, continuous and positive at θ 0, and let k 0 N such that θ k 0 π(θ) dθ < +, and denote Θ t = n 1 2 (θ θn), and π n( X 1:n) the posterior density of t given X 1:n, then under Assumptions (A), for any 0 k k 0, as n +, t k π n(t X 1:n) (2π) 2 3 I(θ0) 1 2 e 1 2 t I(θ 0 )t P dt 0, R 3 where I(θ) is the asymptotic Fisher Information matrix. A. Philippe (LMJL) Informative Hierarchical Prior Distribution 11 / 31

Construction of the prior Plan 1 Introduction 2 Bayesian approach and asymptotic results 3 Construction of the prior 4 Numerical results 5 Bibliography A. Philippe (LMJL) Informative Hierarchical Prior Distribution 12 / 31

Construction of the prior Objective Consider the general model X t = s t + w t + ɛ t Aim the estimation and the predictions of the parametric model over a short dataset denoted B Problem The high dimensionality of a model leads to an overfitting situation, and so the errors in prediction are larger the parameter of the model Prior information θ = (η, σ 2 ), η = (a 1,..., a p, b 1,..., b p, ω 1,..., ω k, κ 1,..., κ r, g, u }{{}}{{}}{{} Fourier offsets shapes A a long dataset known to share some common features with B We wish to improve parameter estimations and model predictions over B with the help of A. ) }{{} heating part A. Philippe (LMJL) Informative Hierarchical Prior Distribution 13 / 31

Construction of the prior Choice of the prior distribution Construction of the prior distribution for B the short dataset using the information coming from the long dataset A + A prior distribution π A posterior distribution π A ( X A ) + B prior distribution π B posterior distribution π B ( X B )? A. Philippe (LMJL) Informative Hierarchical Prior Distribution 14 / 31

Construction of the prior Methodology we assume π B belongs to a parametric family F = {π λ, λ Λ} we want to pick λ B such that π B is «close» to π A ( X A ) Let T : F Λ be an operator such that for every λ Λ, T[π λ ] = λ Choice of the prior distribution on B we select λ B «proportional to» T[π A ( y A )] in the sense that λ B = KT[π A ( y A )] where K : Λ Λ is a diagonal operator K = Diag(k 1,..., k j,...) the operator K can be interpreted as a similarity operator between A and B (the diagonal components of K are hyperparameters of the prior) we give k j a vague hierarchical prior distribution centred around q, the prior on q is vague and centred around 1. A. Philippe (LMJL) Informative Hierarchical Prior Distribution 15 / 31

Construction of the prior Methodology (cont.) Example 1 : Method of Moments We assume that the elements of F can be identified via their m first moments. T[π λ ] = F(E(θ),..., E(θ m )) = λ The expression of λ B then becomes λ B = KF(E(θ X A ),..., E(θ m X A )) Example 2 : Conjugacy We assume that F is the family of priors conjugated for the model. π A F π A ( X A ) F let λ A (X A ) Λ be the parameter of posterior distribution π A ( X A ) The expression of λ B thus reduces to λ B = Kλ A (X A ) A. Philippe (LMJL) Informative Hierarchical Prior Distribution 16 / 31

Construction of the prior Application to the model of electricity load Non informative prior for the the long dataset A π B F = {gaussian distributions} + A non-info. prior π A posterior distribution π A ( X A ) µ A, Σ A + B π B gaussian dist. posterior distribution π B ( X B ) A. Philippe (LMJL) Informative Hierarchical Prior Distribution 17 / 31

Construction of the prior Non-informative prior [long dataset A] non-informative prior for A we consider prior distribution of the form : π A (θ) = π(η)π(σ 2 ) π(η) 1 π(σ 2 ) σ 2 Jeffreys prior for a standard regression model posterior distribution the posterior distribution associated is well defined l(x 1,..., X N θ)π A (θ) dθ < + Θ Explicit forms of µ A, Σ A are not available MCMC algorithm A. Philippe (LMJL) Informative Hierarchical Prior Distribution 18 / 31

Construction of the prior Informative prior distribution [short dataset B] The hierarchical prior that we use is built as follows : the prior distribution is of the form : π(θ, k, l, q, r) = π(η k, l)π(k q, r)π(l)π(q)π(r)π(σ 2 ) prior : the parameter of B are close to µ A and Σ A π(σ 2 ) σ 2 η k, l N (M A k, l 1 Σ A ) M A = Diag µ A hyperparameters k and l correspond to the similarity between A and B k j q, r N (q, r 1 ), l G(a l, b l) a l = b l = 10 3 hyperparameters q and r are more general indicators of how close A and B are, q corresponding to the mean of the coordinates of k r is the inverse-variance of the coordinates of k q N (1, σq), 2 r G(a r, b r) σq 2 = 10 4, a r = b r = 10 6 A. Philippe (LMJL) Informative Hierarchical Prior Distribution 19 / 31

Numerical results Plan 1 Introduction 2 Bayesian approach and asymptotic results 3 Construction of the prior 4 Numerical results 5 Bibliography A. Philippe (LMJL) Informative Hierarchical Prior Distribution 20 / 31

Numerical results Simulation Dataset A. We simulated 4 years of daily data for A with parameters chosen to mimic the typical electricity load of France. 4 frequencies used for the truncated Fourier series 7 daytypes : one daytype for each day of the week 2 offsets to simulate the daylight saving time effect The temperatures (T t) are those measured from September 1996 to August 2000 at 10 :00AM. Dataset B. We simulated 1 year of daily data for B with parameters : seasonal : α B i = kα A i, α = (a, b, ω) i = 1,..., d α shape : κ B j = κ A j, j = 1,..., d β heating : g B = kg A, u B = u A. We compare the quality of the bayesian predictions associated to the hierarchical prior the non-informative prior. We simulate an extra year of daily data B to evaluate the prediction A. Philippe (LMJL) Informative Hierarchical Prior Distribution 21 / 31

Numerical results The comparison criterion let X t = f t(θ) + ɛ t, the prediction error can be written as X t X t = [X t f t(θ)] + [f t(θ) X t] }{{}}{{} noise ɛ t difference w.r.t. the true model given that we want to validate our model on simulated data, we choose to consider the quadratic distance between the real and the predicted model over a year as our quality criterion for a model, i.e. : 1 365 [f t(θ) X t] 365 2. t=1 A. Philippe (LMJL) Informative Hierarchical Prior Distribution 22 / 31

Numerical results Comparison : RMSE info. / RMSE non-info. RMSE (predictions) ratio between non informative and informative prior 0.2 0.4 0.6 0.8 1.0 1.2 0.5 0.6 0.7 0.8 0.9 1.0 k abscissa : k, similarity coefficient between seasonality and heating gradient of populations A and B (other parameters remain untouched) ordinate : RMSE info. / RMSE non-info. whether similarity between A and B is ideal (k = 1) or not (k 1), prior information improves predictions quality A. Philippe (LMJL) Informative Hierarchical Prior Distribution 23 / 31

Numerical results Parameters estimation : info. vs non-info. 15 10 5 0 5 10 15 Seasonal posterior mean true α 1 α 2 α 3 α 4 α 5 α 6 α 7 α 8 α 9 α 10 0.0015 0.0005 0.0005 0.0015 Shape posterior mean true β 1 β 2 β 3 β 4 β 5 β 6 0.6 0.4 0.2 0.0 0.2 0.4 Heating posterior mean true γ u 300 replications for the ideal situation (k = 1) ordinate : θ θ for the informative prior (left) and the non-informative prior (right) prior information improves the estimations a lot A. Philippe (LMJL) Informative Hierarchical Prior Distribution 24 / 31

Numerical results Parameters estimation : info. vs non-info. (cont.) 15 10 5 0 5 10 15 Seasonal posterior mean true α 1 α 2 α 3 α 4 α 5 α 6 α 7 α 8 α 9 α 10 0.003 0.001 0.001 0.003 Shape posterior mean true β 1 β 2 β 3 β 4 β 5 β 6 1.0 0.5 0.0 0.5 1.0 Heating posterior mean true γ u 300 replications for a difficult situation (k = 0.5) ordinate : θ θ for the informative prior (left) and the non-informative prior (right) prior information improves the estimations only slightly A. Philippe (LMJL) Informative Hierarchical Prior Distribution 25 / 31

Numerical results Hyperparameters estimation (q and r) : k j N (q, r 1 ) posterior mean of q 0.75 0.85 0.95 0.5 0.6 0.7 0.8 0.9 1.0 k posterior mean of r 1e+01 1e+03 1e+05 0.5 0.6 0.7 0.8 0.9 1.0 k abscissa : k, similarity coefficient between seasonality and heating gradient of populations A and B (other parameters remain untouched) ordinate : Bayes estimator for q and r for the 5 300 replications considered q et r correspond to the mean and precision (inverse of variance) of the similarity coefficients k j A. Philippe (LMJL) Informative Hierarchical Prior Distribution 26 / 31

Numerical results Application Dataset For A the dataset corresponds to a specific population in France frequently referred to as non-metered B 1 is a subpopulation of A B 2 covers the same people that A (ERDF) The sizes (in days) of the datasets are population A : non metered (833 days) population B 1 : res1+2 (177 +30 days) population B 2 : non metered ERDF (121 +30 days) Dataset 4 frequencies used for the truncated Fourier series 5 daytypes 2 offsets We kept the last 30 days of each B out of the estimation datasets and assessed the model quality over the predictions for those 30 days. A. Philippe (LMJL) Informative Hierarchical Prior Distribution 27 / 31

Numerical results Results B 1 non-informative hierarchical comparison RMSE est. 775.93 786.97 +1.42% RMSE pred. 1863.25 894.00 52.01% MAPE est. 4.00 3.93 0.07 MAPE pred. 19.37 9.30 10.07 B 2 non-informative hierarchical comparison RMSE est. 1127.60 1202.32 +6.62% RMSE pred. 2286.42 1339.14 41.83% MAPE est. 2.82 2.98 +0.15 MAPE pred. 8.65 3.48 5.17 TABLE: Results for the dataset B 1 (top) and B 2 (bottom). RMSE is the root mean square error and MAPE is the mean absolute percentage error. Both of these common measures of accuracy were computed on the estimation (est.) and prediction (pred.) parts of the two datasets. A. Philippe (LMJL) Informative Hierarchical Prior Distribution 28 / 31

Numerical results Estimated posterior marginal distributions of k i for the hierarchical method and for both datasets B 1 (upper row) and B 2 (lower row). Seasonal Shape Heating Posterior density for B 1 0 5 10 15 20 25 30 Posterior density for B 1 0 10 20 30 40 50 60 Posterior density for B 1 0 5 10 15 20 0.4 0.6 0.8 1.0 1.2 k α1,, k α10 0.8 0.9 1.0 1.1 1.2 k β1,, k β4 0.6 0.8 1.0 1.2 k γ, k u Seasonal Shape Heating Posterior density for B 2 0 5 10 15 20 25 30 Posterior density for B 2 0 10 20 30 40 50 60 Posterior density for B 2 0 5 10 15 20 0.4 0.6 0.8 1.0 1.2 k α1,, k α10 0.8 0.9 1.0 1.1 1.2 k β1,, k β4 0.6 0.8 1.0 1.2 k γ, k u { (0.73, 24.48) on B1 Estimation of hyperparameters ( q, r) = (1.02, 795.16) on B 2 A. Philippe (LMJL) Informative Hierarchical Prior Distribution 29 / 31

Bibliography Plan 1 Introduction 2 Bayesian approach and asymptotic results 3 Construction of the prior 4 Numerical results 5 Bibliography A. Philippe (LMJL) Informative Hierarchical Prior Distribution 30 / 31

Bibliography References I T. Launay, A. Philippe and S. Lamarche (2011) Construction of an informative hierarchical prior distribution. Application to electricity load forecasting. http://arxiv.org/abs/1109.4533v2 T. Launay, A. Philippe and S. Lamarche (2012) Consistency of the posterior distribution and MLE for piecewise linear regression http://arxiv.org/abs/1203.4753 A. Philippe (LMJL) Informative Hierarchical Prior Distribution 31 / 31