Construction of an Informative Hierarchical Prior Distribution: Application to Electricity Load Forecasting

Similar documents
Bayesian Regression Linear and Logistic Regression

Linear Models A linear model is defined by the expression

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

PMR Learning as Inference

Introduction to Probabilistic Machine Learning

Probability and Estimation. Alan Moses

Conditional Autoregressif Hilbertian process

Bayesian inference: what it means and why we care

Invariant HPD credible sets and MAP estimators

Overall Objective Priors

Asymptotics for posterior hazards

Bayesian modelling applied to dating methods

Introduction to Bayesian learning Lecture 2: Bayesian methods for (un)supervised problems

1 Bayesian Linear Regression (BLR)

General Bayesian Inference I

Lecture : Probabilistic Machine Learning

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

ECE531 Lecture 10b: Maximum Likelihood Estimation

Brief Review on Estimation Theory

Stable Limit Laws for Marginal Probabilities from MCMC Streams: Acceleration of Convergence

Estimation of Dynamic Regression Models

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

Density Estimation. Seungjin Choi

A Very Brief Summary of Statistical Inference, and Examples

Introduction to Bayesian Methods

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Frequentist-Bayesian Model Comparisons: A Simple Example

Markov Chain Monte Carlo methods

Linear Models for Classification

The Minimum Message Length Principle for Inductive Inference

CMU-Q Lecture 24:

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

1. Fisher Information

ABC random forest for parameter estimation. Jean-Michel Marin

Bayesian Analysis for Natural Language Processing Lecture 2

Algebraic Geometry and Model Selection

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

Learning Bayesian network : Given structure and completely observed data

Module 22: Bayesian Methods Lecture 9 A: Default prior selection

Gaussian Processes in Machine Learning

Statistics Ph.D. Qualifying Exam

ECE521 week 3: 23/26 January 2017

A short introduction to INLA and R-INLA

A Very Brief Summary of Bayesian Inference, and Examples

Step-Stress Models and Associated Inference

Parametric Techniques Lecture 3

Bayesian spatial hierarchical modeling for temperature extremes

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

GWAS IV: Bayesian linear (variance component) models

STAT Advanced Bayesian Inference

Econometrics I, Estimation

Graphical Models for Collaborative Filtering

CSC321 Lecture 18: Learning Probabilistic Models

CPSC 540: Machine Learning

Semiparametric posterior limits

Minimum Message Length Analysis of the Behrens Fisher Problem

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Creating Non-Gaussian Processes from Gaussian Processes by the Log-Sum-Exp Approach. Radford M. Neal, 28 February 2005

Bayesian Machine Learning

MODEL COMPARISON CHRISTOPHER A. SIMS PRINCETON UNIVERSITY

Clustering using Mixture Models

1 Hypothesis Testing and Model Selection

Introduction: exponential family, conjugacy, and sufficiency (9/2/13)

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Model comparison. Christopher A. Sims Princeton University October 18, 2016

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

A BLEND OF INFORMATION THEORY AND STATISTICS. Andrew R. Barron. Collaborators: Cong Huang, Jonathan Li, Gerald Cheang, Xi Luo

Model Selection and Geometry

7. Estimation and hypothesis testing. Objective. Recommended reading

Choosing among models

Bayesian methods in economics and finance

Time Series and Dynamic Models

Foundations of Statistical Inference

Nonstationary time series forecasting and functional clustering using wavelets

9 Bayesian inference. 9.1 Subjective probability

Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting)

A Bayesian Perspective on Residential Demand Response Using Smart Meter Data

Bayesian Inference: Posterior Intervals

Part 6: Multivariate Normal and Linear Models

Non-Parametric Bayes

GAUSSIAN PROCESS REGRESSION

Statistique Bayesienne appliquée à la datation en Archéologie

Deblurring Jupiter (sampling in GLIP faster than regularized inversion) Colin Fox Richard A. Norton, J.

Linear Models for Regression CS534

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Part III. A Decision-Theoretic Approach and Bayesian testing

Clustering bi-partite networks using collapsed latent block models

6.867 Machine Learning

The Calibrated Bayes Factor for Model Comparison

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/??

Spatial extreme value theory and properties of max-stable processes Poitiers, November 8-10, 2012

... Econometric Methods for the Analysis of Dynamic General Equilibrium Models

g-priors for Linear Regression

Transcription:

Construction of an Informative Hierarchical Prior Distribution: Application to Electricity Load Forecasting Anne Philippe Laboratoire de Mathématiques Jean Leray Université de Nantes Workshop EDF-INRIA, 5 avril 2012, IHP - Paris Work in collaboration with Tristan Launay LMJL / EDF OSIRIS, R39 and Sophie Lamarche EDF OSIRIS, R39 anne.philippe@univ-nantes.fr A. Philippe (LMJL) Informative Hierarchical Prior Distribution 1 / 31

Plan 1 Introduction 2 Bayesian approach and asymptotic results 3 Construction of the prior 4 Numerical results 5 Bibliography A. Philippe (LMJL) Informative Hierarchical Prior Distribution 2 / 31

Introduction Plan 1 Introduction 2 Bayesian approach and asymptotic results 3 Construction of the prior 4 Numerical results 5 Bibliography A. Philippe (LMJL) Informative Hierarchical Prior Distribution 3 / 31

Introduction The electricity load signal 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Time Relative Load 2002 2003 2004 2005 2006 2007 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Time Relative Load 2005 06 13 2005 06 16 2005 06 19 2005 06 22 2005 06 25 2005 12 05 2005 12 08 2005 12 11 2005 12 14 2005 12 17 Mo. Tu. We. Th. Fr. Sa. Su. Mo. Tu. We. Th. Fr. Sa. Su. 5 0 5 10 15 20 25 30 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Temperature Relative Load Comments the electricity demand exhibits multiple strong seasonalities : yearly, weekly and daily. daylight saving times, holidays and bank holidays induce several breaks non-linear link with the temperature. A. Philippe (LMJL) Informative Hierarchical Prior Distribution 4 / 31

Introduction The model The electricity load X t is written as X t = s t + w t + ɛ t s t : seasonal part w t : non-linear weather-depending part (heating) (ɛ t) t N i.i.d. from N (0, σ 2 ) Heating part w t = { g(tt u) if T t < u 0 if T t u g is the heating gradient, u is the heating threshold A. Philippe (LMJL) Informative Hierarchical Prior Distribution 5 / 31

Introduction Model (cont.) Seasonal part p { ( ) ( )} s t = 2πj 2πj a j cos 365.25 t + b j sin 365.25 t + j=1 q ω j1 Ωj (t) j=1 r κ j1 Kj (t) j=1 the average seasonal behaviour with a truncated Fourier series gaps (parameters ω j R) represent the average levels over predetermined periods given by a partition (Ω j) j {1,...,d12 } of the calendar. day-to-day adjustments through shapes (parameters κ j) that depends on the so-called days types which are given by a second partition (K j) j {1,...,d2 } of the calendar. A. Philippe (LMJL) Informative Hierarchical Prior Distribution 6 / 31

Bayesian approach and asymptotic results Plan 1 Introduction 2 Bayesian approach and asymptotic results 3 Construction of the prior 4 Numerical results 5 Bibliography A. Philippe (LMJL) Informative Hierarchical Prior Distribution 7 / 31

Bayesian approach and asymptotic results Bayesian principle Let X 1,..., X N be the observations. The likelihood of the model, conditionnally to the temperatures, is : N { 1 l(x 1,..., X N θ) = exp 1 } [Xt (st + wt)]2 2πσ 2 2σ2 t=1 Let π(θ) be the prior distribution on θ The bayesian inference is based on the posterior distribution of θ π(θ X 1,..., X N) l(x 1,..., X N θ)π(θ) The Bayes estimate (under quadratic loss) is θ = E[θ X 1,..., X N] Predictive distribution p(x N+k X 1,..., X N) = l(x N+k θ)π(θ X 1,..., X N) dθ The optimal prediction (under quadratic loss) is X N+k = E[X N+k X 1,..., X N] Θ A. Philippe (LMJL) Informative Hierarchical Prior Distribution 8 / 31

Bayesian approach and asymptotic results Asymptotic results for the heating part The observations X 1:n = (X 1,..., X n) depend on an exogenous variable T 1:n = (T 1,..., T n) via the model for i = 1,..., n. X i = µ(η, T i) + ξ i := g(t i u)1 [Ti, + [(u) + ξ i, (ξ i) i N is a sequence of i.i.d. from N (0, σ 2 ). θ 0 will denote the true value of θ. θ = (g, u, σ 2 ) Θ = R ]u, u[ R +. A. Philippe (LMJL) Informative Hierarchical Prior Distribution 9 / 31

Bayesian approach and asymptotic results Consistency Assumptions (A). 1 There exists K Θ a compact subset of the parameter space Θ such that the MLE θ n K for any n large enough. 2 Let F be a cumulative distribution function which is continuously differentiable and F does not vanish over [u, u] F n(u) = 1 n n i=1 1 [Ti, + [(u), n + F(u) Theorem ( Launay et al [2]) Let π( ) be a prior distribution on θ, continuous and positive on a neighbourhood of θ 0 and let U be a neighbourhood of θ 0, then under Assumptions (A), as n +, π(θ X 1:n) dθ a.s. 1. U A. Philippe (LMJL) Informative Hierarchical Prior Distribution 10 / 31

Bayesian approach and asymptotic results Asymptotic normality Theorem (Launay et al [2]) Let π( ) be a prior distribution on θ, continuous and positive at θ 0, and let k 0 N such that θ k 0 π(θ) dθ < +, and denote Θ t = n 1 2 (θ θn), and π n( X 1:n) the posterior density of t given X 1:n, then under Assumptions (A), for any 0 k k 0, as n +, t k π n(t X 1:n) (2π) 2 3 I(θ0) 1 2 e 1 2 t I(θ 0 )t P dt 0, R 3 where I(θ) is the asymptotic Fisher Information matrix. A. Philippe (LMJL) Informative Hierarchical Prior Distribution 11 / 31

Construction of the prior Plan 1 Introduction 2 Bayesian approach and asymptotic results 3 Construction of the prior 4 Numerical results 5 Bibliography A. Philippe (LMJL) Informative Hierarchical Prior Distribution 12 / 31

Construction of the prior Objective Consider the general model X t = s t + w t + ɛ t Aim the estimation and the predictions of the parametric model over a short dataset denoted B Problem The high dimensionality of a model leads to an overfitting situation, and so the errors in prediction are larger the parameter of the model Prior information θ = (η, σ 2 ), η = (a 1,..., a p, b 1,..., b p, ω 1,..., ω k, κ 1,..., κ r, g, u }{{}}{{}}{{} Fourier offsets shapes A a long dataset known to share some common features with B We wish to improve parameter estimations and model predictions over B with the help of A. ) }{{} heating part A. Philippe (LMJL) Informative Hierarchical Prior Distribution 13 / 31

Construction of the prior Choice of the prior distribution Construction of the prior distribution for B the short dataset using the information coming from the long dataset A + A prior distribution π A posterior distribution π A ( X A ) + B prior distribution π B posterior distribution π B ( X B )? A. Philippe (LMJL) Informative Hierarchical Prior Distribution 14 / 31

Construction of the prior Methodology we assume π B belongs to a parametric family F = {π λ, λ Λ} we want to pick λ B such that π B is «close» to π A ( X A ) Let T : F Λ be an operator such that for every λ Λ, T[π λ ] = λ Choice of the prior distribution on B we select λ B «proportional to» T[π A ( y A )] in the sense that λ B = KT[π A ( y A )] where K : Λ Λ is a diagonal operator K = Diag(k 1,..., k j,...) the operator K can be interpreted as a similarity operator between A and B (the diagonal components of K are hyperparameters of the prior) we give k j a vague hierarchical prior distribution centred around q, the prior on q is vague and centred around 1. A. Philippe (LMJL) Informative Hierarchical Prior Distribution 15 / 31

Construction of the prior Methodology (cont.) Example 1 : Method of Moments We assume that the elements of F can be identified via their m first moments. T[π λ ] = F(E(θ),..., E(θ m )) = λ The expression of λ B then becomes λ B = KF(E(θ X A ),..., E(θ m X A )) Example 2 : Conjugacy We assume that F is the family of priors conjugated for the model. π A F π A ( X A ) F let λ A (X A ) Λ be the parameter of posterior distribution π A ( X A ) The expression of λ B thus reduces to λ B = Kλ A (X A ) A. Philippe (LMJL) Informative Hierarchical Prior Distribution 16 / 31

Construction of the prior Application to the model of electricity load Non informative prior for the the long dataset A π B F = {gaussian distributions} + A non-info. prior π A posterior distribution π A ( X A ) µ A, Σ A + B π B gaussian dist. posterior distribution π B ( X B ) A. Philippe (LMJL) Informative Hierarchical Prior Distribution 17 / 31

Construction of the prior Non-informative prior [long dataset A] non-informative prior for A we consider prior distribution of the form : π A (θ) = π(η)π(σ 2 ) π(η) 1 π(σ 2 ) σ 2 Jeffreys prior for a standard regression model posterior distribution the posterior distribution associated is well defined l(x 1,..., X N θ)π A (θ) dθ < + Θ Explicit forms of µ A, Σ A are not available MCMC algorithm A. Philippe (LMJL) Informative Hierarchical Prior Distribution 18 / 31

Construction of the prior Informative prior distribution [short dataset B] The hierarchical prior that we use is built as follows : the prior distribution is of the form : π(θ, k, l, q, r) = π(η k, l)π(k q, r)π(l)π(q)π(r)π(σ 2 ) prior : the parameter of B are close to µ A and Σ A π(σ 2 ) σ 2 η k, l N (M A k, l 1 Σ A ) M A = Diag µ A hyperparameters k and l correspond to the similarity between A and B k j q, r N (q, r 1 ), l G(a l, b l) a l = b l = 10 3 hyperparameters q and r are more general indicators of how close A and B are, q corresponding to the mean of the coordinates of k r is the inverse-variance of the coordinates of k q N (1, σq), 2 r G(a r, b r) σq 2 = 10 4, a r = b r = 10 6 A. Philippe (LMJL) Informative Hierarchical Prior Distribution 19 / 31

Numerical results Plan 1 Introduction 2 Bayesian approach and asymptotic results 3 Construction of the prior 4 Numerical results 5 Bibliography A. Philippe (LMJL) Informative Hierarchical Prior Distribution 20 / 31

Numerical results Simulation Dataset A. We simulated 4 years of daily data for A with parameters chosen to mimic the typical electricity load of France. 4 frequencies used for the truncated Fourier series 7 daytypes : one daytype for each day of the week 2 offsets to simulate the daylight saving time effect The temperatures (T t) are those measured from September 1996 to August 2000 at 10 :00AM. Dataset B. We simulated 1 year of daily data for B with parameters : seasonal : α B i = kα A i, α = (a, b, ω) i = 1,..., d α shape : κ B j = κ A j, j = 1,..., d β heating : g B = kg A, u B = u A. We compare the quality of the bayesian predictions associated to the hierarchical prior the non-informative prior. We simulate an extra year of daily data B to evaluate the prediction A. Philippe (LMJL) Informative Hierarchical Prior Distribution 21 / 31

Numerical results The comparison criterion let X t = f t(θ) + ɛ t, the prediction error can be written as X t X t = [X t f t(θ)] + [f t(θ) X t] }{{}}{{} noise ɛ t difference w.r.t. the true model given that we want to validate our model on simulated data, we choose to consider the quadratic distance between the real and the predicted model over a year as our quality criterion for a model, i.e. : 1 365 [f t(θ) X t] 365 2. t=1 A. Philippe (LMJL) Informative Hierarchical Prior Distribution 22 / 31

Numerical results Comparison : RMSE info. / RMSE non-info. RMSE (predictions) ratio between non informative and informative prior 0.2 0.4 0.6 0.8 1.0 1.2 0.5 0.6 0.7 0.8 0.9 1.0 k abscissa : k, similarity coefficient between seasonality and heating gradient of populations A and B (other parameters remain untouched) ordinate : RMSE info. / RMSE non-info. whether similarity between A and B is ideal (k = 1) or not (k 1), prior information improves predictions quality A. Philippe (LMJL) Informative Hierarchical Prior Distribution 23 / 31

Numerical results Parameters estimation : info. vs non-info. 15 10 5 0 5 10 15 Seasonal posterior mean true α 1 α 2 α 3 α 4 α 5 α 6 α 7 α 8 α 9 α 10 0.0015 0.0005 0.0005 0.0015 Shape posterior mean true β 1 β 2 β 3 β 4 β 5 β 6 0.6 0.4 0.2 0.0 0.2 0.4 Heating posterior mean true γ u 300 replications for the ideal situation (k = 1) ordinate : θ θ for the informative prior (left) and the non-informative prior (right) prior information improves the estimations a lot A. Philippe (LMJL) Informative Hierarchical Prior Distribution 24 / 31

Numerical results Parameters estimation : info. vs non-info. (cont.) 15 10 5 0 5 10 15 Seasonal posterior mean true α 1 α 2 α 3 α 4 α 5 α 6 α 7 α 8 α 9 α 10 0.003 0.001 0.001 0.003 Shape posterior mean true β 1 β 2 β 3 β 4 β 5 β 6 1.0 0.5 0.0 0.5 1.0 Heating posterior mean true γ u 300 replications for a difficult situation (k = 0.5) ordinate : θ θ for the informative prior (left) and the non-informative prior (right) prior information improves the estimations only slightly A. Philippe (LMJL) Informative Hierarchical Prior Distribution 25 / 31

Numerical results Hyperparameters estimation (q and r) : k j N (q, r 1 ) posterior mean of q 0.75 0.85 0.95 0.5 0.6 0.7 0.8 0.9 1.0 k posterior mean of r 1e+01 1e+03 1e+05 0.5 0.6 0.7 0.8 0.9 1.0 k abscissa : k, similarity coefficient between seasonality and heating gradient of populations A and B (other parameters remain untouched) ordinate : Bayes estimator for q and r for the 5 300 replications considered q et r correspond to the mean and precision (inverse of variance) of the similarity coefficients k j A. Philippe (LMJL) Informative Hierarchical Prior Distribution 26 / 31

Numerical results Application Dataset For A the dataset corresponds to a specific population in France frequently referred to as non-metered B 1 is a subpopulation of A B 2 covers the same people that A (ERDF) The sizes (in days) of the datasets are population A : non metered (833 days) population B 1 : res1+2 (177 +30 days) population B 2 : non metered ERDF (121 +30 days) Dataset 4 frequencies used for the truncated Fourier series 5 daytypes 2 offsets We kept the last 30 days of each B out of the estimation datasets and assessed the model quality over the predictions for those 30 days. A. Philippe (LMJL) Informative Hierarchical Prior Distribution 27 / 31

Numerical results Results B 1 non-informative hierarchical comparison RMSE est. 775.93 786.97 +1.42% RMSE pred. 1863.25 894.00 52.01% MAPE est. 4.00 3.93 0.07 MAPE pred. 19.37 9.30 10.07 B 2 non-informative hierarchical comparison RMSE est. 1127.60 1202.32 +6.62% RMSE pred. 2286.42 1339.14 41.83% MAPE est. 2.82 2.98 +0.15 MAPE pred. 8.65 3.48 5.17 TABLE: Results for the dataset B 1 (top) and B 2 (bottom). RMSE is the root mean square error and MAPE is the mean absolute percentage error. Both of these common measures of accuracy were computed on the estimation (est.) and prediction (pred.) parts of the two datasets. A. Philippe (LMJL) Informative Hierarchical Prior Distribution 28 / 31

Numerical results Estimated posterior marginal distributions of k i for the hierarchical method and for both datasets B 1 (upper row) and B 2 (lower row). Seasonal Shape Heating Posterior density for B 1 0 5 10 15 20 25 30 Posterior density for B 1 0 10 20 30 40 50 60 Posterior density for B 1 0 5 10 15 20 0.4 0.6 0.8 1.0 1.2 k α1,, k α10 0.8 0.9 1.0 1.1 1.2 k β1,, k β4 0.6 0.8 1.0 1.2 k γ, k u Seasonal Shape Heating Posterior density for B 2 0 5 10 15 20 25 30 Posterior density for B 2 0 10 20 30 40 50 60 Posterior density for B 2 0 5 10 15 20 0.4 0.6 0.8 1.0 1.2 k α1,, k α10 0.8 0.9 1.0 1.1 1.2 k β1,, k β4 0.6 0.8 1.0 1.2 k γ, k u { (0.73, 24.48) on B1 Estimation of hyperparameters ( q, r) = (1.02, 795.16) on B 2 A. Philippe (LMJL) Informative Hierarchical Prior Distribution 29 / 31

Bibliography Plan 1 Introduction 2 Bayesian approach and asymptotic results 3 Construction of the prior 4 Numerical results 5 Bibliography A. Philippe (LMJL) Informative Hierarchical Prior Distribution 30 / 31

Bibliography References I T. Launay, A. Philippe and S. Lamarche (2011) Construction of an informative hierarchical prior distribution. Application to electricity load forecasting. http://arxiv.org/abs/1109.4533v2 T. Launay, A. Philippe and S. Lamarche (2012) Consistency of the posterior distribution and MLE for piecewise linear regression http://arxiv.org/abs/1203.4753 A. Philippe (LMJL) Informative Hierarchical Prior Distribution 31 / 31