High dimensional statistical learning methods applied to energy data sets

Size: px

Start display at page:

Download "High dimensional statistical learning methods applied to energy data sets"

Steven Wilcox
5 years ago
Views:

1 High dimensional statistical learning methods applied to energy data sets Dominique Picard Université Paris-Diderot LPMA

2 Data science challenge : Post-modern Rationality Replace human logic by statistical logic Prediction during Greek Antiquity : oracle Physical modeling (prediction) : mathematical calculation using knowledge of the causes and mathematical equations for the consequences Post-modern prediction : predictive algorithms using rather elementary indices of statistical correlations between phenomena on large data sets.

3 Learning in high dimension Basic ideas : 1 Representation of the data 2 Sparsity- regularity- Smoothing 3 Divide and conquer 4 Agregation

4 Parametric-nonparametric 1 Representation of the data 2 Sparsity : only a few number of parameter matters 3 Parametric : you know which parameters matter 4 NonParametric : you don t know which parameters matter : Smoothing

5 Three examples 1 Forecasting the electrical consumption 2 Modeling-Forecasting wind production 3 Segmentation of the country into regions using meteorological data

6 Segmentation of the country into homegeneous climate regions ( ) Homogeneous climate regions using learning algorithms 2018, submitted, M. Mougeot, D. P., V. Lefieux, M. Marchand

7 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Arpege French Meteorological data At n = 259 locations, - Temperature and Wind - for 14 years - hourly sample rate - d = points for raw data - Y data matrix (n x d) - n << d Temperature Wind

8 Objective and Questions : Goals Segmentation of the country into regions using meteorological data Temperature and/or Wind Study the Between Year variability

9 Wind and Temperature spots for 2014 Temp day 2014 Chamonix Temp day 2014 Brest Wind Temp Wind Temp Wind day 2014 Chamonix Wind day 2014 Brest Temp day 2014 Toulouse Wind Temp Wind day 2014 Toulouse

10 Segmentation for 2001, 2007, 2014 daily data Temperature Kmeans 5 ( ) Temp day 2001 Thres 0.05 (dicota) Kmeans 5 ( ) Temp day 2007 Thres 0.05 (dicota) Kmeans 5 ( ) Temp day 2014 Thres 0.05 (dicota) Wind Kmeans 4 ( ) Wind day 2001 Thres 0.05 (dicota) Kmeans 4 ( ) Wind day 2007 Thres 0.05 (dicota) Kmeans 5 ( ) Wind day 2014 Thres 0.05 (dicota)

11 Segmentation results : stability Number of regions stability Geographical stability (no GPS coordinates in the data) Between years stability Wind-temperature stability

12 Intraday load curve forecasting -here 48h- 5.6 x Y 3.8 Yapx tpred

13 Forecasting procedure Construction of an encyclopedia of past scenarios on a learning set of days ( ). For each day : Explanatory variables. Exogenous and endogenous variables. Sparse Approximation of the consumption of the day Prediction procedure Build a set of prediction experts Aggregate the prediction experts ( ) Forecasting Intra Day Load Curves Using Sparse Functional Regression (2015) Lecture Notes in Statistics, vol 217 M. Mougeot, D. P., V. Lefieux, L. Maillard-Teyssier

14 cp = cp = 0.44 Builder curve 0 Electrical power [kw] Wind power with machine learning Wind speed [m.s 1] 15 20

15 Wind power with machine learning Models inspired by physics Persistence : Y t = Yt 1 Turbine builder s power curve Parametric models cp = cp = 0.44 Builder curve 0 Electrical power [kw] Wind speed [m.s 1] Linear regression : Y t = a0 + a1 Wt Logistic regression : Y t = 1+exp(aC0 +a1 Wt ) Polynomial logistic regression : Y t = C 1+exp(a0 +a1 Wt +a2 Wt2 +a3 Wt3 ) LASSO Statistical learning models CART Bagging+CART Random Forest Support Vector Machine k-nearest Neighbors

16 Aim Statistical learning for wind power Can a statistical model be of interest compared to physically based model? In a forecast view, what is the effect on the modeling of the lack of precision on the meteorological variables? ( ) Statistical learning for wind power : A modeling and stability study towards forecasting, Wind Energy, 2017, A. Fisher, L. Montuelle, M. Mougeot, D. P. Dominique Picard (Paris-Diderot-LPMA) High dimensional statistical learning methods applied to energy February data sets / 64

17 Forecasting procedure ( ) Forecasting Intra Day Load Curves Using Sparse Functional Regression (2015) Lecture Notes in Statistics, vol 217 M. Mougeot, D. P., V. Lefieux, L. Maillard-Teyssier

18 Forecasting pipeline 1 Construction of a smart encyclopedia of past scenarios out of a data basis using learning algorithms. 2 Build a set of prediction experts consulting the encyclopedia. 3 Aggregate the prediction experts

19 Data basis The past data basis Electrical consumption of the past Meteorological input Endogenous variables : calendar data, functional bases

20 Electrical consumption of the past Recorded every half hour from January 1 st, 2003 to August 31 th, For this period of time, the global consumption signal is split into N = 2800 sub signals (Y 1,...,Y t,...,y N ). Y t R n, defines the intra day load curve for the t th day of size n = 48.

21 Intraday load curve for seven days Monday January 25 th to Sunday January 31 th 9 x

22 Calendar and functional effects 6 x x x x Figure: autumn, winter, spring and summer

23 Functional aspect : dictionary Figure: Functions of the dictionary. Constant (black-solid line), cosine (blue-dotted line), sine (blue-dashdot line), Haar (red-dashed line) and temperature (green-solid line with points) functions.

24 Meteorological inputs A total of 371 (=2x39+293) meteorological variables recorded each day half-hourly over the 2800 days of the same period of time. Temperature : T k for k = 1,...,39 measured in 39 weather stations scattered all over the French territory. Cloud Cover : N k for k = 1,...,39 measured in the same 39 weather stations. Wind : W k for k = 1,...,293 available at 293 network points scattered all over the territory.

25 Weather stations Figure: Temperature and Cloud covering measurement stations. Wind stations

26 Explanatory variables For each t index of the day of interest, we register the daily electrical consumption signal Y t and Z t = [P t M t ] = [C t B t [T] t [N] t [W] t ] (1) is the concatenation of the calendar and functional variables and climate variables previously described.

27 Sparse approximation on the learning set Sparse Approximation of each consumption day on a learning set of days ( ), using exogenous and endogenous variables. For each day t of the learning set, we build an approximation Ŷt of the (observed) signal Y t with the help of the endogenous and exogenous variables (Z t ) : Ŷ t = G t (Z t )

28 Forecasting procedure SL-E-A Sparse learning of each consumption day on the past set. Construction of a set of forecasting experts. Aggregation of the experts.

29 Expert associated to the strategy M Forecasting experts Strategy. : M a function purely non anticipative, data dependent or not, from N to N such that for any d N,M(d) < d. Plug-in To the strategy M we associate the expert Ỹ t M : the prediction of the signal of day t using forecasting strategy M Ỹ M t = G M(t) (Z t )

30 Examples of experts : time depending tm1 : Refer to the day before : (The coefficients used for prediction are those calculated the previous day) M(d) = d 1 tm7 : Refer to one week before : M(d) = d 7

31 Experts introducing climatic patterns of the day T : Find the day having the closest temperature indicators (min, max, med, std) regarding the sup distance (over the days, and over the indicators.) : M(d) = ArgMin t sup k {1,...,6}, i {1,...,48} T k d (i) Tk t (i) T m : Find the day having the closest med. temperature with the sup distance (over the day.) : M(d) = ArgMin t sup i {1,...,48} T 3 d (i) T3 t (i)

32 Experts introducing a climatic configuration of the day constrained by the type of the day s/j : Closest cloud covering indicators (min, max, med, std) regarding the sup distance for days which are of the same type as Y t. : M = ArgMin J(d)=J(t). sup i,k N k d (i) Nk t (i) within the same type of day as Y t

33 MAPE error For day t, the prediction MAPE error over the interval [0,T] is defined by : MAPE(Y,Ỹ M t )(T) = 1 T MISE(Y,Ỹ M t )(T) = 1 T T i=1 T i=1 Ỹ M t (i) Y t (i) Y t (i) Ỹ M t (i) Y t (i) 2

34 Prediction evaluation M mean med min max Naive Apx tm tm T Tm T/N Tm/N T/G T/d T/c Ns/G N/d N/c Table: Mape errors for the different strategies for intraday forecasting M, L5.

35 Aggregation of predictors : Exponential weights (inspired by various theoretical results -see Lecue, Rigollet, Stolz, Tsybakov,...-) with M Ỹ wgt m=1 d = wm d Ỹ d m M m=1 wm d w M d = exp( 1 Tθ T i=1 Ỹ M d (i) Y t(i) 2 ) θ is a parameter, (often called temperature in physic applications) T = Tpred.

36 Forecasting (mape=0.7%). 5.6 x Y 3.8 Yapx tpred

37 Sparse approximation on the learning set Sparse Approximation For each day t of the learning set, Z t = [P t M t ] = [C t B t [T] t [N] t [W] t ] is the concatenation of calendar variables and climate variables. We build an approximation Ŷt of the (observed) signal Y t using (Z t ) : Ŷ t = G t (Z t ) G t (Z t ) = Z tˆβt

38 High dimensional Linear Models Y = Xβ.+ǫ β. IR p is the unknown parameter (to be estimated) ǫ = (ǫ 1,...,ǫ n ) is a (non observed) vector of random errors. It is assumed to be variables i.i.d. N(0,σ 2 ) X is a known matrix n p. High dimension : p >> n t.

39 Sparsity conditions Dominique Picard (Paris-Diderot-LPMA) High dimensional statistical learning methods applied to energy February data sets / 64

40 Penalization for sparsity Many penalizations introduced historically in the regression framework (to put identification constraints on β) Ridge : E(β,λ) = Y Xβ 2 +λσ j βj 2 Lasso : E(β,λ) = Y Xβ 2 +λσ j β j Scad : E(β,λ) = Y Xβ 2 +λσ j w j g(β j ) 2 step thresholding procedure Many others... The goal is to choose the relevant variables -for each day- Dominique Picard (Paris-Diderot-LPMA) High dimensional statistical learning methods applied to energy February data sets / 64

41 Sparse approximation on the learning set Sparse Approximation of each consumption day on a learning set of days ( ), using the set of potentially explanatory variables. For each day t of the learning set, we build an approximation Ŷt of the (observed) signal Y t with the help of the set of explanatory variables (Z t ) : Ŷ t = G t (Z t ) G t (Z t ) = Z tˆβt ( ) Sparse Approximation and Knowledge Extraction for Electrical Consumption Signals, 2012, M. Mougeot, D. P., K. Tribouley & V. Lefieux, L. Teyssier-Maillard Dominique Picard (Paris-Diderot-LPMA) High dimensional statistical learning methods applied to energy February data sets / 64

42 Forecasting (mape=0.7%). 5.6 x Y 3.8 Yapx tpred Dominique Picard (Paris-Diderot-LPMA) High dimensional statistical learning methods applied to energy February data sets / 64

43 Winter forecast 9 x Figure: Forecast (solid blue line) and observed (dashed dark line) electrical consumption for a winter week from Monday February 1 st to Sunday January 7 th Dominique Picard (Paris-Diderot-LPMA) High dimensional statistical learning methods applied to energy February data sets / 64

44 Spring forecast 6 x Figure: Forecast (solid blue line) and observed (dashed dark line) electrical consumption for a spring week from Monday June 14 th to Sunday June 21 th Dominique Picard (Paris-Diderot-LPMA) High dimensional statistical learning methods applied to energy February data sets / 64

45 Wind farm modeling ( ) Statistical learning for wind power : A modeling and stability study towards forecasting, Wind Energy, 2017, A. Fisher, L. Montuelle, M. Mougeot, D. P. Dominique Picard (Paris-Diderot-LPMA) High dimensional statistical learning methods applied to energy February data sets / 64

46 Wind farm modeling-data 3 farms in the North and East of France 4 to 6 turbines per farm 8.2 to 12.3 MW per farm On each turbine, we measure Power 2 wind speed measures + the average Wind direction, coded with cos and sin Temperature State of the turbine (on/off/starting) Three additional variables Variance of the wind speed Variance of the wind direction

47 Aim Statistical learning for wind power Can a statistical model be of interest compared to physically based model? In a forecast view, what is the effect on the modeling of the lack of precision on the meteorological variables? Dominique Picard (Paris-Diderot-LPMA) High dimensional statistical learning methods applied to energy February data sets / 64

48 Modeling Models inspired by physics Persistence : Y t = Y t 1 Turbine builder s power curve Parametric models Linear regression : Y t = a 0 +a 1 W t Logistic regression : Y t = C 1+exp(a 0+a 1W t) Polynomial logistic regression : Y t = LASSO Statistical learning models CART Bagging+CART Random Forest Support Vector Machine k-nearest Neighbors C 1+exp(a 0+a 1W t+a 2W 2 t +a3w3 t )

49 CART (Breiman & all 1984)!"#!"#$%&'&%() $%!"#$*&'&+(,!"#$%&'&,(%!"#$*&'&+(-!"#$%&'&)(%!"#$%&'&*()!"#$%&'&--!" #$% &&& "$'!() ))%( )&'" )%&! Grow a binary tree : choose the variable that provides the best cut criterion : intra-leaf variance Prune the tree to avoid over-fitting : cross-validation minimize MSE Prediction : mean of the leaf

50 Random Forest (Breiman & all 1984)!"#!"#$%&'&%() $%!"#$*&'&+(,!"#$%&'&,(%!"#$*&'&+(-!"#$%&'&)(%!"#$%&'&*()!"#$%&'&--!" #$% &&& "$'!() ))%( )&'" )%&! Bootstrap one sample per tree to be grown Grow a binary tree : choose randomly m variables among these variables, calculate the best split criterion : minimize the variance of the children Prediction of each tree : mean of the leaf Average the trees

51 Real time modeling One model per turbine Prediction of the farm power Training on around 8000 instant-points Test on 10 data sets of 724 points (around 15 days)

52 Persistence RMSE Linear Reg Logistic Reg Poly Log Reg CART RF Linear Reg Logistic Reg Poly Log Reg Lasso Cart Cart+Bagging RF SVM knn

53 !"#$%$&"'(",97B >== <== )%'"*#+,"- )%'"*#+,"- ).-%$&%(+,"- ).-%$&%(+,"-!./0+).-+,"-!./0+).-+,"- 12,3 12,3,4,4 )%'"*#+,"- )%'"*#+,"- ).-%$&%(+,"- ).-%$&%(+,"-!./0+).-+,"-!./0+).-+,"- )*$$. )*$$. 1*#& 1*#& 1*#&56*--%'- 1*#&56*--%'-,4, :;; :;;

54 Segmentation of French territory ( ) Homogeneous climate regions using learning algorithms 2018, submitted, M. Mougeot, D. P., V. Lefieux, M. Marchand

55 Segmentation for 2001, 2007, 2014 daily data Temperature Kmeans 5 ( ) Temp day 2001 Thres 0.05 (dicota) Kmeans 5 ( ) Temp day 2007 Thres 0.05 (dicota) Kmeans 5 ( ) Temp day 2014 Thres 0.05 (dicota) Wind Kmeans 4 ( ) Wind day 2001 Thres 0.05 (dicota) Kmeans 4 ( ) Wind day 2007 Thres 0.05 (dicota) Kmeans 5 ( ) Wind day 2014 Thres 0.05 (dicota)

56 3 steps There are generally 3 important steps (linked in fact) : 1 Representation of the data 2 Smoothing 3 Selection procedure

57 Representation of the data Ad-hoc representations PCA Functional representation Kernel clustering Spectral clustering...

58 Natural-time aggregation smoothing Temp day 2014 Brest Temp The data are observed hourly. It is commonly admitted to take 1 the average on a day : daily observed data T = 365 for one year. 2 the average on a week : daily observed data T = 52 for one year. 3 the average on a month : daily observed data T = 12 for one year.

59 PCA-reduction Projection of the observations using a data driven orthonormal basis X centered data matrix (n,d) n = 259, d >> n large The Feature matrix (n,t) is computed by projection, T << d : Z = XU T U T is the matrix defined by the first eigenvectors of T, the Variance-Covariance matrix. T chosen so that? λ λ T Σ j λ j = κ pca (0.95) Global linear method involving all the n = 259 spots to compute U T Is U T similar between years?

61 hours Functional smoothing Data are (in fact) functions of time regularly spaced. X i t = fi (t/d)+ǫ i t, f i is unknow, ǫ i N(0,σ 2 ), t = 1,...,d. Nonparametric estimation of f i : f i = T l=1 βi l g l with D = {g 1,...,g p } dictionary of functions. Temperature in Paris, May Here We note ˆX i j 0 = j 0 j=1 ˆβi (j) g j How to choose T? with ˆβ i (1)... ˆβ i (n), and ˆX i (j 0 ) 2 X i 2 T NP (= 0.95).

62 Selection procedure : Kmeans Clustering Choose k the number of clusters Find the Arg min (in C 1,...,C k ) of : k r=1 Other methods j j, C r Y j Y j 2 = 2 Ȳ r = 1 C j k r=1 j C r Y j. j C r Y j Ȳr 2,

63 Segmentation results : stability Number of regions stability Geographical stability Between years stability Wind-temperature stability

64 Thank you for your attention

Predicting intraday-load curve using High-D methods

Predicting intraday-load curve using High-D methods LPMA- Université Paris-Diderot-Paris 7 Mathilde Mougeot UPD, Vincent Lefieux RTE, Laurence Maillard RTE Horizon Maths 2013 Intraday load curve during