Multi-Plant Photovoltaic Energy Forecasting Challenge with Regression Tree Ensembles and Hourly Average Forecasts

Similar documents
ECML PKDD Discovery Challenges 2017

Multi-Plant Photovoltaic Energy Forecasting Challenge: Second place solution

WMO Aeronautical Meteorology Scientific Conference 2017

Remote Ground based observations Merging Method For Visibility and Cloud Ceiling Assessment During the Night Using Data Mining Algorithms

Forecasting Solar Irradiance from Time Series with Exogenous Information

AN INTERNATIONAL SOLAR IRRADIANCE DATA INGEST SYSTEM FOR FORECASTING SOLAR POWER AND AGRICULTURAL CROP YIELDS

Data Analytics for Solar Energy Management

Improving the accuracy of solar irradiance forecasts based on Numerical Weather Prediction

Integrated Electricity Demand and Price Forecasting

David John Gagne II, NCAR

The document was not produced by the CAISO and therefore does not necessarily reflect its views or opinion.

CS 229: Final Paper Wind Prediction: Physical model improvement through support vector regression Daniel Bejarano

About Nnergix +2, More than 2,5 GW forecasted. Forecasting in 5 countries. 4 predictive technologies. More than power facilities

High Wind and Energy Specific Models for Global. Production Forecast

Fine-grained Photovoltaic Output Prediction using a Bayesian Ensemble

ANN and Statistical Theory Based Forecasting and Analysis of Power System Variables

Climatic study of the surface wind field and extreme winds over the Greek seas

1.3 STATISTICAL WIND POWER FORECASTING FOR U.S. WIND FARMS

Research Article Weather Forecasting Using Sliding Window Algorithm

Climate services in support of the energy transformation

Improving Efficiency of PV Systems Using Statistical Performance Monitoring

A Community Gridded Atmospheric Forecast System for Calibrated Solar Irradiance

Believe it Today or Tomorrow? Detecting Untrustworthy Information from Dynamic Multi-Source Data

Director Corporate Services & Board Secretary

LOAD POCKET MODELING. KEY WORDS Load Pocket Modeling, Load Forecasting

Hierarchical models for the rainfall forecast DATA MINING APPROACH

Multi-Model Ensemble for day ahead PV power forecasting improvement

Ms. Cheryl Blundon Director Corporate Services & Board Secretary

Prediction of Citations for Academic Papers

Electrical and Computer Engineering Department University of Waterloo Canada

2013 WEATHER NORMALIZATION SURVEY. Industry Practices

Discovery Through Situational Awareness

FRAPPÉ/DISCOVER-AQ (July/August 2014) in perspective of multi-year ozone analysis

ESTIMATION OF GLOBAL SOLAR RADIATION AT CALABAR USING TWO MODELS

SHORT TERM LOAD FORECASTING

An Adaptive Multi-Modeling Approach to Solar Nowcasting

Predicting the Electricity Demand Response via Data-driven Inverse Optimization

Wind Forecasting using HARMONIE with Bayes Model Averaging for Fine-Tuning

Importance of Input Data and Uncertainty Associated with Tuning Satellite to Ground Solar Irradiation

Short-term wind forecasting using artificial neural networks (ANNs)

The Dayton Power and Light Company Load Profiling Methodology Revised 7/1/2017

STATISTICAL LOAD MODELING

Experimental and Theoretical Study on the Optimal Tilt Angle of Photovoltaic Panels

Current best practice of uncertainty forecast for wind energy

TRANSMISSION BUSINESS LOAD FORECAST AND METHODOLOGY

Demand Forecasting Reporting Period: 19 st Jun th Sep 2017

FORECAST OF ENSEMBLE POWER PRODUCTION BY GRID-CONNECTED PV SYSTEMS

Prediction for night-time ventilation in Stanford s Y2E2 building

ANN based techniques for prediction of wind speed of 67 sites of India

Predicting MTA Bus Arrival Times in New York City

Finding Multiple Outliers from Multidimensional Data using Multiple Regression

Caesar s Taxi Prediction Services

Dynamic Time Warping and FFT: A Data Preprocessing Method for Electrical Load Forecasting

Defining Normal Weather for Energy and Peak Normalization

IMPROVED MODEL FOR FORECASTING GLOBAL SOLAR IRRADIANCE DURING SUNNY AND CLOUDY DAYS. Bogdan-Gabriel Burduhos, Mircea Neagoe *

A Support Vector Regression Model for Forecasting Rainfall

Evaluation. Andrea Passerini Machine Learning. Evaluation

Heat Load Forecasting of District Heating System Based on Numerical Weather Prediction Model

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

That s Hot: Predicting Daily Temperature for Different Locations

not for commercial-scale installations. Thus, there is a need to study the effects of snow on

Clustering and Prediction of Solar Radiation Daily Patterns

A Typical Meteorological Year for Energy Simulations in Hamilton, New Zealand

Modelling and Prediction of 150KW PV Array System in Northern India using Artificial Neural Network

Lake Tahoe Watershed Model. Lessons Learned through the Model Development Process

Measurement of global solar radiation in Terengganu state, Malaysia

A Novel Method for Predicting the Power Output of Distributed Renewable Energy Resources

Evaluation requires to define performance measures to be optimized

Understanding Travel Time to Airports in New York City Sierra Gentry Dominik Schunack

Weather Prediction Using Historical Data

LOADS, CUSTOMERS AND REVENUE

To Predict Rain Fall in Desert Area of Rajasthan Using Data Mining Techniques

Solar Radiation and Solar Programs. Training Consulting Engineering Publications GSES P/L

Peak Shifting Discussion and Forecast Sensitivities

COMPARISON OF CLEAR-SKY MODELS FOR EVALUATING SOLAR FORECASTING SKILL

Collaborative Filtering Applied to Educational Data Mining

Regional Production Quarterly report on the daily analyses and forecasts activities, and verification of the CHIMERE performances

Optimizing the energy consumption and operational availability of point heaters using weather forecasts

Use of Normals in Load Forecasting at National Grid

Spatial Data Science. Soumya K Ghosh

Big Data Paradigm for Risk- Based Predictive Asset and Outage Management

SOLAR POWER FORECASTING BASED ON NUMERICAL WEATHER PREDICTION, SATELLITE DATA, AND POWER MEASUREMENTS

Application of Text Mining for Faster Weather Forecasting

A Feature Based Neural Network Model for Weather Forecasting

Regional Production, Quarterly report on the daily analyses and forecasts activities, and verification of the CHIMERE performances

Temporal global solar radiation forecasting using artificial neural network in Tunisian climate

A re-sampling based weather generator

S e a s o n a l F o r e c a s t i n g f o r t h e E u r o p e a n e n e r g y s e c t o r

Solar Nowcasting with Cluster-based Detrending

Demand and Trip Prediction in Bike Share Systems

NASA Products to Enhance Energy Utility Load Forecasting

A SOLAR AND WIND INTEGRATED FORECAST TOOL (SWIFT) DESIGNED FOR THE MANAGEMENT OF RENEWABLE ENERGY VARIABILITY ON HAWAIIAN GRID SYSTEMS

SWIM and Horizon 2020 Support Mechanism

Electrical Energy Modeling In Y2E2 Building Based On Distributed Sensors Information

This paper presents the

Reduced Order Greenhouse Gas Flaring Estimation

WP6 Early estimates of economic indicators. WP6 coordinator: Tomaž Špeh, SURS ESSNet Big data: BDES 2018 Sofia 14.,

RBF Neural Network Combined with Knowledge Mining Based on Environment Simulation Applied for Photovoltaic Generation Forecasting

Announcements Kevin Jamieson

Weighted Classification Cascades for Optimizing Discovery Significance

Transcription:

Multi-Plant Photovoltaic Energy Forecasting Challenge with Regression Tree Ensembles and Hourly Average Forecasts Kathrin Bujna 1 and Martin Wistuba 2 1 Paderborn University 2 IBM Research Ireland Abstract. In this paper, we present the winning solution to the ECML- PKDD Discovery Challenge 2017 on Multi-Plant Photovoltaic (PV) Energy Forecasting. The goal of the challenge is to utilize the historic data of three different PV plants in Italy regarding meteorological conditions and production in order to forecast their energy production. A major problem is that the data contains many missing value for most sensors and especially for the period of the year for which predictions shall be made in the subsequent year. We investigate two approaches: a regression tree ensemble whose hyperparameters where tuned via Bayesian optimization and a simple rule that just predicts the hourly averages that were observed during the previous year. The latter approach is the winning solution of the challenge. 1 Introduction Due to the urgent need to reduce pollution emission, clean and renewable energy sources have become a strategic European Union and international sector. In particular, photovoltaic (PV) power plants, which are already widely distributed in Europe, are becoming a major source of renewable energy. The main challenges faced by the renewable energy market are grid integration, load balancing and energy trading. In order to face these challenges, we need reliable tools for the prediction of the energy production. Challenge. In the ECML PKDD 2017 Multi-Plant Photovoltaic Energy Forecasting Challenge, we are given multiple time series regarding meteorological conditions and production collected by sensors on three closely located PV plants in Italy. Given the data for the period from January to December 2012, the goal is to forecast the energy production of each plant for the period from January to March 2013. Related Work. [1] assess several approaches for PV power forecasting. In contrast to the given challenge, the authors consider a larger number of features, for instance, the weather forecast provided by numerical weather prediction systems and the geographic coordinates of the plants. Moreover, they consider the data collected from 18 power plants, instead of the data from just 3 plants.

2 Task For detailed information on the given data, we refer to [1]. Time Series. The given data set consists of time series data regarding weather conditions and production collected by sensors on three closely located PV plants in Italy. The elements of these time series are hourly aggregations obtained as the average of all the measures available during a specific hour at a specific day of the year. That is, for each plant, day, and sensor, we are given a time series of 19 values representing hourly aggregated observations (plants are active from 2am to 8pm). The data was gathered locally, from sensors available on plants and from the closest meteorological stations. There are three time series that describe the plant: irradiance, temperature, and the power production. There are seven time series that describe the weather: temperature, cloudcover, dewpoint, humidity, pressure, windbearing, and windspeed. Train and Test. The given training data spans over a period of 12 months (year 2012) and includes the target time series, i.e. the power production observed for each plant. The testing data consists of 3 months (January to March 2013) for which the target time series is not provided. According to the challenge website, min-max scaling (between 0 and 1) on power variables needs to be performed on the training and testing data before applying the prediction model. Additional Information. According to the challenge website, the data can contain reasonable missing values or outliers. Missing values are indicated by zero measurements. Besides that, the PV plants have a maximal power rate of 1 000 kw/h. 3 Exploration Some of the power measurements are much higher than the maximal power rate of 1 000 kw/h (e.g. 885 124 kw/h). Therefore, in the following figures, we clean the power measurements by setting all power values larger than 1 000 kw/h to exactly 1 000 kw/h. 3.1 Single Time Series vs. Power Production Time Series In Fig. 1 we consider the impact of the single variables with respect to the power production and, additionally, the average monthly and hourly power production. The results confirm our expectation: First and foremost, we expect that the power production is higher during summer, during the middle of a day, and while the plant irradiance is higher. We expect that, if the weather is sunny, then the plant irradiance is higher. We expect sunny weather if the weather

temperature is higher, the pressure is rather high, there are few clouds, and the humidity is low. Most notably, there is a strong correlation between the plant irradiance and the power production. The correlation of the power production with the weather temperature, pressure, cloud cover, and humidity is less striking. Besides that, in Fig. 2, we see that these variables also correlate with the plant irradiance. The dewpoint, the wind speed, and wind bearing do not seem to be correlated to the power production, respectively. One might suspect that the informativeness of the wind speed and bearing might depend on the location. However, except for the wind bearing measured at the plant with index 1 (see Fig. 1), these location dependent figures are similar to the averages (over all plants) depicted in Fig. 1. Last but not least, apparently, there is a significant number of missing values, which have been set to zero by default (see Sec. 2). 3.2 Daily Measurements of the Plant Sensors We considered the plant irradiance and the plant temperature the most promising variables. Fig. 3 depicts the daily averages values of these variables during the year 2012. For plant 1, a lot of the plant irradiance and plant temperature measurements are probably missing between day 30 and day 58. During this interval, both measurements are constantly 0, respectively. Besides that, we observe that the plant power of plant 1 drops significantly during the very same time. That is, in comparison to the other plants, we would expect roughly 150 kw/h, but observe about 100 kw/h. Regarding plant 2, there seem to be even more missing values in the first 90 days of the year 2012. During the first 90 days, the plant irradiance and plant temperature measurements are constantly 0. Besides that, the plant power measurements behave as expected. Still, there seems to be a slight drop of the power production between day 25 and 50. Finally, consider the measurements regarding plant 3. Here, there seem to be no missing measurements. However, we observe a strange behavior of the plant irradiance measurements: Approximately between day 50 and day 150, the plant irradiance increases much faster compared to the irradiance measurements for plant 1 and 2, respectively. At day 150, the measurements decrease to values which are similar to the values measured for plant 1 and 2, respectively. Again, there seems to be a slight drop of the power production between day 25 and 50. To sum up, the time series of all three plants exhibit an unusual behavior in the first 90 days of the year 2012. Unfortunately, the test data consists of measurements for the first 90 days of the subsequent year. On a side note, the test data does not seem to contain missing plant irradiance or plant temperature measurements.

Fig. 1. Relation between single variables and the power production of the plants. Note that we collected the measurements for all three plants. In the plots depicting the plant irradiance, wind speed, wind bearing, dewpoint, and pressure, we rounded the measured values to obtain reasonable box plots.

Fig. 2. Relation between single variables and the plant irradiance. Note that we collected the measurements for all three plants. Fig. 3. Daily average measurements of plant power, plant irradiance and plant temperature per plant.

4 Data Preprocessing From Sec. 3, we know that the data set most probably contains a lot of missing values, which are indicated by zero measurements, and outliers. Missing Data. Recall that missing values are indicated by zero measurements. However, a zero measurements does not necessarily indicate a missing value. Therefore, we did not mark all zero measurements as missing: In the plant irradiance time series, we marked a zero measurement as missing if, at the same time point, the plant temperature was also zero (i.e. we marked 2577 of 7921 zero measurements as missing) (cf. exploration of data regarding plant 2 in Sec. 3.2). After that, we marked all zero measurements in the plant temperature, weather pressure, and weather pressure time series as missing. Outlier Removal. First, we removed unusual values. In the given training data, the minimum plant temperature is -3 C, while in the test data we observe plant temperature measurements of up to -202 C. This is why we decided to remove (i.e. mark as missing) all plant temperature measurements of less than -3 C from the test data. Second, we used the outlier removal proposed by [1]. That is, given the time series (v t ) t=1,...,t (i.e. the plant irradiance time series), we consider the value v t at time t an outlier if v t / [ v ± 4 σ v ], where v = 1 T T t=1 v t denotes the sample mean and σv 2 = 1 T T 1 t=1 v t v 2 denotes the (unbiased) sample variance. To be precise, we conduced this outlier removal separately for the time series given by the training data and by the test data, respectively. Normalization of Power Measurements. After our outlier removal there were still some power measurements larger than the maximal power value of 1 000 kw/h. We simply replaced these measurements with the maximal power value (i.e. 1 000 kw/h). Additional Date Features. We created one training and test data set that contains the plant ids, the plant sensor features, weather features, and additionally, three date features. The date features indicate the hour of the day (by values 1,..., 19), the day of the year (by values 1,..., 366), and the month (by values 1,..., 12). 5 Prediction In the following, we first describe two approaches: a regression tree ensemble and an evaluation of hourly averages. The latter has turned out to be the winning solution of the challenge.

5.1 Training Data Splits In order to estimate the performance of our approaches, we split the training data in two splits: The first split T 1 23 contains all data from the training data set concerning the first 23 days of each month in 2012. The second split T 24+ contains the remaining data, i.e. all training data containing the 24th day through the last day of each month in 2012. In the following, we train our models using the larger split T 1 23, and evaluate the resulting models with respect to T 24+. Our motivation for these splits is twofold: First of all, we want to ensure that there is enough information about each month contained in each split. Second, we want the splits to correspond to rather independent time series. In other words, we want that the number of measurements v t for which the subsequent measurement v t+1 is contained in a different split than v t is as small as possible. This property would be violated if, for example, we would choose the assignment of a measurement to one of the splits uniformly at random. 5.2 Approaches: Regression Tree Ensembles and Hourly Averages Our first approach is to use gradient boosted regression trees (GBRT) [2], which can handle heterogeneous data with missing values. To tune the hyperparameters of the GBRT technique, we apply a black-box optimization method [3]. That is, we apply Bayesian optimization to find the hyperparameters λ which minimize the RSME of gradient boosted regression trees, which have been trained with hyperparameters λ and training data T 1 23, with respect to the validation split T 24+. As the resulting regression tree ensemble predicted rather strange time series (regarding the testing data), we also tried the following simple baseline. Our second approach is to predict hourly average values: For each plant p, hour of the day h, and for each month m, we computed the average power production of plant p during the month p at this specific hour h of the day. For instance, our prediction of the power production of plant 1 for February 1st between 1pm and 2pm is the average power production of plant 1 between 1pm and 2pm in February 2012. 5.3 Evaluation First, let us consider the performance of our models with respect to the validation split T 24+ : Our resulting regression tree ensemble yields a root mean squared error (RSME) of 0.0685. In comparison, our hourly average approach performs poorly and only achieves an RSME of 0.1297. However, our models perform differently with respect to the testing data. Regarding the testing data, our simple hourly average approach turned out to be the winning approach. Tab. 1 depicts the temporary and the final leaderboard, which have been published on the challenge website. Our simple hourly average approach already achieved the second best score on the temporary leaderboard, which contains 10% of the testing data. Regarding the complete testing data, the hourly approach outperformed all the other competing solutions.

Table 1. Leaderboards: The testing data consists of 3 months (January to March 2013) for which the target time series (power) is not provided. The temporary leaderboard is obtained with respect to only 10% of the testing data, while the final leaderboard takes into account the complete testing data. Only the best submission per team is listed. Our hourly average approach is marked in bold. Temporary Final position RSME position (previous) RSME 1 0.16697563293016 1 (2) 0.20797269178412 2 0.16864488931773 2 (4) 0.22345258018878 3 0.17726390671562 3 (6) 0.22633730458759 4 0.18583154100811 4 (3) 0.25383336655979 5 0.19462844662109 5 (1) 0.25432166920657 6 0.21447590228825 6 (10) 0.26261669303547 7 0.22291339472308 7 (7) 0.26687963564708 8 0.22307448875416 8 (9) 0.27442245421399 9 0.22404629864761 9 (8) 0.27587241955111 10 0.2267988609685 10 (11) 0.28524305512214 11 0.23031635550655 11 (5) 0.30189592552655 5.4 Conclusion As already explained, the given training data contained many missing values, especially, during the time of the year, for which we have to predict the power production in the subsequent year (see Sec. 3.2). We expect that, given more reasonable data, a machine learning approach will outperform our simple hourly average approach. References 1. Ceci, M., Corizzo, R., Fumarola, F., Malerba, D., Rashkovska, A.: Predictive modeling of pv energy production: How to set up the learning task for a better prediction? IEEE Transactions on Industrial Informatics 13(3), 956 966 (June 2017) 2. Chen, T., Guestrin, C.: Xgboost: A scalable tree boosting system. In: Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 785 794. KDD 16, ACM, New York, NY, USA (2016), http: //doi.acm.org/10.1145/2939672.2939785 3. Snoek, J., Larochelle, H., Adams, R.P.: Practical bayesian optimization of machine learning algorithms. In: Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States. pp. 2960 2968 (2012), http://papers.nips.cc/paper/4522-practicalbayesian-optimization-of-machine-learning-algorithms