Service & Repair Demand Forecasting Timothy Wong (Senior Data Scientist, Centrica plc)

Similar documents
How to deal with non-linear count data? Macro-invertebrates in wetlands

Model checking overview. Checking & Selecting GAMs. Residual checking. Distribution checking

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

Exercise 5.4 Solution

Checking, Selecting & Predicting with GAMs. Simon Wood Mathematical Sciences, University of Bath, U.K.

COMPARING PARAMETRIC AND SEMIPARAMETRIC ERROR CORRECTION MODELS FOR ESTIMATION OF LONG RUN EQUILIBRIUM BETWEEN EXPORTS AND IMPORTS

Logistic Regressions. Stat 430

Generalized Additive Models

Generalized Additive Models

Poisson Regression. The Training Data

Two Hours. Mathematical formula books and statistical tables are to be provided THE UNIVERSITY OF MANCHESTER. 26 May :00 16:00

Statistical Prediction

Linear Regression Models P8111

PAPER 206 APPLIED STATISTICS

Generalized linear models

Stat 4510/7510 Homework 7

Interactions in Logistic Regression

Generalized Linear Models

Log-linear Models for Contingency Tables

SCW6-Doc05. CPUE standardization for the offshore fleet fishing for Jack mackerel in the SPRFMO area

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Generalized Additive Models (GAMs)

Course in Data Science

R Hints for Chapter 10

Logistic Regression - problem 6.14

A Handbook of Statistical Analyses Using R 2nd Edition. Brian S. Everitt and Torsten Hothorn

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Prashant Pant 1, Achal Garg 2 1,2 Engineer, Keppel Offshore and Marine Engineering India Pvt. Ltd, Mumbai. IJRASET 2013: All Rights are Reserved 356

mgcv: GAMs in R Simon Wood Mathematical Sciences, University of Bath, U.K.

Homework 10 - Solution

Classification. Chapter Introduction. 6.2 The Bayes classifier

Reaction Days

9 Generalized Linear Models

Overdispersion Workshop in generalized linear models Uppsala, June 11-12, Outline. Overdispersion

Poisson Regression. Gelman & Hill Chapter 6. February 6, 2017

WEATHER NORMALIZATION METHODS AND ISSUES. Stuart McMenamin Mark Quan David Simons

Recap. HW due Thursday by 5 pm Next HW coming on Thursday Logistic regression: Pr(G = k X) linear on the logit scale Linear discriminant analysis:

BMI 541/699 Lecture 22

Checking the Poisson assumption in the Poisson generalized linear model

From statistics to data science. BAE 815 (Fall 2017) Dr. Zifei Liu

CP:

Forecasting the electricity consumption by aggregating specialized experts

Regression models. Generalized linear models in R. Normal regression models are not always appropriate. Generalized linear models. Examples.

Stat 587: Key points and formulae Week 15

Falkland Island Fisheries Department. Castelo (ZDLT1) Falkland Islands. Dates 9/02/ /02/ May

Statistical Methods for Forecasting

Aedes egg laying behavior Erika Mudrak, CSCU November 7, 2018

Booklet of Code and Output for STAD29/STA 1007 Midterm Exam

CWV Review London Weather Station Move

Short Term Load Forecasting Using Multi Layer Perceptron

Hurricane Forecasting Using Regression Modeling

Generalised linear models. Response variable can take a number of different formats

Dennis Bricker Dept of Mechanical & Industrial Engineering The University of Iowa. Forecasting demand 02/06/03 page 1 of 34

R Output for Linear Models using functions lm(), gls() & glm()

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Stat 8053, Fall 2013: Multinomial Logistic Models

Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/

Modeling Overdispersion

Stat 401B Final Exam Fall 2016

Frequency Forecasting using Time Series ARIMA model

Short-term management of hydro-power systems based on uncertainty model in electricity markets

Linear regression. Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X 1,X 2,...X p is linear.

Binary Regression. GH Chapter 5, ISL Chapter 4. January 31, 2017

Section 9.4. Notation. Requirements. Definition. Inferences About Two Means (Matched Pairs) Examples

Exponential Smoothing. INSR 260, Spring 2009 Bob Stine

STA 450/4000 S: January

cor(dataset$measurement1, dataset$measurement2, method= pearson ) cor.test(datavector1, datavector2, method= pearson )

Chart types and when to use them

Linear Regression Model. Badr Missaoui

SHORT TERM LOAD FORECASTING

Lecture 9 STK3100/4100

Electricity Demand Forecasting using Multi-Task Learning

Non-Gaussian Response Variables

Using R in 200D Luke Sonnet

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph.

Generalized Linear Models

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Journal of Chemical and Pharmaceutical Research, 2014, 6(5): Research Article

Learning with multiple models. Boosting.

Forecasting: principles and practice. Rob J Hyndman 1.1 Introduction to Forecasting

Chapter 7 Forecasting Demand

Time Series and Forecasting

A Multistage Modeling Strategy for Demand Forecasting

Alain F. Zuur Highland Statistics Ltd. Newburgh, UK.

A Bayesian Perspective on Residential Demand Response Using Smart Meter Data

Modelling Wind Farm Data and the Short Term Prediction of Wind Speeds

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

Forecasting Unemployment Rates in the UK and EU

A strategy for modelling count data which may have extra zeros

way and atmospheric models

Sample solutions. Stat 8051 Homework 8

1. Logistic Regression, One Predictor 2. Inference: Estimating the Parameters 3. Multiple Logistic Regression 4. AIC and BIC in Logistic Regression

Statistical Machine Learning I

Q30b Moyale Observed counts. The FREQ Procedure. Table 1 of type by response. Controlling for site=moyale. Improved (1+2) Same (3) Group only

Multivariate Regression (Trees)

Investigation of forecasting methods for the hourly spot price of the Day-Ahead Electric Power Markets

LOAD FORECASTING APPLICATIONS for THE ENERGY SECTOR

Introduction to Machine Learning and Cross-Validation

Understanding Travel Time to Airports in New York City Sierra Gentry Dominik Schunack

Transcription:

Service & Repair Demand Forecasting Timothy Wong (Senior Data Scientist, Centrica plc) European R Users Meeting 14 th -16 th May, 2018 Budapest, Hungary We supply energy and services to over 27 million customer accounts Supported by around 12,000 engineers and technicians Our areas of focus are Energy Supply & Services, Connected Home, Distributed Energy & Power, Energy Marketing & Trading

Overview Driven by many factors Customer Contact Creates Job Demand My gas boiler is not working. Booking Initial Appointment Not yet done 2 nd Appointment Not yet done 3 rd Appointment Not yet done Done Done Done We can help. Would you like to book an appointment? Closed Closed Closed

Gas boiler service & repair demand Strong causality, e.g.: Cold weather use more gas high repair demand Holiday away from home less repair demand 173 service patches in the UK Each has dependent variables, e.g. weather observations. Temperature : Independent variable Number of contact : Dependent variable

Linear Models Linear fit y = β 0 + β 1 x Polynomial fit K y = β 0 + β k x k k=1 Piecewise polynomial fit K y = β 0 + β k x k x (0, 5] k=1 K y = β 0 + β k x k x (5,10] k=1 K y = β 0 + β k x k x (10,15] k=1

Poisson Distribution Goodness-of-fit test for Poisson distribution > summary(gf) Goodness-of-fit test for poisson distribution X^2 df P(> X^2) Likelihood Ratio 543.702 32 2.288901e-94 Poisson GLM library(vcd) gf <- goodfit(x) summary(gf) plot(gf) Assumption: y i = β 0 + x i,1 β 1 + x i,2 β 2 + + ε i y i ~Poisson(λ) ε i ~N(0, σ 2 ) Response variable y i is contact count.

Generalised Additive Model (GAM) Variables may have nonlinear relationship e.g. warm weather low demand, but we don t expect zero demand on extremely hot day GAM deals with smoothing splines (basis functions) K s x = β k b k (x) k=1 Family: poisson Link function: log Formula: contact_priority ~ s(avg_temp) Parametric coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) 2.49418 0.01109 224.9 <2e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Approximate significance of smooth terms: edf Ref.df Chi.sq p-value s(avg_temp) 5.681 6.858 588.6 <2e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 R-sq.(adj) = 0.315 Deviance explained = 31.5% UBRE = 0.88378 Scale est. = 1 n = 694 GAM: Spline function

GLM vs GAM AIC = 4263 AIC = 4260 myglm <- glm(formula = contact_priority ~ avg_temp, data = mydata, family = poisson()) mygam <- gam(formula = contact_priority ~ s(avg_temp), data = mydata, family = poisson()) Statistically significant anova(myglm, mygam, test="chisq") Analysis of Deviance Table AVOVA: Check reduction of sum of squared Model 1: contact_priority ~ avg_temp Model 2: contact_priority ~ s(avg_temp) Resid. Df Resid. Dev Df Deviance Pr(>Chi) 1 692.00 1307.1 2 687.32 1294.0 4.6808 13.087 0.01813 * --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1

Wind Speed More Variables mygam2 <- gam(formula = contact_priority ~ te(avg_temp, avg_wind), data = mydata, family = poisson()) Colour showing output density Family: poisson Link function: log Formula: contact_priority ~ te(avg_temp, avg_wind) Parametric coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) 2.4927 0.0111 224.5 <2e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Approximate significance of smooth terms: edf Ref.df Chi.sq p-value te(avg_temp,avg_wind) 14.12 16.52 613.6 <2e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 R-sq.(adj) = 0.321 Deviance explained = 33.1% UBRE = 0.86457 Scale est. = 1 n = 694 Temperature

Results For each response variable y we also know the standard error Establish confidence interval Actual data Confidence Interval Prediction

Accuracy measurement Consistent results across patches London area:

GAM Results: Aggregated View

Accuracy measurement Defined as 1-MAPE (%) MAX(0, 1 - ABS(Forecast Actual)/Actual) Average accuracy of each quarter:

Potential Improvements Feature transformation Manually hand-craft linear features Combine and transform existing variables Use linear methods Easier to interpret GAM + Bagging Multilevel linear regression ( Mixed-effect model ) Service patches as groups Single model for all patches

Potential Improvements Time Series Approach ARMA (Auto-Regressive Moving Average) / ARIMA Analyse seasonality Other machine learning techniques Boosted trees Random Forest Works nicely with ordinal/categorical variables Neural net (RNNs) Substantially longer model training time Less interpretable, No confidence interval

Thanks Project Team (Names in alphabetical order) Angus Montgomery Hari Ramkumar Harriet Carmo Kerry Wilson Morgan Martin Thornalley Matthew Pearce Philip Szakowski Terry Phipps Timothy Wong Tonia Ryan European R Users Meeting 14 th -16 th May, 2018 Budapest, Hungary Timothy Wong Senior Data Scientist Centrica plc timothy.wong@centrica.com @timothywong731 github.com/timothy-wong linkedin.com/in/timothy-wong-7824ba30