AN APPLICATION OF MULTIPLE CORRESPONDENCE ANALYSIS TO THE CLASSIFICATION OF RESIDENTIAL ELECTRICITY CONSUMERS

Size: px

Start display at page:

Download "AN APPLICATION OF MULTIPLE CORRESPONDENCE ANALYSIS TO THE CLASSIFICATION OF RESIDENTIAL ELECTRICITY CONSUMERS"

Byron Cobb
6 years ago
Views:

!#"$% $$ %!& '($! *)!!#% $)$ +-,/.103254 687/9:6(;=<5>1.1?A@50 <CB5D/EF6HG <C,/I-9AB5G 91J <C25D/.K9L;=<5,/2MG,F68.12C0 9:6HG <-,/N-@503<PO3.LEQ6(9L,/BM?1D/?

1 !#"$% $$ %!& '($! *)!!#% $)$ +-,/ <CB5D/EF6HG <C,/I-9AB5G 91J <C25D/.K9L;=<5,/2MG,F68.12C0 9:6HG J G684 AN APPLICATION OF MULTIPLE CORRESPONDENCE ANALYSIS TO THE CLASSIFICATION OF RESIDENTIAL ELECTRICITY CONSUMERS Ronaldo Rocha Bastos (UFJF) Henrique Steinherz Hippert (UFJF) Augusto Carvalho Souza (ViannaJr) Classification of consumers from relatively small areas into different demand groups may help in the estimation of where, and by how much, electricity consumption is likely to change, thus reducing the uncertainties in the aggregate projecttions required from regional electricity supply companies in Brazil by the Câmara de Comercialização de Energia Elétrica (Electricity Commercialisation Chamber), responsible for the purchase of all energy generated nationwide, since In this paper we compare some methods of classification which can account for the socio-economic profile of electricity consumers and the characteristics of households that generate such residential consumption. In particular, we were interested in studying the usefulness of Multiple Correspondence Analysis (MCA), a multivariate exploratory technique for categorical data, in the identification of covariates and in the combination of its results with regression modelling and intensive computation techniques. The strategies analysed were logistic regression (LR) with both the original categorical covariates and with MCA factors as covariates, and artificial neural networks (ANN) with both types of input variables. The data used in this paper come from a multiplestage sample survey undertaken in among the residential electricity consumers of the city of Juiz de Fora, state of Minas Gerais, an area supplied by a single regional electricity company. The results obtained have shown the importance of MCA as an intermediate step to estimate coefficients that can be used in classification. For LR, despite the similar performance of both strategies, a significant reduction of the dimensionality of the problem was attained. As to the ANN, a methodology originally designed for numerical predictors, the use of MCA factors not only improves the performance of the method, but permits its application to categorical predictors.

2 Keywords: Classification; logistic regression; multiple correspondence analysis; artificial neural networks; load forecasting 2

3 1. Introduction The provision for the growing residential electricity load is made by each regional electricity supply company at a highly disaggregate level, which normally involves large capital investments in distribution lines and supply infrastructure. The overall domestic load to be supplied is, therefore, the summation of the disaggregate load across every individual household. In Brazil, since the creation in 2004 of the national Câmara de Comercialização de Energia Elétrica CCEE (Electricity Commercialisation Chamber), responsible for the purchase of all energy generated nation-wide, all electricity distributors are required to present aggregate demand forecasts of the loads they will need in their concession areas for a fiveyear time span. These forecasts often carry with them a high level of uncertainty, specially if little is known about the patterns of individual consumption. According to the new regulations, however, the CCEE is entitled to penalise the distributors for the forecasting errors, thus increasing the need for developing accurate forecasting systems, if financial risks are to be minimised. Classification of consumers from relatively small areas into different demand groups may help in the estimation of where, and by how much, electricity consumption is likely to change, thus reducing the uncertainties of such aggregate projections. One major factor affecting electricity consumption is the typology of the population which generates it, generally described by a series of categorical variables such as educational level, car ownership, type and size of dwelling, to name but a few. By taking into account differences in population characteristics, one can better analyse expected differences in consumption. Some studies about small-area demand forecast strategies, which demonstrate the usefulness and flexibility of such approaches, have been undertaken for electricity supply companies (e.g. Madden et al, 1994). In Brazil, attempts to account for the characteristics of the consumers and their households in the determination of consumer patterns and in the classification of populations into electricity consumption groups are still rare (see, e.g. Hippert, 2006). This is probably due to the scarcity of adequate data or the difficulties involved in merging data coming from different sources at different aggregation levels. The data described in section 2 made this kind of approach possible. In this paper we apply and compare some methods of classification which can account for the socio-economic profile of electricity consumers and the characteristics of households that generate such domestic consumption. In particular, we were interested in studying the usefulness of Multiple Correspondence Analysis (MCA), a multivariate exploratory technique for categorical data, in the identification of covariates and in the combination of its results with regression modelling and intensive computation techniques. The strategies analysed were logistic regression, both with the original categorical covariates and with the MCA factors as covariates, and artificial neural networks (ANN), using the same two kinds of input data. 2. Data used The data we used come from a data set created by the Department of Statistics (UFJF), containing all variables obtained from a multiple-stage sample survey undertaken in among the residential electricity consumers of the city of Juiz de Fora, state of Minas Gerais, an area supplied by a single regional electricity company. A total of 557 households were visited and their heads interviewed by means of a structured questionnaire, which contained questions about dwelling characteristics (built area, number of rooms, etc.), socioeconomic characteristics of the households (educational level of head, cars owned, existence 3

4 and number of maids, etc.), energy consuming habits (frequency and time of use of electrical household appliances), attitudes towards energy efficiency and conservation measures, and opinions about possible rationalisation in consumption. All households interviewed were also geo-referenced with the aid of a Global Positioning System (GPS) appliance, in order to enable their classification according to the neighbourhoods where they were located. In the study here discussed we did not use the variables related to attitudes, energy consuming habits and location. We focused, instead, on the categorical variables (most of them ordinal) which described the dwelling characteristics and the socio-economic profiles of the consumers. The problem of missing data was not tackled in this study, as our aim was to use the results obtained from different strategies for comparative purposes. After the elimination of all cases with missing data in any of the variables mentioned above, 444 cases remained for analysis. 3. Classification Methods Adopted and Data Analysis 3.1 Multiple Correspondence Analysis Multiple Correspondence Analysis (MCA) is a multivariate exploratory technique aimed at reducing the dimensionality of a categorical data set by means of uncorrelated factors which maximise the projection distances between the different categories of the variables considered. The solution is obtained through single value decomposition of a rectangular matrix (see, e.g. Greenacre & Blasius, 2006, p. 12). One of its strengths is the possibility of presenting results in graphical form, which facilitates the understanding of possible similarities and associations among variables. This technique, first proposed by Benzécri (1992) in the 1960s and 1970s, has been used in different applications in all branches of science (see, e.g. Greenacre and Blasius, 1994; 2006). In order to explore the possible patterns existent among the categorical variables related to the consumers and their households, we undertook an MCA with some selected variables from the dataset, as indicated in Table 1. Categorical Variables (levels) Categories Educational level (3) Up to High School; H S to Undergraduate.; Graduate predictor Cars owned (3) None; One; Two or More predictor House maid (2) Yes; No predictor Dwelling area (4) < 50 m 2 ; m 2 ; m 2 ; > 151 m 2 predictor Meter type (2) One phase; Two/ Three phases grouping Table 1 The categorical variables used in the study Type A total of four predictor variables and 12 corresponding categories were analysed with MCA, giving an overall solution with eight factors. Table 2 summarises the solution, presenting the percentage of the total inertia explained by each factor. Solution Eigenvalu % Cum. % e F F

5 F F F F F F Table 2 - MCA solutions, corresponding eigenvalues and % of inertia explained The graphical solution with the first two factors, which account for approximately 37% of the explained inertia, is presented in Figure 1. In this graph, categories of different variables are considered to be associated when located close to each other, whereas categories of the same variable which are close to each other are considered to be similar. Factors 1 and 2 are very discriminating for the variables analysed. For example, positive values of category coordinates for Factor 1 are related to category 1 of the variable meter type (lower consumption) and negative ones are related to category 2 (higher consumption) of the same variable. The variable meter type was plotted in the graph as a supplementary variable (Greenacre, 2006; p.31-32), inasmuch as it does not influence the solution. The position of its two categories in the principal plane of MCA, however, confirm the expected pattern, observed for the other variables, that more affluent consumers, who tend to have higher demand for electricity (here indicated by the typo of meter), are separated from the lessaffluent consumers. The variable meter type, therefore, is the dependent variable to be used in the classification methods that follow. Area < 50 Factor 2 = 14,93% Two or more cars Hires maid Graduate Area > 151sqm Two/Three-phase metre From high school to undergraduate Area qrm One car No car One-phase metreup to high Does not school level hire maid Area sqm Factor 1 = 22,31% 3.2 Logistic Regression Figure 1 Principal plane of MCA Logistic regression (LR) is one linear method of classification that can be applied to the data presented. The conditional probabilities of a binary outcome (or an outcome of higher 5

6 dimension, if polychotomous logistic regression is used) are estimated and used to classify cases described by particular values of the covariates. LR can easily incorporate categorical predictors as covariates and, according to Saporta & Niang (2006), perform better than linear discriminant analysis when the conditional distributions are not normal or have different covariances. Following the proposal of Saporta & Niang (2006), we initially performed stepwise selection of covariates (probability level 0.05) for the estimation of the conditional probability of an observation x coming from one particular group (baseline). Then, stepwise logistic regression (probability level 0.05) was again performed, this time with the factors obtained by MCA as covariates. We then analysed how close the two solutions were. In order to avoid the problem of resubstitution bias, Saporta & Niang (2006) recommend splitting the total sample in subsets for training (estimating coefficients) and for testing (classification). We followed their advice and performed a similar simulation exercise, in which 50 random samples of both types were taken without replacement, stratified by the two groups of meter type. In each sample, 356 cases (80% of the total) were used for each estimation run and the remaining 88 (20%) were used in each classification. The same subsamples were used for both strategies. The performance of each strategy in terms of classification was evaluated by comparison of their Receiver Operating Characteristic (ROC) curves and corresponding areas under the ROC curve (AUC). The ROC curve, formed by the true positive rates and false positive rates for different thresholds, represents the overall performance of the model for any threshold. Figure 2 presents two of these curves, for illustrative purposes, obtained by both strategies, for the whole sample available in the database. It can be seen that the curves for both strategies are similar, and have about the same AUC. MCA, LR, ROC curves and AUC were computed throughout this analysis by means of routines implemented in R. sensitivity: true positive rate LR with Factors LR with Categorical var. AUC-Factors: AUC-Categ. Var: (1 - specificity): false positive rate Figure 2 ROC curves for LR with categorical covariates and MCA factors as covariates on complete sample 6

7 Table 3a presents a summary of the frequency of the number of covariates retained in the models for both strategies and Table 3b presents the frequency of each covariate in the whole simulation exercise. Nr. of predictors selected (a) LR with categorical predictors LR with MCA factor as predictors Predictor # of times retained (b) Predictor 2-19 Area : m 2 49 Factor Area : m 2 49 Factor Area : >151 m 2 49 Factor Cars: 1 50 Factor Cars: Factor Maid: 1+ 3 Factor 6 0 Total Ed.Level: HS- undergrad 40 Factor 7 3 # of times retained Ed. Level: graduate 40 Factor 8 6 Table 3 Categorical variable and MCA factor selection for LR in 50 samples frequency of covariates retained It can be noticed that LR with MCA factors tend to retain fewer covariates than LR with categorical covariates. Furthermore, one can see that the categorical covariates dwelling area, cars owned and educational level appear more frequently as significant covariates; the covariate dwelling area and cars owned were present in almost all 50 models. On the other hand, factors F1, F2 (the principal plane of MCA) and F5 were the most frequent; F1 and F2 were always present in the 50 models. 3.3 Artificial Neural Networks Artificial Neural Networks (ANN) may be viewed as nonlinear regression models which map out the input variables onto the output space by means of complex sets of nested functions whose parameters are iteratively estimated by means of training algorithms. For this application, the ANN had the same inputs as the LR models, and a single output, restricted to the [0,1] interval, so that it could be interpreted as the probability of the input vector belonging to a given class. The ANN models we experimented with were fully-connected feed-forward perceptrons with one hidden layer and one output neuron. Hyperbolic tangent functions were used for the activation of the hidden neurons, and a logistic function for the output neuron. Training was performed with the Levenberg-Marquardt algorithm (Bishop, 1998), with early stopping defined by cross-validation: after each iteration, the ANN performance was checked over a sub-sample of the data, distinct from the training data; when the error started to increase (which meant the ANN had started to overfit the training data), the iterations were stopped (Haykin, 1999). Out of the 444 cases available in the sample, 332 were randomly selected for the network training, 24 for cross-validation, and the remaining 88 cases for out-of-sample testing (therefore, the sample used for the ANN testing had the same size as the samples used for LR testing). The number of hidden processing elements (neurons) was defined by a gridsearch which compared the performances of ANNs of eight different sizes, containing 1, 2, 3, 5, 10, 15, 20 and 30 neurons. The ANNs were experimented according to the same two strategies as used for the LR models: with the original covariates (Table 1), and with the factors obtained by MCA (Table 7

8 2). In both situations, Meter type was used as the output variable. Since the training algorithms tend to produce slightly different results each time they are run, as they start form random initialisation values, we run each ANN model 30 times. In order to evaluate the performance at each run, the vectors of 88 output values were used to build ROC curves, and the areas under each curve (AUCs) were estimated. For each ANN size, then, we had 30 AUC estimates. Figure 3 shows boxplots displaying the 30 areas estimated for the 30-neuron ANN in each strategy (with original covariates and with MCA factors). The eight ANNs we experimented in each strategy obtained mean and median areas which were roughly equivalent; the larger ANNs, however, tended to show less dispersion in the areas and produce fewer outliers. 4. Comparison of classification strategies used The methodologies used in Sections 3.2 and 3.3 were different: for each strategy (original covariates as input data, versus MCA factors), there were 50 simulations, and consequently 50 different samples for each of the two LR strategies, as compared to 30 replications of the same ANN model on a single sample. Therefore, it is not possible to compare the results coming from LR to those from ANN. It is advisable, therefore, to proceed with separate comparisons, bearing in mind that the main objective of the present study was to observe the contribution of MCA as an intermediate step to both LR and ANN. Figure 3 summarises all results discussed here. Firstly, the two LR strategies are clearly comparable in their classification capabilities, as measured by the box-plots for the AUC obtained in 50 different test samples. A comparison by means of non-parametric tests showed that the two distributions are not significantly different (p=0.085 in Mann-Whitney, p=0.27 in Kolmogorov-Smirnov). The advantage of the strategy using MCA factors is that the models were fitted with fewer inputs (only three factors in average), favouring the principle of parsimony for regression models. Secondly, the ANN trained with MCA factors as entry variables clearly shows superior classification performance than the ANN trained with categorical variables (p=0.000 in Mann-Whitney and in Kolmogorov-Smirnov). The median in the distribution of 30 runs for the former strategy is clearly higher than the median obtained from the other strategy. Finally, the four strategies clearly increase the level of information that can be input to load forecast models, as all of them present a classification performance, represented by their AUC, superior to a random choice. Figure 3 Comparison of different strategies for classification of electricity consumers 8

9 5. Conclusion The results obtained have shown the importance of MCA as an intermediate step to estimate coefficients that can be used in classification. For LR, despite the similar performance of both strategies, the principle of lowering the dimensionality of the problem is attained. As to the ANN, a methodology originally designed for numerical predictors, the use of MCA factors not only improves the performance of the method, but permits its application to categorical predictors, as suggested by Saporta & Niang (2006). The typologies obtained by such classification methods are expected to be used as input to medium or long-term demand forecast models, improving further their accuracy. The results reported in this paper are part of larger study that intends to develop models that associate spatially referenced data about the socio-economic profile of the consumers, and the characteristics of their households, to their most likely levels of electricity demand. These models will then be used to define a demand typology to be used as an input to load forecasting models, particularly those geared to medium- or long-term forecasting for specific areas whose population characteristics are known. References BISHOP, C.M. Neural Networks for Pattern Recognition. Oxford: Oxford University Press, GREENACRE, M. & BLASIUS, J. (eds.) Correspondence Analysis in the Social Sciences. London: Academic Press, GREENACRE, M. & BLASIUS, J. (eds.) Multiple Correspondence Analysis and Related Methods. Boca Raton: Chapman & Hall/CRC, HAYKIN, S. Neural Networks a Comprehensive Foundation. 2 nd ed. Upper Saddle River: Prentice Hall, HIPPERT, H.S. Modelagem e Previsão de Demanda de Energia Elétrica com Base em Dados Geo-Referenciados. Anais do IX Encontro de Modelagem Computacional. Belo Horizonte, MADDEN, M.; STEVENSON, M.A.; BROWN, P.J.B. & BATEY, P.W.J. Developing a Small-Area Electricity Demand Forecasting System. The Journal of Energy and Development. Vol. XX, n. 1, p. 1-24, SAPORTA, G. & NIANG, N. Correspondence Analysis and Classification. In: Michael Greenacre & Jörg Blasius (eds.) Multiple Correspondence Analysis and Related Methods. Boca Raton: Chapman & Hall/CRC,

Analysis of Fast Input Selection: Application in Time Series Prediction

Analysis of Fast Input Selection: Application in Time Series Prediction Jarkko Tikka, Amaury Lendasse, and Jaakko Hollmén Helsinki University of Technology, Laboratory of Computer and Information Science,