How to cope with modelling and privacy concerns? A regression model and a visualization tool for aggregated data

for aggregated data Rosanna Verde (rosanna.verde@unina2.it) Antonio Irpino (antonio.irpino@unina2.it) Dominique Desbois (desbois@agroparistech.fr) Second University of Naples Dept. of Political Sciences J. Monnet 1 of 30

Motivations and aims of the talk Motivation Regulation of official statistical institutes does not allow the diffusion of microdata for privacy-related purposes. In general, it is easier to obtain aggregated data of a set of individuals. Most of the modelling tools in statistics (e.g., regression) work on microdata and cannot be easily extended to macrodata. Methods In this talk, we show the use of a regression method developed, where both the explanatory and the response variables present quantile distributions as observations. A PCA method on quantile data is used in order to visualize relationships between the predicted distributions. Application The analysis has been performed on a dataset of economic indicators related to the specific cost of agriculture products in France regions. 2 of 30

DATA We observed data coming from RICA, the French Farm Accounting Data Network (FADN), aggregated in 22 metropolitan regions of France. CODE REGION CODE REGION 121 Île de France 162 Pays de la Loire 131 Champagne-Ardenne 163 Bretagne 132 Picardie 164 Poitou-Charentes 133 Haute-Normandie 182 Aquitaine 134 Centre 183 Midi-Pyrénées 135 Basse-Normandie 184 Limousin 136 Bourgogne 192 Rhônes-Alpes 141 Nord-Pas-de-Calais 193 Auvergne 151 Lorraine 201 Languedoc-Roussillon 152 Alsace 203 Provence-Alpes-Côte dazur 153 Franche-Comté 204 Corse 3 of 30

Economic indicators available for each region o o o o Y_TSC Total Specific Cost (TSC) of farm holdings, X_WHEAT the wheat output variable; X_PIG the pig output variable; X_MILKC - the cow milk output variable; The available data o Each region is described by the vector of the estimates of the 10 deciles of the distribution observed for each French region; Not-available information (for privacy concerns) o Raw data are not available: for each farm we do not know data about the four variables o o We do not know association structure within each region We do not know the number of farms observed for each region 4 of 30

An example of a row of the data table Y_TSC X_Wheat X_Pig X_Cmilk CDF_Plot Bretagne Histogram Bretagne Smoothed histogram Bretagne 5 of 30

The data table: CDFs (Cumulative distribution functions) and corresponding histograms 6 of 30

A first research question? It is possible to predict Y_TSC from the other variables Classic methods of regression cannot be used with this kind of data Proposal We may use Histogram-valued data analysis The regression for quantile functions: Verde-Irpino regression With each quantile function is associated a distribution Irpino-Verde regression is a novel method for the regression analysis of distributional data. 7 of 30

A regression model for histogram variables based on Wasserstein distance 8 of 30

A Regression model for histogram data Data = Model Fit + Residual Linear regression is a general method for estimating/describing association between a continuous outcome variable (dependent) and one or multiple predictors in one equation. Easy conceptual task with classic data But what does it means when dealing with histogram data? 0,5 0,4 0,3 0,2 0,1 0,3 0,4 0,2 0,1 0,15 0,45 0,3 0,1 Billard, Diday, IFCS 2006 Verde, Irpino, COMPSTAT 2010; CLADAG 2011 Dias, Brito, ISI 2011 0 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100 9 of 30

Linear Regression Model for histogram data (Verde, Irpino, 2013) Given a histogram variable X, we search for a linear transformation of X which allows us to predict the histogram variable Y For example: given the histogram of the Y_TSC observed in a region, is it possible to predict the distribution of the Y_TSC using a linear combination of the predictor histogram variables? 10 of 30

Multiple regression model for quantile functions Our concurrent multiple regression model is: in matrix notation: p = 0 + i j ij + i j= 1 y () t β β x () t ε () t Yt () = Xt () β + ε() t Quantile functions associated with histogram/ distribution data This formulation is analogous to the functional linear model (Ramsay, Silverman, 2003) except for the constant β parameters and for the functions y i (t), x ij (t) which are quantile functions while each ε i (t) is a residual function (distribution?) for all i=1,, n. 11 of 30

Parameters estimation - LS method using Wasserstein distance According to the nature of the variables, for the parameters estimation, we propose to extend the Least Squares principle to the functional case using a typical metric between quantile functions: 1 p 2 εi ( y i(t), ŷi(t) ) = y β0 β i(t) j x ij(t) dt 0 j= 1 2 Squared error based on the Wasserstein l 2 distance between two quantile functions 1 2 1 1 W 0 ( ) ( ) i j = i j d x,x F (t) F (t) dt 2 Wasserstein l 2 distance between two quantile functions 12 of 30

Fitting linear regression model Find a linear transformation of the quantile functions of x ij (for j=1,,p) in order to predict the quantile function of y i i.e.: The linear transformation is unique: the parameters β 0 estimated for all the x ij and y i distributions A first problem: yˆ ( t) = β + β x ( t) t [0, 1] i 0 j ij j= 1 p Only if β j > 0 a quantile function yˆ () t can be derived. and β j are In order to overcome this problem, we propose a solution based on the decomposition of the Wasserstein distance and on the NNLS algorithm. i 13 of 30

OLS estimate (Irpino and Verde, 2012) The quantile function can be decomposed as: x c () t = x + x () t where ij ij ij c x () t = x () t x is the centered quantile function ij ij ij Then, we propose the following regression model: p p c i β0 β j ij γ j ij i j= 1 j= 1 = + + + y () t x x () t ( t) 0 t 1 yˆ () t i Using the Wasserstein distance it is possible to set up a OLS method that returns the two sets of coefficients (β 0,β j ; γ j ). Under a positiveness constraint on γ j 14 of 30

Interpretation of the parameters Regression parameters for the distribution mean locations ˆ β ˆ,..., ˆ 0 β1 β R, p Shrinking factors for the variability y ˆ 1,..., ˆp γ γ R > 1 (< 1) the histogram has a greater (smaller) variability than ˆi the x ij histogram. + 15 of 30

Advantages of the regression on quantile functions The regression on quantile functions takes into account the whole distribution (described by the quantiles). It is more powerful with respect a classic regression on the means of the distributions because it considers information about sizes and shapes of the distributions. It is different from the well-know Quantile regression which requires all microdata and estimates one quantile at time (independently from the others). In this case it is not guaranteed the order among the estimated quantile. Our methods works on aggregated data when microdata are not available, and estimates the quantiles using a single model. The method guarantees the natural order among the estimated quantiles. The method suffers less of the outlying observations, thus it guarantees a more robust estimation of the tails of the distributions. 16 of 30

Regression results (only the first 19 regions are used, the last three have regressors equal to zero) The estimated model 16,834.4 + c 0.6671 X _ WHEAT 0.7793 _ ( i + X WHEATi t) Y _TSC i () t = t [0;1] c 0.6095 X _ PIG + 0.5478 X _ PIG () t Goodness of fit indices Root Mean Square Error (Verde & Irpino, 2013): i 0.2651 X _ MILKC + 0.3438 X i c _ MILKC () t RMSE=7,238.2; Omega index (Dias & Brito, 2011): Ω =0.9069 (0 worst fitting, 1 best fitting); Pseudo R-squared (Verde & Irpino, 2013): PR 2 =0.7233 (0 worst fitting, 1 best fitting). i i 17 of 30

Plot of observed CDFs vs predicted CDFs Observed Predicted 18 of 30

Plot observed vs predicted (zoom) 19 of 30

A visualization tool for distributions Motivations A distribution, being a function, is a high dimensional data. We observed the plots in the last two slides: This kind of visualization is not very communicative. It is difficult to compare different distributions visually. We need a visualization tool that organizes graphically the distributions according to a similarity criterion. A new visualization tool: Quantile PCA (Irpino and Verde, 2013) Chosen a fixed number of quantiles, Quantile PCA performs a principal component analysis on a single distributional variable (a column of the data table). 20 of 30

PCA of quantiles The X matrix decomposed in Q-PCA We fix a set of m quantiles Each individual is represented by a sequence of m+1 (including the minimum value) ordered values xi = min( xi) Qi 1 Qij Qi, m 1 Max( xi) X min( x1) Q1,1 Q1, j Q1, m 1 Max( x1) min( x ) Q Q Q Max( x ) = i i,1 i, j im, 1 i m in( xn) QN,1 QN, j QNm, 1 Max( xn) 21 of 30

Average quantiles vector The m+1 quantile column variables are centered min( x1) Q1,1 Q1, j Q1, m 1 Max( x1) X = min( xi) Qi,1 Qi, j Qim, 1 Max( xi) min( xn) QN,1 QN, j QNm, 1 Max( xn) x = min( x) Q1 Qj Qm 1 Max( x) X I N x = X c with I N the unitary vector of N elements Average quantiles A PCA on the variance-covariance matrix of quantiles is performed. Note: the trace of the covariance matrix is an approximation of a variance measure defined for a distributional-variable. 22 of 30

Eigenvalues and explained inertia Wasserstein-based Variance of the variable Y_TSC = 2.5318x10 8 ; Trace of the quantile Variance-Covariance matrix = 2.7815x10 8. Inertia % of explained % cum E1 2.5739 x10 8 92.54 92.54 E2 0.1702 x10 8 6.12 98.66 E3 0.0223 x10 8 0.80 99.46 E4 0.0095 x10 8 0.34 99.80 E5 0.0027 x10 8 0.10 99.89 E6 0.0015 x10 8 0.05 99.95 E7 0.0007 x10 8 0.02 99.97 E8 0.0005 x10 8 0.01 99.98 E9 0.0001 x10 8 0.006 100.000 Eigenvalues E1 E2 E3 E4 E5 1,70E+07 2,23E+06 9,50E+05 2,70E+05 2,57E+08 Cum. perc. of explained inertia 100 98 96 94 92 90 88 E1 E2 E3 E4 E5 E6 E7 E8 E9 Eigenvalues 23 of 30

The plot of variables: the Spanish-fan plot Median Upper quantiles Lower quantiles Comment: Great part of variability is due to differences on the right tail. (Right-skewness) 24 of 30

Principal Component Analysis of the Y_TSC variable: First factorial plane 25 of 30

PCA of the Y_TSC variable: First factorial plane (distribution colored according to the means) Higher mean Lower mean 26 of 30

PCA of the Y_TSC variable: First factorial plane (distribution coloured according to standard deviations) Lower std Higher std 27 of 30

PCA of the Y_TSC variable: First factorial plane a joint view Comment: means and stds seems slightly positive correlated, they are both related to right heavy tailed distributions 28 of 30

Conclusions In this talk, starting from aggregated data, and without knowing microdata, we showed that it is possible to analyze, predict and show such summary structures using new tools from distributional-valued data analysis, defined into a space of univariate distributions equipped with L2 Wasserstein metric. A regression technique is able to work with this kind of data and it provides accurate and interpretable (also for practitioners) results for the interpreting of the causal relationships. A PCA on quantiles is a promising tool for a fast and easy visualization of the different distribution features. An R package is going to be released in the next quarter. As a future work, a graphical analysis of predicted vs observed distribution-valued data can be introduced using more sophisticated factorial analysis. (This is in progress) 29 of 30

Main references 1. Arroyo, J., Maté, C.: Forecasting histogram time series with k-nearest neighbors methods, International Journal of Forecasting 25 (1), 192-207 (2009) 2. Arroyo, J., González-Rivera, G., Maté C.: Forecasting with interval and histogram data. Some financial applications. Handbook of empirical economics and finance, 247-280 (2010) 3. Dias, S., Brito, P.: A new linear regression model for histogram-valued variables, in: 58th ISI World Statistics Congress, Dublin, Ireland, URL: http://isi2011.congressplanner.eu/pdfs/950662.pdf, (2011) 4. Irpino, A., Verde, R.: Dimension reduction techniques for distributional symbolic data. In: SIS 2013 Statistical Conference Advances in Latent Variables - Methods, Models and Applications. URL: http://meetings.sis-statistica.org/index.php/sis2013/alv/paper/viewfile/2586/443 (2013) 5. Irpino, A. Verde, R. : Ordinary Least Squares for Histogram Data Based on Wasserstein Distance. In: COMPSTAT 2010, 19th Conference of IASC-ERS (Physica Verlag), pp. 581-589 (2010). 6. Irpino, A., Verde, R.: Basic statistics for distributional symbolic variables: a new metric-based approach. Advances in Data Analysis and Classification, doi: 10.1007/s11634-014-0176-4 (in press, 2015) 7. Rüschendorf,, L.: Wasserstein metric, in Hazewinkel, M., Encyclopedia of Mathematics, Springer (2001) 8. Verde, R, Irpino, A.: Multiple Linear Regression for Histogram Data using Least Squares of Quantile Functions: a Two-components model. Revue des Nouvelles Technologies de L'Information, vol. RNTI- E-25, p. 78-93 (2013) 30 of 30

for aggregated data Thanks For Listening Antonio Irpino, Rosanna Verde, Dominique Desbois, November 25 th, 2014 31 of 30