How to cope with modelling and privacy concerns? A regression model and a visualization tool for aggregated data

Similar documents
Histogram data analysis based on Wasserstein distance

Histogram data analysis based on Wasserstein distance

Order statistics for histogram data and a box plot visualization tool

Geo-Spatial Technologies for Customs From Information to Informed Actions. Tokyo, November 1 st 2017

Distributions are the numbers of today From histogram data to distributional data. Javier Arroyo Gallardo Universidad Complutense de Madrid

A new linear regression model for histogram-valued variables

Okun s Law in the French Regions: A Cross-Regional Comparison

Gradewood: Grading of timber for engineered wood products

Mallows L 2 Distance in Some Multivariate Methods and its Application to Histogram-Type Data

Principal Component Analysis for Interval Data

The prediction of house price

Linear Regression Model with Histogram-Valued Variables

Review of Statistics

Forecasting 1 to h steps ahead using partial least squares

Working Paper #37 NETWORK CONNECTION SCHEMES FOR RENEWABLE ENERGY IN FRANCE: A SPATIAL ANALYSIS

Interval-Based Composite Indicators

Alternative management of insect pests on oilseed rape in winter and spring.

A Resampling Approach for Interval-Valued Data Regression

Forecasting Complex Time Series: Beanplot Time Series

Descriptive Statistics

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Discriminant Analysis for Interval Data

Multilevel Clustering for large Databases

A Nonparametric Kernel Approach to Interval-Valued Data Analysis

Nonlinear Multivariate Statistical Sensitivity Analysis of Environmental Models

Basics of Multivariate Modelling and Data Analysis

A Short Introduction to Curve Fitting and Regression by Brad Morantz

Subject CS1 Actuarial Statistics 1 Core Principles

Quantitative Analysis of Financial Markets. Summary of Part II. Key Concepts & Formulas. Christopher Ting. November 11, 2017

Financial Econometrics

Table of Contents. Multivariate methods. Introduction II. Introduction I

Price Linkage and Transmission between Shippers and Retailers in the French Fresh Vegetable Channel. Daniel HASSAN

Summary statistics. G.S. Questa, L. Trapani. MSc Induction - Summary statistics 1

From Practical Data Analysis with JMP, Second Edition. Full book available for purchase here. About This Book... xiii About The Author...

Forecasting with Interval and Histogram Data: Some Financial Applications

Statistics Toolbox 6. Apply statistical algorithms and probability models

A Non-Parametric Approach of Heteroskedasticity Robust Estimation of Vector-Autoregressive (VAR) Models

Semiparametric Cost Allocation Estimation

Distribution-Free Monitoring of Univariate Processes. Peihua Qiu 1 and Zhonghua Li 1,2. Abstract

Principal Components Analysis. Sargur Srihari University at Buffalo

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1

Probability Models for Bayesian Recognition

Modelling and Analysing Interval Data

Statistical Data Analysis

Gaussian kernel GARCH models

Are Forecast Updates Progressive?

REGRESSION DIAGNOSTICS AND REMEDIAL MEASURES

The Geography of French Agricultural Co-operatives: An Explanatory Spatial Data Analysis

Non-parametric bootstrap mean squared error estimation for M-quantile estimates of small area means, quantiles and poverty indicators

Trends in the Relative Distribution of Wages by Gender and Cohorts in Brazil ( )

Reserving for multiple excess layers

Experimental Design and Data Analysis for Biologists

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 1: Introduction, Multivariate Location and Scatter

Estimation of Costs of Production at Farm Level

Model-free prediction intervals for regression and autoregression. Dimitris N. Politis University of California, San Diego

Data Preprocessing Tasks

SUMMARIZING MEASURED DATA. Gaia Maselli

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Bayesian Semiparametric GARCH Models

G E INTERACTION USING JMP: AN OVERVIEW

Bayesian Semiparametric GARCH Models

CoDa-dendrogram: A new exploratory tool. 2 Dept. Informàtica i Matemàtica Aplicada, Universitat de Girona, Spain;

Binary Choice Models Probit & Logit. = 0 with Pr = 0 = 1. decision-making purchase of durable consumer products unemployment

Diagnostics for Linear Models With Functional Responses

The General Linear Model (GLM)

Unit 10: Simple Linear Regression and Correlation

Ridge Regression and Ill-Conditioning

Copula Regression RAHUL A. PARSA DRAKE UNIVERSITY & STUART A. KLUGMAN SOCIETY OF ACTUARIES CASUALTY ACTUARIAL SOCIETY MAY 18,2011

STK4900/ Lecture 5. Program

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

The geography of the French creative class: An exploratory spatial data analysis

Comparative Efficiency of Lactation Curve Models Using Irish Experimental Dairy Farms Data

Regression: Ordinary Least Squares

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

Expression Data Exploration: Association, Patterns, Factors & Regression Modelling

Accounting for measurement uncertainties in industrial data analysis

Regression Models - Introduction

4.1 Least Squares Prediction 4.2 Measuring Goodness-of-Fit. 4.3 Modeling Issues. 4.4 Log-Linear Models

Using Ridge Least Median Squares to Estimate the Parameter by Solving Multicollinearity and Outliers Problems

Model Fitting. Jean Yves Le Boudec

Lecture 5: A step back

Descriptive Statistics for Symbolic Data

Principal Component Analysis and Singular Value Decomposition. Volker Tresp, Clemens Otte Summer 2014

Nonparametric Methods

Eric Shou Stat 598B / CSE 598D METHODS FOR MICRODATA PROTECTION

Generalization of the Principal Components Analysis to Histogram Data

Estimation of Mars surface physical properties from hyperspectral images using the SIR method

Regional Technical Efficiency in Europe

A Bootstrap Test for Causality with Endogenous Lag Length Choice. - theory and application in finance

ining Dissemination Analysis Coordination Coordination Production Production Annual Report lysis Analysis Dissemination Production Production

Lecture 2 Simple Linear Regression STAT 512 Spring 2011 Background Reading KNNL: Chapter 1

Applied Econometrics. Professor Bernard Fingleton

POLSCI 702 Non-Normality and Heteroskedasticity

CHAPTER 5. Outlier Detection in Multivariate Data

Cross-Sectional Regression after Factor Analysis: Two Applications

MS&E 226: Small Data

Regional economic growth and environmental efficiency in greenhouse emissions: A conditional directional distance function approach

ECE 661: Homework 10 Fall 2014

Multivariate Lineare Modelle

Transcription:

for aggregated data Rosanna Verde (rosanna.verde@unina2.it) Antonio Irpino (antonio.irpino@unina2.it) Dominique Desbois (desbois@agroparistech.fr) Second University of Naples Dept. of Political Sciences J. Monnet 1 of 30

Motivations and aims of the talk Motivation Regulation of official statistical institutes does not allow the diffusion of microdata for privacy-related purposes. In general, it is easier to obtain aggregated data of a set of individuals. Most of the modelling tools in statistics (e.g., regression) work on microdata and cannot be easily extended to macrodata. Methods In this talk, we show the use of a regression method developed, where both the explanatory and the response variables present quantile distributions as observations. A PCA method on quantile data is used in order to visualize relationships between the predicted distributions. Application The analysis has been performed on a dataset of economic indicators related to the specific cost of agriculture products in France regions. 2 of 30

DATA We observed data coming from RICA, the French Farm Accounting Data Network (FADN), aggregated in 22 metropolitan regions of France. CODE REGION CODE REGION 121 Île de France 162 Pays de la Loire 131 Champagne-Ardenne 163 Bretagne 132 Picardie 164 Poitou-Charentes 133 Haute-Normandie 182 Aquitaine 134 Centre 183 Midi-Pyrénées 135 Basse-Normandie 184 Limousin 136 Bourgogne 192 Rhônes-Alpes 141 Nord-Pas-de-Calais 193 Auvergne 151 Lorraine 201 Languedoc-Roussillon 152 Alsace 203 Provence-Alpes-Côte dazur 153 Franche-Comté 204 Corse 3 of 30

Economic indicators available for each region o o o o Y_TSC Total Specific Cost (TSC) of farm holdings, X_WHEAT the wheat output variable; X_PIG the pig output variable; X_MILKC - the cow milk output variable; The available data o Each region is described by the vector of the estimates of the 10 deciles of the distribution observed for each French region; Not-available information (for privacy concerns) o Raw data are not available: for each farm we do not know data about the four variables o o We do not know association structure within each region We do not know the number of farms observed for each region 4 of 30

An example of a row of the data table Y_TSC X_Wheat X_Pig X_Cmilk CDF_Plot Bretagne Histogram Bretagne Smoothed histogram Bretagne 5 of 30

The data table: CDFs (Cumulative distribution functions) and corresponding histograms 6 of 30

A first research question? It is possible to predict Y_TSC from the other variables Classic methods of regression cannot be used with this kind of data Proposal We may use Histogram-valued data analysis The regression for quantile functions: Verde-Irpino regression With each quantile function is associated a distribution Irpino-Verde regression is a novel method for the regression analysis of distributional data. 7 of 30

A regression model for histogram variables based on Wasserstein distance 8 of 30

A Regression model for histogram data Data = Model Fit + Residual Linear regression is a general method for estimating/describing association between a continuous outcome variable (dependent) and one or multiple predictors in one equation. Easy conceptual task with classic data But what does it means when dealing with histogram data? 0,5 0,4 0,3 0,2 0,1 0,3 0,4 0,2 0,1 0,15 0,45 0,3 0,1 Billard, Diday, IFCS 2006 Verde, Irpino, COMPSTAT 2010; CLADAG 2011 Dias, Brito, ISI 2011 0 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100 9 of 30

Linear Regression Model for histogram data (Verde, Irpino, 2013) Given a histogram variable X, we search for a linear transformation of X which allows us to predict the histogram variable Y For example: given the histogram of the Y_TSC observed in a region, is it possible to predict the distribution of the Y_TSC using a linear combination of the predictor histogram variables? 10 of 30

Multiple regression model for quantile functions Our concurrent multiple regression model is: in matrix notation: p = 0 + i j ij + i j= 1 y () t β β x () t ε () t Yt () = Xt () β + ε() t Quantile functions associated with histogram/ distribution data This formulation is analogous to the functional linear model (Ramsay, Silverman, 2003) except for the constant β parameters and for the functions y i (t), x ij (t) which are quantile functions while each ε i (t) is a residual function (distribution?) for all i=1,, n. 11 of 30

Parameters estimation - LS method using Wasserstein distance According to the nature of the variables, for the parameters estimation, we propose to extend the Least Squares principle to the functional case using a typical metric between quantile functions: 1 p 2 εi ( y i(t), ŷi(t) ) = y β0 β i(t) j x ij(t) dt 0 j= 1 2 Squared error based on the Wasserstein l 2 distance between two quantile functions 1 2 1 1 W 0 ( ) ( ) i j = i j d x,x F (t) F (t) dt 2 Wasserstein l 2 distance between two quantile functions 12 of 30

Fitting linear regression model Find a linear transformation of the quantile functions of x ij (for j=1,,p) in order to predict the quantile function of y i i.e.: The linear transformation is unique: the parameters β 0 estimated for all the x ij and y i distributions A first problem: yˆ ( t) = β + β x ( t) t [0, 1] i 0 j ij j= 1 p Only if β j > 0 a quantile function yˆ () t can be derived. and β j are In order to overcome this problem, we propose a solution based on the decomposition of the Wasserstein distance and on the NNLS algorithm. i 13 of 30

OLS estimate (Irpino and Verde, 2012) The quantile function can be decomposed as: x c () t = x + x () t where ij ij ij c x () t = x () t x is the centered quantile function ij ij ij Then, we propose the following regression model: p p c i β0 β j ij γ j ij i j= 1 j= 1 = + + + y () t x x () t ( t) 0 t 1 yˆ () t i Using the Wasserstein distance it is possible to set up a OLS method that returns the two sets of coefficients (β 0,β j ; γ j ). Under a positiveness constraint on γ j 14 of 30

Interpretation of the parameters Regression parameters for the distribution mean locations ˆ β ˆ,..., ˆ 0 β1 β R, p Shrinking factors for the variability y ˆ 1,..., ˆp γ γ R > 1 (< 1) the histogram has a greater (smaller) variability than ˆi the x ij histogram. + 15 of 30

Advantages of the regression on quantile functions The regression on quantile functions takes into account the whole distribution (described by the quantiles). It is more powerful with respect a classic regression on the means of the distributions because it considers information about sizes and shapes of the distributions. It is different from the well-know Quantile regression which requires all microdata and estimates one quantile at time (independently from the others). In this case it is not guaranteed the order among the estimated quantile. Our methods works on aggregated data when microdata are not available, and estimates the quantiles using a single model. The method guarantees the natural order among the estimated quantiles. The method suffers less of the outlying observations, thus it guarantees a more robust estimation of the tails of the distributions. 16 of 30

Regression results (only the first 19 regions are used, the last three have regressors equal to zero) The estimated model 16,834.4 + c 0.6671 X _ WHEAT 0.7793 _ ( i + X WHEATi t) Y _TSC i () t = t [0;1] c 0.6095 X _ PIG + 0.5478 X _ PIG () t Goodness of fit indices Root Mean Square Error (Verde & Irpino, 2013): i 0.2651 X _ MILKC + 0.3438 X i c _ MILKC () t RMSE=7,238.2; Omega index (Dias & Brito, 2011): Ω =0.9069 (0 worst fitting, 1 best fitting); Pseudo R-squared (Verde & Irpino, 2013): PR 2 =0.7233 (0 worst fitting, 1 best fitting). i i 17 of 30

Plot of observed CDFs vs predicted CDFs Observed Predicted 18 of 30

Plot observed vs predicted (zoom) 19 of 30

A visualization tool for distributions Motivations A distribution, being a function, is a high dimensional data. We observed the plots in the last two slides: This kind of visualization is not very communicative. It is difficult to compare different distributions visually. We need a visualization tool that organizes graphically the distributions according to a similarity criterion. A new visualization tool: Quantile PCA (Irpino and Verde, 2013) Chosen a fixed number of quantiles, Quantile PCA performs a principal component analysis on a single distributional variable (a column of the data table). 20 of 30

PCA of quantiles The X matrix decomposed in Q-PCA We fix a set of m quantiles Each individual is represented by a sequence of m+1 (including the minimum value) ordered values xi = min( xi) Qi 1 Qij Qi, m 1 Max( xi) X min( x1) Q1,1 Q1, j Q1, m 1 Max( x1) min( x ) Q Q Q Max( x ) = i i,1 i, j im, 1 i m in( xn) QN,1 QN, j QNm, 1 Max( xn) 21 of 30

Average quantiles vector The m+1 quantile column variables are centered min( x1) Q1,1 Q1, j Q1, m 1 Max( x1) X = min( xi) Qi,1 Qi, j Qim, 1 Max( xi) min( xn) QN,1 QN, j QNm, 1 Max( xn) x = min( x) Q1 Qj Qm 1 Max( x) X I N x = X c with I N the unitary vector of N elements Average quantiles A PCA on the variance-covariance matrix of quantiles is performed. Note: the trace of the covariance matrix is an approximation of a variance measure defined for a distributional-variable. 22 of 30

Eigenvalues and explained inertia Wasserstein-based Variance of the variable Y_TSC = 2.5318x10 8 ; Trace of the quantile Variance-Covariance matrix = 2.7815x10 8. Inertia % of explained % cum E1 2.5739 x10 8 92.54 92.54 E2 0.1702 x10 8 6.12 98.66 E3 0.0223 x10 8 0.80 99.46 E4 0.0095 x10 8 0.34 99.80 E5 0.0027 x10 8 0.10 99.89 E6 0.0015 x10 8 0.05 99.95 E7 0.0007 x10 8 0.02 99.97 E8 0.0005 x10 8 0.01 99.98 E9 0.0001 x10 8 0.006 100.000 Eigenvalues E1 E2 E3 E4 E5 1,70E+07 2,23E+06 9,50E+05 2,70E+05 2,57E+08 Cum. perc. of explained inertia 100 98 96 94 92 90 88 E1 E2 E3 E4 E5 E6 E7 E8 E9 Eigenvalues 23 of 30

The plot of variables: the Spanish-fan plot Median Upper quantiles Lower quantiles Comment: Great part of variability is due to differences on the right tail. (Right-skewness) 24 of 30

Principal Component Analysis of the Y_TSC variable: First factorial plane 25 of 30

PCA of the Y_TSC variable: First factorial plane (distribution colored according to the means) Higher mean Lower mean 26 of 30

PCA of the Y_TSC variable: First factorial plane (distribution coloured according to standard deviations) Lower std Higher std 27 of 30

PCA of the Y_TSC variable: First factorial plane a joint view Comment: means and stds seems slightly positive correlated, they are both related to right heavy tailed distributions 28 of 30

Conclusions In this talk, starting from aggregated data, and without knowing microdata, we showed that it is possible to analyze, predict and show such summary structures using new tools from distributional-valued data analysis, defined into a space of univariate distributions equipped with L2 Wasserstein metric. A regression technique is able to work with this kind of data and it provides accurate and interpretable (also for practitioners) results for the interpreting of the causal relationships. A PCA on quantiles is a promising tool for a fast and easy visualization of the different distribution features. An R package is going to be released in the next quarter. As a future work, a graphical analysis of predicted vs observed distribution-valued data can be introduced using more sophisticated factorial analysis. (This is in progress) 29 of 30

Main references 1. Arroyo, J., Maté, C.: Forecasting histogram time series with k-nearest neighbors methods, International Journal of Forecasting 25 (1), 192-207 (2009) 2. Arroyo, J., González-Rivera, G., Maté C.: Forecasting with interval and histogram data. Some financial applications. Handbook of empirical economics and finance, 247-280 (2010) 3. Dias, S., Brito, P.: A new linear regression model for histogram-valued variables, in: 58th ISI World Statistics Congress, Dublin, Ireland, URL: http://isi2011.congressplanner.eu/pdfs/950662.pdf, (2011) 4. Irpino, A., Verde, R.: Dimension reduction techniques for distributional symbolic data. In: SIS 2013 Statistical Conference Advances in Latent Variables - Methods, Models and Applications. URL: http://meetings.sis-statistica.org/index.php/sis2013/alv/paper/viewfile/2586/443 (2013) 5. Irpino, A. Verde, R. : Ordinary Least Squares for Histogram Data Based on Wasserstein Distance. In: COMPSTAT 2010, 19th Conference of IASC-ERS (Physica Verlag), pp. 581-589 (2010). 6. Irpino, A., Verde, R.: Basic statistics for distributional symbolic variables: a new metric-based approach. Advances in Data Analysis and Classification, doi: 10.1007/s11634-014-0176-4 (in press, 2015) 7. Rüschendorf,, L.: Wasserstein metric, in Hazewinkel, M., Encyclopedia of Mathematics, Springer (2001) 8. Verde, R, Irpino, A.: Multiple Linear Regression for Histogram Data using Least Squares of Quantile Functions: a Two-components model. Revue des Nouvelles Technologies de L'Information, vol. RNTI- E-25, p. 78-93 (2013) 30 of 30

for aggregated data Thanks For Listening Antonio Irpino, Rosanna Verde, Dominique Desbois, November 25 th, 2014 31 of 30